Ticket 6366 - AllocGRES id recorded as 7696487
Summary: AllocGRES id recorded as 7696487
Status: RESOLVED FIXED
Alias: None
Product: Slurm
Classification: Unclassified
Component: Accounting (show other tickets)
Version: 18.08.4
Hardware: Linux Linux
: 4 - Minor Issue
Assignee: Director of Support
QA Contact:
URL:
: 6599 (view as ticket list)
Depends on:
Blocks:
 
Reported: 2019-01-16 10:06 MST by Kilian Cavalotti
Modified: 2019-02-27 09:52 MST (History)
3 users (show)

See Also:
Site: Stanford
Slinky Site: ---
Alineos Sites: ---
Atos/Eviden Sites: ---
Confidential Site: ---
Coreweave sites: ---
Cray Sites: ---
DS9 clusters: ---
Google sites: ---
HPCnow Sites: ---
HPE Sites: ---
IBM Sites: ---
NOAA SIte: ---
NoveTech Sites: ---
Nvidia HWinf-CS Sites: ---
OCF Sites: ---
Recursion Pharma Sites: ---
SFW Sites: ---
SNIC sites: ---
Tzag Elita Sites: ---
Linux Distro: ---
Machine Name: Sherlock
CLE Version:
Version Fixed: 18.08.5
Target Release: ---
DevPrio: ---
Emory-Cloud Sites: ---


Attachments

Note You need to log in before you can comment on or make changes to this ticket.
Description Kilian Cavalotti 2019-01-16 10:06:31 MST
Hi,

This is a follow up on bug #4650.

We're seeing some (but not all) of our GPU jobs accounting records created with a AllocGRES value of "7696487:x" (x being the number of requested GPUs), exactly as described in bug #4650.

We've checked our slurm.conf and gres.conf files and node configuration seem to be consistent regarding GRES config.

I've cleaned up the database yesterday by manually updating the records, as suggested in #4650:

MariaDB [slurm_acct_db]> update sherlock_job_table set gres_alloc='gpu:1' where gres_alloc='7696487:1';

And today, I can see the more jobs records have been created with that gres id:

MariaDB [slurm_acct_db]> SELECT id_job,gres_alloc,gres_req,nodelist from sherlock_job_table where gres_alloc like '7696487%' LIMIT 10;
+----------+------------+----------+-----------+
| id_job   | gres_alloc | gres_req | nodelist  |
+----------+------------+----------+-----------+
| 35564295 | 7696487:1  | gpu:0    | sh-115-01 |
| 35564298 | 7696487:1  | gpu:0    | sh-115-01 |
| 35669388 | 7696487:1  | gpu:0    | sh-19-02  |
| 35669389 | 7696487:1  | gpu:0    | sh-112-12 |
| 35669423 | 7696487:1  | gpu:0    | sh-112-13 |
| 35669438 | 7696487:1  | gpu:0    | sh-114-04 |
| 35669439 | 7696487:1  | gpu:0    | sh-114-04 |
| 35669440 | 7696487:1  | gpu:0    | sh-114-04 |
| 35669455 | 7696487:1  | gpu:0    | sh-114-04 |
| 35669456 | 7696487:1  | gpu:0    | sh-114-04 |
+----------+------------+----------+-----------+

MariaDB [slurm_acct_db]> SELECT COUNT(*) from sherlock_job_table where gres_alloc like '7696487%';
+----------+
| COUNT(*) |
+----------+
|      588 |
+----------+

I'll attach the config files (gres.conf and slurm.conf) as well as the slurmctld log and the slurmd log from a relevant node (sh-115-01).


Thanks!
-- 
Kilian
Comment 1 Kilian Cavalotti 2019-01-16 10:57:04 MST
I'd actually like to make the config and log files attachment private, but there's no Privacy setting checkbox when I try to upload an attachement. :(
Comment 2 Felip Moll 2019-01-16 13:06:23 MST
(In reply to Kilian Cavalotti from comment #1)
> I'd actually like to make the config and log files attachment private, but
> there's no Privacy setting checkbox when I try to upload an attachement. :(

Hi Kilian, your bug is already private.
It will be safe to attach it here.

Thanks
Comment 9 Michael Hinton 2019-01-24 13:23:08 MST
Hey Kilian,

We figured out where the problem is. We are just making sure that our fix looks good before committing.

Thanks,
-Michael
Comment 10 Kilian Cavalotti 2019-01-24 13:24:43 MST
(In reply to Michael Hinton from comment #9)
> Hey Kilian,
> 
> We figured out where the problem is. We are just making sure that our fix
> looks good before committing.

Good news! Thanks for the update.

Cheers,
-- 
Kilian
Comment 12 Michael Hinton 2019-01-25 14:24:48 MST
Kilian,

Here is the patch, slated for 18.08.5: https://github.com/SchedMD/slurm/commit/588aacf5b13da5ef. Let me know if that works for you!

Thanks,
-Michael
Comment 13 Michael Hinton 2019-01-25 14:49:37 MST
Here's a quick explanation:

`7696487` is the gpu plugin id, and is simply a special hash of the string `gpu`.

Before, Slurm would try to see if that id matched any gpu records parsed from (effectively) a random node's gres.conf. On heterogeneous systems (or homogeneous systems with gres.confs that are mismatched, like in bug 4650), this means that sometimes a gpu record wasn't found, so the string "gpu" wasn't found. The fallback was to use the plugin id instead.

The simplifying realization was that searching gpu records from a random node's gres.conf for a gres name string was not very smart. Instead, we can simply check the GresTypes strings configured in the controller's slurm.conf.
Comment 14 Kilian Cavalotti 2019-01-25 15:06:55 MST
Hi Michael, 

(In reply to Michael Hinton from comment #13)
> Here's a quick explanation:
> 
> `7696487` is the gpu plugin id, and is simply a special hash of the string
> `gpu`.
> 
> Before, Slurm would try to see if that id matched any gpu records parsed
> from (effectively) a random node's gres.conf. On heterogeneous systems (or
> homogeneous systems with gres.confs that are mismatched, like in bug 4650),
> this means that sometimes a gpu record wasn't found, so the string "gpu"
> wasn't found. The fallback was to use the plugin id instead.
> 
> The simplifying realization was that searching gpu records from a random
> node's gres.conf for a gres name string was not very smart. Instead, we can
> simply check the GresTypes strings configured in the controller's slurm.conf.

Thanks for the explanation, and for the patch!

Cheers,
-- 
Kilian
Comment 15 Michael Hinton 2019-01-25 15:12:18 MST
Closing
Comment 16 Michael Hinton 2019-02-27 09:52:38 MST
*** Ticket 6599 has been marked as a duplicate of this ticket. ***