Ticket 6366

Summary: AllocGRES id recorded as 7696487
Product: Slurm Reporter: Kilian Cavalotti <kilian>
Component: AccountingAssignee: Director of Support <support>
Status: RESOLVED FIXED QA Contact:
Severity: 4 - Minor Issue    
Priority: --- CC: felip.moll, greg.wickham, valentin.plugaru
Version: 18.08.4   
Hardware: Linux   
OS: Linux   
See Also: https://bugs.schedmd.com/show_bug.cgi?id=4650
https://bugs.schedmd.com/show_bug.cgi?id=6348
Site: Stanford Alineos Sites: ---
Atos/Eviden Sites: --- Confidential Site: ---
Coreweave sites: --- Cray Sites: ---
DS9 clusters: --- HPCnow Sites: ---
HPE Sites: --- IBM Sites: ---
NOAA SIte: --- OCF Sites: ---
Recursion Pharma Sites: --- SFW Sites: ---
SNIC sites: --- Linux Distro: ---
Machine Name: Sherlock CLE Version:
Version Fixed: 18.08.5 Target Release: ---
DevPrio: --- Emory-Cloud Sites: ---

Description Kilian Cavalotti 2019-01-16 10:06:31 MST
Hi,

This is a follow up on bug #4650.

We're seeing some (but not all) of our GPU jobs accounting records created with a AllocGRES value of "7696487:x" (x being the number of requested GPUs), exactly as described in bug #4650.

We've checked our slurm.conf and gres.conf files and node configuration seem to be consistent regarding GRES config.

I've cleaned up the database yesterday by manually updating the records, as suggested in #4650:

MariaDB [slurm_acct_db]> update sherlock_job_table set gres_alloc='gpu:1' where gres_alloc='7696487:1';

And today, I can see the more jobs records have been created with that gres id:

MariaDB [slurm_acct_db]> SELECT id_job,gres_alloc,gres_req,nodelist from sherlock_job_table where gres_alloc like '7696487%' LIMIT 10;
+----------+------------+----------+-----------+
| id_job   | gres_alloc | gres_req | nodelist  |
+----------+------------+----------+-----------+
| 35564295 | 7696487:1  | gpu:0    | sh-115-01 |
| 35564298 | 7696487:1  | gpu:0    | sh-115-01 |
| 35669388 | 7696487:1  | gpu:0    | sh-19-02  |
| 35669389 | 7696487:1  | gpu:0    | sh-112-12 |
| 35669423 | 7696487:1  | gpu:0    | sh-112-13 |
| 35669438 | 7696487:1  | gpu:0    | sh-114-04 |
| 35669439 | 7696487:1  | gpu:0    | sh-114-04 |
| 35669440 | 7696487:1  | gpu:0    | sh-114-04 |
| 35669455 | 7696487:1  | gpu:0    | sh-114-04 |
| 35669456 | 7696487:1  | gpu:0    | sh-114-04 |
+----------+------------+----------+-----------+

MariaDB [slurm_acct_db]> SELECT COUNT(*) from sherlock_job_table where gres_alloc like '7696487%';
+----------+
| COUNT(*) |
+----------+
|      588 |
+----------+

I'll attach the config files (gres.conf and slurm.conf) as well as the slurmctld log and the slurmd log from a relevant node (sh-115-01).


Thanks!
-- 
Kilian
Comment 1 Kilian Cavalotti 2019-01-16 10:57:04 MST
I'd actually like to make the config and log files attachment private, but there's no Privacy setting checkbox when I try to upload an attachement. :(
Comment 2 Felip Moll 2019-01-16 13:06:23 MST
(In reply to Kilian Cavalotti from comment #1)
> I'd actually like to make the config and log files attachment private, but
> there's no Privacy setting checkbox when I try to upload an attachement. :(

Hi Kilian, your bug is already private.
It will be safe to attach it here.

Thanks
Comment 9 Michael Hinton 2019-01-24 13:23:08 MST
Hey Kilian,

We figured out where the problem is. We are just making sure that our fix looks good before committing.

Thanks,
-Michael
Comment 10 Kilian Cavalotti 2019-01-24 13:24:43 MST
(In reply to Michael Hinton from comment #9)
> Hey Kilian,
> 
> We figured out where the problem is. We are just making sure that our fix
> looks good before committing.

Good news! Thanks for the update.

Cheers,
-- 
Kilian
Comment 12 Michael Hinton 2019-01-25 14:24:48 MST
Kilian,

Here is the patch, slated for 18.08.5: https://github.com/SchedMD/slurm/commit/588aacf5b13da5ef. Let me know if that works for you!

Thanks,
-Michael
Comment 13 Michael Hinton 2019-01-25 14:49:37 MST
Here's a quick explanation:

`7696487` is the gpu plugin id, and is simply a special hash of the string `gpu`.

Before, Slurm would try to see if that id matched any gpu records parsed from (effectively) a random node's gres.conf. On heterogeneous systems (or homogeneous systems with gres.confs that are mismatched, like in bug 4650), this means that sometimes a gpu record wasn't found, so the string "gpu" wasn't found. The fallback was to use the plugin id instead.

The simplifying realization was that searching gpu records from a random node's gres.conf for a gres name string was not very smart. Instead, we can simply check the GresTypes strings configured in the controller's slurm.conf.
Comment 14 Kilian Cavalotti 2019-01-25 15:06:55 MST
Hi Michael, 

(In reply to Michael Hinton from comment #13)
> Here's a quick explanation:
> 
> `7696487` is the gpu plugin id, and is simply a special hash of the string
> `gpu`.
> 
> Before, Slurm would try to see if that id matched any gpu records parsed
> from (effectively) a random node's gres.conf. On heterogeneous systems (or
> homogeneous systems with gres.confs that are mismatched, like in bug 4650),
> this means that sometimes a gpu record wasn't found, so the string "gpu"
> wasn't found. The fallback was to use the plugin id instead.
> 
> The simplifying realization was that searching gpu records from a random
> node's gres.conf for a gres name string was not very smart. Instead, we can
> simply check the GresTypes strings configured in the controller's slurm.conf.

Thanks for the explanation, and for the patch!

Cheers,
-- 
Kilian
Comment 15 Michael Hinton 2019-01-25 15:12:18 MST
Closing
Comment 16 Michael Hinton 2019-02-27 09:52:38 MST
*** Ticket 6599 has been marked as a duplicate of this ticket. ***