Ticket 4650

Summary:	AllocGRES gives strange value
Product:	Slurm	Reporter:	GSK-ONYX-SLURM <slurm-support>
Component:	Accounting	Assignee:	Felip Moll <felip.moll>
Status:	RESOLVED INFOGIVEN	QA Contact:
Severity:	4 - Minor Issue
Priority:	---	CC:	felip.moll, kilian
Version:	17.02.7
Hardware:	Linux
OS:	Linux
See Also:	https://bugs.schedmd.com/show_bug.cgi?id=4695 https://bugs.schedmd.com/show_bug.cgi?id=6366
Site:	GSK	Slinky Site:	---
Alineos Sites:	---	Atos/Eviden Sites:	---
Confidential Site:	---	Coreweave sites:	---
Cray Sites:	---	DS9 clusters:	---
Google sites:	---	HPCnow Sites:	---
HPE Sites:	---	IBM Sites:	---
NOAA SIte:	---	NoveTech Sites:	---
Nvidia HWinf-CS Sites:	---	OCF Sites:	---
Recursion Pharma Sites:	---	SFW Sites:	---
SNIC sites:	---	Tzag Elita Sites:	---
Linux Distro:	---	Machine Name:
CLE Version:		Version Fixed:
Target Release:	---	DevPrio:	---
Emory-Cloud Sites:	---

Description GSK-ONYX-SLURM 2018-01-19 08:19:53 MST

Hi.
The command I am using is:

sacct -a -S 2017-11-01 -o jobid,partition,ReqTRES,ReqGRES,AllocTRES,AllocGRES,stat 

The issue is that for some records AllocGRES returns a number[colon]1 rather than gpu:number as would appear to be the expected result.  See below.  What does this mean?  It happens for any completion status.

781              uk_hpc cpu=1,mem+        gpu:1 cpu=1,mem+        gpu:1  COMPLETED
781.0                                     gpu:1 cpu=1,mem+        gpu:1  COMPLETED
782              uk_hpc cpu=1,mem+        gpu:1 cpu=1,mem+    7696487:1  COMPLETED
782.0                                     gpu:1 cpu=1,mem+    7696487:1  COMPLETED

830              uk_hpc cpu=1,mem+        gpu:1 cpu=1,mem+    7696487:1  COMPLETED
830.0                                     gpu:1 cpu=1,mem+    7696487:1  COMPLETED
831              uk_hpc cpu=1,mem+        gpu:1 cpu=1,mem+        gpu:1  COMPLETED
831.0                                     gpu:1 cpu=1,mem+        gpu:1  COMPLETED

1325             uk_hpc cpu=1,mem+        gpu:1 cpu=1,mem+        gpu:1     FAILED
1325.0                                    gpu:1 cpu=1,mem+        gpu:1  COMPLETED
1326             uk_hpc cpu=1,mem+        gpu:3 cpu=1,mem+        gpu:3     FAILED
1326.0                                    gpu:3 cpu=1,mem+        gpu:3  COMPLETED
1327             uk_hpc cpu=1,mem+        gpu:1 cpu=1,mem+    7696487:1     FAILED
1327.0                                    gpu:1 cpu=1,mem+    7696487:1  COMPLETED
1328             uk_hpc cpu=1,mem+        gpu:1 cpu=1,mem+    7696487:1     FAILED
1328.0                                    gpu:1 cpu=1,mem+    7696487:1 CANCELLED+
1329             uk_hpc cpu=1,mem+        gpu:1 cpu=1,mem+    7696487:1  COMPLETED
1329.0                                    gpu:1 cpu=1,mem+    7696487:1  COMPLETED
1330             uk_hpc cpu=1,mem+        gpu:1 cpu=1,mem+    7696487:1  COMPLETED
1330.0                                    gpu:1 cpu=1,mem+    7696487:1  COMPLETED
1331             uk_hpc cpu=1,mem+        gpu:1 cpu=1,mem+    7696487:1     FAILED
1331.0                                    gpu:1 cpu=1,mem+    7696487:1     FAILED
1332             uk_hpc cpu=1,mem+        gpu:1 cpu=1,mem+    7696487:1     FAILED
1332.0                                    gpu:1 cpu=1,mem+    7696487:1 CANCELLED+
1333             uk_hpc cpu=1,mem+        gpu:1 cpu=1,mem+    7696487:1     FAILED
1333.0                                    gpu:1 cpu=1,mem+    7696487:1 CANCELLED+
1334             uk_hpc cpu=1,mem+        gpu:1 cpu=1,mem+    7696487:1     FAILED
1334.0                                    gpu:1 cpu=1,mem+    7696487:1 CANCELLED+
1335             uk_hpc cpu=1,mem+        gpu:1 cpu=1,mem+        gpu:1  COMPLETED
1335.0                                    gpu:1 cpu=1,mem+        gpu:1  COMPLETED
1336             uk_hpc cpu=1,mem+        gpu:1 cpu=1,mem+    7696487:1  COMPLETED
1336.0                                    gpu:1 cpu=1,mem+    7696487:1  COMPLETED

Comment 2 Felip Moll 2018-01-22 11:32:50 MST

Hello,

This happens in two situations:

1. If you are using a GRES plugine, i.e. gpu, and you have inconsistencies between slurm.conf and gres.conf. Can you check if you have gpu counts in slurm.conf on nodes differently from what is in gres.conf?

2. When you start slurmd before starting slurmctld and you have defined a gres.conf. This seems to be a bug, I am investigating this case.

Will come back to you asap, but in the meantime please, try to check situation 1.

Comment 3 Felip Moll 2018-01-23 05:02:55 MST

More detailed explanation:

What probably happened here is that you have some nodes with a configuration mismatch between slurm.conf and gres.conf.

For example, if you have a line in slurm.conf like:

NodeName=node0001 gres=gpu:1

and in gres.conf there's no entry for node0001, then what will happen is that when slurmd starts on this node, it'll send a registration message to the controller with the values read from gres.conf: none. The controller then will detect this mismatch thus marking the node drain with reason:

   Reason=gres/gpu count too low (0 < 1) [slurm@2018-01-23T12:51:13]

After that you can force this node to be up again just with:

scontrol update nodename=node0001 state=resume

and send jobs to it asking for gres=gpu since it is defined in slurm.conf.

Finally the job is accepted but at the time of registering which GRES resources have been allocated there's no match between gpu and what's in gres.conf so the ID is saved instead of the string 'gpu'.


The ID 7696487 corresponds to GPU.

So to solve this:

1. Check for a mismatch between slurm.conf and gres.conf, correct it and do 'scontrol reconfig' and slurmd daemon restart on affected nodes, in this order.

2. The string '7696487' is already in the database. If you want to correct these values you will have to do it manually:

see all:
> select id_job,gres_alloc from <your_cluster_name>_job_table;

just wrong gpu ones:
> select id_job,gres_alloc from <your_cluster_name>__job_table where gres_alloc like '7696487%';

example of update:
> update <your_cluster_name>__job_table set gres_alloc='gpu:1' where gres_alloc='7696487:1';

3. I will check internally how to avoid this situation.




Please, tell me how 1 and 2 goes for you.

Comment 8 Felip Moll 2018-02-13 08:46:51 MST

Hello,

I am marking this bug as resolved and in parallel I am opening an enhancement request to fix this situation.

In the future we will probably deny an admin to set nodes to IDLE if they have a reason related to bad gres count, cpu count or any other config mismatch.

May you have any questions, just raise a new bug. Comment 3 will solve your current issues.

Best regards and thanks for reporting.

Comment 9 Kilian Cavalotti 2019-01-15 15:19:50 MST

Hi Felip, 

As there been any fix for this issue? We seem to be experiencing the same thing, although our slurm.conf and gres.conf seem to be consistent.

We definitely have occasions where slurmd starts before slurmctld, though.

So I went to the Slurm DB and updated all the records with id 7696487, as you detailed in Comment #3. I restarted slurmctld, and then all slurmds on GPU nodes, but I continue to see new jobs being recorded with gres_alloc="7696487:x".

None of our nodes is drained with "Reason=gres/gpu count too low" either.

Cheers,
-- 
Kilian

Comment 10 Felip Moll 2019-01-16 01:08:56 MST

(In reply to Kilian Cavalotti from comment #9)
> Hi Felip, 
> 
> As there been any fix for this issue? We seem to be experiencing the same
> thing, although our slurm.conf and gres.conf seem to be consistent.
> 
> We definitely have occasions where slurmd starts before slurmctld, though.
> 
> So I went to the Slurm DB and updated all the records with id 7696487, as
> you detailed in Comment #3. I restarted slurmctld, and then all slurmds on
> GPU nodes, but I continue to see new jobs being recorded with
> gres_alloc="7696487:x".
> 
> None of our nodes is drained with "Reason=gres/gpu count too low" either.
> 
> Cheers,
> -- 
> Kilian

Hi Kilian,

Though this may apply to recent versions this bug is tagged against the old
17.02. Given that your situation does not match exactly the one in this bug,
would you mind to open a new one? You can directly add your newest 
slurm.conf/gres.conf and slurmctld/slurmd logs and issued command.

Also, remember that commenting on "resolved" bugs without reopening them
may lead us to unintentionally skip the notifications. Always reopen or
open a new one.

Thanks