Ticket 5128

Summary: cuda gres not equal number of device
Product: Slurm Reporter: Yann <yann.sagon>
Component: slurmdAssignee: Felip Moll <felip.moll>
Status: RESOLVED INFOGIVEN QA Contact:
Severity: 4 - Minor Issue    
Priority: --- CC: felip.moll
Version: 17.11.5   
Hardware: Linux   
OS: Linux   
Site: Université de Genève Slinky Site: ---
Alineos Sites: --- Atos/Eviden Sites: ---
Confidential Site: --- Coreweave sites: ---
Cray Sites: --- DS9 clusters: ---
Google sites: --- HPCnow Sites: ---
HPE Sites: --- IBM Sites: ---
NOAA SIte: --- NoveTech Sites: ---
Nvidia HWinf-CS Sites: --- OCF Sites: ---
Recursion Pharma Sites: --- SFW Sites: ---
SNIC sites: --- Tzag Elita Sites: ---
Linux Distro: --- Machine Name:
CLE Version: Version Fixed:
Target Release: --- DevPrio: ---
Emory-Cloud Sites: ---

Description Yann 2018-05-03 00:48:00 MDT
Dear team,

We have nodes with GPUs (P100) and we manage their allocation with gres. It's working fine, but some days ago a user told me that $CUDA_VISIBLE_DEVICES wasn't set on one of the nodes. (gpu004) and that all the jobs were running on the same gpu (this node has  6 p100). I checked the log and saw something strange:

[2018-04-05T03:19:39.755] [6796384.0] error: Failed to send MESSAGE_TASK_EXIT: Connection refused
[2018-04-05T03:19:39.757] [6796384.0] done with job
[2018-04-05T11:15:44.592] Slurmd shutdown completing
[2018-04-05T11:15:45.683] Message aggregation disabled
[2018-04-05T11:15:45.684] gpu device number 0(/dev/nvidia0):c 195:0 rwm
[2018-04-05T11:15:45.684] gpu device number 1(/dev/nvidia1):c 195:1 rwm
[2018-04-05T11:15:45.684] gpu device number 2(/dev/nvidia2):c 195:2 rwm
[2018-04-05T11:15:45.684] gpu device number 3(/dev/nvidia3):c 195:3 rwm
[2018-04-05T11:15:45.684] gpu device number 4(/dev/nvidia4):c 195:4 rwm
[2018-04-05T11:15:45.684] gpu device number 5(/dev/nvidia5):c 195:5 rwm
[2018-04-05T11:15:45.695] slurmd version 17.11.5 started
[2018-04-05T11:15:45.696] slurmd started on Thu, 05 Apr 2018 11:15:45 +0200
[2018-04-05T11:15:45.696] CPUs=20 Boards=1 Sockets=2 Cores=10 Threads=1 Memory=128941 TmpDisk=373170 Uptime=1794181 CPUSpecList=(null) FeaturesAvail=(null) FeaturesActive=(null)
[2018-04-05T11:49:09.976] error: common_gres_set_env: gres list is not equal to the number of gres_devices.  This should never happen.
[2018-04-05T11:49:09.980] error: common_gres_set_env: gres list is not equal to the number of gres_devices.  This should never happen.
[2018-04-05T11:49:09.981] error: common_gres_set_env: gres list is not equal to the number of gres_devices.  This should never happen.
[2018-04-05T11:49:09.985] error: common_gres_set_env: gres list is not equal to the number of gres_devices.  This should never happen.
[2018-04-05T11:49:09.986] error: common_gres_set_env: gres list is not equal to the number of gres_devices.  This should never happen.
[2018-04-05T11:49:10.007] _run_prolog: run job script took usec=31175
[2018-04-05T11:49:10.007] _run_prolog: prolog with lock for job 6796393 ran for 1 seconds
[2018-04-05T11:49:10.007] Launching batch job 6796393 for UID 326279
[2018-04-05T11:49:10.011] _run_prolog: run job script took usec=30698
[2018-04-05T11:49:10.011] _run_prolog: prolog with lock for job 6796396 ran for 1 seconds
[2018-04-05T11:49:10.012] _run_prolog: run job script took usec=30708
[2018-04-05T11:49:10.012] _run_prolog: prolog with lock for job 6796395 ran for 1 seconds
[2018-04-05T11:49:10.016] _run_prolog: run job script took usec=30693
[2018-04-05T11:49:10.016] _run_prolog: prolog with lock for job 6796398 ran for 1 seconds
[2018-04-05T11:49:10.017] _run_prolog: run job script took usec=30713
[2018-04-05T11:49:10.017] _run_prolog: prolog with lock for job 6796399 ran for 1 seconds
[2018-04-05T11:49:10.020] [6796393.batch] error: common_gres_set_env: gres list is not equal to the number of gres_devices.  This should never happen.
[2018-04-05T11:49:10.021] Launching batch job 6796396 for UID 326279

I have the same /etc/slurm/gres.conf for every node. The line for gpu004 is:

NodeName=gpu004 Name=gpu Type=pascal File=/dev/nvidia[0-5]

I have nvidia-persistenced running.
I restarted slurm and it seems the problem went away.

Best
Comment 1 Felip Moll 2018-05-03 08:53:01 MDT
(In reply to Yann from comment #0)
> Dear team,
> 
> We have nodes with GPUs (P100) and we manage their allocation with gres.
> It's working fine, but some days ago a user told me that
> $CUDA_VISIBLE_DEVICES wasn't set on one of the nodes. (gpu004) and that all
> the jobs were running on the same gpu (this node has  6 p100). I checked the
> log and saw something strange:
> 
> [2018-04-05T03:19:39.755] [6796384.0] error: Failed to send
> MESSAGE_TASK_EXIT: Connection refused
> [2018-04-05T03:19:39.757] [6796384.0] done with job
> [2018-04-05T11:15:44.592] Slurmd shutdown completing
> [2018-04-05T11:15:45.683] Message aggregation disabled
> [2018-04-05T11:15:45.684] gpu device number 0(/dev/nvidia0):c 195:0 rwm
> [2018-04-05T11:15:45.684] gpu device number 1(/dev/nvidia1):c 195:1 rwm
> [2018-04-05T11:15:45.684] gpu device number 2(/dev/nvidia2):c 195:2 rwm
> [2018-04-05T11:15:45.684] gpu device number 3(/dev/nvidia3):c 195:3 rwm
> [2018-04-05T11:15:45.684] gpu device number 4(/dev/nvidia4):c 195:4 rwm
> [2018-04-05T11:15:45.684] gpu device number 5(/dev/nvidia5):c 195:5 rwm
> [2018-04-05T11:15:45.695] slurmd version 17.11.5 started
> [2018-04-05T11:15:45.696] slurmd started on Thu, 05 Apr 2018 11:15:45 +0200
> [2018-04-05T11:15:45.696] CPUs=20 Boards=1 Sockets=2 Cores=10 Threads=1
> Memory=128941 TmpDisk=373170 Uptime=1794181 CPUSpecList=(null)
> FeaturesAvail=(null) FeaturesActive=(null)
> [2018-04-05T11:49:09.976] error: common_gres_set_env: gres list is not equal
> to the number of gres_devices.  This should never happen.
> [2018-04-05T11:49:09.980] error: common_gres_set_env: gres list is not equal
> to the number of gres_devices.  This should never happen.
> [2018-04-05T11:49:09.981] error: common_gres_set_env: gres list is not equal
> to the number of gres_devices.  This should never happen.
> [2018-04-05T11:49:09.985] error: common_gres_set_env: gres list is not equal
> to the number of gres_devices.  This should never happen.
> [2018-04-05T11:49:09.986] error: common_gres_set_env: gres list is not equal
> to the number of gres_devices.  This should never happen.
> [2018-04-05T11:49:10.007] _run_prolog: run job script took usec=31175
> [2018-04-05T11:49:10.007] _run_prolog: prolog with lock for job 6796393 ran
> for 1 seconds
> [2018-04-05T11:49:10.007] Launching batch job 6796393 for UID 326279
> [2018-04-05T11:49:10.011] _run_prolog: run job script took usec=30698
> [2018-04-05T11:49:10.011] _run_prolog: prolog with lock for job 6796396 ran
> for 1 seconds
> [2018-04-05T11:49:10.012] _run_prolog: run job script took usec=30708
> [2018-04-05T11:49:10.012] _run_prolog: prolog with lock for job 6796395 ran
> for 1 seconds
> [2018-04-05T11:49:10.016] _run_prolog: run job script took usec=30693
> [2018-04-05T11:49:10.016] _run_prolog: prolog with lock for job 6796398 ran
> for 1 seconds
> [2018-04-05T11:49:10.017] _run_prolog: run job script took usec=30713
> [2018-04-05T11:49:10.017] _run_prolog: prolog with lock for job 6796399 ran
> for 1 seconds
> [2018-04-05T11:49:10.020] [6796393.batch] error: common_gres_set_env: gres
> list is not equal to the number of gres_devices.  This should never happen.
> [2018-04-05T11:49:10.021] Launching batch job 6796396 for UID 326279
> 
> I have the same /etc/slurm/gres.conf for every node. The line for gpu004 is:
> 
> NodeName=gpu004 Name=gpu Type=pascal File=/dev/nvidia[0-5]
> 
> I have nvidia-persistenced running.
> I restarted slurm and it seems the problem went away.
> 
> Best

Hi Yann,

I reproduce this error as follows:

1. Set a bad gres count (lower than in slurm.conf) in gres.conf of a specific node.
2. Restart slurmd.
3. The node is set to drain.
4. Update the node state=resume
5. Send a job to the node.

Is there any chance that something similar happened in your environment? I.e. a node running with a bad gres.conf that was set manually resumed.

What this error indicates is basically that the node sees a different number of devices of what slurmctld asks for allocation.

If after a restart everything is fine and you don't see the error around anymore, I guess it is something similar to this situation.
Comment 2 Felip Moll 2018-05-03 08:57:23 MDT
The same happens if you modify slurm.conf and decrese the gpu devices in respect to what is in gres.conf of the node:

1. Modify nodename in slurm.conf decreasing gres count.
2. restart slurmctld.
3. run a job.
Comment 3 Yann 2018-05-08 09:08:22 MDT
Dear Felip,

yes it's possible that such a scenario happened.

In fact I think that what happened is that the node showed once only 5 p100 instead of 6, so I don't remember exactly what I did (changed the gres.conf, rebooted the node or other.)

Let close this issue, If I have again a similar issue with more details, I'll reopen it.

Best
Comment 4 Felip Moll 2018-05-08 09:50:37 MDT
Ok Yann, no problem.

Just reopen if issue persists.

Regards