Ticket 5128

Summary:	cuda gres not equal number of device
Product:	Slurm	Reporter:	Yann <yann.sagon>
Component:	slurmd	Assignee:	Felip Moll <felip.moll>
Status:	RESOLVED INFOGIVEN	QA Contact:
Severity:	4 - Minor Issue
Priority:	---	CC:	felip.moll
Version:	17.11.5
Hardware:	Linux
OS:	Linux
Site:	Université de Genève	Slinky Site:	---
Alineos Sites:	---	Atos/Eviden Sites:	---
Confidential Site:	---	Coreweave sites:	---
Cray Sites:	---	DS9 clusters:	---
Google sites:	---	HPCnow Sites:	---
HPE Sites:	---	IBM Sites:	---
NOAA SIte:	---	NoveTech Sites:	---
Nvidia HWinf-CS Sites:	---	OCF Sites:	---
Recursion Pharma Sites:	---	SFW Sites:	---
SNIC sites:	---	Tzag Elita Sites:	---
Linux Distro:	---	Machine Name:
CLE Version:		Version Fixed:
Target Release:	---	DevPrio:	---
Emory-Cloud Sites:	---

Description Yann 2018-05-03 00:48:00 MDT

Dear team,

We have nodes with GPUs (P100) and we manage their allocation with gres. It's working fine, but some days ago a user told me that $CUDA_VISIBLE_DEVICES wasn't set on one of the nodes. (gpu004) and that all the jobs were running on the same gpu (this node has  6 p100). I checked the log and saw something strange:

[2018-04-05T03:19:39.755] [6796384.0] error: Failed to send MESSAGE_TASK_EXIT: Connection refused
[2018-04-05T03:19:39.757] [6796384.0] done with job
[2018-04-05T11:15:44.592] Slurmd shutdown completing
[2018-04-05T11:15:45.683] Message aggregation disabled
[2018-04-05T11:15:45.684] gpu device number 0(/dev/nvidia0):c 195:0 rwm
[2018-04-05T11:15:45.684] gpu device number 1(/dev/nvidia1):c 195:1 rwm
[2018-04-05T11:15:45.684] gpu device number 2(/dev/nvidia2):c 195:2 rwm
[2018-04-05T11:15:45.684] gpu device number 3(/dev/nvidia3):c 195:3 rwm
[2018-04-05T11:15:45.684] gpu device number 4(/dev/nvidia4):c 195:4 rwm
[2018-04-05T11:15:45.684] gpu device number 5(/dev/nvidia5):c 195:5 rwm
[2018-04-05T11:15:45.695] slurmd version 17.11.5 started
[2018-04-05T11:15:45.696] slurmd started on Thu, 05 Apr 2018 11:15:45 +0200
[2018-04-05T11:15:45.696] CPUs=20 Boards=1 Sockets=2 Cores=10 Threads=1 Memory=128941 TmpDisk=373170 Uptime=1794181 CPUSpecList=(null) FeaturesAvail=(null) FeaturesActive=(null)
[2018-04-05T11:49:09.976] error: common_gres_set_env: gres list is not equal to the number of gres_devices.  This should never happen.
[2018-04-05T11:49:09.980] error: common_gres_set_env: gres list is not equal to the number of gres_devices.  This should never happen.
[2018-04-05T11:49:09.981] error: common_gres_set_env: gres list is not equal to the number of gres_devices.  This should never happen.
[2018-04-05T11:49:09.985] error: common_gres_set_env: gres list is not equal to the number of gres_devices.  This should never happen.
[2018-04-05T11:49:09.986] error: common_gres_set_env: gres list is not equal to the number of gres_devices.  This should never happen.
[2018-04-05T11:49:10.007] _run_prolog: run job script took usec=31175
[2018-04-05T11:49:10.007] _run_prolog: prolog with lock for job 6796393 ran for 1 seconds
[2018-04-05T11:49:10.007] Launching batch job 6796393 for UID 326279
[2018-04-05T11:49:10.011] _run_prolog: run job script took usec=30698
[2018-04-05T11:49:10.011] _run_prolog: prolog with lock for job 6796396 ran for 1 seconds
[2018-04-05T11:49:10.012] _run_prolog: run job script took usec=30708
[2018-04-05T11:49:10.012] _run_prolog: prolog with lock for job 6796395 ran for 1 seconds
[2018-04-05T11:49:10.016] _run_prolog: run job script took usec=30693
[2018-04-05T11:49:10.016] _run_prolog: prolog with lock for job 6796398 ran for 1 seconds
[2018-04-05T11:49:10.017] _run_prolog: run job script took usec=30713
[2018-04-05T11:49:10.017] _run_prolog: prolog with lock for job 6796399 ran for 1 seconds
[2018-04-05T11:49:10.020] [6796393.batch] error: common_gres_set_env: gres list is not equal to the number of gres_devices.  This should never happen.
[2018-04-05T11:49:10.021] Launching batch job 6796396 for UID 326279

I have the same /etc/slurm/gres.conf for every node. The line for gpu004 is:

NodeName=gpu004 Name=gpu Type=pascal File=/dev/nvidia[0-5]

I have nvidia-persistenced running.
I restarted slurm and it seems the problem went away.

Best

Comment 1 Felip Moll 2018-05-03 08:53:01 MDT

(In reply to Yann from comment #0)
> Dear team,
> 
> We have nodes with GPUs (P100) and we manage their allocation with gres.
> It's working fine, but some days ago a user told me that
> $CUDA_VISIBLE_DEVICES wasn't set on one of the nodes. (gpu004) and that all
> the jobs were running on the same gpu (this node has  6 p100). I checked the
> log and saw something strange:
> 
> [2018-04-05T03:19:39.755] [6796384.0] error: Failed to send
> MESSAGE_TASK_EXIT: Connection refused
> [2018-04-05T03:19:39.757] [6796384.0] done with job
> [2018-04-05T11:15:44.592] Slurmd shutdown completing
> [2018-04-05T11:15:45.683] Message aggregation disabled
> [2018-04-05T11:15:45.684] gpu device number 0(/dev/nvidia0):c 195:0 rwm
> [2018-04-05T11:15:45.684] gpu device number 1(/dev/nvidia1):c 195:1 rwm
> [2018-04-05T11:15:45.684] gpu device number 2(/dev/nvidia2):c 195:2 rwm
> [2018-04-05T11:15:45.684] gpu device number 3(/dev/nvidia3):c 195:3 rwm
> [2018-04-05T11:15:45.684] gpu device number 4(/dev/nvidia4):c 195:4 rwm
> [2018-04-05T11:15:45.684] gpu device number 5(/dev/nvidia5):c 195:5 rwm
> [2018-04-05T11:15:45.695] slurmd version 17.11.5 started
> [2018-04-05T11:15:45.696] slurmd started on Thu, 05 Apr 2018 11:15:45 +0200
> [2018-04-05T11:15:45.696] CPUs=20 Boards=1 Sockets=2 Cores=10 Threads=1
> Memory=128941 TmpDisk=373170 Uptime=1794181 CPUSpecList=(null)
> FeaturesAvail=(null) FeaturesActive=(null)
> [2018-04-05T11:49:09.976] error: common_gres_set_env: gres list is not equal
> to the number of gres_devices.  This should never happen.
> [2018-04-05T11:49:09.980] error: common_gres_set_env: gres list is not equal
> to the number of gres_devices.  This should never happen.
> [2018-04-05T11:49:09.981] error: common_gres_set_env: gres list is not equal
> to the number of gres_devices.  This should never happen.
> [2018-04-05T11:49:09.985] error: common_gres_set_env: gres list is not equal
> to the number of gres_devices.  This should never happen.
> [2018-04-05T11:49:09.986] error: common_gres_set_env: gres list is not equal
> to the number of gres_devices.  This should never happen.
> [2018-04-05T11:49:10.007] _run_prolog: run job script took usec=31175
> [2018-04-05T11:49:10.007] _run_prolog: prolog with lock for job 6796393 ran
> for 1 seconds
> [2018-04-05T11:49:10.007] Launching batch job 6796393 for UID 326279
> [2018-04-05T11:49:10.011] _run_prolog: run job script took usec=30698
> [2018-04-05T11:49:10.011] _run_prolog: prolog with lock for job 6796396 ran
> for 1 seconds
> [2018-04-05T11:49:10.012] _run_prolog: run job script took usec=30708
> [2018-04-05T11:49:10.012] _run_prolog: prolog with lock for job 6796395 ran
> for 1 seconds
> [2018-04-05T11:49:10.016] _run_prolog: run job script took usec=30693
> [2018-04-05T11:49:10.016] _run_prolog: prolog with lock for job 6796398 ran
> for 1 seconds
> [2018-04-05T11:49:10.017] _run_prolog: run job script took usec=30713
> [2018-04-05T11:49:10.017] _run_prolog: prolog with lock for job 6796399 ran
> for 1 seconds
> [2018-04-05T11:49:10.020] [6796393.batch] error: common_gres_set_env: gres
> list is not equal to the number of gres_devices.  This should never happen.
> [2018-04-05T11:49:10.021] Launching batch job 6796396 for UID 326279
> 
> I have the same /etc/slurm/gres.conf for every node. The line for gpu004 is:
> 
> NodeName=gpu004 Name=gpu Type=pascal File=/dev/nvidia[0-5]
> 
> I have nvidia-persistenced running.
> I restarted slurm and it seems the problem went away.
> 
> Best

Hi Yann,

I reproduce this error as follows:

1. Set a bad gres count (lower than in slurm.conf) in gres.conf of a specific node.
2. Restart slurmd.
3. The node is set to drain.
4. Update the node state=resume
5. Send a job to the node.

Is there any chance that something similar happened in your environment? I.e. a node running with a bad gres.conf that was set manually resumed.

What this error indicates is basically that the node sees a different number of devices of what slurmctld asks for allocation.

If after a restart everything is fine and you don't see the error around anymore, I guess it is something similar to this situation.

Comment 2 Felip Moll 2018-05-03 08:57:23 MDT

The same happens if you modify slurm.conf and decrese the gpu devices in respect to what is in gres.conf of the node:

1. Modify nodename in slurm.conf decreasing gres count.
2. restart slurmctld.
3. run a job.

Comment 3 Yann 2018-05-08 09:08:22 MDT

Dear Felip,

yes it's possible that such a scenario happened.

In fact I think that what happened is that the node showed once only 5 p100 instead of 6, so I don't remember exactly what I did (changed the gres.conf, rebooted the node or other.)

Let close this issue, If I have again a similar issue with more details, I'll reopen it.

Best

Comment 4 Felip Moll 2018-05-08 09:50:37 MDT

Ok Yann, no problem.

Just reopen if issue persists.

Regards