Ticket 5128 - cuda gres not equal number of device
Summary: cuda gres not equal number of device
Status: RESOLVED INFOGIVEN
Alias: None
Product: Slurm
Classification: Unclassified
Component: slurmd (show other tickets)
Version: 17.11.5
Hardware: Linux Linux
: 4 - Minor Issue
Assignee: Felip Moll
QA Contact:
URL:
Depends on:
Blocks:
 
Reported: 2018-05-03 00:48 MDT by Yann
Modified: 2018-05-08 09:50 MDT (History)
1 user (show)

See Also:
Site: Université de Genève
Slinky Site: ---
Alineos Sites: ---
Atos/Eviden Sites: ---
Confidential Site: ---
Coreweave sites: ---
Cray Sites: ---
DS9 clusters: ---
Google sites: ---
HPCnow Sites: ---
HPE Sites: ---
IBM Sites: ---
NOAA SIte: ---
NoveTech Sites: ---
Nvidia HWinf-CS Sites: ---
OCF Sites: ---
Recursion Pharma Sites: ---
SFW Sites: ---
SNIC sites: ---
Tzag Elita Sites: ---
Linux Distro: ---
Machine Name:
CLE Version:
Version Fixed:
Target Release: ---
DevPrio: ---
Emory-Cloud Sites: ---


Attachments

Note You need to log in before you can comment on or make changes to this ticket.
Description Yann 2018-05-03 00:48:00 MDT
Dear team,

We have nodes with GPUs (P100) and we manage their allocation with gres. It's working fine, but some days ago a user told me that $CUDA_VISIBLE_DEVICES wasn't set on one of the nodes. (gpu004) and that all the jobs were running on the same gpu (this node has  6 p100). I checked the log and saw something strange:

[2018-04-05T03:19:39.755] [6796384.0] error: Failed to send MESSAGE_TASK_EXIT: Connection refused
[2018-04-05T03:19:39.757] [6796384.0] done with job
[2018-04-05T11:15:44.592] Slurmd shutdown completing
[2018-04-05T11:15:45.683] Message aggregation disabled
[2018-04-05T11:15:45.684] gpu device number 0(/dev/nvidia0):c 195:0 rwm
[2018-04-05T11:15:45.684] gpu device number 1(/dev/nvidia1):c 195:1 rwm
[2018-04-05T11:15:45.684] gpu device number 2(/dev/nvidia2):c 195:2 rwm
[2018-04-05T11:15:45.684] gpu device number 3(/dev/nvidia3):c 195:3 rwm
[2018-04-05T11:15:45.684] gpu device number 4(/dev/nvidia4):c 195:4 rwm
[2018-04-05T11:15:45.684] gpu device number 5(/dev/nvidia5):c 195:5 rwm
[2018-04-05T11:15:45.695] slurmd version 17.11.5 started
[2018-04-05T11:15:45.696] slurmd started on Thu, 05 Apr 2018 11:15:45 +0200
[2018-04-05T11:15:45.696] CPUs=20 Boards=1 Sockets=2 Cores=10 Threads=1 Memory=128941 TmpDisk=373170 Uptime=1794181 CPUSpecList=(null) FeaturesAvail=(null) FeaturesActive=(null)
[2018-04-05T11:49:09.976] error: common_gres_set_env: gres list is not equal to the number of gres_devices.  This should never happen.
[2018-04-05T11:49:09.980] error: common_gres_set_env: gres list is not equal to the number of gres_devices.  This should never happen.
[2018-04-05T11:49:09.981] error: common_gres_set_env: gres list is not equal to the number of gres_devices.  This should never happen.
[2018-04-05T11:49:09.985] error: common_gres_set_env: gres list is not equal to the number of gres_devices.  This should never happen.
[2018-04-05T11:49:09.986] error: common_gres_set_env: gres list is not equal to the number of gres_devices.  This should never happen.
[2018-04-05T11:49:10.007] _run_prolog: run job script took usec=31175
[2018-04-05T11:49:10.007] _run_prolog: prolog with lock for job 6796393 ran for 1 seconds
[2018-04-05T11:49:10.007] Launching batch job 6796393 for UID 326279
[2018-04-05T11:49:10.011] _run_prolog: run job script took usec=30698
[2018-04-05T11:49:10.011] _run_prolog: prolog with lock for job 6796396 ran for 1 seconds
[2018-04-05T11:49:10.012] _run_prolog: run job script took usec=30708
[2018-04-05T11:49:10.012] _run_prolog: prolog with lock for job 6796395 ran for 1 seconds
[2018-04-05T11:49:10.016] _run_prolog: run job script took usec=30693
[2018-04-05T11:49:10.016] _run_prolog: prolog with lock for job 6796398 ran for 1 seconds
[2018-04-05T11:49:10.017] _run_prolog: run job script took usec=30713
[2018-04-05T11:49:10.017] _run_prolog: prolog with lock for job 6796399 ran for 1 seconds
[2018-04-05T11:49:10.020] [6796393.batch] error: common_gres_set_env: gres list is not equal to the number of gres_devices.  This should never happen.
[2018-04-05T11:49:10.021] Launching batch job 6796396 for UID 326279

I have the same /etc/slurm/gres.conf for every node. The line for gpu004 is:

NodeName=gpu004 Name=gpu Type=pascal File=/dev/nvidia[0-5]

I have nvidia-persistenced running.
I restarted slurm and it seems the problem went away.

Best
Comment 1 Felip Moll 2018-05-03 08:53:01 MDT
(In reply to Yann from comment #0)
> Dear team,
> 
> We have nodes with GPUs (P100) and we manage their allocation with gres.
> It's working fine, but some days ago a user told me that
> $CUDA_VISIBLE_DEVICES wasn't set on one of the nodes. (gpu004) and that all
> the jobs were running on the same gpu (this node has  6 p100). I checked the
> log and saw something strange:
> 
> [2018-04-05T03:19:39.755] [6796384.0] error: Failed to send
> MESSAGE_TASK_EXIT: Connection refused
> [2018-04-05T03:19:39.757] [6796384.0] done with job
> [2018-04-05T11:15:44.592] Slurmd shutdown completing
> [2018-04-05T11:15:45.683] Message aggregation disabled
> [2018-04-05T11:15:45.684] gpu device number 0(/dev/nvidia0):c 195:0 rwm
> [2018-04-05T11:15:45.684] gpu device number 1(/dev/nvidia1):c 195:1 rwm
> [2018-04-05T11:15:45.684] gpu device number 2(/dev/nvidia2):c 195:2 rwm
> [2018-04-05T11:15:45.684] gpu device number 3(/dev/nvidia3):c 195:3 rwm
> [2018-04-05T11:15:45.684] gpu device number 4(/dev/nvidia4):c 195:4 rwm
> [2018-04-05T11:15:45.684] gpu device number 5(/dev/nvidia5):c 195:5 rwm
> [2018-04-05T11:15:45.695] slurmd version 17.11.5 started
> [2018-04-05T11:15:45.696] slurmd started on Thu, 05 Apr 2018 11:15:45 +0200
> [2018-04-05T11:15:45.696] CPUs=20 Boards=1 Sockets=2 Cores=10 Threads=1
> Memory=128941 TmpDisk=373170 Uptime=1794181 CPUSpecList=(null)
> FeaturesAvail=(null) FeaturesActive=(null)
> [2018-04-05T11:49:09.976] error: common_gres_set_env: gres list is not equal
> to the number of gres_devices.  This should never happen.
> [2018-04-05T11:49:09.980] error: common_gres_set_env: gres list is not equal
> to the number of gres_devices.  This should never happen.
> [2018-04-05T11:49:09.981] error: common_gres_set_env: gres list is not equal
> to the number of gres_devices.  This should never happen.
> [2018-04-05T11:49:09.985] error: common_gres_set_env: gres list is not equal
> to the number of gres_devices.  This should never happen.
> [2018-04-05T11:49:09.986] error: common_gres_set_env: gres list is not equal
> to the number of gres_devices.  This should never happen.
> [2018-04-05T11:49:10.007] _run_prolog: run job script took usec=31175
> [2018-04-05T11:49:10.007] _run_prolog: prolog with lock for job 6796393 ran
> for 1 seconds
> [2018-04-05T11:49:10.007] Launching batch job 6796393 for UID 326279
> [2018-04-05T11:49:10.011] _run_prolog: run job script took usec=30698
> [2018-04-05T11:49:10.011] _run_prolog: prolog with lock for job 6796396 ran
> for 1 seconds
> [2018-04-05T11:49:10.012] _run_prolog: run job script took usec=30708
> [2018-04-05T11:49:10.012] _run_prolog: prolog with lock for job 6796395 ran
> for 1 seconds
> [2018-04-05T11:49:10.016] _run_prolog: run job script took usec=30693
> [2018-04-05T11:49:10.016] _run_prolog: prolog with lock for job 6796398 ran
> for 1 seconds
> [2018-04-05T11:49:10.017] _run_prolog: run job script took usec=30713
> [2018-04-05T11:49:10.017] _run_prolog: prolog with lock for job 6796399 ran
> for 1 seconds
> [2018-04-05T11:49:10.020] [6796393.batch] error: common_gres_set_env: gres
> list is not equal to the number of gres_devices.  This should never happen.
> [2018-04-05T11:49:10.021] Launching batch job 6796396 for UID 326279
> 
> I have the same /etc/slurm/gres.conf for every node. The line for gpu004 is:
> 
> NodeName=gpu004 Name=gpu Type=pascal File=/dev/nvidia[0-5]
> 
> I have nvidia-persistenced running.
> I restarted slurm and it seems the problem went away.
> 
> Best

Hi Yann,

I reproduce this error as follows:

1. Set a bad gres count (lower than in slurm.conf) in gres.conf of a specific node.
2. Restart slurmd.
3. The node is set to drain.
4. Update the node state=resume
5. Send a job to the node.

Is there any chance that something similar happened in your environment? I.e. a node running with a bad gres.conf that was set manually resumed.

What this error indicates is basically that the node sees a different number of devices of what slurmctld asks for allocation.

If after a restart everything is fine and you don't see the error around anymore, I guess it is something similar to this situation.
Comment 2 Felip Moll 2018-05-03 08:57:23 MDT
The same happens if you modify slurm.conf and decrese the gpu devices in respect to what is in gres.conf of the node:

1. Modify nodename in slurm.conf decreasing gres count.
2. restart slurmctld.
3. run a job.
Comment 3 Yann 2018-05-08 09:08:22 MDT
Dear Felip,

yes it's possible that such a scenario happened.

In fact I think that what happened is that the node showed once only 5 p100 instead of 6, so I don't remember exactly what I did (changed the gres.conf, rebooted the node or other.)

Let close this issue, If I have again a similar issue with more details, I'll reopen it.

Best
Comment 4 Felip Moll 2018-05-08 09:50:37 MDT
Ok Yann, no problem.

Just reopen if issue persists.

Regards