Dear team, We have nodes with GPUs (P100) and we manage their allocation with gres. It's working fine, but some days ago a user told me that $CUDA_VISIBLE_DEVICES wasn't set on one of the nodes. (gpu004) and that all the jobs were running on the same gpu (this node has 6 p100). I checked the log and saw something strange: [2018-04-05T03:19:39.755] [6796384.0] error: Failed to send MESSAGE_TASK_EXIT: Connection refused [2018-04-05T03:19:39.757] [6796384.0] done with job [2018-04-05T11:15:44.592] Slurmd shutdown completing [2018-04-05T11:15:45.683] Message aggregation disabled [2018-04-05T11:15:45.684] gpu device number 0(/dev/nvidia0):c 195:0 rwm [2018-04-05T11:15:45.684] gpu device number 1(/dev/nvidia1):c 195:1 rwm [2018-04-05T11:15:45.684] gpu device number 2(/dev/nvidia2):c 195:2 rwm [2018-04-05T11:15:45.684] gpu device number 3(/dev/nvidia3):c 195:3 rwm [2018-04-05T11:15:45.684] gpu device number 4(/dev/nvidia4):c 195:4 rwm [2018-04-05T11:15:45.684] gpu device number 5(/dev/nvidia5):c 195:5 rwm [2018-04-05T11:15:45.695] slurmd version 17.11.5 started [2018-04-05T11:15:45.696] slurmd started on Thu, 05 Apr 2018 11:15:45 +0200 [2018-04-05T11:15:45.696] CPUs=20 Boards=1 Sockets=2 Cores=10 Threads=1 Memory=128941 TmpDisk=373170 Uptime=1794181 CPUSpecList=(null) FeaturesAvail=(null) FeaturesActive=(null) [2018-04-05T11:49:09.976] error: common_gres_set_env: gres list is not equal to the number of gres_devices. This should never happen. [2018-04-05T11:49:09.980] error: common_gres_set_env: gres list is not equal to the number of gres_devices. This should never happen. [2018-04-05T11:49:09.981] error: common_gres_set_env: gres list is not equal to the number of gres_devices. This should never happen. [2018-04-05T11:49:09.985] error: common_gres_set_env: gres list is not equal to the number of gres_devices. This should never happen. [2018-04-05T11:49:09.986] error: common_gres_set_env: gres list is not equal to the number of gres_devices. This should never happen. [2018-04-05T11:49:10.007] _run_prolog: run job script took usec=31175 [2018-04-05T11:49:10.007] _run_prolog: prolog with lock for job 6796393 ran for 1 seconds [2018-04-05T11:49:10.007] Launching batch job 6796393 for UID 326279 [2018-04-05T11:49:10.011] _run_prolog: run job script took usec=30698 [2018-04-05T11:49:10.011] _run_prolog: prolog with lock for job 6796396 ran for 1 seconds [2018-04-05T11:49:10.012] _run_prolog: run job script took usec=30708 [2018-04-05T11:49:10.012] _run_prolog: prolog with lock for job 6796395 ran for 1 seconds [2018-04-05T11:49:10.016] _run_prolog: run job script took usec=30693 [2018-04-05T11:49:10.016] _run_prolog: prolog with lock for job 6796398 ran for 1 seconds [2018-04-05T11:49:10.017] _run_prolog: run job script took usec=30713 [2018-04-05T11:49:10.017] _run_prolog: prolog with lock for job 6796399 ran for 1 seconds [2018-04-05T11:49:10.020] [6796393.batch] error: common_gres_set_env: gres list is not equal to the number of gres_devices. This should never happen. [2018-04-05T11:49:10.021] Launching batch job 6796396 for UID 326279 I have the same /etc/slurm/gres.conf for every node. The line for gpu004 is: NodeName=gpu004 Name=gpu Type=pascal File=/dev/nvidia[0-5] I have nvidia-persistenced running. I restarted slurm and it seems the problem went away. Best
(In reply to Yann from comment #0) > Dear team, > > We have nodes with GPUs (P100) and we manage their allocation with gres. > It's working fine, but some days ago a user told me that > $CUDA_VISIBLE_DEVICES wasn't set on one of the nodes. (gpu004) and that all > the jobs were running on the same gpu (this node has 6 p100). I checked the > log and saw something strange: > > [2018-04-05T03:19:39.755] [6796384.0] error: Failed to send > MESSAGE_TASK_EXIT: Connection refused > [2018-04-05T03:19:39.757] [6796384.0] done with job > [2018-04-05T11:15:44.592] Slurmd shutdown completing > [2018-04-05T11:15:45.683] Message aggregation disabled > [2018-04-05T11:15:45.684] gpu device number 0(/dev/nvidia0):c 195:0 rwm > [2018-04-05T11:15:45.684] gpu device number 1(/dev/nvidia1):c 195:1 rwm > [2018-04-05T11:15:45.684] gpu device number 2(/dev/nvidia2):c 195:2 rwm > [2018-04-05T11:15:45.684] gpu device number 3(/dev/nvidia3):c 195:3 rwm > [2018-04-05T11:15:45.684] gpu device number 4(/dev/nvidia4):c 195:4 rwm > [2018-04-05T11:15:45.684] gpu device number 5(/dev/nvidia5):c 195:5 rwm > [2018-04-05T11:15:45.695] slurmd version 17.11.5 started > [2018-04-05T11:15:45.696] slurmd started on Thu, 05 Apr 2018 11:15:45 +0200 > [2018-04-05T11:15:45.696] CPUs=20 Boards=1 Sockets=2 Cores=10 Threads=1 > Memory=128941 TmpDisk=373170 Uptime=1794181 CPUSpecList=(null) > FeaturesAvail=(null) FeaturesActive=(null) > [2018-04-05T11:49:09.976] error: common_gres_set_env: gres list is not equal > to the number of gres_devices. This should never happen. > [2018-04-05T11:49:09.980] error: common_gres_set_env: gres list is not equal > to the number of gres_devices. This should never happen. > [2018-04-05T11:49:09.981] error: common_gres_set_env: gres list is not equal > to the number of gres_devices. This should never happen. > [2018-04-05T11:49:09.985] error: common_gres_set_env: gres list is not equal > to the number of gres_devices. This should never happen. > [2018-04-05T11:49:09.986] error: common_gres_set_env: gres list is not equal > to the number of gres_devices. This should never happen. > [2018-04-05T11:49:10.007] _run_prolog: run job script took usec=31175 > [2018-04-05T11:49:10.007] _run_prolog: prolog with lock for job 6796393 ran > for 1 seconds > [2018-04-05T11:49:10.007] Launching batch job 6796393 for UID 326279 > [2018-04-05T11:49:10.011] _run_prolog: run job script took usec=30698 > [2018-04-05T11:49:10.011] _run_prolog: prolog with lock for job 6796396 ran > for 1 seconds > [2018-04-05T11:49:10.012] _run_prolog: run job script took usec=30708 > [2018-04-05T11:49:10.012] _run_prolog: prolog with lock for job 6796395 ran > for 1 seconds > [2018-04-05T11:49:10.016] _run_prolog: run job script took usec=30693 > [2018-04-05T11:49:10.016] _run_prolog: prolog with lock for job 6796398 ran > for 1 seconds > [2018-04-05T11:49:10.017] _run_prolog: run job script took usec=30713 > [2018-04-05T11:49:10.017] _run_prolog: prolog with lock for job 6796399 ran > for 1 seconds > [2018-04-05T11:49:10.020] [6796393.batch] error: common_gres_set_env: gres > list is not equal to the number of gres_devices. This should never happen. > [2018-04-05T11:49:10.021] Launching batch job 6796396 for UID 326279 > > I have the same /etc/slurm/gres.conf for every node. The line for gpu004 is: > > NodeName=gpu004 Name=gpu Type=pascal File=/dev/nvidia[0-5] > > I have nvidia-persistenced running. > I restarted slurm and it seems the problem went away. > > Best Hi Yann, I reproduce this error as follows: 1. Set a bad gres count (lower than in slurm.conf) in gres.conf of a specific node. 2. Restart slurmd. 3. The node is set to drain. 4. Update the node state=resume 5. Send a job to the node. Is there any chance that something similar happened in your environment? I.e. a node running with a bad gres.conf that was set manually resumed. What this error indicates is basically that the node sees a different number of devices of what slurmctld asks for allocation. If after a restart everything is fine and you don't see the error around anymore, I guess it is something similar to this situation.
The same happens if you modify slurm.conf and decrese the gpu devices in respect to what is in gres.conf of the node: 1. Modify nodename in slurm.conf decreasing gres count. 2. restart slurmctld. 3. run a job.
Dear Felip, yes it's possible that such a scenario happened. In fact I think that what happened is that the node showed once only 5 p100 instead of 6, so I don't remember exactly what I did (changed the gres.conf, rebooted the node or other.) Let close this issue, If I have again a similar issue with more details, I'll reopen it. Best
Ok Yann, no problem. Just reopen if issue persists. Regards