| Summary: | cuda gres not equal number of device | ||
|---|---|---|---|
| Product: | Slurm | Reporter: | Yann <yann.sagon> |
| Component: | slurmd | Assignee: | Felip Moll <felip.moll> |
| Status: | RESOLVED INFOGIVEN | QA Contact: | |
| Severity: | 4 - Minor Issue | ||
| Priority: | --- | CC: | felip.moll |
| Version: | 17.11.5 | ||
| Hardware: | Linux | ||
| OS: | Linux | ||
| Site: | Université de Genève | Slinky Site: | --- |
| Alineos Sites: | --- | Atos/Eviden Sites: | --- |
| Confidential Site: | --- | Coreweave sites: | --- |
| Cray Sites: | --- | DS9 clusters: | --- |
| Google sites: | --- | HPCnow Sites: | --- |
| HPE Sites: | --- | IBM Sites: | --- |
| NOAA SIte: | --- | NoveTech Sites: | --- |
| Nvidia HWinf-CS Sites: | --- | OCF Sites: | --- |
| Recursion Pharma Sites: | --- | SFW Sites: | --- |
| SNIC sites: | --- | Tzag Elita Sites: | --- |
| Linux Distro: | --- | Machine Name: | |
| CLE Version: | Version Fixed: | ||
| Target Release: | --- | DevPrio: | --- |
| Emory-Cloud Sites: | --- | ||
|
Description
Yann
2018-05-03 00:48:00 MDT
(In reply to Yann from comment #0) > Dear team, > > We have nodes with GPUs (P100) and we manage their allocation with gres. > It's working fine, but some days ago a user told me that > $CUDA_VISIBLE_DEVICES wasn't set on one of the nodes. (gpu004) and that all > the jobs were running on the same gpu (this node has 6 p100). I checked the > log and saw something strange: > > [2018-04-05T03:19:39.755] [6796384.0] error: Failed to send > MESSAGE_TASK_EXIT: Connection refused > [2018-04-05T03:19:39.757] [6796384.0] done with job > [2018-04-05T11:15:44.592] Slurmd shutdown completing > [2018-04-05T11:15:45.683] Message aggregation disabled > [2018-04-05T11:15:45.684] gpu device number 0(/dev/nvidia0):c 195:0 rwm > [2018-04-05T11:15:45.684] gpu device number 1(/dev/nvidia1):c 195:1 rwm > [2018-04-05T11:15:45.684] gpu device number 2(/dev/nvidia2):c 195:2 rwm > [2018-04-05T11:15:45.684] gpu device number 3(/dev/nvidia3):c 195:3 rwm > [2018-04-05T11:15:45.684] gpu device number 4(/dev/nvidia4):c 195:4 rwm > [2018-04-05T11:15:45.684] gpu device number 5(/dev/nvidia5):c 195:5 rwm > [2018-04-05T11:15:45.695] slurmd version 17.11.5 started > [2018-04-05T11:15:45.696] slurmd started on Thu, 05 Apr 2018 11:15:45 +0200 > [2018-04-05T11:15:45.696] CPUs=20 Boards=1 Sockets=2 Cores=10 Threads=1 > Memory=128941 TmpDisk=373170 Uptime=1794181 CPUSpecList=(null) > FeaturesAvail=(null) FeaturesActive=(null) > [2018-04-05T11:49:09.976] error: common_gres_set_env: gres list is not equal > to the number of gres_devices. This should never happen. > [2018-04-05T11:49:09.980] error: common_gres_set_env: gres list is not equal > to the number of gres_devices. This should never happen. > [2018-04-05T11:49:09.981] error: common_gres_set_env: gres list is not equal > to the number of gres_devices. This should never happen. > [2018-04-05T11:49:09.985] error: common_gres_set_env: gres list is not equal > to the number of gres_devices. This should never happen. > [2018-04-05T11:49:09.986] error: common_gres_set_env: gres list is not equal > to the number of gres_devices. This should never happen. > [2018-04-05T11:49:10.007] _run_prolog: run job script took usec=31175 > [2018-04-05T11:49:10.007] _run_prolog: prolog with lock for job 6796393 ran > for 1 seconds > [2018-04-05T11:49:10.007] Launching batch job 6796393 for UID 326279 > [2018-04-05T11:49:10.011] _run_prolog: run job script took usec=30698 > [2018-04-05T11:49:10.011] _run_prolog: prolog with lock for job 6796396 ran > for 1 seconds > [2018-04-05T11:49:10.012] _run_prolog: run job script took usec=30708 > [2018-04-05T11:49:10.012] _run_prolog: prolog with lock for job 6796395 ran > for 1 seconds > [2018-04-05T11:49:10.016] _run_prolog: run job script took usec=30693 > [2018-04-05T11:49:10.016] _run_prolog: prolog with lock for job 6796398 ran > for 1 seconds > [2018-04-05T11:49:10.017] _run_prolog: run job script took usec=30713 > [2018-04-05T11:49:10.017] _run_prolog: prolog with lock for job 6796399 ran > for 1 seconds > [2018-04-05T11:49:10.020] [6796393.batch] error: common_gres_set_env: gres > list is not equal to the number of gres_devices. This should never happen. > [2018-04-05T11:49:10.021] Launching batch job 6796396 for UID 326279 > > I have the same /etc/slurm/gres.conf for every node. The line for gpu004 is: > > NodeName=gpu004 Name=gpu Type=pascal File=/dev/nvidia[0-5] > > I have nvidia-persistenced running. > I restarted slurm and it seems the problem went away. > > Best Hi Yann, I reproduce this error as follows: 1. Set a bad gres count (lower than in slurm.conf) in gres.conf of a specific node. 2. Restart slurmd. 3. The node is set to drain. 4. Update the node state=resume 5. Send a job to the node. Is there any chance that something similar happened in your environment? I.e. a node running with a bad gres.conf that was set manually resumed. What this error indicates is basically that the node sees a different number of devices of what slurmctld asks for allocation. If after a restart everything is fine and you don't see the error around anymore, I guess it is something similar to this situation. The same happens if you modify slurm.conf and decrese the gpu devices in respect to what is in gres.conf of the node: 1. Modify nodename in slurm.conf decreasing gres count. 2. restart slurmctld. 3. run a job. Dear Felip, yes it's possible that such a scenario happened. In fact I think that what happened is that the node showed once only 5 p100 instead of 6, so I don't remember exactly what I did (changed the gres.conf, rebooted the node or other.) Let close this issue, If I have again a similar issue with more details, I'll reopen it. Best Ok Yann, no problem. Just reopen if issue persists. Regards |