| Summary: | SLURM assigning incorrect GPU ID's | ||
|---|---|---|---|
| Product: | Slurm | Reporter: | Simran <simran> |
| Component: | Scheduling | Assignee: | Brian Christiansen <brian> |
| Status: | RESOLVED INFOGIVEN | QA Contact: | |
| Severity: | 2 - High Impact | ||
| Priority: | --- | CC: | brian, da |
| Version: | 14.03.9 | ||
| Hardware: | Linux | ||
| OS: | Linux | ||
| Site: | Genentech (Roche) | Alineos Sites: | --- |
| Atos/Eviden Sites: | --- | Confidential Site: | --- |
| Coreweave sites: | --- | Cray Sites: | --- |
| DS9 clusters: | --- | HPCnow Sites: | --- |
| HPE Sites: | --- | IBM Sites: | --- |
| NOAA SIte: | --- | OCF Sites: | --- |
| Recursion Pharma Sites: | --- | SFW Sites: | --- |
| SNIC sites: | --- | Linux Distro: | --- |
| Machine Name: | CLE Version: | ||
| Version Fixed: | Target Release: | --- | |
| DevPrio: | --- | Emory-Cloud Sites: | --- |
| Attachments: |
slurm.conf
gres.conf |
||
Created attachment 1963 [details]
slurm.conf
Created attachment 1964 [details]
gres.conf
Your configuration looks good. I suspect this is a bug that was fixed in version 14.03.10:
* Changes in Slurm 14.03.10
===========================
-- Fix bug that prevented preservation of a job's GRES bitmap on slurmctld
restart or reconfigure (bug was introduced in 14.03.5 "Clear record of a
job's gres when requeued" and only applies when GRES mapped to specific
files).
Someone will investigate more on Monday morning, but I believe the problem will persist until version 14.03.10 (or later) is installed and all jobs started under earlier versions of Slurm end.
We are unable to reproduce this. Was the slurmctld restarted, or reconfigured, during the lifetime of job 17753? The commit Moe referenced is: https://github.com/SchedMD/slurm/commit/1209a664840a431428658c0950b16078af7aff63 which was found in Bug 1192. Can you apply the above patch or upgrade to 14.03.10 or .11? Okay, so I just got some downtime for today and finally upgraded to slurm v14.11.6. However, now I am having even more issues. I can't seem to submit 2 jobs to a node with 2 gpu's. The second job goes into pending state even though there is a free GPU: [simran@amber400 ~]$ sinfo -lNe | grep -i amber240 amber240 1 test idle 40 2:10:2 31943 0 1 (null) none [simran@amber400 ~]$ scontrol show node amber240 NodeName=amber240 Arch=x86_64 CoresPerSocket=10 CPUAlloc=0 CPUErr=0 CPUTot=40 CPULoad=0.00 Features=(null) Gres=gpu:2 NodeAddr=amber240 NodeHostName=amber240 Version=(null) OS=Linux RealMemory=31943 AllocMem=0 Sockets=2 Boards=1 State=IDLE ThreadsPerCore=2 TmpDisk=0 Weight=1 BootTime=2015-06-13T12:34:26 SlurmdStartTime=2015-06-13T12:35:34 CurrentWatts=0 LowestJoules=0 ConsumedJoules=0 ExtSensorsJoules=n/s ExtSensorsWatts=0 ExtSensorsTemp=n/s [simran@amber400 ~]$ srun -p test --gres=gpu:1 -u bash -i bash: cannot set terminal process group (-1): Invalid argument bash: no job control in this shell simran@amber240:~ % Trying another job submission now does not work: [simran@amber400 ~]$ srun -p test --gres=gpu:1 -u bash -i srun: job 230 queued and waiting for resources -- Your assistance with this would be greatly appreciated. I only have a few more hrs left for my downtime and would like to get this resolved soon. Thanks, -Simran Try specifying a DefMemPerCPU in your slurm.conf. If you don't specify a memory, the job will use the all the memory on the node. That seems to have worked.. asking the users to test.. Is there a way to set this to a different default per partition? Or is it just a global setting across all partitions? Thanks, -Simran Great. It was changed in 14.11.5 that if you didn't specify a memory, then all of the memory would be used. You can set DefMemPerCpu at the partition level as well. ex. PartitionName=debug ... DefMemPerCPU=1024 How is 14.11.6 running for you? So far so good. I changed the DefMemPerCpu as recommended and did not see any further issues. We have not really loaded the cluster with user jobs yet so will probably know in a few weeks if any other issues arise. Feel free to close this request for now. Thanks for all your help! Regards, -Simran Good to hear. Let us know if you have any other issues. Thanks, Brian |
Hello, The slurm scheduler seems to be incorrectly assigning gpu's for amber jobs that is causing some jobs to fail intermittently. Here is an example of that scenario that is currently happening. My slurm head node is amber300 and amber221 is one of the compute nodes. It currently has 1 job running on it: [root@amber300 slurm]# squeue -l | grep -i amber221 17753 amber2 test9.pr tomp RUNNING 16-08:43:00 UNLIMITED 1 amber221 This job has asked for 1 GPU: [root@amber300 slurm]# scontrol show job 17753 JobId=17753 Name=test9.prod UserId=tomp(2647) GroupId=Pharmrd(176) Priority=4294884127 Nice=0 Account=research QOS=normal JobState=RUNNING Reason=None Dependency=(null) Requeue=0 Restarts=0 BatchFlag=1 ExitCode=0:0 RunTime=16-08:43:19 TimeLimit=UNLIMITED TimeMin=N/A SubmitTime=2015-05-22T07:03:14 EligibleTime=2015-05-22T07:03:14 StartTime=2015-05-22T07:03:15 EndTime=Unknown PreemptTime=None SuspendTime=None SecsPreSuspend=0 Partition=amber2 AllocNode:Sid=amber300:16464 ReqNodeList=(null) ExcNodeList=(null) NodeList=amber221 BatchHost=amber221 NumNodes=1 NumCPUs=1 CPUs/Task=1 ReqB:S:C:T=0:0:*:* Socks/Node=* NtasksPerN:B:S:C=0:0:*:* CoreSpec=0 MinCPUsNode=1 MinMemoryNode=0 MinTmpDiskNode=0 Features=(null) Gres=gpu:1 Reservation=(null) Shared=0 Contiguous=0 Licenses=(null) Network=(null) Command=/gne/dev/pharmr/pharmr/tomp/MD/subscripts/cont.slurm WorkDir=/gne/dev/pharmr/pharmr/tomp/DomainSwap/test9 StdErr=/gne/dev/pharmr/pharmr/tomp/DomainSwap/test9/slurm-17753.out StdIn=/dev/null StdOut=/gne/dev/pharmr/pharmr/tomp/DomainSwap/test9/slurm-17753.out I see Gres=gpu:1. Now when I login to the node running that job and look at /proc/pid/environ I can confirm that CUDA_VISIBLE_DEVICES=0 is set and the job is running on gpu id 0: [root@amber221 ~]# nvidia-smi Sun Jun 7 15:47:52 2015 +------------------------------------------------------+ | NVIDIA-SMI 331.49 Driver Version: 331.49 | |-------------------------------+----------------------+----------------------+ | GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC | | Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. | |===============================+======================+======================| | 0 Tesla K20Xm On | 0000:08:00.0 Off | 0* | | N/A 60C P0 72W / 235W | 2316MiB / 5759MiB | 0% E. Process | +-------------------------------+----------------------+----------------------+ | 1 Tesla K20Xm On | 0000:27:00.0 Off | 0* | | N/A 39C P8 16W / 235W | 13MiB / 5759MiB | 0% E. Process | +-------------------------------+----------------------+----------------------+ +-----------------------------------------------------------------------------+ | Compute processes: GPU Memory | | GPU PID Process name Usage | |=============================================================================| | 0 42364 pmemd.cuda 2300MiB | +-----------------------------------------------------------------------------+ So now if I try to submit another job that variable should be set to 1 and not 0: [simran@amber300 ~]$ srun -p amber2 -w amber221 --gres=gpu:1 -u bash -i bash: cannot set terminal process group (-1): Invalid argument bash: no job control in this shell simran@amber221:~ % env | grep -i visible env | grep -i visible CUDA_VISIBLE_DEVICES=0 simran@amber221:~ % However I still get CUDA_VISIBLE_DEVICES set to 0. Should that not be set to 1? That is being set by SLURM correct? Your help with this would be appreciated. -Simran