Ticket 1729 - SLURM assigning incorrect GPU ID's
Summary: SLURM assigning incorrect GPU ID's
Status: RESOLVED INFOGIVEN
Alias: None
Product: Slurm
Classification: Unclassified
Component: Scheduling (show other tickets)
Version: 14.03.9
Hardware: Linux Linux
: 2 - High Impact
Assignee: Brian Christiansen
QA Contact:
URL:
Depends on:
Blocks:
 
Reported: 2015-06-07 10:49 MDT by Simran
Modified: 2015-06-17 03:50 MDT (History)
2 users (show)

See Also:
Site: Genentech (Roche)
Slinky Site: ---
Alineos Sites: ---
Atos/Eviden Sites: ---
Confidential Site: ---
Coreweave sites: ---
Cray Sites: ---
DS9 clusters: ---
Google sites: ---
HPCnow Sites: ---
HPE Sites: ---
IBM Sites: ---
NOAA SIte: ---
NoveTech Sites: ---
Nvidia HWinf-CS Sites: ---
OCF Sites: ---
Recursion Pharma Sites: ---
SFW Sites: ---
SNIC sites: ---
Tzag Elita Sites: ---
Linux Distro: ---
Machine Name:
CLE Version:
Version Fixed:
Target Release: ---
DevPrio: ---
Emory-Cloud Sites: ---


Attachments
slurm.conf (4.33 KB, text/plain)
2015-06-07 10:50 MDT, Simran
Details
gres.conf (161 bytes, text/plain)
2015-06-07 10:51 MDT, Simran
Details

Note You need to log in before you can comment on or make changes to this ticket.
Description Simran 2015-06-07 10:49:37 MDT
Hello,

The slurm scheduler seems to be incorrectly assigning gpu's for amber jobs that is causing some jobs to fail intermittently.  Here is an example of that scenario that is currently happening.  My slurm head node is amber300 and amber221 is one of the compute nodes.  It currently has 1 job running on it:

[root@amber300 slurm]# squeue  -l | grep -i amber221
             17753    amber2 test9.pr     tomp  RUNNING 16-08:43:00 UNLIMITED      1 amber221

This job has asked for 1 GPU:

[root@amber300 slurm]# scontrol show job 17753
JobId=17753 Name=test9.prod
   UserId=tomp(2647) GroupId=Pharmrd(176)
   Priority=4294884127 Nice=0 Account=research QOS=normal
   JobState=RUNNING Reason=None Dependency=(null)
   Requeue=0 Restarts=0 BatchFlag=1 ExitCode=0:0
   RunTime=16-08:43:19 TimeLimit=UNLIMITED TimeMin=N/A
   SubmitTime=2015-05-22T07:03:14 EligibleTime=2015-05-22T07:03:14
   StartTime=2015-05-22T07:03:15 EndTime=Unknown
   PreemptTime=None SuspendTime=None SecsPreSuspend=0
   Partition=amber2 AllocNode:Sid=amber300:16464
   ReqNodeList=(null) ExcNodeList=(null)
   NodeList=amber221
   BatchHost=amber221
   NumNodes=1 NumCPUs=1 CPUs/Task=1 ReqB:S:C:T=0:0:*:*
   Socks/Node=* NtasksPerN:B:S:C=0:0:*:* CoreSpec=0
   MinCPUsNode=1 MinMemoryNode=0 MinTmpDiskNode=0
   Features=(null) Gres=gpu:1 Reservation=(null)
   Shared=0 Contiguous=0 Licenses=(null) Network=(null)
   Command=/gne/dev/pharmr/pharmr/tomp/MD/subscripts/cont.slurm
   WorkDir=/gne/dev/pharmr/pharmr/tomp/DomainSwap/test9
   StdErr=/gne/dev/pharmr/pharmr/tomp/DomainSwap/test9/slurm-17753.out
   StdIn=/dev/null
   StdOut=/gne/dev/pharmr/pharmr/tomp/DomainSwap/test9/slurm-17753.out

I see Gres=gpu:1.  Now when I login to the node running that job and look at /proc/pid/environ I can confirm that CUDA_VISIBLE_DEVICES=0 is set and the job is running on gpu id 0:

[root@amber221 ~]# nvidia-smi 
Sun Jun  7 15:47:52 2015       
+------------------------------------------------------+                       
| NVIDIA-SMI 331.49     Driver Version: 331.49         |                       
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|===============================+======================+======================|
|   0  Tesla K20Xm         On   | 0000:08:00.0     Off |                   0* |
| N/A   60C    P0    72W / 235W |   2316MiB /  5759MiB |      0%   E. Process |
+-------------------------------+----------------------+----------------------+
|   1  Tesla K20Xm         On   | 0000:27:00.0     Off |                   0* |
| N/A   39C    P8    16W / 235W |     13MiB /  5759MiB |      0%   E. Process |
+-------------------------------+----------------------+----------------------+
                                                                               
+-----------------------------------------------------------------------------+
| Compute processes:                                               GPU Memory |
|  GPU       PID  Process name                                     Usage      |
|=============================================================================|
|    0     42364  pmemd.cuda                                          2300MiB |
+-----------------------------------------------------------------------------+

So now if I try to submit another job that variable should be set to 1 and not 0:

[simran@amber300 ~]$ srun -p amber2 -w amber221 --gres=gpu:1 -u bash -i
bash: cannot set terminal process group (-1): Invalid argument
bash: no job control in this shell
simran@amber221:~ % env | grep -i visible
env | grep -i visible
CUDA_VISIBLE_DEVICES=0
simran@amber221:~ % 

However I still get CUDA_VISIBLE_DEVICES set to 0.  Should that not be set to 1?  That is being set by SLURM correct?

Your help with this would be appreciated.

-Simran
Comment 1 Simran 2015-06-07 10:50:41 MDT
Created attachment 1963 [details]
slurm.conf
Comment 2 Simran 2015-06-07 10:51:06 MDT
Created attachment 1964 [details]
gres.conf
Comment 3 Moe Jette 2015-06-07 15:42:34 MDT
Your configuration looks good. I suspect this is a bug that was fixed in version 14.03.10:

* Changes in Slurm 14.03.10
===========================
 -- Fix bug that prevented preservation of a job's GRES bitmap on slurmctld
    restart or reconfigure (bug was introduced in 14.03.5 "Clear record of a
    job's gres when requeued" and only applies when GRES mapped to specific
    files).

Someone will investigate more on Monday morning, but I believe the problem will persist until version 14.03.10 (or later) is installed and all jobs started under earlier versions of Slurm end.
Comment 4 Brian Christiansen 2015-06-08 07:03:23 MDT
We are unable to reproduce this. Was the slurmctld restarted, or reconfigured, during the lifetime of job 17753?

The commit Moe referenced is:
https://github.com/SchedMD/slurm/commit/1209a664840a431428658c0950b16078af7aff63

which was found in Bug 1192.

Can you apply the above patch or upgrade to 14.03.10 or .11?
Comment 5 Simran 2015-06-13 09:09:02 MDT
Okay, so I just got some downtime for today and finally upgraded to slurm v14.11.6.  However, now I am having even more issues.  I can't seem to submit 2 jobs to a node with 2 gpu's.  The second job goes into pending state even though there is a free GPU:

[simran@amber400 ~]$ sinfo -lNe | grep -i amber240
amber240                1      test        idle   40   2:10:2  31943        0      1   (null) none                
[simran@amber400 ~]$ scontrol show node amber240
NodeName=amber240 Arch=x86_64 CoresPerSocket=10
   CPUAlloc=0 CPUErr=0 CPUTot=40 CPULoad=0.00 Features=(null)
   Gres=gpu:2
   NodeAddr=amber240 NodeHostName=amber240 Version=(null)
   OS=Linux RealMemory=31943 AllocMem=0 Sockets=2 Boards=1
   State=IDLE ThreadsPerCore=2 TmpDisk=0 Weight=1
   BootTime=2015-06-13T12:34:26 SlurmdStartTime=2015-06-13T12:35:34
   CurrentWatts=0 LowestJoules=0 ConsumedJoules=0
   ExtSensorsJoules=n/s ExtSensorsWatts=0 ExtSensorsTemp=n/s
   

[simran@amber400 ~]$ srun -p test --gres=gpu:1 -u bash -i
bash: cannot set terminal process group (-1): Invalid argument
bash: no job control in this shell
simran@amber240:~ % 


Trying another job submission now does not work:

[simran@amber400 ~]$ srun -p test --gres=gpu:1 -u bash -i
srun: job 230 queued and waiting for resources


--

Your assistance with this would be greatly appreciated.  I only have a few more hrs left for my downtime and would like to get this resolved soon.

Thanks,
-Simran
Comment 6 Brian Christiansen 2015-06-13 09:21:54 MDT
Try specifying a DefMemPerCPU in your slurm.conf. If you don't specify a memory, the job will use the all the memory on the node.
Comment 7 Simran 2015-06-13 09:27:50 MDT
That seems to have worked.. asking the users to test.. Is there a way to set this to a different default per partition? Or is it just a global setting across all partitions?

Thanks,
-Simran
Comment 8 Brian Christiansen 2015-06-13 09:37:22 MDT
Great. It was changed in 14.11.5 that if you didn't specify a memory, then all of the memory would be used. You can set DefMemPerCpu at the partition level as well. 

ex.
PartitionName=debug ... DefMemPerCPU=1024
Comment 9 Brian Christiansen 2015-06-15 04:28:56 MDT
How is 14.11.6 running for you?
Comment 10 Simran 2015-06-16 11:59:34 MDT
So far so good.  I changed the DefMemPerCpu as recommended and did not see any further issues.  We have not really loaded the cluster with user jobs yet so will probably know in a few weeks if any other issues arise.  Feel free to close this request for now.  Thanks for all your help!

Regards,
-Simran
Comment 11 Brian Christiansen 2015-06-17 03:50:36 MDT
Good to hear. Let us know if you have any other issues.

Thanks,
Brian