Ticket 5198 - GPU jobs are not launched: "Required node not available (down, drained or reserved)"
Summary: GPU jobs are not launched: "Required node not available (down, drained or res...
Status: RESOLVED DUPLICATE of ticket 4932
Alias: None
Product: Slurm
Classification: Unclassified
Component: Scheduling (show other tickets)
Version: 17.02.10
Hardware: Linux Linux
: 3 - Medium Impact
Assignee: Felip Moll
QA Contact:
URL:
Depends on:
Blocks:
 
Reported: 2018-05-21 17:54 MDT by Sergey Meirovich
Modified: 2018-08-22 13:49 MDT (History)
1 user (show)

See Also:
Site: AMAT
Slinky Site: ---
Alineos Sites: ---
Atos/Eviden Sites: ---
Confidential Site: ---
Coreweave sites: ---
Cray Sites: ---
DS9 clusters: ---
Google sites: ---
HPCnow Sites: ---
HPE Sites: ---
IBM Sites: ---
NOAA SIte: ---
NoveTech Sites: ---
Nvidia HWinf-CS Sites: ---
OCF Sites: ---
Recursion Pharma Sites: ---
SFW Sites: ---
SNIC sites: ---
Tzag Elita Sites: ---
Linux Distro: ---
Machine Name:
CLE Version:
Version Fixed:
Target Release: ---
DevPrio: ---
Emory-Cloud Sites: ---


Attachments
slurm.conf (9.24 KB, text/plain)
2018-05-21 17:57 MDT, Sergey Meirovich
Details
gres.conf (669 bytes, text/plain)
2018-05-21 17:58 MDT, Sergey Meirovich
Details
slurmctld log (1.72 MB, application/x-gzip)
2018-05-21 17:59 MDT, Sergey Meirovich
Details
slurmctld SlurmctldDebug=6 DebugFlags=SelectType (3.09 MB, application/x-gzip)
2018-05-22 15:13 MDT, Sergey Meirovich
Details
slurmctld log SlurmctldDebug=6 DebugFlags=SelectType,Gres (3.58 MB, application/x-gzip)
2018-05-22 15:43 MDT, Sergey Meirovich
Details
Files collected as per c16 for job 925717 (560.09 KB, application/x-bzip)
2018-08-17 11:51 MDT, Sergey Meirovich
Details

Note You need to log in before you can comment on or make changes to this ticket.
Description Sergey Meirovich 2018-05-21 17:54:27 MDT
[root@DCALPH000 ~]# scontrol show node=dcalph134
NodeName=dcalph134 Arch=x86_64 CoresPerSocket=24
   CPUAlloc=9 CPUErr=0 CPUTot=48 CPULoad=4.51
   AvailableFeatures=8160M,768G,nv-p100,rhel7
   ActiveFeatures=8160M,768G,nv-p100,rhel7
   Gres=gpu:p100:1
   NodeAddr=dcalph134 NodeHostName=dcalph134 Version=17.02
   OS=Linux RealMemory=773521 AllocMem=0 FreeMem=715530 Sockets=2 Boards=1
   State=MIXED ThreadsPerCore=1 TmpDisk=1484434 Weight=1 Owner=N/A MCS_label=N/A
   Partitions=gpu,gpu_open 
   BootTime=Apr 18 18:58 SlurmdStartTime=May 21 16:44
   CfgTRES=cpu=48,mem=773521M,gres/gpu=1,gres/gpu:p100=1
   AllocTRES=cpu=9
   CapWatts=n/a
   CurrentWatts=0 LowestJoules=0 ConsumedJoules=0
   ExtSensorsJoules=n/s ExtSensorsWatts=0 ExtSensorsTemp=n/s
   

[root@DCALPH000 ~]# 

But:

-bash-4.1$ srun  -p gpu --gres=gpu:p100:1 -n 1 nvidia-smi
srun: Required node not available (down, drained or reserved)
srun: job 754150 queued and waiting for resources

Why GPU jobs are not accepted?
Comment 1 Sergey Meirovich 2018-05-21 17:57:50 MDT
Created attachment 6911 [details]
slurm.conf
Comment 2 Sergey Meirovich 2018-05-21 17:58:20 MDT
Created attachment 6912 [details]
gres.conf
Comment 3 Sergey Meirovich 2018-05-21 17:59:13 MDT
Created attachment 6913 [details]
slurmctld log
Comment 4 Felip Moll 2018-05-22 06:10:46 MDT
Hi Sergey,

Can you check if the node is in a reservation?

scontrol show res

I am moving to sev-3 since the system seems still usable.
Comment 5 Felip Moll 2018-05-22 06:15:02 MDT
There are also many errors like these:

[2018-05-21T16:42:49.803] error: Node dcalph134 appears to have a different slurm.conf than the slurmctld.  This could cause issues with communication and functionality.  Please review both files and make sure they are the same.  If this is expected ignore, and set DebugFlags=NO_CONF_HASH in your slurm.conf.


Please, make sure your slurm.conf is equal in all your cluster nodes/servers and you've issued a scontrol reconfig and that the error disappears.
Comment 6 Felip Moll 2018-05-22 09:31:46 MDT
And finally, this also happens if there's a job already using the gpu. Since the state was MIXED, there was a job in the node.

Can you run a 'scontrol show jobs' and identify if it is using the gpu?
Comment 7 Sergey Meirovich 2018-05-22 11:30:26 MDT
(In reply to Felip Moll from comment #5)
> There are also many errors like these:
> 
> [2018-05-21T16:42:49.803] error: Node dcalph134 appears to have a different
> slurm.conf than the slurmctld.  This could cause issues with communication
> and functionality.  Please review both files and make sure they are the
> same.  If this is expected ignore, and set DebugFlags=NO_CONF_HASH in your
> slurm.conf.
> 
> 
> Please, make sure your slurm.conf is equal in all your cluster nodes/servers
> and you've issued a scontrol reconfig and that the error disappears.

These wer artefacts that predates the issue. As you could see right now there no mismatch but still we see the issue.
[root@DCALPH000 class]# ssh dcalph134 service slurm restart
Restarting slurm (via systemctl):  [  OK  ]
[root@DCALPH000 class]# service slurm restart
stopping slurmctld:                                        [  OK  ]
slurmctld is stopped
starting slurmctld:                                        [  OK  ]
[root@DCALPH000 class]# grep  'Node dcalph134 appears to have a different' /var/log/slurmctld
[root@DCALPH000 class]# logout
-bash-4.1$ srun -p gpu --gres=gpu:1 -n 1 singularity exec --nv /dat/sw/singularity/tensorflow-18.04-py3.simg python /dat/usr/e154466/models/tutorials/image/mnist/convolutional.py
srun: Required node not available (down, drained or reserved)
srun: job 755200 queued and waiting for resources
^Csrun: Job allocation 755200 has been revoked
srun: Force Terminated job 755200
-bash-4.1$
Comment 8 Sergey Meirovich 2018-05-22 11:34:39 MDT
(In reply to Felip Moll from comment #6)
> And finally, this also happens if there's a job already using the gpu. Since
> the state was MIXED, there was a job in the node.
> 
> Can you run a 'scontrol show jobs' and identify if it is using the gpu?

Hmm,

I see something pretty strange:

JobId=755200 JobName=singularity
   UserId=e154466(19383) GroupId=boks_users(2080) MCS_label=N/A
   Priority=10099999 Nice=0 Account=e154466_gpu QOS=normal
   JobState=CANCELLED Reason=ReqNodeNotAvail,_UnavailableNodes: Dependency=(null)
   Requeue=0 Restarts=0 BatchFlag=0 Reboot=0 ExitCode=255:0
   RunTime=00:00:00 TimeLimit=UNLIMITED TimeMin=N/A
   SubmitTime=May 22 10:29 EligibleTime=May 22 10:29
   StartTime=May 22 10:29 EndTime=May 22 10:29 Deadline=N/A
   PreemptTime=None SuspendTime=None SecsPreSuspend=0
   Partition=gpu AllocNode:Sid=DCALPH000:77705
   ReqNodeList=(null) ExcNodeList=(null)
   NodeList=(null)
   NumNodes=1 NumCPUs=1 NumTasks=1 CPUs/Task=1 ReqB:S:C:T=0:0:*:*
   TRES=cpu=1,node=1,gres/gpu=1
   Socks/Node=* NtasksPerN:B:S:C=0:0:*:* CoreSpec=*
   MinCPUsNode=1 MinMemoryNode=0 MinTmpDiskNode=0
   Features=(null) DelayBoot=00:00:00
   Gres=gpu:1 Reservation=(null)
   OverSubscribe=USER Contiguous=0 Licenses=(null) Network=(null)
   Command=singularity
   WorkDir=/user/e154466
   Comment={"script":"","cmdline":"singularity","comment":"","env":{}} 
   Power=
   

-bash-4.1$ 

Why CANCELCED jobs is still showing up in scontrol show jobs?
Comment 9 Sergey Meirovich 2018-05-22 11:49:08 MDT
Ok. Canceled jobs are showing up in scontrol show jobs only for limited amount of time.

No I do not see anything referecncing GPU at all but behaviour has not changed at all.

-bash-4.1$ scontrol show jobs > /tmp/scontrol-show-jobs 
-bash-4.1$ scontrol show job | grep 'TRES=' | grep gpu
-bash-4.1$ srun -p gpu --gres=gpu:1 -n 1 singularity exec --nv /dat/sw/singularity/tensorflow-18.04-py3.simg python /dat/usr/e154466/models/tutorials/image/mnist/convolutional.py
srun: Required node not available (down, drained or reserved)
srun: job 755223 queued and waiting for resources
Comment 10 Sergey Meirovich 2018-05-22 11:53:01 MDT
Created attachment 6916 [details]
scontrol show jobs > /tmp/scontrol-show-jobs  # from comment #9

scontrol show jobs > /tmp/scontrol-show-jobs  # from comment #9
Comment 11 Sergey Meirovich 2018-05-22 12:51:56 MDT
(In reply to Felip Moll from comment #4)
> Hi Sergey,
> 
> Can you check if the node is in a reservation?
> 
> scontrol show res
> 
> I am moving to sev-3 since the system seems still usable.

No any reservation.

-bash-4.1$ scontrol show  res
No reservations in the system
-bash-4.1$ 

All-in-all something very suspicions is going on.
Comment 12 Sergey Meirovich 2018-05-22 12:57:08 MDT
The only one job which is running on that node is #749725
But it is running in gpu_open partition, allocated sinlge CPU cores and not leveraging any GPU at all.
Comment 13 Sergey Meirovich 2018-05-22 15:13:35 MDT
Created attachment 6917 [details]
slurmctld SlurmctldDebug=6 DebugFlags=SelectType

I took the libery and collected slurmcrtl logs with
SlurmctldDebug=6
DebugFlags=SelectType

Job is 755274

Please look into it as well
Comment 14 Sergey Meirovich 2018-05-22 15:43:49 MDT
Created attachment 6918 [details]
slurmctld log SlurmctldDebug=6 DebugFlags=SelectType,Gres

After a bit of thinking I have added Gres flag. Which is clearly showing that GPU is available.

SlurmctldDebug=6
DebugFlags=SelectType.Gres

Job #755355
Comment 15 Felip Moll 2018-05-23 04:14:12 MDT
After looking at all your logs I still really don't see the problem.

Can you upload also the slurmd log?
Does it happen only when you specify gpu?

In your first comment, in show node dcalph134, I see:

   CfgTRES=cpu=48,mem=773521M,gres/gpu=1,gres/gpu:p100=1
   AllocTRES=cpu=9

There were 9 cpus allocated at that time.


Will try to reproduce it with your configuration in my testing servers.
Comment 16 Felip Moll 2018-05-23 06:51:27 MDT
Also, do this:

$ srun  -p gpu --gres=gpu:p100:1 -n 1 nvidia-smi

*do not cancel it*

While the job is pending:

$ scontrol show jobs
$ sinfo
$ squeue
$ scontrol show node dcalph134
Comment 17 Sergey Meirovich 2018-05-23 13:58:05 MDT
Hmm,

We rebooted the node. And issue got away. Not sure why. We still another issue with that node.

Running job is not appearing in squeue:

-bash-4.1$ sacct -r gpu_open
       JobID    JobName  Partition    Account  AllocCPUS      State        NodeList ExitCode 
------------ ---------- ---------- ---------- ---------- ---------- --------------- -------- 
756822             wrap   gpu_open e154466_g+          1    RUNNING       dcalph134      0:0 
-bash-4.1$ scontrol show jobid=756822
JobId=756822 JobName=wrap
   UserId=e154466(19383) GroupId=boks_users(2080) MCS_label=N/A
   Priority=1349999 Nice=0 Account=e154466_gpu QOS=normal
   JobState=RUNNING Reason=None Dependency=(null)
   Requeue=0 Restarts=0 BatchFlag=1 Reboot=0 ExitCode=0:0
   RunTime=17:10:47 TimeLimit=UNLIMITED TimeMin=N/A
   SubmitTime=May 22 19:45 EligibleTime=May 22 19:45
   StartTime=May 22 19:45 EndTime=Unknown Deadline=N/A
   PreemptTime=None SuspendTime=None SecsPreSuspend=0
   Partition=gpu_open AllocNode:Sid=DCALPH000:93594
   ReqNodeList=(null) ExcNodeList=(null)
   NodeList=dcalph134
   BatchHost=dcalph134
   NumNodes=1 NumCPUs=1 NumTasks=1 CPUs/Task=1 ReqB:S:C:T=0:0:*:*
   TRES=cpu=1,node=1
   Socks/Node=* NtasksPerN:B:S:C=0:0:*:* CoreSpec=*
   MinCPUsNode=1 MinMemoryNode=0 MinTmpDiskNode=0
   Features=(null) DelayBoot=00:00:00
   Gres=(null) Reservation=(null)
   OverSubscribe=OK Contiguous=0 Licenses=(null) Network=(null)
   Command=(null)
   WorkDir=/user/e154466
   Comment={"script":"#!/bin/sh\n# This script was created by sbatch --wrap.\n\nsleep 100000\n","cmdline":"","comment":"","env":{"MODULE_VERSION_STACK":"3.2.10","MANPATH":"/cm/shared/apps/slurm/17.02.10/man:/usr/share/man/overrides:/usr/share/man/en:/usr/share/man:/opt/boksm/man:/usr/local/share/man:/cm/local/apps/environment-modules/current/share/man","HOSTNAME":"DCALPH000","TERM":"xterm","SHELL":"/bin/bash","HISTSIZE":"1000","SSH_CLIENT":"172.24.4.121 51346 22","LIBRARY_PATH":"/cm/shared/apps/slurm/17.02.10/lib64/slurm:/cm/shared/apps/slurm/17.02.10/lib64","QTDIR":"/usr/lib64/qt-3.3","QTINC":"/usr/lib64/qt-3.3/include","SSH_TTY":"/dev/pts/43","SQUEUE_PARTITION":"test,interact,license,lic_low,normal,low,open","USER":"e154466","LD_LIBRARY_PATH":"/usr/local/cuda-9.1/lib64:/cm/shared/apps/slurm/17.02.10/lib64/slurm:/cm/shared/apps/slurm/17.02.10/lib64","LS_COLORS":"rs=0:di=01;34:ln=01;36:mh=00:pi=40;33:so=01;35:do=01;35:bd=40;33;01:cd=40;33;01:or=40;31;01:mi=01;05;37;41:su=37;41:sg=30;43:ca=30;41:tw=30;42:ow=34;42:st=37;44:ex=01;32:*.tar=01;31:*.tgz=01;31:*.arj=01;31:*.taz=01;31:*.lzh=01;31:*.lzma=01;31:*.tlz=01;31:*.txz=01;31:*.zip=01;31:*.z=01;31:*.Z=01;31:*.dz=01;31:*.gz=01;31:*.lz=01;31:*.xz=01;31:*.bz2=01;31:*.tbz=01;31:*.tbz2=01;31:*.bz=01;31:*.tz=01;31:*.deb=01;31:*.rpm=01;31:*.jar=01;31:*.rar=01;31:*.ace=01;31:*.zoo=01;31:*.cpio=01;31:*.7z=01;31:*.rz=01;31:*.jpg=01;35:*.jpeg=01;35:*.gif=01;35:*.bmp=01;35:*.pbm=01;35:*.pgm=01;35:*.ppm=01;35:*.tga=01;35:*.xbm=01;35:*.xpm=01;35:*.tif=01;35:*.tiff=01;35:*.png=01;35:*.svg=01;35:*.svgz=01;35:*.mng=01;35:*.pcx=01;35:*.mov=01;35:*.mpg=01;35:*.mpeg=01;35:*.m2v=01;35:*.mkv=01;35:*.ogm=01;35:*.mp4=01;35:*.m4v=01;35:*.mp4v=01;35:*.vob=01;35:*.qt=01;35:*.nuv=01;35:*.wmv=01;35:*.asf=01;35:*.rm=01;35:*.rmvb=01;35:*.flc=01;35:*.avi=01;35:*.fli=01;35:*.flv=01;35:*.gl=01;35:*.dl=01;35:*.xcf=01;35:*.xwd=01;35:*.yuv=01;35:*.cgm=01;35:*.emf=01;35:*.axv=01;35:*.anx=01;35:*.ogv=01;35:*.ogx=01;35:*.aac=01;36:*.au=01;36:*.flac=01;36:*.mid=01;36:*.midi=01;36:*.mka=01;36:*.mp3=01;36:*.mpc=01;36:*.ogg=01;36:*.ra=01;36:*.wav=01;36:*.axa=01;36:*.oga=01;36:*.spx=01;36:*.xspf=01;36:","SINFO_FORMAT":"%n %.10T %.5a %.8e %.7m %.4c %.8O %C","CPATH":"/cm/shared/apps/slurm/17.02.10/include","SACCT_FORMAT":"JobID,JobName,Partition,Account,AllocCPUS,State,NodeList,ExitCode","MODULE_VERSION":"3.2.10","MAIL":"/var/spool/mail/e154466","PATH":"/usr/local/cuda-9.1/bin:/sw/QuantumWise/VNL-ATK-2017.1/bin:/sw/QuantumWise/VNL-ATK-2017.1/bin:/sw/QuantumWise/VNL-ATK-2017.1/bin:/cm/shared/apps/slurm/17.02.10/sbin:/cm/shared/apps/slurm/17.02.10/bin:/usr/lib64/qt-3.3/bin:/bin:/usr/bin:/opt/boksm/bin:/usr/local/sbin:/usr/sbin:/sbin:/sbin:/usr/sbin:/cm/local/apps/environment-modules/3.2.10/bin:/opt/dell/srvadmin/bin","SQUEUE_SORT":"U,P,N","PWD":"/user/e154466","_LMFILES_":"/cm/shared/modulefiles/slurm/17.02.10:/cm/shared/modulefiles/app_env/cuda-9.1","LANG":"en_US.UTF-8","MODULEPATH":"/cm/local/modulefiles:/cm/shared/modulefiles","ESI_HOME":"/hpc_lsf/application/ESI_Software","LOADEDMODULES":"slurm/17.02.10:app_env/cuda-9.1","SSH_ASKPASS":"/usr/libexec/openssh/gnome-ssh-askpass","HISTCONTROL":"ignoredups","SQUEUE_FORMAT2":"jobid:7,username:9,statecompact:3,partition:13,name:15,command:10,submittime:13,numcpus:5,gres:11,numnodes:6,reasonlist:50","SHLVL":"1","HOME":"/user/e154466","LOGNAME":"e154466","QTLIB":"/usr/lib64/qt-3.3/lib","CVS_RSH":"ssh","SSH_CONNECTION":"172.24.4.121 51346 10.41.26.200 22","MODULESHOME":"/cm/local/apps/environment-modules/3.2.10/Modules/3.2.10","SLURM_TIME_FORMAT":"%b %e %k:%M","LESSOPEN":"||/usr/bin/lesspipe.sh %s","G_BROKEN_FILENAMES":"1","BASH_FUNC_module()":"() {  eval `/cm/local/apps/environment-modules/3.2.10/Modules/$MODULE_VERSION/bin/modulecmd bash $*`\n}","_":"/cm/shared/apps/slurm/17.02.10/bin/sbatch","SLURM_NPROCS":"1","SLURM_NTASKS":"1","SLURM_JOB_NAME":"wrap","SLURM_RLIMIT_CPU":"18446744073709551615","SLURM_RLIMIT_FSIZE":"18446744073709551615","SLURM_RLIMIT_DATA":"18446744073709551615","SLURM_RLIMIT_STACK":"18446744073709551615","SLURM_RLIMIT_CORE":"0","SLURM_RLIMIT_RSS":"18446744073709551615","SLURM_RLIMIT_NPROC":"2066973","SLURM_RLIMIT_NOFILE":"65536","SLURM_RLIMIT_MEMLOCK":"18446744073709551615","SLURM_RLIMIT_AS":"18446744073709551615","SLURM_PRIO_PROCESS":"0","SLURM_SUBMIT_DIR":"/user/e154466","SLURM_SUBMIT_HOST":"DCALPH000","SLURM_UMASK":"0022"}} 
   StdErr=/user/e154466/slurm-756822.out
   StdIn=/dev/null
   StdOut=/user/e154466/slurm-756822.out
   Power=
   

-bash-4.1$ squeue -w dcalph134
JOBID  USER     ST PARTITION    NAME           COMMAND   SUBMIT_TIME  CPUS GRES       NODES NODELIST(REASON)                                  
-bash-4.1$ squeue -u e154466
JOBID  USER     ST PARTITION    NAME           COMMAND   SUBMIT_TIME  CPUS GRES       NODES NODELIST(REASON)                                  
755497 e154466  R  test         wrap           (null)    May 22 16:52 1    (null)     1     dcalph132                                         
755498 e154466  R  test         wrap           (null)    May 22 16:53 1    gpu:2      1     dcalph198                                         
-bash-4.1$ 


I am going to open another bug for this.
Comment 18 Sergey Meirovich 2018-05-23 14:44:39 MDT
Please put this bug on hold. For squeue issues I have opened  https://bugs.schedmd.com/show_bug.cgi?id=5208
Comment 19 Felip Moll 2018-05-24 01:16:27 MDT
I think your slurmd restart in the node was not successful, or there was other stalled processes that got cleaned once you rebooted. Maybe the node had some kind of problem (network,IO,..) , should be worth to investigate.

I am closing this issue now, but please, reopen if you encounter more problems. If this is the case upload the info requested in comment 16.

Regards
Comment 20 Sergey Meirovich 2018-08-17 11:45:47 MDT
I have just reprdouced the issue with host dcalph198:
-bash-4.1$ srun -p gpu --gres=gpu:2 nvidia-smi
srun: Required node not available (down, drained or reserved)
srun: job 925717 queued and waiting for resources

Going to upload files colleces as per c16.
Comment 21 Sergey Meirovich 2018-08-17 11:51:28 MDT
Created attachment 7637 [details]
Files collected as per c16 for job 925717
Comment 22 Felip Moll 2018-08-20 06:40:34 MDT
I see in squeue output your job 925717 requesting partition gpu and gpu:2, but at the same time showing the ReqNodeNotAvail showing unrelated UnavailableNodes
dcalph001 dcalph128. This two nodes are effectively set to drain due to some failure.

squeue:925717 e154466  PD gpu          nvidia-smi     nvidia-smiAug 17 10:40 1    gpu:2      1     (ReqNodeNotAvail, UnavailableNodes:dcalph[001,128]

There's a related bug already fixed in 17.11.6: bug 4932.

It seems the cause of the problem is that slurmctld can set wrong reason for jobs with "--exclusive=user", and your partition gpu has this flag set: ExclusiveUser=Yes.
In theory this state shouldn't be permanent and disturb scheduling, but for your curiosity the commits are:

https://github.com/SchedMD/slurm/commit/e2a14b8d7f4f
https://github.com/SchedMD/slurm/commit/fc4e5ac9e056

Are you planning to upgrade? 18.08 is to be released soon and 17.02 will end its support period at that time.

I want also to recommend removing the Shared= parameter in the partitions, since it has been deprecated in favor of OverSubscribe (see man slurm.conf).
Comment 23 Sergey Meirovich 2018-08-22 13:49:10 MDT
Thanks a lot! Closing it.

*** This ticket has been marked as a duplicate of ticket 4932 ***