| Summary: | GPU jobs are not launched: "Required node not available (down, drained or reserved)" | ||
|---|---|---|---|
| Product: | Slurm | Reporter: | Sergey Meirovich <sergey_meirovich> |
| Component: | Scheduling | Assignee: | Felip Moll <felip.moll> |
| Status: | RESOLVED DUPLICATE | QA Contact: | |
| Severity: | 3 - Medium Impact | ||
| Priority: | --- | CC: | felip.moll |
| Version: | 17.02.10 | ||
| Hardware: | Linux | ||
| OS: | Linux | ||
| Site: | AMAT | Alineos Sites: | --- |
| Atos/Eviden Sites: | --- | Confidential Site: | --- |
| Coreweave sites: | --- | Cray Sites: | --- |
| DS9 clusters: | --- | HPCnow Sites: | --- |
| HPE Sites: | --- | IBM Sites: | --- |
| NOAA SIte: | --- | OCF Sites: | --- |
| Recursion Pharma Sites: | --- | SFW Sites: | --- |
| SNIC sites: | --- | Linux Distro: | --- |
| Machine Name: | CLE Version: | ||
| Version Fixed: | Target Release: | --- | |
| DevPrio: | --- | Emory-Cloud Sites: | --- |
| Attachments: |
slurm.conf
gres.conf slurmctld log slurmctld SlurmctldDebug=6 DebugFlags=SelectType slurmctld log SlurmctldDebug=6 DebugFlags=SelectType,Gres Files collected as per c16 for job 925717 |
||
|
Description
Sergey Meirovich
2018-05-21 17:54:27 MDT
Created attachment 6911 [details]
slurm.conf
Created attachment 6912 [details]
gres.conf
Created attachment 6913 [details]
slurmctld log
Hi Sergey, Can you check if the node is in a reservation? scontrol show res I am moving to sev-3 since the system seems still usable. There are also many errors like these: [2018-05-21T16:42:49.803] error: Node dcalph134 appears to have a different slurm.conf than the slurmctld. This could cause issues with communication and functionality. Please review both files and make sure they are the same. If this is expected ignore, and set DebugFlags=NO_CONF_HASH in your slurm.conf. Please, make sure your slurm.conf is equal in all your cluster nodes/servers and you've issued a scontrol reconfig and that the error disappears. And finally, this also happens if there's a job already using the gpu. Since the state was MIXED, there was a job in the node. Can you run a 'scontrol show jobs' and identify if it is using the gpu? (In reply to Felip Moll from comment #5) > There are also many errors like these: > > [2018-05-21T16:42:49.803] error: Node dcalph134 appears to have a different > slurm.conf than the slurmctld. This could cause issues with communication > and functionality. Please review both files and make sure they are the > same. If this is expected ignore, and set DebugFlags=NO_CONF_HASH in your > slurm.conf. > > > Please, make sure your slurm.conf is equal in all your cluster nodes/servers > and you've issued a scontrol reconfig and that the error disappears. These wer artefacts that predates the issue. As you could see right now there no mismatch but still we see the issue. [root@DCALPH000 class]# ssh dcalph134 service slurm restart Restarting slurm (via systemctl): [ OK ] [root@DCALPH000 class]# service slurm restart stopping slurmctld: [ OK ] slurmctld is stopped starting slurmctld: [ OK ] [root@DCALPH000 class]# grep 'Node dcalph134 appears to have a different' /var/log/slurmctld [root@DCALPH000 class]# logout -bash-4.1$ srun -p gpu --gres=gpu:1 -n 1 singularity exec --nv /dat/sw/singularity/tensorflow-18.04-py3.simg python /dat/usr/e154466/models/tutorials/image/mnist/convolutional.py srun: Required node not available (down, drained or reserved) srun: job 755200 queued and waiting for resources ^Csrun: Job allocation 755200 has been revoked srun: Force Terminated job 755200 -bash-4.1$ (In reply to Felip Moll from comment #6) > And finally, this also happens if there's a job already using the gpu. Since > the state was MIXED, there was a job in the node. > > Can you run a 'scontrol show jobs' and identify if it is using the gpu? Hmm, I see something pretty strange: JobId=755200 JobName=singularity UserId=e154466(19383) GroupId=boks_users(2080) MCS_label=N/A Priority=10099999 Nice=0 Account=e154466_gpu QOS=normal JobState=CANCELLED Reason=ReqNodeNotAvail,_UnavailableNodes: Dependency=(null) Requeue=0 Restarts=0 BatchFlag=0 Reboot=0 ExitCode=255:0 RunTime=00:00:00 TimeLimit=UNLIMITED TimeMin=N/A SubmitTime=May 22 10:29 EligibleTime=May 22 10:29 StartTime=May 22 10:29 EndTime=May 22 10:29 Deadline=N/A PreemptTime=None SuspendTime=None SecsPreSuspend=0 Partition=gpu AllocNode:Sid=DCALPH000:77705 ReqNodeList=(null) ExcNodeList=(null) NodeList=(null) NumNodes=1 NumCPUs=1 NumTasks=1 CPUs/Task=1 ReqB:S:C:T=0:0:*:* TRES=cpu=1,node=1,gres/gpu=1 Socks/Node=* NtasksPerN:B:S:C=0:0:*:* CoreSpec=* MinCPUsNode=1 MinMemoryNode=0 MinTmpDiskNode=0 Features=(null) DelayBoot=00:00:00 Gres=gpu:1 Reservation=(null) OverSubscribe=USER Contiguous=0 Licenses=(null) Network=(null) Command=singularity WorkDir=/user/e154466 Comment={"script":"","cmdline":"singularity","comment":"","env":{}} Power= -bash-4.1$ Why CANCELCED jobs is still showing up in scontrol show jobs? Ok. Canceled jobs are showing up in scontrol show jobs only for limited amount of time. No I do not see anything referecncing GPU at all but behaviour has not changed at all. -bash-4.1$ scontrol show jobs > /tmp/scontrol-show-jobs -bash-4.1$ scontrol show job | grep 'TRES=' | grep gpu -bash-4.1$ srun -p gpu --gres=gpu:1 -n 1 singularity exec --nv /dat/sw/singularity/tensorflow-18.04-py3.simg python /dat/usr/e154466/models/tutorials/image/mnist/convolutional.py srun: Required node not available (down, drained or reserved) srun: job 755223 queued and waiting for resources Created attachment 6916 [details] scontrol show jobs > /tmp/scontrol-show-jobs # from comment #9 scontrol show jobs > /tmp/scontrol-show-jobs # from comment #9 (In reply to Felip Moll from comment #4) > Hi Sergey, > > Can you check if the node is in a reservation? > > scontrol show res > > I am moving to sev-3 since the system seems still usable. No any reservation. -bash-4.1$ scontrol show res No reservations in the system -bash-4.1$ All-in-all something very suspicions is going on. The only one job which is running on that node is #749725 But it is running in gpu_open partition, allocated sinlge CPU cores and not leveraging any GPU at all. Created attachment 6917 [details]
slurmctld SlurmctldDebug=6 DebugFlags=SelectType
I took the libery and collected slurmcrtl logs with
SlurmctldDebug=6
DebugFlags=SelectType
Job is 755274
Please look into it as well
Created attachment 6918 [details]
slurmctld log SlurmctldDebug=6 DebugFlags=SelectType,Gres
After a bit of thinking I have added Gres flag. Which is clearly showing that GPU is available.
SlurmctldDebug=6
DebugFlags=SelectType.Gres
Job #755355
After looking at all your logs I still really don't see the problem. Can you upload also the slurmd log? Does it happen only when you specify gpu? In your first comment, in show node dcalph134, I see: CfgTRES=cpu=48,mem=773521M,gres/gpu=1,gres/gpu:p100=1 AllocTRES=cpu=9 There were 9 cpus allocated at that time. Will try to reproduce it with your configuration in my testing servers. Also, do this: $ srun -p gpu --gres=gpu:p100:1 -n 1 nvidia-smi *do not cancel it* While the job is pending: $ scontrol show jobs $ sinfo $ squeue $ scontrol show node dcalph134 Hmm,
We rebooted the node. And issue got away. Not sure why. We still another issue with that node.
Running job is not appearing in squeue:
-bash-4.1$ sacct -r gpu_open
JobID JobName Partition Account AllocCPUS State NodeList ExitCode
------------ ---------- ---------- ---------- ---------- ---------- --------------- --------
756822 wrap gpu_open e154466_g+ 1 RUNNING dcalph134 0:0
-bash-4.1$ scontrol show jobid=756822
JobId=756822 JobName=wrap
UserId=e154466(19383) GroupId=boks_users(2080) MCS_label=N/A
Priority=1349999 Nice=0 Account=e154466_gpu QOS=normal
JobState=RUNNING Reason=None Dependency=(null)
Requeue=0 Restarts=0 BatchFlag=1 Reboot=0 ExitCode=0:0
RunTime=17:10:47 TimeLimit=UNLIMITED TimeMin=N/A
SubmitTime=May 22 19:45 EligibleTime=May 22 19:45
StartTime=May 22 19:45 EndTime=Unknown Deadline=N/A
PreemptTime=None SuspendTime=None SecsPreSuspend=0
Partition=gpu_open AllocNode:Sid=DCALPH000:93594
ReqNodeList=(null) ExcNodeList=(null)
NodeList=dcalph134
BatchHost=dcalph134
NumNodes=1 NumCPUs=1 NumTasks=1 CPUs/Task=1 ReqB:S:C:T=0:0:*:*
TRES=cpu=1,node=1
Socks/Node=* NtasksPerN:B:S:C=0:0:*:* CoreSpec=*
MinCPUsNode=1 MinMemoryNode=0 MinTmpDiskNode=0
Features=(null) DelayBoot=00:00:00
Gres=(null) Reservation=(null)
OverSubscribe=OK Contiguous=0 Licenses=(null) Network=(null)
Command=(null)
WorkDir=/user/e154466
Comment={"script":"#!/bin/sh\n# This script was created by sbatch --wrap.\n\nsleep 100000\n","cmdline":"","comment":"","env":{"MODULE_VERSION_STACK":"3.2.10","MANPATH":"/cm/shared/apps/slurm/17.02.10/man:/usr/share/man/overrides:/usr/share/man/en:/usr/share/man:/opt/boksm/man:/usr/local/share/man:/cm/local/apps/environment-modules/current/share/man","HOSTNAME":"DCALPH000","TERM":"xterm","SHELL":"/bin/bash","HISTSIZE":"1000","SSH_CLIENT":"172.24.4.121 51346 22","LIBRARY_PATH":"/cm/shared/apps/slurm/17.02.10/lib64/slurm:/cm/shared/apps/slurm/17.02.10/lib64","QTDIR":"/usr/lib64/qt-3.3","QTINC":"/usr/lib64/qt-3.3/include","SSH_TTY":"/dev/pts/43","SQUEUE_PARTITION":"test,interact,license,lic_low,normal,low,open","USER":"e154466","LD_LIBRARY_PATH":"/usr/local/cuda-9.1/lib64:/cm/shared/apps/slurm/17.02.10/lib64/slurm:/cm/shared/apps/slurm/17.02.10/lib64","LS_COLORS":"rs=0:di=01;34:ln=01;36:mh=00:pi=40;33:so=01;35:do=01;35:bd=40;33;01:cd=40;33;01:or=40;31;01:mi=01;05;37;41:su=37;41:sg=30;43:ca=30;41:tw=30;42:ow=34;42:st=37;44:ex=01;32:*.tar=01;31:*.tgz=01;31:*.arj=01;31:*.taz=01;31:*.lzh=01;31:*.lzma=01;31:*.tlz=01;31:*.txz=01;31:*.zip=01;31:*.z=01;31:*.Z=01;31:*.dz=01;31:*.gz=01;31:*.lz=01;31:*.xz=01;31:*.bz2=01;31:*.tbz=01;31:*.tbz2=01;31:*.bz=01;31:*.tz=01;31:*.deb=01;31:*.rpm=01;31:*.jar=01;31:*.rar=01;31:*.ace=01;31:*.zoo=01;31:*.cpio=01;31:*.7z=01;31:*.rz=01;31:*.jpg=01;35:*.jpeg=01;35:*.gif=01;35:*.bmp=01;35:*.pbm=01;35:*.pgm=01;35:*.ppm=01;35:*.tga=01;35:*.xbm=01;35:*.xpm=01;35:*.tif=01;35:*.tiff=01;35:*.png=01;35:*.svg=01;35:*.svgz=01;35:*.mng=01;35:*.pcx=01;35:*.mov=01;35:*.mpg=01;35:*.mpeg=01;35:*.m2v=01;35:*.mkv=01;35:*.ogm=01;35:*.mp4=01;35:*.m4v=01;35:*.mp4v=01;35:*.vob=01;35:*.qt=01;35:*.nuv=01;35:*.wmv=01;35:*.asf=01;35:*.rm=01;35:*.rmvb=01;35:*.flc=01;35:*.avi=01;35:*.fli=01;35:*.flv=01;35:*.gl=01;35:*.dl=01;35:*.xcf=01;35:*.xwd=01;35:*.yuv=01;35:*.cgm=01;35:*.emf=01;35:*.axv=01;35:*.anx=01;35:*.ogv=01;35:*.ogx=01;35:*.aac=01;36:*.au=01;36:*.flac=01;36:*.mid=01;36:*.midi=01;36:*.mka=01;36:*.mp3=01;36:*.mpc=01;36:*.ogg=01;36:*.ra=01;36:*.wav=01;36:*.axa=01;36:*.oga=01;36:*.spx=01;36:*.xspf=01;36:","SINFO_FORMAT":"%n %.10T %.5a %.8e %.7m %.4c %.8O %C","CPATH":"/cm/shared/apps/slurm/17.02.10/include","SACCT_FORMAT":"JobID,JobName,Partition,Account,AllocCPUS,State,NodeList,ExitCode","MODULE_VERSION":"3.2.10","MAIL":"/var/spool/mail/e154466","PATH":"/usr/local/cuda-9.1/bin:/sw/QuantumWise/VNL-ATK-2017.1/bin:/sw/QuantumWise/VNL-ATK-2017.1/bin:/sw/QuantumWise/VNL-ATK-2017.1/bin:/cm/shared/apps/slurm/17.02.10/sbin:/cm/shared/apps/slurm/17.02.10/bin:/usr/lib64/qt-3.3/bin:/bin:/usr/bin:/opt/boksm/bin:/usr/local/sbin:/usr/sbin:/sbin:/sbin:/usr/sbin:/cm/local/apps/environment-modules/3.2.10/bin:/opt/dell/srvadmin/bin","SQUEUE_SORT":"U,P,N","PWD":"/user/e154466","_LMFILES_":"/cm/shared/modulefiles/slurm/17.02.10:/cm/shared/modulefiles/app_env/cuda-9.1","LANG":"en_US.UTF-8","MODULEPATH":"/cm/local/modulefiles:/cm/shared/modulefiles","ESI_HOME":"/hpc_lsf/application/ESI_Software","LOADEDMODULES":"slurm/17.02.10:app_env/cuda-9.1","SSH_ASKPASS":"/usr/libexec/openssh/gnome-ssh-askpass","HISTCONTROL":"ignoredups","SQUEUE_FORMAT2":"jobid:7,username:9,statecompact:3,partition:13,name:15,command:10,submittime:13,numcpus:5,gres:11,numnodes:6,reasonlist:50","SHLVL":"1","HOME":"/user/e154466","LOGNAME":"e154466","QTLIB":"/usr/lib64/qt-3.3/lib","CVS_RSH":"ssh","SSH_CONNECTION":"172.24.4.121 51346 10.41.26.200 22","MODULESHOME":"/cm/local/apps/environment-modules/3.2.10/Modules/3.2.10","SLURM_TIME_FORMAT":"%b %e %k:%M","LESSOPEN":"||/usr/bin/lesspipe.sh %s","G_BROKEN_FILENAMES":"1","BASH_FUNC_module()":"() { eval `/cm/local/apps/environment-modules/3.2.10/Modules/$MODULE_VERSION/bin/modulecmd bash $*`\n}","_":"/cm/shared/apps/slurm/17.02.10/bin/sbatch","SLURM_NPROCS":"1","SLURM_NTASKS":"1","SLURM_JOB_NAME":"wrap","SLURM_RLIMIT_CPU":"18446744073709551615","SLURM_RLIMIT_FSIZE":"18446744073709551615","SLURM_RLIMIT_DATA":"18446744073709551615","SLURM_RLIMIT_STACK":"18446744073709551615","SLURM_RLIMIT_CORE":"0","SLURM_RLIMIT_RSS":"18446744073709551615","SLURM_RLIMIT_NPROC":"2066973","SLURM_RLIMIT_NOFILE":"65536","SLURM_RLIMIT_MEMLOCK":"18446744073709551615","SLURM_RLIMIT_AS":"18446744073709551615","SLURM_PRIO_PROCESS":"0","SLURM_SUBMIT_DIR":"/user/e154466","SLURM_SUBMIT_HOST":"DCALPH000","SLURM_UMASK":"0022"}}
StdErr=/user/e154466/slurm-756822.out
StdIn=/dev/null
StdOut=/user/e154466/slurm-756822.out
Power=
-bash-4.1$ squeue -w dcalph134
JOBID USER ST PARTITION NAME COMMAND SUBMIT_TIME CPUS GRES NODES NODELIST(REASON)
-bash-4.1$ squeue -u e154466
JOBID USER ST PARTITION NAME COMMAND SUBMIT_TIME CPUS GRES NODES NODELIST(REASON)
755497 e154466 R test wrap (null) May 22 16:52 1 (null) 1 dcalph132
755498 e154466 R test wrap (null) May 22 16:53 1 gpu:2 1 dcalph198
-bash-4.1$
I am going to open another bug for this.
Please put this bug on hold. For squeue issues I have opened https://bugs.schedmd.com/show_bug.cgi?id=5208 I think your slurmd restart in the node was not successful, or there was other stalled processes that got cleaned once you rebooted. Maybe the node had some kind of problem (network,IO,..) , should be worth to investigate. I am closing this issue now, but please, reopen if you encounter more problems. If this is the case upload the info requested in comment 16. Regards I have just reprdouced the issue with host dcalph198: -bash-4.1$ srun -p gpu --gres=gpu:2 nvidia-smi srun: Required node not available (down, drained or reserved) srun: job 925717 queued and waiting for resources Going to upload files colleces as per c16. Created attachment 7637 [details]
Files collected as per c16 for job 925717
I see in squeue output your job 925717 requesting partition gpu and gpu:2, but at the same time showing the ReqNodeNotAvail showing unrelated UnavailableNodes dcalph001 dcalph128. This two nodes are effectively set to drain due to some failure. squeue:925717 e154466 PD gpu nvidia-smi nvidia-smiAug 17 10:40 1 gpu:2 1 (ReqNodeNotAvail, UnavailableNodes:dcalph[001,128] There's a related bug already fixed in 17.11.6: bug 4932. It seems the cause of the problem is that slurmctld can set wrong reason for jobs with "--exclusive=user", and your partition gpu has this flag set: ExclusiveUser=Yes. In theory this state shouldn't be permanent and disturb scheduling, but for your curiosity the commits are: https://github.com/SchedMD/slurm/commit/e2a14b8d7f4f https://github.com/SchedMD/slurm/commit/fc4e5ac9e056 Are you planning to upgrade? 18.08 is to be released soon and 17.02 will end its support period at that time. I want also to recommend removing the Shared= parameter in the partitions, since it has been deprecated in favor of OverSubscribe (see man slurm.conf). Thanks a lot! Closing it. *** This ticket has been marked as a duplicate of ticket 4932 *** |