[root@DCALPH000 ~]# scontrol show node=dcalph134 NodeName=dcalph134 Arch=x86_64 CoresPerSocket=24 CPUAlloc=9 CPUErr=0 CPUTot=48 CPULoad=4.51 AvailableFeatures=8160M,768G,nv-p100,rhel7 ActiveFeatures=8160M,768G,nv-p100,rhel7 Gres=gpu:p100:1 NodeAddr=dcalph134 NodeHostName=dcalph134 Version=17.02 OS=Linux RealMemory=773521 AllocMem=0 FreeMem=715530 Sockets=2 Boards=1 State=MIXED ThreadsPerCore=1 TmpDisk=1484434 Weight=1 Owner=N/A MCS_label=N/A Partitions=gpu,gpu_open BootTime=Apr 18 18:58 SlurmdStartTime=May 21 16:44 CfgTRES=cpu=48,mem=773521M,gres/gpu=1,gres/gpu:p100=1 AllocTRES=cpu=9 CapWatts=n/a CurrentWatts=0 LowestJoules=0 ConsumedJoules=0 ExtSensorsJoules=n/s ExtSensorsWatts=0 ExtSensorsTemp=n/s [root@DCALPH000 ~]# But: -bash-4.1$ srun -p gpu --gres=gpu:p100:1 -n 1 nvidia-smi srun: Required node not available (down, drained or reserved) srun: job 754150 queued and waiting for resources Why GPU jobs are not accepted?
Created attachment 6911 [details] slurm.conf
Created attachment 6912 [details] gres.conf
Created attachment 6913 [details] slurmctld log
Hi Sergey, Can you check if the node is in a reservation? scontrol show res I am moving to sev-3 since the system seems still usable.
There are also many errors like these: [2018-05-21T16:42:49.803] error: Node dcalph134 appears to have a different slurm.conf than the slurmctld. This could cause issues with communication and functionality. Please review both files and make sure they are the same. If this is expected ignore, and set DebugFlags=NO_CONF_HASH in your slurm.conf. Please, make sure your slurm.conf is equal in all your cluster nodes/servers and you've issued a scontrol reconfig and that the error disappears.
And finally, this also happens if there's a job already using the gpu. Since the state was MIXED, there was a job in the node. Can you run a 'scontrol show jobs' and identify if it is using the gpu?
(In reply to Felip Moll from comment #5) > There are also many errors like these: > > [2018-05-21T16:42:49.803] error: Node dcalph134 appears to have a different > slurm.conf than the slurmctld. This could cause issues with communication > and functionality. Please review both files and make sure they are the > same. If this is expected ignore, and set DebugFlags=NO_CONF_HASH in your > slurm.conf. > > > Please, make sure your slurm.conf is equal in all your cluster nodes/servers > and you've issued a scontrol reconfig and that the error disappears. These wer artefacts that predates the issue. As you could see right now there no mismatch but still we see the issue. [root@DCALPH000 class]# ssh dcalph134 service slurm restart Restarting slurm (via systemctl): [ OK ] [root@DCALPH000 class]# service slurm restart stopping slurmctld: [ OK ] slurmctld is stopped starting slurmctld: [ OK ] [root@DCALPH000 class]# grep 'Node dcalph134 appears to have a different' /var/log/slurmctld [root@DCALPH000 class]# logout -bash-4.1$ srun -p gpu --gres=gpu:1 -n 1 singularity exec --nv /dat/sw/singularity/tensorflow-18.04-py3.simg python /dat/usr/e154466/models/tutorials/image/mnist/convolutional.py srun: Required node not available (down, drained or reserved) srun: job 755200 queued and waiting for resources ^Csrun: Job allocation 755200 has been revoked srun: Force Terminated job 755200 -bash-4.1$
(In reply to Felip Moll from comment #6) > And finally, this also happens if there's a job already using the gpu. Since > the state was MIXED, there was a job in the node. > > Can you run a 'scontrol show jobs' and identify if it is using the gpu? Hmm, I see something pretty strange: JobId=755200 JobName=singularity UserId=e154466(19383) GroupId=boks_users(2080) MCS_label=N/A Priority=10099999 Nice=0 Account=e154466_gpu QOS=normal JobState=CANCELLED Reason=ReqNodeNotAvail,_UnavailableNodes: Dependency=(null) Requeue=0 Restarts=0 BatchFlag=0 Reboot=0 ExitCode=255:0 RunTime=00:00:00 TimeLimit=UNLIMITED TimeMin=N/A SubmitTime=May 22 10:29 EligibleTime=May 22 10:29 StartTime=May 22 10:29 EndTime=May 22 10:29 Deadline=N/A PreemptTime=None SuspendTime=None SecsPreSuspend=0 Partition=gpu AllocNode:Sid=DCALPH000:77705 ReqNodeList=(null) ExcNodeList=(null) NodeList=(null) NumNodes=1 NumCPUs=1 NumTasks=1 CPUs/Task=1 ReqB:S:C:T=0:0:*:* TRES=cpu=1,node=1,gres/gpu=1 Socks/Node=* NtasksPerN:B:S:C=0:0:*:* CoreSpec=* MinCPUsNode=1 MinMemoryNode=0 MinTmpDiskNode=0 Features=(null) DelayBoot=00:00:00 Gres=gpu:1 Reservation=(null) OverSubscribe=USER Contiguous=0 Licenses=(null) Network=(null) Command=singularity WorkDir=/user/e154466 Comment={"script":"","cmdline":"singularity","comment":"","env":{}} Power= -bash-4.1$ Why CANCELCED jobs is still showing up in scontrol show jobs?
Ok. Canceled jobs are showing up in scontrol show jobs only for limited amount of time. No I do not see anything referecncing GPU at all but behaviour has not changed at all. -bash-4.1$ scontrol show jobs > /tmp/scontrol-show-jobs -bash-4.1$ scontrol show job | grep 'TRES=' | grep gpu -bash-4.1$ srun -p gpu --gres=gpu:1 -n 1 singularity exec --nv /dat/sw/singularity/tensorflow-18.04-py3.simg python /dat/usr/e154466/models/tutorials/image/mnist/convolutional.py srun: Required node not available (down, drained or reserved) srun: job 755223 queued and waiting for resources
Created attachment 6916 [details] scontrol show jobs > /tmp/scontrol-show-jobs # from comment #9 scontrol show jobs > /tmp/scontrol-show-jobs # from comment #9
(In reply to Felip Moll from comment #4) > Hi Sergey, > > Can you check if the node is in a reservation? > > scontrol show res > > I am moving to sev-3 since the system seems still usable. No any reservation. -bash-4.1$ scontrol show res No reservations in the system -bash-4.1$ All-in-all something very suspicions is going on.
The only one job which is running on that node is #749725 But it is running in gpu_open partition, allocated sinlge CPU cores and not leveraging any GPU at all.
Created attachment 6917 [details] slurmctld SlurmctldDebug=6 DebugFlags=SelectType I took the libery and collected slurmcrtl logs with SlurmctldDebug=6 DebugFlags=SelectType Job is 755274 Please look into it as well
Created attachment 6918 [details] slurmctld log SlurmctldDebug=6 DebugFlags=SelectType,Gres After a bit of thinking I have added Gres flag. Which is clearly showing that GPU is available. SlurmctldDebug=6 DebugFlags=SelectType.Gres Job #755355
After looking at all your logs I still really don't see the problem. Can you upload also the slurmd log? Does it happen only when you specify gpu? In your first comment, in show node dcalph134, I see: CfgTRES=cpu=48,mem=773521M,gres/gpu=1,gres/gpu:p100=1 AllocTRES=cpu=9 There were 9 cpus allocated at that time. Will try to reproduce it with your configuration in my testing servers.
Also, do this: $ srun -p gpu --gres=gpu:p100:1 -n 1 nvidia-smi *do not cancel it* While the job is pending: $ scontrol show jobs $ sinfo $ squeue $ scontrol show node dcalph134
Hmm, We rebooted the node. And issue got away. Not sure why. We still another issue with that node. Running job is not appearing in squeue: -bash-4.1$ sacct -r gpu_open JobID JobName Partition Account AllocCPUS State NodeList ExitCode ------------ ---------- ---------- ---------- ---------- ---------- --------------- -------- 756822 wrap gpu_open e154466_g+ 1 RUNNING dcalph134 0:0 -bash-4.1$ scontrol show jobid=756822 JobId=756822 JobName=wrap UserId=e154466(19383) GroupId=boks_users(2080) MCS_label=N/A Priority=1349999 Nice=0 Account=e154466_gpu QOS=normal JobState=RUNNING Reason=None Dependency=(null) Requeue=0 Restarts=0 BatchFlag=1 Reboot=0 ExitCode=0:0 RunTime=17:10:47 TimeLimit=UNLIMITED TimeMin=N/A SubmitTime=May 22 19:45 EligibleTime=May 22 19:45 StartTime=May 22 19:45 EndTime=Unknown Deadline=N/A PreemptTime=None SuspendTime=None SecsPreSuspend=0 Partition=gpu_open AllocNode:Sid=DCALPH000:93594 ReqNodeList=(null) ExcNodeList=(null) NodeList=dcalph134 BatchHost=dcalph134 NumNodes=1 NumCPUs=1 NumTasks=1 CPUs/Task=1 ReqB:S:C:T=0:0:*:* TRES=cpu=1,node=1 Socks/Node=* NtasksPerN:B:S:C=0:0:*:* CoreSpec=* MinCPUsNode=1 MinMemoryNode=0 MinTmpDiskNode=0 Features=(null) DelayBoot=00:00:00 Gres=(null) Reservation=(null) OverSubscribe=OK Contiguous=0 Licenses=(null) Network=(null) Command=(null) WorkDir=/user/e154466 Comment={"script":"#!/bin/sh\n# This script was created by sbatch --wrap.\n\nsleep 100000\n","cmdline":"","comment":"","env":{"MODULE_VERSION_STACK":"3.2.10","MANPATH":"/cm/shared/apps/slurm/17.02.10/man:/usr/share/man/overrides:/usr/share/man/en:/usr/share/man:/opt/boksm/man:/usr/local/share/man:/cm/local/apps/environment-modules/current/share/man","HOSTNAME":"DCALPH000","TERM":"xterm","SHELL":"/bin/bash","HISTSIZE":"1000","SSH_CLIENT":"172.24.4.121 51346 22","LIBRARY_PATH":"/cm/shared/apps/slurm/17.02.10/lib64/slurm:/cm/shared/apps/slurm/17.02.10/lib64","QTDIR":"/usr/lib64/qt-3.3","QTINC":"/usr/lib64/qt-3.3/include","SSH_TTY":"/dev/pts/43","SQUEUE_PARTITION":"test,interact,license,lic_low,normal,low,open","USER":"e154466","LD_LIBRARY_PATH":"/usr/local/cuda-9.1/lib64:/cm/shared/apps/slurm/17.02.10/lib64/slurm:/cm/shared/apps/slurm/17.02.10/lib64","LS_COLORS":"rs=0:di=01;34:ln=01;36:mh=00:pi=40;33:so=01;35:do=01;35:bd=40;33;01:cd=40;33;01:or=40;31;01:mi=01;05;37;41:su=37;41:sg=30;43:ca=30;41:tw=30;42:ow=34;42:st=37;44:ex=01;32:*.tar=01;31:*.tgz=01;31:*.arj=01;31:*.taz=01;31:*.lzh=01;31:*.lzma=01;31:*.tlz=01;31:*.txz=01;31:*.zip=01;31:*.z=01;31:*.Z=01;31:*.dz=01;31:*.gz=01;31:*.lz=01;31:*.xz=01;31:*.bz2=01;31:*.tbz=01;31:*.tbz2=01;31:*.bz=01;31:*.tz=01;31:*.deb=01;31:*.rpm=01;31:*.jar=01;31:*.rar=01;31:*.ace=01;31:*.zoo=01;31:*.cpio=01;31:*.7z=01;31:*.rz=01;31:*.jpg=01;35:*.jpeg=01;35:*.gif=01;35:*.bmp=01;35:*.pbm=01;35:*.pgm=01;35:*.ppm=01;35:*.tga=01;35:*.xbm=01;35:*.xpm=01;35:*.tif=01;35:*.tiff=01;35:*.png=01;35:*.svg=01;35:*.svgz=01;35:*.mng=01;35:*.pcx=01;35:*.mov=01;35:*.mpg=01;35:*.mpeg=01;35:*.m2v=01;35:*.mkv=01;35:*.ogm=01;35:*.mp4=01;35:*.m4v=01;35:*.mp4v=01;35:*.vob=01;35:*.qt=01;35:*.nuv=01;35:*.wmv=01;35:*.asf=01;35:*.rm=01;35:*.rmvb=01;35:*.flc=01;35:*.avi=01;35:*.fli=01;35:*.flv=01;35:*.gl=01;35:*.dl=01;35:*.xcf=01;35:*.xwd=01;35:*.yuv=01;35:*.cgm=01;35:*.emf=01;35:*.axv=01;35:*.anx=01;35:*.ogv=01;35:*.ogx=01;35:*.aac=01;36:*.au=01;36:*.flac=01;36:*.mid=01;36:*.midi=01;36:*.mka=01;36:*.mp3=01;36:*.mpc=01;36:*.ogg=01;36:*.ra=01;36:*.wav=01;36:*.axa=01;36:*.oga=01;36:*.spx=01;36:*.xspf=01;36:","SINFO_FORMAT":"%n %.10T %.5a %.8e %.7m %.4c %.8O %C","CPATH":"/cm/shared/apps/slurm/17.02.10/include","SACCT_FORMAT":"JobID,JobName,Partition,Account,AllocCPUS,State,NodeList,ExitCode","MODULE_VERSION":"3.2.10","MAIL":"/var/spool/mail/e154466","PATH":"/usr/local/cuda-9.1/bin:/sw/QuantumWise/VNL-ATK-2017.1/bin:/sw/QuantumWise/VNL-ATK-2017.1/bin:/sw/QuantumWise/VNL-ATK-2017.1/bin:/cm/shared/apps/slurm/17.02.10/sbin:/cm/shared/apps/slurm/17.02.10/bin:/usr/lib64/qt-3.3/bin:/bin:/usr/bin:/opt/boksm/bin:/usr/local/sbin:/usr/sbin:/sbin:/sbin:/usr/sbin:/cm/local/apps/environment-modules/3.2.10/bin:/opt/dell/srvadmin/bin","SQUEUE_SORT":"U,P,N","PWD":"/user/e154466","_LMFILES_":"/cm/shared/modulefiles/slurm/17.02.10:/cm/shared/modulefiles/app_env/cuda-9.1","LANG":"en_US.UTF-8","MODULEPATH":"/cm/local/modulefiles:/cm/shared/modulefiles","ESI_HOME":"/hpc_lsf/application/ESI_Software","LOADEDMODULES":"slurm/17.02.10:app_env/cuda-9.1","SSH_ASKPASS":"/usr/libexec/openssh/gnome-ssh-askpass","HISTCONTROL":"ignoredups","SQUEUE_FORMAT2":"jobid:7,username:9,statecompact:3,partition:13,name:15,command:10,submittime:13,numcpus:5,gres:11,numnodes:6,reasonlist:50","SHLVL":"1","HOME":"/user/e154466","LOGNAME":"e154466","QTLIB":"/usr/lib64/qt-3.3/lib","CVS_RSH":"ssh","SSH_CONNECTION":"172.24.4.121 51346 10.41.26.200 22","MODULESHOME":"/cm/local/apps/environment-modules/3.2.10/Modules/3.2.10","SLURM_TIME_FORMAT":"%b %e %k:%M","LESSOPEN":"||/usr/bin/lesspipe.sh %s","G_BROKEN_FILENAMES":"1","BASH_FUNC_module()":"() { eval `/cm/local/apps/environment-modules/3.2.10/Modules/$MODULE_VERSION/bin/modulecmd bash $*`\n}","_":"/cm/shared/apps/slurm/17.02.10/bin/sbatch","SLURM_NPROCS":"1","SLURM_NTASKS":"1","SLURM_JOB_NAME":"wrap","SLURM_RLIMIT_CPU":"18446744073709551615","SLURM_RLIMIT_FSIZE":"18446744073709551615","SLURM_RLIMIT_DATA":"18446744073709551615","SLURM_RLIMIT_STACK":"18446744073709551615","SLURM_RLIMIT_CORE":"0","SLURM_RLIMIT_RSS":"18446744073709551615","SLURM_RLIMIT_NPROC":"2066973","SLURM_RLIMIT_NOFILE":"65536","SLURM_RLIMIT_MEMLOCK":"18446744073709551615","SLURM_RLIMIT_AS":"18446744073709551615","SLURM_PRIO_PROCESS":"0","SLURM_SUBMIT_DIR":"/user/e154466","SLURM_SUBMIT_HOST":"DCALPH000","SLURM_UMASK":"0022"}} StdErr=/user/e154466/slurm-756822.out StdIn=/dev/null StdOut=/user/e154466/slurm-756822.out Power= -bash-4.1$ squeue -w dcalph134 JOBID USER ST PARTITION NAME COMMAND SUBMIT_TIME CPUS GRES NODES NODELIST(REASON) -bash-4.1$ squeue -u e154466 JOBID USER ST PARTITION NAME COMMAND SUBMIT_TIME CPUS GRES NODES NODELIST(REASON) 755497 e154466 R test wrap (null) May 22 16:52 1 (null) 1 dcalph132 755498 e154466 R test wrap (null) May 22 16:53 1 gpu:2 1 dcalph198 -bash-4.1$ I am going to open another bug for this.
Please put this bug on hold. For squeue issues I have opened https://bugs.schedmd.com/show_bug.cgi?id=5208
I think your slurmd restart in the node was not successful, or there was other stalled processes that got cleaned once you rebooted. Maybe the node had some kind of problem (network,IO,..) , should be worth to investigate. I am closing this issue now, but please, reopen if you encounter more problems. If this is the case upload the info requested in comment 16. Regards
I have just reprdouced the issue with host dcalph198: -bash-4.1$ srun -p gpu --gres=gpu:2 nvidia-smi srun: Required node not available (down, drained or reserved) srun: job 925717 queued and waiting for resources Going to upload files colleces as per c16.
Created attachment 7637 [details] Files collected as per c16 for job 925717
I see in squeue output your job 925717 requesting partition gpu and gpu:2, but at the same time showing the ReqNodeNotAvail showing unrelated UnavailableNodes dcalph001 dcalph128. This two nodes are effectively set to drain due to some failure. squeue:925717 e154466 PD gpu nvidia-smi nvidia-smiAug 17 10:40 1 gpu:2 1 (ReqNodeNotAvail, UnavailableNodes:dcalph[001,128] There's a related bug already fixed in 17.11.6: bug 4932. It seems the cause of the problem is that slurmctld can set wrong reason for jobs with "--exclusive=user", and your partition gpu has this flag set: ExclusiveUser=Yes. In theory this state shouldn't be permanent and disturb scheduling, but for your curiosity the commits are: https://github.com/SchedMD/slurm/commit/e2a14b8d7f4f https://github.com/SchedMD/slurm/commit/fc4e5ac9e056 Are you planning to upgrade? 18.08 is to be released soon and 17.02 will end its support period at that time. I want also to recommend removing the Shared= parameter in the partitions, since it has been deprecated in favor of OverSubscribe (see man slurm.conf).
Thanks a lot! Closing it. *** This ticket has been marked as a duplicate of ticket 4932 ***