Ticket 1357

Summary: Draining Nodes Due to Batch Job Complete Failure
Product: Slurm Reporter: Will French <will>
Component: OtherAssignee: Brian Christiansen <brian>
Status: RESOLVED INFOGIVEN QA Contact:
Severity: 4 - Minor Issue    
Priority: --- CC: brian, da
Version: 14.11.0   
Hardware: Linux   
OS: Linux   
Site: Vanderbilt Alineos Sites: ---
Atos/Eviden Sites: --- Confidential Site: ---
Coreweave sites: --- Cray Sites: ---
DS9 clusters: --- HPCnow Sites: ---
HPE Sites: --- IBM Sites: ---
NOAA SIte: --- OCF Sites: ---
Recursion Pharma Sites: --- SFW Sites: ---
SNIC sites: --- Linux Distro: ---
Machine Name: CLE Version:
Version Fixed: Target Release: ---
DevPrio: --- Emory-Cloud Sites: ---
Attachments: slurmd log file

Description Will French 2015-01-09 10:00:27 MST
Hello again,

We've had this happen with several jobs this week, and I haven't found a very complete description of the problem from searching Bugzilla, slurm-dev, and the SchedMD docs. 

As an example (please excuse the aliases and wrapped lines):

[frenchwr@vmps11 prolog]$ sinfoprobs 
HOSTNAMES    STATE      FEATURES             TIMESTAMP            USER         REASON                          
vmp687       draining   intel                2015-01-09T16:50:40  slurm        batch job complete failure      
vmp690       draining   intel                2015-01-09T16:53:26  slurm        batch job complete failure      
[frenchwr@vmps11 prolog]$ sacctl --allusers | grep vmp687
uscms4307 7419                cms 2-00:00:00   00:55:34     2950Mn                              1          1          vmp687      0:0 
uscms4307 7420                cms 2-00:00:00   00:00:00     2950Mn                              1          1          vmp687      0:0 
          7420.batch          cms              00:00:00     2950Mn          0          0        1          1          vmp687      0:0 
[frenchwr@vmps11 prolog]$ sacctl --allusers | grep vmp690
uscms4307 7440                cms 2-00:00:00   00:53:04     2950Mn                              1          1          vmp690      0:0 
uscms4307 7441                cms 2-00:00:00   00:00:00     2950Mn                              1          1          vmp690      0:0 
          7441.batch          cms              00:00:00     2950Mn          0          0        1          1          vmp690      0:0 
uscms4307 7442                cms 2-00:00:00   00:53:04     2950Mn                              1          1          vmp690      0:0


Can you describe what exactly has happened to cause these jobs to enter into the draining state? What does "batch job complete failure" mean, exactly? It's odd that the exit status on the most recent job(s) that have run on the relevant nodes is 0 in all cases...yet SLURM has detected that something has gone wrong and is taking the nodes offline. Is this SLURM's way of preventing jobs from landing on nodes it considers suspect?

Once we notice drained nodes we use:

scontrol update NodeName=vmp690 State=Resume

and the nodes come back and jobs seem to run on them without issue.

Is this something we should be concerned about?
Comment 1 Moe Jette 2015-01-09 10:08:58 MST
Something is probably seriously bad with the node. Can you provide the slurmd log from that node around that time? It probably describes exactly what happened. The slurmctld log file may also be helpful.
Comment 2 Will French 2015-01-09 23:53:50 MST
Created attachment 1541 [details]
slurmd log file
Comment 3 Will French 2015-01-10 00:01:22 MST
From the slurmd log file on one of the drained nodes:

Jan  9 16:50:42 vmp687 slurmstepd[19461]: error: unable to mount memory cgroup namespace: Device or resource busy
Jan  9 16:50:42 vmp687 slurmstepd[19461]: error: task/cgroup: unable to create memory namespace
Jan  9 16:50:42 vmp687 slurmstepd[19461]: error: Couldn't load specified plugin name for task/cgroup: Plugin init() callback failed
Jan  9 16:50:42 vmp687 slurmstepd[19461]: error: cannot create task context for task/cgroup
Jan  9 16:50:42 vmp687 slurmstepd[19461]: error: job_manager exiting abnormally, rc = 1011
Jan  9 16:50:42 vmp687 slurmstepd[19461]: sending REQUEST_COMPLETE_BATCH_SCRIPT, error:1011 status 0
Jan  9 19:18:03 vmp687 slurmstepd[19456]: error: Exceeded step memory limit at some point. Step may have been partially swapped out to disk.
Jan  9 19:18:04 vmp687 slurmstepd[19456]: sending REQUEST_COMPLETE_BATCH_SCRIPT, error:0 status 0

These messages occurred at the same time that the final job landed on this node and exited immediately....and I assume that draining began here. Nothing interesting was reported in the slurmctld log. I'm a little perplexed by the last two errors that occurred ~2.5 hours after the final job had exited.

Looks like an issue with the cgroup mount. Any idea what could cause this? Problem with the hard disk? Here is our cgroup.conf:

# cgroup.conf
CgroupAutomount=yes
CgroupMountpoint=/cgroup
CgroupReleaseAgentDir="/usr/scheduler/slurm/cgroup"
ConstrainCores=yes
TaskAffinity=yes
ConstrainRAMSpace=yes
ConstrainSwapSpace=yes

slurm.conf:

# slurm.conf file generated by configurator.html.
# Put this file on all nodes of your cluster.
# See the slurm.conf man page for more information.
#
ControlMachine=slurmsched1
ControlAddr=10.0.0.49
BackupController=slurmsched2
BackupAddr=10.0.0.50
#
AuthType=auth/munge
CacheGroups=0
#CheckpointType=checkpoint/none
CryptoType=crypto/munge
#DisableRootJobs=NO
# EnforcePartLimits=YES
#Epilog=
#EpilogSlurmctld=
#FirstJobId=1
#MaxJobId=999999
#GresTypes=
#GroupUpdateForce=0
#GroupUpdateTime=600
#JobCheckpointDir=/var/slurm/checkpoint
#JobCredentialPrivateKey=
#JobCredentialPublicCertificate=
#JobFileAppend=0
#JobRequeue=1
#JobSubmitPlugins=1
#KillOnBadExit=0
#LaunchType=launch/slurm
#Licenses=foo*4,bar
#MailProg=/bin/mail
#MaxJobCount=5000
#MaxStepCount=40000
#MaxTasksPerNode=128
MpiDefault=none
#MpiParams=ports=#-#
PluginDir=/usr/scheduler/slurm/lib/slurm
#PlugStackConfig=
#PrivateData=jobs
ProctrackType=proctrack/cgroup
#Prolog=
#PrologFlags=
#PrologSlurmctld=
#PropagatePrioProcess=0
#PropagateResourceLimits=
#PropagateResourceLimitsExcept=
#RebootProgram=
ReturnToService=1
#SallocDefaultCommand=
SlurmctldPidFile=/var/run/slurm/slurmctld.pid
SlurmctldPort=6817
SlurmdPidFile=/var/run/slurm/slurmd.pid
SlurmdPort=6818
SlurmdSpoolDir=/usr/spool/slurm
SlurmUser=slurm
#SlurmdUser=root
#SrunEpilog=
#SrunProlog=
StateSaveLocation=/usr/scheduler/slurm/state
SwitchType=switch/none
#TaskEpilog=
TaskPlugin=task/cgroup
#TaskPluginParam=
#TaskProlog=
#TopologyPlugin=topology/tree
#TmpFS=/tmp
#TrackWCKey=no
#TreeWidth=
#UnkillableStepProgram=
#UsePAM=0
#
#
# TIMERS
#BatchStartTimeout=10
#CompleteWait=0
#EpilogMsgTime=2000
#GetEnvTimeout=2
#HealthCheckInterval=0
#HealthCheckProgram=
InactiveLimit=0
KillWait=30
#MessageTimeout=10
#ResvOverRun=0
MinJobAge=300
#OverTimeLimit=0
SlurmctldTimeout=120
SlurmdTimeout=300
#UnkillableStepTimeout=60
#VSizeFactor=0
Waittime=0
#
#
# SCHEDULING
#DefMemPerCPU=3000
FastSchedule=1
#MaxMemPerCPU=
#SchedulerRootFilter=1
#SchedulerTimeSlice=30
SchedulerType=sched/backfill
SchedulerPort=7321
SelectType=select/cons_res
SelectTypeParameters=CR_Core_Memory
#
#
# JOB PRIORITY
#PriorityFlags=
PriorityType=priority/multifactor
PriorityDecayHalfLife=168:00:00
PriorityUsageResetPeriod=NONE
#PriorityCalcPeriod=
#PriorityFavorSmall=
PriorityMaxAge=336:00:00
PriorityWeightAge=1000
PriorityWeightFairshare=1000
PriorityWeightJobSize=1000
PriorityWeightPartition=1000
PriorityWeightQOS=1000
#
#
# LOGGING AND ACCOUNTING
AccountingStorageEnforce=safe
AccountingStorageHost=slurmdb
#AccountingStorageLoc=
#AccountingStoragePass=
#AccountingStoragePort=
AccountingStorageType=accounting_storage/slurmdbd
AccountingStorageUser=slurm
AccountingStoreJobComment=YES
ClusterName=accre
DebugFlags=NO_CONF_HASH
JobCompHost=slurmdb
#JobCompLoc=
#JobCompPass=
#JobCompPort=
JobCompType=jobcomp/mysql
JobCompUser=slurm
#JobContainerType=job_container/none
JobAcctGatherFrequency=30
JobAcctGatherType=jobacct_gather/linux
SlurmctldDebug=7
#SlurmctldLogFile=
SlurmdDebug=7
#SlurmdLogFile=
#SlurmSchedLogFile=
#SlurmSchedLogLevel=
#
#
# POWER SAVE SUPPORT FOR IDLE NODES (optional)
#SuspendProgram=
#ResumeProgram=
#SuspendTimeout=
#ResumeTimeout=
#ResumeRate=
#SuspendExcNodes=
#SuspendExcParts=
#SuspendRate=
#SuspendTime=
#
#
# COMPUTE NODES
NodeName=vmp[101-103,105-110,112-120] RealMemory=60000 CPUs=8 Sockets=2 CoresPerSocket=4 ThreadsPerCore=1 Feature=amd
NodeName=vmp[301-380] RealMemory=123000 CPUs=12 Sockets=2 CoresPerSocket=6 ThreadsPerCore=2 Feature=intel,sandy_bridge
NodeName=vmp[552-574,652-653,659-662,664-677,679-690] RealMemory=20500 CPUs=8 Sockets=2 CoresPerSocket=4 ThreadsPerCore=2 Feature=intel
# vmp678 appears to have some faulty memory
NodeName=vmp[678] RealMemory=16000 CPUs=8 Sockets=2 CoresPerSocket=4 ThreadsPerCore=2 Feature=intel
NodeName=vmp[801-803,805-809,813,815,817-819,821-830,835,838,840] RealMemory=45000 CPUs=8 Sockets=2 CoresPerSocket=4 ThreadsPerCore=2 Feature=cuda42
NodeName=vmp[831-834] RealMemory=45000 CPUs=8 Sockets=2 CoresPerSocket=4 ThreadsPerCore=2 Feature=cuda6

PartitionName=production Nodes=vmp[101-103,105-110,112-120,301-380,552-574,652-653,659-662,664-690] Default=YES MaxTime=20160 DefaultTime=15 State=UP Shared=NO DenyAccounts=accre_gpu,biostat_gpu,cgg_gpu,csb_gpu,cummings_gpu,guest_gpu_account,haines_gpu,h_cutting_gpu,h_lasko_gpu,h_vmac_gpu,lola_gpu,lss_gpu,lybrand_gpu,mccabe_gpu,nbody_gpu,nbody_gpu_account,palmeri_gpu,palmeri_gpu_account,polyn_gpu,p_masi_gpu,rokas_gpu,tplab_gpu,tplab_gpu_account,zhao_gpu
# GPU nodes in separate partition with nodes not shared between jobs
PartitionName=gpu Nodes=vmp[801-803,805-809,813,815,817-819,821-835,838,840] Default=NO MaxTime=20160 DefaultTime=15 State=UP Shared=EXCLUSIVE
Comment 4 Brian Christiansen 2015-01-12 10:30:55 MST
Hey Will,

What's happening is that the slurmstepd thinks that the cgroup mount point doesn't exist so it tries to mount it again -- causing the "Resource or device busy" message. This percolates up to the controller and the controller puts the node into a drain state so that other jobs won't land on the node. I can reproduce this by making the code always try to mount the cgroup.

Will you make sure that the slurmd's are restarted with their SlurmdDebug set to debug2? There is some additional logging around this area that may help in diagnosing the problem. You should see a debug message like the following:

131656 Jan 12 16:25:20 compy slurmstepd[28304]: debug level is 6.

Then send the logs when you see it happen again.

Thanks,
Brian
Comment 5 Will French 2015-01-13 03:38:20 MST
Hi Brian,

I'm assuming you meant set SlurmdDebug to debug6 and not debug2? We currently have it set to debug3 so debug2 would actually provide less information.

One quick note:

We did not go out and create the /cgroup directory on all our compute nodes, instead relying on the CgroupAutomount=yes in cgroups.conf. There's nothing wrong with that approach, I hope?
Comment 6 Brian Christiansen 2015-01-13 05:01:45 MST
I meant debug2 but you're setting of debug3 is fine. I think the slurmd's hadn't been restarted since debug3 was configured, so we didn't see the extra logging.

FYI.
debug level 6 == debug2
debug level 7 == debug3

You can set Slurm[ctl]dDebug to the number or the name. I'm always opening sview to remember how to correlate the numbers to the names.

Having Slurm automount the cgroup is fine. I'm hoping that the additional logging will tell us why Slurm thought the cgroup wasn't mounted and tried to remount it.
Comment 7 Will French 2015-01-13 09:20:36 MST
Got it. Looks like I now have the logging set to the appropriate level. I've been hammering the node with jobs for the last hour to try to reproduce the error, but so far everything is running fine. Below are the logs from one of these jobs. There are a few messages (e.g. unable to open 'cgroup/cpuset/slurm/cpus' for reading : No such file or directory) that make me nervous. Any cause for concern?

Jan 13 17:12:31 vmp687 slurmd[2077]: Launching batch job 17189 for UID 112888
Jan 13 17:12:31 vmp687 slurmstepd[2119]: debug level is 6.
Jan 13 17:12:31 vmp687 slurmstepd[2119]: setup for a batch_job
Jan 13 17:12:31 vmp687 slurmstepd[2119]: No acct_gather.conf file (/usr/scheduler/slurm-14.11.0/etc/acct_gather.conf)
Jan 13 17:12:31 vmp687 slurmstepd[2119]: profile signalling type Task
Jan 13 17:12:31 vmp687 slurmstepd[2119]: hwloc_topology_init
Jan 13 17:12:31 vmp687 slurmstepd[2119]: hwloc_topology_load
Jan 13 17:12:31 vmp687 slurmstepd[2119]: Before call to spank_init()
Jan 13 17:12:31 vmp687 slurmstepd[2119]: After call to spank_init()
Jan 13 17:12:31 vmp687 slurmstepd[2119]: unable to open '/cgroup/cpuset/slurm/cpus' for reading : No such file or directory
Jan 13 17:12:31 vmp687 slurmstepd[2119]: unable to get parameter 'cpus' for '/cgroup/cpuset/slurm'
Jan 13 17:12:31 vmp687 slurmstepd[2119]: cpu_freq_cgroup_validate: request value =            0         0
Jan 13 17:12:31 vmp687 slurmstepd[2119]:   jobid=17189, stepid=4294967294, tasks=1 cpu/task=0, cpus=8
Jan 13 17:12:31 vmp687 slurmstepd[2119]:   cpu_bind_type=   0, cpu_bind map=(null)
Jan 13 17:12:31 vmp687 slurmstepd[2119]:   step logical cores = 0-7, step physical cores = 0-15
Jan 13 17:12:31 vmp687 slurmstepd[2119]: task/cgroup: /slurm/uid_112888/job_17189: alloc=500MB mem.limit=500MB memsw.limit=500MB
Jan 13 17:12:31 vmp687 slurmstepd[2119]: task/cgroup: /slurm/uid_112888/job_17189/step_batch: alloc=500MB mem.limit=500MB memsw.limit=500MB
Jan 13 17:12:31 vmp687 slurmstepd[2119]: Uncached user/gid: frenchwr/36014
Jan 13 17:12:31 vmp687 slurmstepd[2119]: Entering _setup_normal_io
Jan 13 17:12:31 vmp687 slurmstepd[2119]: Uncached user/gid: frenchwr/36014
Jan 13 17:12:31 vmp687 slurmstepd[2119]: Leaving  _setup_normal_io
Jan 13 17:12:31 vmp687 slurmstepd[2119]: debug level = 2
Jan 13 17:12:31 vmp687 slurmstepd[2119]: Uncached user/gid: frenchwr/36014
Jan 13 17:12:31 vmp687 slurmstepd[2119]: task 0 (2123) started 2015-01-13T17:12:31
Jan 13 17:12:31 vmp687 slurmstepd[2123]: Uncached user/gid: frenchwr/36014
Jan 13 17:12:31 vmp687 slurmstepd[2119]: adding task 0 pid 2123 on node 0 to jobacct
Jan 13 17:12:31 vmp687 slurmstepd[2119]: jag_common_poll_data: 2123 mem size 1640 190212 time 0(0+0)
Jan 13 17:12:31 vmp687 slurmstepd[2119]: _get_sys_interface_freq_line: filename = /sys/devices/system/cpu/cpu10/cpufreq/cpuinfo_cur_freq 
Jan 13 17:12:31 vmp687 slurmstepd[2119]: _get_sys_interface_freq_line: filename = /proc/cpuinfo 
Jan 13 17:12:31 vmp687 slurmstepd[2119]: cpunfo_frequency= 2266057
Jan 13 17:12:31 vmp687 slurmstepd[2119]: jag_common_poll_data: Task average frequency = 2266057 pid 2123 mem size 1640 190212 time 0(0+0)
Jan 13 17:12:31 vmp687 slurmstepd[2119]: energycounted = 0
Jan 13 17:12:31 vmp687 slurmstepd[2119]: getjoules_task energy = 0
Jan 13 17:12:31 vmp687 slurmstepd[2119]: unable to get cgroup '/cgroup/cpuset' entry '/cgroup/cpuset/slurm/system' properties: No such file or directory
Jan 13 17:12:31 vmp687 slurmstepd[2119]: unable to get cgroup '/cgroup/memory' entry '/cgroup/memory/slurm/system' properties: No such file or directory
Jan 13 17:12:31 vmp687 slurmstepd[2123]: _set_limit: conf setrlimit RLIMIT_CPU no change in value: 18446744073709551615
Jan 13 17:12:31 vmp687 slurmstepd[2123]: _set_limit: conf setrlimit RLIMIT_FSIZE no change in value: 18446744073709551615
Jan 13 17:12:31 vmp687 slurmstepd[2123]: _set_limit: conf setrlimit RLIMIT_DATA no change in value: 18446744073709551615
Jan 13 17:12:31 vmp687 slurmstepd[2123]: _set_limit: conf setrlimit RLIMIT_STACK no change in value: 10485760
Jan 13 17:12:31 vmp687 slurmstepd[2123]: _set_limit: conf setrlimit RLIMIT_CORE no change in value: 0
Jan 13 17:12:31 vmp687 slurmstepd[2123]: _set_limit: conf setrlimit RLIMIT_RSS no change in value: 18446744073709551615
Jan 13 17:12:31 vmp687 slurmstepd[2123]: _set_limit: RLIMIT_NPROC  : max:192014 cur:192014 req:1024
Jan 13 17:12:31 vmp687 slurmstepd[2123]: _set_limit: conf setrlimit RLIMIT_NPROC succeeded
Jan 13 17:12:31 vmp687 slurmstepd[2123]: _set_limit: RLIMIT_NOFILE : max:4096 cur:4096 req:1024
Jan 13 17:12:31 vmp687 slurmstepd[2123]: _set_limit: conf setrlimit RLIMIT_NOFILE succeeded
Jan 13 17:12:31 vmp687 slurmstepd[2123]: _set_limit: conf setrlimit RLIMIT_MEMLOCK no change in value: 65536
Jan 13 17:12:31 vmp687 slurmstepd[2123]: _set_limit: conf setrlimit RLIMIT_AS no change in value: 18446744073709551615
Jan 13 17:12:31 vmp687 slurmstepd[2123]: Set task rss(500 MB)
Jan 13 17:12:32 vmp687 slurmstepd[2119]: jag_common_poll_data: 2123 mem size 57752 914608 time 0(0+0)
Jan 13 17:12:32 vmp687 slurmstepd[2119]: jag_common_poll_data: Task average frequency = 2266057 pid 2123 mem size 57752 914608 time 0(0+0)
Jan 13 17:12:32 vmp687 slurmstepd[2119]: energycounted = 0
Jan 13 17:12:32 vmp687 slurmstepd[2119]: getjoules_task energy = 0
Jan 13 17:12:36 vmp687 slurmstepd[2119]: getjoules_task energy = 0
Jan 13 17:12:36 vmp687 slurmstepd[2119]: removing task 0 pid 2123 from jobacct
Jan 13 17:12:36 vmp687 slurmstepd[2119]: task 0 (2123) exited with exit code 0.
Jan 13 17:12:36 vmp687 slurmstepd[2119]: Aggregated 1 task exit messages
Jan 13 17:12:36 vmp687 slurmstepd[2119]: Before call to spank_fini()
Jan 13 17:12:36 vmp687 slurmstepd[2119]: After call to spank_fini()
Jan 13 17:12:36 vmp687 slurmstepd[2119]: job 17189 completed with slurm_rc = 0, job_rc = 0
Jan 13 17:12:36 vmp687 slurmstepd[2119]: sending REQUEST_COMPLETE_BATCH_SCRIPT, error:0 status 0
Jan 13 17:12:36 vmp687 slurmstepd[2119]:   false, shutdown
Jan 13 17:12:36 vmp687 slurmstepd[2119]: done with job
Comment 8 Brian Christiansen 2015-01-13 10:23:03 MST
These two are fine:
Jan 13 17:12:31 vmp687 slurmstepd[2119]: unable to open '/cgroup/cpuset/slurm/cpus' for reading : No such file or directory
Jan 13 17:12:31 vmp687 slurmstepd[2119]: unable to get parameter 'cpus' for '/cgroup/cpuset/slurm'

Slurm first tries /cgroup/cpuset/slurm/cpus and if it doesn't exist then it tries /cgroup/cpuset/slurm/cpuset.cpus. I get the same logs. There are no errors after so we are ok. Older kernels only did cpus -- without prepending "cpuset.".
Comment 9 Will French 2015-01-14 07:06:50 MST
Hi Brian,

A few other messages we see in the slurmd log:

Jan 14 14:58:46 vmp689 slurmd[26496]: Resource spec: system memory limit not configured for this node
Jan 14 14:58:46 vmp689 slurmd[26498]: Warning: Core limit is only 0 KB

What do these mean? 

Another related question: is it safe to restart slurmd on a compute node where jobs are currently running? I've done a quick test and the job seems unaffected, but just want to verify. The reason I ask is that I'd like to change the debug level on all the nodes but this requires restarting the daemon.
Comment 10 Brian Christiansen 2015-01-14 08:21:47 MST
Jan 14 14:58:46 vmp689 slurmd[26496]: Resource spec: system memory limit not configured for this node

This one means that MemSpecLimit isn't specified on the node. See man slurm.conf. It's not necessary. 


Jan 14 14:58:46 vmp689 slurmd[26498]: Warning: Core limit is only 0 KB

This means that the core file size is 0 for the slurmd meaning that the slurmd won't generate a core if it does. 

check:
ulimit -Hc
ulimit -Sc


Restarting the daemons with running jobs is fine. I should have told you that you can change the debug level without restarting. You can do:

scontrol reconfigure

which will tell all of the daemons to re-read the config.
Comment 11 Brian Christiansen 2015-01-26 04:46:04 MST
Hey Will,

Have you been able to reproduce this again?

Thanks,
Brian
Comment 12 Will French 2015-01-26 05:54:14 MST
Hi Brian, 

No, we haven't seen this error again. It seems to have been an isolated incident. You can close this ticket if you'd like and I will re-open it if this error re-appears.

Best,

Will
Comment 13 Brian Christiansen 2015-01-26 05:56:38 MST
Sounds good. Let us know if you see it again.

Thanks,
Brian