| Summary: | Draining Nodes Due to Batch Job Complete Failure | ||
|---|---|---|---|
| Product: | Slurm | Reporter: | Will French <will> |
| Component: | Other | Assignee: | Brian Christiansen <brian> |
| Status: | RESOLVED INFOGIVEN | QA Contact: | |
| Severity: | 4 - Minor Issue | ||
| Priority: | --- | CC: | brian, da |
| Version: | 14.11.0 | ||
| Hardware: | Linux | ||
| OS: | Linux | ||
| Site: | Vanderbilt | Slinky Site: | --- |
| Alineos Sites: | --- | Atos/Eviden Sites: | --- |
| Confidential Site: | --- | Coreweave sites: | --- |
| Cray Sites: | --- | DS9 clusters: | --- |
| Google sites: | --- | HPCnow Sites: | --- |
| HPE Sites: | --- | IBM Sites: | --- |
| NOAA SIte: | --- | NoveTech Sites: | --- |
| Nvidia HWinf-CS Sites: | --- | OCF Sites: | --- |
| Recursion Pharma Sites: | --- | SFW Sites: | --- |
| SNIC sites: | --- | Tzag Elita Sites: | --- |
| Linux Distro: | --- | Machine Name: | |
| CLE Version: | Version Fixed: | ||
| Target Release: | --- | DevPrio: | --- |
| Emory-Cloud Sites: | --- | ||
| Attachments: | slurmd log file | ||
Something is probably seriously bad with the node. Can you provide the slurmd log from that node around that time? It probably describes exactly what happened. The slurmctld log file may also be helpful. Created attachment 1541 [details]
slurmd log file
From the slurmd log file on one of the drained nodes: Jan 9 16:50:42 vmp687 slurmstepd[19461]: error: unable to mount memory cgroup namespace: Device or resource busy Jan 9 16:50:42 vmp687 slurmstepd[19461]: error: task/cgroup: unable to create memory namespace Jan 9 16:50:42 vmp687 slurmstepd[19461]: error: Couldn't load specified plugin name for task/cgroup: Plugin init() callback failed Jan 9 16:50:42 vmp687 slurmstepd[19461]: error: cannot create task context for task/cgroup Jan 9 16:50:42 vmp687 slurmstepd[19461]: error: job_manager exiting abnormally, rc = 1011 Jan 9 16:50:42 vmp687 slurmstepd[19461]: sending REQUEST_COMPLETE_BATCH_SCRIPT, error:1011 status 0 Jan 9 19:18:03 vmp687 slurmstepd[19456]: error: Exceeded step memory limit at some point. Step may have been partially swapped out to disk. Jan 9 19:18:04 vmp687 slurmstepd[19456]: sending REQUEST_COMPLETE_BATCH_SCRIPT, error:0 status 0 These messages occurred at the same time that the final job landed on this node and exited immediately....and I assume that draining began here. Nothing interesting was reported in the slurmctld log. I'm a little perplexed by the last two errors that occurred ~2.5 hours after the final job had exited. Looks like an issue with the cgroup mount. Any idea what could cause this? Problem with the hard disk? Here is our cgroup.conf: # cgroup.conf CgroupAutomount=yes CgroupMountpoint=/cgroup CgroupReleaseAgentDir="/usr/scheduler/slurm/cgroup" ConstrainCores=yes TaskAffinity=yes ConstrainRAMSpace=yes ConstrainSwapSpace=yes slurm.conf: # slurm.conf file generated by configurator.html. # Put this file on all nodes of your cluster. # See the slurm.conf man page for more information. # ControlMachine=slurmsched1 ControlAddr=10.0.0.49 BackupController=slurmsched2 BackupAddr=10.0.0.50 # AuthType=auth/munge CacheGroups=0 #CheckpointType=checkpoint/none CryptoType=crypto/munge #DisableRootJobs=NO # EnforcePartLimits=YES #Epilog= #EpilogSlurmctld= #FirstJobId=1 #MaxJobId=999999 #GresTypes= #GroupUpdateForce=0 #GroupUpdateTime=600 #JobCheckpointDir=/var/slurm/checkpoint #JobCredentialPrivateKey= #JobCredentialPublicCertificate= #JobFileAppend=0 #JobRequeue=1 #JobSubmitPlugins=1 #KillOnBadExit=0 #LaunchType=launch/slurm #Licenses=foo*4,bar #MailProg=/bin/mail #MaxJobCount=5000 #MaxStepCount=40000 #MaxTasksPerNode=128 MpiDefault=none #MpiParams=ports=#-# PluginDir=/usr/scheduler/slurm/lib/slurm #PlugStackConfig= #PrivateData=jobs ProctrackType=proctrack/cgroup #Prolog= #PrologFlags= #PrologSlurmctld= #PropagatePrioProcess=0 #PropagateResourceLimits= #PropagateResourceLimitsExcept= #RebootProgram= ReturnToService=1 #SallocDefaultCommand= SlurmctldPidFile=/var/run/slurm/slurmctld.pid SlurmctldPort=6817 SlurmdPidFile=/var/run/slurm/slurmd.pid SlurmdPort=6818 SlurmdSpoolDir=/usr/spool/slurm SlurmUser=slurm #SlurmdUser=root #SrunEpilog= #SrunProlog= StateSaveLocation=/usr/scheduler/slurm/state SwitchType=switch/none #TaskEpilog= TaskPlugin=task/cgroup #TaskPluginParam= #TaskProlog= #TopologyPlugin=topology/tree #TmpFS=/tmp #TrackWCKey=no #TreeWidth= #UnkillableStepProgram= #UsePAM=0 # # # TIMERS #BatchStartTimeout=10 #CompleteWait=0 #EpilogMsgTime=2000 #GetEnvTimeout=2 #HealthCheckInterval=0 #HealthCheckProgram= InactiveLimit=0 KillWait=30 #MessageTimeout=10 #ResvOverRun=0 MinJobAge=300 #OverTimeLimit=0 SlurmctldTimeout=120 SlurmdTimeout=300 #UnkillableStepTimeout=60 #VSizeFactor=0 Waittime=0 # # # SCHEDULING #DefMemPerCPU=3000 FastSchedule=1 #MaxMemPerCPU= #SchedulerRootFilter=1 #SchedulerTimeSlice=30 SchedulerType=sched/backfill SchedulerPort=7321 SelectType=select/cons_res SelectTypeParameters=CR_Core_Memory # # # JOB PRIORITY #PriorityFlags= PriorityType=priority/multifactor PriorityDecayHalfLife=168:00:00 PriorityUsageResetPeriod=NONE #PriorityCalcPeriod= #PriorityFavorSmall= PriorityMaxAge=336:00:00 PriorityWeightAge=1000 PriorityWeightFairshare=1000 PriorityWeightJobSize=1000 PriorityWeightPartition=1000 PriorityWeightQOS=1000 # # # LOGGING AND ACCOUNTING AccountingStorageEnforce=safe AccountingStorageHost=slurmdb #AccountingStorageLoc= #AccountingStoragePass= #AccountingStoragePort= AccountingStorageType=accounting_storage/slurmdbd AccountingStorageUser=slurm AccountingStoreJobComment=YES ClusterName=accre DebugFlags=NO_CONF_HASH JobCompHost=slurmdb #JobCompLoc= #JobCompPass= #JobCompPort= JobCompType=jobcomp/mysql JobCompUser=slurm #JobContainerType=job_container/none JobAcctGatherFrequency=30 JobAcctGatherType=jobacct_gather/linux SlurmctldDebug=7 #SlurmctldLogFile= SlurmdDebug=7 #SlurmdLogFile= #SlurmSchedLogFile= #SlurmSchedLogLevel= # # # POWER SAVE SUPPORT FOR IDLE NODES (optional) #SuspendProgram= #ResumeProgram= #SuspendTimeout= #ResumeTimeout= #ResumeRate= #SuspendExcNodes= #SuspendExcParts= #SuspendRate= #SuspendTime= # # # COMPUTE NODES NodeName=vmp[101-103,105-110,112-120] RealMemory=60000 CPUs=8 Sockets=2 CoresPerSocket=4 ThreadsPerCore=1 Feature=amd NodeName=vmp[301-380] RealMemory=123000 CPUs=12 Sockets=2 CoresPerSocket=6 ThreadsPerCore=2 Feature=intel,sandy_bridge NodeName=vmp[552-574,652-653,659-662,664-677,679-690] RealMemory=20500 CPUs=8 Sockets=2 CoresPerSocket=4 ThreadsPerCore=2 Feature=intel # vmp678 appears to have some faulty memory NodeName=vmp[678] RealMemory=16000 CPUs=8 Sockets=2 CoresPerSocket=4 ThreadsPerCore=2 Feature=intel NodeName=vmp[801-803,805-809,813,815,817-819,821-830,835,838,840] RealMemory=45000 CPUs=8 Sockets=2 CoresPerSocket=4 ThreadsPerCore=2 Feature=cuda42 NodeName=vmp[831-834] RealMemory=45000 CPUs=8 Sockets=2 CoresPerSocket=4 ThreadsPerCore=2 Feature=cuda6 PartitionName=production Nodes=vmp[101-103,105-110,112-120,301-380,552-574,652-653,659-662,664-690] Default=YES MaxTime=20160 DefaultTime=15 State=UP Shared=NO DenyAccounts=accre_gpu,biostat_gpu,cgg_gpu,csb_gpu,cummings_gpu,guest_gpu_account,haines_gpu,h_cutting_gpu,h_lasko_gpu,h_vmac_gpu,lola_gpu,lss_gpu,lybrand_gpu,mccabe_gpu,nbody_gpu,nbody_gpu_account,palmeri_gpu,palmeri_gpu_account,polyn_gpu,p_masi_gpu,rokas_gpu,tplab_gpu,tplab_gpu_account,zhao_gpu # GPU nodes in separate partition with nodes not shared between jobs PartitionName=gpu Nodes=vmp[801-803,805-809,813,815,817-819,821-835,838,840] Default=NO MaxTime=20160 DefaultTime=15 State=UP Shared=EXCLUSIVE Hey Will, What's happening is that the slurmstepd thinks that the cgroup mount point doesn't exist so it tries to mount it again -- causing the "Resource or device busy" message. This percolates up to the controller and the controller puts the node into a drain state so that other jobs won't land on the node. I can reproduce this by making the code always try to mount the cgroup. Will you make sure that the slurmd's are restarted with their SlurmdDebug set to debug2? There is some additional logging around this area that may help in diagnosing the problem. You should see a debug message like the following: 131656 Jan 12 16:25:20 compy slurmstepd[28304]: debug level is 6. Then send the logs when you see it happen again. Thanks, Brian Hi Brian, I'm assuming you meant set SlurmdDebug to debug6 and not debug2? We currently have it set to debug3 so debug2 would actually provide less information. One quick note: We did not go out and create the /cgroup directory on all our compute nodes, instead relying on the CgroupAutomount=yes in cgroups.conf. There's nothing wrong with that approach, I hope? I meant debug2 but you're setting of debug3 is fine. I think the slurmd's hadn't been restarted since debug3 was configured, so we didn't see the extra logging. FYI. debug level 6 == debug2 debug level 7 == debug3 You can set Slurm[ctl]dDebug to the number or the name. I'm always opening sview to remember how to correlate the numbers to the names. Having Slurm automount the cgroup is fine. I'm hoping that the additional logging will tell us why Slurm thought the cgroup wasn't mounted and tried to remount it. Got it. Looks like I now have the logging set to the appropriate level. I've been hammering the node with jobs for the last hour to try to reproduce the error, but so far everything is running fine. Below are the logs from one of these jobs. There are a few messages (e.g. unable to open 'cgroup/cpuset/slurm/cpus' for reading : No such file or directory) that make me nervous. Any cause for concern? Jan 13 17:12:31 vmp687 slurmd[2077]: Launching batch job 17189 for UID 112888 Jan 13 17:12:31 vmp687 slurmstepd[2119]: debug level is 6. Jan 13 17:12:31 vmp687 slurmstepd[2119]: setup for a batch_job Jan 13 17:12:31 vmp687 slurmstepd[2119]: No acct_gather.conf file (/usr/scheduler/slurm-14.11.0/etc/acct_gather.conf) Jan 13 17:12:31 vmp687 slurmstepd[2119]: profile signalling type Task Jan 13 17:12:31 vmp687 slurmstepd[2119]: hwloc_topology_init Jan 13 17:12:31 vmp687 slurmstepd[2119]: hwloc_topology_load Jan 13 17:12:31 vmp687 slurmstepd[2119]: Before call to spank_init() Jan 13 17:12:31 vmp687 slurmstepd[2119]: After call to spank_init() Jan 13 17:12:31 vmp687 slurmstepd[2119]: unable to open '/cgroup/cpuset/slurm/cpus' for reading : No such file or directory Jan 13 17:12:31 vmp687 slurmstepd[2119]: unable to get parameter 'cpus' for '/cgroup/cpuset/slurm' Jan 13 17:12:31 vmp687 slurmstepd[2119]: cpu_freq_cgroup_validate: request value = 0 0 Jan 13 17:12:31 vmp687 slurmstepd[2119]: jobid=17189, stepid=4294967294, tasks=1 cpu/task=0, cpus=8 Jan 13 17:12:31 vmp687 slurmstepd[2119]: cpu_bind_type= 0, cpu_bind map=(null) Jan 13 17:12:31 vmp687 slurmstepd[2119]: step logical cores = 0-7, step physical cores = 0-15 Jan 13 17:12:31 vmp687 slurmstepd[2119]: task/cgroup: /slurm/uid_112888/job_17189: alloc=500MB mem.limit=500MB memsw.limit=500MB Jan 13 17:12:31 vmp687 slurmstepd[2119]: task/cgroup: /slurm/uid_112888/job_17189/step_batch: alloc=500MB mem.limit=500MB memsw.limit=500MB Jan 13 17:12:31 vmp687 slurmstepd[2119]: Uncached user/gid: frenchwr/36014 Jan 13 17:12:31 vmp687 slurmstepd[2119]: Entering _setup_normal_io Jan 13 17:12:31 vmp687 slurmstepd[2119]: Uncached user/gid: frenchwr/36014 Jan 13 17:12:31 vmp687 slurmstepd[2119]: Leaving _setup_normal_io Jan 13 17:12:31 vmp687 slurmstepd[2119]: debug level = 2 Jan 13 17:12:31 vmp687 slurmstepd[2119]: Uncached user/gid: frenchwr/36014 Jan 13 17:12:31 vmp687 slurmstepd[2119]: task 0 (2123) started 2015-01-13T17:12:31 Jan 13 17:12:31 vmp687 slurmstepd[2123]: Uncached user/gid: frenchwr/36014 Jan 13 17:12:31 vmp687 slurmstepd[2119]: adding task 0 pid 2123 on node 0 to jobacct Jan 13 17:12:31 vmp687 slurmstepd[2119]: jag_common_poll_data: 2123 mem size 1640 190212 time 0(0+0) Jan 13 17:12:31 vmp687 slurmstepd[2119]: _get_sys_interface_freq_line: filename = /sys/devices/system/cpu/cpu10/cpufreq/cpuinfo_cur_freq Jan 13 17:12:31 vmp687 slurmstepd[2119]: _get_sys_interface_freq_line: filename = /proc/cpuinfo Jan 13 17:12:31 vmp687 slurmstepd[2119]: cpunfo_frequency= 2266057 Jan 13 17:12:31 vmp687 slurmstepd[2119]: jag_common_poll_data: Task average frequency = 2266057 pid 2123 mem size 1640 190212 time 0(0+0) Jan 13 17:12:31 vmp687 slurmstepd[2119]: energycounted = 0 Jan 13 17:12:31 vmp687 slurmstepd[2119]: getjoules_task energy = 0 Jan 13 17:12:31 vmp687 slurmstepd[2119]: unable to get cgroup '/cgroup/cpuset' entry '/cgroup/cpuset/slurm/system' properties: No such file or directory Jan 13 17:12:31 vmp687 slurmstepd[2119]: unable to get cgroup '/cgroup/memory' entry '/cgroup/memory/slurm/system' properties: No such file or directory Jan 13 17:12:31 vmp687 slurmstepd[2123]: _set_limit: conf setrlimit RLIMIT_CPU no change in value: 18446744073709551615 Jan 13 17:12:31 vmp687 slurmstepd[2123]: _set_limit: conf setrlimit RLIMIT_FSIZE no change in value: 18446744073709551615 Jan 13 17:12:31 vmp687 slurmstepd[2123]: _set_limit: conf setrlimit RLIMIT_DATA no change in value: 18446744073709551615 Jan 13 17:12:31 vmp687 slurmstepd[2123]: _set_limit: conf setrlimit RLIMIT_STACK no change in value: 10485760 Jan 13 17:12:31 vmp687 slurmstepd[2123]: _set_limit: conf setrlimit RLIMIT_CORE no change in value: 0 Jan 13 17:12:31 vmp687 slurmstepd[2123]: _set_limit: conf setrlimit RLIMIT_RSS no change in value: 18446744073709551615 Jan 13 17:12:31 vmp687 slurmstepd[2123]: _set_limit: RLIMIT_NPROC : max:192014 cur:192014 req:1024 Jan 13 17:12:31 vmp687 slurmstepd[2123]: _set_limit: conf setrlimit RLIMIT_NPROC succeeded Jan 13 17:12:31 vmp687 slurmstepd[2123]: _set_limit: RLIMIT_NOFILE : max:4096 cur:4096 req:1024 Jan 13 17:12:31 vmp687 slurmstepd[2123]: _set_limit: conf setrlimit RLIMIT_NOFILE succeeded Jan 13 17:12:31 vmp687 slurmstepd[2123]: _set_limit: conf setrlimit RLIMIT_MEMLOCK no change in value: 65536 Jan 13 17:12:31 vmp687 slurmstepd[2123]: _set_limit: conf setrlimit RLIMIT_AS no change in value: 18446744073709551615 Jan 13 17:12:31 vmp687 slurmstepd[2123]: Set task rss(500 MB) Jan 13 17:12:32 vmp687 slurmstepd[2119]: jag_common_poll_data: 2123 mem size 57752 914608 time 0(0+0) Jan 13 17:12:32 vmp687 slurmstepd[2119]: jag_common_poll_data: Task average frequency = 2266057 pid 2123 mem size 57752 914608 time 0(0+0) Jan 13 17:12:32 vmp687 slurmstepd[2119]: energycounted = 0 Jan 13 17:12:32 vmp687 slurmstepd[2119]: getjoules_task energy = 0 Jan 13 17:12:36 vmp687 slurmstepd[2119]: getjoules_task energy = 0 Jan 13 17:12:36 vmp687 slurmstepd[2119]: removing task 0 pid 2123 from jobacct Jan 13 17:12:36 vmp687 slurmstepd[2119]: task 0 (2123) exited with exit code 0. Jan 13 17:12:36 vmp687 slurmstepd[2119]: Aggregated 1 task exit messages Jan 13 17:12:36 vmp687 slurmstepd[2119]: Before call to spank_fini() Jan 13 17:12:36 vmp687 slurmstepd[2119]: After call to spank_fini() Jan 13 17:12:36 vmp687 slurmstepd[2119]: job 17189 completed with slurm_rc = 0, job_rc = 0 Jan 13 17:12:36 vmp687 slurmstepd[2119]: sending REQUEST_COMPLETE_BATCH_SCRIPT, error:0 status 0 Jan 13 17:12:36 vmp687 slurmstepd[2119]: false, shutdown Jan 13 17:12:36 vmp687 slurmstepd[2119]: done with job These two are fine: Jan 13 17:12:31 vmp687 slurmstepd[2119]: unable to open '/cgroup/cpuset/slurm/cpus' for reading : No such file or directory Jan 13 17:12:31 vmp687 slurmstepd[2119]: unable to get parameter 'cpus' for '/cgroup/cpuset/slurm' Slurm first tries /cgroup/cpuset/slurm/cpus and if it doesn't exist then it tries /cgroup/cpuset/slurm/cpuset.cpus. I get the same logs. There are no errors after so we are ok. Older kernels only did cpus -- without prepending "cpuset.". Hi Brian, A few other messages we see in the slurmd log: Jan 14 14:58:46 vmp689 slurmd[26496]: Resource spec: system memory limit not configured for this node Jan 14 14:58:46 vmp689 slurmd[26498]: Warning: Core limit is only 0 KB What do these mean? Another related question: is it safe to restart slurmd on a compute node where jobs are currently running? I've done a quick test and the job seems unaffected, but just want to verify. The reason I ask is that I'd like to change the debug level on all the nodes but this requires restarting the daemon. Jan 14 14:58:46 vmp689 slurmd[26496]: Resource spec: system memory limit not configured for this node This one means that MemSpecLimit isn't specified on the node. See man slurm.conf. It's not necessary. Jan 14 14:58:46 vmp689 slurmd[26498]: Warning: Core limit is only 0 KB This means that the core file size is 0 for the slurmd meaning that the slurmd won't generate a core if it does. check: ulimit -Hc ulimit -Sc Restarting the daemons with running jobs is fine. I should have told you that you can change the debug level without restarting. You can do: scontrol reconfigure which will tell all of the daemons to re-read the config. Hey Will, Have you been able to reproduce this again? Thanks, Brian Hi Brian, No, we haven't seen this error again. It seems to have been an isolated incident. You can close this ticket if you'd like and I will re-open it if this error re-appears. Best, Will Sounds good. Let us know if you see it again. Thanks, Brian |
Hello again, We've had this happen with several jobs this week, and I haven't found a very complete description of the problem from searching Bugzilla, slurm-dev, and the SchedMD docs. As an example (please excuse the aliases and wrapped lines): [frenchwr@vmps11 prolog]$ sinfoprobs HOSTNAMES STATE FEATURES TIMESTAMP USER REASON vmp687 draining intel 2015-01-09T16:50:40 slurm batch job complete failure vmp690 draining intel 2015-01-09T16:53:26 slurm batch job complete failure [frenchwr@vmps11 prolog]$ sacctl --allusers | grep vmp687 uscms4307 7419 cms 2-00:00:00 00:55:34 2950Mn 1 1 vmp687 0:0 uscms4307 7420 cms 2-00:00:00 00:00:00 2950Mn 1 1 vmp687 0:0 7420.batch cms 00:00:00 2950Mn 0 0 1 1 vmp687 0:0 [frenchwr@vmps11 prolog]$ sacctl --allusers | grep vmp690 uscms4307 7440 cms 2-00:00:00 00:53:04 2950Mn 1 1 vmp690 0:0 uscms4307 7441 cms 2-00:00:00 00:00:00 2950Mn 1 1 vmp690 0:0 7441.batch cms 00:00:00 2950Mn 0 0 1 1 vmp690 0:0 uscms4307 7442 cms 2-00:00:00 00:53:04 2950Mn 1 1 vmp690 0:0 Can you describe what exactly has happened to cause these jobs to enter into the draining state? What does "batch job complete failure" mean, exactly? It's odd that the exit status on the most recent job(s) that have run on the relevant nodes is 0 in all cases...yet SLURM has detected that something has gone wrong and is taking the nodes offline. Is this SLURM's way of preventing jobs from landing on nodes it considers suspect? Once we notice drained nodes we use: scontrol update NodeName=vmp690 State=Resume and the nodes come back and jobs seem to run on them without issue. Is this something we should be concerned about?