We've just upgraded to 17.02.5 (from 15.08.7). We're now having a problem where nodes are being drained with reason "batch job complete failure". Example slurmctld.log entries: [2017-06-29T12:28:40.004+01:00] error: slurmd error running JobId=3302451 on node(s)=expspicesrv003: Kill task failed [2017-06-29T12:28:40.004+01:00] drain_nodes: node expspicesrv003 state set to DRAIN [2017-06-29T12:28:40.004+01:00] job_complete: JobID=3302451 State=0x1 NodeCnt=1 WEXITSTATUS 1 [2017-06-29T12:28:40.004+01:00] job_complete: JobID=3302451 State=0x8005 NodeCnt=1 done Corresponding slurmd.log entries: [2017-06-29T12:28:40.000+01:00] [3302451] error: *** JOB 3302451 STEPD TERMINATED ON expspicesrv003 AT 2017-06-29T12:28:40 DUE TO JOB NOT ENDING WITH SIGNALS *** [2017-06-29T12:28:40.000+01:00] [3302451] sending REQUEST_COMPLETE_BATCH_SCRIPT, error:4001 status 256 [2017-06-29T12:28:40.003+01:00] [3302451] done with job What does this error mean and why are the nodes being drained as a result? I can see this error is new in 17.02 but I'm unclear what it's purpose is or what triggers it: https://github.com/SchedMD/slurm/commit/f18390e81766a46b50ffa08a8cf1b7946ecdbf90
From the log entries you provided, it looks like the slurmd had a problem with a job step that didn't end correctly, and the slurmd wasn't able to kill it after the timeout was reached. Because of this, the node is drained to give admins the opportunity to manually intervene and determine if there is a problem that needs to be addressed, rather than continue running jobs on a potentially problematic node. That's the reasoning behind the draining of the node in this situation. When you're ready, you can bring the node back online for jobs with: scontrol update nodename=expspicesrv003 state=resume Please let me know if you have further questions. Tim
What could cause this? I can't see any evidence of issues on the nodes. Given that this is a new feature in slurm, what would the behaviour have been at 15.08? We had 4 nodes drained yesterday with this problem so I certainly need to get this fixed.
We are continuing to have problems. At the moment I have no idea how to prevent this. We did not have this problem with 15.08. So (as already asked), could you please explain how 15.08 would have behaved in this situation? Are you saying that slurm will now always drain a node if the job step can't be killed? Is this related to the "UnkillableStepTimeout" setting? Previously we would see occasional errors such as: proctrack_p_wait: Unable to destroy container 44825 in cgroup plugin, giving up after 128 sec I think this happens when processes get stuck in uninterruptible sleep due to transient I/O issues on one of our network file systems. I haven't seen this error message since the upgrade (although that message still exists in the slurm code) but is that because slurm is now giving up much earlier? In the most recent case I checked I can see that the cgroup of the job that was terminated has not been cleaned up.
*** Ticket 3958 has been marked as a duplicate of this ticket. ***
David, I believe in 15.08 you would get the error messages and it would spend the time trying to clean up but eventually it would just move on. So 17.02 is exposing that problem more by draining the node. A few suggestions: First, set the UnkillableStepTimeout to be 120. I noticed in the code that the step timeout defaults 60 seconds but the proctrack cgroup timeout is 120 seconds. So let's make them to same to get them synced up. Second, set your SlurmdDebug=debug2. This will provide a lot more details as to the problem trying to kill the process and the specific pid. That will hopefully shed more light on the situation. This will also produce a much larger, verbose log file so you'll want to set it back to your previous log level as soon as you catch the problem in the log file. Lastly, could you attach your slurm.conf to the bug just so we can refer to it if needed. Thanks. Tim
Created attachment 4887 [details] slurm.conf
Created attachment 4888 [details] slurmd.log
I've just had the first failure after making the changes you suggested. I've attached the slurmd.log contents from the time of job submission until the failure. I've also attached our slurm.conf. The job in question was cancelled by the user. Relevant slurmctld.log entries: [2017-07-06T16:08:14.204+01:00] _slurm_rpc_kill_job2: REQUEST_KILL_JOB job 3528793 uid 10895 [2017-07-06T16:09:49.787+01:00] Resending TERMINATE_JOB request JobId=3528793 Nodelist=expspicesrv018 [2017-07-06T16:10:15.003+01:00] error: slurmd error running JobId=3528793 on node(s)=expspicesrv018: Kill task failed [2017-07-06T16:10:15.285+01:00] cleanup_completing: job 3528793 completion process took 121 seconds
(In reply to Tim Shaw from comment #7) > I believe in 15.08 you would get the error messages and it would spend the > time trying to clean up but eventually it would just move on. What error messages do you mean? The "STEPD TERMINATED DUE TO JOB NOT ENDING WITH SIGNALS" message did not exist in 15.08. > A few suggestions: First, set the UnkillableStepTimeout to be 120. I > noticed in the code that the step timeout defaults 60 seconds but the > proctrack cgroup timeout is 120 seconds. So let's make them to same to get > them synced up. Note that I'm still not getting any message from proctrack_p_wait. However, that message mentioned 128s not 120 so perhaps we didn't increase UnkillableStepTimeout enough?
Looking at your log, you'll see the entry: killing process 29865 (inherited_task) with signal 9 many times. This is the stuck process that the slurmd isn't able to kill. And the cgroup can't be destroyed because the process is still running inside of it: _slurm_cgroup_destroy: problem deleting step cgroup path /cgroup/freezer/slurm/uid_10895/job_3528793/step_batch: Device or resource busy Why the process is getting stuck is hard to say at this point. You'll want to contact the user that submitted the job and find out details about the job and submission. Could it be stuck waiting on I/O? Finding out more about the nature of the job may reveal details as to why it's getting stuck. >What error messages do you mean? The "STEPD TERMINATED DUE TO JOB NOT ENDING WITH SIGNALS" message did not exist in 15.08. I'm not referring to any specific error message, although the message you provided: proctrack_p_wait: Unable to destroy container 44825 in cgroup plugin, giving up after 128 sec seems to stem from the same problem. And rather than doing anything about it, older versions of slurmd would just ignore it and move on. >Note that I'm still not getting any message from proctrack_p_wait. However, that message mentioned 128s not 120 so perhaps we didn't increase UnkillableStepTimeout enough? Possibly. I don't think it would make a difference. Regards. Tim
OK. I'll do some further investigation and report back.
I'm trying to use UnkillableStepProgram to gather information when the problem occurs. However, as far as I can see this is not working. Presumably this should still work and I should be able to see evidence of it running (with SlurmdDebug=debug2)?
Yes, UnkillableStepProgram should be called if the process is determined to be unkillable and there should be entries noting it in the log file. I'm currently working on trying to reproduce this issue myself. I'll let you know what I find.
David, I wasn't able to reproduce a problem with running the UnkillableStepProgram. Once my UnkillableStepTimeout expired, my UnkillableStepProgram ran and there was this entry in the log: debug: step_terminate_monitor: unkillable after 60 sec, calling: /home/tshaw/test/3941/unkillable_program.sh So I don't see an issue with it. Look to see if you see that log entry in your slurmd log as well. Also, when you see a drained node and log in to investigate, do you find the unkillable process still running or has it exited on its own by then? You can find the pid in the log on this line: debug2: killing process 29865 (inherited_task) with signal 9 If the process does eventually run to completion on its own, you may just need to increase the UnkillableStepTimeout value. That option is the amount of time, in seconds, that Slurm will wait before deciding that process is unkillable. That way, if the problem is I/O related and the process becomes responsive later, it will be killed without the node draining. I would try to determine what that amount of time is and then set the UnkillableStepTimeout to something higher than that time. Or, it that's too hard to catch, try increasing the value incrementally until the drainings no longer occur. Hope that helps.
Tim, I don't see that debug line in my logs. I'm confused by the code. It looks like the UnkillableStepProgram won't be run if job->state != SLURMSTEPD_STEP_RUNNING https://github.com/SchedMD/slurm/blob/slurm-17.02/src/slurmd/slurmstepd/step_terminate_monitor.c#L178 But job->state appears to be set to SLURMSTEPD_STEP_ENDING before step_terminate_monitor is ever started? https://github.com/SchedMD/slurm/blob/slurm-17.02/src/slurmd/slurmstepd/mgr.c So how will UnkillableStepProgram ever run? Apologies if I'm misunderstanding the code. How did you manage to test this? Can you suggest any way to replicate the problem? We're fairly certain all our problems are I/O related so I'm assuming we have processes stuck in interruptible sleep.
Before I respond to your last questions, I wanted to follow up on my last response. Have you had a chance to increase your UnkillableStepTimeout? Have you seen any improvement for doing so? Assuming your stuck processes do eventually free up, this will prevent any further drainings from occurring. Let me know how that goes. (In reply to David Matthews from comment #17) > Tim, I don't see that debug line in my logs. > > I'm confused by the code. > It looks like the UnkillableStepProgram won't be run if > job->state != SLURMSTEPD_STEP_RUNNING > https://github.com/SchedMD/slurm/blob/slurm-17.02/src/slurmd/slurmstepd/ > step_terminate_monitor.c#L178 > > But job->state appears to be set to SLURMSTEPD_STEP_ENDING before > step_terminate_monitor is ever started? > https://github.com/SchedMD/slurm/blob/slurm-17.02/src/slurmd/slurmstepd/mgr.c > So how will UnkillableStepProgram ever run? > Apologies if I'm misunderstanding the code. I'm assuming your referring to the logic in mgr.c:1332. The code there gets called at the beginning of the step and probably hasn't hit the problem yet. So that state does not get set to SLURMSTEPD_STEP_ENDING. If it did, you would also see the entry: Before call to spank_fini() After call to spank_fini() in the log file if that were the case. > How did you manage to test this? > Can you suggest any way to replicate the problem? Because I couldn't create a process that could be unkillable, I tested this by creating a job script that would trap and ignore SIGINT & SIGTERM signals and then sleep for a really long time. Then, for testing purposes only, I temporarily changed the stepd code to send a SIGTERM rather than a SIGKILL (proctrack_cgroup.c:539) and recompiled. Then I submitted the job with a really short walltime so that the job was sure to run past it and the slurmstepd would be sure to try to kill the long-running job: sbatch -t 00:05 job.sh That simulated a stuck job because my script was ignoring the SIGTERM signals sent to it by the slurmd. I'm not sure how helpful this will be for you but that is how I tested it. > We're fairly certain all our problems are I/O related so I'm assuming we > have processes stuck in interruptible sleep. I'm curious to know if the stuck processes ever free up. If they don't every free up, at least the processes are there to inspect and do whatever investigation is needed (like an strace). Tim
+1 for this problem on 17.02.6. We've had a handful of nodes drained for this reason since upgrading to 17.02.6. We have Cgroups enabled. No NFS problems (that I know of). No stray user processes left behind (hours later). Nothing in the syslog around the time of the error, no OOM-killer messages. This user has previously run jobs exceeding the memory allocation, so I guess the job got killed for that reason. I've increased UnkillableStepTimeout to 150 now and made "scontrol reconfig". Is there any need to restarts slurmd on the nodes as well? The slurmd.log for this job leading up to the draining of the node at 22:30:08: [2017-07-15T20:46:11.980] task_p_slurmd_batch_request: 121567 [2017-07-15T20:46:11.980] task/affinity: job 121567 CPU input mask for node: 0xFFFFFF [2017-07-15T20:46:11.980] task/affinity: job 121567 CPU final HW mask for node: 0xFFFFFF [2017-07-15T20:46:11.981] _run_prolog: run job script took usec=4 [2017-07-15T20:46:11.981] _run_prolog: prolog with lock for job 121567 ran for 0 seconds [2017-07-15T20:46:11.981] Launching batch job 121567 for UID 221341 [2017-07-15T20:46:12.016] [121567] task/cgroup: /slurm/uid_221341/job_121567: alloc=192000MB mem.limit=192000MB memsw.limit=192000MB [2017-07-15T20:46:12.016] [121567] task/cgroup: /slurm/uid_221341/job_121567/step_batch: alloc=192000MB mem.limit=192000MB memsw.limit=192000MB [2017-07-15T20:46:12.046] [121567] task_p_pre_launch: Using sched_affinity for tasks [2017-07-15T22:28:09.672] [121567] error: *** JOB 121567 ON x073 CANCELLED AT 2017-07-15T22:28:09 *** [2017-07-15T22:28:10.692] [121567] sending REQUEST_COMPLETE_BATCH_SCRIPT, error:0 status 15 [2017-07-15T22:28:10.695] [121567] done with job [2017-07-15T22:28:32.088] task_p_slurmd_batch_request: 123236 [2017-07-15T22:28:32.088] task/affinity: job 123236 CPU input mask for node: 0xFFFFFF [2017-07-15T22:28:32.088] task/affinity: job 123236 CPU final HW mask for node: 0xFFFFFF [2017-07-15T22:28:32.089] _run_prolog: run job script took usec=4 [2017-07-15T22:28:32.089] _run_prolog: prolog with lock for job 123236 ran for 0 seconds [2017-07-15T22:28:32.089] Launching batch job 123236 for UID 24803 [2017-07-15T22:28:32.126] [123236] task/cgroup: /slurm/uid_24803/job_123236: alloc=256000MB mem.limit=256000MB memsw.limit=256000MB [2017-07-15T22:28:32.126] [123236] task/cgroup: /slurm/uid_24803/job_123236/step_batch: alloc=256000MB mem.limit=256000MB memsw.limit=256000MB [2017-07-15T22:28:32.150] [123236] task_p_pre_launch: Using sched_affinity for tasks [2017-07-15T22:29:09.185] [123236] Defering sending signal, processes in job are currently core dumping [2017-07-15T22:30:08.033] [123236] error: *** JOB 123236 STEPD TERMINATED ON x073 AT 2017-07-15T22:30:07 DUE TO JOB NOT ENDING WITH SIGNALS *** [2017-07-15T22:30:08.040] [123236] sending REQUEST_COMPLETE_BATCH_SCRIPT, error:4001 status 10752 [2017-07-15T22:30:08.047] [123236] done with job
(In reply to Ole.H.Nielsen@fysik.dtu.dk from comment #19) > +1 for this problem on 17.02.6. We've had a handful of nodes drained for > this reason since upgrading to 17.02.6. > > We have Cgroups enabled. No NFS problems (that I know of). No stray user > processes left behind (hours later). Nothing in the syslog around the time > of the error, no OOM-killer messages. > > This user has previously run jobs exceeding the memory allocation, so I > guess the job got killed for that reason. A likely explanation for our issue is large timeouts when terminating the job: The user's job is producing huge core dump files of the order of 9 GB. These large files could take minutes to write onto a busy NFS server over the compute node's 1 Gbit Ethernet link. It would make sense to me to kill the user process even during the core dumping stage, if possible.
We're seeing more of these nodes drain events today. But even though I've increased the timeout: # scontrol show config | grep UnkillableStepTimeout UnkillableStepTimeout = 150 sec and made "scontrol reconfig", the slurmd on a compute node still times out after 60 seconds, see this slurmd.log: [2017-07-17T13:21:26.764] [123429] error: *** JOB 123429 ON g034 CANCELLED AT 2017-07-17T13:19:27 DUE TO TIME LIMIT *** [2017-07-17T13:22:30.284] [123429] error: *** JOB 123429 STEPD TERMINATED ON g034 AT 2017-07-17T13:22:27 DUE TO JOB NOT ENDING WITH SIGNALS *** [2017-07-17T13:22:30.284] [123429] sending REQUEST_COMPLETE_BATCH_SCRIPT, error:4001 status 15 [2017-07-17T13:22:30.286] [123429] done with job Question: Must we restart the slurmd daemons in order to reconfigure the UnkillableStepTimeout? Note added: In order get an alert for nodes that get drained, I've created a Slurm trigger as described in https://wiki.fysik.dtu.dk/niflheim/SLURM#slurm-trigger-information (changing the state from "down" to "drained").
Tim, thanks for the info on how you tested this. I still don't understand why I can't get the UnkillableStepProgram to run - I'll try to add some extra diagnostics and let you know what I find. I've tried increasing UnkillableStepTimeout but that hasn't really helped. When we have I/O issues the problem can last for an extended period. The processes do eventually free up I assume (I haven't actually managed to get any firm evidence of what's happening because I can't get UnkillableStepProgram to run). I think there are several points to consider here: 1) Is the new behaviour of slurm sensible? Previously you had a choice whether to drain the node or not (via the UnkillableStepProgram) - now it happens regardless. However, I can see that draining the node is probably not a bad thing to do in most cases (although I would prefer a more specific reason than "batch job complete failure"). 2) Assuming the new behaviour is retained, I think it need documenting. For example, the documentation for UnkillableStepTimeout only mentions that UnkillableStepProgram will be executed. https://slurm.schedmd.com/slurm.conf.html It ought to mention that, if this timeout is reached, the node will also be drained with reason "batch job complete failure". 3) I still need to be work out why I can't run UnkillableStepProgram. Hope that makes sense.
David, Check to make sure that the UnkillableStepProgram is on the compute node's local file system (or accessible on an nfs share) and that you've reconfigured or restarted the slurmd's process. Also, do you see any error or reference to "unkillable" in the slurmd.log? If not, run "scontrol show config" on the compute node: # scontrol show config | grep Unkill UnkillableStepProgram = /home/tshaw/test/3941/unkillable_program.sh UnkillableStepTimeout = 60 sec Perhaps, it just hasn't picked up the slurm.conf change. Regarding your points: > 1) Is the new behaviour of slurm sensible? Previously you had a choice > whether to drain the node or not (via the UnkillableStepProgram) - now it > happens regardless. However, I can see that draining the node is probably > not a bad thing to do in most cases (although I would prefer a more specific > reason than "batch job complete failure"). > Understood. This change was implemented to protect against scheduling any jobs on a node where the state is suspect, but we're open to recommendations from customers on how to make the problem more obvious. I'll see what's involved with doing that. > 2) Assuming the new behaviour is retained, I think it need documenting. For > example, the documentation for UnkillableStepTimeout > only mentions that UnkillableStepProgram will be executed. > https://slurm.schedmd.com/slurm.conf.html > It ought to mention that, if this timeout is reached, the node will also be > drained with reason "batch job complete failure". Agreed. I put together a doc patch for this as well. Tim
By adding some extra debug statements I've now confirmed the problem is this line: https://github.com/SchedMD/slurm/blob/slurm-17.02/src/slurmd/slurmstepd/step_terminate_monitor.c#L178 When call_external_program is called job->state has the value "3" (SLURMSTEPD_STEP_ENDING?) so it just returns. So, the question is, why is job->state wrong or why is the external program not called in this job state?
Created attachment 4939 [details] Remove state check from UnkillableProgram David, Here is a patch that removes that state check so that your UnkillableProgram can run. Can you test it and let me know how it works for you? Also, because you've confirmed that your job is in an ending state, it makes me think that the I/O problem is in the epilog script, rather than the job. You may want to inspect it to verify. Tim
(In reply to Tim Shaw from comment #29) > Here is a patch that removes that state check so that your UnkillableProgram > can run. Can you test it and let me know how it works for you? Thanks - I've implemented this but am still waiting for a failure to confirm it works (I'm sure it will). > Also, because you've confirmed that your job is in an ending state, it makes > me think that the I/O problem is in the epilog script, rather than the job. > You may want to inspect it to verify. If you look at the slurmd.log output I provided you'll see the entry "debug: [job 3528793] attempting to run epilog". This occur after the "STEPD TERMINATED ... DUE TO JOB NOT ENDING WITH SIGNALS" entry. So the problem can't be in the epilog can it?
> If you look at the slurmd.log output I provided you'll see the entry "debug: > [job 3528793] attempting to run epilog". This occur after the "STEPD > TERMINATED ... DUE TO JOB NOT ENDING WITH SIGNALS" entry. So the problem > can't be in the epilog can it? Yeah, you're probably right there. Just let me know when you've seen the next failure and how it went.
(In reply to Tim Shaw from comment #38) > Just let me know when you've seen the next failure and how it went. The UnkillableStepProgram is now running fine - thanks. Will your patch make it into the next 17.02 release?
David, This patch has been checked into commit 9f48e07c2ab8c8 and will be available in the next 17.02.7 release. Regards. Tim
Thanks Tim. Has the doc patch gone in as well?
Hi Tim, (In reply to Tim Shaw from comment #40) > This patch has been checked into commit 9f48e07c2ab8c8 and will be available > in the next 17.02.7 release. Do you guys have planned a date to release 17.02.7 yet? Thanks! -- Kilian
David, I got a doc change committed in 04b431b43b which will also be out in 17.02.7. Tim
Kilian, 17.02.7 doesn't have a firm date yet but should be out in the next couple of weeks. Tim
Just a comment that we're seeing this issue on 17.11.5 on RHEL7. We're not using the UnkillableStepProgram but I have just increased UnkillableStepTimeout to 130s from its default of 60s. If this solves it then it's probably best to increase the default value in the code. cheers, Chris
Hi There, We are currently running 17.02.10 and the issue is not fixed. We are doing some troubleshooting, and what we've noticed is that, even though slurm tries to kill them, the jobs are still there. I wonder why it fails, since we are able to do use the kill command and the jobs will just get killed. I can confirm that the status of the processes is R when this happens (So no zombies or D processes): [root@syrah01 ~]# sacct -j 345914 JobID JobName Partition Account AllocCPUS State ExitCode ------------ ---------- ---------- ---------- ---------- ---------- -------- 345914 jobarstest cpunormal geomag 4 TIMEOUT 0:0 345914.batch batch geomag 4 FAILED 127:0 [root@cnode033 ~]# ps aufx | grep slurm -A 3 root 5069 0.0 0.0 3469752 7984 ? S Apr18 0:04 /cm/shared/apps/slurm/17.02.10/sbin/slurmd gopinath 117227 0.0 0.0 113144 1436 ? S Apr23 0:00 /bin/bash /cm/local/apps/slurm/var/spool/job345914/slurm_script gopinath 117229 398 0.0 304240 10268 ? Rl Apr23 11550:53 \_ ./main imexrk_Nr32_Nm64_dt1.e-8.txt gopinath 117228 0.0 0.0 113144 1436 ? S Apr23 0:00 /bin/bash /cm/local/apps/slurm/var/spool/job345785/slurm_script gopinath 117230 715 0.0 582100 10600 ? Rl Apr23 20749:55 \_ ./main rk4_Nr32_Nm64_dt1.e-8.txt [root@cnode033 ~]# kill 117227 [root@cnode033 ~]# ps aufx | grep slurm -A 3 root 5069 0.0 0.0 3469752 7984 ? S Apr18 0:04 /cm/shared/apps/slurm/17.02.10/sbin/slurmd gopinath 117228 0.0 0.0 113144 1436 ? S Apr23 0:00 /bin/bash /cm/local/apps/slurm/var/spool/job345785/slurm_script gopinath 117230 715 0.0 582100 10600 ? Rl Apr23 20757:57 \_ ./main rk4_Nr32_Nm64_dt1.e-8.txt gopinath 117229 398 0.0 304240 10268 ? Rl Apr23 11555:22 ./main imexrk_Nr32_Nm64_dt1.e-8.txt [root@cnode033 ~]# pstree -aps 117229 systemd,1 --system └─main,117229 imexrk_Nr32_Nm64_dt1.e-8.txt ├─{main},117231 ├─{main},117232 ├─{main},117233 └─{main},117234 [root@cnode033 ~]# kill 117229 [root@cnode033 ~]# ps aux | grep 117229 root 9036 0.0 0.0 112656 964 pts/2 S+ 14:55 0:00 grep --color=auto 117229 And the job just disappears. We are working on a script to get executed when the job is not killable, which will perform a simple kill - but would be good to know the root cause. What kind of signals is Slurm sending to those applications? Another thing we noticed is that we've killed the father(117227) of the process, but the child kept running. (That shouldn't depend by slurm i suppose). Regards, Andrea
Hi Andrea - Please file a separate bug on this; I need to keep customer versions of these issues (especially resolved ones) separate from new non-customer reports of similar issues. - Tim