We have a job that exit with a non-zero code but it doesnt get into SE. Instead, it disappear from the queue. In the job's log, the exit code is 134 == JOB END STATUS=134 HOST=lnod0004 DATE=Wed Aug 26 00:27:19 BST 2015 RUNTIME=33618 == But in sacct, the exit code is [root@lque0001 etc]# sacct -j 4492303_25 JobID JobName Partition Account AllocCPUS State ExitCode ------------ ---------- ---------- ---------- ---------- ---------- -------- 4492303_25 220_PROD_+ idle 24 NODE_FAIL 1:0 compute node log 2015-08-26T01:20:07+01:00 lnod0004 slurmd[14520]: debug: _rpc_terminate_job, uid = 601 2015-08-26T01:20:07+01:00 lnod0004 slurmd[14520]: debug: task_p_slurmd_release_resources: 4492331 2015-08-26T01:20:07+01:00 lnod0004 slurmd[14520]: debug: credential for job 4492331 revoked 2015-08-26T01:20:07+01:00 lnod0004 slurmd[14520]: debug: Waiting for job 4492331's prolog to complete 2015-08-26T01:20:07+01:00 lnod0004 slurmd[14520]: debug: Finished wait for job 4492331's prolog to complete 2015-08-26T01:20:07+01:00 lnod0004 slurmd[14520]: debug: Calling /d/sw/slurm/20150805/sbin/slurmstepd spank epilog 2015-08-26T01:20:07+01:00 lnod0004 spank-epilog[5336]: Reading slurm.conf file: /d/sw/slurm/20150805/etc/slurm.conf 2015-08-26T01:20:07+01:00 lnod0004 spank-epilog[5336]: Running spank/epilog for jobid [4492331] uid [1476] 2015-08-26T01:20:07+01:00 lnod0004 spank-epilog[5336]: spank: opening plugin stack /d/sw/slurm/20150805/etc/plugstack.conf 2015-08-26T01:20:07+01:00 lnod0004 slurmd[14520]: debug: [job 4492331] attempting to run epilog [/d/sw/slurm/etc/slurm_epilog.sh] 2015-08-26T01:20:07+01:00 lnod0004 slurmd[14520]: debug: completed epilog for jobid 4492331 2015-08-26T01:20:07+01:00 lnod0004 slurmd[14520]: debug: Job 4492331: sent epilog complete msg: rc = 0 2015-08-26T01:20:08+01:00 lnod0004 monitoring: Free Mem: 122977M (95%), 9% Cached, 1/5/15 Load 0.06 0.02 0.41, iowait: 0.04%, 53 ARP entries, eth3: RX 0 TX 0 Mbps 2015-08-26T01:20:23+01:00 lnod0004 slurmd[14520]: debug: _slurm_recv_timeout at 0 of 4, recv zero bytes 2015-08-26T01:20:23+01:00 lnod0004 slurmd[14520]: error: slurm_receive_msg_and_forward: Zero Bytes were transmitted or received 2015-08-26T01:20:23+01:00 lnod0004 slurmd[14520]: debug: task_p_slurmd_batch_request: 4493341 2015-08-26T01:20:23+01:00 lnod0004 slurmd[14520]: debug: Calling /d/sw/slurm/20150805/sbin/slurmstepd spank prolog 2015-08-26T01:20:23+01:00 lnod0004 spank-prolog[5374]: Reading slurm.conf file: /d/sw/slurm/20150805/etc/slurm.conf 2015-08-26T01:20:23+01:00 lnod0004 slurmd[14520]: error: service_connection: slurm_receive_msg: Zero Bytes were transmitted or received slurmctld log [root@lque0001 etc]# zgrep 4492331 /var/log/slurm/slurmctld.log-20150826.gz [2015-08-25T15:07:00.531] backfill: Started JobId=4492303_25 (4492331) on lnod0004 [2015-08-26T01:20:07.620] Batch JobId=4492331 missing from node 0 (not found BatchStartTime after startup) [2015-08-26T01:20:07.620] job_complete: JobID=4492303_25(4492331) State=0x1 NodeCnt=1 WIFEXITED 0 WEXITSTATUS 255 [2015-08-26T01:20:07.620] job_complete: JobID=4492303_25(4492331) State=0x1 NodeCnt=1 cancelled from interactive user or node failure [2015-08-26T01:20:07.620] job_complete: JobID=4492303_25(4492331) State=0x8007 NodeCnt=1 done Perhaps there is some network issue because there is a lot of "[2015-08-26T00:29:55.697] error: slurm_receive_msg: Zero Bytes were transmitted or received" in the log around the time this happen but I thought Slurm will requeue the job in case of node failure?
Hi, the job will be requeued if a node fails unless you have JobRequeue set to zero in your slurm.conf. Is 134 among the exit codes that tell slurmctld to requeue the job? Let me review the logs and get back to you. David
We have this value for for RequeueExit & RequeueExitHold RequeueExitHold=1-199,201-255 RequeueExit=200 We dont set JobRequeue in our slurm.conf so it should default to 1 right? We are using the same configuration in all of our office and recently we have some issue with our cluster and the job did get requeued. So I dont think there is some issue with the configuration. I will attach the logs and conf
Created attachment 2147 [details] slurm.conf
Created attachment 2148 [details] compute node log
Created attachment 2149 [details] slurmctld log
Hi Akmal, this is what I read from the logs: The job was terminated by userid 601. slurmd went to process the termination request which went fine but had a problems talking but the slurmctld. This must have last for a long time till the next day when slurmd on lnod0004 registered back with slurmctld the job was gone from lnod0004. I assume you rebooted lnod0004. Even if sacct show node failed, the job did not terminate because of node failed, it just terminated and that's why it was not requeued. The default value for JobRequeue is indeed 1. The question is why we see these messages: 2015-08-26T01:20:23+01:00 lnod0004 slurmd[14520]: debug: _slurm_recv_timeout at 0 of 4, recv zero bytes this indicates the sender received EOF because the connection was closed, so either slurmctld was exiting in the middle of the communication or some faulty hardware causes the connection to be reset. David
There is a parameter in slurm.conf called LogTimeFormat if you set it to: LogTimeFormat=thread_id The "thread_id" format shows the timestamp in the C standard ctime() function form without the year but including the microseconds, the daemon’s process ID and the current thread ID. The log will look like this: Aug 26 14:55:31.239351 7228 0x7fe61e689700 slurmctld: debug: Spawning ping agent for rigatone,tonnarello Aug 26 14:55:45.500511 7228 0x7fe61e689700 slurmctld: debug: sched: Running job scheduler This in my opinion makes the debugging better. Do you think you can enable it? David
hmm... I'm not sold yet. :) According to the task log, the task finished running (by crashing) at Wed Aug 26 00:27:19 BST 2015. So the first question is indeed what happened between 00:27 and 01:20? The node remained up and on the network the entire time; our monitoring system writes to a remote syslog server every minute with details of its memory, load, network, etc. I'm sure that our user wasn't logged in at 1:20 AM canceling jobs (and indeed, he wouldn't have then sent me a concerned email the next morning asking why this one task disappeared :). Even if he *had* canceled it, that doesn't explain what prevented the task from going into SE promptly at 00:27? It would be quite a coincidence that the one task that someone cancels happens to be the one that gets stuck for 40 minutes after crashing in the middle of the night. Right? > The job was terminated by userid 601. uid 601 is our slurm user > slurmd went to process the termination > request which went fine but had a problems talking but the slurmctld. > This must have last for a long time till the next day when slurmd on > lnod0004 registered back with slurmctld the job was gone from lnod0004. > I assume you rebooted lnod0004. lnod0004 wasn't rebooted -- monitoring shows it was healthy and on the network the entire time. > The question is why we see these messages: > > 2015-08-26T01:20:23+01:00 lnod0004 slurmd[14520]: debug: > _slurm_recv_timeout at 0 of 4, recv zero bytes > > this indicates the sender received EOF because the connection was closed, > so either slurmctld was exiting in the middle of the communication or some > faulty hardware causes the connection to be reset. I think Akmal omitted some important part of the log from the compute node: > 2015-08-26T00:13:26+01:00 lnod0004 slurmd[14520]: debug: found apparently running job 4492331 > 2015-08-26T00:27:19+01:00 lnod0004 slurmstepd[31840]: task 0 (31844) exited with exit code 134. > 2015-08-26T00:46:47+01:00 lnod0004 slurmd[14520]: error: _step_connect: connect() failed dir /var/spool/slurmd node lnod0004 job 4492331 step -2 Connection refused > 2015-08-26T00:46:47+01:00 lnod0004 slurmd[14520]: debug: Cleaned up stray socket /var/spool/slurmd/lnod0004_4492331.4294967294 > 2015-08-26T00:46:47+01:00 lnod0004 slurmd[14520]: _handle_stray_script: Purging vestigial job script /var/spool/slurmd/job4492331/slurm_script > 2015-08-26T01:20:07+01:00 lnod0004 slurmd[14520]: debug: _rpc_terminate_job, uid = 601 > 2015-08-26T01:20:07+01:00 lnod0004 slurmd[14520]: debug: task_p_slurmd_release_resources: 4492331 > 2015-08-26T01:20:07+01:00 lnod0004 slurmd[14520]: debug: credential for job 4492331 revoked (...and then continuing as he already provided) But this still leaves some bewildering questions -- like: - what was slurmd/slurmstepd doing on lnod0004 from 00:27 through 00:46? - what is the Connection refused all about, and what are its side-effects? - what is issuing the termination at 01:20? Are we sure that it's not slurmctld itself issuing that termination, after this message: > [2015-08-26T01:20:07.620] Batch JobId=4492331 missing from node 0 (not found BatchStartTime after startup) ? Just to be clear: I am totally willing to believe that there was a problem with the cluster node. The task did crash after all; maybe it was solar flares, a power surge, electromagnetic interference -- whatever! Totally happy to accept that. My concern is about the task dropping out of the queue afterward, rather than going into SE (if the non-zero exit code was received by slurmctld) or PD (if it believes the node failed). I have 8 months of slurmctld logs, and these are all instances of the "missing from node 0" message: /var/log/slurm/slurmctld.log-20150520.gz:[2015-05-19T11:24:47.925] Batch JobId=3830138 missing from node 0 (not found BatchStartTime after startup) /var/log/slurm/slurmctld.log-20150616.gz:[2015-06-10T13:04:52.536] Batch JobId=4058826 missing from node 0 (not found BatchStartTime after startup) /var/log/slurm/slurmctld.log-20150616.gz:[2015-06-10T13:04:52.536] Batch JobId=4058827 missing from node 0 (not found BatchStartTime after startup) /var/log/slurm/slurmctld.log-20150616.gz:[2015-06-10T13:04:52.536] Batch JobId=4058828 missing from node 0 (not found BatchStartTime after startup) /var/log/slurm/slurmctld.log-20150616.gz:[2015-06-10T13:04:52.537] Batch JobId=4058829 missing from node 0 (not found BatchStartTime after startup) /var/log/slurm/slurmctld.log-20150616.gz:[2015-06-10T13:04:52.537] Batch JobId=4058830 missing from node 0 (not found BatchStartTime after startup) /var/log/slurm/slurmctld.log-20150616.gz:[2015-06-10T13:04:52.537] Batch JobId=4058831 missing from node 0 (not found BatchStartTime after startup) /var/log/slurm/slurmctld.log-20150616.gz:[2015-06-10T13:04:52.537] Batch JobId=4058832 missing from node 0 (not found BatchStartTime after startup) /var/log/slurm/slurmctld.log-20150616.gz:[2015-06-10T13:04:52.538] Batch JobId=4058833 missing from node 0 (not found BatchStartTime after startup) /var/log/slurm/slurmctld.log-20150616.gz:[2015-06-10T13:04:52.538] Batch JobId=4058834 missing from node 0 (not found BatchStartTime after startup) /var/log/slurm/slurmctld.log-20150616.gz:[2015-06-10T13:04:52.538] Batch JobId=4058835 missing from node 0 (not found BatchStartTime after startup) /var/log/slurm/slurmctld.log-20150616.gz:[2015-06-10T13:04:52.538] Batch JobId=4058836 missing from node 0 (not found BatchStartTime after startup) /var/log/slurm/slurmctld.log-20150616.gz:[2015-06-10T13:04:52.538] Batch JobId=4058837 missing from node 0 (not found BatchStartTime after startup) /var/log/slurm/slurmctld.log-20150724.gz:[2015-07-21T14:58:12.967] Batch JobId=4319245 missing from node 0 (not found BatchStartTime after startup) /var/log/slurm/slurmctld.log-20150724.gz:[2015-07-22T18:22:38.979] Batch JobId=4323435 missing from node 0 (not found BatchStartTime after startup) /var/log/slurm/slurmctld.log-20150724.gz:[2015-07-22T18:22:38.981] Batch JobId=4323394 missing from node 0 (not found BatchStartTime after startup) /var/log/slurm/slurmctld.log-20150724.gz:[2015-07-22T18:22:38.990] Batch JobId=4323397 missing from node 0 (not found BatchStartTime after startup) /var/log/slurm/slurmctld.log-20150724.gz:[2015-07-22T18:22:39.035] Batch JobId=4323920 missing from node 0 (not found BatchStartTime after startup) /var/log/slurm/slurmctld.log-20150724.gz:[2015-07-22T18:24:18.346] Batch JobId=4323444 missing from node 0 (not found BatchStartTime after startup) /var/log/slurm/slurmctld.log-20150826.gz:[2015-08-26T01:20:07.605] Batch JobId=4493264 missing from node 0 (not found BatchStartTime after startup) /var/log/slurm/slurmctld.log-20150826.gz:[2015-08-26T01:20:07.620] Batch JobId=4492331 missing from node 0 (not found BatchStartTime after startup) /var/log/slurm/slurmctld.log-20150826.gz:[2015-08-26T01:20:07.627] Batch JobId=4493272 missing from node 0 (not found BatchStartTime after startup) /var/log/slurm/slurmctld.log-20150826.gz:[2015-08-26T01:53:27.122] Batch JobId=4493281 missing from node 0 (not found BatchStartTime after startup) /var/log/slurm/slurmctld.log-20150826.gz:[2015-08-26T01:53:27.127] Batch JobId=4493285 missing from node 0 (not found BatchStartTime after startup) /var/log/slurm/slurmctld.log-20150826.gz:[2015-08-26T01:53:27.171] Batch JobId=4493288 missing from node 0 (not found BatchStartTime after startup) /var/log/slurm/slurmctld.log-20150826.gz:[2015-08-26T01:53:27.172] Batch JobId=4493290 missing from node 0 (not found BatchStartTime after startup) /var/log/slurm/slurmctld.log-20150826.gz:[2015-08-26T01:53:27.175] Batch JobId=4493284 missing from node 0 (not found BatchStartTime after startup) /var/log/slurm/slurmctld.log-20150826.gz:[2015-08-26T01:53:27.184] Batch JobId=4493286 missing from node 0 (not found BatchStartTime after startup) I find it extremely suspicious that they tend to arrive in large clumps like that! Does each one represent a disappearing task, I wonder? This isn't the first time something like this has happened to us (bug 833, bug 866) although it has become much more rare. Cheers, -Phil
I am not sold either... :-) just trying to reverse engineer what I see. This changes the scenario a little bit. From what I read this time is that slurmd found a running jobs, it periodically scans the list of jobs and asks stepd for status. slurmd talks to slurmstepd over a unix socket and this failed as you showed in your new log (file system full?. Then at 1:20 slurmd registers itself with slurmctld which detects the missing job and terminates it, so yes the _rpc_terminate_job does come from the controller. Then we have the network issue... it is not a connection refuse it is really a broken connection, the slurmd connects() possibly sends some bytes then waits for a reply and it prints Zero Bytes were transmitted or received, which indicates connection reset. ->Just to be clear: I am totally willing to believe that there was a ->problem with the cluster node. The task did crash after all; maybe it was ->solar flares, a power surge, electromagnetic interference -- whatever! ->Totally happy to accept that. I don't think so. It is more likely that all the disks in your cluster spinning counterclockwise slowing down earth rotation cause perturbation in time affecting TCP sync algorithm hence the problem. Let me think a bit harder over the logs and see what other data we can collect to understand this. David
->My concern is about the task dropping out of the queue afterward, rather than ->going into SE (if the non-zero exit code was received by slurmctld) ->or PD (if it believes the node failed). I think the reason here is that somehow the job got lost and the controller never saw its exit code, so it has no chance to requeue it. The question is why. David
(In reply to David Bigagli from comment #7) > > This in my opinion makes the debugging better. Do you think you can enable > it? I'll leave that to Akmal, but I can't see why it would be a problem. (In reply to David Bigagli from comment #9) > > This changes the scenario a little bit. From what I read this time is that > slurmd found a running jobs, it periodically scans the list of jobs and asks > stepd for status. slurmd talks to slurmstepd over a unix socket and this > failed as you showed in your new log (file system full?. I don't think so... /var is an independent ramdisk on all of our cluster nodes, and there are 16,000,000 free inodes. Also, that socket would already have been created when the job was launched, right? Would it be doing anything at this point that would require it to create a new inode that could potentially fail? (In reply to David Bigagli from comment #10) > > I think the reason here is that somehow the job got lost and the controller > never saw its exit code, so it has no chance to requeue it. The question is > why. I agree! And this has always been my concern about the fundamental design of slurm. In my opinion, a task should only be removed from the queue under two circumstances: - a user explicitly cancels it (which I think we now agree did not happen in this case) or - it receives an affirmative message from the node that the task executed and returned a exit code indicative of success (which for our configuration is only 0) If neither of those two things happens, the default behaviour *must* be to put it back in the queue automatically. That would be a fail-safe design. If slurm relies on something to explicitly put the task back in the queue, then we'll forever be losing tasks when we encounter a crack in the system that the designers didn't explicitly plan for and handle. Not fail-safe. And it's a lot, lot more work for you guys, who must make the relationship and protocol between slurmctld, slurmd, and slurmstepd absolutely air-tight in the face of every possible failure scenario -- otherwise tasks vanish. That sounds like no fun, to me. Am I wrong? Cheers, -Phil
LogTimeFormat=thread_id added to our slurm.conf
Is there any core file from slurmstepd on lnod0004? You check in the log directory. David
nope didnt see any
Hi Phil, the socket is created when the job is launched and used during the job lifetime by slurmd to send commands to slurmstepd. The socket is removed when the job is done. Nothing else should happen. I agree with your design given your use case. I can comment as I am not one of the original slurm designers. :-) I think the problem is that we are trying to solve different use cases with one code base. As an example you remember the fix we did that if a job has an invalid dependency slurm does not terminate it anymore, this is something we discussed and did while back. Well this fix was not favorably received by other customers, to my surprise, and they complained so hard that we had to create a configurable parameter called kill_invalid_depend. Anyway I am still investigating this issue trying to come up with a good explanation or find a bug. David
Here is the story of job 4492331 as reported by the log of lnod0004. 1) Job starts 2015-08-25T15:07:01+01:00 lnod0004 slurmd[6023]: Launching batch job 4492331 for UID 1476 2) It runs well and its state is periodically polled by slurmd, this is the time of last known contact between slurmd and slurmstepd 2015-08-26T00:13:26+01:00 lnod0004 slurmstepd[31840]: Handling REQUEST_STATE 2015-08-26T00:13:26+01:00 lnod0004 slurmd[14520]: debug: found apparently running job 4492331 3) then this happens: 2015-08-26T00:46:47+01:00 lnod0004 slurmd[14520]: error: _step_connect: connect() failed dir /var/spool/slurmd node lnod0004 job 4492331 step -2 Connection refused 2015-08-26T00:46:47+01:00 lnod0004 slurmd[14520]: debug: Cleaned up stray socket /var/spool/slurmd/lnod0004_4492331.4294967294 2015-08-26T00:46:47+01:00 lnod0004 slurmd[14520]: _handle_stray_script: Purging vestigial job script /var/spool/slurmd/job4492331/slurm_script slurmstepd must have died. You reported the user task crashed but this should be not matter to slurmstepd as it runs as root although in the same session as the user job. In this situation slurmd throws away the job. 4) Then comes the requiem and slurmctld terminates the job as well. 2015-08-26T01:20:07+01:00 lnod0004 slurmd[14520]: debug: _rpc_terminate_job, uid = 601 Two things to understand, what happen to the slurmstepd and see if we can create a patch in which the slurmctld requeues the job if a running job is not reported by slurmd instead of terminating it. The first item is probably hard at this point but the second can be surely investigated. Although worth of investigation I think the message about "Zero Bytes were transmitted or received" is probably independent of this problem. David
Evening I'm just catching up on this and find this a little disturbing > 3) then this happens: > > 2015-08-26T00:46:47+01:00 lnod0004 slurmd[14520]: error: _step_connect: > connect() failed dir /var/spool/slurmd node lnod0004 job 4492331 step -2 > Connection refused > 2015-08-26T00:46:47+01:00 lnod0004 slurmd[14520]: debug: Cleaned up stray > socket /var/spool/slurmd/lnod0004_4492331.4294967294 > 2015-08-26T00:46:47+01:00 lnod0004 slurmd[14520]: _handle_stray_script: > Purging vestigial job script /var/spool/slurmd/job4492331/slurm_script > > slurmstepd must have died. You reported the user task crashed but this should > be not matter to slurmstepd as it runs as root although in the same session > as the user job. In this situation slurmd throws away the job. So, as far as slurm is concerned, nothing caused by the user or job results in slurm just throwing the job away? Surely this should be treated almost identically as if a node has crashed? Job goes back on the queue in PD state? And once communication is resumed to the node, it should be treated as if the node was just rebooted? If a node is struggling for memory, the OOM might just kill slurmstepd if slurmstepd allocates a byte of memory at just the moment OOM is making its decision.
Indeed that's the item 2 of the investigation, provide a patch in which slurmctld puts the job back to pend. I think it is ok for slurmd to clean up the socket and the job script as there is nothing else it can do, but slurmctld can take a different decision than terminate the job in this case. David
Oh, absolutely, cleaning up is the only sensible thing to do. I guess covering the case where both slurmd and slurmstepd get whacked is also good :) I tend to agree with Phil though. Having the ability to choose the "catch-all" of job failures would be good. So, rather than just tossing the job, it would be good to have a config option to change the catch-all to put job in SE. That way, no matter what happens, if slurm doesn't get a known exit method, the jobs ends up in SE, leaving it to the user to make a decision on what to do and maintaining job dependencies etc. In reality, its the job-dependency tree that really bites us.
Yes we just need to enumerate all possible scenarios in which this happens, but I am confident that working few more months with you we catch them all :-)) Do you mean a job that gets lost like this one is in the middle of a dependency tree which then breaks consequently? David
Yes. Cleaning up after such an event can be quite painful for our users... > Do you mean a job that gets lost like this one is in the middle of a > dependency tree which then breaks consequently?
Fixed in this commit: https://github.com/SchedMD/slurm/commit/d8e6f55d4168807b279ad68c0faee9cbd8a7733c Change will be in Slurm version 14.11.9 when release, probably on Monday. Our best guess is that the slurmstepd process (Slurm's job shepherd) somehow aborted. There are no indications of how that might have happened and the status of the job is very much an unknown at that point. This patch will requeue the job under those conditions (i.e. batch job with requeue enabled). It also improves the logging. The slurmctld log will now look like this under those conditions: slurmctld: sched: Allocate JobId=20719 NodeList=tux10 #CPUs=1 slurmctld: Batch JobId=20719 missing from node 0 (not found BatchStartTime after startup), Requeuing job slurmctld: job_complete: JobID=20719 State=0x1 NodeCnt=1 WIFEXITED 0 WEXITSTATUS 255 slurmctld: job_complete: JobID=20719 State=0x1 NodeCnt=1 cancelled due to node failure slurmctld: job_complete: requeue JobID=20719 State=0x8000 NodeCnt=1 due to node failure slurmctld: job_complete: JobID=20719 State=0x8000 NodeCnt=1 done slurmctld: Requeuing JobID=20719 State=0x0 NodeCnt=0 slurmctld: backfill: Started JobId=20719 on tux10
Thanks.