Ticket 1889 - Job disappear from queue (due to node failure?)
Summary: Job disappear from queue (due to node failure?)
Status: RESOLVED FIXED
Alias: None
Product: Slurm
Classification: Unclassified
Component: slurmctld (show other tickets)
Version: 14.11.8
Hardware: Linux Linux
: 2 - High Impact
Assignee: David Bigagli
QA Contact:
URL:
Depends on:
Blocks:
 
Reported: 2015-08-25 21:57 MDT by Akmal Madzlan
Modified: 2015-08-30 14:35 MDT (History)
4 users (show)

See Also:
Site: DownUnder GeoSolutions
Slinky Site: ---
Alineos Sites: ---
Atos/Eviden Sites: ---
Confidential Site: ---
Coreweave sites: ---
Cray Sites: ---
DS9 clusters: ---
Google sites: ---
HPCnow Sites: ---
HPE Sites: ---
IBM Sites: ---
NOAA SIte: ---
NoveTech Sites: ---
Nvidia HWinf-CS Sites: ---
OCF Sites: ---
Recursion Pharma Sites: ---
SFW Sites: ---
SNIC sites: ---
Tzag Elita Sites: ---
Linux Distro: ---
Machine Name:
CLE Version:
Version Fixed: 14.11.9
Target Release: ---
DevPrio: ---
Emory-Cloud Sites: ---


Attachments
slurm.conf (1.66 KB, text/plain)
2015-08-25 23:40 MDT, Akmal Madzlan
Details
compute node log (3.65 MB, text/x-log)
2015-08-25 23:42 MDT, Akmal Madzlan
Details
slurmctld log (4.28 MB, application/x-gzip)
2015-08-25 23:44 MDT, Akmal Madzlan
Details

Note You need to log in before you can comment on or make changes to this ticket.
Description Akmal Madzlan 2015-08-25 21:57:17 MDT
We have a job that exit with a non-zero code but it doesnt get into SE. Instead, it disappear from the queue.

In the job's log, the exit code is 134

== JOB END  STATUS=134 HOST=lnod0004 DATE=Wed Aug 26 00:27:19 BST 2015 RUNTIME=33618 ==

But in sacct, the exit code is

[root@lque0001 etc]# sacct -j 4492303_25
       JobID    JobName  Partition    Account  AllocCPUS      State ExitCode 
------------ ---------- ---------- ---------- ---------- ---------- -------- 
4492303_25   220_PROD_+       idle                    24  NODE_FAIL      1:0 


compute node log

2015-08-26T01:20:07+01:00 lnod0004 slurmd[14520]: debug:  _rpc_terminate_job, uid = 601
2015-08-26T01:20:07+01:00 lnod0004 slurmd[14520]: debug:  task_p_slurmd_release_resources: 4492331
2015-08-26T01:20:07+01:00 lnod0004 slurmd[14520]: debug:  credential for job 4492331 revoked
2015-08-26T01:20:07+01:00 lnod0004 slurmd[14520]: debug:  Waiting for job 4492331's prolog to complete
2015-08-26T01:20:07+01:00 lnod0004 slurmd[14520]: debug:  Finished wait for job 4492331's prolog to complete
2015-08-26T01:20:07+01:00 lnod0004 slurmd[14520]: debug:  Calling /d/sw/slurm/20150805/sbin/slurmstepd spank epilog
2015-08-26T01:20:07+01:00 lnod0004 spank-epilog[5336]: Reading slurm.conf file: /d/sw/slurm/20150805/etc/slurm.conf
2015-08-26T01:20:07+01:00 lnod0004 spank-epilog[5336]: Running spank/epilog for jobid [4492331] uid [1476]
2015-08-26T01:20:07+01:00 lnod0004 spank-epilog[5336]: spank: opening plugin stack /d/sw/slurm/20150805/etc/plugstack.conf
2015-08-26T01:20:07+01:00 lnod0004 slurmd[14520]: debug:  [job 4492331] attempting to run epilog [/d/sw/slurm/etc/slurm_epilog.sh]
2015-08-26T01:20:07+01:00 lnod0004 slurmd[14520]: debug:  completed epilog for jobid 4492331
2015-08-26T01:20:07+01:00 lnod0004 slurmd[14520]: debug:  Job 4492331: sent epilog complete msg: rc = 0
2015-08-26T01:20:08+01:00 lnod0004 monitoring: Free Mem: 122977M (95%), 9% Cached, 1/5/15 Load 0.06 0.02 0.41, iowait: 0.04%, 53 ARP entries, eth3: RX 0 TX 0 Mbps
2015-08-26T01:20:23+01:00 lnod0004 slurmd[14520]: debug:  _slurm_recv_timeout at 0 of 4, recv zero bytes
2015-08-26T01:20:23+01:00 lnod0004 slurmd[14520]: error: slurm_receive_msg_and_forward: Zero Bytes were transmitted or received
2015-08-26T01:20:23+01:00 lnod0004 slurmd[14520]: debug:  task_p_slurmd_batch_request: 4493341
2015-08-26T01:20:23+01:00 lnod0004 slurmd[14520]: debug:  Calling /d/sw/slurm/20150805/sbin/slurmstepd spank prolog
2015-08-26T01:20:23+01:00 lnod0004 spank-prolog[5374]: Reading slurm.conf file: /d/sw/slurm/20150805/etc/slurm.conf
2015-08-26T01:20:23+01:00 lnod0004 slurmd[14520]: error: service_connection: slurm_receive_msg: Zero Bytes were transmitted or received

slurmctld log

[root@lque0001 etc]# zgrep 4492331 /var/log/slurm/slurmctld.log-20150826.gz
[2015-08-25T15:07:00.531] backfill: Started JobId=4492303_25 (4492331) on lnod0004
[2015-08-26T01:20:07.620] Batch JobId=4492331 missing from node 0 (not found BatchStartTime after startup)
[2015-08-26T01:20:07.620] job_complete: JobID=4492303_25(4492331) State=0x1 NodeCnt=1 WIFEXITED 0 WEXITSTATUS 255
[2015-08-26T01:20:07.620] job_complete: JobID=4492303_25(4492331) State=0x1 NodeCnt=1 cancelled from interactive user or node failure
[2015-08-26T01:20:07.620] job_complete: JobID=4492303_25(4492331) State=0x8007 NodeCnt=1 done

Perhaps there is some network issue because there is a lot of "[2015-08-26T00:29:55.697] error: slurm_receive_msg: Zero Bytes were transmitted or received" in the log around the time this happen but I thought Slurm will requeue the job in case of node failure?
Comment 1 David Bigagli 2015-08-25 22:08:37 MDT
Hi,
   the job will be requeued if a node fails unless you have JobRequeue set to zero
in your slurm.conf. Is 134 among the exit codes that tell slurmctld to requeue
the job? Let me review the logs and get back to you.

David
Comment 2 Akmal Madzlan 2015-08-25 23:39:13 MDT
We have this value for for RequeueExit & RequeueExitHold

RequeueExitHold=1-199,201-255
RequeueExit=200

We dont set JobRequeue in our slurm.conf so it should default to 1 right?

We are using the same configuration in all of our office and recently we have some issue with our cluster and the job did get requeued. So I dont think there is some issue with the configuration.

I will attach the logs and conf
Comment 3 Akmal Madzlan 2015-08-25 23:40:55 MDT
Created attachment 2147 [details]
slurm.conf
Comment 4 Akmal Madzlan 2015-08-25 23:42:49 MDT
Created attachment 2148 [details]
compute node log
Comment 5 Akmal Madzlan 2015-08-25 23:44:38 MDT
Created attachment 2149 [details]
slurmctld log
Comment 6 David Bigagli 2015-08-26 00:30:44 MDT
Hi Akmal,
        this is what I read from the logs:
The job was terminated by userid 601. slurmd went to process the termination
request which went fine but had a problems talking but the slurmctld.
This must have last for a long time till the next day when slurmd on 
lnod0004 registered back with slurmctld the job was gone from lnod0004.
I assume you rebooted lnod0004. Even if sacct show node failed, the job
did not terminate because of node failed, it just terminated and that's
why it was not requeued.

The default value for JobRequeue is indeed 1.

The question is why we see these messages:

2015-08-26T01:20:23+01:00 lnod0004 slurmd[14520]: debug:  _slurm_recv_timeout at 0 of 4, recv zero bytes

this indicates the sender received EOF because the connection was closed,
so either slurmctld was exiting in the middle of the communication or some
faulty hardware causes the connection to be reset.

David
Comment 7 David Bigagli 2015-08-26 00:56:34 MDT
There is a parameter in slurm.conf called LogTimeFormat if you set it to:

LogTimeFormat=thread_id

The "thread_id" format shows the timestamp in the C standard ctime() function
form without the year but including the microseconds, the daemon’s process ID 
and the current thread ID.

The log will look like this:

Aug 26 14:55:31.239351  7228 0x7fe61e689700 slurmctld: debug:  Spawning ping agent for rigatone,tonnarello
Aug 26 14:55:45.500511  7228 0x7fe61e689700 slurmctld: debug:  sched: Running job scheduler


This in my opinion makes the debugging better. Do you think you can enable it?

David
Comment 8 Phil Schwan 2015-08-26 01:15:49 MDT
hmm... I'm not sold yet. :)

According to the task log, the task finished running (by crashing) at Wed Aug 26 00:27:19 BST 2015.

So the first question is indeed what happened between 00:27 and 01:20?

The node remained up and on the network the entire time; our monitoring system writes to a remote syslog server every minute with details of its memory, load, network, etc.

I'm sure that our user wasn't logged in at 1:20 AM canceling jobs (and indeed, he wouldn't have then sent me a concerned email the next morning asking why this one task disappeared :).

Even if he *had* canceled it, that doesn't explain what prevented the task from going into SE promptly at 00:27?  It would be quite a coincidence that the one task that someone cancels happens to be the one that gets stuck for 40 minutes after crashing in the middle of the night.  Right?

> The job was terminated by userid 601.

uid 601 is our slurm user

> slurmd went to process the termination
> request which went fine but had a problems talking but the slurmctld.
> This must have last for a long time till the next day when slurmd on 
> lnod0004 registered back with slurmctld the job was gone from lnod0004.
> I assume you rebooted lnod0004.

lnod0004 wasn't rebooted -- monitoring shows it was healthy and on the network the entire time.

> The question is why we see these messages:
> 
> 2015-08-26T01:20:23+01:00 lnod0004 slurmd[14520]: debug: 
> _slurm_recv_timeout at 0 of 4, recv zero bytes
> 
> this indicates the sender received EOF because the connection was closed,
> so either slurmctld was exiting in the middle of the communication or some
> faulty hardware causes the connection to be reset.

I think Akmal omitted some important part of the log from the compute node:

> 2015-08-26T00:13:26+01:00 lnod0004 slurmd[14520]: debug:  found apparently running job 4492331
> 2015-08-26T00:27:19+01:00 lnod0004 slurmstepd[31840]: task 0 (31844) exited with exit code 134.
> 2015-08-26T00:46:47+01:00 lnod0004 slurmd[14520]: error: _step_connect: connect() failed dir /var/spool/slurmd node lnod0004 job 4492331 step -2 Connection refused
> 2015-08-26T00:46:47+01:00 lnod0004 slurmd[14520]: debug:  Cleaned up stray socket /var/spool/slurmd/lnod0004_4492331.4294967294
> 2015-08-26T00:46:47+01:00 lnod0004 slurmd[14520]: _handle_stray_script: Purging vestigial job script /var/spool/slurmd/job4492331/slurm_script
> 2015-08-26T01:20:07+01:00 lnod0004 slurmd[14520]: debug:  _rpc_terminate_job, uid = 601
> 2015-08-26T01:20:07+01:00 lnod0004 slurmd[14520]: debug:  task_p_slurmd_release_resources: 4492331
> 2015-08-26T01:20:07+01:00 lnod0004 slurmd[14520]: debug:  credential for job 4492331 revoked
(...and then continuing as he already provided)

But this still leaves some bewildering questions -- like:

- what was slurmd/slurmstepd doing on lnod0004 from 00:27 through 00:46?

- what is the Connection refused all about, and what are its side-effects?

- what is issuing the termination at 01:20?

Are we sure that it's not slurmctld itself issuing that termination, after this message:

> [2015-08-26T01:20:07.620] Batch JobId=4492331 missing from node 0 (not found BatchStartTime after startup)

?

Just to be clear: I am totally willing to believe that there was a problem with the cluster node.  The task did crash after all; maybe it was solar flares, a power surge, electromagnetic interference -- whatever!  Totally happy to accept that.

My concern is about the task dropping out of the queue afterward, rather than going into SE (if the non-zero exit code was received by slurmctld) or PD (if it believes the node failed).

I have 8 months of slurmctld logs, and these are all instances of the "missing from node 0" message:

 /var/log/slurm/slurmctld.log-20150520.gz:[2015-05-19T11:24:47.925] Batch JobId=3830138 missing from node 0 (not found BatchStartTime after startup)

 /var/log/slurm/slurmctld.log-20150616.gz:[2015-06-10T13:04:52.536] Batch JobId=4058826 missing from node 0 (not found BatchStartTime after startup)
 /var/log/slurm/slurmctld.log-20150616.gz:[2015-06-10T13:04:52.536] Batch JobId=4058827 missing from node 0 (not found BatchStartTime after startup)
 /var/log/slurm/slurmctld.log-20150616.gz:[2015-06-10T13:04:52.536] Batch JobId=4058828 missing from node 0 (not found BatchStartTime after startup)
 /var/log/slurm/slurmctld.log-20150616.gz:[2015-06-10T13:04:52.537] Batch JobId=4058829 missing from node 0 (not found BatchStartTime after startup)
 /var/log/slurm/slurmctld.log-20150616.gz:[2015-06-10T13:04:52.537] Batch JobId=4058830 missing from node 0 (not found BatchStartTime after startup)
 /var/log/slurm/slurmctld.log-20150616.gz:[2015-06-10T13:04:52.537] Batch JobId=4058831 missing from node 0 (not found BatchStartTime after startup)
 /var/log/slurm/slurmctld.log-20150616.gz:[2015-06-10T13:04:52.537] Batch JobId=4058832 missing from node 0 (not found BatchStartTime after startup)
 /var/log/slurm/slurmctld.log-20150616.gz:[2015-06-10T13:04:52.538] Batch JobId=4058833 missing from node 0 (not found BatchStartTime after startup)
 /var/log/slurm/slurmctld.log-20150616.gz:[2015-06-10T13:04:52.538] Batch JobId=4058834 missing from node 0 (not found BatchStartTime after startup)
 /var/log/slurm/slurmctld.log-20150616.gz:[2015-06-10T13:04:52.538] Batch JobId=4058835 missing from node 0 (not found BatchStartTime after startup)
 /var/log/slurm/slurmctld.log-20150616.gz:[2015-06-10T13:04:52.538] Batch JobId=4058836 missing from node 0 (not found BatchStartTime after startup)
 /var/log/slurm/slurmctld.log-20150616.gz:[2015-06-10T13:04:52.538] Batch JobId=4058837 missing from node 0 (not found BatchStartTime after startup)

 /var/log/slurm/slurmctld.log-20150724.gz:[2015-07-21T14:58:12.967] Batch JobId=4319245 missing from node 0 (not found BatchStartTime after startup)

 /var/log/slurm/slurmctld.log-20150724.gz:[2015-07-22T18:22:38.979] Batch JobId=4323435 missing from node 0 (not found BatchStartTime after startup)
 /var/log/slurm/slurmctld.log-20150724.gz:[2015-07-22T18:22:38.981] Batch JobId=4323394 missing from node 0 (not found BatchStartTime after startup)
 /var/log/slurm/slurmctld.log-20150724.gz:[2015-07-22T18:22:38.990] Batch JobId=4323397 missing from node 0 (not found BatchStartTime after startup)
 /var/log/slurm/slurmctld.log-20150724.gz:[2015-07-22T18:22:39.035] Batch JobId=4323920 missing from node 0 (not found BatchStartTime after startup)
 /var/log/slurm/slurmctld.log-20150724.gz:[2015-07-22T18:24:18.346] Batch JobId=4323444 missing from node 0 (not found BatchStartTime after startup)

 /var/log/slurm/slurmctld.log-20150826.gz:[2015-08-26T01:20:07.605] Batch JobId=4493264 missing from node 0 (not found BatchStartTime after startup)
 /var/log/slurm/slurmctld.log-20150826.gz:[2015-08-26T01:20:07.620] Batch JobId=4492331 missing from node 0 (not found BatchStartTime after startup)
 /var/log/slurm/slurmctld.log-20150826.gz:[2015-08-26T01:20:07.627] Batch JobId=4493272 missing from node 0 (not found BatchStartTime after startup)

 /var/log/slurm/slurmctld.log-20150826.gz:[2015-08-26T01:53:27.122] Batch JobId=4493281 missing from node 0 (not found BatchStartTime after startup)
 /var/log/slurm/slurmctld.log-20150826.gz:[2015-08-26T01:53:27.127] Batch JobId=4493285 missing from node 0 (not found BatchStartTime after startup)
 /var/log/slurm/slurmctld.log-20150826.gz:[2015-08-26T01:53:27.171] Batch JobId=4493288 missing from node 0 (not found BatchStartTime after startup)
 /var/log/slurm/slurmctld.log-20150826.gz:[2015-08-26T01:53:27.172] Batch JobId=4493290 missing from node 0 (not found BatchStartTime after startup)
 /var/log/slurm/slurmctld.log-20150826.gz:[2015-08-26T01:53:27.175] Batch JobId=4493284 missing from node 0 (not found BatchStartTime after startup)
 /var/log/slurm/slurmctld.log-20150826.gz:[2015-08-26T01:53:27.184] Batch JobId=4493286 missing from node 0 (not found BatchStartTime after startup)

I find it extremely suspicious that they tend to arrive in large clumps like that!  Does each one represent a disappearing task, I wonder?

This isn't the first time something like this has happened to us (bug 833, bug 866) although it has become much more rare.

Cheers,

-Phil
Comment 9 David Bigagli 2015-08-26 01:40:05 MDT
I am not sold either... :-) just trying to reverse engineer what I see.

This changes the scenario a little bit. From what I read this time is that 
slurmd found a running jobs, it periodically scans the list of jobs and asks
stepd for status. slurmd talks to slurmstepd over a unix socket and this 
failed as you showed in your new log (file system full?. 

Then at 1:20 slurmd registers itself with slurmctld which detects the missing job and terminates it, so yes the _rpc_terminate_job does come from the controller.

Then we have the network issue... it is not a connection refuse it is really 
a broken connection, the slurmd connects() possibly sends some bytes
then waits for a reply and it prints Zero Bytes were transmitted or received,
which indicates connection reset.

->Just to be clear: I am totally willing to believe that there was a 
->problem with the cluster node.  The task did crash after all; maybe it was
->solar flares, a power surge, electromagnetic interference -- whatever!  
->Totally happy to accept that.

I don't think so. It is more likely that all the disks in your cluster spinning
counterclockwise slowing down earth rotation cause perturbation in time affecting TCP sync algorithm hence the problem.

Let me think a bit harder over the logs and see what other data we can collect
to understand this.

David
Comment 10 David Bigagli 2015-08-26 01:47:01 MDT
->My concern is about the task dropping out of the queue afterward, rather than ->going into SE (if the non-zero exit code was received by slurmctld) 
->or PD (if it believes the node failed).

I think the reason here is that somehow the job got lost and the controller 
never saw its exit code, so it has no chance to requeue it. The question is
why.

David
Comment 11 Phil Schwan 2015-08-26 13:05:26 MDT
(In reply to David Bigagli from comment #7)
> 
> This in my opinion makes the debugging better. Do you think you can enable
> it?

I'll leave that to Akmal, but I can't see why it would be a problem.



(In reply to David Bigagli from comment #9)
> 
> This changes the scenario a little bit. From what I read this time is that 
> slurmd found a running jobs, it periodically scans the list of jobs and asks
> stepd for status. slurmd talks to slurmstepd over a unix socket and this 
> failed as you showed in your new log (file system full?. 

I don't think so... /var is an independent ramdisk on all of our cluster nodes, and there are 16,000,000 free inodes.

Also, that socket would already have been created when the job was launched, right?  Would it be doing anything at this point that would require it to create a new inode that could potentially fail?



(In reply to David Bigagli from comment #10)
> 
> I think the reason here is that somehow the job got lost and the controller 
> never saw its exit code, so it has no chance to requeue it. The question is
> why.

I agree!  And this has always been my concern about the fundamental design of slurm.

In my opinion, a task should only be removed from the queue under two circumstances:

- a user explicitly cancels it (which I think we now agree did not happen in this case)

or

- it receives an affirmative message from the node that the task executed and returned a exit code indicative of success (which for our configuration is only 0)

If neither of those two things happens, the default behaviour *must* be to put it back in the queue automatically.  That would be a fail-safe design.

If slurm relies on something to explicitly put the task back in the queue, then we'll forever be losing tasks when we encounter a crack in the system that the designers didn't explicitly plan for and handle.  Not fail-safe.

And it's a lot, lot more work for you guys, who must make the relationship and protocol between slurmctld, slurmd, and slurmstepd absolutely air-tight in the face of every possible failure scenario -- otherwise tasks vanish.  That sounds like no fun, to me.

Am I wrong?

Cheers,

-Phil
Comment 12 Akmal Madzlan 2015-08-26 20:00:28 MDT
LogTimeFormat=thread_id added to our slurm.conf
Comment 13 David Bigagli 2015-08-26 20:56:25 MDT
Is there any core file from slurmstepd on lnod0004? You check in the log directory.

David
Comment 14 Akmal Madzlan 2015-08-26 21:39:56 MDT
nope
didnt see any
Comment 15 David Bigagli 2015-08-26 21:49:05 MDT
Hi Phil,
        the socket is created when the job is launched and used during the job
lifetime by slurmd to send commands to slurmstepd. The socket is removed when
the job is done. Nothing else should happen.

I agree with your design given your use case. I can comment as I am not one of
the original slurm designers. :-) I think the problem is that we are trying to 
solve different use cases with one code base. As an example you remember the
fix we did that if a job has an invalid dependency slurm does not terminate it
anymore, this is something we discussed and did while back. Well this fix was
not favorably received by other customers, to my surprise, and they complained
so hard that we had to create a configurable parameter called kill_invalid_depend.

Anyway I am still investigating this issue trying to come up with a good
explanation or find a bug.

David
Comment 16 David Bigagli 2015-08-27 01:02:35 MDT
Here is the story of job 4492331 as reported by the log of lnod0004.

1) Job starts 

2015-08-25T15:07:01+01:00 lnod0004 slurmd[6023]: Launching batch job 4492331 for UID 1476

2) It runs well and its state is periodically polled by slurmd,
this is the time of last known contact between slurmd and slurmstepd

2015-08-26T00:13:26+01:00 lnod0004 slurmstepd[31840]: Handling REQUEST_STATE
2015-08-26T00:13:26+01:00 lnod0004 slurmd[14520]: debug:  found apparently running job 4492331

3) then this happens:

2015-08-26T00:46:47+01:00 lnod0004 slurmd[14520]: error: _step_connect: connect() failed dir /var/spool/slurmd node lnod0004 job 4492331 step -2 Connection refused
2015-08-26T00:46:47+01:00 lnod0004 slurmd[14520]: debug:  Cleaned up stray socket /var/spool/slurmd/lnod0004_4492331.4294967294
2015-08-26T00:46:47+01:00 lnod0004 slurmd[14520]: _handle_stray_script: Purging vestigial job script /var/spool/slurmd/job4492331/slurm_script

slurmstepd must have died. You reported the user task crashed but this should
be not matter to slurmstepd as it runs as root although in the same session
as the user job. In this situation slurmd throws away the job.

4) Then comes the requiem and slurmctld terminates the job as well.

2015-08-26T01:20:07+01:00 lnod0004 slurmd[14520]: debug:  _rpc_terminate_job, uid = 601

Two things to understand, what happen to the slurmstepd and see if we can create
a patch in which the slurmctld requeues the job if a running job is not reported
by slurmd instead of terminating it. The first item is probably hard at this
point but the second can be surely investigated.

Although worth of investigation I think the message about "Zero Bytes were transmitted or received" is probably independent of this problem.

David
Comment 17 Stuart Midgley 2015-08-27 01:32:37 MDT
Evening

I'm just catching up on this and find this a little disturbing

> 3) then this happens:
> 
> 2015-08-26T00:46:47+01:00 lnod0004 slurmd[14520]: error: _step_connect:
> connect() failed dir /var/spool/slurmd node lnod0004 job 4492331 step -2
> Connection refused
> 2015-08-26T00:46:47+01:00 lnod0004 slurmd[14520]: debug:  Cleaned up stray
> socket /var/spool/slurmd/lnod0004_4492331.4294967294
> 2015-08-26T00:46:47+01:00 lnod0004 slurmd[14520]: _handle_stray_script:
> Purging vestigial job script /var/spool/slurmd/job4492331/slurm_script
> 
> slurmstepd must have died. You reported the user task crashed but this should
> be not matter to slurmstepd as it runs as root although in the same session
> as the user job. In this situation slurmd throws away the job.


So, as far as slurm is concerned, nothing caused by the user or job results in slurm just throwing the job away?  Surely this should be treated almost identically as if a node has crashed?  Job goes back on the queue in PD state?  And once communication is resumed to the node, it should be treated as if the node was just rebooted?

If a node is struggling for memory, the OOM might just kill slurmstepd if slurmstepd allocates a byte of memory at just the moment OOM is making its decision.
Comment 18 David Bigagli 2015-08-27 01:38:16 MDT
Indeed that's the item 2 of the investigation, provide a patch in which 
slurmctld puts the job back to pend. I think it is ok for slurmd to clean
up the socket and the job script as there is nothing else it can do, but 
slurmctld can take a different decision than terminate the job in this
case.

David
Comment 19 Stuart Midgley 2015-08-27 01:43:55 MDT
Oh, absolutely, cleaning up is the only sensible thing to do.  I guess covering the case where both slurmd and slurmstepd get whacked is also good :)

I tend to agree with Phil though.  Having the ability to choose the "catch-all" of job failures would be good.  So, rather than just tossing the job, it would be good to have a config option to change the catch-all to put job in SE.  That way, no matter what happens, if slurm doesn't get a known exit method, the jobs ends up in SE, leaving it to the user to make a decision on what to do and maintaining job dependencies etc.

In reality, its the job-dependency tree that really bites us.
Comment 20 David Bigagli 2015-08-27 01:48:08 MDT
Yes we just need to enumerate all possible scenarios in which this happens,
but I am confident that working few more months with you we catch them all :-))

Do you mean a job that gets lost like this one is in the middle of a
dependency tree which then breaks consequently?

David
Comment 21 Stuart Midgley 2015-08-27 01:50:27 MDT
Yes.  Cleaning up after such an event can be quite painful for our users... 

> Do you mean a job that gets lost like this one is in the middle of a
> dependency tree which then breaks consequently?
Comment 22 Moe Jette 2015-08-28 05:40:47 MDT
Fixed in this commit:
https://github.com/SchedMD/slurm/commit/d8e6f55d4168807b279ad68c0faee9cbd8a7733c

Change will be in Slurm version 14.11.9 when release, probably on Monday.

Our best guess is that the slurmstepd process (Slurm's job shepherd) somehow aborted. There are no indications of how that might have happened and the status of the job is very much an unknown at that point. This patch will requeue the job under those conditions (i.e. batch job with requeue enabled). It also improves the logging. The slurmctld log will now look like this under those conditions:

slurmctld: sched: Allocate JobId=20719 NodeList=tux10 #CPUs=1
slurmctld: Batch JobId=20719 missing from node 0 (not found BatchStartTime after startup), Requeuing job
slurmctld: job_complete: JobID=20719 State=0x1 NodeCnt=1 WIFEXITED 0 WEXITSTATUS 255
slurmctld: job_complete: JobID=20719 State=0x1 NodeCnt=1 cancelled due to node failure
slurmctld: job_complete: requeue JobID=20719 State=0x8000 NodeCnt=1 due to node failure
slurmctld: job_complete: JobID=20719 State=0x8000 NodeCnt=1 done
slurmctld: Requeuing JobID=20719 State=0x0 NodeCnt=0
slurmctld: backfill: Started JobId=20719 on tux10
Comment 23 Stuart Midgley 2015-08-30 14:35:31 MDT
Thanks.