Ticket 840

Summary:	slurmd crashes in close proximity to an (unrelated, surely?) lustre message
Product:	Slurm	Reporter:	Phil Schwan <phils>
Component:	slurmd	Assignee:	David Bigagli <david>
Status:	RESOLVED INFOGIVEN	QA Contact:
Severity:	3 - Medium Impact
Priority:	---	CC:	da, stuartm
Version:	14.03.2
Hardware:	Linux
OS:	Linux
Site:	DownUnder GeoSolutions	Slinky Site:	---
Alineos Sites:	---	Atos/Eviden Sites:	---
Confidential Site:	---	Coreweave sites:	---
Cray Sites:	---	DS9 clusters:	---
Google sites:	---	HPCnow Sites:	---
HPE Sites:	---	IBM Sites:	---
NOAA SIte:	---	NoveTech Sites:	---
Nvidia HWinf-CS Sites:	---	OCF Sites:	---
Recursion Pharma Sites:	---	SFW Sites:	---
SNIC sites:	---	Tzag Elita Sites:	---
Linux Distro:	---	Machine Name:
CLE Version:		Version Fixed:
Target Release:	---	DevPrio:	---
Emory-Cloud Sites:	---

Description Phil Schwan 2014-05-27 23:21:38 MDT

This was running your version ded9c22dd2 plus our standard modifications.

Trying to understand bug 833 has led me to this discovery.

One specific failure that occurred this morning:

> [2014-05-28T09:50:36.141] sched: Allocate JobId=2377353 NodeList=clus344 #CPUs=24
> ...
> [2014-05-28T09:54:49.018] Batch JobId=2377353 missing from node 0
> [2014-05-28T09:54:49.018] completing job 2377353 status -2

(at which point bug 833 led to the job vanishing, but that's another matter)

The user was especially perplexed, because the job didn't even create a logfile, which is the very first thing any job submitted with our normal scripts will do.  This suggests that the job never actually started.

So now the question is: what happened between 9:50 and 9:54?

If we look at the syslog of the node itself:

> 2014-05-28T09:50:37+08:00 clus344 slurmd[2176]: debug:  task_p_slurmd_batch_request: 2377353
> 2014-05-28T09:50:37+08:00 clus344 slurmd[2176]: debug:  Calling /d/sw/slurm/20140522/sbin/slurmstepd spank prolog
> 2014-05-28T09:50:37+08:00 clus344 spank-prolog[29341]: Reading slurm.conf file: /d/sw/slurm/20140522/etc/slurm.conf
> 2014-05-28T09:50:37+08:00 clus344 spank-prolog[29341]: Running spank/prolog for jobid [2377353] uid [1007]
> 2014-05-28T09:50:37+08:00 clus344 spank-prolog[29341]: spank: opening plugin stack /d/sw/slurm/20140522/etc/plugstack.conf
> 2014-05-28T09:50:37+08:00 clus344 slurmd[2176]: debug:  [job 2377353] attempting to run prolog [/d/sw/slurm/etc/slurm_prolog.sh]
> 2014-05-28T09:50:37+08:00 clus344 kernel: type=1701 audit(1401241837.120:14759): auid=4294967295 uid=0 gid=0 ses=4294967295 subj=kernel pid=29339 comm="slurmd" sig=7
> 2014-05-28T09:50:37+08:00 clus344 kernel: LustreError: 11-0: an error occurred while communicating with 172.16.0.247@tcp. The mds_close operation failed with -116
> 2014-05-28T09:50:37+08:00 clus344 kernel: LustreError: 2289:0:(file.c:116:ll_close_inode_openhandle()) inode 6106 mdc close failed: rc = -116
> 2014-05-28T09:51:08+08:00 clus344 slurmstepd[1584]: error: Unable to destroy container 1584
> 2014-05-28T09:53:16+08:00 clus344 slurmstepd[1584]: error: Unable to destroy container 1584
> 2014-05-28T09:54:47+08:00 clus344 monitoring: slurmd not running -- attempting restart.

That type=1701 audit message is slurmd crashing with signal 7 (SIGBUS)

That Lustre error appears to be a file close() returning an error (-ESTALE).  I can no longer recall off the top of my head in which situation that might occur.

Then at 09:54 our monitoring script detects the lack of slurmd and restarts it, at which point everything discovers that the job is missing.

> [2014-05-28T09:54:49.018] Batch JobId=2377353 missing from node 0
> [2014-05-28T09:54:49.018] completing job 2377353 status -2

Interestingly, we've had 5 slurmd crashes (on 5 different nodes) today, and they're _all_ accompanied in close proximity by that mdc close error.

But that Lustre filesystem isn't the root filesystem, and it's not /var/spool/spurmd -- it's a user data filesystem.  slurmd gets pushed the job script from the controller node, so it shouldn't be touching this filesystem at all, ever?  Right?

I can't imagine how this can be related, but I also don't believe in coincidence.

...

And for what it's worth, I don't think this slurmd crash is the only scenario in which we're potentially losing jobs.

In this 3-day period, we got this message more than a hundred times a day!

# zcat slurmctld.log-20140528.gz | grep "missing from node 0" | wc -l
343

Should I be concerned about this?  Is there any situation in which "Batch JobID=x missing from node 0" is "normal" or at least expected?

Comment 1 David Bigagli 2014-05-28 10:54:44 MDT

Hi,
   if you have a slurmd core dump we would like to see the stack(s) to inspect them and compare them. Do you still have the core files?

Batch missing from node can happen in the following scenario:

1) batch job starting on execution host, slurmctld knows there is 1 job
   running on the host
2) the slurmd dies and it is not able to launch the job
3) when slurmd comes up it checks the jobs running on the host, but it does not
   find any because there is no entry in the SlurmdSpoolDir so it reports
   no jobs on the node
4) slurmctld purges the zombie job

the question is did you have 343 such failures in one day? 
This does sound quite strange... Basically in any situation in which there
is a job count mismatch between what reported by slurmd and what is in the controller memory will lead to zombie jobs that will get purged.

I agree with you that failure in Lustre should not affect slurmd if the spool
and the binary are not in Lustre. I assume that /d/sw is a local file system.
One thing that comes to mind is that some file access operation call internally
getcwd() which in turn traverses the full path eventually doing stat() on
the Lustre file system.

David

Comment 2 Phil Schwan 2014-05-28 11:57:47 MDT

(In reply to David Bigagli from comment #1)
>    if you have a slurmd core dump we would like to see the stack(s) to
> inspect them and compare them. Do you still have the core files?

Hmm.  I don't think so.  Its CWD is /var/spool/slurmd, in which I don't see any cores.

I suspect its ulimit for cores is probably 0.  I'll try to get that changed, going forward.

> the question is did you have 343 such failures in one day? 

I don't think we had 343 crashes; I only found evidence of a couple dozen crashes in those 3 days.  The rest of the "missing from node" messages are I think for some other reason.

> I agree with you that failure in Lustre should not affect slurmd if the spool
> and the binary are not in Lustre. I assume that /d/sw is a local file system.

Unfortunately it looks like I told a fib.  Having re-examined this with fresh eyes, I realise that I was tricked by a confusing hostname.

That close() error *is* on the filesystem where the slurm spool is stored.  I apologise for the confusion.

/d/sw/ is a separate NFS filesystem, on which I can see no sign of trouble.

> One thing that comes to mind is that some file access operation call
> internally
> getcwd() which in turn traverses the full path eventually doing stat() on
> the Lustre file system.

I can't find the inodes referenced in that close() error -- specifically, it's not the inode of any file that remains in /var/spool/slurmd, or in the path leading to it --  and there are lots of different inodes when that message recurs.

So I'm not quite sure what to make of it.  But this close() is definitely happening on the filesystem with /var/spool/slurmd; sorry.

Comment 3 David Bigagli 2014-06-03 09:10:47 MDT

Hi,
   did slurmd crashed again and were you able to get the stack?

David

Comment 4 Phil Schwan 2014-06-04 12:31:19 MDT

I haven't wanted to go messing too much with the slurm environment while Stu was away, but he's back today, so we should be able to get that changed.

Comment 5 Phil Schwan 2014-06-05 11:36:51 MDT

We've got the environment changed now, so cores should begin to accumulate.  In typical fashion, there haven't been any crashes since then.

Comment 6 David Bigagli 2014-06-11 09:47:56 MDT

Close for now as it cannot be reproduced. We can reopen it if needed.


David