Ticket 806

Summary: during slurm reconfig, job files can get purged before an epilog gets a chance to run
Product: Slurm Reporter: Phil Schwan <phils>
Component: slurmctldAssignee: David Bigagli <david>
Status: RESOLVED FIXED QA Contact:
Severity: 3 - Medium Impact    
Priority: --- CC: da, stuartm
Version: 14.03.2   
Hardware: Linux   
OS: Linux   
Site: DownUnder GeoSolutions Slinky Site: ---
Alineos Sites: --- Atos/Eviden Sites: ---
Confidential Site: --- Coreweave sites: ---
Cray Sites: --- DS9 clusters: ---
Google sites: --- HPCnow Sites: ---
HPE Sites: --- IBM Sites: ---
NOAA SIte: --- NoveTech Sites: ---
Nvidia HWinf-CS Sites: --- OCF Sites: ---
Recursion Pharma Sites: --- SFW Sites: ---
SNIC sites: --- Tzag Elita Sites: ---
Linux Distro: --- Machine Name:
CLE Version: Version Fixed: 14.03.4
Target Release: --- DevPrio: ---
Emory-Cloud Sites: ---
Attachments: dug mods against slurm-14.03

Description Phil Schwan 2014-05-12 23:58:15 MDT
Following on from bug 805, the log continues:

> [2014-05-13T03:16:01.690] error: read_slurm_conf: default partition not set.
> [2014-05-13T03:16:01.704] restoring original state of nodes
> [2014-05-13T03:16:01.704] restoring original partition state
> [2014-05-13T03:16:01.747] cons_res: select_p_node_init
> [2014-05-13T03:16:01.747] cons_res: preparing for 75 partitions
> [2014-05-13T03:16:01.832] Killing job 1624824 on DOWN node clus658
> [2014-05-13T03:16:01.846] _sync_nodes_to_jobs updated state of 1 nodes
> [2014-05-13T03:16:08.700] Purging files for defunct batch job 1624824

1624824 is the one that it _just_ killed -- the epilog hasn't even had a chance to run yet.

The epilog does run a few seconds later, which of course dutifully puts it into SE state -- but now its job data is gone, so it can never run.

(It also purged a dozen other defunct jobs, but I don't know what to make of those)
Comment 1 Phil Schwan 2014-05-13 00:03:24 MDT
Also, there's this:

> [2014-05-13T03:16:08.819] Job 1624824 in completing state
> [2014-05-13T03:16:09.317] _slurm_rpc_requeue: 1624824: usec=8624
> [2014-05-13T03:16:09.991] completing job 1624824 status 15
> [2014-05-13T03:16:11.181] Job 1624824 completion process took 11124 seconds

At first I thought maybe this value was just mislabelled milliseconds.  But looking at the other times I see this "completion process took..." message, it looks like it really does think it should be seconds.

Maybe this is related to why it purged the job script?  It thought it was a job that had been in CG for 3+ hours?
Comment 2 Moe Jette 2014-05-13 09:20:03 MDT
The completion time is the number of seconds between the current time and the job's "end_time" field. It really looks like something is corrupting memory. Do you have local mods to the Slurm code?
Comment 3 Stuart Midgley 2014-05-13 09:37:00 MDT
We do... how do you want them?  A git diff?
Comment 4 Moe Jette 2014-05-13 09:42:35 MDT
(In reply to Stuart Midgley from comment #3)
> We do... how do you want them?  A git diff?

That would be good.
Comment 5 Stuart Midgley 2014-05-13 10:00:35 MDT
Created attachment 847 [details]
dug mods against slurm-14.03
Comment 6 Moe Jette 2014-05-14 03:59:22 MDT
(In reply to Stuart Midgley from comment #5)
> Created attachment 847 [details]
> dug mods against slurm-14.03

I was concerned about the possibility of memory corruption related to this patch, but I don't see any signs of anything that might corrupt memory here.

Do make sure that if you "git pull" from our github repository that you do so from the "slurm-14.03" branch. We don't want you working from the development branch, "master".
Comment 7 Stuart Midgley 2014-05-14 04:04:22 MDT
Yeh, I have a script that git fetch and then git merge origin/slurm-14.03 into my dug_mods branch.
Comment 8 Moe Jette 2014-05-14 05:42:38 MDT
I still have no idea how this job got an invalid node name, but this patch will run the EpilogSlurmctld if a job is killed on slurmctld reconfiguration and there are no up nodes in its allocation (e.g. the one node in the job allocation is DOWN):

https://github.com/SchedMD/slurm/commit/87128cf0cc2e9affe2efa9d3352be24ec0c7399c