Following on from bug 805, the log continues: > [2014-05-13T03:16:01.690] error: read_slurm_conf: default partition not set. > [2014-05-13T03:16:01.704] restoring original state of nodes > [2014-05-13T03:16:01.704] restoring original partition state > [2014-05-13T03:16:01.747] cons_res: select_p_node_init > [2014-05-13T03:16:01.747] cons_res: preparing for 75 partitions > [2014-05-13T03:16:01.832] Killing job 1624824 on DOWN node clus658 > [2014-05-13T03:16:01.846] _sync_nodes_to_jobs updated state of 1 nodes > [2014-05-13T03:16:08.700] Purging files for defunct batch job 1624824 1624824 is the one that it _just_ killed -- the epilog hasn't even had a chance to run yet. The epilog does run a few seconds later, which of course dutifully puts it into SE state -- but now its job data is gone, so it can never run. (It also purged a dozen other defunct jobs, but I don't know what to make of those)
Also, there's this: > [2014-05-13T03:16:08.819] Job 1624824 in completing state > [2014-05-13T03:16:09.317] _slurm_rpc_requeue: 1624824: usec=8624 > [2014-05-13T03:16:09.991] completing job 1624824 status 15 > [2014-05-13T03:16:11.181] Job 1624824 completion process took 11124 seconds At first I thought maybe this value was just mislabelled milliseconds. But looking at the other times I see this "completion process took..." message, it looks like it really does think it should be seconds. Maybe this is related to why it purged the job script? It thought it was a job that had been in CG for 3+ hours?
The completion time is the number of seconds between the current time and the job's "end_time" field. It really looks like something is corrupting memory. Do you have local mods to the Slurm code?
We do... how do you want them? A git diff?
(In reply to Stuart Midgley from comment #3) > We do... how do you want them? A git diff? That would be good.
Created attachment 847 [details] dug mods against slurm-14.03
(In reply to Stuart Midgley from comment #5) > Created attachment 847 [details] > dug mods against slurm-14.03 I was concerned about the possibility of memory corruption related to this patch, but I don't see any signs of anything that might corrupt memory here. Do make sure that if you "git pull" from our github repository that you do so from the "slurm-14.03" branch. We don't want you working from the development branch, "master".
Yeh, I have a script that git fetch and then git merge origin/slurm-14.03 into my dug_mods branch.
I still have no idea how this job got an invalid node name, but this patch will run the EpilogSlurmctld if a job is killed on slurmctld reconfiguration and there are no up nodes in its allocation (e.g. the one node in the job allocation is DOWN): https://github.com/SchedMD/slurm/commit/87128cf0cc2e9affe2efa9d3352be24ec0c7399c