| Summary: | during slurm reconfig, job files can get purged before an epilog gets a chance to run | ||
|---|---|---|---|
| Product: | Slurm | Reporter: | Phil Schwan <phils> |
| Component: | slurmctld | Assignee: | David Bigagli <david> |
| Status: | RESOLVED FIXED | QA Contact: | |
| Severity: | 3 - Medium Impact | ||
| Priority: | --- | CC: | da, stuartm |
| Version: | 14.03.2 | ||
| Hardware: | Linux | ||
| OS: | Linux | ||
| Site: | DownUnder GeoSolutions | Slinky Site: | --- |
| Alineos Sites: | --- | Atos/Eviden Sites: | --- |
| Confidential Site: | --- | Coreweave sites: | --- |
| Cray Sites: | --- | DS9 clusters: | --- |
| Google sites: | --- | HPCnow Sites: | --- |
| HPE Sites: | --- | IBM Sites: | --- |
| NOAA SIte: | --- | NoveTech Sites: | --- |
| Nvidia HWinf-CS Sites: | --- | OCF Sites: | --- |
| Recursion Pharma Sites: | --- | SFW Sites: | --- |
| SNIC sites: | --- | Tzag Elita Sites: | --- |
| Linux Distro: | --- | Machine Name: | |
| CLE Version: | Version Fixed: | 14.03.4 | |
| Target Release: | --- | DevPrio: | --- |
| Emory-Cloud Sites: | --- | ||
| Attachments: | dug mods against slurm-14.03 | ||
|
Description
Phil Schwan
2014-05-12 23:58:15 MDT
Also, there's this:
> [2014-05-13T03:16:08.819] Job 1624824 in completing state
> [2014-05-13T03:16:09.317] _slurm_rpc_requeue: 1624824: usec=8624
> [2014-05-13T03:16:09.991] completing job 1624824 status 15
> [2014-05-13T03:16:11.181] Job 1624824 completion process took 11124 seconds
At first I thought maybe this value was just mislabelled milliseconds. But looking at the other times I see this "completion process took..." message, it looks like it really does think it should be seconds.
Maybe this is related to why it purged the job script? It thought it was a job that had been in CG for 3+ hours?
The completion time is the number of seconds between the current time and the job's "end_time" field. It really looks like something is corrupting memory. Do you have local mods to the Slurm code? We do... how do you want them? A git diff? (In reply to Stuart Midgley from comment #3) > We do... how do you want them? A git diff? That would be good. Created attachment 847 [details]
dug mods against slurm-14.03
(In reply to Stuart Midgley from comment #5) > Created attachment 847 [details] > dug mods against slurm-14.03 I was concerned about the possibility of memory corruption related to this patch, but I don't see any signs of anything that might corrupt memory here. Do make sure that if you "git pull" from our github repository that you do so from the "slurm-14.03" branch. We don't want you working from the development branch, "master". Yeh, I have a script that git fetch and then git merge origin/slurm-14.03 into my dug_mods branch. I still have no idea how this job got an invalid node name, but this patch will run the EpilogSlurmctld if a job is killed on slurmctld reconfiguration and there are no up nodes in its allocation (e.g. the one node in the job allocation is DOWN): https://github.com/SchedMD/slurm/commit/87128cf0cc2e9affe2efa9d3352be24ec0c7399c |