Summary: | Long time to complete jobs exhibit strange behaviour when viewed using Slurm tools/log output/ | ||
---|---|---|---|
Product: | Slurm | Reporter: | Kevin Buckley <kevin.buckley> |
Component: | Other | Assignee: | Albert Gil <albert.gil> |
Status: | RESOLVED TIMEDOUT | QA Contact: | |
Severity: | 3 - Medium Impact | ||
Priority: | --- | CC: | darran.carey |
Version: | 18.08.6 | ||
Hardware: | Cray XC | ||
OS: | Linux | ||
See Also: | https://bugs.schedmd.com/show_bug.cgi?id=6769 | ||
Site: | Pawsey | Alineos Sites: | --- |
Atos/Eviden Sites: | --- | Confidential Site: | --- |
Coreweave sites: | --- | Cray Sites: | --- |
DS9 clusters: | --- | HPCnow Sites: | --- |
HPE Sites: | --- | IBM Sites: | --- |
NOAA SIte: | --- | NoveTech Sites: | --- |
Nvidia HWinf-CS Sites: | --- | OCF Sites: | --- |
Recursion Pharma Sites: | --- | SFW Sites: | --- |
SNIC sites: | --- | Tzag Elita Sites: | --- |
Linux Distro: | Other | Machine Name: | magnus |
CLE Version: | CLE6 UP05 | Version Fixed: | |
Target Release: | --- | DevPrio: | --- |
Emory-Cloud Sites: | --- | ||
Attachments: | Logs as requested |
Description
Kevin Buckley
2019-10-29 02:12:21 MDT
Hi Kevin - would you please attach the full logs from that day for us to review. Both the slurmctld and the slurmd logs would be helpful in understanding what occurred. I would be intrigued to know if you also have Entries such as the following: > _slurm_rpc_kill_job: REQUEST_KILL_JOB JobId=54 uid 1000 > job_signal: 9 of running JobId=54 successful 0x8004 On the node at "debug" you should also see entries such as >_rpc_terminate_job > Sent signal Created attachment 12153 [details]
Logs as requested
Hi Kevin, In the controller logs I can see this: [2019-10-28T19:53:34.546] backfill: Started JobId=4172215_17(4172812) in workq on nid00170 [2019-10-28T19:54:24.472] _job_complete: JobId=4172215_17(4172812) WEXITSTATUS 0 [2019-10-28T19:54:24.474] _job_complete: JobId=4172215_17(4172812) done But later on a lot of these: [2019-10-28T19:54:54.557] error: select/cons_res: node nid00170 memory is under-allocated (0-60000) for JobId=4172215_17(4172812) [2019-10-28T19:54:54.681] error: select/cons_res: node nid00170 memory is under-allocated (0-60000) for JobId=4172215_17(4172812) [2019-10-28T19:54:54.685] error: select/cons_res: node nid00170 memory is under-allocated (0-60000) for JobId=4172215_17(4172812) And for every 3 minutes: [2019-10-28T19:57:06.072] Resending TERMINATE_JOB request JobId=4172215_17(4172812) Nodelist=nid00170 [2019-10-28T20:00:06.111] Resending TERMINATE_JOB request JobId=4172215_17(4172812) Nodelist=nid00170 [2019-10-28T20:03:06.873] Resending TERMINATE_JOB request JobId=4172215_17(4172812) Nodelist=nid00170 (...) [2019-10-28T20:18:06.546] Resending TERMINATE_JOB request JobId=4172215_17(4172812) Nodelist=nid00170 And finally: [2019-10-28T20:20:17.994] step_partial_comp: JobId=4172215_17(4172812) StepID=4294967295 invalid [2019-10-28T20:20:25.189] cleanup_completing: JobId=4172215_17(4172812) completion process took 1561 seconds So, the problem seems that, once the batch scripts ends, Slurm tries to finish the .extern step and for some reason it cannot. So it finally marked as "Cancelled". I'm not totally sure, but I think that the problem may be related to those "memory is under-allocated" messages. That error messages are fixed in bug 6769, in 19.05.3. I will try to verify if they are really related. I'll let you know, Albert On 2019/10/31 02:56, bugs@schedmd.com wrote: > So, the problem seems that, once the batch scripts ends, Slurm tries to finish > the .extern step and for some reason it cannot. So it finally marked as > "Cancelled". > > I'm not totally sure, but I think that the problem may be related to those > "memory is under-allocated" messages. > That error messages are fixed in bug 6769, in 19.05.3. > I will try to verify if they are really related. > > I'll let you know, > Albert > I checked out the master and slurm-19.05 branches from the GitHub repo and, from what I can see, the three commits mentioned in Bug 6769 (one of the three is just comments) would apply cleanly to the 18.08.6-2 codebase we are commited to using on our production until the next upgrade at the start of Jan 2010, just ahead of the start of our next project accounting period. I suppose the issue for us would be if the file that gets patched src/plugins/select/cons_res/job_test.c is part of a library that might affect many Slurm binaries/operations then we'd need to ensure nothing else breaks. Of course, our TDS is currently running 19.05.03-2 as we look to test that release out, ahead of the Jan upgrade. Maybe answering my own question here, but a closer inspection of what gets deployed suggests that any commit patch(es) to src/plugins/select/cons_res/job_test.c would only end up in one file lib64/slurm/select_cons_res.so and so a rebuild of 18.08.06-2 and swapping into place of the one "patched" shared-object file might be all that's needed ? Hi Kevin,
> Maybe answering my own question here, but a closer
> inspection of what gets deployed suggests that any
> commit patch(es) to
>
> src/plugins/select/cons_res/job_test.c
>
> would only end up in one file
>
> lib64/slurm/select_cons_res.so
>
> and so a rebuild of 18.08.06-2 and swapping into place of
> the one "patched" shared-object file might be all that's
> needed ?
Good research, you are basically right.
But you will also need to restart slurmctld.
Although what you say it's true, and although we don't recommend this kind of cherry-picking, it is also true that doing what you suggest could help us a lot to see if your actual problem (long killings and cancelled .extern steps), is because of the "memory is under-allocated" error. I'm not certain of it, but it might be.
If you try that cherry picking, rebuild Slurm and restart the controller, and then the "memory is under-allocated" disappears (that's good by it self), but the other error don't disappear, then we know that they are unrelated.
If you are willing to try it out, it sound good for me.
Let me know,
Albert
> I'm not certain of it, but it might be.
Sadly, in terms of making progress on this, that's unlikely to be enough justification for us to sanction a change to our production systems.
If SchedMD's testing had indicated that the "memory is under-allocated"
messages WERE RELATED, ie were a symptom of the jobs hanging around in
the CG state, then we'd probably try swapping the "select_cons_res.so"
file, but I can't see it flying here, if SchedMD can't say for certain
that it's the fix.
In terms of impact on throughput, simply monitoring and putting "long
term CG state" nodes into a "+DRAIN" state, so as to get them offside
from the scheduler is less of a change to the production systems.
Hi Kevin, > If SchedMD's testing had indicated that the "memory is under-allocated" > messages WERE RELATED, ie were a symptom of the jobs hanging around in > the CG state, then we'd probably try swapping the "select_cons_res.so" > file, but I can't see it flying here, if SchedMD can't say for certain > that it's the fix. I totally agree. I didn't mean that you should do it. Yet. Let me verify if they are related or not. > In terms of impact on throughput, simply monitoring and putting "long > term CG state" nodes into a "+DRAIN" state, so as to get them offside > from the scheduler is less of a change to the production systems. If I understand correctly it seems that you are noticing that nodes that fail having jobs into this "long term CG state", they have not only one job, but they tend to have more jobs in such "state"? While the rest of the nodes, don't? Is that correct? If it is, then we are probably in the right direction, because it seems that it's not a job-wise or job-random problem, but a node-wise problem, like the "memory is under-allocated" is. I'll keep investigating and I'll let you know, Albert Hi Kevin, Sorry that I didn't follow up on this ticket earlier. I'm closing it as timed out, but please don't hesitate to reopen it if you need further support. Regards, Albert |