Ticket 8013

Summary:	Long time to complete jobs exhibit strange behaviour when viewed using Slurm tools/log output/
Product:	Slurm	Reporter:	Kevin Buckley <kevin.buckley>
Component:	Other	Assignee:	Albert Gil <albert.gil>
Status:	RESOLVED TIMEDOUT	QA Contact:
Severity:	3 - Medium Impact
Priority:	---	CC:	darran.carey
Version:	18.08.6
Hardware:	Cray XC
OS:	Linux
See Also:	https://bugs.schedmd.com/show_bug.cgi?id=6769
Site:	Pawsey	Slinky Site:	---
Alineos Sites:	---	Atos/Eviden Sites:	---
Confidential Site:	---	Coreweave sites:	---
Cray Sites:	---	DS9 clusters:	---
Google sites:	---	HPCnow Sites:	---
HPE Sites:	---	IBM Sites:	---
NOAA SIte:	---	NoveTech Sites:	---
Nvidia HWinf-CS Sites:	---	OCF Sites:	---
Recursion Pharma Sites:	---	SFW Sites:	---
SNIC sites:	---	Tzag Elita Sites:	---
Linux Distro:	Other	Machine Name:	magnus
CLE Version:	CLE6 UP05	Version Fixed:
Target Release:	---	DevPrio:	---
Emory-Cloud Sites:	---
Attachments:	Logs as requested

Description Kevin Buckley 2019-10-29 02:12:21 MDT

We've been looking into idebtifyin, and possibly automatically  handling,
jobs that seem to take a long time to complete, however, in investigating
one of these jobs, I realised was part of an array job, vis:



JobID     JobIDRaw   JobName    Account      Alloc    State      Elapsed
                                         Nodes  CPUs
---------- ------- ---------- ---------- ------ ---- ---------- ---------- 
4172215_24 4172215 obs108013+     mwaeor   1     48  COMPLETED   00:00:54 
4172215_24 4172215      batch     mwaeor   1     24  COMPLETED   00:00:54 
4172215_24 4172215     extern     mwaeor   1     48  COMPLETED   00:00:56 
4172215_1  4172396 obs108013+     mwaeor   1     48  COMPLETED   00:00:52 
4172215_1. 4172396      batch     mwaeor   1     24  COMPLETED   00:00:52 
4172215_1. 4172396     extern     mwaeor   1     48  COMPLETED   00:00:53 
4172215_2  4172397 obs108013+     mwaeor   1     48  COMPLETED   00:00:52 
4172215_2. 4172397      batch     mwaeor   1     24  COMPLETED   00:00:52 
4172215_2. 4172397     extern     mwaeor   1     48  COMPLETED   00:01:31 
4172215_3  4172398 obs108013+     mwaeor   1     48  COMPLETED   00:00:57 
4172215_3. 4172398      batch     mwaeor   1     24  COMPLETED   00:00:57 
4172215_3. 4172398     extern     mwaeor   1     48  CANCELLED   00:04:12 <==
4172215_4  4172399 obs108013+     mwaeor   1     48  COMPLETED   00:00:50 
4172215_4. 4172399      batch     mwaeor   1     24  COMPLETED   00:00:50 
4172215_4. 4172399     extern     mwaeor   1     48  COMPLETED   00:00:51 
4172215_5  4172400 obs108013+     mwaeor   1     48  COMPLETED   00:00:52 
4172215_5. 4172400      batch     mwaeor   1     24  COMPLETED   00:00:52 
4172215_5. 4172400     extern     mwaeor   1     48  COMPLETED   00:00:54 
4172215_6  4172401 obs108013+     mwaeor   1     48  COMPLETED   00:01:07 
4172215_6. 4172401      batch     mwaeor   1     24  COMPLETED   00:01:07 
4172215_6. 4172401     extern     mwaeor   1     48  COMPLETED   00:01:10 
4172215_7  4172402 obs108013+     mwaeor   1     48  COMPLETED   00:00:52 
4172215_7. 4172402      batch     mwaeor   1     24  COMPLETED   00:00:52 
4172215_7. 4172402     extern     mwaeor   1     48  COMPLETED   00:00:55 
4172215_8  4172403 obs108013+     mwaeor   1     48  COMPLETED   00:00:54 
4172215_8. 4172403      batch     mwaeor   1     24  COMPLETED   00:00:54 
4172215_8. 4172403     extern     mwaeor   1     48  COMPLETED   00:00:57 
4172215_9  4172404 obs108013+     mwaeor   1     48  COMPLETED   00:00:53 
4172215_9. 4172404      batch     mwaeor   1     24  COMPLETED   00:00:53 
4172215_9. 4172404     extern     mwaeor   1     48  COMPLETED   00:00:55 
4172215_10 4172405 obs108013+     mwaeor   1     48  COMPLETED   00:00:53 
4172215_10 4172405      batch     mwaeor   1     24  COMPLETED   00:00:53 
4172215_10 4172405     extern     mwaeor   1     48  COMPLETED   00:00:55 
4172215_11 4172406 obs108013+     mwaeor   1     48  COMPLETED   00:00:55 
4172215_11 4172406      batch     mwaeor   1     24  COMPLETED   00:00:55 
4172215_11 4172406     extern     mwaeor   1     48  COMPLETED   00:00:58 
4172215_12 4172407 obs108013+     mwaeor   1     48  COMPLETED   00:00:52 
4172215_12 4172407      batch     mwaeor   1     24  COMPLETED   00:00:52 
4172215_12 4172407     extern     mwaeor   1     48  COMPLETED   00:00:54 
4172215_13 4172408 obs108013+     mwaeor   1     48  COMPLETED   00:00:51 
4172215_13 4172408      batch     mwaeor   1     24  COMPLETED   00:00:51 
4172215_13 4172408     extern     mwaeor   1     48  COMPLETED   00:00:54 
4172215_14 4172409 obs108013+     mwaeor   1     48  COMPLETED   00:00:52 
4172215_14 4172409      batch     mwaeor   1     24  COMPLETED   00:00:52 
4172215_14 4172409     extern     mwaeor   1     48  COMPLETED   00:00:55 
4172215_15 4172410 obs108013+     mwaeor   1     48  COMPLETED   00:00:55 
4172215_15 4172410      batch     mwaeor   1     24  COMPLETED   00:00:55 
4172215_15 4172410     extern     mwaeor   1     48  COMPLETED   00:00:58 
4172215_16 4172411 obs108013+     mwaeor   1     48  COMPLETED   00:00:52 
4172215_16 4172411      batch     mwaeor   1     24  COMPLETED   00:00:52 
4172215_16 4172411     extern     mwaeor   1     48  COMPLETED   00:00:54 
4172215_17 4172812 obs108013+     mwaeor   1     48  COMPLETED   00:00:50 
4172215_17 4172812      batch     mwaeor   1     24  COMPLETED   00:00:50 
4172215_17 4172812     extern     mwaeor   1     48  CANCELLED   00:03:32 <==
4172215_18 4172813 obs108013+     mwaeor   1     48  COMPLETED   00:00:53 
4172215_18 4172813      batch     mwaeor   1     24  COMPLETED   00:00:53 
4172215_18 4172813     extern     mwaeor   1     48  COMPLETED   00:00:54 
4172215_19 4172823 obs108013+     mwaeor   1     48  COMPLETED   00:00:52 
4172215_19 4172823      batch     mwaeor   1     24  COMPLETED   00:00:52 
4172215_19 4172823     extern     mwaeor   1     48  COMPLETED   00:00:54 
4172215_20 4172830 obs108013+     mwaeor   1     48  COMPLETED   00:00:52 
4172215_20 4172830      batch     mwaeor   1     24  COMPLETED   00:00:52 
4172215_20 4172830     extern     mwaeor   1     48  COMPLETED   00:00:54 
4172215_21 4172833 obs108013+     mwaeor   1     48  COMPLETED   00:00:55 
4172215_21 4172833      batch     mwaeor   1     24  COMPLETED   00:00:55 
4172215_21 4172833     extern     mwaeor   1     48  COMPLETED   00:00:57 
4172215_22 4172836 obs108013+     mwaeor   1     48  COMPLETED   00:00:53 
4172215_22 4172836      batch     mwaeor   1     24  COMPLETED   00:00:53 
4172215_22 4172836     extern     mwaeor   1     48  COMPLETED   00:00:55 
4172215_23 4172840 obs108013+     mwaeor   1     48  COMPLETED   00:00:55 
4172215_23 4172840      batch     mwaeor   1     24  COMPLETED   00:00:55 
4172215_23 4172840     extern     mwaeor   1     48  COMPLETED   00:00:58  


where the thing to note is that the two jobs in the array that ran
for greater than 1 minute, are listed as being CANCELLED, and that 
one of thosehas the JobIDraw of the job that took what we recorded 
as around 17 minutes to complete.


The user says that they are not cancelling jobs.


If I look at the slurmd log of the node that 4172215_17 4172812 ran on,
the relevant lines are:

[2019-10-28T19:53:36.540] task_p_slurmd_batch_request: 4172812
[2019-10-28T19:53:36.540] task/affinity: job 4172812 CPU input mask for node: 0xFFFFFFFFFFFF
[2019-10-28T19:53:36.540] task/affinity: job 4172812 CPU final HW mask for node: 0xFFFFFFFFFFFF
[2019-10-28T19:53:36.549] _run_prolog: run job script took usec=11321
[2019-10-28T19:53:36.549] _run_prolog: prolog with lock for job 4172812 ran for 0 seconds
[2019-10-28T19:53:36.562] [4172812.extern] core_spec/cray: init
[2019-10-28T19:53:36.591] [4172812.extern] task/cgroup: /slurm/uid_22186/job_4172812: alloc=60000MB mem.limit=57000MB memsw.limit=57000MB
[2019-10-28T19:53:36.591] [4172812.extern] task/cgroup: /slurm/uid_22186/job_4172812/step_extern: alloc=60000MB mem.limit=57000MB memsw.limit=57000MB
[2019-10-28T19:53:39.161] Launching batch job 4172812 for UID 22186
[2019-10-28T19:53:39.173] [4172812.batch] core_spec/cray: init
[2019-10-28T19:53:39.215] [4172812.batch] task/cgroup: /slurm/uid_22186/job_4172812: alloc=60000MB mem.limit=57000MB memsw.limit=57000MB
[2019-10-28T19:53:39.215] [4172812.batch] task/cgroup: /slurm/uid_22186/job_4172812/step_batch: alloc=60000MB mem.limit=57000MB memsw.limit=57000MB
[2019-10-28T19:53:39.289] [4172812.batch] task_p_pre_launch: Using sched_affinity for tasks
[2019-10-28T19:54:24.471] [4172812.batch] sending REQUEST_COMPLETE_BATCH_SCRIPT, error:0 status 0
[2019-10-28T19:54:24.476] [4172812.batch] done with job
[2019-10-28T20:20:17.998] [4172812.extern] done with job


so no evidence of Slurm cancelling anything ?


In which case, what's likely to be doing the CANCEL-ing here ?

2)
Can anyone at SchedMD suggest why sacct would belive a job 
to be in a CANCELLED state after 00:03:32 elapsed, whilst 
the slurmd didn't consider the extern part of the job "done" 
until some 26 minutes had elapsed ?


Any clues?


I doubt we have changed the slurm.conf from the last one we 
supplied you with.


Kevin M. Buckley
-- 
Supercomputing Systems Administrator
Pawsey Supercomputing Centre

Comment 1 Jason Booth 2019-10-29 12:02:53 MDT

Hi Kevin - would you please attach the full logs from that day for us to review. Both the slurmctld and the slurmd logs would be helpful in understanding what occurred.

 I would be intrigued to know if you also have 

Entries such as the following:

> _slurm_rpc_kill_job: REQUEST_KILL_JOB JobId=54 uid 1000
> job_signal: 9 of running JobId=54 successful 0x8004


On the node at "debug" you should also see entries such as
>_rpc_terminate_job
> Sent signal

Comment 2 Kevin Buckley 2019-10-29 21:39:26 MDT

Created attachment 12153 [details]
Logs as requested

Comment 3 Albert Gil 2019-10-30 12:56:55 MDT

Hi Kevin,

In the controller logs I can see this:

[2019-10-28T19:53:34.546] backfill: Started JobId=4172215_17(4172812) in workq on nid00170
[2019-10-28T19:54:24.472] _job_complete: JobId=4172215_17(4172812) WEXITSTATUS 0
[2019-10-28T19:54:24.474] _job_complete: JobId=4172215_17(4172812) done

But later on a lot of these:

[2019-10-28T19:54:54.557] error: select/cons_res: node nid00170 memory is under-allocated (0-60000) for JobId=4172215_17(4172812)
[2019-10-28T19:54:54.681] error: select/cons_res: node nid00170 memory is under-allocated (0-60000) for JobId=4172215_17(4172812)
[2019-10-28T19:54:54.685] error: select/cons_res: node nid00170 memory is under-allocated (0-60000) for JobId=4172215_17(4172812)

And for every 3 minutes:
[2019-10-28T19:57:06.072] Resending TERMINATE_JOB request JobId=4172215_17(4172812) Nodelist=nid00170
[2019-10-28T20:00:06.111] Resending TERMINATE_JOB request JobId=4172215_17(4172812) Nodelist=nid00170
[2019-10-28T20:03:06.873] Resending TERMINATE_JOB request JobId=4172215_17(4172812) Nodelist=nid00170

(...)

[2019-10-28T20:18:06.546] Resending TERMINATE_JOB request JobId=4172215_17(4172812) Nodelist=nid00170

And finally:
[2019-10-28T20:20:17.994] step_partial_comp: JobId=4172215_17(4172812) StepID=4294967295 invalid
[2019-10-28T20:20:25.189] cleanup_completing: JobId=4172215_17(4172812) completion process took 1561 seconds

So, the problem seems that, once the batch scripts ends, Slurm tries to finish the .extern step and for some reason it cannot. So it finally marked as "Cancelled".

I'm not totally sure, but I think that the problem may be related to those "memory is under-allocated" messages.
That error messages are fixed in bug 6769, in 19.05.3.
I will try to verify if they are really related.

I'll let you know,
Albert

Comment 4 Kevin Buckley 2019-10-31 19:12:35 MDT

On 2019/10/31 02:56, bugs@schedmd.com wrote:

> So, the problem seems that, once the batch scripts ends, Slurm tries to finish
> the .extern step and for some reason it cannot. So it finally marked as
> "Cancelled".
> 
> I'm not totally sure, but I think that the problem may be related to those
> "memory is under-allocated" messages.
> That error messages are fixed in bug 6769, in 19.05.3.
> I will try to verify if they are really related.
> 
> I'll let you know,
> Albert
> 

I checked out the master and slurm-19.05 branches from the GitHub
repo and, from what I can see, the three commits mentioned in Bug
6769 (one of the three is just comments) would apply cleanly to
the 18.08.6-2 codebase we are commited to using on our production
until the next upgrade at the start of Jan 2010, just ahead of the
start of our next project accounting period.

I suppose the issue for us would be if the file that gets patched

  src/plugins/select/cons_res/job_test.c

is part of a library that might affect many Slurm binaries/operations
then we'd need to ensure nothing else breaks.

Of course, our TDS is currently running 19.05.03-2 as we look to test
that release out, ahead of the Jan upgrade.

Comment 5 Kevin Buckley 2019-10-31 22:20:24 MDT

Maybe answering my own question here, but a closer
inspection of what gets deployed suggests that any
commit patch(es) to

  src/plugins/select/cons_res/job_test.c

would only end up in one file

  lib64/slurm/select_cons_res.so

and so a rebuild of 18.08.06-2 and swapping into place of
the one "patched" shared-object file might be all that's
needed ?

Comment 6 Albert Gil 2019-11-04 07:46:03 MST

Hi Kevin,

> Maybe answering my own question here, but a closer
> inspection of what gets deployed suggests that any
> commit patch(es) to
> 
>   src/plugins/select/cons_res/job_test.c
> 
> would only end up in one file
> 
>   lib64/slurm/select_cons_res.so
> 
> and so a rebuild of 18.08.06-2 and swapping into place of
> the one "patched" shared-object file might be all that's
> needed ?

Good research, you are basically right.
But you will also need to restart slurmctld.

Although what you say it's true, and although we don't recommend this kind of cherry-picking, it is also true that doing what you suggest could help us a lot to see if your actual problem (long killings and cancelled .extern steps), is because of the "memory is under-allocated" error. I'm not certain of it, but it might be.

If you try that cherry picking, rebuild Slurm and restart the controller, and then the "memory is under-allocated" disappears (that's good by it self), but the other error don't disappear, then we know that they are unrelated.

If you are willing to try it out, it sound good for me.

Let me know,
Albert

Comment 7 Kevin Buckley 2019-11-04 19:06:04 MST

> I'm not certain of it, but it might be.

Sadly, in terms of making progress on this, that's unlikely to be enough justification for us to sanction a change to our production systems.

If SchedMD's testing had indicated that the "memory is under-allocated"
messages WERE RELATED, ie were a symptom of the jobs hanging around in 
the CG state, then we'd probably try swapping the "select_cons_res.so" 
file, but I can't see it flying here, if SchedMD can't say for certain 
that it's the fix.

In terms of impact on throughput, simply monitoring and putting "long
term CG state" nodes into a "+DRAIN" state, so as to get them offside
from the scheduler is less of a change to the production systems.

Comment 8 Albert Gil 2019-11-05 01:38:34 MST

Hi Kevin,

> If SchedMD's testing had indicated that the "memory is under-allocated"
> messages WERE RELATED, ie were a symptom of the jobs hanging around in 
> the CG state, then we'd probably try swapping the "select_cons_res.so" 
> file, but I can't see it flying here, if SchedMD can't say for certain 
> that it's the fix.

I totally agree.
I didn't mean that you should do it. Yet.
Let me verify if they are related or not.

> In terms of impact on throughput, simply monitoring and putting "long
> term CG state" nodes into a "+DRAIN" state, so as to get them offside
> from the scheduler is less of a change to the production systems.

If I understand correctly it seems that you are noticing that nodes that fail having jobs into this "long term CG state", they have not only one job, but they tend to have more jobs in such "state"?
While the rest of the nodes, don't?
Is that correct?

If it is, then we are probably in the right direction, because it seems that it's not a job-wise or job-random problem, but a node-wise problem, like the "memory is under-allocated" is.

I'll keep investigating and I'll let you know,
Albert

Comment 9 Albert Gil 2020-03-16 05:37:25 MDT

Hi Kevin,

Sorry that I didn't follow up on this ticket earlier.
I'm closing it as timed out, but please don't hesitate to reopen it if you need further support.

Regards,
Albert