Ticket 6630 - array job split into semi-array jobs
Summary: array job split into semi-array jobs
Status: RESOLVED INFOGIVEN
Alias: None
Product: Slurm
Classification: Unclassified
Component: slurmctld (show other tickets)
Version: 18.08.5
Hardware: Linux Linux
: 4 - Minor Issue
Assignee: Director of Support
QA Contact:
URL:
Depends on:
Blocks:
 
Reported: 2019-03-04 11:40 MST by John Hanks
Modified: 2019-03-06 17:50 MST (History)
0 users

See Also:
Site: Stanford
Slinky Site: ---
Alineos Sites: ---
Atos/Eviden Sites: ---
Confidential Site: ---
Coreweave sites: ---
Cray Sites: ---
DS9 clusters: ---
Google sites: ---
HPCnow Sites: ---
HPE Sites: ---
IBM Sites: ---
NOAA SIte: ---
NoveTech Sites: ---
Nvidia HWinf-CS Sites: ---
OCF Sites: ---
Recursion Pharma Sites: ---
SFW Sites: ---
SNIC sites: ---
Tzag Elita Sites: ---
Linux Distro: ---
Machine Name:
CLE Version:
Version Fixed:
Target Release: ---
DevPrio: ---
Emory-Cloud Sites: ---


Attachments

Note You need to log in before you can comment on or make changes to this ticket.
Description John Hanks 2019-03-04 11:40:14 MST
A week or two (maybe three) ago we had an array job with 18000 tasks that was submitted with an incorrect cpus-per-task request and a task throttle of %20, rather than restart it we used `scontrol update` to change the resource requests, like so (I have this from bash history, but it was trial and error and I'm not sure which commands actually worked so the following is recontructed from memory):

I think this didn't work:
scontrol update jobid=7188203 CPUsPerTask=2 NumTasks=1 NumNodes=1-1 NumCPUS=2

and at some point I got an error message saying some aspect of this couldn't be changed for an array. So eventually tried this:

squeue -u $USER --array -t pd -h --format=%i | xargs -n 1 -i scontrol update jobid={} CPUsPerTask=2 NumTasks=1 NumNodes=1-1 NumCPUS=2

At some point in this process the jobs became individual jobs loosely linked to the original array job. FOr instance, I doul update teh task throttle

scontrol update jobid=7188203 ArrayTaskThrottle=1000

but the original limit of 20 was still enforced. The user had no partiuclar rush to have these complete, so I ignored it at that point. 

Today I needed to restart slurmctld and upon restarting it spends a long time doing this:

slurmctld: debug:  first reg: starting JobId=7188203_6546(7194975) in accounting
slurmctld: debug:  first reg: starting JobId=7188203_6547(7194976) in accounting
slurmctld: debug:  first reg: starting JobId=7188203_6548(7194977) in accounting
slurmctld: debug:  first reg: starting JobId=7188203_6549(7194978) in accounting
slurmctld: debug:  first reg: starting JobId=7188203_6550(7194979) in accounting
slurmctld: debug:  first reg: starting JobId=7188203_6551(7194980) in accounting
slurmctld: debug:  first reg: starting JobId=7188203_6552(7194981) in accounting
slurmctld: debug:  first reg: starting JobId=7188203_6553(7194982) in accounting
slurmctld: debug:  first reg: starting JobId=7188203_6554(7194983) in accounting
slurmctld: debug:  first reg: starting JobId=7188203_6555(7194984) in accounting
slurmctld: debug:  first reg: starting JobId=7188203_6556(7194985) in accounting
slurmctld: debug:  first reg: starting JobId=7188203_6557(7194986) in accounting
slurmctld: debug:  first reg: starting JobId=7188203_6558(7194987) in accounting
slurmctld: debug:  first reg: starting JobId=7188203_6559(7194988) in accounting
slurmctld: debug:  first reg: starting JobId=7188203_6560(7194989) in accounting
slurmctld: debug:  first reg: starting JobId=7188203_6561(7194990) in accounting
slurmctld: debug:  first reg: starting JobId=7188203_6562(7194991) in accounting
slurmctld: debug:  first reg: starting JobId=7188203_6563(7194992) in accounting
slurmctld: debug:  first reg: starting JobId=7188203_6564(7194993) in accounting
slurmctld: debug:  first reg: starting JobId=7188203_6565(7194994) in accounting
slurmctld: debug:  first reg: starting JobId=7188203_6566(7194995) in accounting
slurmctld: debug:  first reg: starting JobId=7188203_6567(7194996) in accounting
slurmctld: debug:  first reg: starting JobId=7188203_6568(7194997) in accounting

This proceeds slowly, 5 or 6 jobs/sec after which it does this same task really quickly for the remaining jobs.

I'm reporting this to find out if this is a bug and if not, is there something I can do in the future when updating array jobs to avoid having them permanently split like this.
Comment 2 Michael Hinton 2019-03-04 18:22:58 MST
(In reply to John Hanks from comment #0)
> A week or two (maybe three) ago we had an array job with 18000 tasks that
> was submitted with an incorrect cpus-per-task request and a task throttle of
> %20, rather than restart it we used `scontrol update` to change the resource
> requests, like so (I have this from bash history, but it was trial and error
> and I'm not sure which commands actually worked so the following is
> recontructed from memory):
> 
> I think this didn't work:
> scontrol update jobid=7188203 CPUsPerTask=2 NumTasks=1 NumNodes=1-1 NumCPUS=2
I get "scontrol: error: Job resizing not supported for job arrays"

> and at some point I got an error message saying some aspect of this couldn't
> be changed for an array. So eventually tried this:
> 
> squeue -u $USER --array -t pd -h --format=%i | xargs -n 1 -i scontrol update
> jobid={} CPUsPerTask=2 NumTasks=1 NumNodes=1-1 NumCPUS=2
In my testing, this is where the array tasks "break out" into their own non-array jobs. Since job resizing isn't supported for job arrays, this makes perfect sense.

> At some point in this process the jobs became individual jobs loosely linked
> to the original array job. FOr instance, I doul update teh task throttle
> 
> scontrol update jobid=7188203 ArrayTaskThrottle=1000
> 
> but the original limit of 20 was still enforced. The user had no partiuclar
> rush to have these complete, so I ignored it at that point. 
I have not yet been able to reproduce this behavior, though. Even after the scontrol update resize and restart of slurmctld, the jobs all responded to a change in ArrayTaskThrottle as if they were still an array job (evidence of some loosely-linked behavior, as you mentioned).

Is it possible that only 20 jobs could be running at a time due to some other limit? Likely not, but just wanted to check.

> Today I needed to restart slurmctld and upon restarting it spends a long
> time doing this:
> 
> slurmctld: debug:  first reg: starting JobId=7188203_6546(7194975) in
> accounting
> slurmctld: debug:  first reg: starting JobId=7188203_6547(7194976) in
> accounting
> slurmctld: debug:  first reg: starting JobId=7188203_6548(7194977) in
> accounting
> slurmctld: debug:  first reg: starting JobId=7188203_6549(7194978) in
> accounting
> slurmctld: debug:  first reg: starting JobId=7188203_6550(7194979) in
> accounting
> slurmctld: debug:  first reg: starting JobId=7188203_6551(7194980) in
> accounting
> slurmctld: debug:  first reg: starting JobId=7188203_6552(7194981) in
> accounting
> slurmctld: debug:  first reg: starting JobId=7188203_6553(7194982) in
> accounting
> slurmctld: debug:  first reg: starting JobId=7188203_6554(7194983) in
> accounting
> slurmctld: debug:  first reg: starting JobId=7188203_6555(7194984) in
> accounting
> slurmctld: debug:  first reg: starting JobId=7188203_6556(7194985) in
> accounting
> slurmctld: debug:  first reg: starting JobId=7188203_6557(7194986) in
> accounting
> slurmctld: debug:  first reg: starting JobId=7188203_6558(7194987) in
> accounting
> slurmctld: debug:  first reg: starting JobId=7188203_6559(7194988) in
> accounting
> slurmctld: debug:  first reg: starting JobId=7188203_6560(7194989) in
> accounting
> slurmctld: debug:  first reg: starting JobId=7188203_6561(7194990) in
> accounting
> slurmctld: debug:  first reg: starting JobId=7188203_6562(7194991) in
> accounting
> slurmctld: debug:  first reg: starting JobId=7188203_6563(7194992) in
> accounting
> slurmctld: debug:  first reg: starting JobId=7188203_6564(7194993) in
> accounting
> slurmctld: debug:  first reg: starting JobId=7188203_6565(7194994) in
> accounting
> slurmctld: debug:  first reg: starting JobId=7188203_6566(7194995) in
> accounting
> slurmctld: debug:  first reg: starting JobId=7188203_6567(7194996) in
> accounting
> slurmctld: debug:  first reg: starting JobId=7188203_6568(7194997) in
> accounting
> 
> This proceeds slowly, 5 or 6 jobs/sec after which it does this same task
> really quickly for the remaining jobs.
I haven’t seen this behavior yet either. I'll look into it further.
 
> I'm reporting this to find out if this is a bug
I don’t think so. This splitting off into a stand-alone job under the hood seems to be how it was implemented.

> and if not, is there
> something I can do in the future when updating array jobs to avoid having
> them permanently split like this.
There may be a better method of tackling this kind of problem without resorting to job resizing. It would probably involve requeuing or resubmitting the job. Let me do some more research, and I’ll get back to you.

Thanks,
Michael
Comment 3 John Hanks 2019-03-05 13:54:47 MST
Hi Michael,

The 20 job limit was enforced even when we were only at about 60% utilization and several hundred of these could have been running. Looking back at my bash history I can see I made many attempts to get more of these to run but nothing worked. We eventually cancelled them and resubmitted.

Now that I know the job splitting is by design, I'll just avoid doing that at least for large arrays and if I do need to split them I'll make sure the remove the throttle first.

Unless you want to follow up on the 'debug:  first reg: starting ' stuff, this can be closed INFOGIVEN. I don't have an easy way to reproduce that now so I won't be much help.

Thanks,

jbh
Comment 4 Michael Hinton 2019-03-06 17:50:23 MST
(In reply to John Hanks from comment #3)
> Hi Michael,
> 
> The 20 job limit was enforced even when we were only at about 60%
> utilization and several hundred of these could have been running. Looking
> back at my bash history I can see I made many attempts to get more of these
> to run but nothing worked. We eventually cancelled them and resubmitted.
Interesting. I wonder why your throttle update attempts were rejected and not mine.

> Now that I know the job splitting is by design, I'll just avoid doing that
> at least for large arrays and if I do need to split them I'll make sure the
> remove the throttle first.
That's a great idea - making sure to apply array-wide changes before the split to make sure they get applied properly.
 
> Unless you want to follow up on the 'debug:  first reg: starting ' stuff,
> this can be closed INFOGIVEN. I don't have an easy way to reproduce that now
> so I won't be much help.
Will do. Feel free to reopen this if it turns out to be a problem for you again or if you know how to reproduce it.

Thanks!
Michael