Ticket 5957 - HetJob components each count as a separate job for limits, instead of as a single job
Summary: HetJob components each count as a separate job for limits, instead of as a si...
Status: OPEN
Alias: None
Product: Slurm
Classification: Unclassified
Component: Limits (show other tickets)
Version: 18.08.1
Hardware: Linux Linux
: 5 - Enhancement
Assignee: Tim Wickberg
QA Contact:
URL:
Depends on:
Blocks:
 
Reported: 2018-10-31 10:20 MDT by Anthony DelSorbo
Modified: 2019-07-15 19:23 MDT (History)
2 users (show)

See Also:
Site: NOAA
Alineos Sites: ---
Atos/Eviden Sites: ---
Confidential Site: ---
Coreweave sites: ---
Cray Sites: ---
DS9 clusters: ---
HPCnow Sites: ---
HPE Sites: ---
IBM Sites: ---
NOAA SIte: NESCC
NoveTech Sites: ---
Nvidia HWinf-CS Sites: ---
OCF Sites: ---
Recursion Pharma Sites: ---
SFW Sites: ---
SNIC sites: ---
Tzag Elita Sites: ---
Linux Distro: ---
Machine Name: Selene
CLE Version:
Version Fixed:
Target Release: ---
DevPrio: ---
Emory-Cloud Sites: ---


Attachments

Note You need to log in before you can comment on or make changes to this ticket.
Description Anthony DelSorbo 2018-10-31 10:20:02 MDT
We have run into a situation where no jobs are running.  Below are the outputs from the squeue, sinfo, and sacctmgr show qos commands.  There are 5 jobs pending and stuck for the reason of MaxJobsPerAccount and, that is true for the chosen QOS.  It's not clear why the jobs aren't running.  However, there was a change in configuration today to address bug 5946.  I implemented the change in real time and restarted the slurmd and slurmctld daemons.  The change was to the nodename CPU count where it was changed from: CPUs=48 to CPUs=24. It's wholly possible that this may have affected these jobs.  

But before I ask the users to cancel their jobs, it's not clear why the "MaxJobsPerAccount" jobs are stuck (the "show job" output is shown below).  There should at least be one of these jobs in the run state.

squeue -l
Wed Oct 31 15:16:41 2018
             JOBID PARTITION     NAME     USER    STATE       TIME TIME_LIMI  NODES NODELIST(REASON)
           26586+1    selene    test1 Christop  PENDING       0:00      2:00      2 (None)
           26588+1    selene    test1 Christop  PENDING       0:00      2:00      2 (None)
           26590+1    selene    test1 Christop  PENDING       0:00      2:00      2 (None)
           26592+1    selene    test1 Christop  PENDING       0:00      2:00      2 (None)
           26594+1    selene    test1 Christop  PENDING       0:00      2:00      2 (None)
           26586+0    selene    test1 Christop  PENDING       0:00      2:00      1 (MaxJobsPerAccount)
           26588+0    selene    test1 Christop  PENDING       0:00      2:00      1 (MaxJobsPerAccount)
           26590+0    selene    test1 Christop  PENDING       0:00      2:00      1 (MaxJobsPerAccount)
           26592+0    selene    test1 Christop  PENDING       0:00      2:00      1 (MaxJobsPerAccount)
           26594+0    selene    test1 Christop  PENDING       0:00      2:00      1 (MaxJobsPerAccount)
             27635    selene cpnmmb_2 Jili.Don  PENDING       0:00     29:00     20 (Priority)


sinfo -l
Wed Oct 31 15:19:30 2018
PARTITION AVAIL  TIMELIMIT   JOB_SIZE ROOT OVERSUBS     GROUPS  NODES       STATE NODELIST
selene*      up    8:00:00 1-infinite   no EXCLUSIV        all     21        idle s[0001-0021]
batch        up    8:00:00 1-infinite   no EXCLUSIV        all     21        idle s[0001-0021]
shared       up    8:00:00          1   no       NO        all     21        idle s[0001-0021]
service      up 1-00:00:00          1   no  FORCE:4        all      1        idle sfe01
bigmem       up    8:00:00 1-infinite   no EXCLUSIV        all      1        idle sb01
fge          up    8:00:00 1-infinite   no EXCLUSIV        all      2        idle sg[001-002]
admin        up 1-00:00:00 1-infinite   no EXCLUSIV        all     25        idle s[0001-0021],sb01,sfe01,sg[001-002]


sacctmgr show qos format=Name,Priority,Flags,UsageThres,UsageFactor,GrpTRES,GrpTRESMins,MaxTRES,MaxTRESPerNode,MaxJobsPA,MaxTRESPA,MaxSubmitJobsPA,MaxJobsPU
      Name   Priority                Flags UsageThres UsageFactor       GrpTRES   GrpTRESMins       MaxTRES MaxTRESPerNode MaxJobsPA     MaxTRESPA MaxSubmitPA MaxJobsPU 
---------- ---------- -------------------- ---------- ----------- ------------- ------------- ------------- -------------- --------- ------------- ----------- --------- 
    normal          0                                    1.000000                                                                                                        
  windfall          1                                    1.000000                                                                                                        
     batch         20                                    1.000000                                  cpu=4104                                                              
     debug         30                                    1.000000                                  cpu=4104                                                            2 
    urgent         40                                    1.500000                                  cpu=4104                        1                                     
     admin         90                                    1.000000                                                                                                        
     novel         50                                    1.000000                                                                                                        
maximum-q+        100                                    1.000000                                                                                                        


[slurm@bqs5 etc]$ scontrol show job 26586
JobId=26586 PackJobId=26586 PackJobOffset=0 JobName=test1
   PackJobIdSet=26586-26587
   UserId=Christopher.W.Harrop(3441) GroupId=nesccmgmt(18001) MCS_label=N/A
   Priority=401816474 Nice=0 Account=nesccmgmt QOS=urgent
   JobState=PENDING Reason=MaxJobsPerAccount Dependency=(null)
   Requeue=1 Restarts=0 BatchFlag=1 Reboot=0 ExitCode=0:0
   RunTime=00:00:00 TimeLimit=00:02:00 TimeMin=N/A
   SubmitTime=2018-10-30T22:54:00 EligibleTime=2018-10-30T22:54:00
   AccrueTime=2018-10-30T22:54:00
   StartTime=2018-10-31T16:12:19 EndTime=2018-10-31T16:14:19 Deadline=N/A
   PreemptTime=None SuspendTime=None SecsPreSuspend=0
   LastSchedEval=2018-10-31T16:12:13
   Partition=selene AllocNode:Sid=sfe01:128306
   ReqNodeList=(null) ExcNodeList=(null)
   NodeList=(null) SchedNodeList=s0003
   NumNodes=1-1 NumCPUs=1 NumTasks=1 CPUs/Task=1 ReqB:S:C:T=0:0:*:*
   TRES=cpu=1,mem=512M,node=1
   Socks/Node=* NtasksPerN:B:S:C=1:0:*:1 CoreSpec=*
   MinCPUsNode=1 MinMemoryNode=512M MinTmpDiskNode=0
   Features=(null) DelayBoot=00:00:00
   OverSubscribe=NO Contiguous=0 Licenses=(null) Network=(null)
   Command=(null)
   WorkDir=/home/Christopher.W.Harrop/opt/rocoto/develop/test
   StdErr=/home/Christopher.W.Harrop/opt/rocoto/develop/test/log/test/test_1.join
   StdIn=/dev/null
   StdOut=/home/Christopher.W.Harrop/opt/rocoto/develop/test/log/test/test_1.join
   Power=

JobId=26587 PackJobId=26586 PackJobOffset=1 JobName=test1
   PackJobIdSet=26586-26587
   UserId=Christopher.W.Harrop(3441) GroupId=nesccmgmt(18001) MCS_label=N/A
   Priority=401816613 Nice=0 Account=nesccmgmt QOS=urgent
   JobState=PENDING Reason=None Dependency=(null)
   Requeue=1 Restarts=0 BatchFlag=1 Reboot=0 ExitCode=0:0
   RunTime=00:00:00 TimeLimit=00:02:00 TimeMin=N/A
   SubmitTime=2018-10-30T22:54:00 EligibleTime=2018-10-30T22:54:00
   AccrueTime=2018-10-30T22:54:00
   StartTime=2018-10-31T16:12:19 EndTime=2018-10-31T16:14:19 Deadline=N/A
   PreemptTime=None SuspendTime=None SecsPreSuspend=0
   LastSchedEval=2018-10-31T16:12:13
   Partition=selene AllocNode:Sid=sfe01:128306
   ReqNodeList=(null) ExcNodeList=(null)
   NodeList=(null) SchedNodeList=s[0014-0015]
   NumNodes=2-2 NumCPUs=8 NumTasks=8 CPUs/Task=1 ReqB:S:C:T=0:0:*:*
   TRES=cpu=8,mem=9600M,node=2,billing=4
   Socks/Node=* NtasksPerN:B:S:C=4:0:*:1 CoreSpec=*
   MinCPUsNode=4 MinMemoryCPU=1200M MinTmpDiskNode=0
   Features=(null) DelayBoot=00:00:00
   OverSubscribe=NO Contiguous=0 Licenses=(null) Network=(null)
   Command=(null)
   WorkDir=/home/Christopher.W.Harrop/opt/rocoto/develop/test
   StdErr=/home/Christopher.W.Harrop/opt/rocoto/develop/test/log/test/test_1.join
   StdIn=/dev/null
   StdOut=/home/Christopher.W.Harrop/opt/rocoto/develop/test/log/test/test_1.join
   Power=
Comment 1 Anthony DelSorbo 2018-10-31 11:06:12 MDT
Let me clarify that the pending "MaxJobsPerAccount" were submitted yesterday, long before the configuration change was made, today.  Further, I now have other jobs, submitted after the configuration change, that are pending due to priority:

             27807    selene slurm-mp Raghu.Re PD       0:00      1 (Priority)
             27639    selene RAPP_gsi Eric.Jam PD       0:00      6 (Priority)
           27691+1    selene     tcsh Leslie.B PD       0:00      2 (Priority)
           27691+0    selene     tcsh Leslie.B PD       0:00      2 (Priority)

The 27807 job is a single node job and should bee permitted to run - especially since backfill is enabled.

scontrol show job 27807
JobId=27807 JobName=slurm-mpirun
   UserId=Raghu.Reddy(537) GroupId=nesccmgmt(18001) MCS_label=N/A
   Priority=300019255 Nice=0 Account=nesccmgmt QOS=debug
   JobState=PENDING Reason=Priority Dependency=(null)
   Requeue=1 Restarts=0 BatchFlag=1 Reboot=0 ExitCode=0:0
   RunTime=00:00:00 TimeLimit=00:30:00 TimeMin=N/A
   SubmitTime=2018-10-31T16:44:33 EligibleTime=2018-10-31T16:44:33
   AccrueTime=2018-10-31T16:44:33
   StartTime=2018-10-31T16:56:00 EndTime=2018-10-31T17:26:00 Deadline=N/A
   PreemptTime=None SuspendTime=None SecsPreSuspend=0
   LastSchedEval=2018-10-31T16:54:44
   Partition=selene AllocNode:Sid=sfe01:144462
   ReqNodeList=(null) ExcNodeList=(null)
   NodeList=(null) SchedNodeList=s0021
   NumNodes=1-1 NumCPUs=1 NumTasks=1 CPUs/Task=1 ReqB:S:C:T=0:0:*:*
   TRES=cpu=1,mem=1200M,node=1,billing=1
   Socks/Node=* NtasksPerN:B:S:C=0:0:*:1 CoreSpec=*
   MinCPUsNode=1 MinMemoryCPU=1200M MinTmpDiskNode=0
   Features=(null) DelayBoot=00:00:00
   OverSubscribe=NO Contiguous=0 Licenses=(null) Network=(null)
   Command=/tds_scratch3/SYSADMIN/nesccmgmt/Raghu.Reddy/Slurm/bugs/omp-scalability/gv.job
   WorkDir=/tds_scratch3/SYSADMIN/nesccmgmt/Raghu.Reddy/Slurm/bugs/omp-scalability
   StdErr=/tds_scratch3/SYSADMIN/nesccmgmt/Raghu.Reddy/Slurm/bugs/omp-scalability/%x.o27807
   StdIn=/dev/null
   StdOut=/tds_scratch3/SYSADMIN/nesccmgmt/Raghu.Reddy/Slurm/bugs/omp-scalability/%x.o27807
   Power=
Comment 2 Jason Booth 2018-10-31 11:41:02 MDT
Tony, 

  The change in 5946 seems to only be affecting jobs prior to the modification. I think this change would have been better to make without running/queued jobs. Are your users able to resubmit?

-Jason
Comment 3 Jason Booth 2018-10-31 11:43:08 MDT
Hi Tony,

 Would you please also upload your slurmctld.log?

-Jason
Comment 4 Anthony DelSorbo 2018-10-31 14:06:43 MDT
How can I remotely send a file to this case?  Our current process is convoluted in that I have to scp the file to my production system to which I can connect with my Windows machine with winscp, then upload the file to my local machine, then send that to you via my browser.  It would be great if there was a tool to be able to just scp the file directly to the case.

I'll send you something in a few minutes.  Going through the process ....
Comment 5 Anthony DelSorbo 2018-10-31 14:17:22 MDT
Created attachment 8141 [details]
bqs5 slurmctld.log file
Comment 6 Jason Booth 2018-10-31 14:31:14 MDT
Thanks for the logs Tony. I will let you know what I find.

-Jason
Comment 7 Jason Booth 2018-10-31 16:01:41 MDT
Hi Tony,

 Job 27807 did eventually run so it does look like jobs are going through the system. 

 Did you make any changes after the upload and was this the first job to run in a while? Have more started from the "urgent" account? Would you try updating the MaxSubmitPA to 1, wait for a second and then update it again to 2. Also, can you restart slurmctld again to see if the issue clears up with regard to the MaxSubmitPA?


[2018-10-31T18:26:43.538] debug3: sched: JobId=27807 initiated
[2018-10-31T18:26:43.538] sched: Allocate JobId=27807 NodeList=s0021 #CPUs=24 Partition=selene
[2018-10-31T18:26:43.539] debug3: create_mmap_buf: loaded file `/var/spool/slurmctld/hash.7/job.27807/script` as Buf
[2018-10-31T18:26:43.541] debug3: Writing job id 27807 to header record of job_state file
[2018-10-31T18:26:43.604] debug2: Processing RPC: REQUEST_COMPLETE_PROLOG from JobId=27807
[2018-10-31T18:26:43.604] debug2: _slurm_rpc_complete_prolog JobId=27807 usec=21
[2018-10-31T18:26:45.327] debug2: Processing RPC: REQUEST_COMPLETE_BATCH_SCRIPT from uid=0 JobId=27807
[2018-10-31T18:26:45.327] _job_complete: JobId=27807 WEXITSTATUS 0
[2018-10-31T18:26:45.327] debug3: select/cons_res: _rm_job_from_res: JobId=27807 action 0
[2018-10-31T18:26:45.327] debug3: select/cons_res: removed JobId=27807 from part selene row 0
[2018-10-31T18:26:45.328] _job_complete: JobId=27807 done
[2018-10-31T18:26:45.332] debug2: _slurm_rpc_complete_batch_script JobId=27807 usec=1738
[2018-10-31T18:26:45.369] debug2: epilog_slurmctld JobId=27807 epilog completed
[2018-10-31T18:26:45.464] debug2: _slurm_rpc_epilog_complete: JobId=27807 Node=s0021 
[2018-10-31T18:26:48.000] debug3: Writing job id 27807 to header record of job_state file
[2018-10-31T18:27:07.552] debug3: Writing job id 27807 to header record of job_state file
[2018-10-31T18:27:12.000] debug3: Writing job id 27807 to header record of job_state file
[2018-10-31T18:27:49.671] debug3: Writing job id 27807 to header record of job_state file
[2018-10-31T18:27:58.149] debug3: Writing job id 27807 to header record of job_state file
[2018-10-31T18:28:06.899] debug3: Writing job id 27807 to header record of job_state file
[2018-10-31T18:32:43.955] debug2: _purge_files_thread: purging files from JobId=27807
Comment 9 Anthony DelSorbo 2018-10-31 18:57:22 MDT
Jason, 

I honestly don't know why the delay is occurring between jobs.  Even though that job finally ran, there was a very long delay: Submit time: 2018-10-31T16:44:33 Start Time: 2018-10-31T18:26:43.  So, I'll need to know why that occurred. What would be the best way to determine what is really causing a job's delay?  

However for the MaxJobsPerAccount jobs, it appears that changing the MaxJobsPerAccount value to 2 fixed the issue.  It was actually my colleague, Raghu Reddy, who made me see the light.  Sometimes you can't see the forest for the trees.  Turns out, for heterogeneous jobs (and likely under other circumstances), as in this job:

           27818+1    selene    test1 Christop  R       0:22      2 s[0014-0015]
           27818+0    selene    test1 Christop  R       0:22      1 s0016

there are actually 2 jobs submitted to the urgent queue.  Once I changed the MaxJobsPerAccount value to 2 that particular job was able to run.  

So, this is a problem for us.  We need to be able to limit users to 1 "job set" per account for that particular QOS.  So, how do we accomplish that?

I'll keep digging to see why the other jobs were delayed.

Thanks,

Tony.
Comment 10 Jason Booth 2018-11-01 10:11:38 MDT
Hi Tony,

> So, this is a problem for us.  We need to be able to limit users to 1 "job set" per account for that particular QOS.  So, how do we accomplish that?

 This is currently a restriction with how Slurm treats heterogeneous jobs and MaxSubmitPA, so you will need to have a MaxSubmitPA of two if you wish to have these type of jobs run. We do not have a configuration option to support the MaxSubmitPA mapping to a single heterogeneous job. If this because an issue then you may need an additional account that heterogeneous can submit to.

 In regards to the slow-starting jobs: once the MaxSubmitPA was changed to 2 did you notice a change in scheduling? 
 

-Jason
Comment 11 Jason Booth 2018-11-01 10:17:09 MDT
Tony

One other note I see that you marked 18.08.1 in the bug but if you are 18.08.02 then I would highly suggest that you upgrade to 18.08.3.

These versions include a fix for a regression introduced in 18.08.2 and 
17.11.11 that could lead to a loss of accounting records if the slurmdbd 
was offline. All sites with 18.08.2 or 17.11.11 slurmctld processes are 
encouraged to upgrade them ASAP.

-Jason
Comment 12 Anthony DelSorbo 2018-11-01 12:55:48 MDT
Jason,

Thanks for the feedback.  This is an unscalable problem.  The heterogeneous jobs will be treated as “N” jobs where “N” is the number of colon separated components.  So this affects more than just the urgent qos.  It also affect the debug qos in order to limit users to 2 jobs per account.  

We need a solution as quickly as possible to treat a heterogeneous job as a single unit in terms of configuration limits, etc. or our policies will not be enforceable.  

I noted that there's a "PackJobIdSet=27818-27819" setting.  Is there a means of coming up with a workaround such that if that exists the limit can be bypassed on a job-by-job basis?  Or is there a better way?

Still trying to understand the problem of the slow jobs.

Tony.
Comment 13 Jason Booth 2018-11-01 15:36:17 MDT
Hi Tony,

 I have looked for options here however, the functionality does not exist in Slurm currently. I would suggest that you reach out to Jess Arrington and speak with him to obtain development time to expand the accounting system for this use-case. I am dropping the severity down for this issue to sev 3 since we know the root cause for why the heterogeneous jobs would not start.

-Jason
Comment 15 Anthony DelSorbo 2018-11-02 07:38:19 MDT
Jason,

Thanks for the feedback.  

To be clear, we're not asking for a new feature.  It's simply a bug that needs to be corrected.  The fact is, from a user's perspective and an organizational perspective it's a single submission. This is a single submission:

sbatch -q batch --ntasks=1 : --ntasks=24 --nodes=2 --export=launcher,nranks,ALL hello-mpirun-2-part.job

But because you have designed the system to make it two jobs the implementation breaks the intent of MaxJobsPA, MaxJobsPU, MaxSubmitPA, MaxSubmitPU.

I hope you can see our perspective.

Thanks,

Tony.
Comment 17 Jason Booth 2018-11-02 11:35:04 MDT
Hi Tony,

 We have discussed this internally to make sure this is not a bug. Heterogeneous allocations are treated as sperate job allocation internally in Slurm so limits, such as MaxJobsPA, will apply to each allocation and not the heterogeneous job as a whole.

 It would be best to talk with Jess about the changes you are asking for since it is non-trivial and would most likely take a few weeks to implement correctly.

 As a side question: what is your sites definition of a job allocation within these heterogeneous jobs? Perhaps a GrpTRES=cpu=## might be a better approach.
Comment 18 Anthony DelSorbo 2018-11-02 13:42:45 MDT
Jason,

I sent Jess a message requesting a meeting.

Meanwhile, to us, a heterogeneous job is just that - a singular job.  Not plural.  Now because you treat the job as multiple allocations - which, by the way, is causing quite a stir around here - it is breaking the limit infrastructure.  To a user it is one job.  

So, this is one allocation of resources to us.  I'm asking for an allocation of resources from slurm with a certain layout for a certain amount of time.  It is up to me, the user, to determine how to go about using those resources.  With a subsequent srun you force additional jobs.  This is unnecessary (and wasteful in job IDs) in my view.  But, I'm will to listen to try to understand why that's necessary.

With the current batch system, we simply use the nodefile generated by the scheduler and mpirun -np ... -np ... etc.  to accomplish the task.  It all remains encapsulated within that one job.

On your last comment, I fail to see how GrpTRES=cpu=## would help.  The limit must apply to the jobset as a whole - which could be an allocation or 1 CPU or 1000 cpus.

Were you able to look into how we can remotely send a file to a case?  Would you rather start a different ticket for this issue?

Thanks,

Tony.
Comment 19 Jason Booth 2018-11-02 14:43:59 MDT
Hi Tony,

 We are not against what you are proposing and we understand the logic here. What we are saying is that the current way heterogeneous jobs are implemented with limits will not support what you are trying to do with MaxJobPA and would need development time to support this.

> So, this is one allocation of resources to us.  I'm asking for an allocation of resources from slurm with a certain layout for a certain amount of time.  It is up to me, the user, to determine how to go about using those resources.  With a subsequent srun you force additional jobs.  This is unnecessary (and wasteful in job IDs) in my view.  But, I'm will to listen to try to understand why that's necessary.

This was done by design and, at the time was the solution that was developed by the site and Slurm developers. If you would like development hours to change the current design then Jess would be the person to talk to. 

> Were you able to look into how we can remotely send a file to a case?  Would you rather start a different ticket for this issue?

Currently, we do not offer another way to send in files. If you have a file server on your end that can expose these then we are open to pulling down a file from it.
Comment 21 Tim Wickberg 2018-12-05 15:57:17 MST
Just to summarize, and update the ticket:

The fundamental problem here is that the separate components of a HetJob are treated internally as separate jobs for the purpose of limit calculation.

I understand that this leads to a lot of confusion, and agree that the semantics here aren't ideal from a user standpoint.

I will be looking further into fixing this on a future release, but the changes required to support the HetJob components being treated as a single job will require further study, and would be invasive enough that we cannot consider including them on a stable branch.
Comment 23 Tim Wickberg 2019-01-14 16:12:04 MST
Unfortunately, on further review I do not believe we can change this without creating significant issues elsewhere, and limiting the overall functionality of HetJobs.

I understand this is not ideal for sites used to thinking of these as "one job"; but with the implementation in Slurm they truly are separate pieces, and cannot be tracked as one larger job in a number of different limits.

As background, limits in Slurm are built in two main areas:

- Associations, which can be a limit on {cluster, account}, {cluster, account, user}, or {cluster, account, user, partition}, and which - through the Grp limits - allow the limits within a hierarchy to aggregate usage by multiple sub-accounts.

- QOS, which can be selected in to separately from the associations, and which can also be used as a Partition QOS.

Today, the components of a HetJob are allowed to vary in their QOS, Partition, and Account. If, for a given HetJob, there is any variance in this, it becomes impossible to account for usage of that HetJob without compromising in some fashion.

If a component belongs to multiple QOS, which are to be accounted against? Or do we define some rule that a HetJob with components in multiple QOSs should be accounted for separately, but that one with the QOS in common can be aggregated as a single QOS? And do we handle multi-partitions the same way? This introduces a ton of new edge cases which would be hard to test fully, and also IMO would lead to even more confusion by the end users.

And those are the easy ones to address. Associations, especially the inheritance, is a seemingly intractable issue.

If the HetJob components are in a separate association, do we charge "1 job" against both? But how do we aggregate that usage to higher levels? Do we continue to charge them as separate jobs, or do we try to collapse them at some tier once those separate hierarchies converge? The logic to handle that - especially to remove the job counter once the HetJob completes - would be considerably messy.

The only way I see around this would be to limit the HetJob to having all components in the same partition, account, and QOS, and I'm unwilling to add those restrictions at this late day. I also believe that would substantially limit their utility on the systems they were originally designed for.

In conclusion, I do not have a workable design to approach this from, and thus am closing this as resolved/wontfix.
Comment 24 Anthony DelSorbo 2019-01-15 09:04:51 MST
Tim,

This is disappointing and just about puts us at risk on this project.  We're not sure how we got here but I believe it was with your guidance during the training.  So if heterogeneous jobs is not the answer then perhaps there's a better way

For NOAA jobs, what we are mostly discussing is non-uniform placement, possibly with different executables on an homogeneous set of nodes.

- WRF typical use case:  same executable, but with different processors per node for some MPI tasks (for example IO tasks)

- There are a few applications that run “atmosphere” code on a set of nodes with ppn=6 and “ocean” on a set of nodes with ppn=24 (picking examples out of thin air).

If there is an easier way support this, that will work too and not use the “Hetero” terminology.

Please assist us in determining the right approach.  

Thanks,

Tony.
Comment 25 Anthony DelSorbo 2019-01-16 12:45:42 MST
Tim,

We need an answer to my last comment.  This is important to us and not being able to do non-uniform placement puts us at an unusable solution with a high potential for project transition failure.

This also applies to bug 5718- which is cray specific, but is also in critical need.

If Het jobs is not the answer, we need better guidance from your team.

Thanks,

Tony.
Comment 26 Tim Wickberg 2019-01-16 17:34:16 MST
I'm still looking into other ways to structure your jobs, or other mitigating factors we could pursue.

I could see one potential route out of this, although it would still entail some additional development time on our end, which would be to add some flag that forced the components of a hetjob to have matching partition, account, and qos options. If those all match, then the issues around how to account for the multiple-components correctly mostly disappear, and that same flag could change the behavior over to what you're looking after.

That's a large assumption on my end, and certainly not valid for a lot of our other customer use cases, but if it is on NOAA systems then I could look further into if such a constraint is workable, and what that would take in terms of NRE effort.

- Tim
Comment 27 Anthony DelSorbo 2019-03-26 15:33:51 MDT
Tim,

This issue came to a head today during our local weekly report.  Our customer is non-too-pleased that we don't have a solution by now.  If this can't be fixed by a near-term patch or code, we still need guidance from you on a workaround.  Perhaps we can do something in the job_submit filter to prevent a second job per account from being run?  If so, what't the proper method of getting that accomplished?  You don't have to code it for us, just help us get through this so we can get to production and we can get to work on a longer term solution - which may include some development/NRE work once this crazyness is over.

Thanks,

Tony.
Comment 28 Tim Wickberg 2019-03-27 20:14:34 MDT
(In reply to Anthony DelSorbo from comment #27)
> This issue came to a head today during our local weekly report.  Our
> customer is non-too-pleased that we don't have a solution by now.  If this
> can't be fixed by a near-term patch or code, we still need guidance from you
> on a workaround.

I should have checked in, and could certainly have made this more obvious, but what I'd proposed as a possible solution in comment 26 I was expecting some feedback on. I'll restate that below.

> Perhaps we can do something in the job_submit filter to
> prevent a second job per account from being run?  If so, what't the proper
> method of getting that accomplished?  You don't have to code it for us, just
> help us get through this so we can get to production and we can get to work
> on a longer term solution - which may include some development/NRE work once
> this crazyness is over.

I don't see an easy approach that would emulate this through job_submit.

(In reply to Tim Wickberg from comment #26)
> I could see one potential route out of this, although it would still entail
> some additional development time on our end, which would be to add some flag
> that forced the components of a hetjob to have matching partition, account,
> and qos options. If those all match, then the issues around how to account
> for the multiple-components correctly mostly disappear, and that same flag
> could change the behavior over to what you're looking after.

To make the implied question here explicit:

Are those restrictions that would be required - namely that each HetJob have matching parition, account, and QOS settings throughout all components - workable for NOAA?

If so, I can see an approach that should solve this. If not, we're back to square one.

I will caution that we're in the midst of wrapping up development on 19.05, and as such I cannot comment on when/if I can make sure a patch available.

- Tim
Comment 29 Anthony DelSorbo 2019-03-27 20:41:36 MDT
Tim,

Sorry I didn't pick up on the implied question.  I would answer yes on that
for the short term and immediate need since we don't currently use separate
partitions, account, or QOS settings for a single Het job.  This may change
in the future.  But by that time, we would have sufficient Slurm usage
experience to be able to specify a set of requirements to develop against.

I understand your time schedule.  We too are under one where we are trying
to go live by 1 May.  And we have a major constraint of losing Moab/Torque
support as well.  So, if this transition fails  we fail big.

Give me an idea of what can be done and when so that I can present a
plausible plan to the customer.

Thanks,

Tony.

On Wed, Mar 27, 2019 at 10:14 PM <bugs@schedmd.com> wrote:

> *Comment # 28 <https://bugs.schedmd.com/show_bug.cgi?id=5957#c28> on bug
> 5957 <https://bugs.schedmd.com/show_bug.cgi?id=5957> from Tim Wickberg
> <tim@schedmd.com> *
>
> (In reply to Anthony DelSorbo from comment #27 <https://bugs.schedmd.com/show_bug.cgi?id=5957#c27>)> This issue came to a head today during our local weekly report.  Our
> > customer is non-too-pleased that we don't have a solution by now.  If this
> > can't be fixed by a near-term patch or code, we still need guidance from you
> > on a workaround.
>
> I should have checked in, and could certainly have made this more obvious, but
> what I'd proposed as a possible solution in comment 26 <https://bugs.schedmd.com/show_bug.cgi?id=5957#c26> I was expecting some
> feedback on. I'll restate that below.
> > Perhaps we can do something in the job_submit filter to
> > prevent a second job per account from being run?  If so, what't the proper
> > method of getting that accomplished?  You don't have to code it for us, just
> > help us get through this so we can get to production and we can get to work
> > on a longer term solution - which may include some development/NRE work once
> > this crazyness is over.
>
> I don't see an easy approach that would emulate this through job_submit.
>
> (In reply to Tim Wickberg from comment #26 <https://bugs.schedmd.com/show_bug.cgi?id=5957#c26>)> I could see one potential route out of this, although it would still entail
> > some additional development time on our end, which would be to add some flag
> > that forced the components of a hetjob to have matching partition, account,
> > and qos options. If those all match, then the issues around how to account
> > for the multiple-components correctly mostly disappear, and that same flag
> > could change the behavior over to what you're looking after.
>
> To make the implied question here explicit:
>
> Are those restrictions that would be required - namely that each HetJob have
> matching parition, account, and QOS settings throughout all components -
> workable for NOAA?
>
> If so, I can see an approach that should solve this. If not, we're back to
> square one.
>
> I will caution that we're in the midst of wrapping up development on 19.05, and
> as such I cannot comment on when/if I can make sure a patch available.
>
> - Tim
>
> ------------------------------
> You are receiving this mail because:
>
>    - You reported the bug.
>
>
Comment 30 Tim Wickberg 2019-04-01 21:45:58 MDT
(In reply to Anthony DelSorbo from comment #29)
> Tim,
> 
> Sorry I didn't pick up on the implied question.  I would answer yes on that
> for the short term and immediate need since we don't currently use separate
> partitions, account, or QOS settings for a single Het job.  This may change
> in the future.  But by that time, we would have sufficient Slurm usage
> experience to be able to specify a set of requirements to develop against.

Just to be clear: relaxing those constraints with additional development is not something I'm interested in exploring. Modifying the constraint logic to support mixing some number of those while still counting as a single job in the correct locations is not possible with the current architecture for those limits.
Comment 32 Tim Wickberg 2019-06-12 15:18:18 MDT
Just as an update: I will get a proof-of-concept of this change - which would require that the qos/partition/account all match - attached here ASAP. I did hit a few snags in how the HetJobs are represented at submission time that I need to work through still, which is delaying this slightly.

If said POC looks like it'd satisfy NOAA's request here, I can then move to get an SOW generated on having that included, alongside an appropriate configuration flag and documentation.

- Tim
Comment 34 Anthony DelSorbo 2019-06-13 14:56:44 MDT
(In reply to Tim Wickberg from comment #32)
> Just as an update: I will get a proof-of-concept of this change - which
> would require that the qos/partition/account all match - attached here ASAP.
> I did hit a few snags in how the HetJobs are represented at submission time
> that I need to work through still, which is delaying this slightly.
> 
> If said POC looks like it'd satisfy NOAA's request here, I can then move to
> get an SOW generated on having that included, alongside an appropriate
> configuration flag and documentation.
> 
> - Tim

Tim,

Would you be able to include this additional SOW with the other set of 3 for a total of 4?  

Thanks,

Tony.
Comment 35 Tim Wickberg 2019-06-19 02:03:40 MDT
> Would you be able to include this additional SOW with the other set of 3 for
> a total of 4?  

Yes, Jess will have this included as well with the rest.

- Tim
Comment 37 Tim Wickberg 2019-07-15 19:23:54 MDT
Tony -

Re-tagging this as an enhancement request. You should have a SOW on building out an option to control this; we're waiting for further news on that before making any further changes to this ticket.

- Tim