We have run into a situation where no jobs are running. Below are the outputs from the squeue, sinfo, and sacctmgr show qos commands. There are 5 jobs pending and stuck for the reason of MaxJobsPerAccount and, that is true for the chosen QOS. It's not clear why the jobs aren't running. However, there was a change in configuration today to address bug 5946. I implemented the change in real time and restarted the slurmd and slurmctld daemons. The change was to the nodename CPU count where it was changed from: CPUs=48 to CPUs=24. It's wholly possible that this may have affected these jobs. But before I ask the users to cancel their jobs, it's not clear why the "MaxJobsPerAccount" jobs are stuck (the "show job" output is shown below). There should at least be one of these jobs in the run state. squeue -l Wed Oct 31 15:16:41 2018 JOBID PARTITION NAME USER STATE TIME TIME_LIMI NODES NODELIST(REASON) 26586+1 selene test1 Christop PENDING 0:00 2:00 2 (None) 26588+1 selene test1 Christop PENDING 0:00 2:00 2 (None) 26590+1 selene test1 Christop PENDING 0:00 2:00 2 (None) 26592+1 selene test1 Christop PENDING 0:00 2:00 2 (None) 26594+1 selene test1 Christop PENDING 0:00 2:00 2 (None) 26586+0 selene test1 Christop PENDING 0:00 2:00 1 (MaxJobsPerAccount) 26588+0 selene test1 Christop PENDING 0:00 2:00 1 (MaxJobsPerAccount) 26590+0 selene test1 Christop PENDING 0:00 2:00 1 (MaxJobsPerAccount) 26592+0 selene test1 Christop PENDING 0:00 2:00 1 (MaxJobsPerAccount) 26594+0 selene test1 Christop PENDING 0:00 2:00 1 (MaxJobsPerAccount) 27635 selene cpnmmb_2 Jili.Don PENDING 0:00 29:00 20 (Priority) sinfo -l Wed Oct 31 15:19:30 2018 PARTITION AVAIL TIMELIMIT JOB_SIZE ROOT OVERSUBS GROUPS NODES STATE NODELIST selene* up 8:00:00 1-infinite no EXCLUSIV all 21 idle s[0001-0021] batch up 8:00:00 1-infinite no EXCLUSIV all 21 idle s[0001-0021] shared up 8:00:00 1 no NO all 21 idle s[0001-0021] service up 1-00:00:00 1 no FORCE:4 all 1 idle sfe01 bigmem up 8:00:00 1-infinite no EXCLUSIV all 1 idle sb01 fge up 8:00:00 1-infinite no EXCLUSIV all 2 idle sg[001-002] admin up 1-00:00:00 1-infinite no EXCLUSIV all 25 idle s[0001-0021],sb01,sfe01,sg[001-002] sacctmgr show qos format=Name,Priority,Flags,UsageThres,UsageFactor,GrpTRES,GrpTRESMins,MaxTRES,MaxTRESPerNode,MaxJobsPA,MaxTRESPA,MaxSubmitJobsPA,MaxJobsPU Name Priority Flags UsageThres UsageFactor GrpTRES GrpTRESMins MaxTRES MaxTRESPerNode MaxJobsPA MaxTRESPA MaxSubmitPA MaxJobsPU ---------- ---------- -------------------- ---------- ----------- ------------- ------------- ------------- -------------- --------- ------------- ----------- --------- normal 0 1.000000 windfall 1 1.000000 batch 20 1.000000 cpu=4104 debug 30 1.000000 cpu=4104 2 urgent 40 1.500000 cpu=4104 1 admin 90 1.000000 novel 50 1.000000 maximum-q+ 100 1.000000 [slurm@bqs5 etc]$ scontrol show job 26586 JobId=26586 PackJobId=26586 PackJobOffset=0 JobName=test1 PackJobIdSet=26586-26587 UserId=Christopher.W.Harrop(3441) GroupId=nesccmgmt(18001) MCS_label=N/A Priority=401816474 Nice=0 Account=nesccmgmt QOS=urgent JobState=PENDING Reason=MaxJobsPerAccount Dependency=(null) Requeue=1 Restarts=0 BatchFlag=1 Reboot=0 ExitCode=0:0 RunTime=00:00:00 TimeLimit=00:02:00 TimeMin=N/A SubmitTime=2018-10-30T22:54:00 EligibleTime=2018-10-30T22:54:00 AccrueTime=2018-10-30T22:54:00 StartTime=2018-10-31T16:12:19 EndTime=2018-10-31T16:14:19 Deadline=N/A PreemptTime=None SuspendTime=None SecsPreSuspend=0 LastSchedEval=2018-10-31T16:12:13 Partition=selene AllocNode:Sid=sfe01:128306 ReqNodeList=(null) ExcNodeList=(null) NodeList=(null) SchedNodeList=s0003 NumNodes=1-1 NumCPUs=1 NumTasks=1 CPUs/Task=1 ReqB:S:C:T=0:0:*:* TRES=cpu=1,mem=512M,node=1 Socks/Node=* NtasksPerN:B:S:C=1:0:*:1 CoreSpec=* MinCPUsNode=1 MinMemoryNode=512M MinTmpDiskNode=0 Features=(null) DelayBoot=00:00:00 OverSubscribe=NO Contiguous=0 Licenses=(null) Network=(null) Command=(null) WorkDir=/home/Christopher.W.Harrop/opt/rocoto/develop/test StdErr=/home/Christopher.W.Harrop/opt/rocoto/develop/test/log/test/test_1.join StdIn=/dev/null StdOut=/home/Christopher.W.Harrop/opt/rocoto/develop/test/log/test/test_1.join Power= JobId=26587 PackJobId=26586 PackJobOffset=1 JobName=test1 PackJobIdSet=26586-26587 UserId=Christopher.W.Harrop(3441) GroupId=nesccmgmt(18001) MCS_label=N/A Priority=401816613 Nice=0 Account=nesccmgmt QOS=urgent JobState=PENDING Reason=None Dependency=(null) Requeue=1 Restarts=0 BatchFlag=1 Reboot=0 ExitCode=0:0 RunTime=00:00:00 TimeLimit=00:02:00 TimeMin=N/A SubmitTime=2018-10-30T22:54:00 EligibleTime=2018-10-30T22:54:00 AccrueTime=2018-10-30T22:54:00 StartTime=2018-10-31T16:12:19 EndTime=2018-10-31T16:14:19 Deadline=N/A PreemptTime=None SuspendTime=None SecsPreSuspend=0 LastSchedEval=2018-10-31T16:12:13 Partition=selene AllocNode:Sid=sfe01:128306 ReqNodeList=(null) ExcNodeList=(null) NodeList=(null) SchedNodeList=s[0014-0015] NumNodes=2-2 NumCPUs=8 NumTasks=8 CPUs/Task=1 ReqB:S:C:T=0:0:*:* TRES=cpu=8,mem=9600M,node=2,billing=4 Socks/Node=* NtasksPerN:B:S:C=4:0:*:1 CoreSpec=* MinCPUsNode=4 MinMemoryCPU=1200M MinTmpDiskNode=0 Features=(null) DelayBoot=00:00:00 OverSubscribe=NO Contiguous=0 Licenses=(null) Network=(null) Command=(null) WorkDir=/home/Christopher.W.Harrop/opt/rocoto/develop/test StdErr=/home/Christopher.W.Harrop/opt/rocoto/develop/test/log/test/test_1.join StdIn=/dev/null StdOut=/home/Christopher.W.Harrop/opt/rocoto/develop/test/log/test/test_1.join Power=
Let me clarify that the pending "MaxJobsPerAccount" were submitted yesterday, long before the configuration change was made, today. Further, I now have other jobs, submitted after the configuration change, that are pending due to priority: 27807 selene slurm-mp Raghu.Re PD 0:00 1 (Priority) 27639 selene RAPP_gsi Eric.Jam PD 0:00 6 (Priority) 27691+1 selene tcsh Leslie.B PD 0:00 2 (Priority) 27691+0 selene tcsh Leslie.B PD 0:00 2 (Priority) The 27807 job is a single node job and should bee permitted to run - especially since backfill is enabled. scontrol show job 27807 JobId=27807 JobName=slurm-mpirun UserId=Raghu.Reddy(537) GroupId=nesccmgmt(18001) MCS_label=N/A Priority=300019255 Nice=0 Account=nesccmgmt QOS=debug JobState=PENDING Reason=Priority Dependency=(null) Requeue=1 Restarts=0 BatchFlag=1 Reboot=0 ExitCode=0:0 RunTime=00:00:00 TimeLimit=00:30:00 TimeMin=N/A SubmitTime=2018-10-31T16:44:33 EligibleTime=2018-10-31T16:44:33 AccrueTime=2018-10-31T16:44:33 StartTime=2018-10-31T16:56:00 EndTime=2018-10-31T17:26:00 Deadline=N/A PreemptTime=None SuspendTime=None SecsPreSuspend=0 LastSchedEval=2018-10-31T16:54:44 Partition=selene AllocNode:Sid=sfe01:144462 ReqNodeList=(null) ExcNodeList=(null) NodeList=(null) SchedNodeList=s0021 NumNodes=1-1 NumCPUs=1 NumTasks=1 CPUs/Task=1 ReqB:S:C:T=0:0:*:* TRES=cpu=1,mem=1200M,node=1,billing=1 Socks/Node=* NtasksPerN:B:S:C=0:0:*:1 CoreSpec=* MinCPUsNode=1 MinMemoryCPU=1200M MinTmpDiskNode=0 Features=(null) DelayBoot=00:00:00 OverSubscribe=NO Contiguous=0 Licenses=(null) Network=(null) Command=/tds_scratch3/SYSADMIN/nesccmgmt/Raghu.Reddy/Slurm/bugs/omp-scalability/gv.job WorkDir=/tds_scratch3/SYSADMIN/nesccmgmt/Raghu.Reddy/Slurm/bugs/omp-scalability StdErr=/tds_scratch3/SYSADMIN/nesccmgmt/Raghu.Reddy/Slurm/bugs/omp-scalability/%x.o27807 StdIn=/dev/null StdOut=/tds_scratch3/SYSADMIN/nesccmgmt/Raghu.Reddy/Slurm/bugs/omp-scalability/%x.o27807 Power=
Tony, The change in 5946 seems to only be affecting jobs prior to the modification. I think this change would have been better to make without running/queued jobs. Are your users able to resubmit? -Jason
Hi Tony, Would you please also upload your slurmctld.log? -Jason
How can I remotely send a file to this case? Our current process is convoluted in that I have to scp the file to my production system to which I can connect with my Windows machine with winscp, then upload the file to my local machine, then send that to you via my browser. It would be great if there was a tool to be able to just scp the file directly to the case. I'll send you something in a few minutes. Going through the process ....
Created attachment 8141 [details] bqs5 slurmctld.log file
Thanks for the logs Tony. I will let you know what I find. -Jason
Hi Tony, Job 27807 did eventually run so it does look like jobs are going through the system. Did you make any changes after the upload and was this the first job to run in a while? Have more started from the "urgent" account? Would you try updating the MaxSubmitPA to 1, wait for a second and then update it again to 2. Also, can you restart slurmctld again to see if the issue clears up with regard to the MaxSubmitPA? [2018-10-31T18:26:43.538] debug3: sched: JobId=27807 initiated [2018-10-31T18:26:43.538] sched: Allocate JobId=27807 NodeList=s0021 #CPUs=24 Partition=selene [2018-10-31T18:26:43.539] debug3: create_mmap_buf: loaded file `/var/spool/slurmctld/hash.7/job.27807/script` as Buf [2018-10-31T18:26:43.541] debug3: Writing job id 27807 to header record of job_state file [2018-10-31T18:26:43.604] debug2: Processing RPC: REQUEST_COMPLETE_PROLOG from JobId=27807 [2018-10-31T18:26:43.604] debug2: _slurm_rpc_complete_prolog JobId=27807 usec=21 [2018-10-31T18:26:45.327] debug2: Processing RPC: REQUEST_COMPLETE_BATCH_SCRIPT from uid=0 JobId=27807 [2018-10-31T18:26:45.327] _job_complete: JobId=27807 WEXITSTATUS 0 [2018-10-31T18:26:45.327] debug3: select/cons_res: _rm_job_from_res: JobId=27807 action 0 [2018-10-31T18:26:45.327] debug3: select/cons_res: removed JobId=27807 from part selene row 0 [2018-10-31T18:26:45.328] _job_complete: JobId=27807 done [2018-10-31T18:26:45.332] debug2: _slurm_rpc_complete_batch_script JobId=27807 usec=1738 [2018-10-31T18:26:45.369] debug2: epilog_slurmctld JobId=27807 epilog completed [2018-10-31T18:26:45.464] debug2: _slurm_rpc_epilog_complete: JobId=27807 Node=s0021 [2018-10-31T18:26:48.000] debug3: Writing job id 27807 to header record of job_state file [2018-10-31T18:27:07.552] debug3: Writing job id 27807 to header record of job_state file [2018-10-31T18:27:12.000] debug3: Writing job id 27807 to header record of job_state file [2018-10-31T18:27:49.671] debug3: Writing job id 27807 to header record of job_state file [2018-10-31T18:27:58.149] debug3: Writing job id 27807 to header record of job_state file [2018-10-31T18:28:06.899] debug3: Writing job id 27807 to header record of job_state file [2018-10-31T18:32:43.955] debug2: _purge_files_thread: purging files from JobId=27807
Jason, I honestly don't know why the delay is occurring between jobs. Even though that job finally ran, there was a very long delay: Submit time: 2018-10-31T16:44:33 Start Time: 2018-10-31T18:26:43. So, I'll need to know why that occurred. What would be the best way to determine what is really causing a job's delay? However for the MaxJobsPerAccount jobs, it appears that changing the MaxJobsPerAccount value to 2 fixed the issue. It was actually my colleague, Raghu Reddy, who made me see the light. Sometimes you can't see the forest for the trees. Turns out, for heterogeneous jobs (and likely under other circumstances), as in this job: 27818+1 selene test1 Christop R 0:22 2 s[0014-0015] 27818+0 selene test1 Christop R 0:22 1 s0016 there are actually 2 jobs submitted to the urgent queue. Once I changed the MaxJobsPerAccount value to 2 that particular job was able to run. So, this is a problem for us. We need to be able to limit users to 1 "job set" per account for that particular QOS. So, how do we accomplish that? I'll keep digging to see why the other jobs were delayed. Thanks, Tony.
Hi Tony, > So, this is a problem for us. We need to be able to limit users to 1 "job set" per account for that particular QOS. So, how do we accomplish that? This is currently a restriction with how Slurm treats heterogeneous jobs and MaxSubmitPA, so you will need to have a MaxSubmitPA of two if you wish to have these type of jobs run. We do not have a configuration option to support the MaxSubmitPA mapping to a single heterogeneous job. If this because an issue then you may need an additional account that heterogeneous can submit to. In regards to the slow-starting jobs: once the MaxSubmitPA was changed to 2 did you notice a change in scheduling? -Jason
Tony One other note I see that you marked 18.08.1 in the bug but if you are 18.08.02 then I would highly suggest that you upgrade to 18.08.3. These versions include a fix for a regression introduced in 18.08.2 and 17.11.11 that could lead to a loss of accounting records if the slurmdbd was offline. All sites with 18.08.2 or 17.11.11 slurmctld processes are encouraged to upgrade them ASAP. -Jason
Jason, Thanks for the feedback. This is an unscalable problem. The heterogeneous jobs will be treated as “N” jobs where “N” is the number of colon separated components. So this affects more than just the urgent qos. It also affect the debug qos in order to limit users to 2 jobs per account. We need a solution as quickly as possible to treat a heterogeneous job as a single unit in terms of configuration limits, etc. or our policies will not be enforceable. I noted that there's a "PackJobIdSet=27818-27819" setting. Is there a means of coming up with a workaround such that if that exists the limit can be bypassed on a job-by-job basis? Or is there a better way? Still trying to understand the problem of the slow jobs. Tony.
Hi Tony, I have looked for options here however, the functionality does not exist in Slurm currently. I would suggest that you reach out to Jess Arrington and speak with him to obtain development time to expand the accounting system for this use-case. I am dropping the severity down for this issue to sev 3 since we know the root cause for why the heterogeneous jobs would not start. -Jason
Jason, Thanks for the feedback. To be clear, we're not asking for a new feature. It's simply a bug that needs to be corrected. The fact is, from a user's perspective and an organizational perspective it's a single submission. This is a single submission: sbatch -q batch --ntasks=1 : --ntasks=24 --nodes=2 --export=launcher,nranks,ALL hello-mpirun-2-part.job But because you have designed the system to make it two jobs the implementation breaks the intent of MaxJobsPA, MaxJobsPU, MaxSubmitPA, MaxSubmitPU. I hope you can see our perspective. Thanks, Tony.
Hi Tony, We have discussed this internally to make sure this is not a bug. Heterogeneous allocations are treated as sperate job allocation internally in Slurm so limits, such as MaxJobsPA, will apply to each allocation and not the heterogeneous job as a whole. It would be best to talk with Jess about the changes you are asking for since it is non-trivial and would most likely take a few weeks to implement correctly. As a side question: what is your sites definition of a job allocation within these heterogeneous jobs? Perhaps a GrpTRES=cpu=## might be a better approach.
Jason, I sent Jess a message requesting a meeting. Meanwhile, to us, a heterogeneous job is just that - a singular job. Not plural. Now because you treat the job as multiple allocations - which, by the way, is causing quite a stir around here - it is breaking the limit infrastructure. To a user it is one job. So, this is one allocation of resources to us. I'm asking for an allocation of resources from slurm with a certain layout for a certain amount of time. It is up to me, the user, to determine how to go about using those resources. With a subsequent srun you force additional jobs. This is unnecessary (and wasteful in job IDs) in my view. But, I'm will to listen to try to understand why that's necessary. With the current batch system, we simply use the nodefile generated by the scheduler and mpirun -np ... -np ... etc. to accomplish the task. It all remains encapsulated within that one job. On your last comment, I fail to see how GrpTRES=cpu=## would help. The limit must apply to the jobset as a whole - which could be an allocation or 1 CPU or 1000 cpus. Were you able to look into how we can remotely send a file to a case? Would you rather start a different ticket for this issue? Thanks, Tony.
Hi Tony, We are not against what you are proposing and we understand the logic here. What we are saying is that the current way heterogeneous jobs are implemented with limits will not support what you are trying to do with MaxJobPA and would need development time to support this. > So, this is one allocation of resources to us. I'm asking for an allocation of resources from slurm with a certain layout for a certain amount of time. It is up to me, the user, to determine how to go about using those resources. With a subsequent srun you force additional jobs. This is unnecessary (and wasteful in job IDs) in my view. But, I'm will to listen to try to understand why that's necessary. This was done by design and, at the time was the solution that was developed by the site and Slurm developers. If you would like development hours to change the current design then Jess would be the person to talk to. > Were you able to look into how we can remotely send a file to a case? Would you rather start a different ticket for this issue? Currently, we do not offer another way to send in files. If you have a file server on your end that can expose these then we are open to pulling down a file from it.
Just to summarize, and update the ticket: The fundamental problem here is that the separate components of a HetJob are treated internally as separate jobs for the purpose of limit calculation. I understand that this leads to a lot of confusion, and agree that the semantics here aren't ideal from a user standpoint. I will be looking further into fixing this on a future release, but the changes required to support the HetJob components being treated as a single job will require further study, and would be invasive enough that we cannot consider including them on a stable branch.
Unfortunately, on further review I do not believe we can change this without creating significant issues elsewhere, and limiting the overall functionality of HetJobs. I understand this is not ideal for sites used to thinking of these as "one job"; but with the implementation in Slurm they truly are separate pieces, and cannot be tracked as one larger job in a number of different limits. As background, limits in Slurm are built in two main areas: - Associations, which can be a limit on {cluster, account}, {cluster, account, user}, or {cluster, account, user, partition}, and which - through the Grp limits - allow the limits within a hierarchy to aggregate usage by multiple sub-accounts. - QOS, which can be selected in to separately from the associations, and which can also be used as a Partition QOS. Today, the components of a HetJob are allowed to vary in their QOS, Partition, and Account. If, for a given HetJob, there is any variance in this, it becomes impossible to account for usage of that HetJob without compromising in some fashion. If a component belongs to multiple QOS, which are to be accounted against? Or do we define some rule that a HetJob with components in multiple QOSs should be accounted for separately, but that one with the QOS in common can be aggregated as a single QOS? And do we handle multi-partitions the same way? This introduces a ton of new edge cases which would be hard to test fully, and also IMO would lead to even more confusion by the end users. And those are the easy ones to address. Associations, especially the inheritance, is a seemingly intractable issue. If the HetJob components are in a separate association, do we charge "1 job" against both? But how do we aggregate that usage to higher levels? Do we continue to charge them as separate jobs, or do we try to collapse them at some tier once those separate hierarchies converge? The logic to handle that - especially to remove the job counter once the HetJob completes - would be considerably messy. The only way I see around this would be to limit the HetJob to having all components in the same partition, account, and QOS, and I'm unwilling to add those restrictions at this late day. I also believe that would substantially limit their utility on the systems they were originally designed for. In conclusion, I do not have a workable design to approach this from, and thus am closing this as resolved/wontfix.
Tim, This is disappointing and just about puts us at risk on this project. We're not sure how we got here but I believe it was with your guidance during the training. So if heterogeneous jobs is not the answer then perhaps there's a better way For NOAA jobs, what we are mostly discussing is non-uniform placement, possibly with different executables on an homogeneous set of nodes. - WRF typical use case: same executable, but with different processors per node for some MPI tasks (for example IO tasks) - There are a few applications that run “atmosphere” code on a set of nodes with ppn=6 and “ocean” on a set of nodes with ppn=24 (picking examples out of thin air). If there is an easier way support this, that will work too and not use the “Hetero” terminology. Please assist us in determining the right approach. Thanks, Tony.
Tim, We need an answer to my last comment. This is important to us and not being able to do non-uniform placement puts us at an unusable solution with a high potential for project transition failure. This also applies to bug 5718- which is cray specific, but is also in critical need. If Het jobs is not the answer, we need better guidance from your team. Thanks, Tony.
I'm still looking into other ways to structure your jobs, or other mitigating factors we could pursue. I could see one potential route out of this, although it would still entail some additional development time on our end, which would be to add some flag that forced the components of a hetjob to have matching partition, account, and qos options. If those all match, then the issues around how to account for the multiple-components correctly mostly disappear, and that same flag could change the behavior over to what you're looking after. That's a large assumption on my end, and certainly not valid for a lot of our other customer use cases, but if it is on NOAA systems then I could look further into if such a constraint is workable, and what that would take in terms of NRE effort. - Tim
Tim, This issue came to a head today during our local weekly report. Our customer is non-too-pleased that we don't have a solution by now. If this can't be fixed by a near-term patch or code, we still need guidance from you on a workaround. Perhaps we can do something in the job_submit filter to prevent a second job per account from being run? If so, what't the proper method of getting that accomplished? You don't have to code it for us, just help us get through this so we can get to production and we can get to work on a longer term solution - which may include some development/NRE work once this crazyness is over. Thanks, Tony.
(In reply to Anthony DelSorbo from comment #27) > This issue came to a head today during our local weekly report. Our > customer is non-too-pleased that we don't have a solution by now. If this > can't be fixed by a near-term patch or code, we still need guidance from you > on a workaround. I should have checked in, and could certainly have made this more obvious, but what I'd proposed as a possible solution in comment 26 I was expecting some feedback on. I'll restate that below. > Perhaps we can do something in the job_submit filter to > prevent a second job per account from being run? If so, what't the proper > method of getting that accomplished? You don't have to code it for us, just > help us get through this so we can get to production and we can get to work > on a longer term solution - which may include some development/NRE work once > this crazyness is over. I don't see an easy approach that would emulate this through job_submit. (In reply to Tim Wickberg from comment #26) > I could see one potential route out of this, although it would still entail > some additional development time on our end, which would be to add some flag > that forced the components of a hetjob to have matching partition, account, > and qos options. If those all match, then the issues around how to account > for the multiple-components correctly mostly disappear, and that same flag > could change the behavior over to what you're looking after. To make the implied question here explicit: Are those restrictions that would be required - namely that each HetJob have matching parition, account, and QOS settings throughout all components - workable for NOAA? If so, I can see an approach that should solve this. If not, we're back to square one. I will caution that we're in the midst of wrapping up development on 19.05, and as such I cannot comment on when/if I can make sure a patch available. - Tim
Tim, Sorry I didn't pick up on the implied question. I would answer yes on that for the short term and immediate need since we don't currently use separate partitions, account, or QOS settings for a single Het job. This may change in the future. But by that time, we would have sufficient Slurm usage experience to be able to specify a set of requirements to develop against. I understand your time schedule. We too are under one where we are trying to go live by 1 May. And we have a major constraint of losing Moab/Torque support as well. So, if this transition fails we fail big. Give me an idea of what can be done and when so that I can present a plausible plan to the customer. Thanks, Tony. On Wed, Mar 27, 2019 at 10:14 PM <bugs@schedmd.com> wrote: > *Comment # 28 <https://bugs.schedmd.com/show_bug.cgi?id=5957#c28> on bug > 5957 <https://bugs.schedmd.com/show_bug.cgi?id=5957> from Tim Wickberg > <tim@schedmd.com> * > > (In reply to Anthony DelSorbo from comment #27 <https://bugs.schedmd.com/show_bug.cgi?id=5957#c27>)> This issue came to a head today during our local weekly report. Our > > customer is non-too-pleased that we don't have a solution by now. If this > > can't be fixed by a near-term patch or code, we still need guidance from you > > on a workaround. > > I should have checked in, and could certainly have made this more obvious, but > what I'd proposed as a possible solution in comment 26 <https://bugs.schedmd.com/show_bug.cgi?id=5957#c26> I was expecting some > feedback on. I'll restate that below. > > Perhaps we can do something in the job_submit filter to > > prevent a second job per account from being run? If so, what't the proper > > method of getting that accomplished? You don't have to code it for us, just > > help us get through this so we can get to production and we can get to work > > on a longer term solution - which may include some development/NRE work once > > this crazyness is over. > > I don't see an easy approach that would emulate this through job_submit. > > (In reply to Tim Wickberg from comment #26 <https://bugs.schedmd.com/show_bug.cgi?id=5957#c26>)> I could see one potential route out of this, although it would still entail > > some additional development time on our end, which would be to add some flag > > that forced the components of a hetjob to have matching partition, account, > > and qos options. If those all match, then the issues around how to account > > for the multiple-components correctly mostly disappear, and that same flag > > could change the behavior over to what you're looking after. > > To make the implied question here explicit: > > Are those restrictions that would be required - namely that each HetJob have > matching parition, account, and QOS settings throughout all components - > workable for NOAA? > > If so, I can see an approach that should solve this. If not, we're back to > square one. > > I will caution that we're in the midst of wrapping up development on 19.05, and > as such I cannot comment on when/if I can make sure a patch available. > > - Tim > > ------------------------------ > You are receiving this mail because: > > - You reported the bug. > >
(In reply to Anthony DelSorbo from comment #29) > Tim, > > Sorry I didn't pick up on the implied question. I would answer yes on that > for the short term and immediate need since we don't currently use separate > partitions, account, or QOS settings for a single Het job. This may change > in the future. But by that time, we would have sufficient Slurm usage > experience to be able to specify a set of requirements to develop against. Just to be clear: relaxing those constraints with additional development is not something I'm interested in exploring. Modifying the constraint logic to support mixing some number of those while still counting as a single job in the correct locations is not possible with the current architecture for those limits.
Just as an update: I will get a proof-of-concept of this change - which would require that the qos/partition/account all match - attached here ASAP. I did hit a few snags in how the HetJobs are represented at submission time that I need to work through still, which is delaying this slightly. If said POC looks like it'd satisfy NOAA's request here, I can then move to get an SOW generated on having that included, alongside an appropriate configuration flag and documentation. - Tim
(In reply to Tim Wickberg from comment #32) > Just as an update: I will get a proof-of-concept of this change - which > would require that the qos/partition/account all match - attached here ASAP. > I did hit a few snags in how the HetJobs are represented at submission time > that I need to work through still, which is delaying this slightly. > > If said POC looks like it'd satisfy NOAA's request here, I can then move to > get an SOW generated on having that included, alongside an appropriate > configuration flag and documentation. > > - Tim Tim, Would you be able to include this additional SOW with the other set of 3 for a total of 4? Thanks, Tony.
> Would you be able to include this additional SOW with the other set of 3 for > a total of 4? Yes, Jess will have this included as well with the rest. - Tim
Tony - Re-tagging this as an enhancement request. You should have a SOW on building out an option to control this; we're waiting for further news on that before making any further changes to this ticket. - Tim