| Summary: | MaxEnergyPerAccount and QOSGrpEnergy | ||
|---|---|---|---|
| Product: | Slurm | Reporter: | lhuang |
| Component: | slurmctld | Assignee: | Felip Moll <felip.moll> |
| Status: | RESOLVED CANNOTREPRODUCE | QA Contact: | |
| Severity: | 4 - Minor Issue | ||
| Priority: | --- | CC: | cblack |
| Version: | 20.11.0 | ||
| Hardware: | Linux | ||
| OS: | Linux | ||
| Site: | NY Genome | Alineos Sites: | --- |
| Atos/Eviden Sites: | --- | Confidential Site: | --- |
| Coreweave sites: | --- | Cray Sites: | --- |
| DS9 clusters: | --- | HPCnow Sites: | --- |
| HPE Sites: | --- | IBM Sites: | --- |
| NOAA SIte: | --- | OCF Sites: | --- |
| Recursion Pharma Sites: | --- | SFW Sites: | --- |
| SNIC sites: | --- | Linux Distro: | --- |
| Machine Name: | CLE Version: | ||
| Version Fixed: | Target Release: | --- | |
| DevPrio: | --- | Emory-Cloud Sites: | --- |
| Attachments: |
slurm.conf
screenlog.0 slurmdbd.log tres.txt qos.txt assoc.txt slurmctld.log-20201125.gz slurmctld.log-20201124.gz slurmctld.log-20201120.gz slurmctld.log |
||
Hi, Can you show me the output of: sacctmgr show assoc -p sacctmgr show qos -p sacctmgr show tres It would be useful too if you have the slurmdbd logs. I suppose you have done a database conversion from 20.02, is that right? If so, just attach the logs. Thanks! Created attachment 16890 [details] screenlog.0 Here’s the attached logs. I started the slurmdbd manually in a screen session during the upgrade and I’ve attached the logs. We upgraded from 19.05.3-2. I noticed the MaxEnergyPerAccount and QOSGrpEnergy are most likely triggered by the users hitting either the memory resource limit or the number of cores. From: "bugs@schedmd.com" <bugs@schedmd.com> Date: Tuesday, December 1, 2020 at 11:30 AM To: Luis Huang <lhuang@NYGENOME.ORG> Subject: [Bug 10294] MaxEnergyPerAccount and QOSGrpEnergy Comment # 1<https://urldefense.com/v3/__https:/bugs.schedmd.com/show_bug.cgi?id=10294*c1__;Iw!!C6sPl7C9qQ!D2A0dwBKL61rqEvdypFtcKCuO-SJGJ7vLXNLxD2-yAhncGr6TolFmb-dFvlbeok$> on bug 10294<https://urldefense.com/v3/__https:/bugs.schedmd.com/show_bug.cgi?id=10294__;!!C6sPl7C9qQ!D2A0dwBKL61rqEvdypFtcKCuO-SJGJ7vLXNLxD2-yAhncGr6TolFmb-dDqOyV1M$> from Felip Moll<mailto:felip.moll@schedmd.com> Hi, Can you show me the output of: sacctmgr show assoc -p sacctmgr show qos -p sacctmgr show tres It would be useful too if you have the slurmdbd logs. I suppose you have done a database conversion from 20.02, is that right? If so, just attach the logs. Thanks! ________________________________ You are receiving this mail because: * You reported the bug. ________________________________ This message is for the recipient’s use only, and may contain confidential, privileged or protected information. Any unauthorized use or dissemination of this communication is prohibited. If you received this message in error, please immediately notify the sender and destroy all copies of this message. The recipient should check this email and any attachments for the presence of viruses, as we accept no liability for any damage caused by any virus transmitted by this email. Created attachment 16891 [details]
slurmdbd.log
Created attachment 16892 [details]
tres.txt
Created attachment 16893 [details]
qos.txt
Created attachment 16894 [details]
assoc.txt
> I noticed the MaxEnergyPerAccount and QOSGrpEnergy are most likely triggered
> by the users hitting either the memory resource limit or the number of cores.
Which are the two limits defined in the qos.
Can you upload slurmctld log too?
Created attachment 16898 [details] slurmctld.log-20201125.gz We didn’t define an energy limit. Only ones we defined are cpu and memory. In the past we see MaxCPUPerAccount rather than MaxEnergyPerAccount so we are a little puzzle on why we see a different message after the upgrade. From: "bugs@schedmd.com" <bugs@schedmd.com> Date: Tuesday, December 1, 2020 at 12:24 PM To: Luis Huang <lhuang@NYGENOME.ORG> Subject: [Bug 10294] MaxEnergyPerAccount and QOSGrpEnergy Comment # 7<https://urldefense.com/v3/__https:/bugs.schedmd.com/show_bug.cgi?id=10294*c7__;Iw!!C6sPl7C9qQ!CdFNONSCi8Npo0dfosgZKSCfW_W_9RQkyHWuL7JmsZgKkDyMMMWEKNVgWghXN_U$> on bug 10294<https://urldefense.com/v3/__https:/bugs.schedmd.com/show_bug.cgi?id=10294__;!!C6sPl7C9qQ!CdFNONSCi8Npo0dfosgZKSCfW_W_9RQkyHWuL7JmsZgKkDyMMMWEKNVgK1yawCg$> from Felip Moll<mailto:felip.moll@schedmd.com> > I noticed the MaxEnergyPerAccount and QOSGrpEnergy are most likely triggered > by the users hitting either the memory resource limit or the number of cores. Which are the two limits defined in the qos. Can you upload slurmctld log too? ________________________________ You are receiving this mail because: * You reported the bug. ________________________________ This message is for the recipient’s use only, and may contain confidential, privileged or protected information. Any unauthorized use or dissemination of this communication is prohibited. If you received this message in error, please immediately notify the sender and destroy all copies of this message. The recipient should check this email and any attachments for the presence of viruses, as we accept no liability for any damage caused by any virus transmitted by this email. Created attachment 16899 [details]
slurmctld.log-20201124.gz
Created attachment 16900 [details]
slurmctld.log-20201120.gz
Created attachment 16901 [details]
slurmctld.log
While I am debugging this, could you please increase slurmctld debug level to debug2, run a job and check it gets blocked by the mentioned reasons, and then send me slurmctld log? With debug2 messages I will be able to see more than just with 'error'. SlurmctldDebug=debug2 Also, can you restart the slurmctld after setting debug2? I'd like to see the entire initialization sequence of slurm controller at this verbosity. To summarize: 1. Change slurmctlddebug to debug2 2. stop&start slurmctld 3. send a job and check it is wrong 4. send me the log Thank you I'm out of office and will have limited access to the internet. Please email linuxhelp@nygenome.org for any urgent issues. ________________________________ This message is for the recipient's use only, and may contain confidential, privileged or protected information. Any unauthorized use or dissemination of this communication is prohibited. If you received this message in error, please immediately notify the sender and destroy all copies of this message. The recipient should check this email and any attachments for the presence of viruses, as we accept no liability for any damage caused by any virus transmitted by this email. Unsure if this is trigged due to mismatch version of slurmd + slurm commands. As we now have updated all nodes to the same version, we no longer see the issue. |
Created attachment 16844 [details] slurm.conf Jobs are showing MaxEnergyPerAccount and QOSGrpEnergy for nodelist reason. We do not have energy qos or resource limits set in place. Any idea what could be causing this? 9071964_[7003-8787 pe2 SCEPTRE- jmorris PD 0:00 1 (MaxEnergyPerAccount) 9076733 bigmem submit_s fgaiti PD 0:00 1 (QOSGrpEnergy) JobId=9076499 ArrayJobId=9071964 ArrayTaskId=4207 JobName=SCEPTRE-RNA UserId=jmorris(50680) GroupId=nslab(9013) MCS_label=N/A Priority=234669 Nice=0 Account=nslab QOS=nslab JobState=COMPLETED Reason=None Dependency=(null) Requeue=1 Restarts=0 BatchFlag=1 Reboot=0 ExitCode=0:0 RunTime=01:19:29 TimeLimit=7-00:00:00 TimeMin=N/A SubmitTime=2020-11-25T16:15:33 EligibleTime=2020-11-25T16:15:33 AccrueTime=2020-11-25T16:15:33 StartTime=2020-11-25T20:04:41 EndTime=2020-11-25T21:24:10 Deadline=N/A SuspendTime=None SecsPreSuspend=0 LastSchedEval=2020-11-25T20:04:41 Partition=pe2 AllocNode:Sid=pe2-login01:86721 ReqNodeList=(null) ExcNodeList=(null) NodeList=pe2cc2-038 BatchHost=pe2cc2-038 NumNodes=1 NumCPUs=2 NumTasks=0 CPUs/Task=1 ReqB:S:C:T=0:0:*:* TRES=cpu=2,mem=4G,node=1,billing=2 Socks/Node=* NtasksPerN:B:S:C=0:0:*:* CoreSpec=* MinCPUsNode=1 MinMemoryNode=4G MinTmpDiskNode=0 Features=(null) DelayBoot=00:00:00 OverSubscribe=OK Contiguous=0 Licenses=(null) Network=(null) Command=/gpfs/commons/groups/sanjana_lab/jmorris/190124_CRISPR-GWAS/200916_FullPilot/bin/run_SCEPTRE-RNA.sh WorkDir=/gpfs/commons/groups/sanjana_lab/jmorris/190124_CRISPR-GWAS/200916_FullPilot StdErr=/gpfs/commons/groups/sanjana_lab/jmorris/190124_CRISPR-GWAS/200916_FullPilot/logs/9071964_4207.out StdIn=/dev/null StdOut=/gpfs/commons/groups/sanjana_lab/jmorris/190124_CRISPR-GWAS/200916_FullPilot/logs/9071964_4207.out Power= [lhuang@pe2-login01 ~]$ scontrol show job 9076733 JobId=9076733 JobName=submit_splice_pipeline UserId=fgaiti(50293) GroupId=dllab(9012) MCS_label=N/A Priority=119998 Nice=0 Account=dllab QOS=dllab JobState=PENDING Reason=QOSGrpEnergy Dependency=(null) Requeue=1 Restarts=0 BatchFlag=1 Reboot=0 ExitCode=0:0 RunTime=00:00:00 TimeLimit=10-00:00:00 TimeMin=N/A SubmitTime=2020-11-25T20:22:08 EligibleTime=2020-11-25T20:22:08 AccrueTime=2020-11-25T20:22:08 StartTime=Unknown EndTime=Unknown Deadline=N/A SuspendTime=None SecsPreSuspend=0 LastSchedEval=2020-11-25T23:04:01 Partition=bigmem AllocNode:Sid=pe2-login01:3352 ReqNodeList=(null) ExcNodeList=(null) NodeList=(null) NumNodes=1 NumCPUs=1 NumTasks=1 CPUs/Task=1 ReqB:S:C:T=0:0:*:* TRES=cpu=1,mem=100000M,node=1,billing=1 Socks/Node=* NtasksPerN:B:S:C=0:0:*:* CoreSpec=* MinCPUsNode=1 MinMemoryNode=100000M MinTmpDiskNode=0 Features=(null) DelayBoot=00:00:00 OverSubscribe=OK Contiguous=0 Licenses=(null) Network=(null) Command=/gpfs/commons/groups/landau_lab/SF3B1_splice_project/9.Splice_pipeline_example_run/P6/P6_run_splice_pipelinev4.sh WorkDir=/gpfs/commons/groups/landau_lab/SF3B1_splice_project/9.Splice_pipeline_example_run/P6 StdErr=/gpfs/commons/groups/landau_lab/SF3B1_splice_project/9.Splice_pipeline_example_run/P6/slurm-9076733.out StdIn=/dev/null StdOut=/gpfs/commons/groups/landau_lab/SF3B1_splice_project/9.Splice_pipeline_example_run/P6/slurm-9076733.out Power=