Ticket 10294

Summary:	MaxEnergyPerAccount and QOSGrpEnergy
Product:	Slurm	Reporter:	lhuang
Component:	slurmctld	Assignee:	Felip Moll <felip.moll>
Status:	RESOLVED CANNOTREPRODUCE	QA Contact:
Severity:	4 - Minor Issue
Priority:	---	CC:	cblack
Version:	20.11.0
Hardware:	Linux
OS:	Linux
Site:	NY Genome	Slinky Site:	---
Alineos Sites:	---	Atos/Eviden Sites:	---
Confidential Site:	---	Coreweave sites:	---
Cray Sites:	---	DS9 clusters:	---
Google sites:	---	HPCnow Sites:	---
HPE Sites:	---	IBM Sites:	---
NOAA SIte:	---	NoveTech Sites:	---
Nvidia HWinf-CS Sites:	---	OCF Sites:	---
Recursion Pharma Sites:	---	SFW Sites:	---
SNIC sites:	---	Tzag Elita Sites:	---
Linux Distro:	---	Machine Name:
CLE Version:		Version Fixed:
Target Release:	---	DevPrio:	---
Emory-Cloud Sites:	---
Attachments:	slurm.conf screenlog.0 slurmdbd.log tres.txt qos.txt assoc.txt slurmctld.log-20201125.gz slurmctld.log-20201124.gz slurmctld.log-20201120.gz slurmctld.log

Description lhuang 2020-11-25 21:05:02 MST

Created attachment 16844 [details]
slurm.conf

Jobs are showing MaxEnergyPerAccount and QOSGrpEnergy  for nodelist reason. We do not have energy qos or resource limits set in place. Any idea what could be causing this?

9071964_[7003-8787       pe2 SCEPTRE-  jmorris PD       0:00      1 (MaxEnergyPerAccount)
           9076733    bigmem submit_s   fgaiti PD       0:00      1 (QOSGrpEnergy)



JobId=9076499 ArrayJobId=9071964 ArrayTaskId=4207 JobName=SCEPTRE-RNA
   UserId=jmorris(50680) GroupId=nslab(9013) MCS_label=N/A
   Priority=234669 Nice=0 Account=nslab QOS=nslab
   JobState=COMPLETED Reason=None Dependency=(null)
   Requeue=1 Restarts=0 BatchFlag=1 Reboot=0 ExitCode=0:0
   RunTime=01:19:29 TimeLimit=7-00:00:00 TimeMin=N/A
   SubmitTime=2020-11-25T16:15:33 EligibleTime=2020-11-25T16:15:33
   AccrueTime=2020-11-25T16:15:33
   StartTime=2020-11-25T20:04:41 EndTime=2020-11-25T21:24:10 Deadline=N/A
   SuspendTime=None SecsPreSuspend=0 LastSchedEval=2020-11-25T20:04:41
   Partition=pe2 AllocNode:Sid=pe2-login01:86721
   ReqNodeList=(null) ExcNodeList=(null)
   NodeList=pe2cc2-038
   BatchHost=pe2cc2-038
   NumNodes=1 NumCPUs=2 NumTasks=0 CPUs/Task=1 ReqB:S:C:T=0:0:*:*
   TRES=cpu=2,mem=4G,node=1,billing=2
   Socks/Node=* NtasksPerN:B:S:C=0:0:*:* CoreSpec=*
   MinCPUsNode=1 MinMemoryNode=4G MinTmpDiskNode=0
   Features=(null) DelayBoot=00:00:00
   OverSubscribe=OK Contiguous=0 Licenses=(null) Network=(null)
   Command=/gpfs/commons/groups/sanjana_lab/jmorris/190124_CRISPR-GWAS/200916_FullPilot/bin/run_SCEPTRE-RNA.sh
   WorkDir=/gpfs/commons/groups/sanjana_lab/jmorris/190124_CRISPR-GWAS/200916_FullPilot
   StdErr=/gpfs/commons/groups/sanjana_lab/jmorris/190124_CRISPR-GWAS/200916_FullPilot/logs/9071964_4207.out
   StdIn=/dev/null
   StdOut=/gpfs/commons/groups/sanjana_lab/jmorris/190124_CRISPR-GWAS/200916_FullPilot/logs/9071964_4207.out
   Power=


[lhuang@pe2-login01 ~]$ scontrol show job 9076733
JobId=9076733 JobName=submit_splice_pipeline
   UserId=fgaiti(50293) GroupId=dllab(9012) MCS_label=N/A
   Priority=119998 Nice=0 Account=dllab QOS=dllab
   JobState=PENDING Reason=QOSGrpEnergy Dependency=(null)
   Requeue=1 Restarts=0 BatchFlag=1 Reboot=0 ExitCode=0:0
   RunTime=00:00:00 TimeLimit=10-00:00:00 TimeMin=N/A
   SubmitTime=2020-11-25T20:22:08 EligibleTime=2020-11-25T20:22:08
   AccrueTime=2020-11-25T20:22:08
   StartTime=Unknown EndTime=Unknown Deadline=N/A
   SuspendTime=None SecsPreSuspend=0 LastSchedEval=2020-11-25T23:04:01
   Partition=bigmem AllocNode:Sid=pe2-login01:3352
   ReqNodeList=(null) ExcNodeList=(null)
   NodeList=(null)
   NumNodes=1 NumCPUs=1 NumTasks=1 CPUs/Task=1 ReqB:S:C:T=0:0:*:*
   TRES=cpu=1,mem=100000M,node=1,billing=1
   Socks/Node=* NtasksPerN:B:S:C=0:0:*:* CoreSpec=*
   MinCPUsNode=1 MinMemoryNode=100000M MinTmpDiskNode=0
   Features=(null) DelayBoot=00:00:00
   OverSubscribe=OK Contiguous=0 Licenses=(null) Network=(null)
   Command=/gpfs/commons/groups/landau_lab/SF3B1_splice_project/9.Splice_pipeline_example_run/P6/P6_run_splice_pipelinev4.sh
   WorkDir=/gpfs/commons/groups/landau_lab/SF3B1_splice_project/9.Splice_pipeline_example_run/P6
   StdErr=/gpfs/commons/groups/landau_lab/SF3B1_splice_project/9.Splice_pipeline_example_run/P6/slurm-9076733.out
   StdIn=/dev/null
   StdOut=/gpfs/commons/groups/landau_lab/SF3B1_splice_project/9.Splice_pipeline_example_run/P6/slurm-9076733.out
   Power=

Comment 1 Felip Moll 2020-12-01 12:30:44 MST

Hi,

Can you show me the output of:

sacctmgr show assoc -p
sacctmgr show qos -p
sacctmgr show tres

It would be useful too if you have the slurmdbd logs. I suppose you have done a database conversion from 20.02, is that right? If so, just attach the logs.

Thanks!

Comment 2 lhuang 2020-12-01 12:50:53 MST

Created attachment 16890 [details]
screenlog.0

Here’s the attached logs.

I started the slurmdbd manually in a screen session during the upgrade and I’ve attached the logs.

We upgraded from 19.05.3-2.

I noticed the MaxEnergyPerAccount and QOSGrpEnergy are most likely triggered by the users hitting either the memory resource limit or the number of cores.

From: "bugs@schedmd.com" <bugs@schedmd.com>
Date: Tuesday, December 1, 2020 at 11:30 AM
To: Luis Huang <lhuang@NYGENOME.ORG>
Subject: [Bug 10294] MaxEnergyPerAccount and QOSGrpEnergy

Comment # 1<https://urldefense.com/v3/__https:/bugs.schedmd.com/show_bug.cgi?id=10294*c1__;Iw!!C6sPl7C9qQ!D2A0dwBKL61rqEvdypFtcKCuO-SJGJ7vLXNLxD2-yAhncGr6TolFmb-dFvlbeok$> on bug 10294<https://urldefense.com/v3/__https:/bugs.schedmd.com/show_bug.cgi?id=10294__;!!C6sPl7C9qQ!D2A0dwBKL61rqEvdypFtcKCuO-SJGJ7vLXNLxD2-yAhncGr6TolFmb-dDqOyV1M$> from Felip Moll<mailto:felip.moll@schedmd.com>

Hi,



Can you show me the output of:



sacctmgr show assoc -p

sacctmgr show qos -p

sacctmgr show tres



It would be useful too if you have the slurmdbd logs. I suppose you have done a

database conversion from 20.02, is that right? If so, just attach the logs.



Thanks!

________________________________
You are receiving this mail because:

  *   You reported the bug.

________________________________
This message is for the recipient’s use only, and may contain confidential, privileged or protected information. Any unauthorized use or dissemination of this communication is prohibited. If you received this message in error, please immediately notify the sender and destroy all copies of this message. The recipient should check this email and any attachments for the presence of viruses, as we accept no liability for any damage caused by any virus transmitted by this email.

Comment 3 lhuang 2020-12-01 12:50:54 MST

Created attachment 16891 [details]
slurmdbd.log

Comment 4 lhuang 2020-12-01 12:50:54 MST

Created attachment 16892 [details]
tres.txt

Comment 5 lhuang 2020-12-01 12:50:54 MST

Created attachment 16893 [details]
qos.txt

Comment 6 lhuang 2020-12-01 12:50:54 MST

Created attachment 16894 [details]
assoc.txt

Comment 7 Felip Moll 2020-12-01 13:24:52 MST

> I noticed the MaxEnergyPerAccount and QOSGrpEnergy are most likely triggered
> by the users hitting either the memory resource limit or the number of cores.

Which are the two limits defined in the qos.

Can you upload slurmctld log too?

Comment 8 lhuang 2020-12-01 13:42:13 MST

Created attachment 16898 [details]
slurmctld.log-20201125.gz

We didn’t define an energy limit. Only ones we defined are cpu and memory.

In the past we see MaxCPUPerAccount rather than MaxEnergyPerAccount so we are a little puzzle on why we see a different message after the upgrade.



From: "bugs@schedmd.com" <bugs@schedmd.com>
Date: Tuesday, December 1, 2020 at 12:24 PM
To: Luis Huang <lhuang@NYGENOME.ORG>
Subject: [Bug 10294] MaxEnergyPerAccount and QOSGrpEnergy

Comment # 7<https://urldefense.com/v3/__https:/bugs.schedmd.com/show_bug.cgi?id=10294*c7__;Iw!!C6sPl7C9qQ!CdFNONSCi8Npo0dfosgZKSCfW_W_9RQkyHWuL7JmsZgKkDyMMMWEKNVgWghXN_U$> on bug 10294<https://urldefense.com/v3/__https:/bugs.schedmd.com/show_bug.cgi?id=10294__;!!C6sPl7C9qQ!CdFNONSCi8Npo0dfosgZKSCfW_W_9RQkyHWuL7JmsZgKkDyMMMWEKNVgK1yawCg$> from Felip Moll<mailto:felip.moll@schedmd.com>

> I noticed the MaxEnergyPerAccount and QOSGrpEnergy are most likely triggered

> by the users hitting either the memory resource limit or the number of cores.



Which are the two limits defined in the qos.



Can you upload slurmctld log too?

________________________________
You are receiving this mail because:

  *   You reported the bug.

________________________________
This message is for the recipient’s use only, and may contain confidential, privileged or protected information. Any unauthorized use or dissemination of this communication is prohibited. If you received this message in error, please immediately notify the sender and destroy all copies of this message. The recipient should check this email and any attachments for the presence of viruses, as we accept no liability for any damage caused by any virus transmitted by this email.

Comment 9 lhuang 2020-12-01 13:42:14 MST

Created attachment 16899 [details]
slurmctld.log-20201124.gz

Comment 10 lhuang 2020-12-01 13:42:14 MST

Created attachment 16900 [details]
slurmctld.log-20201120.gz

Comment 11 lhuang 2020-12-01 13:42:14 MST

Created attachment 16901 [details]
slurmctld.log

Comment 12 Felip Moll 2020-12-02 07:47:09 MST

While I am debugging this, could you please increase slurmctld debug level to debug2, run a job and check it gets blocked by the mentioned reasons, and then send me slurmctld log?

With debug2 messages I will be able to see more than just with 'error'.

SlurmctldDebug=debug2

Also, can you restart the slurmctld after setting debug2? I'd like to see the entire initialization sequence of slurm controller at this verbosity.

To summarize:

1. Change slurmctlddebug to debug2
2. stop&start slurmctld
3. send a job and check it is wrong
4. send me the log

Thank you

Comment 13 lhuang 2020-12-02 07:47:16 MST

I'm out of office and will have limited access to the internet. Please email linuxhelp@nygenome.org for any urgent issues.

________________________________
This message is for the recipient's use only, and may contain confidential, privileged or protected information. Any unauthorized use or dissemination of this communication is prohibited. If you received this message in error, please immediately notify the sender and destroy all copies of this message. The recipient should check this email and any attachments for the presence of viruses, as we accept no liability for any damage caused by any virus transmitted by this email.

Comment 14 lhuang 2020-12-07 08:51:09 MST

Unsure if this is trigged due to mismatch version of slurmd + slurm commands. As we now have updated all nodes to the same version, we no longer see the issue.