Ticket 14084

Summary: sacct showing incorrect MaxRSS values
Product: Slurm Reporter: DRW GridOps <gridadm>
Component: AccountingAssignee: Oscar Hernández <oscar.hernandez>
Status: RESOLVED INFOGIVEN QA Contact:
Severity: 3 - Medium Impact    
Priority: --- CC: felip.moll
Version: 21.08.6   
Hardware: Linux   
OS: Linux   
Site: DRW Trading Alineos Sites: ---
Atos/Eviden Sites: --- Confidential Site: ---
Coreweave sites: --- Cray Sites: ---
DS9 clusters: --- HPCnow Sites: ---
HPE Sites: --- IBM Sites: ---
NOAA SIte: --- OCF Sites: ---
Recursion Pharma Sites: --- SFW Sites: ---
SNIC sites: --- Linux Distro: ---
Machine Name: CLE Version:
Version Fixed: Target Release: ---
DevPrio: --- Emory-Cloud Sites: ---
Attachments: slurmdbd.conf
slurm.conf

Description DRW GridOps 2022-05-17 06:53:59 MDT
Created attachment 25061 [details]
slurmdbd.conf

After jobs complete on our cluster, we are seeing very low values in RSS, generally around 1MB.   These do not line up with the values we see when we take a snapshot of the running processes.

For example, job 1422205 reports 1028k:

# sacct -j 1422205 --format jobid,maxrss
JobID            MaxRSS
------------ ----------
1422205           1028K

However, when examining the running job on the compute node, this is what I see:
# ps  -o pid,ppid,cmd,rss --pid 1080584 --pid 1080569 --pid 1080558
    PID    PPID CMD                           RSS
1080558       1 slurmstepd: [1422205.0]      5988
1080569 1080558 /bin/bash /pool/netapp/home  3500
1080584 1080569 /home/mventrice/miniconda3/ 135792

Note that the RSS is around 135x what sacct reported.   We're currently seeing this problem on all the jobs we've spot checked and it appears to have been going on for a long time, possibly the life of the cluster.  (It is difficult to tell for certain since we do not have historical ps snapshots.)

Note that while jobs are running, sacct reports MaxRSS as a blank value (which makes sense as it represents a peak over the lifetime of the job).  After the job completes, the field is populated with a low value.

I am rating this as 'medium' because we have several users migrating into this cluster, and they are attempting to right-size their memory reservations.  This is greatly hampered by them being unable to examine the past performance of jobs.
Comment 1 DRW GridOps 2022-05-17 06:54:14 MDT
Created attachment 25062 [details]
slurm.conf
Comment 2 DRW GridOps 2022-05-17 14:06:55 MDT
Here's a clean reproduction of it:

--

cseraphine@chhq-supgcmp001-14:40:44-/home/cseraphine$ srun -n 4  -w $HOSTNAME stress --vm 4  --vm-stride 8 --vm-hang 30 --vm-keep --vm-bytes 1G -t 200 &
[1] 32190
cseraphine@chhq-supgcmp001-14:40:56-/home/cseraphine$ stress: info: [32207] dispatching hogs: 0 cpu, 0 io, 4 vm, 0 hdd
stress: info: [32209] dispatching hogs: 0 cpu, 0 io, 4 vm, 0 hdd
stress: info: [32208] dispatching hogs: 0 cpu, 0 io, 4 vm, 0 hdd
stress: info: [32210] dispatching hogs: 0 cpu, 0 io, 4 vm, 0 hdd

cseraphine@chhq-supgcmp001-14:40:57-/home/cseraphine$ ps -u cseraphine -o pid,cmd,rss
    PID CMD                           RSS
  27087 /lib/systemd/systemd --user  9812
  27197 -bash                        8456
  32190 srun -n 4 -w chhq-supgcmp00  8352
  32193 srun -n 4 -w chhq-supgcmp00   888
  32207 /usr/bin/stress --vm 4 --vm   972
  32208 /usr/bin/stress --vm 4 --vm   968
  32209 /usr/bin/stress --vm 4 --vm   972
  32210 /usr/bin/stress --vm 4 --vm  1032
  32219 /usr/bin/stress --vm 4 --vm 1048948
  32220 /usr/bin/stress --vm 4 --vm 1048948
  32221 /usr/bin/stress --vm 4 --vm 1048948
  32222 /usr/bin/stress --vm 4 --vm 1048948
  32223 /usr/bin/stress --vm 4 --vm 1048948
  32224 /usr/bin/stress --vm 4 --vm 1048948
  32225 /usr/bin/stress --vm 4 --vm 1048948
  32226 /usr/bin/stress --vm 4 --vm 1048884
  32227 /usr/bin/stress --vm 4 --vm 1048948
  32228 /usr/bin/stress --vm 4 --vm 1048884
  32229 /usr/bin/stress --vm 4 --vm 1049004
  32230 /usr/bin/stress --vm 4 --vm 1048884
  32231 /usr/bin/stress --vm 4 --vm 1049004
  32232 /usr/bin/stress --vm 4 --vm 1048884
  32233 /usr/bin/stress --vm 4 --vm 1049004
  32234 /usr/bin/stress --vm 4 --vm 1049004
  32236 ps -u cseraphine -o pid,cmd  3568
cseraphine@chhq-supgcmp001-14:41:03-/home/cseraphine$ squeue -u cseraphine
             JOBID PARTITION     NAME     USER ST       TIME  NODES NODELIST(REASON)
           1475020 slurm21pm   stress cseraphi  R       0:38      1 chhq-supgcmp001
cseraphine@chhq-supgcmp001-14:41:34-/home/cseraphine$ sacct -j 1475020 --format jobid,jobname,state,maxrss
JobID           JobName      State     MaxRSS
------------ ---------- ---------- ----------
1475020          stress    RUNNING
cseraphine@chhq-supgcmp001-14:42:10-/home/cseraphine$ scontrol show job 1475020
JobId=1475020 JobName=stress
   UserId=cseraphine(29599) GroupId=cseraphine(29599) MCS_label=N/A
   Priority=3292 Nice=0 Account=gridadmins QOS=normal
   JobState=RUNNING Reason=None Dependency=(null)
   Requeue=1 Restarts=0 BatchFlag=0 Reboot=0 ExitCode=0:0
   RunTime=00:01:37 TimeLimit=01:00:00 TimeMin=N/A
   SubmitTime=2022-05-17T14:40:56 EligibleTime=2022-05-17T14:40:56
   AccrueTime=Unknown
   StartTime=2022-05-17T14:40:56 EndTime=2022-05-17T15:40:56 Deadline=N/A
   SuspendTime=None SecsPreSuspend=0 LastSchedEval=2022-05-17T14:40:56 Scheduler=Main
   Partition=slurm21pmain AllocNode:Sid=chhq-supgcmp001:27197
   ReqNodeList=chhq-supgcmp001 ExcNodeList=(null)
   NodeList=chhq-supgcmp001
   BatchHost=chhq-supgcmp001
   NumNodes=1 NumCPUs=4 NumTasks=4 CPUs/Task=1 ReqB:S:C:T=0:0:*:*
   TRES=cpu=4,mem=4G,node=1,billing=4
   Socks/Node=* NtasksPerN:B:S:C=0:0:*:* CoreSpec=*
   MinCPUsNode=1 MinMemoryCPU=1G MinTmpDiskNode=0
   Features=(null) DelayBoot=00:00:00
   OverSubscribe=YES Contiguous=0 Licenses=(null) Network=(null)
   Command=stress
   WorkDir=/home/cseraphine
   Power=


cseraphine@chhq-supgcmp001-14:42:33-/home/cseraphine$ fg
srun -n 4 -w $HOSTNAME stress --vm 4 --vm-stride 8 --vm-hang 30 --vm-keep --vm-bytes 1G -t 200
stress: info: [32208] successful run completed in 200s
stress: info: [32210] successful run completed in 200s
stress: info: [32207] successful run completed in 200s
stress: info: [32209] successful run completed in 200s
cseraphine@chhq-supgcmp001-14:44:16-/home/cseraphine$ sacct -j 1475020 --format jobid,jobname,state,maxrss
JobID           JobName      State     MaxRSS
------------ ---------- ---------- ----------
1475020          stress  COMPLETED      1036K
Comment 3 Oscar Hernández 2022-05-18 02:13:29 MDT
Hi,

After taking a look at your slurm.conf, it looks like you have the following configuration:

JobAcctGatherFrequency=energy=0,filesystem=0,network=0,task=0

If this values are set to 0, as you have, the job gathering is disabled until job termination, when it collects a unique metric (that is the not accurate "MaxRSS" value you are observing). The idea is to avoid any minimal interference of Slurm with the running processes in the node.

For the information you want, it is enough with enabling the task gathering, so, could you please repeat your test case changing the values (in slurm.conf) to:

JobAcctGatherFrequency=energy=0,filesystem=0,network=0,task=30

After the change, you will need to make it effective with: 

$ scontrol reconfigure

I am suggesting 30s for the frequency because it is the value we configure by default, but feel free to change it to a value that better fits your necessities. You will find more information here:

https://slurm.schedmd.com/slurm.conf.html#OPT_JobAcctGatherFrequency

Please, try the suggested changes, and let us know if they work as expected.

Kind regards,

Oscar
Comment 4 DRW GridOps 2022-05-18 09:47:51 MDT
Oh good, I was hoping it was something stupid/easy like that, thank you!

The docs seemed to suggest that setting the intervals to zero would result in no data being gathered during the job's run, 

> If the task sampling interval is 0, accounting information is collected only at job termination (reducing Slurm interference with the job).

That is 100% what we are seeing, but I believe it should be made explicit that the MaxRSS information that _does_ get applied at job end might be garbage.  In *hindsight* this makes perfect sense (how do you see the peaks if you aren't looking at the process. during its lifecycle?) but apparently it is subtle enough to fool at least some Slurm admins :-(

Could we get few words tacked onto that sentence to reflect that?  Something like:

"(reducing Slurm interference with the job, and making accurate tracking of values such as MaxRSS impractical)"

Thanks!
Comment 6 Oscar Hernández 2022-05-18 11:17:22 MDT
Glad it worked out!
 
I see your point, the simple fact of having it accounted might be a little misleading. We will for sure take your comments into consideration. 

I am closing this bug for now. If you have any followup question, feel free to re-open it.

Regards,
Oscar