| Summary: | sacct showing incorrect MaxRSS values | ||
|---|---|---|---|
| Product: | Slurm | Reporter: | DRW GridOps <gridadm> |
| Component: | Accounting | Assignee: | Oscar Hernández <oscar.hernandez> |
| Status: | RESOLVED INFOGIVEN | QA Contact: | |
| Severity: | 3 - Medium Impact | ||
| Priority: | --- | CC: | felip.moll |
| Version: | 21.08.6 | ||
| Hardware: | Linux | ||
| OS: | Linux | ||
| Site: | DRW Trading | Alineos Sites: | --- |
| Atos/Eviden Sites: | --- | Confidential Site: | --- |
| Coreweave sites: | --- | Cray Sites: | --- |
| DS9 clusters: | --- | HPCnow Sites: | --- |
| HPE Sites: | --- | IBM Sites: | --- |
| NOAA SIte: | --- | OCF Sites: | --- |
| Recursion Pharma Sites: | --- | SFW Sites: | --- |
| SNIC sites: | --- | Linux Distro: | --- |
| Machine Name: | CLE Version: | ||
| Version Fixed: | Target Release: | --- | |
| DevPrio: | --- | Emory-Cloud Sites: | --- |
| Attachments: |
slurmdbd.conf
slurm.conf |
||
Created attachment 25062 [details]
slurm.conf
Here's a clean reproduction of it:
--
cseraphine@chhq-supgcmp001-14:40:44-/home/cseraphine$ srun -n 4 -w $HOSTNAME stress --vm 4 --vm-stride 8 --vm-hang 30 --vm-keep --vm-bytes 1G -t 200 &
[1] 32190
cseraphine@chhq-supgcmp001-14:40:56-/home/cseraphine$ stress: info: [32207] dispatching hogs: 0 cpu, 0 io, 4 vm, 0 hdd
stress: info: [32209] dispatching hogs: 0 cpu, 0 io, 4 vm, 0 hdd
stress: info: [32208] dispatching hogs: 0 cpu, 0 io, 4 vm, 0 hdd
stress: info: [32210] dispatching hogs: 0 cpu, 0 io, 4 vm, 0 hdd
cseraphine@chhq-supgcmp001-14:40:57-/home/cseraphine$ ps -u cseraphine -o pid,cmd,rss
PID CMD RSS
27087 /lib/systemd/systemd --user 9812
27197 -bash 8456
32190 srun -n 4 -w chhq-supgcmp00 8352
32193 srun -n 4 -w chhq-supgcmp00 888
32207 /usr/bin/stress --vm 4 --vm 972
32208 /usr/bin/stress --vm 4 --vm 968
32209 /usr/bin/stress --vm 4 --vm 972
32210 /usr/bin/stress --vm 4 --vm 1032
32219 /usr/bin/stress --vm 4 --vm 1048948
32220 /usr/bin/stress --vm 4 --vm 1048948
32221 /usr/bin/stress --vm 4 --vm 1048948
32222 /usr/bin/stress --vm 4 --vm 1048948
32223 /usr/bin/stress --vm 4 --vm 1048948
32224 /usr/bin/stress --vm 4 --vm 1048948
32225 /usr/bin/stress --vm 4 --vm 1048948
32226 /usr/bin/stress --vm 4 --vm 1048884
32227 /usr/bin/stress --vm 4 --vm 1048948
32228 /usr/bin/stress --vm 4 --vm 1048884
32229 /usr/bin/stress --vm 4 --vm 1049004
32230 /usr/bin/stress --vm 4 --vm 1048884
32231 /usr/bin/stress --vm 4 --vm 1049004
32232 /usr/bin/stress --vm 4 --vm 1048884
32233 /usr/bin/stress --vm 4 --vm 1049004
32234 /usr/bin/stress --vm 4 --vm 1049004
32236 ps -u cseraphine -o pid,cmd 3568
cseraphine@chhq-supgcmp001-14:41:03-/home/cseraphine$ squeue -u cseraphine
JOBID PARTITION NAME USER ST TIME NODES NODELIST(REASON)
1475020 slurm21pm stress cseraphi R 0:38 1 chhq-supgcmp001
cseraphine@chhq-supgcmp001-14:41:34-/home/cseraphine$ sacct -j 1475020 --format jobid,jobname,state,maxrss
JobID JobName State MaxRSS
------------ ---------- ---------- ----------
1475020 stress RUNNING
cseraphine@chhq-supgcmp001-14:42:10-/home/cseraphine$ scontrol show job 1475020
JobId=1475020 JobName=stress
UserId=cseraphine(29599) GroupId=cseraphine(29599) MCS_label=N/A
Priority=3292 Nice=0 Account=gridadmins QOS=normal
JobState=RUNNING Reason=None Dependency=(null)
Requeue=1 Restarts=0 BatchFlag=0 Reboot=0 ExitCode=0:0
RunTime=00:01:37 TimeLimit=01:00:00 TimeMin=N/A
SubmitTime=2022-05-17T14:40:56 EligibleTime=2022-05-17T14:40:56
AccrueTime=Unknown
StartTime=2022-05-17T14:40:56 EndTime=2022-05-17T15:40:56 Deadline=N/A
SuspendTime=None SecsPreSuspend=0 LastSchedEval=2022-05-17T14:40:56 Scheduler=Main
Partition=slurm21pmain AllocNode:Sid=chhq-supgcmp001:27197
ReqNodeList=chhq-supgcmp001 ExcNodeList=(null)
NodeList=chhq-supgcmp001
BatchHost=chhq-supgcmp001
NumNodes=1 NumCPUs=4 NumTasks=4 CPUs/Task=1 ReqB:S:C:T=0:0:*:*
TRES=cpu=4,mem=4G,node=1,billing=4
Socks/Node=* NtasksPerN:B:S:C=0:0:*:* CoreSpec=*
MinCPUsNode=1 MinMemoryCPU=1G MinTmpDiskNode=0
Features=(null) DelayBoot=00:00:00
OverSubscribe=YES Contiguous=0 Licenses=(null) Network=(null)
Command=stress
WorkDir=/home/cseraphine
Power=
cseraphine@chhq-supgcmp001-14:42:33-/home/cseraphine$ fg
srun -n 4 -w $HOSTNAME stress --vm 4 --vm-stride 8 --vm-hang 30 --vm-keep --vm-bytes 1G -t 200
stress: info: [32208] successful run completed in 200s
stress: info: [32210] successful run completed in 200s
stress: info: [32207] successful run completed in 200s
stress: info: [32209] successful run completed in 200s
cseraphine@chhq-supgcmp001-14:44:16-/home/cseraphine$ sacct -j 1475020 --format jobid,jobname,state,maxrss
JobID JobName State MaxRSS
------------ ---------- ---------- ----------
1475020 stress COMPLETED 1036K
Hi, After taking a look at your slurm.conf, it looks like you have the following configuration: JobAcctGatherFrequency=energy=0,filesystem=0,network=0,task=0 If this values are set to 0, as you have, the job gathering is disabled until job termination, when it collects a unique metric (that is the not accurate "MaxRSS" value you are observing). The idea is to avoid any minimal interference of Slurm with the running processes in the node. For the information you want, it is enough with enabling the task gathering, so, could you please repeat your test case changing the values (in slurm.conf) to: JobAcctGatherFrequency=energy=0,filesystem=0,network=0,task=30 After the change, you will need to make it effective with: $ scontrol reconfigure I am suggesting 30s for the frequency because it is the value we configure by default, but feel free to change it to a value that better fits your necessities. You will find more information here: https://slurm.schedmd.com/slurm.conf.html#OPT_JobAcctGatherFrequency Please, try the suggested changes, and let us know if they work as expected. Kind regards, Oscar Oh good, I was hoping it was something stupid/easy like that, thank you!
The docs seemed to suggest that setting the intervals to zero would result in no data being gathered during the job's run,
> If the task sampling interval is 0, accounting information is collected only at job termination (reducing Slurm interference with the job).
That is 100% what we are seeing, but I believe it should be made explicit that the MaxRSS information that _does_ get applied at job end might be garbage. In *hindsight* this makes perfect sense (how do you see the peaks if you aren't looking at the process. during its lifecycle?) but apparently it is subtle enough to fool at least some Slurm admins :-(
Could we get few words tacked onto that sentence to reflect that? Something like:
"(reducing Slurm interference with the job, and making accurate tracking of values such as MaxRSS impractical)"
Thanks!
Glad it worked out! I see your point, the simple fact of having it accounted might be a little misleading. We will for sure take your comments into consideration. I am closing this bug for now. If you have any followup question, feel free to re-open it. Regards, Oscar |
Created attachment 25061 [details] slurmdbd.conf After jobs complete on our cluster, we are seeing very low values in RSS, generally around 1MB. These do not line up with the values we see when we take a snapshot of the running processes. For example, job 1422205 reports 1028k: # sacct -j 1422205 --format jobid,maxrss JobID MaxRSS ------------ ---------- 1422205 1028K However, when examining the running job on the compute node, this is what I see: # ps -o pid,ppid,cmd,rss --pid 1080584 --pid 1080569 --pid 1080558 PID PPID CMD RSS 1080558 1 slurmstepd: [1422205.0] 5988 1080569 1080558 /bin/bash /pool/netapp/home 3500 1080584 1080569 /home/mventrice/miniconda3/ 135792 Note that the RSS is around 135x what sacct reported. We're currently seeing this problem on all the jobs we've spot checked and it appears to have been going on for a long time, possibly the life of the cluster. (It is difficult to tell for certain since we do not have historical ps snapshots.) Note that while jobs are running, sacct reports MaxRSS as a blank value (which makes sense as it represents a peak over the lifetime of the job). After the job completes, the field is populated with a low value. I am rating this as 'medium' because we have several users migrating into this cluster, and they are attempting to right-size their memory reservations. This is greatly hampered by them being unable to examine the past performance of jobs.