Created attachment 25061 [details] slurmdbd.conf After jobs complete on our cluster, we are seeing very low values in RSS, generally around 1MB. These do not line up with the values we see when we take a snapshot of the running processes. For example, job 1422205 reports 1028k: # sacct -j 1422205 --format jobid,maxrss JobID MaxRSS ------------ ---------- 1422205 1028K However, when examining the running job on the compute node, this is what I see: # ps -o pid,ppid,cmd,rss --pid 1080584 --pid 1080569 --pid 1080558 PID PPID CMD RSS 1080558 1 slurmstepd: [1422205.0] 5988 1080569 1080558 /bin/bash /pool/netapp/home 3500 1080584 1080569 /home/mventrice/miniconda3/ 135792 Note that the RSS is around 135x what sacct reported. We're currently seeing this problem on all the jobs we've spot checked and it appears to have been going on for a long time, possibly the life of the cluster. (It is difficult to tell for certain since we do not have historical ps snapshots.) Note that while jobs are running, sacct reports MaxRSS as a blank value (which makes sense as it represents a peak over the lifetime of the job). After the job completes, the field is populated with a low value. I am rating this as 'medium' because we have several users migrating into this cluster, and they are attempting to right-size their memory reservations. This is greatly hampered by them being unable to examine the past performance of jobs.
Created attachment 25062 [details] slurm.conf
Here's a clean reproduction of it: -- cseraphine@chhq-supgcmp001-14:40:44-/home/cseraphine$ srun -n 4 -w $HOSTNAME stress --vm 4 --vm-stride 8 --vm-hang 30 --vm-keep --vm-bytes 1G -t 200 & [1] 32190 cseraphine@chhq-supgcmp001-14:40:56-/home/cseraphine$ stress: info: [32207] dispatching hogs: 0 cpu, 0 io, 4 vm, 0 hdd stress: info: [32209] dispatching hogs: 0 cpu, 0 io, 4 vm, 0 hdd stress: info: [32208] dispatching hogs: 0 cpu, 0 io, 4 vm, 0 hdd stress: info: [32210] dispatching hogs: 0 cpu, 0 io, 4 vm, 0 hdd cseraphine@chhq-supgcmp001-14:40:57-/home/cseraphine$ ps -u cseraphine -o pid,cmd,rss PID CMD RSS 27087 /lib/systemd/systemd --user 9812 27197 -bash 8456 32190 srun -n 4 -w chhq-supgcmp00 8352 32193 srun -n 4 -w chhq-supgcmp00 888 32207 /usr/bin/stress --vm 4 --vm 972 32208 /usr/bin/stress --vm 4 --vm 968 32209 /usr/bin/stress --vm 4 --vm 972 32210 /usr/bin/stress --vm 4 --vm 1032 32219 /usr/bin/stress --vm 4 --vm 1048948 32220 /usr/bin/stress --vm 4 --vm 1048948 32221 /usr/bin/stress --vm 4 --vm 1048948 32222 /usr/bin/stress --vm 4 --vm 1048948 32223 /usr/bin/stress --vm 4 --vm 1048948 32224 /usr/bin/stress --vm 4 --vm 1048948 32225 /usr/bin/stress --vm 4 --vm 1048948 32226 /usr/bin/stress --vm 4 --vm 1048884 32227 /usr/bin/stress --vm 4 --vm 1048948 32228 /usr/bin/stress --vm 4 --vm 1048884 32229 /usr/bin/stress --vm 4 --vm 1049004 32230 /usr/bin/stress --vm 4 --vm 1048884 32231 /usr/bin/stress --vm 4 --vm 1049004 32232 /usr/bin/stress --vm 4 --vm 1048884 32233 /usr/bin/stress --vm 4 --vm 1049004 32234 /usr/bin/stress --vm 4 --vm 1049004 32236 ps -u cseraphine -o pid,cmd 3568 cseraphine@chhq-supgcmp001-14:41:03-/home/cseraphine$ squeue -u cseraphine JOBID PARTITION NAME USER ST TIME NODES NODELIST(REASON) 1475020 slurm21pm stress cseraphi R 0:38 1 chhq-supgcmp001 cseraphine@chhq-supgcmp001-14:41:34-/home/cseraphine$ sacct -j 1475020 --format jobid,jobname,state,maxrss JobID JobName State MaxRSS ------------ ---------- ---------- ---------- 1475020 stress RUNNING cseraphine@chhq-supgcmp001-14:42:10-/home/cseraphine$ scontrol show job 1475020 JobId=1475020 JobName=stress UserId=cseraphine(29599) GroupId=cseraphine(29599) MCS_label=N/A Priority=3292 Nice=0 Account=gridadmins QOS=normal JobState=RUNNING Reason=None Dependency=(null) Requeue=1 Restarts=0 BatchFlag=0 Reboot=0 ExitCode=0:0 RunTime=00:01:37 TimeLimit=01:00:00 TimeMin=N/A SubmitTime=2022-05-17T14:40:56 EligibleTime=2022-05-17T14:40:56 AccrueTime=Unknown StartTime=2022-05-17T14:40:56 EndTime=2022-05-17T15:40:56 Deadline=N/A SuspendTime=None SecsPreSuspend=0 LastSchedEval=2022-05-17T14:40:56 Scheduler=Main Partition=slurm21pmain AllocNode:Sid=chhq-supgcmp001:27197 ReqNodeList=chhq-supgcmp001 ExcNodeList=(null) NodeList=chhq-supgcmp001 BatchHost=chhq-supgcmp001 NumNodes=1 NumCPUs=4 NumTasks=4 CPUs/Task=1 ReqB:S:C:T=0:0:*:* TRES=cpu=4,mem=4G,node=1,billing=4 Socks/Node=* NtasksPerN:B:S:C=0:0:*:* CoreSpec=* MinCPUsNode=1 MinMemoryCPU=1G MinTmpDiskNode=0 Features=(null) DelayBoot=00:00:00 OverSubscribe=YES Contiguous=0 Licenses=(null) Network=(null) Command=stress WorkDir=/home/cseraphine Power= cseraphine@chhq-supgcmp001-14:42:33-/home/cseraphine$ fg srun -n 4 -w $HOSTNAME stress --vm 4 --vm-stride 8 --vm-hang 30 --vm-keep --vm-bytes 1G -t 200 stress: info: [32208] successful run completed in 200s stress: info: [32210] successful run completed in 200s stress: info: [32207] successful run completed in 200s stress: info: [32209] successful run completed in 200s cseraphine@chhq-supgcmp001-14:44:16-/home/cseraphine$ sacct -j 1475020 --format jobid,jobname,state,maxrss JobID JobName State MaxRSS ------------ ---------- ---------- ---------- 1475020 stress COMPLETED 1036K
Hi, After taking a look at your slurm.conf, it looks like you have the following configuration: JobAcctGatherFrequency=energy=0,filesystem=0,network=0,task=0 If this values are set to 0, as you have, the job gathering is disabled until job termination, when it collects a unique metric (that is the not accurate "MaxRSS" value you are observing). The idea is to avoid any minimal interference of Slurm with the running processes in the node. For the information you want, it is enough with enabling the task gathering, so, could you please repeat your test case changing the values (in slurm.conf) to: JobAcctGatherFrequency=energy=0,filesystem=0,network=0,task=30 After the change, you will need to make it effective with: $ scontrol reconfigure I am suggesting 30s for the frequency because it is the value we configure by default, but feel free to change it to a value that better fits your necessities. You will find more information here: https://slurm.schedmd.com/slurm.conf.html#OPT_JobAcctGatherFrequency Please, try the suggested changes, and let us know if they work as expected. Kind regards, Oscar
Oh good, I was hoping it was something stupid/easy like that, thank you! The docs seemed to suggest that setting the intervals to zero would result in no data being gathered during the job's run, > If the task sampling interval is 0, accounting information is collected only at job termination (reducing Slurm interference with the job). That is 100% what we are seeing, but I believe it should be made explicit that the MaxRSS information that _does_ get applied at job end might be garbage. In *hindsight* this makes perfect sense (how do you see the peaks if you aren't looking at the process. during its lifecycle?) but apparently it is subtle enough to fool at least some Slurm admins :-( Could we get few words tacked onto that sentence to reflect that? Something like: "(reducing Slurm interference with the job, and making accurate tracking of values such as MaxRSS impractical)" Thanks!
Glad it worked out! I see your point, the simple fact of having it accounted might be a little misleading. We will for sure take your comments into consideration. I am closing this bug for now. If you have any followup question, feel free to re-open it. Regards, Oscar