Ticket 15630 - CPU Utilization for Time Out Jobs
Summary: CPU Utilization for Time Out Jobs
Status: RESOLVED FIXED
Alias: None
Product: Slurm
Classification: Unclassified
Component: Accounting (show other tickets)
Version: 22.05.3
Hardware: Linux Linux
: 4 - Minor Issue
Assignee: Marcin Stolarek
QA Contact: Ben Roberts
URL:
Depends on:
Blocks:
 
Reported: 2022-12-15 07:45 MST by Paul Edmon
Modified: 2023-02-22 08:10 MST (History)
1 user (show)

See Also:
Site: Harvard University
Alineos Sites: ---
Atos/Eviden Sites: ---
Confidential Site: ---
Coreweave sites: ---
Cray Sites: ---
DS9 clusters: ---
HPCnow Sites: ---
HPE Sites: ---
IBM Sites: ---
NOAA SIte: ---
OCF Sites: ---
Recursion Pharma Sites: ---
SFW Sites: ---
SNIC sites: ---
Linux Distro: ---
Machine Name:
CLE Version:
Version Fixed: 23.02.0
Target Release: ---
DevPrio: ---
Emory-Cloud Sites: ---


Attachments
slurm_epilog (9.30 KB, application/x-shellscript)
2022-12-16 11:41 MST, Paul Edmon
Details

Note You need to log in before you can comment on or make changes to this ticket.
Description Paul Edmon 2022-12-15 07:45:18 MST
We've noticed an interesting feature that jobs that go into TimeOut in the scheduler do not have their child processes accounted for in their CPU TotalCPU field. For instance this job:

[root@holy7c22501 ~]# seff 35510337
Job ID: 35510337
Cluster: odyssey
User/Group: aunzuetamartinez/girguis_lab
State: TIMEOUT (exit code 0)
Cores: 1
CPU Utilized: 00:00:00
CPU Efficiency: 0.00% of 10:00:10 core-walltime
Job Wall-clock time: 10:00:10
Memory Utilized: 840.41 MB
Memory Efficiency: 0.82% of 100.00 GB

Definitely did something as it ran for 10 hours and used 1 GB of memory but it shows no time.

[root@holy7c22501 ~]# sacct -j 35510337 -o TotalCPU
  TotalCPU 
---------- 
 00:00.023 
 00:00.022 
 00:00.001 

We noticed a question on the slurm-users group posted by our former director asked a similar query 8 years back and it still seems to be the case. See: https://groups.google.com/g/slurm-users/c/3aMM7yKvepc/m/GSiroHLOG9sJ

Can the child process tracking be fixed for TimeOut and other non-COMPLETED type jobs? Even if a job doesn't go to COMPLETED state it may have still run as intended if it hits TimeOut as the user may just be able to restart the job. Same with jobs that are preempted or otherwise.
Comment 2 Marcin Stolarek 2022-12-16 04:14:52 MST
I tried to reproduce the issue without success:

>[vagrant@slurmctl ~]$ sbatch -t1 --wrap="/bin/yes > /dev/null"
>Submitted batch job 4694
>[vagrant@slurmctl ~]$ sacct -j 4694 -o JobId,Time,CpuTime,TotalCPU,Elapsed,State
>JobID         Timelimit    CPUTime   TotalCPU    Elapsed      State 
>------------ ---------- ---------- ---------- ---------- ---------- 
>4694           00:01:00   00:02:50  01:19.026   00:01:25    TIMEOUT 
>4694.batch                00:02:50  01:19.026   00:01:25  CANCELLED

Are you sure that the timed out job actually consumed CPU Time? For instance for `sleep` CPU time is expected to be close to 0.

>[vagrant@slurmctl ~]$ sbatch -t1 --wrap="sleep 1000"
>Submitted batch job 4693
>[vagrant@slurmctl ~]$  sacct -j 4693 -o JobId,Time,CpuTime,TotalCPU,Elapsed,State
>JobID         Timelimit    CPUTime   TotalCPU    Elapsed      State 
>------------ ---------- ---------- ---------- ---------- ---------- 
>4693           00:01:00   00:02:26  00:00.016   00:01:13    TIMEOUT 
>4693.batch                00:02:26  00:00.016   00:01:13  CANCELLED
Comment 3 Paul Edmon 2022-12-16 11:39:35 MST
We did some testing with your script and found that we got 0.

[nweeks@builds02 ~]$ sbatch -t 1 -p test --wrap="/bin/yes > /dev/null"
Submitted batch job 36019264
[nweeks@builds02 ~]$ seff 36019264
Job ID: 36019264
Cluster: odyssey
User/Group: nweeks/informatics
State: TIMEOUT (exit code 0)
Cores: 1
CPU Utilized: 00:00:00
CPU Efficiency: 0.00% of 00:01:01 core-walltime
Job Wall-clock time: 00:01:01
Memory Utilized: 5.82 MB
Memory Efficiency: 5.82% of 100.00 MB
[nweeks@builds02 ~]$ sbatch -t 2 -p test --wrap="/bin/yes > /dev/null"
Submitted batch job 36019296
[nweeks@builds02 ~]$ seff 36019296
Job ID: 36019296
Cluster: odyssey
User/Group: nweeks/informatics
State: TIMEOUT (exit code 0)
Cores: 1
CPU Utilized: 00:00:00
CPU Efficiency: 0.00% of 00:02:36 core-walltime
Job Wall-clock time: 00:02:36
Memory Utilized: 5.82 MB
Memory Efficiency: 5.82% of 100.00 MB

We suspect that this may be due to our epilog which terminates user 
processes to make sure there is no remaining. I will post a copy of our 
epilog in the bug report. Would the epilog run prior to Slurm's own 
clean up of the job? Could it mess up the accounting for jobs that exit 
due to Timeout?

We tested that it happens as well if we hop on a node and nuke the pid 
using kill, the efficiency also shows up as 0.

-Paul Edmon-

On 12/16/2022 6:14 AM, bugs@schedmd.com wrote:
>
> *Comment # 2 <https://bugs.schedmd.com/show_bug.cgi?id=15630#c2> on 
> bug 15630 <https://bugs.schedmd.com/show_bug.cgi?id=15630> from Marcin 
> Stolarek <mailto:cinek@schedmd.com> *
> I tried to reproduce the issue without success:
>
> >[vagrant@slurmctl ~]$ sbatch -t1 --wrap="/bin/yes > /dev/null" >Submitted batch job 4694 >[vagrant@slurmctl ~]$ sacct -j 4694 -o 
> JobId,Time,CpuTime,TotalCPU,Elapsed,State >JobID Timelimit CPUTime 
> TotalCPU Elapsed State >------------ ---------- ---------- ---------- 
> ---------- ---------- >4694 00:01:00 00:02:50 01:19.026 00:01:25 
> TIMEOUT >4694.batch 00:02:50 01:19.026 00:01:25 CANCELLED
>
> Are you sure that the timed out job actually consumed CPU Time? For instance
> for `sleep` CPU time is expected to be close to 0.
>
> >[vagrant@slurmctl ~]$ sbatch -t1 --wrap="sleep 1000" >Submitted batch job 4693 >[vagrant@slurmctl ~]$ sacct -j 4693 -o 
> JobId,Time,CpuTime,TotalCPU,Elapsed,State >JobID Timelimit CPUTime 
> TotalCPU Elapsed State >------------ ---------- ---------- ---------- 
> ---------- ---------- >4693 00:01:00 00:02:26 00:00.016 00:01:13 
> TIMEOUT >4693.batch 00:02:26 00:00.016 00:01:13 CANCELLED
> ------------------------------------------------------------------------
> You are receiving this mail because:
>
>   * You reported the bug.
>
Comment 4 Paul Edmon 2022-12-16 11:41:53 MST
Created attachment 28224 [details]
slurm_epilog
Comment 7 Marcin Stolarek 2022-12-19 05:12:05 MST
Paul,

I did reproduce the issue. The key was that I didn't use the same ProctrackType plugin as you (based on the configs you shared with us before).
I don't think that the pkill in epilog you have affects the functionality.

I'll continue the analysis to fully understand the mechanics of CPUTime gathering. I'll keep you posted.

cheers,
Marcin
Comment 12 Marcin Stolarek 2023-01-02 00:49:43 MST
Paul,

I wanted to share the details of the code analysis I did. CPU Time of computing processes is gathered by Slurm using last argument of wait3[1] (rusage) while waiting for the slurm spawned process (i.e. we don't directly wait for processes spawned from the batch/step script). The only documentation on the details on how rusage is constructed by the OS I found is getrusage[2]. From the experiments I did I see that rusage returned by wait3 behaves like getrusage called with RUSAGE_CHILDREN.

Which means:
>       RUSAGE_CHILDREN
>              Return resource usage statistics for all children of the
>              calling process that have terminated and been waited for.

and in NOTES we find:
>       Resource usage metrics are preserved across an execve(2).

The issue here is that for all signals (but SIGKILL) Slurm is supposed to signal only the major process and let it handle it for child processes. It looks like in case of timeout, or just scancel on a job signaling only the major process doesn't result in rusage being gathered as explained by getrusage man quotes above. It contains only the r(esource) usage for the Slurm spawned process, which is very low since the real computing is done by its kids - forks and execve's.

I see that this behavior (and the way Slurm signalling is done) differ for proctrack/pgid, where we actually use killpg[3] instead of kill, which has an impact on the rusage returned by the mentioned wait3[4] call.

I also see that the documentation for TotalCPU[5] mentions:
>NOTE: TotalCPU provides a measure of the task's parent process and does not include CPU time of child processes.
which isn't correct for a computing job that completed in a regular way (and the fact that it contains all the resources used is expected according to getrusage manual).

Looking at the code history this didn't change for more than 10 years. Having that in mind I'm thinking about documenting those details/exceptions rather than looking for code changes. Especially, because potential changes are in the essential part of process signaling and the impact we noticed is on accounting based on the OS functionality.

Let me know what you think.

cheers,
Marcin


[1]https://linux.die.net/man/2/wait3
[2]https://man7.org/linux/man-pages/man2/getrusage.2.html
[3]https://man7.org/linux/man-pages/man3/killpg.3.html
[4]https://github.com/SchedMD/slurm/blob/slurm-22-05-3-1/src/slurmd/slurmstepd/mgr.c#L2159
[5]https://slurm.schedmd.com/sacct.html#OPT_TotalCPU
Comment 13 Paul Edmon 2023-01-03 07:30:24 MST
Interesting. What I would be curious about is if the job termination can 
be unified to all be the same rather than having disparate methods for 
terminating jobs. If that is too complicated then a note in the docs I 
think would be sufficient so that people are aware of the behavior.

-Paul Edmon-

On 1/2/2023 2:49 AM, bugs@schedmd.com wrote:
>
> *Comment # 12 <https://bugs.schedmd.com/show_bug.cgi?id=15630#c12> on 
> bug 15630 <https://bugs.schedmd.com/show_bug.cgi?id=15630> from Marcin 
> Stolarek <mailto:cinek@schedmd.com> *
> Paul,
>
> I wanted to share the details of the code analysis I did. CPU Time of computing
> processes is gathered by Slurm using last argument of wait3[1] (rusage) while
> waiting for the slurm spawned process (i.e. we don't directly wait for
> processes spawned from the batch/step script). The only documentation on the
> details on how rusage is constructed by the OS I found is getrusage[2]. From
> the experiments I did I see that rusage returned by wait3 behaves like
> getrusage called with RUSAGE_CHILDREN.
>
> Which means:
> >       RUSAGE_CHILDREN > Return resource usage statistics for all children of the > calling 
> process that have terminated and been waited for.
>
> and in NOTES we find:
> >       Resource usage metrics are preserved across an execve(2).
>
> The issue here is that for all signals (but SIGKILL) Slurm is supposed to
> signal only the major process and let it handle it for child processes. It
> looks like in case of timeout, or just scancel on a job signaling only the
> major process doesn't result in rusage being gathered as explained by getrusage
> man quotes above. It contains only the r(esource) usage for the Slurm spawned
> process, which is very low since the real computing is done by its kids - forks
> and execve's.
>
> I see that this behavior (and the way Slurm signalling is done) differ for
> proctrack/pgid, where we actually use killpg[3] instead of kill, which has an
> impact on the rusage returned by the mentioned wait3[4] call.
>
> I also see that the documentation for TotalCPU[5] mentions:
> >NOTE: TotalCPU provides a measure of the task's parent process and does not include CPU time of child processes.
> which isn't correct for a computing job that completed in a regular way (and
> the fact that it contains all the resources used is expected according to
> getrusage manual).
>
> Looking at the code history this didn't change for more than 10 years. Having
> that in mind I'm thinking about documenting those details/exceptions rather
> than looking for code changes. Especially, because potential changes are in the
> essential part of process signaling and the impact we noticed is on accounting
> based on the OS functionality.
>
> Let me know what you think.
>
> cheers,
> Marcin
>
>
> [1]https://linux.die.net/man/2/wait3
> [2]https://man7.org/linux/man-pages/man2/getrusage.2.html
> [3]https://man7.org/linux/man-pages/man3/killpg.3.html
> [4]https://github.com/SchedMD/slurm/blob/slurm-22-05-3-1/src/slurmd/slurmstepd/mgr.c#L2159
> [5]https://slurm.schedmd.com/sacct.html#OPT_TotalCPU
> ------------------------------------------------------------------------
> You are receiving this mail because:
>
>   * You reported the bug.
>
Comment 14 Marcin Stolarek 2023-01-03 23:14:06 MST
That's what I'm thinking about - documenting how things work in Slurm 22.05 and having an internal ticket to check if we can avoid those exceptions by code changes. The major issue is that even proctrack/pgid behavior is different then documented we're very caution with long term behavior changes, since we know that users rely on the behavior as it is in those cases.

I'll keep you posted.

cheers,
Marcin
Comment 22 Marcin Stolarek 2023-02-22 04:57:24 MST
Paul,

We've updated our docs to better explain the behavior in 89e8ecb54[1]. Since it's really a result of how OS handles getrusage[2] I don't see a way to change/improve the behavior in Slurm.

Let me know if you have any questions.

cheers,
Marcin
[1]https://github.com/SchedMD/slurm/commit/89e8ecb5471fa5e95c03892eafdb0f630e21b51b
[2]https://man7.org/linux/man-pages/man2/getrusage.2.html:
Comment 23 Paul Edmon 2023-02-22 07:28:19 MST
Thanks!

-Paul Edmon-

On 2/22/2023 6:57 AM, bugs@schedmd.com wrote:
>
> *Comment # 22 <https://bugs.schedmd.com/show_bug.cgi?id=15630#c22> on 
> bug 15630 <https://bugs.schedmd.com/show_bug.cgi?id=15630> from Marcin 
> Stolarek <mailto:cinek@schedmd.com> *
> Paul,
>
> We've updated our docs to better explain the behavior in 89e8ecb54[1]. Since
> it's really a result of how OS handles getrusage[2] I don't see a way to
> change/improve the behavior in Slurm.
>
> Let me know if you have any questions.
>
> cheers,
> Marcin
> [1]https://github.com/SchedMD/slurm/commit/89e8ecb5471fa5e95c03892eafdb0f630e21b51b
> [2]https://man7.org/linux/man-pages/man2/getrusage.2.html:
> ------------------------------------------------------------------------
> You are receiving this mail because:
>
>   * You reported the bug.
>