| Summary: | CPU Utilization for Time Out Jobs | ||
|---|---|---|---|
| Product: | Slurm | Reporter: | Paul Edmon <pedmon> |
| Component: | Accounting | Assignee: | Marcin Stolarek <cinek> |
| Status: | RESOLVED FIXED | QA Contact: | Ben Roberts <ben> |
| Severity: | 4 - Minor Issue | ||
| Priority: | --- | CC: | cinek |
| Version: | 22.05.3 | ||
| Hardware: | Linux | ||
| OS: | Linux | ||
| Site: | Harvard University | Alineos Sites: | --- |
| Atos/Eviden Sites: | --- | Confidential Site: | --- |
| Coreweave sites: | --- | Cray Sites: | --- |
| DS9 clusters: | --- | HPCnow Sites: | --- |
| HPE Sites: | --- | IBM Sites: | --- |
| NOAA SIte: | --- | OCF Sites: | --- |
| Recursion Pharma Sites: | --- | SFW Sites: | --- |
| SNIC sites: | --- | Linux Distro: | --- |
| Machine Name: | CLE Version: | ||
| Version Fixed: | 23.02.0 | Target Release: | --- |
| DevPrio: | --- | Emory-Cloud Sites: | --- |
| Attachments: | slurm_epilog | ||
|
Description
Paul Edmon
2022-12-15 07:45:18 MST
I tried to reproduce the issue without success: >[vagrant@slurmctl ~]$ sbatch -t1 --wrap="/bin/yes > /dev/null" >Submitted batch job 4694 >[vagrant@slurmctl ~]$ sacct -j 4694 -o JobId,Time,CpuTime,TotalCPU,Elapsed,State >JobID Timelimit CPUTime TotalCPU Elapsed State >------------ ---------- ---------- ---------- ---------- ---------- >4694 00:01:00 00:02:50 01:19.026 00:01:25 TIMEOUT >4694.batch 00:02:50 01:19.026 00:01:25 CANCELLED Are you sure that the timed out job actually consumed CPU Time? For instance for `sleep` CPU time is expected to be close to 0. >[vagrant@slurmctl ~]$ sbatch -t1 --wrap="sleep 1000" >Submitted batch job 4693 >[vagrant@slurmctl ~]$ sacct -j 4693 -o JobId,Time,CpuTime,TotalCPU,Elapsed,State >JobID Timelimit CPUTime TotalCPU Elapsed State >------------ ---------- ---------- ---------- ---------- ---------- >4693 00:01:00 00:02:26 00:00.016 00:01:13 TIMEOUT >4693.batch 00:02:26 00:00.016 00:01:13 CANCELLED We did some testing with your script and found that we got 0. [nweeks@builds02 ~]$ sbatch -t 1 -p test --wrap="/bin/yes > /dev/null" Submitted batch job 36019264 [nweeks@builds02 ~]$ seff 36019264 Job ID: 36019264 Cluster: odyssey User/Group: nweeks/informatics State: TIMEOUT (exit code 0) Cores: 1 CPU Utilized: 00:00:00 CPU Efficiency: 0.00% of 00:01:01 core-walltime Job Wall-clock time: 00:01:01 Memory Utilized: 5.82 MB Memory Efficiency: 5.82% of 100.00 MB [nweeks@builds02 ~]$ sbatch -t 2 -p test --wrap="/bin/yes > /dev/null" Submitted batch job 36019296 [nweeks@builds02 ~]$ seff 36019296 Job ID: 36019296 Cluster: odyssey User/Group: nweeks/informatics State: TIMEOUT (exit code 0) Cores: 1 CPU Utilized: 00:00:00 CPU Efficiency: 0.00% of 00:02:36 core-walltime Job Wall-clock time: 00:02:36 Memory Utilized: 5.82 MB Memory Efficiency: 5.82% of 100.00 MB We suspect that this may be due to our epilog which terminates user processes to make sure there is no remaining. I will post a copy of our epilog in the bug report. Would the epilog run prior to Slurm's own clean up of the job? Could it mess up the accounting for jobs that exit due to Timeout? We tested that it happens as well if we hop on a node and nuke the pid using kill, the efficiency also shows up as 0. -Paul Edmon- On 12/16/2022 6:14 AM, bugs@schedmd.com wrote: > > *Comment # 2 <https://bugs.schedmd.com/show_bug.cgi?id=15630#c2> on > bug 15630 <https://bugs.schedmd.com/show_bug.cgi?id=15630> from Marcin > Stolarek <mailto:cinek@schedmd.com> * > I tried to reproduce the issue without success: > > >[vagrant@slurmctl ~]$ sbatch -t1 --wrap="/bin/yes > /dev/null" >Submitted batch job 4694 >[vagrant@slurmctl ~]$ sacct -j 4694 -o > JobId,Time,CpuTime,TotalCPU,Elapsed,State >JobID Timelimit CPUTime > TotalCPU Elapsed State >------------ ---------- ---------- ---------- > ---------- ---------- >4694 00:01:00 00:02:50 01:19.026 00:01:25 > TIMEOUT >4694.batch 00:02:50 01:19.026 00:01:25 CANCELLED > > Are you sure that the timed out job actually consumed CPU Time? For instance > for `sleep` CPU time is expected to be close to 0. > > >[vagrant@slurmctl ~]$ sbatch -t1 --wrap="sleep 1000" >Submitted batch job 4693 >[vagrant@slurmctl ~]$ sacct -j 4693 -o > JobId,Time,CpuTime,TotalCPU,Elapsed,State >JobID Timelimit CPUTime > TotalCPU Elapsed State >------------ ---------- ---------- ---------- > ---------- ---------- >4693 00:01:00 00:02:26 00:00.016 00:01:13 > TIMEOUT >4693.batch 00:02:26 00:00.016 00:01:13 CANCELLED > ------------------------------------------------------------------------ > You are receiving this mail because: > > * You reported the bug. > Created attachment 28224 [details]
slurm_epilog
Paul, I did reproduce the issue. The key was that I didn't use the same ProctrackType plugin as you (based on the configs you shared with us before). I don't think that the pkill in epilog you have affects the functionality. I'll continue the analysis to fully understand the mechanics of CPUTime gathering. I'll keep you posted. cheers, Marcin Paul, I wanted to share the details of the code analysis I did. CPU Time of computing processes is gathered by Slurm using last argument of wait3[1] (rusage) while waiting for the slurm spawned process (i.e. we don't directly wait for processes spawned from the batch/step script). The only documentation on the details on how rusage is constructed by the OS I found is getrusage[2]. From the experiments I did I see that rusage returned by wait3 behaves like getrusage called with RUSAGE_CHILDREN. Which means: > RUSAGE_CHILDREN > Return resource usage statistics for all children of the > calling process that have terminated and been waited for. and in NOTES we find: > Resource usage metrics are preserved across an execve(2). The issue here is that for all signals (but SIGKILL) Slurm is supposed to signal only the major process and let it handle it for child processes. It looks like in case of timeout, or just scancel on a job signaling only the major process doesn't result in rusage being gathered as explained by getrusage man quotes above. It contains only the r(esource) usage for the Slurm spawned process, which is very low since the real computing is done by its kids - forks and execve's. I see that this behavior (and the way Slurm signalling is done) differ for proctrack/pgid, where we actually use killpg[3] instead of kill, which has an impact on the rusage returned by the mentioned wait3[4] call. I also see that the documentation for TotalCPU[5] mentions: >NOTE: TotalCPU provides a measure of the task's parent process and does not include CPU time of child processes. which isn't correct for a computing job that completed in a regular way (and the fact that it contains all the resources used is expected according to getrusage manual). Looking at the code history this didn't change for more than 10 years. Having that in mind I'm thinking about documenting those details/exceptions rather than looking for code changes. Especially, because potential changes are in the essential part of process signaling and the impact we noticed is on accounting based on the OS functionality. Let me know what you think. cheers, Marcin [1]https://linux.die.net/man/2/wait3 [2]https://man7.org/linux/man-pages/man2/getrusage.2.html [3]https://man7.org/linux/man-pages/man3/killpg.3.html [4]https://github.com/SchedMD/slurm/blob/slurm-22-05-3-1/src/slurmd/slurmstepd/mgr.c#L2159 [5]https://slurm.schedmd.com/sacct.html#OPT_TotalCPU Interesting. What I would be curious about is if the job termination can be unified to all be the same rather than having disparate methods for terminating jobs. If that is too complicated then a note in the docs I think would be sufficient so that people are aware of the behavior. -Paul Edmon- On 1/2/2023 2:49 AM, bugs@schedmd.com wrote: > > *Comment # 12 <https://bugs.schedmd.com/show_bug.cgi?id=15630#c12> on > bug 15630 <https://bugs.schedmd.com/show_bug.cgi?id=15630> from Marcin > Stolarek <mailto:cinek@schedmd.com> * > Paul, > > I wanted to share the details of the code analysis I did. CPU Time of computing > processes is gathered by Slurm using last argument of wait3[1] (rusage) while > waiting for the slurm spawned process (i.e. we don't directly wait for > processes spawned from the batch/step script). The only documentation on the > details on how rusage is constructed by the OS I found is getrusage[2]. From > the experiments I did I see that rusage returned by wait3 behaves like > getrusage called with RUSAGE_CHILDREN. > > Which means: > > RUSAGE_CHILDREN > Return resource usage statistics for all children of the > calling > process that have terminated and been waited for. > > and in NOTES we find: > > Resource usage metrics are preserved across an execve(2). > > The issue here is that for all signals (but SIGKILL) Slurm is supposed to > signal only the major process and let it handle it for child processes. It > looks like in case of timeout, or just scancel on a job signaling only the > major process doesn't result in rusage being gathered as explained by getrusage > man quotes above. It contains only the r(esource) usage for the Slurm spawned > process, which is very low since the real computing is done by its kids - forks > and execve's. > > I see that this behavior (and the way Slurm signalling is done) differ for > proctrack/pgid, where we actually use killpg[3] instead of kill, which has an > impact on the rusage returned by the mentioned wait3[4] call. > > I also see that the documentation for TotalCPU[5] mentions: > >NOTE: TotalCPU provides a measure of the task's parent process and does not include CPU time of child processes. > which isn't correct for a computing job that completed in a regular way (and > the fact that it contains all the resources used is expected according to > getrusage manual). > > Looking at the code history this didn't change for more than 10 years. Having > that in mind I'm thinking about documenting those details/exceptions rather > than looking for code changes. Especially, because potential changes are in the > essential part of process signaling and the impact we noticed is on accounting > based on the OS functionality. > > Let me know what you think. > > cheers, > Marcin > > > [1]https://linux.die.net/man/2/wait3 > [2]https://man7.org/linux/man-pages/man2/getrusage.2.html > [3]https://man7.org/linux/man-pages/man3/killpg.3.html > [4]https://github.com/SchedMD/slurm/blob/slurm-22-05-3-1/src/slurmd/slurmstepd/mgr.c#L2159 > [5]https://slurm.schedmd.com/sacct.html#OPT_TotalCPU > ------------------------------------------------------------------------ > You are receiving this mail because: > > * You reported the bug. > That's what I'm thinking about - documenting how things work in Slurm 22.05 and having an internal ticket to check if we can avoid those exceptions by code changes. The major issue is that even proctrack/pgid behavior is different then documented we're very caution with long term behavior changes, since we know that users rely on the behavior as it is in those cases. I'll keep you posted. cheers, Marcin Paul, We've updated our docs to better explain the behavior in 89e8ecb54[1]. Since it's really a result of how OS handles getrusage[2] I don't see a way to change/improve the behavior in Slurm. Let me know if you have any questions. cheers, Marcin [1]https://github.com/SchedMD/slurm/commit/89e8ecb5471fa5e95c03892eafdb0f630e21b51b [2]https://man7.org/linux/man-pages/man2/getrusage.2.html: Thanks! -Paul Edmon- On 2/22/2023 6:57 AM, bugs@schedmd.com wrote: > > *Comment # 22 <https://bugs.schedmd.com/show_bug.cgi?id=15630#c22> on > bug 15630 <https://bugs.schedmd.com/show_bug.cgi?id=15630> from Marcin > Stolarek <mailto:cinek@schedmd.com> * > Paul, > > We've updated our docs to better explain the behavior in 89e8ecb54[1]. Since > it's really a result of how OS handles getrusage[2] I don't see a way to > change/improve the behavior in Slurm. > > Let me know if you have any questions. > > cheers, > Marcin > [1]https://github.com/SchedMD/slurm/commit/89e8ecb5471fa5e95c03892eafdb0f630e21b51b > [2]https://man7.org/linux/man-pages/man2/getrusage.2.html: > ------------------------------------------------------------------------ > You are receiving this mail because: > > * You reported the bug. > |