Ticket 24366

Summary: TotalCPU not accounted when using srun command
Product: Slurm Reporter: Institut Pasteur HPC Admin <hpc>
Component: AccountingAssignee: Thomas Sorkin <thomas>
Status: RESOLVED FIXED QA Contact:
Severity: 4 - Minor Issue    
Priority: --- CC: thomas
Version: 25.05.5   
Hardware: Linux   
OS: Linux   
Site: Institut Pasteur Slinky Site: ---
Alineos Sites: --- Atos/Eviden Sites: ---
Confidential Site: --- Coreweave sites: ---
Cray Sites: --- DS9 clusters: ---
Google sites: --- HPCnow Sites: ---
HPE Sites: --- IBM Sites: ---
NOAA SIte: --- NoveTech Sites: ---
Nvidia HWinf-CS Sites: --- OCF Sites: ---
Recursion Pharma Sites: --- SFW Sites: ---
SNIC sites: --- Tzag Elita Sites: ---
Linux Distro: --- Machine Name:
CLE Version: Version Fixed: 25.05,25.11
Target Release: --- DevPrio: ---
Emory-Cloud Sites: ---
Attachments: slurm_tsched.conf
cgroup_tsched.conf
mariadb_1559605.log
cgroup_debug_1559605.log

Description Institut Pasteur HPC Admin 2025-12-22 09:55:56 MST
Created attachment 43999 [details]
slurm_tsched.conf

Hi,

We are facing an issue since upgrading Slurm from 24.11.5 to 25.05.5. When a job uses srun or has a step inside an sbatch script, the total CPU time is always reported as 0. We tested this in our test environment and the issue occurs with both cgroup v1 and cgroup v2.

Here's an example:
"""
[braffest@maestro-1002 ~]$ srun -w maestro-1003 -c 10 stress --vm 10 --timeout 30
stress: info: [76839] dispatching hogs: 0 cpu, 0 io, 10 vm, 0 hdd
stress: info: [76839] successful run completed in 30s
[braffest@maestro-1002 ~]$ sacct -j 1559605 --format=jobid,elapsed,totalcpu,nodelist,tresusageinmax%40
JobID           Elapsed   TotalCPU        NodeList                           TRESUsageInMax 
------------ ---------- ---------- --------------- ---------------------------------------- 
1559605        00:00:31   00:00:00    maestro-1003                                          
1559605.ext+   00:00:31   00:00:00    maestro-1003                                 energy=0 
1559605.0      00:00:30   00:00:00    maestro-1003 cpu=00:04:55,energy=0,fs/disk=3109,mem=+ 
"""

In this test, maestro-1003 is a compute node using cgroup v2. As you can see, TotalCPU is set to 0 seconds, but TRESUsageInMax shows the actual CPU usage.

Checking the job inside the database:
"""
MariaDB [slurm_acct_db]> select t1.id_job,FROM_UNIXTIME(t1.time_start,"%Y-%m-%dT%H:%i:%s"),t1.job_name,user_sec,user_usec,sys_sec,sys_usec,tres_usage_in_tot from mtest_job_table as t1 join mtest_step_table as t2 on t2.job_db_inx=t1.job_db_inx where id_job=1559605 order by id_job;
+---------+--------------------------------------------------+----------+----------+-----------+---------+----------+------------------------------------------+
| id_job  | FROM_UNIXTIME(t1.time_start,"%Y-%m-%dT%H:%i:%s") | job_name | user_sec | user_usec | sys_sec | sys_usec | tres_usage_in_tot                        |
+---------+--------------------------------------------------+----------+----------+-----------+---------+----------+------------------------------------------+
| 1559605 | 2025-12-22T17:23:50                              | stress   |        0 |         0 |       0 |        0 | 3=0                                      |
| 1559605 | 2025-12-22T17:23:50                              | stress   |        0 |         0 |       0 |        0 | 1=295438,2=2544394240,3=0,6=3109,7=0,8=0 |
+---------+--------------------------------------------------+----------+----------+-----------+---------+----------+------------------------------------------+
2 rows in set (0.001 sec)
"""

The cpu values for user/sys are set to 0 in the database (see queries in mariadb_1559605.log file). 
To continue investigating, I added debug lines in the cgroup_p_task_get_acct_data function (cgroup_debug.log):
  - Before "if (cpu_stat)" corresponds to this line: https://github.com/SchedMD/slurm/blob/5362d6fb57f265f3dbf9ad319b7d70d8ff84d081/src/plugins/cgroup/v2/cgroup_v2.c#L2819
  - Before "return stats" corresponds to this line: https://github.com/SchedMD/slurm/blob/5362d6fb57f265f3dbf9ad319b7d70d8ff84d081/src/plugins/cgroup/v2/cgroup_v2.c#L2861

As shown in cgroup_debug.log, the usec/ssec variables are not zero, so I am not sure what's happening here.

I am also attaching our cgroup.conf and slurm.conf files.
In our test environment, we use both cgroup v1 and v2 nodes, but in production we only use cgroup v2 and see the same behavior.

Has this issue been reported before?

Let us know if you need more information.

Best,
Brice
Comment 1 Institut Pasteur HPC Admin 2025-12-22 09:56:20 MST
Created attachment 44000 [details]
cgroup_tsched.conf
Comment 2 Institut Pasteur HPC Admin 2025-12-22 09:56:43 MST
Created attachment 44001 [details]
mariadb_1559605.log
Comment 3 Institut Pasteur HPC Admin 2025-12-22 09:56:59 MST
Created attachment 44002 [details]
cgroup_debug_1559605.log
Comment 6 Thomas Sorkin 2025-12-23 13:39:38 MST
Hi Brice,

Unfortunately this is a known regression introduced in 25.05 that we have identified recently. This was introduced in commit af2c0bd during work on ticket 20207:

> Ticket: https://support.schedmd.com/show_bug.cgi?id=20207

> Commit: https://github.com/SchedMD/slurm/commit/af2c0bd43055e4486e4d455bbeda39b514a2b294

A patch to fix this regression is currently being reviewed and I will let you know as soon as it is merged. I'm sorry to say, though, that the patch will be for versions 25.11+ due to our release eligibility criteria. For now, you can still view some info about CPU usage through the sacct TRESUsageIn* format parameters, for example:

> [1] TRESUsageInAve: https://slurm.schedmd.com/sacct.html#OPT_TresUsageInAve
> [2] TRESUsageInMax: https://slurm.schedmd.com/sacct.html#OPT_TresUsageInMax
> [3] TRESUsageInMin: https://slurm.schedmd.com/sacct.html#OPT_TresUsageInMin
> [4] TRESUsageInTot: https://slurm.schedmd.com/sacct.html#OPT_TresUsageInTot

These will report the CPU usage time for tasks in a job through the cpu TRES, though it will not split it out by User vs. System CPU time, and it will round it down to the nearest second.

I'll keep you updated about the fix.

Regards,
Thomas
Comment 7 Institut Pasteur HPC Admin 2025-12-23 15:03:58 MST
Hi,

Thanks for the information.

The fix is not proposed for version 25.05 because it isn't considered as a major regression?
I've only seen the patch eligibility matrix in the SLUG24 presentation. Is there any more information about patch eligibility criteria in the documentation?

To resolve that issue on our side, because it will still broke command like seff, I'm thinking about injecting the CPU time value from TRESUsageInTot into the user_sec/user_usec fields in the database to ensure we still have a relevant TotalCPU value. Do you see any issues with this idea?

Best,
Brice
Comment 8 Thomas Sorkin 2025-12-29 17:19:59 MST
Hi Brice,

That's correct, good memory. You can see the Slurm eligibility matrix in slide 43 of the SLUG24 Field Notes slideshow, likely the place you first saw the matrix:

> [1] https://slurm.schedmd.com/SLUG24/Field-Notes-8.pdf

Slurm version 25.05 was released 7 months ago now so it would fall under the T+6 category. This means minor regressions are not accepted, but major regressions are generally allowed. The eligibility matrix is not copied anywhere in the docs because it's really just for development purposes, though it's not meant to be a secret either.

After looking into this issue more, I am pushing for this issue to be classified as a major regression. That way you can get the fix as soon as it is finished.

For setting the user/sys CPU values manually in the database based on the TRES usage values. I'm assuming when you mean "broke command like seff", you are referring to an existing pipeline you have that makes use of the seff command. If seff itself fails when queried on a job affected by this issue, please let me know since that is a separate problem. On my end, seff works fine when viewing jobs affected by this problem, though it still does report user/sys CPU time as 0 since it grabs this info from the step table just like sacct.

What you've mentioned is alright for a temporary fix, you would just have to ensure you are updating the correct step with the tres_usage_in_tot value for CPU (you can see the TRES ID for CPU by running sacctmgr show TRES cpu). Note that you will just not be able to tell how much of that TRES usage time was user CPU time and how much was sys CPU time since they are added together in the TRES metrics.

I will keep you updated on the fix for this issue.

Regards,
Thomas
Comment 9 Institut Pasteur HPC Admin 2025-12-30 02:34:05 MST
Hey Thomas,

Thanks for pushing up as a major regression.

When I mention that commands like seff are broken, I just mean that the CPU efficiency is always shown as 0% in seff for srun jobs, while the memory field still works correctly. So, I'm seeing the same behavior as you.

Thanks for the information, it’s okay to not have the real sys/user CPU time for now.

Thanks for your support.
Best,
Brice
Comment 10 Thomas Sorkin 2026-01-20 09:11:14 MST
Hi Brice,

Thank you for your patience. We have now merged a patch to track cputime stats for job steps started by srun. You can see the commit here:

> [1] https://github.com/SchedMD/slurm/commit/4890b1a1739bae01ce4adbb552e570723d1eed76

As mentioned previously, this patch was merged into Slurm versions 25.05 and 25.11, so you can benefit from it as soon as you'd like.

With this fix I'll go ahead and close this ticket, but please do not hesitate to reopen it if you run into further issues related to step accounting or if you have follow-up questions.

Regards,
Thomas