Ticket 15142 - Gres underflow error
Summary: Gres underflow error
Status: RESOLVED INFOGIVEN
Alias: None
Product: Slurm
Classification: Unclassified
Component: slurmctld (show other tickets)
Version: 22.05.4
Hardware: Linux Linux
: 3 - Medium Impact
Assignee: Scott Hilton
QA Contact:
URL:
Depends on:
Blocks:
 
Reported: 2022-10-10 23:03 MDT by Simon Michnowicz
Modified: 2023-12-04 12:03 MST (History)
4 users (show)

See Also:
Site: Monash University
Slinky Site: ---
Alineos Sites: ---
Atos/Eviden Sites: ---
Confidential Site: ---
Coreweave sites: ---
Cray Sites: ---
DS9 clusters: ---
Google sites: ---
HPCnow Sites: ---
HPE Sites: ---
IBM Sites: ---
NOAA SIte: ---
NoveTech Sites: ---
Nvidia HWinf-CS Sites: ---
OCF Sites: ---
Recursion Pharma Sites: ---
SFW Sites: ---
SNIC sites: ---
Tzag Elita Sites: ---
Linux Distro: ---
Machine Name:
CLE Version:
Version Fixed:
Target Release: ---
DevPrio: ---
Emory-Cloud Sites: ---


Attachments
Slurm controller log that shows above errors (1.22 MB, application/x-gzip)
2022-10-10 23:03 MDT, Simon Michnowicz
Details
slurmd log of file with prolog error (2.09 KB, application/x-gzip)
2022-10-10 23:05 MDT, Simon Michnowicz
Details
slurm.conf (7.97 KB, text/plain)
2022-10-10 23:06 MDT, Simon Michnowicz
Details

Note You need to log in before you can comment on or make changes to this ticket.
Description Simon Michnowicz 2022-10-10 23:03:20 MDT
Created attachment 27206 [details]
Slurm controller log that shows above errors

Our Slurm controller is showing thousands of errors of the kind
"[2022-10-11T15:19:54.010] error: gres/gpu: job 27254366 dealloc node m3g009 type V100 gres count underflow (0 1)
[2022-10-11T15:19:54.011] error: gres/gpu: job 27254366 dealloc node m3g009 type V100 gres count underflow (0 1)
[2022-10-11T15:19:54.012] error: gres/gpu: job 27254366 dealloc node m3g009 type V100 gres count underflow (0 1)"
 
and other error which may be unrelated

2022-10-11T15:20:14.296] error: _handle_assoc_tres_run_secs: job 27275414: assoc 5971 TRES cpu grp_used_tres_run_secs underflow, tried to remove 1692 seconds when only 1686 remained.
[2022-10-11T15:20:14.296] error: _handle_assoc_tres_run_secs: job 27275414: assoc 5971 TRES mem grp_used_tres_run_secs underflow, tried to remove 15882240 seconds when only 15825920 remained.
[2022-10-11T15:20:14.296] error: _handle_assoc_tres_run_secs: job 27275414: assoc 5971 TRES node grp_used_tres_run_secs underflow, tried to remove 282 seconds when only 281 remained.
[2022-10-11T15:20:14.296] error: _handle_assoc_tres_run_secs: job 27275414: assoc 5971 TRES billing grp_used_tres_run_secs underflow, tried to remove 1692 seconds when only 1686 remained.
[2022-10-11T15:20:14.296] error: _handle_assoc_tres_run_secs: job 27275414: assoc 5971 TRES gres/gpu grp_used_tres_run_secs underflow, tried to remove 282 seconds when only 281 remained.

We do not know the meaning of these errors. We also have some random nodes going down with 'Prolog error" .  Dmesg shows

"[2022-10-11T15:38:25.502] error: service_connection: slurm_receive_msg: Zero Bytes were transmitted or received
[2022-10-11T15:38:45.497] error: Waiting for JobId=27159216 REQUEST_LAUNCH_PROLOG notification failed, giving up after 20 sec"

This may be related to the first problem. Nodes do have connection with controller, and can be resumed when this error.
Comment 1 Simon Michnowicz 2022-10-10 23:05:19 MDT
Created attachment 27207 [details]
slurmd log of file with prolog error
Comment 2 Simon Michnowicz 2022-10-10 23:06:18 MDT
Created attachment 27208 [details]
slurm.conf
Comment 3 Simon Michnowicz 2022-10-10 23:35:27 MDT
There is a typo in original comment. Instead of `dmesg` I mean slurmd logs.
Comment 7 Simon Michnowicz 2022-10-15 00:05:04 MDT
on the nodes with the Prolog Error I see

[2022-09-29T14:55:33.437] error: Invalid PrologSlurmctld(`/opt/slurm-latest/etc/slurmctld.prolog`): No such file or directory

This is weird, as this file does exist on the Controller Nodes, and other nodes do not show this message
Comment 9 Scott Hilton 2022-10-19 13:25:55 MDT
Simon,

The "gres count underflow" error seems to be an issue with how the backfill scheduler deals with gres. I am working on a fix and will get back to you.

This issue may affect the backfill scheduler's ability to use those nodes, but otherwise shouldn't have any other problems. Though I haven't looked to deeply yet.

As for the other errors, they may be related but I am unsure at the moment. I will let you know when I have more information.

-Scott
Comment 13 Scott Hilton 2022-10-28 14:26:00 MDT
Simon, 

For this error in your slurmd.log:
[2022-10-10T15:14:09.914] error: private-tmpdir: mkdir("/mnt/privatedir/tmpdir.27261746.0",0700): No such file or directory

Do you have this directory on the node?

-Scott
Comment 14 Simon Michnowicz 2022-11-02 20:41:15 MDT
Scott
we use spank plugins to mount local file systems into /tmp
/mnt/privatedir/tmpdir.27261746.0
it the name of such a folder created for this purpose, and 27261746 should be the job number.

The folder is deleted at the end of the job. This is in the slurm epilog file

TMPDIR_BASE=/mnt/privatedir/tmpdir
PRIVATE_TMPDIR="${TMPDIR_BASE}.${SLURM_JOB_ID}.${SLURM_RESTART_COUNT:=0}"
if [ -d "${PRIVATE_TMPDIR}" ]; then
   rm -rf "${PRIVATE_TMPDIR}"
fi



We have not seen this error for a while. I am not sure what triggered it or why it is no longer present. Perhaps a user job caused it.

regards
Simon
Comment 15 Scott Hilton 2022-11-03 16:03:16 MDT
Simon,

Could you send the output of this command for the relevant jobs
>sacct -p -j 27275414,27268224,27254366 --format=jobid,Start,Eligible,Submit,End,Suspended,Timelimit

-Scott
Comment 16 Simon Michnowicz 2022-11-03 21:47:23 MDT
 sacct -p -j 27275414,27268224,27254366 --format=jobid,Start,Eligible,Submit,End,Suspended,Timelimit
JobID|Start|Eligible|Submit|End|Suspended|Timelimit|
27254366|2022-10-10T23:11:06|2022-10-09T18:29:42|2022-10-09T18:29:42|2022-10-16T23:12:31|00:00:00|6-00:00:00|
27254366.batch|2022-10-10T23:11:06|2022-10-10T23:11:06|2022-10-10T23:11:06|2022-10-16T23:12:35|00:00:00||
27254366.extern|2022-10-10T23:11:06|2022-10-10T23:11:06|2022-10-10T23:11:06|2022-10-16T23:12:31|00:00:00||
27254366.0|2022-10-13T16:51:35|2022-10-13T16:51:35|2022-10-13T16:51:35|2022-10-16T18:35:22|00:00:00||
27254366.1|2022-10-16T19:47:37|2022-10-16T19:47:37|2022-10-16T19:47:37|2022-10-16T23:12:31|00:00:00||
27268224|2022-10-10T23:43:06|2022-10-10T19:57:37|2022-10-10T19:57:37|2022-10-12T23:44:15|00:00:00|2-00:00:00|
27268224.batch|2022-10-10T23:43:06|2022-10-10T23:43:06|2022-10-10T23:43:06|2022-10-12T23:44:16|00:00:00||
27268224.extern|2022-10-10T23:43:06|2022-10-10T23:43:06|2022-10-10T23:43:06|2022-10-12T23:44:15|00:00:00||
27275414|2022-10-11T10:19:55|2022-10-11T10:19:55|2022-10-11T10:19:55|2022-10-11T15:21:22|00:00:00|05:00:00|
27275414.batch|2022-10-11T10:19:55|2022-10-11T10:19:55|2022-10-11T10:19:55|2022-10-11T15:21:23|00:00:00||
27275414.extern|2022-10-11T10:19:55|2022-10-11T10:19:55|2022-10-11T10:19:55|2022-10-11T15:21:23|00:00:00||

Thanks.  Is it possible there is some race condition happening?
regards
Simon
Comment 17 Scott Hilton 2022-11-04 09:13:23 MDT
Simon,

Could you sen the slurmctld logs for nov 10 and 12 as well?

-Scott
Comment 18 Scott Hilton 2022-11-04 09:14:45 MDT
Also, can you send the output of this command:
>sacct -p -j 27275414,27268224,27254366 --format=jobid,Start,Eligible,Submit,End,Suspended,Timelimit,reqtres,alloctres
Comment 19 Simon Michnowicz 2022-11-06 19:53:55 MST
Scott
> Could you sen the slurmctld logs for nov 10 and 12 as well?

I think you mean Oct 10,12. Unfortunately we do not keep our logs that far back, as we have a log rotate that eventually deletes them after they achieve a certain size.  The latest I have is 29 Oct..

I enclose the sacct command.
regards
Simon

[smichnow@m3-login2 ~]$ sacct -p -j 27275414,27268224,27254366 --format=jobid,Start,Eligible,Submit,End,Suspended,Timelimit,reqtres,alloctres
JobID|Start|Eligible|Submit|End|Suspended|Timelimit|ReqTRES|AllocTRES|
27254366|2022-10-10T23:11:06|2022-10-09T18:29:42|2022-10-09T18:29:42|2022-10-16T23:12:31|00:00:00|6-00:00:00|billing=8,cpu=8,gres/gpu=1,mem=24G,node=1|billing=8,cpu=8,gres/gpu=1,mem=24G,node=1|
27254366.batch|2022-10-10T23:11:06|2022-10-10T23:11:06|2022-10-10T23:11:06|2022-10-16T23:12:35|00:00:00|||cpu=8,gres/gpu=1,mem=24G,node=1|
27254366.extern|2022-10-10T23:11:06|2022-10-10T23:11:06|2022-10-10T23:11:06|2022-10-16T23:12:31|00:00:00|||billing=8,cpu=8,gres/gpu=1,mem=24G,node=1|
27254366.0|2022-10-13T16:51:35|2022-10-13T16:51:35|2022-10-13T16:51:35|2022-10-16T18:35:22|00:00:00|||cpu=8,gres/gpu=1,mem=24G,node=1|
27254366.1|2022-10-16T19:47:37|2022-10-16T19:47:37|2022-10-16T19:47:37|2022-10-16T23:12:31|00:00:00|||cpu=8,gres/gpu=1,mem=24G,node=1|
27268224|2022-10-10T23:43:06|2022-10-10T19:57:37|2022-10-10T19:57:37|2022-10-12T23:44:15|00:00:00|2-00:00:00|billing=2,cpu=2,gres/gpu=2,mem=62.50G,node=1|billing=2,cpu=2,gres/gpu=2,mem=62.50G,node=1|
27268224.batch|2022-10-10T23:43:06|2022-10-10T23:43:06|2022-10-10T23:43:06|2022-10-12T23:44:16|00:00:00|||cpu=2,gres/gpu=2,mem=62.50G,node=1|
27268224.extern|2022-10-10T23:43:06|2022-10-10T23:43:06|2022-10-10T23:43:06|2022-10-12T23:44:15|00:00:00|||billing=2,cpu=2,gres/gpu=2,mem=62.50G,node=1|
27275414|2022-10-11T10:19:55|2022-10-11T10:19:55|2022-10-11T10:19:55|2022-10-11T15:21:22|00:00:00|05:00:00|billing=6,cpu=6,gres/gpu=1,mem=55G,node=1|billing=6,cpu=6,gres/gpu=1,mem=55G,node=1|
27275414.batch|2022-10-11T10:19:55|2022-10-11T10:19:55|2022-10-11T10:19:55|2022-10-11T15:21:23|00:00:00|||cpu=6,gres/gpu=1,mem=55G,node=1|
27275414.extern|2022-10-11T10:19:55|2022-10-11T10:19:55|2022-10-11T10:19:55|2022-10-11T15:21:23|00:00:00|||billing=6,cpu=6,gres/gpu=1,mem=55G,node=1|
[smichnow@m3-l

q
Comment 20 Scott Hilton 2022-11-28 13:43:43 MST
Simon,

Sorry for the delay, are you still getting these error messages? Is this affecting your workflow?

Is there anything you know that could help me reproduce the issue?

-Scott
Comment 21 help@massive.org.au 2022-11-28 23:20:42 MST
Scott
we are still getting "underflow" errors in our controller logs.
The last was a few weeks ago..

i.e.
slurmctld.log-20221107:[2022-11-07T00:50:14.296] error: _handle_assoc_tres_run_secs: job 27690996: assoc 2007 TRES node grp_used_tres_run_secs underflow, tried to remove 299 seconds when only 298 remained.
slurmctld.log-20221107:[2022-11-07T00:50:14.296] error: _handle_assoc_tres_run_secs: job 27690996: assoc 2007 TRES billing grp_used_tres_run_secs underflow, tried to remove 1794 seconds when only 1788 remained.
slurmctld.log-20221107:[2022-11-07T00:50:14.296] error: _handle_assoc_tres_run_secs: job 27690996: assoc 2007 TRES gres/gpu grp_used_tres_run_secs underflow, tried to remove 299 seconds when only 298 remained.

This does not interfere with our workflow, so it is not of immediate concern to us (other than a general concern over unusual errors).  You mentioned you were working on a fix for the `gres underflow` issue.  Is this something we should consider patching, or wait for the next Slurm release?

regards
Simon
Comment 23 Caden Ellis 2023-01-31 11:38:35 MST
Simon,

Do you use QOS's? What are their settings? One way this can be done is
attaching the output of scontrol show assoc. I am particularly interesting in the flags, GrpTRESRunMins, and usage factor.

Do users resize jobs often? Do you change usage factors on QOS's often while jobs are running, or have you changed usage factors ever? Are you ever restarting the controller or reconfiguring?

Caden
Comment 24 Caden Ellis 2023-01-31 11:56:44 MST
This command could also help, and is more concise for what I am looking for.

sacctmgr show qos format=name,grptresrunmins,usagefactor,flags
Comment 26 Scott Hilton 2023-06-26 14:45:45 MDT
Simon,

There was a change in 23.02.1 that may fix this issue. Have you upgraded yet? Do you still see this issue in your cluster?

-Scott
Comment 27 help@massive.org.au 2023-06-26 18:57:28 MDT
Scott
thanks for the update. We have no immediate plans for an update but will advise once we do it.

regards
Simon
Comment 30 Scott Hilton 2023-11-28 15:41:30 MST
Simon,

Have you had a chance to upgrade yet? Is this still an issue for you?

-Scott
Comment 31 help@massive.org.au 2023-11-28 16:02:03 MST
Hi
We have upgraded to 23.02.6 and I think the problem has gone away. There are just a few errors that look different.

[ec2-user@m3-mgmt4 slurm-logs]$ sudo grep underflow slurmctld.log
[2023-11-29T05:42:05.238] error: _handle_assoc_tres_run_secs: job 32667912: assoc 6215 TRES cpu grp_used_tres_run_secs underflow, tried to remove 1716 seconds when only 1710 remained.
[2023-11-29T05:42:05.238] error: _handle_assoc_tres_run_secs: job 32667912: assoc 6215 TRES mem grp_used_tres_run_secs underflow, tried to remove 16107520 seconds when only 16051200 remained.
[2023-11-29T05:42:05.238] error: _handle_assoc_tres_run_secs: job 32667912: assoc 6215 TRES node grp_used_tres_run_secs underflow, tried to remove 286 seconds when only 285 remained.
[2023-11-29T05:42:05.238] error: _handle_assoc_tres_run_secs: job 32667912: assoc 6215 TRES billing grp_used_tres_run_secs underflow, tried to remove 1716 seconds when only 1710 remained.
[2023-11-29T05:42:05.238] error: _handle_assoc_tres_run_secs: job 32667912: assoc 6215 TRES gres/gpu grp_used_tres_run_secs underflow, tried to remove 286 seconds when only 285 remained.

I think we can close the ticket now.

Thanks
Simon
Comment 32 Scott Hilton 2023-11-29 10:20:25 MST
Simon,

I am still a little worried about the grp_used_tres_run_secs underflow error. I was hoping it was related to the other error and would go away when the bug associated with it was fixed. But this seems to not be the case.

Does this error happen often? Can you reproduce it on command? Is there anything different about the jobs that exhibit this issue?

-Scott
Comment 33 help@massive.org.au 2023-11-29 19:57:28 MST
Scott
There are more such errors today

[root@m3-mgmt4 ec2-user]# grep  -i underflow /mnt/slurm-logs/slurmctld.log
[2023-11-30T03:47:05.239] error: _handle_assoc_tres_run_secs: job 32667547: assoc 6896 TRES gres/gpu grp_used_tres_run_secs underflow, tried to remove 273 seconds when only 272 remained.
[2023-11-30T04:22:05.238] error: _handle_assoc_tres_run_secs: job 32677122: assoc 4942 TRES cpu grp_used_tres_run_secs underflow, tried to remove 280 seconds when only 279 remained.
[2023-11-30T04:22:05.238] error: _handle_assoc_tres_run_secs: job 32677122: assoc 4942 TRES mem grp_used_tres_run_secs underflow, tried to remove 1146880 seconds when only 1142784 remained.
[2023-11-30T04:22:05.238] error: _handle_assoc_tres_run_secs: job 32677122: assoc 4942 TRES node grp_used_tres_run_secs underflow, tried to remove 280 seconds when only 279 remained.
[2023-11-30T04:22:05.238] error: _handle_assoc_tres_run_secs: job 32677122: assoc 4942 TRES billing grp_used_tres_run_secs underflow, tried to remove 280 seconds when only 279 remained.
[2023-11-30T08:20:55.757] error: _handle_assoc_tres_run_secs: job 32675078: assoc 4175 TRES cpu grp_used_tres_run_secs underflow, tried to remove 924 seconds when only 918 remained.
[2023-11-30T08:20:55.757] error: _handle_assoc_tres_run_secs: job 32675078: assoc 4175 TRES mem grp_used_tres_run_secs underflow, tried to remove 8673280 seconds when only 8616960 remained.
[2023-11-30T08:20:55.757] error: _handle_assoc_tres_run_secs: job 32675078: assoc 4175 TRES node grp_used_tres_run_secs underflow, tried to remove 154 seconds when only 153 remained.
[2023-11-30T08:20:55.757] error: _handle_assoc_tres_run_secs: job 32675078: assoc 4175 TRES billing grp_used_tres_run_secs underflow, tried to remove 924 seconds when only 918 remained.
[2023-11-30T08:20:55.757] error: _handle_assoc_tres_run_secs: job 32675078: assoc 4175 TRES gres/gpu grp_used_tres_run_secs underflow, tried to remove 154 seconds when only 153 remained.


I do not know what causes it. There are only 3 jobs mentioned. 1 ended via timeout. 1 was  cancelled. 1 completed normally. All on different compute nodes.

Would out-of-sync clocks cause this?

Simon
Comment 34 Scott Hilton 2023-11-30 17:09:04 MST
Could you send me the output of this command:
>sacct -P --format=jobid,associd,ntasks,alloctres,reqtres,nodelist,eligible,submit,start,end,elapsed,timelimit,state -j 32667547,32677122,32675078
-Scott
Comment 35 help@massive.org.au 2023-11-30 22:11:02 MST
[smichnow@m3-login2 ~]$ sacct -P --format=jobid,associd,ntasks,alloctres,reqtres,nodelist,eligible,submit,start,end,elapsed,timelimit,state -j 32667547,32677122,32675078
JobID|AssocID|NTasks|AllocTRES|ReqTRES|NodeList|Eligible|Submit|Start|End|Elapsed|Timelimit|State
32667547|6896||billing=6,cpu=6,gres/gpu=1,mem=55G,node=1|billing=6,cpu=6,gres/gpu=1,mem=55G,node=1|m3p000|2023-11-28T21:46:10|2023-11-28T21:46:10|2023-11-28T21:46:37|2023-11-30T03:47:43|1-06:01:06|1-06:00:00|TIMEOUT
32667547.batch|6896|1|cpu=6,gres/gpu=1,mem=55G,node=1||m3p000|2023-11-28T21:46:37|2023-11-28T21:46:37|2023-11-28T21:46:37|2023-11-30T03:47:46|1-06:01:09||CANCELLED
32667547.extern|6896|1|billing=6,cpu=6,gres/gpu=1,mem=55G,node=1||m3p000|2023-11-28T21:46:37|2023-11-28T21:46:37|2023-11-28T21:46:37|2023-11-30T03:47:46|1-06:01:09||CANCELLED
32675078|4207||billing=6,cpu=6,gres/gpu=1,mem=55G,node=1|billing=6,cpu=6,gres/gpu=1,mem=55G,node=1|m3p003|2023-11-29T08:19:39|2023-11-29T08:19:39|2023-11-29T08:19:39|2023-11-30T08:20:55|1-00:01:16|1-00:00:00|TIMEOUT
32675078.batch|4207|1|cpu=6,gres/gpu=1,mem=55G,node=1||m3p003|2023-11-29T08:19:39|2023-11-29T08:19:39|2023-11-29T08:19:39|2023-11-30T08:20:57|1-00:01:18||CANCELLED
32675078.extern|4207|1|billing=6,cpu=6,gres/gpu=1,mem=55G,node=1||m3p003|2023-11-29T08:19:39|2023-11-29T08:19:39|2023-11-29T08:19:39|2023-11-30T08:20:57|1-00:01:18||CANCELLED
32677122|4942||billing=1,cpu=1,mem=4G,node=1|billing=1,cpu=1,mem=4G,node=1|m3i004|2023-11-29T11:41:26|2023-11-29T11:41:26|2023-11-29T11:41:44|2023-11-30T04:22:50|16:41:06|16:40:00|TIMEOUT
32677122.batch|4942|1|cpu=1,mem=4G,node=1||m3i004|2023-11-29T11:41:44|2023-11-29T11:41:44|2023-11-29T11:41:44|2023-11-30T04:22:51|16:41:07||CANCELLED
32677122.extern|4942|1|billing=1,cpu=1,mem=4G,node=1||m3i004|2023-11-29T11:41:44|2023-11-29T11:41:44|2023-11-29T11:41:44|2023-11-30T04:22:51|16:41:07||COMPLETED
[smichnow@m3-login2 ~]$
Comment 38 Scott Hilton 2023-12-01 14:48:33 MST
Simon,

It looks like all three jobs ended with TIMEOUT as the reason. Which is expected given the error.

I looked into it more but I am still not sure what causes these error messages. However, it seems that, in this case, it is probably benign. If you don't see any problems related to it, I will probably go ahead and close this ticket now.

I agree with your guess. I think that there is some minor issue with timing that leads to this error occurring sometimes, but that difference is no more than 1 second.

-Scott
Comment 41 help@massive.org.au 2023-12-03 16:09:57 MST
Scott
 I will close the ticket now
regards
Simon
Comment 42 Scott Hilton 2023-12-04 12:03:01 MST
Closing