15142 – Gres underflow error

Ticket 15142 - Gres underflow error

Summary: Gres underflow error

Status:	RESOLVED INFOGIVEN

Alias:	None

Product:	Slurm
Classification:	Unclassified
Component:	slurmctld (show other tickets)
Version:	22.05.4
Hardware:	Linux Linux

Severity:	3 - Medium Impact
Assignee:	Scott Hilton
QA Contact:

URL:

Depends on:
Blocks:

Reported:	2022-10-10 23:03 MDT by Simon Michnowicz
Modified:	2023-12-04 12:03 MST (History)
CC List:	4 users (show)

See Also:	15145 14895 7390 15762 15726
Site:	Monash University
Slinky Site:	---
Alineos Sites:	---
Atos/Eviden Sites:	---
Confidential Site:	---
Coreweave sites:	---
Cray Sites:	---
DS9 clusters:	---
Google sites:	---
HPCnow Sites:	---
HPE Sites:	---
IBM Sites:	---
NOAA SIte:	---
NoveTech Sites:	---
Nvidia HWinf-CS Sites:	---
OCF Sites:	---
Recursion Pharma Sites:	---
SFW Sites:	---
SNIC sites:	---
Tzag Elita Sites:	---
Linux Distro:	---
Machine Name:
CLE Version:
Version Fixed:
Target Release:	---
DevPrio:	---
Emory-Cloud Sites:	---

Attachments
Slurm controller log that shows above errors (1.22 MB, application/x-gzip) 2022-10-10 23:03 MDT, Simon Michnowicz	Details
slurmd log of file with prolog error (2.09 KB, application/x-gzip) 2022-10-10 23:05 MDT, Simon Michnowicz	Details
slurm.conf (7.97 KB, text/plain) 2022-10-10 23:06 MDT, Simon Michnowicz	Details
Add an attachment (proposed patch, testcase, etc.)

Note You need to log in before you can comment on or make changes to this ticket.

Description Simon Michnowicz 2022-10-10 23:03:20 MDT

Created attachment 27206 [details]
Slurm controller log that shows above errors

Our Slurm controller is showing thousands of errors of the kind
"[2022-10-11T15:19:54.010] error: gres/gpu: job 27254366 dealloc node m3g009 type V100 gres count underflow (0 1)
[2022-10-11T15:19:54.011] error: gres/gpu: job 27254366 dealloc node m3g009 type V100 gres count underflow (0 1)
[2022-10-11T15:19:54.012] error: gres/gpu: job 27254366 dealloc node m3g009 type V100 gres count underflow (0 1)"
 
and other error which may be unrelated

2022-10-11T15:20:14.296] error: _handle_assoc_tres_run_secs: job 27275414: assoc 5971 TRES cpu grp_used_tres_run_secs underflow, tried to remove 1692 seconds when only 1686 remained.
[2022-10-11T15:20:14.296] error: _handle_assoc_tres_run_secs: job 27275414: assoc 5971 TRES mem grp_used_tres_run_secs underflow, tried to remove 15882240 seconds when only 15825920 remained.
[2022-10-11T15:20:14.296] error: _handle_assoc_tres_run_secs: job 27275414: assoc 5971 TRES node grp_used_tres_run_secs underflow, tried to remove 282 seconds when only 281 remained.
[2022-10-11T15:20:14.296] error: _handle_assoc_tres_run_secs: job 27275414: assoc 5971 TRES billing grp_used_tres_run_secs underflow, tried to remove 1692 seconds when only 1686 remained.
[2022-10-11T15:20:14.296] error: _handle_assoc_tres_run_secs: job 27275414: assoc 5971 TRES gres/gpu grp_used_tres_run_secs underflow, tried to remove 282 seconds when only 281 remained.

We do not know the meaning of these errors. We also have some random nodes going down with 'Prolog error" .  Dmesg shows

"[2022-10-11T15:38:25.502] error: service_connection: slurm_receive_msg: Zero Bytes were transmitted or received
[2022-10-11T15:38:45.497] error: Waiting for JobId=27159216 REQUEST_LAUNCH_PROLOG notification failed, giving up after 20 sec"

This may be related to the first problem. Nodes do have connection with controller, and can be resumed when this error.

Comment 1 Simon Michnowicz 2022-10-10 23:05:19 MDT

Created attachment 27207 [details]
slurmd log of file with prolog error

Comment 2 Simon Michnowicz 2022-10-10 23:06:18 MDT

Created attachment 27208 [details]
slurm.conf

Comment 3 Simon Michnowicz 2022-10-10 23:35:27 MDT

There is a typo in original comment. Instead of `dmesg` I mean slurmd logs.

Comment 7 Simon Michnowicz 2022-10-15 00:05:04 MDT

on the nodes with the Prolog Error I see

[2022-09-29T14:55:33.437] error: Invalid PrologSlurmctld(`/opt/slurm-latest/etc/slurmctld.prolog`): No such file or directory

This is weird, as this file does exist on the Controller Nodes, and other nodes do not show this message

Comment 9 Scott Hilton 2022-10-19 13:25:55 MDT

Simon,

The "gres count underflow" error seems to be an issue with how the backfill scheduler deals with gres. I am working on a fix and will get back to you.

This issue may affect the backfill scheduler's ability to use those nodes, but otherwise shouldn't have any other problems. Though I haven't looked to deeply yet.

As for the other errors, they may be related but I am unsure at the moment. I will let you know when I have more information.

-Scott

Comment 13 Scott Hilton 2022-10-28 14:26:00 MDT

Simon, 

For this error in your slurmd.log:
[2022-10-10T15:14:09.914] error: private-tmpdir: mkdir("/mnt/privatedir/tmpdir.27261746.0",0700): No such file or directory

Do you have this directory on the node?

-Scott

Comment 14 Simon Michnowicz 2022-11-02 20:41:15 MDT

Scott
we use spank plugins to mount local file systems into /tmp
/mnt/privatedir/tmpdir.27261746.0
it the name of such a folder created for this purpose, and 27261746 should be the job number.

The folder is deleted at the end of the job. This is in the slurm epilog file

TMPDIR_BASE=/mnt/privatedir/tmpdir
PRIVATE_TMPDIR="${TMPDIR_BASE}.${SLURM_JOB_ID}.${SLURM_RESTART_COUNT:=0}"
if [ -d "${PRIVATE_TMPDIR}" ]; then
   rm -rf "${PRIVATE_TMPDIR}"
fi



We have not seen this error for a while. I am not sure what triggered it or why it is no longer present. Perhaps a user job caused it.

regards
Simon

Comment 15 Scott Hilton 2022-11-03 16:03:16 MDT

Simon,

Could you send the output of this command for the relevant jobs
>sacct -p -j 27275414,27268224,27254366 --format=jobid,Start,Eligible,Submit,End,Suspended,Timelimit

-Scott

Comment 16 Simon Michnowicz 2022-11-03 21:47:23 MDT

 sacct -p -j 27275414,27268224,27254366 --format=jobid,Start,Eligible,Submit,End,Suspended,Timelimit
JobID|Start|Eligible|Submit|End|Suspended|Timelimit|
27254366|2022-10-10T23:11:06|2022-10-09T18:29:42|2022-10-09T18:29:42|2022-10-16T23:12:31|00:00:00|6-00:00:00|
27254366.batch|2022-10-10T23:11:06|2022-10-10T23:11:06|2022-10-10T23:11:06|2022-10-16T23:12:35|00:00:00||
27254366.extern|2022-10-10T23:11:06|2022-10-10T23:11:06|2022-10-10T23:11:06|2022-10-16T23:12:31|00:00:00||
27254366.0|2022-10-13T16:51:35|2022-10-13T16:51:35|2022-10-13T16:51:35|2022-10-16T18:35:22|00:00:00||
27254366.1|2022-10-16T19:47:37|2022-10-16T19:47:37|2022-10-16T19:47:37|2022-10-16T23:12:31|00:00:00||
27268224|2022-10-10T23:43:06|2022-10-10T19:57:37|2022-10-10T19:57:37|2022-10-12T23:44:15|00:00:00|2-00:00:00|
27268224.batch|2022-10-10T23:43:06|2022-10-10T23:43:06|2022-10-10T23:43:06|2022-10-12T23:44:16|00:00:00||
27268224.extern|2022-10-10T23:43:06|2022-10-10T23:43:06|2022-10-10T23:43:06|2022-10-12T23:44:15|00:00:00||
27275414|2022-10-11T10:19:55|2022-10-11T10:19:55|2022-10-11T10:19:55|2022-10-11T15:21:22|00:00:00|05:00:00|
27275414.batch|2022-10-11T10:19:55|2022-10-11T10:19:55|2022-10-11T10:19:55|2022-10-11T15:21:23|00:00:00||
27275414.extern|2022-10-11T10:19:55|2022-10-11T10:19:55|2022-10-11T10:19:55|2022-10-11T15:21:23|00:00:00||

Thanks.  Is it possible there is some race condition happening?
regards
Simon

Comment 17 Scott Hilton 2022-11-04 09:13:23 MDT

Simon,

Could you sen the slurmctld logs for nov 10 and 12 as well?

-Scott

Comment 18 Scott Hilton 2022-11-04 09:14:45 MDT

Also, can you send the output of this command:
>sacct -p -j 27275414,27268224,27254366 --format=jobid,Start,Eligible,Submit,End,Suspended,Timelimit,reqtres,alloctres

Comment 19 Simon Michnowicz 2022-11-06 19:53:55 MST

Scott
> Could you sen the slurmctld logs for nov 10 and 12 as well?

I think you mean Oct 10,12. Unfortunately we do not keep our logs that far back, as we have a log rotate that eventually deletes them after they achieve a certain size.  The latest I have is 29 Oct..

I enclose the sacct command.
regards
Simon

[smichnow@m3-login2 ~]$ sacct -p -j 27275414,27268224,27254366 --format=jobid,Start,Eligible,Submit,End,Suspended,Timelimit,reqtres,alloctres
JobID|Start|Eligible|Submit|End|Suspended|Timelimit|ReqTRES|AllocTRES|
27254366|2022-10-10T23:11:06|2022-10-09T18:29:42|2022-10-09T18:29:42|2022-10-16T23:12:31|00:00:00|6-00:00:00|billing=8,cpu=8,gres/gpu=1,mem=24G,node=1|billing=8,cpu=8,gres/gpu=1,mem=24G,node=1|
27254366.batch|2022-10-10T23:11:06|2022-10-10T23:11:06|2022-10-10T23:11:06|2022-10-16T23:12:35|00:00:00|||cpu=8,gres/gpu=1,mem=24G,node=1|
27254366.extern|2022-10-10T23:11:06|2022-10-10T23:11:06|2022-10-10T23:11:06|2022-10-16T23:12:31|00:00:00|||billing=8,cpu=8,gres/gpu=1,mem=24G,node=1|
27254366.0|2022-10-13T16:51:35|2022-10-13T16:51:35|2022-10-13T16:51:35|2022-10-16T18:35:22|00:00:00|||cpu=8,gres/gpu=1,mem=24G,node=1|
27254366.1|2022-10-16T19:47:37|2022-10-16T19:47:37|2022-10-16T19:47:37|2022-10-16T23:12:31|00:00:00|||cpu=8,gres/gpu=1,mem=24G,node=1|
27268224|2022-10-10T23:43:06|2022-10-10T19:57:37|2022-10-10T19:57:37|2022-10-12T23:44:15|00:00:00|2-00:00:00|billing=2,cpu=2,gres/gpu=2,mem=62.50G,node=1|billing=2,cpu=2,gres/gpu=2,mem=62.50G,node=1|
27268224.batch|2022-10-10T23:43:06|2022-10-10T23:43:06|2022-10-10T23:43:06|2022-10-12T23:44:16|00:00:00|||cpu=2,gres/gpu=2,mem=62.50G,node=1|
27268224.extern|2022-10-10T23:43:06|2022-10-10T23:43:06|2022-10-10T23:43:06|2022-10-12T23:44:15|00:00:00|||billing=2,cpu=2,gres/gpu=2,mem=62.50G,node=1|
27275414|2022-10-11T10:19:55|2022-10-11T10:19:55|2022-10-11T10:19:55|2022-10-11T15:21:22|00:00:00|05:00:00|billing=6,cpu=6,gres/gpu=1,mem=55G,node=1|billing=6,cpu=6,gres/gpu=1,mem=55G,node=1|
27275414.batch|2022-10-11T10:19:55|2022-10-11T10:19:55|2022-10-11T10:19:55|2022-10-11T15:21:23|00:00:00|||cpu=6,gres/gpu=1,mem=55G,node=1|
27275414.extern|2022-10-11T10:19:55|2022-10-11T10:19:55|2022-10-11T10:19:55|2022-10-11T15:21:23|00:00:00|||billing=6,cpu=6,gres/gpu=1,mem=55G,node=1|
[smichnow@m3-l

q

Comment 20 Scott Hilton 2022-11-28 13:43:43 MST

Simon,

Sorry for the delay, are you still getting these error messages? Is this affecting your workflow?

Is there anything you know that could help me reproduce the issue?

-Scott

Comment 21 help@massive.org.au 2022-11-28 23:20:42 MST

Scott
we are still getting "underflow" errors in our controller logs.
The last was a few weeks ago..

i.e.
slurmctld.log-20221107:[2022-11-07T00:50:14.296] error: _handle_assoc_tres_run_secs: job 27690996: assoc 2007 TRES node grp_used_tres_run_secs underflow, tried to remove 299 seconds when only 298 remained.
slurmctld.log-20221107:[2022-11-07T00:50:14.296] error: _handle_assoc_tres_run_secs: job 27690996: assoc 2007 TRES billing grp_used_tres_run_secs underflow, tried to remove 1794 seconds when only 1788 remained.
slurmctld.log-20221107:[2022-11-07T00:50:14.296] error: _handle_assoc_tres_run_secs: job 27690996: assoc 2007 TRES gres/gpu grp_used_tres_run_secs underflow, tried to remove 299 seconds when only 298 remained.

This does not interfere with our workflow, so it is not of immediate concern to us (other than a general concern over unusual errors).  You mentioned you were working on a fix for the `gres underflow` issue.  Is this something we should consider patching, or wait for the next Slurm release?

regards
Simon

Comment 23 Caden Ellis 2023-01-31 11:38:35 MST

Simon,

Do you use QOS's? What are their settings? One way this can be done is
attaching the output of scontrol show assoc. I am particularly interesting in the flags, GrpTRESRunMins, and usage factor.

Do users resize jobs often? Do you change usage factors on QOS's often while jobs are running, or have you changed usage factors ever? Are you ever restarting the controller or reconfiguring?

Caden

Comment 24 Caden Ellis 2023-01-31 11:56:44 MST

This command could also help, and is more concise for what I am looking for.

sacctmgr show qos format=name,grptresrunmins,usagefactor,flags

Comment 26 Scott Hilton 2023-06-26 14:45:45 MDT

Simon,

There was a change in 23.02.1 that may fix this issue. Have you upgraded yet? Do you still see this issue in your cluster?

-Scott

Comment 27 help@massive.org.au 2023-06-26 18:57:28 MDT

Scott
thanks for the update. We have no immediate plans for an update but will advise once we do it.

regards
Simon

Comment 30 Scott Hilton 2023-11-28 15:41:30 MST

Simon,

Have you had a chance to upgrade yet? Is this still an issue for you?

-Scott

Comment 31 help@massive.org.au 2023-11-28 16:02:03 MST

Hi
We have upgraded to 23.02.6 and I think the problem has gone away. There are just a few errors that look different.

[ec2-user@m3-mgmt4 slurm-logs]$ sudo grep underflow slurmctld.log
[2023-11-29T05:42:05.238] error: _handle_assoc_tres_run_secs: job 32667912: assoc 6215 TRES cpu grp_used_tres_run_secs underflow, tried to remove 1716 seconds when only 1710 remained.
[2023-11-29T05:42:05.238] error: _handle_assoc_tres_run_secs: job 32667912: assoc 6215 TRES mem grp_used_tres_run_secs underflow, tried to remove 16107520 seconds when only 16051200 remained.
[2023-11-29T05:42:05.238] error: _handle_assoc_tres_run_secs: job 32667912: assoc 6215 TRES node grp_used_tres_run_secs underflow, tried to remove 286 seconds when only 285 remained.
[2023-11-29T05:42:05.238] error: _handle_assoc_tres_run_secs: job 32667912: assoc 6215 TRES billing grp_used_tres_run_secs underflow, tried to remove 1716 seconds when only 1710 remained.
[2023-11-29T05:42:05.238] error: _handle_assoc_tres_run_secs: job 32667912: assoc 6215 TRES gres/gpu grp_used_tres_run_secs underflow, tried to remove 286 seconds when only 285 remained.

I think we can close the ticket now.

Thanks
Simon

Comment 32 Scott Hilton 2023-11-29 10:20:25 MST

Simon,

I am still a little worried about the grp_used_tres_run_secs underflow error. I was hoping it was related to the other error and would go away when the bug associated with it was fixed. But this seems to not be the case.

Does this error happen often? Can you reproduce it on command? Is there anything different about the jobs that exhibit this issue?

-Scott

Comment 33 help@massive.org.au 2023-11-29 19:57:28 MST

Scott
There are more such errors today

[root@m3-mgmt4 ec2-user]# grep  -i underflow /mnt/slurm-logs/slurmctld.log
[2023-11-30T03:47:05.239] error: _handle_assoc_tres_run_secs: job 32667547: assoc 6896 TRES gres/gpu grp_used_tres_run_secs underflow, tried to remove 273 seconds when only 272 remained.
[2023-11-30T04:22:05.238] error: _handle_assoc_tres_run_secs: job 32677122: assoc 4942 TRES cpu grp_used_tres_run_secs underflow, tried to remove 280 seconds when only 279 remained.
[2023-11-30T04:22:05.238] error: _handle_assoc_tres_run_secs: job 32677122: assoc 4942 TRES mem grp_used_tres_run_secs underflow, tried to remove 1146880 seconds when only 1142784 remained.
[2023-11-30T04:22:05.238] error: _handle_assoc_tres_run_secs: job 32677122: assoc 4942 TRES node grp_used_tres_run_secs underflow, tried to remove 280 seconds when only 279 remained.
[2023-11-30T04:22:05.238] error: _handle_assoc_tres_run_secs: job 32677122: assoc 4942 TRES billing grp_used_tres_run_secs underflow, tried to remove 280 seconds when only 279 remained.
[2023-11-30T08:20:55.757] error: _handle_assoc_tres_run_secs: job 32675078: assoc 4175 TRES cpu grp_used_tres_run_secs underflow, tried to remove 924 seconds when only 918 remained.
[2023-11-30T08:20:55.757] error: _handle_assoc_tres_run_secs: job 32675078: assoc 4175 TRES mem grp_used_tres_run_secs underflow, tried to remove 8673280 seconds when only 8616960 remained.
[2023-11-30T08:20:55.757] error: _handle_assoc_tres_run_secs: job 32675078: assoc 4175 TRES node grp_used_tres_run_secs underflow, tried to remove 154 seconds when only 153 remained.
[2023-11-30T08:20:55.757] error: _handle_assoc_tres_run_secs: job 32675078: assoc 4175 TRES billing grp_used_tres_run_secs underflow, tried to remove 924 seconds when only 918 remained.
[2023-11-30T08:20:55.757] error: _handle_assoc_tres_run_secs: job 32675078: assoc 4175 TRES gres/gpu grp_used_tres_run_secs underflow, tried to remove 154 seconds when only 153 remained.


I do not know what causes it. There are only 3 jobs mentioned. 1 ended via timeout. 1 was  cancelled. 1 completed normally. All on different compute nodes.

Would out-of-sync clocks cause this?

Simon

Comment 34 Scott Hilton 2023-11-30 17:09:04 MST

Could you send me the output of this command:
>sacct -P --format=jobid,associd,ntasks,alloctres,reqtres,nodelist,eligible,submit,start,end,elapsed,timelimit,state -j 32667547,32677122,32675078
-Scott

Comment 35 help@massive.org.au 2023-11-30 22:11:02 MST

[smichnow@m3-login2 ~]$ sacct -P --format=jobid,associd,ntasks,alloctres,reqtres,nodelist,eligible,submit,start,end,elapsed,timelimit,state -j 32667547,32677122,32675078
JobID|AssocID|NTasks|AllocTRES|ReqTRES|NodeList|Eligible|Submit|Start|End|Elapsed|Timelimit|State
32667547|6896||billing=6,cpu=6,gres/gpu=1,mem=55G,node=1|billing=6,cpu=6,gres/gpu=1,mem=55G,node=1|m3p000|2023-11-28T21:46:10|2023-11-28T21:46:10|2023-11-28T21:46:37|2023-11-30T03:47:43|1-06:01:06|1-06:00:00|TIMEOUT
32667547.batch|6896|1|cpu=6,gres/gpu=1,mem=55G,node=1||m3p000|2023-11-28T21:46:37|2023-11-28T21:46:37|2023-11-28T21:46:37|2023-11-30T03:47:46|1-06:01:09||CANCELLED
32667547.extern|6896|1|billing=6,cpu=6,gres/gpu=1,mem=55G,node=1||m3p000|2023-11-28T21:46:37|2023-11-28T21:46:37|2023-11-28T21:46:37|2023-11-30T03:47:46|1-06:01:09||CANCELLED
32675078|4207||billing=6,cpu=6,gres/gpu=1,mem=55G,node=1|billing=6,cpu=6,gres/gpu=1,mem=55G,node=1|m3p003|2023-11-29T08:19:39|2023-11-29T08:19:39|2023-11-29T08:19:39|2023-11-30T08:20:55|1-00:01:16|1-00:00:00|TIMEOUT
32675078.batch|4207|1|cpu=6,gres/gpu=1,mem=55G,node=1||m3p003|2023-11-29T08:19:39|2023-11-29T08:19:39|2023-11-29T08:19:39|2023-11-30T08:20:57|1-00:01:18||CANCELLED
32675078.extern|4207|1|billing=6,cpu=6,gres/gpu=1,mem=55G,node=1||m3p003|2023-11-29T08:19:39|2023-11-29T08:19:39|2023-11-29T08:19:39|2023-11-30T08:20:57|1-00:01:18||CANCELLED
32677122|4942||billing=1,cpu=1,mem=4G,node=1|billing=1,cpu=1,mem=4G,node=1|m3i004|2023-11-29T11:41:26|2023-11-29T11:41:26|2023-11-29T11:41:44|2023-11-30T04:22:50|16:41:06|16:40:00|TIMEOUT
32677122.batch|4942|1|cpu=1,mem=4G,node=1||m3i004|2023-11-29T11:41:44|2023-11-29T11:41:44|2023-11-29T11:41:44|2023-11-30T04:22:51|16:41:07||CANCELLED
32677122.extern|4942|1|billing=1,cpu=1,mem=4G,node=1||m3i004|2023-11-29T11:41:44|2023-11-29T11:41:44|2023-11-29T11:41:44|2023-11-30T04:22:51|16:41:07||COMPLETED
[smichnow@m3-login2 ~]$

Comment 38 Scott Hilton 2023-12-01 14:48:33 MST

Simon,

It looks like all three jobs ended with TIMEOUT as the reason. Which is expected given the error.

I looked into it more but I am still not sure what causes these error messages. However, it seems that, in this case, it is probably benign. If you don't see any problems related to it, I will probably go ahead and close this ticket now.

I agree with your guess. I think that there is some minor issue with timing that leads to this error occurring sometimes, but that difference is no more than 1 second.

-Scott

Comment 41 help@massive.org.au 2023-12-03 16:09:57 MST

Scott
 I will close the ticket now
regards
Simon

Comment 42 Scott Hilton 2023-12-04 12:03:01 MST

Closing