Created attachment 27206 [details] Slurm controller log that shows above errors Our Slurm controller is showing thousands of errors of the kind "[2022-10-11T15:19:54.010] error: gres/gpu: job 27254366 dealloc node m3g009 type V100 gres count underflow (0 1) [2022-10-11T15:19:54.011] error: gres/gpu: job 27254366 dealloc node m3g009 type V100 gres count underflow (0 1) [2022-10-11T15:19:54.012] error: gres/gpu: job 27254366 dealloc node m3g009 type V100 gres count underflow (0 1)" and other error which may be unrelated 2022-10-11T15:20:14.296] error: _handle_assoc_tres_run_secs: job 27275414: assoc 5971 TRES cpu grp_used_tres_run_secs underflow, tried to remove 1692 seconds when only 1686 remained. [2022-10-11T15:20:14.296] error: _handle_assoc_tres_run_secs: job 27275414: assoc 5971 TRES mem grp_used_tres_run_secs underflow, tried to remove 15882240 seconds when only 15825920 remained. [2022-10-11T15:20:14.296] error: _handle_assoc_tres_run_secs: job 27275414: assoc 5971 TRES node grp_used_tres_run_secs underflow, tried to remove 282 seconds when only 281 remained. [2022-10-11T15:20:14.296] error: _handle_assoc_tres_run_secs: job 27275414: assoc 5971 TRES billing grp_used_tres_run_secs underflow, tried to remove 1692 seconds when only 1686 remained. [2022-10-11T15:20:14.296] error: _handle_assoc_tres_run_secs: job 27275414: assoc 5971 TRES gres/gpu grp_used_tres_run_secs underflow, tried to remove 282 seconds when only 281 remained. We do not know the meaning of these errors. We also have some random nodes going down with 'Prolog error" . Dmesg shows "[2022-10-11T15:38:25.502] error: service_connection: slurm_receive_msg: Zero Bytes were transmitted or received [2022-10-11T15:38:45.497] error: Waiting for JobId=27159216 REQUEST_LAUNCH_PROLOG notification failed, giving up after 20 sec" This may be related to the first problem. Nodes do have connection with controller, and can be resumed when this error.
Created attachment 27207 [details] slurmd log of file with prolog error
Created attachment 27208 [details] slurm.conf
There is a typo in original comment. Instead of `dmesg` I mean slurmd logs.
on the nodes with the Prolog Error I see [2022-09-29T14:55:33.437] error: Invalid PrologSlurmctld(`/opt/slurm-latest/etc/slurmctld.prolog`): No such file or directory This is weird, as this file does exist on the Controller Nodes, and other nodes do not show this message
Simon, The "gres count underflow" error seems to be an issue with how the backfill scheduler deals with gres. I am working on a fix and will get back to you. This issue may affect the backfill scheduler's ability to use those nodes, but otherwise shouldn't have any other problems. Though I haven't looked to deeply yet. As for the other errors, they may be related but I am unsure at the moment. I will let you know when I have more information. -Scott
Simon, For this error in your slurmd.log: [2022-10-10T15:14:09.914] error: private-tmpdir: mkdir("/mnt/privatedir/tmpdir.27261746.0",0700): No such file or directory Do you have this directory on the node? -Scott
Scott we use spank plugins to mount local file systems into /tmp /mnt/privatedir/tmpdir.27261746.0 it the name of such a folder created for this purpose, and 27261746 should be the job number. The folder is deleted at the end of the job. This is in the slurm epilog file TMPDIR_BASE=/mnt/privatedir/tmpdir PRIVATE_TMPDIR="${TMPDIR_BASE}.${SLURM_JOB_ID}.${SLURM_RESTART_COUNT:=0}" if [ -d "${PRIVATE_TMPDIR}" ]; then rm -rf "${PRIVATE_TMPDIR}" fi We have not seen this error for a while. I am not sure what triggered it or why it is no longer present. Perhaps a user job caused it. regards Simon
Simon, Could you send the output of this command for the relevant jobs >sacct -p -j 27275414,27268224,27254366 --format=jobid,Start,Eligible,Submit,End,Suspended,Timelimit -Scott
sacct -p -j 27275414,27268224,27254366 --format=jobid,Start,Eligible,Submit,End,Suspended,Timelimit JobID|Start|Eligible|Submit|End|Suspended|Timelimit| 27254366|2022-10-10T23:11:06|2022-10-09T18:29:42|2022-10-09T18:29:42|2022-10-16T23:12:31|00:00:00|6-00:00:00| 27254366.batch|2022-10-10T23:11:06|2022-10-10T23:11:06|2022-10-10T23:11:06|2022-10-16T23:12:35|00:00:00|| 27254366.extern|2022-10-10T23:11:06|2022-10-10T23:11:06|2022-10-10T23:11:06|2022-10-16T23:12:31|00:00:00|| 27254366.0|2022-10-13T16:51:35|2022-10-13T16:51:35|2022-10-13T16:51:35|2022-10-16T18:35:22|00:00:00|| 27254366.1|2022-10-16T19:47:37|2022-10-16T19:47:37|2022-10-16T19:47:37|2022-10-16T23:12:31|00:00:00|| 27268224|2022-10-10T23:43:06|2022-10-10T19:57:37|2022-10-10T19:57:37|2022-10-12T23:44:15|00:00:00|2-00:00:00| 27268224.batch|2022-10-10T23:43:06|2022-10-10T23:43:06|2022-10-10T23:43:06|2022-10-12T23:44:16|00:00:00|| 27268224.extern|2022-10-10T23:43:06|2022-10-10T23:43:06|2022-10-10T23:43:06|2022-10-12T23:44:15|00:00:00|| 27275414|2022-10-11T10:19:55|2022-10-11T10:19:55|2022-10-11T10:19:55|2022-10-11T15:21:22|00:00:00|05:00:00| 27275414.batch|2022-10-11T10:19:55|2022-10-11T10:19:55|2022-10-11T10:19:55|2022-10-11T15:21:23|00:00:00|| 27275414.extern|2022-10-11T10:19:55|2022-10-11T10:19:55|2022-10-11T10:19:55|2022-10-11T15:21:23|00:00:00|| Thanks. Is it possible there is some race condition happening? regards Simon
Simon, Could you sen the slurmctld logs for nov 10 and 12 as well? -Scott
Also, can you send the output of this command: >sacct -p -j 27275414,27268224,27254366 --format=jobid,Start,Eligible,Submit,End,Suspended,Timelimit,reqtres,alloctres
Scott > Could you sen the slurmctld logs for nov 10 and 12 as well? I think you mean Oct 10,12. Unfortunately we do not keep our logs that far back, as we have a log rotate that eventually deletes them after they achieve a certain size. The latest I have is 29 Oct.. I enclose the sacct command. regards Simon [smichnow@m3-login2 ~]$ sacct -p -j 27275414,27268224,27254366 --format=jobid,Start,Eligible,Submit,End,Suspended,Timelimit,reqtres,alloctres JobID|Start|Eligible|Submit|End|Suspended|Timelimit|ReqTRES|AllocTRES| 27254366|2022-10-10T23:11:06|2022-10-09T18:29:42|2022-10-09T18:29:42|2022-10-16T23:12:31|00:00:00|6-00:00:00|billing=8,cpu=8,gres/gpu=1,mem=24G,node=1|billing=8,cpu=8,gres/gpu=1,mem=24G,node=1| 27254366.batch|2022-10-10T23:11:06|2022-10-10T23:11:06|2022-10-10T23:11:06|2022-10-16T23:12:35|00:00:00|||cpu=8,gres/gpu=1,mem=24G,node=1| 27254366.extern|2022-10-10T23:11:06|2022-10-10T23:11:06|2022-10-10T23:11:06|2022-10-16T23:12:31|00:00:00|||billing=8,cpu=8,gres/gpu=1,mem=24G,node=1| 27254366.0|2022-10-13T16:51:35|2022-10-13T16:51:35|2022-10-13T16:51:35|2022-10-16T18:35:22|00:00:00|||cpu=8,gres/gpu=1,mem=24G,node=1| 27254366.1|2022-10-16T19:47:37|2022-10-16T19:47:37|2022-10-16T19:47:37|2022-10-16T23:12:31|00:00:00|||cpu=8,gres/gpu=1,mem=24G,node=1| 27268224|2022-10-10T23:43:06|2022-10-10T19:57:37|2022-10-10T19:57:37|2022-10-12T23:44:15|00:00:00|2-00:00:00|billing=2,cpu=2,gres/gpu=2,mem=62.50G,node=1|billing=2,cpu=2,gres/gpu=2,mem=62.50G,node=1| 27268224.batch|2022-10-10T23:43:06|2022-10-10T23:43:06|2022-10-10T23:43:06|2022-10-12T23:44:16|00:00:00|||cpu=2,gres/gpu=2,mem=62.50G,node=1| 27268224.extern|2022-10-10T23:43:06|2022-10-10T23:43:06|2022-10-10T23:43:06|2022-10-12T23:44:15|00:00:00|||billing=2,cpu=2,gres/gpu=2,mem=62.50G,node=1| 27275414|2022-10-11T10:19:55|2022-10-11T10:19:55|2022-10-11T10:19:55|2022-10-11T15:21:22|00:00:00|05:00:00|billing=6,cpu=6,gres/gpu=1,mem=55G,node=1|billing=6,cpu=6,gres/gpu=1,mem=55G,node=1| 27275414.batch|2022-10-11T10:19:55|2022-10-11T10:19:55|2022-10-11T10:19:55|2022-10-11T15:21:23|00:00:00|||cpu=6,gres/gpu=1,mem=55G,node=1| 27275414.extern|2022-10-11T10:19:55|2022-10-11T10:19:55|2022-10-11T10:19:55|2022-10-11T15:21:23|00:00:00|||billing=6,cpu=6,gres/gpu=1,mem=55G,node=1| [smichnow@m3-l q
Simon, Sorry for the delay, are you still getting these error messages? Is this affecting your workflow? Is there anything you know that could help me reproduce the issue? -Scott
Scott we are still getting "underflow" errors in our controller logs. The last was a few weeks ago.. i.e. slurmctld.log-20221107:[2022-11-07T00:50:14.296] error: _handle_assoc_tres_run_secs: job 27690996: assoc 2007 TRES node grp_used_tres_run_secs underflow, tried to remove 299 seconds when only 298 remained. slurmctld.log-20221107:[2022-11-07T00:50:14.296] error: _handle_assoc_tres_run_secs: job 27690996: assoc 2007 TRES billing grp_used_tres_run_secs underflow, tried to remove 1794 seconds when only 1788 remained. slurmctld.log-20221107:[2022-11-07T00:50:14.296] error: _handle_assoc_tres_run_secs: job 27690996: assoc 2007 TRES gres/gpu grp_used_tres_run_secs underflow, tried to remove 299 seconds when only 298 remained. This does not interfere with our workflow, so it is not of immediate concern to us (other than a general concern over unusual errors). You mentioned you were working on a fix for the `gres underflow` issue. Is this something we should consider patching, or wait for the next Slurm release? regards Simon
Simon, Do you use QOS's? What are their settings? One way this can be done is attaching the output of scontrol show assoc. I am particularly interesting in the flags, GrpTRESRunMins, and usage factor. Do users resize jobs often? Do you change usage factors on QOS's often while jobs are running, or have you changed usage factors ever? Are you ever restarting the controller or reconfiguring? Caden
This command could also help, and is more concise for what I am looking for. sacctmgr show qos format=name,grptresrunmins,usagefactor,flags
Simon, There was a change in 23.02.1 that may fix this issue. Have you upgraded yet? Do you still see this issue in your cluster? -Scott
Scott thanks for the update. We have no immediate plans for an update but will advise once we do it. regards Simon
Simon, Have you had a chance to upgrade yet? Is this still an issue for you? -Scott
Hi We have upgraded to 23.02.6 and I think the problem has gone away. There are just a few errors that look different. [ec2-user@m3-mgmt4 slurm-logs]$ sudo grep underflow slurmctld.log [2023-11-29T05:42:05.238] error: _handle_assoc_tres_run_secs: job 32667912: assoc 6215 TRES cpu grp_used_tres_run_secs underflow, tried to remove 1716 seconds when only 1710 remained. [2023-11-29T05:42:05.238] error: _handle_assoc_tres_run_secs: job 32667912: assoc 6215 TRES mem grp_used_tres_run_secs underflow, tried to remove 16107520 seconds when only 16051200 remained. [2023-11-29T05:42:05.238] error: _handle_assoc_tres_run_secs: job 32667912: assoc 6215 TRES node grp_used_tres_run_secs underflow, tried to remove 286 seconds when only 285 remained. [2023-11-29T05:42:05.238] error: _handle_assoc_tres_run_secs: job 32667912: assoc 6215 TRES billing grp_used_tres_run_secs underflow, tried to remove 1716 seconds when only 1710 remained. [2023-11-29T05:42:05.238] error: _handle_assoc_tres_run_secs: job 32667912: assoc 6215 TRES gres/gpu grp_used_tres_run_secs underflow, tried to remove 286 seconds when only 285 remained. I think we can close the ticket now. Thanks Simon
Simon, I am still a little worried about the grp_used_tres_run_secs underflow error. I was hoping it was related to the other error and would go away when the bug associated with it was fixed. But this seems to not be the case. Does this error happen often? Can you reproduce it on command? Is there anything different about the jobs that exhibit this issue? -Scott
Scott There are more such errors today [root@m3-mgmt4 ec2-user]# grep -i underflow /mnt/slurm-logs/slurmctld.log [2023-11-30T03:47:05.239] error: _handle_assoc_tres_run_secs: job 32667547: assoc 6896 TRES gres/gpu grp_used_tres_run_secs underflow, tried to remove 273 seconds when only 272 remained. [2023-11-30T04:22:05.238] error: _handle_assoc_tres_run_secs: job 32677122: assoc 4942 TRES cpu grp_used_tres_run_secs underflow, tried to remove 280 seconds when only 279 remained. [2023-11-30T04:22:05.238] error: _handle_assoc_tres_run_secs: job 32677122: assoc 4942 TRES mem grp_used_tres_run_secs underflow, tried to remove 1146880 seconds when only 1142784 remained. [2023-11-30T04:22:05.238] error: _handle_assoc_tres_run_secs: job 32677122: assoc 4942 TRES node grp_used_tres_run_secs underflow, tried to remove 280 seconds when only 279 remained. [2023-11-30T04:22:05.238] error: _handle_assoc_tres_run_secs: job 32677122: assoc 4942 TRES billing grp_used_tres_run_secs underflow, tried to remove 280 seconds when only 279 remained. [2023-11-30T08:20:55.757] error: _handle_assoc_tres_run_secs: job 32675078: assoc 4175 TRES cpu grp_used_tres_run_secs underflow, tried to remove 924 seconds when only 918 remained. [2023-11-30T08:20:55.757] error: _handle_assoc_tres_run_secs: job 32675078: assoc 4175 TRES mem grp_used_tres_run_secs underflow, tried to remove 8673280 seconds when only 8616960 remained. [2023-11-30T08:20:55.757] error: _handle_assoc_tres_run_secs: job 32675078: assoc 4175 TRES node grp_used_tres_run_secs underflow, tried to remove 154 seconds when only 153 remained. [2023-11-30T08:20:55.757] error: _handle_assoc_tres_run_secs: job 32675078: assoc 4175 TRES billing grp_used_tres_run_secs underflow, tried to remove 924 seconds when only 918 remained. [2023-11-30T08:20:55.757] error: _handle_assoc_tres_run_secs: job 32675078: assoc 4175 TRES gres/gpu grp_used_tres_run_secs underflow, tried to remove 154 seconds when only 153 remained. I do not know what causes it. There are only 3 jobs mentioned. 1 ended via timeout. 1 was cancelled. 1 completed normally. All on different compute nodes. Would out-of-sync clocks cause this? Simon
Could you send me the output of this command: >sacct -P --format=jobid,associd,ntasks,alloctres,reqtres,nodelist,eligible,submit,start,end,elapsed,timelimit,state -j 32667547,32677122,32675078 -Scott
[smichnow@m3-login2 ~]$ sacct -P --format=jobid,associd,ntasks,alloctres,reqtres,nodelist,eligible,submit,start,end,elapsed,timelimit,state -j 32667547,32677122,32675078 JobID|AssocID|NTasks|AllocTRES|ReqTRES|NodeList|Eligible|Submit|Start|End|Elapsed|Timelimit|State 32667547|6896||billing=6,cpu=6,gres/gpu=1,mem=55G,node=1|billing=6,cpu=6,gres/gpu=1,mem=55G,node=1|m3p000|2023-11-28T21:46:10|2023-11-28T21:46:10|2023-11-28T21:46:37|2023-11-30T03:47:43|1-06:01:06|1-06:00:00|TIMEOUT 32667547.batch|6896|1|cpu=6,gres/gpu=1,mem=55G,node=1||m3p000|2023-11-28T21:46:37|2023-11-28T21:46:37|2023-11-28T21:46:37|2023-11-30T03:47:46|1-06:01:09||CANCELLED 32667547.extern|6896|1|billing=6,cpu=6,gres/gpu=1,mem=55G,node=1||m3p000|2023-11-28T21:46:37|2023-11-28T21:46:37|2023-11-28T21:46:37|2023-11-30T03:47:46|1-06:01:09||CANCELLED 32675078|4207||billing=6,cpu=6,gres/gpu=1,mem=55G,node=1|billing=6,cpu=6,gres/gpu=1,mem=55G,node=1|m3p003|2023-11-29T08:19:39|2023-11-29T08:19:39|2023-11-29T08:19:39|2023-11-30T08:20:55|1-00:01:16|1-00:00:00|TIMEOUT 32675078.batch|4207|1|cpu=6,gres/gpu=1,mem=55G,node=1||m3p003|2023-11-29T08:19:39|2023-11-29T08:19:39|2023-11-29T08:19:39|2023-11-30T08:20:57|1-00:01:18||CANCELLED 32675078.extern|4207|1|billing=6,cpu=6,gres/gpu=1,mem=55G,node=1||m3p003|2023-11-29T08:19:39|2023-11-29T08:19:39|2023-11-29T08:19:39|2023-11-30T08:20:57|1-00:01:18||CANCELLED 32677122|4942||billing=1,cpu=1,mem=4G,node=1|billing=1,cpu=1,mem=4G,node=1|m3i004|2023-11-29T11:41:26|2023-11-29T11:41:26|2023-11-29T11:41:44|2023-11-30T04:22:50|16:41:06|16:40:00|TIMEOUT 32677122.batch|4942|1|cpu=1,mem=4G,node=1||m3i004|2023-11-29T11:41:44|2023-11-29T11:41:44|2023-11-29T11:41:44|2023-11-30T04:22:51|16:41:07||CANCELLED 32677122.extern|4942|1|billing=1,cpu=1,mem=4G,node=1||m3i004|2023-11-29T11:41:44|2023-11-29T11:41:44|2023-11-29T11:41:44|2023-11-30T04:22:51|16:41:07||COMPLETED [smichnow@m3-login2 ~]$
Simon, It looks like all three jobs ended with TIMEOUT as the reason. Which is expected given the error. I looked into it more but I am still not sure what causes these error messages. However, it seems that, in this case, it is probably benign. If you don't see any problems related to it, I will probably go ahead and close this ticket now. I agree with your guess. I think that there is some minor issue with timing that leads to this error occurring sometimes, but that difference is no more than 1 second. -Scott
Scott I will close the ticket now regards Simon
Closing