13215 – gpus in use by jobs does not match AllocTRES

Ticket 13215 - gpus in use by jobs does not match AllocTRES

Summary: gpus in use by jobs does not match AllocTRES

Status:	RESOLVED CANNOTREPRODUCE

Alias:	None

Product:	Slurm
Classification:	Unclassified
Component:	GPU (show other tickets)
Version:	21.08.4
Hardware:	Linux Linux

Severity:	4 - Minor Issue
Assignee:	Director of Support
QA Contact:

URL:

Depends on:
Blocks:

Reported:	2022-01-18 11:30 MST by Renata Dart
Modified:	2022-03-23 14:37 MDT (History)
CC List:	1 user (show)

See Also:	13203 13600
Site:	SLAC
Slinky Site:	---
Alineos Sites:	---
Atos/Eviden Sites:	---
Confidential Site:	---
Coreweave sites:	---
Cray Sites:	---
DS9 clusters:	---
Google sites:	---
HPCnow Sites:	---
HPE Sites:	---
IBM Sites:	---
NOAA SIte:	---
NoveTech Sites:	---
Nvidia HWinf-CS Sites:	---
OCF Sites:	---
Recursion Pharma Sites:	---
SFW Sites:	---
SNIC sites:	---
Tzag Elita Sites:	---
Linux Distro:	---
Machine Name:
CLE Version:
Version Fixed:
Target Release:	---
DevPrio:	---
Emory-Cloud Sites:	---

Attachments
slurm.conf (10.09 KB, TEXT/PLAIN) 2022-01-18 14:25 MST, Renata Dart	Details
gres.conf (972 bytes, TEXT/PLAIN) 2022-01-18 14:25 MST, Renata Dart	Details
slurmctld.log (2.09 MB, TEXT/PLAIN) 2022-01-18 14:44 MST, Renata Dart	Details
slurmd.logs.tar.gz (922.10 KB, APPLICATION/x-gzip) 2022-01-18 16:20 MST, Renata Dart	Details
ampt000.slurmd.log (1.40 MB, TEXT/PLAIN) 2022-01-20 16:55 MST, Renata Dart	Details
Add an attachment (proposed patch, testcase, etc.)

Note You need to log in before you can comment on or make changes to this ticket.

Description Renata Dart 2022-01-18 11:30:43 MST

Hi SchedMD, we are seeing a mismatch between the gpus shown in use by allocTRES with those requested by jobs.  The jobs show 1 less than the total are in use as opposed to the show node which says the full number are in use.  At the moment 3 gpu nodes, tur017, ampt007, and psc001 are showing this issue. 

tur017  has 10 gpu
ampt007 has 4  gpu
psc001  has 10 gpu

Following is a show node and and a show job for each job running to show the mismatch:

[renata@sdf-login03 licensing]$ scontrol show node tur017
NodeName=tur017 Arch=x86_64 CoresPerSocket=12 
   CPUAlloc=2 CPUTot=48 CPULoad=0.23
   AvailableFeatures=CPU_GEN:SKX,CPU_SKU:5118,CPU_FRQ:2.30GHz,GPU_GEN:TUR,GPU_SKU:RTX2080TI,GPU_MEM:11GB,GPU_CC:7.5
   ActiveFeatures=CPU_GEN:SKX,CPU_SKU:5118,CPU_FRQ:2.30GHz,GPU_GEN:TUR,GPU_SKU:RTX2080TI,GPU_MEM:11GB,GPU_CC:7.5
   Gres=gpu:geforce_rtx_2080_ti:10
   NodeAddr=tur017 NodeHostName=tur017 Version=21.08.4
   OS=Linux 3.10.0-1160.42.2.el7.x86_64 #1 SMP Tue Sep 7 14:49:57 UTC 2021 
   RealMemory=191552 AllocMem=1500 FreeMem=162798 Sockets=2 Boards=1
   State=MIXED ThreadsPerCore=2 TmpDisk=0 Weight=56117 Owner=N/A MCS_label=N/A
   Partitions=shared,cryoem,lcls,neutrino,ml,atlas 
   BootTime=2021-12-08T08:36:27 SlurmdStartTime=2021-12-08T15:48:06
   LastBusyTime=2022-01-17T12:35:26
   CfgTRES=cpu=48,mem=191552M,billing=48,gres/gpu=10,gres/gpu:geforce_rtx_2080_ti=10
   AllocTRES=cpu=2,mem=1500M,gres/gpu=10,gres/gpu:geforce_rtx_2080_ti=10
   CapWatts=n/a
   CurrentWatts=0 AveWatts=0
   ExtSensorsJoules=n/s ExtSensorsWatts=0 ExtSensorsTemp=n/s

[renata@sdf-login03 licensing]$ squeue | grep tur017
           1504819    shared      fah      ytl  R   21:33:01      1 tur017

[renata@sdf-login03 licensing]$ scontrol show job 1504819 | grep gpu
   TRES=cpu=2,mem=1500M,node=1,billing=2,gres/gpu=9,gres/gpu:geforce_rtx_2080_ti=9
   Command=/sdf/home/y/ytl/fah/fah-gpu-e.slurm
   TresPerJob=gres:gpu:9

----------------------------------------------------------
[renata@sdf-login03 licensing]$ scontrol show node ampt007
NodeName=ampt007 Arch=x86_64 CoresPerSocket=64 
   CPUAlloc=55 CPUTot=128 CPULoad=54.07
   AvailableFeatures=CPU_GEN:RME,CPU_SKU:7542,CPU_FRQ:2.10GHz,GPU_GEN:AMP,GPU_SKU:A100,GPU_MEM:40GB,GPU_CC:8.0
   ActiveFeatures=CPU_GEN:RME,CPU_SKU:7542,CPU_FRQ:2.10GHz,GPU_GEN:AMP,GPU_SKU:A100,GPU_MEM:40GB,GPU_CC:8.0
   Gres=gpu:a100:4
   NodeAddr=ampt007 NodeHostName=ampt007 Version=21.08.4
   OS=Linux 3.10.0-1160.42.2.el7.x86_64 #1 SMP Tue Sep 7 14:49:57 UTC 2021 
   RealMemory=1029344 AllocMem=554460 FreeMem=586033 Sockets=2 Boards=1
   State=MIXED ThreadsPerCore=2 TmpDisk=0 Weight=227207 Owner=N/A MCS_label=N/A
   Partitions=shared,cryoem,neutrino,ml,ard,atlas 
   BootTime=2022-01-16T20:49:40 SlurmdStartTime=2022-01-16T20:59:20
   LastBusyTime=2022-01-16T20:59:20
   CfgTRES=cpu=128,mem=1029344M,billing=128,gres/gpu=4,gres/gpu:a100=4
   AllocTRES=cpu=55,mem=554460M,gres/gpu=4,gres/gpu:a100=4
   CapWatts=n/a
   CurrentWatts=0 AveWatts=0
   ExtSensorsJoules=n/s ExtSensorsWatts=0 ExtSensorsTemp=n/s

[renata@sdf-login03 licensing]$ 
[renata@sdf-login03 licensing]$ 
[renata@sdf-login03 licensing]$ 
[renata@sdf-login03 licensing]$ 
[renata@sdf-login03 licensing]$ 
[renata@sdf-login03 licensing]$ squeue | grep ampt007
           1506847    shared      fah      ytl  R    1:05:15      1 ampt007
           1506238    shared Imp_Samp   milesm  R    7:19:13      1 ampt007
[renata@sdf-login03 licensing]$ 
[renata@sdf-login03 licensing]$ 
[renata@sdf-login03 licensing]$ 
[renata@sdf-login03 licensing]$ scontrol show job 1506847 | grep gpu
   TRES=cpu=1,mem=1500M,node=1,billing=1,gres/gpu=3,gres/gpu:a100=3
   Command=/sdf/home/y/ytl/fah/fah-gpu-e.slurm
   TresPerJob=gres:gpu:3
[renata@sdf-login03 licensing]$ scontrol show job 1506238 | grep gpu
[renata@sdf-login03 licensing]$

---------------------------------------------------------
[renata@sdf-login03 licensing]$ scontrol show node psc001
NodeName=psc001 Arch=x86_64 CoresPerSocket=12 
   CPUAlloc=10 CPUTot=48 CPULoad=0.87
   AvailableFeatures=CPU_GEN:HSW,CPU_SKU:E5-2670v3,CPU_FRQ:2.30GHz,GPU_GEN:PSC,GPU_SKU:GTX1080TI,GPU_MEM:11GB,GPU_CC:6.1
   ActiveFeatures=CPU_GEN:HSW,CPU_SKU:E5-2670v3,CPU_FRQ:2.30GHz,GPU_GEN:PSC,GPU_SKU:GTX1080TI,GPU_MEM:11GB,GPU_CC:6.1
   Gres=gpu:geforce_gtx_1080_ti:10
   NodeAddr=psc001 NodeHostName=psc001 Version=21.08.4
   OS=Linux 3.10.0-1160.42.2.el7.x86_64 #1 SMP Tue Sep 7 14:49:57 UTC 2021 
   RealMemory=257336 AllocMem=14192 FreeMem=217910 Sockets=2 Boards=1
   State=MIXED ThreadsPerCore=2 TmpDisk=0 Weight=58319 Owner=N/A MCS_label=N/A
   Partitions=shared,cryoem,atlas 
   BootTime=2021-12-10T15:22:12 SlurmdStartTime=2021-12-10T15:23:20
   LastBusyTime=2022-01-16T07:51:09
   CfgTRES=cpu=48,mem=257336M,billing=48,gres/gpu=10,gres/gpu:geforce_gtx_1080_ti=10
   AllocTRES=cpu=10,mem=14192M,gres/gpu=10,gres/gpu:geforce_gtx_1080_ti=10
   CapWatts=n/a
   CurrentWatts=0 AveWatts=0
   ExtSensorsJoules=n/s ExtSensorsWatts=0 ExtSensorsTemp=n/s

[renata@sdf-login03 licensing]$ squeue | grep psc001
           1506916    shared arc_pilo osgatlas  R       4:44      1 psc001
           1503765    shared      fah      ytl  R 1-14:58:05      1 psc001
           1504788    shared      fah      ytl  R   22:03:29      1 psc001
           1503921    shared      fah      ytl  R 1-13:03:02      1 psc001
           1504022    shared      fah      ytl  R 1-13:01:21      1 psc001
[renata@sdf-login03 licensing]$ scontrol show job 1506916 | grep gpu
   TRES=cpu=2,mem=8G,node=1,billing=2,gres/gpu=1,gres/gpu:geforce_gtx_1080_ti=1
   TresPerJob=gres:gpu:1
[renata@sdf-login03 licensing]$ scontrol show job 1503765 | grep gpu
   TRES=cpu=2,mem=1500M,node=1,billing=2,gres/gpu=2,gres/gpu:geforce_gtx_1080_ti=2
   Command=/sdf/home/y/ytl/fah/fah-gpu-e.slurm
   TresPerJob=gres:gpu:2
[renata@sdf-login03 licensing]$ scontrol show job 1504788 | grep gpu
   TRES=cpu=2,mem=1500M,node=1,billing=2,gres/gpu=1,gres/gpu:geforce_gtx_1080_ti=1
   Command=/sdf/home/y/ytl/fah/fah-gpu-a.slurm
   TresPerJob=gres:gpu:1
[renata@sdf-login03 licensing]$ scontrol show job 1503921 | grep gpu
   TRES=cpu=2,mem=1500M,node=1,billing=1,gres/gpu=3,gres/gpu:geforce_gtx_1080_ti=3
   Command=/sdf/home/y/ytl/fah/fah-gpu-b.slurm
   TresPerJob=gres:gpu:3
[renata@sdf-login03 licensing]$ scontrol show job 1504022 | grep gpu
   TRES=cpu=2,mem=1500M,node=1,billing=2,gres/gpu=1,gres/gpu:geforce_gtx_1080_ti=1
   Command=/sdf/home/y/ytl/fah/fah-gpu-b.slurm
   TresPerJob=gres:gpu:1


Is this a problem with our configuration?

Thanks,
Renata

Comment 1 Jason Booth 2022-01-18 14:01:14 MST

Would you please attach your slurm.conf and gres.conf? Please also attach the slurmctld.log from the time the jobs were submitted until they started?

Comment 2 Renata Dart 2022-01-18 14:25:07 MST

Created attachment 23034 [details]
slurm.conf

Hi Jason, I have attached our slurm.conf and gres.conf.  Will
check the log and send that in a bit.

Renata

On Tue, 18 Jan 2022, bugs@schedmd.com wrote:

>https://bugs.schedmd.com/show_bug.cgi?id=13215
>
>--- Comment #1 from Jason Booth <jbooth@schedmd.com> ---
>Would you please attach your slurm.conf and gres.conf? Please also attach the
>slurmctld.log from the time the jobs were submitted until they started?
>
>-- 
>You are receiving this mail because:
>You reported the bug.

Comment 3 Renata Dart 2022-01-18 14:25:07 MST

Created attachment 23035 [details]
gres.conf

Comment 4 Renata Dart 2022-01-18 14:44:35 MST

Created attachment 23036 [details]
slurmctld.log

Hi again, I have just attached the slurmctld.log that
covers the time period for these jobs.

Renata

On Tue, 18 Jan 2022, bugs@schedmd.com wrote:

>https://bugs.schedmd.com/show_bug.cgi?id=13215
>
>--- Comment #1 from Jason Booth <jbooth@schedmd.com> ---
>Would you please attach your slurm.conf and gres.conf? Please also attach the
>slurmctld.log from the time the jobs were submitted until they started?
>
>-- 
>You are receiving this mail because:
>You reported the bug.

Comment 6 Michael Hinton 2022-01-18 15:41:05 MST

Hi Renata,

(In reply to Renata Dart from comment #0)
> At the moment 3 gpu nodes, tur017, ampt007, and psc001 are showing
> this issue. 
I see this in slurmctld.log:

[2022-01-16T19:17:28.326] error: gres/gpu: job 1503517 dealloc of node psc001 bad node_offset 0 count is 0
[2022-01-16T19:17:28.361] error: gres/gpu: job 1503488 dealloc of node tur017 bad node_offset 0 count is 0
[2022-01-16T19:19:28.379] error: gres/gpu: job 1503519 dealloc of node ampt007 bad node_offset 0 count is 0

This is probably another symptom of the issue you are seeing. Could you attach the slurmd logs for those 3 nodes?

Your configuration looks good to me, and there doesn't seem to be anything unique about those bad nodes.

I wonder if maybe those slurmds just failed to get restarted on those nodes. I would restart the controller and the slurmds for those nodes to see if maybe this helps fix the issue.

If that doesn't help, then drain those nodes and make sure no jobs are running on them. With no jobs on the node, check to make sure that no GPUs are still in use. If GPUs are still in use but no jobs are running on the node, restart the ctld and the nodes again.

The issues you are seeing here can also cause the dealloc errors in bug 13203.

I wouldn't use `squeue | grep <nodename>`, because you may not get all the jobs using that node. For example, if a job is allocated "psc001" and "psc002", I believe it could show up as "psc00[1-2]" and thus not get caught by grep. A safer method of filtering by nodes is with an squeue option: `squeue -w <nodename>`.

Another tip for debugging GPU jobs: Add `-d` to scontrol show [node|job]. This will show you the GPU bitmaps for the node|job, which may help.

Thanks!
-Michael

Comment 7 Michael Hinton 2022-01-18 16:02:42 MST

The 3 nodes in question were all allocated to jobs started by backfill, and those jobs eventually ended in a "bad node_offset" error:

[2022-01-16T19:13:49.048] requeue job JobId=1495647 due to failure of node ampt000
[2022-01-16T19:13:49.048] Requeuing JobId=1495647
[2022-01-16T19:13:49.048] cleanup_completing: JobId=1503488 completion process took 614 seconds
[2022-01-16T19:13:49.048] cleanup_completing: JobId=1503517 completion process took 614 seconds
[2022-01-16T19:13:49.048] cleanup_completing: JobId=1503519 completion process took 614 seconds
[2022-01-16T19:13:49.048] cleanup_completing: JobId=1503581 completion process took 614 seconds
[2022-01-16T19:13:49.049] error: Nodes ampt000 not responding, setting DOWN
[2022-01-16T19:14:27.616] sched/backfill: _start_job: Started JobId=1503488 in shared on tur017
[2022-01-16T19:14:27.616] sched/backfill: _start_job: Started JobId=1503581 in shared on psc001
[2022-01-16T19:14:27.617] sched/backfill: _start_job: Started JobId=1503517 in shared on psc001
[2022-01-16T19:14:27.618] sched/backfill: _start_job: Started JobId=1503519 in shared on ampt007
...
[2022-01-16T19:17:28.326] error: gres/gpu: job 1503517 dealloc of node psc001 bad node_offset 0 count is 0
[2022-01-16T19:17:28.361] error: gres/gpu: job 1503488 dealloc of node tur017 bad node_offset 0 count is 0
[2022-01-16T19:19:28.344] error: gres/gpu: job 1503581 dealloc of node psc001 bad node_offset 0 count is 0
[2022-01-16T19:19:28.379] error: gres/gpu: job 1503519 dealloc of node ampt007 bad node_offset 0 count is 0

Can you get any more information about those four jobs? Iif they are no longer in scontrol, what does sacct say about them, and about the GPUs they allocated, and how many times were they requeued? Are they related to ampt000 failing (e.g. maybe those jobs were originally running on ampt000, but got requeued).

Comment 8 Renata Dart 2022-01-18 16:20:28 MST

Created attachment 23040 [details]
slurmd.logs.tar.gz

Hi Michael, I have attached slurmd.logs.tar.gz  

ampt007-slurmd.log  
psc001-slurmd.log 
psc001-slurmd.log-20220118.gz  
tur017-slurmd.log

as requested by Jason.

On Tue, 18 Jan 2022, bugs@schedmd.com wrote:

>https://bugs.schedmd.com/show_bug.cgi?id=13215
>
>--- Comment #6 from Michael Hinton <hinton@schedmd.com> ---
>Hi Renata,
>
>(In reply to Renata Dart from comment #0)
>> At the moment 3 gpu nodes, tur017, ampt007, and psc001 are showing
>> this issue. 
>I see this in slurmctld.log:
>
>[2022-01-16T19:17:28.326] error: gres/gpu: job 1503517 dealloc of node psc001
>bad node_offset 0 count is 0
>[2022-01-16T19:17:28.361] error: gres/gpu: job 1503488 dealloc of node tur017
>bad node_offset 0 count is 0
>[2022-01-16T19:19:28.379] error: gres/gpu: job 1503519 dealloc of node ampt007
>bad node_offset 0 count is 0
>
>This is probably another symptom of the issue you are seeing. Could you attach
>the slurmd logs for those 3 nodes?
>
>Your configuration looks good to me, and there doesn't seem to be anything
>unique about those bad nodes.
>
>I wonder if maybe those slurmds just failed to get restarted on those nodes. I
>would restart the controller and the slurmds for those nodes to see if maybe
>this helps fix the issue.

I'll try that next.

>
>If that doesn't help, then drain those nodes and make sure no jobs are running
>on them. With no jobs on the node, check to make sure that no GPUs are still in
>use. If GPUs are still in use but no jobs are running on the node, restart the
>ctld and the nodes again.
>
>The issues you are seeing here can also cause the dealloc errors in bug 13203.
>
>I wouldn't use `squeue | grep <nodename>`, because you may not get all the jobs
>using that node. For example, if a job is allocated "psc001" and "psc002", I
>believe it could show up as "psc00[1-2]" and thus not get caught by grep. A
>safer method of filtering by nodes is with an squeue option: `squeue -w
><nodename>`.

Thanks for that tip, I did also search for just psc to double-check, but your
way is much cleaner.

>
>Another tip for debugging GPU jobs: Add `-d` to scontrol show [node|job]. This
>will show you the GPU bitmaps for the node|job, which may help.

Sounds good.


Thanks,
Renata


>
>Thanks!
>-Michael
>
>-- 
>You are receiving this mail because:
>You reported the bug.

Comment 10 Michael Hinton 2022-01-18 17:16:44 MST

I see a lot of these errors in the slurmd logs you attached:

[2022-01-16T19:14:27.970] [1503488.extern] debug:  cgroup/v1: common_file_write_content: cgroup_common.c:400: common_file_write_content: safe_write (5 of 5) failed: Invalid argument
[2022-01-16T19:14:27.970] [1503488.extern] error: common_file_write_content: unable to write 5 bytes to cgroup /sys/fs/cgroup/devices/slurm/uid_7017/job_1503488/devices.allow: Invalid argument

Do the slurmd logs for your good nodes have those errors? Or is this just limited to the bad nodes?

Comment 11 Renata Dart 2022-01-19 08:23:30 MST

Hi Michael, yes I do see them on a "good" node.  This is ampt015, it
has 3 jobs, two of which are using gpu.  Show node says 4 gpu are in
use and the jobs concur, one is using 3 and the other 1:

[renata@ampt015 ~]$ squeue -w ampt015
             JOBID PARTITION     NAME     USER ST       TIME  NODES NODELIST(REASON)
           1495647    cryoem 4k-2Tomo summ1234  R 2-06:59:34      1 ampt015
           1508857        ml  cryonet   axlevy  R   13:54:32      1 ampt015
           1509619    shared      fah      ytl  R    5:24:07      1 ampt015
[renata@ampt015 ~]$ 
[renata@ampt015 ~]$ 
[renata@ampt015 ~]$ 
[root@ampt015 slurm]# scontrol -d show node ampt015 | grep -i gpu
   AvailableFeatures=CPU_GEN:RME,CPU_SKU:7542,CPU_FRQ:2.10GHz,GPU_GEN:AMP,GPU_SKU:A100,GPU_MEM:40GB,GPU_CC:8.0
   ActiveFeatures=CPU_GEN:RME,CPU_SKU:7542,CPU_FRQ:2.10GHz,GPU_GEN:AMP,GPU_SKU:A100,GPU_MEM:40GB,GPU_CC:8.0
   Gres=gpu:a100:4
   GresUsed=gpu:a100:4(IDX:0-3)
   CfgTRES=cpu=128,mem=1029344M,billing=128,gres/gpu=4,gres/gpu:a100=4
   AllocTRES=cpu=71,mem=900572M,gres/gpu=4,gres/gpu:a100=4
[renata@ampt015 ~]$ 
[renata@ampt015 ~]$ 
[renata@ampt015 ~]$ 
[renata@ampt015 ~]$ 
[renata@ampt015 ~]$ 
[renata@ampt015 ~]$ scontrol show job 1495647 | grep -i gpu
[renata@ampt015 ~]$
[renata@ampt015 ~]$
[root@ampt015 slurm]# scontrol -d show job 1508857 | grep gpu
   TRES=cpu=64,mem=128G,node=1,billing=64,gres/gpu=1,gres/gpu:a100=1
   JOB_GRES=gpu:a100:1
     Nodes=ampt015 CPU_IDs=0-1,16-27,30-79,128-191 Mem=131072 GRES=gpu:a100:1(IDX:2)
   TresPerJob=gres:gpu:1
[renata@ampt015 ~]$
[renata@ampt015 ~]$
[root@ampt015 slurm]# scontrol -d show job 1509619 | grep gpu
   TRES=cpu=1,mem=1500M,node=1,billing=1,gres/gpu=3,gres/gpu:a100=3
   JOB_GRES=gpu:a100:3
     Nodes=ampt015 CPU_IDs=14-15 Mem=1500 GRES=gpu:a100:3(IDX:0-1,3)
   Command=/sdf/home/y/ytl/fah/fah-gpu-e.slurm
   TresPerJob=gres:gpu:3

[root@ampt015 slurm]#
[root@ampt015 slurm]#
[root@ampt015 slurmd]# cd /var/log/slurm/
[root@ampt015 slurm]#
[root@ampt015 slurm]# grep "safe_write (5 of 5) failed: Invalid argument" slurmd.log | tail
[2022-01-18T16:59:16.440] [1508852.extern] debug:  cgroup/v1: common_file_write_content: cgroup_common.c:400: common_file_write_content: safe_write (5 of 5) failed: Invalid argument
[2022-01-18T16:59:17.601] [1508852.batch] debug:  cgroup/v1: common_file_write_content: cgroup_common.c:400: common_file_write_content: safe_write (5 of 5) failed: Invalid argument
[2022-01-18T17:02:56.994] [1508857.extern] debug:  cgroup/v1: common_file_write_content: cgroup_common.c:400: common_file_write_content: safe_write (5 of 5) failed: Invalid argument
[2022-01-18T17:02:57.076] [1508857.batch] debug:  cgroup/v1: common_file_write_content: cgroup_common.c:400: common_file_write_content: safe_write (5 of 5) failed: Invalid argument
[2022-01-19T01:21:20.353] [1509470.extern] debug:  cgroup/v1: common_file_write_content: cgroup_common.c:400: common_file_write_content: safe_write (5 of 5) failed: Invalid argument
[2022-01-19T01:21:20.418] [1509470.batch] debug:  cgroup/v1: common_file_write_content: cgroup_common.c:400: common_file_write_content: safe_write (5 of 5) failed: Invalid argument
[2022-01-19T01:28:21.410] [1509550.extern] debug:  cgroup/v1: common_file_write_content: cgroup_common.c:400: common_file_write_content: safe_write (5 of 5) failed: Invalid argument
[2022-01-19T01:28:21.480] [1509550.batch] debug:  cgroup/v1: common_file_write_content: cgroup_common.c:400: common_file_write_content: safe_write (5 of 5) failed: Invalid argument
[2022-01-19T01:33:21.911] [1509619.extern] debug:  cgroup/v1: common_file_write_content: cgroup_common.c:400: common_file_write_content: safe_write (5 of 5) failed: Invalid argument
[2022-01-19T01:33:21.981] [1509619.batch] debug:  cgroup/v1: common_file_write_content: cgroup_common.c:400: common_file_write_content: safe_write (5 of 5) failed: Invalid argument
[root@ampt015 slurm]#      
[root@ampt015 slurm]# 
[root@ampt015 slurm]# 
[root@ampt015 slurm]# grep "error: common_file_write_content" slurmd.log | tail
[2022-01-18T16:59:16.440] [1508852.extern] error: common_file_write_content: unable to write 5 bytes to cgroup /sys/fs/cgroup/devices/slurm/uid_1102874/job_1508852/devices.allow: Invalid argument
[2022-01-18T16:59:17.601] [1508852.batch] error: common_file_write_content: unable to write 5 bytes to cgroup /sys/fs/cgroup/devices/slurm/uid_1102874/job_1508852/devices.allow: Invalid argument
[2022-01-18T17:02:56.994] [1508857.extern] error: common_file_write_content: unable to write 5 bytes to cgroup /sys/fs/cgroup/devices/slurm/uid_17753/job_1508857/devices.allow: Invalid argument
[2022-01-18T17:02:57.076] [1508857.batch] error: common_file_write_content: unable to write 5 bytes to cgroup /sys/fs/cgroup/devices/slurm/uid_17753/job_1508857/devices.allow: Invalid argument
[2022-01-19T01:21:20.353] [1509470.extern] error: common_file_write_content: unable to write 5 bytes to cgroup /sys/fs/cgroup/devices/slurm/uid_7017/job_1509470/devices.allow: Invalid argument
[2022-01-19T01:21:20.418] [1509470.batch] error: common_file_write_content: unable to write 5 bytes to cgroup /sys/fs/cgroup/devices/slurm/uid_7017/job_1509470/devices.allow: Invalid argument
[2022-01-19T01:28:21.410] [1509550.extern] error: common_file_write_content: unable to write 5 bytes to cgroup /sys/fs/cgroup/devices/slurm/uid_7017/job_1509550/devices.allow: Invalid argument
[2022-01-19T01:28:21.480] [1509550.batch] error: common_file_write_content: unable to write 5 bytes to cgroup /sys/fs/cgroup/devices/slurm/uid_7017/job_1509550/devices.allow: Invalid argument
[2022-01-19T01:33:21.911] [1509619.extern] error: common_file_write_content: unable to write 5 bytes to cgroup /sys/fs/cgroup/devices/slurm/uid_7017/job_1509619/devices.allow: Invalid argument
[2022-01-19T01:33:21.981] [1509619.batch] error: common_file_write_content: unable to write 5 bytes to cgroup /sys/fs/cgroup/devices/slurm/uid_7017/job_1509619/devices.allow: Invalid argument


Renata


On Wed, 19 Jan 2022, bugs@schedmd.com wrote:

>https://bugs.schedmd.com/show_bug.cgi?id=13215
>
>--- Comment #10 from Michael Hinton <hinton@schedmd.com> ---
>I see a lot of these errors in the slurmd logs you attached:
>
>[2022-01-16T19:14:27.970] [1503488.extern] debug:  cgroup/v1:
>common_file_write_content: cgroup_common.c:400: common_file_write_content:
>safe_write (5 of 5) failed: Invalid argument
>[2022-01-16T19:14:27.970] [1503488.extern] error: common_file_write_content:
>unable to write 5 bytes to cgroup
>/sys/fs/cgroup/devices/slurm/uid_7017/job_1503488/devices.allow: Invalid
>argument
>
>Do the slurmd logs for your good nodes have those errors? Or is this just
>limited to the bad nodes?
>
>-- 
>You are receiving this mail because:
>You reported the bug.

Comment 12 Renata Dart 2022-01-19 08:48:05 MST

Hi Michael, here is the sacct -l for each of the jobs:

[renata@sdf-login03 conf-cache]$ sacct -l -j 1503517
JobID        JobIDRaw        JobName  Partition  MaxVMSize  MaxVMSizeNode  MaxVMSizeTask  AveVMSize     MaxRSS MaxRSSNode MaxRSSTask     AveRSS MaxPages MaxPagesNode   MaxPagesTask   AvePages     MinCPU MinCPUNode MinCPUTask     AveCPU   NTasks  AllocCPUS    Elapsed      State ExitCode AveCPUFreq ReqCPUFreqMin ReqCPUFreqMax ReqCPUFreqGov     ReqMem ConsumedEnergy  MaxDiskRead MaxDiskReadNode MaxDiskReadTask    AveDiskRead MaxDiskWrite MaxDiskWriteNode MaxDiskWriteTask   AveDiskWrite    ReqTRES  AllocTRES TRESUsageInAve TRESUsageInMax TRESUsageInMaxNode TRESUsageInMaxTask TRESUsageInMin TRESUsageInMinNode TRESUsageInMinTask TRESUsageInTot TRESUsageOutMax TRESUsageOutMaxNode TRESUsageOutMaxTask TRESUsageOutAve TRESUsageOutTot 
------------ ------------ ---------- ---------- ---------- -------------- -------------- ---------- ---------- ---------- ---------- ---------- -------- ------------ -------------- ---------- ---------- ---------- ---------- ---------- -------- ---------- ---------- ---------- -------- ---------- ------------- ------------- ------------- ---------- -------------- ------------ --------------- --------------- -------------- ------------ ---------------- ---------------- -------------- ---------- ---------- -------------- -------------- ------------------ ------------------ -------------- ------------------ ------------------ -------------- --------------- ------------------- ------------------- --------------- --------------- 
1503517      1503517             fah     shared                                                                                                                                                                                                               2   00:03:01     FAILED    127:0                  Unknown       Unknown       Unknown      1500M              0                                                                                                                           billing=1+ billing=1+                                                                                                                                                                                                                                 
1503517.bat+ 1503517.bat+      batch               221356K         psc001              0    221356K      1880K     psc001          0      1880K        0       psc001              0          0   00:00:00     psc001          0   00:00:00        1          2   04:53:33     FAILED    127:0      2.30G             0             0             0                         0        1.14M          psc001               0          1.14M        0.08M           psc001                0          0.08M            cpu=2,mem+ cpu=00:00:00,+ cpu=00:00:00,+ cpu=psc001,energy+ cpu=0,fs/disk=0,m+ cpu=00:00:00,+ cpu=psc001,energy+ cpu=0,fs/disk=0,m+ cpu=00:00:00,+ energy=0,fs/di+ energy=psc001,fs/d+           fs/disk=0 energy=0,fs/di+ energy=0,fs/di+ 
1503517.ext+ 1503517.ext+     extern               212184K         psc001              0    212184K       892K     psc001          0       892K        0       psc001              0          0   00:00:00     psc001          0   00:00:00        1          2   04:53:33  COMPLETED      0:0      1.20G             0             0             0                         0        0.00M          psc001               0          0.00M            0           psc001                0              0            billing=1+ cpu=00:00:00,+ cpu=00:00:00,+ cpu=psc001,energy+ cpu=0,fs/disk=0,m+ cpu=00:00:00,+ cpu=psc001,energy+ cpu=0,fs/disk=0,m+ cpu=00:00:00,+ energy=0,fs/di+ energy=psc001,fs/d+           fs/disk=0 energy=0,fs/di+ energy=0,fs/di+ 
[renata@sdf-login03 conf-cache]$ 
[renata@sdf-login03 conf-cache]$ 
[renata@sdf-login03 conf-cache]$ 
[renata@sdf-login03 conf-cache]$ 
[renata@sdf-login03 conf-cache]$ 
[renata@sdf-login03 conf-cache]$ sacct -l -j 1503488
JobID        JobIDRaw        JobName  Partition  MaxVMSize  MaxVMSizeNode  MaxVMSizeTask  AveVMSize     MaxRSS MaxRSSNode MaxRSSTask     AveRSS MaxPages MaxPagesNode   MaxPagesTask   AvePages     MinCPU MinCPUNode MinCPUTask     AveCPU   NTasks  AllocCPUS    Elapsed      State ExitCode AveCPUFreq ReqCPUFreqMin ReqCPUFreqMax ReqCPUFreqGov     ReqMem ConsumedEnergy  MaxDiskRead MaxDiskReadNode MaxDiskReadTask    AveDiskRead MaxDiskWrite MaxDiskWriteNode MaxDiskWriteTask   AveDiskWrite    ReqTRES  AllocTRES TRESUsageInAve TRESUsageInMax TRESUsageInMaxNode TRESUsageInMaxTask TRESUsageInMin TRESUsageInMinNode TRESUsageInMinTask TRESUsageInTot TRESUsageOutMax TRESUsageOutMaxNode TRESUsageOutMaxTask TRESUsageOutAve TRESUsageOutTot 
------------ ------------ ---------- ---------- ---------- -------------- -------------- ---------- ---------- ---------- ---------- ---------- -------- ------------ -------------- ---------- ---------- ---------- ---------- ---------- -------- ---------- ---------- ---------- -------- ---------- ------------- ------------- ------------- ---------- -------------- ------------ --------------- --------------- -------------- ------------ ---------------- ---------------- -------------- ---------- ---------- -------------- -------------- ------------------ ------------------ -------------- ------------------ ------------------ -------------- --------------- ------------------- ------------------- --------------- --------------- 
1503488      1503488             fah     shared                                                                                                                                                                                                               2   00:03:01     FAILED    127:0                  Unknown       Unknown       Unknown      1500M              0                                                                                                                           billing=1+ billing=1+                                                                                                                                                                                                                                 
1503488.bat+ 1503488.bat+      batch               221360K         tur017              0    221360K      1880K     tur017          0      1880K        0       tur017              0          0   00:00:00     tur017          0   00:00:00        1          2   05:50:41     FAILED    127:0      2.30G             0             0             0                         0        1.17M          tur017               0          1.17M        0.09M           tur017                0          0.09M            cpu=2,mem+ cpu=00:00:00,+ cpu=00:00:00,+ cpu=tur017,energy+ cpu=0,fs/disk=0,m+ cpu=00:00:00,+ cpu=tur017,energy+ cpu=0,fs/disk=0,m+ cpu=00:00:00,+ energy=0,fs/di+ energy=tur017,fs/d+           fs/disk=0 energy=0,fs/di+ energy=0,fs/di+ 
1503488.ext+ 1503488.ext+     extern               212184K         tur017              0    212184K       896K     tur017          0       896K        0       tur017              0          0   00:00:00     tur017          0   00:00:00        1          2   05:50:41  COMPLETED      0:0      2.00G             0             0             0                         0        0.00M          tur017               0          0.00M            0           tur017                0              0            billing=1+ cpu=00:00:00,+ cpu=00:00:00,+ cpu=tur017,energy+ cpu=0,fs/disk=0,m+ cpu=00:00:00,+ cpu=tur017,energy+ cpu=0,fs/disk=0,m+ cpu=00:00:00,+ energy=0,fs/di+ energy=tur017,fs/d+           fs/disk=0 energy=0,fs/di+ energy=0,fs/di+ 
[renata@sdf-login03 conf-cache]$ 
[renata@sdf-login03 conf-cache]$ 
[renata@sdf-login03 conf-cache]$ 
[renata@sdf-login03 conf-cache]$ sacct -l -j 1503581
JobID        JobIDRaw        JobName  Partition  MaxVMSize  MaxVMSizeNode  MaxVMSizeTask  AveVMSize     MaxRSS MaxRSSNode MaxRSSTask     AveRSS MaxPages MaxPagesNode   MaxPagesTask   AvePages     MinCPU MinCPUNode MinCPUTask     AveCPU   NTasks  AllocCPUS    Elapsed      State ExitCode AveCPUFreq ReqCPUFreqMin ReqCPUFreqMax ReqCPUFreqGov     ReqMem ConsumedEnergy  MaxDiskRead MaxDiskReadNode MaxDiskReadTask    AveDiskRead MaxDiskWrite MaxDiskWriteNode MaxDiskWriteTask   AveDiskWrite    ReqTRES  AllocTRES TRESUsageInAve TRESUsageInMax TRESUsageInMaxNode TRESUsageInMaxTask TRESUsageInMin TRESUsageInMinNode TRESUsageInMinTask TRESUsageInTot TRESUsageOutMax TRESUsageOutMaxNode TRESUsageOutMaxTask TRESUsageOutAve TRESUsageOutTot 
------------ ------------ ---------- ---------- ---------- -------------- -------------- ---------- ---------- ---------- ---------- ---------- -------- ------------ -------------- ---------- ---------- ---------- ---------- ---------- -------- ---------- ---------- ---------- -------- ---------- ------------- ------------- ------------- ---------- -------------- ------------ --------------- --------------- -------------- ------------ ---------------- ---------------- -------------- ---------- ---------- -------------- -------------- ------------------ ------------------ -------------- ------------------ ------------------ -------------- --------------- ------------------- ------------------- --------------- --------------- 
1503581      1503581             fah     shared                                                                                                                                                                                                               2   00:05:01     FAILED    127:0                  Unknown       Unknown       Unknown      1500M              0                                                                                                                           billing=1+ billing=1+                                                                                                                                                                                                                                 
1503581.bat+ 1503581.bat+      batch               221356K         psc001              0    221356K      1880K     psc001          0      1880K        0       psc001              0          0   00:00:00     psc001          0   00:00:00        1          2   03:25:16     FAILED    127:0      2.21G             0             0             0                         0        1.14M          psc001               0          1.14M        0.08M           psc001                0          0.08M            cpu=2,mem+ cpu=00:00:00,+ cpu=00:00:00,+ cpu=psc001,energy+ cpu=0,fs/disk=0,m+ cpu=00:00:00,+ cpu=psc001,energy+ cpu=0,fs/disk=0,m+ cpu=00:00:00,+ energy=0,fs/di+ energy=psc001,fs/d+           fs/disk=0 energy=0,fs/di+ energy=0,fs/di+ 
1503581.ext+ 1503581.ext+     extern               108056K         psc001              0    108056K       352K     psc001          0       352K        0       psc001              0          0   00:00:00     psc001          0   00:00:00        1          2   03:25:16  COMPLETED      0:0      1.20G             0             0             0                         0        0.00M          psc001               0          0.00M            0           psc001                0              0            billing=1+ cpu=00:00:00,+ cpu=00:00:00,+ cpu=psc001,energy+ cpu=0,fs/disk=0,m+ cpu=00:00:00,+ cpu=psc001,energy+ cpu=0,fs/disk=0,m+ cpu=00:00:00,+ energy=0,fs/di+ energy=psc001,fs/d+           fs/disk=0 energy=0,fs/di+ energy=0,fs/di+ 
[renata@sdf-login03 conf-cache]$ 
[renata@sdf-login03 conf-cache]$ 
[renata@sdf-login03 conf-cache]$ 
[renata@sdf-login03 conf-cache]$ sacct -l -j 1503519
JobID        JobIDRaw        JobName  Partition  MaxVMSize  MaxVMSizeNode  MaxVMSizeTask  AveVMSize     MaxRSS MaxRSSNode MaxRSSTask     AveRSS MaxPages MaxPagesNode   MaxPagesTask   AvePages     MinCPU MinCPUNode MinCPUTask     AveCPU   NTasks  AllocCPUS    Elapsed      State ExitCode AveCPUFreq ReqCPUFreqMin ReqCPUFreqMax ReqCPUFreqGov     ReqMem ConsumedEnergy  MaxDiskRead MaxDiskReadNode MaxDiskReadTask    AveDiskRead MaxDiskWrite MaxDiskWriteNode MaxDiskWriteTask   AveDiskWrite    ReqTRES  AllocTRES TRESUsageInAve TRESUsageInMax TRESUsageInMaxNode TRESUsageInMaxTask TRESUsageInMin TRESUsageInMinNode TRESUsageInMinTask TRESUsageInTot TRESUsageOutMax TRESUsageOutMaxNode TRESUsageOutMaxTask TRESUsageOutAve TRESUsageOutTot 
------------ ------------ ---------- ---------- ---------- -------------- -------------- ---------- ---------- ---------- ---------- ---------- -------- ------------ -------------- ---------- ---------- ---------- ---------- ---------- -------- ---------- ---------- ---------- -------- ---------- ------------- ------------- ------------- ---------- -------------- ------------ --------------- --------------- -------------- ------------ ---------------- ---------------- -------------- ---------- ---------- -------------- -------------- ------------------ ------------------ -------------- ------------------ ------------------ -------------- --------------- ------------------- ------------------- --------------- --------------- 
1503519      1503519             fah     shared                                                                                                                                                                                                               1   00:05:01     FAILED    127:0                  Unknown       Unknown       Unknown      1500M              0                                                                                                                           billing=1+ billing=1+                                                                                                                                                                                                                                 
1503519.bat+ 1503519.bat+      batch               221356K        ampt007              0    221356K      1876K    ampt007          0      1876K        0      ampt007              0          0   00:00:00    ampt007          0   00:00:00        1          1   04:52:59     FAILED    127:0      1.50G             0             0             0                         0        2.03M         ampt007               0          2.03M        0.08M          ampt007                0          0.08M            cpu=1,mem+ cpu=00:00:00,+ cpu=00:00:00,+ cpu=ampt007,energ+ cpu=0,fs/disk=0,m+ cpu=00:00:00,+ cpu=ampt007,energ+ cpu=0,fs/disk=0,m+ cpu=00:00:00,+ energy=0,fs/di+ energy=ampt007,fs/+           fs/disk=0 energy=0,fs/di+ energy=0,fs/di+ 
1503519.ext+ 1503519.ext+     extern               212348K        ampt007              0    212348K      1292K    ampt007          0      1292K        0      ampt007              0          0   00:00:00    ampt007          0   00:00:00        1          1   04:52:59  COMPLETED      0:0      1.50G             0             0             0                         0        0.00M         ampt007               0          0.00M            0          ampt007                0              0            billing=1+ cpu=00:00:00,+ cpu=00:00:00,+ cpu=ampt007,energ+ cpu=0,fs/disk=0,m+ cpu=00:00:00,+ cpu=ampt007,energ+ cpu=0,fs/disk=0,m+ cpu=00:00:00,+ energy=0,fs/di+ energy=ampt007,fs/+           fs/disk=0 energy=0,fs/di+ energy=0,fs/di+ 


Renata

On Tue, 18 Jan 2022, bugs@schedmd.com wrote:

>https://bugs.schedmd.com/show_bug.cgi?id=13215
>
>--- Comment #7 from Michael Hinton <hinton@schedmd.com> ---
>The 3 nodes in question were all allocated to jobs started by backfill, and
>those jobs eventually ended in a "bad node_offset" error:
>
>[2022-01-16T19:13:49.048] requeue job JobId=1495647 due to failure of node
>ampt000
>[2022-01-16T19:13:49.048] Requeuing JobId=1495647
>[2022-01-16T19:13:49.048] cleanup_completing: JobId=1503488 completion process
>took 614 seconds
>[2022-01-16T19:13:49.048] cleanup_completing: JobId=1503517 completion process
>took 614 seconds
>[2022-01-16T19:13:49.048] cleanup_completing: JobId=1503519 completion process
>took 614 seconds
>[2022-01-16T19:13:49.048] cleanup_completing: JobId=1503581 completion process
>took 614 seconds
>[2022-01-16T19:13:49.049] error: Nodes ampt000 not responding, setting DOWN
>[2022-01-16T19:14:27.616] sched/backfill: _start_job: Started JobId=1503488 in
>shared on tur017
>[2022-01-16T19:14:27.616] sched/backfill: _start_job: Started JobId=1503581 in
>shared on psc001
>[2022-01-16T19:14:27.617] sched/backfill: _start_job: Started JobId=1503517 in
>shared on psc001
>[2022-01-16T19:14:27.618] sched/backfill: _start_job: Started JobId=1503519 in
>shared on ampt007
>...
>[2022-01-16T19:17:28.326] error: gres/gpu: job 1503517 dealloc of node psc001
>bad node_offset 0 count is 0
>[2022-01-16T19:17:28.361] error: gres/gpu: job 1503488 dealloc of node tur017
>bad node_offset 0 count is 0
>[2022-01-16T19:19:28.344] error: gres/gpu: job 1503581 dealloc of node psc001
>bad node_offset 0 count is 0
>[2022-01-16T19:19:28.379] error: gres/gpu: job 1503519 dealloc of node ampt007
>bad node_offset 0 count is 0
>
>Can you get any more information about those four jobs? Iif they are no longer
>in scontrol, what does sacct say about them, and about the GPUs they allocated,
>and how many times were they requeued? Are they related to ampt000 failing
>(e.g. maybe those jobs were originally running on ampt000, but got requeued).
>
>-- 
>You are receiving this mail because:
>You reported the bug.

Comment 13 Michael Hinton 2022-01-19 10:27:16 MST

Do you have a cgroup_allowed_devices_file.conf file?

Comment 14 Michael Hinton 2022-01-19 11:21:42 MST

(In reply to Michael Hinton from comment #13)
> Do you have a cgroup_allowed_devices_file.conf file?
If so, you should remove it. It has been obsolete for quite a while now, and now in 21.08, it causes those (harmless) cgroup errors.

Comment 15 Renata Dart 2022-01-19 11:48:06 MST

Hi Michael, yes we do:

[root@slurmctld1 slurm]# cat cgroup_allowed_devices_file.conf
/dev/null
/dev/urandom
/dev/zero
/dev/sda*
/dev/cpu/*/*
/dev/pts/*
/dev/nvidia*

To remove it is it enough to:

remove it from our slurmctld server
systemctl restart slurmctld
scontrol reconfigure

And was there some replacement we were supposed to do for
the information in that file?  

Renata

>https://bugs.schedmd.com/show_bug.cgi?id=13215
>
>--- Comment #14 from Michael Hinton <hinton@schedmd.com> ---
>(In reply to Michael Hinton from comment #13)
>> Do you have a cgroup_allowed_devices_file.conf file?
>If so, you should remove it. It has been obsolete for quite a while now, and
>now in 21.08, it causes those (harmless) cgroup errors.
>
>-- 
>You are receiving this mail because:
>You reported the bug.

Comment 16 Michael Hinton 2022-01-19 12:00:09 MST

(In reply to Renata Dart from comment #15)
> To remove it is it enough to:
> 
> remove it from our slurmctld server
> systemctl restart slurmctld
> scontrol reconfigure
You should remove it from all slurmds as well. Removing it and restarting Slurm (ctld and slurmds) should be all that is required.

> And was there some replacement we were supposed to do for
> the information in that file?  
Nothing else needs to be done. The replacement is specifying File= in gres.conf, which is already required. File whitelists those GPU device files, and by default, all other device files are not allowed for the job (I believe).

To be clear, this cgroup issue should be harmless, and it is unrelated to the actual issue described in this ticket. But it will be nice to clear up all those errors.

Comment 17 Michael Hinton 2022-01-19 12:02:13 MST

Sorry, could you redo the sacct commands for the jobs, but add `-p`? That makes it so that nothing gets truncated.

Comment 18 Michael Hinton 2022-01-19 12:10:41 MST

(In reply to Michael Hinton from comment #17)
> Sorry, could you redo the sacct commands for the jobs, but add `-p`? That
> makes it so that nothing gets truncated.
Also, can you add the `-D` option? That should show us duplicates in the case that these jobs were requeued (and I think they were)

Comment 19 Renata Dart 2022-01-19 12:11:01 MST

Hi Michael, here they are:

[renata@sdf-login03 conf-cache]$ sacct -l -p -j 1503517
JobID|JobIDRaw|JobName|Partition|MaxVMSize|MaxVMSizeNode|MaxVMSizeTask|AveVMSize|MaxRSS|MaxRSSNode|MaxRSSTask|AveRSS|MaxPages|MaxPagesNode|MaxPagesTask|AvePages|MinCPU|MinCPUNode|MinCPUTask|AveCPU|NTasks|AllocCPUS|Elapsed|State|ExitCode|AveCPUFreq|ReqCPUFreqMin|ReqCPUFreqMax|ReqCPUFreqGov|ReqMem|ConsumedEnergy|MaxDiskRead|MaxDiskReadNode|MaxDiskReadTask|AveDiskRead|MaxDiskWrite|MaxDiskWriteNode|MaxDiskWriteTask|AveDiskWrite|ReqTRES|AllocTRES|TRESUsageInAve|TRESUsageInMax|TRESUsageInMaxNode|TRESUsageInMaxTask|TRESUsageInMin|TRESUsageInMinNode|TRESUsageInMinTask|TRESUsageInTot|TRESUsageOutMax|TRESUsageOutMaxNode|TRESUsageOutMaxTask|TRESUsageOutAve|TRESUsageOutTot|
1503517|1503517|fah|shared||||||||||||||||||2|00:03:01|FAILED|127:0||Unknown|Unknown|Unknown|1500M|0|||||||||billing=1,cpu=1,gres/gpu=1,mem=1500M,node=1|billing=1,cpu=2,gres/gpu:a100=1,gres/gpu=1,mem=1500M,node=1||||||||||||||
1503517.batch|1503517.batch|batch||221356K|psc001|0|221356K|1880K|psc001|0|1880K|0|psc001|0|0|00:00:00|psc001|0|00:00:00|1|2|04:53:33|FAILED|127:0|2.30G|0|0|0||0|1.14M|psc001|0|1.14M|0.08M|psc001|0|0.08M||cpu=2,mem=1500M,node=1|cpu=00:00:00,energy=0,fs/disk=1200276,mem=1880K,pages=0,vmem=221356K|cpu=00:00:00,energy=0,fs/disk=1200276,mem=1880K,pages=0,vmem=221356K|cpu=psc001,energy=psc001,fs/disk=psc001,mem=psc001,pages=psc001,vmem=psc001|cpu=0,fs/disk=0,mem=0,pages=0,vmem=0|cpu=00:00:00,energy=0,fs/disk=1200276,mem=1880K,pages=0,vmem=221356K|cpu=psc001,energy=psc001,fs/disk=psc001,mem=psc001,pages=psc001,vmem=psc001|cpu=0,fs/disk=0,mem=0,pages=0,vmem=0|cpu=00:00:00,energy=0,fs/disk=1200276,mem=1880K,pages=0,vmem=221356K|energy=0,fs/disk=87019|energy=psc001,fs/disk=psc001|fs/disk=0|energy=0,fs/disk=87019|energy=0,fs/disk=87019|
1503517.extern|1503517.extern|extern||212184K|psc001|0|212184K|892K|psc001|0|892K|0|psc001|0|0|00:00:00|psc001|0|00:00:00|1|2|04:53:33|COMPLETED|0:0|1.20G|0|0|0||0|0.00M|psc001|0|0.00M|0|psc001|0|0||billing=1,cpu=2,gres/gpu:a100=1,gres/gpu=1,mem=1500M,node=1|cpu=00:00:00,energy=0,fs/disk=2012,mem=892K,pages=0,vmem=212184K|cpu=00:00:00,energy=0,fs/disk=2012,mem=892K,pages=0,vmem=212184K|cpu=psc001,energy=psc001,fs/disk=psc001,mem=psc001,pages=psc001,vmem=psc001|cpu=0,fs/disk=0,mem=0,pages=0,vmem=0|cpu=00:00:00,energy=0,fs/disk=2012,mem=892K,pages=0,vmem=212184K|cpu=psc001,energy=psc001,fs/disk=psc001,mem=psc001,pages=psc001,vmem=psc001|cpu=0,fs/disk=0,mem=0,pages=0,vmem=0|cpu=00:00:00,energy=0,fs/disk=2012,mem=892K,pages=0,vmem=212184K|energy=0,fs/disk=0|energy=psc001,fs/disk=psc001|fs/disk=0|energy=0,fs/disk=0|energy=0,fs/disk=0|
[renata@sdf-login03 conf-cache]$ 
[renata@sdf-login03 conf-cache]$ 
[renata@sdf-login03 conf-cache]$ 
[renata@sdf-login03 conf-cache]$ 
[renata@sdf-login03 conf-cache]$ sacct -l -p -j 1503488
JobID|JobIDRaw|JobName|Partition|MaxVMSize|MaxVMSizeNode|MaxVMSizeTask|AveVMSize|MaxRSS|MaxRSSNode|MaxRSSTask|AveRSS|MaxPages|MaxPagesNode|MaxPagesTask|AvePages|MinCPU|MinCPUNode|MinCPUTask|AveCPU|NTasks|AllocCPUS|Elapsed|State|ExitCode|AveCPUFreq|ReqCPUFreqMin|ReqCPUFreqMax|ReqCPUFreqGov|ReqMem|ConsumedEnergy|MaxDiskRead|MaxDiskReadNode|MaxDiskReadTask|AveDiskRead|MaxDiskWrite|MaxDiskWriteNode|MaxDiskWriteTask|AveDiskWrite|ReqTRES|AllocTRES|TRESUsageInAve|TRESUsageInMax|TRESUsageInMaxNode|TRESUsageInMaxTask|TRESUsageInMin|TRESUsageInMinNode|TRESUsageInMinTask|TRESUsageInTot|TRESUsageOutMax|TRESUsageOutMaxNode|TRESUsageOutMaxTask|TRESUsageOutAve|TRESUsageOutTot|
1503488|1503488|fah|shared||||||||||||||||||2|00:03:01|FAILED|127:0||Unknown|Unknown|Unknown|1500M|0|||||||||billing=1,cpu=1,gres/gpu=1,mem=1500M,node=1|billing=1,cpu=2,gres/gpu:a100=1,gres/gpu=1,mem=1500M,node=1||||||||||||||
1503488.batch|1503488.batch|batch||221360K|tur017|0|221360K|1880K|tur017|0|1880K|0|tur017|0|0|00:00:00|tur017|0|00:00:00|1|2|05:50:41|FAILED|127:0|2.30G|0|0|0||0|1.17M|tur017|0|1.17M|0.09M|tur017|0|0.09M||cpu=2,mem=1500M,node=1|cpu=00:00:00,energy=0,fs/disk=1222499,mem=1880K,pages=0,vmem=221360K|cpu=00:00:00,energy=0,fs/disk=1222499,mem=1880K,pages=0,vmem=221360K|cpu=tur017,energy=tur017,fs/disk=tur017,mem=tur017,pages=tur017,vmem=tur017|cpu=0,fs/disk=0,mem=0,pages=0,vmem=0|cpu=00:00:00,energy=0,fs/disk=1222499,mem=1880K,pages=0,vmem=221360K|cpu=tur017,energy=tur017,fs/disk=tur017,mem=tur017,pages=tur017,vmem=tur017|cpu=0,fs/disk=0,mem=0,pages=0,vmem=0|cpu=00:00:00,energy=0,fs/disk=1222499,mem=1880K,pages=0,vmem=221360K|energy=0,fs/disk=92034|energy=tur017,fs/disk=tur017|fs/disk=0|energy=0,fs/disk=92034|energy=0,fs/disk=92034|
1503488.extern|1503488.extern|extern||212184K|tur017|0|212184K|896K|tur017|0|896K|0|tur017|0|0|00:00:00|tur017|0|00:00:00|1|2|05:50:41|COMPLETED|0:0|2.00G|0|0|0||0|0.00M|tur017|0|0.00M|0|tur017|0|0||billing=1,cpu=2,gres/gpu:a100=1,gres/gpu=1,mem=1500M,node=1|cpu=00:00:00,energy=0,fs/disk=2012,mem=896K,pages=0,vmem=212184K|cpu=00:00:00,energy=0,fs/disk=2012,mem=896K,pages=0,vmem=212184K|cpu=tur017,energy=tur017,fs/disk=tur017,mem=tur017,pages=tur017,vmem=tur017|cpu=0,fs/disk=0,mem=0,pages=0,vmem=0|cpu=00:00:00,energy=0,fs/disk=2012,mem=896K,pages=0,vmem=212184K|cpu=tur017,energy=tur017,fs/disk=tur017,mem=tur017,pages=tur017,vmem=tur017|cpu=0,fs/disk=0,mem=0,pages=0,vmem=0|cpu=00:00:00,energy=0,fs/disk=2012,mem=896K,pages=0,vmem=212184K|energy=0,fs/disk=0|energy=tur017,fs/disk=tur017|fs/disk=0|energy=0,fs/disk=0|energy=0,fs/disk=0|
[renata@sdf-login03 conf-cache]$ 
[renata@sdf-login03 conf-cache]$ 
[renata@sdf-login03 conf-cache]$ 
[renata@sdf-login03 conf-cache]$ sacct -l -p -j 1503581
JobID|JobIDRaw|JobName|Partition|MaxVMSize|MaxVMSizeNode|MaxVMSizeTask|AveVMSize|MaxRSS|MaxRSSNode|MaxRSSTask|AveRSS|MaxPages|MaxPagesNode|MaxPagesTask|AvePages|MinCPU|MinCPUNode|MinCPUTask|AveCPU|NTasks|AllocCPUS|Elapsed|State|ExitCode|AveCPUFreq|ReqCPUFreqMin|ReqCPUFreqMax|ReqCPUFreqGov|ReqMem|ConsumedEnergy|MaxDiskRead|MaxDiskReadNode|MaxDiskReadTask|AveDiskRead|MaxDiskWrite|MaxDiskWriteNode|MaxDiskWriteTask|AveDiskWrite|ReqTRES|AllocTRES|TRESUsageInAve|TRESUsageInMax|TRESUsageInMaxNode|TRESUsageInMaxTask|TRESUsageInMin|TRESUsageInMinNode|TRESUsageInMinTask|TRESUsageInTot|TRESUsageOutMax|TRESUsageOutMaxNode|TRESUsageOutMaxTask|TRESUsageOutAve|TRESUsageOutTot|
1503581|1503581|fah|shared||||||||||||||||||2|00:05:01|FAILED|127:0||Unknown|Unknown|Unknown|1500M|0|||||||||billing=1,cpu=1,gres/gpu=1,mem=1500M,node=1|billing=1,cpu=2,gres/gpu:a100=1,gres/gpu=1,mem=1500M,node=1||||||||||||||
1503581.batch|1503581.batch|batch||221356K|psc001|0|221356K|1880K|psc001|0|1880K|0|psc001|0|0|00:00:00|psc001|0|00:00:00|1|2|03:25:16|FAILED|127:0|2.21G|0|0|0||0|1.14M|psc001|0|1.14M|0.08M|psc001|0|0.08M||cpu=2,mem=1500M,node=1|cpu=00:00:00,energy=0,fs/disk=1200332,mem=1880K,pages=0,vmem=221356K|cpu=00:00:00,energy=0,fs/disk=1200332,mem=1880K,pages=0,vmem=221356K|cpu=psc001,energy=psc001,fs/disk=psc001,mem=psc001,pages=psc001,vmem=psc001|cpu=0,fs/disk=0,mem=0,pages=0,vmem=0|cpu=00:00:00,energy=0,fs/disk=1200332,mem=1880K,pages=0,vmem=221356K|cpu=psc001,energy=psc001,fs/disk=psc001,mem=psc001,pages=psc001,vmem=psc001|cpu=0,fs/disk=0,mem=0,pages=0,vmem=0|cpu=00:00:00,energy=0,fs/disk=1200332,mem=1880K,pages=0,vmem=221356K|energy=0,fs/disk=87036|energy=psc001,fs/disk=psc001|fs/disk=0|energy=0,fs/disk=87036|energy=0,fs/disk=87036|
1503581.extern|1503581.extern|extern||108056K|psc001|0|108056K|352K|psc001|0|352K|0|psc001|0|0|00:00:00|psc001|0|00:00:00|1|2|03:25:16|COMPLETED|0:0|1.20G|0|0|0||0|0.00M|psc001|0|0.00M|0|psc001|0|0||billing=1,cpu=2,gres/gpu:a100=1,gres/gpu=1,mem=1500M,node=1|cpu=00:00:00,energy=0,fs/disk=2012,mem=352K,pages=0,vmem=108056K|cpu=00:00:00,energy=0,fs/disk=2012,mem=352K,pages=0,vmem=108056K|cpu=psc001,energy=psc001,fs/disk=psc001,mem=psc001,pages=psc001,vmem=psc001|cpu=0,fs/disk=0,mem=0,pages=0,vmem=0|cpu=00:00:00,energy=0,fs/disk=2012,mem=352K,pages=0,vmem=108056K|cpu=psc001,energy=psc001,fs/disk=psc001,mem=psc001,pages=psc001,vmem=psc001|cpu=0,fs/disk=0,mem=0,pages=0,vmem=0|cpu=00:00:00,energy=0,fs/disk=2012,mem=352K,pages=0,vmem=108056K|energy=0,fs/disk=0|energy=psc001,fs/disk=psc001|fs/disk=0|energy=0,fs/disk=0|energy=0,fs/disk=0|
[renata@sdf-login03 conf-cache]$ 
[renata@sdf-login03 conf-cache]$ 
[renata@sdf-login03 conf-cache]$ 
[renata@sdf-login03 conf-cache]$ sacct -l -p -j 1503519
JobID|JobIDRaw|JobName|Partition|MaxVMSize|MaxVMSizeNode|MaxVMSizeTask|AveVMSize|MaxRSS|MaxRSSNode|MaxRSSTask|AveRSS|MaxPages|MaxPagesNode|MaxPagesTask|AvePages|MinCPU|MinCPUNode|MinCPUTask|AveCPU|NTasks|AllocCPUS|Elapsed|State|ExitCode|AveCPUFreq|ReqCPUFreqMin|ReqCPUFreqMax|ReqCPUFreqGov|ReqMem|ConsumedEnergy|MaxDiskRead|MaxDiskReadNode|MaxDiskReadTask|AveDiskRead|MaxDiskWrite|MaxDiskWriteNode|MaxDiskWriteTask|AveDiskWrite|ReqTRES|AllocTRES|TRESUsageInAve|TRESUsageInMax|TRESUsageInMaxNode|TRESUsageInMaxTask|TRESUsageInMin|TRESUsageInMinNode|TRESUsageInMinTask|TRESUsageInTot|TRESUsageOutMax|TRESUsageOutMaxNode|TRESUsageOutMaxTask|TRESUsageOutAve|TRESUsageOutTot|
1503519|1503519|fah|shared||||||||||||||||||1|00:05:01|FAILED|127:0||Unknown|Unknown|Unknown|1500M|0|||||||||billing=1,cpu=1,gres/gpu=1,mem=1500M,node=1|billing=1,cpu=1,gres/gpu:a100=1,gres/gpu=1,mem=1500M,node=1||||||||||||||
1503519.batch|1503519.batch|batch||221356K|ampt007|0|221356K|1876K|ampt007|0|1876K|0|ampt007|0|0|00:00:00|ampt007|0|00:00:00|1|1|04:52:59|FAILED|127:0|1.50G|0|0|0||0|2.03M|ampt007|0|2.03M|0.08M|ampt007|0|0.08M||cpu=1,mem=1500M,node=1|cpu=00:00:00,energy=0,fs/disk=2130711,mem=1876K,pages=0,vmem=221356K|cpu=00:00:00,energy=0,fs/disk=2130711,mem=1876K,pages=0,vmem=221356K|cpu=ampt007,energy=ampt007,fs/disk=ampt007,mem=ampt007,pages=ampt007,vmem=ampt007|cpu=0,fs/disk=0,mem=0,pages=0,vmem=0|cpu=00:00:00,energy=0,fs/disk=2130711,mem=1876K,pages=0,vmem=221356K|cpu=ampt007,energy=ampt007,fs/disk=ampt007,mem=ampt007,pages=ampt007,vmem=ampt007|cpu=0,fs/disk=0,mem=0,pages=0,vmem=0|cpu=00:00:00,energy=0,fs/disk=2130711,mem=1876K,pages=0,vmem=221356K|energy=0,fs/disk=88096|energy=ampt007,fs/disk=ampt007|fs/disk=0|energy=0,fs/disk=88096|energy=0,fs/disk=88096|
1503519.extern|1503519.extern|extern||212348K|ampt007|0|212348K|1292K|ampt007|0|1292K|0|ampt007|0|0|00:00:00|ampt007|0|00:00:00|1|1|04:52:59|COMPLETED|0:0|1.50G|0|0|0||0|0.00M|ampt007|0|0.00M|0|ampt007|0|0||billing=1,cpu=1,gres/gpu:a100=1,gres/gpu=1,mem=1500M,node=1|cpu=00:00:00,energy=0,fs/disk=2012,mem=1292K,pages=0,vmem=212348K|cpu=00:00:00,energy=0,fs/disk=2012,mem=1292K,pages=0,vmem=212348K|cpu=ampt007,energy=ampt007,fs/disk=ampt007,mem=ampt007,pages=ampt007,vmem=ampt007|cpu=0,fs/disk=0,mem=0,pages=0,vmem=0|cpu=00:00:00,energy=0,fs/disk=2012,mem=1292K,pages=0,vmem=212348K|cpu=ampt007,energy=ampt007,fs/disk=ampt007,mem=ampt007,pages=ampt007,vmem=ampt007|cpu=0,fs/disk=0,mem=0,pages=0,vmem=0|cpu=00:00:00,energy=0,fs/disk=2012,mem=1292K,pages=0,vmem=212348K|energy=0,fs/disk=0|energy=ampt007,fs/disk=ampt007|fs/disk=0|energy=0,fs/disk=0|energy=0,fs/disk=0|


Renata

On Wed, 19 Jan 2022, bugs@schedmd.com wrote:

>https://bugs.schedmd.com/show_bug.cgi?id=13215
>
>--- Comment #17 from Michael Hinton <hinton@schedmd.com> ---
>Sorry, could you redo the sacct commands for the jobs, but add `-p`? That makes
>it so that nothing gets truncated.
>
>-- 
>You are receiving this mail because:
>You reported the bug.

Comment 20 Michael Hinton 2022-01-19 12:37:03 MST

I'm sorry to have to ask you to do this a third time, but could you add the -D option and redo all the sacct calls? I think seeing the older job records before the requeue may help us understand how to replicate this issue. Thanks!

Comment 21 Renata Dart 2022-01-19 14:57:53 MST

Hi Michael, sorry for the delay, I went off to an appt and tried
to let you know with an email from my cell phone, but I have a different
userid there and so it was rejected since your site didn't recognize it.
In any case here they are with the -D:

[renata@sdf-login03 conf-cache]$ sacct -l -D -p -j 1503517
JobID|JobIDRaw|JobName|Partition|MaxVMSize|MaxVMSizeNode|MaxVMSizeTask|AveVMSize|MaxRSS|MaxRSSNode|MaxRSSTask|AveRSS|MaxPages|MaxPagesNode|MaxPagesTask|AvePages|MinCPU|MinCPUNode|MinCPUTask|AveCPU|NTasks|AllocCPUS|Elapsed|State|ExitCode|AveCPUFreq|ReqCPUFreqMin|ReqCPUFreqMax|ReqCPUFreqGov|ReqMem|ConsumedEnergy|MaxDiskRead|MaxDiskReadNode|MaxDiskReadTask|AveDiskRead|MaxDiskWrite|MaxDiskWriteNode|MaxDiskWriteTask|AveDiskWrite|ReqTRES|AllocTRES|TRESUsageInAve|TRESUsageInMax|TRESUsageInMaxNode|TRESUsageInMaxTask|TRESUsageInMin|TRESUsageInMinNode|TRESUsageInMinTask|TRESUsageInTot|TRESUsageOutMax|TRESUsageOutMaxNode|TRESUsageOutMaxTask|TRESUsageOutAve|TRESUsageOutTot|
1503517|1503517|fah|shared||||||||||||||||||2|00:03:01|FAILED|127:0||Unknown|Unknown|Unknown|1500M|0|||||||||billing=1,cpu=1,gres/gpu=1,mem=1500M,node=1|billing=1,cpu=2,gres/gpu:a100=1,gres/gpu=1,mem=1500M,node=1||||||||||||||
1503517.batch|1503517.batch|batch||221356K|psc001|0|221356K|1880K|psc001|0|1880K|0|psc001|0|0|00:00:00|psc001|0|00:00:00|1|2|04:53:33|FAILED|127:0|2.30G|0|0|0||0|1.14M|psc001|0|1.14M|0.08M|psc001|0|0.08M||cpu=2,mem=1500M,node=1|cpu=00:00:00,energy=0,fs/disk=1200276,mem=1880K,pages=0,vmem=221356K|cpu=00:00:00,energy=0,fs/disk=1200276,mem=1880K,pages=0,vmem=221356K|cpu=psc001,energy=psc001,fs/disk=psc001,mem=psc001,pages=psc001,vmem=psc001|cpu=0,fs/disk=0,mem=0,pages=0,vmem=0|cpu=00:00:00,energy=0,fs/disk=1200276,mem=1880K,pages=0,vmem=221356K|cpu=psc001,energy=psc001,fs/disk=psc001,mem=psc001,pages=psc001,vmem=psc001|cpu=0,fs/disk=0,mem=0,pages=0,vmem=0|cpu=00:00:00,energy=0,fs/disk=1200276,mem=1880K,pages=0,vmem=221356K|energy=0,fs/disk=87019|energy=psc001,fs/disk=psc001|fs/disk=0|energy=0,fs/disk=87019|energy=0,fs/disk=87019|
1503517.extern|1503517.extern|extern||212184K|psc001|0|212184K|892K|psc001|0|892K|0|psc001|0|0|00:00:00|psc001|0|00:00:00|1|2|04:53:33|COMPLETED|0:0|1.20G|0|0|0||0|0.00M|psc001|0|0.00M|0|psc001|0|0||billing=1,cpu=2,gres/gpu:a100=1,gres/gpu=1,mem=1500M,node=1|cpu=00:00:00,energy=0,fs/disk=2012,mem=892K,pages=0,vmem=212184K|cpu=00:00:00,energy=0,fs/disk=2012,mem=892K,pages=0,vmem=212184K|cpu=psc001,energy=psc001,fs/disk=psc001,mem=psc001,pages=psc001,vmem=psc001|cpu=0,fs/disk=0,mem=0,pages=0,vmem=0|cpu=00:00:00,energy=0,fs/disk=2012,mem=892K,pages=0,vmem=212184K|cpu=psc001,energy=psc001,fs/disk=psc001,mem=psc001,pages=psc001,vmem=psc001|cpu=0,fs/disk=0,mem=0,pages=0,vmem=0|cpu=00:00:00,energy=0,fs/disk=2012,mem=892K,pages=0,vmem=212184K|energy=0,fs/disk=0|energy=psc001,fs/disk=psc001|fs/disk=0|energy=0,fs/disk=0|energy=0,fs/disk=0|
[renata@sdf-login03 conf-cache]$ 
[renata@sdf-login03 conf-cache]$ 
[renata@sdf-login03 conf-cache]$ 
[renata@sdf-login03 conf-cache]$ 
[renata@sdf-login03 conf-cache]$ 
[renata@sdf-login03 conf-cache]$ 
[renata@sdf-login03 conf-cache]$ sacct -l -D -p -j 1503488
JobID|JobIDRaw|JobName|Partition|MaxVMSize|MaxVMSizeNode|MaxVMSizeTask|AveVMSize|MaxRSS|MaxRSSNode|MaxRSSTask|AveRSS|MaxPages|MaxPagesNode|MaxPagesTask|AvePages|MinCPU|MinCPUNode|MinCPUTask|AveCPU|NTasks|AllocCPUS|Elapsed|State|ExitCode|AveCPUFreq|ReqCPUFreqMin|ReqCPUFreqMax|ReqCPUFreqGov|ReqMem|ConsumedEnergy|MaxDiskRead|MaxDiskReadNode|MaxDiskReadTask|AveDiskRead|MaxDiskWrite|MaxDiskWriteNode|MaxDiskWriteTask|AveDiskWrite|ReqTRES|AllocTRES|TRESUsageInAve|TRESUsageInMax|TRESUsageInMaxNode|TRESUsageInMaxTask|TRESUsageInMin|TRESUsageInMinNode|TRESUsageInMinTask|TRESUsageInTot|TRESUsageOutMax|TRESUsageOutMaxNode|TRESUsageOutMaxTask|TRESUsageOutAve|TRESUsageOutTot|
1503488|1503488|fah|shared||||||||||||||||||2|00:03:01|FAILED|127:0||Unknown|Unknown|Unknown|1500M|0|||||||||billing=1,cpu=1,gres/gpu=1,mem=1500M,node=1|billing=1,cpu=2,gres/gpu:a100=1,gres/gpu=1,mem=1500M,node=1||||||||||||||
1503488.batch|1503488.batch|batch||221360K|tur017|0|221360K|1880K|tur017|0|1880K|0|tur017|0|0|00:00:00|tur017|0|00:00:00|1|2|05:50:41|FAILED|127:0|2.30G|0|0|0||0|1.17M|tur017|0|1.17M|0.09M|tur017|0|0.09M||cpu=2,mem=1500M,node=1|cpu=00:00:00,energy=0,fs/disk=1222499,mem=1880K,pages=0,vmem=221360K|cpu=00:00:00,energy=0,fs/disk=1222499,mem=1880K,pages=0,vmem=221360K|cpu=tur017,energy=tur017,fs/disk=tur017,mem=tur017,pages=tur017,vmem=tur017|cpu=0,fs/disk=0,mem=0,pages=0,vmem=0|cpu=00:00:00,energy=0,fs/disk=1222499,mem=1880K,pages=0,vmem=221360K|cpu=tur017,energy=tur017,fs/disk=tur017,mem=tur017,pages=tur017,vmem=tur017|cpu=0,fs/disk=0,mem=0,pages=0,vmem=0|cpu=00:00:00,energy=0,fs/disk=1222499,mem=1880K,pages=0,vmem=221360K|energy=0,fs/disk=92034|energy=tur017,fs/disk=tur017|fs/disk=0|energy=0,fs/disk=92034|energy=0,fs/disk=92034|
1503488.extern|1503488.extern|extern||212184K|tur017|0|212184K|896K|tur017|0|896K|0|tur017|0|0|00:00:00|tur017|0|00:00:00|1|2|05:50:41|COMPLETED|0:0|2.00G|0|0|0||0|0.00M|tur017|0|0.00M|0|tur017|0|0||billing=1,cpu=2,gres/gpu:a100=1,gres/gpu=1,mem=1500M,node=1|cpu=00:00:00,energy=0,fs/disk=2012,mem=896K,pages=0,vmem=212184K|cpu=00:00:00,energy=0,fs/disk=2012,mem=896K,pages=0,vmem=212184K|cpu=tur017,energy=tur017,fs/disk=tur017,mem=tur017,pages=tur017,vmem=tur017|cpu=0,fs/disk=0,mem=0,pages=0,vmem=0|cpu=00:00:00,energy=0,fs/disk=2012,mem=896K,pages=0,vmem=212184K|cpu=tur017,energy=tur017,fs/disk=tur017,mem=tur017,pages=tur017,vmem=tur017|cpu=0,fs/disk=0,mem=0,pages=0,vmem=0|cpu=00:00:00,energy=0,fs/disk=2012,mem=896K,pages=0,vmem=212184K|energy=0,fs/disk=0|energy=tur017,fs/disk=tur017|fs/disk=0|energy=0,fs/disk=0|energy=0,fs/disk=0|
[renata@sdf-login03 conf-cache]$ 
[renata@sdf-login03 conf-cache]$ 
[renata@sdf-login03 conf-cache]$ 
[renata@sdf-login03 conf-cache]$ 
[renata@sdf-login03 conf-cache]$ sacct -l -D -p -j 1503581
JobID|JobIDRaw|JobName|Partition|MaxVMSize|MaxVMSizeNode|MaxVMSizeTask|AveVMSize|MaxRSS|MaxRSSNode|MaxRSSTask|AveRSS|MaxPages|MaxPagesNode|MaxPagesTask|AvePages|MinCPU|MinCPUNode|MinCPUTask|AveCPU|NTasks|AllocCPUS|Elapsed|State|ExitCode|AveCPUFreq|ReqCPUFreqMin|ReqCPUFreqMax|ReqCPUFreqGov|ReqMem|ConsumedEnergy|MaxDiskRead|MaxDiskReadNode|MaxDiskReadTask|AveDiskRead|MaxDiskWrite|MaxDiskWriteNode|MaxDiskWriteTask|AveDiskWrite|ReqTRES|AllocTRES|TRESUsageInAve|TRESUsageInMax|TRESUsageInMaxNode|TRESUsageInMaxTask|TRESUsageInMin|TRESUsageInMinNode|TRESUsageInMinTask|TRESUsageInTot|TRESUsageOutMax|TRESUsageOutMaxNode|TRESUsageOutMaxTask|TRESUsageOutAve|TRESUsageOutTot|
1503581|1503581|fah|shared||||||||||||||||||2|00:05:01|FAILED|127:0||Unknown|Unknown|Unknown|1500M|0|||||||||billing=1,cpu=1,gres/gpu=1,mem=1500M,node=1|billing=1,cpu=2,gres/gpu:a100=1,gres/gpu=1,mem=1500M,node=1||||||||||||||
1503581.batch|1503581.batch|batch||221356K|psc001|0|221356K|1880K|psc001|0|1880K|0|psc001|0|0|00:00:00|psc001|0|00:00:00|1|2|03:25:16|FAILED|127:0|2.21G|0|0|0||0|1.14M|psc001|0|1.14M|0.08M|psc001|0|0.08M||cpu=2,mem=1500M,node=1|cpu=00:00:00,energy=0,fs/disk=1200332,mem=1880K,pages=0,vmem=221356K|cpu=00:00:00,energy=0,fs/disk=1200332,mem=1880K,pages=0,vmem=221356K|cpu=psc001,energy=psc001,fs/disk=psc001,mem=psc001,pages=psc001,vmem=psc001|cpu=0,fs/disk=0,mem=0,pages=0,vmem=0|cpu=00:00:00,energy=0,fs/disk=1200332,mem=1880K,pages=0,vmem=221356K|cpu=psc001,energy=psc001,fs/disk=psc001,mem=psc001,pages=psc001,vmem=psc001|cpu=0,fs/disk=0,mem=0,pages=0,vmem=0|cpu=00:00:00,energy=0,fs/disk=1200332,mem=1880K,pages=0,vmem=221356K|energy=0,fs/disk=87036|energy=psc001,fs/disk=psc001|fs/disk=0|energy=0,fs/disk=87036|energy=0,fs/disk=87036|
1503581.extern|1503581.extern|extern||108056K|psc001|0|108056K|352K|psc001|0|352K|0|psc001|0|0|00:00:00|psc001|0|00:00:00|1|2|03:25:16|COMPLETED|0:0|1.20G|0|0|0||0|0.00M|psc001|0|0.00M|0|psc001|0|0||billing=1,cpu=2,gres/gpu:a100=1,gres/gpu=1,mem=1500M,node=1|cpu=00:00:00,energy=0,fs/disk=2012,mem=352K,pages=0,vmem=108056K|cpu=00:00:00,energy=0,fs/disk=2012,mem=352K,pages=0,vmem=108056K|cpu=psc001,energy=psc001,fs/disk=psc001,mem=psc001,pages=psc001,vmem=psc001|cpu=0,fs/disk=0,mem=0,pages=0,vmem=0|cpu=00:00:00,energy=0,fs/disk=2012,mem=352K,pages=0,vmem=108056K|cpu=psc001,energy=psc001,fs/disk=psc001,mem=psc001,pages=psc001,vmem=psc001|cpu=0,fs/disk=0,mem=0,pages=0,vmem=0|cpu=00:00:00,energy=0,fs/disk=2012,mem=352K,pages=0,vmem=108056K|energy=0,fs/disk=0|energy=psc001,fs/disk=psc001|fs/disk=0|energy=0,fs/disk=0|energy=0,fs/disk=0|
[renata@sdf-login03 conf-cache]$ 
[renata@sdf-login03 conf-cache]$ 
[renata@sdf-login03 conf-cache]$ 
[renata@sdf-login03 conf-cache]$ sacct -l -D -p -j 1503519
JobID|JobIDRaw|JobName|Partition|MaxVMSize|MaxVMSizeNode|MaxVMSizeTask|AveVMSize|MaxRSS|MaxRSSNode|MaxRSSTask|AveRSS|MaxPages|MaxPagesNode|MaxPagesTask|AvePages|MinCPU|MinCPUNode|MinCPUTask|AveCPU|NTasks|AllocCPUS|Elapsed|State|ExitCode|AveCPUFreq|ReqCPUFreqMin|ReqCPUFreqMax|ReqCPUFreqGov|ReqMem|ConsumedEnergy|MaxDiskRead|MaxDiskReadNode|MaxDiskReadTask|AveDiskRead|MaxDiskWrite|MaxDiskWriteNode|MaxDiskWriteTask|AveDiskWrite|ReqTRES|AllocTRES|TRESUsageInAve|TRESUsageInMax|TRESUsageInMaxNode|TRESUsageInMaxTask|TRESUsageInMin|TRESUsageInMinNode|TRESUsageInMinTask|TRESUsageInTot|TRESUsageOutMax|TRESUsageOutMaxNode|TRESUsageOutMaxTask|TRESUsageOutAve|TRESUsageOutTot|
1503519|1503519|fah|shared||||||||||||||||||1|00:05:01|FAILED|127:0||Unknown|Unknown|Unknown|1500M|0|||||||||billing=1,cpu=1,gres/gpu=1,mem=1500M,node=1|billing=1,cpu=1,gres/gpu:a100=1,gres/gpu=1,mem=1500M,node=1||||||||||||||
1503519.batch|1503519.batch|batch||221356K|ampt007|0|221356K|1876K|ampt007|0|1876K|0|ampt007|0|0|00:00:00|ampt007|0|00:00:00|1|1|04:52:59|FAILED|127:0|1.50G|0|0|0||0|2.03M|ampt007|0|2.03M|0.08M|ampt007|0|0.08M||cpu=1,mem=1500M,node=1|cpu=00:00:00,energy=0,fs/disk=2130711,mem=1876K,pages=0,vmem=221356K|cpu=00:00:00,energy=0,fs/disk=2130711,mem=1876K,pages=0,vmem=221356K|cpu=ampt007,energy=ampt007,fs/disk=ampt007,mem=ampt007,pages=ampt007,vmem=ampt007|cpu=0,fs/disk=0,mem=0,pages=0,vmem=0|cpu=00:00:00,energy=0,fs/disk=2130711,mem=1876K,pages=0,vmem=221356K|cpu=ampt007,energy=ampt007,fs/disk=ampt007,mem=ampt007,pages=ampt007,vmem=ampt007|cpu=0,fs/disk=0,mem=0,pages=0,vmem=0|cpu=00:00:00,energy=0,fs/disk=2130711,mem=1876K,pages=0,vmem=221356K|energy=0,fs/disk=88096|energy=ampt007,fs/disk=ampt007|fs/disk=0|energy=0,fs/disk=88096|energy=0,fs/disk=88096|
1503519.extern|1503519.extern|extern||212348K|ampt007|0|212348K|1292K|ampt007|0|1292K|0|ampt007|0|0|00:00:00|ampt007|0|00:00:00|1|1|04:52:59|COMPLETED|0:0|1.50G|0|0|0||0|0.00M|ampt007|0|0.00M|0|ampt007|0|0||billing=1,cpu=1,gres/gpu:a100=1,gres/gpu=1,mem=1500M,node=1|cpu=00:00:00,energy=0,fs/disk=2012,mem=1292K,pages=0,vmem=212348K|cpu=00:00:00,energy=0,fs/disk=2012,mem=1292K,pages=0,vmem=212348K|cpu=ampt007,energy=ampt007,fs/disk=ampt007,mem=ampt007,pages=ampt007,vmem=ampt007|cpu=0,fs/disk=0,mem=0,pages=0,vmem=0|cpu=00:00:00,energy=0,fs/disk=2012,mem=1292K,pages=0,vmem=212348K|cpu=ampt007,energy=ampt007,fs/disk=ampt007,mem=ampt007,pages=ampt007,vmem=ampt007|cpu=0,fs/disk=0,mem=0,pages=0,vmem=0|cpu=00:00:00,energy=0,fs/disk=2012,mem=1292K,pages=0,vmem=212348K|energy=0,fs/disk=0|energy=ampt007,fs/disk=ampt007|fs/disk=0|energy=0,fs/disk=0|energy=0,fs/disk=0|
On Wed, 19 Jan 2022, bugs@schedmd.com wrote:


Renata

>https://bugs.schedmd.com/show_bug.cgi?id=13215
>
>--- Comment #20 from Michael Hinton <hinton@schedmd.com> ---
>I'm sorry to have to ask you to do this a third time, but could you add the -D
>option and redo all the sacct calls? I think seeing the older job records
>before the requeue may help us understand how to replicate this issue. Thanks!
>
>-- 
>You are receiving this mail because:
>You reported the bug.

Comment 22 Michael Hinton 2022-01-19 17:18:51 MST

Renata, could you run the following command for me and paste the output?

sacct -p -D -j 1495647,1503488,1503517,1503519,1503581 --format=jobid,jobidraw,jobname,partition,nodelist,nnodes,ntasks,reqcpus,alloccpus,ncpus,reqnodes,allocnodes,reqmem,reqtres,alloctres,state,exitcode,derivedexitcode,reason,elapsed,elapsedraw,submit,eligible,start,end,suspended,submitline

All of these jobs were on node ampt000, but that node stopped responding. So 1495647 got requeued successfully it seems, but 1503488,1503517,1503519,1503581 got completed and then rescheduled with backfill (but apparently not requeued...). I think that explains the weirdness of the differences in elapsed time between the job itself and the batch and extern steps.

It's confusing to me how jobs 1503488,1503517,1503519,1503581 could get scheduled twice (once with regular scheduler, once with backfill) without apparently getting requeued, and maybe that error is causing some of the bad accounting. I'll keep looking into it. At any rate, it seems to be an edge case that happens when a node fails and jobs are moved/requeued away from the failed node to a new node.

Thanks!
-Michael

Comment 23 Renata Dart 2022-01-19 17:40:40 MST

Hi Michael, here it is:

[renata@rome0001 ~]$ sacct -p -D -j 1495647,1503488,1503517,1503519,1503581 --format=jobid,jobidraw,jobname,partition,nodelist,nnodes,ntasks,reqcpus,alloccpus,ncpus,reqnodes,allocnodes,reqmem,reqtres,alloctres,state,exitcode,derivedexitcode,reason,elapsed,elapsedraw,submit,eligible,start,end,suspended,submitline
JobID|JobIDRaw|JobName|Partition|NodeList|NNodes|NTasks|ReqCPUS|AllocCPUS|NCPUS|ReqNodes|AllocNodes|ReqMem|ReqTRES|AllocTRES|State|ExitCode|DerivedExitCode|Reason|Elapsed|ElapsedRaw|Submit|Eligible|Start|End|Suspended|SubmitLine|
1495647|1495647|4k-2TomoRecon003|cryoem|ampt008|1||6|6|6|1|1|750G|billing=6,cpu=6,mem=750G,node=1|billing=6,cpu=6,mem=750G,node=1|NODE_FAIL|1:0|0:0|None|3-13:47:14|308834|2022-01-12T16:26:45|2022-01-12T16:26:45|2022-01-12T19:34:27|2022-01-16T09:21:41|00:00:00|sbatch batchScript.sbatch .|
1495647.batch|1495647.batch|batch||ampt008|1|1|6|6|6|1|1|||cpu=6,mem=750G,node=1|CANCELLED||||3-13:47:14|308834|2022-01-12T19:34:27|2022-01-12T19:34:27|2022-01-12T19:34:27|2022-01-16T09:21:41|00:00:00||
1495647.extern|1495647.extern|extern||ampt008|1|1|6|6|6|1|1|||billing=6,cpu=6,mem=750G,node=1|CANCELLED||||3-13:47:14|308834|2022-01-12T19:34:27|2022-01-12T19:34:27|2022-01-12T19:34:27|2022-01-16T09:21:41|00:00:00||
1495647.0|1495647.0|e2tomogram.py||ampt008|1|1|6|6|6|1|1|||cpu=6,mem=750G,node=1|CANCELLED||||3-13:47:13|308833|2022-01-12T19:34:28|2022-01-12T19:34:28|2022-01-12T19:34:28|2022-01-16T09:21:41|00:00:00|srun e2tomogram.py tiltseries/20211022-HSM-Lam4_Tilt7.hdf --tltstep=2.0 --npk=20 --tltkeep=0.9 --outsize=4k --niter=2,1,1,1 --bytile --notmp --pkkeep=0.9 --compressbits=-1 --clipz=-1 --bxsz=64 --filterto=0.35 --rmbeadthr=-1.0 --threads=48 --patchtrack=-1|
1495647|1495647|4k-2TomoRecon003|cryoem|ampt000|1||6|6|6|1|1|750G|billing=6,cpu=6,mem=750G,node=1|billing=6,cpu=6,mem=750G,node=1|NODE_FAIL|0:0|0:0|BeginTime|09:49:59|35399|2022-01-16T09:21:41|2022-01-16T09:23:42|2022-01-16T09:23:50|2022-01-16T19:13:49|00:00:00|sbatch batchScript.sbatch .|
1495647.batch|1495647.batch|batch||ampt000|1|1|6|6|6|1|1|||cpu=6,mem=750G,node=1|CANCELLED||||09:49:59|35399|2022-01-16T09:23:50|2022-01-16T09:23:50|2022-01-16T09:23:50|2022-01-16T19:13:49|00:00:00||
1495647.extern|1495647.extern|extern||ampt000|1|1|6|6|6|1|1|||billing=6,cpu=6,mem=750G,node=1|CANCELLED||||09:49:59|35399|2022-01-16T09:23:50|2022-01-16T09:23:50|2022-01-16T09:23:50|2022-01-16T19:13:49|00:00:00||
1495647.0|1495647.0|e2tomogram.py||ampt000|1|1|6|6|6|1|1|||cpu=6,mem=750G,node=1|CANCELLED||||09:49:58|35398|2022-01-16T09:23:51|2022-01-16T09:23:51|2022-01-16T09:23:51|2022-01-16T19:13:49|00:00:00|srun e2tomogram.py tiltseries/20211022-HSM-Lam4_Tilt7.hdf --tltstep=2.0 --npk=20 --tltkeep=0.9 --outsize=4k --niter=2,1,1,1 --bytile --notmp --pkkeep=0.9 --compressbits=-1 --clipz=-1 --bxsz=64 --filterto=0.35 --rmbeadthr=-1.0 --threads=48 --patchtrack=-1|
1503488|1503488|fah|shared|tur017|1||1|2|2|1|1|1500M|billing=1,cpu=1,gres/gpu=1,mem=1500M,node=1|billing=1,cpu=2,gres/gpu:a100=1,gres/gpu=1,mem=1500M,node=1|FAILED|127:0|0:0|None|00:03:01|181|2022-01-16T13:26:46|2022-01-16T13:26:46|2022-01-16T19:14:27|2022-01-16T19:17:28|00:00:00|sbatch --nice=100 --gpus=1 --time=2-2:00:00 /sdf/home/y/ytl/fah/fah-gpu-b.slurm|
1503488.batch|1503488.batch|batch||tur017|1|1|2|2|2|1|1|||cpu=2,mem=1500M,node=1|FAILED|127:0|||05:50:41|21041|2022-01-16T13:26:47|2022-01-16T13:26:47|2022-01-16T13:26:47|2022-01-16T19:17:28|00:00:00||
1503488.extern|1503488.extern|extern||tur017|1|1|2|2|2|1|1|||billing=1,cpu=2,gres/gpu:a100=1,gres/gpu=1,mem=1500M,node=1|COMPLETED|0:0|||05:50:41|21041|2022-01-16T13:26:47|2022-01-16T13:26:47|2022-01-16T13:26:47|2022-01-16T19:17:28|00:00:00||
1503517|1503517|fah|shared|psc001|1||1|2|2|1|1|1500M|billing=1,cpu=1,gres/gpu=1,mem=1500M,node=1|billing=1,cpu=2,gres/gpu:a100=1,gres/gpu=1,mem=1500M,node=1|FAILED|127:0|0:0|None|00:03:01|181|2022-01-16T14:23:55|2022-01-16T14:23:55|2022-01-16T19:14:27|2022-01-16T19:17:28|00:00:00|sbatch --nice=100 --gpus=1 --time=2-2:00:00 /sdf/home/y/ytl/fah/fah-gpu-b.slurm|
1503517.batch|1503517.batch|batch||psc001|1|1|2|2|2|1|1|||cpu=2,mem=1500M,node=1|FAILED|127:0|||04:53:33|17613|2022-01-16T14:23:55|2022-01-16T14:23:55|2022-01-16T14:23:55|2022-01-16T19:17:28|00:00:00||
1503517.extern|1503517.extern|extern||psc001|1|1|2|2|2|1|1|||billing=1,cpu=2,gres/gpu:a100=1,gres/gpu=1,mem=1500M,node=1|COMPLETED|0:0|||04:53:33|17613|2022-01-16T14:23:55|2022-01-16T14:23:55|2022-01-16T14:23:55|2022-01-16T19:17:28|00:00:00||
1503519|1503519|fah|shared|ampt007|1||1|1|1|1|1|1500M|billing=1,cpu=1,gres/gpu=1,mem=1500M,node=1|billing=1,cpu=1,gres/gpu:a100=1,gres/gpu=1,mem=1500M,node=1|FAILED|127:0|0:0|None|00:05:01|301|2022-01-16T14:26:26|2022-01-16T14:26:26|2022-01-16T19:14:27|2022-01-16T19:19:28|00:00:00|sbatch --nice=100 --gpus=1 --time=2-2:00:00 /sdf/home/y/ytl/fah/fah-gpu-a.slurm|
1503519.batch|1503519.batch|batch||ampt007|1|1|1|1|1|1|1|||cpu=1,mem=1500M,node=1|FAILED|127:0|||04:52:59|17579|2022-01-16T14:26:29|2022-01-16T14:26:29|2022-01-16T14:26:29|2022-01-16T19:19:28|00:00:00||
1503519.extern|1503519.extern|extern||ampt007|1|1|1|1|1|1|1|||billing=1,cpu=1,gres/gpu:a100=1,gres/gpu=1,mem=1500M,node=1|COMPLETED|0:0|||04:52:59|17579|2022-01-16T14:26:29|2022-01-16T14:26:29|2022-01-16T14:26:29|2022-01-16T19:19:28|00:00:00||
1503581|1503581|fah|shared|psc001|1||1|2|2|1|1|1500M|billing=1,cpu=1,gres/gpu=1,mem=1500M,node=1|billing=1,cpu=2,gres/gpu:a100=1,gres/gpu=1,mem=1500M,node=1|FAILED|127:0|0:0|None|00:05:01|301|2022-01-16T15:54:11|2022-01-16T15:54:11|2022-01-16T19:14:27|2022-01-16T19:19:28|00:00:00|sbatch --nice=100 --gpus=1 --time=2-2:00:00 /sdf/home/y/ytl/fah/fah-gpu-a.slurm|
1503581.batch|1503581.batch|batch||psc001|1|1|2|2|2|1|1|||cpu=2,mem=1500M,node=1|FAILED|127:0|||03:25:16|12316|2022-01-16T15:54:12|2022-01-16T15:54:12|2022-01-16T15:54:12|2022-01-16T19:19:28|00:00:00||
1503581.extern|1503581.extern|extern||psc001|1|1|2|2|2|1|1|||billing=1,cpu=2,gres/gpu:a100=1,gres/gpu=1,mem=1500M,node=1|COMPLETED|0:0|||03:25:16|12316|2022-01-16T15:54:12|2022-01-16T15:54:12|2022-01-16T15:54:12|2022-01-16T19:19:28|00:00:00||
1495647|1495647|4k-2TomoRecon003|cryoem|ampt001|1||6|6|6|1|1|750G|billing=6,cpu=6,mem=750G,node=1|billing=6,cpu=6,mem=750G,node=1|NODE_FAIL|1:0|0:0|BeginTime|04:39:31|16771|2022-01-16T19:13:49|2022-01-16T19:15:50|2022-01-16T19:16:03|2022-01-16T23:55:34|00:00:00|sbatch batchScript.sbatch .|
1495647.batch|1495647.batch|batch||ampt001|1|1|6|6|6|1|1|||cpu=6,mem=750G,node=1|CANCELLED||||04:39:31|16771|2022-01-16T19:16:03|2022-01-16T19:16:03|2022-01-16T19:16:03|2022-01-16T23:55:34|00:00:00||
1495647.extern|1495647.extern|extern||ampt001|1|1|6|6|6|1|1|||billing=6,cpu=6,mem=750G,node=1|CANCELLED||||04:39:31|16771|2022-01-16T19:16:03|2022-01-16T19:16:03|2022-01-16T19:16:03|2022-01-16T23:55:34|00:00:00||
1495647.0|1495647.0|e2tomogram.py||ampt001|1|1|6|6|6|1|1|||cpu=6,mem=750G,node=1|CANCELLED||||04:39:31|16771|2022-01-16T19:16:03|2022-01-16T19:16:03|2022-01-16T19:16:03|2022-01-16T23:55:34|00:00:00|srun e2tomogram.py tiltseries/20211022-HSM-Lam4_Tilt7.hdf --tltstep=2.0 --npk=20 --tltkeep=0.9 --outsize=4k --niter=2,1,1,1 --bytile --notmp --pkkeep=0.9 --compressbits=-1 --clipz=-1 --bxsz=64 --filterto=0.35 --rmbeadthr=-1.0 --threads=48 --patchtrack=-1|
1495647|1495647|4k-2TomoRecon003|cryoem|ampt015|1||6|6|6|1|1|750G|billing=6,cpu=6,mem=750G,node=1|billing=6,cpu=6,mem=750G,node=1|RUNNING|0:0|0:0|BeginTime|2-16:41:16|232876|2022-01-16T23:55:34|2022-01-16T23:57:35|2022-01-16T23:57:54|Unknown|00:00:00|sbatch batchScript.sbatch .|
1495647.batch|1495647.batch|batch||ampt015|1|1|6|6|6|1|1|||cpu=6,mem=750G,node=1|RUNNING|0:0|||2-16:41:16|232876|2022-01-16T23:57:54|2022-01-16T23:57:54|2022-01-16T23:57:54|Unknown|00:00:00||
1495647.extern|1495647.extern|extern||ampt015|1|1|6|6|6|1|1|||billing=6,cpu=6,mem=750G,node=1|RUNNING|0:0|||2-16:41:16|232876|2022-01-16T23:57:54|2022-01-16T23:57:54|2022-01-16T23:57:54|Unknown|00:00:00||
1495647.0|1495647.0|e2tomogram.py||ampt015|1|1|6|6|6|1|1|||cpu=6,mem=750G,node=1|RUNNING|0:0|||2-16:41:16|232876|2022-01-16T23:57:54|2022-01-16T23:57:54|2022-01-16T23:57:54|Unknown|00:00:00|srun e2tomogram.py tiltseries/20211022-HSM-Lam4_Tilt7.hdf --tltstep=2.0 --npk=20 --tltkeep=0.9 --outsize=4k --niter=2,1,1,1 --bytile --notmp --pkkeep=0.9 --compressbits=-1 --clipz=-1 --bxsz=64 --filterto=0.35 --rmbeadthr=-1.0 --threads=48 --patchtrack=-1|

-Renata


On Thu, 20 Jan 2022, bugs@schedmd.com wrote:

>https://bugs.schedmd.com/show_bug.cgi?id=13215
>
>--- Comment #22 from Michael Hinton <hinton@schedmd.com> ---
>Renata, could you run the following command for me and paste the output?
>
>sacct -p -D -j 1495647,1503488,1503517,1503519,1503581
>--format=jobid,jobidraw,jobname,partition,nodelist,nnodes,ntasks,reqcpus,alloccpus,ncpus,reqnodes,allocnodes,reqmem,reqtres,alloctres,state,exitcode,derivedexitcode,reason,elapsed,elapsedraw,submit,eligible,start,end,suspended,submitline
>
>All of these jobs were on node ampt000, but that node stopped responding. So
>1495647 got requeued successfully it seems, but 1503488,1503517,1503519,1503581
>got completed and then rescheduled with backfill (but apparently not
>requeued...). I think that explains the weirdness of the differences in elapsed
>time between the job itself and the batch and extern steps.
>
>It's confusing to me how jobs 1503488,1503517,1503519,1503581 could get
>scheduled twice (once with regular scheduler, once with backfill) without
>apparently getting requeued, and maybe that error is causing some of the bad
>accounting. I'll keep looking into it. At any rate, it seems to be an edge case
>that happens when a node fails and jobs are moved/requeued away from the failed
>node to a new node.
>
>Thanks!
>-Michael
>
>-- 
>You are receiving this mail because:
>You reported the bug.

Comment 24 Michael Hinton 2022-01-20 11:55:56 MST

How often do you see this situation occur? Is it only when a node fails? Have you only started seeing this issue after upgrading to 21.08?

Comment 25 Renata Dart 2022-01-20 13:29:48 MST

Hi Michael, the mismatch of gpus in show node vs show job was
happening before we upgraded and we thought it would be fixed when we
went to 21.08 in December.  Tuesday (after your suggestion) I did
restart slurmctld and slurmd on the 3 nodes I reported at the start of
the ticket:

tur017  has 10 gpu
ampt007 has 4  gpu
psc001  has 10 gpu

but it didn't seem to make a difference for the currently running jobs.
Today however I did the remove of cgroup_allowed_devices_file.conf,
restart slurmctld, scontrol reconfigure, and restart of all the slurmds,
and I double-checked the jobs currently running on those 3 nodes and
all looks well.  I also checked a few other gpu nodes and they also
agreed on the gpus allocated in show node and show job.  So maybe this
was an odd case as you have surmised.

Renata

On Thu, 20 Jan 2022, bugs@schedmd.com wrote:

>https://bugs.schedmd.com/show_bug.cgi?id=13215
>
>--- Comment #24 from Michael Hinton <hinton@schedmd.com> ---
>How often do you see this situation occur? Is it only when a node fails? Have
>you only started seeing this issue after upgrading to 21.08?
>
>-- 
>You are receiving this mail because:
>You reported the bug.

Comment 26 Michael Hinton 2022-01-20 15:25:14 MST

Could you give me the slurmd log of node ampt000 for 2022-01-16? Thanks!

Comment 27 Michael Hinton 2022-01-20 15:42:08 MST

Could you also run this command and paste the output?:

sacct -p -D -j 1495647,1503488,1503517,1503519,1503581 --format=JobID,JobName,DBIndex,Start,timelimit,TimelimitRaw,Flags,TotalCPU,MinCPU,MinCPUNode,MinCPUTask

Comment 28 Renata Dart 2022-01-20 16:32:51 MST

Hi Michael, here is the output:

[renata@sdf-login02 conf-cache]$ sacct -p -D -j 1495647,1503488,1503517,1503519,1503581 --format=JobID,JobName,DBIndex,Start,timelimit,TimelimitRaw,Flags,TotalCPU,MinCPU,MinCPUNode,MinCPUTask
JobID|JobName|DBIndex|Start|Timelimit|TimelimitRaw|Flags|TotalCPU|MinCPU|MinCPUNode|MinCPUTask|
1495647|4k-2TomoRecon003|2044544|2022-01-12T19:34:27|4-02:00:00|5880|SchedMain|00:00:00||||
1495647.batch|batch|2044544|2022-01-12T19:34:27||||00:00:00||||
1495647.extern|extern|2044544|2022-01-12T19:34:27||||00:00:00||||
1495647.0|e2tomogram.py|2044544|2022-01-12T19:34:28||||00:00:00||||
1495647|4k-2TomoRecon003|2053945|2022-01-16T09:23:50|4-02:00:00|5880|SchedMain|00:00:00||||
1495647.batch|batch|2053945|2022-01-16T09:23:50||||00:00:00||||
1495647.extern|extern|2053945|2022-01-16T09:23:50||||00:00:00||||
1495647.0|e2tomogram.py|2053945|2022-01-16T09:23:51||||00:00:00||||
1503488|fah|2054315|2022-01-16T19:14:27|2-02:00:00|3000|SchedBackfill|00:00.360||||
1503488.batch|batch|2054315|2022-01-16T13:26:47||||00:00.358|00:00:00|tur017|0|
1503488.extern|extern|2054315|2022-01-16T13:26:47||||00:00.001|00:00:00|tur017|0|
1503517|fah|2054349|2022-01-16T19:14:27|2-02:00:00|3000|SchedBackfill|00:00.364||||
1503517.batch|batch|2054349|2022-01-16T14:23:55||||00:00.362|00:00:00|psc001|0|
1503517.extern|extern|2054349|2022-01-16T14:23:55||||00:00.001|00:00:00|psc001|0|
1503519|fah|2054353|2022-01-16T19:14:27|2-02:00:00|3000|SchedBackfill|00:00.356||||
1503519.batch|batch|2054353|2022-01-16T14:26:29||||00:00.355|00:00:00|ampt007|0|
1503519.extern|extern|2054353|2022-01-16T14:26:29||||00:00.001|00:00:00|ampt007|0|
1503581|fah|2054444|2022-01-16T19:14:27|2-02:00:00|3000|SchedBackfill|00:00.396||||
1503581.batch|batch|2054444|2022-01-16T15:54:12||||00:00.395|00:00:00|psc001|0|
1503581.extern|extern|2054444|2022-01-16T15:54:12||||00:00.001|00:00:00|psc001|0|
1495647|4k-2TomoRecon003|2054955|2022-01-16T19:16:03|4-02:00:00|5880|SchedMain|00:00:00||||
1495647.batch|batch|2054955|2022-01-16T19:16:03||||00:00:00||||
1495647.extern|extern|2054955|2022-01-16T19:16:03||||00:00:00||||
1495647.0|e2tomogram.py|2054955|2022-01-16T19:16:03||||00:00:00||||
1495647|4k-2TomoRecon003|2055317|2022-01-16T23:57:54|4-02:00:00|5880|SchedMain|00:00:00||||
1495647.batch|batch|2055317|2022-01-16T23:57:54||||00:00:00||||
1495647.extern|extern|2055317|2022-01-16T23:57:54||||00:00:00||||
1495647.0|e2tomogram.py|2055317|2022-01-16T23:57:54||||00:00:00||||


Renata

On Thu, 20 Jan 2022, bugs@schedmd.com wrote:

>https://bugs.schedmd.com/show_bug.cgi?id=13215
>
>--- Comment #27 from Michael Hinton <hinton@schedmd.com> ---
>Could you also run this command and paste the output?:
>
>sacct -p -D -j 1495647,1503488,1503517,1503519,1503581
>--format=JobID,JobName,DBIndex,Start,timelimit,TimelimitRaw,Flags,TotalCPU,MinCPU,MinCPUNode,MinCPUTask
>
>-- 
>You are receiving this mail because:
>You reported the bug.

Comment 29 Renata Dart 2022-01-20 16:35:26 MST

Hi Michael, the slurmd logs for ampt000 transition on 2022-01-16.
The current one starts [2022-01-16T08:40:33.834].  Will that
work, or do you need the earlier one?

Renata

On Thu, 20 Jan 2022, bugs@schedmd.com wrote:

>https://bugs.schedmd.com/show_bug.cgi?id=13215
>
>--- Comment #26 from Michael Hinton <hinton@schedmd.com> ---
>Could you give me the slurmd log of node ampt000 for 2022-01-16? Thanks!
>
>-- 
>You are receiving this mail because:
>You reported the bug.

Comment 30 Michael Hinton 2022-01-20 16:44:23 MST

(In reply to Renata Dart from comment #29)
> Hi Michael, the slurmd logs for ampt000 transition on 2022-01-16.
> The current one starts [2022-01-16T08:40:33.834].  Will that
> work, or do you need the earlier one?
That should work. Thanks!

Comment 31 Renata Dart 2022-01-20 16:55:44 MST

Created attachment 23075 [details]
ampt000.slurmd.log

Hi Michael, the log is attached.

Renata

On Thu, 20 Jan 2022, bugs@schedmd.com wrote:

>https://bugs.schedmd.com/show_bug.cgi?id=13215
>
>--- Comment #30 from Michael Hinton <hinton@schedmd.com> ---
>(In reply to Renata Dart from comment #29)
>> Hi Michael, the slurmd logs for ampt000 transition on 2022-01-16.
>> The current one starts [2022-01-16T08:40:33.834].  Will that
>> work, or do you need the earlier one?
>That should work. Thanks!
>
>-- 
>You are receiving this mail because:
>You reported the bug.

Comment 32 Michael Hinton 2022-02-01 09:27:17 MST

Renata,

How often are you seeing this issue, and how disruptive is it?

Thanks!
-Michael

Comment 33 Michael Hinton 2022-03-16 10:47:14 MDT

Hi Renata,

Are you still seeing issues?

Are you able to reproduce these issues on a test cluster? A reliable reproducer will help us the most, as it is difficult to understand how the cluster is getting into this state from logs alone. I suspect you are hitting similar issues as in bug 13600.

Thanks!
-Michael

Comment 34 Renata Dart 2022-03-16 14:39:36 MDT

Hi Michael, I don't think we have seen this again.  And I just ran a
clush against all of our hosts' slurmd logs and no longer see the
dealloc error message.  

I did open a ticket for sort of the opposite problem where a user only
requested 1 gpu and 32 cpu through a bash slurm job and got the entire
node of 4 gpus and 128 cpu.  But that probably isn't related.

However I am seeing this error message in our slurmd logs:

[2022-03-16T13:15:27.886] [2107856.batch] debug:  task/cgroup: task_cgroup_devices_init: unable to open /var/spool/slurmd/conf-cache/cgroup_allowed_devices_file.conf: No such file or directory

Based on your recommendation earlier in this ticket I "removed" our

cgroup_allowed_devices_file.conf

by renaming it to 

cgroup_allowed_devices_file.conf.bak.

I did this pretty close to the time that you suggested it.
Since then I have done at least a couple of slurmctld and
slurmd restarts.

Renata

On Wed, 16 Mar 2022, bugs@schedmd.com wrote:

>https://bugs.schedmd.com/show_bug.cgi?id=13215
>
>--- Comment #33 from Michael Hinton <hinton@schedmd.com> ---
>Hi Renata,
>
>Are you still seeing issues?
>
>Are you able to reproduce these issues on a test cluster? A reliable reproducer
>will help us the most, as it is difficult to understand how the cluster is
>getting into this state from logs alone. I suspect you are hitting similar
>issues as in bug 13600.
>
>Thanks!
>-Michael
>
>-- 
>You are receiving this mail because:
>You reported the bug.

Comment 35 Michael Hinton 2022-03-16 15:44:36 MDT

(In reply to Renata Dart from comment #34)
> [2022-03-16T13:15:27.886] [2107856.batch] debug:  task/cgroup:
> task_cgroup_devices_init: unable to open
> /var/spool/slurmd/conf-cache/cgroup_allowed_devices_file.conf: No such file
> or directory
That should be fine. We left the code in that checks for cgroup_allowed_devices_file.conf, and apparently there is still a debug message that prints when it's not found.

I'm going to go ahead and mark this as resolved, since the issue has gone away. Feel free to repopen this if you see it again. And if you see it again, try to create a reproducer, as that will be the best way for us to figure out a solution.

We are working on a similar issue in bug 13600 that may be related as well. I'll keep you posted if anything comes of that.

Thanks,
-Michael

Comment 36 Renata Dart 2022-03-16 16:03:19 MDT

Thanks Michael, that sounds good.  Thanks for all your help.

Renata

On Wed, 16 Mar 2022, bugs@schedmd.com wrote:

>https://bugs.schedmd.com/show_bug.cgi?id=13215
>
>Michael Hinton <hinton@schedmd.com> changed:
>
>           What    |Removed                     |Added
>----------------------------------------------------------------------------
>         Resolution|---                         |CANNOTREPRODUCE
>             Status|OPEN                        |RESOLVED
>
>--- Comment #35 from Michael Hinton <hinton@schedmd.com> ---
>(In reply to Renata Dart from comment #34)
>> [2022-03-16T13:15:27.886] [2107856.batch] debug:  task/cgroup:
>> task_cgroup_devices_init: unable to open
>> /var/spool/slurmd/conf-cache/cgroup_allowed_devices_file.conf: No such file
>> or directory
>That should be fine. We left the code in that checks for
>cgroup_allowed_devices_file.conf, and apparently there is still a debug message
>that prints when it's not found.
>
>I'm going to go ahead and mark this as resolved, since the issue has gone away.
>Feel free to repopen this if you see it again. And if you see it again, try to
>create a reproducer, as that will be the best way for us to figure out a
>solution.
>
>We are working on a similar issue in bug 13600 that may be related as well.
>I'll keep you posted if anything comes of that.
>
>Thanks,
>-Michael
>
>-- 
>You are receiving this mail because:
>You reported the bug.

Comment 37 Michael Hinton 2022-03-23 11:37:11 MDT

(In reply to Renata Dart from comment #34)
> [2022-03-16T13:15:27.886] [2107856.batch] debug:  task/cgroup:
> task_cgroup_devices_init: unable to open
> /var/spool/slurmd/conf-cache/cgroup_allowed_devices_file.conf: No such file
> or directory
This debug message will be removed in Slurm 22.05, by the way.

Comment 38 Renata Dart 2022-03-23 14:37:30 MDT

Hi Michael, thanks for the update, it will be helpful to have it
removed.

Renata


On Wed, 23 Mar 2022, bugs@schedmd.com wrote:

>https://bugs.schedmd.com/show_bug.cgi?id=13215
>
>--- Comment #37 from Michael Hinton <hinton@schedmd.com> ---
>(In reply to Renata Dart from comment #34)
>> [2022-03-16T13:15:27.886] [2107856.batch] debug:  task/cgroup:
>> task_cgroup_devices_init: unable to open
>> /var/spool/slurmd/conf-cache/cgroup_allowed_devices_file.conf: No such file
>> or directory
>This debug message will be removed in Slurm 22.05, by the way.
>
>-- 
>You are receiving this mail because:
>You reported the bug.