| Summary: | gpus in use by jobs does not match AllocTRES | ||
|---|---|---|---|
| Product: | Slurm | Reporter: | Renata Dart <renata> |
| Component: | GPU | Assignee: | Director of Support <support> |
| Status: | RESOLVED CANNOTREPRODUCE | QA Contact: | |
| Severity: | 4 - Minor Issue | ||
| Priority: | --- | CC: | lyeager |
| Version: | 21.08.4 | ||
| Hardware: | Linux | ||
| OS: | Linux | ||
| See Also: |
https://bugs.schedmd.com/show_bug.cgi?id=13203 https://bugs.schedmd.com/show_bug.cgi?id=13600 |
||
| Site: | SLAC | Alineos Sites: | --- |
| Atos/Eviden Sites: | --- | Confidential Site: | --- |
| Coreweave sites: | --- | Cray Sites: | --- |
| DS9 clusters: | --- | HPCnow Sites: | --- |
| HPE Sites: | --- | IBM Sites: | --- |
| NOAA SIte: | --- | OCF Sites: | --- |
| Recursion Pharma Sites: | --- | SFW Sites: | --- |
| SNIC sites: | --- | Linux Distro: | --- |
| Machine Name: | CLE Version: | ||
| Version Fixed: | Target Release: | --- | |
| DevPrio: | --- | Emory-Cloud Sites: | --- |
| Attachments: |
slurm.conf
gres.conf slurmctld.log slurmd.logs.tar.gz ampt000.slurmd.log |
||
Would you please attach your slurm.conf and gres.conf? Please also attach the slurmctld.log from the time the jobs were submitted until they started? Created attachment 23034 [details] slurm.conf Hi Jason, I have attached our slurm.conf and gres.conf. Will check the log and send that in a bit. Renata On Tue, 18 Jan 2022, bugs@schedmd.com wrote: >https://bugs.schedmd.com/show_bug.cgi?id=13215 > >--- Comment #1 from Jason Booth <jbooth@schedmd.com> --- >Would you please attach your slurm.conf and gres.conf? Please also attach the >slurmctld.log from the time the jobs were submitted until they started? > >-- >You are receiving this mail because: >You reported the bug. Created attachment 23035 [details]
gres.conf
Created attachment 23036 [details] slurmctld.log Hi again, I have just attached the slurmctld.log that covers the time period for these jobs. Renata On Tue, 18 Jan 2022, bugs@schedmd.com wrote: >https://bugs.schedmd.com/show_bug.cgi?id=13215 > >--- Comment #1 from Jason Booth <jbooth@schedmd.com> --- >Would you please attach your slurm.conf and gres.conf? Please also attach the >slurmctld.log from the time the jobs were submitted until they started? > >-- >You are receiving this mail because: >You reported the bug. Hi Renata, (In reply to Renata Dart from comment #0) > At the moment 3 gpu nodes, tur017, ampt007, and psc001 are showing > this issue. I see this in slurmctld.log: [2022-01-16T19:17:28.326] error: gres/gpu: job 1503517 dealloc of node psc001 bad node_offset 0 count is 0 [2022-01-16T19:17:28.361] error: gres/gpu: job 1503488 dealloc of node tur017 bad node_offset 0 count is 0 [2022-01-16T19:19:28.379] error: gres/gpu: job 1503519 dealloc of node ampt007 bad node_offset 0 count is 0 This is probably another symptom of the issue you are seeing. Could you attach the slurmd logs for those 3 nodes? Your configuration looks good to me, and there doesn't seem to be anything unique about those bad nodes. I wonder if maybe those slurmds just failed to get restarted on those nodes. I would restart the controller and the slurmds for those nodes to see if maybe this helps fix the issue. If that doesn't help, then drain those nodes and make sure no jobs are running on them. With no jobs on the node, check to make sure that no GPUs are still in use. If GPUs are still in use but no jobs are running on the node, restart the ctld and the nodes again. The issues you are seeing here can also cause the dealloc errors in bug 13203. I wouldn't use `squeue | grep <nodename>`, because you may not get all the jobs using that node. For example, if a job is allocated "psc001" and "psc002", I believe it could show up as "psc00[1-2]" and thus not get caught by grep. A safer method of filtering by nodes is with an squeue option: `squeue -w <nodename>`. Another tip for debugging GPU jobs: Add `-d` to scontrol show [node|job]. This will show you the GPU bitmaps for the node|job, which may help. Thanks! -Michael The 3 nodes in question were all allocated to jobs started by backfill, and those jobs eventually ended in a "bad node_offset" error: [2022-01-16T19:13:49.048] requeue job JobId=1495647 due to failure of node ampt000 [2022-01-16T19:13:49.048] Requeuing JobId=1495647 [2022-01-16T19:13:49.048] cleanup_completing: JobId=1503488 completion process took 614 seconds [2022-01-16T19:13:49.048] cleanup_completing: JobId=1503517 completion process took 614 seconds [2022-01-16T19:13:49.048] cleanup_completing: JobId=1503519 completion process took 614 seconds [2022-01-16T19:13:49.048] cleanup_completing: JobId=1503581 completion process took 614 seconds [2022-01-16T19:13:49.049] error: Nodes ampt000 not responding, setting DOWN [2022-01-16T19:14:27.616] sched/backfill: _start_job: Started JobId=1503488 in shared on tur017 [2022-01-16T19:14:27.616] sched/backfill: _start_job: Started JobId=1503581 in shared on psc001 [2022-01-16T19:14:27.617] sched/backfill: _start_job: Started JobId=1503517 in shared on psc001 [2022-01-16T19:14:27.618] sched/backfill: _start_job: Started JobId=1503519 in shared on ampt007 ... [2022-01-16T19:17:28.326] error: gres/gpu: job 1503517 dealloc of node psc001 bad node_offset 0 count is 0 [2022-01-16T19:17:28.361] error: gres/gpu: job 1503488 dealloc of node tur017 bad node_offset 0 count is 0 [2022-01-16T19:19:28.344] error: gres/gpu: job 1503581 dealloc of node psc001 bad node_offset 0 count is 0 [2022-01-16T19:19:28.379] error: gres/gpu: job 1503519 dealloc of node ampt007 bad node_offset 0 count is 0 Can you get any more information about those four jobs? Iif they are no longer in scontrol, what does sacct say about them, and about the GPUs they allocated, and how many times were they requeued? Are they related to ampt000 failing (e.g. maybe those jobs were originally running on ampt000, but got requeued). Created attachment 23040 [details] slurmd.logs.tar.gz Hi Michael, I have attached slurmd.logs.tar.gz ampt007-slurmd.log psc001-slurmd.log psc001-slurmd.log-20220118.gz tur017-slurmd.log as requested by Jason. On Tue, 18 Jan 2022, bugs@schedmd.com wrote: >https://bugs.schedmd.com/show_bug.cgi?id=13215 > >--- Comment #6 from Michael Hinton <hinton@schedmd.com> --- >Hi Renata, > >(In reply to Renata Dart from comment #0) >> At the moment 3 gpu nodes, tur017, ampt007, and psc001 are showing >> this issue. >I see this in slurmctld.log: > >[2022-01-16T19:17:28.326] error: gres/gpu: job 1503517 dealloc of node psc001 >bad node_offset 0 count is 0 >[2022-01-16T19:17:28.361] error: gres/gpu: job 1503488 dealloc of node tur017 >bad node_offset 0 count is 0 >[2022-01-16T19:19:28.379] error: gres/gpu: job 1503519 dealloc of node ampt007 >bad node_offset 0 count is 0 > >This is probably another symptom of the issue you are seeing. Could you attach >the slurmd logs for those 3 nodes? > >Your configuration looks good to me, and there doesn't seem to be anything >unique about those bad nodes. > >I wonder if maybe those slurmds just failed to get restarted on those nodes. I >would restart the controller and the slurmds for those nodes to see if maybe >this helps fix the issue. I'll try that next. > >If that doesn't help, then drain those nodes and make sure no jobs are running >on them. With no jobs on the node, check to make sure that no GPUs are still in >use. If GPUs are still in use but no jobs are running on the node, restart the >ctld and the nodes again. > >The issues you are seeing here can also cause the dealloc errors in bug 13203. > >I wouldn't use `squeue | grep <nodename>`, because you may not get all the jobs >using that node. For example, if a job is allocated "psc001" and "psc002", I >believe it could show up as "psc00[1-2]" and thus not get caught by grep. A >safer method of filtering by nodes is with an squeue option: `squeue -w ><nodename>`. Thanks for that tip, I did also search for just psc to double-check, but your way is much cleaner. > >Another tip for debugging GPU jobs: Add `-d` to scontrol show [node|job]. This >will show you the GPU bitmaps for the node|job, which may help. Sounds good. Thanks, Renata > >Thanks! >-Michael > >-- >You are receiving this mail because: >You reported the bug. I see a lot of these errors in the slurmd logs you attached: [2022-01-16T19:14:27.970] [1503488.extern] debug: cgroup/v1: common_file_write_content: cgroup_common.c:400: common_file_write_content: safe_write (5 of 5) failed: Invalid argument [2022-01-16T19:14:27.970] [1503488.extern] error: common_file_write_content: unable to write 5 bytes to cgroup /sys/fs/cgroup/devices/slurm/uid_7017/job_1503488/devices.allow: Invalid argument Do the slurmd logs for your good nodes have those errors? Or is this just limited to the bad nodes? Hi Michael, yes I do see them on a "good" node. This is ampt015, it
has 3 jobs, two of which are using gpu. Show node says 4 gpu are in
use and the jobs concur, one is using 3 and the other 1:
[renata@ampt015 ~]$ squeue -w ampt015
JOBID PARTITION NAME USER ST TIME NODES NODELIST(REASON)
1495647 cryoem 4k-2Tomo summ1234 R 2-06:59:34 1 ampt015
1508857 ml cryonet axlevy R 13:54:32 1 ampt015
1509619 shared fah ytl R 5:24:07 1 ampt015
[renata@ampt015 ~]$
[renata@ampt015 ~]$
[renata@ampt015 ~]$
[root@ampt015 slurm]# scontrol -d show node ampt015 | grep -i gpu
AvailableFeatures=CPU_GEN:RME,CPU_SKU:7542,CPU_FRQ:2.10GHz,GPU_GEN:AMP,GPU_SKU:A100,GPU_MEM:40GB,GPU_CC:8.0
ActiveFeatures=CPU_GEN:RME,CPU_SKU:7542,CPU_FRQ:2.10GHz,GPU_GEN:AMP,GPU_SKU:A100,GPU_MEM:40GB,GPU_CC:8.0
Gres=gpu:a100:4
GresUsed=gpu:a100:4(IDX:0-3)
CfgTRES=cpu=128,mem=1029344M,billing=128,gres/gpu=4,gres/gpu:a100=4
AllocTRES=cpu=71,mem=900572M,gres/gpu=4,gres/gpu:a100=4
[renata@ampt015 ~]$
[renata@ampt015 ~]$
[renata@ampt015 ~]$
[renata@ampt015 ~]$
[renata@ampt015 ~]$
[renata@ampt015 ~]$ scontrol show job 1495647 | grep -i gpu
[renata@ampt015 ~]$
[renata@ampt015 ~]$
[root@ampt015 slurm]# scontrol -d show job 1508857 | grep gpu
TRES=cpu=64,mem=128G,node=1,billing=64,gres/gpu=1,gres/gpu:a100=1
JOB_GRES=gpu:a100:1
Nodes=ampt015 CPU_IDs=0-1,16-27,30-79,128-191 Mem=131072 GRES=gpu:a100:1(IDX:2)
TresPerJob=gres:gpu:1
[renata@ampt015 ~]$
[renata@ampt015 ~]$
[root@ampt015 slurm]# scontrol -d show job 1509619 | grep gpu
TRES=cpu=1,mem=1500M,node=1,billing=1,gres/gpu=3,gres/gpu:a100=3
JOB_GRES=gpu:a100:3
Nodes=ampt015 CPU_IDs=14-15 Mem=1500 GRES=gpu:a100:3(IDX:0-1,3)
Command=/sdf/home/y/ytl/fah/fah-gpu-e.slurm
TresPerJob=gres:gpu:3
[root@ampt015 slurm]#
[root@ampt015 slurm]#
[root@ampt015 slurmd]# cd /var/log/slurm/
[root@ampt015 slurm]#
[root@ampt015 slurm]# grep "safe_write (5 of 5) failed: Invalid argument" slurmd.log | tail
[2022-01-18T16:59:16.440] [1508852.extern] debug: cgroup/v1: common_file_write_content: cgroup_common.c:400: common_file_write_content: safe_write (5 of 5) failed: Invalid argument
[2022-01-18T16:59:17.601] [1508852.batch] debug: cgroup/v1: common_file_write_content: cgroup_common.c:400: common_file_write_content: safe_write (5 of 5) failed: Invalid argument
[2022-01-18T17:02:56.994] [1508857.extern] debug: cgroup/v1: common_file_write_content: cgroup_common.c:400: common_file_write_content: safe_write (5 of 5) failed: Invalid argument
[2022-01-18T17:02:57.076] [1508857.batch] debug: cgroup/v1: common_file_write_content: cgroup_common.c:400: common_file_write_content: safe_write (5 of 5) failed: Invalid argument
[2022-01-19T01:21:20.353] [1509470.extern] debug: cgroup/v1: common_file_write_content: cgroup_common.c:400: common_file_write_content: safe_write (5 of 5) failed: Invalid argument
[2022-01-19T01:21:20.418] [1509470.batch] debug: cgroup/v1: common_file_write_content: cgroup_common.c:400: common_file_write_content: safe_write (5 of 5) failed: Invalid argument
[2022-01-19T01:28:21.410] [1509550.extern] debug: cgroup/v1: common_file_write_content: cgroup_common.c:400: common_file_write_content: safe_write (5 of 5) failed: Invalid argument
[2022-01-19T01:28:21.480] [1509550.batch] debug: cgroup/v1: common_file_write_content: cgroup_common.c:400: common_file_write_content: safe_write (5 of 5) failed: Invalid argument
[2022-01-19T01:33:21.911] [1509619.extern] debug: cgroup/v1: common_file_write_content: cgroup_common.c:400: common_file_write_content: safe_write (5 of 5) failed: Invalid argument
[2022-01-19T01:33:21.981] [1509619.batch] debug: cgroup/v1: common_file_write_content: cgroup_common.c:400: common_file_write_content: safe_write (5 of 5) failed: Invalid argument
[root@ampt015 slurm]#
[root@ampt015 slurm]#
[root@ampt015 slurm]#
[root@ampt015 slurm]# grep "error: common_file_write_content" slurmd.log | tail
[2022-01-18T16:59:16.440] [1508852.extern] error: common_file_write_content: unable to write 5 bytes to cgroup /sys/fs/cgroup/devices/slurm/uid_1102874/job_1508852/devices.allow: Invalid argument
[2022-01-18T16:59:17.601] [1508852.batch] error: common_file_write_content: unable to write 5 bytes to cgroup /sys/fs/cgroup/devices/slurm/uid_1102874/job_1508852/devices.allow: Invalid argument
[2022-01-18T17:02:56.994] [1508857.extern] error: common_file_write_content: unable to write 5 bytes to cgroup /sys/fs/cgroup/devices/slurm/uid_17753/job_1508857/devices.allow: Invalid argument
[2022-01-18T17:02:57.076] [1508857.batch] error: common_file_write_content: unable to write 5 bytes to cgroup /sys/fs/cgroup/devices/slurm/uid_17753/job_1508857/devices.allow: Invalid argument
[2022-01-19T01:21:20.353] [1509470.extern] error: common_file_write_content: unable to write 5 bytes to cgroup /sys/fs/cgroup/devices/slurm/uid_7017/job_1509470/devices.allow: Invalid argument
[2022-01-19T01:21:20.418] [1509470.batch] error: common_file_write_content: unable to write 5 bytes to cgroup /sys/fs/cgroup/devices/slurm/uid_7017/job_1509470/devices.allow: Invalid argument
[2022-01-19T01:28:21.410] [1509550.extern] error: common_file_write_content: unable to write 5 bytes to cgroup /sys/fs/cgroup/devices/slurm/uid_7017/job_1509550/devices.allow: Invalid argument
[2022-01-19T01:28:21.480] [1509550.batch] error: common_file_write_content: unable to write 5 bytes to cgroup /sys/fs/cgroup/devices/slurm/uid_7017/job_1509550/devices.allow: Invalid argument
[2022-01-19T01:33:21.911] [1509619.extern] error: common_file_write_content: unable to write 5 bytes to cgroup /sys/fs/cgroup/devices/slurm/uid_7017/job_1509619/devices.allow: Invalid argument
[2022-01-19T01:33:21.981] [1509619.batch] error: common_file_write_content: unable to write 5 bytes to cgroup /sys/fs/cgroup/devices/slurm/uid_7017/job_1509619/devices.allow: Invalid argument
Renata
On Wed, 19 Jan 2022, bugs@schedmd.com wrote:
>https://bugs.schedmd.com/show_bug.cgi?id=13215
>
>--- Comment #10 from Michael Hinton <hinton@schedmd.com> ---
>I see a lot of these errors in the slurmd logs you attached:
>
>[2022-01-16T19:14:27.970] [1503488.extern] debug: cgroup/v1:
>common_file_write_content: cgroup_common.c:400: common_file_write_content:
>safe_write (5 of 5) failed: Invalid argument
>[2022-01-16T19:14:27.970] [1503488.extern] error: common_file_write_content:
>unable to write 5 bytes to cgroup
>/sys/fs/cgroup/devices/slurm/uid_7017/job_1503488/devices.allow: Invalid
>argument
>
>Do the slurmd logs for your good nodes have those errors? Or is this just
>limited to the bad nodes?
>
>--
>You are receiving this mail because:
>You reported the bug.
Hi Michael, here is the sacct -l for each of the jobs: [renata@sdf-login03 conf-cache]$ sacct -l -j 1503517 JobID JobIDRaw JobName Partition MaxVMSize MaxVMSizeNode MaxVMSizeTask AveVMSize MaxRSS MaxRSSNode MaxRSSTask AveRSS MaxPages MaxPagesNode MaxPagesTask AvePages MinCPU MinCPUNode MinCPUTask AveCPU NTasks AllocCPUS Elapsed State ExitCode AveCPUFreq ReqCPUFreqMin ReqCPUFreqMax ReqCPUFreqGov ReqMem ConsumedEnergy MaxDiskRead MaxDiskReadNode MaxDiskReadTask AveDiskRead MaxDiskWrite MaxDiskWriteNode MaxDiskWriteTask AveDiskWrite ReqTRES AllocTRES TRESUsageInAve TRESUsageInMax TRESUsageInMaxNode TRESUsageInMaxTask TRESUsageInMin TRESUsageInMinNode TRESUsageInMinTask TRESUsageInTot TRESUsageOutMax TRESUsageOutMaxNode TRESUsageOutMaxTask TRESUsageOutAve TRESUsageOutTot ------------ ------------ ---------- ---------- ---------- -------------- -------------- ---------- ---------- ---------- ---------- ---------- -------- ------------ -------------- ---------- ---------- ---------- ---------- ---------- -------- ---------- ---------- ---------- -------- ---------- ------------- ------------- ------------- ---------- -------------- ------------ --------------- --------------- -------------- ------------ ---------------- ---------------- -------------- ---------- ---------- -------------- -------------- ------------------ ------------------ -------------- ------------------ ------------------ -------------- --------------- ------------------- ------------------- --------------- --------------- 1503517 1503517 fah shared 2 00:03:01 FAILED 127:0 Unknown Unknown Unknown 1500M 0 billing=1+ billing=1+ 1503517.bat+ 1503517.bat+ batch 221356K psc001 0 221356K 1880K psc001 0 1880K 0 psc001 0 0 00:00:00 psc001 0 00:00:00 1 2 04:53:33 FAILED 127:0 2.30G 0 0 0 0 1.14M psc001 0 1.14M 0.08M psc001 0 0.08M cpu=2,mem+ cpu=00:00:00,+ cpu=00:00:00,+ cpu=psc001,energy+ cpu=0,fs/disk=0,m+ cpu=00:00:00,+ cpu=psc001,energy+ cpu=0,fs/disk=0,m+ cpu=00:00:00,+ energy=0,fs/di+ energy=psc001,fs/d+ fs/disk=0 energy=0,fs/di+ energy=0,fs/di+ 1503517.ext+ 1503517.ext+ extern 212184K psc001 0 212184K 892K psc001 0 892K 0 psc001 0 0 00:00:00 psc001 0 00:00:00 1 2 04:53:33 COMPLETED 0:0 1.20G 0 0 0 0 0.00M psc001 0 0.00M 0 psc001 0 0 billing=1+ cpu=00:00:00,+ cpu=00:00:00,+ cpu=psc001,energy+ cpu=0,fs/disk=0,m+ cpu=00:00:00,+ cpu=psc001,energy+ cpu=0,fs/disk=0,m+ cpu=00:00:00,+ energy=0,fs/di+ energy=psc001,fs/d+ fs/disk=0 energy=0,fs/di+ energy=0,fs/di+ [renata@sdf-login03 conf-cache]$ [renata@sdf-login03 conf-cache]$ [renata@sdf-login03 conf-cache]$ [renata@sdf-login03 conf-cache]$ [renata@sdf-login03 conf-cache]$ [renata@sdf-login03 conf-cache]$ sacct -l -j 1503488 JobID JobIDRaw JobName Partition MaxVMSize MaxVMSizeNode MaxVMSizeTask AveVMSize MaxRSS MaxRSSNode MaxRSSTask AveRSS MaxPages MaxPagesNode MaxPagesTask AvePages MinCPU MinCPUNode MinCPUTask AveCPU NTasks AllocCPUS Elapsed State ExitCode AveCPUFreq ReqCPUFreqMin ReqCPUFreqMax ReqCPUFreqGov ReqMem ConsumedEnergy MaxDiskRead MaxDiskReadNode MaxDiskReadTask AveDiskRead MaxDiskWrite MaxDiskWriteNode MaxDiskWriteTask AveDiskWrite ReqTRES AllocTRES TRESUsageInAve TRESUsageInMax TRESUsageInMaxNode TRESUsageInMaxTask TRESUsageInMin TRESUsageInMinNode TRESUsageInMinTask TRESUsageInTot TRESUsageOutMax TRESUsageOutMaxNode TRESUsageOutMaxTask TRESUsageOutAve TRESUsageOutTot ------------ ------------ ---------- ---------- ---------- -------------- -------------- ---------- ---------- ---------- ---------- ---------- -------- ------------ -------------- ---------- ---------- ---------- ---------- ---------- -------- ---------- ---------- ---------- -------- ---------- ------------- ------------- ------------- ---------- -------------- ------------ --------------- --------------- -------------- ------------ ---------------- ---------------- -------------- ---------- ---------- -------------- -------------- ------------------ ------------------ -------------- ------------------ ------------------ -------------- --------------- ------------------- ------------------- --------------- --------------- 1503488 1503488 fah shared 2 00:03:01 FAILED 127:0 Unknown Unknown Unknown 1500M 0 billing=1+ billing=1+ 1503488.bat+ 1503488.bat+ batch 221360K tur017 0 221360K 1880K tur017 0 1880K 0 tur017 0 0 00:00:00 tur017 0 00:00:00 1 2 05:50:41 FAILED 127:0 2.30G 0 0 0 0 1.17M tur017 0 1.17M 0.09M tur017 0 0.09M cpu=2,mem+ cpu=00:00:00,+ cpu=00:00:00,+ cpu=tur017,energy+ cpu=0,fs/disk=0,m+ cpu=00:00:00,+ cpu=tur017,energy+ cpu=0,fs/disk=0,m+ cpu=00:00:00,+ energy=0,fs/di+ energy=tur017,fs/d+ fs/disk=0 energy=0,fs/di+ energy=0,fs/di+ 1503488.ext+ 1503488.ext+ extern 212184K tur017 0 212184K 896K tur017 0 896K 0 tur017 0 0 00:00:00 tur017 0 00:00:00 1 2 05:50:41 COMPLETED 0:0 2.00G 0 0 0 0 0.00M tur017 0 0.00M 0 tur017 0 0 billing=1+ cpu=00:00:00,+ cpu=00:00:00,+ cpu=tur017,energy+ cpu=0,fs/disk=0,m+ cpu=00:00:00,+ cpu=tur017,energy+ cpu=0,fs/disk=0,m+ cpu=00:00:00,+ energy=0,fs/di+ energy=tur017,fs/d+ fs/disk=0 energy=0,fs/di+ energy=0,fs/di+ [renata@sdf-login03 conf-cache]$ [renata@sdf-login03 conf-cache]$ [renata@sdf-login03 conf-cache]$ [renata@sdf-login03 conf-cache]$ sacct -l -j 1503581 JobID JobIDRaw JobName Partition MaxVMSize MaxVMSizeNode MaxVMSizeTask AveVMSize MaxRSS MaxRSSNode MaxRSSTask AveRSS MaxPages MaxPagesNode MaxPagesTask AvePages MinCPU MinCPUNode MinCPUTask AveCPU NTasks AllocCPUS Elapsed State ExitCode AveCPUFreq ReqCPUFreqMin ReqCPUFreqMax ReqCPUFreqGov ReqMem ConsumedEnergy MaxDiskRead MaxDiskReadNode MaxDiskReadTask AveDiskRead MaxDiskWrite MaxDiskWriteNode MaxDiskWriteTask AveDiskWrite ReqTRES AllocTRES TRESUsageInAve TRESUsageInMax TRESUsageInMaxNode TRESUsageInMaxTask TRESUsageInMin TRESUsageInMinNode TRESUsageInMinTask TRESUsageInTot TRESUsageOutMax TRESUsageOutMaxNode TRESUsageOutMaxTask TRESUsageOutAve TRESUsageOutTot ------------ ------------ ---------- ---------- ---------- -------------- -------------- ---------- ---------- ---------- ---------- ---------- -------- ------------ -------------- ---------- ---------- ---------- ---------- ---------- -------- ---------- ---------- ---------- -------- ---------- ------------- ------------- ------------- ---------- -------------- ------------ --------------- --------------- -------------- ------------ ---------------- ---------------- -------------- ---------- ---------- -------------- -------------- ------------------ ------------------ -------------- ------------------ ------------------ -------------- --------------- ------------------- ------------------- --------------- --------------- 1503581 1503581 fah shared 2 00:05:01 FAILED 127:0 Unknown Unknown Unknown 1500M 0 billing=1+ billing=1+ 1503581.bat+ 1503581.bat+ batch 221356K psc001 0 221356K 1880K psc001 0 1880K 0 psc001 0 0 00:00:00 psc001 0 00:00:00 1 2 03:25:16 FAILED 127:0 2.21G 0 0 0 0 1.14M psc001 0 1.14M 0.08M psc001 0 0.08M cpu=2,mem+ cpu=00:00:00,+ cpu=00:00:00,+ cpu=psc001,energy+ cpu=0,fs/disk=0,m+ cpu=00:00:00,+ cpu=psc001,energy+ cpu=0,fs/disk=0,m+ cpu=00:00:00,+ energy=0,fs/di+ energy=psc001,fs/d+ fs/disk=0 energy=0,fs/di+ energy=0,fs/di+ 1503581.ext+ 1503581.ext+ extern 108056K psc001 0 108056K 352K psc001 0 352K 0 psc001 0 0 00:00:00 psc001 0 00:00:00 1 2 03:25:16 COMPLETED 0:0 1.20G 0 0 0 0 0.00M psc001 0 0.00M 0 psc001 0 0 billing=1+ cpu=00:00:00,+ cpu=00:00:00,+ cpu=psc001,energy+ cpu=0,fs/disk=0,m+ cpu=00:00:00,+ cpu=psc001,energy+ cpu=0,fs/disk=0,m+ cpu=00:00:00,+ energy=0,fs/di+ energy=psc001,fs/d+ fs/disk=0 energy=0,fs/di+ energy=0,fs/di+ [renata@sdf-login03 conf-cache]$ [renata@sdf-login03 conf-cache]$ [renata@sdf-login03 conf-cache]$ [renata@sdf-login03 conf-cache]$ sacct -l -j 1503519 JobID JobIDRaw JobName Partition MaxVMSize MaxVMSizeNode MaxVMSizeTask AveVMSize MaxRSS MaxRSSNode MaxRSSTask AveRSS MaxPages MaxPagesNode MaxPagesTask AvePages MinCPU MinCPUNode MinCPUTask AveCPU NTasks AllocCPUS Elapsed State ExitCode AveCPUFreq ReqCPUFreqMin ReqCPUFreqMax ReqCPUFreqGov ReqMem ConsumedEnergy MaxDiskRead MaxDiskReadNode MaxDiskReadTask AveDiskRead MaxDiskWrite MaxDiskWriteNode MaxDiskWriteTask AveDiskWrite ReqTRES AllocTRES TRESUsageInAve TRESUsageInMax TRESUsageInMaxNode TRESUsageInMaxTask TRESUsageInMin TRESUsageInMinNode TRESUsageInMinTask TRESUsageInTot TRESUsageOutMax TRESUsageOutMaxNode TRESUsageOutMaxTask TRESUsageOutAve TRESUsageOutTot ------------ ------------ ---------- ---------- ---------- -------------- -------------- ---------- ---------- ---------- ---------- ---------- -------- ------------ -------------- ---------- ---------- ---------- ---------- ---------- -------- ---------- ---------- ---------- -------- ---------- ------------- ------------- ------------- ---------- -------------- ------------ --------------- --------------- -------------- ------------ ---------------- ---------------- -------------- ---------- ---------- -------------- -------------- ------------------ ------------------ -------------- ------------------ ------------------ -------------- --------------- ------------------- ------------------- --------------- --------------- 1503519 1503519 fah shared 1 00:05:01 FAILED 127:0 Unknown Unknown Unknown 1500M 0 billing=1+ billing=1+ 1503519.bat+ 1503519.bat+ batch 221356K ampt007 0 221356K 1876K ampt007 0 1876K 0 ampt007 0 0 00:00:00 ampt007 0 00:00:00 1 1 04:52:59 FAILED 127:0 1.50G 0 0 0 0 2.03M ampt007 0 2.03M 0.08M ampt007 0 0.08M cpu=1,mem+ cpu=00:00:00,+ cpu=00:00:00,+ cpu=ampt007,energ+ cpu=0,fs/disk=0,m+ cpu=00:00:00,+ cpu=ampt007,energ+ cpu=0,fs/disk=0,m+ cpu=00:00:00,+ energy=0,fs/di+ energy=ampt007,fs/+ fs/disk=0 energy=0,fs/di+ energy=0,fs/di+ 1503519.ext+ 1503519.ext+ extern 212348K ampt007 0 212348K 1292K ampt007 0 1292K 0 ampt007 0 0 00:00:00 ampt007 0 00:00:00 1 1 04:52:59 COMPLETED 0:0 1.50G 0 0 0 0 0.00M ampt007 0 0.00M 0 ampt007 0 0 billing=1+ cpu=00:00:00,+ cpu=00:00:00,+ cpu=ampt007,energ+ cpu=0,fs/disk=0,m+ cpu=00:00:00,+ cpu=ampt007,energ+ cpu=0,fs/disk=0,m+ cpu=00:00:00,+ energy=0,fs/di+ energy=ampt007,fs/+ fs/disk=0 energy=0,fs/di+ energy=0,fs/di+ Renata On Tue, 18 Jan 2022, bugs@schedmd.com wrote: >https://bugs.schedmd.com/show_bug.cgi?id=13215 > >--- Comment #7 from Michael Hinton <hinton@schedmd.com> --- >The 3 nodes in question were all allocated to jobs started by backfill, and >those jobs eventually ended in a "bad node_offset" error: > >[2022-01-16T19:13:49.048] requeue job JobId=1495647 due to failure of node >ampt000 >[2022-01-16T19:13:49.048] Requeuing JobId=1495647 >[2022-01-16T19:13:49.048] cleanup_completing: JobId=1503488 completion process >took 614 seconds >[2022-01-16T19:13:49.048] cleanup_completing: JobId=1503517 completion process >took 614 seconds >[2022-01-16T19:13:49.048] cleanup_completing: JobId=1503519 completion process >took 614 seconds >[2022-01-16T19:13:49.048] cleanup_completing: JobId=1503581 completion process >took 614 seconds >[2022-01-16T19:13:49.049] error: Nodes ampt000 not responding, setting DOWN >[2022-01-16T19:14:27.616] sched/backfill: _start_job: Started JobId=1503488 in >shared on tur017 >[2022-01-16T19:14:27.616] sched/backfill: _start_job: Started JobId=1503581 in >shared on psc001 >[2022-01-16T19:14:27.617] sched/backfill: _start_job: Started JobId=1503517 in >shared on psc001 >[2022-01-16T19:14:27.618] sched/backfill: _start_job: Started JobId=1503519 in >shared on ampt007 >... >[2022-01-16T19:17:28.326] error: gres/gpu: job 1503517 dealloc of node psc001 >bad node_offset 0 count is 0 >[2022-01-16T19:17:28.361] error: gres/gpu: job 1503488 dealloc of node tur017 >bad node_offset 0 count is 0 >[2022-01-16T19:19:28.344] error: gres/gpu: job 1503581 dealloc of node psc001 >bad node_offset 0 count is 0 >[2022-01-16T19:19:28.379] error: gres/gpu: job 1503519 dealloc of node ampt007 >bad node_offset 0 count is 0 > >Can you get any more information about those four jobs? Iif they are no longer >in scontrol, what does sacct say about them, and about the GPUs they allocated, >and how many times were they requeued? Are they related to ampt000 failing >(e.g. maybe those jobs were originally running on ampt000, but got requeued). > >-- >You are receiving this mail because: >You reported the bug. Do you have a cgroup_allowed_devices_file.conf file? (In reply to Michael Hinton from comment #13) > Do you have a cgroup_allowed_devices_file.conf file? If so, you should remove it. It has been obsolete for quite a while now, and now in 21.08, it causes those (harmless) cgroup errors. Hi Michael, yes we do: [root@slurmctld1 slurm]# cat cgroup_allowed_devices_file.conf /dev/null /dev/urandom /dev/zero /dev/sda* /dev/cpu/*/* /dev/pts/* /dev/nvidia* To remove it is it enough to: remove it from our slurmctld server systemctl restart slurmctld scontrol reconfigure And was there some replacement we were supposed to do for the information in that file? Renata >https://bugs.schedmd.com/show_bug.cgi?id=13215 > >--- Comment #14 from Michael Hinton <hinton@schedmd.com> --- >(In reply to Michael Hinton from comment #13) >> Do you have a cgroup_allowed_devices_file.conf file? >If so, you should remove it. It has been obsolete for quite a while now, and >now in 21.08, it causes those (harmless) cgroup errors. > >-- >You are receiving this mail because: >You reported the bug. (In reply to Renata Dart from comment #15) > To remove it is it enough to: > > remove it from our slurmctld server > systemctl restart slurmctld > scontrol reconfigure You should remove it from all slurmds as well. Removing it and restarting Slurm (ctld and slurmds) should be all that is required. > And was there some replacement we were supposed to do for > the information in that file? Nothing else needs to be done. The replacement is specifying File= in gres.conf, which is already required. File whitelists those GPU device files, and by default, all other device files are not allowed for the job (I believe). To be clear, this cgroup issue should be harmless, and it is unrelated to the actual issue described in this ticket. But it will be nice to clear up all those errors. Sorry, could you redo the sacct commands for the jobs, but add `-p`? That makes it so that nothing gets truncated. (In reply to Michael Hinton from comment #17) > Sorry, could you redo the sacct commands for the jobs, but add `-p`? That > makes it so that nothing gets truncated. Also, can you add the `-D` option? That should show us duplicates in the case that these jobs were requeued (and I think they were) Hi Michael, here they are: [renata@sdf-login03 conf-cache]$ sacct -l -p -j 1503517 JobID|JobIDRaw|JobName|Partition|MaxVMSize|MaxVMSizeNode|MaxVMSizeTask|AveVMSize|MaxRSS|MaxRSSNode|MaxRSSTask|AveRSS|MaxPages|MaxPagesNode|MaxPagesTask|AvePages|MinCPU|MinCPUNode|MinCPUTask|AveCPU|NTasks|AllocCPUS|Elapsed|State|ExitCode|AveCPUFreq|ReqCPUFreqMin|ReqCPUFreqMax|ReqCPUFreqGov|ReqMem|ConsumedEnergy|MaxDiskRead|MaxDiskReadNode|MaxDiskReadTask|AveDiskRead|MaxDiskWrite|MaxDiskWriteNode|MaxDiskWriteTask|AveDiskWrite|ReqTRES|AllocTRES|TRESUsageInAve|TRESUsageInMax|TRESUsageInMaxNode|TRESUsageInMaxTask|TRESUsageInMin|TRESUsageInMinNode|TRESUsageInMinTask|TRESUsageInTot|TRESUsageOutMax|TRESUsageOutMaxNode|TRESUsageOutMaxTask|TRESUsageOutAve|TRESUsageOutTot| 1503517|1503517|fah|shared||||||||||||||||||2|00:03:01|FAILED|127:0||Unknown|Unknown|Unknown|1500M|0|||||||||billing=1,cpu=1,gres/gpu=1,mem=1500M,node=1|billing=1,cpu=2,gres/gpu:a100=1,gres/gpu=1,mem=1500M,node=1|||||||||||||| 1503517.batch|1503517.batch|batch||221356K|psc001|0|221356K|1880K|psc001|0|1880K|0|psc001|0|0|00:00:00|psc001|0|00:00:00|1|2|04:53:33|FAILED|127:0|2.30G|0|0|0||0|1.14M|psc001|0|1.14M|0.08M|psc001|0|0.08M||cpu=2,mem=1500M,node=1|cpu=00:00:00,energy=0,fs/disk=1200276,mem=1880K,pages=0,vmem=221356K|cpu=00:00:00,energy=0,fs/disk=1200276,mem=1880K,pages=0,vmem=221356K|cpu=psc001,energy=psc001,fs/disk=psc001,mem=psc001,pages=psc001,vmem=psc001|cpu=0,fs/disk=0,mem=0,pages=0,vmem=0|cpu=00:00:00,energy=0,fs/disk=1200276,mem=1880K,pages=0,vmem=221356K|cpu=psc001,energy=psc001,fs/disk=psc001,mem=psc001,pages=psc001,vmem=psc001|cpu=0,fs/disk=0,mem=0,pages=0,vmem=0|cpu=00:00:00,energy=0,fs/disk=1200276,mem=1880K,pages=0,vmem=221356K|energy=0,fs/disk=87019|energy=psc001,fs/disk=psc001|fs/disk=0|energy=0,fs/disk=87019|energy=0,fs/disk=87019| 1503517.extern|1503517.extern|extern||212184K|psc001|0|212184K|892K|psc001|0|892K|0|psc001|0|0|00:00:00|psc001|0|00:00:00|1|2|04:53:33|COMPLETED|0:0|1.20G|0|0|0||0|0.00M|psc001|0|0.00M|0|psc001|0|0||billing=1,cpu=2,gres/gpu:a100=1,gres/gpu=1,mem=1500M,node=1|cpu=00:00:00,energy=0,fs/disk=2012,mem=892K,pages=0,vmem=212184K|cpu=00:00:00,energy=0,fs/disk=2012,mem=892K,pages=0,vmem=212184K|cpu=psc001,energy=psc001,fs/disk=psc001,mem=psc001,pages=psc001,vmem=psc001|cpu=0,fs/disk=0,mem=0,pages=0,vmem=0|cpu=00:00:00,energy=0,fs/disk=2012,mem=892K,pages=0,vmem=212184K|cpu=psc001,energy=psc001,fs/disk=psc001,mem=psc001,pages=psc001,vmem=psc001|cpu=0,fs/disk=0,mem=0,pages=0,vmem=0|cpu=00:00:00,energy=0,fs/disk=2012,mem=892K,pages=0,vmem=212184K|energy=0,fs/disk=0|energy=psc001,fs/disk=psc001|fs/disk=0|energy=0,fs/disk=0|energy=0,fs/disk=0| [renata@sdf-login03 conf-cache]$ [renata@sdf-login03 conf-cache]$ [renata@sdf-login03 conf-cache]$ [renata@sdf-login03 conf-cache]$ [renata@sdf-login03 conf-cache]$ sacct -l -p -j 1503488 JobID|JobIDRaw|JobName|Partition|MaxVMSize|MaxVMSizeNode|MaxVMSizeTask|AveVMSize|MaxRSS|MaxRSSNode|MaxRSSTask|AveRSS|MaxPages|MaxPagesNode|MaxPagesTask|AvePages|MinCPU|MinCPUNode|MinCPUTask|AveCPU|NTasks|AllocCPUS|Elapsed|State|ExitCode|AveCPUFreq|ReqCPUFreqMin|ReqCPUFreqMax|ReqCPUFreqGov|ReqMem|ConsumedEnergy|MaxDiskRead|MaxDiskReadNode|MaxDiskReadTask|AveDiskRead|MaxDiskWrite|MaxDiskWriteNode|MaxDiskWriteTask|AveDiskWrite|ReqTRES|AllocTRES|TRESUsageInAve|TRESUsageInMax|TRESUsageInMaxNode|TRESUsageInMaxTask|TRESUsageInMin|TRESUsageInMinNode|TRESUsageInMinTask|TRESUsageInTot|TRESUsageOutMax|TRESUsageOutMaxNode|TRESUsageOutMaxTask|TRESUsageOutAve|TRESUsageOutTot| 1503488|1503488|fah|shared||||||||||||||||||2|00:03:01|FAILED|127:0||Unknown|Unknown|Unknown|1500M|0|||||||||billing=1,cpu=1,gres/gpu=1,mem=1500M,node=1|billing=1,cpu=2,gres/gpu:a100=1,gres/gpu=1,mem=1500M,node=1|||||||||||||| 1503488.batch|1503488.batch|batch||221360K|tur017|0|221360K|1880K|tur017|0|1880K|0|tur017|0|0|00:00:00|tur017|0|00:00:00|1|2|05:50:41|FAILED|127:0|2.30G|0|0|0||0|1.17M|tur017|0|1.17M|0.09M|tur017|0|0.09M||cpu=2,mem=1500M,node=1|cpu=00:00:00,energy=0,fs/disk=1222499,mem=1880K,pages=0,vmem=221360K|cpu=00:00:00,energy=0,fs/disk=1222499,mem=1880K,pages=0,vmem=221360K|cpu=tur017,energy=tur017,fs/disk=tur017,mem=tur017,pages=tur017,vmem=tur017|cpu=0,fs/disk=0,mem=0,pages=0,vmem=0|cpu=00:00:00,energy=0,fs/disk=1222499,mem=1880K,pages=0,vmem=221360K|cpu=tur017,energy=tur017,fs/disk=tur017,mem=tur017,pages=tur017,vmem=tur017|cpu=0,fs/disk=0,mem=0,pages=0,vmem=0|cpu=00:00:00,energy=0,fs/disk=1222499,mem=1880K,pages=0,vmem=221360K|energy=0,fs/disk=92034|energy=tur017,fs/disk=tur017|fs/disk=0|energy=0,fs/disk=92034|energy=0,fs/disk=92034| 1503488.extern|1503488.extern|extern||212184K|tur017|0|212184K|896K|tur017|0|896K|0|tur017|0|0|00:00:00|tur017|0|00:00:00|1|2|05:50:41|COMPLETED|0:0|2.00G|0|0|0||0|0.00M|tur017|0|0.00M|0|tur017|0|0||billing=1,cpu=2,gres/gpu:a100=1,gres/gpu=1,mem=1500M,node=1|cpu=00:00:00,energy=0,fs/disk=2012,mem=896K,pages=0,vmem=212184K|cpu=00:00:00,energy=0,fs/disk=2012,mem=896K,pages=0,vmem=212184K|cpu=tur017,energy=tur017,fs/disk=tur017,mem=tur017,pages=tur017,vmem=tur017|cpu=0,fs/disk=0,mem=0,pages=0,vmem=0|cpu=00:00:00,energy=0,fs/disk=2012,mem=896K,pages=0,vmem=212184K|cpu=tur017,energy=tur017,fs/disk=tur017,mem=tur017,pages=tur017,vmem=tur017|cpu=0,fs/disk=0,mem=0,pages=0,vmem=0|cpu=00:00:00,energy=0,fs/disk=2012,mem=896K,pages=0,vmem=212184K|energy=0,fs/disk=0|energy=tur017,fs/disk=tur017|fs/disk=0|energy=0,fs/disk=0|energy=0,fs/disk=0| [renata@sdf-login03 conf-cache]$ [renata@sdf-login03 conf-cache]$ [renata@sdf-login03 conf-cache]$ [renata@sdf-login03 conf-cache]$ sacct -l -p -j 1503581 JobID|JobIDRaw|JobName|Partition|MaxVMSize|MaxVMSizeNode|MaxVMSizeTask|AveVMSize|MaxRSS|MaxRSSNode|MaxRSSTask|AveRSS|MaxPages|MaxPagesNode|MaxPagesTask|AvePages|MinCPU|MinCPUNode|MinCPUTask|AveCPU|NTasks|AllocCPUS|Elapsed|State|ExitCode|AveCPUFreq|ReqCPUFreqMin|ReqCPUFreqMax|ReqCPUFreqGov|ReqMem|ConsumedEnergy|MaxDiskRead|MaxDiskReadNode|MaxDiskReadTask|AveDiskRead|MaxDiskWrite|MaxDiskWriteNode|MaxDiskWriteTask|AveDiskWrite|ReqTRES|AllocTRES|TRESUsageInAve|TRESUsageInMax|TRESUsageInMaxNode|TRESUsageInMaxTask|TRESUsageInMin|TRESUsageInMinNode|TRESUsageInMinTask|TRESUsageInTot|TRESUsageOutMax|TRESUsageOutMaxNode|TRESUsageOutMaxTask|TRESUsageOutAve|TRESUsageOutTot| 1503581|1503581|fah|shared||||||||||||||||||2|00:05:01|FAILED|127:0||Unknown|Unknown|Unknown|1500M|0|||||||||billing=1,cpu=1,gres/gpu=1,mem=1500M,node=1|billing=1,cpu=2,gres/gpu:a100=1,gres/gpu=1,mem=1500M,node=1|||||||||||||| 1503581.batch|1503581.batch|batch||221356K|psc001|0|221356K|1880K|psc001|0|1880K|0|psc001|0|0|00:00:00|psc001|0|00:00:00|1|2|03:25:16|FAILED|127:0|2.21G|0|0|0||0|1.14M|psc001|0|1.14M|0.08M|psc001|0|0.08M||cpu=2,mem=1500M,node=1|cpu=00:00:00,energy=0,fs/disk=1200332,mem=1880K,pages=0,vmem=221356K|cpu=00:00:00,energy=0,fs/disk=1200332,mem=1880K,pages=0,vmem=221356K|cpu=psc001,energy=psc001,fs/disk=psc001,mem=psc001,pages=psc001,vmem=psc001|cpu=0,fs/disk=0,mem=0,pages=0,vmem=0|cpu=00:00:00,energy=0,fs/disk=1200332,mem=1880K,pages=0,vmem=221356K|cpu=psc001,energy=psc001,fs/disk=psc001,mem=psc001,pages=psc001,vmem=psc001|cpu=0,fs/disk=0,mem=0,pages=0,vmem=0|cpu=00:00:00,energy=0,fs/disk=1200332,mem=1880K,pages=0,vmem=221356K|energy=0,fs/disk=87036|energy=psc001,fs/disk=psc001|fs/disk=0|energy=0,fs/disk=87036|energy=0,fs/disk=87036| 1503581.extern|1503581.extern|extern||108056K|psc001|0|108056K|352K|psc001|0|352K|0|psc001|0|0|00:00:00|psc001|0|00:00:00|1|2|03:25:16|COMPLETED|0:0|1.20G|0|0|0||0|0.00M|psc001|0|0.00M|0|psc001|0|0||billing=1,cpu=2,gres/gpu:a100=1,gres/gpu=1,mem=1500M,node=1|cpu=00:00:00,energy=0,fs/disk=2012,mem=352K,pages=0,vmem=108056K|cpu=00:00:00,energy=0,fs/disk=2012,mem=352K,pages=0,vmem=108056K|cpu=psc001,energy=psc001,fs/disk=psc001,mem=psc001,pages=psc001,vmem=psc001|cpu=0,fs/disk=0,mem=0,pages=0,vmem=0|cpu=00:00:00,energy=0,fs/disk=2012,mem=352K,pages=0,vmem=108056K|cpu=psc001,energy=psc001,fs/disk=psc001,mem=psc001,pages=psc001,vmem=psc001|cpu=0,fs/disk=0,mem=0,pages=0,vmem=0|cpu=00:00:00,energy=0,fs/disk=2012,mem=352K,pages=0,vmem=108056K|energy=0,fs/disk=0|energy=psc001,fs/disk=psc001|fs/disk=0|energy=0,fs/disk=0|energy=0,fs/disk=0| [renata@sdf-login03 conf-cache]$ [renata@sdf-login03 conf-cache]$ [renata@sdf-login03 conf-cache]$ [renata@sdf-login03 conf-cache]$ sacct -l -p -j 1503519 JobID|JobIDRaw|JobName|Partition|MaxVMSize|MaxVMSizeNode|MaxVMSizeTask|AveVMSize|MaxRSS|MaxRSSNode|MaxRSSTask|AveRSS|MaxPages|MaxPagesNode|MaxPagesTask|AvePages|MinCPU|MinCPUNode|MinCPUTask|AveCPU|NTasks|AllocCPUS|Elapsed|State|ExitCode|AveCPUFreq|ReqCPUFreqMin|ReqCPUFreqMax|ReqCPUFreqGov|ReqMem|ConsumedEnergy|MaxDiskRead|MaxDiskReadNode|MaxDiskReadTask|AveDiskRead|MaxDiskWrite|MaxDiskWriteNode|MaxDiskWriteTask|AveDiskWrite|ReqTRES|AllocTRES|TRESUsageInAve|TRESUsageInMax|TRESUsageInMaxNode|TRESUsageInMaxTask|TRESUsageInMin|TRESUsageInMinNode|TRESUsageInMinTask|TRESUsageInTot|TRESUsageOutMax|TRESUsageOutMaxNode|TRESUsageOutMaxTask|TRESUsageOutAve|TRESUsageOutTot| 1503519|1503519|fah|shared||||||||||||||||||1|00:05:01|FAILED|127:0||Unknown|Unknown|Unknown|1500M|0|||||||||billing=1,cpu=1,gres/gpu=1,mem=1500M,node=1|billing=1,cpu=1,gres/gpu:a100=1,gres/gpu=1,mem=1500M,node=1|||||||||||||| 1503519.batch|1503519.batch|batch||221356K|ampt007|0|221356K|1876K|ampt007|0|1876K|0|ampt007|0|0|00:00:00|ampt007|0|00:00:00|1|1|04:52:59|FAILED|127:0|1.50G|0|0|0||0|2.03M|ampt007|0|2.03M|0.08M|ampt007|0|0.08M||cpu=1,mem=1500M,node=1|cpu=00:00:00,energy=0,fs/disk=2130711,mem=1876K,pages=0,vmem=221356K|cpu=00:00:00,energy=0,fs/disk=2130711,mem=1876K,pages=0,vmem=221356K|cpu=ampt007,energy=ampt007,fs/disk=ampt007,mem=ampt007,pages=ampt007,vmem=ampt007|cpu=0,fs/disk=0,mem=0,pages=0,vmem=0|cpu=00:00:00,energy=0,fs/disk=2130711,mem=1876K,pages=0,vmem=221356K|cpu=ampt007,energy=ampt007,fs/disk=ampt007,mem=ampt007,pages=ampt007,vmem=ampt007|cpu=0,fs/disk=0,mem=0,pages=0,vmem=0|cpu=00:00:00,energy=0,fs/disk=2130711,mem=1876K,pages=0,vmem=221356K|energy=0,fs/disk=88096|energy=ampt007,fs/disk=ampt007|fs/disk=0|energy=0,fs/disk=88096|energy=0,fs/disk=88096| 1503519.extern|1503519.extern|extern||212348K|ampt007|0|212348K|1292K|ampt007|0|1292K|0|ampt007|0|0|00:00:00|ampt007|0|00:00:00|1|1|04:52:59|COMPLETED|0:0|1.50G|0|0|0||0|0.00M|ampt007|0|0.00M|0|ampt007|0|0||billing=1,cpu=1,gres/gpu:a100=1,gres/gpu=1,mem=1500M,node=1|cpu=00:00:00,energy=0,fs/disk=2012,mem=1292K,pages=0,vmem=212348K|cpu=00:00:00,energy=0,fs/disk=2012,mem=1292K,pages=0,vmem=212348K|cpu=ampt007,energy=ampt007,fs/disk=ampt007,mem=ampt007,pages=ampt007,vmem=ampt007|cpu=0,fs/disk=0,mem=0,pages=0,vmem=0|cpu=00:00:00,energy=0,fs/disk=2012,mem=1292K,pages=0,vmem=212348K|cpu=ampt007,energy=ampt007,fs/disk=ampt007,mem=ampt007,pages=ampt007,vmem=ampt007|cpu=0,fs/disk=0,mem=0,pages=0,vmem=0|cpu=00:00:00,energy=0,fs/disk=2012,mem=1292K,pages=0,vmem=212348K|energy=0,fs/disk=0|energy=ampt007,fs/disk=ampt007|fs/disk=0|energy=0,fs/disk=0|energy=0,fs/disk=0| Renata On Wed, 19 Jan 2022, bugs@schedmd.com wrote: >https://bugs.schedmd.com/show_bug.cgi?id=13215 > >--- Comment #17 from Michael Hinton <hinton@schedmd.com> --- >Sorry, could you redo the sacct commands for the jobs, but add `-p`? That makes >it so that nothing gets truncated. > >-- >You are receiving this mail because: >You reported the bug. I'm sorry to have to ask you to do this a third time, but could you add the -D option and redo all the sacct calls? I think seeing the older job records before the requeue may help us understand how to replicate this issue. Thanks! Hi Michael, sorry for the delay, I went off to an appt and tried to let you know with an email from my cell phone, but I have a different userid there and so it was rejected since your site didn't recognize it. In any case here they are with the -D: [renata@sdf-login03 conf-cache]$ sacct -l -D -p -j 1503517 JobID|JobIDRaw|JobName|Partition|MaxVMSize|MaxVMSizeNode|MaxVMSizeTask|AveVMSize|MaxRSS|MaxRSSNode|MaxRSSTask|AveRSS|MaxPages|MaxPagesNode|MaxPagesTask|AvePages|MinCPU|MinCPUNode|MinCPUTask|AveCPU|NTasks|AllocCPUS|Elapsed|State|ExitCode|AveCPUFreq|ReqCPUFreqMin|ReqCPUFreqMax|ReqCPUFreqGov|ReqMem|ConsumedEnergy|MaxDiskRead|MaxDiskReadNode|MaxDiskReadTask|AveDiskRead|MaxDiskWrite|MaxDiskWriteNode|MaxDiskWriteTask|AveDiskWrite|ReqTRES|AllocTRES|TRESUsageInAve|TRESUsageInMax|TRESUsageInMaxNode|TRESUsageInMaxTask|TRESUsageInMin|TRESUsageInMinNode|TRESUsageInMinTask|TRESUsageInTot|TRESUsageOutMax|TRESUsageOutMaxNode|TRESUsageOutMaxTask|TRESUsageOutAve|TRESUsageOutTot| 1503517|1503517|fah|shared||||||||||||||||||2|00:03:01|FAILED|127:0||Unknown|Unknown|Unknown|1500M|0|||||||||billing=1,cpu=1,gres/gpu=1,mem=1500M,node=1|billing=1,cpu=2,gres/gpu:a100=1,gres/gpu=1,mem=1500M,node=1|||||||||||||| 1503517.batch|1503517.batch|batch||221356K|psc001|0|221356K|1880K|psc001|0|1880K|0|psc001|0|0|00:00:00|psc001|0|00:00:00|1|2|04:53:33|FAILED|127:0|2.30G|0|0|0||0|1.14M|psc001|0|1.14M|0.08M|psc001|0|0.08M||cpu=2,mem=1500M,node=1|cpu=00:00:00,energy=0,fs/disk=1200276,mem=1880K,pages=0,vmem=221356K|cpu=00:00:00,energy=0,fs/disk=1200276,mem=1880K,pages=0,vmem=221356K|cpu=psc001,energy=psc001,fs/disk=psc001,mem=psc001,pages=psc001,vmem=psc001|cpu=0,fs/disk=0,mem=0,pages=0,vmem=0|cpu=00:00:00,energy=0,fs/disk=1200276,mem=1880K,pages=0,vmem=221356K|cpu=psc001,energy=psc001,fs/disk=psc001,mem=psc001,pages=psc001,vmem=psc001|cpu=0,fs/disk=0,mem=0,pages=0,vmem=0|cpu=00:00:00,energy=0,fs/disk=1200276,mem=1880K,pages=0,vmem=221356K|energy=0,fs/disk=87019|energy=psc001,fs/disk=psc001|fs/disk=0|energy=0,fs/disk=87019|energy=0,fs/disk=87019| 1503517.extern|1503517.extern|extern||212184K|psc001|0|212184K|892K|psc001|0|892K|0|psc001|0|0|00:00:00|psc001|0|00:00:00|1|2|04:53:33|COMPLETED|0:0|1.20G|0|0|0||0|0.00M|psc001|0|0.00M|0|psc001|0|0||billing=1,cpu=2,gres/gpu:a100=1,gres/gpu=1,mem=1500M,node=1|cpu=00:00:00,energy=0,fs/disk=2012,mem=892K,pages=0,vmem=212184K|cpu=00:00:00,energy=0,fs/disk=2012,mem=892K,pages=0,vmem=212184K|cpu=psc001,energy=psc001,fs/disk=psc001,mem=psc001,pages=psc001,vmem=psc001|cpu=0,fs/disk=0,mem=0,pages=0,vmem=0|cpu=00:00:00,energy=0,fs/disk=2012,mem=892K,pages=0,vmem=212184K|cpu=psc001,energy=psc001,fs/disk=psc001,mem=psc001,pages=psc001,vmem=psc001|cpu=0,fs/disk=0,mem=0,pages=0,vmem=0|cpu=00:00:00,energy=0,fs/disk=2012,mem=892K,pages=0,vmem=212184K|energy=0,fs/disk=0|energy=psc001,fs/disk=psc001|fs/disk=0|energy=0,fs/disk=0|energy=0,fs/disk=0| [renata@sdf-login03 conf-cache]$ [renata@sdf-login03 conf-cache]$ [renata@sdf-login03 conf-cache]$ [renata@sdf-login03 conf-cache]$ [renata@sdf-login03 conf-cache]$ [renata@sdf-login03 conf-cache]$ [renata@sdf-login03 conf-cache]$ sacct -l -D -p -j 1503488 JobID|JobIDRaw|JobName|Partition|MaxVMSize|MaxVMSizeNode|MaxVMSizeTask|AveVMSize|MaxRSS|MaxRSSNode|MaxRSSTask|AveRSS|MaxPages|MaxPagesNode|MaxPagesTask|AvePages|MinCPU|MinCPUNode|MinCPUTask|AveCPU|NTasks|AllocCPUS|Elapsed|State|ExitCode|AveCPUFreq|ReqCPUFreqMin|ReqCPUFreqMax|ReqCPUFreqGov|ReqMem|ConsumedEnergy|MaxDiskRead|MaxDiskReadNode|MaxDiskReadTask|AveDiskRead|MaxDiskWrite|MaxDiskWriteNode|MaxDiskWriteTask|AveDiskWrite|ReqTRES|AllocTRES|TRESUsageInAve|TRESUsageInMax|TRESUsageInMaxNode|TRESUsageInMaxTask|TRESUsageInMin|TRESUsageInMinNode|TRESUsageInMinTask|TRESUsageInTot|TRESUsageOutMax|TRESUsageOutMaxNode|TRESUsageOutMaxTask|TRESUsageOutAve|TRESUsageOutTot| 1503488|1503488|fah|shared||||||||||||||||||2|00:03:01|FAILED|127:0||Unknown|Unknown|Unknown|1500M|0|||||||||billing=1,cpu=1,gres/gpu=1,mem=1500M,node=1|billing=1,cpu=2,gres/gpu:a100=1,gres/gpu=1,mem=1500M,node=1|||||||||||||| 1503488.batch|1503488.batch|batch||221360K|tur017|0|221360K|1880K|tur017|0|1880K|0|tur017|0|0|00:00:00|tur017|0|00:00:00|1|2|05:50:41|FAILED|127:0|2.30G|0|0|0||0|1.17M|tur017|0|1.17M|0.09M|tur017|0|0.09M||cpu=2,mem=1500M,node=1|cpu=00:00:00,energy=0,fs/disk=1222499,mem=1880K,pages=0,vmem=221360K|cpu=00:00:00,energy=0,fs/disk=1222499,mem=1880K,pages=0,vmem=221360K|cpu=tur017,energy=tur017,fs/disk=tur017,mem=tur017,pages=tur017,vmem=tur017|cpu=0,fs/disk=0,mem=0,pages=0,vmem=0|cpu=00:00:00,energy=0,fs/disk=1222499,mem=1880K,pages=0,vmem=221360K|cpu=tur017,energy=tur017,fs/disk=tur017,mem=tur017,pages=tur017,vmem=tur017|cpu=0,fs/disk=0,mem=0,pages=0,vmem=0|cpu=00:00:00,energy=0,fs/disk=1222499,mem=1880K,pages=0,vmem=221360K|energy=0,fs/disk=92034|energy=tur017,fs/disk=tur017|fs/disk=0|energy=0,fs/disk=92034|energy=0,fs/disk=92034| 1503488.extern|1503488.extern|extern||212184K|tur017|0|212184K|896K|tur017|0|896K|0|tur017|0|0|00:00:00|tur017|0|00:00:00|1|2|05:50:41|COMPLETED|0:0|2.00G|0|0|0||0|0.00M|tur017|0|0.00M|0|tur017|0|0||billing=1,cpu=2,gres/gpu:a100=1,gres/gpu=1,mem=1500M,node=1|cpu=00:00:00,energy=0,fs/disk=2012,mem=896K,pages=0,vmem=212184K|cpu=00:00:00,energy=0,fs/disk=2012,mem=896K,pages=0,vmem=212184K|cpu=tur017,energy=tur017,fs/disk=tur017,mem=tur017,pages=tur017,vmem=tur017|cpu=0,fs/disk=0,mem=0,pages=0,vmem=0|cpu=00:00:00,energy=0,fs/disk=2012,mem=896K,pages=0,vmem=212184K|cpu=tur017,energy=tur017,fs/disk=tur017,mem=tur017,pages=tur017,vmem=tur017|cpu=0,fs/disk=0,mem=0,pages=0,vmem=0|cpu=00:00:00,energy=0,fs/disk=2012,mem=896K,pages=0,vmem=212184K|energy=0,fs/disk=0|energy=tur017,fs/disk=tur017|fs/disk=0|energy=0,fs/disk=0|energy=0,fs/disk=0| [renata@sdf-login03 conf-cache]$ [renata@sdf-login03 conf-cache]$ [renata@sdf-login03 conf-cache]$ [renata@sdf-login03 conf-cache]$ [renata@sdf-login03 conf-cache]$ sacct -l -D -p -j 1503581 JobID|JobIDRaw|JobName|Partition|MaxVMSize|MaxVMSizeNode|MaxVMSizeTask|AveVMSize|MaxRSS|MaxRSSNode|MaxRSSTask|AveRSS|MaxPages|MaxPagesNode|MaxPagesTask|AvePages|MinCPU|MinCPUNode|MinCPUTask|AveCPU|NTasks|AllocCPUS|Elapsed|State|ExitCode|AveCPUFreq|ReqCPUFreqMin|ReqCPUFreqMax|ReqCPUFreqGov|ReqMem|ConsumedEnergy|MaxDiskRead|MaxDiskReadNode|MaxDiskReadTask|AveDiskRead|MaxDiskWrite|MaxDiskWriteNode|MaxDiskWriteTask|AveDiskWrite|ReqTRES|AllocTRES|TRESUsageInAve|TRESUsageInMax|TRESUsageInMaxNode|TRESUsageInMaxTask|TRESUsageInMin|TRESUsageInMinNode|TRESUsageInMinTask|TRESUsageInTot|TRESUsageOutMax|TRESUsageOutMaxNode|TRESUsageOutMaxTask|TRESUsageOutAve|TRESUsageOutTot| 1503581|1503581|fah|shared||||||||||||||||||2|00:05:01|FAILED|127:0||Unknown|Unknown|Unknown|1500M|0|||||||||billing=1,cpu=1,gres/gpu=1,mem=1500M,node=1|billing=1,cpu=2,gres/gpu:a100=1,gres/gpu=1,mem=1500M,node=1|||||||||||||| 1503581.batch|1503581.batch|batch||221356K|psc001|0|221356K|1880K|psc001|0|1880K|0|psc001|0|0|00:00:00|psc001|0|00:00:00|1|2|03:25:16|FAILED|127:0|2.21G|0|0|0||0|1.14M|psc001|0|1.14M|0.08M|psc001|0|0.08M||cpu=2,mem=1500M,node=1|cpu=00:00:00,energy=0,fs/disk=1200332,mem=1880K,pages=0,vmem=221356K|cpu=00:00:00,energy=0,fs/disk=1200332,mem=1880K,pages=0,vmem=221356K|cpu=psc001,energy=psc001,fs/disk=psc001,mem=psc001,pages=psc001,vmem=psc001|cpu=0,fs/disk=0,mem=0,pages=0,vmem=0|cpu=00:00:00,energy=0,fs/disk=1200332,mem=1880K,pages=0,vmem=221356K|cpu=psc001,energy=psc001,fs/disk=psc001,mem=psc001,pages=psc001,vmem=psc001|cpu=0,fs/disk=0,mem=0,pages=0,vmem=0|cpu=00:00:00,energy=0,fs/disk=1200332,mem=1880K,pages=0,vmem=221356K|energy=0,fs/disk=87036|energy=psc001,fs/disk=psc001|fs/disk=0|energy=0,fs/disk=87036|energy=0,fs/disk=87036| 1503581.extern|1503581.extern|extern||108056K|psc001|0|108056K|352K|psc001|0|352K|0|psc001|0|0|00:00:00|psc001|0|00:00:00|1|2|03:25:16|COMPLETED|0:0|1.20G|0|0|0||0|0.00M|psc001|0|0.00M|0|psc001|0|0||billing=1,cpu=2,gres/gpu:a100=1,gres/gpu=1,mem=1500M,node=1|cpu=00:00:00,energy=0,fs/disk=2012,mem=352K,pages=0,vmem=108056K|cpu=00:00:00,energy=0,fs/disk=2012,mem=352K,pages=0,vmem=108056K|cpu=psc001,energy=psc001,fs/disk=psc001,mem=psc001,pages=psc001,vmem=psc001|cpu=0,fs/disk=0,mem=0,pages=0,vmem=0|cpu=00:00:00,energy=0,fs/disk=2012,mem=352K,pages=0,vmem=108056K|cpu=psc001,energy=psc001,fs/disk=psc001,mem=psc001,pages=psc001,vmem=psc001|cpu=0,fs/disk=0,mem=0,pages=0,vmem=0|cpu=00:00:00,energy=0,fs/disk=2012,mem=352K,pages=0,vmem=108056K|energy=0,fs/disk=0|energy=psc001,fs/disk=psc001|fs/disk=0|energy=0,fs/disk=0|energy=0,fs/disk=0| [renata@sdf-login03 conf-cache]$ [renata@sdf-login03 conf-cache]$ [renata@sdf-login03 conf-cache]$ [renata@sdf-login03 conf-cache]$ sacct -l -D -p -j 1503519 JobID|JobIDRaw|JobName|Partition|MaxVMSize|MaxVMSizeNode|MaxVMSizeTask|AveVMSize|MaxRSS|MaxRSSNode|MaxRSSTask|AveRSS|MaxPages|MaxPagesNode|MaxPagesTask|AvePages|MinCPU|MinCPUNode|MinCPUTask|AveCPU|NTasks|AllocCPUS|Elapsed|State|ExitCode|AveCPUFreq|ReqCPUFreqMin|ReqCPUFreqMax|ReqCPUFreqGov|ReqMem|ConsumedEnergy|MaxDiskRead|MaxDiskReadNode|MaxDiskReadTask|AveDiskRead|MaxDiskWrite|MaxDiskWriteNode|MaxDiskWriteTask|AveDiskWrite|ReqTRES|AllocTRES|TRESUsageInAve|TRESUsageInMax|TRESUsageInMaxNode|TRESUsageInMaxTask|TRESUsageInMin|TRESUsageInMinNode|TRESUsageInMinTask|TRESUsageInTot|TRESUsageOutMax|TRESUsageOutMaxNode|TRESUsageOutMaxTask|TRESUsageOutAve|TRESUsageOutTot| 1503519|1503519|fah|shared||||||||||||||||||1|00:05:01|FAILED|127:0||Unknown|Unknown|Unknown|1500M|0|||||||||billing=1,cpu=1,gres/gpu=1,mem=1500M,node=1|billing=1,cpu=1,gres/gpu:a100=1,gres/gpu=1,mem=1500M,node=1|||||||||||||| 1503519.batch|1503519.batch|batch||221356K|ampt007|0|221356K|1876K|ampt007|0|1876K|0|ampt007|0|0|00:00:00|ampt007|0|00:00:00|1|1|04:52:59|FAILED|127:0|1.50G|0|0|0||0|2.03M|ampt007|0|2.03M|0.08M|ampt007|0|0.08M||cpu=1,mem=1500M,node=1|cpu=00:00:00,energy=0,fs/disk=2130711,mem=1876K,pages=0,vmem=221356K|cpu=00:00:00,energy=0,fs/disk=2130711,mem=1876K,pages=0,vmem=221356K|cpu=ampt007,energy=ampt007,fs/disk=ampt007,mem=ampt007,pages=ampt007,vmem=ampt007|cpu=0,fs/disk=0,mem=0,pages=0,vmem=0|cpu=00:00:00,energy=0,fs/disk=2130711,mem=1876K,pages=0,vmem=221356K|cpu=ampt007,energy=ampt007,fs/disk=ampt007,mem=ampt007,pages=ampt007,vmem=ampt007|cpu=0,fs/disk=0,mem=0,pages=0,vmem=0|cpu=00:00:00,energy=0,fs/disk=2130711,mem=1876K,pages=0,vmem=221356K|energy=0,fs/disk=88096|energy=ampt007,fs/disk=ampt007|fs/disk=0|energy=0,fs/disk=88096|energy=0,fs/disk=88096| 1503519.extern|1503519.extern|extern||212348K|ampt007|0|212348K|1292K|ampt007|0|1292K|0|ampt007|0|0|00:00:00|ampt007|0|00:00:00|1|1|04:52:59|COMPLETED|0:0|1.50G|0|0|0||0|0.00M|ampt007|0|0.00M|0|ampt007|0|0||billing=1,cpu=1,gres/gpu:a100=1,gres/gpu=1,mem=1500M,node=1|cpu=00:00:00,energy=0,fs/disk=2012,mem=1292K,pages=0,vmem=212348K|cpu=00:00:00,energy=0,fs/disk=2012,mem=1292K,pages=0,vmem=212348K|cpu=ampt007,energy=ampt007,fs/disk=ampt007,mem=ampt007,pages=ampt007,vmem=ampt007|cpu=0,fs/disk=0,mem=0,pages=0,vmem=0|cpu=00:00:00,energy=0,fs/disk=2012,mem=1292K,pages=0,vmem=212348K|cpu=ampt007,energy=ampt007,fs/disk=ampt007,mem=ampt007,pages=ampt007,vmem=ampt007|cpu=0,fs/disk=0,mem=0,pages=0,vmem=0|cpu=00:00:00,energy=0,fs/disk=2012,mem=1292K,pages=0,vmem=212348K|energy=0,fs/disk=0|energy=ampt007,fs/disk=ampt007|fs/disk=0|energy=0,fs/disk=0|energy=0,fs/disk=0| On Wed, 19 Jan 2022, bugs@schedmd.com wrote: Renata >https://bugs.schedmd.com/show_bug.cgi?id=13215 > >--- Comment #20 from Michael Hinton <hinton@schedmd.com> --- >I'm sorry to have to ask you to do this a third time, but could you add the -D >option and redo all the sacct calls? I think seeing the older job records >before the requeue may help us understand how to replicate this issue. Thanks! > >-- >You are receiving this mail because: >You reported the bug. Renata, could you run the following command for me and paste the output? sacct -p -D -j 1495647,1503488,1503517,1503519,1503581 --format=jobid,jobidraw,jobname,partition,nodelist,nnodes,ntasks,reqcpus,alloccpus,ncpus,reqnodes,allocnodes,reqmem,reqtres,alloctres,state,exitcode,derivedexitcode,reason,elapsed,elapsedraw,submit,eligible,start,end,suspended,submitline All of these jobs were on node ampt000, but that node stopped responding. So 1495647 got requeued successfully it seems, but 1503488,1503517,1503519,1503581 got completed and then rescheduled with backfill (but apparently not requeued...). I think that explains the weirdness of the differences in elapsed time between the job itself and the batch and extern steps. It's confusing to me how jobs 1503488,1503517,1503519,1503581 could get scheduled twice (once with regular scheduler, once with backfill) without apparently getting requeued, and maybe that error is causing some of the bad accounting. I'll keep looking into it. At any rate, it seems to be an edge case that happens when a node fails and jobs are moved/requeued away from the failed node to a new node. Thanks! -Michael Hi Michael, here it is: [renata@rome0001 ~]$ sacct -p -D -j 1495647,1503488,1503517,1503519,1503581 --format=jobid,jobidraw,jobname,partition,nodelist,nnodes,ntasks,reqcpus,alloccpus,ncpus,reqnodes,allocnodes,reqmem,reqtres,alloctres,state,exitcode,derivedexitcode,reason,elapsed,elapsedraw,submit,eligible,start,end,suspended,submitline JobID|JobIDRaw|JobName|Partition|NodeList|NNodes|NTasks|ReqCPUS|AllocCPUS|NCPUS|ReqNodes|AllocNodes|ReqMem|ReqTRES|AllocTRES|State|ExitCode|DerivedExitCode|Reason|Elapsed|ElapsedRaw|Submit|Eligible|Start|End|Suspended|SubmitLine| 1495647|1495647|4k-2TomoRecon003|cryoem|ampt008|1||6|6|6|1|1|750G|billing=6,cpu=6,mem=750G,node=1|billing=6,cpu=6,mem=750G,node=1|NODE_FAIL|1:0|0:0|None|3-13:47:14|308834|2022-01-12T16:26:45|2022-01-12T16:26:45|2022-01-12T19:34:27|2022-01-16T09:21:41|00:00:00|sbatch batchScript.sbatch .| 1495647.batch|1495647.batch|batch||ampt008|1|1|6|6|6|1|1|||cpu=6,mem=750G,node=1|CANCELLED||||3-13:47:14|308834|2022-01-12T19:34:27|2022-01-12T19:34:27|2022-01-12T19:34:27|2022-01-16T09:21:41|00:00:00|| 1495647.extern|1495647.extern|extern||ampt008|1|1|6|6|6|1|1|||billing=6,cpu=6,mem=750G,node=1|CANCELLED||||3-13:47:14|308834|2022-01-12T19:34:27|2022-01-12T19:34:27|2022-01-12T19:34:27|2022-01-16T09:21:41|00:00:00|| 1495647.0|1495647.0|e2tomogram.py||ampt008|1|1|6|6|6|1|1|||cpu=6,mem=750G,node=1|CANCELLED||||3-13:47:13|308833|2022-01-12T19:34:28|2022-01-12T19:34:28|2022-01-12T19:34:28|2022-01-16T09:21:41|00:00:00|srun e2tomogram.py tiltseries/20211022-HSM-Lam4_Tilt7.hdf --tltstep=2.0 --npk=20 --tltkeep=0.9 --outsize=4k --niter=2,1,1,1 --bytile --notmp --pkkeep=0.9 --compressbits=-1 --clipz=-1 --bxsz=64 --filterto=0.35 --rmbeadthr=-1.0 --threads=48 --patchtrack=-1| 1495647|1495647|4k-2TomoRecon003|cryoem|ampt000|1||6|6|6|1|1|750G|billing=6,cpu=6,mem=750G,node=1|billing=6,cpu=6,mem=750G,node=1|NODE_FAIL|0:0|0:0|BeginTime|09:49:59|35399|2022-01-16T09:21:41|2022-01-16T09:23:42|2022-01-16T09:23:50|2022-01-16T19:13:49|00:00:00|sbatch batchScript.sbatch .| 1495647.batch|1495647.batch|batch||ampt000|1|1|6|6|6|1|1|||cpu=6,mem=750G,node=1|CANCELLED||||09:49:59|35399|2022-01-16T09:23:50|2022-01-16T09:23:50|2022-01-16T09:23:50|2022-01-16T19:13:49|00:00:00|| 1495647.extern|1495647.extern|extern||ampt000|1|1|6|6|6|1|1|||billing=6,cpu=6,mem=750G,node=1|CANCELLED||||09:49:59|35399|2022-01-16T09:23:50|2022-01-16T09:23:50|2022-01-16T09:23:50|2022-01-16T19:13:49|00:00:00|| 1495647.0|1495647.0|e2tomogram.py||ampt000|1|1|6|6|6|1|1|||cpu=6,mem=750G,node=1|CANCELLED||||09:49:58|35398|2022-01-16T09:23:51|2022-01-16T09:23:51|2022-01-16T09:23:51|2022-01-16T19:13:49|00:00:00|srun e2tomogram.py tiltseries/20211022-HSM-Lam4_Tilt7.hdf --tltstep=2.0 --npk=20 --tltkeep=0.9 --outsize=4k --niter=2,1,1,1 --bytile --notmp --pkkeep=0.9 --compressbits=-1 --clipz=-1 --bxsz=64 --filterto=0.35 --rmbeadthr=-1.0 --threads=48 --patchtrack=-1| 1503488|1503488|fah|shared|tur017|1||1|2|2|1|1|1500M|billing=1,cpu=1,gres/gpu=1,mem=1500M,node=1|billing=1,cpu=2,gres/gpu:a100=1,gres/gpu=1,mem=1500M,node=1|FAILED|127:0|0:0|None|00:03:01|181|2022-01-16T13:26:46|2022-01-16T13:26:46|2022-01-16T19:14:27|2022-01-16T19:17:28|00:00:00|sbatch --nice=100 --gpus=1 --time=2-2:00:00 /sdf/home/y/ytl/fah/fah-gpu-b.slurm| 1503488.batch|1503488.batch|batch||tur017|1|1|2|2|2|1|1|||cpu=2,mem=1500M,node=1|FAILED|127:0|||05:50:41|21041|2022-01-16T13:26:47|2022-01-16T13:26:47|2022-01-16T13:26:47|2022-01-16T19:17:28|00:00:00|| 1503488.extern|1503488.extern|extern||tur017|1|1|2|2|2|1|1|||billing=1,cpu=2,gres/gpu:a100=1,gres/gpu=1,mem=1500M,node=1|COMPLETED|0:0|||05:50:41|21041|2022-01-16T13:26:47|2022-01-16T13:26:47|2022-01-16T13:26:47|2022-01-16T19:17:28|00:00:00|| 1503517|1503517|fah|shared|psc001|1||1|2|2|1|1|1500M|billing=1,cpu=1,gres/gpu=1,mem=1500M,node=1|billing=1,cpu=2,gres/gpu:a100=1,gres/gpu=1,mem=1500M,node=1|FAILED|127:0|0:0|None|00:03:01|181|2022-01-16T14:23:55|2022-01-16T14:23:55|2022-01-16T19:14:27|2022-01-16T19:17:28|00:00:00|sbatch --nice=100 --gpus=1 --time=2-2:00:00 /sdf/home/y/ytl/fah/fah-gpu-b.slurm| 1503517.batch|1503517.batch|batch||psc001|1|1|2|2|2|1|1|||cpu=2,mem=1500M,node=1|FAILED|127:0|||04:53:33|17613|2022-01-16T14:23:55|2022-01-16T14:23:55|2022-01-16T14:23:55|2022-01-16T19:17:28|00:00:00|| 1503517.extern|1503517.extern|extern||psc001|1|1|2|2|2|1|1|||billing=1,cpu=2,gres/gpu:a100=1,gres/gpu=1,mem=1500M,node=1|COMPLETED|0:0|||04:53:33|17613|2022-01-16T14:23:55|2022-01-16T14:23:55|2022-01-16T14:23:55|2022-01-16T19:17:28|00:00:00|| 1503519|1503519|fah|shared|ampt007|1||1|1|1|1|1|1500M|billing=1,cpu=1,gres/gpu=1,mem=1500M,node=1|billing=1,cpu=1,gres/gpu:a100=1,gres/gpu=1,mem=1500M,node=1|FAILED|127:0|0:0|None|00:05:01|301|2022-01-16T14:26:26|2022-01-16T14:26:26|2022-01-16T19:14:27|2022-01-16T19:19:28|00:00:00|sbatch --nice=100 --gpus=1 --time=2-2:00:00 /sdf/home/y/ytl/fah/fah-gpu-a.slurm| 1503519.batch|1503519.batch|batch||ampt007|1|1|1|1|1|1|1|||cpu=1,mem=1500M,node=1|FAILED|127:0|||04:52:59|17579|2022-01-16T14:26:29|2022-01-16T14:26:29|2022-01-16T14:26:29|2022-01-16T19:19:28|00:00:00|| 1503519.extern|1503519.extern|extern||ampt007|1|1|1|1|1|1|1|||billing=1,cpu=1,gres/gpu:a100=1,gres/gpu=1,mem=1500M,node=1|COMPLETED|0:0|||04:52:59|17579|2022-01-16T14:26:29|2022-01-16T14:26:29|2022-01-16T14:26:29|2022-01-16T19:19:28|00:00:00|| 1503581|1503581|fah|shared|psc001|1||1|2|2|1|1|1500M|billing=1,cpu=1,gres/gpu=1,mem=1500M,node=1|billing=1,cpu=2,gres/gpu:a100=1,gres/gpu=1,mem=1500M,node=1|FAILED|127:0|0:0|None|00:05:01|301|2022-01-16T15:54:11|2022-01-16T15:54:11|2022-01-16T19:14:27|2022-01-16T19:19:28|00:00:00|sbatch --nice=100 --gpus=1 --time=2-2:00:00 /sdf/home/y/ytl/fah/fah-gpu-a.slurm| 1503581.batch|1503581.batch|batch||psc001|1|1|2|2|2|1|1|||cpu=2,mem=1500M,node=1|FAILED|127:0|||03:25:16|12316|2022-01-16T15:54:12|2022-01-16T15:54:12|2022-01-16T15:54:12|2022-01-16T19:19:28|00:00:00|| 1503581.extern|1503581.extern|extern||psc001|1|1|2|2|2|1|1|||billing=1,cpu=2,gres/gpu:a100=1,gres/gpu=1,mem=1500M,node=1|COMPLETED|0:0|||03:25:16|12316|2022-01-16T15:54:12|2022-01-16T15:54:12|2022-01-16T15:54:12|2022-01-16T19:19:28|00:00:00|| 1495647|1495647|4k-2TomoRecon003|cryoem|ampt001|1||6|6|6|1|1|750G|billing=6,cpu=6,mem=750G,node=1|billing=6,cpu=6,mem=750G,node=1|NODE_FAIL|1:0|0:0|BeginTime|04:39:31|16771|2022-01-16T19:13:49|2022-01-16T19:15:50|2022-01-16T19:16:03|2022-01-16T23:55:34|00:00:00|sbatch batchScript.sbatch .| 1495647.batch|1495647.batch|batch||ampt001|1|1|6|6|6|1|1|||cpu=6,mem=750G,node=1|CANCELLED||||04:39:31|16771|2022-01-16T19:16:03|2022-01-16T19:16:03|2022-01-16T19:16:03|2022-01-16T23:55:34|00:00:00|| 1495647.extern|1495647.extern|extern||ampt001|1|1|6|6|6|1|1|||billing=6,cpu=6,mem=750G,node=1|CANCELLED||||04:39:31|16771|2022-01-16T19:16:03|2022-01-16T19:16:03|2022-01-16T19:16:03|2022-01-16T23:55:34|00:00:00|| 1495647.0|1495647.0|e2tomogram.py||ampt001|1|1|6|6|6|1|1|||cpu=6,mem=750G,node=1|CANCELLED||||04:39:31|16771|2022-01-16T19:16:03|2022-01-16T19:16:03|2022-01-16T19:16:03|2022-01-16T23:55:34|00:00:00|srun e2tomogram.py tiltseries/20211022-HSM-Lam4_Tilt7.hdf --tltstep=2.0 --npk=20 --tltkeep=0.9 --outsize=4k --niter=2,1,1,1 --bytile --notmp --pkkeep=0.9 --compressbits=-1 --clipz=-1 --bxsz=64 --filterto=0.35 --rmbeadthr=-1.0 --threads=48 --patchtrack=-1| 1495647|1495647|4k-2TomoRecon003|cryoem|ampt015|1||6|6|6|1|1|750G|billing=6,cpu=6,mem=750G,node=1|billing=6,cpu=6,mem=750G,node=1|RUNNING|0:0|0:0|BeginTime|2-16:41:16|232876|2022-01-16T23:55:34|2022-01-16T23:57:35|2022-01-16T23:57:54|Unknown|00:00:00|sbatch batchScript.sbatch .| 1495647.batch|1495647.batch|batch||ampt015|1|1|6|6|6|1|1|||cpu=6,mem=750G,node=1|RUNNING|0:0|||2-16:41:16|232876|2022-01-16T23:57:54|2022-01-16T23:57:54|2022-01-16T23:57:54|Unknown|00:00:00|| 1495647.extern|1495647.extern|extern||ampt015|1|1|6|6|6|1|1|||billing=6,cpu=6,mem=750G,node=1|RUNNING|0:0|||2-16:41:16|232876|2022-01-16T23:57:54|2022-01-16T23:57:54|2022-01-16T23:57:54|Unknown|00:00:00|| 1495647.0|1495647.0|e2tomogram.py||ampt015|1|1|6|6|6|1|1|||cpu=6,mem=750G,node=1|RUNNING|0:0|||2-16:41:16|232876|2022-01-16T23:57:54|2022-01-16T23:57:54|2022-01-16T23:57:54|Unknown|00:00:00|srun e2tomogram.py tiltseries/20211022-HSM-Lam4_Tilt7.hdf --tltstep=2.0 --npk=20 --tltkeep=0.9 --outsize=4k --niter=2,1,1,1 --bytile --notmp --pkkeep=0.9 --compressbits=-1 --clipz=-1 --bxsz=64 --filterto=0.35 --rmbeadthr=-1.0 --threads=48 --patchtrack=-1| -Renata On Thu, 20 Jan 2022, bugs@schedmd.com wrote: >https://bugs.schedmd.com/show_bug.cgi?id=13215 > >--- Comment #22 from Michael Hinton <hinton@schedmd.com> --- >Renata, could you run the following command for me and paste the output? > >sacct -p -D -j 1495647,1503488,1503517,1503519,1503581 >--format=jobid,jobidraw,jobname,partition,nodelist,nnodes,ntasks,reqcpus,alloccpus,ncpus,reqnodes,allocnodes,reqmem,reqtres,alloctres,state,exitcode,derivedexitcode,reason,elapsed,elapsedraw,submit,eligible,start,end,suspended,submitline > >All of these jobs were on node ampt000, but that node stopped responding. So >1495647 got requeued successfully it seems, but 1503488,1503517,1503519,1503581 >got completed and then rescheduled with backfill (but apparently not >requeued...). I think that explains the weirdness of the differences in elapsed >time between the job itself and the batch and extern steps. > >It's confusing to me how jobs 1503488,1503517,1503519,1503581 could get >scheduled twice (once with regular scheduler, once with backfill) without >apparently getting requeued, and maybe that error is causing some of the bad >accounting. I'll keep looking into it. At any rate, it seems to be an edge case >that happens when a node fails and jobs are moved/requeued away from the failed >node to a new node. > >Thanks! >-Michael > >-- >You are receiving this mail because: >You reported the bug. How often do you see this situation occur? Is it only when a node fails? Have you only started seeing this issue after upgrading to 21.08? Hi Michael, the mismatch of gpus in show node vs show job was happening before we upgraded and we thought it would be fixed when we went to 21.08 in December. Tuesday (after your suggestion) I did restart slurmctld and slurmd on the 3 nodes I reported at the start of the ticket: tur017 has 10 gpu ampt007 has 4 gpu psc001 has 10 gpu but it didn't seem to make a difference for the currently running jobs. Today however I did the remove of cgroup_allowed_devices_file.conf, restart slurmctld, scontrol reconfigure, and restart of all the slurmds, and I double-checked the jobs currently running on those 3 nodes and all looks well. I also checked a few other gpu nodes and they also agreed on the gpus allocated in show node and show job. So maybe this was an odd case as you have surmised. Renata On Thu, 20 Jan 2022, bugs@schedmd.com wrote: >https://bugs.schedmd.com/show_bug.cgi?id=13215 > >--- Comment #24 from Michael Hinton <hinton@schedmd.com> --- >How often do you see this situation occur? Is it only when a node fails? Have >you only started seeing this issue after upgrading to 21.08? > >-- >You are receiving this mail because: >You reported the bug. Could you give me the slurmd log of node ampt000 for 2022-01-16? Thanks! Could you also run this command and paste the output?: sacct -p -D -j 1495647,1503488,1503517,1503519,1503581 --format=JobID,JobName,DBIndex,Start,timelimit,TimelimitRaw,Flags,TotalCPU,MinCPU,MinCPUNode,MinCPUTask Hi Michael, here is the output: [renata@sdf-login02 conf-cache]$ sacct -p -D -j 1495647,1503488,1503517,1503519,1503581 --format=JobID,JobName,DBIndex,Start,timelimit,TimelimitRaw,Flags,TotalCPU,MinCPU,MinCPUNode,MinCPUTask JobID|JobName|DBIndex|Start|Timelimit|TimelimitRaw|Flags|TotalCPU|MinCPU|MinCPUNode|MinCPUTask| 1495647|4k-2TomoRecon003|2044544|2022-01-12T19:34:27|4-02:00:00|5880|SchedMain|00:00:00|||| 1495647.batch|batch|2044544|2022-01-12T19:34:27||||00:00:00|||| 1495647.extern|extern|2044544|2022-01-12T19:34:27||||00:00:00|||| 1495647.0|e2tomogram.py|2044544|2022-01-12T19:34:28||||00:00:00|||| 1495647|4k-2TomoRecon003|2053945|2022-01-16T09:23:50|4-02:00:00|5880|SchedMain|00:00:00|||| 1495647.batch|batch|2053945|2022-01-16T09:23:50||||00:00:00|||| 1495647.extern|extern|2053945|2022-01-16T09:23:50||||00:00:00|||| 1495647.0|e2tomogram.py|2053945|2022-01-16T09:23:51||||00:00:00|||| 1503488|fah|2054315|2022-01-16T19:14:27|2-02:00:00|3000|SchedBackfill|00:00.360|||| 1503488.batch|batch|2054315|2022-01-16T13:26:47||||00:00.358|00:00:00|tur017|0| 1503488.extern|extern|2054315|2022-01-16T13:26:47||||00:00.001|00:00:00|tur017|0| 1503517|fah|2054349|2022-01-16T19:14:27|2-02:00:00|3000|SchedBackfill|00:00.364|||| 1503517.batch|batch|2054349|2022-01-16T14:23:55||||00:00.362|00:00:00|psc001|0| 1503517.extern|extern|2054349|2022-01-16T14:23:55||||00:00.001|00:00:00|psc001|0| 1503519|fah|2054353|2022-01-16T19:14:27|2-02:00:00|3000|SchedBackfill|00:00.356|||| 1503519.batch|batch|2054353|2022-01-16T14:26:29||||00:00.355|00:00:00|ampt007|0| 1503519.extern|extern|2054353|2022-01-16T14:26:29||||00:00.001|00:00:00|ampt007|0| 1503581|fah|2054444|2022-01-16T19:14:27|2-02:00:00|3000|SchedBackfill|00:00.396|||| 1503581.batch|batch|2054444|2022-01-16T15:54:12||||00:00.395|00:00:00|psc001|0| 1503581.extern|extern|2054444|2022-01-16T15:54:12||||00:00.001|00:00:00|psc001|0| 1495647|4k-2TomoRecon003|2054955|2022-01-16T19:16:03|4-02:00:00|5880|SchedMain|00:00:00|||| 1495647.batch|batch|2054955|2022-01-16T19:16:03||||00:00:00|||| 1495647.extern|extern|2054955|2022-01-16T19:16:03||||00:00:00|||| 1495647.0|e2tomogram.py|2054955|2022-01-16T19:16:03||||00:00:00|||| 1495647|4k-2TomoRecon003|2055317|2022-01-16T23:57:54|4-02:00:00|5880|SchedMain|00:00:00|||| 1495647.batch|batch|2055317|2022-01-16T23:57:54||||00:00:00|||| 1495647.extern|extern|2055317|2022-01-16T23:57:54||||00:00:00|||| 1495647.0|e2tomogram.py|2055317|2022-01-16T23:57:54||||00:00:00|||| Renata On Thu, 20 Jan 2022, bugs@schedmd.com wrote: >https://bugs.schedmd.com/show_bug.cgi?id=13215 > >--- Comment #27 from Michael Hinton <hinton@schedmd.com> --- >Could you also run this command and paste the output?: > >sacct -p -D -j 1495647,1503488,1503517,1503519,1503581 >--format=JobID,JobName,DBIndex,Start,timelimit,TimelimitRaw,Flags,TotalCPU,MinCPU,MinCPUNode,MinCPUTask > >-- >You are receiving this mail because: >You reported the bug. Hi Michael, the slurmd logs for ampt000 transition on 2022-01-16. The current one starts [2022-01-16T08:40:33.834]. Will that work, or do you need the earlier one? Renata On Thu, 20 Jan 2022, bugs@schedmd.com wrote: >https://bugs.schedmd.com/show_bug.cgi?id=13215 > >--- Comment #26 from Michael Hinton <hinton@schedmd.com> --- >Could you give me the slurmd log of node ampt000 for 2022-01-16? Thanks! > >-- >You are receiving this mail because: >You reported the bug. (In reply to Renata Dart from comment #29) > Hi Michael, the slurmd logs for ampt000 transition on 2022-01-16. > The current one starts [2022-01-16T08:40:33.834]. Will that > work, or do you need the earlier one? That should work. Thanks! Created attachment 23075 [details] ampt000.slurmd.log Hi Michael, the log is attached. Renata On Thu, 20 Jan 2022, bugs@schedmd.com wrote: >https://bugs.schedmd.com/show_bug.cgi?id=13215 > >--- Comment #30 from Michael Hinton <hinton@schedmd.com> --- >(In reply to Renata Dart from comment #29) >> Hi Michael, the slurmd logs for ampt000 transition on 2022-01-16. >> The current one starts [2022-01-16T08:40:33.834]. Will that >> work, or do you need the earlier one? >That should work. Thanks! > >-- >You are receiving this mail because: >You reported the bug. Renata, How often are you seeing this issue, and how disruptive is it? Thanks! -Michael Hi Renata, Are you still seeing issues? Are you able to reproduce these issues on a test cluster? A reliable reproducer will help us the most, as it is difficult to understand how the cluster is getting into this state from logs alone. I suspect you are hitting similar issues as in bug 13600. Thanks! -Michael Hi Michael, I don't think we have seen this again. And I just ran a clush against all of our hosts' slurmd logs and no longer see the dealloc error message. I did open a ticket for sort of the opposite problem where a user only requested 1 gpu and 32 cpu through a bash slurm job and got the entire node of 4 gpus and 128 cpu. But that probably isn't related. However I am seeing this error message in our slurmd logs: [2022-03-16T13:15:27.886] [2107856.batch] debug: task/cgroup: task_cgroup_devices_init: unable to open /var/spool/slurmd/conf-cache/cgroup_allowed_devices_file.conf: No such file or directory Based on your recommendation earlier in this ticket I "removed" our cgroup_allowed_devices_file.conf by renaming it to cgroup_allowed_devices_file.conf.bak. I did this pretty close to the time that you suggested it. Since then I have done at least a couple of slurmctld and slurmd restarts. Renata On Wed, 16 Mar 2022, bugs@schedmd.com wrote: >https://bugs.schedmd.com/show_bug.cgi?id=13215 > >--- Comment #33 from Michael Hinton <hinton@schedmd.com> --- >Hi Renata, > >Are you still seeing issues? > >Are you able to reproduce these issues on a test cluster? A reliable reproducer >will help us the most, as it is difficult to understand how the cluster is >getting into this state from logs alone. I suspect you are hitting similar >issues as in bug 13600. > >Thanks! >-Michael > >-- >You are receiving this mail because: >You reported the bug. (In reply to Renata Dart from comment #34) > [2022-03-16T13:15:27.886] [2107856.batch] debug: task/cgroup: > task_cgroup_devices_init: unable to open > /var/spool/slurmd/conf-cache/cgroup_allowed_devices_file.conf: No such file > or directory That should be fine. We left the code in that checks for cgroup_allowed_devices_file.conf, and apparently there is still a debug message that prints when it's not found. I'm going to go ahead and mark this as resolved, since the issue has gone away. Feel free to repopen this if you see it again. And if you see it again, try to create a reproducer, as that will be the best way for us to figure out a solution. We are working on a similar issue in bug 13600 that may be related as well. I'll keep you posted if anything comes of that. Thanks, -Michael Thanks Michael, that sounds good. Thanks for all your help. Renata On Wed, 16 Mar 2022, bugs@schedmd.com wrote: >https://bugs.schedmd.com/show_bug.cgi?id=13215 > >Michael Hinton <hinton@schedmd.com> changed: > > What |Removed |Added >---------------------------------------------------------------------------- > Resolution|--- |CANNOTREPRODUCE > Status|OPEN |RESOLVED > >--- Comment #35 from Michael Hinton <hinton@schedmd.com> --- >(In reply to Renata Dart from comment #34) >> [2022-03-16T13:15:27.886] [2107856.batch] debug: task/cgroup: >> task_cgroup_devices_init: unable to open >> /var/spool/slurmd/conf-cache/cgroup_allowed_devices_file.conf: No such file >> or directory >That should be fine. We left the code in that checks for >cgroup_allowed_devices_file.conf, and apparently there is still a debug message >that prints when it's not found. > >I'm going to go ahead and mark this as resolved, since the issue has gone away. >Feel free to repopen this if you see it again. And if you see it again, try to >create a reproducer, as that will be the best way for us to figure out a >solution. > >We are working on a similar issue in bug 13600 that may be related as well. >I'll keep you posted if anything comes of that. > >Thanks, >-Michael > >-- >You are receiving this mail because: >You reported the bug. (In reply to Renata Dart from comment #34) > [2022-03-16T13:15:27.886] [2107856.batch] debug: task/cgroup: > task_cgroup_devices_init: unable to open > /var/spool/slurmd/conf-cache/cgroup_allowed_devices_file.conf: No such file > or directory This debug message will be removed in Slurm 22.05, by the way. Hi Michael, thanks for the update, it will be helpful to have it removed. Renata On Wed, 23 Mar 2022, bugs@schedmd.com wrote: >https://bugs.schedmd.com/show_bug.cgi?id=13215 > >--- Comment #37 from Michael Hinton <hinton@schedmd.com> --- >(In reply to Renata Dart from comment #34) >> [2022-03-16T13:15:27.886] [2107856.batch] debug: task/cgroup: >> task_cgroup_devices_init: unable to open >> /var/spool/slurmd/conf-cache/cgroup_allowed_devices_file.conf: No such file >> or directory >This debug message will be removed in Slurm 22.05, by the way. > >-- >You are receiving this mail because: >You reported the bug. |
Hi SchedMD, we are seeing a mismatch between the gpus shown in use by allocTRES with those requested by jobs. The jobs show 1 less than the total are in use as opposed to the show node which says the full number are in use. At the moment 3 gpu nodes, tur017, ampt007, and psc001 are showing this issue. tur017 has 10 gpu ampt007 has 4 gpu psc001 has 10 gpu Following is a show node and and a show job for each job running to show the mismatch: [renata@sdf-login03 licensing]$ scontrol show node tur017 NodeName=tur017 Arch=x86_64 CoresPerSocket=12 CPUAlloc=2 CPUTot=48 CPULoad=0.23 AvailableFeatures=CPU_GEN:SKX,CPU_SKU:5118,CPU_FRQ:2.30GHz,GPU_GEN:TUR,GPU_SKU:RTX2080TI,GPU_MEM:11GB,GPU_CC:7.5 ActiveFeatures=CPU_GEN:SKX,CPU_SKU:5118,CPU_FRQ:2.30GHz,GPU_GEN:TUR,GPU_SKU:RTX2080TI,GPU_MEM:11GB,GPU_CC:7.5 Gres=gpu:geforce_rtx_2080_ti:10 NodeAddr=tur017 NodeHostName=tur017 Version=21.08.4 OS=Linux 3.10.0-1160.42.2.el7.x86_64 #1 SMP Tue Sep 7 14:49:57 UTC 2021 RealMemory=191552 AllocMem=1500 FreeMem=162798 Sockets=2 Boards=1 State=MIXED ThreadsPerCore=2 TmpDisk=0 Weight=56117 Owner=N/A MCS_label=N/A Partitions=shared,cryoem,lcls,neutrino,ml,atlas BootTime=2021-12-08T08:36:27 SlurmdStartTime=2021-12-08T15:48:06 LastBusyTime=2022-01-17T12:35:26 CfgTRES=cpu=48,mem=191552M,billing=48,gres/gpu=10,gres/gpu:geforce_rtx_2080_ti=10 AllocTRES=cpu=2,mem=1500M,gres/gpu=10,gres/gpu:geforce_rtx_2080_ti=10 CapWatts=n/a CurrentWatts=0 AveWatts=0 ExtSensorsJoules=n/s ExtSensorsWatts=0 ExtSensorsTemp=n/s [renata@sdf-login03 licensing]$ squeue | grep tur017 1504819 shared fah ytl R 21:33:01 1 tur017 [renata@sdf-login03 licensing]$ scontrol show job 1504819 | grep gpu TRES=cpu=2,mem=1500M,node=1,billing=2,gres/gpu=9,gres/gpu:geforce_rtx_2080_ti=9 Command=/sdf/home/y/ytl/fah/fah-gpu-e.slurm TresPerJob=gres:gpu:9 ---------------------------------------------------------- [renata@sdf-login03 licensing]$ scontrol show node ampt007 NodeName=ampt007 Arch=x86_64 CoresPerSocket=64 CPUAlloc=55 CPUTot=128 CPULoad=54.07 AvailableFeatures=CPU_GEN:RME,CPU_SKU:7542,CPU_FRQ:2.10GHz,GPU_GEN:AMP,GPU_SKU:A100,GPU_MEM:40GB,GPU_CC:8.0 ActiveFeatures=CPU_GEN:RME,CPU_SKU:7542,CPU_FRQ:2.10GHz,GPU_GEN:AMP,GPU_SKU:A100,GPU_MEM:40GB,GPU_CC:8.0 Gres=gpu:a100:4 NodeAddr=ampt007 NodeHostName=ampt007 Version=21.08.4 OS=Linux 3.10.0-1160.42.2.el7.x86_64 #1 SMP Tue Sep 7 14:49:57 UTC 2021 RealMemory=1029344 AllocMem=554460 FreeMem=586033 Sockets=2 Boards=1 State=MIXED ThreadsPerCore=2 TmpDisk=0 Weight=227207 Owner=N/A MCS_label=N/A Partitions=shared,cryoem,neutrino,ml,ard,atlas BootTime=2022-01-16T20:49:40 SlurmdStartTime=2022-01-16T20:59:20 LastBusyTime=2022-01-16T20:59:20 CfgTRES=cpu=128,mem=1029344M,billing=128,gres/gpu=4,gres/gpu:a100=4 AllocTRES=cpu=55,mem=554460M,gres/gpu=4,gres/gpu:a100=4 CapWatts=n/a CurrentWatts=0 AveWatts=0 ExtSensorsJoules=n/s ExtSensorsWatts=0 ExtSensorsTemp=n/s [renata@sdf-login03 licensing]$ [renata@sdf-login03 licensing]$ [renata@sdf-login03 licensing]$ [renata@sdf-login03 licensing]$ [renata@sdf-login03 licensing]$ [renata@sdf-login03 licensing]$ squeue | grep ampt007 1506847 shared fah ytl R 1:05:15 1 ampt007 1506238 shared Imp_Samp milesm R 7:19:13 1 ampt007 [renata@sdf-login03 licensing]$ [renata@sdf-login03 licensing]$ [renata@sdf-login03 licensing]$ [renata@sdf-login03 licensing]$ scontrol show job 1506847 | grep gpu TRES=cpu=1,mem=1500M,node=1,billing=1,gres/gpu=3,gres/gpu:a100=3 Command=/sdf/home/y/ytl/fah/fah-gpu-e.slurm TresPerJob=gres:gpu:3 [renata@sdf-login03 licensing]$ scontrol show job 1506238 | grep gpu [renata@sdf-login03 licensing]$ --------------------------------------------------------- [renata@sdf-login03 licensing]$ scontrol show node psc001 NodeName=psc001 Arch=x86_64 CoresPerSocket=12 CPUAlloc=10 CPUTot=48 CPULoad=0.87 AvailableFeatures=CPU_GEN:HSW,CPU_SKU:E5-2670v3,CPU_FRQ:2.30GHz,GPU_GEN:PSC,GPU_SKU:GTX1080TI,GPU_MEM:11GB,GPU_CC:6.1 ActiveFeatures=CPU_GEN:HSW,CPU_SKU:E5-2670v3,CPU_FRQ:2.30GHz,GPU_GEN:PSC,GPU_SKU:GTX1080TI,GPU_MEM:11GB,GPU_CC:6.1 Gres=gpu:geforce_gtx_1080_ti:10 NodeAddr=psc001 NodeHostName=psc001 Version=21.08.4 OS=Linux 3.10.0-1160.42.2.el7.x86_64 #1 SMP Tue Sep 7 14:49:57 UTC 2021 RealMemory=257336 AllocMem=14192 FreeMem=217910 Sockets=2 Boards=1 State=MIXED ThreadsPerCore=2 TmpDisk=0 Weight=58319 Owner=N/A MCS_label=N/A Partitions=shared,cryoem,atlas BootTime=2021-12-10T15:22:12 SlurmdStartTime=2021-12-10T15:23:20 LastBusyTime=2022-01-16T07:51:09 CfgTRES=cpu=48,mem=257336M,billing=48,gres/gpu=10,gres/gpu:geforce_gtx_1080_ti=10 AllocTRES=cpu=10,mem=14192M,gres/gpu=10,gres/gpu:geforce_gtx_1080_ti=10 CapWatts=n/a CurrentWatts=0 AveWatts=0 ExtSensorsJoules=n/s ExtSensorsWatts=0 ExtSensorsTemp=n/s [renata@sdf-login03 licensing]$ squeue | grep psc001 1506916 shared arc_pilo osgatlas R 4:44 1 psc001 1503765 shared fah ytl R 1-14:58:05 1 psc001 1504788 shared fah ytl R 22:03:29 1 psc001 1503921 shared fah ytl R 1-13:03:02 1 psc001 1504022 shared fah ytl R 1-13:01:21 1 psc001 [renata@sdf-login03 licensing]$ scontrol show job 1506916 | grep gpu TRES=cpu=2,mem=8G,node=1,billing=2,gres/gpu=1,gres/gpu:geforce_gtx_1080_ti=1 TresPerJob=gres:gpu:1 [renata@sdf-login03 licensing]$ scontrol show job 1503765 | grep gpu TRES=cpu=2,mem=1500M,node=1,billing=2,gres/gpu=2,gres/gpu:geforce_gtx_1080_ti=2 Command=/sdf/home/y/ytl/fah/fah-gpu-e.slurm TresPerJob=gres:gpu:2 [renata@sdf-login03 licensing]$ scontrol show job 1504788 | grep gpu TRES=cpu=2,mem=1500M,node=1,billing=2,gres/gpu=1,gres/gpu:geforce_gtx_1080_ti=1 Command=/sdf/home/y/ytl/fah/fah-gpu-a.slurm TresPerJob=gres:gpu:1 [renata@sdf-login03 licensing]$ scontrol show job 1503921 | grep gpu TRES=cpu=2,mem=1500M,node=1,billing=1,gres/gpu=3,gres/gpu:geforce_gtx_1080_ti=3 Command=/sdf/home/y/ytl/fah/fah-gpu-b.slurm TresPerJob=gres:gpu:3 [renata@sdf-login03 licensing]$ scontrol show job 1504022 | grep gpu TRES=cpu=2,mem=1500M,node=1,billing=2,gres/gpu=1,gres/gpu:geforce_gtx_1080_ti=1 Command=/sdf/home/y/ytl/fah/fah-gpu-b.slurm TresPerJob=gres:gpu:1 Is this a problem with our configuration? Thanks, Renata