Ticket 15940 - Job always tries to start on same node
Summary: Job always tries to start on same node
Status: RESOLVED FIXED
Alias: None
Product: Slurm
Classification: Unclassified
Component: Scheduling (show other tickets)
Version: - Unsupported Older Versions
Hardware: Linux Linux
: 4 - Minor Issue
Assignee: Oriol Vilarrubi
QA Contact:
URL:
Depends on:
Blocks:
 
Reported: 2023-02-03 03:04 MST by Research Computing, University of Bath
Modified: 2023-03-01 07:39 MST (History)
3 users (show)

See Also:
Site: University of Bath
Slinky Site: ---
Alineos Sites: ---
Atos/Eviden Sites: ---
Confidential Site: ---
Coreweave sites: ---
Cray Sites: ---
DS9 clusters: ---
Google sites: ---
HPCnow Sites: ---
HPE Sites: ---
IBM Sites: ---
NOAA SIte: ---
NoveTech Sites: ---
Nvidia HWinf-CS Sites: ---
OCF Sites: ---
Recursion Pharma Sites: ---
SFW Sites: ---
SNIC sites: ---
Tzag Elita Sites: ---
Linux Distro: CentOS
Machine Name:
CLE Version:
Version Fixed: 20.11.7
Target Release: ---
DevPrio: ---
Emory-Cloud Sites: ---


Attachments
scontrol nodes (464.66 KB, text/plain)
2023-02-08 02:15 MST, Research Computing, University of Bath
Details
scontrol show jobs (1.49 KB, text/plain)
2023-02-08 02:15 MST, Research Computing, University of Bath
Details
scontrol show jobs (1.47 KB, text/plain)
2023-02-08 02:15 MST, Research Computing, University of Bath
Details
slurm.conf (1.97 KB, text/plain)
2023-02-08 02:16 MST, Research Computing, University of Bath
Details
cyclecloud.conf (25.61 KB, text/plain)
2023-02-08 02:16 MST, Research Computing, University of Bath
Details
sdiag.out (6.09 KB, text/plain)
2023-02-08 02:17 MST, Research Computing, University of Bath
Details
slurmctld log (7.83 MB, application/xz)
2023-02-09 03:44 MST, Research Computing, University of Bath
Details

Note You need to log in before you can comment on or make changes to this ticket.
Description Research Computing, University of Bath 2023-02-03 03:04:50 MST
Using slurm 20.11.7 and Cyclecloud in Azure.

When I submit a job to a particular partition the same node is always tried, and if I submit two jobs the second is held in PD state, with the reason given as `Resources` despite there being nodes in an idle state. 

A further issue, which may or may not be related is when I start an interactive job using salloc/srun the job hangs with:

srun: job 67116758 queued and waiting for resources
srun: job 67116758 has been allocated resources

a certain amount of time is to be expected, while Azure brings the node up but even when the node is up, and I can ssh onto it, the job still hangs for an inordinate amount of time. And typically will get a node_fail when it does eventually come up.

Thanks!
Comment 1 Jason Booth 2023-02-03 14:19:29 MST
Please attach the following.

> slurm.conf
> The slurmctld.log and slurmd.log that is/was part of the job in question
> The output of "sdiag"
> The output of "sinfo"
> The output of "scontrol show nodes"
> The output of "scontrol show job <JOB_ID>"

Please also include the job ID/s in question and the nodes you expect the job to start on.
Comment 2 Research Computing, University of Bath 2023-02-08 02:15:01 MST
Created attachment 28752 [details]
scontrol nodes
Comment 3 Research Computing, University of Bath 2023-02-08 02:15:30 MST
Created attachment 28753 [details]
scontrol show jobs
Comment 4 Research Computing, University of Bath 2023-02-08 02:15:56 MST
Created attachment 28754 [details]
scontrol show jobs
Comment 5 Research Computing, University of Bath 2023-02-08 02:16:17 MST
Created attachment 28755 [details]
slurm.conf
Comment 6 Research Computing, University of Bath 2023-02-08 02:16:47 MST
Created attachment 28756 [details]
cyclecloud.conf
Comment 7 Research Computing, University of Bath 2023-02-08 02:17:25 MST
Created attachment 28757 [details]
sdiag.out
Comment 8 Research Computing, University of Bath 2023-02-08 02:18:19 MST
Attached Jason - thanks.
Comment 10 Jason Booth 2023-02-08 10:44:58 MST
Thanks, can you also attach the slurmctld.log's as well?

Just to confirm, are these spot instances you are using?

Slurmd may not have fully started yet, since you have pointed out that you can ssh
 in but the node is unavailable. It is also possible that CycleCloud may be
 running startup scripts that further delays the boot.

Can you also supply the slurmd.log from a few of these nodes and verify that the 
 slurmd's are running?
Comment 11 Research Computing, University of Bath 2023-02-09 03:44:10 MST
Created attachment 28771 [details]
slurmctld log
Comment 12 Research Computing, University of Bath 2023-02-09 03:50:11 MST
Hi Jason, 

slurmctld attached. 

I think these are two different issues - the wait when bring a node up, and the resources issue where the job waits for the same node. 

These are spot instances - ( the delay to the job actually doing anything is particularly bad on the GPUs)..

Also if I try and start an interactive job with:

srun --partition spot-fsv2-1 --nodes 1  --account sysadmin  --qos sysadmin --job-name "int" --cpus-per-task 1 --time 24:00:00 --pty bash


on any of the  instance types/partitions I will get the following many times before actually being successful in launching an interactive session:

srun: job 67117846 queued and waiting for resources
srun: job 67117846 has been allocated resources
srun: error: slurm_receive_msgs: Socket timed out on send/recv operation
srun: error: Task launch for StepId=67117846.0 failed on node nimbus-1-spot-fsv2-1-pg0-27: Socket timed out on send/recv operation
srun: error: Application launch failed: Socket timed out on send/recv operation
srun: Job step aborted: Waiting up to 32 seconds for job step to finish.
srun: error: Timed out waiting for job step to complete

Thanks!

Conn
Comment 13 Research Computing, University of Bath 2023-02-09 03:57:40 MST
And sorry - it is only one partition that I have noticed which has the job trying/waiting to start on the same node, and it is one of the GPU spot instances.
Comment 14 Nick Ihli 2023-02-09 16:23:36 MST
For the srun issue, srun does require the ephemeral port range to be opened as the port used is chosen at random from that range. I have seen some Azure sites have that blocked. You can use the the SrunPortRange specifically configure what ports will be used by srun.

https://slurm.schedmd.com/slurm.conf.html#OPT_SrunPortRange


For the issue of the job trying to use the same node, what partition is the problematic one? Also, please share how you are submitting your job.

Thanks,
nick
Comment 15 Research Computing, University of Bath 2023-02-10 02:06:16 MST
Thanks Nick, 

The problematic partition is the spot-ncv3-6 one. It always tries to start on nimbus-1-spot-ncv3-6-pg0-2, and if there is a job already on it the next job will wait in the queue with the reason given as `Resources`.



The issue with the slow startup times and this error:



srun: job 67117846 queued and waiting for resources
srun: job 67117846 has been allocated resources
srun: error: slurm_receive_msgs: Socket timed out on send/recv operation
srun: error: Task launch for StepId=67117846.0 failed on node nimbus-1-spot-fsv2-1-pg0-27: Socket timed out on send/recv operation
srun: error: Application launch failed: Socket timed out on send/recv operation
srun: Job step aborted: Waiting up to 32 seconds for job step to finish.
srun: error: Timed out waiting for job step to complete

Happens on all nodes.
Comment 16 Nick Ihli 2023-02-10 10:16:56 MST
Did you look at the srun ports?

Also, what does your job submission script/requirements look like when submit the jobs to the spot-ncv3-6 partition?
Comment 17 Research Computing, University of Bath 2023-02-14 05:13:04 MST
Hi,

What should I set the srun port to?

I'm submitting a job like:

srun --partition spot-ncv3-6 --nodes 1  --account sysadmin  --qos sysadmin --job-name "int" --cpus-per-task 6 --time 24:00:00 --pty bash
Comment 18 Nick Ihli 2023-02-14 09:53:58 MST
See this link for more specifics. srun requires the ephemeral port range to be opened as the port used is chosen at random from that range. I have seen some Azure sites have that blocked. First, check if the range 1024 – 65535 is blocked. If it is, then either unblock that or you can set specifically the range using SrunPortRange.

https://slurm.schedmd.com/slurm.conf.html#OPT_SrunPortRange


For the other issue I was thinking it might be something to do with the memory request, but the partition that is problematic does have a default memory request, so that shouldn't be the issue. Please sent the sinfo output for that partition at the time where the same node is getting scheduled for the jobs.

Oriol has now been assigned this ticket, so I will turn it over to him to continue helping.
Comment 19 Research Computing, University of Bath 2023-02-15 06:53:31 MST
Hi 

I tried SrunPortRange=0-1000 and the problem still persists.

Thanks!
Comment 20 Oriol Vilarrubi 2023-02-15 11:30:50 MST
Hello, did you also open the firewall ports from 0-1000?
Also you would ideally choose a different srunportrange, as this port range is contained within the tcp port range called "The Well Known Ports" which are normally used for other services like ssh, web etc. and could cause a conflict. anything bigger than 1024 should be ok. I'm looking in your slurm config files and log to try to explain you why the second job keep pending and does not use a new node.
Comment 21 Research Computing, University of Bath 2023-02-20 04:04:15 MST
OK - thanks. 

I have a call with Azure shortly to try and figure out if the ports are the issue. 

I also seem to be getting quite a few jobs getting stuck in `CG` state, and I am unable to remove them.

Any ideas why this may be happening?


The slurmd.log from one of the nodes looks like:

[2023-02-20T10:54:16.788] debug:  Reading cgroup.conf file /etc/slurm/cgroup.conf
[2023-02-20T10:54:16.810] debug:  skipping GRES for NodeName=nimbus-1-paygo-ncv3-12-pg0-[1-16]  Name=gpu Count=2 File=/dev/nvidia[0-1]

[2023-02-20T10:54:16.810] debug:  skipping GRES for NodeName=nimbus-1-paygo-ncv3-24-pg0-[1-16]  Name=gpu Count=4 File=/dev/nvidia[0-3]

[2023-02-20T10:54:16.810] debug:  skipping GRES for NodeName=nimbus-1-paygo-ncv3r-24-pg0-[1-16]  Name=gpu Count=4 File=/dev/nvidia[0-3]

[2023-02-20T10:54:16.810] debug:  skipping GRES for NodeName=nimbus-1-paygo-ndv2-40-pg0-[1-16]  Name=gpu Count=8 File=/dev/nvidia[0-7]

[2023-02-20T10:54:16.810] debug:  skipping GRES for NodeName=nimbus-1-spot-ncv3-12-pg0-[1-16]  Name=gpu Count=2 File=/dev/nvidia[0-1]

[2023-02-20T10:54:16.810] debug:  skipping GRES for NodeName=nimbus-1-spot-ncv3-24-pg0-[1-16]  Name=gpu Count=4 File=/dev/nvidia[0-3]

[2023-02-20T10:54:16.810] debug:  skipping GRES for NodeName=nimbus-1-spot-ncv3-6-pg0-[1-16]  Name=gpu Count=1 File=/dev/nvidia0

[2023-02-20T10:54:16.810] debug:  skipping GRES for NodeName=nimbus-1-spot-ncv3r-24-pg0-[1-16]  Name=gpu Count=4 File=/dev/nvidia[0-3]

[2023-02-20T10:54:16.810] debug:  skipping GRES for NodeName=nimbus-1-spot-ndv2-40-pg0-[1-16]  Name=gpu Count=8 File=/dev/nvidia[0-7]

[2023-02-20T10:54:16.810] debug:  skipping GRES for NodeName=nimbus-1-vis-ncv3-12-pg0-[1-16]  Name=gpu Count=2 File=/dev/nvidia[0-1]

[2023-02-20T10:54:16.811] debug:  skipping GRES for NodeName=nimbus-1-vis-ncv3-24-pg0-[1-16]  Name=gpu Count=4 File=/dev/nvidia[0-3]

[2023-02-20T10:54:16.811] debug:  skipping GRES for NodeName=nimbus-1-vis-ncv3-6-pg0-[1-16]  Name=gpu Count=1 File=/dev/nvidia0

[2023-02-20T10:54:16.811] debug:  skipping GRES for NodeName=nimbus-1-vis-ndv2-40-pg0-[1-16]  Name=gpu Count=8 File=/dev/nvidia[0-7]

[2023-02-20T10:54:16.815] debug:  gres/gpu: init: loaded
[2023-02-20T10:54:16.815] debug:  gpu/generic: init: init: GPU Generic plugin loaded
[2023-02-20T10:54:16.815] Gres Name=gpu Type=(null) Count=1
[2023-02-20T10:54:16.816] topology/tree: init: topology tree plugin loaded
[2023-02-20T10:54:16.816] debug:  topology/tree: _read_topo_file: Reading the topology.conf file
[2023-02-20T10:54:16.835] topology/tree: _validate_switches: TOPOLOGY: warning -- no switch can reach all nodes through its descendants. If this is not intentional, fix the topology.conf file.
[2023-02-20T10:54:16.835] debug:  topology/tree: _log_switches: Switch level:0 name:paygo-fsv2-1-Standard_F2s_v2-pg0 nodes:nimbus-1-paygo-fsv2-1-pg0-[1-16] switches:(null)
[2023-02-20T10:54:16.835] debug:  topology/tree: _log_switches: Switch level:0 name:paygo-fsv2-16-Standard_F32s_v2-pg0 nodes:nimbus-1-paygo-fsv2-16-pg0-[1-16] switches:(null)
[2023-02-20T10:54:16.835] debug:  topology/tree: _log_switches: Switch level:0 name:paygo-fsv2-2-Standard_F4s_v2-pg0 nodes:nimbus-1-paygo-fsv2-2-pg0-[1-16] switches:(null)
[2023-02-20T10:54:16.835] debug:  topology/tree: _log_switches: Switch level:0 name:paygo-fsv2-24-Standard_F48s_v2-pg0 nodes:nimbus-1-paygo-fsv2-24-pg0-[1-16] switches:(null)
[2023-02-20T10:54:16.835] debug:  topology/tree: _log_switches: Switch level:0 name:paygo-fsv2-32-Standard_F64s_v2-pg0 nodes:nimbus-1-paygo-fsv2-32-pg0-[1-16] switches:(null)
[2023-02-20T10:54:16.835] debug:  topology/tree: _log_switches: Switch level:0 name:paygo-fsv2-36-Standard_F72s_v2-pg0 nodes:nimbus-1-paygo-fsv2-36-pg0-[1-16] switches:(null)
[2023-02-20T10:54:16.835] debug:  topology/tree: _log_switches: Switch level:0 name:paygo-fsv2-4-Standard_F8s_v2-pg0 nodes:nimbus-1-paygo-fsv2-4-pg0-[1-16] switches:(null)
[2023-02-20T10:54:16.835] debug:  topology/tree: _log_switches: Switch level:0 name:paygo-fsv2-8-Standard_F16s_v2-pg0 nodes:nimbus-1-paygo-fsv2-8-pg0-[1-16] switches:(null)
[2023-02-20T10:54:16.835] debug:  topology/tree: _log_switches: Switch level:0 name:paygo-hb-60-Standard_HB60rs-pg0 nodes:nimbus-1-paygo-hb-60-pg0-[1-16] switches:(null)
[2023-02-20T10:54:16.835] debug:  topology/tree: _log_switches: Switch level:0 name:paygo-hbv2-120-Standard_HB120rs_v2-pg0 nodes:nimbus-1-paygo-hbv2-120-pg0-[1-16] switches:(null)
[2023-02-20T10:54:16.835] debug:  topology/tree: _log_switches: Switch level:0 name:paygo-hbv3-120-Standard_HB120rs_v3-pg0 nodes:nimbus-1-paygo-hbv3-120-pg0-[1-16] switches:(null)
[2023-02-20T10:54:16.835] debug:  topology/tree: _log_switches: Switch level:0 name:paygo-hc-44-Standard_HC44rs-pg0 nodes:nimbus-1-paygo-hc-44-pg0-[1-16] switches:(null)
[2023-02-20T10:54:16.835] debug:  topology/tree: _log_switches: Switch level:0 name:paygo-ncv3-12-Standard_NC12s_v3-pg0 nodes:nimbus-1-paygo-ncv3-12-pg0-[1-16] switches:(null)
[2023-02-20T10:54:16.835] debug:  topology/tree: _log_switches: Switch level:0 name:paygo-ncv3-24-Standard_NC24s_v3-pg0 nodes:nimbus-1-paygo-ncv3-24-pg0-[1-16] switches:(null)
[2023-02-20T10:54:16.835] debug:  topology/tree: _log_switches: Switch level:0 name:paygo-ncv3-6-Standard_NC6s_v3-pg0 nodes:nimbus-1-paygo-ncv3-6-pg0-[1-16] switches:(null)
[2023-02-20T10:54:16.835] debug:  topology/tree: _log_switches: Switch level:0 name:paygo-ncv3r-24-Standard_NC24rs_v3-pg0 nodes:nimbus-1-paygo-ncv3r-24-pg0-[1-16] switches:(null)
[2023-02-20T10:54:16.835] debug:  topology/tree: _log_switches: Switch level:0 name:paygo-ndv2-40-Standard_ND40rs_v2-pg0 nodes:nimbus-1-paygo-ndv2-40-pg0-[1-16] switches:(null)
[2023-02-20T10:54:16.835] debug:  topology/tree: _log_switches: Switch level:0 name:spot-fsv2-1-Standard_F2s_v2-pg0 nodes:nimbus-1-spot-fsv2-1-pg0-[1-64] switches:(null)
[2023-02-20T10:54:16.835] debug:  topology/tree: _log_switches: Switch level:0 name:spot-fsv2-16-Standard_F32s_v2-pg0 nodes:nimbus-1-spot-fsv2-16-pg0-[1-20] switches:(null)
[2023-02-20T10:54:16.835] debug:  topology/tree: _log_switches: Switch level:0 name:spot-fsv2-2-Standard_F4s_v2-pg0 nodes:nimbus-1-spot-fsv2-2-pg0-[1-32] switches:(null)
[2023-02-20T10:54:16.835] debug:  topology/tree: _log_switches: Switch level:0 name:spot-fsv2-24-Standard_F48s_v2-pg0 nodes:nimbus-1-spot-fsv2-24-pg0-[1-20] switches:(null)
[2023-02-20T10:54:16.835] debug:  topology/tree: _log_switches: Switch level:0 name:spot-fsv2-32-Standard_F64s_v2-pg0 nodes:nimbus-1-spot-fsv2-32-pg0-[1-20] switches:(null)
[2023-02-20T10:54:16.835] debug:  topology/tree: _log_switches: Switch level:0 name:spot-fsv2-36-Standard_F72s_v2-pg0 nodes:nimbus-1-spot-fsv2-36-pg0-[1-20] switches:(null)
[2023-02-20T10:54:16.835] debug:  topology/tree: _log_switches: Switch level:0 name:spot-fsv2-4-Standard_F8s_v2-pg0 nodes:nimbus-1-spot-fsv2-4-pg0-[1-32] switches:(null)
[2023-02-20T10:54:16.835] debug:  topology/tree: _log_switches: Switch level:0 name:spot-fsv2-8-Standard_F16s_v2-pg0 nodes:nimbus-1-spot-fsv2-8-pg0-[1-20] switches:(null)
[2023-02-20T10:54:16.835] debug:  topology/tree: _log_switches: Switch level:0 name:spot-hb-60-Standard_HB60rs-pg0 nodes:nimbus-1-spot-hb-60-pg0-[1-16] switches:(null)
[2023-02-20T10:54:16.835] debug:  topology/tree: _log_switches: Switch level:0 name:spot-hbv2-120-Standard_HB120rs_v2-pg0 nodes:nimbus-1-spot-hbv2-120-pg0-[1-16] switches:(null)
[2023-02-20T10:54:16.835] debug:  topology/tree: _log_switches: Switch level:0 name:spot-hbv3-120-Standard_HB120rs_v3-pg0 nodes:nimbus-1-spot-hbv3-120-pg0-[1-16] switches:(null)
[2023-02-20T10:54:16.835] debug:  topology/tree: _log_switches: Switch level:0 name:spot-hc-44-Standard_HC44rs-pg0 nodes:nimbus-1-spot-hc-44-pg0-[1-16] switches:(null)
[2023-02-20T10:54:16.835] debug:  topology/tree: _log_switches: Switch level:0 name:spot-ncv3-12-Standard_NC12s_v3-pg0 nodes:nimbus-1-spot-ncv3-12-pg0-[1-16] switches:(null)
[2023-02-20T10:54:16.835] debug:  topology/tree: _log_switches: Switch level:0 name:spot-ncv3-24-Standard_NC24s_v3-pg0 nodes:nimbus-1-spot-ncv3-24-pg0-[1-16] switches:(null)
[2023-02-20T10:54:16.835] debug:  topology/tree: _log_switches: Switch level:0 name:spot-ncv3-6-Standard_NC6s_v3-pg0 nodes:nimbus-1-spot-ncv3-6-pg0-[1-16] switches:(null)
[2023-02-20T10:54:16.835] debug:  topology/tree: _log_switches: Switch level:0 name:spot-ncv3r-24-Standard_NC24rs_v3-pg0 nodes:nimbus-1-spot-ncv3r-24-pg0-[1-16] switches:(null)
[2023-02-20T10:54:16.835] debug:  topology/tree: _log_switches: Switch level:0 name:spot-ndv2-40-Standard_ND40rs_v2-pg0 nodes:nimbus-1-spot-ndv2-40-pg0-[1-16] switches:(null)
[2023-02-20T10:54:16.835] debug:  topology/tree: _log_switches: Switch level:0 name:vis-ncv3-12-Standard_NC12s_v3-pg0 nodes:nimbus-1-vis-ncv3-12-pg0-[1-16] switches:(null)
[2023-02-20T10:54:16.835] debug:  topology/tree: _log_switches: Switch level:0 name:vis-ncv3-24-Standard_NC24s_v3-pg0 nodes:nimbus-1-vis-ncv3-24-pg0-[1-16] switches:(null)
[2023-02-20T10:54:16.835] debug:  topology/tree: _log_switches: Switch level:0 name:vis-ncv3-6-Standard_NC6s_v3-pg0 nodes:nimbus-1-vis-ncv3-6-pg0-[1-16] switches:(null)
[2023-02-20T10:54:16.835] debug:  topology/tree: _log_switches: Switch level:0 name:vis-ndv2-40-Standard_ND40rs_v2-pg0 nodes:nimbus-1-vis-ndv2-40-pg0-[1-16] switches:(null)
[2023-02-20T10:54:16.835] route/default: init: route default plugin loaded
[2023-02-20T10:54:16.835] CPU frequency setting not configured for this node
[2023-02-20T10:54:16.835] debug:  Resource spec: No specialized cores configured by default on this node
[2023-02-20T10:54:16.835] debug:  Resource spec: Reserved system memory limit not configured for this node
[2023-02-20T10:54:16.835] debug:  Reading cgroup.conf file /etc/slurm/cgroup.conf
[2023-02-20T10:54:16.849] task/affinity: init: task affinity plugin loaded with CPU mask 0x3f
[2023-02-20T10:54:16.850] debug:  task/cgroup: init: core enforcement enabled
[2023-02-20T10:54:16.850] debug:  task/cgroup: task_cgroup_memory_init: task/cgroup/memory: total:112646M allowed:100%(enforced), swap:0%(enforced), max:100%(112646M) max+swap:100%(225292M) min:30M kmem:100%(112646M permissive) min:30M swappiness:0(unset)
[2023-02-20T10:54:16.850] debug:  task/cgroup: init: memory enforcement enabled
[2023-02-20T10:54:16.851] debug:  task/cgroup: task_cgroup_devices_init: unable to open /etc/slurm/cgroup_allowed_devices_file.conf: No such file or directory
[2023-02-20T10:54:16.851] debug:  task/cgroup: init: device enforcement enabled
[2023-02-20T10:54:16.851] debug:  task/cgroup: init: task/cgroup: loaded
[2023-02-20T10:54:16.851] debug:  auth/munge: init: Munge authentication plugin loaded
[2023-02-20T10:54:16.851] debug:  spank: opening plugin stack /etc/slurm/plugstack.conf
[2023-02-20T10:54:16.851] cred/munge: init: Munge credential signature plugin loaded
[2023-02-20T10:54:16.851] slurmd version 20.11.7 started
[2023-02-20T10:54:16.851] debug:  jobacct_gather/linux: init: Job accounting gather LINUX plugin loaded
[2023-02-20T10:54:16.851] debug:  job_container/none: init: job_container none plugin loaded
[2023-02-20T10:54:16.857] debug:  switch/none: init: switch NONE plugin loaded
[2023-02-20T10:54:16.858] slurmd started on Mon, 20 Feb 2023 10:54:16 +0000
[2023-02-20T10:54:16.858] CPUs=6 Boards=1 Sockets=1 Cores=6 Threads=1 Memory=112646 TmpDisk=64520 Uptime=1708 CPUSpecList=(null) FeaturesAvail=(null) FeaturesActive=(null)
[2023-02-20T10:54:16.858] debug:  acct_gather_energy/none: init: AcctGatherEnergy NONE plugin loaded
[2023-02-20T10:54:16.858] debug:  acct_gather_Profile/none: init: AcctGatherProfile NONE plugin loaded
[2023-02-20T10:54:16.858] debug:  acct_gather_interconnect/none: init: AcctGatherInterconnect NONE plugin loaded
[2023-02-20T10:54:16.859] debug:  acct_gather_filesystem/none: init: AcctGatherFilesystem NONE plugin loaded
[2023-02-20T10:54:16.871] debug:  _handle_node_reg_resp: slurmctld sent back 9 TRES.
[2023-02-20T10:54:21.302] debug:  [job 67119232] attempting to run prolog [/sched/slurm.prolog]
[2023-02-20T10:54:21.310] launch task StepId=67119232.0 request from UID:20003 GID:20003 HOST:172.18.88.22 PORT:54108
[2023-02-20T10:54:21.310] debug:  Checking credential with 596 bytes of sig data
[2023-02-20T10:54:21.310] debug:  task/affinity: task_p_slurmd_launch_request: task affinity : before lllp distribution cpu bind method is '(null type)' ((null))
[2023-02-20T10:54:21.310] debug:  task/affinity: lllp_distribution: binding tasks:6 to nodes:1 sockets:1:0 cores:6:0 threads:6
[2023-02-20T10:54:21.310] task/affinity: lllp_distribution: JobId=67119232 implicit auto binding: cores,one_thread, dist 8192
[2023-02-20T10:54:21.310] task/affinity: _task_layout_lllp_cyclic: _task_layout_lllp_cyclic 
[2023-02-20T10:54:21.310] task/affinity: _lllp_generate_cpu_bind: _lllp_generate_cpu_bind jobid [67119232]: mask_cpu,one_thread, 0x3F
[2023-02-20T10:54:21.310] debug:  task/affinity: task_p_slurmd_launch_request: task affinity : after lllp distribution cpu bind method is 'mask_cpu,one_thread' (0x3F)
[2023-02-20T10:54:21.310] debug:  Waiting for job 67119232's prolog to complete
[2023-02-20T10:55:13.035] debug:  _rpc_terminate_job, uid = 11100 JobId=67119232
[2023-02-20T10:55:13.036] debug:  credential for job 67119232 revoked
[2023-02-20T10:55:13.036] debug:  _rpc_terminate_job: sent SUCCESS for 67119232, waiting for prolog to finish
[2023-02-20T10:55:13.036] debug:  Waiting for job 67119232's prolog to complete
[2023-02-20T10:56:38.930] debug:  _rpc_terminate_job, uid = 11100 JobId=67119232
[2023-02-20T10:59:29.652] debug:  _rpc_terminate_job, uid = 11100 JobId=67119232
[2023-02-20T10:59:31.967] debug:  _handle_node_reg_resp: slurmctld sent back 9 TRES.
Comment 22 Research Computing, University of Bath 2023-02-21 04:18:35 MST
OK - I think I have got to the bottom of the various issues here. 

Drivers on the GPU node images werent properly installed, which cleared up the hang on those nodes. 

The socket time out issues seems to be down to some issues with prolog and epilog scripts that was causing them to run for way too long.

I have put fixes in for these, and now everything seems to be working as expected.

The only final issue is the jobs that are stuck in a `CG` state that I cant get rid of.
Comment 23 Research Computing, University of Bath 2023-02-22 09:34:20 MST
Actually I also have another issue that I can't get to the bottom of. 

During a batch/interactive job the groups arent propagated to the compute node. So on the login node, the user groups are fine. When I run a batch job/interactive job for some users the groups is not getting propagated correctly.
Comment 24 Research Computing, University of Bath 2023-03-01 07:39:43 MST
Solved.