Ticket 5543

Summary:	srun under salloc: step creation temporarily disabled, retrying
Product:	Slurm	Reporter:	Levi Morrison <levi_morrison>
Component:	User Commands	Assignee:	Alejandro Sanchez <alex>
Status:	RESOLVED INFOGIVEN	QA Contact:
Severity:	4 - Minor Issue
Priority:	---	CC:	da, tim
Version:	17.11.8
Hardware:	Linux
OS:	Linux
Site:	BYU - Brigham Young University	Slinky Site:	---
Alineos Sites:	---	Atos/Eviden Sites:	---
Confidential Site:	---	Coreweave sites:	---
Cray Sites:	---	DS9 clusters:	---
Google sites:	---	HPCnow Sites:	---
HPE Sites:	---	IBM Sites:	---
NOAA SIte:	---	NoveTech Sites:	---
Nvidia HWinf-CS Sites:	---	OCF Sites:	---
Recursion Pharma Sites:	---	SFW Sites:	---
SNIC sites:	---	Tzag Elita Sites:	---
Linux Distro:	---	Machine Name:
CLE Version:		Version Fixed:
Target Release:	---	DevPrio:	---
Emory-Cloud Sites:	---
Attachments:	slurm.conf for RHEL 6 cgroup.conf for RHEL 6 nodes.conf - RHEL 6 nodes.conf - RHEL 7 Exerpt from slurmctld.log for relevant job

Description Levi Morrison 2018-08-09 11:48:57 MDT

We are currently running two operating systems: RHEL 6 and 7.

On RHEL 6 we are able to launch a job via salloc and then use srun to launch job steps; something like:

  salloc --nodes=1 --exclusive --mem-per-cpu=2G --time=1:00:00
  srun ./solver-mpi --input 4032x4032.grid --output 4032x4032.grid.done

This works. However, doing the same thing on our RHEL7 OS hangs on the `srun` call.

The slurmctld.log does not have any information in it about the job step.

The SallocDefaultCommand is the same on each OS:
  SallocDefaultCommand = "srun --mem-per-cpu=0 -n1 -N1 --pty --preserve-env --mpi=none $SHELL"

We tried running the command manually without `--mpi=none` in case that was somehow affecting things, but didn't make a difference.

If I attach a debugger the last few frames look like this:
#0  0x00007f932fe84a20 in __poll_nocancel () from /lib64/libc.so.6
#1  0x00007f93303c7b01 in slurm_step_ctx_create_timeout (step_params=step_params@entry=0x150ea00, 
    timeout=121000) at step_ctx.c:287
#2  0x0000000000409f0e in launch_common_create_job_step (job=job@entry=0x150e8f0, 
    use_all_cpus=<optimized out>, signal_function=0x41668c <_signal_while_allocating>, 
    destroy_job=0x625f98 <destroy_job>, opt_local=0x626180 <opt>) at launch.c:334

The last few lines of `srun -vvvvv` look like this:
srun: remote command    : `./solver-mpi --input 4032x4032.grid --output 4032x4032.grid.done'
srun: debug:  propagating RLIMIT_STACK=8388608
srun: debug:  propagating SLURM_PRIO_PROCESS=0
srun: debug2: srun PMI messages to port=42102
srun: debug3: Trying to load plugin /usr/local/lib/slurm/auth_munge.so
srun: debug:  Munge authentication plugin loaded
srun: debug3: Success.
srun: jobid 25589008: nodes(1):`m9g-1-20', cpu counts: 28(x1)
srun: debug2: creating job with 1 tasks
srun: debug:  requesting job 25589008, user 20497, nodes 1 including ((null))
srun: debug:  cpus 1, tasks 1, name solver-mpi, relative 65534
srun: Job 25589008 step creation temporarily disabled, retrying
srun: Job 25589008 step creation still disabled, retrying

We aren't sure what else to check. Ideas?

Comment 1 Alejandro Sanchez 2018-08-10 05:49:05 MDT

Hi,

are RHEL 6 and 7 nodes on different clusters? If so can you attach both slurm.conf and cgroup.conf from each cluster? otherwise the files just for one cluster.

Are you using the exact same request for both OS? (same --exclusive, --mem-per-cpu, etc) ?

Comment 2 Levi Morrison 2018-08-10 08:42:18 MDT

Created attachment 7562 [details]
slurm.conf for RHEL 6

The slurm.conf for RHEL 7 is identical except:
82c82
< SlurmctldPidFile=/var/run/slurmctld.pid
---
> SlurmctldPidFile=/var/run/slurm/slurmctld.pid
84c84
< SlurmdPidFile=/var/run/slurmd.pid
---
> SlurmdPidFile=/var/run/slurm/slurmd.pid

Comment 3 Levi Morrison 2018-08-10 08:45:31 MDT

Created attachment 7563 [details]
cgroup.conf for RHEL 6

The cgroup.conf for RHEL 7 is the same except:
1a2
> CgroupMountpoint="/sys/fs/cgroup"
8d8
< CgroupMountpoint="/cgroup"

Comment 4 Levi Morrison 2018-08-10 08:51:44 MDT

The flags to `srun` are the same except for a reservation (which is how we are currently distinguishing them).

There are some other differences I've thought of:
  - RHEL 6 is running version 17.11.6 clients while RHEL 7 is running 17.11.8 clients.
  - RHEL 7 uses PMIx v2 (which works through sbatch, just not srun) and RHEL 6 uses PMI2.

Comment 5 Alejandro Sanchez 2018-08-10 08:56:42 MDT

Can you attach the nodes and partition definitions as well? are they the same across the two systems?

Can you also provide with the exact submission and/or potential script for both cases?

Temporarily setting 'scontrol setdebugflags +cpu_bind,steps,selecttype' before submitting the failing step might help debugging.

Comment 6 Levi Morrison 2018-08-10 14:30:32 MDT

Hmm. Would an "srun" inside of an allocated job need anything from nodes.conf or partition info?

The job script just runs this line:

  srun ./solver-mpi --input 4032x4032.grid --output 4032x4032.grid.done

Just the same as I do in an salloc'd job.

It doesn't matter what srun is trying to run; it doesn't get that far. `srun hostname` also fails.

Comment 7 Tim Wickberg 2018-08-12 03:06:39 MDT

*** Ticket 5555 has been marked as a duplicate of this ticket. ***

Comment 11 Alejandro Sanchez 2018-08-13 09:18:57 MDT

I'd like to have your node and partition information to see how --mem-per-cpu plays into account with the specific nodes/partitions. It will also be great if you could set temporarily

scontrol setdebug debug2
scontrol setdebugflags +steps,selecttype,cpu_bind

Then execute your failing request

salloc --nodes=1 --exclusive --mem-per-cpu=2G --time=1:00:00
srun hostname

and attach the slurmctld.log. I'm vaguely suspecting there's some sort of issue with the job memory.

Comment 13 Alejandro Sanchez 2018-08-13 10:15:17 MDT

Could you also try this without --exclusive?

salloc --nodes=1 --mem-per-cpu=2G --time=1:00:00
srun hostname

and see if that runs? It looks like there's an uncatched edge case when requesting --exclusive + --mem-per-cpu at once.

Comment 14 Levi Morrison 2018-08-13 10:54:17 MDT

(In reply to Alejandro Sanchez from comment #13)
> Could you also try this without --exclusive?
> 
> salloc --nodes=1 --mem-per-cpu=2G --time=1:00:00
> srun hostname
> 
> and see if that runs? It looks like there's an uncatched edge case when
> requesting --exclusive + --mem-per-cpu at once.

With either of these sallocs:

  salloc --nodes=1 --mem-per-cpu=2G --time=1:00:00
  salloc --nodes=1 --ntasks=24 --mem-per-cpu=2G --time=1:00:00


the `srun hostname` still hangs.

Comment 15 Alejandro Sanchez 2018-08-13 10:57:20 MDT

Ok then I'd need the info requested in comment 11 since I've not been able to reproduce so far. Thanks!

Comment 16 Levi Morrison 2018-08-13 12:17:16 MDT

Created attachment 7579 [details]
nodes.conf - RHEL 6

Comment 17 Levi Morrison 2018-08-13 12:20:07 MDT

Created attachment 7580 [details]
nodes.conf - RHEL 7

The only apparent differences are that the RHEL6 nodes.conf has more `rhel7` features in it.

Comment 19 Levi Morrison 2018-08-13 16:42:28 MDT

Created attachment 7594 [details]
Exerpt from slurmctld.log for relevant job

Comment 22 Alejandro Sanchez 2018-08-14 03:43:06 MDT

Levi,

Despite it'd be great to have the full slurmctld.log not only the filtered lines by jobid (because we are missing relevant information) and it'd be nice to have the requested partition definition, I've deduced a few things and I think I know what's going on now.

1. Job is submitted to more than one partition, which doesn't correspond to the salloc requests you mention throughout the bug comments (I don't see any -p <multiple_parts> request anywhere), but clearly the logs show a multi partition request for the filtered job

[2018-08-13T16:36:59.349] debug:  Job 25633483 has more than one partition (m8g)(106700)
[2018-08-13T16:36:59.349] debug:  Job 25633483 has more than one partition (m9g)(106700)

2. I see this in your slurm.conf

#JobSubmitPlugins=all_partitions,lua
JobSubmitPlugins=partition # "partition" is the C plugin we created using an existing file name. This was the least disruptive way to make it happen. Life would be much easier if each plugin type had a "site" plugin that does nothing and we know would never conflict in git

This makes me guess that your job request might be modified so that it ends up being a multi partition request. Anyway, in any case job is submitted against m8g and m9g partitions according to the filtered logs.

3. I don't have the partition definition, but I _guess_ these partitions include these nodes:

#gpus
NodeName=m8g-1-[1-8],m8g-[2-3]-[1-12]		sockets=2 corespersocket=12 weight=1 RealMemory=65536	 feature=public,gpu,intel,haswell,sse4.1,sse4.2,avx,avx2,fma,xorg,ib,kepler gres=gpu:4
NodeName=m9g-[1-2]-[1-20]                       sockets=2 corespersocket=14 weight=1 RealMemory=131072   feature=public,gpu,intel,broadwell,sse4.1,sse4.2,avx,avx2,fma,xorg,ib,opa,pascal,rhel7 gres=gpu:4

4. The scheduler iterates over the requested partitions trying to allocate resources. It looks like no nodes satisfy job requirements for partition m8g, but job is allocated resources on partition m9g node 

[2018-08-13T16:36:59.349] job_test_resv: job:25633483 reservation:pmix_test nodes:m9g-1-20
[2018-08-13T16:36:59.349] _build_node_list: No nodes satisfy job 25633483 requirements in partition m8g
[2018-08-13T16:36:59.349] debug2: Try job 25633483 on next partition m9g
[2018-08-13T16:36:59.349] job_test_resv: job:25633483 reservation:pmix_test nodes:m9g-1-20
[2018-08-13T16:36:59.349] job_test_resv: job:25633483 reservation:pmix_test nodes:m9g-1-20
...
[2018-08-13T16:36:59.355] debug2: sched: JobId=25633483 allocated resources: NodeList=m9g-1-20
[2018-08-13T16:36:59.355] sched: _slurm_rpc_allocate_resources JobId=25633483 NodeList=m9g-1-20 usec=6821
[2018-08-13T16:36:59.357] debug2: _slurm_rpc_job_ready(25633483)=3 usec=1
[2018-08-13T16:36:59.499] debug2: Processing RPC: REQUEST_COMPLETE_PROLOG from JobId=25633483
[2018-08-13T16:36:59.499] debug2: _slurm_rpc_complete_prolog JobId=25633483 usec=20
[2018-08-13T16:37:02.359] debug2: _slurm_rpc_job_ready(25633483)=3 usec=2
[2018-08-13T16:37:02.376] debug:  _slurm_rpc_job_pack_alloc_info: JobId=25633483 NodeList=m9g-1-20 usec=19

5. Your slurm.conf shows you have PrologFlags=contain, and the following line is coherent to that:

[2018-08-13T16:37:02.378] step 25633483.4294967295 has nodes m9g-1-20

25633483.4294967295 correspondos to job's 25633483 extern step

slurm/slurm.h.in:#define SLURM_EXTERN_CONT  (0xffffffff)

6. This extern step is allocated resources, but the srun step inside the allocation isn't and hangs:

[2018-08-13T16:37:02.379] step_id:25633483.0
[2018-08-13T16:37:02.379] sched: _slurm_rpc_job_step_create: StepId=25633483.0 m9g-1-20 usec=355
[2018-08-13T16:37:11.487] debug:  _slurm_rpc_job_pack_alloc_info: JobId=25633483 NodeList=m9g-1-20 usec=16
[2018-08-13T16:37:11.490] _slurm_rpc_job_step_create for job 25633483: Requested nodes are busy

7. Node m9g-1-20 is configured with gres=gpu:4 and pascal feature. The salloc extern step is consuming the 4 gpus (not sure if your job submit plugin requests modifies the job request to request gpus), leaving no resources for the srun step inside the allocation.

8. Based on all the information above and the fact that you mentioned that the same request works with sbatch, makes me think that including this:

--gres=gpu:0

to your current SallocDefaultCommand

SallocDefaultCommand = "srun --mem-per-cpu=0 -n1 -N1 --pty --preserve-env --mpi=none $SHELL"

so that it changes to

SallocDefaultCommand = "srun --mem-per-cpu=0 --gres=gpu:0 -n1 -N1 --pty --preserve-env --mpi=none $SHELL"

would prevent salloc from consuming all the gpus on the node and blocking subsequent srun requests inside such allocation.

Comment 23 Levi Morrison 2018-08-14 08:30:44 MDT

The multiple partitions comes from our Job Submit plugin, so that's why we don't see it explicitly.

You are right; the gpus make a difference.  I was able to reproduce this on our old OS on m8g with a similar circumstance.

If I launch on gpu node like this, then it works (wrapped because the line is long):

  salloc --nodes=1 --exclusive --mem-per-cpu=2G --time=1:00:00 \
    --reservation=pmix_test --gres=gpu:4  srun --mem-per-cpu=0 -n1 \
    -N1 --pty --preserve-env --mpi=none --gres=gpu:0 /usr/bin/bash

And then run `srun hostname` then it works.

Thank you for your help. Adding `--gres=gpu:0` does not seem to affect jobs that didn't request gpus so I consider this resolved. If it causes issues later I'll open a new ticket.