| Summary: | srun under salloc: step creation temporarily disabled, retrying | ||
|---|---|---|---|
| Product: | Slurm | Reporter: | Levi Morrison <levi_morrison> |
| Component: | User Commands | Assignee: | Alejandro Sanchez <alex> |
| Status: | RESOLVED INFOGIVEN | QA Contact: | |
| Severity: | 4 - Minor Issue | ||
| Priority: | --- | CC: | da, tim |
| Version: | 17.11.8 | ||
| Hardware: | Linux | ||
| OS: | Linux | ||
| Site: | BYU - Brigham Young University | Slinky Site: | --- |
| Alineos Sites: | --- | Atos/Eviden Sites: | --- |
| Confidential Site: | --- | Coreweave sites: | --- |
| Cray Sites: | --- | DS9 clusters: | --- |
| Google sites: | --- | HPCnow Sites: | --- |
| HPE Sites: | --- | IBM Sites: | --- |
| NOAA SIte: | --- | NoveTech Sites: | --- |
| Nvidia HWinf-CS Sites: | --- | OCF Sites: | --- |
| Recursion Pharma Sites: | --- | SFW Sites: | --- |
| SNIC sites: | --- | Tzag Elita Sites: | --- |
| Linux Distro: | --- | Machine Name: | |
| CLE Version: | Version Fixed: | ||
| Target Release: | --- | DevPrio: | --- |
| Emory-Cloud Sites: | --- | ||
| Attachments: |
slurm.conf for RHEL 6
cgroup.conf for RHEL 6 nodes.conf - RHEL 6 nodes.conf - RHEL 7 Exerpt from slurmctld.log for relevant job |
||
Hi, are RHEL 6 and 7 nodes on different clusters? If so can you attach both slurm.conf and cgroup.conf from each cluster? otherwise the files just for one cluster. Are you using the exact same request for both OS? (same --exclusive, --mem-per-cpu, etc) ? Created attachment 7562 [details] slurm.conf for RHEL 6 The slurm.conf for RHEL 7 is identical except: 82c82 < SlurmctldPidFile=/var/run/slurmctld.pid --- > SlurmctldPidFile=/var/run/slurm/slurmctld.pid 84c84 < SlurmdPidFile=/var/run/slurmd.pid --- > SlurmdPidFile=/var/run/slurm/slurmd.pid Created attachment 7563 [details] cgroup.conf for RHEL 6 The cgroup.conf for RHEL 7 is the same except: 1a2 > CgroupMountpoint="/sys/fs/cgroup" 8d8 < CgroupMountpoint="/cgroup" The flags to `srun` are the same except for a reservation (which is how we are currently distinguishing them). There are some other differences I've thought of: - RHEL 6 is running version 17.11.6 clients while RHEL 7 is running 17.11.8 clients. - RHEL 7 uses PMIx v2 (which works through sbatch, just not srun) and RHEL 6 uses PMI2. Can you attach the nodes and partition definitions as well? are they the same across the two systems? Can you also provide with the exact submission and/or potential script for both cases? Temporarily setting 'scontrol setdebugflags +cpu_bind,steps,selecttype' before submitting the failing step might help debugging. Hmm. Would an "srun" inside of an allocated job need anything from nodes.conf or partition info? The job script just runs this line: srun ./solver-mpi --input 4032x4032.grid --output 4032x4032.grid.done Just the same as I do in an salloc'd job. It doesn't matter what srun is trying to run; it doesn't get that far. `srun hostname` also fails. *** Ticket 5555 has been marked as a duplicate of this ticket. *** I'd like to have your node and partition information to see how --mem-per-cpu plays into account with the specific nodes/partitions. It will also be great if you could set temporarily scontrol setdebug debug2 scontrol setdebugflags +steps,selecttype,cpu_bind Then execute your failing request salloc --nodes=1 --exclusive --mem-per-cpu=2G --time=1:00:00 srun hostname and attach the slurmctld.log. I'm vaguely suspecting there's some sort of issue with the job memory. Could you also try this without --exclusive? salloc --nodes=1 --mem-per-cpu=2G --time=1:00:00 srun hostname and see if that runs? It looks like there's an uncatched edge case when requesting --exclusive + --mem-per-cpu at once. (In reply to Alejandro Sanchez from comment #13) > Could you also try this without --exclusive? > > salloc --nodes=1 --mem-per-cpu=2G --time=1:00:00 > srun hostname > > and see if that runs? It looks like there's an uncatched edge case when > requesting --exclusive + --mem-per-cpu at once. With either of these sallocs: salloc --nodes=1 --mem-per-cpu=2G --time=1:00:00 salloc --nodes=1 --ntasks=24 --mem-per-cpu=2G --time=1:00:00 the `srun hostname` still hangs. Ok then I'd need the info requested in comment 11 since I've not been able to reproduce so far. Thanks! Created attachment 7579 [details]
nodes.conf - RHEL 6
Created attachment 7580 [details]
nodes.conf - RHEL 7
The only apparent differences are that the RHEL6 nodes.conf has more `rhel7` features in it.
Created attachment 7594 [details]
Exerpt from slurmctld.log for relevant job
Levi, Despite it'd be great to have the full slurmctld.log not only the filtered lines by jobid (because we are missing relevant information) and it'd be nice to have the requested partition definition, I've deduced a few things and I think I know what's going on now. 1. Job is submitted to more than one partition, which doesn't correspond to the salloc requests you mention throughout the bug comments (I don't see any -p <multiple_parts> request anywhere), but clearly the logs show a multi partition request for the filtered job [2018-08-13T16:36:59.349] debug: Job 25633483 has more than one partition (m8g)(106700) [2018-08-13T16:36:59.349] debug: Job 25633483 has more than one partition (m9g)(106700) 2. I see this in your slurm.conf #JobSubmitPlugins=all_partitions,lua JobSubmitPlugins=partition # "partition" is the C plugin we created using an existing file name. This was the least disruptive way to make it happen. Life would be much easier if each plugin type had a "site" plugin that does nothing and we know would never conflict in git This makes me guess that your job request might be modified so that it ends up being a multi partition request. Anyway, in any case job is submitted against m8g and m9g partitions according to the filtered logs. 3. I don't have the partition definition, but I _guess_ these partitions include these nodes: #gpus NodeName=m8g-1-[1-8],m8g-[2-3]-[1-12] sockets=2 corespersocket=12 weight=1 RealMemory=65536 feature=public,gpu,intel,haswell,sse4.1,sse4.2,avx,avx2,fma,xorg,ib,kepler gres=gpu:4 NodeName=m9g-[1-2]-[1-20] sockets=2 corespersocket=14 weight=1 RealMemory=131072 feature=public,gpu,intel,broadwell,sse4.1,sse4.2,avx,avx2,fma,xorg,ib,opa,pascal,rhel7 gres=gpu:4 4. The scheduler iterates over the requested partitions trying to allocate resources. It looks like no nodes satisfy job requirements for partition m8g, but job is allocated resources on partition m9g node [2018-08-13T16:36:59.349] job_test_resv: job:25633483 reservation:pmix_test nodes:m9g-1-20 [2018-08-13T16:36:59.349] _build_node_list: No nodes satisfy job 25633483 requirements in partition m8g [2018-08-13T16:36:59.349] debug2: Try job 25633483 on next partition m9g [2018-08-13T16:36:59.349] job_test_resv: job:25633483 reservation:pmix_test nodes:m9g-1-20 [2018-08-13T16:36:59.349] job_test_resv: job:25633483 reservation:pmix_test nodes:m9g-1-20 ... [2018-08-13T16:36:59.355] debug2: sched: JobId=25633483 allocated resources: NodeList=m9g-1-20 [2018-08-13T16:36:59.355] sched: _slurm_rpc_allocate_resources JobId=25633483 NodeList=m9g-1-20 usec=6821 [2018-08-13T16:36:59.357] debug2: _slurm_rpc_job_ready(25633483)=3 usec=1 [2018-08-13T16:36:59.499] debug2: Processing RPC: REQUEST_COMPLETE_PROLOG from JobId=25633483 [2018-08-13T16:36:59.499] debug2: _slurm_rpc_complete_prolog JobId=25633483 usec=20 [2018-08-13T16:37:02.359] debug2: _slurm_rpc_job_ready(25633483)=3 usec=2 [2018-08-13T16:37:02.376] debug: _slurm_rpc_job_pack_alloc_info: JobId=25633483 NodeList=m9g-1-20 usec=19 5. Your slurm.conf shows you have PrologFlags=contain, and the following line is coherent to that: [2018-08-13T16:37:02.378] step 25633483.4294967295 has nodes m9g-1-20 25633483.4294967295 correspondos to job's 25633483 extern step slurm/slurm.h.in:#define SLURM_EXTERN_CONT (0xffffffff) 6. This extern step is allocated resources, but the srun step inside the allocation isn't and hangs: [2018-08-13T16:37:02.379] step_id:25633483.0 [2018-08-13T16:37:02.379] sched: _slurm_rpc_job_step_create: StepId=25633483.0 m9g-1-20 usec=355 [2018-08-13T16:37:11.487] debug: _slurm_rpc_job_pack_alloc_info: JobId=25633483 NodeList=m9g-1-20 usec=16 [2018-08-13T16:37:11.490] _slurm_rpc_job_step_create for job 25633483: Requested nodes are busy 7. Node m9g-1-20 is configured with gres=gpu:4 and pascal feature. The salloc extern step is consuming the 4 gpus (not sure if your job submit plugin requests modifies the job request to request gpus), leaving no resources for the srun step inside the allocation. 8. Based on all the information above and the fact that you mentioned that the same request works with sbatch, makes me think that including this: --gres=gpu:0 to your current SallocDefaultCommand SallocDefaultCommand = "srun --mem-per-cpu=0 -n1 -N1 --pty --preserve-env --mpi=none $SHELL" so that it changes to SallocDefaultCommand = "srun --mem-per-cpu=0 --gres=gpu:0 -n1 -N1 --pty --preserve-env --mpi=none $SHELL" would prevent salloc from consuming all the gpus on the node and blocking subsequent srun requests inside such allocation. The multiple partitions comes from our Job Submit plugin, so that's why we don't see it explicitly.
You are right; the gpus make a difference. I was able to reproduce this on our old OS on m8g with a similar circumstance.
If I launch on gpu node like this, then it works (wrapped because the line is long):
salloc --nodes=1 --exclusive --mem-per-cpu=2G --time=1:00:00 \
--reservation=pmix_test --gres=gpu:4 srun --mem-per-cpu=0 -n1 \
-N1 --pty --preserve-env --mpi=none --gres=gpu:0 /usr/bin/bash
And then run `srun hostname` then it works.
Thank you for your help. Adding `--gres=gpu:0` does not seem to affect jobs that didn't request gpus so I consider this resolved. If it causes issues later I'll open a new ticket.
|
We are currently running two operating systems: RHEL 6 and 7. On RHEL 6 we are able to launch a job via salloc and then use srun to launch job steps; something like: salloc --nodes=1 --exclusive --mem-per-cpu=2G --time=1:00:00 srun ./solver-mpi --input 4032x4032.grid --output 4032x4032.grid.done This works. However, doing the same thing on our RHEL7 OS hangs on the `srun` call. The slurmctld.log does not have any information in it about the job step. The SallocDefaultCommand is the same on each OS: SallocDefaultCommand = "srun --mem-per-cpu=0 -n1 -N1 --pty --preserve-env --mpi=none $SHELL" We tried running the command manually without `--mpi=none` in case that was somehow affecting things, but didn't make a difference. If I attach a debugger the last few frames look like this: #0 0x00007f932fe84a20 in __poll_nocancel () from /lib64/libc.so.6 #1 0x00007f93303c7b01 in slurm_step_ctx_create_timeout (step_params=step_params@entry=0x150ea00, timeout=121000) at step_ctx.c:287 #2 0x0000000000409f0e in launch_common_create_job_step (job=job@entry=0x150e8f0, use_all_cpus=<optimized out>, signal_function=0x41668c <_signal_while_allocating>, destroy_job=0x625f98 <destroy_job>, opt_local=0x626180 <opt>) at launch.c:334 The last few lines of `srun -vvvvv` look like this: srun: remote command : `./solver-mpi --input 4032x4032.grid --output 4032x4032.grid.done' srun: debug: propagating RLIMIT_STACK=8388608 srun: debug: propagating SLURM_PRIO_PROCESS=0 srun: debug2: srun PMI messages to port=42102 srun: debug3: Trying to load plugin /usr/local/lib/slurm/auth_munge.so srun: debug: Munge authentication plugin loaded srun: debug3: Success. srun: jobid 25589008: nodes(1):`m9g-1-20', cpu counts: 28(x1) srun: debug2: creating job with 1 tasks srun: debug: requesting job 25589008, user 20497, nodes 1 including ((null)) srun: debug: cpus 1, tasks 1, name solver-mpi, relative 65534 srun: Job 25589008 step creation temporarily disabled, retrying srun: Job 25589008 step creation still disabled, retrying We aren't sure what else to check. Ideas?