| Summary: | Wrong number of tasks started on each node | ||
|---|---|---|---|
| Product: | Slurm | Reporter: | Bjørn-Helge Mevik <b.h.mevik> |
| Component: | Scheduling | Assignee: | Moe Jette <jette> |
| Status: | RESOLVED FIXED | QA Contact: | |
| Severity: | 2 - High Impact | ||
| Priority: | --- | CC: | da |
| Version: | 2.4.x | ||
| Hardware: | Linux | ||
| OS: | Linux | ||
| Site: | -Other- | Alineos Sites: | --- |
| Atos/Eviden Sites: | --- | Confidential Site: | --- |
| Coreweave sites: | --- | Cray Sites: | --- |
| DS9 clusters: | --- | HPCnow Sites: | --- |
| HPE Sites: | --- | IBM Sites: | --- |
| NOAA SIte: | --- | OCF Sites: | --- |
| Recursion Pharma Sites: | --- | SFW Sites: | --- |
| SNIC sites: | --- | Linux Distro: | --- |
| Machine Name: | CLE Version: | ||
| Version Fixed: | Target Release: | --- | |
| DevPrio: | --- | Emory-Cloud Sites: | --- |
I am guessing there is a problem with the sort in the hostlist functions, probably a strcmp instead of a strnatcmp, but what you ask for doesn't specify you want 2 tasks on any specific node. You would need the arbitrary distribution option for that. Unless you do that Slurm will lay them out any way it can, and reorders the list right when it comes in. Let us know if you are able to fix the sort, but based on the request you got what I would expect to happen. Try the arbitrary distribution mode and see if that works as you would expect. I'm sorry if I was unclear, so let me try to explain again. I was not trying to specify explicitly where the tasks are run, but merely provided a minimal, reproducible example of the problem. See a bigger, general example below.
We hare using
Nodename=DEFAULT Sockets=2 CoresPerSocket=8 ThreadsPerCore=1 RealMemory=62976 Gres=localtmp:100,athena:4 State=unknown
PartitionName=DEFAULT State=up Shared=NO
SelectType=select/cons_res
SelectTypeParameters=CR_CPU_Memory
TaskPlugin=task/cgroup
ProctrackType=proctrack/cgroup
i.e, we are using consumable resources, which should not be shared.
The problem is that slurm _allocates_ a number of tasks to a node (based on the amount of unallocated resources there), but _runs_ a different number of tasks there.
Typically, the consequence is that some nodes get more tasks than it has cores, or gets its memory overallocated. This is bad for HPC.
Also, slurm's memory limits are based on where it _allocates_ the tasks, so a job that starts for instance 8 tasks on a node it should have run 2, will likely exceed its memory allowance on that node and be killed.
Here is a general example. According to "scontrol show job" and the SLURM_ environment variables, it should have run 16 tasks on each of the nodes in rack 13 (c13-X), but as the srun output shows, it doesn't. For instance it only runs 8 tasks in c13-10. On the other hand, it should only run 9 tasks in c5-4, but in fact runs 16 tasks there.
$ cat env.sm
#!/bin/bash
#SBATCH --account=staff
#SBATCH --time=0:10:0
#SBATCH --mem-per-cpu=500M
echo On batch node:
echo '****************'
hostname
env | grep SLURM | sort
scontrol show job $SLURM_JOB_ID --details
echo Srun:
echo '****************'
srun hostname | sort | uniq -c
$ sbatch --ntasks=800 env.sm
Submitted batch job 642175
$ cat slurm-642175.out
On batch node:
****************
compute-13-1.local
SLURM_CHECKPOINT_IMAGE_DIR=/cluster/home/bhm/slurm
SLURM_CPUS_ON_NODE=16
SLURMD_NODENAME=c13-1
SLURM_GTIDS=0
SLURM_JOB_CPUS_PER_NODE=16(x29),9,15(x3),16,14,8(x4),16(x5),8,4,16(x8)
SLURM_JOB_ID=642175
SLURM_JOBID=642175
SLURM_JOB_NAME=env.sm
SLURM_JOB_NODELIST=c13-[1-6,8-12,14-28],c5-[1-28]
SLURM_JOB_NUM_NODES=54
SLURM_LOCALID=0
SLURM_MEM_PER_CPU=500
SLURM_NNODES=54
SLURM_NODE_ALIASES=(null)
SLURM_NODEID=0
SLURM_NODELIST=c13-[1-6,8-12,14-28],c5-[1-28]
SLURM_NPROCS=800
SLURM_NTASKS=800
SLURM_PRIO_PROCESS=0
SLURM_PROCID=0
SLURM_SUBMIT_DIR=/cluster/home/bhm/slurm
SLURM_TASK_PID=1224
SLURM_TASKS_PER_NODE=16(x29),9,15(x3),16,14,8(x4),16(x5),8,4,16(x8)
SLURM_TOPOLOGY_ADDR_PATTERN=switch.switch.node
SLURM_TOPOLOGY_ADDR=root.rack13.c13-1
JobId=642175 Name=env.sm
UserId=bhm(10231) GroupId=users(100)
Priority=20477 Account=staff QOS=staff
JobState=RUNNING Reason=None Dependency=(null)
Requeue=1 Restarts=0 BatchFlag=1 ExitCode=0:0
DerivedExitCode=0:0
RunTime=00:00:00 TimeLimit=00:10:00 TimeMin=N/A
SubmitTime=2012-12-05T10:12:57 EligibleTime=2012-12-05T10:12:57
StartTime=2012-12-05T10:12:57 EndTime=2012-12-05T10:22:57
PreemptTime=None SuspendTime=None SecsPreSuspend=0
Partition=normal AllocNode:Sid=login-0-2:16237
ReqNodeList=(null) ExcNodeList=(null)
NodeList=c13-[1-6,8-12,14-28],c5-[1-28]
BatchHost=c13-1
NumNodes=54 NumCPUs=800 CPUs/Task=1 ReqS:C:T=*:*:*
Nodes=c13-[1-6,8-12,14-28],c5-[1-3] CPU_IDs=0-15 Mem=8000
Nodes=c5-4 CPU_IDs=0-7,13 Mem=4500
Nodes=c5-[5-7] CPU_IDs=1-15 Mem=7500
Nodes=c5-8 CPU_IDs=0-15 Mem=8000
Nodes=c5-9 CPU_IDs=0-13 Mem=7000
Nodes=c5-[10-13] CPU_IDs=8-15 Mem=4000
Nodes=c5-[14-18] CPU_IDs=0-15 Mem=8000
Nodes=c5-19 CPU_IDs=0-7 Mem=4000
Nodes=c5-20 CPU_IDs=4-7 Mem=2000
Nodes=c5-[21-28] CPU_IDs=0-15 Mem=8000
MinCPUsNode=1 MinMemoryCPU=500M MinTmpDiskNode=0
Features=(null) Gres=(null) Reservation=(null)
Shared=OK Contiguous=0 Licenses=(null) Network=(null)
Command=/cluster/home/bhm/slurm/env.sm
WorkDir=/cluster/home/bhm/slurm
Srun:
****************
8 compute-13-10.local
8 compute-13-11.local
8 compute-13-12.local
16 compute-13-14.local
16 compute-13-15.local
16 compute-13-16.local
16 compute-13-17.local
16 compute-13-18.local
8 compute-13-19.local
16 compute-13-1.local
4 compute-13-20.local
16 compute-13-21.local
16 compute-13-22.local
16 compute-13-23.local
16 compute-13-24.local
16 compute-13-25.local
16 compute-13-26.local
16 compute-13-27.local
16 compute-13-28.local
9 compute-13-2.local
15 compute-13-3.local
15 compute-13-4.local
15 compute-13-5.local
16 compute-13-6.local
14 compute-13-8.local
8 compute-13-9.local
16 compute-5-10.local
16 compute-5-11.local
16 compute-5-12.local
16 compute-5-13.local
16 compute-5-14.local
16 compute-5-15.local
16 compute-5-16.local
16 compute-5-17.local
16 compute-5-18.local
16 compute-5-19.local
16 compute-5-1.local
16 compute-5-20.local
16 compute-5-21.local
16 compute-5-22.local
16 compute-5-23.local
16 compute-5-24.local
16 compute-5-25.local
16 compute-5-26.local
16 compute-5-27.local
16 compute-5-28.local
16 compute-5-2.local
16 compute-5-3.local
16 compute-5-4.local
16 compute-5-5.local
16 compute-5-6.local
16 compute-5-7.local
16 compute-5-8.local
16 compute-5-9.local
I've done some more experimenting, and it looks like it is the output in
scontrol show job and the environment variables that is wrong. (Also the memory limits on the nodes are wrong; see below.)
I created a reservation for two nodes, c9-3 and c11-36, and started a number
of jobs there, so that the nodes were partially allocated:
# bjob -u bhm
JOBID NAME USER ACCOUNT PARTITI QOS ST PRIORI TIME TIME_LEFT CPUS NOD MIN_MEM MIN_TMP NODELIST(REASON)
643529 sleeeep.sm bhm staff normal staff R 20477 1:24 58:36 1 1 1024 0 c11-36
643528 sleeeep.sm bhm staff normal staff R 20477 1:25 58:35 1 1 1024 0 c11-36
643527 sleeeep.sm bhm staff normal staff R 20477 1:46 58:14 1 1 1024 0 c9-3
643526 sleeeep.sm bhm staff normal staff R 20477 1:47 58:13 1 1 1024 0 c9-3
643524 sleeeep.sm bhm staff normal staff R 20477 1:48 58:12 1 1 1024 0 c9-3
643525 sleeeep.sm bhm staff normal staff R 20477 1:48 58:12 1 1 1024 0 c9-3
643523 sleeeep.sm bhm staff normal staff R 20477 1:49 58:11 1 1 1024 0 c9-3
643522 sleeeep.sm bhm staff normal staff R 20477 2:01 57:59 1 1 1024 0 c9-3
643520 sleeeep.sm bhm staff normal staff R 20477 2:02 57:58 1 1 1024 0 c9-3
643521 sleeeep.sm bhm staff normal staff R 20477 2:02 57:58 1 1 1024 0 c9-3
643519 sleeeep.sm bhm staff normal staff R 20477 2:04 57:56 1 1 1024 0 c9-3
643518 sleeeep.sm bhm staff normal staff R 20477 2:13 57:47 1 1 1024 0 c9-3
643516 sleeeep.sm bhm staff normal staff R 20477 2:23 57:37 1 1 1024 0 c9-3
643517 sleeeep.sm bhm staff normal staff R 20477 2:23 57:37 1 1 1024 0 c9-3
643514 sleeeep.sm bhm staff normal staff R 20477 2:24 57:36 1 1 1024 0 c9-3
643515 sleeeep.sm bhm staff normal staff R 20477 2:24 57:36 1 1 1024 0 c9-3
643513 sleeeep.sm bhm staff normal staff R 20477 2:25 57:35 1 1 1024 0 c9-3
I.e., there are 1 unallocated core on c9-3 and 14 on c11-36 (all our nodes have 16 cores). Then I started a 15-core job in the reservation:
$ sbatch --reservation=bhmtest --ntasks=15 env.sm
Submitted batch job 643532
$ cat slurm-643532.out [I've truncated the output a bit]
On batch node:
****************
compute-11-36.local
SLURM_JOB_CPUS_PER_NODE=1,14
SLURM_JOB_NODELIST=c11-36,c9-3
SLURM_NODELIST=c11-36,c9-3
SLURM_NPROCS=15
SLURM_NTASKS=15
SLURM_TASKS_PER_NODE=1,14
JobId=643532 Name=env.sm
NodeList=c11-36,c9-3
BatchHost=c11-36
NumNodes=2 NumCPUs=15 CPUs/Task=1 ReqS:C:T=*:*:*
Nodes=c11-36 CPU_IDs=0 Mem=500
Nodes=c9-3 CPU_IDs=0-13 Mem=7000
MinCPUsNode=1 MinMemoryCPU=500M MinTmpDiskNode=0
Srun:
****************
14 compute-11-36.local
1 compute-9-3.local
So it seems the job ran 1 task on 9-3 and 14 tasks on c11-36 as it should, but
that the scontrol output and the job environment variables are wrong. Thus my
claim that nodes would be overallocated is wrong.
However, the memory limits for the job on the nodes is wrong:
# ssh c9-3 grep '643532.*limit_in_bytes' /var/log/slurm/slurmd.log
[2012-12-05T12:30:06] [643532.0] parameter 'memory.limit_in_bytes' set to '7340032000' for '/dev/cgroup/memory/slurm/uid_10231/job_643532'
[2012-12-05T12:30:06] [643532.0] parameter 'memory.memsw.limit_in_bytes' set to '7340032000' for '/dev/cgroup/memory/slurm/uid_10231/job_643532'
[2012-12-05T12:30:06] [643532.0] parameter 'memory.limit_in_bytes' set to '7340032000' for '/dev/cgroup/memory/slurm/uid_10231/job_643532/step_0'
[2012-12-05T12:30:06] [643532.0] parameter 'memory.memsw.limit_in_bytes' set to '7340032000' for '/dev/cgroup/memory/slurm/uid_10231/job_643532/step_0'
# ssh c11-36 grep '643532.*limit_in_bytes' /var/log/slurm/slurmd.log
[2012-12-05T12:30:05] [643532] parameter 'memory.limit_in_bytes' set to '524288000' for '/dev/cgroup/memory/slurm/uid_10231/job_643532'
[2012-12-05T12:30:05] [643532] parameter 'memory.memsw.limit_in_bytes' set to '524288000' for '/dev/cgroup/memory/slurm/uid_10231/job_643532'
[2012-12-05T12:30:05] [643532] parameter 'memory.limit_in_bytes' set to '524288000' for '/dev/cgroup/memory/slurm/uid_10231/job_643532/step_4294967294'
[2012-12-05T12:30:05] [643532] parameter 'memory.memsw.limit_in_bytes' set to '524288000' for '/dev/cgroup/memory/slurm/uid_10231/job_643532/step_4294967294'
[2012-12-05T12:30:06] [643532.0] parameter 'memory.limit_in_bytes' set to '524288000' for '/dev/cgroup/memory/slurm/uid_10231/job_643532'
[2012-12-05T12:30:06] [643532.0] parameter 'memory.memsw.limit_in_bytes' set to '524288000' for '/dev/cgroup/memory/slurm/uid_10231/job_643532'
[2012-12-05T12:30:06] [643532.0] parameter 'memory.limit_in_bytes' set to '524288000' for '/dev/cgroup/memory/slurm/uid_10231/job_643532/step_0'
[2012-12-05T12:30:06] [643532.0] parameter 'memory.memsw.limit_in_bytes' set to '524288000' for '/dev/cgroup/memory/slurm/uid_10231/job_643532/step_0'
I.e., the limit is 7000 MiB on c9-3 and 500 MiB on c11-36, which should have
been the other way around.
Thus jobs risk getting killed.
Based upon my very limited testing, this one line change fixes the problem. I need to do a lot more testing on different systems (e.g. Bluegene and Cray) before making this change in our code and I have other work needing my attention, but if you want to work on testing with this and report the results that would be appreciated.
diff --git a/src/common/hostlist.c b/src/common/hostlist.c
index 34c04b0..abc3445 100644
--- a/src/common/hostlist.c
+++ b/src/common/hostlist.c
@@ -901,7 +901,7 @@ static int hostrange_prefix_cmp(hostrange_t h1, hostrange_t h2)
if (h2 == NULL)
return -1;
- retval = strcmp(h1->prefix, h2->prefix);
+ retval = strnatcmp(h1->prefix, h2->prefix);
return retval == 0 ? h2->singlehost - h1->singlehost : retval;
}
Thanks! We will test and report! I've tested it on our test cluster, and it seems to solve the problem, yes. :) I'll do some more tests before porting it to our production cluster. Thanks! Several more tests have not shown any problems with the patch, and it does solve the issue, so we will port it to our production cluster now. Thanks! This change can be find in version 2.5.1 (when released). You will need to use the patch until upgrading. Thank you for testing. |
I think we have come across a bug in Slurm. We are running slurm 2.4.3, on a Rocks 6.0 cluster (based CentOS 6.2). When submitting jobs to several nodes (using sbatch), the wrong number of steps is sometimes started on the nodes. It appears this happens when the job is allocated nodes which are ordered differently with numerical and alphabetical ordering of the nodename prefixes; i.e. c2-3,c12-3 (numerical (2 < 12)) versus c12-3,c2-3 (alphabetical (c12 < c2)). For instance: $ cat env.sm #!/bin/bash #SBATCH --account=staff #SBATCH --time=0:10:0 #SBATCH --mem-per-cpu=500M #SBATCH --output=out/env-%j.out echo On batch node: echo '****************' hostname env|sort|grep SLURM scontrol show job $SLURM_JOB_ID --details echo Srun: echo '****************' srun -l hostname echo '****************' srun -l env | sort|grep SLURM echo done $ sbatch --nodes=2 --ntasks=3 --nodelist=c6-1,c17-3 env.sm Submitted batch job 634845 $ cat out/env-634845.out On batch node: **************** compute-17-3.local SLURM_CHECKPOINT_IMAGE_DIR=/cluster/home/bhm/slurm SLURM_CPUS_ON_NODE=2 SLURMD_NODENAME=c17-3 SLURM_GTIDS=0 SLURM_JOB_CPUS_PER_NODE=2,1 SLURM_JOB_ID=634845 SLURM_JOBID=634845 SLURM_JOB_NAME=env.sm SLURM_JOB_NODELIST=c17-3,c6-1 SLURM_JOB_NUM_NODES=2 SLURM_LOCALID=0 SLURM_MEM_PER_CPU=500 SLURM_NNODES=2 SLURM_NODE_ALIASES=(null) SLURM_NODEID=0 SLURM_NODELIST=c17-3,c6-1 SLURM_NPROCS=3 SLURM_NTASKS=3 SLURM_PRIO_PROCESS=0 SLURM_PROCID=0 SLURM_SUBMIT_DIR=/cluster/home/bhm/slurm SLURM_TASK_PID=25935 SLURM_TASKS_PER_NODE=2,1 SLURM_TOPOLOGY_ADDR=c17-3 SLURM_TOPOLOGY_ADDR_PATTERN=node JobId=634845 Name=env.sm UserId=bhm(10231) GroupId=users(100) Priority=20478 Account=staff QOS=staff JobState=COMPLETED Reason=None Dependency=(null) Requeue=1 Restarts=0 BatchFlag=1 ExitCode=0:0 DerivedExitCode=0:0 RunTime=00:00:04 TimeLimit=00:10:00 TimeMin=N/A SubmitTime=2012-12-04T10:11:09 EligibleTime=2012-12-04T10:11:09 StartTime=2012-12-04T10:11:42 EndTime=2012-12-04T10:11:46 PreemptTime=None SuspendTime=None SecsPreSuspend=0 Partition=normal AllocNode:Sid=login-0-2:21143 ReqNodeList=c17-3,c6-1 ExcNodeList=(null) NodeList=c17-3,c6-1 BatchHost=c17-3 NumNodes=2 NumCPUs=3 CPUs/Task=1 ReqS:C:T=*:*:* Nodes=c17-3 CPU_IDs=14-15 Mem=1000 Nodes=c6-1 CPU_IDs=15 Mem=500 MinCPUsNode=1 MinMemoryCPU=500M MinTmpDiskNode=0 Features=(null) Gres=(null) Reservation=(null) Shared=OK Contiguous=0 Licenses=(null) Network=(null) Command=/cluster/home/bhm/slurm/env.sm WorkDir=/cluster/home/bhm/slurm Srun: **************** 1: compute-6-1.local 0: compute-6-1.local 2: compute-17-3.local **************** 0: SLURM_CHECKPOINT_IMAGE_DIR=/cluster/home/bhm/slurm 0: SLURM_CPUS_ON_NODE=1 0: SLURM_DISTRIBUTION=block 0: SLURMD_NODENAME=c6-1 0: SLURM_GTIDS=0,1 0: SLURM_JOB_CPUS_PER_NODE=2,1 0: SLURM_JOB_ID=634845 0: SLURM_JOBID=634845 0: SLURM_JOB_NAME=env.sm 0: SLURM_JOB_NODELIST=c17-3,c6-1 0: SLURM_JOB_NUM_NODES=2 0: SLURM_LABELIO=1 0: SLURM_LAUNCH_NODE_IPADDR=10.110.253.224 0: SLURM_LOCALID=0 0: SLURM_MEM_PER_CPU=500 0: SLURM_NNODES=2 0: SLURM_NODEID=0 0: SLURM_NODELIST=c17-3,c6-1 0: SLURM_NPROCS=3 0: SLURM_NTASKS=3 0: SLURM_PRIO_PROCESS=0 0: SLURM_PROCID=0 0: SLURM_SRUN_COMM_HOST=10.110.253.224 0: SLURM_SRUN_COMM_PORT=35257 0: SLURM_STEP_ID=1 0: SLURM_STEPID=1 0: SLURM_STEP_LAUNCHER_PORT=35257 0: SLURM_STEP_NODELIST=c6-1,c17-3 0: SLURM_STEP_NUM_NODES=2 0: SLURM_STEP_NUM_TASKS=3 0: SLURM_STEP_TASKS_PER_NODE=2,1 0: SLURM_SUBMIT_DIR=/cluster/home/bhm/slurm 0: SLURM_TASK_PID=23136 0: SLURM_TASKS_PER_NODE=2,1 0: SLURM_TOPOLOGY_ADDR=c6-1 0: SLURM_TOPOLOGY_ADDR_PATTERN=node 1: SLURM_CHECKPOINT_IMAGE_DIR=/cluster/home/bhm/slurm 1: SLURM_CPUS_ON_NODE=1 1: SLURM_DISTRIBUTION=block 1: SLURMD_NODENAME=c6-1 1: SLURM_GTIDS=0,1 1: SLURM_JOB_CPUS_PER_NODE=2,1 1: SLURM_JOB_ID=634845 1: SLURM_JOBID=634845 1: SLURM_JOB_NAME=env.sm 1: SLURM_JOB_NODELIST=c17-3,c6-1 1: SLURM_JOB_NUM_NODES=2 1: SLURM_LABELIO=1 1: SLURM_LAUNCH_NODE_IPADDR=10.110.253.224 1: SLURM_LOCALID=1 1: SLURM_MEM_PER_CPU=500 1: SLURM_NNODES=2 1: SLURM_NODEID=0 1: SLURM_NODELIST=c17-3,c6-1 1: SLURM_NPROCS=3 1: SLURM_NTASKS=3 1: SLURM_PRIO_PROCESS=0 1: SLURM_PROCID=1 1: SLURM_SRUN_COMM_HOST=10.110.253.224 1: SLURM_SRUN_COMM_PORT=35257 1: SLURM_STEP_ID=1 1: SLURM_STEPID=1 1: SLURM_STEP_LAUNCHER_PORT=35257 1: SLURM_STEP_NODELIST=c6-1,c17-3 1: SLURM_STEP_NUM_NODES=2 1: SLURM_STEP_NUM_TASKS=3 1: SLURM_STEP_TASKS_PER_NODE=2,1 1: SLURM_SUBMIT_DIR=/cluster/home/bhm/slurm 1: SLURM_TASK_PID=23137 1: SLURM_TASKS_PER_NODE=2,1 1: SLURM_TOPOLOGY_ADDR=c6-1 1: SLURM_TOPOLOGY_ADDR_PATTERN=node 2: SLURM_CHECKPOINT_IMAGE_DIR=/cluster/home/bhm/slurm 2: SLURM_CPUS_ON_NODE=2 2: SLURM_DISTRIBUTION=block 2: SLURMD_NODENAME=c17-3 2: SLURM_GTIDS=2 2: SLURM_JOB_CPUS_PER_NODE=2,1 2: SLURM_JOB_ID=634845 2: SLURM_JOBID=634845 2: SLURM_JOB_NAME=env.sm 2: SLURM_JOB_NODELIST=c17-3,c6-1 2: SLURM_JOB_NUM_NODES=2 2: SLURM_LABELIO=1 2: SLURM_LAUNCH_NODE_IPADDR=10.110.253.224 2: SLURM_LOCALID=0 2: SLURM_MEM_PER_CPU=500 2: SLURM_NNODES=2 2: SLURM_NODEID=1 2: SLURM_NODELIST=c17-3,c6-1 2: SLURM_NPROCS=3 2: SLURM_NTASKS=3 2: SLURM_PRIO_PROCESS=0 2: SLURM_PROCID=2 2: SLURM_SRUN_COMM_HOST=10.110.253.224 2: SLURM_SRUN_COMM_PORT=35257 2: SLURM_STEP_ID=1 2: SLURM_STEPID=1 2: SLURM_STEP_LAUNCHER_PORT=35257 2: SLURM_STEP_NODELIST=c6-1,c17-3 2: SLURM_STEP_NUM_NODES=2 2: SLURM_STEP_NUM_TASKS=3 2: SLURM_STEP_TASKS_PER_NODE=2,1 2: SLURM_SUBMIT_DIR=/cluster/home/bhm/slurm 2: SLURM_TASK_PID=26030 2: SLURM_TASKS_PER_NODE=2,1 2: SLURM_TOPOLOGY_ADDR=c17-3 2: SLURM_TOPOLOGY_ADDR_PATTERN=node done Notice that according to "scontrol show job", the job should have two tasks on c17-3 and one on c6-1. However, the "srun -l hostname" clearly starts two tasks on c6-1 and one on c17-3. This has two bad effects: One is that when the job starts too many tasks on a node, it can hamper other jobs there. Another thing is that the cgroup limits are set up after the information that "scontrol show job" returns, so the job risks being killed by cgroup without using more memory that it asked for.