I think we have come across a bug in Slurm. We are running slurm 2.4.3, on a Rocks 6.0 cluster (based CentOS 6.2). When submitting jobs to several nodes (using sbatch), the wrong number of steps is sometimes started on the nodes. It appears this happens when the job is allocated nodes which are ordered differently with numerical and alphabetical ordering of the nodename prefixes; i.e. c2-3,c12-3 (numerical (2 < 12)) versus c12-3,c2-3 (alphabetical (c12 < c2)). For instance: $ cat env.sm #!/bin/bash #SBATCH --account=staff #SBATCH --time=0:10:0 #SBATCH --mem-per-cpu=500M #SBATCH --output=out/env-%j.out echo On batch node: echo '****************' hostname env|sort|grep SLURM scontrol show job $SLURM_JOB_ID --details echo Srun: echo '****************' srun -l hostname echo '****************' srun -l env | sort|grep SLURM echo done $ sbatch --nodes=2 --ntasks=3 --nodelist=c6-1,c17-3 env.sm Submitted batch job 634845 $ cat out/env-634845.out On batch node: **************** compute-17-3.local SLURM_CHECKPOINT_IMAGE_DIR=/cluster/home/bhm/slurm SLURM_CPUS_ON_NODE=2 SLURMD_NODENAME=c17-3 SLURM_GTIDS=0 SLURM_JOB_CPUS_PER_NODE=2,1 SLURM_JOB_ID=634845 SLURM_JOBID=634845 SLURM_JOB_NAME=env.sm SLURM_JOB_NODELIST=c17-3,c6-1 SLURM_JOB_NUM_NODES=2 SLURM_LOCALID=0 SLURM_MEM_PER_CPU=500 SLURM_NNODES=2 SLURM_NODE_ALIASES=(null) SLURM_NODEID=0 SLURM_NODELIST=c17-3,c6-1 SLURM_NPROCS=3 SLURM_NTASKS=3 SLURM_PRIO_PROCESS=0 SLURM_PROCID=0 SLURM_SUBMIT_DIR=/cluster/home/bhm/slurm SLURM_TASK_PID=25935 SLURM_TASKS_PER_NODE=2,1 SLURM_TOPOLOGY_ADDR=c17-3 SLURM_TOPOLOGY_ADDR_PATTERN=node JobId=634845 Name=env.sm UserId=bhm(10231) GroupId=users(100) Priority=20478 Account=staff QOS=staff JobState=COMPLETED Reason=None Dependency=(null) Requeue=1 Restarts=0 BatchFlag=1 ExitCode=0:0 DerivedExitCode=0:0 RunTime=00:00:04 TimeLimit=00:10:00 TimeMin=N/A SubmitTime=2012-12-04T10:11:09 EligibleTime=2012-12-04T10:11:09 StartTime=2012-12-04T10:11:42 EndTime=2012-12-04T10:11:46 PreemptTime=None SuspendTime=None SecsPreSuspend=0 Partition=normal AllocNode:Sid=login-0-2:21143 ReqNodeList=c17-3,c6-1 ExcNodeList=(null) NodeList=c17-3,c6-1 BatchHost=c17-3 NumNodes=2 NumCPUs=3 CPUs/Task=1 ReqS:C:T=*:*:* Nodes=c17-3 CPU_IDs=14-15 Mem=1000 Nodes=c6-1 CPU_IDs=15 Mem=500 MinCPUsNode=1 MinMemoryCPU=500M MinTmpDiskNode=0 Features=(null) Gres=(null) Reservation=(null) Shared=OK Contiguous=0 Licenses=(null) Network=(null) Command=/cluster/home/bhm/slurm/env.sm WorkDir=/cluster/home/bhm/slurm Srun: **************** 1: compute-6-1.local 0: compute-6-1.local 2: compute-17-3.local **************** 0: SLURM_CHECKPOINT_IMAGE_DIR=/cluster/home/bhm/slurm 0: SLURM_CPUS_ON_NODE=1 0: SLURM_DISTRIBUTION=block 0: SLURMD_NODENAME=c6-1 0: SLURM_GTIDS=0,1 0: SLURM_JOB_CPUS_PER_NODE=2,1 0: SLURM_JOB_ID=634845 0: SLURM_JOBID=634845 0: SLURM_JOB_NAME=env.sm 0: SLURM_JOB_NODELIST=c17-3,c6-1 0: SLURM_JOB_NUM_NODES=2 0: SLURM_LABELIO=1 0: SLURM_LAUNCH_NODE_IPADDR=10.110.253.224 0: SLURM_LOCALID=0 0: SLURM_MEM_PER_CPU=500 0: SLURM_NNODES=2 0: SLURM_NODEID=0 0: SLURM_NODELIST=c17-3,c6-1 0: SLURM_NPROCS=3 0: SLURM_NTASKS=3 0: SLURM_PRIO_PROCESS=0 0: SLURM_PROCID=0 0: SLURM_SRUN_COMM_HOST=10.110.253.224 0: SLURM_SRUN_COMM_PORT=35257 0: SLURM_STEP_ID=1 0: SLURM_STEPID=1 0: SLURM_STEP_LAUNCHER_PORT=35257 0: SLURM_STEP_NODELIST=c6-1,c17-3 0: SLURM_STEP_NUM_NODES=2 0: SLURM_STEP_NUM_TASKS=3 0: SLURM_STEP_TASKS_PER_NODE=2,1 0: SLURM_SUBMIT_DIR=/cluster/home/bhm/slurm 0: SLURM_TASK_PID=23136 0: SLURM_TASKS_PER_NODE=2,1 0: SLURM_TOPOLOGY_ADDR=c6-1 0: SLURM_TOPOLOGY_ADDR_PATTERN=node 1: SLURM_CHECKPOINT_IMAGE_DIR=/cluster/home/bhm/slurm 1: SLURM_CPUS_ON_NODE=1 1: SLURM_DISTRIBUTION=block 1: SLURMD_NODENAME=c6-1 1: SLURM_GTIDS=0,1 1: SLURM_JOB_CPUS_PER_NODE=2,1 1: SLURM_JOB_ID=634845 1: SLURM_JOBID=634845 1: SLURM_JOB_NAME=env.sm 1: SLURM_JOB_NODELIST=c17-3,c6-1 1: SLURM_JOB_NUM_NODES=2 1: SLURM_LABELIO=1 1: SLURM_LAUNCH_NODE_IPADDR=10.110.253.224 1: SLURM_LOCALID=1 1: SLURM_MEM_PER_CPU=500 1: SLURM_NNODES=2 1: SLURM_NODEID=0 1: SLURM_NODELIST=c17-3,c6-1 1: SLURM_NPROCS=3 1: SLURM_NTASKS=3 1: SLURM_PRIO_PROCESS=0 1: SLURM_PROCID=1 1: SLURM_SRUN_COMM_HOST=10.110.253.224 1: SLURM_SRUN_COMM_PORT=35257 1: SLURM_STEP_ID=1 1: SLURM_STEPID=1 1: SLURM_STEP_LAUNCHER_PORT=35257 1: SLURM_STEP_NODELIST=c6-1,c17-3 1: SLURM_STEP_NUM_NODES=2 1: SLURM_STEP_NUM_TASKS=3 1: SLURM_STEP_TASKS_PER_NODE=2,1 1: SLURM_SUBMIT_DIR=/cluster/home/bhm/slurm 1: SLURM_TASK_PID=23137 1: SLURM_TASKS_PER_NODE=2,1 1: SLURM_TOPOLOGY_ADDR=c6-1 1: SLURM_TOPOLOGY_ADDR_PATTERN=node 2: SLURM_CHECKPOINT_IMAGE_DIR=/cluster/home/bhm/slurm 2: SLURM_CPUS_ON_NODE=2 2: SLURM_DISTRIBUTION=block 2: SLURMD_NODENAME=c17-3 2: SLURM_GTIDS=2 2: SLURM_JOB_CPUS_PER_NODE=2,1 2: SLURM_JOB_ID=634845 2: SLURM_JOBID=634845 2: SLURM_JOB_NAME=env.sm 2: SLURM_JOB_NODELIST=c17-3,c6-1 2: SLURM_JOB_NUM_NODES=2 2: SLURM_LABELIO=1 2: SLURM_LAUNCH_NODE_IPADDR=10.110.253.224 2: SLURM_LOCALID=0 2: SLURM_MEM_PER_CPU=500 2: SLURM_NNODES=2 2: SLURM_NODEID=1 2: SLURM_NODELIST=c17-3,c6-1 2: SLURM_NPROCS=3 2: SLURM_NTASKS=3 2: SLURM_PRIO_PROCESS=0 2: SLURM_PROCID=2 2: SLURM_SRUN_COMM_HOST=10.110.253.224 2: SLURM_SRUN_COMM_PORT=35257 2: SLURM_STEP_ID=1 2: SLURM_STEPID=1 2: SLURM_STEP_LAUNCHER_PORT=35257 2: SLURM_STEP_NODELIST=c6-1,c17-3 2: SLURM_STEP_NUM_NODES=2 2: SLURM_STEP_NUM_TASKS=3 2: SLURM_STEP_TASKS_PER_NODE=2,1 2: SLURM_SUBMIT_DIR=/cluster/home/bhm/slurm 2: SLURM_TASK_PID=26030 2: SLURM_TASKS_PER_NODE=2,1 2: SLURM_TOPOLOGY_ADDR=c17-3 2: SLURM_TOPOLOGY_ADDR_PATTERN=node done Notice that according to "scontrol show job", the job should have two tasks on c17-3 and one on c6-1. However, the "srun -l hostname" clearly starts two tasks on c6-1 and one on c17-3. This has two bad effects: One is that when the job starts too many tasks on a node, it can hamper other jobs there. Another thing is that the cgroup limits are set up after the information that "scontrol show job" returns, so the job risks being killed by cgroup without using more memory that it asked for.
I am guessing there is a problem with the sort in the hostlist functions, probably a strcmp instead of a strnatcmp, but what you ask for doesn't specify you want 2 tasks on any specific node. You would need the arbitrary distribution option for that. Unless you do that Slurm will lay them out any way it can, and reorders the list right when it comes in. Let us know if you are able to fix the sort, but based on the request you got what I would expect to happen. Try the arbitrary distribution mode and see if that works as you would expect.
I'm sorry if I was unclear, so let me try to explain again. I was not trying to specify explicitly where the tasks are run, but merely provided a minimal, reproducible example of the problem. See a bigger, general example below. We hare using Nodename=DEFAULT Sockets=2 CoresPerSocket=8 ThreadsPerCore=1 RealMemory=62976 Gres=localtmp:100,athena:4 State=unknown PartitionName=DEFAULT State=up Shared=NO SelectType=select/cons_res SelectTypeParameters=CR_CPU_Memory TaskPlugin=task/cgroup ProctrackType=proctrack/cgroup i.e, we are using consumable resources, which should not be shared. The problem is that slurm _allocates_ a number of tasks to a node (based on the amount of unallocated resources there), but _runs_ a different number of tasks there. Typically, the consequence is that some nodes get more tasks than it has cores, or gets its memory overallocated. This is bad for HPC. Also, slurm's memory limits are based on where it _allocates_ the tasks, so a job that starts for instance 8 tasks on a node it should have run 2, will likely exceed its memory allowance on that node and be killed. Here is a general example. According to "scontrol show job" and the SLURM_ environment variables, it should have run 16 tasks on each of the nodes in rack 13 (c13-X), but as the srun output shows, it doesn't. For instance it only runs 8 tasks in c13-10. On the other hand, it should only run 9 tasks in c5-4, but in fact runs 16 tasks there. $ cat env.sm #!/bin/bash #SBATCH --account=staff #SBATCH --time=0:10:0 #SBATCH --mem-per-cpu=500M echo On batch node: echo '****************' hostname env | grep SLURM | sort scontrol show job $SLURM_JOB_ID --details echo Srun: echo '****************' srun hostname | sort | uniq -c $ sbatch --ntasks=800 env.sm Submitted batch job 642175 $ cat slurm-642175.out On batch node: **************** compute-13-1.local SLURM_CHECKPOINT_IMAGE_DIR=/cluster/home/bhm/slurm SLURM_CPUS_ON_NODE=16 SLURMD_NODENAME=c13-1 SLURM_GTIDS=0 SLURM_JOB_CPUS_PER_NODE=16(x29),9,15(x3),16,14,8(x4),16(x5),8,4,16(x8) SLURM_JOB_ID=642175 SLURM_JOBID=642175 SLURM_JOB_NAME=env.sm SLURM_JOB_NODELIST=c13-[1-6,8-12,14-28],c5-[1-28] SLURM_JOB_NUM_NODES=54 SLURM_LOCALID=0 SLURM_MEM_PER_CPU=500 SLURM_NNODES=54 SLURM_NODE_ALIASES=(null) SLURM_NODEID=0 SLURM_NODELIST=c13-[1-6,8-12,14-28],c5-[1-28] SLURM_NPROCS=800 SLURM_NTASKS=800 SLURM_PRIO_PROCESS=0 SLURM_PROCID=0 SLURM_SUBMIT_DIR=/cluster/home/bhm/slurm SLURM_TASK_PID=1224 SLURM_TASKS_PER_NODE=16(x29),9,15(x3),16,14,8(x4),16(x5),8,4,16(x8) SLURM_TOPOLOGY_ADDR_PATTERN=switch.switch.node SLURM_TOPOLOGY_ADDR=root.rack13.c13-1 JobId=642175 Name=env.sm UserId=bhm(10231) GroupId=users(100) Priority=20477 Account=staff QOS=staff JobState=RUNNING Reason=None Dependency=(null) Requeue=1 Restarts=0 BatchFlag=1 ExitCode=0:0 DerivedExitCode=0:0 RunTime=00:00:00 TimeLimit=00:10:00 TimeMin=N/A SubmitTime=2012-12-05T10:12:57 EligibleTime=2012-12-05T10:12:57 StartTime=2012-12-05T10:12:57 EndTime=2012-12-05T10:22:57 PreemptTime=None SuspendTime=None SecsPreSuspend=0 Partition=normal AllocNode:Sid=login-0-2:16237 ReqNodeList=(null) ExcNodeList=(null) NodeList=c13-[1-6,8-12,14-28],c5-[1-28] BatchHost=c13-1 NumNodes=54 NumCPUs=800 CPUs/Task=1 ReqS:C:T=*:*:* Nodes=c13-[1-6,8-12,14-28],c5-[1-3] CPU_IDs=0-15 Mem=8000 Nodes=c5-4 CPU_IDs=0-7,13 Mem=4500 Nodes=c5-[5-7] CPU_IDs=1-15 Mem=7500 Nodes=c5-8 CPU_IDs=0-15 Mem=8000 Nodes=c5-9 CPU_IDs=0-13 Mem=7000 Nodes=c5-[10-13] CPU_IDs=8-15 Mem=4000 Nodes=c5-[14-18] CPU_IDs=0-15 Mem=8000 Nodes=c5-19 CPU_IDs=0-7 Mem=4000 Nodes=c5-20 CPU_IDs=4-7 Mem=2000 Nodes=c5-[21-28] CPU_IDs=0-15 Mem=8000 MinCPUsNode=1 MinMemoryCPU=500M MinTmpDiskNode=0 Features=(null) Gres=(null) Reservation=(null) Shared=OK Contiguous=0 Licenses=(null) Network=(null) Command=/cluster/home/bhm/slurm/env.sm WorkDir=/cluster/home/bhm/slurm Srun: **************** 8 compute-13-10.local 8 compute-13-11.local 8 compute-13-12.local 16 compute-13-14.local 16 compute-13-15.local 16 compute-13-16.local 16 compute-13-17.local 16 compute-13-18.local 8 compute-13-19.local 16 compute-13-1.local 4 compute-13-20.local 16 compute-13-21.local 16 compute-13-22.local 16 compute-13-23.local 16 compute-13-24.local 16 compute-13-25.local 16 compute-13-26.local 16 compute-13-27.local 16 compute-13-28.local 9 compute-13-2.local 15 compute-13-3.local 15 compute-13-4.local 15 compute-13-5.local 16 compute-13-6.local 14 compute-13-8.local 8 compute-13-9.local 16 compute-5-10.local 16 compute-5-11.local 16 compute-5-12.local 16 compute-5-13.local 16 compute-5-14.local 16 compute-5-15.local 16 compute-5-16.local 16 compute-5-17.local 16 compute-5-18.local 16 compute-5-19.local 16 compute-5-1.local 16 compute-5-20.local 16 compute-5-21.local 16 compute-5-22.local 16 compute-5-23.local 16 compute-5-24.local 16 compute-5-25.local 16 compute-5-26.local 16 compute-5-27.local 16 compute-5-28.local 16 compute-5-2.local 16 compute-5-3.local 16 compute-5-4.local 16 compute-5-5.local 16 compute-5-6.local 16 compute-5-7.local 16 compute-5-8.local 16 compute-5-9.local
I've done some more experimenting, and it looks like it is the output in scontrol show job and the environment variables that is wrong. (Also the memory limits on the nodes are wrong; see below.) I created a reservation for two nodes, c9-3 and c11-36, and started a number of jobs there, so that the nodes were partially allocated: # bjob -u bhm JOBID NAME USER ACCOUNT PARTITI QOS ST PRIORI TIME TIME_LEFT CPUS NOD MIN_MEM MIN_TMP NODELIST(REASON) 643529 sleeeep.sm bhm staff normal staff R 20477 1:24 58:36 1 1 1024 0 c11-36 643528 sleeeep.sm bhm staff normal staff R 20477 1:25 58:35 1 1 1024 0 c11-36 643527 sleeeep.sm bhm staff normal staff R 20477 1:46 58:14 1 1 1024 0 c9-3 643526 sleeeep.sm bhm staff normal staff R 20477 1:47 58:13 1 1 1024 0 c9-3 643524 sleeeep.sm bhm staff normal staff R 20477 1:48 58:12 1 1 1024 0 c9-3 643525 sleeeep.sm bhm staff normal staff R 20477 1:48 58:12 1 1 1024 0 c9-3 643523 sleeeep.sm bhm staff normal staff R 20477 1:49 58:11 1 1 1024 0 c9-3 643522 sleeeep.sm bhm staff normal staff R 20477 2:01 57:59 1 1 1024 0 c9-3 643520 sleeeep.sm bhm staff normal staff R 20477 2:02 57:58 1 1 1024 0 c9-3 643521 sleeeep.sm bhm staff normal staff R 20477 2:02 57:58 1 1 1024 0 c9-3 643519 sleeeep.sm bhm staff normal staff R 20477 2:04 57:56 1 1 1024 0 c9-3 643518 sleeeep.sm bhm staff normal staff R 20477 2:13 57:47 1 1 1024 0 c9-3 643516 sleeeep.sm bhm staff normal staff R 20477 2:23 57:37 1 1 1024 0 c9-3 643517 sleeeep.sm bhm staff normal staff R 20477 2:23 57:37 1 1 1024 0 c9-3 643514 sleeeep.sm bhm staff normal staff R 20477 2:24 57:36 1 1 1024 0 c9-3 643515 sleeeep.sm bhm staff normal staff R 20477 2:24 57:36 1 1 1024 0 c9-3 643513 sleeeep.sm bhm staff normal staff R 20477 2:25 57:35 1 1 1024 0 c9-3 I.e., there are 1 unallocated core on c9-3 and 14 on c11-36 (all our nodes have 16 cores). Then I started a 15-core job in the reservation: $ sbatch --reservation=bhmtest --ntasks=15 env.sm Submitted batch job 643532 $ cat slurm-643532.out [I've truncated the output a bit] On batch node: **************** compute-11-36.local SLURM_JOB_CPUS_PER_NODE=1,14 SLURM_JOB_NODELIST=c11-36,c9-3 SLURM_NODELIST=c11-36,c9-3 SLURM_NPROCS=15 SLURM_NTASKS=15 SLURM_TASKS_PER_NODE=1,14 JobId=643532 Name=env.sm NodeList=c11-36,c9-3 BatchHost=c11-36 NumNodes=2 NumCPUs=15 CPUs/Task=1 ReqS:C:T=*:*:* Nodes=c11-36 CPU_IDs=0 Mem=500 Nodes=c9-3 CPU_IDs=0-13 Mem=7000 MinCPUsNode=1 MinMemoryCPU=500M MinTmpDiskNode=0 Srun: **************** 14 compute-11-36.local 1 compute-9-3.local So it seems the job ran 1 task on 9-3 and 14 tasks on c11-36 as it should, but that the scontrol output and the job environment variables are wrong. Thus my claim that nodes would be overallocated is wrong. However, the memory limits for the job on the nodes is wrong: # ssh c9-3 grep '643532.*limit_in_bytes' /var/log/slurm/slurmd.log [2012-12-05T12:30:06] [643532.0] parameter 'memory.limit_in_bytes' set to '7340032000' for '/dev/cgroup/memory/slurm/uid_10231/job_643532' [2012-12-05T12:30:06] [643532.0] parameter 'memory.memsw.limit_in_bytes' set to '7340032000' for '/dev/cgroup/memory/slurm/uid_10231/job_643532' [2012-12-05T12:30:06] [643532.0] parameter 'memory.limit_in_bytes' set to '7340032000' for '/dev/cgroup/memory/slurm/uid_10231/job_643532/step_0' [2012-12-05T12:30:06] [643532.0] parameter 'memory.memsw.limit_in_bytes' set to '7340032000' for '/dev/cgroup/memory/slurm/uid_10231/job_643532/step_0' # ssh c11-36 grep '643532.*limit_in_bytes' /var/log/slurm/slurmd.log [2012-12-05T12:30:05] [643532] parameter 'memory.limit_in_bytes' set to '524288000' for '/dev/cgroup/memory/slurm/uid_10231/job_643532' [2012-12-05T12:30:05] [643532] parameter 'memory.memsw.limit_in_bytes' set to '524288000' for '/dev/cgroup/memory/slurm/uid_10231/job_643532' [2012-12-05T12:30:05] [643532] parameter 'memory.limit_in_bytes' set to '524288000' for '/dev/cgroup/memory/slurm/uid_10231/job_643532/step_4294967294' [2012-12-05T12:30:05] [643532] parameter 'memory.memsw.limit_in_bytes' set to '524288000' for '/dev/cgroup/memory/slurm/uid_10231/job_643532/step_4294967294' [2012-12-05T12:30:06] [643532.0] parameter 'memory.limit_in_bytes' set to '524288000' for '/dev/cgroup/memory/slurm/uid_10231/job_643532' [2012-12-05T12:30:06] [643532.0] parameter 'memory.memsw.limit_in_bytes' set to '524288000' for '/dev/cgroup/memory/slurm/uid_10231/job_643532' [2012-12-05T12:30:06] [643532.0] parameter 'memory.limit_in_bytes' set to '524288000' for '/dev/cgroup/memory/slurm/uid_10231/job_643532/step_0' [2012-12-05T12:30:06] [643532.0] parameter 'memory.memsw.limit_in_bytes' set to '524288000' for '/dev/cgroup/memory/slurm/uid_10231/job_643532/step_0' I.e., the limit is 7000 MiB on c9-3 and 500 MiB on c11-36, which should have been the other way around. Thus jobs risk getting killed.
Based upon my very limited testing, this one line change fixes the problem. I need to do a lot more testing on different systems (e.g. Bluegene and Cray) before making this change in our code and I have other work needing my attention, but if you want to work on testing with this and report the results that would be appreciated. diff --git a/src/common/hostlist.c b/src/common/hostlist.c index 34c04b0..abc3445 100644 --- a/src/common/hostlist.c +++ b/src/common/hostlist.c @@ -901,7 +901,7 @@ static int hostrange_prefix_cmp(hostrange_t h1, hostrange_t h2) if (h2 == NULL) return -1; - retval = strcmp(h1->prefix, h2->prefix); + retval = strnatcmp(h1->prefix, h2->prefix); return retval == 0 ? h2->singlehost - h1->singlehost : retval; }
Thanks! We will test and report!
I've tested it on our test cluster, and it seems to solve the problem, yes. :) I'll do some more tests before porting it to our production cluster. Thanks!
Several more tests have not shown any problems with the patch, and it does solve the issue, so we will port it to our production cluster now. Thanks!
This change can be find in version 2.5.1 (when released). You will need to use the patch until upgrading. Thank you for testing.