Ticket 185

Summary: Wrong number of tasks started on each node
Product: Slurm Reporter: Bjørn-Helge Mevik <b.h.mevik>
Component: SchedulingAssignee: Moe Jette <jette>
Status: RESOLVED FIXED QA Contact:
Severity: 2 - High Impact    
Priority: --- CC: da
Version: 2.4.x   
Hardware: Linux   
OS: Linux   
Site: -Other- Slinky Site: ---
Alineos Sites: --- Atos/Eviden Sites: ---
Confidential Site: --- Coreweave sites: ---
Cray Sites: --- DS9 clusters: ---
Google sites: --- HPCnow Sites: ---
HPE Sites: --- IBM Sites: ---
NOAA SIte: --- NoveTech Sites: ---
Nvidia HWinf-CS Sites: --- OCF Sites: ---
Recursion Pharma Sites: --- SFW Sites: ---
SNIC sites: --- Tzag Elita Sites: ---
Linux Distro: --- Machine Name:
CLE Version: Version Fixed:
Target Release: --- DevPrio: ---
Emory-Cloud Sites: ---

Description Bjørn-Helge Mevik 2012-12-03 20:42:48 MST
I think we have come across a bug in Slurm.

We are running slurm 2.4.3, on a Rocks 6.0 cluster (based CentOS 6.2).


When submitting jobs to several nodes (using sbatch), the wrong number of
steps is sometimes started on the nodes.  It appears this happens when the job
is allocated nodes which are ordered differently with numerical and alphabetical
ordering of the nodename prefixes; i.e. c2-3,c12-3 (numerical (2 < 12)) versus
c12-3,c2-3 (alphabetical (c12 < c2)).

For instance:

$ cat env.sm
#!/bin/bash
#SBATCH --account=staff
#SBATCH --time=0:10:0
#SBATCH --mem-per-cpu=500M
#SBATCH --output=out/env-%j.out

echo On batch node:
echo '****************'
hostname
env|sort|grep SLURM
scontrol show job $SLURM_JOB_ID --details

echo Srun:
echo '****************'
srun -l hostname
echo '****************'
srun -l env | sort|grep SLURM

echo done

$ sbatch --nodes=2 --ntasks=3 --nodelist=c6-1,c17-3 env.sm
Submitted batch job 634845

$ cat out/env-634845.out
On batch node:
****************
compute-17-3.local
SLURM_CHECKPOINT_IMAGE_DIR=/cluster/home/bhm/slurm
SLURM_CPUS_ON_NODE=2
SLURMD_NODENAME=c17-3
SLURM_GTIDS=0
SLURM_JOB_CPUS_PER_NODE=2,1
SLURM_JOB_ID=634845
SLURM_JOBID=634845
SLURM_JOB_NAME=env.sm
SLURM_JOB_NODELIST=c17-3,c6-1
SLURM_JOB_NUM_NODES=2
SLURM_LOCALID=0
SLURM_MEM_PER_CPU=500
SLURM_NNODES=2
SLURM_NODE_ALIASES=(null)
SLURM_NODEID=0
SLURM_NODELIST=c17-3,c6-1
SLURM_NPROCS=3
SLURM_NTASKS=3
SLURM_PRIO_PROCESS=0
SLURM_PROCID=0
SLURM_SUBMIT_DIR=/cluster/home/bhm/slurm
SLURM_TASK_PID=25935
SLURM_TASKS_PER_NODE=2,1
SLURM_TOPOLOGY_ADDR=c17-3
SLURM_TOPOLOGY_ADDR_PATTERN=node
JobId=634845 Name=env.sm
   UserId=bhm(10231) GroupId=users(100)
   Priority=20478 Account=staff QOS=staff
   JobState=COMPLETED Reason=None Dependency=(null)
   Requeue=1 Restarts=0 BatchFlag=1 ExitCode=0:0
   DerivedExitCode=0:0
   RunTime=00:00:04 TimeLimit=00:10:00 TimeMin=N/A
   SubmitTime=2012-12-04T10:11:09 EligibleTime=2012-12-04T10:11:09
   StartTime=2012-12-04T10:11:42 EndTime=2012-12-04T10:11:46
   PreemptTime=None SuspendTime=None SecsPreSuspend=0
   Partition=normal AllocNode:Sid=login-0-2:21143
   ReqNodeList=c17-3,c6-1 ExcNodeList=(null)
   NodeList=c17-3,c6-1
   BatchHost=c17-3
   NumNodes=2 NumCPUs=3 CPUs/Task=1 ReqS:C:T=*:*:*
     Nodes=c17-3 CPU_IDs=14-15 Mem=1000
     Nodes=c6-1 CPU_IDs=15 Mem=500
   MinCPUsNode=1 MinMemoryCPU=500M MinTmpDiskNode=0
   Features=(null) Gres=(null) Reservation=(null)
   Shared=OK Contiguous=0 Licenses=(null) Network=(null)
   Command=/cluster/home/bhm/slurm/env.sm
   WorkDir=/cluster/home/bhm/slurm
Srun:
****************
1: compute-6-1.local
0: compute-6-1.local
2: compute-17-3.local
****************
0: SLURM_CHECKPOINT_IMAGE_DIR=/cluster/home/bhm/slurm
0: SLURM_CPUS_ON_NODE=1
0: SLURM_DISTRIBUTION=block
0: SLURMD_NODENAME=c6-1
0: SLURM_GTIDS=0,1
0: SLURM_JOB_CPUS_PER_NODE=2,1
0: SLURM_JOB_ID=634845
0: SLURM_JOBID=634845
0: SLURM_JOB_NAME=env.sm
0: SLURM_JOB_NODELIST=c17-3,c6-1
0: SLURM_JOB_NUM_NODES=2
0: SLURM_LABELIO=1
0: SLURM_LAUNCH_NODE_IPADDR=10.110.253.224
0: SLURM_LOCALID=0
0: SLURM_MEM_PER_CPU=500
0: SLURM_NNODES=2
0: SLURM_NODEID=0
0: SLURM_NODELIST=c17-3,c6-1
0: SLURM_NPROCS=3
0: SLURM_NTASKS=3
0: SLURM_PRIO_PROCESS=0
0: SLURM_PROCID=0
0: SLURM_SRUN_COMM_HOST=10.110.253.224
0: SLURM_SRUN_COMM_PORT=35257
0: SLURM_STEP_ID=1
0: SLURM_STEPID=1
0: SLURM_STEP_LAUNCHER_PORT=35257
0: SLURM_STEP_NODELIST=c6-1,c17-3
0: SLURM_STEP_NUM_NODES=2
0: SLURM_STEP_NUM_TASKS=3
0: SLURM_STEP_TASKS_PER_NODE=2,1
0: SLURM_SUBMIT_DIR=/cluster/home/bhm/slurm
0: SLURM_TASK_PID=23136
0: SLURM_TASKS_PER_NODE=2,1
0: SLURM_TOPOLOGY_ADDR=c6-1
0: SLURM_TOPOLOGY_ADDR_PATTERN=node
1: SLURM_CHECKPOINT_IMAGE_DIR=/cluster/home/bhm/slurm
1: SLURM_CPUS_ON_NODE=1
1: SLURM_DISTRIBUTION=block
1: SLURMD_NODENAME=c6-1
1: SLURM_GTIDS=0,1
1: SLURM_JOB_CPUS_PER_NODE=2,1
1: SLURM_JOB_ID=634845
1: SLURM_JOBID=634845
1: SLURM_JOB_NAME=env.sm
1: SLURM_JOB_NODELIST=c17-3,c6-1
1: SLURM_JOB_NUM_NODES=2
1: SLURM_LABELIO=1
1: SLURM_LAUNCH_NODE_IPADDR=10.110.253.224
1: SLURM_LOCALID=1
1: SLURM_MEM_PER_CPU=500
1: SLURM_NNODES=2
1: SLURM_NODEID=0
1: SLURM_NODELIST=c17-3,c6-1
1: SLURM_NPROCS=3
1: SLURM_NTASKS=3
1: SLURM_PRIO_PROCESS=0
1: SLURM_PROCID=1
1: SLURM_SRUN_COMM_HOST=10.110.253.224
1: SLURM_SRUN_COMM_PORT=35257
1: SLURM_STEP_ID=1
1: SLURM_STEPID=1
1: SLURM_STEP_LAUNCHER_PORT=35257
1: SLURM_STEP_NODELIST=c6-1,c17-3
1: SLURM_STEP_NUM_NODES=2
1: SLURM_STEP_NUM_TASKS=3
1: SLURM_STEP_TASKS_PER_NODE=2,1
1: SLURM_SUBMIT_DIR=/cluster/home/bhm/slurm
1: SLURM_TASK_PID=23137
1: SLURM_TASKS_PER_NODE=2,1
1: SLURM_TOPOLOGY_ADDR=c6-1
1: SLURM_TOPOLOGY_ADDR_PATTERN=node
2: SLURM_CHECKPOINT_IMAGE_DIR=/cluster/home/bhm/slurm
2: SLURM_CPUS_ON_NODE=2
2: SLURM_DISTRIBUTION=block
2: SLURMD_NODENAME=c17-3
2: SLURM_GTIDS=2
2: SLURM_JOB_CPUS_PER_NODE=2,1
2: SLURM_JOB_ID=634845
2: SLURM_JOBID=634845
2: SLURM_JOB_NAME=env.sm
2: SLURM_JOB_NODELIST=c17-3,c6-1
2: SLURM_JOB_NUM_NODES=2
2: SLURM_LABELIO=1
2: SLURM_LAUNCH_NODE_IPADDR=10.110.253.224
2: SLURM_LOCALID=0
2: SLURM_MEM_PER_CPU=500
2: SLURM_NNODES=2
2: SLURM_NODEID=1
2: SLURM_NODELIST=c17-3,c6-1
2: SLURM_NPROCS=3
2: SLURM_NTASKS=3
2: SLURM_PRIO_PROCESS=0
2: SLURM_PROCID=2
2: SLURM_SRUN_COMM_HOST=10.110.253.224
2: SLURM_SRUN_COMM_PORT=35257
2: SLURM_STEP_ID=1
2: SLURM_STEPID=1
2: SLURM_STEP_LAUNCHER_PORT=35257
2: SLURM_STEP_NODELIST=c6-1,c17-3
2: SLURM_STEP_NUM_NODES=2
2: SLURM_STEP_NUM_TASKS=3
2: SLURM_STEP_TASKS_PER_NODE=2,1
2: SLURM_SUBMIT_DIR=/cluster/home/bhm/slurm
2: SLURM_TASK_PID=26030
2: SLURM_TASKS_PER_NODE=2,1
2: SLURM_TOPOLOGY_ADDR=c17-3
2: SLURM_TOPOLOGY_ADDR_PATTERN=node
done


Notice that according to "scontrol show job", the job should have two tasks on
c17-3 and one on c6-1.  However, the "srun -l hostname" clearly starts two
tasks on c6-1 and one on c17-3.

This has two bad effects:  One is that when the job starts too many tasks on a
node, it can hamper other jobs there.  Another thing is that the cgroup
limits are set up after the information that "scontrol show job" returns, so
the job risks being killed by cgroup without using more memory that it asked
for.
Comment 1 Danny Auble 2012-12-04 03:57:52 MST
I am guessing there is a problem with the sort in the hostlist functions, probably a strcmp instead of a strnatcmp, but what you ask for doesn't specify you want 2 tasks on any specific node.  You would need the arbitrary distribution option for that.  Unless you do that Slurm will lay them out any way it can, and reorders the list right when it comes in.

Let us know if you are able to fix the sort, but based on the request you got what I would expect to happen.  Try the arbitrary distribution mode and see if that works as you would expect.
Comment 2 Bjørn-Helge Mevik 2012-12-04 19:25:36 MST
I'm sorry if I was unclear, so let me try to explain again.  I was not trying to specify explicitly where the tasks are run, but merely provided a minimal, reproducible example of the problem.  See a bigger, general example below.

We hare using 

Nodename=DEFAULT Sockets=2 CoresPerSocket=8 ThreadsPerCore=1 RealMemory=62976 Gres=localtmp:100,athena:4 State=unknown
PartitionName=DEFAULT State=up Shared=NO
SelectType=select/cons_res
SelectTypeParameters=CR_CPU_Memory
TaskPlugin=task/cgroup
ProctrackType=proctrack/cgroup

i.e, we are using consumable resources, which should not be shared.

The problem is that slurm _allocates_ a number of tasks to a node (based on the amount of unallocated resources there), but _runs_ a different number of tasks there.

Typically, the consequence is that some nodes get more tasks than it has cores, or gets its memory overallocated.  This is bad for HPC.

Also, slurm's memory limits are based on where it _allocates_ the tasks, so a job that starts for instance 8 tasks on a node it should have run 2, will likely exceed its memory allowance on that node and be killed.


Here is a general example.  According to "scontrol show job" and the SLURM_ environment variables, it should have run 16 tasks on each of the nodes in rack 13 (c13-X), but as the srun output shows, it doesn't.  For instance it only runs 8 tasks in c13-10.  On the other hand, it should only run 9 tasks in c5-4, but in fact runs 16 tasks there.

$ cat env.sm 
#!/bin/bash
#SBATCH --account=staff
#SBATCH --time=0:10:0
#SBATCH --mem-per-cpu=500M

echo On batch node:
echo '****************'
hostname
env | grep SLURM | sort
scontrol show job $SLURM_JOB_ID --details

echo Srun:
echo '****************'
srun hostname | sort | uniq -c

$ sbatch --ntasks=800 env.sm
Submitted batch job 642175

$ cat slurm-642175.out 
On batch node:
****************
compute-13-1.local
SLURM_CHECKPOINT_IMAGE_DIR=/cluster/home/bhm/slurm
SLURM_CPUS_ON_NODE=16
SLURMD_NODENAME=c13-1
SLURM_GTIDS=0
SLURM_JOB_CPUS_PER_NODE=16(x29),9,15(x3),16,14,8(x4),16(x5),8,4,16(x8)
SLURM_JOB_ID=642175
SLURM_JOBID=642175
SLURM_JOB_NAME=env.sm
SLURM_JOB_NODELIST=c13-[1-6,8-12,14-28],c5-[1-28]
SLURM_JOB_NUM_NODES=54
SLURM_LOCALID=0
SLURM_MEM_PER_CPU=500
SLURM_NNODES=54
SLURM_NODE_ALIASES=(null)
SLURM_NODEID=0
SLURM_NODELIST=c13-[1-6,8-12,14-28],c5-[1-28]
SLURM_NPROCS=800
SLURM_NTASKS=800
SLURM_PRIO_PROCESS=0
SLURM_PROCID=0
SLURM_SUBMIT_DIR=/cluster/home/bhm/slurm
SLURM_TASK_PID=1224
SLURM_TASKS_PER_NODE=16(x29),9,15(x3),16,14,8(x4),16(x5),8,4,16(x8)
SLURM_TOPOLOGY_ADDR_PATTERN=switch.switch.node
SLURM_TOPOLOGY_ADDR=root.rack13.c13-1
JobId=642175 Name=env.sm
   UserId=bhm(10231) GroupId=users(100)
   Priority=20477 Account=staff QOS=staff
   JobState=RUNNING Reason=None Dependency=(null)
   Requeue=1 Restarts=0 BatchFlag=1 ExitCode=0:0
   DerivedExitCode=0:0
   RunTime=00:00:00 TimeLimit=00:10:00 TimeMin=N/A
   SubmitTime=2012-12-05T10:12:57 EligibleTime=2012-12-05T10:12:57
   StartTime=2012-12-05T10:12:57 EndTime=2012-12-05T10:22:57
   PreemptTime=None SuspendTime=None SecsPreSuspend=0
   Partition=normal AllocNode:Sid=login-0-2:16237
   ReqNodeList=(null) ExcNodeList=(null)
   NodeList=c13-[1-6,8-12,14-28],c5-[1-28]
   BatchHost=c13-1
   NumNodes=54 NumCPUs=800 CPUs/Task=1 ReqS:C:T=*:*:*
     Nodes=c13-[1-6,8-12,14-28],c5-[1-3] CPU_IDs=0-15 Mem=8000
     Nodes=c5-4 CPU_IDs=0-7,13 Mem=4500
     Nodes=c5-[5-7] CPU_IDs=1-15 Mem=7500
     Nodes=c5-8 CPU_IDs=0-15 Mem=8000
     Nodes=c5-9 CPU_IDs=0-13 Mem=7000
     Nodes=c5-[10-13] CPU_IDs=8-15 Mem=4000
     Nodes=c5-[14-18] CPU_IDs=0-15 Mem=8000
     Nodes=c5-19 CPU_IDs=0-7 Mem=4000
     Nodes=c5-20 CPU_IDs=4-7 Mem=2000
     Nodes=c5-[21-28] CPU_IDs=0-15 Mem=8000
   MinCPUsNode=1 MinMemoryCPU=500M MinTmpDiskNode=0
   Features=(null) Gres=(null) Reservation=(null)
   Shared=OK Contiguous=0 Licenses=(null) Network=(null)
   Command=/cluster/home/bhm/slurm/env.sm
   WorkDir=/cluster/home/bhm/slurm

Srun:
****************
      8 compute-13-10.local
      8 compute-13-11.local
      8 compute-13-12.local
     16 compute-13-14.local
     16 compute-13-15.local
     16 compute-13-16.local
     16 compute-13-17.local
     16 compute-13-18.local
      8 compute-13-19.local
     16 compute-13-1.local
      4 compute-13-20.local
     16 compute-13-21.local
     16 compute-13-22.local
     16 compute-13-23.local
     16 compute-13-24.local
     16 compute-13-25.local
     16 compute-13-26.local
     16 compute-13-27.local
     16 compute-13-28.local
      9 compute-13-2.local
     15 compute-13-3.local
     15 compute-13-4.local
     15 compute-13-5.local
     16 compute-13-6.local
     14 compute-13-8.local
      8 compute-13-9.local
     16 compute-5-10.local
     16 compute-5-11.local
     16 compute-5-12.local
     16 compute-5-13.local
     16 compute-5-14.local
     16 compute-5-15.local
     16 compute-5-16.local
     16 compute-5-17.local
     16 compute-5-18.local
     16 compute-5-19.local
     16 compute-5-1.local
     16 compute-5-20.local
     16 compute-5-21.local
     16 compute-5-22.local
     16 compute-5-23.local
     16 compute-5-24.local
     16 compute-5-25.local
     16 compute-5-26.local
     16 compute-5-27.local
     16 compute-5-28.local
     16 compute-5-2.local
     16 compute-5-3.local
     16 compute-5-4.local
     16 compute-5-5.local
     16 compute-5-6.local
     16 compute-5-7.local
     16 compute-5-8.local
     16 compute-5-9.local
Comment 3 Bjørn-Helge Mevik 2012-12-04 21:53:10 MST
I've done some more experimenting, and it looks like it is the output in
scontrol show job and the environment variables that is wrong.  (Also the memory limits on the nodes are wrong; see below.)

I created a reservation for two nodes, c9-3 and c11-36, and started a number
of jobs there, so that the nodes were partially allocated:

# bjob -u bhm
   JOBID NAME       USER     ACCOUNT   PARTITI QOS     ST PRIORI        TIME   TIME_LEFT CPUS NOD MIN_MEM MIN_TMP NODELIST(REASON)
  643529 sleeeep.sm bhm      staff     normal  staff    R  20477        1:24       58:36    1   1    1024       0 c11-36
  643528 sleeeep.sm bhm      staff     normal  staff    R  20477        1:25       58:35    1   1    1024       0 c11-36
  643527 sleeeep.sm bhm      staff     normal  staff    R  20477        1:46       58:14    1   1    1024       0 c9-3
  643526 sleeeep.sm bhm      staff     normal  staff    R  20477        1:47       58:13    1   1    1024       0 c9-3
  643524 sleeeep.sm bhm      staff     normal  staff    R  20477        1:48       58:12    1   1    1024       0 c9-3
  643525 sleeeep.sm bhm      staff     normal  staff    R  20477        1:48       58:12    1   1    1024       0 c9-3
  643523 sleeeep.sm bhm      staff     normal  staff    R  20477        1:49       58:11    1   1    1024       0 c9-3
  643522 sleeeep.sm bhm      staff     normal  staff    R  20477        2:01       57:59    1   1    1024       0 c9-3
  643520 sleeeep.sm bhm      staff     normal  staff    R  20477        2:02       57:58    1   1    1024       0 c9-3
  643521 sleeeep.sm bhm      staff     normal  staff    R  20477        2:02       57:58    1   1    1024       0 c9-3
  643519 sleeeep.sm bhm      staff     normal  staff    R  20477        2:04       57:56    1   1    1024       0 c9-3
  643518 sleeeep.sm bhm      staff     normal  staff    R  20477        2:13       57:47    1   1    1024       0 c9-3
  643516 sleeeep.sm bhm      staff     normal  staff    R  20477        2:23       57:37    1   1    1024       0 c9-3
  643517 sleeeep.sm bhm      staff     normal  staff    R  20477        2:23       57:37    1   1    1024       0 c9-3
  643514 sleeeep.sm bhm      staff     normal  staff    R  20477        2:24       57:36    1   1    1024       0 c9-3
  643515 sleeeep.sm bhm      staff     normal  staff    R  20477        2:24       57:36    1   1    1024       0 c9-3
  643513 sleeeep.sm bhm      staff     normal  staff    R  20477        2:25       57:35    1   1    1024       0 c9-3

I.e., there are 1 unallocated core on c9-3 and 14 on c11-36 (all our nodes have 16 cores).  Then I started a 15-core job in the reservation:

$ sbatch --reservation=bhmtest --ntasks=15 env.sm
Submitted batch job 643532
$ cat slurm-643532.out [I've truncated the output a bit]
On batch node:
****************
compute-11-36.local
SLURM_JOB_CPUS_PER_NODE=1,14
SLURM_JOB_NODELIST=c11-36,c9-3
SLURM_NODELIST=c11-36,c9-3
SLURM_NPROCS=15
SLURM_NTASKS=15
SLURM_TASKS_PER_NODE=1,14
JobId=643532 Name=env.sm
   NodeList=c11-36,c9-3
   BatchHost=c11-36
   NumNodes=2 NumCPUs=15 CPUs/Task=1 ReqS:C:T=*:*:*
     Nodes=c11-36 CPU_IDs=0 Mem=500
     Nodes=c9-3 CPU_IDs=0-13 Mem=7000
   MinCPUsNode=1 MinMemoryCPU=500M MinTmpDiskNode=0

Srun:
****************
     14 compute-11-36.local
      1 compute-9-3.local

So it seems the job ran 1 task on 9-3 and 14 tasks on c11-36 as it should, but
that the scontrol output and the job environment variables are wrong.  Thus my
claim that nodes would be overallocated is wrong.

However, the memory limits for the job on the nodes is wrong:

 # ssh c9-3 grep '643532.*limit_in_bytes' /var/log/slurm/slurmd.log
[2012-12-05T12:30:06] [643532.0] parameter 'memory.limit_in_bytes' set to '7340032000' for '/dev/cgroup/memory/slurm/uid_10231/job_643532'
[2012-12-05T12:30:06] [643532.0] parameter 'memory.memsw.limit_in_bytes' set to '7340032000' for '/dev/cgroup/memory/slurm/uid_10231/job_643532'
[2012-12-05T12:30:06] [643532.0] parameter 'memory.limit_in_bytes' set to '7340032000' for '/dev/cgroup/memory/slurm/uid_10231/job_643532/step_0'
[2012-12-05T12:30:06] [643532.0] parameter 'memory.memsw.limit_in_bytes' set to '7340032000' for '/dev/cgroup/memory/slurm/uid_10231/job_643532/step_0'

# ssh c11-36 grep '643532.*limit_in_bytes' /var/log/slurm/slurmd.log
[2012-12-05T12:30:05] [643532] parameter 'memory.limit_in_bytes' set to '524288000' for '/dev/cgroup/memory/slurm/uid_10231/job_643532'
[2012-12-05T12:30:05] [643532] parameter 'memory.memsw.limit_in_bytes' set to '524288000' for '/dev/cgroup/memory/slurm/uid_10231/job_643532'
[2012-12-05T12:30:05] [643532] parameter 'memory.limit_in_bytes' set to '524288000' for '/dev/cgroup/memory/slurm/uid_10231/job_643532/step_4294967294'
[2012-12-05T12:30:05] [643532] parameter 'memory.memsw.limit_in_bytes' set to '524288000' for '/dev/cgroup/memory/slurm/uid_10231/job_643532/step_4294967294'
[2012-12-05T12:30:06] [643532.0] parameter 'memory.limit_in_bytes' set to '524288000' for '/dev/cgroup/memory/slurm/uid_10231/job_643532'
[2012-12-05T12:30:06] [643532.0] parameter 'memory.memsw.limit_in_bytes' set to '524288000' for '/dev/cgroup/memory/slurm/uid_10231/job_643532'
[2012-12-05T12:30:06] [643532.0] parameter 'memory.limit_in_bytes' set to '524288000' for '/dev/cgroup/memory/slurm/uid_10231/job_643532/step_0'
[2012-12-05T12:30:06] [643532.0] parameter 'memory.memsw.limit_in_bytes' set to '524288000' for '/dev/cgroup/memory/slurm/uid_10231/job_643532/step_0'

I.e., the limit is 7000 MiB on c9-3 and 500 MiB on c11-36, which should have
been the other way around.

Thus jobs risk getting killed.
Comment 4 Moe Jette 2012-12-06 04:10:10 MST
Based upon my very limited testing, this one line change fixes the problem. I need to do a lot more testing on different systems (e.g. Bluegene and Cray) before making this change in our code and I have other work needing my attention, but if you want to work on testing with this and report the results that would be appreciated.

diff --git a/src/common/hostlist.c b/src/common/hostlist.c
index 34c04b0..abc3445 100644
--- a/src/common/hostlist.c
+++ b/src/common/hostlist.c
@@ -901,7 +901,7 @@ static int hostrange_prefix_cmp(hostrange_t h1, hostrange_t h2)
        if (h2 == NULL)
                return -1;
 
-       retval = strcmp(h1->prefix, h2->prefix);
+       retval = strnatcmp(h1->prefix, h2->prefix);
        return retval == 0 ? h2->singlehost - h1->singlehost : retval;
 }
Comment 5 Bjørn-Helge Mevik 2012-12-06 18:49:10 MST
Thanks!  We will test and report!
Comment 6 Bjørn-Helge Mevik 2012-12-07 00:47:51 MST
I've tested it on our test cluster, and it seems to solve the problem, yes. :)

I'll do some more tests before porting it to our production cluster.

Thanks!
Comment 7 Bjørn-Helge Mevik 2012-12-11 21:41:41 MST
Several more tests have not shown any problems with the patch, and it does solve the issue, so we will port it to our production cluster now.

Thanks!
Comment 8 Moe Jette 2012-12-12 02:20:28 MST
This change can be find in version 2.5.1 (when released). You will need to use the patch until upgrading. Thank you for testing.