I think we have come across a bug in Slurm. We are running slurm 2.4.3, on a Rocks 6.0 cluster (based CentOS 6.2). When submitting jobs to several nodes (using sbatch), steps (either with srun or mpirun) sometimes fail to start on some of the nodes. For instance: $ cat test.sm #!/bin/bash #SBATCH --account=staff #SBATCH --time=0:05:0 #SBATCH --mem-per-cpu=500M srun -l hostname $ sbatch --nodes=300 test.sm Submitted batch job 610698 scontrol show job 610698 JobId=610698 Name=test.sm UserId=bhm(10231) GroupId=users(100) Priority=20707 Account=staff QOS=staff JobState=COMPLETED Reason=None Dependency=(null) Requeue=1 Restarts=0 BatchFlag=1 ExitCode=0:0 RunTime=00:00:02 TimeLimit=00:05:00 TimeMin=N/A SubmitTime=2012-11-29T12:09:32 EligibleTime=2012-11-29T12:09:32 StartTime=2012-11-29T12:09:32 EndTime=2012-11-29T12:09:34 PreemptTime=None SuspendTime=None SecsPreSuspend=0 Partition=normal AllocNode:Sid=login-0-2:21143 ReqNodeList=(null) ExcNodeList=(null) NodeList=c10-[1-16,32-36],c11-[1-4],c12-[4-36],c13-[1-31],c15-[33-36],c16-[1-36],c17-[1-36],c18-[1-19],c3-[19-36],c4-[1-36],c5-[1-36],c6-[1-3],c9-[14-36] BatchHost=c10-1 NumNodes=300 NumCPUs=300 CPUs/Task=1 ReqS:C:T=*:*:* MinCPUsNode=1 MinMemoryCPU=500M MinTmpDiskNode=0 Features=(null) Gres=(null) Reservation=(null) Shared=OK Contiguous=0 Licenses=(null) Network=(null) Command=/cluster/home/bhm/slurm/test.sm WorkDir=/cluster/home/bhm/slurm $ cat slurm-610698.out 130: compute-10-15.local srun: error: Task launch for 610698.0 failed on node c9-34: Invalid job credential srun: error: Task launch for 610698.0 failed on node c9-35: Invalid job credential run: error: Task launch for 610698.0 failed on node c4-15: Invalid job credential [...] 184: compute-13-11.local 282: compute-18-2.local 130: slurmd[c10-15]: *** STEP 610698.0 KILLED AT 2012-11-29T12:09:32 WITH SIGNAL 9 *** [...] 218: slurmd[c16-10]: *** STEP 610698.0 KILLED AT 2012-11-29T12:09:32 WITH SIGNAL 9 *** 116: compute-10-1.local srun: error: Timed out waiting for job step to complete On the nodes the srun (or mpirun) fails, the slurmd.log contains the following message: # pdsh -w c10-[1-16,32-36],c11-[1-4],c12-[4-36],c13-[1-31],c15-[33-36],c16-[1-36],c17-[1-36],c18-[1-19],c3-[19-36],c4-[1-36],c5-[1-36],c6-[1-3],c9-[14-36] 'grep "610698.*not in hostset" /var/log/slurm/slurmd.log' c3-24: [2012-11-29T12:09:32] error: Invalid job 610698.0 credential for user 10231: host c3-24 not in hostset c3-[19-36],c4-[1-36],c5-[1-36],c6-[1-3],c9-[14-36],c10-[1-16,32-36],c11-[1-4],c12-[4-36],c13-[1-31],c15-[33-36],c16-[1-36],c17-[1-36],c18-[1-19] c3-26: [2012-11-29T12:09:32] error: Invalid job 610698.0 credential for user 10231: host c3-26 not in hostset c3-[19-36],c4-[1-36],c5-[1-36],c6-[1-3],c9-[14-36],c10-[1-16,32-36],c11-[1-4],c12-[4-36],c13-[1-31],c15-[33-36],c16-[1-36],c17-[1-36],c18-[1-19] c3-23: [2012-11-29T12:09:32] error: Invalid job 610698.0 credential for user 10231: host c3-23 not in hostset c3-[19-36],c4-[1-36],c5-[1-36],c6-[1-3],c9-[14-36],c10-[1-16,32-36],c11-[1-4],c12-[4-36],c13-[1-31],c15-[33-36],c16-[1-36],c17-[1-36],c18-[1-19] [...] In all the cases we've seen, the node in question actually _is_ in the list of nodenames printed in the error message. We've experimented a bit, and found that the problem seems to be related to the ordering of the nodes. In the error message, the node names (cX-Y) are ordered first in numerical order after X, then in numerical order after Y. (I'll call this the numerical order henceforth.) On the ohter hand, scontrol show job orders the nodes first in alphabetical order after cX, then in numerical order after Y. (I'll call this the alphabetical order.) We've discovered that the error only seems to occur if the nodes allocated to the job would be sorted differently by the two orders, and it seems to happen everytime. Also, the nodes where the error occurs are always precicely the nodes that would have been sorted later in the alphabetical order compared to the numerical order. We can reproduce the error with very few nodes, for instance: $ sbatch --nodelist=c1-23,c2-17,c12-4 --nodes=3 test.sm Submitted batch job 610762 $ scontrol show job 610762 JobId=610762 Name=test.sm UserId=bhm(10231) GroupId=users(100) Priority=20479 Account=staff QOS=staff JobState=COMPLETED Reason=None Dependency=(null) Requeue=1 Restarts=0 BatchFlag=1 ExitCode=0:0 RunTime=00:00:02 TimeLimit=00:05:00 TimeMin=N/A SubmitTime=2012-11-29T12:29:51 EligibleTime=2012-11-29T12:29:51 StartTime=2012-11-29T12:29:51 EndTime=2012-11-29T12:29:53 PreemptTime=None SuspendTime=None SecsPreSuspend=0 Partition=normal AllocNode:Sid=login-0-2:21143 ReqNodeList=c1-23,c12-4,c2-17 ExcNodeList=(null) NodeList=c1-23,c12-4,c2-17 BatchHost=c1-23 NumNodes=3 NumCPUs=3 CPUs/Task=1 ReqS:C:T=*:*:* MinCPUsNode=1 MinMemoryCPU=500M MinTmpDiskNode=0 Features=(null) Gres=(null) Reservation=(null) Shared=OK Contiguous=0 Licenses=(null) Network=(null) Command=/cluster/home/bhm/slurm/test.sm WorkDir=/cluster/home/bhm/slurm # pdsh -w c1-23,c12-4,c2-17 'grep "610762.*not in hostset" /var/log/slurm/slurmd.log' c2-17: [2012-11-29T12:29:51] error: Invalid job 610762.0 credential for user 10231: host c2-17 not in hostset c1-23,c2-17,c12-4 If we specify nodes that would be sorted in the same way by the two orderings, the problem does not occur: $ sbatch --nodelist=c1-23,c2-17,c4-1 --nodes=3 test.sm Submitted batch job 610770 $ cat slurm-610770.out 1: compute-2-17.local 2: compute-4-1.local 0: compute-1-23.local # pdsh -w c1-23,c2-17,c4-1 'grep "610770.*not in hostset" /var/log/slurm/slurmd.log' # (no matches) I've tried to dig into the code to see what really happens, but haven't got too far. The error message is from _check_job_credential() in src/slurmd/slurmd/req.c, and as far as I can see, either the hostset returned in s_hset = hostset_create(arg.step_hostlist) fails to contain the reordered nodes, or the check hostset_within(s_hset, conf->node_name) fails to find them in the set. Our slurm.conf defines the nodes in numerical order, like this: Nodename=c1-[1-36] NodeHostname=compute-1-[1-36] Weight=4172 Feature=rack1,intel,ib Nodename=c2-[1-36] NodeHostname=compute-2-[1-36] Weight=4172 Feature=rack2,intel,ib Nodename=c3-[1-36] NodeHostname=compute-3-[1-36] Weight=4172 Feature=rack3,intel,ib Nodename=c4-[1-36] NodeHostname=compute-4-[1-36] Weight=4172 Feature=rack4,intel,ib [...] Nodename=c16-[1-36] NodeHostname=compute-16-[1-36] Weight=6172 Feature=rack16,intel,ib Nodename=c17-[1-36] NodeHostname=compute-17-[1-36] Weight=6172 Feature=rack17,intel,ib Nodename=c18-[1-20] NodeHostname=compute-18-[1-20] Weight=6172 Feature=rack18,intel,ib Nodename=c18-[21-40] NodeHostname=compute-18-[21-40] Weight=6172 Feature=rack18,intel,ib As a test, we've reordered the nodes in the slurm.conf into the alphabetical order, but that didn't help; the error still occurs: $ sbatch --nodelist=c1-23,c2-17,c12-4 --nodes=3 test.sm Submitted batch job 611490 $ cat slurm-611490.out srun: error: Task launch for 611490.0 failed on node c2-17: Invalid job credential srun: error: Application launch failed: Invalid job credential srun: Job step aborted: Waiting up to 2 seconds for job step to finish. 2: compute-12-4.local srun: error: Timed out waiting for job step to complete One interesting point here: The nodelist in the error message is still in numerical order, while scontrol show job lists the nodes in alphabetic order: # pdsh -w c1-23,c2-17,c12-4 'grep "611490.*not in hostset" /var/log/slurm/slurmd.log' c2-17: [2012-11-29T13:27:08] error: Invalid job 611490.0 credential for user 10231: host c2-17 not in hostset c1-23,c2-17,c12-4 $ scontrol show job 611490 | grep ReqNodeList ReqNodeList=c1-23,c12-4,c2-17 ExcNodeList=(null) Needless to say, this impairs our ability to run parallell jobs quite a bit, so a quick resolution would be very appreciated. Regards, Bjørn-Helge Mevik
I spent the night debugging on our test cluster, and figured it out. The problem is in the function hostrange_hn_within in src/common/hostlist.c. The logic that is added to allow things like nid0000[2-7] assumes that the prefix in all hostranges are equally long. In our case, with node names like c1-1, c10-2 and c2-5, they are not. When hostrange_hn_within is used to check a host c2-5 against a range c10-[1-5], the logic modifies the host prefix from c2- to c2-5. So later, when checking whether c2-5 is in the range c2-[5-6], say, the prefix comparison fails, trying to compare c2- with c2-5. I'm attaching a small workaround fix for this. It simply skips the added logic unless the last character of the noderange prefix is a digit. So it will still handle cases like nid0000[2-7], but will not mess with ranges like c2-[5-6]. It would perhaps be more general to modify the logic to not change the function arguments, but that would mean the logic must be performed for each range. Cheers, Bjørn-Helge
Created attachment 163 [details] Fix for hostname prefixes of varying length
Awesome! Good to see it was so easy. This will be in 2.4.5/2.5.0 probably both tagged next week.