179 – Tasks fails to start when numerical/alphabetical order of nodes do not match

Ticket 179 - Tasks fails to start when numerical/alphabetical order of nodes do not match

Summary: Tasks fails to start when numerical/alphabetical order of nodes do not match

Status:	RESOLVED FIXED

Alias:	None

Product:	Slurm
Classification:	Unclassified
Component:	slurmd (show other tickets)
Version:	2.4.x
Hardware:	Linux Linux

Severity:	1 - System not usable
Assignee:	Moe Jette
QA Contact:

URL:

Depends on:
Blocks:

Reported:	2012-11-28 22:54 MST by Bjørn-Helge Mevik
Modified:	2012-11-30 05:01 MST (History)
CC List:	1 user (show)

See Also:
Site:	-Other-
Slinky Site:	---
Alineos Sites:	---
Atos/Eviden Sites:	---
Confidential Site:	---
Coreweave sites:	---
Cray Sites:	---
DS9 clusters:	---
Google sites:	---
HPCnow Sites:	---
HPE Sites:	---
IBM Sites:	---
NOAA SIte:	---
NoveTech Sites:	---
Nvidia HWinf-CS Sites:	---
OCF Sites:	---
Recursion Pharma Sites:	---
SFW Sites:	---
SNIC sites:	---
Tzag Elita Sites:	---
Linux Distro:	---
Machine Name:
CLE Version:
Version Fixed:
Target Release:	---
DevPrio:	---
Emory-Cloud Sites:	---

Attachments
Fix for hostname prefixes of varying length (763 bytes, patch) 2012-11-29 23:58 MST, Bjørn-Helge Mevik	Details \| Diff
Add an attachment (proposed patch, testcase, etc.)

Note You need to log in before you can comment on or make changes to this ticket.

Description Bjørn-Helge Mevik 2012-11-28 22:54:45 MST

I think we have come across a bug in Slurm.

We are running slurm 2.4.3, on a Rocks 6.0 cluster (based CentOS 6.2).


When submitting jobs to several nodes (using sbatch), steps (either with srun
or mpirun) sometimes fail to start on some of the nodes.  For instance:

$ cat test.sm
#!/bin/bash
#SBATCH --account=staff
#SBATCH --time=0:05:0
#SBATCH --mem-per-cpu=500M

srun -l hostname

$ sbatch  --nodes=300 test.sm
Submitted batch job 610698
scontrol show job 610698
JobId=610698 Name=test.sm
   UserId=bhm(10231) GroupId=users(100)
   Priority=20707 Account=staff QOS=staff
   JobState=COMPLETED Reason=None Dependency=(null)
   Requeue=1 Restarts=0 BatchFlag=1 ExitCode=0:0
   RunTime=00:00:02 TimeLimit=00:05:00 TimeMin=N/A
   SubmitTime=2012-11-29T12:09:32 EligibleTime=2012-11-29T12:09:32
   StartTime=2012-11-29T12:09:32 EndTime=2012-11-29T12:09:34
   PreemptTime=None SuspendTime=None SecsPreSuspend=0
   Partition=normal AllocNode:Sid=login-0-2:21143
   ReqNodeList=(null) ExcNodeList=(null)
   NodeList=c10-[1-16,32-36],c11-[1-4],c12-[4-36],c13-[1-31],c15-[33-36],c16-[1-36],c17-[1-36],c18-[1-19],c3-[19-36],c4-[1-36],c5-[1-36],c6-[1-3],c9-[14-36]
   BatchHost=c10-1
   NumNodes=300 NumCPUs=300 CPUs/Task=1 ReqS:C:T=*:*:*
   MinCPUsNode=1 MinMemoryCPU=500M MinTmpDiskNode=0
   Features=(null) Gres=(null) Reservation=(null)
   Shared=OK Contiguous=0 Licenses=(null) Network=(null)
   Command=/cluster/home/bhm/slurm/test.sm
   WorkDir=/cluster/home/bhm/slurm

$ cat slurm-610698.out
130: compute-10-15.local
srun: error: Task launch for 610698.0 failed on node c9-34: Invalid job credential
srun: error: Task launch for 610698.0 failed on node c9-35: Invalid job credential
run: error: Task launch for 610698.0 failed on node c4-15: Invalid job credential
[...]
184: compute-13-11.local
282: compute-18-2.local
130: slurmd[c10-15]: *** STEP 610698.0 KILLED AT 2012-11-29T12:09:32 WITH SIGNAL 9 ***
[...]
218: slurmd[c16-10]: *** STEP 610698.0 KILLED AT 2012-11-29T12:09:32 WITH SIGNAL 9 ***
116: compute-10-1.local
srun: error: Timed out waiting for job step to complete


On the nodes the srun (or mpirun) fails, the slurmd.log contains the following
message:

# pdsh -w c10-[1-16,32-36],c11-[1-4],c12-[4-36],c13-[1-31],c15-[33-36],c16-[1-36],c17-[1-36],c18-[1-19],c3-[19-36],c4-[1-36],c5-[1-36],c6-[1-3],c9-[14-36] 'grep "610698.*not in hostset" /var/log/slurm/slurmd.log'
c3-24: [2012-11-29T12:09:32] error: Invalid job 610698.0 credential for user 10231: host c3-24 not in hostset c3-[19-36],c4-[1-36],c5-[1-36],c6-[1-3],c9-[14-36],c10-[1-16,32-36],c11-[1-4],c12-[4-36],c13-[1-31],c15-[33-36],c16-[1-36],c17-[1-36],c18-[1-19]
c3-26: [2012-11-29T12:09:32] error: Invalid job 610698.0 credential for user 10231: host c3-26 not in hostset c3-[19-36],c4-[1-36],c5-[1-36],c6-[1-3],c9-[14-36],c10-[1-16,32-36],c11-[1-4],c12-[4-36],c13-[1-31],c15-[33-36],c16-[1-36],c17-[1-36],c18-[1-19]
c3-23: [2012-11-29T12:09:32] error: Invalid job 610698.0 credential for user 10231: host c3-23 not in hostset c3-[19-36],c4-[1-36],c5-[1-36],c6-[1-3],c9-[14-36],c10-[1-16,32-36],c11-[1-4],c12-[4-36],c13-[1-31],c15-[33-36],c16-[1-36],c17-[1-36],c18-[1-19]
[...]


In all the cases we've seen, the node in question actually _is_ in the list of
nodenames printed in the error message.

We've experimented a bit, and found that the problem seems to be related to
the ordering of the nodes.  In the error message, the node names (cX-Y) are
ordered first in numerical order after X, then in numerical order after Y.
(I'll call this the numerical order henceforth.)  On the ohter hand, scontrol
show job orders the nodes first in alphabetical order after cX, then in
numerical order after Y.  (I'll call this the alphabetical order.)

We've discovered that the error only seems to occur if the nodes allocated to
the job would be sorted differently by the two orders, and it seems to
happen everytime.  Also, the nodes where the error occurs are always precicely
the nodes that would have been sorted later in the alphabetical order
compared to the numerical order.


We can reproduce the error with very few nodes, for instance:

$ sbatch  --nodelist=c1-23,c2-17,c12-4 --nodes=3 test.sm
Submitted batch job 610762
$ scontrol show job 610762
JobId=610762 Name=test.sm
   UserId=bhm(10231) GroupId=users(100)
   Priority=20479 Account=staff QOS=staff
   JobState=COMPLETED Reason=None Dependency=(null)
   Requeue=1 Restarts=0 BatchFlag=1 ExitCode=0:0
   RunTime=00:00:02 TimeLimit=00:05:00 TimeMin=N/A
   SubmitTime=2012-11-29T12:29:51 EligibleTime=2012-11-29T12:29:51
   StartTime=2012-11-29T12:29:51 EndTime=2012-11-29T12:29:53
   PreemptTime=None SuspendTime=None SecsPreSuspend=0
   Partition=normal AllocNode:Sid=login-0-2:21143
   ReqNodeList=c1-23,c12-4,c2-17 ExcNodeList=(null)
   NodeList=c1-23,c12-4,c2-17
   BatchHost=c1-23
   NumNodes=3 NumCPUs=3 CPUs/Task=1 ReqS:C:T=*:*:*
   MinCPUsNode=1 MinMemoryCPU=500M MinTmpDiskNode=0
   Features=(null) Gres=(null) Reservation=(null)
   Shared=OK Contiguous=0 Licenses=(null) Network=(null)
   Command=/cluster/home/bhm/slurm/test.sm
   WorkDir=/cluster/home/bhm/slurm
# pdsh -w c1-23,c12-4,c2-17 'grep "610762.*not in hostset" /var/log/slurm/slurmd.log'
c2-17: [2012-11-29T12:29:51] error: Invalid job 610762.0 credential for user 10231: host c2-17 not in hostset c1-23,c2-17,c12-4


If we specify nodes that would be sorted in the same way by the two orderings,
the problem does not occur:

$ sbatch  --nodelist=c1-23,c2-17,c4-1 --nodes=3 test.sm
Submitted batch job 610770
$ cat slurm-610770.out 
1: compute-2-17.local
2: compute-4-1.local
0: compute-1-23.local
# pdsh -w c1-23,c2-17,c4-1 'grep "610770.*not in hostset" /var/log/slurm/slurmd.log'
# (no matches)


I've tried to dig into the code to see what really happens, but haven't got
too far.  The error message is from _check_job_credential() in
src/slurmd/slurmd/req.c, and as far as I can see, either the hostset returned
in

  s_hset = hostset_create(arg.step_hostlist)

fails to contain the reordered nodes, or the check

  hostset_within(s_hset, conf->node_name)

fails to find them in the set.


Our slurm.conf defines the nodes in numerical order, like this:

Nodename=c1-[1-36] NodeHostname=compute-1-[1-36] Weight=4172 Feature=rack1,intel,ib
Nodename=c2-[1-36] NodeHostname=compute-2-[1-36] Weight=4172 Feature=rack2,intel,ib
Nodename=c3-[1-36] NodeHostname=compute-3-[1-36] Weight=4172 Feature=rack3,intel,ib
Nodename=c4-[1-36] NodeHostname=compute-4-[1-36] Weight=4172 Feature=rack4,intel,ib
[...]
Nodename=c16-[1-36] NodeHostname=compute-16-[1-36] Weight=6172 Feature=rack16,intel,ib
Nodename=c17-[1-36] NodeHostname=compute-17-[1-36] Weight=6172 Feature=rack17,intel,ib
Nodename=c18-[1-20] NodeHostname=compute-18-[1-20] Weight=6172 Feature=rack18,intel,ib
Nodename=c18-[21-40] NodeHostname=compute-18-[21-40] Weight=6172 Feature=rack18,intel,ib

As a test, we've reordered the nodes in the slurm.conf into the alphabetical
order, but that didn't help; the error still occurs:

$ sbatch  --nodelist=c1-23,c2-17,c12-4 --nodes=3 test.sm
Submitted batch job 611490
$ cat slurm-611490.out 
srun: error: Task launch for 611490.0 failed on node c2-17: Invalid job credential
srun: error: Application launch failed: Invalid job credential
srun: Job step aborted: Waiting up to 2 seconds for job step to finish.
2: compute-12-4.local
srun: error: Timed out waiting for job step to complete


One interesting point here:  The nodelist in the error message is still in
numerical order, while scontrol show job lists the nodes in alphabetic order:

# pdsh -w c1-23,c2-17,c12-4 'grep "611490.*not in hostset" /var/log/slurm/slurmd.log'
c2-17: [2012-11-29T13:27:08] error: Invalid job 611490.0 credential for user 10231: host c2-17 not in hostset c1-23,c2-17,c12-4

$ scontrol show job 611490 | grep ReqNodeList
   ReqNodeList=c1-23,c12-4,c2-17 ExcNodeList=(null)


Needless to say, this impairs our ability to run parallell jobs quite a bit, so a quick resolution would be very appreciated.

Regards,
Bjørn-Helge Mevik

Comment 1 Bjørn-Helge Mevik 2012-11-29 23:56:38 MST

I spent the night debugging on our test cluster, and figured it out.  The problem is in the function hostrange_hn_within in src/common/hostlist.c.

The logic that is added to allow things like nid0000[2-7] assumes that the prefix in all hostranges are equally long.  In our case, with node names like c1-1, c10-2 and c2-5, they are not.  When hostrange_hn_within is used to check a host c2-5 against a range c10-[1-5], the logic modifies the host prefix from c2- to c2-5.  So later, when checking whether c2-5 is in the range c2-[5-6], say, the prefix comparison fails, trying to compare c2- with c2-5.

I'm attaching a small workaround fix for this.  It simply skips the added logic unless the last character of the noderange prefix is a digit.  So it will still handle cases like nid0000[2-7], but will not mess with ranges like c2-[5-6].

It would perhaps be more general to modify the logic to not change the function arguments, but that would mean the logic must be performed for each range.

Cheers,
Bjørn-Helge

Comment 2 Bjørn-Helge Mevik 2012-11-29 23:58:42 MST

Created attachment 163 [details]
Fix for hostname prefixes of varying length

Comment 3 Danny Auble 2012-11-30 05:01:43 MST

Awesome!  Good to see it was so easy.  This will be in 2.4.5/2.5.0 probably both tagged next week.