Ticket 1513

Summary: ordering of hosts in slurm.conf with --enable-multiple-slurmd
Product: Slurm Reporter: Michael Gutteridge <mrg>
Component: slurmctldAssignee: David Bigagli <david>
Status: RESOLVED INFOGIVEN QA Contact:
Severity: 4 - Minor Issue    
Priority: --- CC: brian, da
Version: 15.08.x   
Hardware: Linux   
OS: Linux   
Site: FHCRC - Fred Hutchinson Cancer Research Center Alineos Sites: ---
Atos/Eviden Sites: --- Confidential Site: ---
Coreweave sites: --- Cray Sites: ---
DS9 clusters: --- HPCnow Sites: ---
HPE Sites: --- IBM Sites: ---
NOAA SIte: --- OCF Sites: ---
Recursion Pharma Sites: --- SFW Sites: ---
SNIC sites: --- Linux Distro: ---
Machine Name: CLE Version:
Version Fixed: Target Release: ---
DevPrio: --- Emory-Cloud Sites: ---
Attachments: slurm.conf with misordered nodes

Description Michael Gutteridge 2015-03-09 11:59:45 MDT
I'm running into an issue in our test environment- to replicate our production cluster, i've set up a cluster using the "multiple slurmd" feature.

It all seems to work well, but it appears that the order in which nodes are declared in slurm.conf matters-  not sure what the design is here, but (for example) if I put a "real" host at the bottom of slurm.conf I get:

slurmctld: error: find_node_record: lookup failure for puck1
slurmctld: error: node_name2bitmap: invalid node specified puck1
slurmctld: fatal: Invalid node names in partition slapshot2

If I move it above all other node definitions, it starts properly.

It's kind of an issue as I'm building these host files with an external tool (puppet templates) and order isn't really a definable thing.

Any hints and/or advice?

Thanks much

Michael
Comment 1 David Bigagli 2015-03-10 05:18:42 MDT
Hi could you please attach your slurm.conf that shows the problem?

Thanks,
David
Comment 2 Michael Gutteridge 2015-03-10 05:24:30 MDT
Created attachment 1701 [details]
slurm.conf with misordered nodes
Comment 3 Michael Gutteridge 2015-03-10 05:25:12 MDT
Ok- attached... when I start slurmctld with this:

mrg@slapshot[~]: sudo /usr/sbin/slurmctld -D
slurmctld: slurmctld version 15.08.0-0pre2 started on cluster slapshot
slurmctld: layouts: no layout to initialize
slurmctld: error: Reconfiguration for node puck1, ignoring!
slurmctld: _parse_part_spec: changing default partition from slapshot to campus
slurmctld: layouts: loading entities/relations information
slurmctld: error: find_node_record: lookup failure for puck1
slurmctld: Recovered state of 406 nodes
slurmctld: Recovered information about 0 jobs
slurmctld: error: find_node_record: lookup failure for puck1
slurmctld: error: node_name2bitmap: invalid node specified puck1
slurmctld: fatal: Invalid node names in partition slapshot


Thanks

M
Comment 4 David Bigagli 2015-03-10 09:55:31 MDT
Hi, this was fixed in the commit ce32018a28d6b7a. It is available in 15.08.0pre3.
If you update your code to the latest version you will get the fix.

David
Comment 5 David Bigagli 2015-03-10 11:30:46 MDT
We discovered the issue still exist in the latest code. Sorry for the confusion.
The solution is to use NodeAddr instead of NodeHostName.

Thanks,
David