I'm running into an issue in our test environment- to replicate our production cluster, i've set up a cluster using the "multiple slurmd" feature. It all seems to work well, but it appears that the order in which nodes are declared in slurm.conf matters- not sure what the design is here, but (for example) if I put a "real" host at the bottom of slurm.conf I get: slurmctld: error: find_node_record: lookup failure for puck1 slurmctld: error: node_name2bitmap: invalid node specified puck1 slurmctld: fatal: Invalid node names in partition slapshot2 If I move it above all other node definitions, it starts properly. It's kind of an issue as I'm building these host files with an external tool (puppet templates) and order isn't really a definable thing. Any hints and/or advice? Thanks much Michael
Hi could you please attach your slurm.conf that shows the problem? Thanks, David
Created attachment 1701 [details] slurm.conf with misordered nodes
Ok- attached... when I start slurmctld with this: mrg@slapshot[~]: sudo /usr/sbin/slurmctld -D slurmctld: slurmctld version 15.08.0-0pre2 started on cluster slapshot slurmctld: layouts: no layout to initialize slurmctld: error: Reconfiguration for node puck1, ignoring! slurmctld: _parse_part_spec: changing default partition from slapshot to campus slurmctld: layouts: loading entities/relations information slurmctld: error: find_node_record: lookup failure for puck1 slurmctld: Recovered state of 406 nodes slurmctld: Recovered information about 0 jobs slurmctld: error: find_node_record: lookup failure for puck1 slurmctld: error: node_name2bitmap: invalid node specified puck1 slurmctld: fatal: Invalid node names in partition slapshot Thanks M
Hi, this was fixed in the commit ce32018a28d6b7a. It is available in 15.08.0pre3. If you update your code to the latest version you will get the fix. David
We discovered the issue still exist in the latest code. Sorry for the confusion. The solution is to use NodeAddr instead of NodeHostName. Thanks, David