Hi guys, we are testing the SLURM upgrade from 14.11.11 to 15.08.6 on our test cluster before doing it on the production cluster. Since the database conversion is the crucial part of the process, we dumped the production database and imported in the test environment. Everything went fine until we tried to turn on slurmctld and we received the following error: # slurmctld -D -vvv slurmctld: pidfile not locked, assuming no running daemon slurmctld: slurmctld version 15.08.6 started on cluster testcluster slurmctld: Munge cryptographic signature plugin loaded slurmctld: Consumable Resources (CR) Node Selection plugin loaded with argument 20 slurmctld: preempt/none loaded slurmctld: debug: Checkpoint plugin loaded: checkpoint/none slurmctld: debug: AcctGatherEnergy NONE plugin loaded slurmctld: debug: AcctGatherProfile NONE plugin loaded slurmctld: debug: AcctGatherInfiniband NONE plugin loaded slurmctld: debug: AcctGatherFilesystem NONE plugin loaded slurmctld: debug2: No acct_gather.conf file (/usr/scheduler/slurm-15.08.6/etc/acct_gather.conf) slurmctld: debug: Job accounting gather LINUX plugin loaded slurmctld: ExtSensors NONE plugin loaded slurmctld: debug: switch NONE plugin loaded slurmctld: debug: No backup controller to shutdown slurmctld: Accounting storage SLURMDBD plugin loaded with AuthInfo=(null) slurmctld: debug: auth plugin for Munge (http://code.google.com/p/munge/) loaded slurmctld: debug: slurmdbd: Sent DbdInit msg slurmctld: error: slurmdbd: Invalid message version=0, type:1432 slurmctld: error: no buffer given slurmctld: error: slurmdbd: Invalid message version=0, type:1432 slurmctld: error: no buffer given slurmctld: error: slurmdbd: Invalid message version=0, type:1424 slurmctld: error: no buffer given [...] [...] [...] slurmctld: error: slurmdbd: Invalid message version=0, type:1425 slurmctld: error: no buffer given slurmctld: slurmdbd: recovered 0 pending RPCs ^Cslurmctld: debug2: _find_assoc_rec_id: no associations added yet slurmctld: debug2: _find_assoc_rec_id: no associations added yet slurmctld: debug2: _find_assoc_rec_id: no associations added yet slurmctld: debug2: _find_assoc_rec_id: no associations added yet slurmctld: debug2: _find_assoc_rec_id: no associations added yet slurmctld: debug: Reading slurm.conf file: /usr/scheduler/slurm-15.08.6/etc/slurm.conf slurmctld: layouts: no layout to initialize slurmctld: topology NONE plugin loaded slurmctld: debug: No DownNodes slurmctld: sched: Backfill scheduler plugin loaded slurmctld: route default plugin loaded slurmctld: layouts: loading entities/relations information slurmctld: debug: layouts: 2/2 nodes in hash table, rc=0 slurmctld: debug: layouts: loading stage 1 slurmctld: debug: layouts: loading stage 1.1 (restore state) slurmctld: debug: layouts: loading stage 2 slurmctld: debug: layouts: loading stage 3 slurmctld: Recovered state of 2 nodes slurmctld: recovered job step 28.0 slurmctld: debug2: _find_assoc_rec_id: no associations added yet slurmctld: Holding job 28 with invalid association slurmctld: recovered job step 31.0 slurmctld: debug2: _find_assoc_rec_id: no associations added yet slurmctld: Holding job 31 with invalid association slurmctld: recovered job step 32.0 slurmctld: debug2: _find_assoc_rec_id: no associations added yet slurmctld: Holding job 32 with invalid association slurmctld: recovered job step 30.0 slurmctld: debug2: _find_assoc_rec_id: no associations added yet slurmctld: Holding job 30 with invalid association slurmctld: recovered job step 34.0 slurmctld: debug2: _find_assoc_rec_id: no associations added yet slurmctld: Holding job 34 with invalid association slurmctld: debug2: _find_assoc_rec_id: no associations added yet slurmctld: Holding job 35 with invalid association slurmctld: debug2: _find_assoc_rec_id: no associations added yet slurmctld: Holding job 36 with invalid association slurmctld: debug2: _find_assoc_rec_id: no associations added yet slurmctld: Holding job 37 with invalid association slurmctld: debug2: _find_assoc_rec_id: no associations added yet slurmctld: Holding job 38 with invalid association slurmctld: Recovered information about 9 jobs slurmctld: cons_res: select_p_node_init slurmctld: cons_res: preparing for 2 partitions slurmctld: debug2: init_requeue_policy: kill_invalid_depend is set to 0 slurmctld: debug: Updating partition uid access list slurmctld: Recovered state of 0 reservations slurmctld: State of 0 triggers recovered slurmctld: error: Invalid assoc_ptr for jobid=28 slurmctld: debug2: _find_assoc_rec: no associations added yet slurmctld: debug2: _find_assoc_rec: no associations added yet slurmctld: _validate_job_assoc: invalid account or partition for uid=389801 jobid=28 slurmctld: error: Invalid assoc_ptr for jobid=28 slurmctld: debug2: _find_assoc_rec: no associations added yet slurmctld: debug2: _find_assoc_rec: no associations added yet slurmctld: _validate_job_assoc: invalid account or partition for uid=389801 jobid=28 slurmctld: error: Invalid assoc_ptr for jobid=31 slurmctld: debug2: _find_assoc_rec: no associations added yet slurmctld: debug2: _find_assoc_rec: no associations added yet slurmctld: _validate_job_assoc: invalid account or partition for uid=389801 jobid=31 slurmctld: error: Invalid assoc_ptr for jobid=31 slurmctld: debug2: _find_assoc_rec: no associations added yet slurmctld: debug2: _find_assoc_rec: no associations added yet slurmctld: _validate_job_assoc: invalid account or partition for uid=389801 jobid=31 slurmctld: error: Invalid assoc_ptr for jobid=32 slurmctld: debug2: _find_assoc_rec: no associations added yet slurmctld: debug2: _find_assoc_rec: no associations added yet slurmctld: _validate_job_assoc: invalid account or partition for uid=389801 jobid=32 slurmctld: error: Invalid assoc_ptr for jobid=32 slurmctld: debug2: _find_assoc_rec: no associations added yet slurmctld: debug2: _find_assoc_rec: no associations added yet slurmctld: _validate_job_assoc: invalid account or partition for uid=389801 jobid=32 slurmctld: error: Invalid assoc_ptr for jobid=30 slurmctld: debug2: _find_assoc_rec: no associations added yet slurmctld: debug2: _find_assoc_rec: no associations added yet slurmctld: _validate_job_assoc: invalid account or partition for uid=389801 jobid=30 slurmctld: error: Invalid assoc_ptr for jobid=30 slurmctld: debug2: _find_assoc_rec: no associations added yet slurmctld: debug2: _find_assoc_rec: no associations added yet slurmctld: _validate_job_assoc: invalid account or partition for uid=389801 jobid=30 slurmctld: error: Invalid assoc_ptr for jobid=34 slurmctld: debug2: _find_assoc_rec: no associations added yet slurmctld: debug2: _find_assoc_rec: no associations added yet slurmctld: _validate_job_assoc: invalid account or partition for uid=389801 jobid=34 slurmctld: error: Invalid assoc_ptr for jobid=34 slurmctld: debug2: _find_assoc_rec: no associations added yet slurmctld: debug2: _find_assoc_rec: no associations added yet slurmctld: _validate_job_assoc: invalid account or partition for uid=389801 jobid=34 slurmctld: error: Invalid assoc_ptr for jobid=35 slurmctld: debug2: _find_assoc_rec: no associations added yet slurmctld: debug2: _find_assoc_rec: no associations added yet slurmctld: _validate_job_assoc: invalid account or partition for uid=389801 jobid=35 slurmctld: error: Invalid assoc_ptr for jobid=36 slurmctld: debug2: _find_assoc_rec: no associations added yet slurmctld: debug2: _find_assoc_rec: no associations added yet slurmctld: _validate_job_assoc: invalid account or partition for uid=389801 jobid=36 slurmctld: error: Invalid assoc_ptr for jobid=37 slurmctld: debug2: _find_assoc_rec: no associations added yet slurmctld: debug2: _find_assoc_rec: no associations added yet slurmctld: _validate_job_assoc: invalid account or partition for uid=389801 jobid=37 slurmctld: error: Invalid assoc_ptr for jobid=38 slurmctld: debug2: _find_assoc_rec: no associations added yet slurmctld: debug2: _find_assoc_rec: no associations added yet slurmctld: _validate_job_assoc: invalid account or partition for uid=389801 jobid=38 slurmctld: read_slurm_conf: backup_controller not specified. slurmctld: cons_res: select_p_reconfigure slurmctld: cons_res: select_p_node_init slurmctld: cons_res: preparing for 2 partitions slurmctld: Running as primary controller slurmctld: Registering slurmctld at port 6817 with slurmdbd. slurmctld: error: slurmdbd: Issue with call DBD_REGISTER_CTLD(1434): 4294967295(This cluster hasn't been added to accounting yet) slurmctld: fatal: You need to add this cluster to accounting if you want to enforce associations, or no jobs will ever run. If I understand it correctly, the error arises from the fact that the production cluster (where the database has been dumped) and the test cluster (where it has been imported) have different names. Or is there something else? We were pretty confident that everything was fine before the upgrade because SLURM was not complaining after we imported the database from the production cluster. So, what is the best procedure to replicate a database on a testbed? Thank you in advance for your help. Davide
Hey Davide, I am guessing the name of cluster on your test machine is different than your production. Change that in your slurm.conf and restart and see if things start working correctly. I would also verify the test machine is talking to the new slurmdbd. If things don't work right away after switching the cluster name please send the slurmdbd.log file from the startup.
Hi Danny, thanks for the hint. Everything worked fine. Have a great day! Davide (In reply to Danny Auble from comment #1) > Hey Davide, I am guessing the name of cluster on your test machine is > different than your production. Change that in your slurm.conf and restart > and see if things start working correctly. I would also verify the test > machine is talking to the new slurmdbd. If things don't work right away > after switching the cluster name please send the slurmdbd.log file from the > startup.