Ticket 2359 - slurmctld database error when upgrading to 15.08.6
Summary: slurmctld database error when upgrading to 15.08.6
Status: RESOLVED INFOGIVEN
Alias: None
Product: Slurm
Classification: Unclassified
Component: slurmctld (show other tickets)
Version: 15.08.6
Hardware: Linux Linux
: 4 - Minor Issue
Assignee: Danny Auble
QA Contact:
URL:
Depends on:
Blocks:
 
Reported: 2016-01-18 09:12 MST by Davide Vanzo
Modified: 2016-01-27 01:06 MST (History)
0 users

See Also:
Site: Vanderbilt
Slinky Site: ---
Alineos Sites: ---
Atos/Eviden Sites: ---
Confidential Site: ---
Coreweave sites: ---
Cray Sites: ---
DS9 clusters: ---
Google sites: ---
HPCnow Sites: ---
HPE Sites: ---
IBM Sites: ---
NOAA SIte: ---
NoveTech Sites: ---
Nvidia HWinf-CS Sites: ---
OCF Sites: ---
Recursion Pharma Sites: ---
SFW Sites: ---
SNIC sites: ---
Tzag Elita Sites: ---
Linux Distro: ---
Machine Name:
CLE Version:
Version Fixed:
Target Release: ---
DevPrio: ---
Emory-Cloud Sites: ---


Attachments

Note You need to log in before you can comment on or make changes to this ticket.
Description Davide Vanzo 2016-01-18 09:12:54 MST
Hi guys,
we are testing the SLURM upgrade from 14.11.11 to 15.08.6 on our test cluster before doing it on the production cluster. Since the database conversion is the crucial part of the process, we dumped the production database and imported in the test environment. Everything went fine until we tried to turn on slurmctld and we received the following error:


# slurmctld -D -vvv
slurmctld: pidfile not locked, assuming no running daemon
slurmctld: slurmctld version 15.08.6 started on cluster testcluster
slurmctld: Munge cryptographic signature plugin loaded
slurmctld: Consumable Resources (CR) Node Selection plugin loaded with argument 20
slurmctld: preempt/none loaded
slurmctld: debug:  Checkpoint plugin loaded: checkpoint/none
slurmctld: debug:  AcctGatherEnergy NONE plugin loaded
slurmctld: debug:  AcctGatherProfile NONE plugin loaded
slurmctld: debug:  AcctGatherInfiniband NONE plugin loaded
slurmctld: debug:  AcctGatherFilesystem NONE plugin loaded
slurmctld: debug2: No acct_gather.conf file (/usr/scheduler/slurm-15.08.6/etc/acct_gather.conf)
slurmctld: debug:  Job accounting gather LINUX plugin loaded
slurmctld: ExtSensors NONE plugin loaded
slurmctld: debug:  switch NONE plugin loaded
slurmctld: debug:  No backup controller to shutdown
slurmctld: Accounting storage SLURMDBD plugin loaded with AuthInfo=(null)
slurmctld: debug:  auth plugin for Munge (http://code.google.com/p/munge/) loaded
slurmctld: debug:  slurmdbd: Sent DbdInit msg
slurmctld: error: slurmdbd: Invalid message version=0, type:1432
slurmctld: error: no buffer given
slurmctld: error: slurmdbd: Invalid message version=0, type:1432
slurmctld: error: no buffer given
slurmctld: error: slurmdbd: Invalid message version=0, type:1424
slurmctld: error: no buffer given
[...]
[...]
[...]
slurmctld: error: slurmdbd: Invalid message version=0, type:1425
slurmctld: error: no buffer given
slurmctld: slurmdbd: recovered 0 pending RPCs
^Cslurmctld: debug2: _find_assoc_rec_id: no associations added yet
slurmctld: debug2: _find_assoc_rec_id: no associations added yet
slurmctld: debug2: _find_assoc_rec_id: no associations added yet
slurmctld: debug2: _find_assoc_rec_id: no associations added yet
slurmctld: debug2: _find_assoc_rec_id: no associations added yet
slurmctld: debug:  Reading slurm.conf file: /usr/scheduler/slurm-15.08.6/etc/slurm.conf
slurmctld: layouts: no layout to initialize
slurmctld: topology NONE plugin loaded
slurmctld: debug:  No DownNodes
slurmctld: sched: Backfill scheduler plugin loaded
slurmctld: route default plugin loaded
slurmctld: layouts: loading entities/relations information
slurmctld: debug:  layouts: 2/2 nodes in hash table, rc=0
slurmctld: debug:  layouts: loading stage 1
slurmctld: debug:  layouts: loading stage 1.1 (restore state)
slurmctld: debug:  layouts: loading stage 2
slurmctld: debug:  layouts: loading stage 3
slurmctld: Recovered state of 2 nodes
slurmctld: recovered job step 28.0
slurmctld: debug2: _find_assoc_rec_id: no associations added yet
slurmctld: Holding job 28 with invalid association
slurmctld: recovered job step 31.0
slurmctld: debug2: _find_assoc_rec_id: no associations added yet
slurmctld: Holding job 31 with invalid association
slurmctld: recovered job step 32.0
slurmctld: debug2: _find_assoc_rec_id: no associations added yet
slurmctld: Holding job 32 with invalid association
slurmctld: recovered job step 30.0
slurmctld: debug2: _find_assoc_rec_id: no associations added yet
slurmctld: Holding job 30 with invalid association
slurmctld: recovered job step 34.0
slurmctld: debug2: _find_assoc_rec_id: no associations added yet
slurmctld: Holding job 34 with invalid association
slurmctld: debug2: _find_assoc_rec_id: no associations added yet
slurmctld: Holding job 35 with invalid association
slurmctld: debug2: _find_assoc_rec_id: no associations added yet
slurmctld: Holding job 36 with invalid association
slurmctld: debug2: _find_assoc_rec_id: no associations added yet
slurmctld: Holding job 37 with invalid association
slurmctld: debug2: _find_assoc_rec_id: no associations added yet
slurmctld: Holding job 38 with invalid association
slurmctld: Recovered information about 9 jobs
slurmctld: cons_res: select_p_node_init
slurmctld: cons_res: preparing for 2 partitions
slurmctld: debug2: init_requeue_policy: kill_invalid_depend is set to 0
slurmctld: debug:  Updating partition uid access list
slurmctld: Recovered state of 0 reservations
slurmctld: State of 0 triggers recovered
slurmctld: error: Invalid assoc_ptr for jobid=28
slurmctld: debug2: _find_assoc_rec: no associations added yet
slurmctld: debug2: _find_assoc_rec: no associations added yet
slurmctld: _validate_job_assoc: invalid account or partition for uid=389801 jobid=28
slurmctld: error: Invalid assoc_ptr for jobid=28
slurmctld: debug2: _find_assoc_rec: no associations added yet
slurmctld: debug2: _find_assoc_rec: no associations added yet
slurmctld: _validate_job_assoc: invalid account or partition for uid=389801 jobid=28
slurmctld: error: Invalid assoc_ptr for jobid=31
slurmctld: debug2: _find_assoc_rec: no associations added yet
slurmctld: debug2: _find_assoc_rec: no associations added yet
slurmctld: _validate_job_assoc: invalid account or partition for uid=389801 jobid=31
slurmctld: error: Invalid assoc_ptr for jobid=31
slurmctld: debug2: _find_assoc_rec: no associations added yet
slurmctld: debug2: _find_assoc_rec: no associations added yet
slurmctld: _validate_job_assoc: invalid account or partition for uid=389801 jobid=31
slurmctld: error: Invalid assoc_ptr for jobid=32
slurmctld: debug2: _find_assoc_rec: no associations added yet
slurmctld: debug2: _find_assoc_rec: no associations added yet
slurmctld: _validate_job_assoc: invalid account or partition for uid=389801 jobid=32
slurmctld: error: Invalid assoc_ptr for jobid=32
slurmctld: debug2: _find_assoc_rec: no associations added yet
slurmctld: debug2: _find_assoc_rec: no associations added yet
slurmctld: _validate_job_assoc: invalid account or partition for uid=389801 jobid=32
slurmctld: error: Invalid assoc_ptr for jobid=30
slurmctld: debug2: _find_assoc_rec: no associations added yet
slurmctld: debug2: _find_assoc_rec: no associations added yet
slurmctld: _validate_job_assoc: invalid account or partition for uid=389801 jobid=30
slurmctld: error: Invalid assoc_ptr for jobid=30
slurmctld: debug2: _find_assoc_rec: no associations added yet
slurmctld: debug2: _find_assoc_rec: no associations added yet
slurmctld: _validate_job_assoc: invalid account or partition for uid=389801 jobid=30
slurmctld: error: Invalid assoc_ptr for jobid=34
slurmctld: debug2: _find_assoc_rec: no associations added yet
slurmctld: debug2: _find_assoc_rec: no associations added yet
slurmctld: _validate_job_assoc: invalid account or partition for uid=389801 jobid=34
slurmctld: error: Invalid assoc_ptr for jobid=34
slurmctld: debug2: _find_assoc_rec: no associations added yet
slurmctld: debug2: _find_assoc_rec: no associations added yet
slurmctld: _validate_job_assoc: invalid account or partition for uid=389801 jobid=34
slurmctld: error: Invalid assoc_ptr for jobid=35
slurmctld: debug2: _find_assoc_rec: no associations added yet
slurmctld: debug2: _find_assoc_rec: no associations added yet
slurmctld: _validate_job_assoc: invalid account or partition for uid=389801 jobid=35
slurmctld: error: Invalid assoc_ptr for jobid=36
slurmctld: debug2: _find_assoc_rec: no associations added yet
slurmctld: debug2: _find_assoc_rec: no associations added yet
slurmctld: _validate_job_assoc: invalid account or partition for uid=389801 jobid=36
slurmctld: error: Invalid assoc_ptr for jobid=37
slurmctld: debug2: _find_assoc_rec: no associations added yet
slurmctld: debug2: _find_assoc_rec: no associations added yet
slurmctld: _validate_job_assoc: invalid account or partition for uid=389801 jobid=37
slurmctld: error: Invalid assoc_ptr for jobid=38
slurmctld: debug2: _find_assoc_rec: no associations added yet
slurmctld: debug2: _find_assoc_rec: no associations added yet
slurmctld: _validate_job_assoc: invalid account or partition for uid=389801 jobid=38
slurmctld: read_slurm_conf: backup_controller not specified.
slurmctld: cons_res: select_p_reconfigure
slurmctld: cons_res: select_p_node_init
slurmctld: cons_res: preparing for 2 partitions
slurmctld: Running as primary controller
slurmctld: Registering slurmctld at port 6817 with slurmdbd.
slurmctld: error: slurmdbd: Issue with call DBD_REGISTER_CTLD(1434): 4294967295(This cluster hasn't been added to accounting yet)
slurmctld: fatal: You need to add this cluster to accounting if you want to enforce associations, or no jobs will ever run.


If I understand it correctly, the error arises from the fact that the production cluster (where the database has been dumped) and the test cluster (where it has been imported) have different names. Or is there something else?

We were pretty confident that everything was fine before the upgrade because SLURM was not complaining after we imported the database from the production cluster. So, what is the best procedure to replicate a database on a testbed?

Thank you in advance for your help.

Davide
Comment 1 Danny Auble 2016-01-18 09:37:22 MST
Hey Davide, I am guessing the name of cluster on your test machine is different than your production.  Change that in your slurm.conf and restart and see if things start working correctly.  I would also verify the test machine is talking to the new slurmdbd.  If things don't work right away after switching the cluster name please send the slurmdbd.log file from the startup.
Comment 2 Davide Vanzo 2016-01-27 00:30:47 MST
Hi Danny,
thanks for the hint. Everything worked fine.

Have a great day!

Davide


(In reply to Danny Auble from comment #1)
> Hey Davide, I am guessing the name of cluster on your test machine is
> different than your production.  Change that in your slurm.conf and restart
> and see if things start working correctly.  I would also verify the test
> machine is talking to the new slurmdbd.  If things don't work right away
> after switching the cluster name please send the slurmdbd.log file from the
> startup.