Ticket 3302

Summary: Lost state during upgrade
Product: Slurm Reporter: Deb Crocker <crock>
Component: OtherAssignee: Alejandro Sanchez <alex>
Status: RESOLVED INFOGIVEN QA Contact:
Severity: 3 - Medium Impact    
Priority: --- CC: alex
Version: 15.08.12   
Hardware: Linux   
OS: Linux   
Site: University of Alabama Alineos Sites: ---
Atos/Eviden Sites: --- Confidential Site: ---
Coreweave sites: --- Cray Sites: ---
DS9 clusters: --- HPCnow Sites: ---
HPE Sites: --- IBM Sites: ---
NOAA SIte: --- OCF Sites: ---
Recursion Pharma Sites: --- SFW Sites: ---
SNIC sites: --- Linux Distro: ---
Machine Name: CLE Version:
Version Fixed: Target Release: ---
DevPrio: --- Emory-Cloud Sites: ---
Attachments: slurm.conf
included headnode.conf
included nodenames.con
included partitions.conf
slurmdbd.conf
slurmctld.log
Final slurmctld.log

Description Deb Crocker 2016-12-01 07:57:25 MST
During an upgrade to 15.8.12 from 15.8.10 we lost our state files. There were no queued jobs. Is there any way to recover the state? We are running with the accounting database.
Comment 1 Alejandro Sanchez 2016-12-01 08:20:36 MST
Hi Deb,

- Have you lost all state files or just a few?
- Can you list the contents of the state file?
- Did you loose them because you accidentally removed them?
- When you say there were no queued jobs, you mean before the upgrade or after?

If you still preserve the *_state.old files, you could make a copy of the *_state.old files in a temp directory aside from StateSaveLocation, then stop the daemons, mv the *_state.old files from the temp dir to *_state in StateSaveLocation. Then start again the daemons and these *_state files will be what the daemons will read. Everytime the state is saved in the *_state files, the saving process consists in writing first to *_state.new file, then rotate *_state to *_state.old, rotate *_state.new to *_state and finally unlink *_state.new. But if you don't preserve any of the state files I can't come up with a way to recover the state unless you have a backup of the filesystem.
Comment 2 Alejandro Sanchez 2016-12-01 08:21:15 MST
Second question is:

- Can you list the contents of the state file _directory_*?
Comment 3 Deb Crocker 2016-12-01 08:29:41 MST
At this point I'm not sure. Files were there in slurm.state but since squeue showed nothing I was assuming they were initialized. It may have been that it was actually /etc/slurm. 

I had a backup of that and restored but now slurmdbd gets this error:

[2016-12-01T09:29:14.787] debug:  slurmdbd: slurm_open_msg_conn to uahpc:6819: Connection refused
[2016-12-01T09:29:14.787] error: slurmdbd: DBD_SEND_MULT_JOB_START failure: Connection refused
Comment 4 Alejandro Sanchez 2016-12-01 08:37:18 MST
May I see the full slurmdbd.log, slurmctld.log and your slurm.conf?
Comment 5 Deb Crocker 2016-12-01 08:43:14 MST
Created attachment 3782 [details]
slurm.conf

slurm.conf has several includes. I will add those, too.
Comment 6 Deb Crocker 2016-12-01 08:44:09 MST
Created attachment 3783 [details]
included headnode.conf
Comment 7 Deb Crocker 2016-12-01 08:44:29 MST
Created attachment 3784 [details]
included nodenames.con
Comment 8 Deb Crocker 2016-12-01 08:44:51 MST
Created attachment 3785 [details]
included partitions.conf
Comment 9 Deb Crocker 2016-12-01 08:46:31 MST
We didn't have many jobs running so it may be sufficient now to clean out and restart. Note slurmctld was up and some jobs got submitteed before I could close nodes.


[2016-12-01T09:38:40.438] debug:  slurmdbd: slurm_open_msg_conn to uahpc:6819: Connection refused
[2016-12-01T09:38:40.438] error: slurmdbd: DBD_SEND_MULT_JOB_START failure: Connection refused
[2016-12-01T09:38:45.000] debug:  slurmdbd: slurm_open_msg_conn to uahpc:6819: Connection refused
[2016-12-01T09:38:45.439] debug:  slurmdbd: slurm_open_msg_conn to uahpc:6819: Connection refused
[2016-12-01T09:38:45.439] error: slurmdbd: DBD_SEND_MULT_JOB_START failure: Connection refused
[2016-12-01T09:38:50.439] debug:  slurmdbd: slurm_open_msg_conn to uahpc:6819: Connection refused
[2016-12-01T09:38:50.439] error: slurmdbd: DBD_SEND_MULT_JOB_START failure: Connection refused
[2016-12-01T09:38:55.000] debug:  slurmdbd: slurm_open_msg_conn to uahpc:6819: Connection refused
[2016-12-01T09:38:55.440] debug:  slurmdbd: slurm_open_msg_conn to uahpc:6819: Connection refused
[2016-12-01T09:38:55.440] error: slurmdbd: DBD_SEND_MULT_JOB_START failure: Connection refused
[2016-12-01T09:38:56.744] Terminate signal (SIGINT or SIGTERM) received
[2016-12-01T09:38:56.744] debug:  sched: slurmctld terminating
[2016-12-01T09:38:56.787] Saving all slurm state
[2016-12-01T09:38:56.792] slurmdbd: saved 122 pending RPCs
[2016-12-01T09:38:56.892] debug:  slurmdbd: Sent fini msg
[2016-12-01T09:38:56.892] debug:  layouts/base: dumping 88 records into state file
[2016-12-01T09:38:56.893] layouts: all layouts are now unloaded.
Comment 10 Deb Crocker 2016-12-01 08:49:47 MST
Is the state directory really most important for queued jobs because there were not any of those.
Comment 11 Deb Crocker 2016-12-01 08:52:12 MST
My next observation is that the running jobs have been killed off. Nothing that was running on the cluster is there now. At this point we just need to get the parts running again.
Comment 12 Alejandro Sanchez 2016-12-01 09:03:01 MST
(In reply to Deb Crocker from comment #10)
> Is the state directory really most important for queued jobs because there
> were not any of those.

Defintely it is. When you stop the daemons all the information about pending and running jobs has to be saved somewhere, and that's exactly the purpose of the state files.
Comment 13 Deb Crocker 2016-12-01 09:15:36 MST
Another question - should slurmd be running on the master node? It is not right now.
Comment 14 Alejandro Sanchez 2016-12-01 09:34:31 MST
(In reply to Deb Crocker from comment #13)
> Another question - should slurmd be running on the master node? It is not
> right now.

The typical configuration is to run one slurmd on every compute node. Another option is to configure slurm --with-multiple-slurmd then you can run different instances of slurmd in the same machine as slurmctld runs on.
Comment 15 Deb Crocker 2016-12-01 09:35:26 MST
I just noticed that this was supposed to be submitted at the highest priority. That did not get set. We're down at the moment. 

Any ideas as to what is supposed to be listening on port 6819?
Comment 16 Alejandro Sanchez 2016-12-01 09:39:59 MST
(In reply to Deb Crocker from comment #15)
> I just noticed that this was supposed to be submitted at the highest
> priority. That did not get set. We're down at the moment. 
> 
> Any ideas as to what is supposed to be listening on port 6819?

In your slurm.conf you have:

SlurmctldPort=6817
SlurmdPort=6818

Execute this on uahpc node and see who is listening on this port
$ sudo netstat -tlnp | grep 6819
Comment 17 Alejandro Sanchez 2016-12-01 09:41:10 MST
And please attach your full slurmctld.log and slurmdbd.log, as well as your slurmdbd.conf.
Comment 18 Deb Crocker 2016-12-01 09:46:10 MST
Created attachment 3786 [details]
slurmdbd.conf
Comment 19 Deb Crocker 2016-12-01 09:47:02 MST
Created attachment 3787 [details]
slurmctld.log
Comment 20 Deb Crocker 2016-12-01 09:47:12 MST
Nothing is listen on port 6819

slurmdbd.log has rolled over several times and is empty
Comment 21 Alejandro Sanchez 2016-12-01 09:48:40 MST
I also see DefaultStorageType=slurmdbd in headnode.conf. This is incorrect, AccountingStorageType=accounting_storage/slurmdbd and the Accounting* parameters should be used instead. DefaultStorageType doesn't accept slurmdbd.
Comment 22 Deb Crocker 2016-12-01 09:50:37 MST
Is this new for 15.8.12? This was the working config prior to this.
Comment 23 Deb Crocker 2016-12-01 09:56:38 MST
Here is debug output from slurmdbd:

[root@uahpc slurm]# slurmdbd -D -vvv
slurmdbd: debug:  auth plugin for Munge (http://code.google.com/p/munge/) loaded
slurmdbd: debug2: mysql_connect() called for db slurm_acct_db
slurmdbd: error: mysql_real_connect failed: 1045 Access denied for user 'slurm'@'localhost' (using password: YES)
slurmdbd: error: The database must be up when starting the MYSQL plugin.  Trying again in 5 seconds.
slurmdbd: error: mysql_real_connect failed: 1045 Access denied for user 'slurm'@'localhost' (using password: YES)
slurmdbd: error: The database must be up when starting the MYSQL plugin.  Trying again in 5 seconds.
slurmdbd: error: mysql_real_connect failed: 1045 Access denied for user 'slurm'@'localhost' (using password: YES)
slurmdbd: error: The database must be up when starting the MYSQL plugin.  Trying again in 5 seconds.
slurmdbd: error: mysql_real_connect failed: 1045 Access denied for user 'slurm'@'localhost' (using password: YES)
slurmdbd: error: The database must be up when starting the MYSQL plugin.  Trying again in 5 seconds.
slurmdbd: error: mysql_real_connect failed: 1045 Access denied for user 'slurm'@'localhost' (using password: YES)
slurmdbd: error: The database must be up when starting the MYSQL plugin.  Trying again in 5 seconds.
Comment 24 Alejandro Sanchez 2016-12-01 09:57:46 MST
You have DbdPort=6819 in your slurmdbd.conf, so slurmctld expects to talk to slurmdbd through this port.
Comment 25 Deb Crocker 2016-12-01 10:00:00 MST
Why is slurmdbd complaining that it can't connect to that port if it is supposed to be listening for slurmctld there:

slurmdbd: slurm_open_msg_conn to uahpc:6819: Connection refused
Comment 26 Deb Crocker 2016-12-01 10:02:15 MST
So this is the problem?

slurmdbd: error: mysql_real_connect failed: 1045 Access denied for user 'slurm'@'localhost' (using password: YES)
Comment 27 Alejandro Sanchez 2016-12-01 10:03:01 MST
Is MySQL up and running? does the user 'slurm' has access to the database slurm_acct_db?

The messages you see come from slurmctld about it being unable to contact slurmdbd because slurmdbd is not able to access the mysql database.
Comment 28 Deb Crocker 2016-12-01 10:07:01 MST
reset password to value in slurmdbd.conf and slurmdbd now starts.
Comment 29 Alejandro Sanchez 2016-12-01 10:09:08 MST
(In reply to Deb Crocker from comment #28)
> reset password to value in slurmdbd.conf and slurmdbd now starts.

Good. Can you attach your slurmdbd.conf and slurmctld.log again? We've solved part of the problem but not all.
Comment 30 Alejandro Sanchez 2016-12-01 10:10:27 MST
Sorry slurmdbd.log* not the conf, we already have it.
Comment 31 Deb Crocker 2016-12-01 10:16:29 MST
The only thing in slurmbdb.log is what I sent early when I ran it in debug mode. I will attach the latest slurmctld.log

I have updated the AccountingStorageType - although there has never been a complaint on the value we had when starting.
Comment 32 Alejandro Sanchez 2016-12-01 10:17:54 MST
(In reply to Deb Crocker from comment #31)
> The only thing in slurmbdb.log is what I sent early when I ran it in debug
> mode. I will attach the latest slurmctld.log

But after you fixed the password, do you still see the MySQL errors?

> I have updated the AccountingStorageType - although there has never been a
> complaint on the value we had when starting.

No need for that. Sorry DefaultStorageType also accepts slurmdbd but it's just not documented in the man page.
Comment 33 Deb Crocker 2016-12-01 10:22:15 MST
Don't see any mysql errors in slurmctld.log and slurmbdb.log must not be set to show debugging. 

There were errors when I updated AccoutingStorageType so I have set that back.
Comment 34 Deb Crocker 2016-12-01 10:24:04 MST
So along with mysql credential change I have this: 

[2016-12-01T11:22:28.598] error: Munge decode failed: Invalid credential
[2016-12-01T11:22:28.598] error: authentication: Invalid credential 
[2016-12-01T11:22:28.598] error: slurm_receive_msgs: Protocol authentication error
[2016-12-01T11:22:28.598] error: Munge decode failed: Invalid credential
[2016-12-01T11:22:28.598] error: authentication: Invalid credential 
[2016-12-01T11:22:28.598] error: slurm_receive_msgs: Protocol authentication error
[2016-12-01T11:22:28.598] error: Munge decode failed: Invalid credential
[2016-12-01T11:22:28.598] error: Munge decode failed: Invalid credential
[2016-12-01T11:22:28.598] error: authentication: Invalid credential 
[2016-12-01T11:22:28.598] error: slurm_receive_msgs: Protocol authentication error
[2016-12-01T11:22:28.598] error: Munge decode failed: Invalid credential
[2016-12-01T11:22:28.598] error: authentication: Invalid credential 
[2016-12-01T11:22:28.598] error: slurm_receive_msgs: Protocol authentication error
[2016-12-01T11:22:28.598] error: authentication: Invalid credential 
[2016-12-01T11:22:28.598] error: slurm_receive_msgs: Protocol authentication error

Only about half of my nodes are responding. How do we clear this up?
Comment 35 Alejandro Sanchez 2016-12-01 10:26:31 MST
munge key must be identical both in master and computing nodes. Anyway, error is solved by copying it from /etc/munge/something.key on the master to the computing nodes.
Comment 36 Alejandro Sanchez 2016-12-01 10:29:47 MST
Also please make sure the clocks are synced across the cluster.
Comment 37 Deb Crocker 2016-12-01 10:32:53 MST
Times are good. I copied key and restarted munge. Nodes seem all good now.
Comment 38 Alejandro Sanchez 2016-12-01 10:34:57 MST
(In reply to Deb Crocker from comment #37)
> Times are good. I copied key and restarted munge. Nodes seem all good now.

Does 'sacctmgr' can talk to slurmdbd? And what's the output of 'squeue' and 'sinfo' now?
Comment 39 Deb Crocker 2016-12-01 10:35:23 MST
I'll upload a final slurmctld.log
Comment 40 Deb Crocker 2016-12-01 10:36:31 MST
sacctmgr can talk. My mysql data seems fine (the accounting fair-share tree is all there). squeue and sinfo look good
Comment 41 Alejandro Sanchez 2016-12-01 10:39:21 MST
(In reply to Deb Crocker from comment #40)
> sacctmgr can talk. My mysql data seems fine (the accounting fair-share tree
> is all there). squeue and sinfo look good

What about the lost jobs from the state files? Do you see them in the queue?
Comment 42 Deb Crocker 2016-12-01 10:39:54 MST
Created attachment 3789 [details]
Final slurmctld.log
Comment 43 Deb Crocker 2016-12-01 10:40:53 MST
Running jobs had died and I don't think there were any queued jobs. Nothing in the queue right now. 

Looking over some stuff to see if I can tell users to come back in.
Comment 44 Alejandro Sanchez 2016-12-01 10:49:02 MST
(In reply to Deb Crocker from comment #43)
> Running jobs had died and I don't think there were any queued jobs. Nothing
> in the queue right now. 

My guess is that's because of the state files removal.

> Looking over some stuff to see if I can tell users to come back in.

Ok. Looking at your slurmctld.log it appears you've a lot of verbosity on the amoutn of logs being logged. Just rechecked your slurm.conf and you have:

SlurmctldDebug=5
DebugFlags=Priority,NO_CONF_HASH,backfill,BackfillMap

Unless you are consciously debugging specific stuff, you can reset SlurmctldDebug to info, which is the default for production and maybe remove some of the DebugFlags you're not interested in being logged.
Comment 45 Alejandro Sanchez 2016-12-01 11:03:32 MST
Deb - looking at your first slurmctld.log, we've noticed that slurm.conf had somehow been messed up:

[2016-12-01T08:49:28.328] slurmctld version 15.08.12 started on cluster uahpc
[2016-12-01T08:49:29.038] layouts: no layout to initialize
[2016-12-01T08:49:29.048] layouts: loading entities/relations information
[2016-12-01T08:49:29.048] error: find_node_record: lookup failure for compute-0-0
[2016-12-01T08:49:29.048] error: Node compute-0-0 has vanished from configuration
[2016-12-01T08:49:29.048] error: find_node_record: lookup failure for compute-0-1
[2016-12-01T08:49:29.048] error: Node compute-0-1 has vanished from configuration
[...]
and so on with the nodes.
[...]
[2016-12-01T08:49:29.050] cons_res: preparing for 2 partitions
[2016-12-01T08:49:29.050] error: Invalid partition (owners) for job 43330
[2016-12-01T08:49:29.050] error: find_node_record: lookup failure for compute-4-2
[2016-12-01T08:49:29.050] error: node_name2bitmap: invalid node specified compute-4-2
[2016-12-01T08:49:29.050] error: find_node_record: lookup failure for compute-4-2
[2016-12-01T08:49:29.050] error: node_name2bitmap: invalid node specified compute-4-2
[...]
So because slurm.conf was messed up it killed all jobs.

Looks like slurm.conf was fixed at:
[2016-12-01T09:03:21.550] slurmctld version 15.08.12 started on cluster uahpc

but the damage had already been done:
There is no recovery after that
[2016-12-01T09:03:25.317] debug: validate_node_specs: node compute-3-7 registered with 0 jobs
[2016-12-01T09:03:25.317] debug: validate_node_specs: node compute-4-12 registered with 0 jobs
[2016-12-01T09:03:25.317] debug: validate_node_specs: node compute-3-12 registered with 0 jobs

and the nodes registered with 0 jobs. So it doesn't appear now like a state files removal issue, but a not good slurm.conf change.
Comment 46 Alejandro Sanchez 2016-12-01 11:04:42 MST
Switching severity to 3 now. Please, let us know how it goes short-term with the cluster.
Comment 47 Deb Crocker 2016-12-01 11:07:00 MST
Yes, my setup got overwritten. I did have a backup of /etc/slurm from before the upgrade which I put back in place.
Comment 48 Alejandro Sanchez 2016-12-05 03:34:31 MST
Hi Deb - do you need any more assistance regarding this bug?
Comment 49 Deb Crocker 2016-12-05 09:31:09 MST
No, we're okay now.
Comment 50 Alejandro Sanchez 2016-12-05 09:38:46 MST
All right. Marking as resolved/infogiven.