| Summary: | Lost state during upgrade | ||
|---|---|---|---|
| Product: | Slurm | Reporter: | Deb Crocker <crock> |
| Component: | Other | Assignee: | Alejandro Sanchez <alex> |
| Status: | RESOLVED INFOGIVEN | QA Contact: | |
| Severity: | 3 - Medium Impact | ||
| Priority: | --- | CC: | alex |
| Version: | 15.08.12 | ||
| Hardware: | Linux | ||
| OS: | Linux | ||
| Site: | University of Alabama | Alineos Sites: | --- |
| Atos/Eviden Sites: | --- | Confidential Site: | --- |
| Coreweave sites: | --- | Cray Sites: | --- |
| DS9 clusters: | --- | HPCnow Sites: | --- |
| HPE Sites: | --- | IBM Sites: | --- |
| NOAA SIte: | --- | OCF Sites: | --- |
| Recursion Pharma Sites: | --- | SFW Sites: | --- |
| SNIC sites: | --- | Linux Distro: | --- |
| Machine Name: | CLE Version: | ||
| Version Fixed: | Target Release: | --- | |
| DevPrio: | --- | Emory-Cloud Sites: | --- |
| Attachments: |
slurm.conf
included headnode.conf included nodenames.con included partitions.conf slurmdbd.conf slurmctld.log Final slurmctld.log |
||
|
Description
Deb Crocker
2016-12-01 07:57:25 MST
Hi Deb, - Have you lost all state files or just a few? - Can you list the contents of the state file? - Did you loose them because you accidentally removed them? - When you say there were no queued jobs, you mean before the upgrade or after? If you still preserve the *_state.old files, you could make a copy of the *_state.old files in a temp directory aside from StateSaveLocation, then stop the daemons, mv the *_state.old files from the temp dir to *_state in StateSaveLocation. Then start again the daemons and these *_state files will be what the daemons will read. Everytime the state is saved in the *_state files, the saving process consists in writing first to *_state.new file, then rotate *_state to *_state.old, rotate *_state.new to *_state and finally unlink *_state.new. But if you don't preserve any of the state files I can't come up with a way to recover the state unless you have a backup of the filesystem. Second question is: - Can you list the contents of the state file _directory_*? At this point I'm not sure. Files were there in slurm.state but since squeue showed nothing I was assuming they were initialized. It may have been that it was actually /etc/slurm. I had a backup of that and restored but now slurmdbd gets this error: [2016-12-01T09:29:14.787] debug: slurmdbd: slurm_open_msg_conn to uahpc:6819: Connection refused [2016-12-01T09:29:14.787] error: slurmdbd: DBD_SEND_MULT_JOB_START failure: Connection refused May I see the full slurmdbd.log, slurmctld.log and your slurm.conf? Created attachment 3782 [details]
slurm.conf
slurm.conf has several includes. I will add those, too.
Created attachment 3783 [details]
included headnode.conf
Created attachment 3784 [details]
included nodenames.con
Created attachment 3785 [details]
included partitions.conf
We didn't have many jobs running so it may be sufficient now to clean out and restart. Note slurmctld was up and some jobs got submitteed before I could close nodes. [2016-12-01T09:38:40.438] debug: slurmdbd: slurm_open_msg_conn to uahpc:6819: Connection refused [2016-12-01T09:38:40.438] error: slurmdbd: DBD_SEND_MULT_JOB_START failure: Connection refused [2016-12-01T09:38:45.000] debug: slurmdbd: slurm_open_msg_conn to uahpc:6819: Connection refused [2016-12-01T09:38:45.439] debug: slurmdbd: slurm_open_msg_conn to uahpc:6819: Connection refused [2016-12-01T09:38:45.439] error: slurmdbd: DBD_SEND_MULT_JOB_START failure: Connection refused [2016-12-01T09:38:50.439] debug: slurmdbd: slurm_open_msg_conn to uahpc:6819: Connection refused [2016-12-01T09:38:50.439] error: slurmdbd: DBD_SEND_MULT_JOB_START failure: Connection refused [2016-12-01T09:38:55.000] debug: slurmdbd: slurm_open_msg_conn to uahpc:6819: Connection refused [2016-12-01T09:38:55.440] debug: slurmdbd: slurm_open_msg_conn to uahpc:6819: Connection refused [2016-12-01T09:38:55.440] error: slurmdbd: DBD_SEND_MULT_JOB_START failure: Connection refused [2016-12-01T09:38:56.744] Terminate signal (SIGINT or SIGTERM) received [2016-12-01T09:38:56.744] debug: sched: slurmctld terminating [2016-12-01T09:38:56.787] Saving all slurm state [2016-12-01T09:38:56.792] slurmdbd: saved 122 pending RPCs [2016-12-01T09:38:56.892] debug: slurmdbd: Sent fini msg [2016-12-01T09:38:56.892] debug: layouts/base: dumping 88 records into state file [2016-12-01T09:38:56.893] layouts: all layouts are now unloaded. Is the state directory really most important for queued jobs because there were not any of those. My next observation is that the running jobs have been killed off. Nothing that was running on the cluster is there now. At this point we just need to get the parts running again. (In reply to Deb Crocker from comment #10) > Is the state directory really most important for queued jobs because there > were not any of those. Defintely it is. When you stop the daemons all the information about pending and running jobs has to be saved somewhere, and that's exactly the purpose of the state files. Another question - should slurmd be running on the master node? It is not right now. (In reply to Deb Crocker from comment #13) > Another question - should slurmd be running on the master node? It is not > right now. The typical configuration is to run one slurmd on every compute node. Another option is to configure slurm --with-multiple-slurmd then you can run different instances of slurmd in the same machine as slurmctld runs on. I just noticed that this was supposed to be submitted at the highest priority. That did not get set. We're down at the moment. Any ideas as to what is supposed to be listening on port 6819? (In reply to Deb Crocker from comment #15) > I just noticed that this was supposed to be submitted at the highest > priority. That did not get set. We're down at the moment. > > Any ideas as to what is supposed to be listening on port 6819? In your slurm.conf you have: SlurmctldPort=6817 SlurmdPort=6818 Execute this on uahpc node and see who is listening on this port $ sudo netstat -tlnp | grep 6819 And please attach your full slurmctld.log and slurmdbd.log, as well as your slurmdbd.conf. Created attachment 3786 [details]
slurmdbd.conf
Created attachment 3787 [details]
slurmctld.log
Nothing is listen on port 6819 slurmdbd.log has rolled over several times and is empty I also see DefaultStorageType=slurmdbd in headnode.conf. This is incorrect, AccountingStorageType=accounting_storage/slurmdbd and the Accounting* parameters should be used instead. DefaultStorageType doesn't accept slurmdbd. Is this new for 15.8.12? This was the working config prior to this. Here is debug output from slurmdbd: [root@uahpc slurm]# slurmdbd -D -vvv slurmdbd: debug: auth plugin for Munge (http://code.google.com/p/munge/) loaded slurmdbd: debug2: mysql_connect() called for db slurm_acct_db slurmdbd: error: mysql_real_connect failed: 1045 Access denied for user 'slurm'@'localhost' (using password: YES) slurmdbd: error: The database must be up when starting the MYSQL plugin. Trying again in 5 seconds. slurmdbd: error: mysql_real_connect failed: 1045 Access denied for user 'slurm'@'localhost' (using password: YES) slurmdbd: error: The database must be up when starting the MYSQL plugin. Trying again in 5 seconds. slurmdbd: error: mysql_real_connect failed: 1045 Access denied for user 'slurm'@'localhost' (using password: YES) slurmdbd: error: The database must be up when starting the MYSQL plugin. Trying again in 5 seconds. slurmdbd: error: mysql_real_connect failed: 1045 Access denied for user 'slurm'@'localhost' (using password: YES) slurmdbd: error: The database must be up when starting the MYSQL plugin. Trying again in 5 seconds. slurmdbd: error: mysql_real_connect failed: 1045 Access denied for user 'slurm'@'localhost' (using password: YES) slurmdbd: error: The database must be up when starting the MYSQL plugin. Trying again in 5 seconds. You have DbdPort=6819 in your slurmdbd.conf, so slurmctld expects to talk to slurmdbd through this port. Why is slurmdbd complaining that it can't connect to that port if it is supposed to be listening for slurmctld there: slurmdbd: slurm_open_msg_conn to uahpc:6819: Connection refused So this is the problem? slurmdbd: error: mysql_real_connect failed: 1045 Access denied for user 'slurm'@'localhost' (using password: YES) Is MySQL up and running? does the user 'slurm' has access to the database slurm_acct_db? The messages you see come from slurmctld about it being unable to contact slurmdbd because slurmdbd is not able to access the mysql database. reset password to value in slurmdbd.conf and slurmdbd now starts. (In reply to Deb Crocker from comment #28) > reset password to value in slurmdbd.conf and slurmdbd now starts. Good. Can you attach your slurmdbd.conf and slurmctld.log again? We've solved part of the problem but not all. Sorry slurmdbd.log* not the conf, we already have it. The only thing in slurmbdb.log is what I sent early when I ran it in debug mode. I will attach the latest slurmctld.log I have updated the AccountingStorageType - although there has never been a complaint on the value we had when starting. (In reply to Deb Crocker from comment #31) > The only thing in slurmbdb.log is what I sent early when I ran it in debug > mode. I will attach the latest slurmctld.log But after you fixed the password, do you still see the MySQL errors? > I have updated the AccountingStorageType - although there has never been a > complaint on the value we had when starting. No need for that. Sorry DefaultStorageType also accepts slurmdbd but it's just not documented in the man page. Don't see any mysql errors in slurmctld.log and slurmbdb.log must not be set to show debugging. There were errors when I updated AccoutingStorageType so I have set that back. So along with mysql credential change I have this: [2016-12-01T11:22:28.598] error: Munge decode failed: Invalid credential [2016-12-01T11:22:28.598] error: authentication: Invalid credential [2016-12-01T11:22:28.598] error: slurm_receive_msgs: Protocol authentication error [2016-12-01T11:22:28.598] error: Munge decode failed: Invalid credential [2016-12-01T11:22:28.598] error: authentication: Invalid credential [2016-12-01T11:22:28.598] error: slurm_receive_msgs: Protocol authentication error [2016-12-01T11:22:28.598] error: Munge decode failed: Invalid credential [2016-12-01T11:22:28.598] error: Munge decode failed: Invalid credential [2016-12-01T11:22:28.598] error: authentication: Invalid credential [2016-12-01T11:22:28.598] error: slurm_receive_msgs: Protocol authentication error [2016-12-01T11:22:28.598] error: Munge decode failed: Invalid credential [2016-12-01T11:22:28.598] error: authentication: Invalid credential [2016-12-01T11:22:28.598] error: slurm_receive_msgs: Protocol authentication error [2016-12-01T11:22:28.598] error: authentication: Invalid credential [2016-12-01T11:22:28.598] error: slurm_receive_msgs: Protocol authentication error Only about half of my nodes are responding. How do we clear this up? munge key must be identical both in master and computing nodes. Anyway, error is solved by copying it from /etc/munge/something.key on the master to the computing nodes. Also please make sure the clocks are synced across the cluster. Times are good. I copied key and restarted munge. Nodes seem all good now. (In reply to Deb Crocker from comment #37) > Times are good. I copied key and restarted munge. Nodes seem all good now. Does 'sacctmgr' can talk to slurmdbd? And what's the output of 'squeue' and 'sinfo' now? I'll upload a final slurmctld.log sacctmgr can talk. My mysql data seems fine (the accounting fair-share tree is all there). squeue and sinfo look good (In reply to Deb Crocker from comment #40) > sacctmgr can talk. My mysql data seems fine (the accounting fair-share tree > is all there). squeue and sinfo look good What about the lost jobs from the state files? Do you see them in the queue? Created attachment 3789 [details]
Final slurmctld.log
Running jobs had died and I don't think there were any queued jobs. Nothing in the queue right now. Looking over some stuff to see if I can tell users to come back in. (In reply to Deb Crocker from comment #43) > Running jobs had died and I don't think there were any queued jobs. Nothing > in the queue right now. My guess is that's because of the state files removal. > Looking over some stuff to see if I can tell users to come back in. Ok. Looking at your slurmctld.log it appears you've a lot of verbosity on the amoutn of logs being logged. Just rechecked your slurm.conf and you have: SlurmctldDebug=5 DebugFlags=Priority,NO_CONF_HASH,backfill,BackfillMap Unless you are consciously debugging specific stuff, you can reset SlurmctldDebug to info, which is the default for production and maybe remove some of the DebugFlags you're not interested in being logged. Deb - looking at your first slurmctld.log, we've noticed that slurm.conf had somehow been messed up: [2016-12-01T08:49:28.328] slurmctld version 15.08.12 started on cluster uahpc [2016-12-01T08:49:29.038] layouts: no layout to initialize [2016-12-01T08:49:29.048] layouts: loading entities/relations information [2016-12-01T08:49:29.048] error: find_node_record: lookup failure for compute-0-0 [2016-12-01T08:49:29.048] error: Node compute-0-0 has vanished from configuration [2016-12-01T08:49:29.048] error: find_node_record: lookup failure for compute-0-1 [2016-12-01T08:49:29.048] error: Node compute-0-1 has vanished from configuration [...] and so on with the nodes. [...] [2016-12-01T08:49:29.050] cons_res: preparing for 2 partitions [2016-12-01T08:49:29.050] error: Invalid partition (owners) for job 43330 [2016-12-01T08:49:29.050] error: find_node_record: lookup failure for compute-4-2 [2016-12-01T08:49:29.050] error: node_name2bitmap: invalid node specified compute-4-2 [2016-12-01T08:49:29.050] error: find_node_record: lookup failure for compute-4-2 [2016-12-01T08:49:29.050] error: node_name2bitmap: invalid node specified compute-4-2 [...] So because slurm.conf was messed up it killed all jobs. Looks like slurm.conf was fixed at: [2016-12-01T09:03:21.550] slurmctld version 15.08.12 started on cluster uahpc but the damage had already been done: There is no recovery after that [2016-12-01T09:03:25.317] debug: validate_node_specs: node compute-3-7 registered with 0 jobs [2016-12-01T09:03:25.317] debug: validate_node_specs: node compute-4-12 registered with 0 jobs [2016-12-01T09:03:25.317] debug: validate_node_specs: node compute-3-12 registered with 0 jobs and the nodes registered with 0 jobs. So it doesn't appear now like a state files removal issue, but a not good slurm.conf change. Switching severity to 3 now. Please, let us know how it goes short-term with the cluster. Yes, my setup got overwritten. I did have a backup of /etc/slurm from before the upgrade which I put back in place. Hi Deb - do you need any more assistance regarding this bug? No, we're okay now. All right. Marking as resolved/infogiven. |