3483 – Job States lost on upgrade

Ticket 3483 - Job States lost on upgrade

Summary: Job States lost on upgrade

Status:	RESOLVED INFOGIVEN

Alias:	None

Product:	Slurm
Classification:	Unclassified
Component:	slurmctld (show other tickets)
Version:	16.05.7
Hardware:	Linux Linux

Severity:	2 - High Impact
Assignee:	Tim Wickberg
QA Contact:

URL:

Depends on:
Blocks:

Reported:	2017-02-21 00:42 MST by paull
Modified:	2017-10-18 01:32 MDT (History)
CC List:	3 users (show)

See Also:	3502
Site:	DownUnder GeoSolutions
Slinky Site:	---
Alineos Sites:	---
Atos/Eviden Sites:	---
Confidential Site:	---
Coreweave sites:	---
Cray Sites:	---
DS9 clusters:	---
Google sites:	---
HPCnow Sites:	---
HPE Sites:	---
IBM Sites:	---
NOAA SIte:	---
NoveTech Sites:	---
Nvidia HWinf-CS Sites:	---
OCF Sites:	---
Recursion Pharma Sites:	---
SFW Sites:	---
SNIC sites:	---
Tzag Elita Sites:	---
Linux Distro:	---
Machine Name:
CLE Version:
Version Fixed:
Target Release:	---
DevPrio:	---
Emory-Cloud Sites:	---

Attachments
slurmctrld.log (12.53 MB, application/xz) 2017-02-21 01:36 MST, paull	Details
Add an attachment (proposed patch, testcase, etc.)

Note You need to log in before you can comment on or make changes to this ticket.

Description paull 2017-02-21 00:42:41 MST

Hi Support,

When I updated one of our clusters (after successfully updating others), all jobs states were lost. Log excerpt below. 

2 questions:

1. How can we recover these job states?
2. Why did it happen and how can I avoid this?

[2017-02-21T15:19:04.368] error: slurm_receive_msg: Incompatible versions of client and server code
[2017-02-21T15:19:04.372] error: unpackmem_ptr: Buffer to be unpacked is too large (1952984942 > 67108864)
[2017-02-21T15:19:04.372] error: slurm_receive_msg: Invalid Protocol Version 7680 from uid=-1 at 172.16.250.121:43092
[2017-02-21T15:19:04.372] error: slurm_receive_msg: Incompatible versions of client and server code
[2017-02-21T15:19:04.382] error: slurm_receive_msg: Incompatible versions of client and server code
[2017-02-21T15:19:04.415] error: unpackmem_ptr: Buffer to be unpacked is too large (1952984942 > 67108864)
[2017-02-21T15:19:04.415] error: slurm_receive_msg: Invalid Protocol Version 7680 from uid=-1 at 172.16.7.136:56831
[2017-02-21T15:19:04.415] error: slurm_receive_msg: Incompatible versions of client and server code
[2017-02-21T15:19:04.425] error: slurm_receive_msg: Incompatible versions of client and server code
[2017-02-21T15:19:04.593] error: unpackmem_ptr: Buffer to be unpacked is too large (1952984942 > 67108864)
[2017-02-21T15:19:04.593] error: slurm_receive_msg: Invalid Protocol Version 7680 from uid=-1 at 172.16.5.3:40283
[2017-02-21T15:19:04.593] error: slurm_receive_msg: Incompatible versions of client and server code
[2017-02-21T15:19:04.603] error: slurm_receive_msg: Incompatible versions of client and server code
[2017-02-21T15:19:04.658] job_complete: JobID=23711916_12(23712866) State=0x1 NodeCnt=1 WEXITSTATUS 0
[2017-02-21T15:19:04.658] job_complete: JobID=23711916_12(23712866) State=0x8003 NodeCnt=1 done
[2017-02-21T15:19:04.694] error: unpackmem_ptr: Buffer to be unpacked is too large (1952984942 > 67108864)
[2017-02-21T15:19:04.694] error: slurm_receive_msg: Invalid Protocol Version 7680 from uid=-1 at 172.16.250.154:48441
[2017-02-21T15:19:04.694] error: slurm_receive_msg: Incompatible versions of client and server code
[2017-02-21T15:19:04.705] error: slurm_receive_msg: Incompatible versions of client and server code
[2017-02-21T15:19:04.905] error: unpackmem_ptr: Buffer to be unpacked is too large (1952984942 > 67108864)
[2017-02-21T15:19:04.905] error: slurm_receive_msg: Invalid Protocol Version 7680 from uid=-1 at 172.16.250.30:47417
[2017-02-21T15:19:04.905] error: slurm_receive_msg: Incompatible versions of client and server code
[2017-02-21T15:19:04.915] error: slurm_receive_msg: Incompatible versions of client and server code
[2017-02-21T15:19:05.278] error: unpackmem_ptr: Buffer to be unpacked is too large (1952984942 > 67108864)
[2017-02-21T15:19:05.278] error: slurm_receive_msg: Invalid Protocol Version 7680 from uid=-1 at 172.16.7.9:54509
[2017-02-21T15:19:05.278] error: slurm_receive_msg: Incompatible versions of client and server code
[2017-02-21T15:19:05.289] error: slurm_receive_msg: Incompatible versions of client and server code
[2017-02-21T15:19:05.660] Terminate signal (SIGINT or SIGTERM) received
[2017-02-21T15:19:05.676] Saving all slurm state
[2017-02-21T15:19:06.510] layouts: all layouts are now unloaded.
[2017-02-21T15:19:06.960] slurmctld version 16.05.7 started on cluster perth
[2017-02-21T15:19:10.995] layouts: no layout to initialize
[2017-02-21T15:19:11.220] error: read_slurm_conf: default partition not set.
[2017-02-21T15:19:11.235] error: Unable to resolve "bud38": Unknown host
[2017-02-21T15:19:11.235] error: slurm_set_addr failure on bud38
[2017-02-21T15:19:11.294] layouts: loading entities/relations information
[2017-02-21T15:19:11.295] Recovered state of 625 nodes
[2017-02-21T15:19:11.296] error: we don't have select plugin type 4096
[2017-02-21T15:19:11.296] error: select_g_select_jobinfo_unpack: unpack error
[2017-02-21T15:19:11.296] error: Incomplete job record
[2017-02-21T15:19:11.296] error: Incomplete job state save file
[2017-02-21T15:19:11.296] Recovered information about 0 jobs
[2017-02-21T15:19:11.297] cons_res: select_p_node_init
[2017-02-21T15:19:11.297] cons_res: preparing for 150 partitions
[2017-02-21T15:19:11.298] Purged files for defunct batch job 23083128
[2017-02-21T15:19:11.311] Purged files for defunct batch job 23674818
[2017-02-21T15:19:11.318] Purged files for defunct batch job 23082938
[2017-02-21T15:19:11.326] Purged files for defunct batch job 23083028
[2017-02-21T15:19:11.334] Purged files for defunct batch job 23711388
[2017-02-21T15:19:11.334] Purged files for defunct batch job 23683308
[2017-02-21T15:19:11.339] Purged files for defunct batch job 23704308
[2017-02-21T15:19:11.348] Purged files for defunct batch job 23083068
[2017-02-21T15:19:11.353] Purged files for defunct batch job 23704328
[2017-02-21T15:19:11.357] Purged files for defunct batch job 23658568
[2017-02-21T15:19:11.359] Purged files for defunct batch job 23712888
[2017-02-21T15:19:11.359] Purged files for defunct batch job 23068258

Comment 1 Alejandro Sanchez 2017-02-21 01:10:10 MST

Hi Paul. From/to which versions did you upgrade? It looks like client commands and slurmctld differ one to each other more than 2 minor releases:


[2017-02-21T15:19:04.368] error: slurm_receive_msg: Incompatible versions of client and server code
[2017-02-21T15:19:04.372] error: slurm_receive_msg: Invalid Protocol Version 7680 from uid=-1 at 172.16.250.121:43092

Can you check the current version of all your components?

Remember that Slurm daemons will support RPCs and state files from the two previous minor releases (e.g. a version 16.05.x SlurmDBD will support slurmctld daemons and commands with a version of 16.05.x, 15.08.x or 14.11.x).

Comment 2 paull 2017-02-21 01:18:18 MST

14.11.10 to 16.05.7

Again this worked fine prior, but failed in this cluster.

Comment 3 Stuart Midgley 2017-02-21 01:19:19 MST

I'm going to state the obvious...  BUT when it failed to read the state file, wouldn't it be a good time to exit and not kill every job on the cluster?

This is obviously going to be a massive mess for our users to clean up.

Comment 4 Stuart Midgley 2017-02-21 01:20:56 MST

What we need to know (and hopefully quickly) is... can we recover?  We only have a backup slurmdb and NOT the state files (unless slurp keeps backups somewhere).

If we can't recover... then we might as well get on with life, restart slurm on the new version and clean up.

Comment 5 Alejandro Sanchez 2017-02-21 01:28:08 MST

Slurm state files are located under StateSaveLocation, and Slurm doesn't make backups of it. I think you won't be able to recover the state previous to the upgrade. Can you please attach the full slurmctld.log? Did you upgrade slurmdbd first?

Comment 6 paull 2017-02-21 01:36:19 MST

Created attachment 4076 [details]
slurmctrld.log

Yes slurmdbd was upgraded first. Please see the attached log.

Comment 7 Alejandro Sanchez 2017-02-21 02:06:19 MST

[2017-02-21T15:19:06.960] slurmctld version 16.05.7 started on cluster perth
...
[2017-02-21T15:19:11.296] error: we don't have select plugin type 4096
[2017-02-21T15:19:11.296] error: select_g_select_jobinfo_unpack: unpack error
[2017-02-21T15:19:11.296] error: Incomplete job record
[2017-02-21T15:19:11.296] error: Incomplete job state save file
[2017-02-21T15:19:11.296] Recovered information about 0 jobs
[2017-02-21T15:19:11.297] cons_res: select_p_node_init
[2017-02-21T15:19:11.297] cons_res: preparing for 150 partitions
[2017-02-21T15:19:11.298] Purged files for defunct batch job 23083128
...

Have you recently changed slurm.conf switching SelectType? 

[2017-02-21T15:19:11.296] error: we don't have select plugin type 4096

If so, it may be possible that the controller is using cons_res, but the compute node was not restarted and is still using a different select type. I also see tons of messages related to different slurm.conf files across the cluster. Can you check that as well as the version of all slurmd's in the cluster, slurmdbd version, slurmctld and commands versions?

Comment 8 paull 2017-02-21 02:19:51 MST

We have not had any changes with plugins from version 14.11.10 to version 16.05.7. 

All of the slurm.conf version errors were due to the controller restarting into the new version with MemSpecLimit changing for all nodes. The clients were yet to be restarted to be configured with the new config file (next step after restarting the controller).

Comment 9 Alejandro Sanchez 2017-02-21 02:26:50 MST

How did you install? was it with rpm and if so did you install it with the Slurm rpm plugins? can you check the dates of your select_*.so files? can you attach your slurm.conf as well? where's PluginDir pointing to?

Comment 10 Alejandro Sanchez 2017-02-21 09:57:53 MST

Also, in which order did you upgrade the components? you said slurmdbd was first. And then? Perhaps did you upgrade first slurmd's before slurmctld? 

Have you followed the safe upgrade procedure accessible in this link?

https://slurm.schedmd.com/quickstart_admin.html

This error doesn't make much sense since I don't see we have any plugin type with plugin id 4096, so I don't know if you had a state file older than expected or corrupted.

[2017-02-21T15:19:11.296] error: we don't have select plugin type 4096

If we had a copy of the SlurmStateDir as it is recommended we could have something to work of off, but we don't so it's difficult to tell what exactly happened. From now on I'd recommend go and get a stable production cluster, get all the components running with the same version and ensure there's just one version of the slurm.conf across the whole cluster.

Comment 13 paull 2017-02-21 16:22:23 MST

Hi Alejandro,

I believe I have found the cause.

We link all of our daemons to a directory, latest, for slurm on a network file system. "latest" then points to the current version. For slurmdbd, I manually pointed the daemon to the new versions directory, but when I did slurmctld I changed latest to point to the newer version. I have done this before in the test environment without issue, but in this situation, it seems the errors from the clients not being pointed to the proper version (for the old bin, lib..etc) were more than slurm could handle. 

With that said, I am concerned with slurm continuing after failing to save job states and purging files. I would think it would require a config setting to override state saving on daemon restart and if not set fails if saving fails. Or something of the kind. 

In this situation, we have moved forward with the new version, requiring resubmission of all jobs. Measures will be put in place as a fail safe moving forward.

I do appreciate the quick response. Any idea if any of this can be added to avoid this happening again?

Thanks,
Paul

Comment 14 Alejandro Sanchez 2017-02-22 02:45:19 MST

(In reply to paull from comment #13)
> Hi Alejandro,
> 
> I believe I have found the cause.
> 
> We link all of our daemons to a directory, latest, for slurm on a network
> file system. "latest" then points to the current version. For slurmdbd, I
> manually pointed the daemon to the new versions directory, but when I did
> slurmctld I changed latest to point to the newer version. I have done this
> before in the test environment without issue, but in this situation, it
> seems the errors from the clients not being pointed to the proper version
> (for the old bin, lib..etc) were more than slurm could handle. 
> 
> With that said, I am concerned with slurm continuing after failing to save
> job states and purging files. I would think it would require a config
> setting to override state saving on daemon restart and if not set fails if
> saving fails. Or something of the kind. 
> 
> In this situation, we have moved forward with the new version, requiring
> resubmission of all jobs. Measures will be put in place as a fail safe
> moving forward.
> 
> I do appreciate the quick response. Any idea if any of this can be added to
> avoid this happening again?
> 
> Thanks,
> Paul

Well, there's already a backup of the state files within the StateSaveLocation, they are marked as .old, and get over written each restart in a rotation fashion. What I meant in comment #5 is that Slurm doesn't backup the whole StateSaveLocation somewhere else, but still there's a backup .old copy in that directory. If you still preserve the .old files from before the upgrade attempt, we'd be interested in analyzing them, specially for this error:

[2017-02-21T15:19:11.296] error: we don't have select plugin type 4096

Comment 15 Stuart Midgley 2017-02-22 22:19:23 MST

I looked at the .old file and due to the server being restarted a couple of times, it contained the same info as the normal state file.

Comment 16 Stuart Midgley 2017-02-22 22:20:45 MST

BUT lets be really clear here.

This is still a massive bug in slurm.  Slurm should NOT basically delete every job on the system (running and queued) if it can't read the state file... it should EXIT and tell the user the issue and let the user delete the files if they want to... or find some other solution.

Comment 17 Tim Wickberg 2017-02-24 15:38:39 MST

Apologies for not responding faster; Alex is out on vacation and this is my fault for not catching this yesterday.

There are two separate issues here, and I'm filing a separate bug to cover part of it:

1) The upgrade was done out of order, resulting in running jobs being cancelled. Unfortunately, there isn't much we can do to prevent this due to the way the protocols are designed currently - to handle this properly would require some significant architectural changes which I do not feel are worth the associated complexity at this time.

2) slurmctld could not read the job_state file at startup. I would still like to understand the root cause of this - the "4096" value that was read as a select plugin type does not correspond to anything within the slurm source code, and either indicates you may have had some local patch running that modified the state format (which we haven't received confirmation on either way yet, comment #1 on bug 3484), or that the state file had been corrupted somehow.

If the state file was corrupted, as it appears to have been, there would be no way to recover the system state. If this was due to a local modification, it may have been possible to proceed, but would be outside the scope of our support.

However, if you did not take any backups (as recommended in step #8 https://slurm.schedmd.com/quickstart_admin.html#upgrade here) before, and do not have a copy of that file I do not think we'll be able to reproduce this, and should thus focus our efforts on preventing such an event in the future.

I agree that the startup code should be modified to fail-fast, and rather than proceed to cancel all jobs, refuse to start instead with some warnings as to why the cluster cannot be brought up properly, and how to override that limit. I've opened bug 3502 to track this issue, and will make sure this is reviewed and finished before the next point release for 16.05 and 17.02 are issued.

Is there anything else we can address on this specific issue, or can I close this and focus our efforts on bug 3502?

- Tim

Comment 18 Stuart Midgley 2017-02-26 19:11:41 MST

I'm happy to close.

Paul has the upgrade procedure sorted and has done all our offices now.

Thanks.

Comment 19 Tim Wickberg 2017-02-27 06:48:22 MST

Marking resolved/infogiven, and moving on to bug 3484 to cover mitigating the impact of unloadable state files.

Comment 20 Phil Schwan 2017-10-18 01:32:26 MDT

We were just bit by this again (apparently changing SelectType torches the entire job state?), and I don't seem to have access to bug 3502.  Is it intentionally private?