2095 – All jobs killed on Upgrade

Ticket 2095 - All jobs killed on Upgrade

Summary: All jobs killed on Upgrade

Status:	RESOLVED FIXED

Alias:	None

Product:	Slurm
Classification:	Unclassified
Component:	slurmctld (show other tickets)
Version:	14.11.8
Hardware:	Linux Linux

Severity:	2 - High Impact
Assignee:	Brian Christiansen
QA Contact:

URL:

Depends on:
Blocks:

Reported:	2015-11-03 03:24 MST by Martins Innus
Modified:	2015-11-04 07:02 MST (History)
CC List:	4 users (show)

See Also:
Site:	University of Buffalo (SUNY)
Slinky Site:	---
Alineos Sites:	---
Atos/Eviden Sites:	---
Confidential Site:	---
Coreweave sites:	---
Cray Sites:	---
DS9 clusters:	---
Google sites:	---
HPCnow Sites:	---
HPE Sites:	---
IBM Sites:	---
NOAA SIte:	---
NoveTech Sites:	---
Nvidia HWinf-CS Sites:	---
OCF Sites:	---
Recursion Pharma Sites:	---
SFW Sites:	---
SNIC sites:	---
Tzag Elita Sites:	---
Linux Distro:	---
Machine Name:
CLE Version:
Version Fixed:	14.11.11 15.08.3 16.05.0pre1
Target Release:	---
DevPrio:	---
Emory-Cloud Sites:	---

Attachments
slurm.conf (2.97 KB, text/plain) 2015-11-03 04:41 MST, Martins Innus	Details
slurm-cluster.conf (4.47 KB, text/plain) 2015-11-03 04:44 MST, Martins Innus	Details
slurmd.log (5.49 KB, text/plain) 2015-11-03 04:44 MST, Martins Innus	Details
slurmctld.log (61.79 KB, text/plain) 2015-11-03 04:45 MST, Martins Innus	Details
slurmd_270265.log (8.26 KB, text/plain) 2015-11-04 00:01 MST, Martins Innus	Details
slurmctld_270265.log (1.96 MB, text/plain) 2015-11-04 00:01 MST, Martins Innus	Details
slurmd_debug3.log (11.32 KB, text/plain) 2015-11-04 04:04 MST, Martins Innus	Details
spec file patch (1.36 KB, patch) 2015-11-04 06:13 MST, Martins Innus	Details \| Diff
Add an attachment (proposed patch, testcase, etc.)

Note You need to log in before you can comment on or make changes to this ticket.

Description Martins Innus 2015-11-03 03:24:28 MST

Hi,
  We just upgraded from 14.11.8 to a build that was built from the 14-11-10 tag in git.

  We did an upgrade in our test infrastructure and all seemed to go well.  When we upgraded our production system, all running jobs were lost.

Our procedure was:

1. Upgrade and restart slurmdbd
2. Upgrade and restart slurmctld
3. Upgrade and restart slurmd on all compute nodes

We have done this procedure before successfully.

Messages from all nodes are similar to this killed job:

On slurmctld:

[2015-11-03T08:42:49.448] Terminate signal (SIGINT or SIGTERM) received
[2015-11-03T08:42:49.517] Saving all slurm state
[2015-11-03T08:42:49.786] layouts: all layouts are now unloaded.
[2015-11-03T08:42:50.212] slurmctld version 14.11.10 started on cluster chemistry
[2015-11-03T08:42:52.088] layouts: no layout to initialize
...
...
[2015-11-03T08:42:52.238] Recovered JobID=270265 State=0x1 NodeCnt=0 Assoc=243
...
...
[2015-11-03T09:52:17.083] Batch JobId=270265 missing from node 0 (not found BatchStartTime after startup)
[2015-11-03T09:52:17.083] job_complete: JobID=270265 State=0x1 NodeCnt=1 WTERMSIG 126
[2015-11-03T09:52:17.083] job_complete: JobID=270265 State=0x1 NodeCnt=1 cancelled by node failure


On the node cpn-p26-30  /var/log/slurm/slurmd.log

[2015-11-03T09:52:17.077] slurmd version 14.11.10 started
[2015-11-03T09:52:17.077] slurmd started on Tue, 03 Nov 2015 09:52:17 -0500
[2015-11-03T09:52:17.078] CPUs=12 Boards=1 Sockets=2 Cores=6 Threads=1 Memory=128536 TmpDisk=286652 Uptime=959992 CPUSpecList=(null)
[2015-11-03T09:52:17.078] error: _step_connect: connect() failed dir /var/spool/slurmd node cpn-p26-30 job 270265 step -2 Connection refused
[2015-11-03T09:52:17.078] _handle_stray_script: Purging vestigial job script /var/spool/slurmd/job270265/slurm_script


Any thoughts on what happened, or how to prevent this in the future?

Thanks

Martins

Comment 1 Brian Christiansen 2015-11-03 04:16:08 MST

Sorry for the inconvenience of losing jobs. We are investigating what might have happened. Will you attach your slurm.conf and the slurmctld log file and cpn-p26-30's slurmd log file?

Thanks,
Brian

Comment 2 Martins Innus 2015-11-03 04:41:55 MST

Created attachment 2371 [details]
slurm.conf

Comment 3 Martins Innus 2015-11-03 04:44:06 MST

Created attachment 2372 [details]
slurm-cluster.conf

Comment 4 Martins Innus 2015-11-03 04:44:59 MST

Created attachment 2373 [details]
slurmd.log

Comment 5 Martins Innus 2015-11-03 04:45:39 MST

Created attachment 2374 [details]
slurmctld.log

Comment 6 Martins Innus 2015-11-03 04:46:38 MST

OK, uploaded.  I truncated them to just before the upgrade since the logs were quite large.  Let me know if you need earlier info.

Comment 7 Brian Christiansen 2015-11-03 09:27:46 MST

Will you grab the logs from the first occurrence of job 270265 in both the slurmctld and slurmd logs?

We also noticed that the slurm.conf's are different from the slurmctld and the slurmd's. We don't recommend doing this as it can cause problems. When was the last time that the slurm.conf's were updated?

Were there any other system upgrades that happened at the same time?

Comment 8 Martins Innus 2015-11-03 10:40:10 MST

Sure I can do that Tommorrow. 

With respect to the slurm.conf, we noticed those messages as well but as far as we can tell, everything is the same. The file is managed by puppet so should be the same. Is there a way to tell what slurm thinks the differences are?

Slurm.conf was updated everywhere during the upgrade. I will have to check the exact timing.

No other system upgrades at this time. THis was just a slurm upgrade. 

Also this is in centos 7 if that matters.

> On Nov 3, 2015, at 6:27 PM, bugs@schedmd.com wrote:
> 
> Comment # 7 on bug 2095 from Brian Christiansen
> Will you grab the logs from the first occurrence of job 270265 in both the
> slurmctld and slurmd logs?
> 
> We also noticed that the slurm.conf's are different from the slurmctld and the
> slurmd's. We don't recommend doing this as it can cause problems. When was the
> last time that the slurm.conf's were updated?
> 
> Were there any other system upgrades that happened at the same time?
> You are receiving this mail because:
> You reported the bug.

Comment 9 Martins Innus 2015-11-04 00:01:11 MST

Created attachment 2382 [details]
slurmd_270265.log

Comment 10 Martins Innus 2015-11-04 00:01:57 MST

Created attachment 2383 [details]
slurmctld_270265.log

Comment 11 Martins Innus 2015-11-04 00:25:20 MST

OK, the 2 files are uploaded.

We have done some more analysis.

We upgraded 4 clusters at the same time, each with its own controller, but talking to the same dbd, and we lost all the jobs on all 4 clusters.  I have been sending the logs from just the "chemistry" cluster, but the other 3 look the same.

Comment 12 Martins Innus 2015-11-04 02:51:43 MST

This is 100% reproducible in our test environment.  We must have not been as thorough as we thought testing the upgrade before deploying.

The cluster is running 14.11.10-1....

1. Submit a single job to a single node.  The job simply sleeps for a long time.

2. Rebuild the current rpm just with a new name so the upgrade will work easily: 14.11.10-2....

3. Just on the node running the job: update and restart slurm.  The job is lost with similar errors as previously posted.


Output pasted below for this slurmd.log.  The suspicious part looks to be the "step -2" entry before the "step 1" entry.




[2015-11-04T11:32:30.618] _run_prolog: run job script took usec=202503
[2015-11-04T11:32:30.618] _run_prolog: prolog with lock for job 2159 ran for 0 seconds
[2015-11-04T11:32:30.618] Launching batch job 2159 for UID 27330
[2015-11-04T11:32:30.664] [2159] task/cgroup: /slurm/uid_27330/job_2159: alloc=5600MB mem.limit=5600MB memsw.limit=5880MB
[2015-11-04T11:32:30.664] [2159] task/cgroup: /slurm/uid_27330/job_2159/step_batch: alloc=5600MB mem.limit=5600MB memsw.limit=5880MB
[2015-11-04T11:32:30.711] launch task 2159.0 request from 27330.16038@10.104.13.27 (port 61372)
[2015-11-04T11:32:30.735] [2159.0] task/cgroup: /slurm/uid_27330/job_2159: alloc=5600MB mem.limit=5600MB memsw.limit=5880MB
[2015-11-04T11:32:30.735] [2159.0] task/cgroup: /slurm/uid_27330/job_2159/step_0: alloc=5600MB mem.limit=5600MB memsw.limit=5880MB
[2015-11-04T11:32:30.891] [2159.0] done with job
[2015-11-04T11:32:30.906] launch task 2159.1 request from 27330.16038@10.104.13.27 (port 64444)
[2015-11-04T11:32:30.930] [2159.1] task/cgroup: /slurm/uid_27330/job_2159: alloc=5600MB mem.limit=5600MB memsw.limit=5880MB
[2015-11-04T11:32:30.930] [2159.1] task/cgroup: /slurm/uid_27330/job_2159/step_1: alloc=5600MB mem.limit=5600MB memsw.limit=5880MB
[2015-11-04T11:34:13.152] Slurmd shutdown completing
[2015-11-04T11:37:14.387] error: can't stat gres.conf file /etc/slurm/gres.conf, assuming zero resource counts
[2015-11-04T11:37:14.389] No specialized cores configured by default on this node
[2015-11-04T11:37:14.389] Resource spec: Reserved system memory limit not configured for this node
[2015-11-04T11:37:14.401] slurmd version 14.11.10 started
[2015-11-04T11:37:14.402] slurmd started on Wed, 04 Nov 2015 11:37:14 -0500
[2015-11-04T11:37:14.402] CPUs=8 Boards=1 Sockets=2 Cores=4 Threads=1 Memory=23939 TmpDisk=229703 Uptime=2248525 CPUSpecList=(null)
[2015-11-04T11:37:14.403] error: _step_connect: connect() failed dir /var/spool/slurmd node cpn-d13-27 job 2159 step -2 Connection refused
[2015-11-04T11:37:14.403] _handle_stray_script: Purging vestigial job script /var/spool/slurmd/job02159/slurm_script
[2015-11-04T11:37:14.403] error: _step_connect: connect() failed dir /var/spool/slurmd node cpn-d13-27 job 2159 step 1 Connection refused
[2015-11-04T11:37:14.411] error: _step_connect: connect() failed dir /var/spool/slurmd node cpn-d13-27 job 2159 step -2 Connection refused
[2015-11-04T11:37:14.411] _handle_stray_script: Purging vestigial job script /var/spool/slurmd/job02159/slurm_script
[2015-11-04T11:37:14.411] error: _step_connect: connect() failed dir /var/spool/slurmd node cpn-d13-27 job 2159 step 1 Connection refused
[2015-11-04T11:37:14.411] error: _step_connect: connect() failed dir /var/spool/slurmd node cpn-d13-27 job 2159 step -2 Connection refused
[2015-11-04T11:37:14.412] _handle_stray_script: Purging vestigial job script /var/spool/slurmd/job02159/slurm_script
[2015-11-04T11:37:14.412] error: _step_connect: connect() failed dir /var/spool/slurmd node cpn-d13-27 job 2159 step 1 Connection refused
[2015-11-04T11:37:14.412] error: _step_connect: connect() failed dir /var/spool/slurmd node cpn-d13-27 job 2159 step -2 Connection refused
[2015-11-04T11:37:14.412] _handle_stray_script: Purging vestigial job script /var/spool/slurmd/job02159/slurm_script
[2015-11-04T11:37:14.413] error: _step_connect: connect() failed dir /var/spool/slurmd node cpn-d13-27 job 2159 step 1 Connection refused
[2015-11-04T11:37:14.413] error: _step_connect: connect() failed dir /var/spool/slurmd node cpn-d13-27 job 2159 step -2 Connection refused
[2015-11-04T11:37:14.413] _handle_stray_script: Purging vestigial job script /var/spool/slurmd/job02159/slurm_script
[2015-11-04T11:37:14.414] error: _step_connect: connect() failed dir /var/spool/slurmd node cpn-d13-27 job 2159 step 1 Connection refused
[2015-11-04T11:37:15.119] error: _step_connect: connect() failed dir /var/spool/slurmd node cpn-d13-27 job 2159 step 1 Connection refused
[2015-11-04T11:37:47.293] error: _step_connect: connect() failed dir /var/spool/slurmd node cpn-d13-27 job 2159 step 1 Connection refused
[2015-11-04T11:39:58.886] error: _step_connect: connect() failed dir /var/spool/slurmd node cpn-d13-27 job 2159 step -2 Connection refused
[2015-11-04T11:39:58.886] _handle_stray_script: Purging vestigial job script /var/spool/slurmd/job02159/slurm_script

Comment 13 Brian Christiansen 2015-11-04 03:11:19 MST

Will you run your test again and verify that the slurmstepd is running prior to the upgrade and then see if it is there after the upgrade?

We're wondering if the stepd is crashing for some reason. We've found that the slurmstepd's won't generate cores without some extra configuration since they call setuid. 

Will you set these options on the compute nodes and see if you get any core files from the stepds?

Set /proc/sys/fs/suid_dumpable to 2.
This could be set permanently in sysctl.conf with:
fs.suid_dumpable = 2

or temporarily with:

sysctl fs.suid_dumpable=2


On Centos 6, I've also had to set "ProcessUnpackaged = yes" in /etc/abrt/abrt-action-save-package-data.conf.

In testing, once these were set and I abort the stepd, I saw these messages in /var/log/messages:
Oct 15 11:31:20 knc abrt[21489]: Saved core dump of pid 21477 (/localhome/brian/slurm/14.11/knc/sbin/slurmstepd) to /var/spool/abrt/ccpp-2015-10-15-11:31:20-21477 (6639616 bytes)
Oct 15 11:31:20 knc abrtd: Directory 'ccpp-2015-10-15-11:31:20-21477' creation detected

There is a coredump file inside the directory.



On a 3.6 kernel (Ubuntu), fs.suid_dumpable requires a fully qualified path in the core_pattern.
e.g.
sysctl kernel.core_pattern=/tmp/core.%e.%p

Comment 14 Brian Christiansen 2015-11-04 03:13:46 MST

Will you also set your debuglevel to debug3 on the slurmctld and slurmd to see if we get any extra information?

Thanks,
Brian

Comment 15 Moe Jette 2015-11-04 03:14:51 MST

This is looking to be closely related to bug 2028, which I am working on (sorry, it's not visible outside of SchedMD and that specific customer).

Here's an explanation of what this message means:
[2015-11-04T11:37:14.403] error: _step_connect: connect() failed dir
/var/spool/slurmd node cpn-d13-27 job 2159 step -2 Connection refused

When the slurmstepd starts a job step, it creates a named socket so that the slurmd can talk with it. When the slurmd tries to talk with it, it uses the function _step_connect. The job id here is 2159 and the -2 means this is the batch script. If they can't talk, the job gets killed.

Are newly started jobs also being killed or were only running jobs effected?
This other site is seeing ongoing problems.

I do have some code tweaks that might help, but don't know exactly what is happening yet. I can share what I do have if you want.

Comment 16 Martins Innus 2015-11-04 04:03:21 MST

stepd is running before:

root     16480     1  0 12:23 ?        00:00:00 slurmstepd: [2160]
root     16550     1  0 12:23 ?        00:00:00 slurmstepd: [2160.1]


And actually, it turns out we don't even need to do an upgrade.  Just a:

systemctl restart slurmd

will cause the bad behavior.  


stepd is gone after this point.  But I can't tell exactly when it happens since the restart happens so fast.


Actually trying something else:

systemctl stop slurmd

Causes the stepd to disappear.  I guess this is not supposed to happen?


I'm still working on trying to get a core dump.  We had abrtd disabled so I'm working on getting that setup.

Comment 17 Martins Innus 2015-11-04 04:04:05 MST

Created attachment 2385 [details]
slurmd_debug3.log

Comment 18 Brian Christiansen 2015-11-04 04:17:09 MST

Yep, the stepd's should stay around after the slurmd goes down. Do you see anything in the syslog about the stepd process being killed?

What OS/distro are you using? Do you have SELinux enabled?

Comment 19 Martins Innus 2015-11-04 04:30:33 MST

nothing in messages about stepd. selinux disabled

Centos 7.1 on x86_64

The only non-standard thing I can think of that we do is we have removed the /etc/init.d/slurm file since centos 7 uses systemd. The slurm rpm installs both the systemd files and the init.d files and this causes systemd to complain.

Comment 20 Martins Innus 2015-11-04 04:41:34 MST

doing:

kill -TERM <pid of slurmd>

causes stepd to dissapear


a gdb attached to stepd when I do this:

(gdb) cont
Continuing.
[New Thread 0x7f3ba37f1700 (LWP 26208)]
[Thread 0x7f3ba37f1700 (LWP 26208) exited]
[New Thread 0x7f3ba37f1700 (LWP 26212)]
[Thread 0x7f3ba37f1700 (LWP 26212) exited]

Program received signal SIGTERM, Terminated.
[Switching to Thread 0x7f3badd47700 (LWP 25962)]
0x00007f3baccc3705 in pthread_cond_wait@@GLIBC_2.3.2 () from /usr/lib64/libpthread.so.0

Comment 21 Brian Christiansen 2015-11-04 05:00:55 MST

Will you try running the slurmd manually (outside of systemd):

1. start the slurmd in the foreground:
./slurmd -D
2. submit your job
3. stop the slurmd by doing ctrl-c

Comment 22 Martins Innus 2015-11-04 05:17:27 MST

yup thats it.

Something to do with how centos/systemd are handling cgroups.

Doing it manually I can stop and start slurmd with no problem.

systemd cgroup info pasted below.  systemd thinks it should be managing the stepd processes. Not sure what the solution is here?



[minnus@cpn-d13-27 slurmd.service]$ ps -ef |grep stepd
root     31525     1  0 14:14 ?        00:00:00 slurmstepd: [2172]
root     31599     1  0 14:14 ?        00:00:00 slurmstepd: [2172.1]
minnus   31669 31441  0 14:14 pts/2    00:00:00 grep --color=auto stepd


[minnus@cpn-d13-27 slurmd.service]$ pwd
/sys/fs/cgroup/systemd/system.slice/slurmd.service


[minnus@cpn-d13-27 slurmd.service]$ more tasks 
31502
31505
31507
31523
31525
31530
31531
31532
31533
31574
31585
31586
31587
31591
31599
31608
31609
31611
31616
31617
31618

Comment 23 Brian Christiansen 2015-11-04 05:19:44 MST

Glad we found it. We'll do some investigating on a solution. Thanks for your help.

Comment 24 Brian Christiansen 2015-11-04 06:00:37 MST

Will you add the "KillMode=process" to the [Service] section of slurmd.service? This fixes it for me.

ex.
[Service]
Type=forking
...
KillMode=process

Comment 25 Martins Innus 2015-11-04 06:13:11 MST

Yup, that did it, thanks!


For systemd support in general, could you add something like attached patch to prevent both init.d and systemd files from both being installed?

And maybe something similar to stop the service in the preun section?

Thanks for the solution to this!

Martins

Comment 26 Martins Innus 2015-11-04 06:13:53 MST

Created attachment 2387 [details]
spec file patch

Comment 27 Brian Christiansen 2015-11-04 07:02:12 MST

I've committed the KillMode patch to 14.11. There will most likely not be another 14.11 release.
https://github.com/SchedMD/slurm/commit/508f866ea10e4c359d62d443279198082d587107

For your other spec file requests, will you submit a enhancement bug for those? Bug 2092 looks similar to your requests.

Let us know if you have any questions. Thanks for your help on this!