| Summary: | All jobs killed on Upgrade | ||
|---|---|---|---|
| Product: | Slurm | Reporter: | Martins Innus <minnus> |
| Component: | slurmctld | Assignee: | Brian Christiansen <brian> |
| Status: | RESOLVED FIXED | QA Contact: | |
| Severity: | 2 - High Impact | ||
| Priority: | --- | CC: | alex, brian, da, tim |
| Version: | 14.11.8 | ||
| Hardware: | Linux | ||
| OS: | Linux | ||
| Site: | University of Buffalo (SUNY) | Alineos Sites: | --- |
| Atos/Eviden Sites: | --- | Confidential Site: | --- |
| Coreweave sites: | --- | Cray Sites: | --- |
| DS9 clusters: | --- | HPCnow Sites: | --- |
| HPE Sites: | --- | IBM Sites: | --- |
| NOAA SIte: | --- | OCF Sites: | --- |
| Recursion Pharma Sites: | --- | SFW Sites: | --- |
| SNIC sites: | --- | Linux Distro: | --- |
| Machine Name: | CLE Version: | ||
| Version Fixed: | 14.11.11 15.08.3 16.05.0pre1 | Target Release: | --- |
| DevPrio: | --- | Emory-Cloud Sites: | --- |
| Attachments: |
slurm.conf
slurm-cluster.conf slurmd.log slurmctld.log slurmd_270265.log slurmctld_270265.log slurmd_debug3.log spec file patch |
||
|
Description
Martins Innus
2015-11-03 03:24:28 MST
Sorry for the inconvenience of losing jobs. We are investigating what might have happened. Will you attach your slurm.conf and the slurmctld log file and cpn-p26-30's slurmd log file? Thanks, Brian Created attachment 2371 [details]
slurm.conf
Created attachment 2372 [details]
slurm-cluster.conf
Created attachment 2373 [details]
slurmd.log
Created attachment 2374 [details]
slurmctld.log
OK, uploaded. I truncated them to just before the upgrade since the logs were quite large. Let me know if you need earlier info. Will you grab the logs from the first occurrence of job 270265 in both the slurmctld and slurmd logs? We also noticed that the slurm.conf's are different from the slurmctld and the slurmd's. We don't recommend doing this as it can cause problems. When was the last time that the slurm.conf's were updated? Were there any other system upgrades that happened at the same time? Sure I can do that Tommorrow.
With respect to the slurm.conf, we noticed those messages as well but as far as we can tell, everything is the same. The file is managed by puppet so should be the same. Is there a way to tell what slurm thinks the differences are?
Slurm.conf was updated everywhere during the upgrade. I will have to check the exact timing.
No other system upgrades at this time. THis was just a slurm upgrade.
Also this is in centos 7 if that matters.
> On Nov 3, 2015, at 6:27 PM, bugs@schedmd.com wrote:
>
> Comment # 7 on bug 2095 from Brian Christiansen
> Will you grab the logs from the first occurrence of job 270265 in both the
> slurmctld and slurmd logs?
>
> We also noticed that the slurm.conf's are different from the slurmctld and the
> slurmd's. We don't recommend doing this as it can cause problems. When was the
> last time that the slurm.conf's were updated?
>
> Were there any other system upgrades that happened at the same time?
> You are receiving this mail because:
> You reported the bug.
Created attachment 2382 [details]
slurmd_270265.log
Created attachment 2383 [details]
slurmctld_270265.log
OK, the 2 files are uploaded. We have done some more analysis. We upgraded 4 clusters at the same time, each with its own controller, but talking to the same dbd, and we lost all the jobs on all 4 clusters. I have been sending the logs from just the "chemistry" cluster, but the other 3 look the same. This is 100% reproducible in our test environment. We must have not been as thorough as we thought testing the upgrade before deploying. The cluster is running 14.11.10-1.... 1. Submit a single job to a single node. The job simply sleeps for a long time. 2. Rebuild the current rpm just with a new name so the upgrade will work easily: 14.11.10-2.... 3. Just on the node running the job: update and restart slurm. The job is lost with similar errors as previously posted. Output pasted below for this slurmd.log. The suspicious part looks to be the "step -2" entry before the "step 1" entry. [2015-11-04T11:32:30.618] _run_prolog: run job script took usec=202503 [2015-11-04T11:32:30.618] _run_prolog: prolog with lock for job 2159 ran for 0 seconds [2015-11-04T11:32:30.618] Launching batch job 2159 for UID 27330 [2015-11-04T11:32:30.664] [2159] task/cgroup: /slurm/uid_27330/job_2159: alloc=5600MB mem.limit=5600MB memsw.limit=5880MB [2015-11-04T11:32:30.664] [2159] task/cgroup: /slurm/uid_27330/job_2159/step_batch: alloc=5600MB mem.limit=5600MB memsw.limit=5880MB [2015-11-04T11:32:30.711] launch task 2159.0 request from 27330.16038@10.104.13.27 (port 61372) [2015-11-04T11:32:30.735] [2159.0] task/cgroup: /slurm/uid_27330/job_2159: alloc=5600MB mem.limit=5600MB memsw.limit=5880MB [2015-11-04T11:32:30.735] [2159.0] task/cgroup: /slurm/uid_27330/job_2159/step_0: alloc=5600MB mem.limit=5600MB memsw.limit=5880MB [2015-11-04T11:32:30.891] [2159.0] done with job [2015-11-04T11:32:30.906] launch task 2159.1 request from 27330.16038@10.104.13.27 (port 64444) [2015-11-04T11:32:30.930] [2159.1] task/cgroup: /slurm/uid_27330/job_2159: alloc=5600MB mem.limit=5600MB memsw.limit=5880MB [2015-11-04T11:32:30.930] [2159.1] task/cgroup: /slurm/uid_27330/job_2159/step_1: alloc=5600MB mem.limit=5600MB memsw.limit=5880MB [2015-11-04T11:34:13.152] Slurmd shutdown completing [2015-11-04T11:37:14.387] error: can't stat gres.conf file /etc/slurm/gres.conf, assuming zero resource counts [2015-11-04T11:37:14.389] No specialized cores configured by default on this node [2015-11-04T11:37:14.389] Resource spec: Reserved system memory limit not configured for this node [2015-11-04T11:37:14.401] slurmd version 14.11.10 started [2015-11-04T11:37:14.402] slurmd started on Wed, 04 Nov 2015 11:37:14 -0500 [2015-11-04T11:37:14.402] CPUs=8 Boards=1 Sockets=2 Cores=4 Threads=1 Memory=23939 TmpDisk=229703 Uptime=2248525 CPUSpecList=(null) [2015-11-04T11:37:14.403] error: _step_connect: connect() failed dir /var/spool/slurmd node cpn-d13-27 job 2159 step -2 Connection refused [2015-11-04T11:37:14.403] _handle_stray_script: Purging vestigial job script /var/spool/slurmd/job02159/slurm_script [2015-11-04T11:37:14.403] error: _step_connect: connect() failed dir /var/spool/slurmd node cpn-d13-27 job 2159 step 1 Connection refused [2015-11-04T11:37:14.411] error: _step_connect: connect() failed dir /var/spool/slurmd node cpn-d13-27 job 2159 step -2 Connection refused [2015-11-04T11:37:14.411] _handle_stray_script: Purging vestigial job script /var/spool/slurmd/job02159/slurm_script [2015-11-04T11:37:14.411] error: _step_connect: connect() failed dir /var/spool/slurmd node cpn-d13-27 job 2159 step 1 Connection refused [2015-11-04T11:37:14.411] error: _step_connect: connect() failed dir /var/spool/slurmd node cpn-d13-27 job 2159 step -2 Connection refused [2015-11-04T11:37:14.412] _handle_stray_script: Purging vestigial job script /var/spool/slurmd/job02159/slurm_script [2015-11-04T11:37:14.412] error: _step_connect: connect() failed dir /var/spool/slurmd node cpn-d13-27 job 2159 step 1 Connection refused [2015-11-04T11:37:14.412] error: _step_connect: connect() failed dir /var/spool/slurmd node cpn-d13-27 job 2159 step -2 Connection refused [2015-11-04T11:37:14.412] _handle_stray_script: Purging vestigial job script /var/spool/slurmd/job02159/slurm_script [2015-11-04T11:37:14.413] error: _step_connect: connect() failed dir /var/spool/slurmd node cpn-d13-27 job 2159 step 1 Connection refused [2015-11-04T11:37:14.413] error: _step_connect: connect() failed dir /var/spool/slurmd node cpn-d13-27 job 2159 step -2 Connection refused [2015-11-04T11:37:14.413] _handle_stray_script: Purging vestigial job script /var/spool/slurmd/job02159/slurm_script [2015-11-04T11:37:14.414] error: _step_connect: connect() failed dir /var/spool/slurmd node cpn-d13-27 job 2159 step 1 Connection refused [2015-11-04T11:37:15.119] error: _step_connect: connect() failed dir /var/spool/slurmd node cpn-d13-27 job 2159 step 1 Connection refused [2015-11-04T11:37:47.293] error: _step_connect: connect() failed dir /var/spool/slurmd node cpn-d13-27 job 2159 step 1 Connection refused [2015-11-04T11:39:58.886] error: _step_connect: connect() failed dir /var/spool/slurmd node cpn-d13-27 job 2159 step -2 Connection refused [2015-11-04T11:39:58.886] _handle_stray_script: Purging vestigial job script /var/spool/slurmd/job02159/slurm_script Will you run your test again and verify that the slurmstepd is running prior to the upgrade and then see if it is there after the upgrade? We're wondering if the stepd is crashing for some reason. We've found that the slurmstepd's won't generate cores without some extra configuration since they call setuid. Will you set these options on the compute nodes and see if you get any core files from the stepds? Set /proc/sys/fs/suid_dumpable to 2. This could be set permanently in sysctl.conf with: fs.suid_dumpable = 2 or temporarily with: sysctl fs.suid_dumpable=2 On Centos 6, I've also had to set "ProcessUnpackaged = yes" in /etc/abrt/abrt-action-save-package-data.conf. In testing, once these were set and I abort the stepd, I saw these messages in /var/log/messages: Oct 15 11:31:20 knc abrt[21489]: Saved core dump of pid 21477 (/localhome/brian/slurm/14.11/knc/sbin/slurmstepd) to /var/spool/abrt/ccpp-2015-10-15-11:31:20-21477 (6639616 bytes) Oct 15 11:31:20 knc abrtd: Directory 'ccpp-2015-10-15-11:31:20-21477' creation detected There is a coredump file inside the directory. On a 3.6 kernel (Ubuntu), fs.suid_dumpable requires a fully qualified path in the core_pattern. e.g. sysctl kernel.core_pattern=/tmp/core.%e.%p Will you also set your debuglevel to debug3 on the slurmctld and slurmd to see if we get any extra information? Thanks, Brian This is looking to be closely related to bug 2028, which I am working on (sorry, it's not visible outside of SchedMD and that specific customer). Here's an explanation of what this message means: [2015-11-04T11:37:14.403] error: _step_connect: connect() failed dir /var/spool/slurmd node cpn-d13-27 job 2159 step -2 Connection refused When the slurmstepd starts a job step, it creates a named socket so that the slurmd can talk with it. When the slurmd tries to talk with it, it uses the function _step_connect. The job id here is 2159 and the -2 means this is the batch script. If they can't talk, the job gets killed. Are newly started jobs also being killed or were only running jobs effected? This other site is seeing ongoing problems. I do have some code tweaks that might help, but don't know exactly what is happening yet. I can share what I do have if you want. stepd is running before: root 16480 1 0 12:23 ? 00:00:00 slurmstepd: [2160] root 16550 1 0 12:23 ? 00:00:00 slurmstepd: [2160.1] And actually, it turns out we don't even need to do an upgrade. Just a: systemctl restart slurmd will cause the bad behavior. stepd is gone after this point. But I can't tell exactly when it happens since the restart happens so fast. Actually trying something else: systemctl stop slurmd Causes the stepd to disappear. I guess this is not supposed to happen? I'm still working on trying to get a core dump. We had abrtd disabled so I'm working on getting that setup. Created attachment 2385 [details]
slurmd_debug3.log
Yep, the stepd's should stay around after the slurmd goes down. Do you see anything in the syslog about the stepd process being killed? What OS/distro are you using? Do you have SELinux enabled? nothing in messages about stepd. selinux disabled Centos 7.1 on x86_64 The only non-standard thing I can think of that we do is we have removed the /etc/init.d/slurm file since centos 7 uses systemd. The slurm rpm installs both the systemd files and the init.d files and this causes systemd to complain. doing: kill -TERM <pid of slurmd> causes stepd to dissapear a gdb attached to stepd when I do this: (gdb) cont Continuing. [New Thread 0x7f3ba37f1700 (LWP 26208)] [Thread 0x7f3ba37f1700 (LWP 26208) exited] [New Thread 0x7f3ba37f1700 (LWP 26212)] [Thread 0x7f3ba37f1700 (LWP 26212) exited] Program received signal SIGTERM, Terminated. [Switching to Thread 0x7f3badd47700 (LWP 25962)] 0x00007f3baccc3705 in pthread_cond_wait@@GLIBC_2.3.2 () from /usr/lib64/libpthread.so.0 Will you try running the slurmd manually (outside of systemd): 1. start the slurmd in the foreground: ./slurmd -D 2. submit your job 3. stop the slurmd by doing ctrl-c yup thats it. Something to do with how centos/systemd are handling cgroups. Doing it manually I can stop and start slurmd with no problem. systemd cgroup info pasted below. systemd thinks it should be managing the stepd processes. Not sure what the solution is here? [minnus@cpn-d13-27 slurmd.service]$ ps -ef |grep stepd root 31525 1 0 14:14 ? 00:00:00 slurmstepd: [2172] root 31599 1 0 14:14 ? 00:00:00 slurmstepd: [2172.1] minnus 31669 31441 0 14:14 pts/2 00:00:00 grep --color=auto stepd [minnus@cpn-d13-27 slurmd.service]$ pwd /sys/fs/cgroup/systemd/system.slice/slurmd.service [minnus@cpn-d13-27 slurmd.service]$ more tasks 31502 31505 31507 31523 31525 31530 31531 31532 31533 31574 31585 31586 31587 31591 31599 31608 31609 31611 31616 31617 31618 Glad we found it. We'll do some investigating on a solution. Thanks for your help. Will you add the "KillMode=process" to the [Service] section of slurmd.service? This fixes it for me. ex. [Service] Type=forking ... KillMode=process Yup, that did it, thanks! For systemd support in general, could you add something like attached patch to prevent both init.d and systemd files from both being installed? And maybe something similar to stop the service in the preun section? Thanks for the solution to this! Martins Created attachment 2387 [details]
spec file patch
I've committed the KillMode patch to 14.11. There will most likely not be another 14.11 release. https://github.com/SchedMD/slurm/commit/508f866ea10e4c359d62d443279198082d587107 For your other spec file requests, will you submit a enhancement bug for those? Bug 2092 looks similar to your requests. Let us know if you have any questions. Thanks for your help on this! |