860 – backup controller deleted a heap of job scripts

Ticket 860 - backup controller deleted a heap of job scripts

Summary: backup controller deleted a heap of job scripts

Status:	RESOLVED FIXED

Alias:	None

Product:	Slurm
Classification:	Unclassified
Component:	slurmctld (show other tickets)
Version:	14.03.0
Hardware:	Linux Linux

Severity:	2 - High Impact
Assignee:	Moe Jette
QA Contact:

URL:

Depends on:
Blocks:

Reported:	2014-06-04 14:04 MDT by Stuart Midgley
Modified:	2014-06-05 17:22 MDT (History)
CC List:	1 user (show)

See Also:
Site:	DownUnder GeoSolutions
Slinky Site:	---
Alineos Sites:	---
Atos/Eviden Sites:	---
Confidential Site:	---
Coreweave sites:	---
Cray Sites:	---
DS9 clusters:	---
Google sites:	---
HPCnow Sites:	---
HPE Sites:	---
IBM Sites:	---
NOAA SIte:	---
NoveTech Sites:	---
Nvidia HWinf-CS Sites:	---
OCF Sites:	---
Recursion Pharma Sites:	---
SFW Sites:	---
SNIC sites:	---
Tzag Elita Sites:	---
Linux Distro:	---
Machine Name:
CLE Version:
Version Fixed:	14.03.4
Target Release:	---
DevPrio:	---
Emory-Cloud Sites:	---

Attachments
attachment-4956-0.html (3.39 KB, text/html) 2014-06-04 14:17 MDT, David Bigagli	Details
backup controller log (1.76 MB, text/x-log) 2014-06-04 15:29 MDT, Stuart Midgley	Details
primary controller log (16.03 MB, text/x-log) 2014-06-04 15:29 MDT, Stuart Midgley	Details
attachment-15222-0.html (2.37 KB, text/html) 2014-06-05 14:18 MDT, Moe Jette	Details
attachment-7754-0.html (1.61 KB, text/html) 2014-06-05 14:37 MDT, Moe Jette	Details
Add an attachment (proposed patch, testcase, etc.)

Note You need to log in before you can comment on or make changes to this ticket.

Description Stuart Midgley 2014-06-04 14:04:54 MDT

Created attachment 901 [details]
attachment-4956-0.html

I redeployed slurm today to bring in a some of your patches... and it created a disaster.

The redeploy basically installs the new software then does a 

    pdsh -l root -w list '/etc/init.d/slurm restart'

on all nodes that use slurm.

When the backup controller was restarted... it "recovered" and did

[2014-06-05T09:36:45.722] Purging files for defunct batch job 2618232

then on the primary controller

[2014-06-05T09:57:00.631] error: Error opening file /d/sw/slurm/var/spool/slurm_perth/job.2618232/script, No such file or directory
[2014-06-05T09:57:00.631] error: Batch script is missing, aborting job 2618232
[2014-06-05T09:57:00.631] completing job 2618232 status 0
[2014-06-05T09:57:00.639] sched: job_complete for JobId=2618232 successful, exit code=0

Comment 1 David Bigagli 2014-06-04 14:17:05 MDT

I see 2 things.

1. why backup starts minutes after the primary? ~20
2. The state files dont seem to be in sync.

What could be the reason?


On June 4, 2014 7:04:54 PM PDT, bugs@schedmd.com wrote:
>http://bugs.schedmd.com/show_bug.cgi?id=860
>
>              Site: DownUnder GeoSolutions
>            Bug ID: 860
>           Summary: backup controller deleted a heap of job scripts
>           Product: SLURM
>           Version: 14.03.0
>          Hardware: Linux
>                OS: Linux
>            Status: UNCONFIRMED
>          Severity: 2 - High Impact
>          Priority: ---
>         Component: slurmctld daemon
>          Assignee: david@schedmd.com
>          Reporter: stuartm@dugeo.com
>               CC: da@schedmd.com, david@schedmd.com, jette@schedmd.com
>
>I redeployed slurm today to bring in a some of your patches... and it
>created a
>disaster.
>
>The redeploy basically installs the new software then does a 
>
>    pdsh -l root -w list '/etc/init.d/slurm restart'
>
>on all nodes that use slurm.
>
>When the backup controller was restarted... it "recovered" and did
>
>[2014-06-05T09:36:45.722] Purging files for defunct batch job 2618232
>
>then on the primary controller
>
>[2014-06-05T09:57:00.631] error: Error opening file
>/d/sw/slurm/var/spool/slurm_perth/job.2618232/script, No such file or
>directory
>[2014-06-05T09:57:00.631] error: Batch script is missing, aborting job
>2618232
>[2014-06-05T09:57:00.631] completing job 2618232 status 0
>[2014-06-05T09:57:00.639] sched: job_complete for JobId=2618232
>successful,
>exit code=0
>
>-- 
>You are receiving this mail because:
>You are on the CC list for the bug.
>You are the assignee for the bug.
>You are watching someone on the CC list of the bug.
>You are watching the assignee of the bug.

Comment 2 Stuart Midgley 2014-06-04 14:24:11 MDT

As part of the deploy, the backup may restart minutes later - whenever the pdsh gets to it... but either way, the state files "should" be in sync, they are coming off an NFS mount from a 3rd machine

Comment 3 Moe Jette 2014-06-04 14:57:33 MDT

can you attach more of the logs? I'd like to see the backup slurmctld log from the time it was assumed control.

Comment 4 Stuart Midgley 2014-06-04 15:29:10 MDT

Created attachment 902 [details]
backup controller log

Comment 5 Stuart Midgley 2014-06-04 15:29:41 MDT

Created attachment 903 [details]
primary controller log

Comment 6 Stuart Midgley 2014-06-05 02:19:07 MDT

We have stopped using the backup controller at the moment.  We found that having the state files on an nfs mount was just too slow and causing client commands to timeout trying to talk to slurmctld.  We have gone back to having the state files on a local disk.

Comment 7 Moe Jette 2014-06-05 11:20:20 MDT

I can address the job loss with a couple of changes, but there is a more fundamental problem that needs to be addressed.

https://github.com/SchedMD/slurm/commit/42ab25c3e7b0dfe9a8073985381f8932418d31e6
Test time when job_state file was written to detect multiple primary slurmctld daemons (e.g. both backup and primary are functioning as primary and there is a split brain problem). If the backup becomes the primary and finds the primary is still writing to the shared file system, then exit.

https://github.com/SchedMD/slurm/commit/25c53b8f5c61d930d5768edc17daa4ef741c811d
If the backup slurmctld assumes primary status, then do NOT purge any job state files (batch script and environment files) but if any attempt is made to re-use them consider this a fatal error. It may indicate that multiple primary slurmctld daemons are active (e.g. both backup and primary are functioning as primary and there is a split brain problem).

Here is the more fundamental issue, from the primary log:
[2014-06-05T09:37:14.024] _slurm_rpc_submit_batch_job JobId=2615283 usec=617437235

What the heck is the daemon doing for 10+ minutes? That's long enough for the backup to start responding, accepting new jobs, etc. I'm guessing that is what caused the bug 805. The job was accepted by the backup, then the primary started responding again and used the same job IDs for another job, assigned it different nodes, etc.

The only thing that stands out is SchedulingParameters=max_sched_time=15
That will cause the slurmctld to run the scheduler for 15 seconds at a time. During that time, thousands of RPCs might pile up and none of them will be serviced. Also note the threads are not necessarily serviced on a first-in first-out basis. Some systems schedule pthreads first-in last-out. In short, I would feel much better about removing the max_sched_time and using the backfill scheduler to manage most of the scheduling. The backfill scheduler releases locks every couple of seconds and is designed to handle large job counts (unlike the main scheduling loop).
SchedulingParameters for backfill might look like this:
bf_continue,bf_interval=300,max_job_test=50000,bf_resolution=300

There might also be delays related to NFS or something else causing delays, but all I can tell is that the system was unresponsive for 10+ minutes.

Comment 8 Stuart Midgley 2014-06-05 12:21:45 MDT

Thanks.

Given that we were seeing client commands time out, which we believe was due to the nfs mount (certainly that's what our straces seemed to indicate - ie. 60 RPCS/s max), I can imagine that the slurmctld would become unresponsive long enough for the backup to kick in.

Comment 9 Stuart Midgley 2014-06-05 14:15:37 MDT

I have removed our value of max_sched_time=15 and I assume it will now get the max default of 4.

(In reply to Moe Jette from comment #7)
> The only thing that stands out is SchedulingParameters=max_sched_time=15
> That will cause the slurmctld to run the scheduler for 15 seconds at a time.
> During that time, thousands of RPCs might pile up and none of them will be
> serviced. Also note the threads are not necessarily serviced on a first-in
> first-out basis. Some systems schedule pthreads first-in last-out. In short,
> I would feel much better about removing the max_sched_time and using the
> backfill scheduler to manage most of the scheduling. The backfill scheduler
> releases locks every couple of seconds and is designed to handle large job
> counts (unlike the main scheduling loop).
> SchedulingParameters for backfill might look like this:
> bf_continue,bf_interval=300,max_job_test=50000,bf_resolution=300

Comment 10 Moe Jette 2014-06-05 14:18:06 MDT

Created attachment 908 [details]
attachment-15222-0.html

Correct but you may need the backfill scheduler to check more jobs.

On June 5, 2014 7:15:37 PM PDT, bugs@schedmd.com wrote:
>http://bugs.schedmd.com/show_bug.cgi?id=860
>
>--- Comment #9 from Stuart Midgley <stuartm@dugeo.com> ---
>I have removed our value of max_sched_time=15 and I assume it will now
>get the
>max default of 4.
>
>(In reply to Moe Jette from comment #7)
>> The only thing that stands out is
>SchedulingParameters=max_sched_time=15
>> That will cause the slurmctld to run the scheduler for 15 seconds at
>a time.
>> During that time, thousands of RPCs might pile up and none of them
>will be
>> serviced. Also note the threads are not necessarily serviced on a
>first-in
>> first-out basis. Some systems schedule pthreads first-in last-out. In
>short,
>> I would feel much better about removing the max_sched_time and using
>the
>> backfill scheduler to manage most of the scheduling. The backfill
>scheduler
>> releases locks every couple of seconds and is designed to handle
>large job
>> counts (unlike the main scheduling loop).
>> SchedulingParameters for backfill might look like this:
>> bf_continue,bf_interval=300,max_job_test=50000,bf_resolution=300
>
>-- 
>You are receiving this mail because:
>You are on the CC list for the bug.
>You are the assignee for the bug.

Comment 11 Stuart Midgley 2014-06-05 14:31:08 MDT

Our current settings are

SchedulerParameters=bf_continue,bf_max_job_test=50000,bf_interval=300,bf_resolution=300,defer,max_depend_depth=3,sched_interval=20,batch_sched_delay=10

Comment 12 Moe Jette 2014-06-05 14:37:30 MDT

Created attachment 909 [details]
attachment-7754-0.html

That should be lower overhead. You also need SchedulerType=sched/backfill. The sdiag tool should help to see how it's working.

On June 5, 2014 7:31:08 PM PDT, bugs@schedmd.com wrote:
>http://bugs.schedmd.com/show_bug.cgi?id=860
>
>--- Comment #11 from Stuart Midgley <stuartm@dugeo.com> ---
>Our current settings are
>
>SchedulerParameters=bf_continue,bf_max_job_test=50000,bf_interval=300,bf_resolution=300,defer,max_depend_depth=3,sched_interval=20,batch_sched_delay=10
>
>-- 
>You are receiving this mail because:
>You are on the CC list for the bug.
>You are the assignee for the bug.

Comment 13 Stuart Midgley 2014-06-05 14:46:52 MDT

Yes, we have

SchedulerType=sched/backfill

Comment 14 Stuart Midgley 2014-06-05 14:49:15 MDT

FWIW

*******************************************************
sdiag output at Fri Jun  6 10:48:56 2014
Data since      Fri Jun  6 10:17:09 2014
*******************************************************
Server thread count: 3
Agent queue size:    0

Jobs submitted: 242
Jobs started:   630
Jobs completed: 574
Jobs canceled:  531
Jobs failed:    0

Main schedule statistics (microseconds):
        Last cycle:   64104
        Max cycle:    578622
        Total cycles: 118
        Mean cycle:   64736
        Mean depth cycle:  190
        Cycles per minute: 3
        Last queue length: 2883

Backfilling stats
        Total backfilled jobs (since last slurm start): 18
        Total backfilled jobs (since last stats cycle start): 18
        Total cycles: 5
        Last cycle when: Fri Jun  6 10:42:04 2014
        Last cycle: 963180
        Max cycle:  1219593
        Mean cycle: 984140
        Last depth cycle: 1165
        Last depth cycle (try sched): 1034
        Depth Mean: 1142
        Depth Mean (try depth): 1035
        Last queue length: 3012
        Queue length mean: 2786

Comment 15 Stuart Midgley 2014-06-05 17:22:29 MDT

Some hours later... it doesn't appear backfill is giving us much.


*******************************************************
sdiag output at Fri Jun  6 13:21:46 2014
Data since      Fri Jun  6 10:17:09 2014
*******************************************************
Server thread count: 3
Agent queue size:    0

Jobs submitted: 1232
Jobs started:   12647
Jobs completed: 11716
Jobs canceled:  763
Jobs failed:    0

Main schedule statistics (microseconds):
        Last cycle:   607295
        Max cycle:    1164293
        Total cycles: 694
        Mean cycle:   106711
        Mean depth cycle:  224
        Cycles per minute: 3
        Last queue length: 4079

Backfilling stats
        Total backfilled jobs (since last slurm start): 237
        Total backfilled jobs (since last stats cycle start): 237
        Total cycles: 29
        Last cycle when: Fri Jun  6 13:13:14 2014
        Last cycle: 3832556
        Max cycle:  6533449
        Mean cycle: 2317317
        Last depth cycle: 2302
        Last depth cycle (try sched): 2299
        Depth Mean: 1520
        Depth Mean (try depth): 1439
        Last queue length: 2976
        Queue length mean: 2721