| Summary: | backup controller deleted a heap of job scripts | ||
|---|---|---|---|
| Product: | Slurm | Reporter: | Stuart Midgley <stuartm> |
| Component: | slurmctld | Assignee: | Moe Jette <jette> |
| Status: | RESOLVED FIXED | QA Contact: | |
| Severity: | 2 - High Impact | ||
| Priority: | --- | CC: | da |
| Version: | 14.03.0 | ||
| Hardware: | Linux | ||
| OS: | Linux | ||
| Site: | DownUnder GeoSolutions | Slinky Site: | --- |
| Alineos Sites: | --- | Atos/Eviden Sites: | --- |
| Confidential Site: | --- | Coreweave sites: | --- |
| Cray Sites: | --- | DS9 clusters: | --- |
| Google sites: | --- | HPCnow Sites: | --- |
| HPE Sites: | --- | IBM Sites: | --- |
| NOAA SIte: | --- | NoveTech Sites: | --- |
| Nvidia HWinf-CS Sites: | --- | OCF Sites: | --- |
| Recursion Pharma Sites: | --- | SFW Sites: | --- |
| SNIC sites: | --- | Tzag Elita Sites: | --- |
| Linux Distro: | --- | Machine Name: | |
| CLE Version: | Version Fixed: | 14.03.4 | |
| Target Release: | --- | DevPrio: | --- |
| Emory-Cloud Sites: | --- | ||
| Attachments: |
attachment-4956-0.html
backup controller log primary controller log attachment-15222-0.html attachment-7754-0.html |
||
I see 2 things. 1. why backup starts minutes after the primary? ~20 2. The state files dont seem to be in sync. What could be the reason? On June 4, 2014 7:04:54 PM PDT, bugs@schedmd.com wrote: >http://bugs.schedmd.com/show_bug.cgi?id=860 > > Site: DownUnder GeoSolutions > Bug ID: 860 > Summary: backup controller deleted a heap of job scripts > Product: SLURM > Version: 14.03.0 > Hardware: Linux > OS: Linux > Status: UNCONFIRMED > Severity: 2 - High Impact > Priority: --- > Component: slurmctld daemon > Assignee: david@schedmd.com > Reporter: stuartm@dugeo.com > CC: da@schedmd.com, david@schedmd.com, jette@schedmd.com > >I redeployed slurm today to bring in a some of your patches... and it >created a >disaster. > >The redeploy basically installs the new software then does a > > pdsh -l root -w list '/etc/init.d/slurm restart' > >on all nodes that use slurm. > >When the backup controller was restarted... it "recovered" and did > >[2014-06-05T09:36:45.722] Purging files for defunct batch job 2618232 > >then on the primary controller > >[2014-06-05T09:57:00.631] error: Error opening file >/d/sw/slurm/var/spool/slurm_perth/job.2618232/script, No such file or >directory >[2014-06-05T09:57:00.631] error: Batch script is missing, aborting job >2618232 >[2014-06-05T09:57:00.631] completing job 2618232 status 0 >[2014-06-05T09:57:00.639] sched: job_complete for JobId=2618232 >successful, >exit code=0 > >-- >You are receiving this mail because: >You are on the CC list for the bug. >You are the assignee for the bug. >You are watching someone on the CC list of the bug. >You are watching the assignee of the bug. As part of the deploy, the backup may restart minutes later - whenever the pdsh gets to it... but either way, the state files "should" be in sync, they are coming off an NFS mount from a 3rd machine can you attach more of the logs? I'd like to see the backup slurmctld log from the time it was assumed control. Created attachment 902 [details]
backup controller log
Created attachment 903 [details]
primary controller log
We have stopped using the backup controller at the moment. We found that having the state files on an nfs mount was just too slow and causing client commands to timeout trying to talk to slurmctld. We have gone back to having the state files on a local disk. I can address the job loss with a couple of changes, but there is a more fundamental problem that needs to be addressed. https://github.com/SchedMD/slurm/commit/42ab25c3e7b0dfe9a8073985381f8932418d31e6 Test time when job_state file was written to detect multiple primary slurmctld daemons (e.g. both backup and primary are functioning as primary and there is a split brain problem). If the backup becomes the primary and finds the primary is still writing to the shared file system, then exit. https://github.com/SchedMD/slurm/commit/25c53b8f5c61d930d5768edc17daa4ef741c811d If the backup slurmctld assumes primary status, then do NOT purge any job state files (batch script and environment files) but if any attempt is made to re-use them consider this a fatal error. It may indicate that multiple primary slurmctld daemons are active (e.g. both backup and primary are functioning as primary and there is a split brain problem). Here is the more fundamental issue, from the primary log: [2014-06-05T09:37:14.024] _slurm_rpc_submit_batch_job JobId=2615283 usec=617437235 What the heck is the daemon doing for 10+ minutes? That's long enough for the backup to start responding, accepting new jobs, etc. I'm guessing that is what caused the bug 805. The job was accepted by the backup, then the primary started responding again and used the same job IDs for another job, assigned it different nodes, etc. The only thing that stands out is SchedulingParameters=max_sched_time=15 That will cause the slurmctld to run the scheduler for 15 seconds at a time. During that time, thousands of RPCs might pile up and none of them will be serviced. Also note the threads are not necessarily serviced on a first-in first-out basis. Some systems schedule pthreads first-in last-out. In short, I would feel much better about removing the max_sched_time and using the backfill scheduler to manage most of the scheduling. The backfill scheduler releases locks every couple of seconds and is designed to handle large job counts (unlike the main scheduling loop). SchedulingParameters for backfill might look like this: bf_continue,bf_interval=300,max_job_test=50000,bf_resolution=300 There might also be delays related to NFS or something else causing delays, but all I can tell is that the system was unresponsive for 10+ minutes. Thanks. Given that we were seeing client commands time out, which we believe was due to the nfs mount (certainly that's what our straces seemed to indicate - ie. 60 RPCS/s max), I can imagine that the slurmctld would become unresponsive long enough for the backup to kick in. I have removed our value of max_sched_time=15 and I assume it will now get the max default of 4. (In reply to Moe Jette from comment #7) > The only thing that stands out is SchedulingParameters=max_sched_time=15 > That will cause the slurmctld to run the scheduler for 15 seconds at a time. > During that time, thousands of RPCs might pile up and none of them will be > serviced. Also note the threads are not necessarily serviced on a first-in > first-out basis. Some systems schedule pthreads first-in last-out. In short, > I would feel much better about removing the max_sched_time and using the > backfill scheduler to manage most of the scheduling. The backfill scheduler > releases locks every couple of seconds and is designed to handle large job > counts (unlike the main scheduling loop). > SchedulingParameters for backfill might look like this: > bf_continue,bf_interval=300,max_job_test=50000,bf_resolution=300 Created attachment 908 [details] attachment-15222-0.html Correct but you may need the backfill scheduler to check more jobs. On June 5, 2014 7:15:37 PM PDT, bugs@schedmd.com wrote: >http://bugs.schedmd.com/show_bug.cgi?id=860 > >--- Comment #9 from Stuart Midgley <stuartm@dugeo.com> --- >I have removed our value of max_sched_time=15 and I assume it will now >get the >max default of 4. > >(In reply to Moe Jette from comment #7) >> The only thing that stands out is >SchedulingParameters=max_sched_time=15 >> That will cause the slurmctld to run the scheduler for 15 seconds at >a time. >> During that time, thousands of RPCs might pile up and none of them >will be >> serviced. Also note the threads are not necessarily serviced on a >first-in >> first-out basis. Some systems schedule pthreads first-in last-out. In >short, >> I would feel much better about removing the max_sched_time and using >the >> backfill scheduler to manage most of the scheduling. The backfill >scheduler >> releases locks every couple of seconds and is designed to handle >large job >> counts (unlike the main scheduling loop). >> SchedulingParameters for backfill might look like this: >> bf_continue,bf_interval=300,max_job_test=50000,bf_resolution=300 > >-- >You are receiving this mail because: >You are on the CC list for the bug. >You are the assignee for the bug. Our current settings are SchedulerParameters=bf_continue,bf_max_job_test=50000,bf_interval=300,bf_resolution=300,defer,max_depend_depth=3,sched_interval=20,batch_sched_delay=10 Created attachment 909 [details] attachment-7754-0.html That should be lower overhead. You also need SchedulerType=sched/backfill. The sdiag tool should help to see how it's working. On June 5, 2014 7:31:08 PM PDT, bugs@schedmd.com wrote: >http://bugs.schedmd.com/show_bug.cgi?id=860 > >--- Comment #11 from Stuart Midgley <stuartm@dugeo.com> --- >Our current settings are > >SchedulerParameters=bf_continue,bf_max_job_test=50000,bf_interval=300,bf_resolution=300,defer,max_depend_depth=3,sched_interval=20,batch_sched_delay=10 > >-- >You are receiving this mail because: >You are on the CC list for the bug. >You are the assignee for the bug. Yes, we have SchedulerType=sched/backfill FWIW
*******************************************************
sdiag output at Fri Jun 6 10:48:56 2014
Data since Fri Jun 6 10:17:09 2014
*******************************************************
Server thread count: 3
Agent queue size: 0
Jobs submitted: 242
Jobs started: 630
Jobs completed: 574
Jobs canceled: 531
Jobs failed: 0
Main schedule statistics (microseconds):
Last cycle: 64104
Max cycle: 578622
Total cycles: 118
Mean cycle: 64736
Mean depth cycle: 190
Cycles per minute: 3
Last queue length: 2883
Backfilling stats
Total backfilled jobs (since last slurm start): 18
Total backfilled jobs (since last stats cycle start): 18
Total cycles: 5
Last cycle when: Fri Jun 6 10:42:04 2014
Last cycle: 963180
Max cycle: 1219593
Mean cycle: 984140
Last depth cycle: 1165
Last depth cycle (try sched): 1034
Depth Mean: 1142
Depth Mean (try depth): 1035
Last queue length: 3012
Queue length mean: 2786
Some hours later... it doesn't appear backfill is giving us much.
*******************************************************
sdiag output at Fri Jun 6 13:21:46 2014
Data since Fri Jun 6 10:17:09 2014
*******************************************************
Server thread count: 3
Agent queue size: 0
Jobs submitted: 1232
Jobs started: 12647
Jobs completed: 11716
Jobs canceled: 763
Jobs failed: 0
Main schedule statistics (microseconds):
Last cycle: 607295
Max cycle: 1164293
Total cycles: 694
Mean cycle: 106711
Mean depth cycle: 224
Cycles per minute: 3
Last queue length: 4079
Backfilling stats
Total backfilled jobs (since last slurm start): 237
Total backfilled jobs (since last stats cycle start): 237
Total cycles: 29
Last cycle when: Fri Jun 6 13:13:14 2014
Last cycle: 3832556
Max cycle: 6533449
Mean cycle: 2317317
Last depth cycle: 2302
Last depth cycle (try sched): 2299
Depth Mean: 1520
Depth Mean (try depth): 1439
Last queue length: 2976
Queue length mean: 2721
|
Created attachment 901 [details] attachment-4956-0.html I redeployed slurm today to bring in a some of your patches... and it created a disaster. The redeploy basically installs the new software then does a pdsh -l root -w list '/etc/init.d/slurm restart' on all nodes that use slurm. When the backup controller was restarted... it "recovered" and did [2014-06-05T09:36:45.722] Purging files for defunct batch job 2618232 then on the primary controller [2014-06-05T09:57:00.631] error: Error opening file /d/sw/slurm/var/spool/slurm_perth/job.2618232/script, No such file or directory [2014-06-05T09:57:00.631] error: Batch script is missing, aborting job 2618232 [2014-06-05T09:57:00.631] completing job 2618232 status 0 [2014-06-05T09:57:00.639] sched: job_complete for JobId=2618232 successful, exit code=0