Created attachment 901 [details] attachment-4956-0.html I redeployed slurm today to bring in a some of your patches... and it created a disaster. The redeploy basically installs the new software then does a pdsh -l root -w list '/etc/init.d/slurm restart' on all nodes that use slurm. When the backup controller was restarted... it "recovered" and did [2014-06-05T09:36:45.722] Purging files for defunct batch job 2618232 then on the primary controller [2014-06-05T09:57:00.631] error: Error opening file /d/sw/slurm/var/spool/slurm_perth/job.2618232/script, No such file or directory [2014-06-05T09:57:00.631] error: Batch script is missing, aborting job 2618232 [2014-06-05T09:57:00.631] completing job 2618232 status 0 [2014-06-05T09:57:00.639] sched: job_complete for JobId=2618232 successful, exit code=0
I see 2 things. 1. why backup starts minutes after the primary? ~20 2. The state files dont seem to be in sync. What could be the reason? On June 4, 2014 7:04:54 PM PDT, bugs@schedmd.com wrote: >http://bugs.schedmd.com/show_bug.cgi?id=860 > > Site: DownUnder GeoSolutions > Bug ID: 860 > Summary: backup controller deleted a heap of job scripts > Product: SLURM > Version: 14.03.0 > Hardware: Linux > OS: Linux > Status: UNCONFIRMED > Severity: 2 - High Impact > Priority: --- > Component: slurmctld daemon > Assignee: david@schedmd.com > Reporter: stuartm@dugeo.com > CC: da@schedmd.com, david@schedmd.com, jette@schedmd.com > >I redeployed slurm today to bring in a some of your patches... and it >created a >disaster. > >The redeploy basically installs the new software then does a > > pdsh -l root -w list '/etc/init.d/slurm restart' > >on all nodes that use slurm. > >When the backup controller was restarted... it "recovered" and did > >[2014-06-05T09:36:45.722] Purging files for defunct batch job 2618232 > >then on the primary controller > >[2014-06-05T09:57:00.631] error: Error opening file >/d/sw/slurm/var/spool/slurm_perth/job.2618232/script, No such file or >directory >[2014-06-05T09:57:00.631] error: Batch script is missing, aborting job >2618232 >[2014-06-05T09:57:00.631] completing job 2618232 status 0 >[2014-06-05T09:57:00.639] sched: job_complete for JobId=2618232 >successful, >exit code=0 > >-- >You are receiving this mail because: >You are on the CC list for the bug. >You are the assignee for the bug. >You are watching someone on the CC list of the bug. >You are watching the assignee of the bug.
As part of the deploy, the backup may restart minutes later - whenever the pdsh gets to it... but either way, the state files "should" be in sync, they are coming off an NFS mount from a 3rd machine
can you attach more of the logs? I'd like to see the backup slurmctld log from the time it was assumed control.
Created attachment 902 [details] backup controller log
Created attachment 903 [details] primary controller log
We have stopped using the backup controller at the moment. We found that having the state files on an nfs mount was just too slow and causing client commands to timeout trying to talk to slurmctld. We have gone back to having the state files on a local disk.
I can address the job loss with a couple of changes, but there is a more fundamental problem that needs to be addressed. https://github.com/SchedMD/slurm/commit/42ab25c3e7b0dfe9a8073985381f8932418d31e6 Test time when job_state file was written to detect multiple primary slurmctld daemons (e.g. both backup and primary are functioning as primary and there is a split brain problem). If the backup becomes the primary and finds the primary is still writing to the shared file system, then exit. https://github.com/SchedMD/slurm/commit/25c53b8f5c61d930d5768edc17daa4ef741c811d If the backup slurmctld assumes primary status, then do NOT purge any job state files (batch script and environment files) but if any attempt is made to re-use them consider this a fatal error. It may indicate that multiple primary slurmctld daemons are active (e.g. both backup and primary are functioning as primary and there is a split brain problem). Here is the more fundamental issue, from the primary log: [2014-06-05T09:37:14.024] _slurm_rpc_submit_batch_job JobId=2615283 usec=617437235 What the heck is the daemon doing for 10+ minutes? That's long enough for the backup to start responding, accepting new jobs, etc. I'm guessing that is what caused the bug 805. The job was accepted by the backup, then the primary started responding again and used the same job IDs for another job, assigned it different nodes, etc. The only thing that stands out is SchedulingParameters=max_sched_time=15 That will cause the slurmctld to run the scheduler for 15 seconds at a time. During that time, thousands of RPCs might pile up and none of them will be serviced. Also note the threads are not necessarily serviced on a first-in first-out basis. Some systems schedule pthreads first-in last-out. In short, I would feel much better about removing the max_sched_time and using the backfill scheduler to manage most of the scheduling. The backfill scheduler releases locks every couple of seconds and is designed to handle large job counts (unlike the main scheduling loop). SchedulingParameters for backfill might look like this: bf_continue,bf_interval=300,max_job_test=50000,bf_resolution=300 There might also be delays related to NFS or something else causing delays, but all I can tell is that the system was unresponsive for 10+ minutes.
Thanks. Given that we were seeing client commands time out, which we believe was due to the nfs mount (certainly that's what our straces seemed to indicate - ie. 60 RPCS/s max), I can imagine that the slurmctld would become unresponsive long enough for the backup to kick in.
I have removed our value of max_sched_time=15 and I assume it will now get the max default of 4. (In reply to Moe Jette from comment #7) > The only thing that stands out is SchedulingParameters=max_sched_time=15 > That will cause the slurmctld to run the scheduler for 15 seconds at a time. > During that time, thousands of RPCs might pile up and none of them will be > serviced. Also note the threads are not necessarily serviced on a first-in > first-out basis. Some systems schedule pthreads first-in last-out. In short, > I would feel much better about removing the max_sched_time and using the > backfill scheduler to manage most of the scheduling. The backfill scheduler > releases locks every couple of seconds and is designed to handle large job > counts (unlike the main scheduling loop). > SchedulingParameters for backfill might look like this: > bf_continue,bf_interval=300,max_job_test=50000,bf_resolution=300
Created attachment 908 [details] attachment-15222-0.html Correct but you may need the backfill scheduler to check more jobs. On June 5, 2014 7:15:37 PM PDT, bugs@schedmd.com wrote: >http://bugs.schedmd.com/show_bug.cgi?id=860 > >--- Comment #9 from Stuart Midgley <stuartm@dugeo.com> --- >I have removed our value of max_sched_time=15 and I assume it will now >get the >max default of 4. > >(In reply to Moe Jette from comment #7) >> The only thing that stands out is >SchedulingParameters=max_sched_time=15 >> That will cause the slurmctld to run the scheduler for 15 seconds at >a time. >> During that time, thousands of RPCs might pile up and none of them >will be >> serviced. Also note the threads are not necessarily serviced on a >first-in >> first-out basis. Some systems schedule pthreads first-in last-out. In >short, >> I would feel much better about removing the max_sched_time and using >the >> backfill scheduler to manage most of the scheduling. The backfill >scheduler >> releases locks every couple of seconds and is designed to handle >large job >> counts (unlike the main scheduling loop). >> SchedulingParameters for backfill might look like this: >> bf_continue,bf_interval=300,max_job_test=50000,bf_resolution=300 > >-- >You are receiving this mail because: >You are on the CC list for the bug. >You are the assignee for the bug.
Our current settings are SchedulerParameters=bf_continue,bf_max_job_test=50000,bf_interval=300,bf_resolution=300,defer,max_depend_depth=3,sched_interval=20,batch_sched_delay=10
Created attachment 909 [details] attachment-7754-0.html That should be lower overhead. You also need SchedulerType=sched/backfill. The sdiag tool should help to see how it's working. On June 5, 2014 7:31:08 PM PDT, bugs@schedmd.com wrote: >http://bugs.schedmd.com/show_bug.cgi?id=860 > >--- Comment #11 from Stuart Midgley <stuartm@dugeo.com> --- >Our current settings are > >SchedulerParameters=bf_continue,bf_max_job_test=50000,bf_interval=300,bf_resolution=300,defer,max_depend_depth=3,sched_interval=20,batch_sched_delay=10 > >-- >You are receiving this mail because: >You are on the CC list for the bug. >You are the assignee for the bug.
Yes, we have SchedulerType=sched/backfill
FWIW ******************************************************* sdiag output at Fri Jun 6 10:48:56 2014 Data since Fri Jun 6 10:17:09 2014 ******************************************************* Server thread count: 3 Agent queue size: 0 Jobs submitted: 242 Jobs started: 630 Jobs completed: 574 Jobs canceled: 531 Jobs failed: 0 Main schedule statistics (microseconds): Last cycle: 64104 Max cycle: 578622 Total cycles: 118 Mean cycle: 64736 Mean depth cycle: 190 Cycles per minute: 3 Last queue length: 2883 Backfilling stats Total backfilled jobs (since last slurm start): 18 Total backfilled jobs (since last stats cycle start): 18 Total cycles: 5 Last cycle when: Fri Jun 6 10:42:04 2014 Last cycle: 963180 Max cycle: 1219593 Mean cycle: 984140 Last depth cycle: 1165 Last depth cycle (try sched): 1034 Depth Mean: 1142 Depth Mean (try depth): 1035 Last queue length: 3012 Queue length mean: 2786
Some hours later... it doesn't appear backfill is giving us much. ******************************************************* sdiag output at Fri Jun 6 13:21:46 2014 Data since Fri Jun 6 10:17:09 2014 ******************************************************* Server thread count: 3 Agent queue size: 0 Jobs submitted: 1232 Jobs started: 12647 Jobs completed: 11716 Jobs canceled: 763 Jobs failed: 0 Main schedule statistics (microseconds): Last cycle: 607295 Max cycle: 1164293 Total cycles: 694 Mean cycle: 106711 Mean depth cycle: 224 Cycles per minute: 3 Last queue length: 4079 Backfilling stats Total backfilled jobs (since last slurm start): 237 Total backfilled jobs (since last stats cycle start): 237 Total cycles: 29 Last cycle when: Fri Jun 6 13:13:14 2014 Last cycle: 3832556 Max cycle: 6533449 Mean cycle: 2317317 Last depth cycle: 2302 Last depth cycle (try sched): 2299 Depth Mean: 1520 Depth Mean (try depth): 1439 Last queue length: 2976 Queue length mean: 2721