Ticket 2539 - BB teardown via SLURM results in unexpected behavior
Summary: BB teardown via SLURM results in unexpected behavior
Status: RESOLVED FIXED
Alias: None
Product: Slurm
Classification: Unclassified
Component: Burst Buffers (show other tickets)
Version: 15.08.8
Hardware: Cray XC Linux
: 3 - Medium Impact
Assignee: Moe Jette
QA Contact:
URL:
Depends on:
Blocks:
 
Reported: 2016-03-10 11:48 MST by Melissa
Modified: 2016-03-16 03:44 MDT (History)
6 users (show)

See Also:
Site: LBNL - Lawrence Berkeley National Laboratory
Slinky Site: ---
Alineos Sites: ---
Atos/Eviden Sites: ---
Confidential Site: ---
Coreweave sites: ---
Cray Sites: ---
DS9 clusters: ---
Google sites: ---
HPCnow Sites: ---
HPE Sites: ---
IBM Sites: ---
NOAA SIte: ---
NoveTech Sites: ---
Nvidia HWinf-CS Sites: ---
OCF Sites: ---
Recursion Pharma Sites: ---
SFW Sites: ---
SNIC sites: ---
Tzag Elita Sites: ---
Linux Distro: ---
Machine Name: Cori
CLE Version:
Version Fixed: 15.08.9
Target Release: ---
DevPrio: ---
Emory-Cloud Sites: ---


Attachments
Possible Five Minute 'Delay' (251.44 KB, image/png)
2016-03-14 15:15 MDT, Melissa
Details

Note You need to log in before you can comment on or make changes to this ticket.
Description Melissa 2016-03-10 11:48:03 MST
On Cori with slurm 15.08.8 and DataWarp, there are 2 issues, likely related to the same bug. When I reference slurm code line numbers, I am actually referencing the latest github source code as of Thu, March 10, 2016 at 4:45PM PST.

Issue #1 (easy fix): 
In a batch script that uses the Burst Buffer (via #DW directive) and when email is enabled via the sbatch options: mail-type=<type> and mail-user=<user>, the email sent to the user reports the StageOut time as the teardown time, even when no stage out has been requested by the user. Sample Subject of Email when using a Burst Buffer but never calling #DW stage_out: "SLURM Job_id=1261175 Name=slurmSO Staged Out, StageOut time 00:05:37" I would expect this to say something like "[...] Complete, Teardown Time [...]" StageOut time would only be relevant if something is actually marked to be staged out, and it should be kept separate from the teardown time.

Issue #2:
It seemed everyone's jobs were taking ~5 minutes (give or take 0-30 seconds) for teardown. Cray reported that the longest wait times for teardown that they have seen are ~1 minute when they are at constant utilization. After digging through the slurm code, I identified a timeout in burst_buffer_common.h, line 59: "#define DEFAULT_OTHER_TIMEOUT	300	/* 5 minutes */" In addition, I investigated the slurm teardown code for the BB in the file burst_buffer_cray.c. Please see lines 1854-1861 in the *_start_teardown function. The 5 minute timeout is used in this code. 

After looking at the slurm logs, it turns out that most of the jobs are reporting that the function is entering a SIGKILL status (I am assuming that is what status 9 represents, please correct me if I am wrong). See a sample of a log file for 1 job (please note for a given day, almost all of the jobs appear to experience this behavior - I shortened to 1 example to save space):

[2016-03-07T07:15:48.492] _start_teardown: teardown for job 1275671 ran for usec=300016437
[2016-03-07T07:15:48.492] error: burst_buffer cray plugin: _start_teardown: teardown for job 1275671 status:9 response:

The burst buffer instances do seem to go away at some point, but I am not sure if this happens cleanly (based on these error messages). Slurm bug #2143 seems to suggest that there were issues with the teardown function before (namely, it looks like it might've failed and was not retried by the slurm code), however, this ticket has been marked as resolved, and the fix was incorporated as of 15.08.5. This code just adds a 5 second delay in between teardown retries (would not result in 5 minutes worth of delay). I have repeated my tests with data in the BB vs no data in the BB and the error exists in both cases. This applies specifically to BBs that are allocated for the length of a job (i.e., using #DW jobdw), since dw_wlm_cli uses -f destroy_persistent for persistent reservations rather than -f teardown.

I did read the comment on lines 1866-1868 that says: `Teardown is run at every termination of every job that _might_ have a burst buffer, so an error of "token not found" should be fairly common and not indicative of a problem.` I have no idea if this would give a status of 9 (SIGKILL). But sigkill intuitively seems to be indicative of a timeout. 

Either way, it's not reasonable for teardown to take a minimum of 5 minutes. If the DataWarp layer takes only seconds to tear down, but slurm is timing out for 5 minutes, the bug will not affect users (other than a delay in the email that their job has completed). HOWEVER, DataWarp performance degrades significantly if teardown of one user's BB is occurring at the same time as another [new] allocation on that burst buffer node is requested (the degradation is noticed by the new user, not the old one). So, if there is some unintended retry loop occurring until the timeout is reached, it might actually have a direct influence on users.
Comment 1 Tim Wickberg 2016-03-10 14:24:59 MST
> Issue #1 (easy fix): 
> In a batch script that uses the Burst Buffer (via #DW directive) and when
> email is enabled via the sbatch options: mail-type=<type> and
> mail-user=<user>, the email sent to the user reports the StageOut time as
> the teardown time, even when no stage out has been requested by the user.
> Sample Subject of Email when using a Burst Buffer but never calling #DW
> stage_out: "SLURM Job_id=1261175 Name=slurmSO Staged Out, StageOut time
> 00:05:37" I would expect this to say something like "[...] Complete,
> Teardown Time [...]" StageOut time would only be relevant if something is
> actually marked to be staged out, and it should be kept separate from the
> teardown time.

Just to check that I'm understanding this correctly - the message is being sent on all burst buffer jobs, not just ones with stage out set?

The reported time is now() - end_of_job, which would include both the tear down and stage out times. It looks like, absent a stage out command, the email shouldn't be generated. This may be a quirk with the Cray plugin in particular, I'll look into it further tomorrow.

> Issue #2:

> I did read the comment on lines 1866-1868 that says: `Teardown is run at
> every termination of every job that _might_ have a burst buffer, so an error
> of "token not found" should be fairly common and not indicative of a
> problem.` I have no idea if this would give a status of 9 (SIGKILL). But
> sigkill intuitively seems to be indicative of a timeout. 

bb_run_script() will kill(SIGKILL) the tear down (or any other script) if it exceeds the timeout, so that's where the '9' is coming from as you surmised.

Why the teardown script is hanging is an interesting question - I'll look through the logs we have previously to see if anything jumps out here. I know Moe noted on 2384 that the DataWarp API was sporadically returning non-json results, this may be related.

- Tim
Comment 2 Melissa 2016-03-10 16:12:29 MST
Hi Tim-

Thanks for your super quick reply! Thank you for looking into Issue #2. I am hoping to get to the bottom of this as well, since it has sparked my curiosity. Please let me know if there are other things we can provide on this end to help.

As far as Issue #1, please see answers to your questions inline.

> Just to check that I'm understanding this correctly - 
> the message is being sent on all burst buffer jobs, 
> not just ones with stage out set?

Yes.

> It looks like, absent a stage out command, the email shouldn't be generated.

Yes, I think sending no email in that case is the intended behavior. For example, when I specify --mail-type=ALL for a normal non-BB job, the output emails I receive from slurm are supposed to include: BEGIN, END, FAIL, REQUEUE, and STAGE_OUT. But I don't get a job REQUEUE email if my job never has to requeue.


In the case where I request a burst buffer but do *not* request stage out, the script looks like similar to this:

---------------------
#!/bin/bash
#SBATCH -p debug
#SBATCH -N 1
#SBATCH -t 00:10:00
#SBATCH -J slurmSO
#SBATCH --mail-type=ALL
#SBATCH --mail-user=user@mail.com
#DW jobdw type=scratch capacity=200GB access_mode=striped

srun echo "Hello World";

-----------------------

I receive three emails:

One when the job begins [expected]
1) SLURM Job_id=1261144 Name=slurmSO Began, Queued time 00:00:08

One when the job completes [expected]
2) SLURM Job_id=1261144 Name=slurmSO Ended, Run time 00:00:05, COMPLETED, ExitCode 0

And then... a little over five minutes later, a third email [not expected]
3) SLURM Job_id=1261144 Name=slurmSO Staged Out, StageOut time 00:05:04


One quick note though... while the third email probably shouldn't be sent out to the user in the absence of a stage out command, I do think having the teardown time recorded, whether there is a stage out command or not, would still be useful in either the logs or the slurm DB (for admin purposes). 

Of course it's ultimately up to you guys at SchedMD, but since slurm is the primary interface with DataWarp, both teardown time and [when applicable] post-run stage out time are important statistics for us that can really only be accurately captured by the workload manager (since dw_wlm_cli -f teardown requires a jobID).
Comment 3 Tim Wickberg 2016-03-11 02:02:17 MST
(In reply to Melissa from comment #2)
> Hi Tim-
> 
> Thanks for your super quick reply! Thank you for looking into Issue #2. I am
> hoping to get to the bottom of this as well, since it has sparked my
> curiosity. Please let me know if there are other things we can provide on
> this end to help.
> 
> As far as Issue #1, please see answers to your questions inline.

I'm still looking into the teardown stalls.

The email I can explain, but do not have a simple fix for unfortunately.

For the (not yet implemented) burst_buffer/generic plugin, sending email out only when stage out completes is easily fixed.

Unfortunately for the Cray version, at the moment Slurm itself does not distinguish between jobs requiring stage out versus those that don't - thus Slurm always generates the email for all jobs using burst buffers. The stage out command (which are set in a #DW line rather than a #BB... BB is slurm, DW is Cray specific and only a few specific DW commands are parsed) isn't seen by us, and the burst_buffer plugin always calls 'dw_wlm_cli data_out' which itself re-reads the job script to determine if stage-out is necessary.

> Of course it's ultimately up to you guys at SchedMD, but since slurm is the
> primary interface with DataWarp, both teardown time and [when applicable]
> post-run stage out time are important statistics for us that can really only
> be accurately captured by the workload manager (since dw_wlm_cli -f teardown
> requires a jobID).

Some of the relevant timing should be in the logs already. If there's a specific timer that isn't it's straightforward to add it.

I don't think we've ever discussed sending burst buffer profiling information to the database. How useful would such functionality be? Would you expect users to ever profile this directly, or is it only of interest to the admin staff and thus something that can be culled from the logs when necessary?

cheers,
- Tim
Comment 4 Tim Wickberg 2016-03-11 02:56:09 MST
(Inline, heavily editing your first comment.)

> [2016-03-07T07:15:48.492] _start_teardown: teardown for job 1275671 ran for
> usec=300016437
> [2016-03-07T07:15:48.492] error: burst_buffer cray plugin: _start_teardown:
> teardown for job 1275671 status:9 response:
> 
> I did read the comment on lines 1866-1868 that says: `Teardown is run at
> every termination of every job that _might_ have a burst buffer, so an error
> of "token not found" should be fairly common and not indicative of a
> problem.` I have no idea if this would give a status of 9 (SIGKILL). But
> sigkill intuitively seems to be indicative of a timeout. 

The log I have from Wednesday doesn't have any of these timeouts. I'm seeing a worst case of 138406092 usec, and a 99%tile of 79742758.7 which are well within the limit.

My best guess here is that the DataWarp API stopped responding correctly for some reason. I think Brian Gilmer had noted something on 2384 about this, but I don't think we'd followed up on that.

Has this happened again since the last set of patches Dough applied for us yesterday? And if so can you please attach the logs?
Comment 5 Melissa 2016-03-11 03:00:04 MST
RE: the minor email issue [another comment coming in reference to the comment you just posted]

> Unfortunately for the Cray version, at the moment Slurm itself 
> does not distinguish between jobs requiring stage out versus 
> those that don't

Hm... thanks for the explanation. But this rules out most of my theory on Issue #2. Well -- it still seems like a timeout, just that it's probably hitting that timeout in a different function than I thought. :) 

After your comment, I took another look at the code. Here is my new understanding (for the email issue). Please correct me if I am wrong (admittedly, I was hastily combing through the code in order to provide more info for my issue). So in my new understanding:

`_queue_teardown` (burst_buffer_cray.c) is only called first when there has been some sort of failure with the job. In this case, it is safe for slurm to set `bool hurry /* Set to destroy in a hurry (no stage-out) */)` to 1 and to only call `dw_wlm_cli -f teardown` because the job was rejected or failed or only wound up there after `-f data_out`.

Instead, the successful job triggers `bb_p_job_start_stage_out` which calls `_queue_stage_out` which ultimately calls `dw_wlm_cli -f post_run` (I am assuming this is some sort of signal that they provided so that slurm can still identify the BB by the job number - maybe?) and then `dw_wlm_cli -f data_out` and THEN the teardown function/`-f teardown` etc etc.

If this is correct, perhaps a simple fix would be to just change the slurm-generated stage out email (in the Cray case) for now to use the terminology "BB Teardown/StageOut = HH:MM:SS" or even "BB Post-Run = HH:MM:SS" [I personally vote for the first option, because I think post-run might be too ambiguous and result in the same kind of confusion]. 

This way our users don't get confused, and all of the code stays the same and retains the same meaning. What do you think? I know that seems like a trivial thing to make a patch for, but we have had more than a few users reporting that a stage out was occurring when it actually was not. In the meantime, we will clarify this on the NERSC BB website.
Comment 6 Melissa 2016-03-11 04:24:21 MST
A few things to catch up on:

1) I spoke to Doug [Jacobsen]. So far, there have only been two patches applied to 15.08.8. One for increasing debugging of gres underflow messages and another for changing interactions with the Cray NHC. I am told that neither patch should affect the burst buffer. 

2) Is there a specific patch that Doug can apply that should fix the behavior? It is better to look at something like Monday or some day last week (or today would probably work after it gets generated) in reference to the logs, because Cray was not here at that time and BB activity was likely much, MUCH more active. The thread for bug #2384 is pretty long -- it seems at the end that DW wasn't involved at all? But if I missed something, please point me to the proper patch.

3) I am not sure that it could be a random DW malfunction. This behavior has been observed over a period of several weeks. I also tested it after your comment. I still get an email reporting 5 min 13 sec (1 100MB file in BB) and 5 min 3 sec (no files) for 'StageOut' (aka: teardown). I was thinking it is probably easier to debug on your end with breakpoints or addtl info to see if it is actually timing out or if Cray is really taking 5+ minutes to report a successful teardown for dw_wlm_cli. I am not sure how much BB activity there was on Wednesday, since Cray was here giving an all-day presentation for us and our 'early BB users'. 

One more comment to come about the DB stuff...
Comment 7 Tim Wickberg 2016-03-11 04:27:21 MST
(In reply to Melissa from comment #5)> 
> Hm... thanks for the explanation. But this rules out most of my theory on
> Issue #2. Well -- it still seems like a timeout, just that it's probably
> hitting that timeout in a different function than I thought. :) 

Yep. They're only tangentially related. And I suspect the other issue is transient, but await your reply on that thread. :)

> If this is correct, perhaps a simple fix would be to just change the
> slurm-generated stage out email (in the Cray case) for now to use the
> terminology "BB Teardown/StageOut = HH:MM:SS" or even "BB Post-Run =
> HH:MM:SS" [I personally vote for the first option, because I think post-run
> might be too ambiguous and result in the same kind of confusion].

I've opened bug 2541 as an enhancement request to add the logic to prevent the email from being generated. It's not as difficult to add as I'd thought it might but, but it'll require a minor change to an internal struct so it will have to wait until 16.05.

If you want to submit a patch to change the email in the meantime I can add that before 15.08.9. Look for _mail_type_str() in src/slurmctld/agent.c for where the subject line is defined.
Comment 8 Tim Wickberg 2016-03-11 04:52:57 MST
(In reply to Melissa from comment #6)
> A few things to catch up on:
> 
> 1) I spoke to Doug [Jacobsen]. So far, there have only been two patches
> applied to 15.08.8. One for increasing debugging of gres underflow messages
> and another for changing interactions with the Cray NHC. I am told that
> neither patch should affect the burst buffer. 

You're right. I thought fec5e03b900aeb may have had an impact but it doesn't appear that's the case.

Do you know if 3bb2e602b has been applied yet?

commit 3bb2e602bfa1161427384145486b47f506457414
Author: Morris Jette <jette@schedmd.com>
Date:   Thu Mar 10 08:50:28 2016 -0800

    Cray Datawarp job requeue bug fix
    
    burst_buffer/cray plugin: Prevent a requeued job from being restarted while
        file stage-out is still in progress. Previous logic could restart the job
        job and not perform a new stage-in.
    bug 2584, comment #45

Although I doubt that is directly related here.

> 2) Is there a specific patch that Doug can apply that should fix the
> behavior? It is better to look at something like Monday or some day last
> week (or today would probably work after it gets generated) in reference to
> the logs, because Cray was not here at that time and BB activity was likely
> much, MUCH more active. The thread for bug #2384 is pretty long -- it seems
> at the end that DW wasn't involved at all? But if I missed something, please
> point me to the proper patch.

(See above.)

> 3) I am not sure that it could be a random DW malfunction. This behavior has
> been observed over a period of several weeks. I also tested it after your
> comment. I still get an email reporting 5 min 13 sec (1 100MB file in BB)
> and 5 min 3 sec (no files) for 'StageOut' (aka: teardown). I was thinking it
> is probably easier to debug on your end with breakpoints or addtl info to
> see if it is actually timing out or if Cray is really taking 5+ minutes to
> report a successful teardown for dw_wlm_cli. I am not sure how much BB
> activity there was on Wednesday, since Cray was here giving an all-day
> presentation for us and our 'early BB users'. 

Ahh... alright. So I just have the logs for the one day this didn't happen.

I sadly don't have an XC40 at hand (although if you have one you're not using I can find a home for it...), so I can't really say what's happening with the "real" DW API in such an instance. If I need to, we have access to a test system at Cray, but won't necessarily be able to provoke the same issue you're seeing.

I suspect the "teardown" is not successful - hence the timeout - but could be wrong. It might be completing in the background, or have completed but the API call didn't close the connection correctly for some reason. I'm hoping the logs should give some sort of indication either way.

I can trivially recreate the issue when emulating it - simply changing the dummy dw_wlm_cli to sleep for 1000 seconds on a teardown command matches things up to what you've described. So I think Slurm's side of things is working as expected here, but I'm curious if there's a specific set of DW API calls, or something else, that may be triggering that behavior. If I don't get a solid indication from the logs, I may give you a patch to log the raw responses directly - we've seen invalid JSON in responses (or non-json responses) to the API in the past that might be causing this although I don't think they'd lead to the timeout.
Comment 9 Melissa 2016-03-11 10:16:18 MST
> I'm hoping the logs should give some sort of indication either way.

I spoke with Doug and he says we have several GBs worth of logs that he can provide. Since the log you have isn't very informative, maybe we can provide a few other days worth of logs. That might also help indicate whether you think we should log the raw responses or not. Please let me know what the best way to get these to you is.

> I can trivially recreate the issue when emulating it - 
> simply changing the dummy dw_wlm_cli to sleep for 1000 seconds 
> on a teardown command matches things up to what you've described. 
> So I think Slurm's side of things is working as expected here, but 
> I'm curious if there's a specific set of DW API calls, or something 
> else, that may be triggering that behavior. 

It's funny - a bit of a deadlock where you are without a Cray/BB, and I would need to mess around with a production-level slurm to use the wlm commands directly related to job-length BBs and can't. If this 

By the way, I did another test where the job script had an invalid command. I got an email that my slurm job failed, "SLURM Job_id=1315905 Name=slurmSO Failed, Run time 00:00:05, FAILED, ExitCode 2," but 5 min later, "SLURM Job_id=1315905 Name=slurmSO Staged Out, StageOut time 00:05:25." The slurm code looked like it would call teardown directly in the case of a job failing (but I guess I am wrong).


Alright, I have had more time to look through the logs, and I am fairly confused. I need to look at today's logs when they are generated tonight (so I will look at them tomorrow). 

I do see it reporting hitting the timeout at certain times. Also, for the most part, it seems to happen exclusively in the teardown code -- which means that the `dw_wlm_cli -f post_run` and `dw_wlm_cli -f data_out` probably completed okay. 

[2016-03-07T07:10:48.463] error: bb_run_script: teardown poll timeout @ 300000 msec

I will reply a bit later with some more information. As for putting BB profiling info in the slurm DB, I believe it would be useful information for internal teams here that do not have admin access. Also, the logs are one day delayed, and some BB functionality tests require information in a more realtime fashion instead of waiting 24hrs. However, I would like to note that I really can't speak for all of NERSC, and I also cannot speak to whether it would be a priority compared to other things.
Comment 10 Melissa 2016-03-13 16:15:51 MDT
Hi Tim,

Okay - I did some more digging in the log files, but I would like to get your input. It really does wait for 5 minutes, but I can't tell /why/. It seems like slurm receives something back from the burst buffer code indicating that teardown was successful, but then afterward, waits for 5 minutes. And then counts that five minutes in the calculation of stage out time.

Here are the relevant parts of 2 jobs I ran Friday (I have the complete job information but will list only the relevant parts - I can provide full output upon request):

This first one failed, because there was a bug in my run script. But there was still a five minute delay in the email.

-------------------------------------------
[2016-03-11T13:31:07.402] job_complete: JobID=1315904 State=0x1 NodeCnt=1 WEXITSTATUS 2
[2016-03-11T13:31:07.402] burst_buffer/cray: bb_p_job_start_stage_out: JobID=1315904
[2016-03-11T13:31:07.402] email msg to MRomanus@lbl.gov: SLURM Job_id=1315904 Name=slurmSO Failed, Run time 00:00:04, FAILED, ExitCode 2
[2016-03-11T13:31:07.407] job_complete: JobID=1315904 State=0x8005 NodeCnt=1 done
[2016-03-11T13:31:08.407] burst_buffer/cray: bb_p_job_test_post_run: JobID=1315904
[2016-03-11T13:31:10.428] _start_stage_out: dws_post_run for job 1315904 ran for usec=3026589
[2016-03-11T13:31:10.428] dw_wlm_cli --function post_run --token 1315904 --job /global/syscom/cori/sc/nsg/var/cori-slurm-state/hash.4/job.1315904/script
[2016-03-11T13:31:11.492] burst_buffer/cray: bb_p_job_test_post_run: JobID=1315904
[2016-03-11T13:31:11.826] _start_stage_out: dws_data_out for job 1315904 ran for usec=350588
[2016-03-11T13:31:11.826] dw_wlm_cli --function data_out --token 1315904 --job /global/syscom/cori/sc/nsg/var/cori-slurm-state/hash.4/job.1315904/script
[2016-03-11T13:31:11.826] stage operations completed
[2016-03-11T13:31:11.874] _start_stage_out: Stage-out/post-run complete for job 1315904
[2016-03-11T13:31:15.425] _start_teardown: teardown for job 1315904 ran for usec=3550786
[2016-03-11T13:31:15.425] dw_wlm_cli --function teardown --token 1315904 --job /global/syscom/cori/sc/nsg/var/cori-slurm-state/hash.4/job.1315904/script
[2016-03-11T13:31:15.425] session 3531 torn down

[[ JOB WAS SUPPOSEDLY TORN DOWN -- BUT FIVE MINUTES LATER... ]]

[2016-03-11T13:37:04.646] burst_buffer/cray: bb_p_job_test_stage_out: JobID=1315904
[2016-03-11T13:37:04.646] email msg to MRomanus@lbl.gov: SLURM Job_id=1315904 Name=slurmSO Staged Out, StageOut time 00:05:57
-------------------------------------------


Second job that was successful:

-------------------------------------------
[2016-03-11T13:39:27.142] job_complete: JobID=1315908 State=0x1 NodeCnt=1 WEXITSTATUS 0
[2016-03-11T13:39:27.142] burst_buffer/cray: bb_p_job_start_stage_out: JobID=1315908
[2016-03-11T13:39:27.142] email msg to MRomanus@lbl.gov: SLURM Job_id=1315908 Name=slurmSO Ended, Run time 00:07:05, COMPLETED, ExitCode 0
[2016-03-11T13:39:27.148] job_complete: JobID=1315908 State=0x8003 NodeCnt=1 done
[2016-03-11T13:39:29.195] burst_buffer/cray: bb_p_job_test_post_run: JobID=1315908
[2016-03-11T13:39:30.353] _start_stage_out: dws_post_run for job 1315908 ran for usec=3210982
[2016-03-11T13:39:30.353] dw_wlm_cli --function post_run --token 1315908 --job /global/syscom/cori/sc/nsg/var/cori-slurm-state/hash.8/job.1315908/script
[2016-03-11T13:39:31.672] burst_buffer/cray: bb_p_job_test_post_run: JobID=1315908
[2016-03-11T13:39:32.303] _start_stage_out: dws_data_out for job 1315908 ran for usec=630884
[2016-03-11T13:39:32.303] dw_wlm_cli --function data_out --token 1315908 --job /global/syscom/cori/sc/nsg/var/cori-slurm-state/hash.8/job.1315908/script
[2016-03-11T13:39:34.174] _start_stage_out: Stage-out/post-run complete for job 1315908
[2016-03-11T13:39:38.465] _start_teardown: teardown for job 1315908 ran for usec=4290206
[2016-03-11T13:39:38.465] dw_wlm_cli --function teardown --token 1315908 --job /global/syscom/cori/sc/nsg/var/cori-slurm-state/hash.8/job.1315908/script

[[FIVE MINUTES LATER]]

[2016-03-11T13:45:05.120] burst_buffer/cray: bb_p_job_test_stage_out: JobID=1315908
[2016-03-11T13:45:05.120] email msg to MRomanus@lbl.gov: SLURM Job_id=1315908 Name=slurmSO Staged Out, StageOut time 00:05:38
-------------------------------------------

I spent more time going through the code, but I honestly cannot find a reason for it to have waited for five minutes. It seems like - if it hit the timeout, there would have been an error message in most cases. 

I did notice that job 1 has the message "session 3531 torn down" while job 2 does not. But they both have the 5 minute wait, which occurs afterward. The only other thing I can think is if there is some sort of delay between BB_STATE_TEARDOWN and BB_STATE_COMPLETE. Honestly not sure why bb_p_job_test_stage_out is called after teardown. It seems like it would run after "_start_stage_out: Stage-out/post-run complete for job 1315908," before teardown?

I mean, I see on the website example: https://github.com/SchedMD/slurm/blob/d9259af5dc747384f5bfd55fcdb3625ea602ef9f/doc/html/burst_buffer.shtml It also has a time of 5:07 for stage out time. I am not sure if there is some kind of event that triggers the email or maybe it's periodic? Anyway, I figured these more updated logs might be helpful.
Comment 11 Melissa 2016-03-14 15:15:47 MDT
Created attachment 2853 [details]
Possible Five Minute 'Delay'

The uploaded picture is part of Moe's BB slides. My primary concern is where I added a red box. I spoke with Doug today, and he gave me some hints, which I will elaborate on below.

Once the job completes successfully, slurm obviously still keeps the job record until the stage out completes. However, it is unclear to me if any of it is busy-waiting. It's possible that slurm spawns the threads for stage out and teardown and then just does not poll to update the job status again until 5 minutes later (assuming that slurm updates jobs on a 5 minute-ish interval). In this case, the threads for the stage out and teardown may complete in mere microseconds, but the slurm controller or slurm daemon (not sure which - this is just a theory) doesn't check back in on the job until some 'regular interval' which could possibly be 5 minutes. In this case, it makes sense that it would call the stage out test after 5 minutes, because it wants to verify that it can delete the job record entirely.

I think it is worth understanding more about this piece of the 'slurm execution' after a BB job has completed but prior to slurm deleting the job record. I need to understand that piece before I can figure out the potential impact to the users. Let me know what you think - thank you!
Comment 13 Moe Jette 2016-03-16 02:18:12 MDT
I the real cause for the 5 minute delay in the message is this configuration parameter:
MinJobAge=300     ## seconds until job removed from active database, 300s is default, minimum 2

The current Slurm logic is not even testing if the stage out has completed until it is ready to purge the job (5 minutes after completion). When that test completes is when the email is generated. That could well be 5 minutes after the teardown actually completes.

I would not recommend changing the configuration parameter, but I will investigate the possibility of moving logic to generate the message as soon as the teardown completes.
Comment 14 Moe Jette 2016-03-16 03:44:01 MDT
I think we have both of these issues addressed now.

This commit:
https://github.com/SchedMD/slurm/commit/fae4c3d38218bffff060f93a8393bd00d124690f
changes the message format from
    "SLURM Job_id=1360353 Name=tmp Staged Out, StageOut time 00:01:47" to
    "SLURM Job_id=1360353 Name=tmp StageOut/Teardown time 00:01:47"

This commit:
https://github.com/SchedMD/slurm/commit/d85cdcc77c6acc5cf8ed234520a9c41e4d9de7df
Sends the email as soon as the teardown completes rather than waiting for job purging.

These changes are both pretty small (just a few lines each) and will be in version 15.08.9 when released, probably around March 24. Or you could patch an earlier version of Slurm if desired.

I'll close the ticket based upon those patches, but feel free to re-open the ticket if necessary.