Ticket 4529 - Burst Buffer Stage Out Fails silently without Error message from Slurm
Summary: Burst Buffer Stage Out Fails silently without Error message from Slurm
Status: RESOLVED FIXED
Alias: None
Product: Slurm
Classification: Unclassified
Component: Burst Buffers (show other tickets)
Version: 17.02.9
Hardware: Linux Linux
: 3 - Medium Impact
Assignee: Moe Jette
QA Contact:
URL:
: 4526 (view as ticket list)
Depends on:
Blocks:
 
Reported: 2017-12-14 16:49 MST by S Senator
Modified: 2018-01-02 10:00 MST (History)
2 users (show)

See Also:
Site: LANL
Slinky Site: ---
Alineos Sites: ---
Atos/Eviden Sites: ---
Confidential Site: ---
Coreweave sites: ---
Cray Sites: ---
DS9 clusters: ---
Google sites: ---
HPCnow Sites: ---
HPE Sites: ---
IBM Sites: ---
NOAA SIte: ---
NoveTech Sites: ---
Nvidia HWinf-CS Sites: ---
OCF Sites: ---
Recursion Pharma Sites: ---
SFW Sites: ---
SNIC sites: ---
Tzag Elita Sites: ---
Linux Distro: ---
Machine Name:
CLE Version:
Version Fixed: 17.11.1
Target Release: ---
DevPrio: ---
Emory-Cloud Sites: ---


Attachments
combined patch for v17.02 (6.81 KB, patch)
2017-12-20 09:30 MST, Moe Jette
Details | Diff

Note You need to log in before you can comment on or make changes to this ticket.
Description S Senator 2017-12-14 16:49:24 MST
Burst Buffer stage out fails if Burst Buffer source directory contains regular file and destination Lustre directory has softlink with same name due to Burst Buffer issues with softlink. But the failure of stage out happens silently, some files get staged out while others do not. We believe that in case of stage out (and stage in) failures, Slurm should give error messages of failures of stage in / stage out.

We have made this request to Cray as well, but for slurm, we are opening this case to request that when Cray reports an error and causes stage-out to fail, that slurm captures this error case and reports on it.

Meanwhile Cray is working on enhancing their interactions with softlinks.
Comment 2 Moe Jette 2017-12-14 17:37:18 MST
Based upon discussions with Cray, I've been working on a patch for this. Here is a proposed solution. Let me know if this is satisfactory or if you want something different.

After stage-out failure:

Keep the job in slurmctld in completed state, but with the job's "Reason" field including the full DW error message. It will be visible using "scontrol show job". The DW buffer will also remain.

If burst_buffer.conf is configured with a flag of "TeardownFailure" (i.e. teardown the buffer on stage-out failure), The stage-out error will be written to the job's "AdminComment" field and visible using the sacct tool.
Comment 4 Moe Jette 2017-12-15 10:58:56 MST
*** Ticket 4526 has been marked as a duplicate of this ticket. ***
Comment 7 Moe Jette 2017-12-15 11:03:19 MST
We're working on a patch for this and the changes are such that we're not anxious to make them in version 17.02 and risk destablizing a very mature release. What are your plans regarding upgrading to version 17.11?
We may be able to provide a patch for v17.02 if necessary.
Comment 17 Moe Jette 2017-12-19 12:12:27 MST
We've added logic to the next version 17.11 to address these issues to the extent possible. Here is a summary:

Fix small memory leaks in slurmctld daemon on some DataWarp errors:
https://github.com/SchedMD/slurm/commit/040c6a0a1e02d6163e27feece3b966581089ea9a

Do not purge a job record from the slurmctld daemon if the stage-out operation fails unless the TeardownFailure flag is set in the burst buffer configuration:
https://github.com/SchedMD/slurm/commit/35d8e193060faeedab74444277c2a137fd3b9aec

Log DataWarp errors in the job's "AdminComment" field. This information will be available in the accounting database for completed jobs except if the job has already completed (stage-out or teardown errors). In order to capture that information, a job update RPC will need to be written to the database and that requires changes that need to be deferred to version 18.08.
https://github.com/SchedMD/slurm/commit/540d0e5cb71a84657172c102a89f7b5358f4a0be

I don't think we can do much more with these failures. While upgrading to version 17.11 would get you all of these changes, the patches would also apply cleanly to version 17.02.
Comment 18 S Senator 2017-12-19 12:19:53 MST
It would be very useful for our reporting, analysis and accounting if the comment in the third delta:
   //FIXME: WRITE ADMIN_COMMENT UPDATE TO DATABASE HERE, JOB IS ALREADY COMPLETED
were implemented.
Comment 19 Moe Jette 2017-12-19 15:47:36 MST
(In reply to S Senator from comment #18)
> It would be very useful for our reporting, analysis and accounting if the
> comment in the third delta:
>    //FIXME: WRITE ADMIN_COMMENT UPDATE TO DATABASE HERE, JOB IS ALREADY
> COMPLETED
> were implemented.

In order to capture that information, a job update RPC will need to be written to the database and that requires changes that need to be deferred to version 18.08. I'll see if we might be able to provide you with some means of getting that, at least for your site, before then.
Comment 20 Doug Jacobsen 2017-12-20 06:47:12 MST
Regarding the patch to this (commit 540d0e5cb71a846), we have code in our jobcomp/nersc plugin that assumes we have exclusive access to AdminComment, which we make heavy use of to store extra details of the job in our database.  I had thought the AdminComment was intended for admin use?
Comment 21 S Senator 2017-12-20 08:58:09 MST
As long as this is in the plan for a scheduled release, our oversight requirements would be met. We must show a scheduled fix.
Comment 22 Doug Jacobsen 2017-12-20 09:19:04 MST
My concern about slurmctld adding _anything_ to AdminComment is that we inject formatted output into it (json document) and then do a good deal of post-processing on that data.  I think I would like to capture this data as well, but would prefer to be able to extract it directly from something in the job_record and format it as required.
Comment 23 Moe Jette 2017-12-20 09:23:56 MST
(In reply to Doug Jacobsen from comment #22)
> My concern about slurmctld adding _anything_ to AdminComment is that we
> inject formatted output into it (json document) and then do a good deal of
> post-processing on that data.  I think I would like to capture this data as
> well, but would prefer to be able to extract it directly from something in
> the job_record and format it as required.

After internal discussion we decided to move the information from the "admin_comment" to the "comment" field in version 17.11. Here is the commit with that change:
https://github.com/SchedMD/slurm/commit/2a17e7d74f48da3b6a0ed9e5aab1093ad9224518

We also discussed adding a "slurm_comment" field to the job record in version 18.08 and using that for the burst buffer errors. That will result one comment field each for admins, slurm, and users.
Comment 25 Moe Jette 2017-12-20 09:30:27 MST
Created attachment 5796 [details]
combined patch for v17.02

(In reply to S Senator from comment #21)
> As long as this is in the plan for a scheduled release, our oversight
> requirements would be met. We must show a scheduled fix.

We plan to release version 17.11.1 this afternoon with the fix. The single attached patch has all of the same changes for version 17.02, which is your current version.
Comment 28 Moe Jette 2017-12-27 14:26:37 MST
Slurm version 18.02 will have a new job field called "system_comment" to distinguish it from "admin_commaent" for sys admins and "comment" for users. The datawarp failure logs will be written to that new "system_comment" field in version 18.08. Until then you can work with the provided patch (for version 17.02) or upgrade to version 17.11.1 (or later).
Comment 29 S Senator 2018-01-02 09:04:52 MST
Which of these fields would be stored into the data base, as governed by the AccountingStoreJobComment?

My preference would be that this parameter's meaning be unchanged, but that there would be additional parameter(s) that govern the admin_comment and system_comment.
Comment 30 Moe Jette 2018-01-02 09:25:41 MST
(In reply to S Senator from comment #29)
> Which of these fields would be stored into the data base, as governed by the
> AccountingStoreJobComment?
> 
> My preference would be that this parameter's meaning be unchanged, but that
> there would be additional parameter(s) that govern the admin_comment and
> system_comment.

"AccountingStoreJobComment" controls the storage of the user comment.
The "admin_comment" is always recorded in the database.
I would expect "system_comment" to also always be recorded.

I'll update the documentation accordingly
Comment 31 Moe Jette 2018-01-02 10:00:13 MST
(In reply to Moe Jette from comment #30)
> "AccountingStoreJobComment" controls the storage of the user comment.
> The "admin_comment" is always recorded in the database.
> I would expect "system_comment" to also always be recorded.
> 
> I'll update the documentation accordingly

https://github.com/SchedMD/slurm/commit/443ed127a04841c006dbc9f762073720bbcffb2b