| Summary: | Burst Buffer Stage Out Fails silently without Error message from Slurm | ||
|---|---|---|---|
| Product: | Slurm | Reporter: | S Senator <sts> |
| Component: | Burst Buffers | Assignee: | Moe Jette <jette> |
| Status: | RESOLVED FIXED | QA Contact: | |
| Severity: | 3 - Medium Impact | ||
| Priority: | --- | CC: | dmjacobsen, lena |
| Version: | 17.02.9 | ||
| Hardware: | Linux | ||
| OS: | Linux | ||
| See Also: | https://bugs.schedmd.com/show_bug.cgi?id=4566 | ||
| Site: | LANL | Slinky Site: | --- |
| Alineos Sites: | --- | Atos/Eviden Sites: | --- |
| Confidential Site: | --- | Coreweave sites: | --- |
| Cray Sites: | --- | DS9 clusters: | --- |
| Google sites: | --- | HPCnow Sites: | --- |
| HPE Sites: | --- | IBM Sites: | --- |
| NOAA SIte: | --- | NoveTech Sites: | --- |
| Nvidia HWinf-CS Sites: | --- | OCF Sites: | --- |
| Recursion Pharma Sites: | --- | SFW Sites: | --- |
| SNIC sites: | --- | Tzag Elita Sites: | --- |
| Linux Distro: | --- | Machine Name: | |
| CLE Version: | Version Fixed: | 17.11.1 | |
| Target Release: | --- | DevPrio: | --- |
| Emory-Cloud Sites: | --- | ||
| Attachments: | combined patch for v17.02 | ||
|
Description
S Senator
2017-12-14 16:49:24 MST
Based upon discussions with Cray, I've been working on a patch for this. Here is a proposed solution. Let me know if this is satisfactory or if you want something different. After stage-out failure: Keep the job in slurmctld in completed state, but with the job's "Reason" field including the full DW error message. It will be visible using "scontrol show job". The DW buffer will also remain. If burst_buffer.conf is configured with a flag of "TeardownFailure" (i.e. teardown the buffer on stage-out failure), The stage-out error will be written to the job's "AdminComment" field and visible using the sacct tool. *** Ticket 4526 has been marked as a duplicate of this ticket. *** We're working on a patch for this and the changes are such that we're not anxious to make them in version 17.02 and risk destablizing a very mature release. What are your plans regarding upgrading to version 17.11? We may be able to provide a patch for v17.02 if necessary. We've added logic to the next version 17.11 to address these issues to the extent possible. Here is a summary: Fix small memory leaks in slurmctld daemon on some DataWarp errors: https://github.com/SchedMD/slurm/commit/040c6a0a1e02d6163e27feece3b966581089ea9a Do not purge a job record from the slurmctld daemon if the stage-out operation fails unless the TeardownFailure flag is set in the burst buffer configuration: https://github.com/SchedMD/slurm/commit/35d8e193060faeedab74444277c2a137fd3b9aec Log DataWarp errors in the job's "AdminComment" field. This information will be available in the accounting database for completed jobs except if the job has already completed (stage-out or teardown errors). In order to capture that information, a job update RPC will need to be written to the database and that requires changes that need to be deferred to version 18.08. https://github.com/SchedMD/slurm/commit/540d0e5cb71a84657172c102a89f7b5358f4a0be I don't think we can do much more with these failures. While upgrading to version 17.11 would get you all of these changes, the patches would also apply cleanly to version 17.02. It would be very useful for our reporting, analysis and accounting if the comment in the third delta: //FIXME: WRITE ADMIN_COMMENT UPDATE TO DATABASE HERE, JOB IS ALREADY COMPLETED were implemented. (In reply to S Senator from comment #18) > It would be very useful for our reporting, analysis and accounting if the > comment in the third delta: > //FIXME: WRITE ADMIN_COMMENT UPDATE TO DATABASE HERE, JOB IS ALREADY > COMPLETED > were implemented. In order to capture that information, a job update RPC will need to be written to the database and that requires changes that need to be deferred to version 18.08. I'll see if we might be able to provide you with some means of getting that, at least for your site, before then. Regarding the patch to this (commit 540d0e5cb71a846), we have code in our jobcomp/nersc plugin that assumes we have exclusive access to AdminComment, which we make heavy use of to store extra details of the job in our database. I had thought the AdminComment was intended for admin use? As long as this is in the plan for a scheduled release, our oversight requirements would be met. We must show a scheduled fix. My concern about slurmctld adding _anything_ to AdminComment is that we inject formatted output into it (json document) and then do a good deal of post-processing on that data. I think I would like to capture this data as well, but would prefer to be able to extract it directly from something in the job_record and format it as required. (In reply to Doug Jacobsen from comment #22) > My concern about slurmctld adding _anything_ to AdminComment is that we > inject formatted output into it (json document) and then do a good deal of > post-processing on that data. I think I would like to capture this data as > well, but would prefer to be able to extract it directly from something in > the job_record and format it as required. After internal discussion we decided to move the information from the "admin_comment" to the "comment" field in version 17.11. Here is the commit with that change: https://github.com/SchedMD/slurm/commit/2a17e7d74f48da3b6a0ed9e5aab1093ad9224518 We also discussed adding a "slurm_comment" field to the job record in version 18.08 and using that for the burst buffer errors. That will result one comment field each for admins, slurm, and users. Created attachment 5796 [details] combined patch for v17.02 (In reply to S Senator from comment #21) > As long as this is in the plan for a scheduled release, our oversight > requirements would be met. We must show a scheduled fix. We plan to release version 17.11.1 this afternoon with the fix. The single attached patch has all of the same changes for version 17.02, which is your current version. Slurm version 18.02 will have a new job field called "system_comment" to distinguish it from "admin_commaent" for sys admins and "comment" for users. The datawarp failure logs will be written to that new "system_comment" field in version 18.08. Until then you can work with the provided patch (for version 17.02) or upgrade to version 17.11.1 (or later). Which of these fields would be stored into the data base, as governed by the AccountingStoreJobComment? My preference would be that this parameter's meaning be unchanged, but that there would be additional parameter(s) that govern the admin_comment and system_comment. (In reply to S Senator from comment #29) > Which of these fields would be stored into the data base, as governed by the > AccountingStoreJobComment? > > My preference would be that this parameter's meaning be unchanged, but that > there would be additional parameter(s) that govern the admin_comment and > system_comment. "AccountingStoreJobComment" controls the storage of the user comment. The "admin_comment" is always recorded in the database. I would expect "system_comment" to also always be recorded. I'll update the documentation accordingly (In reply to Moe Jette from comment #30) > "AccountingStoreJobComment" controls the storage of the user comment. > The "admin_comment" is always recorded in the database. > I would expect "system_comment" to also always be recorded. > > I'll update the documentation accordingly https://github.com/SchedMD/slurm/commit/443ed127a04841c006dbc9f762073720bbcffb2b |