| Summary: | Datawarp job that requires KNL mode reboot does not function correctly | ||
|---|---|---|---|
| Product: | Slurm | Reporter: | David Paul <dpaul> |
| Component: | Burst Buffers | Assignee: | Moe Jette <jette> |
| Status: | RESOLVED FIXED | QA Contact: | |
| Severity: | 2 - High Impact | ||
| Priority: | --- | CC: | dmjacobsen, dpaul |
| Version: | 16.05.9 | ||
| Hardware: | Cray XC | ||
| OS: | Linux | ||
| Site: | NERSC | Slinky Site: | --- |
| Alineos Sites: | --- | Atos/Eviden Sites: | --- |
| Confidential Site: | --- | Coreweave sites: | --- |
| Cray Sites: | --- | DS9 clusters: | --- |
| Google sites: | --- | HPCnow Sites: | --- |
| HPE Sites: | --- | IBM Sites: | --- |
| NOAA SIte: | --- | NoveTech Sites: | --- |
| Nvidia HWinf-CS Sites: | --- | OCF Sites: | --- |
| Recursion Pharma Sites: | --- | SFW Sites: | --- |
| SNIC sites: | --- | Tzag Elita Sites: | --- |
| Linux Distro: | --- | Machine Name: | |
| CLE Version: | Version Fixed: | 16.05.10 | |
| Target Release: | --- | DevPrio: | --- |
| Emory-Cloud Sites: | --- | ||
| Attachments: | Proposed fix | ||
|
Description
David Paul
2017-02-10 16:50:30 MST
Just to confirm - the issue is that step_extern isn't setup and run after a node_features based reboot? I'm assuming all steps launched by the job will still be setup appropriately. Is this only an issue when using a burst buffer alongside the job, or does this happen without the burst buffer as well? "Just to confirm - the issue is that step_extern isn't setup and run after a node_features based reboot? I'm assuming all steps launched by the job will still be setup appropriately."
- See sacct output below. Does this answer the question?
"Is this only an issue when using a burst buffer alongside the job, or does this happen without the burst buffer as well?"
- Yes, only with burst buffer. Other jobs requiring KNL reboots work fine.
- Also to note: the job does not produce .err & .out data.
=============================================================================
[dpaul@cori02]==> sacct -j 3724962
JobID JobName Partition Account AllocCPUS State ExitCode
------------ ---------- ---------- ---------- ---------- ---------- --------
3724962 knl_reboo+ knl nstaff 1088 CANCELLED+ 0:0
3724962.ext+ extern nstaff 1088 CANCELLED -2:0
[dpaul@cori02]==> sacct -j 3724068
JobID JobName Partition Account AllocCPUS State ExitCode
------------ ---------- ---------- ---------- ---------- ---------- --------
3724068 sto-def debug mp148 1024 COMPLETED 0:0
3724068.bat+ batch mp148 64 COMPLETED 0:0
3724068.ext+ extern mp148 1024 COMPLETED 0:0
3724068.0 DLPOLY.Z mp148 512 COMPLETED 0:0
[dpaul@cori02]==>
Hi, To clarify, the batch step never starts on the knl reboot jobs with datawarp. As far as I can tell the job is never really launched. (In reply to David Paul from comment #2) > "Just to confirm - the issue is that step_extern isn't setup and run after a > node_features based reboot? I'm assuming all steps launched by the job will > still be setup appropriately." > > - See sacct output below. Does this answer the question? > > "Is this only an issue when using a burst buffer alongside the job, or does > this happen without the burst buffer as well?" > > - Yes, only with burst buffer. Other jobs requiring KNL reboots work fine. > > - Also to note: the job does not produce .err & .out data. Alright, that's good to know. Something in the state transitions with both node_features and burst_buffer plugins active for the job means the batch step launch is skipped over... should be straightforward enough to reproduce. Created attachment 4042 [details] Proposed fix NOTE: This patch is dependent upon a patch currently under review for bug 3446 and both patches still needs testing. This patch adds a new function to an existing pthread that defers the DW pre-run operation until after node boots complete or the job is killed. I was able to test today on an Cray system. The final patch differs somewhat from the attached proposed patch. This commit is dependent upon changes in another commit made today for another bug. Fix for this bug: https://github.com/SchedMD/slurm/commit/bd7504fbe0986f87b3c63539f79a5d81cc122f56 Depend upon logic added in this commit. https://github.com/SchedMD/slurm/commit/f6d42fdbb293ca89da609779db8d8c04a86a8d13 Let us know if problems are not resolved with these changes. |