Ticket 3461 - Datawarp job that requires KNL mode reboot does not function correctly
Summary: Datawarp job that requires KNL mode reboot does not function correctly
Status: RESOLVED FIXED
Alias: None
Product: Slurm
Classification: Unclassified
Component: Burst Buffers (show other tickets)
Version: 16.05.9
Hardware: Cray XC Linux
: 2 - High Impact
Assignee: Moe Jette
QA Contact:
URL:
Depends on:
Blocks:
 
Reported: 2017-02-10 16:50 MST by David Paul
Modified: 2017-02-13 11:47 MST (History)
2 users (show)

See Also:
Site: NERSC
Slinky Site: ---
Alineos Sites: ---
Atos/Eviden Sites: ---
Confidential Site: ---
Coreweave sites: ---
Cray Sites: ---
DS9 clusters: ---
Google sites: ---
HPCnow Sites: ---
HPE Sites: ---
IBM Sites: ---
NOAA SIte: ---
NoveTech Sites: ---
Nvidia HWinf-CS Sites: ---
OCF Sites: ---
Recursion Pharma Sites: ---
SFW Sites: ---
SNIC sites: ---
Tzag Elita Sites: ---
Linux Distro: ---
Machine Name:
CLE Version:
Version Fixed: 16.05.10
Target Release: ---
DevPrio: ---
Emory-Cloud Sites: ---


Attachments
Proposed fix (1.25 KB, patch)
2017-02-10 20:00 MST, Moe Jette
Details | Diff

Note You need to log in before you can comment on or make changes to this ticket.
Description David Paul 2017-02-10 16:50:30 MST
If a KNL job using Datawarp requires a node reboot to change the mode, the job's slurmstepd will not launch on the node when reboot is completed.

Squeue displays the job in "Running' state immediately (during the reboot) as opposed to displaying 'Configuring'.

Comparing the knl_dw job to a haswell_dw job indicates the knl_dw job never triggers "_slurm_rpc_job_alloc_info_lite" in slurmctld.log. (See below)

All other functions occur as expected: DW_allocation, stage_in, remount_after_boot and stage_out.

===========================================================================
<knl_dw job from slurmctld.log>

[2017-02-10T11:55:46.683] debug:  sched: Running job scheduler
[2017-02-10T11:55:46.685] burst_buffer/cray: bb_p_job_test_stage_in: JobID=3724962 test_only:0
[2017-02-10T11:55:46.688] burst_buffer/cray: bb_p_job_begin: JobID=3724962
[2017-02-10T11:55:46.824] bb_p_job_begin: paths ran for usec=135140
[2017-02-10T11:55:46.824] dw_wlm_cli --function paths --job /global/syscom/cori/sc/nsg/var/cori-slurm-state/hash.2/job.3724962/script --token 3724962 --pathfile /global/syscom/cori/sc/nsg/var/cor
i-slurm-state/hash.2/job.3724962/path
[2017-02-10T11:55:46.824] _update_job_env: DW_JOB_STRIPED=/var/opt/cray/dws/mounts/batch/3724962/ss/
[2017-02-10T11:55:46.867] power_save: pid 31305 reboot nodes nid0[9171-9174] features quad,cache
[2017-02-10T11:55:46.869] sched: Allocate JobID=3724962 NodeList=nid0[9171-9174] #CPUs=1088 Partition=knl
[2017-02-10T11:55:46.869] burst_buffer/cray: bb_p_job_test_stage_in: JobID=3715265 test_only:0
[2017-02-10T11:55:46.927] debug:  sched: already tested 402 jobs, breaking out
[2017-02-10T11:55:46.932] debug:  Still waiting for boot of node nid09173
[2017-02-10T11:55:46.932] debug:  Still waiting for boot of node nid09172
[2017-02-10T11:55:46.932] debug:  Still waiting for boot of node nid09171
[2017-02-10T11:55:46.932] debug:  Still waiting for boot of node nid09174
[2017-02-10T11:55:48.214] debug:  /usr/sbin/capmc_resume[31305]: set_mcdram_cfg sent to 9171-9174
[2017-02-10T11:55:49.044] _start_pre_run: dws_pre_run for JobID=3724962 ran for usec=2220123
[2017-02-10T11:55:49.044] dw_wlm_cli --function pre_run --token 3724962 --job /global/syscom/cori/sc/nsg/var/cori-slurm-state/hash.2/job.3724962/script --nidlistfile /global/syscom/cori/sc/nsg/var/cori-slurm-state/hash.2/job.3724962/client_nids
[2017-02-10T11:55:49.044] dwwlm_pre_run nidlist ['nid09171', 'nid09172', 'nid09173', 'nid09174']
activation 11437 created
[2017-02-10T11:55:49.210] debug:  /usr/sbin/capmc_resume[31305]: set_numa_cfg sent to 9171-9174
[2017-02-10T11:55:50.140] debug:  /usr/sbin/capmc_resume[31305]: node_reinit sent to 9171-9174
[2017-02-10T11:55:50.141] debug:  auth plugin for Munge (http://code.google.com/p/munge/) loaded
[2017-02-10T11:55:50.141] _update_node_active_features: nodes nid09171 active features set to: quad,cache,knl
[2017-02-10T11:55:50.141] _update_node_active_features: nodes nid09172 active features set to: quad,cache,knl
[2017-02-10T11:55:50.141] _update_node_active_features: nodes nid09173 active features set to: quad,cache,knl
[2017-02-10T11:55:50.141] _update_node_active_features: nodes nid09174 active features set to: quad,cache,knl
[2017-02-10T11:55:50.573] debug:  sched: Running job scheduler
[2017-02-10T11:55:50.575] burst_buffer/cray: bb_p_job_test_stage_in: JobID=3715265 test_only:0
...
...
[2017-02-10T12:39:27.959] _slurm_rpc_kill_job2: REQUEST_KILL_JOB job 3724962 uid 15448
[2017-02-10T12:39:27.959] burst_buffer/cray: bb_p_job_start_stage_out: JobID=3724962
[2017-02-10T12:39:27.982] jobcomp/nersc: {"jobId":3724962,"arrayJobId":0,"arrayTaskId":4294967294,"jobDerivedExitCode":0,"jobExitCode":0,"submitTime":1486755788,"startTime":1486756546,"endTime":1486759167,"allocCpus":1088,"allocNodes":4,"uid":15448,"timeLimit":0,"resizing":0,"jobAccount":"nstaff","workingDirectory":"/global/u1/d/dpaul/cori/datawarp/P2/knl","stdinPath":"/dev/null","script":"#!/bin/bash\n#SBATCH -p knl_reboot\n#SBATCH -C knl,quad,cache\n#SBATCH -N 4\n#SBATCH -t 01:00:00\n#SBATCH --reservation=dpaul_knl_dw2\n#DW jobdw capacity=1800GB access_mode=striped type=scratch  pool=sm_pool\n\n#DW stage_in destination=$DW_JOB_STRIPED/ source=/global/cscratch1/sd/dpaul/testBB_stage_in  type=file\n#DW stage_in destination=$DW_JOB_STRIPED/ source=/global/cscratch1/sd/dpaul/slurm_logs type=directory\n#DW stage_in destination=$DW_JOB_STRIPED/ source=/global/cscratch1/sd/dpaul/andrey/output/ type=directory\n\n#DW stage_out source=$DW_JOB_STRIPED/ destination=/global/cscratch1/sd/dpaul/knl_output type=directory\n\necho \"check1\"\necho \"====================\"\nsrun echo $DW_JOB_STRIPED\necho \"====================\"\necho \"check2\"\necho \"====================\"\necho $DW_JOB_STRIPED\necho \"====================\"\nsrun hostname\necho \"====================\"\ndf -Pk\necho \"====================\"\nsrun df -Ph | grep dws\necho \"====================\"\n\nsrun ls -lR $DW_JOB_STRIPED\necho \"====================\"\n\nsrun sleep 7200\n\n","argv":"/global/u1/d/dpaul/cori/datawarp/P2/knl/knl_reboot_in.sh","features":"knl&quad&cache","nodes":"nid0[9171-9174]","partition":"knl","name":"knl_reboot_in.sh","burstBuffer":"#DW jobdw capacity=1800GB access_mode=striped type=scratch  pool=sm_pool\n#DW stage_in destination=$DW_JOB_STRIPED/ source=/global/cscratch1/sd/dpaul/testBB_stage_in  type=file\n#DW stage_in destination=$DW_JOB_STRIPED/ source=/global/cscratch1/sd/dpaul/slurm_logs type=directory\n#DW stage_in destination=$DW_JOB_STRIPED/ source=/global/cscratch1/sd/dpaul/andrey/output/ type=directory\n#DW stage_out source=$DW_JOB_STRIPED/ destination=/global/cscratch1/sd/dpaul/knl_output type=directory","gresRequest":"craynetwork:1","tresRequest":"1=1088,2=368640,4=4,6=1856160","reservation":"dpaul_knl_dw2","batchHost":"nid09171","cluster":"cori","qos":"normal_knl_2"}
[2017-02-10T12:39:27.982] debug:  jobcomp/nersc: took 0.0225s to collect data for job 3724962
[2017-02-10T12:39:28.006] _job_signal: 9 of running JobID=3724962 State=0x8004 NodeCnt=4 successful 0x8004
[2017-02-10T12:39:28.034] debug:  jobcomp/nersc: took 0.0287s to execute /etc/slurm/jobcomplete.sh
[2017-02-10T12:39:29.159] debug:  sched: Running job scheduler
[2017-02-10T12:39:29.161] burst_buffer/cray: bb_p_job_test_stage_in: JobID=3715265 test_only:0
[2017-02-10T12:39:29.212] debug:  sched: already tested 402 jobs, breaking out
[2017-02-10T12:39:29.480] _start_stage_out: dws_post_run for job 3724962 ran for usec=1520126
[2017-02-10T12:39:29.480] dw_wlm_cli --function post_run --token 3724962 --job /global/syscom/cori/sc/nsg/var/cori-slurm-state/hash.2/job.3724962/script
[2017-02-10T12:39:29.480] Activation 11437 unmounted

============================================================================
<Haswell job with Datawarp>
# grep 3724068  /var/tmp/slurm/slurmctld.log | grep -v sessions
[2017-02-10T08:24:03.958] debug:  Job 3724068 feature list: haswell
[2017-02-10T08:24:03.963] cray: Job:3724068 UserID:60891 Swap:0x0 TotalSize:108129157120
[2017-02-10T08:24:03.963] burst_buffer/cray: bb_p_job_validate2: JobID=3724068
[2017-02-10T08:24:04.100] dw_wlm_cli --function job_process --job /global/syscom/cori/sc/nsg/var/cori-slurm-state/hash.8/job.3724068/script
[2017-02-10T08:24:04.100] burst_buffer/cray: bb_p_job_test_stage_in: JobID=3724068 test_only:1
[2017-02-10T08:24:04.101] _slurm_rpc_submit_batch_job JobId=3724068 usec=142615
[2017-02-10T08:24:04.454] burst_buffer/cray: bb_p_job_test_stage_in: JobID=3724068 test_only:0
[2017-02-10T08:24:04.454] _alloc_job_bb: start job allocate JobID=3724068
[2017-02-10T08:24:06.756] _start_stage_in: setup for job 3724068 ran for usec=2302108
[2017-02-10T08:24:06.756] dw_wlm_cli --function setup --token 3724068 --caller SLURM --user 60891 --groupid 60891 --capacity sm_pool:103120MB --job /global/syscom/cori/sc/nsg/var/cori-slurm-state/hash.8/job.3724068/script
[2017-02-10T08:24:06.891] _start_stage_in: dws_data_in for job 3724068 ran for usec=134837
[2017-02-10T08:24:06.891] dw_wlm_cli --function data_in --token 3724068 --job /global/syscom/cori/sc/nsg/var/cori-slurm-state/hash.8/job.3724068/script
[2017-02-10T08:24:06.891] _start_stage_in: Setup/stage-in complete for job 3724068
[2017-02-10T08:24:07.570] burst_buffer/cray: bb_p_job_test_stage_in: JobID=3724068 test_only:0
[2017-02-10T08:24:55.718] burst_buffer/cray: bb_p_job_test_stage_in: JobID=3724068 test_only:0
[2017-02-10T08:25:11.122] burst_buffer/cray: bb_p_job_test_stage_in: JobID=3724068 test_only:0
[2017-02-10T08:25:31.536] burst_buffer/cray: bb_p_job_test_stage_in: JobID=3724068 test_only:0
[2017-02-10T08:26:19.682] burst_buffer/cray: bb_p_job_test_stage_in: JobID=3724068 test_only:0
[2017-02-10T08:26:19.684] burst_buffer/cray: bb_p_job_begin: JobID=3724068
[2017-02-10T08:26:19.819] dw_wlm_cli --function paths --job /global/syscom/cori/sc/nsg/var/cori-slurm-state/hash.8/job.3724068/script --token 3724068 --pathfile /global/syscom/cori/sc/nsg/var/cori-slurm-state/hash.8/job.3724068/path
[2017-02-10T08:26:19.819] _update_job_env: DW_JOB_STRIPED=/var/opt/cray/dws/mounts/batch/3724068/ss/
[2017-02-10T08:26:19.842] sched: Allocate JobID=3724068 NodeList=nid000[09-16,18-25] #CPUs=1024 Partition=debug
[2017-02-10T08:26:24.289] _start_pre_run: dws_pre_run for JobID=3724068 ran for usec=4469792
[2017-02-10T08:26:24.289] dw_wlm_cli --function pre_run --token 3724068 --job /global/syscom/cori/sc/nsg/var/cori-slurm-state/hash.8/job.3724068/script --nidlistfile /global/syscom/cori/sc/nsg/var/cori-slurm-state/hash.8/job.3724068/client_nids
[2017-02-10T08:26:24.636] debug:  _slurm_rpc_job_alloc_info_lite JobId=3724068 NodeList=nid000[09-16,18-25] usec=1
[2017-02-10T08:26:24.637] debug:  reserved ports 63823 for step 3724068.0
[2017-02-10T08:41:11.873] debug:  freed ports 63823 for step 3724068.0
[2017-02-10T08:41:12.032] job_complete: JobID=3724068 State=0x1 NodeCnt=16 WEXITSTATUS 0
[2017-02-10T08:41:12.032] burst_buffer/cray: bb_p_job_start_stage_out: JobID=3724068
[2017-02-10T08:41:12.053] jobcomp/nersc: {"jobId":3724068,"arrayJobId":0,"arrayTaskId":4294967294,"jobDerivedExitCode":0,"jobExitCode":0,"submitTime":1486743843,"startTime":1486743979,"endTime":1486744872,"allocCpus":1024,"allocNodes":16,"uid":60891,"timeLimit":0,"resizing":0,"jobAccount":"mp148","workingDirectory":"/global/cscratch1/sd/ezar/STO/640MeVXe/new-TABLE/pristine/equil","stdinPath":"/dev/null","script":"#!/bin/bash -l\n#SBATCH -p debug\n#SBATCH -N 16\n#SBATCH -t 00:30:00\n#SBATCH -C haswell\n#SBATCH -A mp148\n#SBATCH -J sto-def\n#DW jobdw capacity=100GB access_mode=striped type=scratch pool=sm_pool\n\nsrun -n 512 /global/cscratch1/sd/ezar/code/dl_poly_4.05/xy-cray/execute/DLPOLY.Z>output --scratchdirname=$DW_JOB_STRIPED\n","argv":"/global/cscratch1/sd/ezar/STO/640MeVXe/new-TABLE/pristine/equil/bb","features":"haswell","nodes":"nid000[09-16,18-25]","partition":"debug","name":"sto-def","burstBuffer":"#DW jobdw capacity=100GB access_mode=striped type=scratch pool=sm_pool","licenses":"cscratch1:1","gresRequest":"craynetwork:1","tresRequest":"1=1024,2=1998848,4=16,6=103120","batchHost":"nid00009","cluster":"cori","qos":"debug"}
[2017-02-10T08:41:12.053] debug:  jobcomp/nersc: took 0.0211s to collect data for job 3724068
[2017-02-10T08:41:12.075] job_complete: JobID=3724068 State=0x8003 NodeCnt=16 done
[2017-02-10T08:41:13.545] _start_stage_out: dws_post_run for job 3724068 ran for usec=1513296
[2017-02-10T08:41:13.545] dw_wlm_cli --function post_run --token 3724068 --job /global/syscom/cori/sc/nsg/var/cori-slurm-state/hash.8/job.3724068/script
[2017-02-10T08:41:13.716] _start_stage_out: dws_data_out for job 3724068 ran for usec=171092
[2017-02-10T08:41:13.716] dw_wlm_cli --function data_out --token 3724068 --job /global/syscom/cori/sc/nsg/var/cori-slurm-state/hash.8/job.3724068/script
[2017-02-10T08:41:13.716] _start_stage_out: Stage-out/post-run complete for job 3724068
[2017-02-10T08:41:17.174] _start_teardown: teardown for job 3724068 ran for usec=3458032
[2017-02-10T08:41:17.174] dw_wlm_cli --function teardown --token 3724068 --job /global/syscom/cori/sc/nsg/var/cori-slurm-state/hash.8/job.3724068/script
[2017-02-10T08:46:57.873] burst_buffer/cray: bb_p_job_test_stage_out: JobID=3724068
ctl1:~ #
Comment 1 Tim Wickberg 2017-02-10 16:55:30 MST
Just to confirm - the issue is that step_extern isn't setup and run after a node_features based reboot? I'm assuming all steps launched by the job will still be setup appropriately.

Is this only an issue when using a burst buffer alongside the job, or does this happen without the burst buffer as well?
Comment 2 David Paul 2017-02-10 18:10:09 MST
"Just to confirm - the issue is that step_extern isn't setup and run after a node_features based reboot? I'm assuming all steps launched by the job will still be setup appropriately."

- See sacct output below. Does this answer the question?

"Is this only an issue when using a burst buffer alongside the job, or does this happen without the burst buffer as well?"

- Yes, only with burst buffer. Other jobs requiring KNL reboots work fine.

- Also to note: the job does not produce .err & .out data.

=============================================================================

[dpaul@cori02]==> sacct -j 3724962
       JobID    JobName  Partition    Account  AllocCPUS      State ExitCode
------------ ---------- ---------- ---------- ---------- ---------- --------
3724962      knl_reboo+        knl     nstaff       1088 CANCELLED+      0:0
3724962.ext+     extern                nstaff       1088  CANCELLED     -2:0


[dpaul@cori02]==> sacct -j 3724068
       JobID    JobName  Partition    Account  AllocCPUS      State ExitCode
------------ ---------- ---------- ---------- ---------- ---------- --------
3724068         sto-def      debug      mp148       1024  COMPLETED      0:0
3724068.bat+      batch                 mp148         64  COMPLETED      0:0
3724068.ext+     extern                 mp148       1024  COMPLETED      0:0
3724068.0      DLPOLY.Z                 mp148        512  COMPLETED      0:0
[dpaul@cori02]==>
Comment 3 Doug Jacobsen 2017-02-10 18:19:19 MST
Hi,

To clarify, the batch step never starts on the knl reboot jobs with
datawarp.  As far as I can tell the job is never really launched.
Comment 4 Tim Wickberg 2017-02-10 18:26:27 MST
(In reply to David Paul from comment #2)
> "Just to confirm - the issue is that step_extern isn't setup and run after a
> node_features based reboot? I'm assuming all steps launched by the job will
> still be setup appropriately."
> 
> - See sacct output below. Does this answer the question?
> 
> "Is this only an issue when using a burst buffer alongside the job, or does
> this happen without the burst buffer as well?"
> 
> - Yes, only with burst buffer. Other jobs requiring KNL reboots work fine.
> 
> - Also to note: the job does not produce .err & .out data.

Alright, that's good to know. Something in the state transitions with both node_features and burst_buffer plugins active for the job means the batch step launch is skipped over... should be straightforward enough to reproduce.
Comment 7 Moe Jette 2017-02-10 20:00:08 MST
Created attachment 4042 [details]
Proposed fix

NOTE: This patch is dependent upon a patch currently under review for bug 3446 and both patches still needs testing. This patch adds a new function to an existing pthread that defers the DW pre-run operation until after node boots complete or the job is killed.
Comment 8 Moe Jette 2017-02-13 11:47:36 MST
I was able to test today on an Cray system. The final patch differs somewhat from the attached proposed patch. This commit is dependent upon changes in another commit made today for another bug.

Fix for this bug:
https://github.com/SchedMD/slurm/commit/bd7504fbe0986f87b3c63539f79a5d81cc122f56

Depend upon logic added in this commit.
https://github.com/SchedMD/slurm/commit/f6d42fdbb293ca89da609779db8d8c04a86a8d13

Let us know if problems are not resolved with these changes.