| Summary: | Persistent Reservation created but JobHeldAdmin | ||
|---|---|---|---|
| Product: | Slurm | Reporter: | David Paul <dpaul> |
| Component: | Burst Buffers | Assignee: | Moe Jette <jette> |
| Status: | RESOLVED FIXED | QA Contact: | |
| Severity: | 3 - Medium Impact | ||
| Priority: | --- | CC: | dpaul, tim |
| Version: | 15.08.10 | ||
| Hardware: | Linux | ||
| OS: | Linux | ||
| Site: | NERSC | Alineos Sites: | --- |
| Atos/Eviden Sites: | --- | Confidential Site: | --- |
| Coreweave sites: | --- | Cray Sites: | --- |
| DS9 clusters: | --- | HPCnow Sites: | --- |
| HPE Sites: | --- | IBM Sites: | --- |
| NOAA SIte: | --- | OCF Sites: | --- |
| Recursion Pharma Sites: | --- | SFW Sites: | --- |
| SNIC sites: | --- | Linux Distro: | --- |
| Machine Name: | Cori | CLE Version: | |
| Version Fixed: | 15.08.11 | Target Release: | --- |
| DevPrio: | --- | Emory-Cloud Sites: | --- |
|
Description
David Paul
2016-04-11 05:58:47 MDT
David - your initial comment came through as empty for some reason... I'm not sure if that was expected or a problem with Bugzilla. Jobs submitted to create (or destroy) a Persistent Reservation actually create (or destroy) the PR but the job ends up in the JobHeldAdmin state due to a return of" "error: bb_g_job_begin(1833891): Burst Buffer request invalid". I wonder if this was a result of patch(s) for our bug# 2404. (slurmctld.log): [2016-04-11T11:42:17.979] cray: Job:1833891 UserID:15448 Swap:0x0 TotalSize:0 [2016-04-11T11:42:17.979] Create Name:bbdave Size:228606345216 Access:striped Type:scratch State:pending [2016-04-11T11:42:17.993] _slurm_rpc_submit_batch_job JobId=1833891 usec=16318 [2016-04-11T11:42:18.994] debug: sched: Running job scheduler [2016-04-11T11:42:33.225] burst_buffer/cray: bb_p_job_try_stage_in [2016-04-11T11:42:33.225] _alloc_job_bb: start job allocate JobID=1833891 [2016-04-11T11:42:33.226] debug: cray: select_p_state_save [2016-04-11T11:42:34.557] dw_wlm_cli --function create_persistent -c CLI -t bbdave -u 15448 -C wlm_pool:228606345216 -a striped -T scratch [2016-04-11T11:42:34.557] session 5456 created instance 4905 created configuration 5182 created [2016-04-11T11:42:34.557] create_persistent of bbdave ran for usec=1464263 [2016-04-11T11:42:35.142] debug: _bb_get_sessions: show_sessions ran for usec=585383 [2016-04-11T11:42:35.142] dw_wlm_cli --function show_sessions "owner": 15448, "state": {"actualized": true, "fuse_blown": false, "goal": "create", "mixed": false, "transitioning": false}, "token": "bbdave"}]} [2016-04-11T11:43:17.799] backfill test for JobID=1833891 Prio=69124 Partition=debug [2016-04-11T11:43:17.799] burst_buffer/cray: bb_p_job_test_stage_in: JobID=1833891 test_only:1 [2016-04-11T11:43:17.799] burst_buffer/cray: bb_p_job_get_est_start: JobID=1833891 [2016-04-11T11:43:20.730] _alloc_job_bb: start job allocate JobID=1833891 [2016-04-11T11:43:21.273] _start_stage_in: setup for job 1833891 ran for usec=542139 [2016-04-11T11:43:21.273] dw_wlm_cli --function setup --token 1833891 --caller SLURM --user 15448 --capacity wlm_pool:0 --job /global/syscom/cori/sc/nsg/var/cori-slurm-state/hash.1/job.1833891/script [2016-04-11T11:43:21.876] _start_stage_in: dws_data_in for job 1833891 ran for usec=603359 [2016-04-11T11:43:21.876] dw_wlm_cli --function data_in --token 1833891 --job /global/syscom/cori/sc/nsg/var/cori-slurm-state/hash.1/job.1833891/script [2016-04-11T11:43:21.876] stage operations completed [2016-04-11T11:43:33.414] burst_buffer/cray: bb_p_job_test_stage_in: JobID=1833891 test_only:0 [2016-04-11T11:43:33.424] burst_buffer/cray: bb_p_job_begin: JobID=1833891 [2016-04-11T11:43:34.170] bb_p_job_begin: paths ran for usec=744611 [2016-04-11T11:43:34.170] dw_wlm_cli --function paths --job /global/syscom/cori/sc/nsg/var/cori-slurm-state/hash.1/job.1833891/script --token 1833891 --pathfile /global/syscom/cori/sc/nsg/var/cori-slurm-s tate/hash.1/job.1833891/path [2016-04-11T11:43:34.170] error: bb_p_job_begin: paths for job 1833891 status:256 response: [2016-04-11T11:43:34.170] error: bb_g_job_begin(1833891): Burst Buffer request invalid [2016-04-11T11:43:34.815] _start_pre_run: dws_pre_run for JobID=1833891 ran for usec=644942 [2016-04-11T11:43:34.815] dw_wlm_cli --function pre_run --token 1833891 --job /global/syscom/cori/sc/nsg/var/cori-slurm-state/hash.1/job.1833891/script --nidlistfile /global/syscom/cori/sc/nsg/var/cori-sl urm-state/hash.1/job.1833891/client_nids [2016-04-11T11:43:34.815] dwwlm_pre_run nidlist ['nid00051'] nid00837:/var/tmp/slurm # dwstat most pool units quantity free gran test_pool bytes 5.82TiB 5.82TiB 16MiB wlm_pool bytes 832.5TiB 705.46TiB 212.91GiB sess state token creator owner created expiration nodes 5456 CA--- bbdave CLI 15448 2016-04-11T11:42:33 never 0 inst state sess bytes nodes created expiration intact label public confs 4905 CA--- 5456 212.91GiB 1 2016-04-11T11:42:33 never true bbdave true 1 conf state inst type access_type activs 5182 CA--- 4905 scratch stripe 0 Ah, I had hit <cr> instead of <tab>. Thanks for checking. Hi. Reducing the severity of this bug. Doug is submitting a more serious issue on the accounting DB. Sev 2 vs 3 isn't a huge deal. I'm thinking these two bugs are related - some changes to BB reservations may have had unintended side effects on both scheduling and accounting. Are you running any patches on top of 15.08.9 at the moment? I haven't committed a patch related to the 2404 you reference, and don't think that bug is related here. Although I know we've been working through some other BB issues with you recently, and am guessing you have a local patch or two related to that? Could I get the header of the script? I just want to see the #DW and #BB options. [dpaul@cori12]==> cat bb_create.sh #!/bin/bash #SBATCH -p debug #SBATCH -N 1 #SBATCH -t 00:05:00 #SBATCH -o create_bb.%j.out #SBATCH -e create_bb.%j.err #SBATCH -J create_bb #BB create_persistent name=bbdave capacity=200GB access=striped type=scratch echo "hostname is:" hostname I was able to confirm the problem and fix it here: https://github.com/SchedMD/slurm/commit/87e1cc6760a7cf7e6f18789c11928296daa16afc The fix will be in version 15.08.11 when released, likely late April or early May. The essence of the problem is that Slurm calls the DW API with the user script and asks for the environment variables that need to be set for the job. If the job does use use burst buffers (but only creates or deletes persistent buffers), the DW API returns an error code, but says the script is fine (response is "job_file_valid True"). Slurm specifically looks for that response and if found, treats the response an OK (i.e. ignore the exit code of 1). Thanks for the quick fix! Was this a regression? I had tested PR back in Dec/Jan and did not encounter this issue. (In reply to David Paul from comment #9) > Thanks for the quick fix! > > Was this a regression? > > I had tested PR back in Dec/Jan and did not encounter this issue. The bug in the code existed previously, but was latent until this bug fix in version 15.08.9: -- burst_buffer/cray: Set environment variables just before starting job rather than at job submission time to reflect persistent buffers created or modified while the job is pending. Hi,
Doug applied the patch(es) today. The behavior is the same for removing an existing PR. However, trying to create a new PR gets stuck at BurstBufferResources.
[dpaul@cori06]==> squeue -l --qos=burstbuffer
Mon Apr 18 17:16:51 2016
JOBID PARTITION NAME USER STATE TIME TIME_LIMI NODES NODELIST(REASON)
1926140 debug PRcreate dpaul PENDING 0:00 2:00 1 (BurstBufferResources)
<slurmctld.log>
[2016-04-18T17:16:42.191] cray: Job:1926140 UserID:15448 Swap:0x0 TotalSize:0
[2016-04-18T17:16:42.191] Use Name:davebb3
[2016-04-18T17:16:42.191] burst_buffer/cray: bb_p_job_validate2: JobID=1926140
[2016-04-18T17:16:42.586] bb_p_job_validate2: job_process ran for usec=394512
[2016-04-18T17:16:42.586] dw_wlm_cli --function job_process --job /global/syscom/cori/sc/nsg/var/cori-slurm-state/hash.0/job.1926140/script
[2016-04-18T17:16:42.601] _slurm_rpc_submit_batch_job JobId=1926140 usec=411328
[2016-04-18T17:16:42.684] burst_buffer/cray: bb_p_job_test_stage_in: JobID=1926140 test_only:0
[2016-04-18T17:16:42.684] _alloc_job_bb: start job allocate JobID=1926140
[2016-04-18T17:16:42.686] _alloc_job_bb: start job allocate JobID=1926140
[2016-04-18T17:16:47.750] burst_buffer/cray: bb_p_job_test_stage_in: JobID=1926140 test_only:0
[2016-04-18T17:16:47.750] _alloc_job_bb: start job allocate JobID=1926140
<repeat>
[2016-04-18T17:17:24.764] burst_buffer/cray: bb_p_job_test_stage_in: JobID=1926140 test_only:0
[2016-04-18T17:17:24.764] _alloc_job_bb: start job allocate JobID=1926140
(In reply to David Paul from comment #11) > Hi, > > Doug applied the patch(es) today. The behavior is the same for removing an > existing PR. However, trying to create a new PR gets stuck at > BurstBufferResources. Just to clarify: You can remove a PR now, only create that hangs, correct? We can remove a PR now but the job then becomes JobHeldAdmin, same as it did prior to the patch. Now we cannot create a PR, it hangs at "BurstBufferResources". Previously it would create the PR and then become JobHeldAdmin. BTW- We are now at 15.08.10 plus patches. (In reply to David Paul from comment #11) > Hi, > > Doug applied the patch(es) today. The behavior is the same for removing an > existing PR. However, trying to create a new PR gets stuck at > BurstBufferResources. > > [dpaul@cori06]==> squeue -l --qos=burstbuffer > Mon Apr 18 17:16:51 2016 > JOBID PARTITION NAME USER STATE TIME TIME_LIMI > NODES NODELIST(REASON) > 1926140 debug PRcreate dpaul PENDING 0:00 2:00 > 1 (BurstBufferResources) > > > <slurmctld.log> > [2016-04-18T17:16:42.191] cray: Job:1926140 UserID:15448 Swap:0x0 TotalSize:0 > [2016-04-18T17:16:42.191] Use Name:davebb3 I am testing with version 15.08.10, plus the patch identified in comment #8: https://github.com/SchedMD/slurm/commit/87e1cc6760a7cf7e6f18789c11928296daa16afc When I submit a job to create the persistent burst buffer as described in comment #7: ... #BB create_persistent name=bbdave capacity=200GB access=striped type=scratch ... That generates a log like this: slurmctld: cray: Job:572 UserID:108 Swap:0x0 TotalSize:0 slurmctld: Create Name:bbdave Size:214748364800 Access:striped Type:scratch State:pending (and it creates the buffer as designed, ditto for the destroy, which generates a log like this: slurmctld: cray: Job:574 UserID:108 Swap:0x0 TotalSize:0 slurmctld: Destroy Name:bbdave Hurry:0 But your log shows this: [2016-04-18T17:16:42.191] cray: Job:1926140 UserID:15448 Swap:0x0 TotalSize:0 [2016-04-18T17:16:42.191] Use Name:davebb3 Note the "Create" vs. "Use". Could you check that script contents? It really looks like the script is just trying to use the buffer, which will fail if the buffer doesn't exist at the time the job tries to start. You are correct. My apologies for the poor testing. I was hurrying to test before leaving for the day. However, the 'original' problem has not been fixed. Creating the PR succeeds but the job becomes JHA reporting "256 response: Burst Buffer request invalid". Did the patch address the correct bug? (I hope my testing is correct this time.) [dpaul@cori06]==> cat PRcreate.sh #!/bin/bash #SBATCH -p debug #SBATCH -N 1 #SBATCH -t 00:02:00 #BB create_persistent name=dpaulbb3 capacity=200GB access=striped type=scratch echo "DW_PERSISTENT_STRIPED_piname is: " sleep 90 <squeue -l -u dpaul> 1942598 debug PRcreate dpaul PENDING 0:00 2:00 1 (JobHeldAdmin) nid00837:/var/tmp/slurm # dwstat sessions sess state token creator owner created expiration nodes 5748 CA--- dpaulbb3 CLI 15448 2016-04-19T14:52:48 never 0 [2016-04-19T14:52:42.083] job_submit.lua: looking up username for uid 15448 [2016-04-19T14:52:42.384] job_submit.lua: slurm_job_submit: user dpaul [2016-04-19T14:52:42.384] job_submit.lua: req max_nodes: 1 [2016-04-19T14:52:42.384] job_submit.lua: req min_nodes: 1 [2016-04-19T14:52:42.384] job_submit.lua: req num_tasks: 4294967294 [2016-04-19T14:52:42.384] job_submit.lua: req ntasks_per_core: 65534 [2016-04-19T14:52:42.384] job_submit.lua: req ntasks_per_socket: 65534 [2016-04-19T14:52:42.384] job_submit.lua: req ntasks_per_node: 65534 [2016-04-19T14:52:46.368] job_submit.lua: slurm_job_submit: worked out that we need 1 nodes [2016-04-19T14:52:46.368] job_submit.lua: processing debug partition for qos, 1 nodes [2016-04-19T14:52:46.368] job_submit.lua: got timeLimit of 2 minutes [2016-04-19T14:52:46.368] job_submit.lua: coreHours of 1.066667, chargefactor: 2.500000 [2016-04-19T14:52:46.368] Warning: Note very large processing time from job_submit_plugin_submit: usec=4408261 began=14:52:41.960 [2016-04-19T14:52:46.369] burst_buffer/cray: bb_p_job_validate: job_user_id:15448, submit_uid:15448 [2016-04-19T14:52:46.369] bb_p_job_validate: burst_buffer:#BB create_persistent name=dpaulbb3 capacity=200GB access=striped type=scratch [2016-04-19T14:52:46.371] cray: Job:1942598 UserID:15448 Swap:0x0 TotalSize:0 [2016-04-19T14:52:46.371] Create Name:dpaulbb3 Size:228606345216 Access:striped Type:scratch State:pending [2016-04-19T14:52:46.383] _slurm_rpc_submit_batch_job JobId=1942598 usec=14818 [2016-04-19T14:52:47.689] burst_buffer/cray: bb_p_job_test_stage_in: JobID=1942598 test_only:0 [2016-04-19T14:52:47.689] _alloc_job_bb: start job allocate JobID=1942598 2016-04-19T14:52:47.722] burst_buffer/cray: bb_p_job_try_stage_in [2016-04-19T14:52:47.723] _alloc_job_bb: start job allocate JobID=1939863 [2016-04-19T14:52:47.723] _alloc_job_bb: start job allocate JobID=1942598 [2016-04-19T14:52:48.759] burst_buffer/cray: bb_p_job_test_stage_in: JobID=1942598 test_only:0 [2016-04-19T14:52:48.759] _alloc_job_bb: start job allocate JobID=1942598 [2016-04-19T14:52:51.338] burst_buffer/cray: bb_p_job_test_stage_in: JobID=1942598 test_only:0 [2016-04-19T14:52:51.338] _alloc_job_bb: start job allocate JobID=1942598 [2016-04-19T14:52:54.422] burst_buffer/cray: bb_p_job_test_stage_in: JobID=1942598 test_only:0 [2016-04-19T14:52:54.799] _start_stage_in: setup for job 1942598 ran for usec=3460577 [2016-04-19T14:52:54.799] dw_wlm_cli --function setup --token 1942598 --caller SLURM --user 15448 --capacity wlm_pool:0 --job /global/syscom/cori/sc/nsg/var/cori-slurm-state/hash.8/job.1942598/script [2016-04-19T14:52:55.195] _start_stage_in: dws_data_in for job 1942598 ran for usec=395875 [2016-04-19T14:52:55.195] dw_wlm_cli --function data_in --token 1942598 --job /global/syscom/cori/sc/nsg/var/cori-slurm-state/hash.8/job.1942598/script [2016-04-19T14:52:55.195] stage operations completed [2016-04-19T14:53:03.422] burst_buffer/cray: bb_p_job_test_stage_in: JobID=1942598 test_only:0 [2016-04-19T14:53:03.435] burst_buffer/cray: bb_p_job_begin: JobID=1942598 [2016-04-19T14:53:03.836] bb_p_job_begin: paths ran for usec=401082 [2016-04-19T14:53:03.836] dw_wlm_cli --function paths --job /global/syscom/cori/sc/nsg/var/cori-slurm-state/hash.8/job.1942598/script --token 1942598 --pathfile /global/syscom/cori/sc/nsg/var/cori-slurm-state/hash. 8/job.1942598/path [2016-04-19T14:53:03.836] error: bb_p_job_begin: paths for job 1942598 status:256 response: [2016-04-19T14:53:03.836] error: bb_g_job_begin(1942598): Burst Buffer request invalid [2016-04-19T14:53:04.296] _start_pre_run: dws_pre_run for JobID=1942598 ran for usec=459446 [2016-04-19T14:53:04.296] dw_wlm_cli --function pre_run --token 1942598 --job /global/syscom/cori/sc/nsg/var/cori-slurm-state/hash.8/job.1942598/script --nidlistfile /global/syscom/cori/sc/nsg/var/cori-slurm-state/ hash.8/job.1942598/client_nids [2016-04-19T14:53:04.296] dwwlm_pre_run nidlist ['nid00029'] This looks like a problem with the DW API. Slurm is looking for the API to either return a non-zero exit code or a message of "job_file_valid True", which is what I see on Cray's test cluster. The API on your system is returning with an exit code of 1 and no message. I'll take this up with Cray. What version of the Cray software are you running? Is this enough info to determine the version of DWS? # which dwstat /opt/cray/dws/1.0-1.0502.66278.43.10.ari/bin/dwstat The system OS is 5.2UP04. This brings up a question. We are planning to bring up 6.0UP00+patches next week on our test system (gerty) and 6.0UP01-pre-release on Cori by the end of May. Will Slurm support these versions and is there a specific version of Slurm that is required? Thanks. (In reply to David Paul from comment #18) > Is this enough info to determine the version of DWS? > > # which dwstat > /opt/cray/dws/1.0-1.0502.66278.43.10.ari/bin/dwstat > > The system OS is 5.2UP04. That should do. I'll try to get an answer from Cray. > This brings up a question. We are planning to bring up 6.0UP00+patches next > week on our test system (gerty) and 6.0UP01-pre-release on Cori by the end > of May. Will Slurm support these versions and is there a specific version > of Slurm that is required? Newer versions of DW software should just allow us to remove some Slurm kludges around various DW bugs/feature without introducing any problems. I'll see if Cray can install matching software one one of their tests systems that I can use, but no guarantee. The word that I have from Cray is the specific version of Datawarp software you have installed returns an exit 1 and no message when calling the "paths" function with a script that does not include "#DW" directives. Other Datawarp versions either return an exit code of zero or an response message of "job_file_valid True", both of which are handled by Slurm. I've wrapped the Slurm logic calling the "paths" function to disable the logic if the script only creates or destroys burst buffers in the following commit (it looks far more complex that it really is, the indentation that I added resulted in many changed lines, when all that changed is some white space). Here is the commit: https://github.com/SchedMD/slurm/commit/1391d29ad6970be118100ea4419ee069d9a343f9 Please let me know if this does not fix the problem for you. The change will be in Slurm version 15.08.11 when released, likely later in April. |