Ticket 2624

Summary:	Persistent Reservation created but JobHeldAdmin
Product:	Slurm	Reporter:	David Paul <dpaul>
Component:	Burst Buffers	Assignee:	Moe Jette <jette>
Status:	RESOLVED FIXED	QA Contact:
Severity:	3 - Medium Impact
Priority:	---	CC:	dpaul, tim
Version:	15.08.10
Hardware:	Linux
OS:	Linux
Site:	NERSC	Slinky Site:	---
Alineos Sites:	---	Atos/Eviden Sites:	---
Confidential Site:	---	Coreweave sites:	---
Cray Sites:	---	DS9 clusters:	---
Google sites:	---	HPCnow Sites:	---
HPE Sites:	---	IBM Sites:	---
NOAA SIte:	---	NoveTech Sites:	---
Nvidia HWinf-CS Sites:	---	OCF Sites:	---
Recursion Pharma Sites:	---	SFW Sites:	---
SNIC sites:	---	Tzag Elita Sites:	---
Linux Distro:	---	Machine Name:	Cori
CLE Version:		Version Fixed:	15.08.11
Target Release:	---	DevPrio:	---
Emory-Cloud Sites:	---

Description David Paul 2016-04-11 05:58:47 MDT

Comment 1 Tim Wickberg 2016-04-11 06:04:12 MDT

David - your initial comment came through as empty for some reason... I'm not sure if that was expected or a problem with Bugzilla.

Comment 2 David Paul 2016-04-11 06:05:47 MDT

Jobs submitted to create (or destroy) a Persistent Reservation actually create (or destroy) the PR but the job ends up in the JobHeldAdmin state due to a return of" "error: bb_g_job_begin(1833891): Burst Buffer request invalid".  I wonder if this was a result of patch(s) for our bug# 2404.

(slurmctld.log):

[2016-04-11T11:42:17.979] cray: Job:1833891 UserID:15448 Swap:0x0 TotalSize:0
[2016-04-11T11:42:17.979]   Create  Name:bbdave Size:228606345216 Access:striped Type:scratch State:pending
[2016-04-11T11:42:17.993] _slurm_rpc_submit_batch_job JobId=1833891 usec=16318
[2016-04-11T11:42:18.994] debug:  sched: Running job scheduler
[2016-04-11T11:42:33.225] burst_buffer/cray: bb_p_job_try_stage_in
[2016-04-11T11:42:33.225] _alloc_job_bb: start job allocate JobID=1833891
[2016-04-11T11:42:33.226] debug:  cray: select_p_state_save
[2016-04-11T11:42:34.557] dw_wlm_cli --function create_persistent -c CLI -t bbdave -u 15448 -C wlm_pool:228606345216 -a striped -T scratch
[2016-04-11T11:42:34.557] session 5456 created
instance 4905 created
configuration 5182 created

[2016-04-11T11:42:34.557] create_persistent of bbdave ran for usec=1464263
[2016-04-11T11:42:35.142] debug:  _bb_get_sessions: show_sessions ran for usec=585383
[2016-04-11T11:42:35.142] dw_wlm_cli --function show_sessions
"owner": 15448, "state": {"actualized": true, "fuse_blown": false, "goal": "create", "mixed": false, "transitioning": false}, "token": "bbdave"}]}

[2016-04-11T11:43:17.799] backfill test for JobID=1833891 Prio=69124 Partition=debug
[2016-04-11T11:43:17.799] burst_buffer/cray: bb_p_job_test_stage_in: JobID=1833891 test_only:1
[2016-04-11T11:43:17.799] burst_buffer/cray: bb_p_job_get_est_start: JobID=1833891
[2016-04-11T11:43:20.730] _alloc_job_bb: start job allocate JobID=1833891
[2016-04-11T11:43:21.273] _start_stage_in: setup for job 1833891 ran for usec=542139
[2016-04-11T11:43:21.273] dw_wlm_cli --function setup --token 1833891 --caller SLURM --user 15448 --capacity wlm_pool:0 --job /global/syscom/cori/sc/nsg/var/cori-slurm-state/hash.1/job.1833891/script
[2016-04-11T11:43:21.876] _start_stage_in: dws_data_in for job 1833891 ran for usec=603359
[2016-04-11T11:43:21.876] dw_wlm_cli --function data_in --token 1833891 --job /global/syscom/cori/sc/nsg/var/cori-slurm-state/hash.1/job.1833891/script
[2016-04-11T11:43:21.876] stage operations completed

[2016-04-11T11:43:33.414] burst_buffer/cray: bb_p_job_test_stage_in: JobID=1833891 test_only:0
[2016-04-11T11:43:33.424] burst_buffer/cray: bb_p_job_begin: JobID=1833891
[2016-04-11T11:43:34.170] bb_p_job_begin: paths ran for usec=744611
[2016-04-11T11:43:34.170] dw_wlm_cli --function paths --job /global/syscom/cori/sc/nsg/var/cori-slurm-state/hash.1/job.1833891/script --token 1833891 --pathfile /global/syscom/cori/sc/nsg/var/cori-slurm-s
tate/hash.1/job.1833891/path
[2016-04-11T11:43:34.170] error: bb_p_job_begin: paths for job 1833891 status:256 response:
[2016-04-11T11:43:34.170] error: bb_g_job_begin(1833891): Burst Buffer request invalid

[2016-04-11T11:43:34.815] _start_pre_run: dws_pre_run for JobID=1833891 ran for usec=644942
[2016-04-11T11:43:34.815] dw_wlm_cli --function pre_run --token 1833891 --job /global/syscom/cori/sc/nsg/var/cori-slurm-state/hash.1/job.1833891/script --nidlistfile /global/syscom/cori/sc/nsg/var/cori-sl
urm-state/hash.1/job.1833891/client_nids
[2016-04-11T11:43:34.815] dwwlm_pre_run nidlist ['nid00051']

nid00837:/var/tmp/slurm # dwstat most
     pool units quantity      free      gran
test_pool bytes  5.82TiB   5.82TiB     16MiB
 wlm_pool bytes 832.5TiB 705.46TiB 212.91GiB

sess state        token creator owner             created expiration nodes
5456 CA---       bbdave     CLI 15448 2016-04-11T11:42:33      never     0
inst state sess     bytes nodes             created expiration intact        label public confs
4905 CA--- 5456 212.91GiB     1 2016-04-11T11:42:33      never   true       bbdave   true     1
conf state inst    type access_type activs
5182 CA--- 4905 scratch      stripe      0

Comment 3 David Paul 2016-04-11 06:06:44 MDT

Ah, I had hit <cr> instead of <tab>. Thanks for checking.

Comment 4 David Paul 2016-04-11 06:21:52 MDT

Hi. Reducing the severity of this bug. Doug is submitting a more serious issue on the accounting DB.

Comment 5 Tim Wickberg 2016-04-11 06:31:06 MDT

Sev 2 vs 3 isn't a huge deal. I'm thinking these two bugs are related - some changes to BB reservations may have had unintended side effects on both scheduling and accounting.

Are you running any patches on top of 15.08.9 at the moment? I haven't committed a patch related to the 2404 you reference, and don't think that bug is related here. Although I know we've been working through some other BB issues with you recently, and am guessing you have a local patch or two related to that?

Comment 6 Moe Jette 2016-04-11 07:16:34 MDT

Could I get the header of the script?
I just want to see the #DW and #BB options.

Comment 7 David Paul 2016-04-11 08:24:39 MDT

[dpaul@cori12]==> cat bb_create.sh 
#!/bin/bash

#SBATCH -p debug
#SBATCH -N 1
#SBATCH -t 00:05:00
#SBATCH -o create_bb.%j.out
#SBATCH -e create_bb.%j.err
#SBATCH -J create_bb

#BB create_persistent name=bbdave capacity=200GB access=striped type=scratch

echo "hostname is:"
hostname

Comment 8 Moe Jette 2016-04-11 08:30:57 MDT

I was able to confirm the problem and fix it here:
https://github.com/SchedMD/slurm/commit/87e1cc6760a7cf7e6f18789c11928296daa16afc

The fix will be in version 15.08.11 when released, likely late April or early May.

The essence of the problem is that Slurm calls the DW API with the user script and asks for the environment variables that need to be set for the job. If the job does use use burst buffers (but only creates or deletes persistent buffers), the DW API returns an error code, but says the script is fine (response is "job_file_valid True"). Slurm specifically looks for that response and if found, treats the response an OK (i.e. ignore the exit code of 1).

Comment 9 David Paul 2016-04-11 11:06:00 MDT

Thanks for the quick fix!

Was this a regression?

I had tested PR back in Dec/Jan and did not encounter this issue.

Comment 10 Moe Jette 2016-04-12 01:37:33 MDT

(In reply to David Paul from comment #9)
> Thanks for the quick fix!
> 
> Was this a regression?
> 
> I had tested PR back in Dec/Jan and did not encounter this issue.

The bug in the code existed previously, but was latent until this bug fix in version 15.08.9:

 -- burst_buffer/cray: Set environment variables just before starting job rather
    than at job submission time to reflect persistent buffers created or
    modified while the job is pending.

Comment 11 David Paul 2016-04-18 11:28:01 MDT

Hi,

Doug applied the patch(es) today.  The behavior is the same for removing an existing PR. However, trying to create a new PR gets stuck at BurstBufferResources.

[dpaul@cori06]==> squeue -l --qos=burstbuffer
Mon Apr 18 17:16:51 2016
             JOBID PARTITION     NAME     USER    STATE       TIME TIME_LIMI  NODES NODELIST(REASON)
           1926140     debug PRcreate    dpaul  PENDING       0:00      2:00      1 (BurstBufferResources)


<slurmctld.log>
[2016-04-18T17:16:42.191] cray: Job:1926140 UserID:15448 Swap:0x0 TotalSize:0
[2016-04-18T17:16:42.191]   Use  Name:davebb3
[2016-04-18T17:16:42.191] burst_buffer/cray: bb_p_job_validate2: JobID=1926140
[2016-04-18T17:16:42.586] bb_p_job_validate2: job_process ran for usec=394512
[2016-04-18T17:16:42.586] dw_wlm_cli --function job_process --job /global/syscom/cori/sc/nsg/var/cori-slurm-state/hash.0/job.1926140/script
[2016-04-18T17:16:42.601] _slurm_rpc_submit_batch_job JobId=1926140 usec=411328
[2016-04-18T17:16:42.684] burst_buffer/cray: bb_p_job_test_stage_in: JobID=1926140 test_only:0
[2016-04-18T17:16:42.684] _alloc_job_bb: start job allocate JobID=1926140
[2016-04-18T17:16:42.686] _alloc_job_bb: start job allocate JobID=1926140
[2016-04-18T17:16:47.750] burst_buffer/cray: bb_p_job_test_stage_in: JobID=1926140 test_only:0
[2016-04-18T17:16:47.750] _alloc_job_bb: start job allocate JobID=1926140
<repeat>
[2016-04-18T17:17:24.764] burst_buffer/cray: bb_p_job_test_stage_in: JobID=1926140 test_only:0
[2016-04-18T17:17:24.764] _alloc_job_bb: start job allocate JobID=1926140

Comment 12 Moe Jette 2016-04-19 02:47:43 MDT

(In reply to David Paul from comment #11)
> Hi,
> 
> Doug applied the patch(es) today.  The behavior is the same for removing an
> existing PR. However, trying to create a new PR gets stuck at 
> BurstBufferResources.

Just to clarify: You can remove a PR now, only create that hangs, correct?

Comment 13 David Paul 2016-04-19 03:01:25 MDT

We can remove a PR now but the job then becomes JobHeldAdmin, same as it did prior to the patch.

Now we cannot create a PR, it hangs at "BurstBufferResources". Previously it would create the PR and then become JobHeldAdmin.

Comment 14 David Paul 2016-04-19 03:09:52 MDT

BTW- We are now at 15.08.10 plus patches.

Comment 15 Moe Jette 2016-04-19 05:06:39 MDT

(In reply to David Paul from comment #11)
> Hi,
> 
> Doug applied the patch(es) today.  The behavior is the same for removing an
> existing PR. However, trying to create a new PR gets stuck at
> BurstBufferResources.
> 
> [dpaul@cori06]==> squeue -l --qos=burstbuffer
> Mon Apr 18 17:16:51 2016
>              JOBID PARTITION     NAME     USER    STATE       TIME TIME_LIMI
> NODES NODELIST(REASON)
>            1926140     debug PRcreate    dpaul  PENDING       0:00      2:00
> 1 (BurstBufferResources)
> 
> 
> <slurmctld.log>
> [2016-04-18T17:16:42.191] cray: Job:1926140 UserID:15448 Swap:0x0 TotalSize:0
> [2016-04-18T17:16:42.191]   Use  Name:davebb3

I am testing with version 15.08.10, plus the patch identified in comment #8:
https://github.com/SchedMD/slurm/commit/87e1cc6760a7cf7e6f18789c11928296daa16afc

When I submit a job to create the persistent burst buffer as described in comment #7:
...
#BB create_persistent name=bbdave capacity=200GB access=striped type=scratch
...

That generates a log like this:
slurmctld: cray: Job:572 UserID:108 Swap:0x0 TotalSize:0
slurmctld:   Create  Name:bbdave Size:214748364800 Access:striped Type:scratch State:pending
(and it creates the buffer as designed, ditto for the destroy, which generates a log like this:
slurmctld: cray: Job:574 UserID:108 Swap:0x0 TotalSize:0
slurmctld:   Destroy Name:bbdave Hurry:0

But your log shows this:
[2016-04-18T17:16:42.191] cray: Job:1926140 UserID:15448 Swap:0x0 TotalSize:0
[2016-04-18T17:16:42.191]   Use  Name:davebb3

Note the "Create" vs. "Use".

Could you check that script contents?
It really looks like the script is just trying to use the buffer, which will fail if the buffer doesn't exist at the time the job tries to start.

Comment 16 David Paul 2016-04-19 09:17:22 MDT

You are correct.  My apologies for the poor testing.  I was hurrying to test before leaving for the day.

However, the 'original' problem has not been fixed. Creating the PR succeeds but the job becomes JHA reporting "256 response: Burst Buffer request invalid".

Did the patch address the correct bug?

(I hope my testing is correct this time.)

[dpaul@cori06]==> cat PRcreate.sh
#!/bin/bash
#SBATCH -p debug
#SBATCH -N 1
#SBATCH -t 00:02:00

#BB create_persistent name=dpaulbb3 capacity=200GB access=striped type=scratch

echo "DW_PERSISTENT_STRIPED_piname is: "

sleep 90

<squeue -l -u dpaul>
1942598     debug PRcreate    dpaul  PENDING       0:00      2:00      1 (JobHeldAdmin)

nid00837:/var/tmp/slurm # dwstat sessions
sess state        token creator owner             created expiration nodes
5748 CA---     dpaulbb3     CLI 15448 2016-04-19T14:52:48      never     0

[2016-04-19T14:52:42.083] job_submit.lua: looking up username for uid 15448
[2016-04-19T14:52:42.384] job_submit.lua: slurm_job_submit: user dpaul
[2016-04-19T14:52:42.384] job_submit.lua: req max_nodes: 1
[2016-04-19T14:52:42.384] job_submit.lua: req min_nodes: 1
[2016-04-19T14:52:42.384] job_submit.lua: req num_tasks: 4294967294
[2016-04-19T14:52:42.384] job_submit.lua: req ntasks_per_core: 65534
[2016-04-19T14:52:42.384] job_submit.lua: req ntasks_per_socket: 65534
[2016-04-19T14:52:42.384] job_submit.lua: req ntasks_per_node: 65534
[2016-04-19T14:52:46.368] job_submit.lua: slurm_job_submit: worked out that we need 1 nodes
[2016-04-19T14:52:46.368] job_submit.lua: processing debug partition for qos, 1 nodes
[2016-04-19T14:52:46.368] job_submit.lua: got timeLimit of 2 minutes
[2016-04-19T14:52:46.368] job_submit.lua: coreHours of 1.066667, chargefactor: 2.500000
[2016-04-19T14:52:46.368] Warning: Note very large processing time from job_submit_plugin_submit: usec=4408261 began=14:52:41.960
[2016-04-19T14:52:46.369] burst_buffer/cray: bb_p_job_validate: job_user_id:15448, submit_uid:15448
[2016-04-19T14:52:46.369] bb_p_job_validate: burst_buffer:#BB create_persistent name=dpaulbb3 capacity=200GB access=striped type=scratch
[2016-04-19T14:52:46.371] cray: Job:1942598 UserID:15448 Swap:0x0 TotalSize:0
[2016-04-19T14:52:46.371]   Create  Name:dpaulbb3 Size:228606345216 Access:striped Type:scratch State:pending
[2016-04-19T14:52:46.383] _slurm_rpc_submit_batch_job JobId=1942598 usec=14818
[2016-04-19T14:52:47.689] burst_buffer/cray: bb_p_job_test_stage_in: JobID=1942598 test_only:0
[2016-04-19T14:52:47.689] _alloc_job_bb: start job allocate JobID=1942598
2016-04-19T14:52:47.722] burst_buffer/cray: bb_p_job_try_stage_in
[2016-04-19T14:52:47.723] _alloc_job_bb: start job allocate JobID=1939863
[2016-04-19T14:52:47.723] _alloc_job_bb: start job allocate JobID=1942598
[2016-04-19T14:52:48.759] burst_buffer/cray: bb_p_job_test_stage_in: JobID=1942598 test_only:0
[2016-04-19T14:52:48.759] _alloc_job_bb: start job allocate JobID=1942598
[2016-04-19T14:52:51.338] burst_buffer/cray: bb_p_job_test_stage_in: JobID=1942598 test_only:0
[2016-04-19T14:52:51.338] _alloc_job_bb: start job allocate JobID=1942598
[2016-04-19T14:52:54.422] burst_buffer/cray: bb_p_job_test_stage_in: JobID=1942598 test_only:0
[2016-04-19T14:52:54.799] _start_stage_in: setup for job 1942598 ran for usec=3460577
[2016-04-19T14:52:54.799] dw_wlm_cli --function setup --token 1942598 --caller SLURM --user 15448 --capacity wlm_pool:0 --job /global/syscom/cori/sc/nsg/var/cori-slurm-state/hash.8/job.1942598/script
[2016-04-19T14:52:55.195] _start_stage_in: dws_data_in for job 1942598 ran for usec=395875
[2016-04-19T14:52:55.195] dw_wlm_cli --function data_in --token 1942598 --job /global/syscom/cori/sc/nsg/var/cori-slurm-state/hash.8/job.1942598/script
[2016-04-19T14:52:55.195] stage operations completed

[2016-04-19T14:53:03.422] burst_buffer/cray: bb_p_job_test_stage_in: JobID=1942598 test_only:0
[2016-04-19T14:53:03.435] burst_buffer/cray: bb_p_job_begin: JobID=1942598
[2016-04-19T14:53:03.836] bb_p_job_begin: paths ran for usec=401082
[2016-04-19T14:53:03.836] dw_wlm_cli --function paths --job /global/syscom/cori/sc/nsg/var/cori-slurm-state/hash.8/job.1942598/script --token 1942598 --pathfile /global/syscom/cori/sc/nsg/var/cori-slurm-state/hash.
8/job.1942598/path
[2016-04-19T14:53:03.836] error: bb_p_job_begin: paths for job 1942598 status:256 response:
[2016-04-19T14:53:03.836] error: bb_g_job_begin(1942598): Burst Buffer request invalid
[2016-04-19T14:53:04.296] _start_pre_run: dws_pre_run for JobID=1942598 ran for usec=459446
[2016-04-19T14:53:04.296] dw_wlm_cli --function pre_run --token 1942598 --job /global/syscom/cori/sc/nsg/var/cori-slurm-state/hash.8/job.1942598/script --nidlistfile /global/syscom/cori/sc/nsg/var/cori-slurm-state/
hash.8/job.1942598/client_nids
[2016-04-19T14:53:04.296] dwwlm_pre_run nidlist ['nid00029']

Comment 17 Moe Jette 2016-04-19 10:00:18 MDT

This looks like a problem with the DW API.

Slurm is looking for the API to either return a non-zero exit code or a message of "job_file_valid True", which is what I see on Cray's test cluster. The API on your system is returning with an exit code of 1 and no message.

I'll take this up with Cray. What version of the Cray software are you running?

Comment 18 David Paul 2016-04-19 10:06:56 MDT

Is this enough info to determine the version of DWS?

# which dwstat
/opt/cray/dws/1.0-1.0502.66278.43.10.ari/bin/dwstat

The system OS is 5.2UP04.

This brings up a question.  We are planning to bring up 6.0UP00+patches next week on our test system (gerty) and 6.0UP01-pre-release on Cori by the end of May.  Will Slurm support these versions and is there a specific version of Slurm that is required?

Thanks.

Comment 19 Moe Jette 2016-04-19 10:11:39 MDT

(In reply to David Paul from comment #18)
> Is this enough info to determine the version of DWS?
> 
> # which dwstat
> /opt/cray/dws/1.0-1.0502.66278.43.10.ari/bin/dwstat
> 
> The system OS is 5.2UP04.

That should do. I'll try to get an answer from Cray.


> This brings up a question.  We are planning to bring up 6.0UP00+patches next
> week on our test system (gerty) and 6.0UP01-pre-release on Cori by the end
> of May.  Will Slurm support these versions and is there a specific version
> of Slurm that is required?

Newer versions of DW software should just allow us to remove some Slurm kludges around various DW bugs/feature without introducing any problems. I'll see if Cray can install matching software one one of their tests systems that I can use, but no guarantee.

Comment 20 Moe Jette 2016-04-20 08:49:28 MDT

The word that I have from Cray is the specific version of Datawarp software you have installed returns an exit 1 and no message when calling the "paths" function with a script that does not include "#DW" directives.

Other Datawarp versions either return an exit code of zero or an response message of "job_file_valid True", both of which are handled by Slurm. I've wrapped the Slurm logic calling the "paths" function to disable the logic if the script only creates or destroys burst buffers in the following commit (it looks far more complex that it really is, the indentation that I added resulted in many changed lines, when all that changed is some white space). Here is the commit:
https://github.com/SchedMD/slurm/commit/1391d29ad6970be118100ea4419ee069d9a343f9

Please let me know if this does not fix the problem for you. The change will be in Slurm version 15.08.11 when released, likely later in April.