Ticket 14521

Summary: sbatch: error: Batch job submission failed: Burst Buffer permission denied
Product: Slurm Reporter: karan singh <info.kng>
Component: Burst BuffersAssignee: Jacob Jenson <jacob>
Status: RESOLVED INVALID QA Contact:
Severity: 6 - No support contract    
Priority: ---    
Version: 21.08.8   
Hardware: Linux   
OS: Linux   
Site: -Other- Alineos Sites: ---
Atos/Eviden Sites: --- Confidential Site: ---
Coreweave sites: --- Cray Sites: ---
DS9 clusters: --- HPCnow Sites: ---
HPE Sites: --- IBM Sites: ---
NOAA SIte: --- OCF Sites: ---
Recursion Pharma Sites: --- SFW Sites: ---
SNIC sites: --- Linux Distro: ---
Machine Name: CLE Version:
Version Fixed: Target Release: ---
DevPrio: --- Emory-Cloud Sites: ---

Description karan singh 2022-07-13 00:11:54 MDT
How can I enable root user to create BB ? 
when switch to slurm user , job gets submitted but slurm controller goes down !!! 

[Enabled datawrap plugin in slurm.conf]

BurstBufferType=burst_buffer/datawarp
DebugFlags=BurstBuffer

[burst_buffer.conf file details below]

# Excerpt of burst_buffer.conf file for datawarp plugin

Flags=EnablePersistent,PrivateData
GetSysState=/opt/deepops/build/slurm/src/plugins/burst_buffer/datawarp/dw_wlm_cli
GetSysStatus=/opt/deepops/build/slurm/src/plugins/burst_buffer/datawarp/dwstat

[scratch.sh]--> slurm file to create BB

#!/bin/bash

#### How many nodes?
#SBATCH -N 1

#### How long to run the job?
#SBATCH -t 00:1:00

#### Name the job
#SBATCH -J "job_scratch"

#### Set the output file name
#SBATCH -o "job_scratch.log"

#### Request a 200GB scratch allocation, striped over BB nodes, in the default pool (which has 82GiB granularity, so this gives you grains on 3 BB nodes)
#DW jobdw capacity=100GB access_mode=striped type=scratch pool=wlm_pool

(env) [root@hpc-master IntroToBB]# sbatch -w hpc-master scratch.sh

sbatch: error: Batch job submission failed: Burst Buffer permission denied
Comment 1 karan singh 2022-07-15 02:14:26 MDT
issue showing some munge auth when using burst buffer, without burst buffer i was able to run jobs between the nodes .

(env) [root@hpc-master burst-buffer]# srun --bbf=bb-ior.spec --uid=slurm 

--gid=slurm -N 1 -w hpc-master ior.out
srun: job 86 queued and waiting for resources
srun: error: Munge decode failed: Unauthorized credential for client UID=0 GID=0
srun: auth/munge: _print_cred: ENCODED: Thu Jan 01 00:00:00 1970
srun: auth/munge: _print_cred: DECODED: Thu Jan 01 00:00:00 1970
srun: error: slurm_unpack_received_msg: auth_g_verify: RESPONSE_RESOURCE_ALLOCATION has authentication error: Unspecified error
srun: error: slurm_unpack_received_msg: Protocol authentication error
srun: error: _accept_msg_connection[192.168.61.31:45016]: Unspecified error
srun: error: Munge decode failed: Unauthorized credential for client UID=0 GID=0
srun: auth/munge: _print_cred: ENCODED: Thu Jan 01 00:00:00 1970
srun: auth/munge: _print_cred: DECODED: Thu Jan 01 00:00:00 1970
srun: error: slurm_unpack_received_msg: auth_g_verify: SRUN_JOB_COMPLETE has authentication error: Unspecified error
srun: error: slurm_unpack_received_msg: Protocol authentication error
srun: error: eio_message_socket_accept: slurm_receive_msg[192.168.61.31:52204]: Unspecified error
srun: error: Job allocation 86 has been revoked
Comment 2 karan singh 2022-07-15 03:57:29 MDT
Below is the slurmctld.log when job is submitted .

[2022-07-15T09:41:01.250] burst_buffer/datawarp: bb_job_log: datawarp: JobId=88 UserID:22078 Swap:1x1 TotalSize:429496729600
[2022-07-15T09:41:01.250] burst_buffer/datawarp: bb_job_log:   Create  Name:mytestbuffer Pool:wlm_pool Size:214748364800 Access:striped Type:scratch State:pending
[2022-07-15T09:41:01.250] burst_buffer/datawarp: bb_job_log:   Use  Name:mytestbuffer
[2022-07-15T09:41:01.370] burst_buffer/datawarp: bb_p_job_validate2: job_process ran for usec=119837
[2022-07-15T09:41:01.370] burst_buffer/datawarp: _log_script_argv: dw_wlm_cli --function job_process --job /var/spool/slurm/ctld/hash.8/job.88/script
[2022-07-15T09:41:01.370] sched: _slurm_rpc_allocate_resources JobId=88 NodeList=(null) usec=120197
[2022-07-15T09:41:01.389] burst_buffer/datawarp: _log_script_argv: dw_wlm_cli --function create_persistent -c CLI -t mytestbuffer -u 22078 -C wlm_pool:214748364800 -a striped -T scratch
[2022-07-15T09:41:01.389] burst_buffer/datawarp: _log_script_argv: created

[2022-07-15T09:41:01.389] burst_buffer/datawarp: _create_persistent: create_persistent of mytestbuffer ran for usec=18781
[2022-07-15T09:41:01.408] burst_buffer/datawarp: _log_script_argv: dw_wlm_cli --function show_sessions
[2022-07-15T09:41:01.408] burst_buffer/datawarp: _log_script_argv: { "sessions": [ ] }

[2022-07-15T09:41:01.408] error: An id is needed to add a reservation.
[2022-07-15T09:41:07.894] burst_buffer/datawarp: _start_stage_in: setup for job JobId=88 ran for usec=6020009
[2022-07-15T09:41:07.894] burst_buffer/datawarp: _log_script_argv: dw_wlm_cli --function setup --token 88 --caller SLURM --user 22078 --groupid 966 --capacity wlm_pool:400GiB --job /var/spool/slurm/ctld/hash.8/job.88/script
[2022-07-15T09:41:10.915] burst_buffer/datawarp: _start_stage_in: dws_data_in for JobId=88 ran for usec=3020125
[2022-07-15T09:41:10.915] burst_buffer/datawarp: _log_script_argv: dw_wlm_cli --function data_in --token 88 --job /var/spool/slurm/ctld/hash.8/job.88/script
[2022-07-15T09:41:10.934] burst_buffer/datawarp: _start_stage_in: real_size ran for usec=18965
[2022-07-15T09:41:10.934] burst_buffer/datawarp: _log_script_argv: dw_wlm_cli --function real_size --token 88
[2022-07-15T09:41:11.902] burst_buffer/datawarp: bb_p_job_begin: paths ran for usec=19170
[2022-07-15T09:41:11.902] burst_buffer/datawarp: _log_script_argv: dw_wlm_cli --function paths --job /var/spool/slurm/ctld/hash.8/job.88/script --token 88 --pathfile /var/spool/slurm/ctld/hash.8/job.88/path
[2022-07-15T09:41:11.902] sched: Allocate JobId=88 NodeList=hpc-master #CPUs=2 Partition=batch
[2022-07-15T09:41:11.914] error: slurm_receive_msgs: [[hpc-master]:44095] failed: Zero Bytes were transmitted or received
[2022-07-15T09:41:11.924] Killing interactive JobId=88: Communication connection failure
[2022-07-15T09:41:11.924] _job_complete: JobId=88 WEXITSTATUS 1
[2022-07-15T09:41:11.924] _job_complete: JobId=88 done
[2022-07-15T09:41:11.935] burst_buffer/datawarp: _start_pre_run: dws_pre_run for JobId=88 terminated by slurmctld
[2022-07-15T09:41:12.020] _slurm_rpc_complete_job_allocation: JobId=88 error Job/step already completing or completed
[2022-07-15T09:41:12.044] burst_buffer/datawarp: _start_teardown: teardown for JobId=88 ran for usec=119950
[2022-07-15T09:41:12.044] burst_buffer/datawarp: _log_script_argv: dw_wlm_cli --function teardown --token 88 --job /var/spool/slurm/ctld/hash.8/job.88/script --hurry
[2022-07-15T09:41:14.043] burst_buffer/datawarp: _log_script_argv: dw_wlm_cli --function show_instances
[2022-07-15T09:41:14.043] burst_buffer/datawarp: _log_script_argv: { "instances": [ ] }


Below is the munge connectivity between hosts :

(env) [root@hpc-master ~]# munge -n | unmunge
STATUS:           Success (0)
ENCODE_HOST:      hpc-master (192.168.61.31)
ENCODE_TIME:      2022-07-15 09:53:38 +0000 (1657878818)
DECODE_TIME:      2022-07-15 09:53:38 +0000 (1657878818)
TTL:              300
CIPHER:           aes128 (4)
MAC:              sha256 (5)
ZIP:              none (0)
UID:              root (0)
GID:              root (0)
LENGTH:           0

(env) [root@hpc-master ~]# munge -n | ssh mlperf1 unmunge
STATUS:           Success (0)
ENCODE_HOST:      hpc-master (192.168.61.31)
ENCODE_TIME:      2022-07-15 09:54:39 +0000 (1657878879)
DECODE_TIME:      2022-07-15 09:54:39 +0000 (1657878879)
TTL:              300
CIPHER:           aes128 (4)
MAC:              sha256 (5)
ZIP:              none (0)
UID:              root (0)
GID:              root (0)
LENGTH:           0
Comment 3 karan singh 2022-07-27 02:03:43 MDT
Guys , anyone please help. Please let me know if more info required . I am stuck at this moment