How can I enable root user to create BB ? when switch to slurm user , job gets submitted but slurm controller goes down !!! [Enabled datawrap plugin in slurm.conf] BurstBufferType=burst_buffer/datawarp DebugFlags=BurstBuffer [burst_buffer.conf file details below] # Excerpt of burst_buffer.conf file for datawarp plugin Flags=EnablePersistent,PrivateData GetSysState=/opt/deepops/build/slurm/src/plugins/burst_buffer/datawarp/dw_wlm_cli GetSysStatus=/opt/deepops/build/slurm/src/plugins/burst_buffer/datawarp/dwstat [scratch.sh]--> slurm file to create BB #!/bin/bash #### How many nodes? #SBATCH -N 1 #### How long to run the job? #SBATCH -t 00:1:00 #### Name the job #SBATCH -J "job_scratch" #### Set the output file name #SBATCH -o "job_scratch.log" #### Request a 200GB scratch allocation, striped over BB nodes, in the default pool (which has 82GiB granularity, so this gives you grains on 3 BB nodes) #DW jobdw capacity=100GB access_mode=striped type=scratch pool=wlm_pool (env) [root@hpc-master IntroToBB]# sbatch -w hpc-master scratch.sh sbatch: error: Batch job submission failed: Burst Buffer permission denied
issue showing some munge auth when using burst buffer, without burst buffer i was able to run jobs between the nodes . (env) [root@hpc-master burst-buffer]# srun --bbf=bb-ior.spec --uid=slurm --gid=slurm -N 1 -w hpc-master ior.out srun: job 86 queued and waiting for resources srun: error: Munge decode failed: Unauthorized credential for client UID=0 GID=0 srun: auth/munge: _print_cred: ENCODED: Thu Jan 01 00:00:00 1970 srun: auth/munge: _print_cred: DECODED: Thu Jan 01 00:00:00 1970 srun: error: slurm_unpack_received_msg: auth_g_verify: RESPONSE_RESOURCE_ALLOCATION has authentication error: Unspecified error srun: error: slurm_unpack_received_msg: Protocol authentication error srun: error: _accept_msg_connection[192.168.61.31:45016]: Unspecified error srun: error: Munge decode failed: Unauthorized credential for client UID=0 GID=0 srun: auth/munge: _print_cred: ENCODED: Thu Jan 01 00:00:00 1970 srun: auth/munge: _print_cred: DECODED: Thu Jan 01 00:00:00 1970 srun: error: slurm_unpack_received_msg: auth_g_verify: SRUN_JOB_COMPLETE has authentication error: Unspecified error srun: error: slurm_unpack_received_msg: Protocol authentication error srun: error: eio_message_socket_accept: slurm_receive_msg[192.168.61.31:52204]: Unspecified error srun: error: Job allocation 86 has been revoked
Below is the slurmctld.log when job is submitted . [2022-07-15T09:41:01.250] burst_buffer/datawarp: bb_job_log: datawarp: JobId=88 UserID:22078 Swap:1x1 TotalSize:429496729600 [2022-07-15T09:41:01.250] burst_buffer/datawarp: bb_job_log: Create Name:mytestbuffer Pool:wlm_pool Size:214748364800 Access:striped Type:scratch State:pending [2022-07-15T09:41:01.250] burst_buffer/datawarp: bb_job_log: Use Name:mytestbuffer [2022-07-15T09:41:01.370] burst_buffer/datawarp: bb_p_job_validate2: job_process ran for usec=119837 [2022-07-15T09:41:01.370] burst_buffer/datawarp: _log_script_argv: dw_wlm_cli --function job_process --job /var/spool/slurm/ctld/hash.8/job.88/script [2022-07-15T09:41:01.370] sched: _slurm_rpc_allocate_resources JobId=88 NodeList=(null) usec=120197 [2022-07-15T09:41:01.389] burst_buffer/datawarp: _log_script_argv: dw_wlm_cli --function create_persistent -c CLI -t mytestbuffer -u 22078 -C wlm_pool:214748364800 -a striped -T scratch [2022-07-15T09:41:01.389] burst_buffer/datawarp: _log_script_argv: created [2022-07-15T09:41:01.389] burst_buffer/datawarp: _create_persistent: create_persistent of mytestbuffer ran for usec=18781 [2022-07-15T09:41:01.408] burst_buffer/datawarp: _log_script_argv: dw_wlm_cli --function show_sessions [2022-07-15T09:41:01.408] burst_buffer/datawarp: _log_script_argv: { "sessions": [ ] } [2022-07-15T09:41:01.408] error: An id is needed to add a reservation. [2022-07-15T09:41:07.894] burst_buffer/datawarp: _start_stage_in: setup for job JobId=88 ran for usec=6020009 [2022-07-15T09:41:07.894] burst_buffer/datawarp: _log_script_argv: dw_wlm_cli --function setup --token 88 --caller SLURM --user 22078 --groupid 966 --capacity wlm_pool:400GiB --job /var/spool/slurm/ctld/hash.8/job.88/script [2022-07-15T09:41:10.915] burst_buffer/datawarp: _start_stage_in: dws_data_in for JobId=88 ran for usec=3020125 [2022-07-15T09:41:10.915] burst_buffer/datawarp: _log_script_argv: dw_wlm_cli --function data_in --token 88 --job /var/spool/slurm/ctld/hash.8/job.88/script [2022-07-15T09:41:10.934] burst_buffer/datawarp: _start_stage_in: real_size ran for usec=18965 [2022-07-15T09:41:10.934] burst_buffer/datawarp: _log_script_argv: dw_wlm_cli --function real_size --token 88 [2022-07-15T09:41:11.902] burst_buffer/datawarp: bb_p_job_begin: paths ran for usec=19170 [2022-07-15T09:41:11.902] burst_buffer/datawarp: _log_script_argv: dw_wlm_cli --function paths --job /var/spool/slurm/ctld/hash.8/job.88/script --token 88 --pathfile /var/spool/slurm/ctld/hash.8/job.88/path [2022-07-15T09:41:11.902] sched: Allocate JobId=88 NodeList=hpc-master #CPUs=2 Partition=batch [2022-07-15T09:41:11.914] error: slurm_receive_msgs: [[hpc-master]:44095] failed: Zero Bytes were transmitted or received [2022-07-15T09:41:11.924] Killing interactive JobId=88: Communication connection failure [2022-07-15T09:41:11.924] _job_complete: JobId=88 WEXITSTATUS 1 [2022-07-15T09:41:11.924] _job_complete: JobId=88 done [2022-07-15T09:41:11.935] burst_buffer/datawarp: _start_pre_run: dws_pre_run for JobId=88 terminated by slurmctld [2022-07-15T09:41:12.020] _slurm_rpc_complete_job_allocation: JobId=88 error Job/step already completing or completed [2022-07-15T09:41:12.044] burst_buffer/datawarp: _start_teardown: teardown for JobId=88 ran for usec=119950 [2022-07-15T09:41:12.044] burst_buffer/datawarp: _log_script_argv: dw_wlm_cli --function teardown --token 88 --job /var/spool/slurm/ctld/hash.8/job.88/script --hurry [2022-07-15T09:41:14.043] burst_buffer/datawarp: _log_script_argv: dw_wlm_cli --function show_instances [2022-07-15T09:41:14.043] burst_buffer/datawarp: _log_script_argv: { "instances": [ ] } Below is the munge connectivity between hosts : (env) [root@hpc-master ~]# munge -n | unmunge STATUS: Success (0) ENCODE_HOST: hpc-master (192.168.61.31) ENCODE_TIME: 2022-07-15 09:53:38 +0000 (1657878818) DECODE_TIME: 2022-07-15 09:53:38 +0000 (1657878818) TTL: 300 CIPHER: aes128 (4) MAC: sha256 (5) ZIP: none (0) UID: root (0) GID: root (0) LENGTH: 0 (env) [root@hpc-master ~]# munge -n | ssh mlperf1 unmunge STATUS: Success (0) ENCODE_HOST: hpc-master (192.168.61.31) ENCODE_TIME: 2022-07-15 09:54:39 +0000 (1657878879) DECODE_TIME: 2022-07-15 09:54:39 +0000 (1657878879) TTL: 300 CIPHER: aes128 (4) MAC: sha256 (5) ZIP: none (0) UID: root (0) GID: root (0) LENGTH: 0
Guys , anyone please help. Please let me know if more info required . I am stuck at this moment