| Summary: | sbatch: error: Batch job submission failed: Burst Buffer permission denied | ||
|---|---|---|---|
| Product: | Slurm | Reporter: | karan singh <info.kng> |
| Component: | Burst Buffers | Assignee: | Jacob Jenson <jacob> |
| Status: | RESOLVED INVALID | QA Contact: | |
| Severity: | 6 - No support contract | ||
| Priority: | --- | ||
| Version: | 21.08.8 | ||
| Hardware: | Linux | ||
| OS: | Linux | ||
| Site: | -Other- | Alineos Sites: | --- |
| Atos/Eviden Sites: | --- | Confidential Site: | --- |
| Coreweave sites: | --- | Cray Sites: | --- |
| DS9 clusters: | --- | HPCnow Sites: | --- |
| HPE Sites: | --- | IBM Sites: | --- |
| NOAA SIte: | --- | OCF Sites: | --- |
| Recursion Pharma Sites: | --- | SFW Sites: | --- |
| SNIC sites: | --- | Linux Distro: | --- |
| Machine Name: | CLE Version: | ||
| Version Fixed: | Target Release: | --- | |
| DevPrio: | --- | Emory-Cloud Sites: | --- |
|
Description
karan singh
2022-07-13 00:11:54 MDT
issue showing some munge auth when using burst buffer, without burst buffer i was able to run jobs between the nodes . (env) [root@hpc-master burst-buffer]# srun --bbf=bb-ior.spec --uid=slurm --gid=slurm -N 1 -w hpc-master ior.out srun: job 86 queued and waiting for resources srun: error: Munge decode failed: Unauthorized credential for client UID=0 GID=0 srun: auth/munge: _print_cred: ENCODED: Thu Jan 01 00:00:00 1970 srun: auth/munge: _print_cred: DECODED: Thu Jan 01 00:00:00 1970 srun: error: slurm_unpack_received_msg: auth_g_verify: RESPONSE_RESOURCE_ALLOCATION has authentication error: Unspecified error srun: error: slurm_unpack_received_msg: Protocol authentication error srun: error: _accept_msg_connection[192.168.61.31:45016]: Unspecified error srun: error: Munge decode failed: Unauthorized credential for client UID=0 GID=0 srun: auth/munge: _print_cred: ENCODED: Thu Jan 01 00:00:00 1970 srun: auth/munge: _print_cred: DECODED: Thu Jan 01 00:00:00 1970 srun: error: slurm_unpack_received_msg: auth_g_verify: SRUN_JOB_COMPLETE has authentication error: Unspecified error srun: error: slurm_unpack_received_msg: Protocol authentication error srun: error: eio_message_socket_accept: slurm_receive_msg[192.168.61.31:52204]: Unspecified error srun: error: Job allocation 86 has been revoked Below is the slurmctld.log when job is submitted .
[2022-07-15T09:41:01.250] burst_buffer/datawarp: bb_job_log: datawarp: JobId=88 UserID:22078 Swap:1x1 TotalSize:429496729600
[2022-07-15T09:41:01.250] burst_buffer/datawarp: bb_job_log: Create Name:mytestbuffer Pool:wlm_pool Size:214748364800 Access:striped Type:scratch State:pending
[2022-07-15T09:41:01.250] burst_buffer/datawarp: bb_job_log: Use Name:mytestbuffer
[2022-07-15T09:41:01.370] burst_buffer/datawarp: bb_p_job_validate2: job_process ran for usec=119837
[2022-07-15T09:41:01.370] burst_buffer/datawarp: _log_script_argv: dw_wlm_cli --function job_process --job /var/spool/slurm/ctld/hash.8/job.88/script
[2022-07-15T09:41:01.370] sched: _slurm_rpc_allocate_resources JobId=88 NodeList=(null) usec=120197
[2022-07-15T09:41:01.389] burst_buffer/datawarp: _log_script_argv: dw_wlm_cli --function create_persistent -c CLI -t mytestbuffer -u 22078 -C wlm_pool:214748364800 -a striped -T scratch
[2022-07-15T09:41:01.389] burst_buffer/datawarp: _log_script_argv: created
[2022-07-15T09:41:01.389] burst_buffer/datawarp: _create_persistent: create_persistent of mytestbuffer ran for usec=18781
[2022-07-15T09:41:01.408] burst_buffer/datawarp: _log_script_argv: dw_wlm_cli --function show_sessions
[2022-07-15T09:41:01.408] burst_buffer/datawarp: _log_script_argv: { "sessions": [ ] }
[2022-07-15T09:41:01.408] error: An id is needed to add a reservation.
[2022-07-15T09:41:07.894] burst_buffer/datawarp: _start_stage_in: setup for job JobId=88 ran for usec=6020009
[2022-07-15T09:41:07.894] burst_buffer/datawarp: _log_script_argv: dw_wlm_cli --function setup --token 88 --caller SLURM --user 22078 --groupid 966 --capacity wlm_pool:400GiB --job /var/spool/slurm/ctld/hash.8/job.88/script
[2022-07-15T09:41:10.915] burst_buffer/datawarp: _start_stage_in: dws_data_in for JobId=88 ran for usec=3020125
[2022-07-15T09:41:10.915] burst_buffer/datawarp: _log_script_argv: dw_wlm_cli --function data_in --token 88 --job /var/spool/slurm/ctld/hash.8/job.88/script
[2022-07-15T09:41:10.934] burst_buffer/datawarp: _start_stage_in: real_size ran for usec=18965
[2022-07-15T09:41:10.934] burst_buffer/datawarp: _log_script_argv: dw_wlm_cli --function real_size --token 88
[2022-07-15T09:41:11.902] burst_buffer/datawarp: bb_p_job_begin: paths ran for usec=19170
[2022-07-15T09:41:11.902] burst_buffer/datawarp: _log_script_argv: dw_wlm_cli --function paths --job /var/spool/slurm/ctld/hash.8/job.88/script --token 88 --pathfile /var/spool/slurm/ctld/hash.8/job.88/path
[2022-07-15T09:41:11.902] sched: Allocate JobId=88 NodeList=hpc-master #CPUs=2 Partition=batch
[2022-07-15T09:41:11.914] error: slurm_receive_msgs: [[hpc-master]:44095] failed: Zero Bytes were transmitted or received
[2022-07-15T09:41:11.924] Killing interactive JobId=88: Communication connection failure
[2022-07-15T09:41:11.924] _job_complete: JobId=88 WEXITSTATUS 1
[2022-07-15T09:41:11.924] _job_complete: JobId=88 done
[2022-07-15T09:41:11.935] burst_buffer/datawarp: _start_pre_run: dws_pre_run for JobId=88 terminated by slurmctld
[2022-07-15T09:41:12.020] _slurm_rpc_complete_job_allocation: JobId=88 error Job/step already completing or completed
[2022-07-15T09:41:12.044] burst_buffer/datawarp: _start_teardown: teardown for JobId=88 ran for usec=119950
[2022-07-15T09:41:12.044] burst_buffer/datawarp: _log_script_argv: dw_wlm_cli --function teardown --token 88 --job /var/spool/slurm/ctld/hash.8/job.88/script --hurry
[2022-07-15T09:41:14.043] burst_buffer/datawarp: _log_script_argv: dw_wlm_cli --function show_instances
[2022-07-15T09:41:14.043] burst_buffer/datawarp: _log_script_argv: { "instances": [ ] }
Below is the munge connectivity between hosts :
(env) [root@hpc-master ~]# munge -n | unmunge
STATUS: Success (0)
ENCODE_HOST: hpc-master (192.168.61.31)
ENCODE_TIME: 2022-07-15 09:53:38 +0000 (1657878818)
DECODE_TIME: 2022-07-15 09:53:38 +0000 (1657878818)
TTL: 300
CIPHER: aes128 (4)
MAC: sha256 (5)
ZIP: none (0)
UID: root (0)
GID: root (0)
LENGTH: 0
(env) [root@hpc-master ~]# munge -n | ssh mlperf1 unmunge
STATUS: Success (0)
ENCODE_HOST: hpc-master (192.168.61.31)
ENCODE_TIME: 2022-07-15 09:54:39 +0000 (1657878879)
DECODE_TIME: 2022-07-15 09:54:39 +0000 (1657878879)
TTL: 300
CIPHER: aes128 (4)
MAC: sha256 (5)
ZIP: none (0)
UID: root (0)
GID: root (0)
LENGTH: 0
Guys , anyone please help. Please let me know if more info required . I am stuck at this moment |