9447 – sbcast fails with unspecified error

Ticket 9447 - sbcast fails with unspecified error

Summary: sbcast fails with unspecified error

Status:	RESOLVED INFOGIVEN

Alias:	None

Product:	Slurm
Classification:	Unclassified
Component:	User Commands (show other tickets)
Version:	20.02.3
Hardware:	Linux Linux

Severity:	4 - Minor Issue
Assignee:	Marcin Stolarek
QA Contact:

URL:

Depends on:
Blocks:

Reported:	2020-07-22 03:17 MDT by Ahmed Essam ElMazaty
Modified:	2020-08-13 09:08 MDT (History)
CC List:	1 user (show)

See Also:
Site:	KAUST
Slinky Site:	---
Alineos Sites:	---
Atos/Eviden Sites:	---
Confidential Site:	---
Coreweave sites:	---
Cray Sites:	---
DS9 clusters:	---
Google sites:	---
HPCnow Sites:	---
HPE Sites:	---
IBM Sites:	---
NOAA SIte:	---
NoveTech Sites:	---
Nvidia HWinf-CS Sites:	---
OCF Sites:	---
Recursion Pharma Sites:	---
SFW Sites:	---
SNIC sites:	---
Tzag Elita Sites:	---
Linux Distro:	---
Machine Name:
CLE Version:
Version Fixed:
Target Release:	---
DevPrio:	---
Emory-Cloud Sites:	---

Attachments
Add an attachment (proposed patch, testcase, etc.)

Note You need to log in before you can comment on or make changes to this ticket.

Description Ahmed Essam ElMazaty 2020-07-22 03:17:03 MDT

Hello,
Some of our users mentioned that they can't use "sbcast" command to copy their files to compute nodes
It always fails with (unspecified error).
I've tried to run a test job and got the same error

#!/bin/bash

#SBATCH --time=12:00:00
#SBATCH -N 2
sbcast -v /ibex/scratch/mazatyae/new_job.out /tmp/mazatyae



the output
sbcast: -----------------------------
sbcast: block_size = 8388608
sbcast: compress   = 0
sbcast: force      = false
sbcast: fanout     = 0
sbcast: jobid      = 11279665
sbcast: preserve   = false
sbcast: timeout    = 0
sbcast: verbose    = 1
sbcast: source     = /ibex/scratch/mazatyae/new_job.out
sbcast: dest       = /tmp/mazatyae
sbcast: -----------------------------
sbcast: modes    = 100644
sbcast: uid      = 167627
sbcast: gid      = 1167627
sbcast: atime    = Thu Sep 19 13:06:49 2019
sbcast: mtime    = Thu Sep 19 13:06:49 2019
sbcast: ctime    = Thu May 07 21:36:36 2020
sbcast: size     = 23
sbcast: jobid      = 11279665
sbcast: node_cnt   = 1
sbcast: node_list  = cn509-29-r
sbcast: Sbcast_cred: Jobid   11279665
sbcast: Sbcast_cred: Nodes   cn509-29-r
sbcast: Sbcast_cred: ctime   Wed Jul 22 12:11:28 2020
sbcast: Sbcast_cred: Expire  Thu Jul 23 00:11:28 2020
sbcast: error: REQUEST_FILE_BCAST(cn509-29-r): Unspecified error


I've tried this job on multiple nodes and all of them had failed with the same error.
Can you please help me with this?
Thanks,
Ahmed

Comment 1 Marcin Stolarek 2020-07-22 03:20:13 MDT

Ahmed,

Did you check appropriate slurmd logs?

Is /tmp a shared filesystem?

cheers,
Marcin

Comment 2 Ahmed Essam ElMazaty 2020-07-22 06:02:11 MDT

(In reply to Marcin Stolarek from comment #1)
> Ahmed,
> 
> Did you check appropriate slurmd logs?
> 
> Is /tmp a shared filesystem?
> 
> cheers,
> Marcin

Dear Marcin,
I think I can explain the behaviour better now.
we use private-tmpdir plug stack to bind mount "/tmp" and "/local/scratch" from the job's perspective to a directory under "/local/tmp"
# cat /etc/slurm/plugstack.conf.d/private-tmpdir.conf 
required  private-tmpdir.so  base=/local/tmp/ mount=/var/tmp mount=/tmp mount=/local/scratch

sbcast generates those errors when it's used to copy to one of the filesystems. It works fine when it's using shared filesystems.

After having a look into the logs. It seems that sbcast was trying to copy the file to the original "/tmp" on the node instead of copying it to the job's "/tmp" which is bound to "/local/tmp/<job_id>/tmp". And that was generating some permissions errors.

I've tried to use touch, rsync,scp...etc inside the job script to access the job's /tmp directory and all of them work as expected. Only sbcast is trying to use the original "/tmp" and "/local/scratch" directories.

Hope it's clear now.
Best regards,
Ahmed

Comment 6 Marcin Stolarek 2020-08-04 03:58:03 MDT

Ahmed,

I assume you're using this spank plugin[1], which creates a separate mount namespace and binds specific directories to private space on slurm_spank_job_prolog(just before prolog script) and binds to it on slurm_spank_init_post_opt (salloc, sbatch execution)

Unfortunately, we don't call any spank function before opening a file while handling sbcast RPC, so it's forked from slurmd in the mount namespace of the daemon.

The only easy way to make it successful at the current code state of both - Slurm and the spank is to educate users that they should use a path where private tmp is created, eventually export a variable with it for users ease.

cheers,
Marcin

[1]https://github.com/hpc2n/spank-private-tmp

Comment 7 Marcin Stolarek 2020-08-04 04:00:59 MDT

Ahmed,

By an accident, the previous reply didn't reach you by email.

Could you please check it(using web interface) and let me know if you have further questions?

cheers,
Marcin

Comment 8 Marcin Stolarek 2020-08-11 03:34:01 MDT

Ahmed,

Do you have any additional questions on the case? In case of no reply, I'll close the bug report as "information given".

cheers,
Marcin

Comment 9 Marcin Stolarek 2020-08-13 09:08:23 MDT

Ahmed,

 I'm closing it as "infogiven" now.
Please reopen if you have additional questions.

cheers,
Marcin