Hello, Some of our users mentioned that they can't use "sbcast" command to copy their files to compute nodes It always fails with (unspecified error). I've tried to run a test job and got the same error #!/bin/bash #SBATCH --time=12:00:00 #SBATCH -N 2 sbcast -v /ibex/scratch/mazatyae/new_job.out /tmp/mazatyae the output sbcast: ----------------------------- sbcast: block_size = 8388608 sbcast: compress = 0 sbcast: force = false sbcast: fanout = 0 sbcast: jobid = 11279665 sbcast: preserve = false sbcast: timeout = 0 sbcast: verbose = 1 sbcast: source = /ibex/scratch/mazatyae/new_job.out sbcast: dest = /tmp/mazatyae sbcast: ----------------------------- sbcast: modes = 100644 sbcast: uid = 167627 sbcast: gid = 1167627 sbcast: atime = Thu Sep 19 13:06:49 2019 sbcast: mtime = Thu Sep 19 13:06:49 2019 sbcast: ctime = Thu May 07 21:36:36 2020 sbcast: size = 23 sbcast: jobid = 11279665 sbcast: node_cnt = 1 sbcast: node_list = cn509-29-r sbcast: Sbcast_cred: Jobid 11279665 sbcast: Sbcast_cred: Nodes cn509-29-r sbcast: Sbcast_cred: ctime Wed Jul 22 12:11:28 2020 sbcast: Sbcast_cred: Expire Thu Jul 23 00:11:28 2020 sbcast: error: REQUEST_FILE_BCAST(cn509-29-r): Unspecified error I've tried this job on multiple nodes and all of them had failed with the same error. Can you please help me with this? Thanks, Ahmed
Ahmed, Did you check appropriate slurmd logs? Is /tmp a shared filesystem? cheers, Marcin
(In reply to Marcin Stolarek from comment #1) > Ahmed, > > Did you check appropriate slurmd logs? > > Is /tmp a shared filesystem? > > cheers, > Marcin Dear Marcin, I think I can explain the behaviour better now. we use private-tmpdir plug stack to bind mount "/tmp" and "/local/scratch" from the job's perspective to a directory under "/local/tmp" # cat /etc/slurm/plugstack.conf.d/private-tmpdir.conf required private-tmpdir.so base=/local/tmp/ mount=/var/tmp mount=/tmp mount=/local/scratch sbcast generates those errors when it's used to copy to one of the filesystems. It works fine when it's using shared filesystems. After having a look into the logs. It seems that sbcast was trying to copy the file to the original "/tmp" on the node instead of copying it to the job's "/tmp" which is bound to "/local/tmp/<job_id>/tmp". And that was generating some permissions errors. I've tried to use touch, rsync,scp...etc inside the job script to access the job's /tmp directory and all of them work as expected. Only sbcast is trying to use the original "/tmp" and "/local/scratch" directories. Hope it's clear now. Best regards, Ahmed
Ahmed, I assume you're using this spank plugin[1], which creates a separate mount namespace and binds specific directories to private space on slurm_spank_job_prolog(just before prolog script) and binds to it on slurm_spank_init_post_opt (salloc, sbatch execution) Unfortunately, we don't call any spank function before opening a file while handling sbcast RPC, so it's forked from slurmd in the mount namespace of the daemon. The only easy way to make it successful at the current code state of both - Slurm and the spank is to educate users that they should use a path where private tmp is created, eventually export a variable with it for users ease. cheers, Marcin [1]https://github.com/hpc2n/spank-private-tmp
Ahmed, By an accident, the previous reply didn't reach you by email. Could you please check it(using web interface) and let me know if you have further questions? cheers, Marcin
Ahmed, Do you have any additional questions on the case? In case of no reply, I'll close the bug report as "information given". cheers, Marcin
Ahmed, I'm closing it as "infogiven" now. Please reopen if you have additional questions. cheers, Marcin