Hi, We do have NVMe's on part of the compute nodes where users can copy data for faster IO. Slurm prolog will setup lvm partition and each job will get a separate volume which is removed when job ends. But sbcast is problematic one: sbcast /nfs/file.tar /local_nvme/path/file.tar if job is cancelled before the sbcast is done it's job original user sbcast process is killed but slurmd and slurm_epilog scripts will have file handle open to destination directory which will block it's umount. fuser -m /run/nvme/job_823542 /run/nvme/job_823542: 30645 46407 lsof: slurmd 30645 root 8w REG 253,2 5242880000 69 /run/nvme/job_823542/data/file.tar Only way to recover is slurmd restart. -Tommi
Hi I can reproduce this issue. I will let you know when we will have the fix. Dominik
Hi First sorry this took so long. This commit should fix this issue and it will be included in 21.08.6 and above. https://github.com/SchedMD/slurm/commit/0c385b3c835 I'll go ahead and close this out. Feel free to comment or reopen if needed. Dominik