Ticket 12861 - Slurmd will leave open file descriptor to sbcast destination if job is cancelled
Summary: Slurmd will leave open file descriptor to sbcast destination if job is cancelled
Status: RESOLVED FIXED
Alias: None
Product: Slurm
Classification: Unclassified
Component: slurmd (show other tickets)
Version: 20.11.8
Hardware: Linux Linux
: 4 - Minor Issue
Assignee: Dominik Bartkiewicz
QA Contact:
URL:
Depends on:
Blocks:
 
Reported: 2021-11-16 03:11 MST by CSC sysadmins
Modified: 2022-01-10 04:15 MST (History)
2 users (show)

See Also:
Site: CSC - IT Center for Science
Slinky Site: ---
Alineos Sites: ---
Atos/Eviden Sites: ---
Confidential Site: ---
Coreweave sites: ---
Cray Sites: ---
DS9 clusters: ---
Google sites: ---
HPCnow Sites: ---
HPE Sites: ---
IBM Sites: ---
NOAA SIte: ---
NoveTech Sites: ---
Nvidia HWinf-CS Sites: ---
OCF Sites: ---
Recursion Pharma Sites: ---
SFW Sites: ---
SNIC sites: ---
Tzag Elita Sites: ---
Linux Distro: ---
Machine Name:
CLE Version:
Version Fixed: 21.08.6
Target Release: ---
DevPrio: ---
Emory-Cloud Sites: ---


Attachments

Note You need to log in before you can comment on or make changes to this ticket.
Description CSC sysadmins 2021-11-16 03:11:52 MST
Hi,

We do have NVMe's on part of the compute nodes where users can copy data for faster IO. Slurm prolog will setup lvm partition and each job will get a separate volume which is removed when job ends. But sbcast is problematic one:

sbcast /nfs/file.tar /local_nvme/path/file.tar

if job is cancelled before the sbcast is done it's job original user sbcast process is killed but slurmd and slurm_epilog scripts will have file handle open to destination directory which will block it's umount.

fuser -m /run/nvme/job_823542
/run/nvme/job_823542: 30645 46407

lsof: 
slurmd    30645             root    8w      REG              253,2 5242880000                 69 /run/nvme/job_823542/data/file.tar

Only way to recover is slurmd restart.

-Tommi
Comment 1 Dominik Bartkiewicz 2021-11-18 10:54:52 MST
Hi

I can reproduce this issue.
I will let you know when we will have the fix.

Dominik
Comment 5 Dominik Bartkiewicz 2022-01-10 04:15:47 MST
Hi

First sorry this took so long.
This commit should fix this issue and it will be included in 21.08.6 and above.
https://github.com/SchedMD/slurm/commit/0c385b3c835

I'll go ahead and close this out. Feel free to comment or reopen if needed.

Dominik