12861 – Slurmd will leave open file descriptor to sbcast destination if job is cancelled

Ticket 12861 - Slurmd will leave open file descriptor to sbcast destination if job is cancelled

Summary: Slurmd will leave open file descriptor to sbcast destination if job is cancelled

Status:	RESOLVED FIXED

Alias:	None

Product:	Slurm
Classification:	Unclassified
Component:	slurmd (show other tickets)
Version:	20.11.8
Hardware:	Linux Linux

Severity:	4 - Minor Issue
Assignee:	Dominik Bartkiewicz
QA Contact:

URL:

Depends on:
Blocks:

Reported:	2021-11-16 03:11 MST by CSC sysadmins
Modified:	2022-01-10 04:15 MST (History)
CC List:	2 users (show)

See Also:
Site:	CSC - IT Center for Science
Slinky Site:	---
Alineos Sites:	---
Atos/Eviden Sites:	---
Confidential Site:	---
Coreweave sites:	---
Cray Sites:	---
DS9 clusters:	---
Google sites:	---
HPCnow Sites:	---
HPE Sites:	---
IBM Sites:	---
NOAA SIte:	---
NoveTech Sites:	---
Nvidia HWinf-CS Sites:	---
OCF Sites:	---
Recursion Pharma Sites:	---
SFW Sites:	---
SNIC sites:	---
Tzag Elita Sites:	---
Linux Distro:	---
Machine Name:
CLE Version:
Version Fixed:	21.08.6
Target Release:	---
DevPrio:	---
Emory-Cloud Sites:	---

Attachments
Add an attachment (proposed patch, testcase, etc.)

Note You need to log in before you can comment on or make changes to this ticket.

Description CSC sysadmins 2021-11-16 03:11:52 MST

Hi,

We do have NVMe's on part of the compute nodes where users can copy data for faster IO. Slurm prolog will setup lvm partition and each job will get a separate volume which is removed when job ends. But sbcast is problematic one:

sbcast /nfs/file.tar /local_nvme/path/file.tar

if job is cancelled before the sbcast is done it's job original user sbcast process is killed but slurmd and slurm_epilog scripts will have file handle open to destination directory which will block it's umount.

fuser -m /run/nvme/job_823542
/run/nvme/job_823542: 30645 46407

lsof: 
slurmd    30645             root    8w      REG              253,2 5242880000                 69 /run/nvme/job_823542/data/file.tar

Only way to recover is slurmd restart.

-Tommi

Comment 1 Dominik Bartkiewicz 2021-11-18 10:54:52 MST

Hi

I can reproduce this issue.
I will let you know when we will have the fix.

Dominik

Comment 5 Dominik Bartkiewicz 2022-01-10 04:15:47 MST

Hi

First sorry this took so long.
This commit should fix this issue and it will be included in 21.08.6 and above.
https://github.com/SchedMD/slurm/commit/0c385b3c835

I'll go ahead and close this out. Feel free to comment or reopen if needed.

Dominik