Ticket 9982

Summary: Submitted jobs sometimes get stuck before writing an output file
Product: Slurm Reporter: Mario Döbler <mario.doebler>
Component: slurmdAssignee: Jacob Jenson <jacob>
Status: RESOLVED INVALID QA Contact:
Severity: 6 - No support contract    
Priority: ---    
Version: 20.02.2   
Hardware: Linux   
OS: Linux   
Site: -Other- Alineos Sites: ---
Atos/Eviden Sites: --- Confidential Site: ---
Coreweave sites: --- Cray Sites: ---
DS9 clusters: --- HPCnow Sites: ---
HPE Sites: --- IBM Sites: ---
NOAA SIte: --- OCF Sites: ---
Recursion Pharma Sites: --- SFW Sites: ---
SNIC sites: --- Linux Distro: ---
Machine Name: CLE Version:
Version Fixed: Target Release: ---
DevPrio: --- Emory-Cloud Sites: ---

Description Mario Döbler 2020-10-13 09:47:42 MDT
Dear Slurm Team,

we encountered quite a strange behavior. Sometimes a submitted job (sbatch) gets stuck before writing the output file. squeue shows the job in R state. scancel puts the job in CG state and stays there forever. Logs show that the last message shown for that job is that the prolog completed. We figured out that when a job gets stuck, the slurmd.service shows a second slurmd process. We never encountered the problem with salloc.

Cancelling such a stuck job and killing the "additional" slurmd processes solves the problem.

If you need any other information, please let us know!

Thanks a lot,
Mario