Ticket 9982 - Submitted jobs sometimes get stuck before writing an output file
Summary: Submitted jobs sometimes get stuck before writing an output file
Status: RESOLVED INVALID
Alias: None
Product: Slurm
Classification: Unclassified
Component: slurmd (show other tickets)
Version: 20.02.2
Hardware: Linux Linux
: 6 - No support contract
Assignee: Jacob Jenson
QA Contact:
URL:
Depends on:
Blocks:
 
Reported: 2020-10-13 09:47 MDT by Mario Döbler
Modified: 2020-10-13 09:47 MDT (History)
0 users

See Also:
Site: -Other-
Slinky Site: ---
Alineos Sites: ---
Atos/Eviden Sites: ---
Confidential Site: ---
Coreweave sites: ---
Cray Sites: ---
DS9 clusters: ---
Google sites: ---
HPCnow Sites: ---
HPE Sites: ---
IBM Sites: ---
NOAA SIte: ---
NoveTech Sites: ---
Nvidia HWinf-CS Sites: ---
OCF Sites: ---
Recursion Pharma Sites: ---
SFW Sites: ---
SNIC sites: ---
Tzag Elita Sites: ---
Linux Distro: ---
Machine Name:
CLE Version:
Version Fixed:
Target Release: ---
DevPrio: ---
Emory-Cloud Sites: ---


Attachments

Note You need to log in before you can comment on or make changes to this ticket.
Description Mario Döbler 2020-10-13 09:47:42 MDT
Dear Slurm Team,

we encountered quite a strange behavior. Sometimes a submitted job (sbatch) gets stuck before writing the output file. squeue shows the job in R state. scancel puts the job in CG state and stays there forever. Logs show that the last message shown for that job is that the prolog completed. We figured out that when a job gets stuck, the slurmd.service shows a second slurmd process. We never encountered the problem with salloc.

Cancelling such a stuck job and killing the "additional" slurmd processes solves the problem.

If you need any other information, please let us know!

Thanks a lot,
Mario