Ticket 3998

Summary: Possible deadlock in slurmd in commit 52ce3ff0ec5e25bf9704b0ad5f6f59b3b56c63f5
Product: Slurm Reporter: Thomas Opfer <hrz>
Component: slurmdAssignee: Jacob Jenson <jacob>
Status: RESOLVED DUPLICATE QA Contact:
Severity: 6 - No support contract    
Priority: --- CC: bart, hrz
Version: 17.02.6   
Hardware: Linux   
OS: Linux   
See Also: https://bugs.schedmd.com/show_bug.cgi?id=3977
Site: -Other- Slinky Site: ---
Alineos Sites: --- Atos/Eviden Sites: ---
Confidential Site: --- Coreweave sites: ---
Cray Sites: --- DS9 clusters: ---
Google sites: --- HPCnow Sites: ---
HPE Sites: --- IBM Sites: ---
NOAA SIte: --- NoveTech Sites: ---
Nvidia HWinf-CS Sites: --- OCF Sites: ---
Recursion Pharma Sites: --- SFW Sites: ---
SNIC sites: --- Tzag Elita Sites: ---
Linux Distro: --- Machine Name: Lichtenberg High Performance Computer
CLE Version: Version Fixed:
Target Release: --- DevPrio: ---
Emory-Cloud Sites: ---

Description Thomas Opfer 2017-07-17 02:27:27 MDT
I know we have no support contract, but I want to make you aware of the fact that there seems to be a possible deadlock in commit 52ce3ff0ec5e25bf9704b0ad5f6f59b3b56c63f5 .

When we submit job arrays of more than a few (e.g. 16) jobs to one node that start at the same time, only one or two of them start, the other ones hang.

I was able to avoid this problem by replacing

slurm_mutex_lock(&dummy_lock);
slurm_cond_wait(&conf->prolog_running_cond, &dummy_lock);
slurm_mutex_unlock(&dummy_lock);

by

sleep(1);

. I know this is no real solution but it works until this is fixed. I guess conf->prolog_running_cond is triggered while _prolog_is_running (job_id) has not finished yet. I think slurm_cond_wait should not be used with a dummy mutex but with a real one.

Best regards,
Thomas
Comment 1 Dominik Bartkiewicz 2017-07-20 03:55:05 MDT
Hi

Thanks for your report.
Like you can see we solved this in commit:
https://github.com/SchedMD/slurm/commit/b40bd8d35ef851

dominik

*** This ticket has been marked as a duplicate of ticket 3977 ***