Ticket 3998 - Possible deadlock in slurmd in commit 52ce3ff0ec5e25bf9704b0ad5f6f59b3b56c63f5
Summary: Possible deadlock in slurmd in commit 52ce3ff0ec5e25bf9704b0ad5f6f59b3b56c63f5
Status: RESOLVED DUPLICATE of ticket 3977
Alias: None
Product: Slurm
Classification: Unclassified
Component: slurmd (show other tickets)
Version: 17.02.6
Hardware: Linux Linux
: 6 - No support contract
Assignee: Jacob Jenson
QA Contact:
URL:
Depends on:
Blocks:
 
Reported: 2017-07-17 02:27 MDT by Thomas Opfer
Modified: 2017-07-20 03:55 MDT (History)
2 users (show)

See Also:
Site: -Other-
Slinky Site: ---
Alineos Sites: ---
Atos/Eviden Sites: ---
Confidential Site: ---
Coreweave sites: ---
Cray Sites: ---
DS9 clusters: ---
Google sites: ---
HPCnow Sites: ---
HPE Sites: ---
IBM Sites: ---
NOAA SIte: ---
NoveTech Sites: ---
Nvidia HWinf-CS Sites: ---
OCF Sites: ---
Recursion Pharma Sites: ---
SFW Sites: ---
SNIC sites: ---
Tzag Elita Sites: ---
Linux Distro: ---
Machine Name: Lichtenberg High Performance Computer
CLE Version:
Version Fixed:
Target Release: ---
DevPrio: ---
Emory-Cloud Sites: ---


Attachments

Note You need to log in before you can comment on or make changes to this ticket.
Description Thomas Opfer 2017-07-17 02:27:27 MDT
I know we have no support contract, but I want to make you aware of the fact that there seems to be a possible deadlock in commit 52ce3ff0ec5e25bf9704b0ad5f6f59b3b56c63f5 .

When we submit job arrays of more than a few (e.g. 16) jobs to one node that start at the same time, only one or two of them start, the other ones hang.

I was able to avoid this problem by replacing

slurm_mutex_lock(&dummy_lock);
slurm_cond_wait(&conf->prolog_running_cond, &dummy_lock);
slurm_mutex_unlock(&dummy_lock);

by

sleep(1);

. I know this is no real solution but it works until this is fixed. I guess conf->prolog_running_cond is triggered while _prolog_is_running (job_id) has not finished yet. I think slurm_cond_wait should not be used with a dummy mutex but with a real one.

Best regards,
Thomas
Comment 1 Dominik Bartkiewicz 2017-07-20 03:55:05 MDT
Hi

Thanks for your report.
Like you can see we solved this in commit:
https://github.com/SchedMD/slurm/commit/b40bd8d35ef851

dominik

*** This ticket has been marked as a duplicate of ticket 3977 ***