Ticket 3998

Summary:	Possible deadlock in slurmd in commit 52ce3ff0ec5e25bf9704b0ad5f6f59b3b56c63f5
Product:	Slurm	Reporter:	Thomas Opfer <hrz>
Component:	slurmd	Assignee:	Jacob Jenson <jacob>
Status:	RESOLVED DUPLICATE	QA Contact:
Severity:	6 - No support contract
Priority:	---	CC:	bart, hrz
Version:	17.02.6
Hardware:	Linux
OS:	Linux
See Also:	https://bugs.schedmd.com/show_bug.cgi?id=3977
Site:	-Other-	Slinky Site:	---
Alineos Sites:	---	Atos/Eviden Sites:	---
Confidential Site:	---	Coreweave sites:	---
Cray Sites:	---	DS9 clusters:	---
Google sites:	---	HPCnow Sites:	---
HPE Sites:	---	IBM Sites:	---
NOAA SIte:	---	NoveTech Sites:	---
Nvidia HWinf-CS Sites:	---	OCF Sites:	---
Recursion Pharma Sites:	---	SFW Sites:	---
SNIC sites:	---	Tzag Elita Sites:	---
Linux Distro:	---	Machine Name:	Lichtenberg High Performance Computer
CLE Version:		Version Fixed:
Target Release:	---	DevPrio:	---
Emory-Cloud Sites:	---

Description Thomas Opfer 2017-07-17 02:27:27 MDT

I know we have no support contract, but I want to make you aware of the fact that there seems to be a possible deadlock in commit 52ce3ff0ec5e25bf9704b0ad5f6f59b3b56c63f5 .

When we submit job arrays of more than a few (e.g. 16) jobs to one node that start at the same time, only one or two of them start, the other ones hang.

I was able to avoid this problem by replacing

slurm_mutex_lock(&dummy_lock);
slurm_cond_wait(&conf->prolog_running_cond, &dummy_lock);
slurm_mutex_unlock(&dummy_lock);

by

sleep(1);

. I know this is no real solution but it works until this is fixed. I guess conf->prolog_running_cond is triggered while _prolog_is_running (job_id) has not finished yet. I think slurm_cond_wait should not be used with a dummy mutex but with a real one.

Best regards,
Thomas

Comment 1 Dominik Bartkiewicz 2017-07-20 03:55:05 MDT

Hi

Thanks for your report.
Like you can see we solved this in commit:
https://github.com/SchedMD/slurm/commit/b40bd8d35ef851

dominik

*** This ticket has been marked as a duplicate of ticket 3977 ***