3998 – Possible deadlock in slurmd in commit 52ce3ff0ec5e25bf9704b0ad5f6f59b3b56c63f5

Ticket 3998 - Possible deadlock in slurmd in commit 52ce3ff0ec5e25bf9704b0ad5f6f59b3b56c63f5

Summary: Possible deadlock in slurmd in commit 52ce3ff0ec5e25bf9704b0ad5f6f59b3b56c63f5

Status:	RESOLVED DUPLICATE of ticket 3977

Alias:	None

Product:	Slurm
Classification:	Unclassified
Component:	slurmd (show other tickets)
Version:	17.02.6
Hardware:	Linux Linux

Severity:	6 - No support contract
Assignee:	Jacob Jenson
QA Contact:

URL:

Depends on:
Blocks:

Reported:	2017-07-17 02:27 MDT by Thomas Opfer
Modified:	2017-07-20 03:55 MDT (History)
CC List:	2 users (show)

See Also:	3977
Site:	-Other-
Slinky Site:	---
Alineos Sites:	---
Atos/Eviden Sites:	---
Confidential Site:	---
Coreweave sites:	---
Cray Sites:	---
DS9 clusters:	---
Google sites:	---
HPCnow Sites:	---
HPE Sites:	---
IBM Sites:	---
NOAA SIte:	---
NoveTech Sites:	---
Nvidia HWinf-CS Sites:	---
OCF Sites:	---
Recursion Pharma Sites:	---
SFW Sites:	---
SNIC sites:	---
Tzag Elita Sites:	---
Linux Distro:	---
Machine Name:	Lichtenberg High Performance Computer
CLE Version:
Version Fixed:
Target Release:	---
DevPrio:	---
Emory-Cloud Sites:	---

Attachments
Add an attachment (proposed patch, testcase, etc.)

Note You need to log in before you can comment on or make changes to this ticket.

Description Thomas Opfer 2017-07-17 02:27:27 MDT

I know we have no support contract, but I want to make you aware of the fact that there seems to be a possible deadlock in commit 52ce3ff0ec5e25bf9704b0ad5f6f59b3b56c63f5 .

When we submit job arrays of more than a few (e.g. 16) jobs to one node that start at the same time, only one or two of them start, the other ones hang.

I was able to avoid this problem by replacing

slurm_mutex_lock(&dummy_lock);
slurm_cond_wait(&conf->prolog_running_cond, &dummy_lock);
slurm_mutex_unlock(&dummy_lock);

by

sleep(1);

. I know this is no real solution but it works until this is fixed. I guess conf->prolog_running_cond is triggered while _prolog_is_running (job_id) has not finished yet. I think slurm_cond_wait should not be used with a dummy mutex but with a real one.

Best regards,
Thomas

Comment 1 Dominik Bartkiewicz 2017-07-20 03:55:05 MDT

Hi

Thanks for your report.
Like you can see we solved this in commit:
https://github.com/SchedMD/slurm/commit/b40bd8d35ef851

dominik

*** This ticket has been marked as a duplicate of ticket 3977 ***