2083 – srun coredump when job is stopped by Ctrl+C and started anyway

Ticket 2083 - srun coredump when job is stopped by Ctrl+C and started anyway

Summary: srun coredump when job is stopped by Ctrl+C and started anyway

Status:	RESOLVED FIXED

Alias:	None

Product:	Slurm
Classification:	Unclassified
Component:	User Commands (show other tickets)
Version:	15.08.2
Hardware:	Linux Linux

Severity:	4 - Minor Issue
Assignee:	Tim Wickberg
QA Contact:

URL:

Depends on:
Blocks:

Reported:	2015-10-29 03:18 MDT by Thomas Cadeau
Modified:	2015-11-08 21:22 MST (History)
CC List:	0 users

See Also:
Site:	Atos/Eviden Sites
Slinky Site:	---
Alineos Sites:	---
Atos/Eviden Sites:	Grenoble
Confidential Site:	---
Coreweave sites:	---
Cray Sites:	---
DS9 clusters:	---
Google sites:	---
HPCnow Sites:	---
HPE Sites:	---
IBM Sites:	---
NOAA SIte:	---
NoveTech Sites:	---
Nvidia HWinf-CS Sites:	---
OCF Sites:	---
Recursion Pharma Sites:	---
SFW Sites:	---
SNIC sites:	---
Tzag Elita Sites:	---
Linux Distro:	---
Machine Name:
CLE Version:
Version Fixed:	15.08.4
Target Release:	---
DevPrio:	---
Emory-Cloud Sites:	---

Attachments
check for non null pointer for ctx (482 bytes, patch) 2015-10-29 03:18 MDT, Thomas Cadeau	Details \| Diff
Add an attachment (proposed patch, testcase, etc.)

Note You need to log in before you can comment on or make changes to this ticket.

Description Thomas Cadeau 2015-10-29 03:18:03 MDT

Created attachment 2353 [details]
check for non null pointer for ctx

On clusters very busy, it happens a srun is launched whereas a user already Ctrl+C it.
In this case, it results to 2 errors and a segfault.

I think we can keep the 2 errors. 

$ srun hostname
^Csrun: Job allocation 51 has been revoked
srun: Force Terminated job 51
srun: error: Not a valid slurm_step_ctx_t!
srun: error: Application launch failed: Invalid argument
Segmentation fault (core dumped)

I attach a patch to avoid this segfault.
The concerned function is slurm_step_launch_abort.

Comment 1 Tim Wickberg 2015-11-06 10:45:35 MST

Not sure why this wound up stuck as "6 - No Support Contract" ticket (which are dealt with outside our normal support workflow), but I'm fixing that here now.

David - can you look into this on Monday?

Comment 2 David Bigagli 2015-11-08 19:48:14 MST

Commit a12fb24fda497c72f5d25ae to improve logging in order to distinguish
where in the code the error happened.

David

Comment 3 David Bigagli 2015-11-08 19:56:59 MST

Hi Thomas,
          do you have a stack of the core dump?

David

Comment 4 David Bigagli 2015-11-08 21:22:26 MST

I see how this can happen if the signal handler runs during the job allocation
and the step_context is not set but left to NULL. This is probably a very rare
situation yet still has to be fixed. The code has to be changed in some other
places as well as you can see by looking at the final diffs.
Commits: 39fed766318 and 1da8e3f065f0b9.

Good catch! Thanks.

David