Ticket 2083

Summary: srun coredump when job is stopped by Ctrl+C and started anyway
Product: Slurm Reporter: Thomas Cadeau <thomas.cadeau>
Component: User CommandsAssignee: Tim Wickberg <tim>
Status: RESOLVED FIXED QA Contact:
Severity: 4 - Minor Issue    
Priority: ---    
Version: 15.08.2   
Hardware: Linux   
OS: Linux   
Site: Atos/Eviden Sites Alineos Sites: ---
Atos/Eviden Sites: Grenoble Confidential Site: ---
Coreweave sites: --- Cray Sites: ---
DS9 clusters: --- HPCnow Sites: ---
HPE Sites: --- IBM Sites: ---
NOAA SIte: --- OCF Sites: ---
Recursion Pharma Sites: --- SFW Sites: ---
SNIC sites: --- Linux Distro: ---
Machine Name: CLE Version:
Version Fixed: 15.08.4 Target Release: ---
DevPrio: --- Emory-Cloud Sites: ---
Attachments: check for non null pointer for ctx

Description Thomas Cadeau 2015-10-29 03:18:03 MDT
Created attachment 2353 [details]
check for non null pointer for ctx

On clusters very busy, it happens a srun is launched whereas a user already Ctrl+C it.
In this case, it results to 2 errors and a segfault.

I think we can keep the 2 errors. 

$ srun hostname
^Csrun: Job allocation 51 has been revoked
srun: Force Terminated job 51
srun: error: Not a valid slurm_step_ctx_t!
srun: error: Application launch failed: Invalid argument
Segmentation fault (core dumped)

I attach a patch to avoid this segfault.
The concerned function is slurm_step_launch_abort.
Comment 1 Tim Wickberg 2015-11-06 10:45:35 MST
Not sure why this wound up stuck as "6 - No Support Contract" ticket (which are dealt with outside our normal support workflow), but I'm fixing that here now.

David - can you look into this on Monday?
Comment 2 David Bigagli 2015-11-08 19:48:14 MST
Commit a12fb24fda497c72f5d25ae to improve logging in order to distinguish
where in the code the error happened.

David
Comment 3 David Bigagli 2015-11-08 19:56:59 MST
Hi Thomas,
          do you have a stack of the core dump?

David
Comment 4 David Bigagli 2015-11-08 21:22:26 MST
I see how this can happen if the signal handler runs during the job allocation
and the step_context is not set but left to NULL. This is probably a very rare
situation yet still has to be fixed. The code has to be changed in some other
places as well as you can see by looking at the final diffs.
Commits: 39fed766318 and 1da8e3f065f0b9.

Good catch! Thanks.

David