Created attachment 2353 [details] check for non null pointer for ctx On clusters very busy, it happens a srun is launched whereas a user already Ctrl+C it. In this case, it results to 2 errors and a segfault. I think we can keep the 2 errors. $ srun hostname ^Csrun: Job allocation 51 has been revoked srun: Force Terminated job 51 srun: error: Not a valid slurm_step_ctx_t! srun: error: Application launch failed: Invalid argument Segmentation fault (core dumped) I attach a patch to avoid this segfault. The concerned function is slurm_step_launch_abort.
Not sure why this wound up stuck as "6 - No Support Contract" ticket (which are dealt with outside our normal support workflow), but I'm fixing that here now. David - can you look into this on Monday?
Commit a12fb24fda497c72f5d25ae to improve logging in order to distinguish where in the code the error happened. David
Hi Thomas, do you have a stack of the core dump? David
I see how this can happen if the signal handler runs during the job allocation and the step_context is not set but left to NULL. This is probably a very rare situation yet still has to be fixed. The code has to be changed in some other places as well as you can see by looking at the final diffs. Commits: 39fed766318 and 1da8e3f065f0b9. Good catch! Thanks. David