Ticket 571

Summary: Ppilog randomly kill array jobs
Product: Slurm Reporter: Rod Schultz <Rod.Schultz>
Component: OtherAssignee: David Bigagli <david>
Status: RESOLVED INVALID QA Contact:
Severity: 2 - High Impact    
Priority: --- CC: da, nancy.kritkausky, yiannis.georgiou
Version: 2.6.x   
Hardware: Linux   
OS: Linux   
Site: Coventry University (UK) Alineos Sites: ---
Atos/Eviden Sites: --- Confidential Site: ---
Coreweave sites: --- Cray Sites: ---
DS9 clusters: --- HPCnow Sites: ---
HPE Sites: --- IBM Sites: ---
NOAA SIte: --- OCF Sites: ---
Recursion Pharma Sites: --- SFW Sites: ---
SNIC sites: --- Linux Distro: ---
Machine Name: CLE Version:
Version Fixed: Target Release: ---
DevPrio: --- Emory-Cloud Sites: ---
Attachments: Patch to handle array style jobid in epilog clean

Description Rod Schultz 2014-01-23 04:07:02 MST
Created attachment 598 [details]
Patch to handle array style jobid in epilog clean

The epilog clean script doesn't handle the new job_id format for array job.
So the epilog script is running whilst some of the job in the job array are still going.
Comment 1 David Bigagli 2014-01-23 04:18:11 MST
Hi Rod,
       thanks for the diffs, but could you please append the steps to reproduce the problem so we can see it. 

David
Comment 2 David Bigagli 2014-01-23 07:32:14 MST
Hi, 
   I cannot reproduce the problem. Do they have a modified version of the epilog 
respect to one that is in the example?

If you invoke squeue as it is in the script:

squeue --format=%A

the command returns the job array ids without the underscore, these are
the values of SLURM_JOB_ID env variable. This is documented in the squeue
man page.

If you invoke squeue without the format then you will get the job array ids
with the underscore.

The example script appears to be correct.

David
Comment 3 Rod Schultz 2014-01-24 04:15:37 MST
David,

Thanks for looking at this.

You are right, the script is correct.

I've asked the submitter for his script and a better description of the symptoms.

Rod.
Comment 4 David Bigagli 2014-01-24 04:18:04 MST
Closing. False alarm.

David