Ticket 717

Summary:	job special error states and epilogslurmctld has race conditions
Product:	Slurm	Reporter:	Stuart Midgley <stuartm>
Component:	slurmctld	Assignee:	David Bigagli <david>
Status:	RESOLVED FIXED	QA Contact:
Severity:	5 - Enhancement
Priority:	---	CC:	da, phils
Version:	14.03.0
Hardware:	Linux
OS:	Linux
Site:	DownUnder GeoSolutions	Slinky Site:	---
Alineos Sites:	---	Atos/Eviden Sites:	---
Confidential Site:	---	Coreweave sites:	---
Cray Sites:	---	DS9 clusters:	---
Google sites:	---	HPCnow Sites:	---
HPE Sites:	---	IBM Sites:	---
NOAA SIte:	---	NoveTech Sites:	---
Nvidia HWinf-CS Sites:	---	OCF Sites:	---
Recursion Pharma Sites:	---	SFW Sites:	---
SNIC sites:	---	Tzag Elita Sites:	---
Linux Distro:	---	Machine Name:
CLE Version:		Version Fixed:	14.11.0pre1
Target Release:	---	DevPrio:	---
Emory-Cloud Sites:	---

Description Stuart Midgley 2014-04-15 17:40:48 MDT

Afternoon

I think we are getting to the point where doing the job special error states in the epilogslurmctld is almost unworkable.  Their are race conditions we can't think of good ways to solve.  We really need the queue to perform the operation in an atomic way.

Here is our epilogslurmctld

#!/bin/bash

{

export PATH=/d/sw/slurm/latest/bin:/d/sw/slurm/latest/sbin:$PATH

JOB_EXIT=${SLURM_JOB_EXIT_CODE2%:*}

echo "$(date)    "$(for v in ${!SLURM_*} JOB_EXIT; do echo -n "${v}=${!v} "; done)

if [[ "$SLURM_JOB_EXIT_CODE2" != "0:0" ]] || (( JOB_EXIT > 0 )) ; then
#    echo "$(date)    "$(for v in ${!SLURM_*}; do echo -n "${v}=${!v} "; done)

    for((i=1; ; i++)); do
        status=0
        scontrol requeuehold state=SpecialExit ${SLURM_JOB_ID} && break
        status=$?
        sleep $(( RANDOM % (30+i) ))
    done

    exit $status
fi

} >> /var/log/slurm/slurmctld_epilog.log 2>&1


This has at least 2 race conditions.  First, the log is controlled by log rotate and if that occurs while we are in the scontrol or the sleep or anywhere else in the script, we loose logging information.  To be honest, we can probably work around this and its not a "massive" production issue.  

The second is more a problem and I think we have seen this.  If a user deletes the job while we are in the loop, then the scontrol will continue to fail for ever... and the epilog will never finish.  We need the infinite loop to try and ensure that the SpecialExit state is set (ie. to prevent loosing the job if we have communication or timeout issues).

Comment 1 David Bigagli 2014-04-16 11:25:03 MDT

The idea we have is to specify a configurable parameter

RequeueExitValues=1,2,3,4,5

then if the job exits with one of these we requeue it in special state.

Would this work?

David

Comment 2 Phil Schwan 2014-04-16 11:29:47 MDT

Yes, I expect it would work very well!  We have a wrapper script that catches all internal exit codes from user programs, whatever they might be, and normalises them to a consistent set of values.

Comment 3 Stuart Midgley 2014-04-16 17:50:00 MDT

After some thought and discussion, we might have another idea.

We could set MinJobAge=infinite (or something really really large) then we could have the epilogslurmctld remove successful jobs from Slurm.  In this way, it is fail safe.  Jobs will be in CG state for ever in the queue.  People would then have to requeue them to get them to rerun once they have fixed the problem.

Comment 4 Stuart Midgley 2014-04-16 17:50:50 MDT

Ooops... a value of zero does what we want... shows how well we can read :)

Comment 5 Stuart Midgley 2014-04-16 17:53:34 MDT

If we go down the road of setting MinJobAge=0 what does that do for accounting?  And any other systems...

Comment 6 Moe Jette 2014-04-17 03:42:00 MDT

(In reply to Stuart Midgley from comment #5)
> If we go down the road of setting MinJobAge=0 what does that do for
> accounting?  And any other systems...

It means the job records will never get purged from the slurmctld daemon. Performance will take a big hit and you'll exhaust memory. You could set it fairly large (say an hour), but not zero.

Comment 7 Stuart Midgley 2014-04-17 03:43:38 MDT

But our epilogslurmctld will purge it.

Comment 8 Moe Jette 2014-04-17 03:44:50 MDT

(In reply to Stuart Midgley from comment #7)
> But our epilogslurmctld will purge it.

You can cancel it, but I don't see how your script could purge it from slurmctld's memory.

Comment 9 Stuart Midgley 2014-04-17 03:48:58 MDT

Right, I understand.  So your saying that even if we scancel the job, it will still stay in memory.  Bugger.

Comment 10 David Bigagli 2014-04-28 11:08:56 MDT

Closing.

David

Comment 11 Phil Schwan 2014-04-28 11:10:15 MDT

Does that mean we're not doing your proposal in comment 1?

Comment 12 David Bigagli 2014-04-28 11:12:33 MDT

Wrong comment... sorry.

Comment 13 David Bigagli 2014-04-28 11:13:43 MDT

Done. Feature commit number 60e18f3456f224.

David

Comment 14 Stuart Midgley 2014-04-28 11:26:57 MDT

Thanks.

Comment 15 Phil Schwan 2014-06-04 15:15:46 MDT

(In reply to David Bigagli from comment #1)
> The idea we have is to specify a configurable parameter
> 
> RequeueExitValues=1,2,3,4,5
> 
> then if the job exits with one of these we requeue it in special state.

To double check... do we only need to account for exit codes actually returned by the job?

Or do we also need to list every single possible internal slurm error code?  (Which does not sound like a very robust possibility)

e.g. in bug 855, we see:

> [2014-06-04T09:58:15.042] completing job 2534566 status 25600

we would never have known to add 25600 to RequeueExitValues, because that value is internal to slurm.

Comment 16 David Bigagli 2014-06-10 06:44:40 MDT

Hi,
   yes you only need to account for exit codes of the job itself.
The number 25600 you see is little misleading as that is the return code
of your job as reported by the kernel which then has to be processed by the WIFEXITED(), WEXITSTATUS() and eventually other macros. It is worth to fix it
to print the user expected code.

David