| Summary: | job special error states and epilogslurmctld has race conditions | ||
|---|---|---|---|
| Product: | Slurm | Reporter: | Stuart Midgley <stuartm> |
| Component: | slurmctld | Assignee: | David Bigagli <david> |
| Status: | RESOLVED FIXED | QA Contact: | |
| Severity: | 5 - Enhancement | ||
| Priority: | --- | CC: | da, phils |
| Version: | 14.03.0 | ||
| Hardware: | Linux | ||
| OS: | Linux | ||
| Site: | DownUnder GeoSolutions | Alineos Sites: | --- |
| Atos/Eviden Sites: | --- | Confidential Site: | --- |
| Coreweave sites: | --- | Cray Sites: | --- |
| DS9 clusters: | --- | HPCnow Sites: | --- |
| HPE Sites: | --- | IBM Sites: | --- |
| NOAA SIte: | --- | OCF Sites: | --- |
| Recursion Pharma Sites: | --- | SFW Sites: | --- |
| SNIC sites: | --- | Linux Distro: | --- |
| Machine Name: | CLE Version: | ||
| Version Fixed: | 14.11.0pre1 | Target Release: | --- |
| DevPrio: | --- | Emory-Cloud Sites: | --- |
The idea we have is to specify a configurable parameter RequeueExitValues=1,2,3,4,5 then if the job exits with one of these we requeue it in special state. Would this work? David Yes, I expect it would work very well! We have a wrapper script that catches all internal exit codes from user programs, whatever they might be, and normalises them to a consistent set of values. After some thought and discussion, we might have another idea. We could set MinJobAge=infinite (or something really really large) then we could have the epilogslurmctld remove successful jobs from Slurm. In this way, it is fail safe. Jobs will be in CG state for ever in the queue. People would then have to requeue them to get them to rerun once they have fixed the problem. Ooops... a value of zero does what we want... shows how well we can read :) If we go down the road of setting MinJobAge=0 what does that do for accounting? And any other systems... (In reply to Stuart Midgley from comment #5) > If we go down the road of setting MinJobAge=0 what does that do for > accounting? And any other systems... It means the job records will never get purged from the slurmctld daemon. Performance will take a big hit and you'll exhaust memory. You could set it fairly large (say an hour), but not zero. But our epilogslurmctld will purge it. (In reply to Stuart Midgley from comment #7) > But our epilogslurmctld will purge it. You can cancel it, but I don't see how your script could purge it from slurmctld's memory. Right, I understand. So your saying that even if we scancel the job, it will still stay in memory. Bugger. Closing. David Does that mean we're not doing your proposal in comment 1? Wrong comment... sorry. Done. Feature commit number 60e18f3456f224. David Thanks. (In reply to David Bigagli from comment #1) > The idea we have is to specify a configurable parameter > > RequeueExitValues=1,2,3,4,5 > > then if the job exits with one of these we requeue it in special state. To double check... do we only need to account for exit codes actually returned by the job? Or do we also need to list every single possible internal slurm error code? (Which does not sound like a very robust possibility) e.g. in bug 855, we see: > [2014-06-04T09:58:15.042] completing job 2534566 status 25600 we would never have known to add 25600 to RequeueExitValues, because that value is internal to slurm. Hi, yes you only need to account for exit codes of the job itself. The number 25600 you see is little misleading as that is the return code of your job as reported by the kernel which then has to be processed by the WIFEXITED(), WEXITSTATUS() and eventually other macros. It is worth to fix it to print the user expected code. David |
Afternoon I think we are getting to the point where doing the job special error states in the epilogslurmctld is almost unworkable. Their are race conditions we can't think of good ways to solve. We really need the queue to perform the operation in an atomic way. Here is our epilogslurmctld #!/bin/bash { export PATH=/d/sw/slurm/latest/bin:/d/sw/slurm/latest/sbin:$PATH JOB_EXIT=${SLURM_JOB_EXIT_CODE2%:*} echo "$(date) "$(for v in ${!SLURM_*} JOB_EXIT; do echo -n "${v}=${!v} "; done) if [[ "$SLURM_JOB_EXIT_CODE2" != "0:0" ]] || (( JOB_EXIT > 0 )) ; then # echo "$(date) "$(for v in ${!SLURM_*}; do echo -n "${v}=${!v} "; done) for((i=1; ; i++)); do status=0 scontrol requeuehold state=SpecialExit ${SLURM_JOB_ID} && break status=$? sleep $(( RANDOM % (30+i) )) done exit $status fi } >> /var/log/slurm/slurmctld_epilog.log 2>&1 This has at least 2 race conditions. First, the log is controlled by log rotate and if that occurs while we are in the scontrol or the sleep or anywhere else in the script, we loose logging information. To be honest, we can probably work around this and its not a "massive" production issue. The second is more a problem and I think we have seen this. If a user deletes the job while we are in the loop, then the scontrol will continue to fail for ever... and the epilog will never finish. We need the infinite loop to try and ensure that the SpecialExit state is set (ie. to prevent loosing the job if we have communication or timeout issues).