| Summary: | Batch JobId= missing from node 0 / job credential revoked | ||
|---|---|---|---|
| Product: | Slurm | Reporter: | Phil Schwan <phils> |
| Component: | slurmctld | Assignee: | David Bigagli <david> |
| Status: | RESOLVED FIXED | QA Contact: | |
| Severity: | 1 - System not usable | ||
| Priority: | --- | CC: | da, stuartm |
| Version: | 14.11.x | ||
| Hardware: | Linux | ||
| OS: | Linux | ||
| Site: | DownUnder GeoSolutions | Slinky Site: | --- |
| Alineos Sites: | --- | Atos/Eviden Sites: | --- |
| Confidential Site: | --- | Coreweave sites: | --- |
| Cray Sites: | --- | DS9 clusters: | --- |
| Google sites: | --- | HPCnow Sites: | --- |
| HPE Sites: | --- | IBM Sites: | --- |
| NOAA SIte: | --- | NoveTech Sites: | --- |
| Nvidia HWinf-CS Sites: | --- | OCF Sites: | --- |
| Recursion Pharma Sites: | --- | SFW Sites: | --- |
| SNIC sites: | --- | Tzag Elita Sites: | --- |
| Linux Distro: | --- | Machine Name: | |
| CLE Version: | Version Fixed: | 14.03.4 | |
| Target Release: | --- | DevPrio: | --- |
| Emory-Cloud Sites: | --- | ||
| Attachments: | slurmctld epilog in C | ||
|
Description
Phil Schwan
2014-06-06 13:21:26 MDT
The root cause of the problem seems to be that the epilog thinks it's failing (but appears to actually be succeeding). From our epilog log: > Sat Jun 7 01:44:50 WST 2014 SLURM_ARRAY_JOB_ID=2735047 SLURM_ARRAY_TASK_ID=1996 SLURM_CLUSTER_NAME=perth SLURM_JOBID=2735054 > SLURM_JOB_ACCOUNT=(null) SLURM_JOB_CONSTRAINTS=(null) SLURM_JOB_DERIVED_EC=0 SLURM_JOB_EXIT_CODE=25600 SLURM_JOB_EXIT_CODE2=100:0 > SLURM_JOB_GID=2114 SLURM_JOB_GROUP=teamswan SLURM_JOB_ID=2735054 SLURM_JOB_NAME=dp_NMO SLURM_JOB_NODELIST=clus509 > SLURM_JOB_PARTITION=teamswan,teamswanTest SLURM_JOB_UID=1288 JOB_EXIT=100 > Job can not be altered now, try again later for job 2735047_1996 (2735054) > slurm_requeue error: Job can not be altered now, try again later To ensure that jobs don't vanish (see bug 717) our epilog attempts the requeue until it succeeds, so this runs in a loop. Unfortunately, it appears that the requeue is "failing" no matter how many times it attempts it: > Job can not be altered now, try again later for job 2735047_1996 (2735054) > slurm_requeue error: Job can not be altered now, try again later ...which is why we have this on our slurmctld node: > # ps auxw | grep epilog.sh | wc -l > 1154 But of course it's not failing. It's requeueing it just fine, which is why the job keeps getting killed and put into SE. It's also managing to bring jobs back from the dead after an explicit scancel (!) I'm probably going to try to roll back to a version from a few days ago, to see if I can find some stability. I wonder if perhaps bab22e4f8ad335 is to blame for the epilog thinking it's failing over and over, because it will be attempting to requeue jobs that are in CG state? Yes, the commit bab22e4f8ad335 is not right, we got carried away.... We should return success and requeue the job, basically allow requeue of jobs in completing state. I realize this during the weekend but you were faster... sorry. In any case I would change your requeue script to not try to scontrol forever, but just few times. Another option would be to use the job_requeue() API, it will be faster and give you better control over the returning codes. David Created attachment 914 [details]
slurmctld epilog in C
Commit: a9c1c8e5f50f98f0f6 reverts bab22e4f8ad335. David Well, I think the intention is to do away with the epilog requeue entirely (per bug 717) -- right? Absolutely yes. Actually I am going to retest the development in the light of the new fixes especially the dependencies. David However that will take some time for you to implement I guess, so the option of using the API is probably better, faster and better control over the return code than a script. Just an idea... David Revert bad commit. David |