| Summary: | Job array stuck and can't cancel - ignoring user with multiple emails | ||
|---|---|---|---|
| Product: | Slurm | Reporter: | Wei Feinstein <wfeinstein> |
| Component: | User Commands | Assignee: | Brian Christiansen <brian> |
| Status: | RESOLVED INFOGIVEN | QA Contact: | |
| Severity: | 3 - Medium Impact | ||
| Priority: | --- | CC: | aradeva, brian.gilmer, bweigand, jonathon.anderson, kaylea.nelson, remi-externe.palancher |
| Version: | 17.11.3 | ||
| Hardware: | Linux | ||
| OS: | Linux | ||
| Site: | LBNL - Lawrence Berkeley National Laboratory | Slinky Site: | --- |
| Alineos Sites: | --- | Atos/Eviden Sites: | --- |
| Confidential Site: | --- | Coreweave sites: | --- |
| Cray Sites: | --- | DS9 clusters: | --- |
| Google sites: | --- | HPCnow Sites: | --- |
| HPE Sites: | --- | IBM Sites: | --- |
| NOAA SIte: | --- | NoveTech Sites: | --- |
| Nvidia HWinf-CS Sites: | --- | OCF Sites: | --- |
| Recursion Pharma Sites: | --- | SFW Sites: | --- |
| SNIC sites: | --- | Tzag Elita Sites: | --- |
| Linux Distro: | --- | Machine Name: | |
| CLE Version: | Version Fixed: | ||
| Target Release: | --- | DevPrio: | --- |
| Emory-Cloud Sites: | --- | ||
| Attachments: | Logs for job in slurmctld.log. | ||
Hey Jackie, The jobs almost certainly got into this state due a bug which was fixed in commit (in 17.11.4): https://github.com/SchedMD/slurm/commit/f381e4e6abca6ce45709b86989112442487f856a The job array hashes are getting corrupt and the job can't be found to be killed. Also the meta job of the array is being cleared from the slurmctld's memory -- it should stay in memory until all of the tasks are gone -- so the job can't be requeued (because it holds the job script for the array). If you can apply this patch sooner than 17.11.4 that would help prevent future incidents from happening. As for resolving the issue now, if you restart the slurmctld the jobs should go away. Since the meta job of the array should be out of slurmctld's memory, on restart the slurmctld will see that the job's job script is missing and will mark them as failed and will clean them up. You'll see a messages like: error: Script for job 15152 lost, state set to FAILED Can you try restarting the controller? Thanks, Brian An "scontrol reconfigure" will also cause the slurmctld to check for jobs without job files. I will perform that tomorrow. Since it is already after hours. Thanks Jackie On Mon, Feb 26, 2018 at 1:49 PM, <bugs@schedmd.com> wrote: > *Comment # 4 <https://bugs.schedmd.com/show_bug.cgi?id=4833#c4> on bug > 4833 <https://bugs.schedmd.com/show_bug.cgi?id=4833> from Brian > Christiansen <brian@schedmd.com> * > > An "scontrol reconfigure" will also cause the slurmctld to check for jobs > without job files. > > ------------------------------ > You are receiving this mail because: > > - You reported the bug. > > Hey Jackie, were you able to clear up the array jobs? Thanks, Brian Yes it worked restarting slurmctld. Thanks Jacke On Fri, Mar 2, 2018 at 7:33 AM, <bugs@schedmd.com> wrote: > *Comment # 6 <https://bugs.schedmd.com/show_bug.cgi?id=4833#c6> on bug > 4833 <https://bugs.schedmd.com/show_bug.cgi?id=4833> from Brian > Christiansen <brian@schedmd.com> * > > Hey Jackie, were you able to clear up the array jobs? > > Thanks, > Brian > > ------------------------------ > You are receiving this mail because: > > - You reported the bug. > > Great. Thanks Jackie. *** Ticket 5079 has been marked as a duplicate of this ticket. *** *** Ticket 5170 has been marked as a duplicate of this ticket. *** *** Ticket 5272 has been marked as a duplicate of this ticket. *** *** Ticket 9676 has been marked as a duplicate of this ticket. *** *** Ticket 13138 has been marked as a duplicate of this ticket. *** My last day with RC and CU is 22 October 2021. For Research Computing, please contact rc-help@colorado.edu. |
Created attachment 6239 [details] Logs for job in slurmctld.log. The following jobs are orphaned and can not be cancelled by either the user or an admin: 11840060_2 lr3 thm_dime aszasz RH 0:00 1 (JobHoldMaxRequeue) 11840060_3 lr3 thm_dime aszasz RH 0:00 1 (JobHoldMaxRequeue) 11840060_4 lr3 thm_dime aszasz RH 0:00 1 (JobHoldMaxRequeue) 11840060_5 lr3 thm_dime aszasz RH 0:00 1 (JobHoldMaxRequeue) 11840060_6 lr3 thm_dime aszasz RH 0:00 1 (JobHoldMaxRequeue) 11840060_7 lr3 thm_dime aszasz RH 0:00 1 (JobHoldMaxRequeue) 11840060_8 lr3 thm_dime aszasz RH 0:00 1 (JobHoldMaxRequeue) 11840060_9 lr3 thm_dime aszasz RH 0:00 1 (JobHoldMaxRequeue) 11840060_10 lr3 thm_dime aszasz RH 0:00 1 (JobHoldMaxRequeue) 11840060_11 lr3 thm_dime aszasz RH 0:00 1 (JobHoldMaxRequeue) 11840060_13 lr3 thm_dime aszasz RH 0:00 1 (JobHoldMaxRequeue) Tried the following steps - scontrol release (to release the JobHoldMxRequeue) and then I tried scancel with the following options -f and --signal=TERM. When the jobs were released they were in PD with (BeginTime) and then they went right back to JobHoldMaxRequeue. The user is getting annoyed by the number of emails he is receiving. And I have not found a way to remove his orphaned tasks. I am enclosing the logs that state the status of his job from slurm. scontrol show job no longer shows the job and so I can not kill it. Also is there a way to stop the emails for the user until this is resolved? Thanks Jackie