| Summary: | messages about old jobs | ||
|---|---|---|---|
| Product: | Slurm | Reporter: | Anthony DelSorbo <anthony.delsorbo> |
| Component: | slurmctld | Assignee: | Albert Gil <albert.gil> |
| Status: | RESOLVED FIXED | QA Contact: | |
| Severity: | 4 - Minor Issue | ||
| Priority: | --- | CC: | rmoye |
| Version: | 18.08.6 | ||
| Hardware: | Linux | ||
| OS: | Linux | ||
| Site: | NOAA | Alineos Sites: | --- |
| Atos/Eviden Sites: | --- | Confidential Site: | --- |
| Coreweave sites: | --- | Cray Sites: | --- |
| DS9 clusters: | --- | HPCnow Sites: | --- |
| HPE Sites: | --- | IBM Sites: | --- |
| NOAA SIte: | NESCC | OCF Sites: | --- |
| Recursion Pharma Sites: | --- | SFW Sites: | --- |
| SNIC sites: | --- | Linux Distro: | --- |
| Machine Name: | Any | CLE Version: | |
| Version Fixed: | 18.08.8 | Target Release: | --- |
| DevPrio: | --- | Emory-Cloud Sites: | --- |
| Attachments: | syslog containing slurmctld logs | ||
|
Description
Anthony DelSorbo
2019-04-12 09:36:21 MDT
Hi Anthony, Please could you provide the output of the following command: $ sacct -Plj 132335 I'm not certain but it looks like slurm is trying to kill that job due a time limit, but the internal register of the job is wrong for some reason. Also, for a better diagnose, could you provide the slurmctld and slurmd logs of the job 132335? Thanks, Albert WARNING: Please note that there is a regression bug on sacct in version 18.08.6 that will make the above command to return an empty output (bug 6697). We released 18.08.7 earlier to fix it, and we strongly recommend to upgrade to 18.08.7 to avoid it. Albert, I attempted to send you data on this yesterday directly from the batch server via e-mail, but it seems to have failed. Is there a method I should use to be able to directly send you support files? I'd rather not have to send them to my laptop, then upload them to you from there. I'm afraid there's not much information from the sacct -Plj command: [root@bqs5 ~]# sacct -Plj 132335 JobID|JobIDRaw|JobName|Partition|MaxVMSize|MaxVMSizeNode|MaxVMSizeTask|AveVMSize|MaxRSS|MaxRSSNode|MaxRSSTask|AveRSS|MaxPages|MaxPagesNode|MaxPagesTask|AvePages|MinCPU|MinCPUNode|MinCPUTask|AveCPU|NTasks|AllocCPUS|Elapsed|State|ExitCode|AveCPUFreq|ReqCPUFreqMin|ReqCPUFreqMax|ReqCPUFreqGov|ReqMem|ConsumedEnergy|MaxDiskRead|MaxDiskReadNode|MaxDiskReadTask|AveDiskRead|MaxDiskWrite|MaxDiskWriteNode|MaxDiskWriteTask|AveDiskWrite|AllocGRES|ReqGRES|ReqTRES|AllocTRES|TRESUsageInAve|TRESUsageInMax|TRESUsageInMaxNode|TRESUsageInMaxTask|TRESUsageInMin|TRESUsageInMinNode|TRESUsageInMinTask|TRESUsageInTot|TRESUsageOutMax|TRESUsageOutMaxNode|TRESUsageOutMaxTask|TRESUsageOutAve|TRESUsageOutTot 132335+0|132335|test|selene||||||||||||||||||0|00:00:00|FAILED|1:0||Unknown|Unknown|Unknown|2500Mc||||||||||||billing=1,cpu=1,mem=2500M,node=1|||||||||||||| Here's a snippet of the batch system logs (slurmctld is being sent to the syslogs) at the time of the job submit: 14992 Mar 28 17:42:15 bqs5 EPILOG-CTLD: Job 132334: epilog.ctld.sh ending 14993 Mar 28 17:42:15 bqs5 slurmctld[16421]: debug: sched: Running job scheduler 14994 Mar 28 17:42:15 bqs5 slurmctld[16421]: debug: sched: Running job scheduler 14995 Mar 28 17:42:15 bqs5 slurmctld[16421]: debug: sched: Running job scheduler 14996 Mar 28 17:42:44 bqs5 slurmctld[16421]: debug: backfill: beginning 14997 Mar 28 17:42:44 bqs5 slurmctld[16421]: debug: backfill: no jobs to backfill 14998 Mar 28 17:42:55 bqs5 slurmctld[16421]: debug: sched: Running job scheduler 14999 Mar 28 17:42:57 bqs5 slurmctld[16421]: debug: Requesting control from backup controller bqs6 15000 Mar 28 17:42:57 bqs5 slurmctld[16421]: error: _shutdown_bu_thread:send/recv bqs6: Connection refused 15001 Mar 28 17:43:28 bqs5 sshd[15745]: Connection closed by 10.181.4.12 15002 Mar 28 17:43:46 bqs5 slurmctld[16421]: _job_create: exceeded association/QOS limit for user 20411: MaxSubmitJobsPerAccount 15003 Mar 28 17:43:46 bqs5 slurmctld[16421]: _slurm_rpc_submit_batch_pack_job: Job violates accounting/QOS policy (job submit limit, user's size and/or time limits) 15004 Mar 28 17:43:46 bqs5 slurmctld[16421]: Cancelling aborted pack job submit: JobId=132335+0(132335) 15005 Mar 28 17:43:47 bqs5 slurmctld[16421]: debug: backfill: beginning 15006 Mar 28 17:43:47 bqs5 slurmctld[16421]: debug: backfill: no jobs to backfill 15007 Mar 28 17:43:55 bqs5 slurmctld[16421]: debug: sched: Running job scheduler I'll pull the systmctld logs and send those up to you ... Tony. Actually, turns out that's the best I can do as the slurmctld.log files are rotated daily and only the last 5 days are kept due to their sizes. I can send you the entire syslog for the first day of the job if you think that's useful. Also, since the job failed to launch, there is no slurmd available either. Hi Tony, > I attempted to send you data on this yesterday directly from the batch > server via e-mail, but it seems to have failed. Is there a method I should > use to be able to directly send you support files? I'd rather not have to > send them to my laptop, then upload them to you from there. As Jason said in bug 5957 comment 19 right now we do not offer another way to send files. But if you have any file server that you want to use to share files with us, please go ahead and let us know how to pull them from it. > I'm afraid there's not much information from the sacct -Plj command: > Here's a snippet of the batch system logs (slurmctld is being sent to the > syslogs) at the time of the job submit: From your logs, I can see the following scenario: - The HetJob 132335 was rejected due MaxSubmitJobsPerAccount - But for some reason it was not purged correctly and it's still in the job_list of the controller - As the job is really old, the controller tries to kill it because the time limit, but as it is has some NULL pointer (not purged correclty) the kill operation fails and the error that you see is printed. - The rejection (and purge) happens right after some error between the main controller and the backup Now I'm working on two parallel tasks: - Diagnose why that happen to fix the logic to avoid it to happen again. - Find a way to purge correctly that job in your situation to avoid the error to be printed out. - as it looks harmless we could avoid only the printed error, but I would rather prefer ensuring that the job list of the controller is really clean. I'll come back here when I've more progress, Albert Hi Anthony, I've been digging into this and, although I don't have a fix yet, I've some good clues about where the issue could be. I've been able to replicate the problem but it turned out that only happen when the controller is restarted after a rejected hetjob. More indeed I think that it is related to how that rejection of the hetjobs is saved / cleaned into the StateSaveLocation, so it probably happen also when the backup controller is enabled (like it seems to be your case from your logs). Just to double-check that I'm replicating the same issue than you, could you verify that you see these kind of lines in the slurmctld's log when you (re)start it? ERROR: [../../../slurm/src/common/hostlist.c:1745] Invalid range: `(null)': Invalid argument slurmctld: error: JobId=34+0(34) has invalid pack_job_id_set((null)) slurmctld: error: build_cg_bitmap: node_bitmap is NULL slurmctld: debug2: acct_policy_remove_job_submit: used_submit_jobs underflow for account development Thanks, Albert Hi Tony, I just want to let you know that we've found the root cause of the problem explained in comment 10 and a patch is being developed to avoid that more messages of old jobs keep appearing in your logs. We're also developing some patch to actually purge the "corrupted jobs" that you already have and appear to the logs, as mentioned in comment 5. By the way, can you confirm what I asked for on comment 10? Just to be sure that we've found and we're fixing the same problem than you are facing. Thanks, Albert Created attachment 10212 [details]
syslog containing slurmctld logs
This log contains all entries from the sysctld during the time of the incident.
Thanks Tony, From the logs I can see that the message started appearing right after a scontrol reconfigure instead of after a restart as I was doing but I can confirm that the problem is the same: Mar 28 17:45:52 bqs5 slurmctld[16421]: Processing RPC: REQUEST_RECONFIGURE from uid=377 Mar 28 17:45:52 bqs5 slurmctld[16421]: error: JobId=132335+0(132335) has invalid pack_job_id_set((null)) Hi Tony,
> I just want to let you know that we've found the root cause of the problem
> explained in comment 10 and a patch is being developed to avoid that more
> messages of old jobs keep appearing in your logs.
> We're also developing some patch to actually purge the "corrupted jobs" that
> you already have and appear to the logs, as mentioned in comment 5.
We already developed a patch to fix both.
We are reviewing it before committing it to the repo.
We'll keep you updated,
Albert
Hi Tony, Patch has been pushed to branches slurm-18.08 and slurm-19.05 and it will be part of the next minor releases 18.08.8 and 19.05.1: - https://github.com/SchedMD/slurm/commit/43ec65b9e6f88f4800c185a5433d82adadb72b54 - https://github.com/SchedMD/slurm/commit/2d587530e488b7f68efa6985f2506c708de9a7cd I'm closing this bug as fixed, Albert *** Ticket 12781 has been marked as a duplicate of this ticket. *** |