Ticket 6847

Summary:	messages about old jobs
Product:	Slurm	Reporter:	Anthony DelSorbo <anthony.delsorbo>
Component:	slurmctld	Assignee:	Albert Gil <albert.gil>
Status:	RESOLVED FIXED	QA Contact:
Severity:	4 - Minor Issue
Priority:	---	CC:	rmoye
Version:	18.08.6
Hardware:	Linux
OS:	Linux
Site:	NOAA	Slinky Site:	---
Alineos Sites:	---	Atos/Eviden Sites:	---
Confidential Site:	---	Coreweave sites:	---
Cray Sites:	---	DS9 clusters:	---
Google sites:	---	HPCnow Sites:	---
HPE Sites:	---	IBM Sites:	---
NOAA SIte:	NESCC	NoveTech Sites:	---
Nvidia HWinf-CS Sites:	---	OCF Sites:	---
Recursion Pharma Sites:	---	SFW Sites:	---
SNIC sites:	---	Tzag Elita Sites:	---
Linux Distro:	---	Machine Name:	Any
CLE Version:		Version Fixed:	18.08.8
Target Release:	---	DevPrio:	---
Emory-Cloud Sites:	---
Attachments:	syslog containing slurmctld logs

Description Anthony DelSorbo 2019-04-12 09:36:21 MDT

I see continuous messages about old jobs in the logs.  While this isn't affecting anything that I know of, it is disconcerting and an indication that not everything is getting cleaned up properly.  These are the messages that I'm seeing:

 21571 Apr 11 15:41:02 bqs5 slurmctld[22614]: error: _pack_configuring_test: Job pack leader JobId=132335+0(132335) job list is NULL
 21572 Apr 11 15:41:32 bqs5 slurmctld[22614]: error: _pack_configuring_test: Job pack leader JobId=132335+0(132335) job list is NULL
 21573 Apr 11 15:42:02 bqs5 slurmctld[22614]: error: _pack_configuring_test: Job pack leader JobId=132335+0(132335) job list is NULL

Please let me know why I'm seeing these messages and what I need to do to clean them up.  The messages have been in the logs for over two weeks.

Thanks,

Tony.

Comment 1 Albert Gil 2019-04-15 06:57:03 MDT

Hi Anthony,

Please could you provide the output of the following command:

$ sacct -Plj 132335

I'm not certain but it looks like slurm is trying to kill that job due a time limit, but the internal register of the job is wrong for some reason.

Also, for a better diagnose, could you provide the slurmctld and slurmd logs of the job 132335?

Thanks,
Albert

WARNING:
Please note that there is a regression bug on sacct in version 18.08.6 that will make the above command to return an empty output (bug 6697). We released 18.08.7 earlier to fix it, and we strongly recommend to upgrade to 18.08.7 to avoid it.

Comment 2 Anthony DelSorbo 2019-04-17 10:29:40 MDT

Albert,

I attempted to send you data on this yesterday directly from the batch server via e-mail, but it seems to have failed.  Is there a method I should use to be able to directly send you support files?  I'd rather not have to send them to my laptop, then upload them to you from there.

I'm afraid there's not much information from the sacct -Plj command:

[root@bqs5 ~]# sacct -Plj 132335
JobID|JobIDRaw|JobName|Partition|MaxVMSize|MaxVMSizeNode|MaxVMSizeTask|AveVMSize|MaxRSS|MaxRSSNode|MaxRSSTask|AveRSS|MaxPages|MaxPagesNode|MaxPagesTask|AvePages|MinCPU|MinCPUNode|MinCPUTask|AveCPU|NTasks|AllocCPUS|Elapsed|State|ExitCode|AveCPUFreq|ReqCPUFreqMin|ReqCPUFreqMax|ReqCPUFreqGov|ReqMem|ConsumedEnergy|MaxDiskRead|MaxDiskReadNode|MaxDiskReadTask|AveDiskRead|MaxDiskWrite|MaxDiskWriteNode|MaxDiskWriteTask|AveDiskWrite|AllocGRES|ReqGRES|ReqTRES|AllocTRES|TRESUsageInAve|TRESUsageInMax|TRESUsageInMaxNode|TRESUsageInMaxTask|TRESUsageInMin|TRESUsageInMinNode|TRESUsageInMinTask|TRESUsageInTot|TRESUsageOutMax|TRESUsageOutMaxNode|TRESUsageOutMaxTask|TRESUsageOutAve|TRESUsageOutTot
132335+0|132335|test|selene||||||||||||||||||0|00:00:00|FAILED|1:0||Unknown|Unknown|Unknown|2500Mc||||||||||||billing=1,cpu=1,mem=2500M,node=1||||||||||||||

Here's a snippet of the batch system logs (slurmctld is being sent to the syslogs) at the time of the job submit:

14992 Mar 28 17:42:15 bqs5 EPILOG-CTLD: Job 132334: epilog.ctld.sh ending
14993 Mar 28 17:42:15 bqs5 slurmctld[16421]: debug:  sched: Running job scheduler
14994 Mar 28 17:42:15 bqs5 slurmctld[16421]: debug:  sched: Running job scheduler
14995 Mar 28 17:42:15 bqs5 slurmctld[16421]: debug:  sched: Running job scheduler
14996 Mar 28 17:42:44 bqs5 slurmctld[16421]: debug:  backfill: beginning
14997 Mar 28 17:42:44 bqs5 slurmctld[16421]: debug:  backfill: no jobs to backfill
14998 Mar 28 17:42:55 bqs5 slurmctld[16421]: debug:  sched: Running job scheduler
14999 Mar 28 17:42:57 bqs5 slurmctld[16421]: debug:  Requesting control from backup controller bqs6
15000 Mar 28 17:42:57 bqs5 slurmctld[16421]: error: _shutdown_bu_thread:send/recv bqs6: Connection refused
15001 Mar 28 17:43:28 bqs5 sshd[15745]: Connection closed by 10.181.4.12
15002 Mar 28 17:43:46 bqs5 slurmctld[16421]: _job_create: exceeded association/QOS limit for user 20411: MaxSubmitJobsPerAccount
15003 Mar 28 17:43:46 bqs5 slurmctld[16421]: _slurm_rpc_submit_batch_pack_job: Job violates accounting/QOS policy (job submit limit, user's size and/or time limits)
15004 Mar 28 17:43:46 bqs5 slurmctld[16421]: Cancelling aborted pack job submit: JobId=132335+0(132335)
15005 Mar 28 17:43:47 bqs5 slurmctld[16421]: debug:  backfill: beginning
15006 Mar 28 17:43:47 bqs5 slurmctld[16421]: debug:  backfill: no jobs to backfill
15007 Mar 28 17:43:55 bqs5 slurmctld[16421]: debug:  sched: Running job scheduler

I'll pull the systmctld logs and send those up to you ...

Tony.

Comment 3 Anthony DelSorbo 2019-04-17 10:50:10 MDT

Actually, turns out that's the best I can do as the slurmctld.log files are rotated daily and only the last 5 days are kept due to their sizes.  I can send you the entire syslog for the first day of the job if you think that's useful.  

Also, since the job failed to launch, there is no slurmd available either.

Comment 5 Albert Gil 2019-04-22 02:41:29 MDT

Hi Tony,

> I attempted to send you data on this yesterday directly from the batch
> server via e-mail, but it seems to have failed.  Is there a method I should
> use to be able to directly send you support files?  I'd rather not have to
> send them to my laptop, then upload them to you from there.

As Jason said in bug 5957 comment 19 right now we do not offer another way to send files.
But if you have any file server that you want to use to share files with us, please go ahead and let us know how to pull them from it.

> I'm afraid there's not much information from the sacct -Plj command:
> Here's a snippet of the batch system logs (slurmctld is being sent to the
> syslogs) at the time of the job submit:

From your logs, I can see the following scenario:

- The HetJob 132335 was rejected due MaxSubmitJobsPerAccount
- But for some reason it was not purged correctly and it's still in the job_list of the controller
- As the job is really old, the controller tries to kill it because the time limit, but as it is has some NULL pointer (not purged correclty) the kill operation fails and the error that you see is printed.
- The rejection (and purge) happens right after some error between the main controller and the backup 


Now I'm working on two parallel tasks:
- Diagnose why that happen to fix the logic to avoid it to happen again.
- Find a way to purge correctly that job in your situation to avoid the error to be printed out.
  - as it looks harmless we could avoid only the printed error, but I would rather prefer ensuring that the job list of the controller is really clean.

I'll come back here when I've more progress,
Albert

Comment 10 Albert Gil 2019-05-07 09:08:36 MDT

Hi Anthony,

I've been digging into this and, although I don't have a fix yet, I've some good clues about where the issue could be.
I've been able to replicate the problem but it turned out that only happen when the controller is restarted after a rejected hetjob.
More indeed I think that it is related to how that rejection of the hetjobs is saved / cleaned into the StateSaveLocation, so it probably happen also when the backup controller is enabled (like it seems to be your case from your logs).

Just to double-check that I'm replicating the same issue than you, could you verify that you see these kind of lines in the slurmctld's log when you (re)start it?

ERROR: [../../../slurm/src/common/hostlist.c:1745] Invalid range: `(null)': Invalid argument
slurmctld: error: JobId=34+0(34) has invalid pack_job_id_set((null))
slurmctld: error: build_cg_bitmap: node_bitmap is NULL
slurmctld: debug2: acct_policy_remove_job_submit: used_submit_jobs underflow for account development

Thanks,
Albert

Comment 18 Albert Gil 2019-05-14 05:55:20 MDT

Hi Tony,

I just want to let you know that we've found the root cause of the problem explained in comment 10 and a patch is being developed to avoid that more messages of old jobs keep appearing in your logs.

We're also developing some patch to actually purge the "corrupted jobs" that you already have and appear to the logs, as mentioned in comment 5.

By the way, can you confirm what I asked for on comment 10?
Just to be sure that we've found and we're fixing the same problem than you are facing.

Thanks,
Albert

Comment 19 Anthony DelSorbo 2019-05-14 06:22:57 MDT

Created attachment 10212 [details]
syslog containing slurmctld logs

This log contains all entries from the sysctld during the time of the incident.

Comment 20 Albert Gil 2019-05-14 07:29:53 MDT

Thanks Tony,

From the logs I can see that the message started appearing right after a scontrol reconfigure instead of after a restart as I was doing but I can confirm that the problem is the same:

Mar 28 17:45:52 bqs5 slurmctld[16421]: Processing RPC: REQUEST_RECONFIGURE from uid=377
Mar 28 17:45:52 bqs5 slurmctld[16421]: error: JobId=132335+0(132335) has invalid pack_job_id_set((null))

Comment 36 Albert Gil 2019-05-16 12:07:15 MDT

Hi Tony,

> I just want to let you know that we've found the root cause of the problem
> explained in comment 10 and a patch is being developed to avoid that more
> messages of old jobs keep appearing in your logs.
> We're also developing some patch to actually purge the "corrupted jobs" that
> you already have and appear to the logs, as mentioned in comment 5.

We already developed a patch to fix both.
We are reviewing it before committing it to the repo.

We'll keep you updated,
Albert

Comment 50 Albert Gil 2019-06-07 10:40:33 MDT

Hi Tony,

Patch has been pushed to branches slurm-18.08 and slurm-19.05 and it will be part of the next minor releases 18.08.8 and 19.05.1:
- https://github.com/SchedMD/slurm/commit/43ec65b9e6f88f4800c185a5433d82adadb72b54
- https://github.com/SchedMD/slurm/commit/2d587530e488b7f68efa6985f2506c708de9a7cd

I'm closing this bug as fixed,
Albert

Comment 51 Marcin Stolarek 2021-12-09 02:19:46 MST

*** Ticket 12781 has been marked as a duplicate of this ticket. ***