Ticket 6659

Summary:	_remove_accrue_time_internal: accrue_cnt underflow
Product:	Slurm	Reporter:	Marshall Garey <marshall>
Component:	slurmctld	Assignee:	Marshall Garey <marshall>
Status:	RESOLVED DUPLICATE	QA Contact:	Danny Auble <da>
Severity:	4 - Minor Issue
Priority:	---	CC:	alex, nate, sven.sternberger
Version:	18.08.5
Hardware:	Linux
OS:	Linux
See Also:	https://bugs.schedmd.com/show_bug.cgi?id=6879 https://bugs.schedmd.com/show_bug.cgi?id=6880 https://bugs.schedmd.com/show_bug.cgi?id=7375 https://bugs.schedmd.com/show_bug.cgi?id=7376 https://bugs.schedmd.com/show_bug.cgi?id=7385 https://bugs.schedmd.com/show_bug.cgi?id=7361 https://bugs.schedmd.com/show_bug.cgi?id=7083 https://bugs.schedmd.com/show_bug.cgi?id=8621 https://bugs.schedmd.com/show_bug.cgi?id=8629 https://bugs.schedmd.com/show_bug.cgi?id=8650
Site:	SchedMD	Slinky Site:	---
Alineos Sites:	---	Atos/Eviden Sites:	---
Confidential Site:	---	Coreweave sites:	---
Cray Sites:	---	DS9 clusters:	---
Google sites:	---	HPCnow Sites:	---
HPE Sites:	---	IBM Sites:	---
NOAA SIte:	---	NoveTech Sites:	---
Nvidia HWinf-CS Sites:	---	OCF Sites:	---
Recursion Pharma Sites:	---	SFW Sites:	---
SNIC sites:	---	Tzag Elita Sites:	---
Linux Distro:	---	Machine Name:
CLE Version:		Version Fixed:
Target Release:	---	DevPrio:	---
Emory-Cloud Sites:	---
Attachments:	gdb output configuration to reproduce debugging patch updated debug patch (with more logs and doesn't segfault) updated (2) debug patch fix for 18.08 in progress patch with debug 18.08 patch proposal 18.08 v2 - adds on a commit to remove commented code that wouldn't compile if uncommented v3 - Same as last one, except I exclude the commented code instead of removing it later. v4 - same as last one, but removed a debugging log message I'd added in v5 - remove erroneous error message 19.05 v2 - patch for 19.05, same functionality as 18.08 18.08 v6 - same as v5 but with line in NEWS 18.08 v7 patchset 18.08 v8 patchset 19.05 v3 patchset formatting fixes and fix a small mistake

Description Marshall Garey 2019-03-07 14:09:45 MST

I get these errors from slurmctld when I submit a job to multiple partitions, where one of the partitions has a partition QOS and the others don't, and the job spent some time pending before running. Adding SchedulerParameters=defer is sufficient (for me) to provoke this error, though removing defer and running other higher priority jobs so this one sits in the queue for at least some amount of time also reproduces the error.

[2019-03-07T13:56:19.357] error: _remove_accrue_time_internal: QOS partqos1 accrue_cnt underflow
[2019-03-07T13:56:19.357] error: _remove_accrue_time_internal: QOS partqos1 acct test accrue_cnt underflow
[2019-03-07T13:56:19.357] error: _remove_accrue_time_internal: QOS partqos1 user 1017 accrue_cnt underflow



slurm.conf - relevant info:

PriorityType=priority/multifactor
PriorityWeightAge=100000
# I have other PriorityWeight settings, but I believe age is the only one that matters
AccountingStorageEnforce=qos,safe
EnforcePartLimits=all
PartitionName=DEFAULT State=UP MaxTime=600 Default=NO PriorityTier=0
PartitionName=debug Nodes=v[1-10],voyager2 Default=YES PriorityTier=1
PartitionName=debug2 Nodes=v[1-10] PriorityTier=2 MaxTime=100 Qos=partqos1
PartitionName=all Nodes=v[1-10],voyager2,centosvm,v13
PartitionName=veryshort nodes=v[1-10] MaxTime=1 DefaultTime=0
PartitionName=hidden Hidden=YES Nodes=v[1-10] RootOnly=YES


partqos1: (just the default values, nothing added to it right now)
$ sacctmgr show qos partqos1
      Name   Priority  GraceTime    Preempt PreemptMode                                    Flags UsageThres UsageFactor       GrpTRES   GrpTRESMins GrpTRESRunMin GrpJobs GrpSubmit     GrpWall       MaxTRES MaxTRESPerNode   MaxTRESMins     MaxWall     MaxTRESPU MaxJobsPU MaxSubmitPU     MaxTRESPA MaxJobsPA MaxSubmitPA       MinTRES 
---------- ---------- ---------- ---------- ----------- ---------------------------------------- ---------- ----------- ------------- ------------- ------------- ------- --------- ----------- ------------- -------------- ------------- ----------- ------------- --------- ----------- ------------- --------- ----------- ------------- 
  partqos1          0   00:00:00                cluster                                                        1.000000


Example job submission that provokes the error in slurmctld

srun -pdebug,debug2,all hostname

Removing debug2 from the partition list removes the error in slurmctld.

srun -pdebug,all hostname



Has anyone else seen this error? Can anyone else reproduce it?

Comment 8 Marshall Garey 2019-05-22 15:31:47 MDT

Created attachment 10332 [details]
configuration to reproduce

This job is sufficient to reproduce this issue:

srun -pdebug,debug2 whereami

Partition debug2 has a partition qos called partqos1, but it doesn't have any limits or anything special.

Comment 11 Alejandro Sanchez 2019-05-23 12:21:23 MDT

For the record yesterday I also managed to trigger this by having active jobs and switching from not having ACCRUE_ALWAYS to having ACCRUE_ALWAYS and reconfiguring.

https://bugs.schedmd.com/show_bug.cgi?id=7083

Also I was seeing calls to acct_policy_handle_accrue_time() for job arrays and if I printed %pJ at the beginning of the function I was seeing jobs printed with the form <jobid>_* instead of with an already taskid.

https://bugs.schedmd.com/show_bug.cgi?id=7102

Comment 26 Marshall Garey 2019-06-03 16:59:44 MDT

> (In reply to Alejandro Sanchez from comment #11)
> > For the record yesterday I also managed to trigger this by having active
> > jobs and switching from not having ACCRUE_ALWAYS to having ACCRUE_ALWAYS and
> > reconfiguring.
> > 
> > https://bugs.schedmd.com/show_bug.cgi?id=7083
> 
> I haven't verified that my patch fixes this yet.

Thanks for telling me to look for something with ACCRUE_ALWAYS, Alex. I couldn't make it happen by switching it on and off, but I did find this:

I've now found more underflow errors with dependent jobs and using the ACCRUE_ALWAYS flag. This happens with and without my patch, so it's not caused by my patch, but my patch doesn't fix it yet. My patch is still good for review, but I'll also be working on this.

* PriorityFlags=ACCRUE_ALWAYS, DebugFlags=accrue
* Submit a job to fill the cluster
marshall@voyager:~/slurm/19.05/voyager$ srun -N13 --exclusive sleep 7890&

* Submit a job to multiple partitions, where at least one has a partition qos:
marshall@voyager:~/slurm/19.05/voyager$ srun -pall,all2 whereami &
  * I don't see log messages about this job being added, which also seems like a bug.

marshall@voyager:~/slurm/19.05/voyager$ squeue 
             JOBID PARTITION     NAME     USER ST       TIME  NODES NODELIST(REASON)
              9636  all,all2 whereami marshall PD       0:00      1 (Nodes required for job are DOWN, DRAINED or reserved for jobs in higher priority partitions)
              9635     debug    sleep marshall  R       6:29     13 v[1-13]

* Submit a job that depends on a previous job without any "after" tags. It doesn't seem to matter what partition it is submitted to, one or multiple, or whether those partitions have QOS's.

If I omit an "after" tag in depend, then it doesn't matter if I have it depend on the running job or the pending job, the errors will happen. If I include an "after tag, then it only errors if I depend on the pending job. It only errors for the jobs whose reason is "depend," so there's some subtle difference between including and excluding "after" in --depend that I don't understand. Regardless, it's a bug.

marshall@voyager:~/slurm/19.05/voyager$ srun --depend=after:9635 -pdebug whereami &
< No errors in slurmctld >
marshall@voyager:~/slurm/19.05/voyager$ srun --depend=after:9636 -pdebug whereami &  
< Accrue cnt underflow errors >
marshall@voyager:~/slurm/19.05/voyager$ srun --depend=9635 -pdebug whereami &       
< Accrue cnt underflow errors >

Comment 27 Marshall Garey 2019-06-05 15:47:43 MDT

While my currently pending patch is correct to the best of my knowledge, there are a number of other issues that I've found.

* We don't properly set accrue_cnt on the qos of every partition in the job list. Fixed by my currently pending patch.

* Scontrol resetAccrueTime doesn't actually reset accrue
Run these to see what it is before
scontrol show job |grep -i accrue
sprio (keep running until it gets recalculated - see prioritycalcperiod)
Reset:
scontrol update jobid=<jobid> resetAccrueTime
What is it now?
scontrol show job |grep -i accrue
sprio (keep running until it gets recalculated - see prioritycalcperiod)
It isn't reset. We keep resetting it to details_ptr->begin_time

* We don't prevent adding/removing accrue_cnt when the ACCRUE_ALWAYS flag is set for the functions acct_policy_add_accrue_time() and acct_policy_remove_accrue_time()

* In both acct_policy_add_accrue_time() and acct_policy_handle_accrue_time(), we don't acquire the assoc_mgr locks before checking job_ptr->assoc. Change it to how we do it in acct_policy_remove_accrue_time(), or change acct_policy_remove_accrue_time().

* With dependent jobs, I get underflow errors when switching from ACCRUE_ALWAYS to without it and scontrol reconfigure. Reason: When I have ACCRUE_ALWAYS, the jobs get accrue_time set, but never accumulate accrue_cnt. When I switch away, they have accrue_time. In add_accrue_time, we're handling dependent jobs (don't add accrue), but in remove_accrue_time, we're not, so we get underflow. Additionally, in remove_accrue_time, we don't remove accrue if it doesn't have details_ptr or accrue_time set, but we don't look at it in add_accrue_time.
  * This is potentially multiple bugs.

* We never update sched_update, so we reacquire the priority_flags every time. This is not strictly a bug, but it does cost us in terms of performance, especially since we have to acquire the lock on the slurmctld configuration.

* In acct_policy_handle_accrue_time(), we use time(NULL) and then a little later we set the variable now=time(NULL) and use now. We should just set now a little earlier and save a time(NULL) call. Again, not a bug, but a very minor performance issue.

* (Need to check) Are there any other places that use set_qos_order (which implies just 2 qos) when they should actually be using a list of qos (up to one per partition plus one for the job)?


* We are inconsistent in how we handle the initial checks in acct_policy_add_accrue_time(), acct_policy_remove_accrue_time(), and acct_policy_handle_accrue_time(). This is causing some of these issues (such as the ACCRUE_ALWAYS issues and the assoc_mgr lock issue). It would be great to have a single function to handle all the init work (such as checking for ACCRUE_ALWAYS, enforcing limits, job pending, job_ptr->details != NULL, etc.) for these functions.


I'm currently working on documenting in detail the behavior as it is and how I think it's supposed to be. My current patch is still a correct thing to do, so it's still good for review, but I'll be keeping this bug open until I get all these issues fixed or determine that they aren't issues.

Comment 34 Marshall Garey 2019-06-11 17:13:00 MDT

Created attachment 10585 [details]
19.05 v3 patchset

Today in our support meeting I found that after 18.08.8 is tagged, we aren't going to have any more 18.08 releases unless there are any security or other serious bugs. Unless this makes it into 18.08.8, I assume it's going into 19.05. So, I've attached the 19.05 version of 18.08 v8. It includes all the commits in 18.08 with some lines in NEWS, and also includes one final patch which improves debugging.

So, here's an updated version of comment 30:

=======================================================================================
This patchset for 19.05 fixes the following things (with lines in NEWS added as appropriate):
=======================================================================================

* We don't properly set accrue_cnt on the qos of every partition in the job list.
* We don't prevent adding/removing accrue_cnt when the ACCRUE_ALWAYS flag is set for the functions acct_policy_add_accrue_time() and acct_policy_remove_accrue_time().
* With running jobs - if you start a job with ACCRUE_ALWAYS, then turn off ACCRUE_ALWAYS, we get underflow errors. acct_policy_handle_accrue_time() is calling _remove_accrue_time_internal() on the running jobs. This happens because when you start a job, its AccrueTime isn't erased. However, without ACCRUE_ALWAYS, we set JOB_ACCRUE_OVER and decrement the accrue_cnt.
* In both acct_policy_add_accrue_time() and acct_policy_handle_accrue_time(), we don't acquire the assoc_mgr locks before checking job_ptr->assoc. Change it to how we do it in acct_policy_remove_accrue_time(), or change acct_policy_remove_accrue_time().
* With dependent jobs, I get underflow errors when switching from ACCRUE_ALWAYS to without it and scontrol reconfigure. Reason: When I have ACCRUE_ALWAYS, the jobs get accrue_time set, but never accumulate accrue_cnt. When I switch away, they have accrue_time. In add_accrue_time, we're handling dependent jobs (don't add accrue), but in remove_accrue_time, we're not, so we get underflow. Additionally, in remove_accrue_time, we don't remove accrue if it doesn't have details_ptr or accrue_time set, but we don't look at it in add_accrue_time.
* We never update sched_update, so we reacquire the priority_flags every time. This is not strictly a bug, but it does cost us in terms of performance, especially since we have to acquire a mutex.
* In the patch handling the assoc_locks, I missed a few other spots in acct_policy.c with the same problem. I should fix this in my patchset and re-upload.
* I added an additional patch to improve debugging.


=======================================================================================
I'd like some feedback here:
=======================================================================================

(In reply to Marshall Garey from comment #27)
> * Scontrol resetAccrueTime doesn't actually reset accrue
> Run these to see what it is before
> scontrol show job |grep -i accrue
> sprio (keep running until it gets recalculated - see prioritycalcperiod)
> Reset:
> scontrol update jobid=<jobid> resetAccrueTime
> What is it now?
> scontrol show job |grep -i accrue
> sprio (keep running until it gets recalculated - see prioritycalcperiod)
> It isn't reset. We keep resetting it to details_ptr->begin_time


This isn't a bug, I just didn't understand the whole picture for AccrueTime. Here it is, as I understand it:

* Initialized to 0 (in _create_job_record(), details is xmalloc'd, which initializes to zero)
* Set to 0 in acct_policy_remove_accrue_time().
* If not enforcing limits with AccountingStorageEnforce, set it to details->begin_time. Otherwise...
* Dependent/held jobs:
  - When turning on ACCRUE_ALWAYS, accrue time is set to job submission 
  - When turning off ACCRUE_ALWAYS, their accrue time is set to 0.
  - When scontrol release a job or clear its dependency, its AccrueTime is set to that moment.
* When scontrol update job <jobid> ResetAccrueTime, (Are these correct?)
  - ACCRUE_ALWAYS off:
    * its AccrueTime is set to that moment if an accrue limit (like GrpJobsAccrue) is in place, otherwise its AccrueTime is set to the job submit time.
  - ACCRUE_ALWAYS on: Nothing

* When an accrue limit is reached:
  - Pending jobs after the limit have AccrueTime=0
  - When a pending job starts, the next pending job to take its place gets AccrueTime set to that time
* Pending jobs that are not held or dependent: (Is this correct?)
  - When turning on/off ACCRUE_ALWAYS, accrue time doesn't change, even if the job's accrue time had previously been changed by some other method (scontrol update job <id> ResetAccrueTime, or a held job being released, or a dependency being fulfilled). But for new jobs submitted with this on, their AccrueTime is submit time. And, since ACCRUE_ALWAYS doesn't allow you to change AccrueTime with scontrol update ResetAccrueTime, they'll stay like that.

I believe this is all intended behavior, but I want to make sure for:
* Pending jobs that are not held or dependent: (Is this correct?)
* When scontrol update job <jobid> ResetAccrueTime, (Are these correct?)


=======================================================================================
Still need to do:
=======================================================================================

* Make a regression test. I have a lot of ideas for this. We can see the current accrue_cnt with scontrol show assoc (grep for accrue), set various limits with sacctmgr, do different tests depending on configuration. I really want to do this.
* (Need to check) Are there any other places that use set_qos_order (which implies just 2 qos) when they should actually be using a list of qos (up to one per partition plus one for the job)?


=======================================================================================
Don't need to do:
=======================================================================================

> * In acct_policy_handle_accrue_time(), we use time(NULL) and then a little
> later we set the variable now=time(NULL) and use now. We should just set now
> a little earlier and save a time(NULL) call. Again, not a bug, but a very
> minor performance issue.


It's not actually a performance issue, since when we call time(NULL) the first time, we will always goto endit.

> * We are inconsistent in how we handle the initial checks in
> acct_policy_add_accrue_time(), acct_policy_remove_accrue_time(), and
> acct_policy_handle_accrue_time(). This is causing some of these issues (such
> as the ACCRUE_ALWAYS issues and the assoc_mgr lock issue). It would be great
> to have a single function to handle all the init work (such as checking for
> ACCRUE_ALWAYS, enforcing limits, job pending, job_ptr->details != NULL,
> etc.) for these functions.


After my additional work since this comment, I don't know if a single function would be possible, since we handle the conditions somewhat differently in each one. Maybe there's some more improvement to be had?

Comment 37 Marshall Garey 2019-07-08 16:20:09 MDT

(In reply to Marshall Garey from comment #34)

The todo's in that comment are taken care of by the following bugs, so I don't have anything left to do in the bug (besides fix any issues seen in the review process):


> * In the patch handling the assoc_locks, I missed a few other spots in
> acct_policy.c with the same problem. I should fix this in my patchset and
> re-upload.

Bug 5476.


> * Make a regression test. I have a lot of ideas for this. We can see the
> current accrue_cnt with scontrol show assoc (grep for accrue), set various
> limits with sacctmgr, do different tests depending on configuration. I
> really want to do this.

Bug 7376.



> * (Need to check) Are there any other places that use set_qos_order (which
> implies just 2 qos) when they should actually be using a list of qos (up to
> one per partition plus one for the job)?

Bug 7375.

Comment 40 Sven Sternberger 2019-09-09 10:26:02 MDT

After updating to 19.05.02 we also see these messages.
We have also some other issue after updating (#7708)
maybe this is related?

Always like these 3 lines (111 times in 15h)
Sep  9 16:03:13 max-adm01 slurmctld[22664]: error: _remove_accrue_time_internal: QOS joblim accrue_cnt underflow
Sep  9 16:03:13 max-adm01 slurmctld[22664]: error: _remove_accrue_time_internal: QOS joblim acct ***** accrue_cnt underflow
Sep  9 16:03:13 max-adm01 slurmctld[22664]: error: _remove_accrue_time_internal: QOS joblim user ***** accrue_cnt underflow

Comment 41 Marshall Garey 2019-10-02 11:21:55 MDT

*** Ticket 7083 has been marked as a duplicate of this ticket. ***

Comment 42 Marshall Garey 2019-11-07 14:07:26 MST

(In reply to Sven Sternberger from comment #40)
> After updating to 19.05.02 we also see these messages.
> We have also some other issue after updating (#7708)
> maybe this is related?
> 
> Always like these 3 lines (111 times in 15h)
> Sep  9 16:03:13 max-adm01 slurmctld[22664]: error:
> _remove_accrue_time_internal: QOS joblim accrue_cnt underflow
> Sep  9 16:03:13 max-adm01 slurmctld[22664]: error:
> _remove_accrue_time_internal: QOS joblim acct ***** accrue_cnt underflow
> Sep  9 16:03:13 max-adm01 slurmctld[22664]: error:
> _remove_accrue_time_internal: QOS joblim user ***** accrue_cnt underflow

Hi Sven,

Sorry, I didn't see your message from September. These errors are generated because of a bug with submitting jobs to multiple partitions where the partitions have QOS's attached to them. However, this bug only matters if you use any accrue limits on QOS's. They are documented in our sacctmgr man page (https://slurm.schedmd.com/sacctmgr.html). They include:

GrpJobsAccrue
MaxJobsAccruePerAccount
MaxJobsAccruePerUser

Do you use any of these? If you are, then please let us know and I'll see if we can get the patches reviewed sooner. If not, then feel free to ignore this error message.

- Marshall

Comment 59 Marshall Garey 2020-06-02 09:32:41 MDT

There are other places in the code which have the same underlying bad assumption that there are only up to two QOS - the job's QOS and one partition's QOS. I realized that and created bug 7375 to handle those other places. However, rather than tackle them individually we've decided to tackle them all at once and track them in one place in bug 7375. The changes are extensive enough that we don't feel comfortable putting them in a maintenance release in 20.02, so we'd like to get these fixes in 20.11.

We're going to use the patches in this bug as a starting point and extend them to other places. I'm marking this bug as a duplicate of bug 7375, and I'll upload the patches as a private attachment to the other bug for us to use.


Closing as a duplicate of bug 7375.

*** This ticket has been marked as a duplicate of ticket 7375 ***