Description
Marshall Garey
2019-03-07 14:09:45 MST
Created attachment 10332 [details]
configuration to reproduce
This job is sufficient to reproduce this issue:
srun -pdebug,debug2 whereami
Partition debug2 has a partition qos called partqos1, but it doesn't have any limits or anything special.
For the record yesterday I also managed to trigger this by having active jobs and switching from not having ACCRUE_ALWAYS to having ACCRUE_ALWAYS and reconfiguring. https://bugs.schedmd.com/show_bug.cgi?id=7083 Also I was seeing calls to acct_policy_handle_accrue_time() for job arrays and if I printed %pJ at the beginning of the function I was seeing jobs printed with the form <jobid>_* instead of with an already taskid. https://bugs.schedmd.com/show_bug.cgi?id=7102 > (In reply to Alejandro Sanchez from comment #11)
> > For the record yesterday I also managed to trigger this by having active
> > jobs and switching from not having ACCRUE_ALWAYS to having ACCRUE_ALWAYS and
> > reconfiguring.
> >
> > https://bugs.schedmd.com/show_bug.cgi?id=7083
>
> I haven't verified that my patch fixes this yet.
Thanks for telling me to look for something with ACCRUE_ALWAYS, Alex. I couldn't make it happen by switching it on and off, but I did find this:
I've now found more underflow errors with dependent jobs and using the ACCRUE_ALWAYS flag. This happens with and without my patch, so it's not caused by my patch, but my patch doesn't fix it yet. My patch is still good for review, but I'll also be working on this.
* PriorityFlags=ACCRUE_ALWAYS, DebugFlags=accrue
* Submit a job to fill the cluster
marshall@voyager:~/slurm/19.05/voyager$ srun -N13 --exclusive sleep 7890&
* Submit a job to multiple partitions, where at least one has a partition qos:
marshall@voyager:~/slurm/19.05/voyager$ srun -pall,all2 whereami &
* I don't see log messages about this job being added, which also seems like a bug.
marshall@voyager:~/slurm/19.05/voyager$ squeue
JOBID PARTITION NAME USER ST TIME NODES NODELIST(REASON)
9636 all,all2 whereami marshall PD 0:00 1 (Nodes required for job are DOWN, DRAINED or reserved for jobs in higher priority partitions)
9635 debug sleep marshall R 6:29 13 v[1-13]
* Submit a job that depends on a previous job without any "after" tags. It doesn't seem to matter what partition it is submitted to, one or multiple, or whether those partitions have QOS's.
If I omit an "after" tag in depend, then it doesn't matter if I have it depend on the running job or the pending job, the errors will happen. If I include an "after tag, then it only errors if I depend on the pending job. It only errors for the jobs whose reason is "depend," so there's some subtle difference between including and excluding "after" in --depend that I don't understand. Regardless, it's a bug.
marshall@voyager:~/slurm/19.05/voyager$ srun --depend=after:9635 -pdebug whereami &
< No errors in slurmctld >
marshall@voyager:~/slurm/19.05/voyager$ srun --depend=after:9636 -pdebug whereami &
< Accrue cnt underflow errors >
marshall@voyager:~/slurm/19.05/voyager$ srun --depend=9635 -pdebug whereami &
< Accrue cnt underflow errors >
While my currently pending patch is correct to the best of my knowledge, there are a number of other issues that I've found. * We don't properly set accrue_cnt on the qos of every partition in the job list. Fixed by my currently pending patch. * Scontrol resetAccrueTime doesn't actually reset accrue Run these to see what it is before scontrol show job |grep -i accrue sprio (keep running until it gets recalculated - see prioritycalcperiod) Reset: scontrol update jobid=<jobid> resetAccrueTime What is it now? scontrol show job |grep -i accrue sprio (keep running until it gets recalculated - see prioritycalcperiod) It isn't reset. We keep resetting it to details_ptr->begin_time * We don't prevent adding/removing accrue_cnt when the ACCRUE_ALWAYS flag is set for the functions acct_policy_add_accrue_time() and acct_policy_remove_accrue_time() * In both acct_policy_add_accrue_time() and acct_policy_handle_accrue_time(), we don't acquire the assoc_mgr locks before checking job_ptr->assoc. Change it to how we do it in acct_policy_remove_accrue_time(), or change acct_policy_remove_accrue_time(). * With dependent jobs, I get underflow errors when switching from ACCRUE_ALWAYS to without it and scontrol reconfigure. Reason: When I have ACCRUE_ALWAYS, the jobs get accrue_time set, but never accumulate accrue_cnt. When I switch away, they have accrue_time. In add_accrue_time, we're handling dependent jobs (don't add accrue), but in remove_accrue_time, we're not, so we get underflow. Additionally, in remove_accrue_time, we don't remove accrue if it doesn't have details_ptr or accrue_time set, but we don't look at it in add_accrue_time. * This is potentially multiple bugs. * We never update sched_update, so we reacquire the priority_flags every time. This is not strictly a bug, but it does cost us in terms of performance, especially since we have to acquire the lock on the slurmctld configuration. * In acct_policy_handle_accrue_time(), we use time(NULL) and then a little later we set the variable now=time(NULL) and use now. We should just set now a little earlier and save a time(NULL) call. Again, not a bug, but a very minor performance issue. * (Need to check) Are there any other places that use set_qos_order (which implies just 2 qos) when they should actually be using a list of qos (up to one per partition plus one for the job)? * We are inconsistent in how we handle the initial checks in acct_policy_add_accrue_time(), acct_policy_remove_accrue_time(), and acct_policy_handle_accrue_time(). This is causing some of these issues (such as the ACCRUE_ALWAYS issues and the assoc_mgr lock issue). It would be great to have a single function to handle all the init work (such as checking for ACCRUE_ALWAYS, enforcing limits, job pending, job_ptr->details != NULL, etc.) for these functions. I'm currently working on documenting in detail the behavior as it is and how I think it's supposed to be. My current patch is still a correct thing to do, so it's still good for review, but I'll be keeping this bug open until I get all these issues fixed or determine that they aren't issues. Created attachment 10585 [details] 19.05 v3 patchset Today in our support meeting I found that after 18.08.8 is tagged, we aren't going to have any more 18.08 releases unless there are any security or other serious bugs. Unless this makes it into 18.08.8, I assume it's going into 19.05. So, I've attached the 19.05 version of 18.08 v8. It includes all the commits in 18.08 with some lines in NEWS, and also includes one final patch which improves debugging. So, here's an updated version of comment 30: ======================================================================================= This patchset for 19.05 fixes the following things (with lines in NEWS added as appropriate): ======================================================================================= * We don't properly set accrue_cnt on the qos of every partition in the job list. * We don't prevent adding/removing accrue_cnt when the ACCRUE_ALWAYS flag is set for the functions acct_policy_add_accrue_time() and acct_policy_remove_accrue_time(). * With running jobs - if you start a job with ACCRUE_ALWAYS, then turn off ACCRUE_ALWAYS, we get underflow errors. acct_policy_handle_accrue_time() is calling _remove_accrue_time_internal() on the running jobs. This happens because when you start a job, its AccrueTime isn't erased. However, without ACCRUE_ALWAYS, we set JOB_ACCRUE_OVER and decrement the accrue_cnt. * In both acct_policy_add_accrue_time() and acct_policy_handle_accrue_time(), we don't acquire the assoc_mgr locks before checking job_ptr->assoc. Change it to how we do it in acct_policy_remove_accrue_time(), or change acct_policy_remove_accrue_time(). * With dependent jobs, I get underflow errors when switching from ACCRUE_ALWAYS to without it and scontrol reconfigure. Reason: When I have ACCRUE_ALWAYS, the jobs get accrue_time set, but never accumulate accrue_cnt. When I switch away, they have accrue_time. In add_accrue_time, we're handling dependent jobs (don't add accrue), but in remove_accrue_time, we're not, so we get underflow. Additionally, in remove_accrue_time, we don't remove accrue if it doesn't have details_ptr or accrue_time set, but we don't look at it in add_accrue_time. * We never update sched_update, so we reacquire the priority_flags every time. This is not strictly a bug, but it does cost us in terms of performance, especially since we have to acquire a mutex. * In the patch handling the assoc_locks, I missed a few other spots in acct_policy.c with the same problem. I should fix this in my patchset and re-upload. * I added an additional patch to improve debugging. ======================================================================================= I'd like some feedback here: ======================================================================================= (In reply to Marshall Garey from comment #27) > * Scontrol resetAccrueTime doesn't actually reset accrue > Run these to see what it is before > scontrol show job |grep -i accrue > sprio (keep running until it gets recalculated - see prioritycalcperiod) > Reset: > scontrol update jobid=<jobid> resetAccrueTime > What is it now? > scontrol show job |grep -i accrue > sprio (keep running until it gets recalculated - see prioritycalcperiod) > It isn't reset. We keep resetting it to details_ptr->begin_time This isn't a bug, I just didn't understand the whole picture for AccrueTime. Here it is, as I understand it: * Initialized to 0 (in _create_job_record(), details is xmalloc'd, which initializes to zero) * Set to 0 in acct_policy_remove_accrue_time(). * If not enforcing limits with AccountingStorageEnforce, set it to details->begin_time. Otherwise... * Dependent/held jobs: - When turning on ACCRUE_ALWAYS, accrue time is set to job submission - When turning off ACCRUE_ALWAYS, their accrue time is set to 0. - When scontrol release a job or clear its dependency, its AccrueTime is set to that moment. * When scontrol update job <jobid> ResetAccrueTime, (Are these correct?) - ACCRUE_ALWAYS off: * its AccrueTime is set to that moment if an accrue limit (like GrpJobsAccrue) is in place, otherwise its AccrueTime is set to the job submit time. - ACCRUE_ALWAYS on: Nothing * When an accrue limit is reached: - Pending jobs after the limit have AccrueTime=0 - When a pending job starts, the next pending job to take its place gets AccrueTime set to that time * Pending jobs that are not held or dependent: (Is this correct?) - When turning on/off ACCRUE_ALWAYS, accrue time doesn't change, even if the job's accrue time had previously been changed by some other method (scontrol update job <id> ResetAccrueTime, or a held job being released, or a dependency being fulfilled). But for new jobs submitted with this on, their AccrueTime is submit time. And, since ACCRUE_ALWAYS doesn't allow you to change AccrueTime with scontrol update ResetAccrueTime, they'll stay like that. I believe this is all intended behavior, but I want to make sure for: * Pending jobs that are not held or dependent: (Is this correct?) * When scontrol update job <jobid> ResetAccrueTime, (Are these correct?) ======================================================================================= Still need to do: ======================================================================================= * Make a regression test. I have a lot of ideas for this. We can see the current accrue_cnt with scontrol show assoc (grep for accrue), set various limits with sacctmgr, do different tests depending on configuration. I really want to do this. * (Need to check) Are there any other places that use set_qos_order (which implies just 2 qos) when they should actually be using a list of qos (up to one per partition plus one for the job)? ======================================================================================= Don't need to do: ======================================================================================= > * In acct_policy_handle_accrue_time(), we use time(NULL) and then a little > later we set the variable now=time(NULL) and use now. We should just set now > a little earlier and save a time(NULL) call. Again, not a bug, but a very > minor performance issue. It's not actually a performance issue, since when we call time(NULL) the first time, we will always goto endit. > * We are inconsistent in how we handle the initial checks in > acct_policy_add_accrue_time(), acct_policy_remove_accrue_time(), and > acct_policy_handle_accrue_time(). This is causing some of these issues (such > as the ACCRUE_ALWAYS issues and the assoc_mgr lock issue). It would be great > to have a single function to handle all the init work (such as checking for > ACCRUE_ALWAYS, enforcing limits, job pending, job_ptr->details != NULL, > etc.) for these functions. After my additional work since this comment, I don't know if a single function would be possible, since we handle the conditions somewhat differently in each one. Maybe there's some more improvement to be had? (In reply to Marshall Garey from comment #34) The todo's in that comment are taken care of by the following bugs, so I don't have anything left to do in the bug (besides fix any issues seen in the review process): > * In the patch handling the assoc_locks, I missed a few other spots in > acct_policy.c with the same problem. I should fix this in my patchset and > re-upload. Bug 5476. > * Make a regression test. I have a lot of ideas for this. We can see the > current accrue_cnt with scontrol show assoc (grep for accrue), set various > limits with sacctmgr, do different tests depending on configuration. I > really want to do this. Bug 7376. > * (Need to check) Are there any other places that use set_qos_order (which > implies just 2 qos) when they should actually be using a list of qos (up to > one per partition plus one for the job)? Bug 7375. After updating to 19.05.02 we also see these messages. We have also some other issue after updating (#7708) maybe this is related? Always like these 3 lines (111 times in 15h) Sep 9 16:03:13 max-adm01 slurmctld[22664]: error: _remove_accrue_time_internal: QOS joblim accrue_cnt underflow Sep 9 16:03:13 max-adm01 slurmctld[22664]: error: _remove_accrue_time_internal: QOS joblim acct ***** accrue_cnt underflow Sep 9 16:03:13 max-adm01 slurmctld[22664]: error: _remove_accrue_time_internal: QOS joblim user ***** accrue_cnt underflow *** Ticket 7083 has been marked as a duplicate of this ticket. *** (In reply to Sven Sternberger from comment #40) > After updating to 19.05.02 we also see these messages. > We have also some other issue after updating (#7708) > maybe this is related? > > Always like these 3 lines (111 times in 15h) > Sep 9 16:03:13 max-adm01 slurmctld[22664]: error: > _remove_accrue_time_internal: QOS joblim accrue_cnt underflow > Sep 9 16:03:13 max-adm01 slurmctld[22664]: error: > _remove_accrue_time_internal: QOS joblim acct ***** accrue_cnt underflow > Sep 9 16:03:13 max-adm01 slurmctld[22664]: error: > _remove_accrue_time_internal: QOS joblim user ***** accrue_cnt underflow Hi Sven, Sorry, I didn't see your message from September. These errors are generated because of a bug with submitting jobs to multiple partitions where the partitions have QOS's attached to them. However, this bug only matters if you use any accrue limits on QOS's. They are documented in our sacctmgr man page (https://slurm.schedmd.com/sacctmgr.html). They include: GrpJobsAccrue MaxJobsAccruePerAccount MaxJobsAccruePerUser Do you use any of these? If you are, then please let us know and I'll see if we can get the patches reviewed sooner. If not, then feel free to ignore this error message. - Marshall There are other places in the code which have the same underlying bad assumption that there are only up to two QOS - the job's QOS and one partition's QOS. I realized that and created bug 7375 to handle those other places. However, rather than tackle them individually we've decided to tackle them all at once and track them in one place in bug 7375. The changes are extensive enough that we don't feel comfortable putting them in a maintenance release in 20.02, so we'd like to get these fixes in 20.11. We're going to use the patches in this bug as a starting point and extend them to other places. I'm marking this bug as a duplicate of bug 7375, and I'll upload the patches as a private attachment to the other bug for us to use. Closing as a duplicate of bug 7375. *** This ticket has been marked as a duplicate of ticket 7375 *** |