Ticket 7083 - Improper AccrueTime removals/resets when ACCRUE_ALWAYS is set
Summary: Improper AccrueTime removals/resets when ACCRUE_ALWAYS is set
Status: RESOLVED DUPLICATE of ticket 6659
Alias: None
Product: Slurm
Classification: Unclassified
Component: slurmctld (show other tickets)
Version: 18.08.7
Hardware: Linux Linux
: 3 - Medium Impact
Assignee: Alejandro Sanchez
QA Contact:
URL:
Depends on:
Blocks:
 
Reported: 2019-05-22 06:14 MDT by Alejandro Sanchez
Modified: 2019-10-02 11:21 MDT (History)
1 user (show)

See Also:
Site: SchedMD
Alineos Sites: ---
Atos/Eviden Sites: ---
Confidential Site: ---
Coreweave sites: ---
Cray Sites: ---
DS9 clusters: ---
HPCnow Sites: ---
HPE Sites: ---
IBM Sites: ---
NOAA SIte: ---
NoveTech Sites: ---
Nvidia HWinf-CS Sites: ---
OCF Sites: ---
Recursion Pharma Sites: ---
SFW Sites: ---
SNIC sites: ---
Tzag Elita Sites: ---
Linux Distro: ---
Machine Name:
CLE Version:
Version Fixed:
Target Release: ---
DevPrio: ---
Emory-Cloud Sites: ---


Attachments

Note You need to log in before you can comment on or make changes to this ticket.
Description Alejandro Sanchez 2019-05-22 06:14:33 MDT
There are two issues here:

1. AccrueTime and related flags can be removed/reset by calling acct_policy_remove_accrue_time(), but there are places where instead of using this function we manually set 

job_ptr->details->accrue_time = 0;
job_ptr->bit_flags &= ~JOB_ACCRUE_OVER;

which is a subset of the stuff handled in acct_policy_remove_accrue_time(). So the first thing is study if any of these places we should be using the function instead of the manual reset. Some examples of this are:

_job_fail_account
_job_fail_qos
batch_requeue_fini

NOTE: the _job_fail* have a comment indicating we don't use the remove function on purpose. But the other spot can be discussed.

2. The second issue is if for all these places where we either reset the accrue manually or through the function, see if it also makes sense to do so when ACCRUE_ALWAYS is set and if not fix it by either checking the flag in the caller or perhaps inside if that is complete for all cases.
Comment 1 Marshall Garey 2019-07-19 10:08:38 MDT
I'm starting to look at it. I haven't done any testing yet, but those places do look suspicious. First thoughts/notes:

* acct_policy_remove_job_submit() will return without doing anything if AccountingStorageEnforce=limits.

* acct_policy_remove_job_submit() will call acct_policy_handle_accrue_time().

* _job_fail_account() (if the job's assoc_ptr != NULL) and _job_fail_qos() (if the job's qos_ptr != NULL) both call acct_policy_remove_job_submit().

> NOTE: the _job_fail* have a comment indicating we don't use the remove function on purpose. But the other spot can be discussed.
  * They'll actually potentially be calling acct_policy_remove_accrue_time() because they 


  * I don't think this correctly handles the case where QOS is not enforced by AccountingStorageEnforce. If QOS is not enforced, it's possible the job doesn't have a QOS but the partition does, in which case the partition QOS limits need to be handled, but we're skipping that part. This should be tested.
  * The reason I worry about this is that we can enforce limits without enforcing QOS. (See AccountingStorageEnforce in the slurm.conf man page.) However, if we have AccountingStorageEnforce=limits, then AccountingStorageEnforce=assoc is implied, so we don't need to worry about the case where associations aren't enforced.

* When a job starts running, acct_policy_handle_accrue_time() is called for that job and the accrue time is cleared and accrue counts are decremented. In _job_fail_account() and _job_fail_qos(), we do

job_ptr->details->accrue_time = 0;
job_ptr->bit_flags &= ~JOB_ACCRUE_OVER;

if the job is pending. Then, when acct_policy_handle_accrue_time() is called, because accrue_time == 0, it won't call _remove_accrue_time_internal(). It might eventually call _add_accrue_time_internal(). If it does, that seems wrong to me. I need to study acct_policy_handle_accrue_time().

  * Even if acct_policy_handle_accrue_time() doesn't call _add_accrue_time_internal() because of certain variables that are set (like begin_time == 0, accrue_time == 0, job pending or not pending), the comment makes it seem like we don't want to call acct_policy_handle_accrue_time() at all.

* I haven't studied where/how _job_fail_account, _job_fail_qos, batch_requeue_fini are called, so I should do that.



Final thoughts:

* This should all be tested with the patches in bug 6659 applied. There are several bugs fixed by the patches in 6659, so any fixes we do here (if needed) should probably be made on top of the ones in 6659. In particular, 6659 makes sure we properly handle ALL partition QOS, not just one of them.

* The logic here is confusing. Even if there isn't a bug, I think it should be simplified if possible.
Comment 2 Marshall Garey 2019-07-19 10:09:36 MDT
(In reply to Marshall Garey from comment #1)
> I'm starting to look at it. I haven't done any testing yet, but those places
> do look suspicious. First thoughts/notes:
> 
> * acct_policy_remove_job_submit() will return without doing anything if
> AccountingStorageEnforce=limits.
> 
> * acct_policy_remove_job_submit() will call acct_policy_handle_accrue_time().
> 
> * _job_fail_account() (if the job's assoc_ptr != NULL) and _job_fail_qos()
> (if the job's qos_ptr != NULL) both call acct_policy_remove_job_submit().
> 
> > NOTE: the _job_fail* have a comment indicating we don't use the remove function on purpose. But the other spot can be discussed.
>   * They'll actually potentially be calling acct_policy_remove_accrue_time()
> because they 

because they call acct_policy_remove_job_submit().

(I forgot to finish that thought.)
Comment 3 Marshall Garey 2019-10-02 11:21:54 MDT
I fixed this in bug 6659. The patches in that bug are still pending review, though.

*** This ticket has been marked as a duplicate of ticket 6659 ***