There are two issues here: 1. AccrueTime and related flags can be removed/reset by calling acct_policy_remove_accrue_time(), but there are places where instead of using this function we manually set job_ptr->details->accrue_time = 0; job_ptr->bit_flags &= ~JOB_ACCRUE_OVER; which is a subset of the stuff handled in acct_policy_remove_accrue_time(). So the first thing is study if any of these places we should be using the function instead of the manual reset. Some examples of this are: _job_fail_account _job_fail_qos batch_requeue_fini NOTE: the _job_fail* have a comment indicating we don't use the remove function on purpose. But the other spot can be discussed. 2. The second issue is if for all these places where we either reset the accrue manually or through the function, see if it also makes sense to do so when ACCRUE_ALWAYS is set and if not fix it by either checking the flag in the caller or perhaps inside if that is complete for all cases.
I'm starting to look at it. I haven't done any testing yet, but those places do look suspicious. First thoughts/notes: * acct_policy_remove_job_submit() will return without doing anything if AccountingStorageEnforce=limits. * acct_policy_remove_job_submit() will call acct_policy_handle_accrue_time(). * _job_fail_account() (if the job's assoc_ptr != NULL) and _job_fail_qos() (if the job's qos_ptr != NULL) both call acct_policy_remove_job_submit(). > NOTE: the _job_fail* have a comment indicating we don't use the remove function on purpose. But the other spot can be discussed. * They'll actually potentially be calling acct_policy_remove_accrue_time() because they * I don't think this correctly handles the case where QOS is not enforced by AccountingStorageEnforce. If QOS is not enforced, it's possible the job doesn't have a QOS but the partition does, in which case the partition QOS limits need to be handled, but we're skipping that part. This should be tested. * The reason I worry about this is that we can enforce limits without enforcing QOS. (See AccountingStorageEnforce in the slurm.conf man page.) However, if we have AccountingStorageEnforce=limits, then AccountingStorageEnforce=assoc is implied, so we don't need to worry about the case where associations aren't enforced. * When a job starts running, acct_policy_handle_accrue_time() is called for that job and the accrue time is cleared and accrue counts are decremented. In _job_fail_account() and _job_fail_qos(), we do job_ptr->details->accrue_time = 0; job_ptr->bit_flags &= ~JOB_ACCRUE_OVER; if the job is pending. Then, when acct_policy_handle_accrue_time() is called, because accrue_time == 0, it won't call _remove_accrue_time_internal(). It might eventually call _add_accrue_time_internal(). If it does, that seems wrong to me. I need to study acct_policy_handle_accrue_time(). * Even if acct_policy_handle_accrue_time() doesn't call _add_accrue_time_internal() because of certain variables that are set (like begin_time == 0, accrue_time == 0, job pending or not pending), the comment makes it seem like we don't want to call acct_policy_handle_accrue_time() at all. * I haven't studied where/how _job_fail_account, _job_fail_qos, batch_requeue_fini are called, so I should do that. Final thoughts: * This should all be tested with the patches in bug 6659 applied. There are several bugs fixed by the patches in 6659, so any fixes we do here (if needed) should probably be made on top of the ones in 6659. In particular, 6659 makes sure we properly handle ALL partition QOS, not just one of them. * The logic here is confusing. Even if there isn't a bug, I think it should be simplified if possible.
(In reply to Marshall Garey from comment #1) > I'm starting to look at it. I haven't done any testing yet, but those places > do look suspicious. First thoughts/notes: > > * acct_policy_remove_job_submit() will return without doing anything if > AccountingStorageEnforce=limits. > > * acct_policy_remove_job_submit() will call acct_policy_handle_accrue_time(). > > * _job_fail_account() (if the job's assoc_ptr != NULL) and _job_fail_qos() > (if the job's qos_ptr != NULL) both call acct_policy_remove_job_submit(). > > > NOTE: the _job_fail* have a comment indicating we don't use the remove function on purpose. But the other spot can be discussed. > * They'll actually potentially be calling acct_policy_remove_accrue_time() > because they because they call acct_policy_remove_job_submit(). (I forgot to finish that thought.)
I fixed this in bug 6659. The patches in that bug are still pending review, though. *** This ticket has been marked as a duplicate of ticket 6659 ***