Summary: | Improper AccrueTime removals/resets when ACCRUE_ALWAYS is set | ||
---|---|---|---|
Product: | Slurm | Reporter: | Alejandro Sanchez <alex> |
Component: | slurmctld | Assignee: | Alejandro Sanchez <alex> |
Status: | RESOLVED DUPLICATE | QA Contact: | |
Severity: | 3 - Medium Impact | ||
Priority: | --- | CC: | marshall |
Version: | 18.08.7 | ||
Hardware: | Linux | ||
OS: | Linux | ||
See Also: | https://bugs.schedmd.com/show_bug.cgi?id=6659 | ||
Site: | SchedMD | Alineos Sites: | --- |
Atos/Eviden Sites: | --- | Confidential Site: | --- |
Coreweave sites: | --- | Cray Sites: | --- |
DS9 clusters: | --- | HPCnow Sites: | --- |
HPE Sites: | --- | IBM Sites: | --- |
NOAA SIte: | --- | NoveTech Sites: | --- |
Nvidia HWinf-CS Sites: | --- | OCF Sites: | --- |
Recursion Pharma Sites: | --- | SFW Sites: | --- |
SNIC sites: | --- | Tzag Elita Sites: | --- |
Linux Distro: | --- | Machine Name: | |
CLE Version: | Version Fixed: | ||
Target Release: | --- | DevPrio: | --- |
Emory-Cloud Sites: | --- |
Description
Alejandro Sanchez
2019-05-22 06:14:33 MDT
I'm starting to look at it. I haven't done any testing yet, but those places do look suspicious. First thoughts/notes: * acct_policy_remove_job_submit() will return without doing anything if AccountingStorageEnforce=limits. * acct_policy_remove_job_submit() will call acct_policy_handle_accrue_time(). * _job_fail_account() (if the job's assoc_ptr != NULL) and _job_fail_qos() (if the job's qos_ptr != NULL) both call acct_policy_remove_job_submit(). > NOTE: the _job_fail* have a comment indicating we don't use the remove function on purpose. But the other spot can be discussed. * They'll actually potentially be calling acct_policy_remove_accrue_time() because they * I don't think this correctly handles the case where QOS is not enforced by AccountingStorageEnforce. If QOS is not enforced, it's possible the job doesn't have a QOS but the partition does, in which case the partition QOS limits need to be handled, but we're skipping that part. This should be tested. * The reason I worry about this is that we can enforce limits without enforcing QOS. (See AccountingStorageEnforce in the slurm.conf man page.) However, if we have AccountingStorageEnforce=limits, then AccountingStorageEnforce=assoc is implied, so we don't need to worry about the case where associations aren't enforced. * When a job starts running, acct_policy_handle_accrue_time() is called for that job and the accrue time is cleared and accrue counts are decremented. In _job_fail_account() and _job_fail_qos(), we do job_ptr->details->accrue_time = 0; job_ptr->bit_flags &= ~JOB_ACCRUE_OVER; if the job is pending. Then, when acct_policy_handle_accrue_time() is called, because accrue_time == 0, it won't call _remove_accrue_time_internal(). It might eventually call _add_accrue_time_internal(). If it does, that seems wrong to me. I need to study acct_policy_handle_accrue_time(). * Even if acct_policy_handle_accrue_time() doesn't call _add_accrue_time_internal() because of certain variables that are set (like begin_time == 0, accrue_time == 0, job pending or not pending), the comment makes it seem like we don't want to call acct_policy_handle_accrue_time() at all. * I haven't studied where/how _job_fail_account, _job_fail_qos, batch_requeue_fini are called, so I should do that. Final thoughts: * This should all be tested with the patches in bug 6659 applied. There are several bugs fixed by the patches in 6659, so any fixes we do here (if needed) should probably be made on top of the ones in 6659. In particular, 6659 makes sure we properly handle ALL partition QOS, not just one of them. * The logic here is confusing. Even if there isn't a bug, I think it should be simplified if possible. (In reply to Marshall Garey from comment #1) > I'm starting to look at it. I haven't done any testing yet, but those places > do look suspicious. First thoughts/notes: > > * acct_policy_remove_job_submit() will return without doing anything if > AccountingStorageEnforce=limits. > > * acct_policy_remove_job_submit() will call acct_policy_handle_accrue_time(). > > * _job_fail_account() (if the job's assoc_ptr != NULL) and _job_fail_qos() > (if the job's qos_ptr != NULL) both call acct_policy_remove_job_submit(). > > > NOTE: the _job_fail* have a comment indicating we don't use the remove function on purpose. But the other spot can be discussed. > * They'll actually potentially be calling acct_policy_remove_accrue_time() > because they because they call acct_policy_remove_job_submit(). (I forgot to finish that thought.) I fixed this in bug 6659. The patches in that bug are still pending review, though. *** This ticket has been marked as a duplicate of ticket 6659 *** |