| Summary: | Setting MaxWall user association has no effect on job duration | ||
|---|---|---|---|
| Product: | Slurm | Reporter: | doug.parisek |
| Component: | Accounting | Assignee: | Marshall Garey <marshall> |
| Status: | RESOLVED DUPLICATE | QA Contact: | |
| Severity: | 4 - Minor Issue | ||
| Priority: | --- | ||
| Version: | 17.02.9 | ||
| Hardware: | Linux | ||
| OS: | Linux | ||
| Site: | Atos/Eviden Sites | Alineos Sites: | --- |
| Atos/Eviden Sites: | Internal | Confidential Site: | --- |
| Coreweave sites: | --- | Cray Sites: | --- |
| DS9 clusters: | --- | HPCnow Sites: | --- |
| HPE Sites: | --- | IBM Sites: | --- |
| NOAA SIte: | --- | OCF Sites: | --- |
| Recursion Pharma Sites: | --- | SFW Sites: | --- |
| SNIC sites: | --- | Linux Distro: | --- |
| Machine Name: | CLE Version: | ||
| Version Fixed: | Target Release: | --- | |
| DevPrio: | --- | Emory-Cloud Sites: | --- |
This has already been fixed in bug 4681. Marking as duplicate. Please reopen if it doesn't address your problem. Specifically: https://github.com/SchedMD/slurm/commit/9143c7c964 and more work done here: https://github.com/SchedMD/slurm/commit/2ef56d4b96f93e0854 *** This ticket has been marked as a duplicate of ticket 4681 *** |
Reproduced a reported problem. The problem was reported against version 16.05.5 but I had the same problem on 17.02.9 as follows: I created a new account (test) and associated a user (dparisek) with that account. Then I first set MaxWall to 1 minute (and later MaxWallDurationPerJob to 1 min to see if that made a difference). I set AccountingStorageEnforce=associations,limit; user dparisek ran a sleep job for 2 mins but the job remained running the entire 2 mins. ======================================================================== sacctmgr modify user dparisek set MaxWallDurationPerJob=1 Modified user associations... C = cluster5 A = test U = dparisek Would you like to commit changes? (You have 30 seconds to decide) (N/y): y [trek0] (slurm) dhp> sacctmgr -s show user where user=dparisek format=user,maxw User MaxWall ---------- ----------- dparisek 00:01:00 [trek0] (slurm) dhp> scontrol show config | grep Accounting AccountingStorageBackupHost = (null) AccountingStorageEnforce = associations,limits << srun sleep 120& >> Ran entire 2 mins - maxwall not enforced! ======================================================================== Then I created a new QoS and associated that QoS with the user and associated that QoS with a MaxWall=1 min. This DID work! [trek0] (slurm) dhp> sacctmgr add qos qosA Adding QOS(s) qosa Settings Description = qosa Would you like to commit changes? (You have 30 seconds to decide) (N/y): y [trek0] (slurm) dhp> sacctmgr modify user dparisek set qos=qosa Modified user associations... C = cluster5 A = test U = dparisek Would you like to commit changes? (You have 30 seconds to decide) (N/y): y [trek0] (slurm) dhp> sacctmgr -s show user where user=dparisek format=user,maxw,qos User MaxWall QOS ---------- ----------- -------------------- dparisek 00:01:00 qosa [trek0] (slurm) dhp> sacctmgr modify qos set maxwall=1 where user=dparisek << srun sleep 120& >> Maxwall was enforced - job was killed after 1 min ======================================================================== Question: Did I miss something in the first scenario when I didn't have a QoS associated? Is associating a QoS the only way to enforce MaxWall (and maybe other limits)? If so then what is the point of allowing sacctmgr to set the limit without a QoS? Is there a bug here? Thanks.