Ticket 7966

Summary: Jobs hitting AssocGrpGRES
Product: Slurm Reporter: ARC Admins <arc-slurm-admins>
Component: AccountingAssignee: Broderick Gardner <broderick>
Status: RESOLVED DUPLICATE QA Contact:
Severity: 4 - Minor Issue    
Priority: ---    
Version: 18.08.8   
Hardware: Linux   
OS: Linux   
Site: University of Michigan Slinky Site: ---
Alineos Sites: --- Atos/Eviden Sites: ---
Confidential Site: --- Coreweave sites: ---
Cray Sites: --- DS9 clusters: ---
Google sites: --- HPCnow Sites: ---
HPE Sites: --- IBM Sites: ---
NOAA SIte: --- NoveTech Sites: ---
Nvidia HWinf-CS Sites: --- OCF Sites: ---
Recursion Pharma Sites: --- SFW Sites: ---
SNIC sites: --- Tzag Elita Sites: ---
Linux Distro: --- Machine Name:
CLE Version: Version Fixed:
Target Release: --- DevPrio: ---
Emory-Cloud Sites: ---
Attachments: umich.dump
SlurmSchedLogFile
SlurmctldLogFile
slurm conf

Description ARC Admins 2019-10-21 12:43:53 MDT
Hello,

We have users in an account hitting the AssocGrpGRES limit despite their jobs no asking for a GRES (e.g. GPU). I added myself to the account and submitted a very simple 1 core, 1G job. The job is accepted and is held for Priority and then goes to AssocGrpGRES. The job states in squeue go from Priority to AssocGrpGRES.


Do you have any thoughts?

David
Comment 2 ARC Admins 2019-11-06 12:39:50 MST
Hello,

We are seeing this again today with a user and their array job:

```
[root@gl-build ~]# squeue -u <user>
             JOBID PARTITION     NAME     USER  ACCOUNT ST       TIME  NODES NODELIST(REASON)
    1618692_[0-27]  standard extractR   <user> <PIaccount> PD       0:00      1 (Priority,AssocGrpGRES)
```

David
Comment 3 ARC Admins 2019-11-06 12:50:31 MST
Hello,

Another data point: I increased the priority of the job and it started running. So, there is some weird overlap in conditions where jobs with pending reason of Priority also list AssocGrpGRES as a reason. It's not consistent, though.

This is the second time where we've seen jobs held with a reason of AssocGrpGRES where no GRES were requested, though.

David
Comment 4 Broderick Gardner 2019-11-06 12:56:37 MST
Okay I'm looking into this for you. Please attach the slurmctld.log and the file produced by
$ sacctmgr dump <cluster_name> umich.dump

The information about priority is important; without diving in yet, it appears that the reason is erroneously set to AssocGrpGRES. I'm looking into reproducing this.
Comment 5 ARC Admins 2019-11-06 13:28:30 MST
Created attachment 12242 [details]
umich.dump
Comment 6 ARC Admins 2019-11-06 13:29:19 MST
Thanks, Broderick. We've been chasing things with Priority lately but I think that's tied to high utilization.

David
Comment 7 Broderick Gardner 2019-11-06 14:37:22 MST
And please post the slurmctld.log and slurm.conf
Comment 8 ARC Admins 2019-11-07 07:53:26 MST
Created attachment 12251 [details]
SlurmSchedLogFile
Comment 9 ARC Admins 2019-11-07 07:54:07 MST
Created attachment 12252 [details]
SlurmctldLogFile
Comment 10 ARC Admins 2019-11-07 07:54:36 MST
Created attachment 12253 [details]
slurm conf
Comment 12 Broderick Gardner 2019-11-14 16:38:32 MST
Thanks for the information here. This ticket appears to be a duplicate of Bug 6814. Bug 8012 is also a duplicate. As a summary, the sub jobs in an array job have their Reason set to an incorrect value while they are pending.

As stated in Bug 8012, there is a patch under review and QA right now that resolves this issue. If you have a test cluster or environment, you would be welcome to test it with jobs your users run. Otherwise, we are waiting on further internal testing and potentially some testing from the customer in Bug 8012.

Assuming your issue is the same, the bug is only visual; the scheduling is working correctly. If that does not seem the case to you, let me know.
Comment 13 ARC Admins 2019-11-15 05:32:24 MST
Thanks, Broderick! I share a similar observation in that the scheduler is working correctly and this appears to be only visual.
Comment 14 Broderick Gardner 2019-12-19 09:40:51 MST
As the main ticket tracking this issue is now closed, I'm going to close this ticket as a duplicate now. The fix has been committed:
commit ee3d4715f0071725a4

Thanks

*** This ticket has been marked as a duplicate of ticket 6814 ***