7966 – Jobs hitting AssocGrpGRES

Ticket 7966 - Jobs hitting AssocGrpGRES

Summary: Jobs hitting AssocGrpGRES

Status:	RESOLVED DUPLICATE of ticket 6814

Alias:	None

Product:	Slurm
Classification:	Unclassified
Component:	Accounting (show other tickets)
Version:	18.08.8
Hardware:	Linux Linux

Severity:	4 - Minor Issue
Assignee:	Broderick Gardner
QA Contact:

URL:

Depends on:
Blocks:

Reported:	2019-10-21 12:43 MDT by ARC Admins
Modified:	2019-12-19 09:40 MST (History)
CC List:	0 users

See Also:
Site:	University of Michigan
Slinky Site:	---
Alineos Sites:	---
Atos/Eviden Sites:	---
Confidential Site:	---
Coreweave sites:	---
Cray Sites:	---
DS9 clusters:	---
Google sites:	---
HPCnow Sites:	---
HPE Sites:	---
IBM Sites:	---
NOAA SIte:	---
NoveTech Sites:	---
Nvidia HWinf-CS Sites:	---
OCF Sites:	---
Recursion Pharma Sites:	---
SFW Sites:	---
SNIC sites:	---
Tzag Elita Sites:	---
Linux Distro:	---
Machine Name:
CLE Version:
Version Fixed:
Target Release:	---
DevPrio:	---
Emory-Cloud Sites:	---

Attachments
umich.dump (558.68 KB, text/plain) 2019-11-06 13:28 MST, ARC Admins	Details
SlurmSchedLogFile (22.49 MB, application/x-gzip) 2019-11-07 07:53 MST, ARC Admins	Details
SlurmctldLogFile (1.71 MB, application/x-gzip) 2019-11-07 07:54 MST, ARC Admins	Details
slurm conf (4.99 KB, text/plain) 2019-11-07 07:54 MST, ARC Admins	Details
Add an attachment (proposed patch, testcase, etc.)

Note You need to log in before you can comment on or make changes to this ticket.

Description ARC Admins 2019-10-21 12:43:53 MDT

Hello,

We have users in an account hitting the AssocGrpGRES limit despite their jobs no asking for a GRES (e.g. GPU). I added myself to the account and submitted a very simple 1 core, 1G job. The job is accepted and is held for Priority and then goes to AssocGrpGRES. The job states in squeue go from Priority to AssocGrpGRES.


Do you have any thoughts?

David

Comment 2 ARC Admins 2019-11-06 12:39:50 MST

Hello,

We are seeing this again today with a user and their array job:

```
[root@gl-build ~]# squeue -u <user>
             JOBID PARTITION     NAME     USER  ACCOUNT ST       TIME  NODES NODELIST(REASON)
    1618692_[0-27]  standard extractR   <user> <PIaccount> PD       0:00      1 (Priority,AssocGrpGRES)
```

David

Comment 3 ARC Admins 2019-11-06 12:50:31 MST

Hello,

Another data point: I increased the priority of the job and it started running. So, there is some weird overlap in conditions where jobs with pending reason of Priority also list AssocGrpGRES as a reason. It's not consistent, though.

This is the second time where we've seen jobs held with a reason of AssocGrpGRES where no GRES were requested, though.

David

Comment 4 Broderick Gardner 2019-11-06 12:56:37 MST

Okay I'm looking into this for you. Please attach the slurmctld.log and the file produced by
$ sacctmgr dump <cluster_name> umich.dump

The information about priority is important; without diving in yet, it appears that the reason is erroneously set to AssocGrpGRES. I'm looking into reproducing this.

Comment 5 ARC Admins 2019-11-06 13:28:30 MST

Created attachment 12242 [details]
umich.dump

Comment 6 ARC Admins 2019-11-06 13:29:19 MST

Thanks, Broderick. We've been chasing things with Priority lately but I think that's tied to high utilization.

David

Comment 7 Broderick Gardner 2019-11-06 14:37:22 MST

And please post the slurmctld.log and slurm.conf

Comment 8 ARC Admins 2019-11-07 07:53:26 MST

Created attachment 12251 [details]
SlurmSchedLogFile

Comment 9 ARC Admins 2019-11-07 07:54:07 MST

Created attachment 12252 [details]
SlurmctldLogFile

Comment 10 ARC Admins 2019-11-07 07:54:36 MST

Created attachment 12253 [details]
slurm conf

Comment 12 Broderick Gardner 2019-11-14 16:38:32 MST

Thanks for the information here. This ticket appears to be a duplicate of Bug 6814. Bug 8012 is also a duplicate. As a summary, the sub jobs in an array job have their Reason set to an incorrect value while they are pending.

As stated in Bug 8012, there is a patch under review and QA right now that resolves this issue. If you have a test cluster or environment, you would be welcome to test it with jobs your users run. Otherwise, we are waiting on further internal testing and potentially some testing from the customer in Bug 8012.

Assuming your issue is the same, the bug is only visual; the scheduling is working correctly. If that does not seem the case to you, let me know.

Comment 13 ARC Admins 2019-11-15 05:32:24 MST

Thanks, Broderick! I share a similar observation in that the scheduler is working correctly and this appears to be only visual.

Comment 14 Broderick Gardner 2019-12-19 09:40:51 MST

As the main ticket tracking this issue is now closed, I'm going to close this ticket as a duplicate now. The fix has been committed:
commit ee3d4715f0071725a4

Thanks

*** This ticket has been marked as a duplicate of ticket 6814 ***