Hello, We have users in an account hitting the AssocGrpGRES limit despite their jobs no asking for a GRES (e.g. GPU). I added myself to the account and submitted a very simple 1 core, 1G job. The job is accepted and is held for Priority and then goes to AssocGrpGRES. The job states in squeue go from Priority to AssocGrpGRES. Do you have any thoughts? David
Hello, We are seeing this again today with a user and their array job: ``` [root@gl-build ~]# squeue -u <user> JOBID PARTITION NAME USER ACCOUNT ST TIME NODES NODELIST(REASON) 1618692_[0-27] standard extractR <user> <PIaccount> PD 0:00 1 (Priority,AssocGrpGRES) ``` David
Hello, Another data point: I increased the priority of the job and it started running. So, there is some weird overlap in conditions where jobs with pending reason of Priority also list AssocGrpGRES as a reason. It's not consistent, though. This is the second time where we've seen jobs held with a reason of AssocGrpGRES where no GRES were requested, though. David
Okay I'm looking into this for you. Please attach the slurmctld.log and the file produced by $ sacctmgr dump <cluster_name> umich.dump The information about priority is important; without diving in yet, it appears that the reason is erroneously set to AssocGrpGRES. I'm looking into reproducing this.
Created attachment 12242 [details] umich.dump
Thanks, Broderick. We've been chasing things with Priority lately but I think that's tied to high utilization. David
And please post the slurmctld.log and slurm.conf
Created attachment 12251 [details] SlurmSchedLogFile
Created attachment 12252 [details] SlurmctldLogFile
Created attachment 12253 [details] slurm conf
Thanks for the information here. This ticket appears to be a duplicate of Bug 6814. Bug 8012 is also a duplicate. As a summary, the sub jobs in an array job have their Reason set to an incorrect value while they are pending. As stated in Bug 8012, there is a patch under review and QA right now that resolves this issue. If you have a test cluster or environment, you would be welcome to test it with jobs your users run. Otherwise, we are waiting on further internal testing and potentially some testing from the customer in Bug 8012. Assuming your issue is the same, the bug is only visual; the scheduling is working correctly. If that does not seem the case to you, let me know.
Thanks, Broderick! I share a similar observation in that the scheduler is working correctly and this appears to be only visual.
As the main ticket tracking this issue is now closed, I'm going to close this ticket as a duplicate now. The fix has been committed: commit ee3d4715f0071725a4 Thanks *** This ticket has been marked as a duplicate of ticket 6814 ***