| Summary: | sreport command not working with future dates and End=now | ||
|---|---|---|---|
| Product: | Slurm | Reporter: | ARC Admins <arc-slurm-admins> |
| Component: | User Commands | Assignee: | Albert Gil <albert.gil> |
| Status: | OPEN --- | QA Contact: | |
| Severity: | 4 - Minor Issue | ||
| Priority: | --- | CC: | albert.gil, laurent.catherine |
| Version: | 21.08.4 | ||
| Hardware: | Linux | ||
| OS: | Linux | ||
| Site: | University of Michigan | Alineos Sites: | --- |
| Atos/Eviden Sites: | --- | Confidential Site: | --- |
| Coreweave sites: | --- | Cray Sites: | --- |
| DS9 clusters: | --- | HPCnow Sites: | --- |
| HPE Sites: | --- | IBM Sites: | --- |
| NOAA SIte: | --- | OCF Sites: | --- |
| Recursion Pharma Sites: | --- | SFW Sites: | --- |
| SNIC sites: | --- | Linux Distro: | --- |
| Machine Name: | CLE Version: | ||
| Version Fixed: | Target Release: | --- | |
| DevPrio: | --- | Emory-Cloud Sites: | --- |
| Attachments: | 2023-01-18 slurmdbd log | ||
|
Description
ARC Admins
2022-03-24 06:47:38 MDT
Hi David, Can you confirm if you have runaway jobs, or if you had and fixed them before getting this error? Regards, Albert (In reply to Albert Gil from comment #3) Albert, > Can you confirm if you have runaway jobs, or if you had and fixed them > before getting this error? I personally know of no runaway jobs that were fixed recently, and none show now. That said, is there a way to query for jobs that were runaway and fixed? David David, > I personally know of no runaway jobs that were fixed recently, and none show > now. Ok, good to know. > That said, is there a way to query for jobs that were runaway and fixed? That's a good question. There is no "official" way, ie no command or option. Maybe we could tell from inspecting certain values on the DB, but right now this is not that important. We are working with a similar issue in other bugs and runaways seems to be related, so I asked you for that information. But I can keep investigating anyway. I'll let you know about my findings. Regards, Albert Hi David, > Per Albert in bug 13400 I am moving the contents of comment #24 from there > to here as a fork: > > ``` > date; sreport -nP -T billing cluster AccountUtilizationByUser > Start=2022-03-01T00:00:00 End=now -M greatlakes format=account,login,used | > awk -F "|" '{ if ($2=="") print $1","$3 }' > Wed Mar 16 14:36:48 EDT 2022 > sreport: error: Getting response to message type: DBD_GET_ASSOCS > sreport: error: DBD_GET_ASSOCS failure: No error > slurmdb_report_cluster_account_by_user: Problem with get query. > ``` > > I have tried using the aforementioned with `End=Now` after > `End=2022-04-01T00:00:00` started exhibiting problems. > > This command worked in the past iterations when the script ran, but at some > point it started producing this. Any thoughts? You have reopen bug 13400 and I've also looked into this. I've realized this you already faced this kind of errors in bug 10907. Reviewing it again it seems clear that the SQL backend is returning some error to the slurmdbd query requesting the associations. Are you still able to reproduce it? Regards, Albert Hi Albert, David is out for a couple of days, but I'm happy to help until he returns. We were able to reproduce Slurm starting a job which exceeded a limit; were you asking if we could reproduce that, or the "DBD_GET_ASSOCS failure: No error" ? Thanks, - Matt Hi Matt, > We were able to reproduce Slurm starting a job which exceeded a limit; were > you asking if we could reproduce that, or the "DBD_GET_ASSOCS failure: No > error" ? In this ticket we're focused in the "DBD_GET_ASSOCS failure: No error". Let's track the exceding limit in bug 13400. Sorry for the confusion. Thanks, Albert Albert, Yes, the ticket I reopened was 13400 for the jobs exceeding the limit. Let's focus there. David Hi David,
> Yes, the ticket I reopened was 13400 for the jobs exceeding the limit. Let's
> focus there.
Thanks!
So, are you still able to reproduce the "DBD_GET_ASSOCS failure: No error"?
Regards,
Albert
Albert, Yes, I am actually. But that is tangentially related. David Hi David,
> So, are you still able to reproduce the "DBD_GET_ASSOCS failure: No error"?
Now that I think that we have the issue related to usage and limits more under control in other tickets, I would like to handle also this (old) one.
Do you have a command that reproduces it?
If you do, could you enable these option in slurmdbd.conf, reproduce the issue and sent back the slurmdbd logs?
DebugLevel = debug2
DebugFlags=DB_ASSOC,DB_QUERY
Thanks,
Albert
Hi David, Now that we closed bug 13400, I would like to work on this one too. > > So, are you still able to reproduce the "DBD_GET_ASSOCS failure: No error"? > > Now that I think that we have the issue related to usage and limits more > under control in other tickets, I would like to handle also this (old) one. > > Do you have a command that reproduces it? > If you do, could you enable these option in slurmdbd.conf, reproduce the > issue and sent back the slurmdbd logs? > > DebugLevel = debug2 > DebugFlags=DB_ASSOC,DB_QUERY Could you try it? Thanks, Albert Created attachment 28494 [details]
2023-01-18 slurmdbd log
Hi Albert,
> Now that we closed bug 13400, I would like to work on this one too.
>
> > > So, are you still able to reproduce the "DBD_GET_ASSOCS failure: No error"?
> >
> > Now that I think that we have the issue related to usage and limits more
> > under control in other tickets, I would like to handle also this (old) one.
> >
> > Do you have a command that reproduces it?
> > If you do, could you enable these option in slurmdbd.conf, reproduce the
> > issue and sent back the slurmdbd logs?
> >
> > DebugLevel = debug2
> > DebugFlags=DB_ASSOC,DB_QUERY
>
> Could you try it?
I've done the needful and attached the log file. Here is the command I ran today with debug2 and the aforementioned DebugFlags:
```
[root@slurmdbd ~]# date; sreport -nP -T billing cluster AccountUtilizationByUser Start=2022-12-01T00:00:00 End=now -M greatlakes format=account,login,used | awk -F "|" '{ if ($2=="") print $1","$3 }'
Wed Jan 18 11:22:13 EST 2023
sreport: error: Getting response to message type: DBD_GET_ASSOCS
sreport: error: DBD_GET_ASSOCS failure: No error
slurmdb_report_cluster_account_by_user: Problem with get query.
```
David
Hi David, Sorry for the late response. I'll be working on this again to see what can I found. Is still happening, right? Thanks, Albert Hi David, I cannot reproduce the issue. If this is ok for you, I'm closing this ticket. If you still want to pursuit this, please reopen it and update the version you are using. Regards, Albert Hi, Albert,
Unfortunately, we are still experiencing this issue:
[root@gl-build ~]# date; sreport -nP -T billing cluster AccountUtilizationByUser Start=2024-02-01T00:00:00 End=2024-03-01T00:00:00 -M greatlakes format=acco
unt,login,used | awk -F "|" '{ if ($2=="") print $1","$3 }'
Tue Feb 13 09:11:50 EST 2024
sreport: error: Getting response to message type: DBD_GET_ASSOCS
sreport: error: DBD_GET_ASSOCS failure: No error
slurmdb_report_cluster_account_by_user: Problem with get query.
I have changed the End= parameter to "now", and it fails in the same way, too.
We are on slurm version 23.02.6 now.
David
|