Summary: | More time than possible | ||
---|---|---|---|
Product: | Slurm | Reporter: | Paul Edmon <pedmon> |
Component: | slurmdbd | Assignee: | Albert Gil <albert.gil> |
Status: | RESOLVED INFOGIVEN | QA Contact: | |
Severity: | 4 - Minor Issue | ||
Priority: | --- | ||
Version: | 20.11.3 | ||
Hardware: | Linux | ||
OS: | Linux | ||
Site: | Harvard University | Slinky Site: | --- |
Alineos Sites: | --- | Atos/Eviden Sites: | --- |
Confidential Site: | --- | Coreweave sites: | --- |
Cray Sites: | --- | DS9 clusters: | --- |
Google sites: | --- | HPCnow Sites: | --- |
HPE Sites: | --- | IBM Sites: | --- |
NOAA SIte: | --- | NoveTech Sites: | --- |
Nvidia HWinf-CS Sites: | --- | OCF Sites: | --- |
Recursion Pharma Sites: | --- | SFW Sites: | --- |
SNIC sites: | --- | Tzag Elita Sites: | --- |
Linux Distro: | --- | Machine Name: | |
CLE Version: | Version Fixed: | ||
Target Release: | --- | DevPrio: | --- |
Emory-Cloud Sites: | --- |
Description
Paul Edmon
2021-02-01 07:10:26 MST
Hi Paul, This error means that when slurmdbd computes the usage information shown by sreport from the jobs and reservations that run on the cluster last hour (aka rollup process), it detected that the sum of all the CPU time allocated was bigger than the actual amount of CPUs in the cluster. So, something is wrong. There are some reasons that could explain those error messages. Some of them even legitimate if OverSubscribe is setup in some specific ways. But the most typical reason is due runaway jobs. That is, jobs that are not anymore running on the system, but for some reason slurmdbd was not notified and thinks that they are running. Runaway jobs would mess with the accounting information and, if your system is actually quite busy, adding nonexisting/runaway jobs could lead to "more time than possible" detected. You can run "sacctmgr show runaway" to see if you are facing this problem, and it could also fix them. Note that to fix runaways means also that slurmdbd will start a new rollup to recompute the usage/sreport info from the oldest runaway detected. If the runaway jobs are very old, the rollup could be take time to complete, and some sreport info won't be available until completed. If you don't feel confident enough to say Yes when asked to fix the runaways, please say No and attach the output of the command to help you further. If you don't have runaways, we'll look for other reasons. Regards, Albert Thanks. I suspected as much. I'm usually pretty regular about running the runaway check because we do pick up errant jobs from time to time. I ran the check today and it found a few more from Feb 1st when we did the upgrade. I will keep an eye on it and see if after the roll up this error goes away. -Paul Edmon- On 2/4/2021 6:09 AM, bugs@schedmd.com wrote: > > *Comment # 1 <https://bugs.schedmd.com/show_bug.cgi?id=10753#c1> on > bug 10753 <https://bugs.schedmd.com/show_bug.cgi?id=10753> from Albert > Gil <mailto:albert.gil@schedmd.com> * > Hi Paul, > > This error means that when slurmdbd computes the usage information shown by > sreport from the jobs and reservations that run on the cluster last hour (aka > rollup process), it detected that the sum of all the CPU time allocated was > bigger than the actual amount of CPUs in the cluster. So, something is wrong. > > There are some reasons that could explain those error messages. > Some of them even legitimate if OverSubscribe is setup in some specific ways. > > But the most typical reason is due runaway jobs. > That is, jobs that are not anymore running on the system, but for some reason > slurmdbd was not notified and thinks that they are running. > Runaway jobs would mess with the accounting information and, if your system is > actually quite busy, adding nonexisting/runaway jobs could lead to "more time > than possible" detected. > > You can run "sacctmgr show runaway" to see if you are facing this problem, and > it could also fix them. > Note that to fix runaways means also that slurmdbd will start a new rollup to > recompute the usage/sreport info from the oldest runaway detected. If the > runaway jobs are very old, the rollup could be take time to complete, and some > sreport info won't be available until completed. > > If you don't feel confident enough to say Yes when asked to fix the runaways, > please say No and attach the output of the command to help you further. > > If you don't have runaways, we'll look for other reasons. > > Regards, > Albert > ------------------------------------------------------------------------ > You are receiving this mail because: > > * You reported the bug. > Hi Paul, > Thanks. I suspected as much. I'm usually pretty regular about running > the runaway check because we do pick up errant jobs from time to time. Ok. It should be pretty exceptional due the system being on some bad conditions at some point. > I ran the check today and it found a few more from Feb 1st when we did > the upgrade. Thanks for the information. I'll try to reproduce that specific case to see if we can fix/avoid those runaways on the first place. > I will keep an eye on it and see if after the roll up this > error goes away. The rollup should not take much. Unless you really have very high throughput, the rollup shouldn't take much time to complete. A simple way to check it is quering sreport. Now it's probably empty from Feb 1st, and in some hours the info will be there again. Regrds, Albert Looks like that did it. I'm not seeing any other errors. -Paul Edmon- On 2/8/2021 10:32 AM, bugs@schedmd.com wrote: > > *Comment # 3 <https://bugs.schedmd.com/show_bug.cgi?id=10753#c3> on > bug 10753 <https://bugs.schedmd.com/show_bug.cgi?id=10753> from Albert > Gil <mailto:albert.gil@schedmd.com> * > Hi Paul, > > > Thanks. I suspected as much. I'm usually pretty regular about running > the runaway check because we do pick up errant jobs from time to time. > > Ok. > It should be pretty exceptional due the system being on some bad conditions at > some point. > > > I ran the check today and it found a few more from Feb 1st when we did > the upgrade. > > Thanks for the information. > I'll try to reproduce that specific case to see if we can fix/avoid those > runaways on the first place. > > > I will keep an eye on it and see if after the roll up this > error goes away. > > The rollup should not take much. > Unless you really have very high throughput, the rollup shouldn't take much > time to complete. > A simple way to check it is quering sreport. > Now it's probably empty from Feb 1st, and in some hours the info will be there > again. > > Regrds, > Albert > ------------------------------------------------------------------------ > You are receiving this mail because: > > * You reported the bug. > Great! Closing as infogiven. |