Ticket 10753

Summary:	More time than possible
Product:	Slurm	Reporter:	Paul Edmon <pedmon>
Component:	slurmdbd	Assignee:	Albert Gil <albert.gil>
Status:	RESOLVED INFOGIVEN	QA Contact:
Severity:	4 - Minor Issue
Priority:	---
Version:	20.11.3
Hardware:	Linux
OS:	Linux
Site:	Harvard University	Slinky Site:	---
Alineos Sites:	---	Atos/Eviden Sites:	---
Confidential Site:	---	Coreweave sites:	---
Cray Sites:	---	DS9 clusters:	---
Google sites:	---	HPCnow Sites:	---
HPE Sites:	---	IBM Sites:	---
NOAA SIte:	---	NoveTech Sites:	---
Nvidia HWinf-CS Sites:	---	OCF Sites:	---
Recursion Pharma Sites:	---	SFW Sites:	---
SNIC sites:	---	Tzag Elita Sites:	---
Linux Distro:	---	Machine Name:
CLE Version:		Version Fixed:
Target Release:	---	DevPrio:	---
Emory-Cloud Sites:	---

Description Paul Edmon 2021-02-01 07:10:26 MST

I'm seeing the following error in my slurmdbd logs after I upgraded to 20.11.3:

Feb  1 09:07:08 holy-slurm02 slurmdbd[15803]: error: We have more time than is possible (887595120000+860675998487+0)(1748271118487) > 1290775849200 for cluster odyssey(358548847) from 2021-02-01T08:00:00 - 2021-02-01T09:00:00 tres 2
Feb  1 09:07:08 holy-slurm02 slurmdbd[15803]: error: We have more time than is possible (291733200+333814968+0)(625548168) > 501386400 for cluster odyssey(139274) from 2021-02-01T08:00:00 - 2021-02-01T09:00:00 tres 5
Feb  1 09:07:08 holy-slurm02 slurmdbd[15803]: error: We have more time than is possible (1713600+1344534+0)(3058134) > 2008800 for cluster odyssey(558) from 2021-02-01T08:00:00 - 2021-02-01T09:00:00 tres 1003

I'm guessing a few jobs or something got screwed up.  Is there anyway to resolve this?  It's too late to roll back and slurmdbd is working properly aside from this error.  Thanks.

-Paul Edmon-

Comment 1 Albert Gil 2021-02-04 04:09:57 MST

Hi Paul,

This error means that when slurmdbd computes the usage information shown by sreport from the jobs and reservations that run on the cluster last hour (aka rollup process), it detected that the sum of all the CPU time allocated was bigger than the actual amount of CPUs in the cluster. So, something is wrong.

There are some reasons that could explain those error messages.
Some of them even legitimate if OverSubscribe is setup in some specific ways.

But the most typical reason is due runaway jobs.
That is, jobs that are not anymore running on the system, but for some reason slurmdbd was not notified and thinks that they are running.
Runaway jobs would mess with the accounting information and, if your system is actually quite busy, adding nonexisting/runaway jobs could lead to "more time than possible" detected.

You can run "sacctmgr show runaway" to see if you are facing this problem, and it could also fix them.
Note that to fix runaways means also that slurmdbd will start a new rollup to recompute the usage/sreport info from the oldest runaway detected. If the runaway jobs are very old, the rollup could be take time to complete, and some sreport info won't be available until completed.

If you don't feel confident enough to say Yes when asked to fix the runaways, please say No and attach the output of the command to help you further.

If you don't have runaways, we'll look for other reasons.

Regards,
Albert

Comment 2 Paul Edmon 2021-02-08 07:33:33 MST

Thanks.  I suspected as much.  I'm usually pretty regular about running 
the runaway check because we do pick up errant jobs from time to time.  
I ran the check today and it found a few more from Feb 1st when we did 
the upgrade.  I will keep an eye on it and see if after the roll up this 
error goes away.

-Paul Edmon-

On 2/4/2021 6:09 AM, bugs@schedmd.com wrote:
>
> *Comment # 1 <https://bugs.schedmd.com/show_bug.cgi?id=10753#c1> on 
> bug 10753 <https://bugs.schedmd.com/show_bug.cgi?id=10753> from Albert 
> Gil <mailto:albert.gil@schedmd.com> *
> Hi Paul,
>
> This error means that when slurmdbd computes the usage information shown by
> sreport from the jobs and reservations that run on the cluster last hour (aka
> rollup process), it detected that the sum of all the CPU time allocated was
> bigger than the actual amount of CPUs in the cluster. So, something is wrong.
>
> There are some reasons that could explain those error messages.
> Some of them even legitimate if OverSubscribe is setup in some specific ways.
>
> But the most typical reason is due runaway jobs.
> That is, jobs that are not anymore running on the system, but for some reason
> slurmdbd was not notified and thinks that they are running.
> Runaway jobs would mess with the accounting information and, if your system is
> actually quite busy, adding nonexisting/runaway jobs could lead to "more time
> than possible" detected.
>
> You can run "sacctmgr show runaway" to see if you are facing this problem, and
> it could also fix them.
> Note that to fix runaways means also that slurmdbd will start a new rollup to
> recompute the usage/sreport info from the oldest runaway detected. If the
> runaway jobs are very old, the rollup could be take time to complete, and some
> sreport info won't be available until completed.
>
> If you don't feel confident enough to say Yes when asked to fix the runaways,
> please say No and attach the output of the command to help you further.
>
> If you don't have runaways, we'll look for other reasons.
>
> Regards,
> Albert
> ------------------------------------------------------------------------
> You are receiving this mail because:
>
>   * You reported the bug.
>

Comment 3 Albert Gil 2021-02-08 08:32:42 MST

Hi Paul,

> Thanks.  I suspected as much.  I'm usually pretty regular about running 
> the runaway check because we do pick up errant jobs from time to time.  

Ok.
It should be pretty exceptional due the system being on some bad conditions at some point.

> I ran the check today and it found a few more from Feb 1st when we did 
> the upgrade.

Thanks for the information.
I'll try to reproduce that specific case to see if we can fix/avoid those runaways on the first place.

>  I will keep an eye on it and see if after the roll up this 
> error goes away.

The rollup should not take much.
Unless you really have very high throughput, the rollup shouldn't take much time to complete.
A simple way to check it is quering sreport.
Now it's probably empty from Feb 1st, and in some hours the info will be there again.

Regrds,
Albert

Comment 4 Paul Edmon 2021-02-08 09:02:54 MST

Looks like that did it. I'm not seeing any other errors.

-Paul Edmon-

On 2/8/2021 10:32 AM, bugs@schedmd.com wrote:
>
> *Comment # 3 <https://bugs.schedmd.com/show_bug.cgi?id=10753#c3> on 
> bug 10753 <https://bugs.schedmd.com/show_bug.cgi?id=10753> from Albert 
> Gil <mailto:albert.gil@schedmd.com> *
> Hi Paul,
>
> > Thanks.  I suspected as much.  I'm usually pretty regular about running > the runaway check because we do pick up errant jobs from time to time.
>
> Ok.
> It should be pretty exceptional due the system being on some bad conditions at
> some point.
>
> > I ran the check today and it found a few more from Feb 1st when we did > the upgrade.
>
> Thanks for the information.
> I'll try to reproduce that specific case to see if we can fix/avoid those
> runaways on the first place.
>
> >  I will keep an eye on it and see if after the roll up this > error goes away.
>
> The rollup should not take much.
> Unless you really have very high throughput, the rollup shouldn't take much
> time to complete.
> A simple way to check it is quering sreport.
> Now it's probably empty from Feb 1st, and in some hours the info will be there
> again.
>
> Regrds,
> Albert
> ------------------------------------------------------------------------
> You are receiving this mail because:
>
>   * You reported the bug.
>

Comment 5 Albert Gil 2021-02-08 09:05:50 MST

Great!
Closing as infogiven.