Ticket 2068 - More allocated time than is possible?
Summary: More allocated time than is possible?
Status: RESOLVED FIXED
Alias: None
Product: Slurm
Classification: Unclassified
Component: Database (show other tickets)
Version: 15.08.8
Hardware: Linux Linux
: 3 - Medium Impact
Assignee: Danny Auble
QA Contact:
URL:
Depends on:
Blocks:
 
Reported: 2015-10-27 01:02 MDT by Josko Plazonic
Modified: 2016-03-20 20:38 MDT (History)
3 users (show)

See Also:
Site: Princeton (PICSciE)
Slinky Site: ---
Alineos Sites: ---
Atos/Eviden Sites: ---
Confidential Site: ---
Coreweave sites: ---
Cray Sites: ---
DS9 clusters: ---
Google sites: ---
HPCnow Sites: ---
HPE Sites: ---
IBM Sites: ---
NOAA SIte: ---
NoveTech Sites: ---
Nvidia HWinf-CS Sites: ---
OCF Sites: ---
Recursion Pharma Sites: ---
SFW Sites: ---
SNIC sites: ---
Tzag Elita Sites: ---
Linux Distro: ---
Machine Name:
CLE Version:
Version Fixed: 15.08.8 16.05.00-pre1
Target Release: ---
DevPrio: ---
Emory-Cloud Sites: ---


Attachments
runaway job identification script (1.33 KB, application/x-perl)
2015-10-27 03:39 MDT, Tim Wickberg
Details
slurmdbd.log (273.92 KB, application/octet-stream)
2016-03-17 02:17 MDT, Jean-Claude
Details

Note You need to log in before you can comment on or make changes to this ticket.
Description Josko Plazonic 2015-10-27 01:02:48 MDT
In slurmdbd.log we have a fair number of messages like:

[2015-10-27T09:00:01.016] error: We have more allocated time than is possible (315424224 > 13305600) for cluster orbital(3696) from 2015-10-27T08:00:00 - 2015-10-27T09:00:00 tres 1
[2015-10-27T09:00:01.016] error: We have more time than is possible (13305600+648000+0)(13953600) > 13305600 for cluster orbital(3696) from 2015-10-27T08:00:00 - 2015-10-27T09:00:00 tres 1
[2015-10-27T09:00:01.016] error: We have more allocated time than is possible (473365296000 > 53222400000) for cluster orbital(14784000) from 2015-10-27T08:00:00 - 2015-10-27T09:00:00 tres 2
[2015-10-27T09:00:01.016] error: We have more time than is possible (53222400000+2592000000+0)(55814400000) > 53222400000 for cluster orbital(14784000) from 2015-10-27T08:00:00 - 2015-10-27T09:00:00 tres 2
[2015-10-27T09:00:01.249] error: We have more allocated time than is possible (199136811 > 37382400) for cluster tiger(10384) from 2015-10-27T08:00:00 - 2015-10-27T09:00:00 tres 1
[2015-10-27T09:00:01.249] error: We have more time than is possible (37382400+115200+0)(37497600) > 37382400 for cluster tiger(10384) from 2015-10-27T08:00:00 - 2015-10-27T09:00:00 tres 1
[2015-10-27T09:00:01.249] error: We have more allocated time than is possible (736817583760 > 172339200000) for cluster tiger(47872000) from 2015-10-27T08:00:00 - 2015-10-27T09:00:00 tres 2
[2015-10-27T09:00:01.249] error: We have more time than is possible (172339200000+460800000+0)(172800000000) > 172339200000 for cluster tiger(47872000) from 2015-10-27T08:00:00 - 2015-10-27T09:00:00 tres 2
[2015-10-27T09:00:08.693] error: We have more allocated time than is possible (414524469 > 12441600) for cluster della(3456) from 2015-10-27T08:00:00 - 2015-10-27T09:00:00 tres 1
[2015-10-27T09:00:08.693] error: We have more allocated time than is possible (1611299003408 > 77414400000) for cluster della(21504000) from 2015-10-27T08:00:00 - 2015-10-27T09:00:00 tres 2

do you know why is this happening, is it important, can we fix?

Thanks!
Comment 1 Tim Wickberg 2015-10-27 03:39:40 MDT
Created attachment 2338 [details]
runaway job identification script
Comment 2 Tim Wickberg 2015-10-27 04:08:15 MDT
Hi Josko -

That message indicates there are some jobs that "ran away" in the database - for some reason Slurm doesn't have a valid endtime set for them, and their runtime is continuing to be included in each hourly roll-up process. This will affect the output for "sreport" as well, it will start to over-report the allocated resources on the system. This could have happened if slurmctld crashed, or if you'd hit some other bug, and it could have happened quite some time in the past and only be showing up now if the system had been running at a lower load.

The script I attached to the bug [1] can help identify the specific jobs with issues, you'd need to update their endtimes either to when the job actually completed (possibly during a system crash or other outage?) or otherwise zero those fields out.

If you're using 'sreport' you can rebuild the hourly/daily/monthly rollup values by resetting the values in orbital_last_ran_table through MySQL like so (after changing the timestamp to an appropriate value):

UPDATE orbital_last_ran_table
SET
    hourly_rollup  = UNIX_TIMESTAMP('2015-2-29 00:00:00'),
    daily_rollup   = UNIX_TIMESTAMP('2015-2-29 00:00:00'),
    monthly_rollup = UNIX_TIMESTAMP('2015-2-29 00:00:00');


There are a few more details captured under bug 1706 [2] that may shed more light, although hopefully that's enough to point you towards a fix.

- Tim

[1] http://bugzilla.schedmd.com/attachment.cgi?id=2338
[2] http://bugs.schedmd.com/show_bug.cgi?id=1706
Comment 3 Tim Wickberg 2015-10-27 04:55:00 MDT
This is another symptom of bug 2067, I should have caught that earlier. As part of the communication issues some jobs are missing valid endtimes in the database, causing the hourly rollup process to count more time used than is possible.

I'm going to close this out in deference to 2067 instead.

- Tim

*** This ticket has been marked as a duplicate of ticket 2067 ***
Comment 4 Tim Wickberg 2015-10-27 06:18:21 MDT
We're talking this over, and concluded that may have been a bit quick. They're most likely related, but the fix for this is different than for your database communication issues.

You'll still need to go and locate these runaway job records and correct or remove them from the database directly. If you can give us any additional detail on the start times for this it may help narrow down the window a bit.

We're noticing that each of these are showing up with different timestamps, we're wondering if there may be something external to slurmdbd interactive with the database and potentially causing some locking issues. Are there any cron jobs that are backing up / exporting / interfacing with mysql directly that may be causing some contention?

- Tim
Comment 5 Josko Plazonic 2015-10-27 06:32:05 MDT
We have xdmod slurping data nightly (but via sacct dump so shouldn't be that bad) and we have a mysql backup job (seems to run at 11:05pm local time for some 40 minutes) but it should do the dump in safe (--single-transaction) way so in theory it is not disruptive.  To be more precise:

mysqldump -u${USER} -p${PASS} --opt --flush-logs --single-transaction \
    --ignore-table=mysql.event --skip-triggers --skip-routines \
 slurm_acct_db|

I did indeed start fixing the database - so far I "fixed" (update della_job_table set time_end=time_start where id_job=5858358; was one of them) one cluster, 3 more to go through and those have far more bad jobs then della (only 4).

Della's 4 bad jobs were all started between 2015-10-21T13:02:22 (1 with 2 hour alloc time) and 2015-10-21T14:16:38 (3 jobs with 4 hour allocation).  ReqTres info present but not much else.

I'll look at other clusters as well - they have far more bad jobs.
Comment 6 Josko Plazonic 2015-10-27 13:00:31 MDT
Didn't have time to look at other clusters but fixing della's lost jobs did not stop complaints. 

[root@della4 plazonic]# ./lost.pl 

CLUSTER:della

[root@della4 plazonic]# 

[2015-10-27T19:00:05.622] error: We have more allocated time than is possible (368585970 > 12441600) for cluster della(3456) from 2015-10-27T18:00:00 - 2015-10-27T19:00:00 tres 1
[2015-10-27T19:00:05.622] error: We have more allocated time than is possible (1418975235801 > 77414400000) for cluster della(21504000) from 2015-10-27T18:00:00 - 2015-10-27T19:00:00 tres 2
[2015-10-27T20:00:05.423] error: We have more allocated time than is possible (392200987 > 12441600) for cluster della(3456) from 2015-10-27T19:00:00 - 2015-10-27T20:00:00 tres 1
[2015-10-27T20:00:05.423] error: We have more allocated time than is possible (1518947735928 > 77414400000) for cluster della(21504000) from 2015-10-27T19:00:00 - 2015-10-27T20:00:00 tres 2
Comment 7 Tim Wickberg 2015-10-28 11:54:31 MDT
Can you check in on della again and see if there are any new runaway jobs? You're seeing ~30x more compute time each hour than is possible for your cluster, so it must be coming from somewhere.

Is this happening on the other clusters as well, and were you seeing any different log messages from them?
Comment 8 Josko Plazonic 2015-10-29 01:28:50 MDT
No, there are no lost/runaway jobs (as reported by the script) on della and yes - we are seeing this for all 3 clusters sharing slurmdbd. As far as design this - slurmdbd runs on a separate machine that all clusters can reach.  It does not have any slurm configs - indeed it complains at startup about missing slurm.conf file - but why should it need it? Anyway, I am not 100% sure that slurmdbd can talk to slurmctlds (though clearly they can talk to slurmdbd).  And logs indicating that all clusters suffer from this issue:

[2015-10-29T09:00:00.705] error: We have more allocated time than is possible (163674715 > 37382400) for cluster tiger(10384) from 2015-10-28T17:00:00 - 2015-10-28T18:00:00 tres 1
[2015-10-29T09:00:00.705] error: We have more time than is possible (37382400+115200+0)(37497600) > 37382400 for cluster tiger(10384) from 2015-10-28T17:00:00 - 2015-10-28T18:00:00 tres 1
[2015-10-29T09:00:00.705] error: We have more allocated time than is possible (575372884552 > 172339200000) for cluster tiger(47872000) from 2015-10-28T17:00:00 - 2015-10-28T18:00:00 tres 2
[2015-10-29T09:00:00.705] error: We have more time than is possible (172339200000+460800000+0)(172800000000) > 172339200000 for cluster tiger(47872000) from 2015-10-28T17:00:00 - 2015-10-28T18:00:00 tres 2
[2015-10-29T09:00:00.717] error: We have more allocated time than is possible (109632656 > 13305600) for cluster orbital(3696) from 2015-10-29T08:00:00 - 2015-10-29T09:00:00 tres 1
[2015-10-29T09:00:00.717] error: We have more time than is possible (13305600+604800+0)(13910400) > 13305600 for cluster orbital(3696) from 2015-10-29T08:00:00 - 2015-10-29T09:00:00 tres 1
[2015-10-29T09:00:00.717] error: We have more allocated time than is possible (164887256000 > 53222400000) for cluster orbital(14784000) from 2015-10-29T08:00:00 - 2015-10-29T09:00:00 tres 2
[2015-10-29T09:00:00.717] error: We have more time than is possible (53222400000+2419200000+0)(55641600000) > 53222400000 for cluster orbital(14784000) from 2015-10-29T08:00:00 - 2015-10-29T09:00:00 tres 2
[2015-10-29T09:00:00.755] error: We have more allocated time than is possible (158313041 > 37382400) for cluster tiger(10384) from 2015-10-28T18:00:00 - 2015-10-28T19:00:00 tres 1
[2015-10-29T09:00:00.755] error: We have more time than is possible (37382400+115200+0)(37497600) > 37382400 for cluster tiger(10384) from 2015-10-28T18:00:00 - 2015-10-28T19:00:00 tres 1
[2015-10-29T09:00:00.755] error: We have more allocated time than is possible (562198844600 > 172339200000) for cluster tiger(47872000) from 2015-10-28T18:00:00 - 2015-10-28T19:00:00 tres 2
[2015-10-29T09:00:00.755] error: We have more time than is possible (172339200000+460800000+0)(172800000000) > 172339200000 for cluster tiger(47872000) from 2015-10-28T18:00:00 - 2015-10-28T19:00:00 tres 2
[2015-10-29T09:00:00.807] error: We have more allocated time than is possible (173781396 > 37382400) for cluster tiger(10384) from 2015-10-28T19:00:00 - 2015-10-28T20:00:00 tres 1
[2015-10-29T09:00:00.807] error: We have more time than is possible (37382400+115200+0)(37497600) > 37382400 for cluster tiger(10384) from 2015-10-28T19:00:00 - 2015-10-28T20:00:00 tres 1
[2015-10-29T09:00:00.807] error: We have more allocated time than is possible (646419575600 > 172339200000) for cluster tiger(47872000) from 2015-10-28T19:00:00 - 2015-10-28T20:00:00 tres 2
[2015-10-29T09:00:00.807] error: We have more time than is possible (172339200000+460800000+0)(172800000000) > 172339200000 for cluster tiger(47872000) from 2015-10-28T19:00:00 - 2015-10-28T20:00:00 tres 2
[2015-10-29T09:00:00.831] error: We have more allocated time than is possible (2337371400 > 12441600) for cluster della(3456) from 2015-10-29T08:00:00 - 2015-10-29T09:00:00 tres 1
[2015-10-29T09:00:00.831] error: We have more allocated time than is possible (9519241078976 > 77414400000) for cluster della(21504000) from 2015-10-29T08:00:00 - 2015-10-29T09:00:00 tres 2
[2015-10-29T09:00:00.852] error: We have more allocated time than is possible (182051219 > 37382400) for cluster tiger(10384) from 2015-10-28T20:00:00 - 2015-10-28T21:00:00 tres 1
[2015-10-29T09:00:00.852] error: We have more time than is possible (37382400+115200+0)(37497600) > 37382400 for cluster tiger(10384) from 2015-10-28T20:00:00 - 2015-10-28T21:00:00 tres 1
[2015-10-29T09:00:00.852] error: We have more allocated time than is possible (714777476360 > 172339200000) for cluster tiger(47872000) from 2015-10-28T20:00:00 - 2015-10-28T21:00:00 tres 2
[2015-10-29T09:00:00.852] error: We have more time than is possible (172339200000+460800000+0)(172800000000) > 172339200000 for cluster tiger(47872000) from 2015-10-28T20:00:00 - 2015-10-28T21:00:00 tres 2
[2015-10-29T09:00:00.896] error: We have more allocated time than is possible (197420130 > 37382400) for cluster tiger(10384) from 2015-10-28T21:00:00 - 2015-10-28T22:00:00 tres 1
[2015-10-29T09:00:00.896] error: We have more time than is possible (37382400+115200+0)(37497600) > 37382400 for cluster tiger(10384) from 2015-10-28T21:00:00 - 2015-10-28T22:00:00 tres 1
[2015-10-29T09:00:00.896] error: We have more allocated time than is possible (1012108251400 > 172339200000) for cluster tiger(47872000) from 2015-10-28T21:00:00 - 2015-10-28T22:00:00 tres 2
[2015-10-29T09:00:00.896] error: We have more time than is possible (172339200000+460800000+0)(172800000000) > 172339200000 for cluster tiger(47872000) from 2015-10-28T21:00:00 - 2015-10-28T22:00:00 tres 2
[2015-10-29T09:00:00.921] error: We have more allocated time than is possible (201042368 > 37382400) for cluster tiger(10384) from 2015-10-28T22:00:00 - 2015-10-28T23:00:00 tres 1
[2015-10-29T09:00:00.921] error: We have more time than is possible (37382400+115200+0)(37497600) > 37382400 for cluster tiger(10384) from 2015-10-28T22:00:00 - 2015-10-28T23:00:00 tres 1
[2015-10-29T09:00:00.921] error: We have more allocated time than is possible (1011152569200 > 172339200000) for cluster tiger(47872000) from 2015-10-28T22:00:00 - 2015-10-28T23:00:00 tres 2
[2015-10-29T09:00:00.921] error: We have more time than is possible (172339200000+460800000+0)(172800000000) > 172339200000 for cluster tiger(47872000) from 2015-10-28T22:00:00 - 2015-10-28T23:00:00 tres 2
[2015-10-29T09:00:00.944] error: We have more allocated time than is possible (190524023 > 37382400) for cluster tiger(10384) from 2015-10-28T23:00:00 - 2015-10-29T00:00:00 tres 1
[2015-10-29T09:00:00.944] error: We have more time than is possible (37382400+115200+0)(37497600) > 37382400 for cluster tiger(10384) from 2015-10-28T23:00:00 - 2015-10-29T00:00:00 tres 1
[2015-10-29T09:00:00.944] error: We have more allocated time than is possible (727171735200 > 172339200000) for cluster tiger(47872000) from 2015-10-28T23:00:00 - 2015-10-29T00:00:00 tres 2
[2015-10-29T09:00:00.944] error: We have more time than is possible (172339200000+460800000+0)(172800000000) > 172339200000 for cluster tiger(47872000) from 2015-10-28T23:00:00 - 2015-10-29T00:00:00 tres 2
[2015-10-29T09:00:00.969] error: We have more allocated time than is possible (193484799 > 37382400) for cluster tiger(10384) from 2015-10-29T00:00:00 - 2015-10-29T01:00:00 tres 1
[2015-10-29T09:00:00.969] error: We have more time than is possible (37382400+115200+0)(37497600) > 37382400 for cluster tiger(10384) from 2015-10-29T00:00:00 - 2015-10-29T01:00:00 tres 1
[2015-10-29T09:00:00.969] error: We have more allocated time than is possible (712515734400 > 172339200000) for cluster tiger(47872000) from 2015-10-29T00:00:00 - 2015-10-29T01:00:00 tres 2
[2015-10-29T09:00:00.969] error: We have more time than is possible (172339200000+460800000+0)(172800000000) > 172339200000 for cluster tiger(47872000) from 2015-10-29T00:00:00 - 2015-10-29T01:00:00 tres 2
[2015-10-29T09:00:00.994] error: We have more allocated time than is possible (195719364 > 37382400) for cluster tiger(10384) from 2015-10-29T01:00:00 - 2015-10-29T02:00:00 tres 1
[2015-10-29T09:00:00.994] error: We have more time than is possible (37382400+115200+0)(37497600) > 37382400 for cluster tiger(10384) from 2015-10-29T01:00:00 - 2015-10-29T02:00:00 tres 1
[2015-10-29T09:00:00.994] error: We have more allocated time than is possible (707546413680 > 172339200000) for cluster tiger(47872000) from 2015-10-29T01:00:00 - 2015-10-29T02:00:00 tres 2
[2015-10-29T09:00:00.994] error: We have more time than is possible (172339200000+460800000+0)(172800000000) > 172339200000 for cluster tiger(47872000) from 2015-10-29T01:00:00 - 2015-10-29T02:00:00 tres 2
[2015-10-29T09:00:01.018] error: We have more allocated time than is possible (193018208 > 37382400) for cluster tiger(10384) from 2015-10-29T02:00:00 - 2015-10-29T03:00:00 tres 1
[2015-10-29T09:00:01.018] error: We have more time than is possible (37382400+115200+0)(37497600) > 37382400 for cluster tiger(10384) from 2015-10-29T02:00:00 - 2015-10-29T03:00:00 tres 1
[2015-10-29T09:00:01.018] error: We have more allocated time than is possible (673781790400 > 172339200000) for cluster tiger(47872000) from 2015-10-29T02:00:00 - 2015-10-29T03:00:00 tres 2
[2015-10-29T09:00:01.018] error: We have more time than is possible (172339200000+460800000+0)(172800000000) > 172339200000 for cluster tiger(47872000) from 2015-10-29T02:00:00 - 2015-10-29T03:00:00 tres 2
[2015-10-29T09:00:01.041] error: We have more allocated time than is possible (190585550 > 37382400) for cluster tiger(10384) from 2015-10-29T03:00:00 - 2015-10-29T04:00:00 tres 1
[2015-10-29T09:00:01.041] error: We have more time than is possible (37382400+115200+0)(37497600) > 37382400 for cluster tiger(10384) from 2015-10-29T03:00:00 - 2015-10-29T04:00:00 tres 1
[2015-10-29T09:00:01.041] error: We have more allocated time than is possible (674758428000 > 172339200000) for cluster tiger(47872000) from 2015-10-29T03:00:00 - 2015-10-29T04:00:00 tres 2
[2015-10-29T09:00:01.041] error: We have more time than is possible (172339200000+460800000+0)(172800000000) > 172339200000 for cluster tiger(47872000) from 2015-10-29T03:00:00 - 2015-10-29T04:00:00 tres 2
[2015-10-29T09:00:01.064] error: We have more allocated time than is possible (203983033 > 37382400) for cluster tiger(10384) from 2015-10-29T04:00:00 - 2015-10-29T05:00:00 tres 1
[2015-10-29T09:00:01.064] error: We have more time than is possible (37382400+115200+0)(37497600) > 37382400 for cluster tiger(10384) from 2015-10-29T04:00:00 - 2015-10-29T05:00:00 tres 1
[2015-10-29T09:00:01.064] error: We have more allocated time than is possible (714000817400 > 172339200000) for cluster tiger(47872000) from 2015-10-29T04:00:00 - 2015-10-29T05:00:00 tres 2
[2015-10-29T09:00:01.064] error: We have more time than is possible (172339200000+460800000+0)(172800000000) > 172339200000 for cluster tiger(47872000) from 2015-10-29T04:00:00 - 2015-10-29T05:00:00 tres 2
[2015-10-29T09:00:01.086] error: We have more allocated time than is possible (194270318 > 37382400) for cluster tiger(10384) from 2015-10-29T05:00:00 - 2015-10-29T06:00:00 tres 1
[2015-10-29T09:00:01.086] error: We have more time than is possible (37382400+115200+0)(37497600) > 37382400 for cluster tiger(10384) from 2015-10-29T05:00:00 - 2015-10-29T06:00:00 tres 1
[2015-10-29T09:00:01.086] error: We have more allocated time than is possible (680541284800 > 172339200000) for cluster tiger(47872000) from 2015-10-29T05:00:00 - 2015-10-29T06:00:00 tres 2
[2015-10-29T09:00:01.086] error: We have more time than is possible (172339200000+460800000+0)(172800000000) > 172339200000 for cluster tiger(47872000) from 2015-10-29T05:00:00 - 2015-10-29T06:00:00 tres 2
[2015-10-29T09:00:01.107] error: We have more allocated time than is possible (191678601 > 37382400) for cluster tiger(10384) from 2015-10-29T06:00:00 - 2015-10-29T07:00:00 tres 1
[2015-10-29T09:00:01.107] error: We have more time than is possible (37382400+115200+0)(37497600) > 37382400 for cluster tiger(10384) from 2015-10-29T06:00:00 - 2015-10-29T07:00:00 tres 1
[2015-10-29T09:00:01.107] error: We have more allocated time than is possible (667025485600 > 172339200000) for cluster tiger(47872000) from 2015-10-29T06:00:00 - 2015-10-29T07:00:00 tres 2
[2015-10-29T09:00:01.107] error: We have more time than is possible (172339200000+460800000+0)(172800000000) > 172339200000 for cluster tiger(47872000) from 2015-10-29T06:00:00 - 2015-10-29T07:00:00 tres 2
[2015-10-29T09:00:01.129] error: We have more allocated time than is possible (188878930 > 37382400) for cluster tiger(10384) from 2015-10-29T07:00:00 - 2015-10-29T08:00:00 tres 1
[2015-10-29T09:00:01.129] error: We have more time than is possible (37382400+115200+0)(37497600) > 37382400 for cluster tiger(10384) from 2015-10-29T07:00:00 - 2015-10-29T08:00:00 tres 1
[2015-10-29T09:00:01.129] error: We have more allocated time than is possible (646713798600 > 172339200000) for cluster tiger(47872000) from 2015-10-29T07:00:00 - 2015-10-29T08:00:00 tres 2
[2015-10-29T09:00:01.129] error: We have more time than is possible (172339200000+460800000+0)(172800000000) > 172339200000 for cluster tiger(47872000) from 2015-10-29T07:00:00 - 2015-10-29T08:00:00 tres 2
[2015-10-29T09:00:01.207] error: We have more allocated time than is possible (194586712 > 37382400) for cluster tiger(10384) from 2015-10-29T08:00:00 - 2015-10-29T09:00:00 tres 1
[2015-10-29T09:00:01.207] error: We have more time than is possible (37382400+115200+0)(37497600) > 37382400 for cluster tiger(10384) from 2015-10-29T08:00:00 - 2015-10-29T09:00:00 tres 1
[2015-10-29T09:00:01.207] error: We have more allocated time than is possible (652454145000 > 172339200000) for cluster tiger(47872000) from 2015-10-29T08:00:00 - 2015-10-29T09:00:00 tres 2
[2015-10-29T09:00:01.207] error: We have more time than is possible (172339200000+460800000+0)(172800000000) > 172339200000 for cluster tiger(47872000) from 2015-10-29T08:00:00 - 2015-10-29T09:00:00 tres 2
Comment 9 Tim Wickberg 2015-10-29 07:00:04 MDT
We're still thinking that some specific job record(s). If you can run something like this for della (or you can swap for tiger if you'd prefer) we'd appreciate it:

select job.job_db_inx, job.id_job, job.array_task_pending, job.time_submit, job.time_eligible, job.time_start, job.time_end, job.time_suspended, job.cpus_req, job.id_resv, job.tres_alloc, SUM(step.consumed_energy) from della_job_table as job left outer join della_step_table as step on job.job_db_inx=step.job_db_inx and (step.id_step>=0) where (job.time_eligible < 1446141600 && (job.time_end >= 1446138000 || job.time_end = 0)) group by job.job_db_inx order by job.id_assoc, job.time_eligible;
Comment 10 Josko Plazonic 2015-10-29 08:14:28 MDT
OK, here it is:

http://tigress-web.princeton.edu/~plazonic/della_query.txt

2486 rows - is this bad?
Comment 11 Tim Wickberg 2015-10-29 09:49:31 MDT
2600 lines is fine - that's the jobs eligible within that one hour on della. I've got that loaded in here and I'm seeing the same issue with the accounting now so I can dig in further. 

Do you mind sending over a few more details? We could use the output from this:

sacctmgr list events format=cluster,clusternodes,timestart,timeend,tres%70 where cluster=della

as well as the current slurm.conf for della.

thanks,
- Tim
Comment 12 Josko Plazonic 2015-10-29 10:56:51 MDT
Note that I can't run these commands on the slurmdbd machine - as we don't have slurm.conf file there - it just fails after trying 60 times, once per second.  

Therefore this is ran on della's slurmctld host:

sacctmgr list events format=cluster,clusternodes,timestart,timeend,tres%70 where cluster=della
   Cluster        Cluster Nodes           TimeStart             TimeEnd                                                                   TRES 
---------- -------------------- ------------------- -------------------                  ----------------------------------------------------- 
     della della-r1c1n[1-16],d+ 2015-10-13T08:25:30             Unknown                                cpu=3456,mem=21504000,energy=0,node=224 

In case you need the full list of nodes - but it looks to match:
sacctmgr list events format=clusternodes%280 where cluster=della
                                                                                                                                                                                                                                                                           Cluster Nodes 
                                                                                                                                                                                                                                   ----------------------------------------------------- 
                             della-r1c1n[1-16],della-r1c2n[1-16],della-r1c3n[1-16],della-r1c4n[1-16],della-r2c1n[1-16],della-r2c2n[1-16],della-r2c3n[1-16],della-r2c4n[1-16],della-r3c1n[1-16],della-r3c2n[1-16],della-r3c3n[1-16],della-r3c4n[1-16],della-r4c1n[1-16],della-r4c2n[1-16] 

and slurm.conf you can find at 

http://tigress-web.princeton.edu/~plazonic/slurm.conf

Note that the cluster is non uniform - some nodes 20 and some have 12 cores.
Comment 13 Tim Wickberg 2015-11-02 10:40:46 MST
Thanks for your patience, the results from the diagnostic commands was quite helpful.

We think we've got the bug nailed down and a fix for it ready now:

https://github.com/SchedMD/slurm/commit/34e24467093dbd65.patch

The good news is that the individual job records should be okay, it's only the hourly "rollup" which aggregates the job statistics together for quicker reporting by sreport and other commands that is affected.

You'd need to reset the hourly/daily/monthly rollup timestamps to have slurmdbd recalculate past usage as reported by sreport, it's the same as we'd given you before (and you'd need to run for each cluster):

update della_last_ran_table SET hourly_rollup = UNIX_TIMESTAMP('2015-08-30 00:00:00'), daily_rollup = UNIX_TIMESTAMP('2015-08-30 00:00:00'), monthly_rollup = UNIX_TIMESTAMP('2015-08-30 00:00:00');

You can adjust those dates forward to whenever you updated the slurmdbd to 15.08 - the bug was introduced in this release as part of the TRES work. We expect to have a 15.08.3 out soon with this fix.

- Tim
Comment 14 Josko Plazonic 2015-11-02 11:27:11 MST
Thanks! Re-basing our installation to 15.08.2+this patch (and a few others...). Will hopefully update tonight and tell it then to recalculate it all, starting with clusters with less jobs and so on.

Will update ticket once I see how it is handling recalculations with patch.
Comment 15 Moe Jette 2015-11-02 12:03:34 MST
We'll probably tag 15.08.3 on Tuesday 

On November 2, 2015 5:27:11 PM PST, bugs@schedmd.com wrote:
>http://bugs.schedmd.com/show_bug.cgi?id=2068
>
>--- Comment #14 from Josko Plazonic <plazonic@princeton.edu> ---
>Thanks! Re-basing our installation to 15.08.2+this patch (and a few
>others...).
>Will hopefully update tonight and tell it then to recalculate it all,
>starting
>with clusters with less jobs and so on.
>
>Will update ticket once I see how it is handling recalculations with
>patch.
>
>-- 
>You are receiving this mail because:
>You are on the CC list for the bug.
Comment 16 Josko Plazonic 2015-11-02 12:23:13 MST
I guess then I'll wait - I see a few other fixes that would be good to have.
Comment 17 Danny Auble 2015-11-02 12:29:43 MST
Please test this tonight if possible.  Your results will determine if we tag or not. 

On November 2, 2015 6:23:13 PM PST, bugs@schedmd.com wrote:
>http://bugs.schedmd.com/show_bug.cgi?id=2068
>
>--- Comment #16 from Josko Plazonic <plazonic@princeton.edu> ---
>I guess then I'll wait - I see a few other fixes that would be good to
>have.
>
>-- 
>You are receiving this mail because:
>You are on the CC list for the bug.
Comment 18 Josko Plazonic 2015-11-02 12:49:21 MST
OK, will do.
Comment 19 Josko Plazonic 2015-11-02 13:18:15 MST
It looks better - after upgrading slurmdbd and then updating last processed timestamps no further complaints about della, we only get:

[2015-11-02T22:08:42.125] error: id_assoc 439 doesn't have any tres
[2015-11-02T22:08:42.125] error: id_assoc 440 doesn't have any tres
[2015-11-02T22:08:42.247] error: id_assoc 439 doesn't have any tres
[2015-11-02T22:08:42.247] error: id_assoc 440 doesn't have any tres
[2015-11-02T22:09:08.384] Warning: Note very large processing time from hourly_rollup for della: usec=124312592 began=22:07:04.072

Not sure what's up with id_assoc but that's probably unrelated to this ticket.

Looks like we have a bunch of lost jobs on tiger but that's probably related to a switch crash and I'll try cleaning that up and rerunning processing.
Comment 20 Josko Plazonic 2015-11-02 13:37:48 MST
Tiger looks good too after cleaning up "runaway" jobs (note that this event was on 10/29 - unrelated to original report).  It sure looks like this patch helped.
Comment 21 Tim Wickberg 2015-11-05 08:47:26 MST
Glad to hear that worked out for you.

I'm going to go ahead and close the bug, if you see any recurrence please open it back up.

- Tim
Comment 22 Yong Qin 2016-02-11 04:35:19 MST
We are still seeing this issue as of 15.08.6, which we just recently upgraded to.

[yqin@perceus-00 ~]$ sbatch -V
slurm 15.08.6

[2016-02-11T09:21:11.555] error: We have more allocated time than is possible (109163769570 > 6307200) for cluster perceus-00(1752) from 2016-02-10T06:00:00 - 2016-02-10T07:00:00 tres 2
[2016-02-11T09:21:11.555] error: We have more time than is possible (6307200+745486+0)(7052686) > 6307200 for cluster perceus-00(1752) from 2016-02-10T06:00:00 - 2016-02-10T07:00:00 tres 2
[2016-02-11T09:21:23.334] error: We have more allocated time than is possible (108556002000 > 6307200) for cluster perceus-00(1752) from 2016-02-10T07:00:00 - 2016-02-10T08:00:00 tres 2
[2016-02-11T09:21:23.334] error: We have more time than is possible (6307200+745200+0)(7052400) > 6307200 for cluster perceus-00(1752) from 2016-02-10T07:00:00 - 2016-02-10T08:00:00 tres 2
[2016-02-11T09:21:31.484] error: We have more allocated time than is possible (108556002000 > 6307200) for cluster perceus-00(1752) from 2016-02-10T08:00:00 - 2016-02-10T09:00:00 tres 2
[2016-02-11T09:21:31.484] error: We have more time than is possible (6307200+745200+0)(7052400) > 6307200 for cluster perceus-00(1752) from 2016-02-10T08:00:00 - 2016-02-10T09:00:00 tres 2
Comment 24 Tim Wickberg 2016-02-16 05:51:43 MST
Commit 120b3e6f5cf4 should fix this, and will be included in Slurm 15.08.8 which is due out soon.

On a related note[1], I'd also suggest setting a minimum value for RealMemory for each node. That would protect against situations where the node memory comes up as less than expected - I've seen situations where the node only brings up half the memory as expected after a reboot leading to problems running jobs. If the configured RealMemory is less than the available memory slurm will mark the node as down, if the RealMemory value is lower slurm will use the reported value from the node.

- Tim

[1] The issue fixed here was that RealMemory defaults to 1 for a node if not set, the value "1752" you saw printed out in the log message corresponds to the total RealMemory calculated in your cluster - in your case 1MB per node. The fix made is to calculate that value from the reported memory in each node (when FastSchedule=0), rather than the value from the configuration.
Comment 25 Jean-Claude 2016-02-25 16:29:25 MST
Hi all,

It seems that the problem remains in the version 15.08.8 :

[2016-02-26T07:00:01.043] error: We have more allocated time than is possible (144786519744 > 135425458800) for cluster deneb(37618183) from 2016-02-26T06:00:00 - 2016-02-26T07:00:00 tres 2
[2016-02-26T07:00:01.043] error: We have more time than is possible (135425458800+4602898800+0)(140028357600) > 135425458800 for cluster deneb(37618183) from 2016-02-26T06:00:00 - 2016-02-26T07:00:00 tres 2
[2016-02-26T08:00:00.216] error: We have more allocated time than is possible (144267645650 > 135425458800) for cluster deneb(37618183) from 2016-02-26T07:00:00 - 2016-02-26T08:00:00 tres 2

# sbatch -V
slurm 15.08.8

I launched the perl script and did not receive any results:
# ./lost.pl 

CLUSTER:deneb

The first message of that kind appeared the 2016-02-23, one day after the SLURM upgrade (from 14.11.7 to 15.08.8).


Regards,
Jean-Claude
Comment 27 Tim Wickberg 2016-02-26 06:07:14 MST
Just to check - slurmctld has been restarted since the update? While the error message is in the database, slurmctld is where the fix was. 

Usually the fix and error message are located in the same daemon, this is a rare time when it isn't so I though I'd check.
Comment 28 Jean-Claude 2016-02-29 20:11:54 MST
Hi,

Yes, I rebooted the server.

(In reply to Tim Wickberg from comment #27)
> Just to check - slurmctld has been restarted since the update? While the
> error message is in the database, slurmctld is where the fix was. 
> 
> Usually the fix and error message are located in the same daemon, this is a
> rare time when it isn't so I though I'd check.
Comment 30 Danny Auble 2016-03-16 03:45:35 MDT
Hi Jean-Claude

Just to confirm, this is still happening?  Could you post some recent logs and verify the lost.pl script still continues to return nothing?
Comment 31 Jean-Claude 2016-03-17 02:17:29 MDT
Created attachment 2875 [details]
slurmdbd.log

Hello,


Yes, the same error messages are still comming.


As requested, I joined the slurmdbd.log file.

The lost.pl file still does return an empty result.


Regards,

Jean-Claude


________________________________
De : bugs@schedmd.com <bugs@schedmd.com>
Envoyé : mercredi 16 mars 2016 17:45
À : De Giorgi Jean-Claude
Objet : [Bug 2068] More allocated time than is possible?


Comment # 30<https://bugs.schedmd.com/show_bug.cgi?id=2068#c30> on bug 2068<https://bugs.schedmd.com/show_bug.cgi?id=2068> from Danny Auble<mailto:da@schedmd.com>

Hi Jean-Claude

Just to confirm, this is still happening?  Could you post some recent logs and
verify the lost.pl script still continues to return nothing?

________________________________
You are receiving this mail because:

  *   You are on the CC list for the bug.
Comment 32 Danny Auble 2016-03-17 04:58:05 MDT
 Jean-Claude, could you please open a new bug on this as this was originally Princeton's issue.  Thanks.
Comment 33 Jean-Claude 2016-03-20 20:38:14 MDT
Hi Danny,

I created a new bug here:
https://bugs.schedmd.com/show_bug.cgi?id=2572 

Regards,
Jean-Claude