| Summary: | More allocated time than is possible? | ||
|---|---|---|---|
| Product: | Slurm | Reporter: | Josko Plazonic <plazonic> |
| Component: | Database | Assignee: | Danny Auble <da> |
| Status: | RESOLVED FIXED | QA Contact: | |
| Severity: | 3 - Medium Impact | ||
| Priority: | --- | CC: | jared.baker, jeanclaude.degiorgi, yong.qin |
| Version: | 15.08.8 | ||
| Hardware: | Linux | ||
| OS: | Linux | ||
| See Also: |
http://bugs.schedmd.com/show_bug.cgi?id=1706 http://bugs.schedmd.com/show_bug.cgi?id=2444 |
||
| Site: | Princeton (PICSciE) | Slinky Site: | --- |
| Alineos Sites: | --- | Atos/Eviden Sites: | --- |
| Confidential Site: | --- | Coreweave sites: | --- |
| Cray Sites: | --- | DS9 clusters: | --- |
| Google sites: | --- | HPCnow Sites: | --- |
| HPE Sites: | --- | IBM Sites: | --- |
| NOAA SIte: | --- | NoveTech Sites: | --- |
| Nvidia HWinf-CS Sites: | --- | OCF Sites: | --- |
| Recursion Pharma Sites: | --- | SFW Sites: | --- |
| SNIC sites: | --- | Tzag Elita Sites: | --- |
| Linux Distro: | --- | Machine Name: | |
| CLE Version: | Version Fixed: | 15.08.8 16.05.00-pre1 | |
| Target Release: | --- | DevPrio: | --- |
| Emory-Cloud Sites: | --- | ||
| Attachments: |
runaway job identification script
slurmdbd.log |
||
|
Description
Josko Plazonic
2015-10-27 01:02:48 MDT
Created attachment 2338 [details]
runaway job identification script
Hi Josko -
That message indicates there are some jobs that "ran away" in the database - for some reason Slurm doesn't have a valid endtime set for them, and their runtime is continuing to be included in each hourly roll-up process. This will affect the output for "sreport" as well, it will start to over-report the allocated resources on the system. This could have happened if slurmctld crashed, or if you'd hit some other bug, and it could have happened quite some time in the past and only be showing up now if the system had been running at a lower load.
The script I attached to the bug [1] can help identify the specific jobs with issues, you'd need to update their endtimes either to when the job actually completed (possibly during a system crash or other outage?) or otherwise zero those fields out.
If you're using 'sreport' you can rebuild the hourly/daily/monthly rollup values by resetting the values in orbital_last_ran_table through MySQL like so (after changing the timestamp to an appropriate value):
UPDATE orbital_last_ran_table
SET
hourly_rollup = UNIX_TIMESTAMP('2015-2-29 00:00:00'),
daily_rollup = UNIX_TIMESTAMP('2015-2-29 00:00:00'),
monthly_rollup = UNIX_TIMESTAMP('2015-2-29 00:00:00');
There are a few more details captured under bug 1706 [2] that may shed more light, although hopefully that's enough to point you towards a fix.
- Tim
[1] http://bugzilla.schedmd.com/attachment.cgi?id=2338
[2] http://bugs.schedmd.com/show_bug.cgi?id=1706
This is another symptom of bug 2067, I should have caught that earlier. As part of the communication issues some jobs are missing valid endtimes in the database, causing the hourly rollup process to count more time used than is possible. I'm going to close this out in deference to 2067 instead. - Tim *** This ticket has been marked as a duplicate of ticket 2067 *** We're talking this over, and concluded that may have been a bit quick. They're most likely related, but the fix for this is different than for your database communication issues. You'll still need to go and locate these runaway job records and correct or remove them from the database directly. If you can give us any additional detail on the start times for this it may help narrow down the window a bit. We're noticing that each of these are showing up with different timestamps, we're wondering if there may be something external to slurmdbd interactive with the database and potentially causing some locking issues. Are there any cron jobs that are backing up / exporting / interfacing with mysql directly that may be causing some contention? - Tim We have xdmod slurping data nightly (but via sacct dump so shouldn't be that bad) and we have a mysql backup job (seems to run at 11:05pm local time for some 40 minutes) but it should do the dump in safe (--single-transaction) way so in theory it is not disruptive. To be more precise:
mysqldump -u${USER} -p${PASS} --opt --flush-logs --single-transaction \
--ignore-table=mysql.event --skip-triggers --skip-routines \
slurm_acct_db|
I did indeed start fixing the database - so far I "fixed" (update della_job_table set time_end=time_start where id_job=5858358; was one of them) one cluster, 3 more to go through and those have far more bad jobs then della (only 4).
Della's 4 bad jobs were all started between 2015-10-21T13:02:22 (1 with 2 hour alloc time) and 2015-10-21T14:16:38 (3 jobs with 4 hour allocation). ReqTres info present but not much else.
I'll look at other clusters as well - they have far more bad jobs.
Didn't have time to look at other clusters but fixing della's lost jobs did not stop complaints. [root@della4 plazonic]# ./lost.pl CLUSTER:della [root@della4 plazonic]# [2015-10-27T19:00:05.622] error: We have more allocated time than is possible (368585970 > 12441600) for cluster della(3456) from 2015-10-27T18:00:00 - 2015-10-27T19:00:00 tres 1 [2015-10-27T19:00:05.622] error: We have more allocated time than is possible (1418975235801 > 77414400000) for cluster della(21504000) from 2015-10-27T18:00:00 - 2015-10-27T19:00:00 tres 2 [2015-10-27T20:00:05.423] error: We have more allocated time than is possible (392200987 > 12441600) for cluster della(3456) from 2015-10-27T19:00:00 - 2015-10-27T20:00:00 tres 1 [2015-10-27T20:00:05.423] error: We have more allocated time than is possible (1518947735928 > 77414400000) for cluster della(21504000) from 2015-10-27T19:00:00 - 2015-10-27T20:00:00 tres 2 Can you check in on della again and see if there are any new runaway jobs? You're seeing ~30x more compute time each hour than is possible for your cluster, so it must be coming from somewhere. Is this happening on the other clusters as well, and were you seeing any different log messages from them? No, there are no lost/runaway jobs (as reported by the script) on della and yes - we are seeing this for all 3 clusters sharing slurmdbd. As far as design this - slurmdbd runs on a separate machine that all clusters can reach. It does not have any slurm configs - indeed it complains at startup about missing slurm.conf file - but why should it need it? Anyway, I am not 100% sure that slurmdbd can talk to slurmctlds (though clearly they can talk to slurmdbd). And logs indicating that all clusters suffer from this issue: [2015-10-29T09:00:00.705] error: We have more allocated time than is possible (163674715 > 37382400) for cluster tiger(10384) from 2015-10-28T17:00:00 - 2015-10-28T18:00:00 tres 1 [2015-10-29T09:00:00.705] error: We have more time than is possible (37382400+115200+0)(37497600) > 37382400 for cluster tiger(10384) from 2015-10-28T17:00:00 - 2015-10-28T18:00:00 tres 1 [2015-10-29T09:00:00.705] error: We have more allocated time than is possible (575372884552 > 172339200000) for cluster tiger(47872000) from 2015-10-28T17:00:00 - 2015-10-28T18:00:00 tres 2 [2015-10-29T09:00:00.705] error: We have more time than is possible (172339200000+460800000+0)(172800000000) > 172339200000 for cluster tiger(47872000) from 2015-10-28T17:00:00 - 2015-10-28T18:00:00 tres 2 [2015-10-29T09:00:00.717] error: We have more allocated time than is possible (109632656 > 13305600) for cluster orbital(3696) from 2015-10-29T08:00:00 - 2015-10-29T09:00:00 tres 1 [2015-10-29T09:00:00.717] error: We have more time than is possible (13305600+604800+0)(13910400) > 13305600 for cluster orbital(3696) from 2015-10-29T08:00:00 - 2015-10-29T09:00:00 tres 1 [2015-10-29T09:00:00.717] error: We have more allocated time than is possible (164887256000 > 53222400000) for cluster orbital(14784000) from 2015-10-29T08:00:00 - 2015-10-29T09:00:00 tres 2 [2015-10-29T09:00:00.717] error: We have more time than is possible (53222400000+2419200000+0)(55641600000) > 53222400000 for cluster orbital(14784000) from 2015-10-29T08:00:00 - 2015-10-29T09:00:00 tres 2 [2015-10-29T09:00:00.755] error: We have more allocated time than is possible (158313041 > 37382400) for cluster tiger(10384) from 2015-10-28T18:00:00 - 2015-10-28T19:00:00 tres 1 [2015-10-29T09:00:00.755] error: We have more time than is possible (37382400+115200+0)(37497600) > 37382400 for cluster tiger(10384) from 2015-10-28T18:00:00 - 2015-10-28T19:00:00 tres 1 [2015-10-29T09:00:00.755] error: We have more allocated time than is possible (562198844600 > 172339200000) for cluster tiger(47872000) from 2015-10-28T18:00:00 - 2015-10-28T19:00:00 tres 2 [2015-10-29T09:00:00.755] error: We have more time than is possible (172339200000+460800000+0)(172800000000) > 172339200000 for cluster tiger(47872000) from 2015-10-28T18:00:00 - 2015-10-28T19:00:00 tres 2 [2015-10-29T09:00:00.807] error: We have more allocated time than is possible (173781396 > 37382400) for cluster tiger(10384) from 2015-10-28T19:00:00 - 2015-10-28T20:00:00 tres 1 [2015-10-29T09:00:00.807] error: We have more time than is possible (37382400+115200+0)(37497600) > 37382400 for cluster tiger(10384) from 2015-10-28T19:00:00 - 2015-10-28T20:00:00 tres 1 [2015-10-29T09:00:00.807] error: We have more allocated time than is possible (646419575600 > 172339200000) for cluster tiger(47872000) from 2015-10-28T19:00:00 - 2015-10-28T20:00:00 tres 2 [2015-10-29T09:00:00.807] error: We have more time than is possible (172339200000+460800000+0)(172800000000) > 172339200000 for cluster tiger(47872000) from 2015-10-28T19:00:00 - 2015-10-28T20:00:00 tres 2 [2015-10-29T09:00:00.831] error: We have more allocated time than is possible (2337371400 > 12441600) for cluster della(3456) from 2015-10-29T08:00:00 - 2015-10-29T09:00:00 tres 1 [2015-10-29T09:00:00.831] error: We have more allocated time than is possible (9519241078976 > 77414400000) for cluster della(21504000) from 2015-10-29T08:00:00 - 2015-10-29T09:00:00 tres 2 [2015-10-29T09:00:00.852] error: We have more allocated time than is possible (182051219 > 37382400) for cluster tiger(10384) from 2015-10-28T20:00:00 - 2015-10-28T21:00:00 tres 1 [2015-10-29T09:00:00.852] error: We have more time than is possible (37382400+115200+0)(37497600) > 37382400 for cluster tiger(10384) from 2015-10-28T20:00:00 - 2015-10-28T21:00:00 tres 1 [2015-10-29T09:00:00.852] error: We have more allocated time than is possible (714777476360 > 172339200000) for cluster tiger(47872000) from 2015-10-28T20:00:00 - 2015-10-28T21:00:00 tres 2 [2015-10-29T09:00:00.852] error: We have more time than is possible (172339200000+460800000+0)(172800000000) > 172339200000 for cluster tiger(47872000) from 2015-10-28T20:00:00 - 2015-10-28T21:00:00 tres 2 [2015-10-29T09:00:00.896] error: We have more allocated time than is possible (197420130 > 37382400) for cluster tiger(10384) from 2015-10-28T21:00:00 - 2015-10-28T22:00:00 tres 1 [2015-10-29T09:00:00.896] error: We have more time than is possible (37382400+115200+0)(37497600) > 37382400 for cluster tiger(10384) from 2015-10-28T21:00:00 - 2015-10-28T22:00:00 tres 1 [2015-10-29T09:00:00.896] error: We have more allocated time than is possible (1012108251400 > 172339200000) for cluster tiger(47872000) from 2015-10-28T21:00:00 - 2015-10-28T22:00:00 tres 2 [2015-10-29T09:00:00.896] error: We have more time than is possible (172339200000+460800000+0)(172800000000) > 172339200000 for cluster tiger(47872000) from 2015-10-28T21:00:00 - 2015-10-28T22:00:00 tres 2 [2015-10-29T09:00:00.921] error: We have more allocated time than is possible (201042368 > 37382400) for cluster tiger(10384) from 2015-10-28T22:00:00 - 2015-10-28T23:00:00 tres 1 [2015-10-29T09:00:00.921] error: We have more time than is possible (37382400+115200+0)(37497600) > 37382400 for cluster tiger(10384) from 2015-10-28T22:00:00 - 2015-10-28T23:00:00 tres 1 [2015-10-29T09:00:00.921] error: We have more allocated time than is possible (1011152569200 > 172339200000) for cluster tiger(47872000) from 2015-10-28T22:00:00 - 2015-10-28T23:00:00 tres 2 [2015-10-29T09:00:00.921] error: We have more time than is possible (172339200000+460800000+0)(172800000000) > 172339200000 for cluster tiger(47872000) from 2015-10-28T22:00:00 - 2015-10-28T23:00:00 tres 2 [2015-10-29T09:00:00.944] error: We have more allocated time than is possible (190524023 > 37382400) for cluster tiger(10384) from 2015-10-28T23:00:00 - 2015-10-29T00:00:00 tres 1 [2015-10-29T09:00:00.944] error: We have more time than is possible (37382400+115200+0)(37497600) > 37382400 for cluster tiger(10384) from 2015-10-28T23:00:00 - 2015-10-29T00:00:00 tres 1 [2015-10-29T09:00:00.944] error: We have more allocated time than is possible (727171735200 > 172339200000) for cluster tiger(47872000) from 2015-10-28T23:00:00 - 2015-10-29T00:00:00 tres 2 [2015-10-29T09:00:00.944] error: We have more time than is possible (172339200000+460800000+0)(172800000000) > 172339200000 for cluster tiger(47872000) from 2015-10-28T23:00:00 - 2015-10-29T00:00:00 tres 2 [2015-10-29T09:00:00.969] error: We have more allocated time than is possible (193484799 > 37382400) for cluster tiger(10384) from 2015-10-29T00:00:00 - 2015-10-29T01:00:00 tres 1 [2015-10-29T09:00:00.969] error: We have more time than is possible (37382400+115200+0)(37497600) > 37382400 for cluster tiger(10384) from 2015-10-29T00:00:00 - 2015-10-29T01:00:00 tres 1 [2015-10-29T09:00:00.969] error: We have more allocated time than is possible (712515734400 > 172339200000) for cluster tiger(47872000) from 2015-10-29T00:00:00 - 2015-10-29T01:00:00 tres 2 [2015-10-29T09:00:00.969] error: We have more time than is possible (172339200000+460800000+0)(172800000000) > 172339200000 for cluster tiger(47872000) from 2015-10-29T00:00:00 - 2015-10-29T01:00:00 tres 2 [2015-10-29T09:00:00.994] error: We have more allocated time than is possible (195719364 > 37382400) for cluster tiger(10384) from 2015-10-29T01:00:00 - 2015-10-29T02:00:00 tres 1 [2015-10-29T09:00:00.994] error: We have more time than is possible (37382400+115200+0)(37497600) > 37382400 for cluster tiger(10384) from 2015-10-29T01:00:00 - 2015-10-29T02:00:00 tres 1 [2015-10-29T09:00:00.994] error: We have more allocated time than is possible (707546413680 > 172339200000) for cluster tiger(47872000) from 2015-10-29T01:00:00 - 2015-10-29T02:00:00 tres 2 [2015-10-29T09:00:00.994] error: We have more time than is possible (172339200000+460800000+0)(172800000000) > 172339200000 for cluster tiger(47872000) from 2015-10-29T01:00:00 - 2015-10-29T02:00:00 tres 2 [2015-10-29T09:00:01.018] error: We have more allocated time than is possible (193018208 > 37382400) for cluster tiger(10384) from 2015-10-29T02:00:00 - 2015-10-29T03:00:00 tres 1 [2015-10-29T09:00:01.018] error: We have more time than is possible (37382400+115200+0)(37497600) > 37382400 for cluster tiger(10384) from 2015-10-29T02:00:00 - 2015-10-29T03:00:00 tres 1 [2015-10-29T09:00:01.018] error: We have more allocated time than is possible (673781790400 > 172339200000) for cluster tiger(47872000) from 2015-10-29T02:00:00 - 2015-10-29T03:00:00 tres 2 [2015-10-29T09:00:01.018] error: We have more time than is possible (172339200000+460800000+0)(172800000000) > 172339200000 for cluster tiger(47872000) from 2015-10-29T02:00:00 - 2015-10-29T03:00:00 tres 2 [2015-10-29T09:00:01.041] error: We have more allocated time than is possible (190585550 > 37382400) for cluster tiger(10384) from 2015-10-29T03:00:00 - 2015-10-29T04:00:00 tres 1 [2015-10-29T09:00:01.041] error: We have more time than is possible (37382400+115200+0)(37497600) > 37382400 for cluster tiger(10384) from 2015-10-29T03:00:00 - 2015-10-29T04:00:00 tres 1 [2015-10-29T09:00:01.041] error: We have more allocated time than is possible (674758428000 > 172339200000) for cluster tiger(47872000) from 2015-10-29T03:00:00 - 2015-10-29T04:00:00 tres 2 [2015-10-29T09:00:01.041] error: We have more time than is possible (172339200000+460800000+0)(172800000000) > 172339200000 for cluster tiger(47872000) from 2015-10-29T03:00:00 - 2015-10-29T04:00:00 tres 2 [2015-10-29T09:00:01.064] error: We have more allocated time than is possible (203983033 > 37382400) for cluster tiger(10384) from 2015-10-29T04:00:00 - 2015-10-29T05:00:00 tres 1 [2015-10-29T09:00:01.064] error: We have more time than is possible (37382400+115200+0)(37497600) > 37382400 for cluster tiger(10384) from 2015-10-29T04:00:00 - 2015-10-29T05:00:00 tres 1 [2015-10-29T09:00:01.064] error: We have more allocated time than is possible (714000817400 > 172339200000) for cluster tiger(47872000) from 2015-10-29T04:00:00 - 2015-10-29T05:00:00 tres 2 [2015-10-29T09:00:01.064] error: We have more time than is possible (172339200000+460800000+0)(172800000000) > 172339200000 for cluster tiger(47872000) from 2015-10-29T04:00:00 - 2015-10-29T05:00:00 tres 2 [2015-10-29T09:00:01.086] error: We have more allocated time than is possible (194270318 > 37382400) for cluster tiger(10384) from 2015-10-29T05:00:00 - 2015-10-29T06:00:00 tres 1 [2015-10-29T09:00:01.086] error: We have more time than is possible (37382400+115200+0)(37497600) > 37382400 for cluster tiger(10384) from 2015-10-29T05:00:00 - 2015-10-29T06:00:00 tres 1 [2015-10-29T09:00:01.086] error: We have more allocated time than is possible (680541284800 > 172339200000) for cluster tiger(47872000) from 2015-10-29T05:00:00 - 2015-10-29T06:00:00 tres 2 [2015-10-29T09:00:01.086] error: We have more time than is possible (172339200000+460800000+0)(172800000000) > 172339200000 for cluster tiger(47872000) from 2015-10-29T05:00:00 - 2015-10-29T06:00:00 tres 2 [2015-10-29T09:00:01.107] error: We have more allocated time than is possible (191678601 > 37382400) for cluster tiger(10384) from 2015-10-29T06:00:00 - 2015-10-29T07:00:00 tres 1 [2015-10-29T09:00:01.107] error: We have more time than is possible (37382400+115200+0)(37497600) > 37382400 for cluster tiger(10384) from 2015-10-29T06:00:00 - 2015-10-29T07:00:00 tres 1 [2015-10-29T09:00:01.107] error: We have more allocated time than is possible (667025485600 > 172339200000) for cluster tiger(47872000) from 2015-10-29T06:00:00 - 2015-10-29T07:00:00 tres 2 [2015-10-29T09:00:01.107] error: We have more time than is possible (172339200000+460800000+0)(172800000000) > 172339200000 for cluster tiger(47872000) from 2015-10-29T06:00:00 - 2015-10-29T07:00:00 tres 2 [2015-10-29T09:00:01.129] error: We have more allocated time than is possible (188878930 > 37382400) for cluster tiger(10384) from 2015-10-29T07:00:00 - 2015-10-29T08:00:00 tres 1 [2015-10-29T09:00:01.129] error: We have more time than is possible (37382400+115200+0)(37497600) > 37382400 for cluster tiger(10384) from 2015-10-29T07:00:00 - 2015-10-29T08:00:00 tres 1 [2015-10-29T09:00:01.129] error: We have more allocated time than is possible (646713798600 > 172339200000) for cluster tiger(47872000) from 2015-10-29T07:00:00 - 2015-10-29T08:00:00 tres 2 [2015-10-29T09:00:01.129] error: We have more time than is possible (172339200000+460800000+0)(172800000000) > 172339200000 for cluster tiger(47872000) from 2015-10-29T07:00:00 - 2015-10-29T08:00:00 tres 2 [2015-10-29T09:00:01.207] error: We have more allocated time than is possible (194586712 > 37382400) for cluster tiger(10384) from 2015-10-29T08:00:00 - 2015-10-29T09:00:00 tres 1 [2015-10-29T09:00:01.207] error: We have more time than is possible (37382400+115200+0)(37497600) > 37382400 for cluster tiger(10384) from 2015-10-29T08:00:00 - 2015-10-29T09:00:00 tres 1 [2015-10-29T09:00:01.207] error: We have more allocated time than is possible (652454145000 > 172339200000) for cluster tiger(47872000) from 2015-10-29T08:00:00 - 2015-10-29T09:00:00 tres 2 [2015-10-29T09:00:01.207] error: We have more time than is possible (172339200000+460800000+0)(172800000000) > 172339200000 for cluster tiger(47872000) from 2015-10-29T08:00:00 - 2015-10-29T09:00:00 tres 2 We're still thinking that some specific job record(s). If you can run something like this for della (or you can swap for tiger if you'd prefer) we'd appreciate it: select job.job_db_inx, job.id_job, job.array_task_pending, job.time_submit, job.time_eligible, job.time_start, job.time_end, job.time_suspended, job.cpus_req, job.id_resv, job.tres_alloc, SUM(step.consumed_energy) from della_job_table as job left outer join della_step_table as step on job.job_db_inx=step.job_db_inx and (step.id_step>=0) where (job.time_eligible < 1446141600 && (job.time_end >= 1446138000 || job.time_end = 0)) group by job.job_db_inx order by job.id_assoc, job.time_eligible; OK, here it is: http://tigress-web.princeton.edu/~plazonic/della_query.txt 2486 rows - is this bad? 2600 lines is fine - that's the jobs eligible within that one hour on della. I've got that loaded in here and I'm seeing the same issue with the accounting now so I can dig in further. Do you mind sending over a few more details? We could use the output from this: sacctmgr list events format=cluster,clusternodes,timestart,timeend,tres%70 where cluster=della as well as the current slurm.conf for della. thanks, - Tim Note that I can't run these commands on the slurmdbd machine - as we don't have slurm.conf file there - it just fails after trying 60 times, once per second.
Therefore this is ran on della's slurmctld host:
sacctmgr list events format=cluster,clusternodes,timestart,timeend,tres%70 where cluster=della
Cluster Cluster Nodes TimeStart TimeEnd TRES
---------- -------------------- ------------------- ------------------- -----------------------------------------------------
della della-r1c1n[1-16],d+ 2015-10-13T08:25:30 Unknown cpu=3456,mem=21504000,energy=0,node=224
In case you need the full list of nodes - but it looks to match:
sacctmgr list events format=clusternodes%280 where cluster=della
Cluster Nodes
-----------------------------------------------------
della-r1c1n[1-16],della-r1c2n[1-16],della-r1c3n[1-16],della-r1c4n[1-16],della-r2c1n[1-16],della-r2c2n[1-16],della-r2c3n[1-16],della-r2c4n[1-16],della-r3c1n[1-16],della-r3c2n[1-16],della-r3c3n[1-16],della-r3c4n[1-16],della-r4c1n[1-16],della-r4c2n[1-16]
and slurm.conf you can find at
http://tigress-web.princeton.edu/~plazonic/slurm.conf
Note that the cluster is non uniform - some nodes 20 and some have 12 cores.
Thanks for your patience, the results from the diagnostic commands was quite helpful. We think we've got the bug nailed down and a fix for it ready now: https://github.com/SchedMD/slurm/commit/34e24467093dbd65.patch The good news is that the individual job records should be okay, it's only the hourly "rollup" which aggregates the job statistics together for quicker reporting by sreport and other commands that is affected. You'd need to reset the hourly/daily/monthly rollup timestamps to have slurmdbd recalculate past usage as reported by sreport, it's the same as we'd given you before (and you'd need to run for each cluster): update della_last_ran_table SET hourly_rollup = UNIX_TIMESTAMP('2015-08-30 00:00:00'), daily_rollup = UNIX_TIMESTAMP('2015-08-30 00:00:00'), monthly_rollup = UNIX_TIMESTAMP('2015-08-30 00:00:00'); You can adjust those dates forward to whenever you updated the slurmdbd to 15.08 - the bug was introduced in this release as part of the TRES work. We expect to have a 15.08.3 out soon with this fix. - Tim Thanks! Re-basing our installation to 15.08.2+this patch (and a few others...). Will hopefully update tonight and tell it then to recalculate it all, starting with clusters with less jobs and so on. Will update ticket once I see how it is handling recalculations with patch. We'll probably tag 15.08.3 on Tuesday On November 2, 2015 5:27:11 PM PST, bugs@schedmd.com wrote: >http://bugs.schedmd.com/show_bug.cgi?id=2068 > >--- Comment #14 from Josko Plazonic <plazonic@princeton.edu> --- >Thanks! Re-basing our installation to 15.08.2+this patch (and a few >others...). >Will hopefully update tonight and tell it then to recalculate it all, >starting >with clusters with less jobs and so on. > >Will update ticket once I see how it is handling recalculations with >patch. > >-- >You are receiving this mail because: >You are on the CC list for the bug. I guess then I'll wait - I see a few other fixes that would be good to have. Please test this tonight if possible. Your results will determine if we tag or not. On November 2, 2015 6:23:13 PM PST, bugs@schedmd.com wrote: >http://bugs.schedmd.com/show_bug.cgi?id=2068 > >--- Comment #16 from Josko Plazonic <plazonic@princeton.edu> --- >I guess then I'll wait - I see a few other fixes that would be good to >have. > >-- >You are receiving this mail because: >You are on the CC list for the bug. OK, will do. It looks better - after upgrading slurmdbd and then updating last processed timestamps no further complaints about della, we only get: [2015-11-02T22:08:42.125] error: id_assoc 439 doesn't have any tres [2015-11-02T22:08:42.125] error: id_assoc 440 doesn't have any tres [2015-11-02T22:08:42.247] error: id_assoc 439 doesn't have any tres [2015-11-02T22:08:42.247] error: id_assoc 440 doesn't have any tres [2015-11-02T22:09:08.384] Warning: Note very large processing time from hourly_rollup for della: usec=124312592 began=22:07:04.072 Not sure what's up with id_assoc but that's probably unrelated to this ticket. Looks like we have a bunch of lost jobs on tiger but that's probably related to a switch crash and I'll try cleaning that up and rerunning processing. Tiger looks good too after cleaning up "runaway" jobs (note that this event was on 10/29 - unrelated to original report). It sure looks like this patch helped. Glad to hear that worked out for you. I'm going to go ahead and close the bug, if you see any recurrence please open it back up. - Tim We are still seeing this issue as of 15.08.6, which we just recently upgraded to. [yqin@perceus-00 ~]$ sbatch -V slurm 15.08.6 [2016-02-11T09:21:11.555] error: We have more allocated time than is possible (109163769570 > 6307200) for cluster perceus-00(1752) from 2016-02-10T06:00:00 - 2016-02-10T07:00:00 tres 2 [2016-02-11T09:21:11.555] error: We have more time than is possible (6307200+745486+0)(7052686) > 6307200 for cluster perceus-00(1752) from 2016-02-10T06:00:00 - 2016-02-10T07:00:00 tres 2 [2016-02-11T09:21:23.334] error: We have more allocated time than is possible (108556002000 > 6307200) for cluster perceus-00(1752) from 2016-02-10T07:00:00 - 2016-02-10T08:00:00 tres 2 [2016-02-11T09:21:23.334] error: We have more time than is possible (6307200+745200+0)(7052400) > 6307200 for cluster perceus-00(1752) from 2016-02-10T07:00:00 - 2016-02-10T08:00:00 tres 2 [2016-02-11T09:21:31.484] error: We have more allocated time than is possible (108556002000 > 6307200) for cluster perceus-00(1752) from 2016-02-10T08:00:00 - 2016-02-10T09:00:00 tres 2 [2016-02-11T09:21:31.484] error: We have more time than is possible (6307200+745200+0)(7052400) > 6307200 for cluster perceus-00(1752) from 2016-02-10T08:00:00 - 2016-02-10T09:00:00 tres 2 Commit 120b3e6f5cf4 should fix this, and will be included in Slurm 15.08.8 which is due out soon. On a related note[1], I'd also suggest setting a minimum value for RealMemory for each node. That would protect against situations where the node memory comes up as less than expected - I've seen situations where the node only brings up half the memory as expected after a reboot leading to problems running jobs. If the configured RealMemory is less than the available memory slurm will mark the node as down, if the RealMemory value is lower slurm will use the reported value from the node. - Tim [1] The issue fixed here was that RealMemory defaults to 1 for a node if not set, the value "1752" you saw printed out in the log message corresponds to the total RealMemory calculated in your cluster - in your case 1MB per node. The fix made is to calculate that value from the reported memory in each node (when FastSchedule=0), rather than the value from the configuration. Hi all, It seems that the problem remains in the version 15.08.8 : [2016-02-26T07:00:01.043] error: We have more allocated time than is possible (144786519744 > 135425458800) for cluster deneb(37618183) from 2016-02-26T06:00:00 - 2016-02-26T07:00:00 tres 2 [2016-02-26T07:00:01.043] error: We have more time than is possible (135425458800+4602898800+0)(140028357600) > 135425458800 for cluster deneb(37618183) from 2016-02-26T06:00:00 - 2016-02-26T07:00:00 tres 2 [2016-02-26T08:00:00.216] error: We have more allocated time than is possible (144267645650 > 135425458800) for cluster deneb(37618183) from 2016-02-26T07:00:00 - 2016-02-26T08:00:00 tres 2 # sbatch -V slurm 15.08.8 I launched the perl script and did not receive any results: # ./lost.pl CLUSTER:deneb The first message of that kind appeared the 2016-02-23, one day after the SLURM upgrade (from 14.11.7 to 15.08.8). Regards, Jean-Claude Just to check - slurmctld has been restarted since the update? While the error message is in the database, slurmctld is where the fix was. Usually the fix and error message are located in the same daemon, this is a rare time when it isn't so I though I'd check. Hi, Yes, I rebooted the server. (In reply to Tim Wickberg from comment #27) > Just to check - slurmctld has been restarted since the update? While the > error message is in the database, slurmctld is where the fix was. > > Usually the fix and error message are located in the same daemon, this is a > rare time when it isn't so I though I'd check. Hi Jean-Claude Just to confirm, this is still happening? Could you post some recent logs and verify the lost.pl script still continues to return nothing? Created attachment 2875 [details] slurmdbd.log Hello, Yes, the same error messages are still comming. As requested, I joined the slurmdbd.log file. The lost.pl file still does return an empty result. Regards, Jean-Claude ________________________________ De : bugs@schedmd.com <bugs@schedmd.com> Envoyé : mercredi 16 mars 2016 17:45 À : De Giorgi Jean-Claude Objet : [Bug 2068] More allocated time than is possible? Comment # 30<https://bugs.schedmd.com/show_bug.cgi?id=2068#c30> on bug 2068<https://bugs.schedmd.com/show_bug.cgi?id=2068> from Danny Auble<mailto:da@schedmd.com> Hi Jean-Claude Just to confirm, this is still happening? Could you post some recent logs and verify the lost.pl script still continues to return nothing? ________________________________ You are receiving this mail because: * You are on the CC list for the bug. Jean-Claude, could you please open a new bug on this as this was originally Princeton's issue. Thanks. Hi Danny, I created a new bug here: https://bugs.schedmd.com/show_bug.cgi?id=2572 Regards, Jean-Claude |