Hi, what is the latest state of the jobs in the database, are they still in running? This would seems to indicate that somehow the controller failed to update the database. David Hi, do you have reservations in the system? The reservations are counted against used cpu time. David Created attachment 2239 [details]
perl script to search for runaway jobs
Bill if you have runaway jobs this script should identify them fairly easy.
Thanks Bill, for some reason your message wasn't part of the bug at all. The answer to this is in comment 4 of the bug. You reply is included here... I’ve answered those previously on June 22, 2015. Attaching email trail. Thanks. **************************************************** From: Schadlich, William W Sent: Monday, June 22, 2015 4:42 PM To: 'bugs@schedmd.com' Subject: RE: [Bug 1706] sreport indicates 98.45% allocated for month when sample clearly indicates 50.4% We shouldn’t have any but how would I check that? Regards, Bill Bill, can you also update the version if changed. The script given in comment 4 shows runaway jobs as it is noted. To list any reservations you would have to go in the database to look before 15.08 which has 'sacctmgr list resv'. If you want to see if you have any reservations on anything before that you can log into the slurm database and run select * from clustername_resv_table; where clustername is the name of your cluster. This should give you the start and end of any reservations in your system. Hey Bill, Are you getting emails from bugzilla and have you tried replying back? We don't see any responses so we are just making sure that your responses aren't getting lost somewhere. If you are getting these, are you currently experiencing this issue? Thanks, Brian Test email Regards, Bill ______________________________________________________________________ William W. Schadlich EMIT Foundation Infrastructure Middleware Retail Unix Solutions High Performance Computing EMRE Support 908-730-1011 Ticket Support: http://itservices Select "Get Assistance" Select "High Performance Cluster" Select - Request Type "Downstream: All Requests" In Description of Issue* field "Type name of server and problem you are having including your ID" http://goto/bill From: bugs@schedmd.com [mailto:bugs@schedmd.com] Sent: Thursday, September 24, 2015 2:04 PM To: Schadlich, William W Subject: [Bug 1706] sreport indicates 98.45% allocated for month when sample clearly indicates 50.4% Comment # 7<http://bugs.schedmd.com/show_bug.cgi?id=1706#c7> on bug 1706<http://bugs.schedmd.com/show_bug.cgi?id=1706> from Brian Christiansen<mailto:brian@schedmd.com> Hey Bill, Are you getting emails from bugzilla and have you tried replying back? We don't see any responses so we are just making sure that your responses aren't getting lost somewhere. If you are getting these, are you currently experiencing this issue? Thanks, Brian ________________________________ You are receiving this mail because: * You reported the bug. We can hear you now. From a previous email: Here is what that script determined. I guess I was able to kill most of the database jobs from Nov of last year but somehow new ones are now here. CLUSTER:#myclustername# cant find job 1461 lpkozio subscript+ research RUNNING #myclustername# n0203 1 short 131-16:57:11 cant find job 1462 lpkozio subscript+ research RUNNING #myclustername# n0382 1 short 131-16:57:11 cant find job 1463 lpkozio subscript+ research RUNNING #myclustername# n0318 1 short 131-16:57:11 cant find job 1464 lpkozio subscript+ research RUNNING #myclustername# n0319 1 short 131-16:57:11 cant find job 1465 lpkozio subscript+ research RUNNING #myclustername# n0320 1 short 131-16:57:11 cant find job 1466 lpkozio subscript+ research RUNNING #myclustername# n0682 1 short 131-16:57:08 cant find job 1467 lpkozio subscript+ research RUNNING #myclustername# n0683 1 short 131-16:57:08 cant find job 1468 lpkozio subscript+ research RUNNING #myclustername# n0684 1 short 131-16:57:08 cant find job 1469 lpkozio subscript+ research RUNNING #myclustername# n0210 1 short 131-16:57:08 cant find job 1470 lpkozio subscript+ research RUNNING #myclustername# n0211 1 short 131-16:57:08 cant find job 1471 lpkozio subscript+ research RUNNING #myclustername# n0212 1 short 131-16:57:08 cant find job 1472 lpkozio subscript+ research RUNNING #myclustername# n0213 1 short 131-16:57:08 cant find job 1473 lpkozio subscript+ research RUNNING #myclustername# n0666 1 short 131-16:57:08 cant find job 1474 lpkozio subscript+ research RUNNING #myclustername# n0667 1 short 131-16:57:08 cant find job 1475 lpkozio subscript+ research RUNNING #myclustername# n0668 1 short 131-16:57:08 cant find job 1476 lpkozio subscript+ research RUNNING #myclustername# n0669 1 short 131-16:57:08 cant find job 1494 sraman1 lammps.sc+ research RUNNING #myclustername# n[0352-0356] 5 small 131-16:36:11 Were you able to clear out the long running jobs that the script reported? If not you can use these queries to clean them up.
select * from <cluster>_job_table where id_job = <jobid> and state=1 and time_end = 0;
update <cluster>_job_table set time_end = time_start, state = 3 where id_job=<jobid> and state=1 and time_end=0;
select * from <cluster>_job_table where id_job = <jobid>;
One possible explanation for the long running jobs is that the database was down for a period of time and the controller started throwing away records. The controller will cache quite a few records while the database is down but it will eventually start throwing things away. The logs will show the following messages if the cache is getting to large:
slurmdbd: agent queue filling, RESTART SLURMDBD NOW
this one goes to the syslog:
*** RESTART SLURMDBD NOW ***"
and if it starts to throw things away you'll see:
slurmdbd: agent queue is full, discarding request");
After you clean up these jobs will you run your reports again and see if they are matching up?
Here's one suggestion for your report command line options, you can use the sacct -X option to not display the steps. That way you'll just be looking at the jobs.
Thanks,
Brian
From Bill: So I ran the clean-up scripts provided and the system indicates nothing is running but I go back to months where utilization was low and it still indicates percentages in the 90s where it probably closer to 50-60. Thoughts? [wwschad@clnsand01 admin]$ sreport cluster utilization start=1/1/15 end=1/31/15 -t percent -------------------------------------------------------------------------------- Cluster Utilization 2015-01-01T00:00:00 - 2015-01-30T23:59:59 (2592000*cpus secs) Time reported in Percentage of Total -------------------------------------------------------------------------------- Cluster Allocated Down PLND Down Idle Reserved Reported --------- ------------ ---------- ---------- ------------ --------- ------------ descartes 95.80% 0.03% 0.00% 4.17% 0.00% 100.00% [wwschad@clnsand01 admin]$ sreport cluster utilization start=2/1/15 end=2/28/15 -t percent -------------------------------------------------------------------------------- Cluster Utilization 2015-02-01T00:00:00 - 2015-02-27T23:59:59 (2332800*cpus secs) Time reported in Percentage of Total -------------------------------------------------------------------------------- Cluster Allocated Down PLND Down Idle Reserved Reported --------- ------------ ---------- ---------- ------------ --------- ------------ descartes 48.49% 0.61% 0.00% 50.68% 0.21% 100.00% [wwschad@clnsand01 admin]$ sreport cluster utilization start=12/1/14 end=12/31/14 -t percent -------------------------------------------------------------------------------- Cluster Utilization 2014-12-01T00:00:00 - 2014-12-30T23:59:59 (2592000*cpus secs) Time reported in Percentage of Total -------------------------------------------------------------------------------- Cluster Allocated Down PLND Down Idle Reserved Reported --------- ------------ ---------- ---------- ------------ --------- ------------ descartes 98.79% 0.02% 0.00% 1.18% 0.01% 100.00% Now that the the jobs are cleaned up, the job usage needs to be rolled up again. sreport looks at the rolled up usage.
You can trigger the re-roll by setting the last roll up times to a previous date in the database. For example, since your sreport time is from last December you could set the time to a day before just be safe.
UPDATE = descartes_last_ran_table
SET
hourly_rollup = UNIX_TIMESTAMP('2014-11-30 00:00:00'),
daily_rollup = UNIX_TIMESTAMP('2014-11-30 00:00:00'),
monthly_rollup = UNIX_TIMESTAMP('2014-11-30 00:00:00');
Or you could set the columns to 0 and it will roll up everything.
By modifying these columns the slurmdbd will do the rollups the next time it does its rollup (which is every hour or on startup). It could take a little bit of time to re-roll depending on how big the database is, but just let it run.
Does the database still have the jobs from last year? Are you doing any purging? Will you attach your slurmdbd.conf?
Will you also show the output of:
select * from descartes_resv_table;
Let us know how it goes or if you have any questions.
Thanks,
Brian
From the slurmdbd log I see the following is it a concern from the work we have done? We currently have everything stored in the db from startup still. [2015-09-02T11:00:00.959] error: We have more time than is possible (15091584+14976000+0)(30067584) > 29030400 for cluster descartes(8064) from 2015-09-02T10:00:00 - 2015-09-02T11:00:00 [2015-09-02T12:00:00.104] error: We have more time than is possible (15093552+14976000+0)(30069552) > 29030400 for cluster descartes(8064) from 2015-09-02T11:00:00 - 2015-09-02T12:00:00 [2015-09-02T13:00:00.271] error: We have more time than is possible (14164480+14976000+0)(29140480) > 29030400 for cluster descartes(8064) from 2015-09-02T12:00:00 - 2015-09-02T13:00:00 [2015-09-02T20:00:00.305] error: We have more time than is possible (15091200+14976000+0)(30067200) > 29030400 for cluster descartes(8064) from 2015-09-02T19:00:00 - 2015-09-02T20:00:00 [2015-09-02T21:00:00.460] error: We have more time than is possible (15091200+14976000+0)(30067200) > 29030400 for cluster descartes(8064) from 2015-09-02T20:00:00 - 2015-09-02T21:00:00 [2015-09-02T22:00:00.659] error: We have more time than is possible (14767616+14976000+0)(29743616) > 29030400 for cluster descartes(8064) from 2015-09-02T21:00:00 - 2015-09-02T22:00:00 [2015-09-02T23:00:00.829] error: We have more time than is possible (15037392+14976000+0)(30013392) > 29030400 for cluster descartes(8064) from 2015-09-02T22:00:00 - 2015-09-02T23:00:00 [2015-09-03T00:00:00.983] error: We have more time than is possible (14868336+14976000+0)(29844336) > 29030400 for cluster descartes(8064) from 2015-09-02T23:00:00 - 2015-09-03T00:00:00 [2015-09-03T01:00:00.195] error: We have more time than is possible (14165504+14976000+0)(29141504) > 29030400 for cluster descartes(8064) from 2015-09-03T00:00:00 - 2015-09-03T01:00:00 [2015-09-21T09:06:43.643] error: Problem getting jobs for cluster -s [2015-10-01T09:00:00.661] error: We have more time than is possible (10574560+19656576+0)(30231136) > 29030400 for cluster descartes(8064) from 2015-10-01T08:00:00 - 2015-10-01T09:00:00 [2015-10-01T10:00:00.834] error: We have more time than is possible (4723200+25516800+0)(30240000) > 29030400 for cluster descartes(8064) from 2015-10-01T09:00:00 - 2015-10-01T10:00:00 [2015-10-01T14:05:04.576] Terminate signal (SIGINT or SIGTERM) received [2015-10-01T14:08:50.631] Accounting storage MYSQL plugin loaded [2015-10-01T14:08:50.689] slurmdbd version 14.11.5 started [2015-10-01T14:13:49.515] DBD_CLUSTER_CPUS: cluster not registered Regards, Bill ______________________________________________________________________ William W. Schadlich EMIT Foundation Infrastructure Middleware Retail Unix Solutions High Performance Computing EMRE Support 908-730-1011 Ticket Support: http://itservices Select "Get Assistance" Select "High Performance Cluster" Select - Request Type "Downstream: All Requests" In Description of Issue* field "Type name of server and problem you are having including your ID" http://goto/bill From: bugs@schedmd.com [mailto:bugs@schedmd.com] Sent: Wednesday, September 30, 2015 4:35 PM To: Schadlich, William W Subject: [Bug 1706] sreport indicates 98.45% allocated for month when sample clearly indicates 50.4% Comment # 10<http://bugs.schedmd.com/show_bug.cgi?id=1706#c10> on bug 1706<http://bugs.schedmd.com/show_bug.cgi?id=1706> from Brian Christiansen<mailto:brian@schedmd.com> Were you able to clear out the long running jobs that the script reported? If not you can use these queries to clean them up. select * from <cluster>_job_table where id_job = <jobid> and state=1 and time_end = 0; update <cluster>_job_table set time_end = time_start, state = 3 where id_job=<jobid> and state=1 and time_end=0; select * from <cluster>_job_table where id_job = <jobid>; One possible explanation for the long running jobs is that the database was down for a period of time and the controller started throwing away records. The controller will cache quite a few records while the database is down but it will eventually start throwing things away. The logs will show the following messages if the cache is getting to large: slurmdbd: agent queue filling, RESTART SLURMDBD NOW this one goes to the syslog: *** RESTART SLURMDBD NOW ***" and if it starts to throw things away you'll see: slurmdbd: agent queue is full, discarding request"); After you clean up these jobs will you run your reports again and see if they are matching up? Here's one suggestion for your report command line options, you can use the sacct -X option to not display the steps. That way you'll just be looking at the jobs. Thanks, Brian ________________________________ You are receiving this mail because: * You reported the bug. How do you reset the re-rollup? Is it a command line or db command? Here is the output you requested: mysql> select * from descartes_resv_table; Empty set (0.00 sec) Regards, Bill ______________________________________________________________________ William W. Schadlich EMIT Foundation Infrastructure Middleware Retail Unix Solutions High Performance Computing EMRE Support 908-730-1011 Ticket Support: http://itservices Select "Get Assistance" Select "High Performance Cluster" Select - Request Type "Downstream: All Requests" In Description of Issue* field "Type name of server and problem you are having including your ID" http://goto/bill From: bugs@schedmd.com [mailto:bugs@schedmd.com] Sent: Thursday, October 01, 2015 5:25 PM To: Schadlich, William W Subject: [Bug 1706] sreport indicates 98.45% allocated for month when sample clearly indicates 50.4% Comment # 12<http://bugs.schedmd.com/show_bug.cgi?id=1706#c12> on bug 1706<http://bugs.schedmd.com/show_bug.cgi?id=1706> from Brian Christiansen<mailto:brian@schedmd.com> Now that the the jobs are cleaned up, the job usage needs to be rolled up again. sreport looks at the rolled up usage. You can trigger the re-roll by setting the last roll up times to a previous date in the database. For example, since your sreport time is from last December you could set the time to a day before just be safe. UPDATE = descartes_last_ran_table SET hourly_rollup = UNIX_TIMESTAMP('2014-11-30 00:00:00'), daily_rollup = UNIX_TIMESTAMP('2014-11-30 00:00:00'), monthly_rollup = UNIX_TIMESTAMP('2014-11-30 00:00:00'); Or you could set the columns to 0 and it will roll up everything. By modifying these columns the slurmdbd will do the rollups the next time it does its rollup (which is every hour or on startup). It could take a little bit of time to re-roll depending on how big the database is, but just let it run. Does the database still have the jobs from last year? Are you doing any purging? Will you attach your slurmdbd.conf? Will you also show the output of: select * from descartes_resv_table; Let us know how it goes or if you have any questions. Thanks, Brian ________________________________ You are receiving this mail because: * You reported the bug. You can actually do it using "sacctmgr roll <timestamp>" but we recommend just modifying the database directly and letting the dbd handle it the next time it does a rollup. In 14.11, it's possible to have the "sacctmgr roll" and the normal rollup happen at the same time, so we want to avoid that. You can update the database using the UPDATE query in Comment 14. You can modify the time to how far back you want the re-rollup to start from. There are log messages logged at debug2 that say when the rollup is done: "Everything rolled up" Or you can look at the descartes_last_ran_table and look for the numbers to change. Actually, Comment 12 and not Comment 14. 14 has it in reply as well. Regarding the messages: [2015-09-02T11:00:00.959] error: We have more time than is possible (15091584+14976000+0)(30067584) > 29030400 for cluster descartes(8064) from 2015-09-02T10:00:00 - 2015-09-02T11:00:00 These will go away after the re-rollup. The long running jobs that you cleaned up were being accounted for in this month's rollups. So the re-rollup will take these jobs out of this month's usage. There seems to be a syntax error is the db sql statement. Sadly I gather it's the timestamp command itself not supported in mysql but I haven't researched it in-depth.
mysql> UPDATE = descartes_last_ran_table SET hourly_rollup = UNIX_TIMESTAMP('2014-11-30 00:00:00'), daily_rollup = UNIX_TIMESTAMP('2014-11-30 00:00:00'), monthly_rollup = UNIX_TIMESTAMP('2014-11-30 00:00:00');
ERROR 1064 (42000): You have an error in your SQL syntax; check the manual that corresponds to your MySQL server version for the right syntax to use near '= descartes_last_ran_table SET hourly_rollup = UNIX_TIMESTAMP('2014-11-30 00:00' at line 1
Regards,
Bill
______________________________________________________________________
William W. Schadlich
EMIT Foundation Infrastructure Middleware Retail Unix Solutions
High Performance Computing
EMRE Support
908-730-1011
Ticket Support:
http://itservices
Select "Get Assistance"
Select "High Performance Cluster"
Select - Request Type "Downstream: All Requests"
In Description of Issue* field
"Type name of server and problem you are having including your ID"
http://goto/bill
From: bugs@schedmd.com [mailto:bugs@schedmd.com]
Sent: Friday, October 02, 2015 12:20 PM
To: Schadlich, William W
Subject: [Bug 1706] sreport indicates 98.45% allocated for month when sample clearly indicates 50.4%
Comment # 15<http://bugs.schedmd.com/show_bug.cgi?id=1706#c15> on bug 1706<http://bugs.schedmd.com/show_bug.cgi?id=1706> from Brian Christiansen<mailto:brian@schedmd.com>
You can actually do it using "sacctmgr roll <timestamp>" but we recommend just
modifying the database directly and letting the dbd handle it the next time it
does a rollup. In 14.11, it's possible to have the "sacctmgr roll" and the
normal rollup happen at the same time, so we want to avoid that.
You can update the database using the UPDATE query in Comment 14<http://bugs.schedmd.com/show_bug.cgi?id=1706#c14>. You can
modify the time to how far back you want the re-rollup to start from.
There are log messages logged at debug2 that say when the rollup is done:
"Everything rolled up"
Or you can look at the descartes_last_ran_table and look for the numbers to
change.
________________________________
You are receiving this mail because:
* You reported the bug.
Oops. There's a rouge '=' after the UPDATE. It should look like this:
UPDATE descartes_last_ran_table
SET
hourly_rollup = UNIX_TIMESTAMP('2014-11-30 00:00:00'),
daily_rollup = UNIX_TIMESTAMP('2014-11-30 00:00:00'),
monthly_rollup = UNIX_TIMESTAMP('2014-11-30 00:00:00');
:) much better Query OK, 1 row affected (0.04 sec) Rows matched: 1 Changed: 1 Warnings: 0 Regards, Bill ______________________________________________________________________ William W. Schadlich EMIT Foundation Infrastructure Middleware Retail Unix Solutions High Performance Computing EMRE Support 908-730-1011 Ticket Support: http://itservices Select "Get Assistance" Select "High Performance Cluster" Select - Request Type "Downstream: All Requests" In Description of Issue* field "Type name of server and problem you are having including your ID" http://goto/bill From: bugs@schedmd.com [mailto:bugs@schedmd.com] Sent: Friday, October 02, 2015 2:10 PM To: Schadlich, William W Subject: [Bug 1706] sreport indicates 98.45% allocated for month when sample clearly indicates 50.4% Comment # 19<http://bugs.schedmd.com/show_bug.cgi?id=1706#c19> on bug 1706<http://bugs.schedmd.com/show_bug.cgi?id=1706> from Brian Christiansen<mailto:brian@schedmd.com> Oops. There's a rouge '=' after the UPDATE. It should look like this: UPDATE descartes_last_ran_table SET hourly_rollup = UNIX_TIMESTAMP('2014-11-30 00:00:00'), daily_rollup = UNIX_TIMESTAMP('2014-11-30 00:00:00'), monthly_rollup = UNIX_TIMESTAMP('2014-11-30 00:00:00'); ________________________________ You are receiving this mail because: * You reported the bug. Slurmdbd.conf # # Example slurmdbd.conf file. # # See the slurmdbd.conf man page for more information. # # Archive info #ArchiveJobs=yes #ArchiveDir="/tmp" #ArchiveSteps=yes #ArchiveScript= #JobPurge=12 #StepPurge=1 # # Authentication info AuthType=auth/munge #AuthInfo=/var/run/munge/munge.socket.2 # # slurmDBD info DbdAddr=localhost DbdHost=localhost DbdPort=6819 SlurmUser=slurm #MessageTimeout=300 DebugLevel=4 #DefaultQOS=normal,standby LogFile=/var/log/slurm/slurmdbd.log PidFile=/var/run/slurm/slurmdbd.pid #PluginDir=/usr/lib/slurm #PrivateData=accounts,users,usage,jobs #TrackWCKey=yes # # Database info StorageType=accounting_storage/mysql StorageHost=localhost #StoragePort=1234 # Just to avoid other clients (e.g. rtm) possibly corrupting db StoragePass=descartes StorageUser=slurm StorageLoc=slurm_acct_db Regards, Bill ______________________________________________________________________ William W. Schadlich EMIT Foundation Infrastructure Middleware Retail Unix Solutions High Performance Computing EMRE Support 908-730-1011 Ticket Support: http://itservices Select "Get Assistance" Select "High Performance Cluster" Select - Request Type "Downstream: All Requests" In Description of Issue* field "Type name of server and problem you are having including your ID" http://goto/bill From: bugs@schedmd.com [mailto:bugs@schedmd.com] Sent: Thursday, October 01, 2015 5:25 PM To: Schadlich, William W Subject: [Bug 1706] sreport indicates 98.45% allocated for month when sample clearly indicates 50.4% Comment # 12<http://bugs.schedmd.com/show_bug.cgi?id=1706#c12> on bug 1706<http://bugs.schedmd.com/show_bug.cgi?id=1706> from Brian Christiansen<mailto:brian@schedmd.com> Now that the the jobs are cleaned up, the job usage needs to be rolled up again. sreport looks at the rolled up usage. You can trigger the re-roll by setting the last roll up times to a previous date in the database. For example, since your sreport time is from last December you could set the time to a day before just be safe. UPDATE = descartes_last_ran_table SET hourly_rollup = UNIX_TIMESTAMP('2014-11-30 00:00:00'), daily_rollup = UNIX_TIMESTAMP('2014-11-30 00:00:00'), monthly_rollup = UNIX_TIMESTAMP('2014-11-30 00:00:00'); Or you could set the columns to 0 and it will roll up everything. By modifying these columns the slurmdbd will do the rollups the next time it does its rollup (which is every hour or on startup). It could take a little bit of time to re-roll depending on how big the database is, but just let it run. Does the database still have the jobs from last year? Are you doing any purging? Will you attach your slurmdbd.conf? Will you also show the output of: select * from descartes_resv_table; Let us know how it goes or if you have any questions. Thanks, Brian ________________________________ You are receiving this mail because: * You reported the bug. Thanks, Bill. When the rollups are done will you check your reports again and let us know how they look? Looks much better now, though how does the reserve work since I didn't think we were using any. Is that the total percent people waiting for resources queued? Thanks. [wwschad@clnxcat02 ~]$ sreport cluster utilization start=1/1/15 end=1/31/15 -t percent -------------------------------------------------------------------------------- Cluster Utilization 2015-01-01T00:00:00 - 2015-01-30T23:59:59 (2592000*cpus secs) Time reported in Percentage of Total -------------------------------------------------------------------------------- Cluster Allocated Down PLND Down Idle Reserved Reported --------- ------------ ---------- ---------- ------------ --------- ------------ descartes 46.31% 10.15% 0.00% 41.40% 2.15% 100.00% [wwschad@clnxcat02 ~]$ sreport cluster utilization start=2/1/15 end=2/31/15 -t percent -------------------------------------------------------------------------------- Cluster Utilization 2015-02-01T00:00:00 - 2015-03-02T23:59:59 (2592000*cpus secs) Time reported in Percentage of Total -------------------------------------------------------------------------------- Cluster Allocated Down PLND Down Idle Reserved Reported --------- ------------ ---------- ---------- ------------ --------- ------------ descartes 48.71% 0.55% 0.00% 48.35% 2.39% 100.00% Regards, Bill ______________________________________________________________________ William W. Schadlich EMIT Foundation Infrastructure Middleware Retail Unix Solutions High Performance Computing EMRE Support 908-730-1011 Ticket Support: http://itservices Select "Get Assistance" Select "High Performance Cluster" Select - Request Type "Downstream: All Requests" In Description of Issue* field "Type name of server and problem you are having including your ID" http://goto/bill From: bugs@schedmd.com [mailto:bugs@schedmd.com] Sent: Friday, October 02, 2015 4:22 PM To: Schadlich, William W Subject: [Bug 1706] sreport indicates 98.45% allocated for month when sample clearly indicates 50.4% Comment # 22<http://bugs.schedmd.com/show_bug.cgi?id=1706#c22> on bug 1706<http://bugs.schedmd.com/show_bug.cgi?id=1706> from Brian Christiansen<mailto:brian@schedmd.com> Thanks, Bill. When the rollups are done will you check your reports again and let us know how they look? ________________________________ You are receiving this mail because: * You reported the bug. Good to hear! From the sreport man page: "Reserved time refers to time that a job was waiting for resources after the job had become eligible. If the value is not of importance for you the number should be grouped with idle time." For example if a job that needs four nodes is the highest priority job but it can't run becausee only two are available. The time waiting for idle resources up to 2 nodes is counted as reserved time. The time waiting for the busy nodes gets put into the unshown overcommit column. If you had a high reserve time you could say that either you need more resources or backfill could do a better job. The unshown overcommit column can also be helpful to show how much backlog you had of jobs if you had more resources, which contains all pending time for jobs waiting for resources. Let us know if you have any other questions. Thanks, Brian Very good but I am little concerned that later months are still off. [wwschad@clnhpc01 ~]$ sreport cluster utilization start=4/01/15 end=4/31/15 -t percent -------------------------------------------------------------------------------- Cluster Utilization 2015-04-01T00:00:00 - 2015-04-30T23:59:59 (2592000*cpus secs) Time reported in Percentage of Total -------------------------------------------------------------------------------- Cluster Allocated Down PLND Down Idle Reserved Reported --------- ------------ ---------- ---------- ------------ --------- ------------ descartes 96.69% 0.54% 0.00% 2.77% 0.00% 100.00% Whereas I calculated about 50.54% utilization. Thoughts? Regards, Bill ______________________________________________________________________ William W. Schadlich EMIT Foundation Infrastructure Middleware Retail Unix Solutions High Performance Computing EMRE Support 908-730-1011 Ticket Support: http://itservices Select "Get Assistance" Select "High Performance Cluster" Select - Request Type "Downstream: All Requests" In Description of Issue* field "Type name of server and problem you are having including your ID" http://goto/bill From: bugs@schedmd.com [mailto:bugs@schedmd.com] Sent: Wednesday, October 07, 2015 2:35 PM To: Schadlich, William W Subject: [Bug 1706] sreport indicates 98.45% allocated for month when sample clearly indicates 50.4% Comment # 24<http://bugs.schedmd.com/show_bug.cgi?id=1706#c24> on bug 1706<http://bugs.schedmd.com/show_bug.cgi?id=1706> from Brian Christiansen<mailto:brian@schedmd.com> Good to hear! From the sreport man page: "Reserved time refers to time that a job was waiting for resources after the job had become eligible. If the value is not of importance for you the number should be grouped with idle time." For example if a job that needs four nodes is the highest priority job but it can't run becausee only two are available. The time waiting for idle resources up to 2 nodes is counted as reserved time. The time waiting for the busy nodes gets put into the unshown overcommit column. If you had a high reserve time you could say that either you need more resources or backfill could do a better job. The unshown overcommit column can also be helpful to show how much backlog you had of jobs if you had more resources, which contains all pending time for jobs waiting for resources. Let us know if you have any other questions. Thanks, Brian ________________________________ You are receiving this mail because: * You reported the bug. Just for confirmation will you send the output of: select * from descartes_last_ran_table; What is your current sacct line that you are using for your reports? I recommend using the -X and -T options. Will you send your dbd logs from when the re-rollup happened? Thanks, Brian Created attachment 2280 [details] slurmdbd.log.tgz mysql> select * from descartes_last_ran_table; +---------------+--------------+----------------+ | hourly_rollup | daily_rollup | monthly_rollup | +---------------+--------------+----------------+ | 1444248000 | 1444190400 | 1443672000 | +---------------+--------------+----------------+ 1 row in set (0.00 sec) To compensate for the db issues we have been having for awhile I don't sacct for report since they weren't accurate. I have a cronjob running every 5 mins to take a sample of jobs running in the system. Using this method with a small margin of error I am able to determine utilization and availability. sinfo --format="%10D %20F %8t %15U %20H %10A %15C %25E %N" >> /home/slurm/acct_logs/slurm_daily_acct.`date +%Y%m%d` Also I am confused what you mean by dbd logs when the rollup happened? I guess you are talking about the slurmdbd.log on the master so I've attached those, let me know if that is not the cause. Thanks. Regards, Bill ______________________________________________________________________ William W. Schadlich EMIT Foundation Infrastructure Middleware Retail Unix Solutions High Performance Computing EMRE Support 908-730-1011 Ticket Support: http://itservices Select "Get Assistance" Select "High Performance Cluster" Select - Request Type "Downstream: All Requests" In Description of Issue* field "Type name of server and problem you are having including your ID" http://goto/bill From: bugs@schedmd.com [mailto:bugs@schedmd.com] Sent: Wednesday, October 07, 2015 4:54 PM To: Schadlich, William W Subject: [Bug 1706] sreport indicates 98.45% allocated for month when sample clearly indicates 50.4% Comment # 26<http://bugs.schedmd.com/show_bug.cgi?id=1706#c26> on bug 1706<http://bugs.schedmd.com/show_bug.cgi?id=1706> from Brian Christiansen<mailto:brian@schedmd.com> Just for confirmation will you send the output of: select * from descartes_last_ran_table; What is your current sacct line that you are using for your reports? I recommend using the -X and -T options. Will you send your dbd logs from when the re-rollup happened? Thanks, Brian ________________________________ You are receiving this mail because: * You reported the bug. Looks like there may be some runaway jobs still in the database -- since there are a lot of these messages in the log: "We have more time than is possible " Will you run these commands? Do you remember doing a cold start (slurmctld -c) on April 29th? sacct -X -S 2014-12-11T17:00:00 -E 2014-12-11T18:00:00 sacct -X -S 2015-04-29T11:00:00 -E 2015-04-29T12:00:00 sacct -X -S 2015-02-19T23:00:00 -E 2015-02-20T00:00:00 sacct -X -S 2015-01-20T10:00:00 -E 2015-01-20T11:00:00 sacctmgr list events start=2015-04-29T12:00:00 end=2015-04-29T13:00:0 Created attachment 2284 [details] slurmacctmgr_listing.08OCT2015.txt April 29 I believe was an upgrade to Slurm day from 2.6.7 to 14.11. Here is the output of the next 4 scripts but the last sacctmgr script indicates the following error and then posts every event it looks like. [wwschad@clnxcat02 ~]$ sacct -X -S 2014-12-11T17:00:00 -E 2014-12-11T18:00:00 JobID JobName Partition Account AllocCPUS State ExitCode ------------ ---------- ---------- ---------- ---------- ---------- -------- [wwschad@clnxcat02 ~]$ sacct -X -S 2015-04-29T11:00:00 -E 2015-04-29T12:00:00 JobID JobName Partition Account AllocCPUS State ExitCode ------------ ---------- ---------- ---------- ---------- ---------- -------- [wwschad@clnxcat02 ~]$ sacct -X -S 2015-02-19T23:00:00 -E 2015-02-20T00:00:00 JobID JobName Partition Account AllocCPUS State ExitCode ------------ ---------- ---------- ---------- ---------- ---------- -------- [wwschad@clnxcat02 ~]$ sacct -X -S 2015-01-20T10:00:00 -E 2015-01-20T11:00:00 JobID JobName Partition Account AllocCPUS State ExitCode ------------ ---------- ---------- ---------- ---------- ---------- -------- [wwschad@clnxcat02 ~]$ sacctmgr list events start=2015-04-29T12:00:00 end=2015-04-29T13:00:0 Invalid time specification (pos=18): 2015-04-29T13:00:0 The output is attached. Thanks. Regards, Bill ______________________________________________________________________ William W. Schadlich EMIT Foundation Infrastructure Middleware Retail Unix Solutions High Performance Computing EMRE Support 908-730-1011 Ticket Support: http://itservices Select "Get Assistance" Select "High Performance Cluster" Select - Request Type "Downstream: All Requests" In Description of Issue* field "Type name of server and problem you are having including your ID" http://goto/bill From: bugs@schedmd.com [mailto:bugs@schedmd.com] Sent: Wednesday, October 07, 2015 5:51 PM To: Schadlich, William W Subject: [Bug 1706] sreport indicates 98.45% allocated for month when sample clearly indicates 50.4% Comment # 28<http://bugs.schedmd.com/show_bug.cgi?id=1706#c28> on bug 1706<http://bugs.schedmd.com/show_bug.cgi?id=1706> from Brian Christiansen<mailto:brian@schedmd.com> Looks like there may be some runaway jobs still in the database -- since there are a lot of these messages in the log: "We have more time than is possible " Will you run these commands? Do you remember doing a cold start (slurmctld -c) on April 29th? sacct -X -S 2014-12-11T17:00:00 -E 2014-12-11T18:00:00 sacct -X -S 2015-04-29T11:00:00 -E 2015-04-29T12:00:00 sacct -X -S 2015-02-19T23:00:00 -E 2015-02-20T00:00:00 sacct -X -S 2015-01-20T10:00:00 -E 2015-01-20T11:00:00 sacctmgr list events start=2015-04-29T12:00:00 end=2015-04-29T13:00:0 ________________________________ You are receiving this mail because: * You reported the bug. Created attachment 2285 [details]
quries.txt
Will you run the queries in the attached file like so:
mysql slurm_acct_db < quries.txt 2>&1 > out.txt
and send the output file back?
Created attachment 2287 [details] out.txt As requested. Thanks. Regards, Bill ______________________________________________________________________ William W. Schadlich EMIT Foundation Infrastructure Middleware Retail Unix Solutions High Performance Computing EMRE Support 908-730-1011 Ticket Support: http://itservices Select "Get Assistance" Select "High Performance Cluster" Select - Request Type "Downstream: All Requests" In Description of Issue* field "Type name of server and problem you are having including your ID" http://goto/bill From: bugs@schedmd.com [mailto:bugs@schedmd.com] Sent: Thursday, October 08, 2015 7:05 PM To: Schadlich, William W Subject: [Bug 1706] sreport indicates 98.45% allocated for month when sample clearly indicates 50.4% Comment # 30<http://bugs.schedmd.com/show_bug.cgi?id=1706#c30> on bug 1706<http://bugs.schedmd.com/show_bug.cgi?id=1706> from Brian Christiansen<mailto:brian@schedmd.com> Created attachment 2285 [details]<attachment.cgi?id=2285> [details]<attachment.cgi?id=2285&action=edit> quries.txt Will you run the queries in the attached file like so: mysql slurm_acct_db < quries.txt 2>&1 > out.txt and send the output file back? ________________________________ You are receiving this mail because: * You reported the bug. I believe I found some culprits. Will you run this query to verify that you see what I see? select job_db_inx, id_job, from_unixtime(time_submit), from_unixtime(time_start), from_unixtime(time_end), (time_end - time_start)/(3600*24) from exxon_job_table2 where (time_end - time_start) > (3600 * 24 * 80); I'm getting: +------------+--------+----------------------------+---------------------------+-------------------------+-----------------------------------+ | job_db_inx | id_job | from_unixtime(time_submit) | from_unixtime(time_start) | from_unixtime(time_end) | (time_end - time_start)/(3600*24) | +------------+--------+----------------------------+---------------------------+-------------------------+-----------------------------------+ | 2042 | 1228 | 2014-05-06 12:29:52 | 2014-05-06 12:30:12 | 2015-04-29 09:17:23 | 357.8661 | | 2179 | 1342 | 2014-05-08 04:49:53 | 2014-05-08 04:49:53 | 2015-04-29 09:17:23 | 356.1858 | | 2281 | 1440 | 2014-05-12 08:24:54 | 2014-05-12 08:25:09 | 2015-04-29 09:17:23 | 352.0363 | | 115281 | 80929 | 2015-04-29 07:55:40 | 1969-12-31 16:00:00 | 2015-04-29 09:17:23 | 16554.6787 | | 115282 | 80930 | 2015-04-29 08:08:37 | 1969-12-31 16:00:00 | 2015-04-29 08:08:50 | 16554.6311 | | 115285 | 80932 | 2015-04-29 08:10:59 | 1969-12-31 16:00:00 | 2015-04-29 08:11:08 | 16554.6327 | +------------+--------+----------------------------+---------------------------+-------------------------+-----------------------------------+ 6 rows in set (0.01 sec) 1228, 1342 and 1440 are the jobids that existed in both the queries that you ran for me (December 2014 and April 2015). The others are in the April 2015 batch. Is it normal for jobs to run 30 to 70 days? I see a lot that run for about 50 days. Thanks, Brian Created attachment 2288 [details] slurm_select_output.txt I made a minor modification to the script since it had the wrong table in it during run. There have been times when jobs ran a few months but over the past six we have had cluster issues which the filesystem has crashed and all jobs where cancelled. Anything prior to April 29th we had a complete cluster shutdown to upgrade slurm to 14.11. Attached is the output from your query request. Thanks. Regards, Bill ______________________________________________________________________ William W. Schadlich EMIT Foundation Infrastructure Middleware Retail Unix Solutions High Performance Computing EMRE Support 908-730-1011 Ticket Support: http://itservices Select "Get Assistance" Select "High Performance Cluster" Select - Request Type "Downstream: All Requests" In Description of Issue* field "Type name of server and problem you are having including your ID" http://goto/bill From: bugs@schedmd.com [mailto:bugs@schedmd.com] Sent: Friday, October 09, 2015 5:55 PM To: Schadlich, William W Subject: [Bug 1706] sreport indicates 98.45% allocated for month when sample clearly indicates 50.4% Comment # 32<http://bugs.schedmd.com/show_bug.cgi?id=1706#c32> on bug 1706<http://bugs.schedmd.com/show_bug.cgi?id=1706> from Brian Christiansen<mailto:brian@schedmd.com> I believe I found some culprits. Will you run this query to verify that you see what I see? select job_db_inx, id_job, from_unixtime(time_submit), from_unixtime(time_start), from_unixtime(time_end), (time_end - time_start)/(3600*24) from exxon_job_table2 where (time_end - time_start) > (3600 * 24 * 80); I'm getting: +------------+--------+----------------------------+---------------------------+-------------------------+-----------------------------------+ | job_db_inx | id_job | from_unixtime(time_submit) | from_unixtime(time_start) | from_unixtime(time_end) | (time_end - time_start)/(3600*24) | +------------+--------+----------------------------+---------------------------+-------------------------+-----------------------------------+ | 2042 | 1228 | 2014-05-06 12:29:52 | 2014-05-06 12:30:12 | 2015-04-29 09:17:23 | 357.8661 | | 2179 | 1342 | 2014-05-08 04:49:53 | 2014-05-08 04:49:53 | 2015-04-29 09:17:23 | 356.1858 | | 2281 | 1440 | 2014-05-12 08:24:54 | 2014-05-12 08:25:09 | 2015-04-29 09:17:23 | 352.0363 | | 115281 | 80929 | 2015-04-29 07:55:40 | 1969-12-31 16:00:00 | 2015-04-29 09:17:23 | 16554.6787 | | 115282 | 80930 | 2015-04-29 08:08:37 | 1969-12-31 16:00:00 | 2015-04-29 08:08:50 | 16554.6311 | | 115285 | 80932 | 2015-04-29 08:10:59 | 1969-12-31 16:00:00 | 2015-04-29 08:11:08 | 16554.6327 | +------------+--------+----------------------------+---------------------------+-------------------------+-----------------------------------+ 6 rows in set (0.01 sec) 1228, 1342 and 1440 are the jobids that existed in both the queries that you ran for me (December 2014 and April 2015). The others are in the April 2015 batch. Is it normal for jobs to run 30 to 70 days? I see a lot that run for about 50 days. Thanks, Brian ________________________________ You are receiving this mail because: * You reported the bug. Will you run this query too? It adds the time_eligible column. Thanks, Brian select job_db_inx, id_job, from_unixtime(time_submit), from_unixtime(time_eligible), from_unixtime(time_start), from_unixtime(time_end), (time_end - time_eligible)/(3600*24), (time_end - time_start)/(3600*24) from descartes_job_table where (time_end - time_start) > (3600 * 24 * 80); Created attachment 2289 [details] slurm_select_output2.txt As requested. Thanks. Regards, Bill ______________________________________________________________________ William W. Schadlich EMIT Foundation Infrastructure Middleware Retail Unix Solutions High Performance Computing EMRE Support 908-730-1011 Ticket Support: http://itservices Select "Get Assistance" Select "High Performance Cluster" Select - Request Type "Downstream: All Requests" In Description of Issue* field "Type name of server and problem you are having including your ID" http://goto/bill From: bugs@schedmd.com [mailto:bugs@schedmd.com] Sent: Monday, October 12, 2015 11:29 AM To: Schadlich, William W Subject: [Bug 1706] sreport indicates 98.45% allocated for month when sample clearly indicates 50.4% Comment # 34<http://bugs.schedmd.com/show_bug.cgi?id=1706#c34> on bug 1706<http://bugs.schedmd.com/show_bug.cgi?id=1706> from Brian Christiansen<mailto:brian@schedmd.com> Will you run this query too? It adds the time_eligible column. Thanks, Brian select job_db_inx, id_job, from_unixtime(time_submit), from_unixtime(time_eligible), from_unixtime(time_start), from_unixtime(time_end), (time_end - time_eligible)/(3600*24), (time_end - time_start)/(3600*24) from descartes_job_table where (time_end - time_start) > (3600 * 24 * 80); ________________________________ You are receiving this mail because: * You reported the bug. Do you see any errors in slurmctld logs? Even in the last couple of days, it looks like the time_start messages aren't getting to database. There are errors from the slurmdbd.log [2015-10-02T15:05:20.724] error: We have more allocated time than is possible (44964192 > 29030400) for cluster descartes(8064) from 2015-04-29T10:00:00 - 2015-04-29T11:00:00 [2015-10-02T15:05:20.724] error: We have more time than is possible (29030400+1636944+0)(30667344) > 29030400 for cluster descartes(8064) from 2015-04-29T10:00:00 - 2015-04-29T11:00:00 [2015-10-02T15:05:20.848] error: We have more allocated time than is possible (34511168 > 29030400) for cluster descartes(8064) from 2015-04-29T11:00:00 - 2015-04-29T12:00:00 [2015-10-02T15:05:20.849] error: We have more time than is possible (29030400+21754128+0)(50784528) > 29030400 for cluster descartes(8064) from 2015-04-29T11:00:00 - 2015-04-29T12:00:00 [2015-10-02T15:05:20.974] error: We have more time than is possible (8677760+27966656+0)(36644416) > 29030400 for cluster descartes(8064) from 2015-04-29T12:00:00 - 2015-04-29T13:00:00 [2015-10-02T15:09:52.582] Warning: Note very large processing time from hourly_rollup for descartes: usec=591744348 began=15:00:00.838 [2015-10-02T15:10:06.395] Warning: Note very large processing time from daily_rollup for descartes: usec=13812530 began=15:09:52.582 [wwschad@clnxcat02 slurm]$ Regards, Bill ______________________________________________________________________ William W. Schadlich EMIT Foundation Infrastructure Middleware Retail Unix Solutions High Performance Computing EMRE Support 908-730-1011 Ticket Support: http://itservices Select "Get Assistance" Select "High Performance Cluster" Select - Request Type "Downstream: All Requests" In Description of Issue* field "Type name of server and problem you are having including your ID" http://goto/bill From: bugs@schedmd.com [mailto:bugs@schedmd.com] Sent: Monday, October 12, 2015 11:33 AM To: Schadlich, William W Subject: [Bug 1706] sreport indicates 98.45% allocated for month when sample clearly indicates 50.4% Comment # 36<http://bugs.schedmd.com/show_bug.cgi?id=1706#c36> on bug 1706<http://bugs.schedmd.com/show_bug.cgi?id=1706> from Brian Christiansen<mailto:brian@schedmd.com> Do you see any errors in slurmctld logs? Even in the last couple of days, it looks like the time_start messages aren't getting to database. ________________________________ You are receiving this mail because: * You reported the bug. What about in /var/log/slurm/slurmctld.log? Created attachment 2290 [details] slurmctld_log.txt Don't see anything in particular. 2000 line from the end of the file attached. Regards, Bill ______________________________________________________________________ William W. Schadlich EMIT Foundation Infrastructure Middleware Retail Unix Solutions High Performance Computing EMRE Support 908-730-1011 Ticket Support: http://itservices Select "Get Assistance" Select "High Performance Cluster" Select - Request Type "Downstream: All Requests" In Description of Issue* field "Type name of server and problem you are having including your ID" http://goto/bill From: bugs@schedmd.com [mailto:bugs@schedmd.com] Sent: Monday, October 12, 2015 12:34 PM To: Schadlich, William W Subject: [Bug 1706] sreport indicates 98.45% allocated for month when sample clearly indicates 50.4% Comment # 38<http://bugs.schedmd.com/show_bug.cgi?id=1706#c38> on bug 1706<http://bugs.schedmd.com/show_bug.cgi?id=1706> from Brian Christiansen<mailto:brian@schedmd.com> What about in /var/log/slurm/slurmctld.log? ________________________________ You are receiving this mail because: * You reported the bug. Just making sure. How far back does the log go? And if you grep for "error" do you see anything? Created attachment 2291 [details] slurmctld_log_error.txt.tgz Here are the errors in the log. I haven't rotated it in a while (or possibly ever) so it's all there. Regards, Bill ______________________________________________________________________ William W. Schadlich EMIT Foundation Infrastructure Middleware Retail Unix Solutions High Performance Computing EMRE Support 908-730-1011 Ticket Support: http://itservices Select "Get Assistance" Select "High Performance Cluster" Select - Request Type "Downstream: All Requests" In Description of Issue* field "Type name of server and problem you are having including your ID" http://goto/bill From: bugs@schedmd.com [mailto:bugs@schedmd.com] Sent: Monday, October 12, 2015 12:56 PM To: Schadlich, William W Subject: [Bug 1706] sreport indicates 98.45% allocated for month when sample clearly indicates 50.4% Comment # 40<http://bugs.schedmd.com/show_bug.cgi?id=1706#c40> on bug 1706<http://bugs.schedmd.com/show_bug.cgi?id=1706> from Brian Christiansen<mailto:brian@schedmd.com> Just making sure. How far back does the log go? And if you grep for "error" do you see anything? ________________________________ You are receiving this mail because: * You reported the bug. It looks like we just need to clean up 3 jobs for now. Will you run this query:
UPDATE descartes_job_table SET time_end = time_start WHERE id_job in (1228, 1342, 1440);
and then re-run the rollup?
UPDATE descartes_last_ran_table
SET
hourly_rollup = UNIX_TIMESTAMP('2014-11-30 00:00:00'),
daily_rollup = UNIX_TIMESTAMP('2014-11-30 00:00:00'),
monthly_rollup = UNIX_TIMESTAMP('2014-11-30 00:00:00');
Let me know how this goes.
Thanks,
Brian
Updates completed.
mysql> UPDATE descartes_job_table SET time_end = time_start WHERE id_job in (1228,1342, 1440);
Query OK, 6 rows affected (0.04 sec)
Rows matched: 6 Changed: 6 Warnings: 0
mysql> UPDATE descartes_last_ran_table
-> SET
-> hourly_rollup = UNIX_TIMESTAMP('2014-11-30 00:00:00'),
-> daily_rollup = UNIX_TIMESTAMP('2014-11-30 00:00:00'),
-> monthly_rollup = UNIX_TIMESTAMP('2014-11-30 00:00:00');
Query OK, 1 row affected (0.07 sec)
Rows matched: 1 Changed: 1 Warnings: 0
mysql>
Regards,
Bill
______________________________________________________________________
William W. Schadlich
EMIT Foundation Infrastructure Middleware Retail Unix Solutions
High Performance Computing
EMRE Support
908-730-1011
Ticket Support:
http://itservices
Select "Get Assistance"
Select "High Performance Cluster"
Select - Request Type "Downstream: All Requests"
In Description of Issue* field
"Type name of server and problem you are having including your ID"
http://goto/bill
From: bugs@schedmd.com [mailto:bugs@schedmd.com]
Sent: Tuesday, October 13, 2015 4:05 PM
To: Schadlich, William W
Subject: [Bug 1706] sreport indicates 98.45% allocated for month when sample clearly indicates 50.4%
Comment # 42<http://bugs.schedmd.com/show_bug.cgi?id=1706#c42> on bug 1706<http://bugs.schedmd.com/show_bug.cgi?id=1706> from Brian Christiansen<mailto:brian@schedmd.com>
It looks like we just need to clean up 3 jobs for now. Will you run this query:
UPDATE descartes_job_table SET time_end = time_start WHERE id_job in (1228,
1342, 1440);
and then re-run the rollup?
UPDATE descartes_last_ran_table
SET
hourly_rollup = UNIX_TIMESTAMP('2014-11-30 00:00:00'),
daily_rollup = UNIX_TIMESTAMP('2014-11-30 00:00:00'),
monthly_rollup = UNIX_TIMESTAMP('2014-11-30 00:00:00');
Let me know how this goes.
Thanks,
Brian
________________________________
You are receiving this mail because:
* You reported the bug.
Have the rollups finished? If so how does sreport look now? March and April seem to be off still. They should be ~54% and 50% respectively. [wwschad@clnxcat02 ~]$ sreport cluster utilization start=3/01/15 end=3/30/15 -t percent -------------------------------------------------------------------------------- Cluster Utilization 2015-03-01T00:00:00 - 2015-03-29T23:59:59 (2502000*cpus secs) Time reported in Percentage of Total -------------------------------------------------------------------------------- Cluster Allocated Down PLND Down Idle Reserved Reported --------- ------------ ---------- ---------- ------------ --------- ------------ descartes 85.53% 0.01% 0.00% 14.42% 0.05% 100.00% [wwschad@clnxcat02 ~]$ [wwschad@clnxcat02 ~]$ sreport cluster utilization start=4/01/15 end=4/30/15 -t percent -------------------------------------------------------------------------------- Cluster Utilization 2015-04-01T00:00:00 - 2015-04-29T23:59:59 (2505600*cpus secs) Time reported in Percentage of Total -------------------------------------------------------------------------------- Cluster Allocated Down PLND Down Idle Reserved Reported --------- ------------ ---------- ---------- ------------ --------- ------------ descartes 98.45% 0.55% 0.00% 1.00% 0.00% 100.00% [wwschad@clnxcat02 ~]$ Regards, Bill ______________________________________________________________________ William W. Schadlich EMIT Foundation Infrastructure Middleware Retail Unix Solutions High Performance Computing EMRE Support 908-730-1011 Ticket Support: http://itservices Select "Get Assistance" Select "High Performance Cluster" Select - Request Type "Downstream: All Requests" In Description of Issue* field "Type name of server and problem you are having including your ID" http://goto/bill From: bugs@schedmd.com [mailto:bugs@schedmd.com] Sent: Wednesday, October 14, 2015 12:31 PM To: Schadlich, William W Subject: [Bug 1706] sreport indicates 98.45% allocated for month when sample clearly indicates 50.4% Comment # 44<http://bugs.schedmd.com/show_bug.cgi?id=1706#c44> on bug 1706<http://bugs.schedmd.com/show_bug.cgi?id=1706> from Brian Christiansen<mailto:brian@schedmd.com> Have the rollups finished? If so how does sreport look now? ________________________________ You are receiving this mail because: * You reported the bug. Will you attach your latest slurmdbd.log? Bill, can you attach your latest slurmdbd.log? I want to see if there were any errors during the last re-rollup. Thanks, Brian Created attachment 2316 [details] slurmdbd.log_19oct2015.tgz Regards, Bill ______________________________________________________________________ William W. Schadlich EMIT Foundation Infrastructure Middleware Retail Unix Solutions High Performance Computing EMRE Support 908-730-1011 Ticket Support: http://itservices Select "Get Assistance" Select "High Performance Cluster" Select - Request Type "Downstream: All Requests" In Description of Issue* field "Type name of server and problem you are having including your ID" http://goto/bill From: bugs@schedmd.com [mailto:bugs@schedmd.com] Sent: Monday, October 19, 2015 4:52 PM To: Schadlich, William W Subject: [Bug 1706] sreport indicates 98.45% allocated for month when sample clearly indicates 50.4% Comment # 47<http://bugs.schedmd.com/show_bug.cgi?id=1706#c47> on bug 1706<http://bugs.schedmd.com/show_bug.cgi?id=1706> from Brian Christiansen<mailto:brian@schedmd.com> Bill, can you attach your latest slurmdbd.log? I want to see if there were any errors during the last re-rollup. Thanks, Brian ________________________________ You are receiving this mail because: * You reported the bug. Bill could you send the output of this again... select job_db_inx, id_job, from_unixtime(time_submit), from_unixtime(time_start), from_unixtime(time_end), (time_end - time_start)/(3600*24) from exxon_job_table2 where (time_end - time_start) > (3600 * 24 * 80); Created attachment 2319 [details] select_results_20oct2015.txt As requested. Minor update to correct table name. Regards, Bill ______________________________________________________________________ William W. Schadlich EMIT Foundation Infrastructure Middleware Retail Unix Solutions High Performance Computing EMRE Support 908-730-1011 Ticket Support: http://itservices Select "Get Assistance" Select "High Performance Cluster" Select - Request Type "Downstream: All Requests" In Description of Issue* field "Type name of server and problem you are having including your ID" http://goto/bill From: bugs@schedmd.com [mailto:bugs@schedmd.com] Sent: Tuesday, October 20, 2015 1:11 PM To: Schadlich, William W Subject: [Bug 1706] sreport indicates 98.45% allocated for month when sample clearly indicates 50.4% Comment # 49<http://bugs.schedmd.com/show_bug.cgi?id=1706#c49> on bug 1706<http://bugs.schedmd.com/show_bug.cgi?id=1706> from Danny Auble<mailto:da@schedmd.com> Bill could you send the output of this again... select job_db_inx, id_job, from_unixtime(time_submit), from_unixtime(time_start), from_unixtime(time_end), (time_end - time_start)/(3600*24) from exxon_job_table2 where (time_end - time_start) > (3600 * 24 * 80); ________________________________ You are receiving this mail because: * You reported the bug. Thanks, could you now send the output of select job_db_inx, id_job, time_submit, time_start, time_end, (time_end - time_start)/(3600*24), from_unixtime(time_submit), from_unixtime(time_start), from_unixtime(time_end) from descartes_job_table where time_end != 0 && time_start != 0 and (time_end - time_start) > (3600 * 24 * 20); Created attachment 2320 [details] select_results_20oct2015_new.txt Here is the newest run. Thanks. Regards, Bill ______________________________________________________________________ William W. Schadlich EMIT Foundation Infrastructure Middleware Retail Unix Solutions High Performance Computing EMRE Support 908-730-1011 Ticket Support: http://itservices Select "Get Assistance" Select "High Performance Cluster" Select - Request Type "Downstream: All Requests" In Description of Issue* field "Type name of server and problem you are having including your ID" http://goto/bill From: bugs@schedmd.com [mailto:bugs@schedmd.com] Sent: Tuesday, October 20, 2015 2:22 PM To: Schadlich, William W Subject: [Bug 1706] sreport indicates 98.45% allocated for month when sample clearly indicates 50.4% Comment # 51<http://bugs.schedmd.com/show_bug.cgi?id=1706#c51> on bug 1706<http://bugs.schedmd.com/show_bug.cgi?id=1706> from Danny Auble<mailto:da@schedmd.com> Thanks, could you now send the output of select job_db_inx, id_job, time_submit, time_start, time_end, (time_end - time_start)/(3600*24), from_unixtime(time_submit), from_unixtime(time_start), from_unixtime(time_end) from descartes_job_table where time_end != 0 && time_start != 0 and (time_end - time_start) > (3600 * 24 * 20); ________________________________ You are receiving this mail because: * You reported the bug. Thanks Bill this does give good information. From this it appears there were a few jobs in here that were indeed running up until you did the upgrade which cleaned things up (put an end time in the job). It looks like the magic timestamp is 1430324243 (2015-04-29 12:17:23). Our thought process now is to just zero all those jobs out unless you can find some information on them pointing as to when they actually ended. A query that should update them to get you sane numbers is this update descartes_job_table set time_end=time_start where time_end=1430324243; You can do a select descartes_job_table where time_end=1430324243; to give you an idea of how many jobs that would be, from this output it looks like it would be quite a few (> 200). They all appeared to start around March 11-13 with nothing obvious going wrong after that. My current believe is if you update the table with that command things will be back to normal for those 2 months. select * descartes_job_table where time_end=1430324243; is the select statement, I missed the '*', sorry. All good. 208 records updated. Regards, Bill ______________________________________________________________________ William W. Schadlich EMIT Foundation Infrastructure Middleware Retail Unix Solutions High Performance Computing EMRE Support 908-730-1011 Ticket Support: http://itservices Select "Get Assistance" Select "High Performance Cluster" Select - Request Type "Downstream: All Requests" In Description of Issue* field "Type name of server and problem you are having including your ID" http://goto/bill From: bugs@schedmd.com [mailto:bugs@schedmd.com] Sent: Tuesday, October 20, 2015 2:42 PM To: Schadlich, William W Subject: [Bug 1706] sreport indicates 98.45% allocated for month when sample clearly indicates 50.4% Comment # 54<http://bugs.schedmd.com/show_bug.cgi?id=1706#c54> on bug 1706<http://bugs.schedmd.com/show_bug.cgi?id=1706> from Danny Auble<mailto:da@schedmd.com> select * descartes_job_table where time_end=1430324243; is the select statement, I missed the '*', sorry. ________________________________ You are receiving this mail because: * You reported the bug. Cool, just run
UPDATE descartes_last_ran_table
SET
hourly_rollup = UNIX_TIMESTAMP('2015-2-29 00:00:00'),
daily_rollup = UNIX_TIMESTAMP('2015-2-29 00:00:00'),
monthly_rollup = UNIX_TIMESTAMP('2015-2-29 00:00:00');
and it should reroll at the top of the next hour and you should be set (hopefully).
mysql> UPDATE descartes_last_ran_table
-> SET
-> hourly_rollup = UNIX_TIMESTAMP('2015-2-29 00:00:00'),
-> daily_rollup = UNIX_TIMESTAMP('2015-2-29 00:00:00'),
-> monthly_rollup = UNIX_TIMESTAMP('2015-2-29 00:00:00');
Query OK, 1 row affected, 3 warnings (0.06 sec)
Rows matched: 1 Changed: 1 Warnings: 0
Regards,
Bill
______________________________________________________________________
William W. Schadlich
EMIT Foundation Infrastructure Middleware Retail Unix Solutions
High Performance Computing
EMRE Support
908-730-1011
Ticket Support:
http://itservices
Select "Get Assistance"
Select "High Performance Cluster"
Select - Request Type "Downstream: All Requests"
In Description of Issue* field
"Type name of server and problem you are having including your ID"
http://goto/bill
From: bugs@schedmd.com [mailto:bugs@schedmd.com]
Sent: Tuesday, October 20, 2015 2:54 PM
To: Schadlich, William W
Subject: [Bug 1706] sreport indicates 98.45% allocated for month when sample clearly indicates 50.4%
Comment # 56<http://bugs.schedmd.com/show_bug.cgi?id=1706#c56> on bug 1706<http://bugs.schedmd.com/show_bug.cgi?id=1706> from Danny Auble<mailto:da@schedmd.com>
Cool, just run
UPDATE descartes_last_ran_table
SET
hourly_rollup = UNIX_TIMESTAMP('2015-2-29 00:00:00'),
daily_rollup = UNIX_TIMESTAMP('2015-2-29 00:00:00'),
monthly_rollup = UNIX_TIMESTAMP('2015-2-29 00:00:00');
and it should reroll at the top of the next hour and you should be set
(hopefully).
________________________________
You are receiving this mail because:
* You reported the bug.
Things look better now? Much better now. Thanks. Regards, Bill ______________________________________________________________________ William W. Schadlich EMIT Foundation Infrastructure Middleware Retail Unix Solutions High Performance Computing EMRE Support 908-730-1011 Ticket Support: http://itservices Select "Get Assistance" Select "High Performance Cluster" Select - Request Type "Downstream: All Requests" In Description of Issue* field "Type name of server and problem you are having including your ID" http://goto/bill From: bugs@schedmd.com [mailto:bugs@schedmd.com] Sent: Tuesday, October 20, 2015 6:32 PM To: Schadlich, William W Subject: [Bug 1706] sreport indicates 98.45% allocated for month when sample clearly indicates 50.4% Comment # 58<http://bugs.schedmd.com/show_bug.cgi?id=1706#c58> on bug 1706<http://bugs.schedmd.com/show_bug.cgi?id=1706> from Danny Auble<mailto:da@schedmd.com> Things look better now? ________________________________ You are receiving this mail because: * You reported the bug. Going forward is there a script I can run to verify what jobs aren't closed after say 60 days so that I could kill them. Can you forward those to me including scripts to kill them? Thanks. Regards, Bill ______________________________________________________________________ William W. Schadlich EMIT Foundation Infrastructure Middleware Retail Unix Solutions High Performance Computing EMRE Support 908-730-1011 Ticket Support: http://itservices Select "Get Assistance" Select "High Performance Cluster" Select - Request Type "Downstream: All Requests" In Description of Issue* field "Type name of server and problem you are having including your ID" http://goto/bill From: bugs@schedmd.com [mailto:bugs@schedmd.com] Sent: Tuesday, October 20, 2015 6:32 PM To: Schadlich, William W Subject: [Bug 1706] sreport indicates 98.45% allocated for month when sample clearly indicates 50.4% Comment # 58<http://bugs.schedmd.com/show_bug.cgi?id=1706#c58> on bug 1706<http://bugs.schedmd.com/show_bug.cgi?id=1706> from Danny Auble<mailto:da@schedmd.com> Things look better now? ________________________________ You are receiving this mail because: * You reported the bug. The perl script from comment 4 (attachment 2239 [details]) should do what you want, if not it can probably be modified to do so. The good news is ever since your upgrade in April it doesn't appear you have had any runaways. Glad things are better, let us know if you need anything else on this. The scripts don't cleanup the jobs automatically. Let us know if you ever see them again and we can look at it. Thanks, Brian Interestingly the sreport this month now shows 63.33% reported. Is this due to the power saving plug-in? Last month we had 100% reported. The numbers look pretty good to the sampling we are taking especially showing a good number for reserved but I need to understand and explain the Reported percentage. Thanks. Will you show an example of what you are seeing? Output in previous message. Let me know if you don't see it. Thanks. Regards, Bill ______________________________________________________________________ William W. Schadlich EMIT Foundation Infrastructure Middleware Retail Unix Solutions High Performance Computing EMRE Support 908-730-1011 Ticket Support: http://itservices Select "Get Assistance" Select "High Performance Cluster" Select - Request Type "Downstream: All Requests" In Description of Issue* field "Type name of server and problem you are having including your ID" http://goto/bill From: bugs@schedmd.com [mailto:bugs@schedmd.com] Sent: Wednesday, November 04, 2015 11:20 AM To: Schadlich, William W Subject: [Bug 1706] sreport indicates 98.45% allocated for month when sample clearly indicates 50.4% Brian Christiansen<mailto:brian@schedmd.com> changed bug 1706<http://bugs.schedmd.com/show_bug.cgi?id=1706> What Removed Added Status RESOLVED UNCONFIRMED Resolution INFOGIVEN --- Comment # 64<http://bugs.schedmd.com/show_bug.cgi?id=1706#c64> on bug 1706<http://bugs.schedmd.com/show_bug.cgi?id=1706> from Brian Christiansen<mailto:brian@schedmd.com> Will you show an example of what you are seeing? ________________________________ You are receiving this mail because: * You reported the bug. I don't see it in Comment 63. Will you attach it again? Thanks. [wwschad@clnhpc01 ~]$ sreport cluster utilization start=10/1/15 end=10/31/15 -t percent -------------------------------------------------------------------------------- Cluster Utilization 2015-10-01T00:00:00 - 2015-10-30T23:59:59 (2592000*cpus secs) Time reported in Percentage of Total -------------------------------------------------------------------------------- Cluster Allocated Down PLND Down Idle Reserved Reported --------- ------------ ---------- ---------- ------------ --------- ------------ descartes 71.02% 2.05% 0.00% 15.75% 11.18% 63.33% [wwschad@clnhpc01 ~]$ sreport cluster utilization start=9/1/15 end=9/31/15 -t percent -------------------------------------------------------------------------------- Cluster Utilization 2015-09-01T00:00:00 - 2015-09-30T23:59:59 (2592000*cpus secs) Time reported in Percentage of Total -------------------------------------------------------------------------------- Cluster Allocated Down PLND Down Idle Reserved Reported --------- ------------ ---------- ---------- ------------ --------- ------------ descartes 34.23% 5.52% 0.00% 58.88% 1.37% 100.00% Thanks Bill, that helps clarify things. Will you run this command? sacctmgr show event start=10/1/15 Created attachment 2386 [details]
slurm_output_sacctmgr.04NOV2015.txt
Will you run these too -- notice that you weren't getting a full month: sreport cluster utilization start=10/1/15 end=11/1/15 -tMinPer sreport cluster utilization start=9/1/15 end=10/1/15 -tMinPer Doesn't like the topic one [wwschad@clnhpc01 ~]$ sreport cluster utilization start=10/1/15 end=11/1/15 -tMi nPer -------------------------------------------------------------------------------- Cluster Utilization 2015-10-01T00:00:00 - 2015-10-31T23:59:59 (2678400*cpus secs ) Time reported in CPU Minutes/Percentage of Total -------------------------------------------------------------------------------- Cluster Allocated Down PLND Down Idle Reserved Reported --------- -------------------- ------------------ ------------------ ----------- --------- ------------------ -------------------- [wwschad@clnhpc01 ~]$ sreport cluster utilization start=10/1/15 end=11/1/15 -tMinPer -------------------------------------------------------------------------------- Cluster Utilization 2015-10-01T00:00:00 - 2015-10-31T23:59:59 (2678400*cpus secs) Time reported in CPU Minutes/Percentage of Total -------------------------------------------------------------------------------- Cluster Allocated Down PLND Down Idle Reserved Reported --------- -------------------- ------------------ ------------------ -------------------- ------------------ -------------------- [wwschad@clnhpc01 ~]$ sreport cluster utilization start=9/1/15 end=10/1/15 -tMinPer -------------------------------------------------------------------------------- Cluster Utilization 2015-09-01T00:00:00 - 2015-09-30T23:59:59 (2592000*cpus secs) Time reported in CPU Minutes/Percentage of Total -------------------------------------------------------------------------------- Cluster Allocated Down PLND Down Idle Reserved Reported --------- -------------------- ------------------ ------------------ -------------------- ------------------ -------------------- descartes 119257881(34.23%) 19233198(5.52%) 0(0.00%) 205115781(58.88%) 4757940(1.37%) 348364800(100.00%) Regards, Bill ______________________________________________________________________ William W. Schadlich EMIT Foundation Infrastructure Middleware Retail Unix Solutions High Performance Computing EMRE Support 908-730-1011 Ticket Support: http://itservices Select "Get Assistance" Select "High Performance Cluster" Select - Request Type "Downstream: All Requests" In Description of Issue* field "Type name of server and problem you are having including your ID" http://goto/bill From: bugs@schedmd.com [mailto:bugs@schedmd.com] Sent: Wednesday, November 04, 2015 1:50 PM To: Schadlich, William W Subject: [Bug 1706] sreport indicates 98.45% allocated for month when sample clearly indicates 50.4% Comment # 70<http://bugs.schedmd.com/show_bug.cgi?id=1706#c70> on bug 1706<http://bugs.schedmd.com/show_bug.cgi?id=1706> from Brian Christiansen<mailto:brian@schedmd.com> Will you run these too -- notice that you weren't getting a full month: sreport cluster utilization start=10/1/15 end=11/1/15 -tMinPer sreport cluster utilization start=9/1/15 end=10/1/15 -tMinPer ________________________________ You are receiving this mail because: * You reported the bug. Will you grab this output? It looks like the month rollup hasn't happened yet. select * from descartes_last_ran_table; and will you get this too? sreport cluster utilization start=10/1/15 end=10/31/15 -tminper mysql> select * from descartes_last_ran_table; +---------------+--------------+----------------+ | hourly_rollup | daily_rollup | monthly_rollup | +---------------+--------------+----------------+ | 1445367600 | 1445313600 | 1443672000 | +---------------+--------------+----------------+ 1 row in set (0.00 sec) [root@clnxcat02 ~]# sreport cluster utilization start=10/1/15 end=10/31/15 -tminper -------------------------------------------------------------------------------- Cluster Utilization 2015-10-01T00:00:00 - 2015-10-30T23:59:59 (2592000*cpus secs) Time reported in CPU Minutes/Percentage of Total -------------------------------------------------------------------------------- Cluster Allocated Down PLND Down Idle Reserved Reported --------- -------------------- ------------------ ------------------ -------------------- ------------------ -------------------- descartes 156698707(71.02%) 4513662(2.05%) 0(0.00%) 34753009(15.75%) 24665662(11.18%) 220631040(63.33%) [root@clnxcat02 ~]# Regards, Bill ______________________________________________________________________ William W. Schadlich EMIT Foundation Infrastructure Middleware Retail Unix Solutions High Performance Computing EMRE Support 908-730-1011 Ticket Support: http://itservices Select "Get Assistance" Select "High Performance Cluster" Select - Request Type "Downstream: All Requests" In Description of Issue* field "Type name of server and problem you are having including your ID" http://goto/bill From: bugs@schedmd.com [mailto:bugs@schedmd.com] Sent: Wednesday, November 04, 2015 4:33 PM To: Schadlich, William W Subject: [Bug 1706] sreport indicates 98.45% allocated for month when sample clearly indicates 50.4% Comment # 72<http://bugs.schedmd.com/show_bug.cgi?id=1706#c72> on bug 1706<http://bugs.schedmd.com/show_bug.cgi?id=1706> from Brian Christiansen<mailto:brian@schedmd.com> Will you grab this output? It looks like the month rollup hasn't happened yet. select * from descartes_last_ran_table; and will you get this too? sreport cluster utilization start=10/1/15 end=10/31/15 -tminper ________________________________ You are receiving this mail because: * You reported the bug. You're rollups haven't happened since Mon Oct 19 21:00:00 2015. Will you send your slurmdbd.log? Created attachment 2392 [details]
slurmdbd.tar
Will you set the DebugLevel to debug2 in your slurmdbd.conf and restart the slurmdbd? Then watch for the messages: slurmdbd: debug2: running rollup at Thu Nov 05 09:22:24 2015 slurmdbd: debug2: Everything rolled up Once you see the last message will you run this again and send the slurmdbd.log again? select * from descartes_last_ran_table; You could restart the dbd again too without debug2. Thanks, Brian Waiting for the line to hit the file but am noticing a slowdown with the system processing new jobs is this normal? Its takes like 30-50 sec to obtain a node with a basic srun -N1 -pty bash. Thoughts? How long does it take before it decides to run? I still don't see anything that I can tell in the logs? [2015-10-22T16:36:13.994] error: Processing last message from connection 9(159.70.89.221) uid(-2) [2015-11-06T11:40:21.840] Terminate signal (SIGINT or SIGTERM) received mysql> select * from descartes_last_ran_table; +---------------+--------------+----------------+ | hourly_rollup | daily_rollup | monthly_rollup | +---------------+--------------+----------------+ | 1446832800 | 1446786000 | 1446350400 | +---------------+--------------+----------------+ 1 row in set (0.00 sec) Regards, Bill ______________________________________________________________________ William W. Schadlich EMIT Foundation Infrastructure Middleware Retail Unix Solutions High Performance Computing EMRE Support 908-730-1011 Ticket Support: http://itservices Select "Get Assistance" Select "High Performance Cluster" Select - Request Type "Downstream: All Requests" In Description of Issue* field "Type name of server and problem you are having including your ID" http://goto/bill From: bugs@schedmd.com [mailto:bugs@schedmd.com] Sent: Thursday, November 05, 2015 12:40 PM To: Schadlich, William W Subject: [Bug 1706] sreport indicates 98.45% allocated for month when sample clearly indicates 50.4% Comment # 76<http://bugs.schedmd.com/show_bug.cgi?id=1706#c76> on bug 1706<http://bugs.schedmd.com/show_bug.cgi?id=1706> from Brian Christiansen<mailto:brian@schedmd.com> Will you set the DebugLevel to debug2 in your slurmdbd.conf and restart the slurmdbd? Then watch for the messages: slurmdbd: debug2: running rollup at Thu Nov 05 09:22:24 2015 slurmdbd: debug2: Everything rolled up Once you see the last message will you run this again and send the slurmdbd.log again? select * from descartes_last_ran_table; You could restart the dbd again too without debug2. Thanks, Brian ________________________________ You are receiving this mail because: * You reported the bug. Job scheduling should not be significantly impacted by the database updates - if you're seeing throughput problems can you see if there are any other potential causes? Is there an unusually high number of jobs in the system, or were any config changes done when updating to 15.08.3? Those database timestamps are current for today, so the rollup process has happened and is current now. The hourly rollup timestamp should just forward every hour, daily every day. If you hadn't noticed before those are just unix timestamp values stored in the database - 1446832800 decodes as today at 18:00 UTC. - Tim I may have misread part of the bug history, my apologies. If you're still on the 14.11 release you weren't affected by the bug in 15.08. The timestamps indicate that the rollup process is happening on schedule now. Unless you've increased the debug level to debug2 or above you shouldn't see messages in slurmdbd.log related to those rollups unless there are specific problems. No we are still on 14.11.5. I was just wondering is it possible the database I getting large or one of the log files that needs to be rotated out. Thanks. I'm realizing your last message may have still been meant as a question? Database size is entirely up to you and your site configuration - we do recommend that you prune it every so often but it's not mandatory. The main thing that can take some time is the upgrade process to a new release - 15.08 in particular changes a few columns in the database and the slurmdbd needs to restructure that all in one transaction. The time that takes (and memory) is ~ O(number of job steps * number of associations) which can get quite large. We've seen updates take 15 minutes on ~ 1mil jobs (say ~=2-3 million job steps) and 400 associations, but that can vary significantly depending on hardware. Let me know if you had any further questions on this or if I can go ahead and close out the bug. - Tim Appreciate the feedback. I'm good now thanks. Regards, Bill ______________________________________________________________________ William W. Schadlich EMIT Foundation Infrastructure Middleware Retail Unix Solutions High Performance Computing EMRE Support 908-730-1011 Ticket Support: http://itservices Select "Get Assistance" Select "High Performance Cluster" Select - Request Type "Downstream: All Requests" In Description of Issue* field "Type name of server and problem you are having including your ID" http://goto/bill From: bugs@schedmd.com [mailto:bugs@schedmd.com] Sent: Tuesday, December 01, 2015 5:49 PM To: Schadlich, William W Subject: [Bug 1706] sreport indicates 98.45% allocated for month when sample clearly indicates 50.4% Comment # 82<http://bugs.schedmd.com/show_bug.cgi?id=1706#c82> on bug 1706<http://bugs.schedmd.com/show_bug.cgi?id=1706> from Tim Wickberg<mailto:tim@schedmd.com> I'm realizing your last message may have still been meant as a question? Database size is entirely up to you and your site configuration - we do recommend that you prune it every so often but it's not mandatory. The main thing that can take some time is the upgrade process to a new release - 15.08 in particular changes a few columns in the database and the slurmdbd needs to restructure that all in one transaction. The time that takes (and memory) is ~ O(number of job steps * number of associations) which can get quite large. We've seen updates take 15 minutes on ~ 1mil jobs (say ~=2-3 million job steps) and 400 associations, but that can vary significantly depending on hardware. Let me know if you had any further questions on this or if I can go ahead and close out the bug. - Tim ________________________________ You are receiving this mail because: * You reported the bug. Marking as closed. |
Running an following command shows allocated to be much higher than actually occurring. [me@servername ~]$ sreport cluster utilization start=04/01/15 end=04/30/15 -t percent -------------------------------------------------------------------------------- Cluster Utilization 2015-04-01T00:00:00 - 2015-04-29T23:59:59 (2505600*cpus secs) Time reported in Percentage of Total -------------------------------------------------------------------------------- Cluster Allocated Down PLND Down Idle Reserved Reported --------- ------------ ---------- ---------- ------------ --------- ------------ clustername 98.45% 0.55% 0.00% 1.00% 0.00% 100.00% We are noticing many jobs particularly in the partitions that have time restraints of 16hr limits have jobs that are still running well passed this timeframe which we believe is causing a skewed. Example: report we run - #sacct --allusers --starttime=$STIME --parsable2 --format=Partition,User,State,JobID,JobName,Submit,Start,End,Elapsed,Timelimit,ExitCode,DerivedExitCode,AllocCPUS,ReqCPUS,NNodes,MaxVMSize,MaxRSS,CPUTime,CPUTimeRAW,TotalCPU,SystemCPU,UserCPU | \ awk -F\| 'BEGIN{OFS="|";}{if ($1 != "") { arg1 = $1; arg2 = $2; printf("%s\n", $0); } else { $1 = arg1; $2 = arg2; printf("%s\n", $0); } }' | \ sed -e 's/,/_/g' | \ sed -e 's/|/,/g' > /home/slurm/usage_logs/slurm_usage_acct.`date +%Y%m%d`.csv Output from April 2015 (subset) - first row should be end of jobs in the short partition since 16hr time limit was hit but it shows jobs with after this timeframe continue to be there. Interestingly the start and end times are shown correctly for the first user (user1) with (383.batch) CANCELLED with the current Elapsed time but the next 3 for (user1) indicate RUNNING with (SOMEJOB1) but the start and end times are exactly the same and are not really running but the database believes so. We feel this is the case happening with all the partitions in our cluster. Partition User State JobID JobName Submit Start End Elapsed Timelimit ExitCode DerivedExitCode AllocCPUS ReqCPUS NNodes MaxVMSize MaxRSS CPUTime CPUTimeRAW TotalCPU SystemCPU UserCPU short user1 CANCELLED 382.batch batch 2015-05-01T15:49:27 2015-05-01T15:49:27 2015-05-02T07:49:28 16:00:01 0:15 16 16 1 262964K 6444K 10-16:00:16 921616 00:00.0 00:00.0 00:00.0 short user1 RUNNING 1461.81 SOMEJOB1 2015-05-12T16:56:32 2015-05-12T16:56:32 Unknown 15-22:29:35 0:00 1 1 1 15-22:29:35 1376975 0:00:00 short user1 RUNNING 1461.82 SOMEJOB1 2015-05-12T16:56:32 2015-05-12T16:56:32 Unknown 15-22:29:35 0:00 1 1 1 15-22:29:35 1376975 0:00:00 short user1 RUNNING 1461.83 SOMEJOB1 2015-05-12T16:56:32 2015-05-12T16:56:32 Unknown 15-22:29:35 0:00 1 1 1 15-22:29:35 1376975 0:00:00 I've also attached our slurm.conf configuration for review (minus some name updates and node number configuration) [wwschad@clnhpc03 ~]$ cat /etc/slurm/slurm.conf # slurm.conf file generated by configurator.html. # Put this file on all nodes of your cluster. # See the slurm.conf man page for more information. # ControlMachine=SERVERNAMEHERE ControlAddr=CLUSTERNAMEHERE #BackupController= #BackupAddr= # AuthType=auth/munge CacheGroups=0 #CheckpointType=checkpoint/none CryptoType=crypto/munge #DisableRootJobs=NO #EnforcePartLimits=NO #Epilog= #EpilogSlurmctld= #FirstJobId=1 #MaxJobId=999999 #GresTypes= #GroupUpdateForce=0 #GroupUpdateTime=600 #JobCheckpointDir=/var/slurm/checkpoint #JobCredentialPrivateKey= #JobCredentialPublicCertificate= #JobFileAppend=0 JobRequeue=0 #JobSubmitPlugins=1 #KillOnBadExit=0 #LaunchType=launch/slurm #Licenses=foo*4,bar #MailProg=/bin/mail #MaxJobCount=5000 #MaxStepCount=40000 MaxStepCount=2000000000 #MaxTasksPerNode=128 MpiDefault=none MpiParams=ports=50000-60000 #PluginDir= #PlugStackConfig= #PrivateData=jobs ProctrackType=proctrack/linuxproc #Prolog= #PrologSlurmctld= #PropagatePrioProcess=0 PropagateResourceLimits=ALL #PropagateResourceLimitsExcept= #RebootProgram= ReturnToService=1 #SallocDefaultCommand= SlurmctldPidFile=/var/run/slurmctld.pid SlurmctldPort=6817 SlurmdPidFile=/var/run/slurmd.pid SlurmdPort=6818 SlurmdSpoolDir=/var/spool/slurmd SlurmUser=slurm #SlurmdUser=root #SrunEpilog= #SrunProlog= StateSaveLocation=/var/spool/slurmctld SwitchType=switch/none #TaskEpilog= #TaskPlugin=task/none TaskPlugin=task/affinity #TaskPluginParam= #TaskProlog= #TopologyPlugin=topology/tree #TmpFS=/tmp #TrackWCKey=no #TreeWidth= #UnkillableStepProgram= UsePAM=0 # # # TIMERS #BatchStartTimeout=10 #CompleteWait=0 #EpilogMsgTime=2000 #GetEnvTimeout=2 #HealthCheckInterval=0 #HealthCheckProgram= InactiveLimit=0 KillWait=30 #MessageTimeout=10 #ResvOverRun=0 MinJobAge=300 #OverTimeLimit=0 SlurmctldTimeout=120 SlurmdTimeout=300 #UnkillableStepTimeout=60 #VSizeFactor=0 Waittime=0 # # # SCHEDULING #DefMemPerCPU=0 FastSchedule=1 #MaxMemPerCPU=0 #SchedulerRootFilter=1 #SchedulerTimeSlice=30 SchedulerType=sched/backfill SchedulerPort=7321 SelectType=select/linear #SelectTypeParameters= # # # JOB PRIORITY #PriorityFlags= #PriorityType=priority/basic #PriorityDecayHalfLife= #PriorityCalcPeriod= #PriorityFavorSmall= #PriorityMaxAge= #PriorityUsageResetPeriod= #PriorityWeightAge= #PriorityWeightFairshare= #PriorityWeightJobSize= #PriorityWeightPartition= #PriorityWeightQOS= # # # LOGGING AND ACCOUNTING AccountingStorageEnforce=Associations AccountingStorageHost=CLUSTERNAMEHERE #AccountingStorageLoc=/var/log/slurm/accounting #AccountingStoragePass=CLUSTERNAMEHERE #AccountingStoragePort=7031 AccountingStorageType=accounting_storage/slurmdbd #AccountingStorageUser=slurm AccountingStoreJobComment=YES ClusterName=CLUSTERNAMEHERE #DebugFlags= JobCompHost=CLUSTERNAMEHERE JobCompLoc=/var/log/slurm/job_completions #JobCompPass=CLUSTERNAMEHERE #JobCompPort=7031 JobCompType=jobcomp/filetxt JobCompUser=slurm JobAcctGatherFrequency=30 JobAcctGatherType=jobacct_gather/linux SlurmctldDebug=3 SlurmctldLogFile=/var/log/slurm/slurmctld.log SlurmdDebug=3 #SlurmdLogFile= #SlurmSchedLogFile= #SlurmSchedLogLevel= # # # POWER SAVE SUPPORT FOR IDLE NODES (optional) #SuspendProgram= #ResumeProgram= #SuspendTimeout= #ResumeTimeout= #ResumeRate= #SuspendExcNodes= #SuspendExcParts= #SuspendRate= #SuspendTime= # # # COMPUTE NODES Output can be emailed separately