1706 – sreport indicates 98.45% allocated for month when sample clearly indicates 50.4%

Ticket 1706 - sreport indicates 98.45% allocated for month when sample clearly indicates 50.4%

Summary: sreport indicates 98.45% allocated for month when sample clearly indicates 50.4%

Status:	RESOLVED INFOGIVEN

Alias:	None

Product:	Slurm
Classification:	Unclassified
Component:	Accounting (show other tickets)
Version:	14.11.5
Hardware:	Linux Linux

Severity:	3 - Medium Impact
Assignee:	Tim Wickberg
QA Contact:

URL:

Depends on:
Blocks:

Reported:	2015-05-28 08:58 MDT by William Schadlich
Modified:	2015-12-02 00:47 MST (History)
CC List:	1 user (show)

See Also:	2068
Site:	EM
Slinky Site:	---
Alineos Sites:	---
Atos/Eviden Sites:	---
Confidential Site:	---
Coreweave sites:	---
Cray Sites:	---
DS9 clusters:	---
Google sites:	---
HPCnow Sites:	---
HPE Sites:	---
IBM Sites:	---
NOAA SIte:	---
NoveTech Sites:	---
Nvidia HWinf-CS Sites:	---
OCF Sites:	---
Recursion Pharma Sites:	---
SFW Sites:	---
SNIC sites:	---
Tzag Elita Sites:	---
Linux Distro:	---
Machine Name:
CLE Version:
Version Fixed:
Target Release:	---
DevPrio:	---
Emory-Cloud Sites:	---

Attachments
perl script to search for runaway jobs (1.33 KB, application/x-perl) 2015-09-18 10:01 MDT, Danny Auble	Details
slurmdbd.log.tgz (28.92 KB, application/x-compressed) 2015-10-07 09:03 MDT, William Schadlich	Details
slurmacctmgr_listing.08OCT2015.txt (995.30 KB, text/plain) 2015-10-08 02:05 MDT, William Schadlich	Details
quries.txt (1.30 KB, text/plain) 2015-10-08 11:04 MDT, Brian Christiansen	Details
out.txt (190.23 KB, text/plain) 2015-10-09 02:40 MDT, William Schadlich	Details
slurm_select_output.txt (404.88 KB, text/plain) 2015-10-12 02:12 MDT, William Schadlich	Details
slurm_select_output2.txt (536.69 KB, text/plain) 2015-10-12 03:30 MDT, William Schadlich	Details
slurmctld_log.txt (256.95 KB, text/plain) 2015-10-12 04:39 MDT, William Schadlich	Details
slurmctld_log_error.txt.tgz (1.18 MB, application/x-compressed) 2015-10-12 04:59 MDT, William Schadlich	Details
slurmdbd.log_19oct2015.tgz (52.05 KB, application/x-compressed) 2015-10-19 08:57 MDT, William Schadlich	Details
select_results_20oct2015.txt (562.60 KB, text/plain) 2015-10-20 06:12 MDT, William Schadlich	Details
select_results_20oct2015_new.txt (33.71 KB, text/plain) 2015-10-20 06:25 MDT, William Schadlich	Details
slurm_output_sacctmgr.04NOV2015.txt (216.24 KB, text/plain) 2015-11-04 04:29 MST, William Schadlich	Details
slurmdbd.tar (830.00 KB, application/x-tar) 2015-11-05 00:33 MST, William Schadlich	Details
Add an attachment (proposed patch, testcase, etc.)

Note You need to log in before you can comment on or make changes to this ticket.

Description William Schadlich 2015-05-28 08:58:02 MDT

Running an following command shows allocated to be much higher than actually occurring.

[me@servername ~]$ sreport cluster utilization start=04/01/15 end=04/30/15 -t percent
--------------------------------------------------------------------------------
Cluster Utilization 2015-04-01T00:00:00 - 2015-04-29T23:59:59 (2505600*cpus secs)
Time reported in Percentage of Total
--------------------------------------------------------------------------------
  Cluster    Allocated       Down  PLND Down         Idle  Reserved     Reported
--------- ------------ ---------- ---------- ------------ --------- ------------
clustername     98.45%      0.55%      0.00%        1.00%     0.00%      100.00%


We are noticing many jobs particularly in the partitions that have time restraints of 16hr limits have jobs that are still running well passed this timeframe which we believe is causing a skewed.

Example:

report we run - 
#sacct --allusers --starttime=$STIME --parsable2 --format=Partition,User,State,JobID,JobName,Submit,Start,End,Elapsed,Timelimit,ExitCode,DerivedExitCode,AllocCPUS,ReqCPUS,NNodes,MaxVMSize,MaxRSS,CPUTime,CPUTimeRAW,TotalCPU,SystemCPU,UserCPU | \
awk -F\| 'BEGIN{OFS="|";}{if ($1 != "") { arg1 = $1; arg2 = $2; printf("%s\n", $0); } else { $1 = arg1; $2 = arg2; printf("%s\n", $0); } }' | \
sed -e 's/,/_/g' | \
sed -e 's/|/,/g' > /home/slurm/usage_logs/slurm_usage_acct.`date +%Y%m%d`.csv


Output from April 2015 (subset) - first row should be end of jobs in the short partition since 16hr time limit was hit but it shows jobs with after this timeframe continue to be there. Interestingly the start and end times are shown correctly for the first user (user1) with (383.batch) CANCELLED with the current Elapsed time but the next 3 for (user1) indicate RUNNING with (SOMEJOB1) but the start and end times are exactly the same and are not really running but the database believes so.  

We feel this is the case happening with all the partitions in our cluster.

Partition	User	State	JobID	JobName	Submit	Start	End	Elapsed	Timelimit	ExitCode	DerivedExitCode	AllocCPUS	ReqCPUS	NNodes	MaxVMSize	MaxRSS	CPUTime	CPUTimeRAW	TotalCPU	SystemCPU	UserCPU


short	user1	CANCELLED	382.batch	batch	2015-05-01T15:49:27	2015-05-01T15:49:27	2015-05-02T07:49:28	16:00:01		0:15		16	16	1	262964K	6444K	10-16:00:16	921616	00:00.0	00:00.0	00:00.0

short	user1	RUNNING	1461.81	SOMEJOB1	2015-05-12T16:56:32	2015-05-12T16:56:32	Unknown	15-22:29:35		0:00		1	1	1			15-22:29:35	1376975	0:00:00		
short	user1	RUNNING	1461.82	SOMEJOB1	2015-05-12T16:56:32	2015-05-12T16:56:32	Unknown	15-22:29:35		0:00		1	1	1			15-22:29:35	1376975	0:00:00		
short	user1	RUNNING	1461.83	SOMEJOB1	2015-05-12T16:56:32	2015-05-12T16:56:32	Unknown	15-22:29:35		0:00		1	1	1			15-22:29:35	1376975	0:00:00		



I've also attached our slurm.conf configuration for review (minus some name updates and node number configuration)

[wwschad@clnhpc03 ~]$ cat /etc/slurm/slurm.conf
# slurm.conf file generated by configurator.html.
# Put this file on all nodes of your cluster.
# See the slurm.conf man page for more information.
#
ControlMachine=SERVERNAMEHERE
ControlAddr=CLUSTERNAMEHERE
#BackupController=
#BackupAddr=
#
AuthType=auth/munge
CacheGroups=0
#CheckpointType=checkpoint/none
CryptoType=crypto/munge
#DisableRootJobs=NO
#EnforcePartLimits=NO
#Epilog=
#EpilogSlurmctld=
#FirstJobId=1
#MaxJobId=999999
#GresTypes=
#GroupUpdateForce=0
#GroupUpdateTime=600
#JobCheckpointDir=/var/slurm/checkpoint
#JobCredentialPrivateKey=
#JobCredentialPublicCertificate=
#JobFileAppend=0
JobRequeue=0
#JobSubmitPlugins=1
#KillOnBadExit=0
#LaunchType=launch/slurm
#Licenses=foo*4,bar
#MailProg=/bin/mail
#MaxJobCount=5000
#MaxStepCount=40000
MaxStepCount=2000000000
#MaxTasksPerNode=128
MpiDefault=none
MpiParams=ports=50000-60000
#PluginDir=
#PlugStackConfig=
#PrivateData=jobs
ProctrackType=proctrack/linuxproc
#Prolog=
#PrologSlurmctld=
#PropagatePrioProcess=0
PropagateResourceLimits=ALL
#PropagateResourceLimitsExcept=
#RebootProgram=
ReturnToService=1
#SallocDefaultCommand=
SlurmctldPidFile=/var/run/slurmctld.pid
SlurmctldPort=6817
SlurmdPidFile=/var/run/slurmd.pid
SlurmdPort=6818
SlurmdSpoolDir=/var/spool/slurmd
SlurmUser=slurm
#SlurmdUser=root
#SrunEpilog=
#SrunProlog=
StateSaveLocation=/var/spool/slurmctld
SwitchType=switch/none
#TaskEpilog=
#TaskPlugin=task/none
TaskPlugin=task/affinity
#TaskPluginParam=
#TaskProlog=
#TopologyPlugin=topology/tree
#TmpFS=/tmp
#TrackWCKey=no
#TreeWidth=
#UnkillableStepProgram=
UsePAM=0
#
#
# TIMERS
#BatchStartTimeout=10
#CompleteWait=0
#EpilogMsgTime=2000
#GetEnvTimeout=2
#HealthCheckInterval=0
#HealthCheckProgram=
InactiveLimit=0
KillWait=30
#MessageTimeout=10
#ResvOverRun=0
MinJobAge=300
#OverTimeLimit=0
SlurmctldTimeout=120
SlurmdTimeout=300
#UnkillableStepTimeout=60
#VSizeFactor=0
Waittime=0
#
#
# SCHEDULING
#DefMemPerCPU=0
FastSchedule=1
#MaxMemPerCPU=0
#SchedulerRootFilter=1
#SchedulerTimeSlice=30
SchedulerType=sched/backfill
SchedulerPort=7321
SelectType=select/linear
#SelectTypeParameters=
#
#
# JOB PRIORITY
#PriorityFlags=
#PriorityType=priority/basic
#PriorityDecayHalfLife=
#PriorityCalcPeriod=
#PriorityFavorSmall=
#PriorityMaxAge=
#PriorityUsageResetPeriod=
#PriorityWeightAge=
#PriorityWeightFairshare=
#PriorityWeightJobSize=
#PriorityWeightPartition=
#PriorityWeightQOS=
#
#
# LOGGING AND ACCOUNTING
AccountingStorageEnforce=Associations
AccountingStorageHost=CLUSTERNAMEHERE
#AccountingStorageLoc=/var/log/slurm/accounting
#AccountingStoragePass=CLUSTERNAMEHERE
#AccountingStoragePort=7031
AccountingStorageType=accounting_storage/slurmdbd
#AccountingStorageUser=slurm
AccountingStoreJobComment=YES
ClusterName=CLUSTERNAMEHERE
#DebugFlags=
JobCompHost=CLUSTERNAMEHERE
JobCompLoc=/var/log/slurm/job_completions
#JobCompPass=CLUSTERNAMEHERE
#JobCompPort=7031
JobCompType=jobcomp/filetxt
JobCompUser=slurm
JobAcctGatherFrequency=30
JobAcctGatherType=jobacct_gather/linux
SlurmctldDebug=3
SlurmctldLogFile=/var/log/slurm/slurmctld.log
SlurmdDebug=3
#SlurmdLogFile=
#SlurmSchedLogFile=
#SlurmSchedLogLevel=
#
#
# POWER SAVE SUPPORT FOR IDLE NODES (optional)
#SuspendProgram=
#ResumeProgram=
#SuspendTimeout=
#ResumeTimeout=
#ResumeRate=
#SuspendExcNodes=
#SuspendExcParts=
#SuspendRate=
#SuspendTime=
#
#
# COMPUTE NODES
Output can be emailed separately

Comment 1 David Bigagli 2015-06-22 08:25:33 MDT

Hi,
   what is the latest state of the jobs in the database, are they still in
running? This would seems to indicate that somehow the controller failed
to update the database.

David

Comment 2 David Bigagli 2015-06-22 08:40:49 MDT

Hi, 
   do you have reservations in the system? The reservations are counted against
used cpu time.

David

Comment 3 Danny Auble 2015-09-18 06:04:36 MDT

Bill, could you reply to comment 1 and 2?

Comment 4 Danny Auble 2015-09-18 10:01:06 MDT

Created attachment 2239 [details]
perl script to search for runaway jobs

Bill if you have runaway jobs this script should identify them fairly easy.

Comment 5 Danny Auble 2015-09-18 10:12:35 MDT

Thanks Bill, for some reason your message wasn't part of the bug at 
all.  The answer to this is in comment 4 of the bug.

You reply is included here...

I’ve answered those previously on June 22, 2015.  Attaching email 
trail.  Thanks.





****************************************************

From: Schadlich, William W
Sent: Monday, June 22, 2015 4:42 PM
To: 'bugs@schedmd.com'
Subject: RE: [Bug 1706] sreport indicates 98.45% allocated for month 
when sample clearly indicates 50.4%



We shouldn’t have any but how would I check that?





Regards,

Bill

Comment 6 Danny Auble 2015-09-18 10:43:04 MDT

Bill, can you also update the version if changed.  The script given in comment 4 shows runaway jobs as it is noted.  To list any reservations you would have to go in the database to look before 15.08 which has 'sacctmgr list resv'.

If you want to see if you have any reservations on anything before that you can log into the slurm database and run

select * from clustername_resv_table;

where clustername is the name of your cluster.  This should give you the start and end of any reservations in your system.

Comment 7 Brian Christiansen 2015-09-24 06:03:53 MDT

Hey Bill,

Are you getting emails from bugzilla and have you tried replying back? We don't see any responses so we are just making sure that your responses aren't getting lost somewhere.

If you are getting these, are you currently experiencing this issue?

Thanks,
Brian

Comment 8 William Schadlich 2015-09-30 07:56:27 MDT

Test email



Regards,
Bill

______________________________________________________________________
William W. Schadlich
EMIT Foundation Infrastructure Middleware Retail Unix Solutions
High Performance Computing

EMRE Support
908-730-1011
Ticket Support:
http://itservices
                          Select "Get Assistance"
                          Select "High Performance Cluster"
                          Select - Request Type "Downstream: All Requests"
                          In Description of Issue* field
                               "Type name of server and problem you are having including your ID"

http://goto/bill

From: bugs@schedmd.com [mailto:bugs@schedmd.com]
Sent: Thursday, September 24, 2015 2:04 PM
To: Schadlich, William W
Subject: [Bug 1706] sreport indicates 98.45% allocated for month when sample clearly indicates 50.4%

Comment # 7<http://bugs.schedmd.com/show_bug.cgi?id=1706#c7> on bug 1706<http://bugs.schedmd.com/show_bug.cgi?id=1706> from Brian Christiansen<mailto:brian@schedmd.com>

Hey Bill,



Are you getting emails from bugzilla and have you tried replying back? We don't

see any responses so we are just making sure that your responses aren't getting

lost somewhere.



If you are getting these, are you currently experiencing this issue?



Thanks,

Brian

________________________________
You are receiving this mail because:

 *   You reported the bug.

Comment 9 Brian Christiansen 2015-09-30 08:32:49 MDT

We can hear you now. 

From a previous email:
Here is what that script determined.  I guess I was able to kill most of the database jobs from Nov of last year but somehow new ones are now here.

CLUSTER:#myclustername#

cant find job 1461           lpkozio subscript+   research    RUNNING  #myclustername#           n0203        1      short 131-16:57:11
cant find job 1462           lpkozio subscript+   research    RUNNING  #myclustername#           n0382        1      short 131-16:57:11
cant find job 1463           lpkozio subscript+   research    RUNNING  #myclustername#           n0318        1      short 131-16:57:11
cant find job 1464           lpkozio subscript+   research    RUNNING  #myclustername#           n0319        1      short 131-16:57:11
cant find job 1465           lpkozio subscript+   research    RUNNING  #myclustername#           n0320        1      short 131-16:57:11
cant find job 1466           lpkozio subscript+   research    RUNNING  #myclustername#           n0682        1      short 131-16:57:08
cant find job 1467           lpkozio subscript+   research    RUNNING  #myclustername#           n0683        1      short 131-16:57:08
cant find job 1468           lpkozio subscript+   research    RUNNING  #myclustername#           n0684        1      short 131-16:57:08
cant find job 1469           lpkozio subscript+   research    RUNNING  #myclustername#           n0210        1      short 131-16:57:08
cant find job 1470           lpkozio subscript+   research    RUNNING  #myclustername#           n0211        1      short 131-16:57:08
cant find job 1471           lpkozio subscript+   research    RUNNING  #myclustername#           n0212        1      short 131-16:57:08
cant find job 1472           lpkozio subscript+   research    RUNNING  #myclustername#           n0213        1      short 131-16:57:08
cant find job 1473           lpkozio subscript+   research    RUNNING  #myclustername#           n0666        1      short 131-16:57:08
cant find job 1474           lpkozio subscript+   research    RUNNING  #myclustername#           n0667        1      short 131-16:57:08
cant find job 1475           lpkozio subscript+   research    RUNNING  #myclustername#           n0668        1      short 131-16:57:08
cant find job 1476           lpkozio subscript+   research    RUNNING  #myclustername#           n0669        1      short 131-16:57:08
cant find job 1494           sraman1 lammps.sc+   research    RUNNING  #myclustername#    n[0352-0356]        5      small 131-16:36:11

Comment 10 Brian Christiansen 2015-09-30 08:35:15 MDT

Were you able to clear out the long running jobs that the script reported? If not you can use these queries to clean them up.

select * from <cluster>_job_table where id_job = <jobid> and state=1 and time_end = 0;
update <cluster>_job_table set time_end = time_start, state = 3 where id_job=<jobid> and state=1 and time_end=0;
select * from <cluster>_job_table where id_job = <jobid>;

One possible explanation for the long running jobs is that the database was down for a period of time and the controller started throwing away records. The controller will cache quite a few records while the database is down but it will eventually start throwing things away. The logs will show the following messages if the cache is getting to large:

slurmdbd: agent queue filling, RESTART SLURMDBD NOW

this one goes to the syslog:
*** RESTART SLURMDBD NOW ***"
                                                 
and if it starts to throw things away you'll see:

slurmdbd: agent queue is full, discarding request");  


After you clean up these jobs will you run your reports again and see if they are matching up?

Here's one suggestion for your report command line options, you can use the sacct -X option to not display the steps. That way you'll just be looking at the jobs.

Thanks,
Brian

Comment 11 Brian Christiansen 2015-10-01 08:45:12 MDT

From Bill:
So I ran the clean-up scripts provided and the system indicates nothing is running but I go back to months where utilization was low and it still indicates percentages in the 90s where it probably closer to 50-60.  Thoughts?

[wwschad@clnsand01 admin]$ sreport cluster utilization start=1/1/15 end=1/31/15 -t percent
--------------------------------------------------------------------------------
Cluster Utilization 2015-01-01T00:00:00 - 2015-01-30T23:59:59 (2592000*cpus secs)
Time reported in Percentage of Total
--------------------------------------------------------------------------------
  Cluster    Allocated       Down  PLND Down         Idle  Reserved     Reported
--------- ------------ ---------- ---------- ------------ --------- ------------
descartes       95.80%      0.03%      0.00%        4.17%     0.00%      100.00%


[wwschad@clnsand01 admin]$ sreport cluster utilization start=2/1/15 end=2/28/15 -t percent
--------------------------------------------------------------------------------
Cluster Utilization 2015-02-01T00:00:00 - 2015-02-27T23:59:59 (2332800*cpus secs)
Time reported in Percentage of Total
--------------------------------------------------------------------------------
  Cluster    Allocated       Down  PLND Down         Idle  Reserved     Reported
--------- ------------ ---------- ---------- ------------ --------- ------------
descartes       48.49%      0.61%      0.00%       50.68%     0.21%      100.00%


[wwschad@clnsand01 admin]$ sreport cluster utilization start=12/1/14 end=12/31/14 -t percent
--------------------------------------------------------------------------------
Cluster Utilization 2014-12-01T00:00:00 - 2014-12-30T23:59:59 (2592000*cpus secs)
Time reported in Percentage of Total
--------------------------------------------------------------------------------
  Cluster    Allocated       Down  PLND Down         Idle  Reserved     Reported
--------- ------------ ---------- ---------- ------------ --------- ------------
descartes       98.79%      0.02%      0.00%        1.18%     0.01%      100.00%

Comment 12 Brian Christiansen 2015-10-01 09:24:43 MDT

Now that the the jobs are cleaned up, the job usage needs to be rolled up again. sreport looks at the rolled up usage.

You can trigger the re-roll by setting the last roll up times to a previous date in the database. For example, since your sreport time is from last December you could set the time to a day before just be safe.

UPDATE = descartes_last_ran_table 
SET 
    hourly_rollup  = UNIX_TIMESTAMP('2014-11-30 00:00:00'),
    daily_rollup   = UNIX_TIMESTAMP('2014-11-30 00:00:00'),
    monthly_rollup = UNIX_TIMESTAMP('2014-11-30 00:00:00');

Or you could set the columns to 0 and it will roll up everything. 

By modifying these columns the slurmdbd will do the rollups the next time it does its rollup (which is every hour or on startup). It could take a little bit of time to re-roll depending on how big the database is, but just let it run.

Does the database still have the jobs from last year? Are you doing any purging? Will you attach your slurmdbd.conf?

Will you also show the output of:

select * from descartes_resv_table;

Let us know how it goes or if you have any questions.

Thanks,
Brian

Comment 13 William Schadlich 2015-10-02 03:51:34 MDT

From the slurmdbd  log I see the following is it a concern from the work we have done?  We currently have everything stored in the db from startup still.

[2015-09-02T11:00:00.959] error: We have more time than is possible (15091584+14976000+0)(30067584) > 29030400 for cluster descartes(8064) from 2015-09-02T10:00:00 - 2015-09-02T11:00:00
[2015-09-02T12:00:00.104] error: We have more time than is possible (15093552+14976000+0)(30069552) > 29030400 for cluster descartes(8064) from 2015-09-02T11:00:00 - 2015-09-02T12:00:00
[2015-09-02T13:00:00.271] error: We have more time than is possible (14164480+14976000+0)(29140480) > 29030400 for cluster descartes(8064) from 2015-09-02T12:00:00 - 2015-09-02T13:00:00
[2015-09-02T20:00:00.305] error: We have more time than is possible (15091200+14976000+0)(30067200) > 29030400 for cluster descartes(8064) from 2015-09-02T19:00:00 - 2015-09-02T20:00:00
[2015-09-02T21:00:00.460] error: We have more time than is possible (15091200+14976000+0)(30067200) > 29030400 for cluster descartes(8064) from 2015-09-02T20:00:00 - 2015-09-02T21:00:00
[2015-09-02T22:00:00.659] error: We have more time than is possible (14767616+14976000+0)(29743616) > 29030400 for cluster descartes(8064) from 2015-09-02T21:00:00 - 2015-09-02T22:00:00
[2015-09-02T23:00:00.829] error: We have more time than is possible (15037392+14976000+0)(30013392) > 29030400 for cluster descartes(8064) from 2015-09-02T22:00:00 - 2015-09-02T23:00:00
[2015-09-03T00:00:00.983] error: We have more time than is possible (14868336+14976000+0)(29844336) > 29030400 for cluster descartes(8064) from 2015-09-02T23:00:00 - 2015-09-03T00:00:00
[2015-09-03T01:00:00.195] error: We have more time than is possible (14165504+14976000+0)(29141504) > 29030400 for cluster descartes(8064) from 2015-09-03T00:00:00 - 2015-09-03T01:00:00
[2015-09-21T09:06:43.643] error: Problem getting jobs for cluster -s
[2015-10-01T09:00:00.661] error: We have more time than is possible (10574560+19656576+0)(30231136) > 29030400 for cluster descartes(8064) from 2015-10-01T08:00:00 - 2015-10-01T09:00:00
[2015-10-01T10:00:00.834] error: We have more time than is possible (4723200+25516800+0)(30240000) > 29030400 for cluster descartes(8064) from 2015-10-01T09:00:00 - 2015-10-01T10:00:00
[2015-10-01T14:05:04.576] Terminate signal (SIGINT or SIGTERM) received
[2015-10-01T14:08:50.631] Accounting storage MYSQL plugin loaded
[2015-10-01T14:08:50.689] slurmdbd version 14.11.5 started
[2015-10-01T14:13:49.515] DBD_CLUSTER_CPUS: cluster not registered



Regards,
Bill

______________________________________________________________________
William W. Schadlich
EMIT Foundation Infrastructure Middleware Retail Unix Solutions
High Performance Computing

EMRE Support
908-730-1011
Ticket Support:
http://itservices
                          Select "Get Assistance"
                          Select "High Performance Cluster"
                          Select - Request Type "Downstream: All Requests"
                          In Description of Issue* field
                               "Type name of server and problem you are having including your ID"

http://goto/bill

From: bugs@schedmd.com [mailto:bugs@schedmd.com]
Sent: Wednesday, September 30, 2015 4:35 PM
To: Schadlich, William W
Subject: [Bug 1706] sreport indicates 98.45% allocated for month when sample clearly indicates 50.4%

Comment # 10<http://bugs.schedmd.com/show_bug.cgi?id=1706#c10> on bug 1706<http://bugs.schedmd.com/show_bug.cgi?id=1706> from Brian Christiansen<mailto:brian@schedmd.com>

Were you able to clear out the long running jobs that the script reported? If

not you can use these queries to clean them up.



select * from <cluster>_job_table where id_job = <jobid> and state=1 and

time_end = 0;

update <cluster>_job_table set time_end = time_start, state = 3 where

id_job=<jobid> and state=1 and time_end=0;

select * from <cluster>_job_table where id_job = <jobid>;



One possible explanation for the long running jobs is that the database was

down for a period of time and the controller started throwing away records. The

controller will cache quite a few records while the database is down but it

will eventually start throwing things away. The logs will show the following

messages if the cache is getting to large:



slurmdbd: agent queue filling, RESTART SLURMDBD NOW



this one goes to the syslog:

*** RESTART SLURMDBD NOW ***"



and if it starts to throw things away you'll see:



slurmdbd: agent queue is full, discarding request");





After you clean up these jobs will you run your reports again and see if they

are matching up?



Here's one suggestion for your report command line options, you can use the

sacct -X option to not display the steps. That way you'll just be looking at

the jobs.



Thanks,

Brian

________________________________
You are receiving this mail because:

 *   You reported the bug.

Comment 14 William Schadlich 2015-10-02 03:55:28 MDT

How do you reset the re-rollup?  Is it a command line or db command?

Here is the output you requested:

mysql> select * from descartes_resv_table;
Empty set (0.00 sec)



Regards,
Bill

______________________________________________________________________
William W. Schadlich
EMIT Foundation Infrastructure Middleware Retail Unix Solutions
High Performance Computing

EMRE Support
908-730-1011
Ticket Support:
http://itservices
                          Select "Get Assistance"
                          Select "High Performance Cluster"
                          Select - Request Type "Downstream: All Requests"
                          In Description of Issue* field
                               "Type name of server and problem you are having including your ID"

http://goto/bill

From: bugs@schedmd.com [mailto:bugs@schedmd.com]
Sent: Thursday, October 01, 2015 5:25 PM
To: Schadlich, William W
Subject: [Bug 1706] sreport indicates 98.45% allocated for month when sample clearly indicates 50.4%

Comment # 12<http://bugs.schedmd.com/show_bug.cgi?id=1706#c12> on bug 1706<http://bugs.schedmd.com/show_bug.cgi?id=1706> from Brian Christiansen<mailto:brian@schedmd.com>

Now that the the jobs are cleaned up, the job usage needs to be rolled up

again. sreport looks at the rolled up usage.



You can trigger the re-roll by setting the last roll up times to a previous

date in the database. For example, since your sreport time is from last

December you could set the time to a day before just be safe.



UPDATE = descartes_last_ran_table

SET

    hourly_rollup  = UNIX_TIMESTAMP('2014-11-30 00:00:00'),

    daily_rollup   = UNIX_TIMESTAMP('2014-11-30 00:00:00'),

    monthly_rollup = UNIX_TIMESTAMP('2014-11-30 00:00:00');



Or you could set the columns to 0 and it will roll up everything.



By modifying these columns the slurmdbd will do the rollups the next time it

does its rollup (which is every hour or on startup). It could take a little bit

of time to re-roll depending on how big the database is, but just let it run.



Does the database still have the jobs from last year? Are you doing any

purging? Will you attach your slurmdbd.conf?



Will you also show the output of:



select * from descartes_resv_table;



Let us know how it goes or if you have any questions.



Thanks,

Brian

________________________________
You are receiving this mail because:

 *   You reported the bug.

Comment 15 Brian Christiansen 2015-10-02 04:19:35 MDT

You can actually do it using "sacctmgr roll <timestamp>" but we recommend just modifying the database directly and letting the dbd handle it the next time it does a rollup. In 14.11, it's possible to have the "sacctmgr roll" and the normal rollup happen at the same time, so we want to avoid that.

You can update the database using the UPDATE query in Comment 14. You can modify the time to how far back you want the re-rollup to start from.

There are log messages logged at debug2 that say when the rollup is done:
"Everything rolled up"

Or you can look at the descartes_last_ran_table and look for the numbers to change.

Comment 16 Brian Christiansen 2015-10-02 04:22:26 MDT

Actually, Comment 12 and not Comment 14. 14 has it in reply as well.

Comment 17 Brian Christiansen 2015-10-02 04:42:09 MDT

Regarding the messages:
[2015-09-02T11:00:00.959] error: We have more time than is possible (15091584+14976000+0)(30067584) > 29030400 for cluster descartes(8064) from 2015-09-02T10:00:00 - 2015-09-02T11:00:00

These will go away after the re-rollup. The long running jobs that you cleaned up were being accounted for in this month's rollups. So the re-rollup will take these jobs out of this month's usage.

Comment 18 William Schadlich 2015-10-02 06:01:48 MDT

There seems to be a syntax error is the db sql statement.  Sadly I gather it's the timestamp command itself not supported in mysql but I haven't researched it in-depth.

mysql> UPDATE = descartes_last_ran_table SET hourly_rollup  = UNIX_TIMESTAMP('2014-11-30 00:00:00'), daily_rollup   = UNIX_TIMESTAMP('2014-11-30 00:00:00'), monthly_rollup = UNIX_TIMESTAMP('2014-11-30 00:00:00');
ERROR 1064 (42000): You have an error in your SQL syntax; check the manual that corresponds to your MySQL server version for the right syntax to use near '= descartes_last_ran_table SET hourly_rollup  = UNIX_TIMESTAMP('2014-11-30 00:00' at line 1



Regards,
Bill

______________________________________________________________________
William W. Schadlich
EMIT Foundation Infrastructure Middleware Retail Unix Solutions
High Performance Computing

EMRE Support
908-730-1011
Ticket Support:
http://itservices
                          Select "Get Assistance"
                          Select "High Performance Cluster"
                          Select - Request Type "Downstream: All Requests"
                          In Description of Issue* field
                               "Type name of server and problem you are having including your ID"

http://goto/bill

From: bugs@schedmd.com [mailto:bugs@schedmd.com]
Sent: Friday, October 02, 2015 12:20 PM
To: Schadlich, William W
Subject: [Bug 1706] sreport indicates 98.45% allocated for month when sample clearly indicates 50.4%

Comment # 15<http://bugs.schedmd.com/show_bug.cgi?id=1706#c15> on bug 1706<http://bugs.schedmd.com/show_bug.cgi?id=1706> from Brian Christiansen<mailto:brian@schedmd.com>

You can actually do it using "sacctmgr roll <timestamp>" but we recommend just

modifying the database directly and letting the dbd handle it the next time it

does a rollup. In 14.11, it's possible to have the "sacctmgr roll" and the

normal rollup happen at the same time, so we want to avoid that.



You can update the database using the UPDATE query in Comment 14<http://bugs.schedmd.com/show_bug.cgi?id=1706#c14>. You can

modify the time to how far back you want the re-rollup to start from.



There are log messages logged at debug2 that say when the rollup is done:

"Everything rolled up"



Or you can look at the descartes_last_ran_table and look for the numbers to

change.

________________________________
You are receiving this mail because:

 *   You reported the bug.

Comment 19 Brian Christiansen 2015-10-02 06:10:07 MDT

Oops. There's a rouge '=' after the UPDATE. It should look like this:

UPDATE descartes_last_ran_table
SET
    hourly_rollup  = UNIX_TIMESTAMP('2014-11-30 00:00:00'),
    daily_rollup   = UNIX_TIMESTAMP('2014-11-30 00:00:00'),
    monthly_rollup = UNIX_TIMESTAMP('2014-11-30 00:00:00');

Comment 20 William Schadlich 2015-10-02 06:12:04 MDT

:) much better

Query OK, 1 row affected (0.04 sec)
Rows matched: 1  Changed: 1  Warnings: 0



Regards,
Bill

______________________________________________________________________
William W. Schadlich
EMIT Foundation Infrastructure Middleware Retail Unix Solutions
High Performance Computing

EMRE Support
908-730-1011
Ticket Support:
http://itservices
                          Select "Get Assistance"
                          Select "High Performance Cluster"
                          Select - Request Type "Downstream: All Requests"
                          In Description of Issue* field
                               "Type name of server and problem you are having including your ID"

http://goto/bill

From: bugs@schedmd.com [mailto:bugs@schedmd.com]
Sent: Friday, October 02, 2015 2:10 PM
To: Schadlich, William W
Subject: [Bug 1706] sreport indicates 98.45% allocated for month when sample clearly indicates 50.4%

Comment # 19<http://bugs.schedmd.com/show_bug.cgi?id=1706#c19> on bug 1706<http://bugs.schedmd.com/show_bug.cgi?id=1706> from Brian Christiansen<mailto:brian@schedmd.com>

Oops. There's a rouge '=' after the UPDATE. It should look like this:



UPDATE descartes_last_ran_table

SET

    hourly_rollup  = UNIX_TIMESTAMP('2014-11-30 00:00:00'),

    daily_rollup   = UNIX_TIMESTAMP('2014-11-30 00:00:00'),

    monthly_rollup = UNIX_TIMESTAMP('2014-11-30 00:00:00');

________________________________
You are receiving this mail because:

 *   You reported the bug.

Comment 21 William Schadlich 2015-10-02 07:38:39 MDT

Slurmdbd.conf

#
# Example slurmdbd.conf file.
#
# See the slurmdbd.conf man page for more information.
#
# Archive info
#ArchiveJobs=yes
#ArchiveDir="/tmp"
#ArchiveSteps=yes
#ArchiveScript=
#JobPurge=12
#StepPurge=1
#
# Authentication info
AuthType=auth/munge
#AuthInfo=/var/run/munge/munge.socket.2
#
# slurmDBD info
DbdAddr=localhost
DbdHost=localhost
DbdPort=6819
SlurmUser=slurm
#MessageTimeout=300
DebugLevel=4
#DefaultQOS=normal,standby
LogFile=/var/log/slurm/slurmdbd.log
PidFile=/var/run/slurm/slurmdbd.pid
#PluginDir=/usr/lib/slurm
#PrivateData=accounts,users,usage,jobs
#TrackWCKey=yes
#
# Database info
StorageType=accounting_storage/mysql
StorageHost=localhost
#StoragePort=1234
# Just to avoid other clients (e.g. rtm) possibly corrupting db
StoragePass=descartes
StorageUser=slurm
StorageLoc=slurm_acct_db



Regards,
Bill

______________________________________________________________________
William W. Schadlich
EMIT Foundation Infrastructure Middleware Retail Unix Solutions
High Performance Computing

EMRE Support
908-730-1011
Ticket Support:
http://itservices
                          Select "Get Assistance"
                          Select "High Performance Cluster"
                          Select - Request Type "Downstream: All Requests"
                          In Description of Issue* field
                               "Type name of server and problem you are having including your ID"

http://goto/bill

From: bugs@schedmd.com [mailto:bugs@schedmd.com]
Sent: Thursday, October 01, 2015 5:25 PM
To: Schadlich, William W
Subject: [Bug 1706] sreport indicates 98.45% allocated for month when sample clearly indicates 50.4%

Comment # 12<http://bugs.schedmd.com/show_bug.cgi?id=1706#c12> on bug 1706<http://bugs.schedmd.com/show_bug.cgi?id=1706> from Brian Christiansen<mailto:brian@schedmd.com>

Now that the the jobs are cleaned up, the job usage needs to be rolled up

again. sreport looks at the rolled up usage.



You can trigger the re-roll by setting the last roll up times to a previous

date in the database. For example, since your sreport time is from last

December you could set the time to a day before just be safe.



UPDATE = descartes_last_ran_table

SET

    hourly_rollup  = UNIX_TIMESTAMP('2014-11-30 00:00:00'),

    daily_rollup   = UNIX_TIMESTAMP('2014-11-30 00:00:00'),

    monthly_rollup = UNIX_TIMESTAMP('2014-11-30 00:00:00');



Or you could set the columns to 0 and it will roll up everything.



By modifying these columns the slurmdbd will do the rollups the next time it

does its rollup (which is every hour or on startup). It could take a little bit

of time to re-roll depending on how big the database is, but just let it run.



Does the database still have the jobs from last year? Are you doing any

purging? Will you attach your slurmdbd.conf?



Will you also show the output of:



select * from descartes_resv_table;



Let us know how it goes or if you have any questions.



Thanks,

Brian

________________________________
You are receiving this mail because:

 *   You reported the bug.

Comment 22 Brian Christiansen 2015-10-02 08:22:27 MDT

Thanks, Bill. When the rollups are done will you check your reports again and let us know how they look?

Comment 23 William Schadlich 2015-10-07 05:39:33 MDT

Looks much better now, though how does the reserve work since I didn't think we were using any.  Is that the total percent people waiting for resources queued?  Thanks.

[wwschad@clnxcat02 ~]$ sreport cluster utilization start=1/1/15 end=1/31/15 -t percent
--------------------------------------------------------------------------------
Cluster Utilization 2015-01-01T00:00:00 - 2015-01-30T23:59:59 (2592000*cpus secs)
Time reported in Percentage of Total
--------------------------------------------------------------------------------
  Cluster    Allocated       Down  PLND Down         Idle  Reserved     Reported
--------- ------------ ---------- ---------- ------------ --------- ------------
descartes       46.31%     10.15%      0.00%       41.40%     2.15%      100.00%
[wwschad@clnxcat02 ~]$ sreport cluster utilization start=2/1/15 end=2/31/15 -t percent
--------------------------------------------------------------------------------
Cluster Utilization 2015-02-01T00:00:00 - 2015-03-02T23:59:59 (2592000*cpus secs)
Time reported in Percentage of Total
--------------------------------------------------------------------------------
  Cluster    Allocated       Down  PLND Down         Idle  Reserved     Reported
--------- ------------ ---------- ---------- ------------ --------- ------------
descartes       48.71%      0.55%      0.00%       48.35%     2.39%      100.00%



Regards,
Bill

______________________________________________________________________
William W. Schadlich
EMIT Foundation Infrastructure Middleware Retail Unix Solutions
High Performance Computing

EMRE Support
908-730-1011
Ticket Support:
http://itservices
                          Select "Get Assistance"
                          Select "High Performance Cluster"
                          Select - Request Type "Downstream: All Requests"
                          In Description of Issue* field
                               "Type name of server and problem you are having including your ID"

http://goto/bill

From: bugs@schedmd.com [mailto:bugs@schedmd.com]
Sent: Friday, October 02, 2015 4:22 PM
To: Schadlich, William W
Subject: [Bug 1706] sreport indicates 98.45% allocated for month when sample clearly indicates 50.4%

Comment # 22<http://bugs.schedmd.com/show_bug.cgi?id=1706#c22> on bug 1706<http://bugs.schedmd.com/show_bug.cgi?id=1706> from Brian Christiansen<mailto:brian@schedmd.com>

Thanks, Bill. When the rollups are done will you check your reports again and

let us know how they look?

________________________________
You are receiving this mail because:

 *   You reported the bug.

Comment 24 Brian Christiansen 2015-10-07 06:34:30 MDT

Good to hear!

From the sreport man page:
"Reserved time refers to time that a job was waiting for resources after the job had become eligible. If the value is not of importance for you the number should be grouped with idle time."

For example if a job that needs four nodes is the highest priority job but it can't run becausee only two are available. The time waiting for idle resources up to 2 nodes is counted as reserved time. The time waiting for the busy nodes gets put into the unshown overcommit column. 

If you had a high reserve time you could say that either you need more resources or backfill could do a better job.  The unshown overcommit column can also be helpful to show how much backlog you had of jobs if you had more resources, which contains all pending time for jobs waiting for resources.

Let us know if you have any other questions. 

Thanks,
Brian

Comment 25 William Schadlich 2015-10-07 06:59:17 MDT

Very good but I am little concerned that later months are still off.

[wwschad@clnhpc01 ~]$ sreport cluster utilization start=4/01/15 end=4/31/15 -t percent
--------------------------------------------------------------------------------
Cluster Utilization 2015-04-01T00:00:00 - 2015-04-30T23:59:59 (2592000*cpus secs)
Time reported in Percentage of Total
--------------------------------------------------------------------------------
  Cluster    Allocated       Down  PLND Down         Idle  Reserved     Reported
--------- ------------ ---------- ---------- ------------ --------- ------------
descartes       96.69%      0.54%      0.00%        2.77%     0.00%      100.00%


Whereas I calculated about 50.54% utilization.  Thoughts?



Regards,
Bill

______________________________________________________________________
William W. Schadlich
EMIT Foundation Infrastructure Middleware Retail Unix Solutions
High Performance Computing

EMRE Support
908-730-1011
Ticket Support:
http://itservices
                          Select "Get Assistance"
                          Select "High Performance Cluster"
                          Select - Request Type "Downstream: All Requests"
                          In Description of Issue* field
                               "Type name of server and problem you are having including your ID"

http://goto/bill

From: bugs@schedmd.com [mailto:bugs@schedmd.com]
Sent: Wednesday, October 07, 2015 2:35 PM
To: Schadlich, William W
Subject: [Bug 1706] sreport indicates 98.45% allocated for month when sample clearly indicates 50.4%

Comment # 24<http://bugs.schedmd.com/show_bug.cgi?id=1706#c24> on bug 1706<http://bugs.schedmd.com/show_bug.cgi?id=1706> from Brian Christiansen<mailto:brian@schedmd.com>

Good to hear!



From the sreport man page:

"Reserved time refers to time that a job was waiting for resources after the

job had become eligible. If the value is not of importance for you the number

should be grouped with idle time."



For example if a job that needs four nodes is the highest priority job but it

can't run becausee only two are available. The time waiting for idle resources

up to 2 nodes is counted as reserved time. The time waiting for the busy nodes

gets put into the unshown overcommit column.



If you had a high reserve time you could say that either you need more

resources or backfill could do a better job.  The unshown overcommit column can

also be helpful to show how much backlog you had of jobs if you had more

resources, which contains all pending time for jobs waiting for resources.



Let us know if you have any other questions.



Thanks,

Brian

________________________________
You are receiving this mail because:

 *   You reported the bug.

Comment 26 Brian Christiansen 2015-10-07 08:53:36 MDT

Just for confirmation will you send the output of:
select * from descartes_last_ran_table;

What is your current sacct line that you are using for your reports? I recommend using the -X and -T options.

Will you send your dbd logs from when the re-rollup happened?

Thanks,
Brian

Comment 27 William Schadlich 2015-10-07 09:03:09 MDT

Created attachment 2280 [details]
slurmdbd.log.tgz

mysql> select * from descartes_last_ran_table;
+---------------+--------------+----------------+
| hourly_rollup | daily_rollup | monthly_rollup |
+---------------+--------------+----------------+
|    1444248000 |   1444190400 |     1443672000 |
+---------------+--------------+----------------+
1 row in set (0.00 sec)

To compensate for the db issues we have been having for awhile I don't sacct for report since they weren't accurate.  I have a cronjob running every 5 mins to take a sample of jobs running in the system.  Using this method with a small margin of error I am able to determine utilization and availability.

sinfo --format="%10D %20F %8t %15U %20H %10A %15C %25E %N" >> /home/slurm/acct_logs/slurm_daily_acct.`date +%Y%m%d`

Also I am confused what you mean by dbd logs when the rollup happened?  I guess you are talking about the slurmdbd.log on the master so I've attached those, let me know if that is not the cause.  Thanks.

Regards,
Bill

______________________________________________________________________
William W. Schadlich
EMIT Foundation Infrastructure Middleware Retail Unix Solutions
High Performance Computing

EMRE Support
908-730-1011
Ticket Support:
http://itservices
                          Select "Get Assistance"
                          Select "High Performance Cluster"
                          Select - Request Type "Downstream: All Requests"
                          In Description of Issue* field
                               "Type name of server and problem you are having including your ID"

http://goto/bill

From: bugs@schedmd.com [mailto:bugs@schedmd.com]
Sent: Wednesday, October 07, 2015 4:54 PM
To: Schadlich, William W
Subject: [Bug 1706] sreport indicates 98.45% allocated for month when sample clearly indicates 50.4%

Comment # 26<http://bugs.schedmd.com/show_bug.cgi?id=1706#c26> on bug 1706<http://bugs.schedmd.com/show_bug.cgi?id=1706> from Brian Christiansen<mailto:brian@schedmd.com>

Just for confirmation will you send the output of:

select * from descartes_last_ran_table;



What is your current sacct line that you are using for your reports? I

recommend using the -X and -T options.



Will you send your dbd logs from when the re-rollup happened?



Thanks,

Brian

________________________________
You are receiving this mail because:

 *   You reported the bug.

Comment 28 Brian Christiansen 2015-10-07 09:50:35 MDT

Looks like there may be some runaway jobs still in the database -- since there are a lot of these messages in the log:

"We have more time than is possible "

Will you run these commands? Do you remember doing a cold start (slurmctld -c) on April 29th? 

sacct -X -S 2014-12-11T17:00:00 -E 2014-12-11T18:00:00
sacct -X -S 2015-04-29T11:00:00 -E 2015-04-29T12:00:00
sacct -X -S 2015-02-19T23:00:00 -E 2015-02-20T00:00:00
sacct -X -S 2015-01-20T10:00:00 -E 2015-01-20T11:00:00
sacctmgr list events start=2015-04-29T12:00:00 end=2015-04-29T13:00:0

Comment 29 William Schadlich 2015-10-08 02:05:10 MDT

Created attachment 2284 [details]
slurmacctmgr_listing.08OCT2015.txt

April 29 I believe was an upgrade to Slurm day from 2.6.7 to 14.11.  Here is the output of the next 4 scripts but the last sacctmgr script indicates the following error and then posts every event it looks like.


[wwschad@clnxcat02 ~]$ sacct -X -S 2014-12-11T17:00:00 -E 2014-12-11T18:00:00
       JobID    JobName  Partition    Account  AllocCPUS      State ExitCode
------------ ---------- ---------- ---------- ---------- ---------- --------
[wwschad@clnxcat02 ~]$ sacct -X -S 2015-04-29T11:00:00 -E 2015-04-29T12:00:00
       JobID    JobName  Partition    Account  AllocCPUS      State ExitCode
------------ ---------- ---------- ---------- ---------- ---------- --------
[wwschad@clnxcat02 ~]$ sacct -X -S 2015-02-19T23:00:00 -E 2015-02-20T00:00:00
       JobID    JobName  Partition    Account  AllocCPUS      State ExitCode
------------ ---------- ---------- ---------- ---------- ---------- --------
[wwschad@clnxcat02 ~]$ sacct -X -S 2015-01-20T10:00:00 -E 2015-01-20T11:00:00
       JobID    JobName  Partition    Account  AllocCPUS      State ExitCode
------------ ---------- ---------- ---------- ---------- ---------- --------

[wwschad@clnxcat02 ~]$ sacctmgr list events start=2015-04-29T12:00:00 end=2015-04-29T13:00:0
Invalid time specification (pos=18): 2015-04-29T13:00:0

The output is attached.  Thanks.



Regards,
Bill

______________________________________________________________________
William W. Schadlich
EMIT Foundation Infrastructure Middleware Retail Unix Solutions
High Performance Computing

EMRE Support
908-730-1011
Ticket Support:
http://itservices
                          Select "Get Assistance"
                          Select "High Performance Cluster"
                          Select - Request Type "Downstream: All Requests"
                          In Description of Issue* field
                               "Type name of server and problem you are having including your ID"

http://goto/bill

From: bugs@schedmd.com [mailto:bugs@schedmd.com]
Sent: Wednesday, October 07, 2015 5:51 PM
To: Schadlich, William W
Subject: [Bug 1706] sreport indicates 98.45% allocated for month when sample clearly indicates 50.4%

Comment # 28<http://bugs.schedmd.com/show_bug.cgi?id=1706#c28> on bug 1706<http://bugs.schedmd.com/show_bug.cgi?id=1706> from Brian Christiansen<mailto:brian@schedmd.com>

Looks like there may be some runaway jobs still in the database -- since there

are a lot of these messages in the log:



"We have more time than is possible "



Will you run these commands? Do you remember doing a cold start (slurmctld -c)

on April 29th?



sacct -X -S 2014-12-11T17:00:00 -E 2014-12-11T18:00:00

sacct -X -S 2015-04-29T11:00:00 -E 2015-04-29T12:00:00

sacct -X -S 2015-02-19T23:00:00 -E 2015-02-20T00:00:00

sacct -X -S 2015-01-20T10:00:00 -E 2015-01-20T11:00:00

sacctmgr list events start=2015-04-29T12:00:00 end=2015-04-29T13:00:0

________________________________
You are receiving this mail because:

 *   You reported the bug.

Comment 30 Brian Christiansen 2015-10-08 11:04:53 MDT

Created attachment 2285 [details]
quries.txt

Will you run the queries in the attached file like so:

mysql slurm_acct_db < quries.txt 2>&1 > out.txt

and send the output file back?

Comment 31 William Schadlich 2015-10-09 02:40:19 MDT

Created attachment 2287 [details]
out.txt

As requested.  Thanks.



Regards,
Bill

______________________________________________________________________
William W. Schadlich
EMIT Foundation Infrastructure Middleware Retail Unix Solutions
High Performance Computing

EMRE Support
908-730-1011
Ticket Support:
http://itservices
                          Select "Get Assistance"
                          Select "High Performance Cluster"
                          Select - Request Type "Downstream: All Requests"
                          In Description of Issue* field
                               "Type name of server and problem you are having including your ID"

http://goto/bill

From: bugs@schedmd.com [mailto:bugs@schedmd.com]
Sent: Thursday, October 08, 2015 7:05 PM
To: Schadlich, William W
Subject: [Bug 1706] sreport indicates 98.45% allocated for month when sample clearly indicates 50.4%

Comment # 30<http://bugs.schedmd.com/show_bug.cgi?id=1706#c30> on bug 1706<http://bugs.schedmd.com/show_bug.cgi?id=1706> from Brian Christiansen<mailto:brian@schedmd.com>

Created attachment 2285 [details]<attachment.cgi?id=2285> [details]<attachment.cgi?id=2285&action=edit>

quries.txt



Will you run the queries in the attached file like so:



mysql slurm_acct_db < quries.txt 2>&1 > out.txt



and send the output file back?

________________________________
You are receiving this mail because:

 *   You reported the bug.

Comment 32 Brian Christiansen 2015-10-09 09:55:12 MDT

I believe I found some culprits. Will you run this query to verify that you see what I see?


select job_db_inx, id_job, from_unixtime(time_submit), from_unixtime(time_start), from_unixtime(time_end), (time_end - time_start)/(3600*24) from exxon_job_table2 where (time_end - time_start) > (3600 * 24 * 80);

I'm getting:
+------------+--------+----------------------------+---------------------------+-------------------------+-----------------------------------+
| job_db_inx | id_job | from_unixtime(time_submit) | from_unixtime(time_start) | from_unixtime(time_end) | (time_end - time_start)/(3600*24) |
+------------+--------+----------------------------+---------------------------+-------------------------+-----------------------------------+
|       2042 |   1228 | 2014-05-06 12:29:52        | 2014-05-06 12:30:12       | 2015-04-29 09:17:23     |                          357.8661 |
|       2179 |   1342 | 2014-05-08 04:49:53        | 2014-05-08 04:49:53       | 2015-04-29 09:17:23     |                          356.1858 |
|       2281 |   1440 | 2014-05-12 08:24:54        | 2014-05-12 08:25:09       | 2015-04-29 09:17:23     |                          352.0363 |
|     115281 |  80929 | 2015-04-29 07:55:40        | 1969-12-31 16:00:00       | 2015-04-29 09:17:23     |                        16554.6787 |
|     115282 |  80930 | 2015-04-29 08:08:37        | 1969-12-31 16:00:00       | 2015-04-29 08:08:50     |                        16554.6311 |
|     115285 |  80932 | 2015-04-29 08:10:59        | 1969-12-31 16:00:00       | 2015-04-29 08:11:08     |                        16554.6327 |
+------------+--------+----------------------------+---------------------------+-------------------------+-----------------------------------+
6 rows in set (0.01 sec)


1228, 1342 and 1440 are the jobids that existed in both the queries that you ran for me (December 2014 and April 2015). The others are in the April 2015 batch. 

Is it normal for jobs to run 30 to 70 days? I see a lot that run for about 50 days.

Thanks,
Brian

Comment 33 William Schadlich 2015-10-12 02:12:37 MDT

Created attachment 2288 [details]
slurm_select_output.txt

I made a minor modification to the script since it had the wrong table in it during run.  There have been times when jobs ran a few months but over the past six we have had cluster issues which the filesystem has crashed and all jobs where cancelled.  Anything prior to April 29th we had a complete cluster shutdown to upgrade slurm to 14.11.

Attached is the output from your query request.  Thanks.



Regards,
Bill

______________________________________________________________________
William W. Schadlich
EMIT Foundation Infrastructure Middleware Retail Unix Solutions
High Performance Computing

EMRE Support
908-730-1011
Ticket Support:
http://itservices
                          Select "Get Assistance"
                          Select "High Performance Cluster"
                          Select - Request Type "Downstream: All Requests"
                          In Description of Issue* field
                               "Type name of server and problem you are having including your ID"

http://goto/bill

From: bugs@schedmd.com [mailto:bugs@schedmd.com]
Sent: Friday, October 09, 2015 5:55 PM
To: Schadlich, William W
Subject: [Bug 1706] sreport indicates 98.45% allocated for month when sample clearly indicates 50.4%

Comment # 32<http://bugs.schedmd.com/show_bug.cgi?id=1706#c32> on bug 1706<http://bugs.schedmd.com/show_bug.cgi?id=1706> from Brian Christiansen<mailto:brian@schedmd.com>

I believe I found some culprits. Will you run this query to verify that you see

what I see?





select job_db_inx, id_job, from_unixtime(time_submit),

from_unixtime(time_start), from_unixtime(time_end), (time_end -

time_start)/(3600*24) from exxon_job_table2 where (time_end - time_start) >

(3600 * 24 * 80);



I'm getting:

+------------+--------+----------------------------+---------------------------+-------------------------+-----------------------------------+

| job_db_inx | id_job | from_unixtime(time_submit) | from_unixtime(time_start)

| from_unixtime(time_end) | (time_end - time_start)/(3600*24) |

+------------+--------+----------------------------+---------------------------+-------------------------+-----------------------------------+

|       2042 |   1228 | 2014-05-06 12:29:52        | 2014-05-06 12:30:12

| 2015-04-29 09:17:23     |                          357.8661 |

|       2179 |   1342 | 2014-05-08 04:49:53        | 2014-05-08 04:49:53

| 2015-04-29 09:17:23     |                          356.1858 |

|       2281 |   1440 | 2014-05-12 08:24:54        | 2014-05-12 08:25:09

| 2015-04-29 09:17:23     |                          352.0363 |

|     115281 |  80929 | 2015-04-29 07:55:40        | 1969-12-31 16:00:00

| 2015-04-29 09:17:23     |                        16554.6787 |

|     115282 |  80930 | 2015-04-29 08:08:37        | 1969-12-31 16:00:00

| 2015-04-29 08:08:50     |                        16554.6311 |

|     115285 |  80932 | 2015-04-29 08:10:59        | 1969-12-31 16:00:00

| 2015-04-29 08:11:08     |                        16554.6327 |

+------------+--------+----------------------------+---------------------------+-------------------------+-----------------------------------+

6 rows in set (0.01 sec)





1228, 1342 and 1440 are the jobids that existed in both the queries that you

ran for me (December 2014 and April 2015). The others are in the April 2015

batch.



Is it normal for jobs to run 30 to 70 days? I see a lot that run for about 50

days.



Thanks,

Brian

________________________________
You are receiving this mail because:

 *   You reported the bug.

Comment 34 Brian Christiansen 2015-10-12 03:28:41 MDT

Will you run this query too? It adds the time_eligible column.

Thanks,
Brian

select job_db_inx, id_job, from_unixtime(time_submit), from_unixtime(time_eligible), from_unixtime(time_start), from_unixtime(time_end), (time_end - time_eligible)/(3600*24), (time_end - time_start)/(3600*24) from descartes_job_table where (time_end - time_start) > (3600 * 24 * 80);

Comment 35 William Schadlich 2015-10-12 03:30:28 MDT

Created attachment 2289 [details]
slurm_select_output2.txt

As requested.  Thanks.



Regards,
Bill

______________________________________________________________________
William W. Schadlich
EMIT Foundation Infrastructure Middleware Retail Unix Solutions
High Performance Computing

EMRE Support
908-730-1011
Ticket Support:
http://itservices
                          Select "Get Assistance"
                          Select "High Performance Cluster"
                          Select - Request Type "Downstream: All Requests"
                          In Description of Issue* field
                               "Type name of server and problem you are having including your ID"

http://goto/bill

From: bugs@schedmd.com [mailto:bugs@schedmd.com]
Sent: Monday, October 12, 2015 11:29 AM
To: Schadlich, William W
Subject: [Bug 1706] sreport indicates 98.45% allocated for month when sample clearly indicates 50.4%

Comment # 34<http://bugs.schedmd.com/show_bug.cgi?id=1706#c34> on bug 1706<http://bugs.schedmd.com/show_bug.cgi?id=1706> from Brian Christiansen<mailto:brian@schedmd.com>

Will you run this query too? It adds the time_eligible column.



Thanks,

Brian



select job_db_inx, id_job, from_unixtime(time_submit),

from_unixtime(time_eligible), from_unixtime(time_start),

from_unixtime(time_end), (time_end - time_eligible)/(3600*24), (time_end -

time_start)/(3600*24) from descartes_job_table where (time_end - time_start) >

(3600 * 24 * 80);

________________________________
You are receiving this mail because:

 *   You reported the bug.

Comment 36 Brian Christiansen 2015-10-12 03:33:05 MDT

Do you see any errors in slurmctld logs? Even in the last couple of days, it looks like the time_start messages aren't getting to database.

Comment 37 William Schadlich 2015-10-12 03:58:39 MDT

There are errors from the slurmdbd.log

[2015-10-02T15:05:20.724] error: We have more allocated time than is possible (44964192 > 29030400) for cluster descartes(8064) from 2015-04-29T10:00:00 - 2015-04-29T11:00:00
[2015-10-02T15:05:20.724] error: We have more time than is possible (29030400+1636944+0)(30667344) > 29030400 for cluster descartes(8064) from 2015-04-29T10:00:00 - 2015-04-29T11:00:00
[2015-10-02T15:05:20.848] error: We have more allocated time than is possible (34511168 > 29030400) for cluster descartes(8064) from 2015-04-29T11:00:00 - 2015-04-29T12:00:00
[2015-10-02T15:05:20.849] error: We have more time than is possible (29030400+21754128+0)(50784528) > 29030400 for cluster descartes(8064) from 2015-04-29T11:00:00 - 2015-04-29T12:00:00
[2015-10-02T15:05:20.974] error: We have more time than is possible (8677760+27966656+0)(36644416) > 29030400 for cluster descartes(8064) from 2015-04-29T12:00:00 - 2015-04-29T13:00:00
[2015-10-02T15:09:52.582] Warning: Note very large processing time from hourly_rollup for descartes: usec=591744348 began=15:00:00.838
[2015-10-02T15:10:06.395] Warning: Note very large processing time from daily_rollup for descartes: usec=13812530 began=15:09:52.582
[wwschad@clnxcat02 slurm]$



Regards,
Bill

______________________________________________________________________
William W. Schadlich
EMIT Foundation Infrastructure Middleware Retail Unix Solutions
High Performance Computing

EMRE Support
908-730-1011
Ticket Support:
http://itservices
                          Select "Get Assistance"
                          Select "High Performance Cluster"
                          Select - Request Type "Downstream: All Requests"
                          In Description of Issue* field
                               "Type name of server and problem you are having including your ID"

http://goto/bill

From: bugs@schedmd.com [mailto:bugs@schedmd.com]
Sent: Monday, October 12, 2015 11:33 AM
To: Schadlich, William W
Subject: [Bug 1706] sreport indicates 98.45% allocated for month when sample clearly indicates 50.4%

Comment # 36<http://bugs.schedmd.com/show_bug.cgi?id=1706#c36> on bug 1706<http://bugs.schedmd.com/show_bug.cgi?id=1706> from Brian Christiansen<mailto:brian@schedmd.com>

Do you see any errors in slurmctld logs? Even in the last couple of days, it

looks like the time_start messages aren't getting to database.

________________________________
You are receiving this mail because:

 *   You reported the bug.

Comment 38 Brian Christiansen 2015-10-12 04:33:34 MDT

What about in /var/log/slurm/slurmctld.log?

Comment 39 William Schadlich 2015-10-12 04:39:10 MDT

Created attachment 2290 [details]
slurmctld_log.txt

Don't see anything in particular.  2000 line from the end of the file attached.



Regards,
Bill

______________________________________________________________________
William W. Schadlich
EMIT Foundation Infrastructure Middleware Retail Unix Solutions
High Performance Computing

EMRE Support
908-730-1011
Ticket Support:
http://itservices
                          Select "Get Assistance"
                          Select "High Performance Cluster"
                          Select - Request Type "Downstream: All Requests"
                          In Description of Issue* field
                               "Type name of server and problem you are having including your ID"

http://goto/bill

From: bugs@schedmd.com [mailto:bugs@schedmd.com]
Sent: Monday, October 12, 2015 12:34 PM
To: Schadlich, William W
Subject: [Bug 1706] sreport indicates 98.45% allocated for month when sample clearly indicates 50.4%

Comment # 38<http://bugs.schedmd.com/show_bug.cgi?id=1706#c38> on bug 1706<http://bugs.schedmd.com/show_bug.cgi?id=1706> from Brian Christiansen<mailto:brian@schedmd.com>

What about in /var/log/slurm/slurmctld.log?

________________________________
You are receiving this mail because:

 *   You reported the bug.

Comment 40 Brian Christiansen 2015-10-12 04:56:04 MDT

Just making sure. How far back does the log go? And if you grep for "error" do you see anything?

Comment 41 William Schadlich 2015-10-12 04:59:47 MDT

Created attachment 2291 [details]
slurmctld_log_error.txt.tgz

Here are the errors in the log.  I haven't rotated it in a while (or possibly ever) so it's all there.


Regards,
Bill

______________________________________________________________________
William W. Schadlich
EMIT Foundation Infrastructure Middleware Retail Unix Solutions
High Performance Computing

EMRE Support
908-730-1011
Ticket Support:
http://itservices
                          Select "Get Assistance"
                          Select "High Performance Cluster"
                          Select - Request Type "Downstream: All Requests"
                          In Description of Issue* field
                               "Type name of server and problem you are having including your ID"

http://goto/bill

From: bugs@schedmd.com [mailto:bugs@schedmd.com]
Sent: Monday, October 12, 2015 12:56 PM
To: Schadlich, William W
Subject: [Bug 1706] sreport indicates 98.45% allocated for month when sample clearly indicates 50.4%

Comment # 40<http://bugs.schedmd.com/show_bug.cgi?id=1706#c40> on bug 1706<http://bugs.schedmd.com/show_bug.cgi?id=1706> from Brian Christiansen<mailto:brian@schedmd.com>

Just making sure. How far back does the log go? And if you grep for "error" do

you see anything?

________________________________
You are receiving this mail because:

 *   You reported the bug.

Comment 42 Brian Christiansen 2015-10-13 08:04:32 MDT

It looks like we just need to clean up 3 jobs for now. Will you run this query:

UPDATE descartes_job_table SET time_end = time_start WHERE id_job in (1228, 1342, 1440);

and then re-run the rollup?

UPDATE descartes_last_ran_table 
SET 
    hourly_rollup  = UNIX_TIMESTAMP('2014-11-30 00:00:00'),
    daily_rollup   = UNIX_TIMESTAMP('2014-11-30 00:00:00'),
    monthly_rollup = UNIX_TIMESTAMP('2014-11-30 00:00:00');

Let me know how this goes.

Thanks,
Brian

Comment 43 William Schadlich 2015-10-13 08:08:17 MDT

Updates completed.


mysql> UPDATE descartes_job_table SET time_end = time_start WHERE id_job in (1228,1342, 1440);
Query OK, 6 rows affected (0.04 sec)
Rows matched: 6  Changed: 6  Warnings: 0

mysql> UPDATE descartes_last_ran_table
    -> SET
    ->     hourly_rollup  = UNIX_TIMESTAMP('2014-11-30 00:00:00'),
    ->     daily_rollup   = UNIX_TIMESTAMP('2014-11-30 00:00:00'),
    ->     monthly_rollup = UNIX_TIMESTAMP('2014-11-30 00:00:00');
Query OK, 1 row affected (0.07 sec)
Rows matched: 1  Changed: 1  Warnings: 0

mysql>



Regards,
Bill

______________________________________________________________________
William W. Schadlich
EMIT Foundation Infrastructure Middleware Retail Unix Solutions
High Performance Computing

EMRE Support
908-730-1011
Ticket Support:
http://itservices
                          Select "Get Assistance"
                          Select "High Performance Cluster"
                          Select - Request Type "Downstream: All Requests"
                          In Description of Issue* field
                               "Type name of server and problem you are having including your ID"

http://goto/bill

From: bugs@schedmd.com [mailto:bugs@schedmd.com]
Sent: Tuesday, October 13, 2015 4:05 PM
To: Schadlich, William W
Subject: [Bug 1706] sreport indicates 98.45% allocated for month when sample clearly indicates 50.4%

Comment # 42<http://bugs.schedmd.com/show_bug.cgi?id=1706#c42> on bug 1706<http://bugs.schedmd.com/show_bug.cgi?id=1706> from Brian Christiansen<mailto:brian@schedmd.com>

It looks like we just need to clean up 3 jobs for now. Will you run this query:



UPDATE descartes_job_table SET time_end = time_start WHERE id_job in (1228,

1342, 1440);



and then re-run the rollup?



UPDATE descartes_last_ran_table

SET

    hourly_rollup  = UNIX_TIMESTAMP('2014-11-30 00:00:00'),

    daily_rollup   = UNIX_TIMESTAMP('2014-11-30 00:00:00'),

    monthly_rollup = UNIX_TIMESTAMP('2014-11-30 00:00:00');



Let me know how this goes.



Thanks,

Brian

________________________________
You are receiving this mail because:

 *   You reported the bug.

Comment 44 Brian Christiansen 2015-10-14 04:30:57 MDT

Have the rollups finished? If so how does sreport look now?

Comment 45 William Schadlich 2015-10-14 04:48:20 MDT

March and April seem to be off still.  They should be ~54% and 50% respectively.

[wwschad@clnxcat02 ~]$ sreport cluster utilization start=3/01/15 end=3/30/15 -t percent
--------------------------------------------------------------------------------
Cluster Utilization 2015-03-01T00:00:00 - 2015-03-29T23:59:59 (2502000*cpus secs)
Time reported in Percentage of Total
--------------------------------------------------------------------------------
  Cluster    Allocated       Down  PLND Down         Idle  Reserved     Reported
--------- ------------ ---------- ---------- ------------ --------- ------------
descartes       85.53%      0.01%      0.00%       14.42%     0.05%      100.00%
[wwschad@clnxcat02 ~]$

[wwschad@clnxcat02 ~]$ sreport cluster utilization start=4/01/15 end=4/30/15 -t percent
--------------------------------------------------------------------------------
Cluster Utilization 2015-04-01T00:00:00 - 2015-04-29T23:59:59 (2505600*cpus secs)
Time reported in Percentage of Total
--------------------------------------------------------------------------------
  Cluster    Allocated       Down  PLND Down         Idle  Reserved     Reported
--------- ------------ ---------- ---------- ------------ --------- ------------
descartes       98.45%      0.55%      0.00%        1.00%     0.00%      100.00%
[wwschad@clnxcat02 ~]$




Regards,
Bill

______________________________________________________________________
William W. Schadlich
EMIT Foundation Infrastructure Middleware Retail Unix Solutions
High Performance Computing

EMRE Support
908-730-1011
Ticket Support:
http://itservices
                          Select "Get Assistance"
                          Select "High Performance Cluster"
                          Select - Request Type "Downstream: All Requests"
                          In Description of Issue* field
                               "Type name of server and problem you are having including your ID"

http://goto/bill

From: bugs@schedmd.com [mailto:bugs@schedmd.com]
Sent: Wednesday, October 14, 2015 12:31 PM
To: Schadlich, William W
Subject: [Bug 1706] sreport indicates 98.45% allocated for month when sample clearly indicates 50.4%

Comment # 44<http://bugs.schedmd.com/show_bug.cgi?id=1706#c44> on bug 1706<http://bugs.schedmd.com/show_bug.cgi?id=1706> from Brian Christiansen<mailto:brian@schedmd.com>

Have the rollups finished? If so how does sreport look now?

________________________________
You are receiving this mail because:

 *   You reported the bug.

Comment 46 Brian Christiansen 2015-10-14 04:52:46 MDT

Will you attach your latest slurmdbd.log?

Comment 47 Brian Christiansen 2015-10-19 08:52:10 MDT

Bill, can you attach your latest slurmdbd.log? I want to see if there were any errors during the last re-rollup.

Thanks,
Brian

Comment 48 William Schadlich 2015-10-19 08:57:11 MDT

Created attachment 2316 [details]
slurmdbd.log_19oct2015.tgz

Regards,
Bill

______________________________________________________________________
William W. Schadlich
EMIT Foundation Infrastructure Middleware Retail Unix Solutions
High Performance Computing

EMRE Support
908-730-1011
Ticket Support:
http://itservices
                          Select "Get Assistance"
                          Select "High Performance Cluster"
                          Select - Request Type "Downstream: All Requests"
                          In Description of Issue* field
                               "Type name of server and problem you are having including your ID"

http://goto/bill

From: bugs@schedmd.com [mailto:bugs@schedmd.com]
Sent: Monday, October 19, 2015 4:52 PM
To: Schadlich, William W
Subject: [Bug 1706] sreport indicates 98.45% allocated for month when sample clearly indicates 50.4%

Comment # 47<http://bugs.schedmd.com/show_bug.cgi?id=1706#c47> on bug 1706<http://bugs.schedmd.com/show_bug.cgi?id=1706> from Brian Christiansen<mailto:brian@schedmd.com>

Bill, can you attach your latest slurmdbd.log? I want to see if there were any

errors during the last re-rollup.



Thanks,

Brian

________________________________
You are receiving this mail because:

 *   You reported the bug.

Comment 49 Danny Auble 2015-10-20 05:10:48 MDT

Bill could you send the output of this again...

select job_db_inx, id_job, from_unixtime(time_submit),
from_unixtime(time_start), from_unixtime(time_end), (time_end -
time_start)/(3600*24) from exxon_job_table2 where (time_end - time_start) >
(3600 * 24 * 80);

Comment 50 William Schadlich 2015-10-20 06:12:28 MDT

Created attachment 2319 [details]
select_results_20oct2015.txt

As requested.  Minor update to correct table name.



Regards,
Bill

______________________________________________________________________
William W. Schadlich
EMIT Foundation Infrastructure Middleware Retail Unix Solutions
High Performance Computing

EMRE Support
908-730-1011
Ticket Support:
http://itservices
                          Select "Get Assistance"
                          Select "High Performance Cluster"
                          Select - Request Type "Downstream: All Requests"
                          In Description of Issue* field
                               "Type name of server and problem you are having including your ID"

http://goto/bill

From: bugs@schedmd.com [mailto:bugs@schedmd.com]
Sent: Tuesday, October 20, 2015 1:11 PM
To: Schadlich, William W
Subject: [Bug 1706] sreport indicates 98.45% allocated for month when sample clearly indicates 50.4%

Comment # 49<http://bugs.schedmd.com/show_bug.cgi?id=1706#c49> on bug 1706<http://bugs.schedmd.com/show_bug.cgi?id=1706> from Danny Auble<mailto:da@schedmd.com>

Bill could you send the output of this again...



select job_db_inx, id_job, from_unixtime(time_submit),

from_unixtime(time_start), from_unixtime(time_end), (time_end -

time_start)/(3600*24) from exxon_job_table2 where (time_end - time_start) >

(3600 * 24 * 80);

________________________________
You are receiving this mail because:

 *   You reported the bug.

Comment 51 Danny Auble 2015-10-20 06:22:06 MDT

Thanks, could you now send the output of 

select job_db_inx, id_job, time_submit, time_start, time_end, (time_end - time_start)/(3600*24), from_unixtime(time_submit), from_unixtime(time_start), from_unixtime(time_end) from descartes_job_table where time_end != 0 && time_start != 0 and (time_end - time_start) > (3600 * 24 * 20);

Comment 52 William Schadlich 2015-10-20 06:25:12 MDT

Created attachment 2320 [details]
select_results_20oct2015_new.txt

Here is the newest run.  Thanks.



Regards,
Bill

______________________________________________________________________
William W. Schadlich
EMIT Foundation Infrastructure Middleware Retail Unix Solutions
High Performance Computing

EMRE Support
908-730-1011
Ticket Support:
http://itservices
                          Select "Get Assistance"
                          Select "High Performance Cluster"
                          Select - Request Type "Downstream: All Requests"
                          In Description of Issue* field
                               "Type name of server and problem you are having including your ID"

http://goto/bill

From: bugs@schedmd.com [mailto:bugs@schedmd.com]
Sent: Tuesday, October 20, 2015 2:22 PM
To: Schadlich, William W
Subject: [Bug 1706] sreport indicates 98.45% allocated for month when sample clearly indicates 50.4%

Comment # 51<http://bugs.schedmd.com/show_bug.cgi?id=1706#c51> on bug 1706<http://bugs.schedmd.com/show_bug.cgi?id=1706> from Danny Auble<mailto:da@schedmd.com>

Thanks, could you now send the output of



select job_db_inx, id_job, time_submit, time_start, time_end, (time_end -

time_start)/(3600*24), from_unixtime(time_submit), from_unixtime(time_start),

from_unixtime(time_end) from descartes_job_table where time_end != 0 &&

time_start != 0 and (time_end - time_start) > (3600 * 24 * 20);

________________________________
You are receiving this mail because:

 *   You reported the bug.

Comment 53 Danny Auble 2015-10-20 06:38:42 MDT

Thanks Bill this does give good information.

From this it appears there were a few jobs in here that were indeed running up until you did the upgrade which cleaned things up (put an end time in the job).

It looks like the magic timestamp is 1430324243 (2015-04-29 12:17:23).

Our thought process now is to just zero all those jobs out unless you can find some information on them pointing as to when they actually ended.

A query that should update them to get you sane numbers is this

update descartes_job_table set time_end=time_start where time_end=1430324243;

You can do a

select descartes_job_table where time_end=1430324243;

to give you an idea of how many jobs that would be, from this output it looks like it would be quite a few (> 200).

They all appeared to start around March 11-13 with nothing obvious going wrong after that.

My current believe is if you update the table with that command things will be back to normal for those 2 months.

Comment 54 Danny Auble 2015-10-20 06:41:46 MDT

select * descartes_job_table where time_end=1430324243;

is the select statement, I missed the '*', sorry.

Comment 55 William Schadlich 2015-10-20 06:50:21 MDT

All good.  208 records updated.



Regards,
Bill

______________________________________________________________________
William W. Schadlich
EMIT Foundation Infrastructure Middleware Retail Unix Solutions
High Performance Computing

EMRE Support
908-730-1011
Ticket Support:
http://itservices
                          Select "Get Assistance"
                          Select "High Performance Cluster"
                          Select - Request Type "Downstream: All Requests"
                          In Description of Issue* field
                               "Type name of server and problem you are having including your ID"

http://goto/bill

From: bugs@schedmd.com [mailto:bugs@schedmd.com]
Sent: Tuesday, October 20, 2015 2:42 PM
To: Schadlich, William W
Subject: [Bug 1706] sreport indicates 98.45% allocated for month when sample clearly indicates 50.4%

Comment # 54<http://bugs.schedmd.com/show_bug.cgi?id=1706#c54> on bug 1706<http://bugs.schedmd.com/show_bug.cgi?id=1706> from Danny Auble<mailto:da@schedmd.com>

select * descartes_job_table where time_end=1430324243;



is the select statement, I missed the '*', sorry.

________________________________
You are receiving this mail because:

 *   You reported the bug.

Comment 56 Danny Auble 2015-10-20 06:53:40 MDT

Cool, just run

UPDATE descartes_last_ran_table
SET
    hourly_rollup  = UNIX_TIMESTAMP('2015-2-29 00:00:00'),
    daily_rollup   = UNIX_TIMESTAMP('2015-2-29 00:00:00'),
    monthly_rollup = UNIX_TIMESTAMP('2015-2-29 00:00:00');

and it should reroll at the top of the next hour and you should be set (hopefully).

Comment 57 William Schadlich 2015-10-20 06:55:51 MDT

mysql> UPDATE descartes_last_ran_table
    -> SET
    ->     hourly_rollup  = UNIX_TIMESTAMP('2015-2-29 00:00:00'),
    ->     daily_rollup   = UNIX_TIMESTAMP('2015-2-29 00:00:00'),
    ->     monthly_rollup = UNIX_TIMESTAMP('2015-2-29 00:00:00');
Query OK, 1 row affected, 3 warnings (0.06 sec)
Rows matched: 1  Changed: 1  Warnings: 0



Regards,
Bill

______________________________________________________________________
William W. Schadlich
EMIT Foundation Infrastructure Middleware Retail Unix Solutions
High Performance Computing

EMRE Support
908-730-1011
Ticket Support:
http://itservices
                          Select "Get Assistance"
                          Select "High Performance Cluster"
                          Select - Request Type "Downstream: All Requests"
                          In Description of Issue* field
                               "Type name of server and problem you are having including your ID"

http://goto/bill

From: bugs@schedmd.com [mailto:bugs@schedmd.com]
Sent: Tuesday, October 20, 2015 2:54 PM
To: Schadlich, William W
Subject: [Bug 1706] sreport indicates 98.45% allocated for month when sample clearly indicates 50.4%

Comment # 56<http://bugs.schedmd.com/show_bug.cgi?id=1706#c56> on bug 1706<http://bugs.schedmd.com/show_bug.cgi?id=1706> from Danny Auble<mailto:da@schedmd.com>

Cool, just run



UPDATE descartes_last_ran_table

SET

    hourly_rollup  = UNIX_TIMESTAMP('2015-2-29 00:00:00'),

    daily_rollup   = UNIX_TIMESTAMP('2015-2-29 00:00:00'),

    monthly_rollup = UNIX_TIMESTAMP('2015-2-29 00:00:00');



and it should reroll at the top of the next hour and you should be set

(hopefully).

________________________________
You are receiving this mail because:

 *   You reported the bug.

Comment 58 Danny Auble 2015-10-20 10:31:48 MDT

Things look better now?

Comment 59 William Schadlich 2015-10-21 01:28:34 MDT

Much better now.  Thanks.



Regards,
Bill

______________________________________________________________________
William W. Schadlich
EMIT Foundation Infrastructure Middleware Retail Unix Solutions
High Performance Computing

EMRE Support
908-730-1011
Ticket Support:
http://itservices
                          Select "Get Assistance"
                          Select "High Performance Cluster"
                          Select - Request Type "Downstream: All Requests"
                          In Description of Issue* field
                               "Type name of server and problem you are having including your ID"

http://goto/bill

From: bugs@schedmd.com [mailto:bugs@schedmd.com]
Sent: Tuesday, October 20, 2015 6:32 PM
To: Schadlich, William W
Subject: [Bug 1706] sreport indicates 98.45% allocated for month when sample clearly indicates 50.4%

Comment # 58<http://bugs.schedmd.com/show_bug.cgi?id=1706#c58> on bug 1706<http://bugs.schedmd.com/show_bug.cgi?id=1706> from Danny Auble<mailto:da@schedmd.com>

Things look better now?

________________________________
You are receiving this mail because:

 *   You reported the bug.

Comment 60 William Schadlich 2015-10-21 01:29:36 MDT

Going forward is there a script I can run to verify what jobs aren't closed after say 60 days so that I could kill them.  Can you forward those to me including scripts to kill them?  Thanks.



Regards,
Bill

______________________________________________________________________
William W. Schadlich
EMIT Foundation Infrastructure Middleware Retail Unix Solutions
High Performance Computing

EMRE Support
908-730-1011
Ticket Support:
http://itservices
                          Select "Get Assistance"
                          Select "High Performance Cluster"
                          Select - Request Type "Downstream: All Requests"
                          In Description of Issue* field
                               "Type name of server and problem you are having including your ID"

http://goto/bill

From: bugs@schedmd.com [mailto:bugs@schedmd.com]
Sent: Tuesday, October 20, 2015 6:32 PM
To: Schadlich, William W
Subject: [Bug 1706] sreport indicates 98.45% allocated for month when sample clearly indicates 50.4%

Comment # 58<http://bugs.schedmd.com/show_bug.cgi?id=1706#c58> on bug 1706<http://bugs.schedmd.com/show_bug.cgi?id=1706> from Danny Auble<mailto:da@schedmd.com>

Things look better now?

________________________________
You are receiving this mail because:

 *   You reported the bug.

Comment 61 Danny Auble 2015-10-21 02:48:05 MDT

The perl script from comment 4 (attachment 2239 [details]) should do what you want, if not it can probably be modified to do so. 

The good news is ever since your upgrade in April it doesn't appear you have had any runaways. 

Glad things are better, let us know if you need anything else on this.

Comment 62 Brian Christiansen 2015-10-21 10:53:47 MDT

The scripts don't cleanup the jobs automatically. Let us know if you ever see them again and we can look at it.

Thanks,
Brian

Comment 63 William Schadlich 2015-11-03 07:04:42 MST

Interestingly the sreport this month now shows 63.33% reported.  Is this due to the power saving plug-in?  Last month we had 100% reported.  The numbers look pretty good to the sampling we are taking especially showing a good number for reserved but I need to understand and explain the Reported percentage.  Thanks.

Comment 64 Brian Christiansen 2015-11-04 02:19:48 MST

Will you show an example of what you are seeing?

Comment 65 William Schadlich 2015-11-04 02:34:44 MST

Output in previous message.  Let me know if you don't see it.  Thanks.



Regards,
Bill

______________________________________________________________________
William W. Schadlich
EMIT Foundation Infrastructure Middleware Retail Unix Solutions
High Performance Computing

EMRE Support
908-730-1011
Ticket Support:
http://itservices
                          Select "Get Assistance"
                          Select "High Performance Cluster"
                          Select - Request Type "Downstream: All Requests"
                          In Description of Issue* field
                               "Type name of server and problem you are having including your ID"

http://goto/bill

From: bugs@schedmd.com [mailto:bugs@schedmd.com]
Sent: Wednesday, November 04, 2015 11:20 AM
To: Schadlich, William W
Subject: [Bug 1706] sreport indicates 98.45% allocated for month when sample clearly indicates 50.4%

Brian Christiansen<mailto:brian@schedmd.com> changed bug 1706<http://bugs.schedmd.com/show_bug.cgi?id=1706>
What

Removed

Added

Status

RESOLVED

UNCONFIRMED

Resolution

INFOGIVEN

---

Comment # 64<http://bugs.schedmd.com/show_bug.cgi?id=1706#c64> on bug 1706<http://bugs.schedmd.com/show_bug.cgi?id=1706> from Brian Christiansen<mailto:brian@schedmd.com>

Will you show an example of what you are seeing?

________________________________
You are receiving this mail because:

 *   You reported the bug.

Comment 66 Brian Christiansen 2015-11-04 02:50:19 MST

I don't see it in Comment 63. Will you attach it again? Thanks.

Comment 67 William Schadlich 2015-11-04 04:15:37 MST

[wwschad@clnhpc01 ~]$ sreport cluster utilization start=10/1/15 end=10/31/15 -t percent
--------------------------------------------------------------------------------
Cluster Utilization 2015-10-01T00:00:00 - 2015-10-30T23:59:59 (2592000*cpus secs)
Time reported in Percentage of Total
--------------------------------------------------------------------------------
  Cluster    Allocated       Down  PLND Down         Idle  Reserved     Reported
--------- ------------ ---------- ---------- ------------ --------- ------------
descartes       71.02%      2.05%      0.00%       15.75%    11.18%       63.33%
[wwschad@clnhpc01 ~]$  sreport cluster utilization start=9/1/15 end=9/31/15 -t percent
--------------------------------------------------------------------------------
Cluster Utilization 2015-09-01T00:00:00 - 2015-09-30T23:59:59 (2592000*cpus secs)
Time reported in Percentage of Total
--------------------------------------------------------------------------------
  Cluster    Allocated       Down  PLND Down         Idle  Reserved     Reported
--------- ------------ ---------- ---------- ------------ --------- ------------
descartes       34.23%      5.52%      0.00%       58.88%     1.37%      100.00%

Comment 68 Brian Christiansen 2015-11-04 04:25:01 MST

Thanks Bill, that helps clarify things. Will you run this command?

sacctmgr show event start=10/1/15

Comment 69 William Schadlich 2015-11-04 04:29:55 MST

Created attachment 2386 [details]
slurm_output_sacctmgr.04NOV2015.txt

Comment 70 Brian Christiansen 2015-11-04 04:50:19 MST

Will you run these too -- notice that you weren't getting a full month:

sreport cluster utilization start=10/1/15 end=11/1/15 -tMinPer
sreport cluster utilization start=9/1/15 end=10/1/15 -tMinPer

Comment 71 William Schadlich 2015-11-04 05:05:15 MST

Doesn't like the topic one

[wwschad@clnhpc01 ~]$ sreport cluster utilization start=10/1/15 end=11/1/15 -tMi                                              nPer
--------------------------------------------------------------------------------
Cluster Utilization 2015-10-01T00:00:00 - 2015-10-31T23:59:59 (2678400*cpus secs                                              )
Time reported in CPU Minutes/Percentage of Total
--------------------------------------------------------------------------------
  Cluster            Allocated               Down          PLND Down                                                               Idle           Reserved             Reported
--------- -------------------- ------------------ ------------------ -----------                                              --------- ------------------ --------------------
[wwschad@clnhpc01 ~]$ sreport cluster utilization start=10/1/15 end=11/1/15 -tMinPer
--------------------------------------------------------------------------------
Cluster Utilization 2015-10-01T00:00:00 - 2015-10-31T23:59:59 (2678400*cpus secs)
Time reported in CPU Minutes/Percentage of Total
--------------------------------------------------------------------------------
  Cluster            Allocated               Down          PLND Down                 Idle           Reserved             Reported
--------- -------------------- ------------------ ------------------ -------------------- ------------------ --------------------
[wwschad@clnhpc01 ~]$ sreport cluster utilization start=9/1/15 end=10/1/15 -tMinPer
--------------------------------------------------------------------------------
Cluster Utilization 2015-09-01T00:00:00 - 2015-09-30T23:59:59 (2592000*cpus secs)
Time reported in CPU Minutes/Percentage of Total
--------------------------------------------------------------------------------
  Cluster            Allocated               Down          PLND Down                 Idle           Reserved             Reported
--------- -------------------- ------------------ ------------------ -------------------- ------------------ --------------------
descartes    119257881(34.23%)    19233198(5.52%)           0(0.00%)    205115781(58.88%)     4757940(1.37%)   348364800(100.00%)



Regards,
Bill

______________________________________________________________________
William W. Schadlich
EMIT Foundation Infrastructure Middleware Retail Unix Solutions
High Performance Computing

EMRE Support
908-730-1011
Ticket Support:
http://itservices
                          Select "Get Assistance"
                          Select "High Performance Cluster"
                          Select - Request Type "Downstream: All Requests"
                          In Description of Issue* field
                               "Type name of server and problem you are having including your ID"

http://goto/bill

From: bugs@schedmd.com [mailto:bugs@schedmd.com]
Sent: Wednesday, November 04, 2015 1:50 PM
To: Schadlich, William W
Subject: [Bug 1706] sreport indicates 98.45% allocated for month when sample clearly indicates 50.4%

Comment # 70<http://bugs.schedmd.com/show_bug.cgi?id=1706#c70> on bug 1706<http://bugs.schedmd.com/show_bug.cgi?id=1706> from Brian Christiansen<mailto:brian@schedmd.com>

Will you run these too -- notice that you weren't getting a full month:



sreport cluster utilization start=10/1/15 end=11/1/15 -tMinPer

sreport cluster utilization start=9/1/15 end=10/1/15 -tMinPer

________________________________
You are receiving this mail because:

 *   You reported the bug.

Comment 72 Brian Christiansen 2015-11-04 07:32:30 MST

Will you grab this output? It looks like the month rollup hasn't happened yet.

select * from descartes_last_ran_table;

and will you get this too?
sreport cluster utilization start=10/1/15 end=10/31/15 -tminper

Comment 73 William Schadlich 2015-11-04 08:04:32 MST

mysql> select * from descartes_last_ran_table;
+---------------+--------------+----------------+
| hourly_rollup | daily_rollup | monthly_rollup |
+---------------+--------------+----------------+
|    1445367600 |   1445313600 |     1443672000 |
+---------------+--------------+----------------+
1 row in set (0.00 sec)


[root@clnxcat02 ~]# sreport cluster utilization start=10/1/15 end=10/31/15 -tminper
--------------------------------------------------------------------------------
Cluster Utilization 2015-10-01T00:00:00 - 2015-10-30T23:59:59 (2592000*cpus secs)
Time reported in CPU Minutes/Percentage of Total
--------------------------------------------------------------------------------
  Cluster            Allocated               Down          PLND Down                 Idle           Reserved             Reported
--------- -------------------- ------------------ ------------------ -------------------- ------------------ --------------------
descartes    156698707(71.02%)     4513662(2.05%)           0(0.00%)     34753009(15.75%)   24665662(11.18%)    220631040(63.33%)
[root@clnxcat02 ~]#


Regards,
Bill

______________________________________________________________________
William W. Schadlich
EMIT Foundation Infrastructure Middleware Retail Unix Solutions
High Performance Computing

EMRE Support
908-730-1011
Ticket Support:
http://itservices
                          Select "Get Assistance"
                          Select "High Performance Cluster"
                          Select - Request Type "Downstream: All Requests"
                          In Description of Issue* field
                               "Type name of server and problem you are having including your ID"

http://goto/bill

From: bugs@schedmd.com [mailto:bugs@schedmd.com]
Sent: Wednesday, November 04, 2015 4:33 PM
To: Schadlich, William W
Subject: [Bug 1706] sreport indicates 98.45% allocated for month when sample clearly indicates 50.4%

Comment # 72<http://bugs.schedmd.com/show_bug.cgi?id=1706#c72> on bug 1706<http://bugs.schedmd.com/show_bug.cgi?id=1706> from Brian Christiansen<mailto:brian@schedmd.com>

Will you grab this output? It looks like the month rollup hasn't happened yet.



select * from descartes_last_ran_table;



and will you get this too?

sreport cluster utilization start=10/1/15 end=10/31/15 -tminper

________________________________
You are receiving this mail because:

 *   You reported the bug.

Comment 74 Brian Christiansen 2015-11-04 08:18:11 MST

You're rollups haven't happened since Mon Oct 19 21:00:00 2015. Will you send your slurmdbd.log?

Comment 75 William Schadlich 2015-11-05 00:33:38 MST

Created attachment 2392 [details]
slurmdbd.tar

Comment 76 Brian Christiansen 2015-11-05 03:39:41 MST

Will you set the DebugLevel to debug2 in your slurmdbd.conf and restart the slurmdbd? Then watch for the messages: 

slurmdbd: debug2: running rollup at Thu Nov 05 09:22:24 2015
slurmdbd: debug2: Everything rolled up

Once you see the last message will you run this again and send the slurmdbd.log again?

select * from descartes_last_ran_table;

You could restart the dbd again too without debug2.

Thanks,
Brian

Comment 77 William Schadlich 2015-11-06 04:43:39 MST

Waiting for the line to hit the file but am noticing a slowdown with the system processing new jobs is this normal?  Its takes like 30-50 sec to obtain a node with a basic srun -N1 -pty bash.  Thoughts?

Comment 78 William Schadlich 2015-11-06 04:45:16 MST

How long does it take before it decides to run?  I still don't see anything that I can tell in the logs?

[2015-10-22T16:36:13.994] error: Processing last message from connection 9(159.70.89.221) uid(-2)
[2015-11-06T11:40:21.840] Terminate signal (SIGINT or SIGTERM) received


mysql> select * from descartes_last_ran_table;
+---------------+--------------+----------------+
| hourly_rollup | daily_rollup | monthly_rollup |
+---------------+--------------+----------------+
|    1446832800 |   1446786000 |     1446350400 |
+---------------+--------------+----------------+
1 row in set (0.00 sec)



Regards,
Bill

______________________________________________________________________
William W. Schadlich
EMIT Foundation Infrastructure Middleware Retail Unix Solutions
High Performance Computing

EMRE Support
908-730-1011
Ticket Support:
http://itservices
                          Select "Get Assistance"
                          Select "High Performance Cluster"
                          Select - Request Type "Downstream: All Requests"
                          In Description of Issue* field
                               "Type name of server and problem you are having including your ID"

http://goto/bill

From: bugs@schedmd.com [mailto:bugs@schedmd.com]
Sent: Thursday, November 05, 2015 12:40 PM
To: Schadlich, William W
Subject: [Bug 1706] sreport indicates 98.45% allocated for month when sample clearly indicates 50.4%

Comment # 76<http://bugs.schedmd.com/show_bug.cgi?id=1706#c76> on bug 1706<http://bugs.schedmd.com/show_bug.cgi?id=1706> from Brian Christiansen<mailto:brian@schedmd.com>

Will you set the DebugLevel to debug2 in your slurmdbd.conf and restart the

slurmdbd? Then watch for the messages:



slurmdbd: debug2: running rollup at Thu Nov 05 09:22:24 2015

slurmdbd: debug2: Everything rolled up



Once you see the last message will you run this again and send the slurmdbd.log

again?



select * from descartes_last_ran_table;



You could restart the dbd again too without debug2.



Thanks,

Brian

________________________________
You are receiving this mail because:

 *   You reported the bug.

Comment 79 Tim Wickberg 2015-11-06 05:02:04 MST

Job scheduling should not be significantly impacted by the database updates - if you're seeing throughput problems can you see if there are any other potential causes? Is there an unusually high number of jobs in the system, or were any config changes done when updating to 15.08.3?

Those database timestamps are current for today, so the rollup process has happened and is current now. The hourly rollup timestamp should just forward every hour, daily every day. If you hadn't noticed before those are just unix timestamp values stored in the database - 1446832800 decodes as today at 18:00 UTC.

- Tim

Comment 80 Tim Wickberg 2015-11-06 05:13:32 MST

I may have misread part of the bug history, my apologies. If you're still on the 14.11 release you weren't affected by the bug in 15.08. 

The timestamps indicate that the rollup process is happening on schedule now. Unless you've increased the debug level to debug2 or above you shouldn't see messages in slurmdbd.log related to those rollups unless there are specific problems.

Comment 81 William Schadlich 2015-11-06 05:28:30 MST

No we are still on 14.11.5.  I was just wondering is it possible the database I getting large or one of the log files that needs to be rotated out.  Thanks.

Comment 82 Tim Wickberg 2015-12-01 08:49:04 MST

I'm realizing your last message may have still been meant as a question?

Database size is entirely up to you and your site configuration - we do recommend that you prune it every so often but it's not mandatory. The main thing that can take some time is the upgrade process to a new release - 15.08 in particular changes a few columns in the database and the slurmdbd needs to restructure that all in one transaction. The time that takes (and memory) is ~ O(number of job steps * number of associations) which can get quite large. We've seen updates take 15 minutes on ~ 1mil jobs (say ~=2-3 million job steps) and 400 associations, but that can vary significantly depending on hardware.

Let me know if you had any further questions on this or if I can go ahead and close out the bug.

- Tim

Comment 83 William Schadlich 2015-12-02 00:22:54 MST

Appreciate the feedback.  I'm good now thanks.



Regards,
Bill

______________________________________________________________________
William W. Schadlich
EMIT Foundation Infrastructure Middleware Retail Unix Solutions
High Performance Computing

EMRE Support
908-730-1011
Ticket Support:
http://itservices
                          Select "Get Assistance"
                          Select "High Performance Cluster"
                          Select - Request Type "Downstream: All Requests"
                          In Description of Issue* field
                               "Type name of server and problem you are having including your ID"

http://goto/bill

From: bugs@schedmd.com [mailto:bugs@schedmd.com]
Sent: Tuesday, December 01, 2015 5:49 PM
To: Schadlich, William W
Subject: [Bug 1706] sreport indicates 98.45% allocated for month when sample clearly indicates 50.4%

Comment # 82<http://bugs.schedmd.com/show_bug.cgi?id=1706#c82> on bug 1706<http://bugs.schedmd.com/show_bug.cgi?id=1706> from Tim Wickberg<mailto:tim@schedmd.com>

I'm realizing your last message may have still been meant as a question?



Database size is entirely up to you and your site configuration - we do

recommend that you prune it every so often but it's not mandatory. The main

thing that can take some time is the upgrade process to a new release - 15.08

in particular changes a few columns in the database and the slurmdbd needs to

restructure that all in one transaction. The time that takes (and memory) is ~

O(number of job steps * number of associations) which can get quite large.

We've seen updates take 15 minutes on ~ 1mil jobs (say ~=2-3 million job steps)

and 400 associations, but that can vary significantly depending on hardware.



Let me know if you had any further questions on this or if I can go ahead and

close out the bug.



- Tim

________________________________
You are receiving this mail because:

 *   You reported the bug.

Comment 84 Tim Wickberg 2015-12-02 00:47:28 MST

Marking as closed.