Ticket 2848

Summary:	Please help understand how Elapsed is calculated by sacct
Product:	Slurm	Reporter:	Sergey Meirovich <sergey_meirovich>
Component:	Other	Assignee:	Dominik Bartkiewicz <bart>
Status:	RESOLVED INFOGIVEN	QA Contact:
Severity:	4 - Minor Issue
Priority:	---
Version:	15.08.12
Hardware:	Linux
OS:	Linux
Site:	AMAT	Slinky Site:	---
Alineos Sites:	---	Atos/Eviden Sites:	---
Confidential Site:	---	Coreweave sites:	---
Cray Sites:	---	DS9 clusters:	---
Google sites:	---	HPCnow Sites:	---
HPE Sites:	---	IBM Sites:	---
NOAA SIte:	---	NoveTech Sites:	---
Nvidia HWinf-CS Sites:	---	OCF Sites:	---
Recursion Pharma Sites:	---	SFW Sites:	---
SNIC sites:	---	Tzag Elita Sites:	---
Linux Distro:	---	Machine Name:
CLE Version:		Version Fixed:
Target Release:	---	DevPrio:	---
Emory-Cloud Sites:	---

Description Sergey Meirovich 2016-06-21 12:39:30 MDT

I am working on automatic classification (using python scikit-learn) of the jobs and further using these data for various reporting. I simply put classificated data into separated database and now using that cross joining slurm database with my database.

My problem is I could figure out how exactly Elapsed is calculated basing on data from slurm database.

Initially I thought that for any already finished job  this is simply (time_end - time_start - time_suspended) from job_table of step_table. That didn't happen to be true. Then I tried user_sec + sys_sec from step_table without luck.

Could you please shed a light on Elapsed formula?

Comment 1 Tim Wickberg 2016-06-21 15:18:15 MDT

As you note, elapsed should be (time_end - time_begin) for the job, minus any suspend time.

Are you seeing something different?

Can you attach a chunk of "sacct --format=jobid,start,end,elapsed" where the numbers don't match up?

Comment 2 Sergey Meirovich 2016-06-21 23:06:28 MDT

Hmm,

Looks like my errors appeared due to sometime start_time is 0

What is that?

Comment 3 Sergey Meirovich 2016-06-23 06:41:08 MDT

Hi Tim,

Fund clear abnormaly in the db:

mysql>  SELECT id_job,job_db_inx,time_end,time_start,time_suspended,CONVERT(time_end,DECIMAL(65)) - CONVERT(time_start,DECIMAL(65)) - CONVERT(time_suspended,DECIMAL(65)) as t FROM austin_job_table WHERE time_start<>0 AND time_end<>0  ORDER BY t  LIMIT 3;
+--------+------------+------------+------------+----------------+-------------+
| id_job | job_db_inx | time_end   | time_start | time_suspended | t           |
+--------+------------+------------+------------+----------------+-------------+
|   5365 |       8460 | 1453748912 | 1453748510 |     1453748912 | -1453748510 |
|   5366 |       8461 | 1453748912 | 1453748510 |     1453748912 | -1453748510 |
|   5364 |       8459 | 1453748912 | 1453748510 |     1453748912 | -1453748510 |
+--------+------------+------------+------------+----------------+-------------+
3 rows in set (1.37 sec)

mysql> 
-bash-4.1$ sacct --format=jobid,start,end,elapsed,suspended -j 5364
       JobID               Start                 End    Elapsed  Suspended 
------------ ------------------- ------------------- ---------- ---------- 
5364                Jan 25 11:01        Jan 25 11:08   00:00:00 16825-19:08:32 
5364.batch          Jan 25 11:01        Jan 27 10:46 1-23:44:19   00:00:00 
5364.0              Jan 25 11:01        Jan 27 10:46 1-23:44:19   00:00:00 
-bash-4.1$ sacct --format=jobid,start,end,elapsed,suspended -j 5365
       JobID               Start                 End    Elapsed  Suspended 
------------ ------------------- ------------------- ---------- ---------- 
5365                Jan 25 11:01        Jan 25 11:08   00:00:00 16825-19:08:32 
5365.batch          Jan 25 11:01        Jan 27 10:47 1-23:45:23   00:00:00 
5365.0              Jan 25 11:01        Jan 27 10:47 1-23:45:23   00:00:00 
-bash-4.1$ sacct --format=jobid,start,end,elapsed,suspended -j 5366
       JobID               Start                 End    Elapsed  Suspended 
------------ ------------------- ------------------- ---------- ---------- 
5366                Jan 25 11:01        Jan 25 11:08   00:00:00 16825-19:08:32 
5366.batch          Jan 25 11:01        Jan 28  9:50 2-22:48:30   00:00:00 
5366.0              Jan 25 11:01        Jan 28  9:50 2-22:48:30   00:00:00 
-bash-4.1$

Comment 4 Sergey Meirovich 2016-06-23 07:04:12 MDT

I reset these 3 time_suspended to 0 as they obviously odd...

Comment 6 Tim Wickberg 2016-07-08 12:54:37 MDT

time_suspended is the timestamp of the most recent suspension, not a value. There's a separate table - $CLUSTER_suspend_table - that deals with the individual start/stop timestamps for the job.

Something does appear to be odd here - having the entries from $CLUSTER_suspend_table and logs from slurmctld may help narrow down the issue. My best guess, assuming everything else was working properly, is that the job completed right when it was due for suspension by the gang scheduler and some race condition resulted in these records. If you're able to get info from the database and logs it'd help locate such a bug.

Comment 12 Dominik Bartkiewicz 2016-07-11 08:41:06 MDT

Hi

I know it's been a while, but could you send me slurmctld and slurmdbd logs from 25 till 28 January?

Dominik

Comment 13 Sergey Meirovich 2016-07-11 08:44:54 MDT

Hi,

Unfortunately logs from 25 till 28 January have been already rotated over.

Let's probably close that case. If the issue re-occurs we could always re-open it.

Comment 14 Dominik Bartkiewicz 2016-07-19 03:47:37 MDT

Marking resolved; please reopen if you notice this again.

Dominik