I am working on automatic classification (using python scikit-learn) of the jobs and further using these data for various reporting. I simply put classificated data into separated database and now using that cross joining slurm database with my database. My problem is I could figure out how exactly Elapsed is calculated basing on data from slurm database. Initially I thought that for any already finished job this is simply (time_end - time_start - time_suspended) from job_table of step_table. That didn't happen to be true. Then I tried user_sec + sys_sec from step_table without luck. Could you please shed a light on Elapsed formula?
As you note, elapsed should be (time_end - time_begin) for the job, minus any suspend time. Are you seeing something different? Can you attach a chunk of "sacct --format=jobid,start,end,elapsed" where the numbers don't match up?
Hmm, Looks like my errors appeared due to sometime start_time is 0 What is that?
Hi Tim, Fund clear abnormaly in the db: mysql> SELECT id_job,job_db_inx,time_end,time_start,time_suspended,CONVERT(time_end,DECIMAL(65)) - CONVERT(time_start,DECIMAL(65)) - CONVERT(time_suspended,DECIMAL(65)) as t FROM austin_job_table WHERE time_start<>0 AND time_end<>0 ORDER BY t LIMIT 3; +--------+------------+------------+------------+----------------+-------------+ | id_job | job_db_inx | time_end | time_start | time_suspended | t | +--------+------------+------------+------------+----------------+-------------+ | 5365 | 8460 | 1453748912 | 1453748510 | 1453748912 | -1453748510 | | 5366 | 8461 | 1453748912 | 1453748510 | 1453748912 | -1453748510 | | 5364 | 8459 | 1453748912 | 1453748510 | 1453748912 | -1453748510 | +--------+------------+------------+------------+----------------+-------------+ 3 rows in set (1.37 sec) mysql> -bash-4.1$ sacct --format=jobid,start,end,elapsed,suspended -j 5364 JobID Start End Elapsed Suspended ------------ ------------------- ------------------- ---------- ---------- 5364 Jan 25 11:01 Jan 25 11:08 00:00:00 16825-19:08:32 5364.batch Jan 25 11:01 Jan 27 10:46 1-23:44:19 00:00:00 5364.0 Jan 25 11:01 Jan 27 10:46 1-23:44:19 00:00:00 -bash-4.1$ sacct --format=jobid,start,end,elapsed,suspended -j 5365 JobID Start End Elapsed Suspended ------------ ------------------- ------------------- ---------- ---------- 5365 Jan 25 11:01 Jan 25 11:08 00:00:00 16825-19:08:32 5365.batch Jan 25 11:01 Jan 27 10:47 1-23:45:23 00:00:00 5365.0 Jan 25 11:01 Jan 27 10:47 1-23:45:23 00:00:00 -bash-4.1$ sacct --format=jobid,start,end,elapsed,suspended -j 5366 JobID Start End Elapsed Suspended ------------ ------------------- ------------------- ---------- ---------- 5366 Jan 25 11:01 Jan 25 11:08 00:00:00 16825-19:08:32 5366.batch Jan 25 11:01 Jan 28 9:50 2-22:48:30 00:00:00 5366.0 Jan 25 11:01 Jan 28 9:50 2-22:48:30 00:00:00 -bash-4.1$
I reset these 3 time_suspended to 0 as they obviously odd...
time_suspended is the timestamp of the most recent suspension, not a value. There's a separate table - $CLUSTER_suspend_table - that deals with the individual start/stop timestamps for the job. Something does appear to be odd here - having the entries from $CLUSTER_suspend_table and logs from slurmctld may help narrow down the issue. My best guess, assuming everything else was working properly, is that the job completed right when it was due for suspension by the gang scheduler and some race condition resulted in these records. If you're able to get info from the database and logs it'd help locate such a bug.
Hi I know it's been a while, but could you send me slurmctld and slurmdbd logs from 25 till 28 January? Dominik
Hi, Unfortunately logs from 25 till 28 January have been already rotated over. Let's probably close that case. If the issue re-occurs we could always re-open it.
Marking resolved; please reopen if you notice this again. Dominik