| Summary: | Please help understand how Elapsed is calculated by sacct | ||
|---|---|---|---|
| Product: | Slurm | Reporter: | Sergey Meirovich <sergey_meirovich> |
| Component: | Other | Assignee: | Dominik Bartkiewicz <bart> |
| Status: | RESOLVED INFOGIVEN | QA Contact: | |
| Severity: | 4 - Minor Issue | ||
| Priority: | --- | ||
| Version: | 15.08.12 | ||
| Hardware: | Linux | ||
| OS: | Linux | ||
| Site: | AMAT | Alineos Sites: | --- |
| Atos/Eviden Sites: | --- | Confidential Site: | --- |
| Coreweave sites: | --- | Cray Sites: | --- |
| DS9 clusters: | --- | HPCnow Sites: | --- |
| HPE Sites: | --- | IBM Sites: | --- |
| NOAA SIte: | --- | OCF Sites: | --- |
| Recursion Pharma Sites: | --- | SFW Sites: | --- |
| SNIC sites: | --- | Linux Distro: | --- |
| Machine Name: | CLE Version: | ||
| Version Fixed: | Target Release: | --- | |
| DevPrio: | --- | Emory-Cloud Sites: | --- |
|
Description
Sergey Meirovich
2016-06-21 12:39:30 MDT
As you note, elapsed should be (time_end - time_begin) for the job, minus any suspend time. Are you seeing something different? Can you attach a chunk of "sacct --format=jobid,start,end,elapsed" where the numbers don't match up? Hmm, Looks like my errors appeared due to sometime start_time is 0 What is that? Hi Tim,
Fund clear abnormaly in the db:
mysql> SELECT id_job,job_db_inx,time_end,time_start,time_suspended,CONVERT(time_end,DECIMAL(65)) - CONVERT(time_start,DECIMAL(65)) - CONVERT(time_suspended,DECIMAL(65)) as t FROM austin_job_table WHERE time_start<>0 AND time_end<>0 ORDER BY t LIMIT 3;
+--------+------------+------------+------------+----------------+-------------+
| id_job | job_db_inx | time_end | time_start | time_suspended | t |
+--------+------------+------------+------------+----------------+-------------+
| 5365 | 8460 | 1453748912 | 1453748510 | 1453748912 | -1453748510 |
| 5366 | 8461 | 1453748912 | 1453748510 | 1453748912 | -1453748510 |
| 5364 | 8459 | 1453748912 | 1453748510 | 1453748912 | -1453748510 |
+--------+------------+------------+------------+----------------+-------------+
3 rows in set (1.37 sec)
mysql>
-bash-4.1$ sacct --format=jobid,start,end,elapsed,suspended -j 5364
JobID Start End Elapsed Suspended
------------ ------------------- ------------------- ---------- ----------
5364 Jan 25 11:01 Jan 25 11:08 00:00:00 16825-19:08:32
5364.batch Jan 25 11:01 Jan 27 10:46 1-23:44:19 00:00:00
5364.0 Jan 25 11:01 Jan 27 10:46 1-23:44:19 00:00:00
-bash-4.1$ sacct --format=jobid,start,end,elapsed,suspended -j 5365
JobID Start End Elapsed Suspended
------------ ------------------- ------------------- ---------- ----------
5365 Jan 25 11:01 Jan 25 11:08 00:00:00 16825-19:08:32
5365.batch Jan 25 11:01 Jan 27 10:47 1-23:45:23 00:00:00
5365.0 Jan 25 11:01 Jan 27 10:47 1-23:45:23 00:00:00
-bash-4.1$ sacct --format=jobid,start,end,elapsed,suspended -j 5366
JobID Start End Elapsed Suspended
------------ ------------------- ------------------- ---------- ----------
5366 Jan 25 11:01 Jan 25 11:08 00:00:00 16825-19:08:32
5366.batch Jan 25 11:01 Jan 28 9:50 2-22:48:30 00:00:00
5366.0 Jan 25 11:01 Jan 28 9:50 2-22:48:30 00:00:00
-bash-4.1$
I reset these 3 time_suspended to 0 as they obviously odd... time_suspended is the timestamp of the most recent suspension, not a value. There's a separate table - $CLUSTER_suspend_table - that deals with the individual start/stop timestamps for the job. Something does appear to be odd here - having the entries from $CLUSTER_suspend_table and logs from slurmctld may help narrow down the issue. My best guess, assuming everything else was working properly, is that the job completed right when it was due for suspension by the gang scheduler and some race condition resulted in these records. If you're able to get info from the database and logs it'd help locate such a bug. Hi I know it's been a while, but could you send me slurmctld and slurmdbd logs from 25 till 28 January? Dominik Hi, Unfortunately logs from 25 till 28 January have been already rotated over. Let's probably close that case. If the issue re-occurs we could always re-open it. Marking resolved; please reopen if you notice this again. Dominik |