11516 – Wrong results with a sacct query using --whole-hetjob=yes

Ticket 11516 - Wrong results with a sacct query using --whole-hetjob=yes

Summary: Wrong results with a sacct query using --whole-hetjob=yes

Status:	RESOLVED FIXED

Alias:	None

Product:	Slurm
Classification:	Unclassified
Component:	Database (show other tickets)
Version:	20.11.4
Hardware:	Linux Linux

Severity:	4 - Minor Issue
Assignee:	Chad Vizino
QA Contact:

URL:

Duplicates (2):	11717 12507 (view as ticket list)
Depends on:
Blocks:	12516
	Show dependency tree / graph

Reported:	2021-05-04 02:16 MDT by Marco Induni
Modified:	2021-10-21 09:47 MDT (History)
CC List:	3 users (show)

See Also:	12507 12365 8722 12516
Site:	CSCS - Swiss National Supercomputing Centre
Slinky Site:	---
Alineos Sites:	---
Atos/Eviden Sites:	---
Confidential Site:	---
Coreweave sites:	---
Cray Sites:	---
DS9 clusters:	---
Google sites:	---
HPCnow Sites:	---
HPE Sites:	---
IBM Sites:	---
NOAA SIte:	---
NoveTech Sites:	---
Nvidia HWinf-CS Sites:	---
OCF Sites:	---
Recursion Pharma Sites:	---
SFW Sites:	---
SNIC sites:	---
Tzag Elita Sites:	---
Linux Distro:	---
Machine Name:
CLE Version:
Version Fixed:	21.08.2
Target Release:	---
DevPrio:	---
Emory-Cloud Sites:	---

Attachments
wip v3 (761 bytes, patch) 2021-06-14 11:13 MDT, Chad Vizino	Details \| Diff
Show Obsolete (1) Add an attachment (proposed patch, testcase, etc.)

Note You need to log in before you can comment on or make changes to this ticket.

Description Marco Induni 2021-05-04 02:16:48 MDT

Dear support,
I did a small test job doing an salloc for 1 node and the doing a simple srun to query the host name.

After this I've requested some data via sacct 

minduni@dom101:~> sacct -j 1400263 -o JobID,User,NodeList,start,end
       JobID      User        NodeList               Start                 End 
------------ --------- --------------- ------------------- ------------------- 
1400263        minduni        nid00419            09:59:11            09:59:15 
1400263.ext+                  nid00419            09:59:11            09:59:15 
1400263.0                     nid00419            09:59:14            09:59:15

and this was fine.

But if you add the parameter "-D --whole-hetjob=yes" (-D is not relevant) then I've got a huge list of jobs that aren't related to my query (and took minutes to complete)


Using the DBG flags I've captured the 2 queries used in both cases:

### First one: 
sacct -j 1400263 -o JobID,User,NodeList,start,end

[2021-05-04T10:03:38.053] debug2: accounting_storage/as_mysql: acct_storage_p_get_connection: acct_storage_p_get_connection: request new connection 1
[2021-05-04T10:03:38.053] debug2: Attempting to connect to fojorina06.cscs.ch:3306
[2021-05-04T10:03:38.094] debug2: DBD_GET_JOBS_COND: called
[2021-05-04T10:03:38.094] accounting_storage/as_mysql: _cluster_get_jobs: DB_JOB: 13(as_mysql_jobacct_process.c:582) query
select t1.account, t1.admin_comment, t1.array_max_tasks, t1.array_task_str, t1.constraints, t1.cpus_req, t1.derived_ec, t1.derived_es, t1.exit_code, t1.flags, t1.id_array_job, t1.id_array_task, t1.id_assoc, t1.id_block, t1.id_group, t1.id_job, t1.het_job_id, t1.het_job_offset, t1.id_qos, t1.id_resv, t3.resv_name, t1.id_user, t1.id_wckey, t1.job_db_inx, t1.job_name, t1.kill_requid, t1.mem_req, t1.node_inx, t1.nodelist, t1.nodes_alloc, t1.partition, t1.priority, t1.state, t1.state_reason_prev, t1.system_comment, t1.time_eligible, t1.time_end, t1.time_start, t1.time_submit, t1.time_suspended, t1.timelimit, t1.track_steps, t1.wckey, t1.gres_used, t1.tres_alloc, t1.tres_req, t1.work_dir, t1.mcs_label, t2.acct, t2.lft, t2.user from "dom_job_table" as t1 left join "dom_assoc_table" as t2 on t1.id_assoc=t2.id_assoc left join "dom_resv_table" as t3 on t1.id_resv=t3.id_resv && ((t1.time_start && (t3.time_start < t1.time_start && (t3.time_end >= t1.time_start || t3.time_end = 0))) || (t1.time_start = 0 && ((t3.time_start < t1.time_submit && (t3.time_end >= t1.time_submit || t3.time_end = 0)) || (t3.time_start > t1.time_submit)))) where ((t1.time_submit <= 1620115418) && (t1.time_end >= 0 || t1.time_end = 0)) && (t1.id_job in (1400263) || t1.het_job_id in (1400263) || (t1.id_array_job in (1400263))) && (t1.state != 524288) order by id_job, time_submit desc


### Second one: 
sacct -D --whole-hetjob=yes -j 1400263 -o JobID,User,NodeList,start,end

[2021-05-04T10:02:42.174] accounting_storage/as_mysql: _cluster_get_jobs: DB_JOB: 10(as_mysql_jobacct_process.c:582) query
select t1.account, t1.admin_comment, t1.array_max_tasks, t1.array_task_str, t1.constraints, t1.cpus_req, t1.derived_ec, t1.derived_es, t1.exit_code, t1.flags, t1.id_array_job, t1.id_array_task, t1.id_assoc, t1.id_block, t1.id_group, t1.id_job, t1.het_job_id, t1.het_job_offset, t1.id_qos, t1.id_resv, t3.resv_name, t1.id_user, t1.id_wckey, t1.job_db_inx, t1.job_name, t1.kill_requid, t1.mem_req, t1.node_inx, t1.nodelist, t1.nodes_alloc, t1.partition, t1.priority, t1.state, t1.state_reason_prev, t1.system_comment, t1.time_eligible, t1.time_end, t1.time_start, t1.time_submit, t1.time_suspended, t1.timelimit, t1.track_steps, t1.wckey, t1.gres_used, t1.tres_alloc, t1.tres_req, t1.work_dir, t1.mcs_label, t2.acct, t2.lft, t2.user from "dom_job_table" as t1 left join "dom_assoc_table" as t2 on t1.id_assoc=t2.id_assoc left join "dom_resv_table" as t3 on t1.id_resv=t3.id_resv && ((t1.time_start && (t3.time_start < t1.time_start && (t3.time_end >= t1.time_start || t3.time_end = 0))) || (t1.time_start = 0 && ((t3.time_start < t1.time_submit && (t3.time_end >= t1.time_submit || t3.time_end = 0)) || (t3.time_start > t1.time_submit)))) where ((t1.time_submit <= 1620115362) && (t1.time_end >= 0 || t1.time_end = 0)) && (t1.id_job in (1400263) || (t1.het_job_offset<>4294967294 && t1.het_job_id in (select t4.het_job_id from "dom_job_table" as t4 where t4.id_job in (1400263))) || (t1.id_array_job in (1400263))) order by id_job, time_submit desc
[2021-05-04T10:03:20.613] debug2: DBD_CLUSTER_TRES: called for pilatus(1=32768,2=43778048,3=0,4=128,5=32768,6=0,7=0,8=0)
[2021-05-04T10:03:30.756] debug2: DBD_FINI: CLOSE:1 COMMIT:0


I think if there aren't het_job related to my jobid, nothing should be printed out, beside few data.


Thank you
Marco Induni

Comment 1 Chad Vizino 2021-05-04 10:59:06 MDT

Thanks for that reproducer--I see the same and will look into it.

 -Chad

Comment 4 Chad Vizino 2021-05-04 17:09:59 MDT

It looks like the SQL for the query is not properly constructed when --whole-hetjob=yes. This happens in
_setup_job_cond_selected_steps(). From 20.11.6:

> 302                 if (job_ids) {
> 303                         if (job_cond->flags & JOBCOND_FLAG_WHOLE_HETJOB)
> 304                                 xstrfmtcat(*extra, "t1.id_job in (%s) || "
> 305                                            "(t1.het_job_offset<>%u && "
> 306                                            "t1.het_job_id in (select "
> 307                                            "t4.het_job_id from \"%s_%s\" as "
> 308                                            "t4 where t4.id_job in (%s)))",
> 309                                            job_ids, NO_VAL, cluster_name,
> 310                                            job_table, job_ids);
Doing an abbreviated (limited output fields and only 5 records displayed) SQL query outside slurm shows:

>mysql> select t1.id_job, t1.het_job_id, t1.het_job_offset, t1.id_user, t2.user from cluster_job_table as t1 left join cluster_assoc_table as t2 on t1.id_assoc=t2.id_assoc left join cluster_resv_table as t3 on t1.id_resv=t3.id_resv && ((t1.time_start && (t3.time_start < t1.time_start&& (t3.time_end >= t1.time_start || t3.time_end = 0))) || (t1.time_start = 0 && ((t3.time_start < t1.time_submit && (t3.time_end >= t1.time_submit || t3.time_end = 0)) || (t3.time_start > t1.time_submit)))) where ((t1.time_submit <= 1620161552) && (t1.time_end >= 0 || t1.time_end = 0)) && (t1.id_job in (55) || (t1.het_job_offset<>4294967294 && t1.het_job_id in (select t4.het_job_id from cluster_job_table as t4 where t4.id_job in (55))) || (t1.id_array_job in (55))) order by id_job, time_submit desc limit 5;                                            >+--------+------------+----------------+---------+------+                                                                                  >| id_job | het_job_id | het_job_offset | id_user | user |
>+--------+------------+----------------+---------+------+
>|      2 |          0 |              0 |    1000 | NULL |
>|      3 |          0 |              0 |    1000 | NULL |
>|      3 |          0 |              0 |    1023 | NULL |
>|      4 |          0 |              0 |    1000 | NULL |
>|      4 |          0 |              0 |    1023 | NULL |
>+--------+------------+----------------+---------+------+
>5 rows in set, 15 warnings (0.01 sec)
So we can see that het_job_id is 0 and that's what we need to a add a test for to limit the output. Changing the SQL construction in the code to:

> 302                 if (job_ids) {
> 303                         if (job_cond->flags & JOBCOND_FLAG_WHOLE_HETJOB)
> 304                                 xstrfmtcat(*extra, "t1.id_job in (%s) || "
> 305                                            "(t1.het_job_id<>0 && "
> 306                                            "t1.het_job_offset<>%u && "
Seems to resolve the problem. I have a working patch but need to run some more tests on it and make sure this is the right way to limit the output.

Comment 6 Chad Vizino 2021-05-26 09:20:01 MDT

Update: The test patch is working. However, based on observations made in comment 4, I'd like to understand why rows are being inserted with het_job_id==0 && het_job_offset==0 since I'd expect one or both to contain NO_VAL. I wonder if there's a problem elsewhere that needs to be addressed rather than the SQL query and am looking into it.

Comment 8 Chad Vizino 2021-05-27 10:51:52 MDT

I believe the problem has been fixed already but that old job data in the table from before you started running the version with the fix may be causing the problem. I'd like to see what version of Slurm you were running when the problem last appeared and try to pinpoint the fix.

When het_job_id==0, we expect to see het_job_offset==4294967294 (NO_VAL) in your cluster job table for past jobs. When this is not the case (het_job_offset==0) and --whole-hetjob=yes is used with sacct, these jobs are also (incorrectly) listed.

On my system running 20.11.6 (proper data being written in the cluster job table), I have some old data that is causing results similar to yours but don't have the matching slurmctld logs to see what version of Slurm I was running then.

Would you run the following?

># mysql -D <db_name> -e "select time_submit,id_job,het_job_id,het_job_offset from <cluster_name>_job_table where het_job_id=0 and het_job_offset=0 order by time_submit desc limit 1"
where <db_name> is your StorageLoc value in slurmdbd.conf (or use "slurm_acct_db" if not set) and where <cluster_name> is your ClusterName value from your slurm.conf. This will list the most recent job info where het_job_id==0 and het_job_offset==0.

Once you have the time_submit timestamp (we'll call that <t>) translate it to a string:

># date -d @<t>
Then look up the id_job value in a past slurmctld log file (the date string will help pick the right file) to determine what version of Slurm you were running when that job was submitted. Look for a string like this (where <x> is the version) preceding it somewhere earlier in that file or in an earlier one:

>slurmctld version <x> started on cluster

Comment 9 Chad Vizino 2021-05-27 16:35:45 MDT

(In reply to Chad Vizino from comment #8)
> Would you run the following?
> 
> ># mysql -D <db_name> -e "select time_submit,id_job,het_job_id,het_job_offset from <cluster_name>_job_table where het_job_id=0 and het_job_offset=0 order by time_submit desc limit 1"
> where <db_name> is your StorageLoc value in slurmdbd.conf (or use
> "slurm_acct_db" if not set) and where <cluster_name> is your ClusterName
> value from your slurm.conf. This will list the most recent job info where
> het_job_id==0 and het_job_offset==0.
> 
> Once you have the time_submit timestamp (we'll call that <t>) translate it
> to a string:
> 
> ># date -d @<t>
Forget those two steps--this will be easier (run on your dbd node):

># mysql -D <db_name> -e "select from_unixtime(time_submit),id_job,het_job_id,het_job_offset from <cluster_name>_job_table where het_job_id=0 and het_job_offset=0 order by time_submit desc limit 1"
Also any entries you could provide from slurmctld.log and/or slurmdbd.log about the returned job id may be helpful.

Comment 10 Marco Induni 2021-06-01 00:30:09 MDT

(In reply to Chad Vizino from comment #9)
> (In reply to Chad Vizino from comment #8)
> > Would you run the following?
> > 
> > ># mysql -D <db_name> -e "select time_submit,id_job,het_job_id,het_job_offset from <cluster_name>_job_table where het_job_id=0 and het_job_offset=0 order by time_submit desc limit 1"
> > where <db_name> is your StorageLoc value in slurmdbd.conf (or use
> > "slurm_acct_db" if not set) and where <cluster_name> is your ClusterName
> > value from your slurm.conf. This will list the most recent job info where
> > het_job_id==0 and het_job_offset==0.
> > 
> > Once you have the time_submit timestamp (we'll call that <t>) translate it
> > to a string:
> > 
> > ># date -d @<t>
> Forget those two steps--this will be easier (run on your dbd node):
> 
> ># mysql -D <db_name> -e "select from_unixtime(time_submit),id_job,het_job_id,het_job_offset from <cluster_name>_job_table where het_job_id=0 and het_job_offset=0 order by time_submit desc limit 1"
> Also any entries you could provide from slurmctld.log and/or slurmdbd.log
> about the returned job id may be helpful.

Hi Chad,
here the results:

mysql -D slurmdbd_tds -e "select from_unixtime(time_submit),id_job,het_job_id,het_job_offset from dom_job_table where het_job_id=0 and het_job_offset=0 order by time_submit desc limit 1"
+----------------------------+---------+------------+----------------+
| from_unixtime(time_submit) | id_job  | het_job_id | het_job_offset |
+----------------------------+---------+------------+----------------+
| 2020-10-19 14:27:08        | 1185802 |          0 |              0 |
+----------------------------+---------+------------+----------------+

Kind regards,
Marco

Comment 11 Chad Vizino 2021-06-01 11:51:52 MDT

(In reply to Marco Induni from comment #10)
> > Forget those two steps--this will be easier (run on your dbd node):
> > 
> > ># mysql -D <db_name> -e "select from_unixtime(time_submit),id_job,het_job_id,het_job_offset from <cluster_name>_job_table where het_job_id=0 and het_job_offset=0 order by time_submit desc limit 1"
> > Also any entries you could provide from slurmctld.log and/or slurmdbd.log
> > about the returned job id may be helpful.
> 
> Hi Chad,
> here the results:
> 
> mysql -D slurmdbd_tds -e "select
> from_unixtime(time_submit),id_job,het_job_id,het_job_offset from
> dom_job_table where het_job_id=0 and het_job_offset=0 order by time_submit
> desc limit 1"
> +----------------------------+---------+------------+----------------+
> | from_unixtime(time_submit) | id_job  | het_job_id | het_job_offset |
> +----------------------------+---------+------------+----------------+
> | 2020-10-19 14:27:08        | 1185802 |          0 |              0 |
> +----------------------------+---------+------------+----------------+
Thanks. Can you provide any more detail about that job from the logs? What version of Slurm were you running then and have you upgraded since then?

Also can you run the following so we can see how many jobs were affected?

># mysql -D slurm_acct_db -e "select count(*) from cluster_job_table where het_job_id=0 and het_job_offset=0"

Comment 12 Marco Induni 2021-06-02 07:32:08 MDT

Hi Chad,
> Thanks. Can you provide any more detail about that job from the logs? 

Uhm no, unfortunately

> What version of Slurm were you running then and have you upgraded since then?
At that time we probably were at 20.02.2 then we updated to 20.02.6 to 20.11.2, 20.11.4 and now we are in 20.11.7
> 
> Also can you run the following so we can see how many jobs were affected?
> 
# mysql -D slurm_acct_db -e "select count(*) from cluster_job_table where het_job_id=0 and het_job_offset=0"

mysql -D slurmdbd_tds -e "select count(*) from dom_job_table where het_job_id=0 and het_job_offset=0"
+----------+
| count(*) |
+----------+
|   146839 |
+----------+

Kind regards,
Marco

Comment 13 Chad Vizino 2021-06-02 10:38:31 MDT

(In reply to Marco Induni from comment #12)
> Hi Chad,
> > Thanks. Can you provide any more detail about that job from the logs? 
> 
> Uhm no, unfortunately
> 
> > What version of Slurm were you running then and have you upgraded since then?
> At that time we probably were at 20.02.2 then we updated to 20.02.6 to
> 20.11.2, 20.11.4 and now we are in 20.11.7
> > 
> > Also can you run the following so we can see how many jobs were affected?
> > 
> # mysql -D slurm_acct_db -e "select count(*) from cluster_job_table where
> het_job_id=0 and het_job_offset=0"
> 
> mysql -D slurmdbd_tds -e "select count(*) from dom_job_table where
> het_job_id=0 and het_job_offset=0"
> +----------+
> | count(*) |
> +----------+
> |   146839 |
> +----------+
Ok, knowing the potential version helps, thanks. Since your latest job affected was from late October, I'm pretty sure the problem has been fixed but would still like to know what the fix was and at what version it was introduced so we can propose a mitigation (maybe by changing the source (update query as I had mentioned earlier (see comment 4), or by manually updating the db (but am not sure we want to do that)).

I'll focus my efforts in trying to reproduce this on my system with 20.02.2 and see if I can figure it out and then advise a path forward.

Comment 15 Oriol Vilarrubi 2021-06-09 10:00:09 MDT

*** Ticket 11717 has been marked as a duplicate of this ticket. ***

Comment 16 Chad Vizino 2021-06-14 11:13:16 MDT

Created attachment 19947 [details]
wip v3

Comment 17 Chad Vizino 2021-06-14 11:17:38 MDT

I think that "wip v3" will fix the problem. The patch was actually the first version I made (see comment 4 and comment 6) which adds an extra check in an SQL query for het_job_id==0 && het_job_offset<>NO_VAL (see comment 16) so we are excluding records where we've seen this can be true due to a past bug.

Would you be willing to try this patch and see if it does the right thing for you?

Comment 18 Marco Induni 2021-06-15 08:14:19 MDT

Hi Chad
> Would you be willing to try this patch and see if it does the right thing
> for you?
yes, sure.\
These are the results of the query with or without --whole-hetjob=yes

minduni@dom101:~> sacct -D --whole-hetjob=yes -j 1400263 -o JobID,User,NodeList,start,end
       JobID      User        NodeList               Start                 End 
------------ --------- --------------- ------------------- ------------------- 
1400263        minduni        nid00419         4 May 09:59         4 May 09:59 
1400263.ext+                  nid00419         4 May 09:59         4 May 09:59 
1400263.0                     nid00419         4 May 09:59         4 May 09:59 

minduni@dom101:~> sacct -j 1400263 -o JobID,User,NodeList,start,end
       JobID      User        NodeList               Start                 End 
------------ --------- --------------- ------------------- ------------------- 
1400263        minduni        nid00419         4 May 09:59         4 May 09:59 
1400263.ext+                  nid00419         4 May 09:59         4 May 09:59 
1400263.0                     nid00419         4 May 09:59         4 May 09:59 

I suspect that --whole-hetjob=yes is the default, right ? Because I did thi on a hetjob

minduni@dom101:~> sacct -D --whole-hetjob=yes -j 1447529 -o JobID,User,NodeList,start,end
       JobID      User        NodeList               Start                 End 
------------ --------- --------------- ------------------- ------------------- 
1447529+0      minduni        nid00405        10 Jun 10:11        10 Jun 10:12 
1447529+0.e+                  nid00405        10 Jun 10:11        10 Jun 10:12 
1447529+0.1                   nid00405        10 Jun 10:11        10 Jun 10:11 
1447529+1      minduni        nid00406        10 Jun 10:11        10 Jun 10:12 
1447529+1.e+                  nid00406        10 Jun 10:11        10 Jun 10:12 
1447529+1.0                   nid00406        10 Jun 10:11        10 Jun 10:11 
minduni@dom101:~> sacct -j 1447529 -o JobID,User,NodeList,start,end
       JobID      User        NodeList               Start                 End 
------------ --------- --------------- ------------------- ------------------- 
1447529+0      minduni        nid00405        10 Jun 10:11        10 Jun 10:12 
1447529+0.e+                  nid00405        10 Jun 10:11        10 Jun 10:12 
1447529+0.1                   nid00405        10 Jun 10:11        10 Jun 10:11 
1447529+1      minduni        nid00406        10 Jun 10:11        10 Jun 10:12 
1447529+1.e+                  nid00406        10 Jun 10:11        10 Jun 10:12 
1447529+1.0                   nid00406        10 Jun 10:11        10 Jun 10:11 



minduni@dom101:~> sacct -D --whole-hetjob=no -j 1447529 -o JobID,User,NodeList,start,end
       JobID      User        NodeList               Start                 End 
------------ --------- --------------- ------------------- ------------------- 
1447529+0      minduni        nid00405        10 Jun 10:11        10 Jun 10:12 
1447529+0.e+                  nid00405        10 Jun 10:11        10 Jun 10:12 
1447529+0.1                   nid00405        10 Jun 10:11        10 Jun 10:11 



Kind regards,
Marco

Comment 19 frank.schluenzen 2021-06-22 06:20:34 MDT

looks perfect. after applying the patch sacct and slurmrestd both return results as expected (i.e. only the job_id matching the query). 

> From: "bugs" <bugs@schedmd.com>
> To: "frank schluenzen" <frank.schluenzen@desy.de>
> Sent: Monday, 14 June, 2021 19:17:38
> Subject: [Bug 11516] Wrong results with a sacct query using --whole-hetjob=yes

> [ https://bugs.schedmd.com/show_bug.cgi?id=11516#c17 | Comment # 17 ] on [
> https://bugs.schedmd.com/show_bug.cgi?id=11516 | bug 11516 ] from [
> mailto:chad@schedmd.com |  Chad Vizino ]
> I think that "wip v3" will fix the problem. The patch was actually the first
> version I made (see [ https://bugs.schedmd.com/show_bug.cgi?id=11516#c4 |
> comment 4 ] and [ https://bugs.schedmd.com/show_bug.cgi?id=11516#c6 | comment 6
> ] ) which adds an extra check in an
> SQL query for het_job_id==0 && het_job_offset<>NO_VAL (see [
> https://bugs.schedmd.com/show_bug.cgi?id=11516#c16 | comment 16 ] ) so we
> are excluding records where we've seen this can be true due to a past bug.

> Would you be willing to try this patch and see if it does the right thing for
> you?

> You are receiving this mail because:

>     * You are on the CC list for the bug.

Comment 22 Marco Induni 2021-07-01 03:45:26 MDT

Hi Chad,
from which version of Slurm will be applied this patch ?

Kind regards
Marco

Comment 23 Chad Vizino 2021-07-01 07:25:53 MDT

(In reply to Marco Induni from comment #22)
> from which version of Slurm will be applied this patch ?
I am not sure yet. It is waiting for internal review and approval before being checked in. I think 20.11.8 is due out soon but don't think this patch will make it into that release so would guess 20.11.9 (if there is going to be one) or 21.08. I'll supply more once the review process has been completed.

Comment 26 Chad Vizino 2021-07-01 09:55:39 MDT

One more question about this: Did you happen to load archived job data in the past? We're wondering if the problem of het_job_id==0 and het_job_offset==0 data in the cluster job table might have been introduced by an archive/load.

Comment 28 Marco Induni 2021-07-02 00:22:19 MDT

Hi Chad,
> One more question about this: Did you happen to load archived job data in
> the past? We're wondering if the problem of het_job_id==0 and
> het_job_offset==0 data in the cluster job table might have been introduced
> by an archive/load.
we started to use the archive option, but only to purge data as far as I know no data were loaded back

Kind regards.
Marco

Comment 41 Marcin Stolarek 2021-09-22 01:58:19 MDT

*** Ticket 12507 has been marked as a duplicate of this ticket. ***

Comment 69 Chad Vizino 2021-09-29 10:04:16 MDT

It's taken some doing but we have a fix committed for this issue that will be in 21.08.2. The problem exposed a few other issues that we are wrapping up so will keep this ticket open a bit longer until they have been addressed.

Comment 72 Chad Vizino 2021-09-29 13:50:23 MDT

Think we're all set so closing.