Dear support, I did a small test job doing an salloc for 1 node and the doing a simple srun to query the host name. After this I've requested some data via sacct minduni@dom101:~> sacct -j 1400263 -o JobID,User,NodeList,start,end JobID User NodeList Start End ------------ --------- --------------- ------------------- ------------------- 1400263 minduni nid00419 09:59:11 09:59:15 1400263.ext+ nid00419 09:59:11 09:59:15 1400263.0 nid00419 09:59:14 09:59:15 and this was fine. But if you add the parameter "-D --whole-hetjob=yes" (-D is not relevant) then I've got a huge list of jobs that aren't related to my query (and took minutes to complete) Using the DBG flags I've captured the 2 queries used in both cases: ### First one: sacct -j 1400263 -o JobID,User,NodeList,start,end [2021-05-04T10:03:38.053] debug2: accounting_storage/as_mysql: acct_storage_p_get_connection: acct_storage_p_get_connection: request new connection 1 [2021-05-04T10:03:38.053] debug2: Attempting to connect to fojorina06.cscs.ch:3306 [2021-05-04T10:03:38.094] debug2: DBD_GET_JOBS_COND: called [2021-05-04T10:03:38.094] accounting_storage/as_mysql: _cluster_get_jobs: DB_JOB: 13(as_mysql_jobacct_process.c:582) query select t1.account, t1.admin_comment, t1.array_max_tasks, t1.array_task_str, t1.constraints, t1.cpus_req, t1.derived_ec, t1.derived_es, t1.exit_code, t1.flags, t1.id_array_job, t1.id_array_task, t1.id_assoc, t1.id_block, t1.id_group, t1.id_job, t1.het_job_id, t1.het_job_offset, t1.id_qos, t1.id_resv, t3.resv_name, t1.id_user, t1.id_wckey, t1.job_db_inx, t1.job_name, t1.kill_requid, t1.mem_req, t1.node_inx, t1.nodelist, t1.nodes_alloc, t1.partition, t1.priority, t1.state, t1.state_reason_prev, t1.system_comment, t1.time_eligible, t1.time_end, t1.time_start, t1.time_submit, t1.time_suspended, t1.timelimit, t1.track_steps, t1.wckey, t1.gres_used, t1.tres_alloc, t1.tres_req, t1.work_dir, t1.mcs_label, t2.acct, t2.lft, t2.user from "dom_job_table" as t1 left join "dom_assoc_table" as t2 on t1.id_assoc=t2.id_assoc left join "dom_resv_table" as t3 on t1.id_resv=t3.id_resv && ((t1.time_start && (t3.time_start < t1.time_start && (t3.time_end >= t1.time_start || t3.time_end = 0))) || (t1.time_start = 0 && ((t3.time_start < t1.time_submit && (t3.time_end >= t1.time_submit || t3.time_end = 0)) || (t3.time_start > t1.time_submit)))) where ((t1.time_submit <= 1620115418) && (t1.time_end >= 0 || t1.time_end = 0)) && (t1.id_job in (1400263) || t1.het_job_id in (1400263) || (t1.id_array_job in (1400263))) && (t1.state != 524288) order by id_job, time_submit desc ### Second one: sacct -D --whole-hetjob=yes -j 1400263 -o JobID,User,NodeList,start,end [2021-05-04T10:02:42.174] accounting_storage/as_mysql: _cluster_get_jobs: DB_JOB: 10(as_mysql_jobacct_process.c:582) query select t1.account, t1.admin_comment, t1.array_max_tasks, t1.array_task_str, t1.constraints, t1.cpus_req, t1.derived_ec, t1.derived_es, t1.exit_code, t1.flags, t1.id_array_job, t1.id_array_task, t1.id_assoc, t1.id_block, t1.id_group, t1.id_job, t1.het_job_id, t1.het_job_offset, t1.id_qos, t1.id_resv, t3.resv_name, t1.id_user, t1.id_wckey, t1.job_db_inx, t1.job_name, t1.kill_requid, t1.mem_req, t1.node_inx, t1.nodelist, t1.nodes_alloc, t1.partition, t1.priority, t1.state, t1.state_reason_prev, t1.system_comment, t1.time_eligible, t1.time_end, t1.time_start, t1.time_submit, t1.time_suspended, t1.timelimit, t1.track_steps, t1.wckey, t1.gres_used, t1.tres_alloc, t1.tres_req, t1.work_dir, t1.mcs_label, t2.acct, t2.lft, t2.user from "dom_job_table" as t1 left join "dom_assoc_table" as t2 on t1.id_assoc=t2.id_assoc left join "dom_resv_table" as t3 on t1.id_resv=t3.id_resv && ((t1.time_start && (t3.time_start < t1.time_start && (t3.time_end >= t1.time_start || t3.time_end = 0))) || (t1.time_start = 0 && ((t3.time_start < t1.time_submit && (t3.time_end >= t1.time_submit || t3.time_end = 0)) || (t3.time_start > t1.time_submit)))) where ((t1.time_submit <= 1620115362) && (t1.time_end >= 0 || t1.time_end = 0)) && (t1.id_job in (1400263) || (t1.het_job_offset<>4294967294 && t1.het_job_id in (select t4.het_job_id from "dom_job_table" as t4 where t4.id_job in (1400263))) || (t1.id_array_job in (1400263))) order by id_job, time_submit desc [2021-05-04T10:03:20.613] debug2: DBD_CLUSTER_TRES: called for pilatus(1=32768,2=43778048,3=0,4=128,5=32768,6=0,7=0,8=0) [2021-05-04T10:03:30.756] debug2: DBD_FINI: CLOSE:1 COMMIT:0 I think if there aren't het_job related to my jobid, nothing should be printed out, beside few data. Thank you Marco Induni
Thanks for that reproducer--I see the same and will look into it. -Chad
It looks like the SQL for the query is not properly constructed when --whole-hetjob=yes. This happens in _setup_job_cond_selected_steps(). From 20.11.6: > 302 if (job_ids) { > 303 if (job_cond->flags & JOBCOND_FLAG_WHOLE_HETJOB) > 304 xstrfmtcat(*extra, "t1.id_job in (%s) || " > 305 "(t1.het_job_offset<>%u && " > 306 "t1.het_job_id in (select " > 307 "t4.het_job_id from \"%s_%s\" as " > 308 "t4 where t4.id_job in (%s)))", > 309 job_ids, NO_VAL, cluster_name, > 310 job_table, job_ids); Doing an abbreviated (limited output fields and only 5 records displayed) SQL query outside slurm shows: >mysql> select t1.id_job, t1.het_job_id, t1.het_job_offset, t1.id_user, t2.user from cluster_job_table as t1 left join cluster_assoc_table as t2 on t1.id_assoc=t2.id_assoc left join cluster_resv_table as t3 on t1.id_resv=t3.id_resv && ((t1.time_start && (t3.time_start < t1.time_start&& (t3.time_end >= t1.time_start || t3.time_end = 0))) || (t1.time_start = 0 && ((t3.time_start < t1.time_submit && (t3.time_end >= t1.time_submit || t3.time_end = 0)) || (t3.time_start > t1.time_submit)))) where ((t1.time_submit <= 1620161552) && (t1.time_end >= 0 || t1.time_end = 0)) && (t1.id_job in (55) || (t1.het_job_offset<>4294967294 && t1.het_job_id in (select t4.het_job_id from cluster_job_table as t4 where t4.id_job in (55))) || (t1.id_array_job in (55))) order by id_job, time_submit desc limit 5; >+--------+------------+----------------+---------+------+ >| id_job | het_job_id | het_job_offset | id_user | user | >+--------+------------+----------------+---------+------+ >| 2 | 0 | 0 | 1000 | NULL | >| 3 | 0 | 0 | 1000 | NULL | >| 3 | 0 | 0 | 1023 | NULL | >| 4 | 0 | 0 | 1000 | NULL | >| 4 | 0 | 0 | 1023 | NULL | >+--------+------------+----------------+---------+------+ >5 rows in set, 15 warnings (0.01 sec) So we can see that het_job_id is 0 and that's what we need to a add a test for to limit the output. Changing the SQL construction in the code to: > 302 if (job_ids) { > 303 if (job_cond->flags & JOBCOND_FLAG_WHOLE_HETJOB) > 304 xstrfmtcat(*extra, "t1.id_job in (%s) || " > 305 "(t1.het_job_id<>0 && " > 306 "t1.het_job_offset<>%u && " Seems to resolve the problem. I have a working patch but need to run some more tests on it and make sure this is the right way to limit the output.
Update: The test patch is working. However, based on observations made in comment 4, I'd like to understand why rows are being inserted with het_job_id==0 && het_job_offset==0 since I'd expect one or both to contain NO_VAL. I wonder if there's a problem elsewhere that needs to be addressed rather than the SQL query and am looking into it.
I believe the problem has been fixed already but that old job data in the table from before you started running the version with the fix may be causing the problem. I'd like to see what version of Slurm you were running when the problem last appeared and try to pinpoint the fix. When het_job_id==0, we expect to see het_job_offset==4294967294 (NO_VAL) in your cluster job table for past jobs. When this is not the case (het_job_offset==0) and --whole-hetjob=yes is used with sacct, these jobs are also (incorrectly) listed. On my system running 20.11.6 (proper data being written in the cluster job table), I have some old data that is causing results similar to yours but don't have the matching slurmctld logs to see what version of Slurm I was running then. Would you run the following? ># mysql -D <db_name> -e "select time_submit,id_job,het_job_id,het_job_offset from <cluster_name>_job_table where het_job_id=0 and het_job_offset=0 order by time_submit desc limit 1" where <db_name> is your StorageLoc value in slurmdbd.conf (or use "slurm_acct_db" if not set) and where <cluster_name> is your ClusterName value from your slurm.conf. This will list the most recent job info where het_job_id==0 and het_job_offset==0. Once you have the time_submit timestamp (we'll call that <t>) translate it to a string: ># date -d @<t> Then look up the id_job value in a past slurmctld log file (the date string will help pick the right file) to determine what version of Slurm you were running when that job was submitted. Look for a string like this (where <x> is the version) preceding it somewhere earlier in that file or in an earlier one: >slurmctld version <x> started on cluster
(In reply to Chad Vizino from comment #8) > Would you run the following? > > ># mysql -D <db_name> -e "select time_submit,id_job,het_job_id,het_job_offset from <cluster_name>_job_table where het_job_id=0 and het_job_offset=0 order by time_submit desc limit 1" > where <db_name> is your StorageLoc value in slurmdbd.conf (or use > "slurm_acct_db" if not set) and where <cluster_name> is your ClusterName > value from your slurm.conf. This will list the most recent job info where > het_job_id==0 and het_job_offset==0. > > Once you have the time_submit timestamp (we'll call that <t>) translate it > to a string: > > ># date -d @<t> Forget those two steps--this will be easier (run on your dbd node): ># mysql -D <db_name> -e "select from_unixtime(time_submit),id_job,het_job_id,het_job_offset from <cluster_name>_job_table where het_job_id=0 and het_job_offset=0 order by time_submit desc limit 1" Also any entries you could provide from slurmctld.log and/or slurmdbd.log about the returned job id may be helpful.
(In reply to Chad Vizino from comment #9) > (In reply to Chad Vizino from comment #8) > > Would you run the following? > > > > ># mysql -D <db_name> -e "select time_submit,id_job,het_job_id,het_job_offset from <cluster_name>_job_table where het_job_id=0 and het_job_offset=0 order by time_submit desc limit 1" > > where <db_name> is your StorageLoc value in slurmdbd.conf (or use > > "slurm_acct_db" if not set) and where <cluster_name> is your ClusterName > > value from your slurm.conf. This will list the most recent job info where > > het_job_id==0 and het_job_offset==0. > > > > Once you have the time_submit timestamp (we'll call that <t>) translate it > > to a string: > > > > ># date -d @<t> > Forget those two steps--this will be easier (run on your dbd node): > > ># mysql -D <db_name> -e "select from_unixtime(time_submit),id_job,het_job_id,het_job_offset from <cluster_name>_job_table where het_job_id=0 and het_job_offset=0 order by time_submit desc limit 1" > Also any entries you could provide from slurmctld.log and/or slurmdbd.log > about the returned job id may be helpful. Hi Chad, here the results: mysql -D slurmdbd_tds -e "select from_unixtime(time_submit),id_job,het_job_id,het_job_offset from dom_job_table where het_job_id=0 and het_job_offset=0 order by time_submit desc limit 1" +----------------------------+---------+------------+----------------+ | from_unixtime(time_submit) | id_job | het_job_id | het_job_offset | +----------------------------+---------+------------+----------------+ | 2020-10-19 14:27:08 | 1185802 | 0 | 0 | +----------------------------+---------+------------+----------------+ Kind regards, Marco
(In reply to Marco Induni from comment #10) > > Forget those two steps--this will be easier (run on your dbd node): > > > > ># mysql -D <db_name> -e "select from_unixtime(time_submit),id_job,het_job_id,het_job_offset from <cluster_name>_job_table where het_job_id=0 and het_job_offset=0 order by time_submit desc limit 1" > > Also any entries you could provide from slurmctld.log and/or slurmdbd.log > > about the returned job id may be helpful. > > Hi Chad, > here the results: > > mysql -D slurmdbd_tds -e "select > from_unixtime(time_submit),id_job,het_job_id,het_job_offset from > dom_job_table where het_job_id=0 and het_job_offset=0 order by time_submit > desc limit 1" > +----------------------------+---------+------------+----------------+ > | from_unixtime(time_submit) | id_job | het_job_id | het_job_offset | > +----------------------------+---------+------------+----------------+ > | 2020-10-19 14:27:08 | 1185802 | 0 | 0 | > +----------------------------+---------+------------+----------------+ Thanks. Can you provide any more detail about that job from the logs? What version of Slurm were you running then and have you upgraded since then? Also can you run the following so we can see how many jobs were affected? ># mysql -D slurm_acct_db -e "select count(*) from cluster_job_table where het_job_id=0 and het_job_offset=0"
Hi Chad, > Thanks. Can you provide any more detail about that job from the logs? Uhm no, unfortunately > What version of Slurm were you running then and have you upgraded since then? At that time we probably were at 20.02.2 then we updated to 20.02.6 to 20.11.2, 20.11.4 and now we are in 20.11.7 > > Also can you run the following so we can see how many jobs were affected? > # mysql -D slurm_acct_db -e "select count(*) from cluster_job_table where het_job_id=0 and het_job_offset=0" mysql -D slurmdbd_tds -e "select count(*) from dom_job_table where het_job_id=0 and het_job_offset=0" +----------+ | count(*) | +----------+ | 146839 | +----------+ Kind regards, Marco
(In reply to Marco Induni from comment #12) > Hi Chad, > > Thanks. Can you provide any more detail about that job from the logs? > > Uhm no, unfortunately > > > What version of Slurm were you running then and have you upgraded since then? > At that time we probably were at 20.02.2 then we updated to 20.02.6 to > 20.11.2, 20.11.4 and now we are in 20.11.7 > > > > Also can you run the following so we can see how many jobs were affected? > > > # mysql -D slurm_acct_db -e "select count(*) from cluster_job_table where > het_job_id=0 and het_job_offset=0" > > mysql -D slurmdbd_tds -e "select count(*) from dom_job_table where > het_job_id=0 and het_job_offset=0" > +----------+ > | count(*) | > +----------+ > | 146839 | > +----------+ Ok, knowing the potential version helps, thanks. Since your latest job affected was from late October, I'm pretty sure the problem has been fixed but would still like to know what the fix was and at what version it was introduced so we can propose a mitigation (maybe by changing the source (update query as I had mentioned earlier (see comment 4), or by manually updating the db (but am not sure we want to do that)). I'll focus my efforts in trying to reproduce this on my system with 20.02.2 and see if I can figure it out and then advise a path forward.
*** Ticket 11717 has been marked as a duplicate of this ticket. ***
Created attachment 19947 [details] wip v3
I think that "wip v3" will fix the problem. The patch was actually the first version I made (see comment 4 and comment 6) which adds an extra check in an SQL query for het_job_id==0 && het_job_offset<>NO_VAL (see comment 16) so we are excluding records where we've seen this can be true due to a past bug. Would you be willing to try this patch and see if it does the right thing for you?
Hi Chad > Would you be willing to try this patch and see if it does the right thing > for you? yes, sure.\ These are the results of the query with or without --whole-hetjob=yes minduni@dom101:~> sacct -D --whole-hetjob=yes -j 1400263 -o JobID,User,NodeList,start,end JobID User NodeList Start End ------------ --------- --------------- ------------------- ------------------- 1400263 minduni nid00419 4 May 09:59 4 May 09:59 1400263.ext+ nid00419 4 May 09:59 4 May 09:59 1400263.0 nid00419 4 May 09:59 4 May 09:59 minduni@dom101:~> sacct -j 1400263 -o JobID,User,NodeList,start,end JobID User NodeList Start End ------------ --------- --------------- ------------------- ------------------- 1400263 minduni nid00419 4 May 09:59 4 May 09:59 1400263.ext+ nid00419 4 May 09:59 4 May 09:59 1400263.0 nid00419 4 May 09:59 4 May 09:59 I suspect that --whole-hetjob=yes is the default, right ? Because I did thi on a hetjob minduni@dom101:~> sacct -D --whole-hetjob=yes -j 1447529 -o JobID,User,NodeList,start,end JobID User NodeList Start End ------------ --------- --------------- ------------------- ------------------- 1447529+0 minduni nid00405 10 Jun 10:11 10 Jun 10:12 1447529+0.e+ nid00405 10 Jun 10:11 10 Jun 10:12 1447529+0.1 nid00405 10 Jun 10:11 10 Jun 10:11 1447529+1 minduni nid00406 10 Jun 10:11 10 Jun 10:12 1447529+1.e+ nid00406 10 Jun 10:11 10 Jun 10:12 1447529+1.0 nid00406 10 Jun 10:11 10 Jun 10:11 minduni@dom101:~> sacct -j 1447529 -o JobID,User,NodeList,start,end JobID User NodeList Start End ------------ --------- --------------- ------------------- ------------------- 1447529+0 minduni nid00405 10 Jun 10:11 10 Jun 10:12 1447529+0.e+ nid00405 10 Jun 10:11 10 Jun 10:12 1447529+0.1 nid00405 10 Jun 10:11 10 Jun 10:11 1447529+1 minduni nid00406 10 Jun 10:11 10 Jun 10:12 1447529+1.e+ nid00406 10 Jun 10:11 10 Jun 10:12 1447529+1.0 nid00406 10 Jun 10:11 10 Jun 10:11 minduni@dom101:~> sacct -D --whole-hetjob=no -j 1447529 -o JobID,User,NodeList,start,end JobID User NodeList Start End ------------ --------- --------------- ------------------- ------------------- 1447529+0 minduni nid00405 10 Jun 10:11 10 Jun 10:12 1447529+0.e+ nid00405 10 Jun 10:11 10 Jun 10:12 1447529+0.1 nid00405 10 Jun 10:11 10 Jun 10:11 Kind regards, Marco
looks perfect. after applying the patch sacct and slurmrestd both return results as expected (i.e. only the job_id matching the query). > From: "bugs" <bugs@schedmd.com> > To: "frank schluenzen" <frank.schluenzen@desy.de> > Sent: Monday, 14 June, 2021 19:17:38 > Subject: [Bug 11516] Wrong results with a sacct query using --whole-hetjob=yes > [ https://bugs.schedmd.com/show_bug.cgi?id=11516#c17 | Comment # 17 ] on [ > https://bugs.schedmd.com/show_bug.cgi?id=11516 | bug 11516 ] from [ > mailto:chad@schedmd.com | Chad Vizino ] > I think that "wip v3" will fix the problem. The patch was actually the first > version I made (see [ https://bugs.schedmd.com/show_bug.cgi?id=11516#c4 | > comment 4 ] and [ https://bugs.schedmd.com/show_bug.cgi?id=11516#c6 | comment 6 > ] ) which adds an extra check in an > SQL query for het_job_id==0 && het_job_offset<>NO_VAL (see [ > https://bugs.schedmd.com/show_bug.cgi?id=11516#c16 | comment 16 ] ) so we > are excluding records where we've seen this can be true due to a past bug. > Would you be willing to try this patch and see if it does the right thing for > you? > You are receiving this mail because: > * You are on the CC list for the bug.
Hi Chad, from which version of Slurm will be applied this patch ? Kind regards Marco
(In reply to Marco Induni from comment #22) > from which version of Slurm will be applied this patch ? I am not sure yet. It is waiting for internal review and approval before being checked in. I think 20.11.8 is due out soon but don't think this patch will make it into that release so would guess 20.11.9 (if there is going to be one) or 21.08. I'll supply more once the review process has been completed.
One more question about this: Did you happen to load archived job data in the past? We're wondering if the problem of het_job_id==0 and het_job_offset==0 data in the cluster job table might have been introduced by an archive/load.
Hi Chad, > One more question about this: Did you happen to load archived job data in > the past? We're wondering if the problem of het_job_id==0 and > het_job_offset==0 data in the cluster job table might have been introduced > by an archive/load. we started to use the archive option, but only to purge data as far as I know no data were loaded back Kind regards. Marco
*** Ticket 12507 has been marked as a duplicate of this ticket. ***
It's taken some doing but we have a fix committed for this issue that will be in 21.08.2. The problem exposed a few other issues that we are wrapping up so will keep this ticket open a bit longer until they have been addressed.
Think we're all set so closing.