Ticket 11717

Summary:	slurmrestd returning truncated list of jobs when querying for non-het-jobs
Product:	Slurm	Reporter:	frank.schluenzen
Component:	slurmrestd	Assignee:	Oriol Vilarrubi <jvilarru>
Status:	RESOLVED DUPLICATE	QA Contact:
Severity:	3 - Medium Impact
Priority:	---	CC:	albert.gil, chad
Version:	20.11.7
Hardware:	Linux
OS:	Linux
Site:	DESY	Slinky Site:	---
Alineos Sites:	---	Atos/Eviden Sites:	---
Confidential Site:	---	Coreweave sites:	---
Cray Sites:	---	DS9 clusters:	---
Google sites:	---	HPCnow Sites:	---
HPE Sites:	---	IBM Sites:	---
NOAA SIte:	---	NoveTech Sites:	---
Nvidia HWinf-CS Sites:	---	OCF Sites:	---
Recursion Pharma Sites:	---	SFW Sites:	---
SNIC sites:	---	Tzag Elita Sites:	---
Linux Distro:	---	Machine Name:
CLE Version:		Version Fixed:
Target Release:	---	DevPrio:	---
Emory-Cloud Sites:	---
Attachments:	sacct-7737772.out restapi-7737772.out

Description frank.schluenzen 2021-05-27 07:35:44 MDT

hi,

this is obviously a duplicate of #11330 but I wasn't sure it would have been appropriate to drop a comment there. We see in any case a very similar issue.

To give an example I selected two specific job_ids for testing:

job_id=7737772 was a "homogeneous" job
job_id=7740309+0,1 was a "heterogeneous" job

# query homogeneous job with sacct
sacct -D --whole-hetjob=yes -j 7737772 -o ...

returns lots (1736) of entries including 7737772. It reports more information than expected, but contains the queried one.

# query heterogeneous job with sacct
sacct -D --whole-hetjob=yes -j 7740309 -o ...

returns entries only for 7740309+0, 7740309+1 so exactly matches the query

# query heterogeneous job with curl
curl [...] -X GET http://../slurmdb/v0.0.36/job/7740309

returns information for job_id 7740309,7740310 so exactly matches the query

# query homogeneous job with curl
curl [...] -X GET http://../slurmdb/v0.0.36/job/7737772

returns lots of entries, but actually only a small subset of entries reported by sacct. In our case 182 entries are listed by slurmrestd whereas sacct lists 1736 job_id's. jobid's returned by slurmrestd are ranging from 2428246 to 3425951 and there is consequently no information about 7737772.

sacct and slurmrestd largely provide the same information, except that slurmrestd seemingly truncates the list of jobs when querying for "homogeneous" jobs, which makes it impossible to retrieve information about most jobs via rest api. So from my naive perspective it appears to be a slurmrestd rather than a slurmdbd problem.

Cheers, Frank.

Comment 1 Oriol Vilarrubi 2021-06-03 05:46:53 MDT

Hi Frank,

I've tested what you are describing in my setup and I cannot reproduce seeing multiple entries while specifying the -j flag in the homogeneous job, could you please send me the output of the sacct command so that I can understand this better?

Thanks.

Comment 2 frank.schluenzen 2021-06-03 06:48:15 MDT

Created attachment 19780 [details]
sacct-7737772.out

Hi Oriol, 

output from 

sacct -D --whole-hetjob=yes -j 7737772 > sacct-7737772.out 
curl -s H "Content-Type: application/json" -H X-SLURM-USER-NAME:$(whoami) -H X-SLURM-USER-TOKEN:$SLURM_JWT -X GET http://restapi:6820/slurmdb/v0.0.36/job/7737772 > restapi-7737772.out 

are attached (I replaced user-names in the restapi output). 

Cheers, Frank. 

> From: "bugs" <bugs@schedmd.com>
> To: "frank schluenzen" <frank.schluenzen@desy.de>
> Sent: Thursday, 3 June, 2021 13:46:53
> Subject: [Bug 11717] slurmrestd returning truncated list of jobs when querying
> for non-het-jobs

> [ https://bugs.schedmd.com/show_bug.cgi?id=11717#c1 | Comment # 1 ] on [
> https://bugs.schedmd.com/show_bug.cgi?id=11717 | bug 11717 ] from [
> mailto:jvilarru@schedmd.com |  Oriol Vilarrubi ]
> Hi Frank,

> I've tested what you are describing in my setup and I cannot reproduce seeing
> multiple entries while specifying the -j flag in the homogeneous job, could you
> please send me the output of the sacct command so that I can understand this
> better?

> Thanks.

> You are receiving this mail because:

>     * You reported the bug.

Comment 3 frank.schluenzen 2021-06-03 06:48:16 MDT

Created attachment 19781 [details]
restapi-7737772.out

Comment 4 Oriol Vilarrubi 2021-06-03 08:57:15 MDT

Hi Frank,

In order to try to isolate the issue, could you please run the sacct with only the -j flag. This should only print one job.

Also in the restd output the job 7737772 was not even appearing, could you try with another homogeneous job to see if the issue repeats?

Regards.

Comment 6 frank.schluenzen 2021-06-03 09:24:48 MDT

Hi Oriol,

sacct -j 7737772
       JobID    JobName  Partition    Account  AllocCPUS      State ExitCode 
------------ ---------- ---------- ---------- ---------- ---------- -------- 
7737772         proc.sh        all         it         80  COMPLETED      0:0 
7737772.bat+      batch                    it         80  COMPLETED      0:0 

of course behaves as expected. 

The issue with restd output is always the same. Actually the restd output is always identical for homogeneous jobs:

curl ... job/7737772 > 7737772
curl ... job/6344343 > 6344343
diff 7737772 6344343 # empty

sacct likewise except for the queried job_ids: 

sacct -D --whole-hetjob=yes -j 7737772 > 7737772.s
sacct -D --whole-hetjob=yes -j 6344343 > 6344343.s

diff 7737772.s 6344343.s 
2276a2277,2278
> 6344343      dark_CALL+      exfel      exfel         72  COMPLETED      0:0 
> 6344343.bat+      batch                 exfel         72  COMPLETED      0:0 
2511,2512d2512
< 7737772         proc.sh        all         it         80  COMPLETED      0:0 
< 7737772.bat+      batch                    it         80  COMPLETED      0:0 

the other 2510 or so lines are identical. 

I also noticed that sacct and restd always report the same (wrong) job_name for all "duplicate entries", e.g. in sacct-7737772.out:

7595235      allocation                  cfel         96   REQUEUED      0:0 
7595235.bat+      batch                  cfel         96  CANCELLED          

but sacct -j 7595235 
7595235      spawner-j+       jhub       cfel          1 CANCELLED+      0:0 

Cheers, Frank.

Comment 7 Oriol Vilarrubi 2021-06-03 10:03:48 MDT

Hi Frank,

That is really weird, could you please attach your slurm.conf and the slurmdbd.conf (ensure to remove all the passwords in them). Also please ensure that you do not have any environment variables set while executing the commands.

A portion of the slurmrestd and the slurmdbd logs while executing the commands will also be useful.

Greetings.

Comment 8 Oriol Vilarrubi 2021-06-09 10:00:09 MDT

Hi Frank,

I've been talking with my colleagues and most probably the issue you are experiencing with this bug is the same one as the one in bug 11516, to summarize it, the issue is that before version 20.11.6 the heterogenous jobs were saved in the DB in a format that makes the SQL query made against the DB fail when specifying the flag --whole-hetjob=yes as it thinks that all non-het jobs are the ones you are asking for. This applies both to sacct and slurmrestd. 

I will mark this bug as duplicate of the other.

Greetings.

*** This ticket has been marked as a duplicate of ticket 11516 ***

Comment 9 Chad Vizino 2021-07-01 10:01:25 MDT

We're trying to gather more data on this bug and about what might have caused it.

Did you happen to load archived job data in the past? We're wondering if the problem of het_job_id==0 and het_job_offset==0 data in the cluster job table might have been introduced by an archive/load from the past.

Also could you run these queries on your db node and report what you see?

># mysql -D <db_name> -e "select from_unixtime(time_submit),id_job,het_job_id,het_job_offset from <cluster_name>_job_table where het_job_id=0 and het_job_offset=0 order by time_submit desc limit 1"
># mysql -D slurm_acct_db -e "select count(*) from cluster_job_table where het_job_id=0 and het_job_offset=0"

where <db_name> is your StorageLoc value in slurmdbd.conf (or use "slurm_acct_db" if not set) and where <cluster_name> is your ClusterName value from your slurm.conf.

Comment 10 frank.schluenzen 2021-07-02 06:38:59 MDT

we archive data but most certainly never loaded archived data.

># mysql -D <db_name> -e "select from_unixtime(time_submit),id_job,het_job_id,het_job_offset from <cluster_name>_job_table where het_job_id=0 and het_job_offset=0 order by time_submit desc limit 1"

+----------------------------+---------+------------+----------------+
| from_unixtime(time_submit) | id_job  | het_job_id | het_job_offset |
+----------------------------+---------+------------+----------------+
| 2021-05-19 12:39:06        | 7595235 |          0 |              0 |
+----------------------------+---------+------------+----------------+


> # mysql -D slurm_acct_db -e "select count(*) from cluster_job_table where het_job_id=0 and het_job_offset=0"
> 

| count(*) |
+----------+
|     1665 |
+----------+

Comment 11 Chad Vizino 2021-07-02 14:38:43 MDT

(In reply to frank.schluenzen from comment #10)
> we archive data but most certainly never loaded archived data.
> 
> ># mysql -D <db_name> -e "select from_unixtime(time_submit),id_job,het_job_id,het_job_offset from <cluster_name>_job_table where het_job_id=0 and het_job_offset=0 order by time_submit desc limit 1"
> 
> +----------------------------+---------+------------+----------------+
> | from_unixtime(time_submit) | id_job  | het_job_id | het_job_offset |
> +----------------------------+---------+------------+----------------+
> | 2021-05-19 12:39:06        | 7595235 |          0 |              0 |
> +----------------------------+---------+------------+----------------+
> 
> 
> > # mysql -D slurm_acct_db -e "select count(*) from cluster_job_table where het_job_id=0 and het_job_offset=0"
> > 
> 
> | count(*) |
> +----------+
> |     1665 |
> +----------+

Thank you. Do you know what version of Slurm you were running on that date (2021-05-19)? Also, is there any detail you can provide about job 7595235 from the logs (slurmctld and slurmd and slurmdbd)?

Comment 12 frank.schluenzen 2021-07-04 06:06:12 MDT

slurm version was 19.05.5 (installed Jan 08).

7595235 was nothing special (we have lots of this kind). It was cancelled before having started. 

job_completions (replaced user(uid)):

JobId=7595235 UserId=user(12345) GroupId=cfel(3512) Name=spawner-jupyterhub JobState=PENDING Partition=jhub TimeLimit=10080 StartTime=2021-05-19T12:39:07 EndTime=2021-05-19T12:39:07 NodeList=max-wne003 NodeCnt=1 ProcCnt=96 WorkDir=/user ReservationName= Tres=cpu=1,node=1,billing=1 Account=cfel QOS=cfel WcKey= Cluster=maxwell SubmitTime=2021-05-19T12:39:06 EligibleTime=2021-05-19T12:39:06 DerivedExitCode=0:0 ExitCode=0:0 

JobId=7595235 UserId=user(12345) GroupId=cfel(3512) Name=spawner-jupyterhub JobState=CANCELLED Partition=jhub TimeLimit=10080 StartTime=2021-05-19T12:39:40 EndTime=2021-05-19T12:39:40 NodeList=(null) NodeCnt=0 ProcCnt=0 WorkDir=/home/user ReservationName= Tres=cpu=1,node=1,billing=1 Account=cfel QOS=cfel WcKey= Cluster=maxwell SubmitTime=2021-05-19T12:39:08 EligibleTime=2021-05-19T12:41:09 DerivedExitCode=0:0 ExitCode=0:0 

sacct -j 7595235 --format "jobid,partition,start,end,elapsed,nodelist,state"
       JobID  Partition               Start                 End    Elapsed        NodeList      State 
------------ ---------- ------------------- ------------------- ---------- --------------- ---------- 
7595235            jhub 2021-05-19T12:39:40 2021-05-19T12:39:40   00:00:00   None assigned CANCELLED+