| Summary: | slurmrestd returning truncated list of jobs when querying for non-het-jobs | ||
|---|---|---|---|
| Product: | Slurm | Reporter: | frank.schluenzen |
| Component: | slurmrestd | Assignee: | Oriol Vilarrubi <jvilarru> |
| Status: | RESOLVED DUPLICATE | QA Contact: | |
| Severity: | 3 - Medium Impact | ||
| Priority: | --- | CC: | albert.gil, chad |
| Version: | 20.11.7 | ||
| Hardware: | Linux | ||
| OS: | Linux | ||
| Site: | DESY | Slinky Site: | --- |
| Alineos Sites: | --- | Atos/Eviden Sites: | --- |
| Confidential Site: | --- | Coreweave sites: | --- |
| Cray Sites: | --- | DS9 clusters: | --- |
| Google sites: | --- | HPCnow Sites: | --- |
| HPE Sites: | --- | IBM Sites: | --- |
| NOAA SIte: | --- | NoveTech Sites: | --- |
| Nvidia HWinf-CS Sites: | --- | OCF Sites: | --- |
| Recursion Pharma Sites: | --- | SFW Sites: | --- |
| SNIC sites: | --- | Tzag Elita Sites: | --- |
| Linux Distro: | --- | Machine Name: | |
| CLE Version: | Version Fixed: | ||
| Target Release: | --- | DevPrio: | --- |
| Emory-Cloud Sites: | --- | ||
| Attachments: |
sacct-7737772.out
restapi-7737772.out |
||
|
Description
frank.schluenzen
2021-05-27 07:35:44 MDT
Hi Frank, I've tested what you are describing in my setup and I cannot reproduce seeing multiple entries while specifying the -j flag in the homogeneous job, could you please send me the output of the sacct command so that I can understand this better? Thanks. Created attachment 19780 [details] sacct-7737772.out Hi Oriol, output from sacct -D --whole-hetjob=yes -j 7737772 > sacct-7737772.out curl -s H "Content-Type: application/json" -H X-SLURM-USER-NAME:$(whoami) -H X-SLURM-USER-TOKEN:$SLURM_JWT -X GET http://restapi:6820/slurmdb/v0.0.36/job/7737772 > restapi-7737772.out are attached (I replaced user-names in the restapi output). Cheers, Frank. > From: "bugs" <bugs@schedmd.com> > To: "frank schluenzen" <frank.schluenzen@desy.de> > Sent: Thursday, 3 June, 2021 13:46:53 > Subject: [Bug 11717] slurmrestd returning truncated list of jobs when querying > for non-het-jobs > [ https://bugs.schedmd.com/show_bug.cgi?id=11717#c1 | Comment # 1 ] on [ > https://bugs.schedmd.com/show_bug.cgi?id=11717 | bug 11717 ] from [ > mailto:jvilarru@schedmd.com | Oriol Vilarrubi ] > Hi Frank, > I've tested what you are describing in my setup and I cannot reproduce seeing > multiple entries while specifying the -j flag in the homogeneous job, could you > please send me the output of the sacct command so that I can understand this > better? > Thanks. > You are receiving this mail because: > * You reported the bug. Created attachment 19781 [details]
restapi-7737772.out
Hi Frank, In order to try to isolate the issue, could you please run the sacct with only the -j flag. This should only print one job. Also in the restd output the job 7737772 was not even appearing, could you try with another homogeneous job to see if the issue repeats? Regards. Hi Oriol,
sacct -j 7737772
JobID JobName Partition Account AllocCPUS State ExitCode
------------ ---------- ---------- ---------- ---------- ---------- --------
7737772 proc.sh all it 80 COMPLETED 0:0
7737772.bat+ batch it 80 COMPLETED 0:0
of course behaves as expected.
The issue with restd output is always the same. Actually the restd output is always identical for homogeneous jobs:
curl ... job/7737772 > 7737772
curl ... job/6344343 > 6344343
diff 7737772 6344343 # empty
sacct likewise except for the queried job_ids:
sacct -D --whole-hetjob=yes -j 7737772 > 7737772.s
sacct -D --whole-hetjob=yes -j 6344343 > 6344343.s
diff 7737772.s 6344343.s
2276a2277,2278
> 6344343 dark_CALL+ exfel exfel 72 COMPLETED 0:0
> 6344343.bat+ batch exfel 72 COMPLETED 0:0
2511,2512d2512
< 7737772 proc.sh all it 80 COMPLETED 0:0
< 7737772.bat+ batch it 80 COMPLETED 0:0
the other 2510 or so lines are identical.
I also noticed that sacct and restd always report the same (wrong) job_name for all "duplicate entries", e.g. in sacct-7737772.out:
7595235 allocation cfel 96 REQUEUED 0:0
7595235.bat+ batch cfel 96 CANCELLED
but sacct -j 7595235
7595235 spawner-j+ jhub cfel 1 CANCELLED+ 0:0
Cheers, Frank.
Hi Frank, That is really weird, could you please attach your slurm.conf and the slurmdbd.conf (ensure to remove all the passwords in them). Also please ensure that you do not have any environment variables set while executing the commands. A portion of the slurmrestd and the slurmdbd logs while executing the commands will also be useful. Greetings. Hi Frank, I've been talking with my colleagues and most probably the issue you are experiencing with this bug is the same one as the one in bug 11516, to summarize it, the issue is that before version 20.11.6 the heterogenous jobs were saved in the DB in a format that makes the SQL query made against the DB fail when specifying the flag --whole-hetjob=yes as it thinks that all non-het jobs are the ones you are asking for. This applies both to sacct and slurmrestd. I will mark this bug as duplicate of the other. Greetings. *** This ticket has been marked as a duplicate of ticket 11516 *** We're trying to gather more data on this bug and about what might have caused it.
Did you happen to load archived job data in the past? We're wondering if the problem of het_job_id==0 and het_job_offset==0 data in the cluster job table might have been introduced by an archive/load from the past.
Also could you run these queries on your db node and report what you see?
># mysql -D <db_name> -e "select from_unixtime(time_submit),id_job,het_job_id,het_job_offset from <cluster_name>_job_table where het_job_id=0 and het_job_offset=0 order by time_submit desc limit 1"
># mysql -D slurm_acct_db -e "select count(*) from cluster_job_table where het_job_id=0 and het_job_offset=0"
where <db_name> is your StorageLoc value in slurmdbd.conf (or use "slurm_acct_db" if not set) and where <cluster_name> is your ClusterName value from your slurm.conf.
we archive data but most certainly never loaded archived data. ># mysql -D <db_name> -e "select from_unixtime(time_submit),id_job,het_job_id,het_job_offset from <cluster_name>_job_table where het_job_id=0 and het_job_offset=0 order by time_submit desc limit 1" +----------------------------+---------+------------+----------------+ | from_unixtime(time_submit) | id_job | het_job_id | het_job_offset | +----------------------------+---------+------------+----------------+ | 2021-05-19 12:39:06 | 7595235 | 0 | 0 | +----------------------------+---------+------------+----------------+ > # mysql -D slurm_acct_db -e "select count(*) from cluster_job_table where het_job_id=0 and het_job_offset=0" > | count(*) | +----------+ | 1665 | +----------+ (In reply to frank.schluenzen from comment #10) > we archive data but most certainly never loaded archived data. > > ># mysql -D <db_name> -e "select from_unixtime(time_submit),id_job,het_job_id,het_job_offset from <cluster_name>_job_table where het_job_id=0 and het_job_offset=0 order by time_submit desc limit 1" > > +----------------------------+---------+------------+----------------+ > | from_unixtime(time_submit) | id_job | het_job_id | het_job_offset | > +----------------------------+---------+------------+----------------+ > | 2021-05-19 12:39:06 | 7595235 | 0 | 0 | > +----------------------------+---------+------------+----------------+ > > > > # mysql -D slurm_acct_db -e "select count(*) from cluster_job_table where het_job_id=0 and het_job_offset=0" > > > > | count(*) | > +----------+ > | 1665 | > +----------+ Thank you. Do you know what version of Slurm you were running on that date (2021-05-19)? Also, is there any detail you can provide about job 7595235 from the logs (slurmctld and slurmd and slurmdbd)? slurm version was 19.05.5 (installed Jan 08).
7595235 was nothing special (we have lots of this kind). It was cancelled before having started.
job_completions (replaced user(uid)):
JobId=7595235 UserId=user(12345) GroupId=cfel(3512) Name=spawner-jupyterhub JobState=PENDING Partition=jhub TimeLimit=10080 StartTime=2021-05-19T12:39:07 EndTime=2021-05-19T12:39:07 NodeList=max-wne003 NodeCnt=1 ProcCnt=96 WorkDir=/user ReservationName= Tres=cpu=1,node=1,billing=1 Account=cfel QOS=cfel WcKey= Cluster=maxwell SubmitTime=2021-05-19T12:39:06 EligibleTime=2021-05-19T12:39:06 DerivedExitCode=0:0 ExitCode=0:0
JobId=7595235 UserId=user(12345) GroupId=cfel(3512) Name=spawner-jupyterhub JobState=CANCELLED Partition=jhub TimeLimit=10080 StartTime=2021-05-19T12:39:40 EndTime=2021-05-19T12:39:40 NodeList=(null) NodeCnt=0 ProcCnt=0 WorkDir=/home/user ReservationName= Tres=cpu=1,node=1,billing=1 Account=cfel QOS=cfel WcKey= Cluster=maxwell SubmitTime=2021-05-19T12:39:08 EligibleTime=2021-05-19T12:41:09 DerivedExitCode=0:0 ExitCode=0:0
sacct -j 7595235 --format "jobid,partition,start,end,elapsed,nodelist,state"
JobID Partition Start End Elapsed NodeList State
------------ ---------- ------------------- ------------------- ---------- --------------- ----------
7595235 jhub 2021-05-19T12:39:40 2021-05-19T12:39:40 00:00:00 None assigned CANCELLED+
|