Ticket 3038

Summary: Bug in squeue for arrays with non-1 indexing
Product: Slurm Reporter: Stuart Midgley <stuartm>
Component: User CommandsAssignee: Dominik Bartkiewicz <bart>
Status: RESOLVED FIXED QA Contact:
Severity: 4 - Minor Issue    
Priority: --- CC: bart, paull, phils
Version: 14.11.10   
Hardware: Linux   
OS: Linux   
Site: DownUnder GeoSolutions Slinky Site: ---
Alineos Sites: --- Atos/Eviden Sites: ---
Confidential Site: --- Coreweave sites: ---
Cray Sites: --- DS9 clusters: ---
Google sites: --- HPCnow Sites: ---
HPE Sites: --- IBM Sites: ---
NOAA SIte: --- NoveTech Sites: ---
Nvidia HWinf-CS Sites: --- OCF Sites: ---
Recursion Pharma Sites: --- SFW Sites: ---
SNIC sites: --- Tzag Elita Sites: ---
Linux Distro: --- Machine Name:
CLE Version: Version Fixed: 16.05.1
Target Release: --- DevPrio: ---
Emory-Cloud Sites: ---

Description Stuart Midgley 2016-09-01 07:45:24 MDT
Evening

I have built a nice little cgi script which helps users monitor their jobs.  We have noticed that when an array has a non-1 index, squeue returns a stupid number of array tasks...

> tab=" "                                                                                                                                                                                                       

> squeue -j 21075999                                                                                                                                                                                            PARTITION   PRIORITY   NAME                     USER ST       TIME TIME_LEFT  NODES NODELIST(REASON JOBID                                                                                                                              XXXXXXXXXXX 200        zzzzzzzzzzzzzzzzzzzz yyyyyyyy PD       0:00     21:00      1 (Resources)     21075999_[22655-22999:8]                                                                                                           XXXXXXXXXXX 200        zzzzzzzzzzzzzzzzzzzz yyyyyyyy  R      13:14      7:46      1 node1           21075999_22639
XXXXXXXXXXX 200        zzzzzzzzzzzzzzzzzzzz yyyyyyyy  R       8:00     13:00      1 node2           21075999_22647

> squeue -h -r -a -j 21075999 -o "%t${tab}%F${tab}%K${tab}%P${tab}%Q${tab}%u${tab}%j${tab}%k${tab}%Z${tab}%M" | wc -l
207346


and if I look at the output of the command, we have array tasks larger than 22999!

PD 21075999 22996 XXXXXXXXXXX 200 yyyyyyyy zzzzzzzzzzzzzzzzzzzz aaaa/bbbb/workflow.job aaaa/bbbb/ 0:00                                                                                                                                                                                                        PD 21075999 22997 XXXXXXXXXXX 200 yyyyyyyy zzzzzzzzzzzzzzzzzzzz aaaa/bbbb/workflow.job aaaa/bbbb/ 0:00                                                                                                                                                                                                        PD 21075999 22998 XXXXXXXXXXX 200 yyyyyyyy zzzzzzzzzzzzzzzzzzzz aaaa/bbbb/workflow.job aaaa/bbbb/ 0:00                                                                                                                                                                                                        PD 21075999 22999 XXXXXXXXXXX 200 yyyyyyyy zzzzzzzzzzzzzzzzzzzz aaaa/bbbb/workflow.job aaaa/bbbb/ 0:00                                                                                                                                                                                                        PD 21075999 23000 XXXXXXXXXXX 200 yyyyyyyy zzzzzzzzzzzzzzzzzzzz aaaa/bbbb/workflow.job aaaa/bbbb/ 0:00                                                                                                                                                                                                        PD 21075999 23001 XXXXXXXXXXX 200 yyyyyyyy zzzzzzzzzzzzzzzzzzzz aaaa/bbbb/workflow.job aaaa/bbbb/ 0:00                                                                                                                                                                                                        PD 21075999 23002 XXXXXXXXXXX 200 yyyyyyyy zzzzzzzzzzzzzzzzzzzz aaaa/bbbb/workflow.job aaaa/bbbb/ 0:00                                      



(sorry for anonymising it)
Comment 1 Dominik Bartkiewicz 2016-09-01 08:13:47 MDT
Hi,

This looks like old solved issues,
commit cdf2587850 fixes this problem.
The patch is available as https://github.com/SchedMD/slurm/commit/cdf258785034f825234b325.patch
You can manually apply this patch.

Dominik
Comment 2 Stuart Midgley 2016-09-01 17:27:17 MDT
Sigh, thanks.  I guess we really should get upgraded to a more recent version :)


(In reply to Dominik Bartkiewicz from comment #1)
> Hi,
> 
> This looks like old solved issues,
> commit cdf2587850 fixes this problem.
> The patch is available as
> https://github.com/SchedMD/slurm/commit/cdf258785034f825234b325.patch
> You can manually apply this patch.
> 
> Dominik
Comment 3 Dominik Bartkiewicz 2016-10-13 04:16:23 MDT
Marking resolved, please reopen if there were any further questions.

Dominik