Ticket 9460

Summary: qstat wrapper does not handle job arrays correctly (no support for "qstat -t")
Product: Slurm Reporter: Troy Baer <troy>
Component: User CommandsAssignee: Tim Wickberg <tim>
Status: OPEN --- QA Contact:
Severity: 5 - Enhancement    
Priority: --- CC: jess, tdockendorf
Version: 20.02.2   
Hardware: Linux   
OS: Linux   
Site: Ohio State OSC Alineos Sites: ---
Atos/Eviden Sites: --- Confidential Site: ---
Coreweave sites: --- Cray Sites: ---
DS9 clusters: --- HPCnow Sites: ---
HPE Sites: --- IBM Sites: ---
NOAA SIte: --- OCF Sites: ---
Recursion Pharma Sites: --- SFW Sites: ---
SNIC sites: --- Linux Distro: ---
Machine Name: CLE Version:
Version Fixed: Target Release: ---
DevPrio: --- Emory-Cloud Sites: ---

Description Troy Baer 2020-07-23 10:01:38 MDT
The qstat TORQUE compatibility wrapper script does not appear to handle job arrays correctly.  It does not recognize the -t flag, and I'm not sure what it's doing to the jobids.

On our pre-production cluster running Slurm: 

troy@pitzer-login04:~$ qsub -t 1-10 ~/test.pbs
4787

troy@pitzer-login04:~$ squeue
             JOBID PARTITION     NAME     USER ST       TIME  NODES NODELIST(REASON) 
            4787_2 serial-40     test     troy  R       0:07      1 p0001 
            4787_3 serial-40     test     troy  R       0:07      1 p0001 
            4787_4 serial-40     test     troy  R       0:07      1 p0001 
            4787_5 serial-40     test     troy  R       0:07      1 p0001 
            4787_6 serial-40     test     troy  R       0:07      1 p0001 
            4787_7 serial-40     test     troy  R       0:07      1 p0001 
            4787_8 serial-40     test     troy  R       0:07      1 p0001 
            4787_9 serial-40     test     troy  R       0:07      1 p0001 
           4787_10 serial-40     test     troy  R       0:07      1 p0001 
            4787_1 serial-40     test     troy  R       0:07      1 p0001 

troy@pitzer-login04:~$ qstat -t
Unknown option: t
Usage:
    qstat [-f] [-a|-i|-r] [-n [-1]] [-G|-M] [-u *user_list*] [-? | --help]
    [--man] [*job_id*...]

    qstat -Q [-f]

    qstat -q


troy@pitzer-login04:~$ qstat
Job id              Name             Username        Time Use S Queue          
------------------- ---------------- --------------- -------- - ---------------
4788                test             troy            00:00:00 R serial-40core  
4789                test             troy            00:00:00 R serial-40core  
4790                test             troy            00:00:00 R serial-40core  
4791                test             troy            00:00:00 R serial-40core  
4792                test             troy            00:00:00 R serial-40core  
4793                test             troy            00:00:00 R serial-40core  
4794                test             troy            00:00:00 R serial-40core  
4795                test             troy            00:00:00 R serial-40core  
4796                test             troy            00:00:00 R serial-40core  
4787                test             troy            00:00:00 R serial-40core
Comment 1 Broderick Gardner 2020-08-12 09:06:56 MDT
I am looking into fixing the qstat wrapper.
Comment 4 Tim Wickberg 2020-08-31 09:45:47 MDT
Hi Troy -

I need to intervene here at a higher level and elaborate a bit on the support model for these various wrapper scripts.

I do appreciate the patches you've submitted, and will be looking through them further, but they are not and cannot be comprehensive. Slurm is not TORQUE, or PBS, or OpenLava, but its own entity, and these wrappers are meant to aid in conversion. We make no guarantee that they will work for every situation, or implement every possible option from every other scheduler.

In this case, it is clear that the -t flag is unimplemented. If you wish to propose an implementation for it we'd be happy to review it, but otherwise development on these wrappers does not fall within scope for our support contracts.

I'm updating the ticket metadata to reflect this is an outstanding enhancement request, albeit one that I do not expect to develop internally.

To answer the open open question as to where these JobIDs come from - in Slurm each of the array job elements, when launched (or close to launching) is split off into a separate internal job record. Each of these is assigned a unique JobID which is used for process control and accounting. The "Array Job ID", combined with the "Array Task ID", are the fields you're more used to seeing substituted in on these, but no one has converted the TORQUE wrappers to handle that format.

As one critical implementation detail: these individual jobids are created lazily. Until they have been split off from the meta record they will all be captured under the original jobid (which is equivalent to the array job id).

- Tim