| Summary: | qstat wrapper does not handle job arrays correctly (no support for "qstat -t") | ||
|---|---|---|---|
| Product: | Slurm | Reporter: | Troy Baer <troy> |
| Component: | User Commands | Assignee: | Tim Wickberg <tim> |
| Status: | OPEN --- | QA Contact: | |
| Severity: | 5 - Enhancement | ||
| Priority: | --- | CC: | jess, tdockendorf |
| Version: | 20.02.2 | ||
| Hardware: | Linux | ||
| OS: | Linux | ||
| Site: | Ohio State OSC | Alineos Sites: | --- |
| Atos/Eviden Sites: | --- | Confidential Site: | --- |
| Coreweave sites: | --- | Cray Sites: | --- |
| DS9 clusters: | --- | HPCnow Sites: | --- |
| HPE Sites: | --- | IBM Sites: | --- |
| NOAA SIte: | --- | OCF Sites: | --- |
| Recursion Pharma Sites: | --- | SFW Sites: | --- |
| SNIC sites: | --- | Linux Distro: | --- |
| Machine Name: | CLE Version: | ||
| Version Fixed: | Target Release: | --- | |
| DevPrio: | --- | Emory-Cloud Sites: | --- |
I am looking into fixing the qstat wrapper. Hi Troy - I need to intervene here at a higher level and elaborate a bit on the support model for these various wrapper scripts. I do appreciate the patches you've submitted, and will be looking through them further, but they are not and cannot be comprehensive. Slurm is not TORQUE, or PBS, or OpenLava, but its own entity, and these wrappers are meant to aid in conversion. We make no guarantee that they will work for every situation, or implement every possible option from every other scheduler. In this case, it is clear that the -t flag is unimplemented. If you wish to propose an implementation for it we'd be happy to review it, but otherwise development on these wrappers does not fall within scope for our support contracts. I'm updating the ticket metadata to reflect this is an outstanding enhancement request, albeit one that I do not expect to develop internally. To answer the open open question as to where these JobIDs come from - in Slurm each of the array job elements, when launched (or close to launching) is split off into a separate internal job record. Each of these is assigned a unique JobID which is used for process control and accounting. The "Array Job ID", combined with the "Array Task ID", are the fields you're more used to seeing substituted in on these, but no one has converted the TORQUE wrappers to handle that format. As one critical implementation detail: these individual jobids are created lazily. Until they have been split off from the meta record they will all be captured under the original jobid (which is equivalent to the array job id). - Tim |
The qstat TORQUE compatibility wrapper script does not appear to handle job arrays correctly. It does not recognize the -t flag, and I'm not sure what it's doing to the jobids. On our pre-production cluster running Slurm: troy@pitzer-login04:~$ qsub -t 1-10 ~/test.pbs 4787 troy@pitzer-login04:~$ squeue JOBID PARTITION NAME USER ST TIME NODES NODELIST(REASON) 4787_2 serial-40 test troy R 0:07 1 p0001 4787_3 serial-40 test troy R 0:07 1 p0001 4787_4 serial-40 test troy R 0:07 1 p0001 4787_5 serial-40 test troy R 0:07 1 p0001 4787_6 serial-40 test troy R 0:07 1 p0001 4787_7 serial-40 test troy R 0:07 1 p0001 4787_8 serial-40 test troy R 0:07 1 p0001 4787_9 serial-40 test troy R 0:07 1 p0001 4787_10 serial-40 test troy R 0:07 1 p0001 4787_1 serial-40 test troy R 0:07 1 p0001 troy@pitzer-login04:~$ qstat -t Unknown option: t Usage: qstat [-f] [-a|-i|-r] [-n [-1]] [-G|-M] [-u *user_list*] [-? | --help] [--man] [*job_id*...] qstat -Q [-f] qstat -q troy@pitzer-login04:~$ qstat Job id Name Username Time Use S Queue ------------------- ---------------- --------------- -------- - --------------- 4788 test troy 00:00:00 R serial-40core 4789 test troy 00:00:00 R serial-40core 4790 test troy 00:00:00 R serial-40core 4791 test troy 00:00:00 R serial-40core 4792 test troy 00:00:00 R serial-40core 4793 test troy 00:00:00 R serial-40core 4794 test troy 00:00:00 R serial-40core 4795 test troy 00:00:00 R serial-40core 4796 test troy 00:00:00 R serial-40core 4787 test troy 00:00:00 R serial-40core