Ticket 5077

Summary: squeue jobselectinfo ("%s") non-functional
Product: Slurm Reporter: Doug Jacobsen <dmjacobsen>
Component: User CommandsAssignee: Director of Support <support>
Status: RESOLVED FIXED QA Contact:
Severity: 4 - Minor Issue    
Priority: ---    
Version: 17.11.5   
Hardware: Linux   
OS: Linux   
Site: NERSC Alineos Sites: ---
Atos/Eviden Sites: --- Confidential Site: ---
Coreweave sites: --- Cray Sites: ---
DS9 clusters: --- HPCnow Sites: ---
HPE Sites: --- IBM Sites: ---
NOAA SIte: --- OCF Sites: ---
Recursion Pharma Sites: --- SFW Sites: ---
SNIC sites: --- Linux Distro: ---
Machine Name: CLE Version:
Version Fixed: 17.11 Target Release: ---
DevPrio: --- Emory-Cloud Sites: ---
Attachments: give %s format option a header in squeue

Description Doug Jacobsen 2018-04-18 03:01:34 MDT
Hello,

I have an observant user pointing out that "squeue -o%all" can produce invalid output which is causing some level of distress.

squeue -l -j <job> -o %all 
Tue Sep 13 16:00:38 2016 
ACCOUNT|GRES|MIN_CPUS|MIN_TMP_DISK|END_TIME|FEATURES|GROUP|OVER_SUBSCRIBE|JOBID|NAME|COMMENT|TIME_LIMIT|MIN_MEMORY|REQ_NODES|COMMAND|PRIORITY|QOS|REASON|PsÈ/ü|ST|USER|RESERVATION|WCKEY|EXC_NODES|NICE|S:C:T|JOBID|EXEC_HOST|CPUS|NODES|DEPENDENCY|ARRAY_JOB_ID|GROUP|SOCKETS_PER_NODE|CORES_PER_SOCKET|THREADS_PER_CORE|ARRAY_TASK_ID|TIME_LEFT|TIME|NODELIST|CONTIGUOUS|PARTITION|PRIORITY|NODELIST(REASON)|START_TIME|STATE|USER|SUBMIT_TIME|LICENSES|CORE_SPEC|SCHEDNODES|WORK_DIR 
...


Note that between "REASON" and "ST" things get silly.  Based on the rather clever way %all works:

https://github.com/SchedMD/slurm/blob/slurm-17.11/src/squeue/opts.c#L567

we can infer this is caused by "%s" (since it sits between "%r" and "%t".

The user complained about this some  time ago (16.05 or 17.02) and I'm just  finding the ticket now.  In any case. I still see random invalid data in 17.11.5:

dmj@edison01:~> squeue --format="%s" | sort | tail









Md�h�*
dmj@edison01:~>

It appears that the invalid output I'm seeing in cori's current queue is in the header.


It seems that this is because _print_job_select_jobinfo is trying to  run:

		select_g_select_jobinfo_sprint(NULL,
			select_buf, sizeof(select_buf), SELECT_PRINT_HEAD);


I'm guessing select/cray is not implementing this correctly.

Please keep in mind that we need select/cray to do the right thing here whether or not the slurm build is for cray (our elogin nodes where squeue runs is not built for native cray).  However, the invalid output is present on both native cray and linux builds when select/cray is the select plugin.

Looks like this is called  out as a FIXME:

https://github.com/SchedMD/slurm/blob/de4c76ebfe53628be255d2dbf30a2c45631776cb/src/plugins/select/cray/select_cray.c#L2597


The time has arrived. =)

Thanks,
Doug
Comment 1 Doug Jacobsen 2018-04-18 03:14:27 MDT
select/alps prints a header

select/cons_res puts in an empty string (without checking buffer length)
select/linear same
select/serial same
select/cray explicitly causes trouble  (subject of this ticket)

select/bluegene is complicated but there

seems like all should at least put in a header if only so there isn't a mysterious empty column in %all output
Comment 2 Isaac Hartung 2018-04-19 11:33:53 MDT
Created attachment 6656 [details]
give %s format option a header in squeue
Comment 4 Isaac Hartung 2018-05-01 10:22:20 MDT
Hi Doug,

The solution for this issue is in commit d3398004245fcc29c5c8f93311957fb3960dc6b2

It has been decided that it will be treated in the cray plugin as it is in the serial, linear, and cons_res plugins -- returning an empty string for %s.

This is the solution for 17.11.  We intend to either remove the column or give it a header independent of the plugin in 18.08.
Comment 5 Isaac Hartung 2018-05-02 09:29:12 MDT
I'm going to close this ticket.  Please reopen it should you have any further issues with the squeue %s option.