Ticket 850

Summary:	Incorrect CPU_IDs reported by scontrol show job for hyperthreaded nodes
Product:	Slurm	Reporter:	Martin Perry <martin.perry>
Component:	Other	Assignee:	Moe Jette <jette>
Status:	RESOLVED FIXED	QA Contact:
Severity:	3 - Medium Impact
Priority:	---	CC:	bill.brophy, da
Version:	14.11.x
Hardware:	Linux
OS:	Linux
Site:	Atos/Eviden Sites	Slinky Site:	---
Alineos Sites:	---	Atos/Eviden Sites:	---
Confidential Site:	---	Coreweave sites:	---
Cray Sites:	---	DS9 clusters:	---
Google sites:	---	HPCnow Sites:	---
HPE Sites:	---	IBM Sites:	---
NOAA SIte:	---	NoveTech Sites:	---
Nvidia HWinf-CS Sites:	---	OCF Sites:	---
Recursion Pharma Sites:	---	SFW Sites:	---
SNIC sites:	---	Tzag Elita Sites:	---
Linux Distro:	---	Machine Name:
CLE Version:		Version Fixed:	14.03.4
Target Release:	---	DevPrio:	---
Emory-Cloud Sites:	---
Attachments:	Patch to fix bug Patch, Version 2

Description Martin Perry 2014-06-02 08:44:46 MDT

The CPU IDs reported by "scontrol -d show job" are actually "abstract core IDs." For nodes with ThreadsPerCore=1, Core ID is the same as CPU ID, so it's not a problem. But for nodes with ThreadsPerCore>1, the reported CPU IDs are not correct.

To fix this problem, I think it will be necessary to add the ThreadsPerCore count for each node in the allocation to the job_resources_t struct, so that the correct CPU IDs can be calculated by slurm_sprint_job_info. It looks like making this change to the job_resources struct would impact numerous files, so I wanted to run this by you before starting work on a fix. Do you agree with this analysis?

Thanks,
Martin

Comment 1 Moe Jette 2014-06-03 04:42:48 MDT

As you say, this will involve a lot of code change. It is also only relevant if you want to report CPU IDs rather than "abstract core IDs" for the "scontrol -d show job" command.

Have you considered changing scontrol to get the node information when needed (i.e. when "scontrol -d show job" is run)?

That is probably much simpler and probably provides better overall system scalability than adding ThreadsPerCore array to the job_resources_t struct.

Comment 2 Martin Perry 2014-06-03 08:59:04 MDT

job_resources_t already has arrays for sockets_per_node and cores_per_socket. That's why I thought it would make sense to add threads_per_core. But since that would mean a lot of code change, and it's not really very efficient, I'll look into changing scontrol to get the node info dynamically. Thanks.

Comment 3 Moe Jette 2014-06-03 09:04:52 MDT

(In reply to Martin Perry from comment #2)
> job_resources_t already has arrays for sockets_per_node and
> cores_per_socket. That's why I thought it would make sense to add
> threads_per_core. But since that would mean a lot of code change, and it's
> not really very efficient, I'll look into changing scontrol to get the node
> info dynamically. Thanks.

Those are needed to interpret the core_bitmap.
It's easy enough to add the cores_per_thread field too, but it is a lot more work and arguably not needed in that data structure. Either design should be fine so, you choice...

Comment 4 Martin Perry 2014-06-10 08:32:59 MDT

Created attachment 922 [details]
Patch to fix bug

Comment 5 Moe Jette 2014-06-10 10:44:55 MDT

Issuing an RPC for each node is not a very scalable solution. Is there some reason you don't just get the full node table and pick out the information about specific nodes of interest? That offloads most of the work to the squeue command.

Comment 6 Martin Perry 2014-06-10 11:13:44 MDT

Moe,
Yes, I should make it more scalable. So I need to call slurm_load_node instead of slurm_load_node_single, and then pick out the nodes of interest from the node_array, right?

Comment 7 Moe Jette 2014-06-10 11:18:39 MDT

(In reply to Martin Perry from comment #6)
> Moe,
> Yes, I should make it more scalable. So I need to call slurm_load_node
> instead of slurm_load_node_single, and then pick out the nodes of interest
> from the node_array, right?

Correct. Load the full table and then search it by node name as needed. It's pretty simple and there are many examples (see sinfo, sview, scontrol, etc.).

Comment 8 Martin Perry 2014-06-11 08:25:11 MDT

Created attachment 928 [details]
Patch, Version 2

Version 2 of patch attached, to address scalability issue in original version.

Comment 9 Danny Auble 2014-06-11 08:32:38 MDT

Martin, this patch still makes the call for each job, which should be avoided.  I am not sure how to make the current api work better.  Perhaps changing the slurm_sprint_job_info() and slurm_print_job_info function calls to accept a node_info_msg_t which is filled in before hand.  Calling slurm_load_*node in the function should be avoided.

This is just one idea, perhaps there are others?

Comment 10 Martin Perry 2014-06-11 08:46:01 MDT

I was thinking "scontrol show job" is relatively rare, so doing the slurm_load_node each time would be ok. But I'll look into making the call outside the function. The other option Moe and I discussed previously is adding a threads_per_core array to job_resources_t, but that would impact many more files.

I have a few other things that need my attention so I probably won't get back to this today.

Comment 11 Moe Jette 2014-06-12 09:15:34 MDT

Patch, version 2 is here:
https://github.com/SchedMD/slurm/commit/83d626ca45032d71f3847a9a92e9a21fac5e8031

I moved around the code a bit so the node table only needs to get loaded once for better performance:
https://github.com/SchedMD/slurm/commit/04921aa2d2662cc4236aaa4d6e86f99f7caeb19f

These changes will be in version 14.03.4

Comment 12 Martin Perry 2014-06-12 09:24:32 MDT

Thanks!