Ticket 86

Summary:	Implement support for nodes with more than one board.
Product:	Slurm	Reporter:	Rod Schultz <Rod.Schultz>
Component:	Other	Assignee:	Moe Jette <jette>
Status:	RESOLVED FIXED	QA Contact:
Severity:	5 - Enhancement
Priority:	---	CC:	da, martin.perry
Version:	2.5.x
Hardware:	Linux
OS:	Linux
Site:	Atos/Eviden Sites	Slinky Site:	---
Alineos Sites:	---	Atos/Eviden Sites:	---
Confidential Site:	---	Coreweave sites:	---
Cray Sites:	---	DS9 clusters:	---
Google sites:	---	HPCnow Sites:	---
HPE Sites:	---	IBM Sites:	---
NOAA SIte:	---	NoveTech Sites:	---
Nvidia HWinf-CS Sites:	---	OCF Sites:	---
Recursion Pharma Sites:	---	SFW Sites:	---
SNIC sites:	---	Tzag Elita Sites:	---
Linux Distro:	---	Machine Name:
CLE Version:		Version Fixed:
Target Release:	---	DevPrio:	---
Emory-Cloud Sites:	---
Attachments:	Patch to 2.5.0-pre1 slurm.conf which causes crash Patch to fix this bug

Description Rod Schultz 2012-07-16 04:45:54 MDT

Created attachment 91 [details]
Patch to 2.5.0-pre1

This implementation also includes best fit on sockets within boards. 

I had to port it from our 2.5-pre1 base that has some additional functionality, particularly in the area of power management. I believe I removed all aspects of that, but if you find defects that may be the cause. 

The job state version also incremented by two, but I think that just folds in the patch Danny made to 2.4.1. 

I believe I may have added boards to an existing line of the scontrol show nodes output. If you want me to rework that, just tell me what format you would like to see and I will send another patch. 

Also note that since we use hwloc, I had to modify the make.am for all the executables to include linkage information.

Comment 1 Moe Jette 2012-07-19 05:42:03 MDT

All integrated with the main code slurm base

Comment 2 Danny Auble 2012-08-06 09:20:34 MDT

Rod, we have found an issue with the patch.

If you run a job on a clean system looking to run a task on every cpu on the node in this manner...

srun -n<num_of_cpus_on_one_node> -N1 hostname

The slurmctld will seg fault.  It appears the root of the problem is
src/plugins/select/cons_res/dist_tasks.c (line 429)
		ncomb_brd = comb_counts[nboards_nb-1][b_min-1];
sets the ncomb_brd to 0.

This appears to mess up the socket_list and thus makes issues later on.

I haven't looked much further into the code but if I set this number to 1 instead of 0 things appear to work just fine.  I doubt this is really the answer though.

Perhaps you can see something so we don't have to dig in.

If you run a job pervious to this everything works fine as well.  The full node job has to be the first job ran after the slurmctld is started up.

Comment 3 Martin Perry 2012-08-07 05:33:18 MDT

Danny,

I haven't been able to reproduce this problem. Could you send me your node definition and any other relevant slurm.conf settings. 

I'm not sure how ncomb_brd could be getting set to 0. If the node is defined with "Boards=1", or it defaults to 1 because the "Boards" param is omitted, ncomb_brd should get set to 1.

Comment 4 Danny Auble 2012-08-07 05:36:03 MDT

Created attachment 103 [details]
slurm.conf which causes crash

Here is my slurm.conf let me know if you need anything else.

Remember the job has to be the first job ran.  It doesn't appear to matter if it was a clean start or not.

Comment 5 Martin Perry 2012-08-07 09:58:43 MDT

Thanks Danny.  I think I've managed to reproduce the problem, or a very similar variant of the problem.  It seems to have to do with ThreadsPerCore > 1, but I'm not completely sure.  Anyway, I'll let you know when I have more information.

Comment 6 Martin Perry 2012-08-08 08:42:54 MDT

Danny,

I have attached a patch to fix this problem in 2.5.0-pre2.  It seems to fix the problem for me, but I'm still running some regression tests. Please verify that it fixes the problem for you.  Thanks.

Comment 7 Martin Perry 2012-08-08 08:44:15 MDT

Created attachment 104 [details]
Patch to fix this bug

Comment 8 Danny Auble 2012-08-08 11:01:42 MDT

Excellent Martin, this appears to fix the issue.

Comment 9 Martin Perry 2012-08-08 11:39:32 MDT

Great.  If I find a problem in my regression testing I'll let you know, but so far it looks good.