Ticket 86

Summary: Implement support for nodes with more than one board.
Product: Slurm Reporter: Rod Schultz <Rod.Schultz>
Component: OtherAssignee: Moe Jette <jette>
Status: RESOLVED FIXED QA Contact:
Severity: 5 - Enhancement    
Priority: --- CC: da, martin.perry
Version: 2.5.x   
Hardware: Linux   
OS: Linux   
Site: Atos/Eviden Sites Slinky Site: ---
Alineos Sites: --- Atos/Eviden Sites: ---
Confidential Site: --- Coreweave sites: ---
Cray Sites: --- DS9 clusters: ---
Google sites: --- HPCnow Sites: ---
HPE Sites: --- IBM Sites: ---
NOAA SIte: --- NoveTech Sites: ---
Nvidia HWinf-CS Sites: --- OCF Sites: ---
Recursion Pharma Sites: --- SFW Sites: ---
SNIC sites: --- Tzag Elita Sites: ---
Linux Distro: --- Machine Name:
CLE Version: Version Fixed:
Target Release: --- DevPrio: ---
Emory-Cloud Sites: ---
Attachments: Patch to 2.5.0-pre1
slurm.conf which causes crash
Patch to fix this bug

Description Rod Schultz 2012-07-16 04:45:54 MDT
Created attachment 91 [details]
Patch to 2.5.0-pre1

This implementation also includes best fit on sockets within boards. 

I had to port it from our 2.5-pre1 base that has some additional functionality, particularly in the area of power management. I believe I removed all aspects of that, but if you find defects that may be the cause. 

The job state version also incremented by two, but I think that just folds in the patch Danny made to 2.4.1. 

I believe I may have added boards to an existing line of the scontrol show nodes output. If you want me to rework that, just tell me what format you would like to see and I will send another patch. 

Also note that since we use hwloc, I had to modify the make.am for all the executables to include linkage information.
Comment 1 Moe Jette 2012-07-19 05:42:03 MDT
All integrated with the main code slurm base
Comment 2 Danny Auble 2012-08-06 09:20:34 MDT
Rod, we have found an issue with the patch.

If you run a job on a clean system looking to run a task on every cpu on the node in this manner...

srun -n<num_of_cpus_on_one_node> -N1 hostname

The slurmctld will seg fault.  It appears the root of the problem is
src/plugins/select/cons_res/dist_tasks.c (line 429)
		ncomb_brd = comb_counts[nboards_nb-1][b_min-1];
sets the ncomb_brd to 0.

This appears to mess up the socket_list and thus makes issues later on.

I haven't looked much further into the code but if I set this number to 1 instead of 0 things appear to work just fine.  I doubt this is really the answer though.

Perhaps you can see something so we don't have to dig in.

If you run a job pervious to this everything works fine as well.  The full node job has to be the first job ran after the slurmctld is started up.
Comment 3 Martin Perry 2012-08-07 05:33:18 MDT
Danny,

I haven't been able to reproduce this problem. Could you send me your node definition and any other relevant slurm.conf settings. 

I'm not sure how ncomb_brd could be getting set to 0. If the node is defined with "Boards=1", or it defaults to 1 because the "Boards" param is omitted, ncomb_brd should get set to 1.
Comment 4 Danny Auble 2012-08-07 05:36:03 MDT
Created attachment 103 [details]
slurm.conf which causes crash

Here is my slurm.conf let me know if you need anything else.

Remember the job has to be the first job ran.  It doesn't appear to matter if it was a clean start or not.
Comment 5 Martin Perry 2012-08-07 09:58:43 MDT
Thanks Danny.  I think I've managed to reproduce the problem, or a very similar variant of the problem.  It seems to have to do with ThreadsPerCore > 1, but I'm not completely sure.  Anyway, I'll let you know when I have more information.
Comment 6 Martin Perry 2012-08-08 08:42:54 MDT
Danny,

I have attached a patch to fix this problem in 2.5.0-pre2.  It seems to fix the problem for me, but I'm still running some regression tests. Please verify that it fixes the problem for you.  Thanks.
Comment 7 Martin Perry 2012-08-08 08:44:15 MDT
Created attachment 104 [details]
Patch to fix this bug
Comment 8 Danny Auble 2012-08-08 11:01:42 MDT
Excellent Martin, this appears to fix the issue.
Comment 9 Martin Perry 2012-08-08 11:39:32 MDT
Great.  If I find a problem in my regression testing I'll let you know, but so far it looks good.