| Summary: | Implement support for nodes with more than one board. | ||
|---|---|---|---|
| Product: | Slurm | Reporter: | Rod Schultz <Rod.Schultz> |
| Component: | Other | Assignee: | Moe Jette <jette> |
| Status: | RESOLVED FIXED | QA Contact: | |
| Severity: | 5 - Enhancement | ||
| Priority: | --- | CC: | da, martin.perry |
| Version: | 2.5.x | ||
| Hardware: | Linux | ||
| OS: | Linux | ||
| Site: | Atos/Eviden Sites | Alineos Sites: | --- |
| Atos/Eviden Sites: | --- | Confidential Site: | --- |
| Coreweave sites: | --- | Cray Sites: | --- |
| DS9 clusters: | --- | HPCnow Sites: | --- |
| HPE Sites: | --- | IBM Sites: | --- |
| NOAA SIte: | --- | OCF Sites: | --- |
| Recursion Pharma Sites: | --- | SFW Sites: | --- |
| SNIC sites: | --- | Linux Distro: | --- |
| Machine Name: | CLE Version: | ||
| Version Fixed: | Target Release: | --- | |
| DevPrio: | --- | Emory-Cloud Sites: | --- |
| Attachments: |
Patch to 2.5.0-pre1
slurm.conf which causes crash Patch to fix this bug |
||
All integrated with the main code slurm base Rod, we have found an issue with the patch. If you run a job on a clean system looking to run a task on every cpu on the node in this manner... srun -n<num_of_cpus_on_one_node> -N1 hostname The slurmctld will seg fault. It appears the root of the problem is src/plugins/select/cons_res/dist_tasks.c (line 429) ncomb_brd = comb_counts[nboards_nb-1][b_min-1]; sets the ncomb_brd to 0. This appears to mess up the socket_list and thus makes issues later on. I haven't looked much further into the code but if I set this number to 1 instead of 0 things appear to work just fine. I doubt this is really the answer though. Perhaps you can see something so we don't have to dig in. If you run a job pervious to this everything works fine as well. The full node job has to be the first job ran after the slurmctld is started up. Danny, I haven't been able to reproduce this problem. Could you send me your node definition and any other relevant slurm.conf settings. I'm not sure how ncomb_brd could be getting set to 0. If the node is defined with "Boards=1", or it defaults to 1 because the "Boards" param is omitted, ncomb_brd should get set to 1. Created attachment 103 [details]
slurm.conf which causes crash
Here is my slurm.conf let me know if you need anything else.
Remember the job has to be the first job ran. It doesn't appear to matter if it was a clean start or not.
Thanks Danny. I think I've managed to reproduce the problem, or a very similar variant of the problem. It seems to have to do with ThreadsPerCore > 1, but I'm not completely sure. Anyway, I'll let you know when I have more information. Danny, I have attached a patch to fix this problem in 2.5.0-pre2. It seems to fix the problem for me, but I'm still running some regression tests. Please verify that it fixes the problem for you. Thanks. Created attachment 104 [details]
Patch to fix this bug
Excellent Martin, this appears to fix the issue. Great. If I find a problem in my regression testing I'll let you know, but so far it looks good. |
Created attachment 91 [details] Patch to 2.5.0-pre1 This implementation also includes best fit on sockets within boards. I had to port it from our 2.5-pre1 base that has some additional functionality, particularly in the area of power management. I believe I removed all aspects of that, but if you find defects that may be the cause. The job state version also incremented by two, but I think that just folds in the patch Danny made to 2.4.1. I believe I may have added boards to an existing line of the scontrol show nodes output. If you want me to rework that, just tell me what format you would like to see and I will send another patch. Also note that since we use hwloc, I had to modify the make.am for all the executables to include linkage information.