| Summary: | with -m*:cyclic and -c2 affinity binding | ||
|---|---|---|---|
| Product: | Slurm | Reporter: | Danny Auble <da> |
| Component: | slurmd | Assignee: | Danny Auble <da> |
| Status: | RESOLVED FIXED | QA Contact: | |
| Severity: | 4 - Minor Issue | ||
| Priority: | --- | CC: | da, martin.perry, matthieu.hautreux, nancy.kritkausky, yiannis.georgiou |
| Version: | 14.03.3 | ||
| Hardware: | Linux | ||
| OS: | Linux | ||
| Site: | SchedMD | Alineos Sites: | --- |
| Atos/Eviden Sites: | --- | Confidential Site: | --- |
| Coreweave sites: | --- | Cray Sites: | --- |
| DS9 clusters: | --- | HPCnow Sites: | --- |
| HPE Sites: | --- | IBM Sites: | --- |
| NOAA SIte: | --- | OCF Sites: | --- |
| Recursion Pharma Sites: | --- | SFW Sites: | --- |
| SNIC sites: | --- | Linux Distro: | --- |
| Machine Name: | CLE Version: | ||
| Version Fixed: | 14.03.5 | Target Release: | --- |
| DevPrio: | --- | Emory-Cloud Sites: | --- |
| Attachments: | Add -m *:cyclic:fcyclic support in task/cgroup | ||
|
Description
Danny Auble
2014-05-20 04:43:34 MDT
When doing task binding using task/affinity or task/cgroup and asking for multiple cpus per task (srun -c) currently the cpus picked for each task happen in a cyclic fashion over the sockets on the node. So if a task is requesting -c4 and it ran on a 2 socket 6 core per socket node it would get 2 cores on one socket and 2 cores on the other socket. My feeling on the matter is an application will preform better if a task is requesting multiple cpus per task that the cpus be grouped together in a block fashion on the socket instead of spread cyclically. As in the case above the task would get all 4 cores of the task on 1 socket instead of spread across 2. Does anyone see an issue with this approach? Currently if no distribution is requested there is an attempt to make this happen starting with https://github.com/SchedMD/slurm/commit/d689495804bded28170cde86214a8b9a82be2911 But it needs to be flushed out a bit more, and doesn't apply to task/cgroup. I agree that when requesting for cyclic distribution, this should mean by default cyclic distribution of tasks but not of the cores associated to the tasks. The srun man page does not explain how exactly cores requested using -c should be treated by the binding logic. So using this as a default, even being disruptive from the current logic, could be added and the documentation updated. The current logic may however be interesting if a user really want to launch as many tasks as cores per socket having each task using one core on each socket for example. This means that the 3 different cases of inter-sockets distribution has to be provided as you did and a new parameter should probably be added to describe that new scenario. Maybe something like : block : distribute tasks by block, including their cpus when >1 cyclic : distribute tasks cyclicly across sockets, including their cpus when >1 hybrid : distribute tasks cyclicly across sockets, but distribute their cpus using a block logic inside sockets when possible As you did, the hybrid should be the default if nothing else was expicitely requested. As I said, this is disruptive so you might ask for comments on the mailing list first and in case of strong opposition use cyclic as the default and mention to use hybrid in the man page for the better logic. I think that it should not be so hard to had that logic to the task/cgroup too. Do you have any ideas on the option for the user to use this? We are hoping to put this into 14.03. Seems like there should be something like -m*:cyclic:block or -m*:cyclic:cyclic which this third layer doesn't exist today, but could probably be added I suppose. I don't think people would want -m*:block:cyclic, Should that even be allowed? Or are you just saying there should be a -m*:hybrid? I think pulling in a third option of -m would make it the most clear of what is happening, but that might be my own opinion. I am not sure of the interest of the -m*:block:cyclic too. That is the reason why I was more thinking about a -m*:block|cyclic|hybrid option to avoid having to add the third layer -m*:block|cyclic:block|cyclic. I do not have a strong opinion on the best option at that point. FYI, I know that openmpi is (maybe was?) in some version parsing the a:b pattern of SLURM_DIST to get the kind of binding to apply. It might break things adding an extra layer, but it might break because of a new hybrid pragma too. This will have to be explicitely noted as a major change in both cases. I ended up calling it fcyclic (full cyclic). This can be changed if anyone hates the name. Commit a335b9749a18eec4214be50cdce8c12ca2c653a2 has these changes. I only altered task/affinity to handle this and task/cgroup hasn't been changed. Comments are welcome. Does full cyclic correspond to the current behavior of multiple cpus being bound cyclically and cyclic to the new logic equivalent to -m*:cyclic:block ? I can try to look at how to manage to modify the task/cgroup for that, but not before next week or the week after. Correct, the new default behaviour is the equivalent of -m*:cyclic:block. To get the older default -m*:fcyclic. I will be sure to update the documentation to describe this. Let me know what you come up with on the task/cgroup. I am hoping it isn't that bad. I think we plan on tagging 14.03.4 in the next 2 weeks, but I don't think anything is pressing right now. Danny, For the distribution of tasks to nodes (the "first distribution") there is a distribution method called "plane", which combines block and cyclic distribution. I think what you have implemented here is plane distribution as the default method for distributing cpus to tasks (the "second distribution"), with the plane size equal to the number of cpus per task. I agree this is probably a better default than the current method. It could be extended to support arbitrary plane sizes, same as for the distribution of tasks to nodes. Perhaps we should retain the name "cyclic" to refer to the existing method and use "plane" to refer your new default method. So with your example of a node with 2 sockets and 6 cores per socket, with -n2 -c2, the OLD default behavior is "-m *:cyclic" like this: Task 0: Socket 0, Core 0; Socket 1, Core 0 Task 1: Socket 0, Core 1; Socket 1, Core 1 And the NEW default behavior (with your change) is equivalent to "-m *:plane=2" like this: Task 0: Socket 0, Core 0; Socket 0, Core 1 Task 1: Socket 1, Core 0; Socket 1, Core 1 Martin, (In reply to Martin Perry from comment #8) > > For the distribution of tasks to nodes (the "first distribution") there is a > distribution method called "plane", which combines block and cyclic > distribution. I don't believe there is a -m *:plane, only -m plane=. I believe there is a difference. Also I believe plane is related to task layout not the cpus per task layout. Am I misunderstanding what plane is? > > I think what you have implemented here is plane distribution as the default > method for distributing cpus to tasks (the "second distribution"), with the > plane size equal to the number of cpus per task. I agree this is probably a > better default than the current method. It could be extended to support > arbitrary plane sizes, same as for the distribution of tasks to nodes. > > Perhaps we should retain the name "cyclic" to refer to the existing method > and use "plane" to refer your new default method. > > So with your example of a node with 2 sockets and 6 cores per socket, with > -n2 -c2, the OLD default behavior is "-m *:cyclic" like this: > > Task 0: Socket 0, Core 0; Socket 1, Core 0 > Task 1: Socket 0, Core 1; Socket 1, Core 1 Correct, this is what I get. > > And the NEW default behaviour (with your change) is equivalent to "-m > *:plane=2" like this: > > Task 0: Socket 0, Core 0; Socket 0, Core 1 > Task 1: Socket 1, Core 0; Socket 1, Core 1 I am unable to get this from older code. If I test this out with 2.6 (before my change) I get these results with srun -n2 -c2 -mplane=2 --exclusive Task 0: Socket 0, Core 0; Socket 0, Core 1 Task 1: Socket 0, Core 2; Socket 0, Core 3 Without exclusive I get the last 4 on the last socket (as expected). I think after my change you will get what you are expecting only because it does the correct thing. I don't think it did it before, and I doubt plane will do what you expect in all cases. I still think plane and what I have done to cyclic is different. Perhaps I have missed something though. Please let me know. http://slurm.schedmd.com/dist_plane.html is described here as I understand it. Created attachment 898 [details]
Add -m *:cyclic:fcyclic support in task/cgroup
This is a first attempt for a patch enabling to add the cyclic|fcyclic logics in the task/cgroup plugin.
I did not have the possibility to test it well, so it would be great if one of you can do more tests with it.
Matthieu, I thought (could be wrong since I don't totally remember) the current code does fcyclic for cyclic (which now means cyclic the tasks on the sockets, but block the cpus per task). Are you saying task/cgroup already did this? And what you added will make the cpus per task cyclic? affinity logic in task/cgroup was splitted into 2 parts : - _task_cgroup_cpuset_dist_cyclic : that did the cyclic affinity stuff (in the original way, that now corresponds to fcyclic) - _task_cgroup_cpuset_dist_block : that did the block affinity stuff This patch modifies _task_cgroup_cpuset_dist_cyclic in order to : - change the cyclic affinity logic to do cyclic distribution of tasks across sockets while trying to allocate multiple cores on the socket using a block approach when multiple cores are requested per task - use a fully cylic logic (as previously called cyclic) when the requested distribution is SLURM_DIST_CYCLIC_CFULL. I think that it is what we agreed in comment #6 and #7 but I might be wrong... This explanation seems correct. I think my confusion was I didn't read comment 10 very well :). I read -m *:cyclic:fcyclic as -m *:fcyclic instead of -m *:cyclic|fcyclic This explanation seems correct. I think my confusion was I didn't read comment 10 very well :). I read -m *:cyclic:fcyclic as -m *:fcyclic instead of -m *:cyclic|fcyclic Thanks Matthieu, there was only a minor issue with a print statement, but this appears to do the correct thing. This is checked into the 14.03 branch commit 992ec0943c614be378e3820f835d673252e604af and will be in 14.03.5. It would be nice if others tested this a bit more, but in the testing I did it appeared to do the correct thing. |