Created attachment 38576 [details] scripts to reproduce the unexpected behavior Hi, I have a multi-node allocation (interactive or batch) and I start multiple job steps under it with srun. In an attempt to minimize inter-node communication for each job step, I try to use "--distribution=pack" option, but it does work unexpectedly (at least to me). In short, whenever the first job step occupies the entire first node (of the two-node allocation) and spills over to the second node, the second job step does not start on the remaining idle cpus of the second node, even though it is clearly enough cpus on the second node to run it even with some cpus already used by the first job step. Instead, it waits for the first job step to finish and starts on the first (!) node only after the first job step is finished. The expected (to me) behavior would be that the second job step starts on the second node in parallel with the first job step running on the first and second nodes. Steps to reproduce: 1) Getting an interactive allocation via "salloc --partition=debug --reservation=debug --qos=debug -N2 --time=00:30:00" This gets me two nodes of 112 cpus each on my machine. 2) I then run the "run.bsh" script (in the attached zip) to see that the second step only starts after the first step is finished. 3) Additionally, I see that even though the first step runs mostly (!) on node 0 (expectedly due to "--distribution=pack"), the second job step runs on node 0 since it waited for the first step to finish. Is it an expected behavior for "--distribution=pack", or there is something wrong with how slurm functions on our end? If this is an expected behavior, is there any way to get the behavior I want, i.e., where the job steps are "packed" (i.e., tasks not distributed evenly between the nodes), but multiple job steps can still run on multiple nodes?
Hello, Can you share your current slurm.conf (also showing your node configuration)? Can you also give us the output of "scontrol show job" for one of the allocations done with your reproducer? I want to check your case with a configuration as close as possible to yours. Best regards, Ricard.
Created attachment 38604 [details] /etc/slurm/slurm.conf
Hi, # I request an interactive allocation with: salloc --partition=debug --reservation=debug --qos=debug -N2 --time=00:20:0 # result of "scontrol show job": JobId=1629782 JobName=interactive UserId=kirill(2502) GroupId=kirill(2502) MCS_label=N/A Priority=70669 Nice=0 Account=xd QOS=debug JobState=RUNNING Reason=None Dependency=(null) Requeue=0 Restarts=0 BatchFlag=0 Reboot=0 ExitCode=0:0 RunTime=00:00:12 TimeLimit=00:20:00 TimeMin=N/A SubmitTime=2024-08-29T10:46:46 EligibleTime=2024-08-29T10:46:46 AccrueTime=Unknown StartTime=2024-08-29T10:46:46 EndTime=2024-08-29T11:06:46 Deadline=N/A PreemptEligibleTime=2024-08-29T11:16:46 PreemptTime=None SuspendTime=None SecsPreSuspend=0 LastSchedEval=2024-08-29T10:46:46 Scheduler=Main Partition=debug AllocNode:Sid=ro-rfe3:167121 ReqNodeList=(null) ExcNodeList=(null) NodeList=nid[001157-001158] BatchHost=nid001157 NumNodes=2 NumCPUs=448 NumTasks=2 CPUs/Task=1 ReqB:S:C:T=0:0:*:* ReqTRES=cpu=2,mem=474000M,node=2,billing=2 AllocTRES=cpu=448,node=2,billing=448 Socks/Node=* NtasksPerN:B:S:C=0:0:*:* CoreSpec=* MinCPUsNode=1 MinMemoryCPU=0 MinTmpDiskNode=0 Features=(null) DelayBoot=00:00:00 Reservation=debug OverSubscribe=NO Contiguous=0 Licenses=roscratch1@slurmdb Network=(null) Command=/bin/sh WorkDir=/users/kirill/slurm_pack Power=
Hello, I used your reproducers and I was able to get similar behaviors, but I have just noticed a detail that I totally missed the first time. In your "jstep_*.bsh" files, the environment variable that you are using to determine the node where the task is being executed is "SLURM_NODEID". The value of this variable will go from 0 to N-1 (where N is the number of allocated nodes). However, this range can be different for each step allocation, and its value will be the *relative* node ID inside that range. In your "distribution_01" file, you will see nodes 0 and 1 because that step actually needed to allocate resources in two different nodes. On the other hand, "distribution_02" will only ever show node 0 because it only allocates a single node. Could you perform the same testing but complementing the outputs with "SLURM_STEP_NODELIST" (and any other env variables that you might find useful)? This will show exactly which nodes are allocated for each step and where the task is being executed relative to that. Let me know if you still get unexpected results after this. Best regards, Ricard.
Hi Ricard, I just updated the script and re-generated the results. I think the results are still consistent with that picture I had: the second job step starts only after the first step is finished even though there is enough CPUs on the the second node while the first step is running. Also, I increase the sleep duration in the script to make it a little more clear to myself. Now, when I start ./run.bsh under a two-node interactive allocation, the script runs for about 10 sec (I am just counting manually :), and then it prints out "srun: Step created for StepId=1634712.1" without quitting. It runs for another ~10 sec and quits only after that.
Created attachment 38627 [details] Updated scripts printing out more job step info
Hello, I have done some testing and you were missing the "--exact" [1] option for srun. Without it, by default a step will get assigned the resources of the whole node, thus blocking other parallel steps from executing in the same node. Here is a working example in job script form that you can launch via sbatch as is, you only need to modify the -n parameters (my testing nodes are 4 cores with 2 threads/core). job script ---------- >> #!/bin/bash >> >> #SBATCH -N 2 >> #SBATCH --exclusive >> #SBATCH -o /dev/null >> >> rm -rf distribution_* >> rm -rf step_info* >> >> srun -l -n 5 --exact --hint=nomultithread --distribution=pack jstep_01.bsh & >> srun -l -n 2 --exact --hint=nomultithread --distribution=pack jstep_02.bsh >> wait With this, I get the following: distribution_01 --------------- >> task 3 (node 0): pid 86201's current affinity list: 6 >> task 0 (node 0): pid 86198's current affinity list: 0 >> task 4 (node 1): pid 86195's current affinity list: 0 >> task 1 (node 0): pid 86199's current affinity list: 2 >> task 2 (node 0): pid 86200's current affinity list: 4 distribution_02 --------------- >> task 0 (node 0): pid 86194's current affinity list: 2 >> task 1 (node 0): pid 86197's current affinity list: 4 step_info_01 ------------ >> SLURM_JOB_ID = 36 >> SLURM_PROCID = 0 >> SLURM_STEP_ID = 0 >> SlurmID = 36.0 >> SLURM_CPUS_PER_TASK = >> SLURM_NTASKS = 5 >> SLURM_STEP_NODELIST = n[1-2] >> SLURM_TASKS_PER_NODE = 4,1 step_info_02 ------------ SLURM_JOB_ID = 36 SLURM_PROCID = 0 SLURM_STEP_ID = 1 SlurmID = 36.1 SLURM_CPUS_PER_TASK = SLURM_NTASKS = 2 SLURM_STEP_NODELIST = n2 SLURM_TASKS_PER_NODE = 2 I have confirmed that they do not block each other. I originally missed this detail because your outputs showed a value of "taskset -cp $$" with only a single cpu id present. However, it turns out that this is because you are using "--hint=nomultithread". Without it, you would actually see a range of cpu ids for each task, evenly distributing the node's available cores for that step, making it more obvious that the whole node is used. However, even if taskset does not show it, the first step would still block the second one. Try it and let me know if you are getting the expected results now. It does in my case, so if there are still discrepancies, we will need to dig deeper. Best regards, Ricard. [1] https://slurm.schedmd.com/srun.html#OPT_exact
Created attachment 38651 [details] 3 steps 2 nodes - works
Created attachment 38652 [details] 3 steps 2 nodes - does not work
Hi Ricard, I can confirm that your example does work on my end. Moreover, when I modify my example with that "--exact" option, it does work as well. Unfortunately, it is somewhat of a moving target for me since the examples I come up with are to reproduce a behavior of a more complex code of mine I am trying to make work. When I tried your proposed solution with that code of mine, it did not work. The apparent reason was that, unlike the example we have been using, I have more job steps then allocated nodes. I just attached two more examples entitled "3 steps 2 nodes ...". The first one works just fine on two nodes with 112 cpus each. The second one does not work though. The only difference between the two examples is the number of cpus I assign to job steps. The first example starts all the steps in parallel, the first step running on the first node, the second step running on the second node, and the last step running on the both first and second nodes. The second example is weird. It starts the first two steps on the first and second node, respectively, but the third step does not start, even after the first two steps are long finished. After ~ 2 minutes, the last step finally starts on the first node (the node seems long idle by this time). It puts a message into sbatch.err. Additionally, the step ID for the last step is not "2" as it was for the first working example, but actually "3", so by the time this last step is finished, it looks as if the steps with IDs of "0", "1" and "3" have been finished, but there was not step with the step ID of "2"!!! Thank you for your patience!
Hello, I have been able to run the second example without issues, but with -n counts adapted to my rig (all nodes are CPUS=8 CoresPerSocket=4 Sockets=1 ThreadsPerCore=2): -n 98 ---> -n 3 -n 100 ---> -n 4 -n 20 ---> -n 1 I have tried to respect -n proportions as close as possible, but since I have low CPU counts (agravated with the use of nomultithread), the granularity of my example is very coarse. This could be a source of different behaviors, but let us narrow this down as much as possible. Can you confirm that this second example fails consistently between multiple tries and nodes? Does this only happen with those -n counts, or does this always happen as long as you run three steps that use a single node each? Just so we can determine a common denominator. I have already checked if there is something in your slurm.conf that could be limitting the amount of usable CPUs (like CoreSpecList or CoreSpecCount) in your nodes and making that third step wait for the first two, but I have not noticed anything yet that could explain this. Best regards, Ricard.
Created attachment 38681 [details] series.txt
Hi Ricard, I am using the scripts I supplied on 09/03/24 as attachments, and run them multiple times on our machine changing only the "-n" values for the three steps run on two nodes (112 cpus each). The specific results for series of such runs are provided in the "series.txt" attachment. In those series, I see three typically observed scenarios: Scenario #1. All the three steps run (tasks IDs = 0, 1, and 2) on the two nodes simultaneously, the first step starts on the first node, the second step on the second node, and the third step starts on the first node or the both nodes, depending on whether it fits into the first node or not with the first step already running in there. Scenario #2. The first two steps run on the two nodes (each on a its own single node, task IDs = 0 and 1) simultaneously. The third step starts (with the task ID = 3, i.e., task ID 2 is seemingly skipped!) ~2 min after the first two steps finished. Importantly for this scenario, the total number of available cpus (2x112) is enough to accomodate the three steps, yet the third one does not start well after the first two steps are finished. Scenario #3: The first two steps run on the two nodes (single node each) simultaneously, the third step started (with task ID = 2) on the first node immediately once the first step is finished. Discussion: The most seemingly straightforward transition, when I am changing the "-n" numbers, is from scenario #1 to scenario #2. In all the series below, this transition happens once the number of tasks for the third step (i) becomes larger than the number of cpus available on the node where the first step is running, but (ii) still lower than the total number of cpus on a node (112). It is almost like something in Slurm on our machine looks at the third step, sees that it requires less tasks than the total number of cpus on a node and pushed this step to only a single first node. However, since this first node is already partially occupied, the step cannot be properly started so it hangs in there for ~2 min and then starts with the step ID=3 (step ID=2 is just skipped!). This empirical logic seems to be able to also explain why there happens to be a "back transition" from scenario #2 to scenario #1 (series 2, series 3, series 4). Once the number of tasks for the third step becomes larger than the total number of cpus on a node (112), Slurm does not think any more that this step could be fit into a single node, so it allows it to be distributed over two nodes which is then done successfully. Maybe your machine/configuration is different enough that you will never see such transitions. However, I still would like to point out that the necessary condition for me to observe the #1 -> #2 transition is to have the third step such that it would be able to run simultaneously with the first two step provided it was distributed between the two nodes. In your last reply (comment 12), the third step cannot get distributed. Can I propose you try 3+3+2 or 2+3+3 configuration instead of 3+4+1 you tried? I do not really understand the transition from scenario #2 to scenario #3, but perhaps the #1->#2 transition is the most important one and if we are able to solve it, then there is not reason to think about #2->#3?
Hello, Thanks for the thorough case-by-case testing, I think that you are onto something. Following your logic I could replicate situation #2 consistently even with my low cpu count, even the same exact 2 minutes wait time. I think that I am not missing anything else parameter/configuration-wise, so I am going to start combing the related source code and check where this situation gets triggered. I will give you an update as soon as I find anything. Best regards, Ricard.
Hi Ricard, Very interesting. Meanwhile, I wanted to have a step back and look at a somewhat bigger picture. The reason I started looking into "--distribution=pack" is I have several very communication-heavy job steps under a single job allocation, and so I wanted to have each step as node-localized as possible to minimize the inter-node communications. Would you say that "--distribution=pack" is, at least potentially, a suggested way to go about it?
Hello, Sorry for the late reply, last week I had limited availability due to SLUG24. >> Would you say that "--distribution=pack" is, at least potentially, a suggested way to go about it? I do not see anything wrong with your reasoning, I would probably go the same route as you. I will be resuming my debugging work for the behavior you found and keep you updated as soon as I get a potential patch for it. Best regards, Ricard.
Hello, Quick status update, I found the underlying issue in the code. You were on the right track, this happens because when using --distribution=pack, the maximum amount of nodes a step can use gets changed to its minimum amount of needed nodes to be able to allocate all tasks. This is why this happens only when the amount of requested tasks (cpus) is lower than the total cpus of a node. This was done to ensure that tasks do not get distributed between more nodes than necessary, but the current implementation has this side-effect. The issue is pretty identified at this point, so I will try to come up with a fix for this. Best regards, Ricard.
Hi, Ricard, Very interesting. Thanks for doing this! Kirill
Hello Kirill, Just as a status update, this is still pending to be fixed but I have internally discussed possible ways to address this situation without producing a regression. We are checking how to implement it for the moment. Best regards, Ricard.
Hi Ricard, This sounds awesome, I am looking forward playing with it. Thanks!!! Kirill
Hello Kirill, Quick status update: Sorry about the delay. I spent some time on this and I initially tried to go for a naive approach to solve this specific case, but I found out that it would create a regression for other use cases. I am planning on trying another idea, but I need to invest more cycles on my part to sort out how to implement it at the step/task level. Best regards, Ricard.
Hi Ricard, That is no problem. This issue is not currently holding me back in any of my immediate projects, but I am still obviously interested in it and would be happy to play with whatever fix you come up with. Thanks! Kirill
Status update: I have had to prioritize other tickets these days, but I am letting you know that this is still on my radar. Thank you for your patience, if this becomes a more pressing issue for your workload, let me know so I can rearrange some priorities. Best regards, Ricard.
Hello, I'm taking over this ticket for Ricard. I think we can split this ticket into 3 questions: 1. How to pack multiple job steps into nodes, e.g. with two 4 core nodes and three steps with 3, 3, and 2 tasks how do we get them to run all at the same time with one step running across both nodes? For this, we should use the `--distribution` flag[1] so you were on the right track with --distribution=pack, but after doing some digging I think pack is too rigid for what you're looking for, and there unfortunately isn't a simple solution. Pack[2] will force the job to use the minimal number of nodes required, so a step that could fit on one node won't run across two. For example, I can reliably reproduce Scenario #2 (from Comment 14) with the setup I described above and --distribution=pack. However, If I submit only the 3 task steps with --distribution=pack and then the 2 task step without it, they will all run as desired. The inverse works as well (pack on the 2 task step only), resulting in 2 tasks on the first node and 1 on the second for both of the 3 task steps, and the entire 2 task step on the second node. I think your best bet it to look at your job steps and mix and match distribution options to lay them out as you like. Another option is to use --distribution=arbitrary and specify which nodes each task should run on manually, an example helper script for this can be found here[4]. Its unfortunate that depending on how your job works these options could be considerably more difficult than simply setting --distribution=pack, but the "--distribution=MostlyPack" option we're looking for does not currently exist. 2. In Scenario 2, where does the 2 minute scheduling pause come from? 3. Why is there no job step 2? I've done some initial investigation on questions 2 and 3 and I'm pretty sure they are connected, this is supported by you analysis where both scenarios #1 and #3 have the correct job step ordering and no 2 minute wait, while #2 has both. I'm still trying to figure out the cause of this, but I think its pretty clearly a bug. I'm going to keep digging into this and will keep you updated on my progress. -- Will [1]: https://slurm.schedmd.com/srun.html#OPT_distribution [2]: https://slurm.schedmd.com/srun.html#OPT_Pack [3]: https://slurm.schedmd.com/srun.html#OPT_block [4]: https://slurm.schedmd.com/faq.html#arbitrary
Hello, Does my explanation for question #1 make sense? -- Will
Hello Will, Sorry, I have been on an extended travel (still not fully back yet), and it was very inconvenient to reply. Your explanation of question #1 makes perfect sense. Moreover, I totally agree with how you split the problem into parts as two me it is clearly a two part thing: (a) I would like to have a more flexible distribution option (a specific requirement/desire in my case is to minimize an inter-node communication for each job step) , and (b) there seems to be a bug with that 2-minute wait. The "--distribution=arbitrary" is might be what I need for (a). I have a certain python module that communicates with slurm starting job steps whenever it sees that some previous job steps are done, keeping the allocated (by sbatch) nodes fully loaded - I might decide in the future to do everything "manually", i.e., start all the srun's with 'arbitrary' as an option there and always decide how to distribute a job step between the nodes on the python level. That should work, I think. The one critical piece I am missing for this, and I might as well ask you about it here, is is there a slurm command to get a distribution between the nodes for a given job step? Something like this: I am providing a job step ID, e.g., 1234567.23, and it returns how many tasks this step uses on each node from the allocation? That would be a critical input in deciding how to form a distribution for a new step to be started. Obviously, if I start everything with arbitrary distribution, I can just keep track, on the python level, of how each job step is distributed. Or, how I do it right now, when I start a job step, I wrap an actual call to a simulation code inside a bash script that has the following line: echo "task $SLURM_PROCID (node $SLURM_NODEID): `taskset -cp $$`" >> distribution that is actually started by srun - it gives me a node ID for each task. However, for various reasons it is not super convenient. I would prefer to have a way to check now, in the moment, which nodes a job step is running at and how many CPUs are used by this step on those nodes. Is there a way to do that? I think I tried to find whether it is possible with sacct for example, but did not find it.
Hello, (In reply to kirill from comment #36) > Hello Will, > > Sorry, I have been on an extended travel (still not fully back yet), and it > was very inconvenient to reply. No worries at all, I wanted to confirm if this ticket was still of interest to you or not, but now that I've heard back I know it is :) > Your explanation of question #1 makes > perfect sense. Moreover, I totally agree with how you split the problem into > parts as two me it is clearly a two part thing: (a) I would like to have a > more flexible distribution option (a specific requirement/desire in my case > is to minimize an inter-node communication for each job step) , Being pedantic, the type of job packing we have been discussing would potentially increase inter-node communication inside of job steps, which is more or less why this option hasn't been added to Slurm. i.e. with --distribution=pack slurm opts to wait until it can satisfy a step's resource requirements with the minimum number of nodes to run the step, this ensures minimal inter node communication inside of a step, but can cause a step to be stuck waiting in the queue when there are technically enough open resources in the allocation, just spread out across the allocation. Of course if one would prefer to spread their step out there are --distribution options for that, but packing a step, unless it could be run sooner if not packed isn't a common request since this could cause significant variability in the communication overhead. > and (b) there seems to be a bug with that 2-minute wait. > > The "--distribution=arbitrary" is might be what I need for (a). I have a > certain python module that communicates with slurm starting job steps > whenever it sees that some previous job steps are done, keeping the > allocated (by sbatch) nodes fully loaded - I might decide in the future to > do everything "manually", i.e., start all the srun's with 'arbitrary' as an > option there and always decide how to distribute a job step between the > nodes on the python level. That should work, I think. The one critical piece > I am missing for this, and I might as well ask you about it here, is is > there a slurm command to get a distribution between the nodes for a given > job step? Something like this: I am providing a job step ID, e.g., > 1234567.23, and it returns how many tasks this step uses on each node from > the allocation? That would be a critical input in deciding how to form a > distribution for a new step to be started. Obviously, if I start everything > with arbitrary distribution, I can just keep track, on the python level, of > how each job step is distributed. Or, how I do it right now, when I start a > job step, I wrap an actual call to a simulation code inside a bash script > that has the following line: > echo "task $SLURM_PROCID (node $SLURM_NODEID): `taskset -cp $$`" >> > distribution > that is actually started by srun - it gives me a node ID for each task. > However, for various reasons it is not super convenient. I would prefer to > have a way to check now, in the moment, which nodes a job step is running at > and how many CPUs are used by this step on those nodes. Is there a way to > do that? I think I tried to find whether it is possible with sacct for > example, but did not find it. Unfortunately I can't find a good way to do this either. With the correct logging enabled this information does end up in the slurmctld.log, but I don't think trying to parse this out live is a good idea as contention on the log file can be a serious bottleneck for the controller. I will keep working on the 2 minute wait bug, but I think your best bet for laying out job steps like this is --distribution=arbitrary -- Will
Hello, After quite a bit of digging I have identified the cause of the 2 minute pause and how it relates to --hint=nomultithread. When the stepmgr (in the slurmctld) is calculating where it can put a job step it counts the "usable cpus" on the nodes in the allocation, it is incorrectly counting all the cpus on idle cores, not just 1 per core, if it identifies enough cpus, it then starts towards launching the step, which fails rather quickly as there aren't actually enough cpus available given the constraints. This causes the srun command to never get an acknowledgement about the step being started to after SlurmctldTimeout (2 minutes by default) it tries again. I have a patch going through review right now that will handle this better and stop the 2 minute wait from happening. -- Will
Awesome, Will, thanks a lot for looking into this! Please let me know whenever this new feature propagates into a slurm release. I will then see what can be done to update our HPC machines with it, here at LANL. Again, thanks a lot!!!
Hello, It took a while for this to get through review, but I am happy to report it is now merged into master with commit ce159cb364, and an update to the testsuite in commit 4043bf4034. While the change itself is small, it is deep in the step launching logic, considering this and the severity of the bug, we felt it was best not to cherry-pick this into any of our current stable branches, so it will first appear in the 26.05.0 release. I will resolve this ticket as fixed, but please feel free to reopen it if you have any further questions. -- Will