|
Description
Ryan Novosielski
2019-07-24 10:25:36 MDT
Created attachment 10998 [details]
SLURM topology.conf file for amarel
Running 17.11.7, BTW. Planning to upgrade to 18.08.x in the near future. (A) is correct - this is just a warning/recommendation, and does not silently change your route plugin. This warning is just telling you that jobs won't be able to use nodes where no switch connects them. If you try to run a job that requests nodes across disjointed switches, the client (sbatch/srun/salloc) gets a warning: (For sbatch) sbatch: error: Batch job submission failed: Requested node configuration is not available (For srun) srun: error: Unable to allocate resources: Requested node configuration is not available (For salloc) salloc: error: Job submit/allocate failed: Requested node configuration is not available salloc: Job allocation 10243 has been revoked. Since you're doing this intentionally, don't worry about the warning. If your users are confused about the "Requested node configuration is not available" message, make sure they check the number of nodes or nodelist they requested and make sure it's in line with your topology. Another thing you could do is make sure a single partition doesn't have nodes that are in different fabrics and therefore can never be part of the same job. There's no problem with your topology configuration. I don't have any actual data for how common it is, but the impression at SchedMD is that it is common. I'll look into softening the language in that log message and fix the spacing. It's a thought. It would require the users knowing about/submitting to two different partitions in a job. We also use federation in the partition in question that will allow jobs to federate to other clusters, so this would seem to be the better path. (In reply to Ryan Novosielski from comment #7) > It's a thought. It would require the users knowing about/submitting to two > different partitions in a job. We also use federation in the partition in > question that will allow jobs to federate to other clusters, so this would > seem to be the better path. Whatever works best for you. If you have questions about how to handle the multi-cluster environment, feel free to submit a ticket about that. I've submitted a patch to be reviewed that softens the language in the log message. Thanks -- is it possible to have a look at it? Would be happy to mention whether or not the rewrite would have helped me. Thanks again. (In reply to Ryan Novosielski from comment #10) > Thanks -- is it possible to have a look at it? Would be happy to mention > whether or not the rewrite would have helped me. I'm further investigating why that specific log message was put there in the first place and why it indicated that route/topology shouldn't be used. So, I'm still iterating on the log message. I'll let you know when I've got something more solid and see what you think. Thanks. Would be interested to know as well. Hi Ryan, Here's my current conclusion: This warning has been in there since the beginning of the route plugin. It was added in commit 0dde0a71c10 by bull: commit 0dde0a71c10c67ca815b2e70504ce463a9a7b95b Author: Rod Schultz <Rod.Schultz@bull.com> Date: Fri Jul 11 10:02:22 2014 -0700 Initial addition of the routing plugins I can't think of how route/topology could be bad with a disjointed switch topology. It might be better to use route/topology instead of route/none with a disjointed topology (though I'm not sure). Anyway, I suspect that bull threw this error/info-warning in there because they thought it might be important, not because of any sure-fire testing that they did. Here's the language I've been toying around with (still subject to change, in particular potentially removing the word "warning"): info("TOPOLOGY: warning -- no switch can reach all nodes through its descendants. If this is not intentional, fix the topology.conf file."); What do you think about that sentence? Would it have helped you more than the previous warning? Suggestions are welcome. I will be out of the office from Tuesday, August 13th to Friday, August 16th, returning Monday, August 19th. If this is an urgent message, please contact help@oarc.rutgers.edu (if you have not already done so) for a more immediate response. Thank you for your patience. Sorry for the delay. Will talk it over with the coworker that raised this as a concern to me and see if there are any more suggestions. That's a better message than the existing one. I probably would have suggested something along the lines of:
info("TOPOLOGY: warning -- no switch can reach all nodes through its descendants. Check topology.conf for errors.");
...if I'd not read the original, but yours is good too.
Sounds good. Let us know if there are any concerns for the error message; if you or your coworker would like it changed, we can do that. Otherwise, if it's good with you, we'll commit what I've given you. He had an interesting suggestion, I thought, which was a link to the documentation which could be more verbose. Of course, that would require whatever it was not to move around/to be a virtual URL. Solaris did something like this, if you see the below: https://docs.oracle.com/cd/E36784_01/html/E48546/fmasvcs.html They give you a short message and then for more information, see such and such. Maybe more more work than this warrants, but a thought. He didn't have any particular comments on the suggested change, and I do agree that it's better than the existing. That's an interesting idea. I proposed it internally, and we decided that the effort isn't worth it, especially with the risk of dead URL's like you pointed out. Our goal is to try to make source comments/logs stand alone and give enough information for someone to know what to search for. I'll let you know when we've committed the change. We've pushed the fix in commit cad50250c58e. It will be in the next 19.05 tag (19.05.3). Thanks for the report and discussion. Let us know if you have any more concerns or questions. I'm closing this as resolved/fixed. |