| Summary: | show job's communication tree (for resilience evaluation and testing) | ||
|---|---|---|---|
| Product: | Slurm | Reporter: | S Senator <sts> |
| Component: | Other | Assignee: | Unassigned Developer <dev-unassigned> |
| Status: | OPEN --- | QA Contact: | |
| Severity: | 5 - Enhancement | ||
| Priority: | --- | CC: | bart, bsantos, fullop, kstroup, lena, mej |
| Version: | 17.11.9 | ||
| Hardware: | Linux | ||
| OS: | Linux | ||
| See Also: | https://bugs.schedmd.com/show_bug.cgi?id=5285 | ||
| Site: | LANL | Slinky Site: | --- |
| Alineos Sites: | --- | Atos/Eviden Sites: | --- |
| Confidential Site: | --- | Coreweave sites: | --- |
| Cray Sites: | --- | DS9 clusters: | --- |
| Google sites: | --- | HPCnow Sites: | --- |
| HPE Sites: | --- | IBM Sites: | --- |
| NOAA SIte: | --- | NoveTech Sites: | --- |
| Nvidia HWinf-CS Sites: | --- | OCF Sites: | --- |
| Recursion Pharma Sites: | --- | SFW Sites: | --- |
| SNIC sites: | --- | Tzag Elita Sites: | --- |
| Linux Distro: | --- | Machine Name: | gadget, trinitite, trinity |
| CLE Version: | Version Fixed: | ||
| Target Release: | --- | DevPrio: | --- |
| Emory-Cloud Sites: | --- | ||
|
Description
S Senator
2018-08-15 15:35:51 MDT
> We are trying to define a set of processes to test and automate verification > of the slurmsmwd features as implemented in ticket bug#5285. Is there a > means to collect the communications tree for a given job? The purpose of > this request is to satisfy a goal regarding resilient codes and regression > tests. The topology plugin is, in the absence of RoutePlugin=route/topology, not related to the communication pattern used by slurmctld / slurmd. If you'd like such a diagnostic capability, that's something that we can look at addressing under a development contract, although I'd need some time to see if this is desirable from our side, and write up an appropriate SoW. > Our plan is to systematically disable specific nodes in the communication > tree from leaf nodes up to root nodes and verify that the slurmsmwd > correctly detects that the node is unavailable and communicates this state > to the slurmctld. A minimal test would disable a random node within the job > allocation, avoiding leaf nodes in the communication tree. Does such a test > already exist in a (private?) branch of the slurm verification test suite. No. |