Ticket 5569 - show job's communication tree (for resilience evaluation and testing)
Summary: show job's communication tree (for resilience evaluation and testing)
Status: OPEN
Alias: None
Product: Slurm
Classification: Unclassified
Component: Other (show other tickets)
Version: 17.11.9
Hardware: Linux Linux
: 5 - Enhancement
Assignee: Unassigned Developer
QA Contact:
URL:
Depends on:
Blocks:
 
Reported: 2018-08-15 15:35 MDT by S Senator
Modified: 2018-08-16 09:34 MDT (History)
6 users (show)

See Also:
Site: LANL
Slinky Site: ---
Alineos Sites: ---
Atos/Eviden Sites: ---
Confidential Site: ---
Coreweave sites: ---
Cray Sites: ---
DS9 clusters: ---
Google sites: ---
HPCnow Sites: ---
HPE Sites: ---
IBM Sites: ---
NOAA SIte: ---
NoveTech Sites: ---
Nvidia HWinf-CS Sites: ---
OCF Sites: ---
Recursion Pharma Sites: ---
SFW Sites: ---
SNIC sites: ---
Tzag Elita Sites: ---
Linux Distro: ---
Machine Name: gadget, trinitite, trinity
CLE Version:
Version Fixed:
Target Release: ---
DevPrio: ---
Emory-Cloud Sites: ---


Attachments

Note You need to log in before you can comment on or make changes to this ticket.
Description S Senator 2018-08-15 15:35:51 MDT
Please provide the equivalent of:
  $ scontrol show topology

but for a specific job:
  $ scontrol show topology -j <jobid>

We are trying to define a set of processes to test and automate verification of the slurmsmwd features as implemented in ticket bug#5285. Is there a means to collect the communications tree for a given job? The purpose of this request is to satisfy a goal regarding resilient codes and regression tests.

Our plan is to systematically disable specific nodes in the communication tree from leaf nodes up to root nodes and verify that the slurmsmwd correctly detects that the node is unavailable and communicates this state to the slurmctld. A minimal test would disable a random node within the job allocation, avoiding leaf nodes in the communication tree. Does such a test already exist in a (private?) branch of the slurm verification test suite?

Thank you.
Comment 1 Tim Wickberg 2018-08-15 16:04:03 MDT
> We are trying to define a set of processes to test and automate verification
> of the slurmsmwd features as implemented in ticket bug#5285. Is there a
> means to collect the communications tree for a given job? The purpose of
> this request is to satisfy a goal regarding resilient codes and regression
> tests.

The topology plugin is, in the absence of RoutePlugin=route/topology, not related to the communication pattern used by slurmctld / slurmd.

If you'd like such a diagnostic capability, that's something that we can look at addressing under a development contract, although I'd need some time to see if this is desirable from our side, and write up an appropriate SoW.

> Our plan is to systematically disable specific nodes in the communication
> tree from leaf nodes up to root nodes and verify that the slurmsmwd
> correctly detects that the node is unavailable and communicates this state
> to the slurmctld. A minimal test would disable a random node within the job
> allocation, avoiding leaf nodes in the communication tree. Does such a test
> already exist in a (private?) branch of the slurm verification test suite.

No.