| Summary: | NERSC / SchedMD live ziatest debugging session | ||
|---|---|---|---|
| Product: | Slurm | Reporter: | Doug Jacobsen <dmjacobsen> |
| Component: | Other | Assignee: | Danny Auble <da> |
| Status: | RESOLVED INFOGIVEN | QA Contact: | |
| Severity: | 4 - Minor Issue | ||
| Priority: | --- | ||
| Version: | 16.05.x | ||
| Hardware: | Cray XC | ||
| OS: | Linux | ||
| Site: | NERSC | Slinky Site: | --- |
| Alineos Sites: | --- | Atos/Eviden Sites: | --- |
| Confidential Site: | --- | Coreweave sites: | --- |
| Cray Sites: | --- | DS9 clusters: | --- |
| Google sites: | --- | HPCnow Sites: | --- |
| HPE Sites: | --- | IBM Sites: | --- |
| NOAA SIte: | --- | NoveTech Sites: | --- |
| Nvidia HWinf-CS Sites: | --- | OCF Sites: | --- |
| Recursion Pharma Sites: | --- | SFW Sites: | --- |
| SNIC sites: | --- | Tzag Elita Sites: | --- |
| Linux Distro: | --- | Machine Name: | |
| CLE Version: | Version Fixed: | ||
| Target Release: | --- | DevPrio: | --- |
| Emory-Cloud Sites: | --- | ||
| Attachments: |
mega slurmd logs (400MB uncompressed)
typscripts |
||
Created attachment 2872 [details]
typscripts
Doug, will there be more time available for testing in the near future? Hi Danny, Unfortunately the maintenance ran long yesterday and they chose to skip the ziatest testing. However, NERSC management was encouraged by the results shown from edison, which was the harder test anyway. I don't have any maintenances upcoming that we could take advantage of, but if something comes up, I'll let you know. Do you have anything in particular you wanted to try? -Doug We have improved some other aspects of the performance, and I also want to test performance with message aggregation. I am thinking the reason for the long wait after a job completes is because of a packet storm that can be avoided with message aggregation (the whole reason it was done ;)). You can probably test the message aggregation now since that part hasn't changed. I just want to see if a job finished faster when wrapped with time than otherwise. Doug I checked on the Crystal system (around 1000 nodes) and the Message Aggregation did seem to help the swamped slurmctld but didn't really affect the speed of ziatest one way or the other. I had the settings set to WindowMsgs=100,WindowTime=10 I didn't mess with them much though. In any case. I am going to close this based off the performance improvements we have seen. Please reopen if you feel otherwise. |
Created attachment 2871 [details] mega slurmd logs (400MB uncompressed) Hello, I'm using this interface to upload some of the data from the live debugging session this evening. -Doug