Ticket 2562

Summary: NERSC / SchedMD live ziatest debugging session
Product: Slurm Reporter: Doug Jacobsen <dmjacobsen>
Component: OtherAssignee: Danny Auble <da>
Status: RESOLVED INFOGIVEN QA Contact:
Severity: 4 - Minor Issue    
Priority: ---    
Version: 16.05.x   
Hardware: Cray XC   
OS: Linux   
Site: NERSC Alineos Sites: ---
Atos/Eviden Sites: --- Confidential Site: ---
Coreweave sites: --- Cray Sites: ---
DS9 clusters: --- HPCnow Sites: ---
HPE Sites: --- IBM Sites: ---
NOAA SIte: --- OCF Sites: ---
Recursion Pharma Sites: --- SFW Sites: ---
SNIC sites: --- Linux Distro: ---
Machine Name: CLE Version:
Version Fixed: Target Release: ---
DevPrio: --- Emory-Cloud Sites: ---
Attachments: mega slurmd logs (400MB uncompressed)
typscripts

Description Doug Jacobsen 2016-03-16 17:05:14 MDT
Created attachment 2871 [details]
mega slurmd logs (400MB uncompressed)

Hello,

I'm using this interface to upload some of the data from the live debugging session this evening.

-Doug
Comment 1 Doug Jacobsen 2016-03-16 17:38:33 MDT
Created attachment 2872 [details]
typscripts
Comment 5 Danny Auble 2016-03-22 08:12:11 MDT
Doug, will there be more time available for testing in the near future?
Comment 6 Doug Jacobsen 2016-03-23 06:20:59 MDT
Hi Danny,

Unfortunately the maintenance ran long yesterday and they chose to skip the ziatest testing.  However, NERSC management was encouraged by the results shown from edison, which was the harder test anyway.

I don't have any maintenances upcoming that we could take advantage of, but if something comes up, I'll let you know.

Do you have anything in particular you wanted to try?

-Doug
Comment 7 Danny Auble 2016-03-23 06:24:08 MDT
We have improved some other aspects of the performance, and I also want to test performance with message aggregation.  I am thinking the reason for the long wait after a job completes is because of a packet storm that can be avoided with message aggregation (the whole reason it was done ;)).

You can probably test the message aggregation now since that part hasn't changed.  I just want to see if a job finished faster when wrapped with time than otherwise.
Comment 8 Danny Auble 2016-03-29 05:24:15 MDT
Doug I checked on the Crystal system (around 1000 nodes) and the Message Aggregation did seem to help the swamped slurmctld but didn't really affect the speed of ziatest one way or the other.  I had the settings set to 

WindowMsgs=100,WindowTime=10

I didn't mess with them much though.

In any case.  I am going to close this based off the performance improvements we have seen.

Please reopen if you feel otherwise.