Summary: | Abnormally large agent queue size causing wedging | ||
---|---|---|---|
Product: | Slurm | Reporter: | Paul Edmon <pedmon> |
Component: | Scheduling | Assignee: | Director of Support <support> |
Status: | RESOLVED FIXED | QA Contact: | |
Severity: | 4 - Minor Issue | ||
Priority: | --- | ||
Version: | 24.05.4 | ||
Hardware: | Linux | ||
OS: | Linux | ||
Site: | Harvard University | Alineos Sites: | --- |
Atos/Eviden Sites: | --- | Confidential Site: | --- |
Coreweave sites: | --- | Cray Sites: | --- |
DS9 clusters: | --- | Google sites: | --- |
HPCnow Sites: | --- | HPE Sites: | --- |
IBM Sites: | --- | NOAA SIte: | --- |
NoveTech Sites: | --- | Nvidia HWinf-CS Sites: | --- |
OCF Sites: | --- | Recursion Pharma Sites: | --- |
SFW Sites: | --- | SNIC sites: | --- |
Tzag Elita Sites: | --- | Linux Distro: | --- |
Machine Name: | CLE Version: | ||
Version Fixed: | None | Target Release: | --- |
DevPrio: | --- | Emory-Cloud Sites: | --- |
Attachments: |
Graph of agent queue and server threads
Current slurm.conf Current topology.conf |
Description
Paul Edmon
2025-01-17 19:30:53 MST
Created attachment 40442 [details]
Graph of agent queue and server threads
Created attachment 40443 [details]
Current slurm.conf
Created attachment 40444 [details]
Current topology.conf
Looks like lowering the max_rpc_cnt to 32 did the trick. Things are much more stable. I'm resolving this ticket. |