Summary: | Abnormally large agent queue size causing wedging | ||
---|---|---|---|
Product: | Slurm | Reporter: | Paul Edmon <pedmon> |
Component: | Scheduling | Assignee: | Director of Support <support> |
Status: | RESOLVED FIXED | QA Contact: | |
Severity: | 4 - Minor Issue | ||
Priority: | --- | ||
Version: | 24.05.4 | ||
Hardware: | Linux | ||
OS: | Linux | ||
Site: | Harvard University | Alineos Sites: | --- |
Atos/Eviden Sites: | --- | Confidential Site: | --- |
Coreweave sites: | --- | Cray Sites: | --- |
DS9 clusters: | --- | HPCnow Sites: | --- |
HPE Sites: | --- | IBM Sites: | --- |
NOAA SIte: | --- | NoveTech Sites: | --- |
Nvidia HWinf-CS Sites: | --- | OCF Sites: | --- |
Recursion Pharma Sites: | --- | SFW Sites: | --- |
SNIC sites: | --- | Tzag Elita Sites: | --- |
Linux Distro: | --- | Machine Name: | |
CLE Version: | Version Fixed: | None | |
Target Release: | --- | DevPrio: | --- |
Emory-Cloud Sites: | --- | ||
Attachments: |
Graph of agent queue and server threads
Current slurm.conf Current topology.conf |
Description
Paul Edmon
2025-01-17 19:30:53 MST
Created attachment 40442 [details]
Graph of agent queue and server threads
Created attachment 40443 [details]
Current slurm.conf
Created attachment 40444 [details]
Current topology.conf
Looks like lowering the max_rpc_cnt to 32 did the trick. Things are much more stable. I'm resolving this ticket. |