| Summary: | *_nodes_alloc metrics reports inflated counts | ||
|---|---|---|---|
| Product: | Slurm | Reporter: | Charlie Getzen <charliegetzen> |
| Component: | slurmctld | Assignee: | Michael Steed <msteed> |
| Status: | OPEN --- | QA Contact: | |
| Severity: | 4 - Minor Issue | ||
| Priority: | --- | CC: | felip.moll, msteed |
| Version: | 26.11.x | ||
| Hardware: | Linux | ||
| OS: | Linux | ||
| See Also: | https://support.schedmd.com/show_bug.cgi?id=24999 | ||
| Site: | -Other- | Slinky Site: | --- |
| Alineos Sites: | --- | Atos/Eviden Sites: | --- |
| Confidential Site: | --- | Coreweave sites: | --- |
| Cray Sites: | --- | DS9 clusters: | --- |
| Google sites: | --- | HPCnow Sites: | --- |
| HPE Sites: | --- | IBM Sites: | --- |
| NOAA SIte: | --- | NoveTech Sites: | --- |
| Nvidia HWinf-CS Sites: | --- | OCF Sites: | --- |
| Recursion Pharma Sites: | --- | SFW Sites: | --- |
| SNIC sites: | --- | Tzag Elita Sites: | --- |
| Linux Distro: | --- | Machine Name: | |
| CLE Version: | Version Fixed: | ||
| Target Release: | --- | DevPrio: | --- |
| Emory-Cloud Sites: | --- | ||
| Attachments: | node_bitmap solution | ||
Hi Charlie, Thanks for reporting. We are aware of this issue and a new fix is being implemented. We'll let you know once it is done. Thanks! |
Created attachment 44909 [details] node_bitmap solution ## Problem `slurm_user_jobs_nodes_alloc` and `slurm_jobs_nodes_alloc` report inflated node counts when a user runs multiple sub-node-sized jobs that shared physical nodes. ## Root cause `nodes_alloc` is computed by summing `total_nodes` across each individual running job for the user. When multiple jobs fit on the same physical node (e.g. jobs requesting only a fraction of a node's CPUs), each job contributed its own `total_nodes = 1` to the sum, even though they all shared the same node. ## Solution Track node allocation using bitmaps and OR each job's `node_bitmap`. After all jobs have been processed, count the set bits to get the actual number of unique nodes in use.