Ticket 22460 - ibwarn and lustre count errors
Summary: ibwarn and lustre count errors
Status: OPEN
Alias: None
Product: Slurm
Classification: Unclassified
Component: Build System and Packaging (show other tickets)
Version: 24.11.1
Hardware: Linux Linux
: 4 - Minor Issue
Assignee: Broderick Gardner
QA Contact:
URL:
Depends on:
Blocks:
 
Reported: 2025-03-28 15:01 MDT by Trever
Modified: 2025-03-28 16:59 MDT (History)
0 users

See Also:
Site: GA Tech
Alineos Sites: ---
Atos/Eviden Sites: ---
Confidential Site: ---
Coreweave sites: ---
Cray Sites: ---
DS9 clusters: ---
HPCnow Sites: ---
HPE Sites: ---
IBM Sites: ---
NOAA SIte: ---
NoveTech Sites: ---
Nvidia HWinf-CS Sites: ---
OCF Sites: ---
Recursion Pharma Sites: ---
SFW Sites: ---
SNIC sites: ---
Tzag Elita Sites: ---
Linux Distro: ---
Machine Name:
CLE Version:
Version Fixed:
Target Release: ---
DevPrio: ---
Emory-Cloud Sites: ---


Attachments

Note You need to log in before you can comment on or make changes to this ticket.
Description Trever 2025-03-28 15:01:43 MDT
We are using onprem slurm RPMS in Azure, where Azure does not have lustre or IB (onprem does or can).

When an azure node spins up with resume, interactive sessions regularly display errors such as following:
[tnightingale6@slurm-cloud packer]$ salloc -N1 -Aphx-pace-staff -Crhel9 -qpace -pazureDS1 --mem=1000                                                                                                                                                                             
salloc: Granted job allocation 48                                                                                                       
salloc: Waiting for resource configuration                                                                                              
salloc: Nodes azDS1-1 are ready for job                                                                                                 
slurmstepd: error: couldn't chdir to `/storage/home/hcodaman1/tnightingale6/packer': No such file or directory: going to /tmp instead                                                                                                                                            
slurmstepd: error: couldn't chdir to `/storage/home/hcodaman1/tnightingale6/packer': No such file or directory: going to /tmp instead                                                                                                                                            
slurmstepd: error: _read_lustre_counters: can't find Lustre stats                                                                       
slurmstepd: error: acct_gather_filesystem_p_get_data: cannot read lustre counters                                                                                                                                                                                                
ibwarn: [6035] get_abi_version: can't read ABI version from /sys/class/infiniband_mad/abi_version (No such file or directory): is ib_umad module loaded?
ibwarn: [6035] mad_rpc_open_port: can't open UMAD port ((null):1)                    
bash-5.1$ ibwarn: [6035] mad_rpc_open_port: can't open UMAD port ((null):1)                                                                                                                                                                                                      
ibwarn: [6035] mad_rpc_open_port: can't open UMAD port ((null):1)                                                                                                                                                                                                                
ibwarn: [6035] mad_rpc_open_port: can't open UMAD port ((null):1)                                                                                                                                                                                                                
ibwarn: [6035] mad_rpc_open_port: can't open UMAD port ((null):1)                                                                                                                                                                                                                
ibwarn: [6035] mad_rpc_open_port: can't open UMAD port ((null):1)                                                                                                                                                                                                                
ibwarn: [6035] mad_rpc_open_port: can't open UMAD port ((null):1)                                                                                                                                                                                                                
ibwarn: [6035] mad_rpc_open_port: can't open UMAD port ((null):1)                      

How do we get rid of these errors?

Removing relevant lines in slurm.conf does not mitigate.