Ticket 22460 - ibwarn and lustre count errors
Summary: ibwarn and lustre count errors
Status: RESOLVED FIXED
Alias: None
Product: Slurm
Classification: Unclassified
Component: Build System and Packaging (show other tickets)
Version: 24.11.1
Hardware: Linux Linux
: 4 - Minor Issue
Assignee: Broderick Gardner
QA Contact:
URL:
Depends on:
Blocks:
 
Reported: 2025-03-28 15:01 MDT by Trever
Modified: 2025-04-08 16:39 MDT (History)
1 user (show)

See Also:
Site: GA Tech
Slinky Site: ---
Alineos Sites: ---
Atos/Eviden Sites: ---
Confidential Site: ---
Coreweave sites: ---
Cray Sites: ---
DS9 clusters: ---
Google sites: ---
HPCnow Sites: ---
HPE Sites: ---
IBM Sites: ---
NOAA SIte: ---
NoveTech Sites: ---
Nvidia HWinf-CS Sites: ---
OCF Sites: ---
Recursion Pharma Sites: ---
SFW Sites: ---
SNIC sites: ---
Tzag Elita Sites: ---
Linux Distro: RHEL
Machine Name:
CLE Version:
Version Fixed: 24.11.1
Target Release: ---
DevPrio: ---
Emory-Cloud Sites: ---


Attachments

Note You need to log in before you can comment on or make changes to this ticket.
Description Trever 2025-03-28 15:01:43 MDT
We are using onprem slurm RPMS in Azure, where Azure does not have lustre or IB (onprem does or can).

When an azure node spins up with resume, interactive sessions regularly display errors such as following:
[tnightingale6@slurm-cloud packer]$ salloc -N1 -Aphx-pace-staff -Crhel9 -qpace -pazureDS1 --mem=1000                                                                                                                                                                             
salloc: Granted job allocation 48                                                                                                       
salloc: Waiting for resource configuration                                                                                              
salloc: Nodes azDS1-1 are ready for job                                                                                                 
slurmstepd: error: couldn't chdir to `/storage/home/hcodaman1/tnightingale6/packer': No such file or directory: going to /tmp instead                                                                                                                                            
slurmstepd: error: couldn't chdir to `/storage/home/hcodaman1/tnightingale6/packer': No such file or directory: going to /tmp instead                                                                                                                                            
slurmstepd: error: _read_lustre_counters: can't find Lustre stats                                                                       
slurmstepd: error: acct_gather_filesystem_p_get_data: cannot read lustre counters                                                                                                                                                                                                
ibwarn: [6035] get_abi_version: can't read ABI version from /sys/class/infiniband_mad/abi_version (No such file or directory): is ib_umad module loaded?
ibwarn: [6035] mad_rpc_open_port: can't open UMAD port ((null):1)                    
bash-5.1$ ibwarn: [6035] mad_rpc_open_port: can't open UMAD port ((null):1)                                                                                                                                                                                                      
ibwarn: [6035] mad_rpc_open_port: can't open UMAD port ((null):1)                                                                                                                                                                                                                
ibwarn: [6035] mad_rpc_open_port: can't open UMAD port ((null):1)                                                                                                                                                                                                                
ibwarn: [6035] mad_rpc_open_port: can't open UMAD port ((null):1)                                                                                                                                                                                                                
ibwarn: [6035] mad_rpc_open_port: can't open UMAD port ((null):1)                                                                                                                                                                                                                
ibwarn: [6035] mad_rpc_open_port: can't open UMAD port ((null):1)                                                                                                                                                                                                                
ibwarn: [6035] mad_rpc_open_port: can't open UMAD port ((null):1)                      

How do we get rid of these errors?

Removing relevant lines in slurm.conf does not mitigate.
Comment 1 Aaron Jezghani 2025-04-03 09:38:44 MDT
Building on Trever's ask, effectively what we want to get at here is the following:

If a particular plugin such as AcctGatherFilesystemType=acct_gather_filesystem/lustre is enabled on the controller, is there a way to prevent that plugin from executing on designated compute nodes, nominally via modified configurations? Or, does Slurmstepd purely gather from the controller?

We've also looked at some of the acct_gather options at job submission, but that didn't seem to disable them.

Any insight would be greatly appreciated.

Aaron
Comment 2 Broderick Gardner 2025-04-03 09:58:52 MDT
Sorry about the delay. I need some more detail about your configuration.

Can you send this output?
scontrol show config

Are you concerned about any of the errors in particular?

I expect the IB errors come as a result of the shell profile/bashrc loading IB environment information. It's not clear if the problem is on the node or the login 

This one:
slurmstepd: error: couldn't chdir to `/storage/home/hcodaman1/tnightingale6/packer': No such file or directory: going to /tmp instead
is because that path doesn't exist on the compute node. You can use salloc --chdir, but going to /tmp is also valid.
Comment 3 Broderick Gardner 2025-04-03 10:00:31 MDT
I didn't see your latest before sending that comment (the page had not reloaded). I am looking into what you asked.
Comment 4 Broderick Gardner 2025-04-03 10:27:12 MDT
Depending on how you sync the slurm.conf, you might be able to unset the acct_gather_filesystem/lustre on nodes without lustre. I don't know another way to do it.

The slurmstepd runs on the compute nodes, tied to a particular job.
Comment 5 Trever 2025-04-03 12:42:16 MDT
What does it mean to "unset"?

Our current approach is to:
1.  On the Azure nodes being created (ie. "resumed") systemd preexec get the slurm config files from the onprem slurm controller
2.  The azure node preexec removes the IB and lustre config that exists on prem
3.  slurmd starts on the azure nodes, using the slightly modified (per #2 above) slurm.conf etc.

The problem is that this has not worked to quiet the errors.
Comment 6 Broderick Gardner 2025-04-08 13:45:47 MDT
Please send your config. One way is the output of
scontrol show config

By "unset", I mean probably what you describe doing; comment out/remove the plugin in the slurm.conf on the compute nodes:
#AcctGatherFilesystemType=acct_gather_filesystem/lustre

I'm not sure yet why the slurmstepd would be loading the plugin, if you're already doing that.

I don't know where the ibwarn errors are from. Since they're outside slurm, I expect they are part of some shell profile environment.
Comment 7 Trever 2025-04-08 16:39:32 MDT
Well, we got it working.  It would appear my changes weren't taking effect because the config files weren't being read!  Absence of the IB/Lustre config settings was sufficient.

We can close this ticket, thank you.