Ticket 13265

Summary: A login / submit node is not using configless slurm
Product: Slurm Reporter: GSK-ONYX-SLURM <slurm-support>
Component: SchedulingAssignee: Marcin Stolarek <cinek>
Status: RESOLVED INFOGIVEN QA Contact:
Severity: 4 - Minor Issue    
Priority: --- CC: cinek
Version: 21.08.1   
Hardware: Linux   
OS: Linux   
Site: GSK Alineos Sites: ---
Atos/Eviden Sites: --- Confidential Site: ---
Coreweave sites: --- Cray Sites: ---
DS9 clusters: --- HPCnow Sites: ---
HPE Sites: --- IBM Sites: ---
NOAA SIte: --- OCF Sites: ---
Recursion Pharma Sites: --- SFW Sites: ---
SNIC sites: --- Linux Distro: ---
Machine Name: CLE Version:
Version Fixed: Target Release: ---
DevPrio: --- Emory-Cloud Sites: ---

Description GSK-ONYX-SLURM 2022-01-26 01:55:38 MST
Hello Team,

We notice the issue related to submitting direct jobs by using srun command:

-bash-4.2$ /usr/local/slurm/bin/srun --time=1:00:00 --gres=gpu:1 --pty bash
srun: error: fwd_tree_thread: can't find address for host us1sxlxhgx0004, check slurm.conf
srun: error: Task launch for StepId=5793.0 failed on node us1sxlxhgx0004: Can't find an address, check slurm.conf
srun: error: Application launch failed: Can't find an address, check slurm.conf
srun: Job step aborted: Waiting up to 32 seconds for job step to finish.
srun: error: Timed out waiting for job step to complete

The command is executed from a submit node.

It's working once a specific compute node is added to the /etc/slurm/slurm.conf file, but we would prefer to use configless.

Looking at the configless documentation, there is possible to add a SRV DNS entry, but it requires to install bind and additional configuration. 

Is there any other way to get this working?

Thanks,
Radek
Comment 1 Marcin Stolarek 2022-01-26 03:51:37 MST
Radek,

The other way may be to configure a submit host as one of the cluster nodes. It doesn't have to be included in any partition, but running slurmd on it will result in a cached slurm.conf(/run/slurm/conf/slurm.conf) being stored on it and used by Slurm commands.

cheers,
Marcin
Comment 2 GSK-ONYX-SLURM 2022-01-26 07:15:19 MST
Hey Marcin,

many thanks for your quick response. The submit node is a part of the cluster. The slurmd service is running on that node and the cached config files are stored in /var/spool/slurmd/conf-cache. The point is that a submit node seems not using them. It's looking compute nodes in /etc/slurm/slurm.conf instead. If the specific compute node doesn't exist, it will display a failure, even though it does exist in cached slurm.conf.

The slurmd is configured properly in terms of configless point of view. The slurm.conf along with other files (i.e. gres.conf) is updated every time the new configuration is pushed by scontrol reconfig command from the control node. 

Any ideas?

Cheers,
Radek
Comment 3 Marcin Stolarek 2022-01-26 11:35:35 MST
Radek,

>The point is that a submit node seems not using them. It's looking compute nodes in /etc/slurm/slurm.conf instead. If the specific compute node doesn't exist, it will display a failure, even though it does exist in cached slurm.conf.

I guess is the default built-in location of slurm.conf. In that case, it's used if the file exists. Why does the file exist if you want to run configless?

cheers,
Marcin
Comment 4 GSK-ONYX-SLURM 2022-01-27 05:03:49 MST
Hi Marcin,

the reason why the slurm.conf file has been left in its default location is that without this file is not possible to read the server's physical configuration to put it into the config file later on. Obviously it should have been removed after installation.

It looks like removing everything from /etc/slurm helped and now a submit node can use cached config files, including gres.conf. 

Please make sure that SchedMD documentation (https://slurm.schedmd.com/configless_slurm.html) is updated with the info that config files located in /etc/slurm/ takes precedence over the cached config files.

Many thanks for your help!

Cheers,
Radek
Comment 5 Marcin Stolarek 2022-01-27 07:24:22 MST
I believe it's documented on that page:
>The order of precedence for determining what configuration source to use is as follows:
> 
>    1.The slurmd --conf-server $host[:$port] option
>    2.The -f $config_file option
>    3.The SLURM_CONF environment variable (if set)
>    4.The default slurm config file (likely /etc/slurm.conf)
>    5.Any DNS SRV records (from lowest priority value to highest)

Does that look good to you?

cheers,
Marcin
[1]https://slurm.schedmd.com/configless_slurm.html#NOTES
Comment 6 GSK-ONYX-SLURM 2022-01-27 07:54:23 MST
You're right. I probably wasn't very careful while reading this, so it's a user issue ;-)
It is mentioned in an opposite direction (from lowest priority value to highest), but makes sense. 

Thanks again for your support. You can close the ticket.

Cheers,
Radek