Ticket 7430 - slurm_auth_get_host is preventing batch submissions from containerized hosts
Summary: slurm_auth_get_host is preventing batch submissions from containerized hosts
Status: RESOLVED INFOGIVEN
Alias: None
Product: Slurm
Classification: Unclassified
Component: User Commands (show other tickets)
Version: 19.05.1
Hardware: Linux Linux
: 4 - Minor Issue
Assignee: Tim Wickberg
QA Contact:
URL:
Depends on:
Blocks:
 
Reported: 2019-07-17 17:47 MDT by Felix Russell
Modified: 2021-02-11 13:38 MST (History)
3 users (show)

See Also:
Site: University of Washington
Slinky Site: ---
Alineos Sites: ---
Atos/Eviden Sites: ---
Confidential Site: ---
Coreweave sites: ---
Cray Sites: ---
DS9 clusters: ---
Google sites: ---
HPCnow Sites: ---
HPE Sites: ---
IBM Sites: ---
NOAA SIte: ---
NoveTech Sites: ---
Nvidia HWinf-CS Sites: ---
OCF Sites: ---
Recursion Pharma Sites: ---
SFW Sites: ---
SNIC sites: ---
Tzag Elita Sites: ---
Linux Distro: ---
Machine Name:
CLE Version:
Version Fixed: 19.05.2.1
Target Release: ---
DevPrio: ---
Emory-Cloud Sites: ---


Attachments

Note You need to log in before you can comment on or make changes to this ticket.
Description Felix Russell 2019-07-17 17:47:53 MDT
Hi, 

We've been using containerized instances of Open OnDemand as a means to submit jobs to the slurmctld running on our management node. I noticed that an 18.x sbatch binary has no trouble making an sbatch request against a 19.x slurmctld, but once the binary in the container has been updated to 19.x, we're seeing the following output on the 'submit host/container' 

"sbatch: error: Batch job submission failed: Invalid node name specified"

Before investigating the slurmctld logs on the management host this led to some confusion because it was assumed that a nodename constraint was being applied by the submission wrapper when it submitted to sbatch over STDIO.

Drilling into the slurmctld logs we found the following output:

error: slurm_auth_get_host: Lookup failed: Unknown host
error: REQUEST_SUBMIT_BATCH_JOB lacks alloc_node from uid=441788
_slurm_rpc_submit_batch_job: Invalid node name specified

I searched the git repo for commits where slurm_auth_get_host was mentioned and found the commit: 8e67c7fe60e8845c6a0e417357e2b05bf5bbcece [Here's a quick link] https://github.com/SchedMD/slurm/commit/8e67c7fe60e8845c6a0e417357e2b05bf5bbcece#diff-25edbe52f2c9729eb0c72267fba0ce81

It looks like in an effort to improve security of commands run from the slurmbins Danny added a check to ensure that the hostname was added to the munge auth signing component when the auth/munge scheme is used for cluster security, and while this is probably a good measure in terms of security hardening, and should probably be on by default, we'd appreciate it if this particular check could be disabled in the slurm.conf file. Maybe 'auth/munge-nohostname-check' can be added to `auth/munge` and `auth-none` and inherit most of `auth/munge`'s logic but skip the hostname check.

Reverting to 18.x bins inside the container is a functional workaround at the moment, but I'm worried that as releases progress the 18.x bins won't be able to talk to newer releases

Something else to be aware of here is that 'query' type commands like `sinfo` or `sacctmgr show` still work on nodes where the hostname check fails. I've never worked with a cluster that doesn't have auth/munge enabled, so I'm not aware if nodes without a valid munge auth context are typically allowed to execute query commands, but since I'm seeing a discrepancy between `get-` and `post-`type commands it might be worth ensuring this hostname checking is enabled for all other post-type commands. 

Sorry I don't have the C chops to make a PR myself, and thanks for your time.

Felix Russell
Comment 1 Felix Russell 2019-07-17 17:56:59 MDT
As an addendum to the previous message and while I'm on the tangent of suggesting features:

It might be wise to remove the ambiguity of the sbatch error message from `Invalid node name specified` to `Auth Error: host not recognized` or something similar.
Comment 2 Tim Wickberg 2019-07-19 11:12:59 MDT
Hi Felix -

There's a private bug open covering this, and we'll have it fixed before 19.05.2 comes out.

I'm closing this as a duplicate of that bug.

- Tim

*** This ticket has been marked as a duplicate of ticket 7255 ***
Comment 3 Felix Russell 2019-08-14 00:31:25 MDT
Blank
Comment 4 Felix Russell 2019-08-14 00:43:47 MDT
Hi Tim,

In the release notes for 19.05.2.1 I found the following entry:

 -- In munge decode set the alloc_node field to the text representation of an
    IP address if the reverse lookup fails.

Was this alloc_node field getting fetched simply so that munge could log either a hostname (or now an ip address) for the purposes of audit/security logging? or is this considered a 'in-network-check' that now falls back from hostname lookup to an 'is the IP address even pingable' sanity/security check?

The reason I'm asking is that we have maintenance events a month apart from one another, and rebuilding my 'submit container' with 19.05.2.1 bins didn't yield a workable resolution, and I want to be confident that when we update our slurmctld host to 19.05.2.1 (in line with our scarce/finite maintenance window schedule) it will resolve this issue. 

Thanks for your patience in this matter,

Felix
Comment 5 Tim Wickberg 2019-08-14 11:26:49 MDT
(In reply to Felix Russell from comment #4)
> Hi Tim,
> 
> In the release notes for 19.05.2.1 I found the following entry:
> 
>  -- In munge decode set the alloc_node field to the text representation of an
>     IP address if the reverse lookup fails.
> 
> Was this alloc_node field getting fetched simply so that munge could log
> either a hostname (or now an ip address) for the purposes of audit/security
> logging? or is this considered a 'in-network-check' that now falls back from
> hostname lookup to an 'is the IP address even pingable' sanity/security
> check?

It's used for enforcing the 'AllocNodes' constraint on the partitions. We don't usually recommend using that setting, and thus falling back to the IP address doesn't change anything from a security perspective.

> The reason I'm asking is that we have maintenance events a month apart from
> one another, and rebuilding my 'submit container' with 19.05.2.1 bins didn't
> yield a workable resolution, and I want to be confident that when we update
> our slurmctld host to 19.05.2.1 (in line with our scarce/finite maintenance
> window schedule) it will resolve this issue. 

The slurmctld is what would have been throwing the error related to this, 19.05.x are all fine from a submission standpoint as long as the slurmctld is upgrade to 19.05.2.

- Tim