| Summary: | slurm_auth_get_host is preventing batch submissions from containerized hosts | ||
|---|---|---|---|
| Product: | Slurm | Reporter: | Felix Russell <frussell> |
| Component: | User Commands | Assignee: | Tim Wickberg <tim> |
| Status: | RESOLVED INFOGIVEN | QA Contact: | |
| Severity: | 4 - Minor Issue | ||
| Priority: | --- | CC: | ahough, frussell, mattjay |
| Version: | 19.05.1 | ||
| Hardware: | Linux | ||
| OS: | Linux | ||
| Site: | University of Washington | Alineos Sites: | --- |
| Atos/Eviden Sites: | --- | Confidential Site: | --- |
| Coreweave sites: | --- | Cray Sites: | --- |
| DS9 clusters: | --- | HPCnow Sites: | --- |
| HPE Sites: | --- | IBM Sites: | --- |
| NOAA SIte: | --- | OCF Sites: | --- |
| Recursion Pharma Sites: | --- | SFW Sites: | --- |
| SNIC sites: | --- | Linux Distro: | --- |
| Machine Name: | CLE Version: | ||
| Version Fixed: | 19.05.2.1 | Target Release: | --- |
| DevPrio: | --- | Emory-Cloud Sites: | --- |
|
Description
Felix Russell
2019-07-17 17:47:53 MDT
As an addendum to the previous message and while I'm on the tangent of suggesting features: It might be wise to remove the ambiguity of the sbatch error message from `Invalid node name specified` to `Auth Error: host not recognized` or something similar. Hi Felix - There's a private bug open covering this, and we'll have it fixed before 19.05.2 comes out. I'm closing this as a duplicate of that bug. - Tim *** This ticket has been marked as a duplicate of ticket 7255 *** Blank Hi Tim,
In the release notes for 19.05.2.1 I found the following entry:
-- In munge decode set the alloc_node field to the text representation of an
IP address if the reverse lookup fails.
Was this alloc_node field getting fetched simply so that munge could log either a hostname (or now an ip address) for the purposes of audit/security logging? or is this considered a 'in-network-check' that now falls back from hostname lookup to an 'is the IP address even pingable' sanity/security check?
The reason I'm asking is that we have maintenance events a month apart from one another, and rebuilding my 'submit container' with 19.05.2.1 bins didn't yield a workable resolution, and I want to be confident that when we update our slurmctld host to 19.05.2.1 (in line with our scarce/finite maintenance window schedule) it will resolve this issue.
Thanks for your patience in this matter,
Felix
(In reply to Felix Russell from comment #4) > Hi Tim, > > In the release notes for 19.05.2.1 I found the following entry: > > -- In munge decode set the alloc_node field to the text representation of an > IP address if the reverse lookup fails. > > Was this alloc_node field getting fetched simply so that munge could log > either a hostname (or now an ip address) for the purposes of audit/security > logging? or is this considered a 'in-network-check' that now falls back from > hostname lookup to an 'is the IP address even pingable' sanity/security > check? It's used for enforcing the 'AllocNodes' constraint on the partitions. We don't usually recommend using that setting, and thus falling back to the IP address doesn't change anything from a security perspective. > The reason I'm asking is that we have maintenance events a month apart from > one another, and rebuilding my 'submit container' with 19.05.2.1 bins didn't > yield a workable resolution, and I want to be confident that when we update > our slurmctld host to 19.05.2.1 (in line with our scarce/finite maintenance > window schedule) it will resolve this issue. The slurmctld is what would have been throwing the error related to this, 19.05.x are all fine from a submission standpoint as long as the slurmctld is upgrade to 19.05.2. - Tim |