Ticket 15021

Summary: RFC: SlurmUser check in client commands
Product: Slurm Reporter: Kilian Cavalotti <kilian>
Component: User CommandsAssignee: Carlos Tripiana Montes <tripiana>
Status: OPEN --- QA Contact:
Severity: 4 - Minor Issue    
Priority: --- CC: nate, tim
Version: 22.05.3   
Hardware: Linux   
OS: Linux   
Site: Stanford Alineos Sites: ---
Atos/Eviden Sites: --- Confidential Site: ---
Coreweave sites: --- Cray Sites: ---
DS9 clusters: --- HPCnow Sites: ---
HPE Sites: --- IBM Sites: ---
NOAA SIte: --- OCF Sites: ---
Recursion Pharma Sites: --- SFW Sites: ---
SNIC sites: --- Linux Distro: ---
Machine Name: CLE Version:
Version Fixed: Target Release: ---
DevPrio: --- Emory-Cloud Sites: ---

Description Kilian Cavalotti 2022-09-22 09:56:34 MDT
Hi!


I have a question regarding the _validate_and_set_defaults() function in common/read_config.c. From what I understand, each time a Slurm command starts and reads the configuration file, it checks for the existence of a "SlurmUser" uid on the current host, even if that user is not used for anything after that.

I'm asking in the context of what's been described in bug 9282: running Slurm commands from a container that lacks the user account defined as "SlurmUser" in its configuration file.


I understand the need for checking the existence of a SlurmUser on the daemon side, for functions that need to drop privileges and run under that user. But for client commands (and especially read-only commands like sinfo or squeue), I'm not sure to understand why that check is done.

sinfo, for instance, will fail with the following error if the SlurmUser doesn't exist on a client node (like a submission host, without any daemon running):

$ grep SlurmUser /etc/slurm/slurm.conf
SlurmUser=foobar
$ id foobar
id: foobar: no such user
$ sinfo --version
sinfo: error: Invalid user for SlurmUser foobar, ignored
sinfo: fatal: Unable to process configuration file


But it will happily run if SlurmUser is defined to any other existing user, even if it's completely unrelated and different from the rest of the cluster:

$ grep SlurmUser /etc/slurm/slurm.conf
nfsnobody
$ id nfsnobody
uid=65534(nfsnobody) gid=65534(nfsnobody) groups=65534(nfsnobody)
$ sinfo --version
slurm 22.05.3


Could you please help me understand the rationale here?

Thanks!
--
Kilian
Comment 1 Carlos Tripiana Montes 2022-09-23 05:45:34 MDT
Hi Kilian,

That's old code. It used for all Slurm components and for some of them it's needed such check whereas in others don't.

It's a simple check, done as name to UID resolution against the system accounts.

Yes, no check between commands and slurmctld. This code is also used by daemons so such check wouldn't make much sense.

I know, this check also doesn't make much sense in the client side.

Let me think what it could be done.

Cheers,
Carlos.
Comment 2 Kilian Cavalotti 2022-09-23 10:47:08 MDT
Hi Carlos, 

(In reply to Carlos Tripiana Montes from comment #1)
> That's old code. It used for all Slurm components and for some of them it's
> needed such check whereas in others don't.
> 
> It's a simple check, done as name to UID resolution against the system
> accounts.
> 
> Yes, no check between commands and slurmctld. This code is also used by
> daemons so such check wouldn't make much sense.
> 
> I know, this check also doesn't make much sense in the client side.

That was my impression as well, thanks for confirming!

> Let me think what it could be done.

I guess the first step would be to have a specific list of cases where that check is needed. 

I don't know if it's feasible to identify the command that's calling init_slurm_conf(), and then ignore the SlurmUser uid existence check for client commands that don't require it?

But maybe a better solution would be to move that check out of the _validate_and_set_defaults() function and place it in functions where it's really required: there are already similar gid-side checks in _become_slurm_user() in src/slurmdbd/slurmdbd.c and src/slurmctld/controller.c. Maybe the user existence should be checked there as well. And where setuid/setgid calls to SlurmUser are done.

Actually, those setuid/setgid calls will already return an error if the SlurmUser doesn't exist, so checking the existence of that user while parsing the configuration file seems a bit redundant.


Anyway, this is one of the most common causes of error and confusion when trying to execute Slurm commands from a container, and one can find plenty of reports besides bug 9282:
https://github.com/apptainer/singularity/issues/5345
https://computing.ee.ethz.ch/Services/SLURM#Invalid_user_for_SlurmUser_slurm
https://lists.mpich.org/pipermail/discuss/2018-September/002399.html
Even a Debian packaging issue:
https://bugs.debian.org/cgi-bin/bugreport.cgi?bug=783663

So it's probably worth considering dropping that check when parsing the configuration file, and moving it to the relevant functions.


Thanks!
--
Kilian
Comment 12 Carlos Tripiana Montes 2023-03-30 06:48:37 MDT
Hi Killian,

I've followed all the implications that this name to UID resolution have.

It's mostly what we already talked about, but there's another important shard to take into account. Some client commands, those which open an active communication with slurmctld, like salloc, sattach, srun, ... Those *really* need to know that the sender's is who claims to be. slurmctld is ran by SlurmUser, so we need to be able to check that replies to RPCs are sent by it, from the client command perspective.

Other than that, we can mostly ignore if the name to UID resolution fails for SlurmUser on a particular node, if we just fallback to slurm_conf.slurm_user_id=0.

We are on our way to address such limitation, and we're developing some POC.

I take this opportunity to apologise for not having replied to you earlier.

Cheers,
Carlos.
Comment 14 Kilian Cavalotti 2023-03-31 12:07:35 MDT
Hi Carlos,

Thanks for the follow up!

(In reply to Carlos Tripiana Montes from comment #12)
> It's mostly what we already talked about, but there's another important
> shard to take into account. Some client commands, those which open an active
> communication with slurmctld, like salloc, sattach, srun, ... Those *really*
> need to know that the sender's is who claims to be. slurmctld is ran by
> SlurmUser, so we need to be able to check that replies to RPCs are sent by
> it, from the client command perspective.

I see, that makes sense.

> Other than that, we can mostly ignore if the name to UID resolution fails
> for SlurmUser on a particular node, if we just fallback to
> slurm_conf.slurm_user_id=0.
> 
> We are on our way to address such limitation, and we're developing some POC.

Excellent, thanks for letting me know!

> I take this opportunity to apologise for not having replied to you earlier.

No worries at all!

Cheers,
--
Kilian