Ticket 16611 - srun fails with --uid flag
Summary: srun fails with --uid flag
Status: RESOLVED WONTFIX
Alias: None
Product: Slurm
Classification: Unclassified
Component: slurmd (show other tickets)
Version: 23.02.1
Hardware: Linux Linux
: 3 - Medium Impact
Assignee: Nate Rini
QA Contact: Tim Wickberg
URL:
: 16524 (view as ticket list)
Depends on:
Blocks: 17412
  Show dependency treegraph
 
Reported: 2023-04-28 11:49 MDT by DRW GridOps
Modified: 2024-09-23 01:29 MDT (History)
1 user (show)

See Also:
Site: DRW Trading
Slinky Site: ---
Alineos Sites: ---
Atos/Eviden Sites: ---
Confidential Site: ---
Coreweave sites: ---
Cray Sites: ---
DS9 clusters: ---
Google sites: ---
HPCnow Sites: ---
HPE Sites: ---
IBM Sites: ---
NOAA SIte: ---
NoveTech Sites: ---
Nvidia HWinf-CS Sites: ---
OCF Sites: ---
Recursion Pharma Sites: ---
SFW Sites: ---
SNIC sites: ---
Tzag Elita Sites: ---
Linux Distro: ---
Machine Name:
CLE Version:
Version Fixed:
Target Release: ---
DevPrio: ---
Emory-Cloud Sites: ---


Attachments
slurm.conf (7.75 KB, text/plain)
2023-04-28 11:49 MDT, DRW GridOps
Details

Note You need to log in before you can comment on or make changes to this ticket.
Description DRW GridOps 2023-04-28 11:49:48 MDT
Created attachment 30057 [details]
slurm.conf

(flagging this as medium; I can work around it but I'm concerned because we recently upgraded to 23, so this could be a more general issue.)

Hello, my cluster appears to be behaving normally when users submit jobs.  However, I get errors when I attempt to use the --uid flag.  It looks like the job makes it out to the node and then slurmd tells it to take a hike.


$ srun --chdir /tmp/ whoami
srun: job 43621200 queued and waiting for resources
srun: job 43621200 has been allocated resources
cseraphine
$ sudo srun --chdir /tmp/ whoami
root
$ sudo srun --uid cseraphine --chdir /tmp/ whoami
srun: error: Unable to create step for job 43621206: Unexpected message received
$ sudo srun --verbose  --uid cseraphine --chdir /tmp/ whoami
srun: defined options
srun: -------------------- --------------------
srun: chdir               : /tmp/
srun: uid                 : 29599
srun: verbose             : 1
srun: -------------------- --------------------
srun: end of defined options
srun: Waiting for resource configuration
srun: Nodes chhq-supgcmp046 are ready for job
srun: jobid 43621212: nodes(1):`chhq-supgcmp046', cpu counts: 1(x1)
srun: error: Unable to create step for job 43621212: Unexpected message received


slurmctld logs from that period show:
[2023-04-28T12:42:49.176] sched: _slurm_rpc_allocate_resources JobId=43621212 NodeList=chhq-supgcmp046 usec=7206
[2023-04-28T12:42:49.279] error: slurm_unpack_received_msg: [[chhq-supghome01.us.drwholdings.com]:47434] auth_g_verify: REQUEST_JOB_STEP_CREATE has authentication error: Unspecified error
[2023-04-28T12:42:49.279] error: slurm_unpack_received_msg: [[chhq-supghome01.us.drwholdings.com]:47434] Protocol authentication error
[2023-04-28T12:42:49.290] error: slurm_receive_msg [10.16.147.100:47434]: Protocol authentication error
[2023-04-28T12:42:49.291] error: slurm_unpack_received_msg: [[chhq-supghome01.us.drwholdings.com]:47436] auth_g_verify: REQUEST_COMPLETE_JOB_ALLOCATION has authentication error: Unspecified error
[2023-04-28T12:42:49.291] error: slurm_unpack_received_msg: [[chhq-supghome01.us.drwholdings.com]:47436] Protocol authentication error
[2023-04-28T12:42:49.301] error: slurm_receive_msg [10.16.147.100:47436]: Protocol authentication error
Comment 2 Nate Rini 2023-04-28 14:07:36 MDT
Several authentication changes in Slurm-23.02 resulted in `srun --uid` to use user/group combinations that can get rejected by the controller. We have a patch pending QA to resolve the issue in bug#16524. Since that ticket is private, I will leave this ticket out and update it once the patch is upstream for the next release tagging.
Comment 24 Nate Rini 2023-06-28 09:47:59 MDT
(In reply to Nate Rini from comment #2)
> Several authentication changes in Slurm-23.02 resulted in `srun --uid` to
> use user/group combinations that can get rejected by the controller. We have
> a patch pending QA to resolve the issue in bug#16524. Since that ticket is
> private, I will leave this ticket out and update it once the patch is
> upstream for the next release tagging.

Update: We have a patch set we have been working on. We found an issue with SELinux that we are currently working to resolve, which has slowed down the process considerably.
Comment 25 Bas van der Vlies 2023-06-28 09:48:13 MDT
I am on holidays from 23-Jun-2023 till 3-Jul-2023
Comment 26 DRW GridOps 2023-06-28 09:52:21 MDT
> We have a patch set we have been working on. 

Glorious!  Please let me know when that makes it's way into a release, as we tend to skip some minor releases here.

> We found an issue with SELinux
> that we are currently working to resolve, which has slowed down the process
> considerably.

Yeah, that's what SELinux does.....   I tip my hat in respect to anyone that actually runs selinux on a slurm cluster in enforcing mode.
Comment 53 DRW GridOps 2023-08-09 13:04:12 MDT
Any word on this?  This is still an issue as of 2.3.4.

root@shared21-14:03:44-/tmp# sudo -u cseraphine srun whoami
cseraphine
root@shared21-14:03:46-/tmp# srun --uid cseraphine  whoami
srun: error: Unable to create step for job 59487017: Unexpected message received
1-root@shared21-14:03:55-/tmp# srun --version
slurm 23.02.4
Comment 54 Nate Rini 2023-08-09 20:32:49 MDT
(In reply to DRW GridOps from comment #53)
> Any word on this?  This is still an issue as of 2.3.4.

We have a patch currently undergoing QA review. It has needed several modifications, as found by testing, that have delayed the process.
Comment 59 Tim Wickberg 2023-08-24 14:32:43 MDT
Hey folks -

This has proven rather difficult to address - the fixes for CVE-2022-29500 that limit how the authentication tokens are what have lead to the majority of the complications here, and we've gone through a number of iterations (nine currently) that only allow portions of this to work as expected.

To try to summarize the issue: part of the security fix for that CVE requires the slurmctld to know what uid the srun process is operating under when connecting back to it. This allows it to restrict the scope of the authentication tokens to avoid any malicious use.

The way that the 'srun --uid' support was built requires the srun process to run initially under uid 0, and then pivot to running under the target uid. Which uid it is currently operating under depends on a number of factors - even if the slurmctld were to try to predict when this switch occurs, there are a number of subtle race conditions around job termination and node failure handling that would make this impossible to predict 100% of the time. This means that we cannot reliably communicate with the srun process while restricting the credential. And disabling this security mechanism is not something we want to do here - it's a core part of addressing CVE-2022-29500.

At this point, and after some long internal deliberation, we've concluded that it's best for Slurm to abandon the srun --uid / --gid options in salloc/srun entirely. We feel that the security risks vastly outweigh the benefits of restoring support for that option, and will move ahead with removing the options entirely instead.

I'm hoping that is not too inconvenient for your use, and that the workaround you'd identified is still sufficient for your use cases?

- Tim
Comment 60 Bas van der Vlies 2023-08-24 14:32:57 MDT
I am on holidays from 10-Aug-2023 till 27-Aug-2023
Comment 61 DRW GridOps 2023-08-24 14:39:24 MDT
I'm... actually OK with this.  A change with multiple touchpoints across a runtime is going to be ugly.   The logical thing to do, given what you've described, is to have the UID change happen dramatically earlier-- which would be very similar in effect to having srun re-exec itself as the new user, which in turn is precisely the same thing as me simply running the thing under sudo or runas or whathaveyou.   Given how long the bug has been in the wild everyone who cares about it has already made the obvious workaround anyway.

TL,DR: I'd rather lose a nice-to-have feature than introduce a complex fix that opens the door for security issues or other bugs.  

Kill the beast!
Comment 63 Tim Wickberg 2023-08-24 14:57:00 MDT
> TL,DR: I'd rather lose a nice-to-have feature than introduce a complex fix
> that opens the door for security issues or other bugs.  
> 
> Kill the beast!

Thank you for understanding; removing features is always one of the toughest parts of my job, but it's been a long-term goal of mine to simplify the handling around uid/gid to limit potential security issue, and this serves as good a time as any to push forward even if the initial break was inadvertent.

I've pushed a patch that'll be in 23.02.5 that removes --uid/--gid from salloc and srun. (They do remain in the sbatch command, and on the REST API, but in those situations the communication paths are much more straightforward.)

I'll be making some further cleanup on master now that we can simplify a number of these code paths.

thanks,
- Tim