We would like to allow users to run Slurm commands (e.g., sbatch and squeue) insde of an Apptainer container. This is similar to https://bugs.schedmd.com/show_bug.cgi?id=9282, except I will note that we are installing Slurm via RPMs that we build. As a result Slurm user commands are in /usr/bin/ on the host and it is not ideal to map that into the container. And a few years have passed since that case was active. Is there any new advice on how to accomplish this? We've though of using slurmrestd running as a service. A user could create a token and then use that inside of their container to authenticate when passing HTTP requests to slurmrestd running in the cluster outside of the container. But then a user would need to create those HTTP requests. Is there any thought of creating/modifying Slurm user commands to alternatively use HTTP+slurmrestd w/ JWT auth instead of talking directly to slurmctld w/ MUNGE auth? Or creating alternate "wrappers" for sbatch, etc. that work this way?
(In reply to Jake Rundall from comment #0) > We would like to allow users to run Slurm commands (e.g., sbatch and squeue) > insde of an Apptainer container. This is similar to > https://bugs.schedmd.com/show_bug.cgi?id=9282, except I will note that we > are installing Slurm via RPMs that we build. As a result Slurm user commands > are in /usr/bin/ on the host and it is not ideal to map that into the > container. And a few years have passed since that case was active. > Is there any new advice on how to accomplish this? The `--container` arguments have been added to srun/sbatch/salloc since that ticket was closed. This doesn't exactly relate to the issue of running the CLI commands in a container though. Having the batch step sit outside of any containers and then have all the steps inside of the relevant containers might work, but there wasn't a use case provided. Using the `--container` argument will ensure that the PMIX socket and any GRES-defined hardware devices will be mounted inside of the containers by default. Requirements for running Slurm CLI commands inside of a container have mostly remained the same as before. This would be bind mounting in the Munge socket or using JWT for authentication (more details below). The Slurm binaries must be the same version or within 2 older versions. There must be IP connectivity to all Slurm daemons along with DNS resolution (or defined IPs in slurm.conf) for all nodes in the cluster. If any specialized hardware is in use, then the relevant device drivers must be mounted in too. The versions of PMIX and MPI must be compatible with the versions Slurm was compiled against, if used. If your site is using cgroupsv2, then slurmd can be safely run inside of a container with Slurm-23.02 while all the normal features will be usable. This may be an alternative if your site wishes for the users to always use the same container. > We've though of using slurmrestd running as a service. A user could create a > token and then use that inside of their container to authenticate when > passing HTTP requests to slurmrestd running in the cluster outside of the > container. But then a user would need to create those HTTP requests. Submitting requests inside of containers was one of the original use cases of the new auth/jwt plugin which was added with slurmrestd. As of Slurm-23.02, almost all of the CLI commands should work with the SLURM_JWT environment variable set, with the notable exceptions of srun for job steps. When using auth/jwt, munge is not used (or needed) by the CLI commands. Munge's socket will still need to be visible for the Slurm daemons. We have documented setting up JWT here: > https://slurm.schedmd.com/jwt.html It is possible the job steps might be made to work with auth/jwt in the future. That would be considered an RFE and would need a ticket requesting it. > Is there any thought of creating/modifying Slurm user commands to alternatively > use HTTP+slurmrestd w/ JWT auth instead of talking directly to slurmctld w/ > MUNGE auth? There has been some discussion of adding this functionality as several additional prerequisites for it have already been added to the Slurm-23.11 release. As of right now, such functionality is not on our official roadmap. As always, sites are welcome to submit RFE tickets for new functionality requests.
Are there any more questions, and is your site interested in sponsoring this functionality?
Thanks, sorry for the delayed response. I'm still waiting on some input from others at our site. -------- For posterity, here's the minimal Apptainer command that allowed us to run sinfo inside a container: apptainer shell --bind /usr/bin/sinfo,/usr/lib64/slurm,/var/spool/slurmd/conf-cache:/etc/slurm,/etc/passwd,/etc/group,/lib64/libmunge.so.2,/run/munge ubuntu_latest.sif The host OS is Red Hat 8 but we tested with an Ubuntu container purposefully so there'd be a reasonable mismatch between host and container. Since we're building Slurm RPMs that install the Slurm commands in /usr/bin/, and since we didn't want to bindmount all of /usr/bin/ into the container, we're just bringing in the specific sinfo command. The compute node is configless and doesn't have Slurm configs at /etc/slurm/, so we bindmounted the config cache.
(In reply to Jake Rundall from comment #4) > Thanks, sorry for the delayed response. I'm still waiting on some input from others at our site. Do you wish to keep this ticket open while waiting? > apptainer shell --bind > /usr/bin/sinfo,/usr/lib64/slurm,/var/spool/slurmd/conf-cache:/etc/slurm,/etc/ > passwd,/etc/group,/lib64/libmunge.so.2,/run/munge ubuntu_latest.sif I would suggest using ldd against every binary in that list and adding all of the dependencies too. Once the container and the host os diverge too far with libc, things may not work even with that. A second slurm only container that is called via a wrapper script in the guest container might be a better long term solution.
(In reply to Nate Rini from comment #5) > (In reply to Jake Rundall from comment #4) > > Thanks, sorry for the delayed response. I'm still waiting on some input from others at our site. > > Do you wish to keep this ticket open while waiting? There hasn't been a response in a week, so I assume there are no more questions. I'm closing this ticket, but feel free to respond and the ticket will automatically re-open. If your site is interested in sponsoring in adding the features requested in comment#0, please respond and we can begin that process.