Summary: | Slurm commands fail when run in Singularity container with the error "Invalid user for SlurmUser slurm | ||
---|---|---|---|
Product: | Slurm | Reporter: | Nuance HPC Grid Admins <gridadmins> |
Component: | Other | Assignee: | Tim McMullan <mcmullan> |
Status: | RESOLVED INFOGIVEN | QA Contact: | |
Severity: | 2 - High Impact | ||
Priority: | --- | CC: | rk3199 |
Version: | - Unsupported Older Versions | ||
Hardware: | Linux | ||
OS: | Linux | ||
Site: | Nuance | Alineos Sites: | --- |
Atos/Eviden Sites: | --- | Confidential Site: | --- |
Coreweave sites: | --- | Cray Sites: | --- |
DS9 clusters: | --- | HPCnow Sites: | --- |
HPE Sites: | --- | IBM Sites: | --- |
NOAA SIte: | --- | NoveTech Sites: | --- |
Nvidia HWinf-CS Sites: | --- | OCF Sites: | --- |
Recursion Pharma Sites: | --- | SFW Sites: | --- |
SNIC sites: | --- | Linux Distro: | CentOS |
Machine Name: | CLE Version: | ||
Version Fixed: | Target Release: | --- | |
DevPrio: | --- | Emory-Cloud Sites: | --- |
Attachments: | Configuration from Nuance EU Tools cluster |
*** Ticket 9281 has been marked as a duplicate of this ticket. *** Hi Jason, Per you question, our containers do not include the Slurm user. Our developers decided to use Docker for all container development without considering the security implications of running containers in a shared multi-user environment with HIPAA/PII data. We need to enforce the restriction that the container MUST run under the submitter's user/group IDs by default, which you cannot do with Docker. Because the developers have pulled in base container images that are outside of the HPC teams control, we have no method to introduce the slurm user into the original Docker containers, nor add it into the final Singularity image that HPC requires the users to use. The slurm user is created as an integral part of the slurm packages our cloud team built. Mike - just to clarify here: you are using these Singularity containers as submit hosts or both submit hosts and scheduler? The slurm master is not running in a container. Our bare-metal environment is very stripped down. The requirement is that all users need to work in a container. The users are trying to submit jobs both interactively on the login node, and non-interactively as steps in active jobs. So, they submit a job to run a script. The submitted command would resemble: singularity exec <container> script.sh In script.sh, there are steps that call srun, sstat, or the DRMAA submit wrapper. Without singularity, we would be using compute nodes as submit hosts. Hey Mike! Thanks for all the information on how you have things set up! It sounds like you have mapped in the slurm commands, the config, and most of the important files mapped in. If you haven't already, you will likely also need to map in the munge socket and library (I mapped /run/munge and /lib/x86_64-linux-gnu/libmunge.so.2). I also looked at your slurm.conf and didn't see anything out of place. I should note that for the most part the support staff handles bugs, development issues, and configuration problems (we are largely developers), and we don't handle a lot of integration issues like this and aren't singularity experts. That said, I do have an idea on handling this! The error you are seeing now is because the config parser expects that the "SlurmUser" is valid, and you don't have that in the containers. There are a couple options for dealing with it, one being to try to make sure there is a slurm user in the containers (which I know is not ideal). However, for just those user commands setting "SlurmUser" to a user that exists in all of the containers (eg root) would allow them to parse the config. I was able to do this by setting SINGULARITYENV_SLURM_CONF before spawning the container. Thanks! --Tim Hey Mike, I just wanted to check in and see if there was anything else on this I could help you with, and if that idea works for you! Let me know if there is anything else I can help with! Thanks, --Tim (In reply to Tim McMullan from comment #7) > Hey Mike, > > I just wanted to check in and see if there was anything else on this I could > help you with, and if that idea works for you! > > Let me know if there is anything else I can help with! > Thanks, > --Tim Hi Tim, I will be running some tests next week. I need to create an updated container image that includes the Slurm user AND I need to add in the munge socket/library. I will update this once I have some results. Thank you and enjoy the 4th of July weekend. Hi! Thank you, sounds good, and I hope you have a good 4th of July weekend! Hi, I just wanted to check in and see if you had any luck with the new image. Thanks, and I hope you had a good weekend! --Tim Hello Tim, I was able to test a container that included embedded slurm and munge users and the additional mapping of libmunge and /var/run/munge. I am able to submit jobs/interact with the slurm master from within Singularity. I think we can close this ticket now. Thank you -Mike Moore That's good to hear! Thanks Mike, I'll close this ticket out now! --Tim
> The error you are seeing now is because the config parser expects that the
> "SlurmUser" is valid, and you don't have that in the containers. There are
> a couple options for dealing with it, one being to try to make sure there is
> a slurm user in the containers (which I know is not ideal). However, for
> just those user commands setting "SlurmUser" to a user that exists in all of
> the containers (eg root) would allow them to parse the config. I was able
> to do this by setting SINGULARITYENV_SLURM_CONF before spawning the
> container.
>
> Thanks!
> --Tim
Would you share how to use SINGULARITYENV_SLURM_CONF
(In reply to Robert Kudyba from comment #13) > > The error you are seeing now is because the config parser expects that the > > "SlurmUser" is valid, and you don't have that in the containers. There are > > a couple options for dealing with it, one being to try to make sure there is > > a slurm user in the containers (which I know is not ideal). However, for > > just those user commands setting "SlurmUser" to a user that exists in all of > > the containers (eg root) would allow them to parse the config. I was able > > to do this by setting SINGULARITYENV_SLURM_CONF before spawning the > > container. > > > > Thanks! > > --Tim > Would you share how to use SINGULARITYENV_SLURM_CONF "SINGULARITYENV_" is a prefix for environment variables that if set beforehand will set the following variable in the container. For SINGULARITYENV_SLURM_CONF="foo" singularity exec ... , SLURM_CONF will be set in the container to "foo". There is some documentation on this at https://singularity-userdoc.readthedocs.io/en/latest/environment_and_metadata.html If you are having an issue that this ability does not solve, please feel free to open a new ticket so someone can look further into the issue for you! Thanks, --Tim |
Created attachment 14788 [details] Configuration from Nuance EU Tools cluster Nuance HPC is running several Slurm clusters in Microsoft Azure using Cyclecloud to orchestrate the deployment and dynamic provisioning of the clusters. Our users' workflows self submit jobs/tasks from within Singularity containers. To enable this, we have setup our singularity.conf configuration file to map in the various Slurm commands, /usr/lib64/libslurm.so, /usr/lib64/libdrmaa.so, and the slurm plugins. The users use a combination of the Slurm commands and a DRMAA based wrapper to submit jobs. However, we have run into an issue when using any of the commands. From within singularity, we get the error: Singularity> sstat sstat: error: Invalid user for SlurmUser slurm, ignored sstat: fatal: Unable to process configuration file We did not build slurm into the containers as we would have to support multiple versions of Slurm clients running against the slurm masters. We are currently using Slurm 18.08, but will be upgrading the version 19 in the near future to support Microsoft's spot pricing. Can you provide us with any best practices for using Slurm from within a container and review our configuration for any issues. We are fairly new to Slurm as our on-premise environments are Univa Grid Engine based. This same model of self-submission does work on-premise by mapping in the configuration/tools in the same manner. Thank you -Mike Moore