Ticket 9282 - Slurm commands fail when run in Singularity container with the error "Invalid user for SlurmUser slurm
Summary: Slurm commands fail when run in Singularity container with the error "Invalid...
Status: RESOLVED INFOGIVEN
Alias: None
Product: Slurm
Classification: Unclassified
Component: Other (show other tickets)
Version: - Unsupported Older Versions
Hardware: Linux Linux
: 2 - High Impact
Assignee: Tim McMullan
QA Contact:
URL:
: 9281 (view as ticket list)
Depends on:
Blocks:
 
Reported: 2020-06-25 16:21 MDT by Nuance HPC Grid Admins
Modified: 2024-07-03 11:32 MDT (History)
1 user (show)

See Also:
Site: Nuance
Alineos Sites: ---
Atos/Eviden Sites: ---
Confidential Site: ---
Coreweave sites: ---
Cray Sites: ---
DS9 clusters: ---
HPCnow Sites: ---
HPE Sites: ---
IBM Sites: ---
NOAA SIte: ---
NoveTech Sites: ---
Nvidia HWinf-CS Sites: ---
OCF Sites: ---
Recursion Pharma Sites: ---
SFW Sites: ---
SNIC sites: ---
Linux Distro: CentOS
Machine Name:
CLE Version:
Version Fixed:
Target Release: ---
DevPrio: ---
Emory-Cloud Sites: ---


Attachments
Configuration from Nuance EU Tools cluster (1.58 KB, text/plain)
2020-06-25 16:21 MDT, Nuance HPC Grid Admins
Details

Note You need to log in before you can comment on or make changes to this ticket.
Description Nuance HPC Grid Admins 2020-06-25 16:21:47 MDT
Created attachment 14788 [details]
Configuration from Nuance EU Tools cluster

Nuance HPC is running several Slurm clusters in Microsoft Azure using Cyclecloud to orchestrate the deployment and dynamic provisioning of the clusters.  Our users' workflows self submit jobs/tasks from within Singularity containers.  To enable this, we have setup our singularity.conf configuration file to map in the various Slurm commands, /usr/lib64/libslurm.so, /usr/lib64/libdrmaa.so, and the slurm plugins.  The users use a combination of the Slurm commands and a DRMAA based wrapper to submit jobs.

However, we have run into an issue when using any of the commands.  From within singularity, we get the error:

Singularity> sstat
sstat: error: Invalid user for SlurmUser slurm, ignored
sstat: fatal: Unable to process configuration file

We did not build slurm into the containers as we would have to support multiple versions of Slurm clients running against the slurm masters.  

We are currently using Slurm 18.08, but will be upgrading the version 19 in the near future to support Microsoft's spot pricing.

Can you provide us with any best practices for using Slurm from within a container and review our configuration for any issues.  We are fairly new to Slurm as our on-premise environments are Univa Grid Engine based.  This same model of self-submission does work on-premise by mapping in the configuration/tools in the same manner.

Thank you
-Mike Moore
Comment 1 Jason Booth 2020-06-25 16:22:40 MDT
*** Ticket 9281 has been marked as a duplicate of this ticket. ***
Comment 2 Nuance HPC Grid Admins 2020-06-25 16:29:15 MDT
Hi Jason,

  Per you question, our containers do not include the Slurm user.  Our developers decided to use Docker for all container development without considering the security implications of running containers in a shared multi-user environment with HIPAA/PII data.  We need to enforce the restriction that the container MUST run under the submitter's user/group IDs by default, which you cannot do with Docker.

  Because the developers have pulled in base container images that are outside of the HPC teams control, we have no method to introduce the slurm user into the original Docker containers, nor add it into the final Singularity image that HPC requires the users to use.
Comment 3 Nuance HPC Grid Admins 2020-06-25 16:30:13 MDT
The slurm user is created as an integral part of the slurm packages our cloud team built.
Comment 4 Jason Booth 2020-06-25 17:32:33 MDT
Mike - just to clarify here: you are using these Singularity containers as submit hosts or both submit hosts and scheduler?
Comment 5 Nuance HPC Grid Admins 2020-06-26 07:24:11 MDT
The slurm master is not running in a container.  Our bare-metal environment is very stripped down.  The requirement is that all users need to work in a container.  The users are trying to submit jobs both interactively on the login node, and non-interactively as steps in active jobs.

So, they submit a job to run a script. The submitted command would resemble:

    singularity exec <container> script.sh

In script.sh, there are steps that call srun, sstat, or the DRMAA submit wrapper.

Without singularity, we would be using compute nodes as submit hosts.
Comment 6 Tim McMullan 2020-06-26 11:50:27 MDT
Hey Mike!

Thanks for all the information on how you have things set up!

It sounds like you have mapped in the slurm commands, the config, and most of the important files mapped in.  If you haven't already, you will likely also need to map in the munge socket and library (I mapped /run/munge and /lib/x86_64-linux-gnu/libmunge.so.2).  I also looked at your slurm.conf and didn't see anything out of place.

I should note that for the most part the support staff handles bugs, development issues, and configuration problems (we are largely developers), and we don't handle a lot of integration issues like this and aren't singularity experts.  That said, I do have an idea on handling this!

The error you are seeing now is because the config parser expects that the "SlurmUser" is valid, and you don't have that in the containers.  There are a couple options for dealing with it, one being to try to make sure there is a slurm user in the containers (which I know is not ideal).  However, for just those user commands setting "SlurmUser" to a user that exists in all of the containers (eg root) would allow them to parse the config.  I was able to do this by setting SINGULARITYENV_SLURM_CONF before spawning the container.

Thanks!
--Tim
Comment 7 Tim McMullan 2020-06-30 06:10:52 MDT
Hey Mike,

I just wanted to check in and see if there was anything else on this I could help you with, and if that idea works for you!

Let me know if there is anything else I can help with!
Thanks,
--Tim
Comment 8 Nuance HPC Grid Admins 2020-07-01 11:20:25 MDT
(In reply to Tim McMullan from comment #7)
> Hey Mike,
> 
> I just wanted to check in and see if there was anything else on this I could
> help you with, and if that idea works for you!
> 
> Let me know if there is anything else I can help with!
> Thanks,
> --Tim

Hi Tim,

  I will be running some tests next week.  I need to create an updated container image that includes the Slurm user AND I need to add in the munge socket/library.

I will update this once I have some results.

Thank you and enjoy the 4th of July weekend.
Comment 9 Tim McMullan 2020-07-01 11:25:48 MDT
Hi!

Thank you, sounds good, and I hope you have a good 4th of July weekend!
Comment 10 Tim McMullan 2020-07-07 09:37:27 MDT
Hi,

I just wanted to check in and see if you had any luck with the new image.

Thanks, and I hope you had a good weekend!
--Tim
Comment 11 Nuance HPC Grid Admins 2020-07-09 12:24:02 MDT
Hello Tim,

  I was able to test a container that included embedded slurm and munge users and the additional mapping of libmunge and /var/run/munge.  I am able to submit jobs/interact with the slurm master from within Singularity.

I think we can close this ticket now.

Thank you 
-Mike Moore
Comment 12 Tim McMullan 2020-07-09 13:05:06 MDT
That's good to hear!

Thanks Mike, I'll close this ticket out now!
--Tim
Comment 13 Robert Kudyba 2024-07-03 09:06:14 MDT
> The error you are seeing now is because the config parser expects that the
> "SlurmUser" is valid, and you don't have that in the containers.  There are
> a couple options for dealing with it, one being to try to make sure there is
> a slurm user in the containers (which I know is not ideal).  However, for
> just those user commands setting "SlurmUser" to a user that exists in all of
> the containers (eg root) would allow them to parse the config.  I was able
> to do this by setting SINGULARITYENV_SLURM_CONF before spawning the
> container.
> 
> Thanks!
> --Tim
Would you share how to use SINGULARITYENV_SLURM_CONF
Comment 14 Tim McMullan 2024-07-03 11:32:30 MDT
(In reply to Robert Kudyba from comment #13)
> > The error you are seeing now is because the config parser expects that the
> > "SlurmUser" is valid, and you don't have that in the containers.  There are
> > a couple options for dealing with it, one being to try to make sure there is
> > a slurm user in the containers (which I know is not ideal).  However, for
> > just those user commands setting "SlurmUser" to a user that exists in all of
> > the containers (eg root) would allow them to parse the config.  I was able
> > to do this by setting SINGULARITYENV_SLURM_CONF before spawning the
> > container.
> > 
> > Thanks!
> > --Tim
> Would you share how to use SINGULARITYENV_SLURM_CONF

"SINGULARITYENV_" is a prefix for environment variables that if set beforehand will set the following variable in the container.

For SINGULARITYENV_SLURM_CONF="foo" singularity exec ... , SLURM_CONF will be set in the container to "foo".

There is some documentation on this at https://singularity-userdoc.readthedocs.io/en/latest/environment_and_metadata.html

If you are having an issue that this ability does not solve, please feel free to open a new ticket so someone can look further into the issue for you!

Thanks,
--Tim