Ticket 5037

Summary:	Failed to load current user environment variables
Product:	Slurm	Reporter:	Kilian Cavalotti <kilian>
Component:	slurmd	Assignee:	Ricard Zarco Badia <ricard>
Status:	RESOLVED FIXED	QA Contact:
Severity:	4 - Minor Issue
Priority:	---	CC:	brian, felip.moll, marshall, matthews, mhautreux, ricard
Version:	17.11.5
Hardware:	Linux
OS:	Linux
See Also:	https://bugs.schedmd.com/show_bug.cgi?id=5928 https://bugs.schedmd.com/show_bug.cgi?id=6042 https://bugs.schedmd.com/show_bug.cgi?id=7734 https://bugs.schedmd.com/show_bug.cgi?id=8478 https://bugs.schedmd.com/show_bug.cgi?id=18229 https://bugs.schedmd.com/show_bug.cgi?id=19040 https://bugs.schedmd.com/show_bug.cgi?id=19898
Site:	Stanford	Slinky Site:	---
Alineos Sites:	---	Atos/Eviden Sites:	---
Confidential Site:	---	Coreweave sites:	---
Cray Sites:	---	DS9 clusters:	---
Google sites:	---	HPCnow Sites:	---
HPE Sites:	---	IBM Sites:	---
NOAA SIte:	---	NoveTech Sites:	---
Nvidia HWinf-CS Sites:	---	OCF Sites:	---
Recursion Pharma Sites:	---	SFW Sites:	---
SNIC sites:	---	Tzag Elita Sites:	---
Linux Distro:	---	Machine Name:
CLE Version:		Version Fixed:	24.05
Target Release:	---	DevPrio:	---
Emory-Cloud Sites:	---
Attachments:	slurmd logs

Description Kilian Cavalotti 2018-04-06 11:46:57 MDT

Created attachment 6553 [details]
slurmd logs

Hi there, 

We seem to be getting a bunch of job failures lately, due to the following:


error: Failed to load current user environment variables
error: _get_user_env: Unable to get user's local environment, running only with passed environment


I've added requeue_setup_env to SchedulerParameters to make sure those nodes get drained and flagged, and increased SlurmdSyslogDebug to debug3 to catch more information, but I can't seem to figure out the root cause for this. I don't understand why the user environment can't be retrieved.

I'm attaching the full logs from a compute node where this happened, and would really appreciate if you could take a look and maybe give some pointers to understand where this error is coming from.

Thanks!
-- 
Kilian

Comment 2 Marshall Garey 2018-04-06 14:29:38 MDT

The "Failed to load current user environment variables" error starts happening after this:

Apr  5 23:16:07 sh-101-44 slurmstepd[82988]: error: Could not open stdout file /home/users/chodgson/oil/slurm-13570233.out: No space left on device
Apr  5 23:16:07 sh-101-44 slurmstepd[82988]: error: IO setup failed: No space left on device

Could you check if you're running out of disc space?

Can you also ssh into the node and try to run these commands:

[become root]
$ su <username> -c env

and see what happens?

Comment 3 Kilian Cavalotti 2018-04-06 15:15:12 MDT

Hi Marshall, 

Thanks for getting back to me.

(In reply to Marshall Garey from comment #2)
> The "Failed to load current user environment variables" error starts
> happening after this:
> 
> Apr  5 23:16:07 sh-101-44 slurmstepd[82988]: error: Could not open stdout
> file /home/users/chodgson/oil/slurm-13570233.out: No space left on device
> Apr  5 23:16:07 sh-101-44 slurmstepd[82988]: error: IO setup failed: No
> space left on device
> 
> Could you check if you're running out of disc space?

Ah, that's true, but coincidental and not related, I think.

That user indeed exceeded his home directory quota at some point, and for what it's worth, our NFS filer that serves $HOME, does reply with ENOSPC rather than with EDQUOT when users reach their quota. So it's not related to a system drive on the node, it's really just a user quota exceeded.

The same "Failed to load current user environment variables" also happens for other users, who are nowhere near their $HOME quotas.

> Can you also ssh into the node and try to run these commands:
> 
> [become root]
> $ su <username> -c env
> 
> and see what happens?

It works normally and displays the full environment of <username>.


So, digging a bit deeper in this, it looks like our system-wide profile is being executed each time a job starts, by the env_array_user_default() function in env.c, which calls "su - <username>".

In this profile, we set the user software environment via Lmod, which sits on a NFS mount, and apparently, when many jobs (or user sessions) start simultaneously on a node, the time it takes for the profile to be loaded increases.

For instance, with a single user session, we can get a new session initialized in ~0.6s

[root@sh-101-44 kilian]# for i in {1..1}; do time /bin/su - kilian -c /usr/bin/env & done |& grep real; wait
real    0m0.679s


When running 10 simultaneously, the run time increases to about 2s per session:

[root@sh-101-44 kilian]# for i in {1..10}; do time /bin/su - kilian -c /usr/bin/env & done |& grep real; wait
real    0m2.021s
real    0m2.020s
real    0m2.019s
real    0m2.022s
real    0m2.046s
real    0m2.061s
real    0m2.061s
real    0m2.064s
real    0m2.064s
real    0m2.075s


And for 32, we're at more than 6s per session.

[root@sh-101-44 kilian]# for i in {1..32}; do time /bin/su - kilian -c /usr/bin/env & done |& grep real; wait
real    0m6.355s
real    0m6.368s
real    0m6.360s
real    0m6.371s
real    0m6.368s
real    0m6.389s
real    0m6.394s
real    0m6.405s
real    0m6.400s
real    0m6.423s
real    0m6.426s
real    0m6.437s
real    0m6.449s
real    0m6.452s
real    0m6.470s
real    0m6.479s
real    0m6.481s
real    0m6.486s
real    0m6.494s
real    0m6.503s
real    0m6.522s
real    0m6.522s
real    0m6.537s
real    0m6.557s
real    0m6.549s
real    0m6.547s
real    0m6.550s
real    0m6.561s
real    0m6.562s
real    0m6.565s
real    0m6.574s
real    0m6.605s


I'll look at possible ways to limit this, but it could also be beneficial to increase the /bin/su timeout that appears to be triggered, as the log show:

Apr  6 10:30:00 sh-101-44 slurmd[125185]: timeout waiting for /bin/su to complete
Apr  6 10:30:00 sh-101-44 slurmd[125185]: timeout waiting for /bin/su to complete
Apr  6 10:30:00 sh-101-44 slurmd[125185]: timeout waiting for /bin/su to complete
Apr  6 10:30:00 sh-101-44 slurmd[125185]: timeout waiting for /bin/su to complete
Apr  6 10:30:00 sh-101-44 slurmd[125185]: timeout waiting for /bin/su to complete
Apr  6 10:30:00 sh-101-44 slurmd[125185]: timeout waiting for /bin/su to complete
Apr  6 10:30:00 sh-101-44 slurmd[125185]: timeout waiting for /bin/su to complete
Apr  6 10:30:00 sh-101-44 slurmd[125185]: timeout waiting for /bin/su to complete
Apr  6 10:30:00 sh-101-44 slurmd[125185]: timeout waiting for /bin/su to complete
Apr  6 10:30:00 sh-101-44 slurmd[125185]: timeout waiting for /bin/su to complete
Apr  6 10:30:00 sh-101-44 slurmd[125185]: timeout waiting for /bin/su to complete
Apr  6 10:30:00 sh-101-44 slurmd[125185]: timeout waiting for /bin/su to complete
Apr  6 10:30:00 sh-101-44 slurmd[125185]: timeout waiting for /bin/su to complete
Apr  6 10:30:00 sh-101-44 slurmd[125185]: timeout waiting for /bin/su to complete
Apr  6 10:30:00 sh-101-44 slurmd[125185]: timeout waiting for /bin/su to complete
Apr  6 10:30:01 sh-101-44 slurmd[125185]: timeout waiting for /bin/su to complete
Apr  6 10:30:01 sh-101-44 slurmd[125185]: error: Failed to load current user environment variables
Apr  6 10:30:01 sh-101-44 slurmd[125185]: error: _get_user_env: Unable to get user's local environment, running only with passed environment
Apr  6 10:30:01 sh-101-44 slurmd[125185]: timeout waiting for /bin/su to complete
Apr  6 10:30:01 sh-101-44 slurmd[125185]: timeout waiting for /bin/su to complete
Apr  6 10:30:01 sh-101-44 slurmd[125185]: error: Failed to load current user environment variables
Apr  6 10:30:01 sh-101-44 slurmd[125185]: error: _get_user_env: Unable to get user's local environment, running only with passed environment
Apr  6 10:30:01 sh-101-44 slurmd[125185]: timeout waiting for /bin/su to complete


My current feeling is that many jobs starting at the same time, and maybe loading a lot of software modules, are provoking some sort of contention in the Lmod part of our profile, which makes su timeout.

Is there a way to increase that /bin/su timeout?

Thanks!
-- 
Kilian

Comment 4 Marshall Garey 2018-04-06 16:27:24 MDT

It looks like this should only be called when using the --get-user-env option for sbatch. Are you getting this this call path without the --get-user-env sbatch option?

The timeout can be changed with the GetEnvTimeout slurm.conf parameter. The default is 2 seconds, so you're definitely exceeding that in some cases. The man page entry says it is "used for Moab scheduled jobs only" - ignore that. That was documented in 2008 - things have clearly changed in the past 10 years.

The sbatch man page entry for --get-user-env says that 8 seconds is the default, if not specified. But it actually looks like it uses the configuration default of 2 seconds. So we'll need to fix that, too.

Comment 5 Kilian Cavalotti 2018-04-06 16:34:20 MDT

(In reply to Marshall Garey from comment #4)
> It looks like this should only be called when using the --get-user-env
> option for sbatch. Are you getting this this call path without the
> --get-user-env sbatch option?

Yes: we don't use the --get-user-env sbatch option, and the su timeout messages definitely appear in the slurmd logs without that option.

> The timeout can be changed with the GetEnvTimeout slurm.conf parameter. The
> default is 2 seconds, so you're definitely exceeding that in some cases. The
> man page entry says it is "used for Moab scheduled jobs only" - ignore that.
> That was documented in 2008 - things have clearly changed in the past 10
> years.
> 
> The sbatch man page entry for --get-user-env says that 8 seconds is the
> default, if not specified. But it actually looks like it uses the
> configuration default of 2 seconds. So we'll need to fix that, too.

Ok, I'll try to set GetEnvTimeout to 10 seconds and see how things go. An update of the documentation would definitely be welcome, because the Moab reference makes it easy to dismiss the option entirely. :)

Thanks!
-- 
Kilian

Comment 6 Kilian Cavalotti 2018-04-09 09:08:31 MDT

(In reply to Kilian Cavalotti from comment #5)
> Ok, I'll try to set GetEnvTimeout to 10 seconds and see how things go. An
> update of the documentation would definitely be welcome, because the Moab
> reference makes it easy to dismiss the option entirely. :)

Quick update: we increased the GetEnvTimeout value to 30 seconds, but we're still seeing occasional nodes drained with "Job env setup error" and the same errors in the logs:

Apr  6 23:28:24 sh-09-06 slurmd[1017]: error: Failed to load current user environment variables
Apr  6 23:28:24 sh-09-06 slurmd[1017]: error: _get_user_env: Unable to get user's local environment, running only with passed environment

Is there any way we could get some more information about what's really causing that issue? I feel like a 30s timeout on "su" is not the real cause here.

Thanks!
-- 
Kilian

Comment 7 Kilian Cavalotti 2018-04-09 15:27:08 MDT

Hi there, 

Side question: it looks from the code that a _load_env_cache() function is called in case the user environment can't be readily obtained.

This function tries to read from <StateSaveLocation>/env_cache/<username>, except:
1. this directory is not created anywhere (nor on the Slurm controller, nor on the compute nodes),
2. I can't find any reference to any function populating this cache.

So, how does that all user environment caching is supposed to work?


This is actually getting a bit worrisome, as enabling requeue_setup_env_fail in ShcedulerParamters exposes the fact that this error happens pretty often, on several of our clusters, running different versions of Slurm. 

I'm raising this ticket's Sev, as we're seeing many nodes being drained because of this, and would really like to identify the root cause.

Thanks!
-- 
Kilian

Comment 8 Marshall Garey 2018-04-10 14:15:42 MDT

So the function env_array_user_default() is called not only by using the --get-user-env option, but also by using --export=NONE or --export=<specific environment variable(s)> (or by using the sbatch input environment variable equivalents). With these options, the user environment is not passed from the client, so the compute node needs to get the user environment locally or else it will be empty.

Are any of these options being used? They're fine to use, I just want to make sure that is what is happening.

I looked into the environment cache in StateSaveLocation. There was a separate program that would build environment files for each user and save it in StateSaveLocation. It is in 17.02 but was taken out before 17.11 was released. See this commit:

3e3cd30fccd204c203bc589abd2e85c69889e055

It was probably taken out because of comments that said it was just used for Moab. That was needed when Moab was used as a scheduler on top of Slurm. That's long past, but as I explained in the first paragraph of the post, there are certain cases where the user environment isn't passed from the client, so the compute node still needs to get the user environment.

Regardless, the cache builder program was taken out, so _load_env_cache() doesn't do anything anymore, since that environment cache file is never created. It seems there are some loose ends in the code and the documentation that can be cleaned up for 18.08, which I can take care of in separate patches.

To address why "su" is taking so long - I don't know, but I agree that is the real issue. Is it possible that there's an issue with LDAP that's causing "su" to take forever?

Comment 9 Kilian Cavalotti 2018-04-10 14:32:54 MDT

Hi Marshall, 

(In reply to Marshall Garey from comment #8)
> So the function env_array_user_default() is called not only by using the
> --get-user-env option, but also by using --export=NONE or --export=<specific
> environment variable(s)> (or by using the sbatch input environment variable
> equivalents). With these options, the user environment is not passed from
> the client, so the compute node needs to get the user environment locally or
> else it will be empty.
> 
> Are any of these options being used? They're fine to use, I just want to
> make sure that is what is happening.

Our users generally don't use the --get-user-env, not --export options. And I checked that the jobs that generated the su timeout and env error messages didn't either.

> I looked into the environment cache in StateSaveLocation. There was a
> separate program that would build environment files for each user and save
> it in StateSaveLocation. It is in 17.02 but was taken out before 17.11 was
> released. See this commit:
> 
> 3e3cd30fccd204c203bc589abd2e85c69889e055
> 
> It was probably taken out because of comments that said it was just used for
> Moab. That was needed when Moab was used as a scheduler on top of Slurm.
> That's long past, but as I explained in the first paragraph of the post,
> there are certain cases where the user environment isn't passed from the
> client, so the compute node still needs to get the user environment.
> 
> Regardless, the cache builder program was taken out, so _load_env_cache()
> doesn't do anything anymore, since that environment cache file is never
> created. It seems there are some loose ends in the code and the
> documentation that can be cleaned up for 18.08, which I can take care of in
> separate patches.

All right, thanks for the clarification.

> To address why "su" is taking so long - I don't know, but I agree that is
> the real issue. Is it possible that there's an issue with LDAP that's
> causing "su" to take forever?

It's possible, although I can't seem to be able to reproduce the issue outside of Slurm.

My current focus is on the shell profile scripts, and specifically the parts that set the modules environment up for our users. We use Lmod and I'm looking into ways to speed up the initialization of the module environment.

What would be really helpful though, is to make sure that there's a way in slurm.conf to control the timeout for those "su -" commands. it doesn't look like GetEnvTimeout has much influence, since I increased it way past the maximum values I could see interactively by opening many user sessions in parallel, and still got the su timeout messages in slurmd logs. Which makes me think it somehow continues to use the default 2 sec.

Thanks!
-- 
Kilian

Comment 10 Marshall Garey 2018-04-10 15:34:48 MDT

You're right, changing GetEnvTimeout isn't affecting it at all, because the parameter "timeout" to the function env_array_user_default() is hard-coded to 120 seconds if called from slurmd, so although the function is getting the GetEnvTimeout from the config file, it doesn't use it because the function parameter "timeout" is non-zero. My bad for not noticing this before.

So the timeout should be not 2 seconds nor the 30 seconds, but 120 seconds, if it's being called from slurmd. The only other place this it's being called from salloc.

Can you send an example user command that's triggering this?

Below is a patch that prints the actual timeout value (in seconds). Can you also apply this patch and restart a slurmd, then run a job on that node that triggers this error?

diff --git a/src/common/env.c b/src/common/env.c
index bb8620e027..50ab8ed87e 100644
--- a/src/common/env.c
+++ b/src/common/env.c
@@ -2145,6 +2145,7 @@ char **env_array_user_default(const char *username, int timeout, int mode,
        found = 0;
        buf_read = 0;
        buffer = xmalloc(ENV_BUFSIZE);
+       info("su timeout = %d", timeout);
        while (1) {
                gettimeofday(&now, NULL);
                timeleft = timeout * 1000;

Comment 11 Marshall Garey 2018-04-18 09:44:52 MDT

Hi Kilian, have you been able to make any progress on this ticket?

Comment 12 Kilian Cavalotti 2018-04-18 09:53:41 MDT

Hi Marshall, 

(In reply to Marshall Garey from comment #11)
> Hi Kilian, have you been able to make any progress on this ticket?

Yes, but mostly outside of Slurm, by working on and optimizing the way we create our user environment. We modified our default profile scripts, tweaked Lmod and used more aggressive caching to limit the effect of starting multiple user sessions simultaneously. And the "Failed to load current user environment variables" error message basically disappeared.

So I guess we're good now, but I would like to put in an enhancement request to:
1. transform that hard-coded 120 seconds timeout in a settable configuration option (or maybe reuse GetEnvTimeout)
2. add more information in the error message, such as the jobid, and the value of the timeout that has been reached.

Thanks!
-- 
Kilian

Comment 13 Marshall Garey 2018-04-18 10:06:36 MDT

That's great, I'm glad it's working now.

I'm re-tagging this ticket as an enhancement request.

Comment 16 Nate Rini 2018-11-01 10:49:03 MDT

*** Ticket 5928 has been marked as a duplicate of this ticket. ***

Comment 21 Ricard Zarco Badia 2024-07-29 07:46:15 MDT

*** Ticket 18229 has been marked as a duplicate of this ticket. ***

Comment 28 Ricard Zarco Badia 2025-04-10 06:50:38 MDT

This ticket has finally reached a conclusion. Basically, we have removed all old code regarding environment caches, simplified the surviving code and added a new behavior for jobs that fail at the environment retrieval stage (requeueing them as HELD instead of letting them execute with a wrong env).

We have decided to set the timeout of this env retrieval stage to 120 seconds and it will not be configurable, since we feel that it is more than enough time for this. In case of timeouts of this kind, users should make sure that there is nothing weird in their startup scripts. We have made some documentation changes to reflect that.

All of this has been done in commits eabe4f52a[1] .. 88a61d5e[2], which will be present in the 24.05 release.

Closing this ticket as fixed.

[1] https://github.com/SchedMD/slurm/commit/eabe4f52a958724f680ebdf93325d3b2421ad82c
[2] https://github.com/SchedMD/slurm/commit/88a61d5e520b52251757b354ca388755066de149

Comment 29 Kilian Cavalotti 2025-04-10 06:50:51 MDT

Hi!

I'm out of the office and will return on April 14.
For urgent matters, please contact srcc-support@stanford.edu

Cheers,
--
Kilian

Comment 30 Kilian Cavalotti 2025-04-14 09:47:09 MDT

(In reply to Ricard Zarco Badia from comment #28)
> This ticket has finally reached a conclusion. Basically, we have removed all
> old code regarding environment caches, simplified the surviving code and
> added a new behavior for jobs that fail at the environment retrieval stage
> (requeueing them as HELD instead of letting them execute with a wrong env).
> 
> We have decided to set the timeout of this env retrieval stage to 120
> seconds and it will not be configurable, since we feel that it is more than
> enough time for this. In case of timeouts of this kind, users should make
> sure that there is nothing weird in their startup scripts. We have made some
> documentation changes to reflect that.
> 
> All of this has been done in commits eabe4f52a[1] .. 88a61d5e[2], which will
> be present in the 24.05 release.
> 
> Closing this ticket as fixed.

Thanks Ricard!

Cheers,
--
Kilian