The "Failed to load current user environment variables" error starts happening after this: Apr 5 23:16:07 sh-101-44 slurmstepd[82988]: error: Could not open stdout file /home/users/chodgson/oil/slurm-13570233.out: No space left on device Apr 5 23:16:07 sh-101-44 slurmstepd[82988]: error: IO setup failed: No space left on device Could you check if you're running out of disc space? Can you also ssh into the node and try to run these commands: [become root] $ su <username> -c env and see what happens? Hi Marshall, Thanks for getting back to me. (In reply to Marshall Garey from comment #2) > The "Failed to load current user environment variables" error starts > happening after this: > > Apr 5 23:16:07 sh-101-44 slurmstepd[82988]: error: Could not open stdout > file /home/users/chodgson/oil/slurm-13570233.out: No space left on device > Apr 5 23:16:07 sh-101-44 slurmstepd[82988]: error: IO setup failed: No > space left on device > > Could you check if you're running out of disc space? Ah, that's true, but coincidental and not related, I think. That user indeed exceeded his home directory quota at some point, and for what it's worth, our NFS filer that serves $HOME, does reply with ENOSPC rather than with EDQUOT when users reach their quota. So it's not related to a system drive on the node, it's really just a user quota exceeded. The same "Failed to load current user environment variables" also happens for other users, who are nowhere near their $HOME quotas. > Can you also ssh into the node and try to run these commands: > > [become root] > $ su <username> -c env > > and see what happens? It works normally and displays the full environment of <username>. So, digging a bit deeper in this, it looks like our system-wide profile is being executed each time a job starts, by the env_array_user_default() function in env.c, which calls "su - <username>". In this profile, we set the user software environment via Lmod, which sits on a NFS mount, and apparently, when many jobs (or user sessions) start simultaneously on a node, the time it takes for the profile to be loaded increases. For instance, with a single user session, we can get a new session initialized in ~0.6s [root@sh-101-44 kilian]# for i in {1..1}; do time /bin/su - kilian -c /usr/bin/env & done |& grep real; wait real 0m0.679s When running 10 simultaneously, the run time increases to about 2s per session: [root@sh-101-44 kilian]# for i in {1..10}; do time /bin/su - kilian -c /usr/bin/env & done |& grep real; wait real 0m2.021s real 0m2.020s real 0m2.019s real 0m2.022s real 0m2.046s real 0m2.061s real 0m2.061s real 0m2.064s real 0m2.064s real 0m2.075s And for 32, we're at more than 6s per session. [root@sh-101-44 kilian]# for i in {1..32}; do time /bin/su - kilian -c /usr/bin/env & done |& grep real; wait real 0m6.355s real 0m6.368s real 0m6.360s real 0m6.371s real 0m6.368s real 0m6.389s real 0m6.394s real 0m6.405s real 0m6.400s real 0m6.423s real 0m6.426s real 0m6.437s real 0m6.449s real 0m6.452s real 0m6.470s real 0m6.479s real 0m6.481s real 0m6.486s real 0m6.494s real 0m6.503s real 0m6.522s real 0m6.522s real 0m6.537s real 0m6.557s real 0m6.549s real 0m6.547s real 0m6.550s real 0m6.561s real 0m6.562s real 0m6.565s real 0m6.574s real 0m6.605s I'll look at possible ways to limit this, but it could also be beneficial to increase the /bin/su timeout that appears to be triggered, as the log show: Apr 6 10:30:00 sh-101-44 slurmd[125185]: timeout waiting for /bin/su to complete Apr 6 10:30:00 sh-101-44 slurmd[125185]: timeout waiting for /bin/su to complete Apr 6 10:30:00 sh-101-44 slurmd[125185]: timeout waiting for /bin/su to complete Apr 6 10:30:00 sh-101-44 slurmd[125185]: timeout waiting for /bin/su to complete Apr 6 10:30:00 sh-101-44 slurmd[125185]: timeout waiting for /bin/su to complete Apr 6 10:30:00 sh-101-44 slurmd[125185]: timeout waiting for /bin/su to complete Apr 6 10:30:00 sh-101-44 slurmd[125185]: timeout waiting for /bin/su to complete Apr 6 10:30:00 sh-101-44 slurmd[125185]: timeout waiting for /bin/su to complete Apr 6 10:30:00 sh-101-44 slurmd[125185]: timeout waiting for /bin/su to complete Apr 6 10:30:00 sh-101-44 slurmd[125185]: timeout waiting for /bin/su to complete Apr 6 10:30:00 sh-101-44 slurmd[125185]: timeout waiting for /bin/su to complete Apr 6 10:30:00 sh-101-44 slurmd[125185]: timeout waiting for /bin/su to complete Apr 6 10:30:00 sh-101-44 slurmd[125185]: timeout waiting for /bin/su to complete Apr 6 10:30:00 sh-101-44 slurmd[125185]: timeout waiting for /bin/su to complete Apr 6 10:30:00 sh-101-44 slurmd[125185]: timeout waiting for /bin/su to complete Apr 6 10:30:01 sh-101-44 slurmd[125185]: timeout waiting for /bin/su to complete Apr 6 10:30:01 sh-101-44 slurmd[125185]: error: Failed to load current user environment variables Apr 6 10:30:01 sh-101-44 slurmd[125185]: error: _get_user_env: Unable to get user's local environment, running only with passed environment Apr 6 10:30:01 sh-101-44 slurmd[125185]: timeout waiting for /bin/su to complete Apr 6 10:30:01 sh-101-44 slurmd[125185]: timeout waiting for /bin/su to complete Apr 6 10:30:01 sh-101-44 slurmd[125185]: error: Failed to load current user environment variables Apr 6 10:30:01 sh-101-44 slurmd[125185]: error: _get_user_env: Unable to get user's local environment, running only with passed environment Apr 6 10:30:01 sh-101-44 slurmd[125185]: timeout waiting for /bin/su to complete My current feeling is that many jobs starting at the same time, and maybe loading a lot of software modules, are provoking some sort of contention in the Lmod part of our profile, which makes su timeout. Is there a way to increase that /bin/su timeout? Thanks! -- Kilian It looks like this should only be called when using the --get-user-env option for sbatch. Are you getting this this call path without the --get-user-env sbatch option? The timeout can be changed with the GetEnvTimeout slurm.conf parameter. The default is 2 seconds, so you're definitely exceeding that in some cases. The man page entry says it is "used for Moab scheduled jobs only" - ignore that. That was documented in 2008 - things have clearly changed in the past 10 years. The sbatch man page entry for --get-user-env says that 8 seconds is the default, if not specified. But it actually looks like it uses the configuration default of 2 seconds. So we'll need to fix that, too. (In reply to Marshall Garey from comment #4) > It looks like this should only be called when using the --get-user-env > option for sbatch. Are you getting this this call path without the > --get-user-env sbatch option? Yes: we don't use the --get-user-env sbatch option, and the su timeout messages definitely appear in the slurmd logs without that option. > The timeout can be changed with the GetEnvTimeout slurm.conf parameter. The > default is 2 seconds, so you're definitely exceeding that in some cases. The > man page entry says it is "used for Moab scheduled jobs only" - ignore that. > That was documented in 2008 - things have clearly changed in the past 10 > years. > > The sbatch man page entry for --get-user-env says that 8 seconds is the > default, if not specified. But it actually looks like it uses the > configuration default of 2 seconds. So we'll need to fix that, too. Ok, I'll try to set GetEnvTimeout to 10 seconds and see how things go. An update of the documentation would definitely be welcome, because the Moab reference makes it easy to dismiss the option entirely. :) Thanks! -- Kilian (In reply to Kilian Cavalotti from comment #5) > Ok, I'll try to set GetEnvTimeout to 10 seconds and see how things go. An > update of the documentation would definitely be welcome, because the Moab > reference makes it easy to dismiss the option entirely. :) Quick update: we increased the GetEnvTimeout value to 30 seconds, but we're still seeing occasional nodes drained with "Job env setup error" and the same errors in the logs: Apr 6 23:28:24 sh-09-06 slurmd[1017]: error: Failed to load current user environment variables Apr 6 23:28:24 sh-09-06 slurmd[1017]: error: _get_user_env: Unable to get user's local environment, running only with passed environment Is there any way we could get some more information about what's really causing that issue? I feel like a 30s timeout on "su" is not the real cause here. Thanks! -- Kilian Hi there, Side question: it looks from the code that a _load_env_cache() function is called in case the user environment can't be readily obtained. This function tries to read from <StateSaveLocation>/env_cache/<username>, except: 1. this directory is not created anywhere (nor on the Slurm controller, nor on the compute nodes), 2. I can't find any reference to any function populating this cache. So, how does that all user environment caching is supposed to work? This is actually getting a bit worrisome, as enabling requeue_setup_env_fail in ShcedulerParamters exposes the fact that this error happens pretty often, on several of our clusters, running different versions of Slurm. I'm raising this ticket's Sev, as we're seeing many nodes being drained because of this, and would really like to identify the root cause. Thanks! -- Kilian So the function env_array_user_default() is called not only by using the --get-user-env option, but also by using --export=NONE or --export=<specific environment variable(s)> (or by using the sbatch input environment variable equivalents). With these options, the user environment is not passed from the client, so the compute node needs to get the user environment locally or else it will be empty. Are any of these options being used? They're fine to use, I just want to make sure that is what is happening. I looked into the environment cache in StateSaveLocation. There was a separate program that would build environment files for each user and save it in StateSaveLocation. It is in 17.02 but was taken out before 17.11 was released. See this commit: 3e3cd30fccd204c203bc589abd2e85c69889e055 It was probably taken out because of comments that said it was just used for Moab. That was needed when Moab was used as a scheduler on top of Slurm. That's long past, but as I explained in the first paragraph of the post, there are certain cases where the user environment isn't passed from the client, so the compute node still needs to get the user environment. Regardless, the cache builder program was taken out, so _load_env_cache() doesn't do anything anymore, since that environment cache file is never created. It seems there are some loose ends in the code and the documentation that can be cleaned up for 18.08, which I can take care of in separate patches. To address why "su" is taking so long - I don't know, but I agree that is the real issue. Is it possible that there's an issue with LDAP that's causing "su" to take forever? Hi Marshall, (In reply to Marshall Garey from comment #8) > So the function env_array_user_default() is called not only by using the > --get-user-env option, but also by using --export=NONE or --export=<specific > environment variable(s)> (or by using the sbatch input environment variable > equivalents). With these options, the user environment is not passed from > the client, so the compute node needs to get the user environment locally or > else it will be empty. > > Are any of these options being used? They're fine to use, I just want to > make sure that is what is happening. Our users generally don't use the --get-user-env, not --export options. And I checked that the jobs that generated the su timeout and env error messages didn't either. > I looked into the environment cache in StateSaveLocation. There was a > separate program that would build environment files for each user and save > it in StateSaveLocation. It is in 17.02 but was taken out before 17.11 was > released. See this commit: > > 3e3cd30fccd204c203bc589abd2e85c69889e055 > > It was probably taken out because of comments that said it was just used for > Moab. That was needed when Moab was used as a scheduler on top of Slurm. > That's long past, but as I explained in the first paragraph of the post, > there are certain cases where the user environment isn't passed from the > client, so the compute node still needs to get the user environment. > > Regardless, the cache builder program was taken out, so _load_env_cache() > doesn't do anything anymore, since that environment cache file is never > created. It seems there are some loose ends in the code and the > documentation that can be cleaned up for 18.08, which I can take care of in > separate patches. All right, thanks for the clarification. > To address why "su" is taking so long - I don't know, but I agree that is > the real issue. Is it possible that there's an issue with LDAP that's > causing "su" to take forever? It's possible, although I can't seem to be able to reproduce the issue outside of Slurm. My current focus is on the shell profile scripts, and specifically the parts that set the modules environment up for our users. We use Lmod and I'm looking into ways to speed up the initialization of the module environment. What would be really helpful though, is to make sure that there's a way in slurm.conf to control the timeout for those "su -" commands. it doesn't look like GetEnvTimeout has much influence, since I increased it way past the maximum values I could see interactively by opening many user sessions in parallel, and still got the su timeout messages in slurmd logs. Which makes me think it somehow continues to use the default 2 sec. Thanks! -- Kilian You're right, changing GetEnvTimeout isn't affecting it at all, because the parameter "timeout" to the function env_array_user_default() is hard-coded to 120 seconds if called from slurmd, so although the function is getting the GetEnvTimeout from the config file, it doesn't use it because the function parameter "timeout" is non-zero. My bad for not noticing this before.
So the timeout should be not 2 seconds nor the 30 seconds, but 120 seconds, if it's being called from slurmd. The only other place this it's being called from salloc.
Can you send an example user command that's triggering this?
Below is a patch that prints the actual timeout value (in seconds). Can you also apply this patch and restart a slurmd, then run a job on that node that triggers this error?
diff --git a/src/common/env.c b/src/common/env.c
index bb8620e027..50ab8ed87e 100644
--- a/src/common/env.c
+++ b/src/common/env.c
@@ -2145,6 +2145,7 @@ char **env_array_user_default(const char *username, int timeout, int mode,
found = 0;
buf_read = 0;
buffer = xmalloc(ENV_BUFSIZE);
+ info("su timeout = %d", timeout);
while (1) {
gettimeofday(&now, NULL);
timeleft = timeout * 1000;
Hi Kilian, have you been able to make any progress on this ticket? Hi Marshall, (In reply to Marshall Garey from comment #11) > Hi Kilian, have you been able to make any progress on this ticket? Yes, but mostly outside of Slurm, by working on and optimizing the way we create our user environment. We modified our default profile scripts, tweaked Lmod and used more aggressive caching to limit the effect of starting multiple user sessions simultaneously. And the "Failed to load current user environment variables" error message basically disappeared. So I guess we're good now, but I would like to put in an enhancement request to: 1. transform that hard-coded 120 seconds timeout in a settable configuration option (or maybe reuse GetEnvTimeout) 2. add more information in the error message, such as the jobid, and the value of the timeout that has been reached. Thanks! -- Kilian That's great, I'm glad it's working now. I'm re-tagging this ticket as an enhancement request. *** Ticket 5928 has been marked as a duplicate of this ticket. *** *** Ticket 18229 has been marked as a duplicate of this ticket. *** This ticket has finally reached a conclusion. Basically, we have removed all old code regarding environment caches, simplified the surviving code and added a new behavior for jobs that fail at the environment retrieval stage (requeueing them as HELD instead of letting them execute with a wrong env). We have decided to set the timeout of this env retrieval stage to 120 seconds and it will not be configurable, since we feel that it is more than enough time for this. In case of timeouts of this kind, users should make sure that there is nothing weird in their startup scripts. We have made some documentation changes to reflect that. All of this has been done in commits eabe4f52a[1] .. 88a61d5e[2], which will be present in the 24.05 release. Closing this ticket as fixed. [1] https://github.com/SchedMD/slurm/commit/eabe4f52a958724f680ebdf93325d3b2421ad82c [2] https://github.com/SchedMD/slurm/commit/88a61d5e520b52251757b354ca388755066de149 Hi! I'm out of the office and will return on April 14. For urgent matters, please contact srcc-support@stanford.edu Cheers, -- Kilian (In reply to Ricard Zarco Badia from comment #28) > This ticket has finally reached a conclusion. Basically, we have removed all > old code regarding environment caches, simplified the surviving code and > added a new behavior for jobs that fail at the environment retrieval stage > (requeueing them as HELD instead of letting them execute with a wrong env). > > We have decided to set the timeout of this env retrieval stage to 120 > seconds and it will not be configurable, since we feel that it is more than > enough time for this. In case of timeouts of this kind, users should make > sure that there is nothing weird in their startup scripts. We have made some > documentation changes to reflect that. > > All of this has been done in commits eabe4f52a[1] .. 88a61d5e[2], which will > be present in the 24.05 release. > > Closing this ticket as fixed. Thanks Ricard! Cheers, -- Kilian |
Created attachment 6553 [details] slurmd logs Hi there, We seem to be getting a bunch of job failures lately, due to the following: error: Failed to load current user environment variables error: _get_user_env: Unable to get user's local environment, running only with passed environment I've added requeue_setup_env to SchedulerParameters to make sure those nodes get drained and flagged, and increased SlurmdSyslogDebug to debug3 to catch more information, but I can't seem to figure out the root cause for this. I don't understand why the user environment can't be retrieved. I'm attaching the full logs from a compute node where this happened, and would really appreciate if you could take a look and maybe give some pointers to understand where this error is coming from. Thanks! -- Kilian