Hello, I think this is a new behavior for slurm in 18.08, but it will be disruptive: dmj@nid00384:~> srun shifter hostname slurmstepd: error: execve(): /global/u2/d/dmj/shifter: Permission denied srun: error: nid00384: task 0: Exited with exit code 13 srun: Terminating job step 1001814.1 dmj@nid00384:~> ls shifter AUTHORS LICENSE NEWS README.md extra imagegw udiRoot dmj@nid00384:~> srun /usr/bin/shifter hostname nid00384 dmj@nid00384:~> srun hostname nid00384 dmj@nid00384:~> ls -ld shifter drwxr-x--- 6 dmj dmj 512 Jan 5 2016 shifter dmj@nid00384:~> Since I have a directory in my working dir of the same name of the desired executable it is using that instead of properly searching $PATH. in slurm.conf I have: LaunchParameters=test_exec,send_gids Thanks, Doug
in general this is also combined with a loss of shell $PATH-like behavior that the test_exec launchparameters flag gave in the past. For example, if "." is not in $PATH, then a user has to request ./a.out, which is GOOD, as it delivers a more consistent experience. on cori, slurm 17.11.9: dmj@mom1:~> salloc -C knl -N 2 --qos=interactive salloc: Pending job allocation 15770348 salloc: job 15770348 queued and waiting for resources salloc: job 15770348 has been allocated resources salloc: Granted job allocation 15770348 salloc: Waiting for resource configuration salloc: Nodes nid0[2304,2307] are ready for job dmj@nid02304:~> cd mpi dmj@nid02304:~/mpi> srun a.out srun: fatal: Can not execute a.out Aborted dmj@nid02304:~/mpi> srun ./a.out hello from 1 of 2 on nid02307 hello from 0 of 2 on nid02304 dmj@nid02304:~/mpi> exit salloc: Relinquishing job allocation 15770348 dmj@mom1:~> on gerty, slurm 18.08.1: dmj@gerty:~> salloc -N 2 -C haswell salloc: Granted job allocation 1001844 salloc: Waiting for resource configuration salloc: Nodes nid00[384-385] are ready for job dmj@nid00384:~> cd mpi dmj@nid00384:~/mpi> srun a.out hello from 1 of 2 on nid00385 hello from 0 of 2 on nid00384 dmj@nid00384:~/mpi> exit salloc: Relinquishing job allocation 1001844 dmj@gerty:~> Note in the example above, 18.08 allows a file in the pwd to be executed even though that path is no way in PATH. The original reason for test_exec to exist was to ensure linux PATH-like semantics were used to positively identify the executable in all cases. Thanks, Doug
i suspect this was brought in in commit https://github.com/SchedMD/slurm/commit/a84852c388e with the modifications to search_path() I would say a minimum repair would be ensuring that the file (or if it is a symlink, does not dereference) a directory. However, I'm not a fan of searching cwd by default.
sorry, missing words in previous comment. ensuring that the fully resolved path is not a directory, e.g., for each candidate path, perform a realpath(), stat() and check !IS_IFDIR
I'm testing the attached patch presently; I'm handling disabling checking the cwd with the opposite logic, assuming that you want to keep it as a default behavior.
Created attachment 8029 [details] patch under consideration for inclusion with NERSC 18.08 deployment lightly tested patch; abstracting out _check_exec() gives common point for adding additional criteria for testing a given path.
with the attached patch, and setting LaunchParameters=test_exec,send_gids,nocheck_cwd, I get the desired behavior: dmj@gerty:~> salloc -C haswell -N 2 --image=ubuntu:latest salloc: Granted job allocation 1001864 salloc: Waiting for resource configuration salloc: Nodes nid00[384-385] are ready for job dmj@nid00384:~> srun shifter hostname nid00384 nid00385 dmj@nid00384:~> cd mpi dmj@nid00384:~/mpi> srun a.out srun: fatal: Can not execute a.out Aborted dmj@nid00384:~/mpi> srun ./a.out hello from 1 of 2 on nid00385 hello from 0 of 2 on nid00384 dmj@nid00384:~/mpi> exit salloc: Relinquishing job allocation 1001865 dmj@gerty:~> I would be fully supportive of reversing the default for check_cwd and have it default to false with a LaunchParamter flag of check_cwd instead of nocheck_cwd, but that may be informed by other requirements, such as the bug that initiated this change.
Doug, Thanks for looking into this. As you may have inferred from the comment numbers, we’ve been privately discussing this quite a bit. Commit https://github.com/SchedMD/slurm/commit/a84852c388e changed Slurm to be consistent with its other parts by checking cwd. I tend to agree with you that by default, we shouldn’t add the cwd of the binary to the search path, and that users should opt in to this. But this has actually been the default behavior of Slurm for quite a while, for good or bad. So let’s make this a new feature request for 19.05, in order to separate this from the bug you discovered. I would like to see cwd NOT checked by default, and add a check_cwd launch param, like you suggest. But this may require more lobbying before the entire SchedMd team is on board with introducing this change. Introducing the nocheck_cwd launch param may be an easier compromise to sell. In the meantime, I’m thinking we’ll only commit the dir checking part of the commit, so that the bug is fixed without adding new features. If dirs are ignored, then it shouldn’t even matter in your initial case if cwd is checked or not. But if checking cwd still causes problems for you, you can just apply the nocheck_cwd portion of your patch.
There is another significant difference between test_exec and non-test_exec: test_exec will choose a local binary in cwd over PATH, while non-test_exec will only use a binary in the cwd if nothing else can be found in PATH. This mismatch is also a bug. I will create a follow-up patch so test_exec matches the non-test_exec behavior. This change will effectively solve your issue, as local binaries cannot trump anything on your PATH.
Doug, The following two commits fix the bug and are checked in to 18.08. https://github.com/SchedMD/slurm/commit/169ad58ff18c239a5e5a034ca618d148f9572910 https://github.com/SchedMD/slurm/commit/ccafaf7b60090155639edcbdbf4a3ab5e36967c6 Closing out bug. Thanks! Michael
Hello, Based on the recent see also addition, I'm thinking that the revised versions of these patches or a related patch has generated some big issues for us. Over the holiday break we were forced to downgrade slurm from 18.08.4 to 18.08.3 on cori login nodes. This was because sbatch stopped processing #SBATCH lines in the scripts. I suspect it may be because the test_exec flags were being honored by sbatch and probably should not be. In any case, we downgraded to to 18.08.3 with the patch I supplied in this bug and the issue went away. I bring it up here and now with enhanced priority because we are planning to rebuild the systems on Tuesday 1/8, and it would be nice if we had a fix appropriate for 18.08.4. Thanks, Doug
That's not good. Just a note: a patch landed today from bug 6271 that partially reverts these patches: https://github.com/SchedMD/slurm/commit/aca40d167b21835b0eeb61b18a2eaaef68c33865 When we made Slurm check the CWD last for srun, it also did that for sbatch, since they share the same path-checking code. The commit today reverts this for sbatch only, so sbatch always checks CWD _first_, like it did before. What you describe, though, seems like it might not be fixed by this. I'll need to look into this further. Btw, I thought the government was supposed to be shut down ;)
Hey Doug, Can you be more specific? I wasn't able to reproduce this with Slurm 18.08.4 and setting LaunchParameters=test_exec. My batch scripts read #SBATCH lines just fine. Did this work as expected when test_exec was disabled? I'm not sure how test_exec would affect #SBATCH lines, which are inside an sbatch script. The script is already running by the time the #SBATCH lines are hit. These patches only messed with path checking. Unless there is a clear connection with this bug, I would recommend opening up a separate sev 3 bug to track this. From what you've said, it could be caused by any number of commits within 18.08.4.
found it -- it would seem a number of our users name their scripts after things that can be found in PATH dmj@gert01:~> sbatch test.sh Submitted batch job 1133854 dmj@gert01:~> mv test.sh perl dmj@gert01:~> sbatch perl sbatch: error: This does not look like a batch script. The first sbatch: error: line must start with #! followed by the path to an interpreter. sbatch: error: For instance: #!/bin/sh dmj@gert01:~> I've been given to understand that this gets much worse when the user searches . in PATH (obviously a bad practice, but nonetheless). I think the patches discussed in this bug are correct -- but only for srun and salloc (which directly exec a process or pretend to). sbatch should _not_ search PATH or exhibit execve()-like tendencies -- sbatch should just take the path offered it and use that.
(In reply to Doug Jacobsen from comment #30) > found it -- it would seem a number of our users name their scripts after > things that can be found in PATH > > > dmj@gert01:~> sbatch test.sh > Submitted batch job 1133854 > dmj@gert01:~> mv test.sh perl > dmj@gert01:~> sbatch perl > sbatch: error: This does not look like a batch script. The first > sbatch: error: line must start with #! followed by the path to an > interpreter. > sbatch: error: For instance: #!/bin/sh > dmj@gert01:~> > > > > I've been given to understand that this gets much worse when the user > searches . in PATH (obviously a bad practice, but nonetheless). > > > I think the patches discussed in this bug are correct -- but only for srun > and salloc (which directly exec a process or pretend to). sbatch should > _not_ search PATH or exhibit execve()-like tendencies -- sbatch should just > take the path offered it and use that. That's good to hear, because the patch that landed this week from bug 6271 fixes this issue: https://github.com/SchedMD/slurm/commit/aca40d167b21835b0eeb61b18a2eaaef68c33865 This commit makes sure that sbatch always checks the CWD _first_ (the original behavior), which should fix the problems you describe. It should cleanly apply to 18.08.4. I can see the argument for sbatch not searching PATH at all, but that is how sbatch has worked for a while now. And it shouldn't really matter once CWD is checked first. Closing out ticket. Feel free to reopen if that doesn't work for you. Thanks, -Michael
OK, I guess the issue with #SBATCH lines not getting processed in some cases is not this one. I'll add the specified patch to my build and will keep trying to get more info on the other bug. Thanks, Doug