Created attachment 11385 [details] gzipped tarball with files hi all, playing/testing pmix, and i get a strange failure incl strange kernel error message. [Wed Aug 28 15:59:44 2019] uverbs_write: process 54799 (slurmstepd) changed security contexts after opening file descriptor, this is not allowed. (i include the timestamp from dmesg -T as it might help with the other log files) the job is osu benchmarks latency test, osu is compiled with openmpi4, openmpi is compiled with pmix 3, ucx 1.6, slurm support (all bells and whistles) slurm is compiled with mpix3 support using system ucx (1.5), but srun seems to pick up the ucx used for openmpi (from out loaded modules). we run patched slurm with 5920 (and almost 19.05.2, see https://github.com/stdweird/slurm/tree/nvidia). i understand that this is not supported, but any pointers where things go wrong are already welcome. because there is a kernel error, i assume that the ucx version difference is not the culprit (i'll try to use system ucx for both) in attachment the job script (submitted with /usr/bin/sbatch --nodes=2 --gres=gpu:2 -J osu.sh --chdir=/user/gent/400/vsc40002 -e /vscmnt/gent_kyukon_home/_kyukon_home_gent/400/vsc40002/%x.e%A --export=NONE --get-user-env=60L -o /vscmnt/gent_kyukon_home/_kyukon_home_gent/400/vsc40002/%x.o%A), output of the job (stdout was empty, only stderr output), slurm.conf, cgroup.conf, gres.conf, and slurmd -vvvvvvvv output of both nodes (node3300 and node3301).
removing the ucx module and using the system one (the one that slurm was build with) gives same result and error
runngin slurmd with UCX_TLS=tcp,self,sm at least fixes the issue (basically forcing tcp for pmix) i'll now need to check if i can get ib working in the mpi itself somehow (or maybe i misunderstand where pmix and ucx live wt teh application) stijn
Hi stijn, We're looking at the issue and provide a response asap. Sorry for the initial delay.
The timestamp of the kernel log doesn't match with the start of slurmd logs. In slurmstepd logs I cannot see a timestamp. I am assuming that these may be related: slurmstepd: debug: node3301 [1] pmixp_utils.c:388 [_pmix_p2p_send_core] mpi/pmix: nodelist=node3300, address=/var/spool/slurm/slurmd/stepd.slurm.pmix.40000075.0, len=763 slurmstepd: debug: node3300 [0] pmixp_agent.c:129 [_server_conn_read] mpi/pmix: Slurm PROTO: accepted connection: sd=56 slurmstepd: debug: node3300 [0] pmixp_server.c:1376 [pmixp_server_slurm_conn] mpi/pmix: Request from fd = 56 slurmstepd: debug: node3301 [1] pmixp_coll_ring.c:100 [_ring_sent_cb] mpi/pmix: 0x2b4f1c0230f8: called 0 slurmstepd: error: node3300 [0] pmixp_dconn_ucx.c:668 [_ucx_connect] mpi/pmix: ERROR: ucp_ep_create failed: Input/output error slurmstepd: error: node3300 [0] pmixp_dconn.h:243 [pmixp_dconn_connect] mpi/pmix: ERROR: Cannot establish direct connection to node3301 (1) [2019-08-28T16:05:19.263] [40000075.0] debug: node3301 [1] mpi_pmix.c:141 [fini] mpi/pmix: node3301: call fini() [2019-08-28T16:05:19.263] [40000075.0] debug4: eio: handling events for 4 objects [2019-08-28T16:05:19.263] [40000075.0] debug: node3301 [1] pmixp_agent.c:87 [_conn_readable] mpi/pmix: false, shutdown [2019-08-28T16:05:19.263] [40000075.0] debug: node3301 [1] pmixp_agent.c:87 [_conn_readable] mpi/pmix: false, shutdown [2019-08-28T16:05:19.263] [40000075.0] debug: node3301 [1] pmixp_agent.c:255 [_agent_thread] mpi/pmix: agent thread exit [2019-08-28T16:05:19.263] [40000075.0] debug: node3301 [1] pmixp_client_v2.c:62 [_op_callbk] mpi/pmix: op callback is called with status=0 [2019-08-28T16:05:19.270] [40000075.0] error: node3301 [1] pmixp_coll.c:281 [pmixp_coll_log] mpi/pmix: ERROR: Dumping collective state [2019-08-28T16:05:19.270] [40000075.0] error: node3301 [1] pmixp_coll_ring.c:756 [pmixp_coll_ring_log] mpi/pmix: ERROR: 0x2b4f1c023010: COLL_FENCE_RING state seq=0 [2019-08-28T16:05:19.270] [40000075.0] error: node3301 [1] pmixp_coll_ring.c:758 [pmixp_coll_ring_log] mpi/pmix: ERROR: my peerid: 1:node3301 [2019-08-28T16:05:19.270] [40000075.0] error: node3301 [1] pmixp_coll_ring.c:765 [pmixp_coll_ring_log] mpi/pmix: ERROR: neighbor id: next 0:node3300, prev 0:node3300 [2019-08-28T16:05:19.270] [40000075.0] error: node3301 [1] pmixp_coll_ring.c:775 [pmixp_coll_ring_log] mpi/pmix: ERROR: Context ptr=0x2b4f1c023088, #0, in-use=0 [2019-08-28T16:05:19.270] [40000075.0] error: node3301 [1] pmixp_coll_ring.c:775 [pmixp_coll_ring_log] mpi/pmix: ERROR: Context ptr=0x2b4f1c0230c0, #1, in-use=0 [2019-08-28T16:05:19.270] [40000075.0] error: node3301 [1] pmixp_coll_ring.c:775 [pmixp_coll_ring_log] mpi/pmix: ERROR: Context ptr=0x2b4f1c0230f8, #2, in-use=1 [2019-08-28T16:05:19.270] [40000075.0] error: node3301 [1] pmixp_coll_ring.c:786 [pmixp_coll_ring_log] mpi/pmix: ERROR: seq=0 contribs: loc=1/prev=0/fwd=1 [2019-08-28T16:05:19.270] [40000075.0] error: node3301 [1] pmixp_coll_ring.c:788 [pmixp_coll_ring_log] mpi/pmix: ERROR: neighbor contribs [2]: [2019-08-28T16:05:19.270] [40000075.0] error: node3301 [1] pmixp_coll_ring.c:821 [pmixp_coll_ring_log] mpi/pmix: ERROR: done contrib: - [2019-08-28T16:05:19.270] [40000075.0] error: node3301 [1] pmixp_coll_ring.c:823 [pmixp_coll_ring_log] mpi/pmix: ERROR: wait contrib: node3300 [2019-08-28T16:05:19.270] [40000075.0] error: node3301 [1] pmixp_coll_ring.c:825 [pmixp_coll_ring_log] mpi/pmix: ERROR: status=PMIXP_COLL_RING_PROGRESS [2019-08-28T16:05:19.270] [40000075.0] error: node3301 [1] pmixp_coll_ring.c:829 [pmixp_coll_ring_log] mpi/pmix: ERROR: buf (offset/size): 328/984 [2019-08-28T16:05:19.270] [40000075.0] debug: node3301 [1] pmixp_server.c:477 [pmixp_stepd_finalize] mpi/pmix: Remove PMIx plugin usock [2019-08-28T16:05:19.270] [40000075.0] debug: mpi/pmix: setup sockets [2019-08-28T16:05:19.270] [40000075.0] done with job But it would be nice if you can provide more complete logs, since that seems just to be an ending task. Also, can you describe the "failure" and consequences you are seeing with this error? As per [Wed Aug 28 15:59:44 2019] uverbs_write: process 54799 (slurmstepd) changed security contexts after opening file descriptor, this is not allowed. ..the error you see from the kernel comes from the infiniband driver. A function is denying access to the file descriptor of pmix with a -EACCES linux/latest/source/drivers/infiniband/core/uverbs_main.c#L656 if (!ib_safe_file_access(filp)) { pr_err_once("uverbs_write: process %d (%s) changed security contexts after opening file descriptor, this is not allowed.\n", task_tgid_vnr(current), current->comm); return -EACCES; } This function tries to write on the file 'filp', but before it just checks if it is safe to do so by calling ib_safe_file_acces(), which is defined as: static inline bool ib_safe_file_access(struct file *filp) { return filp->f_cred == current_cred() && !uaccess_kernel(); } The first check just looks at the file task subjective credentials and compares with the current thread credential performing the write, which basically seems to protect against concurrent read-copy-update operations. The second check is the one that's probably failing, !uacess_kernel(), which is defined as: #define uaccess_kernel() segment_eq(get_fs(), KERNEL_DS) To summarize, KERNEL_DS is defined as a mm_segment_t with 0x0 as a value, and represents the start of the kernel data-space address. get_fs() is a macro that returns the current thread address limit. When a write function transfers data to/from user space using a user-provided pointer, the system call checks whether the provided buffer is a valid address or not. During normal operation, an address that lies in the user address range (0-3GB for standard kernel configuration) is considered valid, and an address that lies in kernel address space (3GB-4GB) is not. What may be happening here is something related to slurmstepd context change, since it may start as root and then become another user, so there may be a mismatch and the driver will be detecting this. How this is happening is not yet clear to me. There may be other reasons and not directly attachable to Slurm, I've seen others hitting this issue when a thread executed key_add in between open and write operations, so this may be a pmix issue rather a Slurm one. I will continue investigating the error messages.
hi felip, it is entirely posisble that the kernel error and pmix failure do not coincide. the node3301 logs however mention that pmix fails when trying to use ib, so i assume they are/were related. the error we get is that the pmix does not start, and thus the mpi part of the job fails. (forcing UCX_TLS to something non-ib in slurmd unit works, but i can't pass ucx_tls anymore as part of the env to test ucx things (the varaible is picked up by pmix somehow, while the job cannot see the varaible set in slurmd unitfile). without setting ucx_tls, ucx does it's autodetect magic and still can use eg ib; but you loose all control on the user side) stijn
Created attachment 11439 [details] node3301 strace slurmd
hi felip, i uploaded strace -f -s 256 output from slurmd -vvvvv -D on node3301 search for ERROR and you can see ib related failures. 64364 write(1, "[1567457092.808739] [node3301:64343:1] ib_iface.c:577 UCX ERROR iface=0x13a0ba0: failed to create rc QP TX wr:256 sge:3 inl:64 RX wr:0 sge:1 inl\ 64: Permission denied\n", 175) = 175 a few lines before, pid 64364 tried to write to fd11. fd11 (this is probably from ucx) is 64343 open("/dev/infiniband/uverbs1", O_RDWR) = 11 (i think ;) but between the open of fd11 and the error, lots of stuff happens (incl gpu device cgroup stuff; and processes being placed in the cgroup of the job, ie owned by the user). and as you mentioned, there are some setresuid calls too in particular 64364 setresuid(-1, 2540002, -1 <unfinished ...> so fd11 opened by root, is later written to by pid 64364 that has different uid. anyway, i'm no expert ;) thanks for looking into this. stijn
> and as you mentioned, there are some setresuid calls too > in particular > 64364 setresuid(-1, 2540002, -1 <unfinished ...> > > so fd11 opened by root, is later written to by pid 64364 that has different > uid. That makes sense. It also makes sense that switching to tcp it works as we're not using infiniband driver anymore. Just to discard another idea I have, are you using SELinux, apparmor or such? > anyway, i'm no expert ;) Kernel code escapes also my domain, but the path we're looking into seems the good one.
hi felip, selinux is disabled, no other similar protections are in place. wrt the ib vs tcp, it's not that clear to me why it works. either ucx/pmix does nto initialise the socket at the same time (so it is openend by the user) or tcp socket have some better kernel support, and eg are covered by the cgroup changes so that the job user somehow is allowed to write to it (ie in some magical way, the gcroup canges propagte to that already openened socket. it soounds weird, so i guess it's the first: tcp socket is started when pmix starts to communicate, not when pmix starts. tcp does not need to initialise the eg CQ and WQ etc etc). anyway, i am happy with tcp only for pmix+ucx, however i would still like to set the UCX_TLS in the job. when i do that now, this UCX_TLS setting (from eg sron --export) is picked up by pmix, and it fails again. it would be sufficient for me at least if we can configure the UCX environment used by pmix somehow with slurm (eg in srun or some slurm.conf since it is slum stepd that starts the pmix) is it required somehow that pmix starts in the job environment? is there a way to pass variables to the pmix enviroment, eg PMIX_UCX_TLS=..., and the PMIX_ is stripped off (yes, i know, more magic ;) stijn
> wrt the ib vs tcp, it's not that clear to me why it works. slurmstepd.c:job_manager() is the function which runs as root and performs shared memory and performs interconnect initialization. At the start of the function all the plugins (gather, switch, proctrack, task, etc.) are initialized. The job container and cgroups are created. mpi_hook_slurmstepd_init() is called too. This mpi_hook_slurmstepd_init() reads/sets the SLURM_MPI_TYPE env variable and registers the mpi plugin. Then, when all this and other initialization stuff succeds, is time to fork necessary threads for MPI and mpi_hook_slurmstepd_prefork is called. Since the plugin initialized in your case is pmix, mpi_hook_slurmstepd_prefork calls p_mpi_hook_slurmstepd_prefork defined in src/plugins/mpi/pmix/mpi_pmix.c. This function in turn calls pmixp_stepd_init() which initializes the UNIX socket for slurmd communication and then calls pmixp_dconn.c:pmixp_dconn_init() which seems to prepare the direct connection between slurmd and stepd (I still have no deep knowledge of this last part). In pmixp_dconn_init() from pmixp_dconn.c you will see the: #ifdef HAVE_UCX section. If you compiled with ucx support pmixp_dconn_ucx_prepare function will be called, otherwise pmixp_dconn_tcp_prepare will be the one called. In the UCX case I can see calls to the UCX library, like ucp_worker_create, ucp_worker_get_address, ucp_init. At this point, I guess that UCX library is (as you suspected) using the infiniband devices and some environment variables to setup the connection. Ok, at this point remember we're still root in slurmstepd, and we're setting plugins and specifically mpi connection, servers and sockets. UCX seems to have initialized the infiniband connection, as root. Returning back to job_manager, when mpi_hook_slurmstepd_prefork is done, we just continue and do fork all tasks, which indeed incurs into a setresuid call and to fork system call. The inifiniband driver has the particularity to detect (see comment 5) the address space of the process which initialized the socket. Now that we changed the uid with setuid or we forked, when the new process tries to use the previously created socket, the driver detects that the buffer address passed when writing or reading to the socket is out of the scope of the one that created the socket initially, so it just returns an access denied. I can see in the infiniband kernel code that this is affected by a fork or by a setuid. The TCP driver doesn't seem to have such control over their sockets, or must do in some other way which does not get affected by this two situations. There may be one way to fix this, and is to do the first transport connection after setuid. The workaround seems to be settng UCX_TLS=tcp,self,sm for slurmd to not use Infiniband driver. Honestly I still doesn't understand the deep details of UCX and why this is happening in this situation and not others, but I am studying it. > > anyway, i am happy with tcp only for pmix+ucx, however i would still like to > set the UCX_TLS in the job. I am investigating this option too. > when i do that now, this UCX_TLS setting (from eg sron --export) is picked > up by pmix, and it fails again. it would be sufficient for me at least if we > can configure the UCX environment used by pmix somehow with slurm (eg in > srun or some slurm.conf since it is slum stepd that starts the pmix) > is it required somehow that pmix starts in the job environment? is there a > way to pass variables to the pmix enviroment, eg PMIX_UCX_TLS=..., and the > PMIX_ is stripped off (yes, i know, more magic ;) > > stijn Looking into that. I guess setting the variable in the user's environment doesn't work, right? export UCX_TLS=tcp,self,sm srun ...
To be more specific, it is not infiniband device what matters, but TCP vs RDMA. RDMA is what is not working when a connection is issued before a setuid or a fork.
hi felip, thanks for investigating. below a comment from ceph code that has similar remark i guess this would mean you can only do the pmix in slurm after the setuid, or for tcp only for pmix? stijn // // Utterly important to run first network connection after setuid(). // In case of rdma transport uverbs kernel module starts returning // -EACCESS on each operation if credentials has been changed, see // callers of ib_safe_file_access() for details. // // fork() syscall also matters, so daemonization won't work in case // of rdma. //
(In reply to Stijn De Weirdt from comment #14) > hi felip, > > thanks for investigating. below a comment from ceph code that has similar > remark > I have seen this, and it's a similar issue. In our case I think we are hitting both issues the fork and the setuid. We are setting up the socket before setuid and before forking, then the forked processes are trying to access the ib socket as a different user and being a different process. So this check done in ib driver: return filp->f_cred == current_cred() && !uaccess_kernel(); may be failing both for current_cred and uaccess_kernel. > i guess this would mean you can only do the pmix in slurm after the setuid, > or for tcp only for pmix? > Correct. To be precise PMIx or UCX are not the problem but rdma socket setup vs tcp socket is, and UCX seems to detect rdma and using it. I am adding Artem here just for him to be aware of the issue. --- https://www.spinics.net/lists/ceph-devel/msg45105.html https://github.com/ceph/ceph/pull/28012
thanks for the detailed explanation. at what point in the whole stepd process of forking and setuid'ing are the environment variables set (the nes passed by eg srun --export=ALL) from the behaviour i see when passing UCX_TLS, i'd guess before UCX initiliased the rdma, so that is quite early. is that correct? but why are the exported variables needed so early? is this so you can pass stuff to the plugins? anyway, many thanks. i hope you can provide a way to control the UCX_TLS (or any other UCX variable) that IS picked up by pmix and the ONE needed/wanted for the job. stijn
> from the behaviour i see when passing UCX_TLS, i'd guess before UCX initiliased the rdma, so that is quite early. is that correct? but why are the exported variables needed so early? is this so you can pass stuff to the plugins? That's correct. Some variables (UCX_TLS included) are read in src/plugins/mpi/pmix/pmixp_info.c:_env_set() , called from pmixp_stepd_init()->pmixp_info_set(), before UCX initialises rdma and creates the socket. In fact is the first action done in pmixp_stepd_init. As you guess this is needed to pass stuff to plugins. Having said that, the environment read by _env_set is the one passed from job->env in slurmstepd: mpi_hook_slurmstepd_prefork(job, &job->env) != SLURM_SUCCESS) The job->env is captured from the user environment in slurmd and passed to slurmstepd job lanch request, so I think UCX_TLS should be set. I have to dig more into that. You said that setting UCX_TLS in slurmd systemd init file makes it work, right? > anyway, many thanks. i hope you can provide a way to control the UCX_TLS (or > any other UCX variable) that IS picked up by pmix and the ONE needed/wanted > for the job. I'll look into that. I am investigating another solution too, which is to transfer permissions of the socket to the new user after initializing it. Will come as soon as I've something.
hi felip, setting UCX_TLS in the unit file works in the sense that it is picked up by pmix+ucx, and not by the job (eg openpi+ucx) (ucx falls back to default autodect which seems to use ib and soem other things). however, when i set ucx_tls in the environment (and pass it with --export ALL), it is picked up by pmix+ucx (overwriting teh value from the unit file). so it works, but not for all cases. i guess for now it is ok, but i wanted to try some ucx tls options, to compare with autodetecting. stijn
Hello, Stijn I'm Artem from Mellanox and I'm working on PMIx plugin development. Unfortunately, I didn't notice this bug before. We have discovered and analyzed the problem independently. The reason for the issue you observe is that Linux kernel driver disables the control path if privileges are dropped (effectively setegid() is called). And slurmd is doing that prior to launching application processes. My fault that it slipped in - as a developer I'm doing everything as my user and using our Slurm-under-Slurm deployment system (https://github.com/artpol84/jobstart) that allows to use official Slurm installation to deploy my own Slurm and to test "almost" everything, except to these things that come to play when you install Slurm as root. We are currently working to solve this issue by extracting the PMIx portion from the slurmstepd into the slurmpmixd process that won't call the functions that hurt IB stuff. We expect to have it finished in Feb. Will you be interested to test? Of course when you set UCX_TLS to tcp this problem goes away.
hi artem, thanks for looking into this. if you have something i can test, i'll see if i can make some time and some cluster available (but no garantees) stijn
Also, regarding setting env variables for the jobstart. Currently, you have to set them when you launch the job. But I think that this is more of an admin-side decision so we are looking into adding a way to configure through Slurm config files. But there is a hint you can consider. The plugin registers a "SLURM" prefix with UCX, so if you want only the plugin to pickup the TLS, but not MPI you can set UCX_SLURM_TLS.
Felip, I want to apologize for ignoring the issue earlier, it felt off my radar as I was traveling/on vacation in September. I think you did a really good job investigating and I wish I knew this earlier. When we were debugging at the customer site, no kernel message was observed in dmesg and it was way much harder to dig to diagnose the issue.
(In reply to Artem Polyakov from comment #24) > Felip, > > I want to apologize for ignoring the issue earlier, it felt off my radar as > I was traveling/on vacation in September. > I think you did a really good job investigating and I wish I knew this > earlier. > > When we were debugging at the customer site, no kernel message was observed > in dmesg and it was way much harder to dig to diagnose the issue. Artem, it has been my fault to not ping you before. I just added you in CC but never reached directly. I was indeed suspecting this should need to be managed from PMIx side but before bothering you I wanted to give it another try. We (SchedMD) really appreciate your effort and willingness to help. I'll stay tuned and wait for news in the extraction of this code to slurmpmixd. Thank you very much.
I've just got an interesting update from my colleagues working with Linux kernel. Basically, this issue is resolved when using 'rdma-core". By default in MOFED up to 5.0 "legacy verbs" are used. In order to enable "rdma-core" you can use MOFED 5.0 with "--upstream-libs" option. Starting from MOFED 5.1 (Release is this June) "rdma-core" will be the default. I'm going to verify internally that Slurm/UCX works in this configuration. Given that Slurm/UCX is a relatively new feature, relying on the new software is reasonable. Will require to be documented in the plugin description though. I'll update on the results here. Might put the slurmpmixd extraction to the backburner if that works.
ah, that is good news. using rdma-core is already avail in 4.7 (and i think even earlier versions). i can have a look at it
Correct, it should be available in 4.7. the reason I said 5.0 is because UCX may not support it as part of MOFED 4.7, I haven't got a definite answer to this question. But if you have a way to experiment it would be cool
(In reply to Artem Polyakov from comment #28) > Correct, it should be available in 4.7. > the reason I said 5.0 is because UCX may not support it as part of MOFED > 4.7, I haven't got a definite answer to this question. > But if you have a way to experiment it would be cool Hi Artem, Stijn, Idk if you are still working on this issue or gave it as resolved, I'd like to get some feedback from you in order to close or keep this bug open. Can you update me? Thanks
Felip, We verified, that rdma-core solves the issue. So we have postponed an effort to extract the pmix portion into a separate process for now. Though, Boris implemented a POC fairly quickly so we can revive this at any point. Here is the link for the reference that gives some feeling on how much effort it takes (not bad at all IMO): https://github.com/karasevb/slurm/commits/pmixpd_intro Now I think what we should do to move forward is to document that UCX support is only available with rdma-core: * starting from MOFED 4.7 if rdma-core is explicitly enabled (need to double-check the exact instructions on how to enable) * starting from v5.0 - by default.
(In reply to Artem Polyakov from comment #30) > Felip, > > We verified, that rdma-core solves the issue. So we have postponed an effort > to extract the pmix portion into a separate process for now. > Though, Boris implemented a POC fairly quickly so we can revive this at any > point. Here is the link for the reference that gives some feeling on how > much effort it takes (not bad at all IMO): > https://github.com/karasevb/slurm/commits/pmixpd_intro > > Now I think what we should do to move forward is to document that UCX > support is only available with rdma-core: > * starting from MOFED 4.7 if rdma-core is explicitly enabled (need to > double-check the exact instructions on how to enable) > * starting from v5.0 - by default. Great news. In what regards to the documentation Slurm doesn't have any for UCX yet, so I guess you are referring to external docs to slurm. In a future it would be great to have some lines in our doc though but this is another matter. I'm available if you need some help on our side. Also if you think we can close this issue since there's a workaround just tell me. Thanks again for your valuable and appreciated collaboration.
I am closing this issue since most of the bug has been already covered. Please, mark it as open again if you need anything from us. Thanks!