| Summary: | Prevent multiple PMI2_Init calls from same rank from hanging slurmstepd | ||
|---|---|---|---|
| Product: | Slurm | Reporter: | Aaron Knister <aaron.s.knister> |
| Component: | slurmstepd | Assignee: | Tim Wickberg <tim> |
| Status: | RESOLVED FIXED | QA Contact: | |
| Severity: | 4 - Minor Issue | ||
| Priority: | --- | CC: | alex, mcoyne, mej, rhc, sts |
| Version: | 17.11.x | ||
| Hardware: | Linux | ||
| OS: | Linux | ||
| See Also: | https://bugs.schedmd.com/show_bug.cgi?id=13622 | ||
| Site: | NASA - NCCS | Alineos Sites: | --- |
| Atos/Eviden Sites: | --- | Confidential Site: | --- |
| Coreweave sites: | --- | Cray Sites: | --- |
| DS9 clusters: | --- | HPCnow Sites: | --- |
| HPE Sites: | --- | IBM Sites: | --- |
| NOAA SIte: | --- | OCF Sites: | --- |
| Recursion Pharma Sites: | --- | SFW Sites: | --- |
| SNIC sites: | --- | Linux Distro: | --- |
| Machine Name: | CLE Version: | ||
| Version Fixed: | 17.02.8 17.11.0-pre3 | Target Release: | --- |
| DevPrio: | --- | Emory-Cloud Sites: | --- |
| Attachments: |
pmi2_init.c
prevent multiple pmi2_init calls from leaving libpmi2 hung |
||
|
Description
Aaron Knister
2017-03-01 13:10:35 MST
I'm playing catch up here with the PMI2 API design, so please bear with me. Looping indefinitely on an invalid command / command we don't expect to receive at that point is definitely a bug, and something we'd certainly want to fix ASAP. If you have a patch that handles this part please let me know. Looking at our current PMI implementation, it doesn't appear that re-initializing is safe due to the number of static variables in use within the plugin - at the very least I'd want to audit it extensively before exploring that as an option. I'm thinking the safe approach at the moment would be to reject any successive "cmd=init" messages (and document that PMI2_Init() is only permitted once per step), rather than loop infinitely until the step times out. I'm adding Ralph Castain as a CC here in the hopes he may shed some light as well, he's our usual contact with the OpenMPI folks and knows a lot more than I do about the inner workings, and may be able to weigh in on the expected behavior of successive PMI2_Init() calls. You actually have a bigger problem than you realize. The PMI2 server in SLURM assigns a port to the initial client and requires that only that client callback on it. In this case, you are using mpirun to launch your script, and mpirun launches its own daemons. Thus, so far as SLURM is concerned, the mpirun daemon is the PMI2 client. This highlights something often overlooked when launching using mpirun. SLURM itself has no idea that there are application processes running - it only sees the daemons started by mpirun. Adding to the problem here is that mpirun doesn't know anything about your worker ranks - it only sees the script. So when your worker ranks start calling PMI2_Init, the PMI2 server is caught by surprise as the calls are coming from "non-client" processes, and more than one of them is claiming to be the same "rank" (since the rank is associated with the assigned port). Bottom line: your use-case violates the design of the PMI2 implementation as it has multiple separations from the PMI2 server and multiple procs all claiming to be the same "rank" so far as SLURM is concerned. Your executables aren't going to work correctly if they call PMI2_Init unless SLURM makes major architectural changes to the PMI2 implementation. Ralph - Thank you for the clarification, that's what I was hoping you may be able to answer. Aaron - the "hang" you'd identified - is this from the Slurm PMI interface not knowing how to handle the (unexpected) message format, and not returning an appropriate error? Or is it in the application itself not known what to do if PMI2_Init fails? I should have added a few alternative suggestions, if you will forgive the shameless plug for a project I lead. If you have access to a version 2.0 or above of OpenMPI, then it includes a copy of the PMIx code. PMIx allows this use-case where a child process fork/exec's additional children of its own that want to also call PMI_Init, and PMIx has backward compatibility for PMI-2. So you could use that version of OMPI and run your job with that mpirun. Or you can install the PMIx reference server (https://github.com/pmix/pmix-reference-server) and run it in you job allocation, and then execute your job against that environment using "prun". Also, if you prefer to direct-launch with srun, then PMIx support is included in v16.05 and above, so you could use -mpi=pmix to support this use-case that way. HTH Ralph Thanks Tim, Ralph. Sorry for the confusion, the MPI implementation I'm having this issue with is SGI MPT rather than OpenMPI. I believe the issue is SLURM not responding to the init request from the application and the application sits indefinitely waiting for a reply. I think we've found a workable solution in the mean time (un-setting the various PMI_* environment variables) in the perl worker processes that launch the MPI-compiled (but don't actually use MPI) binaries. -Aaron Aaron, I have a couple of questions for you. - Can you provide us with a simple reproducer and the compile line for your binary so we can try to reproduce it here. That would help us get a better idea of how you're doing it and how the hanging is happening. - Also, you mentioned using mpirun but have you tried just using srun to submit the job. For example: srun --mpi=pmi2 -N2 -n2 hello_mpi Lastly, the version on this bug is set to 17.11.x. Is that the version you're really using? If not, please update it so I make sure I'm using the same version you are. Many thanks Tim Hi Tim, The application in this case is something known as "prund" that's run as part of a job workflow. It's an in-house tool developed to run through a work queue and launch commands. I forget the specifics of how its run but the idea is that you have a "master" process with a list of commands to run, i.e.: cat work_to_do.txt | prund --master where work_to_do.txt could have a list of shell commands to run that each would say generate a plot. Then you launcher the workers via mpirun: mpirun prund --slave --master-host=$HOSTNAME the idea is that the slaves connect back to the master and run each command, one per MPI rank until all commands listed in the file have been run. It's launched using the mpirun launcher to make the workflow portable between various MPI and scheduler implementations. The trouble we run into is that although the applications run through prund are generally inherently serial and don't explicitly make any mpi calls, they're built with mpif90 and linked with various MPI libraries. On application launch, even though they're not actually mpi apps, because they were linked with MPI libraries the MPI libraries attempt to perform various initialization routines such as calling PMI2_init. The MPI implementation having this issue, specifically, is SGI MPT. The end result is that from the perspective of slurmstepd a single "mpi rank" (prund worker) calls PMI2_init multiple times, once for each fork() and exec() of the non-mpi app run through prund. The first task spawned by each prund worker would run just fine but then subsequent tasks would hang indefinitely with some errors about invalid client request. Our workaround was in prund to simply unset the various PMI2 environment variables, preventing them from being passed through to the tasks they fork(), which seems to stop the MPI libraries from doing any PMI2 initialization. We are still on SLURM 14.03 but I have reproduced this on 17.11 as of opening the ticket. The PMI2 double-init issue was still there, and I do have a reproducer. Here's how I simulated the behavior of MPT and reproduced it: gcc pmi2_init.c -o pmi2_init -lpmi2 srun -N1 -n1 --mpi=pmi2 ./run_pmi2.sh I'll attach the c code and shell script here in just a second. -Aaron Created attachment 4338 [details]
pmi2_init.c
run_pmi2.sh is just a shell script that runs ./pmi2_init twice. -Aaron Thanks Aaron. I'll give that a try and get back to you. Tim Aaron,
I've spent most of the day playing with this and I haven't been able to get the hang with the script the way it is currently. Here's my job submission:
$ srun -N8 -n8 --mpi=pmi2 ./run_pmi2.sh
over and out
over and out
over and out
over and out
over and out
over and out
over and out
over and out
The job finishes. However, I do see a hang if I try to add another PMI2_Init() to the pmi2_init.c file. Like this:
int main() {
int rc, spawned, size, rank, appnum;
rc = PMI2_Init(&spawned, &size, &rank, &appnum);
assert(rc == 0);
rc = PMI2_Init(&spawned, &size, &rank, &appnum);
assert(rc == 0);
printf("over and out\n");
return rc;
}
Now when I submit the same job I get:
$ srun -N8 -n8 --mpi=pmi2 ./run_pmi2.sh
slurmstepd-chiron0: error: mpi/pmi2: request not begin with 'cmd='
slurmstepd-chiron0: error: mpi/pmi2: full request is:
slurmstepd-chiron0: error: mpi/pmi2: invalid client request
slurmstepd-chiron0: error: mpi/pmi2: request not begin with 'cmd='
slurmstepd-chiron0: error: mpi/pmi2: full request is:
slurmstepd-chiron0: error: mpi/pmi2: invalid client request
slurmstepd-chiron0: error: mpi/pmi2: request not begin with 'cmd='
...
...
...
Despite the fact that I'm getting an error, I'm not sure if this is the correct way to reproduce the same hang you're getting. Before I started debugging code, I just wanted to run this by you to see if this is a valid test case. Just let me know if I've successfully reproduced the error.
Thanks
Tim
I confess to still being puzzled by something Aaron said, so perhaps he can clarify for me. The implication from your comments is that your processes are not calling MPI_Init - is that true? Are you then saying that SGI's MPT somehow is calling MPI_Init for you, even though you are not invoking MPI functions? I ask because that would be really weird and counter to expected MPI behavior. Ralph Tim,
The run_pmi2.sh script looked like this for me:
#!/bin/bash
./pmi2_init
./pmi2_init
Which I think achieves the same result as calling PMI2_init twice from
pmi2_init.c.
>
> Despite the fact that I'm getting an error, I'm not sure if this is the correct
> way to reproduce the same hang you're getting. Before I started debugging
> code, I just wanted to run this by you to see if this is a valid test case.
> Just let me know if I've successfully reproduced the error.
I think this is as accurate a way as I can come up with to reproduce the
hang. Those errors look exactly like those I was seeing.
-Aaron
Ralph, That's correct, the processes are not calling MPI_Init and I don't think that SGI's MPT is calling MPI_Init for me but what it absolutely appears to be doing is calling PMI2_Init for me. -Aaron It's a tough one to resolve - very unfortunate that SGI did that to people. I assume you must be calling at least some function in that library or else it would be difficult to understand how it just calls PMI2_Init for you out of the blue! The issue is that SLURM uses that unique process-to-socket mapping as part of its security policy. The idea is that SLURM creates the socket and then passes it to the process at launch, thus ensuring that process is the only one who knows about it. Admittedly, someone can port scan to find it, and so there are other mechanisms in place - but the idea that only one process should be calling back on that socket, and any other attempts represent a "bad actor", is an important element of the overall strategy. In fact, some resource managers go so far as to actually hit your process with a SIGKILL if it violates that policy. SLURM is one of the more friendly ones in that it only generates an error. PMIx can support multiple connections because we have a different security policy that involves the exchange of multiple verifying pieces of information. It isn't any stronger than the one used here, but it was designed around the idea that a process might fork/exec a child that also needed to access PMIx. PMI (both 1 and 2) was not designed to support that mode. I'd really recommend you go back to SGI and request that they fix this problem. It would be a much cleaner solution, though I'm sure the folks here could "hack" some way to let you run - provided they are willing to "stretch" their security. Aaron, I spent the better part of a day digging through the pmi2 library code and the slurmstepd code and it's clear that PMI2_Init() is not meant to be called more than once per job step (slurmstepd process). I believe Ralph is right in that this is an SGI bug and they need to fix it. Regards Tim Thanks, Tim, Ralph. I don't necessarily consider it a bug that one can't call PMI2_Init() multiple times per-rank (although the PMI2 spec doesn't explicitly prohibit it) but I do consider it a bug that SLURM hangs indefinitely. From when I looked at the code I recall thinking that behavior would be fairly easy to fix. I'm quite willing to put together a patch-- is that something you'd be interested in? -Aaron Aaron, Feel free to put together a patch that addresses this. You'll probable what to center your efforts around _handle_task_request in pmi2/agent.c. If your patch can make it so it simply returns on every subsequent PMI2_Init() call, that would be ideal. Regards. Tim Aaron, Because this functionality needs to be added, I'm switching this ticket to a sev 5 (enhancement). Thanks Tim *** Ticket 4024 has been marked as a duplicate of this ticket. *** Created attachment 5219 [details] prevent multiple pmi2_init calls from leaving libpmi2 hung The attached patch prevents a second PMI2_Init() call from leaving our libpmi2 code stuck, and forces the socket closed for good measure, and to ensure any and all successive PMI calls will fail. There's a confounding issue in our pmi2 code that will need to be addressed, and lead to this patch acting in a slightly-out-of-spec manner, and I suspect may affect other implementations as well. Unfortunately, it is my understanding that a number of applications may have statically linked in their copy of libpmi2, and thus even if we ship a fixed version I know that won't solve the problem for a lot of sites. There is also a tangentially related fix on master you may want to apply alongside this: https://github.com/SchedMD/slurm/commit/b018470f0d5e.patch Thanks, Tim. That actually fixes the immediate issue for us, I believe. The MPI implementation that we're seeing this with (MPT) dlopen()'s the libpmi2 so it will pick up SLURM's PMI2 lib. I'd started looking at a more generic fix that would trigger stepd to close the socket but that approach left me with more questions than answers. Originally I was hoping we could actually support a second PMI2_Init so that the applications could continue to function but having them terminate with an error is an order of magnitude better than an indefinite hang, so I'll call this a win. I should also ask SGI/HPE why they're calling PMI2_Init before MPI_Init has been called. As far as I'm concerned I think we can close this ticket. -Aaron As you mentioned, though, this doesn't help an app speaking the PMI2 wire protocol that hits this rather than opening libpmi2. We haven't encountered that case, yet, though that I can recall. Final version of the patch is in commit b2aa25d50dca17, and will be in 17.02.8 when released. |