Ticket 3520

Summary:	Prevent multiple PMI2_Init calls from same rank from hanging slurmstepd
Product:	Slurm	Reporter:	Aaron Knister <aaron.s.knister>
Component:	slurmstepd	Assignee:	Tim Wickberg <tim>
Status:	RESOLVED FIXED	QA Contact:
Severity:	4 - Minor Issue
Priority:	---	CC:	alex, mcoyne, mej, rhc, sts
Version:	17.11.x
Hardware:	Linux
OS:	Linux
See Also:	https://bugs.schedmd.com/show_bug.cgi?id=13622
Site:	NASA - NCCS	Slinky Site:	---
Alineos Sites:	---	Atos/Eviden Sites:	---
Confidential Site:	---	Coreweave sites:	---
Cray Sites:	---	DS9 clusters:	---
Google sites:	---	HPCnow Sites:	---
HPE Sites:	---	IBM Sites:	---
NOAA SIte:	---	NoveTech Sites:	---
Nvidia HWinf-CS Sites:	---	OCF Sites:	---
Recursion Pharma Sites:	---	SFW Sites:	---
SNIC sites:	---	Tzag Elita Sites:	---
Linux Distro:	---	Machine Name:
CLE Version:		Version Fixed:	17.02.8 17.11.0-pre3
Target Release:	---	DevPrio:	---
Emory-Cloud Sites:	---
Attachments:	pmi2_init.c prevent multiple pmi2_init calls from leaving libpmi2 hung

Description Aaron Knister 2017-03-01 13:10:35 MST

We have a code that uses mpirun to launch a perl script that fires off worker ranks that individually run many processes serially. The problem is some of the executables run were compiled with a particular MPI library that uses the SLURM PMI2 interface. If the pmi2 interface is not there the executables don't run. The problem we're running into is that PMI2_Init() gets called many times and after called a second time from a single rank the task just hangs.

I believe multiple PMI2_Init() calls from a single rank doesn't seem to jive with the current SLURM code because of 2 reasons:

- The initial init command is sent from the PMI2 client to slurmstepd and starts with "cmd=init". Subsequent commands are sent in the form "<6 char buflen><cmd=foo foo_cmd_args>". When a second init it sent it doesn't match the expected format and the job hangs until wallclock runs out or its killed.
- Even if a second init command on the wire could be parsed properly there's no code path in the pmi command processing routines to call _handle_pmi1_init a second time.

The PMI2 spec doesn't seem to prohibit PMI2_Init from being called more than once so I think there's grounds here to fix this in SLURM.

The best solution I've come up with to deal with this that doesn't result in unnecessary cpu cycles being burned for folks *not* doing a double init (which is most I do believe) is to modify the logic of handle_pmi2_cmd such that if client_req_init fails, do an strcmp to see if the command sent was "cmd=init" and if so call _handle_pmi1_init.

Does that sound reasonable? If so I'll work to code up a fix.

-Aaron

Comment 1 Tim Wickberg 2017-03-07 18:07:53 MST

I'm playing catch up here with the PMI2 API design, so please bear with me.

Looping indefinitely on an invalid command / command we don't expect to receive at that point is definitely a bug, and something we'd certainly want to fix ASAP. If you have a patch that handles this part please let me know.

Looking at our current PMI implementation, it doesn't appear that re-initializing is safe due to the number of static variables in use within the plugin - at the very least I'd want to audit it extensively before exploring that as an option.

I'm thinking the safe approach at the moment would be to reject any successive "cmd=init" messages (and document that PMI2_Init() is only permitted once per step), rather than loop infinitely until the step times out.

I'm adding Ralph Castain as a CC here in the hopes he may shed some light as well, he's our usual contact with the OpenMPI folks and knows a lot more than I do about the inner workings, and may be able to weigh in on the expected behavior of successive PMI2_Init() calls.

Comment 2 Ralph Castain 2017-03-08 10:17:04 MST

You actually have a bigger problem than you realize. The PMI2 server in SLURM assigns a port to the initial client and requires that only that client callback on it. In this case, you are using mpirun to launch your script, and mpirun launches its own daemons. Thus, so far as SLURM is concerned, the mpirun daemon is the PMI2 client.

This highlights something often overlooked when launching using mpirun. SLURM itself has no idea that there are application processes running - it only sees the daemons started by mpirun. Adding to the problem here is that mpirun doesn't know anything about your worker ranks - it only sees the script.

So when your worker ranks start calling PMI2_Init, the PMI2 server is caught by surprise as the calls are coming from "non-client" processes, and more than one of them is claiming to be the same "rank" (since the rank is associated with the assigned port).

Bottom line: your use-case violates the design of the PMI2 implementation as it has multiple separations from the PMI2 server and multiple procs all claiming to be the same "rank" so far as SLURM is concerned. Your executables aren't going to work correctly if they call PMI2_Init unless SLURM makes major architectural changes to the PMI2 implementation.

Comment 3 Tim Wickberg 2017-03-08 10:54:05 MST

Ralph - Thank you for the clarification, that's what I was hoping you may be able to answer.

Aaron - the "hang" you'd identified - is this from the Slurm PMI interface not knowing how to handle the (unexpected) message format, and not returning an appropriate error? Or is it in the application itself not known what to do if PMI2_Init fails?

Comment 4 Ralph Castain 2017-03-09 14:33:08 MST

I should have added a few alternative suggestions, if you will forgive the shameless plug for a project I lead. If you have access to a version 2.0 or above of OpenMPI, then it includes a copy of the PMIx code. PMIx allows this use-case where a child process fork/exec's additional children of its own that want to also call PMI_Init, and PMIx has backward compatibility for PMI-2.

So you could use that version of OMPI and run your job with that mpirun. Or you can install the PMIx reference server (https://github.com/pmix/pmix-reference-server) and run it in you job allocation, and then execute your job against that environment using "prun".

Also, if you prefer to direct-launch with srun, then PMIx support is included in v16.05 and above, so you could use -mpi=pmix to support this use-case that way.

HTH
Ralph

Comment 5 Aaron Knister 2017-03-13 18:12:09 MDT

Thanks Tim, Ralph.

Sorry for the confusion, the MPI implementation I'm having this issue with is SGI MPT rather than OpenMPI.

I believe the issue is SLURM not responding to the init request from the application and the application sits indefinitely waiting for a reply.

I think we've found a workable solution in the mean time (un-setting the various PMI_* environment variables) in the perl worker processes that launch the MPI-compiled (but don't actually use MPI) binaries.

-Aaron

Comment 11 Tim Shaw 2017-04-11 13:55:09 MDT

Aaron,

I have a couple of questions for you.

- Can you provide us with a simple reproducer and the compile line for your binary so we can try to reproduce it here.  That would help us get a better idea of how you're doing it and how the hanging is happening.

- Also, you mentioned using mpirun but have you tried just using srun to submit the job.  For example:

srun --mpi=pmi2 -N2 -n2 hello_mpi

Lastly, the version on this bug is set to 17.11.x.  Is that the version you're really using?  If not, please update it so I make sure I'm using the same version you are.

Many thanks

Tim

Comment 12 Aaron Knister 2017-04-11 20:14:17 MDT

Hi Tim,

The application in this case is something known as "prund" that's run as part of a job workflow. It's an in-house tool developed to run through a work queue and launch commands. I forget the specifics of how its run but the idea is that you have a "master" process with a list of commands to run, i.e.:

cat work_to_do.txt | prund --master

where work_to_do.txt could have a list of shell commands to run that each would say generate a plot. Then you launcher the workers via mpirun:

mpirun prund --slave --master-host=$HOSTNAME

the idea is that the slaves connect back to the master and run each command, one per MPI rank until all commands listed in the file have been run. It's launched using the mpirun launcher to make the workflow portable between various MPI and scheduler implementations.

The trouble we run into is that although the applications run through prund are generally inherently serial and don't explicitly make any mpi calls, they're built with mpif90 and linked with various MPI libraries. On application launch, even though they're not actually mpi apps, because they were linked with MPI libraries the MPI libraries attempt to perform various initialization routines such as calling PMI2_init. The MPI implementation having this issue, specifically, is SGI MPT.

The end result is that from the perspective of slurmstepd a single "mpi rank" (prund worker) calls PMI2_init multiple times, once for each fork() and exec() of the non-mpi app run through prund. The first task spawned by each prund worker would run just fine but then subsequent tasks would hang indefinitely with some errors about invalid client request.

Our workaround was in prund to simply unset the various PMI2 environment variables, preventing them from being passed through to the tasks they fork(), which seems to stop the MPI libraries from doing any PMI2 initialization.

We are still on SLURM 14.03 but I have reproduced this on 17.11 as of opening the ticket.

The PMI2 double-init issue was still there, and I do have a reproducer. Here's how I simulated the behavior of MPT and reproduced it:

gcc pmi2_init.c -o pmi2_init -lpmi2
srun -N1 -n1 --mpi=pmi2 ./run_pmi2.sh

I'll attach the c code and shell script here in just a second.

-Aaron

Comment 13 Aaron Knister 2017-04-11 20:14:54 MDT

Created attachment 4338 [details]
pmi2_init.c

Comment 14 Aaron Knister 2017-04-11 20:16:09 MDT

run_pmi2.sh is just a shell script that runs ./pmi2_init twice.

-Aaron

Comment 15 Tim Shaw 2017-04-12 09:00:08 MDT

Thanks Aaron.  I'll give that a try and get back to you.

Tim

Comment 16 Tim Shaw 2017-04-12 15:12:02 MDT

Aaron,

I've spent most of the day playing with this and I haven't been able to get the hang with the script the way it is currently.  Here's my job submission:

$ srun -N8 -n8 --mpi=pmi2 ./run_pmi2.sh
over and out
over and out
over and out
over and out
over and out
over and out
over and out
over and out

The job finishes.  However, I do see a hang if I try to add another PMI2_Init() to the pmi2_init.c file.  Like this:

int main() {

        int rc, spawned, size, rank, appnum;

        rc = PMI2_Init(&spawned, &size, &rank, &appnum);

        assert(rc == 0);

        rc = PMI2_Init(&spawned, &size, &rank, &appnum);

        assert(rc == 0);

        printf("over and out\n");

        return rc;
}

Now when I submit the same job I get:

$ srun -N8 -n8 --mpi=pmi2 ./run_pmi2.sh
slurmstepd-chiron0: error: mpi/pmi2: request not begin with 'cmd='
slurmstepd-chiron0: error: mpi/pmi2: full request is: 
slurmstepd-chiron0: error: mpi/pmi2: invalid client request
slurmstepd-chiron0: error: mpi/pmi2: request not begin with 'cmd='
slurmstepd-chiron0: error: mpi/pmi2: full request is: 
slurmstepd-chiron0: error: mpi/pmi2: invalid client request
slurmstepd-chiron0: error: mpi/pmi2: request not begin with 'cmd='
...
...
...

Despite the fact that I'm getting an error, I'm not sure if this is the correct way to reproduce the same hang you're getting.  Before I started debugging code, I just wanted to run this by you to see if this is a valid test case.  Just let me know if I've successfully reproduced the error.

Thanks

Tim

Comment 17 Ralph Castain 2017-04-12 16:20:27 MDT

I confess to still being puzzled by something Aaron said, so perhaps he can clarify for me. The implication from your comments is that your processes are not calling MPI_Init - is that true? Are you then saying that SGI's MPT somehow is calling MPI_Init for you, even though you are not invoking MPI functions?

I ask because that would be really weird and counter to expected MPI behavior.
Ralph

Comment 18 Aaron Knister 2017-04-12 16:36:21 MDT

Tim,

The run_pmi2.sh script looked like this for me:

#!/bin/bash
./pmi2_init
./pmi2_init

Which I think achieves the same result as calling PMI2_init twice from 
pmi2_init.c.

>
> Despite the fact that I'm getting an error, I'm not sure if this is the correct
> way to reproduce the same hang you're getting.  Before I started debugging
> code, I just wanted to run this by you to see if this is a valid test case.
> Just let me know if I've successfully reproduced the error.

I think this is as accurate a way as I can come up with to reproduce the 
hang. Those errors look exactly like those I was seeing.

-Aaron

Comment 19 Aaron Knister 2017-04-12 16:37:36 MDT

Ralph,

That's correct, the processes are not calling MPI_Init and I don't think 
that SGI's MPT is calling MPI_Init for me but what it absolutely appears 
to be doing is calling PMI2_Init for me.

-Aaron

Comment 20 Ralph Castain 2017-04-12 16:55:27 MDT

It's a tough one to resolve - very unfortunate that SGI did that to people. I assume you must be calling at least some function in that library or else it would be difficult to understand how it just calls PMI2_Init for you out of the blue!

The issue is that SLURM uses that unique process-to-socket mapping as part of its security policy. The idea is that SLURM creates the socket and then passes it to the process at launch, thus ensuring that process is the only one who knows about it. Admittedly, someone can port scan to find it, and so there are other mechanisms in place - but the idea that only one process should be calling back on that socket, and any other attempts represent a "bad actor", is an important element of the overall strategy.

In fact, some resource managers go so far as to actually hit your process with a SIGKILL if it violates that policy. SLURM is one of the more friendly ones in that it only generates an error.

PMIx can support multiple connections because we have a different security policy that involves the exchange of multiple verifying pieces of information. It isn't any stronger than the one used here, but it was designed around the idea that a process might fork/exec a child that also needed to access PMIx.

PMI (both 1 and 2) was not designed to support that mode. I'd really recommend you go back to SGI and request that they fix this problem. It would be a much cleaner solution, though I'm sure the folks here could "hack" some way to let you run - provided they are willing to "stretch" their security.

Comment 21 Tim Shaw 2017-04-13 15:56:01 MDT

Aaron,

I spent the better part of a day digging through the pmi2 library code and the slurmstepd code and it's clear that PMI2_Init() is not meant to be called more than once per job step (slurmstepd process).  I believe Ralph is right in that this is an SGI bug and they need to fix it.

Regards

Tim

Comment 22 Aaron Knister 2017-04-13 18:32:44 MDT

Thanks, Tim, Ralph. I don't necessarily consider it a bug that one can't 
call PMI2_Init() multiple times per-rank (although the PMI2 spec doesn't 
explicitly prohibit it) but I do consider it a bug that SLURM hangs 
indefinitely. From when I looked at the code I recall thinking that 
behavior would be fairly easy to fix. I'm quite willing to put together 
a patch-- is that something you'd be interested in?

-Aaron

Comment 23 Tim Shaw 2017-04-14 10:26:43 MDT

Aaron,

Feel free to put together a patch that addresses this.  You'll probable what to center your efforts around _handle_task_request in pmi2/agent.c.  If your patch can make it so it simply returns on every subsequent PMI2_Init() call, that would be ideal.

Regards.

Tim

Comment 24 Tim Shaw 2017-05-01 16:18:32 MDT

Aaron,

Because this functionality needs to be added, I'm switching this ticket to a sev 5 (enhancement).

Thanks

Tim

Comment 25 Tim Wickberg 2017-09-13 09:33:49 MDT

*** Ticket 4024 has been marked as a duplicate of this ticket. ***

Comment 26 Tim Wickberg 2017-09-13 13:55:59 MDT

Created attachment 5219 [details]
prevent multiple pmi2_init calls from leaving libpmi2 hung

The attached patch prevents a second PMI2_Init() call from leaving our libpmi2 code stuck, and forces the socket closed for good measure, and to ensure any and all successive PMI calls will fail.

There's a confounding issue in our pmi2 code that will need to be addressed, and lead to this patch acting in a slightly-out-of-spec manner, and I suspect may affect other implementations as well. Unfortunately, it is my understanding that a number of applications may have statically linked in their copy of libpmi2, and thus even if we ship a fixed version I know that won't solve the problem for a lot of sites.

There is also a tangentially related fix on master you may want to apply alongside this: https://github.com/SchedMD/slurm/commit/b018470f0d5e.patch

Comment 27 Aaron Knister 2017-09-14 09:18:41 MDT

Thanks, Tim. That actually fixes the immediate issue for us, I believe. The MPI implementation that we're seeing this with (MPT) dlopen()'s the libpmi2 so it will pick up SLURM's PMI2 lib. 

I'd started looking at a more generic fix that would trigger stepd to close the socket but that approach left me with more questions than answers. 

Originally I was hoping we could actually support a second PMI2_Init so that the applications could continue to function but having them terminate with an error is an order of magnitude better than an indefinite hang, so I'll call this a win. 

I should also ask SGI/HPE why they're calling PMI2_Init before MPI_Init has been called.

As far as I'm concerned I think we can close this ticket. 

-Aaron

Comment 28 Aaron Knister 2017-09-14 09:19:55 MDT

As you mentioned, though, this doesn't help an app speaking the PMI2 wire protocol that hits this rather than opening libpmi2. We haven't encountered that case, yet, though that I can recall.

Comment 29 Tim Wickberg 2017-09-14 15:11:06 MDT

Final version of the patch is in commit b2aa25d50dca17, and will be in 17.02.8 when released.