Ticket 614

Summary: PMI2 socket being closed
Product: Slurm Reporter: David Bigagli <david>
Component: OtherAssignee: David Bigagli <david>
Status: RESOLVED FIXED QA Contact:
Severity: 3 - Medium Impact    
Priority: --- CC: da
Version: 14.03.x   
Hardware: Linux   
OS: Linux   
Site: SchedMD Alineos Sites: ---
Atos/Eviden Sites: --- Confidential Site: ---
Coreweave sites: --- Cray Sites: ---
DS9 clusters: --- HPCnow Sites: ---
HPE Sites: --- IBM Sites: ---
NOAA SIte: --- OCF Sites: ---
Recursion Pharma Sites: --- SFW Sites: ---
SNIC sites: --- Linux Distro: ---
Machine Name: CLE Version:
Version Fixed: 14.03.0rc1 Target Release: ---
DevPrio: --- Emory-Cloud Sites: ---

Description David Bigagli 2014-02-28 08:56:21 MST
Yo David

We're hearing some squawks about PMI2 under Slurm that sound eerily like something we saw from Cray that causes lots of problems. Basically, it appears that PMI2 is opening some file descriptors to the spawned processes that are expected to "persist", even across fork/exec boundaries.

In other words, if a process launched by Slurm wants to spawn a child process, the normal procedure is to (a) fork, (b) close all file descriptors other than 0-2, and then (c) exec. However, if you do this with PMI2 active, then PMI2 will barf.

Here is a very simple way to demonstrate the problem, courtesy of one user:

Here’s the PMI-only “this violates ‘no surprises’” demonstration.  (Nice that I still had a couple of those PMI programs hanging around.)
 
(18:15)m80<SALLOC:8on1>:~/upc$ cat pmi2-003.c
/*
  cc -Wall -I/opt/slurm/include  pmi2-003.c -L/opt/slurm/lib64  -lpmi2
  cc -Wall -I$SLURM_ROOT/include pmi2-003.c -L$SLURM_ROOT/lib64 -lpmi2
*/
 
#include "slurm/pmi2.h"
int main(int argc, char **argv)
{
    int spawned = -1, size = -1, rank = -1, appnum = -1;
    return PMI2_Init(&spawned, &size, &rank, &appnum);
}
(18:15)m80<SALLOC:8on1>:~/upc$ cc -Wall -I$SLURM_ROOT/include pmi2-003.c -L$SLURM_ROOT/lib64 -lpmi2
(18:16)m80<SALLOC:8on1>:~/upc$ srun -n 8 ./a.out
(18:16)m80<SALLOC:8on1>:~/upc$ srun -n 8 bash -cf ./a.out
(18:16)m80<SALLOC:8on1>:~/upc$ srun -n 8 csh -cf ./a.out
srun: error: n016: tasks 3-4: Exited with exit code 14
(18:16)m80<SALLOC:8on1>:~/upc$                   

Note that bash doesn't close fd's prior to exec, but csh does. We can all argue about which behavior is "correct", but the fact remains that closing fd's is a long acknowledged (and even taught!) best-practice. Can you help us fix this mess?

All that is required is for Slurm to pass an envar with the PMI2 server's socket, and for the PMI2 client to open its own socket during PMI2_Init to connect to the server.

Thanks
Ralph
Comment 1 David Bigagli 2014-02-28 09:06:12 MST
I think this is the same problem I dealt with back in the days.
What is happening is that srun sets in the environment the variable PMI2_fd which tells the PMI2 library which socket to talk to with the PMI2 backend, unfortunately smart csh closes some of these file descriptors before starting the application. Indeed it does not happen always, for sure it closes fd's 16, 18 which I tested using 4 component
job, if I use only 2 no problems happen. I can see if on the srun side we can allocate higher socket number. 

David
Comment 2 David Bigagli 2014-03-20 11:39:08 MDT
Fixed in commit: 084787c0d8f26

Thanks,
David