Ticket 3363

Summary: OmniPath fabric: srun error PSM2 can't open hfi unit: -1 (err=23)
Product: Slurm Reporter: Ole.H.Nielsen <Ole.H.Nielsen>
Component: OtherAssignee: Tim Wickberg <tim>
Status: RESOLVED DUPLICATE QA Contact:
Severity: 4 - Minor Issue    
Priority: --- CC: jacob
Version: 16.05.6   
Hardware: Linux   
OS: Linux   
Site: DTU Physics Alineos Sites: ---
Atos/Eviden Sites: --- Confidential Site: ---
Coreweave sites: --- Cray Sites: ---
DS9 clusters: --- HPCnow Sites: ---
HPE Sites: --- IBM Sites: ---
NOAA SIte: --- NoveTech Sites: ---
Nvidia HWinf-CS Sites: --- OCF Sites: ---
Recursion Pharma Sites: --- SFW Sites: ---
SNIC sites: --- Linux Distro: ---
Machine Name: CLE Version:
Version Fixed: Target Release: ---
DevPrio: --- Emory-Cloud Sites: ---

Description Ole.H.Nielsen@fysik.dtu.dk 2016-12-22 03:49:53 MST
We're installing new nodes running CentOS 7.2 and an Intel OmniPath (OPA) fabric. The Intel® Omni-Path Fabric Software for Linux 10.2 drivers have been installed.  The OPA fabric seems to work correctly, and we can run OpenMPI tasks on the OPA fabric in interactive logins to the compute nodes.  Our OpenMPI 1.10.3 has been built adding the configopts += '--with-psm2' for PSM2 support.  We can also run the Intel OpenMPI 1.10.2.

When submitting jobs with sbatch and trying to run an MPI demo in the batch script:
  srun ./mpi_hello_world
we get some error messages:

PSM2 can't open hfi unit: -1 (err=23)
PSM2 was unable to open an endpoint. Please make sure that the network link is
active on the node and the hardware is functioning.
  Error: Failure in initializing endpoint

When I use mpirun in the Slurm job an additional error message besides the PSM2 endpoint error is printed:
hfi_userinit: mmap of rcvhdrq at dabbad0004030000 failed: Resource temporarily unavailable

I've experimented with different OpenMPI settings and googled for similar problems, but to no avail.

Question: How do we enable the OPA PSM2 endpoints within Slurm jobs so that we can run OpenMPI tasks?
Comment 1 Tim Wickberg 2016-12-22 08:27:03 MST
Welcome aboard! Jacob's getting your site added to the list at the moment, but that's a minor detail.

The PSM2 issue we'll need to investigate further. Is this test job running on a single node, or multiple nodes?

For mpirun, it looks like this may be caused by a memlock limit in place on the node preventing the driver from setting up the shared memory it uses to communicate with the adapter; it looks like adding a line like this to /etc/security/limits.conf and restarting slurmd on the node may work to fix that:

*		hard	memlock		unlimited

You can verify that slurmd is running with no restriction with a bash line like:

cat /proc/$(pgrep -u 0 slurmd)/limits|grep locked
Comment 2 Tim Wickberg 2016-12-22 08:31:23 MST
For PSM2 - would you mind installing the 1.10.5 release? There were a few commits the OpenMPI developers made recently, and I know at least one bug related to PSM2 interoperability was fixed then.
Comment 3 Tim Wickberg 2016-12-22 08:36:14 MST
> For mpirun, it looks like this may be caused by a memlock limit in place on
> the node preventing the driver from setting up the shared memory it uses to
> communicate with the adapter; it looks like adding a line like this to
> /etc/security/limits.conf and restarting slurmd on the node may work to fix
> that:
> 
> *		hard	memlock		unlimited
> 
> You can verify that slurmd is running with no restriction with a bash line
> like:
> 
> cat /proc/$(pgrep -u 0 slurmd)/limits|grep locked

Actually - can you run that command first? If you're using the systemd service file we provide you shouldn't need to set this through limits.conf; if you're using the older init script you may have to.
Comment 4 Ole.H.Nielsen@fysik.dtu.dk 2016-12-23 02:22:21 MST
(In reply to Tim Wickberg from comment #1)
> The PSM2 issue we'll need to investigate further. Is this test job running
> on a single node, or multiple nodes?

I'm explicitly testing 4 nodes with 1 task per node: sbatch -N 4.
Error messages appear from each of the 4 hosts.

> For mpirun, it looks like this may be caused by a memlock limit in place on
> the node preventing the driver from setting up the shared memory it uses to
> communicate with the adapter; it looks like adding a line like this to
> /etc/security/limits.conf and restarting slurmd on the node may work to fix
> that:
> 
> *		hard	memlock		unlimited

The Intel OPA software installation had already configured /etc/security/limits.conf (and the nodes were rebooted):
# -- All OPA Settings Start here --
# [ICS VERSION STRING: @(#) ./config/limits.conf.redhat.ES72 10_2_0_0_169 [09/08/16 17:17]
# User space Infiniband verbs require memlock permissions
# if desired you can limit these permissions to the users permitted to use OPA
# and/or reduce the limits.  Keep in mind this limit is per user
# (not per process)
* hard memlock unlimited
* soft memlock unlimited
# -- All OPA Settings End here --

> You can verify that slurmd is running with no restriction with a bash line
> like:
> 
> cat /proc/$(pgrep -u 0 slurmd)/limits|grep locked

This looks fine:
# cat /proc/$(pgrep -u 0 slurmd)/limits|grep locked
Max locked memory         unlimited            unlimited            bytes
Comment 5 Ole.H.Nielsen@fysik.dtu.dk 2016-12-23 02:33:18 MST
(In reply to Tim Wickberg from comment #2)
> For PSM2 - would you mind installing the 1.10.5 release? There were a few
> commits the OpenMPI developers made recently, and I know at least one bug
> related to PSM2 interoperability was fixed then.

Yes, OpenMPI 1.10.5 should be tried out.  We use EasyBuild to create software modules, and I need to work out a module file for 1.10.5.

Please note that Intel's OpenMPI 1.10.2 as well as my own 1.10.3 work correctly when I log in to the nodes interactively and run (for example) this mpirun using PSM2:
mpirun -H x069,x060,x061,x062 -np 4 -mca mtl psm2 -mca btl ^openib,sm ./mpi_hello_world
Comment 6 Ole.H.Nielsen@fysik.dtu.dk 2016-12-23 03:02:58 MST
(In reply to Tim Wickberg from comment #3)
> > You can verify that slurmd is running with no restriction with a bash line
> > like:
> > 
> > cat /proc/$(pgrep -u 0 slurmd)/limits|grep locked
> 
> Actually - can you run that command first? If you're using the systemd
> service file we provide you shouldn't need to set this through limits.conf;
> if you're using the older init script you may have to.

Yes, the slurmd service files on our CentOS 7.2 nodes are installed by the slurm RPM:
]# rpm -qf /usr/lib/systemd/system/slurmd.service
slurm-16.05.6-1.el7.centos.x86_64

and contain the desired settings:
...
[Service]
Type=forking
EnvironmentFile=-/etc/sysconfig/slurmd
ExecStart=/usr/sbin/slurmd $SLURMD_OPTIONS
ExecReload=/bin/kill -HUP $MAINPID
PIDFile=/var/run/slurmd.pid
KillMode=process
LimitNOFILE=51200
LimitMEMLOCK=infinity
LimitSTACK=infinity
...
Comment 7 Ole.H.Nielsen@fysik.dtu.dk 2016-12-25 07:08:31 MST
(In reply to Ole.H.Nielsen@fysik.dtu.dk from comment #5)
> (In reply to Tim Wickberg from comment #2)
> > For PSM2 - would you mind installing the 1.10.5 release? There were a few
> > commits the OpenMPI developers made recently, and I know at least one bug
> > related to PSM2 interoperability was fixed then.
> 
> Yes, OpenMPI 1.10.5 should be tried out.  We use EasyBuild to create
> software modules, and I need to work out a module file for 1.10.5.
> 
> Please note that Intel's OpenMPI 1.10.2 as well as my own 1.10.3 work
> correctly when I log in to the nodes interactively and run (for example)
> this mpirun using PSM2:
> mpirun -H x069,x060,x061,x062 -np 4 -mca mtl psm2 -mca btl ^openib,sm
> ./mpi_hello_world

I've built OpenMPI 1.10.5 with EasyBuild now, and I get the same srun as well as mpirun errors as with 1.10.3 on 4 nodes (1 task on each node):

 srun --mpi=pmi2
x049.nifl.fysik.dtu.dk.25497PSM2 can't open hfi unit: -1 (err=23)
--------------------------------------------------------------------------
PSM2 was unable to open an endpoint. Please make sure that the network link is
active on the node and the hardware is functioning.

  Error: Failure in initializing endpoint
--------------------------------------------------------------------------
x049.nifl.fysik.dtu.dk.25497hfi_userinit: mmap of rcvhdrq at dabbad0004030000 failed: Resource temporarily unavailable
x050.nifl.fysik.dtu.dk.23098x051.nifl.fysik.dtu.dk.23107x052.nifl.fysik.dtu.dk.31625--------------------------------------------------------------------------
It looks like MPI_INIT failed for some reason; your parallel process is
likely to abort.  There are many reasons that a parallel process can
fail during MPI_INIT; some of which are due to configuration or environment
problems.  This failure appears to be an internal failure; here's some
additional information (which may only be relevant to an Open MPI
developer):

  PML add procs failed
  --> Returned "Error" (-1) instead of "Success" (0)
--------------------------------------------------------------------------
*** An error occurred in MPI_Init
PSM2 can't open hfi unit: -1 (err=23)--------------------------------------------------------------------------
PSM2 was unable to open an endpoint. Please make sure that the network link is
active on the node and the hardware is functioning.

  Error: Failure in initializing endpoint
--------------------------------------------------------------------------
PSM2 can't open hfi unit: -1 (err=23)--------------------------------------------------------------------------
PSM2 was unable to open an endpoint. Please make sure that the network link is
active on the node and the hardware is functioning.

  Error: Failure in initializing endpoint
--------------------------------------------------------------------------
PSM2 can't open hfi unit: -1 (err=23)--------------------------------------------------------------------------
PSM2 was unable to open an endpoint. Please make sure that the network link is
active on the node and the hardware is functioning.

  Error: Failure in initializing endpoint
--------------------------------------------------------------------------
*** on a NULL communicator
x051.nifl.fysik.dtu.dk.23107hfi_userinit: mmap of rcvhdrq at dabbad0004030000 failed: Resource temporarily unavailable

x050.nifl.fysik.dtu.dk.23098hfi_userinit: mmap of rcvhdrq at dabbad0004030000 failed: Resource temporarily unavailable

x052.nifl.fysik.dtu.dk.31625hfi_userinit: mmap of rcvhdrq at dabbad0004030000 failed: Resource temporarily unavailable
Comment 8 Ole.H.Nielsen@fysik.dtu.dk 2016-12-26 08:14:50 MST
Problem SOLVED: The runtime error:
   PSM2 can't open hfi unit: -1 (err=23)
is caused by slurmd's configured limit
   LimitMEMLOCK=infinity
being overridden by the user's default limits from the login node.
By default, Slurm propagates all limits to the batch job as described in https://slurm.schedmd.com/faq.html#memlock.

One can diagnose this error by adding a line to the slurm job script:
   ulimit -l
which must return "unlimited".

One must add
   PropagateResourceLimitsExcept=MEMLOCK
to slurm.conf in order to avoid this problem, as explained in the FAQ.
I have verified that MPI jobs now work correctly when the locked memory limit has been set to unlimited.

The PSM2 runtime error is rather unintelligible, and I'll take this issue up with my Intel representative.
Comment 9 Ole.H.Nielsen@fysik.dtu.dk 2016-12-30 02:16:52 MST
Added to the solution: The slurmd daemon is started at boot time by /etc/init.d/slurm with only the system default limits.  Verify this by:
cat "/proc/$(pgrep -u 0 slurmd)/limits"

In https://bugs.schedmd.com/show_bug.cgi?id=3371 I suggest the workaround:

echo ulimit -l unlimited -s unlimited -n 51200 >> /etc/sysconfig/slurm

in order to duplicate at boot time the limits set in /usr/lib/systemd/system/slurmd.service.
Comment 10 Tim Wickberg 2016-12-30 07:12:55 MST
(In reply to Ole.H.Nielsen@fysik.dtu.dk from comment #9)
> Added to the solution: The slurmd daemon is started at boot time by
> /etc/init.d/slurm with only the system default limits.  Verify this by:
> cat "/proc/$(pgrep -u 0 slurmd)/limits"
> 
> In https://bugs.schedmd.com/show_bug.cgi?id=3371 I suggest the workaround:
> 
> echo ulimit -l unlimited -s unlimited -n 51200 >> /etc/sysconfig/slurm
> 
> in order to duplicate at boot time the limits set in
> /usr/lib/systemd/system/slurmd.service.

Are you using both the init script and the systemd service file on the compute nodes? You should only use one or the other; although I'm unsure of what exactly happens if you have both in place.
Comment 11 Ole.H.Nielsen@fysik.dtu.dk 2016-12-30 07:47:49 MST
(In reply to Tim Wickberg from comment #10)
> (In reply to Ole.H.Nielsen@fysik.dtu.dk from comment #9)
> > Added to the solution: The slurmd daemon is started at boot time by
> > /etc/init.d/slurm with only the system default limits.  Verify this by:
> > cat "/proc/$(pgrep -u 0 slurmd)/limits"
> > 
> > In https://bugs.schedmd.com/show_bug.cgi?id=3371 I suggest the workaround:
> > 
> > echo ulimit -l unlimited -s unlimited -n 51200 >> /etc/sysconfig/slurm
> > 
> > in order to duplicate at boot time the limits set in
> > /usr/lib/systemd/system/slurmd.service.
> 
> Are you using both the init script and the systemd service file on the
> compute nodes? You should only use one or the other; although I'm unsure of
> what exactly happens if you have both in place.

Both, I think.  On RHEL/CentOS 7 I assumed that systemd was used, but I just discovered that the slurm RPM installs /etc/init.d/slurm which seems to take precedence at boot time.  The /etc/init.d/slurm as installed will start by default at boot time:
chkconfig --list slurm
slurm           0:off   1:off   2:on    3:on    4:on    5:on    6:off

We need to figure out the best practices for Systemd based systems: Should we use /etc/init.d/slurm or systemctl??  My preference would be to use systemctl for consistency with EL7.  The /etc/init.d/slurm should then be reconfigured on EL7 systems to not start by default, or the service shouldn't be added in the first place! What's your opinion?

I note that the slurm.spec file adds the service thus:

%post
if [ -x /sbin/ldconfig ]; then
    /sbin/ldconfig %{_libdir}
    if [ $1 = 1 ]; then
        [ -x /sbin/chkconfig ] && /sbin/chkconfig --add slurm
    fi
fi

This could perhaps be omitted on EL7 systems?
Comment 12 Tim Wickberg 2016-12-30 07:58:24 MST
(In reply to Ole.H.Nielsen@fysik.dtu.dk from comment #11)
> (In reply to Tim Wickberg from comment #10)
> > (In reply to Ole.H.Nielsen@fysik.dtu.dk from comment #9)
> > > Added to the solution: The slurmd daemon is started at boot time by
> > > /etc/init.d/slurm with only the system default limits.  Verify this by:
> > > cat "/proc/$(pgrep -u 0 slurmd)/limits"
> > > 
> > > In https://bugs.schedmd.com/show_bug.cgi?id=3371 I suggest the workaround:
> > > 
> > > echo ulimit -l unlimited -s unlimited -n 51200 >> /etc/sysconfig/slurm
> > > 
> > > in order to duplicate at boot time the limits set in
> > > /usr/lib/systemd/system/slurmd.service.
> > 
> > Are you using both the init script and the systemd service file on the
> > compute nodes? You should only use one or the other; although I'm unsure of
> > what exactly happens if you have both in place.
> 
> Both, I think.  On RHEL/CentOS 7 I assumed that systemd was used, but I just
> discovered that the slurm RPM installs /etc/init.d/slurm which seems to take
> precedence at boot time.  The /etc/init.d/slurm as installed will start by
> default at boot time:
> chkconfig --list slurm
> slurm           0:off   1:off   2:on    3:on    4:on    5:on    6:off
> 
> We need to figure out the best practices for Systemd based systems: Should
> we use /etc/init.d/slurm or systemctl??  My preference would be to use
> systemctl for consistency with EL7.  The /etc/init.d/slurm should then be
> reconfigured on EL7 systems to not start by default, or the service
> shouldn't be added in the first place! What's your opinion?

Use systemctl and the service file. Ignore the init scripts going forward, and I'd suggest removing them from the compute nodes if you can to prevent this.

There does seem to be an issue with the slurm.spec file where both are installed, leading to this slightly-confusing behavior. I'm going to see what we can do to mitigate that.

> I note that the slurm.spec file adds the service thus:
> 
> %post
> if [ -x /sbin/ldconfig ]; then
>     /sbin/ldconfig %{_libdir}
>     if [ $1 = 1 ]; then
>         [ -x /sbin/chkconfig ] && /sbin/chkconfig --add slurm
>     fi
> fi
> 
> This could perhaps be omitted on EL7 systems?

I'm looking through it, and this definitely seems to be an oversight. Using both service files and the systemd init compatibility mechanisms is not intentional.
Comment 13 Ole.H.Nielsen@fysik.dtu.dk 2016-12-30 08:11:07 MST
(In reply to Tim Wickberg from comment #12)
> > We need to figure out the best practices for Systemd based systems: Should
> > we use /etc/init.d/slurm or systemctl??  My preference would be to use
> > systemctl for consistency with EL7.  The /etc/init.d/slurm should then be
> > reconfigured on EL7 systems to not start by default, or the service
> > shouldn't be added in the first place! What's your opinion?
> 
> Use systemctl and the service file. Ignore the init scripts going forward,
> and I'd suggest removing them from the compute nodes if you can to prevent
> this.

I agree that this would be the most consistent approach.  On EL7 systems one might make the change thus:

chkconfig slurm off
systemctl enable slurmd

I have just verified this on a compute node: The slurmd service is running correctly after a reboot.

I wouldn't remove the file /etc/init.d/slurm because it has been installed by the slurm-16.05.6-1.el7.centos.x86_64 RPM.

> There does seem to be an issue with the slurm.spec file where both are
> installed, leading to this slightly-confusing behavior. I'm going to see
> what we can do to mitigate that.
> 
> > I note that the slurm.spec file adds the service thus:
> > 
> > %post
> > if [ -x /sbin/ldconfig ]; then
> >     /sbin/ldconfig %{_libdir}
> >     if [ $1 = 1 ]; then
> >         [ -x /sbin/chkconfig ] && /sbin/chkconfig --add slurm
> >     fi
> > fi
> > 
> > This could perhaps be omitted on EL7 systems?
> 
> I'm looking through it, and this definitely seems to be an oversight. Using
> both service files and the systemd init compatibility mechanisms is not
> intentional.

I agree that it's better to omit the slurm service on EL7 systems, and rely upon Systemd in stead. 

This however begs for documentation!  I don't believe the Slurm web-pages have OS-specific instructions for EL7, for example.  That's why I've been documenting my work with Slurm on CentOS 7 in a Wiki page (which has grown too long): https://wiki.fysik.dtu.dk/niflheim/SLURM
Comment 14 Ole.H.Nielsen@fysik.dtu.dk 2017-01-25 06:11:14 MST
I think we may close this case as a duplicate of https://bugs.schedmd.com/show_bug.cgi?id=3371
Comment 15 Tim Wickberg 2017-01-25 11:32:53 MST
(In reply to Ole.H.Nielsen@fysik.dtu.dk from comment #14)
> I think we may close this case as a duplicate of
> https://bugs.schedmd.com/show_bug.cgi?id=3371

Yes, it appears so. Marking as a duplicate.

*** This ticket has been marked as a duplicate of ticket 3371 ***
Comment 17 Ole.H.Nielsen@fysik.dtu.dk 2018-07-30 14:20:56 MDT
I'm out of the office until August 13.
Jeg er ikke på kontoret, tilbage igen 13. august.

Best regards / Venlig hilsen,
Ole Holm Nielsen