We're installing new nodes running CentOS 7.2 and an Intel OmniPath (OPA) fabric. The Intel® Omni-Path Fabric Software for Linux 10.2 drivers have been installed. The OPA fabric seems to work correctly, and we can run OpenMPI tasks on the OPA fabric in interactive logins to the compute nodes. Our OpenMPI 1.10.3 has been built adding the configopts += '--with-psm2' for PSM2 support. We can also run the Intel OpenMPI 1.10.2. When submitting jobs with sbatch and trying to run an MPI demo in the batch script: srun ./mpi_hello_world we get some error messages: PSM2 can't open hfi unit: -1 (err=23) PSM2 was unable to open an endpoint. Please make sure that the network link is active on the node and the hardware is functioning. Error: Failure in initializing endpoint When I use mpirun in the Slurm job an additional error message besides the PSM2 endpoint error is printed: hfi_userinit: mmap of rcvhdrq at dabbad0004030000 failed: Resource temporarily unavailable I've experimented with different OpenMPI settings and googled for similar problems, but to no avail. Question: How do we enable the OPA PSM2 endpoints within Slurm jobs so that we can run OpenMPI tasks?
Welcome aboard! Jacob's getting your site added to the list at the moment, but that's a minor detail. The PSM2 issue we'll need to investigate further. Is this test job running on a single node, or multiple nodes? For mpirun, it looks like this may be caused by a memlock limit in place on the node preventing the driver from setting up the shared memory it uses to communicate with the adapter; it looks like adding a line like this to /etc/security/limits.conf and restarting slurmd on the node may work to fix that: * hard memlock unlimited You can verify that slurmd is running with no restriction with a bash line like: cat /proc/$(pgrep -u 0 slurmd)/limits|grep locked
For PSM2 - would you mind installing the 1.10.5 release? There were a few commits the OpenMPI developers made recently, and I know at least one bug related to PSM2 interoperability was fixed then.
> For mpirun, it looks like this may be caused by a memlock limit in place on > the node preventing the driver from setting up the shared memory it uses to > communicate with the adapter; it looks like adding a line like this to > /etc/security/limits.conf and restarting slurmd on the node may work to fix > that: > > * hard memlock unlimited > > You can verify that slurmd is running with no restriction with a bash line > like: > > cat /proc/$(pgrep -u 0 slurmd)/limits|grep locked Actually - can you run that command first? If you're using the systemd service file we provide you shouldn't need to set this through limits.conf; if you're using the older init script you may have to.
(In reply to Tim Wickberg from comment #1) > The PSM2 issue we'll need to investigate further. Is this test job running > on a single node, or multiple nodes? I'm explicitly testing 4 nodes with 1 task per node: sbatch -N 4. Error messages appear from each of the 4 hosts. > For mpirun, it looks like this may be caused by a memlock limit in place on > the node preventing the driver from setting up the shared memory it uses to > communicate with the adapter; it looks like adding a line like this to > /etc/security/limits.conf and restarting slurmd on the node may work to fix > that: > > * hard memlock unlimited The Intel OPA software installation had already configured /etc/security/limits.conf (and the nodes were rebooted): # -- All OPA Settings Start here -- # [ICS VERSION STRING: @(#) ./config/limits.conf.redhat.ES72 10_2_0_0_169 [09/08/16 17:17] # User space Infiniband verbs require memlock permissions # if desired you can limit these permissions to the users permitted to use OPA # and/or reduce the limits. Keep in mind this limit is per user # (not per process) * hard memlock unlimited * soft memlock unlimited # -- All OPA Settings End here -- > You can verify that slurmd is running with no restriction with a bash line > like: > > cat /proc/$(pgrep -u 0 slurmd)/limits|grep locked This looks fine: # cat /proc/$(pgrep -u 0 slurmd)/limits|grep locked Max locked memory unlimited unlimited bytes
(In reply to Tim Wickberg from comment #2) > For PSM2 - would you mind installing the 1.10.5 release? There were a few > commits the OpenMPI developers made recently, and I know at least one bug > related to PSM2 interoperability was fixed then. Yes, OpenMPI 1.10.5 should be tried out. We use EasyBuild to create software modules, and I need to work out a module file for 1.10.5. Please note that Intel's OpenMPI 1.10.2 as well as my own 1.10.3 work correctly when I log in to the nodes interactively and run (for example) this mpirun using PSM2: mpirun -H x069,x060,x061,x062 -np 4 -mca mtl psm2 -mca btl ^openib,sm ./mpi_hello_world
(In reply to Tim Wickberg from comment #3) > > You can verify that slurmd is running with no restriction with a bash line > > like: > > > > cat /proc/$(pgrep -u 0 slurmd)/limits|grep locked > > Actually - can you run that command first? If you're using the systemd > service file we provide you shouldn't need to set this through limits.conf; > if you're using the older init script you may have to. Yes, the slurmd service files on our CentOS 7.2 nodes are installed by the slurm RPM: ]# rpm -qf /usr/lib/systemd/system/slurmd.service slurm-16.05.6-1.el7.centos.x86_64 and contain the desired settings: ... [Service] Type=forking EnvironmentFile=-/etc/sysconfig/slurmd ExecStart=/usr/sbin/slurmd $SLURMD_OPTIONS ExecReload=/bin/kill -HUP $MAINPID PIDFile=/var/run/slurmd.pid KillMode=process LimitNOFILE=51200 LimitMEMLOCK=infinity LimitSTACK=infinity ...
(In reply to Ole.H.Nielsen@fysik.dtu.dk from comment #5) > (In reply to Tim Wickberg from comment #2) > > For PSM2 - would you mind installing the 1.10.5 release? There were a few > > commits the OpenMPI developers made recently, and I know at least one bug > > related to PSM2 interoperability was fixed then. > > Yes, OpenMPI 1.10.5 should be tried out. We use EasyBuild to create > software modules, and I need to work out a module file for 1.10.5. > > Please note that Intel's OpenMPI 1.10.2 as well as my own 1.10.3 work > correctly when I log in to the nodes interactively and run (for example) > this mpirun using PSM2: > mpirun -H x069,x060,x061,x062 -np 4 -mca mtl psm2 -mca btl ^openib,sm > ./mpi_hello_world I've built OpenMPI 1.10.5 with EasyBuild now, and I get the same srun as well as mpirun errors as with 1.10.3 on 4 nodes (1 task on each node): srun --mpi=pmi2 x049.nifl.fysik.dtu.dk.25497PSM2 can't open hfi unit: -1 (err=23) -------------------------------------------------------------------------- PSM2 was unable to open an endpoint. Please make sure that the network link is active on the node and the hardware is functioning. Error: Failure in initializing endpoint -------------------------------------------------------------------------- x049.nifl.fysik.dtu.dk.25497hfi_userinit: mmap of rcvhdrq at dabbad0004030000 failed: Resource temporarily unavailable x050.nifl.fysik.dtu.dk.23098x051.nifl.fysik.dtu.dk.23107x052.nifl.fysik.dtu.dk.31625-------------------------------------------------------------------------- It looks like MPI_INIT failed for some reason; your parallel process is likely to abort. There are many reasons that a parallel process can fail during MPI_INIT; some of which are due to configuration or environment problems. This failure appears to be an internal failure; here's some additional information (which may only be relevant to an Open MPI developer): PML add procs failed --> Returned "Error" (-1) instead of "Success" (0) -------------------------------------------------------------------------- *** An error occurred in MPI_Init PSM2 can't open hfi unit: -1 (err=23)-------------------------------------------------------------------------- PSM2 was unable to open an endpoint. Please make sure that the network link is active on the node and the hardware is functioning. Error: Failure in initializing endpoint -------------------------------------------------------------------------- PSM2 can't open hfi unit: -1 (err=23)-------------------------------------------------------------------------- PSM2 was unable to open an endpoint. Please make sure that the network link is active on the node and the hardware is functioning. Error: Failure in initializing endpoint -------------------------------------------------------------------------- PSM2 can't open hfi unit: -1 (err=23)-------------------------------------------------------------------------- PSM2 was unable to open an endpoint. Please make sure that the network link is active on the node and the hardware is functioning. Error: Failure in initializing endpoint -------------------------------------------------------------------------- *** on a NULL communicator x051.nifl.fysik.dtu.dk.23107hfi_userinit: mmap of rcvhdrq at dabbad0004030000 failed: Resource temporarily unavailable x050.nifl.fysik.dtu.dk.23098hfi_userinit: mmap of rcvhdrq at dabbad0004030000 failed: Resource temporarily unavailable x052.nifl.fysik.dtu.dk.31625hfi_userinit: mmap of rcvhdrq at dabbad0004030000 failed: Resource temporarily unavailable
Problem SOLVED: The runtime error: PSM2 can't open hfi unit: -1 (err=23) is caused by slurmd's configured limit LimitMEMLOCK=infinity being overridden by the user's default limits from the login node. By default, Slurm propagates all limits to the batch job as described in https://slurm.schedmd.com/faq.html#memlock. One can diagnose this error by adding a line to the slurm job script: ulimit -l which must return "unlimited". One must add PropagateResourceLimitsExcept=MEMLOCK to slurm.conf in order to avoid this problem, as explained in the FAQ. I have verified that MPI jobs now work correctly when the locked memory limit has been set to unlimited. The PSM2 runtime error is rather unintelligible, and I'll take this issue up with my Intel representative.
Added to the solution: The slurmd daemon is started at boot time by /etc/init.d/slurm with only the system default limits. Verify this by: cat "/proc/$(pgrep -u 0 slurmd)/limits" In https://bugs.schedmd.com/show_bug.cgi?id=3371 I suggest the workaround: echo ulimit -l unlimited -s unlimited -n 51200 >> /etc/sysconfig/slurm in order to duplicate at boot time the limits set in /usr/lib/systemd/system/slurmd.service.
(In reply to Ole.H.Nielsen@fysik.dtu.dk from comment #9) > Added to the solution: The slurmd daemon is started at boot time by > /etc/init.d/slurm with only the system default limits. Verify this by: > cat "/proc/$(pgrep -u 0 slurmd)/limits" > > In https://bugs.schedmd.com/show_bug.cgi?id=3371 I suggest the workaround: > > echo ulimit -l unlimited -s unlimited -n 51200 >> /etc/sysconfig/slurm > > in order to duplicate at boot time the limits set in > /usr/lib/systemd/system/slurmd.service. Are you using both the init script and the systemd service file on the compute nodes? You should only use one or the other; although I'm unsure of what exactly happens if you have both in place.
(In reply to Tim Wickberg from comment #10) > (In reply to Ole.H.Nielsen@fysik.dtu.dk from comment #9) > > Added to the solution: The slurmd daemon is started at boot time by > > /etc/init.d/slurm with only the system default limits. Verify this by: > > cat "/proc/$(pgrep -u 0 slurmd)/limits" > > > > In https://bugs.schedmd.com/show_bug.cgi?id=3371 I suggest the workaround: > > > > echo ulimit -l unlimited -s unlimited -n 51200 >> /etc/sysconfig/slurm > > > > in order to duplicate at boot time the limits set in > > /usr/lib/systemd/system/slurmd.service. > > Are you using both the init script and the systemd service file on the > compute nodes? You should only use one or the other; although I'm unsure of > what exactly happens if you have both in place. Both, I think. On RHEL/CentOS 7 I assumed that systemd was used, but I just discovered that the slurm RPM installs /etc/init.d/slurm which seems to take precedence at boot time. The /etc/init.d/slurm as installed will start by default at boot time: chkconfig --list slurm slurm 0:off 1:off 2:on 3:on 4:on 5:on 6:off We need to figure out the best practices for Systemd based systems: Should we use /etc/init.d/slurm or systemctl?? My preference would be to use systemctl for consistency with EL7. The /etc/init.d/slurm should then be reconfigured on EL7 systems to not start by default, or the service shouldn't be added in the first place! What's your opinion? I note that the slurm.spec file adds the service thus: %post if [ -x /sbin/ldconfig ]; then /sbin/ldconfig %{_libdir} if [ $1 = 1 ]; then [ -x /sbin/chkconfig ] && /sbin/chkconfig --add slurm fi fi This could perhaps be omitted on EL7 systems?
(In reply to Ole.H.Nielsen@fysik.dtu.dk from comment #11) > (In reply to Tim Wickberg from comment #10) > > (In reply to Ole.H.Nielsen@fysik.dtu.dk from comment #9) > > > Added to the solution: The slurmd daemon is started at boot time by > > > /etc/init.d/slurm with only the system default limits. Verify this by: > > > cat "/proc/$(pgrep -u 0 slurmd)/limits" > > > > > > In https://bugs.schedmd.com/show_bug.cgi?id=3371 I suggest the workaround: > > > > > > echo ulimit -l unlimited -s unlimited -n 51200 >> /etc/sysconfig/slurm > > > > > > in order to duplicate at boot time the limits set in > > > /usr/lib/systemd/system/slurmd.service. > > > > Are you using both the init script and the systemd service file on the > > compute nodes? You should only use one or the other; although I'm unsure of > > what exactly happens if you have both in place. > > Both, I think. On RHEL/CentOS 7 I assumed that systemd was used, but I just > discovered that the slurm RPM installs /etc/init.d/slurm which seems to take > precedence at boot time. The /etc/init.d/slurm as installed will start by > default at boot time: > chkconfig --list slurm > slurm 0:off 1:off 2:on 3:on 4:on 5:on 6:off > > We need to figure out the best practices for Systemd based systems: Should > we use /etc/init.d/slurm or systemctl?? My preference would be to use > systemctl for consistency with EL7. The /etc/init.d/slurm should then be > reconfigured on EL7 systems to not start by default, or the service > shouldn't be added in the first place! What's your opinion? Use systemctl and the service file. Ignore the init scripts going forward, and I'd suggest removing them from the compute nodes if you can to prevent this. There does seem to be an issue with the slurm.spec file where both are installed, leading to this slightly-confusing behavior. I'm going to see what we can do to mitigate that. > I note that the slurm.spec file adds the service thus: > > %post > if [ -x /sbin/ldconfig ]; then > /sbin/ldconfig %{_libdir} > if [ $1 = 1 ]; then > [ -x /sbin/chkconfig ] && /sbin/chkconfig --add slurm > fi > fi > > This could perhaps be omitted on EL7 systems? I'm looking through it, and this definitely seems to be an oversight. Using both service files and the systemd init compatibility mechanisms is not intentional.
(In reply to Tim Wickberg from comment #12) > > We need to figure out the best practices for Systemd based systems: Should > > we use /etc/init.d/slurm or systemctl?? My preference would be to use > > systemctl for consistency with EL7. The /etc/init.d/slurm should then be > > reconfigured on EL7 systems to not start by default, or the service > > shouldn't be added in the first place! What's your opinion? > > Use systemctl and the service file. Ignore the init scripts going forward, > and I'd suggest removing them from the compute nodes if you can to prevent > this. I agree that this would be the most consistent approach. On EL7 systems one might make the change thus: chkconfig slurm off systemctl enable slurmd I have just verified this on a compute node: The slurmd service is running correctly after a reboot. I wouldn't remove the file /etc/init.d/slurm because it has been installed by the slurm-16.05.6-1.el7.centos.x86_64 RPM. > There does seem to be an issue with the slurm.spec file where both are > installed, leading to this slightly-confusing behavior. I'm going to see > what we can do to mitigate that. > > > I note that the slurm.spec file adds the service thus: > > > > %post > > if [ -x /sbin/ldconfig ]; then > > /sbin/ldconfig %{_libdir} > > if [ $1 = 1 ]; then > > [ -x /sbin/chkconfig ] && /sbin/chkconfig --add slurm > > fi > > fi > > > > This could perhaps be omitted on EL7 systems? > > I'm looking through it, and this definitely seems to be an oversight. Using > both service files and the systemd init compatibility mechanisms is not > intentional. I agree that it's better to omit the slurm service on EL7 systems, and rely upon Systemd in stead. This however begs for documentation! I don't believe the Slurm web-pages have OS-specific instructions for EL7, for example. That's why I've been documenting my work with Slurm on CentOS 7 in a Wiki page (which has grown too long): https://wiki.fysik.dtu.dk/niflheim/SLURM
I think we may close this case as a duplicate of https://bugs.schedmd.com/show_bug.cgi?id=3371
(In reply to Ole.H.Nielsen@fysik.dtu.dk from comment #14) > I think we may close this case as a duplicate of > https://bugs.schedmd.com/show_bug.cgi?id=3371 Yes, it appears so. Marking as a duplicate. *** This ticket has been marked as a duplicate of ticket 3371 ***
I'm out of the office until August 13. Jeg er ikke på kontoret, tilbage igen 13. august. Best regards / Venlig hilsen, Ole Holm Nielsen