Ticket 13340 - Issues with MATLAB jobs in SLURM and inconsistent ulimit values
Summary: Issues with MATLAB jobs in SLURM and inconsistent ulimit values
Status: RESOLVED INFOGIVEN
Alias: None
Product: Slurm
Classification: Unclassified
Component: Limits (show other tickets)
Version: 20.11.7
Hardware: Linux Linux
: 4 - Minor Issue
Assignee: Director of Support
QA Contact:
URL:
Depends on:
Blocks:
 
Reported: 2022-02-04 13:56 MST by Misha Ahmadian
Modified: 2022-02-24 12:16 MST (History)
1 user (show)

See Also:
Site: TTU
Slinky Site: ---
Alineos Sites: ---
Atos/Eviden Sites: ---
Confidential Site: ---
Coreweave sites: ---
Cray Sites: ---
DS9 clusters: ---
Google sites: ---
HPCnow Sites: ---
HPE Sites: ---
IBM Sites: ---
NOAA SIte: ---
NoveTech Sites: ---
Nvidia HWinf-CS Sites: ---
OCF Sites: ---
Recursion Pharma Sites: ---
SFW Sites: ---
SNIC sites: ---
Tzag Elita Sites: ---
Linux Distro: ---
Machine Name:
CLE Version:
Version Fixed:
Target Release: ---
DevPrio: ---
Emory-Cloud Sites: ---


Attachments
slurm.conf (6.82 KB, text/plain)
2022-02-09 12:10 MST, Misha Ahmadian
Details
Slurm Interactive Script file (8.12 KB, application/x-shellscript)
2022-02-24 07:46 MST, Misha Ahmadian
Details

Note You need to log in before you can comment on or make changes to this ticket.
Description Misha Ahmadian 2022-02-04 13:56:45 MST
Hello,

It's been a while that we have been experiencing a special issue when running Parallel MATLB jobs with SLURM in both interactive and batch modes on our cluster, and we were working closely with Mathworks folks (included in this ticket) to find out the source of the problem:

[Failed Parallel MATLAB jobs]

1) Let's say each node in one of our cluster partitions (Nocona) has 128 CPU cores.
2) A normal user would request 1 node and 128 tasks per node with 1 CPU per task in Slurm.
3) Then user tries to open up a ParPool of 128 processes in MATLAB
4) MATLAB starts by spawning the processes gradually, but it fails when it reaches around ~100 processes!
5) This result is identical in batch and interactive modes with Slurm.
In other words, MATLAB cannot spawn all the 128 ParPool processes inside a Slurm job with 128 tasks.

[Successful Parallel MATLAB runs]

The user uses another approach that runs the MATLAB ParPools successfully:
1) The user uses salloc command to grab 1 node with 128 tasks and 1 cpu per task.
2) He "SSH" to the allocated node
3) He runs the MATLAB and tries to open up a ParPool of 128 processes
4) This time MATLAB spawns all the processes successfully and works fine

So, initially we through there is something wrong with the "cgroups". We investigated on cgroups per jobs by looking deeply on the CPU core and Memory limits per jobs and couldn't find anything wrong with that. This is still a gray area for us that we're are trying to figure out. As you know, interactive sessions combine (salloc + srun) which applies all cgroup limits to the interactive session. But (salloc + ssh) doesn't come with those limits.


Then, we looked into something else which was interesting:


[ Salloc + SSH ]:

1) We use salloc to grab a whole node to make sure no one else would be there:

$ salloc -p nocona -N1 -n128
salloc: Granted job allocation 4565948
salloc: Waiting for resource configuration
salloc: Nodes cpu-23-22 are ready for job

2) Then SSHed into the node and run ulimit for "open files" and "max user processes". (Please note that only these two numbers are changing in all cases that I'm going to explain below )

$ ssh cpu-23-22
cpu-23-22:$ ulimit -n -u
open files           (-n) 8192
max user processes       (-u) 2061717

This output is correct because that's exactly how we did the ulimit setup for each worker node:

cpu-23-22:$ cat /etc/security/limits.conf | tail -n 6
* soft memlock unlimited
* hard memlock unlimited
* hard nofile 8192
* soft nofile 8192
* soft core 2097152
* hard core 4194304

3) Then we run MATLAB without leaving the node:

cpu-23-22:$ matlab
>>

4) and try to check the ulimit from MATLAB:
>> !ulimit -n -u
open files           (-n) 8192
max user processes       (-u) 2061717

As you see above, the output is consistent with the ulimit command outside MATLAB. 
Okay, everything looks normal when we SSH to a node and run MATLAB outside SLURM.


[ SLURM Interactive session: salloc + srun --pty ]:

1) We use "interactive" script, which uses salloc + srun under the hood to grab a whole node with 128 cores:

$ interactive -c 128 -p nocona
[CPUs=128 NNodes=1 Name=INTERACTIVE Account=default Reservation=nocona_test Partition=nocona X11=NO]

salloc: Granted job allocation 4565965
salloc: Waiting for resource configuration
salloc: Nodes cpu-23-22 are ready for job
cpu-23-22:$

2) Then, we run ulimit for "open files" and "max user processes" inside the SLURM interactive session:

cpu-23-22:$ ulimit -n -u
open files           (-n) 8192
max user processes       (-u) 1029644

Ah, the "max user processes" numbers are lower than what we've defined in the /etc/security/limits.conf, which should be "2061717" instead of "1029644".

3) Okay, then we go ahead and open the MATLAB inside the SLURM interactive session:

cpu-23-22:$ matlab
>>

4) and try to check the ulimit from MATLAB:
>> !ulimit -n -u
open files           (-n) 131072
max user processes       (-u) 1029644

Alright, this time, MATLAB shows different numbers for both "open files" and "max user processes". Please note that this only happens inside SLURM sessions!

***Now, I have a guess for the "open files" number with is "131072" vs "8192". I think this number comes from the "LimitNOFILE" in slumd.service:

cpu-23-22:$ cat /usr/lib/systemd/system/slurmd.service
[Unit]
Description=Slurm node daemon
After=munge.service network.target remote-fs.target
#ConditionPathExists=/etc/slurm/slurm.conf

[Service]
Type=simple
EnvironmentFile=-/etc/sysconfig/slurmd
ExecStart=/usr/sbin/slurmd -D $SLURMD_OPTIONS
ExecReload=/bin/kill -HUP $MAINPID
KillMode=process
LimitNOFILE=131072
LimitMEMLOCK=infinity
LimitSTACK=infinity
Delegate=yes
TasksMax=infinity

[Install]
WantedBy=multi-user.target

However, still, two main questions remain unanswered to us at this point:

 1- Why is the "max user processes" inside SLURM sessions inconsistent with what we've defined in "/etc/security/limits.conf"? Where does this 1029644 limit come from inside SLURM sessions?

 2- Why in SLURM the "open files" limit is consistent with what we defined in "etc/security/limits.conf" outside MATLAB, but it changes to what is specified in "slurmd.service" when we run the MATLAB? Why interactive session (salloc + srun) does not inherit the "LimitNOFILE" from slurmd.service but MATLAB process does?

These are the mysterious things we found so far, and we think if we could keep the "max user processes" consistent with what we've defined in "/etc/security/limits.conf" then that will resolve the issue with MATLAB. However, Mathworks folks are still investigating why MATLAB may need such a large "max user processes" to spawn all the processes successfully. The current problem with MATLAB parallel jobs is that each process opens 1000s of dynamic libraries (.jar and .so files) which ends up with a huge number of open files on each node. We're still not sure if the problem is the "max user processes", any limitation on the cgroup side, or any possible limitation inside SLURM that we're not aware of.

Sorry for such a long explanation. I was trying to clarify what we discovered so far and what we cannot understand and may need your assistance.

Best Regards,
Misha
Comment 1 Misha Ahmadian 2022-02-04 14:01:12 MST
It looks like I'm not allowed to add one of the Mathworks specialists into this ticket to follow up since he is not on the email list.
Comment 2 Jason Booth 2022-02-04 14:11:13 MST
>It looks like I'm not allowed to add one of the Mathworks specialists into this ticket to follow up since he is not on the email list.

They will need to create an account before they can be added to the CC field.

https://bugs.schedmd.com/createaccount.cgi
Comment 3 Misha Ahmadian 2022-02-04 14:51:38 MST
(In reply to Jason Booth from comment #2)
> >It looks like I'm not allowed to add one of the Mathworks specialists into this ticket to follow up since he is not on the email list.
> 
> They will need to create an account before they can be added to the CC field.
> 
> https://bugs.schedmd.com/createaccount.cgi

Thank you, Jason.

I did let Damian from MathWorks know and just added him to the ticket after he joined the email list.

Best,
Misha
Comment 4 Michael Hinton 2022-02-04 16:01:15 MST
Misha,

Could you attach you most recent slurm.conf?

(In reply to Misha Ahmadian from comment #0)
> It's been a while that we have been experiencing a special issue
How long, exactly? Have you experienced this issue since moving to 20.11?

> 4) MATLAB starts by spawning the processes gradually, but it fails when it
> reaches around ~100 processes!
> 5) This result is identical in batch and interactive modes with Slurm.
> In other words, MATLAB cannot spawn all the 128 ParPool processes inside a
> Slurm job with 128 tasks.
Can you supply matlab and slurm error logs for these failures?

> These are the mysterious things we found so far, and we think if we could
> keep the "max user processes" consistent with what we've defined in
> "/etc/security/limits.conf" then that will resolve the issue with MATLAB.
> However, Mathworks folks are still investigating why MATLAB may need such a
> large "max user processes" to spawn all the processes successfully. The
> current problem with MATLAB parallel jobs is that each process opens 1000s
> of dynamic libraries (.jar and .so files) which ends up with a huge number
> of open files on each node. We're still not sure if the problem is the "max
> user processes", any limitation on the cgroup side, or any possible
> limitation inside SLURM that we're not aware of.
How confident are you that the max user processes limit is being hit? Are you able to get diagnostic information on the 99th ParPool process before it crashes (perhaps by inserting a sleep and then checking the current limits of the parent process)?

Let me look into your other questions and get back to you.

Thanks!
-Michael
Comment 5 Misha Ahmadian 2022-02-09 12:09:30 MST
(In reply to Michael Hinton from comment #4)

Hi Michael,

Sorry for the late response.

> Could you attach you most recent slurm.conf?
Sure, please find the attached slurm.conf file.

> How long, exactly? Have you experienced this issue since moving to 20.11?
I'm not sure exactly. It's been a few months (~ 6 months or more) since we've deployed the MATLAB Parallel server and Mathworks folks have been working with us closely to customize the MATLAB scripts and get the installation done correctly. I don't think this would be relevant to the Slurm version (but I could be wrong)

> Can you supply matlab and slurm error logs for these failures?
So, collecting logs with all the details from MATLAB has been an issue for us (TTU  and Mathworks) since MATLAB does not yield all the details we desire to see. However, below is what I get from MATLAB inside an interactive session:

$ interactive -p nocona -c 128
[CPUs=128 NNodes=1 Name=INTERACTIVE Account=default Reservation=nocona_test Partition=nocona X11=NO]

salloc: Granted job allocation 4588750
salloc: Waiting for resource configuration
salloc: Nodes cpu-23-22 are ready for job
cpu-23-22:$ cd /path/to/matlab/R2021b/bin/
cpu-23-22:$ ./matlab
MATLAB is selecting SOFTWARE OPENGL rendering.

                   < M A T L A B (R) >
           Copyright 1984-2021 The MathWorks, Inc.
          R2021b (9.11.0.1769968) 64-bit (glnxa64)
                    September 17, 2021


To get started, type doc.
For product information, visit www.mathworks.com.

>> c = parcluster('local');
>> c.NumWorkers = 128;
>> p = c.parpool(128);
Starting parallel pool (parpool) using the 'local' profile ...

Error using parallel.Cluster/parpool (line 88)
Parallel pool failed to start with the following error. For more detailed information, validate the profile 'local' in the Cluster Profile Manager.

Caused by:
    Error using parallel.internal.pool.AbstractInteractiveClient>iThrowWithCause (line 305)
    Failed to initialize the interactive session.
        Error using parallel.internal.pool.AbstractInteractiveClient>iThrowIfBadParallelJobStatus (line 426)
        The interactive communicating job failed with no message.

>>


> How confident are you that the max user processes limit is being hit? Are
> you able to get diagnostic information on the 99th ParPool process before it
> crashes (perhaps by inserting a sleep and then checking the current limits
> of the parent process)?

So, we're not entirely confident at this point. The differences in max user process limit inside and outside the Slurm are the most obvious thing we have found so far. Damian is working with the developers to see how he can collect further logs to help us get more info. However, if you have a better idea of collecting the parent process limits in Slurm, please let me know the steps and work on it.


> 
> Let me look into your other questions and get back to you.

Thank you very much.

Best Regards,
misha
Comment 6 Misha Ahmadian 2022-02-09 12:10:47 MST
Created attachment 23387 [details]
slurm.conf
Comment 9 Misha Ahmadian 2022-02-15 10:51:29 MST
Hi Michael,

I just wanted to follow up with you and see any update on this case. Please let me know if you need more info from our side.

Best Regards,
Misha
Comment 10 Michael Hinton 2022-02-15 12:11:15 MST
(In reply to Misha Ahmadian from comment #9)
> I just wanted to follow up with you and see any update on this case. Please
> let me know if you need more info from our side.
Based on the data you provided so far, I don't see any holes in your analysis. The only run-time adjustments I see Slurm make to slurmd ulimits is in slurm_rlimits_info.c --> rlimits_adjust_nofile(), and that only affects open files. So I think we need more data on node cpu-23-22.

Are all nodes showing this behavior, or just node cpu-23-22? What Linux distro is it running?

Is systemd setting any global limits that we are not aware of (in /etc/systemd/system.conf or anywhere else)? Could you attach the entire /etc/security/limits.conf?

If you run slurmd outside of systemd, do you get the same behavior? (kill the current slurmd service on that node and just run `sudo /usr/sbin/slurmd -D` in a terminal).

What are the limits of slurmd under systemd vs. not under systemd, when no job is running? Attach both the `ulimit -a` output as well as `cat /proc/$(pidof slurmd)/limits`.

Can you do the same thing above, but when you have an interactive job running? Can you also do it for the stepd (`cat /proc/$(pidof slurmstepd)/limits`)?

Can you do the same thing for the interactive matlab under the interactive job?

To summarize: Get `ulimit -a` and `cat /proc/<pid-of-prog>/limits` for the following processes:

Under systemd
---------------------
* latent slurmd (no interactive job)
* slurmd w/ interactive job
* slurmstepd w/ interactive job
* matlab, called in interactive job

Note under systemd
---------------------
* latent slurmd (no interactive job)
* slurmd w/ interactive job
* slurmstepd w/ interactive job
* matlab, called in interactive job

I still don't know the answers to your questions in comment 0, but I think that this should help us better see when the limits change, and perhaps why.

-Michael

P.S. Starting in 20.11, the preferred method for interactive jobs is to set `use_interactive_step` in LaunchParameters in slurm.conf, and then simply use `salloc` to start up the interactive job. `salloc` with no arguments will internally call `srun --interactive --preserve-env --pty $SHELL`, and this can be modified with InteractiveStepOptions.
Comment 11 Michael Hinton 2022-02-15 12:13:36 MST
Note under systemd
-->
*Not under systemd
Comment 12 Michael Hinton 2022-02-15 12:30:03 MST
I also notice that you have this configured:

    PropagateResourceLimitsExcept=CORE,MEMLOCK

That means that all the soft limits of the login shell except for core size and locked memory are being propagated when you start a Slurm job. (see https://slurm.schedmd.com/slurm.conf.html#OPT_PropagateResourceLimits)

Could you attach the `ulimit -a` and `cat /proc/self/limits` and `cat /etc/security/limits.conf` of the login/submission machine before you do an interactive job?

Perhaps the solution is to avoid propagating an errant max process limit from the submission node.
Comment 13 Michael Hinton 2022-02-15 12:39:05 MST
(In reply to Misha Ahmadian from comment #5)
> So, we're not entirely confident at this point. The differences in max user
> process limit inside and outside the Slurm are the most obvious thing we
> have found so far.
Could you reduce Matlab's max processes in the SSH session to match what it is in Slurm and see if you can force the failure? If not, perhaps the max processes is a red herring.
Comment 14 Michael Hinton 2022-02-15 12:42:53 MST
Also, can you explain how Matlab integrates with Slurm? Is it using MPI under the hood? If so, what MPI and what version?
Comment 15 Damian Pietrus 2022-02-15 18:09:00 MST
Apologies about my delay in joining this thread

(In reply to Michael Hinton from comment #4)

> How confident are you that the max user processes limit is being hit? Are
> you able to get diagnostic information on the 99th ParPool process before it
> crashes (perhaps by inserting a sleep and then checking the current limits
> of the parent process)?

Typically, I'd be very confident that the max user process limit is being hit.  When failing to start my local pool of workers, often a java crash log is generated with the message "java.lang.OutOfMemoryError: unable to create new native thread".  With MATLAB, this almost always means the user process limit is too low.  However, the limits are set to such a high value that we shouldn't be running into this issue.  This discrepancy is something I'm trying to look into further from my side

(In reply to Michael Hinton from comment #14)
> Also, can you explain how Matlab integrates with Slurm? Is it using MPI
> under the hood? If so, what MPI and what version?

There are two primary methods by which we can run MATLAB on the cluster.  The first (which we are testing here) is to request resources from the scheduler, manually open the client, and run your code.  With this method, MATLAB is using it's "local" profile and the action of starting up a parallel pool uses only the resources that we were assigned for that one particular node.

In comparison, using a Slurm cluster profile will submit a secondary job to Slurm that can span multiple nodes using the MATLAB Parallel Server product.  I believe this uses MPICH3, though the mechanism for starting these workers is slightly different than starting local workers.

Please let me know if there is any specific information from my side that I can provide
Comment 16 Misha Ahmadian 2022-02-21 13:20:39 MST
Hi Michael,

Please find the answers below (BTW, We experience this on every worker node):

===================================================================
(Under systemd)
===================================================================
1) latent Slurmd info:
-----------------------

# ssh cpu-25-8

[root@cpu-25-8 ~]# cat  /usr/lib/systemd/system/slurmd.service

[Unit]
Description=Slurm node daemon
After=munge.service network.target remote-fs.target
#ConditionPathExists=/etc/slurm/slurm.conf

[Service]
Type=simple
EnvironmentFile=-/etc/sysconfig/slurmd
ExecStart=/usr/sbin/slurmd -D $SLURMD_OPTIONS
ExecReload=/bin/kill -HUP $MAINPID
KillMode=process
LimitNOFILE=131072
LimitMEMLOCK=infinity
LimitSTACK=infinity
Delegate=yes
TasksMax=infinity

[Install]
WantedBy=multi-user.target

[root@cpu-25-8 ~]# grep -v '#' /etc/security/limits.conf

* soft memlock unlimited
* hard memlock unlimited
* hard nofile 8192
* soft nofile 8192
* soft core 2097152
* hard core 4194304

[root@cpu-25-8 ~]# cat /proc/$(pidof slurmd)/limits
Limit                     Soft Limit           Hard Limit           Units
Max cpu time              unlimited            unlimited            seconds
Max file size             unlimited            unlimited            bytes
Max data size             unlimited            unlimited            bytes
Max stack size            unlimited            unlimited            bytes
Max core file size        unlimited            unlimited            bytes
Max resident set          unlimited            unlimited            bytes
Max processes             2061717              2061717              processes
Max open files            131072               131072               files
Max locked memory         unlimited            unlimited            bytes
Max address space         unlimited            unlimited            bytes
Max file locks            unlimited            unlimited            locks
Max pending signals       2061717              2061717              signals
Max msgqueue size         819200               819200               bytes
Max nice priority         0                    0
Max realtime priority     0                    0
Max realtime timeout      unlimited            unlimited            us


[root@cpu-25-8 ~]# su - user1
[user1@cpu-25-8 ~]$ ulimit -a
core file size          (blocks, -c) 2097152
data seg size           (kbytes, -d) unlimited
scheduling priority             (-e) 0
file size               (blocks, -f) unlimited
pending signals                 (-i) 2061717
max locked memory       (kbytes, -l) unlimited
max memory size         (kbytes, -m) unlimited
open files                      (-n) 8192
pipe size            (512 bytes, -p) 8
POSIX message queues     (bytes, -q) 819200
real-time priority              (-r) 0
stack size              (kbytes, -s) unlimited
cpu time               (seconds, -t) unlimited
max user processes              (-u) 2061717
virtual memory          (kbytes, -v) unlimited
file locks                      (-x) unlimited

--------------------
2) Interactive Job:
--------------------
$ interactive -p nocona -c 128 -r nocona_test -w cpu-25-8

cpu-25-8:$ ulimit -a
core file size          (blocks, -c) unlimited
data seg size           (kbytes, -d) unlimited
scheduling priority             (-e) 0
file size               (blocks, -f) unlimited
pending signals                 (-i) 2061717
max locked memory       (kbytes, -l) unlimited
max memory size         (kbytes, -m) 527826944
open files                      (-n) 8192
pipe size            (512 bytes, -p) 8
POSIX message queues     (bytes, -q) 819200
real-time priority              (-r) 0
stack size              (kbytes, -s) unlimited
cpu time               (seconds, -t) unlimited
max user processes              (-u) 1029626
virtual memory          (kbytes, -v) unlimited
file locks                      (-x) unlimited


cpu-25-8:$ cat /proc/self/limits
Limit                     Soft Limit           Hard Limit           Units
Max cpu time              unlimited            unlimited            seconds
Max file size             unlimited            unlimited            bytes
Max data size             unlimited            unlimited            bytes
Max stack size            unlimited            unlimited            bytes
Max core file size        unlimited            unlimited            bytes
Max resident set          540494790656         540494790656         bytes
Max processes             1029626              2061717              processes
Max open files            8192                 131072               files
Max locked memory         unlimited            unlimited            bytes
Max address space         unlimited            unlimited            bytes
Max file locks            unlimited            unlimited            locks
Max pending signals       2061717              2061717              signals
Max msgqueue size         819200               819200               bytes
Max nice priority         0                    0
Max realtime priority     0                    0
Max realtime timeout      unlimited            unlimited            us

cpu-25-8:$ matlab -nodisplay

>> !ulimit -a
core file size          (blocks, -c) unlimited
data seg size           (kbytes, -d) unlimited
scheduling priority             (-e) 0
file size               (blocks, -f) unlimited
pending signals                 (-i) 2061717
max locked memory       (kbytes, -l) unlimited
max memory size         (kbytes, -m) 527826944
open files                      (-n) 131072
pipe size            (512 bytes, -p) 8
POSIX message queues     (bytes, -q) 819200
real-time priority              (-r) 0
stack size              (kbytes, -s) unlimited
cpu time               (seconds, -t) unlimited
max user processes              (-u) 1029626
virtual memory          (kbytes, -v) unlimited
file locks                      (-x) unlimited

>> !cat /proc/self/limits
Limit                     Soft Limit           Hard Limit           Units
Max cpu time              unlimited            unlimited            seconds
Max file size             unlimited            unlimited            bytes
Max data size             unlimited            unlimited            bytes
Max stack size            unlimited            unlimited            bytes
Max core file size        unlimited            unlimited            bytes
Max resident set          540494790656         540494790656         bytes
Max processes             1029626              2061717              processes
Max open files            131072               131072               files
Max locked memory         unlimited            unlimited            bytes
Max address space         unlimited            unlimited            bytes
Max file locks            unlimited            unlimited            locks
Max pending signals       2061717              2061717              signals
Max msgqueue size         819200               819200               bytes
Max nice priority         0                    0
Max realtime priority     0                    0
Max realtime timeout      unlimited            unlimited            us


===================================================================
(Not under systemd)
===================================================================
1) latent Slurmd info:
-----------------------

# ssh cpu-25-8

[root@cpu-25-8 ~]# systemctl stop slurmd

[root@cpu-25-8 ~]# slurmd -D --conf-server 10.100.21.250:6817 &

[root@cpu-25-8 ~]# cat /proc/$(pidof slurmd)/limits
Limit                     Soft Limit           Hard Limit           Units
Max cpu time              unlimited            unlimited            seconds
Max file size             unlimited            unlimited            bytes
Max data size             unlimited            unlimited            bytes
Max stack size            8388608              unlimited            bytes
Max core file size        4294967296           4294967296           bytes
Max resident set          unlimited            unlimited            bytes
Max processes             2061717              2061717              processes
Max open files            8192                 8192                 files
Max locked memory         unlimited            unlimited            bytes
Max address space         unlimited            unlimited            bytes
Max file locks            unlimited            unlimited            locks
Max pending signals       2061717              2061717              signals
Max msgqueue size         819200               819200               bytes
Max nice priority         0                    0
Max realtime priority     0                    0
Max realtime timeout      unlimited            unlimited            us

[root@cpu-25-8 ~]# su - user1
Last login: Mon Feb 21 12:36:33 CST 2022 on pts/0
[user1@cpu-25-8 ~]$ ulimit -a
core file size          (blocks, -c) 2097152
data seg size           (kbytes, -d) unlimited
scheduling priority             (-e) 0
file size               (blocks, -f) unlimited
pending signals                 (-i) 2061717
max locked memory       (kbytes, -l) unlimited
max memory size         (kbytes, -m) unlimited
open files                      (-n) 8192
pipe size            (512 bytes, -p) 8
POSIX message queues     (bytes, -q) 819200
real-time priority              (-r) 0
stack size              (kbytes, -s) unlimited
cpu time               (seconds, -t) unlimited
max user processes              (-u) 2061717
virtual memory          (kbytes, -v) unlimited
file locks                      (-x) unlimited


--------------------
2) Interactive Job:
--------------------
$ interactive -p nocona -c 128 -r nocona_test -w cpu-25-8

cpu-25-8:$ ulimit -a

core file size          (blocks, -c) 4194304
data seg size           (kbytes, -d) unlimited
scheduling priority             (-e) 0
file size               (blocks, -f) unlimited
pending signals                 (-i) 2061717
max locked memory       (kbytes, -l) unlimited
max memory size         (kbytes, -m) 527826944
open files                      (-n) 8192
pipe size            (512 bytes, -p) 8
POSIX message queues     (bytes, -q) 819200
real-time priority              (-r) 0
stack size              (kbytes, -s) unlimited
cpu time               (seconds, -t) unlimited
max user processes              (-u) 1029626
virtual memory          (kbytes, -v) unlimited
file locks                      (-x) unlimited

cpu-25-8:$ cat /proc/self/limits

Limit                     Soft Limit           Hard Limit           Units
Max cpu time              unlimited            unlimited            seconds
Max file size             unlimited            unlimited            bytes
Max data size             unlimited            unlimited            bytes
Max stack size            unlimited            unlimited            bytes
Max core file size        4294967296           4294967296           bytes
Max resident set          540494790656         540494790656         bytes
Max processes             1029626              2061717              processes
Max open files            8192                 8192                 files
Max locked memory         unlimited            unlimited            bytes
Max address space         unlimited            unlimited            bytes
Max file locks            unlimited            unlimited            locks
Max pending signals       2061717              2061717              signals
Max msgqueue size         819200               819200               bytes
Max nice priority         0                    0
Max realtime priority     0                    0
Max realtime timeout      unlimited            unlimited            us


cpu-25-8:$ matlab -nodisplay

>> !ulimit -a
core file size          (blocks, -c) 4194304
data seg size           (kbytes, -d) unlimited
scheduling priority             (-e) 0
file size               (blocks, -f) unlimited
pending signals                 (-i) 2061717
max locked memory       (kbytes, -l) unlimited
max memory size         (kbytes, -m) 527826944
open files                      (-n) 8192
pipe size            (512 bytes, -p) 8
POSIX message queues     (bytes, -q) 819200
real-time priority              (-r) 0
stack size              (kbytes, -s) unlimited
cpu time               (seconds, -t) unlimited
max user processes              (-u) 1029626
virtual memory          (kbytes, -v) unlimited
file locks                      (-x) unlimited

>> !cat /proc/self/limits
Limit                     Soft Limit           Hard Limit           Units
Max cpu time              unlimited            unlimited            seconds
Max file size             unlimited            unlimited            bytes
Max data size             unlimited            unlimited            bytes
Max stack size            unlimited            unlimited            bytes
Max core file size        4294967296           4294967296           bytes
Max resident set          540494790656         540494790656         bytes
Max processes             1029626              2061717              processes
Max open files            8192                 8192                 files
Max locked memory         unlimited            unlimited            bytes
Max address space         unlimited            unlimited            bytes
Max file locks            unlimited            unlimited            locks
Max pending signals       2061717              2061717              signals
Max msgqueue size         819200               819200               bytes
Max nice priority         0                    0
Max realtime priority     0                    0
Max realtime timeout      unlimited            unlimited            us


>> c = parcluster('local');
>> c.NumWorkers = 128;
>> p = c.parpool(128);
Starting parallel pool (parpool) using the 'local' profile ...

Error using parallel.Cluster/parpool (line 88)
Parallel pool failed to start with the following error. For more detailed information, validate the profile 'local' in the Cluster Profile Manager.

Caused by:
    Error using parallel.internal.pool.AbstractInteractiveClient>iThrowWithCause (line 305)
    Failed to initialize the interactive session.
        Error using parallel.internal.pool.AbstractInteractiveClient>iThrowIfBadParallelJobStatus (line 426)
        The interactive communicating job failed with no message.



===================================================================

As you see above, there are no differences between running slumd with under systemd or as a standalone process, and in both cases, the ulimits are the same and Matlab crashes in a similar way. However, the ulimits inside and outside a Slurm session within the same node are not the same.

Please let me know if you need more info from me or Damian. 

Best Regards,
Misha
Comment 17 Michael Hinton 2022-02-21 14:04:37 MST
Misha,

Have you tried temporarily increasing the limit inside an interactive Slurm job before running Matlab to verify that it can work if only that limit is increased properly?

-Michael
Comment 18 Michael Hinton 2022-02-21 16:14:07 MST
You were apparently printing out the user limits on the compute node itself (cpu-25-8). But I believe that limits are propagated *from the submission node*. What is the submission node (where you call `$ interactive ...`), and what are the user limits before submission?

Could we also do a sanity check to see if turning off limit propagation in slurm.conf for NPROC solves the issue?

Change:

    PropagateResourceLimitsExcept=CORE,MEMLOCK

to:

    PropagateResourceLimitsExcept=CORE,MEMLOCK,NPROC

just to be sure that we aren't somehow propagating a user limit.

Thanks!
-Michael
Comment 19 Misha Ahmadian 2022-02-23 11:11:31 MST
Hi Michael,

> You were apparently printing out the user limits on the compute node itself
> (cpu-25-8). But I believe that limits are propagated *from the submission
> node*. What is the submission node (where you call `$ interactive ...`), and
> what are the user limits before submission?

hmm, that's a very interesting point. I totally forgot the limit propagation by srun in this case. Ok, now I'm having a hard time understanding something maybe you can help:

I call the "interactive" command from a login node. Below are the ulimits and limit.conf files on the login node:

login-20-26:$ ulimit -a
core file size          (blocks, -c) 2097152
data seg size           (kbytes, -d) unlimited
scheduling priority             (-e) 0
file size               (blocks, -f) unlimited
pending signals                 (-i) 1029626
max locked memory       (kbytes, -l) unlimited
max memory size         (kbytes, -m) unlimited
open files                      (-n) 8192
pipe size            (512 bytes, -p) 8
POSIX message queues     (bytes, -q) 819200
real-time priority              (-r) 0
stack size              (kbytes, -s) unlimited
cpu time               (seconds, -t) unlimited
max user processes              (-u) 1029626
virtual memory          (kbytes, -v) unlimited
file locks                      (-x) unlimited

login-20-26:$ grep -v '#' /etc/security/limits.conf

* soft memlock unlimited
* hard memlock unlimited
* hard nofile 8192
* soft nofile 8192

So, I'm wondering why the "max user processes" limit on the login nodes is different than the worker nodes! we never specified a "nproc" limit on the limit.conf files on both login and worker nodes but thes "max user processes" is not the same. I checked /etc/security/limit.d directory on both login and worker nodes but they're empty. I also didn't find anything in the /etc/sysctl.conf or /etc/sysctl.d/* to set such a limit. What on earth might've changed the "max user processes" on the worker nodes then? Is there any other location I'm missing? (I check the ~/.bashrc as well).

> 
> Could we also do a sanity check to see if turning off limit propagation in
> slurm.conf for NPROC solves the issue?
> 
> Change:
> 
>     PropagateResourceLimitsExcept=CORE,MEMLOCK
> 
> to:
> 
>     PropagateResourceLimitsExcept=CORE,MEMLOCK,NPROC
> 
> just to be sure that we aren't somehow propagating a user limit.


So, Would this change affect the running jobs? Would it requires a full Slurm restart or "scontrol configure" will take care of that?

Best Regards,
Misha
Comment 20 Michael Hinton 2022-02-23 12:24:50 MST
(In reply to Misha Ahmadian from comment #19)
> hmm, that's a very interesting point. I totally forgot the limit propagation
> by srun in this case. Ok, now I'm having a hard time understanding something
> maybe you can help:
> 
> I call the "interactive" command from a login node. Below are the ulimits
> and limit.conf files on the login node:
> 
> login-20-26:$ ulimit -a
> core file size          (blocks, -c) 2097152
> data seg size           (kbytes, -d) unlimited
> scheduling priority             (-e) 0
> file size               (blocks, -f) unlimited
> pending signals                 (-i) 1029626
> max locked memory       (kbytes, -l) unlimited
> max memory size         (kbytes, -m) unlimited
> open files                      (-n) 8192
> pipe size            (512 bytes, -p) 8
> POSIX message queues     (bytes, -q) 819200
> real-time priority              (-r) 0
> stack size              (kbytes, -s) unlimited
> cpu time               (seconds, -t) unlimited
> max user processes              (-u) 1029626
> virtual memory          (kbytes, -v) unlimited
> file locks                      (-x) unlimited
Ok, great! That's where the 1029626 limit is coming from.

> login-20-26:$ grep -v '#' /etc/security/limits.conf
> 
> * soft memlock unlimited
> * hard memlock unlimited
> * hard nofile 8192
> * soft nofile 8192
> 
> So, I'm wondering why the "max user processes" limit on the login nodes is
> different than the worker nodes! we never specified a "nproc" limit on the
> limit.conf files on both login and worker nodes but thes "max user
> processes" is not the same. I checked /etc/security/limit.d directory on
> both login and worker nodes but they're empty. I also didn't find anything
> in the /etc/sysctl.conf or /etc/sysctl.d/* to set such a limit. What on
> earth might've changed the "max user processes" on the worker nodes then? Is
> there any other location I'm missing? (I check the ~/.bashrc as well).
Are the login nodes and worker nodes running the same distro at the same version? Perhaps there is a different default for max user processes depending on the distro. Also, is there a login .bashrc or something similar? Perhaps something is changing the ulimit at runtime for the login shell.

This answer indicates that there are multiple sources of setting limits other than systemd, including PAM. See https://serverfault.com/a/485277.

Even though you don't have any systemd limit configs, I believe systemd still sets a default limit. Do `cat /proc/1/limits` to see the defaults of the init process, and perhaps that will account for the difference.

At any rate, disabling NPROC limit propagation in Slurm should solve the issue.

> So, Would this change affect the running jobs? Would it requires a full
> Slurm restart or "scontrol configure" will take care of that?
I think an `scontrol reconfigure` will suffice.

In my testing though, I didn't even need to reconfigure for the changes to take effect. I just updated the slurm.conf. I think this is because salloc will read in slurm.conf when it is started and freshly parse the new limit propagation settings, regardless of if `scontrol reconfigure` is called beforehand or not.

Thanks,
-Michael
Comment 21 Michael Hinton 2022-02-23 12:28:35 MST
Also from https://serverfault.com/a/485277:

"At boot time, Linux sets default limits to the init (or systemd) process, which are then inherited by all the other (children) processes. To see these limits: cat /proc/1/limits."

So I guess the limits initially given to the systemd init process are determined by the Linux kernel, not systemd itself. I'm not sure if there is a corresponding Linux config parameter that sets the default limit, or if it is hard-coded and can change between kernel versions.
Comment 22 Misha Ahmadian 2022-02-23 13:31:56 MST
Hi Michael and Damian,

So, there are no differences between the OS distro on login nodes and worker nodes. Both are running CenoOS 8.1.1911. And, you're right; the "Max processes" limits on the init process on login and worker nodes are not equal and I have no idea what causes this, but I can look into that later since it's not a big deal now:

[root@cpu-25-8 ~]# cat /proc/1/limits | grep "Max processes"
Max processes             2061717              2061717              processes

[root@login-20-25 ~]# cat /proc/1/limits | grep "Max processes"
Max processes             1029644              1029644              processes

However, adding NPROC to the PropagateResourceLimitsExcept worked perfectly!

Now, this is what happened after I made the changes in the Slurm configuration:

==================================
(Ulimits on a node out of Slurm)
==================================
[root@cpu-25-8 ~]# ulimit -a
core file size          (blocks, -c) 2097152
data seg size           (kbytes, -d) unlimited
scheduling priority             (-e) 0
file size               (blocks, -f) unlimited
pending signals                 (-i) 2061717
max locked memory       (kbytes, -l) unlimited
max memory size         (kbytes, -m) unlimited
open files                      (-n) 8192
pipe size            (512 bytes, -p) 8
POSIX message queues     (bytes, -q) 819200
real-time priority              (-r) 0
stack size              (kbytes, -s) 8192
cpu time               (seconds, -t) unlimited
max user processes              (-u) 2061717
virtual memory          (kbytes, -v) unlimited
file locks                      (-x) unlimited

=============================================
(Ulimits inside a Slurm interactive session)
=============================================
login-20-26:$ interactive -p nocona -r nocona_test

cpu-25-8:$ ulimit
unlimited
cpu-25-8:$ ulimit -a
core file size          (blocks, -c) unlimited
data seg size           (kbytes, -d) unlimited
scheduling priority             (-e) 0
file size               (blocks, -f) unlimited
pending signals                 (-i) 2061717
max locked memory       (kbytes, -l) unlimited
max memory size         (kbytes, -m) 4123648
open files                      (-n) 8192
pipe size            (512 bytes, -p) 8
POSIX message queues     (bytes, -q) 819200
real-time priority              (-r) 0
stack size              (kbytes, -s) unlimited
cpu time               (seconds, -t) unlimited
max user processes              (-u) 2061717
virtual memory          (kbytes, -v) unlimited
file locks                      (-x) unlimited

==========================================================
(Ulimits of MATLAB process inside an interactive session)
==========================================================
cpu-25-8:$ matlab -nodisplay

>> !ulimit -a
core file size          (blocks, -c) unlimited
data seg size           (kbytes, -d) unlimited
scheduling priority             (-e) 0
file size               (blocks, -f) unlimited
pending signals                 (-i) 2061717
max locked memory       (kbytes, -l) unlimited
max memory size         (kbytes, -m) 527826944
open files                      (-n) 131072
pipe size            (512 bytes, -p) 8
POSIX message queues     (bytes, -q) 819200
real-time priority              (-r) 0
stack size              (kbytes, -s) unlimited
cpu time               (seconds, -t) unlimited
max user processes              (-u) 2061717
virtual memory          (kbytes, -v) unlimited
file locks                      (-x) unlimited

As you can see, there is a difference between the "max memory size" of the interactive session and the MATLAB process. However, that should be ok, because MATLAB is getting the limits we desire to see!

So, everything looks good for MATLAB! Right? Then I went ahead and started a Parallel pool in MATLAB:


>> c = parcluster('local');
>> c.NumWorkers = 128;
>> p = c.parpool(128);
Starting parallel pool (parpool) using the 'local' profile ...

Error using parallel.Cluster/parpool (line 88)
Parallel pool failed to start with the following error. For more detailed information, validate the profile 'local' in the Cluster Profile Manager.

Caused by:
    Error using parallel.internal.pool.AbstractInteractiveClient>iThrowWithCause (line 305)
    Failed to initialize the interactive session.
        Error using parallel.internal.pool.AbstractInteractiveClient>iThrowIfBadParallelJobStatus (line 399)
        The interactive communicating job failed with no message.



It crashed again!! Now once I called the "p = c.parpool(128)" I login to the cpu-25-8 as root and tried the "ps aux | grep -i matlab | wc -l". That showed me 128 processes were spawned for MATLAB successfully (It used to be getting crashed around ~90 processes before the changes I made). However, for some other reasons, MATLAB failed again. I'm not sure what else might have caused this. 

Whatever it is, it's tied to Slurm because Parallel MATLAB is still fine outside the Slurm.

I think Damian can help us a bit at this point.

Best Regards,
Misha
Comment 23 Michael Hinton 2022-02-23 14:31:50 MST
Ok, great. At least it is spawning all 128 processes. Sounds like we just need to debug the Slurm-MATLAB integration some more.

Have you ever gotten a similar MATLAB job working in Slurm in the past? In other words, is this a new failure, or has this type of job never worked before?

Can you attach the `interactive` script/command code? I want to see how it is calling salloc.

It sounds like you already confirmed that cgroups for the job and step are what you expect, but it would be good to double-check.

(In reply to Michael Hinton from comment #10)
> P.S. Starting in 20.11, the preferred method for interactive jobs is to set
> `use_interactive_step` in LaunchParameters in slurm.conf, and then simply
> use `salloc` to start up the interactive job. `salloc` with no arguments
> will internally call `srun --interactive --preserve-env --pty $SHELL`, and
> this can be modified with InteractiveStepOptions.
I would also try out `LaunchParameters=use_interactive_step` and then create an interactive job with just `salloc` to see if that gives different results.

-Michael
Comment 24 Misha Ahmadian 2022-02-23 15:58:21 MST
(In reply to Michael Hinton from comment #23)

Hi Michael,

> (In reply to Michael Hinton from comment #10)
> > P.S. Starting in 20.11, the preferred method for interactive jobs is to set
> > `use_interactive_step` in LaunchParameters in slurm.conf, and then simply
> > use `salloc` to start up the interactive job. `salloc` with no arguments
> > will internally call `srun --interactive --preserve-env --pty $SHELL`, and
> > this can be modified with InteractiveStepOptions.
> I would also try out `LaunchParameters=use_interactive_step` and then create
> an interactive job with just `salloc` to see if that gives different results.

Wow! I totally missed your P.S in comment #10! That actually did the final magic for me and resolved the whole problem!! Now I'm able to run MATLAB parpool successfully:

>> c = parcluster('local');
>> c.NumWorkers = 128;
>> p = c.parpool(128);
Starting parallel pool (parpool) using the 'local' profile ...
Connected to the parallel pool (number of workers: 128).


Thank you, Michael. Let me check with Damian and do some further tests and will let you know if we have further questions for you.

Best Regards,
Misha
Comment 25 Michael Hinton 2022-02-23 16:06:24 MST
(In reply to Misha Ahmadian from comment #24)
> (In reply to Michael Hinton from comment #23)
> 
> Hi Michael,
> 
> > (In reply to Michael Hinton from comment #10)
> > > P.S. Starting in 20.11, the preferred method for interactive jobs is to set
> > > `use_interactive_step` in LaunchParameters in slurm.conf, and then simply
> > > use `salloc` to start up the interactive job. `salloc` with no arguments
> > > will internally call `srun --interactive --preserve-env --pty $SHELL`, and
> > > this can be modified with InteractiveStepOptions.
> > I would also try out `LaunchParameters=use_interactive_step` and then create
> > an interactive job with just `salloc` to see if that gives different results.
> 
> Wow! I totally missed your P.S in comment #10! That actually did the final
> magic for me and resolved the whole problem!!
Great!

So are you using a plain `salloc`, or are you still using your `interactive` wrapper? I would still like to know what the `interactive` script does so I can better understand how use_interactive_step fixed things for you.

Thanks!
-Michael
Comment 26 Misha Ahmadian 2022-02-24 07:46:21 MST
Created attachment 23616 [details]
Slurm Interactive Script file

Hi Michael,

Sure. Please find the attached "interactive" script file. I modified the last line to be adjusted with the new settings. We intend to keep the interactive script since most of our users are still using this command, but there is no difference between using the "interactive" vs. "salloc" now.

I also added the following lines into the slurm.conf file to reflect the new changes:

>LaunchParameters=use_interactive_step
>InteractiveStepOptions="--interactive --preserve-env --pty /bin/bash -c 'source /etc/slurm/scripts/slurm_fix_modules.sh && /bin/bash -l -i'"

The "InteractiveStepOptions" may sound weird, but I need that trick to run the "slurm_fix_modules.sh" script before each interactive session. This script fixes the incompatibility issue with Lmod versions and contents across various partitions. It's a harmless script and doesn't mess with the current settings.

Please let me know if you need anything else from me.

Best Regards,
Misha
Comment 27 Michael Hinton 2022-02-24 12:16:38 MST
(In reply to Misha Ahmadian from comment #26)
> Please let me know if you need anything else from me.
Well, if it works, it works, so I guess we can mark this as resolved!

I'm still not 100% certain how use_interactive_step fixed the issue - I would need to understand better what MATLAB is complaining about - but my guess is that it has to do with a change in 20.11 where steps no longer overlap. This change means that in 20.11, a simple `srun --pty` would create a step with all resources, and that step would not allow other steps to run unless the new --overlap arg was specified.

From https://slurm.schedmd.com/faq.html#prompt:

"By default, use_interactive_step creates an interactive step on a node in the allocation and runs the shell in that step. An interactive step is to an interactive shell what a batch step is to a batch script - both have access to all resources in the allocation on the node they are running on, but do not "consume" them.

"Note that beginning in 20.11, steps created by srun are now exclusive. This means that the previously-recommended way to get an interactive shell, srun --pty $SHELL, will no longer work, as the shell's step will now consume all resources on the node and cause subsequent srun calls to pend."

Also, from 20.11 RELEASE_NOTES:

 -- By default, a step started with srun will [now] be granted exclusive (or non-
    overlapping) access to the resources assigned to that step. No other
    parallel step will be allowed to run on the same resources at the same
    time. This replaces one facet of the '--exclusive' option's behavior, but
    does not imply the '--exact' option described below. To get the previous
    default behavior - which allowed parallel steps to share all resources -
    use the new srun '--overlap' option.

So it is good that you are now using use_interactive_step -  without it, I imagine you would have eventually seen other non-MATLAB interactive job issues as well.

Thanks!
-Michael