| Summary: | Issues with MATLAB jobs in SLURM and inconsistent ulimit values | ||
|---|---|---|---|
| Product: | Slurm | Reporter: | Misha Ahmadian <misha.ahmadian> |
| Component: | Limits | Assignee: | Director of Support <support> |
| Status: | RESOLVED INFOGIVEN | QA Contact: | |
| Severity: | 4 - Minor Issue | ||
| Priority: | --- | CC: | dpietrus |
| Version: | 20.11.7 | ||
| Hardware: | Linux | ||
| OS: | Linux | ||
| Site: | TTU | Slinky Site: | --- |
| Alineos Sites: | --- | Atos/Eviden Sites: | --- |
| Confidential Site: | --- | Coreweave sites: | --- |
| Cray Sites: | --- | DS9 clusters: | --- |
| Google sites: | --- | HPCnow Sites: | --- |
| HPE Sites: | --- | IBM Sites: | --- |
| NOAA SIte: | --- | NoveTech Sites: | --- |
| Nvidia HWinf-CS Sites: | --- | OCF Sites: | --- |
| Recursion Pharma Sites: | --- | SFW Sites: | --- |
| SNIC sites: | --- | Tzag Elita Sites: | --- |
| Linux Distro: | --- | Machine Name: | |
| CLE Version: | Version Fixed: | ||
| Target Release: | --- | DevPrio: | --- |
| Emory-Cloud Sites: | --- | ||
| Attachments: |
slurm.conf
Slurm Interactive Script file |
||
|
Description
Misha Ahmadian
2022-02-04 13:56:45 MST
It looks like I'm not allowed to add one of the Mathworks specialists into this ticket to follow up since he is not on the email list. >It looks like I'm not allowed to add one of the Mathworks specialists into this ticket to follow up since he is not on the email list. They will need to create an account before they can be added to the CC field. https://bugs.schedmd.com/createaccount.cgi (In reply to Jason Booth from comment #2) > >It looks like I'm not allowed to add one of the Mathworks specialists into this ticket to follow up since he is not on the email list. > > They will need to create an account before they can be added to the CC field. > > https://bugs.schedmd.com/createaccount.cgi Thank you, Jason. I did let Damian from MathWorks know and just added him to the ticket after he joined the email list. Best, Misha Misha, Could you attach you most recent slurm.conf? (In reply to Misha Ahmadian from comment #0) > It's been a while that we have been experiencing a special issue How long, exactly? Have you experienced this issue since moving to 20.11? > 4) MATLAB starts by spawning the processes gradually, but it fails when it > reaches around ~100 processes! > 5) This result is identical in batch and interactive modes with Slurm. > In other words, MATLAB cannot spawn all the 128 ParPool processes inside a > Slurm job with 128 tasks. Can you supply matlab and slurm error logs for these failures? > These are the mysterious things we found so far, and we think if we could > keep the "max user processes" consistent with what we've defined in > "/etc/security/limits.conf" then that will resolve the issue with MATLAB. > However, Mathworks folks are still investigating why MATLAB may need such a > large "max user processes" to spawn all the processes successfully. The > current problem with MATLAB parallel jobs is that each process opens 1000s > of dynamic libraries (.jar and .so files) which ends up with a huge number > of open files on each node. We're still not sure if the problem is the "max > user processes", any limitation on the cgroup side, or any possible > limitation inside SLURM that we're not aware of. How confident are you that the max user processes limit is being hit? Are you able to get diagnostic information on the 99th ParPool process before it crashes (perhaps by inserting a sleep and then checking the current limits of the parent process)? Let me look into your other questions and get back to you. Thanks! -Michael (In reply to Michael Hinton from comment #4) Hi Michael, Sorry for the late response. > Could you attach you most recent slurm.conf? Sure, please find the attached slurm.conf file. > How long, exactly? Have you experienced this issue since moving to 20.11? I'm not sure exactly. It's been a few months (~ 6 months or more) since we've deployed the MATLAB Parallel server and Mathworks folks have been working with us closely to customize the MATLAB scripts and get the installation done correctly. I don't think this would be relevant to the Slurm version (but I could be wrong) > Can you supply matlab and slurm error logs for these failures? So, collecting logs with all the details from MATLAB has been an issue for us (TTU and Mathworks) since MATLAB does not yield all the details we desire to see. However, below is what I get from MATLAB inside an interactive session: $ interactive -p nocona -c 128 [CPUs=128 NNodes=1 Name=INTERACTIVE Account=default Reservation=nocona_test Partition=nocona X11=NO] salloc: Granted job allocation 4588750 salloc: Waiting for resource configuration salloc: Nodes cpu-23-22 are ready for job cpu-23-22:$ cd /path/to/matlab/R2021b/bin/ cpu-23-22:$ ./matlab MATLAB is selecting SOFTWARE OPENGL rendering. < M A T L A B (R) > Copyright 1984-2021 The MathWorks, Inc. R2021b (9.11.0.1769968) 64-bit (glnxa64) September 17, 2021 To get started, type doc. For product information, visit www.mathworks.com. >> c = parcluster('local'); >> c.NumWorkers = 128; >> p = c.parpool(128); Starting parallel pool (parpool) using the 'local' profile ... Error using parallel.Cluster/parpool (line 88) Parallel pool failed to start with the following error. For more detailed information, validate the profile 'local' in the Cluster Profile Manager. Caused by: Error using parallel.internal.pool.AbstractInteractiveClient>iThrowWithCause (line 305) Failed to initialize the interactive session. Error using parallel.internal.pool.AbstractInteractiveClient>iThrowIfBadParallelJobStatus (line 426) The interactive communicating job failed with no message. >> > How confident are you that the max user processes limit is being hit? Are > you able to get diagnostic information on the 99th ParPool process before it > crashes (perhaps by inserting a sleep and then checking the current limits > of the parent process)? So, we're not entirely confident at this point. The differences in max user process limit inside and outside the Slurm are the most obvious thing we have found so far. Damian is working with the developers to see how he can collect further logs to help us get more info. However, if you have a better idea of collecting the parent process limits in Slurm, please let me know the steps and work on it. > > Let me look into your other questions and get back to you. Thank you very much. Best Regards, misha Created attachment 23387 [details]
slurm.conf
Hi Michael, I just wanted to follow up with you and see any update on this case. Please let me know if you need more info from our side. Best Regards, Misha (In reply to Misha Ahmadian from comment #9) > I just wanted to follow up with you and see any update on this case. Please > let me know if you need more info from our side. Based on the data you provided so far, I don't see any holes in your analysis. The only run-time adjustments I see Slurm make to slurmd ulimits is in slurm_rlimits_info.c --> rlimits_adjust_nofile(), and that only affects open files. So I think we need more data on node cpu-23-22. Are all nodes showing this behavior, or just node cpu-23-22? What Linux distro is it running? Is systemd setting any global limits that we are not aware of (in /etc/systemd/system.conf or anywhere else)? Could you attach the entire /etc/security/limits.conf? If you run slurmd outside of systemd, do you get the same behavior? (kill the current slurmd service on that node and just run `sudo /usr/sbin/slurmd -D` in a terminal). What are the limits of slurmd under systemd vs. not under systemd, when no job is running? Attach both the `ulimit -a` output as well as `cat /proc/$(pidof slurmd)/limits`. Can you do the same thing above, but when you have an interactive job running? Can you also do it for the stepd (`cat /proc/$(pidof slurmstepd)/limits`)? Can you do the same thing for the interactive matlab under the interactive job? To summarize: Get `ulimit -a` and `cat /proc/<pid-of-prog>/limits` for the following processes: Under systemd --------------------- * latent slurmd (no interactive job) * slurmd w/ interactive job * slurmstepd w/ interactive job * matlab, called in interactive job Note under systemd --------------------- * latent slurmd (no interactive job) * slurmd w/ interactive job * slurmstepd w/ interactive job * matlab, called in interactive job I still don't know the answers to your questions in comment 0, but I think that this should help us better see when the limits change, and perhaps why. -Michael P.S. Starting in 20.11, the preferred method for interactive jobs is to set `use_interactive_step` in LaunchParameters in slurm.conf, and then simply use `salloc` to start up the interactive job. `salloc` with no arguments will internally call `srun --interactive --preserve-env --pty $SHELL`, and this can be modified with InteractiveStepOptions. Note under systemd --> *Not under systemd I also notice that you have this configured:
PropagateResourceLimitsExcept=CORE,MEMLOCK
That means that all the soft limits of the login shell except for core size and locked memory are being propagated when you start a Slurm job. (see https://slurm.schedmd.com/slurm.conf.html#OPT_PropagateResourceLimits)
Could you attach the `ulimit -a` and `cat /proc/self/limits` and `cat /etc/security/limits.conf` of the login/submission machine before you do an interactive job?
Perhaps the solution is to avoid propagating an errant max process limit from the submission node.
(In reply to Misha Ahmadian from comment #5) > So, we're not entirely confident at this point. The differences in max user > process limit inside and outside the Slurm are the most obvious thing we > have found so far. Could you reduce Matlab's max processes in the SSH session to match what it is in Slurm and see if you can force the failure? If not, perhaps the max processes is a red herring. Also, can you explain how Matlab integrates with Slurm? Is it using MPI under the hood? If so, what MPI and what version? Apologies about my delay in joining this thread (In reply to Michael Hinton from comment #4) > How confident are you that the max user processes limit is being hit? Are > you able to get diagnostic information on the 99th ParPool process before it > crashes (perhaps by inserting a sleep and then checking the current limits > of the parent process)? Typically, I'd be very confident that the max user process limit is being hit. When failing to start my local pool of workers, often a java crash log is generated with the message "java.lang.OutOfMemoryError: unable to create new native thread". With MATLAB, this almost always means the user process limit is too low. However, the limits are set to such a high value that we shouldn't be running into this issue. This discrepancy is something I'm trying to look into further from my side (In reply to Michael Hinton from comment #14) > Also, can you explain how Matlab integrates with Slurm? Is it using MPI > under the hood? If so, what MPI and what version? There are two primary methods by which we can run MATLAB on the cluster. The first (which we are testing here) is to request resources from the scheduler, manually open the client, and run your code. With this method, MATLAB is using it's "local" profile and the action of starting up a parallel pool uses only the resources that we were assigned for that one particular node. In comparison, using a Slurm cluster profile will submit a secondary job to Slurm that can span multiple nodes using the MATLAB Parallel Server product. I believe this uses MPICH3, though the mechanism for starting these workers is slightly different than starting local workers. Please let me know if there is any specific information from my side that I can provide Hi Michael, Please find the answers below (BTW, We experience this on every worker node): =================================================================== (Under systemd) =================================================================== 1) latent Slurmd info: ----------------------- # ssh cpu-25-8 [root@cpu-25-8 ~]# cat /usr/lib/systemd/system/slurmd.service [Unit] Description=Slurm node daemon After=munge.service network.target remote-fs.target #ConditionPathExists=/etc/slurm/slurm.conf [Service] Type=simple EnvironmentFile=-/etc/sysconfig/slurmd ExecStart=/usr/sbin/slurmd -D $SLURMD_OPTIONS ExecReload=/bin/kill -HUP $MAINPID KillMode=process LimitNOFILE=131072 LimitMEMLOCK=infinity LimitSTACK=infinity Delegate=yes TasksMax=infinity [Install] WantedBy=multi-user.target [root@cpu-25-8 ~]# grep -v '#' /etc/security/limits.conf * soft memlock unlimited * hard memlock unlimited * hard nofile 8192 * soft nofile 8192 * soft core 2097152 * hard core 4194304 [root@cpu-25-8 ~]# cat /proc/$(pidof slurmd)/limits Limit Soft Limit Hard Limit Units Max cpu time unlimited unlimited seconds Max file size unlimited unlimited bytes Max data size unlimited unlimited bytes Max stack size unlimited unlimited bytes Max core file size unlimited unlimited bytes Max resident set unlimited unlimited bytes Max processes 2061717 2061717 processes Max open files 131072 131072 files Max locked memory unlimited unlimited bytes Max address space unlimited unlimited bytes Max file locks unlimited unlimited locks Max pending signals 2061717 2061717 signals Max msgqueue size 819200 819200 bytes Max nice priority 0 0 Max realtime priority 0 0 Max realtime timeout unlimited unlimited us [root@cpu-25-8 ~]# su - user1 [user1@cpu-25-8 ~]$ ulimit -a core file size (blocks, -c) 2097152 data seg size (kbytes, -d) unlimited scheduling priority (-e) 0 file size (blocks, -f) unlimited pending signals (-i) 2061717 max locked memory (kbytes, -l) unlimited max memory size (kbytes, -m) unlimited open files (-n) 8192 pipe size (512 bytes, -p) 8 POSIX message queues (bytes, -q) 819200 real-time priority (-r) 0 stack size (kbytes, -s) unlimited cpu time (seconds, -t) unlimited max user processes (-u) 2061717 virtual memory (kbytes, -v) unlimited file locks (-x) unlimited -------------------- 2) Interactive Job: -------------------- $ interactive -p nocona -c 128 -r nocona_test -w cpu-25-8 cpu-25-8:$ ulimit -a core file size (blocks, -c) unlimited data seg size (kbytes, -d) unlimited scheduling priority (-e) 0 file size (blocks, -f) unlimited pending signals (-i) 2061717 max locked memory (kbytes, -l) unlimited max memory size (kbytes, -m) 527826944 open files (-n) 8192 pipe size (512 bytes, -p) 8 POSIX message queues (bytes, -q) 819200 real-time priority (-r) 0 stack size (kbytes, -s) unlimited cpu time (seconds, -t) unlimited max user processes (-u) 1029626 virtual memory (kbytes, -v) unlimited file locks (-x) unlimited cpu-25-8:$ cat /proc/self/limits Limit Soft Limit Hard Limit Units Max cpu time unlimited unlimited seconds Max file size unlimited unlimited bytes Max data size unlimited unlimited bytes Max stack size unlimited unlimited bytes Max core file size unlimited unlimited bytes Max resident set 540494790656 540494790656 bytes Max processes 1029626 2061717 processes Max open files 8192 131072 files Max locked memory unlimited unlimited bytes Max address space unlimited unlimited bytes Max file locks unlimited unlimited locks Max pending signals 2061717 2061717 signals Max msgqueue size 819200 819200 bytes Max nice priority 0 0 Max realtime priority 0 0 Max realtime timeout unlimited unlimited us cpu-25-8:$ matlab -nodisplay >> !ulimit -a core file size (blocks, -c) unlimited data seg size (kbytes, -d) unlimited scheduling priority (-e) 0 file size (blocks, -f) unlimited pending signals (-i) 2061717 max locked memory (kbytes, -l) unlimited max memory size (kbytes, -m) 527826944 open files (-n) 131072 pipe size (512 bytes, -p) 8 POSIX message queues (bytes, -q) 819200 real-time priority (-r) 0 stack size (kbytes, -s) unlimited cpu time (seconds, -t) unlimited max user processes (-u) 1029626 virtual memory (kbytes, -v) unlimited file locks (-x) unlimited >> !cat /proc/self/limits Limit Soft Limit Hard Limit Units Max cpu time unlimited unlimited seconds Max file size unlimited unlimited bytes Max data size unlimited unlimited bytes Max stack size unlimited unlimited bytes Max core file size unlimited unlimited bytes Max resident set 540494790656 540494790656 bytes Max processes 1029626 2061717 processes Max open files 131072 131072 files Max locked memory unlimited unlimited bytes Max address space unlimited unlimited bytes Max file locks unlimited unlimited locks Max pending signals 2061717 2061717 signals Max msgqueue size 819200 819200 bytes Max nice priority 0 0 Max realtime priority 0 0 Max realtime timeout unlimited unlimited us =================================================================== (Not under systemd) =================================================================== 1) latent Slurmd info: ----------------------- # ssh cpu-25-8 [root@cpu-25-8 ~]# systemctl stop slurmd [root@cpu-25-8 ~]# slurmd -D --conf-server 10.100.21.250:6817 & [root@cpu-25-8 ~]# cat /proc/$(pidof slurmd)/limits Limit Soft Limit Hard Limit Units Max cpu time unlimited unlimited seconds Max file size unlimited unlimited bytes Max data size unlimited unlimited bytes Max stack size 8388608 unlimited bytes Max core file size 4294967296 4294967296 bytes Max resident set unlimited unlimited bytes Max processes 2061717 2061717 processes Max open files 8192 8192 files Max locked memory unlimited unlimited bytes Max address space unlimited unlimited bytes Max file locks unlimited unlimited locks Max pending signals 2061717 2061717 signals Max msgqueue size 819200 819200 bytes Max nice priority 0 0 Max realtime priority 0 0 Max realtime timeout unlimited unlimited us [root@cpu-25-8 ~]# su - user1 Last login: Mon Feb 21 12:36:33 CST 2022 on pts/0 [user1@cpu-25-8 ~]$ ulimit -a core file size (blocks, -c) 2097152 data seg size (kbytes, -d) unlimited scheduling priority (-e) 0 file size (blocks, -f) unlimited pending signals (-i) 2061717 max locked memory (kbytes, -l) unlimited max memory size (kbytes, -m) unlimited open files (-n) 8192 pipe size (512 bytes, -p) 8 POSIX message queues (bytes, -q) 819200 real-time priority (-r) 0 stack size (kbytes, -s) unlimited cpu time (seconds, -t) unlimited max user processes (-u) 2061717 virtual memory (kbytes, -v) unlimited file locks (-x) unlimited -------------------- 2) Interactive Job: -------------------- $ interactive -p nocona -c 128 -r nocona_test -w cpu-25-8 cpu-25-8:$ ulimit -a core file size (blocks, -c) 4194304 data seg size (kbytes, -d) unlimited scheduling priority (-e) 0 file size (blocks, -f) unlimited pending signals (-i) 2061717 max locked memory (kbytes, -l) unlimited max memory size (kbytes, -m) 527826944 open files (-n) 8192 pipe size (512 bytes, -p) 8 POSIX message queues (bytes, -q) 819200 real-time priority (-r) 0 stack size (kbytes, -s) unlimited cpu time (seconds, -t) unlimited max user processes (-u) 1029626 virtual memory (kbytes, -v) unlimited file locks (-x) unlimited cpu-25-8:$ cat /proc/self/limits Limit Soft Limit Hard Limit Units Max cpu time unlimited unlimited seconds Max file size unlimited unlimited bytes Max data size unlimited unlimited bytes Max stack size unlimited unlimited bytes Max core file size 4294967296 4294967296 bytes Max resident set 540494790656 540494790656 bytes Max processes 1029626 2061717 processes Max open files 8192 8192 files Max locked memory unlimited unlimited bytes Max address space unlimited unlimited bytes Max file locks unlimited unlimited locks Max pending signals 2061717 2061717 signals Max msgqueue size 819200 819200 bytes Max nice priority 0 0 Max realtime priority 0 0 Max realtime timeout unlimited unlimited us cpu-25-8:$ matlab -nodisplay >> !ulimit -a core file size (blocks, -c) 4194304 data seg size (kbytes, -d) unlimited scheduling priority (-e) 0 file size (blocks, -f) unlimited pending signals (-i) 2061717 max locked memory (kbytes, -l) unlimited max memory size (kbytes, -m) 527826944 open files (-n) 8192 pipe size (512 bytes, -p) 8 POSIX message queues (bytes, -q) 819200 real-time priority (-r) 0 stack size (kbytes, -s) unlimited cpu time (seconds, -t) unlimited max user processes (-u) 1029626 virtual memory (kbytes, -v) unlimited file locks (-x) unlimited >> !cat /proc/self/limits Limit Soft Limit Hard Limit Units Max cpu time unlimited unlimited seconds Max file size unlimited unlimited bytes Max data size unlimited unlimited bytes Max stack size unlimited unlimited bytes Max core file size 4294967296 4294967296 bytes Max resident set 540494790656 540494790656 bytes Max processes 1029626 2061717 processes Max open files 8192 8192 files Max locked memory unlimited unlimited bytes Max address space unlimited unlimited bytes Max file locks unlimited unlimited locks Max pending signals 2061717 2061717 signals Max msgqueue size 819200 819200 bytes Max nice priority 0 0 Max realtime priority 0 0 Max realtime timeout unlimited unlimited us >> c = parcluster('local'); >> c.NumWorkers = 128; >> p = c.parpool(128); Starting parallel pool (parpool) using the 'local' profile ... Error using parallel.Cluster/parpool (line 88) Parallel pool failed to start with the following error. For more detailed information, validate the profile 'local' in the Cluster Profile Manager. Caused by: Error using parallel.internal.pool.AbstractInteractiveClient>iThrowWithCause (line 305) Failed to initialize the interactive session. Error using parallel.internal.pool.AbstractInteractiveClient>iThrowIfBadParallelJobStatus (line 426) The interactive communicating job failed with no message. =================================================================== As you see above, there are no differences between running slumd with under systemd or as a standalone process, and in both cases, the ulimits are the same and Matlab crashes in a similar way. However, the ulimits inside and outside a Slurm session within the same node are not the same. Please let me know if you need more info from me or Damian. Best Regards, Misha Misha, Have you tried temporarily increasing the limit inside an interactive Slurm job before running Matlab to verify that it can work if only that limit is increased properly? -Michael You were apparently printing out the user limits on the compute node itself (cpu-25-8). But I believe that limits are propagated *from the submission node*. What is the submission node (where you call `$ interactive ...`), and what are the user limits before submission?
Could we also do a sanity check to see if turning off limit propagation in slurm.conf for NPROC solves the issue?
Change:
PropagateResourceLimitsExcept=CORE,MEMLOCK
to:
PropagateResourceLimitsExcept=CORE,MEMLOCK,NPROC
just to be sure that we aren't somehow propagating a user limit.
Thanks!
-Michael
Hi Michael, > You were apparently printing out the user limits on the compute node itself > (cpu-25-8). But I believe that limits are propagated *from the submission > node*. What is the submission node (where you call `$ interactive ...`), and > what are the user limits before submission? hmm, that's a very interesting point. I totally forgot the limit propagation by srun in this case. Ok, now I'm having a hard time understanding something maybe you can help: I call the "interactive" command from a login node. Below are the ulimits and limit.conf files on the login node: login-20-26:$ ulimit -a core file size (blocks, -c) 2097152 data seg size (kbytes, -d) unlimited scheduling priority (-e) 0 file size (blocks, -f) unlimited pending signals (-i) 1029626 max locked memory (kbytes, -l) unlimited max memory size (kbytes, -m) unlimited open files (-n) 8192 pipe size (512 bytes, -p) 8 POSIX message queues (bytes, -q) 819200 real-time priority (-r) 0 stack size (kbytes, -s) unlimited cpu time (seconds, -t) unlimited max user processes (-u) 1029626 virtual memory (kbytes, -v) unlimited file locks (-x) unlimited login-20-26:$ grep -v '#' /etc/security/limits.conf * soft memlock unlimited * hard memlock unlimited * hard nofile 8192 * soft nofile 8192 So, I'm wondering why the "max user processes" limit on the login nodes is different than the worker nodes! we never specified a "nproc" limit on the limit.conf files on both login and worker nodes but thes "max user processes" is not the same. I checked /etc/security/limit.d directory on both login and worker nodes but they're empty. I also didn't find anything in the /etc/sysctl.conf or /etc/sysctl.d/* to set such a limit. What on earth might've changed the "max user processes" on the worker nodes then? Is there any other location I'm missing? (I check the ~/.bashrc as well). > > Could we also do a sanity check to see if turning off limit propagation in > slurm.conf for NPROC solves the issue? > > Change: > > PropagateResourceLimitsExcept=CORE,MEMLOCK > > to: > > PropagateResourceLimitsExcept=CORE,MEMLOCK,NPROC > > just to be sure that we aren't somehow propagating a user limit. So, Would this change affect the running jobs? Would it requires a full Slurm restart or "scontrol configure" will take care of that? Best Regards, Misha (In reply to Misha Ahmadian from comment #19) > hmm, that's a very interesting point. I totally forgot the limit propagation > by srun in this case. Ok, now I'm having a hard time understanding something > maybe you can help: > > I call the "interactive" command from a login node. Below are the ulimits > and limit.conf files on the login node: > > login-20-26:$ ulimit -a > core file size (blocks, -c) 2097152 > data seg size (kbytes, -d) unlimited > scheduling priority (-e) 0 > file size (blocks, -f) unlimited > pending signals (-i) 1029626 > max locked memory (kbytes, -l) unlimited > max memory size (kbytes, -m) unlimited > open files (-n) 8192 > pipe size (512 bytes, -p) 8 > POSIX message queues (bytes, -q) 819200 > real-time priority (-r) 0 > stack size (kbytes, -s) unlimited > cpu time (seconds, -t) unlimited > max user processes (-u) 1029626 > virtual memory (kbytes, -v) unlimited > file locks (-x) unlimited Ok, great! That's where the 1029626 limit is coming from. > login-20-26:$ grep -v '#' /etc/security/limits.conf > > * soft memlock unlimited > * hard memlock unlimited > * hard nofile 8192 > * soft nofile 8192 > > So, I'm wondering why the "max user processes" limit on the login nodes is > different than the worker nodes! we never specified a "nproc" limit on the > limit.conf files on both login and worker nodes but thes "max user > processes" is not the same. I checked /etc/security/limit.d directory on > both login and worker nodes but they're empty. I also didn't find anything > in the /etc/sysctl.conf or /etc/sysctl.d/* to set such a limit. What on > earth might've changed the "max user processes" on the worker nodes then? Is > there any other location I'm missing? (I check the ~/.bashrc as well). Are the login nodes and worker nodes running the same distro at the same version? Perhaps there is a different default for max user processes depending on the distro. Also, is there a login .bashrc or something similar? Perhaps something is changing the ulimit at runtime for the login shell. This answer indicates that there are multiple sources of setting limits other than systemd, including PAM. See https://serverfault.com/a/485277. Even though you don't have any systemd limit configs, I believe systemd still sets a default limit. Do `cat /proc/1/limits` to see the defaults of the init process, and perhaps that will account for the difference. At any rate, disabling NPROC limit propagation in Slurm should solve the issue. > So, Would this change affect the running jobs? Would it requires a full > Slurm restart or "scontrol configure" will take care of that? I think an `scontrol reconfigure` will suffice. In my testing though, I didn't even need to reconfigure for the changes to take effect. I just updated the slurm.conf. I think this is because salloc will read in slurm.conf when it is started and freshly parse the new limit propagation settings, regardless of if `scontrol reconfigure` is called beforehand or not. Thanks, -Michael Also from https://serverfault.com/a/485277: "At boot time, Linux sets default limits to the init (or systemd) process, which are then inherited by all the other (children) processes. To see these limits: cat /proc/1/limits." So I guess the limits initially given to the systemd init process are determined by the Linux kernel, not systemd itself. I'm not sure if there is a corresponding Linux config parameter that sets the default limit, or if it is hard-coded and can change between kernel versions. Hi Michael and Damian, So, there are no differences between the OS distro on login nodes and worker nodes. Both are running CenoOS 8.1.1911. And, you're right; the "Max processes" limits on the init process on login and worker nodes are not equal and I have no idea what causes this, but I can look into that later since it's not a big deal now: [root@cpu-25-8 ~]# cat /proc/1/limits | grep "Max processes" Max processes 2061717 2061717 processes [root@login-20-25 ~]# cat /proc/1/limits | grep "Max processes" Max processes 1029644 1029644 processes However, adding NPROC to the PropagateResourceLimitsExcept worked perfectly! Now, this is what happened after I made the changes in the Slurm configuration: ================================== (Ulimits on a node out of Slurm) ================================== [root@cpu-25-8 ~]# ulimit -a core file size (blocks, -c) 2097152 data seg size (kbytes, -d) unlimited scheduling priority (-e) 0 file size (blocks, -f) unlimited pending signals (-i) 2061717 max locked memory (kbytes, -l) unlimited max memory size (kbytes, -m) unlimited open files (-n) 8192 pipe size (512 bytes, -p) 8 POSIX message queues (bytes, -q) 819200 real-time priority (-r) 0 stack size (kbytes, -s) 8192 cpu time (seconds, -t) unlimited max user processes (-u) 2061717 virtual memory (kbytes, -v) unlimited file locks (-x) unlimited ============================================= (Ulimits inside a Slurm interactive session) ============================================= login-20-26:$ interactive -p nocona -r nocona_test cpu-25-8:$ ulimit unlimited cpu-25-8:$ ulimit -a core file size (blocks, -c) unlimited data seg size (kbytes, -d) unlimited scheduling priority (-e) 0 file size (blocks, -f) unlimited pending signals (-i) 2061717 max locked memory (kbytes, -l) unlimited max memory size (kbytes, -m) 4123648 open files (-n) 8192 pipe size (512 bytes, -p) 8 POSIX message queues (bytes, -q) 819200 real-time priority (-r) 0 stack size (kbytes, -s) unlimited cpu time (seconds, -t) unlimited max user processes (-u) 2061717 virtual memory (kbytes, -v) unlimited file locks (-x) unlimited ========================================================== (Ulimits of MATLAB process inside an interactive session) ========================================================== cpu-25-8:$ matlab -nodisplay >> !ulimit -a core file size (blocks, -c) unlimited data seg size (kbytes, -d) unlimited scheduling priority (-e) 0 file size (blocks, -f) unlimited pending signals (-i) 2061717 max locked memory (kbytes, -l) unlimited max memory size (kbytes, -m) 527826944 open files (-n) 131072 pipe size (512 bytes, -p) 8 POSIX message queues (bytes, -q) 819200 real-time priority (-r) 0 stack size (kbytes, -s) unlimited cpu time (seconds, -t) unlimited max user processes (-u) 2061717 virtual memory (kbytes, -v) unlimited file locks (-x) unlimited As you can see, there is a difference between the "max memory size" of the interactive session and the MATLAB process. However, that should be ok, because MATLAB is getting the limits we desire to see! So, everything looks good for MATLAB! Right? Then I went ahead and started a Parallel pool in MATLAB: >> c = parcluster('local'); >> c.NumWorkers = 128; >> p = c.parpool(128); Starting parallel pool (parpool) using the 'local' profile ... Error using parallel.Cluster/parpool (line 88) Parallel pool failed to start with the following error. For more detailed information, validate the profile 'local' in the Cluster Profile Manager. Caused by: Error using parallel.internal.pool.AbstractInteractiveClient>iThrowWithCause (line 305) Failed to initialize the interactive session. Error using parallel.internal.pool.AbstractInteractiveClient>iThrowIfBadParallelJobStatus (line 399) The interactive communicating job failed with no message. It crashed again!! Now once I called the "p = c.parpool(128)" I login to the cpu-25-8 as root and tried the "ps aux | grep -i matlab | wc -l". That showed me 128 processes were spawned for MATLAB successfully (It used to be getting crashed around ~90 processes before the changes I made). However, for some other reasons, MATLAB failed again. I'm not sure what else might have caused this. Whatever it is, it's tied to Slurm because Parallel MATLAB is still fine outside the Slurm. I think Damian can help us a bit at this point. Best Regards, Misha Ok, great. At least it is spawning all 128 processes. Sounds like we just need to debug the Slurm-MATLAB integration some more. Have you ever gotten a similar MATLAB job working in Slurm in the past? In other words, is this a new failure, or has this type of job never worked before? Can you attach the `interactive` script/command code? I want to see how it is calling salloc. It sounds like you already confirmed that cgroups for the job and step are what you expect, but it would be good to double-check. (In reply to Michael Hinton from comment #10) > P.S. Starting in 20.11, the preferred method for interactive jobs is to set > `use_interactive_step` in LaunchParameters in slurm.conf, and then simply > use `salloc` to start up the interactive job. `salloc` with no arguments > will internally call `srun --interactive --preserve-env --pty $SHELL`, and > this can be modified with InteractiveStepOptions. I would also try out `LaunchParameters=use_interactive_step` and then create an interactive job with just `salloc` to see if that gives different results. -Michael (In reply to Michael Hinton from comment #23) Hi Michael, > (In reply to Michael Hinton from comment #10) > > P.S. Starting in 20.11, the preferred method for interactive jobs is to set > > `use_interactive_step` in LaunchParameters in slurm.conf, and then simply > > use `salloc` to start up the interactive job. `salloc` with no arguments > > will internally call `srun --interactive --preserve-env --pty $SHELL`, and > > this can be modified with InteractiveStepOptions. > I would also try out `LaunchParameters=use_interactive_step` and then create > an interactive job with just `salloc` to see if that gives different results. Wow! I totally missed your P.S in comment #10! That actually did the final magic for me and resolved the whole problem!! Now I'm able to run MATLAB parpool successfully: >> c = parcluster('local'); >> c.NumWorkers = 128; >> p = c.parpool(128); Starting parallel pool (parpool) using the 'local' profile ... Connected to the parallel pool (number of workers: 128). Thank you, Michael. Let me check with Damian and do some further tests and will let you know if we have further questions for you. Best Regards, Misha (In reply to Misha Ahmadian from comment #24) > (In reply to Michael Hinton from comment #23) > > Hi Michael, > > > (In reply to Michael Hinton from comment #10) > > > P.S. Starting in 20.11, the preferred method for interactive jobs is to set > > > `use_interactive_step` in LaunchParameters in slurm.conf, and then simply > > > use `salloc` to start up the interactive job. `salloc` with no arguments > > > will internally call `srun --interactive --preserve-env --pty $SHELL`, and > > > this can be modified with InteractiveStepOptions. > > I would also try out `LaunchParameters=use_interactive_step` and then create > > an interactive job with just `salloc` to see if that gives different results. > > Wow! I totally missed your P.S in comment #10! That actually did the final > magic for me and resolved the whole problem!! Great! So are you using a plain `salloc`, or are you still using your `interactive` wrapper? I would still like to know what the `interactive` script does so I can better understand how use_interactive_step fixed things for you. Thanks! -Michael Created attachment 23616 [details] Slurm Interactive Script file Hi Michael, Sure. Please find the attached "interactive" script file. I modified the last line to be adjusted with the new settings. We intend to keep the interactive script since most of our users are still using this command, but there is no difference between using the "interactive" vs. "salloc" now. I also added the following lines into the slurm.conf file to reflect the new changes: >LaunchParameters=use_interactive_step >InteractiveStepOptions="--interactive --preserve-env --pty /bin/bash -c 'source /etc/slurm/scripts/slurm_fix_modules.sh && /bin/bash -l -i'" The "InteractiveStepOptions" may sound weird, but I need that trick to run the "slurm_fix_modules.sh" script before each interactive session. This script fixes the incompatibility issue with Lmod versions and contents across various partitions. It's a harmless script and doesn't mess with the current settings. Please let me know if you need anything else from me. Best Regards, Misha (In reply to Misha Ahmadian from comment #26) > Please let me know if you need anything else from me. Well, if it works, it works, so I guess we can mark this as resolved! I'm still not 100% certain how use_interactive_step fixed the issue - I would need to understand better what MATLAB is complaining about - but my guess is that it has to do with a change in 20.11 where steps no longer overlap. This change means that in 20.11, a simple `srun --pty` would create a step with all resources, and that step would not allow other steps to run unless the new --overlap arg was specified. From https://slurm.schedmd.com/faq.html#prompt: "By default, use_interactive_step creates an interactive step on a node in the allocation and runs the shell in that step. An interactive step is to an interactive shell what a batch step is to a batch script - both have access to all resources in the allocation on the node they are running on, but do not "consume" them. "Note that beginning in 20.11, steps created by srun are now exclusive. This means that the previously-recommended way to get an interactive shell, srun --pty $SHELL, will no longer work, as the shell's step will now consume all resources on the node and cause subsequent srun calls to pend." Also, from 20.11 RELEASE_NOTES: -- By default, a step started with srun will [now] be granted exclusive (or non- overlapping) access to the resources assigned to that step. No other parallel step will be allowed to run on the same resources at the same time. This replaces one facet of the '--exclusive' option's behavior, but does not imply the '--exact' option described below. To get the previous default behavior - which allowed parallel steps to share all resources - use the new srun '--overlap' option. So it is good that you are now using use_interactive_step - without it, I imagine you would have eventually seen other non-MATLAB interactive job issues as well. Thanks! -Michael |