| Summary: | how to increase max locked mem | ||
|---|---|---|---|
| Product: | Slurm | Reporter: | Wei Feinstein <wfeinstein> |
| Component: | Limits | Assignee: | Nate Rini <nate> |
| Status: | RESOLVED INFOGIVEN | QA Contact: | |
| Severity: | 4 - Minor Issue | ||
| Priority: | --- | ||
| Version: | 22.05.6 | ||
| Hardware: | Linux | ||
| OS: | Linux | ||
| Site: | LBNL - Lawrence Berkeley National Laboratory | Alineos Sites: | --- |
| Atos/Eviden Sites: | --- | Confidential Site: | --- |
| Coreweave sites: | --- | Cray Sites: | --- |
| DS9 clusters: | --- | HPCnow Sites: | --- |
| HPE Sites: | --- | IBM Sites: | --- |
| NOAA SIte: | --- | OCF Sites: | --- |
| Recursion Pharma Sites: | --- | SFW Sites: | --- |
| SNIC sites: | --- | Linux Distro: | --- |
| Machine Name: | CLE Version: | ||
| Version Fixed: | Target Release: | --- | |
| DevPrio: | --- | Emory-Cloud Sites: | --- |
| Attachments: | slurm.conf | ||
Created attachment 31189 [details]
slurm.conf
Please see attached slurm.conf
Wei, Take a look at the ulimit that is set in the environment/shell where slurmd is launched. Slurmd inherits what is set in the Linux environment as it is started and so do the slurmstepd's. You would need to change the limit before starting slurmd. In this case, you would do so in the Linux environment on that compute node. If you are using systems then you could probably set it there as well. Before running slurmd > ulimit -l 4064952 Start slurmd and check its ulimits. > # cat /proc/<SLURMD_PID>/limits > Limit Soft Limit Hard Limit Units > ... > .. > . > Max locked memory 4162510848 4162510848 bytes Changing the limit by hand will demonstrate what I mean. Stopping slurmd then modify the limit and start it again. > $# ulimit -l unlimited > $# slurmd > $# cat /proc/33493/limits > Limit Soft Limit Hard Limit Units > ... > .. > . > Max locked memory unlimited unlimited bytes Hi Jason, Thank you. Please take a look here: max locked memory is shown as unlimited on n0001.lr7. [root@n0001.lr7 ~]# ulimit -a core file size (blocks, -c) 0 data seg size (kbytes, -d) unlimited scheduling priority (-e) 0 file size (blocks, -f) unlimited pending signals (-i) 1028443 max locked memory (kbytes, -l) unlimited max memory size (kbytes, -m) unlimited open files (-n) 262144 pipe size (512 bytes, -p) 8 POSIX message queues (bytes, -q) 819200 real-time priority (-r) 0 stack size (kbytes, -s) unlimited cpu time (seconds, -t) unlimited max user processes (-u) 1028443 virtual memory (kbytes, -v) unlimited file locks (-x) unlimited But when I do srun, there is only 32GB. [wfeinstein@n0003 ~]$ srun -p lr7 -A scs -t 1:0:0 -q lr_normal --exclusive --pty bash srun: job 62302638 queued and waiting for resources srun: job 62302638 has been allocated resources [wfeinstein@n0001 ~]$ ulimit -l 32000000 [wfeinstein@n0001 ~]$ hostname n0001.lr7 I also followed your instructions as below: [root@n0001.lr7 ~]# ps aux |grep slurmd root 24109 0.0 0.0 281948 8424 ? S 10:50 0:00 /usr/sbin/slurmd root 24154 0.0 0.0 9092 680 pts/1 S+ 10:56 0:00 grep --color=auto slurmd [root@n0001.lr7 ~]# cat /proc/24109/limits Limit Soft Limit Hard Limit Units Max cpu time unlimited unlimited seconds Max file size unlimited unlimited bytes Max data size unlimited unlimited bytes Max stack size unlimited unlimited bytes Max core file size unlimited unlimited bytes Max resident set unlimited unlimited bytes Max processes 1028443 1028443 processes Max open files 51200 51200 files Max locked memory 32768000000 32768000000 bytes Max address space unlimited unlimited bytes Max file locks unlimited unlimited locks Max pending signals 1028443 1028443 signals Max msgqueue size 819200 819200 bytes Max nice priority 0 0 Max realtime priority 0 0 Max realtime timeout unlimited unlimited us [root@n0001.lr7 ~]# systemctl stop slurmd [root@n0001.lr7 ~]# ulimit -l unlimited [root@n0001.lr7 ~]# systemctl start slurmd [root@n0001.lr7 ~]# ps aux |grep slurmd root 24174 0.0 0.0 215384 8104 ? S 10:57 0:00 /usr/sbin/slurmd root 24188 0.0 0.0 9092 676 pts/1 S+ 10:58 0:00 grep --color=auto slurmd [root@n0001.lr7 ~]# cat /proc/24174/limits Limit Soft Limit Hard Limit Units Max cpu time unlimited unlimited seconds Max file size unlimited unlimited bytes Max data size unlimited unlimited bytes Max stack size unlimited unlimited bytes Max core file size unlimited unlimited bytes Max resident set unlimited unlimited bytes Max processes 1028443 1028443 processes Max open files 51200 51200 files Max locked memory 32768000000 32768000000 bytes Max address space unlimited unlimited bytes Max file locks unlimited unlimited locks Max pending signals 1028443 1028443 signals Max msgqueue size 819200 819200 bytes Max nice priority 0 0 Max realtime priority 0 0 Max realtime timeout unlimited unlimited us Max locked memory did NOT get changed. Advise? Thank you, Wei Wei the explanation was just an explanation and not a solution This is outside of Slurm's control. It is set in your Linux environment so you will need to set this either through the slurmd.service script or the calling environment. It is also entirely possible that PAM is overwriting limits. You will need to investigate what works for your Linux environment. [1] https://www.freedesktop.org/software/systemd/man/systemd.exec.html#Process%20Properties [2] /etc/security/limits.conf [3] https://ss64.com/bash/limits.conf.html Hi Jason,
I talked to my system admins, they checked everything they could think of.
[root@perceus-00 ~]# cat /etc/security/limits.conf
* soft memlock unlimited
* hard memlock unlimited
* soft nofile 10240
* hard nofile 1000000 You have new mail in /var/spool/mail/root
[root@master ~]# cat /etc/sysconfig/slurm
ulimit -l unlimited
ulimit -n 100000
ulimit -u unlimited
ulimit -s unlimited
salloc looks fine shown unlimited below:
[wfeinstein@n0003 ~]$ salloc -p lr7 -A scs -t 1:0:0 -q lr_normal --exclusive
salloc: Pending job allocation 62307334
salloc: job 62307334 queued and waiting for resources
salloc: job 62307334 has been allocated resources
salloc: Granted job allocation 62307334
[wfeinstein@n0003 ~]$ squeue -u wfeinstein
JOBID PARTITION NAME USER ST TIME NODES NODELIST(REASON)
62307334 lr7 interact wfeinste R 0:10 1 n0001.lr7
[wfeinstein@n0003 ~]$ ssh n0001.lr7
Last login: Tue Jul 11 10:41:13 2023 from 10.0.2.3
u[wfeinstein@n0001 ~]$ ulimit -l
unlimited
[wfeinstein@n0001 ~]$ ulimit -a
core file size (blocks, -c) 0
data seg size (kbytes, -d) unlimited
scheduling priority (-e) 0
file size (blocks, -f) unlimited
pending signals (-i) 1028443
max locked memory (kbytes, -l) unlimited
max memory size (kbytes, -m) unlimited
open files (-n) 262144
pipe size (512 bytes, -p) 8
POSIX message queues (bytes, -q) 819200
real-time priority (-r) 0
stack size (kbytes, -s) unlimited
cpu time (seconds, -t) unlimited
max user processes (-u) 32768
virtual memory (kbytes, -v) unlimited
file locks (-x) unlimited
However, srun and sbatch have problems.
[wfeinstein@n0003 ~]$ srun -p lr7 -A scs -t 1:0:0 -q lr_normal --exclusive --pty bash
[wfeinstein@n0001 ~]$ ulimit -l
32000000
Users are waiting on us for solutions in order to run MPI jobs.
Can you please guide us for further debug, where/what to look and check? If needed, I will bring the system admins to the ticket.
Highly appreciated.
Wei
What flavor of Linux is your site using? centos 7.9 > centos 7.9 I was not able to duplicate this. [root@controller ~]# salloc --exclusive salloc: Granted job allocation 5004 [root@controller ~]# srun --exclusive --pty bash [root@n2 ~]# ulimit -l unlimited [root@n2 ~]# exit [root@controller ~]# > Can you please guide us for further debug, where/what to look and check? If needed, I will bring the system admins to the ticket. Slurm uses the limits provided by the operating system. These can be set/defined in more than one location. I am not sure where this limit is being brought in from but there is something in your environment setting this. If you run your srun with "--propagate=NONE" does that change the outcome? https://slurm.schedmd.com/srun.html#OPT_propagate Hi Jason,
It is tricky.
[wfeinstein@n0003 ~]$ srun -p lr7 -A scs -t 1:0:0 -q lr_normal --exclusive --propagate=NONE --pty bash
srun: job 62458274 queued and waiting for resources
srun: job 62458274 has been allocated resources
[wfeinstein@n0004 ~]$ ulimit -l
32000000
[wfeinstein@n0003 ~]$ squeue -u wfeinstein
JOBID PARTITION NAME USER ST TIME NODES NODELIST(REASON)
62458337 lr7 interact wfeinste R 0:11 1 n0004.lr7
[wfeinstein@n0003 ~]$ ssh n0004.lr7
[wfeinstein@n0004 ~]$ ulimit -l
unlimited
It has to be somewhere.
Thanks,
Wei
Please attach your slurmd.service file and also the output from the following command while the srun is active.
> $ ps aux | grep "slurm"
Hi Jason, [root@n0001.lr7 ~]# cat /usr/lib/systemd/system/slurmd.service [Unit] Description=Slurm node daemon Wants=nvidia-modprobe.service After=network.target nvidia-modprobe.service ConditionPathExists=/etc/slurm/slurm.conf [Service] Type=forking EnvironmentFile=-/etc/sysconfig/slurmd ExecStart=/usr/sbin/slurmd $SLURMD_OPTIONS ExecReload=/bin/kill -HUP $MAINPID PIDFile=/var/run/slurm/slurmd.pid KillMode=process LimitNOFILE=51200 #LimitMEMLOCK=infinity LimitMEMLOCK=32768000000 LimitSTACK=infinity [Install] WantedBy=multi-user.target [root@master ~]# ps aux | grep "slurm" slurm 11060 0.3 0.0 1031288 103912 ? Sl Jun29 71:33 /usr/sbin/slurmdbd slurm 11074 18.7 8.0 17185056 10683232 ? Sl Jun29 3812:25 /usr/sbin/slurmctld slurm 11075 0.0 0.0 1109824 1996 ? S Jun29 0:05 slurmctld: slurmscriptd root 11533 2.4 0.0 2337696 25756 ? Ssl Jul06 235:17 /usr/local/bin/prometheus-slurm-exporter root 11851 0.0 0.0 112812 980 pts/0 S+ 15:02 0:00 grep --color=auto slurm This might be it. Wei Thank you Jason! Wei Hi Jason, One more question about slurmd.service below, what do you think about the config in this file? [root@n0001.lr7 ~]# cat /usr/lib/systemd/system/slurmd.service [Unit] Description=Slurm node daemon Wants=nvidia-modprobe.service After=network.target nvidia-modprobe.service ConditionPathExists=/etc/slurm/slurm.conf [Service] Type=forking EnvironmentFile=-/etc/sysconfig/slurmd ExecStart=/usr/sbin/slurmd $SLURMD_OPTIONS ExecReload=/bin/kill -HUP $MAINPID PIDFile=/var/run/slurm/slurmd.pid KillMode=process LimitNOFILE=51200 #LimitMEMLOCK=infinity LimitMEMLOCK=32768000000 LimitSTACK=infinity Thanks, Wei Please also provide the output of `systemctl start slurmd` Hi Nate, systemctl start slurmd has no return. Did you mean status? [root@n0001.lr7 ~]# systemctl status slurmd ● slurmd.service - Slurm node daemon Loaded: loaded (/usr/lib/systemd/system/slurmd.service; enabled; vendor preset: disabled) Active: active (running) since Tue 2023-07-11 11:32:21 PDT; 2 days ago Main PID: 24372 (slurmd) CGroup: /system.slice/slurmd.service ├─24372 /usr/sbin/slurmd ├─41517 slurmstepd: [62412180.batch] ├─41532 /bin/bash /var/spool/slurmd/job62412180/slurm_script ├─41548 mpirun -np 56 lmp -in in.mismatch ├─41556 lmp -in in.mismatch ├─41557 lmp -in in.mismatch ├─41558 lmp -in in.mismatch ├─41559 lmp -in in.mismatch ├─41560 lmp -in in.mismatch ├─41561 lmp -in in.mismatch ├─41562 lmp -in in.mismatch ├─41563 lmp -in in.mismatch ├─41564 lmp -in in.mismatch ├─41565 lmp -in in.mismatch ├─41566 lmp -in in.mismatch ├─41567 lmp -in in.mismatch ├─41568 lmp -in in.mismatch ├─41569 lmp -in in.mismatch ├─41570 lmp -in in.mismatch ├─41571 lmp -in in.mismatch ├─41572 lmp -in in.mismatch ├─41573 lmp -in in.mismatch ├─41574 lmp -in in.mismatch ├─41575 lmp -in in.mismatch ├─41576 lmp -in in.mismatch ├─41577 lmp -in in.mismatch ├─41578 lmp -in in.mismatch ├─41579 lmp -in in.mismatch ├─41580 lmp -in in.mismatch ├─41581 lmp -in in.mismatch ├─41582 lmp -in in.mismatch ├─41583 lmp -in in.mismatch ├─41584 lmp -in in.mismatch ├─41585 lmp -in in.mismatch ├─41586 lmp -in in.mismatch ├─41587 lmp -in in.mismatch ├─41591 lmp -in in.mismatch ├─41592 lmp -in in.mismatch ├─41595 lmp -in in.mismatch ├─41596 lmp -in in.mismatch ├─41599 lmp -in in.mismatch ├─41601 lmp -in in.mismatch ├─41604 lmp -in in.mismatch ├─41606 lmp -in in.mismatch ├─41607 lmp -in in.mismatch ├─41608 lmp -in in.mismatch ├─41611 lmp -in in.mismatch ├─41612 lmp -in in.mismatch ├─41613 lmp -in in.mismatch ├─41615 lmp -in in.mismatch ├─41616 lmp -in in.mismatch ├─41619 lmp -in in.mismatch ├─41621 lmp -in in.mismatch ├─41622 lmp -in in.mismatch ├─41627 lmp -in in.mismatch ├─41629 lmp -in in.mismatch ├─41632 lmp -in in.mismatch ├─41636 lmp -in in.mismatch ├─41640 lmp -in in.mismatch └─41645 lmp -in in.mismatch Jul 12 15:33:03 n0001.lr7 su[36417]: (to jdvorak2) root on none Jul 12 15:33:03 n0001.lr7 su[36417]: pam_unix(su-l:session): session opened for user jdvorak2 by (uid=0) Jul 12 15:33:03 n0001.lr7 su[36417]: pam_unix(su-l:session): session closed for user jdvorak2 Jul 12 15:42:52 n0001.lr7 jdvorak2[37043]: n0001.lr7 jdvorak2: module load valgrind Jul 12 15:45:09 n0001.lr7 su[37119]: (to jdvorak2) root on none Jul 12 15:45:09 n0001.lr7 su[37119]: pam_unix(su-l:session): session opened for user jdvorak2 by (uid=0) Jul 12 15:45:09 n0001.lr7 su[37119]: pam_unix(su-l:session): session closed for user jdvorak2 Jul 12 15:49:56 n0001.lr7 su[39147]: (to jdvorak2) root on none Jul 12 15:49:56 n0001.lr7 su[39147]: pam_unix(su-l:session): session opened for user jdvorak2 by (uid=0) Jul 12 15:49:56 n0001.lr7 su[39147]: pam_unix(su-l:session): session closed for user jdvorak2 Thanks, Wei Nate, Wanted to add: [root@n0001.lr7 ~]# systemctl start slurmd Warning: slurmd.service changed on disk. Run 'systemctl daemon-reload' to reload units. Wei (In reply to Wei Feinstein from comment #15) > systemctl start slurmd has no return. Did you mean status? Yes, but I actually needed both based on comment#16 (In reply to Wei Feinstein from comment #16) > [root@n0001.lr7 ~]# systemctl start slurmd > Warning: slurmd.service changed on disk. Run 'systemctl daemon-reload' to > reload units. Please add or append "SLURMD_OPTIONS=-M" to /etc/sysconfig/slurmd Please add 'LaunchParameters=slurmstepd_memlock' to slurm.conf. This is outlined in https://slurm.schedmd.com/slurmd.html under "-M" if you want more details. Then please call the following: > systemctl daemon-reload > systemctl restart slurmd > systemctl status slurmd and attach the output Hi Nate, I did the following only on a compute node. slurm.config: LaunchParameters=enable_nss_slurm,slurmstepd_memlock /etc/sysconfig/slurmd does not exist on compute nodes Also commented out LimitMEMLOCK=32768000000 in slurmd.service [root@n0001.lr7 ~]# cat /usr/lib/systemd/system/slurmd.service [Unit] Description=Slurm node daemon Wants=nvidia-modprobe.service After=network.target nvidia-modprobe.service ConditionPathExists=/etc/slurm/slurm.conf [Service] Type=forking EnvironmentFile=-/etc/sysconfig/slurmd ExecStart=/usr/sbin/slurmd $SLURMD_OPTIONS ExecReload=/bin/kill -HUP $MAINPID PIDFile=/var/run/slurm/slurmd.pid KillMode=process LimitNOFILE=51200 LimitMEMLOCK=infinity #LimitMEMLOCK=32768000000 LimitSTACK=infinity ... [root@n0001.lr7 ~]# systemctl daemon-reload [root@n0001.lr7 ~]# systemctl restart slurmd [root@n0001.lr7 ~]# systemctl status slurmd ● slurmd.service - Slurm node daemon Loaded: loaded (/usr/lib/systemd/system/slurmd.service; enabled; vendor preset: disabled) Active: active (running) since Fri 2023-07-14 07:31:18 PDT; 6s ago Process: 53116 ExecStart=/usr/sbin/slurmd $SLURMD_OPTIONS (code=exited, status=0/SUCCESS) Main PID: 53118 (slurmd) CGroup: /system.slice/slurmd.service └─53118 /usr/sbin/slurmd Jul 14 07:31:18 n0001.lr7 systemd[1]: Starting Slurm node daemon... Jul 14 07:31:18 n0001.lr7 systemd[1]: Can't open PID file /var/run/slurm/slurmd.pid (yet?) after start: No such file or directory Jul 14 07:31:18 n0001.lr7 systemd[1]: Started Slurm node daemon. I also tested on a separate testbed, once LimitMEMLOCK=infinity is set, srun does see ulimit setting. Thank you, Wei (In reply to Wei Feinstein from comment #18) > /etc/sysconfig/slurmd does not exist on compute nodes Was one created? Setting 'SLURMD_OPTIONS=-M' is required to get slurmd to lock itself into memory. > Also commented out LimitMEMLOCK=32768000000 in slurmd.service Directly modifing the systemd unit files is not recommended as they have a bad habit of getting overwritten. I suggest creating the following drop instead: /usr/lib/systemd/system/slurmd.service.d/local.conf: > [Service] > LimitMEMLOCK=infinity > Environment=SLURMD_OPTIONS=-M Setting rlimit for locking memory to infinity may have unintended consequences causing other system daemons to fail if a job should decide to lock most of the possible system memory. For example, if slurmd is not started with '-M' and it ends up getting swapped out, it may not be able to get paged back in if all the other memory is locked by the job processes effectively causing a deadlock. > I also tested on a separate testbed, once LimitMEMLOCK=infinity is set, > srun does see ulimit setting. I assume by this comment, that you mean the rlimit is now unlimited as requested? Also, the slurmd unit file appears to be out of date from the suggested template. Please update it to the version generated at compile time in the etc dir. Nate, I understand the changes need be pushed to all the nodes. But for now, in the testing stage, please bear with me. I did the following on a compute node: echo "SLURMD_OPTIONS=-M" > /etc/sysconfig/slurmd "Setting rlimit for locking memory to infinity may have unintended consequences causing other system daemons to fail if a job should decide to lock most of the possible system memory." Based on this comment, I suspect it might be the reason why we had LimitMEMLOCK=32768000000 in the first place. If I understand right, I should avoid "LimitMEMLOCK=infinity", and replace it with a number < RAM on the node. "the slurmd unit file appears to be out of date from the suggested template. Please update it to the version generated at compile time in the etc dir." The file was changed by the system admin yesterday across the board, although slurmd have not been restarted since we have a downtime coming up next week. To summarize, I need to have /etc/sysconfig/slurmd (SLURMD_OPTIONS=-M) on all compute nodes LaunchParameters=slurmstepd_memlock' in slurm.conf LimitMEMLOCK=(not infinity but a number < RAM) in /usr/lib/systemd/system/slurmd.service Comments? Wei (In reply to Wei Feinstein from comment #20) > I understand the changes need be pushed to all the nodes. Yes, but it is always suggested to test on single node (admin reserved) to verify the configuration changes first. > I did the following on a compute node: > > echo "SLURMD_OPTIONS=-M" > /etc/sysconfig/slurmd understood. That should be sufficient. > "Setting rlimit for locking memory to infinity may have unintended > consequences causing other system daemons to fail if a job should decide to > lock most of the possible system memory." > > Based on this comment, I suspect it might be the reason why we had > LimitMEMLOCK=32768000000 in the first place. If I understand right, I should > avoid "LimitMEMLOCK=infinity", and replace it with a number < RAM on the > node. I'm not sure why that exact number was set, but I suggest giving the system daemons a good safety margin of memory to avoid node failures. > "the slurmd unit file appears to be out of date from the suggested template. > Please update it to the version generated at compile time in the etc dir." > > The file was changed by the system admin yesterday across the board, > although slurmd have not been restarted since we have a downtime coming up > next week. Understood. The latest unit files switch to the non-forking version, which allows better process tracking and logging by systemd. > To summarize, I need to have > /etc/sysconfig/slurmd (SLURMD_OPTIONS=-M) on all compute nodes > LaunchParameters=slurmstepd_memlock' in slurm.conf > LimitMEMLOCK=(not infinity but a number < RAM) in > /usr/lib/systemd/system/slurmd.service That looks correct, but I always like to verify that the settings took with the output of `systemctl status slurmd` and verification with a test job. Nate, [root@perceus-00 ~]# ssh n0059.lr7 Last login: Sun Jun 25 09:34:35 2023 from 10.0.0.10 [root@n0059.lr7 ~]# vi /etc/slurm/slurm.conf [root@n0059.lr7 ~]# vi /usr/lib/systemd/system/slurmd.service LimitMEMLOCK=200000000000 (RMA=256G) [root@n0059.lr7 ~]# echo "SLURMD_OPTIONS=-M" > /etc/sysconfig/slurmd [root@n0059.lr7 ~]# systemctl daemon-reload [root@n0059.lr7 ~]# systemctl restart slurmd [root@n0059.lr7 ~]# systemctl status slurmd ● slurmd.service - Slurm node daemon Loaded: loaded (/usr/lib/systemd/system/slurmd.service; enabled; vendor preset: disabled) Active: active (running) since Fri 2023-07-14 09:41:59 PDT; 6s ago Process: 11686 ExecStart=/usr/sbin/slurmd $SLURMD_OPTIONS (code=exited, status=0/SUCCESS) Main PID: 11688 (slurmd) CGroup: /system.slice/slurmd.service └─11688 /usr/sbin/slurmd -M Jul 14 09:41:59 n0059.lr7 systemd[1]: Starting Slurm node daemon... Jul 14 09:41:59 n0059.lr7 systemd[1]: Can't open PID file /var/run/slurm/slurmd.pid (yet?) after start: No such file or directory Jul 14 09:41:59 n0059.lr7 systemd[1]: Started Slurm node daemon. Then I submit a job to the node: [wfeinstein@n0003 ~]$ srun -p lr7 -A scs -t 1:0:0 -q lr_normal --exclusive --nodelist=n0059.lr7 --pty bash srun: job 62485516 queued and waiting for resources srun: job 62485516 has been allocated resources [wfeinstein@n0059 ~]$ ulimit -l 195312500 It looks right to me. What do you think? Wei (In reply to Wei Feinstein from comment #22) > It looks right to me. What do you think? Agreed. Are there any more questions? Hi Nate, Excellent! Thank you so very much, you can close this ticket. I have a standing ticket (Bug 17191) submitted three days ago about GraceTime for low prio jobs, can you please take a look if all possible? Thanks, Wei (In reply to Wei Feinstein from comment #24) > Thank you so very much, you can close this ticket. Closing out ticket. > I have a standing ticket (Bug 17191) submitted three days ago about > GraceTime for low prio jobs, can you please take a look if all possible? We are a little backed up with tickets right now so the SEV4s are having delayed responses. |
Dear support team, I can change the locked max memory size on a compute node. [wfeinstein@n0002 ~]$ srun -p lr7 -A scs -t 1:0:0 -q lr_normal --exclusive --pty bash srun: job 62247353 queued and waiting for resources srun: job 62247353 has been allocated resources [wfeinstein@n0001 ~]$ sacct -j 62247353 --format=alloctres%45 AllocTRES --------------------------------------------- billing=56,cpu=56,mem=257040M,node=1 cpu=56,mem=257040M,node=1 [wfeinstein@n0001 ~]$ free -h total used free shared buff/cache available Mem: 251G 4.9G 237G 4.3G 9.0G 241G Swap: 8.0G 1.5G 6.5G As you can see the entire node RMA is allocated. [wfeinstein@n0001 ~]$ ulimit -l 32000000 [wfeinstein@n0001 ~]$ ulimit -a core file size (blocks, -c) 0 data seg size (kbytes, -d) unlimited scheduling priority (-e) 0 file size (blocks, -f) unlimited pending signals (-i) 1028443 max locked memory (kbytes, -l) 32000000 max memory size (kbytes, -m) 263208960 open files (-n) 51200 pipe size (512 bytes, -p) 8 POSIX message queues (bytes, -q) 819200 real-time priority (-r) 0 stack size (kbytes, -s) unlimited cpu time (seconds, -t) unlimited max user processes (-u) 32768 virtual memory (kbytes, -v) unlimited file locks (-x) unlimited And ulimit -s unlimited have NO effect! [wfeinstein@n0001 ~]$ ulimit -s unlimited [wfeinstein@n0001 ~]$ ulimit -l 32000000 Anything else can I do to increase the max memory size? Thank you, Wei