Ticket 17170

Summary: how to increase max locked mem
Product: Slurm Reporter: Wei Feinstein <wfeinstein>
Component: LimitsAssignee: Nate Rini <nate>
Status: RESOLVED INFOGIVEN QA Contact:
Severity: 4 - Minor Issue    
Priority: ---    
Version: 22.05.6   
Hardware: Linux   
OS: Linux   
Site: LBNL - Lawrence Berkeley National Laboratory Slinky Site: ---
Alineos Sites: --- Atos/Eviden Sites: ---
Confidential Site: --- Coreweave sites: ---
Cray Sites: --- DS9 clusters: ---
Google sites: --- HPCnow Sites: ---
HPE Sites: --- IBM Sites: ---
NOAA SIte: --- NoveTech Sites: ---
Nvidia HWinf-CS Sites: --- OCF Sites: ---
Recursion Pharma Sites: --- SFW Sites: ---
SNIC sites: --- Tzag Elita Sites: ---
Linux Distro: --- Machine Name:
CLE Version: Version Fixed:
Target Release: --- DevPrio: ---
Emory-Cloud Sites: ---
Attachments: slurm.conf

Description Wei Feinstein 2023-07-10 19:05:26 MDT
Dear support team,

I can change the locked max memory size on a compute node.

[wfeinstein@n0002 ~]$ srun -p lr7 -A scs -t 1:0:0 -q lr_normal --exclusive --pty bash
srun: job 62247353 queued and waiting for resources
srun: job 62247353 has been allocated resources

[wfeinstein@n0001 ~]$ sacct -j 62247353 --format=alloctres%45
                                    AllocTRES 
--------------------------------------------- 
         billing=56,cpu=56,mem=257040M,node=1 
                    cpu=56,mem=257040M,node=1 

[wfeinstein@n0001 ~]$ free -h
              total        used        free      shared  buff/cache   available
Mem:           251G        4.9G        237G        4.3G        9.0G        241G
Swap:          8.0G        1.5G        6.5G

As you can see the entire node RMA is allocated. 

[wfeinstein@n0001 ~]$ ulimit -l
32000000

[wfeinstein@n0001 ~]$ ulimit -a
core file size          (blocks, -c) 0
data seg size           (kbytes, -d) unlimited
scheduling priority             (-e) 0
file size               (blocks, -f) unlimited
pending signals                 (-i) 1028443
max locked memory       (kbytes, -l) 32000000
max memory size         (kbytes, -m) 263208960
open files                      (-n) 51200
pipe size            (512 bytes, -p) 8
POSIX message queues     (bytes, -q) 819200
real-time priority              (-r) 0
stack size              (kbytes, -s) unlimited
cpu time               (seconds, -t) unlimited
max user processes              (-u) 32768
virtual memory          (kbytes, -v) unlimited
file locks                      (-x) unlimited

And ulimit -s unlimited have NO effect!

[wfeinstein@n0001 ~]$ ulimit -s unlimited

[wfeinstein@n0001 ~]$ ulimit -l
32000000

Anything else can I do to increase the max memory size?

Thank you,
Wei
Comment 1 Wei Feinstein 2023-07-11 11:19:26 MDT
Created attachment 31189 [details]
slurm.conf

Please see attached slurm.conf
Comment 2 Jason Booth 2023-07-11 11:30:49 MDT
Wei,

Take a look at the ulimit that is set in the environment/shell where slurmd is launched.

Slurmd inherits what is set in the Linux environment as it is started and so do the slurmstepd's. You would need to change the limit before starting slurmd. In this case, you would do so in the Linux environment on that compute node. If you are using systems then you could probably set it there as well.


Before running slurmd
> ulimit -l 4064952

Start slurmd and check its ulimits.
> # cat /proc/<SLURMD_PID>/limits 
> Limit                     Soft Limit           Hard Limit           Units     
> ...
> ..
> . 
> Max locked memory         4162510848           4162510848           bytes     
 
Changing the limit by hand will demonstrate what I mean.

Stopping slurmd then modify the limit and start it again.

> $# ulimit -l unlimited
> $# slurmd 
> $# cat /proc/33493/limits 
> Limit                     Soft Limit           Hard Limit           Units     
> ...
> ..
> .  
> Max locked memory         unlimited            unlimited            bytes
Comment 3 Wei Feinstein 2023-07-11 12:02:24 MDT
Hi Jason,

Thank you.

Please take a look here: max locked memory is shown as unlimited on n0001.lr7. 

[root@n0001.lr7 ~]# ulimit -a
core file size          (blocks, -c) 0
data seg size           (kbytes, -d) unlimited
scheduling priority             (-e) 0
file size               (blocks, -f) unlimited
pending signals                 (-i) 1028443
max locked memory       (kbytes, -l) unlimited
max memory size         (kbytes, -m) unlimited
open files                      (-n) 262144
pipe size            (512 bytes, -p) 8
POSIX message queues     (bytes, -q) 819200
real-time priority              (-r) 0
stack size              (kbytes, -s) unlimited
cpu time               (seconds, -t) unlimited
max user processes              (-u) 1028443
virtual memory          (kbytes, -v) unlimited
file locks                      (-x) unlimited

But when I do srun, there is only 32GB.

[wfeinstein@n0003 ~]$  srun -p lr7 -A scs -t 1:0:0 -q lr_normal --exclusive --pty bash
srun: job 62302638 queued and waiting for resources
srun: job 62302638 has been allocated resources
[wfeinstein@n0001 ~]$ ulimit -l
32000000
[wfeinstein@n0001 ~]$ hostname
n0001.lr7


I also followed your instructions as below:

[root@n0001.lr7 ~]# ps aux |grep slurmd
root     24109  0.0  0.0 281948  8424 ?        S    10:50   0:00 /usr/sbin/slurmd
root     24154  0.0  0.0   9092   680 pts/1    S+   10:56   0:00 grep --color=auto slurmd
[root@n0001.lr7 ~]# cat /proc/24109/limits 
Limit                     Soft Limit           Hard Limit           Units     
Max cpu time              unlimited            unlimited            seconds   
Max file size             unlimited            unlimited            bytes     
Max data size             unlimited            unlimited            bytes     
Max stack size            unlimited            unlimited            bytes     
Max core file size        unlimited            unlimited            bytes     
Max resident set          unlimited            unlimited            bytes     
Max processes             1028443              1028443              processes 
Max open files            51200                51200                files     
Max locked memory         32768000000          32768000000          bytes     
Max address space         unlimited            unlimited            bytes     
Max file locks            unlimited            unlimited            locks     
Max pending signals       1028443              1028443              signals   
Max msgqueue size         819200               819200               bytes     
Max nice priority         0                    0                    
Max realtime priority     0                    0                    
Max realtime timeout      unlimited            unlimited            us        
[root@n0001.lr7 ~]# systemctl stop slurmd

[root@n0001.lr7 ~]# ulimit -l unlimited
[root@n0001.lr7 ~]# systemctl start slurmd
[root@n0001.lr7 ~]# ps aux |grep slurmd 
root     24174  0.0  0.0 215384  8104 ?        S    10:57   0:00 /usr/sbin/slurmd
root     24188  0.0  0.0   9092   676 pts/1    S+   10:58   0:00 grep --color=auto slurmd
[root@n0001.lr7 ~]# cat /proc/24174/limits 
Limit                     Soft Limit           Hard Limit           Units     
Max cpu time              unlimited            unlimited            seconds   
Max file size             unlimited            unlimited            bytes     
Max data size             unlimited            unlimited            bytes     
Max stack size            unlimited            unlimited            bytes     
Max core file size        unlimited            unlimited            bytes     
Max resident set          unlimited            unlimited            bytes     
Max processes             1028443              1028443              processes 
Max open files            51200                51200                files     
Max locked memory         32768000000          32768000000          bytes     
Max address space         unlimited            unlimited            bytes     
Max file locks            unlimited            unlimited            locks     
Max pending signals       1028443              1028443              signals   
Max msgqueue size         819200               819200               bytes     
Max nice priority         0                    0                    
Max realtime priority     0                    0                    
Max realtime timeout      unlimited            unlimited            us       

Max locked memory  did NOT get changed.

Advise?

Thank you,
Wei
Comment 4 Jason Booth 2023-07-11 12:15:58 MDT
Wei the explanation was just an explanation and not a solution This is outside of Slurm's control. It is set in your Linux environment so you will need to set this either through the slurmd.service script or the calling environment.

It is also entirely possible that PAM is overwriting limits. You will need to investigate what works for your Linux environment.  

[1] https://www.freedesktop.org/software/systemd/man/systemd.exec.html#Process%20Properties
[2] /etc/security/limits.conf
[3] https://ss64.com/bash/limits.conf.html
Comment 5 Wei Feinstein 2023-07-12 19:35:56 MDT
Hi Jason,

I talked to my system admins, they checked everything they could think of. 

[root@perceus-00 ~]# cat /etc/security/limits.conf
* soft memlock unlimited
* hard memlock unlimited
*     soft   nofile  10240
*     hard   nofile  1000000 You have new mail in /var/spool/mail/root

[root@master ~]# cat  /etc/sysconfig/slurm
ulimit -l unlimited
ulimit -n 100000
ulimit -u unlimited
ulimit -s unlimited

salloc looks fine shown unlimited below:

[wfeinstein@n0003 ~]$ salloc -p lr7 -A scs -t 1:0:0 -q lr_normal --exclusive
salloc: Pending job allocation 62307334
salloc: job 62307334 queued and waiting for resources
salloc: job 62307334 has been allocated resources
salloc: Granted job allocation 62307334
[wfeinstein@n0003 ~]$ squeue -u wfeinstein
       JOBID PARTITION   NAME   USER ST    TIME NODES NODELIST(REASON)
     62307334    lr7 interact wfeinste R    0:10   1 n0001.lr7
[wfeinstein@n0003 ~]$ ssh n0001.lr7
Last login: Tue Jul 11 10:41:13 2023 from 10.0.2.3
u[wfeinstein@n0001 ~]$ ulimit -l
unlimited
[wfeinstein@n0001 ~]$ ulimit -a
core file size     (blocks, -c) 0
data seg size      (kbytes, -d) unlimited
scheduling priority       (-e) 0
file size        (blocks, -f) unlimited
pending signals         (-i) 1028443
max locked memory    (kbytes, -l) unlimited
max memory size     (kbytes, -m) unlimited
open files           (-n) 262144
pipe size      (512 bytes, -p) 8
POSIX message queues   (bytes, -q) 819200
real-time priority       (-r) 0
stack size       (kbytes, -s) unlimited
cpu time        (seconds, -t) unlimited
max user processes       (-u) 32768
virtual memory     (kbytes, -v) unlimited
file locks           (-x) unlimited


However, srun and sbatch have problems.

[wfeinstein@n0003 ~]$ srun -p lr7 -A scs -t 1:0:0 -q lr_normal --exclusive --pty bash
[wfeinstein@n0001 ~]$ ulimit -l
32000000

Users are waiting on us for solutions in order to run MPI jobs. 

Can you please guide us for further debug, where/what to look and check? If needed, I will bring the system admins to the ticket.

Highly appreciated.
Wei
Comment 6 Jason Booth 2023-07-13 13:08:58 MDT
What flavor of Linux is your site using?
Comment 7 Wei Feinstein 2023-07-13 13:21:42 MDT
centos 7.9
Comment 8 Jason Booth 2023-07-13 15:27:48 MDT
> centos 7.9

I was not able to duplicate this.

[root@controller ~]# salloc --exclusive
salloc: Granted job allocation 5004
[root@controller ~]# srun --exclusive --pty bash
[root@n2 ~]# ulimit -l
unlimited
[root@n2 ~]# 
exit
[root@controller ~]# 


> Can you please guide us for further debug, where/what to look and check? If needed, I will bring the system admins to the ticket.

Slurm uses the limits provided by the operating system. These can be set/defined in more than one location. I am not sure where this limit is being brought in from but there is something in your environment setting this.

If you run your srun with  "--propagate=NONE" does that change the outcome?

https://slurm.schedmd.com/srun.html#OPT_propagate
Comment 9 Wei Feinstein 2023-07-13 15:35:20 MDT
Hi Jason,

It is tricky.

[wfeinstein@n0003 ~]$ srun -p lr7 -A scs -t 1:0:0 -q lr_normal --exclusive --propagate=NONE --pty bash
srun: job 62458274 queued and waiting for resources
srun: job 62458274 has been allocated resources
[wfeinstein@n0004 ~]$ ulimit -l
32000000

[wfeinstein@n0003 ~]$ squeue -u wfeinstein
             JOBID PARTITION     NAME     USER ST       TIME  NODES NODELIST(REASON)
          62458337       lr7 interact wfeinste  R       0:11      1 n0004.lr7
[wfeinstein@n0003 ~]$ ssh n0004.lr7
[wfeinstein@n0004 ~]$ ulimit -l
unlimited

It has to be somewhere.

Thanks,

Wei
Comment 10 Jason Booth 2023-07-13 15:57:56 MDT
Please attach your slurmd.service file and also the output from the following command while the srun is active.

> $ ps aux | grep "slurm"
Comment 11 Wei Feinstein 2023-07-13 16:53:48 MDT
Hi Jason,

[root@n0001.lr7 ~]# cat /usr/lib/systemd/system/slurmd.service 
[Unit]
Description=Slurm node daemon
Wants=nvidia-modprobe.service
After=network.target nvidia-modprobe.service
ConditionPathExists=/etc/slurm/slurm.conf

[Service]
Type=forking
EnvironmentFile=-/etc/sysconfig/slurmd
ExecStart=/usr/sbin/slurmd $SLURMD_OPTIONS
ExecReload=/bin/kill -HUP $MAINPID
PIDFile=/var/run/slurm/slurmd.pid
KillMode=process
LimitNOFILE=51200
#LimitMEMLOCK=infinity
LimitMEMLOCK=32768000000
LimitSTACK=infinity

[Install]
WantedBy=multi-user.target


[root@master ~]# ps aux | grep "slurm"
slurm    11060  0.3  0.0 1031288 103912 ?      Sl   Jun29  71:33 /usr/sbin/slurmdbd
slurm    11074 18.7  8.0 17185056 10683232 ?   Sl   Jun29 3812:25 /usr/sbin/slurmctld
slurm    11075  0.0  0.0 1109824 1996 ?        S    Jun29   0:05 slurmctld: slurmscriptd
root     11533  2.4  0.0 2337696 25756 ?       Ssl  Jul06 235:17 /usr/local/bin/prometheus-slurm-exporter
root     11851  0.0  0.0 112812   980 pts/0    S+   15:02   0:00 grep --color=auto slurm

This might be it.

Wei
Comment 12 Wei Feinstein 2023-07-13 16:57:37 MDT
Thank you Jason!

Wei
Comment 13 Wei Feinstein 2023-07-13 17:10:10 MDT
Hi Jason,

One more question about slurmd.service below, what do you think about the config in this file? 

[root@n0001.lr7 ~]# cat /usr/lib/systemd/system/slurmd.service 
[Unit]
Description=Slurm node daemon
Wants=nvidia-modprobe.service
After=network.target nvidia-modprobe.service
ConditionPathExists=/etc/slurm/slurm.conf

[Service]
Type=forking
EnvironmentFile=-/etc/sysconfig/slurmd
ExecStart=/usr/sbin/slurmd $SLURMD_OPTIONS
ExecReload=/bin/kill -HUP $MAINPID
PIDFile=/var/run/slurm/slurmd.pid
KillMode=process
LimitNOFILE=51200
#LimitMEMLOCK=infinity
LimitMEMLOCK=32768000000
LimitSTACK=infinity

Thanks,
Wei
Comment 14 Nate Rini 2023-07-13 17:33:31 MDT
Please also provide the output of `systemctl start slurmd`
Comment 15 Wei Feinstein 2023-07-13 17:37:03 MDT
Hi Nate,

systemctl start slurmd has no return. Did you mean status? 

[root@n0001.lr7 ~]# systemctl status slurmd
● slurmd.service - Slurm node daemon
   Loaded: loaded (/usr/lib/systemd/system/slurmd.service; enabled; vendor preset: disabled)
   Active: active (running) since Tue 2023-07-11 11:32:21 PDT; 2 days ago
 Main PID: 24372 (slurmd)
   CGroup: /system.slice/slurmd.service
           ├─24372 /usr/sbin/slurmd
           ├─41517 slurmstepd: [62412180.batch]
           ├─41532 /bin/bash /var/spool/slurmd/job62412180/slurm_script
           ├─41548 mpirun -np 56 lmp -in in.mismatch
           ├─41556 lmp -in in.mismatch
           ├─41557 lmp -in in.mismatch
           ├─41558 lmp -in in.mismatch
           ├─41559 lmp -in in.mismatch
           ├─41560 lmp -in in.mismatch
           ├─41561 lmp -in in.mismatch
           ├─41562 lmp -in in.mismatch
           ├─41563 lmp -in in.mismatch
           ├─41564 lmp -in in.mismatch
           ├─41565 lmp -in in.mismatch
           ├─41566 lmp -in in.mismatch
           ├─41567 lmp -in in.mismatch
           ├─41568 lmp -in in.mismatch
           ├─41569 lmp -in in.mismatch
           ├─41570 lmp -in in.mismatch
           ├─41571 lmp -in in.mismatch
           ├─41572 lmp -in in.mismatch
           ├─41573 lmp -in in.mismatch
           ├─41574 lmp -in in.mismatch
           ├─41575 lmp -in in.mismatch
           ├─41576 lmp -in in.mismatch
           ├─41577 lmp -in in.mismatch
           ├─41578 lmp -in in.mismatch
           ├─41579 lmp -in in.mismatch
           ├─41580 lmp -in in.mismatch
           ├─41581 lmp -in in.mismatch
           ├─41582 lmp -in in.mismatch
           ├─41583 lmp -in in.mismatch
           ├─41584 lmp -in in.mismatch
           ├─41585 lmp -in in.mismatch
           ├─41586 lmp -in in.mismatch
           ├─41587 lmp -in in.mismatch
           ├─41591 lmp -in in.mismatch
           ├─41592 lmp -in in.mismatch
           ├─41595 lmp -in in.mismatch
           ├─41596 lmp -in in.mismatch
           ├─41599 lmp -in in.mismatch
           ├─41601 lmp -in in.mismatch
           ├─41604 lmp -in in.mismatch
           ├─41606 lmp -in in.mismatch
           ├─41607 lmp -in in.mismatch
           ├─41608 lmp -in in.mismatch
           ├─41611 lmp -in in.mismatch
           ├─41612 lmp -in in.mismatch
           ├─41613 lmp -in in.mismatch
           ├─41615 lmp -in in.mismatch
           ├─41616 lmp -in in.mismatch
           ├─41619 lmp -in in.mismatch
           ├─41621 lmp -in in.mismatch
           ├─41622 lmp -in in.mismatch
           ├─41627 lmp -in in.mismatch
           ├─41629 lmp -in in.mismatch
           ├─41632 lmp -in in.mismatch
           ├─41636 lmp -in in.mismatch
           ├─41640 lmp -in in.mismatch
           └─41645 lmp -in in.mismatch

Jul 12 15:33:03 n0001.lr7 su[36417]: (to jdvorak2) root on none
Jul 12 15:33:03 n0001.lr7 su[36417]: pam_unix(su-l:session): session opened for user jdvorak2 by (uid=0)
Jul 12 15:33:03 n0001.lr7 su[36417]: pam_unix(su-l:session): session closed for user jdvorak2
Jul 12 15:42:52 n0001.lr7 jdvorak2[37043]: n0001.lr7 jdvorak2: module load valgrind
Jul 12 15:45:09 n0001.lr7 su[37119]: (to jdvorak2) root on none
Jul 12 15:45:09 n0001.lr7 su[37119]: pam_unix(su-l:session): session opened for user jdvorak2 by (uid=0)
Jul 12 15:45:09 n0001.lr7 su[37119]: pam_unix(su-l:session): session closed for user jdvorak2
Jul 12 15:49:56 n0001.lr7 su[39147]: (to jdvorak2) root on none
Jul 12 15:49:56 n0001.lr7 su[39147]: pam_unix(su-l:session): session opened for user jdvorak2 by (uid=0)
Jul 12 15:49:56 n0001.lr7 su[39147]: pam_unix(su-l:session): session closed for user jdvorak2

Thanks,
Wei
Comment 16 Wei Feinstein 2023-07-13 17:41:49 MDT
Nate,

Wanted to add:

[root@n0001.lr7 ~]# systemctl start slurmd 
Warning: slurmd.service changed on disk. Run 'systemctl daemon-reload' to reload units.

Wei
Comment 17 Nate Rini 2023-07-13 17:57:36 MDT
(In reply to Wei Feinstein from comment #15)
> systemctl start slurmd has no return. Did you mean status? 

Yes, but I actually needed both based on comment#16

(In reply to Wei Feinstein from comment #16)
> [root@n0001.lr7 ~]# systemctl start slurmd 
> Warning: slurmd.service changed on disk. Run 'systemctl daemon-reload' to
> reload units.

Please add or append "SLURMD_OPTIONS=-M" to /etc/sysconfig/slurmd
Please add 'LaunchParameters=slurmstepd_memlock' to slurm.conf.

This is outlined in https://slurm.schedmd.com/slurmd.html under "-M" if you want more details.

Then please call the following:
> systemctl daemon-reload
> systemctl restart slurmd
> systemctl status slurmd

and attach the output
Comment 18 Wei Feinstein 2023-07-14 08:34:31 MDT
Hi Nate,

I did the following only on a compute node.

slurm.config:
LaunchParameters=enable_nss_slurm,slurmstepd_memlock

/etc/sysconfig/slurmd does not exist on compute nodes

Also commented out LimitMEMLOCK=32768000000 in slurmd.service

[root@n0001.lr7 ~]# cat /usr/lib/systemd/system/slurmd.service 
[Unit]
Description=Slurm node daemon
Wants=nvidia-modprobe.service
After=network.target nvidia-modprobe.service
ConditionPathExists=/etc/slurm/slurm.conf

[Service]
Type=forking
EnvironmentFile=-/etc/sysconfig/slurmd
ExecStart=/usr/sbin/slurmd $SLURMD_OPTIONS
ExecReload=/bin/kill -HUP $MAINPID
PIDFile=/var/run/slurm/slurmd.pid
KillMode=process
LimitNOFILE=51200
LimitMEMLOCK=infinity
#LimitMEMLOCK=32768000000
LimitSTACK=infinity
...
[root@n0001.lr7 ~]# systemctl daemon-reload
[root@n0001.lr7 ~]# systemctl restart slurmd
[root@n0001.lr7 ~]# systemctl status slurmd
● slurmd.service - Slurm node daemon
   Loaded: loaded (/usr/lib/systemd/system/slurmd.service; enabled; vendor preset: disabled)
   Active: active (running) since Fri 2023-07-14 07:31:18 PDT; 6s ago
  Process: 53116 ExecStart=/usr/sbin/slurmd $SLURMD_OPTIONS (code=exited, status=0/SUCCESS)
 Main PID: 53118 (slurmd)
   CGroup: /system.slice/slurmd.service
           └─53118 /usr/sbin/slurmd

Jul 14 07:31:18 n0001.lr7 systemd[1]: Starting Slurm node daemon...
Jul 14 07:31:18 n0001.lr7 systemd[1]: Can't open PID file /var/run/slurm/slurmd.pid (yet?) after start: No such file or directory
Jul 14 07:31:18 n0001.lr7 systemd[1]: Started Slurm node daemon.

I also tested on a separate testbed, once LimitMEMLOCK=infinity is set, 
srun does see ulimit setting. 

Thank you,
Wei
Comment 19 Nate Rini 2023-07-14 09:35:48 MDT
(In reply to Wei Feinstein from comment #18)
> /etc/sysconfig/slurmd does not exist on compute nodes
Was one created? Setting 'SLURMD_OPTIONS=-M' is required to get slurmd to lock itself into memory.
 
> Also commented out LimitMEMLOCK=32768000000 in slurmd.service

Directly modifing the systemd unit files is not recommended as they have a bad habit of getting overwritten. I suggest creating the following drop instead:

/usr/lib/systemd/system/slurmd.service.d/local.conf:
> [Service]
> LimitMEMLOCK=infinity
> Environment=SLURMD_OPTIONS=-M

Setting rlimit for locking memory to infinity may have unintended consequences causing other system daemons to fail if a job should decide to lock most of the possible system memory. For example, if slurmd is not started with '-M' and it ends up getting swapped out, it may not be able to get paged back in if all the other memory is locked by the job processes effectively causing a deadlock.

> I also tested on a separate testbed, once LimitMEMLOCK=infinity is set, 
> srun does see ulimit setting. 

I assume by this comment, that you mean the rlimit is now unlimited as requested?

Also, the slurmd unit file appears to be out of date from the suggested template. Please update it to the version generated at compile time in the etc dir.
Comment 20 Wei Feinstein 2023-07-14 10:15:32 MDT
Nate,

I understand the changes need be pushed to all the nodes.

But for now, in the testing stage, please bear with me.

I did the following on a compute node:

echo "SLURMD_OPTIONS=-M" > /etc/sysconfig/slurmd

"Setting rlimit for locking memory to infinity may have unintended consequences causing other system daemons to fail if a job should decide to lock most of the possible system memory." 

Based on this comment, I suspect it might be the reason why we had LimitMEMLOCK=32768000000 in the first place. If I understand right, I should avoid "LimitMEMLOCK=infinity", and replace it with a number < RAM on the node.


"the slurmd unit file appears to be out of date from the suggested template. Please update it to the version generated at compile time in the etc dir."

The file was changed by the system admin yesterday across the board, although slurmd   have not been restarted since we have a downtime coming up next week. 

To summarize, I need to have 
/etc/sysconfig/slurmd (SLURMD_OPTIONS=-M) on all compute nodes
LaunchParameters=slurmstepd_memlock' in slurm.conf
LimitMEMLOCK=(not infinity but a number < RAM) in /usr/lib/systemd/system/slurmd.service

Comments?

Wei
Comment 21 Nate Rini 2023-07-14 10:31:05 MDT
(In reply to Wei Feinstein from comment #20)
> I understand the changes need be pushed to all the nodes.

Yes, but it is always suggested to test on single node (admin reserved) to verify the configuration changes first.

> I did the following on a compute node:
> 
> echo "SLURMD_OPTIONS=-M" > /etc/sysconfig/slurmd

understood. That should be sufficient.

> "Setting rlimit for locking memory to infinity may have unintended
> consequences causing other system daemons to fail if a job should decide to
> lock most of the possible system memory." 
> 
> Based on this comment, I suspect it might be the reason why we had
> LimitMEMLOCK=32768000000 in the first place. If I understand right, I should
> avoid "LimitMEMLOCK=infinity", and replace it with a number < RAM on the
> node.

I'm not sure why that exact number was set, but I suggest giving the system daemons a good safety margin of memory to avoid node failures.
 
> "the slurmd unit file appears to be out of date from the suggested template.
> Please update it to the version generated at compile time in the etc dir."
> 
> The file was changed by the system admin yesterday across the board,
> although slurmd   have not been restarted since we have a downtime coming up
> next week. 

Understood. The latest unit files switch to the non-forking version, which allows better process tracking and logging by systemd.
 
> To summarize, I need to have 
> /etc/sysconfig/slurmd (SLURMD_OPTIONS=-M) on all compute nodes
> LaunchParameters=slurmstepd_memlock' in slurm.conf
> LimitMEMLOCK=(not infinity but a number < RAM) in
> /usr/lib/systemd/system/slurmd.service

That looks correct, but I always like to verify that the settings took with the output of `systemctl status slurmd` and verification with a test job.
Comment 22 Wei Feinstein 2023-07-14 10:49:02 MDT
Nate,

[root@perceus-00 ~]# ssh n0059.lr7
Last login: Sun Jun 25 09:34:35 2023 from 10.0.0.10
[root@n0059.lr7 ~]# vi /etc/slurm/slurm.conf 

[root@n0059.lr7 ~]# vi /usr/lib/systemd/system/slurmd.service 
LimitMEMLOCK=200000000000 (RMA=256G)

[root@n0059.lr7 ~]# echo "SLURMD_OPTIONS=-M" > /etc/sysconfig/slurmd
[root@n0059.lr7 ~]# systemctl daemon-reload
[root@n0059.lr7 ~]# systemctl restart slurmd
[root@n0059.lr7 ~]# systemctl status slurmd
● slurmd.service - Slurm node daemon
   Loaded: loaded (/usr/lib/systemd/system/slurmd.service; enabled; vendor preset: disabled)
   Active: active (running) since Fri 2023-07-14 09:41:59 PDT; 6s ago
  Process: 11686 ExecStart=/usr/sbin/slurmd $SLURMD_OPTIONS (code=exited, status=0/SUCCESS)
 Main PID: 11688 (slurmd)
   CGroup: /system.slice/slurmd.service
           └─11688 /usr/sbin/slurmd -M

Jul 14 09:41:59 n0059.lr7 systemd[1]: Starting Slurm node daemon...
Jul 14 09:41:59 n0059.lr7 systemd[1]: Can't open PID file /var/run/slurm/slurmd.pid (yet?) after start: No such file or directory
Jul 14 09:41:59 n0059.lr7 systemd[1]: Started Slurm node daemon.

Then I submit a job to the node:

[wfeinstein@n0003 ~]$ srun -p lr7 -A scs -t 1:0:0 -q lr_normal --exclusive --nodelist=n0059.lr7  --pty bash
srun: job 62485516 queued and waiting for resources
srun: job 62485516 has been allocated resources
[wfeinstein@n0059 ~]$ ulimit -l
195312500


It looks right to me. What do you think?

Wei
Comment 23 Nate Rini 2023-07-14 10:54:05 MDT
(In reply to Wei Feinstein from comment #22)
> It looks right to me. What do you think?

Agreed. Are there any more questions?
Comment 24 Wei Feinstein 2023-07-14 10:58:55 MDT
Hi Nate,

Excellent! 

Thank you so very much, you can close this ticket.

I have a standing ticket (Bug 17191) submitted three days ago about GraceTime for low prio jobs, can you please take a look if all possible? 

Thanks,
Wei
Comment 25 Nate Rini 2023-07-14 11:00:17 MDT
(In reply to Wei Feinstein from comment #24)
> Thank you so very much, you can close this ticket.
Closing out ticket.


> I have a standing ticket (Bug 17191) submitted three days ago about
> GraceTime for low prio jobs, can you please take a look if all possible? 
We are a little backed up with tickets right now so the SEV4s are having delayed responses.