17064 – setting the process memory limit to a lower value

Ticket 17064 - setting the process memory limit to a lower value

Summary: setting the process memory limit to a lower value

Status:	RESOLVED INFOGIVEN

Alias:	None

Product:	Slurm
Classification:	Unclassified
Component:	User Commands (show other tickets)
Version:	23.02.1
Hardware:	Linux Linux

Severity:	4 - Minor Issue
Assignee:	Felip Moll
QA Contact:

URL:

Depends on:
Blocks:

Reported:	2023-06-28 03:37 MDT by Yann
Modified:	2023-07-04 08:55 MDT (History)
CC List:	0 users

See Also:
Site:	Université de Genève
Slinky Site:	---
Alineos Sites:	---
Atos/Eviden Sites:	---
Confidential Site:	---
Coreweave sites:	---
Cray Sites:	---
DS9 clusters:	---
Google sites:	---
HPCnow Sites:	---
HPE Sites:	---
IBM Sites:	---
NOAA SIte:	---
NoveTech Sites:	---
Nvidia HWinf-CS Sites:	---
OCF Sites:	---
Recursion Pharma Sites:	---
SFW Sites:	---
SNIC sites:	---
Tzag Elita Sites:	---
Linux Distro:	---
Machine Name:
CLE Version:
Version Fixed:
Target Release:	---
DevPrio:	---
Emory-Cloud Sites:	---

Attachments
Add an attachment (proposed patch, testcase, etc.)

Note You need to log in before you can comment on or make changes to this ticket.

Description Yann 2023-06-28 03:37:00 MDT

Dear team,

we have a user that wants to set the memory limit of a process to a value lower than the cgroup set by slurm. The idea is to prevent his interactive job to be killed by OOM when he tries to allocate a bigger array than allowed by the cgroup.

I checked the doc and I didn't found a way to do so. The idea would then be to use "ulimit -m" and or "ulimit -v". I tried to add that to srun --task-prolog but it seems it isn't propagated to the process. 

What would be the options to set this limit?

Best

Yann

Comment 2 Felip Moll 2023-06-28 10:09:43 MDT

(In reply to Yann from comment #0)
> Dear team,
> 
> we have a user that wants to set the memory limit of a process to a value
> lower than the cgroup set by slurm. The idea is to prevent his interactive
> job to be killed by OOM when he tries to allocate a bigger array than
> allowed by the cgroup.
> 
> I checked the doc and I didn't found a way to do so. The idea would then be
> to use "ulimit -m" and or "ulimit -v". I tried to add that to srun
> --task-prolog but it seems it isn't propagated to the process. 
> 
> What would be the options to set this limit?
> 
> Best
> 
> Yann

Hi Yann,

Can I see what 'interactive' job means for you? Is it an 'salloc' or an 'srun --pty /bin/bash' ?
If it is an salloc and you get an interactive console in the node, and then you run commands or steps with srun, then:
a) If you just run commands in an salloc, normally the OOM will kill the process which exceeds the memory.
b) If you use srun, then specifying --mem to a lower value than the one of the allocation would do the trick.

If its a 'srun --pty /bin/bash', then I expect the OOM to kill the process and not bash, so the job should continue.
So, generally, the OOM happens to the step or to the specific processes, but not to the entire job.

In any case, you're trying to limit a process to a certain amount of memory. What should happen then when the process exceeds this limit? Should it be killed? Isn't that exactly the same than having OOM and the normal cgroup limit?

The ulimit method won't work, specially if the process forks new children which will inherit the limits but will start over.
The only reliable way is with a cgroup, but actually we don't support creating sub-cgroups in our internal cgroup tree.

Please, clarify a bit which behavior are you looking for when this supposed limit is imposed to a process, and what do you mean by interactive.

Thanks!

Comment 3 Yann 2023-07-04 03:06:15 MDT

Hi,

thanks for your answer.

The user is submitting a job using "srun --pty /bin/bash" and then launches an ipython script.

In this ipython, he opens hdf5 files etc, and do something like :

In [3]: while 1:
   ...:     ar=np.ones((2**36))
   ...:     tlist.append(ar)

In this case as he is requesting way more memory than what is available, there is an error but the Python session isn't killed as the memory is not even allocated.

The issue is if he requests something realistic bug higher than the allocated memory, then the whole Python session is killed, and hdf5 files are potentially corrupted.

it seems that if he sets a limit using ulimit -v to a lower value than what was requested by "srun --pty /bin/bash" it is working: numpy refuses to allocate more memory than this limit. Still if the process would use more memory than allocated throught the cgroup, the process would be killed, this is ok.

Best

Comment 4 Felip Moll 2023-07-04 06:39:11 MDT

(In reply to Yann from comment #3)
> Hi,
> 
> thanks for your answer.
> 
> The user is submitting a job using "srun --pty /bin/bash" and then launches
> an ipython script.
> 
> In this ipython, he opens hdf5 files etc, and do something like :
> 
> In [3]: while 1:
>    ...:     ar=np.ones((2**36))
>    ...:     tlist.append(ar)
> 
> In this case as he is requesting way more memory than what is available,
> there is an error but the Python session isn't killed as the memory is not
> even allocated.
> 
> The issue is if he requests something realistic bug higher than the
> allocated memory, then the whole Python session is killed, and hdf5 files
> are potentially corrupted.
> 
> it seems that if he sets a limit using ulimit -v to a lower value than what
> was requested by "srun --pty /bin/bash" it is working: numpy refuses to
> allocate more memory than this limit. Still if the process would use more
> memory than allocated throught the cgroup, the process would be killed, this
> is ok.
> 
> Best

So I think the behavior you describe is all expected.

The user should take care of not using more memory than requested in the python script. The job will normally never be killed, only the 'offender' process which in this case is python, though that is not guaranteed since it is the OOM kernel algorithm which decides which processes to kill with a scoring system.

I would recommend using salloc combined with LaunchParameters=use_interactive_step in slurm.conf instead of 'srun --pty=/bin/bash', which allows one to start new steps from within the allocation as opposite when using the srun option.

Is there any other question? Can we resolve the bug?

Comment 5 Felip Moll 2023-07-04 06:40:38 MDT

> The user should take care of not using more memory than requested in the
> python script. 

Typo, I wanted to say:

The user should not use more memory in the python script than the memory requested in the job request.

Comment 6 Felip Moll 2023-07-04 06:49:01 MDT

I want also to suggest an action than the user can do:

The user could read, from the python script, which are the limits of its cgroup by programmatically doing an equivalent to:

1. Check its cgroup: cat /proc/self/cgroup
2. Check its step limit, by appending /sys/fs/cgroup to 1)  (if in cgroup/v2) or /sys/fs/cgroup/memory/ (if in cgroup/v1) and read:

in v1: memory.limit_in_bytes
in v2: memory.max

Example in v2:
]$ cat /sys/fs/cgroup/system.slice/gamba1_slurmstepd.scope/job_207/step_interactive/user/memory.max 
104857600

3. When he determined the limit, do not request more than this value.

--
Another option is to interpret the SLURM_* environment variables, for example SLURM_MEM_PER_NODE=, and act in consequence (3).

Comment 7 Yann 2023-07-04 08:00:13 MDT

> I would recommend using salloc combined with 
> LaunchParameters=use_interactive_step in slurm.conf instead of 
> 'srun --pty=/bin/bash', which allows one to start new steps from within the
> allocation as opposite when using the srun option.

Indeed, this is what we already do:)

Thanks for the suggestion about cgroup reading from Python, I'll transmit the info the the user. You can close the issue then.

Best

Yann

Comment 8 Felip Moll 2023-07-04 08:55:24 MDT

Thanks Yann,

Closing the issue.