Ticket 2142

Summary: Make swap a selectable resource
Product: Slurm Reporter: John Hanks <john.hanks>
Component: LimitsAssignee: Moe Jette <jette>
Status: RESOLVED INFOGIVEN QA Contact:
Severity: 5 - Enhancement    
Priority: ---    
Version: 15.08.2   
Hardware: Linux   
OS: Linux   
Site: KAUST Slinky Site: ---
Alineos Sites: --- Atos/Eviden Sites: ---
Confidential Site: --- Coreweave sites: ---
Cray Sites: --- DS9 clusters: ---
Google sites: --- HPCnow Sites: ---
HPE Sites: --- IBM Sites: ---
NOAA SIte: --- NoveTech Sites: ---
Nvidia HWinf-CS Sites: --- OCF Sites: ---
Recursion Pharma Sites: --- SFW Sites: ---
SNIC sites: --- Tzag Elita Sites: ---
Linux Distro: --- Machine Name:
CLE Version: Version Fixed:
Target Release: --- DevPrio: ---
Emory-Cloud Sites: ---

Description John Hanks 2015-11-14 16:08:19 MST
Hello,

We have a subset of applications, primarily genome assembly but also other assorted annotation or alignment tools, which require large amounts of memory. But we have relatively few large memory machines. To accommodate these codes, we've enabled zram as swap on our nodes. For many of these codes, this works great. We trade a little CPU, which we weren't able to fully utilize anyway, for fast swapping and in some cases see the zram compressing up to 10x or higher. We get to run relatively large memory jobs on relatively small memory nodes. Yay for us.

Except, some stuff doesn't compress well in zram and immediately spills over onto disk. Before we started using zram, we were using cgroups to enforce swap limits and limiting apps to only getting swap equal to 10% of the requested RAM. This caused things eating a lot of swap to fall over quickly. But now that we have had to increase that to accommodate the effective usage of zram, these non-swap-friendly apps send the nodes into swap-thrash-death. 

Is it possible, or could it be made possible, to have a parameter like --swap/--swap-per-cpu so that jobs can select the amount of swap they want to attempt to use? This would allow us to set a low default which would prevent swap-thrash-death and allow jobs that can effectively use zram/swap to set a much higher amount. 

Thanks,

jbh
Comment 1 David Bigagli 2015-11-15 18:34:34 MST
Hi,
   as you know Slurm currently does not have this feature. Development will
evaluate this and get back to you.

David
Comment 2 Moe Jette 2016-08-02 14:31:02 MDT
Perhaps GRES (Generic RESources) could be used for this purpose. You can define a GRES count per node and jobs requesting it would consume those resources. Node configurations would look something like this:

NodeName=nid[00000-01000] Gres=swap:1g ....

gres.conf would include:
Name=swap Count=1g

Job requests would look something like this:

sbatch --gres=swap:100m ...

A job submit plugin could set default swap values if desired.

More information about GRES is available here:
http://slurm.schedmd.com/gres.html

Let me know if this addresses your needs.
Comment 3 John Hanks 2016-08-03 11:00:11 MDT
What we wound up doing was similar to your suggestion, except we applied it to zram. On the nodes where we allow this we added a "zram" feature, then wrote a submit plugin which checks for the feature and if found sets --mem=0 and --exclusive. Prolog and epilog scripts then enable zram for the job and disable it once the job is complete. Now we can simply lower the available disk based swap to some amount that is general purpose and people running large jobs can activate zram as-needed. 

My original idea was that allowing selectable swap amounts would allow jobs to run on the same node with different swap limits but upon further pondering I realized that almost all jobs either want all the swap they can get or no swap at all. Will still let all jobs that set a memory amount go over it by 10% into swap and that seems to be a fairly good boundary.

I think you can close this request as yeah it would be neat but it's really unnecessary. If it turns out we do want to allow selectable swap in the future I will probably follow the same approach and have a prolog/epilog add and remove a swap zvol for the duration of the job. 

Thank you,

jbh
Comment 4 Moe Jette 2016-08-03 11:06:39 MDT
Resolved using GRES.