Ticket 2549

Summary: Submitting jobs using large tmp disk resources
Product: Slurm Reporter: Kolbeinn Josepsson <kolbeinn.josepsson>
Component: ConfigurationAssignee: Alejandro Sanchez <alex>
Status: RESOLVED INFOGIVEN QA Contact:
Severity: 4 - Minor Issue    
Priority: --- CC: alex, hjalti.sveinsson
Version: 14.11.6   
Hardware: Linux   
OS: Linux   
Site: deCODE Slinky Site: ---
Alineos Sites: --- Atos/Eviden Sites: ---
Confidential Site: --- Coreweave sites: ---
Cray Sites: --- DS9 clusters: ---
Google sites: --- HPCnow Sites: ---
HPE Sites: --- IBM Sites: ---
NOAA SIte: --- NoveTech Sites: ---
Nvidia HWinf-CS Sites: --- OCF Sites: ---
Recursion Pharma Sites: --- SFW Sites: ---
SNIC sites: --- Tzag Elita Sites: ---
Linux Distro: --- Machine Name:
CLE Version: Version Fixed:
Target Release: --- DevPrio: ---
Emory-Cloud Sites: ---
Attachments: SLURM config

Description Kolbeinn Josepsson 2016-03-15 01:28:21 MDT
Created attachment 2854 [details]
SLURM config

Hello support,

We need to run specific jobs, where one job will use nearly all temp disk resource on the node. We have two nodes with larger local disks for /tmp, about 1.2TB for this case.

The user wants to submit many jobs and only one job to run at a time on each node, other jobs need to wait in the queue until the running job finishes. He submits jobs with --tmp=1100000

The problem is all jobs starts to run in parallel on the nodes, is this working as designed? And if it is, is there any workaround you can share with us?

squeue -u eirikurh -o '%.10i %.6C %.7d %.6m %.9l %.9M %.9L %R'
JOBID CPUS MIN_TMP MIN_ME TIME_LIMI TIME TIME_LEFT NODELIST(REASON)
39088431 1 10 55000 15-00:00:00 4:59:34 14-19:00:26 lhpc-652
39101933 1 1100000 55000 15-00:00:00 1:38 14-23:58:22 lhpc-200
39101934 1 1100000 55000 15-00:00:00 1:38 14-23:58:22 lhpc-200
39101935 1 1100000 55000 15-00:00:00 1:38 14-23:58:22 lhpc-200
39101936 1 1100000 55000 15-00:00:00 1:38 14-23:58:22 lhpc-200
39101937 1 1100000 55000 15-00:00:00 1:38 14-23:58:22 lhpc-200
39101938 1 1100000 55000 15-00:00:00 1:38 14-23:58:22 lhpc-200
39101939 1 1100000 55000 15-00:00:00 1:38 14-23:58:22 lhpc-200
39101940 1 1100000 55000 15-00:00:00 1:38 14-23:58:22 lhpc-200
39101941 1 1100000 55000 15-00:00:00 1:38 14-23:58:22 lhpc-200
39101942 1 1100000 55000 15-00:00:00 1:38 14-23:58:22 lhpc-200
39101943 1 1100000 55000 15-00:00:00 1:38 14-23:58:22 lhpc-200
39101944 1 1100000 55000 15-00:00:00 1:38 14-23:58:22 lhpc-200
39101945 1 1100000 55000 15-00:00:00 1:38 14-23:58:22 lhpc-733
39101946 1 1100000 55000 15-00:00:00 1:38 14-23:58:22 lhpc-733
39101947 1 1100000 55000 15-00:00:00 1:38 14-23:58:22 lhpc-733
39101948 1 1100000 55000 15-00:00:00 1:38 14-23:58:22 lhpc-733
39101949 1 1100000 55000 15-00:00:00 1:38 14-23:58:22 lhpc-733
39101950 1 1100000 55000 15-00:00:00 1:38 14-23:58:22 lhpc-733
39101951 1 1100000 55000 15-00:00:00 1:38 14-23:58:22 lhpc-733
39101952 1 1100000 55000 15-00:00:00 1:38 14-23:58:22 lhpc-733

Attached is slurm.conf file.

Best regards,
Kolbeinn Josepsson
Comment 1 Hjalti Sveinsson 2016-03-15 01:28:49 MDT
Hjalti Þór Sveinsson verður frá vinnu í ótilgreindan tíma.

Beinið vinsamlegast fyrirspurnum ykkar til Kolbeins, kolbeinn.josepson@decode.is, eða Arnar, orn.asgeirsson@decode.is.



Hjalti is will be out of office for an unspecific period of time.

Please turn to Kolbeinn, kolbeinn.josepson@decode.is, or Örn, orn.asgeirsson@decode.is, if you need assistance.
Comment 2 Alejandro Sanchez 2016-03-15 02:08:13 MDT
Hi Kolbeinn,

You can create a partition containing these two nodes with larger local disks for /tmp and set Shared=EXCLUSIVE on that partition.

EXCLUSIVE Allocates   entire   nodes   to   jobs   even   with select/cons_res configured.  Jobs that run in parti‐tions  with  "Shared=EXCLUSIVE"  will have exclusive access to all allocated nodes.
Comment 3 Alejandro Sanchez 2016-03-15 02:13:39 MDT
Furthermore, you can set up conditional dependencies between jobs so that user's jobs do not start until certain conditions are satisfied. Please refer to sbatch/salloc/srun man page and look for the --dependency parameter.
Comment 4 Alejandro Sanchez 2016-03-15 03:13:31 MDT
Kolbeinn,

been talking to our team mate Tim, and we agreed that what you might actually want is to configure tmp as a GRES.

By using --tmp and TmpFS you are not constraining the job to be guaranteed to have enough free space on that filesystem. So you might be interested in configuring GRES[1] for that purpose.

As an example you could configure something like this in your slurm.conf:

GresTypes=tmp
Add to the desired NodeName lines: Gres=tmp:1T

Then in gres.conf add this line:
NodeName=<desired_nodes> Name=tmp Count=1T

Then users should be able to add --gres=tmp:500G or whatever quantity to their request.

[1] http://slurm.schedmd.com/gres.html
Comment 9 Tim Wickberg 2016-03-15 05:51:53 MDT
The --tmp and TmpDir mechanisms only act as a node constraint - they do not allocate any resources to a given job, they only ensure a node is selected that has at least that much space available in /tmp. That space may be completely consumed by the time the job starts.

What it sounds like you'd like to do is allocate space under /tmp for each job to ensure multiple /tmp-hogging jobs do not try to run simultaneously. The GRES mechanism is a good fit for this use-case. http://slurm.schedmd.com/gres.html

To briefly outline a possible solution for you, you'd want to set in slurm.conf:

GresTypes=tmp
NodeName=lhpc-200 CoresPerSocket=12 RealMemory=757000 Weight=5 Gres=tmp:1200GB
NodeName=lhpc-733 CoresPerSocket=14 RealMemory=757000 Weight=5 Gres=tmp:1200GB

You'll also need a gres.conf file that looks like:
NodeName=lhpc-[200,733] Name=tmp Count=1200G

After that you'll need to restart slurmctld and slurmd to pick up the changes.

Then, you'll be able to specify your job requests as:

sbatch --gres=tmp:1G 

and have slurm track the amount of space that has been allocated. (Note that, unlike the GPU plugin, your 'tmp' GRES won't enforce anything on the node itself. You'd need to build a 'tmp' plugin to do that, or you could optionally manage this through an epilog script.)

Let me know if that makes sense.

cheers,
- Tim
Comment 10 Kolbeinn Josepsson 2016-03-15 19:51:07 MDT
Many thanks Alejandro and Tim.

The GRES mechanism sound as the perfect solution for our case.

One question, do we need the gres.conf only on the head node or is it also needed on the lhpc-[200,733] or all nodes in the cluster?

Rgds, Kolbeinn
Comment 11 Alejandro Sanchez 2016-03-15 21:34:43 MDT
Each node must contain a gres.conf file if generic resources are to be scheduled by Slurm. Closing this as resolved infogiven. Pleaes re-open if you have more doubts.
Comment 12 Hjalti Sveinsson 2017-01-13 03:40:56 MST
We have this configure now at deCODE but what is the default value if a user does not use the --gres=tmp: parameter?

regards,
Hjalti Sveinsson
Comment 13 Alejandro Sanchez 2017-01-13 08:06:14 MST
(In reply to Hjalti Sveinsson from comment #12)
> We have this configure now at deCODE but what is the default value if a user
> does not use the --gres=tmp: parameter?
> 
> regards,
> Hjalti Sveinsson

If you the request doesn't contain "--gres=tmp:count" then the nodes without gres=tmp configured might also be considered for allocation. If the request contains "--gres=tmp" without <count>, then the default <count> value is 1, as defined for the --gres parameter in sbatch man page:

"The count is the number of those resources with a default value of 1."