| Summary: | Submitting jobs using large tmp disk resources | ||
|---|---|---|---|
| Product: | Slurm | Reporter: | Kolbeinn Josepsson <kolbeinn.josepsson> |
| Component: | Configuration | Assignee: | Alejandro Sanchez <alex> |
| Status: | RESOLVED INFOGIVEN | QA Contact: | |
| Severity: | 4 - Minor Issue | ||
| Priority: | --- | CC: | alex, hjalti.sveinsson |
| Version: | 14.11.6 | ||
| Hardware: | Linux | ||
| OS: | Linux | ||
| Site: | deCODE | Slinky Site: | --- |
| Alineos Sites: | --- | Atos/Eviden Sites: | --- |
| Confidential Site: | --- | Coreweave sites: | --- |
| Cray Sites: | --- | DS9 clusters: | --- |
| Google sites: | --- | HPCnow Sites: | --- |
| HPE Sites: | --- | IBM Sites: | --- |
| NOAA SIte: | --- | NoveTech Sites: | --- |
| Nvidia HWinf-CS Sites: | --- | OCF Sites: | --- |
| Recursion Pharma Sites: | --- | SFW Sites: | --- |
| SNIC sites: | --- | Tzag Elita Sites: | --- |
| Linux Distro: | --- | Machine Name: | |
| CLE Version: | Version Fixed: | ||
| Target Release: | --- | DevPrio: | --- |
| Emory-Cloud Sites: | --- | ||
| Attachments: | SLURM config | ||
Hjalti Þór Sveinsson verður frá vinnu í ótilgreindan tíma. Beinið vinsamlegast fyrirspurnum ykkar til Kolbeins, kolbeinn.josepson@decode.is, eða Arnar, orn.asgeirsson@decode.is. Hjalti is will be out of office for an unspecific period of time. Please turn to Kolbeinn, kolbeinn.josepson@decode.is, or Örn, orn.asgeirsson@decode.is, if you need assistance. Hi Kolbeinn, You can create a partition containing these two nodes with larger local disks for /tmp and set Shared=EXCLUSIVE on that partition. EXCLUSIVE Allocates entire nodes to jobs even with select/cons_res configured. Jobs that run in parti‐tions with "Shared=EXCLUSIVE" will have exclusive access to all allocated nodes. Furthermore, you can set up conditional dependencies between jobs so that user's jobs do not start until certain conditions are satisfied. Please refer to sbatch/salloc/srun man page and look for the --dependency parameter. Kolbeinn, been talking to our team mate Tim, and we agreed that what you might actually want is to configure tmp as a GRES. By using --tmp and TmpFS you are not constraining the job to be guaranteed to have enough free space on that filesystem. So you might be interested in configuring GRES[1] for that purpose. As an example you could configure something like this in your slurm.conf: GresTypes=tmp Add to the desired NodeName lines: Gres=tmp:1T Then in gres.conf add this line: NodeName=<desired_nodes> Name=tmp Count=1T Then users should be able to add --gres=tmp:500G or whatever quantity to their request. [1] http://slurm.schedmd.com/gres.html The --tmp and TmpDir mechanisms only act as a node constraint - they do not allocate any resources to a given job, they only ensure a node is selected that has at least that much space available in /tmp. That space may be completely consumed by the time the job starts. What it sounds like you'd like to do is allocate space under /tmp for each job to ensure multiple /tmp-hogging jobs do not try to run simultaneously. The GRES mechanism is a good fit for this use-case. http://slurm.schedmd.com/gres.html To briefly outline a possible solution for you, you'd want to set in slurm.conf: GresTypes=tmp NodeName=lhpc-200 CoresPerSocket=12 RealMemory=757000 Weight=5 Gres=tmp:1200GB NodeName=lhpc-733 CoresPerSocket=14 RealMemory=757000 Weight=5 Gres=tmp:1200GB You'll also need a gres.conf file that looks like: NodeName=lhpc-[200,733] Name=tmp Count=1200G After that you'll need to restart slurmctld and slurmd to pick up the changes. Then, you'll be able to specify your job requests as: sbatch --gres=tmp:1G and have slurm track the amount of space that has been allocated. (Note that, unlike the GPU plugin, your 'tmp' GRES won't enforce anything on the node itself. You'd need to build a 'tmp' plugin to do that, or you could optionally manage this through an epilog script.) Let me know if that makes sense. cheers, - Tim Many thanks Alejandro and Tim. The GRES mechanism sound as the perfect solution for our case. One question, do we need the gres.conf only on the head node or is it also needed on the lhpc-[200,733] or all nodes in the cluster? Rgds, Kolbeinn Each node must contain a gres.conf file if generic resources are to be scheduled by Slurm. Closing this as resolved infogiven. Pleaes re-open if you have more doubts. We have this configure now at deCODE but what is the default value if a user does not use the --gres=tmp: parameter? regards, Hjalti Sveinsson (In reply to Hjalti Sveinsson from comment #12) > We have this configure now at deCODE but what is the default value if a user > does not use the --gres=tmp: parameter? > > regards, > Hjalti Sveinsson If you the request doesn't contain "--gres=tmp:count" then the nodes without gres=tmp configured might also be considered for allocation. If the request contains "--gres=tmp" without <count>, then the default <count> value is 1, as defined for the --gres parameter in sbatch man page: "The count is the number of those resources with a default value of 1." |
Created attachment 2854 [details] SLURM config Hello support, We need to run specific jobs, where one job will use nearly all temp disk resource on the node. We have two nodes with larger local disks for /tmp, about 1.2TB for this case. The user wants to submit many jobs and only one job to run at a time on each node, other jobs need to wait in the queue until the running job finishes. He submits jobs with --tmp=1100000 The problem is all jobs starts to run in parallel on the nodes, is this working as designed? And if it is, is there any workaround you can share with us? squeue -u eirikurh -o '%.10i %.6C %.7d %.6m %.9l %.9M %.9L %R' JOBID CPUS MIN_TMP MIN_ME TIME_LIMI TIME TIME_LEFT NODELIST(REASON) 39088431 1 10 55000 15-00:00:00 4:59:34 14-19:00:26 lhpc-652 39101933 1 1100000 55000 15-00:00:00 1:38 14-23:58:22 lhpc-200 39101934 1 1100000 55000 15-00:00:00 1:38 14-23:58:22 lhpc-200 39101935 1 1100000 55000 15-00:00:00 1:38 14-23:58:22 lhpc-200 39101936 1 1100000 55000 15-00:00:00 1:38 14-23:58:22 lhpc-200 39101937 1 1100000 55000 15-00:00:00 1:38 14-23:58:22 lhpc-200 39101938 1 1100000 55000 15-00:00:00 1:38 14-23:58:22 lhpc-200 39101939 1 1100000 55000 15-00:00:00 1:38 14-23:58:22 lhpc-200 39101940 1 1100000 55000 15-00:00:00 1:38 14-23:58:22 lhpc-200 39101941 1 1100000 55000 15-00:00:00 1:38 14-23:58:22 lhpc-200 39101942 1 1100000 55000 15-00:00:00 1:38 14-23:58:22 lhpc-200 39101943 1 1100000 55000 15-00:00:00 1:38 14-23:58:22 lhpc-200 39101944 1 1100000 55000 15-00:00:00 1:38 14-23:58:22 lhpc-200 39101945 1 1100000 55000 15-00:00:00 1:38 14-23:58:22 lhpc-733 39101946 1 1100000 55000 15-00:00:00 1:38 14-23:58:22 lhpc-733 39101947 1 1100000 55000 15-00:00:00 1:38 14-23:58:22 lhpc-733 39101948 1 1100000 55000 15-00:00:00 1:38 14-23:58:22 lhpc-733 39101949 1 1100000 55000 15-00:00:00 1:38 14-23:58:22 lhpc-733 39101950 1 1100000 55000 15-00:00:00 1:38 14-23:58:22 lhpc-733 39101951 1 1100000 55000 15-00:00:00 1:38 14-23:58:22 lhpc-733 39101952 1 1100000 55000 15-00:00:00 1:38 14-23:58:22 lhpc-733 Attached is slurm.conf file. Best regards, Kolbeinn Josepsson