Ticket 14747

Summary:	How to run heterogeneous slurm job on a system with single node
Product:	Slurm	Reporter:	Shaheer KM <shaheer>
Component:	Heterogeneous Jobs	Assignee:	Chad Vizino <chad>
Status:	RESOLVED TIMEDOUT	QA Contact:
Severity:	4 - Minor Issue
Priority:	---
Version:	23.02.x
Hardware:	Linux
OS:	Linux
Site:	Cerebras	Slinky Site:	---
Alineos Sites:	---	Atos/Eviden Sites:	---
Confidential Site:	---	Coreweave sites:	---
Cray Sites:	---	DS9 clusters:	---
Google sites:	---	HPCnow Sites:	---
HPE Sites:	---	IBM Sites:	---
NOAA SIte:	---	NoveTech Sites:	---
Nvidia HWinf-CS Sites:	---	OCF Sites:	---
Recursion Pharma Sites:	---	SFW Sites:	---
SNIC sites:	---	Tzag Elita Sites:	---
Linux Distro:	---	Machine Name:
CLE Version:		Version Fixed:
Target Release:	---	DevPrio:	---
Emory-Cloud Sites:	---

Description Shaheer KM 2022-08-12 10:44:54 MDT

We have a special setup with a very beefy node with lot of CPUs and huge amount of memory. We need a way to run slurm job in such a way that 2 of the tasks gets 800GB of memory and rest of the jobs gets remaining memory allocated as usual.
We tried following command but its trying to spin up job on 3 nodes and gets stuck waiting for resource

srun --unbuffered --kill-on-bad-exit --ntasks=1 --cpus-per-task=64 --mem-per-cpu=800gb : --ntasks=1 --cpus-per-task=64 --mem-per-cpu=800gb : --distribution=cyclic --ntasks=14 --cpus-per-task=16 

             JOBID PARTITION     NAME     USER ST       TIME  NODES NODELIST(REASON)
          220018+2       sdf singular   lab PD       0:00      1 (Resources)
          220018+1       sdf singular   lab PD       0:00      1 (Resources)
          220018+0       sdf singular   lab PD       0:00      1 (Resources)

Any help in getting a working command here is highly appreciated.

Comment 1 Jason Booth 2022-08-12 12:48:34 MDT

Are you running these srun commands outside a job allocation?

Het jobs are normally run with sbatch however, looking over your requirement, why not just request and entire allocation in sbatch and then use srun to define allocations inside of that single job.

Based on what you have in your example, this node has 356 CPU's?

One way to approach this would be to request the entire node.

sbatch -N 1 -n <number of tasks/CPU's> --mem=0

> NOTE: A memory size specification of zero is treated as a special case and 
> grants the job access to all of the memory on each node.

https://slurm.schedmd.com/sbatch.html#OPT_mem

Then, inside your batch script, you would call each srun with the requested task placement for your needs.

Can you also let me know if this is an MPI job and if those tasks need to be part of the same MPI coms world?

Comment 2 Shaheer KM 2022-08-15 10:45:10 MDT

Thanks for the info. This is not an MPI job.

I will try out the suggestion and get back to you if we need more info on this.

Comment 3 Shaheer KM 2022-08-24 12:57:51 MDT

Hello,

I was trying the following to get this working.

Created a bash script that contains below srun command:

srun --unbuffered --kill-on-bad-exit --ntasks=1 --cpus-per-task=28 --mem-per-cpu=32gb : --ntasks=1 --cpus-per-task=28 --mem-per-cpu=32gb : --distribution=cyclic --ntasks=45 python exec.py

Then called this script via sbatch command:

sbatch -N 1 --nodelist sdf-2 -n 47 --mem=0 csrun.sh


This errors out:
srun: error: Allocation failure of 1 nodes: job size of 1, already allocated 1 nodes to previous components.


Our goal is to get 500Gb memory for 2 out of 47 tasks. 

Do you have suggestion to make this happen on a single node slurm setup?

Comment 4 Chad Vizino 2022-08-24 16:09:02 MDT

(In reply to Shaheer KM from comment #3)
> I was trying the following to get this working.
> 
> Created a bash script that contains below srun command:
> 
> srun --unbuffered --kill-on-bad-exit --ntasks=1 --cpus-per-task=28
> --mem-per-cpu=32gb : --ntasks=1 --cpus-per-task=28 --mem-per-cpu=32gb :
> --distribution=cyclic --ntasks=45 python exec.py
> 
> Then called this script via sbatch command:
> 
> sbatch -N 1 --nodelist sdf-2 -n 47 --mem=0 csrun.sh
> 
> 
> This errors out:
> srun: error: Allocation failure of 1 nodes: job size of 1, already allocated
> 1 nodes to previous components.
> 
> 
> Our goal is to get 500Gb memory for 2 out of 47 tasks. 
> 
> Do you have suggestion to make this happen on a single node slurm setup?
Hi. A heterogeneous srun within a non-het job as you list above requires at least 2 nodes (see https://slurm.schedmd.com/heterogeneous_jobs.html#het_steps). The docs also list that het jobs typically require one node per component (see https://slurm.schedmd.com/heterogeneous_jobs.html#limitations).

As Jason suggested in comment 1, could you just request 1 node and then run parallel sruns using --overlap (using this option is important or the sruns will block and run serially) within your job script?

Can you share your slurm.conf file so we can see your configuration?

Comment 5 Chad Vizino 2022-09-01 13:17:18 MDT

(In reply to Chad Vizino from comment #4)
> (In reply to Shaheer KM from comment #3)
> > I was trying the following to get this working.
> > 
> > Created a bash script that contains below srun command:
> > 
> > srun --unbuffered --kill-on-bad-exit --ntasks=1 --cpus-per-task=28
> > --mem-per-cpu=32gb : --ntasks=1 --cpus-per-task=28 --mem-per-cpu=32gb :
> > --distribution=cyclic --ntasks=45 python exec.py
> > 
> > Then called this script via sbatch command:
> > 
> > sbatch -N 1 --nodelist sdf-2 -n 47 --mem=0 csrun.sh
> > 
> > 
> > This errors out:
> > srun: error: Allocation failure of 1 nodes: job size of 1, already allocated
> > 1 nodes to previous components.
> > 
> > 
> > Our goal is to get 500Gb memory for 2 out of 47 tasks. 
> > 
> > Do you have suggestion to make this happen on a single node slurm setup?
> Hi. A heterogeneous srun within a non-het job as you list above requires at
> least 2 nodes (see
> https://slurm.schedmd.com/heterogeneous_jobs.html#het_steps). The docs also
> list that het jobs typically require one node per component (see
> https://slurm.schedmd.com/heterogeneous_jobs.html#limitations).
> 
> As Jason suggested in comment 1, could you just request 1 node and then run
> parallel sruns using --overlap (using this option is important or the sruns
> will block and run serially) within your job script?
> 
> Can you share your slurm.conf file so we can see your configuration?

Hi. Any update on this?

Comment 6 Chad Vizino 2022-09-21 15:55:54 MDT

Will plan to close in a couple days unless you'd like to continue to pursue this.

Comment 7 Chad Vizino 2022-09-26 09:10:25 MDT

Closing for now. If you have more questions, feel free to reopen.