Ticket 13632

Summary: Can scheduler and database be run on a VM?
Product: Slurm Reporter: Elijah Gagne <elijah.w.gagne>
Component: OtherAssignee: Nate Rini <nate>
Status: RESOLVED INFOGIVEN QA Contact:
Severity: 4 - Minor Issue    
Priority: --- CC: ben, nate
Version: 22.05.x   
Hardware: Linux   
OS: Linux   
Site: Dartmouth Alineos Sites: ---
Atos/Eviden Sites: --- Confidential Site: ---
Coreweave sites: --- Cray Sites: ---
DS9 clusters: --- HPCnow Sites: ---
HPE Sites: --- IBM Sites: ---
NOAA SIte: --- OCF Sites: ---
Recursion Pharma Sites: --- SFW Sites: ---
SNIC sites: --- Linux Distro: ---
Machine Name: CLE Version:
Version Fixed: Target Release: ---
DevPrio: --- Emory-Cloud Sites: ---

Description Elijah Gagne 2022-03-16 09:15:47 MDT
Hi SchedMD, 

We need to replace the hardware our scheduler and DB run on. Is there any inherent reason it could not be installed on VMs? We can scale VMs with lots of CPU, memory, and fast disks these days and a VM is just a lot nicer to work with. 

Thanks,
EWG
Comment 1 Nate Rini 2022-03-16 09:22:05 MDT
(In reply to Elijah Gagne from comment #0)
> We need to replace the hardware our scheduler and DB run on. Is there any
> inherent reason it could not be installed on VMs? We can scale VMs with lots
> of CPU, memory, and fast disks these days and a VM is just a lot nicer to
> work with. 

slurmd, slurmctld, slurmdbd, and MySQL/MariaDB can run on VMs. It is up to the site to ensure that the VMs have sufficient CPUs/Memory/Network for their cluster's load and/or jobs. The most common issue we see with VMs is that the clock source is not high enough precision to the point that we now have `sdiag` warn when this is detected. Depending on the VM vendor, most now provide client kernel drivers that provide high precision clock source but usually have to be enabled manually. Please also note that slurmctld, slurmdbd, and MySQL/MariaDB can be run inside of containers if a site doesn't want to pay the performance penalty for virtualization.

Do you have any more questions?
Comment 4 Nate Rini 2022-03-16 10:22:03 MDT
I wanted to emphasize this part:
> It is up to the site to ensure that the VMs have sufficient CPUs/Memory/Network for their cluster's load and/or jobs.

Many sites have ended up pulling Slurm off their VM systems due to them being too slow or other performance issues. While these issues are not specific to Slurm, they do have an unfortunate tendency of making Slurm look slow. Most VMs have a penalty of 10% for performance so the hardware needs to be that much faster to meet our suggestions here (slides 18-20):
> https://slurm.schedmd.com/SLUG21/Field_Notes_5.pdf

If a VM is used, it is required that the CPUs and Memory be pinned for Slurm's VMs.
Comment 6 Elijah Gagne 2022-03-16 14:06:10 MDT
Thanks. I think we're good to close this out.
-EWG