13920 – SLURM on GCP Issues - srun/sbatch hanging with new compute node images?

Ticket 13920 - SLURM on GCP Issues - srun/sbatch hanging with new compute node images?

Summary: SLURM on GCP Issues - srun/sbatch hanging with new compute node images?

Status:	RESOLVED INVALID

Alias:	None

Product:	Slurm
Classification:	Unclassified
Component:	Cloud (show other tickets)
Version:	22.05.x
Hardware:	Linux Linux

Severity:	6 - No support contract
Assignee:	Jacob Jenson
QA Contact:

URL:

Depends on:
Blocks:

Reported:	2022-04-24 18:16 MDT by skaramcheti
Modified:	2022-04-24 18:16 MDT (History)
CC List:	0 users

See Also:
Site:	-Other-
Slinky Site:	---
Alineos Sites:	---
Atos/Eviden Sites:	---
Confidential Site:	---
Coreweave sites:	---
Cray Sites:	---
DS9 clusters:	---
Google sites:	---
HPCnow Sites:	---
HPE Sites:	---
IBM Sites:	---
NOAA SIte:	---
NoveTech Sites:	---
Nvidia HWinf-CS Sites:	---
OCF Sites:	---
Recursion Pharma Sites:	---
SFW Sites:	---
SNIC sites:	---
Tzag Elita Sites:	---
Linux Distro:	---
Machine Name:
CLE Version:
Version Fixed:
Target Release:	---
DevPrio:	---
Emory-Cloud Sites:	---

Attachments
Add an attachment (proposed patch, testcase, etc.)

Note You need to log in before you can comment on or make changes to this ticket.

Description skaramcheti 2022-04-24 18:16:58 MDT

Hey folks,

I've followed the SchedMD instructions to set up a 5-partition SLURM cluster on GCP (SLURM on Google Cloud). Both my login and controller nodes are running the SchedMD images: 

`projects/schedmd-slurm-public/global/images/family/schedmd-slurm-21-08-6-debian-10` 

Each compute node in my cluster is running an Image based off of the Deep Learning VMs (Debian + NVIDIA Drivers + Python). 

I'm having the following problem: I can launch jobs just fine via `sbatch` or `srun` but they never return anything (in the case of sbatch) and they hang indefinitely (in the case of srun). 

Upon further inspection, when issuing an sbatch (for a simple `hostname`) command, I can see the commands spin up the appropriate number of machines, spin down, then requeue (because SLURM never sees the job results).

Similarly when I run an srun, I can manually ssh into the spun up node, and things work fine. 

Do I need to run the HPC images on the compute nodes as well? If so, what's the best way to install custom dependencies on my compute nodes (e.g., for GPU/Python)?