Ticket 13920 - SLURM on GCP Issues - srun/sbatch hanging with new compute node images?
Summary: SLURM on GCP Issues - srun/sbatch hanging with new compute node images?
Status: RESOLVED INVALID
Alias: None
Product: Slurm
Classification: Unclassified
Component: Cloud (show other tickets)
Version: 22.05.x
Hardware: Linux Linux
: 6 - No support contract
Assignee: Jacob Jenson
QA Contact:
URL:
Depends on:
Blocks:
 
Reported: 2022-04-24 18:16 MDT by skaramcheti
Modified: 2022-04-24 18:16 MDT (History)
0 users

See Also:
Site: -Other-
Slinky Site: ---
Alineos Sites: ---
Atos/Eviden Sites: ---
Confidential Site: ---
Coreweave sites: ---
Cray Sites: ---
DS9 clusters: ---
Google sites: ---
HPCnow Sites: ---
HPE Sites: ---
IBM Sites: ---
NOAA SIte: ---
NoveTech Sites: ---
Nvidia HWinf-CS Sites: ---
OCF Sites: ---
Recursion Pharma Sites: ---
SFW Sites: ---
SNIC sites: ---
Tzag Elita Sites: ---
Linux Distro: ---
Machine Name:
CLE Version:
Version Fixed:
Target Release: ---
DevPrio: ---
Emory-Cloud Sites: ---


Attachments

Note You need to log in before you can comment on or make changes to this ticket.
Description skaramcheti 2022-04-24 18:16:58 MDT
Hey folks,

I've followed the SchedMD instructions to set up a 5-partition SLURM cluster on GCP (SLURM on Google Cloud). Both my login and controller nodes are running the SchedMD images: 

`projects/schedmd-slurm-public/global/images/family/schedmd-slurm-21-08-6-debian-10` 

Each compute node in my cluster is running an Image based off of the Deep Learning VMs (Debian + NVIDIA Drivers + Python). 

I'm having the following problem: I can launch jobs just fine via `sbatch` or `srun` but they never return anything (in the case of sbatch) and they hang indefinitely (in the case of srun). 

Upon further inspection, when issuing an sbatch (for a simple `hostname`) command, I can see the commands spin up the appropriate number of machines, spin down, then requeue (because SLURM never sees the job results).

Similarly when I run an srun, I can manually ssh into the spun up node, and things work fine. 

Do I need to run the HPC images on the compute nodes as well? If so, what's the best way to install custom dependencies on my compute nodes (e.g., for GPU/Python)?