Ticket 17363

Summary: GCP a2-highgpu-8g nodes unable to be used with hpc toolkit
Product: Slurm Reporter: Chris Raynor <chris>
Component: GCPAssignee: Jacob Jenson <jacob>
Status: OPEN --- QA Contact:
Severity: 6 - No support contract    
Priority: ---    
Version: 23.11.x   
Hardware: Linux   
OS: Linux   
Site: -Other- Alineos Sites: ---
Atos/Eviden Sites: --- Confidential Site: ---
Coreweave sites: --- Cray Sites: ---
DS9 clusters: --- HPCnow Sites: ---
HPE Sites: --- IBM Sites: ---
NOAA SIte: --- OCF Sites: ---
Recursion Pharma Sites: --- SFW Sites: ---
SNIC sites: --- Linux Distro: ---
Machine Name: CLE Version:
Version Fixed: Target Release: ---
DevPrio: --- Emory-Cloud Sites: ---

Description Chris Raynor 2023-08-07 08:17:46 MDT
Since https://github.com/SchedMD/slurm-gcp/commit/c5095d0bc2f38054d678e9a32568883dec0132c5 instances of type a2-highgpu-8g (or similar) nodes cannot be created - the controller node fails with:

2023-08-07 12:04:08,247 INFO: Setting up controller
2023-08-07 12:04:08,249 INFO: installing custom scripts: compute.d/ghpc_startup.sh,controller.d/ghpc_startup.sh,partition.d/a2/ghpc_startup.sh,partition.d/dev/ghpc_startup.sh
2023-08-07 12:04:08,249 DEBUG: install_custom_scripts: compute.d/ghpc_startup.sh
2023-08-07 12:04:08,251 DEBUG: install_custom_scripts: controller.d/ghpc_startup.sh
2023-08-07 12:04:08,252 DEBUG: install_custom_scripts: partition.d/a2/ghpc_startup.sh
2023-08-07 12:04:08,254 DEBUG: install_custom_scripts: partition.d/dev/ghpc_startup.sh
2023-08-07 12:04:08,259 DEBUG: compute_service: Using version=v1 of Google Compute Engine API
2023-08-07 12:04:44,512 WARNING: core count in machine type a2-highgpu-8g is not an integer. Default to 1 socket.
2023-08-07 12:04:44,512 ERROR: invalid literal for int() with base 10: '8g'
--
  File "/slurm/scripts/setup.py", line 1071, in setup_controller
    gen_cloud_conf()
  File "/slurm/scripts/setup.py", line 341, in gen_cloud_conf
    content = make_cloud_conf(lkp, cloud_parameters=cloud_parameters)
  File "/slurm/scripts/setup.py", line 330, in make_cloud_conf
    lines = [
  File "/slurm/scripts/setup.py", line 333, in <genexpr>
    *(partitionlines(p, lkp) for p in lkp.cfg.partitions.values()),
  File "/slurm/scripts/setup.py", line 272, in partitionlines
    group_lines = [
  File "/slurm/scripts/setup.py", line 273, in <listcomp>
    node_group_lines(group, part_name, lkp)
  File "/slurm/scripts/setup.py", line 210, in node_group_lines
    machine_conf = lkp.template_machine_conf(node_group.instance_template)
  File "/slurm/scripts/util.py", line 1540, in template_machine_conf
    _div = 2 if getThreadsPerCore(template) == 1 else 1
  File "/slurm/scripts/util.py", line 1117, in getThreadsPerCore
    if not isSmt(template):
  File "/slurm/scripts/util.py", line 1099, in isSmt
    machineTypeCore: int = int(matches["core"])
ValueError: invalid literal for int() with base 10: '8g'

https://github.com/SchedMD/slurm-gcp/commit/6cb0fd3f706ee5fc742a7c243bf2648ee45a9729 also seems to make the same assumption