Ticket 14749

Summary: Slurm-gcp v5 errors when enable_reconfigure is enabled
Product: Slurm Reporter: Simon Gao <simon.gao>
Component: GCPAssignee: Jacob Jenson <jacob>
Status: RESOLVED INVALID QA Contact:
Severity: 6 - No support contract    
Priority: ---    
Version: 22.05.0   
Hardware: Linux   
OS: Linux   
Site: -Other- Slinky Site: ---
Alineos Sites: --- Atos/Eviden Sites: ---
Confidential Site: --- Coreweave sites: ---
Cray Sites: --- DS9 clusters: ---
Google sites: --- HPCnow Sites: ---
HPE Sites: --- IBM Sites: ---
NOAA SIte: --- NoveTech Sites: ---
Nvidia HWinf-CS Sites: --- OCF Sites: ---
Recursion Pharma Sites: --- SFW Sites: ---
SNIC sites: --- Tzag Elita Sites: ---
Linux Distro: --- Machine Name:
CLE Version: Version Fixed:
Target Release: --- DevPrio: ---
Emory-Cloud Sites: ---
Attachments: Terraform error log
terraform var-file

Description Simon Gao 2022-08-12 16:40:52 MDT
Created attachment 26318 [details]
Terraform error log

When set "enable_reconfigure" to true, deploying cloud example full cluster on GCP experienced following errors and failed (For detailes error information see attached files)


 Error: local-exec provisioner error
│ 
│   with module.slurm_cluster.module.slurm_controller_instance[0].module.reconfigure_notify[0].null_resource.notify_cluster,
│   on ../../../../../slurm_cluster/modules/slurm_notify_cluster/main.tf line 51, in resource "null_resource" "notify_cluster":
│   51:   provisioner "local-exec" {
│ 
│ Error running command '/home/luser/Documents/git/test/slurm-gcp/scripts/notify_cluster.py --type='reconfig' 'g2-slurm-events-WrPqCf9D'': exit status 1. Output:
│ Traceback (most recent call last):
│   File "/usr/local/lib/python3.9/site-packages/google/api_core/grpc_helpers.py", line 50, in error_remapped_callable
│     return callable_(*args, **kwargs)
│   File "/usr/local/lib64/python3.9/site-packages/grpc/_channel.py", line 946, in __call__
│     return _end_unary_response_blocking(state, call, False, None)
│   File "/usr/local/lib64/python3.9/site-packages/grpc/_channel.py", line 849, in _end_unary_response_blocking
│     raise _InactiveRpcError(state)
│ grpc._channel._InactiveRpcError: <_InactiveRpcError of RPC that terminated with:
│       status = StatusCode.NOT_FOUND
│       details = "Resource not found (resource=g2-slurm-events-WrPqCf9D)."
│       debug_error_string = "{"created":"@1660343082.509674303","description":"Error received from peer
│ ipv4:142.250.217.74:443","file":"src/core/lib/surface/call.cc","file_line":966,"grpc_message":"Resource not found
│ (resource=g2-slurm-events-WrPqCf9D).","grpc_status":5}"
│ >

When set "enable_reconfigure" to false, the same terraform command completed successfully.

The terraformuser has all the suggested permissions.
Comment 1 Simon Gao 2022-08-12 16:41:37 MDT
Created attachment 26319 [details]
terraform var-file