Ticket 14749 - Slurm-gcp v5 errors when enable_reconfigure is enabled
Summary: Slurm-gcp v5 errors when enable_reconfigure is enabled
Status: RESOLVED INVALID
Alias: None
Product: Slurm
Classification: Unclassified
Component: GCP (show other tickets)
Version: 22.05.0
Hardware: Linux Linux
: 6 - No support contract
Assignee: Jacob Jenson
QA Contact:
URL:
Depends on:
Blocks:
 
Reported: 2022-08-12 16:40 MDT by Simon Gao
Modified: 2022-08-12 16:41 MDT (History)
0 users

See Also:
Site: -Other-
Slinky Site: ---
Alineos Sites: ---
Atos/Eviden Sites: ---
Confidential Site: ---
Coreweave sites: ---
Cray Sites: ---
DS9 clusters: ---
Google sites: ---
HPCnow Sites: ---
HPE Sites: ---
IBM Sites: ---
NOAA SIte: ---
NoveTech Sites: ---
Nvidia HWinf-CS Sites: ---
OCF Sites: ---
Recursion Pharma Sites: ---
SFW Sites: ---
SNIC sites: ---
Tzag Elita Sites: ---
Linux Distro: ---
Machine Name:
CLE Version:
Version Fixed:
Target Release: ---
DevPrio: ---
Emory-Cloud Sites: ---


Attachments
Terraform error log (20.49 KB, text/plain)
2022-08-12 16:40 MDT, Simon Gao
Details
terraform var-file (9.38 KB, text/x-csrc)
2022-08-12 16:41 MDT, Simon Gao
Details

Note You need to log in before you can comment on or make changes to this ticket.
Description Simon Gao 2022-08-12 16:40:52 MDT
Created attachment 26318 [details]
Terraform error log

When set "enable_reconfigure" to true, deploying cloud example full cluster on GCP experienced following errors and failed (For detailes error information see attached files)


 Error: local-exec provisioner error
│ 
│   with module.slurm_cluster.module.slurm_controller_instance[0].module.reconfigure_notify[0].null_resource.notify_cluster,
│   on ../../../../../slurm_cluster/modules/slurm_notify_cluster/main.tf line 51, in resource "null_resource" "notify_cluster":
│   51:   provisioner "local-exec" {
│ 
│ Error running command '/home/luser/Documents/git/test/slurm-gcp/scripts/notify_cluster.py --type='reconfig' 'g2-slurm-events-WrPqCf9D'': exit status 1. Output:
│ Traceback (most recent call last):
│   File "/usr/local/lib/python3.9/site-packages/google/api_core/grpc_helpers.py", line 50, in error_remapped_callable
│     return callable_(*args, **kwargs)
│   File "/usr/local/lib64/python3.9/site-packages/grpc/_channel.py", line 946, in __call__
│     return _end_unary_response_blocking(state, call, False, None)
│   File "/usr/local/lib64/python3.9/site-packages/grpc/_channel.py", line 849, in _end_unary_response_blocking
│     raise _InactiveRpcError(state)
│ grpc._channel._InactiveRpcError: <_InactiveRpcError of RPC that terminated with:
│       status = StatusCode.NOT_FOUND
│       details = "Resource not found (resource=g2-slurm-events-WrPqCf9D)."
│       debug_error_string = "{"created":"@1660343082.509674303","description":"Error received from peer
│ ipv4:142.250.217.74:443","file":"src/core/lib/surface/call.cc","file_line":966,"grpc_message":"Resource not found
│ (resource=g2-slurm-events-WrPqCf9D).","grpc_status":5}"
│ >

When set "enable_reconfigure" to false, the same terraform command completed successfully.

The terraformuser has all the suggested permissions.
Comment 1 Simon Gao 2022-08-12 16:41:37 MDT
Created attachment 26319 [details]
terraform var-file