Ticket 13596 - XCC Plugin unrealistic value
Summary: XCC Plugin unrealistic value
Status: RESOLVED FIXED
Alias: None
Product: Slurm
Classification: Unclassified
Component: Accounting (show other tickets)
Version: 20.11.5
Hardware: Linux Linux
: 4 - Minor Issue
Assignee: Carlos Tripiana Montes
QA Contact:
URL:
Depends on: 13778
Blocks:
  Show dependency treegraph
 
Reported: 2022-03-10 02:32 MST by Pascal Schuhmacher
Modified: 2022-10-17 09:12 MDT (History)
0 users

See Also:
Site: KIT
Alineos Sites: ---
Atos/Eviden Sites: ---
Confidential Site: ---
Coreweave sites: ---
Cray Sites: ---
DS9 clusters: ---
HPCnow Sites: ---
HPE Sites: ---
IBM Sites: ---
NOAA SIte: ---
NoveTech Sites: ---
Nvidia HWinf-CS Sites: ---
OCF Sites: ---
Recursion Pharma Sites: ---
SFW Sites: ---
SNIC sites: ---
Linux Distro: ---
Machine Name:
CLE Version:
Version Fixed: 23.02.0pre1
Target Release: ---
DevPrio: ---
Emory-Cloud Sites: ---


Attachments
debug patch (4.11 KB, patch)
2022-04-05 07:35 MDT, Carlos Tripiana Montes
Details | Diff
Logs cpuonly Node (3.51 MB, text/x-log)
2022-05-20 02:18 MDT, Pascal Schuhmacher
Details
Logs GPU Node (239.46 KB, text/x-log)
2022-05-20 02:18 MDT, Pascal Schuhmacher
Details
error last 4 weeks (161.84 KB, text/plain)
2022-05-24 01:19 MDT, Pascal Schuhmacher
Details
Error occurring (6.35 KB, text/plain)
2022-06-17 05:36 MDT, Pascal Schuhmacher
Details
slurmd.log (1.85 MB, text/x-log)
2022-06-17 05:37 MDT, Pascal Schuhmacher
Details

Note You need to log in before you can comment on or make changes to this ticket.
Description Pascal Schuhmacher 2022-03-10 02:32:12 MST
Hello,

The XCC plugin for Slurm runs reliable most of the time and a realistic value is written to the energy consumed field.
But we started to notice that for some jobs the value is way to high.

sacct --start 2022-03-09 --format User,Account,JobName,ConsumedEnergy,NodeList,Elapsed,JobID,Start | grep ?
          haicore-p+      batch         18.45?         hkn1701 2-00:00:11 1614781.bat+ 2022-03-07T14:13:09 
          haicore-p+     extern         18.45?         hkn1701 2-00:00:04 1614781.ext+ 2022-03-07T14:13:09 
   mc4465 hk-projec+ NiO/CsPbI+         18.45?         hkn1209 3-00:00:10 1619575      2022-03-07T09:47:35 
          hk-projec+      batch         18.45?         hkn1209 3-00:00:11 1619575.bat+ 2022-03-07T09:47:35 
          hk-projec+     extern         18.45?         hkn1209 3-00:00:11 1619575.ext+ 2022-03-07T09:47:35 
          hk-projec+ hydra_bst+         18.45?         hkn1209 2-23:59:53 1619575.0    2022-03-07T09:47:55 
   mc4465 hk-projec+ NiO/CsPbI+         18.45?         hkn0006 3-00:00:20 1619576      2022-03-07T09:48:25 
          hk-projec+      batch         18.45?         hkn0006 3-00:00:22 1619576.bat+ 2022-03-07T09:48:25 
          hk-projec+     extern         18.45?         hkn0006 3-00:00:21 1619576.ext+ 2022-03-07T09:48:25 
          hk-projec+ hydra_bst+         18.45?         hkn0006 3-00:00:03 1619576.0    2022-03-07T09:48:45 
   mc4465 hk-projec+ NiO/CsPbI+         18.45?         hkn0303 3-00:00:29 1619584      2022-03-07T10:01:46 
          hk-projec+      batch         18.45?         hkn0303 3-00:00:30 1619584.bat+ 2022-03-07T10:01:46 
          hk-projec+     extern         18.45?         hkn0303 3-00:00:29 1619584.ext+ 2022-03-07T10:01:46 
          hk-projec+ hydra_bst+         18.45?         hkn0303 3-00:00:13 1619584.0    2022-03-07T10:02:05 
   mc4465 hk-projec+ NiO/CsPbI+         18.45?         hkn1235 3-00:00:18 1619585      2022-03-07T10:02:27 
          hk-projec+      batch         18.45?         hkn1235 3-00:00:19 1619585.bat+ 2022-03-07T10:02:27 
          hk-projec+     extern         18.45?         hkn1235 3-00:00:18 1619585.ext+ 2022-03-07T10:02:27 
          hk-projec+ hydra_bst+         18.45?         hkn1235 3-00:00:02 1619585.0    2022-03-07T10:02:47 
   mc4465 hk-projec+ CsI_0.25-+         18.45?         hkn1202 2-21:20:51 1619588      2022-03-07T10:10:05 
          hk-projec+      batch         18.45?         hkn1202 2-21:20:51 1619588.bat+ 2022-03-07T10:10:05 
          hk-projec+     extern         18.45?         hkn1202 2-21:20:51 1619588.ext+ 2022-03-07T10:10:05 
          hk-projec+ hydra_bst+         18.45?         hkn1202 2-21:20:30 1619588.0    2022-03-07T10:10:26 
   dp7106 hk-projec+       EMAC         18.45? hkn[0227-0228,+ 2-04:23:51 1619616      2022-03-07T10:50:12 
          hk-projec+      batch         18.45?         hkn0227 2-04:23:51 1619616.bat+ 2022-03-07T10:50:12 
          hk-projec+     extern         18.45? hkn[0227-0228,+ 2-04:23:51 1619616.ext+ 2022-03-07T10:50:12 
          hk-projec+ hydra_bst+         18.45? hkn[0227-0228,+ 2-04:22:00 1619616.0    2022-03-07T10:50:36 
   fy2374 haicore-p+ interacti+         18.45?         hkn0717 1-21:46:41 1620006      2022-03-07T13:15:40 
          haicore-p+ interacti+         18.45?         hkn0717 1-21:47:11 1620006.int+ 2022-03-07T13:15:40 
          haicore-p+     extern         18.45?         hkn0717 1-21:46:41 1620006.ext+ 2022-03-07T13:15:40
          
In the database the following value is written 'consumed_energy' => '18446744071445903664',
To me it looks like an Over/Underflow of a variable. 
The question now is if it is a bug in the slurm xcc plugin or the hardware gives back a wrong value(way too high/maybe negative or somthing) at some point since it seems to happen mostly on long running jobs.


Regards,
Pascal
Comment 1 Carlos Tripiana Montes 2022-03-10 06:26:22 MST
Hi there Pascal,

I think it could be Bug 12170. Have you been issuing scontrol reconfig or restarting the slurmd service/process?

Cheers,
Carlos.
Comment 2 Carlos Tripiana Montes 2022-03-14 04:53:58 MDT
Hi Pascal,

Have you got time to check what I told in Comment 1?

Thanks,
Carlos.
Comment 3 Pascal Schuhmacher 2022-03-14 09:36:42 MDT
Hello Carlos,

We do not frequently use scontrol reconfig, nor do we restart the slurmd service/process often. 


Regards,
Pascal
Comment 4 Carlos Tripiana Montes 2022-03-15 02:35:18 MDT
Hi Pascal,

In the XCC plugin I don't see any possible way to get a negative number for consumed energy, unless "scontrol reconfig" or slurmd process restart has happened.

I understand you don't usually issue these, but I was suggesting that, by chance, this had happened for the previously mentioned jobs.

To double check the issue, please send us the slurmd.log for 1-node job, along with the particular jobid we are looking after.

I want to have a look at the logs before making further assumptions like a faulty BMC (something completely of of Slurm control, but keen to know how it happens, *if it happens*, and provide a way to mitigate it).

Cheers,
Carlos.
Comment 5 Pascal Schuhmacher 2022-03-15 03:47:36 MDT
Hello Carlos,

There seem to be 2 different cases:

Logs for job(1623719):
tl6613 hk-projec+      mtd16         18.45?         hkn0322   12:15:44 1623719      2022-03-12T06:42:14


[2022-03-12T05:13:56.959] [1623487.0] done with job
[2022-03-12T06:42:14.440] task/affinity: task_p_slurmd_batch_request: task_p_slurmd_batch_request: 1623719
[2022-03-12T06:42:14.440] task/affinity: batch_bind: job 1623719 CPU input mask for node: 0xFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF
[2022-03-12T06:42:14.440] task/affinity: batch_bind: job 1623719 CPU final HW mask for node: 0xFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF
[2022-03-12T06:42:32.763] [1623719.extern] task/cgroup: _memcg_initialize: /slurm/uid_242096/job_1623719: alloc=243200MB mem.limit=243200MB memsw.limit=unlimited
[2022-03-12T06:42:32.763] [1623719.extern] task/cgroup: _memcg_initialize: /slurm/uid_242096/job_1623719/step_extern: alloc=243200MB mem.limit=243200MB memsw.limit=unlimited
[2022-03-12T06:42:32.774] Launching batch job 1623719 for UID 242096
[2022-03-12T06:42:32.805] [1623719.batch] task/cgroup: _memcg_initialize: /slurm/uid_242096/job_1623719: alloc=243200MB mem.limit=243200MB memsw.limit=unlimited
[2022-03-12T06:42:32.805] [1623719.batch] task/cgroup: _memcg_initialize: /slurm/uid_242096/job_1623719/step_batch: alloc=243200MB mem.limit=243200MB memsw.limit=unlimited
[2022-03-12T06:42:33.836] launch task StepId=1623719.0 request from UID:242096 GID:501437 HOST:10.0.0.238 PORT:34582
[2022-03-12T06:42:33.836] task/affinity: lllp_distribution: JobId=1623719 auto binding off: mask_cpu
[2022-03-12T06:42:33.869] [1623719.0] task/cgroup: _memcg_initialize: /slurm/uid_242096/job_1623719: alloc=243200MB mem.limit=243200MB memsw.limit=unlimited
[2022-03-12T06:42:33.869] [1623719.0] task/cgroup: _memcg_initialize: /slurm/uid_242096/job_1623719/step_0: alloc=243200MB mem.limit=243200MB memsw.limit=unlimited
[2022-03-12T16:07:43.164] error: Invalid ipmi response length for XCC raw command: 2 bytes, expected 16 (SD650) or 40 (SD650V2)
[2022-03-12T16:07:43.164] error: _thread_update_node_energy could not read XCC ipmi values
[2022-03-12T16:07:52.805] [1623719.extern] error: slurm_get_node_energy: Socket timed out on send/recv operation
[2022-03-12T16:07:52.805] [1623719.extern] error: _get_joules_task: can't get info from slurmd
[2022-03-12T16:07:52.888] [1623719.batch] error: slurm_get_node_energy: Socket timed out on send/recv operation
[2022-03-12T16:07:52.888] [1623719.batch] error: _get_joules_task: can't get info from slurmd
[2022-03-12T16:07:53.922] [1623719.0] error: slurm_get_node_energy: Socket timed out on send/recv operation
[2022-03-12T16:07:53.922] [1623719.0] error: _get_joules_task: can't get info from slurmd
[2022-03-12T16:08:38.647] error: Invalid ipmi response length for XCC raw command: 2 bytes, expected 16 (SD650) or 40 (SD650V2)
[2022-03-12T16:08:38.647] error: _thread_update_node_energy could not read XCC ipmi values
[2022-03-12T16:09:33.245] error: Invalid ipmi response length for XCC raw command: 2 bytes, expected 16 (SD650) or 40 (SD650V2)
[2022-03-12T16:09:33.245] error: _thread_update_node_energy could not read XCC ipmi values
[2022-03-12T16:09:44.014] [1623719.0] error: slurm_get_node_energy: Socket timed out on send/recv operation
[2022-03-12T16:09:44.014] [1623719.0] error: _get_joules_task: can't get info from slurmd
[2022-03-12T18:57:58.345] [1623719.0] done with job
[2022-03-12T18:57:58.449] [1623719.batch] sending REQUEST_COMPLETE_BATCH_SCRIPT, error:0 status:0
[2022-03-12T18:57:58.474] [1623719.batch] done with job
[2022-03-12T18:57:58.501] [1623719.extern] done with job
[2022-03-12T20:18:52.406] task/affinity: task_p_slurmd_batch_request: task_p_slurmd_batch_request: 1623093
[2022-03-12T20:18:52.406] task/affinity: batch_bind: job 1623093 CPU input mask for node: 0xFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF
[2022-03-12T20:18:52.406] task/affinity: batch_bind: job 1623093 CPU final HW mask for node: 0xFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF
[2022-03-12T20:19:10.646] [1623093.extern] task/cgroup: _memcg_initialize: /slurm/uid_946468/job_1623093: alloc=243200MB mem.limit=243200MB memsw.limit=unlimited
[2022-03-12T20:19:10.646] [1623093.extern] task/cgroup: _memcg_initialize: /slurm/uid_946468/job_1623093/step_extern: alloc=243200MB mem.limit=243200MB memsw.limit=unlimited
[2022-03-12T20:19:10.657] Launching batch job 1623093 for UID 946468
[2022-03-12T20:19:10.687] [1623093.batch] task/cgroup: _memcg_initialize: /slurm/uid_946468/job_1623093: alloc=243200MB mem.limit=243200MB memsw.limit=unlimited
[2022-03-12T20:19:10.687] [1623093.batch] task/cgroup: _memcg_initialize: /slurm/uid_946468/job_1623093/step_batch: alloc=243200MB mem.limit=243200MB memsw.limit=unlimited
[2022-03-13T11:29:07.088] [1623093.batch] sending REQUEST_COMPLETE_BATCH_SCRIPT, error:0 status:0
[2022-03-13T11:29:07.111] [1623093.batch] done with job
[2022-03-13T11:29:07.145] [1623093.extern] done with job
[2022-03-13T11:29:40.880] task/affinity: task_p_slurmd_batch_request: task_p_slurmd_batch_request: 1624540


Logs for job(1622936):
st_st142+ hk-projec+         fe         18.45?         hkn0245   22:32:58 1622936      2022-03-10T13:59:16


[2022-03-10T13:12:40.999] [1622894.extern] done with job
[2022-03-10T13:59:18.262] task/affinity: task_p_slurmd_batch_request: task_p_slurmd_batch_request: 1622936
[2022-03-10T13:59:18.262] task/affinity: batch_bind: job 1622936 CPU input mask for node: 0xFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF
[2022-03-10T13:59:18.262] task/affinity: batch_bind: job 1622936 CPU final HW mask for node: 0xFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF
[2022-03-10T13:59:37.020] [1622936.extern] task/cgroup: _memcg_initialize: /slurm/uid_946468/job_1622936: alloc=243200MB mem.limit=243200MB memsw.limit=unlimited
[2022-03-10T13:59:37.020] [1622936.extern] task/cgroup: _memcg_initialize: /slurm/uid_946468/job_1622936/step_extern: alloc=243200MB mem.limit=243200MB memsw.limit=unlimited
[2022-03-10T13:59:37.031] Launching batch job 1622936 for UID 946468
[2022-03-10T13:59:37.062] [1622936.batch] task/cgroup: _memcg_initialize: /slurm/uid_946468/job_1622936: alloc=243200MB mem.limit=243200MB memsw.limit=unlimited
[2022-03-10T13:59:37.063] [1622936.batch] task/cgroup: _memcg_initialize: /slurm/uid_946468/job_1622936/step_batch: alloc=243200MB mem.limit=243200MB memsw.limit=unlimited
[2022-03-10T16:25:37.720] error: Invalid ipmi response length for XCC raw command: 2 bytes, expected 16 (SD650) or 40 (SD650V2)
[2022-03-10T16:25:37.720] error: _thread_update_node_energy could not read XCC ipmi values
[2022-03-10T16:25:47.055] [1622936.extern] error: slurm_get_node_energy: Socket timed out on send/recv operation
[2022-03-10T16:25:47.055] [1622936.extern] error: _get_joules_task: can't get info from slurmd
[2022-03-10T16:25:47.105] [1622936.batch] error: slurm_get_node_energy: Socket timed out on send/recv operation
[2022-03-10T16:25:47.105] [1622936.batch] error: _get_joules_task: can't get info from slurmd
[2022-03-10T16:26:32.736] error: Invalid ipmi response length for XCC raw command: 2 bytes, expected 16 (SD650) or 40 (SD650V2)
[2022-03-10T16:26:32.736] error: _thread_update_node_energy could not read XCC ipmi values
[2022-03-10T16:26:57.055] [1622936.extern] error: slurm_get_node_energy: Socket timed out on send/recv operation
[2022-03-10T16:26:57.055] [1622936.extern] error: _get_joules_task: can't get info from slurmd
[2022-03-10T16:26:57.105] [1622936.batch] error: slurm_get_node_energy: Socket timed out on send/recv operation
[2022-03-10T16:26:57.105] [1622936.batch] error: _get_joules_task: can't get info from slurmd
[2022-03-10T16:27:27.752] error: Invalid ipmi response length for XCC raw command: 2 bytes, expected 16 (SD650) or 40 (SD650V2)
[2022-03-10T16:27:27.752] error: _thread_update_node_energy could not read XCC ipmi values
[2022-03-10T16:28:07.055] [1622936.extern] error: slurm_get_node_energy: Socket timed out on send/recv operation
[2022-03-10T16:28:07.055] [1622936.extern] error: _get_joules_task: can't get info from slurmd
[2022-03-10T16:28:07.105] [1622936.batch] error: slurm_get_node_energy: Socket timed out on send/recv operation
[2022-03-10T16:28:07.105] [1622936.batch] error: _get_joules_task: can't get info from slurmd
[2022-03-10T16:28:18.771] error: Invalid ipmi response length for XCC raw command: 2 bytes, expected 16 (SD650) or 40 (SD650V2)
[2022-03-10T16:28:18.771] error: _thread_update_node_energy could not read XCC ipmi values
[2022-03-11T12:32:14.401] [1622936.batch] sending REQUEST_COMPLETE_BATCH_SCRIPT, error:0 status:0
[2022-03-11T12:32:14.449] [1622936.batch] done with job
[2022-03-11T12:32:14.450] [1622936.extern] done with job
[2022-03-11T13:23:23.591] [1623470.extern] task/cgroup: _memcg_initialize: /slurm/uid_225557/job_1623470: alloc=243200MB mem.limit=243200MB memsw.limit=unlimited
[2022-03-11T13:23:23.591] [1623470.extern] task/cgroup: _memcg_initialize: /slurm/uid_225557/job_1623470/step_extern: alloc=243200MB mem.limit=243200MB memsw.limit=unlimited
[2022-03-11T13:23:27.346] launch task StepId=1623470.0 request from UID:225557 GID:501340 HOST:10.0.0.74 PORT:57970
[2022-03-11T13:23:27.346] task/affinity: lllp_distribution: JobId=1623470 manual binding: none
[2022-03-11T13:23:27.397] [1623470.0] task/cgroup: _memcg_initialize: /slurm/uid_225557/job_1623470: alloc=243200MB mem.limit=243200MB memsw.limit=unlimited
[2022-03-11T13:23:27.397] [1623470.0] task/cgroup: _memcg_initialize: /slurm/uid_225557/job_1623470/step_0: alloc=243200MB mem.limit=243200MB memsw.limit=unlimited
[2022-03-11T23:02:31.445] [1623470.0] done with job

It seems at some point the ipmi gives back a value with the length of 2 bytes. The following jobs on the node have the correct Energy consumption.




The other case:

Logs for job(1620006):
fy2374 haicore-p+ interacti+         18.45?         hkn0717 1-21:46:41 1620006      2022-03-07T13:15:40

[2022-03-06T11:30:03.970] [1618990.extern] task/cgroup: _memcg_initialize: /slurm/uid_239403/job_1618990: alloc=0MB mem.limit=514600MB memsw.limit=unlimited
[2022-03-06T11:30:03.970] [1618990.extern] task/cgroup: _memcg_initialize: /slurm/uid_239403/job_1618990/step_extern: alloc=0MB mem.limit=514600MB memsw.limit=unlimited
[2022-03-06T11:30:05.311] launch task StepId=1618990.interactive request from UID:239403 GID:501442 HOST:10.0.7.200 PORT:41328
[2022-03-06T11:30:05.311] task/affinity: lllp_distribution: JobId=1618990 manual binding: none
[2022-03-06T11:30:05.345] [1618990.interactive] task/cgroup: _memcg_initialize: /slurm/uid_239403/job_1618990: alloc=0MB mem.limit=514600MB memsw.limit=unlimited
[2022-03-06T11:30:05.345] [1618990.interactive] task/cgroup: _memcg_initialize: /slurm/uid_239403/job_1618990/step_interactive: alloc=0MB mem.limit=514600MB memsw.limit=unlimited
[2022-03-06T11:30:05.388] [1618990.interactive] in _window_manager
[2022-03-07T13:15:36.780] [1618990.interactive] error: *** STEP 1618990.interactive ON hkn0717 CANCELLED AT 2022-03-07T13:15:36 ***
[2022-03-07T13:15:37.963] [1618990.interactive] done with job
[2022-03-07T13:16:00.414] [1620006.extern] task/cgroup: _memcg_initialize: /slurm/uid_239403/job_1620006: alloc=0MB mem.limit=514600MB memsw.limit=unlimited
[2022-03-07T13:16:00.414] [1620006.extern] task/cgroup: _memcg_initialize: /slurm/uid_239403/job_1620006/step_extern: alloc=0MB mem.limit=514600MB memsw.limit=unlimited
[2022-03-07T13:16:01.948] launch task StepId=1620006.interactive request from UID:239403 GID:501442 HOST:10.0.7.200 PORT:33706
[2022-03-07T13:16:01.948] task/affinity: lllp_distribution: JobId=1620006 manual binding: none
[2022-03-07T13:16:01.983] [1620006.interactive] task/cgroup: _memcg_initialize: /slurm/uid_239403/job_1620006: alloc=0MB mem.limit=514600MB memsw.limit=unlimited
[2022-03-07T13:16:01.983] [1620006.interactive] task/cgroup: _memcg_initialize: /slurm/uid_239403/job_1620006/step_interactive: alloc=0MB mem.limit=514600MB memsw.limit=unlimited
[2022-03-07T13:16:02.024] [1620006.interactive] in _window_manager
[2022-03-07T16:43:12.703] error: _find_node_record(764): lookup failure for hkn1998
[2022-03-07T16:43:12.703] error: WARNING: Invalid hostnames in switch configuration: hkn1998
[2022-03-09T11:02:21.923] [1620006.interactive] error: *** STEP 1620006.interactive ON hkn0717 CANCELLED AT 2022-03-09T11:02:21 ***
[2022-03-09T11:02:21.926] [1620006.extern] done with job
[2022-03-09T11:02:51.603] [1620006.interactive] done with job
[2022-03-09T11:03:14.526] [1622319.extern] task/cgroup: _memcg_initialize: /slurm/uid_239403/job_1622319: alloc=0MB mem.limit=514600MB memsw.limit=unlimited


Logs for job(1620006):
mc4465 hk-projec+ NiO/CsPbI+         18.45?         hkn1267 3-00:00:05 1620043      2022-03-07T15:34:11



[2022-03-07T15:00:44.031] [1619672.0] done with job
[2022-03-07T15:34:12.815] task/affinity: task_p_slurmd_batch_request: task_p_slurmd_batch_request: 1620043
[2022-03-07T15:34:12.815] task/affinity: batch_bind: job 1620043 CPU input mask for node: 0xFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF
[2022-03-07T15:34:12.815] task/affinity: batch_bind: job 1620043 CPU final HW mask for node: 0xFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF
[2022-03-07T15:34:31.463] [1620043.extern] task/cgroup: _memcg_initialize: /slurm/uid_240854/job_1620043: alloc=243200MB mem.limit=243200MB memsw.limit=unlimited
[2022-03-07T15:34:31.463] [1620043.extern] task/cgroup: _memcg_initialize: /slurm/uid_240854/job_1620043/step_extern: alloc=243200MB mem.limit=243200MB memsw.limit=unlimited
[2022-03-07T15:34:31.494] Launching batch job 1620043 for UID 240854
[2022-03-07T15:34:31.545] [1620043.batch] task/cgroup: _memcg_initialize: /slurm/uid_240854/job_1620043: alloc=243200MB mem.limit=243200MB memsw.limit=unlimited
[2022-03-07T15:34:31.545] [1620043.batch] task/cgroup: _memcg_initialize: /slurm/uid_240854/job_1620043/step_batch: alloc=243200MB mem.limit=243200MB memsw.limit=unlimited
[2022-03-07T15:34:33.011] launch task StepId=1620043.0 request from UID:240854 GID:501329 HOST:10.0.2.180 PORT:52044
[2022-03-07T15:34:33.011] task/affinity: lllp_distribution: JobId=1620043 manual binding: none
[2022-03-07T15:34:33.067] [1620043.0] task/cgroup: _memcg_initialize: /slurm/uid_240854/job_1620043: alloc=243200MB mem.limit=243200MB memsw.limit=unlimited
[2022-03-07T15:34:33.067] [1620043.0] task/cgroup: _memcg_initialize: /slurm/uid_240854/job_1620043/step_0: alloc=243200MB mem.limit=243200MB memsw.limit=unlimited
[2022-03-07T16:43:12.772] error: _find_node_record(764): lookup failure for hkn1998
[2022-03-07T16:43:12.772] error: WARNING: Invalid hostnames in switch configuration: hkn1998
[2022-03-10T15:34:16.868] [1620043.batch] error: *** JOB 1620043 ON hkn1267 CANCELLED AT 2022-03-10T15:34:16 DUE TO TIME LIMIT ***
[2022-03-10T15:34:16.869] [1620043.0] error: *** STEP 1620043.0 ON hkn1267 CANCELLED AT 2022-03-10T15:34:16 DUE TO TIME LIMIT ***
[2022-03-10T15:34:17.103] [1620043.extern] done with job
[2022-03-10T15:34:18.081] [1620043.batch] sending REQUEST_COMPLETE_BATCH_SCRIPT, error:0 status:15
[2022-03-10T15:34:18.082] [1620043.batch] done with job
[2022-03-10T15:34:19.478] [1620043.0] error: Failed to send MESSAGE_TASK_EXIT: Connection refused
[2022-03-10T15:34:19.479] [1620043.0] done with job
[2022-03-10T15:44:38.849] [1622908.extern] task/cgroup: _memcg_initialize: /slurm/uid_3729/job_1622908: alloc=243200MB mem.limit=243200MB memsw.limit=unlimited

In these logs i do not see a ipmi error message.

Best regard,
Pascal
Comment 6 Carlos Tripiana Montes 2022-03-15 04:34:53 MDT
I think the error:

> error: Invalid ipmi response length for XCC raw command: 2 bytes, expected 16 (SD650) or 40 (SD650V2)

could belong to the fixed Bug 12868. Check if you are experiencing the same issue with open FDs getting leaked. And don't forget to apply the patch there if is needed. I hope this will be possible for 20.11.

But this error shouldn't affect in a way that we get negative numbers. If the power read fails, the sample is ignored, and no consumed energy calculation is done.

Another thing I see it that these jobs are very long ones. I wonder whether the counter is getting overflowed or it's another issue.

Short question, are all those jobs belonging to GPU nodes?

Thanks,
Carlos.
Comment 7 Pascal Schuhmacher 2022-03-15 04:42:22 MDT
Hello Carlos,

It happens for jobs on CPU only nodes and nodes containing GPUs.

Regards,
Pascal
Comment 8 Carlos Tripiana Montes 2022-03-15 06:27:12 MDT
Wait,

The number you posted as result of sacct is 18446744071445903664.

This number has 20 digits. And this matches mysql bigint unsigned (unsigned long long ULLONG_MAX) length/limit. ULLONG_MAX is 18446744073709551615.

18446744073709551615-18446744071445903664=2263647951

Let's see if I'm able to understand why it overflows. But for consumed energy we don't use unsigned long long, but uint64_t. It UINT64_MAX range is 9223372036854775807 (19 digits length).

I think this might be a clue. I'll let you know about any updates.

Thanks for letting us know,
Carlos.
Comment 9 Carlos Tripiana Montes 2022-03-18 09:24:44 MDT
My fault uint64_t range is unsigned long long ULLONG_MAX.
Comment 10 Carlos Tripiana Montes 2022-03-23 02:19:39 MDT
Pascal,

Have you checked this error?

> error: Invalid ipmi response length for XCC raw command: 2 bytes, expected 16 (SD650) or 40 (SD650V2)

I think it belongs to the fixed Bug 12868. The patch there should mitigate this error.

Regarding the energy overflow (which is separated issue, and the one important here): How often this error is triggered? Very often? Only very very long jobs? I'm not able to reproduce it on my end by now, so any additional clues would be much welcomed.

Regards,
Carlos.
Comment 11 Pascal Schuhmacher 2022-03-24 02:54:50 MDT
Hello Carlos,

The error occurs about once or twice a day. Here are the jobs with the unrealistic energy values for the last 10 days:

sacct --start 2022-03-14 --format Account,partition,ConsumedEnergy,NodeList,Elapsed,end | grep ?
hk-projec+    cpuonly         18.45?  hkn[0150-0151]   23:50:50 2022-03-14T10:01:51 
hk-projec+                    18.45?         hkn0150   23:50:50 2022-03-14T10:01:51 
hk-projec+                    18.45?  hkn[0150-0151]   23:50:50 2022-03-14T10:01:51 
hk-projec+                    18.45?  hkn[0150-0151]   23:50:30 2022-03-14T10:01:51 
hk-projec+                    18.45?         hkn1651 1-12:00:19 2022-03-15T05:50:22 
hk-projec+                    18.45?         hkn1651 1-12:00:10 2022-03-15T05:50:13 
hk-projec+    cpuonly         18.45?         hkn1219 1-13:58:56 2022-03-15T23:17:40 
hk-projec+                    18.45?         hkn1219 1-13:58:56 2022-03-15T23:17:40 
hk-projec+                    18.45?         hkn1219 1-13:58:56 2022-03-15T23:17:40 
hk-projec+    cpuonly         18.45?         hkn0119 1-05:58:13 2022-03-17T04:40:56 
hk-projec+                    18.45?         hkn0119 1-05:58:13 2022-03-17T04:40:56 
hk-projec+                    18.45?         hkn0119 1-05:58:13 2022-03-17T04:40:56 
hk-projec+    cpuonly         18.45?         hkn1664 3-00:00:15 2022-03-17T09:32:30 
hk-projec+                    18.45?         hkn1664 3-00:00:17 2022-03-17T09:32:32 
hk-projec+                    18.45?         hkn1664 3-00:00:16 2022-03-17T09:32:31 
hk-projec+                    18.45?         hkn1664 2-23:59:58 2022-03-17T09:32:33 
hk-projec+    cpuonly         18.45?         hkn0845   18:16:10 2022-03-15T19:48:20 
hk-projec+                    18.45?         hkn0845   18:16:10 2022-03-15T19:48:20 
hk-projec+                    18.45?         hkn0845   18:16:10 2022-03-15T19:48:20 
hk-projec+                    18.45?         hkn0522 1-00:00:09 2022-03-16T11:23:53 
hk-projec+                    18.45?         hkn0522 1-00:00:08 2022-03-16T11:23:52 
hk-projec+                    18.45?         hkn0523 1-00:00:09 2022-03-16T11:23:53 
hk-projec+                    18.45?         hkn0523 1-00:00:08 2022-03-16T11:23:52 
hk-projec+    cpuonly         18.45?         hkn0015 1-12:00:12 2022-03-16T23:50:57 
hk-projec+                    18.45?         hkn0015 1-12:00:13 2022-03-16T23:50:58 
hk-projec+                    18.45?         hkn0015 1-12:00:12 2022-03-16T23:50:57 
hk-projec+                    18.45?         hkn0015 1-11:59:54 2022-03-16T23:51:00 
hk-projec+    cpuonly         18.45?  hkn[0315-0318]   16:18:14 2022-03-17T09:53:30 
hk-projec+                    18.45?         hkn0315   16:18:14 2022-03-17T09:53:30 
hk-projec+                    18.45?  hkn[0315-0318]   16:18:14 2022-03-17T09:53:30 
hk-projec+                    18.45?         hkn0128 2-23:59:05 2022-03-20T10:34:00 
hk-projec+ accelerat+         18.45?         hkn0719   09:06:28 2022-03-18T00:44:05 
hk-projec+                    18.45?         hkn0719   09:06:28 2022-03-18T00:44:05 
hk-projec+                    18.45?         hkn0719   09:06:28 2022-03-18T00:44:05 
hk-projec+                    18.45? hkn[1201,1206,+   00:00:05 2022-03-17T14:23:59 
hk-projec+                    18.45? hkn[1201,1206,+   00:00:04 2022-03-17T14:34:35 
hk-projec+    cpuonly         18.45? hkn[0229,0231-+ 1-20:13:26 2022-03-20T07:09:16 
hk-projec+                    18.45? hkn[0229,0231-+ 1-20:13:27 2022-03-20T07:09:17 
hk-projec+                    18.45?  hkn[0231-0234] 1-20:12:33 2022-03-20T07:09:18 
hk-projec+    cpuonly         18.45?  hkn[0321-0325]   23:46:27 2022-03-19T16:17:55 
hk-projec+                    18.45?  hkn[0321-0325]   23:46:27 2022-03-19T16:17:55 
hk-projec+                    18.45?  hkn[0321-0325]   23:45:47 2022-03-19T16:17:35 
hk-projec+    cpuonly         18.45? hkn[1141,1143,+   23:46:27 2022-03-21T20:00:23 
hk-projec+                    18.45? hkn[1141,1143,+   23:46:27 2022-03-21T20:00:23 
hk-projec+                    18.45? hkn[1141,1143,+   23:45:48 2022-03-21T20:00:03 
hk-projec+    cpuonly         18.45? hkn[0149-0150,+   00:08:08 2022-03-21T17:05:06 
hk-projec+                    18.45? hkn[0149-0150,+   00:08:08 2022-03-21T17:05:06 
hk-projec+                    18.45? hkn[0150,0153-+   00:07:51 2022-03-21T17:05:11 
hk-projec+                    18.45?         hkn0337   02:30:27 2022-03-21T20:48:58 
hk-projec+    cpuonly         18.45? hkn[0007-0016,+   01:34:50 2022-03-23T09:44:48 
hk-projec+                    18.45? hkn[0007-0016,+   01:34:51 2022-03-23T09:44:49 
hk-projec+                    18.45? hkn[0008-0016,+   01:34:29 2022-03-23T09:44:50 
hk-projec+                    18.45?         hkn0111   09:39:43 2022-03-23T20:53:16 
hk-projec+    cpuonly         18.45? hkn[0008-0016,+   01:33:45 2022-03-23T14:04:53 
hk-projec+                    18.45? hkn[0008-0016,+   01:33:46 2022-03-23T14:04:54 
hk-projec+                    18.45? hkn[0009-0016,+   01:33:23 2022-03-23T14:04:55 
hk-projec+    cpuonly         18.45? hkn[0149-0153,+   01:33:19 2022-03-23T19:21:16 
hk-projec+                    18.45? hkn[0149-0153,+   01:33:19 2022-03-23T19:21:16 
hk-projec+                    18.45? hkn[0150-0153,+   01:32:59 2022-03-23T19:21:17 
hk-projec+    cpuonly         18.45?  hkn[1621-1625]   00:29:21 2022-03-24T04:07:34 
hk-projec+                    18.45?  hkn[1621-1625]   00:29:21 2022-03-24T04:07:34

The job length does not seem to have an influence on the occurance.

What i also noticed is the aveWatts field for most nodes do not work correctly:
Most nodes display aveWatts=0

NodeName=hkn1669 Arch=x86_64 CoresPerSocket=38 
   CPUAlloc=0 CPUTot=152 CPULoad=59.05
   AvailableFeatures=BEEOND,BEEOND_4MDS,BEEOND_MAXMDS,LSDF,ELCH,CPU
   ActiveFeatures=BEEOND,BEEOND_4MDS,BEEOND_MAXMDS,LSDF,ELCH,CPU
   Gres=(null)
   NodeAddr=hkn1669 NodeHostName=hkn1669 Version=20.11.5
   OS=Linux 4.18.0-193.60.2.el8_2.x86_64 #1 SMP Wed Jul 7 07:08:14 EDT 2021 
   RealMemory=243200 AllocMem=0 FreeMem=241452 Sockets=2 Boards=1
   State=IDLE ThreadsPerCore=2 TmpDisk=0 Weight=1 Owner=N/A MCS_label=N/A
   Partitions=cpuonly 
   BootTime=2021-08-06T18:01:26 SlurmdStartTime=2022-01-13T13:09:56
   CfgTRES=cpu=152,mem=237.50G,billing=152
   AllocTRES=
   CapWatts=n/a
   CurrentWatts=406 AveWatts=0
   ExtSensorsJoules=n/s ExtSensorsWatts=0 ExtSensorsTemp=n/s
   Comment=(null)

Some have a to high value on these nodes the wrong reading propably happened:

NodeName=hkn1707 Arch=x86_64 CoresPerSocket=38 
   CPUAlloc=152 CPUTot=152 CPULoad=240.67
   AvailableFeatures=BEEOND,BEEOND_4MDS,BEEOND_MAXMDS,LSDF,ELCH,ACC
   ActiveFeatures=BEEOND,BEEOND_4MDS,BEEOND_MAXMDS,LSDF,ELCH,ACC
   Gres=gpu:4
   NodeAddr=hkn1707 NodeHostName=hkn1707 Version=20.11.5
   OS=Linux 4.18.0-193.60.2.el8_2.x86_64 #1 SMP Wed Jul 7 07:08:14 EDT 2021 
   RealMemory=501600 AllocMem=501600 FreeMem=464304 Sockets=2 Boards=1
   State=ALLOCATED ThreadsPerCore=2 TmpDisk=0 Weight=1 Owner=N/A MCS_label=N/A
   Partitions=haicore-gpu4 
   BootTime=2022-01-31T16:52:31 SlurmdStartTime=2022-01-31T16:53:44
   CfgTRES=cpu=152,mem=501600M,billing=152,gres/gpu=4
   AllocTRES=cpu=152,mem=501600M,gres/gpu=4
   CapWatts=n/a
   CurrentWatts=713 AveWatts=36645
   ExtSensorsJoules=n/s ExtSensorsWatts=0 ExtSensorsTemp=n/s
   Comment=(null)

And some nodes have a reasonable entry.

Regards,
Pascal
Comment 12 Carlos Tripiana Montes 2022-04-04 02:40:00 MDT
Pascal,

I'll send you a small debug patch soon, to apply on top of your production slurm so you can have a bit more info w/o increasing the debug level; something that is not recommended in production to avoid perf impact.

This way, I expect to narrow the issue and luckily hit the bug.

Cheers,
Carlos.
Comment 13 Carlos Tripiana Montes 2022-04-05 07:35:28 MDT
Created attachment 24228 [details]
debug patch

Pascal,

If you can, give a try on this debug patch.

Wait to reproduce the error, collect the slurmd.log file from an affected node, and get back to us with a jobid, scontrol show job [ID] output and the slurmd.log.

I think that, whatever is happening, it should be reflected with this patch. if you feel it's rather easy to reproduce the issue, don't apply the patch to the whole cluster but to a node and make it fail.

Cheers,
Carlos.
Comment 14 Carlos Tripiana Montes 2022-04-14 02:17:06 MDT
Please Pascal,

Give us some feedback so we can keep alive our work here.

Thanks,
Carlos.
Comment 15 Carlos Tripiana Montes 2022-04-17 00:59:37 MDT
Pascal,

I'm sorry we need to close the issue as timed out.

Please get back to us is you still have issues whenever you have time to.

Regards,
Carlos.
Comment 17 Pascal Schuhmacher 2022-05-20 02:18:06 MDT
Created attachment 25132 [details]
Logs cpuonly Node
Comment 18 Pascal Schuhmacher 2022-05-20 02:18:45 MDT
Created attachment 25133 [details]
Logs GPU Node
Comment 19 Pascal Schuhmacher 2022-05-20 02:19:39 MDT
Hello Carlos,

Sorry for the very late response. we had a lot to do the last few weeks. I installed your debugg patch and included logs of the bug happening on computenodes. We also upgraded to slurm 21.08.

Here are the timestamps of the bug occuring.

sacct  --starttime 2022-04-26 --nodelist hkn0712 --format partition,ConsumedEnergy,NodeList,Elapsed,start,end,jobname,state,jobid | grep "?"
accelerat+         18.45?  hkn[0709-0712]   01:01:22 2022-05-05T15:20:37 2022-05-05T16:21:59 interacti+  COMPLETED 1716871      
                   18.45?  hkn[0709-0712]   01:01:22 2022-05-05T15:20:37 2022-05-05T16:21:59     extern  COMPLETED 1716871.ext+ 
accelerat+         18.45?         hkn0712   00:20:51 2022-05-09T17:54:30 2022-05-09T18:15:21 966_DDDDP+  COMPLETED 1736826      
                   18.45?         hkn0712   00:20:51 2022-05-09T17:54:30 2022-05-09T18:15:21      batch  COMPLETED 1736826.bat+ 
                   18.45?         hkn0712   00:20:51 2022-05-09T17:54:30 2022-05-09T18:15:21     extern  COMPLETED 1736826.ext+ 
accelerat+         18.45?         hkn0712   00:18:27 2022-05-09T18:33:56 2022-05-09T18:52:23 984_DDDDP+  COMPLETED 1737575      
                   18.45?         hkn0712   00:18:27 2022-05-09T18:33:56 2022-05-09T18:52:23      batch  COMPLETED 1737575.bat+ 
                   18.45?         hkn0712   00:18:27 2022-05-09T18:33:56 2022-05-09T18:52:23     extern  COMPLETED 1737575.ext+
                   
                   
sacct  --starttime 2022-04-26 --nodelist hkn0319 --format partition,ConsumedEnergy,NodeList,Elapsed,start,end,jobname,state,jobid | grep "?"
   cpuonly         18.45? hkn[0318-0320,+   02:56:32 2022-04-26T15:26:43 2022-04-26T18:23:15       rd_5  COMPLETED 1661179      
                   18.45? hkn[0318-0320,+   02:56:32 2022-04-26T15:26:43 2022-04-26T18:23:15     extern  COMPLETED 1661179.ext+ 
                   18.45? hkn[0319-0320,+   02:56:13 2022-04-26T15:27:03 2022-04-26T18:23:16      orted  COMPLETED 1661179.0    
   cpuonly         18.45?         hkn0319   06:57:47 2022-05-09T14:53:51 2022-05-09T21:51:38 atomistica  COMPLETED 1733052      
                   18.45?         hkn0319   06:57:47 2022-05-09T14:53:51 2022-05-09T21:51:38      batch  COMPLETED 1733052.bat+ 
                   18.45?         hkn0319   06:57:47 2022-05-09T14:53:51 2022-05-09T21:51:38     extern  COMPLETED 1733052.ext+ 
   cpuonly         18.45? hkn[0239,0260-+ 1-20:38:33 2022-05-17T17:54:30 2022-05-19T14:33:03  E8R1.5e12  COMPLETED 1743934      
                   18.45? hkn[0239,0260-+ 1-20:38:33 2022-05-17T17:54:30 2022-05-19T14:33:03     extern  COMPLETED 1743934.ext+ 
                   18.45? hkn[0260-0261,+ 1-20:38:15 2022-05-17T17:54:49 2022-05-19T14:33:04      orted  COMPLETED 1743934.0
                   
Best Regards,
Pascal
Comment 20 Carlos Tripiana Montes 2022-05-23 02:02:33 MDT
That's fairly interesting.

I don't really know why -by now, but it seems that node energy from node(s?) gets reseted to ~0 every time I see this in the logs:

error: _find_node_record: lookup failure for node "hkn1998"
error: WARNING: Invalid hostnames in switch configuration: hkn1998

At least, there's a correlation in the logs with these 2 events. Need to see if it's causality or casualty.

Thanks,
Carlos.
Comment 21 Carlos Tripiana Montes 2022-05-23 02:37:48 MDT
Would you mind to collect same type of logs with DebugFlags=Energy in slurm.conf for at least 1 node. I don't really need all cluster. If you know more or less how to make it fail, doing so for a node is enough.

Thanks,
Carlos.
Comment 22 Pascal Schuhmacher 2022-05-24 01:18:59 MDT
Hello Carlos,

I will set the DebugFlags=Energy on a partition, since i do not know how to trigger the bug. 
On another node i noticed the bug seems to often occur in bursts rather than at completely random times. I added an attachment with the errors occurring over the last 4 weeks. 

Regards,
Pascal
Comment 23 Pascal Schuhmacher 2022-05-24 01:19:32 MDT
Created attachment 25175 [details]
error last 4 weeks
Comment 24 Carlos Tripiana Montes 2022-05-24 01:40:36 MDT
> since i do not know how to trigger the bug.

Sounds good, don't worry. I can truly understand you because this issue is kinda subtle and getting me a bit puzzled right now.
Comment 26 Carlos Tripiana Montes 2022-05-26 01:38:47 MDT
Pascal,

Have you been able to set DebugFlags=Energy and collect the log I need?

Thanks,
Carlos.
Comment 27 Pascal Schuhmacher 2022-06-07 01:29:11 MDT
Hello,

A small update:
I have set SlurmdDebug=debug and DebugFlags=Energy on all nodes on the Cluster now since 27.05. The Bug has not appeared yet since then which is a bit strange. I will sent the logs as soon as it happens.

Regards,
Pascal
Comment 28 Carlos Tripiana Montes 2022-06-15 04:21:11 MDT
Hi Pascal,

please, update us whenever possible. It's been more than a week since our last communication. I hope you get this reproduced soon.

Regards,
Carlos.
Comment 29 Pascal Schuhmacher 2022-06-15 04:32:54 MDT
Hello Carlos,

The debug flag has been turned on for over 2 weeks now and the error did not appear it seems as if it does not happen if it is turned on.

Regards,
Pascal
Comment 30 Carlos Tripiana Montes 2022-06-15 05:48:20 MDT
> as if it does not happen if it is turned on.

That's fairly strange, as the log flags only enable log printing, nothing else. It's fairly easy to check it looking at plugin's source because it's not a big file.

If you aren't able to get this reproduced with debug enabled, I wonder whether the slurmd restart has fixed the issue or it's just a mere coincidence. Well, all in all, keep this running with debug the time you need/can to get this reproduced. If you aren't able, revert the debug flags and check if this issue starts happening again.

Despite not being able to reproduce the issue, this bit of information is useful as well. I'll take it into account to understand why this happens.

Thanks,
Carlos.
Comment 31 Pascal Schuhmacher 2022-06-17 05:36:52 MDT
Created attachment 25557 [details]
Error occurring
Comment 32 Pascal Schuhmacher 2022-06-17 05:37:12 MDT
Created attachment 25558 [details]
slurmd.log
Comment 33 Pascal Schuhmacher 2022-06-17 05:37:47 MDT
Hello Carlos,

Today at 10:50 the slurm debug mode was turned off and immediately the error occurred again as you can see in the attached file:sacct_output
I also uploaded a slurmd.log of the occurrence.

Best Regards,
Pascal
Comment 34 Carlos Tripiana Montes 2022-06-17 06:45:37 MDT
Might debug have been turned off but still I see DebugFlags "ENERGY" logs and those are helpful. Let's see if I'm able to spot the issue now.

Thanks,
Carlos.
Comment 35 Carlos Tripiana Montes 2022-06-17 07:00:01 MDT
According to the logs:

Joules, elapsed time: 5 seconds, first read energy counter val: 2374876713 ave watts: 463
[2022-06-17T10:50:41.416] [1756422.interactive] acct_gather_energy/xcc: _get_joules_task: ENERGY: consumed 3323243 Joules (received 6938(692 watts) from slurmd)
[2022-06-17T10:50:42.576] got reconfigure request
[2022-06-17T10:50:42.576] all threads complete
[2022-06-17T10:50:42.577] debug:  Reading slurm.conf file: /etc/slurm/slurm.conf
[2022-06-17T10:50:42.583] debug:  Reading cgroup.conf file /etc/slurm/cgroup.conf
[2022-06-17T10:50:42.584] [1756422.extern] debug:  Handling REQUEST_STEP_RECONFIGURE
[2022-06-17T10:50:42.584] [1756422.extern] debug:  _handle_reconfigure for StepId=1756422.extern successful
[2022-06-17T10:50:42.584] [1756422.interactive] debug:  Handling REQUEST_STEP_RECONFIGURE
[2022-06-17T10:50:42.585] [1756422.interactive] debug:  _handle_reconfigure for StepId=1756422.interactive successful
[2022-06-17T10:50:42.588] error: _find_node_record: lookup failure for node "hkn1998"
[2022-06-17T10:50:42.588] error: WARNING: Invalid hostnames in switch configuration: hkn1998
[2022-06-17T10:50:42.596] [1756422.extern] debug:  Handling REQUEST_STATE
[2022-06-17T10:50:42.596] [1756422.interactive] debug:  Handling REQUEST_STATE
[2022-06-17T10:50:48.357] [1756422.extern] acct_gather_energy/xcc: _get_joules_task: _get_joules_task: CurrEn<1stEn (2818<1625951988)
[2022-06-17T10:50:48.357] [1756422.extern] acct_gather_energy/xcc: _get_joules_task: ENERGY: consumed 18446744072083602446 Joules (received 18446744072080279129(564 watts) from slurmd)

The error was indeed caused by a reconfig. XCC doesn't have support for reconfig or slurmd restart. When you fire a reconfig structures are freed now, thus negative numbers arise. This is an ongoing improvement internally tracked in Bug 13778. If you are keen to test it, I can send you an early access patch for 22.05. I encourage you to upgrade to 22.08 latest released version as soon as possible, so you can test it.

Regards,
Carlos.
Comment 36 Pascal Schuhmacher 2022-06-21 01:08:06 MDT
Hello Carlos,

Indeed it seems to be triggered by the slurm reconfigure and i can reliably reproduce the error. I would like to test the early access patch.

Best Regards,
Pascal
Comment 37 Carlos Tripiana Montes 2022-07-13 02:21:57 MDT
> I would like to test the early access patch.

For this to happen you need your cluster in latest 22.05 release.

I'll deliver you the patch in early access or if review finishes before with the commit ids.

Please, let us know if you already are or have planned to go for 22.05.

Regards,
Carlos.
Comment 38 Pascal Schuhmacher 2022-07-18 01:25:05 MDT
Hello,

We plan to go to 22.05 as soon as possible. For testing purposes our test cluster can be upgraded at any time.

Best Regards,
Pascal
Comment 39 Carlos Tripiana Montes 2022-08-11 01:29:55 MDT
Hi Pascal,

We are tracking the improvement of some energy plugins in an internal bug and this is part of that bug.

Once ready to disclosure, I'll provide you the fix. Right now it is under review.

Thanks for your patience.

Regards.
Comment 40 Carlos Tripiana Montes 2022-10-17 09:12:53 MDT
Hi Pascal,

This issue has been solved as part of this series of commits:

https://github.com/SchedMD/slurm/compare/ca032023fe89...78fe518aa095

Already in master.

I'm going to close the issue as fixed.