Summary: | XCC Plugin unrealistic value | ||
---|---|---|---|
Product: | Slurm | Reporter: | Pascal Schuhmacher <pascal.schuhmacher> |
Component: | Accounting | Assignee: | Carlos Tripiana Montes <tripiana> |
Status: | RESOLVED FIXED | QA Contact: | |
Severity: | 4 - Minor Issue | ||
Priority: | --- | ||
Version: | 20.11.5 | ||
Hardware: | Linux | ||
OS: | Linux | ||
See Also: | https://bugs.schedmd.com/show_bug.cgi?id=11323 | ||
Site: | KIT | Alineos Sites: | --- |
Atos/Eviden Sites: | --- | Confidential Site: | --- |
Coreweave sites: | --- | Cray Sites: | --- |
DS9 clusters: | --- | HPCnow Sites: | --- |
HPE Sites: | --- | IBM Sites: | --- |
NOAA SIte: | --- | NoveTech Sites: | --- |
Nvidia HWinf-CS Sites: | --- | OCF Sites: | --- |
Recursion Pharma Sites: | --- | SFW Sites: | --- |
SNIC sites: | --- | Linux Distro: | --- |
Machine Name: | CLE Version: | ||
Version Fixed: | 23.02.0pre1 | Target Release: | --- |
DevPrio: | --- | Emory-Cloud Sites: | --- |
Ticket Depends on: | 13778 | ||
Ticket Blocks: | |||
Attachments: |
debug patch
Logs cpuonly Node Logs GPU Node error last 4 weeks Error occurring slurmd.log |
Description
Pascal Schuhmacher
2022-03-10 02:32:12 MST
Hi there Pascal, I think it could be Bug 12170. Have you been issuing scontrol reconfig or restarting the slurmd service/process? Cheers, Carlos. Hi Pascal, Have you got time to check what I told in Comment 1? Thanks, Carlos. Hello Carlos, We do not frequently use scontrol reconfig, nor do we restart the slurmd service/process often. Regards, Pascal Hi Pascal, In the XCC plugin I don't see any possible way to get a negative number for consumed energy, unless "scontrol reconfig" or slurmd process restart has happened. I understand you don't usually issue these, but I was suggesting that, by chance, this had happened for the previously mentioned jobs. To double check the issue, please send us the slurmd.log for 1-node job, along with the particular jobid we are looking after. I want to have a look at the logs before making further assumptions like a faulty BMC (something completely of of Slurm control, but keen to know how it happens, *if it happens*, and provide a way to mitigate it). Cheers, Carlos. Hello Carlos, There seem to be 2 different cases: Logs for job(1623719): tl6613 hk-projec+ mtd16 18.45? hkn0322 12:15:44 1623719 2022-03-12T06:42:14 [2022-03-12T05:13:56.959] [1623487.0] done with job [2022-03-12T06:42:14.440] task/affinity: task_p_slurmd_batch_request: task_p_slurmd_batch_request: 1623719 [2022-03-12T06:42:14.440] task/affinity: batch_bind: job 1623719 CPU input mask for node: 0xFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF [2022-03-12T06:42:14.440] task/affinity: batch_bind: job 1623719 CPU final HW mask for node: 0xFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF [2022-03-12T06:42:32.763] [1623719.extern] task/cgroup: _memcg_initialize: /slurm/uid_242096/job_1623719: alloc=243200MB mem.limit=243200MB memsw.limit=unlimited [2022-03-12T06:42:32.763] [1623719.extern] task/cgroup: _memcg_initialize: /slurm/uid_242096/job_1623719/step_extern: alloc=243200MB mem.limit=243200MB memsw.limit=unlimited [2022-03-12T06:42:32.774] Launching batch job 1623719 for UID 242096 [2022-03-12T06:42:32.805] [1623719.batch] task/cgroup: _memcg_initialize: /slurm/uid_242096/job_1623719: alloc=243200MB mem.limit=243200MB memsw.limit=unlimited [2022-03-12T06:42:32.805] [1623719.batch] task/cgroup: _memcg_initialize: /slurm/uid_242096/job_1623719/step_batch: alloc=243200MB mem.limit=243200MB memsw.limit=unlimited [2022-03-12T06:42:33.836] launch task StepId=1623719.0 request from UID:242096 GID:501437 HOST:10.0.0.238 PORT:34582 [2022-03-12T06:42:33.836] task/affinity: lllp_distribution: JobId=1623719 auto binding off: mask_cpu [2022-03-12T06:42:33.869] [1623719.0] task/cgroup: _memcg_initialize: /slurm/uid_242096/job_1623719: alloc=243200MB mem.limit=243200MB memsw.limit=unlimited [2022-03-12T06:42:33.869] [1623719.0] task/cgroup: _memcg_initialize: /slurm/uid_242096/job_1623719/step_0: alloc=243200MB mem.limit=243200MB memsw.limit=unlimited [2022-03-12T16:07:43.164] error: Invalid ipmi response length for XCC raw command: 2 bytes, expected 16 (SD650) or 40 (SD650V2) [2022-03-12T16:07:43.164] error: _thread_update_node_energy could not read XCC ipmi values [2022-03-12T16:07:52.805] [1623719.extern] error: slurm_get_node_energy: Socket timed out on send/recv operation [2022-03-12T16:07:52.805] [1623719.extern] error: _get_joules_task: can't get info from slurmd [2022-03-12T16:07:52.888] [1623719.batch] error: slurm_get_node_energy: Socket timed out on send/recv operation [2022-03-12T16:07:52.888] [1623719.batch] error: _get_joules_task: can't get info from slurmd [2022-03-12T16:07:53.922] [1623719.0] error: slurm_get_node_energy: Socket timed out on send/recv operation [2022-03-12T16:07:53.922] [1623719.0] error: _get_joules_task: can't get info from slurmd [2022-03-12T16:08:38.647] error: Invalid ipmi response length for XCC raw command: 2 bytes, expected 16 (SD650) or 40 (SD650V2) [2022-03-12T16:08:38.647] error: _thread_update_node_energy could not read XCC ipmi values [2022-03-12T16:09:33.245] error: Invalid ipmi response length for XCC raw command: 2 bytes, expected 16 (SD650) or 40 (SD650V2) [2022-03-12T16:09:33.245] error: _thread_update_node_energy could not read XCC ipmi values [2022-03-12T16:09:44.014] [1623719.0] error: slurm_get_node_energy: Socket timed out on send/recv operation [2022-03-12T16:09:44.014] [1623719.0] error: _get_joules_task: can't get info from slurmd [2022-03-12T18:57:58.345] [1623719.0] done with job [2022-03-12T18:57:58.449] [1623719.batch] sending REQUEST_COMPLETE_BATCH_SCRIPT, error:0 status:0 [2022-03-12T18:57:58.474] [1623719.batch] done with job [2022-03-12T18:57:58.501] [1623719.extern] done with job [2022-03-12T20:18:52.406] task/affinity: task_p_slurmd_batch_request: task_p_slurmd_batch_request: 1623093 [2022-03-12T20:18:52.406] task/affinity: batch_bind: job 1623093 CPU input mask for node: 0xFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF [2022-03-12T20:18:52.406] task/affinity: batch_bind: job 1623093 CPU final HW mask for node: 0xFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF [2022-03-12T20:19:10.646] [1623093.extern] task/cgroup: _memcg_initialize: /slurm/uid_946468/job_1623093: alloc=243200MB mem.limit=243200MB memsw.limit=unlimited [2022-03-12T20:19:10.646] [1623093.extern] task/cgroup: _memcg_initialize: /slurm/uid_946468/job_1623093/step_extern: alloc=243200MB mem.limit=243200MB memsw.limit=unlimited [2022-03-12T20:19:10.657] Launching batch job 1623093 for UID 946468 [2022-03-12T20:19:10.687] [1623093.batch] task/cgroup: _memcg_initialize: /slurm/uid_946468/job_1623093: alloc=243200MB mem.limit=243200MB memsw.limit=unlimited [2022-03-12T20:19:10.687] [1623093.batch] task/cgroup: _memcg_initialize: /slurm/uid_946468/job_1623093/step_batch: alloc=243200MB mem.limit=243200MB memsw.limit=unlimited [2022-03-13T11:29:07.088] [1623093.batch] sending REQUEST_COMPLETE_BATCH_SCRIPT, error:0 status:0 [2022-03-13T11:29:07.111] [1623093.batch] done with job [2022-03-13T11:29:07.145] [1623093.extern] done with job [2022-03-13T11:29:40.880] task/affinity: task_p_slurmd_batch_request: task_p_slurmd_batch_request: 1624540 Logs for job(1622936): st_st142+ hk-projec+ fe 18.45? hkn0245 22:32:58 1622936 2022-03-10T13:59:16 [2022-03-10T13:12:40.999] [1622894.extern] done with job [2022-03-10T13:59:18.262] task/affinity: task_p_slurmd_batch_request: task_p_slurmd_batch_request: 1622936 [2022-03-10T13:59:18.262] task/affinity: batch_bind: job 1622936 CPU input mask for node: 0xFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF [2022-03-10T13:59:18.262] task/affinity: batch_bind: job 1622936 CPU final HW mask for node: 0xFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF [2022-03-10T13:59:37.020] [1622936.extern] task/cgroup: _memcg_initialize: /slurm/uid_946468/job_1622936: alloc=243200MB mem.limit=243200MB memsw.limit=unlimited [2022-03-10T13:59:37.020] [1622936.extern] task/cgroup: _memcg_initialize: /slurm/uid_946468/job_1622936/step_extern: alloc=243200MB mem.limit=243200MB memsw.limit=unlimited [2022-03-10T13:59:37.031] Launching batch job 1622936 for UID 946468 [2022-03-10T13:59:37.062] [1622936.batch] task/cgroup: _memcg_initialize: /slurm/uid_946468/job_1622936: alloc=243200MB mem.limit=243200MB memsw.limit=unlimited [2022-03-10T13:59:37.063] [1622936.batch] task/cgroup: _memcg_initialize: /slurm/uid_946468/job_1622936/step_batch: alloc=243200MB mem.limit=243200MB memsw.limit=unlimited [2022-03-10T16:25:37.720] error: Invalid ipmi response length for XCC raw command: 2 bytes, expected 16 (SD650) or 40 (SD650V2) [2022-03-10T16:25:37.720] error: _thread_update_node_energy could not read XCC ipmi values [2022-03-10T16:25:47.055] [1622936.extern] error: slurm_get_node_energy: Socket timed out on send/recv operation [2022-03-10T16:25:47.055] [1622936.extern] error: _get_joules_task: can't get info from slurmd [2022-03-10T16:25:47.105] [1622936.batch] error: slurm_get_node_energy: Socket timed out on send/recv operation [2022-03-10T16:25:47.105] [1622936.batch] error: _get_joules_task: can't get info from slurmd [2022-03-10T16:26:32.736] error: Invalid ipmi response length for XCC raw command: 2 bytes, expected 16 (SD650) or 40 (SD650V2) [2022-03-10T16:26:32.736] error: _thread_update_node_energy could not read XCC ipmi values [2022-03-10T16:26:57.055] [1622936.extern] error: slurm_get_node_energy: Socket timed out on send/recv operation [2022-03-10T16:26:57.055] [1622936.extern] error: _get_joules_task: can't get info from slurmd [2022-03-10T16:26:57.105] [1622936.batch] error: slurm_get_node_energy: Socket timed out on send/recv operation [2022-03-10T16:26:57.105] [1622936.batch] error: _get_joules_task: can't get info from slurmd [2022-03-10T16:27:27.752] error: Invalid ipmi response length for XCC raw command: 2 bytes, expected 16 (SD650) or 40 (SD650V2) [2022-03-10T16:27:27.752] error: _thread_update_node_energy could not read XCC ipmi values [2022-03-10T16:28:07.055] [1622936.extern] error: slurm_get_node_energy: Socket timed out on send/recv operation [2022-03-10T16:28:07.055] [1622936.extern] error: _get_joules_task: can't get info from slurmd [2022-03-10T16:28:07.105] [1622936.batch] error: slurm_get_node_energy: Socket timed out on send/recv operation [2022-03-10T16:28:07.105] [1622936.batch] error: _get_joules_task: can't get info from slurmd [2022-03-10T16:28:18.771] error: Invalid ipmi response length for XCC raw command: 2 bytes, expected 16 (SD650) or 40 (SD650V2) [2022-03-10T16:28:18.771] error: _thread_update_node_energy could not read XCC ipmi values [2022-03-11T12:32:14.401] [1622936.batch] sending REQUEST_COMPLETE_BATCH_SCRIPT, error:0 status:0 [2022-03-11T12:32:14.449] [1622936.batch] done with job [2022-03-11T12:32:14.450] [1622936.extern] done with job [2022-03-11T13:23:23.591] [1623470.extern] task/cgroup: _memcg_initialize: /slurm/uid_225557/job_1623470: alloc=243200MB mem.limit=243200MB memsw.limit=unlimited [2022-03-11T13:23:23.591] [1623470.extern] task/cgroup: _memcg_initialize: /slurm/uid_225557/job_1623470/step_extern: alloc=243200MB mem.limit=243200MB memsw.limit=unlimited [2022-03-11T13:23:27.346] launch task StepId=1623470.0 request from UID:225557 GID:501340 HOST:10.0.0.74 PORT:57970 [2022-03-11T13:23:27.346] task/affinity: lllp_distribution: JobId=1623470 manual binding: none [2022-03-11T13:23:27.397] [1623470.0] task/cgroup: _memcg_initialize: /slurm/uid_225557/job_1623470: alloc=243200MB mem.limit=243200MB memsw.limit=unlimited [2022-03-11T13:23:27.397] [1623470.0] task/cgroup: _memcg_initialize: /slurm/uid_225557/job_1623470/step_0: alloc=243200MB mem.limit=243200MB memsw.limit=unlimited [2022-03-11T23:02:31.445] [1623470.0] done with job It seems at some point the ipmi gives back a value with the length of 2 bytes. The following jobs on the node have the correct Energy consumption. The other case: Logs for job(1620006): fy2374 haicore-p+ interacti+ 18.45? hkn0717 1-21:46:41 1620006 2022-03-07T13:15:40 [2022-03-06T11:30:03.970] [1618990.extern] task/cgroup: _memcg_initialize: /slurm/uid_239403/job_1618990: alloc=0MB mem.limit=514600MB memsw.limit=unlimited [2022-03-06T11:30:03.970] [1618990.extern] task/cgroup: _memcg_initialize: /slurm/uid_239403/job_1618990/step_extern: alloc=0MB mem.limit=514600MB memsw.limit=unlimited [2022-03-06T11:30:05.311] launch task StepId=1618990.interactive request from UID:239403 GID:501442 HOST:10.0.7.200 PORT:41328 [2022-03-06T11:30:05.311] task/affinity: lllp_distribution: JobId=1618990 manual binding: none [2022-03-06T11:30:05.345] [1618990.interactive] task/cgroup: _memcg_initialize: /slurm/uid_239403/job_1618990: alloc=0MB mem.limit=514600MB memsw.limit=unlimited [2022-03-06T11:30:05.345] [1618990.interactive] task/cgroup: _memcg_initialize: /slurm/uid_239403/job_1618990/step_interactive: alloc=0MB mem.limit=514600MB memsw.limit=unlimited [2022-03-06T11:30:05.388] [1618990.interactive] in _window_manager [2022-03-07T13:15:36.780] [1618990.interactive] error: *** STEP 1618990.interactive ON hkn0717 CANCELLED AT 2022-03-07T13:15:36 *** [2022-03-07T13:15:37.963] [1618990.interactive] done with job [2022-03-07T13:16:00.414] [1620006.extern] task/cgroup: _memcg_initialize: /slurm/uid_239403/job_1620006: alloc=0MB mem.limit=514600MB memsw.limit=unlimited [2022-03-07T13:16:00.414] [1620006.extern] task/cgroup: _memcg_initialize: /slurm/uid_239403/job_1620006/step_extern: alloc=0MB mem.limit=514600MB memsw.limit=unlimited [2022-03-07T13:16:01.948] launch task StepId=1620006.interactive request from UID:239403 GID:501442 HOST:10.0.7.200 PORT:33706 [2022-03-07T13:16:01.948] task/affinity: lllp_distribution: JobId=1620006 manual binding: none [2022-03-07T13:16:01.983] [1620006.interactive] task/cgroup: _memcg_initialize: /slurm/uid_239403/job_1620006: alloc=0MB mem.limit=514600MB memsw.limit=unlimited [2022-03-07T13:16:01.983] [1620006.interactive] task/cgroup: _memcg_initialize: /slurm/uid_239403/job_1620006/step_interactive: alloc=0MB mem.limit=514600MB memsw.limit=unlimited [2022-03-07T13:16:02.024] [1620006.interactive] in _window_manager [2022-03-07T16:43:12.703] error: _find_node_record(764): lookup failure for hkn1998 [2022-03-07T16:43:12.703] error: WARNING: Invalid hostnames in switch configuration: hkn1998 [2022-03-09T11:02:21.923] [1620006.interactive] error: *** STEP 1620006.interactive ON hkn0717 CANCELLED AT 2022-03-09T11:02:21 *** [2022-03-09T11:02:21.926] [1620006.extern] done with job [2022-03-09T11:02:51.603] [1620006.interactive] done with job [2022-03-09T11:03:14.526] [1622319.extern] task/cgroup: _memcg_initialize: /slurm/uid_239403/job_1622319: alloc=0MB mem.limit=514600MB memsw.limit=unlimited Logs for job(1620006): mc4465 hk-projec+ NiO/CsPbI+ 18.45? hkn1267 3-00:00:05 1620043 2022-03-07T15:34:11 [2022-03-07T15:00:44.031] [1619672.0] done with job [2022-03-07T15:34:12.815] task/affinity: task_p_slurmd_batch_request: task_p_slurmd_batch_request: 1620043 [2022-03-07T15:34:12.815] task/affinity: batch_bind: job 1620043 CPU input mask for node: 0xFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF [2022-03-07T15:34:12.815] task/affinity: batch_bind: job 1620043 CPU final HW mask for node: 0xFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF [2022-03-07T15:34:31.463] [1620043.extern] task/cgroup: _memcg_initialize: /slurm/uid_240854/job_1620043: alloc=243200MB mem.limit=243200MB memsw.limit=unlimited [2022-03-07T15:34:31.463] [1620043.extern] task/cgroup: _memcg_initialize: /slurm/uid_240854/job_1620043/step_extern: alloc=243200MB mem.limit=243200MB memsw.limit=unlimited [2022-03-07T15:34:31.494] Launching batch job 1620043 for UID 240854 [2022-03-07T15:34:31.545] [1620043.batch] task/cgroup: _memcg_initialize: /slurm/uid_240854/job_1620043: alloc=243200MB mem.limit=243200MB memsw.limit=unlimited [2022-03-07T15:34:31.545] [1620043.batch] task/cgroup: _memcg_initialize: /slurm/uid_240854/job_1620043/step_batch: alloc=243200MB mem.limit=243200MB memsw.limit=unlimited [2022-03-07T15:34:33.011] launch task StepId=1620043.0 request from UID:240854 GID:501329 HOST:10.0.2.180 PORT:52044 [2022-03-07T15:34:33.011] task/affinity: lllp_distribution: JobId=1620043 manual binding: none [2022-03-07T15:34:33.067] [1620043.0] task/cgroup: _memcg_initialize: /slurm/uid_240854/job_1620043: alloc=243200MB mem.limit=243200MB memsw.limit=unlimited [2022-03-07T15:34:33.067] [1620043.0] task/cgroup: _memcg_initialize: /slurm/uid_240854/job_1620043/step_0: alloc=243200MB mem.limit=243200MB memsw.limit=unlimited [2022-03-07T16:43:12.772] error: _find_node_record(764): lookup failure for hkn1998 [2022-03-07T16:43:12.772] error: WARNING: Invalid hostnames in switch configuration: hkn1998 [2022-03-10T15:34:16.868] [1620043.batch] error: *** JOB 1620043 ON hkn1267 CANCELLED AT 2022-03-10T15:34:16 DUE TO TIME LIMIT *** [2022-03-10T15:34:16.869] [1620043.0] error: *** STEP 1620043.0 ON hkn1267 CANCELLED AT 2022-03-10T15:34:16 DUE TO TIME LIMIT *** [2022-03-10T15:34:17.103] [1620043.extern] done with job [2022-03-10T15:34:18.081] [1620043.batch] sending REQUEST_COMPLETE_BATCH_SCRIPT, error:0 status:15 [2022-03-10T15:34:18.082] [1620043.batch] done with job [2022-03-10T15:34:19.478] [1620043.0] error: Failed to send MESSAGE_TASK_EXIT: Connection refused [2022-03-10T15:34:19.479] [1620043.0] done with job [2022-03-10T15:44:38.849] [1622908.extern] task/cgroup: _memcg_initialize: /slurm/uid_3729/job_1622908: alloc=243200MB mem.limit=243200MB memsw.limit=unlimited In these logs i do not see a ipmi error message. Best regard, Pascal I think the error: > error: Invalid ipmi response length for XCC raw command: 2 bytes, expected 16 (SD650) or 40 (SD650V2) could belong to the fixed Bug 12868. Check if you are experiencing the same issue with open FDs getting leaked. And don't forget to apply the patch there if is needed. I hope this will be possible for 20.11. But this error shouldn't affect in a way that we get negative numbers. If the power read fails, the sample is ignored, and no consumed energy calculation is done. Another thing I see it that these jobs are very long ones. I wonder whether the counter is getting overflowed or it's another issue. Short question, are all those jobs belonging to GPU nodes? Thanks, Carlos. Hello Carlos, It happens for jobs on CPU only nodes and nodes containing GPUs. Regards, Pascal Wait, The number you posted as result of sacct is 18446744071445903664. This number has 20 digits. And this matches mysql bigint unsigned (unsigned long long ULLONG_MAX) length/limit. ULLONG_MAX is 18446744073709551615. 18446744073709551615-18446744071445903664=2263647951 Let's see if I'm able to understand why it overflows. But for consumed energy we don't use unsigned long long, but uint64_t. It UINT64_MAX range is 9223372036854775807 (19 digits length). I think this might be a clue. I'll let you know about any updates. Thanks for letting us know, Carlos. My fault uint64_t range is unsigned long long ULLONG_MAX. Pascal, Have you checked this error? > error: Invalid ipmi response length for XCC raw command: 2 bytes, expected 16 (SD650) or 40 (SD650V2) I think it belongs to the fixed Bug 12868. The patch there should mitigate this error. Regarding the energy overflow (which is separated issue, and the one important here): How often this error is triggered? Very often? Only very very long jobs? I'm not able to reproduce it on my end by now, so any additional clues would be much welcomed. Regards, Carlos. Hello Carlos, The error occurs about once or twice a day. Here are the jobs with the unrealistic energy values for the last 10 days: sacct --start 2022-03-14 --format Account,partition,ConsumedEnergy,NodeList,Elapsed,end | grep ? hk-projec+ cpuonly 18.45? hkn[0150-0151] 23:50:50 2022-03-14T10:01:51 hk-projec+ 18.45? hkn0150 23:50:50 2022-03-14T10:01:51 hk-projec+ 18.45? hkn[0150-0151] 23:50:50 2022-03-14T10:01:51 hk-projec+ 18.45? hkn[0150-0151] 23:50:30 2022-03-14T10:01:51 hk-projec+ 18.45? hkn1651 1-12:00:19 2022-03-15T05:50:22 hk-projec+ 18.45? hkn1651 1-12:00:10 2022-03-15T05:50:13 hk-projec+ cpuonly 18.45? hkn1219 1-13:58:56 2022-03-15T23:17:40 hk-projec+ 18.45? hkn1219 1-13:58:56 2022-03-15T23:17:40 hk-projec+ 18.45? hkn1219 1-13:58:56 2022-03-15T23:17:40 hk-projec+ cpuonly 18.45? hkn0119 1-05:58:13 2022-03-17T04:40:56 hk-projec+ 18.45? hkn0119 1-05:58:13 2022-03-17T04:40:56 hk-projec+ 18.45? hkn0119 1-05:58:13 2022-03-17T04:40:56 hk-projec+ cpuonly 18.45? hkn1664 3-00:00:15 2022-03-17T09:32:30 hk-projec+ 18.45? hkn1664 3-00:00:17 2022-03-17T09:32:32 hk-projec+ 18.45? hkn1664 3-00:00:16 2022-03-17T09:32:31 hk-projec+ 18.45? hkn1664 2-23:59:58 2022-03-17T09:32:33 hk-projec+ cpuonly 18.45? hkn0845 18:16:10 2022-03-15T19:48:20 hk-projec+ 18.45? hkn0845 18:16:10 2022-03-15T19:48:20 hk-projec+ 18.45? hkn0845 18:16:10 2022-03-15T19:48:20 hk-projec+ 18.45? hkn0522 1-00:00:09 2022-03-16T11:23:53 hk-projec+ 18.45? hkn0522 1-00:00:08 2022-03-16T11:23:52 hk-projec+ 18.45? hkn0523 1-00:00:09 2022-03-16T11:23:53 hk-projec+ 18.45? hkn0523 1-00:00:08 2022-03-16T11:23:52 hk-projec+ cpuonly 18.45? hkn0015 1-12:00:12 2022-03-16T23:50:57 hk-projec+ 18.45? hkn0015 1-12:00:13 2022-03-16T23:50:58 hk-projec+ 18.45? hkn0015 1-12:00:12 2022-03-16T23:50:57 hk-projec+ 18.45? hkn0015 1-11:59:54 2022-03-16T23:51:00 hk-projec+ cpuonly 18.45? hkn[0315-0318] 16:18:14 2022-03-17T09:53:30 hk-projec+ 18.45? hkn0315 16:18:14 2022-03-17T09:53:30 hk-projec+ 18.45? hkn[0315-0318] 16:18:14 2022-03-17T09:53:30 hk-projec+ 18.45? hkn0128 2-23:59:05 2022-03-20T10:34:00 hk-projec+ accelerat+ 18.45? hkn0719 09:06:28 2022-03-18T00:44:05 hk-projec+ 18.45? hkn0719 09:06:28 2022-03-18T00:44:05 hk-projec+ 18.45? hkn0719 09:06:28 2022-03-18T00:44:05 hk-projec+ 18.45? hkn[1201,1206,+ 00:00:05 2022-03-17T14:23:59 hk-projec+ 18.45? hkn[1201,1206,+ 00:00:04 2022-03-17T14:34:35 hk-projec+ cpuonly 18.45? hkn[0229,0231-+ 1-20:13:26 2022-03-20T07:09:16 hk-projec+ 18.45? hkn[0229,0231-+ 1-20:13:27 2022-03-20T07:09:17 hk-projec+ 18.45? hkn[0231-0234] 1-20:12:33 2022-03-20T07:09:18 hk-projec+ cpuonly 18.45? hkn[0321-0325] 23:46:27 2022-03-19T16:17:55 hk-projec+ 18.45? hkn[0321-0325] 23:46:27 2022-03-19T16:17:55 hk-projec+ 18.45? hkn[0321-0325] 23:45:47 2022-03-19T16:17:35 hk-projec+ cpuonly 18.45? hkn[1141,1143,+ 23:46:27 2022-03-21T20:00:23 hk-projec+ 18.45? hkn[1141,1143,+ 23:46:27 2022-03-21T20:00:23 hk-projec+ 18.45? hkn[1141,1143,+ 23:45:48 2022-03-21T20:00:03 hk-projec+ cpuonly 18.45? hkn[0149-0150,+ 00:08:08 2022-03-21T17:05:06 hk-projec+ 18.45? hkn[0149-0150,+ 00:08:08 2022-03-21T17:05:06 hk-projec+ 18.45? hkn[0150,0153-+ 00:07:51 2022-03-21T17:05:11 hk-projec+ 18.45? hkn0337 02:30:27 2022-03-21T20:48:58 hk-projec+ cpuonly 18.45? hkn[0007-0016,+ 01:34:50 2022-03-23T09:44:48 hk-projec+ 18.45? hkn[0007-0016,+ 01:34:51 2022-03-23T09:44:49 hk-projec+ 18.45? hkn[0008-0016,+ 01:34:29 2022-03-23T09:44:50 hk-projec+ 18.45? hkn0111 09:39:43 2022-03-23T20:53:16 hk-projec+ cpuonly 18.45? hkn[0008-0016,+ 01:33:45 2022-03-23T14:04:53 hk-projec+ 18.45? hkn[0008-0016,+ 01:33:46 2022-03-23T14:04:54 hk-projec+ 18.45? hkn[0009-0016,+ 01:33:23 2022-03-23T14:04:55 hk-projec+ cpuonly 18.45? hkn[0149-0153,+ 01:33:19 2022-03-23T19:21:16 hk-projec+ 18.45? hkn[0149-0153,+ 01:33:19 2022-03-23T19:21:16 hk-projec+ 18.45? hkn[0150-0153,+ 01:32:59 2022-03-23T19:21:17 hk-projec+ cpuonly 18.45? hkn[1621-1625] 00:29:21 2022-03-24T04:07:34 hk-projec+ 18.45? hkn[1621-1625] 00:29:21 2022-03-24T04:07:34 The job length does not seem to have an influence on the occurance. What i also noticed is the aveWatts field for most nodes do not work correctly: Most nodes display aveWatts=0 NodeName=hkn1669 Arch=x86_64 CoresPerSocket=38 CPUAlloc=0 CPUTot=152 CPULoad=59.05 AvailableFeatures=BEEOND,BEEOND_4MDS,BEEOND_MAXMDS,LSDF,ELCH,CPU ActiveFeatures=BEEOND,BEEOND_4MDS,BEEOND_MAXMDS,LSDF,ELCH,CPU Gres=(null) NodeAddr=hkn1669 NodeHostName=hkn1669 Version=20.11.5 OS=Linux 4.18.0-193.60.2.el8_2.x86_64 #1 SMP Wed Jul 7 07:08:14 EDT 2021 RealMemory=243200 AllocMem=0 FreeMem=241452 Sockets=2 Boards=1 State=IDLE ThreadsPerCore=2 TmpDisk=0 Weight=1 Owner=N/A MCS_label=N/A Partitions=cpuonly BootTime=2021-08-06T18:01:26 SlurmdStartTime=2022-01-13T13:09:56 CfgTRES=cpu=152,mem=237.50G,billing=152 AllocTRES= CapWatts=n/a CurrentWatts=406 AveWatts=0 ExtSensorsJoules=n/s ExtSensorsWatts=0 ExtSensorsTemp=n/s Comment=(null) Some have a to high value on these nodes the wrong reading propably happened: NodeName=hkn1707 Arch=x86_64 CoresPerSocket=38 CPUAlloc=152 CPUTot=152 CPULoad=240.67 AvailableFeatures=BEEOND,BEEOND_4MDS,BEEOND_MAXMDS,LSDF,ELCH,ACC ActiveFeatures=BEEOND,BEEOND_4MDS,BEEOND_MAXMDS,LSDF,ELCH,ACC Gres=gpu:4 NodeAddr=hkn1707 NodeHostName=hkn1707 Version=20.11.5 OS=Linux 4.18.0-193.60.2.el8_2.x86_64 #1 SMP Wed Jul 7 07:08:14 EDT 2021 RealMemory=501600 AllocMem=501600 FreeMem=464304 Sockets=2 Boards=1 State=ALLOCATED ThreadsPerCore=2 TmpDisk=0 Weight=1 Owner=N/A MCS_label=N/A Partitions=haicore-gpu4 BootTime=2022-01-31T16:52:31 SlurmdStartTime=2022-01-31T16:53:44 CfgTRES=cpu=152,mem=501600M,billing=152,gres/gpu=4 AllocTRES=cpu=152,mem=501600M,gres/gpu=4 CapWatts=n/a CurrentWatts=713 AveWatts=36645 ExtSensorsJoules=n/s ExtSensorsWatts=0 ExtSensorsTemp=n/s Comment=(null) And some nodes have a reasonable entry. Regards, Pascal Pascal, I'll send you a small debug patch soon, to apply on top of your production slurm so you can have a bit more info w/o increasing the debug level; something that is not recommended in production to avoid perf impact. This way, I expect to narrow the issue and luckily hit the bug. Cheers, Carlos. Created attachment 24228 [details]
debug patch
Pascal,
If you can, give a try on this debug patch.
Wait to reproduce the error, collect the slurmd.log file from an affected node, and get back to us with a jobid, scontrol show job [ID] output and the slurmd.log.
I think that, whatever is happening, it should be reflected with this patch. if you feel it's rather easy to reproduce the issue, don't apply the patch to the whole cluster but to a node and make it fail.
Cheers,
Carlos.
Please Pascal, Give us some feedback so we can keep alive our work here. Thanks, Carlos. Pascal, I'm sorry we need to close the issue as timed out. Please get back to us is you still have issues whenever you have time to. Regards, Carlos. Created attachment 25132 [details]
Logs cpuonly Node
Created attachment 25133 [details]
Logs GPU Node
Hello Carlos, Sorry for the very late response. we had a lot to do the last few weeks. I installed your debugg patch and included logs of the bug happening on computenodes. We also upgraded to slurm 21.08. Here are the timestamps of the bug occuring. sacct --starttime 2022-04-26 --nodelist hkn0712 --format partition,ConsumedEnergy,NodeList,Elapsed,start,end,jobname,state,jobid | grep "?" accelerat+ 18.45? hkn[0709-0712] 01:01:22 2022-05-05T15:20:37 2022-05-05T16:21:59 interacti+ COMPLETED 1716871 18.45? hkn[0709-0712] 01:01:22 2022-05-05T15:20:37 2022-05-05T16:21:59 extern COMPLETED 1716871.ext+ accelerat+ 18.45? hkn0712 00:20:51 2022-05-09T17:54:30 2022-05-09T18:15:21 966_DDDDP+ COMPLETED 1736826 18.45? hkn0712 00:20:51 2022-05-09T17:54:30 2022-05-09T18:15:21 batch COMPLETED 1736826.bat+ 18.45? hkn0712 00:20:51 2022-05-09T17:54:30 2022-05-09T18:15:21 extern COMPLETED 1736826.ext+ accelerat+ 18.45? hkn0712 00:18:27 2022-05-09T18:33:56 2022-05-09T18:52:23 984_DDDDP+ COMPLETED 1737575 18.45? hkn0712 00:18:27 2022-05-09T18:33:56 2022-05-09T18:52:23 batch COMPLETED 1737575.bat+ 18.45? hkn0712 00:18:27 2022-05-09T18:33:56 2022-05-09T18:52:23 extern COMPLETED 1737575.ext+ sacct --starttime 2022-04-26 --nodelist hkn0319 --format partition,ConsumedEnergy,NodeList,Elapsed,start,end,jobname,state,jobid | grep "?" cpuonly 18.45? hkn[0318-0320,+ 02:56:32 2022-04-26T15:26:43 2022-04-26T18:23:15 rd_5 COMPLETED 1661179 18.45? hkn[0318-0320,+ 02:56:32 2022-04-26T15:26:43 2022-04-26T18:23:15 extern COMPLETED 1661179.ext+ 18.45? hkn[0319-0320,+ 02:56:13 2022-04-26T15:27:03 2022-04-26T18:23:16 orted COMPLETED 1661179.0 cpuonly 18.45? hkn0319 06:57:47 2022-05-09T14:53:51 2022-05-09T21:51:38 atomistica COMPLETED 1733052 18.45? hkn0319 06:57:47 2022-05-09T14:53:51 2022-05-09T21:51:38 batch COMPLETED 1733052.bat+ 18.45? hkn0319 06:57:47 2022-05-09T14:53:51 2022-05-09T21:51:38 extern COMPLETED 1733052.ext+ cpuonly 18.45? hkn[0239,0260-+ 1-20:38:33 2022-05-17T17:54:30 2022-05-19T14:33:03 E8R1.5e12 COMPLETED 1743934 18.45? hkn[0239,0260-+ 1-20:38:33 2022-05-17T17:54:30 2022-05-19T14:33:03 extern COMPLETED 1743934.ext+ 18.45? hkn[0260-0261,+ 1-20:38:15 2022-05-17T17:54:49 2022-05-19T14:33:04 orted COMPLETED 1743934.0 Best Regards, Pascal That's fairly interesting. I don't really know why -by now, but it seems that node energy from node(s?) gets reseted to ~0 every time I see this in the logs: error: _find_node_record: lookup failure for node "hkn1998" error: WARNING: Invalid hostnames in switch configuration: hkn1998 At least, there's a correlation in the logs with these 2 events. Need to see if it's causality or casualty. Thanks, Carlos. Would you mind to collect same type of logs with DebugFlags=Energy in slurm.conf for at least 1 node. I don't really need all cluster. If you know more or less how to make it fail, doing so for a node is enough. Thanks, Carlos. Hello Carlos, I will set the DebugFlags=Energy on a partition, since i do not know how to trigger the bug. On another node i noticed the bug seems to often occur in bursts rather than at completely random times. I added an attachment with the errors occurring over the last 4 weeks. Regards, Pascal Created attachment 25175 [details]
error last 4 weeks
> since i do not know how to trigger the bug.
Sounds good, don't worry. I can truly understand you because this issue is kinda subtle and getting me a bit puzzled right now.
Pascal, Have you been able to set DebugFlags=Energy and collect the log I need? Thanks, Carlos. Hello, A small update: I have set SlurmdDebug=debug and DebugFlags=Energy on all nodes on the Cluster now since 27.05. The Bug has not appeared yet since then which is a bit strange. I will sent the logs as soon as it happens. Regards, Pascal Hi Pascal, please, update us whenever possible. It's been more than a week since our last communication. I hope you get this reproduced soon. Regards, Carlos. Hello Carlos, The debug flag has been turned on for over 2 weeks now and the error did not appear it seems as if it does not happen if it is turned on. Regards, Pascal > as if it does not happen if it is turned on.
That's fairly strange, as the log flags only enable log printing, nothing else. It's fairly easy to check it looking at plugin's source because it's not a big file.
If you aren't able to get this reproduced with debug enabled, I wonder whether the slurmd restart has fixed the issue or it's just a mere coincidence. Well, all in all, keep this running with debug the time you need/can to get this reproduced. If you aren't able, revert the debug flags and check if this issue starts happening again.
Despite not being able to reproduce the issue, this bit of information is useful as well. I'll take it into account to understand why this happens.
Thanks,
Carlos.
Created attachment 25557 [details]
Error occurring
Created attachment 25558 [details]
slurmd.log
Hello Carlos, Today at 10:50 the slurm debug mode was turned off and immediately the error occurred again as you can see in the attached file:sacct_output I also uploaded a slurmd.log of the occurrence. Best Regards, Pascal Might debug have been turned off but still I see DebugFlags "ENERGY" logs and those are helpful. Let's see if I'm able to spot the issue now. Thanks, Carlos. According to the logs: Joules, elapsed time: 5 seconds, first read energy counter val: 2374876713 ave watts: 463 [2022-06-17T10:50:41.416] [1756422.interactive] acct_gather_energy/xcc: _get_joules_task: ENERGY: consumed 3323243 Joules (received 6938(692 watts) from slurmd) [2022-06-17T10:50:42.576] got reconfigure request [2022-06-17T10:50:42.576] all threads complete [2022-06-17T10:50:42.577] debug: Reading slurm.conf file: /etc/slurm/slurm.conf [2022-06-17T10:50:42.583] debug: Reading cgroup.conf file /etc/slurm/cgroup.conf [2022-06-17T10:50:42.584] [1756422.extern] debug: Handling REQUEST_STEP_RECONFIGURE [2022-06-17T10:50:42.584] [1756422.extern] debug: _handle_reconfigure for StepId=1756422.extern successful [2022-06-17T10:50:42.584] [1756422.interactive] debug: Handling REQUEST_STEP_RECONFIGURE [2022-06-17T10:50:42.585] [1756422.interactive] debug: _handle_reconfigure for StepId=1756422.interactive successful [2022-06-17T10:50:42.588] error: _find_node_record: lookup failure for node "hkn1998" [2022-06-17T10:50:42.588] error: WARNING: Invalid hostnames in switch configuration: hkn1998 [2022-06-17T10:50:42.596] [1756422.extern] debug: Handling REQUEST_STATE [2022-06-17T10:50:42.596] [1756422.interactive] debug: Handling REQUEST_STATE [2022-06-17T10:50:48.357] [1756422.extern] acct_gather_energy/xcc: _get_joules_task: _get_joules_task: CurrEn<1stEn (2818<1625951988) [2022-06-17T10:50:48.357] [1756422.extern] acct_gather_energy/xcc: _get_joules_task: ENERGY: consumed 18446744072083602446 Joules (received 18446744072080279129(564 watts) from slurmd) The error was indeed caused by a reconfig. XCC doesn't have support for reconfig or slurmd restart. When you fire a reconfig structures are freed now, thus negative numbers arise. This is an ongoing improvement internally tracked in Bug 13778. If you are keen to test it, I can send you an early access patch for 22.05. I encourage you to upgrade to 22.08 latest released version as soon as possible, so you can test it. Regards, Carlos. Hello Carlos, Indeed it seems to be triggered by the slurm reconfigure and i can reliably reproduce the error. I would like to test the early access patch. Best Regards, Pascal > I would like to test the early access patch.
For this to happen you need your cluster in latest 22.05 release.
I'll deliver you the patch in early access or if review finishes before with the commit ids.
Please, let us know if you already are or have planned to go for 22.05.
Regards,
Carlos.
Hello, We plan to go to 22.05 as soon as possible. For testing purposes our test cluster can be upgraded at any time. Best Regards, Pascal Hi Pascal, We are tracking the improvement of some energy plugins in an internal bug and this is part of that bug. Once ready to disclosure, I'll provide you the fix. Right now it is under review. Thanks for your patience. Regards. Hi Pascal, This issue has been solved as part of this series of commits: https://github.com/SchedMD/slurm/compare/ca032023fe89...78fe518aa095 Already in master. I'm going to close the issue as fixed. |