Summary: | acct_gather_energy/xcc fails on Lenovo SD650v2/SD650-Nv2 nodes | ||
---|---|---|---|
Product: | Slurm | Reporter: | Simon Raffeiner <simon.raffeiner> |
Component: | slurmd | Assignee: | Carlos Tripiana Montes <tripiana> |
Status: | RESOLVED FIXED | QA Contact: | |
Severity: | 4 - Minor Issue | ||
Priority: | --- | CC: | alex, cinek, felip.moll, pascal.schuhmacher, tim |
Version: | 20.11.5 | ||
Hardware: | Linux | ||
OS: | Linux | ||
See Also: |
https://bugs.schedmd.com/show_bug.cgi?id=6743 https://bugs.schedmd.com/show_bug.cgi?id=13596 |
||
Site: | KIT | Alineos Sites: | --- |
Atos/Eviden Sites: | --- | Confidential Site: | --- |
Coreweave sites: | --- | Cray Sites: | --- |
DS9 clusters: | --- | HPCnow Sites: | --- |
HPE Sites: | --- | IBM Sites: | --- |
NOAA SIte: | --- | NoveTech Sites: | --- |
Nvidia HWinf-CS Sites: | --- | OCF Sites: | --- |
Recursion Pharma Sites: | --- | SFW Sites: | --- |
SNIC sites: | --- | Linux Distro: | --- |
Machine Name: | CLE Version: | ||
Version Fixed: | 22.05.0pre1 | Target Release: | --- |
DevPrio: | --- | Emory-Cloud Sites: | --- |
Ticket Depends on: | 12868 | ||
Ticket Blocks: | |||
Attachments: |
Single energy command
v3 style changes already merged v4 IPMI open FDs quickfix test |
Description
Simon Raffeiner
2021-04-08 01:18:05 MDT
Hi Simon, We don't have access to such a HW environment right now, but until now 16 was the expected value. It would be pretty good to know Lenovo answer about it as well. Is there any way we can have access to a test HW with a reproduction environment? Thanks for reporting. Carlos. Created attachment 18954 [details]
Single energy command
Please, check the last attached image. You will see the expected return of the faulty command.
Additionally, you can send us your slurm.conf and xcc file plugin configuration (acct_gather.conf). An soeme logs, and see if we see something else. But I'm pretty sure the root of the issue is the new HW returning data from the command in a different format. (NOTE: English version below) Sehr geehrter Absender, ich befinde mich bis zum 24.05.2021 im Urlaub. Bei technischen Fragen zu den HPC-Systemen benutzen Sie bitte das bwSupportPortal [1] oder wenden sich an die E-Mail-Hotlines. Bei organisatorischen Fragen, Fragen zur Beschaffungen oder der Öffentlichkeitsarbeit wenden Sie sich bitte an die Abteilungsleiterin Frau Dr. Jennifer Buchmüller (jennifer.buchmueller@kit.edu). E-Mails werden während meiner Abwesenheit nicht automatisch weitergeleitet. mit freundlichen Grüßen, Simon Raffeiner ============================================================================ Dear sender, I am on vacation until May 24, 2021. Please redirect technical questions about the HPC systems to the bwSupportPortal [1] or the hotline mailing lists. All other questions, especially about procurements, administrative topics or public relations should be directed to the head of the department, Dr. Jennifer Buchmüller (jennifer.buchmueller@kit.edu). E-Mails are not being forwarded during my absence. kind regards, Simon Raffeiner [1] https://bw-support.scc.kit.edu/ Dear Simon, We have a patch proposal that must be tested. Please, forward this to Lenovo and ask them to let us access to some test HW in which I can compile and test Slurm as part of the test stage for this patch. Thanks. Hi Simon, We have a patch proposal but we can't test it in the real HW, since we're unable to get access to it. I've been trying to get an account in HoreKA 2 months ago (before my paternity leave). But I was unable to get the credentials. I'm still unable to get it. I'm closing this issue by now. If we interested party wakes up again and provides us access to the HW, we can test it. Thanks for your understanding. Regards, Carlos. Hello Carlos, Sorry, we had a lot going on the past months. I'm reopening the issue, so we can get the patch tested on the HW. How far did you get with getting access to the HW, as far as i was informed an account had already been created for you. I'm happy to provide support for getting access to the HW. Regards, Pascal Hi Pascal, I've been unable to get access since I couldn't get the pass. Please, go email and talk with me at tripiana@schedmd.com so we can get access to horeka. Thanks, Carlos. Good new is that I'm able to access. Good new is that I've been able to git download slurm master, patch it, configure it, and compile it. Good news is that I've been able to configure slurm, and start ctld. Bad news it, OFC, that I'm not able to run slurmd because it needs root. So now I'm not sure how to proceed to test the patched version. I'm ic6849, slurm is in $HOME/slurm, and ctld is running now in hkn1991. I'll wait your feedback before doing anything else here. Thanks, Carlos. By now I'm able to run slurmd as normal user as well but need to check if I'm able (have rights) to gather power metrics from the device. I'll keep you poste. Regards, Carlos. As I was thinking: [2021-11-22T17:31:45.720] debug2: No acct_gather.conf file (/home/hk-project-abnahme/ic6849/slurm/inst/etc/acct_gather.conf) [2021-11-22T17:31:45.720] acct_gather_energy/xcc: acct_gather_energy_p_conf_set: AcctGatherEnergy XCC plugin loaded [2021-11-22T17:31:45.721] error: _init_ipmi_config: error : must be root to open ipmi devices [2021-11-22T17:31:45.721] error: AcctGatherEnergy XCC plugin threads failed to start in a timely manner [2021-11-22T17:31:45.724] debug: _handle_node_reg_resp: slurmctld sent back 8 TRES. [2021-11-22T17:32:13.559] debug3: in the service_connection I can't work with IMPI as normal user. Hello Carlos, I already wrote you per e-mail but i'm going to write here as well. The node you were working on was a login Node. 2 Nodes have been reserved for you hkn0372 and hkn0602. You can login to them via ssh from the login node (e.g.: ssh hkn0372). They are of Typ Lenovo_SD650_v2 and Lenovo_SD650-N_v2 respectively. Here you should have the permissions to use ipmi. If you need more permissions don't hesitate to ask. Best Regards, Pascal Thanks Pascal for the efforts from your side. What I need now is to execute with sudo the following: [ic6849@hkn0372 inst]$ pwd /home/hk-project-abnahme/ic6849/slurm/inst [ic6849@hkn0372 inst]$ . setup.sh [ic6849@hkn0372 inst]$ sudo slurmd -Dvvvvvv -N n1 <<----- I'll do it in the 2 nodes reserved for this purpose. Cheers, Carlos. Reasoning for asking for sudo: [2021-11-26T09:15:49.814] debug2: Start processing RPC: REQUEST_ACCT_GATHER_ENERGY [2021-11-26T09:15:49.814] debug2: Processing RPC: REQUEST_ACCT_GATHER_ENERGY [2021-11-26T09:15:49.814] error: _init_ipmi_config: error : must be root to open ipmi devices [2021-11-26T09:15:49.814] debug2: Finish processing RPC: REQUEST_ACCT_GATHER_ENERGY [2021-11-26T09:15:49.814] [2.batch] acct_gather_energy/xcc: _get_joules_task: we got a blank And similar errors. (NOTE: English version below) Sehr geehrte Absenderin, Sehr geehrter Absender, Sie erreichen mich am 29. November wieder. Bei technischen Fragen oder Problemen benutzen Sie bitte - je nach HPC-System oder Dienst - entweder das NHR-Supportportal [1] oder das bwSupportPortal [2]. In sehr dringenden Fällen können Sie sich an Dr. René Caspart (rene.caspart@kit.edu) wenden. E-Mails werden während meiner Abwesenheit nicht automatisch weitergeleitet. mit freundlichen Grüßen, Simon Raffeiner Leiter HPC-Betrieb ============================================================================ Dear sender, you can reach me again on November 29. Please redirect technical questions and problems via the NHR Support Portal [1] or the bwSupportPortal [2], depending on the HPC system or service affected. In very urgent cases you can contact Dr. René Caspart (rene.caspart@kit.edu). E-Mails are not being forwarded during my absence. With kind regards, Simon Raffeiner Leiter HPC-Betrieb [1] https://support.nhr.kit.edu [2] https://bw-support.scc.kit.edu/ Hello, You should now be able to use sudo slurmd -Dvvvvvv -N n1 Right now you can only use this exact command if you need more tell us Regards, Pascal Maybe this is something you should be aware of but, it's not working yet:
[ic6849@hkn0372 inst]$ sudo slurmd -Dvvvvvv -N n1
>>> /etc/sudoers.d/sudoers_schedmd: syntax error near line 4 <<<
sudo: parse error in /etc/sudoers.d/sudoers_schedmd near line 4
sudo: no valid sudoers sources found, quitting
sudo: unable to initialize policy plugin
Thanks,
Carlos.
Hello, I fixed the error these two commands should work: /usr/sbin/slurmd -Dvvvvvv -N n1 /sbin/slurmd -Dvvvvvv -N n1 Regards, Pascal Ah... no, no. It's not: /usr/sbin/slurmd -Dvvvvvv -N n1 /sbin/slurmd -Dvvvvvv -N n1 It is: /home/hk-project-abnahme/ic6849/slurm/inst/sbin/slurmd Because my installation is local and I'm working there. Ok you should be able to use slurmd -Dvvvvvv -N n1 with your local installation patched version seems to be working, or at least it seems promising: [ic6849@hkn0372 inst]$ scontrol show nodes NodeName=n1 Arch=x86_64 CoresPerSocket=1 CPUAlloc=0 CPUTot=2 CPULoad=0.03 AvailableFeatures=(null) ActiveFeatures=(null) Gres=gpu:tesla:1,gpu:p100:1,gpu:k80:1 NodeAddr=127.0.0.1 NodeHostName=n1 Port=17001 Version=22.05.0-0pre1 OS=Linux 4.18.0-193.60.2.el8_2.x86_64 #1 SMP Wed Jul 7 07:08:14 EDT 2021 RealMemory=30000 AllocMem=0 FreeMem=242986 Sockets=1 Boards=1 State=IDLE ThreadsPerCore=2 TmpDisk=0 Weight=1 Owner=N/A MCS_label=N/A Partitions=debug BootTime=2021-08-06T18:19:43 SlurmdStartTime=2021-11-26T13:42:57 LastBusyTime=2021-11-26T13:45:38 CfgTRES=cpu=2,mem=30000M,billing=2 AllocTRES= CapWatts=n/a CurrentWatts=429760 AveWatts=298601 ExtSensorsJoules=n/s ExtSensorsWatts=0 ExtSensorsTemp=n/s Since CurrentWatts=429760 AveWatts=298601 there's some value here... Well, it seems quite high value isn't it? I need further debugging but the point now is that I'm able to iterate over the code, and test it until we have a functional patch. Cheers, Carlos. [2021-11-26T14:18:39.883] acct_gather_energy/xcc: _sd650v2_update_node_energy: ENERGY: _sd650v2_update_node_energy: XCC current_watts: 427 consumed energy last interval: 12720(current reading 137955) Joules, elapsed time: 30 Seconds, ave watts: 380 That is making a lot more sense I suppose. Could you please check these values make sense? Hello Carlos, Thank you for your work, the values make sense since its a node of type Lenovo SD650v2 without an accelerator. If you wanna try you can try it on the other node hkn0602 of type Lenovo SD650-Nv2 which has GPU accelerators and should have a higher value. Best regards Pascal Go for it. I've got sudo in hkn0602 as well for the slurmd command? Thanks, Carlos. Yes you have the same permissions on both nodes In hkn0602 I see: [2021-11-26T15:09:20.534] acct_gather_energy/xcc: _sd650_update_node_energy: ENERGY: _sd650_update_node_energy: XCC current_watts: 636 consumed energy last interval: 19092(current reading 306002) Joules, elapsed time: 30 Seconds, first read energy counter val: 2310595332 ave watts: 635 But I must say that I'm not able to see there any available GPU but yes, it's twice the consumption :). Anyway, it seems like the patch is almost ready, if you agree on what you are seeing. Cheers, Carlos. Hello, The values are as expected. Thanks again for your work. Cheers, Pascal I'm going to double check everything then and ask for forma QA review on this. Thanks for your time. Created attachment 22414 [details]
v3
This is final reviewed version for new functionality that is working after testing it in HoreKA.
I'm going to start the reviewing process now. Thanks.
Hello Carlos, How long does the review process take and when can a patched version be expected. We would like to test the patch as fast as possible. Ps: we are currently using Slurm version 20.11.5 Regards, Pascal Hello Carlos, We tested your patch on our cluster. The energy readings work fine, but on the cpu only Nodes (Lenovo SD650v2) a kernel error occurs randomly and the slurmd.service gets killed. kernel: traps: slurmd[2543817] trap divide error ip:1467f19dd595 sp:1467ebffecc0 error:0 in acct_gather_energy_xcc.so[1467f19db000+6000] systemd[1]: Started Process Core Dump (PID 2550389/UID 0). systemd[1]: slurmd.service: Main process exited, code=killed, status=8/FPE systemd[1]: slurmd.service: Failed with result 'signal'. The reason for the error seems to be line 501: xcc_reading->w /= count; root@hktn1671:/# addr2line -e /usr/lib64/slurm/acct_gather_energy_xcc.so 0x2595 /root/rpmbuild/BUILD/slurm-20.11.5-2/src/plugins/acct_gather_energy/xcc/acct_gather_energy_xcc.c:501 The error only occurs on the Lenovo SD650v2 Nodes. On nodes with accelerator Lenovo SD650-Nv2 the patch works fine without any errors. Regards, Pascal (NOTE: English version below) Sehr geehrte Absenderin, Sehr geehrter Absender, Sie erreichen mich am 03. Januar 2022 wieder. Bei technischen Fragen oder Problemen benutzen Sie bitte - je nach HPC-System oder Dienst - entweder das NHR-Supportportal [1] oder das bwSupportPortal [2]. In sehr dringenden Fällen können Sie sich an die Abteilungsleiterin, Frau Dr. Jennifer Buchmüller (jennifer.buchmueller@kit.edu), wenden. E-Mails werden während meiner Abwesenheit nicht automatisch weitergeleitet. mit freundlichen Grüßen, Simon Raffeiner Stellvertretender Abteilungsleiter / Leiter HPC-Betrieb ============================================================================ Dear sender, you can reach me again on Januar 3rd, 2022. Please redirect technical questions and problems via the NHR Support Portal [1] or the bwSupportPortal [2], depending on the HPC system or service affected. In very urgent cases you can contact teh Head of the Department, Dr. Jennifer Buchmüller (jennifer.buchmueller@kit.edu). E-Mails are not being forwarded during my absence. With kind regards, Simon Raffeiner Deputy Head of Department / HPC Operations Manager [1] https://support.nhr.kit.edu [2] https://bw-support.scc.kit.edu/ Hi, Carlo's currently on vacation till January 10th. I'll see if I can hand a workaround for the division by zero issue on Thu this week or otherwise early next week so that you can test locally. In any case, what Carlos handed as a local test patch won't be the definitive changes that will be merged to Slurm. With regards to the review process we're still doing some last iterations for the patchset and I'll need further feedback once Carlos is back for some suggested changes. Hi, I'm attaching two patch sets: The first set is only style changes and they are already merged in Slurm master commits: eeca14d9d9 acct_gather_energy/xcc - add comment to clarify how response is parsed. 37eed1dab8 acct_gather_energy/xcc - comment xcc_raw_single_data_t members. 1bdaa16760 acct_gather_energy/xcc - fix operands style. a10e829abc acct_gather_energy/xcc - fix spaces style. Then a second patch set which is not yet merged (still in the review process) that should workaround the division by zero issue and should be applied on top of the previous style patch set. Please, test them and let us know if the reported metrics look as expected for both the old and the new hardware formats. Created attachment 22817 [details]
style changes already merged
First patch set with style changes. Apply this first.
Created attachment 22818 [details]
v4
v4 patch set with actual changes. To be applied on top of the previous style changes patch set.
Hi, Thank you for the patches. The gathering of energy seems to work and the reported metrics look as expected the prior kernel errors do not happen anymore, but with the plugin enabled slurmd randomly gets killed on nodes during jobs with a buffer overflow: Jan 5 22:15:58 hktn1671 slurmd[3827728]: *** buffer overflow detected ***: /usr/sbin/slurmd terminated Jan 5 22:15:58 hktn1671 abrt-hook-ccpp[3865415]: Process 3827728 (slurmd) of user 0 killed by SIGABRT - dumping core The corresponding core dump backtrace on the node: (gdb) thread apply all bt full Thread 3 (Thread 0x15498ec7a700 (LWP 3827731)): #0 0x0000154997a3d7da in pthread_cond_timedwait@@GLIBC_2.3.2 () from /usr/lib64/libpthread.so.0 No symbol table info available. #1 0x00001549946a90cf in _thread_ipmi_run (no_data=<optimized out>) at acct_gather_energy_xcc.c:783 err = <optimized out> tvnow = {tv_sec = 1641391314, tv_usec = 541545} abs = {tv_sec = 1641417384, tv_nsec = 541545000} __func__ = "_thread_ipmi_run" #2 0x0000154997a372de in start_thread () from /usr/lib64/libpthread.so.0 No symbol table info available. #3 0x0000154997768e83 in clone () from /usr/lib64/libc.so.6 No symbol table info available. Thread 2 (Thread 0x154999884780 (LWP 3827728)): #0 0x0000154997a40cb7 in accept () from /usr/lib64/libpthread.so.0 No symbol table info available. #1 0x0000154998f8d75b in slurm_accept_msg_conn (fd=<optimized out>, addr=<optimized out>) at slurm_protocol_socket.c:452 len = 128 #2 0x0000000000411f83 in _msg_engine () at slurmd.c:473 cli = 0x20c7890 sock = <optimized out> cli = <optimized out> sock = <optimized out> __func__ = "_msg_engine" rpc_wait = <optimized out> tv1 = <optimized out> tv2 = <optimized out> tv_str = <optimized out> delta_t = <optimized out> tv1 = <optimized out> tv2 = <optimized out> tv_str = <optimized out> delta_t = <optimized out> #3 main (argc=<optimized out>, argv=<optimized out>) at slurmd.c:394 i = <optimized out> pidfd = 6 blocked_signals = {13, 0} cc = <optimized out> oom_value = <optimized out> curr_uid = <optimized out> time_stamp = "Wed, 05 Jan 2022 15:01:54 +0100\000\240\233\017\227\000\000\000\000\377\377\377\377\377\377\377\377\000\000\000\000\000\000\000\000\270\267\n\317\377\037\000\000\240\337*<\377\177\000\000P.\002\002\000\000\000\000\340\336*<\377\177\000\000`\337*<\377\177\000\000\253\064\244\230I\025\000\000\003+\244\230I\025\000\000\003\000\000\000\000\000\000\000@\001\000\000\000\000\000\000\200\000\000\000\000\000\000\000\000\026\023jRg\016֠\"\002\002\000\000\000\000\240\"\002\002", '\000' <repeats 20 times>, "\030\241ĘI\025\000\000\360/\002\002\000\000\000\000\f0\002\002\000\000\000\000\003\273"... lopts = <optimized out> __func__ = "main" Thread 1 (Thread 0x154999883700 (LWP 3865414)): #0 0x00001549976a470f in raise () from /usr/lib64/libc.so.6 No symbol table info available. #1 0x000015499768eb25 in abort () from /usr/lib64/libc.so.6 No symbol table info available. #2 0x00001549976e7897 in __libc_message () from /usr/lib64/libc.so.6 No symbol table info available. #3 0x000015499777a905 in __fortify_fail_abort () from /usr/lib64/libc.so.6 No symbol table info available. #4 0x000015499777a937 in __fortify_fail () from /usr/lib64/libc.so.6 No symbol table info available. #5 0x00001549977788f6 in __chk_fail () from /usr/lib64/libc.so.6 No symbol table info available. #6 0x000015499777a82b in __fdelt_warn () from /usr/lib64/libc.so.6 No symbol table info available. #7 0x000015498fa6b604 in _openipmi_read () from /usr/lib64/libfreeipmi.so.17 No symbol table info available. #8 0x000015498fa6bc2f in ipmi_openipmi_cmd () from /usr/lib64/libfreeipmi.so.17 No symbol table info available. #9 0x000015498fa4a48f in api_openipmi_cmd () from /usr/lib64/libfreeipmi.so.17 No symbol table info available. #10 0x000015498fa4a58e in api_openipmi_cmd_raw () from /usr/lib64/libfreeipmi.so.17 No symbol table info available. #11 0x000015498fa36ef3 in ipmi_cmd_raw () from /usr/lib64/libfreeipmi.so.17 No symbol table info available. #12 0x00001549946a83b9 in _read_ipmi_values () at acct_gather_energy_xcc.c:444 xcc_reading = <optimized out> buf_rs = "\020\000\000\000I\025\000\000\320)\210\231I\025\000\000\220)\210\231I\025\000\000\000\026\023jRg\016\326\000\000y\n\326a1\002\306\062o\227I\025\000\000\002\b\000\000\000\000\000\000\233\023o\227I\025\000\000\200\231*\210I\025\000\000\a\000\000\000\000\000\000\000\232\062\002\210I\025\000\000\360\002\000\000\000\000\000\000\340\002\000\000\000\000\000\000/\000\000\000I\025\000\000\250\063\002\210I\025\000\000\020\003\000\000\000\000\000\000\250\377\377\377\377\377\377\377\v\000\000\000\000\000\000\000/\000\000\000;\000\000\000\\\000\000\000|\000\000\000\001", '\000' <repeats 23 times>, "w\000\000\000n\000\000\000\b\000\000\000\000\000\000\000"... rs_len = 0 __func__ = "_read_ipmi_values" #13 0x00001549946a8648 in _thread_update_node_energy () at acct_gather_energy_xcc.c:663 xcc_raw = <optimized out> __func__ = "_thread_update_node_energy" #14 0x00001549946a9577 in acct_gather_energy_p_get_data (data_type=<optimized out>, data=0x15498827e8f0) at acct_gather_energy_xcc.c:954 rc = 0 energy = 0x15498827e8f0 last_poll = 0x15498827e8f0 sensor_cnt = 0x15498827e8f0 __func__ = "acct_gather_energy_p_get_data" #15 0x0000154998f35e38 in acct_gather_energy_g_get_data (context_id=0, data_type=data_type@entry=ENERGY_DATA_JOULES_TASK, data=0x15498827e8f0) at slurm_acct_gather_energy.c:369 retval = -1 __func__ = "acct_gather_energy_g_get_data" #16 0x0000000000417357 in _rpc_acct_gather_energy (msg=msg@entry=0x154988025b10) at req.c:3315 resp_msg = {address = {ss_family = 39312, __ss_padding = "*\210I\025\000\000 \000\000\210I\025\000\000\024\000\000\000\000\000\000\000\320Q\000\000\000\000\000\000P梗I\025\000\000\060\236*\210I\025\000\000\370[\002\210I\025\000\000\306\062o\227I\025", '\000' <repeats 11 times>, ",\210\231I\025\000\000CI\377\230I\025\000\000t\004\000\000\000\000\000\000\004\000\000\000\000\000\000\000\355w\374\230I\025\000\000\000\061(\210I\025\000", __ss_align = 9216}, auth_cred = 0x154988025bc0, auth_index = -2010631936, auth_uid = 5449, auth_uid_set = 64, body_offset = 5449, buffer = 0x154998f6a529 <_unpack_acct_gather_energy_req+70>, conn = 0x154998ff5870 <__func__.19392>, conn_fd = -1728666239, data = 0x154988283100, data_size = 2281855760, flags = 5449, msg_index = 0, msg_type = 12544, protocol_version = 34856, forward = {cnt = 54064, init = 34857, nodelist = 0x154999882cc0 "\200.\210\231I\025", timeout = 1779635712, tree_width = 26450}, forward_struct = 0x15498829d34c, orig_addr = {ss_family = 11448, __ss_padding = "\210\231I\025\000\000\240,\210\231I\025\000\000\000\000\000\000\000\000\000\000\020[\002\210I\025\000\000\060\323)\210I\025\000\000\370[\002\210I\025\000\000\202y\374\230I\025\000\000\060\323)\210I\025\000\000\000\061(\210I\025\000\000\300,\210\231I\025\000\000u\371\363\230I\025\000\000\060\323)\210I\025\000\000\000\061(\210I\025\000\000\200.\210\231I\025\000\000`F\365\230I\025\000", __ss_align = 0}, ret_list = 0x154988139530} data_type = 0 sensor_cnt = 1 acct_msg = {energy = 0x15498827e8f0, node_name = 0x0, sensor_cnt = 1} now = <optimized out> last_poll = 1641417337 req = 0x1549882a9180 rc = 0 req_uid = <optimized out> first_msg = false #17 0x000000000041ea80 in slurmd_req (msg=msg@entry=0x154988025b10) at req.c:349 __func__ = "slurmd_req" #18 0x000000000040f49e in _service_connection (arg=<optimized out>) at slurmd.c:581 con = 0x2040230 msg = 0x154988025b10 __func__ = "_service_connection" rc = 0 #19 0x0000154997a372de in start_thread () from /usr/lib64/libpthread.so.0 No symbol table info available. #20 0x0000154997768e83 in clone () from /usr/lib64/libc.so.6 No symbol table info available. (gdb) Regards, Pascal Have you recently updated freeipmi version? He have found similar issue with libfreeipmi.so.17 from other customer and this site is not using v2 hardware nor modified XCC plugin, so they have the same problem without any other change than an OS upgrade that came with a new freeipmi version. And, as you might guessed, the site was running OK before the upgrade. So we suspect from a freeipmi but or something like this but we couldn't find the root of the issue until now. Cheers, Carlos. Would you mind to send us the output, as root, from commands: ulimit -a sysctl fs.file-nr ls -l /proc/$(pidof slurmd)/fd | wc -l in a node affected by this problem? We have found another possible explanation, which would relate to the number of open files. Thanks, Carlos. We suspect that the freeipmi version is compiled with -D_FORTIFY_SOURCE=2. This is enabling: #define FD_SET(fd, fdsetp) __FD_SET (fd, fdsetp) #define __FD_SET(d, s) \ ((void) (__FDS_BITS (s)[__FD_ELT(d)] |= __FD_MASK(d))) #define __FD_ELT(d) \ __extension__ \ ({ long int __d = (d); \ (__builtin_constant_p (__d) \ ? (0 <= __d && __d < __FD_SETSIZE \ ? (__d / __NFDBITS) \ : __fdelt_warn (__d)) \ : __fdelt_chk (__d)); }) instead of this: #define __FD_ELT(d) ((d) / __NFDBITS) And this fails with buffer overflow if FD number is greater than: #define __FD_SETSIZE 1024 When Slurm opens connection with XCC BMC it will fail if the FD number is greater than 1024 with buffer overflow. _openipmi_read snippet that trigger the issue, I think: static int _openipmi_read (ipmi_openipmi_ctx_t ctx, fiid_obj_t obj_cmd_rs) { uint8_t rs_buf_temp[IPMI_OPENIPMI_BUFLEN]; uint8_t rs_buf[IPMI_OPENIPMI_BUFLEN]; struct ipmi_system_interface_addr rs_addr; struct ipmi_recv rs_packet; fd_set read_fds; struct timeval tv, tv_orig, start, end, delta; int n; assert (ctx); assert (ctx->magic == IPMI_OPENIPMI_CTX_MAGIC); assert (fiid_obj_valid (obj_cmd_rs)); rs_packet.addr = (unsigned char *)&rs_addr; rs_packet.addr_len = sizeof (struct ipmi_system_interface_addr); rs_packet.msg.data = rs_buf_temp; rs_packet.msg.data_len = IPMI_OPENIPMI_BUFLEN; FD_ZERO (&read_fds); FD_SET (ctx->device_fd, &read_fds); <------------ HERE, this is a macro [...] } Hello, Here are the outputs of the requested commands: ulimit -a core file size (blocks, -c) unlimited data seg size (kbytes, -d) unlimited scheduling priority (-e) 0 file size (blocks, -f) unlimited pending signals (-i) 1025138 max locked memory (kbytes, -l) unlimited max memory size (kbytes, -m) unlimited open files (-n) 1024 pipe size (512 bytes, -p) 8 POSIX message queues (bytes, -q) 819200 real-time priority (-r) 0 stack size (kbytes, -s) 8192 cpu time (seconds, -t) unlimited max user processes (-u) 16308 virtual memory (kbytes, -v) unlimited file locks (-x) unlimited sysctl fs.file-nr fs.file-nr = 12832 0 26228996 as for ls -l /proc/$(pidof slurmd)/fd | wc -l: The number of files keeps going up until the limit of 1024 is reached and the buffer overflow happens. All files are symbolic links to /dev/ipmi0 small extract from ls -l /proc/$(pidof slurmd)/fd lrwx------ 1 root root 64 10. Jan 16:06 40 -> /dev/ipmi0 lrwx------ 1 root root 64 10. Jan 16:06 41 -> /dev/ipmi0 lrwx------ 1 root root 64 10. Jan 16:06 42 -> /dev/ipmi0 lrwx------ 1 root root 64 10. Jan 16:06 43 -> /dev/ipmi0 lrwx------ 1 root root 64 10. Jan 16:06 45 -> /dev/ipmi0 lrwx------ 1 root root 64 10. Jan 16:06 46 -> /dev/ipmi0 lrwx------ 1 root root 64 10. Jan 16:06 47 -> /dev/ipmi0 lrwx------ 1 root root 64 10. Jan 16:06 48 -> /dev/ipmi0 lrwx------ 1 root root 64 10. Jan 16:06 49 -> /dev/ipmi0 This was tested with freeipmi version 1.6.6 and 1.6.8 Regards, Pascal Wow, wow, wow, I'd have expected a FD number to be greater than 1024 trying to be used to open communication with ipmi0 but not a long list of ipmi0 open FDs. This seems like a bug, although the original code seems to be suffering from the same issue, because we didn't have changed this part. Give us a while to determine why those FDs have got leaked. Thanks Created attachment 22925 [details]
IPMI open FDs quickfix test
Hi,
Is it possible to test this quickfix in an out-of-prod node in your cluster?
We need access to real HW to see if this solves the open FD leak or not.
It's still valid that, in the event of lots of FDs opened at OS level, if freeipmi tries to open connection with XCC BMC and the returned FD number is greater than 1024, it fill fail in the same way. But this patch should mitigate this issue, by fixing our side and avoiding contributing to the number of system-wide opened files.
The final solution, once fixed our side, is to recompile freeipmi library without _FORTIFY_SOURCE.
Hi, I have reserved 1 Node for you: hkn0372. You should be able to login to it via "ssh hkn0372" from the login node with your account: ic6849 Right now you can execute /home/hk-project-abnahme/ic6849/slurm/inst/sbin/slurmd -Dvvvvvv -N n1 If you need more privileges don't hesitate to let me know. Basically everything should be the same as last time. Regards, Pascal Pascal, I'm still waiting for Harmut to answer my request, to reset/grant me access again to the VPN, so I can access HoreKa again. If you want to test yourself the patch you can do it and provide feedback, so we can speed up the testing. I can't do it anything here until I have access or you test it yourself. Thanks, Carlos. Hello, I tested the patch on a test cluster. Open FD s are no longer leaked. The slurmd has been running on the test cluster for a day by now without any problems. The gathering of energy also works. Cheers, Pascal Hi Pascal, Regarding crash patch (buffer overflow error): after few more days, I need to know if any problem arise, or if it's still running fine. If everything going well, I'm to start the QA review for this patch right now. So your feedback will be much appreciated. Thanks, Carlos. Hello Carlos, Everything is running fine now. The patch seems to have fixed the issue. Regards, Pascal This is great news Pascal, I'm going to start formal review right now. Thanks, Carlos. Hi Pascal, Pushed commits: 34f5d7086f, 87d8a85667, eef9272b90 in both 21.08 and master branches includes the reviewed patch for the *buffer overflow error*. Next stable 21.08.6 release will include this fix. Don't forget that there's still a chance to get the same error, in the event an open FD has a number >1024. This can only be workarounded compiling freeipmi without source fortification. Now, we are going to continue the QA review for the new XCC support. I'll let you know once this is ready. Best regards, Carlos. Hi Pascal, I'm glad to say our work here has finished, and pushed commits into master: 0ea9f6c847..e91b316c6e addresses the addition of this feature for SD650v2/SD650-Nv2 hardware. I'm going to close the bug as fixed now, but please don't hesitate to reopen it if you need it. Best regards, Carlos. |