Ticket 11323

Summary:	acct_gather_energy/xcc fails on Lenovo SD650v2/SD650-Nv2 nodes
Product:	Slurm	Reporter:	Simon Raffeiner <simon.raffeiner>
Component:	slurmd	Assignee:	Carlos Tripiana Montes <tripiana>
Status:	RESOLVED FIXED	QA Contact:
Severity:	4 - Minor Issue
Priority:	---	CC:	alex, cinek, felip.moll, pascal.schuhmacher, tim
Version:	20.11.5
Hardware:	Linux
OS:	Linux
See Also:	https://bugs.schedmd.com/show_bug.cgi?id=6743 https://bugs.schedmd.com/show_bug.cgi?id=13596
Site:	KIT	Slinky Site:	---
Alineos Sites:	---	Atos/Eviden Sites:	---
Confidential Site:	---	Coreweave sites:	---
Cray Sites:	---	DS9 clusters:	---
Google sites:	---	HPCnow Sites:	---
HPE Sites:	---	IBM Sites:	---
NOAA SIte:	---	NoveTech Sites:	---
Nvidia HWinf-CS Sites:	---	OCF Sites:	---
Recursion Pharma Sites:	---	SFW Sites:	---
SNIC sites:	---	Tzag Elita Sites:	---
Linux Distro:	---	Machine Name:
CLE Version:		Version Fixed:	22.05.0pre1
Target Release:	---	DevPrio:	---
Emory-Cloud Sites:	---
Ticket Depends on:	12868
Ticket Blocks:
Attachments:	Single energy command v3 style changes already merged v4 IPMI open FDs quickfix test

Description Simon Raffeiner 2021-04-08 01:18:05 MDT

Dear developers,

the acct_gather_energy/xcc plugin does not work with Lenovo SD650v2/SD650-Nv2. slurmd throws the following errors:

[2021-04-07T20:33:38.929] error: Invalid ipmi response length for XCC
raw command: 40 bytes, expected 16
[2021-04-07T20:33:38.929] error: _thread_update_node_energy could not
read XCC ipmi values

Lenovo has been informed and is looking into it.

Comment 3 Carlos Tripiana Montes 2021-04-09 08:31:09 MDT

Hi Simon,

We don't have access to such a HW environment right now, but until now 16 was the expected value.

It would be pretty good to know Lenovo answer about it as well.

Is there any way we can have access to a test HW with a reproduction environment?

Thanks for reporting.
Carlos.

Comment 6 Carlos Tripiana Montes 2021-04-14 06:44:23 MDT

Created attachment 18954 [details]
Single energy command

Please, check the last attached image. You will see the expected return of the faulty command.

Comment 7 Carlos Tripiana Montes 2021-04-14 08:34:09 MDT

Additionally, you can send us your slurm.conf and xcc file plugin configuration (acct_gather.conf). An soeme logs, and see if we see something else. But I'm pretty sure the root of the issue is the new HW returning data from the command in a different format.

Comment 19 Simon Raffeiner 2021-05-17 15:53:32 MDT

(NOTE: English version below)

Sehr geehrter Absender,

ich befinde mich bis zum 24.05.2021 im Urlaub.

Bei technischen Fragen zu den HPC-Systemen benutzen Sie bitte das bwSupportPortal [1] oder wenden sich an die E-Mail-Hotlines.

Bei organisatorischen Fragen, Fragen zur Beschaffungen oder der Öffentlichkeitsarbeit wenden Sie sich bitte an die Abteilungsleiterin Frau Dr. Jennifer Buchmüller (jennifer.buchmueller@kit.edu).

E-Mails werden während meiner Abwesenheit nicht automatisch weitergeleitet.

mit freundlichen Grüßen,
Simon Raffeiner


============================================================================


Dear sender,

I am on vacation until May 24, 2021.

Please redirect technical questions about the HPC systems to the bwSupportPortal [1] or the hotline mailing lists.

All other questions, especially about procurements, administrative topics or public relations should be directed to the head of the department, Dr. Jennifer Buchmüller (jennifer.buchmueller@kit.edu).

E-Mails are not being forwarded during my absence.

kind regards,
Simon Raffeiner


[1] https://bw-support.scc.kit.edu/

Comment 23 Carlos Tripiana Montes 2021-05-20 06:15:44 MDT

Dear Simon,

We have a patch proposal that must be tested. Please, forward this to Lenovo and ask them to let us access to some test HW in which I can compile and test Slurm as part of the test stage for this patch.

Thanks.

Comment 28 Carlos Tripiana Montes 2021-09-08 03:38:48 MDT

Hi Simon,

We have a patch proposal but we can't test it in the real HW, since we're unable to get access to it.

I've been trying to get an account in HoreKA 2 months ago (before my paternity leave). But I was unable to get the credentials. I'm still unable to get it.

I'm closing this issue by now. If we interested party wakes up again and provides us access to the HW, we can test it.

Thanks for your understanding.

Regards,
Carlos.

Comment 30 Pascal Schuhmacher 2021-11-22 01:55:11 MST

Hello Carlos,

Sorry, we had a lot going on the past months. I'm reopening the issue, so we can get the patch tested on the HW. 

How far did you get with getting access to the HW, as far as i was informed an account had already been created for you. I'm happy to provide support for getting access to the HW.

Regards,
Pascal

Comment 31 Carlos Tripiana Montes 2021-11-22 02:02:40 MST

Hi Pascal,

I've been unable to get access since I couldn't get the pass.

Please, go email and talk with me at tripiana@schedmd.com so we can get access to horeka.

Thanks,
Carlos.

Comment 32 Carlos Tripiana Montes 2021-11-22 08:02:33 MST

Good new is that I'm able to access.

Good new is that I've been able to git download slurm master, patch it, configure it, and compile it.

Good news is that I've been able to configure slurm, and start ctld.

Bad news it, OFC, that I'm not able to run slurmd because it needs root.

So now I'm not sure how to proceed to test the patched version.

I'm ic6849, slurm is in $HOME/slurm, and ctld is running now in hkn1991.

I'll wait your feedback before doing anything else here.

Thanks,
Carlos.

Comment 33 Carlos Tripiana Montes 2021-11-22 09:33:18 MST

By now I'm able to run slurmd as normal user as well but need to check if I'm able (have rights) to gather power metrics from the device.

I'll keep you poste.

Regards,
Carlos.

Comment 34 Carlos Tripiana Montes 2021-11-22 09:43:10 MST

As I was thinking:

[2021-11-22T17:31:45.720] debug2: No acct_gather.conf file (/home/hk-project-abnahme/ic6849/slurm/inst/etc/acct_gather.conf)
[2021-11-22T17:31:45.720] acct_gather_energy/xcc: acct_gather_energy_p_conf_set: AcctGatherEnergy XCC plugin loaded
[2021-11-22T17:31:45.721] error: _init_ipmi_config: error : must be root to open ipmi devices

[2021-11-22T17:31:45.721] error: AcctGatherEnergy XCC plugin threads failed to start in a timely manner
[2021-11-22T17:31:45.724] debug:  _handle_node_reg_resp: slurmctld sent back 8 TRES.
[2021-11-22T17:32:13.559] debug3: in the service_connection

I can't work with IMPI as normal user.

Comment 35 Pascal Schuhmacher 2021-11-25 03:41:52 MST

Hello Carlos,

I already wrote you per e-mail but i'm going to write here as well.

The node you were working on was a login Node.

2 Nodes have been reserved for you hkn0372 and hkn0602. You can login to them via ssh from the login node (e.g.: ssh hkn0372).

They are of Typ Lenovo_SD650_v2 and Lenovo_SD650-N_v2 respectively.

Here you should have the permissions to use ipmi.
If you need more permissions don't hesitate to ask.

Best Regards,
Pascal

Comment 36 Carlos Tripiana Montes 2021-11-26 01:04:01 MST

Thanks Pascal for the efforts from your side.

What I need now is to execute with sudo the following:

[ic6849@hkn0372 inst]$ pwd
/home/hk-project-abnahme/ic6849/slurm/inst
[ic6849@hkn0372 inst]$ . setup.sh 
[ic6849@hkn0372 inst]$ sudo slurmd -Dvvvvvv -N n1 <<-----

I'll do it in the 2 nodes reserved for this purpose.

Cheers,
Carlos.

Comment 37 Carlos Tripiana Montes 2021-11-26 01:17:17 MST

Reasoning for asking for sudo: 

[2021-11-26T09:15:49.814] debug2: Start processing RPC: REQUEST_ACCT_GATHER_ENERGY
[2021-11-26T09:15:49.814] debug2: Processing RPC: REQUEST_ACCT_GATHER_ENERGY
[2021-11-26T09:15:49.814] error: _init_ipmi_config: error : must be root to open ipmi devices

[2021-11-26T09:15:49.814] debug2: Finish processing RPC: REQUEST_ACCT_GATHER_ENERGY
[2021-11-26T09:15:49.814] [2.batch] acct_gather_energy/xcc: _get_joules_task: we got a blank

And similar errors.

Comment 38 Simon Raffeiner 2021-11-26 01:17:24 MST

(NOTE: English version below)


Sehr geehrte Absenderin, Sehr geehrter Absender,


Sie erreichen mich am 29. November wieder.


Bei technischen Fragen oder Problemen benutzen Sie bitte - je nach HPC-System oder Dienst - entweder das NHR-Supportportal [1] oder das bwSupportPortal [2].


In sehr dringenden Fällen können Sie sich an Dr. René Caspart (rene.caspart@kit.edu) wenden.


E-Mails werden während meiner Abwesenheit nicht automatisch weitergeleitet.


mit freundlichen Grüßen,

Simon Raffeiner


Leiter HPC-Betrieb



============================================================================



Dear sender,


you can reach me again on November 29.


Please redirect technical questions and problems via the NHR Support Portal [1] or the bwSupportPortal [2], depending on the HPC system or service affected.


In very urgent cases you can contact Dr. René Caspart (rene.caspart@kit.edu).


E-Mails are not being forwarded during my absence.


With kind regards,

Simon Raffeiner


Leiter HPC-Betrieb




[1] https://support.nhr.kit.edu


[2] https://bw-support.scc.kit.edu/

Comment 39 Pascal Schuhmacher 2021-11-26 02:16:17 MST

Hello,

You should now be able to use sudo slurmd -Dvvvvvv -N n1
Right now you can only use this exact command if you need more tell us

Regards, 
Pascal

Comment 40 Carlos Tripiana Montes 2021-11-26 03:01:28 MST

Maybe this is something you should be aware of but, it's not working yet:

[ic6849@hkn0372 inst]$ sudo slurmd -Dvvvvvv -N n1
>>> /etc/sudoers.d/sudoers_schedmd: syntax error near line 4 <<<
sudo: parse error in /etc/sudoers.d/sudoers_schedmd near line 4
sudo: no valid sudoers sources found, quitting
sudo: unable to initialize policy plugin

Thanks,
Carlos.

Comment 41 Pascal Schuhmacher 2021-11-26 03:29:06 MST

Hello,

I fixed the error these two commands should work:
/usr/sbin/slurmd -Dvvvvvv -N n1
/sbin/slurmd -Dvvvvvv -N n1

Regards,
Pascal

Comment 42 Carlos Tripiana Montes 2021-11-26 04:01:09 MST

Ah... no, no.

It's not:

/usr/sbin/slurmd -Dvvvvvv -N n1
/sbin/slurmd -Dvvvvvv -N n1

It is:

/home/hk-project-abnahme/ic6849/slurm/inst/sbin/slurmd

Because my installation is local and I'm working there.

Comment 43 Pascal Schuhmacher 2021-11-26 04:10:58 MST

Ok you should be able to use slurmd -Dvvvvvv -N n1 with your local installation

Comment 44 Carlos Tripiana Montes 2021-11-26 05:49:08 MST

patched version seems to be working, or at least it seems promising:

[ic6849@hkn0372 inst]$ scontrol show nodes
NodeName=n1 Arch=x86_64 CoresPerSocket=1 
   CPUAlloc=0 CPUTot=2 CPULoad=0.03
   AvailableFeatures=(null)
   ActiveFeatures=(null)
   Gres=gpu:tesla:1,gpu:p100:1,gpu:k80:1
   NodeAddr=127.0.0.1 NodeHostName=n1 Port=17001 Version=22.05.0-0pre1
   OS=Linux 4.18.0-193.60.2.el8_2.x86_64 #1 SMP Wed Jul 7 07:08:14 EDT 2021 
   RealMemory=30000 AllocMem=0 FreeMem=242986 Sockets=1 Boards=1
   State=IDLE ThreadsPerCore=2 TmpDisk=0 Weight=1 Owner=N/A MCS_label=N/A
   Partitions=debug 
   BootTime=2021-08-06T18:19:43 SlurmdStartTime=2021-11-26T13:42:57
   LastBusyTime=2021-11-26T13:45:38
   CfgTRES=cpu=2,mem=30000M,billing=2
   AllocTRES=
   CapWatts=n/a
   CurrentWatts=429760 AveWatts=298601
   ExtSensorsJoules=n/s ExtSensorsWatts=0 ExtSensorsTemp=n/s


Since CurrentWatts=429760 AveWatts=298601 there's some value here... Well, it seems quite high value isn't it?

I need further debugging but the point now is that I'm able to iterate over the code, and test it until we have a functional patch.

Cheers,
Carlos.

Comment 45 Carlos Tripiana Montes 2021-11-26 06:19:44 MST

[2021-11-26T14:18:39.883] acct_gather_energy/xcc: _sd650v2_update_node_energy: ENERGY: _sd650v2_update_node_energy: XCC current_watts: 427 consumed energy last interval: 12720(current reading 137955) Joules, elapsed time: 30 Seconds, ave watts: 380

That is making a lot more sense I suppose.

Could you please check these values make sense?

Comment 46 Pascal Schuhmacher 2021-11-26 06:48:25 MST

Hello Carlos,

Thank you for your work, the values make sense since its a node of type Lenovo SD650v2 without an accelerator. 

If you wanna try you can try it on the other node hkn0602 of type Lenovo SD650-Nv2 which has GPU accelerators and should have a higher value.

Best regards
Pascal

Comment 47 Carlos Tripiana Montes 2021-11-26 06:49:41 MST

Go for it. I've got sudo in hkn0602 as well for the slurmd command?

Thanks,
Carlos.

Comment 48 Pascal Schuhmacher 2021-11-26 06:51:49 MST

Yes you have the same permissions on both nodes

Comment 49 Carlos Tripiana Montes 2021-11-26 07:21:27 MST

In hkn0602 I see:

[2021-11-26T15:09:20.534] acct_gather_energy/xcc: _sd650_update_node_energy: ENERGY: _sd650_update_node_energy: XCC current_watts: 636 consumed energy last interval: 19092(current reading 306002) Joules, elapsed time: 30 Seconds, first read energy counter val: 2310595332 ave watts: 635

But I must say that I'm not able to see there any available GPU but yes, it's twice the consumption :).

Anyway, it seems like the patch is almost ready, if you agree on what you are seeing.

Cheers,
Carlos.

Comment 50 Pascal Schuhmacher 2021-11-26 07:31:43 MST

Hello,

The values are as expected. Thanks again for your work.

Cheers,
Pascal

Comment 51 Carlos Tripiana Montes 2021-11-26 09:23:29 MST

I'm going to double check everything then and ask for forma QA review on this.

Thanks for your time.

Comment 52 Carlos Tripiana Montes 2021-11-26 09:45:07 MST

Created attachment 22414 [details]
v3

This is final reviewed version for new functionality that is working after testing it in HoreKA.

I'm going to start the reviewing process now. Thanks.

Comment 56 Pascal Schuhmacher 2021-12-08 04:46:26 MST

Hello Carlos,

How long does the review process take and when can a patched version be expected. We would like to test the patch as fast as possible.

Ps: we are currently using Slurm version 20.11.5

Regards,
Pascal

Comment 61 Pascal Schuhmacher 2021-12-27 02:05:27 MST

Hello Carlos,

We tested your patch on our cluster. The energy readings work fine, but on the cpu only Nodes (Lenovo SD650v2) a kernel error occurs randomly and the slurmd.service gets killed.

kernel: traps: slurmd[2543817] trap divide error ip:1467f19dd595 sp:1467ebffecc0 error:0 in acct_gather_energy_xcc.so[1467f19db000+6000]
systemd[1]: Started Process Core Dump (PID 2550389/UID 0).
systemd[1]: slurmd.service: Main process exited, code=killed, status=8/FPE
systemd[1]: slurmd.service: Failed with result 'signal'.

The reason for the error seems to be line 501: xcc_reading->w /= count;

root@hktn1671:/# addr2line -e /usr/lib64/slurm/acct_gather_energy_xcc.so 0x2595
/root/rpmbuild/BUILD/slurm-20.11.5-2/src/plugins/acct_gather_energy/xcc/acct_gather_energy_xcc.c:501

The error only occurs on the Lenovo SD650v2 Nodes.
On nodes with accelerator Lenovo SD650-Nv2 the patch works fine without any errors.

Regards,
Pascal

Comment 62 Simon Raffeiner 2021-12-27 02:05:38 MST

(NOTE: English version below)


Sehr geehrte Absenderin, Sehr geehrter Absender,


Sie erreichen mich am 03. Januar 2022 wieder.


Bei technischen Fragen oder Problemen benutzen Sie bitte - je nach HPC-System oder Dienst - entweder das NHR-Supportportal [1] oder das bwSupportPortal [2]. In sehr dringenden Fällen können Sie sich an die Abteilungsleiterin, Frau Dr. Jennifer Buchmüller (jennifer.buchmueller@kit.edu), wenden.


E-Mails werden während meiner Abwesenheit nicht automatisch weitergeleitet.


mit freundlichen Grüßen,

Simon Raffeiner


Stellvertretender Abteilungsleiter / Leiter HPC-Betrieb


============================================================================


Dear sender,


you can reach me again on Januar 3rd, 2022.


Please redirect technical questions and problems via the NHR Support Portal [1] or the bwSupportPortal [2], depending on the HPC system or service affected.


In very urgent cases you can contact teh Head of the Department, Dr. Jennifer Buchmüller (jennifer.buchmueller@kit.edu).


E-Mails are not being forwarded during my absence.


With kind regards,

Simon Raffeiner


Deputy Head of Department / HPC Operations Manager


[1] https://support.nhr.kit.edu


[2] https://bw-support.scc.kit.edu/

Comment 64 Alejandro Sanchez 2021-12-28 09:40:08 MST

Hi,

Carlo's currently on vacation till January 10th.

I'll see if I can hand a workaround for the division by zero issue on Thu this week or otherwise early next week so that you can test locally.

In any case, what Carlos handed as a local test patch won't be the definitive changes that will be merged to Slurm. With regards to the review process we're still doing some last iterations for the patchset and I'll need further feedback once Carlos is back for some suggested changes.

Comment 65 Alejandro Sanchez 2021-12-30 05:36:19 MST

Hi,

I'm attaching two patch sets:

The first set is only style changes and they are already merged in Slurm master commits:

eeca14d9d9 acct_gather_energy/xcc - add comment to clarify how response is parsed.
37eed1dab8 acct_gather_energy/xcc - comment xcc_raw_single_data_t members.
1bdaa16760 acct_gather_energy/xcc - fix operands style.
a10e829abc acct_gather_energy/xcc - fix spaces style.

Then a second patch set which is not yet merged (still in the review process) that should workaround the division by zero issue and should be applied on top of the previous style patch set.

Please, test them and let us know if the reported metrics look as expected for both the old and the new hardware formats.

Comment 66 Alejandro Sanchez 2021-12-30 05:37:10 MST

Created attachment 22817 [details]
style changes already merged

First patch set with style changes. Apply this first.

Comment 67 Alejandro Sanchez 2021-12-30 05:38:16 MST

Created attachment 22818 [details]
v4

v4 patch set with actual changes. To be applied on top of the previous style changes patch set.

Comment 68 Pascal Schuhmacher 2022-01-07 03:32:39 MST

Hi,

Thank you for the patches.
The gathering of energy seems to work and the reported metrics look as expected the prior kernel errors do not happen anymore, but with the plugin enabled slurmd randomly gets killed on nodes during jobs with a buffer overflow:



Jan  5 22:15:58 hktn1671 slurmd[3827728]: *** buffer overflow detected ***: /usr/sbin/slurmd terminated
Jan  5 22:15:58 hktn1671 abrt-hook-ccpp[3865415]: Process 3827728 (slurmd) of user 0 killed by SIGABRT - dumping core




The corresponding core dump backtrace on the node:

(gdb) thread apply all bt full 

Thread 3 (Thread 0x15498ec7a700 (LWP 3827731)):
#0  0x0000154997a3d7da in pthread_cond_timedwait@@GLIBC_2.3.2 () from /usr/lib64/libpthread.so.0
No symbol table info available.
#1  0x00001549946a90cf in _thread_ipmi_run (no_data=<optimized out>) at acct_gather_energy_xcc.c:783
        err = <optimized out>
        tvnow = {tv_sec = 1641391314, tv_usec = 541545}
        abs = {tv_sec = 1641417384, tv_nsec = 541545000}
        __func__ = "_thread_ipmi_run"
#2  0x0000154997a372de in start_thread () from /usr/lib64/libpthread.so.0
No symbol table info available.
#3  0x0000154997768e83 in clone () from /usr/lib64/libc.so.6
No symbol table info available.

Thread 2 (Thread 0x154999884780 (LWP 3827728)):
#0  0x0000154997a40cb7 in accept () from /usr/lib64/libpthread.so.0
No symbol table info available.
#1  0x0000154998f8d75b in slurm_accept_msg_conn (fd=<optimized out>, addr=<optimized out>) at slurm_protocol_socket.c:452
        len = 128
#2  0x0000000000411f83 in _msg_engine () at slurmd.c:473
        cli = 0x20c7890
        sock = <optimized out>
        cli = <optimized out>
        sock = <optimized out>
        __func__ = "_msg_engine"
        rpc_wait = <optimized out>
        tv1 = <optimized out>
        tv2 = <optimized out>
        tv_str = <optimized out>
        delta_t = <optimized out>
        tv1 = <optimized out>
        tv2 = <optimized out>
        tv_str = <optimized out>
        delta_t = <optimized out>
#3  main (argc=<optimized out>, argv=<optimized out>) at slurmd.c:394
        i = <optimized out>
        pidfd = 6
        blocked_signals = {13, 0}
        cc = <optimized out>
        oom_value = <optimized out>
        curr_uid = <optimized out>
        time_stamp = "Wed, 05 Jan 2022 15:01:54 +0100\000\240\233\017\227\000\000\000\000\377\377\377\377\377\377\377\377\000\000\000\000\000\000\000\000\270\267\n\317\377\037\000\000\240\337*<\377\177\000\000P.\002\002\000\000\000\000\340\336*<\377\177\000\000`\337*<\377\177\000\000\253\064\244\230I\025\000\000\003+\244\230I\025\000\000\003\000\000\000\000\000\000\000@\001\000\000\000\000\000\000\200\000\000\000\000\000\000\000\000\026\023jRg\016֠\"\002\002\000\000\000\000\240\"\002\002", '\000' <repeats 20 times>, "\030\241ĘI\025\000\000\360/\002\002\000\000\000\000\f0\002\002\000\000\000\000\003\273"...
        lopts = <optimized out>
        __func__ = "main"

Thread 1 (Thread 0x154999883700 (LWP 3865414)):
#0  0x00001549976a470f in raise () from /usr/lib64/libc.so.6
No symbol table info available.
#1  0x000015499768eb25 in abort () from /usr/lib64/libc.so.6
No symbol table info available.
#2  0x00001549976e7897 in __libc_message () from /usr/lib64/libc.so.6
No symbol table info available.
#3  0x000015499777a905 in __fortify_fail_abort () from /usr/lib64/libc.so.6
No symbol table info available.
#4  0x000015499777a937 in __fortify_fail () from /usr/lib64/libc.so.6
No symbol table info available.
#5  0x00001549977788f6 in __chk_fail () from /usr/lib64/libc.so.6
No symbol table info available.
#6  0x000015499777a82b in __fdelt_warn () from /usr/lib64/libc.so.6
No symbol table info available.
#7  0x000015498fa6b604 in _openipmi_read () from /usr/lib64/libfreeipmi.so.17
No symbol table info available.
#8  0x000015498fa6bc2f in ipmi_openipmi_cmd () from /usr/lib64/libfreeipmi.so.17
No symbol table info available.
#9  0x000015498fa4a48f in api_openipmi_cmd () from /usr/lib64/libfreeipmi.so.17
No symbol table info available.
#10 0x000015498fa4a58e in api_openipmi_cmd_raw () from /usr/lib64/libfreeipmi.so.17
No symbol table info available.
#11 0x000015498fa36ef3 in ipmi_cmd_raw () from /usr/lib64/libfreeipmi.so.17
No symbol table info available.
#12 0x00001549946a83b9 in _read_ipmi_values () at acct_gather_energy_xcc.c:444
        xcc_reading = <optimized out>
        buf_rs = "\020\000\000\000I\025\000\000\320)\210\231I\025\000\000\220)\210\231I\025\000\000\000\026\023jRg\016\326\000\000y\n\326a1\002\306\062o\227I\025\000\000\002\b\000\000\000\000\000\000\233\023o\227I\025\000\000\200\231*\210I\025\000\000\a\000\000\000\000\000\000\000\232\062\002\210I\025\000\000\360\002\000\000\000\000\000\000\340\002\000\000\000\000\000\000/\000\000\000I\025\000\000\250\063\002\210I\025\000\000\020\003\000\000\000\000\000\000\250\377\377\377\377\377\377\377\v\000\000\000\000\000\000\000/\000\000\000;\000\000\000\\\000\000\000|\000\000\000\001", '\000' <repeats 23 times>, "w\000\000\000n\000\000\000\b\000\000\000\000\000\000\000"...
        rs_len = 0
        __func__ = "_read_ipmi_values"
#13 0x00001549946a8648 in _thread_update_node_energy () at acct_gather_energy_xcc.c:663
        xcc_raw = <optimized out>
        __func__ = "_thread_update_node_energy"
#14 0x00001549946a9577 in acct_gather_energy_p_get_data (data_type=<optimized out>, data=0x15498827e8f0) at acct_gather_energy_xcc.c:954
        rc = 0
        energy = 0x15498827e8f0
        last_poll = 0x15498827e8f0
        sensor_cnt = 0x15498827e8f0
        __func__ = "acct_gather_energy_p_get_data"
#15 0x0000154998f35e38 in acct_gather_energy_g_get_data (context_id=0, data_type=data_type@entry=ENERGY_DATA_JOULES_TASK, data=0x15498827e8f0) at slurm_acct_gather_energy.c:369
        retval = -1
        __func__ = "acct_gather_energy_g_get_data"
#16 0x0000000000417357 in _rpc_acct_gather_energy (msg=msg@entry=0x154988025b10) at req.c:3315
        resp_msg = {address = {ss_family = 39312, __ss_padding = "*\210I\025\000\000 \000\000\210I\025\000\000\024\000\000\000\000\000\000\000\320Q\000\000\000\000\000\000P梗I\025\000\000\060\236*\210I\025\000\000\370[\002\210I\025\000\000\306\062o\227I\025", '\000' <repeats 11 times>, ",\210\231I\025\000\000CI\377\230I\025\000\000t\004\000\000\000\000\000\000\004\000\000\000\000\000\000\000\355w\374\230I\025\000\000\000\061(\210I\025\000", __ss_align = 9216}, auth_cred = 0x154988025bc0, auth_index = -2010631936, auth_uid = 5449, auth_uid_set = 64, body_offset = 5449, buffer = 0x154998f6a529 <_unpack_acct_gather_energy_req+70>, conn = 0x154998ff5870 <__func__.19392>, conn_fd = -1728666239, data = 0x154988283100, data_size = 2281855760, flags = 5449, msg_index = 0, msg_type = 12544, protocol_version = 34856, forward = {cnt = 54064, init = 34857, nodelist = 0x154999882cc0 "\200.\210\231I\025", timeout = 1779635712, tree_width = 26450}, forward_struct = 0x15498829d34c, orig_addr = {ss_family = 11448, __ss_padding = "\210\231I\025\000\000\240,\210\231I\025\000\000\000\000\000\000\000\000\000\000\020[\002\210I\025\000\000\060\323)\210I\025\000\000\370[\002\210I\025\000\000\202y\374\230I\025\000\000\060\323)\210I\025\000\000\000\061(\210I\025\000\000\300,\210\231I\025\000\000u\371\363\230I\025\000\000\060\323)\210I\025\000\000\000\061(\210I\025\000\000\200.\210\231I\025\000\000`F\365\230I\025\000", __ss_align = 0}, ret_list = 0x154988139530}
        data_type = 0
        sensor_cnt = 1
        acct_msg = {energy = 0x15498827e8f0, node_name = 0x0, sensor_cnt = 1}
        now = <optimized out>
        last_poll = 1641417337
        req = 0x1549882a9180
        rc = 0
        req_uid = <optimized out>
        first_msg = false
#17 0x000000000041ea80 in slurmd_req (msg=msg@entry=0x154988025b10) at req.c:349
        __func__ = "slurmd_req"
#18 0x000000000040f49e in _service_connection (arg=<optimized out>) at slurmd.c:581
        con = 0x2040230
        msg = 0x154988025b10
        __func__ = "_service_connection"
        rc = 0
#19 0x0000154997a372de in start_thread () from /usr/lib64/libpthread.so.0
No symbol table info available.
#20 0x0000154997768e83 in clone () from /usr/lib64/libc.so.6
No symbol table info available.
(gdb) 

Regards,
Pascal

Comment 69 Carlos Tripiana Montes 2022-01-10 04:17:21 MST

Have you recently updated freeipmi version?

He have found similar issue with libfreeipmi.so.17 from other customer and this site is not using v2 hardware nor modified XCC plugin, so they have the same problem without any other change than an OS upgrade that came with a new freeipmi version.

And, as you might guessed, the site was running OK before the upgrade. So we suspect from a freeipmi but or something like this but we couldn't find the root of the issue until now.

Cheers,
Carlos.

Comment 70 Carlos Tripiana Montes 2022-01-10 04:52:48 MST

Would you mind to send us the output, as root, from commands:

ulimit -a

sysctl fs.file-nr

ls -l /proc/$(pidof slurmd)/fd | wc -l

in a node affected by this problem?

We have found another possible explanation, which would relate to the number of open files.

Thanks,
Carlos.

Comment 71 Carlos Tripiana Montes 2022-01-10 06:32:23 MST

We suspect that the freeipmi version is compiled with -D_FORTIFY_SOURCE=2.

This is enabling:

#define	FD_SET(fd, fdsetp)	__FD_SET (fd, fdsetp)

#define __FD_SET(d, s) \
  ((void) (__FDS_BITS (s)[__FD_ELT(d)] |= __FD_MASK(d)))

#define	__FD_ELT(d) \
  __extension__								    \
  ({ long int __d = (d);						    \
     (__builtin_constant_p (__d)					    \
      ? (0 <= __d && __d < __FD_SETSIZE					    \
	 ? (__d / __NFDBITS)						    \
	 : __fdelt_warn (__d))						    \
      : __fdelt_chk (__d)); })

instead of this:

#define	__FD_ELT(d)	((d) / __NFDBITS)

And this fails with buffer overflow if FD number is greater than:

#define __FD_SETSIZE		1024

When Slurm opens connection with XCC BMC it will fail if the FD number is greater than 1024 with buffer overflow.

Comment 72 Carlos Tripiana Montes 2022-01-10 06:37:12 MST

_openipmi_read snippet that trigger the issue, I think:

static int
_openipmi_read (ipmi_openipmi_ctx_t ctx,
                fiid_obj_t obj_cmd_rs)
{
  uint8_t rs_buf_temp[IPMI_OPENIPMI_BUFLEN];
  uint8_t rs_buf[IPMI_OPENIPMI_BUFLEN];
  struct ipmi_system_interface_addr rs_addr;
  struct ipmi_recv rs_packet;
  fd_set read_fds;
  struct timeval tv, tv_orig, start, end, delta;
  int n;

  assert (ctx);
  assert (ctx->magic == IPMI_OPENIPMI_CTX_MAGIC);
  assert (fiid_obj_valid (obj_cmd_rs));

  rs_packet.addr = (unsigned char *)&rs_addr;
  rs_packet.addr_len = sizeof (struct ipmi_system_interface_addr);
  rs_packet.msg.data = rs_buf_temp;
  rs_packet.msg.data_len = IPMI_OPENIPMI_BUFLEN;

  FD_ZERO (&read_fds);
  FD_SET (ctx->device_fd, &read_fds); <------------ HERE, this is a macro
[...]
}

Comment 73 Pascal Schuhmacher 2022-01-10 09:24:42 MST

Hello,

Here are the outputs of the requested commands:

ulimit -a

core file size          (blocks, -c) unlimited
data seg size           (kbytes, -d) unlimited
scheduling priority             (-e) 0
file size               (blocks, -f) unlimited
pending signals                 (-i) 1025138
max locked memory       (kbytes, -l) unlimited
max memory size         (kbytes, -m) unlimited
open files                      (-n) 1024
pipe size            (512 bytes, -p) 8
POSIX message queues     (bytes, -q) 819200
real-time priority              (-r) 0
stack size              (kbytes, -s) 8192
cpu time               (seconds, -t) unlimited
max user processes              (-u) 16308
virtual memory          (kbytes, -v) unlimited
file locks                      (-x) unlimited




sysctl fs.file-nr

fs.file-nr = 12832	0	26228996


as for ls -l /proc/$(pidof slurmd)/fd | wc -l:

The number of files keeps going up until the limit of 1024 is reached and the buffer overflow happens. All files are symbolic links to /dev/ipmi0

small extract from ls -l /proc/$(pidof slurmd)/fd

lrwx------ 1 root root 64 10. Jan 16:06 40 -> /dev/ipmi0
lrwx------ 1 root root 64 10. Jan 16:06 41 -> /dev/ipmi0
lrwx------ 1 root root 64 10. Jan 16:06 42 -> /dev/ipmi0
lrwx------ 1 root root 64 10. Jan 16:06 43 -> /dev/ipmi0
lrwx------ 1 root root 64 10. Jan 16:06 45 -> /dev/ipmi0
lrwx------ 1 root root 64 10. Jan 16:06 46 -> /dev/ipmi0
lrwx------ 1 root root 64 10. Jan 16:06 47 -> /dev/ipmi0
lrwx------ 1 root root 64 10. Jan 16:06 48 -> /dev/ipmi0
lrwx------ 1 root root 64 10. Jan 16:06 49 -> /dev/ipmi0

This was tested with freeipmi version 1.6.6 and 1.6.8

Regards,
Pascal

Comment 74 Carlos Tripiana Montes 2022-01-10 10:02:02 MST

Wow, wow, wow,

I'd have expected a FD number to be greater than 1024 trying to be used to open communication with ipmi0 but not a long list of ipmi0 open FDs.

This seems like a bug, although the original code seems to be suffering from the same issue, because we didn't have changed this part.

Give us a while to determine why those FDs have got leaked.

Thanks

Comment 77 Carlos Tripiana Montes 2022-01-11 03:40:00 MST

Created attachment 22925 [details]
IPMI open FDs quickfix test

Hi,

Is it possible to test this quickfix in an out-of-prod node in your cluster?

We need access to real HW to see if this solves the open FD leak or not.

It's still valid that, in the event of lots of FDs opened at OS level, if freeipmi tries to open connection with XCC BMC and the returned FD number is greater than 1024, it fill fail in the same way. But this patch should mitigate this issue, by fixing our side and avoiding contributing to the number of system-wide opened files.

The final solution, once fixed our side, is to recompile freeipmi library without _FORTIFY_SOURCE.

Comment 78 Pascal Schuhmacher 2022-01-11 06:15:10 MST

Hi,

I have reserved 1 Node for you: hkn0372. You should be able to login to it via "ssh hkn0372" from the login node with your account: ic6849

Right now you can execute /home/hk-project-abnahme/ic6849/slurm/inst/sbin/slurmd -Dvvvvvv -N n1

If you need more privileges don't hesitate to let me know.


Basically everything should be the same as last time.

Regards,
Pascal

Comment 79 Carlos Tripiana Montes 2022-01-13 07:30:22 MST

Pascal,

I'm still waiting for Harmut to answer my request, to reset/grant me access again to the VPN, so I can access HoreKa again.

If you want to test yourself the patch you can do it and provide feedback, so we can speed up the testing.

I can't do it anything here until I have access or you test it yourself.

Thanks,
Carlos.

Comment 80 Pascal Schuhmacher 2022-01-13 07:35:31 MST

Hello,

I tested the patch on a test cluster. Open FD s are no longer leaked. 
The slurmd has been running on the test cluster for a day by now without any problems.
The gathering of energy also works.

Cheers,
Pascal

Comment 83 Carlos Tripiana Montes 2022-01-18 01:36:37 MST

Hi Pascal,

Regarding crash patch (buffer overflow error): after few more days, I need to know if any problem arise, or if it's still running fine.

If everything going well, I'm to start the QA review for this patch right now. So your feedback will be much appreciated.

Thanks,
Carlos.

Comment 84 Pascal Schuhmacher 2022-01-18 01:51:15 MST

Hello Carlos,

Everything is running fine now. The patch seems to have fixed the issue.

Regards,
 Pascal

Comment 85 Carlos Tripiana Montes 2022-01-18 02:25:39 MST

This is great news Pascal,

I'm going to start formal review right now.

Thanks,
Carlos.

Comment 87 Carlos Tripiana Montes 2022-01-18 08:04:05 MST

Hi Pascal,

Pushed commits: 34f5d7086f, 87d8a85667, eef9272b90 in both 21.08 and master branches includes the reviewed patch for the *buffer overflow error*.

Next stable 21.08.6 release will include this fix.

Don't forget that there's still a chance to get the same error, in the event an open FD has a number >1024. This can only be workarounded compiling freeipmi without source fortification.

Now, we are going to continue the QA review for the new XCC support. I'll let you know once this is ready.

Best regards,
Carlos.

Comment 94 Carlos Tripiana Montes 2022-01-21 09:34:37 MST

Hi Pascal,

I'm glad to say our work here has finished, and pushed commits into master: 0ea9f6c847..e91b316c6e addresses the addition of this feature for SD650v2/SD650-Nv2 hardware.

I'm going to close the bug as fixed now, but please don't hesitate to reopen it if you need it.

Best regards,
Carlos.