Ticket 17639

Summary: slurmd with buffer overflow - bug in freeipmi (FD_SETSIZE > 1024)
Product: Slurm Reporter: Frank Heckes <heckes>
Component: slurmdAssignee: Marshall Garey <marshall>
Status: RESOLVED FIXED QA Contact:
Severity: 3 - Medium Impact    
Priority: --- CC: aelmazaty, felip.moll, mark.titorenko, nate, nils.kanning, Ole.H.Nielsen
Version: 23.02.4   
Hardware: Linux   
OS: Linux   
See Also: https://bugs.schedmd.com/show_bug.cgi?id=17464
https://bugs.schedmd.com/show_bug.cgi?id=12868
Site: Max Planck Institute for Solar System Research Alineos Sites: ---
Atos/Eviden Sites: --- Confidential Site: ---
Coreweave sites: --- Cray Sites: ---
DS9 clusters: --- HPCnow Sites: ---
HPE Sites: --- IBM Sites: ---
NOAA SIte: --- NoveTech Sites: ---
Nvidia HWinf-CS Sites: --- OCF Sites: ---
Recursion Pharma Sites: --- SFW Sites: ---
SNIC sites: --- Linux Distro: SUSE
Machine Name: CLE Version:
Version Fixed: 23.02.7 23.11.0rc1 Target Release: ---
DevPrio: --- Emory-Cloud Sites: ---
Attachments: tar archive containing the files mentioned in the ticket
requested info's from comment #3, #4
file colection after crash at Sep. 18th
gdb output as requested in item 18
gdb output as requested in comment #22
Fixes file descriptor leak in IPMI DCMI energy gathering plugin

Description Frank Heckes 2023-09-07 03:10:49 MDT
Dear SlurmSupport, here's a description of our compute cluster the error occurred:
1. 56 SuperMicro Nodes, 1TB Ram, AMD 7763, NDR@200GBit HCAs
2. OS: SLES 15.4
3. Single NDR switch in Spit mode, HCA, switch OS latest version
4. MLNX ofed version: MLNX_OFED_LINUX-5.9-0.5.6.
5. slurm Version 23.02.4
   compile environment: gcc-7.5.0 (Full list of RPMs can be attached on demand)
6. The error happened during a HPL version 2.3 compile with easybuild modules
   GCC-11.2.0, openMPI-4.1.1
   alpha for N (N=alpha*sqrt(#nodes * Mem/per nodes / 8) ) was 0.85

The job executed stopped two times with a buffer overflow that happened on
1. 2 nodes in first time (swan[03,04])
2. 1 node at the second time (swan28)

Although the job was marked as FAIL slurm restarted it automatically. Could you please also comment on this fact 

Attached files
a. build log for slurmd RPMs; yes, munge is installed and operational
b. slurmctld, slurmd logfile and coredumps written
c. slurmd.conf, cgroups.conf, acct_gather.con, gres.conf (althought not of interest here, cause the
   GPU weren't part of the job)

Please let me know if some information or file is missing for your analysis. Many thanks in advance.
Cheers,
-Frank Heckes
P.S.: I'll upload the attachments in a second step, as I'm afraid my session might expires.
Comment 1 Frank Heckes 2023-09-07 03:37:34 MDT
Created attachment 32142 [details]
tar archive containing the files mentioned in the ticket
Comment 2 Frank Heckes 2023-09-07 05:57:23 MDT
I forgot to display the limits setting for the slurmd processes. All use:
Limit                     Soft Limit           Hard Limit           Units     
Max cpu time              unlimited            unlimited            seconds   
Max file size             unlimited            unlimited            bytes     
Max data size             unlimited            unlimited            bytes     
Max stack size            unlimited            unlimited            bytes     
Max core file size        unlimited            unlimited            bytes     
Max resident set          unlimited            unlimited            bytes     
Max processes             4127344              4127344              processes 
Max open files            131072               131072               files     
Max locked memory         unlimited            unlimited            bytes     
Max address space         unlimited            unlimited            bytes     
Max file locks            unlimited            unlimited            locks     
Max pending signals       4127344              4127344              signals   
Max msgqueue size         819200               819200               bytes     
Max nice priority         0                    0                    
Max realtime priority     0                    0                    
Max realtime timeout      unlimited            unlimited            us

besides some minor difference in O(100) for 'Max Pending signals'
Comment 3 Jason Booth 2023-09-07 09:50:36 MDT
Please also include the slurmd.log, the output of "slurmd -C" and the backtrace from those core files.


> gdb -ex 't a a bt' -batch slurmd core.12345

Core files are rarely useful to us and most of the time we are unable to read them due to the differences between the Linux environments sites use versus ours.
Comment 4 Marshall Garey 2023-09-07 10:10:54 MDT
I don't understand all of the question.

(In reply to Frank Heckes from comment #0)
> Dear SlurmSupport, here's a description of our compute cluster the error
> occurred:
> 1. 56 SuperMicro Nodes, 1TB Ram, AMD 7763, NDR@200GBit HCAs
> 2. OS: SLES 15.4
> 3. Single NDR switch in Spit mode, HCA, switch OS latest version
> 4. MLNX ofed version: MLNX_OFED_LINUX-5.9-0.5.6.
> 5. slurm Version 23.02.4
>    compile environment: gcc-7.5.0 (Full list of RPMs can be attached on
> demand)
> 6. The error happened during a HPL version 2.3 compile with easybuild modules
>    GCC-11.2.0, openMPI-4.1.1
>    alpha for N (N=alpha*sqrt(#nodes * Mem/per nodes / 8) ) was 0.85
> 
> The job executed stopped two times with a buffer overflow that happened on
> 1. 2 nodes in first time (swan[03,04])
> 2. 1 node at the second time (swan28)

Are you saying that the job had a buffer overflow? Or that slurmd did? You imply that slurmd crashed because you attached core dumps for slurmd.
 
> Although the job was marked as FAIL slurm restarted it automatically. Could
> you please also comment on this fact 

What job id (or job id's if multiple jobs)? Did the job requeue itself in the job script? Will you upload the output of the following sacct command?

sacct -j <jobid> -D --long -p


> Attached files
> a. build log for slurmd RPMs; yes, munge is installed and operational
> b. slurmctld, slurmd logfile and coredumps written

I don't see the slurmd log files. Are you talking about the dmesg output?
As Jason mentioned, we cannot make use of the coredumps. We need the backtraces generated from the coredumps.

> c. slurmd.conf, cgroups.conf, acct_gather.con, gres.conf (althought not of
> interest here, cause the
>    GPU weren't part of the job)
> 
> Please let me know if some information or file is missing for your analysis.

I think the slurmd log files are missing.
Comment 5 Frank Heckes 2023-09-07 11:31:19 MDT
@Jason: 
I put everything into tar archive of the latest upload.

Details:
* I'm sorry about the missing slurmd.log files. Included in attachment.
* Same for captured slurmd -C
* The coredumps are handled via systemd-coredump on all nodes. dumps are stored in 
  compressed format after the event compressed. Interface to access is coredumpctl 
  (or z
swan03:~ # coredumpctl list
TIME                           PID  UID  GID SIG     COREFILE  EXE                                SIZE
Wed 2023-09-06 17:25:18 CEST 28102    0    0 SIGABRT present   /usr/sbin/slurmd                   620.4K

coredumpctl debug --debugger=gdb --debugger-arguments="-ex 't a a bt' -batch slurmd" /usr/sbin/slurmd > /tmp/swan03.stacktrace.via-coredumpctl 2>&1

or Two steps:
coredumpctl dump -o /tmp/hurz /usr/sbin/slurmd
gdb -ex 't a a bt' -batch slurmd /tmp/hurz > swan03.stacktrace.gdb-direct 2>&1

I used this commmand sequence on all nodes. The 'coredumpctl debug' contains
some more info, but the same stackstrace as far as I see. All files attached in 
second upload.


@Mashall: 
I'm very sorry for any ambiguity. The slurmd dumped core (see messages files implicitly). I would never bother with obvious direct visible HPC application problems.
The job requeue itself after the crash. Actually I made read the isotime for the buffer overflow incorrectely. This one is also caused by job 244.


sacct info include is include in tar archive.
Comment 6 Frank Heckes 2023-09-07 11:32:54 MDT
Created attachment 32152 [details]
requested info's from comment #3, #4
Comment 7 Frank Heckes 2023-09-07 11:39:04 MDT
Sorry, for the typos, the last three sentences should read as:
Actually I read the isotime string for the buffer overflow on node swan28 incorrectly. This incident is also caused by job 244.

The sacct info requested has been included in tar archive.
Comment 9 Marshall Garey 2023-09-08 11:42:07 MDT
Thanks for the updated details. The backtrace shows that the ipmi library is aborting (assertion failure).

We have another private bug (bug 17464) with the same backtrace. In that bug, the problem started happening after they upgraded Slurm, but they had not recompiled the ipmi library and there was an issue. Quoting our response on that bug:


> The IPMI library crash is triggered by the internal memory consistency
> functionality. Considering this worked before the upgrade, this strongly
> suggests that the library is not compatible with Slurm or glibc via some
> unknown incompatibility. I have seen a few times with similar corruption
> (but not this exact issue) that the library (libfreeipmi.so.17 in this case)
> needs to be recompiled on the system and relinked against Slurm. Upgrading
> to the lastest version of libfreeipmi.so and recompiling Slurm against the
> upgraded library is my suggested path to avoid debugging an older version
> and likely unsupported version of libfreeipmi.


Did you recently upgrade Slurm? Did this problem start happening after the upgrade?

If you upgrade the ipmi library and recompile Slurm against the upgraded library, does the crash go away? Perhaps try it on a single node if the crash is easily reproducible.
Comment 10 Frank Heckes 2023-09-11 01:39:46 MDT
Many thanks for the analysis. Indeed there was an inital version without freeipmi support. I re-compiled the slurm against the freeipmi delivered with the sles15.4 distribution. I was pretty sure that this worked fine and that I rollout this version to all nodes. Anyway, something most be wrong. I double check the compilation and the installed RPMs.
Comment 11 Frank Heckes 2023-09-11 14:43:55 MDT
* I verified the installation is consitent, i.e. the RPMs on the build nodes are 
  those installed on the db, ctl - servers and the compute node.
  (if it is of interest I can upload the script logs)
  
* CTL node and compute nodes run different OS distro:
    NODE TYPE   OS      INSTALLED RPMs
slurmctl        el7 	freeipmi-1.5.7-3.el7.x86_64  
build node	el7	freeipmi-1.5.7-3.el7.x86_64
                        freeipmi-devel-1.5.7-3.el7.x86_64
##  I use only the ctl, example, restapi and slurm RPM on the ctl

compute	        sles15	freeipmi-1freeipmi-1.6.8-150400.1.11.x86_64 
			libfreeipmi17-1.6.8-150400.1.11.x86_64
			libipmiconsole2-1.6.8-150400.1.11.x86_64 
			libipmidetect0-1.6.8-150400.1.11.x86_64
build node	sles15	freeipmi-1.6.8-150400.1.11.x86_64
			freeipmi-devel-1.6.8-150400.1.11.x86_64
			libfreeipmi17-1.6.8-150400.1.11.x86_64
			libipmiconsole2-1.6.8-150400.1.11.x86_64

Installation is consistent, especially for the compute nodes.
My current understanding is that the freeipmi is used only on the
compute nodes to gather the power perf-counter so using different, so the
different freeipmi version on CTL and compute nodes should not matter(?).
					
* I also verified the checksums between build nodes for ctl and compute nodes for 
  the installed ipmi libs. There're all the same.
  
* Is this maybe a bug in the freeipmi version 1.6.8 used on the compute nodes?
  Should I use the latest version of freeipmi instead?

(* The error is reproducible. --> HPL test crashed over weekend)
Comment 12 Marshall Garey 2023-09-11 15:03:25 MDT
Yes, upgrading to the latest version of ipmi was part the instructions given to fix the other ticket:

> needs to be recompiled on the system and relinked against Slurm. Upgrading
> to the lastest version of libfreeipmi.so and recompiling Slurm against the
> upgraded library is my suggested path to avoid debugging an older version
> and likely unsupported version of libfreeipmi.


If after upgrading freeipmi and recompiling Slurm against it you still see slurmd crash, will you provide an updated backtrace so we can see if it is any different?
Comment 13 Frank Heckes 2023-09-12 00:43:01 MDT
Okay, I'll work on that this morning and take the chance to PMIx and update IB accounting.
Comment 14 Frank Heckes 2023-09-18 02:11:19 MDT
I recompiled slurm 23.02.04 utilizing freeipmi-1.6.11 and re-installed. I started an HPL. I'll let you know as soon as there some news.
Comment 15 Frank Heckes 2023-09-19 00:03:31 MDT
The suggestion exacerbated the situation. Now all slurmd's on all nodes crashed like orchestrated almost in sync.
I added the following content to the tar archive:
freeipmi.content        - rpm listing freeipmi-1.6.11
slurm-status-20230919   - sinfo and captured buffer overflow messages 
                          of all affected from messages
swan56.rpmlist          - RPM list
swan56.stacktrace.via-coredumpctl  - the stack trace of one node, extracted
                                     using the instruction you specified above

I used the src file https://ftp.gnu.org/gnu/freeipmi/freeipmi-1.6.11.tar.gz to 
recompile slurm.
Please let me know if you need any additional information. Thank you.
Comment 16 Frank Heckes 2023-09-19 00:05:07 MDT
Created attachment 32307 [details]
file colection after crash at Sep. 18th
Comment 17 Frank Heckes 2023-09-19 06:45:24 MDT
I forgot to mention that all nodes have been installed utilizing the same image, aka they're all identical concerning their installed RPMs and configuration.
Comment 18 Marshall Garey 2023-09-19 08:13:01 MDT
From your most recent backtrace (on swan56), can you run the following in gdb?


t 1
f 12
info locals
p *(struct fiid_obj *)dcmi_rs
p ipmi_dcmi_ctx
p *(struct ipmi_ctx *)ipmi_dcmi_ctx
Comment 19 Frank Heckes 2023-09-19 08:34:56 MDT
Hello Marshall, I captured the output in attached file gdbSessionOutputAsDescribeInItem18.txt.
Comment 20 Frank Heckes 2023-09-19 08:35:36 MDT
Created attachment 32321 [details]
gdb output as requested in item 18
Comment 22 Marshall Garey 2023-09-19 09:18:18 MDT
Thank you. Can you get some additional info from gdb?

set pretty print on
t 1
f 12
p *(((struct ipmi_ctx *)ipmi_dcmi_ctx)->io.inband.openipmi_ctx)
f 6
info locals
info args
p *ctx
p ctx->device_fd


https://github.com/elitak/freeipmi/blob/0e52de4bdea4ca0b37ec9fde25711abfb41835cb/libfreeipmi/driver/ipmi-openipmi-driver.c#L510

We are trying to identify if the ctx->device_fd is valid here, which it seems like it is not because of the assertion failure in glibc.
Comment 24 Frank Heckes 2023-09-19 11:04:12 MDT
the output has been captured in the attached file gdbSessionOutputAsDescribeInComment22.txt.
Comment 25 Frank Heckes 2023-09-19 11:04:55 MDT
Created attachment 32324 [details]
gdb output as requested in comment #22
Comment 30 Marshall Garey 2023-09-19 14:11:22 MDT
Thanks for your help in debugging the issue. It showed that the problem is a limitation in freeipmi.

(gdb) f 6 
#6  0x00001520445252c4 in _openipmi_read (ctx=ctx@entry=0x15203c26deb0, obj_cmd_rs=obj_cmd_rs@entry=0x15203c25c8e0) at driver/ipmi-openipmi-driver.c:517
517     driver/ipmi-openipmi-driver.c: No such file or directory.

(gdb) p *ctx
$2 = {magic = 3490697231, errnum = 0, flags = 0, driver_device = 0x0, device_fd = 1024, io_init = 1}


device_fd is 1024.

This code passes device_fd+1:

https://github.com/elitak/freeipmi/blob/0e52de4bdea4ca0b37ec9fde25711abfb41835cb/libfreeipmi/driver/ipmi-openipmi-driver.c#L510
> if ((n = select (ctx->device_fd + 1,


glibc checks for fd < 0 || fd >= FD_SETSIZE

https://github.com/lattera/glibc/blob/master/debug/fdelt_chk.c#L24

According to select(2), FD_SETSIZE is 1024:

https://man7.org/linux/man-pages/man2/select.2.html

```
       WARNING: select() can monitor only file descriptors numbers that
       are less than FD_SETSIZE (1024)—an unreasonably low limit for
       many modern applications—and this limitation will not change.
       All modern applications should instead use poll(2) or epoll(7),
       which do not suffer this limitation.
```


My current theory is that slurmd has a bunch of connections open such that fd's 0-1023 are taken, so when ipmi opens a new file it gets fd = 1024.

We are discussing internally how we can work around this limitation. In the meantime, unfortunately the only sure way to avoid this is to not use freeipmi.
Comment 31 Frank Heckes 2023-09-19 23:32:53 MDT
many thanks for analysis and explanation. Great that the root cause has been identified. We currently run slurm without the ipmi plugin enabled.
Comment 32 Marshall Garey 2023-09-20 15:16:26 MDT
Is it okay if I change this bug to a sev-3? It sounds like it may not be a sev-2 situation anymore, though obviously having to disable Slurm's ipmi plugin is not ideal.
Comment 33 Frank Heckes 2023-09-21 01:11:06 MDT
yes, of course, please go ahead, sorry I forgot to adjust on my own.
Comment 34 Nate Rini 2023-09-21 11:15:50 MDT
*** Ticket 17464 has been marked as a duplicate of this ticket. ***
Comment 38 Ole.H.Nielsen@fysik.dtu.dk 2023-09-27 05:16:53 MDT
FYI: I've asked a question about the use of select() to the freeipmi-users mailing list: https://lists.gnu.org/archive/html/freeipmi-users/2023-09/msg00004.html

Let's see what the FreeIPMI developers think about it.

Best regards,
Ole
Comment 40 Ahmed Elmazaty 2023-09-27 07:04:14 MDT
Hello
I understand it's a FreeIPMI bug. But why did it only appear after upgrading SLURM to 23.02.4?
We never had such issues on Previuos versions (Our last upgrade was from 23.02.0)
Comment 41 Felip Moll 2023-09-27 07:29:25 MDT
(In reply to Ole.H.Nielsen@fysik.dtu.dk from comment #38)
> FYI: I've asked a question about the use of select() to the freeipmi-users
> mailing list:
> https://lists.gnu.org/archive/html/freeipmi-users/2023-09/msg00004.html
> 
> Let's see what the FreeIPMI developers think about it.
> 
> Best regards,
> Ole

Hi,

For the moment I added a GNU bug here with an (untested) patch proposal. I don't know if this is the right way. You could indicate that in the mailing list if you want.

Frank, Ole, are some of you able to test this (unverified) patch in your freeipmi installation?

http://savannah.gnu.org/bugs/index.php?64721
Comment 42 Felip Moll 2023-09-27 07:34:55 MDT
(In reply to Ahmed Elmazaty from comment #40)
> Hello
> I understand it's a FreeIPMI bug. But why did it only appear after upgrading
> SLURM to 23.02.4?
> We never had such issues on Previuos versions (Our last upgrade was from
> 23.02.0)

Ahmed, 

This happens when using DCMI in the acct_gather_energy/ipmi plugin (also may happen with XCC plugin) which in that case requires the use of ipmi raw commands by using freeipmi libraries. When not using DCMI it will use the ipmi monitoring library which doesn't suffer from this issue.

I understand you're now trying with DCMI, which is a new feature added in 23.02.0 as part of bug 9629, and until now it wasn't been tested thoroughly. The fact you haven't hit this bug before must be because you weren't using DCMI or because your slurmd never hit > 1024 open fds.

Unless there's something else we haven't detected, but this bug backtraces indicates this is clearly the issue.
Comment 43 Felip Moll 2023-09-27 10:58:48 MDT
Hi, the maintainer of freeipmi agreed to include a fix I proposed in freeipmi 1.6.x.
If some of you is willing to test it, it would be very valuable to freeipmi team.
Please check:

http://savannah.gnu.org/bugs/index.php?64721

Thank you.
Comment 44 Ole.H.Nielsen@fysik.dtu.dk 2023-09-28 00:52:15 MDT
(In reply to Felip Moll from comment #43)
> Hi, the maintainer of freeipmi agreed to include a fix I proposed in
> freeipmi 1.6.x.
> If some of you is willing to test it, it would be very valuable to freeipmi
> team.
> Please check:
> 
> http://savannah.gnu.org/bugs/index.php?64721

After some searching I found the location of the master branch tar file:
http://git.savannah.gnu.org/cgit/freeipmi.git/snapshot/freeipmi-master.tar.gz

The INSTALL file is missing an important piece of information!  The first thing to do in the freeipmi-master folder is "autoreconf -i"!  I opened a bug on that at https://savannah.gnu.org/bugs/index.php?64723

If you want to build freeipmi RPM packages there are additional steps:

1. Do "autoreconf -i" and "./configure".

2. Edit the generated freeipmi.spec file and modify this line to indicate the master version:

   %define version master

3. Create a new tar-ball from the configured freeipmi-master folder:

   tar czf freeipmi-master.tar.gz freeipmi-master

4. Build the RPM packages for Systemd, see https://lists.gnu.org/archive/html/freeipmi-users/2023-08/msg00001.html

   rpmbuild -ta --with systemd freeipmi-master.tar.gz

  You will get these RPMs (on a CentOS 7 system):

  freeipmi-bmc-watchdog-master-1.el7.x86_64.rpm  freeipmi-debuginfo-master-1.el7.x86_64.rpm  freeipmi-devel-master-1.el7.x86_64.rpm  freeipmi-ipmidetectd-master-1.el7.x86_64.rpm  freeipmi-ipmiseld-master-1.el7.x86_64.rpm  freeipmi-master-1.el7.x86_64.rpm

I don't know if having "master" in the package name will cause conflicts.  Maybe there is a better way to build RPMs from the master branch and have a proper version number in stead?

/Ole
Comment 45 Frank Heckes 2023-09-28 02:50:28 MDT
I willing to do that, but we have a production or go alive mile stone for our cluster at Oct, 2nd 2023 and mgmt decreed a software freeze. I'll have a discussion on that, because the power mgmt is a mgmt - requirement ironically. 
I'll let you know about the outcome asap.

Anyway I'm not sure whether the fix in patch file http://savannah.gnu.org/bugs/download.php?file_id=55173 is already included by packing the master branch into a previously autoconf'ed tar archive. I think the patch has to applied manually to the master branch first as it haven't been merged, yet and execute create a tar archive or rpmbuild from within the changed repo afterwards.
Sorry I overlooked something in the procedure described in #44 or the freeipmi changemgmt workflow.
Comment 46 Ole.H.Nielsen@fysik.dtu.dk 2023-09-28 03:03:07 MDT
Hi Frank,

(In reply to Frank Heckes from comment #45)
> Anyway I'm not sure whether the fix in patch file
> http://savannah.gnu.org/bugs/download.php?file_id=55173 is already included
> by packing the master branch into a previously autoconf'ed tar archive. I
> think the patch has to applied manually to the master branch first as it
> haven't been merged, yet and execute create a tar archive or rpmbuild from
> within the changed repo afterwards.
> Sorry I overlooked something in the procedure described in #44 or the
> freeipmi changemgmt workflow.

It would appear that the patch "Use poll() instead of select() to avoid segfaults with high number of open fd" is in the master branch:
https://git.savannah.gnu.org/cgit/freeipmi.git/log/

Are you going to build the master branch as RPMs, or do you do "make install"?  I'm curious about the best way to build RPMs.

Best regards,
Ole
Comment 47 Ole.H.Nielsen@fysik.dtu.dk 2023-09-28 03:06:03 MDT
(In reply to Ole.H.Nielsen@fysik.dtu.dk from comment #46)
> It would appear that the patch "Use poll() instead of select() to avoid
> segfaults with high number of open fd" is in the master branch:
> https://git.savannah.gnu.org/cgit/freeipmi.git/log/

I downloaded the freeipmi-master.tar.gz and here the ChangeLog file contains:

2023-09-05 Felip Moll <felip.moll@schedmd.com>

        * libfreeipmi/driver/ipmi-openipmi-driver.c (_openipmi_read):
        Convert from select() to poll() to avoid issues with high number
        of open file descriptors when next fd > 1024. (GNU bug #64721).
Comment 48 Frank Heckes 2023-09-28 03:21:52 MDT
Okay, I should have checked the git - repo, my bad.  If the log contains the change then your procedure is fine. I'll try that in the afternoon. Normally I download a tar archive from the freeIPMI release page and execute 
rpmbuild --define '_with_systemd 1' -ba freeipmi.spec 
to build the RPMs from within the unpacked tar archive directory after the tar archive had been copied to ~/rpmbuild/SOURCES on the build node.
Mgmt permission to apply the change is under construction still.
Comment 49 John SchedMD Gilmore 2023-09-28 17:02:39 MDT
It seems like there are two errors here, interacting:

(1)  IPMI couldn't handle FD's >1024.  (Probably fixed, needs testing.)
(2)  Upgrading to a later slurm caused more than 1024 file descriptors to be used.

My guess is that #2 should indicate that some code in the new release is leaking file descriptors (e.g. opening them and failing to close them when done).  Such errors have happened before (https://groups.google.com/g/slurm-users/c/VJiRi7n7B0s and https://www.suse.com/support/update/announcement/2021/suse-su-20210773-1/ are easy-to-find examples)  So far I haven't seen any discussion of that possibility in this bug report.
Comment 51 Ole.H.Nielsen@fysik.dtu.dk 2023-09-29 03:34:55 MDT
FYI: Albert Chu has updated the freeipmi master version to 1.7.0, see https://savannah.gnu.org/bugs/?64724.  You can download the master tar-ball (as well as other versions) from https://git.savannah.gnu.org/cgit/freeipmi.git/

The master version includes the patch required by Slurm: "Use poll() instead of select() to avoid segfaults with high number of open fd."

The build instructions have now been documented in the master README file as:

./autogen.sh
./configure
make

and furthermore in https://savannah.gnu.org/bugs/?64724 these steps to build RPMs:

make dist
rpmbuild -ta --with systemd *.tar.gz

I have tested this, and the freeipmi 1.7.0-1 RPMs get built as expected.

It remains to test the freeipmi 1.7.0 with Slurm.

/Ole
Comment 54 Frank Heckes 2023-10-16 08:42:52 MDT
I'm compiling based on 23.02.06 and freeIPMI 1.7.0 (in sync with Ole's comments) right now. My plan is to update our cluster today or tomorrow and verify the per job power accounting.
Comment 55 Ole.H.Nielsen@fysik.dtu.dk 2023-10-18 03:07:38 MDT
I would like to report a successful test of Slurm power monitoring using freeipmi 1.7.0:

We have installed on all our compute nodes and slurmctld server the custom built freeipmi 1.7.0-1 RPMs built from the development master branch as described in comment 51.

I have also built all the Slurm 23.02.6 RPMs including the freeipmi 1.7.0 libraries, and we upgraded most of our cluster from 23.02.5 to 23.02.6 last week.  FYI: We've been running with Configless Slurm for about 2 years now.

Regarding power monitoring we have been running with RAPL using this line in slurm.conf:

AcctGatherEnergyType=acct_gather_energy/rapl

In order to test IPMI power monitoring, I copied the /etc/slurm/ folder from the slurmctld server to a SuperMicro test node still running 23.02.5 on AlmaLinux 8.8.  I changed /etc/slurm/slurm.conf locally on that node into:

AcctGatherEnergyType=acct_gather_energy/ipmi

and activated the corresponding lines in /etc/slurm/acct_gather.conf:

EnergyIPMIPowerSensors=Node=DCMI
EnergyIPMIFrequency=60
EnergyIPMICalcAdjustment=yes

Then I upgraded slurm RPMs from 23.02.5 to 23.02.6, and slurmd is working correctly after the upgrade and restart.  Naturally, in slurmd.log I see a warning that the node runs with local config files:

[2023-10-18T10:29:00.012] warning: Running with local config file despite slurmctld having been setup for configless operation

Now the node power monitoring showing CurrentWatts=354:

$ scontrol show node s003
NodeName=s003 Arch=x86_64 CoresPerSocket=10 
(lines deleted)
   CurrentWatts=354 AveWatts=341
   ExtSensorsJoules=n/s ExtSensorsWatts=0 ExtSensorsTemp=n/s

At the same time the IPMI command on node s003 reports Current Power = 360 Watts:

s003$ ipmi-dcmi --get-system-power-statistics
Current Power                        : 360 Watts
Minimum Power over sampling duration : 332 watts
Maximum Power over sampling duration : 5278 watts
Average Power over sampling duration : 2417 watts
Time Stamp                           : 10/18/2023 - 08:55:06
Statistics reporting time period     : 2522203704 milliseconds
Power Measurement                    : Active

An identical SuperMicro node with RAPL power monitoring show only the CPU+DIMMs value CurrentWatts=74.  So IPMI gives a much more realistic power monitoring than RAPL.

I conclude that:

1. slurmd doesn't crash when the freeipmi 1.7.0 library is used, even during an upgrade of slurm RPMs.

2. IPMI power monitoring is working as expected with EnergyIPMIPowerSensors=Node=DCMI

3. The power Watts values reported by slurmd is the same as the value from the ipmi-dcmi CLI command (the power sampling intervals may be slightly different in these two cases).
Comment 56 Frank Heckes 2023-10-18 03:53:31 MDT
Hi Ole, I just completed my compilation, installation and re-enabling of the energy accounting, too. Concerning the compilation of freeIPMI, I followed the instruction you posted and those from the freeIPMI git and change log. I also double checked that the change is contained in the git-clone I used.
I can confirm your results. 
As in my case the problem occurred after ~8 hours (per active jobs), I restarted the HPL. I'll update the ticket as soon as I have a result from this test.

@sSLURM team: Please let me know if you need the compile and configuration files, logs.

I have one question about the 'logic' of the select-> pull change applied. What I don't understand is why the number of 1024 file descriptors are exceeded on the (each) compute node running the freeipmi process. I wonder why such a simple task as creating a connection to the mgmt card reading some bytes need more than 1024 file descriptors. This would be rather high price / jitter effect to make our directors happy. Does anyone has an idea why this is the case?
Comment 57 Ole.H.Nielsen@fysik.dtu.dk 2023-10-18 06:18:05 MDT
I have tested the AcctGatherEnergyType=acct_gather_energy/ipmi also on one of our old Huawei compute nodes whose BMC seems to lack the IPMI DCMI extensions.

The new problem which I reported in bug 17938 is that slurmd annoyingly logs an error every 60 seconds: _get_dcmi_power_reading: get DCMI power reading failed

So if you have nodes where the command "ipmi-dcmi --get-system-power-statistics" fails, then you should wait for the bug 17938 to be fixed.

/Ole
Comment 58 Frank Heckes 2023-10-18 07:35:20 MDT
The HPL is running and I see the following effect:
* the number of open fd to /dev/ipmi0 for each forked slurmd instance assigned 
  to each MPI task increases monotonous:

swan01:~ # lsof | grep '34160' | grep ipmi | wc -l
236
swan01:~ # lsof | grep '34160' | grep ipmi | wc -l
240

swan01:~ # lsof | less -i
... 
slurmd     34160 34164 slurmd          root  116u      CHR              241,0       0t0        586 /dev/ipmi0
slurmd     34160 34164 slurmd          root  117u      CHR              241,0       0t0        586 /dev/ipmi0
slurmd     34160 34164 slurmd          root  118u      CHR              241,0       0t0        586 /dev/ipmi0
slurmd     34160 34164 slurmd          root  119u      CHR              241,0       0t0        586 /dev/ipmi0
...

* Each slurmd shows this behavior. Does the actual error is maybe in the slurm code?
Comment 59 Frank Heckes 2023-10-18 07:40:35 MDT
* sorry, to be clear, the open fd's to /dev/ipmi0 are allocated by the slurmd started by systemd.
Comment 60 Marshall Garey 2023-10-18 13:18:29 MDT
(In reply to Frank Heckes from comment #56)
> I have one question about the 'logic' of the select-> pull change applied.
> What I don't understand is why the number of 1024 file descriptors are
> exceeded on the (each) compute node running the freeipmi process. I wonder
> why such a simple task as creating a connection to the mgmt card reading
> some bytes need more than 1024 file descriptors. This would be rather high
> price / jitter effect to make our directors happy. Does anyone has an idea
> why this is the case?

It is not that freeipmi needs 1024 file descriptors. The file descriptors are shared among the entire slurmd process. Various things cause slurmd to open file descriptors, including RPC handling, starting slurmstepd's, freeipmi, running scripts such as prolog or epilog or NodeHealthCheck, etc.

(In reply to Frank Heckes from comment #58)
> The HPL is running and I see the following effect:
...
> swan01:~ # lsof | grep '34160' | grep ipmi | wc -l
> 236
> swan01:~ # lsof | grep '34160' | grep ipmi | wc -l
> 240
> * Each slurmd shows this behavior. Does the actual error is maybe in the
> slurm code?


Do you see sure evidence of a file descriptor leak in slurmd? I cannot see a fd leak in my tests.

I am running slurmd with ipmi enabled.
I am monitoring the fd count with the following command:

sudo ls /proc/<slurmd pid>/fd | wc -l

As I run jobs, steps, and other commands, I see the fd count changing but always going back down to the "resting" number of open file descriptors (in my case, 7 open fd's).

Do you have a reproducer where you see this number of open fd's go up and not go back down again?
Comment 61 Mark Titorenko 2023-10-19 04:03:27 MDT
Hi all,

The `acct_gather_energy_ipmi` plugin leaks file descriptors (and memory) when using DCMI; this happens as the thread local `ipmi_dcmi_ctx` struct is not destroyed when threads end.

This occurs both during RPCs from `slurmctld` to `slurmd` for job polling (RPC threads) as well as when `scontrol reconfigure` is issued (a new background polling thread is created).

We've corrected this in our builds by reworking the implementation to only create a DCMI IPMI context when performing a sensor read and removing the thread local approach entirely. This ensures that `/dev/ipmi0` is only open while it needs to be accessed.

To reproduce: you can exacerbate the issue by increasing the value for `EnergyIPMIFrequency` in `acct_gather.conf` to something large (say 120s), but retain a smaller frequency for `JobAcctGatherFrequency` (say 30s); the logic in `slurmd` will cause a read to be made within the RPC thread, rather than relying on a recent value from the background polling thread.  When running a job, you should see a brand new leaked FD for `/dev/ipmi0` every 30 seconds or so (refer to `lsof /dev/ipmi0`). Also try an `scontrol reconfigure` or two to see more leaks. In our experience/on our hardware, after ~30 of these have been opened, the kernel driver refuses to allow any more FDs to be opened to `/dev/ipmi0` which causes `slurmd` (and anything else that wants to talk IPMI, for example `ipmitool`) to experience extreme unpleasantness.

Somebody might want to open a new bug for this as, while it's tangentially related to the original issue here (the segfault/buffer overflow/crash during energy readings), it's a different issue really. I'll leave it to somebody who pays for support from SchedMD to open that, as anything I open is assigned "importance 6 - no support contract" -- happy to add my findings there too. If anybody is interested in our patch please let me know and I'll either attach it here or in the new bug.

Cheers,

Mark.
Comment 62 Ole.H.Nielsen@fysik.dtu.dk 2023-10-19 05:35:36 MDT
Hi Mark,

Thanks so much for sharing your insights (and maybe even your patch?):

(In reply to Mark Titorenko from comment #61)
> The `acct_gather_energy_ipmi` plugin leaks file descriptors (and memory)
> when using DCMI; this happens as the thread local `ipmi_dcmi_ctx` struct is
> not destroyed when threads end.
> 
> This occurs both during RPCs from `slurmctld` to `slurmd` for job polling
> (RPC threads) as well as when `scontrol reconfigure` is issued (a new
> background polling thread is created).
> 
> We've corrected this in our builds by reworking the implementation to only
> create a DCMI IPMI context when performing a sensor read and removing the
> thread local approach entirely. This ensures that `/dev/ipmi0` is only open
> while it needs to be accessed.
> 
> To reproduce: you can exacerbate the issue by increasing the value for
> `EnergyIPMIFrequency` in `acct_gather.conf` to something large (say 120s),
> but retain a smaller frequency for `JobAcctGatherFrequency` (say 30s); the
> logic in `slurmd` will cause a read to be made within the RPC thread, rather
> than relying on a recent value from the background polling thread.  When
> running a job, you should see a brand new leaked FD for `/dev/ipmi0` every
> 30 seconds or so (refer to `lsof /dev/ipmi0`). Also try an `scontrol
> reconfigure` or two to see more leaks. In our experience/on our hardware,
> after ~30 of these have been opened, the kernel driver refuses to allow any
> more FDs to be opened to `/dev/ipmi0` which causes `slurmd` (and anything
> else that wants to talk IPMI, for example `ipmitool`) to experience extreme
> unpleasantness.

We're running with EnergyIPMIFrequency=60 and JobAcctGatherFrequency=30 and I didn't see the issue initially when activating AcctGatherEnergyType=acct_gather_energy/ipmi with EnergyIPMIPowerSensors=Node=DCMI on a node which was already running a job.  The slurmd file descriptors looked fine for several hours:

$ ls -l /proc/$(pidof slurmd)/fd
total 0
lr-x------ 1 root root 64 Oct 19 08:10 0 -> /dev/null
lrwx------ 1 root root 64 Oct 19 08:10 1 -> socket:[2249162]
lrwx------ 1 root root 64 Oct 19 08:10 2 -> socket:[2249162]
l-wx------ 1 root root 64 Oct 19 08:10 3 -> /var/log/slurm/slurmd.log
lrwx------ 1 root root 64 Oct 19 08:10 4 -> /dev/null
lrwx------ 1 root root 64 Oct 19 08:10 5 -> socket:[2234879]
lrwx------ 1 root root 64 Oct 19 08:10 6 -> /dev/ipmi0
l-wx------ 1 root root 64 Oct 19 08:10 7 -> /run/slurmd.pid

But later a new slurmstepd process was started by the job, and things began to go haywire!  Now we see new FDs being created every few minutes, also when doing "scontrol reconfig":

$ ls -l /proc/$(pidof slurmd)/fd
total 0
lr-x------ 1 root root 64 Oct 19 08:10 0 -> /dev/null
lrwx------ 1 root root 64 Oct 19 08:10 1 -> socket:[2249162]
lrwx------ 1 root root 64 Oct 19 08:10 10 -> /dev/ipmi0
lrwx------ 1 root root 64 Oct 19 08:10 11 -> /dev/ipmi0
lrwx------ 1 root root 64 Oct 19 08:10 12 -> /dev/ipmi0
lrwx------ 1 root root 64 Oct 19 08:10 13 -> /dev/ipmi0
lrwx------ 1 root root 64 Oct 19 08:10 14 -> /dev/ipmi0
lrwx------ 1 root root 64 Oct 19 08:10 15 -> /dev/ipmi0
lrwx------ 1 root root 64 Oct 19 08:10 16 -> /dev/ipmi0
lrwx------ 1 root root 64 Oct 19 08:10 17 -> /dev/ipmi0
lrwx------ 1 root root 64 Oct 19 08:10 18 -> /dev/ipmi0
lrwx------ 1 root root 64 Oct 19 08:10 19 -> /dev/ipmi0
lrwx------ 1 root root 64 Oct 19 08:10 2 -> socket:[2249162]
lrwx------ 1 root root 64 Oct 19 08:10 20 -> /dev/ipmi0
lrwx------ 1 root root 64 Oct 19 08:10 21 -> /dev/ipmi0
lrwx------ 1 root root 64 Oct 19 08:10 22 -> /dev/ipmi0
lrwx------ 1 root root 64 Oct 19 08:10 23 -> /dev/ipmi0
lrwx------ 1 root root 64 Oct 19 08:10 24 -> /dev/ipmi0
lrwx------ 1 root root 64 Oct 19 08:10 25 -> /dev/ipmi0
lrwx------ 1 root root 64 Oct 19 12:57 26 -> /dev/ipmi0
lrwx------ 1 root root 64 Oct 19 08:10 27 -> /dev/ipmi0
lrwx------ 1 root root 64 Oct 19 08:10 28 -> /dev/ipmi0
lrwx------ 1 root root 64 Oct 19 08:10 29 -> /dev/ipmi0
lrwx------ 1 root root 64 Oct 19 08:10 3 -> /dev/ipmi0
lrwx------ 1 root root 64 Oct 19 08:10 30 -> /dev/ipmi0
lrwx------ 1 root root 64 Oct 19 08:10 31 -> /dev/ipmi0
lrwx------ 1 root root 64 Oct 19 08:10 32 -> /dev/ipmi0
lrwx------ 1 root root 64 Oct 19 08:10 33 -> /dev/ipmi0
lrwx------ 1 root root 64 Oct 19 08:10 34 -> /dev/ipmi0
lrwx------ 1 root root 64 Oct 19 08:10 35 -> /dev/ipmi0
lrwx------ 1 root root 64 Oct 19 08:10 36 -> /dev/ipmi0
lrwx------ 1 root root 64 Oct 19 08:10 37 -> /dev/ipmi0
lrwx------ 1 root root 64 Oct 19 12:57 38 -> /dev/ipmi0
lrwx------ 1 root root 64 Oct 19 08:10 39 -> /dev/ipmi0
lrwx------ 1 root root 64 Oct 19 08:10 4 -> /dev/null
lrwx------ 1 root root 64 Oct 19 08:10 40 -> /dev/ipmi0
lrwx------ 1 root root 64 Oct 19 08:10 41 -> /dev/ipmi0
lrwx------ 1 root root 64 Oct 19 08:10 42 -> /dev/ipmi0
lrwx------ 1 root root 64 Oct 19 08:10 43 -> /dev/ipmi0
lrwx------ 1 root root 64 Oct 19 08:10 44 -> /dev/ipmi0
lrwx------ 1 root root 64 Oct 19 08:10 45 -> /dev/ipmi0
lrwx------ 1 root root 64 Oct 19 12:57 46 -> /dev/ipmi0
lrwx------ 1 root root 64 Oct 19 12:57 47 -> /dev/ipmi0
lrwx------ 1 root root 64 Oct 19 08:10 48 -> /dev/ipmi0
lrwx------ 1 root root 64 Oct 19 08:10 49 -> /dev/ipmi0
lrwx------ 1 root root 64 Oct 19 08:10 5 -> socket:[2234879]
lrwx------ 1 root root 64 Oct 19 08:10 50 -> /dev/ipmi0
lrwx------ 1 root root 64 Oct 19 08:10 51 -> /dev/ipmi0
lrwx------ 1 root root 64 Oct 19 08:10 52 -> /dev/ipmi0
lrwx------ 1 root root 64 Oct 19 12:57 53 -> /dev/ipmi0
lrwx------ 1 root root 64 Oct 19 08:10 54 -> /dev/ipmi0
lrwx------ 1 root root 64 Oct 19 08:10 55 -> /dev/ipmi0
lrwx------ 1 root root 64 Oct 19 12:57 56 -> /dev/ipmi0
lrwx------ 1 root root 64 Oct 19 08:10 57 -> /dev/ipmi0
lrwx------ 1 root root 64 Oct 19 08:10 58 -> /dev/ipmi0
lrwx------ 1 root root 64 Oct 19 08:10 59 -> /dev/ipmi0
lrwx------ 1 root root 64 Oct 19 08:10 6 -> /dev/ipmi0
lrwx------ 1 root root 64 Oct 19 08:10 60 -> /dev/ipmi0
lrwx------ 1 root root 64 Oct 19 08:10 61 -> /dev/ipmi0
lrwx------ 1 root root 64 Oct 19 12:57 62 -> /dev/ipmi0
lrwx------ 1 root root 64 Oct 19 12:57 63 -> /dev/ipmi0
lrwx------ 1 root root 64 Oct 19 12:57 64 -> /dev/ipmi0
lrwx------ 1 root root 64 Oct 19 12:57 65 -> /dev/ipmi0
lrwx------ 1 root root 64 Oct 19 12:57 66 -> /dev/ipmi0
lrwx------ 1 root root 64 Oct 19 12:57 67 -> /dev/ipmi0
lrwx------ 1 root root 64 Oct 19 12:57 68 -> /dev/ipmi0
lrwx------ 1 root root 64 Oct 19 12:57 69 -> /dev/ipmi0
l-wx------ 1 root root 64 Oct 19 08:10 7 -> /run/slurmd.pid
lrwx------ 1 root root 64 Oct 19 12:58 70 -> /dev/ipmi0
lrwx------ 1 root root 64 Oct 19 13:00 71 -> /dev/ipmi0
lrwx------ 1 root root 64 Oct 19 13:00 72 -> /dev/ipmi0
lrwx------ 1 root root 64 Oct 19 13:01 73 -> /dev/ipmi0
lrwx------ 1 root root 64 Oct 19 13:05 74 -> /dev/ipmi0
lrwx------ 1 root root 64 Oct 19 13:05 75 -> /dev/ipmi0
lrwx------ 1 root root 64 Oct 19 13:05 76 -> /dev/ipmi0
lrwx------ 1 root root 64 Oct 19 13:08 78 -> /dev/ipmi0
lrwx------ 1 root root 64 Oct 19 13:08 79 -> /dev/ipmi0
l-wx------ 1 root root 64 Oct 19 08:10 8 -> /var/log/slurm/slurmd.log
lrwx------ 1 root root 64 Oct 19 13:08 80 -> /dev/ipmi0
lrwx------ 1 root root 64 Oct 19 13:09 81 -> /dev/ipmi0
lrwx------ 1 root root 64 Oct 19 13:09 82 -> /dev/ipmi0
lrwx------ 1 root root 64 Oct 19 13:16 83 -> /dev/ipmi0
lrwx------ 1 root root 64 Oct 19 13:16 84 -> /dev/ipmi0
lrwx------ 1 root root 64 Oct 19 13:16 85 -> /dev/ipmi0
lrwx------ 1 root root 64 Oct 19 13:16 86 -> /dev/ipmi0
lrwx------ 1 root root 64 Oct 19 13:16 87 -> /dev/ipmi0
lrwx------ 1 root root 64 Oct 19 13:16 88 -> /dev/ipmi0
lrwx------ 1 root root 64 Oct 19 13:16 89 -> /dev/ipmi0

We're not seeing any problems with slurmd just yet, but this may be due to our use of freeipmi 1.7.0 (development master) which replaces the select() by poll() in the FreeIPMI driver/ipmi-openipmi-driver.c code, see https://savannah.gnu.org/bugs/?64721.

But obviously we're now suffering from slurmd leaking file descriptors every few minutes as you have described it!

The "ipmi-dcmi --get-system-power-statistics" command continues to work correctly so far.  Our test node runs CentOS 7.9 with kernel 3.10.0-1160.99.1.el7.x86_64.

> Somebody might want to open a new bug for this as, while it's tangentially
> related to the original issuewho here (the segfault/buffer overflow/crash
> during energy readings), it's a different issue really. I'll leave it to
> somebody who pays for support from SchedMD to open that, as anything I open
> is assigned "importance 6 - no support contract" -- happy to add my findings
> there too. If anybody is interested in our patch please let me know and I'll
> either attach it here or in the new bug.

Either Frank or I (we both pay for Slurm support) could open a new bug report.  But in comment 30 it was shown that >1024 file descriptors was used, and in comment 49 it was suggested that slurmd may leak file descriptors.

So is there a good reason to open a separate bug report, and could you suggest a good description of such a bug?

In any case it would be great if you would be willing to upload your patch to the present bug!  I'm hoping that it could make it into Slurm 23.11, which is only a few weeks away from release, so that all of us would benefit from your work starting with 23.11!

Thanks a lot,
Ole
Comment 63 Marshall Garey 2023-10-19 09:31:38 MDT
Don't open a new bug for the leaking file descriptor. I'll use your more detailed descriptions to try to reproduce.
Comment 64 Mark Titorenko 2023-10-19 09:45:31 MDT
Hi Ole,

We're on an EL8 kernel here (4.18.0-477.15.1.el8_8.x86_64) so it's quite possible that the IPMI driver within the EL7 kernel is less strict about how many FDs can be opened for `/dev/ipmi0` before things start going south -- I haven't investigated further the reasons for the IPMI driver behaving this way, it may even be hardware-specific I guess?

We had previously seen the initially reported buffer overflow issue on an EL7 system, but we backed out all our power monitoring changes there as they weren't mission critical for that system.  However, looking back now, there's definitely a higher than average chance that the FD number did reach 1024+ because of this FD leak issue. :)  My thanks to both yourself and Frank for pursuing a resolution to that one BTW!

Happy to share our patch to `acct_gather_energy_ipmi.c` -- I'll attach it in a moment.  And, indeed, hopefully SchedMD can adapt it (or roll their own preferred approach) in time for 23.11 indeed!

Cheers,

Mark.
Comment 65 Mark Titorenko 2023-10-19 09:47:33 MDT
Created attachment 32832 [details]
Fixes file descriptor leak in IPMI DCMI energy gathering plugin

Patch attached as promised. Whitespace does not conform to code base standards I'm afraid as it wasn't a priority. :)

Cheers,

Mark.
Comment 66 Frank Heckes 2023-10-19 10:20:03 MDT
(In reply to Marshall Garey from comment #60)
First of all some good news from my side, the HPL completed successful with AcctGatherEnergyType=acct_gather_energy/ipmi enabled. The energy consumption have been written to the slurmDB and seems to be in sync with the values gathered on the PDUs with a different tool. The HPL ran exclusive on the partition (54 nodes). No other jobs were running at the same time. I.e. there were no jitter effects on FDs from other jobs.


Concerning the open fd's:
 * For lsof I counted only the slurmd's FDs on ipmi0 (-> grep ipmi0). 
   This number never decreased while job were running. The method
   you used count(ed) ALL FDs via the /proc FS.
 * The number measured via lsof were the same on all nodes, besides on the
   one node starting the job which showed 4 fd's less.
 * I haven't measured, but the (monotonous) increase of 
   opened fd's correlates with the measurement interval of 10s.
 * Indeed the number of open fds in slurmd's proc FS were different then the
   one reported by lsof. The last meas I took report 1751 fd's via lsof.
   (I didn't check the proc count as comment #60 has arrived)
   Also:
   after the HPL the FDs are 
    node;date;#fdPROC;#fdLSOF
    swan56;20231019T1600;770;770
   Restarting the slurmd adjust lsof reported FDs to 1!!:
    swan56;20231019T1609;1;1
   At taking these measurement the reservation was active still; aka the 
   partition was empty.

   I'm not sure whether discrepancy is really a problem or a tool effect.

The reproducer is the HPL2.3 based on foss-2021b.eb (EasyBuild: 
gcc '11.2.0', ; 'OpenMPI', '4.1.1', 'FlexiBLAS', '3.0.4', '', 'FFTW', '3.3.10',
'ScaLAPACK', '2.1.0') executed on 54 Nodes with alpha=0.9; N=2584479 P=18, Q=24

As I have to re-run the HPL for a perf and energy baseline, I'll monitor the procFD and lsofFD with the command attached below and capture the output. I don't know whether this is of any value for you. Please let me know if I should attach the file to this ticket.

Otherwise I don't really have a code based assumption that would justify to keep the ticket open. Anyway, it's strange that lsof reports so many FDs on /dev/ipmi0.

Cheers,
-Frank 
P.S.: Sorry if I overlooked something or haven't considered from the posting that followed #60.


----------------- FD meas command
this count the ipmi0 fd in /proc via long listing and lsof command for slurmd pid:
pdsh -g swan 'mtime=$(date "+%Y%m%dT%H%M"); pid=$(ps aux | grep -E slurmd | grep -v grep | awk "{printf(\"%s\", \$2 );}");  echo -e "$mtime;\c" ; if [ -n "${pid}" ] ; then procFDc=$(ls -l /proc/${pid}/fd | grep ipmi0 | wc -l); lsOFc=$(lsof -p ${pid} | grep ipmi0 | wc -l) ; echo "$procFDc;$lsOFc" ; else echo "-;-"; fi' | sort | sed 's/:[[:space:]]\+/;/'
Comment 67 Ole.H.Nielsen@fysik.dtu.dk 2023-10-19 12:00:52 MDT
Mark mentioned in comment 64 that they use an EL8 kernel, so I decided to test a node running AlmaLinux 8.8 with kernel 4.18.0-477.27.2.el8_8.x86_64 as well.

Again, initially there are just a few open file descriptors:

$ ]# ls -l /proc/$(pidof slurmd)/fd
total 0
lr-x------. 1 root root 64 Oct 19 18:35 0 -> /dev/null
lrwx------. 1 root root 64 Oct 19 18:35 1 -> 'socket:[923294]'
lrwx------. 1 root root 64 Oct 19 18:35 2 -> 'socket:[923294]'
lrwx------. 1 root root 64 Oct 19 18:35 4 -> /dev/null
lrwx------. 1 root root 64 Oct 19 18:35 5 -> /dev/ipmi0
lrwx------. 1 root root 64 Oct 19 18:35 6 -> 'socket:[934533]'
l-wx------. 1 root root 64 Oct 19 18:35 7 -> /run/slurmd.pid
l-wx------. 1 root root 64 Oct 19 18:35 8 -> /var/log/slurm/slurmd.log
lrwx------. 1 root root 64 Oct 19 18:35 9 -> /dev/ipmi0

But then the running job terminated and a new job was started at 19:19:34, and now there are many messages in slurmd.log:

[2023-10-19T19:19:34.003] task/affinity: task_p_slurmd_batch_request: task_p_slurmd_batch_request: 6771167
[2023-10-19T19:19:34.003] task/affinity: batch_bind: job 6771167 CPU input mask for node: 0xFFFFFFFFFFFFFF
[2023-10-19T19:19:34.003] task/affinity: batch_bind: job 6771167 CPU final HW mask for node: 0xFFFFFFFFFFFFFF
[2023-10-19T19:19:34.027] [6771167.extern] task/cgroup: _memcg_initialize: job: alloc=487200MB mem.limit=487200MB memsw.limit=487200MB job_swappiness=18446744073709551614
[2023-10-19T19:19:34.027] [6771167.extern] task/cgroup: _memcg_initialize: step: alloc=487200MB mem.limit=487200MB memsw.limit=487200MB job_swappiness=18446744073709551614
[2023-10-19T19:19:34.041] Launching batch job 6771167 for UID 261375
[2023-10-19T19:19:34.049] [6771167.batch] task/cgroup: _memcg_initialize: job: alloc=487200MB mem.limit=487200MB memsw.limit=487200MB job_swappiness=18446744073709551614
[2023-10-19T19:19:34.050] [6771167.batch] task/cgroup: _memcg_initialize: step: alloc=487200MB mem.limit=487200MB memsw.limit=487200MB job_swappiness=18446744073709551614
[2023-10-19T19:46:44.048] [6771167.extern] error: slurm_get_node_energy: Socket timed out on send/recv operation
[2023-10-19T19:46:44.048] [6771167.extern] error: _get_joules_task: can't get info from slurmd
[2023-10-19T19:46:44.063] [6771167.batch] error: slurm_get_node_energy: Socket timed out on send/recv operation
[2023-10-19T19:46:44.063] [6771167.batch] error: _get_joules_task: can't get info from slurmd
[2023-10-19T19:47:14.037] [6771167.extern] error: slurm_get_node_energy: Socket timed out on send/recv operation
[2023-10-19T19:47:14.037] [6771167.extern] error: _get_joules_task: can't get info from slurmd
[2023-10-19T19:47:14.072] [6771167.batch] error: slurm_get_node_energy: Socket timed out on send/recv operation
[2023-10-19T19:47:14.072] [6771167.batch] error: _get_joules_task: can't get info from slurmd
[2023-10-19T19:47:34.134] error: _get_dcmi_power_reading: get DCMI power reading failed
[2023-10-19T19:47:34.145] error: slurm_send_node_msg: [(null)] slurm_bufs_sendto(msg_type=RESPONSE_ACCT_GATHER_ENERGY) failed: Unexpected missing socket error
[2023-10-19T19:47:34.148] error: slurm_send_node_msg: [(null)] slurm_bufs_sendto(msg_type=RESPONSE_ACCT_GATHER_UPDATE) failed: Unexpected missing socket error
[2023-10-19T19:47:34.148] error: slurm_send_node_msg: [(null)] slurm_bufs_sendto(msg_type=RESPONSE_ACCT_GATHER_ENERGY) failed: Unexpected missing socket error
[2023-10-19T19:47:34.157] error: slurm_send_node_msg: [(null)] slurm_bufs_sendto(msg_type=RESPONSE_ACCT_GATHER_ENERGY) failed: Unexpected missing socket error
[2023-10-19T19:47:34.161] error: slurm_send_node_msg: [(null)] slurm_bufs_sendto(msg_type=RESPONSE_ACCT_GATHER_ENERGY) failed: Unexpected missing socket error

As you can see, there are new types of error from slurmd which we didn't see on the CentOS 7.9 node!  Furthermore, lots of leaking file descriptors now pop up:

$ ls -l /proc/$(pidof slurmd)/fd
total 0
lr-x------. 1 root root 64 Oct 19 18:35 0 -> /dev/null
lrwx------. 1 root root 64 Oct 19 18:35 1 -> 'socket:[923294]'
lrwx------. 1 root root 64 Oct 19 18:35 11 -> /dev/ipmi0
lrwx------. 1 root root 64 Oct 19 18:35 12 -> /dev/ipmi0
lrwx------. 1 root root 64 Oct 19 18:35 13 -> /dev/ipmi0
lrwx------. 1 root root 64 Oct 19 18:35 14 -> /dev/ipmi0
lrwx------. 1 root root 64 Oct 19 18:35 15 -> /dev/ipmi0
lrwx------. 1 root root 64 Oct 19 18:35 16 -> /dev/ipmi0
lrwx------. 1 root root 64 Oct 19 18:35 17 -> /dev/ipmi0
lrwx------. 1 root root 64 Oct 19 18:35 18 -> /dev/ipmi0
lrwx------. 1 root root 64 Oct 19 18:35 19 -> /dev/ipmi0
lrwx------. 1 root root 64 Oct 19 18:35 2 -> 'socket:[923294]'
lrwx------. 1 root root 64 Oct 19 18:35 20 -> /dev/ipmi0
lrwx------. 1 root root 64 Oct 19 18:35 21 -> /dev/ipmi0
lrwx------. 1 root root 64 Oct 19 18:35 22 -> /dev/ipmi0
lrwx------. 1 root root 64 Oct 19 18:35 23 -> /dev/ipmi0
lrwx------. 1 root root 64 Oct 19 18:35 24 -> /dev/ipmi0
lrwx------. 1 root root 64 Oct 19 18:35 25 -> /dev/ipmi0
lrwx------. 1 root root 64 Oct 19 18:35 26 -> /dev/ipmi0
lrwx------. 1 root root 64 Oct 19 18:35 27 -> /dev/ipmi0
lrwx------. 1 root root 64 Oct 19 18:35 28 -> /dev/ipmi0
lrwx------. 1 root root 64 Oct 19 18:35 29 -> /dev/ipmi0
lrwx------. 1 root root 64 Oct 19 18:35 30 -> /dev/ipmi0
lrwx------. 1 root root 64 Oct 19 18:35 31 -> /dev/ipmi0
lrwx------. 1 root root 64 Oct 19 18:35 32 -> /dev/ipmi0
lrwx------. 1 root root 64 Oct 19 18:35 33 -> /dev/ipmi0
lrwx------. 1 root root 64 Oct 19 18:35 34 -> /dev/ipmi0
lrwx------. 1 root root 64 Oct 19 18:35 35 -> /dev/ipmi0
lrwx------. 1 root root 64 Oct 19 18:35 36 -> /dev/ipmi0
lrwx------. 1 root root 64 Oct 19 18:35 37 -> /dev/ipmi0
lrwx------. 1 root root 64 Oct 19 18:35 4 -> /dev/null
lrwx------. 1 root root 64 Oct 19 18:35 5 -> /dev/ipmi0
lrwx------. 1 root root 64 Oct 19 18:35 6 -> 'socket:[934533]'
l-wx------. 1 root root 64 Oct 19 18:35 7 -> /run/slurmd.pid
l-wx------. 1 root root 64 Oct 19 18:35 8 -> /var/log/slurm/slurmd.log
lrwx------. 1 root root 64 Oct 19 18:35 9 -> /dev/ipmi0

Strangely, all the file descriptors have time stamps equal to the starting time of slurmd.  File descriptors 10 and up definitely weren't listed before the new job was started at 19:19:34!
Comment 68 Ole.H.Nielsen@fysik.dtu.dk 2023-10-19 12:22:56 MDT
Note added: After we see 37 slurmd file descriptors, the node's BMC has now stopped responding completely:

$ ipmi-dcmi --get-system-power-statistics
ipmi_cmd_dcmi_get_power_reading: driver timeout

At this point it would seem that we need to disable IPMI power monitoring and reboot the node :-(

FYI: This node is a Dell PowerEdge R650 with iDRAC9 BMC.  The previous CentOS 7.9 node was a Dell C6420 also with iDRAC9.
Comment 69 Mark Titorenko 2023-10-19 12:50:10 MDT
Hi Ole,

(In reply to Ole.H.Nielsen@fysik.dtu.dk from comment #68)
> Note added: After we see 37 slurmd file descriptors, the node's BMC has now
> stopped responding completely:

29 open FDs to `/dev/ipmi0` appears to be the magic number of doom, indeed. ;-)

> $ ipmi-dcmi --get-system-power-statistics
> ipmi_cmd_dcmi_get_power_reading: driver timeout
> 
> At this point it would seem that we need to disable IPMI power monitoring
> and reboot the node :-(

We found that restarting (or stopping) `slurmd` to cause all the open FDs to be closed allowed the BMC to start responding in-band again -- our hypothesis is that the kernel driver is only allowing a maximum number of FDs of `/dev/ipmi0` to be open at any one time, at least under EL8 kernels (unverified though).

Cheers,

Mark.
Comment 71 Ole.H.Nielsen@fysik.dtu.dk 2023-10-20 00:32:26 MDT
(In reply to Mark Titorenko from comment #69)
> Hi Ole,
> 
> (In reply to Ole.H.Nielsen@fysik.dtu.dk from comment #68)
> > Note added: After we see 37 slurmd file descriptors, the node's BMC has now
> > stopped responding completely:
> 
> 29 open FDs to `/dev/ipmi0` appears to be the magic number of doom, indeed.
> ;-)
> 
> > $ ipmi-dcmi --get-system-power-statistics
> > ipmi_cmd_dcmi_get_power_reading: driver timeout
> > 
> > At this point it would seem that we need to disable IPMI power monitoring
> > and reboot the node :-(
> 
> We found that restarting (or stopping) `slurmd` to cause all the open FDs to
> be closed allowed the BMC to start responding in-band again -- our
> hypothesis is that the kernel driver is only allowing a maximum number of
> FDs of `/dev/ipmi0` to be open at any one time, at least under EL8 kernels
> (unverified though).

I have confirmed your findings now.  Restarting slurmd (which took 10-15 seconds, much more than usual) frees up the file descriptors and the BMC can now be queried again:

Frozen BMC state:

[root@d033 ~]# ipmi-dcmi --get-system-power-statistics
ipmi_cmd_dcmi_get_power_reading: driver busy
[root@d033 ~]# ls -l /proc/$(pidof slurmd)/fd
total 0
lr-x------. 1 root root 64 Oct 19 18:35 0 -> /dev/null
lrwx------. 1 root root 64 Oct 19 18:35 1 -> 'socket:[923294]'
lrwx------. 1 root root 64 Oct 19 18:35 10 -> 'socket:[1112493]'
lrwx------. 1 root root 64 Oct 19 18:35 11 -> /dev/ipmi0
lrwx------. 1 root root 64 Oct 19 18:35 12 -> /dev/ipmi0
lrwx------. 1 root root 64 Oct 19 18:35 13 -> /dev/ipmi0
lrwx------. 1 root root 64 Oct 19 18:35 14 -> /dev/ipmi0
lrwx------. 1 root root 64 Oct 19 18:35 15 -> /dev/ipmi0
lrwx------. 1 root root 64 Oct 19 18:35 16 -> /dev/ipmi0
lrwx------. 1 root root 64 Oct 19 18:35 17 -> /dev/ipmi0
lrwx------. 1 root root 64 Oct 19 18:35 18 -> /dev/ipmi0
lrwx------. 1 root root 64 Oct 19 18:35 19 -> /dev/ipmi0
lrwx------. 1 root root 64 Oct 19 18:35 2 -> 'socket:[923294]'
lrwx------. 1 root root 64 Oct 19 18:35 20 -> /dev/ipmi0
lrwx------. 1 root root 64 Oct 19 18:35 21 -> /dev/ipmi0
lrwx------. 1 root root 64 Oct 19 18:35 22 -> /dev/ipmi0
lrwx------. 1 root root 64 Oct 19 18:35 23 -> /dev/ipmi0
lrwx------. 1 root root 64 Oct 19 18:35 24 -> /dev/ipmi0
lrwx------. 1 root root 64 Oct 19 18:35 25 -> /dev/ipmi0
lrwx------. 1 root root 64 Oct 19 18:35 26 -> /dev/ipmi0
lrwx------. 1 root root 64 Oct 19 18:35 27 -> /dev/ipmi0
lrwx------. 1 root root 64 Oct 19 18:35 28 -> /dev/ipmi0
lrwx------. 1 root root 64 Oct 19 18:35 29 -> /dev/ipmi0
lrwx------. 1 root root 64 Oct 19 18:35 3 -> 'socket:[1112492]'
lrwx------. 1 root root 64 Oct 19 18:35 30 -> /dev/ipmi0
lrwx------. 1 root root 64 Oct 19 18:35 31 -> /dev/ipmi0
lrwx------. 1 root root 64 Oct 19 18:35 32 -> /dev/ipmi0
lrwx------. 1 root root 64 Oct 19 18:35 33 -> /dev/ipmi0
lrwx------. 1 root root 64 Oct 19 18:35 34 -> /dev/ipmi0
lrwx------. 1 root root 64 Oct 19 18:35 35 -> /dev/ipmi0
lrwx------. 1 root root 64 Oct 19 18:35 36 -> /dev/ipmi0
lrwx------. 1 root root 64 Oct 19 18:35 37 -> /dev/ipmi0
lrwx------. 1 root root 64 Oct 19 18:35 38 -> 'socket:[1135495]'
lrwx------. 1 root root 64 Oct 19 18:35 39 -> 'socket:[1102809]'
lrwx------. 1 root root 64 Oct 19 18:35 4 -> /dev/null
lrwx------. 1 root root 64 Oct 19 18:35 40 -> 'socket:[1112496]'
lrwx------. 1 root root 64 Oct 19 18:35 5 -> /dev/ipmi0
lrwx------. 1 root root 64 Oct 19 18:35 6 -> 'socket:[934533]'
l-wx------. 1 root root 64 Oct 19 18:35 7 -> /run/slurmd.pid
l-wx------. 1 root root 64 Oct 19 18:35 8 -> /var/log/slurm/slurmd.log
lrwx------. 1 root root 64 Oct 19 18:35 9 -> /dev/ipmi0

Restarting slurmd fixes the issue, at least temporarily:

[root@d033 ~]# systemctl restart slurmd
[root@d033 ~]# ls -l /proc/$(pidof slurmd)/fd
total 0
lr-x------. 1 root root 64 Oct 20 08:29 0 -> /dev/null
lrwx------. 1 root root 64 Oct 20 08:29 1 -> 'socket:[1105603]'
lrwx------. 1 root root 64 Oct 20 08:29 2 -> 'socket:[1105603]'
l-wx------. 1 root root 64 Oct 20 08:29 3 -> /var/log/slurm/slurmd.log
lrwx------. 1 root root 64 Oct 20 08:29 4 -> /dev/null
lrwx------. 1 root root 64 Oct 20 08:29 5 -> 'socket:[1136614]'
l-wx------. 1 root root 64 Oct 20 08:29 6 -> /run/slurmd.pid
lrwx------. 1 root root 64 Oct 20 08:29 8 -> /dev/ipmi0
[root@d033 ~]# ipmi-dcmi --get-system-power-statistics
Current Power                        : 738 Watts
Minimum Power over sampling duration : 10 watts
Maximum Power over sampling duration : 890 watts
Average Power over sampling duration : 623 watts
Time Stamp                           : 10/20/2023 - 06:29:26
Statistics reporting time period     : 1000 milliseconds
Power Measurement                    : Active
Comment 72 Ole.H.Nielsen@fysik.dtu.dk 2023-10-20 00:54:02 MDT
In comment 71 a restart of slurmd cleared up BMC communications.  Alas, after some 15 minutes the deadlocked BMC situation reappeared.  :-(
Comment 73 Ole.H.Nielsen@fysik.dtu.dk 2023-10-20 01:02:24 MDT
In comment 62 the number of open file descriptors was reported on a CentOS 7.9 node.  The EL7 kernel doesn't seem to deadlock the BMC communications, as we have seen with an EL8 kernel.  After some 24 hours of operation slurmd has accumulated 2150 open file descriptors and still seems to work OK:

$ ls -l /proc/$(pidof slurmd)/fd | wc -l
2150

We use freeipmi 1.7.0 with the patch that permits >1024 open file descriptors.

The node's BMC communication continues to work correctly:

$ ipmi-dcmi --get-system-power-statistics
Current Power                        : 428 Watts
Minimum Power over sampling duration : 116 watts
Maximum Power over sampling duration : 496 watts
Average Power over sampling duration : 387 watts
Time Stamp                           : 10/20/2023 - 06:54:33
Statistics reporting time period     : 1000 milliseconds
Power Measurement                    : Active
Comment 74 Mark Titorenko 2023-10-20 02:03:01 MDT
Hi Ole,

(In reply to Ole.H.Nielsen@fysik.dtu.dk from comment #73)
> In comment 62 the number of open file descriptors was reported on a CentOS
> 7.9 node.  The EL7 kernel doesn't seem to deadlock the BMC communications,
> as we have seen with an EL8 kernel.  After some 24 hours of operation slurmd
> has accumulated 2150 open file descriptors and still seems to work OK:
> 
> $ ls -l /proc/$(pidof slurmd)/fd | wc -l
> 2150
> 
> We use freeipmi 1.7.0 with the patch that permits >1024 open file
> descriptors.

Thanks for confirming this, and for verifying that the recent update to FreeIPMI mitigates the buffer overflow crash.  I think this supports the hypothesis that it's the FD leak that surfaces the buffer overflow issue on EL7 systems. Without the FreeIPMI select->poll patch, `slurmd` would have crashed quite some time ago with that many FDs!

Worth noting that, by default, the `slurmd` process is limited to a maximum of 131072 open files (via `https://github.com/SchedMD/slurm/blob/d365fbc205f3f3919bb34cd8885206fe7fe86337/etc/slurmd.service.in#L14`) so this will give `slurmd` around 60 days to live if the FD leak continues at this rate unabated.

Cheers,

Mark.
Comment 75 Frank Heckes 2023-10-20 07:02:00 MDT
I re-run the HPL successful without buffer overflows. The consumed energy was written to slurmDB successful. Verification of this value against the PDU measurement will take place on Monday.
This time the number of FD's increased only by 1 on all nodes. 
772 -> 773 on nodes were slurm haven't been restared.
3   -> 4   on swan56 were slurm was restarted before the second HPL run.
It's unclear, why after the first run 772 FD (-> proc-slurmFDswan55.log) were created and not de-allocated. 
Anyway the file descriptor wasn't released after the job completed and again no other jobs ran due to the exclusive reservation.
I attached the log counting the allocated FDs on all nodes during the HPL test (-> slurmFdMeasurment.dat)
Even if the increase is slow fd limit will be reached very soon in production on our cluster.

We use sles15.4 and kernel 5.14.21-150400.24.55-default on the compute nodes.
Comment 93 Marshall Garey 2023-10-25 11:13:11 MDT
We've fixed the file descriptor leak in the following commits:

* 8795696092 NEWS for the previous two commits
* 14f26eabe3 Fix slurmd fd leak due to not closing dcmi ctx
* 3d734bccca acct_gather_energy/ipmi plugin - move code to a function

I'm closing this bug as fixed ahead of 23.02.7. Thanks for the bug report and helping us identify the root causes.
Comment 95 Ole.H.Nielsen@fysik.dtu.dk 2024-02-02 01:59:11 MST
An update about FreeIPMI: Version 1.6.12 has fixed this issue:

o Use poll() over select() to avoid fd limit in openipmi driver. 

The latest version is currently 1.6.14 from https://git.savannah.gnu.org/cgit/freeipmi.git/ and you can build EL8/EL9 RPMs from the tar-ball with:

$ rpmbuild -ta --with systemd freeipmi-1.6.14.tar.gz

The acct_gather_energy/ipmi accounting plugin in Slurm 23.02.7 is working correctly for us.

/Ole