Description
Frank Heckes
2023-09-07 03:10:49 MDT
Created attachment 32142 [details]
tar archive containing the files mentioned in the ticket
I forgot to display the limits setting for the slurmd processes. All use: Limit Soft Limit Hard Limit Units Max cpu time unlimited unlimited seconds Max file size unlimited unlimited bytes Max data size unlimited unlimited bytes Max stack size unlimited unlimited bytes Max core file size unlimited unlimited bytes Max resident set unlimited unlimited bytes Max processes 4127344 4127344 processes Max open files 131072 131072 files Max locked memory unlimited unlimited bytes Max address space unlimited unlimited bytes Max file locks unlimited unlimited locks Max pending signals 4127344 4127344 signals Max msgqueue size 819200 819200 bytes Max nice priority 0 0 Max realtime priority 0 0 Max realtime timeout unlimited unlimited us besides some minor difference in O(100) for 'Max Pending signals' Please also include the slurmd.log, the output of "slurmd -C" and the backtrace from those core files.
> gdb -ex 't a a bt' -batch slurmd core.12345
Core files are rarely useful to us and most of the time we are unable to read them due to the differences between the Linux environments sites use versus ours.
I don't understand all of the question. (In reply to Frank Heckes from comment #0) > Dear SlurmSupport, here's a description of our compute cluster the error > occurred: > 1. 56 SuperMicro Nodes, 1TB Ram, AMD 7763, NDR@200GBit HCAs > 2. OS: SLES 15.4 > 3. Single NDR switch in Spit mode, HCA, switch OS latest version > 4. MLNX ofed version: MLNX_OFED_LINUX-5.9-0.5.6. > 5. slurm Version 23.02.4 > compile environment: gcc-7.5.0 (Full list of RPMs can be attached on > demand) > 6. The error happened during a HPL version 2.3 compile with easybuild modules > GCC-11.2.0, openMPI-4.1.1 > alpha for N (N=alpha*sqrt(#nodes * Mem/per nodes / 8) ) was 0.85 > > The job executed stopped two times with a buffer overflow that happened on > 1. 2 nodes in first time (swan[03,04]) > 2. 1 node at the second time (swan28) Are you saying that the job had a buffer overflow? Or that slurmd did? You imply that slurmd crashed because you attached core dumps for slurmd. > Although the job was marked as FAIL slurm restarted it automatically. Could > you please also comment on this fact What job id (or job id's if multiple jobs)? Did the job requeue itself in the job script? Will you upload the output of the following sacct command? sacct -j <jobid> -D --long -p > Attached files > a. build log for slurmd RPMs; yes, munge is installed and operational > b. slurmctld, slurmd logfile and coredumps written I don't see the slurmd log files. Are you talking about the dmesg output? As Jason mentioned, we cannot make use of the coredumps. We need the backtraces generated from the coredumps. > c. slurmd.conf, cgroups.conf, acct_gather.con, gres.conf (althought not of > interest here, cause the > GPU weren't part of the job) > > Please let me know if some information or file is missing for your analysis. I think the slurmd log files are missing. @Jason: I put everything into tar archive of the latest upload. Details: * I'm sorry about the missing slurmd.log files. Included in attachment. * Same for captured slurmd -C * The coredumps are handled via systemd-coredump on all nodes. dumps are stored in compressed format after the event compressed. Interface to access is coredumpctl (or z swan03:~ # coredumpctl list TIME PID UID GID SIG COREFILE EXE SIZE Wed 2023-09-06 17:25:18 CEST 28102 0 0 SIGABRT present /usr/sbin/slurmd 620.4K coredumpctl debug --debugger=gdb --debugger-arguments="-ex 't a a bt' -batch slurmd" /usr/sbin/slurmd > /tmp/swan03.stacktrace.via-coredumpctl 2>&1 or Two steps: coredumpctl dump -o /tmp/hurz /usr/sbin/slurmd gdb -ex 't a a bt' -batch slurmd /tmp/hurz > swan03.stacktrace.gdb-direct 2>&1 I used this commmand sequence on all nodes. The 'coredumpctl debug' contains some more info, but the same stackstrace as far as I see. All files attached in second upload. @Mashall: I'm very sorry for any ambiguity. The slurmd dumped core (see messages files implicitly). I would never bother with obvious direct visible HPC application problems. The job requeue itself after the crash. Actually I made read the isotime for the buffer overflow incorrectely. This one is also caused by job 244. sacct info include is include in tar archive. Created attachment 32152 [details] requested info's from comment #3, #4 Sorry, for the typos, the last three sentences should read as: Actually I read the isotime string for the buffer overflow on node swan28 incorrectly. This incident is also caused by job 244. The sacct info requested has been included in tar archive. Thanks for the updated details. The backtrace shows that the ipmi library is aborting (assertion failure). We have another private bug (bug 17464) with the same backtrace. In that bug, the problem started happening after they upgraded Slurm, but they had not recompiled the ipmi library and there was an issue. Quoting our response on that bug: > The IPMI library crash is triggered by the internal memory consistency > functionality. Considering this worked before the upgrade, this strongly > suggests that the library is not compatible with Slurm or glibc via some > unknown incompatibility. I have seen a few times with similar corruption > (but not this exact issue) that the library (libfreeipmi.so.17 in this case) > needs to be recompiled on the system and relinked against Slurm. Upgrading > to the lastest version of libfreeipmi.so and recompiling Slurm against the > upgraded library is my suggested path to avoid debugging an older version > and likely unsupported version of libfreeipmi. Did you recently upgrade Slurm? Did this problem start happening after the upgrade? If you upgrade the ipmi library and recompile Slurm against the upgraded library, does the crash go away? Perhaps try it on a single node if the crash is easily reproducible. Many thanks for the analysis. Indeed there was an inital version without freeipmi support. I re-compiled the slurm against the freeipmi delivered with the sles15.4 distribution. I was pretty sure that this worked fine and that I rollout this version to all nodes. Anyway, something most be wrong. I double check the compilation and the installed RPMs. * I verified the installation is consitent, i.e. the RPMs on the build nodes are those installed on the db, ctl - servers and the compute node. (if it is of interest I can upload the script logs) * CTL node and compute nodes run different OS distro: NODE TYPE OS INSTALLED RPMs slurmctl el7 freeipmi-1.5.7-3.el7.x86_64 build node el7 freeipmi-1.5.7-3.el7.x86_64 freeipmi-devel-1.5.7-3.el7.x86_64 ## I use only the ctl, example, restapi and slurm RPM on the ctl compute sles15 freeipmi-1freeipmi-1.6.8-150400.1.11.x86_64 libfreeipmi17-1.6.8-150400.1.11.x86_64 libipmiconsole2-1.6.8-150400.1.11.x86_64 libipmidetect0-1.6.8-150400.1.11.x86_64 build node sles15 freeipmi-1.6.8-150400.1.11.x86_64 freeipmi-devel-1.6.8-150400.1.11.x86_64 libfreeipmi17-1.6.8-150400.1.11.x86_64 libipmiconsole2-1.6.8-150400.1.11.x86_64 Installation is consistent, especially for the compute nodes. My current understanding is that the freeipmi is used only on the compute nodes to gather the power perf-counter so using different, so the different freeipmi version on CTL and compute nodes should not matter(?). * I also verified the checksums between build nodes for ctl and compute nodes for the installed ipmi libs. There're all the same. * Is this maybe a bug in the freeipmi version 1.6.8 used on the compute nodes? Should I use the latest version of freeipmi instead? (* The error is reproducible. --> HPL test crashed over weekend) Yes, upgrading to the latest version of ipmi was part the instructions given to fix the other ticket:
> needs to be recompiled on the system and relinked against Slurm. Upgrading
> to the lastest version of libfreeipmi.so and recompiling Slurm against the
> upgraded library is my suggested path to avoid debugging an older version
> and likely unsupported version of libfreeipmi.
If after upgrading freeipmi and recompiling Slurm against it you still see slurmd crash, will you provide an updated backtrace so we can see if it is any different?
Okay, I'll work on that this morning and take the chance to PMIx and update IB accounting. I recompiled slurm 23.02.04 utilizing freeipmi-1.6.11 and re-installed. I started an HPL. I'll let you know as soon as there some news. The suggestion exacerbated the situation. Now all slurmd's on all nodes crashed like orchestrated almost in sync. I added the following content to the tar archive: freeipmi.content - rpm listing freeipmi-1.6.11 slurm-status-20230919 - sinfo and captured buffer overflow messages of all affected from messages swan56.rpmlist - RPM list swan56.stacktrace.via-coredumpctl - the stack trace of one node, extracted using the instruction you specified above I used the src file https://ftp.gnu.org/gnu/freeipmi/freeipmi-1.6.11.tar.gz to recompile slurm. Please let me know if you need any additional information. Thank you. Created attachment 32307 [details]
file colection after crash at Sep. 18th
I forgot to mention that all nodes have been installed utilizing the same image, aka they're all identical concerning their installed RPMs and configuration. From your most recent backtrace (on swan56), can you run the following in gdb? t 1 f 12 info locals p *(struct fiid_obj *)dcmi_rs p ipmi_dcmi_ctx p *(struct ipmi_ctx *)ipmi_dcmi_ctx Hello Marshall, I captured the output in attached file gdbSessionOutputAsDescribeInItem18.txt. Created attachment 32321 [details]
gdb output as requested in item 18
Thank you. Can you get some additional info from gdb? set pretty print on t 1 f 12 p *(((struct ipmi_ctx *)ipmi_dcmi_ctx)->io.inband.openipmi_ctx) f 6 info locals info args p *ctx p ctx->device_fd https://github.com/elitak/freeipmi/blob/0e52de4bdea4ca0b37ec9fde25711abfb41835cb/libfreeipmi/driver/ipmi-openipmi-driver.c#L510 We are trying to identify if the ctx->device_fd is valid here, which it seems like it is not because of the assertion failure in glibc. the output has been captured in the attached file gdbSessionOutputAsDescribeInComment22.txt. Created attachment 32324 [details] gdb output as requested in comment #22 Thanks for your help in debugging the issue. It showed that the problem is a limitation in freeipmi. (gdb) f 6 #6 0x00001520445252c4 in _openipmi_read (ctx=ctx@entry=0x15203c26deb0, obj_cmd_rs=obj_cmd_rs@entry=0x15203c25c8e0) at driver/ipmi-openipmi-driver.c:517 517 driver/ipmi-openipmi-driver.c: No such file or directory. (gdb) p *ctx $2 = {magic = 3490697231, errnum = 0, flags = 0, driver_device = 0x0, device_fd = 1024, io_init = 1} device_fd is 1024. This code passes device_fd+1: https://github.com/elitak/freeipmi/blob/0e52de4bdea4ca0b37ec9fde25711abfb41835cb/libfreeipmi/driver/ipmi-openipmi-driver.c#L510 > if ((n = select (ctx->device_fd + 1, glibc checks for fd < 0 || fd >= FD_SETSIZE https://github.com/lattera/glibc/blob/master/debug/fdelt_chk.c#L24 According to select(2), FD_SETSIZE is 1024: https://man7.org/linux/man-pages/man2/select.2.html ``` WARNING: select() can monitor only file descriptors numbers that are less than FD_SETSIZE (1024)—an unreasonably low limit for many modern applications—and this limitation will not change. All modern applications should instead use poll(2) or epoll(7), which do not suffer this limitation. ``` My current theory is that slurmd has a bunch of connections open such that fd's 0-1023 are taken, so when ipmi opens a new file it gets fd = 1024. We are discussing internally how we can work around this limitation. In the meantime, unfortunately the only sure way to avoid this is to not use freeipmi. many thanks for analysis and explanation. Great that the root cause has been identified. We currently run slurm without the ipmi plugin enabled. Is it okay if I change this bug to a sev-3? It sounds like it may not be a sev-2 situation anymore, though obviously having to disable Slurm's ipmi plugin is not ideal. yes, of course, please go ahead, sorry I forgot to adjust on my own. *** Ticket 17464 has been marked as a duplicate of this ticket. *** FYI: I've asked a question about the use of select() to the freeipmi-users mailing list: https://lists.gnu.org/archive/html/freeipmi-users/2023-09/msg00004.html Let's see what the FreeIPMI developers think about it. Best regards, Ole Hello I understand it's a FreeIPMI bug. But why did it only appear after upgrading SLURM to 23.02.4? We never had such issues on Previuos versions (Our last upgrade was from 23.02.0) (In reply to Ole.H.Nielsen@fysik.dtu.dk from comment #38) > FYI: I've asked a question about the use of select() to the freeipmi-users > mailing list: > https://lists.gnu.org/archive/html/freeipmi-users/2023-09/msg00004.html > > Let's see what the FreeIPMI developers think about it. > > Best regards, > Ole Hi, For the moment I added a GNU bug here with an (untested) patch proposal. I don't know if this is the right way. You could indicate that in the mailing list if you want. Frank, Ole, are some of you able to test this (unverified) patch in your freeipmi installation? http://savannah.gnu.org/bugs/index.php?64721 (In reply to Ahmed Elmazaty from comment #40) > Hello > I understand it's a FreeIPMI bug. But why did it only appear after upgrading > SLURM to 23.02.4? > We never had such issues on Previuos versions (Our last upgrade was from > 23.02.0) Ahmed, This happens when using DCMI in the acct_gather_energy/ipmi plugin (also may happen with XCC plugin) which in that case requires the use of ipmi raw commands by using freeipmi libraries. When not using DCMI it will use the ipmi monitoring library which doesn't suffer from this issue. I understand you're now trying with DCMI, which is a new feature added in 23.02.0 as part of bug 9629, and until now it wasn't been tested thoroughly. The fact you haven't hit this bug before must be because you weren't using DCMI or because your slurmd never hit > 1024 open fds. Unless there's something else we haven't detected, but this bug backtraces indicates this is clearly the issue. Hi, the maintainer of freeipmi agreed to include a fix I proposed in freeipmi 1.6.x. If some of you is willing to test it, it would be very valuable to freeipmi team. Please check: http://savannah.gnu.org/bugs/index.php?64721 Thank you. (In reply to Felip Moll from comment #43) > Hi, the maintainer of freeipmi agreed to include a fix I proposed in > freeipmi 1.6.x. > If some of you is willing to test it, it would be very valuable to freeipmi > team. > Please check: > > http://savannah.gnu.org/bugs/index.php?64721 After some searching I found the location of the master branch tar file: http://git.savannah.gnu.org/cgit/freeipmi.git/snapshot/freeipmi-master.tar.gz The INSTALL file is missing an important piece of information! The first thing to do in the freeipmi-master folder is "autoreconf -i"! I opened a bug on that at https://savannah.gnu.org/bugs/index.php?64723 If you want to build freeipmi RPM packages there are additional steps: 1. Do "autoreconf -i" and "./configure". 2. Edit the generated freeipmi.spec file and modify this line to indicate the master version: %define version master 3. Create a new tar-ball from the configured freeipmi-master folder: tar czf freeipmi-master.tar.gz freeipmi-master 4. Build the RPM packages for Systemd, see https://lists.gnu.org/archive/html/freeipmi-users/2023-08/msg00001.html rpmbuild -ta --with systemd freeipmi-master.tar.gz You will get these RPMs (on a CentOS 7 system): freeipmi-bmc-watchdog-master-1.el7.x86_64.rpm freeipmi-debuginfo-master-1.el7.x86_64.rpm freeipmi-devel-master-1.el7.x86_64.rpm freeipmi-ipmidetectd-master-1.el7.x86_64.rpm freeipmi-ipmiseld-master-1.el7.x86_64.rpm freeipmi-master-1.el7.x86_64.rpm I don't know if having "master" in the package name will cause conflicts. Maybe there is a better way to build RPMs from the master branch and have a proper version number in stead? /Ole I willing to do that, but we have a production or go alive mile stone for our cluster at Oct, 2nd 2023 and mgmt decreed a software freeze. I'll have a discussion on that, because the power mgmt is a mgmt - requirement ironically. I'll let you know about the outcome asap. Anyway I'm not sure whether the fix in patch file http://savannah.gnu.org/bugs/download.php?file_id=55173 is already included by packing the master branch into a previously autoconf'ed tar archive. I think the patch has to applied manually to the master branch first as it haven't been merged, yet and execute create a tar archive or rpmbuild from within the changed repo afterwards. Sorry I overlooked something in the procedure described in #44 or the freeipmi changemgmt workflow. Hi Frank, (In reply to Frank Heckes from comment #45) > Anyway I'm not sure whether the fix in patch file > http://savannah.gnu.org/bugs/download.php?file_id=55173 is already included > by packing the master branch into a previously autoconf'ed tar archive. I > think the patch has to applied manually to the master branch first as it > haven't been merged, yet and execute create a tar archive or rpmbuild from > within the changed repo afterwards. > Sorry I overlooked something in the procedure described in #44 or the > freeipmi changemgmt workflow. It would appear that the patch "Use poll() instead of select() to avoid segfaults with high number of open fd" is in the master branch: https://git.savannah.gnu.org/cgit/freeipmi.git/log/ Are you going to build the master branch as RPMs, or do you do "make install"? I'm curious about the best way to build RPMs. Best regards, Ole (In reply to Ole.H.Nielsen@fysik.dtu.dk from comment #46) > It would appear that the patch "Use poll() instead of select() to avoid > segfaults with high number of open fd" is in the master branch: > https://git.savannah.gnu.org/cgit/freeipmi.git/log/ I downloaded the freeipmi-master.tar.gz and here the ChangeLog file contains: 2023-09-05 Felip Moll <felip.moll@schedmd.com> * libfreeipmi/driver/ipmi-openipmi-driver.c (_openipmi_read): Convert from select() to poll() to avoid issues with high number of open file descriptors when next fd > 1024. (GNU bug #64721). Okay, I should have checked the git - repo, my bad. If the log contains the change then your procedure is fine. I'll try that in the afternoon. Normally I download a tar archive from the freeIPMI release page and execute rpmbuild --define '_with_systemd 1' -ba freeipmi.spec to build the RPMs from within the unpacked tar archive directory after the tar archive had been copied to ~/rpmbuild/SOURCES on the build node. Mgmt permission to apply the change is under construction still. It seems like there are two errors here, interacting: (1) IPMI couldn't handle FD's >1024. (Probably fixed, needs testing.) (2) Upgrading to a later slurm caused more than 1024 file descriptors to be used. My guess is that #2 should indicate that some code in the new release is leaking file descriptors (e.g. opening them and failing to close them when done). Such errors have happened before (https://groups.google.com/g/slurm-users/c/VJiRi7n7B0s and https://www.suse.com/support/update/announcement/2021/suse-su-20210773-1/ are easy-to-find examples) So far I haven't seen any discussion of that possibility in this bug report. FYI: Albert Chu has updated the freeipmi master version to 1.7.0, see https://savannah.gnu.org/bugs/?64724. You can download the master tar-ball (as well as other versions) from https://git.savannah.gnu.org/cgit/freeipmi.git/ The master version includes the patch required by Slurm: "Use poll() instead of select() to avoid segfaults with high number of open fd." The build instructions have now been documented in the master README file as: ./autogen.sh ./configure make and furthermore in https://savannah.gnu.org/bugs/?64724 these steps to build RPMs: make dist rpmbuild -ta --with systemd *.tar.gz I have tested this, and the freeipmi 1.7.0-1 RPMs get built as expected. It remains to test the freeipmi 1.7.0 with Slurm. /Ole I'm compiling based on 23.02.06 and freeIPMI 1.7.0 (in sync with Ole's comments) right now. My plan is to update our cluster today or tomorrow and verify the per job power accounting. I would like to report a successful test of Slurm power monitoring using freeipmi 1.7.0: We have installed on all our compute nodes and slurmctld server the custom built freeipmi 1.7.0-1 RPMs built from the development master branch as described in comment 51. I have also built all the Slurm 23.02.6 RPMs including the freeipmi 1.7.0 libraries, and we upgraded most of our cluster from 23.02.5 to 23.02.6 last week. FYI: We've been running with Configless Slurm for about 2 years now. Regarding power monitoring we have been running with RAPL using this line in slurm.conf: AcctGatherEnergyType=acct_gather_energy/rapl In order to test IPMI power monitoring, I copied the /etc/slurm/ folder from the slurmctld server to a SuperMicro test node still running 23.02.5 on AlmaLinux 8.8. I changed /etc/slurm/slurm.conf locally on that node into: AcctGatherEnergyType=acct_gather_energy/ipmi and activated the corresponding lines in /etc/slurm/acct_gather.conf: EnergyIPMIPowerSensors=Node=DCMI EnergyIPMIFrequency=60 EnergyIPMICalcAdjustment=yes Then I upgraded slurm RPMs from 23.02.5 to 23.02.6, and slurmd is working correctly after the upgrade and restart. Naturally, in slurmd.log I see a warning that the node runs with local config files: [2023-10-18T10:29:00.012] warning: Running with local config file despite slurmctld having been setup for configless operation Now the node power monitoring showing CurrentWatts=354: $ scontrol show node s003 NodeName=s003 Arch=x86_64 CoresPerSocket=10 (lines deleted) CurrentWatts=354 AveWatts=341 ExtSensorsJoules=n/s ExtSensorsWatts=0 ExtSensorsTemp=n/s At the same time the IPMI command on node s003 reports Current Power = 360 Watts: s003$ ipmi-dcmi --get-system-power-statistics Current Power : 360 Watts Minimum Power over sampling duration : 332 watts Maximum Power over sampling duration : 5278 watts Average Power over sampling duration : 2417 watts Time Stamp : 10/18/2023 - 08:55:06 Statistics reporting time period : 2522203704 milliseconds Power Measurement : Active An identical SuperMicro node with RAPL power monitoring show only the CPU+DIMMs value CurrentWatts=74. So IPMI gives a much more realistic power monitoring than RAPL. I conclude that: 1. slurmd doesn't crash when the freeipmi 1.7.0 library is used, even during an upgrade of slurm RPMs. 2. IPMI power monitoring is working as expected with EnergyIPMIPowerSensors=Node=DCMI 3. The power Watts values reported by slurmd is the same as the value from the ipmi-dcmi CLI command (the power sampling intervals may be slightly different in these two cases). Hi Ole, I just completed my compilation, installation and re-enabling of the energy accounting, too. Concerning the compilation of freeIPMI, I followed the instruction you posted and those from the freeIPMI git and change log. I also double checked that the change is contained in the git-clone I used. I can confirm your results. As in my case the problem occurred after ~8 hours (per active jobs), I restarted the HPL. I'll update the ticket as soon as I have a result from this test. @sSLURM team: Please let me know if you need the compile and configuration files, logs. I have one question about the 'logic' of the select-> pull change applied. What I don't understand is why the number of 1024 file descriptors are exceeded on the (each) compute node running the freeipmi process. I wonder why such a simple task as creating a connection to the mgmt card reading some bytes need more than 1024 file descriptors. This would be rather high price / jitter effect to make our directors happy. Does anyone has an idea why this is the case? I have tested the AcctGatherEnergyType=acct_gather_energy/ipmi also on one of our old Huawei compute nodes whose BMC seems to lack the IPMI DCMI extensions. The new problem which I reported in bug 17938 is that slurmd annoyingly logs an error every 60 seconds: _get_dcmi_power_reading: get DCMI power reading failed So if you have nodes where the command "ipmi-dcmi --get-system-power-statistics" fails, then you should wait for the bug 17938 to be fixed. /Ole The HPL is running and I see the following effect: * the number of open fd to /dev/ipmi0 for each forked slurmd instance assigned to each MPI task increases monotonous: swan01:~ # lsof | grep '34160' | grep ipmi | wc -l 236 swan01:~ # lsof | grep '34160' | grep ipmi | wc -l 240 swan01:~ # lsof | less -i ... slurmd 34160 34164 slurmd root 116u CHR 241,0 0t0 586 /dev/ipmi0 slurmd 34160 34164 slurmd root 117u CHR 241,0 0t0 586 /dev/ipmi0 slurmd 34160 34164 slurmd root 118u CHR 241,0 0t0 586 /dev/ipmi0 slurmd 34160 34164 slurmd root 119u CHR 241,0 0t0 586 /dev/ipmi0 ... * Each slurmd shows this behavior. Does the actual error is maybe in the slurm code? * sorry, to be clear, the open fd's to /dev/ipmi0 are allocated by the slurmd started by systemd. (In reply to Frank Heckes from comment #56) > I have one question about the 'logic' of the select-> pull change applied. > What I don't understand is why the number of 1024 file descriptors are > exceeded on the (each) compute node running the freeipmi process. I wonder > why such a simple task as creating a connection to the mgmt card reading > some bytes need more than 1024 file descriptors. This would be rather high > price / jitter effect to make our directors happy. Does anyone has an idea > why this is the case? It is not that freeipmi needs 1024 file descriptors. The file descriptors are shared among the entire slurmd process. Various things cause slurmd to open file descriptors, including RPC handling, starting slurmstepd's, freeipmi, running scripts such as prolog or epilog or NodeHealthCheck, etc. (In reply to Frank Heckes from comment #58) > The HPL is running and I see the following effect: ... > swan01:~ # lsof | grep '34160' | grep ipmi | wc -l > 236 > swan01:~ # lsof | grep '34160' | grep ipmi | wc -l > 240 > * Each slurmd shows this behavior. Does the actual error is maybe in the > slurm code? Do you see sure evidence of a file descriptor leak in slurmd? I cannot see a fd leak in my tests. I am running slurmd with ipmi enabled. I am monitoring the fd count with the following command: sudo ls /proc/<slurmd pid>/fd | wc -l As I run jobs, steps, and other commands, I see the fd count changing but always going back down to the "resting" number of open file descriptors (in my case, 7 open fd's). Do you have a reproducer where you see this number of open fd's go up and not go back down again? Hi all, The `acct_gather_energy_ipmi` plugin leaks file descriptors (and memory) when using DCMI; this happens as the thread local `ipmi_dcmi_ctx` struct is not destroyed when threads end. This occurs both during RPCs from `slurmctld` to `slurmd` for job polling (RPC threads) as well as when `scontrol reconfigure` is issued (a new background polling thread is created). We've corrected this in our builds by reworking the implementation to only create a DCMI IPMI context when performing a sensor read and removing the thread local approach entirely. This ensures that `/dev/ipmi0` is only open while it needs to be accessed. To reproduce: you can exacerbate the issue by increasing the value for `EnergyIPMIFrequency` in `acct_gather.conf` to something large (say 120s), but retain a smaller frequency for `JobAcctGatherFrequency` (say 30s); the logic in `slurmd` will cause a read to be made within the RPC thread, rather than relying on a recent value from the background polling thread. When running a job, you should see a brand new leaked FD for `/dev/ipmi0` every 30 seconds or so (refer to `lsof /dev/ipmi0`). Also try an `scontrol reconfigure` or two to see more leaks. In our experience/on our hardware, after ~30 of these have been opened, the kernel driver refuses to allow any more FDs to be opened to `/dev/ipmi0` which causes `slurmd` (and anything else that wants to talk IPMI, for example `ipmitool`) to experience extreme unpleasantness. Somebody might want to open a new bug for this as, while it's tangentially related to the original issue here (the segfault/buffer overflow/crash during energy readings), it's a different issue really. I'll leave it to somebody who pays for support from SchedMD to open that, as anything I open is assigned "importance 6 - no support contract" -- happy to add my findings there too. If anybody is interested in our patch please let me know and I'll either attach it here or in the new bug. Cheers, Mark. Hi Mark, Thanks so much for sharing your insights (and maybe even your patch?): (In reply to Mark Titorenko from comment #61) > The `acct_gather_energy_ipmi` plugin leaks file descriptors (and memory) > when using DCMI; this happens as the thread local `ipmi_dcmi_ctx` struct is > not destroyed when threads end. > > This occurs both during RPCs from `slurmctld` to `slurmd` for job polling > (RPC threads) as well as when `scontrol reconfigure` is issued (a new > background polling thread is created). > > We've corrected this in our builds by reworking the implementation to only > create a DCMI IPMI context when performing a sensor read and removing the > thread local approach entirely. This ensures that `/dev/ipmi0` is only open > while it needs to be accessed. > > To reproduce: you can exacerbate the issue by increasing the value for > `EnergyIPMIFrequency` in `acct_gather.conf` to something large (say 120s), > but retain a smaller frequency for `JobAcctGatherFrequency` (say 30s); the > logic in `slurmd` will cause a read to be made within the RPC thread, rather > than relying on a recent value from the background polling thread. When > running a job, you should see a brand new leaked FD for `/dev/ipmi0` every > 30 seconds or so (refer to `lsof /dev/ipmi0`). Also try an `scontrol > reconfigure` or two to see more leaks. In our experience/on our hardware, > after ~30 of these have been opened, the kernel driver refuses to allow any > more FDs to be opened to `/dev/ipmi0` which causes `slurmd` (and anything > else that wants to talk IPMI, for example `ipmitool`) to experience extreme > unpleasantness. We're running with EnergyIPMIFrequency=60 and JobAcctGatherFrequency=30 and I didn't see the issue initially when activating AcctGatherEnergyType=acct_gather_energy/ipmi with EnergyIPMIPowerSensors=Node=DCMI on a node which was already running a job. The slurmd file descriptors looked fine for several hours: $ ls -l /proc/$(pidof slurmd)/fd total 0 lr-x------ 1 root root 64 Oct 19 08:10 0 -> /dev/null lrwx------ 1 root root 64 Oct 19 08:10 1 -> socket:[2249162] lrwx------ 1 root root 64 Oct 19 08:10 2 -> socket:[2249162] l-wx------ 1 root root 64 Oct 19 08:10 3 -> /var/log/slurm/slurmd.log lrwx------ 1 root root 64 Oct 19 08:10 4 -> /dev/null lrwx------ 1 root root 64 Oct 19 08:10 5 -> socket:[2234879] lrwx------ 1 root root 64 Oct 19 08:10 6 -> /dev/ipmi0 l-wx------ 1 root root 64 Oct 19 08:10 7 -> /run/slurmd.pid But later a new slurmstepd process was started by the job, and things began to go haywire! Now we see new FDs being created every few minutes, also when doing "scontrol reconfig": $ ls -l /proc/$(pidof slurmd)/fd total 0 lr-x------ 1 root root 64 Oct 19 08:10 0 -> /dev/null lrwx------ 1 root root 64 Oct 19 08:10 1 -> socket:[2249162] lrwx------ 1 root root 64 Oct 19 08:10 10 -> /dev/ipmi0 lrwx------ 1 root root 64 Oct 19 08:10 11 -> /dev/ipmi0 lrwx------ 1 root root 64 Oct 19 08:10 12 -> /dev/ipmi0 lrwx------ 1 root root 64 Oct 19 08:10 13 -> /dev/ipmi0 lrwx------ 1 root root 64 Oct 19 08:10 14 -> /dev/ipmi0 lrwx------ 1 root root 64 Oct 19 08:10 15 -> /dev/ipmi0 lrwx------ 1 root root 64 Oct 19 08:10 16 -> /dev/ipmi0 lrwx------ 1 root root 64 Oct 19 08:10 17 -> /dev/ipmi0 lrwx------ 1 root root 64 Oct 19 08:10 18 -> /dev/ipmi0 lrwx------ 1 root root 64 Oct 19 08:10 19 -> /dev/ipmi0 lrwx------ 1 root root 64 Oct 19 08:10 2 -> socket:[2249162] lrwx------ 1 root root 64 Oct 19 08:10 20 -> /dev/ipmi0 lrwx------ 1 root root 64 Oct 19 08:10 21 -> /dev/ipmi0 lrwx------ 1 root root 64 Oct 19 08:10 22 -> /dev/ipmi0 lrwx------ 1 root root 64 Oct 19 08:10 23 -> /dev/ipmi0 lrwx------ 1 root root 64 Oct 19 08:10 24 -> /dev/ipmi0 lrwx------ 1 root root 64 Oct 19 08:10 25 -> /dev/ipmi0 lrwx------ 1 root root 64 Oct 19 12:57 26 -> /dev/ipmi0 lrwx------ 1 root root 64 Oct 19 08:10 27 -> /dev/ipmi0 lrwx------ 1 root root 64 Oct 19 08:10 28 -> /dev/ipmi0 lrwx------ 1 root root 64 Oct 19 08:10 29 -> /dev/ipmi0 lrwx------ 1 root root 64 Oct 19 08:10 3 -> /dev/ipmi0 lrwx------ 1 root root 64 Oct 19 08:10 30 -> /dev/ipmi0 lrwx------ 1 root root 64 Oct 19 08:10 31 -> /dev/ipmi0 lrwx------ 1 root root 64 Oct 19 08:10 32 -> /dev/ipmi0 lrwx------ 1 root root 64 Oct 19 08:10 33 -> /dev/ipmi0 lrwx------ 1 root root 64 Oct 19 08:10 34 -> /dev/ipmi0 lrwx------ 1 root root 64 Oct 19 08:10 35 -> /dev/ipmi0 lrwx------ 1 root root 64 Oct 19 08:10 36 -> /dev/ipmi0 lrwx------ 1 root root 64 Oct 19 08:10 37 -> /dev/ipmi0 lrwx------ 1 root root 64 Oct 19 12:57 38 -> /dev/ipmi0 lrwx------ 1 root root 64 Oct 19 08:10 39 -> /dev/ipmi0 lrwx------ 1 root root 64 Oct 19 08:10 4 -> /dev/null lrwx------ 1 root root 64 Oct 19 08:10 40 -> /dev/ipmi0 lrwx------ 1 root root 64 Oct 19 08:10 41 -> /dev/ipmi0 lrwx------ 1 root root 64 Oct 19 08:10 42 -> /dev/ipmi0 lrwx------ 1 root root 64 Oct 19 08:10 43 -> /dev/ipmi0 lrwx------ 1 root root 64 Oct 19 08:10 44 -> /dev/ipmi0 lrwx------ 1 root root 64 Oct 19 08:10 45 -> /dev/ipmi0 lrwx------ 1 root root 64 Oct 19 12:57 46 -> /dev/ipmi0 lrwx------ 1 root root 64 Oct 19 12:57 47 -> /dev/ipmi0 lrwx------ 1 root root 64 Oct 19 08:10 48 -> /dev/ipmi0 lrwx------ 1 root root 64 Oct 19 08:10 49 -> /dev/ipmi0 lrwx------ 1 root root 64 Oct 19 08:10 5 -> socket:[2234879] lrwx------ 1 root root 64 Oct 19 08:10 50 -> /dev/ipmi0 lrwx------ 1 root root 64 Oct 19 08:10 51 -> /dev/ipmi0 lrwx------ 1 root root 64 Oct 19 08:10 52 -> /dev/ipmi0 lrwx------ 1 root root 64 Oct 19 12:57 53 -> /dev/ipmi0 lrwx------ 1 root root 64 Oct 19 08:10 54 -> /dev/ipmi0 lrwx------ 1 root root 64 Oct 19 08:10 55 -> /dev/ipmi0 lrwx------ 1 root root 64 Oct 19 12:57 56 -> /dev/ipmi0 lrwx------ 1 root root 64 Oct 19 08:10 57 -> /dev/ipmi0 lrwx------ 1 root root 64 Oct 19 08:10 58 -> /dev/ipmi0 lrwx------ 1 root root 64 Oct 19 08:10 59 -> /dev/ipmi0 lrwx------ 1 root root 64 Oct 19 08:10 6 -> /dev/ipmi0 lrwx------ 1 root root 64 Oct 19 08:10 60 -> /dev/ipmi0 lrwx------ 1 root root 64 Oct 19 08:10 61 -> /dev/ipmi0 lrwx------ 1 root root 64 Oct 19 12:57 62 -> /dev/ipmi0 lrwx------ 1 root root 64 Oct 19 12:57 63 -> /dev/ipmi0 lrwx------ 1 root root 64 Oct 19 12:57 64 -> /dev/ipmi0 lrwx------ 1 root root 64 Oct 19 12:57 65 -> /dev/ipmi0 lrwx------ 1 root root 64 Oct 19 12:57 66 -> /dev/ipmi0 lrwx------ 1 root root 64 Oct 19 12:57 67 -> /dev/ipmi0 lrwx------ 1 root root 64 Oct 19 12:57 68 -> /dev/ipmi0 lrwx------ 1 root root 64 Oct 19 12:57 69 -> /dev/ipmi0 l-wx------ 1 root root 64 Oct 19 08:10 7 -> /run/slurmd.pid lrwx------ 1 root root 64 Oct 19 12:58 70 -> /dev/ipmi0 lrwx------ 1 root root 64 Oct 19 13:00 71 -> /dev/ipmi0 lrwx------ 1 root root 64 Oct 19 13:00 72 -> /dev/ipmi0 lrwx------ 1 root root 64 Oct 19 13:01 73 -> /dev/ipmi0 lrwx------ 1 root root 64 Oct 19 13:05 74 -> /dev/ipmi0 lrwx------ 1 root root 64 Oct 19 13:05 75 -> /dev/ipmi0 lrwx------ 1 root root 64 Oct 19 13:05 76 -> /dev/ipmi0 lrwx------ 1 root root 64 Oct 19 13:08 78 -> /dev/ipmi0 lrwx------ 1 root root 64 Oct 19 13:08 79 -> /dev/ipmi0 l-wx------ 1 root root 64 Oct 19 08:10 8 -> /var/log/slurm/slurmd.log lrwx------ 1 root root 64 Oct 19 13:08 80 -> /dev/ipmi0 lrwx------ 1 root root 64 Oct 19 13:09 81 -> /dev/ipmi0 lrwx------ 1 root root 64 Oct 19 13:09 82 -> /dev/ipmi0 lrwx------ 1 root root 64 Oct 19 13:16 83 -> /dev/ipmi0 lrwx------ 1 root root 64 Oct 19 13:16 84 -> /dev/ipmi0 lrwx------ 1 root root 64 Oct 19 13:16 85 -> /dev/ipmi0 lrwx------ 1 root root 64 Oct 19 13:16 86 -> /dev/ipmi0 lrwx------ 1 root root 64 Oct 19 13:16 87 -> /dev/ipmi0 lrwx------ 1 root root 64 Oct 19 13:16 88 -> /dev/ipmi0 lrwx------ 1 root root 64 Oct 19 13:16 89 -> /dev/ipmi0 We're not seeing any problems with slurmd just yet, but this may be due to our use of freeipmi 1.7.0 (development master) which replaces the select() by poll() in the FreeIPMI driver/ipmi-openipmi-driver.c code, see https://savannah.gnu.org/bugs/?64721. But obviously we're now suffering from slurmd leaking file descriptors every few minutes as you have described it! The "ipmi-dcmi --get-system-power-statistics" command continues to work correctly so far. Our test node runs CentOS 7.9 with kernel 3.10.0-1160.99.1.el7.x86_64. > Somebody might want to open a new bug for this as, while it's tangentially > related to the original issuewho here (the segfault/buffer overflow/crash > during energy readings), it's a different issue really. I'll leave it to > somebody who pays for support from SchedMD to open that, as anything I open > is assigned "importance 6 - no support contract" -- happy to add my findings > there too. If anybody is interested in our patch please let me know and I'll > either attach it here or in the new bug. Either Frank or I (we both pay for Slurm support) could open a new bug report. But in comment 30 it was shown that >1024 file descriptors was used, and in comment 49 it was suggested that slurmd may leak file descriptors. So is there a good reason to open a separate bug report, and could you suggest a good description of such a bug? In any case it would be great if you would be willing to upload your patch to the present bug! I'm hoping that it could make it into Slurm 23.11, which is only a few weeks away from release, so that all of us would benefit from your work starting with 23.11! Thanks a lot, Ole Don't open a new bug for the leaking file descriptor. I'll use your more detailed descriptions to try to reproduce. Hi Ole, We're on an EL8 kernel here (4.18.0-477.15.1.el8_8.x86_64) so it's quite possible that the IPMI driver within the EL7 kernel is less strict about how many FDs can be opened for `/dev/ipmi0` before things start going south -- I haven't investigated further the reasons for the IPMI driver behaving this way, it may even be hardware-specific I guess? We had previously seen the initially reported buffer overflow issue on an EL7 system, but we backed out all our power monitoring changes there as they weren't mission critical for that system. However, looking back now, there's definitely a higher than average chance that the FD number did reach 1024+ because of this FD leak issue. :) My thanks to both yourself and Frank for pursuing a resolution to that one BTW! Happy to share our patch to `acct_gather_energy_ipmi.c` -- I'll attach it in a moment. And, indeed, hopefully SchedMD can adapt it (or roll their own preferred approach) in time for 23.11 indeed! Cheers, Mark. Created attachment 32832 [details]
Fixes file descriptor leak in IPMI DCMI energy gathering plugin
Patch attached as promised. Whitespace does not conform to code base standards I'm afraid as it wasn't a priority. :)
Cheers,
Mark.
(In reply to Marshall Garey from comment #60) First of all some good news from my side, the HPL completed successful with AcctGatherEnergyType=acct_gather_energy/ipmi enabled. The energy consumption have been written to the slurmDB and seems to be in sync with the values gathered on the PDUs with a different tool. The HPL ran exclusive on the partition (54 nodes). No other jobs were running at the same time. I.e. there were no jitter effects on FDs from other jobs. Concerning the open fd's: * For lsof I counted only the slurmd's FDs on ipmi0 (-> grep ipmi0). This number never decreased while job were running. The method you used count(ed) ALL FDs via the /proc FS. * The number measured via lsof were the same on all nodes, besides on the one node starting the job which showed 4 fd's less. * I haven't measured, but the (monotonous) increase of opened fd's correlates with the measurement interval of 10s. * Indeed the number of open fds in slurmd's proc FS were different then the one reported by lsof. The last meas I took report 1751 fd's via lsof. (I didn't check the proc count as comment #60 has arrived) Also: after the HPL the FDs are node;date;#fdPROC;#fdLSOF swan56;20231019T1600;770;770 Restarting the slurmd adjust lsof reported FDs to 1!!: swan56;20231019T1609;1;1 At taking these measurement the reservation was active still; aka the partition was empty. I'm not sure whether discrepancy is really a problem or a tool effect. The reproducer is the HPL2.3 based on foss-2021b.eb (EasyBuild: gcc '11.2.0', ; 'OpenMPI', '4.1.1', 'FlexiBLAS', '3.0.4', '', 'FFTW', '3.3.10', 'ScaLAPACK', '2.1.0') executed on 54 Nodes with alpha=0.9; N=2584479 P=18, Q=24 As I have to re-run the HPL for a perf and energy baseline, I'll monitor the procFD and lsofFD with the command attached below and capture the output. I don't know whether this is of any value for you. Please let me know if I should attach the file to this ticket. Otherwise I don't really have a code based assumption that would justify to keep the ticket open. Anyway, it's strange that lsof reports so many FDs on /dev/ipmi0. Cheers, -Frank P.S.: Sorry if I overlooked something or haven't considered from the posting that followed #60. ----------------- FD meas command this count the ipmi0 fd in /proc via long listing and lsof command for slurmd pid: pdsh -g swan 'mtime=$(date "+%Y%m%dT%H%M"); pid=$(ps aux | grep -E slurmd | grep -v grep | awk "{printf(\"%s\", \$2 );}"); echo -e "$mtime;\c" ; if [ -n "${pid}" ] ; then procFDc=$(ls -l /proc/${pid}/fd | grep ipmi0 | wc -l); lsOFc=$(lsof -p ${pid} | grep ipmi0 | wc -l) ; echo "$procFDc;$lsOFc" ; else echo "-;-"; fi' | sort | sed 's/:[[:space:]]\+/;/' Mark mentioned in comment 64 that they use an EL8 kernel, so I decided to test a node running AlmaLinux 8.8 with kernel 4.18.0-477.27.2.el8_8.x86_64 as well. Again, initially there are just a few open file descriptors: $ ]# ls -l /proc/$(pidof slurmd)/fd total 0 lr-x------. 1 root root 64 Oct 19 18:35 0 -> /dev/null lrwx------. 1 root root 64 Oct 19 18:35 1 -> 'socket:[923294]' lrwx------. 1 root root 64 Oct 19 18:35 2 -> 'socket:[923294]' lrwx------. 1 root root 64 Oct 19 18:35 4 -> /dev/null lrwx------. 1 root root 64 Oct 19 18:35 5 -> /dev/ipmi0 lrwx------. 1 root root 64 Oct 19 18:35 6 -> 'socket:[934533]' l-wx------. 1 root root 64 Oct 19 18:35 7 -> /run/slurmd.pid l-wx------. 1 root root 64 Oct 19 18:35 8 -> /var/log/slurm/slurmd.log lrwx------. 1 root root 64 Oct 19 18:35 9 -> /dev/ipmi0 But then the running job terminated and a new job was started at 19:19:34, and now there are many messages in slurmd.log: [2023-10-19T19:19:34.003] task/affinity: task_p_slurmd_batch_request: task_p_slurmd_batch_request: 6771167 [2023-10-19T19:19:34.003] task/affinity: batch_bind: job 6771167 CPU input mask for node: 0xFFFFFFFFFFFFFF [2023-10-19T19:19:34.003] task/affinity: batch_bind: job 6771167 CPU final HW mask for node: 0xFFFFFFFFFFFFFF [2023-10-19T19:19:34.027] [6771167.extern] task/cgroup: _memcg_initialize: job: alloc=487200MB mem.limit=487200MB memsw.limit=487200MB job_swappiness=18446744073709551614 [2023-10-19T19:19:34.027] [6771167.extern] task/cgroup: _memcg_initialize: step: alloc=487200MB mem.limit=487200MB memsw.limit=487200MB job_swappiness=18446744073709551614 [2023-10-19T19:19:34.041] Launching batch job 6771167 for UID 261375 [2023-10-19T19:19:34.049] [6771167.batch] task/cgroup: _memcg_initialize: job: alloc=487200MB mem.limit=487200MB memsw.limit=487200MB job_swappiness=18446744073709551614 [2023-10-19T19:19:34.050] [6771167.batch] task/cgroup: _memcg_initialize: step: alloc=487200MB mem.limit=487200MB memsw.limit=487200MB job_swappiness=18446744073709551614 [2023-10-19T19:46:44.048] [6771167.extern] error: slurm_get_node_energy: Socket timed out on send/recv operation [2023-10-19T19:46:44.048] [6771167.extern] error: _get_joules_task: can't get info from slurmd [2023-10-19T19:46:44.063] [6771167.batch] error: slurm_get_node_energy: Socket timed out on send/recv operation [2023-10-19T19:46:44.063] [6771167.batch] error: _get_joules_task: can't get info from slurmd [2023-10-19T19:47:14.037] [6771167.extern] error: slurm_get_node_energy: Socket timed out on send/recv operation [2023-10-19T19:47:14.037] [6771167.extern] error: _get_joules_task: can't get info from slurmd [2023-10-19T19:47:14.072] [6771167.batch] error: slurm_get_node_energy: Socket timed out on send/recv operation [2023-10-19T19:47:14.072] [6771167.batch] error: _get_joules_task: can't get info from slurmd [2023-10-19T19:47:34.134] error: _get_dcmi_power_reading: get DCMI power reading failed [2023-10-19T19:47:34.145] error: slurm_send_node_msg: [(null)] slurm_bufs_sendto(msg_type=RESPONSE_ACCT_GATHER_ENERGY) failed: Unexpected missing socket error [2023-10-19T19:47:34.148] error: slurm_send_node_msg: [(null)] slurm_bufs_sendto(msg_type=RESPONSE_ACCT_GATHER_UPDATE) failed: Unexpected missing socket error [2023-10-19T19:47:34.148] error: slurm_send_node_msg: [(null)] slurm_bufs_sendto(msg_type=RESPONSE_ACCT_GATHER_ENERGY) failed: Unexpected missing socket error [2023-10-19T19:47:34.157] error: slurm_send_node_msg: [(null)] slurm_bufs_sendto(msg_type=RESPONSE_ACCT_GATHER_ENERGY) failed: Unexpected missing socket error [2023-10-19T19:47:34.161] error: slurm_send_node_msg: [(null)] slurm_bufs_sendto(msg_type=RESPONSE_ACCT_GATHER_ENERGY) failed: Unexpected missing socket error As you can see, there are new types of error from slurmd which we didn't see on the CentOS 7.9 node! Furthermore, lots of leaking file descriptors now pop up: $ ls -l /proc/$(pidof slurmd)/fd total 0 lr-x------. 1 root root 64 Oct 19 18:35 0 -> /dev/null lrwx------. 1 root root 64 Oct 19 18:35 1 -> 'socket:[923294]' lrwx------. 1 root root 64 Oct 19 18:35 11 -> /dev/ipmi0 lrwx------. 1 root root 64 Oct 19 18:35 12 -> /dev/ipmi0 lrwx------. 1 root root 64 Oct 19 18:35 13 -> /dev/ipmi0 lrwx------. 1 root root 64 Oct 19 18:35 14 -> /dev/ipmi0 lrwx------. 1 root root 64 Oct 19 18:35 15 -> /dev/ipmi0 lrwx------. 1 root root 64 Oct 19 18:35 16 -> /dev/ipmi0 lrwx------. 1 root root 64 Oct 19 18:35 17 -> /dev/ipmi0 lrwx------. 1 root root 64 Oct 19 18:35 18 -> /dev/ipmi0 lrwx------. 1 root root 64 Oct 19 18:35 19 -> /dev/ipmi0 lrwx------. 1 root root 64 Oct 19 18:35 2 -> 'socket:[923294]' lrwx------. 1 root root 64 Oct 19 18:35 20 -> /dev/ipmi0 lrwx------. 1 root root 64 Oct 19 18:35 21 -> /dev/ipmi0 lrwx------. 1 root root 64 Oct 19 18:35 22 -> /dev/ipmi0 lrwx------. 1 root root 64 Oct 19 18:35 23 -> /dev/ipmi0 lrwx------. 1 root root 64 Oct 19 18:35 24 -> /dev/ipmi0 lrwx------. 1 root root 64 Oct 19 18:35 25 -> /dev/ipmi0 lrwx------. 1 root root 64 Oct 19 18:35 26 -> /dev/ipmi0 lrwx------. 1 root root 64 Oct 19 18:35 27 -> /dev/ipmi0 lrwx------. 1 root root 64 Oct 19 18:35 28 -> /dev/ipmi0 lrwx------. 1 root root 64 Oct 19 18:35 29 -> /dev/ipmi0 lrwx------. 1 root root 64 Oct 19 18:35 30 -> /dev/ipmi0 lrwx------. 1 root root 64 Oct 19 18:35 31 -> /dev/ipmi0 lrwx------. 1 root root 64 Oct 19 18:35 32 -> /dev/ipmi0 lrwx------. 1 root root 64 Oct 19 18:35 33 -> /dev/ipmi0 lrwx------. 1 root root 64 Oct 19 18:35 34 -> /dev/ipmi0 lrwx------. 1 root root 64 Oct 19 18:35 35 -> /dev/ipmi0 lrwx------. 1 root root 64 Oct 19 18:35 36 -> /dev/ipmi0 lrwx------. 1 root root 64 Oct 19 18:35 37 -> /dev/ipmi0 lrwx------. 1 root root 64 Oct 19 18:35 4 -> /dev/null lrwx------. 1 root root 64 Oct 19 18:35 5 -> /dev/ipmi0 lrwx------. 1 root root 64 Oct 19 18:35 6 -> 'socket:[934533]' l-wx------. 1 root root 64 Oct 19 18:35 7 -> /run/slurmd.pid l-wx------. 1 root root 64 Oct 19 18:35 8 -> /var/log/slurm/slurmd.log lrwx------. 1 root root 64 Oct 19 18:35 9 -> /dev/ipmi0 Strangely, all the file descriptors have time stamps equal to the starting time of slurmd. File descriptors 10 and up definitely weren't listed before the new job was started at 19:19:34! Note added: After we see 37 slurmd file descriptors, the node's BMC has now stopped responding completely: $ ipmi-dcmi --get-system-power-statistics ipmi_cmd_dcmi_get_power_reading: driver timeout At this point it would seem that we need to disable IPMI power monitoring and reboot the node :-( FYI: This node is a Dell PowerEdge R650 with iDRAC9 BMC. The previous CentOS 7.9 node was a Dell C6420 also with iDRAC9. Hi Ole, (In reply to Ole.H.Nielsen@fysik.dtu.dk from comment #68) > Note added: After we see 37 slurmd file descriptors, the node's BMC has now > stopped responding completely: 29 open FDs to `/dev/ipmi0` appears to be the magic number of doom, indeed. ;-) > $ ipmi-dcmi --get-system-power-statistics > ipmi_cmd_dcmi_get_power_reading: driver timeout > > At this point it would seem that we need to disable IPMI power monitoring > and reboot the node :-( We found that restarting (or stopping) `slurmd` to cause all the open FDs to be closed allowed the BMC to start responding in-band again -- our hypothesis is that the kernel driver is only allowing a maximum number of FDs of `/dev/ipmi0` to be open at any one time, at least under EL8 kernels (unverified though). Cheers, Mark. (In reply to Mark Titorenko from comment #69) > Hi Ole, > > (In reply to Ole.H.Nielsen@fysik.dtu.dk from comment #68) > > Note added: After we see 37 slurmd file descriptors, the node's BMC has now > > stopped responding completely: > > 29 open FDs to `/dev/ipmi0` appears to be the magic number of doom, indeed. > ;-) > > > $ ipmi-dcmi --get-system-power-statistics > > ipmi_cmd_dcmi_get_power_reading: driver timeout > > > > At this point it would seem that we need to disable IPMI power monitoring > > and reboot the node :-( > > We found that restarting (or stopping) `slurmd` to cause all the open FDs to > be closed allowed the BMC to start responding in-band again -- our > hypothesis is that the kernel driver is only allowing a maximum number of > FDs of `/dev/ipmi0` to be open at any one time, at least under EL8 kernels > (unverified though). I have confirmed your findings now. Restarting slurmd (which took 10-15 seconds, much more than usual) frees up the file descriptors and the BMC can now be queried again: Frozen BMC state: [root@d033 ~]# ipmi-dcmi --get-system-power-statistics ipmi_cmd_dcmi_get_power_reading: driver busy [root@d033 ~]# ls -l /proc/$(pidof slurmd)/fd total 0 lr-x------. 1 root root 64 Oct 19 18:35 0 -> /dev/null lrwx------. 1 root root 64 Oct 19 18:35 1 -> 'socket:[923294]' lrwx------. 1 root root 64 Oct 19 18:35 10 -> 'socket:[1112493]' lrwx------. 1 root root 64 Oct 19 18:35 11 -> /dev/ipmi0 lrwx------. 1 root root 64 Oct 19 18:35 12 -> /dev/ipmi0 lrwx------. 1 root root 64 Oct 19 18:35 13 -> /dev/ipmi0 lrwx------. 1 root root 64 Oct 19 18:35 14 -> /dev/ipmi0 lrwx------. 1 root root 64 Oct 19 18:35 15 -> /dev/ipmi0 lrwx------. 1 root root 64 Oct 19 18:35 16 -> /dev/ipmi0 lrwx------. 1 root root 64 Oct 19 18:35 17 -> /dev/ipmi0 lrwx------. 1 root root 64 Oct 19 18:35 18 -> /dev/ipmi0 lrwx------. 1 root root 64 Oct 19 18:35 19 -> /dev/ipmi0 lrwx------. 1 root root 64 Oct 19 18:35 2 -> 'socket:[923294]' lrwx------. 1 root root 64 Oct 19 18:35 20 -> /dev/ipmi0 lrwx------. 1 root root 64 Oct 19 18:35 21 -> /dev/ipmi0 lrwx------. 1 root root 64 Oct 19 18:35 22 -> /dev/ipmi0 lrwx------. 1 root root 64 Oct 19 18:35 23 -> /dev/ipmi0 lrwx------. 1 root root 64 Oct 19 18:35 24 -> /dev/ipmi0 lrwx------. 1 root root 64 Oct 19 18:35 25 -> /dev/ipmi0 lrwx------. 1 root root 64 Oct 19 18:35 26 -> /dev/ipmi0 lrwx------. 1 root root 64 Oct 19 18:35 27 -> /dev/ipmi0 lrwx------. 1 root root 64 Oct 19 18:35 28 -> /dev/ipmi0 lrwx------. 1 root root 64 Oct 19 18:35 29 -> /dev/ipmi0 lrwx------. 1 root root 64 Oct 19 18:35 3 -> 'socket:[1112492]' lrwx------. 1 root root 64 Oct 19 18:35 30 -> /dev/ipmi0 lrwx------. 1 root root 64 Oct 19 18:35 31 -> /dev/ipmi0 lrwx------. 1 root root 64 Oct 19 18:35 32 -> /dev/ipmi0 lrwx------. 1 root root 64 Oct 19 18:35 33 -> /dev/ipmi0 lrwx------. 1 root root 64 Oct 19 18:35 34 -> /dev/ipmi0 lrwx------. 1 root root 64 Oct 19 18:35 35 -> /dev/ipmi0 lrwx------. 1 root root 64 Oct 19 18:35 36 -> /dev/ipmi0 lrwx------. 1 root root 64 Oct 19 18:35 37 -> /dev/ipmi0 lrwx------. 1 root root 64 Oct 19 18:35 38 -> 'socket:[1135495]' lrwx------. 1 root root 64 Oct 19 18:35 39 -> 'socket:[1102809]' lrwx------. 1 root root 64 Oct 19 18:35 4 -> /dev/null lrwx------. 1 root root 64 Oct 19 18:35 40 -> 'socket:[1112496]' lrwx------. 1 root root 64 Oct 19 18:35 5 -> /dev/ipmi0 lrwx------. 1 root root 64 Oct 19 18:35 6 -> 'socket:[934533]' l-wx------. 1 root root 64 Oct 19 18:35 7 -> /run/slurmd.pid l-wx------. 1 root root 64 Oct 19 18:35 8 -> /var/log/slurm/slurmd.log lrwx------. 1 root root 64 Oct 19 18:35 9 -> /dev/ipmi0 Restarting slurmd fixes the issue, at least temporarily: [root@d033 ~]# systemctl restart slurmd [root@d033 ~]# ls -l /proc/$(pidof slurmd)/fd total 0 lr-x------. 1 root root 64 Oct 20 08:29 0 -> /dev/null lrwx------. 1 root root 64 Oct 20 08:29 1 -> 'socket:[1105603]' lrwx------. 1 root root 64 Oct 20 08:29 2 -> 'socket:[1105603]' l-wx------. 1 root root 64 Oct 20 08:29 3 -> /var/log/slurm/slurmd.log lrwx------. 1 root root 64 Oct 20 08:29 4 -> /dev/null lrwx------. 1 root root 64 Oct 20 08:29 5 -> 'socket:[1136614]' l-wx------. 1 root root 64 Oct 20 08:29 6 -> /run/slurmd.pid lrwx------. 1 root root 64 Oct 20 08:29 8 -> /dev/ipmi0 [root@d033 ~]# ipmi-dcmi --get-system-power-statistics Current Power : 738 Watts Minimum Power over sampling duration : 10 watts Maximum Power over sampling duration : 890 watts Average Power over sampling duration : 623 watts Time Stamp : 10/20/2023 - 06:29:26 Statistics reporting time period : 1000 milliseconds Power Measurement : Active In comment 71 a restart of slurmd cleared up BMC communications. Alas, after some 15 minutes the deadlocked BMC situation reappeared. :-( In comment 62 the number of open file descriptors was reported on a CentOS 7.9 node. The EL7 kernel doesn't seem to deadlock the BMC communications, as we have seen with an EL8 kernel. After some 24 hours of operation slurmd has accumulated 2150 open file descriptors and still seems to work OK: $ ls -l /proc/$(pidof slurmd)/fd | wc -l 2150 We use freeipmi 1.7.0 with the patch that permits >1024 open file descriptors. The node's BMC communication continues to work correctly: $ ipmi-dcmi --get-system-power-statistics Current Power : 428 Watts Minimum Power over sampling duration : 116 watts Maximum Power over sampling duration : 496 watts Average Power over sampling duration : 387 watts Time Stamp : 10/20/2023 - 06:54:33 Statistics reporting time period : 1000 milliseconds Power Measurement : Active Hi Ole, (In reply to Ole.H.Nielsen@fysik.dtu.dk from comment #73) > In comment 62 the number of open file descriptors was reported on a CentOS > 7.9 node. The EL7 kernel doesn't seem to deadlock the BMC communications, > as we have seen with an EL8 kernel. After some 24 hours of operation slurmd > has accumulated 2150 open file descriptors and still seems to work OK: > > $ ls -l /proc/$(pidof slurmd)/fd | wc -l > 2150 > > We use freeipmi 1.7.0 with the patch that permits >1024 open file > descriptors. Thanks for confirming this, and for verifying that the recent update to FreeIPMI mitigates the buffer overflow crash. I think this supports the hypothesis that it's the FD leak that surfaces the buffer overflow issue on EL7 systems. Without the FreeIPMI select->poll patch, `slurmd` would have crashed quite some time ago with that many FDs! Worth noting that, by default, the `slurmd` process is limited to a maximum of 131072 open files (via `https://github.com/SchedMD/slurm/blob/d365fbc205f3f3919bb34cd8885206fe7fe86337/etc/slurmd.service.in#L14`) so this will give `slurmd` around 60 days to live if the FD leak continues at this rate unabated. Cheers, Mark. I re-run the HPL successful without buffer overflows. The consumed energy was written to slurmDB successful. Verification of this value against the PDU measurement will take place on Monday. This time the number of FD's increased only by 1 on all nodes. 772 -> 773 on nodes were slurm haven't been restared. 3 -> 4 on swan56 were slurm was restarted before the second HPL run. It's unclear, why after the first run 772 FD (-> proc-slurmFDswan55.log) were created and not de-allocated. Anyway the file descriptor wasn't released after the job completed and again no other jobs ran due to the exclusive reservation. I attached the log counting the allocated FDs on all nodes during the HPL test (-> slurmFdMeasurment.dat) Even if the increase is slow fd limit will be reached very soon in production on our cluster. We use sles15.4 and kernel 5.14.21-150400.24.55-default on the compute nodes. We've fixed the file descriptor leak in the following commits: * 8795696092 NEWS for the previous two commits * 14f26eabe3 Fix slurmd fd leak due to not closing dcmi ctx * 3d734bccca acct_gather_energy/ipmi plugin - move code to a function I'm closing this bug as fixed ahead of 23.02.7. Thanks for the bug report and helping us identify the root causes. An update about FreeIPMI: Version 1.6.12 has fixed this issue: o Use poll() over select() to avoid fd limit in openipmi driver. The latest version is currently 1.6.14 from https://git.savannah.gnu.org/cgit/freeipmi.git/ and you can build EL8/EL9 RPMs from the tar-ball with: $ rpmbuild -ta --with systemd freeipmi-1.6.14.tar.gz The acct_gather_energy/ipmi accounting plugin in Slurm 23.02.7 is working correctly for us. /Ole |