Ticket 12868

Summary: slurmd crashes with acct_gather_energy/xcc
Product: Slurm Reporter: AB <Alexander.Block>
Component: slurmdAssignee: Carlos Tripiana Montes <tripiana>
Status: RESOLVED FIXED QA Contact:
Severity: 3 - Medium Impact    
Priority: ---    
Version: 20.11.7   
Hardware: Linux   
OS: Linux   
See Also: https://bugs.schedmd.com/show_bug.cgi?id=17639
Site: LRZ Alineos Sites: ---
Atos/Eviden Sites: --- Confidential Site: ---
Coreweave sites: --- Cray Sites: ---
DS9 clusters: --- HPCnow Sites: ---
HPE Sites: --- IBM Sites: ---
NOAA SIte: --- NoveTech Sites: ---
Nvidia HWinf-CS Sites: --- OCF Sites: ---
Recursion Pharma Sites: --- SFW Sites: ---
SNIC sites: --- Linux Distro: ---
Machine Name: CLE Version:
Version Fixed: 21.08.6, 22.05.0pre1 Target Release: ---
DevPrio: --- Emory-Cloud Sites: ---
Ticket Depends on:    
Ticket Blocks: 11323    
Attachments: config+log files
debug+log files
build log
specifications for xcc and impi
configure, make, build, slurmd dmesg logs
slurmd log file
slurmd log file 2
debug print patch
IPMI open FDs quickfix test

Description AB 2021-11-17 04:50:02 MST
Created attachment 22294 [details]
config+log files

I want to report an issue with slurmd when using the config

AcctGatherEnergyType=acct_gather_energy/xcc

on our system.

We have Lenovo ThinkSystem SD650 DWC Dual Node Tray nodes. In June we did a maintenance and upgraded our software stack from SLES12.3 to SLES15.1 and Slurm from 18.08 to 20.11.7

We do compile and build Slurm rpms on our own using the rpmbuild on our small testsystem.

After the upgrade of the system we saw that slurmd on the nodes is dying. In the slurmd logs no hints about problems are visible. Only in the syslogs we find the following entries, e.g.

i08r11c04s04: Nov 15 14:08:08 i08r11c04s04 slurmd[2239000]: *** buffer overflow detected ***: /usr/sbin/slurmd terminated
i08r11c04s04: Nov 15 14:08:41 i08r11c04s04 systemd[1]: slurmd.service: Main process exited, code=killed, status=6/ABRT
i08r11c04s04: Nov 15 14:08:41 i08r11c04s04 systemd[1]: slurmd.service: Unit entered failed state.
i08r11c04s04: Nov 15 14:08:41 i08r11c04s04 systemd[1]: slurmd.service: Failed with result 'signal'.

This error appears after some (within one day after restart usually) time the slurmd is started. Before, correct energy values for jobs are written to the DB.

When not using AcctGatherEnergyType slurmd is running fine.

We did not switch on the energy measurement since June but now we need to switch it on again.
 
I will add our configs and the slurmd logfile of the node mentioned. Since NO information is seen in slurmctld I do not add that file but I could provide if needed.

I also had a look at bug 11323 which seems to be related but with no solution.
Comment 1 Carlos Tripiana Montes 2021-11-17 09:47:23 MST
Hi Alexander,

It doesn't seem quite the same as Bug 11323. First of all, this one was about v2 HW with some underlying changes in how IPMI sends data back after queried.

Your error seems more related to a potential bug in slurmd while trying to get energy data. Something like a leak.

I'm not sure if you can, but best of everybody will be if you could please post here a valgrind log of the slurmd crash, like:

valgrind --leak-check=full --show-leak-kinds=all --track-origins=yes --verbose --log-file=~/valgrind-out.txt -- slurmd -vvvvvv [MORE_FLAGS]

Please, then send us both the valgrind out and the super-debug log.

Tell us if this is feasible from your side.

Thanks,
Carlos.
Comment 2 AB 2021-11-17 12:27:15 MST
Hi Carlos,

ok, I will try to set it up on our test system and get back to you.
Comment 3 AB 2021-11-18 09:02:32 MST
Created attachment 22316 [details]
debug+log files
Comment 4 AB 2021-11-18 09:06:45 MST
Hi Carlos,

please find attached to requested log files for two nodes.

Since we have a maintenance operation at the moment I ran valgrind on our main system. The command used was 

NODE=`hostname -s`
LOGNAME="/dss/sngslurm/log/tmp/${NODE}.val.log"
valgrind --leak-check=full --show-leak-kinds=all --track-origins=yes --verbose --log-file=${LOGNAME} -- slurmd -vvvvvv -D &

The valgrind/slurmd process was gone on the nodes but I did not see any *buffer overflow* message in the syslogs this time.

Thanks,
Alexander
Comment 5 Carlos Tripiana Montes 2021-11-19 03:03:31 MST
Hi,

> The valgrind/slurmd process was gone on the nodes but I did not see any *buffer overflow* message in the syslogs this time.

Did you stress the cluster with jobs?

I've found from logs that it seems to only happen if:

AcctGatherEnergyType=acct_gather_energy/xcc

is enabled and you let the cluster diggest jobs and jobs until whatever happens it overflows. That suggested me that it could be a leak, but I need a valgrind log that reproduces the overflow so I can point right to the problem.

In any case, I see quite a lot of entries in the valgrind log related to impi and/or acct_gather_energy. I'll have a look and see it I find anything interesting.

Regards,
Carlos.
Comment 6 AB 2021-11-19 03:47:35 MST
Hi Carlos,

yes, the problem occurs only when we activate 

AcctGatherEnergyType=acct_gather_energy/xcc

on our system. If we just comment it out, no issues are seen. But then we also don't have any energy values for the jobs in the DB.

As you might have seen in our config we have a cluster with 6480 nodes, consisting of 9 islands. I didn't want to run valgrind on all nodes therefore I picked 3 racks (i01r0[2-4], 216 nodes) and started slurmd with valgrind in the way I wrote before. On all other nodes I started slurmd as service with systemctl. Then we filled the system with jobs.
According to Murphy's law first slurmd crashes with buffer overflow in the syslog appeared on all other islands than i01 where I started slurmd on nodes with valgrind. But finally also the nodes where slurmd with valgrind was started slurmd process died. That should be caused by the same trigger as on the other nodes. However since slurmd was not under control of systemctl I assume that the *buffer overflow* message was not picked up.
It might be that valgrind biased the behavior a bit but in the end slurmd process was gone as on the other nodes under systemctl control.
If you have any other suggestion how to setup the valgrind slurmd start please let me know.

Thanks,
Alexander
Comment 7 Carlos Tripiana Montes 2021-11-19 07:54:23 MST
Yes, I saw in logs slurmd did die in the end.

After having a look at the provided valgrind log I can't conclude the problem is a leak inside Slurm.

Detected entries that are either indirect or direct (but in the end confirmed) leaks points to a few lines in acct_gather_energy_xcc.c, and all those calls libfreeipmi.so.17.2.1 functions. If I look at the in/out parameters of these calls I can't see any variable leaking right now.

As an example. Worst leak in the provided valgrind log is:

==3706825== 2,746,568 (744,832 direct, 2,001,736 indirect) bytes in 1,012 blocks are definitely lost in loss record 629 of 630
==3706825==    at 0x4C4422A: calloc (vg_replace_malloc.c:762)
==3706825==    by 0xCFD15E2: ipmi_ctx_create (in /usr/lib64/libfreeipmi.so.17.2.1)
==3706825==    by 0xCA43DED: _init_ipmi_config (acct_gather_energy_xcc.c:308)
==3706825==    by 0xCA450FA: acct_gather_energy_p_get_data (acct_gather_energy_xcc.c:846)
==3706825==    by 0x51619D8: acct_gather_energy_g_get_data (slurm_acct_gather_energy.c:369)
==3706825==    by 0x41634B: _rpc_acct_gather_energy (req.c:3325)
==3706825==    by 0x41DB32: slurmd_req (req.c:349)
==3706825==    by 0x40E7AD: _service_connection (slurmd.c:581)
==3706825==    by 0x64114F8: start_thread (in /lib64/libpthread-2.26.so)
==3706825==    by 0x6723FBE: clone (in /lib64/libc-2.26.so)

This means a leak od 2,62MiB. Not a big deal. At least not a reason to overflow. But the question is that the code around this is:

acct_gather_energy_xcc:static int _init_ipmi_config (void)
{
	int ret = 0;
	unsigned int workaround_flags_mask =
		(IPMI_WORKAROUND_FLAGS_INBAND_ASSUME_IO_BASE_ADDRESS
		 | IPMI_WORKAROUND_FLAGS_INBAND_SPIN_POLL);

	if (ipmi_ctx) { <-- IF IT'S NOT NULL
		debug("ipmi_ctx already initialized\n");
		return SLURM_SUCCESS;
	}

L308:	if (!(ipmi_ctx = ipmi_ctx_create())) { <-- IF IT'S NULL
		error("ipmi_ctx_create: %s\n", strerror(errno));
		goto cleanup;
	}
[...]

This function will not leak if we already have a context. Also, this function will clean ok if it fails:

[...]
	return SLURM_SUCCESS;

cleanup:
	ipmi_ctx_close(ipmi_ctx);
	ipmi_ctx_destroy(ipmi_ctx);
	return SLURM_ERROR;
}

Additionally will clean before closing plugin well:

extern int fini(void)
{
	if (!running_in_slurmd_stepd())
		return SLURM_SUCCESS;

	flag_energy_accounting_shutdown = true;

	/* clean up the launch thread */
	slurm_cond_signal(&launch_cond);

	if (thread_ipmi_id_launcher)
		pthread_join(thread_ipmi_id_launcher, NULL);

	/* clean up the run thread */
	slurm_cond_signal(&ipmi_cond);

	slurm_mutex_lock(&ipmi_mutex);

	if (ipmi_ctx)
		ipmi_ctx_destroy(ipmi_ctx);
	_reset_slurm_ipmi_conf(&slurm_ipmi_conf);

	slurm_mutex_unlock(&ipmi_mutex);

	if (thread_ipmi_id_run)
		pthread_join(thread_ipmi_id_run, NULL);

	return SLURM_SUCCESS;
}

I see also some indirect leaks like:

==3706825== 862,224 bytes in 1,012 blocks are indirectly lost in loss record 627 of 630
==3706825==    at 0x4C4422A: calloc (vg_replace_malloc.c:762)
==3706825==    by 0xD009100: fiid_obj_create (in /usr/lib64/libfreeipmi.so.17.2.1)
==3706825==    by 0xCFD2395: ipmi_ctx_open_inband (in /usr/lib64/libfreeipmi.so.17.2.1)
==3706825==    by 0xCFD2877: ipmi_ctx_find_inband (in /usr/lib64/libfreeipmi.so.17.2.1)
==3706825==    by 0xCA4402A: _init_ipmi_config (acct_gather_energy_xcc.c:339)
==3706825==    by 0xCA450FA: acct_gather_energy_p_get_data (acct_gather_energy_xcc.c:846)
==3706825==    by 0x51619D8: acct_gather_energy_g_get_data (slurm_acct_gather_energy.c:369)
==3706825==    by 0x41634B: _rpc_acct_gather_energy (req.c:3325)
==3706825==    by 0x41DB32: slurmd_req (req.c:349)
==3706825==    by 0x40E7AD: _service_connection (slurmd.c:581)
==3706825==    by 0x64114F8: start_thread (in /lib64/libpthread-2.26.so)
==3706825==    by 0x6723FBE: clone (in /lib64/libc-2.26.so)

But they're even more strange. Since:

acct_gather_energy_xcc:static int _init_ipmi_config (void)
{
[...]
	if (slurm_ipmi_conf.driver_type == NO_VAL) {
L339:		if ((ret = ipmi_ctx_find_inband( <-- THAT ONLY HAPPENS IF driver_type == NO_VAL which is the default.
			     ipmi_ctx,
			     NULL,
			     slurm_ipmi_conf.disable_auto_probe,
			     slurm_ipmi_conf.driver_address,
			     slurm_ipmi_conf.register_spacing,
			     slurm_ipmi_conf.driver_device,
			     slurm_ipmi_conf.workaround_flags,
			     slurm_ipmi_conf.ipmi_flags)) <= 0) {
			error("%s: error on ipmi_ctx_find_inband: %s",
			      __func__, ipmi_ctx_errormsg(ipmi_ctx));

			debug("slurm_ipmi_conf.driver_type=%u\n"
			      "slurm_ipmi_conf.disable_auto_probe=%u\n"
			      "slurm_ipmi_conf.driver_address=%u\n"
			      "slurm_ipmi_conf.register_spacing=%u\n"
			      "slurm_ipmi_conf.driver_device=%s\n"
			      "slurm_ipmi_conf.workaround_flags=%u\n"
			      "slurm_ipmi_conf.ipmi_flags=%u",
			      slurm_ipmi_conf.driver_type,
			      slurm_ipmi_conf.disable_auto_probe,
			      slurm_ipmi_conf.driver_address,
			      slurm_ipmi_conf.register_spacing,
			      slurm_ipmi_conf.driver_device,
			      slurm_ipmi_conf.workaround_flags,
			      slurm_ipmi_conf.ipmi_flags);

			goto cleanup;
		}

Is not handling any pointer so assign values for the slurm.

As you are using a freeipmi like https://opensuse.pkgs.org/15.2/opensuse-oss-x86_64/libfreeipmi17-1.6.2-lp152.3.6.x86_64.rpm.html I've downloaded the source and I can't explain nothing that leaks close to where valgrind is pointing at.

----

Looking at your logs I guess the time slurmd crashes is:

[2021-11-18T16:18:25.386] debug2: Processing RPC: REQUEST_ACCT_GATHER_ENERGY
[2021-11-18T16:18:26.976] [1480049.0] debug:  slurm_recv_timeout at 0 of 4, recv zero bytes
[2021-11-18T16:18:26.976] [1480049.0] error: slurm_get_node_energy: Zero Bytes were transmitted or received
[2021-11-18T16:18:26.976] [1480049.0] error: _get_joules_task: can't get info from slurmd


[2021-11-18T15:25:11.783] debug2: Processing RPC: REQUEST_ACCT_GATHER_ENERGY
[2021-11-18T15:25:12.403] [1480040.extern] debug:  slurm_recv_timeout at 0 of 4, recv zero bytes
[2021-11-18T15:25:12.403] [1480040.extern] error: slurm_get_node_energy: Zero Bytes were transmitted or received
[2021-11-18T15:25:12.403] [1480040.extern] error: _get_joules_task: can't get info from slurmd

It goes into grabbing another sample for the energy and overflows. That makes sense with valgrind results.

----

Is it possible to recompile slurm with --disable-optimizations --enable-debug --enable-memory-leak-debug --enable-developer and enable core dumps (like ulimit -c unlimited), then run slurmd until it crashes and send us this core file?

It would be super helpful to GDB this core file and backtrace the time slurm crashed.

The idea is to see if there's any relation to the valgrind entries you post.

You can also relaunch the slurmd after core generation in a valgrind session and ensure it crashes, then post another valgrind log with the results.

With these 2 valuable clues (core dump and valgrind log, using slurm with debug compilation) we can search for the issue better.

As a side note, it would be even better if you upgrade to 21.08.4 before further investigations. Although I can't see any important update in XCC since late 2020: https://github.com/SchedMD/slurm/commits/master/src/plugins/acct_gather_energy/xcc

Tell me what do you think.

Regards,
Carlos.
Comment 8 AB 2021-11-22 05:03:17 MST
Hi Carlos,

I will try to do what you suggested but on our test system. However I will first try with our recent version 20.11.7.

I'' get back once I have results (or further questions).

Regards,
Alexander
Comment 9 AB 2021-11-23 07:29:56 MST
Dear Carlos,

I have a question about the build.
Since we need to build rpms I use rpmbuild as documented. However I don't see how I could pass the options you suggested to the rpmbuild. Im the surm.spec file only these options are available to rpmmacros

# build options         .rpmmacros options      change to default action
# ====================  ====================    ========================
# --prefix              %_prefix path           install path for commands, libraries, etc.
# --with cray           %_with_cray 1           build for a Cray Aries system
# --with cray_network   %_with_cray_network 1   build for a non-Cray system with a Cray network
# --with cray_shasta    %_with_cray_shasta 1    build for a Cray Shasta system
# --with slurmrestd     %_with_slurmrestd 1     build slurmrestd
# --with slurmsmwd      %_with_slurmsmwd 1      build slurmsmwd
# --without debug       %_without_debug 1       don't compile with debugging symbols
# --with hdf5           %_with_hdf5 path        require hdf5 support
# --with hwloc          %_with_hwloc 1          require hwloc support
# --with lua            %_with_lua path         build Slurm lua bindings
# --with mysql          %_with_mysql 1          require mysql/mariadb support
# --with numa           %_with_numa 1           require NUMA support
# --without pam         %_without_pam 1         don't require pam-devel RPM to be installed
# --without x11         %_without_x11 1         disable internal X11 support
# --with ucx            %_with_ucx path         require ucx support
# --with pmix           %_with_pmix path        require pmix support

Can you please give my any hint how to build rpms with the suggested options?

Thanks,
Alexander
Comment 10 Carlos Tripiana Montes 2021-11-25 05:49:48 MST
Just do:

wget https://github.com/SchedMD/slurm/archive/refs/tags/slurm-20-11-7-1.tar.gz
tar -xf slurm-20-11-7-1.tar.gz
mv slurm-slurm-20-11-7-1 slurm-20.11.7
cd slurm-20.11.7
vim slurm.spec
:351 #goto last configure line
i #insert

Then insert the extra " \" and desired options like this:

        %{?_with_cflags} \
        --disable-optimizations --enable-debug --enable-memory-leak-debug --enable-developer

make %{?_smp_mflags}

:x #save
cd ..
tar -jcf slurm-20.11.7.tar.bz2 slurm-20.11.7
rpmbuild -tc -v slurm-20.11.7.tar.bz2

I've tested it to work in my Fedora testbed.

Or just use configure instead of rpmbuild.

Cheers,
Carlos.
Comment 11 AB 2021-11-25 06:56:03 MST
Hi Carlos,

I have inserted this line 

	%{?_with_cflags:--disable-optimizations --enable-debug --enable-memory-leak-debug --enable-developer} \

and it built the rpms.

However, since I was testing in the meantime a bit with configure and make I'm not sure if these options really change something in the build process?! Is there any check I can do to see if the options have been really applied and changed something? I compared to the build logs of the original version and I don't seem to find any substantial changes?!

Thanks,
Alexander
Comment 12 Carlos Tripiana Montes 2021-11-25 07:10:12 MST
  --disable-optimizations disable optimizations (sets -O0)                      
  --enable-developer      enable developer options (asserts, -Werror - also     
                          sets --enable-debug as well)
  --enable-memory-leak-debug                                                    
                          enable memory leak debugging code for development 

Yeah, in fact enable-developer enables enable-debug, but anyway TLDR:

-O0
-Werror
asserts as fatal
-ggdb3
#define NDEBUG 1
#define MEMORY_LEAK_DEBUG 1

The idea is to ensure your reproduction don't trigger any extra puller in debug, and also give me all the flags I need to ensure the best of the trackings for debugging the core file I expect to receive, plus as complete stracktraces as possible in valgrind as well. With the addition of being sure the valgrind analysis don't fail in their guessings.

Thanks,
Carlos.
Comment 13 AB 2021-11-25 07:26:44 MST
Hi Carlos,

that's what I mean - e.g. I don't see -O0 somewhere in the make log... I attach the build log for your inspection.

Thanks,
Alexander
Comment 14 AB 2021-11-25 07:27:26 MST
Created attachment 22405 [details]
build log
Comment 15 Carlos Tripiana Montes 2021-11-25 08:00:19 MST
You aren't setting things right because:

+ ./configure --host=x86_64-suse-linux-gnu --build=x86_64-suse-linux-gnu --program-prefix= --disable-dependency-tracking --prefix=/usr --exec-prefix=/usr --bindir=/usr/bin --sbindir=/usr/sbin --sysconfdir=/etc/slurm --datadir=/usr/share --includedir=/usr/include --libdir=/usr/lib64 --libexecdir=/usr/lib --localstatedir=/var --sharedstatedir=/var/lib --mandir=/usr/share/man --infodir=/usr/share/info --disable-dependency-tracking --disable-slurmrestd

I can't see in your configure line the desired options.

I see your modification to slurm.spec is:

%{?_with_cflags:--disable-optimizations --enable-debug --enable-memory-leak-debug --enable-developer} \

meanwhile I'd suggested:

 %{?_with_cflags} \
        --disable-optimizations --enable-debug --enable-memory-leak-debug --enable-developer

I doesn't quite mean the same. Try to change the above.

Additionally, if -D_FORTIFY_SOURCE=2 (added automatically by RPM stuff) collides with -O0 in your distro, set CFLGAS to empty like this:

%build
%define _with_cflags "CFLAGS= "
%configure \
[...]

Overall, in my Fedora, build block is:

%build
%define _with_cflags "CFLAGS= "
%configure \
        %{?_without_debug:--disable-debug} \
        %{?_with_pam_dir} \
        %{?_with_cpusetdir} \
        %{?_with_mysql_config} \
        %{?_with_ssl} \
        %{?_without_cray:--enable-really-no-cray}\
        %{?_with_cray_network:--enable-cray-network}\
        %{?_with_multiple_slurmd:--enable-multiple-slurmd} \
        %{?_with_pmix} \
        %{?_with_freeipmi} \
        %{?_with_hdf5} \
        %{?_with_shared_libslurm} \
        %{!?_with_slurmrestd:--disable-slurmrestd} \
        %{?_without_x11:--disable-x11} \
        %{?_with_ucx} \
        %{?_with_cflags} \
        --disable-optimizations --enable-debug --enable-memory-leak-debug --enable-developer

make %{?_smp_mflags}

Don't forget to repack the tar.bz2 that rpmbuild is going to use as my steps in Comment 10.

My logs from rpmbuild then said:

+ ./configure --build=x86_64-redhat-linux-gnu --host=x86_64-redhat-linux-gnu --program-prefix= --disable-dependency-tracking --prefix=/usr --exec-prefix=/usr --bindir=/usr/bin --sbindir=/usr/sbin --sysconfdir=/etc/slurm --datadir=/usr/share --includedir=/usr/include --libdir=/usr/lib64 --libexecdir=/usr/libexec --localstatedir=/var --sharedstatedir=/var/lib --mandir=/usr/share/man --infodir=/usr/share/info --disable-slurmrestd 'CFLAGS= ' --disable-optimizations --enable-debug --enable-memory-leak-debug --enable-developer

Regards,
Carlos.
Comment 16 AB 2021-11-25 08:31:45 MST
Hi Carlos,

these are my steps:

Edit slurm.spec in /etc/slurm/src/tmp/slurm-20.11.7 to 
...
	%{?_with_ucx} \
	%{?_with_cflags:} \
        --disable-optimizations --enable-debug --enable-memory-leak-debug --enable-developer 

make %{?_smp_mflags}
...

tsl1:/etc/slurm/src/tmp # tar cvfj slurm-20.11.7.tar.bz2 slurm-20.11.7
tsl1:/etc/slurm/src/tmp # rpmbuild -ta slurm-20.11.7.tar.bz2 > build-slurm.txt.slurm-20.11.7.debug.2 2>&1
tsl1:/etc/slurm/src/tmp # tail -n15 build-slurm.txt.slurm-20.11.7.debug.2
checking for lua5.3... yes
checking for whether we can link to liblua... no lua5.3
configure: WARNING: unable to link against lua libraries
checking whether man2html is available... checking for man2html... no
configure: WARNING: unable to build man page html files without man2html
checking for support of printf("%s", NULL)... configure: error: printf("%s", NULL) results in abort
error: Bad exit status from /var/tmp/rpm-tmp.96i4m5 (%build)


RPM build errors:
    Macro expanded in comment on line 22: %_prefix path		install path for commands, libraries, etc.

    Macro expanded in comment on line 170: %define _unpackaged_files_terminate_build      0

    Bad exit status from /var/tmp/rpm-tmp.96i4m5 (%build)
tsl1:/etc/slurm/src/tmp # 

So the change you suggested leads to an error and abort of the rpmbuild process. That's why I put the options into the macro definition.
Comment 17 AB 2021-11-25 08:35:49 MST
Addendum - it also throws the error when removing the ":", i.e.

...
        %{?_with_ucx} \
        %{?_with_cflags} \
        --disable-optimizations --enable-debug --enable-memory-leak-debug --enable-developer

make %{?_smp_mflags}
...
Comment 18 Carlos Tripiana Montes 2021-11-25 09:08:05 MST
Yeah, that's why I told you:

> Additionally, if -D_FORTIFY_SOURCE=2 (added automatically by RPM stuff) collides with -O0 in your distro, set CFLAGS to empty like this:

%build
%define _with_cflags "CFLAGS= "
%configure \
[...]

in Comment 15.

You need to remove some RPM default flags that collide with debug mode. In any case, you are dealing with this because to stick to rpmbuild because simple plain configure doesn't pose these kind of problems.

Cheers,
Carlos.
Comment 19 AB 2021-11-25 12:41:39 MST
Hi Carlos,

what I did now is the following:

tsl1:/etc/slurm/src/tmp/slurm-20.11.7 # ./configure --disable-optimizations --enable-debug --enable-memory-leak-debug --enable-developer

tsl1:/etc/slurm/src/tmp/slurm-20.11.7 # make > make.out.txt 2>&1

Within this make step I found -O0 compiling steps which might be a hint for succesfully applying the options.

After that I edited the slurm.spec in /etc/slurm/src/tmp/slurm-20.11.7 and removed %build, %configure and make section:
...
#%build
#%configure \
#       %{?_without_debug:--disable-debug} \
#       %{?_with_pam_dir} \
#       %{?_with_cpusetdir} \
#       %{?_with_mysql_config} \
#       %{?_with_ssl} \
#       %{?_without_cray:--enable-really-no-cray}\
#       %{?_with_cray_network:--enable-cray-network}\
#       %{?_with_multiple_slurmd:--enable-multiple-slurmd} \
#       %{?_with_pmix} \
#       %{?_with_freeipmi} \
#       %{?_with_hdf5} \
#       %{?_with_shared_libslurm} \
#       %{!?_with_slurmrestd:--disable-slurmrestd} \
#       %{?_without_x11:--disable-x11} \
#       %{?_with_ucx} \
#       %{?_with_cflags}

#make %{?_smp_mflags}
...

And then

tsl1:/etc/slurm/src/tmp # tar cvfj slurm-20.11.7.tar.bz2 slurm-20.11.7
tsl1:/etc/slurm/src/tmp # rpmbuild -ta slurm-20.11.7.1.tar.bz2 > build-slurm.txt.slurm-20.11.7.1.debug 2>&1

This created me the rpms needed. However they are all a bit smaller in size than the ones without debug enabled. 
I will install these rpms now on the nodes and see what happens.

However I must state I am dealing with this since I have reported a bug and SchedMD asked the customer to provide a Slurm build option that is obviously not fully officially supported.
Comment 20 Carlos Tripiana Montes 2021-11-25 15:12:06 MST
We do officially support every option configure documents in its --help.

What it's happening it's just that we want to create RPMs with developer options, and slurm.spec is only intended for production builds.

That's why I suggested to just use plain configure and avoid the extra effort of dealing with rpmbuild.

Sorry if anything was not properly explained. I'm very happy to help you on discovering the root of this issue, and the effort from your side is much welcomed, as it's the key to deliver a solution in the quickest way. I hope to find the clues we need in the results of your test run.

Finally, a part from -O0 you can see my Comment 12 for other flags that should have been activated, as well as some defines in config.h.

Thanks,
Carlos.
Comment 21 AB 2021-11-26 03:12:04 MST
Hi Carlos,

I did now install the debug version on a node in our test system.
First error was 

[2021-11-26T10:49:27.988] fatal: Unable to find slurmstepd file at /usr/local/sbin/slurmstepd

So that kind of build seems to have changed the slurmstepd location. But it was easy to solve with a symlink.
Then slurmd started and now I see a lot or errors in slurmd.log like this

[2021-11-26T11:06:38.576] debug3: acct_gather_energy/xcc: _read_ipmi_values: ipmi_cmd_raw: success
[2021-11-26T11:06:38.576] error: Invalid ipmi response length for XCC raw command: 2 bytes, expected 16
[2021-11-26T11:06:38.576] error: _thread_update_node_energy could not read XCC ipmi values

Which brings us back to the ticket I also mentioned at the very beginning (bug 11323).

I will submit a job to the node and see what happens next.
Comment 22 Carlos Tripiana Montes 2021-11-26 03:43:00 MST
Wow, that's weird.

The error is not the same as in Bug 11323, because this one verses about a new IMPI version returning values in a different format. Then, in this bug you see new format is 40B long, while the implemented format is 16B long.

Your problem seems worse. While pinging the IMPI, the response was just 2B long, which is lower than anything that can hold a real power metric:

Yours:
raw command: 2 bytes, expected 16

Bug 11323:
raw command: 40 bytes, expected 16

I need to see your new slurmd's log to see if there's anything else of interest.

It would be nice to see if in the system logs from this node (journal, dmesg, ...) there's some clues about the IMPI doing something bad, or the kernel module, or the freeimpi version...

In any case, I'm going to see if the fact that Slurm receives something like 2 Bytes can cause a memory leak... But that's weird again since in the last logs, it seemed that it was working until some overflow happened:

[2021-11-18T14:11:33.093] debug3: acct_gather_energy/xcc: _read_ipmi_values: ipmi_cmd_raw: success

At least I have a new clue to continue digging into the issue :).

Post the log whenever you have time.

Thanks,
Carlos.
Comment 23 Carlos Tripiana Montes 2021-11-26 03:51:21 MST
There's no leak here:

static xcc_raw_single_data_t *_read_ipmi_values(void)
{
	xcc_raw_single_data_t *xcc_reading;
	uint8_t buf_rs[IPMI_RAW_MAX_ARGS];
	int rs_len = 0;

	if (!IPMI_NET_FN_RQ_VALID(cmd_rq[1])) {
                error("Invalid netfn value\n");
		return NULL;
	}

        rs_len = ipmi_cmd_raw(ipmi_ctx,
                              cmd_rq[0], // Lun (logical unit number)
                              cmd_rq[1], // Net Function
                              &cmd_rq[2], // Command number + request data
                              cmd_rq_len - 2, // Length (in bytes)
                              &buf_rs, // response buffer
                              IPMI_RAW_MAX_ARGS // max response length
                );

        debug3("ipmi_cmd_raw: %s", ipmi_ctx_errormsg(ipmi_ctx));

        if (rs_len != XCC_EXPECTED_RSPLEN) {
		error("Invalid ipmi response length for XCC raw command: "
		      "%d bytes, expected %d", rs_len, XCC_EXPECTED_RSPLEN);
		return NULL;
	}
[...]

ipmi_cmd_raw returns "success", but then rs_len is only 2 bytes long?

My freeipmi headers show:

/* for request/response, byte #1 = cmd */
/* for response, byte #2 (typically) = completion code */
/* returns length written into buf_fs on success, -1 on error */
int ipmi_cmd_raw (ipmi_ctx_t ctx,
                  uint8_t lun,
                  uint8_t net_fn,
                  const void *buf_rq,
                  unsigned int buf_rq_len,
                  void *buf_rs,
                  unsigned int buf_rs_len);

/* for request/response, byte #1 = cmd */
/* for response, byte #2 (typically) = completion code */
/* returns length written into buf_fs on success, -1 on error */
int ipmi_cmd_raw (ipmi_ctx_t ctx,
                  uint8_t lun,
                  uint8_t net_fn,
                  const void *buf_rq,
                  unsigned int buf_rq_len,
                  void *buf_rs,
                  unsigned int buf_rs_len);

It doesn't quite make any sense, isn't it? I'm going to have a look at the freeimpi bugs as well to see if I spot something.
Comment 24 Carlos Tripiana Montes 2021-11-26 04:13:32 MST
Created attachment 22410 [details]
specifications for xcc and impi

I've digged back in the past to grab the specification of the IMPI raw command and it's returning value.

Take a look at slide 5.

The 1st 2 bytes are fixed and for debugging reasons. So your response to this raw command is empty in this node.

I'm not really sure this problem is the same that it's causing your overflow, as reported initially. Because the flags for the debug compilation don't affect here to what the code is doing nor to the log messages you are now having.

Regards,
Carlos.
Comment 25 AB 2021-11-26 04:49:57 MST
Hi Carlos,

I would suggest to wait a bit at the moment with further analyzing.
We will create a new image and install the same (debug) version on all management and compute nodes on the system to have a clean installation for debuggin. And maybe there is also a problem with the xcc at the compute nodes I have to clarify first.
Comment 26 AB 2021-11-26 05:10:55 MST
Created attachment 22411 [details]
configure, make, build, slurmd dmesg logs

Please find in the meantime the logs of the build and slurmd on two nodes.
Comment 27 AB 2021-11-26 05:16:23 MST
And I got a script from Lenovo to test the xcc/ipmi

#!/bin/bash
energy1=$((echo ibase=16; ipmi-raw 0x0 0x3a 0x32 4 2 0 0 0 | awk '{print $9 $8 $7 $6}') | bc -l)
date1=$((echo ibase=16; ipmi-raw 0x0 0x3a 0x32 4 2 0 0 0 | awk '{print $15 $14 $13 $12}') | bc -l)
sleep 20
energy2=$((echo ibase=16; ipmi-raw 0x0 0x3a 0x32 4 2 0 0 0 | awk '{print $9 $8 $7 $6}') | bc -l)
date2=$((echo ibase=16; ipmi-raw 0x0 0x3a 0x32 4 2 0 0 0 | awk '{print $15 $14 $13 $12}') | bc -l)
energy=$((energy2-energy1))
interval=$((date2-date1))
echo $energy1 $energy2 $energy $date1 $date2 $interval # $((energy/interval))W

Run on one of the nodes looks fine...

thin2:~ # ./energy.sh 
279485028 279488614 3586 1637928871 1637928891 20
Comment 28 AB 2021-11-26 07:41:32 MST
Hi Carlos,

slurmd on one node (thin1) has now crashed with buffer overflow.

thin1:/var/log/slurm # systemctl status slurmd.service
● slurmd.service - Slurm node daemon
   Loaded: loaded (/etc/systemd/system/slurmd.service; enabled; vendor preset: disabled)
   Active: failed (Result: signal) since Fri 2021-11-26 13:57:20 CET; 1h 43min ago
  Process: 17082 ExecStart=/usr/sbin/slurmd -D $SLURMD_OPTIONS (code=killed, signal=ABRT)
 Main PID: 17082 (code=killed, signal=ABRT)
    Tasks: 24
   CGroup: /slurm.slice/slurmd.service
           ├─19721 slurmstepd: [1110.extern]
           ├─19726 sleep 100000000
           ├─19729 slurmstepd: [1110.batch]
           ├─19734 /bin/bash /var/spool/slurm/job01110/slurm_script
           ├─19741 srun xhpl
           ├─19742 srun xhpl
           ├─19756 slurmstepd: [1110.0]
           ├─19763 ./xhpl
           └─19764 ./xhpl

Nov 26 13:56:19 thin1 slurmd[17082]: slurmd: debug2: Finish processing RPC: REQUEST_ACCT_GATHER_UPDATE
Nov 26 13:56:29 thin1 slurmd[17082]: slurmd: debug2: Finish processing RPC: REQUEST_ACCT_GATHER_ENERGY
Nov 26 13:56:29 thin1 slurmd[17082]: slurmd: debug2: Finish processing RPC: REQUEST_ACCT_GATHER_ENERGY
Nov 26 13:56:30 thin1 slurmd[17082]: slurmd: debug2: Finish processing RPC: REQUEST_ACCT_GATHER_ENERGY
Nov 26 13:56:49 thin1 slurmd[17082]: slurmd: debug2: Finish processing RPC: REQUEST_ACCT_GATHER_UPDATE
Nov 26 13:56:59 thin1 slurmd[17082]: slurmd: debug2: Finish processing RPC: REQUEST_ACCT_GATHER_ENERGY
Nov 26 13:56:59 thin1 slurmd[17082]: *** buffer overflow detected ***: /usr/sbin/slurmd terminated
Nov 26 13:57:20 thin1 systemd[1]: slurmd.service: Main process exited, code=killed, status=6/ABRT
Nov 26 13:57:20 thin1 systemd[1]: slurmd.service: Unit entered failed state.
Nov 26 13:57:20 thin1 systemd[1]: slurmd.service: Failed with result 'signal'.

I will attach the slurmd log file.
Comment 29 AB 2021-11-26 07:41:54 MST
Created attachment 22412 [details]
slurmd log file
Comment 30 AB 2021-11-26 08:19:45 MST
Created attachment 22413 [details]
slurmd log file 2

On the other node slurmd crashed as well. Please find slurmd.log attached.
Comment 31 Carlos Tripiana Montes 2021-11-26 09:01:57 MST
From your valgrind log:

==3748094== 2900 errors in context 20 of 578:
==3748094== Syscall param ioctl(generic) points to uninitialised byte(s)
==3748094==    at 0x671AAC7: ioctl (in /lib64/libc-2.26.so)
==3748094==    by 0xD0061ED: ??? (in /usr/lib64/libfreeipmi.so.17.2.1)
==3748094==    by 0xD00676E: ipmi_openipmi_cmd (in /usr/lib64/libfreeipmi.so.17.2.1)
==3748094==    by 0xCFE594A: ??? (in /usr/lib64/libfreeipmi.so.17.2.1)
==3748094==    by 0xCFE5A2D: ??? (in /usr/lib64/libfreeipmi.so.17.2.1)
==3748094==    by 0xCFD3712: ipmi_cmd_raw (in /usr/lib64/libfreeipmi.so.17.2.1)
==3748094==    by 0xCA44242: _read_ipmi_values (acct_gather_energy_xcc.c:434)
==3748094==    by 0xCA4440E: _thread_update_node_energy (acct_gather_energy_xcc.c:491)
==3748094==    by 0xCA44C2E: _thread_ipmi_run (acct_gather_energy_xcc.c:672)
==3748094==    by 0x64114F8: start_thread (in /lib64/libpthread-2.26.so)
==3748094==    by 0x6723FBE: clone (in /lib64/libc-2.26.so)
==3748094==  Address 0xe1d8f50 is on thread 4's stack
==3748094==  Uninitialised value was created by a stack allocation
==3748094==    at 0xD006107: ??? (in /usr/lib64/libfreeipmi.so.17.2.1)

==3724395== 1,997 (56 direct, 1,941 indirect) bytes in 1 blocks are definitely lost in loss record 578 of 648
==3724395==    at 0x4C41FDB: malloc (vg_replace_malloc.c:309)
==3724395==    by 0xD00900B: fiid_obj_create (in /usr/lib64/libfreeipmi.so.17.2.1)
==3724395==    by 0xCFE5A01: ??? (in /usr/lib64/libfreeipmi.so.17.2.1)
==3724395==    by 0xCFD3712: ipmi_cmd_raw (in /usr/lib64/libfreeipmi.so.17.2.1)
==3724395==    by 0xCA44242: _read_ipmi_values (acct_gather_energy_xcc.c:434)
==3724395==    by 0xCA4440E: _thread_update_node_energy (acct_gather_energy_xcc.c:491)
==3724395==    by 0xCA44C2E: _thread_ipmi_run (acct_gather_energy_xcc.c:672)
==3724395==    by 0x64114F8: start_thread (in /lib64/libpthread-2.26.so)
==3724395==    by 0x6723FBE: clone (in /lib64/libc-2.26.so)

----

From the code:

static xcc_raw_single_data_t *_read_ipmi_values(void)
{
	xcc_raw_single_data_t *xcc_reading;
	uint8_t buf_rs[IPMI_RAW_MAX_ARGS];
	int rs_len = 0;

	if (!IPMI_NET_FN_RQ_VALID(cmd_rq[1])) {
                error("Invalid netfn value\n");
		return NULL;
	}

        rs_len = ipmi_cmd_raw(ipmi_ctx,
                              cmd_rq[0], // Lun (logical unit number)
                              cmd_rq[1], // Net Function
                              &cmd_rq[2], // Command number + request data
                              cmd_rq_len - 2, // Length (in bytes)
                              &buf_rs, // response buffer
                              IPMI_RAW_MAX_ARGS // max response length
                );

        debug3("ipmi_cmd_raw: %s", ipmi_ctx_errormsg(ipmi_ctx));

        if (rs_len != XCC_EXPECTED_RSPLEN) {
		error("Invalid ipmi response length for XCC raw command: "
		      "%d bytes, expected %d", rs_len, XCC_EXPECTED_RSPLEN);
		return NULL;
	}
[...]

----

From the logs, you see tones of similar messages like:

185338:[2021-11-26T13:56:48.572] error: Invalid ipmi response length for XCC raw command: 2 bytes, expected 16
185339:[2021-11-26T13:56:48.572] error: _thread_update_node_energy could not read XCC ipmi values
185345:[2021-11-26T13:56:53.572] error: Invalid ipmi response length for XCC raw command: 2 bytes, expected 16
185346:[2021-11-26T13:56:53.572] error: _thread_update_node_energy could not read XCC ipmi values
185348:[2021-11-26T13:56:58.573] error: Invalid ipmi response length for XCC raw command: 2 bytes, expected 16
185349:[2021-11-26T13:56:58.573] error: _thread_update_node_energy could not read XCC ipmi values
185358:[2021-11-26T13:56:59.662] error: Invalid ipmi response length for XCC raw command: 2 bytes, expected 16
185359:[2021-11-26T13:56:59.662] error: _thread_update_node_energy could not read XCC ipmi values

That means ipmi_cmd_raw returns "success", but then rs_len is only 2 bytes. Until eventually slurmd crashes and systemd detects the issue:

Nov 26 13:56:59 thin1 slurmd[17082]: *** buffer overflow detected ***: /usr/sbin/slurmd terminated

Slurmd logs then continuously say:

185382:[2021-11-26T13:57:05.098] [1110.batch] error: slurm_get_node_energy: Zero Bytes were transmitted or received
185383:[2021-11-26T13:57:05.098] [1110.batch] error: _get_joules_task: can't get info from slurmd
185387:[2021-11-26T13:57:20.342] [1110.0] error: slurm_get_node_energy: Transport endpoint is not connected
185388:[2021-11-26T13:57:20.342] [1110.0] error: _get_joules_task: can't get info from slurmd
185443:[2021-11-26T13:57:29.659] [1110.extern] debug2: Error connecting slurm stream socket at 192.168.192.32:6818: Connection refused

----

Everything is pointing me to a leak inside freeipmi by now. I'm now digging into the code valgrind is pointing things out. I'll keep you posted with more info ASAP. As a side note, because I'm not certain about the root of the issue, I can't ensure a rollback for freeipmi would help, it may be deeper in the hole and be a glibc change breaking freeipmi or whatever, so it might well be the case that it requires an upgrade for freeipmi instead. I just need some time to dig on it with calm.

Cheers,
Carlos.
Comment 32 Carlos Tripiana Montes 2021-11-26 09:33:42 MST
Nevertheless, I still think the best action to tackle right now is to enable:

unlimit -c unlimited

for a slurmd compiled with the debug flags. Then let the slurmd crash (it seems to happen after lots of hours, so it means it's a small leak that over the time overflows.

Then, send us the generated core file for inspection. If we still see in the core file a backtrace for the crash time going inside freeipmi, then we have confirmation of the root of the problem, and also more clues on the place the leak is generated.

Thanks for your time,
Carlos.
Comment 33 AB 2021-11-29 05:49:12 MST
Hi Carlos,

yes, we are working to get an image with core dumps enabled. I will get back to once we have results with that.

Thanks,
Alexander
Comment 35 Carlos Tripiana Montes 2021-12-16 08:32:09 MST
We found the 1st 2 bytes you receive corresponds to:

Byte 0: the same command code is sent in the request.
Byte 1: the completion code sent from the XCC response.

And then the data response part is empty.

I think we can print the byte 1 and see the value.

It might be in this list too:

https://sysmgt.lenovofiles.com/help/index.jsp?topic=%2Fcom.lenovo.systems.management.xcc.amd.doc%2Foem_ipmi_commands.html

Tell us what do you think.

Thanks.
Comment 36 AB 2021-12-17 01:26:57 MST
Hi Carlos,

I'm not really familiar with the ipmi-raw commands. In the link you sent I don't see the command we used?! What I did now is run the command and get the full output:

thin2:~ # ipmi-raw 0x0 0x3a 0x32 4 2 0 0 0
rcvd: 32 00 00 00 CC 9C 41 18 32 00 35 46 BC 61 DC 00

Is that what you mean?
If you think of something else I need to contact the guys from Lenovo.

Fyi - since we have a hardware problem on our test system I am not able to install the valgrind there at least before the Christmas holidays.

Thanks!
Comment 37 Carlos Tripiana Montes 2021-12-17 01:46:42 MST
Hi,

Your command is OK, it matches the command slurm sends:

/* LUN, NetFN, CMD, Data[n]*/
static uint8_t cmd_rq[8] = { 0x00, 0x3A, 0x32, 4, 2, 0, 0, 0 };

I really expect, if you send a command like Comment 36, to work well.

Remember slurmd triggers command and gets the result well until some point it starts to receive only 2 bytes. Some time later slurmd crashed.

My idea was about, if you want, patching slurm to add a line that prints this byte:

Byte 1: the completion code sent from the XCC response.

So when slurmd gets this result from command with only 2 bytes know the code value.

Also, I asked in the past to set ulimit -c unlimited and let the slurmd crash to get a core file and analyse it with GDB.

IMHO, maybe at some point in time XCC board starts to fail to send the result back. And not sure, but seems like it fails in a way freepmi doesn't expect and triggers some corner case of mem leak.

I'm afraid I'm not able to help more w/o a core dump or some type of new information about, like the "byte 1" value.

Tell me what do you think about.

Thanks,
Carlos.
Comment 38 AB 2021-12-17 05:18:15 MST
Hi Carlos,

ok, I get your point now.

If you let me know how/where I can patch slurm I could install this version on the nodes.
Comment 39 Carlos Tripiana Montes 2021-12-20 03:43:48 MST
Created attachment 22732 [details]
debug print patch

Hi,

Adding debug print patch that should always print 3 lines in a row at level info.

I plan this to be only deployed the time it gets the error bo be reproduced and then go back to production release.

If you encounter any issue with the patch please tell me so ASAP.

Thanks,
Carlos.
Comment 40 AB 2021-12-21 03:53:39 MST
Hi Carlos,

I have added the patch and build new rpms.
Installation was successful and slurmd is also running. However I can't start a job on the node. The only error I see is in the nodes logfile

[2021-12-21T11:48:12.416] error: Error binding slurm stream socket: Address already in use
[2021-12-21T11:48:12.416] error: Unable to bind listen port (6818): Address already in use


Any ideas?!

The other message I see is this (maybe already what you wanted):

[2021-12-21T11:48:12.440] debug:  JobId=1128: sent epilog complete msg: rc = 0
[2021-12-21T11:48:12.440] debug2: Finish processing RPC: REQUEST_TERMINATE_JOB
[2021-12-21T11:48:17.190] acct_gather_energy/xcc: _read_ipmi_values: ipmi_cmd_raw: success
[2021-12-21T11:48:17.190] acct_gather_energy/xcc: _read_ipmi_values: ipmi_cmd_raw: CMD: 00 3A 32 04 02 00 00 00
[2021-12-21T11:48:17.190] acct_gather_energy/xcc: _read_ipmi_values: ipmi_cmd_raw: RSP: 32 D5 00 00 00 00 00 00 08 00 00 00 00 00 00 00
[2021-12-21T11:48:17.190] error: Invalid ipmi response length for XCC raw command: 2 bytes, expected 16
[2021-12-21T11:48:17.190] error: _thread_update_node_energy could not read XCC ipmi values
Comment 41 AB 2021-12-21 04:12:03 MST
Hi Carlos,

sorry, the last comment may lead us to a false direction.
I installed this patched version on a different node than before. This one seems not to have correct xcc installation. I will try on another node again.

Thanks.
Comment 42 AB 2021-12-21 06:06:28 MST
I have installed the patched code on another node. Behavior with respect to the jobs the same (canceled). But here is the output in the log file

[2021-12-21T12:52:12.474] error: Error binding slurm stream socket: Address already in use
[2021-12-21T12:52:12.474] error: Unable to bind listen port (6818): Address already in use
[2021-12-21T12:52:12.475] debug3: _rpc_batch_job: return from _forkexec_slurmstepd: -1
[2021-12-21T12:52:12.485] debug:  Finished wait for JobId=1130, all steps
[2021-12-21T12:52:12.491] debug2: No steps in jobid 1130 to send signal 993
[2021-12-21T12:52:12.494] debug2: No steps in jobid 1130 to send signal 18
[2021-12-21T12:52:12.497] debug2: No steps in jobid 1130 to send signal 15
[2021-12-21T12:52:12.501] debug2: set revoke expiration for jobid 1130 to 1640087652 UTS
[2021-12-21T12:52:12.502] debug2: Finish processing RPC: REQUEST_BATCH_JOB_LAUNCH
[2021-12-21T12:52:12.509] debug:  Waiting for job 1130's prolog to complete
[2021-12-21T12:52:12.509] debug:  Finished wait for job 1130's prolog to complete
[2021-12-21T12:52:12.520] debug:  [job 1130] attempting to run epilog [/etc/slurm/scripts/Epilog]
[2021-12-21T12:52:14.800] debug:  Note large processing time from prep_epilog: usec=2289864 began=12:52:12.509
[2021-12-21T12:52:14.803] debug:  completed epilog for jobid 1130
[2021-12-21T12:52:14.817] debug:  JobId=1130: sent epilog complete msg: rc = 0
[2021-12-21T12:52:14.817] debug2: Finish processing RPC: REQUEST_TERMINATE_JOB
[2021-12-21T12:52:16.202] acct_gather_energy/xcc: _read_ipmi_values: ipmi_cmd_raw: success
[2021-12-21T12:52:16.203] acct_gather_energy/xcc: _read_ipmi_values: ipmi_cmd_raw: CMD: 00 3A 32 04 02 00 00 00
[2021-12-21T12:52:16.204] acct_gather_energy/xcc: _read_ipmi_values: ipmi_cmd_raw: RSP: 32 00 F4 00 CE 9F 69 05 36 01 E4 BF C1 61 BF 01


I also have one node now running with valgrind and ulimit -c. I let you know when/if I get a core dump.
Comment 43 Carlos Tripiana Montes 2021-12-21 08:10:15 MST
This seems roge process taking over the port:

[2021-12-21T12:52:12.474] error: Error binding slurm stream socket: Address already in use
[2021-12-21T12:52:12.474] error: Unable to bind listen port (6818): Address already in use

It may be an old process but I can't tell. It is for sure not related to the patch itself, because the patch only add a couple of prints.


Regarding the patch, correct response is like:

[2021-12-21T12:52:16.202] acct_gather_energy/xcc: _read_ipmi_values: ipmi_cmd_raw: success
[2021-12-21T12:52:16.203] acct_gather_energy/xcc: _read_ipmi_values: ipmi_cmd_raw: CMD: 00 3A 32 04 02 00 00 00
[2021-12-21T12:52:16.204] acct_gather_energy/xcc: _read_ipmi_values: ipmi_cmd_raw: RSP: 32 00 F4 00 CE 9F 69 05 36 01 E4 BF C1 61 BF 01

Whereas faulty one is like:

[2021-12-21T11:48:17.190] acct_gather_energy/xcc: _read_ipmi_values: ipmi_cmd_raw: success
[2021-12-21T11:48:17.190] acct_gather_energy/xcc: _read_ipmi_values: ipmi_cmd_raw: CMD: 00 3A 32 04 02 00 00 00
[2021-12-21T11:48:17.190] acct_gather_energy/xcc: _read_ipmi_values: ipmi_cmd_raw: RSP: 32 D5 00 00 00 00 00 00 08 00 00 00 00 00 00 00

Completion code: D5.


Your HW is: https://lenovopress.com/lp0636-thinksystem-sd650-direct-water-cooled-server-xeon-sp-gen-1:

"Remote server management
The SD650 contains an integrated service processor, XClarity Controller (XCC), which provides advanced service-processor control, monitoring, and alerting functions. The XCC is based on the Pilot4 XE401 baseboard management controller (BMC) using a dual-core ARM Cortex A9 service processor."

"Remote server management is provided through industry-standard interfaces:
* Intelligent Platform Management Interface (IPMI) Version 2.0"

Referring to IPMI 2.0 spec: https://www.intel.com/content/dam/www/public/us/en/documents/product-briefs/second-gen-interface-spec-v2-rev1-3.pdf:

5.2 Completion Codes
Table 5-2, Completion Codes 
D5h Cannot execute command. Command, or request parameter(s), not supported in
present state. 

From now, that's what I was able to find.


As I was suspecting at Comment 37: "IMHO, maybe at some point in time XCC board starts to fail to send the result back. And not sure, but seems like it fails in a way freepmi doesn't expect and triggers some corner case of mem leak."

At least I think we have confirmed the IMPI is failing. The leak is clear, but it remains to be confirmed the exact source.

I'd suggest you to tackle this issue wider, as it's slowly getting bigger than just Slurm:

1. Contact HW vendor on how to address problems with IPMI completion code D5: Ie. providing insights about the underlying reasons for a common command like getting the consumption is. It could be just that the IPMI is somehow rebooted, and you try to get data from it while unavailable? I'm afraid but I can't find more information on Internet.

2. When you have the underlying reasons clarified you can try a recent release for freeipmi and see what happens. Or a rollback to your previous one.

3. I'll still keep an eye at any data/feedback provided by you like the core file, any valgrind log, etc. Because we really want to know the source of the error, and see if we can do something to mitigate it. Or at least to provide a better future support if any user hits the same problem.

Thanks for keeping an eye on this,
Carlos.
Comment 44 AB 2021-12-21 08:34:37 MST
Hi Carlos,

thanks for your comments. I will contact Lenovo and let them know about the status of the bug report.

Regarding the the error I have a different opinion since I got this error on two different nodes and only after installation of the patched version. Also netstat only showed slurmd using this port so I guess that the debug maybe tries to write to some wrong channel. Whatever, since the needed information is given we don't need to think about that anymore.

I get back to you when I have some news.
Comment 45 Carlos Tripiana Montes 2021-12-21 09:10:30 MST
> netstat only showed slurmd using this port so I guess that the debug maybe
> tries to write to some wrong channel. Whatever, since the needed information
> is given we don't need to think about that anymore.

Triggered in: slurm_init_msg_engine:_create_msg_socket:main:
[2021-12-21T12:52:12.474] error: Error binding slurm stream socket: Address already in use

Triggered in _create_msg_socket:main:
[2021-12-21T12:52:12.474] error: Unable to bind listen port (6818): Address already in use

From your slurm.conf I see SlurmdPort=6818, which is the one already in use.

https://slurm.schedmd.com/slurm.conf.html#OPT_SlurmdPort

At that moment there's was another process taking over the port but eventually gone died and thus slurmd continued. Most probably the old slurmd while finishing.

This 2 additional log lines just write to slurmctld.log in the same way the others do, don't add any network.

A colleague that was involved in the XCC development told me Lenovo knows about this concrete code (D5) sice years ago (2018). He doesn't know the underlying reasons though.

Cheers,
Carlos.
Comment 46 AB 2021-12-22 06:04:26 MST
Hi Carlos,

as I mentioned in comment 41 please ignore the comment 40 with its output since energy measurement via XCC does not work at all on that node. So the D5 return code is perfectly fine for that one.
Comment 47 Carlos Tripiana Montes 2021-12-22 06:14:17 MST
Do we have any failures in the node that is reporting fine from Comment 42 yet?

The type of error reported in Comment 21 was then because we were testing things on a node with bad XCC?

Or is Comment 21 going to happen in the node from Comment 42 as well --at some point?

Cheers,
Carlos.
Comment 48 Carlos Tripiana Montes 2022-01-10 04:52:20 MST
Would you mind to send us the output, as root, from commands:

ulimit -a

sysctl fs.file-nr

ls -l /proc/$(pidof slurmd)/fd | wc -l

in a node affected by this problem?

We have found another possible explanation, which would relate to the number of open files.

Thanks,
Carlos.
Comment 49 AB 2022-01-10 05:20:14 MST
Hi Carlos,

here is the output you requested.

thin1:~ # ulimit -a
core file size          (blocks, -c) unlimited
data seg size           (kbytes, -d) unlimited
scheduling priority             (-e) 0
file size               (blocks, -f) unlimited
pending signals                 (-i) 384108
max locked memory       (kbytes, -l) unlimited
max memory size         (kbytes, -m) unlimited
open files                      (-n) 1024
pipe size            (512 bytes, -p) 8
POSIX message queues     (bytes, -q) 819200
real-time priority              (-r) 0
stack size              (kbytes, -s) 8192
cpu time               (seconds, -t) unlimited
max user processes              (-u) 384108
virtual memory          (kbytes, -v) unlimited
file locks                      (-x) unlimited
thin1:~ # sysctl fs.file-nr
fs.file-nr = 2720	0	9832927
thin1:~ # ls -l /proc/62125/fd |wc -l
21


The one node that is running the slurmd with valgrind still seems to have a slurmd process running. In the valgrind log I can find however the last lines like this, I don't know if this helps?!

==62972== 
==62972== 
==62972== 39 errors in context 22 of 591:
==62972== Syscall param ioctl(generic) points to uninitialised byte(s)
==62972==    at 0x67C0AC7: ioctl (in /lib64/libc-2.26.so)
==62972==    by 0xA6BE1ED: ??? (in /usr/lib64/libfreeipmi.so.17.2.1)
==62972==    by 0xA6BE76E: ipmi_openipmi_cmd (in /usr/lib64/libfreeipmi.so.17.2.1)
==62972==    by 0xA69D94A: ??? (in /usr/lib64/libfreeipmi.so.17.2.1)
==62972==    by 0xA69DA2D: ??? (in /usr/lib64/libfreeipmi.so.17.2.1)
==62972==    by 0xA68B712: ipmi_cmd_raw (in /usr/lib64/libfreeipmi.so.17.2.1)
==62972==    by 0xA0FC4C1: _read_ipmi_values (acct_gather_energy_xcc.c:434)
==62972==    by 0xA0FC6C7: _thread_update_node_energy (acct_gather_energy_xcc.c:491)
==62972==    by 0xA0FCF05: _thread_ipmi_run (acct_gather_energy_xcc.c:672)
==62972==    by 0x64B74F8: start_thread (in /lib64/libpthread-2.26.so)
==62972==    by 0x67C9FBE: clone (in /lib64/libc-2.26.so)
==62972==  Address 0xb991f50 is on thread 4's stack
==62972==  Uninitialised value was created by a stack allocation
==62972==    at 0xA6BE107: ??? (in /usr/lib64/libfreeipmi.so.17.2.1)
==62972== 
==62972== ERROR SUMMARY: 961 errors from 591 contexts (suppressed: 0 from 0)
--62125-- memcheck GC: 1029 nodes, 162 survivors (15.7%)
--62125-- memcheck GC: 1044 new table size (driftup)
--62125-- memcheck GC: 1044 nodes, 168 survivors (16.1%)
--62125-- memcheck GC: 1059 new table size (driftup)
--62125-- memcheck GC: 1059 nodes, 183 survivors (17.3%)
--62125-- memcheck GC: 1074 new table size (driftup)
--62125-- memcheck GC: 1074 nodes, 186 survivors (17.3%)
--62125-- memcheck GC: 1090 new table size (driftup)
--62125-- memcheck GC: 1090 nodes, 138 survivors (12.7%)
--62125-- memcheck GC: 1090 nodes, 149 survivors (13.7%)
--62125-- memcheck GC: 1090 nodes, 149 survivors (13.7%)
--62125-- memcheck GC: 1090 nodes, 149 survivors (13.7%)
--62125-- memcheck GC: 1090 nodes, 149 survivors (13.7%)
--62125-- memcheck GC: 1090 nodes, 149 survivors (13.7%)
--62125-- memcheck GC: 1090 nodes, 149 survivors (13.7%)
--62125-- memcheck GC: 1090 nodes, 149 survivors (13.7%)
--62125-- memcheck GC: 1090 nodes, 149 survivors (13.7%)
--62125-- memcheck GC: 1090 nodes, 149 survivors (13.7%)
--62125-- memcheck GC: 1090 nodes, 149 survivors (13.7%)
--62125-- memcheck GC: 1090 nodes, 149 survivors (13.7%)
--62125-- memcheck GC: 1090 nodes, 149 survivors (13.7%)
--62125-- memcheck GC: 1090 nodes, 149 survivors (13.7%)
--62125-- memcheck GC: 1090 nodes, 149 survivors (13.7%)
--62125-- memcheck GC: 1090 nodes, 149 survivors (13.7%)
--62125-- memcheck GC: 1090 nodes, 149 survivors (13.7%)
--62125-- memcheck GC: 1090 nodes, 149 survivors (13.7%)
--62125-- memcheck GC: 1090 nodes, 149 survivors (13.7%)
--62125-- memcheck GC: 1090 nodes, 149 survivors (13.7%)
--62125-- memcheck GC: 1090 nodes, 149 survivors (13.7%)
--62125-- memcheck GC: 1090 nodes, 149 survivors (13.7%)
--62125-- memcheck GC: 1090 nodes, 149 survivors (13.7%)
Comment 50 Carlos Tripiana Montes 2022-01-10 06:11:19 MST
Type just:

thin1:~ # ls -l /proc/62125/fd

and send the list, thanks.

We have found freeipmi using SET_FD, which is quite bad design today and fails if the FD number is greater than FD_SETSIZE (which is fixed number 1024) and compilation is fortified (-D_FORTIFY_SOURCE>0):

The question is that OS is now in a state like:

thin1:~ # sysctl fs.file-nr
fs.file-nr = 2720	0	9832927

So it seems like, in the event of opening connection (from a job step, from slurmd itself) against BMC, if the FD number is higher than FD_SETSIZE it will fail with a buffer overflow like the one you initially posted.

We have found another customer with similar error, and after posting a GDB trace the issue is pointing to this problem.

If we confirm this error I think by now the only way to workaround the issue is to disable fortification. This could, for example, explain why Comment 19 compilation (made with ./configure), that doesn't add fortify flags, hasn't failed by now. The compilation made by rpmbuild adds -D_FORTIFY_SOURCE=2 by default in your system, if I remember well.
Comment 51 Carlos Tripiana Montes 2022-01-10 06:16:08 MST
Oh, my fault. I think I'm wrong. It's freeipmi the one that needs to be recompiled without fortification. SET_FD is inside the lib so no matter how Slurm is compiled. Discard this part from Comment 50.
Comment 52 Carlos Tripiana Montes 2022-01-10 06:33:51 MST
The summary for this new discovery is that we suspect this freeipmi version is compiled with -D_FORTIFY_SOURCE=2.

This is enabling:

#define	FD_SET(fd, fdsetp)	__FD_SET (fd, fdsetp)

#define __FD_SET(d, s) \
  ((void) (__FDS_BITS (s)[__FD_ELT(d)] |= __FD_MASK(d)))

#define	__FD_ELT(d) \
  __extension__								    \
  ({ long int __d = (d);						    \
     (__builtin_constant_p (__d)					    \
      ? (0 <= __d && __d < __FD_SETSIZE					    \
	 ? (__d / __NFDBITS)						    \
	 : __fdelt_warn (__d))						    \
      : __fdelt_chk (__d)); })

instead of this:

#define	__FD_ELT(d)	((d) / __NFDBITS)

And this fails with buffer overflow if FD number is greater than:

#define __FD_SETSIZE		1024

When Slurm opens connection with XCC BMC it will fail if the FD number is greater than 1024 with buffer overflow.
Comment 53 Carlos Tripiana Montes 2022-01-10 06:37:34 MST
_openipmi_read snippet that trigger the issue, I think:

static int
_openipmi_read (ipmi_openipmi_ctx_t ctx,
                fiid_obj_t obj_cmd_rs)
{
  uint8_t rs_buf_temp[IPMI_OPENIPMI_BUFLEN];
  uint8_t rs_buf[IPMI_OPENIPMI_BUFLEN];
  struct ipmi_system_interface_addr rs_addr;
  struct ipmi_recv rs_packet;
  fd_set read_fds;
  struct timeval tv, tv_orig, start, end, delta;
  int n;

  assert (ctx);
  assert (ctx->magic == IPMI_OPENIPMI_CTX_MAGIC);
  assert (fiid_obj_valid (obj_cmd_rs));

  rs_packet.addr = (unsigned char *)&rs_addr;
  rs_packet.addr_len = sizeof (struct ipmi_system_interface_addr);
  rs_packet.msg.data = rs_buf_temp;
  rs_packet.msg.data_len = IPMI_OPENIPMI_BUFLEN;

  FD_ZERO (&read_fds);
  FD_SET (ctx->device_fd, &read_fds); <------------ HERE, this is a macro
[...]
}
Comment 54 AB 2022-01-10 06:40:33 MST
Here is the output 

thin1:~ # ls -l /proc/62125/fd
total 0
lrwx------ 1 root root 64 Jan 10 13:08 0 -> '/dev/pts/0 (deleted)'
lrwx------ 1 root root 64 Jan 10 13:08 1 -> '/dev/pts/0 (deleted)'
lr-x------ 1 root root 64 Dec 21 12:07 10 -> /var/lib/sss/mc/passwd
lr-x------ 1 root root 64 Jan 10 13:08 1024 -> /usr/sbin/slurmd
lrwx------ 1 root root 64 Jan 10 13:08 1025 -> '/tmp/valgrind_proc_62125_cmdline_347fd7b8 (deleted)'
lrwx------ 1 root root 64 Jan 10 13:08 1026 -> '/tmp/valgrind_proc_62125_auxv_347fd7b8 (deleted)'
l-wx------ 1 root root 64 Jan 10 13:08 1027 -> /dss/dsshome1/slurm/log/thin1.val.log
lr-x------ 1 root root 64 Jan 10 13:08 1028 -> 'pipe:[314461948]'
l-wx------ 1 root root 64 Jan 10 13:08 1029 -> 'pipe:[314461948]'
lr-x------ 1 root root 64 Jan 10 13:08 1030 -> /tmp/vgdb-pipe-from-vgdb-to-62125-by-root-on-thin1
lrwx------ 1 root root 64 Jan 10 13:08 11 -> 'socket:[313881599]'
lr-x------ 1 root root 64 Dec 21 12:07 12 -> /var/lib/sss/mc/initgroups
lrwx------ 1 root root 64 Jan 10 13:08 13 -> /dev/ipmi0
lrwx------ 1 root root 64 Jan 10 13:08 16 -> /dev/ipmi0
lrwx------ 1 root root 64 Jan 10 13:08 17 -> /dev/ipmi0
lrwx------ 1 root root 64 Jan 10 13:08 2 -> '/dev/pts/0 (deleted)'
l-wx------ 1 root root 64 Dec 21 12:04 3 -> /dss/dsshome1/slurm/log/slurmd.thin1.log-20211228
lrwx------ 1 root root 64 Dec 21 12:04 4 -> /dev/null
lrwx------ 1 root root 64 Dec 21 12:04 5 -> 'socket:[313608689]'
lrwx------ 1 root root 64 Dec 21 12:04 7 -> /dev/ipmi0
Comment 55 AB 2022-01-10 06:41:25 MST
Ok, is there anything we can do about the freeipmi build?
Comment 56 Carlos Tripiana Montes 2022-01-10 09:13:41 MST
If you want, you can download and compile:

https://download.opensuse.org/source/distribution/leap/15.2/repo/oss/src/freeipmi-1.6.2-lp152.3.6.src.rpm

in the meantime to help us to get this fixed soon. Most probably rpmbuild will try to add -D_FORTIFY_SOURCE=2 by default and you'd need to tweak the spec in a similar fashion like you did for slurm.

We are working with the other customer in the same direction to test if this gets fixed with this approach.

Cheers,
Carlos.
Comment 57 Carlos Tripiana Montes 2022-01-12 02:10:55 MST
We have found a potential way to mitigate this issue from the Slurm side. And we are right now testing it.

It's still true that it will fail if, by chance, you open connection with XCC BMC and FD number is greater than 1024, but it will be less likely.

To *really* avoid the issue, it's still valid the need of recompiling freeipmi without _FORTIFY_SOURCE.

I'll post any updates as soon as we have more information.

Thanks for your efforts, this have helped us a lot to track this error, and have helped us not only on providing you with a solution but also to other customers.

As final note: We are going to address this fix not only for master but 21.08. We know you are using 20.11, but this one only receives security patches right now. Good news is that you would easily backport this commit once pushed, as the XCC code hasn't been changed for a while. But note here: we don't officially provide backports for 20.11 or older versions. Our recommended procedure is to upgrade to a newer Slurm version. Being said this, you can backport and apply the fix at your own risk.
Comment 58 AB 2022-01-12 05:29:08 MST
Hi Carlos,

thanks for the news.
We will discuss how to handle the solution, i.e. either we upgrade or patch at own risk. But it's good to know that there are now options to circumvent the problem.
Comment 59 Carlos Tripiana Montes 2022-01-13 07:56:05 MST
Created attachment 22974 [details]
IPMI open FDs quickfix test

Hi Alexander,

Would you like to help us on testing the patch proposal?

It's still not officially published, but our other customer is reporting this patch fixes the issue.

If you see your "ls" output from Comment 54, there's some open FDs with ipmi0. After applying the patch you should only see *one* one those FDs against ipmi0.

Thanks,
Carlos.
Comment 60 AB 2022-01-13 08:11:10 MST
Hi Carlos,

ok thanks for the patch.
I will try to install it on our testsystem and let you know about the results.
Comment 62 AB 2022-01-14 02:58:04 MST
Hi Carlos,

I have installed the patch yesterday and submitted jobs to four nodes. So far all slurmd are running and I can see only one ipmi fd, i.e.

thin1:~ # ls -l /proc/60724/fd
total 0
lr-x------ 1 root root 64 Jan 13 20:33 0 -> /dev/null
lrwx------ 1 root root 64 Jan 13 20:33 1 -> 'socket:[95113]'
lr-x------ 1 root root 64 Jan 13 20:33 10 -> /var/lib/sss/mc/passwd
lrwx------ 1 root root 64 Jan 13 20:33 11 -> 'socket:[274805]'
lr-x------ 1 root root 64 Jan 13 20:33 12 -> /var/lib/sss/mc/initgroups
lrwx------ 1 root root 64 Jan 13 20:33 2 -> 'socket:[95113]'
l-wx------ 1 root root 64 Jan 13 20:33 3 -> /dss/dsshome1/slurm/log/slurmd.thin1.log
lrwx------ 1 root root 64 Jan 13 20:33 4 -> /dev/null
lrwx------ 1 root root 64 Jan 13 20:33 5 -> 'socket:[273965]'
lrwx------ 1 root root 64 Jan 13 20:33 8 -> /dev/ipmi0

So looks promising for the moment. I get back with any news.

Thanks,
Alexander
Comment 63 Carlos Tripiana Montes 2022-01-14 04:04:17 MST
Thank you,

keep sending test jobs there for a while. Try also to execute some scontrol reconfig, if possible.

But yes, it look promising.

Thanks,
Carlos.
Comment 64 AB 2022-01-18 00:57:32 MST
Hi Carlos,

slurmd is running on the nodes since Friday without any crash. I also had some slurmd restarts and scontrol reconfigs and still everything runs smoothly.

So it looks like the patch does its jobs well.

In which version will this patch be included in the future? We will do an update once we have a maintenance (not planned yet).

Thanks,
Alexander
Comment 65 Carlos Tripiana Montes 2022-01-18 01:33:56 MST
Good morning Alexander,

This is great news.

We have still to wait to our other customer to fully agree on the solution provided. Once done, I'll trigger the peer-review of our QA pipeline for the patch.

I guess this QA review won't take a lot of time, and I think reviewer is going to accept the patch for 21.08 onwards. Next stable release is 21.08.6.

In any case, I'll let you know the commit ID(s) and pushed branch(es) once we'll merge the patch into the repository.

Cheers,
Carlos.
Comment 67 Carlos Tripiana Montes 2022-01-18 02:45:41 MST
Created attachment 23021 [details]
v1 for 21.08
Comment 75 Carlos Tripiana Montes 2022-01-18 07:49:50 MST
Hi Alexander,

Pushed commits: 34f5d7086f, 87d8a85667, eef9272b90 in both 21.08 and master branches includes the reviewed patch.

Next stable 21.08.6 release will include this fix.

Don't forget that there's still a chance to get the same error, in the event an open FD has a number >1024. This can only be workarounded compiling freeipmi without source fortification.

If you don't mind, I'm going to close the bug as resolved.

Best regards,
Carlos.
Comment 77 AB 2022-01-19 00:50:34 MST
Hi Carlos,

thanks for the additional information and resolving this bug.

Best regards,
Alexander