Ticket 4469 - 17.11 unit test failures (make check)
Summary: 17.11 unit test failures (make check)
Status: RESOLVED FIXED
Alias: None
Product: Slurm
Classification: Unclassified
Component: Contributions (show other tickets)
Version: 17.11.0
Hardware: Linux Linux
: 4 - Minor Issue
Assignee: Dominik Bartkiewicz
QA Contact:
URL:
Depends on:
Blocks:
 
Reported: 2017-12-04 16:32 MST by Philip Kovacs
Modified: 2018-01-10 04:17 MST (History)
3 users (show)

See Also:
Site: -Other-
Slinky Site: ---
Alineos Sites: ---
Atos/Eviden Sites: ---
Confidential Site: ---
Coreweave sites: ---
Cray Sites: ---
DS9 clusters: ---
Google sites: ---
HPCnow Sites: ---
HPE Sites: ---
IBM Sites: ---
NOAA SIte: ---
NoveTech Sites: ---
Nvidia HWinf-CS Sites: ---
OCF Sites: ---
Recursion Pharma Sites: ---
SFW Sites: ---
SNIC sites: ---
Tzag Elita Sites: ---
Linux Distro: ---
Machine Name:
CLE Version:
Version Fixed: 17.11.3
Target Release: ---
DevPrio: ---
Emory-Cloud Sites: ---


Attachments
slurm common/log.c vxstrfmt() patch (4.71 KB, patch)
2017-12-06 19:54 MST, Philip Kovacs
Details | Diff
slurm common/log.c vxstrfmt() patch (v3) (13.60 KB, patch)
2017-12-07 09:15 MST, Jeff Frey
Details | Diff
slurm log_test unit tests (718 bytes, patch)
2017-12-07 10:53 MST, Philip Kovacs
Details | Diff
slurm protocol pack test patch (926 bytes, patch)
2018-01-04 16:58 MST, Philip Kovacs
Details | Diff

Note You need to log in before you can comment on or make changes to this ticket.
Description Philip Kovacs 2017-12-04 16:32:32 MST
I'm seeing a few unit test failures on 17.11 with the `check` package installed.  

-- testsuite/slurm_unit/common/slurm_protocol_pack/pack_job_alloc_info_msg-test.c

I get a SIGSEGV and not a SIGABRT on this test:

if (srunner_fork_status(sr) != CK_NOFORK)
   tcase_add_test_raise_signal(tc_core, pack_null_req, SIGABRT);

Changing the signal to SIGSEGV clears this one.

-- testsuite/slurm_unit/common/log-test.c

I am getting a segmentation fault on this line on the i686 (32-bit) architecture:

Works on 64-bit arch, e.g. x86_64.

info   ("testing double: %18.7f int: %05i string `%s'", f, i, string);

-- testsuite/slurm_unit/common/bitstring-test.c

failing on some architectures.
Comment 2 Philip Kovacs 2017-12-04 21:34:45 MST
Also seeing a problem with 

-- testsuite/slurm_unit/common/slurmdb_pack/pack_assoc_rec_with_usage-test.c

in the invalid_protocol test.  Getting a SIGABRT when compiled with --enable-debug.  This activate the xassert() debug code and causes the abort in the invalid protocol test.   Without --enable-debug, the test does not abort and passes.

with --enable-debug

./pack_assoc_rec_with_usage-test
Running suite(s): Pack slurmdb_assoc_rec_t
pack_assoc_rec_with_usage-test: error: slurmdb_pack.c:1643: slurmdb_pack_assoc_usage(): Assertion (usage) failed.
Aborted (core dumped)

with --disable-debug

./pack_assoc_rec_with_usage-test
Running suite(s): Pack slurmdb_assoc_rec_t
pack_assoc_rec_with_usage-test: error: slurmdb_pack_assoc_rec: protocol_version 0 not supported
pack_assoc_rec_with_usage-test: error: slurmdb_pack_assoc_usage: protocol_version 0 not supported
pack_assoc_rec_with_usage-test: error: slurmdb_pack_assoc_rec_with_usage: protocol_version 0 not supported
pack_assoc_rec_with_usage-test: error: slurmdb_unpack_assoc_rec_members: protocol_version 0 not supported
100%: Checks: 2, Failures: 0, Errors: 0
pack_assoc_rec_with_usage-test.c:29:P:Pack slurmdb_assoc_rec_t:invalid_protocol:0: Passed
pack_assoc_rec_with_usage-test.c:177:P:Pack slurmdb_assoc_rec_t:pack_1702_assoc_rec:0: Passed
Comment 3 Philip Kovacs 2017-12-04 21:46:46 MST
Also in:

-- testsuite/slurm_unit/common/slurmdb_pack/pack_assoc_rec_with_usage-test.c

slurmdb_pack_assoc_rec_with_usage((void **)&assoc_rec, 0, buf);

should be:

slurmdb_pack_assoc_rec_with_usage((void *)assoc_rec, 0, buf);

since you want cast the buffer pointer to a void pointer, not the address of a void pointer, although the receiving function will unwind the cast properly anyway.
Comment 4 Philip Kovacs 2017-12-05 16:48:26 MST
For the log-test.c problem:

info   ("testing double: %18.7f int: %05i string `%s'", f, i, string);

I think this commit is the culprit:

5e5b115927 - Major re-write of vxstrfmt() 

In vxstrfmt() (common/log.c) each format specifier is inspected and either handled by the local code or sent to vsnprintf() for handling.  I notice that the local handlers do not handle complex format specifiers, e.g. %05i.  By changing the info line in log-test/c above to 

info   ("testing double: %f int: %d string `%s'", f, i, string);

the local handlers are used and the test passes.  This just masks the problem however.  *Something* is going wrong when vsnprintf() is used to handle complex specifiers.

As for the other problems, in general, you would want the checks to pass whether or not the assertion code is active or not (--enable-debug and --disable-debug).
Comment 5 Philip Kovacs 2017-12-06 19:52:10 MST
I've looked closely at the vxstrfmt() code in common/log.c and have written a patch for that module.  This commit is the problem as I mentioned earlier:

5e5b115927 - Major re-write of vxstrfmt() 

That rewrite takes the format specifier string and slices it iteratively into chunks -- essentially from one '%' character to just before the next '%'.  Each chunk is examined and if the specifier is simple, e.g. '%d', it handles it locally and appends a new section to the final string.

If the format specifier chunk is complex, however, e.g. '%05d', it hands it to vsnprintf and tries to get a new string section for that chunk.  Then it moves on to subsequent chunks until the format sepcifier is completely consumed.

The problem is the variadic list va_list cannot be advanced with va_arg, as required, in any reliable way once it is established that the chunk contains a complex specifer.  When a specifier like '%05d' or '%18.7f' is encountered, all we really know is that we cannot handle it; we do not know its type, thus vsnprintf must be used. 

What has to happen there is the _entire remainder_ of the format string must be handed to vsnprintf after the first complex one is encountered.  The code must not continue chunking specifiers sections because we can no longer advance the va_list -- we do not have reliable type information for va_arg().

For example:

info   ("testing double: %18.7f int: %05i string `%s'", f, i, string);

Once %18.7f is seen, we have to hand the entire remainder of the format
specifier to vsnprintf.  Trying to process the subsequent specifiers
is not possible since we cannot advance the va_list with va_arg.

Reproduce the crash by building on 32-bit i686 and running the testsuite/slurm_unit/common/log-test.
Comment 6 Philip Kovacs 2017-12-06 19:54:00 MST
Created attachment 5683 [details]
slurm common/log.c vxstrfmt() patch
Comment 8 Jeff Frey 2017-12-07 08:08:26 MST
My personal tests of the rewrite in

  5e5b115927 - Major re-write of vxstrfmt() 

was solely on x86_64 systems.  Documentation of the v*printf() functions typically indicates that:

- va_end() is not called on the passed va_list
- the state of the passed va_list is indeterminate when the v*printf() function returns

The indeterminacy usually appears to stem from something that hasn't been mentioned (by me or you) in the context of the vxstrfmt() function:  positional specifiers in the format (e.g. "%4$s").  The rewritten function is even more likely to fail when positional specifiers are present in the format string.

Another issue remains with your revision to the rewrite:  what if any of the Slurm-specific formats (e.g. "%T") occur _after_ an unhandled format:

   "%18.7f is incorrect (%m)"

Since the format string is processed from left to right, the "%18.7f" will shift the function immediately into calling vsnprintf() before the "%m" has been expanded.


There's also a major problem with your rewrite:  you reuse the ap list across multiple calls to vsnprintf() in a loop:


                                /* If stack buffer is too small, grow a buffer
                                 * in multiples of LINEBUFBIZE for vsnprintf. */
                                do {
                                        if (grow_iter == 0) {
                                                /* first time use stack buf */
                                                buf_sz = tmp_len;
                                                len = vsnprintf(tmp, buf_sz, p-1, ap);
                                        } else {
                                                /* subsequent times use grow buf */
                                                buf_sz = grow_iter * LINEBUFSIZE;
                                                grow_buf = xrealloc(grow_buf, buf_sz);
                                                len = vsnprintf(grow_buf, buf_sz, p-1, ap);
                                        }
                                        ++grow_iter;
                                } while ((len >= 0) && (len >= buf_sz));


The moment you've got something longer than LINEBUFSIZE this will fail badly.


It's unclear what the best solution is.  Doing a first pass to handle ONLY the format sequences introduced by vxstrfmt() and passing an intermediate format string to vsnprintf() to handle all the rest seems the most appropriate way to handle this (luckily none of the vxstrfmt() sequences take an argument from the va_list).  Any "%" characters in the substitutions would have to be doubled-up ("%%") to avoid the same issue that I originally cited.  I will work on a patch in that vein and upload to this ticket.
Comment 9 Philip Kovacs 2017-12-07 08:40:03 MST
Before I posted my patch I tested on many architectures with format specifiers up in the thousands of characters. No failures with a very recent glibc.  For safety I will wrap my vnsprintf calls with va_copy.

The larger question about what format specifiers are actually supported remains.

Along the architecture support lines, I notice also that the 17.11.0 bitstring code is now broken on several architectures.  I amreferring to the bitstring unit test now failing.  I'm also looking at a that.
Comment 10 Jeff Frey 2017-12-07 09:15:49 MST
Created attachment 5690 [details]
slurm common/log.c vxstrfmt() patch (v3)

Another rewrite of vxstrfmt() that expands all vxstrfmt()-specific formats (%m, %t, %T, %M) into an intermediate format string that is passed to vsnprintf() with the ap va_list to expand all formats not native to vxstrfmt().
Comment 11 Philip Kovacs 2017-12-07 10:45:36 MST
Jeff, 

I like what you did with v3 of vxstrfmt(), separating the custom slurm specifiers out from the rest.  The code is more focused and robust that way.  My multi-arch testing infrastructure is down atm, but on local vms, it is passing on x86_64 and i686.  Will test further when things are running again.

I added a few more tests to the unit test log_test.c to exercise the slurm specifiers.  I'll attach that as a patch.

log-test: testing slurm format (T): Thu, 07 Dec 2017 12:33:09 -0500
log-test: testing slurm format (t): 12/07/17 12:33:09
log-test: testing slurm format (M): 2017-12-07T12:33:09.490
log-test: testing slurm format (m): No error

Phil
Comment 12 Philip Kovacs 2017-12-07 10:53:46 MST
Created attachment 5691 [details]
slurm log_test unit tests

Added more checks to testsuite/slurm_unit/common/log_test.c
Comment 13 Jeff Frey 2017-12-07 10:57:26 MST
FWIW (w.r.t. the v3 patch), some of the time formats in %M possibly would NEVER have a "%" show up in them.  If that's a valid assumption, then some of those x*timecat() calls could be altered to append directly to the intermediate_fmt string and skip the "%" double-up checks.  E.g.


case 'M':
	if (!log)
		xiso8601timecat(substitute, true);
	else {
		switch (log->fmt) {
		case LOG_FMT_ISO8601_MS:
			/* "%M" => "yyyy-mm-ddThh:mm:ss.fff"  */
			xiso8601timecat(intermediate_fmt, true);
			break;
		case LOG_FMT_ISO8601:
			/* "%M" => "yyyy-mm-ddThh:mm:ss.fff"  */
			xiso8601timecat(intermediate_fmt, false);
			break;


Probably a negligible difference in performance versus always being 100% sure no unwanted "%" characters are showing up in intermediate_fmt.
Comment 15 Philip Kovacs 2017-12-08 19:09:52 MST
I found the problem causing failures in the bitstring unit test.  It's a core problem uncovered via the test, so I filed a separate bug.

https://bugs.schedmd.com/show_bug.cgi?id=4494
Comment 18 Philip Kovacs 2017-12-16 16:16:49 MST
On i686 (32-bit), the current vxstrfmt segfaults under certain conditions, as demonstrated by either the log-test or this simple program below.  Please push the v3 patch to correct the problem.  It is not a minor issue.


---crash.c----
#include "spank.h"

int main()
{
	slurm_info("double: %18.7f int: %05i string `%s'",  
                   9876543210.0123456, 67890, "test string"); 
	return 0;
}

---compiled with---
gcc -g -O0 -I slurm-17.11.0/slurm slurm-17.11.0/src/api/libslurm.o -o crash crash.c -lpthread -ldl

./crash
Segmentation fault (core dumped)

gdb ./crash
(gdb) r
Starting program: /home/phil/slurm/crash 
Missing separate debuginfos, use: dnf debuginfo-install glibc-2.26.9000-28.fc28.i686
[Thread debugging using libthread_db enabled]
Using host libthread_db library "/lib/libthread_db.so.1".

Program received signal SIGSEGV, Segmentation fault.
0xb7e82b7b in strlen () from /lib/libc.so.6

(gdb) bt
#0  0xb7e82b7b in strlen () from /lib/libc.so.6
#1  0x08064dde in _xstrcat (str1=str1@entry=0xbffff1b0, str2=0xb7501949 <error: Cannot access memory at address 0xb7501949>)
    at xstring.c:132
#2  0x0806ee51 in vxstrfmt (fmt=<optimized out>, fmt@entry=0x81cda64 "double: %18.7f int: %05i string `%s'", 
    ap=0xbffff3b8 "\200e\002B2\t\001", ap@entry=0xbffff3b4 "I\031P\267\200e\002B2\t\001") at log.c:786
#3  0x0806fcbd in log_msg (level=level@entry=LOG_LEVEL_INFO, fmt=fmt@entry=0x81cda64 "double: %18.7f int: %05i string `%s'", 
    args=args@entry=0xbffff3b4 "I\031P\267\200e\002B2\t\001") at log.c:1160
#4  0x08070ea5 in info (fmt=0x81cda64 "double: %18.7f int: %05i string `%s'") at log.c:1268
#5  0x081a0c89 in main () at crash.c:5
Comment 34 Dominik Bartkiewicz 2018-01-04 08:22:15 MST
Hi, Jeff

Thanks for your contribution.
After reviewing this code, we changed formatting and it is committed as
https://github.com/SchedMD/slurm/commit/f44b1d89e36ad420a04a0e75746e8d7998b3d4ff

Dominik
Comment 35 Philip Kovacs 2018-01-04 16:58:56 MST
Created attachment 5846 [details]
slurm protocol pack test patch

Attached is a fix for the protocol pack test problem I mentioned in the first comment:

testsuite/slurm_unit/common/slurm_protocol_pack/pack_job_alloc_info_msg-test.c

You will get a segfault if not using a debug build (--disable-debug).  One of the tests wants to trap the SIGABRT thrown by xassert() in _pack_job_alloc_info_msg() in

src/common/slurm_protocol_pack.c

If using a non-debug build, however, xassert() does nothing and _pack_job_alloc_info_msg() will dereference a zero pointer:

pack32(job_desc_ptr->job_id, buffer);

and segfault.  To reproduce the problem make sure `check` is installed, configure --disable-debug and `make check`
Comment 41 Dominik Bartkiewicz 2018-01-10 04:17:11 MST
Hi, Philip

First, thank you for your contribution.
We committed your patch in:
https://github.com/SchedMD/slurm/commit/7d9a75dfaddb86cb383050eb91f5fd8ae02f2a7f

Brian found and fixed two small issues with test
https://github.com/SchedMD/slurm/commit/037e68c9413441ab3b94092ba87102f8ab3fcf80
https://github.com/SchedMD/slurm/commit/b41ce4df69c42a1e84882e845bf73c82edad6b9d

I'm going to go ahead and mark this as Resolved.

Dominik