I'm seeing a few unit test failures on 17.11 with the `check` package installed. -- testsuite/slurm_unit/common/slurm_protocol_pack/pack_job_alloc_info_msg-test.c I get a SIGSEGV and not a SIGABRT on this test: if (srunner_fork_status(sr) != CK_NOFORK) tcase_add_test_raise_signal(tc_core, pack_null_req, SIGABRT); Changing the signal to SIGSEGV clears this one. -- testsuite/slurm_unit/common/log-test.c I am getting a segmentation fault on this line on the i686 (32-bit) architecture: Works on 64-bit arch, e.g. x86_64. info ("testing double: %18.7f int: %05i string `%s'", f, i, string); -- testsuite/slurm_unit/common/bitstring-test.c failing on some architectures.
Also seeing a problem with -- testsuite/slurm_unit/common/slurmdb_pack/pack_assoc_rec_with_usage-test.c in the invalid_protocol test. Getting a SIGABRT when compiled with --enable-debug. This activate the xassert() debug code and causes the abort in the invalid protocol test. Without --enable-debug, the test does not abort and passes. with --enable-debug ./pack_assoc_rec_with_usage-test Running suite(s): Pack slurmdb_assoc_rec_t pack_assoc_rec_with_usage-test: error: slurmdb_pack.c:1643: slurmdb_pack_assoc_usage(): Assertion (usage) failed. Aborted (core dumped) with --disable-debug ./pack_assoc_rec_with_usage-test Running suite(s): Pack slurmdb_assoc_rec_t pack_assoc_rec_with_usage-test: error: slurmdb_pack_assoc_rec: protocol_version 0 not supported pack_assoc_rec_with_usage-test: error: slurmdb_pack_assoc_usage: protocol_version 0 not supported pack_assoc_rec_with_usage-test: error: slurmdb_pack_assoc_rec_with_usage: protocol_version 0 not supported pack_assoc_rec_with_usage-test: error: slurmdb_unpack_assoc_rec_members: protocol_version 0 not supported 100%: Checks: 2, Failures: 0, Errors: 0 pack_assoc_rec_with_usage-test.c:29:P:Pack slurmdb_assoc_rec_t:invalid_protocol:0: Passed pack_assoc_rec_with_usage-test.c:177:P:Pack slurmdb_assoc_rec_t:pack_1702_assoc_rec:0: Passed
Also in: -- testsuite/slurm_unit/common/slurmdb_pack/pack_assoc_rec_with_usage-test.c slurmdb_pack_assoc_rec_with_usage((void **)&assoc_rec, 0, buf); should be: slurmdb_pack_assoc_rec_with_usage((void *)assoc_rec, 0, buf); since you want cast the buffer pointer to a void pointer, not the address of a void pointer, although the receiving function will unwind the cast properly anyway.
For the log-test.c problem: info ("testing double: %18.7f int: %05i string `%s'", f, i, string); I think this commit is the culprit: 5e5b115927 - Major re-write of vxstrfmt() In vxstrfmt() (common/log.c) each format specifier is inspected and either handled by the local code or sent to vsnprintf() for handling. I notice that the local handlers do not handle complex format specifiers, e.g. %05i. By changing the info line in log-test/c above to info ("testing double: %f int: %d string `%s'", f, i, string); the local handlers are used and the test passes. This just masks the problem however. *Something* is going wrong when vsnprintf() is used to handle complex specifiers. As for the other problems, in general, you would want the checks to pass whether or not the assertion code is active or not (--enable-debug and --disable-debug).
I've looked closely at the vxstrfmt() code in common/log.c and have written a patch for that module. This commit is the problem as I mentioned earlier: 5e5b115927 - Major re-write of vxstrfmt() That rewrite takes the format specifier string and slices it iteratively into chunks -- essentially from one '%' character to just before the next '%'. Each chunk is examined and if the specifier is simple, e.g. '%d', it handles it locally and appends a new section to the final string. If the format specifier chunk is complex, however, e.g. '%05d', it hands it to vsnprintf and tries to get a new string section for that chunk. Then it moves on to subsequent chunks until the format sepcifier is completely consumed. The problem is the variadic list va_list cannot be advanced with va_arg, as required, in any reliable way once it is established that the chunk contains a complex specifer. When a specifier like '%05d' or '%18.7f' is encountered, all we really know is that we cannot handle it; we do not know its type, thus vsnprintf must be used. What has to happen there is the _entire remainder_ of the format string must be handed to vsnprintf after the first complex one is encountered. The code must not continue chunking specifiers sections because we can no longer advance the va_list -- we do not have reliable type information for va_arg(). For example: info ("testing double: %18.7f int: %05i string `%s'", f, i, string); Once %18.7f is seen, we have to hand the entire remainder of the format specifier to vsnprintf. Trying to process the subsequent specifiers is not possible since we cannot advance the va_list with va_arg. Reproduce the crash by building on 32-bit i686 and running the testsuite/slurm_unit/common/log-test.
Created attachment 5683 [details] slurm common/log.c vxstrfmt() patch
My personal tests of the rewrite in 5e5b115927 - Major re-write of vxstrfmt() was solely on x86_64 systems. Documentation of the v*printf() functions typically indicates that: - va_end() is not called on the passed va_list - the state of the passed va_list is indeterminate when the v*printf() function returns The indeterminacy usually appears to stem from something that hasn't been mentioned (by me or you) in the context of the vxstrfmt() function: positional specifiers in the format (e.g. "%4$s"). The rewritten function is even more likely to fail when positional specifiers are present in the format string. Another issue remains with your revision to the rewrite: what if any of the Slurm-specific formats (e.g. "%T") occur _after_ an unhandled format: "%18.7f is incorrect (%m)" Since the format string is processed from left to right, the "%18.7f" will shift the function immediately into calling vsnprintf() before the "%m" has been expanded. There's also a major problem with your rewrite: you reuse the ap list across multiple calls to vsnprintf() in a loop: /* If stack buffer is too small, grow a buffer * in multiples of LINEBUFBIZE for vsnprintf. */ do { if (grow_iter == 0) { /* first time use stack buf */ buf_sz = tmp_len; len = vsnprintf(tmp, buf_sz, p-1, ap); } else { /* subsequent times use grow buf */ buf_sz = grow_iter * LINEBUFSIZE; grow_buf = xrealloc(grow_buf, buf_sz); len = vsnprintf(grow_buf, buf_sz, p-1, ap); } ++grow_iter; } while ((len >= 0) && (len >= buf_sz)); The moment you've got something longer than LINEBUFSIZE this will fail badly. It's unclear what the best solution is. Doing a first pass to handle ONLY the format sequences introduced by vxstrfmt() and passing an intermediate format string to vsnprintf() to handle all the rest seems the most appropriate way to handle this (luckily none of the vxstrfmt() sequences take an argument from the va_list). Any "%" characters in the substitutions would have to be doubled-up ("%%") to avoid the same issue that I originally cited. I will work on a patch in that vein and upload to this ticket.
Before I posted my patch I tested on many architectures with format specifiers up in the thousands of characters. No failures with a very recent glibc. For safety I will wrap my vnsprintf calls with va_copy. The larger question about what format specifiers are actually supported remains. Along the architecture support lines, I notice also that the 17.11.0 bitstring code is now broken on several architectures. I amreferring to the bitstring unit test now failing. I'm also looking at a that.
Created attachment 5690 [details] slurm common/log.c vxstrfmt() patch (v3) Another rewrite of vxstrfmt() that expands all vxstrfmt()-specific formats (%m, %t, %T, %M) into an intermediate format string that is passed to vsnprintf() with the ap va_list to expand all formats not native to vxstrfmt().
Jeff, I like what you did with v3 of vxstrfmt(), separating the custom slurm specifiers out from the rest. The code is more focused and robust that way. My multi-arch testing infrastructure is down atm, but on local vms, it is passing on x86_64 and i686. Will test further when things are running again. I added a few more tests to the unit test log_test.c to exercise the slurm specifiers. I'll attach that as a patch. log-test: testing slurm format (T): Thu, 07 Dec 2017 12:33:09 -0500 log-test: testing slurm format (t): 12/07/17 12:33:09 log-test: testing slurm format (M): 2017-12-07T12:33:09.490 log-test: testing slurm format (m): No error Phil
Created attachment 5691 [details] slurm log_test unit tests Added more checks to testsuite/slurm_unit/common/log_test.c
FWIW (w.r.t. the v3 patch), some of the time formats in %M possibly would NEVER have a "%" show up in them. If that's a valid assumption, then some of those x*timecat() calls could be altered to append directly to the intermediate_fmt string and skip the "%" double-up checks. E.g. case 'M': if (!log) xiso8601timecat(substitute, true); else { switch (log->fmt) { case LOG_FMT_ISO8601_MS: /* "%M" => "yyyy-mm-ddThh:mm:ss.fff" */ xiso8601timecat(intermediate_fmt, true); break; case LOG_FMT_ISO8601: /* "%M" => "yyyy-mm-ddThh:mm:ss.fff" */ xiso8601timecat(intermediate_fmt, false); break; Probably a negligible difference in performance versus always being 100% sure no unwanted "%" characters are showing up in intermediate_fmt.
I found the problem causing failures in the bitstring unit test. It's a core problem uncovered via the test, so I filed a separate bug. https://bugs.schedmd.com/show_bug.cgi?id=4494
On i686 (32-bit), the current vxstrfmt segfaults under certain conditions, as demonstrated by either the log-test or this simple program below. Please push the v3 patch to correct the problem. It is not a minor issue. ---crash.c---- #include "spank.h" int main() { slurm_info("double: %18.7f int: %05i string `%s'", 9876543210.0123456, 67890, "test string"); return 0; } ---compiled with--- gcc -g -O0 -I slurm-17.11.0/slurm slurm-17.11.0/src/api/libslurm.o -o crash crash.c -lpthread -ldl ./crash Segmentation fault (core dumped) gdb ./crash (gdb) r Starting program: /home/phil/slurm/crash Missing separate debuginfos, use: dnf debuginfo-install glibc-2.26.9000-28.fc28.i686 [Thread debugging using libthread_db enabled] Using host libthread_db library "/lib/libthread_db.so.1". Program received signal SIGSEGV, Segmentation fault. 0xb7e82b7b in strlen () from /lib/libc.so.6 (gdb) bt #0 0xb7e82b7b in strlen () from /lib/libc.so.6 #1 0x08064dde in _xstrcat (str1=str1@entry=0xbffff1b0, str2=0xb7501949 <error: Cannot access memory at address 0xb7501949>) at xstring.c:132 #2 0x0806ee51 in vxstrfmt (fmt=<optimized out>, fmt@entry=0x81cda64 "double: %18.7f int: %05i string `%s'", ap=0xbffff3b8 "\200e\002B2\t\001", ap@entry=0xbffff3b4 "I\031P\267\200e\002B2\t\001") at log.c:786 #3 0x0806fcbd in log_msg (level=level@entry=LOG_LEVEL_INFO, fmt=fmt@entry=0x81cda64 "double: %18.7f int: %05i string `%s'", args=args@entry=0xbffff3b4 "I\031P\267\200e\002B2\t\001") at log.c:1160 #4 0x08070ea5 in info (fmt=0x81cda64 "double: %18.7f int: %05i string `%s'") at log.c:1268 #5 0x081a0c89 in main () at crash.c:5
Hi, Jeff Thanks for your contribution. After reviewing this code, we changed formatting and it is committed as https://github.com/SchedMD/slurm/commit/f44b1d89e36ad420a04a0e75746e8d7998b3d4ff Dominik
Created attachment 5846 [details] slurm protocol pack test patch Attached is a fix for the protocol pack test problem I mentioned in the first comment: testsuite/slurm_unit/common/slurm_protocol_pack/pack_job_alloc_info_msg-test.c You will get a segfault if not using a debug build (--disable-debug). One of the tests wants to trap the SIGABRT thrown by xassert() in _pack_job_alloc_info_msg() in src/common/slurm_protocol_pack.c If using a non-debug build, however, xassert() does nothing and _pack_job_alloc_info_msg() will dereference a zero pointer: pack32(job_desc_ptr->job_id, buffer); and segfault. To reproduce the problem make sure `check` is installed, configure --disable-debug and `make check`
Hi, Philip First, thank you for your contribution. We committed your patch in: https://github.com/SchedMD/slurm/commit/7d9a75dfaddb86cb383050eb91f5fd8ae02f2a7f Brian found and fixed two small issues with test https://github.com/SchedMD/slurm/commit/037e68c9413441ab3b94092ba87102f8ab3fcf80 https://github.com/SchedMD/slurm/commit/b41ce4df69c42a1e84882e845bf73c82edad6b9d I'm going to go ahead and mark this as Resolved. Dominik