Ticket 7996

Summary: sshare -l dumps core
Product: Slurm Reporter: Randy Smith <rsmith>
Component: User CommandsAssignee: Gavin D. Howard <gavin>
Status: RESOLVED FIXED QA Contact:
Severity: 4 - Minor Issue    
Priority: --- CC: cinek, eckert2
Version: 19.05.2   
Hardware: Linux   
OS: Linux   
See Also: https://bugs.schedmd.com/show_bug.cgi?id=8305
Site: TGen Alineos Sites: ---
Atos/Eviden Sites: --- Confidential Site: ---
Coreweave sites: --- Cray Sites: ---
DS9 clusters: --- HPCnow Sites: ---
HPE Sites: --- IBM Sites: ---
NOAA SIte: --- OCF Sites: ---
Recursion Pharma Sites: --- SFW Sites: ---
SNIC sites: --- Linux Distro: ---
Machine Name: CLE Version:
Version Fixed: 19.05.6 Target Release: ---
DevPrio: --- Emory-Cloud Sites: ---
Attachments: sshare.process.txt
slurm.conf
slurmdbd.conf

Description Randy Smith 2019-10-24 17:44:08 MDT
$We believe we have encountered a bug when running sshare -l  on our recently upgraded 19.05.2 Slurm environment.  Below is the output from the command.  Please advise.

 sshare -l
             Account       User  RawShares  NormShares    RawUsage   NormUsage  EffectvUsage  FairShare    LevelFS                    GrpTRESMins                    TRESRunMins
-------------------- ---------- ---------- ----------- ----------- ----------- ------------- ---------- ---------- ------------------------------ ------------------------------
root                                          0.000000    26327672                  1.000000                                                      cpu=2588254,mem=26165807462,e+
*** Error in `sshare': free(): invalid next size (fast): 0x000000000061f650 ***
======= Backtrace: =========
/lib64/libc.so.6(+0x7d023)[0x2aaaac0fd023]
/opt/slurm/19.05.02/lib/slurm/libslurmfull.so(slurm_xfree+0x1d)[0x2aaaaae44972]
/opt/slurm/19.05.02/lib/slurm/libslurmfull.so(print_fields_double+0x27e)[0x2aaaaad9325f]
sshare(process+0x50d)[0x40245d]
sshare[0x402866]
sshare(main+0x9f3)[0x40326b]
/lib64/libc.so.6(__libc_start_main+0xf5)[0x2aaaac0a1b15]
sshare[0x401d79]
======= Memory map: ========
00400000-00405000 r-xp 00000000 00:29 270855015                          /opt/slurm/19.05.02/bin/sshare
00604000-00605000 r--p 00004000 00:29 270855015                          /opt/slurm/19.05.02/bin/sshare
00605000-00606000 rw-p 00005000 00:29 270855015                          /opt/slurm/19.05.02/bin/sshare
00606000-00627000 rw-p 00000000 00:00 0                                  [heap]
2aaaaaaab000-2aaaaaacc000 r-xp 00000000 08:03 269755771                  /usr/lib64/ld-2.17.so
2aaaaaacc000-2aaaaaace000 r-xp 00000000 00:00 0                          [vdso]
2aaaaaace000-2aaaaaad1000 rw-p 00000000 00:00 0
2aaaaaaee000-2aaaaaaf4000 rw-p 00000000 00:00 0
2aaaaaccc000-2aaaaaccd000 r--p 00021000 08:03 269755771                  /usr/lib64/ld-2.17.so
2aaaaaccd000-2aaaaacce000 rw-p 00022000 08:03 269755771                  /usr/lib64/ld-2.17.so
2aaaaacce000-2aaaaaccf000 rw-p 00000000 00:00 0
2aaaaaccf000-2aaaaaeab000 r-xp 00000000 00:29 3680300                    /opt/slurm/19.05.02/lib/slurm/libslurmfull.so
2aaaaaeab000-2aaaab0ab000 ---p 001dc000 00:29 3680300                    /opt/slurm/19.05.02/lib/slurm/libslurmfull.so
2aaaab0ab000-2aaaab0ad000 r--p 001dc000 00:29 3680300                    /opt/slurm/19.05.02/lib/slurm/libslurmfull.so
2aaaab0ad000-2aaaab0b8000 rw-p 001de000 00:29 3680300                    /opt/slurm/19.05.02/lib/slurm/libslurmfull.so
2aaaab0b8000-2aaaab0be000 rw-p 00000000 00:00 0
2aaaab0be000-2aaaab0c1000 r-xp 00000000 08:03 269970243                  /usr/lib64/libdl-2.17.so
2aaaab0c1000-2aaaab2c0000 ---p 00003000 08:03 269970243                  /usr/lib64/libdl-2.17.so
2aaaab2c0000-2aaaab2c1000 r--p 00002000 08:03 269970243                  /usr/lib64/libdl-2.17.so
2aaaab2c1000-2aaaab2c2000 rw-p 00003000 08:03 269970243                  /usr/lib64/libdl-2.17.so
2aaaab2c2000-2aaaab3c3000 r-xp 00000000 08:03 270136694                  /usr/lib64/libm-2.17.so
2aaaab3c3000-2aaaab5c2000 ---p 00101000 08:03 270136694                  /usr/lib64/libm-2.17.so
2aaaab5c2000-2aaaab5c3000 r--p 00100000 08:03 270136694                  /usr/lib64/libm-2.17.so
2aaaab5c3000-2aaaab5c4000 rw-p 00101000 08:03 270136694                  /usr/lib64/libm-2.17.so
2aaaab5c4000-2aaaab600000 r-xp 00000000 08:03 270267744                  /usr/lib64/libreadline.so.6.2
2aaaab600000-2aaaab800000 ---p 0003c000 08:03 270267744                  /usr/lib64/libreadline.so.6.2
2aaaab800000-2aaaab802000 r--p 0003c000 08:03 270267744                  /usr/lib64/libreadline.so.6.2
2aaaab802000-2aaaab808000 rw-p 0003e000 08:03 270267744                  /usr/lib64/libreadline.so.6.2
2aaaab808000-2aaaab80a000 rw-p 00000000 00:00 0
2aaaab80a000-2aaaab812000 r-xp 00000000 08:03 270117785                  /usr/lib64/libhistory.so.6.2
2aaaab812000-2aaaaba11000 ---p 00008000 08:03 270117785                  /usr/lib64/libhistory.so.6.2
2aaaaba11000-2aaaaba12000 r--p 00007000 08:03 270117785                  /usr/lib64/libhistory.so.6.2
2aaaaba12000-2aaaaba13000 rw-p 00008000 08:03 270117785                  /usr/lib64/libhistory.so.6.2
2aaaaba13000-2aaaaba39000 r-xp 00000000 08:03 270160703                  /usr/lib64/libncurses.so.5.9
2aaaaba39000-2aaaabc38000 ---p 00026000 08:03 270160703                  /usr/lib64/libncurses.so.5.9
2aaaabc38000-2aaaabc39000 r--p 00025000 08:03 270160703                  /usr/lib64/libncurses.so.5.9
2aaaabc39000-2aaaabc3a000 rw-p 00026000 08:03 270160703                  /usr/lib64/libncurses.so.5.9
2aaaabc3a000-2aaaabc5f000 r-xp 00000000 08:03 270334697                  /usr/lib64/libtinfo.so.5.9
2aaaabc5f000-2aaaabe5f000 ---p 00025000 08:03 270334697                  /usr/lib64/libtinfo.so.5.9
2aaaabe5f000-2aaaabe63000 r--p 00025000 08:03 270334697                  /usr/lib64/libtinfo.so.5.9
2aaaabe63000-2aaaabe64000 rw-p 00029000 08:03 270334697                  /usr/lib64/libtinfo.so.5.9
2aaaabe64000-2aaaabe7a000 r-xp 00000000 08:03 270255875                  /usr/lib64/libpthread-2.17.so
2aaaabe7a000-2aaaac07a000 ---p 00016000 08:03 270255875                  /usr/lib64/libpthread-2.17.so
2aaaac07a000-2aaaac07b000 r--p 00016000 08:03 270255875                  /usr/lib64/libpthread-2.17.so
2aaaac07b000-2aaaac07c000 rw-p 00017000 08:03 270255875                  /usr/lib64/libpthread-2.17.so
2aaaac07c000-2aaaac080000 rw-p 00000000 00:00 0
2aaaac080000-2aaaac236000 r-xp 00000000 08:03 269840384                  /usr/lib64/libc-2.17.so
2aaaac236000-2aaaac436000 ---p 001b6000 08:03 269840384                  /usr/lib64/libc-2.17.so
2aaaac436000-2aaaac43a000 r--p 001b6000 08:03 269840384                  /usr/lib64/libc-2.17.so
2aaaac43a000-2aaaac43c000 rw-p 001ba000 08:03 269840384                  /usr/lib64/libc-2.17.so
2aaaac43c000-2aaaac441000 rw-p 00000000 00:00 0
2aaaac441000-2aaaac44d000 r-xp 00000000 08:03 270224039                  /usr/lib64/libnss_files-2.17.so
2aaaac44d000-2aaaac64c000 ---p 0000c000 08:03 270224039                  /usr/lib64/libnss_files-2.17.so
2aaaac64c000-2aaaac64d000 r--p 0000b000 08:03 270224039                  /usr/lib64/libnss_files-2.17.so
2aaaac64d000-2aaaac64e000 rw-p 0000c000 08:03 270224039                  /usr/lib64/libnss_files-2.17.so
2aaaac64e000-2aaaac654000 rw-p 00000000 00:00 0
2aaaac654000-2aaaac65c000 r-xp 00000000 08:03 270224046                  /usr/lib64/libnss_sss.so.2
2aaaac65c000-2aaaac85b000 ---p 00008000 08:03 270224046                  /usr/lib64/libnss_sss.so.2
2aaaac85b000-2aaaac85c000 r--p 00007000 08:03 270224046                  /usr/lib64/libnss_sss.so.2
2aaaac85c000-2aaaac85d000 rw-p 00008000 08:03 270224046                  /usr/lib64/libnss_sss.so.2
2aaaac85d000-2aaaac869000 r-xp 00000000 08:03 270224041                  /usr/lib64/libnss_ldap.so.2
2aaaac869000-2aaaaca68000 ---p 0000c000 08:03 270224041                  /usr/lib64/libnss_ldap.so.2
2aaaaca68000-2aaaaca69000 r--p 0000b000 08:03 270224041                  /usr/lib64/libnss_ldap.so.2
2aaaaca69000-2aaaaca6a000 rw-p 0000c000 08:03 270224041                  /usr/lib64/libnss_ldap.so.2
2aaaaca6a000-2aaaaca6c000 r-xp 00000000 00:29 2860787                    /opt/slurm/19.05.02/lib/slurm/auth_munge.so
2aaaaca6c000-2aaaacc6c000 ---p 00002000 00:29 2860787                    /opt/slurm/19.05.02/lib/slurm/auth_munge.so
2aaaacc6c000-2aaaacc6d000 r--p 00002000 00:29 2860787                    /opt/slurm/19.05.02/lib/slurm/auth_munge.so
2aaaacc6d000-2aaaacc6e000 rw-p 00003000 00:29 2860787                    /opt/slurm/19.05.02/lib/slurm/auth_munge.so
2aaaacc6e000-2aaaacc77000 r-xp 00000000 08:03 270160696                  /usr/lib64/libmunge.so.2.0.0
2aaaacc77000-2aaaace76000 ---p 00009000 08:03 270160696                  /usr/lib64/libmunge.so.2.0.0
2aaaace76000-2aaaace77000 r--p 00008000 08:03 270160696                  /usr/lib64/libmunge.so.2.0.0
2aaaace77000-2aaaace78000 rw-p 00009000 08:03 270160696                  /usr/lib64/libmunge.so.2.0.0
2aaaace78000-2aaaace8d000 r-xp 00000000 08:03 270095360                  /usr/lib64/libgcc_s-4.8.5-20150702.so.1
2aaaace8d000-2aaaad08c000 ---p 00015000 08:03 270095360                  /usr/lib64/libgcc_s-4.8.5-20150702.so.1
2aaaad08c000-2aaaad08d000 r--p 00014000 08:03 270095360                  /usr/lib64/libgcc_s-4.8.5-20150702.so.1
2aaaad08d000-2aaaad08e000 rw-p 00015000 08:03 270095360                  /usr/lib64/libgcc_s-4.8.5-20150702.so.1
2aaab0000000-2aaab0021000 rw-p 00000000 00:00 0
2aaab0021000-2aaab4000000 ---p 00000000 00:00 0
7ffffffde000-7ffffffff000 rw-p 00000000 00:00 0                          [stack]
ffffffffff600000-ffffffffff601000 r-xp 00000000 00:00 0                  [vsyscall]
 coh                                     1    0.020833           0    0.000000      0.000000            3.9818e+22 Aborted (core dumped)
Comment 1 Marcin Stolarek 2019-10-25 02:24:37 MDT
Randy,

Could you please start the sshare under gdb loading the generated core file and check the backtrace:

#gdb $(which sshare) /path/to/core
(gdb)t a a bt full

Please attach the result of the above commands to the bug report.

cheers,
Marcin
Comment 2 Randy Smith 2019-10-25 08:50:40 MDT
Created attachment 12089 [details]
sshare.process.txt

Attached is the output you requested.

On Fri, Oct 25, 2019 at 1:24 AM <bugs@schedmd.com> wrote:

> Marcin Stolarek <cinek@schedmd.com> changed bug 7996
> <https://bugs.schedmd.com/show_bug.cgi?id=7996>
> What Removed Added
> CC   cinek@schedmd.com
>
> *Comment # 1 <https://bugs.schedmd.com/show_bug.cgi?id=7996#c1> on bug
> 7996 <https://bugs.schedmd.com/show_bug.cgi?id=7996> from Marcin Stolarek
> <cinek@schedmd.com> *
>
> Randy,
>
> Could you please start the sshare under gdb loading the generated core file and
> check the backtrace:
>
> #gdb $(which sshare) /path/to/core
> (gdb)t a a bt full
>
> Please attach the result of the above commands to the bug report.
>
> cheers,
> Marcin
>
> ------------------------------
> You are receiving this mail because:
>
>    - You reported the bug.
>
>
Comment 3 Gavin D. Howard 2019-10-25 16:24:27 MDT
There are a few things I need to solve this bug:

1. Your slurm.conf
2. Your slurmdbd.conf
3. The Linux distro and version on the node where `sshare` was run.
4. The Linux distro and version on the node where `slurmctld` is running.
5. The Linux distro and version on the node where `slurmdbd` is running.
6. If at all possible, your database. If you do send it, please compress it first.
Comment 5 Randy Smith 2019-10-28 10:08:32 MDT
Created attachment 12124 [details]
slurm.conf

Here you go.

sshare: linux hpc-utility 3.10.0-957.5.1.el7.x86_64 #1 SMP Fri Feb 1
14:54:57 UTC 2019 x86_64 x86_64 x86_64 GNU/Linux
slurmctld: Linux dback-slurm 3.10.0-327.3.1.el7.x86_64 #1 SMP Wed Dec 9
14:09:15 UTC 2015 x86_64 x86_64 x86_64 GNU/Linux
slurmdbd: Linux dback-slurmdb 3.10.0-327.3.1.el7.x86_64 #1 SMP Wed Dec 9
14:09:15 UTC 2015 x86_64 x86_64 x86_64 GNU/Linux

I'll upload the database dump via your web interface.


On Fri, Oct 25, 2019 at 3:24 PM <bugs@schedmd.com> wrote:

> *Comment # 3 <https://bugs.schedmd.com/show_bug.cgi?id=7996#c3> on bug
> 7996 <https://bugs.schedmd.com/show_bug.cgi?id=7996> from Gavin D. Howard
> <gavin@schedmd.com> *
>
> There are a few things I need to solve this bug:
>
> 1. Your slurm.conf
> 2. Your slurmdbd.conf
> 3. The Linux distro and version on the node where `sshare` was run.
> 4. The Linux distro and version on the node where `slurmctld` is running.
> 5. The Linux distro and version on the node where `slurmdbd` is running.
> 6. If at all possible, your database. If you do send it, please compress it
> first.
>
> ------------------------------
> You are receiving this mail because:
>
>    - You reported the bug.
>
>
Comment 6 Randy Smith 2019-10-28 10:08:32 MDT
Created attachment 12125 [details]
slurmdbd.conf
Comment 7 Randy Smith 2019-10-28 10:34:11 MDT
The file is too large to upload here is a link to a google drive location.
 slurm_acct_db_dump_10_28_19.sql.gz
<https://drive.google.com/a/tgen.org/file/d/1hrzQUgcim0EYoOm2mRWwV5nuQeTcXKQw/view?usp=drive_web>


On Mon, Oct 28, 2019 at 9:07 AM Randy Smith <rsmith@tgen.org> wrote:

> Here you go.
>
> sshare: linux hpc-utility 3.10.0-957.5.1.el7.x86_64 #1 SMP Fri Feb 1
> 14:54:57 UTC 2019 x86_64 x86_64 x86_64 GNU/Linux
> slurmctld: Linux dback-slurm 3.10.0-327.3.1.el7.x86_64 #1 SMP Wed Dec 9
> 14:09:15 UTC 2015 x86_64 x86_64 x86_64 GNU/Linux
> slurmdbd: Linux dback-slurmdb 3.10.0-327.3.1.el7.x86_64 #1 SMP Wed Dec 9
> 14:09:15 UTC 2015 x86_64 x86_64 x86_64 GNU/Linux
>
> I'll upload the database dump via your web interface.
>
>
> On Fri, Oct 25, 2019 at 3:24 PM <bugs@schedmd.com> wrote:
>
>> *Comment # 3 <https://bugs.schedmd.com/show_bug.cgi?id=7996#c3> on bug
>> 7996 <https://bugs.schedmd.com/show_bug.cgi?id=7996> from Gavin D. Howard
>> <gavin@schedmd.com> *
>>
>> There are a few things I need to solve this bug:
>>
>> 1. Your slurm.conf
>> 2. Your slurmdbd.conf
>> 3. The Linux distro and version on the node where `sshare` was run.
>> 4. The Linux distro and version on the node where `slurmctld` is running.
>> 5. The Linux distro and version on the node where `slurmdbd` is running.
>> 6. If at all possible, your database. If you do send it, please compress it
>> first.
>>
>> ------------------------------
>> You are receiving this mail because:
>>
>>    - You reported the bug.
>>
>>
>
> --
> Randy Smith
>
> Translational Genomics Research Institute
> [image: http://www.tgen.org] <http://www.tgen.org/>
> 445 N 5th Street, Phoenix, AZ 85004
> <https://maps.google.com/?q=445+N+5th+Street,+Phoenix,+AZ+85004+(602&entry=gmail&source=g>
> (602
> <https://maps.google.com/?q=445+N+5th+Street,+Phoenix,+AZ+85004+(602&entry=gmail&source=g>)
> 343-8547
> rsmith@tgen.org
>
Comment 8 Gavin D. Howard 2019-10-28 17:32:26 MDT
Thank you. I have been looking into it, but I don't see a certain fix yet.
Comment 11 Randy Smith 2019-10-29 14:42:09 MDT
Thanks for the update.

On Tue, Oct 29, 2019 at 1:33 PM <bugs@schedmd.com> wrote:

> Gavin D. Howard <gavin@schedmd.com> changed bug 7996
> <https://bugs.schedmd.com/show_bug.cgi?id=7996>
> What Removed Added
> QA Contact   reviewers@schedmd.com
>
> ------------------------------
> You are receiving this mail because:
>
>    - You reported the bug.
>
>
Comment 12 Douglas Wightman 2020-02-05 11:55:01 MST
*** Ticket 8305 has been marked as a duplicate of this ticket. ***
Comment 14 Gavin D. Howard 2020-02-21 14:46:51 MST
Randy,

We have a fix, and it has been committed to 19.05 (so it will be in the next dot release of 19.05) and into 20.02.

Thanks again. Closing.