Ticket 2007 - hanging comma in the sacct joblist causes slurmdbd to crash
Summary: hanging comma in the sacct joblist causes slurmdbd to crash
Status: RESOLVED FIXED
Alias: None
Product: Slurm
Classification: Unclassified
Component: slurmdbd (show other tickets)
Version: 14.11.8
Hardware: Linux Linux
: 3 - Medium Impact
Assignee: Danny Auble
QA Contact:
URL:
Depends on:
Blocks:
 
Reported: 2015-10-04 14:35 MDT by Jeff Tan
Modified: 2015-10-07 08:31 MDT (History)
2 users (show)

See Also:
Site: VLSCI
Slinky Site: ---
Alineos Sites: ---
Atos/Eviden Sites: ---
Confidential Site: ---
Coreweave sites: ---
Cray Sites: ---
DS9 clusters: ---
Google sites: ---
HPCnow Sites: ---
HPE Sites: ---
IBM Sites: ---
NOAA SIte: ---
NoveTech Sites: ---
Nvidia HWinf-CS Sites: ---
OCF Sites: ---
Recursion Pharma Sites: ---
SFW Sites: ---
SNIC sites: ---
Tzag Elita Sites: ---
Linux Distro: ---
Machine Name:
CLE Version:
Version Fixed: 14.11.10 15.08.2 16.05.0-pre1
Target Release: ---
DevPrio: ---
Emory-Cloud Sites: ---


Attachments

Note You need to log in before you can comment on or make changes to this ticket.
Description Jeff Tan 2015-10-04 14:35:59 MDT
Dear SchedMD,

When I pass a joblist to sacct with a hanging comma, slurmdbd crashes, e.g.,

$ sacct -X -j 132423,

and we get this in slurmdbd.log within a few minutes:

slurmdb_defs.c:385: : malloc failed

This is with slurmdbd 14.11.8 and sacct from Slurm 14.03.11. When the sacct is run from Slurm 14.11.9 then the behavior is odd but not problematic (see far below).

We were producing this when we nested squeue within sacct, with tr to convert the job list into a comma-separated list:

$ sacct -X -j `squeue -u myuser -tPD -h -r -o %A | tr '/\n/' ','`

If I were to remove that hanging comma, e.g., .. | sed 's/,$//', then the problem goes away.

While no one would deliberately set out to add a hanging comma like that, it could mean that slurmdbd is keeping a connection held up indefinitely. 

The strace shows three threads with a bunch of poll() timeouts which then becomes a bunch of recv() with EAGAIN errors, another one the same and then a bunch of mmap2() calls that meet ENOMEM before getting aborted. 

It is also noted that after such an sacct command is given on Slurm 14.03, even after Ctrl-C to break out of it, slurmdbd will be in a state where, within 10 minutes, memory utilization will start climbing up quickly past 1 GB, perhaps in reaction to other slurmdbd queries coming from scripts and users on our systems, but climb it will, and 3 GB within 25 minutes or so, which is enough on our 4 GB host for slurmdbd to fail malloc and die.

Slurm 14.11.9: odd but not problematic:

If sacct as above is given from Slurm 14.11.9, sacct appears to resolve the hanging comma by running the query successfully. However, it is not clear how it resolves it. If I feed it a nonexisting jobid with a hanging comma, it appears to fetch all jobs in the database. Likewise if I feed no jobid but provide a hanging comma, e.g.,

$ sacct -j ,

Not the same behavior as running scct without parameters, presumably because no jobs have run since the most recent midnight.

Just to summarize: SlurmDBD is running 14.11.8, the crash is induced by sacct with the hanging comma in the joblist from Slurm 14.03.11, no crash but odd results via sacct from Slurm 14.11.9.

Regards
Jeff
Comment 1 David Bigagli 2015-10-04 23:11:46 MDT
I cannot reproduce the core dump in neither in 14.03 not in 14.11.8.
Perhaps the memory issue is somewhere else. Could you append your slurmdbd.conf?

14.11.9 prints all jobs indeed but this problem appears to be fixed in 15.08.

David
Comment 2 Danny Auble 2015-10-05 11:59:15 MDT
My guess is you have many jobs in your system.  You might want to consider looking at using the purging/archiving functionality of the DBD, http://slurm.schedmd.com/slurmdbd.conf.html

I can reproduce the issue with giving all jobs back.  I made a commit 2646e7615885ad4 that will fix the scenarios like

sacct -X -j 132423,

You will need to upgrade to 15.08 for

sacct -X -j,

to be fixed though.  The real fix has to be made to sacct though, so any older version of the code will have this anomaly.

FYI, in 15.08

sacct -X -j 132423,

will be rejected with sacct: fatal: Bad job/step specified.

We can probably change that to just not accept the empty one though which would probably be better.  I'll see if I can alter that in 15.08.
Comment 3 Jeff Tan 2015-10-05 14:40:31 MDT
Thanks, David and Danny. I'm guessing David is unable to replicate this behavior because our database has never been purged. I hesitate to open up the job tables via mysql directly these days, but these jobs go way back.

We'll give commit 2646e7615885ad4 a go and perhaps craft something extra for the empty joblist with just the comma given. An upgrade to 15.08 is probably not happening for us until January.

Thanks again!

Regards
Jeff
Comment 4 David Bigagli 2015-10-05 19:48:46 MDT
Author: Danny Auble <da@schedmd.com>
Date:   Mon Oct 5 16:50:43 2015 -0700

    Fix sacct to not return all jobs if the -j option is given with a trailing
    ','.


David
Comment 5 Danny Auble 2015-10-06 00:54:17 MDT
I would still like to look at this more from the sacct side.
Comment 6 David Bigagli 2015-10-06 00:56:50 MDT
Ah ok, you mean fix the syntax on the sacct side. 

David
Comment 7 Danny Auble 2015-10-07 08:31:26 MDT
This is now fixed in commit 2dcc2732c1bca for 15.08.  I also added a commit to 14.11 in commit d5979ef68c24 which will fix sacct in 14.11 if you are interested in it there, it will be in 14.11.10 if that ever gets tagged.