Ticket 6755

Summary: Sacct different behaviour after update
Product: Slurm Reporter: Ahmed Essam ElMazaty <ahmed.mazaty>
Component: AccountingAssignee: Albert Gil <albert.gil>
Status: RESOLVED DUPLICATE QA Contact:
Severity: 3 - Medium Impact    
Priority: --- CC: pawel.dziekonski, pedmon
Version: 18.08.6   
Hardware: Linux   
OS: Linux   
See Also: https://bugs.schedmd.com/show_bug.cgi?id=5717
Site: KAUST Slinky Site: ---
Alineos Sites: --- Atos/Eviden Sites: ---
Confidential Site: --- Coreweave sites: ---
Cray Sites: --- DS9 clusters: ---
Google sites: --- HPCnow Sites: ---
HPE Sites: --- IBM Sites: ---
NOAA SIte: --- NoveTech Sites: ---
Nvidia HWinf-CS Sites: --- OCF Sites: ---
Recursion Pharma Sites: --- SFW Sites: ---
SNIC sites: --- Tzag Elita Sites: ---
Linux Distro: --- Machine Name:
CLE Version: Version Fixed:
Target Release: --- DevPrio: ---
Emory-Cloud Sites: ---
Attachments: slurmdbd.log
slurmctld.log
slurm.conf
slurmdbd.conf

Description Ahmed Essam ElMazaty 2019-03-26 03:09:22 MDT
Good afternoon,
After upgrading slurm to slurm 18.08.6-2, 'sacct' command seems to behave in a different way.
previously 'sacct -j <job ID>' displays immediately info about the job. but now it doesn't
i.e 
# sacct -j 1668219
       JobID    JobName  Partition    Account  AllocCPUS      State ExitCode 
------------ ---------- ---------- ---------- ---------- ---------- -------- 
#

now it does not work until I specify start time with '-S' option 
#  sacct -j 1668219 -S 2019-03-01
       JobID    JobName  Partition    Account  AllocCPUS      State ExitCode 
------------ ---------- ---------- ---------- ---------- ---------- -------- 
1668219        gsvSlow4      batch    default         12  COMPLETED      0:0 
1668219.bat+      batch               default         12  COMPLETED      0:0 
1668219.ext+     extern               default         12  COMPLETED      0:0
Comment 1 Albert Gil 2019-03-26 09:56:48 MDT
Hi Ahmed,

Yes, this looks like a regression of the work done bug 5717.
Let me check it further and I'll let you know.

Albert
Comment 2 Albert Gil 2019-03-27 04:39:28 MDT
Hi Ahmed,

I've not being able to replicate your issue.
Please could you post the following information:
- Your slurm.conf
- Your slurmdb.conf (without any passwd)

Also, could you please change your slurmdb.conf to add these values?
DebugLevel=debug2
DebugFlags=DB_QUERY,DB_JOB,DB_STEP

Then, after restart the slurmdbd, could you run the same commands you did but with "-vvv" and post the logs of slurmdbd and slurmctld?

$ sacct -j 1668219 -vvv
$ sacct -j 1668219 -vvv -S 2019-03-01

And finally, is this happening for any jobid?
Could you try same commands but for jobs that:
- are running or completed today (same day of the command execution)
- were run and completed yesterday
- were run and completed before your update to 18.08.6

The logs slurmdbd and slurmctld while executing all the above commands will be very useful.

Thanks,
Albert
Comment 3 Ahmed Essam ElMazaty 2019-03-28 00:33:07 MDT
Created attachment 9721 [details]
slurmdbd.log
Comment 4 Ahmed Essam ElMazaty 2019-03-28 00:34:28 MDT
(In reply to Albert Gil from comment #2)

Hello Albert,
Please find my comments inline 

> Hi Ahmed,
> 
> I've not being able to replicate your issue.
> Please could you post the following information:
> - Your slurm.conf
attached

> - Your slurmdb.conf (without any passwd)
attached
> 
> Also, could you please change your slurmdb.conf to add these values?
> DebugLevel=debug2
> DebugFlags=DB_QUERY,DB_JOB,DB_STEP

changed and restarted

> 
> Then, after restart the slurmdbd, could you run the same commands you did
> but with "-vvv" and post the logs of slurmdbd and slurmctld?
> 
> $ sacct -j 1668219 -vvv
Here's the output, and attached the logs

# sacct -j 1668219 -vvv
sacct: Jobs Eligible in the time window from Epoch 0 to Thu Mar 28 09:06:57 2019
sacct: debug:  Options selected:
	opt_completion=0
	opt_dup=0
	opt_field_list=(null)
	opt_help=0
	opt_no_steps=0
	opt_whole_hetjob=(null)
sacct: Accounting storage SLURMDBD plugin loaded
sacct: debug:  Munge authentication plugin loaded
sacct: debug:  slurmdbd: Sent PersistInit msg
sacct: debug2: Clusters requested:	dragon
sacct: debug2: Userids requested:	all
sacct: debug2: Jobs requested:
sacct: debug2: 	: 1668219
       JobID    JobName  Partition    Account  AllocCPUS      State ExitCode 
------------ ---------- ---------- ---------- ---------- ---------- -------- 
sacct: debug:  slurmdbd: Sent fini msg


> $ sacct -j 1668219 -vvv -S 2019-03-01
Here is the output and attached the logs

# sacct -j 1668219 -vvv -S 2019-03-01
sacct: Jobs Eligible in the time window from Fri Mar 01 00:00:00 2019 to Thu Mar 28 09:07:40 2019
sacct: debug:  Options selected:
	opt_completion=0
	opt_dup=0
	opt_field_list=(null)
	opt_help=0
	opt_no_steps=0
	opt_whole_hetjob=(null)
sacct: Accounting storage SLURMDBD plugin loaded
sacct: debug:  Munge authentication plugin loaded
sacct: debug:  slurmdbd: Sent PersistInit msg
sacct: debug2: Clusters requested:	dragon
sacct: debug2: Userids requested:	all
sacct: debug2: Jobs requested:
sacct: debug2: 	: 1668219
       JobID    JobName  Partition    Account  AllocCPUS      State ExitCode 
------------ ---------- ---------- ---------- ---------- ---------- -------- 
1668219        gsvSlow4      batch    default         12  COMPLETED      0:0 
1668219.bat+      batch               default         12  COMPLETED      0:0 
1668219.ext+     extern               default         12  COMPLETED      0:0 
sacct: debug:  slurmdbd: Sent fini msg



> 
> And finally, is this happening for any jobid?
> Could you try same commands but for jobs that:
> - are running or completed today (same day of the command execution)
it works for running jobs 

# sacct -j 1650895
       JobID    JobName  Partition    Account  AllocCPUS      State ExitCode 
------------ ---------- ---------- ---------- ---------- ---------- -------- 
1650895          MkSmby      batch    default         32    RUNNING      0:0 
1650895.ext+     extern               default         32    RUNNING      0:0 

and it works for jobs that were completed today (this one ended 5 hours ago)

# sacct -j 1799897
       JobID    JobName  Partition    Account  AllocCPUS      State ExitCode 
------------ ---------- ---------- ---------- ---------- ---------- -------- 
1799897      W05H30d0r+      batch    default         20  COMPLETED      0:0 
1799897.bat+      batch               default         20  COMPLETED      0:0 
1799897.ext+     extern               default         20  COMPLETED      0:0 


> - were run and completed yesterday
it doesn't

# sacct -j 1750453
       JobID    JobName  Partition    Account  AllocCPUS      State ExitCode 
------------ ---------- ---------- ---------- ---------- ---------- -------- 


> - were run and completed before your update to 18.08.6
no it doesn't show anything without -S

> 
> The logs slurmdbd and slurmctld while executing all the above commands will
> be very useful.
attached

> 
> Thanks,
> Albert

Thanks for your help.
Ahmed
Comment 5 Ahmed Essam ElMazaty 2019-03-28 00:35:05 MDT
Created attachment 9722 [details]
slurmctld.log
Comment 6 Ahmed Essam ElMazaty 2019-03-28 00:36:38 MDT
Created attachment 9723 [details]
slurm.conf
Comment 7 Ahmed Essam ElMazaty 2019-03-28 00:37:18 MDT
Created attachment 9724 [details]
slurmdbd.conf
Comment 8 Albert Gil 2019-03-28 12:42:26 MDT
Sorry Ahmed,
I don't know how I didn't realized before.
This bug has been already reported and fixed in bug 6697.

Albert

*** This ticket has been marked as a duplicate of ticket 6697 ***
Comment 9 Albert Gil 2019-04-11 02:02:59 MDT
*** Ticket 6830 has been marked as a duplicate of this ticket. ***