Summary: | Sacct different behaviour after update | ||
---|---|---|---|
Product: | Slurm | Reporter: | Ahmed Essam ElMazaty <ahmed.mazaty> |
Component: | Accounting | Assignee: | Albert Gil <albert.gil> |
Status: | RESOLVED DUPLICATE | QA Contact: | |
Severity: | 3 - Medium Impact | ||
Priority: | --- | CC: | pawel.dziekonski, pedmon |
Version: | 18.08.6 | ||
Hardware: | Linux | ||
OS: | Linux | ||
See Also: | https://bugs.schedmd.com/show_bug.cgi?id=5717 | ||
Site: | KAUST | Slinky Site: | --- |
Alineos Sites: | --- | Atos/Eviden Sites: | --- |
Confidential Site: | --- | Coreweave sites: | --- |
Cray Sites: | --- | DS9 clusters: | --- |
Google sites: | --- | HPCnow Sites: | --- |
HPE Sites: | --- | IBM Sites: | --- |
NOAA SIte: | --- | NoveTech Sites: | --- |
Nvidia HWinf-CS Sites: | --- | OCF Sites: | --- |
Recursion Pharma Sites: | --- | SFW Sites: | --- |
SNIC sites: | --- | Tzag Elita Sites: | --- |
Linux Distro: | --- | Machine Name: | |
CLE Version: | Version Fixed: | ||
Target Release: | --- | DevPrio: | --- |
Emory-Cloud Sites: | --- | ||
Attachments: |
slurmdbd.log
slurmctld.log slurm.conf slurmdbd.conf |
Description
Ahmed Essam ElMazaty
2019-03-26 03:09:22 MDT
Hi Ahmed, Yes, this looks like a regression of the work done bug 5717. Let me check it further and I'll let you know. Albert Hi Ahmed, I've not being able to replicate your issue. Please could you post the following information: - Your slurm.conf - Your slurmdb.conf (without any passwd) Also, could you please change your slurmdb.conf to add these values? DebugLevel=debug2 DebugFlags=DB_QUERY,DB_JOB,DB_STEP Then, after restart the slurmdbd, could you run the same commands you did but with "-vvv" and post the logs of slurmdbd and slurmctld? $ sacct -j 1668219 -vvv $ sacct -j 1668219 -vvv -S 2019-03-01 And finally, is this happening for any jobid? Could you try same commands but for jobs that: - are running or completed today (same day of the command execution) - were run and completed yesterday - were run and completed before your update to 18.08.6 The logs slurmdbd and slurmctld while executing all the above commands will be very useful. Thanks, Albert Created attachment 9721 [details]
slurmdbd.log
(In reply to Albert Gil from comment #2) Hello Albert, Please find my comments inline > Hi Ahmed, > > I've not being able to replicate your issue. > Please could you post the following information: > - Your slurm.conf attached > - Your slurmdb.conf (without any passwd) attached > > Also, could you please change your slurmdb.conf to add these values? > DebugLevel=debug2 > DebugFlags=DB_QUERY,DB_JOB,DB_STEP changed and restarted > > Then, after restart the slurmdbd, could you run the same commands you did > but with "-vvv" and post the logs of slurmdbd and slurmctld? > > $ sacct -j 1668219 -vvv Here's the output, and attached the logs # sacct -j 1668219 -vvv sacct: Jobs Eligible in the time window from Epoch 0 to Thu Mar 28 09:06:57 2019 sacct: debug: Options selected: opt_completion=0 opt_dup=0 opt_field_list=(null) opt_help=0 opt_no_steps=0 opt_whole_hetjob=(null) sacct: Accounting storage SLURMDBD plugin loaded sacct: debug: Munge authentication plugin loaded sacct: debug: slurmdbd: Sent PersistInit msg sacct: debug2: Clusters requested: dragon sacct: debug2: Userids requested: all sacct: debug2: Jobs requested: sacct: debug2: : 1668219 JobID JobName Partition Account AllocCPUS State ExitCode ------------ ---------- ---------- ---------- ---------- ---------- -------- sacct: debug: slurmdbd: Sent fini msg > $ sacct -j 1668219 -vvv -S 2019-03-01 Here is the output and attached the logs # sacct -j 1668219 -vvv -S 2019-03-01 sacct: Jobs Eligible in the time window from Fri Mar 01 00:00:00 2019 to Thu Mar 28 09:07:40 2019 sacct: debug: Options selected: opt_completion=0 opt_dup=0 opt_field_list=(null) opt_help=0 opt_no_steps=0 opt_whole_hetjob=(null) sacct: Accounting storage SLURMDBD plugin loaded sacct: debug: Munge authentication plugin loaded sacct: debug: slurmdbd: Sent PersistInit msg sacct: debug2: Clusters requested: dragon sacct: debug2: Userids requested: all sacct: debug2: Jobs requested: sacct: debug2: : 1668219 JobID JobName Partition Account AllocCPUS State ExitCode ------------ ---------- ---------- ---------- ---------- ---------- -------- 1668219 gsvSlow4 batch default 12 COMPLETED 0:0 1668219.bat+ batch default 12 COMPLETED 0:0 1668219.ext+ extern default 12 COMPLETED 0:0 sacct: debug: slurmdbd: Sent fini msg > > And finally, is this happening for any jobid? > Could you try same commands but for jobs that: > - are running or completed today (same day of the command execution) it works for running jobs # sacct -j 1650895 JobID JobName Partition Account AllocCPUS State ExitCode ------------ ---------- ---------- ---------- ---------- ---------- -------- 1650895 MkSmby batch default 32 RUNNING 0:0 1650895.ext+ extern default 32 RUNNING 0:0 and it works for jobs that were completed today (this one ended 5 hours ago) # sacct -j 1799897 JobID JobName Partition Account AllocCPUS State ExitCode ------------ ---------- ---------- ---------- ---------- ---------- -------- 1799897 W05H30d0r+ batch default 20 COMPLETED 0:0 1799897.bat+ batch default 20 COMPLETED 0:0 1799897.ext+ extern default 20 COMPLETED 0:0 > - were run and completed yesterday it doesn't # sacct -j 1750453 JobID JobName Partition Account AllocCPUS State ExitCode ------------ ---------- ---------- ---------- ---------- ---------- -------- > - were run and completed before your update to 18.08.6 no it doesn't show anything without -S > > The logs slurmdbd and slurmctld while executing all the above commands will > be very useful. attached > > Thanks, > Albert Thanks for your help. Ahmed Created attachment 9722 [details]
slurmctld.log
Created attachment 9723 [details]
slurm.conf
Created attachment 9724 [details]
slurmdbd.conf
Sorry Ahmed, I don't know how I didn't realized before. This bug has been already reported and fixed in bug 6697. Albert *** This ticket has been marked as a duplicate of ticket 6697 *** *** Ticket 6830 has been marked as a duplicate of this ticket. *** |