Ticket 6817

Summary: sacct returns no data
Product: Slurm Reporter: Jurij Pečar <jurij.pecar>
Component: AccountingAssignee: Felip Moll <felip.moll>
Status: RESOLVED FIXED QA Contact:
Severity: 4 - Minor Issue    
Priority: --- CC: albert.gil
Version: 18.08.6   
Hardware: Linux   
OS: Linux   
See Also: https://bugs.schedmd.com/show_bug.cgi?id=6697
https://bugs.schedmd.com/show_bug.cgi?id=6862
Site: EMBL Slinky Site: ---
Alineos Sites: --- Atos/Eviden Sites: ---
Confidential Site: --- Coreweave sites: ---
Cray Sites: --- DS9 clusters: ---
Google sites: --- HPCnow Sites: ---
HPE Sites: --- IBM Sites: ---
NOAA SIte: --- NoveTech Sites: ---
Nvidia HWinf-CS Sites: --- OCF Sites: ---
Recursion Pharma Sites: --- SFW Sites: ---
SNIC sites: --- Tzag Elita Sites: ---
Linux Distro: --- Machine Name:
CLE Version: Version Fixed: 19.05.0rc2
Target Release: --- DevPrio: ---
Emory-Cloud Sites: ---
Ticket Depends on: 6862    
Ticket Blocks:    

Description Jurij Pečar 2019-04-09 02:02:30 MDT
After upgrade to 18.08.6-2 (from 18.08.4) users started complaining that their workflow management tools are failing because they can't get info about past jobs. I investigated and discovered that sacct indeed does not return anything for some jobs.

An example:

# sacct -Pj 31699893                                                                                                                                                            
JobID|JobName|Partition|Account|AllocCPUS|State|ExitCode

But job is in the db:
#mysql -E slurm
MariaDB [slurm]> select * from hof_job_table where id_job=31699893;
*************************** 1. row ***************************
        job_db_inx: 62027302
          mod_time: 1554379958
           deleted: 0
           account: zeller
     admin_comment: NULL
    array_task_str: NULL
   array_max_tasks: 0
array_task_pending: 0
          cpus_req: 1
        derived_ec: 0
        derived_es: NULL
         exit_code: 0
          job_name: dada2.sh
          id_assoc: 4758
      id_array_job: 0
     id_array_task: 4294967294
          id_block: NULL
            id_job: 31699893
... etc

As a consequence seff also reports nonsense for this job:
# seff 31699893                                                                                                                                                                 
Job ID: 31699893
Cluster: hof
User/Group: hchen/zeller
State: COMPLETED (exit code 0)
Cores: 1
CPU Utilized: 1-22:09:15
CPU Efficiency: 182.96% of 1-01:13:33 core-walltime
Job Wall-clock time: 1-01:13:33
Memory Utilized: 16.00 EB
Memory Efficiency: 17179869184.00% of 100.00 GB

Some other jobs appear fine:

# sacct -j 31912919
       JobID    JobName  Partition    Account  AllocCPUS      State ExitCode 
------------ ---------- ---------- ---------- ---------- ---------- -------- 
31912919     Extract/j+        htc   wilmanns          8  COMPLETED      0:0 
31912919.ba+      batch              wilmanns          8  COMPLETED      0:0 
31912919.ex+     extern              wilmanns          8  COMPLETED      0:0 
# seff 31912919                                                                                                                                                                 
Job ID: 31912919
Cluster: hof
User/Group: lciccarelli/wilmanns
State: COMPLETED (exit code 0)
Nodes: 1
Cores per node: 8
CPU Utilized: 01:03:08
CPU Efficiency: 48.86% of 02:09:12 core-walltime
Job Wall-clock time: 00:16:09
Memory Utilized: 1.09 GB
Memory Efficiency: 9.09% of 12.00 GB

What worries me is that sequence of JobIDs returned by seff is totally unexpected:

# sacct -o JobID
       JobID 
------------ 
431291       
431292       
963843_646   
963843_647   
964268_84    
964268_85    
963843_648   
963843_650   
963843_652   
963843_653   
963843_654   
963843_655   
963843_656   
963843_657   
963843_658   
963843_659   
963843_660   
963843_661   
963843_662   
963843_663   
963843_664   
963843_665   
963843_666   
963843_667   
963843_668   
963843_669   
963843_670   
963843_671   
963843_672   
963843_673   
963843_674   
963843_675   
963843_676   
963843_677   
963843_678   
963843_679   
963843_680   
963843_681   
963843_682   
963843_683   
963843_684   
963843_685   
963843_686   
963843_687   
963843_688   
963843_689   
963843_690   
963843_691   
963843_692   
963843_693   
963843_694   
963843_695   
963843_696   
963843_697   
963843_698   
963843_699   
963843_700   
963843_701   
963843_702   
963843_703   
963843_704   
963843_705   
963843_706   
963843_707   
963843_708   
963843_709   
963843_710   
963843_711   
963843_712   
963843_713   
963843_714   
963843_715   
963843_716   
963843_717   
963843_718   
963843_719   
963843_720   
963843_721   
963843_722   
963843_723   
963843_724   
963843_725   
963843_726   
963843_727   
963843_728   
963843_729   
963843_730   
963843_731   
963843_732   
963843_733   
963843_734   
963843_735   
963843_736   
963843_737   
963843_738   
963843_739   
963843_740   
963843_741   
963843_742   
963843_743   
963843_744   
963843_745   
963843_746   
963843_747   
963843_748   
963843_749   
963843_750   
963843_751   
963843_752   
963843_753   
963843_754   
963843_755   
963843_756   
963843_757   
963843_758   
963843_759   
963843_760   
963843_761   
963843_762   
963843_763   
963843_764   
963843_765   
963843_766   
963843_767   
963843_768   
963843_769   
963843_770   
963843_771   
963843_772   
963843_773   
963843_774   
963843_775   
963843_776   
963843_777   
963843_778   
963843_779   
963843_780   
963843_781   
963843_782   
963843_783   
963843_784   
963843_785   
963843_786   
963843_787   
963843_788   
963843_789   
963843_790   
963843_791   
963843_792   
963843_793   
963843_794   
963843_795   
963843_796   
963843_797   
963843_798   
963843_799   
963843_800   
963843_801   
963843_802   
963843_803   
963843_804   
963843_805   
963843_806   
963843_807   
963843_808   
963843_809   
963843_810   
963843_811   
963843_812   
963843_813   
963843_814   
963843_815   
963843_816   
963843_817   
963843_818   
963843_819   
963843_820   
963843_821   
963843_822   
963843_823   
963843_824   
963843_825   
963843_826   
963843_827   
963843_828   
963843_829   
963843_830   
963843_831   
963843_832   
963843_833   
963843_834   
963843_835   
963843_836   
963843_837   
963843_838   
963843_839   
963843_840   
963843_841   
963843_842   
963843_843   
963843_844   
963843_845   
963843_846   
963843_847   
963843_848   
963843_849   
963843_850   
963843_851   
963843_852   
963843_853   
963843_854   
963843_855   
963843_856   
963843_857   
963843_858   
963843_859   
963843_860   
963843_861   
963843_862   
963843_863   
963843_864   
963843_865   
963843_866   
963843_867   
963843_868   
963843_869   
963843_870   
963843_871   
963843_872   
963843_873   
963843_874   
963843_875   
963843_876   
963843_877   
963843_878   
963843_879   
963843_880   
963843_881   
963843_882   
963843_883   
963843_884   
963843_885   
963843_886   
963843_887   
963843_888   
963843_889   
963843_890   
963843_891   
963843_892   
963843_893   
963843_894   
963843_895   
963843_896   
963843_897   
963843_898   
963843_899   
963843_900   
963843_901   
963843_902   
963843_903   
963843_904   
963843_905   
963843_906   
963843_907   
963843_908   
963843_909   
963843_910   
963843_911   
963843_912   
963843_913   
963843_914   
963843_915   
963843_916   
963843_917   
963843_918   
963843_919   
963843_920   
963843_921   
963843_922   
963843_923   
963843_924   
963843_925   
963843_926   
963843_927   
963843_928   
963843_929   
963843_930   
963843_931   
963843_932   
963843_933   
963843_934   
963843_935   
963843_936   
963843_937   
963843_938   
963843_939   
963843_940   
963843_941   
963843_942   
963843_943   
963843_944   
967310       
967310.0     
964268_86    
963843_945   
963843_946   
963843_947   
963843_948   
963843_949   
963843_950   
963843_951   
963843_952   
963843_953   
963843_954   
963843_955   
963843_956   
963843_957   
963843_958   
963843_959   
963843_960   
963843_961   
963843_962   
963843_963   
963843_964   
963843_965   
963843_966   
963843_967   
963843_968   
963843_969   
963843_970   
963843_971   
963843_972   
963843_973   
963843_974   
963843_975   
963843_976   
963843_977   
963843_978   
963843_979   
963843_980   
963843_981   
963843_982   
963843_983   
963843_984   
963843_985   
963843_986   
963843_987   
963843_988   
963843_989   
963843_990   
963843_991   
963843_992   
964268_88    
964268_89    
967365       
2827304_45   
2827304_46   
4556401_[17+ 
4555133_0    
4555133_1    
4555133_2    
4555133_3    
4555133_4    
4555133_5    
4555133_6    
4555133_7    
4555133_8    
4555133_9    
4555133_10   
4555133_11   
4555133_12   
4555133_13   
4555133_14   
4555133_15   
4555133_16   
4555133_17   
4555133_18   
4555133_19   
4555133_20   
4555133_21   
4555133_22   
4555133_23   
4555133_24   
4555133_25   
4555133_26   
4555133_27   
4555133_28   
4555133_29   
4555133_30   
4555133_31   
4555133_32   
4555133_33   
4555133_34   
4555133_35   
4555133_36   
4555133_37   
4555133_38   
4555133_39   
4555133_40   
4555133_41   
4555133_42   
4555133_43   
4555133_44   
4555133_45   
4555133_46   
4555133_47   
4555133_48   
4555133_49   
4555133_50   
4555133_51   
4555133_52   
4555133_53   
4555133_54   
4555133_55   
4555133_56   
4555133_57   
4555133_58   
4555133_59   
4555133_60   
4555133_61   
4555133_62   
4555133_63   
4555133_64   
4555133_65   
4555133_66   
4555133_67   
4555133_68   
4555133_69   
4555133_70   
4555133_71   
4555133_72   
4555133_73   
4555133_74   
4555133_75   
4555133_76   
4555133_77   
4555133_78   
4555133_79   
4555133_80   
4555133_81   
4555133_82   
4555133_83   
4555133_84   
4555133_85   
4555133_86   
4555133_87   
4555133_88   
4555133_89   
4555133_90   
4555133_91   
4555133_92   
4555133_93   
4555133_94   
4555133_95   
4555133_96   
4555133_97   
4555133_98   
4554970_2075 
4554970_2076 
4554970_2077 
4554970_2078 
4554970_2079 
4554970_2080 
4554970_2081 
4554970_2082 
4554970_2083 
4554970_2084 
4554970_2085 
4554970_2086 
4554970_2087 
4554970_2088 
4554970_2089 
4554970_2090 
4554970_2091 
4554970_2092 
4554970_2093 
4554970_2094 
4554970_2095 
4554970_2096 
4554970_2097 
4554970_2098 
4554970_2099 
4554970_2100 
4554970_2101 
4554970_2102 
4554970_2103 
4554970_2104 
4554970_2105 
4554970_2106 
4554970_2107 
4554970_2108 
4554970_2109 
4554970_2110 
4554970_2111 
4554970_2112 
4554970_2113 
4554970_2114 
4554970_2115 
4554970_2116 
4554970_2117 
4554970_2118 
4554970_2119 
4554970_2120 
4554970_2121 
4554970_2122 
4554970_2123 
4554970_2124 
4554970_2125 
6846863_162  
6846863_163  
6749926_1984 
6846863_164  
6846863_165  
6846863_166  
6749926_1985 
6846863_167  
6846863_168  
6846863_169  
6749926_1986 
6749926_1987 
6749926_1988 
6846863_171  
6846863_172  
6749926_1989 
6749926_1990 
6846863_173  
6846863_174  
6749926_1991 
6846863_175  
6846863_176  
6846863_177  
6846863_178  
6846863_179  
6846863_180  
6846863_181  
6846863_182  
6846863_183  
6846863_184  
6846863_185  
6884615      
6884630      
6884646      
6884651      
6884667      
6884695      
6884708      
6884710      
6884711      
6884723      
6884734      
6883076_4291 
6884768      
6883076_4300 
6884773      
6884779      
6884789      
6883076_4316 
6884792      
6884797      
6883076_4330 
6883076_4331 
6846863_186  
6865115_9997 
6883076_4338 
6883076_4339 
6883076_4340 
6883076_4341 
6883076_4342 
6883076_4343 
6883076_4347 
6883076_4350 
6883076_4352 
6883076_4362 
6883076_4364 
6883076_4366 
6883076_4367 
6883076_4370 
6883076_4371 
6883076_4372 
6883076_4373 
6883076_4374 
6883076_4375 
6883076_4376 
6883076_4377 
6883076_4378 
6883076_4379 
6883076_4380 
6883076_4381 
6883076_4382 
6846863_187  
6883076_4384 
6883076_4385 
6883076_4386 
6883076_4387 
6883076_4388 
6883076_4389 
8948593      
8948593.bat+ 
8948735      
8948735.0    
8948736      
8948737      
8948739      
8948739.0    
8948741      
8948741.0    
8948743      
8948743.0    
8948744      
11163094     
11163095     
11163096     
11163097     
11163098     
11163099     
11163100     
11163101     
11163102     
11163103     
11163104     
11163105     
11163106     
11803874     
15762688     
15991888     
15991889     
15991890     
15991891     
15991892     
15991893     
15991894     
15991895     
15991896     
15991897     
15991898     
15991899     
15991900     
15991901     
15991902     
15991903     
15991904     
15991905     
15991906     
15991907     
15991908     
15991909     
15991910     
15991911     
15991912     
15991913     
15991914     
15991915     
15991916     
15991917     
15991918     
15991919     
15991920     
15991921     
15991922     
15991923     
15991924     
15991925     
15991926     
15991927     
15991928     
15991929     
15991930     
15991931     
15991932     
15991933     
15991934     
21117807     
21117808     
21117809     
21117810     
21117811     
21117812     
21117813     
21117814     
21117815     
21117816     
21117817     
21117818     
21117819     
21117820     
21117821     
21117822     
21117823     
21117824     
21117825     
21117826     
21117827     
21117828     
21117829     
21117830     
21117831     
21117832     
21117833     
21117834     
21117835     
21117836     
21117837     
21117838     
21117839     
21117840     
21117841     
21117842     
21117843     
21117844     
21117845     
21117846     
21117847     
21117848     
21117849     
21117850     
21117851     
21117852     
21117853     
21117854     
21117855     
21117856     
21128693     
21128694     
21128695     
21128696     
21128697     
21128698     
21128699     
21128700     
21128701     
21128702     
21128703     
21128704     
21128705     
21128706     
21128707     
21128708     
21128709     
21128710     
21128711     
21128712     
21128713     
21128714     
21128715     
21128716     
21128717     
21128718     
21128719     
21128720     
21128721     
21128722     
21128723     
21128724     
21128725     
21128726     
21128727     
21128728     
21128729     
21128730     
21128731     
21128732     
21128733     
21128734     
21128735     
21128736     
21128737     
21128738     
21128739     
21128740     
21128741     
21128742     
21128743     
21128744     
21128745     
21128746     
21128747     
21128748     
21128749     
21128750     
21128751     
21128752     
21128753     
21128754     
21128755     
21128756     
21128757     
21128758     
21128759     
21128760     
21128761     
21128762     
21128763     
21128764     
21128765     
21128766     
21128767     
21128768     
21128769     
21128770     
21128771     
21128772     
21128773     
21128774     
21128775     
21128776     
21128777     
21128778     
21128779     
21128780     
21128781     
21128782     
21128783     
21128784     
21128785     
21128786     
21128787     
21128788     
21128789     
21128790     
21128791     
21128792     
21128793     
21128794     
21128795     
21128796     
21128797     
21128798     
21128799     
21128800     
21128801     
21128802     
21128803     
21128804     
21128805     
21128806     
21128807     
21128808     
21128809     
21128810     
21128811     
21128812     
21128813     
21128814     
21128815     
21128816     
21128817     
21128818     
21128819     
21128820     
21128821     
21128822     
21128823     
21128824     
21128825     
21128826     
21128827     
21128828     
21128829     
21128830     
21128831     
21128832     
21128833     
21128834     
21128835     
21128836     
21128837     
21128838     
21128839     
21128840     
21128841     
21128842     
21128843     
21128844     
21128845     
21128846     
21128847     
21995350     
28033437     
28033438     
28209075     
28209075.ex+ 
28220303     
28220303.ex+ 
28220304     
28220304.ex+ 
28678522     
28681980     
28681981     
31511434     
31511434.ex+ 
31511434.0   
31611184_184 
31611184_18+ 
31611184_187 
31611184_18+ 
31611184_189 
31611184_18+ 
31799492     
31799492.ex+ 
31799544     
31799544.ex+ 
31807635     
31807635.ex+ 
31833439     
31833439.ex+ 
31838902     
31838902.ex+ 
31847269     
31847269.ex+ 
31852719     
31852719.ex+ 
31862357     
31862357.ex+ 
31862358     
31862358.ex+ 
31862360     
31862360.ba+ 
31862360.ex+ 
31862362     
31862362.ba+ 
31862362.ex+ 
31862992     
31862992.ex+ 
31865504     
31865504.ex+ 
31886565     
31886565.ba+ 
31886565.ex+ 
31886566     
31886566.ex+ 
31889194     
31889194.ba+ 
31889194.ex+ 
31889195     
31889195.ex+ 
31889589     
31889589.ex+ 
31889590     
31889590.ex+ 
31889591     
31889591.ex+ 
31889592     
31889592.ex+ 
31889593     
31889593.ex+ 
31889594     
31889594.ex+ 
31889595     
31889595.ex+ 
31889596     
31889596.ex+ 
31889597     
31889597.ex+ 
31889598     
31889598.ex+ 
31889599     
31889599.ex+ 
31889600     
31889600.ex+ 
31889601     
31889601.ex+ 
31889602     
31889602.ex+ 
31889603     
31889603.ex+ 
31889604     
31889604.ex+ 
31889605     
31889605.ex+ 
31889606     
31889606.ex+ 
31889607     
31889607.ex+ 
31889608     
31889608.ex+ 
31889609     
31889609.ex+ 
31889610     
31889610.ex+ 
31889611     
31889611.ex+ 
31889612     
31889612.ex+ 
31889613     
31889613.ex+ 
31889614     
31889614.ex+ 
31889615     
31889615.ex+ 
31889616     
31889616.ex+ 
31889617     
31889617.ex+ 
31889618     
31889618.ex+ 
31889619     
31889619.ex+ 
31889620     
31889620.ex+ 
31889621     
31889621.ex+ 
31889622     
31889622.ex+ 
31889623     
31889623.ex+ 
31889624     
31889624.ex+ 
31889625     
31889625.ex+ 
31889626     
31889626.ex+ 
31889627     
31889627.ex+ 
31889628     
31889628.ex+ 
31889629     
31889629.ex+ 
31889630     
31889630.ex+ 
31889631     
31889631.ex+ 
31889632     
31889632.ex+ 
31889633     
31889633.ex+ 
31889634     
31889634.ex+ 
31889635     
31889635.ex+ 
31889636     
31889636.ex+ 
31889637     
31889637.ex+ 
31889638     
31889638.ex+ 
31889639     
31889639.ex+ 
31889640     
31889640.ex+ 
31889641     
31889641.ex+ 
31889642     
31889642.ex+ 
31889643     
31889643.ex+ 
31889644     
31889644.ex+ 
31889645     
31889645.ex+ 
31889646     
31889646.ex+ 
31889647     
31889647.ex+ 
31889648     
31889648.ex+ 
31889649     
31889649.ex+ 
31889650     
31889650.ex+ 
31889651     
31889651.ex+ 
31889652     
31889652.ex+ 
31889653     
31889653.ex+ 
31889654     
31889654.ex+ 
31889655     
31889655.ex+ 
31889656     
31889656.ex+ 
31889657     
31889657.ex+ 
31889658     
31889658.ex+ 
31889659     
31889659.ex+ 
31889660     
31889660.ex+ 
31889661     
31889661.ex+ 
31889662     
31889662.ex+ 
31889663     
31889663.ex+ 
31889664     
31889664.ex+ 
31889665     
31889665.ex+ 
31889666     
31889666.ex+ 
31889667     
31889667.ex+ 
31889669     
31889669.ex+ 
31889670     
31889670.ex+ 
31889671     
31889671.ex+ 
31889672     
31889672.ex+ 
31889673     
31889673.ex+ 
31889674     
31889674.ex+ 
31889675     
31889675.ex+ 
31889676     
31889676.ex+ 
31889677     
31889677.ex+ 
31889678     
31889678.ex+ 
31889679     
31889679.ex+ 
31889680     
31889680.ex+ 
31889681     
31889681.ex+ 
31889682     
31889682.ex+ 
31889683     
31889683.ex+ 
31889684     
31889684.ex+ 
31889685     
31889685.ex+ 
31889686     
31889686.ex+ 
31889687     
31889687.ex+ 
31889688     
31889688.ex+ 
31889689     
31889689.ex+ 
31889690     
31889690.ex+ 
31889691     
31889691.ex+ 
31889692     
31889692.ex+ 
31889693     
31889693.ex+ 
31889694     
31889694.ex+ 
31889695     
31889695.ex+ 
31889696     
31889696.ex+ 
31889697     
31889697.ex+ 
31889698     
31889698.ex+ 
31889699     
31889699.ex+ 
31889700     
31889700.ex+ 
31889701     
31889701.ba+ 
31889701.ex+ 
31889702     
31889702.ex+ 
31889703     
31889703.ex+ 
31889704     
31889704.ex+ 
31889705     
31889705.ex+ 
31889706     
31889706.ex+ 
31889707     
31889707.ex+ 
31889708     
31889708.ex+ 
31889709     
31889709.ex+ 
31889710     
31889710.ex+ 
31889711     
31889711.ex+ 
31889712     
31889712.ex+ 
31889713     
31889713.ex+ 
31889714     
31889714.ex+ 
31889715     
31889715.ex+ 
31889716     
31889716.ex+ 
31889717     
31889717.ex+ 
31889718     
31889718.ex+ 
31889719     
31889719.ex+ 
31889720     
31889720.ex+ 
31889721     
31889721.ex+ 
31889722     
31889722.ex+ 
31889723     
31889723.ex+ 
31889724     
31889724.ex+ 
31889725     
31889725.ex+ 
31889726     
31889726.ex+ 
31889727     
31889727.ex+ 
31889728     
31889728.ex+ 
31889729     
31889729.ex+ 
31889730     
31889730.ex+ 
31889731     
31889731.ex+ 
31889732     
31889732.ex+ 
31889733     
31889733.ex+ 
31889734     
31889734.ex+ 
31889735     
31889735.ex+ 
31889736     
31889736.ex+ 
31889737     
31889737.ex+ 
31889738     
31889738.ex+ 
31889739     
31889739.ex+ 
31889740     
31889740.ex+ 
31889741     
31889741.ex+ 
31889742     
31889742.ex+ 
31889743     
31889743.ex+ 
31889744     
31889744.ex+ 
31889745     
31889745.ex+ 
31889746     
31889746.ex+ 
31889747     
31889747.ex+ 
31889748     
31889748.ex+ 
31889749     
31889749.ex+ 
31889750     
31889750.ex+ 
31889751     
31889751.ex+ 
31889752     
31889752.ex+ 
31889753     
31889753.ex+ 
31889754     
31889754.ex+ 
31889755     
31889755.ex+ 
31889756     
31889756.ex+ 
31889757     
31889757.ex+ 
31889758     
31889758.ex+ 
31889759     
31889759.ex+ 
31889760     
31889760.ex+ 
31889761     
31889761.ex+ 
31889762     
31889762.ex+ 
31889763     
31889763.ex+ 
31889764     
31889764.ex+ 
31889765     
31889765.ex+ 
31889766     
31889766.ex+ 
31889767     
31889767.ex+ 
31889768     
31889768.ex+ 
31889769     
31889769.ex+ 
31889770     
31889770.ex+ 
31889771     
31889771.ex+ 
31889773     
31889773.ex+ 
31889775     
31889775.ex+ 
31889776     
31889776.ex+ 
31889777     
31889777.ex+ 
31889778     
31889778.ex+ 
31889779     
31889779.ex+ 
31889780     
31889780.ex+ 
31889781     
31889781.ex+ 
31889782     
31889782.ex+ 
31889783     
31889783.ex+ 
31889784     
31889784.ex+ 
31889785     
31889785.ex+ 
31889786     
31889786.ex+ 
31889787     
31889787.ex+ 
31889788     
31889788.ex+ 
31889789     
31889789.ex+ 
31889790     
31889790.ex+ 
31889791     
31889791.ex+ 
31889792     
31889792.ex+ 
31889793     
31889793.ex+ 
31889794     
31889794.ex+ 
31889795     
31889795.ex+ 
31889796     
31889796.ex+ 
31889797     
31889797.ex+ 
31889798     
31889798.ex+ 
31889799     
31889799.ex+ 
31889800     
31889800.ex+ 
31889801     
... etc
It includes a bunch of really old jobs for no obvious reason, jumps across large ranges of job ids and is only sequential now in the 31.8M range.

Any idea why would that be? It almost feels like some counter wrapped around somewhere. Could it be related to a fact that we don't purge old job data from db?

Anyway, I'd like to get the ability back to query all jobs in the db through sacct.
Comment 3 Felip Moll 2019-04-10 09:45:01 MDT
Hi Jurij,

Can you show me the complete mysql command for the affected job?

I need specifically the fields:

time_eligible 
time_end 

Also, 'sacctmgr show runaway' does you show anything?

There have been recent changes in the query logic to fix some issues, so this can be related to them and not something specific to your site.
Comment 6 Felip Moll 2019-04-11 11:00:59 MDT
Jurij, without getting more information I cannot diagnose it for sure, but it looks like you are suffering from a known regression happened in 18.08.6. There we changed the behavior of a couple of sql queries that have had unexpected collateral effects. 

One of this effect is that you cannot query old jobs.

The next release, 18.08.7 will have the fix in place, and it should be released very soon (next days).

If you want to apply the fix right now, the commit id is 3361bf611c61de3bb90f8cadbacf58b4d1dc8707.

I've also discovered another issue where it is impossible in sacct to show up jobs that has EligibleTime=unknown. We're working in this issue too.

When you have a chance, please, send me the requested information to confirm all of this.
Comment 9 Jurij Pečar 2019-04-12 06:15:19 MDT
Sorry, got drowned by sales drones buzzing around here yesterday. 

'sacct show runaway' gives me some thousands of jobs going back to 2017. I assumed I can safely fix them. This also fixed the unexpected output of sacct, showing really old jobs. One more trick I learned, good :)

I just updated to 18.08.7 and indeed sacct is now returning jobs again as expected. Thanks.

However seff is still reporting some nonsense:

# sacct -Plj 31699893
JobID|JobIDRaw|JobName|Partition|MaxVMSize|MaxVMSizeNode|MaxVMSizeTask|AveVMSize|MaxRSS|MaxRSSNode|MaxRSSTask|AveRSS|MaxPages|MaxPagesNode|MaxPagesTask|AvePages|MinCPU|MinCPUNode|MinCPUTask|AveCPU|NTasks|AllocCPUS|Elapsed|State|ExitCode|AveCPUFreq|ReqCPUFreqMin|ReqCPUFreqMax|ReqCPUFreqGov|ReqMem|ConsumedEnergy|MaxDiskRead|MaxDiskReadNode|MaxDiskReadTask|AveDiskRead|MaxDiskWrite|MaxDiskWriteNode|MaxDiskWriteTask|AveDiskWrite|AllocGRES|ReqGRES|ReqTRES|AllocTRES|TRESUsageInAve|TRESUsageInMax|TRESUsageInMaxNode|TRESUsageInMaxTask|TRESUsageInMin|TRESUsageInMinNode|TRESUsageInMinTask|TRESUsageInTot|TRESUsageOutMax|TRESUsageOutMaxNode|TRESUsageOutMaxTask|TRESUsageOutAve|TRESUsageOutTot
31699893|31699893|dada2.sh|lvic||||||||||||||||||1|1-01:13:33|COMPLETED|0:0||Unknown|Unknown|Unknown|100Gn|0|||||||||||billing=1,cpu=1,mem=100G,node=1|billing=1,cpu=1,mem=100G,node=1|||||||||||||
31699893.batch|31699893.batch|batch||34537244K|nile|0|34537244K|32007060K|nile|0|32007060K|71|nile|0|71|1-22:08:47|nile|0|1-22:08:47|1|1|1-01:13:33|COMPLETED|0:0|17K|0|0|0|100Gn|19.38M|27465.52M|nile|0|27465.52M|2367.46M|nile|0|2367.46M||||cpu=1,mem=100G,node=1|cpu=1-22:08:47,energy=19375074,fs/disk=28799688565,mem=32007060K,pages=71,vmem=34537244K|cpu=1-22:08:47,energy=19369272,fs/disk=28799688565,mem=32007060K,pages=71,vmem=34537244K|cpu=nile,energy=nile,fs/disk=nile,mem=nile,pages=nile,vmem=nile|cpu=0,fs/disk=0,mem=0,pages=0,vmem=0|cpu=1-22:08:47,energy=19369272,fs/disk=28799688565,mem=32007060K,pages=71,vmem=34537244K|cpu=nile,energy=nile,fs/disk=nile,mem=nile,pages=nile,vmem=nile|cpu=0,fs/disk=0,mem=0,pages=0,vmem=0|cpu=1-22:08:47,energy=19375074,fs/disk=28799688565,mem=32007060K,pages=71,vmem=34537244K|energy=227,fs/disk=2482458912|energy=nile,fs/disk=nile|fs/disk=0|energy=6170,fs/disk=2482458912|energy=6170,fs/disk=2482458912
31699893.extern|31699893.extern|extern||||||||||||||||||1|1|1-01:13:33|CANCELLED||0|0|0|0|100Gn|0||||||||||||billing=1,cpu=1,mem=100G,node=1|||||||||||||

# seff 31699893
Job ID: 31699893
Cluster: hof
User/Group: hchen/zeller
State: COMPLETED (exit code 0)
Cores: 1
CPU Utilized: 1-22:09:15
CPU Efficiency: 182.96% of 1-01:13:33 core-walltime
Job Wall-clock time: 1-01:13:33
Memory Utilized: 16.00 EB
Memory Efficiency: 17179869184.00% of 100.00 GB

I assume there need to be some fixes in the seff script or in the perl api as well? I reclassified this as minor issue.
Comment 11 Felip Moll 2019-04-12 10:59:42 MDT
(In reply to Jurij Pečar from comment #9)
> Sorry, got drowned by sales drones buzzing around here yesterday. 
> 
> 'sacct show runaway' gives me some thousands of jobs going back to 2017. I
> assumed I can safely fix them. This also fixed the unexpected output of
> sacct, showing really old jobs. One more trick I learned, good :)

Good. Diagnosing why these jobs were wrong is more tricky. If this fixed the issue let's keep this way but please, monitor if newer runaway jobs appear at some point, and then open a new bug for this.

> I just updated to 18.08.7 and indeed sacct is now returning jobs again as
> expected. Thanks.

Neat.

> However seff is still reporting some nonsense:
> 
> # sacct -Plj 31699893
> JobID|JobIDRaw|JobName|Partition|MaxVMSize|MaxVMSizeNode|MaxVMSizeTask|AveVMS
> ize|MaxRSS|MaxRSSNode|MaxRSSTask|AveRSS|MaxPages|MaxPagesNode|MaxPagesTask|Av
> ePages|MinCPU|MinCPUNode|MinCPUTask|AveCPU|NTasks|AllocCPUS|Elapsed|State|Exi
> tCode|AveCPUFreq|ReqCPUFreqMin|ReqCPUFreqMax|ReqCPUFreqGov|ReqMem|ConsumedEne
> rgy|MaxDiskRead|MaxDiskReadNode|MaxDiskReadTask|AveDiskRead|MaxDiskWrite|MaxD
> iskWriteNode|MaxDiskWriteTask|AveDiskWrite|AllocGRES|ReqGRES|ReqTRES|AllocTRE
> S|TRESUsageInAve|TRESUsageInMax|TRESUsageInMaxNode|TRESUsageInMaxTask|TRESUsa
> geInMin|TRESUsageInMinNode|TRESUsageInMinTask|TRESUsageInTot|TRESUsageOutMax|
> TRESUsageOutMaxNode|TRESUsageOutMaxTask|TRESUsageOutAve|TRESUsageOutTot
> 31699893|31699893|dada2.sh|lvic||||||||||||||||||1|1-01:13:33|COMPLETED|0:
> 0||Unknown|Unknown|Unknown|100Gn|0|||||||||||billing=1,cpu=1,mem=100G,
> node=1|billing=1,cpu=1,mem=100G,node=1|||||||||||||
> 31699893.batch|31699893.
> batch|batch||34537244K|nile|0|34537244K|32007060K|nile|0|32007060K|71|nile|0|
> 71|1-22:08:47|nile|0|1-22:08:47|1|1|1-01:13:33|COMPLETED|0:
> 0|17K|0|0|0|100Gn|19.38M|27465.52M|nile|0|27465.52M|2367.46M|nile|0|2367.
> 46M||||cpu=1,mem=100G,node=1|cpu=1-22:08:47,energy=19375074,fs/
> disk=28799688565,mem=32007060K,pages=71,vmem=34537244K|cpu=1-22:08:47,
> energy=19369272,fs/disk=28799688565,mem=32007060K,pages=71,
> vmem=34537244K|cpu=nile,energy=nile,fs/disk=nile,mem=nile,pages=nile,
> vmem=nile|cpu=0,fs/disk=0,mem=0,pages=0,vmem=0|cpu=1-22:08:47,
> energy=19369272,fs/disk=28799688565,mem=32007060K,pages=71,
> vmem=34537244K|cpu=nile,energy=nile,fs/disk=nile,mem=nile,pages=nile,
> vmem=nile|cpu=0,fs/disk=0,mem=0,pages=0,vmem=0|cpu=1-22:08:47,
> energy=19375074,fs/disk=28799688565,mem=32007060K,pages=71,
> vmem=34537244K|energy=227,fs/disk=2482458912|energy=nile,fs/disk=nile|fs/
> disk=0|energy=6170,fs/disk=2482458912|energy=6170,fs/disk=2482458912
> 31699893.extern|31699893.extern|extern||||||||||||||||||1|1|1-01:13:
> 33|CANCELLED||0|0|0|0|100Gn|0||||||||||||billing=1,cpu=1,mem=100G,
> node=1|||||||||||||
> 
> # seff 31699893
> Job ID: 31699893
> Cluster: hof
> User/Group: hchen/zeller
> State: COMPLETED (exit code 0)
> Cores: 1
> CPU Utilized: 1-22:09:15
> CPU Efficiency: 182.96% of 1-01:13:33 core-walltime
> Job Wall-clock time: 1-01:13:33
> Memory Utilized: 16.00 EB
> Memory Efficiency: 17179869184.00% of 100.00 GB
> 

I guess you're talking about:

>Memory Utilized: 16.00 EB
>Memory Efficiency: 17179869184.00% of 100.00 GB


Is it that? Can you show me the full mysql select for this job?

> I assume there need to be some fixes in the seff script or in the perl api
> as well?

Not sure yet.


> I reclassified this as minor issue.

Cool.
Comment 12 Jurij Pečar 2019-04-15 08:17:19 MDT
Job with two steps:

> select * from hof_job_table where id_job=31699893 \G
*************************** 1. row ***************************
        job_db_inx: 62027302
          mod_time: 1554379958
           deleted: 0
           account: zeller
     admin_comment: NULL
    array_task_str: NULL
   array_max_tasks: 0
array_task_pending: 0
          cpus_req: 1
        derived_ec: 0
        derived_es: NULL
         exit_code: 0
          job_name: dada2.sh
          id_assoc: 4758
      id_array_job: 0
     id_array_task: 4294967294
          id_block: NULL
            id_job: 31699893
            id_qos: 1
           id_resv: 0
          id_wckey: 0
           id_user: 23873
          id_group: 702
       pack_job_id: 0
   pack_job_offset: 4294967294
       kill_requid: -1
         mcs_label: NULL
           mem_req: 102400
          nodelist: nile
       nodes_alloc: 1
          node_inx: 211
         partition: lvic
          priority: 3550
             state: 3
         timelimit: 4320
       time_submit: 1554289144
     time_eligible: 1554289144
        time_start: 1554289145
          time_end: 1554379958
    time_suspended: 0
          gres_req: 
        gres_alloc: 
         gres_used: 
             wckey: 
          work_dir: /g/scb2/zeller/SHARED/DATA/16S/Mavis_test
    system_comment: NULL
       track_steps: 0
        tres_alloc: 1=1,2=102400,4=1,5=1
          tres_req: 1=1,2=102400,4=1,5=1

> select * from hof_step_table where job_db_inx=62027302 \G
*************************** 1. row ***************************
               job_db_inx: 62027302
                  deleted: 0
                exit_code: 0
                  id_step: -2
              kill_requid: -1
                 nodelist: nile
              nodes_alloc: 1
                 node_inx: 211
                    state: 3
                step_name: batch
                 task_cnt: 1
                task_dist: 0
               time_start: 1554289145
                 time_end: 1554379958
           time_suspended: 0
                 user_sec: 165464
                user_usec: 713318
                  sys_sec: 690
                 sys_usec: 325511
              act_cpufreq: 17
          consumed_energy: 19375074
          req_cpufreq_min: 0
              req_cpufreq: 0
          req_cpufreq_gov: 0
               tres_alloc: 1=1,2=102400,4=1
        tres_usage_in_ave: 1=166127920,2=32775229440,3=19375074,6=28799688565,7=35366137856,8=71
        tres_usage_in_max: 1=166127920,2=32775229440,3=19369272,6=28799688565,7=35366137856,8=71
 tres_usage_in_max_taskid: 1=0,2=0,6=0,7=0,8=0
 tres_usage_in_max_nodeid: 1=0,2=0,3=0,6=0,7=0,8=0
        tres_usage_in_min: 1=166127920,2=32775229440,3=19369272,6=28799688565,7=35366137856,8=71
 tres_usage_in_min_taskid: 1=0,2=0,6=0,7=0,8=0
 tres_usage_in_min_nodeid: 1=0,2=0,3=0,6=0,7=0,8=0
        tres_usage_in_tot: 1=166127920,2=32775229440,3=19375074,6=28799688565,7=35366137856,8=71
       tres_usage_out_ave: 3=6170,6=2482458912
       tres_usage_out_max: 3=227,6=2482458912
tres_usage_out_max_taskid: 6=0
tres_usage_out_max_nodeid: 3=0,6=0
       tres_usage_out_min: 3=227,6=2482458912
tres_usage_out_min_taskid: 6=0
tres_usage_out_min_nodeid: 3=0,6=0
       tres_usage_out_tot: 3=6170,6=2482458912
*************************** 2. row ***************************
               job_db_inx: 62027302
                  deleted: 0
                exit_code: -2
                  id_step: -1
              kill_requid: -1
                 nodelist: nile
              nodes_alloc: 1
                 node_inx: 211
                    state: 4
                step_name: extern
                 task_cnt: 1
                task_dist: 0
               time_start: 1554289145
                 time_end: 1554379958
           time_suspended: 0
                 user_sec: 0
                user_usec: 0
                  sys_sec: 0
                 sys_usec: 0
              act_cpufreq: 0
          consumed_energy: 0
          req_cpufreq_min: 0
              req_cpufreq: 0
          req_cpufreq_gov: 0
               tres_alloc: 1=1,2=102400,3=18446744073709551614,4=1,5=1
        tres_usage_in_ave: 
        tres_usage_in_max: 
 tres_usage_in_max_taskid: 
 tres_usage_in_max_nodeid: 
        tres_usage_in_min: 
 tres_usage_in_min_taskid: 
 tres_usage_in_min_nodeid: 
        tres_usage_in_tot: 
       tres_usage_out_ave: 
       tres_usage_out_max: 
tres_usage_out_max_taskid: 
tres_usage_out_max_nodeid: 
       tres_usage_out_min: 
tres_usage_out_min_taskid: 
tres_usage_out_min_nodeid: 
       tres_usage_out_tot:
Comment 13 Felip Moll 2019-04-15 10:15:41 MDT
(In reply to Jurij Pečar from comment #12)
> Job with two steps:
> 
> > select * from hof_job_table where id_job=31699893 \G
> *************************** 1. row ***************************
>         job_db_inx: 62027302
>           mod_time: 1554379958
>            deleted: 0
>            account: zeller
>      admin_comment: NULL
>     array_task_str: NULL
>    array_max_tasks: 0
> array_task_pending: 0
>           cpus_req: 1
>         derived_ec: 0
>         derived_es: NULL
>          exit_code: 0
>           job_name: dada2.sh
>           id_assoc: 4758
>       id_array_job: 0
>      id_array_task: 4294967294
>           id_block: NULL
>             id_job: 31699893
>             id_qos: 1
>            id_resv: 0
>           id_wckey: 0
>            id_user: 23873
>           id_group: 702
>        pack_job_id: 0
>    pack_job_offset: 4294967294
>        kill_requid: -1
>          mcs_label: NULL
>            mem_req: 102400
>           nodelist: nile
>        nodes_alloc: 1
>           node_inx: 211
>          partition: lvic
>           priority: 3550
>              state: 3
>          timelimit: 4320
>        time_submit: 1554289144
>      time_eligible: 1554289144
>         time_start: 1554289145
>           time_end: 1554379958
>     time_suspended: 0
>           gres_req: 
>         gres_alloc: 
>          gres_used: 
>              wckey: 
>           work_dir: /g/scb2/zeller/SHARED/DATA/16S/Mavis_test
>     system_comment: NULL
>        track_steps: 0
>         tres_alloc: 1=1,2=102400,4=1,5=1
>           tres_req: 1=1,2=102400,4=1,5=1
> 
> > select * from hof_step_table where job_db_inx=62027302 \G
> *************************** 1. row ***************************
>                job_db_inx: 62027302
>                   deleted: 0
>                 exit_code: 0
>                   id_step: -2
>               kill_requid: -1
>                  nodelist: nile
>               nodes_alloc: 1
>                  node_inx: 211
>                     state: 3
>                 step_name: batch
>                  task_cnt: 1
>                 task_dist: 0
>                time_start: 1554289145
>                  time_end: 1554379958
>            time_suspended: 0
>                  user_sec: 165464
>                 user_usec: 713318
>                   sys_sec: 690
>                  sys_usec: 325511
>               act_cpufreq: 17
>           consumed_energy: 19375074
>           req_cpufreq_min: 0
>               req_cpufreq: 0
>           req_cpufreq_gov: 0
>                tres_alloc: 1=1,2=102400,4=1
>         tres_usage_in_ave:
> 1=166127920,2=32775229440,3=19375074,6=28799688565,7=35366137856,8=71
>         tres_usage_in_max:
> 1=166127920,2=32775229440,3=19369272,6=28799688565,7=35366137856,8=71
>  tres_usage_in_max_taskid: 1=0,2=0,6=0,7=0,8=0
>  tres_usage_in_max_nodeid: 1=0,2=0,3=0,6=0,7=0,8=0
>         tres_usage_in_min:
> 1=166127920,2=32775229440,3=19369272,6=28799688565,7=35366137856,8=71
>  tres_usage_in_min_taskid: 1=0,2=0,6=0,7=0,8=0
>  tres_usage_in_min_nodeid: 1=0,2=0,3=0,6=0,7=0,8=0
>         tres_usage_in_tot:
> 1=166127920,2=32775229440,3=19375074,6=28799688565,7=35366137856,8=71
>        tres_usage_out_ave: 3=6170,6=2482458912
>        tres_usage_out_max: 3=227,6=2482458912
> tres_usage_out_max_taskid: 6=0
> tres_usage_out_max_nodeid: 3=0,6=0
>        tres_usage_out_min: 3=227,6=2482458912
> tres_usage_out_min_taskid: 6=0
> tres_usage_out_min_nodeid: 3=0,6=0
>        tres_usage_out_tot: 3=6170,6=2482458912
> *************************** 2. row ***************************
>                job_db_inx: 62027302
>                   deleted: 0
>                 exit_code: -2
>                   id_step: -1
>               kill_requid: -1
>                  nodelist: nile
>               nodes_alloc: 1
>                  node_inx: 211
>                     state: 4
>                 step_name: extern
>                  task_cnt: 1
>                 task_dist: 0
>                time_start: 1554289145
>                  time_end: 1554379958
>            time_suspended: 0
>                  user_sec: 0
>                 user_usec: 0
>                   sys_sec: 0
>                  sys_usec: 0
>               act_cpufreq: 0
>           consumed_energy: 0
>           req_cpufreq_min: 0
>               req_cpufreq: 0
>           req_cpufreq_gov: 0
>                tres_alloc: 1=1,2=102400,3=18446744073709551614,4=1,5=1
>         tres_usage_in_ave: 
>         tres_usage_in_max: 
>  tres_usage_in_max_taskid: 
>  tres_usage_in_max_nodeid: 
>         tres_usage_in_min: 
>  tres_usage_in_min_taskid: 
>  tres_usage_in_min_nodeid: 
>         tres_usage_in_tot: 
>        tres_usage_out_ave: 
>        tres_usage_out_max: 
> tres_usage_out_max_taskid: 
> tres_usage_out_max_nodeid: 
>        tres_usage_out_min: 
> tres_usage_out_min_taskid: 
> tres_usage_out_min_nodeid: 
>        tres_usage_out_tot:

This is a different issue.

MariaDB [slurm_acct_db_master]> select id_step,step_name,tres_usage_in_max from llagosti_step_table where job_db_inx=62027302;
+---------+-----------+-----------------------------------------------------------------------+
| id_step | step_name | tres_usage_in_max                                                     |
+---------+-----------+-----------------------------------------------------------------------+
|      -2 | batch     | 1=166127920,2=32775229440,3=19369272,6=28799688565,7=35366137856,8=71 |
|      -1 | extern    |                                                                       |
+---------+-----------+-----------------------------------------------------------------------+
2 rows in set (0.001 sec)


For some reason, the extern step doesn't have a tres_usage_in_max field recorded. This makes the SlurmDB API to return an overflowed number when 'find_tres_count_in_string' is called.

Seff does:

my $lmem = Slurmdb::find_tres_count_in_string($step->{'stats'}{'tres_usage_in_max'}, TRES_MEM) / 1024;

which turns into a flawed lmem value, which is what you then see in the output.

I am investigating why the extern step may not have this field filled in. I am wondering if you see anything in the ctld or slurmdbd logs related to this job. If this is only seen on one job this could be some issue when updating this field (mysql restart?). In my internal test database I also see a couple of steps without this information (be it extern or not).

Can you do:

select job_db_inx,step_name,tres_usage_in_max from <your_cluster>_step_table;

and see which steps doesn't have this field filled in, then correlate with your job table to see the job id, and then see what seff reports for these jobs?

I would suggest to open a new issue because this is totally different to your initial seff problem. You can link to this bug.
Comment 16 Felip Moll 2019-04-15 11:22:21 MDT
Jurij, I opened an internal issue for the seff problem, so no need for you to open it.

I will inform when it is fixed.
Comment 18 Felip Moll 2019-05-20 03:30:36 MDT
Hi, the issue with seff is closed in bab13dfde6d691a26b581eea20ef2f52e0c600a9, release 19.05.0rc2.

I am closing this bug now since everything should be fine at the moment.

Please, mark it as OPEN again if you still encounter issues related to your case.

Thank you,
Felip