After upgrade to 18.08.6-2 (from 18.08.4) users started complaining that their workflow management tools are failing because they can't get info about past jobs. I investigated and discovered that sacct indeed does not return anything for some jobs. An example: # sacct -Pj 31699893 JobID|JobName|Partition|Account|AllocCPUS|State|ExitCode But job is in the db: #mysql -E slurm MariaDB [slurm]> select * from hof_job_table where id_job=31699893; *************************** 1. row *************************** job_db_inx: 62027302 mod_time: 1554379958 deleted: 0 account: zeller admin_comment: NULL array_task_str: NULL array_max_tasks: 0 array_task_pending: 0 cpus_req: 1 derived_ec: 0 derived_es: NULL exit_code: 0 job_name: dada2.sh id_assoc: 4758 id_array_job: 0 id_array_task: 4294967294 id_block: NULL id_job: 31699893 ... etc As a consequence seff also reports nonsense for this job: # seff 31699893 Job ID: 31699893 Cluster: hof User/Group: hchen/zeller State: COMPLETED (exit code 0) Cores: 1 CPU Utilized: 1-22:09:15 CPU Efficiency: 182.96% of 1-01:13:33 core-walltime Job Wall-clock time: 1-01:13:33 Memory Utilized: 16.00 EB Memory Efficiency: 17179869184.00% of 100.00 GB Some other jobs appear fine: # sacct -j 31912919 JobID JobName Partition Account AllocCPUS State ExitCode ------------ ---------- ---------- ---------- ---------- ---------- -------- 31912919 Extract/j+ htc wilmanns 8 COMPLETED 0:0 31912919.ba+ batch wilmanns 8 COMPLETED 0:0 31912919.ex+ extern wilmanns 8 COMPLETED 0:0 # seff 31912919 Job ID: 31912919 Cluster: hof User/Group: lciccarelli/wilmanns State: COMPLETED (exit code 0) Nodes: 1 Cores per node: 8 CPU Utilized: 01:03:08 CPU Efficiency: 48.86% of 02:09:12 core-walltime Job Wall-clock time: 00:16:09 Memory Utilized: 1.09 GB Memory Efficiency: 9.09% of 12.00 GB What worries me is that sequence of JobIDs returned by seff is totally unexpected: # sacct -o JobID JobID ------------ 431291 431292 963843_646 963843_647 964268_84 964268_85 963843_648 963843_650 963843_652 963843_653 963843_654 963843_655 963843_656 963843_657 963843_658 963843_659 963843_660 963843_661 963843_662 963843_663 963843_664 963843_665 963843_666 963843_667 963843_668 963843_669 963843_670 963843_671 963843_672 963843_673 963843_674 963843_675 963843_676 963843_677 963843_678 963843_679 963843_680 963843_681 963843_682 963843_683 963843_684 963843_685 963843_686 963843_687 963843_688 963843_689 963843_690 963843_691 963843_692 963843_693 963843_694 963843_695 963843_696 963843_697 963843_698 963843_699 963843_700 963843_701 963843_702 963843_703 963843_704 963843_705 963843_706 963843_707 963843_708 963843_709 963843_710 963843_711 963843_712 963843_713 963843_714 963843_715 963843_716 963843_717 963843_718 963843_719 963843_720 963843_721 963843_722 963843_723 963843_724 963843_725 963843_726 963843_727 963843_728 963843_729 963843_730 963843_731 963843_732 963843_733 963843_734 963843_735 963843_736 963843_737 963843_738 963843_739 963843_740 963843_741 963843_742 963843_743 963843_744 963843_745 963843_746 963843_747 963843_748 963843_749 963843_750 963843_751 963843_752 963843_753 963843_754 963843_755 963843_756 963843_757 963843_758 963843_759 963843_760 963843_761 963843_762 963843_763 963843_764 963843_765 963843_766 963843_767 963843_768 963843_769 963843_770 963843_771 963843_772 963843_773 963843_774 963843_775 963843_776 963843_777 963843_778 963843_779 963843_780 963843_781 963843_782 963843_783 963843_784 963843_785 963843_786 963843_787 963843_788 963843_789 963843_790 963843_791 963843_792 963843_793 963843_794 963843_795 963843_796 963843_797 963843_798 963843_799 963843_800 963843_801 963843_802 963843_803 963843_804 963843_805 963843_806 963843_807 963843_808 963843_809 963843_810 963843_811 963843_812 963843_813 963843_814 963843_815 963843_816 963843_817 963843_818 963843_819 963843_820 963843_821 963843_822 963843_823 963843_824 963843_825 963843_826 963843_827 963843_828 963843_829 963843_830 963843_831 963843_832 963843_833 963843_834 963843_835 963843_836 963843_837 963843_838 963843_839 963843_840 963843_841 963843_842 963843_843 963843_844 963843_845 963843_846 963843_847 963843_848 963843_849 963843_850 963843_851 963843_852 963843_853 963843_854 963843_855 963843_856 963843_857 963843_858 963843_859 963843_860 963843_861 963843_862 963843_863 963843_864 963843_865 963843_866 963843_867 963843_868 963843_869 963843_870 963843_871 963843_872 963843_873 963843_874 963843_875 963843_876 963843_877 963843_878 963843_879 963843_880 963843_881 963843_882 963843_883 963843_884 963843_885 963843_886 963843_887 963843_888 963843_889 963843_890 963843_891 963843_892 963843_893 963843_894 963843_895 963843_896 963843_897 963843_898 963843_899 963843_900 963843_901 963843_902 963843_903 963843_904 963843_905 963843_906 963843_907 963843_908 963843_909 963843_910 963843_911 963843_912 963843_913 963843_914 963843_915 963843_916 963843_917 963843_918 963843_919 963843_920 963843_921 963843_922 963843_923 963843_924 963843_925 963843_926 963843_927 963843_928 963843_929 963843_930 963843_931 963843_932 963843_933 963843_934 963843_935 963843_936 963843_937 963843_938 963843_939 963843_940 963843_941 963843_942 963843_943 963843_944 967310 967310.0 964268_86 963843_945 963843_946 963843_947 963843_948 963843_949 963843_950 963843_951 963843_952 963843_953 963843_954 963843_955 963843_956 963843_957 963843_958 963843_959 963843_960 963843_961 963843_962 963843_963 963843_964 963843_965 963843_966 963843_967 963843_968 963843_969 963843_970 963843_971 963843_972 963843_973 963843_974 963843_975 963843_976 963843_977 963843_978 963843_979 963843_980 963843_981 963843_982 963843_983 963843_984 963843_985 963843_986 963843_987 963843_988 963843_989 963843_990 963843_991 963843_992 964268_88 964268_89 967365 2827304_45 2827304_46 4556401_[17+ 4555133_0 4555133_1 4555133_2 4555133_3 4555133_4 4555133_5 4555133_6 4555133_7 4555133_8 4555133_9 4555133_10 4555133_11 4555133_12 4555133_13 4555133_14 4555133_15 4555133_16 4555133_17 4555133_18 4555133_19 4555133_20 4555133_21 4555133_22 4555133_23 4555133_24 4555133_25 4555133_26 4555133_27 4555133_28 4555133_29 4555133_30 4555133_31 4555133_32 4555133_33 4555133_34 4555133_35 4555133_36 4555133_37 4555133_38 4555133_39 4555133_40 4555133_41 4555133_42 4555133_43 4555133_44 4555133_45 4555133_46 4555133_47 4555133_48 4555133_49 4555133_50 4555133_51 4555133_52 4555133_53 4555133_54 4555133_55 4555133_56 4555133_57 4555133_58 4555133_59 4555133_60 4555133_61 4555133_62 4555133_63 4555133_64 4555133_65 4555133_66 4555133_67 4555133_68 4555133_69 4555133_70 4555133_71 4555133_72 4555133_73 4555133_74 4555133_75 4555133_76 4555133_77 4555133_78 4555133_79 4555133_80 4555133_81 4555133_82 4555133_83 4555133_84 4555133_85 4555133_86 4555133_87 4555133_88 4555133_89 4555133_90 4555133_91 4555133_92 4555133_93 4555133_94 4555133_95 4555133_96 4555133_97 4555133_98 4554970_2075 4554970_2076 4554970_2077 4554970_2078 4554970_2079 4554970_2080 4554970_2081 4554970_2082 4554970_2083 4554970_2084 4554970_2085 4554970_2086 4554970_2087 4554970_2088 4554970_2089 4554970_2090 4554970_2091 4554970_2092 4554970_2093 4554970_2094 4554970_2095 4554970_2096 4554970_2097 4554970_2098 4554970_2099 4554970_2100 4554970_2101 4554970_2102 4554970_2103 4554970_2104 4554970_2105 4554970_2106 4554970_2107 4554970_2108 4554970_2109 4554970_2110 4554970_2111 4554970_2112 4554970_2113 4554970_2114 4554970_2115 4554970_2116 4554970_2117 4554970_2118 4554970_2119 4554970_2120 4554970_2121 4554970_2122 4554970_2123 4554970_2124 4554970_2125 6846863_162 6846863_163 6749926_1984 6846863_164 6846863_165 6846863_166 6749926_1985 6846863_167 6846863_168 6846863_169 6749926_1986 6749926_1987 6749926_1988 6846863_171 6846863_172 6749926_1989 6749926_1990 6846863_173 6846863_174 6749926_1991 6846863_175 6846863_176 6846863_177 6846863_178 6846863_179 6846863_180 6846863_181 6846863_182 6846863_183 6846863_184 6846863_185 6884615 6884630 6884646 6884651 6884667 6884695 6884708 6884710 6884711 6884723 6884734 6883076_4291 6884768 6883076_4300 6884773 6884779 6884789 6883076_4316 6884792 6884797 6883076_4330 6883076_4331 6846863_186 6865115_9997 6883076_4338 6883076_4339 6883076_4340 6883076_4341 6883076_4342 6883076_4343 6883076_4347 6883076_4350 6883076_4352 6883076_4362 6883076_4364 6883076_4366 6883076_4367 6883076_4370 6883076_4371 6883076_4372 6883076_4373 6883076_4374 6883076_4375 6883076_4376 6883076_4377 6883076_4378 6883076_4379 6883076_4380 6883076_4381 6883076_4382 6846863_187 6883076_4384 6883076_4385 6883076_4386 6883076_4387 6883076_4388 6883076_4389 8948593 8948593.bat+ 8948735 8948735.0 8948736 8948737 8948739 8948739.0 8948741 8948741.0 8948743 8948743.0 8948744 11163094 11163095 11163096 11163097 11163098 11163099 11163100 11163101 11163102 11163103 11163104 11163105 11163106 11803874 15762688 15991888 15991889 15991890 15991891 15991892 15991893 15991894 15991895 15991896 15991897 15991898 15991899 15991900 15991901 15991902 15991903 15991904 15991905 15991906 15991907 15991908 15991909 15991910 15991911 15991912 15991913 15991914 15991915 15991916 15991917 15991918 15991919 15991920 15991921 15991922 15991923 15991924 15991925 15991926 15991927 15991928 15991929 15991930 15991931 15991932 15991933 15991934 21117807 21117808 21117809 21117810 21117811 21117812 21117813 21117814 21117815 21117816 21117817 21117818 21117819 21117820 21117821 21117822 21117823 21117824 21117825 21117826 21117827 21117828 21117829 21117830 21117831 21117832 21117833 21117834 21117835 21117836 21117837 21117838 21117839 21117840 21117841 21117842 21117843 21117844 21117845 21117846 21117847 21117848 21117849 21117850 21117851 21117852 21117853 21117854 21117855 21117856 21128693 21128694 21128695 21128696 21128697 21128698 21128699 21128700 21128701 21128702 21128703 21128704 21128705 21128706 21128707 21128708 21128709 21128710 21128711 21128712 21128713 21128714 21128715 21128716 21128717 21128718 21128719 21128720 21128721 21128722 21128723 21128724 21128725 21128726 21128727 21128728 21128729 21128730 21128731 21128732 21128733 21128734 21128735 21128736 21128737 21128738 21128739 21128740 21128741 21128742 21128743 21128744 21128745 21128746 21128747 21128748 21128749 21128750 21128751 21128752 21128753 21128754 21128755 21128756 21128757 21128758 21128759 21128760 21128761 21128762 21128763 21128764 21128765 21128766 21128767 21128768 21128769 21128770 21128771 21128772 21128773 21128774 21128775 21128776 21128777 21128778 21128779 21128780 21128781 21128782 21128783 21128784 21128785 21128786 21128787 21128788 21128789 21128790 21128791 21128792 21128793 21128794 21128795 21128796 21128797 21128798 21128799 21128800 21128801 21128802 21128803 21128804 21128805 21128806 21128807 21128808 21128809 21128810 21128811 21128812 21128813 21128814 21128815 21128816 21128817 21128818 21128819 21128820 21128821 21128822 21128823 21128824 21128825 21128826 21128827 21128828 21128829 21128830 21128831 21128832 21128833 21128834 21128835 21128836 21128837 21128838 21128839 21128840 21128841 21128842 21128843 21128844 21128845 21128846 21128847 21995350 28033437 28033438 28209075 28209075.ex+ 28220303 28220303.ex+ 28220304 28220304.ex+ 28678522 28681980 28681981 31511434 31511434.ex+ 31511434.0 31611184_184 31611184_18+ 31611184_187 31611184_18+ 31611184_189 31611184_18+ 31799492 31799492.ex+ 31799544 31799544.ex+ 31807635 31807635.ex+ 31833439 31833439.ex+ 31838902 31838902.ex+ 31847269 31847269.ex+ 31852719 31852719.ex+ 31862357 31862357.ex+ 31862358 31862358.ex+ 31862360 31862360.ba+ 31862360.ex+ 31862362 31862362.ba+ 31862362.ex+ 31862992 31862992.ex+ 31865504 31865504.ex+ 31886565 31886565.ba+ 31886565.ex+ 31886566 31886566.ex+ 31889194 31889194.ba+ 31889194.ex+ 31889195 31889195.ex+ 31889589 31889589.ex+ 31889590 31889590.ex+ 31889591 31889591.ex+ 31889592 31889592.ex+ 31889593 31889593.ex+ 31889594 31889594.ex+ 31889595 31889595.ex+ 31889596 31889596.ex+ 31889597 31889597.ex+ 31889598 31889598.ex+ 31889599 31889599.ex+ 31889600 31889600.ex+ 31889601 31889601.ex+ 31889602 31889602.ex+ 31889603 31889603.ex+ 31889604 31889604.ex+ 31889605 31889605.ex+ 31889606 31889606.ex+ 31889607 31889607.ex+ 31889608 31889608.ex+ 31889609 31889609.ex+ 31889610 31889610.ex+ 31889611 31889611.ex+ 31889612 31889612.ex+ 31889613 31889613.ex+ 31889614 31889614.ex+ 31889615 31889615.ex+ 31889616 31889616.ex+ 31889617 31889617.ex+ 31889618 31889618.ex+ 31889619 31889619.ex+ 31889620 31889620.ex+ 31889621 31889621.ex+ 31889622 31889622.ex+ 31889623 31889623.ex+ 31889624 31889624.ex+ 31889625 31889625.ex+ 31889626 31889626.ex+ 31889627 31889627.ex+ 31889628 31889628.ex+ 31889629 31889629.ex+ 31889630 31889630.ex+ 31889631 31889631.ex+ 31889632 31889632.ex+ 31889633 31889633.ex+ 31889634 31889634.ex+ 31889635 31889635.ex+ 31889636 31889636.ex+ 31889637 31889637.ex+ 31889638 31889638.ex+ 31889639 31889639.ex+ 31889640 31889640.ex+ 31889641 31889641.ex+ 31889642 31889642.ex+ 31889643 31889643.ex+ 31889644 31889644.ex+ 31889645 31889645.ex+ 31889646 31889646.ex+ 31889647 31889647.ex+ 31889648 31889648.ex+ 31889649 31889649.ex+ 31889650 31889650.ex+ 31889651 31889651.ex+ 31889652 31889652.ex+ 31889653 31889653.ex+ 31889654 31889654.ex+ 31889655 31889655.ex+ 31889656 31889656.ex+ 31889657 31889657.ex+ 31889658 31889658.ex+ 31889659 31889659.ex+ 31889660 31889660.ex+ 31889661 31889661.ex+ 31889662 31889662.ex+ 31889663 31889663.ex+ 31889664 31889664.ex+ 31889665 31889665.ex+ 31889666 31889666.ex+ 31889667 31889667.ex+ 31889669 31889669.ex+ 31889670 31889670.ex+ 31889671 31889671.ex+ 31889672 31889672.ex+ 31889673 31889673.ex+ 31889674 31889674.ex+ 31889675 31889675.ex+ 31889676 31889676.ex+ 31889677 31889677.ex+ 31889678 31889678.ex+ 31889679 31889679.ex+ 31889680 31889680.ex+ 31889681 31889681.ex+ 31889682 31889682.ex+ 31889683 31889683.ex+ 31889684 31889684.ex+ 31889685 31889685.ex+ 31889686 31889686.ex+ 31889687 31889687.ex+ 31889688 31889688.ex+ 31889689 31889689.ex+ 31889690 31889690.ex+ 31889691 31889691.ex+ 31889692 31889692.ex+ 31889693 31889693.ex+ 31889694 31889694.ex+ 31889695 31889695.ex+ 31889696 31889696.ex+ 31889697 31889697.ex+ 31889698 31889698.ex+ 31889699 31889699.ex+ 31889700 31889700.ex+ 31889701 31889701.ba+ 31889701.ex+ 31889702 31889702.ex+ 31889703 31889703.ex+ 31889704 31889704.ex+ 31889705 31889705.ex+ 31889706 31889706.ex+ 31889707 31889707.ex+ 31889708 31889708.ex+ 31889709 31889709.ex+ 31889710 31889710.ex+ 31889711 31889711.ex+ 31889712 31889712.ex+ 31889713 31889713.ex+ 31889714 31889714.ex+ 31889715 31889715.ex+ 31889716 31889716.ex+ 31889717 31889717.ex+ 31889718 31889718.ex+ 31889719 31889719.ex+ 31889720 31889720.ex+ 31889721 31889721.ex+ 31889722 31889722.ex+ 31889723 31889723.ex+ 31889724 31889724.ex+ 31889725 31889725.ex+ 31889726 31889726.ex+ 31889727 31889727.ex+ 31889728 31889728.ex+ 31889729 31889729.ex+ 31889730 31889730.ex+ 31889731 31889731.ex+ 31889732 31889732.ex+ 31889733 31889733.ex+ 31889734 31889734.ex+ 31889735 31889735.ex+ 31889736 31889736.ex+ 31889737 31889737.ex+ 31889738 31889738.ex+ 31889739 31889739.ex+ 31889740 31889740.ex+ 31889741 31889741.ex+ 31889742 31889742.ex+ 31889743 31889743.ex+ 31889744 31889744.ex+ 31889745 31889745.ex+ 31889746 31889746.ex+ 31889747 31889747.ex+ 31889748 31889748.ex+ 31889749 31889749.ex+ 31889750 31889750.ex+ 31889751 31889751.ex+ 31889752 31889752.ex+ 31889753 31889753.ex+ 31889754 31889754.ex+ 31889755 31889755.ex+ 31889756 31889756.ex+ 31889757 31889757.ex+ 31889758 31889758.ex+ 31889759 31889759.ex+ 31889760 31889760.ex+ 31889761 31889761.ex+ 31889762 31889762.ex+ 31889763 31889763.ex+ 31889764 31889764.ex+ 31889765 31889765.ex+ 31889766 31889766.ex+ 31889767 31889767.ex+ 31889768 31889768.ex+ 31889769 31889769.ex+ 31889770 31889770.ex+ 31889771 31889771.ex+ 31889773 31889773.ex+ 31889775 31889775.ex+ 31889776 31889776.ex+ 31889777 31889777.ex+ 31889778 31889778.ex+ 31889779 31889779.ex+ 31889780 31889780.ex+ 31889781 31889781.ex+ 31889782 31889782.ex+ 31889783 31889783.ex+ 31889784 31889784.ex+ 31889785 31889785.ex+ 31889786 31889786.ex+ 31889787 31889787.ex+ 31889788 31889788.ex+ 31889789 31889789.ex+ 31889790 31889790.ex+ 31889791 31889791.ex+ 31889792 31889792.ex+ 31889793 31889793.ex+ 31889794 31889794.ex+ 31889795 31889795.ex+ 31889796 31889796.ex+ 31889797 31889797.ex+ 31889798 31889798.ex+ 31889799 31889799.ex+ 31889800 31889800.ex+ 31889801 ... etc It includes a bunch of really old jobs for no obvious reason, jumps across large ranges of job ids and is only sequential now in the 31.8M range. Any idea why would that be? It almost feels like some counter wrapped around somewhere. Could it be related to a fact that we don't purge old job data from db? Anyway, I'd like to get the ability back to query all jobs in the db through sacct.
Hi Jurij, Can you show me the complete mysql command for the affected job? I need specifically the fields: time_eligible time_end Also, 'sacctmgr show runaway' does you show anything? There have been recent changes in the query logic to fix some issues, so this can be related to them and not something specific to your site.
Jurij, without getting more information I cannot diagnose it for sure, but it looks like you are suffering from a known regression happened in 18.08.6. There we changed the behavior of a couple of sql queries that have had unexpected collateral effects. One of this effect is that you cannot query old jobs. The next release, 18.08.7 will have the fix in place, and it should be released very soon (next days). If you want to apply the fix right now, the commit id is 3361bf611c61de3bb90f8cadbacf58b4d1dc8707. I've also discovered another issue where it is impossible in sacct to show up jobs that has EligibleTime=unknown. We're working in this issue too. When you have a chance, please, send me the requested information to confirm all of this.
Sorry, got drowned by sales drones buzzing around here yesterday. 'sacct show runaway' gives me some thousands of jobs going back to 2017. I assumed I can safely fix them. This also fixed the unexpected output of sacct, showing really old jobs. One more trick I learned, good :) I just updated to 18.08.7 and indeed sacct is now returning jobs again as expected. Thanks. However seff is still reporting some nonsense: # sacct -Plj 31699893 JobID|JobIDRaw|JobName|Partition|MaxVMSize|MaxVMSizeNode|MaxVMSizeTask|AveVMSize|MaxRSS|MaxRSSNode|MaxRSSTask|AveRSS|MaxPages|MaxPagesNode|MaxPagesTask|AvePages|MinCPU|MinCPUNode|MinCPUTask|AveCPU|NTasks|AllocCPUS|Elapsed|State|ExitCode|AveCPUFreq|ReqCPUFreqMin|ReqCPUFreqMax|ReqCPUFreqGov|ReqMem|ConsumedEnergy|MaxDiskRead|MaxDiskReadNode|MaxDiskReadTask|AveDiskRead|MaxDiskWrite|MaxDiskWriteNode|MaxDiskWriteTask|AveDiskWrite|AllocGRES|ReqGRES|ReqTRES|AllocTRES|TRESUsageInAve|TRESUsageInMax|TRESUsageInMaxNode|TRESUsageInMaxTask|TRESUsageInMin|TRESUsageInMinNode|TRESUsageInMinTask|TRESUsageInTot|TRESUsageOutMax|TRESUsageOutMaxNode|TRESUsageOutMaxTask|TRESUsageOutAve|TRESUsageOutTot 31699893|31699893|dada2.sh|lvic||||||||||||||||||1|1-01:13:33|COMPLETED|0:0||Unknown|Unknown|Unknown|100Gn|0|||||||||||billing=1,cpu=1,mem=100G,node=1|billing=1,cpu=1,mem=100G,node=1||||||||||||| 31699893.batch|31699893.batch|batch||34537244K|nile|0|34537244K|32007060K|nile|0|32007060K|71|nile|0|71|1-22:08:47|nile|0|1-22:08:47|1|1|1-01:13:33|COMPLETED|0:0|17K|0|0|0|100Gn|19.38M|27465.52M|nile|0|27465.52M|2367.46M|nile|0|2367.46M||||cpu=1,mem=100G,node=1|cpu=1-22:08:47,energy=19375074,fs/disk=28799688565,mem=32007060K,pages=71,vmem=34537244K|cpu=1-22:08:47,energy=19369272,fs/disk=28799688565,mem=32007060K,pages=71,vmem=34537244K|cpu=nile,energy=nile,fs/disk=nile,mem=nile,pages=nile,vmem=nile|cpu=0,fs/disk=0,mem=0,pages=0,vmem=0|cpu=1-22:08:47,energy=19369272,fs/disk=28799688565,mem=32007060K,pages=71,vmem=34537244K|cpu=nile,energy=nile,fs/disk=nile,mem=nile,pages=nile,vmem=nile|cpu=0,fs/disk=0,mem=0,pages=0,vmem=0|cpu=1-22:08:47,energy=19375074,fs/disk=28799688565,mem=32007060K,pages=71,vmem=34537244K|energy=227,fs/disk=2482458912|energy=nile,fs/disk=nile|fs/disk=0|energy=6170,fs/disk=2482458912|energy=6170,fs/disk=2482458912 31699893.extern|31699893.extern|extern||||||||||||||||||1|1|1-01:13:33|CANCELLED||0|0|0|0|100Gn|0||||||||||||billing=1,cpu=1,mem=100G,node=1||||||||||||| # seff 31699893 Job ID: 31699893 Cluster: hof User/Group: hchen/zeller State: COMPLETED (exit code 0) Cores: 1 CPU Utilized: 1-22:09:15 CPU Efficiency: 182.96% of 1-01:13:33 core-walltime Job Wall-clock time: 1-01:13:33 Memory Utilized: 16.00 EB Memory Efficiency: 17179869184.00% of 100.00 GB I assume there need to be some fixes in the seff script or in the perl api as well? I reclassified this as minor issue.
(In reply to Jurij Pečar from comment #9) > Sorry, got drowned by sales drones buzzing around here yesterday. > > 'sacct show runaway' gives me some thousands of jobs going back to 2017. I > assumed I can safely fix them. This also fixed the unexpected output of > sacct, showing really old jobs. One more trick I learned, good :) Good. Diagnosing why these jobs were wrong is more tricky. If this fixed the issue let's keep this way but please, monitor if newer runaway jobs appear at some point, and then open a new bug for this. > I just updated to 18.08.7 and indeed sacct is now returning jobs again as > expected. Thanks. Neat. > However seff is still reporting some nonsense: > > # sacct -Plj 31699893 > JobID|JobIDRaw|JobName|Partition|MaxVMSize|MaxVMSizeNode|MaxVMSizeTask|AveVMS > ize|MaxRSS|MaxRSSNode|MaxRSSTask|AveRSS|MaxPages|MaxPagesNode|MaxPagesTask|Av > ePages|MinCPU|MinCPUNode|MinCPUTask|AveCPU|NTasks|AllocCPUS|Elapsed|State|Exi > tCode|AveCPUFreq|ReqCPUFreqMin|ReqCPUFreqMax|ReqCPUFreqGov|ReqMem|ConsumedEne > rgy|MaxDiskRead|MaxDiskReadNode|MaxDiskReadTask|AveDiskRead|MaxDiskWrite|MaxD > iskWriteNode|MaxDiskWriteTask|AveDiskWrite|AllocGRES|ReqGRES|ReqTRES|AllocTRE > S|TRESUsageInAve|TRESUsageInMax|TRESUsageInMaxNode|TRESUsageInMaxTask|TRESUsa > geInMin|TRESUsageInMinNode|TRESUsageInMinTask|TRESUsageInTot|TRESUsageOutMax| > TRESUsageOutMaxNode|TRESUsageOutMaxTask|TRESUsageOutAve|TRESUsageOutTot > 31699893|31699893|dada2.sh|lvic||||||||||||||||||1|1-01:13:33|COMPLETED|0: > 0||Unknown|Unknown|Unknown|100Gn|0|||||||||||billing=1,cpu=1,mem=100G, > node=1|billing=1,cpu=1,mem=100G,node=1||||||||||||| > 31699893.batch|31699893. > batch|batch||34537244K|nile|0|34537244K|32007060K|nile|0|32007060K|71|nile|0| > 71|1-22:08:47|nile|0|1-22:08:47|1|1|1-01:13:33|COMPLETED|0: > 0|17K|0|0|0|100Gn|19.38M|27465.52M|nile|0|27465.52M|2367.46M|nile|0|2367. > 46M||||cpu=1,mem=100G,node=1|cpu=1-22:08:47,energy=19375074,fs/ > disk=28799688565,mem=32007060K,pages=71,vmem=34537244K|cpu=1-22:08:47, > energy=19369272,fs/disk=28799688565,mem=32007060K,pages=71, > vmem=34537244K|cpu=nile,energy=nile,fs/disk=nile,mem=nile,pages=nile, > vmem=nile|cpu=0,fs/disk=0,mem=0,pages=0,vmem=0|cpu=1-22:08:47, > energy=19369272,fs/disk=28799688565,mem=32007060K,pages=71, > vmem=34537244K|cpu=nile,energy=nile,fs/disk=nile,mem=nile,pages=nile, > vmem=nile|cpu=0,fs/disk=0,mem=0,pages=0,vmem=0|cpu=1-22:08:47, > energy=19375074,fs/disk=28799688565,mem=32007060K,pages=71, > vmem=34537244K|energy=227,fs/disk=2482458912|energy=nile,fs/disk=nile|fs/ > disk=0|energy=6170,fs/disk=2482458912|energy=6170,fs/disk=2482458912 > 31699893.extern|31699893.extern|extern||||||||||||||||||1|1|1-01:13: > 33|CANCELLED||0|0|0|0|100Gn|0||||||||||||billing=1,cpu=1,mem=100G, > node=1||||||||||||| > > # seff 31699893 > Job ID: 31699893 > Cluster: hof > User/Group: hchen/zeller > State: COMPLETED (exit code 0) > Cores: 1 > CPU Utilized: 1-22:09:15 > CPU Efficiency: 182.96% of 1-01:13:33 core-walltime > Job Wall-clock time: 1-01:13:33 > Memory Utilized: 16.00 EB > Memory Efficiency: 17179869184.00% of 100.00 GB > I guess you're talking about: >Memory Utilized: 16.00 EB >Memory Efficiency: 17179869184.00% of 100.00 GB Is it that? Can you show me the full mysql select for this job? > I assume there need to be some fixes in the seff script or in the perl api > as well? Not sure yet. > I reclassified this as minor issue. Cool.
Job with two steps: > select * from hof_job_table where id_job=31699893 \G *************************** 1. row *************************** job_db_inx: 62027302 mod_time: 1554379958 deleted: 0 account: zeller admin_comment: NULL array_task_str: NULL array_max_tasks: 0 array_task_pending: 0 cpus_req: 1 derived_ec: 0 derived_es: NULL exit_code: 0 job_name: dada2.sh id_assoc: 4758 id_array_job: 0 id_array_task: 4294967294 id_block: NULL id_job: 31699893 id_qos: 1 id_resv: 0 id_wckey: 0 id_user: 23873 id_group: 702 pack_job_id: 0 pack_job_offset: 4294967294 kill_requid: -1 mcs_label: NULL mem_req: 102400 nodelist: nile nodes_alloc: 1 node_inx: 211 partition: lvic priority: 3550 state: 3 timelimit: 4320 time_submit: 1554289144 time_eligible: 1554289144 time_start: 1554289145 time_end: 1554379958 time_suspended: 0 gres_req: gres_alloc: gres_used: wckey: work_dir: /g/scb2/zeller/SHARED/DATA/16S/Mavis_test system_comment: NULL track_steps: 0 tres_alloc: 1=1,2=102400,4=1,5=1 tres_req: 1=1,2=102400,4=1,5=1 > select * from hof_step_table where job_db_inx=62027302 \G *************************** 1. row *************************** job_db_inx: 62027302 deleted: 0 exit_code: 0 id_step: -2 kill_requid: -1 nodelist: nile nodes_alloc: 1 node_inx: 211 state: 3 step_name: batch task_cnt: 1 task_dist: 0 time_start: 1554289145 time_end: 1554379958 time_suspended: 0 user_sec: 165464 user_usec: 713318 sys_sec: 690 sys_usec: 325511 act_cpufreq: 17 consumed_energy: 19375074 req_cpufreq_min: 0 req_cpufreq: 0 req_cpufreq_gov: 0 tres_alloc: 1=1,2=102400,4=1 tres_usage_in_ave: 1=166127920,2=32775229440,3=19375074,6=28799688565,7=35366137856,8=71 tres_usage_in_max: 1=166127920,2=32775229440,3=19369272,6=28799688565,7=35366137856,8=71 tres_usage_in_max_taskid: 1=0,2=0,6=0,7=0,8=0 tres_usage_in_max_nodeid: 1=0,2=0,3=0,6=0,7=0,8=0 tres_usage_in_min: 1=166127920,2=32775229440,3=19369272,6=28799688565,7=35366137856,8=71 tres_usage_in_min_taskid: 1=0,2=0,6=0,7=0,8=0 tres_usage_in_min_nodeid: 1=0,2=0,3=0,6=0,7=0,8=0 tres_usage_in_tot: 1=166127920,2=32775229440,3=19375074,6=28799688565,7=35366137856,8=71 tres_usage_out_ave: 3=6170,6=2482458912 tres_usage_out_max: 3=227,6=2482458912 tres_usage_out_max_taskid: 6=0 tres_usage_out_max_nodeid: 3=0,6=0 tres_usage_out_min: 3=227,6=2482458912 tres_usage_out_min_taskid: 6=0 tres_usage_out_min_nodeid: 3=0,6=0 tres_usage_out_tot: 3=6170,6=2482458912 *************************** 2. row *************************** job_db_inx: 62027302 deleted: 0 exit_code: -2 id_step: -1 kill_requid: -1 nodelist: nile nodes_alloc: 1 node_inx: 211 state: 4 step_name: extern task_cnt: 1 task_dist: 0 time_start: 1554289145 time_end: 1554379958 time_suspended: 0 user_sec: 0 user_usec: 0 sys_sec: 0 sys_usec: 0 act_cpufreq: 0 consumed_energy: 0 req_cpufreq_min: 0 req_cpufreq: 0 req_cpufreq_gov: 0 tres_alloc: 1=1,2=102400,3=18446744073709551614,4=1,5=1 tres_usage_in_ave: tres_usage_in_max: tres_usage_in_max_taskid: tres_usage_in_max_nodeid: tres_usage_in_min: tres_usage_in_min_taskid: tres_usage_in_min_nodeid: tres_usage_in_tot: tres_usage_out_ave: tres_usage_out_max: tres_usage_out_max_taskid: tres_usage_out_max_nodeid: tres_usage_out_min: tres_usage_out_min_taskid: tres_usage_out_min_nodeid: tres_usage_out_tot:
(In reply to Jurij Pečar from comment #12) > Job with two steps: > > > select * from hof_job_table where id_job=31699893 \G > *************************** 1. row *************************** > job_db_inx: 62027302 > mod_time: 1554379958 > deleted: 0 > account: zeller > admin_comment: NULL > array_task_str: NULL > array_max_tasks: 0 > array_task_pending: 0 > cpus_req: 1 > derived_ec: 0 > derived_es: NULL > exit_code: 0 > job_name: dada2.sh > id_assoc: 4758 > id_array_job: 0 > id_array_task: 4294967294 > id_block: NULL > id_job: 31699893 > id_qos: 1 > id_resv: 0 > id_wckey: 0 > id_user: 23873 > id_group: 702 > pack_job_id: 0 > pack_job_offset: 4294967294 > kill_requid: -1 > mcs_label: NULL > mem_req: 102400 > nodelist: nile > nodes_alloc: 1 > node_inx: 211 > partition: lvic > priority: 3550 > state: 3 > timelimit: 4320 > time_submit: 1554289144 > time_eligible: 1554289144 > time_start: 1554289145 > time_end: 1554379958 > time_suspended: 0 > gres_req: > gres_alloc: > gres_used: > wckey: > work_dir: /g/scb2/zeller/SHARED/DATA/16S/Mavis_test > system_comment: NULL > track_steps: 0 > tres_alloc: 1=1,2=102400,4=1,5=1 > tres_req: 1=1,2=102400,4=1,5=1 > > > select * from hof_step_table where job_db_inx=62027302 \G > *************************** 1. row *************************** > job_db_inx: 62027302 > deleted: 0 > exit_code: 0 > id_step: -2 > kill_requid: -1 > nodelist: nile > nodes_alloc: 1 > node_inx: 211 > state: 3 > step_name: batch > task_cnt: 1 > task_dist: 0 > time_start: 1554289145 > time_end: 1554379958 > time_suspended: 0 > user_sec: 165464 > user_usec: 713318 > sys_sec: 690 > sys_usec: 325511 > act_cpufreq: 17 > consumed_energy: 19375074 > req_cpufreq_min: 0 > req_cpufreq: 0 > req_cpufreq_gov: 0 > tres_alloc: 1=1,2=102400,4=1 > tres_usage_in_ave: > 1=166127920,2=32775229440,3=19375074,6=28799688565,7=35366137856,8=71 > tres_usage_in_max: > 1=166127920,2=32775229440,3=19369272,6=28799688565,7=35366137856,8=71 > tres_usage_in_max_taskid: 1=0,2=0,6=0,7=0,8=0 > tres_usage_in_max_nodeid: 1=0,2=0,3=0,6=0,7=0,8=0 > tres_usage_in_min: > 1=166127920,2=32775229440,3=19369272,6=28799688565,7=35366137856,8=71 > tres_usage_in_min_taskid: 1=0,2=0,6=0,7=0,8=0 > tres_usage_in_min_nodeid: 1=0,2=0,3=0,6=0,7=0,8=0 > tres_usage_in_tot: > 1=166127920,2=32775229440,3=19375074,6=28799688565,7=35366137856,8=71 > tres_usage_out_ave: 3=6170,6=2482458912 > tres_usage_out_max: 3=227,6=2482458912 > tres_usage_out_max_taskid: 6=0 > tres_usage_out_max_nodeid: 3=0,6=0 > tres_usage_out_min: 3=227,6=2482458912 > tres_usage_out_min_taskid: 6=0 > tres_usage_out_min_nodeid: 3=0,6=0 > tres_usage_out_tot: 3=6170,6=2482458912 > *************************** 2. row *************************** > job_db_inx: 62027302 > deleted: 0 > exit_code: -2 > id_step: -1 > kill_requid: -1 > nodelist: nile > nodes_alloc: 1 > node_inx: 211 > state: 4 > step_name: extern > task_cnt: 1 > task_dist: 0 > time_start: 1554289145 > time_end: 1554379958 > time_suspended: 0 > user_sec: 0 > user_usec: 0 > sys_sec: 0 > sys_usec: 0 > act_cpufreq: 0 > consumed_energy: 0 > req_cpufreq_min: 0 > req_cpufreq: 0 > req_cpufreq_gov: 0 > tres_alloc: 1=1,2=102400,3=18446744073709551614,4=1,5=1 > tres_usage_in_ave: > tres_usage_in_max: > tres_usage_in_max_taskid: > tres_usage_in_max_nodeid: > tres_usage_in_min: > tres_usage_in_min_taskid: > tres_usage_in_min_nodeid: > tres_usage_in_tot: > tres_usage_out_ave: > tres_usage_out_max: > tres_usage_out_max_taskid: > tres_usage_out_max_nodeid: > tres_usage_out_min: > tres_usage_out_min_taskid: > tres_usage_out_min_nodeid: > tres_usage_out_tot: This is a different issue. MariaDB [slurm_acct_db_master]> select id_step,step_name,tres_usage_in_max from llagosti_step_table where job_db_inx=62027302; +---------+-----------+-----------------------------------------------------------------------+ | id_step | step_name | tres_usage_in_max | +---------+-----------+-----------------------------------------------------------------------+ | -2 | batch | 1=166127920,2=32775229440,3=19369272,6=28799688565,7=35366137856,8=71 | | -1 | extern | | +---------+-----------+-----------------------------------------------------------------------+ 2 rows in set (0.001 sec) For some reason, the extern step doesn't have a tres_usage_in_max field recorded. This makes the SlurmDB API to return an overflowed number when 'find_tres_count_in_string' is called. Seff does: my $lmem = Slurmdb::find_tres_count_in_string($step->{'stats'}{'tres_usage_in_max'}, TRES_MEM) / 1024; which turns into a flawed lmem value, which is what you then see in the output. I am investigating why the extern step may not have this field filled in. I am wondering if you see anything in the ctld or slurmdbd logs related to this job. If this is only seen on one job this could be some issue when updating this field (mysql restart?). In my internal test database I also see a couple of steps without this information (be it extern or not). Can you do: select job_db_inx,step_name,tres_usage_in_max from <your_cluster>_step_table; and see which steps doesn't have this field filled in, then correlate with your job table to see the job id, and then see what seff reports for these jobs? I would suggest to open a new issue because this is totally different to your initial seff problem. You can link to this bug.
Jurij, I opened an internal issue for the seff problem, so no need for you to open it. I will inform when it is fixed.
Hi, the issue with seff is closed in bab13dfde6d691a26b581eea20ef2f52e0c600a9, release 19.05.0rc2. I am closing this bug now since everything should be fine at the moment. Please, mark it as OPEN again if you still encounter issues related to your case. Thank you, Felip