Summary: | sacct returns no data | ||
---|---|---|---|
Product: | Slurm | Reporter: | Jurij Pečar <jurij.pecar> |
Component: | Accounting | Assignee: | Felip Moll <felip.moll> |
Status: | RESOLVED FIXED | QA Contact: | |
Severity: | 4 - Minor Issue | ||
Priority: | --- | CC: | albert.gil |
Version: | 18.08.6 | ||
Hardware: | Linux | ||
OS: | Linux | ||
See Also: |
https://bugs.schedmd.com/show_bug.cgi?id=6697 https://bugs.schedmd.com/show_bug.cgi?id=6862 |
||
Site: | EMBL | Slinky Site: | --- |
Alineos Sites: | --- | Atos/Eviden Sites: | --- |
Confidential Site: | --- | Coreweave sites: | --- |
Cray Sites: | --- | DS9 clusters: | --- |
Google sites: | --- | HPCnow Sites: | --- |
HPE Sites: | --- | IBM Sites: | --- |
NOAA SIte: | --- | NoveTech Sites: | --- |
Nvidia HWinf-CS Sites: | --- | OCF Sites: | --- |
Recursion Pharma Sites: | --- | SFW Sites: | --- |
SNIC sites: | --- | Tzag Elita Sites: | --- |
Linux Distro: | --- | Machine Name: | |
CLE Version: | Version Fixed: | 19.05.0rc2 | |
Target Release: | --- | DevPrio: | --- |
Emory-Cloud Sites: | --- | ||
Ticket Depends on: | 6862 | ||
Ticket Blocks: |
Description
Jurij Pečar
2019-04-09 02:02:30 MDT
Hi Jurij, Can you show me the complete mysql command for the affected job? I need specifically the fields: time_eligible time_end Also, 'sacctmgr show runaway' does you show anything? There have been recent changes in the query logic to fix some issues, so this can be related to them and not something specific to your site. Jurij, without getting more information I cannot diagnose it for sure, but it looks like you are suffering from a known regression happened in 18.08.6. There we changed the behavior of a couple of sql queries that have had unexpected collateral effects. One of this effect is that you cannot query old jobs. The next release, 18.08.7 will have the fix in place, and it should be released very soon (next days). If you want to apply the fix right now, the commit id is 3361bf611c61de3bb90f8cadbacf58b4d1dc8707. I've also discovered another issue where it is impossible in sacct to show up jobs that has EligibleTime=unknown. We're working in this issue too. When you have a chance, please, send me the requested information to confirm all of this. Sorry, got drowned by sales drones buzzing around here yesterday. 'sacct show runaway' gives me some thousands of jobs going back to 2017. I assumed I can safely fix them. This also fixed the unexpected output of sacct, showing really old jobs. One more trick I learned, good :) I just updated to 18.08.7 and indeed sacct is now returning jobs again as expected. Thanks. However seff is still reporting some nonsense: # sacct -Plj 31699893 JobID|JobIDRaw|JobName|Partition|MaxVMSize|MaxVMSizeNode|MaxVMSizeTask|AveVMSize|MaxRSS|MaxRSSNode|MaxRSSTask|AveRSS|MaxPages|MaxPagesNode|MaxPagesTask|AvePages|MinCPU|MinCPUNode|MinCPUTask|AveCPU|NTasks|AllocCPUS|Elapsed|State|ExitCode|AveCPUFreq|ReqCPUFreqMin|ReqCPUFreqMax|ReqCPUFreqGov|ReqMem|ConsumedEnergy|MaxDiskRead|MaxDiskReadNode|MaxDiskReadTask|AveDiskRead|MaxDiskWrite|MaxDiskWriteNode|MaxDiskWriteTask|AveDiskWrite|AllocGRES|ReqGRES|ReqTRES|AllocTRES|TRESUsageInAve|TRESUsageInMax|TRESUsageInMaxNode|TRESUsageInMaxTask|TRESUsageInMin|TRESUsageInMinNode|TRESUsageInMinTask|TRESUsageInTot|TRESUsageOutMax|TRESUsageOutMaxNode|TRESUsageOutMaxTask|TRESUsageOutAve|TRESUsageOutTot 31699893|31699893|dada2.sh|lvic||||||||||||||||||1|1-01:13:33|COMPLETED|0:0||Unknown|Unknown|Unknown|100Gn|0|||||||||||billing=1,cpu=1,mem=100G,node=1|billing=1,cpu=1,mem=100G,node=1||||||||||||| 31699893.batch|31699893.batch|batch||34537244K|nile|0|34537244K|32007060K|nile|0|32007060K|71|nile|0|71|1-22:08:47|nile|0|1-22:08:47|1|1|1-01:13:33|COMPLETED|0:0|17K|0|0|0|100Gn|19.38M|27465.52M|nile|0|27465.52M|2367.46M|nile|0|2367.46M||||cpu=1,mem=100G,node=1|cpu=1-22:08:47,energy=19375074,fs/disk=28799688565,mem=32007060K,pages=71,vmem=34537244K|cpu=1-22:08:47,energy=19369272,fs/disk=28799688565,mem=32007060K,pages=71,vmem=34537244K|cpu=nile,energy=nile,fs/disk=nile,mem=nile,pages=nile,vmem=nile|cpu=0,fs/disk=0,mem=0,pages=0,vmem=0|cpu=1-22:08:47,energy=19369272,fs/disk=28799688565,mem=32007060K,pages=71,vmem=34537244K|cpu=nile,energy=nile,fs/disk=nile,mem=nile,pages=nile,vmem=nile|cpu=0,fs/disk=0,mem=0,pages=0,vmem=0|cpu=1-22:08:47,energy=19375074,fs/disk=28799688565,mem=32007060K,pages=71,vmem=34537244K|energy=227,fs/disk=2482458912|energy=nile,fs/disk=nile|fs/disk=0|energy=6170,fs/disk=2482458912|energy=6170,fs/disk=2482458912 31699893.extern|31699893.extern|extern||||||||||||||||||1|1|1-01:13:33|CANCELLED||0|0|0|0|100Gn|0||||||||||||billing=1,cpu=1,mem=100G,node=1||||||||||||| # seff 31699893 Job ID: 31699893 Cluster: hof User/Group: hchen/zeller State: COMPLETED (exit code 0) Cores: 1 CPU Utilized: 1-22:09:15 CPU Efficiency: 182.96% of 1-01:13:33 core-walltime Job Wall-clock time: 1-01:13:33 Memory Utilized: 16.00 EB Memory Efficiency: 17179869184.00% of 100.00 GB I assume there need to be some fixes in the seff script or in the perl api as well? I reclassified this as minor issue. (In reply to Jurij Pečar from comment #9) > Sorry, got drowned by sales drones buzzing around here yesterday. > > 'sacct show runaway' gives me some thousands of jobs going back to 2017. I > assumed I can safely fix them. This also fixed the unexpected output of > sacct, showing really old jobs. One more trick I learned, good :) Good. Diagnosing why these jobs were wrong is more tricky. If this fixed the issue let's keep this way but please, monitor if newer runaway jobs appear at some point, and then open a new bug for this. > I just updated to 18.08.7 and indeed sacct is now returning jobs again as > expected. Thanks. Neat. > However seff is still reporting some nonsense: > > # sacct -Plj 31699893 > JobID|JobIDRaw|JobName|Partition|MaxVMSize|MaxVMSizeNode|MaxVMSizeTask|AveVMS > ize|MaxRSS|MaxRSSNode|MaxRSSTask|AveRSS|MaxPages|MaxPagesNode|MaxPagesTask|Av > ePages|MinCPU|MinCPUNode|MinCPUTask|AveCPU|NTasks|AllocCPUS|Elapsed|State|Exi > tCode|AveCPUFreq|ReqCPUFreqMin|ReqCPUFreqMax|ReqCPUFreqGov|ReqMem|ConsumedEne > rgy|MaxDiskRead|MaxDiskReadNode|MaxDiskReadTask|AveDiskRead|MaxDiskWrite|MaxD > iskWriteNode|MaxDiskWriteTask|AveDiskWrite|AllocGRES|ReqGRES|ReqTRES|AllocTRE > S|TRESUsageInAve|TRESUsageInMax|TRESUsageInMaxNode|TRESUsageInMaxTask|TRESUsa > geInMin|TRESUsageInMinNode|TRESUsageInMinTask|TRESUsageInTot|TRESUsageOutMax| > TRESUsageOutMaxNode|TRESUsageOutMaxTask|TRESUsageOutAve|TRESUsageOutTot > 31699893|31699893|dada2.sh|lvic||||||||||||||||||1|1-01:13:33|COMPLETED|0: > 0||Unknown|Unknown|Unknown|100Gn|0|||||||||||billing=1,cpu=1,mem=100G, > node=1|billing=1,cpu=1,mem=100G,node=1||||||||||||| > 31699893.batch|31699893. > batch|batch||34537244K|nile|0|34537244K|32007060K|nile|0|32007060K|71|nile|0| > 71|1-22:08:47|nile|0|1-22:08:47|1|1|1-01:13:33|COMPLETED|0: > 0|17K|0|0|0|100Gn|19.38M|27465.52M|nile|0|27465.52M|2367.46M|nile|0|2367. > 46M||||cpu=1,mem=100G,node=1|cpu=1-22:08:47,energy=19375074,fs/ > disk=28799688565,mem=32007060K,pages=71,vmem=34537244K|cpu=1-22:08:47, > energy=19369272,fs/disk=28799688565,mem=32007060K,pages=71, > vmem=34537244K|cpu=nile,energy=nile,fs/disk=nile,mem=nile,pages=nile, > vmem=nile|cpu=0,fs/disk=0,mem=0,pages=0,vmem=0|cpu=1-22:08:47, > energy=19369272,fs/disk=28799688565,mem=32007060K,pages=71, > vmem=34537244K|cpu=nile,energy=nile,fs/disk=nile,mem=nile,pages=nile, > vmem=nile|cpu=0,fs/disk=0,mem=0,pages=0,vmem=0|cpu=1-22:08:47, > energy=19375074,fs/disk=28799688565,mem=32007060K,pages=71, > vmem=34537244K|energy=227,fs/disk=2482458912|energy=nile,fs/disk=nile|fs/ > disk=0|energy=6170,fs/disk=2482458912|energy=6170,fs/disk=2482458912 > 31699893.extern|31699893.extern|extern||||||||||||||||||1|1|1-01:13: > 33|CANCELLED||0|0|0|0|100Gn|0||||||||||||billing=1,cpu=1,mem=100G, > node=1||||||||||||| > > # seff 31699893 > Job ID: 31699893 > Cluster: hof > User/Group: hchen/zeller > State: COMPLETED (exit code 0) > Cores: 1 > CPU Utilized: 1-22:09:15 > CPU Efficiency: 182.96% of 1-01:13:33 core-walltime > Job Wall-clock time: 1-01:13:33 > Memory Utilized: 16.00 EB > Memory Efficiency: 17179869184.00% of 100.00 GB > I guess you're talking about: >Memory Utilized: 16.00 EB >Memory Efficiency: 17179869184.00% of 100.00 GB Is it that? Can you show me the full mysql select for this job? > I assume there need to be some fixes in the seff script or in the perl api > as well? Not sure yet. > I reclassified this as minor issue. Cool. Job with two steps: > select * from hof_job_table where id_job=31699893 \G *************************** 1. row *************************** job_db_inx: 62027302 mod_time: 1554379958 deleted: 0 account: zeller admin_comment: NULL array_task_str: NULL array_max_tasks: 0 array_task_pending: 0 cpus_req: 1 derived_ec: 0 derived_es: NULL exit_code: 0 job_name: dada2.sh id_assoc: 4758 id_array_job: 0 id_array_task: 4294967294 id_block: NULL id_job: 31699893 id_qos: 1 id_resv: 0 id_wckey: 0 id_user: 23873 id_group: 702 pack_job_id: 0 pack_job_offset: 4294967294 kill_requid: -1 mcs_label: NULL mem_req: 102400 nodelist: nile nodes_alloc: 1 node_inx: 211 partition: lvic priority: 3550 state: 3 timelimit: 4320 time_submit: 1554289144 time_eligible: 1554289144 time_start: 1554289145 time_end: 1554379958 time_suspended: 0 gres_req: gres_alloc: gres_used: wckey: work_dir: /g/scb2/zeller/SHARED/DATA/16S/Mavis_test system_comment: NULL track_steps: 0 tres_alloc: 1=1,2=102400,4=1,5=1 tres_req: 1=1,2=102400,4=1,5=1 > select * from hof_step_table where job_db_inx=62027302 \G *************************** 1. row *************************** job_db_inx: 62027302 deleted: 0 exit_code: 0 id_step: -2 kill_requid: -1 nodelist: nile nodes_alloc: 1 node_inx: 211 state: 3 step_name: batch task_cnt: 1 task_dist: 0 time_start: 1554289145 time_end: 1554379958 time_suspended: 0 user_sec: 165464 user_usec: 713318 sys_sec: 690 sys_usec: 325511 act_cpufreq: 17 consumed_energy: 19375074 req_cpufreq_min: 0 req_cpufreq: 0 req_cpufreq_gov: 0 tres_alloc: 1=1,2=102400,4=1 tres_usage_in_ave: 1=166127920,2=32775229440,3=19375074,6=28799688565,7=35366137856,8=71 tres_usage_in_max: 1=166127920,2=32775229440,3=19369272,6=28799688565,7=35366137856,8=71 tres_usage_in_max_taskid: 1=0,2=0,6=0,7=0,8=0 tres_usage_in_max_nodeid: 1=0,2=0,3=0,6=0,7=0,8=0 tres_usage_in_min: 1=166127920,2=32775229440,3=19369272,6=28799688565,7=35366137856,8=71 tres_usage_in_min_taskid: 1=0,2=0,6=0,7=0,8=0 tres_usage_in_min_nodeid: 1=0,2=0,3=0,6=0,7=0,8=0 tres_usage_in_tot: 1=166127920,2=32775229440,3=19375074,6=28799688565,7=35366137856,8=71 tres_usage_out_ave: 3=6170,6=2482458912 tres_usage_out_max: 3=227,6=2482458912 tres_usage_out_max_taskid: 6=0 tres_usage_out_max_nodeid: 3=0,6=0 tres_usage_out_min: 3=227,6=2482458912 tres_usage_out_min_taskid: 6=0 tres_usage_out_min_nodeid: 3=0,6=0 tres_usage_out_tot: 3=6170,6=2482458912 *************************** 2. row *************************** job_db_inx: 62027302 deleted: 0 exit_code: -2 id_step: -1 kill_requid: -1 nodelist: nile nodes_alloc: 1 node_inx: 211 state: 4 step_name: extern task_cnt: 1 task_dist: 0 time_start: 1554289145 time_end: 1554379958 time_suspended: 0 user_sec: 0 user_usec: 0 sys_sec: 0 sys_usec: 0 act_cpufreq: 0 consumed_energy: 0 req_cpufreq_min: 0 req_cpufreq: 0 req_cpufreq_gov: 0 tres_alloc: 1=1,2=102400,3=18446744073709551614,4=1,5=1 tres_usage_in_ave: tres_usage_in_max: tres_usage_in_max_taskid: tres_usage_in_max_nodeid: tres_usage_in_min: tres_usage_in_min_taskid: tres_usage_in_min_nodeid: tres_usage_in_tot: tres_usage_out_ave: tres_usage_out_max: tres_usage_out_max_taskid: tres_usage_out_max_nodeid: tres_usage_out_min: tres_usage_out_min_taskid: tres_usage_out_min_nodeid: tres_usage_out_tot: (In reply to Jurij Pečar from comment #12) > Job with two steps: > > > select * from hof_job_table where id_job=31699893 \G > *************************** 1. row *************************** > job_db_inx: 62027302 > mod_time: 1554379958 > deleted: 0 > account: zeller > admin_comment: NULL > array_task_str: NULL > array_max_tasks: 0 > array_task_pending: 0 > cpus_req: 1 > derived_ec: 0 > derived_es: NULL > exit_code: 0 > job_name: dada2.sh > id_assoc: 4758 > id_array_job: 0 > id_array_task: 4294967294 > id_block: NULL > id_job: 31699893 > id_qos: 1 > id_resv: 0 > id_wckey: 0 > id_user: 23873 > id_group: 702 > pack_job_id: 0 > pack_job_offset: 4294967294 > kill_requid: -1 > mcs_label: NULL > mem_req: 102400 > nodelist: nile > nodes_alloc: 1 > node_inx: 211 > partition: lvic > priority: 3550 > state: 3 > timelimit: 4320 > time_submit: 1554289144 > time_eligible: 1554289144 > time_start: 1554289145 > time_end: 1554379958 > time_suspended: 0 > gres_req: > gres_alloc: > gres_used: > wckey: > work_dir: /g/scb2/zeller/SHARED/DATA/16S/Mavis_test > system_comment: NULL > track_steps: 0 > tres_alloc: 1=1,2=102400,4=1,5=1 > tres_req: 1=1,2=102400,4=1,5=1 > > > select * from hof_step_table where job_db_inx=62027302 \G > *************************** 1. row *************************** > job_db_inx: 62027302 > deleted: 0 > exit_code: 0 > id_step: -2 > kill_requid: -1 > nodelist: nile > nodes_alloc: 1 > node_inx: 211 > state: 3 > step_name: batch > task_cnt: 1 > task_dist: 0 > time_start: 1554289145 > time_end: 1554379958 > time_suspended: 0 > user_sec: 165464 > user_usec: 713318 > sys_sec: 690 > sys_usec: 325511 > act_cpufreq: 17 > consumed_energy: 19375074 > req_cpufreq_min: 0 > req_cpufreq: 0 > req_cpufreq_gov: 0 > tres_alloc: 1=1,2=102400,4=1 > tres_usage_in_ave: > 1=166127920,2=32775229440,3=19375074,6=28799688565,7=35366137856,8=71 > tres_usage_in_max: > 1=166127920,2=32775229440,3=19369272,6=28799688565,7=35366137856,8=71 > tres_usage_in_max_taskid: 1=0,2=0,6=0,7=0,8=0 > tres_usage_in_max_nodeid: 1=0,2=0,3=0,6=0,7=0,8=0 > tres_usage_in_min: > 1=166127920,2=32775229440,3=19369272,6=28799688565,7=35366137856,8=71 > tres_usage_in_min_taskid: 1=0,2=0,6=0,7=0,8=0 > tres_usage_in_min_nodeid: 1=0,2=0,3=0,6=0,7=0,8=0 > tres_usage_in_tot: > 1=166127920,2=32775229440,3=19375074,6=28799688565,7=35366137856,8=71 > tres_usage_out_ave: 3=6170,6=2482458912 > tres_usage_out_max: 3=227,6=2482458912 > tres_usage_out_max_taskid: 6=0 > tres_usage_out_max_nodeid: 3=0,6=0 > tres_usage_out_min: 3=227,6=2482458912 > tres_usage_out_min_taskid: 6=0 > tres_usage_out_min_nodeid: 3=0,6=0 > tres_usage_out_tot: 3=6170,6=2482458912 > *************************** 2. row *************************** > job_db_inx: 62027302 > deleted: 0 > exit_code: -2 > id_step: -1 > kill_requid: -1 > nodelist: nile > nodes_alloc: 1 > node_inx: 211 > state: 4 > step_name: extern > task_cnt: 1 > task_dist: 0 > time_start: 1554289145 > time_end: 1554379958 > time_suspended: 0 > user_sec: 0 > user_usec: 0 > sys_sec: 0 > sys_usec: 0 > act_cpufreq: 0 > consumed_energy: 0 > req_cpufreq_min: 0 > req_cpufreq: 0 > req_cpufreq_gov: 0 > tres_alloc: 1=1,2=102400,3=18446744073709551614,4=1,5=1 > tres_usage_in_ave: > tres_usage_in_max: > tres_usage_in_max_taskid: > tres_usage_in_max_nodeid: > tres_usage_in_min: > tres_usage_in_min_taskid: > tres_usage_in_min_nodeid: > tres_usage_in_tot: > tres_usage_out_ave: > tres_usage_out_max: > tres_usage_out_max_taskid: > tres_usage_out_max_nodeid: > tres_usage_out_min: > tres_usage_out_min_taskid: > tres_usage_out_min_nodeid: > tres_usage_out_tot: This is a different issue. MariaDB [slurm_acct_db_master]> select id_step,step_name,tres_usage_in_max from llagosti_step_table where job_db_inx=62027302; +---------+-----------+-----------------------------------------------------------------------+ | id_step | step_name | tres_usage_in_max | +---------+-----------+-----------------------------------------------------------------------+ | -2 | batch | 1=166127920,2=32775229440,3=19369272,6=28799688565,7=35366137856,8=71 | | -1 | extern | | +---------+-----------+-----------------------------------------------------------------------+ 2 rows in set (0.001 sec) For some reason, the extern step doesn't have a tres_usage_in_max field recorded. This makes the SlurmDB API to return an overflowed number when 'find_tres_count_in_string' is called. Seff does: my $lmem = Slurmdb::find_tres_count_in_string($step->{'stats'}{'tres_usage_in_max'}, TRES_MEM) / 1024; which turns into a flawed lmem value, which is what you then see in the output. I am investigating why the extern step may not have this field filled in. I am wondering if you see anything in the ctld or slurmdbd logs related to this job. If this is only seen on one job this could be some issue when updating this field (mysql restart?). In my internal test database I also see a couple of steps without this information (be it extern or not). Can you do: select job_db_inx,step_name,tres_usage_in_max from <your_cluster>_step_table; and see which steps doesn't have this field filled in, then correlate with your job table to see the job id, and then see what seff reports for these jobs? I would suggest to open a new issue because this is totally different to your initial seff problem. You can link to this bug. Jurij, I opened an internal issue for the seff problem, so no need for you to open it. I will inform when it is fixed. Hi, the issue with seff is closed in bab13dfde6d691a26b581eea20ef2f52e0c600a9, release 19.05.0rc2. I am closing this bug now since everything should be fine at the moment. Please, mark it as OPEN again if you still encounter issues related to your case. Thank you, Felip |