Dear Team, Can you please help us understand this sreport query: $ sreport -t percent -T gres/gpu,cpu,gres/gpu:v100,gres/gpu:gtx1080ti,gres/gpu:rtx2080ti cluster Utilization start="2020-04-19T00:00:00" end="2020-04-26T00:00:00" -------------------------------------------------------------------------------- Cluster Utilization 2020-04-19T00:00:00 - 2020-04-25T23:59:59 Usage reported in Percentage of Total -------------------------------------------------------------------------------- Cluster TRES Name Allocated Down PLND Down Idle Reserved Reported --------- -------------- ---------- ---------- --------- --------- ---------- ----------- dragon cpu 47.82% 28.19% 4.45% 3.63% 15.91% 99.99% dragon gres/gpu 68.34% 2.17% 0.00% 29.49% 0.00% 100.00% dragon gres/gpu:gtx1+ 99.94% 0.06% 0.00% 0.00% 0.00% 410170400.0 dragon gres/gpu:rtx2+ 25.89% 74.11% 0.00% 0.00% 0.00% 138.02% dragon gres/gpu:v100 99.69% 0.31% 0.00% 0.00% 0.00% 1431.29% Issues: - the headings of this report don't match the description in the sreport manual (cluster Utilization) - the gres/gpu:rtx2+ is showing 74% down, but this is no where no correct - the 'Reported' column shows values over 100% thanks, -greg
Hi Greg, The first issue you mentioned looks like it matches the documentation to me. In https://slurm.schedmd.com/sreport.html under cluster utilization the second note says: "The default view for the "Cluster Utilization" report includes the following fields: Cluster, Allocated, Down, PlannedDown, Idle, Reserved, Reported." If this isn't what you mean please let me know. The other two issues look like bugs to me but I will have to do more research. I'll get back to you on them once I know the answer. -Scott
Regarding the headings - not sure why I got that wrong. This may help too: $ sreport -T gres/gpu,cpu,gres/gpu:v100,gres/gpu:gtx1080ti,gres/gpu:rtx2080ti cluster Utilization start="2020-04-19T00:00:00" end="2020-04-26T00:00:00" -------------------------------------------------------------------------------- Cluster Utilization 2020-04-19T00:00:00 - 2020-04-25T23:59:59 Usage reported in TRES Minutes -------------------------------------------------------------------------------- Cluster TRES Name Allocated Down PLND Down Idle Reserved Reported --------- -------------- ---------- ---------- --------- --------- ---------- ----------- dragon cpu 93153566 54902197 8669504 7077713 30981166 194784146 dragon gres/gpu 2734732 86964 0 1180063 0 4001760 dragon gres/gpu:gtx1+ 68318 44 0 0 0 68362 dragon gres/gpu:rtx2+ 28817 82486 0 0 0 111303 dragon gres/gpu:v100 1006816 3104 0 0 0 1009920
Hi Greg, Thanks for the other report that helped. To fully identify the issue could you share a database dump. I'm not sure if the issue is in the database values or in the sreport function. For reference this is how you make a database dump: mysqldump --databases <your_database> > <file> Please zip the file as well: gzip <file> -Scott
Have taken dump of db. Compressed it's 1.1GB Can it be uploaded to somewhere? (Or do you want it attached to this ticket?) -greg
Greg, That is bigger than the maximum file size for this site. So it won't work here. You could share it via google drive (or a similar service) with scott@schedmd.com. -Scott
Greg, How do you know that 74% down is not correct? Could you send me your slurm.conf and any included .conf files. I want to see how the nodes are set up to help me identify patterns in what is reported as down. Also, I got your database working and have reproduced same issues on my machine now. I'm still not sure what is causing reported to be greater than 100% but I am looking into it. Thanks, Scott
Greg, Also send your gres.conf if you can. Thanks, Scott
(In reply to Scott Hilton from comment #7) > Greg, > > How do you know that 74% down is not correct? > > Could you send me your slurm.conf and any included .conf files. I want to > see how the nodes are set up to help me identify patterns in what is > reported as down. > > Also, I got your database working and have reproduced same issues on my > machine now. > > I'm still not sure what is causing reported to be greater than 100% but I am > looking into it. > > Thanks, Scott I think I forgot to enable notifications on this comment. Can you send me your slurm.conf and included files? Thanks, Scott
Greg, Part of the fix for bug 8958 I believe will also fix the issue where you were getting percentages much larger than %100. This is the commit: https://github.com/SchedMD/slurm/commit/010d752b275a1ccb1c9537238233c938f2412ec2 Could you explain how you know that 74% down for the rtx is not correct? Also, do you have any other info about the rtx setup that may be relevant. Take care, Scott
Hi Scott, The evidence that the 2080s were up is stored in our slurm database (we use slurm to report when nodes are down etc). We have 32 x RTX2080ti in 4 nodes (8 GPUs per node). During the week in question one node was down (all week) due to a GPU fault. The remaining three nodes were processing jobs (all week). For slurm to report that the 2080Ti was down 74.11% would have meant that three of the four nodes were not available for virtually the entire week (just shy of 75% down). We are confident this was not the case. (BTW - I sent all out configs as a dropbox link - did you get it?) -greg
Greg, Yes, I did get your configs. They were quite helpful in diagnosing the problems for this bug and bug 8958. Thanks for the info. I'll keep you updated on that last issue. -Scott
Greg, It looks like the total time for the one down graphics card is correct. The percent was off because we had an incorrect tres count 8 instead of 24. This should also be fixed be the change in 8958. If you had had the fix from 8958, I would guess you would get something like: >Cluster TRES Name Allocated Down PLND Down Idle Reserved Reported >dragon gres/gpu:rtx2+ ~9% ~26% 0.00% ~65% 0.00% 100% That means though that we still have only 9% of the time was allocated. This could mean that the rest of the time the gpus were actually idle or there is a bug counting the allocated time (which I am looking into). If it was idle time this could result from nodes running just on the cpus or jobs reserving the nodes until they have all the resources to run. Have you encountered any similar bugs to this? Also, have you updated to 19.05.7 or 20.02.3? -Scott
Hi Scott, A ran a simple script over the output of sacct -X for the week in question. The results are: gpu504-37 0 GPU seconds (0% of period) gpu510-02 4224835 GPU seconds (87% of period) gpu510-12 3199137 GPU seconds (66% of period) gpu510-17 2431046 GPU seconds (50% of period) (GPU seconds is the number of seconds during the week that the GPU was allocated to a job. The percentage is calculated between GPU seconds available [of 8 GPUs] and the overall number of seconds in the week). Our slurm version is currently: $ scontrol -V slurm 19.05.5 We plan on upgrading to 20 in the next maintenance session (in three weeks). In the interim we can upgrade to a later version of 19 if it is recommend.
Greg, Could you send me that script? Thanks, Scott
Hi Scott, Script is at https://gitlab.com/greg.wickham/python-slurm When the script run it identifies a job that spans more than one node. Looking at this job further (# 10342997) I'm not sure how to update the script to keep track of the correct allocations. Maybe these following questions should be a separate ticket, but they are related to the data for this ticket. (Let me know if I should open a separate ticket) For job 10342997 $ sacct -P -j 10342997 --format jobidraw,start,end,nodelist,reqtres,reqgres,alloctres,allocgres JobIDRaw|Start|End|NodeList|ReqTRES|ReqGRES|AllocTRES|AllocGRES 10342997|2020-04-21T18:33:34|2020-04-21T18:33:50|gpu504-37,gpu510-17|billing=20,cpu=20,gres/gpu=4,mem=120G,node=1|gpu:8|billing=20,cpu=20,gres/gpu=8,mem=120G,node=2|gpu:8 10342997.batch|2020-04-21T18:33:34|2020-04-21T18:33:50|gpu504-37||gpu:8|cpu=10,mem=60G,node=1|gpu:8 10342997.extern|2020-04-21T18:33:34|2020-04-21T18:33:51|gpu504-37,gpu510-17||gpu:8|billing=20,cpu=20,gres/gpu=8,mem=120G,node=2|gpu:8 10342997.0|2020-04-21T18:33:36|2020-04-21T18:33:52|gpu510-17||gpu:8|cpu=1,gres/gpu:gtx1080ti=4,gres/gpu=4,mem=6G,node=1|gpu:8 $ sinfo -n gpu510-17,gpu504-37 -o "%n,%G" HOSTNAMES,GRES gpu504-37,gpu:gtx1080ti:8(S:0-1) gpu510-17,gpu:rtx2080ti:8(S:0-1) Q1: The job "10342997.0" indicates using "gres/gpu:gtx1080ti=4" on node gpu510-17, but the node has "rtx2080ti" Q2: 10342997.0 shows AllocTRES is 4 GPUs while AllocGRES is 8 GPUS Q3: 10342997.batch and 10342997.0 are running at the same time on different nodes - what is the total of GPUs allocated?
Greg, Could you post those questions in a new ticket? We will just focus on the original question involving missing allocation time in this thread. This will help us keep track of the issues. Thanks, Scott
created https://bugs.schedmd.com/show_bug.cgi?id=9159
Greg, The numbers from your script look like they are in the right ballpark to me. This tells me that the direct accounting records are probably correct. The issue is probably in adding up the time. I'm not sure but I think there is a good chance that the previous fix (see 8958) could also fix this discrepancy where we are missing allocation time, Leading to 74% down. Once you update to 20.02.3 or 19.05.7, if you see any new issues like this let me know. I'm going to close this ticket for now, but if the same issue appears again please reopen it so we can address the bug. Good luck, Scott