| Summary: | slurmctld locked up - max threads; possible deadlock for job completion when BB non functional | ||
|---|---|---|---|
| Product: | Slurm | Reporter: | Doug Jacobsen <dmjacobsen> |
| Component: | slurmctld | Assignee: | Danny Auble <da> |
| Status: | RESOLVED FIXED | QA Contact: | |
| Severity: | 3 - Medium Impact | ||
| Priority: | --- | CC: | dpaul, tinad |
| Version: | 15.08.8 | ||
| Hardware: | Cray XC | ||
| OS: | Linux | ||
| Site: | NERSC | Slinky Site: | --- |
| Alineos Sites: | --- | Atos/Eviden Sites: | --- |
| Confidential Site: | --- | Coreweave sites: | --- |
| Cray Sites: | --- | DS9 clusters: | --- |
| Google sites: | --- | HPCnow Sites: | --- |
| HPE Sites: | --- | IBM Sites: | --- |
| NOAA SIte: | --- | NoveTech Sites: | --- |
| Nvidia HWinf-CS Sites: | --- | OCF Sites: | --- |
| Recursion Pharma Sites: | --- | SFW Sites: | --- |
| SNIC sites: | --- | Tzag Elita Sites: | --- |
| Linux Distro: | --- | Machine Name: | |
| CLE Version: | Version Fixed: | 15.08.9 16.05.0-pre2 | |
| Target Release: | --- | DevPrio: | --- |
| Emory-Cloud Sites: | --- | ||
| Attachments: |
backtrace of all threads
slurmctld log another (larger) backtrace |
||
Created attachment 2895 [details]
slurmctld log
I believe that the dwrest service became unreliable as early as 3:30AM today based on some comments from Dave Paul. He's attached and can give more details on that when he is available. My read of the backtrace is that 30 threads were trying to get a read or write lock on the JOB_LOCK. Some of these were BB threads. Many others threads seemed to be throttled based on the existing count of RPCs of the same type present in the system. My hypothesis is that when dwrest became unreliable BB jobs failed to teardown correctly, and I'm guessing the RPC stayed alive in the slurmctld memory. At some point the number of jobs trying to complete exceeded the threshold for RPC throttling and then all completing jobs got stuck waiting on the deadlock to break. Clearly this didn't happen. If true, this suggests that SLURM is strongly dependent on DW tearing down, and thus that the dwrest service is operating correctly. Is that true? Are there options for effectively dealing with this situation? (e.g., putting the dw commands into a work-queue for periodic inspection, rather than having an RPC stay alive?) Thanks, Doug Doug, this is fixed in commit https://github.com/SchedMD/slurm/commit/dcfa6ec034ad4 It should happen fairly rarely though. Thanks for the back trace it was very helpful finding the issue. You should be able to restart the slurmctld and all should be good. Please reopen if you notice this happen again after the patch. Hi Danny, I applied the patch, but we're still hitting max threads pretty consistently. In fact right now slurmctld is crawling and we're at 513 pthreads (guessing not all of these correspond to slurm's idea of its legislated threads). Attaching a backtrace for all 513 threads. Will work on getting logs updated. -Doug Created attachment 2897 [details]
another (larger) backtrace
some output a few minutes later. No child processes (e.g. xtcleanup_after or dw_wlm commands) nid00837:/tmp # ps --ppid $(pgrep slurmctld) PID TTY TIME CMD nid00837:/tmp # sdiag ******************************************************* sdiag output at Mon Mar 21 21:44:27 2016 Data since Mon Mar 21 17:00:00 2016 ******************************************************* Server thread count: 209 Agent queue size: 0 Jobs submitted: 1651 Jobs started: 1439 Jobs completed: 1422 Jobs canceled: 111 Jobs failed: 0 Main schedule statistics (microseconds): Last cycle: 1816821 Max cycle: 2000040 Total cycles: 1629 Mean cycle: 1485685 Mean depth cycle: 400 Cycles per minute: 5 Last queue length: 3185 Backfilling stats Total backfilled jobs (since last slurm start): 1906 Total backfilled jobs (since last stats cycle start): 1197 Total cycles: 109 Last cycle when: Mon Mar 21 21:42:58 2016 Last cycle: 1176137 Max cycle: 6588614 Mean cycle: 2597680 Last depth cycle: 3168 Last depth cycle (try sched): 143 Depth Mean: 2538 Depth Mean (try depth): 164 Last queue length: 3189 Queue length mean: 3077 Remote Procedure Call statistics by message type REQUEST_JOB_STEP_CREATE ( 5001) count:66560 ave_time:5368977 total_time:357359170262 REQUEST_JOB_ALLOCATION_INFO_LITE ( 4016) count:30662 ave_time:24198760 total_time:741982400772 REQUEST_CANCEL_JOB_STEP ( 5005) count:30393 ave_time:2430176 total_time:73860364348 REQUEST_STEP_COMPLETE ( 5016) count:26932 ave_time:1754264 total_time:47245852420 REQUEST_PARTITION_INFO ( 2009) count:20354 ave_time:2993 total_time:60921843 REQUEST_COMPLETE_PROLOG ( 6018) count:14327 ave_time:491854 total_time:7046796518 MESSAGE_EPILOG_COMPLETE ( 6012) count:13989 ave_time:526917 total_time:7371052923 REQUEST_JOB_USER_INFO ( 2039) count:12692 ave_time:1050204 total_time:13329199978 MESSAGE_NODE_REGISTRATION_STATUS ( 1002) count:12377 ave_time:206522 total_time:2556128584 REQUEST_SUBMIT_BATCH_JOB ( 4003) count:6343 ave_time:1094288 total_time:6941070745 REQUEST_JOB_INFO ( 2003) count:5712 ave_time:2209448 total_time:12620372632 REQUEST_COMPLETE_BATCH_SCRIPT ( 5018) count:3702 ave_time:371836 total_time:1376540279 REQUEST_JOB_INFO_SINGLE ( 2021) count:2168 ave_time:1725353 total_time:3740566161 REQUEST_PING ( 1008) count:647 ave_time:262 total_time:169725 REQUEST_NODE_INFO ( 2007) count:546 ave_time:143091 total_time:78127793 REQUEST_NODE_INFO_SINGLE ( 2040) count:257 ave_time:169860 total_time:43654145 REQUEST_UPDATE_NODE ( 3002) count:251 ave_time:38674 total_time:9707230 REQUEST_KILL_JOB ( 5032) count:233 ave_time:387665 total_time:90325969 REQUEST_JOB_READY ( 4019) count:175 ave_time:14612203 total_time:2557135581 REQUEST_JOB_NOTIFY ( 4022) count:169 ave_time:19833 total_time:3351930 REQUEST_BURST_BUFFER_INFO ( 2037) count:97 ave_time:2688 total_time:260775 REQUEST_RESOURCE_ALLOCATION ( 4001) count:88 ave_time:1537222 total_time:135275613 REQUEST_COMPLETE_JOB_ALLOCATION ( 5017) count:73 ave_time:221638 total_time:16179630 REQUEST_PRIORITY_FACTORS ( 2026) count:43 ave_time:5833953 total_time:250860005 REQUEST_UPDATE_JOB ( 3001) count:28 ave_time:503150 total_time:14088211 REQUEST_JOB_SBCAST_CRED ( 4023) count:11 ave_time:17435 total_time:191795 REQUEST_UPDATE_PARTITION ( 3005) count:10 ave_time:5469 total_time:54694 REQUEST_CREATE_RESERVATION ( 3006) count:4 ave_time:271436 total_time:1085744 REQUEST_RESERVATION_INFO ( 2024) count:3 ave_time:199 total_time:598 ACCOUNTING_UPDATE_MSG (10001) count:3 ave_time:970258 total_time:2910774 REQUEST_STATS_INFO ( 2035) count:3 ave_time:176 total_time:528 REQUEST_SHARE_INFO ( 2022) count:1 ave_time:8620 total_time:8620 REQUEST_DELETE_RESERVATION ( 3008) count:1 ave_time:39489 total_time:39489 Remote Procedure Call statistics by user root ( 0) count:73832 ave_time:906379 total_time:66919835928 chhabra ( 70608) count:18797 ave_time:327674 total_time:6159293226 aoliu ( 56679) count:8885 ave_time:732082 total_time:6504554785 wyang ( 31382) count:4256 ave_time:874210 total_time:3720641525 kononenk ( 48195) count:2731 ave_time:353837 total_time:966329931 racah ( 61228) count:2566 ave_time:680061 total_time:1745038493 gabinder ( 57927) count:1970 ave_time:306812 total_time:604421472 pmlee ( 61777) count:1862 ave_time:101310 total_time:188639915 sball ( 66005) count:1775 ave_time:3163801 total_time:5615747083 yaolilbl ( 69607) count:1246 ave_time:197956 total_time:246653517 bwoo ( 69657) count:967 ave_time:1336571 total_time:1292465115 ptfproc ( 62098) count:927 ave_time:286530 total_time:265614068 hou ( 51464) count:899 ave_time:63302 total_time:56908903 gpzhang ( 41964) count:898 ave_time:1437999 total_time:1291323183 polkituser ( 108) count:748 ave_time:2333949 total_time:1745793883 rwelsch ( 69678) count:399 ave_time:68831 total_time:27463862 gprodri ( 60805) count:398 ave_time:5021682 total_time:1998629505 xuchen ( 54724) count:368 ave_time:197118 total_time:72539700 jialin ( 58179) count:365 ave_time:84877 total_time:30980370 pengcui ( 59255) count:326 ave_time:2161984 total_time:704807061 zli ( 12843) count:298 ave_time:129391 total_time:38558592 chunjing ( 46250) count:296 ave_time:1454093 total_time:430411723 pccampb2 ( 60631) count:266 ave_time:5515142 total_time:1467027975 luisruiz ( 69207) count:263 ave_time:1903120 total_time:500520613 sagiyama ( 58709) count:248 ave_time:1313515 total_time:325751736 tabarnes ( 49295) count:218 ave_time:199611 total_time:43515411 mlobet ( 70991) count:194 ave_time:276424 total_time:53626308 nweeks ( 68921) count:192 ave_time:195082 total_time:37455838 jiqiang ( 13500) count:184 ave_time:3544256 total_time:652143266 fangyong ( 53958) count:181 ave_time:1753062 total_time:317304302 operator ( 34510) count:181 ave_time:52546 total_time:9510833 dmj ( 56094) count:174 ave_time:397903 total_time:69235293 xiazhu ( 71334) count:171 ave_time:773954 total_time:132346247 usgweb ( 33442) count:170 ave_time:3447555 total_time:586084382 zarija ( 52379) count:154 ave_time:779228 total_time:120001122 hwpeng ( 48169) count:151 ave_time:2767853 total_time:417945822 yxl232 ( 69375) count:148 ave_time:685545 total_time:101460746 emmas ( 70248) count:143 ave_time:1033376 total_time:147772835 sleak ( 70817) count:131 ave_time:4524048 total_time:592650360 haobin ( 12588) count:131 ave_time:132749 total_time:17390135 hshan ( 32726) count:129 ave_time:3035696 total_time:391604843 yixia ( 55113) count:119 ave_time:6934798 total_time:825240989 xiaobao ( 70051) count:116 ave_time:702947 total_time:81541906 dfeldman ( 56837) count:115 ave_time:3377018 total_time:388357134 startsev ( 16891) count:115 ave_time:5138783 total_time:590960121 leinoaa ( 71110) count:115 ave_time:60011 total_time:6901309 lsqphot ( 62521) count:109 ave_time:5076617 total_time:553351303 jdeslip ( 34014) count:101 ave_time:4315867 total_time:435902592 alsdata ( 57074) count:97 ave_time:140253 total_time:13604585 loring ( 46040) count:95 ave_time:159724 total_time:15173838 aspencer ( 45648) count:94 ave_time:273280 total_time:25688383 preeti ( 61846) count:90 ave_time:152703 total_time:13743340 pselv ( 69907) count:89 ave_time:7482772 total_time:665966772 lpinol ( 71088) count:88 ave_time:92900 total_time:8175202 truong ( 64901) count:88 ave_time:568261 total_time:50007040 glock ( 69615) count:88 ave_time:14336 total_time:1261650 asarje ( 54453) count:86 ave_time:127997 total_time:11007784 ndk ( 31805) count:80 ave_time:1117608 total_time:89408659 jzhou ( 64642) count:79 ave_time:1869041 total_time:147654245 cwpark ( 57215) count:76 ave_time:6902578 total_time:524595991 dpaul ( 15448) count:71 ave_time:363675 total_time:25820992 monoue ( 56205) count:70 ave_time:126622 total_time:8863594 cschreck ( 60571) count:70 ave_time:48541 total_time:3397908 lonardon ( 52122) count:68 ave_time:2420672 total_time:164605746 friesen ( 52244) count:66 ave_time:263566 total_time:17395370 aike ( 34983) count:66 ave_time:189840 total_time:12529460 bajdich ( 48661) count:66 ave_time:170350 total_time:11243116 ddkang ( 31318) count:65 ave_time:12163004 total_time:790595304 shlufl ( 71339) count:65 ave_time:110570 total_time:7187081 trangel ( 59525) count:64 ave_time:6181114 total_time:395591336 gibertif ( 69901) count:64 ave_time:285937 total_time:18300012 harklee ( 61092) count:63 ave_time:9742506 total_time:613777938 ygao ( 69865) count:63 ave_time:153749 total_time:9686229 yhuang ( 59011) count:63 ave_time:741637 total_time:46723173 vat ( 61855) count:63 ave_time:212218 total_time:13369766 yunhe ( 18456) count:61 ave_time:155157 total_time:9464596 byue ( 71176) count:60 ave_time:204478 total_time:12268706 jlestz ( 56372) count:59 ave_time:1947180 total_time:114883656 u10198 ( 10198) count:56 ave_time:7707897 total_time:431642270 tkoskela ( 68467) count:54 ave_time:163866 total_time:8848814 psteinbr ( 62610) count:54 ave_time:243474 total_time:13147643 aswingna ( 70318) count:53 ave_time:252489 total_time:13381957 djbard ( 61692) count:52 ave_time:416093 total_time:21636849 lpyu ( 47763) count:48 ave_time:329448 total_time:15813531 yy642 ( 70062) count:48 ave_time:4897489 total_time:235079483 tiros ( 71138) count:46 ave_time:1143526 total_time:52602239 jyc ( 57038) count:46 ave_time:1298059 total_time:59710741 u6338 ( 6338) count:45 ave_time:1122589 total_time:50516518 mgduda ( 63169) count:45 ave_time:767321 total_time:34529477 mwhite ( 31845) count:43 ave_time:1005575 total_time:43239734 ameisner ( 68391) count:43 ave_time:6295021 total_time:270685928 loryza ( 68452) count:41 ave_time:31752186 total_time:1301839656 mgervais ( 69893) count:41 ave_time:138412 total_time:5674926 ruchen ( 62289) count:40 ave_time:1563576 total_time:62543061 zxiao ( 70165) count:38 ave_time:85461 total_time:3247546 warndt ( 70744) count:37 ave_time:80793 total_time:2989368 cmelni ( 60881) count:35 ave_time:9861767 total_time:345161851 zz217 ( 32858) count:32 ave_time:80924 total_time:2589592 cgrambow ( 71102) count:32 ave_time:11449036 total_time:366369177 eriof ( 69343) count:30 ave_time:224081 total_time:6722430 pgautam ( 70609) count:30 ave_time:85035 total_time:2551063 emiliord ( 56872) count:29 ave_time:125148 total_time:3629312 scottcal ( 70511) count:28 ave_time:135297 total_time:3788336 gbassi ( 42023) count:28 ave_time:8344745 total_time:233652883 whitlocb ( 41892) count:26 ave_time:1844120 total_time:47947143 jcarron ( 68411) count:25 ave_time:249559 total_time:6238983 bnlcat ( 43966) count:24 ave_time:5053102 total_time:121274458 kenmc ( 59827) count:23 ave_time:127668 total_time:2936378 liling ( 33901) count:23 ave_time:176928 total_time:4069348 gghosh ( 45100) count:22 ave_time:75656 total_time:1664436 ajamal4 ( 71127) count:22 ave_time:223721 total_time:4921882 donghwa ( 59746) count:20 ave_time:94974 total_time:1899489 qpzhang ( 69031) count:20 ave_time:329291 total_time:6585839 cho ( 12842) count:18 ave_time:35107 total_time:631936 vfung ( 69515) count:17 ave_time:13437867 total_time:228443751 dsambit ( 70700) count:17 ave_time:250120 total_time:4252051 gull ( 54393) count:16 ave_time:4235792 total_time:67772672 jdonate ( 48896) count:16 ave_time:5999414 total_time:95990634 binchen ( 69475) count:16 ave_time:117590 total_time:1881440 oluk ( 53193) count:15 ave_time:55353 total_time:830298 zoglauer ( 61194) count:15 ave_time:2227478 total_time:33412183 aepolin ( 68665) count:15 ave_time:87909 total_time:1318639 wangsh ( 55213) count:14 ave_time:335806 total_time:4701295 dhossain ( 54800) count:14 ave_time:425296 total_time:5954151 aditik ( 60733) count:14 ave_time:47000 total_time:658003 yaowang ( 56583) count:14 ave_time:3440315 total_time:48164414 wwan ( 31570) count:14 ave_time:414765 total_time:5806715 peterm ( 33161) count:14 ave_time:3766890 total_time:52736468 lweston ( 61085) count:14 ave_time:366062 total_time:5124880 myeung ( 58737) count:13 ave_time:4227265 total_time:54954457 toshiyan ( 63150) count:13 ave_time:663640 total_time:8627320 mycoy ( 62389) count:13 ave_time:111195 total_time:1445535 lge ( 32490) count:13 ave_time:72239 total_time:939112 dbin ( 57774) count:12 ave_time:105912 total_time:1270947 gbisht ( 56407) count:12 ave_time:34114 total_time:409379 jpsauppe ( 52364) count:11 ave_time:166567 total_time:1832243 totoni ( 60898) count:11 ave_time:238961 total_time:2628572 zhuhua ( 70641) count:11 ave_time:114249 total_time:1256743 jobst ( 57803) count:11 ave_time:112420 total_time:1236624 chenb ( 68552) count:11 ave_time:3019377 total_time:33213150 fgygi ( 40699) count:11 ave_time:110804 total_time:1218853 mchalabi ( 69764) count:11 ave_time:237250 total_time:2609750 bigang ( 70359) count:10 ave_time:293309 total_time:2933098 cyang ( 15392) count:10 ave_time:93349 total_time:933491 ycao ( 54162) count:10 ave_time:181726 total_time:1817264 rthomas ( 32331) count:10 ave_time:4460958 total_time:44609583 mtnguyen ( 70592) count:9 ave_time:47680 total_time:429125 bhatele ( 52003) count:9 ave_time:246899 total_time:2222095 djacobse ( 54879) count:9 ave_time:102742 total_time:924685 galexv ( 62848) count:9 ave_time:449489 total_time:4045405 arjundey ( 56032) count:8 ave_time:216976 total_time:1735815 cslage ( 69803) count:8 ave_time:53253 total_time:426028 paul_m ( 61225) count:7 ave_time:58409 total_time:408868 cdiaz ( 58183) count:6 ave_time:479239 total_time:2875439 yshin ( 68699) count:6 ave_time:438357 total_time:2630144 arhoades ( 57963) count:6 ave_time:31425 total_time:188555 songzhou ( 68569) count:6 ave_time:125845 total_time:755073 sqwu ( 45413) count:6 ave_time:4805903 total_time:28835423 ifampiou ( 59399) count:6 ave_time:131266 total_time:787599 hfeng ( 60864) count:5 ave_time:688948 total_time:3444741 pboyle ( 13178) count:5 ave_time:11474517 total_time:57372585 nkern ( 70626) count:5 ave_time:500884 total_time:2504420 wangyu ( 49739) count:4 ave_time:274 total_time:1098 mtreagan ( 55441) count:4 ave_time:382552 total_time:1530209 jpritch ( 69097) count:4 ave_time:193164 total_time:772656 danielsf ( 58711) count:4 ave_time:259891 total_time:1039566 zrsun ( 63135) count:4 ave_time:224 total_time:896 mniesen ( 58371) count:4 ave_time:10098 total_time:40392 sivanr ( 55792) count:4 ave_time:548452 total_time:2193808 adjones ( 49217) count:4 ave_time:9857 total_time:39431 jl137 ( 70952) count:4 ave_time:46778 total_time:187113 aparnak2 ( 70948) count:4 ave_time:649 total_time:2597 ppesante ( 70035) count:4 ave_time:9318 total_time:37273 alexmir ( 59893) count:4 ave_time:436006 total_time:1744024 gxlin ( 44516) count:4 ave_time:6470 total_time:25880 drpahls ( 71049) count:4 ave_time:21796 total_time:87187 nox ( 70602) count:4 ave_time:225 total_time:900 bfaber ( 56450) count:4 ave_time:246248 total_time:984995 porazik ( 70982) count:3 ave_time:340621 total_time:1021864 askinner ( 59483) count:2 ave_time:173818 total_time:347637 thjoshi ( 63019) count:2 ave_time:244 total_time:489 ssn492 ( 68082) count:2 ave_time:96543 total_time:193086 aovsyann ( 69266) count:2 ave_time:10272 total_time:20545 kirbym ( 70593) count:2 ave_time:666518 total_time:1333037 adamvh ( 63294) count:2 ave_time:741546 total_time:1483092 bkolb ( 58267) count:2 ave_time:21798 total_time:43597 mana ( 60853) count:2 ave_time:98406 total_time:196813 weijie ( 68877) count:2 ave_time:564250 total_time:1128501 regan ( 16695) count:2 ave_time:239 total_time:479 nrcavana ( 63095) count:2 ave_time:10969 total_time:21939 lattanzi ( 56409) count:2 ave_time:1024422 total_time:2048845 greght ( 70769) count:2 ave_time:1184 total_time:2369 shn898 ( 69409) count:2 ave_time:7810 total_time:15621 agreiner ( 50931) count:2 ave_time:6199 total_time:12398 wbhimji ( 68441) count:2 ave_time:10901 total_time:21803 botana ( 59613) count:2 ave_time:7455 total_time:14910 mahines ( 70259) count:2 ave_time:247 total_time:494 dbrout ( 58732) count:1 ave_time:51771 total_time:51771 yiwang62 ( 45039) count:1 ave_time:528 total_time:528 bthakur ( 41337) count:1 ave_time:154 total_time:154 nid00837:/tmp # sorry, nevermind -- looks like this is the work of an abuser, not a return of the locking issue. someone starting THOUSANDS of job steps. Will find them and stop them Some background for the dwrest/gunicorn/nginx outage from earlier. Due to large datawarp (DW) allocation (~25TB) failing to teardown correctly the dwrest daemons were logging copious amounts of messages about trying to complete the teardown. We currently have 12 nodes running dwrest/gunicorn/nginx (soon to be reduced) which all right to the Cray persistent-shared /var disk partition. The shared-var partition ran out of space at ~3:40am, causing all dwrest/gunicorn daemons to terminate. The nginx daemons were still present on the nodes. dwstat functionality stopped (no daemons to respond), and I expect that Slurm's ability to get DW status also stopped in this time frame. After cleaning up the huge dwrest.log files (~5.5GB each) the dwrest/gunicorn services were restarted along with nginx (to be safe). At this point dwstat functionality was restored. |
Created attachment 2894 [details] backtrace of all threads Hello, This morning the DW system stopped giving good results to queries, possibly because of some issue around the dwrest system not working correctly -- unsure about that. Attaching Dave P to comment. In any case at 9:50AM slurmctld became unresponsive because it reached its max thread limit. I dumped a core file and restarted slurmcltd (with all partitions down). Thus, we are not starting new jobs right now while we investigate to understand if this will just happen again. I'm still working my way through the core file to try to determine what each of the threads might have been doing. I'm attaching the backtrace of all the threads here so you can see it. And will also attach a copy of the logs. -Doug