Ticket 2576

Summary: slurmctld locked up - max threads; possible deadlock for job completion when BB non functional
Product: Slurm Reporter: Doug Jacobsen <dmjacobsen>
Component: slurmctldAssignee: Danny Auble <da>
Status: RESOLVED FIXED QA Contact:
Severity: 3 - Medium Impact    
Priority: --- CC: dpaul, tinad
Version: 15.08.8   
Hardware: Cray XC   
OS: Linux   
Site: NERSC Alineos Sites: ---
Atos/Eviden Sites: --- Confidential Site: ---
Coreweave sites: --- Cray Sites: ---
DS9 clusters: --- HPCnow Sites: ---
HPE Sites: --- IBM Sites: ---
NOAA SIte: --- OCF Sites: ---
Recursion Pharma Sites: --- SFW Sites: ---
SNIC sites: --- Linux Distro: ---
Machine Name: CLE Version:
Version Fixed: 15.08.9 16.05.0-pre2 Target Release: ---
DevPrio: --- Emory-Cloud Sites: ---
Attachments: backtrace of all threads
slurmctld log
another (larger) backtrace

Description Doug Jacobsen 2016-03-21 05:09:31 MDT
Created attachment 2894 [details]
backtrace of all threads

Hello,

This morning the DW system stopped giving good results to queries, possibly because of some issue around the dwrest system not working correctly -- unsure about that.  Attaching Dave P to comment.

In any case at 9:50AM slurmctld became unresponsive because it reached its max thread limit. I dumped a core file and restarted slurmcltd (with all partitions down).  Thus, we are not starting new jobs right now while we investigate to understand if this will just happen again.

I'm still working my way through the core file to try to determine what each of the threads might have been doing.

I'm attaching the backtrace of all the threads here so you can see it. And will also attach a copy of the logs.

-Doug
Comment 1 Doug Jacobsen 2016-03-21 05:21:16 MDT
Created attachment 2895 [details]
slurmctld log
Comment 2 Doug Jacobsen 2016-03-21 06:12:00 MDT
I believe that the dwrest service became unreliable as early as 3:30AM today based on some comments from Dave Paul.  He's attached and can give more details on that when he is available.

My read of the backtrace is that 30 threads were trying to get a read or write lock on the JOB_LOCK.  Some of these were BB threads.  Many others threads seemed to be throttled based on the existing count of RPCs of the same type present in the system.

My hypothesis is that when dwrest became unreliable BB jobs failed to teardown correctly, and I'm guessing the RPC stayed alive in the slurmctld memory.  At some point the number of jobs trying to complete exceeded the threshold for RPC throttling and then all completing jobs got stuck waiting on the deadlock to break.  Clearly this didn't happen.

If true, this suggests that SLURM is strongly dependent on DW tearing down, and thus that the dwrest service is operating correctly. Is that true?  Are there options for effectively dealing with this situation? (e.g., putting the dw commands into a work-queue for periodic inspection, rather than having an RPC stay alive?)

Thanks,
Doug
Comment 3 Danny Auble 2016-03-21 06:23:01 MDT
Doug, this is fixed in commit https://github.com/SchedMD/slurm/commit/dcfa6ec034ad4

It should happen fairly rarely though.  Thanks for the back trace it was very helpful finding the issue.

You should be able to restart the slurmctld and all should be good.

Please reopen if you notice this happen again after the patch.
Comment 4 Doug Jacobsen 2016-03-21 15:43:00 MDT
Hi Danny, 

I applied the patch, but we're still hitting max threads pretty consistently.  In fact right now slurmctld is crawling and we're at 513 pthreads (guessing not all of these correspond to slurm's idea of its legislated threads).

Attaching a backtrace for all 513 threads.  Will work on getting logs updated.

-Doug
Comment 5 Doug Jacobsen 2016-03-21 15:43:25 MDT
Created attachment 2897 [details]
another (larger) backtrace
Comment 6 Doug Jacobsen 2016-03-21 15:45:12 MDT
some output a few minutes later.  No child processes (e.g. xtcleanup_after or dw_wlm commands)

nid00837:/tmp # ps --ppid $(pgrep slurmctld)
  PID TTY          TIME CMD
nid00837:/tmp # sdiag
*******************************************************
sdiag output at Mon Mar 21 21:44:27 2016
Data since      Mon Mar 21 17:00:00 2016
*******************************************************
Server thread count: 209
Agent queue size:    0

Jobs submitted: 1651
Jobs started:   1439
Jobs completed: 1422
Jobs canceled:  111
Jobs failed:    0

Main schedule statistics (microseconds):
	Last cycle:   1816821
	Max cycle:    2000040
	Total cycles: 1629
	Mean cycle:   1485685
	Mean depth cycle:  400
	Cycles per minute: 5
	Last queue length: 3185

Backfilling stats
	Total backfilled jobs (since last slurm start): 1906
	Total backfilled jobs (since last stats cycle start): 1197
	Total cycles: 109
	Last cycle when: Mon Mar 21 21:42:58 2016
	Last cycle: 1176137
	Max cycle:  6588614
	Mean cycle: 2597680
	Last depth cycle: 3168
	Last depth cycle (try sched): 143
	Depth Mean: 2538
	Depth Mean (try depth): 164
	Last queue length: 3189
	Queue length mean: 3077

Remote Procedure Call statistics by message type
	REQUEST_JOB_STEP_CREATE                 ( 5001) count:66560  ave_time:5368977 total_time:357359170262
	REQUEST_JOB_ALLOCATION_INFO_LITE        ( 4016) count:30662  ave_time:24198760 total_time:741982400772
	REQUEST_CANCEL_JOB_STEP                 ( 5005) count:30393  ave_time:2430176 total_time:73860364348
	REQUEST_STEP_COMPLETE                   ( 5016) count:26932  ave_time:1754264 total_time:47245852420
	REQUEST_PARTITION_INFO                  ( 2009) count:20354  ave_time:2993   total_time:60921843
	REQUEST_COMPLETE_PROLOG                 ( 6018) count:14327  ave_time:491854 total_time:7046796518
	MESSAGE_EPILOG_COMPLETE                 ( 6012) count:13989  ave_time:526917 total_time:7371052923
	REQUEST_JOB_USER_INFO                   ( 2039) count:12692  ave_time:1050204 total_time:13329199978
	MESSAGE_NODE_REGISTRATION_STATUS        ( 1002) count:12377  ave_time:206522 total_time:2556128584
	REQUEST_SUBMIT_BATCH_JOB                ( 4003) count:6343   ave_time:1094288 total_time:6941070745
	REQUEST_JOB_INFO                        ( 2003) count:5712   ave_time:2209448 total_time:12620372632
	REQUEST_COMPLETE_BATCH_SCRIPT           ( 5018) count:3702   ave_time:371836 total_time:1376540279
	REQUEST_JOB_INFO_SINGLE                 ( 2021) count:2168   ave_time:1725353 total_time:3740566161
	REQUEST_PING                            ( 1008) count:647    ave_time:262    total_time:169725
	REQUEST_NODE_INFO                       ( 2007) count:546    ave_time:143091 total_time:78127793
	REQUEST_NODE_INFO_SINGLE                ( 2040) count:257    ave_time:169860 total_time:43654145
	REQUEST_UPDATE_NODE                     ( 3002) count:251    ave_time:38674  total_time:9707230
	REQUEST_KILL_JOB                        ( 5032) count:233    ave_time:387665 total_time:90325969
	REQUEST_JOB_READY                       ( 4019) count:175    ave_time:14612203 total_time:2557135581
	REQUEST_JOB_NOTIFY                      ( 4022) count:169    ave_time:19833  total_time:3351930
	REQUEST_BURST_BUFFER_INFO               ( 2037) count:97     ave_time:2688   total_time:260775
	REQUEST_RESOURCE_ALLOCATION             ( 4001) count:88     ave_time:1537222 total_time:135275613
	REQUEST_COMPLETE_JOB_ALLOCATION         ( 5017) count:73     ave_time:221638 total_time:16179630
	REQUEST_PRIORITY_FACTORS                ( 2026) count:43     ave_time:5833953 total_time:250860005
	REQUEST_UPDATE_JOB                      ( 3001) count:28     ave_time:503150 total_time:14088211
	REQUEST_JOB_SBCAST_CRED                 ( 4023) count:11     ave_time:17435  total_time:191795
	REQUEST_UPDATE_PARTITION                ( 3005) count:10     ave_time:5469   total_time:54694
	REQUEST_CREATE_RESERVATION              ( 3006) count:4      ave_time:271436 total_time:1085744
	REQUEST_RESERVATION_INFO                ( 2024) count:3      ave_time:199    total_time:598
	ACCOUNTING_UPDATE_MSG                   (10001) count:3      ave_time:970258 total_time:2910774
	REQUEST_STATS_INFO                      ( 2035) count:3      ave_time:176    total_time:528
	REQUEST_SHARE_INFO                      ( 2022) count:1      ave_time:8620   total_time:8620
	REQUEST_DELETE_RESERVATION              ( 3008) count:1      ave_time:39489  total_time:39489

Remote Procedure Call statistics by user

	root            (       0) count:73832  ave_time:906379 total_time:66919835928
	chhabra         (   70608) count:18797  ave_time:327674 total_time:6159293226
	aoliu           (   56679) count:8885   ave_time:732082 total_time:6504554785
	wyang           (   31382) count:4256   ave_time:874210 total_time:3720641525
	kononenk        (   48195) count:2731   ave_time:353837 total_time:966329931
	racah           (   61228) count:2566   ave_time:680061 total_time:1745038493
	gabinder        (   57927) count:1970   ave_time:306812 total_time:604421472
	pmlee           (   61777) count:1862   ave_time:101310 total_time:188639915
	sball           (   66005) count:1775   ave_time:3163801 total_time:5615747083
	yaolilbl        (   69607) count:1246   ave_time:197956 total_time:246653517
	bwoo            (   69657) count:967    ave_time:1336571 total_time:1292465115
	ptfproc         (   62098) count:927    ave_time:286530 total_time:265614068
	hou             (   51464) count:899    ave_time:63302  total_time:56908903
	gpzhang         (   41964) count:898    ave_time:1437999 total_time:1291323183
	polkituser      (     108) count:748    ave_time:2333949 total_time:1745793883
	rwelsch         (   69678) count:399    ave_time:68831  total_time:27463862
	gprodri         (   60805) count:398    ave_time:5021682 total_time:1998629505
	xuchen          (   54724) count:368    ave_time:197118 total_time:72539700
	jialin          (   58179) count:365    ave_time:84877  total_time:30980370
	pengcui         (   59255) count:326    ave_time:2161984 total_time:704807061
	zli             (   12843) count:298    ave_time:129391 total_time:38558592
	chunjing        (   46250) count:296    ave_time:1454093 total_time:430411723
	pccampb2        (   60631) count:266    ave_time:5515142 total_time:1467027975
	luisruiz        (   69207) count:263    ave_time:1903120 total_time:500520613
	sagiyama        (   58709) count:248    ave_time:1313515 total_time:325751736
	tabarnes        (   49295) count:218    ave_time:199611 total_time:43515411
	mlobet          (   70991) count:194    ave_time:276424 total_time:53626308
	nweeks          (   68921) count:192    ave_time:195082 total_time:37455838
	jiqiang         (   13500) count:184    ave_time:3544256 total_time:652143266
	fangyong        (   53958) count:181    ave_time:1753062 total_time:317304302
	operator        (   34510) count:181    ave_time:52546  total_time:9510833
	dmj             (   56094) count:174    ave_time:397903 total_time:69235293
	xiazhu          (   71334) count:171    ave_time:773954 total_time:132346247
	usgweb          (   33442) count:170    ave_time:3447555 total_time:586084382
	zarija          (   52379) count:154    ave_time:779228 total_time:120001122
	hwpeng          (   48169) count:151    ave_time:2767853 total_time:417945822
	yxl232          (   69375) count:148    ave_time:685545 total_time:101460746
	emmas           (   70248) count:143    ave_time:1033376 total_time:147772835
	sleak           (   70817) count:131    ave_time:4524048 total_time:592650360
	haobin          (   12588) count:131    ave_time:132749 total_time:17390135
	hshan           (   32726) count:129    ave_time:3035696 total_time:391604843
	yixia           (   55113) count:119    ave_time:6934798 total_time:825240989
	xiaobao         (   70051) count:116    ave_time:702947 total_time:81541906
	dfeldman        (   56837) count:115    ave_time:3377018 total_time:388357134
	startsev        (   16891) count:115    ave_time:5138783 total_time:590960121
	leinoaa         (   71110) count:115    ave_time:60011  total_time:6901309
	lsqphot         (   62521) count:109    ave_time:5076617 total_time:553351303
	jdeslip         (   34014) count:101    ave_time:4315867 total_time:435902592
	alsdata         (   57074) count:97     ave_time:140253 total_time:13604585
	loring          (   46040) count:95     ave_time:159724 total_time:15173838
	aspencer        (   45648) count:94     ave_time:273280 total_time:25688383
	preeti          (   61846) count:90     ave_time:152703 total_time:13743340
	pselv           (   69907) count:89     ave_time:7482772 total_time:665966772
	lpinol          (   71088) count:88     ave_time:92900  total_time:8175202
	truong          (   64901) count:88     ave_time:568261 total_time:50007040
	glock           (   69615) count:88     ave_time:14336  total_time:1261650
	asarje          (   54453) count:86     ave_time:127997 total_time:11007784
	ndk             (   31805) count:80     ave_time:1117608 total_time:89408659
	jzhou           (   64642) count:79     ave_time:1869041 total_time:147654245
	cwpark          (   57215) count:76     ave_time:6902578 total_time:524595991
	dpaul           (   15448) count:71     ave_time:363675 total_time:25820992
	monoue          (   56205) count:70     ave_time:126622 total_time:8863594
	cschreck        (   60571) count:70     ave_time:48541  total_time:3397908
	lonardon        (   52122) count:68     ave_time:2420672 total_time:164605746
	friesen         (   52244) count:66     ave_time:263566 total_time:17395370
	aike            (   34983) count:66     ave_time:189840 total_time:12529460
	bajdich         (   48661) count:66     ave_time:170350 total_time:11243116
	ddkang          (   31318) count:65     ave_time:12163004 total_time:790595304
	shlufl          (   71339) count:65     ave_time:110570 total_time:7187081
	trangel         (   59525) count:64     ave_time:6181114 total_time:395591336
	gibertif        (   69901) count:64     ave_time:285937 total_time:18300012
	harklee         (   61092) count:63     ave_time:9742506 total_time:613777938
	ygao            (   69865) count:63     ave_time:153749 total_time:9686229
	yhuang          (   59011) count:63     ave_time:741637 total_time:46723173
	vat             (   61855) count:63     ave_time:212218 total_time:13369766
	yunhe           (   18456) count:61     ave_time:155157 total_time:9464596
	byue            (   71176) count:60     ave_time:204478 total_time:12268706
	jlestz          (   56372) count:59     ave_time:1947180 total_time:114883656
	u10198          (   10198) count:56     ave_time:7707897 total_time:431642270
	tkoskela        (   68467) count:54     ave_time:163866 total_time:8848814
	psteinbr        (   62610) count:54     ave_time:243474 total_time:13147643
	aswingna        (   70318) count:53     ave_time:252489 total_time:13381957
	djbard          (   61692) count:52     ave_time:416093 total_time:21636849
	lpyu            (   47763) count:48     ave_time:329448 total_time:15813531
	yy642           (   70062) count:48     ave_time:4897489 total_time:235079483
	tiros           (   71138) count:46     ave_time:1143526 total_time:52602239
	jyc             (   57038) count:46     ave_time:1298059 total_time:59710741
	u6338           (    6338) count:45     ave_time:1122589 total_time:50516518
	mgduda          (   63169) count:45     ave_time:767321 total_time:34529477
	mwhite          (   31845) count:43     ave_time:1005575 total_time:43239734
	ameisner        (   68391) count:43     ave_time:6295021 total_time:270685928
	loryza          (   68452) count:41     ave_time:31752186 total_time:1301839656
	mgervais        (   69893) count:41     ave_time:138412 total_time:5674926
	ruchen          (   62289) count:40     ave_time:1563576 total_time:62543061
	zxiao           (   70165) count:38     ave_time:85461  total_time:3247546
	warndt          (   70744) count:37     ave_time:80793  total_time:2989368
	cmelni          (   60881) count:35     ave_time:9861767 total_time:345161851
	zz217           (   32858) count:32     ave_time:80924  total_time:2589592
	cgrambow        (   71102) count:32     ave_time:11449036 total_time:366369177
	eriof           (   69343) count:30     ave_time:224081 total_time:6722430
	pgautam         (   70609) count:30     ave_time:85035  total_time:2551063
	emiliord        (   56872) count:29     ave_time:125148 total_time:3629312
	scottcal        (   70511) count:28     ave_time:135297 total_time:3788336
	gbassi          (   42023) count:28     ave_time:8344745 total_time:233652883
	whitlocb        (   41892) count:26     ave_time:1844120 total_time:47947143
	jcarron         (   68411) count:25     ave_time:249559 total_time:6238983
	bnlcat          (   43966) count:24     ave_time:5053102 total_time:121274458
	kenmc           (   59827) count:23     ave_time:127668 total_time:2936378
	liling          (   33901) count:23     ave_time:176928 total_time:4069348
	gghosh          (   45100) count:22     ave_time:75656  total_time:1664436
	ajamal4         (   71127) count:22     ave_time:223721 total_time:4921882
	donghwa         (   59746) count:20     ave_time:94974  total_time:1899489
	qpzhang         (   69031) count:20     ave_time:329291 total_time:6585839
	cho             (   12842) count:18     ave_time:35107  total_time:631936
	vfung           (   69515) count:17     ave_time:13437867 total_time:228443751
	dsambit         (   70700) count:17     ave_time:250120 total_time:4252051
	gull            (   54393) count:16     ave_time:4235792 total_time:67772672
	jdonate         (   48896) count:16     ave_time:5999414 total_time:95990634
	binchen         (   69475) count:16     ave_time:117590 total_time:1881440
	oluk            (   53193) count:15     ave_time:55353  total_time:830298
	zoglauer        (   61194) count:15     ave_time:2227478 total_time:33412183
	aepolin         (   68665) count:15     ave_time:87909  total_time:1318639
	wangsh          (   55213) count:14     ave_time:335806 total_time:4701295
	dhossain        (   54800) count:14     ave_time:425296 total_time:5954151
	aditik          (   60733) count:14     ave_time:47000  total_time:658003
	yaowang         (   56583) count:14     ave_time:3440315 total_time:48164414
	wwan            (   31570) count:14     ave_time:414765 total_time:5806715
	peterm          (   33161) count:14     ave_time:3766890 total_time:52736468
	lweston         (   61085) count:14     ave_time:366062 total_time:5124880
	myeung          (   58737) count:13     ave_time:4227265 total_time:54954457
	toshiyan        (   63150) count:13     ave_time:663640 total_time:8627320
	mycoy           (   62389) count:13     ave_time:111195 total_time:1445535
	lge             (   32490) count:13     ave_time:72239  total_time:939112
	dbin            (   57774) count:12     ave_time:105912 total_time:1270947
	gbisht          (   56407) count:12     ave_time:34114  total_time:409379
	jpsauppe        (   52364) count:11     ave_time:166567 total_time:1832243
	totoni          (   60898) count:11     ave_time:238961 total_time:2628572
	zhuhua          (   70641) count:11     ave_time:114249 total_time:1256743
	jobst           (   57803) count:11     ave_time:112420 total_time:1236624
	chenb           (   68552) count:11     ave_time:3019377 total_time:33213150
	fgygi           (   40699) count:11     ave_time:110804 total_time:1218853
	mchalabi        (   69764) count:11     ave_time:237250 total_time:2609750
	bigang          (   70359) count:10     ave_time:293309 total_time:2933098
	cyang           (   15392) count:10     ave_time:93349  total_time:933491
	ycao            (   54162) count:10     ave_time:181726 total_time:1817264
	rthomas         (   32331) count:10     ave_time:4460958 total_time:44609583
	mtnguyen        (   70592) count:9      ave_time:47680  total_time:429125
	bhatele         (   52003) count:9      ave_time:246899 total_time:2222095
	djacobse        (   54879) count:9      ave_time:102742 total_time:924685
	galexv          (   62848) count:9      ave_time:449489 total_time:4045405
	arjundey        (   56032) count:8      ave_time:216976 total_time:1735815
	cslage          (   69803) count:8      ave_time:53253  total_time:426028
	paul_m          (   61225) count:7      ave_time:58409  total_time:408868
	cdiaz           (   58183) count:6      ave_time:479239 total_time:2875439
	yshin           (   68699) count:6      ave_time:438357 total_time:2630144
	arhoades        (   57963) count:6      ave_time:31425  total_time:188555
	songzhou        (   68569) count:6      ave_time:125845 total_time:755073
	sqwu            (   45413) count:6      ave_time:4805903 total_time:28835423
	ifampiou        (   59399) count:6      ave_time:131266 total_time:787599
	hfeng           (   60864) count:5      ave_time:688948 total_time:3444741
	pboyle          (   13178) count:5      ave_time:11474517 total_time:57372585
	nkern           (   70626) count:5      ave_time:500884 total_time:2504420
	wangyu          (   49739) count:4      ave_time:274    total_time:1098
	mtreagan        (   55441) count:4      ave_time:382552 total_time:1530209
	jpritch         (   69097) count:4      ave_time:193164 total_time:772656
	danielsf        (   58711) count:4      ave_time:259891 total_time:1039566
	zrsun           (   63135) count:4      ave_time:224    total_time:896
	mniesen         (   58371) count:4      ave_time:10098  total_time:40392
	sivanr          (   55792) count:4      ave_time:548452 total_time:2193808
	adjones         (   49217) count:4      ave_time:9857   total_time:39431
	jl137           (   70952) count:4      ave_time:46778  total_time:187113
	aparnak2        (   70948) count:4      ave_time:649    total_time:2597
	ppesante        (   70035) count:4      ave_time:9318   total_time:37273
	alexmir         (   59893) count:4      ave_time:436006 total_time:1744024
	gxlin           (   44516) count:4      ave_time:6470   total_time:25880
	drpahls         (   71049) count:4      ave_time:21796  total_time:87187
	nox             (   70602) count:4      ave_time:225    total_time:900
	bfaber          (   56450) count:4      ave_time:246248 total_time:984995
	porazik         (   70982) count:3      ave_time:340621 total_time:1021864
	askinner        (   59483) count:2      ave_time:173818 total_time:347637
	thjoshi         (   63019) count:2      ave_time:244    total_time:489
	ssn492          (   68082) count:2      ave_time:96543  total_time:193086
	aovsyann        (   69266) count:2      ave_time:10272  total_time:20545
	kirbym          (   70593) count:2      ave_time:666518 total_time:1333037
	adamvh          (   63294) count:2      ave_time:741546 total_time:1483092
	bkolb           (   58267) count:2      ave_time:21798  total_time:43597
	mana            (   60853) count:2      ave_time:98406  total_time:196813
	weijie          (   68877) count:2      ave_time:564250 total_time:1128501
	regan           (   16695) count:2      ave_time:239    total_time:479
	nrcavana        (   63095) count:2      ave_time:10969  total_time:21939
	lattanzi        (   56409) count:2      ave_time:1024422 total_time:2048845
	greght          (   70769) count:2      ave_time:1184   total_time:2369
	shn898          (   69409) count:2      ave_time:7810   total_time:15621
	agreiner        (   50931) count:2      ave_time:6199   total_time:12398
	wbhimji         (   68441) count:2      ave_time:10901  total_time:21803
	botana          (   59613) count:2      ave_time:7455   total_time:14910
	mahines         (   70259) count:2      ave_time:247    total_time:494
	dbrout          (   58732) count:1      ave_time:51771  total_time:51771
	yiwang62        (   45039) count:1      ave_time:528    total_time:528
	bthakur         (   41337) count:1      ave_time:154    total_time:154
nid00837:/tmp #
Comment 7 Doug Jacobsen 2016-03-21 15:51:00 MDT
sorry, nevermind -- looks like this is the work of an abuser, not a return of the locking issue.  someone starting THOUSANDS of job steps.

Will find them and stop them
Comment 8 David Paul 2016-03-21 16:03:56 MDT
Some background for the dwrest/gunicorn/nginx outage from earlier.

Due to large datawarp (DW) allocation (~25TB) failing to teardown correctly the dwrest daemons were logging copious amounts of messages about trying to complete the teardown.  We currently have 12 nodes running dwrest/gunicorn/nginx (soon to be reduced) which all right to the Cray persistent-shared /var disk partition.

The shared-var partition ran out of space at ~3:40am, causing all dwrest/gunicorn daemons to terminate. The nginx daemons were still present on the nodes. dwstat functionality stopped (no daemons to respond), and I expect that Slurm's ability to get DW status also stopped in this time frame.

After cleaning up the huge dwrest.log files (~5.5GB each) the dwrest/gunicorn services were restarted along with nginx (to be safe).

At this point dwstat functionality was restored.