|
Description
Paul Edmon
2020-05-01 18:33:35 MDT
Finally got a sdiag to return. Here is what it looks like:
[root@holy-slurm02 ~]# sdiag
*******************************************************
sdiag output at Fri May 01 21:00:08 2020 (1588381208)
Data since Fri May 01 20:56:54 2020 (1588381014)
*******************************************************
Server thread count: 256
Agent queue size: 0
Agent count: 0
DBD Agent queue size: 4
Jobs submitted: 0
Jobs started: 0
Jobs completed: 46
Jobs canceled: 0
Jobs failed: 0
Job states ts: Fri May 01 20:59:50 2020 (1588381190)
Jobs pending: 41684
Jobs running: 3038
Main schedule statistics (microseconds):
Last cycle: 30326
Max cycle: 42371
Total cycles: 5
Mean cycle: 32196
Mean depth cycle: 0
Cycles per minute: 1
Last queue length: 0
Backfilling stats
Total backfilled jobs (since last slurm start): 0
Total backfilled jobs (since last stats cycle start): 0
Total backfilled heterogeneous job components: 0
Total cycles: 0
Last cycle when: Wed Dec 31 19:00:00 1969 (0)
Last cycle: 0
Max cycle: 0
Last depth cycle: 0
Last depth cycle (try sched): 0
Last queue length: 0
Latency for 1000 calls to gettimeofday(): 28 microseconds
Remote Procedure Call statistics by message type
MESSAGE_NODE_REGISTRATION_STATUS ( 1002) count:2119 ave_time:1403473 total_time:2973960568
REQUEST_COMPLETE_BATCH_SCRIPT ( 5018) count:333 ave_time:65274240 total_time:21736322001
REQUEST_STEP_COMPLETE ( 5016) count:294 ave_time:15850743 total_time:4660118512
MESSAGE_EPILOG_COMPLETE ( 6012) count:237 ave_time:6835661 total_time:1620051867
REQUEST_PARTITION_INFO ( 2009) count:154 ave_time:12923 total_time:1990195
REQUEST_JOB_INFO_SINGLE ( 2021) count:151 ave_time:3479016 total_time:525331456
REQUEST_NODE_INFO ( 2007) count:145 ave_time:118578 total_time:17193917
REQUEST_FED_INFO ( 2049) count:118 ave_time:180 total_time:21257
REQUEST_JOB_INFO ( 2003) count:104 ave_time:38759825 total_time:4031021849
REQUEST_SUBMIT_BATCH_JOB ( 4003) count:12 ave_time:49075701 total_time:588908420
REQUEST_JOB_USER_INFO ( 2039) count:5 ave_time:13642526 total_time:68212631
REQUEST_SHARE_INFO ( 2022) count:4 ave_time:40979 total_time:163918
REQUEST_RESOURCE_ALLOCATION ( 4001) count:4 ave_time:20502351 total_time:82009404
REQUEST_UPDATE_JOB ( 3001) count:4 ave_time:4157839 total_time:16631356
REQUEST_STATS_INFO ( 2035) count:3 ave_time:3824 total_time:11473
REQUEST_COMPLETE_JOB_ALLOCATION ( 5017) count:2 ave_time:12488591 total_time:24977183
REQUEST_CANCEL_JOB_STEP ( 5005) count:1 ave_time:79234 total_time:79234
Remote Procedure Call statistics by user
root ( 0) count:3283 ave_time:10681802 total_time:35068357100
sgprasath ( 21136) count:53 ave_time:652570 total_time:34586236
dingma ( 40306) count:47 ave_time:367572 total_time:17275926
kholton ( 20682) count:22 ave_time:15233 total_time:335131
simonx ( 12821) count:20 ave_time:865928 total_time:17318569
mengzhang ( 12484) count:20 ave_time:847308 total_time:16946176
acoran ( 22029) count:19 ave_time:910703 total_time:17303371
gtorri ( 57089) count:19 ave_time:929695 total_time:17664212
ballen ( 5899723) count:15 ave_time:2306049 total_time:34590743
jingnandu ( 20922) count:11 ave_time:1541796 total_time:16959758
mpolansky ( 13716) count:10 ave_time:4227 total_time:42276
u_219006_g_65249( 6000076) count:10 ave_time:1712869 total_time:17128692
jseeley ( 14063) count:10 ave_time:3362528 total_time:33625287
bostdiek ( 20817) count:10 ave_time:3388593 total_time:33885933
jaehwan ( 555445) count:9 ave_time:1866485 total_time:16798366
nthor ( 20593) count:9 ave_time:1913447 total_time:17221023
sychen ( 13448) count:9 ave_time:1879294 total_time:16913648
ncolombi ( 21080) count:9 ave_time:3924203 total_time:35317829
agomezlievano ( 12518) count:8 ave_time:30274 total_time:242198
jyang12 ( 12394) count:8 ave_time:2121857 total_time:16974858
zmorris ( 558739) count:8 ave_time:2107202 total_time:16857619
fseixas ( 14101) count:7 ave_time:2421317 total_time:16949222
atrevino ( 10871) count:7 ave_time:21673 total_time:151717
ajcukierman ( 12585) count:7 ave_time:2433861 total_time:17037030
ctedijanto ( 21988) count:7 ave_time:2931 total_time:20517
chsu ( 20796) count:7 ave_time:2412941 total_time:16890587
rmattheis ( 13299) count:6 ave_time:2889130 total_time:17334785
lsepulvedaduran ( 10976) count:6 ave_time:6096 total_time:36578
dkakkar ( 20839) count:5 ave_time:19894847 total_time:99474237
tfranzmeyer ( 20760) count:4 ave_time:56098417 total_time:224393671
pacargile ( 556471) count:4 ave_time:4157839 total_time:16631356
daldarondo ( 13304) count:3 ave_time:26914508 total_time:80743525
lshen ( 58176) count:3 ave_time:38485790 total_time:115457370
pryke ( 40838) count:3 ave_time:1947 total_time:5843
qxz125 ( 556345) count:2 ave_time:49872903 total_time:99745806
atsang ( 14422) count:2 ave_time:42334 total_time:84668
bmatejek ( 5900266) count:2 ave_time:37281327 total_time:74562655
cclapp ( 57329) count:2 ave_time:19433481 total_time:38866963
fpinter ( 21371) count:1 ave_time:25116489 total_time:25116489
ruzhang ( 558295) count:1 ave_time:16089529 total_time:16089529
e88bu2009 ( 21880) count:1 ave_time:24971749 total_time:24971749
jang ( 12650) count:1 ave_time:16095993 total_time:16095993
Pending RPC statistics
No pending RPCs
Created attachment 14083 [details]
Current slurm.conf
Please increase your logging level: > scontrol setdebug debug3 > scontrol setdebugflags +agent Please upload new logs after 15 min and revert: > scontrol setdebug info > scontrol setdebugflags -agent Thanks. I've managed to stabilize and right now I'm testing out individual partitions and jobs to see what is causing the storm. I will enable the debug when I have a moment. My suspicion is that a user (or more likely users) submitted thousands of jobs to multiple partitions that all Failed immediately causing the completion storm. Is there a way for slurm to automatically put users in a penalty box if they have job failures? (i.e. all jobs in the pending queue are held until the admin releases?) -Paul Edmon- On 5/1/2020 9:23 PM, bugs@schedmd.com wrote: > Nate Rini <mailto:nate@schedmd.com> changed bug 8978 > <https://bugs.schedmd.com/show_bug.cgi?id=8978> > What Removed Added > CC nate@schedmd.com > > *Comment # 3 <https://bugs.schedmd.com/show_bug.cgi?id=8978#c3> on bug > 8978 <https://bugs.schedmd.com/show_bug.cgi?id=8978> from Nate Rini > <mailto:nate@schedmd.com> * > Please increase your logging level: > > scontrol setdebug debug3 > scontrol setdebugflags +agent > > Please upload new logs after 15 min and revert: > > scontrol setdebug info > scontrol setdebugflags -agent > ------------------------------------------------------------------------ > You are receiving this mail because: > > * You reported the bug. > (In reply to Paul Edmon from comment #4) > My suspicion is that a user (or more likely users) submitted thousands > of jobs to multiple partitions that all Failed immediately causing the > completion storm. Is there a way for slurm to automatically put users > in a penalty box if they have job failures? (i.e. all jobs in the > pending queue are held until the admin releases?) The job submit throttle plugin is the closest to this request I'm aware of: https://github.com/SchedMD/slurm/blob/master/src/plugins/job_submit/throttle/job_submit_throttle.c Yeah, this feature would be when the job ran not on submission time. It would be handy as it would prevent completion storms like this as if you have a user who submits thousands of jobs or a job array of a thousand jobs and the first fails, odds are all or most of them will. So it would be great to be able to put a hold on the rest of that users jobs. I've put the additional debug flagging on and I will let it run for another 15 min before turning it off. I will then send you the log. I only have a few partitions running out of the 140 that we have configured as I'm trying to see if there are any problematic jobs. Already found one user. -Paul Edmon- On 5/1/2020 9:30 PM, bugs@schedmd.com wrote: > > *Comment # 5 <https://bugs.schedmd.com/show_bug.cgi?id=8978#c5> on bug > 8978 <https://bugs.schedmd.com/show_bug.cgi?id=8978> from Nate Rini > <mailto:nate@schedmd.com> * > (In reply to Paul Edmon fromcomment #4 <show_bug.cgi?id=8978#c4>) > > My suspicion is that a user (or more likely users) submitted thousands > of jobs to multiple partitions that all Failed immediately causing > the > completion storm. Is there a way for slurm to automatically put > users > in a penalty box if they have job failures? (i.e. all jobs in > the > pending queue are held until the admin releases?) > > The job submit throttle plugin is the closest to this request I'm aware of: > > https://github.com/SchedMD/slurm/blob/master/src/plugins/job_submit/throttle/job_submit_throttle.c > ------------------------------------------------------------------------ > You are receiving this mail because: > > * You reported the bug. > Created attachment 14084 [details]
slurmctld log
The period with higher debugging is from 21:30-21:45
Some additional forensic work. I've tracked down 4 users thus far whose jobs exited basically immediately either due to Failure, OOM, or their job just ran instantly and completed. So I think what we ended up with was a perfect completion storm of a ton of jobs just continually thrashing the scheduler as the scheduler would throw them in and then they would exit immediately, creating room for the scheduler to throw more in and then exit. On 5/1/2020 9:30 PM, bugs@schedmd.com wrote: > > *Comment # 5 <https://bugs.schedmd.com/show_bug.cgi?id=8978#c5> on bug > 8978 <https://bugs.schedmd.com/show_bug.cgi?id=8978> from Nate Rini > <mailto:nate@schedmd.com> * > (In reply to Paul Edmon fromcomment #4 <show_bug.cgi?id=8978#c4>) > > My suspicion is that a user (or more likely users) submitted thousands > of jobs to multiple partitions that all Failed immediately causing > the > completion storm. Is there a way for slurm to automatically put > users > in a penalty box if they have job failures? (i.e. all jobs in > the > pending queue are held until the admin releases?) > > The job submit throttle plugin is the closest to this request I'm aware of: > > https://github.com/SchedMD/slurm/blob/master/src/plugins/job_submit/throttle/job_submit_throttle.c > ------------------------------------------------------------------------ > You are receiving this mail because: > > * You reported the bug. > (In reply to Paul Edmon from comment #8) > So I think what we ended up with > was a perfect completion storm of a ton of jobs just continually > thrashing the scheduler as the scheduler would throw them in and then > they would exit immediately, creating room for the scheduler to throw > more in and then exit. Is the cluster current working? In fits and starts. I'm canceling the culprits jobs which is wedging the scheduler due to the completion storm of the cancelation... I should I have used a bash loop instead of using a bulk scancel. If I can clear the stack up of RPC's things seem to work as expected. Its just a matter of getting off the mountain. That said I don't have it working yet at full capacity where we have everything back to where it was prior to the RPC storm. -Paul Edmon- On 5/1/2020 10:11 PM, bugs@schedmd.com wrote: > > *Comment # 9 <https://bugs.schedmd.com/show_bug.cgi?id=8978#c9> on bug > 8978 <https://bugs.schedmd.com/show_bug.cgi?id=8978> from Nate Rini > <mailto:nate@schedmd.com> * > (In reply to Paul Edmon fromcomment #8 <show_bug.cgi?id=8978#c8>) > > So I think what we ended up with > was a perfect completion storm of a ton of jobs just continually > > thrashing the scheduler as the scheduler would throw them in and then > > they would exit immediately, creating room for the scheduler to > throw > more in and then exit. > > Is the cluster current working? > ------------------------------------------------------------------------ > You are receiving this mail because: > > * You reported the bug. > Please also take a core of slurmctld: > pgrep slurmctld| xargs -i gcore {} I assume there is only 1 slurmctld running but this will grab any. Please then provide the following from gdb loaded with the core: > set pagination off > set print pretty on > t a a bt full Please save the core for later, we may ask for more info from gdb. Please also attach the whole slurmctld log during the incident. (In reply to Paul Edmon from comment #10) > If I can clear > the stack up of RPC's things seem to work as expected. Its just a > matter of getting off the mountain. That said I don't have it working > yet at full capacity where we have everything back to where it was prior > to the RPC storm. Please add these to SchedulerParameters in slurm.conf: > bf_max_job_start=100 #(update instead of add) > bf_max_job_user=50 > bf_max_job_user_part=10 > sched_max_job_start=100 This should stop the scheduler from launching so many jobs to give it a chance to recover from the rush of job completes. Please upload new slurmctld logs and sdiag after the change. Created attachment 14085 [details]
backtrace for core dump of slurmctld
Added those updates. I will get you the sdiag once things settle down. -paul Edmon- On 5/1/2020 10:24 PM, bugs@schedmd.com wrote: > > *Comment # 12 <https://bugs.schedmd.com/show_bug.cgi?id=8978#c12> on > bug 8978 <https://bugs.schedmd.com/show_bug.cgi?id=8978> from Nate > Rini <mailto:nate@schedmd.com> * > (In reply to Paul Edmon fromcomment #10 <show_bug.cgi?id=8978#c10>) > > If I can clear > the stack up of RPC's things seem to work as expected. Its just a > > matter of getting off the mountain. That said I don't have it working > > yet at full capacity where we have everything back to where it was > prior > to the RPC storm. > > Please add these to SchedulerParameters in slurm.conf: > > bf_max_job_start=100 #(update instead of add) > bf_max_job_user=50 > bf_max_job_user_part=10 > sched_max_job_start=100 > > This should stop the scheduler from launching so many jobs to give it a chance > to recover from the rush of job completes. > > Please upload new slurmctld logs and sdiag after the change. > ------------------------------------------------------------------------ > You are receiving this mail because: > > * You reported the bug. > Created attachment 14086 [details]
full day slurmctld log
Incident starts around 12:30
Here is the current sdiag, mind you I'm in a RPC storm right now due to me bulk canceling a users jobs: [root@holy-slurm02 ~]# sdiag ******************************************************* sdiag output at Fri May 01 22:40:24 2020 (1588387224) Data since Fri May 01 22:35:11 2020 (1588386911) ******************************************************* Server thread count: 256 Agent queue size: 14 Agent count: 0 DBD Agent queue size: 2 Jobs submitted: 0 Jobs started: 0 Jobs completed: 54 Jobs canceled: 0 Jobs failed: 0 Job states ts: Fri May 01 22:39:55 2020 (1588387195) Jobs pending: 28879 Jobs running: 3779 Main schedule statistics (microseconds): Last cycle: 19297 Max cycle: 32428 Total cycles: 4 Mean cycle: 25048 Mean depth cycle: 0 Cycles per minute: 0 Last queue length: 0 Backfilling stats (WARNING: data obtained in the middle of backfilling execution.) Total backfilled jobs (since last slurm start): 0 Total backfilled jobs (since last stats cycle start): 0 Total backfilled heterogeneous job components: 0 Total cycles: 0 Last cycle when: Wed Dec 31 19:00:00 1969 (0) Last cycle: 0 Max cycle: 0 Last depth cycle: 0 Last depth cycle (try sched): 0 Last queue length: 0 Latency for 1000 calls to gettimeofday(): 29 microseconds Remote Procedure Call statistics by message type MESSAGE_NODE_REGISTRATION_STATUS ( 1002) count:3052 ave_time:1136570 total_time:3468814243 REQUEST_STEP_COMPLETE ( 5016) count:586 ave_time:110719367 total_time:64881549280 REQUEST_JOB_PACK_ALLOC_INFO ( 4027) count:273 ave_time:1181923 total_time:322665212 REQUEST_COMPLETE_BATCH_SCRIPT ( 5018) count:174 ave_time:15460826 total_time:2690183831 REQUEST_NODE_INFO ( 2007) count:119 ave_time:130587 total_time:15539894 REQUEST_JOB_STEP_CREATE ( 5001) count:116 ave_time:7865802 total_time:912433049 REQUEST_FED_INFO ( 2049) count:111 ave_time:494 total_time:54878 REQUEST_JOB_INFO_SINGLE ( 2021) count:99 ave_time:1159243 total_time:114765057 MESSAGE_EPILOG_COMPLETE ( 6012) count:76 ave_time:2672367 total_time:203099956 REQUEST_PARTITION_INFO ( 2009) count:47 ave_time:7680 total_time:360978 REQUEST_JOB_INFO ( 2003) count:31 ave_time:27205848 total_time:843381313 REQUEST_SUBMIT_BATCH_JOB ( 4003) count:10 ave_time:15663028 total_time:156630284 REQUEST_SHARE_INFO ( 2022) count:3 ave_time:38724 total_time:116174 REQUEST_STATS_INFO ( 2035) count:3 ave_time:253 total_time:761 REQUEST_JOB_USER_INFO ( 2039) count:2 ave_time:1895804 total_time:3791609 REQUEST_RESOURCE_ALLOCATION ( 4001) count:1 ave_time:296898 total_time:296898 REQUEST_KILL_JOB ( 5032) count:1 ave_time:27150565 total_time:27150565 Remote Procedure Call statistics by user root ( 0) count:4056 ave_time:17769687 total_time:72073854002 schlafly ( 35075) count:389 ave_time:3175059 total_time:1235098261 kholton ( 20682) count:19 ave_time:392498 total_time:7457479 acoran ( 22029) count:17 ave_time:6340 total_time:107794 sgprasath ( 21136) count:17 ave_time:1649 total_time:28033 dingma ( 40306) count:16 ave_time:371161 total_time:5938588 dkakkar ( 20839) count:15 ave_time:606991 total_time:9104868 simonx ( 12821) count:15 ave_time:513206 total_time:7698100 gtorri ( 57089) count:14 ave_time:266562 total_time:3731880 ballen ( 5899723) count:13 ave_time:525850 total_time:6836052 mengzhang ( 12484) count:13 ave_time:40928 total_time:532074 jingnandu ( 20922) count:11 ave_time:633657 total_time:6970235 nthor ( 20593) count:10 ave_time:120426 total_time:1204269 ajcukierman ( 12585) count:9 ave_time:495968 total_time:4463715 jfroberg ( 13751) count:9 ave_time:649581 total_time:5846232 jaehwan ( 555445) count:9 ave_time:152185 total_time:1369671 jyang12 ( 12394) count:9 ave_time:725685 total_time:6531172 mpolansky ( 13716) count:8 ave_time:556519 total_time:4452155 chsu ( 20796) count:7 ave_time:797738 total_time:5584171 sychen ( 13448) count:7 ave_time:649582 total_time:4547080 bostdiek ( 20817) count:7 ave_time:473564 total_time:3314949 atrevino ( 10871) count:6 ave_time:198376 total_time:1190259 rmattheis ( 13299) count:6 ave_time:506129 total_time:3036775 jseeley ( 14063) count:6 ave_time:19659 total_time:117956 tfranzmeyer ( 20760) count:3 ave_time:17658674 total_time:52976022 bmatejek ( 5900266) count:2 ave_time:24987538 total_time:49975076 jeastman ( 5899955) count:2 ave_time:28925574 total_time:57851148 daldarondo ( 13304) count:2 ave_time:12527300 total_time:25054600 jtaylor ( 558629) count:1 ave_time:331058 total_time:331058 benmergui ( 99690) count:1 ave_time:296898 total_time:296898 jianxiangqiu ( 22008) count:1 ave_time:27123857 total_time:27123857 e88bu2046 ( 21907) count:1 ave_time:353497 total_time:353497 jpegues ( 21791) count:1 ave_time:27150565 total_time:27150565 ehand ( 14058) count:1 ave_time:670431 total_time:670431 pkrastev ( 56139) count:1 ave_time:35060 total_time:35060 xlu123 ( 10076) count:0 ave_time:0 total_time:0 ghzhang17 ( 13095) count:0 ave_time:0 total_time:0 rweinberger ( 13263) count:0 ave_time:0 total_time:0 Pending RPC statistics REQUEST_TERMINATE_JOB ( 6011) count:14 Pending RPCs 1: REQUEST_TERMINATE_JOB holy7c08102 2: REQUEST_TERMINATE_JOB holy7c06405 3: REQUEST_TERMINATE_JOB holy7c06405 4: REQUEST_TERMINATE_JOB holy7c06405 5: REQUEST_TERMINATE_JOB shakgpu41 6: REQUEST_TERMINATE_JOB shakgpu41 7: REQUEST_TERMINATE_JOB seasmicro23 8: REQUEST_TERMINATE_JOB seasmicro23 9: REQUEST_TERMINATE_JOB holy2c01316 10: REQUEST_TERMINATE_JOB holy2c01316 11: REQUEST_TERMINATE_JOB holy2c01316 12: REQUEST_TERMINATE_JOB holy2c01316 13: REQUEST_TERMINATE_JOB holy2c01316 14: REQUEST_TERMINATE_JOB holy2c01316 -Paul Edmon- On 5/1/2020 10:24 PM, bugs@schedmd.com wrote: > > *Comment # 12 <https://bugs.schedmd.com/show_bug.cgi?id=8978#c12> on > bug 8978 <https://bugs.schedmd.com/show_bug.cgi?id=8978> from Nate > Rini <mailto:nate@schedmd.com> * > (In reply to Paul Edmon fromcomment #10 <show_bug.cgi?id=8978#c10>) > > If I can clear > the stack up of RPC's things seem to work as expected. Its just a > > matter of getting off the mountain. That said I don't have it working > > yet at full capacity where we have everything back to where it was > prior > to the RPC storm. > > Please add these to SchedulerParameters in slurm.conf: > > bf_max_job_start=100 #(update instead of add) > bf_max_job_user=50 > bf_max_job_user_part=10 > sched_max_job_start=100 > > This should stop the scheduler from launching so many jobs to give it a chance > to recover from the rush of job completes. > > Please upload new slurmctld logs and sdiag after the change. > ------------------------------------------------------------------------ > You are receiving this mail because: > > * You reported the bug. > When I restart it looks like this: May 1 22:43:00 holy-slurm02 slurmctld[44855]: _sync_nodes_to_comp_job: JobId=54737409_14(54737434) in completing state May 1 22:43:00 holy-slurm02 slurmctld[44855]: select/cons_tres: rm_job_res: plugin still initializing May 1 22:43:00 holy-slurm02 slurmctld[44855]: _sync_nodes_to_comp_job: JobId=54737409_15(54737435) in completing state May 1 22:43:00 holy-slurm02 slurmctld[44855]: select/cons_tres: rm_job_res: plugin still initializing May 1 22:43:00 holy-slurm02 slurmctld[44855]: _sync_nodes_to_comp_job: JobId=54737409_16(54737436) in completing state May 1 22:43:00 holy-slurm02 slurmctld[44855]: select/cons_tres: rm_job_res: plugin still initializing May 1 22:43:00 holy-slurm02 slurmctld[44855]: _sync_nodes_to_comp_job: JobId=54737409_17(54737437) in completing state May 1 22:43:00 holy-slurm02 slurmctld[44855]: select/cons_tres: rm_job_res: plugin still initializing May 1 22:43:00 holy-slurm02 slurmctld[44855]: _sync_nodes_to_comp_job: JobId=54737409_18(54737438) in completing state May 1 22:43:00 holy-slurm02 slurmctld[44855]: select/cons_tres: rm_job_res: plugin still initializing May 1 22:43:00 holy-slurm02 slurmctld[44855]: _sync_nodes_to_comp_job: completing 321 jobs May 1 22:43:13 holy-slurm02 slurmctld[44855]: Warning: Note very large processing time from load_part_uid_allow_list: usec=13362758 began=22:43:00.167 May 1 22:43:13 holy-slurm02 slurmctld[44855]: Recovered state of reservation doshi-velez May 1 22:43:13 holy-slurm02 slurmctld[44855]: Recovered state of reservation doshi-velez-gpu May 1 22:43:13 holy-slurm02 slurmctld[44855]: Recovered state of 2 reservations May 1 22:43:13 holy-slurm02 slurmctld[44855]: State of 0 triggers recovered May 1 22:43:13 holy-slurm02 slurmctld[44855]: _preserve_plugins: backup_controller not specified May 1 22:43:13 holy-slurm02 slurmctld[44855]: select/cons_tres: select_p_reconfigure May 1 22:43:13 holy-slurm02 slurmctld[44855]: select/cons_tres: select_p_node_init May 1 22:43:13 holy-slurm02 slurmctld[44855]: select/cons_tres: preparing for 145 partitions May 1 22:43:13 holy-slurm02 slurmctld[44855]: Warning: Note very large processing time from read_slurm_conf: usec=36807998 began=22:42:36.828 May 1 22:43:13 holy-slurm02 slurmctld[44855]: Running as primary controller May 1 22:43:13 holy-slurm02 slurmctld[44855]: Registering slurmctld at port 6820 with slurmdbd. May 1 22:43:13 holy-slurm02 slurmctld[44855]: retry_list retry_list_size:321 msg_type=REQUEST_TERMINATE_JOB,REQUEST_TERMINATE_JOB,REQUEST_TERMINATE_JOB,REQUEST_TERMINATE_JOB,REQUEST_TERMINATE_JOB May 1 22:43:13 holy-slurm02 slurmctld[44855]: mcs: MCSParameters = ondemand,ondemandselect,privatedata. ondemand set. May 1 22:43:13 holy-slurm02 slurmctld[44855]: step_partial_comp: JobId=54480668 StepID=110 invalid May 1 22:43:13 holy-slurm02 slurmctld[44855]: _job_complete: JobId=54733400 WEXITSTATUS 0 May 1 22:43:13 holy-slurm02 slurmctld[44855]: _job_complete: JobId=54733400 done May 1 22:43:13 holy-slurm02 slurmctld[44855]: step_partial_comp: JobId=54708002_348(54736510) StepID=4294967295 invalid May 1 22:43:13 holy-slurm02 slurmctld[44855]: _job_complete: JobId=54586861 WEXITSTATUS 0 May 1 22:43:13 holy-slurm02 slurmctld[44855]: _job_complete: JobId=54586861 done May 1 22:43:13 holy-slurm02 slurmctld[44855]: cleanup_completing: JobId=54586418 completion process took 1757 seconds May 1 22:43:13 holy-slurm02 slurmctld[44855]: step_partial_comp: JobId=54733398 StepID=4294967295 invalid May 1 22:43:13 holy-slurm02 slurmctld[44855]: _job_complete: JobId=54586592 WEXITSTATUS 0 May 1 22:43:13 holy-slurm02 slurmctld[44855]: _job_complete: JobId=54586592 done May 1 22:43:14 holy-slurm02 slurmctld[44855]: server_thread_count over limit (256), waiting May 1 22:43:14 holy-slurm02 slurmctld[44855]: step_partial_comp: JobId=54480668 StepID=326 invalid May 1 22:43:14 holy-slurm02 slurmctld[44855]: error: _slurm_rpc_complete_batch_script: Could not find batch step for JobId=54552845, this should never happen May 1 22:43:14 holy-slurm02 slurmctld[44855]: step_partial_comp: JobId=54648048_25(54729445) StepID=4294967295 invalid -Paul Edmon- On 5/1/2020 10:24 PM, bugs@schedmd.com wrote: > > *Comment # 12 <https://bugs.schedmd.com/show_bug.cgi?id=8978#c12> on > bug 8978 <https://bugs.schedmd.com/show_bug.cgi?id=8978> from Nate > Rini <mailto:nate@schedmd.com> * > (In reply to Paul Edmon fromcomment #10 <show_bug.cgi?id=8978#c10>) > > If I can clear > the stack up of RPC's things seem to work as expected. Its just a > > matter of getting off the mountain. That said I don't have it working > > yet at full capacity where we have everything back to where it was > prior > to the RPC storm. > > Please add these to SchedulerParameters in slurm.conf: > > bf_max_job_start=100 #(update instead of add) > bf_max_job_user=50 > bf_max_job_user_part=10 > sched_max_job_start=100 > > This should stop the scheduler from launching so many jobs to give it a chance > to recover from the rush of job completes. > > Please upload new slurmctld logs and sdiag after the change. > ------------------------------------------------------------------------ > You are receiving this mail because: > > * You reported the bug. > Please upload a new log in 10 minutes. I want to see if the agent queue will clear up. Please also upload slurmd log from holy7c08102. Created attachment 14087 [details]
Log for slurmd on holy7c08102
Created attachment 14088 [details]
slurmctld log for 2200
Looks like things have settled down. I'm going to try reopening single partitions now to see if I see any other problematic jobs: [root@holy-slurm02 ~]# sdiag ******************************************************* sdiag output at Fri May 01 23:07:35 2020 (1588388855) Data since Fri May 01 23:05:34 2020 (1588388734) ******************************************************* Server thread count: 16 Agent queue size: 0 Agent count: 1 DBD Agent queue size: 0 Jobs submitted: 0 Jobs started: 0 Jobs completed: 44 Jobs canceled: 91 Jobs failed: 0 Job states ts: Fri May 01 23:07:22 2020 (1588388842) Jobs pending: 28879 Jobs running: 3416 Main schedule statistics (microseconds): Last cycle: 24080 Max cycle: 29775 Total cycles: 5 Mean cycle: 24481 Mean depth cycle: 0 Cycles per minute: 2 Last queue length: 0 Backfilling stats Total backfilled jobs (since last slurm start): 0 Total backfilled jobs (since last stats cycle start): 0 Total backfilled heterogeneous job components: 0 Total cycles: 0 Last cycle when: Wed Dec 31 19:00:00 1969 (0) Last cycle: 0 Max cycle: 0 Last depth cycle: 0 Last depth cycle (try sched): 0 Last queue length: 0 Latency for 1000 calls to gettimeofday(): 28 microseconds Remote Procedure Call statistics by message type MESSAGE_NODE_REGISTRATION_STATUS ( 1002) count:2151 ave_time:452820 total_time:974017819 REQUEST_PARTITION_INFO ( 2009) count:480 ave_time:13369 total_time:6417400 REQUEST_JOB_INFO_SINGLE ( 2021) count:474 ave_time:213231 total_time:101071951 REQUEST_JOB_STEP_CREATE ( 5001) count:471 ave_time:7833530 total_time:3689592669 REQUEST_STEP_COMPLETE ( 5016) count:465 ave_time:1292113 total_time:600832713 REQUEST_FED_INFO ( 2049) count:444 ave_time:259 total_time:115060 MESSAGE_EPILOG_COMPLETE ( 6012) count:392 ave_time:7299999 total_time:2861599919 REQUEST_JOB_PACK_ALLOC_INFO ( 4027) count:247 ave_time:74224 total_time:18333474 REQUEST_COMPLETE_BATCH_SCRIPT ( 5018) count:140 ave_time:16196175 total_time:2267464639 REQUEST_KILL_JOB ( 5032) count:91 ave_time:183127 total_time:16664643 REQUEST_SUBMIT_BATCH_JOB ( 4003) count:47 ave_time:5436944 total_time:255536411 REQUEST_STATS_INFO ( 2035) count:17 ave_time:346 total_time:5890 REQUEST_JOB_USER_INFO ( 2039) count:9 ave_time:141924 total_time:1277320 REQUEST_NODE_INFO ( 2007) count:7 ave_time:157950 total_time:1105653 REQUEST_JOB_INFO ( 2003) count:4 ave_time:18466464 total_time:73865858 REQUEST_SHARE_INFO ( 2022) count:3 ave_time:41067 total_time:123203 REQUEST_RESOURCE_ALLOCATION ( 4001) count:1 ave_time:20086781 total_time:20086781 Remote Procedure Call statistics by user root ( 0) count:3293 ave_time:2063940 total_time:6796556091 schlafly ( 35075) count:718 ave_time:5164242 total_time:3707926143 dingma ( 40306) count:208 ave_time:7542 total_time:1568761 sgprasath ( 21136) count:195 ave_time:4926 total_time:960580 kholton ( 20682) count:89 ave_time:9910 total_time:881997 mengzhang ( 12484) count:88 ave_time:18747 total_time:1649758 ballen ( 5899723) count:87 ave_time:5180 total_time:450718 gtorri ( 57089) count:87 ave_time:9662 total_time:840594 simonx ( 12821) count:84 ave_time:10278 total_time:863372 acoran ( 22029) count:74 ave_time:18433 total_time:1364108 jingnandu ( 20922) count:46 ave_time:11661 total_time:536433 nthor ( 20593) count:41 ave_time:35060 total_time:1437465 jaehwan ( 555445) count:41 ave_time:15745 total_time:645555 sychen ( 13448) count:41 ave_time:5565 total_time:228190 rmattheis ( 13299) count:40 ave_time:34110 total_time:1364406 bostdiek ( 20817) count:40 ave_time:5441 total_time:217679 jyang12 ( 12394) count:40 ave_time:5885 total_time:235401 chsu ( 20796) count:40 ave_time:3056 total_time:122276 atrevino ( 10871) count:38 ave_time:19668 total_time:747412 ajcukierman ( 12585) count:38 ave_time:5325 total_time:202356 mpolansky ( 13716) count:35 ave_time:5377 total_time:188205 tfranzmeyer ( 20760) count:29 ave_time:5591371 total_time:162149786 bmatejek ( 5900266) count:16 ave_time:4812152 total_time:76994438 jseeley ( 14063) count:14 ave_time:288302 total_time:4036233 daldarondo ( 13304) count:10 ave_time:8897164 total_time:88971640 dchan ( 558943) count:6 ave_time:79832 total_time:478992 jfroberg ( 13751) count:2 ave_time:6923 total_time:13846 ruzhang ( 558295) count:1 ave_time:20086781 total_time:20086781 yaowang ( 11810) count:1 ave_time:16115066 total_time:16115066 froy ( 12706) count:1 ave_time:277121 total_time:277121 jpegues ( 21791) count:0 ave_time:0 total_time:0 Pending RPC statistics No pending RPCs -Paul Edmon- On 5/1/2020 10:57 PM, bugs@schedmd.com wrote: > > *Comment # 19 <https://bugs.schedmd.com/show_bug.cgi?id=8978#c19> on > bug 8978 <https://bugs.schedmd.com/show_bug.cgi?id=8978> from Nate > Rini <mailto:nate@schedmd.com> * > Please also upload slurmd log from holy7c08102. > ------------------------------------------------------------------------ > You are receiving this mail because: > > * You reported the bug. > Here is with 3 partitions open: [root@holy-slurm02 ~]# sdiag ******************************************************* sdiag output at Fri May 01 23:18:31 2020 (1588389511) Data since Fri May 01 23:05:34 2020 (1588388734) ******************************************************* Server thread count: 248 Agent queue size: 0 Agent count: 8 DBD Agent queue size: 4 Jobs submitted: 108 Jobs started: 418 Jobs completed: 155 Jobs canceled: 91 Jobs failed: 0 Job states ts: Fri May 01 23:18:06 2020 (1588389486) Jobs pending: 28578 Jobs running: 3709 Main schedule statistics (microseconds): Last cycle: 215507 Max cycle: 300186 Total cycles: 26 Mean cycle: 111743 Mean depth cycle: 12 Cycles per minute: 2 Last queue length: 3404 Backfilling stats (WARNING: data obtained in the middle of backfilling execution.) Total backfilled jobs (since last slurm start): 274 Total backfilled jobs (since last stats cycle start): 274 Total backfilled heterogeneous job components: 0 Total cycles: 7 Last cycle when: Fri May 01 23:18:04 2020 (1588389484) Last cycle: 27759586 Max cycle: 88216689 Mean cycle: 32782808 Last depth cycle: 2629 Last depth cycle (try sched): 52 Depth Mean: 2471 Depth Mean (try depth): 67 Last queue length: 3422 Queue length mean: 3437 Latency for 1000 calls to gettimeofday(): 28 microseconds Remote Procedure Call statistics by message type MESSAGE_NODE_REGISTRATION_STATUS ( 1002) count:2187 ave_time:453748 total_time:992347916 REQUEST_JOB_STEP_CREATE ( 5001) count:2147 ave_time:25123331 total_time:53939793028 REQUEST_PARTITION_INFO ( 2009) count:1870 ave_time:100690 total_time:188290996 REQUEST_FED_INFO ( 2049) count:1712 ave_time:273 total_time:468316 REQUEST_JOB_INFO_SINGLE ( 2021) count:1707 ave_time:514885 total_time:878909349 REQUEST_STEP_COMPLETE ( 5016) count:597 ave_time:1992293 total_time:1189399345 MESSAGE_EPILOG_COMPLETE ( 6012) count:516 ave_time:7081250 total_time:3653925371 REQUEST_JOB_PACK_ALLOC_INFO ( 4027) count:471 ave_time:103458 total_time:48729023 REQUEST_COMPLETE_PROLOG ( 6018) count:462 ave_time:6962299 total_time:3216582553 REQUEST_COMPLETE_BATCH_SCRIPT ( 5018) count:253 ave_time:15511429 total_time:3924391777 REQUEST_SUBMIT_BATCH_JOB ( 4003) count:239 ave_time:9470308 total_time:2263403822 REQUEST_KILL_JOB ( 5032) count:91 ave_time:183127 total_time:16664643 REQUEST_NODE_INFO_SINGLE ( 2040) count:87 ave_time:120367 total_time:10471935 REQUEST_NODE_INFO ( 2007) count:48 ave_time:173608 total_time:8333203 REQUEST_STATS_INFO ( 2035) count:38 ave_time:362 total_time:13777 REQUEST_JOB_USER_INFO ( 2039) count:37 ave_time:528677 total_time:19561062 REQUEST_JOB_INFO ( 2003) count:20 ave_time:18827043 total_time:376540868 REQUEST_SHARE_INFO ( 2022) count:14 ave_time:45118 total_time:631662 REQUEST_UPDATE_NODE ( 3002) count:8 ave_time:2739537 total_time:21916302 REQUEST_RESOURCE_ALLOCATION ( 4001) count:4 ave_time:16192802 total_time:64771211 REQUEST_UPDATE_PARTITION ( 3005) count:3 ave_time:22761824 total_time:68285474 REQUEST_JOB_READY ( 4019) count:2 ave_time:269 total_time:539 REQUEST_COMPLETE_JOB_ALLOCATION ( 5017) count:1 ave_time:939 total_time:939 Remote Procedure Call statistics by user root ( 0) count:4538 ave_time:2999670 total_time:13612506105 schlafly ( 35075) count:2575 ave_time:20858703 total_time:53711161254 dingma ( 40306) count:577 ave_time:51809 total_time:29893957 sgprasath ( 21136) count:561 ave_time:6678 total_time:3746765 ballen ( 5899723) count:349 ave_time:163438 total_time:57039982 kholton ( 20682) count:346 ave_time:75262 total_time:26040928 mengzhang ( 12484) count:346 ave_time:77776 total_time:26910505 simonx ( 12821) count:336 ave_time:18391 total_time:6179416 acoran ( 22029) count:334 ave_time:84926 total_time:28365423 gtorri ( 57089) count:327 ave_time:72709 total_time:23775844 jingnandu ( 20922) count:183 ave_time:111679 total_time:20437382 jyang12 ( 12394) count:175 ave_time:131350 total_time:22986272 nthor ( 20593) count:174 ave_time:172635 total_time:30038528 bostdiek ( 20817) count:174 ave_time:130671 total_time:22736889 sychen ( 13448) count:172 ave_time:162721 total_time:27988049 chsu ( 20796) count:172 ave_time:141293 total_time:24302532 jaehwan ( 555445) count:170 ave_time:129712 total_time:22051120 atrevino ( 10871) count:166 ave_time:25389 total_time:4214732 ajcukierman ( 12585) count:160 ave_time:192365 total_time:30778496 mpolansky ( 13716) count:151 ave_time:145428 total_time:21959765 rmattheis ( 13299) count:106 ave_time:70802 total_time:7505094 tfranzmeyer ( 20760) count:95 ave_time:11302072 total_time:1073696895 lshen ( 58176) count:57 ave_time:1756328 total_time:100110742 kbystrom ( 20619) count:54 ave_time:622443 total_time:33611923 bmatejek ( 5900266) count:51 ave_time:7631683 total_time:389215875 daldarondo ( 13304) count:48 ave_time:9188387 total_time:441042588 stacchella ( 11809) count:24 ave_time:6230649 total_time:149535597 mlovell ( 5899810) count:16 ave_time:7964527 total_time:127432435 jseeley ( 14063) count:14 ave_time:288302 total_time:4036233 zeyuanhu ( 11515) count:11 ave_time:10381427 total_time:114195704 dchan ( 558943) count:6 ave_time:79832 total_time:478992 yaowang ( 11810) count:5 ave_time:14677543 total_time:73387715 rhong ( 21011) count:5 ave_time:18807801 total_time:94039006 namluu ( 20825) count:4 ave_time:31107978 total_time:124431912 bursley ( 59719) count:4 ave_time:10590207 total_time:42360829 dkakkar ( 20839) count:4 ave_time:3827 total_time:15311 gloewinger ( 13020) count:3 ave_time:549612 total_time:1648836 mprasad ( 20409) count:3 ave_time:26905909 total_time:80717729 jdu ( 13910) count:3 ave_time:24400632 total_time:73201898 jeastman ( 5899955) count:2 ave_time:20745138 total_time:41490276 jfroberg ( 13751) count:2 ave_time:6923 total_time:13846 jpegues ( 21791) count:2 ave_time:21390403 total_time:42780807 francesco ( 11040) count:2 ave_time:7817828 total_time:15635656 jang ( 12650) count:1 ave_time:278634 total_time:278634 froy ( 12706) count:1 ave_time:277121 total_time:277121 bwilder ( 21982) count:1 ave_time:7069726 total_time:7069726 ghzhang17 ( 13095) count:1 ave_time:28174604 total_time:28174604 ruzhang ( 558295) count:1 ave_time:20086781 total_time:20086781 e88bu2024 ( 21897) count:1 ave_time:24097034 total_time:24097034 rjsingh ( 12725) count:1 ave_time:19749368 total_time:19749368 Pending RPC statistics REQUEST_BATCH_JOB_LAUNCH ( 4005) count:7 REQUEST_LAUNCH_PROLOG ( 6017) count:6 REQUEST_TERMINATE_JOB ( 6011) count:1 Pending RPCs 1: REQUEST_BATCH_JOB_LAUNCH shakgpu05 2: REQUEST_LAUNCH_PROLOG shakgpu05 3: REQUEST_BATCH_JOB_LAUNCH shakgpu05 4: REQUEST_LAUNCH_PROLOG shakgpu04 5: REQUEST_BATCH_JOB_LAUNCH shakgpu04 6: REQUEST_LAUNCH_PROLOG shakgpu04 7: REQUEST_BATCH_JOB_LAUNCH shakgpu04 8: REQUEST_LAUNCH_PROLOG shakgpu02 9: REQUEST_BATCH_JOB_LAUNCH shakgpu02 10: REQUEST_LAUNCH_PROLOG shakgpu02 11: REQUEST_BATCH_JOB_LAUNCH shakgpu02 12: REQUEST_LAUNCH_PROLOG shakgpu01 13: REQUEST_BATCH_JOB_LAUNCH shakgpu01 14: REQUEST_TERMINATE_JOB holy7c26501 I will attach the log. -Paul Edmon- On 5/1/2020 10:57 PM, bugs@schedmd.com wrote: > > *Comment # 19 <https://bugs.schedmd.com/show_bug.cgi?id=8978#c19> on > bug 8978 <https://bugs.schedmd.com/show_bug.cgi?id=8978> from Nate > Rini <mailto:nate@schedmd.com> * > Please also upload slurmd log from holy7c08102. > ------------------------------------------------------------------------ > You are receiving this mail because: > > * You reported the bug. > Created attachment 14089 [details]
Slurmctld log for 2300-2320
Current state: [root@holy-slurm02 log]# sdiag ******************************************************* sdiag output at Fri May 01 23:37:21 2020 (1588390641) Data since Fri May 01 23:05:34 2020 (1588388734) ******************************************************* Server thread count: 248 Agent queue size: 0 Agent count: 2 DBD Agent queue size: 40 Jobs submitted: 210 Jobs started: 1220 Jobs completed: 716 Jobs canceled: 187 Jobs failed: 0 Job states ts: Fri May 01 23:35:57 2020 (1588390557) Jobs pending: 27869 Jobs running: 3871 Main schedule statistics (microseconds): Last cycle: 233916 Max cycle: 317117 Total cycles: 74 Mean cycle: 183082 Mean depth cycle: 19 Cycles per minute: 2 Last queue length: 3239 Backfilling stats Total backfilled jobs (since last slurm start): 511 Total backfilled jobs (since last stats cycle start): 511 Total backfilled heterogeneous job components: 0 Total cycles: 20 Last cycle when: Fri May 01 23:35:55 2020 (1588390555) Last cycle: 85258338 Max cycle: 88216689 Mean cycle: 42888560 Last depth cycle: 2459 Last depth cycle (try sched): 50 Depth Mean: 2443 Depth Mean (try depth): 63 Last queue length: 3250 Queue length mean: 3191 Latency for 1000 calls to gettimeofday(): 28 microseconds Remote Procedure Call statistics by message type REQUEST_JOB_STEP_CREATE ( 5001) count:5710 ave_time:36180328 total_time:206589677163 REQUEST_PARTITION_INFO ( 2009) count:4311 ave_time:121717 total_time:524722651 REQUEST_FED_INFO ( 2049) count:3196 ave_time:275 total_time:880948 REQUEST_JOB_INFO_SINGLE ( 2021) count:3128 ave_time:712903 total_time:2229963569 MESSAGE_NODE_REGISTRATION_STATUS ( 1002) count:2190 ave_time:453169 total_time:992441442 REQUEST_STEP_COMPLETE ( 5016) count:1300 ave_time:4698500 total_time:6108050692 REQUEST_COMPLETE_PROLOG ( 6018) count:1277 ave_time:6000919 total_time:7663174133 MESSAGE_EPILOG_COMPLETE ( 6012) count:1110 ave_time:9394388 total_time:10427770876 REQUEST_COMPLETE_BATCH_SCRIPT ( 5018) count:923 ave_time:12989108 total_time:11988947335 REQUEST_NODE_INFO_SINGLE ( 2040) count:910 ave_time:126811 total_time:115398566 REQUEST_JOB_PACK_ALLOC_INFO ( 4027) count:511 ave_time:204723 total_time:104613771 REQUEST_SUBMIT_BATCH_JOB ( 4003) count:378 ave_time:9990699 total_time:3776484365 REQUEST_NODE_INFO ( 2007) count:234 ave_time:539748 total_time:126301155 REQUEST_KILL_JOB ( 5032) count:214 ave_time:2356655 total_time:504324334 REQUEST_JOB_USER_INFO ( 2039) count:84 ave_time:842773 total_time:70792985 REQUEST_STATS_INFO ( 2035) count:58 ave_time:356 total_time:20691 REQUEST_CANCEL_JOB_STEP ( 5005) count:46 ave_time:35663429 total_time:1640517757 REQUEST_UPDATE_NODE ( 3002) count:36 ave_time:2802281 total_time:100882124 REQUEST_JOB_INFO ( 2003) count:36 ave_time:24105303 total_time:867790934 REQUEST_SHARE_INFO ( 2022) count:32 ave_time:48537 total_time:1553184 REQUEST_JOB_READY ( 4019) count:12 ave_time:211823 total_time:2541885 REQUEST_RESOURCE_ALLOCATION ( 4001) count:9 ave_time:16984702 total_time:152862321 REQUEST_UPDATE_PARTITION ( 3005) count:4 ave_time:17085514 total_time:68342058 REQUEST_COMPLETE_JOB_ALLOCATION ( 5017) count:3 ave_time:8397524 total_time:25192574 REQUEST_JOB_REQUEUE ( 5023) count:2 ave_time:361 total_time:723 ACCOUNTING_UPDATE_MSG (10001) count:1 ave_time:131584 total_time:131584 Remote Procedure Call statistics by user root ( 0) count:9692 ave_time:4066232 total_time:39409929887 schlafly ( 35075) count:6218 ave_time:33460498 total_time:208057381535 dingma ( 40306) count:982 ave_time:80883 total_time:79427183 sgprasath ( 21136) count:967 ave_time:35252 total_time:34089243 ballen ( 5899723) count:667 ave_time:289552 total_time:193131196 kholton ( 20682) count:631 ave_time:97914 total_time:61784082 mengzhang ( 12484) count:629 ave_time:101925 total_time:64111115 acoran ( 22029) count:622 ave_time:95397 total_time:59337531 simonx ( 12821) count:621 ave_time:52437 total_time:32563915 gtorri ( 57089) count:603 ave_time:88341 total_time:53269775 jingnandu ( 20922) count:346 ave_time:197415 total_time:68305857 jyang12 ( 12394) count:332 ave_time:136911 total_time:45454685 bostdiek ( 20817) count:332 ave_time:136451 total_time:45301798 chsu ( 20796) count:332 ave_time:203221 total_time:67469662 sychen ( 13448) count:331 ave_time:152377 total_time:50437078 nthor ( 20593) count:329 ave_time:160890 total_time:52932920 atrevino ( 10871) count:315 ave_time:126401 total_time:39816558 ajcukierman ( 12585) count:309 ave_time:190847 total_time:58971795 mpolansky ( 13716) count:286 ave_time:154179 total_time:44095373 jaehwan ( 555445) count:271 ave_time:299567 total_time:81182770 dkakkar ( 20839) count:162 ave_time:136002 total_time:22032362 tfranzmeyer ( 20760) count:151 ave_time:9757618 total_time:1473400360 bmatejek ( 5900266) count:108 ave_time:9725661 total_time:1050371420 rmattheis ( 13299) count:106 ave_time:70802 total_time:7505094 daldarondo ( 13304) count:98 ave_time:11673132 total_time:1143966977 lshen ( 58176) count:57 ave_time:1756328 total_time:100110742 kbystrom ( 20619) count:54 ave_time:622443 total_time:33611923 stacchella ( 11809) count:24 ave_time:6230649 total_time:149535597 zeyuanhu ( 11515) count:19 ave_time:11172843 total_time:212284028 mlovell ( 5899810) count:16 ave_time:7964527 total_time:127432435 jeastman ( 5899955) count:15 ave_time:9365269 total_time:140479039 jseeley ( 14063) count:14 ave_time:288302 total_time:4036233 gloewinger ( 13020) count:7 ave_time:14513580 total_time:101595064 dchan ( 558943) count:7 ave_time:437885 total_time:3065195 benmergui ( 99690) count:7 ave_time:343504 total_time:2404531 yaowang ( 11810) count:5 ave_time:14677543 total_time:73387715 rhong ( 21011) count:5 ave_time:18807801 total_time:94039006 sgaynor ( 5900197) count:5 ave_time:136195 total_time:680978 jpegues ( 21791) count:4 ave_time:25368430 total_time:101473721 namluu ( 20825) count:4 ave_time:31107978 total_time:124431912 bursley ( 59719) count:4 ave_time:10590207 total_time:42360829 mprasad ( 20409) count:3 ave_time:26905909 total_time:80717729 jdu ( 13910) count:3 ave_time:24400632 total_time:73201898 kmattioli ( 558161) count:3 ave_time:25504835 total_time:76514505 francesco ( 11040) count:2 ave_time:7817828 total_time:15635656 jgortmaker ( 21075) count:2 ave_time:589 total_time:1179 jfroberg ( 13751) count:2 ave_time:6923 total_time:13846 jang ( 12650) count:2 ave_time:8283866 total_time:16567733 e88bu2024 ( 21897) count:2 ave_time:12079052 total_time:24158104 rjsingh ( 12725) count:1 ave_time:19749368 total_time:19749368 froy ( 12706) count:1 ave_time:277121 total_time:277121 slurm ( 57812) count:1 ave_time:131584 total_time:131584 cbischoff ( 40921) count:1 ave_time:22678 total_time:22678 ghzhang17 ( 13095) count:1 ave_time:28174604 total_time:28174604 isvetlizky ( 13998) count:1 ave_time:42834247 total_time:42834247 bwilder ( 21982) count:1 ave_time:7069726 total_time:7069726 ruzhang ( 558295) count:1 ave_time:20086781 total_time:20086781 xlu123 ( 10076) count:1 ave_time:71023942 total_time:71023942 Pending RPC statistics No pending RPCs -Paul Edmon- On 5/1/2020 10:57 PM, bugs@schedmd.com wrote: > > *Comment # 19 <https://bugs.schedmd.com/show_bug.cgi?id=8978#c19> on > bug 8978 <https://bugs.schedmd.com/show_bug.cgi?id=8978> from Nate > Rini <mailto:nate@schedmd.com> * > Please also upload slurmd log from holy7c08102. > ------------------------------------------------------------------------ > You are receiving this mail because: > > * You reported the bug. > (In reply to Paul Edmon from comment #25) > Server thread count: 248 > Agent queue size: 0 > Agent count: 2 > DBD Agent queue size: 40 Has the thread count stayed below 256? Is DBD agent size still rising? (In reply to Paul Edmon from comment #23) > Here is with 3 partitions open: I take it normal jobs are now running again? (In reply to Paul Edmon from comment #24) > Created attachment 14089 [details] > Slurmctld log for 2300-2320 These are not expected: > May 1 23:03:53 holy-slurm02 slurmctld[26507]: error: slurm_unpack_received_msg: MESSAGE_NODE_REGISTRATION_STATUS has authentication error: Invalid authentication credential > May 1 23:04:11 holy-slurm02 slurmctld[26507]: error: slurm_unpack_received_msg: MESSAGE_NODE_REGISTRATION_STATUS has authentication error: Invalid authentication credential > May 1 23:04:11 holy-slurm02 slurmctld[26507]: error: slurm_unpack_received_msg: REQUEST_JOB_STEP_CREATE has authentication error: Invalid authentication credential > May 1 23:04:11 holy-slurm02 slurmctld[26507]: error: slurm_unpack_received_msg: MESSAGE_NODE_REGISTRATION_STATUS has authentication error: Invalid authentication credential Can you please make sure the munge key on all of your nodes is synced? munged might need to restarted on at least 1 node. The munge key is the same on all nodes, we have not changed it in years. I did send out a global restart for munge. From past experience those munge errors can occur if the RPC load builds up and it starts reading through old RPC's with old creds and then it sees that they are time shifted. Its going back and forth between 256 and lower. It keep stalling. The DBD agent fluctuates up and down. Here's the current snap shot: [root@holy-slurm02 log]# sdiag ******************************************************* sdiag output at Fri May 01 23:58:53 2020 (1588391933) Data since Fri May 01 23:55:43 2020 (1588391743) ******************************************************* Server thread count: 119 Agent queue size: 0 Agent count: 0 DBD Agent queue size: 0 Jobs submitted: 12 Jobs started: 307 Jobs completed: 44 Jobs canceled: 0 Jobs failed: 0 Job states ts: Fri May 01 23:57:31 2020 (1588391851) Jobs pending: 27737 Jobs running: 3736 Main schedule statistics (microseconds): Last cycle: 714794 Max cycle: 777899 Total cycles: 5 Mean cycle: 465795 Mean depth cycle: 142 Cycles per minute: 1 Last queue length: 4284 Backfilling stats (WARNING: data obtained in the middle of backfilling execution.) Total backfilled jobs (since last slurm start): 57 Total backfilled jobs (since last stats cycle start): 57 Total backfilled heterogeneous job components: 0 Total cycles: 1 Last cycle when: Fri May 01 23:57:31 2020 (1588391851) Last cycle: 62435039 Max cycle: 62435039 Mean cycle: 62435039 Last depth cycle: 873 Last depth cycle (try sched): 112 Depth Mean: 3704 Depth Mean (try depth): 97 Last queue length: 4308 Queue length mean: 8681 Latency for 1000 calls to gettimeofday(): 28 microseconds Remote Procedure Call statistics by message type MESSAGE_NODE_REGISTRATION_STATUS ( 1002) count:2149 ave_time:6418642 total_time:13793662474 REQUEST_PARTITION_INFO ( 2009) count:513 ave_time:13428 total_time:6889074 REQUEST_JOB_INFO_SINGLE ( 2021) count:509 ave_time:508450 total_time:258801068 REQUEST_FED_INFO ( 2049) count:504 ave_time:260 total_time:131165 REQUEST_COMPLETE_PROLOG ( 6018) count:243 ave_time:643353 total_time:156334898 MESSAGE_EPILOG_COMPLETE ( 6012) count:121 ave_time:51655963 total_time:6250371603 REQUEST_COMPLETE_BATCH_SCRIPT ( 5018) count:106 ave_time:52728963 total_time:5589270179 REQUEST_STEP_COMPLETE ( 5016) count:98 ave_time:40404250 total_time:3959616513 REQUEST_STATS_INFO ( 2035) count:17 ave_time:363 total_time:6173 REQUEST_JOB_USER_INFO ( 2039) count:16 ave_time:20884 total_time:334151 REQUEST_SUBMIT_BATCH_JOB ( 4003) count:12 ave_time:30059163 total_time:360709964 REQUEST_JOB_STEP_CREATE ( 5001) count:7 ave_time:4066294 total_time:28464060 REQUEST_NODE_INFO ( 2007) count:6 ave_time:96975 total_time:581855 REQUEST_SHARE_INFO ( 2022) count:6 ave_time:42626 total_time:255761 REQUEST_JOB_PACK_ALLOC_INFO ( 4027) count:5 ave_time:2383283 total_time:11916417 REQUEST_JOB_READY ( 4019) count:4 ave_time:177 total_time:709 REQUEST_RESOURCE_ALLOCATION ( 4001) count:3 ave_time:42577422 total_time:127732266 REQUEST_JOB_INFO ( 2003) count:2 ave_time:49763000 total_time:99526000 REQUEST_JOB_ALLOCATION_INFO ( 4014) count:1 ave_time:325 total_time:325 REQUEST_COMPLETE_JOB_ALLOCATION ( 5017) count:0 ave_time:0 total_time:0 Remote Procedure Call statistics by user root ( 0) count:2784 ave_time:10721983 total_time:29850001226 dingma ( 40306) count:170 ave_time:49644 total_time:8439544 sgprasath ( 21136) count:161 ave_time:28255 total_time:4549095 mengzhang ( 12484) count:119 ave_time:41288 total_time:4913306 kholton ( 20682) count:119 ave_time:5409 total_time:643712 ballen ( 5899723) count:118 ave_time:33120 total_time:3908259 simonx ( 12821) count:113 ave_time:20460 total_time:2312002 gtorri ( 57089) count:106 ave_time:117094 total_time:12412022 sychen ( 13448) count:59 ave_time:7907 total_time:466557 jingnandu ( 20922) count:59 ave_time:64345 total_time:3796386 bostdiek ( 20817) count:59 ave_time:11734 total_time:692327 chsu ( 20796) count:59 ave_time:60402 total_time:3563761 nthor ( 20593) count:59 ave_time:8925 total_time:526607 jyang12 ( 12394) count:59 ave_time:57415 total_time:3387487 jaehwan ( 555445) count:59 ave_time:9494 total_time:560203 atrevino ( 10871) count:56 ave_time:74506 total_time:4172341 ajcukierman ( 12585) count:56 ave_time:46444 total_time:2600904 mpolansky ( 13716) count:47 ave_time:6205 total_time:291652 acoran ( 22029) count:23 ave_time:2176 total_time:50060 sgaynor ( 5900197) count:10 ave_time:12773453 total_time:127734533 daldarondo ( 13304) count:9 ave_time:24144511 total_time:217300606 bmatejek ( 5900266) count:8 ave_time:22954769 total_time:183638155 mkelp ( 13432) count:4 ave_time:882833 total_time:3531334 rweinberger ( 13263) count:2 ave_time:14020383 total_time:28040767 cbaker ( 41029) count:2 ave_time:56949119 total_time:113898239 jianxiangqiu ( 22008) count:1 ave_time:15805403 total_time:15805403 isvetlizky ( 13998) count:1 ave_time:47368167 total_time:47368167 Pending RPC statistics No pending RPCs What seems to happen is that it stalls on something, builds up RPCs until it hits 256 becomes unresponsive then unwedges and releases a bunch of stuff at once. It's midnight here, so here is my plan. There is no danger to the jobs, and if I leave the scheduler running over night it will just be intermittently available to users and it will still schedule jobs when it unwedges. I've put all our partitions as UP with the settings you recommended. We can let it run overnight and see if it magically fixes itself by processing through whatever jobs are in it. That said, I don't know that it will. I suspect there is something else that broke which is causing the stalls but I cannot figure it out. Any ideas would be appreciated. Happy to dial up the log verbosity to get more data. I will send you the latest logs I have before I go to bed. -Paul Edmon- Created attachment 14090 [details]
slurmctld log for 2300-0012
(In reply to Paul Edmon from comment #28) > Server thread count: 119 > Agent queue size: 0 > Agent count: 0 > DBD Agent queue size: 0 This looks alot better but still too many threads. > Pending RPC statistics > No pending RPCs Looks like we at least cleared one of the log jams here. > What seems to happen is that it stalls on something, builds up RPCs > until it hits 256 becomes unresponsive then unwedges and releases a > bunch of stuff at once. It might be required to lower this fields in SchedulerParameters in slurm.conf: > sched_max_job_start=100 Looks like most jobs are getting started by the main scheduler and this would slow that down. > It's midnight here, so here is my plan. There is no danger to the jobs, > and if I leave the scheduler running over night it will just be > intermittently available to users and it will still schedule jobs when > it unwedges. I've put all our partitions as UP with the settings you > recommended. We can let it run overnight and see if it magically fixes > itself by processing through whatever jobs are in it. I'm going to lower this ticket to SEV2 per your response. > That said, I don't know that it will. I suspect there is something else > that broke which is causing the stalls but I cannot figure it out. Any > ideas would be appreciated. Happy to dial up the log verbosity to get > more data. I will send you the latest logs I have before I go to bed. Please keep them coming. Please also take another gcore per comment#11. Created attachment 14091 [details]
core dump at 0018 on 05/02
Created attachment 14092 [details]
Current slurm.conf as of the morning of May 2nd
Created attachment 14093 [details]
Core dump at 0840 on 05/02
Created attachment 14094 [details]
slurmctld log for 05/02 0000-0830
Good morning. The night did not solve the problem, so whatever is wedging the scheduler is still wedging it. I'm digging back in to see if I can figure out what is happening. I've dropped in fresh logs, let me know if you need anything else or want me to make any changes. Some additional probing this morning shows that when I enable a partition then the RPC count just starts stacking up, even a partition that is relatively idle. This also concerns me: ay 2 10:06:40 holy-slurm02 slurmctld[35237]: Warning: Note very large processing time from _slurm_rpc_dump_jobs: usec=278722116 began=10:02:01.375 May 2 10:06:41 holy-slurm02 slurmctld[35237]: Warning: Note very large processing time from _slurm_rpc_dump_jobs: usec=279759753 began=10:02:01.974 May 2 10:06:42 holy-slurm02 slurmctld[35237]: Warning: Note very large processing time from _slurm_rpc_dump_jobs: usec=280193002 began=10:02:01.934 May 2 10:06:44 holy-slurm02 slurmctld[35237]: Warning: Note very large processing time from _slurm_rpc_dump_jobs: usec=282985658 began=10:02:01.470 May 2 10:06:48 holy-slurm02 slurmctld[35237]: Warning: Note very large processing time from _slurm_rpc_dump_jobs: usec=278315968 began=10:02:09.687 May 2 10:06:48 holy-slurm02 slurmctld[35237]: Warning: Note very large processing time from _slurm_rpc_dump_jobs: usec=286450800 began=10:02:01.683 May 2 10:06:48 holy-slurm02 slurmctld[35237]: Warning: Note very large processing time from _slurm_rpc_dump_jobs: usec=286808882 began=10:02:01.389 May 2 10:06:52 holy-slurm02 slurmctld[35237]: Warning: Note very large processing time from _slurm_rpc_dump_jobs: usec=291082914 began=10:02:01.198 I'm not really sure why the dump job would take so long. We are also seeing users ping fairly frequently for job info, but that should be pretty matter of course for the scheduler. Nothing has changed with that so far as I can tell. Created attachment 14095 [details]
Core dump at 1030 on 5/2
So this is a core dump in the middle of a stall. The scheduler sits there for a while unresponsive and then will somehow unwedge and continue.
Created attachment 14096 [details]
slurmctld coredump 1033 05/02
A few minutes later it did an even larger stall where the load of the slurmctld was sitting in the 4000's
Manged to get the scheduler settled down again but all the partitions are set to DOWN. Everytime I try to reopen it stalls. An strace shows:
pid 18451] connect(1045, {sa_family=AF_LOCAL, sun_path="/var/run/munge/munge.socket.2"}, 110) = 0
[pid 18451] writev(1045, [{"\0`mK\4\2\0\0\0\0\24", 11}, {"\1\1\1\0\0\0\0\0\377\377\377\377\377\377\377\377\0\0\0\0", 20}], 2) = 31
[pid 18451] read(1045, "\0`mK\4\3\0\0\0\0\206", 11) = 11
[pid 18451] read(1045, "\0\0\0\0\0\200MUNGE:AwQDAAA+z2cVavs9bkf6"..., 134) = 134
[pid 18451] close(1045) = 0
[pid 18451] rt_sigaction(SIGALRM, {SIG_DFL, [ALRM], SA_RESTORER, 0x7f9996bb7630}, {SIG_DFL, [ALRM], SA_RESTORER, 0x7f9996bb7630}, 8) = 0
[pid 18451] rt_sigaction(SIGPIPE, {SIG_IGN, [PIPE], SA_RESTORER, 0x7f9996bb7630}, {SIG_IGN, [PIPE], SA_RESTORER, 0x7f9996bb7630}, 8) = 0
[pid 18451] fcntl(1044, F_GETFL) = 0x802 (flags O_RDWR|O_NONBLOCK)
[pid 18451] fcntl(1044, F_GETFL) = 0x802 (flags O_RDWR|O_NONBLOCK)
[pid 18451] fcntl(1044, F_SETFL, O_RDWR|O_NONBLOCK) = 0
[pid 18451] poll([{fd=1044, events=POLLOUT}], 1, 100000) = 1 ([{fd=1044, revents=POLLOUT}])
[pid 18451] recvfrom(1044, 0x7f98ce80d8d0, 1, 0, NULL, NULL) = -1 EAGAIN (Resource temporarily unavailable)
[pid 18451] sendto(1044, "\0\1\357`", 4, 0, NULL, 0) = 4
[pid 18451] fcntl(1044, F_SETFL, O_RDWR|O_NONBLOCK) = 0
[pid 18451] fcntl(1044, F_GETFL) = 0x802 (flags O_RDWR|O_NONBLOCK)
[pid 18451] fcntl(1044, F_GETFL) = 0x802 (flags O_RDWR|O_NONBLOCK)
[pid 18451] fcntl(1044, F_SETFL, O_RDWR|O_NONBLOCK) = 0
[pid 18451] poll([{fd=1044, events=POLLOUT}], 1, 100000) = 1 ([{fd=1044, revents=POLLOUT}])
[pid 18451] recvfrom(1044, 0x7f98ce80d8d0, 1, 0, NULL, NULL) = -1 EAGAIN (Resource temporarily unavailable)
[pid 18451] sendto(1044, "\"\0\0\0\0\0\7\332\0\1\356\302\0\0\0\0\0\0\0\0\0\0\0\0\0e\0\0\0\200MU"..., 126816, 0, NULL, 0) = 83984
[pid 18451] poll([{fd=1044, events=POLLOUT}], 1, 100000) = 1 ([{fd=1044, revents=POLLOUT}])
[pid 18451] recvfrom(1044, 0x7f98ce80d8d0, 1, 0, NULL, NULL) = -1 EAGAIN (Resource temporarily unavailable)
[pid 18451] sendto(1044, "b,davis_lab,hepl,hoekstra_lab,ir"..., 42832, 0, NULL, 0) = 42832
[pid 18451] fcntl(1044, F_SETFL, O_RDWR|O_NONBLOCK) = 0
[pid 18451] rt_sigaction(SIGPIPE, {SIG_IGN, [PIPE], SA_RESTORER, 0x7f9996bb7630}, {SIG_IGN, [PIPE], SA_RESTORER, 0x7f9996bb7630}, 8) = 0
[pid 18451] close(1044) = 0
[pid 18451] madvise(0x7f98ce70e000, 1028096, MADV_DONTNEED) = 0
[pid 18451] exit(0) = ?
[pid 18451] +++ exited with 0 +++
[pid 3015] <... futex resumed> ) = -1 ETIMEDOUT (Connection timed out)
[pid 3014] <... futex resumed> ) = -1 ETIMEDOUT (Connection timed out)
[pid 2892] <... futex resumed> ) = -1 ETIMEDOUT (Connection timed out)
[pid 3015] futex(0x6ef800, FUTEX_WAKE_PRIVATE, 1 <unfinished ...>
[pid 3014] futex(0x6efa00, FUTEX_WAKE_PRIVATE, 1 <unfinished ...>
[pid 3015] <... futex resumed> ) = 0
[pid 3014] <... futex resumed> ) = 0
[pid 2892] futex(0x6ef1e0, FUTEX_WAKE_PRIVATE, 1 <unfinished ...>
[pid 3015] futex(0x6ef844, FUTEX_WAIT_BITSET_PRIVATE|FUTEX_CLOCK_REALTIME, 2207, {1588432667, 0}, ffffffff <unfinished ...>
[pid 3014] futex(0x6efa44, FUTEX_WAIT_BITSET_PRIVATE|FUTEX_CLOCK_REALTIME, 2207, {1588432667, 0}, ffffffff <unfinished ...>
[pid 2892] <... futex resumed> ) = 0
I'm not sure what those time outs and unavailable resources are related to, but they are likely the source of the deadlocks.
(In reply to Paul Edmon from comment #40) > Manged to get the scheduler settled down again but all the partitions are > set to DOWN. Everytime I try to reopen it stalls. An strace shows: Returning to SEV1 per your response. > I'm not sure what those time outs and Slurm uses a mutexes with timeouts as timers internally. > unavailable resources are related to, > but they are likely the source of the deadlocks. Please call this: > sudo lsof -p $(pgrep slurmctld|head -1) > ls -la /proc/$(pgrep slurmctld|head -1)/fd (In reply to Paul Edmon from comment #40) > pid 18451] connect(1045, {sa_family=AF_LOCAL, > sun_path="/var/run/munge/munge.socket.2"}, 110) = 0 Can you please verify how many threads munge is current running with? > munged --num-threads=10 (In reply to Paul Edmon from comment #43) Please update your UPDATE_SQUEUE_PASSWORD. > /usr/sbin/munged --num-threads 10 Can you increase that to 15? Created attachment 14097 [details]
lsof
lsof -p $(pgrep slurmctld|head -1)
Created attachment 14098 [details]
fd
ls -la /proc/$(pgrep slurmctld|head -1)/fd
Blegh, forgot to sanitize that. Sorry. -Paul Edmon- On 5/2/2020 12:38 PM, bugs@schedmd.com wrote: > > *Comment # 44 <https://bugs.schedmd.com/show_bug.cgi?id=8978#c44> on > bug 8978 <https://bugs.schedmd.com/show_bug.cgi?id=8978> from Nate > Rini <mailto:nate@schedmd.com> * > (In reply to Paul Edmon fromcomment #43 <show_bug.cgi?id=8978#c43>) > Please update your UPDATE_SQUEUE_PASSWORD. > > > /usr/sbin/munged --num-threads 10 > Can you increase that to 15? > ------------------------------------------------------------------------ > You are receiving this mail because: > > * You reported the bug. > Do you want that to 15 everywhere or just the master? -Paul Edmon- On 5/2/2020 12:38 PM, bugs@schedmd.com wrote: > > *Comment # 44 <https://bugs.schedmd.com/show_bug.cgi?id=8978#c44> on > bug 8978 <https://bugs.schedmd.com/show_bug.cgi?id=8978> from Nate > Rini <mailto:nate@schedmd.com> * > (In reply to Paul Edmon fromcomment #43 <show_bug.cgi?id=8978#c43>) > Please update your UPDATE_SQUEUE_PASSWORD. > > > /usr/sbin/munged --num-threads 10 > Can you increase that to 15? > ------------------------------------------------------------------------ > You are receiving this mail because: > > * You reported the bug. > (In reply to Paul Edmon from comment #45) > Created attachment 14097 [details] > slurmctld 24649 slurm 1044u IPv4 13313937 0t0 TCP holy-slurm02.rc.fas.harvard.edu:7298->holy7c16404.rc.fas.harvard.edu:59408 (CLOSE_WAIT (In reply to Paul Edmon from comment #40) > [pid 18451] recvfrom(1044, 0x7f98ce80d8d0, 1, 0, NULL, NULL) = -1 EAGAIN > (Resource temporarily unavailable) This is a connection to holy7c16404 that is waiting for more incoming data. I doubt it is the cause of the problem. Do you see any errors in dmesg about dropped packets or otherwise. (In reply to Paul Edmon from comment #47) > Blegh, forgot to sanitize that. Sorry. I tagged the post private but this ticket is public so someone else could have seen it. (In reply to Paul Edmon from comment #48) > Do you want that to 15 everywhere or just the master? Just the slurmctld node should be sufficent. Thanks. I've upped munge to 15 on the master. I did notice this: [root@holy-slurm02 ~]# ifconfig bond0: flags=5187<UP,BROADCAST,RUNNING,MASTER,MULTICAST> mtu 1500 inet 10.31.20.151 netmask 255.255.255.128 broadcast 10.31.20.255 inet6 fe80::1618:77ff:fe2f:5dfd prefixlen 64 scopeid 0x20<link> ether 14:18:77:2f:5d:fd txqueuelen 1000 (Ethernet) RX packets 24301328 bytes 3654469329 (3.4 GiB) RX errors 0 dropped 2 overruns 0 frame 0 TX packets 47629163 bytes 51083307242 (47.5 GiB) TX errors 0 dropped 0 overruns 0 carrier 0 collisions 0 docker0: flags=4163<UP,BROADCAST,RUNNING,MULTICAST> mtu 1500 inet 172.17.0.1 netmask 255.255.0.0 broadcast 0.0.0.0 inet6 fe80::42:53ff:fe8d:acec prefixlen 64 scopeid 0x20<link> ether 02:42:53:8d:ac:ec txqueuelen 0 (Ethernet) RX packets 2506775 bytes 3221673745 (3.0 GiB) RX errors 0 dropped 0 overruns 0 frame 0 TX packets 2610484 bytes 4951298627 (4.6 GiB) TX errors 0 dropped 0 overruns 0 carrier 0 collisions 0 em1: flags=6211<UP,BROADCAST,RUNNING,SLAVE,MULTICAST> mtu 1500 ether 14:18:77:2f:5d:fd txqueuelen 1000 (Ethernet) RX packets 14200613 bytes 1559840151 (1.4 GiB) RX errors 0 dropped 1 overruns 0 frame 0 TX packets 26682796 bytes 30077933325 (28.0 GiB) TX errors 0 dropped 0 overruns 0 carrier 0 collisions 0 device interrupt 18 em3: flags=6147<UP,BROADCAST,SLAVE,MULTICAST> mtu 1500 ether 14:18:77:2f:5d:fd txqueuelen 1000 (Ethernet) RX packets 0 bytes 0 (0.0 B) RX errors 0 dropped 0 overruns 0 frame 0 TX packets 0 bytes 0 (0.0 B) TX errors 0 dropped 0 overruns 0 carrier 0 collisions 0 device interrupt 19 em4: flags=6147<UP,BROADCAST,SLAVE,MULTICAST> mtu 1500 ether 14:18:77:2f:5d:fd txqueuelen 1000 (Ethernet) RX packets 0 bytes 0 (0.0 B) RX errors 0 dropped 0 overruns 0 frame 0 TX packets 0 bytes 0 (0.0 B) TX errors 0 dropped 0 overruns 0 carrier 0 collisions 0 device interrupt 16 lo: flags=73<UP,LOOPBACK,RUNNING> mtu 65536 inet 127.0.0.1 netmask 255.0.0.0 inet6 ::1 prefixlen 128 scopeid 0x10<host> loop txqueuelen 1000 (Local Loopback) RX packets 460506 bytes 3147757537 (2.9 GiB) RX errors 0 dropped 0 overruns 0 frame 0 TX packets 460506 bytes 3147757537 (2.9 GiB) TX errors 0 dropped 0 overruns 0 carrier 0 collisions 0 vetha1f8bc7: flags=4163<UP,BROADCAST,RUNNING,MULTICAST> mtu 1500 inet6 fe80::50e7:82ff:fe35:54be prefixlen 64 scopeid 0x20<link> ether 52:e7:82:35:54:be txqueuelen 0 (Ethernet) RX packets 2506775 bytes 3256768595 (3.0 GiB) RX errors 0 dropped 0 overruns 0 frame 0 TX packets 2610492 bytes 4951299283 (4.6 GiB) TX errors 0 dropped 0 overruns 0 carrier 0 collisions 0 Nothing in dmesg about dropped packets though but it has at least dropped 2. I will ping our network engineer. On the holy7c16104 side I didn't see any dropped packets in dmesg or in ifconfig. -Paul Edmon- On 5/2/2020 12:44 PM, bugs@schedmd.com wrote: > > *Comment # 50 <https://bugs.schedmd.com/show_bug.cgi?id=8978#c50> on > bug 8978 <https://bugs.schedmd.com/show_bug.cgi?id=8978> from Nate > Rini <mailto:nate@schedmd.com> * > (In reply to Paul Edmon fromcomment #47 <show_bug.cgi?id=8978#c47>) > > Blegh, forgot to sanitize that. Sorry. > > I tagged the post private but this ticket is public so someone else could have > seen it. > > (In reply to Paul Edmon fromcomment #48 <show_bug.cgi?id=8978#c48>) > > Do you want that to 15 everywhere or just the master? > Just the slurmctld node should be sufficent. > ------------------------------------------------------------------------ > You are receiving this mail because: > > * You reported the bug. > (In reply to Paul Edmon from comment #37) > I'm not really sure why the dump job would take so long. The core dump shows 158 threads of _slurm_rpc_dump_jobs() at once. Generally, _slurm_rpc_dump_jobs() is not slow but it does have to grab a few read locks which is a blocking event. > We are also seeing > users ping fairly frequently for job info, but that should be pretty matter > of course for the scheduler. Doesn't look like a user is flooding the scheduler with dump requests. root on the other hand has 230 pending RPCs. Are there any scripts that root calls that would call squeue or `scontrol show jobs`? Please attach /usr/local/bin/slurm_epilog (In reply to Paul Edmon from comment #51) > Nothing in dmesg about dropped packets though but it has at least > dropped 2. I will ping our network engineer. That is exceptionally low drop count, doubt this is a problem. Created attachment 14099 [details]
slurm_epilog
Yeah, but I'm still going to have him look in case it is symptomatic of some larger underlying problem. As it would explain a lot. We can operate though as if networking isn't a problem. So far as root running squeue. We try to avoid it. We have scripts that ping sacct for data and our epilog looks at the local cgroups but doesn't touch squeue. We do have a squeue buffer we built to protect against direct invocation of squeue (which was that docker container you saw earlier). -Paul Edmon- On 5/2/2020 12:54 PM, bugs@schedmd.com wrote: > > *Comment # 54 <https://bugs.schedmd.com/show_bug.cgi?id=8978#c54> on > bug 8978 <https://bugs.schedmd.com/show_bug.cgi?id=8978> from Nate > Rini <mailto:nate@schedmd.com> * > (In reply to Paul Edmon fromcomment #51 <show_bug.cgi?id=8978#c51>) > > Nothing in dmesg about dropped packets though but it has at least > dropped 2. I will ping our network engineer. > > That is exceptionally low drop count, doubt this is a problem. > ------------------------------------------------------------------------ > You are receiving this mail because: > > * You reported the bug. > Just an update, even with the munge thread change it still stalls and then eventually throws munge errors. I don't think the issue is munge perse but rather something is stalling/slowing down the slurmctld responding to RPC's. It's either blocking on a specific RPC or its just slow in general. At least that's my thinking. -Paul Edmon- On 5/2/2020 12:54 PM, bugs@schedmd.com wrote: > > *Comment # 54 <https://bugs.schedmd.com/show_bug.cgi?id=8978#c54> on > bug 8978 <https://bugs.schedmd.com/show_bug.cgi?id=8978> from Nate > Rini <mailto:nate@schedmd.com> * > (In reply to Paul Edmon fromcomment #51 <show_bug.cgi?id=8978#c51>) > > Nothing in dmesg about dropped packets though but it has at least > dropped 2. I will ping our network engineer. > > That is exceptionally low drop count, doubt this is a problem. > ------------------------------------------------------------------------ > You are receiving this mail because: > > * You reported the bug. > Finally got it unwedged. Here is the current sdiag: [root@holy-slurm02 ~]# sdiag ******************************************************* sdiag output at Sat May 02 13:04:46 2020 (1588439086) Data since Sat May 02 13:04:13 2020 (1588439053) ******************************************************* Server thread count: 32 Agent queue size: 0 Agent count: 0 DBD Agent queue size: 0 Jobs submitted: 5 Jobs started: 0 Jobs completed: 39 Jobs canceled: 0 Jobs failed: 0 Job states ts: Sat May 02 13:04:13 2020 (1588439053) Jobs pending: 26207 Jobs running: 3219 Main schedule statistics (microseconds): Last cycle: 17985 Max cycle: 22985 Total cycles: 4 Mean cycle: 17120 Mean depth cycle: 0 Last queue length: 0 Backfilling stats (WARNING: data obtained in the middle of backfilling execution.) Total backfilled jobs (since last slurm start): 0 Total backfilled jobs (since last stats cycle start): 0 Total backfilled heterogeneous job components: 0 Total cycles: 0 Last cycle when: Wed Dec 31 19:00:00 1969 (0) Last cycle: 0 Max cycle: 0 Last depth cycle: 0 Last depth cycle (try sched): 0 Last queue length: 0 Latency for 1000 calls to gettimeofday(): 27 microseconds Remote Procedure Call statistics by message type MESSAGE_NODE_REGISTRATION_STATUS ( 1002) count:2138 ave_time:63224 total_time:135175011 MESSAGE_EPILOG_COMPLETE ( 6012) count:374 ave_time:5359051 total_time:2004285165 REQUEST_COMPLETE_BATCH_SCRIPT ( 5018) count:262 ave_time:42058785 total_time:11019401869 REQUEST_STEP_COMPLETE ( 5016) count:226 ave_time:5937922 total_time:1341970415 REQUEST_PARTITION_INFO ( 2009) count:177 ave_time:15757 total_time:2788990 REQUEST_JOB_INFO_SINGLE ( 2021) count:176 ave_time:323436 total_time:56924842 REQUEST_FED_INFO ( 2049) count:154 ave_time:736 total_time:113366 REQUEST_SUBMIT_BATCH_JOB ( 4003) count:7 ave_time:16344643 total_time:114412502 REQUEST_STATS_INFO ( 2035) count:7 ave_time:364 total_time:2553 REQUEST_JOB_USER_INFO ( 2039) count:5 ave_time:7130 total_time:35653 REQUEST_RESOURCE_ALLOCATION ( 4001) count:2 ave_time:28680703 total_time:57361407 REQUEST_NODE_INFO ( 2007) count:2 ave_time:26570 total_time:53140 REQUEST_SHARE_INFO ( 2022) count:2 ave_time:58590 total_time:117181 REQUEST_JOB_INFO ( 2003) count:1 ave_time:57082435 total_time:57082435 REQUEST_KILL_JOB ( 5032) count:0 ave_time:0 total_time:0 Remote Procedure Call statistics by user root ( 0) count:3025 ave_time:4812612 total_time:14558151642 gchaudhary ( 20991) count:111 ave_time:7369 total_time:818010 sgprasath ( 21136) count:92 ave_time:6468 total_time:595129 dingma ( 40306) count:92 ave_time:5403 total_time:497083 mengzhang ( 12484) count:38 ave_time:6429 total_time:244302 kholton ( 20682) count:37 ave_time:5656 total_time:209295 ballen ( 5899723) count:32 ave_time:5508 total_time:176273 eprotzer ( 21969) count:16 ave_time:4557 total_time:72916 agomezlievano ( 12518) count:14 ave_time:6645 total_time:93041 mpolansky ( 13716) count:14 ave_time:5197 total_time:72761 zmorris ( 558739) count:14 ave_time:5157 total_time:72210 ajcukierman ( 12585) count:14 ave_time:5183 total_time:72572 jyang12 ( 12394) count:14 ave_time:6321 total_time:88499 jtaylor ( 558629) count:4 ave_time:2715 total_time:10862 simonx ( 12821) count:4 ave_time:7099 total_time:28397 bmatejek ( 5900266) count:4 ave_time:14268212 total_time:57072848 daldarondo ( 13304) count:3 ave_time:18915876 total_time:56747628 swilson ( 556463) count:2 ave_time:28510706 total_time:57021412 stat115u20025 ( 21481) count:1 ave_time:57108257 total_time:57108257 yitongqi ( 21637) count:1 ave_time:318242 total_time:318242 vknutson ( 558621) count:1 ave_time:253150 total_time:253150 dongyuyang ( 54022) count:0 ave_time:0 total_time:0 Pending RPC statistics No pending RPCs All the partitions are set to down. I'm going to suspend all the jobs that are running so we can rule out any job traffic. -Paul Edmon- On 5/2/2020 12:54 PM, bugs@schedmd.com wrote: > > *Comment # 54 <https://bugs.schedmd.com/show_bug.cgi?id=8978#c54> on > bug 8978 <https://bugs.schedmd.com/show_bug.cgi?id=8978> from Nate > Rini <mailto:nate@schedmd.com> * > (In reply to Paul Edmon fromcomment #51 <show_bug.cgi?id=8978#c51>) > > Nothing in dmesg about dropped packets though but it has at least > dropped 2. I will ping our network engineer. > > That is exceptionally low drop count, doubt this is a problem. > ------------------------------------------------------------------------ > You are receiving this mail because: > > * You reported the bug. > Please attach your /etc/sssd/sssd.conf There are 82 connections in the dump waiting on reply from /var/lib/sss/mc/initgroups. (In reply to Paul Edmon from comment #57) > Just an update, even with the munge thread change it still stalls and > then eventually throws munge errors. Understood. > rather something is stalling/slowing down the slurmctld > responding to RPC's Currently, looking for that right now. Created attachment 14100 [details]
sssd config
(In reply to Paul Edmon from comment #61) > Created attachment 14100 [details] > sssd config Please also call: > systemctl status sssd > time getent passwd >/dev/null > time getent group >/dev/null > time getent hosts >/dev/null [root@holy-slurm02 sssd]# systemctl status sssd ● sssd.service - System Security Services Daemon Loaded: loaded (/usr/lib/systemd/system/sssd.service; enabled; vendor preset: disabled) Active: active (running) since Fri 2020-05-01 16:52:08 EDT; 20h ago Main PID: 3295 (sssd) Tasks: 5 Memory: 95.0M CGroup: /system.slice/sssd.service ├─3295 /usr/sbin/sssd -i --logger=files ├─3305 /usr/libexec/sssd/sssd_be --domain rc.domain --uid 0 --gid 0 --logger=files ├─3306 /usr/libexec/sssd/sssd_be --domain academic.domain --uid 0 --gid 0 --logger=files ├─3338 /usr/libexec/sssd/sssd_nss --uid 0 --gid 0 --logger=files └─3339 /usr/libexec/sssd/sssd_pam --uid 0 --gid 0 --logger=files May 01 16:52:08 holy-slurm02.rc.fas.harvard.edu sssd[be[academic.domain]][3306]: Starting up May 01 16:52:08 holy-slurm02.rc.fas.harvard.edu sssd[be[rc.domain]][3305]: Starting up May 01 16:52:08 holy-slurm02.rc.fas.harvard.edu sssd[pam][3339]: Starting up May 01 16:52:08 holy-slurm02.rc.fas.harvard.edu sssd[nss][3338]: Starting up May 01 16:52:08 holy-slurm02.rc.fas.harvard.edu systemd[1]: Started System Security Services Daemon. May 01 16:52:39 holy-slurm02.rc.fas.harvard.edu sssd[be[academic.domain]][3306]: Backend is offline May 01 16:52:39 holy-slurm02.rc.fas.harvard.edu sssd[be[rc.domain]][3305]: Backend is offline May 01 16:52:44 holy-slurm02.rc.fas.harvard.edu sssd[nss][3338]: Enumeration requested but not enabled May 01 16:52:46 holy-slurm02.rc.fas.harvard.edu sssd[be[academic.domain]][3306]: Backend is online May 01 16:52:46 holy-slurm02.rc.fas.harvard.edu sssd[be[rc.domain]][3305]: Backend is online [root@holy-slurm02 sssd]# time getent passwd > /dev/null real 0m0.003s user 0m0.002s sys 0m0.002s [root@holy-slurm02 sssd]# time getent group > /dev/null real 0m0.004s user 0m0.000s sys 0m0.003s [root@holy-slurm02 sssd]# time getent hosts > /dev/null real 0m0.003s user 0m0.003s sys 0m0.000s On 5/2/2020 1:20 PM, bugs@schedmd.com wrote: > > *Comment # 62 <https://bugs.schedmd.com/show_bug.cgi?id=8978#c62> on > bug 8978 <https://bugs.schedmd.com/show_bug.cgi?id=8978> from Nate > Rini <mailto:nate@schedmd.com> * > (In reply to Paul Edmon fromcomment #61 <show_bug.cgi?id=8978#c61>) > > Created attachment 14100 [details] <attachment.cgi?id=14100> [details] > <attachment.cgi?id=14100&action=edit> > sssd config > > Please also call: > > systemctl status sssd > time getent passwd >/dev/null > time getent group >/dev/null > time > getent hosts >/dev/null > ------------------------------------------------------------------------ > You are receiving this mail because: > > * You reported the bug. > Looking at the backtrace and the logs, dumping the jobs is taking ~20s (_slurm_rpc_dump_jobs). Is it possible to decrease the number of jobs in the system? (In reply to Paul Edmon from comment #58) > Jobs pending: 26207 > Jobs running: 3219 Current setting: > MaxJobCount=300000 > MinJobAge=600 Can we change this to: > MaxJobCount=30000 > MinJobAge=60 This shouldn't kill any submitted jobs but should cause slurmctld to purge the internal list of completed jobs. It will very likely freeze to rpcs after this change while it writes all the jobs out. Yes, I will make those changes. The MinJobAge=600 was for a HTCondor set up we don't run anymore, so that can be dropped as low as we want. As for MaxJobCount we will need to go up to 300000 when we are back in production as we have roughly 100,000 cores and we are serial job dominated. However for us figuring this out I will reduce it to 30000 -paul Edmon- On 5/2/2020 1:38 PM, bugs@schedmd.com wrote: > > *Comment # 66 <https://bugs.schedmd.com/show_bug.cgi?id=8978#c66> on > bug 8978 <https://bugs.schedmd.com/show_bug.cgi?id=8978> from Nate > Rini <mailto:nate@schedmd.com> * > Looking at the backtrace and the logs, dumping the jobs is taking ~20s > (_slurm_rpc_dump_jobs). Is it possible to decrease the number of jobs in the > system? > > (In reply to Paul Edmon fromcomment #58 <show_bug.cgi?id=8978#c58>) > > Jobs pending: 26207 > Jobs running: 3219 > > Current setting: > > MaxJobCount=300000 > MinJobAge=600 > > Can we change this to: > > MaxJobCount=30000 > MinJobAge=60 > > This shouldn't kill any submitted jobs but should cause slurmctld to purge the > internal list of completed jobs. It will very likely freeze to rpcs after this > change while it writes all the jobs out. > ------------------------------------------------------------------------ > You are receiving this mail because: > > * You reported the bug. > With both those changes, the scheduler still stalls periodically. Here is the latest sdiag: [root@holy-slurm02 slurm]# sdiag ******************************************************* sdiag output at Sat May 02 13:50:48 2020 (1588441848) Data since Sat May 02 13:46:26 2020 (1588441586) ******************************************************* Server thread count: 105 Agent queue size: 0 Agent count: 0 DBD Agent queue size: 0 Jobs submitted: 43 Jobs started: 0 Jobs completed: 30 Jobs canceled: 1 Jobs failed: 0 Job states ts: Sat May 02 13:47:34 2020 (1588441654) Jobs pending: 26300 Jobs running: 1822 Main schedule statistics (microseconds): Last cycle: 19484 Max cycle: 36271 Total cycles: 8 Mean cycle: 21381 Mean depth cycle: 0 Cycles per minute: 2 Last queue length: 0 Backfilling stats (WARNING: data obtained in the middle of backfilling execution.) Total backfilled jobs (since last slurm start): 0 Total backfilled jobs (since last stats cycle start): 0 Total backfilled heterogeneous job components: 0 Total cycles: 0 Last cycle when: Wed Dec 31 19:00:00 1969 (0) Last cycle: 0 Max cycle: 0 Last depth cycle: 0 Last depth cycle (try sched): 0 Last queue length: 0 Latency for 1000 calls to gettimeofday(): 27 microseconds Remote Procedure Call statistics by message type MESSAGE_NODE_REGISTRATION_STATUS ( 1002) count:2138 ave_time:2040193 total_time:4361934122 REQUEST_PARTITION_INFO ( 2009) count:1595 ave_time:10651 total_time:16988863 REQUEST_FED_INFO ( 2049) count:1286 ave_time:259 total_time:333690 REQUEST_JOB_INFO_SINGLE ( 2021) count:1278 ave_time:268734 total_time:343442068 REQUEST_SUSPEND ( 5014) count:288 ave_time:1196416 total_time:344567982 REQUEST_STATS_INFO ( 2035) count:225 ave_time:342 total_time:77062 MESSAGE_EPILOG_COMPLETE ( 6012) count:61 ave_time:71827286 total_time:4381464453 REQUEST_COMPLETE_BATCH_SCRIPT ( 5018) count:55 ave_time:32937763 total_time:1811577013 REQUEST_STEP_COMPLETE ( 5016) count:53 ave_time:21116692 total_time:1119184683 REQUEST_SUBMIT_BATCH_JOB ( 4003) count:50 ave_time:18395768 total_time:919788427 REQUEST_JOB_USER_INFO ( 2039) count:22 ave_time:13417 total_time:295181 REQUEST_NODE_INFO ( 2007) count:10 ave_time:36381 total_time:363817 REQUEST_SHARE_INFO ( 2022) count:8 ave_time:46348 total_time:370785 REQUEST_JOB_INFO ( 2003) count:5 ave_time:57311490 total_time:286557454 REQUEST_KILL_JOB ( 5032) count:3 ave_time:86745980 total_time:260237941 REQUEST_RESOURCE_ALLOCATION ( 4001) count:2 ave_time:51794471 total_time:103588943 REQUEST_JOB_ALLOCATION_INFO ( 4014) count:2 ave_time:219 total_time:438 REQUEST_COMPLETE_JOB_ALLOCATION ( 5017) count:0 ave_time:0 total_time:0 Remote Procedure Call statistics by user root ( 0) count:2899 ave_time:4245000 total_time:12306257331 gchaudhary ( 20991) count:503 ave_time:4886 total_time:2457915 sgprasath ( 21136) count:374 ave_time:4902 total_time:1833554 vsemenova ( 13255) count:368 ave_time:4946 total_time:1820475 dingma ( 40306) count:368 ave_time:4918 total_time:1809926 wji ( 13315) count:310 ave_time:86234 total_time:26732845 jtaylor ( 558629) count:267 ave_time:2582 total_time:689548 mengzhang ( 12484) count:257 ave_time:5084 total_time:1306588 kholton ( 20682) count:257 ave_time:4895 total_time:1258220 ballen ( 5899723) count:256 ave_time:4995 total_time:1278894 u_219006_g_65249( 6000076) count:198 ave_time:1316492 total_time:260665532 yitongqi ( 21637) count:134 ave_time:5176 total_time:693632 jyang12 ( 12394) count:128 ave_time:5000 total_time:640118 eprotzer ( 21969) count:128 ave_time:4915 total_time:629243 agomezlievano ( 12518) count:128 ave_time:5087 total_time:651155 zmorris ( 558739) count:128 ave_time:4966 total_time:635672 ajcukierman ( 12585) count:119 ave_time:5085 total_time:605226 mpolansky ( 13716) count:113 ave_time:5015 total_time:566761 dkakkar ( 20839) count:78 ave_time:5098 total_time:397652 yisoe ( 11692) count:37 ave_time:1035729 total_time:38321999 daldarondo ( 13304) count:13 ave_time:26376497 total_time:342894470 bmatejek ( 5900266) count:9 ave_time:37364346 total_time:336279114 dhur ( 20296) count:3 ave_time:86531605 total_time:259594815 svc-cmdbquery ( 13174) count:2 ave_time:18689 total_time:37378 dongyuyang ( 54022) count:1 ave_time:161173996 total_time:161173996 stat115u20072 ( 21528) count:1 ave_time:15424280 total_time:15424280 stat115u20076 ( 21532) count:1 ave_time:186116347 total_time:186116347 yuntian ( 11729) count:1 ave_time:236 total_time:236 Pending RPC statistics No pending RPCs On 5/2/2020 1:38 PM, bugs@schedmd.com wrote: > > *Comment # 66 <https://bugs.schedmd.com/show_bug.cgi?id=8978#c66> on > bug 8978 <https://bugs.schedmd.com/show_bug.cgi?id=8978> from Nate > Rini <mailto:nate@schedmd.com> * > Looking at the backtrace and the logs, dumping the jobs is taking ~20s > (_slurm_rpc_dump_jobs). Is it possible to decrease the number of jobs in the > system? > > (In reply to Paul Edmon fromcomment #58 <show_bug.cgi?id=8978#c58>) > > Jobs pending: 26207 > Jobs running: 3219 > > Current setting: > > MaxJobCount=300000 > MinJobAge=600 > > Can we change this to: > > MaxJobCount=30000 > MinJobAge=60 > > This shouldn't kill any submitted jobs but should cause slurmctld to purge the > internal list of completed jobs. It will very likely freeze to rpcs after this > change while it writes all the jobs out. > ------------------------------------------------------------------------ > You are receiving this mail because: > > * You reported the bug. > Created attachment 14102 [details] core dump at 1354 on 05/02 This is taken after the recommendations of Comment 66 are implemented. This during a stall. Just an update. I'm back in the 256 thread stall zone. I'm trying to get off of that. When I do a restart it just stalls immediately and goes to 256 threads. The slurmctld load goes to 4800%, while the load on the box goes up to around 120. It usually then starts to unwedge a bit and the load drops. But the stall at very high load goes from about 5 minutes. Even after the load drops it still sits at 256 threads. Created attachment 14103 [details]
slurmctld log for 05/02 for 0000-1530
Managed to suspend all the jobs. However the stall persists. So its not any specific job. Here is the latest sdiag:
[root@holy-slurm02 ~]# sdiag
*******************************************************
sdiag output at Sat May 02 15:28:51 2020 (1588447731)
Data since Sat May 02 15:06:01 2020 (1588446361)
*******************************************************
Server thread count: 20
Agent queue size: 0
Agent count: 0
DBD Agent queue size: 0
Jobs submitted: 142
Jobs started: 0
Jobs completed: 60
Jobs canceled: 9
Jobs failed: 0
Job states ts: Sat May 02 15:24:19 2020 (1588447459)
Jobs pending: 27048
Jobs running: 0
Main schedule statistics (microseconds):
Last cycle: 19239
Max cycle: 156645
Total cycles: 45
Mean cycle: 35135
Mean depth cycle: 0
Cycles per minute: 2
Last queue length: 0
Backfilling stats (WARNING: data obtained in the middle of backfilling execution.)
Total backfilled jobs (since last slurm start): 0
Total backfilled jobs (since last stats cycle start): 0
Total backfilled heterogeneous job components: 0
Total cycles: 0
Last cycle when: Wed Dec 31 19:00:00 1969 (0)
Last cycle: 0
Max cycle: 0
Last depth cycle: 0
Last depth cycle (try sched): 0
Last queue length: 0
Latency for 1000 calls to gettimeofday(): 28 microseconds
Remote Procedure Call statistics by message type
REQUEST_PARTITION_INFO ( 2009) count:3658 ave_time:456506 total_time:1669899695
REQUEST_FED_INFO ( 2049) count:2739 ave_time:1596 total_time:4373005
REQUEST_JOB_INFO_SINGLE ( 2021) count:2691 ave_time:677461 total_time:1823050189
MESSAGE_NODE_REGISTRATION_STATUS ( 1002) count:2177 ave_time:3626470 total_time:7894826940
REQUEST_SUSPEND ( 5014) count:1120 ave_time:786604 total_time:880996772
REQUEST_NODE_INFO_SINGLE ( 2040) count:818 ave_time:3222 total_time:2636219
REQUEST_STATS_INFO ( 2035) count:580 ave_time:1584 total_time:918945
MESSAGE_EPILOG_COMPLETE ( 6012) count:174 ave_time:2138754 total_time:372143307
REQUEST_COMPLETE_BATCH_SCRIPT ( 5018) count:160 ave_time:51374209 total_time:8219873469
REQUEST_STEP_COMPLETE ( 5016) count:159 ave_time:1284302 total_time:204204069
REQUEST_SUBMIT_BATCH_JOB ( 4003) count:151 ave_time:14300013 total_time:2159302070
REQUEST_NODE_INFO ( 2007) count:127 ave_time:1639406 total_time:208204616
REQUEST_UPDATE_NODE ( 3002) count:95 ave_time:43641858 total_time:4145976587
REQUEST_JOB_USER_INFO ( 2039) count:67 ave_time:565256 total_time:37872199
REQUEST_KILL_JOB ( 5032) count:10 ave_time:32622486 total_time:326224869
REQUEST_COMPLETE_JOB_ALLOCATION ( 5017) count:9 ave_time:78333787 total_time:705004088
REQUEST_RESOURCE_ALLOCATION ( 4001) count:8 ave_time:62078130 total_time:496625046
REQUEST_JOB_INFO ( 2003) count:8 ave_time:58218129 total_time:465745038
REQUEST_JOB_ALLOCATION_INFO ( 4014) count:2 ave_time:1333 total_time:2667
REQUEST_SHARE_INFO ( 2022) count:2 ave_time:36794 total_time:73588
Remote Procedure Call statistics by user
root ( 0) count:6507 ave_time:3643257 total_time:23706676488
gchaudhary ( 20991) count:1453 ave_time:35487 total_time:51563785
dingma ( 40306) count:757 ave_time:20715 total_time:15681613
ballen ( 5899723) count:698 ave_time:135215 total_time:94380140
kholton ( 20682) count:682 ave_time:66109 total_time:45086354
mengzhang ( 12484) count:678 ave_time:68578 total_time:46495968
sgprasath ( 21136) count:591 ave_time:84162 total_time:49739818
yitongqi ( 21637) count:359 ave_time:135091 total_time:48497780
jyang12 ( 12394) count:341 ave_time:127801 total_time:43580388
mpolansky ( 13716) count:341 ave_time:129903 total_time:44296944
fernandes ( 556465) count:335 ave_time:145685 total_time:48804673
eprotzer ( 21969) count:333 ave_time:270713 total_time:90147616
ajcukierman ( 12585) count:320 ave_time:138905 total_time:44449666
barkley ( 558696) count:315 ave_time:370040 total_time:116562790
dkakkar ( 20839) count:300 ave_time:262018 total_time:78605437
u_159900_g_65249( 6000078) count:247 ave_time:563578 total_time:139203854
jcornelison ( 5899776) count:151 ave_time:233926 total_time:35322939
jwalker ( 20775) count:124 ave_time:3709167 total_time:459936759
ejones ( 20412) count:50 ave_time:626788 total_time:31339445
agomezlievano ( 12518) count:40 ave_time:31172 total_time:1246882
daldarondo ( 13304) count:39 ave_time:28450317 total_time:1109562391
bmatejek ( 5900266) count:34 ave_time:27072757 total_time:920473750
mengdong ( 14125) count:25 ave_time:37434697 total_time:935867437
cechavarria ( 558213) count:12 ave_time:65728708 total_time:788744500
jtaylor ( 558629) count:11 ave_time:2477174 total_time:27248914
sstanley ( 21855) count:2 ave_time:90418654 total_time:180837309
sgaynor ( 5900197) count:1 ave_time:37601466 total_time:37601466
e88bu2004 ( 21874) count:1 ave_time:36053763 total_time:36053763
bgoulet ( 10657) count:1 ave_time:124476679 total_time:124476679
benmergui ( 99690) count:1 ave_time:47096860 total_time:47096860
junjiema2 ( 21868) count:1 ave_time:53584760 total_time:53584760
zzhang01 ( 58426) count:1 ave_time:26583108 total_time:26583108
qxz125 ( 556345) count:1 ave_time:112538440 total_time:112538440
atsang ( 14422) count:1 ave_time:25591074 total_time:25591074
tstgermaine ( 558600) count:1 ave_time:51370 total_time:51370
cbischoff ( 40921) count:1 ave_time:22218 total_time:22218
Pending RPC statistics
No pending RPCs
Hi Could you send me an output from those gdb commands run on the last backtrace? t 119 2 f p *job_ptr I suspect that jobs submitted by this user generate this problem. Dominik [root@holy-slurm02 bug-8978]# gdb /usr/sbin/slurmctld core.41849 GNU gdb (GDB) Red Hat Enterprise Linux 7.6.1-115.el7 Copyright (C) 2013 Free Software Foundation, Inc. License GPLv3+: GNU GPL version 3 or later <http://gnu.org/licenses/gpl.html> This is free software: you are free to change and redistribute it. There is NO WARRANTY, to the extent permitted by law. Type "show copying" and "show warranty" for details. This GDB was configured as "x86_64-redhat-linux-gnu". For bug reporting instructions, please see: <http://www.gnu.org/software/gdb/bugs/>... Reading symbols from /usr/sbin/slurmctld...done. [New LWP 41851] [New LWP 41852] [New LWP 41854] [New LWP 41856] [New LWP 41918] [New LWP 41919] [New LWP 41920] [New LWP 41921] [New LWP 41922] [New LWP 41923] [New LWP 41925] [New LWP 1429] [New LWP 1442] [New LWP 1657] [New LWP 1662] [New LWP 1669] [New LWP 1755] [New LWP 1904] [New LWP 1933] [New LWP 1937] [New LWP 2083] [New LWP 2366] [New LWP 2451] [New LWP 2452] [New LWP 2453] [New LWP 2454] [New LWP 2458] [New LWP 2460] [New LWP 2461] [New LWP 2494] [New LWP 2498] [New LWP 2500] [New LWP 2506] [New LWP 2511] [New LWP 2516] [New LWP 2517] [New LWP 2518] [New LWP 2519] [New LWP 2521] [New LWP 2524] [New LWP 2525] [New LWP 2526] [New LWP 2528] [New LWP 2529] [New LWP 2531] [New LWP 2533] [New LWP 2536] [New LWP 2537] [New LWP 2538] [New LWP 2539] [New LWP 2540] [New LWP 2544] [New LWP 2548] [New LWP 2549] [New LWP 2550] [New LWP 2553] [New LWP 2555] [New LWP 2558] [New LWP 2561] [New LWP 2566] [New LWP 2569] [New LWP 2570] [New LWP 2571] [New LWP 2576] [New LWP 2579] [New LWP 2582] [New LWP 2587] [New LWP 2591] [New LWP 2595] [New LWP 2598] [New LWP 2600] [New LWP 2601] [New LWP 2602] [New LWP 2603] [New LWP 2605] [New LWP 2611] [New LWP 2615] [New LWP 2619] [New LWP 2621] [New LWP 2623] [New LWP 2628] [New LWP 2635] [New LWP 2637] [New LWP 2639] [New LWP 2644] [New LWP 2648] [New LWP 2649] [New LWP 2650] [New LWP 2652] [New LWP 2653] [New LWP 2654] [New LWP 2655] [New LWP 2656] [New LWP 2667] [New LWP 2690] [New LWP 2783] [New LWP 2931] [New LWP 2933] [New LWP 3069] [New LWP 3081] [New LWP 3113] [New LWP 3206] [New LWP 3215] [New LWP 3356] [New LWP 3361] [New LWP 3521] [New LWP 3541] [New LWP 3559] [New LWP 3620] [New LWP 3646] [New LWP 3662] [New LWP 3765] [New LWP 3787] [New LWP 3902] [New LWP 3961] [New LWP 3962] [New LWP 4020] [New LWP 4069] [New LWP 4073] [New LWP 4089] [New LWP 4110] [New LWP 4189] [New LWP 4201] [New LWP 4255] [New LWP 4263] [New LWP 4270] [New LWP 4276] [New LWP 4394] [New LWP 4489] [New LWP 4492] [New LWP 4494] [New LWP 4506] [New LWP 41849] [Thread debugging using libthread_db enabled] Using host libthread_db library "/lib64/libthread_db.so.1". Core was generated by `/usr/sbin/slurmctld'. #0 0x00007fc7f8d61de2 in pthread_cond_timedwait@@GLIBC_2.3.2 () from /lib64/libpthread.so.0 Missing separate debuginfos, use: debuginfo-install slurm-slurmctld-19.05.5-1fasrc01.el7.x86_64 (gdb) t 119 [Switching to thread 119 (Thread 0x7fc7d8943700 (LWP 4073))] #0 0x00007fc7f92f62b3 in _makespace (str=str@entry=0x7fc7d89428c0, str_len=str_len@entry=-1, needed=needed@entry=1) at xstring.c:110 110 xstring.c: No such file or directory. (gdb) 2 f Undefined command: "2". Try "help". (gdb) p *job_ptr No symbol "job_ptr" in current context. (gdb) quit On 5/2/2020 3:35 PM, bugs@schedmd.com wrote: > > *Comment # 77 <https://bugs.schedmd.com/show_bug.cgi?id=8978#c77> on > bug 8978 <https://bugs.schedmd.com/show_bug.cgi?id=8978> from Dominik > Bartkiewicz <mailto:bart@schedmd.com> * > Hi > > Could you send me an output from those gdb commands run on the last backtrace? > > t 119 > 2 f > p *job_ptr > > I suspect that jobs submitted by this user generate this problem. > > Dominik > ------------------------------------------------------------------------ > You are receiving this mail because: > > * You reported the bug. > I think this is the command you wanted: [root@holy-slurm02 bug-8978]# gdb /usr/sbin/slurmctld core.41849 GNU gdb (GDB) Red Hat Enterprise Linux 7.6.1-115.el7 Copyright (C) 2013 Free Software Foundation, Inc. License GPLv3+: GNU GPL version 3 or later <http://gnu.org/licenses/gpl.html> This is free software: you are free to change and redistribute it. There is NO WARRANTY, to the extent permitted by law. Type "show copying" and "show warranty" for details. This GDB was configured as "x86_64-redhat-linux-gnu". For bug reporting instructions, please see: <http://www.gnu.org/software/gdb/bugs/>... Reading symbols from /usr/sbin/slurmctld...done. [New LWP 41851] [New LWP 41852] [New LWP 41854] [New LWP 41856] [New LWP 41918] [New LWP 41919] [New LWP 41920] [New LWP 41921] [New LWP 41922] [New LWP 41923] [New LWP 41925] [New LWP 1429] [New LWP 1442] [New LWP 1657] [New LWP 1662] [New LWP 1669] [New LWP 1755] [New LWP 1904] [New LWP 1933] [New LWP 1937] [New LWP 2083] [New LWP 2366] [New LWP 2451] [New LWP 2452] [New LWP 2453] [New LWP 2454] [New LWP 2458] [New LWP 2460] [New LWP 2461] [New LWP 2494] [New LWP 2498] [New LWP 2500] [New LWP 2506] [New LWP 2511] [New LWP 2516] [New LWP 2517] [New LWP 2518] [New LWP 2519] [New LWP 2521] [New LWP 2524] [New LWP 2525] [New LWP 2526] [New LWP 2528] [New LWP 2529] [New LWP 2531] [New LWP 2533] [New LWP 2536] [New LWP 2537] [New LWP 2538] [New LWP 2539] [New LWP 2540] [New LWP 2544] [New LWP 2548] [New LWP 2549] [New LWP 2550] [New LWP 2553] [New LWP 2555] [New LWP 2558] [New LWP 2561] [New LWP 2566] [New LWP 2569] [New LWP 2570] [New LWP 2571] [New LWP 2576] [New LWP 2579] [New LWP 2582] [New LWP 2587] [New LWP 2591] [New LWP 2595] [New LWP 2598] [New LWP 2600] [New LWP 2601] [New LWP 2602] [New LWP 2603] [New LWP 2605] [New LWP 2611] [New LWP 2615] [New LWP 2619] [New LWP 2621] [New LWP 2623] [New LWP 2628] [New LWP 2635] [New LWP 2637] [New LWP 2639] [New LWP 2644] [New LWP 2648] [New LWP 2649] [New LWP 2650] [New LWP 2652] [New LWP 2653] [New LWP 2654] [New LWP 2655] [New LWP 2656] [New LWP 2667] [New LWP 2690] [New LWP 2783] [New LWP 2931] [New LWP 2933] [New LWP 3069] [New LWP 3081] [New LWP 3113] [New LWP 3206] [New LWP 3215] [New LWP 3356] [New LWP 3361] [New LWP 3521] [New LWP 3541] [New LWP 3559] [New LWP 3620] [New LWP 3646] [New LWP 3662] [New LWP 3765] [New LWP 3787] [New LWP 3902] [New LWP 3961] [New LWP 3962] [New LWP 4020] [New LWP 4069] [New LWP 4073] [New LWP 4089] [New LWP 4110] [New LWP 4189] [New LWP 4201] [New LWP 4255] [New LWP 4263] [New LWP 4270] [New LWP 4276] [New LWP 4394] [New LWP 4489] [New LWP 4492] [New LWP 4494] [New LWP 4506] [New LWP 41849] [Thread debugging using libthread_db enabled] Using host libthread_db library "/lib64/libthread_db.so.1". Core was generated by `/usr/sbin/slurmctld'. #0 0x00007fc7f8d61de2 in pthread_cond_timedwait@@GLIBC_2.3.2 () from /lib64/libpthread.so.0 Missing separate debuginfos, use: debuginfo-install slurm-slurmctld-19.05.5-1fasrc01.el7.x86_64 (gdb) t 119 [Switching to thread 119 (Thread 0x7fc7d8943700 (LWP 4073))] #0 0x00007fc7f92f62b3 in _makespace (str=str@entry=0x7fc7d89428c0, str_len=str_len@entry=-1, needed=needed@entry=1) at xstring.c:110 110 xstring.c: No such file or directory. (gdb) f 2 #2 0x000000000044037b in _pack_default_job_details (job_ptr=job_ptr@entry=0x67aeff0, buffer=buffer@entry=0x7fc5d4000c90, protocol_version=protocol_version@entry=8704) at job_mgr.c:10624 10624 job_mgr.c: No such file or directory. (gdb) p *job_ptr $1 = {magic = 4038539564, account = 0x67af630 "olveczky_lab", admin_comment = 0x0, alias_list = 0x0, alloc_node = 0x67af600 "boslogin01", alloc_resp_port = 0, alloc_sid = 241239, array_job_id = 54727756, array_task_id = 1594, array_recs = 0x0, assoc_id = 8941, assoc_ptr = 0x2079990, batch_features = 0x0, batch_flag = 2, batch_host = 0x67af6a0 "holystat12", billable_tres = 2.0869140625, bit_flags = 444596224, burst_buffer = 0x0, burst_buffer_state = 0x0, check_job = 0x0, ckpt_interval = 0, ckpt_time = 0, clusters = 0x0, comment = 0x0, cpu_cnt = 0, cpus_per_tres = 0x0, cr_enabled = 0, db_flags = 4, db_index = 765607413, deadline = 0, delay_boot = 0, derived_ec = 0, details = 0x67af400, direct_set_prio = 0, end_time = 0, end_time_exp = 0, epilog_running = false, exit_code = 1, fed_details = 0x0, front_end_ptr = 0x0, gres_list = 0x0, gres_alloc = 0x67af660 "", gres_detail_cnt = 0, gres_detail_str = 0x0, gres_req = 0x67af680 "", gres_used = 0x0, group_id = 40013, job_id = 54739752, job_next = 0x0, job_array_next_j = 0x674e1e0, job_array_next_t = 0x0, job_resrcs = 0x67af700, job_state = 0, kill_on_node_fail = 1, last_sched_eval = 1588436032, licenses = 0x0, license_list = 0x0, limit_set = {qos = 0, time = 0, tres = 0x67aefc0}, mail_type = 0, mail_user = 0x0, mem_per_tres = 0x0, mcs_label = 0x0, name = 0x67aee70 "compute_stac", network = 0x0, next_step_id = 0, nodes = 0x0, node_addr = 0x0, node_bitmap = 0x0, node_bitmap_cg = 0x0, node_cnt = 0, node_cnt_wag = 1, nodes_completing = 0x0, origin_cluster = 0x0, other_port = 0, pack_details = 0x0, pack_job_id = 0, pack_job_id_set = 0x0, pack_job_offset = 0, pack_job_list = 0x0, partition = 0x67aedf0 "serial_requeue,olveczky,shared", part_ptr_list = 0x61fa050, part_nodes_missing = false, part_ptr = 0x2732f70, power_flags = 0 '\000', pre_sus_time = 0, preempt_time = 0, preempt_in_progress = false, priority = 2517554, priority_array = 0x7fc7c8006980, prio_factors = 0x67aef20, profile = 0, qos_id = 1, qos_ptr = 0x1f3b720, qos_blocking_ptr = 0x0, reboot = 0 '\000', restart_cnt = 1, resize_time = 0, resv_id = 0, resv_name = 0x0, resv_ptr = 0x0, requid = 4294967295, resp_host = 0x0, sched_nodes = 0x0, select_jobinfo = 0x67af6d0, site_factor = 2147483648, spank_job_env = 0x0, spank_job_env_size = 0, start_protocol_ver = 8704, start_time = 0, state_desc = 0x0, state_reason = 6, state_reason_prev = 6, state_reason_prev_db = 6, step_list = 0x61fa0a0, suspend_time = 0, system_comment = 0x0, time_last_active = 1588441391, time_limit = 300, time_min = 0, tot_sus_time = 0, total_cpus = 0, total_nodes = 0, tres_bind = 0x0, tres_freq = 0x0, tres_per_job = 0x0, tres_per_node = 0x0, tres_per_socket = 0x0, tres_per_task = 0x0, tres_req_cnt = 0x680fcc0, tres_req_str = 0x680fb90 "1=1,2=13000,4=1,5=4", tres_fmt_req_str = 0x680fbc0 "cpu=1,mem=13000M,node=1,billing=4", tres_alloc_cnt = 0x0, tres_alloc_str = 0x0, tres_fmt_alloc_str = 0x0, user_id = 13304, user_name = 0x67af5d0 "daldarondo", wait_all_nodes = 0, warn_flags = 0, warn_signal = 0, warn_time = 0, wckey = 0x0, req_switch = 0, wait4switch = 0, best_switch = true, wait4switch_start = 0} (gdb) On 5/2/2020 3:35 PM, bugs@schedmd.com wrote: > > *Comment # 77 <https://bugs.schedmd.com/show_bug.cgi?id=8978#c77> on > bug 8978 <https://bugs.schedmd.com/show_bug.cgi?id=8978> from Dominik > Bartkiewicz <mailto:bart@schedmd.com> * > Hi > > Could you send me an output from those gdb commands run on the last backtrace? > > t 119 > 2 f > p *job_ptr > > I suspect that jobs submitted by this user generate this problem. > > Dominik > ------------------------------------------------------------------------ > You are receiving this mail because: > > * You reported the bug. > Created attachment 14104 [details]
scontrol show job 54739752
This looks all sorts of bad. What do you want me to do? Cancel?
Sorry for the typo. Could you send me also detail_ptr? t 119 2 f p *detail_ptr (gdb) t 119
[Switching to thread 119 (Thread 0x7fc7d8943700 (LWP 4073))]
#0 0x00007fc7f92f62b3 in _makespace (str=str@entry=0x7fc7d89428c0,
str_len=str_len@entry=-1, needed=needed@entry=1) at xstring.c:110
110 xstring.c: No such file or directory.
(gdb) f 2
#2 0x000000000044037b in _pack_default_job_details
(job_ptr=job_ptr@entry=0x67aeff0, buffer=buffer@entry=0x7fc5d4000c90,
protocol_version=protocol_version@entry=8704) at job_mgr.c:10624
10624 job_mgr.c: No such file or directory.
(gdb) p *detail_ptr
$1 = {magic = 0, acctg_freq = 0x0, accrue_time = 1588402586, argc =
7806, argv = 0x67afb60, begin_time = 1588402586, ckpt_dir = 0x67afae0
"/n/holylfs02/LABS/olveczky_lab/Diego/code/dm/stac", cluster_features =
0x0, contiguous = 0,
core_spec = 65534, cpu_bind = 0x0, cpu_bind_type = 65534,
cpu_freq_min = 4294967294, cpu_freq_max = 4294967294, cpu_freq_gov =
4294967294, cpus_per_task = 1, orig_cpus_per_task = 1, depend_list =
0x0, dependency = 0x0,
orig_dependency = 0x0, env_cnt = 4, env_sup = 0x680fa70,
exc_node_bitmap = 0x0, exc_nodes = 0x0, expanding_jobid = 0, extra =
0x0, feature_list = 0x0, features = 0x0, max_cpus = 4294967294,
orig_max_cpus = 4294967294, max_nodes = 1,
mc_ptr = 0x67afb30, mem_bind = 0x0, mem_bind_type = 0, min_cpus = 1,
orig_min_cpus = 1, min_gres_cpu = 0, min_nodes = 1, nice = 2147483648,
ntasks_per_node = 0, num_tasks = 1, open_mode = 0 '\000', overcommit = 0
'\000',
plane_size = 0, pn_min_cpus = 1, orig_pn_min_cpus = 1, pn_min_memory
= 13000, orig_pn_min_memory = 13000, pn_min_tmp_disk = 0, prolog_running
= 0 '\000', reserved_resources = 0, req_node_bitmap = 0x0,
preempt_start_time = 0,
req_nodes = 0x0, requeue = 1, restart_dir = 0x0, share_res = 254
'\376', std_err = 0x67af980
"/n/holylfs02/LABS/olveczky_lab/Diego/code/dm/stac/logs/Job.compute_stac.%N.%j.err",
std_in = 0x67af9f0 "/dev/null",
std_out = 0x67afa20
"/n/holylfs02/LABS/olveczky_lab/Diego/code/dm/stac/logs/Job.compute_stac.%N.%j.out",
submit_time = 1588402465, task_dist = 8192, usable_nodes = 0, whole_node
= 0 '\000',
work_dir = 0x67afa90
"/n/holylfs02/LABS/olveczky_lab/Diego/code/dm/stac", x11 = 0,
x11_magic_cookie = 0x0, x11_target = 0x0, x11_target_port = 0}
(gdb)
On 5/2/2020 3:48 PM, bugs@schedmd.com wrote:
>
> *Comment # 81 <https://bugs.schedmd.com/show_bug.cgi?id=8978#c81> on
> bug 8978 <https://bugs.schedmd.com/show_bug.cgi?id=8978> from Dominik
> Bartkiewicz <mailto:bart@schedmd.com> *
> Sorry for the typo. Could you send me also detail_ptr?
> t 119
> 2 f
> p *detail_ptr
> ------------------------------------------------------------------------
> You are receiving this mail because:
>
> * You reported the bug.
>
How many jobs this type is in the queue now? It looks those jobs have crazy sizes of argv. argc = 7806, argv = 0x67afb60 Slurmctld packs this in a not optimal way and I suspect that this can be the root of the problem. The user in question: [root@holy-slurm02 bug-8978]# /usr/bin/squeue -u daldarondo JOBID PARTITION NAME USER ST TIME NODES NODELIST(REASON) 54727756_[99-105,1 serial_re compute_ daldaron PD 0:00 1 (PartitionDown) 54727756_278 olveczky, compute_ daldaron PD 0:00 1 (PartitionDown) 54727756_279 olveczky, compute_ daldaron PD 0:00 1 (PartitionDown) 54727756_272 olveczky, compute_ daldaron PD 0:00 1 (PartitionDown) 54727756_282 olveczky, compute_ daldaron PD 0:00 1 (PartitionDown) 54727756_269 olveczky, compute_ daldaron PD 0:00 1 (PartitionDown) 54727756_270 olveczky, compute_ daldaron PD 0:00 1 (PartitionDown) 54727756_275 olveczky, compute_ daldaron PD 0:00 1 (PartitionDown) 54727756_277 olveczky, compute_ daldaron PD 0:00 1 (PartitionDown) 54727756_280 olveczky, compute_ daldaron PD 0:00 1 (PartitionDown) 54727756_283 olveczky, compute_ daldaron PD 0:00 1 (PartitionDown) 54727756_281 olveczky, compute_ daldaron PD 0:00 1 (PartitionDown) 54727756_268 olveczky, compute_ daldaron PD 0:00 1 (PartitionDown) 54727756_285 olveczky, compute_ daldaron PD 0:00 1 (PartitionDown) 54727756_276 olveczky, compute_ daldaron PD 0:00 1 (PartitionDown) 54727756_273 olveczky, compute_ daldaron PD 0:00 1 (PartitionDown) 54727756_271 olveczky, compute_ daldaron PD 0:00 1 (PartitionDown) 54727756_284 olveczky, compute_ daldaron PD 0:00 1 (PartitionDown) 54727756_274 olveczky, compute_ daldaron PD 0:00 1 (PartitionDown) 54727756_313 olveczky, compute_ daldaron PD 0:00 1 (PartitionDown) I can cancel all those and have a conversation with him. Obviously that submission is line is crazy, I have never seen one like it in our environment. Let me know if you want me to axe those jobs, as I want to make sure you are done looking. Once they clear I can try restarting the scheduler. I'm sure it will perform much better once these are purged. Please confirm I can cancel, as I will wait for your go ahead. -Paul Edmon- On 5/2/2020 3:54 PM, bugs@schedmd.com wrote: > > *Comment # 83 <https://bugs.schedmd.com/show_bug.cgi?id=8978#c83> on > bug 8978 <https://bugs.schedmd.com/show_bug.cgi?id=8978> from Dominik > Bartkiewicz <mailto:bart@schedmd.com> * > How many jobs this type is in the queue now? > It looks those jobs have crazy sizes of argv. > argc = 7806, argv = 0x67afb60 > Slurmctld packs this in a not optimal way and I suspect that this can be the > root of the problem. > ------------------------------------------------------------------------ > You are receiving this mail because: > > * You reported the bug. > Yes, please cancel those jobs. If this is a source of this problem, we have enough data to analyze. Dominik Yes, that did it. We are back at full capacity. Thank you so much for your assistance. I would have never found those jobs unless you dug into that. Those were just bizarre. Can you add a limit to the size of argv and that line to prevent this from happening again? I don't really have a good way of stopping this sort of abuse, and if it can grind the scheduler to a halt, it should be prevented by the scheduler. Thank you again! -Paul Edmon- On 5/2/2020 4:02 PM, bugs@schedmd.com wrote: > > *Comment # 85 <https://bugs.schedmd.com/show_bug.cgi?id=8978#c85> on > bug 8978 <https://bugs.schedmd.com/show_bug.cgi?id=8978> from Dominik > Bartkiewicz <mailto:bart@schedmd.com> * > Yes, please cancel those jobs. If this is a source of this problem, we have > enough data to analyze. > > Dominik > ------------------------------------------------------------------------ > You are receiving this mail because: > > * You reported the bug. > I'm glad to hear that. For sure, this can be blocked by submit plugin. We will also try to make the packing code smarter. We will discuss this internally on Monday, and I will let you know what we will do with this. Can we drop the severity of this bug now? Yes, this is resolved. -Paul Edmon- On 5/2/2020 4:46 PM, bugs@schedmd.com wrote: > > *Comment # 87 <https://bugs.schedmd.com/show_bug.cgi?id=8978#c87> on > bug 8978 <https://bugs.schedmd.com/show_bug.cgi?id=8978> from Dominik > Bartkiewicz <mailto:bart@schedmd.com> * > I'm glad to hear that. For sure, this can be blocked by submit plugin. > > We will also try to make the packing code smarter. We will discuss this > internally on Monday, and I will let you know what we will do with this. > > Can we drop the severity of this bug now? > ------------------------------------------------------------------------ > You are receiving this mail because: > > * You reported the bug. > (In reply to Paul Edmon from comment #88) > Yes, this is resolved. Dropping to sev4 per response. Hi This commit should prevent form a similar problem in the future. It will be included in 20.02.3 and above https://github.com/SchedMD/slurm/commit/f17cf91ccc56ccf87 Let me know if you have any additional questions about this or if we can close this ticket. Dominik Thanks. Appreciate the fix. You can close this. -Paul Edmon- On 5/7/2020 7:23 AM, bugs@schedmd.com wrote: > > *Comment # 92 <https://bugs.schedmd.com/show_bug.cgi?id=8978#c92> on > bug 8978 <https://bugs.schedmd.com/show_bug.cgi?id=8978> from Dominik > Bartkiewicz <mailto:bart@schedmd.com> * > Hi > > This commit should prevent form a similar problem in the future. > It will be included in 20.02.3 and above > https://github.com/SchedMD/slurm/commit/f17cf91ccc56ccf87 > Let me know if you have any additional questions about this or if we can close > this ticket. > > Dominik > ------------------------------------------------------------------------ > You are receiving this mail because: > > * You reported the bug. > This looks like similar to https://bugs.schedmd.com/show_bug.cgi?id=8246 . At least the fix in https://github.com/SchedMD/slurm/commit/f17cf91ccc56ccf87 looks like it should fix #8246 as well (except for the poor performance for lots of mmap of small files thing, though my patch for #8246 didn't address that one either). *** Ticket 9217 has been marked as a duplicate of this ticket. *** *** Ticket 10178 has been marked as a duplicate of this ticket. *** |