Ticket 8061 - Two Users Cancelling Thousands of Jobs
Summary: Two Users Cancelling Thousands of Jobs
Status: RESOLVED DUPLICATE of ticket 7928
Alias: None
Product: Slurm
Classification: Unclassified
Component: Scheduling (show other tickets)
Version: 19.05.3
Hardware: Linux Linux
: 4 - Minor Issue
Assignee: Nate Rini
QA Contact:
URL:
Depends on:
Blocks:
 
Reported: 2019-11-06 11:09 MST by Paul Edmon
Modified: 2019-12-04 09:10 MST (History)
0 users

See Also:
Site: Harvard University
Slinky Site: ---
Alineos Sites: ---
Atos/Eviden Sites: ---
Confidential Site: ---
Coreweave sites: ---
Cray Sites: ---
DS9 clusters: ---
Google sites: ---
HPCnow Sites: ---
HPE Sites: ---
IBM Sites: ---
NOAA SIte: ---
NoveTech Sites: ---
Nvidia HWinf-CS Sites: ---
OCF Sites: ---
Recursion Pharma Sites: ---
SFW Sites: ---
SNIC sites: ---
Tzag Elita Sites: ---
Linux Distro: ---
Machine Name:
CLE Version:
Version Fixed:
Target Release: ---
DevPrio: ---
Emory-Cloud Sites: ---


Attachments
Job Submit Lua Script (14.85 KB, text/plain)
2019-11-12 08:56 MST, Paul Edmon
Details

Note You need to log in before you can comment on or make changes to this ticket.
Description Paul Edmon 2019-11-06 11:09:58 MST
We had a situation this afternoon where two users tried to cancel all there jobs.  This caused slurmctld to lock up due to high thread count as it tried to purge all those jobs plus then also continue with normal operations (i.e. answering queries about job state, scheduling, etc.).  Obviously this is not optimal as I had to set all partitions to INACTIVE and go into defer mode to get the scheduler responsive enough to purge all these jobs.

Could bulk job cancelation be throttled in some way, or given a lower priority?  In some cases you certainly want to cancel ASAP, in others you can have the cancelation work itself out over the next several minutes.  When you have thousands of jobs all cancelling at the same time on a system with hundreds of users and hundreds of thousands of jobs a simultaneous cancelation can be killer and lock up the scheduler. (In this case the scheduler is still cancelling these jobs even 10 minutes after I managed to get the scheduler responsive again).

Clearly cancelation needs to be reworked for large scale cancelation.  We can't have one or two users locking up the whole scheduler just because they decide to nuke all their jobs.

Thanks.

-Paul Edmon-
Comment 1 Nate Rini 2019-11-06 11:28:34 MST
Paul,

Please attach your slurm.conf.

Thanks,
--Nate
Comment 2 Paul Edmon 2019-11-06 11:50:53 MST
Created attachment 12240 [details]
slurm.conf
Comment 3 Paul Edmon 2019-11-06 11:51:10 MST
Created attachment 12241 [details]
topology.conf
Comment 4 Paul Edmon 2019-11-06 11:51:24 MST
I've attached them.
Comment 5 Nate Rini 2019-11-07 10:45:13 MST
(In reply to Paul Edmon from comment #0)
> We had a situation this afternoon where two users tried to cancel all there
> jobs.

Were these jobs RUNNING or PENDING at the time?
Comment 6 Paul Edmon 2019-11-07 11:00:55 MST
Both.  Some of the jobs were running, others were pending.

-Paul Edmon-

On 11/7/19 12:45 PM, bugs@schedmd.com wrote:
>
> *Comment # 5 <https://bugs.schedmd.com/show_bug.cgi?id=8061#c5> on bug 
> 8061 <https://bugs.schedmd.com/show_bug.cgi?id=8061> from Nate Rini 
> <mailto:nate@schedmd.com> *
> (In reply to Paul Edmon fromcomment #0  <show_bug.cgi?id=8061#c0>)
> > We had a situation this afternoon where two users tried to cancel all there > jobs.
>
> Were these jobs RUNNING or PENDING at the time?
> ------------------------------------------------------------------------
> You are receiving this mail because:
>
>   * You reported the bug.
>
Comment 7 Nate Rini 2019-11-07 11:02:49 MST
(In reply to Paul Edmon from comment #6)
> Both.  Some of the jobs were running, others were pending.

Is it possible to get your slurmctld during the event? How long does /usr/local/bin/slurm_epilog take execute?
Comment 8 Paul Edmon 2019-11-07 11:03:59 MST
You mean you want the log?  I can provide that if you want.

As for the epilog. It should be pretty quick. I will attach it.

-Paul Edmon-

On 11/7/19 1:02 PM, bugs@schedmd.com wrote:
>
> *Comment # 7 <https://bugs.schedmd.com/show_bug.cgi?id=8061#c7> on bug 
> 8061 <https://bugs.schedmd.com/show_bug.cgi?id=8061> from Nate Rini 
> <mailto:nate@schedmd.com> *
> (In reply to Paul Edmon fromcomment #6  <show_bug.cgi?id=8061#c6>)
> > Both.  Some of the jobs were running, others were pending.
>
> Is it possible to get your slurmctld during the event? How long does
> /usr/local/bin/slurm_epilog take execute?
> ------------------------------------------------------------------------
> You are receiving this mail because:
>
>   * You reported the bug.
>
Comment 9 Paul Edmon 2019-11-07 11:05:23 MST
Created attachment 12264 [details]
slurm epilog
Comment 10 Nate Rini 2019-11-07 11:07:30 MST
(In reply to Paul Edmon from comment #8)
> You mean you want the log?  I can provide that if you want.
Yes, the slurmctld log during the event.

Slurm should have no issue cancelling PENDING jobs but killing RUNNING jobs can be very slow since it requires communications and responses from the compute nodes. The logs will be helpful to figure out what is going slow.
Comment 11 Paul Edmon 2019-11-07 11:19:53 MST
Created attachment 12266 [details]
Slurm log for November 6th 11 - 13
Comment 12 Nate Rini 2019-11-07 12:12:05 MST
> holy-slurm02 slurmctld[22405]: error: Munge decode failed: Expired credential
Can you please verify that the system clock is in sync on all nodes?
Comment 14 Paul Edmon 2019-11-07 13:12:39 MST
Yes system clock is synced.  That munge decode issue tends to happen 
when the thread count gets high and slurm is backloged with traffic.  
Essentially it is replaying late messages.

-Paul Edmon-

On 11/7/19 2:12 PM, bugs@schedmd.com wrote:
>
> *Comment # 12 <https://bugs.schedmd.com/show_bug.cgi?id=8061#c12> on 
> bug 8061 <https://bugs.schedmd.com/show_bug.cgi?id=8061> from Nate 
> Rini <mailto:nate@schedmd.com> *
> > holy-slurm02 slurmctld[22405]: error: Munge decode failed: Expired credential
> Can you please verify that the system clock is in sync on all nodes?
> ------------------------------------------------------------------------
> You are receiving this mail because:
>
>   * You reported the bug.
>
Comment 15 Nate Rini 2019-11-07 14:03:46 MST
(In reply to Paul Edmon from comment #14)
> Yes system clock is synced.
Always prefer to make sure when this error shows up.

Examining your logs.
Comment 17 Nate Rini 2019-11-07 14:17:45 MST
> Nov  6 12:28:46 holy-slurm02 slurmctld[34262]: Warning: Note very large processing time from read_slurm_conf: usec=38812931 began=12:28:07.635

What kind of filesystem is the slurm.conf stored on? If it is a network filesystem (lustre, nfs, gpfs, etc) were there any waiters/stalls around this time?
Comment 18 Nate Rini 2019-11-07 14:36:50 MST
Looks like your using puppet for configuration management. Does puppet keep the slurm.conf in sync before starting the slurmd daemons?
Comment 19 Nate Rini 2019-11-07 14:44:27 MST
> Nov  6 12:59:48 holy-slurm02 slurmctld[79605]: job_submit.lua: sacctmgr failed to add account salomon_lab with exit status 256
> Nov  6 12:59:48 holy-slurm02 slurmctld[79605]: job_submit.lua: added association to salomon_lab for faye16

Does your job_submit lua plugin call sacctmgr?
Comment 20 Nate Rini 2019-11-07 14:59:07 MST
Please also call this:
> $ scontrol show config|grep -i debug
Comment 21 Paul Edmon 2019-11-08 08:17:53 MST
[root@holy7c22501 ~]# scontrol show config | grep -i debug
DebugFlags              = (null)
SlurmctldDebug          = info
SlurmctldSyslogDebug    = verbose
SlurmdDebug             = info
SlurmdSyslogDebug       = verbose

As for your questions:

1. Our slurm.conf is hosted as a NFS mount from our slurm master.

2. puppet does not manage slurm.conf except to lay it down on the master 
which then propagates via NFS out to the cluster.

3. job_submit.lua does call sacctmgr to see if a user is in the 
database, and if not add it to the database.

-Paul Edmon-

On 11/7/19 4:59 PM, bugs@schedmd.com wrote:
>
> *Comment # 20 <https://bugs.schedmd.com/show_bug.cgi?id=8061#c20> on 
> bug 8061 <https://bugs.schedmd.com/show_bug.cgi?id=8061> from Nate 
> Rini <mailto:nate@schedmd.com> *
> Please also call this:
> > $ scontrol show config|grep -i debug
> ------------------------------------------------------------------------
> You are receiving this mail because:
>
>   * You reported the bug.
>
Comment 22 Nate Rini 2019-11-08 21:19:12 MST
(In reply to Paul Edmon from comment #21)
> 1. Our slurm.conf is hosted as a NFS mount from our slurm master.
There are a good number of hash mismatch errors in the logs. Are all of the Slurm daemons being restarted when slurm.conf is changed?
 
> 3. job_submit.lua does call sacctmgr to see if a user is in the 
> database, and if not add it to the database.
I would expect this operation to have a potential race condition where the user addition is known by scontrol before the job is examined. I didn't see any evidence of this in the logs though.

Are there any scontrol commands calling the job_submit.lua? How often do new users need to be added?
Comment 23 Paul Edmon 2019-11-08 21:59:34 MST
1. We usually run scontrol reconfigure or a global restart when we 
update the config.

2. I've never seen an issue with it.

job_submit.lua does pull information from the scheduler about partitions 
and fairshare to do some gating logic.  It does it through the API.  We 
probably add about 2-3 users a day.  The script itself caches the list 
of users that it has looked up already so the load on the slurmdb is 
low.  I can send a copy of the script if you like.  Let me know.

-Paul Edmon-

On 11/8/2019 11:19 PM, bugs@schedmd.com wrote:
>
> *Comment # 22 <https://bugs.schedmd.com/show_bug.cgi?id=8061#c22> on 
> bug 8061 <https://bugs.schedmd.com/show_bug.cgi?id=8061> from Nate 
> Rini <mailto:nate@schedmd.com> *
> (In reply to Paul Edmon fromcomment #21  <show_bug.cgi?id=8061#c21>)
> > 1. Our slurm.conf is hosted as a NFS mount from our slurm master.
> There are a good number of hash mismatch errors in the logs. Are all of the
> Slurm daemons being restarted when slurm.conf is changed?
>
> > 3. job_submit.lua does call sacctmgr to see if a user is in the > database, and if not add it to the database.
> I would expect this operation to have a potential race condition where the user
> addition is known by scontrol before the job is examined. I didn't see any
> evidence of this in the logs though.
>
> Are there any scontrol commands calling the job_submit.lua? How often do new
> users need to be added?
> ------------------------------------------------------------------------
> You are receiving this mail because:
>
>   * You reported the bug.
>
Comment 24 Nate Rini 2019-11-11 11:51:44 MST
(In reply to Paul Edmon from comment #23)
> 1. We usually run scontrol reconfigure or a global restart when we 
> update the config.

As long as everything is restarted, it should be fine.
 
> 2. I've never seen an issue with it.

This ticket be related.
 
> job_submit.lua does pull information from the scheduler about partitions 
> and fairshare to do some gating logic.  It does it through the API.  We 
> probably add about 2-3 users a day.  The script itself caches the list 
> of users that it has looked up already so the load on the slurmdb is 
> low.  I can send a copy of the script if you like.  Let me know.
Please attach it.

Can you please also call this on your slurmctld node:
> lsmem
> lscpu
> cat /proc/meminfo
Comment 25 Nate Rini 2019-11-11 12:12:43 MST
(In reply to Nate Rini from comment #24)
> > 2. I've never seen an issue with it.
> 
> This ticket be related.
Correction:  This ticket *may* be related.
Comment 26 Paul Edmon 2019-11-12 08:56:09 MST
Created attachment 12304 [details]
Job Submit Lua Script
Comment 27 Paul Edmon 2019-11-12 08:56:50 MST
[root@holy-slurm02 ~]# lsmem
RANGE                                  SIZE  STATE REMOVABLE     BLOCK
0x0000000000000000-0x0000000077ffffff  1.9G online        no      0-14
0x0000000078000000-0x000000007fffffff  128M online       yes        15
0x0000000100000000-0x00000001c7ffffff  3.1G online        no     32-56
0x00000001c8000000-0x00000001cfffffff  128M online       yes        57
0x00000001d0000000-0x000000021fffffff  1.3G online        no     58-67
0x0000000220000000-0x000000022fffffff  256M online       yes     68-69
0x0000000230000000-0x000000023fffffff  256M online        no     70-71
0x0000000240000000-0x000000024fffffff  256M online       yes     72-73
0x0000000250000000-0x0000000267ffffff  384M online        no     74-76
0x0000000268000000-0x000000028fffffff  640M online       yes     77-81
0x0000000290000000-0x00000002afffffff  512M online        no     82-85
0x00000002b0000000-0x00000002b7ffffff  128M online       yes        86
0x00000002b8000000-0x00000002bfffffff  128M online        no        87
0x00000002c0000000-0x00000002c7ffffff  128M online       yes        88
0x00000002c8000000-0x00000002cfffffff  128M online        no        89
0x00000002d0000000-0x00000002d7ffffff  128M online       yes        90
0x00000002d8000000-0x0000000317ffffff    1G online        no     91-98
0x0000000318000000-0x0000000327ffffff  256M online       yes    99-100
0x0000000328000000-0x0000000367ffffff    1G online        no   101-108
0x0000000368000000-0x000000036fffffff  128M online       yes       109
0x0000000370000000-0x0000000387ffffff  384M online        no   110-112
0x0000000388000000-0x0000000397ffffff  256M online       yes   113-114
0x0000000398000000-0x000000039fffffff  128M online        no       115
0x00000003a0000000-0x00000003a7ffffff  128M online       yes       116
0x00000003a8000000-0x00000005cfffffff  8.6G online        no   117-185
0x00000005d0000000-0x00000005d7ffffff  128M online       yes       186
0x00000005d8000000-0x0000000617ffffff    1G online        no   187-194
0x0000000618000000-0x000000061fffffff  128M online       yes       195
0x0000000620000000-0x0000000627ffffff  128M online        no       196
0x0000000628000000-0x000000062fffffff  128M online       yes       197
0x0000000630000000-0x0000000637ffffff  128M online        no       198
0x0000000638000000-0x000000063fffffff  128M online       yes       199
0x0000000640000000-0x000000064fffffff  256M online        no   200-201
0x0000000650000000-0x0000000657ffffff  128M online       yes       202
0x0000000658000000-0x000000067fffffff  640M online        no   203-207
0x0000000680000000-0x0000000687ffffff  128M online       yes       208
0x0000000688000000-0x00000006b7ffffff  768M online        no   209-214
0x00000006b8000000-0x00000006bfffffff  128M online       yes       215
0x00000006c0000000-0x00000006f7ffffff  896M online        no   216-222
0x00000006f8000000-0x0000000707ffffff  256M online       yes   223-224
0x0000000708000000-0x000000071fffffff  384M online        no   225-227
0x0000000720000000-0x0000000727ffffff  128M online       yes       228
0x0000000728000000-0x0000000757ffffff  768M online        no   229-234
0x0000000758000000-0x000000075fffffff  128M online       yes       235
0x0000000760000000-0x000000078fffffff  768M online        no   236-241
0x0000000790000000-0x0000000797ffffff  128M online       yes       242
0x0000000798000000-0x00000007f7ffffff  1.5G online        no   243-254
0x00000007f8000000-0x00000007ffffffff  128M online       yes       255
0x0000000800000000-0x0000000837ffffff  896M online        no   256-262
0x0000000838000000-0x0000000847ffffff  256M online       yes   263-264
0x0000000848000000-0x000000087fffffff  896M online        no   265-271
0x0000000880000000-0x0000000887ffffff  128M online       yes       272
0x0000000888000000-0x0000000897ffffff  256M online        no   273-274
0x0000000898000000-0x000000089fffffff  128M online       yes       275
0x00000008a0000000-0x00000008bfffffff  512M online        no   276-279
0x00000008c0000000-0x00000008c7ffffff  128M online       yes       280
0x00000008c8000000-0x0000000937ffffff  1.8G online        no   281-294
0x0000000938000000-0x000000093fffffff  128M online       yes       295
0x0000000940000000-0x000000097fffffff    1G online        no   296-303
0x0000000980000000-0x0000000987ffffff  128M online       yes       304
0x0000000988000000-0x00000009a7ffffff  512M online        no   305-308
0x00000009a8000000-0x00000009afffffff  128M online       yes       309
0x00000009b0000000-0x00000009b7ffffff  128M online        no       310
0x00000009b8000000-0x00000009bfffffff  128M online       yes       311
0x00000009c0000000-0x0000000a07ffffff  1.1G online        no   312-320
0x0000000a08000000-0x0000000a0fffffff  128M online       yes       321
0x0000000a10000000-0x0000000a27ffffff  384M online        no   322-324
0x0000000a28000000-0x0000000a3fffffff  384M online       yes   325-327
0x0000000a40000000-0x0000000aafffffff  1.8G online        no   328-341
0x0000000ab0000000-0x0000000ab7ffffff  128M online       yes       342
0x0000000ab8000000-0x0000000bc7ffffff  4.3G online        no   343-376
0x0000000bc8000000-0x0000000bcfffffff  128M online       yes       377
0x0000000bd0000000-0x0000000be7ffffff  384M online        no   378-380
0x0000000be8000000-0x0000000befffffff  128M online       yes       381
0x0000000bf0000000-0x0000000c6fffffff    2G online        no   382-397
0x0000000c70000000-0x0000000c7fffffff  256M online       yes   398-399
0x0000000c80000000-0x0000000cd7ffffff  1.4G online        no   400-410
0x0000000cd8000000-0x0000000cdfffffff  128M online       yes       411
0x0000000ce0000000-0x0000000cf7ffffff  384M online        no   412-414
0x0000000cf8000000-0x0000000cffffffff  128M online       yes       415
0x0000000d00000000-0x0000000d7fffffff    2G online        no   416-431
0x0000000d80000000-0x0000000d87ffffff  128M online       yes       432
0x0000000d88000000-0x0000000d9fffffff  384M online        no   433-435
0x0000000da0000000-0x0000000da7ffffff  128M online       yes       436
0x0000000da8000000-0x0000000dcfffffff  640M online        no   437-441
0x0000000dd0000000-0x0000000dd7ffffff  128M online       yes       442
0x0000000dd8000000-0x0000000de7ffffff  256M online        no   443-444
0x0000000de8000000-0x0000000defffffff  128M online       yes       445
0x0000000df0000000-0x0000000df7ffffff  128M online        no       446
0x0000000df8000000-0x0000000dffffffff  128M online       yes       447
0x0000000e00000000-0x0000000e17ffffff  384M online        no   448-450
0x0000000e18000000-0x0000000e1fffffff  128M online       yes       451
0x0000000e20000000-0x0000000e37ffffff  384M online        no   452-454
0x0000000e38000000-0x0000000e3fffffff  128M online       yes       455
0x0000000e40000000-0x0000000e97ffffff  1.4G online        no   456-466
0x0000000e98000000-0x0000000e9fffffff  128M online       yes       467
0x0000000ea0000000-0x0000000ec7ffffff  640M online        no   468-472
0x0000000ec8000000-0x0000000ecfffffff  128M online       yes       473
0x0000000ed0000000-0x0000000ee7ffffff  384M online        no   474-476
0x0000000ee8000000-0x0000000eefffffff  128M online       yes       477
0x0000000ef0000000-0x0000000f37ffffff  1.1G online        no   478-486
0x0000000f38000000-0x0000000f47ffffff  256M online       yes   487-488
0x0000000f48000000-0x0000000f7fffffff  896M online        no   489-495
0x0000000f80000000-0x0000000f87ffffff  128M online       yes       496
0x0000000f88000000-0x0000000fb7ffffff  768M online        no   497-502
0x0000000fb8000000-0x0000000fc7ffffff  256M online       yes   503-504
0x0000000fc8000000-0x000000100fffffff  1.1G online        no   505-513
0x0000001010000000-0x0000001017ffffff  128M online       yes       514
0x0000001018000000-0x0000001107ffffff  3.8G online        no   515-544
0x0000001108000000-0x000000110fffffff  128M online       yes       545
0x0000001110000000-0x0000001117ffffff  128M online        no       546
0x0000001118000000-0x000000111fffffff  128M online       yes       547
0x0000001120000000-0x00000011d7ffffff  2.9G online        no   548-570
0x00000011d8000000-0x00000011dfffffff  128M online       yes       571
0x00000011e0000000-0x000000120fffffff  768M online        no   572-577
0x0000001210000000-0x0000001217ffffff  128M online       yes       578
0x0000001218000000-0x0000001227ffffff  256M online        no   579-580
0x0000001228000000-0x000000122fffffff  128M online       yes       581
0x0000001230000000-0x0000001247ffffff  384M online        no   582-584
0x0000001248000000-0x0000001257ffffff  256M online       yes   585-586
0x0000001258000000-0x000000125fffffff  128M online        no       587
0x0000001260000000-0x0000001267ffffff  128M online       yes       588
0x0000001268000000-0x000000126fffffff  128M online        no       589
0x0000001270000000-0x0000001277ffffff  128M online       yes       590
0x0000001278000000-0x000000128fffffff  384M online        no   591-593
0x0000001290000000-0x000000129fffffff  256M online       yes   594-595
0x00000012a0000000-0x00000012a7ffffff  128M online        no       596
0x00000012a8000000-0x00000012b7ffffff  256M online       yes   597-598
0x00000012b8000000-0x00000012cfffffff  384M online        no   599-601
0x00000012d0000000-0x00000012d7ffffff  128M online       yes       602
0x00000012d8000000-0x00000012dfffffff  128M online        no       603
0x00000012e0000000-0x00000012e7ffffff  128M online       yes       604
0x00000012e8000000-0x000000134fffffff  1.6G online        no   605-617
0x0000001350000000-0x0000001357ffffff  128M online       yes       618
0x0000001358000000-0x0000001367ffffff  256M online        no   619-620
0x0000001368000000-0x0000001377ffffff  256M online       yes   621-622
0x0000001378000000-0x000000139fffffff  640M online        no   623-627
0x00000013a0000000-0x00000013a7ffffff  128M online       yes       628
0x00000013a8000000-0x00000013e7ffffff    1G online        no   629-636
0x00000013e8000000-0x00000013f7ffffff  256M online       yes   637-638
0x00000013f8000000-0x00000013ffffffff  128M online        no       639
0x0000001400000000-0x0000001407ffffff  128M online       yes       640
0x0000001408000000-0x000000146fffffff  1.6G online        no   641-653
0x0000001470000000-0x0000001477ffffff  128M online       yes       654
0x0000001478000000-0x00000014c7ffffff  1.3G online        no   655-664
0x00000014c8000000-0x00000014d7ffffff  256M online       yes   665-666
0x00000014d8000000-0x000000150fffffff  896M online        no   667-673
0x0000001510000000-0x0000001517ffffff  128M online       yes       674
0x0000001518000000-0x000000151fffffff  128M online        no       675
0x0000001520000000-0x0000001527ffffff  128M online       yes       676
0x0000001528000000-0x0000001537ffffff  256M online        no   677-678
0x0000001538000000-0x000000153fffffff  128M online       yes       679
0x0000001540000000-0x000000156fffffff  768M online        no   680-685
0x0000001570000000-0x0000001577ffffff  128M online       yes       686
0x0000001578000000-0x0000001597ffffff  512M online        no   687-690
0x0000001598000000-0x000000159fffffff  128M online       yes       691
0x00000015a0000000-0x00000015ffffffff  1.5G online        no   692-703
0x0000001600000000-0x0000001607ffffff  128M online       yes       704
0x0000001608000000-0x000000160fffffff  128M online        no       705
0x0000001610000000-0x0000001617ffffff  128M online       yes       706
0x0000001618000000-0x000000167fffffff  1.6G online        no   707-719
0x0000001680000000-0x000000168fffffff  256M online       yes   720-721
0x0000001690000000-0x00000016dfffffff  1.3G online        no   722-731
0x00000016e0000000-0x00000016e7ffffff  128M online       yes       732
0x00000016e8000000-0x00000016ffffffff  384M online        no   733-735
0x0000001700000000-0x0000001707ffffff  128M online       yes       736
0x0000001708000000-0x000000179fffffff  2.4G online        no   737-755
0x00000017a0000000-0x00000017a7ffffff  128M online       yes       756
0x00000017a8000000-0x00000017f7ffffff  1.3G online        no   757-766
0x00000017f8000000-0x00000017ffffffff  128M online       yes       767
0x0000001800000000-0x0000001837ffffff  896M online        no   768-774
0x0000001838000000-0x000000183fffffff  128M online       yes       775
0x0000001840000000-0x0000001897ffffff  1.4G online        no   776-786
0x0000001898000000-0x00000018a7ffffff  256M online       yes   787-788
0x00000018a8000000-0x0000001927ffffff    2G online        no   789-804
0x0000001928000000-0x000000192fffffff  128M online       yes       805
0x0000001930000000-0x000000195fffffff  768M online        no   806-811
0x0000001960000000-0x0000001967ffffff  128M online       yes       812
0x0000001968000000-0x0000001a3fffffff  3.4G online        no   813-839
0x0000001a40000000-0x0000001a47ffffff  128M online       yes       840
0x0000001a48000000-0x0000001ae7ffffff  2.5G online        no   841-860
0x0000001ae8000000-0x0000001aefffffff  128M online       yes       861
0x0000001af0000000-0x0000001b2fffffff    1G online        no   862-869
0x0000001b30000000-0x0000001b37ffffff  128M online       yes       870
0x0000001b38000000-0x0000001c5fffffff  4.6G online        no   871-907
0x0000001c60000000-0x0000001c67ffffff  128M online       yes       908
0x0000001c68000000-0x0000001cffffffff  2.4G online        no   909-927
0x0000001d00000000-0x0000001d07ffffff  128M online       yes       928
0x0000001d08000000-0x0000001d37ffffff  768M online        no   929-934
0x0000001d38000000-0x0000001d3fffffff  128M online       yes       935
0x0000001d40000000-0x0000001d5fffffff  512M online        no   936-939
0x0000001d60000000-0x0000001d67ffffff  128M online       yes       940
0x0000001d68000000-0x0000001d97ffffff  768M online        no   941-946
0x0000001d98000000-0x0000001da7ffffff  256M online       yes   947-948
0x0000001da8000000-0x0000002087ffffff 11.5G online        no  949-1040
0x0000002088000000-0x000000209fffffff  384M online       yes 1041-1043
0x00000020a0000000-0x00000020b7ffffff  384M online        no 1044-1046
0x00000020b8000000-0x00000020c7ffffff  256M online       yes 1047-1048
0x00000020c8000000-0x0000002127ffffff  1.5G online        no 1049-1060
0x0000002128000000-0x0000002137ffffff  256M online       yes 1061-1062
0x0000002138000000-0x0000002147ffffff  256M online        no 1063-1064
0x0000002148000000-0x000000214fffffff  128M online       yes      1065
0x0000002150000000-0x000000216fffffff  512M online        no 1066-1069
0x0000002170000000-0x000000218fffffff  512M online       yes 1070-1073
0x0000002190000000-0x00000021b7ffffff  640M online        no 1074-1078
0x00000021b8000000-0x00000021bfffffff  128M online       yes      1079
0x00000021c0000000-0x00000021cfffffff  256M online        no 1080-1081
0x00000021d0000000-0x00000021d7ffffff  128M online       yes      1082
0x00000021d8000000-0x0000002207ffffff  768M online        no 1083-1088
0x0000002208000000-0x000000220fffffff  128M online       yes      1089
0x0000002210000000-0x0000002237ffffff  640M online        no 1090-1094
0x0000002238000000-0x000000223fffffff  128M online       yes      1095
0x0000002240000000-0x0000002277ffffff  896M online        no 1096-1102
0x0000002278000000-0x000000227fffffff  128M online       yes      1103
0x0000002280000000-0x0000002287ffffff  128M online        no      1104
0x0000002288000000-0x0000002297ffffff  256M online       yes 1105-1106
0x0000002298000000-0x000000232fffffff  2.4G online        no 1107-1125
0x0000002330000000-0x000000233fffffff  256M online       yes 1126-1127
0x0000002340000000-0x0000002347ffffff  128M online        no      1128
0x0000002348000000-0x000000234fffffff  128M online       yes      1129
0x0000002350000000-0x00000023afffffff  1.5G online        no 1130-1141
0x00000023b0000000-0x00000023b7ffffff  128M online       yes      1142
0x00000023b8000000-0x00000023cfffffff  384M online        no 1143-1145
0x00000023d0000000-0x00000023e7ffffff  384M online       yes 1146-1148
0x00000023e8000000-0x00000023ffffffff  384M online        no 1149-1151
0x0000002400000000-0x0000002417ffffff  384M online       yes 1152-1154
0x0000002418000000-0x0000002427ffffff  256M online        no 1155-1156
0x0000002428000000-0x000000242fffffff  128M online       yes      1157
0x0000002430000000-0x0000002487ffffff  1.4G online        no 1158-1168
0x0000002488000000-0x000000248fffffff  128M online       yes      1169
0x0000002490000000-0x00000024a7ffffff  384M online        no 1170-1172
0x00000024a8000000-0x00000024afffffff  128M online       yes      1173
0x00000024b0000000-0x00000024c7ffffff  384M online        no 1174-1176
0x00000024c8000000-0x00000024cfffffff  128M online       yes      1177
0x00000024d0000000-0x00000024efffffff  512M online        no 1178-1181
0x00000024f0000000-0x00000024ffffffff  256M online       yes 1182-1183
0x0000002500000000-0x000000250fffffff  256M online        no 1184-1185
0x0000002510000000-0x0000002517ffffff  128M online       yes      1186
0x0000002518000000-0x000000252fffffff  384M online        no 1187-1189
0x0000002530000000-0x0000002537ffffff  128M online       yes      1190
0x0000002538000000-0x000000253fffffff  128M online        no      1191
0x0000002540000000-0x000000254fffffff  256M online       yes 1192-1193
0x0000002550000000-0x0000002587ffffff  896M online        no 1194-1200
0x0000002588000000-0x000000258fffffff  128M online       yes      1201
0x0000002590000000-0x0000002597ffffff  128M online        no      1202
0x0000002598000000-0x000000259fffffff  128M online       yes      1203
0x00000025a0000000-0x000000263fffffff  2.5G online        no 1204-1223
0x0000002640000000-0x0000002647ffffff  128M online       yes      1224
0x0000002648000000-0x0000002677ffffff  768M online        no 1225-1230
0x0000002678000000-0x000000267fffffff  128M online       yes      1231
0x0000002680000000-0x0000002687ffffff  128M online        no      1232
0x0000002688000000-0x0000002697ffffff  256M online       yes 1233-1234
0x0000002698000000-0x000000277fffffff  3.6G online        no 1235-1263
0x0000002780000000-0x0000002787ffffff  128M online       yes      1264
0x0000002788000000-0x000000279fffffff  384M online        no 1265-1267
0x00000027a0000000-0x00000027a7ffffff  128M online       yes      1268
0x00000027a8000000-0x00000027b7ffffff  256M online        no 1269-1270
0x00000027b8000000-0x00000027bfffffff  128M online       yes      1271
0x00000027c0000000-0x0000002807ffffff  1.1G online        no 1272-1280
0x0000002808000000-0x000000280fffffff  128M online       yes      1281
0x0000002810000000-0x0000002827ffffff  384M online        no 1282-1284
0x0000002828000000-0x000000282fffffff  128M online       yes      1285
0x0000002830000000-0x000000283fffffff  256M online        no 1286-1287
0x0000002840000000-0x0000002847ffffff  128M online       yes      1288
0x0000002848000000-0x000000286fffffff  640M online        no 1289-1293
0x0000002870000000-0x0000002877ffffff  128M online       yes      1294
0x0000002878000000-0x0000002887ffffff  256M online        no 1295-1296
0x0000002888000000-0x000000288fffffff  128M online       yes      1297
0x0000002890000000-0x0000002897ffffff  128M online        no      1298
0x0000002898000000-0x000000289fffffff  128M online       yes      1299
0x00000028a0000000-0x00000028d7ffffff  896M online        no 1300-1306
0x00000028d8000000-0x00000028dfffffff  128M online       yes      1307
0x00000028e0000000-0x0000002937ffffff  1.4G online        no 1308-1318
0x0000002938000000-0x0000002947ffffff  256M online       yes 1319-1320
0x0000002948000000-0x00000029afffffff  1.6G online        no 1321-1333
0x00000029b0000000-0x00000029b7ffffff  128M online       yes      1334
0x00000029b8000000-0x0000002a2fffffff  1.9G online        no 1335-1349
0x0000002a30000000-0x0000002a37ffffff  128M online       yes      1350
0x0000002a38000000-0x0000002acfffffff  2.4G online        no 1351-1369
0x0000002ad0000000-0x0000002ad7ffffff  128M online       yes      1370
0x0000002ad8000000-0x0000002ae7ffffff  256M online        no 1371-1372
0x0000002ae8000000-0x0000002aefffffff  128M online       yes      1373
0x0000002af0000000-0x0000002b77ffffff  2.1G online        no 1374-1390
0x0000002b78000000-0x0000002b7fffffff  128M online       yes      1391
0x0000002b80000000-0x0000002b87ffffff  128M online        no      1392
0x0000002b88000000-0x0000002b97ffffff  256M online       yes 1393-1394
0x0000002b98000000-0x0000002bc7ffffff  768M online        no 1395-1400
0x0000002bc8000000-0x0000002bcfffffff  128M online       yes      1401
0x0000002bd0000000-0x0000002d67ffffff  6.4G online        no 1402-1452
0x0000002d68000000-0x0000002d6fffffff  128M online       yes      1453
0x0000002d70000000-0x0000002f7fffffff  8.3G online        no 1454-1519
0x0000002f80000000-0x0000002f87ffffff  128M online       yes      1520
0x0000002f88000000-0x00000033a7ffffff 16.5G online        no 1521-1652
0x00000033a8000000-0x00000033afffffff  128M online       yes      1653
0x00000033b0000000-0x00000033d7ffffff  640M online        no 1654-1658
0x00000033d8000000-0x00000033dfffffff  128M online       yes      1659
0x00000033e0000000-0x0000003427ffffff  1.1G online        no 1660-1668
0x0000003428000000-0x000000342fffffff  128M online       yes      1669
0x0000003430000000-0x0000003457ffffff  640M online        no 1670-1674
0x0000003458000000-0x000000345fffffff  128M online       yes      1675
0x0000003460000000-0x0000003537ffffff  3.4G online        no 1676-1702
0x0000003538000000-0x000000353fffffff  128M online       yes      1703
0x0000003540000000-0x000000358fffffff  1.3G online        no 1704-1713
0x0000003590000000-0x0000003597ffffff  128M online       yes      1714
0x0000003598000000-0x00000035b7ffffff  512M online        no 1715-1718
0x00000035b8000000-0x00000035bfffffff  128M online       yes      1719
0x00000035c0000000-0x00000035c7ffffff  128M online        no      1720
0x00000035c8000000-0x00000035cfffffff  128M online       yes      1721
0x00000035d0000000-0x00000035dfffffff  256M online        no 1722-1723
0x00000035e0000000-0x00000035e7ffffff  128M online       yes      1724
0x00000035e8000000-0x0000003607ffffff  512M online        no 1725-1728
0x0000003608000000-0x000000360fffffff  128M online       yes      1729
0x0000003610000000-0x000000362fffffff  512M online        no 1730-1733
0x0000003630000000-0x0000003637ffffff  128M online       yes      1734
0x0000003638000000-0x000000366fffffff  896M online        no 1735-1741
0x0000003670000000-0x0000003677ffffff  128M online       yes      1742
0x0000003678000000-0x00000038cfffffff  9.4G online        no 1743-1817
0x00000038d0000000-0x00000038d7ffffff  128M online       yes      1818
0x00000038d8000000-0x0000003987ffffff  2.8G online        no 1819-1840
0x0000003988000000-0x000000398fffffff  128M online       yes      1841
0x0000003990000000-0x00000039ffffffff  1.8G online        no 1842-1855
0x0000003a00000000-0x0000003a07ffffff  128M online       yes      1856
0x0000003a08000000-0x0000003a1fffffff  384M online        no 1857-1859
0x0000003a20000000-0x0000003a27ffffff  128M online       yes      1860
0x0000003a28000000-0x0000003b7fffffff  5.4G online        no 1861-1903
0x0000003b80000000-0x0000003b87ffffff  128M online       yes      1904
0x0000003b88000000-0x0000003c17ffffff  2.3G online        no 1905-1922
0x0000003c18000000-0x0000003c1fffffff  128M online       yes      1923
0x0000003c20000000-0x0000003d2fffffff  4.3G online        no 1924-1957
0x0000003d30000000-0x0000003d37ffffff  128M online       yes      1958
0x0000003d38000000-0x0000003d57ffffff  512M online        no 1959-1962
0x0000003d58000000-0x0000003d5fffffff  128M online       yes      1963
0x0000003d60000000-0x0000003eb7ffffff  5.4G online        no 1964-2006
0x0000003eb8000000-0x0000003ebfffffff  128M online       yes      2007
0x0000003ec0000000-0x000000407fffffff    7G online        no 2008-2063

Memory block size:       128M
Total online memory:     256G
Total offline memory:      0B
[root@holy-slurm02 ~]# lscpu
Architecture:          x86_64
CPU op-mode(s):        32-bit, 64-bit
Byte Order:            Little Endian
CPU(s):                48
On-line CPU(s) list:   0-47
Thread(s) per core:    2
Core(s) per socket:    12
Socket(s):             2
NUMA node(s):          2
Vendor ID:             GenuineIntel
CPU family:            6
Model:                 63
Model name:            Intel(R) Xeon(R) CPU E5-2690 v3 @ 2.60GHz
Stepping:              2
CPU MHz:               1235.571
CPU max MHz:           3500.0000
CPU min MHz:           1200.0000
BogoMIPS:              5194.51
Virtualization:        VT-x
L1d cache:             32K
L1i cache:             32K
L2 cache:              256K
L3 cache:              30720K
NUMA node0 CPU(s):     0,2,4,6,8,10,12,14,16,18,20,22,24,26,28,30,32,34,36,38,40,42,44,46
NUMA node1 CPU(s):     1,3,5,7,9,11,13,15,17,19,21,23,25,27,29,31,33,35,37,39,41,43,45,47
Flags:                 fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe syscall nx pdpe1gb rdtscp lm constant_tsc arch_perfmon pebs bts rep_good nopl xtopology nonstop_tsc aperfmperf eagerfpu pni pclmulqdq dtes64 monitor ds_cpl vmx smx est tm2 ssse3 sdbg fma cx16 xtpr pdcm pcid dca sse4_1 sse4_2 x2apic movbe popcnt tsc_deadline_timer aes xsave avx f16c rdrand lahf_lm abm epb ssbd ibrs ibpb stibp tpr_shadow vnmi flexpriority ept vpid fsgsbase tsc_adjust bmi1 avx2 smep bmi2 erms invpcid cqm xsaveopt cqm_llc cqm_occup_llc dtherm ida arat pln pts md_clear spec_ctrl intel_stibp flush_l1d
[root@holy-slurm02 ~]# cat /proc/meminfo
MemTotal:       263721992 kB
MemFree:        46590816 kB
MemAvailable:   181597940 kB
Buffers:            3628 kB
Cached:         129252828 kB
SwapCached:        22956 kB
Active:         162071260 kB
Inactive:       41296804 kB
Active(anon):   72430076 kB
Inactive(anon):  5941972 kB
Active(file):   89641184 kB
Inactive(file): 35354832 kB
Unevictable:           0 kB
Mlocked:               0 kB
SwapTotal:       4194300 kB
SwapFree:        3921156 kB
Dirty:              1052 kB
Writeback:             0 kB
AnonPages:      74090476 kB
Mapped:           213196 kB
Shmem:           4260440 kB
Slab:           11392396 kB
SReclaimable:   10895188 kB
SUnreclaim:       497208 kB
KernelStack:       19632 kB
PageTables:       168496 kB
NFS_Unstable:          0 kB
Bounce:                0 kB
WritebackTmp:          0 kB
CommitLimit:    136055296 kB
Committed_AS:   100149100 kB
VmallocTotal:   34359738367 kB
VmallocUsed:      814772 kB
VmallocChunk:   34224547700 kB
HardwareCorrupted:     0 kB
AnonHugePages:   3641344 kB
CmaTotal:              0 kB
CmaFree:               0 kB
HugePages_Total:       0
HugePages_Free:        0
HugePages_Rsvd:        0
HugePages_Surp:        0
Hugepagesize:       2048 kB
DirectMap4k:      551632 kB
DirectMap2M:    14950400 kB
DirectMap1G:    254803968 kB
Comment 28 Paul Edmon 2019-11-12 08:57:34 MST
As a note we are planning on replacing our slurm master next year as the box running it is nearing end of its warranty.
Comment 29 Nate Rini 2019-11-13 12:57:01 MST
(In reply to Paul Edmon from comment #28)
> As a note we are planning on replacing our slurm master next year as the box
> running it is nearing end of its warranty.

Please see slides labeled "On performance, time, and such matters" in  https://slurm.schedmd.com/SLUG17/FieldNotes.pdf
Comment 30 Paul Edmon 2019-11-13 13:24:18 MST
Yup, I recall that talk.  I plan on following its suggestions for 
hardware spec.  Our current box is pretty good, but we can do better.

-Paul Edmon-

On 11/13/19 2:57 PM, bugs@schedmd.com wrote:
>
> *Comment # 29 <https://bugs.schedmd.com/show_bug.cgi?id=8061#c29> on 
> bug 8061 <https://bugs.schedmd.com/show_bug.cgi?id=8061> from Nate 
> Rini <mailto:nate@schedmd.com> *
> (In reply to Paul Edmon fromcomment #28  <show_bug.cgi?id=8061#c28>)
> > As a note we are planning on replacing our slurm master next year as the box > running it is nearing end of its warranty.
>
> Please see slides labeled "On performance, time, and such matters" in
> https://slurm.schedmd.com/SLUG17/FieldNotes.pdf
> ------------------------------------------------------------------------
> You are receiving this mail because:
>
>   * You reported the bug.
>
Comment 34 Nate Rini 2019-12-02 10:22:31 MST
Paul,

We are still working on this bug. Can we reduce this to a SEV4 since this is not an active issue but more of a research issue at this point?

Thanks,
--Nate
Comment 35 Paul Edmon 2019-12-02 10:23:27 MST
Yup, go right ahead.  We haven't had a recurrent issue of this either.  
I'm guessing this type of situation is rare.

-Paul Edmon-

On 12/2/19 12:22 PM, bugs@schedmd.com wrote:
>
> *Comment # 34 <https://bugs.schedmd.com/show_bug.cgi?id=8061#c34> on 
> bug 8061 <https://bugs.schedmd.com/show_bug.cgi?id=8061> from Nate 
> Rini <mailto:nate@schedmd.com> *
> Paul,
>
> We are still working on this bug. Can we reduce this to a SEV4 since this is
> not an active issue but more of a research issue at this point?
>
> Thanks,
> --Nate
> ------------------------------------------------------------------------
> You are receiving this mail because:
>
>   * You reported the bug.
>
Comment 36 Nate Rini 2019-12-02 10:29:11 MST
(In reply to Paul Edmon from comment #35)
> Yup, go right ahead.
Lowering severity per your response.
Comment 37 Nate Rini 2019-12-04 09:10:25 MST
Paul,

I'm going to mark this as a duplicate of bug#7928 (and bug#7141). In bug#7928, there is already work on a patchset to more gracefully handle large number of agent RPCs which is the same issue as this ticket.

Please respond if you have any questions.

Thanks,
--Nate

*** This ticket has been marked as a duplicate of ticket 7928 ***