We had a situation this afternoon where two users tried to cancel all there jobs. This caused slurmctld to lock up due to high thread count as it tried to purge all those jobs plus then also continue with normal operations (i.e. answering queries about job state, scheduling, etc.). Obviously this is not optimal as I had to set all partitions to INACTIVE and go into defer mode to get the scheduler responsive enough to purge all these jobs. Could bulk job cancelation be throttled in some way, or given a lower priority? In some cases you certainly want to cancel ASAP, in others you can have the cancelation work itself out over the next several minutes. When you have thousands of jobs all cancelling at the same time on a system with hundreds of users and hundreds of thousands of jobs a simultaneous cancelation can be killer and lock up the scheduler. (In this case the scheduler is still cancelling these jobs even 10 minutes after I managed to get the scheduler responsive again). Clearly cancelation needs to be reworked for large scale cancelation. We can't have one or two users locking up the whole scheduler just because they decide to nuke all their jobs. Thanks. -Paul Edmon-
Paul, Please attach your slurm.conf. Thanks, --Nate
Created attachment 12240 [details] slurm.conf
Created attachment 12241 [details] topology.conf
I've attached them.
(In reply to Paul Edmon from comment #0) > We had a situation this afternoon where two users tried to cancel all there > jobs. Were these jobs RUNNING or PENDING at the time?
Both. Some of the jobs were running, others were pending. -Paul Edmon- On 11/7/19 12:45 PM, bugs@schedmd.com wrote: > > *Comment # 5 <https://bugs.schedmd.com/show_bug.cgi?id=8061#c5> on bug > 8061 <https://bugs.schedmd.com/show_bug.cgi?id=8061> from Nate Rini > <mailto:nate@schedmd.com> * > (In reply to Paul Edmon fromcomment #0 <show_bug.cgi?id=8061#c0>) > > We had a situation this afternoon where two users tried to cancel all there > jobs. > > Were these jobs RUNNING or PENDING at the time? > ------------------------------------------------------------------------ > You are receiving this mail because: > > * You reported the bug. >
(In reply to Paul Edmon from comment #6) > Both. Some of the jobs were running, others were pending. Is it possible to get your slurmctld during the event? How long does /usr/local/bin/slurm_epilog take execute?
You mean you want the log? I can provide that if you want. As for the epilog. It should be pretty quick. I will attach it. -Paul Edmon- On 11/7/19 1:02 PM, bugs@schedmd.com wrote: > > *Comment # 7 <https://bugs.schedmd.com/show_bug.cgi?id=8061#c7> on bug > 8061 <https://bugs.schedmd.com/show_bug.cgi?id=8061> from Nate Rini > <mailto:nate@schedmd.com> * > (In reply to Paul Edmon fromcomment #6 <show_bug.cgi?id=8061#c6>) > > Both. Some of the jobs were running, others were pending. > > Is it possible to get your slurmctld during the event? How long does > /usr/local/bin/slurm_epilog take execute? > ------------------------------------------------------------------------ > You are receiving this mail because: > > * You reported the bug. >
Created attachment 12264 [details] slurm epilog
(In reply to Paul Edmon from comment #8) > You mean you want the log? I can provide that if you want. Yes, the slurmctld log during the event. Slurm should have no issue cancelling PENDING jobs but killing RUNNING jobs can be very slow since it requires communications and responses from the compute nodes. The logs will be helpful to figure out what is going slow.
Created attachment 12266 [details] Slurm log for November 6th 11 - 13
> holy-slurm02 slurmctld[22405]: error: Munge decode failed: Expired credential Can you please verify that the system clock is in sync on all nodes?
Yes system clock is synced. That munge decode issue tends to happen when the thread count gets high and slurm is backloged with traffic. Essentially it is replaying late messages. -Paul Edmon- On 11/7/19 2:12 PM, bugs@schedmd.com wrote: > > *Comment # 12 <https://bugs.schedmd.com/show_bug.cgi?id=8061#c12> on > bug 8061 <https://bugs.schedmd.com/show_bug.cgi?id=8061> from Nate > Rini <mailto:nate@schedmd.com> * > > holy-slurm02 slurmctld[22405]: error: Munge decode failed: Expired credential > Can you please verify that the system clock is in sync on all nodes? > ------------------------------------------------------------------------ > You are receiving this mail because: > > * You reported the bug. >
(In reply to Paul Edmon from comment #14) > Yes system clock is synced. Always prefer to make sure when this error shows up. Examining your logs.
> Nov 6 12:28:46 holy-slurm02 slurmctld[34262]: Warning: Note very large processing time from read_slurm_conf: usec=38812931 began=12:28:07.635 What kind of filesystem is the slurm.conf stored on? If it is a network filesystem (lustre, nfs, gpfs, etc) were there any waiters/stalls around this time?
Looks like your using puppet for configuration management. Does puppet keep the slurm.conf in sync before starting the slurmd daemons?
> Nov 6 12:59:48 holy-slurm02 slurmctld[79605]: job_submit.lua: sacctmgr failed to add account salomon_lab with exit status 256 > Nov 6 12:59:48 holy-slurm02 slurmctld[79605]: job_submit.lua: added association to salomon_lab for faye16 Does your job_submit lua plugin call sacctmgr?
Please also call this: > $ scontrol show config|grep -i debug
[root@holy7c22501 ~]# scontrol show config | grep -i debug DebugFlags = (null) SlurmctldDebug = info SlurmctldSyslogDebug = verbose SlurmdDebug = info SlurmdSyslogDebug = verbose As for your questions: 1. Our slurm.conf is hosted as a NFS mount from our slurm master. 2. puppet does not manage slurm.conf except to lay it down on the master which then propagates via NFS out to the cluster. 3. job_submit.lua does call sacctmgr to see if a user is in the database, and if not add it to the database. -Paul Edmon- On 11/7/19 4:59 PM, bugs@schedmd.com wrote: > > *Comment # 20 <https://bugs.schedmd.com/show_bug.cgi?id=8061#c20> on > bug 8061 <https://bugs.schedmd.com/show_bug.cgi?id=8061> from Nate > Rini <mailto:nate@schedmd.com> * > Please also call this: > > $ scontrol show config|grep -i debug > ------------------------------------------------------------------------ > You are receiving this mail because: > > * You reported the bug. >
(In reply to Paul Edmon from comment #21) > 1. Our slurm.conf is hosted as a NFS mount from our slurm master. There are a good number of hash mismatch errors in the logs. Are all of the Slurm daemons being restarted when slurm.conf is changed? > 3. job_submit.lua does call sacctmgr to see if a user is in the > database, and if not add it to the database. I would expect this operation to have a potential race condition where the user addition is known by scontrol before the job is examined. I didn't see any evidence of this in the logs though. Are there any scontrol commands calling the job_submit.lua? How often do new users need to be added?
1. We usually run scontrol reconfigure or a global restart when we update the config. 2. I've never seen an issue with it. job_submit.lua does pull information from the scheduler about partitions and fairshare to do some gating logic. It does it through the API. We probably add about 2-3 users a day. The script itself caches the list of users that it has looked up already so the load on the slurmdb is low. I can send a copy of the script if you like. Let me know. -Paul Edmon- On 11/8/2019 11:19 PM, bugs@schedmd.com wrote: > > *Comment # 22 <https://bugs.schedmd.com/show_bug.cgi?id=8061#c22> on > bug 8061 <https://bugs.schedmd.com/show_bug.cgi?id=8061> from Nate > Rini <mailto:nate@schedmd.com> * > (In reply to Paul Edmon fromcomment #21 <show_bug.cgi?id=8061#c21>) > > 1. Our slurm.conf is hosted as a NFS mount from our slurm master. > There are a good number of hash mismatch errors in the logs. Are all of the > Slurm daemons being restarted when slurm.conf is changed? > > > 3. job_submit.lua does call sacctmgr to see if a user is in the > database, and if not add it to the database. > I would expect this operation to have a potential race condition where the user > addition is known by scontrol before the job is examined. I didn't see any > evidence of this in the logs though. > > Are there any scontrol commands calling the job_submit.lua? How often do new > users need to be added? > ------------------------------------------------------------------------ > You are receiving this mail because: > > * You reported the bug. >
(In reply to Paul Edmon from comment #23) > 1. We usually run scontrol reconfigure or a global restart when we > update the config. As long as everything is restarted, it should be fine. > 2. I've never seen an issue with it. This ticket be related. > job_submit.lua does pull information from the scheduler about partitions > and fairshare to do some gating logic. It does it through the API. We > probably add about 2-3 users a day. The script itself caches the list > of users that it has looked up already so the load on the slurmdb is > low. I can send a copy of the script if you like. Let me know. Please attach it. Can you please also call this on your slurmctld node: > lsmem > lscpu > cat /proc/meminfo
(In reply to Nate Rini from comment #24) > > 2. I've never seen an issue with it. > > This ticket be related. Correction: This ticket *may* be related.
Created attachment 12304 [details] Job Submit Lua Script
[root@holy-slurm02 ~]# lsmem RANGE SIZE STATE REMOVABLE BLOCK 0x0000000000000000-0x0000000077ffffff 1.9G online no 0-14 0x0000000078000000-0x000000007fffffff 128M online yes 15 0x0000000100000000-0x00000001c7ffffff 3.1G online no 32-56 0x00000001c8000000-0x00000001cfffffff 128M online yes 57 0x00000001d0000000-0x000000021fffffff 1.3G online no 58-67 0x0000000220000000-0x000000022fffffff 256M online yes 68-69 0x0000000230000000-0x000000023fffffff 256M online no 70-71 0x0000000240000000-0x000000024fffffff 256M online yes 72-73 0x0000000250000000-0x0000000267ffffff 384M online no 74-76 0x0000000268000000-0x000000028fffffff 640M online yes 77-81 0x0000000290000000-0x00000002afffffff 512M online no 82-85 0x00000002b0000000-0x00000002b7ffffff 128M online yes 86 0x00000002b8000000-0x00000002bfffffff 128M online no 87 0x00000002c0000000-0x00000002c7ffffff 128M online yes 88 0x00000002c8000000-0x00000002cfffffff 128M online no 89 0x00000002d0000000-0x00000002d7ffffff 128M online yes 90 0x00000002d8000000-0x0000000317ffffff 1G online no 91-98 0x0000000318000000-0x0000000327ffffff 256M online yes 99-100 0x0000000328000000-0x0000000367ffffff 1G online no 101-108 0x0000000368000000-0x000000036fffffff 128M online yes 109 0x0000000370000000-0x0000000387ffffff 384M online no 110-112 0x0000000388000000-0x0000000397ffffff 256M online yes 113-114 0x0000000398000000-0x000000039fffffff 128M online no 115 0x00000003a0000000-0x00000003a7ffffff 128M online yes 116 0x00000003a8000000-0x00000005cfffffff 8.6G online no 117-185 0x00000005d0000000-0x00000005d7ffffff 128M online yes 186 0x00000005d8000000-0x0000000617ffffff 1G online no 187-194 0x0000000618000000-0x000000061fffffff 128M online yes 195 0x0000000620000000-0x0000000627ffffff 128M online no 196 0x0000000628000000-0x000000062fffffff 128M online yes 197 0x0000000630000000-0x0000000637ffffff 128M online no 198 0x0000000638000000-0x000000063fffffff 128M online yes 199 0x0000000640000000-0x000000064fffffff 256M online no 200-201 0x0000000650000000-0x0000000657ffffff 128M online yes 202 0x0000000658000000-0x000000067fffffff 640M online no 203-207 0x0000000680000000-0x0000000687ffffff 128M online yes 208 0x0000000688000000-0x00000006b7ffffff 768M online no 209-214 0x00000006b8000000-0x00000006bfffffff 128M online yes 215 0x00000006c0000000-0x00000006f7ffffff 896M online no 216-222 0x00000006f8000000-0x0000000707ffffff 256M online yes 223-224 0x0000000708000000-0x000000071fffffff 384M online no 225-227 0x0000000720000000-0x0000000727ffffff 128M online yes 228 0x0000000728000000-0x0000000757ffffff 768M online no 229-234 0x0000000758000000-0x000000075fffffff 128M online yes 235 0x0000000760000000-0x000000078fffffff 768M online no 236-241 0x0000000790000000-0x0000000797ffffff 128M online yes 242 0x0000000798000000-0x00000007f7ffffff 1.5G online no 243-254 0x00000007f8000000-0x00000007ffffffff 128M online yes 255 0x0000000800000000-0x0000000837ffffff 896M online no 256-262 0x0000000838000000-0x0000000847ffffff 256M online yes 263-264 0x0000000848000000-0x000000087fffffff 896M online no 265-271 0x0000000880000000-0x0000000887ffffff 128M online yes 272 0x0000000888000000-0x0000000897ffffff 256M online no 273-274 0x0000000898000000-0x000000089fffffff 128M online yes 275 0x00000008a0000000-0x00000008bfffffff 512M online no 276-279 0x00000008c0000000-0x00000008c7ffffff 128M online yes 280 0x00000008c8000000-0x0000000937ffffff 1.8G online no 281-294 0x0000000938000000-0x000000093fffffff 128M online yes 295 0x0000000940000000-0x000000097fffffff 1G online no 296-303 0x0000000980000000-0x0000000987ffffff 128M online yes 304 0x0000000988000000-0x00000009a7ffffff 512M online no 305-308 0x00000009a8000000-0x00000009afffffff 128M online yes 309 0x00000009b0000000-0x00000009b7ffffff 128M online no 310 0x00000009b8000000-0x00000009bfffffff 128M online yes 311 0x00000009c0000000-0x0000000a07ffffff 1.1G online no 312-320 0x0000000a08000000-0x0000000a0fffffff 128M online yes 321 0x0000000a10000000-0x0000000a27ffffff 384M online no 322-324 0x0000000a28000000-0x0000000a3fffffff 384M online yes 325-327 0x0000000a40000000-0x0000000aafffffff 1.8G online no 328-341 0x0000000ab0000000-0x0000000ab7ffffff 128M online yes 342 0x0000000ab8000000-0x0000000bc7ffffff 4.3G online no 343-376 0x0000000bc8000000-0x0000000bcfffffff 128M online yes 377 0x0000000bd0000000-0x0000000be7ffffff 384M online no 378-380 0x0000000be8000000-0x0000000befffffff 128M online yes 381 0x0000000bf0000000-0x0000000c6fffffff 2G online no 382-397 0x0000000c70000000-0x0000000c7fffffff 256M online yes 398-399 0x0000000c80000000-0x0000000cd7ffffff 1.4G online no 400-410 0x0000000cd8000000-0x0000000cdfffffff 128M online yes 411 0x0000000ce0000000-0x0000000cf7ffffff 384M online no 412-414 0x0000000cf8000000-0x0000000cffffffff 128M online yes 415 0x0000000d00000000-0x0000000d7fffffff 2G online no 416-431 0x0000000d80000000-0x0000000d87ffffff 128M online yes 432 0x0000000d88000000-0x0000000d9fffffff 384M online no 433-435 0x0000000da0000000-0x0000000da7ffffff 128M online yes 436 0x0000000da8000000-0x0000000dcfffffff 640M online no 437-441 0x0000000dd0000000-0x0000000dd7ffffff 128M online yes 442 0x0000000dd8000000-0x0000000de7ffffff 256M online no 443-444 0x0000000de8000000-0x0000000defffffff 128M online yes 445 0x0000000df0000000-0x0000000df7ffffff 128M online no 446 0x0000000df8000000-0x0000000dffffffff 128M online yes 447 0x0000000e00000000-0x0000000e17ffffff 384M online no 448-450 0x0000000e18000000-0x0000000e1fffffff 128M online yes 451 0x0000000e20000000-0x0000000e37ffffff 384M online no 452-454 0x0000000e38000000-0x0000000e3fffffff 128M online yes 455 0x0000000e40000000-0x0000000e97ffffff 1.4G online no 456-466 0x0000000e98000000-0x0000000e9fffffff 128M online yes 467 0x0000000ea0000000-0x0000000ec7ffffff 640M online no 468-472 0x0000000ec8000000-0x0000000ecfffffff 128M online yes 473 0x0000000ed0000000-0x0000000ee7ffffff 384M online no 474-476 0x0000000ee8000000-0x0000000eefffffff 128M online yes 477 0x0000000ef0000000-0x0000000f37ffffff 1.1G online no 478-486 0x0000000f38000000-0x0000000f47ffffff 256M online yes 487-488 0x0000000f48000000-0x0000000f7fffffff 896M online no 489-495 0x0000000f80000000-0x0000000f87ffffff 128M online yes 496 0x0000000f88000000-0x0000000fb7ffffff 768M online no 497-502 0x0000000fb8000000-0x0000000fc7ffffff 256M online yes 503-504 0x0000000fc8000000-0x000000100fffffff 1.1G online no 505-513 0x0000001010000000-0x0000001017ffffff 128M online yes 514 0x0000001018000000-0x0000001107ffffff 3.8G online no 515-544 0x0000001108000000-0x000000110fffffff 128M online yes 545 0x0000001110000000-0x0000001117ffffff 128M online no 546 0x0000001118000000-0x000000111fffffff 128M online yes 547 0x0000001120000000-0x00000011d7ffffff 2.9G online no 548-570 0x00000011d8000000-0x00000011dfffffff 128M online yes 571 0x00000011e0000000-0x000000120fffffff 768M online no 572-577 0x0000001210000000-0x0000001217ffffff 128M online yes 578 0x0000001218000000-0x0000001227ffffff 256M online no 579-580 0x0000001228000000-0x000000122fffffff 128M online yes 581 0x0000001230000000-0x0000001247ffffff 384M online no 582-584 0x0000001248000000-0x0000001257ffffff 256M online yes 585-586 0x0000001258000000-0x000000125fffffff 128M online no 587 0x0000001260000000-0x0000001267ffffff 128M online yes 588 0x0000001268000000-0x000000126fffffff 128M online no 589 0x0000001270000000-0x0000001277ffffff 128M online yes 590 0x0000001278000000-0x000000128fffffff 384M online no 591-593 0x0000001290000000-0x000000129fffffff 256M online yes 594-595 0x00000012a0000000-0x00000012a7ffffff 128M online no 596 0x00000012a8000000-0x00000012b7ffffff 256M online yes 597-598 0x00000012b8000000-0x00000012cfffffff 384M online no 599-601 0x00000012d0000000-0x00000012d7ffffff 128M online yes 602 0x00000012d8000000-0x00000012dfffffff 128M online no 603 0x00000012e0000000-0x00000012e7ffffff 128M online yes 604 0x00000012e8000000-0x000000134fffffff 1.6G online no 605-617 0x0000001350000000-0x0000001357ffffff 128M online yes 618 0x0000001358000000-0x0000001367ffffff 256M online no 619-620 0x0000001368000000-0x0000001377ffffff 256M online yes 621-622 0x0000001378000000-0x000000139fffffff 640M online no 623-627 0x00000013a0000000-0x00000013a7ffffff 128M online yes 628 0x00000013a8000000-0x00000013e7ffffff 1G online no 629-636 0x00000013e8000000-0x00000013f7ffffff 256M online yes 637-638 0x00000013f8000000-0x00000013ffffffff 128M online no 639 0x0000001400000000-0x0000001407ffffff 128M online yes 640 0x0000001408000000-0x000000146fffffff 1.6G online no 641-653 0x0000001470000000-0x0000001477ffffff 128M online yes 654 0x0000001478000000-0x00000014c7ffffff 1.3G online no 655-664 0x00000014c8000000-0x00000014d7ffffff 256M online yes 665-666 0x00000014d8000000-0x000000150fffffff 896M online no 667-673 0x0000001510000000-0x0000001517ffffff 128M online yes 674 0x0000001518000000-0x000000151fffffff 128M online no 675 0x0000001520000000-0x0000001527ffffff 128M online yes 676 0x0000001528000000-0x0000001537ffffff 256M online no 677-678 0x0000001538000000-0x000000153fffffff 128M online yes 679 0x0000001540000000-0x000000156fffffff 768M online no 680-685 0x0000001570000000-0x0000001577ffffff 128M online yes 686 0x0000001578000000-0x0000001597ffffff 512M online no 687-690 0x0000001598000000-0x000000159fffffff 128M online yes 691 0x00000015a0000000-0x00000015ffffffff 1.5G online no 692-703 0x0000001600000000-0x0000001607ffffff 128M online yes 704 0x0000001608000000-0x000000160fffffff 128M online no 705 0x0000001610000000-0x0000001617ffffff 128M online yes 706 0x0000001618000000-0x000000167fffffff 1.6G online no 707-719 0x0000001680000000-0x000000168fffffff 256M online yes 720-721 0x0000001690000000-0x00000016dfffffff 1.3G online no 722-731 0x00000016e0000000-0x00000016e7ffffff 128M online yes 732 0x00000016e8000000-0x00000016ffffffff 384M online no 733-735 0x0000001700000000-0x0000001707ffffff 128M online yes 736 0x0000001708000000-0x000000179fffffff 2.4G online no 737-755 0x00000017a0000000-0x00000017a7ffffff 128M online yes 756 0x00000017a8000000-0x00000017f7ffffff 1.3G online no 757-766 0x00000017f8000000-0x00000017ffffffff 128M online yes 767 0x0000001800000000-0x0000001837ffffff 896M online no 768-774 0x0000001838000000-0x000000183fffffff 128M online yes 775 0x0000001840000000-0x0000001897ffffff 1.4G online no 776-786 0x0000001898000000-0x00000018a7ffffff 256M online yes 787-788 0x00000018a8000000-0x0000001927ffffff 2G online no 789-804 0x0000001928000000-0x000000192fffffff 128M online yes 805 0x0000001930000000-0x000000195fffffff 768M online no 806-811 0x0000001960000000-0x0000001967ffffff 128M online yes 812 0x0000001968000000-0x0000001a3fffffff 3.4G online no 813-839 0x0000001a40000000-0x0000001a47ffffff 128M online yes 840 0x0000001a48000000-0x0000001ae7ffffff 2.5G online no 841-860 0x0000001ae8000000-0x0000001aefffffff 128M online yes 861 0x0000001af0000000-0x0000001b2fffffff 1G online no 862-869 0x0000001b30000000-0x0000001b37ffffff 128M online yes 870 0x0000001b38000000-0x0000001c5fffffff 4.6G online no 871-907 0x0000001c60000000-0x0000001c67ffffff 128M online yes 908 0x0000001c68000000-0x0000001cffffffff 2.4G online no 909-927 0x0000001d00000000-0x0000001d07ffffff 128M online yes 928 0x0000001d08000000-0x0000001d37ffffff 768M online no 929-934 0x0000001d38000000-0x0000001d3fffffff 128M online yes 935 0x0000001d40000000-0x0000001d5fffffff 512M online no 936-939 0x0000001d60000000-0x0000001d67ffffff 128M online yes 940 0x0000001d68000000-0x0000001d97ffffff 768M online no 941-946 0x0000001d98000000-0x0000001da7ffffff 256M online yes 947-948 0x0000001da8000000-0x0000002087ffffff 11.5G online no 949-1040 0x0000002088000000-0x000000209fffffff 384M online yes 1041-1043 0x00000020a0000000-0x00000020b7ffffff 384M online no 1044-1046 0x00000020b8000000-0x00000020c7ffffff 256M online yes 1047-1048 0x00000020c8000000-0x0000002127ffffff 1.5G online no 1049-1060 0x0000002128000000-0x0000002137ffffff 256M online yes 1061-1062 0x0000002138000000-0x0000002147ffffff 256M online no 1063-1064 0x0000002148000000-0x000000214fffffff 128M online yes 1065 0x0000002150000000-0x000000216fffffff 512M online no 1066-1069 0x0000002170000000-0x000000218fffffff 512M online yes 1070-1073 0x0000002190000000-0x00000021b7ffffff 640M online no 1074-1078 0x00000021b8000000-0x00000021bfffffff 128M online yes 1079 0x00000021c0000000-0x00000021cfffffff 256M online no 1080-1081 0x00000021d0000000-0x00000021d7ffffff 128M online yes 1082 0x00000021d8000000-0x0000002207ffffff 768M online no 1083-1088 0x0000002208000000-0x000000220fffffff 128M online yes 1089 0x0000002210000000-0x0000002237ffffff 640M online no 1090-1094 0x0000002238000000-0x000000223fffffff 128M online yes 1095 0x0000002240000000-0x0000002277ffffff 896M online no 1096-1102 0x0000002278000000-0x000000227fffffff 128M online yes 1103 0x0000002280000000-0x0000002287ffffff 128M online no 1104 0x0000002288000000-0x0000002297ffffff 256M online yes 1105-1106 0x0000002298000000-0x000000232fffffff 2.4G online no 1107-1125 0x0000002330000000-0x000000233fffffff 256M online yes 1126-1127 0x0000002340000000-0x0000002347ffffff 128M online no 1128 0x0000002348000000-0x000000234fffffff 128M online yes 1129 0x0000002350000000-0x00000023afffffff 1.5G online no 1130-1141 0x00000023b0000000-0x00000023b7ffffff 128M online yes 1142 0x00000023b8000000-0x00000023cfffffff 384M online no 1143-1145 0x00000023d0000000-0x00000023e7ffffff 384M online yes 1146-1148 0x00000023e8000000-0x00000023ffffffff 384M online no 1149-1151 0x0000002400000000-0x0000002417ffffff 384M online yes 1152-1154 0x0000002418000000-0x0000002427ffffff 256M online no 1155-1156 0x0000002428000000-0x000000242fffffff 128M online yes 1157 0x0000002430000000-0x0000002487ffffff 1.4G online no 1158-1168 0x0000002488000000-0x000000248fffffff 128M online yes 1169 0x0000002490000000-0x00000024a7ffffff 384M online no 1170-1172 0x00000024a8000000-0x00000024afffffff 128M online yes 1173 0x00000024b0000000-0x00000024c7ffffff 384M online no 1174-1176 0x00000024c8000000-0x00000024cfffffff 128M online yes 1177 0x00000024d0000000-0x00000024efffffff 512M online no 1178-1181 0x00000024f0000000-0x00000024ffffffff 256M online yes 1182-1183 0x0000002500000000-0x000000250fffffff 256M online no 1184-1185 0x0000002510000000-0x0000002517ffffff 128M online yes 1186 0x0000002518000000-0x000000252fffffff 384M online no 1187-1189 0x0000002530000000-0x0000002537ffffff 128M online yes 1190 0x0000002538000000-0x000000253fffffff 128M online no 1191 0x0000002540000000-0x000000254fffffff 256M online yes 1192-1193 0x0000002550000000-0x0000002587ffffff 896M online no 1194-1200 0x0000002588000000-0x000000258fffffff 128M online yes 1201 0x0000002590000000-0x0000002597ffffff 128M online no 1202 0x0000002598000000-0x000000259fffffff 128M online yes 1203 0x00000025a0000000-0x000000263fffffff 2.5G online no 1204-1223 0x0000002640000000-0x0000002647ffffff 128M online yes 1224 0x0000002648000000-0x0000002677ffffff 768M online no 1225-1230 0x0000002678000000-0x000000267fffffff 128M online yes 1231 0x0000002680000000-0x0000002687ffffff 128M online no 1232 0x0000002688000000-0x0000002697ffffff 256M online yes 1233-1234 0x0000002698000000-0x000000277fffffff 3.6G online no 1235-1263 0x0000002780000000-0x0000002787ffffff 128M online yes 1264 0x0000002788000000-0x000000279fffffff 384M online no 1265-1267 0x00000027a0000000-0x00000027a7ffffff 128M online yes 1268 0x00000027a8000000-0x00000027b7ffffff 256M online no 1269-1270 0x00000027b8000000-0x00000027bfffffff 128M online yes 1271 0x00000027c0000000-0x0000002807ffffff 1.1G online no 1272-1280 0x0000002808000000-0x000000280fffffff 128M online yes 1281 0x0000002810000000-0x0000002827ffffff 384M online no 1282-1284 0x0000002828000000-0x000000282fffffff 128M online yes 1285 0x0000002830000000-0x000000283fffffff 256M online no 1286-1287 0x0000002840000000-0x0000002847ffffff 128M online yes 1288 0x0000002848000000-0x000000286fffffff 640M online no 1289-1293 0x0000002870000000-0x0000002877ffffff 128M online yes 1294 0x0000002878000000-0x0000002887ffffff 256M online no 1295-1296 0x0000002888000000-0x000000288fffffff 128M online yes 1297 0x0000002890000000-0x0000002897ffffff 128M online no 1298 0x0000002898000000-0x000000289fffffff 128M online yes 1299 0x00000028a0000000-0x00000028d7ffffff 896M online no 1300-1306 0x00000028d8000000-0x00000028dfffffff 128M online yes 1307 0x00000028e0000000-0x0000002937ffffff 1.4G online no 1308-1318 0x0000002938000000-0x0000002947ffffff 256M online yes 1319-1320 0x0000002948000000-0x00000029afffffff 1.6G online no 1321-1333 0x00000029b0000000-0x00000029b7ffffff 128M online yes 1334 0x00000029b8000000-0x0000002a2fffffff 1.9G online no 1335-1349 0x0000002a30000000-0x0000002a37ffffff 128M online yes 1350 0x0000002a38000000-0x0000002acfffffff 2.4G online no 1351-1369 0x0000002ad0000000-0x0000002ad7ffffff 128M online yes 1370 0x0000002ad8000000-0x0000002ae7ffffff 256M online no 1371-1372 0x0000002ae8000000-0x0000002aefffffff 128M online yes 1373 0x0000002af0000000-0x0000002b77ffffff 2.1G online no 1374-1390 0x0000002b78000000-0x0000002b7fffffff 128M online yes 1391 0x0000002b80000000-0x0000002b87ffffff 128M online no 1392 0x0000002b88000000-0x0000002b97ffffff 256M online yes 1393-1394 0x0000002b98000000-0x0000002bc7ffffff 768M online no 1395-1400 0x0000002bc8000000-0x0000002bcfffffff 128M online yes 1401 0x0000002bd0000000-0x0000002d67ffffff 6.4G online no 1402-1452 0x0000002d68000000-0x0000002d6fffffff 128M online yes 1453 0x0000002d70000000-0x0000002f7fffffff 8.3G online no 1454-1519 0x0000002f80000000-0x0000002f87ffffff 128M online yes 1520 0x0000002f88000000-0x00000033a7ffffff 16.5G online no 1521-1652 0x00000033a8000000-0x00000033afffffff 128M online yes 1653 0x00000033b0000000-0x00000033d7ffffff 640M online no 1654-1658 0x00000033d8000000-0x00000033dfffffff 128M online yes 1659 0x00000033e0000000-0x0000003427ffffff 1.1G online no 1660-1668 0x0000003428000000-0x000000342fffffff 128M online yes 1669 0x0000003430000000-0x0000003457ffffff 640M online no 1670-1674 0x0000003458000000-0x000000345fffffff 128M online yes 1675 0x0000003460000000-0x0000003537ffffff 3.4G online no 1676-1702 0x0000003538000000-0x000000353fffffff 128M online yes 1703 0x0000003540000000-0x000000358fffffff 1.3G online no 1704-1713 0x0000003590000000-0x0000003597ffffff 128M online yes 1714 0x0000003598000000-0x00000035b7ffffff 512M online no 1715-1718 0x00000035b8000000-0x00000035bfffffff 128M online yes 1719 0x00000035c0000000-0x00000035c7ffffff 128M online no 1720 0x00000035c8000000-0x00000035cfffffff 128M online yes 1721 0x00000035d0000000-0x00000035dfffffff 256M online no 1722-1723 0x00000035e0000000-0x00000035e7ffffff 128M online yes 1724 0x00000035e8000000-0x0000003607ffffff 512M online no 1725-1728 0x0000003608000000-0x000000360fffffff 128M online yes 1729 0x0000003610000000-0x000000362fffffff 512M online no 1730-1733 0x0000003630000000-0x0000003637ffffff 128M online yes 1734 0x0000003638000000-0x000000366fffffff 896M online no 1735-1741 0x0000003670000000-0x0000003677ffffff 128M online yes 1742 0x0000003678000000-0x00000038cfffffff 9.4G online no 1743-1817 0x00000038d0000000-0x00000038d7ffffff 128M online yes 1818 0x00000038d8000000-0x0000003987ffffff 2.8G online no 1819-1840 0x0000003988000000-0x000000398fffffff 128M online yes 1841 0x0000003990000000-0x00000039ffffffff 1.8G online no 1842-1855 0x0000003a00000000-0x0000003a07ffffff 128M online yes 1856 0x0000003a08000000-0x0000003a1fffffff 384M online no 1857-1859 0x0000003a20000000-0x0000003a27ffffff 128M online yes 1860 0x0000003a28000000-0x0000003b7fffffff 5.4G online no 1861-1903 0x0000003b80000000-0x0000003b87ffffff 128M online yes 1904 0x0000003b88000000-0x0000003c17ffffff 2.3G online no 1905-1922 0x0000003c18000000-0x0000003c1fffffff 128M online yes 1923 0x0000003c20000000-0x0000003d2fffffff 4.3G online no 1924-1957 0x0000003d30000000-0x0000003d37ffffff 128M online yes 1958 0x0000003d38000000-0x0000003d57ffffff 512M online no 1959-1962 0x0000003d58000000-0x0000003d5fffffff 128M online yes 1963 0x0000003d60000000-0x0000003eb7ffffff 5.4G online no 1964-2006 0x0000003eb8000000-0x0000003ebfffffff 128M online yes 2007 0x0000003ec0000000-0x000000407fffffff 7G online no 2008-2063 Memory block size: 128M Total online memory: 256G Total offline memory: 0B [root@holy-slurm02 ~]# lscpu Architecture: x86_64 CPU op-mode(s): 32-bit, 64-bit Byte Order: Little Endian CPU(s): 48 On-line CPU(s) list: 0-47 Thread(s) per core: 2 Core(s) per socket: 12 Socket(s): 2 NUMA node(s): 2 Vendor ID: GenuineIntel CPU family: 6 Model: 63 Model name: Intel(R) Xeon(R) CPU E5-2690 v3 @ 2.60GHz Stepping: 2 CPU MHz: 1235.571 CPU max MHz: 3500.0000 CPU min MHz: 1200.0000 BogoMIPS: 5194.51 Virtualization: VT-x L1d cache: 32K L1i cache: 32K L2 cache: 256K L3 cache: 30720K NUMA node0 CPU(s): 0,2,4,6,8,10,12,14,16,18,20,22,24,26,28,30,32,34,36,38,40,42,44,46 NUMA node1 CPU(s): 1,3,5,7,9,11,13,15,17,19,21,23,25,27,29,31,33,35,37,39,41,43,45,47 Flags: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe syscall nx pdpe1gb rdtscp lm constant_tsc arch_perfmon pebs bts rep_good nopl xtopology nonstop_tsc aperfmperf eagerfpu pni pclmulqdq dtes64 monitor ds_cpl vmx smx est tm2 ssse3 sdbg fma cx16 xtpr pdcm pcid dca sse4_1 sse4_2 x2apic movbe popcnt tsc_deadline_timer aes xsave avx f16c rdrand lahf_lm abm epb ssbd ibrs ibpb stibp tpr_shadow vnmi flexpriority ept vpid fsgsbase tsc_adjust bmi1 avx2 smep bmi2 erms invpcid cqm xsaveopt cqm_llc cqm_occup_llc dtherm ida arat pln pts md_clear spec_ctrl intel_stibp flush_l1d [root@holy-slurm02 ~]# cat /proc/meminfo MemTotal: 263721992 kB MemFree: 46590816 kB MemAvailable: 181597940 kB Buffers: 3628 kB Cached: 129252828 kB SwapCached: 22956 kB Active: 162071260 kB Inactive: 41296804 kB Active(anon): 72430076 kB Inactive(anon): 5941972 kB Active(file): 89641184 kB Inactive(file): 35354832 kB Unevictable: 0 kB Mlocked: 0 kB SwapTotal: 4194300 kB SwapFree: 3921156 kB Dirty: 1052 kB Writeback: 0 kB AnonPages: 74090476 kB Mapped: 213196 kB Shmem: 4260440 kB Slab: 11392396 kB SReclaimable: 10895188 kB SUnreclaim: 497208 kB KernelStack: 19632 kB PageTables: 168496 kB NFS_Unstable: 0 kB Bounce: 0 kB WritebackTmp: 0 kB CommitLimit: 136055296 kB Committed_AS: 100149100 kB VmallocTotal: 34359738367 kB VmallocUsed: 814772 kB VmallocChunk: 34224547700 kB HardwareCorrupted: 0 kB AnonHugePages: 3641344 kB CmaTotal: 0 kB CmaFree: 0 kB HugePages_Total: 0 HugePages_Free: 0 HugePages_Rsvd: 0 HugePages_Surp: 0 Hugepagesize: 2048 kB DirectMap4k: 551632 kB DirectMap2M: 14950400 kB DirectMap1G: 254803968 kB
As a note we are planning on replacing our slurm master next year as the box running it is nearing end of its warranty.
(In reply to Paul Edmon from comment #28) > As a note we are planning on replacing our slurm master next year as the box > running it is nearing end of its warranty. Please see slides labeled "On performance, time, and such matters" in https://slurm.schedmd.com/SLUG17/FieldNotes.pdf
Yup, I recall that talk. I plan on following its suggestions for hardware spec. Our current box is pretty good, but we can do better. -Paul Edmon- On 11/13/19 2:57 PM, bugs@schedmd.com wrote: > > *Comment # 29 <https://bugs.schedmd.com/show_bug.cgi?id=8061#c29> on > bug 8061 <https://bugs.schedmd.com/show_bug.cgi?id=8061> from Nate > Rini <mailto:nate@schedmd.com> * > (In reply to Paul Edmon fromcomment #28 <show_bug.cgi?id=8061#c28>) > > As a note we are planning on replacing our slurm master next year as the box > running it is nearing end of its warranty. > > Please see slides labeled "On performance, time, and such matters" in > https://slurm.schedmd.com/SLUG17/FieldNotes.pdf > ------------------------------------------------------------------------ > You are receiving this mail because: > > * You reported the bug. >
Paul, We are still working on this bug. Can we reduce this to a SEV4 since this is not an active issue but more of a research issue at this point? Thanks, --Nate
Yup, go right ahead. We haven't had a recurrent issue of this either. I'm guessing this type of situation is rare. -Paul Edmon- On 12/2/19 12:22 PM, bugs@schedmd.com wrote: > > *Comment # 34 <https://bugs.schedmd.com/show_bug.cgi?id=8061#c34> on > bug 8061 <https://bugs.schedmd.com/show_bug.cgi?id=8061> from Nate > Rini <mailto:nate@schedmd.com> * > Paul, > > We are still working on this bug. Can we reduce this to a SEV4 since this is > not an active issue but more of a research issue at this point? > > Thanks, > --Nate > ------------------------------------------------------------------------ > You are receiving this mail because: > > * You reported the bug. >
(In reply to Paul Edmon from comment #35) > Yup, go right ahead. Lowering severity per your response.
Paul, I'm going to mark this as a duplicate of bug#7928 (and bug#7141). In bug#7928, there is already work on a patchset to more gracefully handle large number of agent RPCs which is the same issue as this ticket. Please respond if you have any questions. Thanks, --Nate *** This ticket has been marked as a duplicate of ticket 7928 ***