Summary: | select/cons_res: node c0519 memory is under-allocated (0-3048) for JobId=6030929_5(6031087) | ||
---|---|---|---|
Product: | Slurm | Reporter: | Jenny Williams <jennyw> |
Component: | slurmctld | Assignee: | Alejandro Sanchez <alex> |
Status: | RESOLVED CANNOTREPRODUCE | QA Contact: | |
Severity: | 4 - Minor Issue | ||
Priority: | --- | CC: | bart, marshall, regine.gaudin |
Version: | 18.08.4 | ||
Hardware: | Linux | ||
OS: | Linux | ||
See Also: |
https://bugs.schedmd.com/show_bug.cgi?id=6639 https://bugs.schedmd.com/show_bug.cgi?id=6769 |
||
Site: | University of North Carolina at Chapel Hill | Alineos Sites: | --- |
Atos/Eviden Sites: | --- | Confidential Site: | --- |
Coreweave sites: | --- | Cray Sites: | --- |
DS9 clusters: | --- | HPCnow Sites: | --- |
HPE Sites: | --- | IBM Sites: | --- |
NOAA SIte: | --- | NoveTech Sites: | --- |
Nvidia HWinf-CS Sites: | --- | OCF Sites: | --- |
Recursion Pharma Sites: | --- | SFW Sites: | --- |
SNIC sites: | --- | Tzag Elita Sites: | --- |
Linux Distro: | --- | Machine Name: | |
CLE Version: | Version Fixed: | ||
Target Release: | --- | DevPrio: | --- |
Emory-Cloud Sites: | --- | ||
Attachments: |
slurmctld log
slurm conf |
Description
Jenny Williams
2019-01-07 11:49:37 MST
# sinfo -Nl -p spill Tue Jan 8 10:38:50 2019 NODELIST NODES PARTITION STATE CPUS S:C:T MEMORY TMP_DISK WEIGHT AVAIL_FE REASON b1001 1 spill mixed 24 2:12:1 186391 0 1 (null) none b1002 1 spill mixed 24 2:12:1 186391 0 1 (null) none b1003 1 spill allocated 24 2:12:1 186391 0 1 (null) none b1004 1 spill mixed 24 2:12:1 186391 0 1 (null) none b1005 1 spill mixed 24 2:12:1 186391 0 1 (null) none b1006 1 spill allocated 24 2:12:1 186391 0 1 (null) none b1007 1 spill allocated 24 2:12:1 186391 0 1 (null) none b1008 1 spill allocated 24 2:12:1 186391 0 1 (null) none b1009 1 spill allocated 24 2:12:1 186391 0 1 (null) none b1010 1 spill allocated 24 2:12:1 186391 0 1 (null) none b1011 1 spill mixed 24 2:12:1 186391 0 1 (null) none b1012 1 spill mixed 24 2:12:1 186391 0 1 (null) none b1013 1 spill mixed 24 2:12:1 186391 0 1 (null) none b1014 1 spill mixed 24 2:12:1 186391 0 1 (null) none b1015 1 spill allocated 24 2:12:1 186391 0 1 (null) none b1016 1 spill mixed 24 2:12:1 186391 0 1 (null) none b1017 1 spill allocated 24 2:12:1 186391 0 1 (null) none b1018 1 spill allocated 24 2:12:1 186391 0 1 (null) none b1019 1 spill allocated 24 2:12:1 186391 0 1 (null) none b1020 1 spill allocated 24 2:12:1 186391 0 1 (null) none b1021 1 spill mixed 24 2:12:1 186391 0 1 (null) none b1022 1 spill drained* 24 2:12:1 186391 0 1 (null) borrow_nic b1023 1 spill drained* 24 2:12:1 186391 0 1 (null) borrow_nic b1024 1 spill drained* 24 2:12:1 186391 0 1 (null) borrow_nic b1025 1 spill drained* 24 2:12:1 186391 0 1 (null) borrow_nic b1026 1 spill drained* 24 2:12:1 186391 0 1 (null) borrow_nic b1027 1 spill drained* 24 2:12:1 186391 0 1 (null) borrow_nic c0301 1 spill mixed 72 2:36:1 750452 0 1 (null) none c0302 1 spill mixed 72 2:36:1 750452 0 1 (null) none c0303 1 spill mixed 72 2:36:1 750452 0 1 (null) none c0304 1 spill mixed 72 2:36:1 750452 0 1 (null) none c0305 1 spill mixed 72 2:36:1 750452 0 1 (null) none c0306 1 spill mixed 72 2:36:1 750452 0 1 (null) none c0307 1 spill mixed 72 2:36:1 750452 0 1 (null) none c0308 1 spill mixed 72 2:36:1 750452 0 1 (null) none c0309 1 spill mixed 72 2:36:1 750452 0 1 (null) none c0310 1 spill mixed 72 2:36:1 750452 0 1 (null) none c0311 1 spill mixed 72 2:36:1 750452 0 1 (null) none c0312 1 spill mixed 72 2:36:1 750452 0 1 (null) none c0313 1 spill mixed 72 2:36:1 750452 0 1 (null) none c0314 1 spill mixed 72 2:36:1 750452 0 1 (null) none c0315 1 spill mixed 72 2:36:1 750452 0 1 (null) none c0316 1 spill mixed 72 2:36:1 750452 0 1 (null) none c0317 1 spill mixed 72 2:36:1 750452 0 1 (null) none c0318 1 spill mixed 72 2:36:1 750452 0 1 (null) none c0319 1 spill allocated 72 2:36:1 750452 0 1 (null) none c0320 1 spill mixed 72 2:36:1 750452 0 1 (null) none c0401 1 spill mixed 72 2:36:1 750452 0 1 (null) none c0402 1 spill mixed 72 2:36:1 750452 0 1 (null) none c0403 1 spill mixed 72 2:36:1 750452 0 1 (null) none c0404 1 spill mixed 72 2:36:1 750452 0 1 (null) none c0405 1 spill mixed 72 2:36:1 750452 0 1 (null) none c0406 1 spill mixed 72 2:36:1 750452 0 1 (null) none c0407 1 spill allocated 72 2:36:1 750452 0 1 (null) none c0408 1 spill mixed 72 2:36:1 750452 0 1 (null) none c0409 1 spill mixed 72 2:36:1 750452 0 1 (null) none c0410 1 spill mixed 72 2:36:1 750452 0 1 (null) none c0501 1 spill mixed 56 2:28:1 235520 0 1 (null) none c0502 1 spill mixed 56 2:28:1 235520 0 1 (null) none c0503 1 spill mixed 56 2:28:1 235520 0 1 (null) none c0504 1 spill mixed 56 2:28:1 235520 0 1 (null) none c0505 1 spill mixed 56 2:28:1 235520 0 1 (null) none c0506 1 spill mixed 56 2:28:1 235520 0 1 (null) none c0507 1 spill mixed 56 2:28:1 235520 0 1 (null) none c0508 1 spill mixed 56 2:28:1 235520 0 1 (null) none c0509 1 spill mixed 56 2:28:1 235520 0 1 (null) none c0510 1 spill mixed 56 2:28:1 235520 0 1 (null) none c0511 1 spill mixed 56 2:28:1 235520 0 1 (null) none c0512 1 spill mixed 56 2:28:1 235520 0 1 (null) none c0513 1 spill mixed 56 2:28:1 235520 0 1 (null) none c0514 1 spill mixed 56 2:28:1 235520 0 1 (null) none c0515 1 spill mixed 56 2:28:1 235520 0 1 (null) none c0516 1 spill mixed 56 2:28:1 235520 0 1 (null) none c0517 1 spill mixed 56 2:28:1 235520 0 1 (null) none c0518 1 spill mixed 56 2:28:1 235520 0 1 (null) none c0519 1 spill mixed 56 2:28:1 235520 0 1 (null) none c0520 1 spill allocated 56 2:28:1 235520 0 1 (null) none c0521 1 spill mixed 56 2:28:1 235520 0 1 (null) none c0522 1 spill mixed 56 2:28:1 235520 0 1 (null) none c0523 1 spill mixed 56 2:28:1 235520 0 1 (null) none c0524 1 spill mixed 56 2:28:1 235520 0 1 (null) none c0525 1 spill mixed 56 2:28:1 235520 0 1 (null) none c0526 1 spill mixed 56 2:28:1 235520 0 1 (null) none c0527 1 spill mixed 56 2:28:1 235520 0 1 (null) none c0528 1 spill mixed 56 2:28:1 235520 0 1 (null) none c0529 1 spill mixed 56 2:28:1 235520 0 1 (null) none c0530 1 spill allocated 56 2:28:1 235520 0 1 (null) none c0531 1 spill mixed 56 2:28:1 235520 0 1 (null) none c0532 1 spill mixed 56 2:28:1 235520 0 1 (null) none c0533 1 spill draining 56 2:28:1 235520 0 1 (null) replace_dimmA8 c0534 1 spill mixed 56 2:28:1 235520 0 1 (null) none c0535 1 spill mixed 56 2:28:1 235520 0 1 (null) none c0536 1 spill allocated 56 2:28:1 235520 0 1 (null) none c0537 1 spill allocated 56 2:28:1 235520 0 1 (null) none c0538 1 spill allocated 56 2:28:1 235520 0 1 (null) none c0539 1 spill allocated 56 2:28:1 235520 0 1 (null) none c0540 1 spill allocated 56 2:28:1 235520 0 1 (null) none c0802 1 spill allocated 48 2:24:1 364536 0 1 (null) none c0803 1 spill allocated 48 2:24:1 364536 0 1 (null) none c0804 1 spill allocated 48 2:24:1 364536 0 1 (null) none c0805 1 spill allocated 48 2:24:1 364536 0 1 (null) none c0806 1 spill mixed 48 2:24:1 364536 0 1 (null) none c0807 1 spill allocated 48 2:24:1 364536 0 1 (null) none c0808 1 spill allocated 48 2:24:1 364536 0 1 (null) none c0809 1 spill allocated 48 2:24:1 364536 0 1 (null) none c0810 1 spill allocated 48 2:24:1 364536 0 1 (null) none c0811 1 spill allocated 48 2:24:1 364536 0 1 (null) none c0812 1 spill allocated 48 2:24:1 364536 0 1 (null) none c0813 1 spill allocated 48 2:24:1 364536 0 1 (null) none c0814 1 spill allocated 48 2:24:1 364536 0 1 (null) none c0815 1 spill allocated 48 2:24:1 364536 0 1 (null) none c0816 1 spill allocated 48 2:24:1 364536 0 1 (null) none c0817 1 spill allocated 48 2:24:1 364536 0 1 (null) none c0818 1 spill allocated 48 2:24:1 364536 0 1 (null) none c0819 1 spill allocated 48 2:24:1 364536 0 1 (null) none c0820 1 spill allocated 48 2:24:1 364536 0 1 (null) none c0821 1 spill allocated 48 2:24:1 364536 0 1 (null) none c0822 1 spill allocated 48 2:24:1 364536 0 1 (null) none c0823 1 spill allocated 48 2:24:1 364536 0 1 (null) none c0824 1 spill allocated 48 2:24:1 364536 0 1 (null) none c0825 1 spill allocated 48 2:24:1 364536 0 1 (null) none c0826 1 spill allocated 48 2:24:1 364536 0 1 (null) none c0827 1 spill allocated 48 2:24:1 364536 0 1 (null) none c0828 1 spill mixed 48 2:24:1 364536 0 1 (null) none c0829 1 spill allocated 48 2:24:1 364536 0 1 (null) none c0830 1 spill drained* 48 2:24:1 235520 0 1 (null) mem_upgrade c0831 1 spill allocated 48 2:24:1 364536 0 1 (null) none c0832 1 spill allocated 48 2:24:1 364536 0 1 (null) none c0833 1 spill allocated 48 2:24:1 364536 0 1 (null) none c0834 1 spill mixed 48 2:24:1 364536 0 1 (null) none c0835 1 spill allocated 48 2:24:1 364536 0 1 (null) none c0836 1 spill allocated 48 2:24:1 364536 0 1 (null) none c0837 1 spill allocated 48 2:24:1 364536 0 1 (null) none c0838 1 spill allocated 48 2:24:1 364536 0 1 (null) none c0839 1 spill allocated 48 2:24:1 364536 0 1 (null) none c0840 1 spill allocated 48 2:24:1 364536 0 1 (null) none c0901 1 spill allocated 48 2:24:1 364536 0 1 (null) none c0902 1 spill allocated 48 2:24:1 364536 0 1 (null) none c0903 1 spill mixed 48 2:24:1 364536 0 1 (null) none c0904 1 spill allocated 48 2:24:1 364536 0 1 (null) none c0905 1 spill mixed 48 2:24:1 364536 0 1 (null) none c0906 1 spill mixed 48 2:24:1 364536 0 1 (null) none c0907 1 spill mixed 48 2:24:1 364536 0 1 (null) none c0908 1 spill mixed 48 2:24:1 364536 0 1 (null) none c0909 1 spill mixed 48 2:24:1 364536 0 1 (null) none c0910 1 spill allocated 48 2:24:1 364536 0 1 (null) none c0911 1 spill allocated 48 2:24:1 364536 0 1 (null) none c0912 1 spill allocated 48 2:24:1 364536 0 1 (null) none c0913 1 spill allocated 48 2:24:1 364536 0 1 (null) none c0914 1 spill allocated 48 2:24:1 364536 0 1 (null) none c0915 1 spill allocated 48 2:24:1 364536 0 1 (null) none c0916 1 spill allocated 48 2:24:1 364536 0 1 (null) none c0917 1 spill allocated 48 2:24:1 364536 0 1 (null) none c0918 1 spill allocated 48 2:24:1 364536 0 1 (null) none c0919 1 spill allocated 48 2:24:1 364536 0 1 (null) none c0920 1 spill allocated 48 2:24:1 364536 0 1 (null) none c0921 1 spill allocated 48 2:24:1 364536 0 1 (null) none c0922 1 spill allocated 48 2:24:1 364536 0 1 (null) none c0923 1 spill allocated@ 48 2:24:1 364536 0 1 (null) none c0924 1 spill allocated 48 2:24:1 364536 0 1 (null) none c0925 1 spill allocated 48 2:24:1 364536 0 1 (null) none c0926 1 spill allocated 48 2:24:1 364536 0 1 (null) none c0927 1 spill mixed 48 2:24:1 364536 0 1 (null) none c0928 1 spill mixed 48 2:24:1 364536 0 1 (null) none c0929 1 spill mixed 48 2:24:1 364536 0 1 (null) none c0930 1 spill mixed 48 2:24:1 364536 0 1 (null) none c0931 1 spill mixed 48 2:24:1 364536 0 1 (null) none c0932 1 spill mixed 48 2:24:1 364536 0 1 (null) none c0933 1 spill mixed 48 2:24:1 364536 0 1 (null) none c0934 1 spill mixed 48 2:24:1 364536 0 1 (null) none c0935 1 spill mixed 48 2:24:1 364536 0 1 (null) none c0936 1 spill mixed 48 2:24:1 364536 0 1 (null) none c0937 1 spill mixed 48 2:24:1 364536 0 1 (null) none c0938 1 spill mixed 48 2:24:1 364536 0 1 (null) none c0939 1 spill mixed 48 2:24:1 364536 0 1 (null) none c0940 1 spill mixed@ 48 2:24:1 364536 0 1 (null) none c1101 1 spill mixed 48 2:24:1 364536 0 1 (null) none c1102 1 spill mixed 48 2:24:1 364536 0 1 (null) none c1103 1 spill mixed 48 2:24:1 364536 0 1 (null) none c1104 1 spill mixed 48 2:24:1 364536 0 1 (null) none c1105 1 spill mixed 48 2:24:1 364536 0 1 (null) none c1106 1 spill mixed 48 2:24:1 364536 0 1 (null) none c1107 1 spill mixed 48 2:24:1 364536 0 1 (null) none c1108 1 spill mixed 48 2:24:1 364536 0 1 (null) none c1109 1 spill mixed 48 2:24:1 364536 0 1 (null) none c1110 1 spill mixed 48 2:24:1 364536 0 1 (null) none c1111 1 spill mixed 48 2:24:1 364536 0 1 (null) none c1112 1 spill mixed 48 2:24:1 364536 0 1 (null) none c1113 1 spill mixed 48 2:24:1 364536 0 1 (null) none c1114 1 spill mixed 48 2:24:1 364536 0 1 (null) none c1115 1 spill mixed 48 2:24:1 364536 0 1 (null) none c1116 1 spill mixed 48 2:24:1 364536 0 1 (null) none c1117 1 spill mixed 48 2:24:1 364536 0 1 (null) none c1118 1 spill mixed 48 2:24:1 364536 0 1 (null) none c1119 1 spill mixed 48 2:24:1 364536 0 1 (null) none c1120 1 spill mixed 48 2:24:1 364536 0 1 (null) none c1121 1 spill mixed 48 2:24:1 235520 0 1 (null) none c1122 1 spill mixed 48 2:24:1 235520 0 1 (null) none c1123 1 spill mixed 48 2:24:1 235520 0 1 (null) none c1124 1 spill mixed 48 2:24:1 235520 0 1 (null) none c1125 1 spill mixed 48 2:24:1 235520 0 1 (null) none c1126 1 spill allocated 48 2:24:1 235520 0 1 (null) none c1127 1 spill allocated 48 2:24:1 235520 0 1 (null) none c1128 1 spill mixed 48 2:24:1 235520 0 1 (null) none c1129 1 spill allocated 48 2:24:1 235520 0 1 (null) none c1130 1 spill allocated 48 2:24:1 235520 0 1 (null) none c1131 1 spill allocated 48 2:24:1 235520 0 1 (null) none c1132 1 spill allocated 48 2:24:1 235520 0 1 (null) none c1133 1 spill allocated 48 2:24:1 235520 0 1 (null) none c1134 1 spill draining 48 2:24:1 235520 0 1 (null) Kill task failed c1135 1 spill mixed 48 2:24:1 235520 0 1 (null) none c1136 1 spill allocated 48 2:24:1 235520 0 1 (null) none c1137 1 spill allocated 48 2:24:1 235520 0 1 (null) none c1138 1 spill allocated 48 2:24:1 235520 0 1 (null) none c1139 1 spill mixed 48 2:24:1 235520 0 1 (null) none c1140 1 spill allocated 48 2:24:1 235520 0 1 (null) none # scontrol show config Configuration data as of 2019-01-08T10:39:59 AccountingStorageBackupHost = (null) AccountingStorageEnforce = associations,limits,qos,safe AccountingStorageHost = m1006 AccountingStorageLoc = N/A AccountingStoragePort = 6819 AccountingStorageTRES = cpu,mem,energy,node,billing,fs/disk,vmem,pages,gres/gpu AccountingStorageType = accounting_storage/slurmdbd AccountingStorageUser = N/A AccountingStoreJobComment = Yes AcctGatherEnergyType = acct_gather_energy/none AcctGatherFilesystemType = acct_gather_filesystem/none AcctGatherInterconnectType = acct_gather_interconnect/none AcctGatherNodeFreq = 0 sec AcctGatherProfileType = acct_gather_profile/none AllowSpecResourcesUsage = 0 AuthInfo = (null) AuthType = auth/munge BatchStartTimeout = 10 sec BOOT_TIME = 2019-01-02T17:13:46 BurstBufferType = (null) CheckpointType = checkpoint/none ClusterName = longleaf CommunicationParameters = (null) CompleteWait = 0 sec CoreSpecPlugin = core_spec/none CpuFreqDef = Unknown CpuFreqGovernors = Performance,OnDemand,UserSpace CryptoType = crypto/munge DebugFlags = (null) DefMemPerNode = UNLIMITED DisableRootJobs = No EioTimeout = 60 EnforcePartLimits = ANY Epilog = (null) EpilogMsgTime = 2000 usec EpilogSlurmctld = (null) ExtSensorsType = ext_sensors/none ExtSensorsFreq = 0 sec FairShareDampeningFactor = 1 FastSchedule = 1 FederationParameters = (null) FirstJobId = 1 GetEnvTimeout = 2 sec GresTypes = gpu GroupUpdateForce = 1 GroupUpdateTime = 600 sec HASH_VAL = Match HealthCheckInterval = 0 sec HealthCheckNodeState = ANY HealthCheckProgram = (null) InactiveLimit = 65533 sec JobAcctGatherFrequency = task=15 JobAcctGatherType = jobacct_gather/cgroup JobAcctGatherParams = (null) JobCheckpointDir = /var/slurm/checkpoint JobCompHost = localhost JobCompLoc = /var/log/slurm_jobcomp.log JobCompPort = 0 JobCompType = jobcomp/none JobCompUser = root JobContainerType = job_container/none JobCredentialPrivateKey = (null) JobCredentialPublicCertificate = (null) JobDefaults = (null) JobFileAppend = 0 JobRequeue = 1 JobSubmitPlugins = lua,all_partitions KeepAliveTime = SYSTEM_DEFAULT KillOnBadExit = 0 KillWait = 30 sec LaunchParameters = (null) LaunchType = launch/slurm Layouts = Licenses = mplus:1,nonmem:32 LicensesUsed = nonmem:0/32,mplus:0/1 LogTimeFormat = iso8601_ms MailDomain = (null) MailProg = /bin/mail MaxArraySize = 40001 MaxJobCount = 350000 MaxJobId = 67043328 MaxMemPerNode = UNLIMITED MaxStepCount = 40000 MaxTasksPerNode = 512 MCSPlugin = mcs/none MCSParameters = (null) MemLimitEnforce = No MessageTimeout = 60 sec MinJobAge = 300 sec MpiDefault = none MpiParams = (null) MsgAggregationParams = (null) NEXT_JOB_ID = 6170348 NodeFeaturesPlugins = (null) OverTimeLimit = 0 min PluginDir = /usr/lib64/slurm PlugStackConfig = /etc/slurm/plugstack.conf PowerParameters = (null) PowerPlugin = PreemptMode = OFF PreemptType = preempt/none PriorityParameters = (null) PriorityDecayHalfLife = 14-00:00:00 PriorityCalcPeriod = 00:05:00 PriorityFavorSmall = No PriorityFlags = SMALL_RELATIVE_TO_TIME,CALCULATE_RUNNING,FAIR_TREE,MAX_TRES PriorityMaxAge = 60-00:00:00 PriorityUsageResetPeriod = NONE PriorityType = priority/multifactor PriorityWeightAge = 1000 PriorityWeightFairShare = 10000 PriorityWeightJobSize = 1000 PriorityWeightPartition = 1000 PriorityWeightQOS = 1000 PriorityWeightTRES = CPU=1000,Mem=4000,GRES/gpu=3000 PrivateData = none ProctrackType = proctrack/cgroup Prolog = (null) PrologEpilogTimeout = 65534 PrologSlurmctld = (null) PrologFlags = Alloc,Contain PropagatePrioProcess = 0 PropagateResourceLimits = ALL PropagateResourceLimitsExcept = (null) RebootProgram = /usr/sbin/reboot ReconfigFlags = (null) RequeueExit = (null) RequeueExitHold = (null) ResumeFailProgram = (null) ResumeProgram = (null) ResumeRate = 300 nodes/min ResumeTimeout = 60 sec ResvEpilog = (null) ResvOverRun = 0 min ResvProlog = (null) ReturnToService = 2 RoutePlugin = route/default SallocDefaultCommand = srun -n1 -N1 --gres=gpu:0 --mem-per-cpu=0 --pty --preserve-env --mpi=none $SHELL SbcastParameters = (null) SchedulerParameters = kill_invalid_depend,batch_sched_delay=10,bf_continue,bf_max_job_part=5000,bf_max_job_test=10000,bf_max_job_user=300,bf_resolution=300,bf_window=10080,bf_yield_interval=1000000,default_queue_depth=1000,partition_job_depth=600,sched_min_interval=2000000,defer,max_rpc_cnt=80 SchedulerTimeSlice = 30 sec SchedulerType = sched/backfill SelectType = select/cons_res SelectTypeParameters = CR_CPU_MEMORY SlurmUser = slurm(47) SlurmctldAddr = (null) SlurmctldDebug = info SlurmctldHost[0] = longleaf-sched(172.26.113.4) SlurmctldLogFile = /pine/EX/root/slurm-log/slurmctld.log SlurmctldPort = 6820-6824 SlurmctldSyslogDebug = unknown SlurmctldPrimaryOffProg = (null) SlurmctldPrimaryOnProg = (null) SlurmctldTimeout = 65530 sec SlurmctldParameters = (null) SlurmdDebug = info SlurmdLogFile = /var/log/slurm/slurmd.log SlurmdParameters = (null) SlurmdPidFile = /var/run/slurmd.pid SlurmdPort = 6818 SlurmdSpoolDir = /var/spool/slurmd SlurmdSyslogDebug = unknown SlurmdTimeout = 65530 sec SlurmdUser = root(0) SlurmSchedLogFile = (null) SlurmSchedLogLevel = 0 SlurmctldPidFile = /var/run/slurmctld.pid SlurmctldPlugstack = (null) SLURM_CONF = /etc/slurm/slurm.conf SLURM_VERSION = 18.08.4 SrunEpilog = (null) SrunPortRange = 0-0 SrunProlog = (null) StateSaveLocation = /pine/EX/root/slurm-log/slurmctld SuspendExcNodes = (null) SuspendExcParts = (null) SuspendProgram = (null) SuspendRate = 60 nodes/min SuspendTime = NONE SuspendTimeout = 30 sec SwitchType = switch/none TaskEpilog = (null) TaskPlugin = task/cgroup TaskPluginParam = (null type) TaskProlog = (null) TCPTimeout = 2 sec TmpFS = /tmp TopologyParam = (null) TopologyPlugin = topology/none TrackWCKey = No TreeWidth = 50 UsePam = 0 UnkillableStepProgram = (null) UnkillableStepTimeout = 120 sec VSizeFactor = 0 percent WaitTime = 0 sec X11Parameters = (null) Cgroup Support Configuration: AllowedDevicesFile = /etc/slurm/cgroup_allowed_devices_file.conf AllowedKmemSpace = (null) AllowedRAMSpace = 100.0% AllowedSwapSpace = 0.0% CgroupAutomount = yes CgroupMountpoint = /sys/fs/cgroup ConstrainCores = yes ConstrainDevices = no ConstrainKmemSpace = no ConstrainRAMSpace = yes ConstrainSwapSpace = no MaxKmemPercent = 100.0% MaxRAMPercent = 100.0% MaxSwapPercent = 100.0% MemLimitThreshold = 100.0% MemoryLimitEnforcement = no MemorySwappiness = (null) MinKmemSpace = 30 MB MinRAMSpace = 30 MB TaskAffinity = yes Slurmctld(primary) at longleaf-sched is UP Hi Jenny, Each job record has associated a struct job_resources[1] to track allocated resources, including memory_allocated per node. Similarly, each node is associated with another struct node_use_record[2] to track resources allocated to nodes, including alloc_memory reserved by jobs. The "memory is under-allocated" error is logged by the select/cons_res plugin when trying to deallocate resources previously reserved for a given job. More specifically, this happens when the job_resources.memory_allocated on a node is higher than the node_use_record.alloc_memory, meaning there's a mismatch between the job's viewpoint vs the node's one in terms of memory allocation. Then Slurm proceeds by logging this message to note the mismatch and instead of subtracting the job's point of view amount of memory from the node (which would then underflow below zero), it sets the node struct alloc_memory for that node to zero. This deallocation function is called in different scenarios: when a job finishes, when a job is suspended, when a job is expanded, when a job is preempted or when the scheduler builds fake future scenarios by [de]allocating resources to see if/where/when a job will run. The mismatch shouldn't happen; both jobs and nodes should have the same view about what they allocate/are allocated. In order to try to reproduce: - could you please attach your slurm.conf? (scontrol show conf isn't showing the node/part definitions) - could you please attach slurmctld.log including all the log messages related to one of the afflicted jobs? for instance JobId=3906238 and/or JobId=6030929_5(6031087). - I'm curious to know what use-case from the list mentioned above triggered the deallocation of resources; I'm also suspecting about jobs potentially being allocated nodes with different hardware, specifically different cpu/memory counts. Thanks. [1] https://github.com/SchedMD/slurm/blob/slurm-18-08-4-1/src/common/job_resources.h#L103 [2] https://github.com/SchedMD/slurm/blob/slurm-18-08-4-1/src/plugins/select/cons_res/select_cons_res.h#L99 [3] https://github.com/SchedMD/slurm/blob/slurm-18-08-4-1/src/plugins/select/cons_res/select_cons_res.c#L1227 Created attachment 8974 [details]
slurmctld log
Created attachment 8975 [details]
slurm conf
The most recent job has all messages included in the attached slurmctld.log - JobID User NodeList ReqTRES Elapsed Submit Start ------------ --------- --------------- ----------------------------------- ---------- ------------------- ------------------- 8245711 dg144 b1010 billing=1,cpu=1,mem=1G,node=1 04:30:15 2019-01-22T02:52:29 2019-01-22T05:02:42 Jenny, I'm still trying to reproduce this. Did you add/remove nodes while these under-allocated errors were reported? I am almost certain we were not adding/removing nodes when these errors were generated. I cannot replicate the condition myself at this point. The only info I would have on this is contained in the slurm log files. I'm willing to close this case at this point in regards to my own needs. *** Ticket 6769 has been marked as a duplicate of this ticket. *** Hi I'm updating this bug as CEA is also encountering memory under-allocated errors you have mentionned, filling slurmctld.log error: select/cons_res: node machine1234 memory is under-allocated (0-188800) for JobID=XXXXXX In bug 6879 it is written "there are proposed fixes for both issues I mentioned (accrue_cnt underflow and memory under-allocated errors)". So I let you known that CEA would be also interested in proposed fixes. slurm controller is 18.08.06 and clients in 17.11.6 but will be upgraded soon in 18.08.06 Thanks Regine |