I am seeing this error with a fair frequency. When a batch occurs it is cycles over jobs on a given node, as in the first output here. The second command below shows that the error happens on all nodes. Any insight as to why? Any references to this error I find are back in re v. 16. I have a condition where there are resources ( memory and CPU ) available, but jobs are still pending. I am troubleshooting that. [2019-01-07T13:43:03.567] error: select/cons_res: node c0519 memory is under-allocated (0-3048) for JobId=6030929_5(6031087) [2019-01-07T13:43:03.567] error: select/cons_res: node c0519 memory is under-allocated (0-3048) for JobId=6030929_6(6031088) [2019-01-07T13:43:03.567] error: select/cons_res: node c0519 memory is under-allocated (0-3048) for JobId=6030929_7(6031089) [2019-01-07T13:43:03.567] error: select/cons_res: node c0519 memory is under-allocated (0-3048) for JobId=6030929_13(6031095) [2019-01-07T13:43:03.567] error: select/cons_res: node c0519 memory is under-allocated (0-3048) for JobId=6030929_14(6031096) [2019-01-07T13:43:03.567] error: select/cons_res: node c0519 memory is under-allocated (0-3048) for JobId=6030929_15(6030929) [2019-01-07T13:43:03.588] error: select/cons_res: node c0519 memory is under-allocated (0-3048) for JobId=6030989_15(6030989) [2019-01-07T13:43:03.608] error: select/cons_res: node c0519 memory is under-allocated (0-3048) for JobId=6030998_7(6038834) [2019-01-07T13:43:03.616] error: select/cons_res: node c0519 memory is under-allocated (0-3048) for JobId=6031003_8(6046914) [2019-01-07T13:43:03.621] error: select/cons_res: node c0519 memory is under-allocated (0-1024) for JobId=5302915 [2019-01-07T13:43:03.640] error: select/cons_res: node c0519 memory is under-allocated (0-2048) for JobId=5685057 [2019-01-07T13:43:03.661] error: select/cons_res: node c0519 memory is under-allocated (0-4800) for JobId=3906238 [2019-01-07T13:43:03.844] error: select/cons_res: node c0519 memory is under-allocated (1400-3048) for JobId=6030928_14(6031081) [2019-01-07T13:43:03.844] error: select/cons_res: node c0519 memory is under-allocated (0-3048) for JobId=6030928_15(6030928) [2019-01-07T13:43:03.844] error: select/cons_res: node c0519 memory is under-allocated (0-3048) for JobId=6030929_5(6031087) [2019-01-07T13:43:03.844] error: select/cons_res: node c0519 memory is under-allocated (0-3048) for JobId=6030929_6(6031088) [2019-01-07T13:43:03.844] error: select/cons_res: node c0519 memory is under-allocated (0-3048) for JobId=6030929_7(6031089) [2019-01-07T13:43:03.844] error: select/cons_res: node c0519 memory is under-allocated (0-3048) for JobId=6030929_13(6031095) [2019-01-07T13:43:03.844] error: select/cons_res: node c0519 memory is under-allocated (0-3048) for JobId=6030929_14(6031096) [2019-01-07T13:43:03.844] error: select/cons_res: node c0519 memory is under-allocated (0-3048) for JobId=6030929_15(6030929) [2019-01-07T13:43:03.865] error: select/cons_res: node c0519 memory is under-allocated (0-3048) for JobId=6030989_15(6030989) [2019-01-07T13:43:03.885] error: select/cons_res: node c0519 memory is under-allocated (0-3048) for JobId=6030998_7(6038834) [2019-01-07T13:43:03.893] error: select/cons_res: node c0519 memory is under-allocated (0-3048) for JobId=6031003_8(6046914) [2019-01-07T13:43:03.898] error: select/cons_res: node c0519 memory is under-allocated (0-1024) for JobId=5302915 [2019-01-07T13:43:03.918] error: select/cons_res: node c0519 memory is under-allocated (0-2048) for JobId=5685057 [2019-01-07T13:43:03.938] error: select/cons_res: node c0519 memory is under-allocated (0-4800) for JobId=3906238 [2019-01-07T13:43:04.101] error: select/cons_res: node c0519 memory is under-allocated (1400-3048) for JobId=6030928_14(6031081) [2019-01-07T13:43:04.101] error: select/cons_res: node c0519 memory is under-allocated (0-3048) for JobId=6030928_15(6030928) [2019-01-07T13:43:04.101] error: select/cons_res: node c0519 memory is under-allocated (0-3048) for JobId=6030929_5(6031087) [2019-01-07T13:43:04.101] error: select/cons_res: node c0519 memory is under-allocated (0-3048) for JobId=6030929_6(6031088) [2019-01-07T13:43:04.101] error: select/cons_res: node c0519 memory is under-allocated (0-3048) for JobId=6030929_7(6031089) [2019-01-07T13:43:04.101] error: select/cons_res: node c0519 memory is under-allocated (0-3048) for JobId=6030929_13(6031095) [2019-01-07T13:43:04.101] error: select/cons_res: node c0519 memory is under-allocated (0-3048) for JobId=6030929_14(6031096) [2019-01-07T13:43:04.101] error: select/cons_res: node c0519 memory is under-allocated (0-3048) for JobId=6030929_15(6030929) [2019-01-07T13:43:04.121] error: select/cons_res: node c0519 memory is under-allocated (0-3048) for JobId=6030989_15(6030989) [2019-01-07T13:43:04.140] error: select/cons_res: node c0519 memory is under-allocated (0-3048) for JobId=6030998_7(6038834) [2019-01-07T13:43:04.148] error: select/cons_res: node c0519 memory is under-allocated (0-3048) for JobId=6031003_8(6046914) [2019-01-07T13:43:04.153] error: select/cons_res: node c0519 memory is under-allocated (0-1024) for JobId=5302915 [2019-01-07T13:43:04.172] error: select/cons_res: node c0519 memory is under-allocated (0-2048) for JobId=5685057 [2019-01-07T13:43:04.192] error: select/cons_res: node c0519 memory is under-allocated (0-4800) for JobId=3906238 # egrep 2019-01-07T /pine/EX/root/slurm-log/slurmctld.log|egrep "memory is under-allocated" |awk '{ print $5 }' |sort|uniq -c 17 b1001 24 b1002 76 b1003 10 b1004 33 b1005 6 b1006 18 b1007 5 b1008 598 b1009 7 b1010 6 b1011 64 b1012 38 b1013 2 b1014 17 b1015 3 b1016 10 b1017 12 b1018 11 b1019 8 b1020 91 b1021 44 c0302 4 c0304 1 c0305 127 c0311 2 c0312 10 c0313 13 c0315 8 c0401 1 c0402 43 c0407 53 c0501 61 c0502 81 c0503 124 c0504 124 c0505 18 c0506 21 c0507 8 c0508 18 c0509 200 c0510 143 c0511 24 c0512 11 c0513 6 c0514 251 c0515 65 c0516 356 c0517 19 c0518 31 c0519 42 c0520 400 c0521 271 c0522 191 c0523 51 c0524 33 c0525 296 c0526 404 c0527 35 c0528 129 c0529 18 c0530 337 c0531 704 c0532 54 c0534 124 c0535 145 c0536 27 c0537 213 c0538 14 c0539 39 c0540 7 c0802 122 c0803 14 c0804 605 c0805 24 c0806 35 c0807 190 c0808 254 c0809 35 c0810 29 c0811 210 c0812 4 c0813 44 c0814 74 c0815 15 c0816 23 c0817 89 c0818 20 c0819 243 c0820 52 c0821 227 c0822 117 c0823 2 c0824 99 c0825 41 c0826 215 c0827 458 c0828 65 c0829 61 c0831 219 c0832 9 c0833 18 c0834 14 c0835 123 c0836 278 c0837 260 c0838 302 c0839 33 c0840 84 c0901 8 c0902 16 c0903 57 c0904 84 c0905 139 c0906 182 c0907 239 c0908 40 c0909 172 c0910 406 c0911 115 c0912 249 c0913 183 c0914 104 c0915 33 c0916 6 c0917 59 c0919 61 c0921 71 c0922 123 c0923 22 c0924 10 c0925 215 c0926 134 c0927 341 c0928 88 c0929 35 c0930 26 c0931 68 c0932 121 c0933 62 c0934 66 c0935 45 c0936 31 c0937 428 c0938 148 c0939 67 c0940 17 c1101 68 c1102 4 c1103 45 c1104 7 c1105 277 c1106 427 c1107 519 c1108 440 c1109 468 c1110 350 c1111 524 c1112 675 c1113 265 c1114 197 c1115 699 c1116 15 c1121 17 c1123 96 c1124 70 c1125 126 c1127 28 c1128 18 c1129 3 c1131 129 c1132 29 c1133 5 c1134 17 c1137 110 c1138 16 c1139 36 c1140 6 g0307 24 g0309 11 g0601 23 longleaf-login1 8 off01 7 t0601 15 t0602 6 t0603 412 t0604 15 t0605
# sinfo -Nl -p spill Tue Jan 8 10:38:50 2019 NODELIST NODES PARTITION STATE CPUS S:C:T MEMORY TMP_DISK WEIGHT AVAIL_FE REASON b1001 1 spill mixed 24 2:12:1 186391 0 1 (null) none b1002 1 spill mixed 24 2:12:1 186391 0 1 (null) none b1003 1 spill allocated 24 2:12:1 186391 0 1 (null) none b1004 1 spill mixed 24 2:12:1 186391 0 1 (null) none b1005 1 spill mixed 24 2:12:1 186391 0 1 (null) none b1006 1 spill allocated 24 2:12:1 186391 0 1 (null) none b1007 1 spill allocated 24 2:12:1 186391 0 1 (null) none b1008 1 spill allocated 24 2:12:1 186391 0 1 (null) none b1009 1 spill allocated 24 2:12:1 186391 0 1 (null) none b1010 1 spill allocated 24 2:12:1 186391 0 1 (null) none b1011 1 spill mixed 24 2:12:1 186391 0 1 (null) none b1012 1 spill mixed 24 2:12:1 186391 0 1 (null) none b1013 1 spill mixed 24 2:12:1 186391 0 1 (null) none b1014 1 spill mixed 24 2:12:1 186391 0 1 (null) none b1015 1 spill allocated 24 2:12:1 186391 0 1 (null) none b1016 1 spill mixed 24 2:12:1 186391 0 1 (null) none b1017 1 spill allocated 24 2:12:1 186391 0 1 (null) none b1018 1 spill allocated 24 2:12:1 186391 0 1 (null) none b1019 1 spill allocated 24 2:12:1 186391 0 1 (null) none b1020 1 spill allocated 24 2:12:1 186391 0 1 (null) none b1021 1 spill mixed 24 2:12:1 186391 0 1 (null) none b1022 1 spill drained* 24 2:12:1 186391 0 1 (null) borrow_nic b1023 1 spill drained* 24 2:12:1 186391 0 1 (null) borrow_nic b1024 1 spill drained* 24 2:12:1 186391 0 1 (null) borrow_nic b1025 1 spill drained* 24 2:12:1 186391 0 1 (null) borrow_nic b1026 1 spill drained* 24 2:12:1 186391 0 1 (null) borrow_nic b1027 1 spill drained* 24 2:12:1 186391 0 1 (null) borrow_nic c0301 1 spill mixed 72 2:36:1 750452 0 1 (null) none c0302 1 spill mixed 72 2:36:1 750452 0 1 (null) none c0303 1 spill mixed 72 2:36:1 750452 0 1 (null) none c0304 1 spill mixed 72 2:36:1 750452 0 1 (null) none c0305 1 spill mixed 72 2:36:1 750452 0 1 (null) none c0306 1 spill mixed 72 2:36:1 750452 0 1 (null) none c0307 1 spill mixed 72 2:36:1 750452 0 1 (null) none c0308 1 spill mixed 72 2:36:1 750452 0 1 (null) none c0309 1 spill mixed 72 2:36:1 750452 0 1 (null) none c0310 1 spill mixed 72 2:36:1 750452 0 1 (null) none c0311 1 spill mixed 72 2:36:1 750452 0 1 (null) none c0312 1 spill mixed 72 2:36:1 750452 0 1 (null) none c0313 1 spill mixed 72 2:36:1 750452 0 1 (null) none c0314 1 spill mixed 72 2:36:1 750452 0 1 (null) none c0315 1 spill mixed 72 2:36:1 750452 0 1 (null) none c0316 1 spill mixed 72 2:36:1 750452 0 1 (null) none c0317 1 spill mixed 72 2:36:1 750452 0 1 (null) none c0318 1 spill mixed 72 2:36:1 750452 0 1 (null) none c0319 1 spill allocated 72 2:36:1 750452 0 1 (null) none c0320 1 spill mixed 72 2:36:1 750452 0 1 (null) none c0401 1 spill mixed 72 2:36:1 750452 0 1 (null) none c0402 1 spill mixed 72 2:36:1 750452 0 1 (null) none c0403 1 spill mixed 72 2:36:1 750452 0 1 (null) none c0404 1 spill mixed 72 2:36:1 750452 0 1 (null) none c0405 1 spill mixed 72 2:36:1 750452 0 1 (null) none c0406 1 spill mixed 72 2:36:1 750452 0 1 (null) none c0407 1 spill allocated 72 2:36:1 750452 0 1 (null) none c0408 1 spill mixed 72 2:36:1 750452 0 1 (null) none c0409 1 spill mixed 72 2:36:1 750452 0 1 (null) none c0410 1 spill mixed 72 2:36:1 750452 0 1 (null) none c0501 1 spill mixed 56 2:28:1 235520 0 1 (null) none c0502 1 spill mixed 56 2:28:1 235520 0 1 (null) none c0503 1 spill mixed 56 2:28:1 235520 0 1 (null) none c0504 1 spill mixed 56 2:28:1 235520 0 1 (null) none c0505 1 spill mixed 56 2:28:1 235520 0 1 (null) none c0506 1 spill mixed 56 2:28:1 235520 0 1 (null) none c0507 1 spill mixed 56 2:28:1 235520 0 1 (null) none c0508 1 spill mixed 56 2:28:1 235520 0 1 (null) none c0509 1 spill mixed 56 2:28:1 235520 0 1 (null) none c0510 1 spill mixed 56 2:28:1 235520 0 1 (null) none c0511 1 spill mixed 56 2:28:1 235520 0 1 (null) none c0512 1 spill mixed 56 2:28:1 235520 0 1 (null) none c0513 1 spill mixed 56 2:28:1 235520 0 1 (null) none c0514 1 spill mixed 56 2:28:1 235520 0 1 (null) none c0515 1 spill mixed 56 2:28:1 235520 0 1 (null) none c0516 1 spill mixed 56 2:28:1 235520 0 1 (null) none c0517 1 spill mixed 56 2:28:1 235520 0 1 (null) none c0518 1 spill mixed 56 2:28:1 235520 0 1 (null) none c0519 1 spill mixed 56 2:28:1 235520 0 1 (null) none c0520 1 spill allocated 56 2:28:1 235520 0 1 (null) none c0521 1 spill mixed 56 2:28:1 235520 0 1 (null) none c0522 1 spill mixed 56 2:28:1 235520 0 1 (null) none c0523 1 spill mixed 56 2:28:1 235520 0 1 (null) none c0524 1 spill mixed 56 2:28:1 235520 0 1 (null) none c0525 1 spill mixed 56 2:28:1 235520 0 1 (null) none c0526 1 spill mixed 56 2:28:1 235520 0 1 (null) none c0527 1 spill mixed 56 2:28:1 235520 0 1 (null) none c0528 1 spill mixed 56 2:28:1 235520 0 1 (null) none c0529 1 spill mixed 56 2:28:1 235520 0 1 (null) none c0530 1 spill allocated 56 2:28:1 235520 0 1 (null) none c0531 1 spill mixed 56 2:28:1 235520 0 1 (null) none c0532 1 spill mixed 56 2:28:1 235520 0 1 (null) none c0533 1 spill draining 56 2:28:1 235520 0 1 (null) replace_dimmA8 c0534 1 spill mixed 56 2:28:1 235520 0 1 (null) none c0535 1 spill mixed 56 2:28:1 235520 0 1 (null) none c0536 1 spill allocated 56 2:28:1 235520 0 1 (null) none c0537 1 spill allocated 56 2:28:1 235520 0 1 (null) none c0538 1 spill allocated 56 2:28:1 235520 0 1 (null) none c0539 1 spill allocated 56 2:28:1 235520 0 1 (null) none c0540 1 spill allocated 56 2:28:1 235520 0 1 (null) none c0802 1 spill allocated 48 2:24:1 364536 0 1 (null) none c0803 1 spill allocated 48 2:24:1 364536 0 1 (null) none c0804 1 spill allocated 48 2:24:1 364536 0 1 (null) none c0805 1 spill allocated 48 2:24:1 364536 0 1 (null) none c0806 1 spill mixed 48 2:24:1 364536 0 1 (null) none c0807 1 spill allocated 48 2:24:1 364536 0 1 (null) none c0808 1 spill allocated 48 2:24:1 364536 0 1 (null) none c0809 1 spill allocated 48 2:24:1 364536 0 1 (null) none c0810 1 spill allocated 48 2:24:1 364536 0 1 (null) none c0811 1 spill allocated 48 2:24:1 364536 0 1 (null) none c0812 1 spill allocated 48 2:24:1 364536 0 1 (null) none c0813 1 spill allocated 48 2:24:1 364536 0 1 (null) none c0814 1 spill allocated 48 2:24:1 364536 0 1 (null) none c0815 1 spill allocated 48 2:24:1 364536 0 1 (null) none c0816 1 spill allocated 48 2:24:1 364536 0 1 (null) none c0817 1 spill allocated 48 2:24:1 364536 0 1 (null) none c0818 1 spill allocated 48 2:24:1 364536 0 1 (null) none c0819 1 spill allocated 48 2:24:1 364536 0 1 (null) none c0820 1 spill allocated 48 2:24:1 364536 0 1 (null) none c0821 1 spill allocated 48 2:24:1 364536 0 1 (null) none c0822 1 spill allocated 48 2:24:1 364536 0 1 (null) none c0823 1 spill allocated 48 2:24:1 364536 0 1 (null) none c0824 1 spill allocated 48 2:24:1 364536 0 1 (null) none c0825 1 spill allocated 48 2:24:1 364536 0 1 (null) none c0826 1 spill allocated 48 2:24:1 364536 0 1 (null) none c0827 1 spill allocated 48 2:24:1 364536 0 1 (null) none c0828 1 spill mixed 48 2:24:1 364536 0 1 (null) none c0829 1 spill allocated 48 2:24:1 364536 0 1 (null) none c0830 1 spill drained* 48 2:24:1 235520 0 1 (null) mem_upgrade c0831 1 spill allocated 48 2:24:1 364536 0 1 (null) none c0832 1 spill allocated 48 2:24:1 364536 0 1 (null) none c0833 1 spill allocated 48 2:24:1 364536 0 1 (null) none c0834 1 spill mixed 48 2:24:1 364536 0 1 (null) none c0835 1 spill allocated 48 2:24:1 364536 0 1 (null) none c0836 1 spill allocated 48 2:24:1 364536 0 1 (null) none c0837 1 spill allocated 48 2:24:1 364536 0 1 (null) none c0838 1 spill allocated 48 2:24:1 364536 0 1 (null) none c0839 1 spill allocated 48 2:24:1 364536 0 1 (null) none c0840 1 spill allocated 48 2:24:1 364536 0 1 (null) none c0901 1 spill allocated 48 2:24:1 364536 0 1 (null) none c0902 1 spill allocated 48 2:24:1 364536 0 1 (null) none c0903 1 spill mixed 48 2:24:1 364536 0 1 (null) none c0904 1 spill allocated 48 2:24:1 364536 0 1 (null) none c0905 1 spill mixed 48 2:24:1 364536 0 1 (null) none c0906 1 spill mixed 48 2:24:1 364536 0 1 (null) none c0907 1 spill mixed 48 2:24:1 364536 0 1 (null) none c0908 1 spill mixed 48 2:24:1 364536 0 1 (null) none c0909 1 spill mixed 48 2:24:1 364536 0 1 (null) none c0910 1 spill allocated 48 2:24:1 364536 0 1 (null) none c0911 1 spill allocated 48 2:24:1 364536 0 1 (null) none c0912 1 spill allocated 48 2:24:1 364536 0 1 (null) none c0913 1 spill allocated 48 2:24:1 364536 0 1 (null) none c0914 1 spill allocated 48 2:24:1 364536 0 1 (null) none c0915 1 spill allocated 48 2:24:1 364536 0 1 (null) none c0916 1 spill allocated 48 2:24:1 364536 0 1 (null) none c0917 1 spill allocated 48 2:24:1 364536 0 1 (null) none c0918 1 spill allocated 48 2:24:1 364536 0 1 (null) none c0919 1 spill allocated 48 2:24:1 364536 0 1 (null) none c0920 1 spill allocated 48 2:24:1 364536 0 1 (null) none c0921 1 spill allocated 48 2:24:1 364536 0 1 (null) none c0922 1 spill allocated 48 2:24:1 364536 0 1 (null) none c0923 1 spill allocated@ 48 2:24:1 364536 0 1 (null) none c0924 1 spill allocated 48 2:24:1 364536 0 1 (null) none c0925 1 spill allocated 48 2:24:1 364536 0 1 (null) none c0926 1 spill allocated 48 2:24:1 364536 0 1 (null) none c0927 1 spill mixed 48 2:24:1 364536 0 1 (null) none c0928 1 spill mixed 48 2:24:1 364536 0 1 (null) none c0929 1 spill mixed 48 2:24:1 364536 0 1 (null) none c0930 1 spill mixed 48 2:24:1 364536 0 1 (null) none c0931 1 spill mixed 48 2:24:1 364536 0 1 (null) none c0932 1 spill mixed 48 2:24:1 364536 0 1 (null) none c0933 1 spill mixed 48 2:24:1 364536 0 1 (null) none c0934 1 spill mixed 48 2:24:1 364536 0 1 (null) none c0935 1 spill mixed 48 2:24:1 364536 0 1 (null) none c0936 1 spill mixed 48 2:24:1 364536 0 1 (null) none c0937 1 spill mixed 48 2:24:1 364536 0 1 (null) none c0938 1 spill mixed 48 2:24:1 364536 0 1 (null) none c0939 1 spill mixed 48 2:24:1 364536 0 1 (null) none c0940 1 spill mixed@ 48 2:24:1 364536 0 1 (null) none c1101 1 spill mixed 48 2:24:1 364536 0 1 (null) none c1102 1 spill mixed 48 2:24:1 364536 0 1 (null) none c1103 1 spill mixed 48 2:24:1 364536 0 1 (null) none c1104 1 spill mixed 48 2:24:1 364536 0 1 (null) none c1105 1 spill mixed 48 2:24:1 364536 0 1 (null) none c1106 1 spill mixed 48 2:24:1 364536 0 1 (null) none c1107 1 spill mixed 48 2:24:1 364536 0 1 (null) none c1108 1 spill mixed 48 2:24:1 364536 0 1 (null) none c1109 1 spill mixed 48 2:24:1 364536 0 1 (null) none c1110 1 spill mixed 48 2:24:1 364536 0 1 (null) none c1111 1 spill mixed 48 2:24:1 364536 0 1 (null) none c1112 1 spill mixed 48 2:24:1 364536 0 1 (null) none c1113 1 spill mixed 48 2:24:1 364536 0 1 (null) none c1114 1 spill mixed 48 2:24:1 364536 0 1 (null) none c1115 1 spill mixed 48 2:24:1 364536 0 1 (null) none c1116 1 spill mixed 48 2:24:1 364536 0 1 (null) none c1117 1 spill mixed 48 2:24:1 364536 0 1 (null) none c1118 1 spill mixed 48 2:24:1 364536 0 1 (null) none c1119 1 spill mixed 48 2:24:1 364536 0 1 (null) none c1120 1 spill mixed 48 2:24:1 364536 0 1 (null) none c1121 1 spill mixed 48 2:24:1 235520 0 1 (null) none c1122 1 spill mixed 48 2:24:1 235520 0 1 (null) none c1123 1 spill mixed 48 2:24:1 235520 0 1 (null) none c1124 1 spill mixed 48 2:24:1 235520 0 1 (null) none c1125 1 spill mixed 48 2:24:1 235520 0 1 (null) none c1126 1 spill allocated 48 2:24:1 235520 0 1 (null) none c1127 1 spill allocated 48 2:24:1 235520 0 1 (null) none c1128 1 spill mixed 48 2:24:1 235520 0 1 (null) none c1129 1 spill allocated 48 2:24:1 235520 0 1 (null) none c1130 1 spill allocated 48 2:24:1 235520 0 1 (null) none c1131 1 spill allocated 48 2:24:1 235520 0 1 (null) none c1132 1 spill allocated 48 2:24:1 235520 0 1 (null) none c1133 1 spill allocated 48 2:24:1 235520 0 1 (null) none c1134 1 spill draining 48 2:24:1 235520 0 1 (null) Kill task failed c1135 1 spill mixed 48 2:24:1 235520 0 1 (null) none c1136 1 spill allocated 48 2:24:1 235520 0 1 (null) none c1137 1 spill allocated 48 2:24:1 235520 0 1 (null) none c1138 1 spill allocated 48 2:24:1 235520 0 1 (null) none c1139 1 spill mixed 48 2:24:1 235520 0 1 (null) none c1140 1 spill allocated 48 2:24:1 235520 0 1 (null) none
# scontrol show config Configuration data as of 2019-01-08T10:39:59 AccountingStorageBackupHost = (null) AccountingStorageEnforce = associations,limits,qos,safe AccountingStorageHost = m1006 AccountingStorageLoc = N/A AccountingStoragePort = 6819 AccountingStorageTRES = cpu,mem,energy,node,billing,fs/disk,vmem,pages,gres/gpu AccountingStorageType = accounting_storage/slurmdbd AccountingStorageUser = N/A AccountingStoreJobComment = Yes AcctGatherEnergyType = acct_gather_energy/none AcctGatherFilesystemType = acct_gather_filesystem/none AcctGatherInterconnectType = acct_gather_interconnect/none AcctGatherNodeFreq = 0 sec AcctGatherProfileType = acct_gather_profile/none AllowSpecResourcesUsage = 0 AuthInfo = (null) AuthType = auth/munge BatchStartTimeout = 10 sec BOOT_TIME = 2019-01-02T17:13:46 BurstBufferType = (null) CheckpointType = checkpoint/none ClusterName = longleaf CommunicationParameters = (null) CompleteWait = 0 sec CoreSpecPlugin = core_spec/none CpuFreqDef = Unknown CpuFreqGovernors = Performance,OnDemand,UserSpace CryptoType = crypto/munge DebugFlags = (null) DefMemPerNode = UNLIMITED DisableRootJobs = No EioTimeout = 60 EnforcePartLimits = ANY Epilog = (null) EpilogMsgTime = 2000 usec EpilogSlurmctld = (null) ExtSensorsType = ext_sensors/none ExtSensorsFreq = 0 sec FairShareDampeningFactor = 1 FastSchedule = 1 FederationParameters = (null) FirstJobId = 1 GetEnvTimeout = 2 sec GresTypes = gpu GroupUpdateForce = 1 GroupUpdateTime = 600 sec HASH_VAL = Match HealthCheckInterval = 0 sec HealthCheckNodeState = ANY HealthCheckProgram = (null) InactiveLimit = 65533 sec JobAcctGatherFrequency = task=15 JobAcctGatherType = jobacct_gather/cgroup JobAcctGatherParams = (null) JobCheckpointDir = /var/slurm/checkpoint JobCompHost = localhost JobCompLoc = /var/log/slurm_jobcomp.log JobCompPort = 0 JobCompType = jobcomp/none JobCompUser = root JobContainerType = job_container/none JobCredentialPrivateKey = (null) JobCredentialPublicCertificate = (null) JobDefaults = (null) JobFileAppend = 0 JobRequeue = 1 JobSubmitPlugins = lua,all_partitions KeepAliveTime = SYSTEM_DEFAULT KillOnBadExit = 0 KillWait = 30 sec LaunchParameters = (null) LaunchType = launch/slurm Layouts = Licenses = mplus:1,nonmem:32 LicensesUsed = nonmem:0/32,mplus:0/1 LogTimeFormat = iso8601_ms MailDomain = (null) MailProg = /bin/mail MaxArraySize = 40001 MaxJobCount = 350000 MaxJobId = 67043328 MaxMemPerNode = UNLIMITED MaxStepCount = 40000 MaxTasksPerNode = 512 MCSPlugin = mcs/none MCSParameters = (null) MemLimitEnforce = No MessageTimeout = 60 sec MinJobAge = 300 sec MpiDefault = none MpiParams = (null) MsgAggregationParams = (null) NEXT_JOB_ID = 6170348 NodeFeaturesPlugins = (null) OverTimeLimit = 0 min PluginDir = /usr/lib64/slurm PlugStackConfig = /etc/slurm/plugstack.conf PowerParameters = (null) PowerPlugin = PreemptMode = OFF PreemptType = preempt/none PriorityParameters = (null) PriorityDecayHalfLife = 14-00:00:00 PriorityCalcPeriod = 00:05:00 PriorityFavorSmall = No PriorityFlags = SMALL_RELATIVE_TO_TIME,CALCULATE_RUNNING,FAIR_TREE,MAX_TRES PriorityMaxAge = 60-00:00:00 PriorityUsageResetPeriod = NONE PriorityType = priority/multifactor PriorityWeightAge = 1000 PriorityWeightFairShare = 10000 PriorityWeightJobSize = 1000 PriorityWeightPartition = 1000 PriorityWeightQOS = 1000 PriorityWeightTRES = CPU=1000,Mem=4000,GRES/gpu=3000 PrivateData = none ProctrackType = proctrack/cgroup Prolog = (null) PrologEpilogTimeout = 65534 PrologSlurmctld = (null) PrologFlags = Alloc,Contain PropagatePrioProcess = 0 PropagateResourceLimits = ALL PropagateResourceLimitsExcept = (null) RebootProgram = /usr/sbin/reboot ReconfigFlags = (null) RequeueExit = (null) RequeueExitHold = (null) ResumeFailProgram = (null) ResumeProgram = (null) ResumeRate = 300 nodes/min ResumeTimeout = 60 sec ResvEpilog = (null) ResvOverRun = 0 min ResvProlog = (null) ReturnToService = 2 RoutePlugin = route/default SallocDefaultCommand = srun -n1 -N1 --gres=gpu:0 --mem-per-cpu=0 --pty --preserve-env --mpi=none $SHELL SbcastParameters = (null) SchedulerParameters = kill_invalid_depend,batch_sched_delay=10,bf_continue,bf_max_job_part=5000,bf_max_job_test=10000,bf_max_job_user=300,bf_resolution=300,bf_window=10080,bf_yield_interval=1000000,default_queue_depth=1000,partition_job_depth=600,sched_min_interval=2000000,defer,max_rpc_cnt=80 SchedulerTimeSlice = 30 sec SchedulerType = sched/backfill SelectType = select/cons_res SelectTypeParameters = CR_CPU_MEMORY SlurmUser = slurm(47) SlurmctldAddr = (null) SlurmctldDebug = info SlurmctldHost[0] = longleaf-sched(172.26.113.4) SlurmctldLogFile = /pine/EX/root/slurm-log/slurmctld.log SlurmctldPort = 6820-6824 SlurmctldSyslogDebug = unknown SlurmctldPrimaryOffProg = (null) SlurmctldPrimaryOnProg = (null) SlurmctldTimeout = 65530 sec SlurmctldParameters = (null) SlurmdDebug = info SlurmdLogFile = /var/log/slurm/slurmd.log SlurmdParameters = (null) SlurmdPidFile = /var/run/slurmd.pid SlurmdPort = 6818 SlurmdSpoolDir = /var/spool/slurmd SlurmdSyslogDebug = unknown SlurmdTimeout = 65530 sec SlurmdUser = root(0) SlurmSchedLogFile = (null) SlurmSchedLogLevel = 0 SlurmctldPidFile = /var/run/slurmctld.pid SlurmctldPlugstack = (null) SLURM_CONF = /etc/slurm/slurm.conf SLURM_VERSION = 18.08.4 SrunEpilog = (null) SrunPortRange = 0-0 SrunProlog = (null) StateSaveLocation = /pine/EX/root/slurm-log/slurmctld SuspendExcNodes = (null) SuspendExcParts = (null) SuspendProgram = (null) SuspendRate = 60 nodes/min SuspendTime = NONE SuspendTimeout = 30 sec SwitchType = switch/none TaskEpilog = (null) TaskPlugin = task/cgroup TaskPluginParam = (null type) TaskProlog = (null) TCPTimeout = 2 sec TmpFS = /tmp TopologyParam = (null) TopologyPlugin = topology/none TrackWCKey = No TreeWidth = 50 UsePam = 0 UnkillableStepProgram = (null) UnkillableStepTimeout = 120 sec VSizeFactor = 0 percent WaitTime = 0 sec X11Parameters = (null) Cgroup Support Configuration: AllowedDevicesFile = /etc/slurm/cgroup_allowed_devices_file.conf AllowedKmemSpace = (null) AllowedRAMSpace = 100.0% AllowedSwapSpace = 0.0% CgroupAutomount = yes CgroupMountpoint = /sys/fs/cgroup ConstrainCores = yes ConstrainDevices = no ConstrainKmemSpace = no ConstrainRAMSpace = yes ConstrainSwapSpace = no MaxKmemPercent = 100.0% MaxRAMPercent = 100.0% MaxSwapPercent = 100.0% MemLimitThreshold = 100.0% MemoryLimitEnforcement = no MemorySwappiness = (null) MinKmemSpace = 30 MB MinRAMSpace = 30 MB TaskAffinity = yes Slurmctld(primary) at longleaf-sched is UP
Hi Jenny, Each job record has associated a struct job_resources[1] to track allocated resources, including memory_allocated per node. Similarly, each node is associated with another struct node_use_record[2] to track resources allocated to nodes, including alloc_memory reserved by jobs. The "memory is under-allocated" error is logged by the select/cons_res plugin when trying to deallocate resources previously reserved for a given job. More specifically, this happens when the job_resources.memory_allocated on a node is higher than the node_use_record.alloc_memory, meaning there's a mismatch between the job's viewpoint vs the node's one in terms of memory allocation. Then Slurm proceeds by logging this message to note the mismatch and instead of subtracting the job's point of view amount of memory from the node (which would then underflow below zero), it sets the node struct alloc_memory for that node to zero. This deallocation function is called in different scenarios: when a job finishes, when a job is suspended, when a job is expanded, when a job is preempted or when the scheduler builds fake future scenarios by [de]allocating resources to see if/where/when a job will run. The mismatch shouldn't happen; both jobs and nodes should have the same view about what they allocate/are allocated. In order to try to reproduce: - could you please attach your slurm.conf? (scontrol show conf isn't showing the node/part definitions) - could you please attach slurmctld.log including all the log messages related to one of the afflicted jobs? for instance JobId=3906238 and/or JobId=6030929_5(6031087). - I'm curious to know what use-case from the list mentioned above triggered the deallocation of resources; I'm also suspecting about jobs potentially being allocated nodes with different hardware, specifically different cpu/memory counts. Thanks. [1] https://github.com/SchedMD/slurm/blob/slurm-18-08-4-1/src/common/job_resources.h#L103 [2] https://github.com/SchedMD/slurm/blob/slurm-18-08-4-1/src/plugins/select/cons_res/select_cons_res.h#L99 [3] https://github.com/SchedMD/slurm/blob/slurm-18-08-4-1/src/plugins/select/cons_res/select_cons_res.c#L1227
Created attachment 8974 [details] slurmctld log
Created attachment 8975 [details] slurm conf
The most recent job has all messages included in the attached slurmctld.log - JobID User NodeList ReqTRES Elapsed Submit Start ------------ --------- --------------- ----------------------------------- ---------- ------------------- ------------------- 8245711 dg144 b1010 billing=1,cpu=1,mem=1G,node=1 04:30:15 2019-01-22T02:52:29 2019-01-22T05:02:42
Jenny, I'm still trying to reproduce this. Did you add/remove nodes while these under-allocated errors were reported?
I am almost certain we were not adding/removing nodes when these errors were generated. I cannot replicate the condition myself at this point. The only info I would have on this is contained in the slurm log files. I'm willing to close this case at this point in regards to my own needs.
*** Ticket 6769 has been marked as a duplicate of this ticket. ***
Hi I'm updating this bug as CEA is also encountering memory under-allocated errors you have mentionned, filling slurmctld.log error: select/cons_res: node machine1234 memory is under-allocated (0-188800) for JobID=XXXXXX In bug 6879 it is written "there are proposed fixes for both issues I mentioned (accrue_cnt underflow and memory under-allocated errors)". So I let you known that CEA would be also interested in proposed fixes. slurm controller is 18.08.06 and clients in 17.11.6 but will be upgraded soon in 18.08.06 Thanks Regine