Ticket 6323 - select/cons_res: node c0519 memory is under-allocated (0-3048) for JobId=6030929_5(6031087)
Summary: select/cons_res: node c0519 memory is under-allocated (0-3048) for JobId=6030...
Status: RESOLVED CANNOTREPRODUCE
Alias: None
Product: Slurm
Classification: Unclassified
Component: slurmctld (show other tickets)
Version: 18.08.4
Hardware: Linux Linux
: 4 - Minor Issue
Assignee: Alejandro Sanchez
QA Contact:
URL:
Depends on:
Blocks:
 
Reported: 2019-01-07 11:49 MST by Jenny Williams
Modified: 2019-06-04 03:27 MDT (History)
3 users (show)

See Also:
Site: University of North Carolina at Chapel Hill
Alineos Sites: ---
Atos/Eviden Sites: ---
Confidential Site: ---
Coreweave sites: ---
Cray Sites: ---
DS9 clusters: ---
HPCnow Sites: ---
HPE Sites: ---
IBM Sites: ---
NOAA SIte: ---
NoveTech Sites: ---
Nvidia HWinf-CS Sites: ---
OCF Sites: ---
Recursion Pharma Sites: ---
SFW Sites: ---
SNIC sites: ---
Tzag Elita Sites: ---
Linux Distro: ---
Machine Name:
CLE Version:
Version Fixed:
Target Release: ---
DevPrio: ---
Emory-Cloud Sites: ---


Attachments
slurmctld log (23.31 MB, application/x-compressed)
2019-01-22 09:05 MST, Jenny Williams
Details
slurm conf (4.77 KB, text/plain)
2019-01-22 09:05 MST, Jenny Williams
Details

Note You need to log in before you can comment on or make changes to this ticket.
Description Jenny Williams 2019-01-07 11:49:37 MST
I am seeing this error with a fair frequency.  When a batch occurs it is cycles over jobs on a given node, as in the first output here. The second command below shows that the error happens on all nodes.  Any insight as to why?  Any references to this error I find are back in re v. 16.

I have a condition where there are resources ( memory and CPU ) available, but jobs are still pending. I am troubleshooting that.

[2019-01-07T13:43:03.567] error: select/cons_res: node c0519 memory is under-allocated (0-3048) for JobId=6030929_5(6031087)
[2019-01-07T13:43:03.567] error: select/cons_res: node c0519 memory is under-allocated (0-3048) for JobId=6030929_6(6031088)
[2019-01-07T13:43:03.567] error: select/cons_res: node c0519 memory is under-allocated (0-3048) for JobId=6030929_7(6031089)
[2019-01-07T13:43:03.567] error: select/cons_res: node c0519 memory is under-allocated (0-3048) for JobId=6030929_13(6031095)
[2019-01-07T13:43:03.567] error: select/cons_res: node c0519 memory is under-allocated (0-3048) for JobId=6030929_14(6031096)
[2019-01-07T13:43:03.567] error: select/cons_res: node c0519 memory is under-allocated (0-3048) for JobId=6030929_15(6030929)
[2019-01-07T13:43:03.588] error: select/cons_res: node c0519 memory is under-allocated (0-3048) for JobId=6030989_15(6030989)
[2019-01-07T13:43:03.608] error: select/cons_res: node c0519 memory is under-allocated (0-3048) for JobId=6030998_7(6038834)
[2019-01-07T13:43:03.616] error: select/cons_res: node c0519 memory is under-allocated (0-3048) for JobId=6031003_8(6046914)
[2019-01-07T13:43:03.621] error: select/cons_res: node c0519 memory is under-allocated (0-1024) for JobId=5302915
[2019-01-07T13:43:03.640] error: select/cons_res: node c0519 memory is under-allocated (0-2048) for JobId=5685057
[2019-01-07T13:43:03.661] error: select/cons_res: node c0519 memory is under-allocated (0-4800) for JobId=3906238
[2019-01-07T13:43:03.844] error: select/cons_res: node c0519 memory is under-allocated (1400-3048) for JobId=6030928_14(6031081)
[2019-01-07T13:43:03.844] error: select/cons_res: node c0519 memory is under-allocated (0-3048) for JobId=6030928_15(6030928)
[2019-01-07T13:43:03.844] error: select/cons_res: node c0519 memory is under-allocated (0-3048) for JobId=6030929_5(6031087)
[2019-01-07T13:43:03.844] error: select/cons_res: node c0519 memory is under-allocated (0-3048) for JobId=6030929_6(6031088)
[2019-01-07T13:43:03.844] error: select/cons_res: node c0519 memory is under-allocated (0-3048) for JobId=6030929_7(6031089)
[2019-01-07T13:43:03.844] error: select/cons_res: node c0519 memory is under-allocated (0-3048) for JobId=6030929_13(6031095)
[2019-01-07T13:43:03.844] error: select/cons_res: node c0519 memory is under-allocated (0-3048) for JobId=6030929_14(6031096)
[2019-01-07T13:43:03.844] error: select/cons_res: node c0519 memory is under-allocated (0-3048) for JobId=6030929_15(6030929)
[2019-01-07T13:43:03.865] error: select/cons_res: node c0519 memory is under-allocated (0-3048) for JobId=6030989_15(6030989)
[2019-01-07T13:43:03.885] error: select/cons_res: node c0519 memory is under-allocated (0-3048) for JobId=6030998_7(6038834)
[2019-01-07T13:43:03.893] error: select/cons_res: node c0519 memory is under-allocated (0-3048) for JobId=6031003_8(6046914)
[2019-01-07T13:43:03.898] error: select/cons_res: node c0519 memory is under-allocated (0-1024) for JobId=5302915
[2019-01-07T13:43:03.918] error: select/cons_res: node c0519 memory is under-allocated (0-2048) for JobId=5685057
[2019-01-07T13:43:03.938] error: select/cons_res: node c0519 memory is under-allocated (0-4800) for JobId=3906238
[2019-01-07T13:43:04.101] error: select/cons_res: node c0519 memory is under-allocated (1400-3048) for JobId=6030928_14(6031081)
[2019-01-07T13:43:04.101] error: select/cons_res: node c0519 memory is under-allocated (0-3048) for JobId=6030928_15(6030928)
[2019-01-07T13:43:04.101] error: select/cons_res: node c0519 memory is under-allocated (0-3048) for JobId=6030929_5(6031087)
[2019-01-07T13:43:04.101] error: select/cons_res: node c0519 memory is under-allocated (0-3048) for JobId=6030929_6(6031088)
[2019-01-07T13:43:04.101] error: select/cons_res: node c0519 memory is under-allocated (0-3048) for JobId=6030929_7(6031089)
[2019-01-07T13:43:04.101] error: select/cons_res: node c0519 memory is under-allocated (0-3048) for JobId=6030929_13(6031095)
[2019-01-07T13:43:04.101] error: select/cons_res: node c0519 memory is under-allocated (0-3048) for JobId=6030929_14(6031096)
[2019-01-07T13:43:04.101] error: select/cons_res: node c0519 memory is under-allocated (0-3048) for JobId=6030929_15(6030929)
[2019-01-07T13:43:04.121] error: select/cons_res: node c0519 memory is under-allocated (0-3048) for JobId=6030989_15(6030989)
[2019-01-07T13:43:04.140] error: select/cons_res: node c0519 memory is under-allocated (0-3048) for JobId=6030998_7(6038834)
[2019-01-07T13:43:04.148] error: select/cons_res: node c0519 memory is under-allocated (0-3048) for JobId=6031003_8(6046914)
[2019-01-07T13:43:04.153] error: select/cons_res: node c0519 memory is under-allocated (0-1024) for JobId=5302915
[2019-01-07T13:43:04.172] error: select/cons_res: node c0519 memory is under-allocated (0-2048) for JobId=5685057
[2019-01-07T13:43:04.192] error: select/cons_res: node c0519 memory is under-allocated (0-4800) for JobId=3906238



# egrep 2019-01-07T /pine/EX/root/slurm-log/slurmctld.log|egrep "memory is under-allocated" |awk '{ print $5 }' |sort|uniq -c
     17 b1001
     24 b1002
     76 b1003
     10 b1004
     33 b1005
      6 b1006
     18 b1007
      5 b1008
    598 b1009
      7 b1010
      6 b1011
     64 b1012
     38 b1013
      2 b1014
     17 b1015
      3 b1016
     10 b1017
     12 b1018
     11 b1019
      8 b1020
     91 b1021
     44 c0302
      4 c0304
      1 c0305
    127 c0311
      2 c0312
     10 c0313
     13 c0315
      8 c0401
      1 c0402
     43 c0407
     53 c0501
     61 c0502
     81 c0503
    124 c0504
    124 c0505
     18 c0506
     21 c0507
      8 c0508
     18 c0509
    200 c0510
    143 c0511
     24 c0512
     11 c0513
      6 c0514
    251 c0515
     65 c0516
    356 c0517
     19 c0518
     31 c0519
     42 c0520
    400 c0521
    271 c0522
    191 c0523
     51 c0524
     33 c0525
    296 c0526
    404 c0527
     35 c0528
    129 c0529
     18 c0530
    337 c0531
    704 c0532
     54 c0534
    124 c0535
    145 c0536
     27 c0537
    213 c0538
     14 c0539
     39 c0540
      7 c0802
    122 c0803
     14 c0804
    605 c0805
     24 c0806
     35 c0807
    190 c0808
    254 c0809
     35 c0810
     29 c0811
    210 c0812
      4 c0813
     44 c0814
     74 c0815
     15 c0816
     23 c0817
     89 c0818
     20 c0819
    243 c0820
     52 c0821
    227 c0822
    117 c0823
      2 c0824
     99 c0825
     41 c0826
    215 c0827
    458 c0828
     65 c0829
     61 c0831
    219 c0832
      9 c0833
     18 c0834
     14 c0835
    123 c0836
    278 c0837
    260 c0838
    302 c0839
     33 c0840
     84 c0901
      8 c0902
     16 c0903
     57 c0904
     84 c0905
    139 c0906
    182 c0907
    239 c0908
     40 c0909
    172 c0910
    406 c0911
    115 c0912
    249 c0913
    183 c0914
    104 c0915
     33 c0916
      6 c0917
     59 c0919
     61 c0921
     71 c0922
    123 c0923
     22 c0924
     10 c0925
    215 c0926
    134 c0927
    341 c0928
     88 c0929
     35 c0930
     26 c0931
     68 c0932
    121 c0933
     62 c0934
     66 c0935
     45 c0936
     31 c0937
    428 c0938
    148 c0939
     67 c0940
     17 c1101
     68 c1102
      4 c1103
     45 c1104
      7 c1105
    277 c1106
    427 c1107
    519 c1108
    440 c1109
    468 c1110
    350 c1111
    524 c1112
    675 c1113
    265 c1114
    197 c1115
    699 c1116
     15 c1121
     17 c1123
     96 c1124
     70 c1125
    126 c1127
     28 c1128
     18 c1129
      3 c1131
    129 c1132
     29 c1133
      5 c1134
     17 c1137
    110 c1138
     16 c1139
     36 c1140
      6 g0307
     24 g0309
     11 g0601
     23 longleaf-login1
      8 off01
      7 t0601
     15 t0602
      6 t0603
    412 t0604
     15 t0605
Comment 1 Jenny Williams 2019-01-08 08:39:51 MST
# sinfo -Nl -p spill
Tue Jan  8 10:38:50 2019
NODELIST   NODES PARTITION       STATE CPUS    S:C:T MEMORY TMP_DISK WEIGHT AVAIL_FE REASON              
b1001          1     spill       mixed   24   2:12:1 186391        0      1   (null) none                
b1002          1     spill       mixed   24   2:12:1 186391        0      1   (null) none                
b1003          1     spill   allocated   24   2:12:1 186391        0      1   (null) none                
b1004          1     spill       mixed   24   2:12:1 186391        0      1   (null) none                
b1005          1     spill       mixed   24   2:12:1 186391        0      1   (null) none                
b1006          1     spill   allocated   24   2:12:1 186391        0      1   (null) none                
b1007          1     spill   allocated   24   2:12:1 186391        0      1   (null) none                
b1008          1     spill   allocated   24   2:12:1 186391        0      1   (null) none                
b1009          1     spill   allocated   24   2:12:1 186391        0      1   (null) none                
b1010          1     spill   allocated   24   2:12:1 186391        0      1   (null) none                
b1011          1     spill       mixed   24   2:12:1 186391        0      1   (null) none                
b1012          1     spill       mixed   24   2:12:1 186391        0      1   (null) none                
b1013          1     spill       mixed   24   2:12:1 186391        0      1   (null) none                
b1014          1     spill       mixed   24   2:12:1 186391        0      1   (null) none                
b1015          1     spill   allocated   24   2:12:1 186391        0      1   (null) none                
b1016          1     spill       mixed   24   2:12:1 186391        0      1   (null) none                
b1017          1     spill   allocated   24   2:12:1 186391        0      1   (null) none                
b1018          1     spill   allocated   24   2:12:1 186391        0      1   (null) none                
b1019          1     spill   allocated   24   2:12:1 186391        0      1   (null) none                
b1020          1     spill   allocated   24   2:12:1 186391        0      1   (null) none                
b1021          1     spill       mixed   24   2:12:1 186391        0      1   (null) none                
b1022          1     spill    drained*   24   2:12:1 186391        0      1   (null) borrow_nic          
b1023          1     spill    drained*   24   2:12:1 186391        0      1   (null) borrow_nic          
b1024          1     spill    drained*   24   2:12:1 186391        0      1   (null) borrow_nic          
b1025          1     spill    drained*   24   2:12:1 186391        0      1   (null) borrow_nic          
b1026          1     spill    drained*   24   2:12:1 186391        0      1   (null) borrow_nic          
b1027          1     spill    drained*   24   2:12:1 186391        0      1   (null) borrow_nic          
c0301          1     spill       mixed   72   2:36:1 750452        0      1   (null) none                
c0302          1     spill       mixed   72   2:36:1 750452        0      1   (null) none                
c0303          1     spill       mixed   72   2:36:1 750452        0      1   (null) none                
c0304          1     spill       mixed   72   2:36:1 750452        0      1   (null) none                
c0305          1     spill       mixed   72   2:36:1 750452        0      1   (null) none                
c0306          1     spill       mixed   72   2:36:1 750452        0      1   (null) none                
c0307          1     spill       mixed   72   2:36:1 750452        0      1   (null) none                
c0308          1     spill       mixed   72   2:36:1 750452        0      1   (null) none                
c0309          1     spill       mixed   72   2:36:1 750452        0      1   (null) none                
c0310          1     spill       mixed   72   2:36:1 750452        0      1   (null) none                
c0311          1     spill       mixed   72   2:36:1 750452        0      1   (null) none                
c0312          1     spill       mixed   72   2:36:1 750452        0      1   (null) none                
c0313          1     spill       mixed   72   2:36:1 750452        0      1   (null) none                
c0314          1     spill       mixed   72   2:36:1 750452        0      1   (null) none                
c0315          1     spill       mixed   72   2:36:1 750452        0      1   (null) none                
c0316          1     spill       mixed   72   2:36:1 750452        0      1   (null) none                
c0317          1     spill       mixed   72   2:36:1 750452        0      1   (null) none                
c0318          1     spill       mixed   72   2:36:1 750452        0      1   (null) none                
c0319          1     spill   allocated   72   2:36:1 750452        0      1   (null) none                
c0320          1     spill       mixed   72   2:36:1 750452        0      1   (null) none                
c0401          1     spill       mixed   72   2:36:1 750452        0      1   (null) none                
c0402          1     spill       mixed   72   2:36:1 750452        0      1   (null) none                
c0403          1     spill       mixed   72   2:36:1 750452        0      1   (null) none                
c0404          1     spill       mixed   72   2:36:1 750452        0      1   (null) none                
c0405          1     spill       mixed   72   2:36:1 750452        0      1   (null) none                
c0406          1     spill       mixed   72   2:36:1 750452        0      1   (null) none                
c0407          1     spill   allocated   72   2:36:1 750452        0      1   (null) none                
c0408          1     spill       mixed   72   2:36:1 750452        0      1   (null) none                
c0409          1     spill       mixed   72   2:36:1 750452        0      1   (null) none                
c0410          1     spill       mixed   72   2:36:1 750452        0      1   (null) none                
c0501          1     spill       mixed   56   2:28:1 235520        0      1   (null) none                
c0502          1     spill       mixed   56   2:28:1 235520        0      1   (null) none                
c0503          1     spill       mixed   56   2:28:1 235520        0      1   (null) none                
c0504          1     spill       mixed   56   2:28:1 235520        0      1   (null) none                
c0505          1     spill       mixed   56   2:28:1 235520        0      1   (null) none                
c0506          1     spill       mixed   56   2:28:1 235520        0      1   (null) none                
c0507          1     spill       mixed   56   2:28:1 235520        0      1   (null) none                
c0508          1     spill       mixed   56   2:28:1 235520        0      1   (null) none                
c0509          1     spill       mixed   56   2:28:1 235520        0      1   (null) none                
c0510          1     spill       mixed   56   2:28:1 235520        0      1   (null) none                
c0511          1     spill       mixed   56   2:28:1 235520        0      1   (null) none                
c0512          1     spill       mixed   56   2:28:1 235520        0      1   (null) none                
c0513          1     spill       mixed   56   2:28:1 235520        0      1   (null) none                
c0514          1     spill       mixed   56   2:28:1 235520        0      1   (null) none                
c0515          1     spill       mixed   56   2:28:1 235520        0      1   (null) none                
c0516          1     spill       mixed   56   2:28:1 235520        0      1   (null) none                
c0517          1     spill       mixed   56   2:28:1 235520        0      1   (null) none                
c0518          1     spill       mixed   56   2:28:1 235520        0      1   (null) none                
c0519          1     spill       mixed   56   2:28:1 235520        0      1   (null) none                
c0520          1     spill   allocated   56   2:28:1 235520        0      1   (null) none                
c0521          1     spill       mixed   56   2:28:1 235520        0      1   (null) none                
c0522          1     spill       mixed   56   2:28:1 235520        0      1   (null) none                
c0523          1     spill       mixed   56   2:28:1 235520        0      1   (null) none                
c0524          1     spill       mixed   56   2:28:1 235520        0      1   (null) none                
c0525          1     spill       mixed   56   2:28:1 235520        0      1   (null) none                
c0526          1     spill       mixed   56   2:28:1 235520        0      1   (null) none                
c0527          1     spill       mixed   56   2:28:1 235520        0      1   (null) none                
c0528          1     spill       mixed   56   2:28:1 235520        0      1   (null) none                
c0529          1     spill       mixed   56   2:28:1 235520        0      1   (null) none                
c0530          1     spill   allocated   56   2:28:1 235520        0      1   (null) none                
c0531          1     spill       mixed   56   2:28:1 235520        0      1   (null) none                
c0532          1     spill       mixed   56   2:28:1 235520        0      1   (null) none                
c0533          1     spill    draining   56   2:28:1 235520        0      1   (null) replace_dimmA8      
c0534          1     spill       mixed   56   2:28:1 235520        0      1   (null) none                
c0535          1     spill       mixed   56   2:28:1 235520        0      1   (null) none                
c0536          1     spill   allocated   56   2:28:1 235520        0      1   (null) none                
c0537          1     spill   allocated   56   2:28:1 235520        0      1   (null) none                
c0538          1     spill   allocated   56   2:28:1 235520        0      1   (null) none                
c0539          1     spill   allocated   56   2:28:1 235520        0      1   (null) none                
c0540          1     spill   allocated   56   2:28:1 235520        0      1   (null) none                
c0802          1     spill   allocated   48   2:24:1 364536        0      1   (null) none                
c0803          1     spill   allocated   48   2:24:1 364536        0      1   (null) none                
c0804          1     spill   allocated   48   2:24:1 364536        0      1   (null) none                
c0805          1     spill   allocated   48   2:24:1 364536        0      1   (null) none                
c0806          1     spill       mixed   48   2:24:1 364536        0      1   (null) none                
c0807          1     spill   allocated   48   2:24:1 364536        0      1   (null) none                
c0808          1     spill   allocated   48   2:24:1 364536        0      1   (null) none                
c0809          1     spill   allocated   48   2:24:1 364536        0      1   (null) none                
c0810          1     spill   allocated   48   2:24:1 364536        0      1   (null) none                
c0811          1     spill   allocated   48   2:24:1 364536        0      1   (null) none                
c0812          1     spill   allocated   48   2:24:1 364536        0      1   (null) none                
c0813          1     spill   allocated   48   2:24:1 364536        0      1   (null) none                
c0814          1     spill   allocated   48   2:24:1 364536        0      1   (null) none                
c0815          1     spill   allocated   48   2:24:1 364536        0      1   (null) none                
c0816          1     spill   allocated   48   2:24:1 364536        0      1   (null) none                
c0817          1     spill   allocated   48   2:24:1 364536        0      1   (null) none                
c0818          1     spill   allocated   48   2:24:1 364536        0      1   (null) none                
c0819          1     spill   allocated   48   2:24:1 364536        0      1   (null) none                
c0820          1     spill   allocated   48   2:24:1 364536        0      1   (null) none                
c0821          1     spill   allocated   48   2:24:1 364536        0      1   (null) none                
c0822          1     spill   allocated   48   2:24:1 364536        0      1   (null) none                
c0823          1     spill   allocated   48   2:24:1 364536        0      1   (null) none                
c0824          1     spill   allocated   48   2:24:1 364536        0      1   (null) none                
c0825          1     spill   allocated   48   2:24:1 364536        0      1   (null) none                
c0826          1     spill   allocated   48   2:24:1 364536        0      1   (null) none                
c0827          1     spill   allocated   48   2:24:1 364536        0      1   (null) none                
c0828          1     spill       mixed   48   2:24:1 364536        0      1   (null) none                
c0829          1     spill   allocated   48   2:24:1 364536        0      1   (null) none                
c0830          1     spill    drained*   48   2:24:1 235520        0      1   (null) mem_upgrade         
c0831          1     spill   allocated   48   2:24:1 364536        0      1   (null) none                
c0832          1     spill   allocated   48   2:24:1 364536        0      1   (null) none                
c0833          1     spill   allocated   48   2:24:1 364536        0      1   (null) none                
c0834          1     spill       mixed   48   2:24:1 364536        0      1   (null) none                
c0835          1     spill   allocated   48   2:24:1 364536        0      1   (null) none                
c0836          1     spill   allocated   48   2:24:1 364536        0      1   (null) none                
c0837          1     spill   allocated   48   2:24:1 364536        0      1   (null) none                
c0838          1     spill   allocated   48   2:24:1 364536        0      1   (null) none                
c0839          1     spill   allocated   48   2:24:1 364536        0      1   (null) none                
c0840          1     spill   allocated   48   2:24:1 364536        0      1   (null) none                
c0901          1     spill   allocated   48   2:24:1 364536        0      1   (null) none                
c0902          1     spill   allocated   48   2:24:1 364536        0      1   (null) none                
c0903          1     spill       mixed   48   2:24:1 364536        0      1   (null) none                
c0904          1     spill   allocated   48   2:24:1 364536        0      1   (null) none                
c0905          1     spill       mixed   48   2:24:1 364536        0      1   (null) none                
c0906          1     spill       mixed   48   2:24:1 364536        0      1   (null) none                
c0907          1     spill       mixed   48   2:24:1 364536        0      1   (null) none                
c0908          1     spill       mixed   48   2:24:1 364536        0      1   (null) none                
c0909          1     spill       mixed   48   2:24:1 364536        0      1   (null) none                
c0910          1     spill   allocated   48   2:24:1 364536        0      1   (null) none                
c0911          1     spill   allocated   48   2:24:1 364536        0      1   (null) none                
c0912          1     spill   allocated   48   2:24:1 364536        0      1   (null) none                
c0913          1     spill   allocated   48   2:24:1 364536        0      1   (null) none                
c0914          1     spill   allocated   48   2:24:1 364536        0      1   (null) none                
c0915          1     spill   allocated   48   2:24:1 364536        0      1   (null) none                
c0916          1     spill   allocated   48   2:24:1 364536        0      1   (null) none                
c0917          1     spill   allocated   48   2:24:1 364536        0      1   (null) none                
c0918          1     spill   allocated   48   2:24:1 364536        0      1   (null) none                
c0919          1     spill   allocated   48   2:24:1 364536        0      1   (null) none                
c0920          1     spill   allocated   48   2:24:1 364536        0      1   (null) none                
c0921          1     spill   allocated   48   2:24:1 364536        0      1   (null) none                
c0922          1     spill   allocated   48   2:24:1 364536        0      1   (null) none                
c0923          1     spill  allocated@   48   2:24:1 364536        0      1   (null) none                
c0924          1     spill   allocated   48   2:24:1 364536        0      1   (null) none                
c0925          1     spill   allocated   48   2:24:1 364536        0      1   (null) none                
c0926          1     spill   allocated   48   2:24:1 364536        0      1   (null) none                
c0927          1     spill       mixed   48   2:24:1 364536        0      1   (null) none                
c0928          1     spill       mixed   48   2:24:1 364536        0      1   (null) none                
c0929          1     spill       mixed   48   2:24:1 364536        0      1   (null) none                
c0930          1     spill       mixed   48   2:24:1 364536        0      1   (null) none                
c0931          1     spill       mixed   48   2:24:1 364536        0      1   (null) none                
c0932          1     spill       mixed   48   2:24:1 364536        0      1   (null) none                
c0933          1     spill       mixed   48   2:24:1 364536        0      1   (null) none                
c0934          1     spill       mixed   48   2:24:1 364536        0      1   (null) none                
c0935          1     spill       mixed   48   2:24:1 364536        0      1   (null) none                
c0936          1     spill       mixed   48   2:24:1 364536        0      1   (null) none                
c0937          1     spill       mixed   48   2:24:1 364536        0      1   (null) none                
c0938          1     spill       mixed   48   2:24:1 364536        0      1   (null) none                
c0939          1     spill       mixed   48   2:24:1 364536        0      1   (null) none                
c0940          1     spill      mixed@   48   2:24:1 364536        0      1   (null) none                
c1101          1     spill       mixed   48   2:24:1 364536        0      1   (null) none                
c1102          1     spill       mixed   48   2:24:1 364536        0      1   (null) none                
c1103          1     spill       mixed   48   2:24:1 364536        0      1   (null) none                
c1104          1     spill       mixed   48   2:24:1 364536        0      1   (null) none                
c1105          1     spill       mixed   48   2:24:1 364536        0      1   (null) none                
c1106          1     spill       mixed   48   2:24:1 364536        0      1   (null) none                
c1107          1     spill       mixed   48   2:24:1 364536        0      1   (null) none                
c1108          1     spill       mixed   48   2:24:1 364536        0      1   (null) none                
c1109          1     spill       mixed   48   2:24:1 364536        0      1   (null) none                
c1110          1     spill       mixed   48   2:24:1 364536        0      1   (null) none                
c1111          1     spill       mixed   48   2:24:1 364536        0      1   (null) none                
c1112          1     spill       mixed   48   2:24:1 364536        0      1   (null) none                
c1113          1     spill       mixed   48   2:24:1 364536        0      1   (null) none                
c1114          1     spill       mixed   48   2:24:1 364536        0      1   (null) none                
c1115          1     spill       mixed   48   2:24:1 364536        0      1   (null) none                
c1116          1     spill       mixed   48   2:24:1 364536        0      1   (null) none                
c1117          1     spill       mixed   48   2:24:1 364536        0      1   (null) none                
c1118          1     spill       mixed   48   2:24:1 364536        0      1   (null) none                
c1119          1     spill       mixed   48   2:24:1 364536        0      1   (null) none                
c1120          1     spill       mixed   48   2:24:1 364536        0      1   (null) none                
c1121          1     spill       mixed   48   2:24:1 235520        0      1   (null) none                
c1122          1     spill       mixed   48   2:24:1 235520        0      1   (null) none                
c1123          1     spill       mixed   48   2:24:1 235520        0      1   (null) none                
c1124          1     spill       mixed   48   2:24:1 235520        0      1   (null) none                
c1125          1     spill       mixed   48   2:24:1 235520        0      1   (null) none                
c1126          1     spill   allocated   48   2:24:1 235520        0      1   (null) none                
c1127          1     spill   allocated   48   2:24:1 235520        0      1   (null) none                
c1128          1     spill       mixed   48   2:24:1 235520        0      1   (null) none                
c1129          1     spill   allocated   48   2:24:1 235520        0      1   (null) none                
c1130          1     spill   allocated   48   2:24:1 235520        0      1   (null) none                
c1131          1     spill   allocated   48   2:24:1 235520        0      1   (null) none                
c1132          1     spill   allocated   48   2:24:1 235520        0      1   (null) none                
c1133          1     spill   allocated   48   2:24:1 235520        0      1   (null) none                
c1134          1     spill    draining   48   2:24:1 235520        0      1   (null) Kill task failed    
c1135          1     spill       mixed   48   2:24:1 235520        0      1   (null) none                
c1136          1     spill   allocated   48   2:24:1 235520        0      1   (null) none                
c1137          1     spill   allocated   48   2:24:1 235520        0      1   (null) none                
c1138          1     spill   allocated   48   2:24:1 235520        0      1   (null) none                
c1139          1     spill       mixed   48   2:24:1 235520        0      1   (null) none                
c1140          1     spill   allocated   48   2:24:1 235520        0      1   (null) none
Comment 2 Jenny Williams 2019-01-08 08:40:17 MST
# scontrol show config
Configuration data as of 2019-01-08T10:39:59
AccountingStorageBackupHost = (null)
AccountingStorageEnforce = associations,limits,qos,safe
AccountingStorageHost   = m1006
AccountingStorageLoc    = N/A
AccountingStoragePort   = 6819
AccountingStorageTRES   = cpu,mem,energy,node,billing,fs/disk,vmem,pages,gres/gpu
AccountingStorageType   = accounting_storage/slurmdbd
AccountingStorageUser   = N/A
AccountingStoreJobComment = Yes
AcctGatherEnergyType    = acct_gather_energy/none
AcctGatherFilesystemType = acct_gather_filesystem/none
AcctGatherInterconnectType = acct_gather_interconnect/none
AcctGatherNodeFreq      = 0 sec
AcctGatherProfileType   = acct_gather_profile/none
AllowSpecResourcesUsage = 0
AuthInfo                = (null)
AuthType                = auth/munge
BatchStartTimeout       = 10 sec
BOOT_TIME               = 2019-01-02T17:13:46
BurstBufferType         = (null)
CheckpointType          = checkpoint/none
ClusterName             = longleaf
CommunicationParameters = (null)
CompleteWait            = 0 sec
CoreSpecPlugin          = core_spec/none
CpuFreqDef              = Unknown
CpuFreqGovernors        = Performance,OnDemand,UserSpace
CryptoType              = crypto/munge
DebugFlags              = (null)
DefMemPerNode           = UNLIMITED
DisableRootJobs         = No
EioTimeout              = 60
EnforcePartLimits       = ANY
Epilog                  = (null)
EpilogMsgTime           = 2000 usec
EpilogSlurmctld         = (null)
ExtSensorsType          = ext_sensors/none
ExtSensorsFreq          = 0 sec
FairShareDampeningFactor = 1
FastSchedule            = 1
FederationParameters    = (null)
FirstJobId              = 1
GetEnvTimeout           = 2 sec
GresTypes               = gpu
GroupUpdateForce        = 1
GroupUpdateTime         = 600 sec
HASH_VAL                = Match
HealthCheckInterval     = 0 sec
HealthCheckNodeState    = ANY
HealthCheckProgram      = (null)
InactiveLimit           = 65533 sec
JobAcctGatherFrequency  = task=15
JobAcctGatherType       = jobacct_gather/cgroup
JobAcctGatherParams     = (null)
JobCheckpointDir        = /var/slurm/checkpoint
JobCompHost             = localhost
JobCompLoc              = /var/log/slurm_jobcomp.log
JobCompPort             = 0
JobCompType             = jobcomp/none
JobCompUser             = root
JobContainerType        = job_container/none
JobCredentialPrivateKey = (null)
JobCredentialPublicCertificate = (null)
JobDefaults             = (null)
JobFileAppend           = 0
JobRequeue              = 1
JobSubmitPlugins        = lua,all_partitions
KeepAliveTime           = SYSTEM_DEFAULT
KillOnBadExit           = 0
KillWait                = 30 sec
LaunchParameters        = (null)
LaunchType              = launch/slurm
Layouts                 = 
Licenses                = mplus:1,nonmem:32
LicensesUsed            = nonmem:0/32,mplus:0/1
LogTimeFormat           = iso8601_ms
MailDomain              = (null)
MailProg                = /bin/mail
MaxArraySize            = 40001
MaxJobCount             = 350000
MaxJobId                = 67043328
MaxMemPerNode           = UNLIMITED
MaxStepCount            = 40000
MaxTasksPerNode         = 512
MCSPlugin               = mcs/none
MCSParameters           = (null)
MemLimitEnforce         = No
MessageTimeout          = 60 sec
MinJobAge               = 300 sec
MpiDefault              = none
MpiParams               = (null)
MsgAggregationParams    = (null)
NEXT_JOB_ID             = 6170348
NodeFeaturesPlugins     = (null)
OverTimeLimit           = 0 min
PluginDir               = /usr/lib64/slurm
PlugStackConfig         = /etc/slurm/plugstack.conf
PowerParameters         = (null)
PowerPlugin             = 
PreemptMode             = OFF
PreemptType             = preempt/none
PriorityParameters      = (null)
PriorityDecayHalfLife   = 14-00:00:00
PriorityCalcPeriod      = 00:05:00
PriorityFavorSmall      = No
PriorityFlags           = SMALL_RELATIVE_TO_TIME,CALCULATE_RUNNING,FAIR_TREE,MAX_TRES
PriorityMaxAge          = 60-00:00:00
PriorityUsageResetPeriod = NONE
PriorityType            = priority/multifactor
PriorityWeightAge       = 1000
PriorityWeightFairShare = 10000
PriorityWeightJobSize   = 1000
PriorityWeightPartition = 1000
PriorityWeightQOS       = 1000
PriorityWeightTRES      = CPU=1000,Mem=4000,GRES/gpu=3000
PrivateData             = none
ProctrackType           = proctrack/cgroup
Prolog                  = (null)
PrologEpilogTimeout     = 65534
PrologSlurmctld         = (null)
PrologFlags             = Alloc,Contain
PropagatePrioProcess    = 0
PropagateResourceLimits = ALL
PropagateResourceLimitsExcept = (null)
RebootProgram           = /usr/sbin/reboot
ReconfigFlags           = (null)
RequeueExit             = (null)
RequeueExitHold         = (null)
ResumeFailProgram       = (null)
ResumeProgram           = (null)
ResumeRate              = 300 nodes/min
ResumeTimeout           = 60 sec
ResvEpilog              = (null)
ResvOverRun             = 0 min
ResvProlog              = (null)
ReturnToService         = 2
RoutePlugin             = route/default
SallocDefaultCommand    = srun -n1 -N1 --gres=gpu:0 --mem-per-cpu=0 --pty --preserve-env --mpi=none $SHELL
SbcastParameters        = (null)
SchedulerParameters     = kill_invalid_depend,batch_sched_delay=10,bf_continue,bf_max_job_part=5000,bf_max_job_test=10000,bf_max_job_user=300,bf_resolution=300,bf_window=10080,bf_yield_interval=1000000,default_queue_depth=1000,partition_job_depth=600,sched_min_interval=2000000,defer,max_rpc_cnt=80
SchedulerTimeSlice      = 30 sec
SchedulerType           = sched/backfill
SelectType              = select/cons_res
SelectTypeParameters    = CR_CPU_MEMORY
SlurmUser               = slurm(47)
SlurmctldAddr           = (null)
SlurmctldDebug          = info
SlurmctldHost[0]        = longleaf-sched(172.26.113.4)
SlurmctldLogFile        = /pine/EX/root/slurm-log/slurmctld.log
SlurmctldPort           = 6820-6824
SlurmctldSyslogDebug    = unknown
SlurmctldPrimaryOffProg = (null)
SlurmctldPrimaryOnProg  = (null)
SlurmctldTimeout        = 65530 sec
SlurmctldParameters     = (null)
SlurmdDebug             = info
SlurmdLogFile           = /var/log/slurm/slurmd.log
SlurmdParameters        = (null)
SlurmdPidFile           = /var/run/slurmd.pid
SlurmdPort              = 6818
SlurmdSpoolDir          = /var/spool/slurmd
SlurmdSyslogDebug       = unknown
SlurmdTimeout           = 65530 sec
SlurmdUser              = root(0)
SlurmSchedLogFile       = (null)
SlurmSchedLogLevel      = 0
SlurmctldPidFile        = /var/run/slurmctld.pid
SlurmctldPlugstack      = (null)
SLURM_CONF              = /etc/slurm/slurm.conf
SLURM_VERSION           = 18.08.4
SrunEpilog              = (null)
SrunPortRange           = 0-0
SrunProlog              = (null)
StateSaveLocation       = /pine/EX/root/slurm-log/slurmctld
SuspendExcNodes         = (null)
SuspendExcParts         = (null)
SuspendProgram          = (null)
SuspendRate             = 60 nodes/min
SuspendTime             = NONE
SuspendTimeout          = 30 sec
SwitchType              = switch/none
TaskEpilog              = (null)
TaskPlugin              = task/cgroup
TaskPluginParam         = (null type)
TaskProlog              = (null)
TCPTimeout              = 2 sec
TmpFS                   = /tmp
TopologyParam           = (null)
TopologyPlugin          = topology/none
TrackWCKey              = No
TreeWidth               = 50
UsePam                  = 0
UnkillableStepProgram   = (null)
UnkillableStepTimeout   = 120 sec
VSizeFactor             = 0 percent
WaitTime                = 0 sec
X11Parameters           = (null)

Cgroup Support Configuration:
AllowedDevicesFile      = /etc/slurm/cgroup_allowed_devices_file.conf
AllowedKmemSpace        = (null)
AllowedRAMSpace         = 100.0%
AllowedSwapSpace        = 0.0%
CgroupAutomount         = yes
CgroupMountpoint        = /sys/fs/cgroup
ConstrainCores          = yes
ConstrainDevices        = no
ConstrainKmemSpace      = no
ConstrainRAMSpace       = yes
ConstrainSwapSpace      = no
MaxKmemPercent          = 100.0%
MaxRAMPercent           = 100.0%
MaxSwapPercent          = 100.0%
MemLimitThreshold       = 100.0%
MemoryLimitEnforcement  = no
MemorySwappiness        = (null)
MinKmemSpace            = 30 MB
MinRAMSpace             = 30 MB
TaskAffinity            = yes

Slurmctld(primary) at longleaf-sched is UP
Comment 3 Alejandro Sanchez 2019-01-18 04:30:10 MST
Hi Jenny,

Each job record has associated a struct job_resources[1] to track allocated resources, including memory_allocated per node.

Similarly, each node is associated with another struct node_use_record[2] to track resources allocated to nodes, including alloc_memory reserved by jobs.

The "memory is under-allocated" error is logged by the select/cons_res plugin when trying to deallocate resources previously reserved for a given job. More specifically, this happens when the job_resources.memory_allocated on a node is higher than the node_use_record.alloc_memory, meaning there's a mismatch between the job's viewpoint vs the node's one in terms of memory allocation.

Then Slurm proceeds by logging this message to note the mismatch and instead of subtracting the job's point of view amount of memory from the node (which would then underflow below zero), it sets the node struct alloc_memory for that node to zero.

This deallocation function is called in different scenarios: when a job finishes, when a job is suspended, when a job is expanded, when a job is preempted or when the scheduler builds fake future scenarios by [de]allocating resources to see if/where/when a job will run.

The mismatch shouldn't happen; both jobs and nodes should have the same view about what they allocate/are allocated. In order to try to reproduce:

- could you please attach your slurm.conf? (scontrol show conf isn't showing the node/part definitions)
- could you please attach slurmctld.log including all the log messages related to one of the afflicted jobs? for instance JobId=3906238 and/or JobId=6030929_5(6031087).
- I'm curious to know what use-case from the list mentioned above triggered the deallocation of resources; I'm also suspecting about jobs potentially being allocated nodes with different hardware, specifically different cpu/memory counts.

Thanks.

[1] https://github.com/SchedMD/slurm/blob/slurm-18-08-4-1/src/common/job_resources.h#L103

[2] https://github.com/SchedMD/slurm/blob/slurm-18-08-4-1/src/plugins/select/cons_res/select_cons_res.h#L99

[3] https://github.com/SchedMD/slurm/blob/slurm-18-08-4-1/src/plugins/select/cons_res/select_cons_res.c#L1227
Comment 4 Jenny Williams 2019-01-22 09:05:01 MST
Created attachment 8974 [details]
slurmctld log
Comment 5 Jenny Williams 2019-01-22 09:05:52 MST
Created attachment 8975 [details]
slurm conf
Comment 6 Jenny Williams 2019-01-22 09:08:15 MST
The most recent job has all messages included in the attached slurmctld.log -
       JobID      User        NodeList                             ReqTRES    Elapsed              Submit               Start 
------------ --------- --------------- ----------------------------------- ---------- ------------------- ------------------- 
8245711          dg144           b1010       billing=1,cpu=1,mem=1G,node=1   04:30:15 2019-01-22T02:52:29 2019-01-22T05:02:42
Comment 7 Alejandro Sanchez 2019-02-14 05:31:38 MST
Jenny, I'm still trying to reproduce this. Did you add/remove nodes while these under-allocated errors were reported?
Comment 8 Jenny Williams 2019-02-14 08:20:25 MST
I am almost certain we were not adding/removing nodes when these errors were generated.  I cannot replicate the condition myself at this point.  The only info I would have on this is contained in the slurm log files.

I'm willing to close this case at this point in regards to my own needs.
Comment 9 Marshall Garey 2019-03-27 16:42:10 MDT
*** Ticket 6769 has been marked as a duplicate of this ticket. ***
Comment 10 Regine Gaudin 2019-06-04 03:27:32 MDT
Hi

I'm updating this bug as CEA is also encountering memory under-allocated errors
 you have mentionned, filling slurmctld.log
error: select/cons_res: node machine1234 memory is under-allocated (0-188800) for JobID=XXXXXX

In bug 6879 it is written "there are proposed fixes for both issues I mentioned (accrue_cnt underflow and memory under-allocated errors)". So I let you known that CEA would be also interested in proposed fixes. slurm controller is 18.08.06 and clients in 17.11.6 but will be upgraded soon in 18.08.06

Thanks

Regine