Hello, for some reason our main partition is not scheduling new jobs since 11:46 AM this morning. Here is an output from the sacct command: 17536718.ba+ 2023-03-29T11:46:46 2023-03-29T11:59:41 COMPLETED 17536719 2023-03-29T11:46:46 2023-03-29T11:47:03 COMPLETED 17536719.ba+ 2023-03-29T11:46:46 2023-03-29T11:47:03 COMPLETED 17536721 2023-03-29T11:46:46 2023-03-29T11:47:00 COMPLETED 17536721.ba+ 2023-03-29T11:46:46 2023-03-29T11:47:00 COMPLETED 17536722 2023-03-29T11:46:46 2023-03-29T12:11:13 COMPLETED 17536722.ba+ 2023-03-29T11:46:46 2023-03-29T12:11:13 COMPLETED 17536731 2023-03-29T11:46:46 2023-03-29T12:13:33 COMPLETED 17536731.ba+ 2023-03-29T11:46:46 2023-03-29T12:13:33 COMPLETED 17536735 2023-03-29T11:46:46 2023-03-29T11:47:09 COMPLETED 17536735.ba+ 2023-03-29T11:46:46 2023-03-29T11:47:09 COMPLETED 17536738 2023-03-29T11:46:46 2023-03-29T12:12:56 COMPLETED 17536738.ba+ 2023-03-29T11:46:46 2023-03-29T12:12:56 COMPLETED 17536742 2023-03-29T11:46:46 2023-03-29T12:10:58 COMPLETED 17536742.ba+ 2023-03-29T11:46:46 2023-03-29T12:10:58 COMPLETED 17536746 2023-03-29T11:46:46 2023-03-29T11:46:59 COMPLETED 17536746.ba+ 2023-03-29T11:46:46 2023-03-29T11:46:59 COMPLETED 17536751 2023-03-29T11:46:46 2023-03-29T12:11:27 COMPLETED 17536751.ba+ 2023-03-29T11:46:46 2023-03-29T12:11:27 COMPLETED 17536753 2023-03-29T11:46:47 2023-03-29T11:46:54 COMPLETED 17536753.ba+ 2023-03-29T11:46:47 2023-03-29T11:46:54 COMPLETED 17536755 2023-03-29T11:46:47 2023-03-29T12:17:10 COMPLETED 17536755.ba+ 2023-03-29T11:46:47 2023-03-29T12:17:10 COMPLETED 17536759 Unknown Unknown PENDING 17536762 Unknown Unknown PENDING 17536770 Unknown Unknown PENDING 17536772 Unknown Unknown PENDING 17536775 Unknown Unknown PENDING 17536777 Unknown Unknown PENDING 17536783 Unknown Unknown PENDING 17536786 Unknown Unknown PENDING 17536790 Unknown Unknown PENDING 17536794 Unknown Unknown PENDING 17536802 Unknown Unknown PENDING 17536803 Unknown Unknown PENDING and the list goes on..... I will attach slurmctld log file, the slurm.conf config and sdiag output. We did not do any changes to our cluster at 11:46 this morning but we did move some nodes between partitions after 15:00 today.
Created attachment 29586 [details] slurmctld.log.tgz
Created attachment 29587 [details] slurmctld.log.old.tgz
Created attachment 29588 [details] sdiag output
Created attachment 29589 [details] slurm.conf
Would you also be able to provide the output of the following? > squeue -l > sprio Also, are you referring to the partition "DEFAULT" or "cpu_hog"? >PartitionName=DEFAULT Nodes=ru-hpc-[0240-0311],ru-hpc-[0321-0392,0402-0410,0412-0419,0421,0423-0428,0430,0432-0436,0438-0477,0480,0482-0484,0486-0489,0491-0494,0496-0510,0519-0540,0562-0563,0567,0569,0571-0582,0585,0588-0590,0592-0594,0598-0599,0604-0606,0612-0635,0638-0640],ru-hpc-[0901-1044],ru-hpc-[1300-1308,1310-1318,1320-1322,1325-1329,1331-1333,1335-1358,1360-1379,1381-1391,1394,1395,1398,1399],ru-hpc-[1405-1425,1427-1429,1430-1431,1433-1442],ru-hpc-[1045-1059,1070-1115,1120-1155,1168-1175],ru-hpc-[2001-2099,2101-2188,2193-2207,2209-2222,2224,2226-2238,2243-2246,2248-2265,2267-2269,2272-2273,2276,2279-2283,2285-2290,2292-2295,2297-2302,2304,2311,2314-2320,2323-2324,2326-2327,2329-2337,2339-2400,2411,2416-2432],ru-hpc-[1221-1237,1240-1255,1258-1275,1278-1299] DefaultTime=01:00:00 MaxTime=20-00:00:00 State=UP > ... >PartitionName=cpu_hog Default=YES Priority=1000
We seem to have found a possible cause for this, these jobs were pending: 17493333 big_calc 2023-03-29T08:40:23 PENDING 392000M 17493332 big_calc 2023-03-29T08:40:23 PENDING 392000M 17493330 big_calc 2023-03-29T08:40:22 PENDING 392000M 17493328 big_calc 2023-03-29T08:40:22 PENDING 392000M 17493327 big_calc 2023-03-29T08:40:22 PENDING 392000M 17493325 big_calc 2023-03-29T08:40:22 PENDING 392000M And this partition has a higher priority than the default partition. We move these jobs to the mem_hog partition and now jobs are getting through.
Created attachment 29591 [details] sprio output
Created attachment 29592 [details] squeue output
That is good to head. I will have Tim follow up with you should you need more work done on this. For now, I will have him hold onto this issue for the day.
I just wanted to check in and make sure that after moving these jobs things have been running correctly! Thanks! --Tim
Hi Hjalti, I'm going to resolve this ticket for now since it seems like the issue is resolved, but please let us know if you need further assistance! Thanks! --Tim
Yes, this was the cause. You can close this non-issue. Thank you for the support.