Ticket 16404

Summary:	Jobs not starting in main production cluster since ca. 11:46 AM 29-03-2023
Product:	Slurm	Reporter:	Hjalti Sveinsson <hjalti.sveinsson>
Component:	Scheduling	Assignee:	Tim McMullan <mcmullan>
Status:	RESOLVED INFOGIVEN	QA Contact:
Severity:	3 - Medium Impact
Priority:	---
Version:	22.05.6
Hardware:	Linux
OS:	Linux
Site:	deCODE	Slinky Site:	---
Alineos Sites:	---	Atos/Eviden Sites:	---
Confidential Site:	---	Coreweave sites:	---
Cray Sites:	---	DS9 clusters:	---
Google sites:	---	HPCnow Sites:	---
HPE Sites:	---	IBM Sites:	---
NOAA SIte:	---	NoveTech Sites:	---
Nvidia HWinf-CS Sites:	---	OCF Sites:	---
Recursion Pharma Sites:	---	SFW Sites:	---
SNIC sites:	---	Tzag Elita Sites:	---
Linux Distro:	---	Machine Name:
CLE Version:		Version Fixed:
Target Release:	---	DevPrio:	---
Emory-Cloud Sites:	---
Attachments:	slurmctld.log.tgz slurmctld.log.old.tgz sdiag output slurm.conf sprio output squeue output

Description Hjalti Sveinsson 2023-03-29 12:21:59 MDT

Hello, 

for some reason our main partition is not scheduling new jobs since 11:46 AM this morning.

Here is an output from the sacct command:

17536718.ba+ 2023-03-29T11:46:46 2023-03-29T11:59:41  COMPLETED
17536719     2023-03-29T11:46:46 2023-03-29T11:47:03  COMPLETED
17536719.ba+ 2023-03-29T11:46:46 2023-03-29T11:47:03  COMPLETED
17536721     2023-03-29T11:46:46 2023-03-29T11:47:00  COMPLETED
17536721.ba+ 2023-03-29T11:46:46 2023-03-29T11:47:00  COMPLETED
17536722     2023-03-29T11:46:46 2023-03-29T12:11:13  COMPLETED
17536722.ba+ 2023-03-29T11:46:46 2023-03-29T12:11:13  COMPLETED
17536731     2023-03-29T11:46:46 2023-03-29T12:13:33  COMPLETED
17536731.ba+ 2023-03-29T11:46:46 2023-03-29T12:13:33  COMPLETED
17536735     2023-03-29T11:46:46 2023-03-29T11:47:09  COMPLETED
17536735.ba+ 2023-03-29T11:46:46 2023-03-29T11:47:09  COMPLETED
17536738     2023-03-29T11:46:46 2023-03-29T12:12:56  COMPLETED
17536738.ba+ 2023-03-29T11:46:46 2023-03-29T12:12:56  COMPLETED
17536742     2023-03-29T11:46:46 2023-03-29T12:10:58  COMPLETED
17536742.ba+ 2023-03-29T11:46:46 2023-03-29T12:10:58  COMPLETED
17536746     2023-03-29T11:46:46 2023-03-29T11:46:59  COMPLETED
17536746.ba+ 2023-03-29T11:46:46 2023-03-29T11:46:59  COMPLETED
17536751     2023-03-29T11:46:46 2023-03-29T12:11:27  COMPLETED
17536751.ba+ 2023-03-29T11:46:46 2023-03-29T12:11:27  COMPLETED
17536753     2023-03-29T11:46:47 2023-03-29T11:46:54  COMPLETED
17536753.ba+ 2023-03-29T11:46:47 2023-03-29T11:46:54  COMPLETED
17536755     2023-03-29T11:46:47 2023-03-29T12:17:10  COMPLETED
17536755.ba+ 2023-03-29T11:46:47 2023-03-29T12:17:10  COMPLETED
17536759                 Unknown             Unknown    PENDING
17536762                 Unknown             Unknown    PENDING
17536770                 Unknown             Unknown    PENDING
17536772                 Unknown             Unknown    PENDING
17536775                 Unknown             Unknown    PENDING
17536777                 Unknown             Unknown    PENDING
17536783                 Unknown             Unknown    PENDING
17536786                 Unknown             Unknown    PENDING
17536790                 Unknown             Unknown    PENDING
17536794                 Unknown             Unknown    PENDING
17536802                 Unknown             Unknown    PENDING
17536803                 Unknown             Unknown    PENDING

and the list goes on.....

I will attach slurmctld log file, the slurm.conf config and sdiag output.

We did not do any changes to our cluster at 11:46 this morning but we did move some nodes between partitions after 15:00 today.

Comment 1 Hjalti Sveinsson 2023-03-29 12:25:37 MDT

Created attachment 29586 [details]
slurmctld.log.tgz

Comment 2 Hjalti Sveinsson 2023-03-29 12:26:31 MDT

Created attachment 29587 [details]
slurmctld.log.old.tgz

Comment 3 Hjalti Sveinsson 2023-03-29 12:26:52 MDT

Created attachment 29588 [details]
sdiag output

Comment 4 Hjalti Sveinsson 2023-03-29 12:27:19 MDT

Created attachment 29589 [details]
slurm.conf

Comment 5 Jason Booth 2023-03-29 12:53:59 MDT

Would you also be able to provide the output of the following?

> squeue -l
> sprio

Also, are you referring to the partition "DEFAULT" or "cpu_hog"?

>PartitionName=DEFAULT Nodes=ru-hpc-[0240-0311],ru-hpc-[0321-0392,0402-0410,0412-0419,0421,0423-0428,0430,0432-0436,0438-0477,0480,0482-0484,0486-0489,0491-0494,0496-0510,0519-0540,0562-0563,0567,0569,0571-0582,0585,0588-0590,0592-0594,0598-0599,0604-0606,0612-0635,0638-0640],ru-hpc-[0901-1044],ru-hpc-[1300-1308,1310-1318,1320-1322,1325-1329,1331-1333,1335-1358,1360-1379,1381-1391,1394,1395,1398,1399],ru-hpc-[1405-1425,1427-1429,1430-1431,1433-1442],ru-hpc-[1045-1059,1070-1115,1120-1155,1168-1175],ru-hpc-[2001-2099,2101-2188,2193-2207,2209-2222,2224,2226-2238,2243-2246,2248-2265,2267-2269,2272-2273,2276,2279-2283,2285-2290,2292-2295,2297-2302,2304,2311,2314-2320,2323-2324,2326-2327,2329-2337,2339-2400,2411,2416-2432],ru-hpc-[1221-1237,1240-1255,1258-1275,1278-1299] DefaultTime=01:00:00 MaxTime=20-00:00:00 State=UP
> ...
>PartitionName=cpu_hog Default=YES Priority=1000

Comment 7 Hjalti Sveinsson 2023-03-29 12:56:58 MDT

We seem to have found a possible cause for this, these jobs were pending:

17493333            big_calc            2023-03-29T08:40:23 PENDING             392000M
17493332            big_calc            2023-03-29T08:40:23 PENDING             392000M
17493330            big_calc            2023-03-29T08:40:22 PENDING             392000M
17493328            big_calc            2023-03-29T08:40:22 PENDING             392000M
17493327            big_calc            2023-03-29T08:40:22 PENDING             392000M
17493325            big_calc            2023-03-29T08:40:22 PENDING             392000M 

And this partition has a higher priority than the default partition. We move these jobs to the mem_hog partition and now jobs are getting through.

Comment 8 Hjalti Sveinsson 2023-03-29 12:58:41 MDT

Created attachment 29591 [details]
sprio output

Comment 9 Hjalti Sveinsson 2023-03-29 12:59:13 MDT

Created attachment 29592 [details]
squeue output

Comment 10 Jason Booth 2023-03-29 13:01:39 MDT

That is good to head. I will have Tim follow up with you should you need more work done on this. For now, I will have him hold onto this issue for the day.

Comment 11 Tim McMullan 2023-04-03 11:37:24 MDT

I just wanted to check in and make sure that after moving these jobs things have been running correctly!

Thanks!
--Tim

Comment 12 Tim McMullan 2023-04-05 12:38:47 MDT

Hi Hjalti,

I'm going to resolve this ticket for now since it seems like the issue is resolved, but please let us know if you need further assistance!

Thanks!
--Tim

Comment 13 Hjalti Sveinsson 2023-04-12 06:36:26 MDT

Yes, this was the cause. You can close this non-issue. Thank you for the support.