| Summary: | Jobs not starting in main production cluster since ca. 11:46 AM 29-03-2023 | ||
|---|---|---|---|
| Product: | Slurm | Reporter: | Hjalti Sveinsson <hjalti.sveinsson> |
| Component: | Scheduling | Assignee: | Tim McMullan <mcmullan> |
| Status: | RESOLVED INFOGIVEN | QA Contact: | |
| Severity: | 3 - Medium Impact | ||
| Priority: | --- | ||
| Version: | 22.05.6 | ||
| Hardware: | Linux | ||
| OS: | Linux | ||
| Site: | deCODE | Slinky Site: | --- |
| Alineos Sites: | --- | Atos/Eviden Sites: | --- |
| Confidential Site: | --- | Coreweave sites: | --- |
| Cray Sites: | --- | DS9 clusters: | --- |
| Google sites: | --- | HPCnow Sites: | --- |
| HPE Sites: | --- | IBM Sites: | --- |
| NOAA SIte: | --- | NoveTech Sites: | --- |
| Nvidia HWinf-CS Sites: | --- | OCF Sites: | --- |
| Recursion Pharma Sites: | --- | SFW Sites: | --- |
| SNIC sites: | --- | Tzag Elita Sites: | --- |
| Linux Distro: | --- | Machine Name: | |
| CLE Version: | Version Fixed: | ||
| Target Release: | --- | DevPrio: | --- |
| Emory-Cloud Sites: | --- | ||
| Attachments: |
slurmctld.log.tgz
slurmctld.log.old.tgz sdiag output slurm.conf sprio output squeue output |
||
|
Description
Hjalti Sveinsson
2023-03-29 12:21:59 MDT
Created attachment 29586 [details]
slurmctld.log.tgz
Created attachment 29587 [details]
slurmctld.log.old.tgz
Created attachment 29588 [details]
sdiag output
Created attachment 29589 [details]
slurm.conf
Would you also be able to provide the output of the following? > squeue -l > sprio Also, are you referring to the partition "DEFAULT" or "cpu_hog"? >PartitionName=DEFAULT Nodes=ru-hpc-[0240-0311],ru-hpc-[0321-0392,0402-0410,0412-0419,0421,0423-0428,0430,0432-0436,0438-0477,0480,0482-0484,0486-0489,0491-0494,0496-0510,0519-0540,0562-0563,0567,0569,0571-0582,0585,0588-0590,0592-0594,0598-0599,0604-0606,0612-0635,0638-0640],ru-hpc-[0901-1044],ru-hpc-[1300-1308,1310-1318,1320-1322,1325-1329,1331-1333,1335-1358,1360-1379,1381-1391,1394,1395,1398,1399],ru-hpc-[1405-1425,1427-1429,1430-1431,1433-1442],ru-hpc-[1045-1059,1070-1115,1120-1155,1168-1175],ru-hpc-[2001-2099,2101-2188,2193-2207,2209-2222,2224,2226-2238,2243-2246,2248-2265,2267-2269,2272-2273,2276,2279-2283,2285-2290,2292-2295,2297-2302,2304,2311,2314-2320,2323-2324,2326-2327,2329-2337,2339-2400,2411,2416-2432],ru-hpc-[1221-1237,1240-1255,1258-1275,1278-1299] DefaultTime=01:00:00 MaxTime=20-00:00:00 State=UP > ... >PartitionName=cpu_hog Default=YES Priority=1000 We seem to have found a possible cause for this, these jobs were pending: 17493333 big_calc 2023-03-29T08:40:23 PENDING 392000M 17493332 big_calc 2023-03-29T08:40:23 PENDING 392000M 17493330 big_calc 2023-03-29T08:40:22 PENDING 392000M 17493328 big_calc 2023-03-29T08:40:22 PENDING 392000M 17493327 big_calc 2023-03-29T08:40:22 PENDING 392000M 17493325 big_calc 2023-03-29T08:40:22 PENDING 392000M And this partition has a higher priority than the default partition. We move these jobs to the mem_hog partition and now jobs are getting through. Created attachment 29591 [details]
sprio output
Created attachment 29592 [details]
squeue output
That is good to head. I will have Tim follow up with you should you need more work done on this. For now, I will have him hold onto this issue for the day. I just wanted to check in and make sure that after moving these jobs things have been running correctly! Thanks! --Tim Hi Hjalti, I'm going to resolve this ticket for now since it seems like the issue is resolved, but please let us know if you need further assistance! Thanks! --Tim Yes, this was the cause. You can close this non-issue. Thank you for the support. |