Ticket 11628

Summary: Slurm 20.11.4 support for Forge 19
Product: Slurm Reporter: Tony Racho <antonio-ii.racho>
Component: OtherAssignee: Marcin Stolarek <cinek>
Status: RESOLVED INFOGIVEN QA Contact:
Severity: 4 - Minor Issue    
Priority: --- CC: antonio-ii.racho
Version: 20.11.4   
Hardware: Linux   
OS: Linux   
Site: CRAY Alineos Sites: ---
Atos/Eviden Sites: --- Confidential Site: ---
Coreweave sites: --- Cray Sites: NIWA/WELLINGTON
DS9 clusters: --- HPCnow Sites: ---
HPE Sites: --- IBM Sites: ---
NOAA SIte: --- OCF Sites: ---
Recursion Pharma Sites: --- SFW Sites: ---
SNIC sites: --- Linux Distro: ---
Machine Name: CLE Version:
Version Fixed: Target Release: ---
DevPrio: --- Emory-Cloud Sites: ---

Description Tony Racho 2021-05-16 17:58:35 MDT
Hello:

We've upgraded to 20.11.4 (from 19.05.x) and when we try to run our ARM applications:

$ ml forge/19.0
$ ddt --offline srun ls
srun: error: Unable to create step for job 20054593: Requested nodes are busy

They used to work with our 19.05 versions.

Could this be support issue with Slurm 20.11.4 on Forge 19 or just c slurm configuration issue?

Cheers,
Tony
Comment 2 Marcin Stolarek 2021-05-18 06:15:33 MDT
Tony,

Before we'll jump into more detailed debugging, could you please check if addition of
>export SLURM_OVERLAP=1

before the execution of the other commands will change the behavior?

cheers,
Marcin
Comment 4 Tony Racho 2021-05-18 18:23:20 MDT
Marcin:

That worked.

Thanks,
Tony
Comment 5 Marcin Stolarek 2021-05-19 01:34:53 MDT
Tonny,

I'm guessing that the issue was that `ddt` (probably) calls srun behind the scene and one of the major changes in Slurm 20.11 was that we don't overlap step resources by default. If you can verify how ddt works (one of the options is to use strace and check if srun is executed) this will help us to understand the case.

Another thing to check here is to make sure if you need srun in `ddt` arguments, maybe just `ddt ls` will work - if srun is already called by ddt.

You may want to check Bug 11341 for more details - from the 20.11 change perspective it may be called a duplicate, but I'd like to make sure that we fully understood what's happening in your case to have a best long term solution for you. Exporting SLURM_OVERLAP is something I'd call a workaround for now.

Let me know your thoughts.

cheers,
Marcin
Comment 6 Tony Racho 2021-05-19 20:42:38 MDT
Will check this out.

Thanks,
Tony
Comment 7 Marcin Stolarek 2021-05-27 03:46:47 MDT
Tony,

Were you able to check the details?

cheers,
Marcin
Comment 8 Tony Racho 2021-05-30 22:47:25 MDT
Hi Marcin:

Apologies.

Was distracted on other stuff lately.

Will find that out and update.

Cheers,
Tony
Comment 9 Marcin Stolarek 2021-06-10 02:04:13 MDT
 Tony,
Were you able to get back to the case? 

cheers,
Marcin
Comment 10 Marcin Stolarek 2021-06-18 04:26:31 MDT
Tony,

Let me know if you want to continue working on this. In case of no reply I'll close the bug as info given.

cheers,
Marcin
Comment 11 Tony Racho 2021-06-18 04:28:31 MDT
Hi Marcin:

Apologies. 

A bit busy at the site at the moment due to hardware expansions. 

Please close the ticket.

Much appreciated. 

Cheers,
Tony