Ticket 11628 - Slurm 20.11.4 support for Forge 19
Summary: Slurm 20.11.4 support for Forge 19
Status: RESOLVED INFOGIVEN
Alias: None
Product: Slurm
Classification: Unclassified
Component: Other (show other tickets)
Version: 20.11.4
Hardware: Linux Linux
: 4 - Minor Issue
Assignee: Marcin Stolarek
QA Contact:
URL:
Depends on:
Blocks:
 
Reported: 2021-05-16 17:58 MDT by Tony Racho
Modified: 2021-06-18 05:15 MDT (History)
1 user (show)

See Also:
Site: CRAY
Slinky Site: ---
Alineos Sites: ---
Atos/Eviden Sites: ---
Confidential Site: ---
Coreweave sites: ---
Cray Sites: NIWA/WELLINGTON
DS9 clusters: ---
Google sites: ---
HPCnow Sites: ---
HPE Sites: ---
IBM Sites: ---
NOAA SIte: ---
NoveTech Sites: ---
Nvidia HWinf-CS Sites: ---
OCF Sites: ---
Recursion Pharma Sites: ---
SFW Sites: ---
SNIC sites: ---
Tzag Elita Sites: ---
Linux Distro: ---
Machine Name:
CLE Version:
Version Fixed:
Target Release: ---
DevPrio: ---
Emory-Cloud Sites: ---


Attachments

Note You need to log in before you can comment on or make changes to this ticket.
Description Tony Racho 2021-05-16 17:58:35 MDT
Hello:

We've upgraded to 20.11.4 (from 19.05.x) and when we try to run our ARM applications:

$ ml forge/19.0
$ ddt --offline srun ls
srun: error: Unable to create step for job 20054593: Requested nodes are busy

They used to work with our 19.05 versions.

Could this be support issue with Slurm 20.11.4 on Forge 19 or just c slurm configuration issue?

Cheers,
Tony
Comment 2 Marcin Stolarek 2021-05-18 06:15:33 MDT
Tony,

Before we'll jump into more detailed debugging, could you please check if addition of
>export SLURM_OVERLAP=1

before the execution of the other commands will change the behavior?

cheers,
Marcin
Comment 4 Tony Racho 2021-05-18 18:23:20 MDT
Marcin:

That worked.

Thanks,
Tony
Comment 5 Marcin Stolarek 2021-05-19 01:34:53 MDT
Tonny,

I'm guessing that the issue was that `ddt` (probably) calls srun behind the scene and one of the major changes in Slurm 20.11 was that we don't overlap step resources by default. If you can verify how ddt works (one of the options is to use strace and check if srun is executed) this will help us to understand the case.

Another thing to check here is to make sure if you need srun in `ddt` arguments, maybe just `ddt ls` will work - if srun is already called by ddt.

You may want to check Bug 11341 for more details - from the 20.11 change perspective it may be called a duplicate, but I'd like to make sure that we fully understood what's happening in your case to have a best long term solution for you. Exporting SLURM_OVERLAP is something I'd call a workaround for now.

Let me know your thoughts.

cheers,
Marcin
Comment 6 Tony Racho 2021-05-19 20:42:38 MDT
Will check this out.

Thanks,
Tony
Comment 7 Marcin Stolarek 2021-05-27 03:46:47 MDT
Tony,

Were you able to check the details?

cheers,
Marcin
Comment 8 Tony Racho 2021-05-30 22:47:25 MDT
Hi Marcin:

Apologies.

Was distracted on other stuff lately.

Will find that out and update.

Cheers,
Tony
Comment 9 Marcin Stolarek 2021-06-10 02:04:13 MDT
 Tony,
Were you able to get back to the case? 

cheers,
Marcin
Comment 10 Marcin Stolarek 2021-06-18 04:26:31 MDT
Tony,

Let me know if you want to continue working on this. In case of no reply I'll close the bug as info given.

cheers,
Marcin
Comment 11 Tony Racho 2021-06-18 04:28:31 MDT
Hi Marcin:

Apologies. 

A bit busy at the site at the moment due to hardware expansions. 

Please close the ticket.

Much appreciated. 

Cheers,
Tony