Ticket 21825

Summary: Apptainer container wont connect to multiple processors
Product: Slurm Reporter: Grace <gm3128>
Component: LimitsAssignee: Jacob Jenson <jacob>
Status: OPEN --- QA Contact:
Severity: 6 - No support contract    
Priority: --- CC: oscar.hernandez
Version: 25.05.x   
Hardware: Linux   
OS: Linux   
Site: Columbia University Alineos Sites: ---
Atos/Eviden Sites: --- Confidential Site: ---
Coreweave sites: --- Cray Sites: ---
DS9 clusters: --- HPCnow Sites: ---
HPE Sites: --- IBM Sites: ---
NOAA SIte: --- NoveTech Sites: ---
Nvidia HWinf-CS Sites: --- OCF Sites: ---
Recursion Pharma Sites: --- SFW Sites: ---
SNIC sites: --- Tzag Elita Sites: ---
Linux Distro: --- Machine Name:
CLE Version: Version Fixed:
Target Release: --- DevPrio: ---
Emory-Cloud Sites: ---
Attachments: Apptainer job not able to connect to multiple processors
ErrorFromScript3.png
ScriptWittenWithNate.png
Script3-N3n3.png
ErrorFromScript2.png
Script2-N1n1.png

Description Grace 2025-01-15 19:18:12 MST
Created attachment 40404 [details]
Apptainer job not able to connect to multiple processors

Hello! Following up on some training provided earlier this morning. I am trying to run a job using a apptainer, the job requires at least two processors or it will not go forward. 

Nathan from your team kindly helped me set up a script, but alas (as far as I can tell), it refuses to accept more than one processor to run the job -- and subsequently will not begin to run. I am fairly certain this is not a limitation of resources (I own my own node, and even if I make the problem super tiny, it can't connect to that second processor). Though I can buy more resources if necessary.

I am really not familiar with containers nor HPCs, so it is likely I am doing something dumb on my end, any suggestions are most welcome! Thank you!
Comment 2 Patrick Wigger 2025-01-16 12:06:45 MST
Hi Grace,

That initial error suggests a mismatch between the resources provided by Slurm and what is being expected by mpirun. For some more information, could you please attach:
1. The job script that you are working with
2. Other errors (if any) from your job submission that aren't included in the original screenshot.

If you are using mpirun in your script, my first thought is that the underlying node may have hyper-threading enabled. If so, adding --use-hwthread-cpus to the mpirun line will allow MPI to view each thread as a processing element vs. each core. 

Best,
Patrick
Comment 3 Grace 2025-01-16 12:37:04 MST
Created attachment 40420 [details]
ErrorFromScript3.png

Hi Patrick! Thanks for the quick response. Couple of things attached here,
hopefully you can see the titles of the pngs on your end.

The first 'ScriptWrittenWithNate' was what your colleague helped me set up,
the software executes and runs the very first few processes, but then fails
as soon as it needs to connect to a Slave processor as none are available.

Tried two ways to give it more resources... the addition of your
--use-hwthread-cpus did not make a difference in the error message.
Script2-N1n1 has the error I initially uploaded to your site
(ErrorFromScript2)...Script3-N3n3  -- I tried increasing the number of
resources in Sbatch and got a message that srun was not
available (ErrorFromScript3).

Thanks,
Grace McIlvain


On Thu, Jan 16, 2025 at 2:06 PM <bugs@schedmd.com> wrote:

> Comment # 2 on ticket 21825 from Patrick Wigger Hi Grace, That initial
> error suggests a mismatch between the resources provided by Slurm and what
> is being expected by mpirun. For some more information, could you please
> attach: 1. The job script
> ZjQcmQRYFpfptBannerStart
> This Message Is From an External Sender
> This message came from outside your organization.
>
> ZjQcmQRYFpfptBannerEnd
>
> *Comment # 2
> <https://urldefense.proofpoint.com/v2/url?u=https-3A__support.schedmd.com_show-5Fbug.cgi-3Fid-3D21825-23c2&d=DwMFaQ&c=009klHSCxuh5AI1vNQzSO0KGjl4nbi2Q0M1QLJX9BeE&r=gpwzteordz3VHd8rOq6CL1wyEgtRJHyqxYG2abcG_P0&m=LVkhFjV1SO7fEW9jKCI5PZQljEN0JKlGgybnW2r_R1iZjTUGjBuerHF3E8moQzmk&s=CgXiqslwLDdoU12dxxoXey4QQXkk2wpWp37pF8xfuHc&e=>
> on ticket 21825
> <https://urldefense.proofpoint.com/v2/url?u=https-3A__support.schedmd.com_show-5Fbug.cgi-3Fid-3D21825&d=DwMFaQ&c=009klHSCxuh5AI1vNQzSO0KGjl4nbi2Q0M1QLJX9BeE&r=gpwzteordz3VHd8rOq6CL1wyEgtRJHyqxYG2abcG_P0&m=LVkhFjV1SO7fEW9jKCI5PZQljEN0JKlGgybnW2r_R1iZjTUGjBuerHF3E8moQzmk&s=B5G9KJXIlpR-ptEBM2Y6Vz2_Tq8Q4LpTiKe0q_IWaX4&e=>
> from Patrick Wigger <patrick@schedmd.com> *
>
> Hi Grace,
>
> That initial error suggests a mismatch between the resources provided by Slurm
> and what is being expected by mpirun. For some more information, could you
> please attach:
> 1. The job script that you are working with
> 2. Other errors (if any) from your job submission that aren't included in the
> original screenshot.
>
> If you are using mpirun in your script, my first thought is that the underlying
> node may have hyper-threading enabled. If so, adding --use-hwthread-cpus to the
> mpirun line will allow MPI to view each thread as a processing element vs. each
> core.
>
> Best,
> Patrick
>
> ------------------------------
> You are receiving this mail because:
>
>    - You reported the ticket.
>
>
Comment 4 Grace 2025-01-16 12:37:04 MST
Created attachment 40421 [details]
ScriptWittenWithNate.png
Comment 5 Grace 2025-01-16 12:37:04 MST
Created attachment 40422 [details]
Script3-N3n3.png
Comment 6 Grace 2025-01-16 12:37:04 MST
Created attachment 40423 [details]
ErrorFromScript2.png
Comment 7 Grace 2025-01-16 12:37:04 MST
Created attachment 40424 [details]
Script2-N1n1.png
Comment 8 Jacob Jenson 2025-01-16 13:19:35 MST
Grace,

We currently do now have you listed as a supported user for Slurm support
for Columbia. Typically these questions need to go through your university
help desk system. If the help deak or system admins are not able to provide
answers then they can forward questions to us for assistance.

Thank you,
Jacob

Jacob Jenson

*COO*

+1 925.695.7782

www.schedmd.com


On Wed, Jan 15, 2025 at 7:18 PM <bugs@schedmd.com> wrote:

> Site Columbia University
> Ticket ID 21825 <https://support.schedmd.com/show_bug.cgi?id=21825>
> Summary Apptainer container wont connect to multiple processors
> Product Slurm
> Version 25.05.x
> Hardware Linux
> OS Linux
> Status OPEN
> Severity 2 - High Impact
> Priority ---
> Component Limits
> Assignee support@schedmd.com
> Reporter gm3128@columbia.edu
>
> Created attachment 40404 [details] <https://support.schedmd.com/attachment.cgi?id=40404> [details] <https://support.schedmd.com/attachment.cgi?id=40404&action=edit>
> Apptainer job not able to connect to multiple processors
>
> Hello! Following up on some training provided earlier this morning. I am trying
> to run a job using a apptainer, the job requires at least two processors or it
> will not go forward.
>
> Nathan from your team kindly helped me set up a script, but alas (as far as I
> can tell), it refuses to accept more than one processor to run the job -- and
> subsequently will not begin to run. I am fairly certain this is not a
> limitation of resources (I own my own node, and even if I make the problem
> super tiny, it can't connect to that second processor). Though I can buy more
> resources if necessary.
>
> I am really not familiar with containers nor HPCs, so it is likely I am doing
> something dumb on my end, any suggestions are most welcome! Thank you!
>
> ------------------------------
> You are receiving this mail because:
>
>    - You are the assignee for the ticket.
>
>
Comment 9 Grace 2025-01-16 13:37:33 MST
Hi Jacob,

They were unable to fix my problem and referred me to you guys. Nate was
helping me out yesterday, and just following up with him/you on this last
issue! Thanks so much !

Thanks,
Grace McIlvain


On Thu, Jan 16, 2025 at 3:19 PM <bugs@schedmd.com> wrote:

> Comment # 8 on ticket 21825 from Jacob Jenson Grace, We currently do now
> have you listed as a supported user for Slurm support for Columbia.
> Typically these questions need to go through your university help desk
> system. If the help deak or system
> ZjQcmQRYFpfptBannerStart
> This Message Is From an External Sender
> This message came from outside your organization.
>
> ZjQcmQRYFpfptBannerEnd
>
> *Comment # 8
> <https://urldefense.proofpoint.com/v2/url?u=https-3A__support.schedmd.com_show-5Fbug.cgi-3Fid-3D21825-23c8&d=DwMFaQ&c=009klHSCxuh5AI1vNQzSO0KGjl4nbi2Q0M1QLJX9BeE&r=gpwzteordz3VHd8rOq6CL1wyEgtRJHyqxYG2abcG_P0&m=h329PTumhJl_Hlhw9gDK8PgAMooTFe60UZrB2QejMniF3BLwwFcSspQdyl8XIqoL&s=28SCni940svF7NqPYxcob9YBYAsVR2CLZKlUTkdmazk&e=>
> on ticket 21825
> <https://urldefense.proofpoint.com/v2/url?u=https-3A__support.schedmd.com_show-5Fbug.cgi-3Fid-3D21825&d=DwMFaQ&c=009klHSCxuh5AI1vNQzSO0KGjl4nbi2Q0M1QLJX9BeE&r=gpwzteordz3VHd8rOq6CL1wyEgtRJHyqxYG2abcG_P0&m=h329PTumhJl_Hlhw9gDK8PgAMooTFe60UZrB2QejMniF3BLwwFcSspQdyl8XIqoL&s=0W_UXiGcseGoG3UhXlPlXEcBU7mf5hyg5Nen8MuQsi4&e=>
> from Jacob Jenson <jacob@schedmd.com> *
>
> Grace,
>
> We currently do now have you listed as a supported user for Slurm support
> for Columbia. Typically these questions need to go through your university
> help desk system. If the help deak or system admins are not able to provide
> answers then they can forward questions to us for assistance.
>
> Thank you,
> Jacob
>
> Jacob Jenson
>
> *COO*
>
> +1 925.695.7782
> www.schedmd.com <https://urldefense.proofpoint.com/v2/url?u=http-3A__www.schedmd.com&d=DwQFaQ&c=009klHSCxuh5AI1vNQzSO0KGjl4nbi2Q0M1QLJX9BeE&r=gpwzteordz3VHd8rOq6CL1wyEgtRJHyqxYG2abcG_P0&m=h329PTumhJl_Hlhw9gDK8PgAMooTFe60UZrB2QejMniF3BLwwFcSspQdyl8XIqoL&s=nJPQXmkFCw70Y9LIpT8Bi1g3CTndaTfgBS4gi6V6l4c&e=>
>
>
> On Wed, Jan 15, 2025 at 7:18 PM <bugs@schedmd.com> wrote:
> > Site Columbia University
> > Ticket ID 21825 <https://support.schedmd.com/show_bug.cgi?id=21825 <https://urldefense.proofpoint.com/v2/url?u=https-3A__support.schedmd.com_show-5Fbug.cgi-3Fid-3D21825&d=DwMFaQ&c=009klHSCxuh5AI1vNQzSO0KGjl4nbi2Q0M1QLJX9BeE&r=gpwzteordz3VHd8rOq6CL1wyEgtRJHyqxYG2abcG_P0&m=h329PTumhJl_Hlhw9gDK8PgAMooTFe60UZrB2QejMniF3BLwwFcSspQdyl8XIqoL&s=0W_UXiGcseGoG3UhXlPlXEcBU7mf5hyg5Nen8MuQsi4&e=>>
> > Summary Apptainer container wont connect to multiple processors
> > Product Slurm
> > Version 25.05.x
> > Hardware Linux
> > OS Linux
> > Status OPEN
> > Severity 2 - High Impact
> > Priority ---
> > Component Limits
> > Assignee support@schedmd.com
> > Reporter gm3128@columbia.edu
> >> Created attachment 40404 [details] <https://urldefense.proofpoint.com/v2/url?u=https-3A__support.schedmd.com_attachment.cgi-3Fid-3D40404&d=DwMFaQ&c=009klHSCxuh5AI1vNQzSO0KGjl4nbi2Q0M1QLJX9BeE&r=gpwzteordz3VHd8rOq6CL1wyEgtRJHyqxYG2abcG_P0&m=h329PTumhJl_Hlhw9gDK8PgAMooTFe60UZrB2QejMniF3BLwwFcSspQdyl8XIqoL&s=VOHsm2HFJ43wyVUHULjzas6AEFR3_2aWI2zXaCp06Gc&e=> [details] <https://urldefense.proofpoint.com/v2/url?u=https-3A__support.schedmd.com_attachment.cgi-3Fid-3D40404-26action-3Dedit&d=DwMFaQ&c=009klHSCxuh5AI1vNQzSO0KGjl4nbi2Q0M1QLJX9BeE&r=gpwzteordz3VHd8rOq6CL1wyEgtRJHyqxYG2abcG_P0&m=h329PTumhJl_Hlhw9gDK8PgAMooTFe60UZrB2QejMniF3BLwwFcSspQdyl8XIqoL&s=HdmaxrKbWIcpjUhGzaD_2MJTjH-_XJM36DsiXxEzV-w&e=> <https://support.schedmd.com/attachment.cgi?id=40404 <https://urldefense.proofpoint.com/v2/url?u=https-3A__support.schedmd.com_attachment.cgi-3Fid-3D40404&d=DwMFaQ&c=009klHSCxuh5AI1vNQzSO0KGjl4nbi2Q0M1QLJX9BeE&r=gpwzteordz3VHd8rOq6CL1wyEgtRJHyqxYG2abcG_P0&m=h329PTumhJl_Hlhw9gDK8PgAMooTFe60UZrB2QejMniF3BLwwFcSspQdyl8XIqoL&s=VOHsm2HFJ43wyVUHULjzas6AEFR3_2aWI2zXaCp06Gc&e=>> [details] <https://support.schedmd.com/attachment.cgi?id=40404&action=edit <https://urldefense.proofpoint.com/v2/url?u=https-3A__support.schedmd.com_attachment.cgi-3Fid-3D40404-26action-3Dedit&d=DwMFaQ&c=009klHSCxuh5AI1vNQzSO0KGjl4nbi2Q0M1QLJX9BeE&r=gpwzteordz3VHd8rOq6CL1wyEgtRJHyqxYG2abcG_P0&m=h329PTumhJl_Hlhw9gDK8PgAMooTFe60UZrB2QejMniF3BLwwFcSspQdyl8XIqoL&s=HdmaxrKbWIcpjUhGzaD_2MJTjH-_XJM36DsiXxEzV-w&e=>>
> > Apptainer job not able to connect to multiple processors
> >> Hello! Following up on some training provided earlier this morning. I am trying
> > to run a job using a apptainer, the job requires at least two processors or it
> > will not go forward.
> >> Nathan from your team kindly helped me set up a script, but alas (as far as I
> > can tell), it refuses to accept more than one processor to run the job -- and
> > subsequently will not begin to run. I am fairly certain this is not a
> > limitation of resources (I own my own node, and even if I make the problem
> > super tiny, it can't connect to that second processor). Though I can buy more
> > resources if necessary.
> >> I am really not familiar with containers nor HPCs, so it is likely I am doing
> > something dumb on my end, any suggestions are most welcome! Thank you!
> >> ------------------------------
> > You are receiving this mail because:
> >>    - You are the assignee for the ticket.
> >
> >
>
> ------------------------------
> You are receiving this mail because:
>
>    - You reported the ticket.
>
>