Ticket 21825 - Apptainer container wont connect to multiple processors
Summary: Apptainer container wont connect to multiple processors
Status: OPEN
Alias: None
Product: Slurm
Classification: Unclassified
Component: Limits (show other tickets)
Version: 25.05.x
Hardware: Linux Linux
: 6 - No support contract
Assignee: Jacob Jenson
QA Contact:
URL:
Depends on:
Blocks:
 
Reported: 2025-01-15 19:18 MST by Grace
Modified: 2025-01-16 13:37 MST (History)
1 user (show)

See Also:
Site: Columbia University
Alineos Sites: ---
Atos/Eviden Sites: ---
Confidential Site: ---
Coreweave sites: ---
Cray Sites: ---
DS9 clusters: ---
HPCnow Sites: ---
HPE Sites: ---
IBM Sites: ---
NOAA SIte: ---
NoveTech Sites: ---
Nvidia HWinf-CS Sites: ---
OCF Sites: ---
Recursion Pharma Sites: ---
SFW Sites: ---
SNIC sites: ---
Tzag Elita Sites: ---
Linux Distro: ---
Machine Name:
CLE Version:
Version Fixed:
Target Release: ---
DevPrio: ---
Emory-Cloud Sites: ---


Attachments
Apptainer job not able to connect to multiple processors (84.77 KB, image/png)
2025-01-15 19:18 MST, Grace
Details
ErrorFromScript3.png (28.04 KB, image/png)
2025-01-16 12:37 MST, Grace
Details
ScriptWittenWithNate.png (86.55 KB, image/png)
2025-01-16 12:37 MST, Grace
Details
Script3-N3n3.png (122.25 KB, image/png)
2025-01-16 12:37 MST, Grace
Details
ErrorFromScript2.png (86.27 KB, image/png)
2025-01-16 12:37 MST, Grace
Details
Script2-N1n1.png (111.69 KB, image/png)
2025-01-16 12:37 MST, Grace
Details

Note You need to log in before you can comment on or make changes to this ticket.
Description Grace 2025-01-15 19:18:12 MST
Created attachment 40404 [details]
Apptainer job not able to connect to multiple processors

Hello! Following up on some training provided earlier this morning. I am trying to run a job using a apptainer, the job requires at least two processors or it will not go forward. 

Nathan from your team kindly helped me set up a script, but alas (as far as I can tell), it refuses to accept more than one processor to run the job -- and subsequently will not begin to run. I am fairly certain this is not a limitation of resources (I own my own node, and even if I make the problem super tiny, it can't connect to that second processor). Though I can buy more resources if necessary.

I am really not familiar with containers nor HPCs, so it is likely I am doing something dumb on my end, any suggestions are most welcome! Thank you!
Comment 2 Patrick Wigger 2025-01-16 12:06:45 MST
Hi Grace,

That initial error suggests a mismatch between the resources provided by Slurm and what is being expected by mpirun. For some more information, could you please attach:
1. The job script that you are working with
2. Other errors (if any) from your job submission that aren't included in the original screenshot.

If you are using mpirun in your script, my first thought is that the underlying node may have hyper-threading enabled. If so, adding --use-hwthread-cpus to the mpirun line will allow MPI to view each thread as a processing element vs. each core. 

Best,
Patrick
Comment 3 Grace 2025-01-16 12:37:04 MST
Created attachment 40420 [details]
ErrorFromScript3.png

Hi Patrick! Thanks for the quick response. Couple of things attached here,
hopefully you can see the titles of the pngs on your end.

The first 'ScriptWrittenWithNate' was what your colleague helped me set up,
the software executes and runs the very first few processes, but then fails
as soon as it needs to connect to a Slave processor as none are available.

Tried two ways to give it more resources... the addition of your
--use-hwthread-cpus did not make a difference in the error message.
Script2-N1n1 has the error I initially uploaded to your site
(ErrorFromScript2)...Script3-N3n3  -- I tried increasing the number of
resources in Sbatch and got a message that srun was not
available (ErrorFromScript3).

Thanks,
Grace McIlvain


On Thu, Jan 16, 2025 at 2:06 PM <bugs@schedmd.com> wrote:

> Comment # 2 on ticket 21825 from Patrick Wigger Hi Grace, That initial
> error suggests a mismatch between the resources provided by Slurm and what
> is being expected by mpirun. For some more information, could you please
> attach: 1. The job script
> ZjQcmQRYFpfptBannerStart
> This Message Is From an External Sender
> This message came from outside your organization.
>
> ZjQcmQRYFpfptBannerEnd
>
> *Comment # 2
> <https://urldefense.proofpoint.com/v2/url?u=https-3A__support.schedmd.com_show-5Fbug.cgi-3Fid-3D21825-23c2&d=DwMFaQ&c=009klHSCxuh5AI1vNQzSO0KGjl4nbi2Q0M1QLJX9BeE&r=gpwzteordz3VHd8rOq6CL1wyEgtRJHyqxYG2abcG_P0&m=LVkhFjV1SO7fEW9jKCI5PZQljEN0JKlGgybnW2r_R1iZjTUGjBuerHF3E8moQzmk&s=CgXiqslwLDdoU12dxxoXey4QQXkk2wpWp37pF8xfuHc&e=>
> on ticket 21825
> <https://urldefense.proofpoint.com/v2/url?u=https-3A__support.schedmd.com_show-5Fbug.cgi-3Fid-3D21825&d=DwMFaQ&c=009klHSCxuh5AI1vNQzSO0KGjl4nbi2Q0M1QLJX9BeE&r=gpwzteordz3VHd8rOq6CL1wyEgtRJHyqxYG2abcG_P0&m=LVkhFjV1SO7fEW9jKCI5PZQljEN0JKlGgybnW2r_R1iZjTUGjBuerHF3E8moQzmk&s=B5G9KJXIlpR-ptEBM2Y6Vz2_Tq8Q4LpTiKe0q_IWaX4&e=>
> from Patrick Wigger <patrick@schedmd.com> *
>
> Hi Grace,
>
> That initial error suggests a mismatch between the resources provided by Slurm
> and what is being expected by mpirun. For some more information, could you
> please attach:
> 1. The job script that you are working with
> 2. Other errors (if any) from your job submission that aren't included in the
> original screenshot.
>
> If you are using mpirun in your script, my first thought is that the underlying
> node may have hyper-threading enabled. If so, adding --use-hwthread-cpus to the
> mpirun line will allow MPI to view each thread as a processing element vs. each
> core.
>
> Best,
> Patrick
>
> ------------------------------
> You are receiving this mail because:
>
>    - You reported the ticket.
>
>
Comment 4 Grace 2025-01-16 12:37:04 MST
Created attachment 40421 [details]
ScriptWittenWithNate.png
Comment 5 Grace 2025-01-16 12:37:04 MST
Created attachment 40422 [details]
Script3-N3n3.png
Comment 6 Grace 2025-01-16 12:37:04 MST
Created attachment 40423 [details]
ErrorFromScript2.png
Comment 7 Grace 2025-01-16 12:37:04 MST
Created attachment 40424 [details]
Script2-N1n1.png
Comment 8 Jacob Jenson 2025-01-16 13:19:35 MST
Grace,

We currently do now have you listed as a supported user for Slurm support
for Columbia. Typically these questions need to go through your university
help desk system. If the help deak or system admins are not able to provide
answers then they can forward questions to us for assistance.

Thank you,
Jacob

Jacob Jenson

*COO*

+1 925.695.7782

www.schedmd.com


On Wed, Jan 15, 2025 at 7:18 PM <bugs@schedmd.com> wrote:

> Site Columbia University
> Ticket ID 21825 <https://support.schedmd.com/show_bug.cgi?id=21825>
> Summary Apptainer container wont connect to multiple processors
> Product Slurm
> Version 25.05.x
> Hardware Linux
> OS Linux
> Status OPEN
> Severity 2 - High Impact
> Priority ---
> Component Limits
> Assignee support@schedmd.com
> Reporter gm3128@columbia.edu
>
> Created attachment 40404 [details] <https://support.schedmd.com/attachment.cgi?id=40404> [details] <https://support.schedmd.com/attachment.cgi?id=40404&action=edit>
> Apptainer job not able to connect to multiple processors
>
> Hello! Following up on some training provided earlier this morning. I am trying
> to run a job using a apptainer, the job requires at least two processors or it
> will not go forward.
>
> Nathan from your team kindly helped me set up a script, but alas (as far as I
> can tell), it refuses to accept more than one processor to run the job -- and
> subsequently will not begin to run. I am fairly certain this is not a
> limitation of resources (I own my own node, and even if I make the problem
> super tiny, it can't connect to that second processor). Though I can buy more
> resources if necessary.
>
> I am really not familiar with containers nor HPCs, so it is likely I am doing
> something dumb on my end, any suggestions are most welcome! Thank you!
>
> ------------------------------
> You are receiving this mail because:
>
>    - You are the assignee for the ticket.
>
>
Comment 9 Grace 2025-01-16 13:37:33 MST
Hi Jacob,

They were unable to fix my problem and referred me to you guys. Nate was
helping me out yesterday, and just following up with him/you on this last
issue! Thanks so much !

Thanks,
Grace McIlvain


On Thu, Jan 16, 2025 at 3:19 PM <bugs@schedmd.com> wrote:

> Comment # 8 on ticket 21825 from Jacob Jenson Grace, We currently do now
> have you listed as a supported user for Slurm support for Columbia.
> Typically these questions need to go through your university help desk
> system. If the help deak or system
> ZjQcmQRYFpfptBannerStart
> This Message Is From an External Sender
> This message came from outside your organization.
>
> ZjQcmQRYFpfptBannerEnd
>
> *Comment # 8
> <https://urldefense.proofpoint.com/v2/url?u=https-3A__support.schedmd.com_show-5Fbug.cgi-3Fid-3D21825-23c8&d=DwMFaQ&c=009klHSCxuh5AI1vNQzSO0KGjl4nbi2Q0M1QLJX9BeE&r=gpwzteordz3VHd8rOq6CL1wyEgtRJHyqxYG2abcG_P0&m=h329PTumhJl_Hlhw9gDK8PgAMooTFe60UZrB2QejMniF3BLwwFcSspQdyl8XIqoL&s=28SCni940svF7NqPYxcob9YBYAsVR2CLZKlUTkdmazk&e=>
> on ticket 21825
> <https://urldefense.proofpoint.com/v2/url?u=https-3A__support.schedmd.com_show-5Fbug.cgi-3Fid-3D21825&d=DwMFaQ&c=009klHSCxuh5AI1vNQzSO0KGjl4nbi2Q0M1QLJX9BeE&r=gpwzteordz3VHd8rOq6CL1wyEgtRJHyqxYG2abcG_P0&m=h329PTumhJl_Hlhw9gDK8PgAMooTFe60UZrB2QejMniF3BLwwFcSspQdyl8XIqoL&s=0W_UXiGcseGoG3UhXlPlXEcBU7mf5hyg5Nen8MuQsi4&e=>
> from Jacob Jenson <jacob@schedmd.com> *
>
> Grace,
>
> We currently do now have you listed as a supported user for Slurm support
> for Columbia. Typically these questions need to go through your university
> help desk system. If the help deak or system admins are not able to provide
> answers then they can forward questions to us for assistance.
>
> Thank you,
> Jacob
>
> Jacob Jenson
>
> *COO*
>
> +1 925.695.7782
> www.schedmd.com <https://urldefense.proofpoint.com/v2/url?u=http-3A__www.schedmd.com&d=DwQFaQ&c=009klHSCxuh5AI1vNQzSO0KGjl4nbi2Q0M1QLJX9BeE&r=gpwzteordz3VHd8rOq6CL1wyEgtRJHyqxYG2abcG_P0&m=h329PTumhJl_Hlhw9gDK8PgAMooTFe60UZrB2QejMniF3BLwwFcSspQdyl8XIqoL&s=nJPQXmkFCw70Y9LIpT8Bi1g3CTndaTfgBS4gi6V6l4c&e=>
>
>
> On Wed, Jan 15, 2025 at 7:18 PM <bugs@schedmd.com> wrote:
> > Site Columbia University
> > Ticket ID 21825 <https://support.schedmd.com/show_bug.cgi?id=21825 <https://urldefense.proofpoint.com/v2/url?u=https-3A__support.schedmd.com_show-5Fbug.cgi-3Fid-3D21825&d=DwMFaQ&c=009klHSCxuh5AI1vNQzSO0KGjl4nbi2Q0M1QLJX9BeE&r=gpwzteordz3VHd8rOq6CL1wyEgtRJHyqxYG2abcG_P0&m=h329PTumhJl_Hlhw9gDK8PgAMooTFe60UZrB2QejMniF3BLwwFcSspQdyl8XIqoL&s=0W_UXiGcseGoG3UhXlPlXEcBU7mf5hyg5Nen8MuQsi4&e=>>
> > Summary Apptainer container wont connect to multiple processors
> > Product Slurm
> > Version 25.05.x
> > Hardware Linux
> > OS Linux
> > Status OPEN
> > Severity 2 - High Impact
> > Priority ---
> > Component Limits
> > Assignee support@schedmd.com
> > Reporter gm3128@columbia.edu
> >> Created attachment 40404 [details] <https://urldefense.proofpoint.com/v2/url?u=https-3A__support.schedmd.com_attachment.cgi-3Fid-3D40404&d=DwMFaQ&c=009klHSCxuh5AI1vNQzSO0KGjl4nbi2Q0M1QLJX9BeE&r=gpwzteordz3VHd8rOq6CL1wyEgtRJHyqxYG2abcG_P0&m=h329PTumhJl_Hlhw9gDK8PgAMooTFe60UZrB2QejMniF3BLwwFcSspQdyl8XIqoL&s=VOHsm2HFJ43wyVUHULjzas6AEFR3_2aWI2zXaCp06Gc&e=> [details] <https://urldefense.proofpoint.com/v2/url?u=https-3A__support.schedmd.com_attachment.cgi-3Fid-3D40404-26action-3Dedit&d=DwMFaQ&c=009klHSCxuh5AI1vNQzSO0KGjl4nbi2Q0M1QLJX9BeE&r=gpwzteordz3VHd8rOq6CL1wyEgtRJHyqxYG2abcG_P0&m=h329PTumhJl_Hlhw9gDK8PgAMooTFe60UZrB2QejMniF3BLwwFcSspQdyl8XIqoL&s=HdmaxrKbWIcpjUhGzaD_2MJTjH-_XJM36DsiXxEzV-w&e=> <https://support.schedmd.com/attachment.cgi?id=40404 <https://urldefense.proofpoint.com/v2/url?u=https-3A__support.schedmd.com_attachment.cgi-3Fid-3D40404&d=DwMFaQ&c=009klHSCxuh5AI1vNQzSO0KGjl4nbi2Q0M1QLJX9BeE&r=gpwzteordz3VHd8rOq6CL1wyEgtRJHyqxYG2abcG_P0&m=h329PTumhJl_Hlhw9gDK8PgAMooTFe60UZrB2QejMniF3BLwwFcSspQdyl8XIqoL&s=VOHsm2HFJ43wyVUHULjzas6AEFR3_2aWI2zXaCp06Gc&e=>> [details] <https://support.schedmd.com/attachment.cgi?id=40404&action=edit <https://urldefense.proofpoint.com/v2/url?u=https-3A__support.schedmd.com_attachment.cgi-3Fid-3D40404-26action-3Dedit&d=DwMFaQ&c=009klHSCxuh5AI1vNQzSO0KGjl4nbi2Q0M1QLJX9BeE&r=gpwzteordz3VHd8rOq6CL1wyEgtRJHyqxYG2abcG_P0&m=h329PTumhJl_Hlhw9gDK8PgAMooTFe60UZrB2QejMniF3BLwwFcSspQdyl8XIqoL&s=HdmaxrKbWIcpjUhGzaD_2MJTjH-_XJM36DsiXxEzV-w&e=>>
> > Apptainer job not able to connect to multiple processors
> >> Hello! Following up on some training provided earlier this morning. I am trying
> > to run a job using a apptainer, the job requires at least two processors or it
> > will not go forward.
> >> Nathan from your team kindly helped me set up a script, but alas (as far as I
> > can tell), it refuses to accept more than one processor to run the job -- and
> > subsequently will not begin to run. I am fairly certain this is not a
> > limitation of resources (I own my own node, and even if I make the problem
> > super tiny, it can't connect to that second processor). Though I can buy more
> > resources if necessary.
> >> I am really not familiar with containers nor HPCs, so it is likely I am doing
> > something dumb on my end, any suggestions are most welcome! Thank you!
> >> ------------------------------
> > You are receiving this mail because:
> >>    - You are the assignee for the ticket.
> >
> >
>
> ------------------------------
> You are receiving this mail because:
>
>    - You reported the ticket.
>
>