Created attachment 40404 [details] Apptainer job not able to connect to multiple processors Hello! Following up on some training provided earlier this morning. I am trying to run a job using a apptainer, the job requires at least two processors or it will not go forward. Nathan from your team kindly helped me set up a script, but alas (as far as I can tell), it refuses to accept more than one processor to run the job -- and subsequently will not begin to run. I am fairly certain this is not a limitation of resources (I own my own node, and even if I make the problem super tiny, it can't connect to that second processor). Though I can buy more resources if necessary. I am really not familiar with containers nor HPCs, so it is likely I am doing something dumb on my end, any suggestions are most welcome! Thank you!
Hi Grace, That initial error suggests a mismatch between the resources provided by Slurm and what is being expected by mpirun. For some more information, could you please attach: 1. The job script that you are working with 2. Other errors (if any) from your job submission that aren't included in the original screenshot. If you are using mpirun in your script, my first thought is that the underlying node may have hyper-threading enabled. If so, adding --use-hwthread-cpus to the mpirun line will allow MPI to view each thread as a processing element vs. each core. Best, Patrick
Created attachment 40420 [details] ErrorFromScript3.png Hi Patrick! Thanks for the quick response. Couple of things attached here, hopefully you can see the titles of the pngs on your end. The first 'ScriptWrittenWithNate' was what your colleague helped me set up, the software executes and runs the very first few processes, but then fails as soon as it needs to connect to a Slave processor as none are available. Tried two ways to give it more resources... the addition of your --use-hwthread-cpus did not make a difference in the error message. Script2-N1n1 has the error I initially uploaded to your site (ErrorFromScript2)...Script3-N3n3 -- I tried increasing the number of resources in Sbatch and got a message that srun was not available (ErrorFromScript3). Thanks, Grace McIlvain On Thu, Jan 16, 2025 at 2:06 PM <bugs@schedmd.com> wrote: > Comment # 2 on ticket 21825 from Patrick Wigger Hi Grace, That initial > error suggests a mismatch between the resources provided by Slurm and what > is being expected by mpirun. For some more information, could you please > attach: 1. The job script > ZjQcmQRYFpfptBannerStart > This Message Is From an External Sender > This message came from outside your organization. > > ZjQcmQRYFpfptBannerEnd > > *Comment # 2 > <https://urldefense.proofpoint.com/v2/url?u=https-3A__support.schedmd.com_show-5Fbug.cgi-3Fid-3D21825-23c2&d=DwMFaQ&c=009klHSCxuh5AI1vNQzSO0KGjl4nbi2Q0M1QLJX9BeE&r=gpwzteordz3VHd8rOq6CL1wyEgtRJHyqxYG2abcG_P0&m=LVkhFjV1SO7fEW9jKCI5PZQljEN0JKlGgybnW2r_R1iZjTUGjBuerHF3E8moQzmk&s=CgXiqslwLDdoU12dxxoXey4QQXkk2wpWp37pF8xfuHc&e=> > on ticket 21825 > <https://urldefense.proofpoint.com/v2/url?u=https-3A__support.schedmd.com_show-5Fbug.cgi-3Fid-3D21825&d=DwMFaQ&c=009klHSCxuh5AI1vNQzSO0KGjl4nbi2Q0M1QLJX9BeE&r=gpwzteordz3VHd8rOq6CL1wyEgtRJHyqxYG2abcG_P0&m=LVkhFjV1SO7fEW9jKCI5PZQljEN0JKlGgybnW2r_R1iZjTUGjBuerHF3E8moQzmk&s=B5G9KJXIlpR-ptEBM2Y6Vz2_Tq8Q4LpTiKe0q_IWaX4&e=> > from Patrick Wigger <patrick@schedmd.com> * > > Hi Grace, > > That initial error suggests a mismatch between the resources provided by Slurm > and what is being expected by mpirun. For some more information, could you > please attach: > 1. The job script that you are working with > 2. Other errors (if any) from your job submission that aren't included in the > original screenshot. > > If you are using mpirun in your script, my first thought is that the underlying > node may have hyper-threading enabled. If so, adding --use-hwthread-cpus to the > mpirun line will allow MPI to view each thread as a processing element vs. each > core. > > Best, > Patrick > > ------------------------------ > You are receiving this mail because: > > - You reported the ticket. > >
Created attachment 40421 [details] ScriptWittenWithNate.png
Created attachment 40422 [details] Script3-N3n3.png
Created attachment 40423 [details] ErrorFromScript2.png
Created attachment 40424 [details] Script2-N1n1.png
Grace, We currently do now have you listed as a supported user for Slurm support for Columbia. Typically these questions need to go through your university help desk system. If the help deak or system admins are not able to provide answers then they can forward questions to us for assistance. Thank you, Jacob Jacob Jenson *COO* +1 925.695.7782 www.schedmd.com On Wed, Jan 15, 2025 at 7:18 PM <bugs@schedmd.com> wrote: > Site Columbia University > Ticket ID 21825 <https://support.schedmd.com/show_bug.cgi?id=21825> > Summary Apptainer container wont connect to multiple processors > Product Slurm > Version 25.05.x > Hardware Linux > OS Linux > Status OPEN > Severity 2 - High Impact > Priority --- > Component Limits > Assignee support@schedmd.com > Reporter gm3128@columbia.edu > > Created attachment 40404 [details] <https://support.schedmd.com/attachment.cgi?id=40404> [details] <https://support.schedmd.com/attachment.cgi?id=40404&action=edit> > Apptainer job not able to connect to multiple processors > > Hello! Following up on some training provided earlier this morning. I am trying > to run a job using a apptainer, the job requires at least two processors or it > will not go forward. > > Nathan from your team kindly helped me set up a script, but alas (as far as I > can tell), it refuses to accept more than one processor to run the job -- and > subsequently will not begin to run. I am fairly certain this is not a > limitation of resources (I own my own node, and even if I make the problem > super tiny, it can't connect to that second processor). Though I can buy more > resources if necessary. > > I am really not familiar with containers nor HPCs, so it is likely I am doing > something dumb on my end, any suggestions are most welcome! Thank you! > > ------------------------------ > You are receiving this mail because: > > - You are the assignee for the ticket. > >
Hi Jacob, They were unable to fix my problem and referred me to you guys. Nate was helping me out yesterday, and just following up with him/you on this last issue! Thanks so much ! Thanks, Grace McIlvain On Thu, Jan 16, 2025 at 3:19 PM <bugs@schedmd.com> wrote: > Comment # 8 on ticket 21825 from Jacob Jenson Grace, We currently do now > have you listed as a supported user for Slurm support for Columbia. > Typically these questions need to go through your university help desk > system. If the help deak or system > ZjQcmQRYFpfptBannerStart > This Message Is From an External Sender > This message came from outside your organization. > > ZjQcmQRYFpfptBannerEnd > > *Comment # 8 > <https://urldefense.proofpoint.com/v2/url?u=https-3A__support.schedmd.com_show-5Fbug.cgi-3Fid-3D21825-23c8&d=DwMFaQ&c=009klHSCxuh5AI1vNQzSO0KGjl4nbi2Q0M1QLJX9BeE&r=gpwzteordz3VHd8rOq6CL1wyEgtRJHyqxYG2abcG_P0&m=h329PTumhJl_Hlhw9gDK8PgAMooTFe60UZrB2QejMniF3BLwwFcSspQdyl8XIqoL&s=28SCni940svF7NqPYxcob9YBYAsVR2CLZKlUTkdmazk&e=> > on ticket 21825 > <https://urldefense.proofpoint.com/v2/url?u=https-3A__support.schedmd.com_show-5Fbug.cgi-3Fid-3D21825&d=DwMFaQ&c=009klHSCxuh5AI1vNQzSO0KGjl4nbi2Q0M1QLJX9BeE&r=gpwzteordz3VHd8rOq6CL1wyEgtRJHyqxYG2abcG_P0&m=h329PTumhJl_Hlhw9gDK8PgAMooTFe60UZrB2QejMniF3BLwwFcSspQdyl8XIqoL&s=0W_UXiGcseGoG3UhXlPlXEcBU7mf5hyg5Nen8MuQsi4&e=> > from Jacob Jenson <jacob@schedmd.com> * > > Grace, > > We currently do now have you listed as a supported user for Slurm support > for Columbia. Typically these questions need to go through your university > help desk system. If the help deak or system admins are not able to provide > answers then they can forward questions to us for assistance. > > Thank you, > Jacob > > Jacob Jenson > > *COO* > > +1 925.695.7782 > www.schedmd.com <https://urldefense.proofpoint.com/v2/url?u=http-3A__www.schedmd.com&d=DwQFaQ&c=009klHSCxuh5AI1vNQzSO0KGjl4nbi2Q0M1QLJX9BeE&r=gpwzteordz3VHd8rOq6CL1wyEgtRJHyqxYG2abcG_P0&m=h329PTumhJl_Hlhw9gDK8PgAMooTFe60UZrB2QejMniF3BLwwFcSspQdyl8XIqoL&s=nJPQXmkFCw70Y9LIpT8Bi1g3CTndaTfgBS4gi6V6l4c&e=> > > > On Wed, Jan 15, 2025 at 7:18 PM <bugs@schedmd.com> wrote: > > Site Columbia University > > Ticket ID 21825 <https://support.schedmd.com/show_bug.cgi?id=21825 <https://urldefense.proofpoint.com/v2/url?u=https-3A__support.schedmd.com_show-5Fbug.cgi-3Fid-3D21825&d=DwMFaQ&c=009klHSCxuh5AI1vNQzSO0KGjl4nbi2Q0M1QLJX9BeE&r=gpwzteordz3VHd8rOq6CL1wyEgtRJHyqxYG2abcG_P0&m=h329PTumhJl_Hlhw9gDK8PgAMooTFe60UZrB2QejMniF3BLwwFcSspQdyl8XIqoL&s=0W_UXiGcseGoG3UhXlPlXEcBU7mf5hyg5Nen8MuQsi4&e=>> > > Summary Apptainer container wont connect to multiple processors > > Product Slurm > > Version 25.05.x > > Hardware Linux > > OS Linux > > Status OPEN > > Severity 2 - High Impact > > Priority --- > > Component Limits > > Assignee support@schedmd.com > > Reporter gm3128@columbia.edu > >> Created attachment 40404 [details] <https://urldefense.proofpoint.com/v2/url?u=https-3A__support.schedmd.com_attachment.cgi-3Fid-3D40404&d=DwMFaQ&c=009klHSCxuh5AI1vNQzSO0KGjl4nbi2Q0M1QLJX9BeE&r=gpwzteordz3VHd8rOq6CL1wyEgtRJHyqxYG2abcG_P0&m=h329PTumhJl_Hlhw9gDK8PgAMooTFe60UZrB2QejMniF3BLwwFcSspQdyl8XIqoL&s=VOHsm2HFJ43wyVUHULjzas6AEFR3_2aWI2zXaCp06Gc&e=> [details] <https://urldefense.proofpoint.com/v2/url?u=https-3A__support.schedmd.com_attachment.cgi-3Fid-3D40404-26action-3Dedit&d=DwMFaQ&c=009klHSCxuh5AI1vNQzSO0KGjl4nbi2Q0M1QLJX9BeE&r=gpwzteordz3VHd8rOq6CL1wyEgtRJHyqxYG2abcG_P0&m=h329PTumhJl_Hlhw9gDK8PgAMooTFe60UZrB2QejMniF3BLwwFcSspQdyl8XIqoL&s=HdmaxrKbWIcpjUhGzaD_2MJTjH-_XJM36DsiXxEzV-w&e=> <https://support.schedmd.com/attachment.cgi?id=40404 <https://urldefense.proofpoint.com/v2/url?u=https-3A__support.schedmd.com_attachment.cgi-3Fid-3D40404&d=DwMFaQ&c=009klHSCxuh5AI1vNQzSO0KGjl4nbi2Q0M1QLJX9BeE&r=gpwzteordz3VHd8rOq6CL1wyEgtRJHyqxYG2abcG_P0&m=h329PTumhJl_Hlhw9gDK8PgAMooTFe60UZrB2QejMniF3BLwwFcSspQdyl8XIqoL&s=VOHsm2HFJ43wyVUHULjzas6AEFR3_2aWI2zXaCp06Gc&e=>> [details] <https://support.schedmd.com/attachment.cgi?id=40404&action=edit <https://urldefense.proofpoint.com/v2/url?u=https-3A__support.schedmd.com_attachment.cgi-3Fid-3D40404-26action-3Dedit&d=DwMFaQ&c=009klHSCxuh5AI1vNQzSO0KGjl4nbi2Q0M1QLJX9BeE&r=gpwzteordz3VHd8rOq6CL1wyEgtRJHyqxYG2abcG_P0&m=h329PTumhJl_Hlhw9gDK8PgAMooTFe60UZrB2QejMniF3BLwwFcSspQdyl8XIqoL&s=HdmaxrKbWIcpjUhGzaD_2MJTjH-_XJM36DsiXxEzV-w&e=>> > > Apptainer job not able to connect to multiple processors > >> Hello! Following up on some training provided earlier this morning. I am trying > > to run a job using a apptainer, the job requires at least two processors or it > > will not go forward. > >> Nathan from your team kindly helped me set up a script, but alas (as far as I > > can tell), it refuses to accept more than one processor to run the job -- and > > subsequently will not begin to run. I am fairly certain this is not a > > limitation of resources (I own my own node, and even if I make the problem > > super tiny, it can't connect to that second processor). Though I can buy more > > resources if necessary. > >> I am really not familiar with containers nor HPCs, so it is likely I am doing > > something dumb on my end, any suggestions are most welcome! Thank you! > >> ------------------------------ > > You are receiving this mail because: > >> - You are the assignee for the ticket. > > > > > > ------------------------------ > You are receiving this mail because: > > - You reported the ticket. > >