Ticket 16849 - unable to run MPICH 3.4.1 related jobs on Azure Cycleloud
Summary: unable to run MPICH 3.4.1 related jobs on Azure Cycleloud
Status: OPEN
Alias: None
Product: Slurm
Classification: Unclassified
Component: Cloud (show other tickets)
Version: 22.05.3
Hardware: Linux Linux
: 6 - No support contract
Assignee: Jess
QA Contact:
URL:
Depends on:
Blocks:
 
Reported: 2023-05-29 04:14 MDT by Shraddha Kiran
Modified: 2023-11-28 17:00 MST (History)
0 users

See Also:
Site: AMAT
Slinky Site: ---
Alineos Sites: ---
Atos/Eviden Sites: ---
Confidential Site: ---
Coreweave sites: ---
Cray Sites: ---
DS9 clusters: ---
Google sites: ---
HPCnow Sites: ---
HPE Sites: ---
IBM Sites: ---
NOAA SIte: ---
NoveTech Sites: ---
Nvidia HWinf-CS Sites: ---
OCF Sites: ---
Recursion Pharma Sites: ---
SFW Sites: ---
SNIC sites: ---
Tzag Elita Sites: ---
Linux Distro: ---
Machine Name:
CLE Version:
Version Fixed:
Target Release: ---
DevPrio: ---
Emory-Cloud Sites: ---


Attachments

Note You need to log in before you can comment on or make changes to this ticket.
Description Shraddha Kiran 2023-05-29 04:14:24 MDT
Hello ,

We are trying to run a application built using MPICH 3.4.1 and encountering below issues when running on Azure cyclecloud

Cyclecloud version- CycleCloud 8.2.2-1902


Loading mpi version 2021.2.0
****************************************************************************
* hwloc 2.0.3rc2-git received invalid information from the operating system.
*
* Group0 (cpuset 0x00ffffff,0xfc000000,,0x0) intersects with Package (P#1 cpuset 0x0fffffff,0xf0000000,0x0) without inclusion!
* Error occurred in topology.c line 1386
*
* The following FAQ entry in the hwloc documentation may help:
*   What should I do when hwloc reports "operating system" warnings?
* Otherwise please report this error message to the hwloc user's mailing list,
* along with the files generated by the hwloc-gather-topology script.
*
* hwloc will now ignore this invalid topology information and continue.



thread_monitor Resource temporarily unavailable in pthread_create
thread_monitor Resource temporarily unavailable in pthread_create
thread_monitor Resource temporarily unavailable in pthread_create
thread_monitor Resource temporarily unavailable in pthread_create
thread_monitor Resource temporarily unavailable in pthread_create
thread_monitor Resource temporarily unavailable in pthread_create
*** Error in `./ginestra-core-sim.run': double free or corruption (!prev): 0x000055ff3e109e00 ***
*** Error in `./ginestra-core-sim.run': double free or corruption (!prev): 0x000055ff3dfe3cd0 ***
thread_monitor Resource temporarily unavailable in pthread_create
thread_monitor Resource temporarily unavailable in pthread_create


Please guide.

Thank You
Shraddha
Comment 2 Shraddha Kiran 2023-05-30 10:15:37 MDT
Hello 

Please let me know if any more logs/ data needs to be shared to investigate on this further.

Thank You
Shraddha
Comment 3 Jess 2023-11-28 17:00:23 MST
Hi Shraddha,

SchedMD must put the cloud nodes/GPUs under AMAT's support agreement to assign these support tickets to the Slurm engineers.

May I get the annual average total of instances + GPUs being consumed in AWS?


-Jess Arrington