Created attachment 5131 [details] MPI hello world example When testing heterogeneous MPI programs on Slurm 17.11.0-pre2, I noticed that they don't share the MPI_COMM_WORLD communicator. I'm not sure if this is by design or not, but it works differently on ALPS. dgloe@opal-p1:~> aprun -n 1 ./hello : -n 2 ./hello Hello world from processor nid00036, rank 0 out of 3 processors Hello world from processor nid00037, rank 1 out of 3 processors Hello world from processor nid00037, rank 2 out of 3 processors Application 84989 resources: utime ~0s, stime ~0s, Rss ~9156, inblocks ~0, outblocks ~0 dgloe@opal-p2:~> srun -n 1 ./hello : -n 2 ./hello srun: job 21240 queued and waiting for resources srun: job 21240 has been allocated resources Hello world from processor nid00032, rank 0 out of 1 processors Hello world from processor nid00033, rank 0 out of 2 processors Hello world from processor nid00033, rank 1 out of 2 processors I'll try to figure out how ALPS sets this up.
That is still under development. About all that is available in Slurm version 17.11.0-pre2 is the job allocation and some of the infrastructure (account, squeue, scancel, etc.). I am working on the application launch side of things (slurm job steps) now. That might be ready for beta testing in mid-September.
(In reply to Moe Jette from comment #1) > That is still under development. About all that is available in Slurm > version 17.11.0-pre2 is the job allocation and some of the infrastructure > (account, squeue, scancel, etc.). I am working on the application launch > side of things (slurm job steps) now. That might be ready for beta testing > in mid-September. Ah, ok. I'll stay tuned then.
(In reply to David Gloe from comment #0) > I'll try to figure out how ALPS sets this up. That would be great if you could get that information for me. I'm concentrating on OpenMPI right now, but we'll definitely want this to work for Cray MPI too. Unfortunately, there are a multitude of MPI implementations that I'll need to deal with.
(In reply to Moe Jette from comment #3) > (In reply to David Gloe from comment #0) > > I'll try to figure out how ALPS sets this up. > > That would be great if you could get that information for me. I'm > concentrating on OpenMPI right now, but we'll definitely want this to work > for Cray MPI too. Unfortunately, there are a multitude of MPI > implementations that I'll need to deal with. I'm guessing you'll have to set the cmdIndex and peCmdMapArray in the alpsc_write_placement_file call. ALPS has a restriction where only one MPMD command can run on a node, so for example aprun command1 : command2 must use 2 different nodes. Are you planning on having that same restriction?
(In reply to David Gloe from comment #4) > (In reply to Moe Jette from comment #3) > > (In reply to David Gloe from comment #0) > > > I'll try to figure out how ALPS sets this up. > > > > That would be great if you could get that information for me. I'm > > concentrating on OpenMPI right now, but we'll definitely want this to work > > for Cray MPI too. Unfortunately, there are a multitude of MPI > > implementations that I'll need to deal with. > > I'm guessing you'll have to set the cmdIndex and peCmdMapArray in the > alpsc_write_placement_file call. > > ALPS has a restriction where only one MPMD command can run on a node, so for > example aprun command1 : command2 must use 2 different nodes. Are you > planning on having that same restriction? There is no such restriction in Slurm.
(In reply to David Gloe from comment #4) > ALPS has a restriction where only one MPMD command can run on a node, so for > example aprun command1 : command2 must use 2 different nodes. Are you > planning on having that same restriction? Based upon my recent work with MPI, it's more an issue of the MPI interface rather than ALPS. Slurm will have the same limitation as ALPS. It also appears unlikely that Slurm version 17.11 will support an application spanning multiple components of a heterogeneous job on Cray systems, but that will come later. A separate srun command will be required for each component (at least unless a recent version of OpenMPI is used). For example: salloc --mem=40g -n1 : --mem=10g -n16 bash srun --pack-group=0 master & srun --pack-group=1 slave
Changing to severity 5. Planned for Slurm version 18.08
(In reply to Moe Jette from comment #7) > Changing to severity 5. Planned for Slurm version 18.08 I am currently not expecting this to land in Slurm version 18.08, although the logic in Slurm v18.08 may work Cray's ethernet-based Shasta system.
Hi, Are you still expecting to delay the unified MPI COMM WORLD after 18.08 or will it be integrated ? Regards, Matthieu
(In reply to Matthieu Hautreux from comment #10) > Hi, > > Are you still expecting to delay the unified MPI COMM WORLD after 18.08 or > will it be integrated ? It should work everywhere EXCEPT on a Cray network.
Great, thanks for the clarification Le mer. 23 mai 2018 15:21, <bugs@schedmd.com> a écrit : > *Comment # 11 <https://bugs.schedmd.com/show_bug.cgi?id=4105#c11> on bug > 4105 <https://bugs.schedmd.com/show_bug.cgi?id=4105> from Moe Jette > <jette@schedmd.com> * > > (In reply to Matthieu Hautreux from comment #10 <https://bugs.schedmd.com/show_bug.cgi?id=4105#c10>)> Hi, > > > > Are you still expecting to delay the unified MPI COMM WORLD after 18.08 or > > will it be integrated ? > > It should work everywhere EXCEPT on a Cray network. > > ------------------------------ > You are receiving this mail because: > > - You are on the CC list for the bug. > >
*** Ticket 4322 has been marked as a duplicate of this ticket. ***
*** Ticket 6428 has been marked as a duplicate of this ticket. ***
David, how has the testing been going?
*** Ticket 6561 has been marked as a duplicate of this ticket. ***
There is renewed interest in this feature. What can Cray provide to help this move forward? My understanding is that David is no longer working non-SHASTA issues.
Certainly, I am hoping we only need a bit of time from someone familiar with the cray switch, container, or cookie logs. It might involved pmi or mpi, but at this point we are not sure. David was given the cray_pack branch (https://github.com/SchedMD/slurm/tree/cray_pack) that contains as far as we have gotten. At the moment we can run non-mpi jobs (hostname) without issue, but mpi jobs just hang. My guess is since Slurm uses different job ids for each of the job portions and the Cray infrastructure expects only one jobid there is a place we have missed setting the jobid to that of the parent jobid. Let me know what you find out. Thanks!
Has Cray R&D made contact on this issue yet?
They have just this week
FYI, the cray_pack branch has been simplified and rebased on master (19.05).
Doug, I think we might want to close bug 6825 as a duplicate of this bug. For some reason you weren't on this.
Brian G, is there any update from Cray on this? I know Blaine was getting someone at Cray to look at this, but haven't heard much from him on the matter.
Created attachment 9931 [details] approaching functionality? Testing Danny's patch from yesterday, I found that NICs were not getting programmed on non-offset-0 steps (non-pack-leader?). The attached patch does two things -- and neither very well, these are just attempts to build a proof of concept that may eventually work. (1) on all non-pack-leader offset steps, the credential is copied; this is achieved by adding a switch_p_duplicate_jobinfo() call to the switch plugin API, and assuming that the pack-leader was processed first. (2) the switch/cray pe_info.c is modified to at least see the pack steps and be aware of those ranks when calculating placements Note that the modifications for (2) do indicate there are some important components missing from the stepd_step_rec_t data structure to achieve this. (a) no global_task_ids equivalent for hetjobs (b) no per-pack cpus_per_task value (unsure of the value of this) (c) mpmd seems to have special setup, may not be possible (or make sense) to mix mpmd and hetjob (d) the cmd index may or may not be correct Can see we are now getting cookies set: boot-gerty:~ # pcmd -s -n ALL_COMPUTE "cat /sys/class/gni/kgni0/resources | grep PTag" Output from 205: --- PTag: 1 PKey: 0x0 JobId: 0x0 RefCount: 1 Suspend: Disabled --- --- PTag: 4 PKey: 0x80 JobId: 0x0 RefCount: 2 Suspend: Idle --- --- PTag: 183 PKey: 0x15f7 JobId: 0x7e3 RefCount: 1 Suspend: Idle --- --- PTag: 184 PKey: 0x15f8 JobId: 0x7e3 RefCount: 1 Suspend: Idle --- Output from 21-23,28-30,32-41,43-54,56-63,200-202,206-207,212-223,225-255,384-390,392-404,406-423,428-435,440-447: --- PTag: 1 PKey: 0x0 JobId: 0x0 RefCount: 1 Suspend: Disabled --- --- PTag: 4 PKey: 0x80 JobId: 0x0 RefCount: 2 Suspend: Idle --- Output from 204: --- PTag: 1 PKey: 0x0 JobId: 0x0 RefCount: 1 Suspend: Disabled --- --- PTag: 4 PKey: 0x80 JobId: 0x0 RefCount: 2 Suspend: Idle --- --- PTag: 35 PKey: 0x15f7 JobId: 0x19f8 RefCount: 1 Suspend: Idle --- --- PTag: 36 PKey: 0x15f8 JobId: 0x19f8 RefCount: 1 Suspend: Idle --- Node(s) 21-23,28-30,32-41,43-54,56-63,200-202,204-207,212-223,225-255,384-390,392-404,406-423,428-435,440-447 had exit code 0 boot-gerty:~ # Ranks on the node are getting to pmi_init without any errors now: Core was generated by `/global/homes/d/dmj/mpi/hello_gerty'. #0 0x000000000064eef0 in __poll_nocancel () at ../sysdeps/unix/syscall-template.S:84 84 ../sysdeps/unix/syscall-template.S: No such file or directory. (gdb) bt #0 0x000000000064eef0 in __poll_nocancel () at ../sysdeps/unix/syscall-template.S:84 #1 0x00000000005d2800 in inet_accept_with_address (ip_addr=<optimized out>, data=<optimized out>, sock=<optimized out>) at /notbackedup/users/sko/mpt/mpt_base/pmi/src/pmi_core/pmi_inet.c:735 #2 _pmi_inet_setup (net_info=0x9bc760 <_pmi_base_net_info>, retry=300) at /notbackedup/users/sko/mpt/mpt_base/pmi/src/pmi_core/pmi_inet.c:1026 #3 0x00000000005c5900 in _pmi_init (spawned=0x7fffffff7048) at /notbackedup/users/sko/mpt/mpt_base/pmi/src/pmi_core/_pmi_init.c:1262 #4 0x00000000005c4af2 in _pmi_constructor () at /notbackedup/users/sko/mpt/mpt_base/pmi/src/pmi_core/_pmi_init.c:343 #5 0x00000000005f1e57 in __libc_csu_init (argc=argc@entry=1, argv=argv@entry=0x7fffffff7198, envp=0x7fffffff71a8) at elf-init.c:88 #6 0x00000000005f18cd in __libc_start_main (main=0x40a190 <main>, argc=1, argv=0x7fffffff7198, init=0x5f1de0 <__libc_csu_init>, fini=0x5f1e70 <__libc_csu_fini>, rtld_fini=0x0, stack_end=0x7fffffff7188) at libc-start.c:245 #7 0x000000000040a0a9 in _start () at ../sysdeps/x86_64/start.S:118 (gdb) quit dmj@nid00204:~> logout nid00204:~ # seems to be getting stuck. I don't have source to Cray PMI, so I really can't go much further.
I think the key takeways here are that (1): some protocol augmentations are likely needed to get the data required into the slurmstepd. I hope that we can at least review the results from this patch and the other related bug for possible protocol changes going into 19.05 (2): access to those data is not sufficient yet, something still needs to be rearranged to make this work. I think Cray will need to get involved to move this further
Comment on attachment 9931 [details] approaching functionality? A slight variant of this has been added to the cray_pack branch.
Danny, Kim McMahon volunteered to address any PMI questions that you have. kmcmahon@cray.com
Thanks Brian, Could you get Kim to add themselves to this bug?
I wasn't able to add Kim to the bug because she doesn't have access. Her e-mail needs to be added.
That is what I should had asked, can you please have her get an account here?
*** Ticket 6825 has been marked as a duplicate of this ticket. ***
FYI, a few minor modifications needed to get current cray_pack branch to build: diff --git a/src/plugins/switch/cray/pe_info.c b/src/plugins/switch/cray/pe_info.c index c38c60f095..7a1a933dc4 100644 --- a/src/plugins/switch/cray/pe_info.c +++ b/src/plugins/switch/cray/pe_info.c @@ -50,7 +50,7 @@ typedef struct { } local_step_rec_t; // Static functions -static void _get_het_info(local_step_rec_t *step_rec, stepd_step_rec_t *job); +static void _setup_local_step_rec(local_step_rec_t *step_rec, stepd_step_rec_t *job); static int _get_first_pe(stepd_step_rec_t *job); static int _get_cmd_index(stepd_step_rec_t *job); static int *_get_cmd_map(stepd_step_rec_t *job); @@ -133,7 +133,7 @@ int build_alpsc_pe_info(stepd_step_rec_t *job, return SLURM_SUCCESS; } -static void _get_het_info(local_step_rec_t *step_rec, stepd_step_rec_t *job) +static void _setup_local_step_rec(local_step_rec_t *step_rec, stepd_step_rec_t *job) { xassert(step_rec); xassert(job); @@ -142,13 +142,13 @@ static void _get_het_info(local_step_rec_t *step_rec, stepd_step_rec_t *job) if (job->pack_jobid != NO_VAL) { step_rec->nnodes = job->pack_nnodes; step_rec->ntasks = job->pack_ntasks; - step_rec->complete_nodelist = job->pack_node_list; + step_rec->nodelist = job->pack_node_list; step_rec->tasks_to_launch = job->pack_task_cnts; } else { step_rec->nnodes = job->nnodes; step_rec->ntasks = job->ntasks; - step_rec->complete_nodelist = job->complete_nodelist; - step_rec->tasks_to_launch = job->tasks_to_launch; + step_rec->nodelist = job->msg->complete_nodelist; + step_rec->tasks_to_launch = job->msg->tasks_to_launch; } }
after making and deploying the modifications in comment #37, I'm looking at: dmj@gerty:~> /global/gscratch1/sd/dmj/cray_pack/bin/srun ./mpi/hello_gerty : ./mpi/hello_gerty .... Both jobs start. Job 44+0 runs on nid00206, 44+1 on nid00207: boot-gerty:~ # squeue JOBID PARTITION NAME USER ST TIME NODES NODELIST(REASON) 44+0 debug hello_ge nobody R 1:44 1 nid00206 44+1 debug hello_ge nobody R 1:44 1 nid00207 boot-gerty:~ # The hello_gerty processes hang, but in different spots: The process on nid00206 (i.e., the pack leader) hangs during the initial _pmi_constructor call before main() is reached: ``` Core was generated by `/global/u2/d/dmj/./mpi/hello_gerty'. #0 0x000000000064eef0 in __poll_nocancel () at ../sysdeps/unix/syscall-template.S:84 84 ../sysdeps/unix/syscall-template.S: No such file or directory. (gdb) bt #0 0x000000000064eef0 in __poll_nocancel () at ../sysdeps/unix/syscall-template.S:84 #1 0x00000000005d2800 in inet_accept_with_address (ip_addr=<optimized out>, data=<optimized out>, sock=<optimized out>) at /notbackedup/users/sko/mpt/mpt_base/pmi/src/pmi_core/pmi_inet.c:735 #2 _pmi_inet_setup (net_info=0x9bc760 <_pmi_base_net_info>, retry=300) at /notbackedup/users/sko/mpt/mpt_base/pmi/src/pmi_core/pmi_inet.c:1026 #3 0x00000000005c5900 in _pmi_init (spawned=0x7fffffff70e8) at /notbackedup/users/sko/mpt/mpt_base/pmi/src/pmi_core/_pmi_init.c:1262 #4 0x00000000005c4af2 in _pmi_constructor () at /notbackedup/users/sko/mpt/mpt_base/pmi/src/pmi_core/_pmi_init.c:343 #5 0x00000000005f1e57 in __libc_csu_init (argc=argc@entry=1, argv=argv@entry=0x7fffffff7238, envp=0x7fffffff7248) at elf-init.c:88 #6 0x00000000005f18cd in __libc_start_main (main=0x40a190 <main>, argc=1, argv=0x7fffffff7238, init=0x5f1de0 <__libc_csu_init>, fini=0x5f1e70 <__libc_csu_fini>, rtld_fini=0x0, stack_end=0x7fffffff7228) at libc-start.c:245 #7 0x000000000040a0a9 in _start () at ../sysdeps/x86_64/start.S:118 (gdb) up #1 0x00000000005d2800 in inet_accept_with_address (ip_addr=<optimized out>, data=<optimized out>, sock=<optimized out>) at /notbackedup/users/sko/mpt/mpt_base/pmi/src/pmi_core/pmi_inet.c:735 735 /notbackedup/users/sko/mpt/mpt_base/pmi/src/pmi_core/pmi_inet.c: No such file or directory. (gdb) up #2 _pmi_inet_setup (net_info=0x9bc760 <_pmi_base_net_info>, retry=300) at /notbackedup/users/sko/mpt/mpt_base/pmi/src/pmi_core/pmi_inet.c:1026 1026 in /notbackedup/users/sko/mpt/mpt_base/pmi/src/pmi_core/pmi_inet.c (gdb) print *net_info $1 = {have_controller = 0, num_targets = 1, my_inet_id = 206, portnum = 63005, listen_sock = 4, last_downed_node_count = 0, controller_nid = -1, control_net_id = 0, target_nids = 0x9befc0, controller_hostname = '\000' <repeats 63 times>} (gdb) print *net_info->target_nids $2 = 207 (gdb) ``` It seems to have the idea that it would like to talk with nid00207 though, so that is nice. There are some interesting looking symbols in frame #4 (_pmi_constructor), but I can't tell if they have been initialized yet, so the fact that they are populated with apparent garbage may be meaningless: ``` (gdb) frame 4 #4 0x00000000005c4af2 in _pmi_constructor () at /notbackedup/users/sko/mpt/mpt_base/pmi/src/pmi_core/_pmi_init.c:343 343 in /notbackedup/users/sko/mpt/mpt_base/pmi/src/pmi_core/_pmi_init.c (gdb) print my_rank $6 = 31 (gdb) print pg_info $7 = {pg_id = 8241994560065789796, size = 1601398121, my_rank = 942747698, napps = 825110830, appnum = 1815032630, pes_this_node = 2020961897, pg_pes_per_smp = 1818979631, base_pe_on_node = 1651076143, my_lrank = 1953392943, base_pe_in_app = 0x5f82b2 <getenv+194>, pes_in_app = 0x37ffffa00, pes_in_app_this_smp = 7549634, pes_in_app_this_smp_list = 0x400000003, my_app_lrank = 5, apps_share_node = 6} (gdb) ``` Job 44+1 (node nid00207), however hangs in a different place, much further along (I think), past the start of main() and into PMI_Init() somewhere: ``` (gdb) bt #0 __strcmp_sse2_unaligned () at ../sysdeps/x86_64/multiarch/strcmp-sse2-unaligned.S:41 #1 0x00000000005c5d9c in _pmi2_info_getjobattr (name=<optimized out>, value=0x7fffffff6700 "\377", valuelen=1024, found=0x7fffffff66ec) at /notbackedup/users/sko/mpt/mpt_base/pmi/src/pmi_core/_pmi_jobattr.c:524 #2 0x00000000005c0a79 in PMI2_Info_GetJobAttr (name=0x2aaaaaab41c0 "", value=0x6d39a5 "PMI_process_mapping", valuelen=0, found=0xf) at /notbackedup/users/sko/mpt/mpt_base/pmi/src/api/kvs/jobattr.c:51 #3 0x0000000000443b69 in MPIDI_Populate_vc_node_ids () #4 0x000000000043fa66 in MPID_Init () #5 0x0000000000411ed9 in MPIR_Init_thread () #6 0x000000000041193e in PMPI_Init () #7 0x000000000040a1e3 in main () (gdb) ```
This looks to me like the Cray PMI ranks might be getting conflicting information at startup, between the two executables. I'll provide some background as to what info CrayPMI needs, and how it accesses that data from slurm. PMI calls only a few 'alps-plugin' functions to get the info it needs for startup. As far as I know, these alps-plugin functions should all be contained within the slurm source tree. 1) alps_get_placement_info() 2) alps_app_lli_* functions (lli_get_response, lli_set_response, lli_get_response_bytes) Each rank calls these functions. To get the APID, PMI calls the apls_all_lli function with option ALPS_APP_LLI_ALPS_REQ_APID. For a MPMD job, there can be only one APID. To get the rank value, PMI reads an env variable. There are a couple of options, but I believe slurm should set: "ALPS_APP_PE". To get the network credentials, PMI calls the apls_all_lli function with option ALPS_APP_LLI_ALPS_REQ_GNI. There should be one set of credentials for the entire MPMD job. To get the full job placement data, PMI calls alps_get_placement_info(). I think this is where the current problem is. There are several parameters to this function. CrayPMI ignores a few, but the others are important. alps_get_placement_info( uint64_t apid, alpsAppLayout_t *appLayout, int **placementList, int **targetNids, // CrayPMI ignores int **targetPes, // CrayPMI ignores int **targetLen, // CrayPMI ignores struct in_addr **targetIps, // CrayPMI ignores int **startPe, int **totalPes, int **nodePes, int **peCpus); // CrayPMI ignores Some of these parameters (placementList for example) should be the same for every rank in the MPMD job. Here's one suggestion on how to debug this. Since we know the --multi-prog format for MPMD works with slurm and CrayPMI, perhaps you could run a MPMD test using the multi-prog format and look at the placement_info that slurm provides to PMI via alps_get_placement_info(). Running that same example with the new ":" syntax should provide clues as to what data is not getting set correctly.
awesome! based on that i suspect a couple changes to the task/cray plugin and we can get the test code to function.
@kmcmahon that was precisely the advice needed: dmj@gerty:~> /global/gscratch1/sd/dmj/cray_pack/bin/srun ~/mpi/hello_gerty : ~/mpi/hello_gerty srun: job 55 queued and waiting for resources srun: job 55 has been allocated resources hello from 0 of 2 on nid00206 hello from 1 of 2 on nid00207 dmj@gerty:~> I've hard-coded specific rank ids to get exactly that test working, however, now it can be extended to be fully generic. Thank you!
Excellent news! I'm very happy to hear that. Thank you Doug for working the slurm side of this.
Created attachment 9975 [details] cray/mpich/pmi working with hetjob Hello, The attached patch adds all the taskids for the entire set of job packs to the protocol transmitting data from srun to slurmstepd. This is needed because Cray's PMI requires each node to know where every other task is. Other related adjustments are made. Node identification, in particular, is challenging, because the nodes are ordered in an arbitrary order (really alphabetical, but since this isn't by pack, it can be surprising in the implementation). I still have not addressed either the command index for Cray PMI, nor the cpus_per_task varying by pack. Both of these will require additional changes to the protocol to support, however it is not clear that work is required since (1) it's functional without it, and (2) it would seem Cray PMI may not need it based on Kim's statements earlier. To support the command index, I think we would need a way to map any given task back to it's pack (and then simply used the pack_offset as the command index). To support the cpus_per_task, I suppose we would need to simply send all the cpus_per_task for all the packs in the protocol, however, it's unclear if srun always knows the correct values for this - and - it may not be used anyway. I have not addressed allowing the MPMD (--multi-prog) to work within a hetjob, but I think it's probably reasonable to defer that since it seems like an unlikely case (why use MPMD with the flexibility of hetjob at the same time). It would probably be good to have the srun CLI generate an error in that case. Kim, can you please comment on whether or not the command index or cpus_per_task need to be addressed? Danny, is there a chance this will be possible to merge for 19.05?
Oh, I forgot to post the evidence. The first test is just a basic non-het-job srun to be sure I didn't break it, the second varies between haswell, and knl. The third does the same but adds a third component which selects haswell based on memory requirements (not a tagged constraint): dmj@gerty:/global/gscratch1/sd/dmj> ./cray_pack/bin/srun -n 32 --ntasks-per-node=8 ~/mpi/hello_gerty hello from 16 of 32 on nid00214 hello from 17 of 32 on nid00214 hello from 18 of 32 on nid00214 hello from 19 of 32 on nid00214 hello from 20 of 32 on nid00214 hello from 21 of 32 on nid00214 hello from 22 of 32 on nid00214 hello from 23 of 32 on nid00214 hello from 25 of 32 on nid00215 hello from 26 of 32 on nid00215 hello from 27 of 32 on nid00215 hello from 28 of 32 on nid00215 hello from 29 of 32 on nid00215 hello from 30 of 32 on nid00215 hello from 31 of 32 on nid00215 hello from 24 of 32 on nid00215 hello from 8 of 32 on nid00213 hello from 9 of 32 on nid00213 hello from 10 of 32 on nid00213 hello from 11 of 32 on nid00213 hello from 12 of 32 on nid00213 hello from 13 of 32 on nid00213 hello from 14 of 32 on nid00213 hello from 15 of 32 on nid00213 hello from 0 of 32 on nid00212 hello from 1 of 32 on nid00212 hello from 2 of 32 on nid00212 hello from 3 of 32 on nid00212 hello from 4 of 32 on nid00212 hello from 5 of 32 on nid00212 hello from 6 of 32 on nid00212 hello from 7 of 32 on nid00212 dmj@gerty:/global/gscratch1/sd/dmj> ./cray_pack/bin/srun -C haswell -n 32 --ntasks-per-node=8 ~/mpi/hello_gerty : -C knl -n 68 ~/mpi/hello_gerty srun: job 619 queued and waiting for resources srun: job 619 has been allocated resources hello from 0 of 100 on nid00021 hello from 1 of 100 on nid00021 hello from 2 of 100 on nid00021 hello from 3 of 100 on nid00021 hello from 4 of 100 on nid00021 hello from 5 of 100 on nid00021 hello from 24 of 100 on nid00028 hello from 6 of 100 on nid00021 hello from 25 of 100 on nid00028 hello from 7 of 100 on nid00021 hello from 26 of 100 on nid00028 hello from 28 of 100 on nid00028 hello from 29 of 100 on nid00028 hello from 30 of 100 on nid00028 hello from 31 of 100 on nid00028 hello from 27 of 100 on nid00028 hello from 8 of 100 on nid00022 hello from 9 of 100 on nid00022 hello from 10 of 100 on nid00022 hello from 11 of 100 on nid00022 hello from 12 of 100 on nid00022 hello from 13 of 100 on nid00022 hello from 14 of 100 on nid00022 hello from 15 of 100 on nid00022 hello from 17 of 100 on nid00023 hello from 19 of 100 on nid00023 hello from 20 of 100 on nid00023 hello from 21 of 100 on nid00023 hello from 22 of 100 on nid00023 hello from 23 of 100 on nid00023 hello from 16 of 100 on nid00023 hello from 18 of 100 on nid00023 hello from 36 of 100 on nid00200 hello from 37 of 100 on nid00200 hello from 50 of 100 on nid00200 hello from 64 of 100 on nid00200 hello from 75 of 100 on nid00200 hello from 77 of 100 on nid00200 hello from 80 of 100 on nid00200 hello from 32 of 100 on nid00200 hello from 33 of 100 on nid00200 hello from 34 of 100 on nid00200 hello from 35 of 100 on nid00200 hello from 38 of 100 on nid00200 hello from 39 of 100 on nid00200 hello from 40 of 100 on nid00200 hello from 41 of 100 on nid00200 hello from 42 of 100 on nid00200 hello from 43 of 100 on nid00200 hello from 44 of 100 on nid00200 hello from 45 of 100 on nid00200 hello from 46 of 100 on nid00200 hello from 47 of 100 on nid00200 hello from 48 of 100 on nid00200 hello from 49 of 100 on nid00200 hello from 51 of 100 on nid00200 hello from 52 of 100 on nid00200 hello from 53 of 100 on nid00200 hello from 54 of 100 on nid00200 hello from 55 of 100 on nid00200 hello from 56 of 100 on nid00200 hello from 57 of 100 on nid00200 hello from 58 of 100 on nid00200 hello from 59 of 100 on nid00200 hello from 60 of 100 on nid00200 hello from 61 of 100 on nid00200 hello from 62 of 100 on nid00200 hello from 63 of 100 on nid00200 hello from 65 of 100 on nid00200 hello from 66 of 100 on nid00200 hello from 67 of 100 on nid00200 hello from 68 of 100 on nid00200 hello from 69 of 100 on nid00200 hello from 70 of 100 on nid00200 hello from 71 of 100 on nid00200 hello from 72 of 100 on nid00200 hello from 73 of 100 on nid00200 hello from 74 of 100 on nid00200 hello from 76 of 100 on nid00200 hello from 78 of 100 on nid00200 hello from 79 of 100 on nid00200 hello from 81 of 100 on nid00200 hello from 82 of 100 on nid00200 hello from 83 of 100 on nid00200 hello from 84 of 100 on nid00200 hello from 85 of 100 on nid00200 hello from 86 of 100 on nid00200 hello from 87 of 100 on nid00200 hello from 88 of 100 on nid00200 hello from 89 of 100 on nid00200 hello from 90 of 100 on nid00200 hello from 91 of 100 on nid00200 hello from 92 of 100 on nid00200 hello from 93 of 100 on nid00200 hello from 94 of 100 on nid00200 hello from 95 of 100 on nid00200 hello from 96 of 100 on nid00200 hello from 97 of 100 on nid00200 hello from 98 of 100 on nid00200 hello from 99 of 100 on nid00200 dmj@gerty:/global/gscratch1/sd/dmj> ./cray_pack/bin/srun -C haswell -n 8 --ntasks-per-node=2 ~/mpi/hello_gerty : -C knl -n 2 ~/mpi/hello_gerty : --mem=100G -n 4 ~/mpi/hello_gerty srun: job 656 queued and waiting for resources srun: job 656 has been allocated resources hello from 5 of 14 on nid00023 hello from 4 of 14 on nid00023 hello from 6 of 14 on nid00028 hello from 7 of 14 on nid00028 hello from 2 of 14 on nid00022 hello from 3 of 14 on nid00022 hello from 11 of 14 on nid00206 hello from 12 of 14 on nid00206 hello from 13 of 14 on nid00206 hello from 10 of 14 on nid00206 hello from 0 of 14 on nid00021 hello from 1 of 14 on nid00021 hello from 8 of 14 on nid00200 hello from 9 of 14 on nid00200 dmj@gerty:/global/gscratch1/sd/dmj> The regression testsuite is running at present.
Hi Doug, CrayPMI does not care about cpus_per_task, so I think we are fine there. If you truly are running an MPMD job (multiple different executables), then CrayPMI needs the alps_appLayout.numCmds to be accurate. This is the total number of executables launched together. Because slurm supports multiple executables on the *same* node (something alps does not support), we've previously made some adjustments in CrayPMI to handle that case. The alps-plugin interface via cmdNumber did not conveniently support this case. Because of this, CrayPMI determines the correct cmdNumber (index) based on the start_pes[i] and total_pes[i] provided in the call to alps_get_placement_info(). However, if you are using the hetjob feature simply to split one executable into multiple 'pieces' for placement reasons only, then I believe you can get by with telling PMI you have : alps_appLayout.numCmds=1 alps_appLayout.cmdNumber=0 alps_appLayout.total_pes[0] = total-number-of-processes-launched since that would accurately reflect what is being launched. -Kim
Comment on attachment 9975 [details] cray/mpich/pmi working with hetjob Thanks Doug, variations of this has been pushed to the cray_pack branch.
Created attachment 9993 [details] set the command map based on job pack Hello, I'm making an assumption that each hetjob component is running a separate executable (whether or not that is actually true), in order to avoid further complicating this -- however if we did want to identify the executable by hetjob component, we would _still_ need the change in this patch. This patch adds a mapping of taskids to originating pack_offset numbers. This allows the slurmstepd and associated plugins to identify which hetjob component any given task belongs to. From that, I'm configuring switch/cray to set up the command map to take the pack_offset as the command index. This effectively tells ALPS/PMI that each hetjob component is running a separate executable. For trivial cases (mpi helloworld) it doesn't seem to matter if I use different or same executables in these cases (whether or not the command map is populated). I have not tried a large variety of MPI capabilities though, so it may matter. Kim, if we make this assumption that each job component is a separate executable will that be sufficient? If it is sufficient, I think we're got basic functionality. Danny, what needs to be done to move this forward from here?
(In reply to Danny Auble from comment #46) > Comment on attachment 9975 [details] > cray/mpich/pmi working with hetjob > > Thanks Doug, variations of this has been pushed to the cray_pack branch. Danny asked me to review the work to this point. The only problem I discovered is that if an old version of srun (say slurm version 18.08) tries to launch a heterogeneous job then the pack_tids array will be NULL and cause aborts in the slurmd and/or slurmstepd daemons. I added some checks for NULL pack_tids. An old srun will fail for switch/cray and it should all work fine for non-cray systems. The commit is here: https://github.com/SchedMD/slurm/commit/219272ddd696e3fe1a5be72d1bc1c91ed63507c3 I'm wondering if we want to even build the pack_tids array for non-Cray systems or perhaps leave it there for possible future use.
> Kim, if we make this assumption that each job component is a separate executable will that be sufficient? Yes, I think that should work. Assuming N job components, tell PMI it has N commands, and have N values for start_pe and and total_pes that each reflect their individual component.
Comment on attachment 9993 [details] set the command map based on job pack Doug a slight variation of this is now in the cray_pack branch.
OK, I am calling this closed. The cray_pack branch has been merged into master. Thanks for everyone's help, especially Kim and Doug. Without your help and tips this would had been nye impossible. Thanks!