Ticket 4105

Summary: Heterogeneous jobs don't share MPI_COMM_WORLD on a Cray
Product: Slurm Reporter: David Gloe <david.gloe>
Component: slurmdAssignee: Danny Auble <da>
Status: RESOLVED FIXED QA Contact:
Severity: 5 - Enhancement    
Priority: --- CC: brian.gilmer, bsantos, csamuel, dmjacobsen, ezellma, fullop, iryan, kmcmahon, lena, matthieu.hautreux, sts, tim
Version: 19.05.x   
Hardware: Cray XC   
OS: Linux   
See Also: https://bugs.schedmd.com/show_bug.cgi?id=4322
https://bugs.schedmd.com/show_bug.cgi?id=7039
https://bugs.schedmd.com/show_bug.cgi?id=8329
Site: CRAY Slinky Site: ---
Alineos Sites: --- Atos/Eviden Sites: ---
Confidential Site: --- Coreweave sites: ---
Cray Sites: Cray Internal DS9 clusters: ---
Google sites: --- HPCnow Sites: ---
HPE Sites: --- IBM Sites: ---
NOAA SIte: --- NoveTech Sites: ---
Nvidia HWinf-CS Sites: --- OCF Sites: ---
Recursion Pharma Sites: --- SFW Sites: ---
SNIC sites: --- Tzag Elita Sites: ---
Linux Distro: --- Machine Name:
CLE Version: Version Fixed: 19.05.0pre4
Target Release: future DevPrio: 3 - High
Emory-Cloud Sites: ---
Attachments: MPI hello world example
approaching functionality?
cray/mpich/pmi working with hetjob
set the command map based on job pack

Description David Gloe 2017-08-23 09:26:07 MDT
Created attachment 5131 [details]
MPI hello world example

When testing heterogeneous MPI programs on Slurm 17.11.0-pre2, I noticed that they don't share the MPI_COMM_WORLD communicator. I'm not sure if this is by design or not, but it works differently on ALPS.

dgloe@opal-p1:~> aprun -n 1 ./hello : -n 2 ./hello
Hello world from processor nid00036, rank 0 out of 3 processors
Hello world from processor nid00037, rank 1 out of 3 processors
Hello world from processor nid00037, rank 2 out of 3 processors
Application 84989 resources: utime ~0s, stime ~0s, Rss ~9156, inblocks ~0, outblocks ~0

dgloe@opal-p2:~> srun -n 1 ./hello : -n 2 ./hello
srun: job 21240 queued and waiting for resources
srun: job 21240 has been allocated resources
Hello world from processor nid00032, rank 0 out of 1 processors
Hello world from processor nid00033, rank 0 out of 2 processors
Hello world from processor nid00033, rank 1 out of 2 processors

I'll try to figure out how ALPS sets this up.
Comment 1 Moe Jette 2017-08-23 09:30:33 MDT
That is still under development. About all that is available in Slurm version 17.11.0-pre2 is the job allocation and some of the infrastructure (account, squeue, scancel, etc.). I am working on the application launch side of things (slurm job steps) now. That might be ready for beta testing in mid-September.
Comment 2 David Gloe 2017-08-23 09:32:34 MDT
(In reply to Moe Jette from comment #1)
> That is still under development. About all that is available in Slurm
> version 17.11.0-pre2 is the job allocation and some of the infrastructure
> (account, squeue, scancel, etc.). I am working on the application launch
> side of things (slurm job steps) now. That might be ready for beta testing
> in mid-September.

Ah, ok. I'll stay tuned then.
Comment 3 Moe Jette 2017-08-23 09:42:45 MDT
(In reply to David Gloe from comment #0)
> I'll try to figure out how ALPS sets this up.

That would be great if you could get that information for me. I'm concentrating on OpenMPI right now, but we'll definitely want this to work for Cray MPI too. Unfortunately, there are a multitude of MPI implementations that I'll need to deal with.
Comment 4 David Gloe 2017-08-23 09:56:40 MDT
(In reply to Moe Jette from comment #3)
> (In reply to David Gloe from comment #0)
> > I'll try to figure out how ALPS sets this up.
> 
> That would be great if you could get that information for me. I'm
> concentrating on OpenMPI right now, but we'll definitely want this to work
> for Cray MPI too. Unfortunately, there are a multitude of MPI
> implementations that I'll need to deal with.

I'm guessing you'll have to set the cmdIndex and peCmdMapArray in the alpsc_write_placement_file call.

ALPS has a restriction where only one MPMD command can run on a node, so for example aprun command1 : command2 must use 2 different nodes. Are you planning on having that same restriction?
Comment 5 Moe Jette 2017-08-23 09:58:31 MDT
(In reply to David Gloe from comment #4)
> (In reply to Moe Jette from comment #3)
> > (In reply to David Gloe from comment #0)
> > > I'll try to figure out how ALPS sets this up.
> > 
> > That would be great if you could get that information for me. I'm
> > concentrating on OpenMPI right now, but we'll definitely want this to work
> > for Cray MPI too. Unfortunately, there are a multitude of MPI
> > implementations that I'll need to deal with.
> 
> I'm guessing you'll have to set the cmdIndex and peCmdMapArray in the
> alpsc_write_placement_file call.
> 
> ALPS has a restriction where only one MPMD command can run on a node, so for
> example aprun command1 : command2 must use 2 different nodes. Are you
> planning on having that same restriction?

There is no such restriction in Slurm.
Comment 6 Moe Jette 2017-09-07 15:37:12 MDT
(In reply to David Gloe from comment #4)
> ALPS has a restriction where only one MPMD command can run on a node, so for
> example aprun command1 : command2 must use 2 different nodes. Are you
> planning on having that same restriction?

Based upon my recent work with MPI, it's more an issue of the MPI interface rather than ALPS. Slurm will have the same limitation as ALPS.

It also appears unlikely that Slurm version 17.11 will support an application spanning multiple components of a heterogeneous job on Cray systems, but that will come later. A separate srun command will be required for each component (at least unless a recent version of OpenMPI is used). For example:

salloc --mem=40g -n1 : --mem=10g -n16  bash
srun --pack-group=0 master &
srun --pack-group=1 slave
Comment 7 Moe Jette 2017-09-11 17:03:01 MDT
Changing to severity 5. Planned for Slurm version 18.08
Comment 9 Moe Jette 2018-04-12 07:53:03 MDT
(In reply to Moe Jette from comment #7)
> Changing to severity 5. Planned for Slurm version 18.08

I am currently not expecting this to land in Slurm version 18.08, although the logic in Slurm v18.08 may work Cray's ethernet-based Shasta system.
Comment 10 Matthieu Hautreux 2018-05-23 02:12:37 MDT
Hi,

Are you still expecting to delay the unified MPI COMM WORLD after 18.08 or will it be integrated ?

Regards,
Matthieu
Comment 11 Moe Jette 2018-05-23 07:21:05 MDT
(In reply to Matthieu Hautreux from comment #10)
> Hi,
> 
> Are you still expecting to delay the unified MPI COMM WORLD after 18.08 or
> will it be integrated ?

It should work everywhere EXCEPT on a Cray network.
Comment 12 Matthieu Hautreux 2018-05-25 11:50:47 MDT
Great, thanks for the clarification

Le mer. 23 mai 2018 15:21, <bugs@schedmd.com> a écrit :

> *Comment # 11 <https://bugs.schedmd.com/show_bug.cgi?id=4105#c11> on bug
> 4105 <https://bugs.schedmd.com/show_bug.cgi?id=4105> from Moe Jette
> <jette@schedmd.com> *
>
> (In reply to Matthieu Hautreux from comment #10 <https://bugs.schedmd.com/show_bug.cgi?id=4105#c10>)> Hi,
> >
> > Are you still expecting to delay the unified MPI COMM WORLD after 18.08 or
> > will it be integrated ?
>
> It should work everywhere EXCEPT on a Cray network.
>
> ------------------------------
> You are receiving this mail because:
>
>    - You are on the CC list for the bug.
>
>
Comment 18 Danny Auble 2018-11-06 13:57:32 MST
*** Ticket 4322 has been marked as a duplicate of this ticket. ***
Comment 19 Tim Wickberg 2019-01-30 11:38:41 MST
*** Ticket 6428 has been marked as a duplicate of this ticket. ***
Comment 20 Danny Auble 2019-02-15 16:23:17 MST
David, how has the testing been going?
Comment 21 Tim Wickberg 2019-02-22 19:51:43 MST
*** Ticket 6561 has been marked as a duplicate of this ticket. ***
Comment 22 Brian F Gilmer 2019-03-13 09:50:05 MDT
There is renewed interest in this feature. What can Cray provide to help this move forward? My understanding is that David is no longer working non-SHASTA issues.
Comment 23 Danny Auble 2019-03-13 10:01:47 MDT
Certainly, I am hoping we only need a bit of time from someone familiar with the cray switch, container, or cookie logs.  It might involved pmi or mpi, but at this point we are not sure.

David was given the cray_pack branch (https://github.com/SchedMD/slurm/tree/cray_pack) that contains as far as we have gotten.

At the moment we can run non-mpi jobs (hostname) without issue, but mpi jobs just hang.

My guess is since Slurm uses different job ids for each of the job portions and the Cray infrastructure expects only one jobid there is a place we have missed setting the jobid to that of the parent jobid.

Let me know what you find out.

Thanks!
Comment 24 Brian F Gilmer 2019-04-05 13:35:35 MDT
Has Cray R&D made contact on this issue yet?
Comment 25 Danny Auble 2019-04-05 14:50:04 MDT
They have just this week
Comment 26 Danny Auble 2019-04-13 12:41:44 MDT
FYI, the cray_pack branch has been simplified and rebased on master (19.05).
Comment 27 Danny Auble 2019-04-13 12:43:03 MDT
Doug, I think we might want to close bug 6825 as a duplicate of this bug.  For some reason you weren't on this.
Comment 28 Danny Auble 2019-04-13 12:53:25 MDT
Brian G, is there any update from Cray on this?  I know Blaine was getting someone at Cray to look at this, but haven't heard much from him on the matter.
Comment 29 Doug Jacobsen 2019-04-17 09:23:00 MDT
Created attachment 9931 [details]
approaching functionality?

Testing Danny's patch from yesterday, I found that NICs were not getting programmed on non-offset-0 steps (non-pack-leader?).  The attached patch does two things -- and neither very well, these are just attempts to build a proof of concept that may eventually work.


(1) on all non-pack-leader offset steps, the credential is copied;  this is achieved by adding a switch_p_duplicate_jobinfo() call to the switch plugin API, and assuming that the pack-leader was processed first.
(2) the switch/cray pe_info.c is modified to at least see the pack steps and be aware of those ranks when calculating placements

Note that the modifications for (2) do indicate there are some important components missing from the stepd_step_rec_t data structure to achieve this.  

(a) no global_task_ids equivalent for hetjobs
(b) no per-pack cpus_per_task value (unsure of the value of this)
(c) mpmd seems to have special setup, may not be possible (or make sense) to mix mpmd and hetjob
(d) the cmd index may or may not be correct

Can see we are now getting cookies set:

boot-gerty:~ # pcmd -s -n ALL_COMPUTE "cat /sys/class/gni/kgni0/resources | grep PTag"
Output from 205:
--- PTag: 1 PKey: 0x0 JobId: 0x0 RefCount: 1 Suspend: Disabled ---
--- PTag: 4 PKey: 0x80 JobId: 0x0 RefCount: 2 Suspend: Idle ---
--- PTag: 183 PKey: 0x15f7 JobId: 0x7e3 RefCount: 1 Suspend: Idle ---
--- PTag: 184 PKey: 0x15f8 JobId: 0x7e3 RefCount: 1 Suspend: Idle ---
Output from 21-23,28-30,32-41,43-54,56-63,200-202,206-207,212-223,225-255,384-390,392-404,406-423,428-435,440-447:
--- PTag: 1 PKey: 0x0 JobId: 0x0 RefCount: 1 Suspend: Disabled ---
--- PTag: 4 PKey: 0x80 JobId: 0x0 RefCount: 2 Suspend: Idle ---
Output from 204:
--- PTag: 1 PKey: 0x0 JobId: 0x0 RefCount: 1 Suspend: Disabled ---
--- PTag: 4 PKey: 0x80 JobId: 0x0 RefCount: 2 Suspend: Idle ---
--- PTag: 35 PKey: 0x15f7 JobId: 0x19f8 RefCount: 1 Suspend: Idle ---
--- PTag: 36 PKey: 0x15f8 JobId: 0x19f8 RefCount: 1 Suspend: Idle ---
Node(s) 21-23,28-30,32-41,43-54,56-63,200-202,204-207,212-223,225-255,384-390,392-404,406-423,428-435,440-447 had exit code 0
boot-gerty:~ #



Ranks on the node are getting to pmi_init without any errors now:

Core was generated by `/global/homes/d/dmj/mpi/hello_gerty'.
#0  0x000000000064eef0 in __poll_nocancel () at ../sysdeps/unix/syscall-template.S:84
84	../sysdeps/unix/syscall-template.S: No such file or directory.
(gdb) bt
#0  0x000000000064eef0 in __poll_nocancel () at ../sysdeps/unix/syscall-template.S:84
#1  0x00000000005d2800 in inet_accept_with_address (ip_addr=<optimized out>, data=<optimized out>, sock=<optimized out>)
    at /notbackedup/users/sko/mpt/mpt_base/pmi/src/pmi_core/pmi_inet.c:735
#2  _pmi_inet_setup (net_info=0x9bc760 <_pmi_base_net_info>, retry=300) at /notbackedup/users/sko/mpt/mpt_base/pmi/src/pmi_core/pmi_inet.c:1026
#3  0x00000000005c5900 in _pmi_init (spawned=0x7fffffff7048) at /notbackedup/users/sko/mpt/mpt_base/pmi/src/pmi_core/_pmi_init.c:1262
#4  0x00000000005c4af2 in _pmi_constructor () at /notbackedup/users/sko/mpt/mpt_base/pmi/src/pmi_core/_pmi_init.c:343
#5  0x00000000005f1e57 in __libc_csu_init (argc=argc@entry=1, argv=argv@entry=0x7fffffff7198, envp=0x7fffffff71a8) at elf-init.c:88
#6  0x00000000005f18cd in __libc_start_main (main=0x40a190 <main>, argc=1, argv=0x7fffffff7198, init=0x5f1de0 <__libc_csu_init>,
    fini=0x5f1e70 <__libc_csu_fini>, rtld_fini=0x0, stack_end=0x7fffffff7188) at libc-start.c:245
#7  0x000000000040a0a9 in _start () at ../sysdeps/x86_64/start.S:118
(gdb) quit
dmj@nid00204:~> logout
nid00204:~ #


seems to be getting stuck.  I don't have source to Cray PMI, so I really can't go much further.
Comment 30 Doug Jacobsen 2019-04-17 09:25:44 MDT
I think the key takeways here are that

(1): some protocol augmentations are likely needed to get the data required into the slurmstepd.  I hope that we can at least review the results from this patch and the other related bug for possible protocol changes going into 19.05

(2): access to those data is not sufficient yet, something still needs to be rearranged to make this work.  I think Cray will need to get involved to move this further
Comment 31 Danny Auble 2019-04-17 16:31:54 MDT
Comment on attachment 9931 [details]
approaching functionality?

A slight variant of this has been added to the cray_pack branch.
Comment 32 Brian F Gilmer 2019-04-18 11:36:12 MDT
Danny,

Kim McMahon volunteered to address any PMI questions that you have. 

kmcmahon@cray.com
Comment 33 Danny Auble 2019-04-18 11:58:36 MDT
Thanks Brian, Could you get Kim to add themselves to this bug?
Comment 34 Brian F Gilmer 2019-04-18 12:34:49 MDT
I wasn't able to add Kim to the bug because she doesn't have access. Her e-mail needs to be added.
Comment 35 Danny Auble 2019-04-18 12:48:44 MDT
That is what I should had asked, can you please have her get an account here?
Comment 36 Danny Auble 2019-04-18 15:38:07 MDT
*** Ticket 6825 has been marked as a duplicate of this ticket. ***
Comment 37 Doug Jacobsen 2019-04-19 06:36:52 MDT
FYI, a few minor modifications needed to get current cray_pack branch to build:

diff --git a/src/plugins/switch/cray/pe_info.c b/src/plugins/switch/cray/pe_info.c
index c38c60f095..7a1a933dc4 100644
--- a/src/plugins/switch/cray/pe_info.c
+++ b/src/plugins/switch/cray/pe_info.c
@@ -50,7 +50,7 @@ typedef struct {
 } local_step_rec_t;

 // Static functions
-static void _get_het_info(local_step_rec_t *step_rec, stepd_step_rec_t *job);
+static void _setup_local_step_rec(local_step_rec_t *step_rec, stepd_step_rec_t *job);
 static int _get_first_pe(stepd_step_rec_t *job);
 static int _get_cmd_index(stepd_step_rec_t *job);
 static int *_get_cmd_map(stepd_step_rec_t *job);
@@ -133,7 +133,7 @@ int build_alpsc_pe_info(stepd_step_rec_t *job,
 	return SLURM_SUCCESS;
 }

-static void _get_het_info(local_step_rec_t *step_rec, stepd_step_rec_t *job)
+static void _setup_local_step_rec(local_step_rec_t *step_rec, stepd_step_rec_t *job)
 {
 	xassert(step_rec);
 	xassert(job);
@@ -142,13 +142,13 @@ static void _get_het_info(local_step_rec_t *step_rec, stepd_step_rec_t *job)
 	if (job->pack_jobid != NO_VAL) {
 		step_rec->nnodes = job->pack_nnodes;
 		step_rec->ntasks = job->pack_ntasks;
-		step_rec->complete_nodelist = job->pack_node_list;
+		step_rec->nodelist = job->pack_node_list;
 		step_rec->tasks_to_launch = job->pack_task_cnts;
 	} else {
 		step_rec->nnodes = job->nnodes;
 		step_rec->ntasks = job->ntasks;
-		step_rec->complete_nodelist = job->complete_nodelist;
-		step_rec->tasks_to_launch = job->tasks_to_launch;
+		step_rec->nodelist = job->msg->complete_nodelist;
+		step_rec->tasks_to_launch = job->msg->tasks_to_launch;
 	}
 }
Comment 38 Doug Jacobsen 2019-04-19 06:57:44 MDT
after making and deploying the modifications in comment #37, I'm looking at:

dmj@gerty:~> /global/gscratch1/sd/dmj/cray_pack/bin/srun ./mpi/hello_gerty : ./mpi/hello_gerty
....


Both jobs start.  Job 44+0 runs on nid00206, 44+1 on nid00207:
boot-gerty:~ # squeue
             JOBID PARTITION     NAME     USER ST       TIME  NODES NODELIST(REASON)
              44+0     debug hello_ge   nobody  R       1:44      1 nid00206
              44+1     debug hello_ge   nobody  R       1:44      1 nid00207
boot-gerty:~ #



The hello_gerty processes hang, but in different spots:

The process on nid00206 (i.e., the pack leader) hangs during the initial _pmi_constructor call before main() is reached:

```
Core was generated by `/global/u2/d/dmj/./mpi/hello_gerty'.
#0  0x000000000064eef0 in __poll_nocancel () at ../sysdeps/unix/syscall-template.S:84
84	../sysdeps/unix/syscall-template.S: No such file or directory.
(gdb) bt
#0  0x000000000064eef0 in __poll_nocancel () at ../sysdeps/unix/syscall-template.S:84
#1  0x00000000005d2800 in inet_accept_with_address (ip_addr=<optimized out>, data=<optimized out>, sock=<optimized out>)
    at /notbackedup/users/sko/mpt/mpt_base/pmi/src/pmi_core/pmi_inet.c:735
#2  _pmi_inet_setup (net_info=0x9bc760 <_pmi_base_net_info>, retry=300) at /notbackedup/users/sko/mpt/mpt_base/pmi/src/pmi_core/pmi_inet.c:1026
#3  0x00000000005c5900 in _pmi_init (spawned=0x7fffffff70e8) at /notbackedup/users/sko/mpt/mpt_base/pmi/src/pmi_core/_pmi_init.c:1262
#4  0x00000000005c4af2 in _pmi_constructor () at /notbackedup/users/sko/mpt/mpt_base/pmi/src/pmi_core/_pmi_init.c:343
#5  0x00000000005f1e57 in __libc_csu_init (argc=argc@entry=1, argv=argv@entry=0x7fffffff7238, envp=0x7fffffff7248) at elf-init.c:88
#6  0x00000000005f18cd in __libc_start_main (main=0x40a190 <main>, argc=1, argv=0x7fffffff7238, init=0x5f1de0 <__libc_csu_init>,
    fini=0x5f1e70 <__libc_csu_fini>, rtld_fini=0x0, stack_end=0x7fffffff7228) at libc-start.c:245
#7  0x000000000040a0a9 in _start () at ../sysdeps/x86_64/start.S:118
(gdb) up
#1  0x00000000005d2800 in inet_accept_with_address (ip_addr=<optimized out>, data=<optimized out>, sock=<optimized out>)
    at /notbackedup/users/sko/mpt/mpt_base/pmi/src/pmi_core/pmi_inet.c:735
735	/notbackedup/users/sko/mpt/mpt_base/pmi/src/pmi_core/pmi_inet.c: No such file or directory.
(gdb) up
#2  _pmi_inet_setup (net_info=0x9bc760 <_pmi_base_net_info>, retry=300) at /notbackedup/users/sko/mpt/mpt_base/pmi/src/pmi_core/pmi_inet.c:1026
1026	in /notbackedup/users/sko/mpt/mpt_base/pmi/src/pmi_core/pmi_inet.c
(gdb) print *net_info
$1 = {have_controller = 0, num_targets = 1, my_inet_id = 206, portnum = 63005, listen_sock = 4, last_downed_node_count = 0, controller_nid = -1,
  control_net_id = 0, target_nids = 0x9befc0, controller_hostname = '\000' <repeats 63 times>}
(gdb) print *net_info->target_nids
$2 = 207
(gdb)
```

It seems to have the idea that it would like to talk with nid00207 though, so that is nice.  There are some interesting looking symbols in frame #4 (_pmi_constructor), but I can't tell if they have been initialized yet, so the fact that they are populated with apparent garbage may be meaningless:
```
(gdb) frame 4
#4  0x00000000005c4af2 in _pmi_constructor () at /notbackedup/users/sko/mpt/mpt_base/pmi/src/pmi_core/_pmi_init.c:343
343	in /notbackedup/users/sko/mpt/mpt_base/pmi/src/pmi_core/_pmi_init.c
(gdb) print my_rank
$6 = 31
(gdb) print pg_info
$7 = {pg_id = 8241994560065789796, size = 1601398121, my_rank = 942747698, napps = 825110830, appnum = 1815032630, pes_this_node = 2020961897,
  pg_pes_per_smp = 1818979631, base_pe_on_node = 1651076143, my_lrank = 1953392943, base_pe_in_app = 0x5f82b2 <getenv+194>, pes_in_app = 0x37ffffa00,
  pes_in_app_this_smp = 7549634, pes_in_app_this_smp_list = 0x400000003, my_app_lrank = 5, apps_share_node = 6}
(gdb)
```




Job 44+1 (node nid00207), however hangs in a different place, much further along (I think), past the start of main() and into PMI_Init() somewhere:
```
(gdb) bt
#0  __strcmp_sse2_unaligned () at ../sysdeps/x86_64/multiarch/strcmp-sse2-unaligned.S:41
#1  0x00000000005c5d9c in _pmi2_info_getjobattr (name=<optimized out>, value=0x7fffffff6700 "\377", valuelen=1024, found=0x7fffffff66ec)
    at /notbackedup/users/sko/mpt/mpt_base/pmi/src/pmi_core/_pmi_jobattr.c:524
#2  0x00000000005c0a79 in PMI2_Info_GetJobAttr (name=0x2aaaaaab41c0 "", value=0x6d39a5 "PMI_process_mapping", valuelen=0, found=0xf)
    at /notbackedup/users/sko/mpt/mpt_base/pmi/src/api/kvs/jobattr.c:51
#3  0x0000000000443b69 in MPIDI_Populate_vc_node_ids ()
#4  0x000000000043fa66 in MPID_Init ()
#5  0x0000000000411ed9 in MPIR_Init_thread ()
#6  0x000000000041193e in PMPI_Init ()
#7  0x000000000040a1e3 in main ()
(gdb)
```
Comment 39 Kim McMahon 2019-04-19 09:01:46 MDT
This looks to me like the Cray PMI ranks might be getting conflicting information at startup, between the two executables.  I'll provide some background as to what info CrayPMI needs, and how it accesses that data from slurm.

PMI calls only a few 'alps-plugin' functions to get the info it needs for startup. As far as I know, these alps-plugin functions should all be contained within the slurm source tree.

 1) alps_get_placement_info()

 2) alps_app_lli_* functions
     (lli_get_response, lli_set_response, lli_get_response_bytes)

Each rank calls these functions.

To get the APID, PMI calls the apls_all_lli function with option ALPS_APP_LLI_ALPS_REQ_APID.  For a MPMD job, there can be only one APID.

To get the rank value, PMI reads an env variable.  There are a couple of options, but I believe slurm should set: "ALPS_APP_PE".

To get the network credentials, PMI calls the apls_all_lli function with option ALPS_APP_LLI_ALPS_REQ_GNI.  There should be one set of credentials for the entire MPMD job.

To get the full job placement data, PMI calls alps_get_placement_info(). I think this is where the current problem is.  There are several parameters to this function.  CrayPMI ignores a few, but the others are important.

alps_get_placement_info(
    uint64_t apid,
    alpsAppLayout_t *appLayout,
    int **placementList,
    int **targetNids,           // CrayPMI ignores
    int **targetPes,            // CrayPMI ignores
    int **targetLen,            // CrayPMI ignores
    struct in_addr **targetIps, // CrayPMI ignores
    int **startPe,
    int **totalPes,
    int **nodePes,
    int **peCpus);              // CrayPMI ignores

Some of these parameters (placementList for example) should be the same for every rank in the MPMD job.

Here's one suggestion on how to debug this.  Since we know the --multi-prog format for MPMD works with slurm and CrayPMI, perhaps you could run a MPMD test using the multi-prog format and look at the placement_info that slurm provides to PMI via alps_get_placement_info().  Running that same example with the new ":" syntax should provide clues as to what data is not getting set correctly.
Comment 40 Doug Jacobsen 2019-04-19 09:38:30 MDT
awesome!  based on that i suspect a couple changes to the task/cray plugin and we can get the test code to function.
Comment 41 Doug Jacobsen 2019-04-19 23:33:50 MDT
@kmcmahon that was precisely the advice needed:

dmj@gerty:~> /global/gscratch1/sd/dmj/cray_pack/bin/srun ~/mpi/hello_gerty : ~/mpi/hello_gerty
srun: job 55 queued and waiting for resources
srun: job 55 has been allocated resources
hello from 0 of 2 on nid00206
hello from 1 of 2 on nid00207
dmj@gerty:~>


I've hard-coded specific rank ids to get exactly that test working, however, now it can be extended to be fully generic.

Thank you!
Comment 42 Kim McMahon 2019-04-21 17:29:51 MDT
Excellent news!  I'm very happy to hear that.  Thank you Doug for working the slurm side of this.
Comment 43 Doug Jacobsen 2019-04-22 06:44:25 MDT
Created attachment 9975 [details]
cray/mpich/pmi working with hetjob

Hello,

The attached patch adds all the taskids for the entire set of job packs to the protocol transmitting data from srun to slurmstepd.  This is needed because Cray's PMI requires each node to know where every other task is.  Other related adjustments are made.  Node identification, in particular, is challenging, because the nodes are ordered in an arbitrary order (really alphabetical, but since this isn't by pack, it can be surprising in the implementation).

I still have not addressed either the command index for Cray PMI, nor the cpus_per_task varying by pack.  Both of these will require additional changes to the protocol to support, however it is not clear that work is required since (1) it's functional without it, and (2) it would seem Cray PMI may not need it based on Kim's statements earlier.

To support the command index, I think we would need a way to map any given task back to it's pack (and then simply used the pack_offset as the command index).
To support the cpus_per_task, I suppose we would need to simply send all the cpus_per_task for all the packs in the protocol, however, it's unclear if srun always knows the correct values for this - and - it may not be used anyway.

I have not addressed allowing the MPMD (--multi-prog) to work within a hetjob, but I think it's probably reasonable to defer that since it seems like an unlikely case (why use MPMD with the flexibility of hetjob at the same time).  It would probably be good to have the srun CLI generate an error in that case.

Kim, can you please comment on whether or not the command index or cpus_per_task need to be addressed?

Danny, is there a chance this will be possible to merge for 19.05?
Comment 44 Doug Jacobsen 2019-04-22 07:49:25 MDT
Oh, I forgot to post the evidence.  The first test is just a basic non-het-job srun to be sure I didn't break it, the second varies between haswell, and knl.  The third does the same but adds a third component which selects haswell based on memory requirements (not a tagged constraint):

dmj@gerty:/global/gscratch1/sd/dmj> ./cray_pack/bin/srun -n 32 --ntasks-per-node=8 ~/mpi/hello_gerty
hello from 16 of 32 on nid00214
hello from 17 of 32 on nid00214
hello from 18 of 32 on nid00214
hello from 19 of 32 on nid00214
hello from 20 of 32 on nid00214
hello from 21 of 32 on nid00214
hello from 22 of 32 on nid00214
hello from 23 of 32 on nid00214
hello from 25 of 32 on nid00215
hello from 26 of 32 on nid00215
hello from 27 of 32 on nid00215
hello from 28 of 32 on nid00215
hello from 29 of 32 on nid00215
hello from 30 of 32 on nid00215
hello from 31 of 32 on nid00215
hello from 24 of 32 on nid00215
hello from 8 of 32 on nid00213
hello from 9 of 32 on nid00213
hello from 10 of 32 on nid00213
hello from 11 of 32 on nid00213
hello from 12 of 32 on nid00213
hello from 13 of 32 on nid00213
hello from 14 of 32 on nid00213
hello from 15 of 32 on nid00213
hello from 0 of 32 on nid00212
hello from 1 of 32 on nid00212
hello from 2 of 32 on nid00212
hello from 3 of 32 on nid00212
hello from 4 of 32 on nid00212
hello from 5 of 32 on nid00212
hello from 6 of 32 on nid00212
hello from 7 of 32 on nid00212
dmj@gerty:/global/gscratch1/sd/dmj> ./cray_pack/bin/srun -C haswell -n 32 --ntasks-per-node=8 ~/mpi/hello_gerty : -C knl -n 68 ~/mpi/hello_gerty
srun: job 619 queued and waiting for resources
srun: job 619 has been allocated resources
hello from 0 of 100 on nid00021
hello from 1 of 100 on nid00021
hello from 2 of 100 on nid00021
hello from 3 of 100 on nid00021
hello from 4 of 100 on nid00021
hello from 5 of 100 on nid00021
hello from 24 of 100 on nid00028
hello from 6 of 100 on nid00021
hello from 25 of 100 on nid00028
hello from 7 of 100 on nid00021
hello from 26 of 100 on nid00028
hello from 28 of 100 on nid00028
hello from 29 of 100 on nid00028
hello from 30 of 100 on nid00028
hello from 31 of 100 on nid00028
hello from 27 of 100 on nid00028
hello from 8 of 100 on nid00022
hello from 9 of 100 on nid00022
hello from 10 of 100 on nid00022
hello from 11 of 100 on nid00022
hello from 12 of 100 on nid00022
hello from 13 of 100 on nid00022
hello from 14 of 100 on nid00022
hello from 15 of 100 on nid00022
hello from 17 of 100 on nid00023
hello from 19 of 100 on nid00023
hello from 20 of 100 on nid00023
hello from 21 of 100 on nid00023
hello from 22 of 100 on nid00023
hello from 23 of 100 on nid00023
hello from 16 of 100 on nid00023
hello from 18 of 100 on nid00023
hello from 36 of 100 on nid00200
hello from 37 of 100 on nid00200
hello from 50 of 100 on nid00200
hello from 64 of 100 on nid00200
hello from 75 of 100 on nid00200
hello from 77 of 100 on nid00200
hello from 80 of 100 on nid00200
hello from 32 of 100 on nid00200
hello from 33 of 100 on nid00200
hello from 34 of 100 on nid00200
hello from 35 of 100 on nid00200
hello from 38 of 100 on nid00200
hello from 39 of 100 on nid00200
hello from 40 of 100 on nid00200
hello from 41 of 100 on nid00200
hello from 42 of 100 on nid00200
hello from 43 of 100 on nid00200
hello from 44 of 100 on nid00200
hello from 45 of 100 on nid00200
hello from 46 of 100 on nid00200
hello from 47 of 100 on nid00200
hello from 48 of 100 on nid00200
hello from 49 of 100 on nid00200
hello from 51 of 100 on nid00200
hello from 52 of 100 on nid00200
hello from 53 of 100 on nid00200
hello from 54 of 100 on nid00200
hello from 55 of 100 on nid00200
hello from 56 of 100 on nid00200
hello from 57 of 100 on nid00200
hello from 58 of 100 on nid00200
hello from 59 of 100 on nid00200
hello from 60 of 100 on nid00200
hello from 61 of 100 on nid00200
hello from 62 of 100 on nid00200
hello from 63 of 100 on nid00200
hello from 65 of 100 on nid00200
hello from 66 of 100 on nid00200
hello from 67 of 100 on nid00200
hello from 68 of 100 on nid00200
hello from 69 of 100 on nid00200
hello from 70 of 100 on nid00200
hello from 71 of 100 on nid00200
hello from 72 of 100 on nid00200
hello from 73 of 100 on nid00200
hello from 74 of 100 on nid00200
hello from 76 of 100 on nid00200
hello from 78 of 100 on nid00200
hello from 79 of 100 on nid00200
hello from 81 of 100 on nid00200
hello from 82 of 100 on nid00200
hello from 83 of 100 on nid00200
hello from 84 of 100 on nid00200
hello from 85 of 100 on nid00200
hello from 86 of 100 on nid00200
hello from 87 of 100 on nid00200
hello from 88 of 100 on nid00200
hello from 89 of 100 on nid00200
hello from 90 of 100 on nid00200
hello from 91 of 100 on nid00200
hello from 92 of 100 on nid00200
hello from 93 of 100 on nid00200
hello from 94 of 100 on nid00200
hello from 95 of 100 on nid00200
hello from 96 of 100 on nid00200
hello from 97 of 100 on nid00200
hello from 98 of 100 on nid00200
hello from 99 of 100 on nid00200
dmj@gerty:/global/gscratch1/sd/dmj> ./cray_pack/bin/srun -C haswell -n 8 --ntasks-per-node=2 ~/mpi/hello_gerty : -C knl -n 2 ~/mpi/hello_gerty : --mem=100G -n 4 ~/mpi/hello_gerty
srun: job 656 queued and waiting for resources
srun: job 656 has been allocated resources
hello from 5 of 14 on nid00023
hello from 4 of 14 on nid00023
hello from 6 of 14 on nid00028
hello from 7 of 14 on nid00028
hello from 2 of 14 on nid00022
hello from 3 of 14 on nid00022
hello from 11 of 14 on nid00206
hello from 12 of 14 on nid00206
hello from 13 of 14 on nid00206
hello from 10 of 14 on nid00206
hello from 0 of 14 on nid00021
hello from 1 of 14 on nid00021
hello from 8 of 14 on nid00200
hello from 9 of 14 on nid00200
dmj@gerty:/global/gscratch1/sd/dmj>


The regression testsuite is running at present.
Comment 45 Kim McMahon 2019-04-22 09:11:19 MDT
Hi Doug,

CrayPMI does not care about cpus_per_task, so I think we are fine there.

If you truly are running an MPMD job (multiple different executables), then CrayPMI needs the alps_appLayout.numCmds to be accurate.  This is the total number of executables launched together.  Because slurm supports multiple executables on the *same* node (something alps does not support), we've previously made some adjustments in CrayPMI to handle that case. The alps-plugin interface via cmdNumber did not conveniently support this case.  Because of this, CrayPMI determines the correct cmdNumber (index) based on the start_pes[i] and total_pes[i] provided in the call to alps_get_placement_info().

However, if you are using the hetjob feature simply to split one executable into multiple 'pieces' for placement reasons only, then I believe you can get by with telling PMI you have :
 alps_appLayout.numCmds=1
 alps_appLayout.cmdNumber=0
 alps_appLayout.total_pes[0] = total-number-of-processes-launched

since that would accurately reflect what is being launched.

-Kim
Comment 46 Danny Auble 2019-04-22 14:44:52 MDT
Comment on attachment 9975 [details]
cray/mpich/pmi working with hetjob

Thanks Doug, variations of this has been pushed to the cray_pack branch.
Comment 47 Doug Jacobsen 2019-04-23 07:19:49 MDT
Created attachment 9993 [details]
set the command map based on job pack

Hello,

I'm making an assumption that each hetjob component is running a separate executable (whether or not that is actually true), in order to avoid further complicating this -- however if we did want to identify the executable by hetjob component, we would _still_ need the change in this patch.

This patch adds a mapping of taskids to originating pack_offset numbers.  This allows the slurmstepd and associated plugins to identify which hetjob component any given task belongs to.

From that, I'm configuring switch/cray to set up the command map to take the pack_offset as the command index.  This effectively tells ALPS/PMI that each hetjob component is running a separate executable.

For trivial cases (mpi helloworld) it doesn't seem to matter if I use different or same executables in these cases (whether or not the command map is populated).  I have not tried a large variety of MPI capabilities though, so it may matter.

Kim, if we make this assumption that each job component is a separate executable will that be sufficient?

If it is sufficient, I think we're got basic functionality.  Danny, what needs to be done to move this forward from here?
Comment 48 Moe Jette 2019-04-23 09:41:17 MDT
(In reply to Danny Auble from comment #46)
> Comment on attachment 9975 [details]
> cray/mpich/pmi working with hetjob
> 
> Thanks Doug, variations of this has been pushed to the cray_pack branch.

Danny asked me to review the work to this point. The only problem I discovered is that if an old version of srun (say slurm version 18.08) tries to launch a heterogeneous job then the pack_tids array will be NULL and cause aborts in the slurmd and/or slurmstepd daemons. I added some checks for NULL pack_tids. An old srun will fail for switch/cray and it should all work fine for non-cray systems. The commit is here:
https://github.com/SchedMD/slurm/commit/219272ddd696e3fe1a5be72d1bc1c91ed63507c3

I'm wondering if we want to even build the pack_tids array for non-Cray systems or perhaps leave it there for possible future use.
Comment 49 Kim McMahon 2019-04-23 10:02:39 MDT
>  Kim, if we make this assumption that each job component is a separate executable will that be sufficient?

Yes, I think that should work.  Assuming N job components, tell PMI it has N commands, and have N values for start_pe and and total_pes that each reflect their individual component.
Comment 52 Danny Auble 2019-04-23 14:59:08 MDT
Comment on attachment 9993 [details]
set the command map based on job pack

Doug a slight variation of this is now in the cray_pack branch.
Comment 53 Danny Auble 2019-04-24 14:59:33 MDT
OK, I am calling this closed.  The cray_pack branch has been merged into master.

Thanks for everyone's help, especially Kim and Doug.  Without your help and tips this would had been nye impossible.

Thanks!