Doug have you tested this out yet? Doug: Ping Doug, any help here would be great. this was just reported internally yesterday as well. my thought here is that we've tied everything to pack id 0 for getting the credential, which is great for the main purpose of the work. However, if pack id 0 isn't being used in a given allocation then the needed credentials won't be setup. my basic thought for addressing this is to set up the credential in the _lowest_ pack id represented by a discrete srun. Next week, I plan to see if the data sent to the switch plugin includes the full set of requested packs for this overall srun (at one point i was playing with that, i think), and if not, we may need to enhance it to get those data there -- i.e., an rpc change, possibly, thus targeting master branch and brave 19.05 patchers. (In reply to Doug Jacobsen from comment #5) > ... > used in a given allocation then the needed credentials won't be setup. my > ... allocation -- i meant a given job step. there are also some reported oddities in the cgroup setup: + srun --pack-group=0,1 --cpu-bind=cores xthi.intel srun: Job 1480411 step creation temporarily disabled, retrying srun: Step created for job 1480411 slurmstepd: error: Detected zonesort setup failure: Could not open job cpuset (1480411.2) slurmstepd: error: Detected zonesort setup failure: Could not open job cpuset (1480411.2) slurmstepd: error: Detected zonesort setup failure: Could not open job cpuset (1480411.2) slurmstepd: error: Detected zonesort setup failure: Could not open job cpuset (1480411.2) slurmstepd: error: Detected zonesort setup failure: Could not open job cpuset (1480411.2) slurmstepd: error: Detected zonesort setup failure: Could not open job cpuset (1480411.2) slurmstepd: error: Detected zonesort setup failure: Could not open job cpuset (1480411.2) slurmstepd: error: Detected zonesort setup failure: Could not open job cpuset (1480411.2) Hello from rank 0, thread 3, on nid00204. (core affinity = 12) Hello from rank 0, thread 0, on nid00204. (core affinity = 0) Hello from rank 0, thread 1, on nid00204. (core affinity = 4) Hello from rank 0, thread 2, on nid00204. (core affinity = 8) .... i.e., our zonesort spank plugin is expecting the 1480411.2 cpuset cgroup to exist, and it doesn't; will need to look into that as well. *** Ticket 7446 has been marked as a duplicate of this ticket. *** I've marked bug 7446 as a duplicate of this one. SallocDefaultCommand is the primary culprit there. I have several private comments over there with some analysis of the problem. I think that 7446 will likely be solved by fixing this bug. If you think that's a mistake, let me know. - Marshall Doug, I am guessing you haven't had any more time to look at this? If you do let me know :). i'm hopeful to look at it before SC. I suspect some protocol reorganizations may be needed but unsure Hi, I am suspecting that bug #8329 (private) is suffering from the same issues than the ones described here. Is there any intention to keep working on this bug in a near future? one issue in #8329 is that for the pack id 1 they're receiving: slurmstepd: error: (switch_cray_aries.c: 752: switch_p_job_fini) jobinfo pointer was NULL Also, when they're using mpi4py, do you know if this is expected or does it happen to your systems too?: xxx@xxx:~/packjob> ml -t Currently Loaded Modulefiles: modules/3.2.11.4 slurm/19.05.4-1 cray-python/3.7.3.2 xxx@xxx:~/packjob> cat job.sh #!/bin/bash #SBATCH --job-name=test #SBATCH --time='00:02:00' #SBATCH --nodes=2 #SBATCH --output=pp.out #SBATCH --error=pp.err #SBATCH packjob #SBATCH --time='00:02:00' #SBATCH --nodes=1 srun --pack-group=0 -n 2 --exclusive python -c 'from mpi4py import MPI; print("Hello")' & srun --pack-group=1 -n 1 --exclusive python -c 'from mpi4py import MPI; print("Hola")' & wait xxx@xxx:~/packjob> sbatch job.sh Submitted batch job 624729 xxx@xxx:~/packjob> ll total 12 rw-rr- 1 xxx craypri 374 Jan 14 14:06 job.sh rw-rr- 1 xxx craypri 774 Jan 14 14:07 pp.err rw-rr- 1 xxx craypri 12 Jan 14 14:07 pp.out xxx@xxx:~/packjob> more pp.err Tue Jan 14 14:07:43 2020: [unset]:_pmi_alps_init:alps_get_placement_info returned with error -1 Tue Jan 14 14:07:43 2020: [unset]:_pmi_init:_pmi_alps_init returned -1 [Tue Jan 14 14:07:43 2020] [c1-0c0s8n2] Fatal error in PMPI_Init_thread: Other MPI error, error stack: MPIR_Init_thread(537): MPID_Init(246).......: channel initialization failed MPID_Init(647).......: PMI2 init failed: 1 aborting job: Fatal error in PMPI_Init_thread: Other MPI error, error stack: MPIR_Init_thread(537): MPID_Init(246).......: channel initialization failed MPID_Init(647).......: PMI2 init failed: 1 slurmstepd: error: (switch_cray_aries.c: 752: switch_p_job_fini) jobinfo pointer was NULL srun: error: nid00226: task 0: Exited with exit code 255 srun: Terminating job step 624730.0 xxx@xxx:~/packjob> *** Ticket 8329 has been marked as a duplicate of this ticket. *** yeah, i think the key here is that we need to identify the lowest numbered pack that a particular step is starting, and have that pack do the needed interactions with the cray stack (rather than fixating on pack 0) to do this, we'll need all of the information about all packs starting in the step provided to the switch plugin interface. I haven't had time to look at what data is available at that stage. This bug is marked urgent in the HPE/Cray case. I raised the priority of this bug to keep the effort moving forward. The pack information needed to identify the lowest number pack doesn't seem to be available in the controller. Since only 1 of the job steps in the job needs to allocate credientials any pack group will work. A flag marking the first pack group encountered by srun would work. It does not have to be the lowest number group pack it just has to be one and only one. If the flag is set by default then cleared after the first group pack then the flag would work for all jobs.
in srun_job.c:create_srun__job()
first_pack=true;
while ((opt_local = get_next_opt(pack_offset))) {
srun_opt_t *srun_opt = opt_local->srun_opt;
xassert(srun_opt);
….
// one and only one
first_pack=false;
}
in step_mgr.c:step_create()
/*
* We only want to set up the Aries switch for the first
* job with all the nodes in the total allocation along
* with that node count.
*/
//if (job_ptr->job_id == job_ptr->pack_job_id) {
if (first_pack(job_id)) {
Brian, we haven't been able to get time with the right people to make any progress on this. Without this time we will not be able to progress. If you do happen to have this time with them we would be happy to look over a working patch. Danny, Do you need support from the HPE/Cray side? There seems to be a problem getting access to Kachina, is that still the case? It is more than just system access, the only way we were able to get to this point was Doug J helping look at logs (and knowing what logs to look at) while we were doing trial and error. I am guessing the end patch will be fairly small, but at this point we haven't had this be that big a problem for it to make cycles available. Created attachment 13351 [details]
patch
Proposed solution. This patch adds the pack_job_list to all of the jobs in the pack. This allows for a test against the first job and not just a job in pack-group=0. The logic in proc_req.c and step_mgr.c has been tested against a VM cluster. The changes in step_mgr.c have not been checked against an XC yet.
Brian, thanks for the idea, I am not sure how it would make a difference though. In my testings of your patch het_job_ptr->pack_job_id is always the same as job_ptr->pack_job_id so it doesn't seem this does anything different than what was there. I would expect the pack_job_id to be the same on all the parts of the hetjob. Keep in mind pack_job_id (or het_job_id in 20.02+ Slurm code) is the job id of the lead het job (which sets up all the switch cookies and what not). This may be the right spot in the code (step_mgr.c) as you will see the 'else' on the clause you are editing has as it's first line... /* assume that job offset 0 has already run! */ This is at least part of the problem. Meaning it appears the real bug is in situations where we never involve component 0 in the step the cookies are never made for the job. It looks like I can put build the switch info once for the job and store it in the job_ptr, the problem is we currently don't have a dynamic_plugin_data_t *switch_job as the step does in the structure, meaning if this is the fix state wouldn't persist from restarting the slurmctld. But that might not be that big a deal. It is hard to say, but based on what I am seeing I am guessing it will not be that big an issue not saving state as we can just rebuild the pointer when the potential next step comes through. Could someone test this for me before I go and look to actually fix it? I would expect this kind of thing to be a valid test... salloc -N1 -n1 : -N1 -n1 srun --pack-group=1 mpihelloworld will fail. srun --pack-group=0 sleep 1000& srun --pack-group=1 mpihelloworld will at least get set up with the correct switch info. I just tested this on cori though and it does work as expected though so there is more to the problem assuming my idea works as expected (on a non-cray system at least the code is executed correctly). Having access to a cray system where I can mess around with stuff like this would be helpful. Does anyone have a system I can play on with root? (In reply to Danny Auble from comment #22) > In my testings of your patch > > het_job_ptr->pack_job_id is always the same as job_ptr->pack_job_id so it > doesn't seem this does anything different than what was there. > > I would expect the pack_job_id to be the same on all the parts of the hetjob. "het_job_ptr->pack_job_id is always the same as job_ptr->pack_job_id" this is not the case. The change makes sure that this happens only once for the het-jobs. If you are not seeing this then it is likely something went wrong in translating my non-XC code to XC code. > > Keep in mind pack_job_id (or het_job_id in 20.02+ Slurm code) is the job id > of the lead het job (which sets up all the switch cookies and what not). > > This may be the right spot in the code (step_mgr.c) as you will see the > 'else' on the clause you are editing has as it's first line... > > /* assume that job offset 0 has already run! */ > > This is at least part of the problem. > > Meaning it appears the real bug is in situations where we never involve > component 0 in the step the cookies are never made for the job. Yes, this is the bug. The approach I took was to rely on the job list and not the pack-leader. The change to proc_req.c make sure the job-list is defined for all of the jobs in the job-pack. Then just request credentials on the first job regardless of whether it is the pack-leader or pack-group==0. > > It looks like I can put build the switch info once for the job and store it > in the job_ptr, the problem is we currently don't have a > dynamic_plugin_data_t *switch_job as the step does in the structure, meaning > if this is the fix state wouldn't persist from restarting the slurmctld. But > that might not be that big a deal. It is hard to say, but based on what I > am seeing I am guessing it will not be that big an issue not saving state as > we can just rebuild the pointer when the potential next step comes through. This is not a problem. The process for acquiring credentials always happens on a job step. The issue with the het-jobs is that it should only be done once. So any fix for this bug would eliminate the case of _not_ acquiring credentials. > > Could someone test this for me before I go and look to actually fix it? > > I would expect this kind of thing to be a valid test... > > salloc -N1 -n1 : -N1 -n1 > > srun --pack-group=1 mpihelloworld > > will fail. > This is the the current situation. > srun --pack-group=0 sleep 1000& > srun --pack-group=1 mpihelloworld > > will at least get set up with the correct switch info. Yes, this is also the current sitation. > > I just tested this on cori though and it does work as expected though so > there is more to the problem assuming my idea works as expected (on a > non-cray system at least the code is executed correctly). > You understand the bug correctly. It is limited to not passing control through the code in step_mgr.c at least once per job. That is the case for all job except when there is no pack-group==0. > Having access to a cray system where I can mess around with stuff like this > would be helpful. Does anyone have a system I can play on with root? I will check to see if a system can be made available. (In reply to Brian F Gilmer from comment #23) > (In reply to Danny Auble from comment #22) > > In my testings of your patch > > > > het_job_ptr->pack_job_id is always the same as job_ptr->pack_job_id so it > > doesn't seem this does anything different than what was there. > > > > I would expect the pack_job_id to be the same on all the parts of the hetjob. > > "het_job_ptr->pack_job_id is always the same as job_ptr->pack_job_id" this > is not the case. The change makes sure that this happens only once for the > het-jobs. If you are not seeing this then it is likely something went wrong > in translating my non-XC code to XC code. I would be interested in how are you seeing what you are seeing (note your change does not affect batch jobs only srun/salloc). Perhaps I am not understanding what you mean, but if the pack_job_id doesn't point to the head component of the hetjob then all sorts of things would be wrong. > > > > > Keep in mind pack_job_id (or het_job_id in 20.02+ Slurm code) is the job id > > of the lead het job (which sets up all the switch cookies and what not). > > > > This may be the right spot in the code (step_mgr.c) as you will see the > > 'else' on the clause you are editing has as it's first line... > > > > /* assume that job offset 0 has already run! */ > > > > This is at least part of the problem. > > > > Meaning it appears the real bug is in situations where we never involve > > component 0 in the step the cookies are never made for the job. > > Yes, this is the bug. The approach I took was to rely on the job list and > not the pack-leader. The change to proc_req.c make sure the job-list is > defined for all of the jobs in the job-pack. Then just request credentials > on the first job regardless of whether it is the pack-leader or > pack-group==0. > > > > > It looks like I can put build the switch info once for the job and store it > > in the job_ptr, the problem is we currently don't have a > > dynamic_plugin_data_t *switch_job as the step does in the structure, meaning > > if this is the fix state wouldn't persist from restarting the slurmctld. But > > that might not be that big a deal. It is hard to say, but based on what I > > am seeing I am guessing it will not be that big an issue not saving state as > > we can just rebuild the pointer when the potential next step comes through. > > This is not a problem. The process for acquiring credentials always happens > on a job step. The issue with the het-jobs is that it should only be done > once. So any fix for this bug would eliminate the case of _not_ acquiring > credentials. I am not sure you follow this. If we don't have state we would need to acquire multiple times. I am now seeing issues with the task plugin as we need to handle the stepid translation to the apid. At the moment I am thinking we could just use stepid 0 for all hetjobs and we should be set. This is all theoretical at the moment. I received access to Kachina and am working on getting it booted. > > > > > Could someone test this for me before I go and look to actually fix it? > > > > I would expect this kind of thing to be a valid test... > > > > salloc -N1 -n1 : -N1 -n1 > > > > srun --pack-group=1 mpihelloworld > > > > will fail. > > > > This is the the current situation. > > > srun --pack-group=0 sleep 1000& > > srun --pack-group=1 mpihelloworld > > > > will at least get set up with the correct switch info. > > Yes, this is also the current sitation. Then you are seeing what I am seeing. > > > > > I just tested this on cori though and it does work as expected though so > > there is more to the problem assuming my idea works as expected (on a > > non-cray system at least the code is executed correctly). > > > > You understand the bug correctly. It is limited to not passing control > through the code in step_mgr.c at least once per job. That is the case for > all job except when there is no pack-group==0. I don't think this is stated correctly. The problem here is we don't have a component 0 of the step to grab the switch info from. There is always a head component in the job, but since we don't have a step for that component the cookies never get made (and as the test above shows, even though they are made they are not done correctly for the step. > > > Having access to a cray system where I can mess around with stuff like this > > would be helpful. Does anyone have a system I can play on with root? > > I will check to see if a system can be made available. Gracias, I am hoping Kachina will work out, but it is sort of slow going. (In reply to Danny Auble from comment #24) > (In reply to Brian F Gilmer from comment #23) > > (In reply to Danny Auble from comment #22) > > I would be interested in how are you seeing what you are seeing (note your > change does not affect batch jobs only srun/salloc). Perhaps I am not > understanding what you mean, but if the pack_job_id doesn't point to the > head component of the hetjob then all sorts of things would be wrong. > OK, I see my mistake. What I am looking at is the first job in the list which would job_id==pack_job_id _if_ so the result is the same, if group=0 is not used then the portion of the code that acquires the credentials is not invoked. I need to take a look at how the credentials are shared with a batch job. The would be the same in a srun ... & ; srun ... & scenario. (In reply to Brian F Gilmer from comment #26) > I need to take a look at how the credentials are shared with a batch job. > The would be the same in a srun ... & ; srun ... & scenario. There is no need. I am going down a different path. I don't believe your patch is needed, but did point me to the spot that does need changing. I have a patch now that will hopefully get us closer, just waiting for kachina to boot. Created attachment 13402 [details]
Patch to make hetjobs work for non component 0 steps.
Hey guys, try this patch and see if it does what you would expect.
I would like to get this into 20.02.1 which we plan to tag next week, so testing sooner would be better than later.
I'm running slurm20.02.0 with this patch applied on gerty (XC40 system). There seems to be an issue with salloc and our default salloc command: dmj@nid00388:/global/gscratch1/sd/dmj/slurm2002> srun hostname nid00389 nid00388 dmj@nid00388:/global/gscratch1/sd/dmj/slurm2002> exit salloc: Relinquishing job allocation 8 dmj@gerty:/global/gscratch1/sd/dmj/slurm2002> ./bin/salloc -N 2 -p system -C haswell : -N 4 -p system -C knl salloc: Pending job allocation 9 salloc: job 9 queued and waiting for resources salloc: job 9 has been allocated resources salloc: Granted job allocation 9 salloc: Waiting for resource configuration salloc: Nodes nid00[388-389] are ready for job srun: error: task 0 launch failed: Error configuring interconnect salloc: Relinquishing job allocation 9 dmj@gerty:/global/gscratch1/sd/dmj/slurm2002> However, trying again and specifying /bin/bash for salloc, I can then control things a bit better: salloc: Nodes nid00[388-389] are ready for job dmj@gerty:/global/gscratch1/sd/dmj/slurm2002> ./bin/srun hostname nid00388 nid00389 dmj@gerty:/global/gscratch1/sd/dmj/slurm2002> ./bin/srun --pack 0 hostname nid00388 nid00389 dmj@gerty:/global/gscratch1/sd/dmj/slurm2002> ./bin/srun --pack 1 hostname nid00032 nid00035 nid00034 nid00033 dmj@gerty:/global/gscratch1/sd/dmj/slurm2002> ./bin/srun --pack 0,1 hostname nid00388 nid00389 nid00035 nid00033 nid00032 nid00034 dmj@gerty:/global/gscratch1/sd/dmj/slurm2002> cp ~/mpi/helloworld.c . dmj@gerty:/global/gscratch1/sd/dmj/slurm2002> cc helloworld.c -o hello dmj@gerty:/global/gscratch1/sd/dmj/slurm2002> ./bin/srun --pack 0,1 ./hello hello from 2 of 6 on nid00389 hello from 1 of 6 on nid00388 hello from 6 of 6 on nid00035 hello from 3 of 6 on nid00032 hello from 4 of 6 on nid00033 hello from 5 of 6 on nid00034 dmj@gerty:/global/gscratch1/sd/dmj/slurm2002> ./bin/srun --pack 0 ./hello hello from 2 of 2 on nid00389 hello from 1 of 2 on nid00388 dmj@gerty:/global/gscratch1/sd/dmj/slurm2002> ./bin/srun --pack 1 ./hello hello from 4 of 4 on nid00035 hello from 1 of 4 on nid00032 hello from 3 of 4 on nid00034 hello from 2 of 4 on nid00033 dmj@gerty:/global/gscratch1/sd/dmj/slurm2002> So one thing that seems to be new is that in order to run in any pack other than pack 0, I have to specify; I think in 19.05 the default was to run in all packs, maybe that is an intentional change in 20.02. So the only problem I see right now is that a default salloc command of "srun -n1 -N1 --mem-per-cpu=0 --pty --preserve-env --gres=craynetwork:0 --mpi=none $SHELL" fails with this patch. My new e-mail address is brian.gilmer@hpe.com Thanks for testing Doug, Can you confirm this works without the patch, or is this just a situation with 20.02 in general? I've only tested 20.02 with this patch. On Mon, Mar 23, 2020 at 6:27 AM <bugs@schedmd.com> wrote: > *Comment # 32 <https://bugs.schedmd.com/show_bug.cgi?id=7039#c32> on bug > 7039 <https://bugs.schedmd.com/show_bug.cgi?id=7039> from Danny Auble > <da@schedmd.com> * > > Thanks for testing Doug, > > Can you confirm this works without the patch, or is this just a situation with > 20.02 in general? > > ------------------------------ > You are receiving this mail because: > > - You are on the CC list for the bug. > > Doug,
It appears this issue can happen (and always has) if you don't request the entire component on a step.
So in your situation you were requesting 4 nodes in the first component of the hetjob and then the default command ran only requested 1 node.
This would had happened on a regular srun as well.
This patch
diff --git a/src/srun/srun.c b/src/srun/srun.c
index 0ec2b215cf..c681c23c06 100644
--- a/src/srun/srun.c
+++ b/src/srun/srun.c
@@ -559,7 +559,9 @@ static void _launch_app(srun_job_t *job, List srun_job_list, bool got_alloc)
sizeof(uint32_t *));
memcpy(job->het_job_tids, tmp_tids,
sizeof(uint32_t *) * job->het_job_nnodes);
- job->het_job_node_list = xstrdup(job->nodelist);
+ (void) slurm_step_ctx_get(job->step_ctx,
+ SLURM_STEP_CTX_NODE_LIST,
+ &job->het_job_node_list);
job->het_job_tid_offsets = xcalloc(job->ntasks,
sizeof(uint32_t));
Fixes this problem as well. At the moment I am thinking this will probably go into 19.05 so you can at least run hetjobs there until you upgrade. It is clear though no one on your system is running hetjobs in salloc ;).
This along with the already attached patch appears to fix everything as you would expect.
Let me know if you find differently. We are looking to tag on Thursday, so any amount of extra testing would be very welcome if you would like this in the next version of Slurm.
Danny, Doug, many thanks for working on this. Danny, we don't have an upgrade window for 20.02 until May/June (at least) and some of our users are constantly hitting this, do you know already if it'll land on 19.05 as well? Hey Miguel, sorry this patch will not be going into 19.05, only 20.02+. This patch will most likely work with 19.05 though but at the moment we would prefer to avoid patching that version. You are welcome to use this as a local patch if you would like until you update to 20.02. Hi Danny, Waiting for 20.02 was discussed with the customer today. This is not an option for them. It sounds like they are not committed to moving to 20.02, disruption and stability of the version were both mentioned by them. We are looking at moving forward with a 19.05 fix. The customer wants a 'certified' version so HPE/Cray has to go through the testing regime in HPE/Cray. At this point, do you have any areas of concern we should be aware about? Sorry Brian, while we think this will work fine with 19.05 the change is too great this late in the release and could potentially break the current hetjob functionality that does work. The current options are run with a local patch or wait until moving to 20.02 in the May/June time frame as they have indicated. At the moment no tagged version of Slurm has these changes in it. Something to understand is there very well may never be another 19.05 release, so even if this patch was put there there may never be a blessed version of 19.05 containing it. At the moment only major and security related fixes are the only things being considered for 19.05. Danny, For understanding: the reason for not including this in 19.05 is that the change has a high risk of breaking current functionality and the version is in an end-of-life maintenance phase. This patched version of 19.05 would potentially fall outside of Slurm support since it would not be an officially released version of Slurm. If HPE/Cray were to move ahead with providing a Slurm variant to the customer that would not be covered by our support contract with SchedMD. This also deviates from the Cray support model which is only to provide support for released versions of Slurm. Cray wanted to avoid the maintenance tail associated with a Cray variant of Slurm. This will be part of the discussion within HPE/Cray. Thanks (In reply to Brian F Gilmer from comment #41) > Danny, > > For understanding: the reason for not including this in 19.05 is that the > change has a high risk of breaking current functionality and the version is > in an end-of-life maintenance phase. This patched version of 19.05 would > potentially fall outside of Slurm support since it would not be an > officially released version of Slurm. Correct, this patch is seen as an enhancement more than a bug fix since 19.05 does work with hetjob, just not in the cases listed here. As such this patch would not be considered for a release that is mature as 19.05 is. This has always been our policy. While we would do our best to support a patched version of the code the odds of a blessed 19.05 are rather low. > > If HPE/Cray were to move ahead with providing a Slurm variant to the > customer that would not be covered by our support contract with SchedMD. > This also deviates from the Cray support model which is only to provide > support for released versions of Slurm. Cray wanted to avoid the maintenance > tail associated with a Cray variant of Slurm. This will be part of the > discussion within HPE/Cray. As mentioned above we would do our best to support them, but the request of a blessed version outside of 20.02+ is most likely not a reality. While in most cases I support the HPE/Cray stance on only supporting released versions, there may need to be wiggle room on your end to allow testing of patches and such. As they plan to move to 20.02 in just a couple of months this doesn't seem like a horrible stop gap. This problem has existed since the beginning and only now has gained any traction almost a year later. > > Thanks This patch will be in 20.02.2. Commit 6ee504895e0809. Thanks for all those who helped get this going. Please reopen if needed. |
Created attachment 10245 [details] my configuration on kachina test system, likely not important though This is part of regression test38.7. Starting a step on pack group 0 is fine. Starting a step on pack group 1 results in no communications. Starting a step across pack groups 0 and 1 was not tested due to the test needing an update so that cray hetjobs can be tested (already done in commit 9661022c3, but not tested since. I'll update the bug when I have updated information about that. Here's a log: TEST: 38.7 spawn /opt/cray/pe/craype/default/bin/cc -o test38.7.prog test38.7.prog.c No supported cpu target is set, CRAY_CPU_TARGET=x86-64 will be used. Load a valid targeting module or set CRAY_CPU_TARGET TEST OF PACK GROUP 0 spawn /home/users/n15158/slurm/19.05/kachina//bin/sbatch -N1 -n2 --output=test38.7.output --error=test38.7.error -t1 : -N1 -n2 ./test38.7.input Submitted batch job 4644 Job 4644 is in state PENDING, desire DONE Job 4644 is in state RUNNING, desire DONE Job 4644 is in state RUNNING, desire DONE Job 4644 is DONE (COMPLETED) spawn cat test38.7.output Wed May 15 21:47:13 CDT 2019 PACK GROUP 0 Rank[0] on nid00040 just received msg from Rank 1 on nid00040 Rank[1] on nid00040 just received msg from Rank 0 on nid00040 Wed May 15 21:47:15 CDT 2019 TEST_COMPLETE TEST OF PACK GROUP 1 spawn /home/users/n15158/slurm/19.05/kachina//bin/sbatch -N1 -n2 --output=test38.7.output --error=test38.7.error -t1 : -N1 -n2 ./test38.7.input Submitted batch job 4646 Job 4646 is in state PENDING, desire DONE Job 4646 is in state PENDING, desire DONE Job 4646 is in state PENDING, desire DONE Job 4646 is in state RUNNING, desire DONE Job 4646 is DONE (COMPLETED) spawn cat test38.7.output Wed May 15 21:47:28 CDT 2019 PACK GROUP 1 Wed May 15 21:47:31 CDT 2019 TEST_COMPLETE FAILURE: No MPI communications occurred