7039 – [CAST-22101] Unable to start step on pack group != 0 for cray system

Ticket 7039 - [CAST-22101] Unable to start step on pack group != 0 for cray system

Summary: [CAST-22101] Unable to start step on pack group != 0 for cray system

Status:	RESOLVED FIXED

Alias:	None

Product:	Slurm
Classification:	Unclassified
Component:	Heterogeneous Jobs (show other tickets)
Version:	19.05.x
Hardware:	Linux Linux

Severity:	3 - Medium Impact
Assignee:	Danny Auble
QA Contact:

URL:

Duplicates (2):	7446 8329 (view as ticket list)
Depends on:
Blocks:

Reported:	2019-05-16 10:08 MDT by Moe Jette
Modified:	2020-03-31 13:19 MDT (History)
CC List:	7 users (show)

See Also:	4105 8329
Site:	SchedMD
Slinky Site:	---
Alineos Sites:	---
Atos/Eviden Sites:	---
Confidential Site:	---
Coreweave sites:	---
Cray Sites:	---
DS9 clusters:	---
Google sites:	---
HPCnow Sites:	---
HPE Sites:	---
IBM Sites:	---
NOAA SIte:	---
NoveTech Sites:	---
Nvidia HWinf-CS Sites:	---
OCF Sites:	---
Recursion Pharma Sites:	---
SFW Sites:	---
SNIC sites:	---
Tzag Elita Sites:	---
Linux Distro:	---
Machine Name:
CLE Version:
Version Fixed:	20.02.2 20.11.0pre1
Target Release:	---
DevPrio:	---
Emory-Cloud Sites:	---

Attachments
my configuration on kachina test system, likely not important though (20.00 KB, application/x-tar) 2019-05-16 10:08 MDT, Moe Jette	Details
patch (1.54 KB, text/plain) 2020-03-11 13:46 MDT, Brian F Gilmer	Details
Patch to make hetjobs work for non component 0 steps. (8.80 KB, patch) 2020-03-17 15:29 MDT, Danny Auble	Details \| Diff
Show Obsolete (1) Add an attachment (proposed patch, testcase, etc.)

Note You need to log in before you can comment on or make changes to this ticket.

Description Moe Jette 2019-05-16 10:08:09 MDT

Created attachment 10245 [details]
my configuration on kachina test system, likely not important though

This is part of regression test38.7.
Starting a step on pack group 0 is fine.
Starting a step on pack group 1 results in no communications.
Starting a step across pack groups 0 and 1 was not tested due to the test needing an update so that cray hetjobs can be tested (already done in commit 9661022c3, but not tested since. I'll update the bug when I have updated information about that.

Here's a log:

TEST: 38.7
spawn /opt/cray/pe/craype/default/bin/cc -o test38.7.prog test38.7.prog.c
No supported cpu target is set, CRAY_CPU_TARGET=x86-64 will be used.
Load a valid targeting module or set CRAY_CPU_TARGET


TEST OF PACK GROUP 0

spawn /home/users/n15158/slurm/19.05/kachina//bin/sbatch -N1 -n2 --output=test38.7.output --error=test38.7.error -t1 : -N1 -n2 ./test38.7.input
Submitted batch job 4644
Job 4644 is in state PENDING, desire DONE
Job 4644 is in state RUNNING, desire DONE
Job 4644 is in state RUNNING, desire DONE
Job 4644 is DONE (COMPLETED)
spawn cat test38.7.output
Wed May 15 21:47:13 CDT 2019
PACK GROUP 0
Rank[0] on nid00040 just received msg from Rank 1 on nid00040
Rank[1] on nid00040 just received msg from Rank 0 on nid00040
Wed May 15 21:47:15 CDT 2019
TEST_COMPLETE


TEST OF PACK GROUP 1

spawn /home/users/n15158/slurm/19.05/kachina//bin/sbatch -N1 -n2 --output=test38.7.output --error=test38.7.error -t1 : -N1 -n2 ./test38.7.input
Submitted batch job 4646
Job 4646 is in state PENDING, desire DONE
Job 4646 is in state PENDING, desire DONE
Job 4646 is in state PENDING, desire DONE
Job 4646 is in state RUNNING, desire DONE
Job 4646 is DONE (COMPLETED)
spawn cat test38.7.output
Wed May 15 21:47:28 CDT 2019
PACK GROUP 1
Wed May 15 21:47:31 CDT 2019
TEST_COMPLETE

FAILURE: No MPI communications occurred

Comment 2 Danny Auble 2019-06-11 12:05:16 MDT

Doug have you tested this out yet?

Comment 3 Danny Auble 2019-07-12 10:21:24 MDT

Doug: Ping

Comment 4 Danny Auble 2019-07-26 10:23:20 MDT

Doug, any help here would be great.

Comment 5 Doug Jacobsen 2019-07-26 10:31:54 MDT

this was just reported internally yesterday as well.  my thought here is that we've tied everything to pack id 0 for getting the credential, which is great for the main purpose of the work.  However, if pack id 0 isn't being used in a given allocation then the needed credentials won't be setup.  my basic thought for addressing this is to set up the credential in the _lowest_ pack id represented by a discrete srun.

Next week, I plan to see if the data sent to the switch plugin includes the full set of requested packs for this overall srun (at one point i was playing with that, i think), and if not, we may need to enhance it to get those data there -- i.e., an rpc change, possibly, thus targeting master branch and brave 19.05 patchers.

Comment 6 Doug Jacobsen 2019-07-26 10:32:47 MDT

(In reply to Doug Jacobsen from comment #5)
> ...
> used in a given allocation then the needed credentials won't be setup.  my
> ...

allocation -- i meant a given job step.

Comment 7 Doug Jacobsen 2019-07-26 10:35:06 MDT

there are also some reported oddities in the cgroup setup:

+ srun --pack-group=0,1 --cpu-bind=cores xthi.intel
srun: Job 1480411 step creation temporarily disabled, retrying
srun: Step created for job 1480411
slurmstepd: error: Detected zonesort setup failure: Could not open job cpuset (1480411.2)
slurmstepd: error: Detected zonesort setup failure: Could not open job cpuset (1480411.2)
slurmstepd: error: Detected zonesort setup failure: Could not open job cpuset (1480411.2)
slurmstepd: error: Detected zonesort setup failure: Could not open job cpuset (1480411.2)
slurmstepd: error: Detected zonesort setup failure: Could not open job cpuset (1480411.2)
slurmstepd: error: Detected zonesort setup failure: Could not open job cpuset (1480411.2)
slurmstepd: error: Detected zonesort setup failure: Could not open job cpuset (1480411.2)
slurmstepd: error: Detected zonesort setup failure: Could not open job cpuset (1480411.2)
Hello from rank 0, thread 3, on nid00204. (core affinity = 12)
Hello from rank 0, thread 0, on nid00204. (core affinity = 0)
Hello from rank 0, thread 1, on nid00204. (core affinity = 4)
Hello from rank 0, thread 2, on nid00204. (core affinity = 8)

....



i.e., our zonesort spank plugin is expecting the 1480411.2 cpuset cgroup to exist, and it doesn't;  will need to look into that as well.

Comment 8 Marshall Garey 2019-07-30 13:40:56 MDT

*** Ticket 7446 has been marked as a duplicate of this ticket. ***

Comment 9 Marshall Garey 2019-07-30 13:42:54 MDT

I've marked bug 7446 as a duplicate of this one. SallocDefaultCommand is the primary culprit there. I have several private comments over there with some analysis of the problem. I think that 7446 will likely be solved by fixing this bug.

If you think that's a mistake, let me know.

- Marshall

Comment 10 Danny Auble 2019-11-07 14:13:27 MST

Doug, I am guessing you haven't had any more time to look at this?  If you do let me know :).

Comment 11 Doug Jacobsen 2019-11-07 14:21:30 MST

i'm hopeful to look at it before SC.  I suspect some protocol reorganizations may be needed but unsure

Comment 12 Felip Moll 2020-02-13 08:56:29 MST

Hi, I am suspecting that bug #8329 (private) is suffering from the same issues than the ones described here.

Is there any intention to keep working on this bug in a near future?

one issue in #8329 is that for the pack id 1 they're receiving:

slurmstepd: error: (switch_cray_aries.c: 752: switch_p_job_fini) jobinfo pointer was NULL


Also, when they're using mpi4py, do you know if this is expected or does it happen to your systems too?:

xxx@xxx:~/packjob> ml -t
Currently Loaded Modulefiles:
modules/3.2.11.4
slurm/19.05.4-1
cray-python/3.7.3.2

xxx@xxx:~/packjob> cat job.sh
#!/bin/bash
#SBATCH --job-name=test
#SBATCH --time='00:02:00'
#SBATCH --nodes=2
#SBATCH --output=pp.out
#SBATCH --error=pp.err
#SBATCH packjob
#SBATCH --time='00:02:00'
#SBATCH --nodes=1
srun --pack-group=0 -n 2 --exclusive python -c 'from mpi4py import MPI; print("Hello")' &
srun --pack-group=1 -n 1 --exclusive python -c 'from mpi4py import MPI; print("Hola")' &
wait

xxx@xxx:~/packjob> sbatch job.sh
Submitted batch job 624729

xxx@xxx:~/packjob> ll
total 12
rw-rr- 1 xxx craypri 374 Jan 14 14:06 job.sh
rw-rr- 1 xxx craypri 774 Jan 14 14:07 pp.err
rw-rr- 1 xxx craypri 12 Jan 14 14:07 pp.out

xxx@xxx:~/packjob> more pp.err
Tue Jan 14 14:07:43 2020: [unset]:_pmi_alps_init:alps_get_placement_info returned with error -1
Tue Jan 14 14:07:43 2020: [unset]:_pmi_init:_pmi_alps_init returned -1
[Tue Jan 14 14:07:43 2020] [c1-0c0s8n2] Fatal error in PMPI_Init_thread: Other MPI error, error stack:
MPIR_Init_thread(537):
MPID_Init(246).......: channel initialization failed
MPID_Init(647).......: PMI2 init failed: 1
aborting job:
Fatal error in PMPI_Init_thread: Other MPI error, error stack:
MPIR_Init_thread(537):
MPID_Init(246).......: channel initialization failed
MPID_Init(647).......: PMI2 init failed: 1
slurmstepd: error: (switch_cray_aries.c: 752: switch_p_job_fini) jobinfo pointer was NULL
srun: error: nid00226: task 0: Exited with exit code 255
srun: Terminating job step 624730.0
xxx@xxx:~/packjob>

Comment 14 Felip Moll 2020-02-19 08:41:12 MST

*** Ticket 8329 has been marked as a duplicate of this ticket. ***

Comment 15 Doug Jacobsen 2020-02-19 10:36:14 MST

yeah, i think the key here is that we need to identify the lowest numbered pack that a particular step is starting, and have that pack do the needed interactions with the cray stack (rather than fixating on pack 0)

to do this, we'll need all of the information about all packs starting in the step provided to the switch plugin interface.  I haven't had time to look at what data is available at that stage.

Comment 16 Brian F Gilmer 2020-02-25 06:01:11 MST

This bug is marked urgent in the HPE/Cray case. I raised the priority of this bug to keep the effort moving forward.

Comment 17 Brian F Gilmer 2020-02-25 06:15:04 MST

The pack information needed to identify the lowest number pack doesn't seem to be available in the controller. Since only 1 of the job steps in the job needs to allocate credientials any pack group will work. A flag marking the first pack group encountered by srun would work. It does not have to be the lowest number group pack it just has to be one and only one. If the flag is set by default then cleared after the first group pack then the flag would work for all jobs. 

in srun_job.c:create_srun__job()

                        first_pack=true;
			while ((opt_local = get_next_opt(pack_offset))) {
				srun_opt_t *srun_opt = opt_local->srun_opt;
				xassert(srun_opt);
                        ….
                                // one and only one
                                first_pack=false;
                        }


in step_mgr.c:step_create()

		/*
		 * We only want to set up the Aries switch for the first
		 * job with all the nodes in the total allocation along
		 * with that node count.
		 */
		//if (job_ptr->job_id == job_ptr->pack_job_id) {
                if (first_pack(job_id)) {

Comment 18 Danny Auble 2020-02-25 08:55:29 MST

Brian, we haven't been able to get time with the right people to make any progress on this.  Without this time we will not be able to progress.  If you do happen to have this time with them we would be happy to look over a working patch.

Comment 19 Brian F Gilmer 2020-02-25 09:06:57 MST

Danny,

Do you need support from the HPE/Cray side? There seems to be a problem getting access to Kachina, is that still the case?

Comment 20 Danny Auble 2020-02-25 10:15:21 MST

It is more than just system access, the only way we were able to get to this point was Doug J helping look at logs (and knowing what logs to look at) while we were doing trial and error.

I am guessing the end patch will be fairly small, but at this point we haven't had this be that big a problem for it to make cycles available.

Comment 21 Brian F Gilmer 2020-03-11 13:46:19 MDT

Created attachment 13351 [details]
patch

Proposed solution. This patch adds the pack_job_list to all of the jobs in the pack. This allows for a test against the first job and not just a job in pack-group=0. The logic in proc_req.c and step_mgr.c has been tested against a VM cluster. The changes in step_mgr.c have not been checked against an XC yet.

Comment 22 Danny Auble 2020-03-11 16:12:43 MDT

Brian, thanks for the idea, I am not sure how it would make a difference though.

In my testings of your patch

het_job_ptr->pack_job_id is always the same as job_ptr->pack_job_id so it doesn't seem this does anything different than what was there.

I would expect the pack_job_id to be the same on all the parts of the hetjob.

Keep in mind pack_job_id (or het_job_id in 20.02+ Slurm code) is the job id of the lead het job (which sets up all the switch cookies and what not).

This may be the right spot in the code (step_mgr.c) as you will see the 'else' on the clause you are editing has as it's first line...

/* assume that job offset 0 has already run! */

This is at least part of the problem.

Meaning it appears the real bug is in situations where we never involve component 0 in the step the cookies are never made for the job.

It looks like I can put build the switch info once for the job and store it in the job_ptr, the problem is we currently don't have a dynamic_plugin_data_t *switch_job as the step does in the structure, meaning if this is the fix state wouldn't persist from restarting the slurmctld. But that might not be that big a deal. It is hard to say, but based on what I am seeing I am guessing it will not be that big an issue not saving state as we can just rebuild the pointer when the potential next step comes through.

Could someone test this for me before I go and look to actually fix it?

I would expect this kind of thing to be a valid test...

salloc -N1 -n1 : -N1 -n1

srun --pack-group=1 mpihelloworld

will fail.

srun --pack-group=0 sleep 1000&
srun --pack-group=1 mpihelloworld

will at least get set up with the correct switch info.

I just tested this on cori though and it does work as expected though so there is more to the problem assuming my idea works as expected (on a non-cray system at least the code is executed correctly).

Having access to a cray system where I can mess around with stuff like this would be helpful. Does anyone have a system I can play on with root?

Comment 23 Brian F Gilmer 2020-03-12 09:16:23 MDT

(In reply to Danny Auble from comment #22)
> In my testings of your patch
> 
> het_job_ptr->pack_job_id is always the same as job_ptr->pack_job_id so it
> doesn't seem this does anything different than what was there.
> 
> I would expect the pack_job_id to be the same on all the parts of the hetjob.

"het_job_ptr->pack_job_id is always the same as job_ptr->pack_job_id" this is not the case. The change makes sure that this happens only once for the het-jobs. If you are not seeing this then it is likely something went wrong in translating my non-XC code to XC code.

> 
> Keep in mind pack_job_id (or het_job_id in 20.02+ Slurm code) is the job id
> of the lead het job (which sets up all the switch cookies and what not).
> 
> This may be the right spot in the code (step_mgr.c) as you will see the
> 'else' on the clause you are editing has as it's first line...
> 
> /* assume that job offset 0 has already run! */
> 
> This is at least part of the problem.
> 
> Meaning it appears the real bug is in situations where we never involve
> component 0 in the step the cookies are never made for the job.

Yes, this is the bug. The approach I took was to rely on the job list and not the pack-leader. The change to proc_req.c make sure the job-list is defined for all of the jobs in the job-pack. Then just request credentials on the first job regardless of whether it is the pack-leader or pack-group==0. 

> 
> It looks like I can put build the switch info once for the job and store it
> in the job_ptr, the problem is we currently don't have a
> dynamic_plugin_data_t *switch_job as the step does in the structure, meaning
> if this is the fix state wouldn't persist from restarting the slurmctld. But
> that might not be that big a deal.  It is hard to say, but based on what I
> am seeing I am guessing it will not be that big an issue not saving state as
> we can just rebuild the pointer when the potential next step comes through.

This is not a problem. The process for acquiring credentials always happens on a job step. The issue with the het-jobs is that it should only be done once. So any fix for this bug would eliminate the case of _not_ acquiring credentials.

> 
> Could someone test this for me before I go and look to actually fix it?
> 
> I would expect this kind of thing to be a valid test...
> 
> salloc -N1 -n1 : -N1 -n1
> 
> srun --pack-group=1 mpihelloworld
> 
> will fail.
> 

This is the the current situation. 

> srun --pack-group=0 sleep 1000&
> srun --pack-group=1 mpihelloworld
> 
> will at least get set up with the correct switch info.

Yes, this is also the current sitation.

> 
> I just tested this on cori though and it does work as expected though so
> there is more to the problem assuming my idea works as expected (on a
> non-cray system at least the code is executed correctly).
> 

You understand the bug correctly. It is limited to not passing control through the code in step_mgr.c at least once per job. That is the case for all job except when there is no pack-group==0.

> Having access to a cray system where I can mess around with stuff like this
> would be helpful.  Does anyone have a system I can play on with root?

I will check to see if a system can be made available.

Comment 24 Danny Auble 2020-03-12 10:47:02 MDT

(In reply to Brian F Gilmer from comment #23)
> (In reply to Danny Auble from comment #22)
> > In my testings of your patch
> > 
> > het_job_ptr->pack_job_id is always the same as job_ptr->pack_job_id so it
> > doesn't seem this does anything different than what was there.
> > 
> > I would expect the pack_job_id to be the same on all the parts of the hetjob.
> 
> "het_job_ptr->pack_job_id is always the same as job_ptr->pack_job_id" this
> is not the case. The change makes sure that this happens only once for the
> het-jobs. If you are not seeing this then it is likely something went wrong
> in translating my non-XC code to XC code.

I would be interested in how are you seeing what you are seeing (note your change does not affect batch jobs only srun/salloc).  Perhaps I am not understanding what you mean, but if the pack_job_id doesn't point to the head component of the hetjob then all sorts of things would be wrong.

> 
> > 
> > Keep in mind pack_job_id (or het_job_id in 20.02+ Slurm code) is the job id
> > of the lead het job (which sets up all the switch cookies and what not).
> > 
> > This may be the right spot in the code (step_mgr.c) as you will see the
> > 'else' on the clause you are editing has as it's first line...
> > 
> > /* assume that job offset 0 has already run! */
> > 
> > This is at least part of the problem.
> > 
> > Meaning it appears the real bug is in situations where we never involve
> > component 0 in the step the cookies are never made for the job.
> 
> Yes, this is the bug. The approach I took was to rely on the job list and
> not the pack-leader. The change to proc_req.c make sure the job-list is
> defined for all of the jobs in the job-pack. Then just request credentials
> on the first job regardless of whether it is the pack-leader or
> pack-group==0. 
> 
> > 
> > It looks like I can put build the switch info once for the job and store it
> > in the job_ptr, the problem is we currently don't have a
> > dynamic_plugin_data_t *switch_job as the step does in the structure, meaning
> > if this is the fix state wouldn't persist from restarting the slurmctld. But
> > that might not be that big a deal.  It is hard to say, but based on what I
> > am seeing I am guessing it will not be that big an issue not saving state as
> > we can just rebuild the pointer when the potential next step comes through.
> 
> This is not a problem. The process for acquiring credentials always happens
> on a job step. The issue with the het-jobs is that it should only be done
> once. So any fix for this bug would eliminate the case of _not_ acquiring
> credentials.

I am not sure you follow this.  If we don't have state we would need to acquire multiple times.  I am now seeing issues with the task plugin as we need to handle the stepid translation to the apid.  At the moment I am thinking we could just use stepid 0 for all hetjobs and we should be set.

This is all theoretical at the moment.  I received access to Kachina and am working on getting it booted.

> 
> > 
> > Could someone test this for me before I go and look to actually fix it?
> > 
> > I would expect this kind of thing to be a valid test...
> > 
> > salloc -N1 -n1 : -N1 -n1
> > 
> > srun --pack-group=1 mpihelloworld
> > 
> > will fail.
> > 
> 
> This is the the current situation. 
> 
> > srun --pack-group=0 sleep 1000&
> > srun --pack-group=1 mpihelloworld
> > 
> > will at least get set up with the correct switch info.
> 
> Yes, this is also the current sitation.

Then you are seeing what I am seeing.

> 
> > 
> > I just tested this on cori though and it does work as expected though so
> > there is more to the problem assuming my idea works as expected (on a
> > non-cray system at least the code is executed correctly).
> > 
> 
> You understand the bug correctly. It is limited to not passing control
> through the code in step_mgr.c at least once per job. That is the case for
> all job except when there is no pack-group==0.

I don't think this is stated correctly.  The problem here is we don't have a component 0 of the step to grab the switch info from.  There is always a head component in the job, but since we don't have a step for that component the cookies never get made (and as the test above shows, even though they are made they are not done correctly for the step.

> 
> > Having access to a cray system where I can mess around with stuff like this
> > would be helpful.  Does anyone have a system I can play on with root?
> 
> I will check to see if a system can be made available.

Gracias, I am hoping Kachina will work out, but it is sort of slow going.

Comment 25 Brian F Gilmer 2020-03-12 11:37:39 MDT

(In reply to Danny Auble from comment #24)
> (In reply to Brian F Gilmer from comment #23)
> > (In reply to Danny Auble from comment #22)
> 
> I would be interested in how are you seeing what you are seeing (note your
> change does not affect batch jobs only srun/salloc).  Perhaps I am not
> understanding what you mean, but if the pack_job_id doesn't point to the
> head component of the hetjob then all sorts of things would be wrong.
> 

OK, I see my mistake. What I am looking at is the first job in the list which would job_id==pack_job_id _if_ so the result is the same, if group=0 is not used then the portion of the code that acquires the credentials is not invoked.

Comment 26 Brian F Gilmer 2020-03-12 11:40:49 MDT

I need to take a look at how the credentials are shared with a batch job. The would be the same in a srun ... & ; srun ... & scenario.

Comment 27 Danny Auble 2020-03-12 11:54:44 MDT

(In reply to Brian F Gilmer from comment #26)
> I need to take a look at how the credentials are shared with a batch job.
> The would be the same in a srun ... & ; srun ... & scenario.

There is no need.  I am going down a different path.  I don't believe your patch is needed, but did point me to the spot that does need changing.  I have a patch now that will hopefully get us closer, just waiting for kachina to boot.

Comment 28 Danny Auble 2020-03-17 15:29:12 MDT

Created attachment 13402 [details]
Patch to make hetjobs work for non component 0 steps.

Hey guys, try this patch and see if it does what you would expect.

I would like to get this into 20.02.1 which we plan to tag next week, so testing sooner would be better than later.

Comment 30 Doug Jacobsen 2020-03-21 23:15:22 MDT

I'm running slurm20.02.0 with this patch applied on gerty (XC40 system).  

There seems to be an issue with salloc and our default salloc command:

dmj@nid00388:/global/gscratch1/sd/dmj/slurm2002> srun hostname
nid00389
nid00388
dmj@nid00388:/global/gscratch1/sd/dmj/slurm2002> exit
salloc: Relinquishing job allocation 8
dmj@gerty:/global/gscratch1/sd/dmj/slurm2002> ./bin/salloc -N 2 -p system -C haswell : -N 4 -p system -C knl
salloc: Pending job allocation 9
salloc: job 9 queued and waiting for resources
salloc: job 9 has been allocated resources
salloc: Granted job allocation 9
salloc: Waiting for resource configuration
salloc: Nodes nid00[388-389] are ready for job
srun: error: task 0 launch failed: Error configuring interconnect
salloc: Relinquishing job allocation 9
dmj@gerty:/global/gscratch1/sd/dmj/slurm2002>

However, trying again and specifying /bin/bash for salloc, I can then control things a bit better:

salloc: Nodes nid00[388-389] are ready for job
dmj@gerty:/global/gscratch1/sd/dmj/slurm2002> ./bin/srun hostname
nid00388
nid00389
dmj@gerty:/global/gscratch1/sd/dmj/slurm2002> ./bin/srun --pack 0 hostname
nid00388
nid00389
dmj@gerty:/global/gscratch1/sd/dmj/slurm2002> ./bin/srun --pack 1 hostname
nid00032
nid00035
nid00034
nid00033
dmj@gerty:/global/gscratch1/sd/dmj/slurm2002> ./bin/srun --pack 0,1 hostname
nid00388
nid00389
nid00035
nid00033
nid00032
nid00034
dmj@gerty:/global/gscratch1/sd/dmj/slurm2002> cp ~/mpi/helloworld.c .
dmj@gerty:/global/gscratch1/sd/dmj/slurm2002> cc helloworld.c -o hello
dmj@gerty:/global/gscratch1/sd/dmj/slurm2002> ./bin/srun --pack 0,1 ./hello
hello from 2 of 6 on nid00389
hello from 1 of 6 on nid00388
hello from 6 of 6 on nid00035
hello from 3 of 6 on nid00032
hello from 4 of 6 on nid00033
hello from 5 of 6 on nid00034
dmj@gerty:/global/gscratch1/sd/dmj/slurm2002> ./bin/srun --pack 0 ./hello
hello from 2 of 2 on nid00389
hello from 1 of 2 on nid00388
dmj@gerty:/global/gscratch1/sd/dmj/slurm2002> ./bin/srun --pack 1 ./hello
hello from 4 of 4 on nid00035
hello from 1 of 4 on nid00032
hello from 3 of 4 on nid00034
hello from 2 of 4 on nid00033
dmj@gerty:/global/gscratch1/sd/dmj/slurm2002>



So one thing that seems to be new is that in order to run in any pack other than pack 0, I have to specify;  I think in 19.05 the default was to run in all packs, maybe that is an intentional change in 20.02.

So the only problem I see right now is that a default salloc command of "srun -n1 -N1 --mem-per-cpu=0 --pty --preserve-env --gres=craynetwork:0 --mpi=none $SHELL" fails with this patch.

Comment 31 Brian F Gilmer 2020-03-21 23:15:37 MDT

My new e-mail address is brian.gilmer@hpe.com

Comment 32 Danny Auble 2020-03-23 07:27:46 MDT

Thanks for testing Doug,

Can you confirm this works without the patch, or is this just a situation with 20.02 in general?

Comment 33 Doug Jacobsen 2020-03-23 09:39:08 MDT

I've only tested 20.02 with this patch.



On Mon, Mar 23, 2020 at 6:27 AM <bugs@schedmd.com> wrote:

> *Comment # 32 <https://bugs.schedmd.com/show_bug.cgi?id=7039#c32> on bug
> 7039 <https://bugs.schedmd.com/show_bug.cgi?id=7039> from Danny Auble
> <da@schedmd.com> *
>
> Thanks for testing Doug,
>
> Can you confirm this works without the patch, or is this just a situation with
> 20.02 in general?
>
> ------------------------------
> You are receiving this mail because:
>
>    - You are on the CC list for the bug.
>
>

Comment 34 Danny Auble 2020-03-24 16:42:33 MDT

Doug,

It appears this issue can happen (and always has) if you don't request the entire component on a step.

So in your situation you were requesting 4 nodes in the first component of the hetjob and then the default command ran only requested 1 node.

This would had happened on a regular srun as well.

This patch

diff --git a/src/srun/srun.c b/src/srun/srun.c
index 0ec2b215cf..c681c23c06 100644
--- a/src/srun/srun.c
+++ b/src/srun/srun.c
@@ -559,7 +559,9 @@ static void _launch_app(srun_job_t *job, List srun_job_list, bool got_alloc)
                                                    sizeof(uint32_t *));
                        memcpy(job->het_job_tids, tmp_tids,
                               sizeof(uint32_t *) * job->het_job_nnodes);
-                       job->het_job_node_list = xstrdup(job->nodelist);
+                       (void) slurm_step_ctx_get(job->step_ctx,
+                                                 SLURM_STEP_CTX_NODE_LIST,
+                                                 &job->het_job_node_list);
                        job->het_job_tid_offsets = xcalloc(job->ntasks,
                                                           sizeof(uint32_t));

Fixes this problem as well.  At the moment I am thinking this will probably go into 19.05 so you can at least run hetjobs there until you upgrade.  It is clear though no one on your system is running hetjobs in salloc ;).

This along with the already attached patch appears to fix everything as you would expect.

Let me know if you find differently.  We are looking to tag on Thursday, so any amount of extra testing would be very welcome if you would like this in the next version of Slurm.

Comment 37 Miguel Gila 2020-03-27 03:33:57 MDT

Danny, Doug, many thanks for working on this. 

Danny, we don't have an upgrade window for 20.02 until May/June (at least) and some of our users are constantly hitting this, do you know already if it'll land on 19.05 as well?

Comment 38 Danny Auble 2020-03-30 14:20:31 MDT

Hey Miguel, sorry this patch will not be going into 19.05, only 20.02+.  This patch will most likely work with 19.05 though but at the moment we would prefer to avoid patching that version.  You are welcome to use this as a local patch if you would like until you update to 20.02.

Comment 39 Brian F Gilmer 2020-03-31 08:48:13 MDT

Hi Danny,

Waiting for 20.02 was discussed with the customer today. This is not an option for them. It sounds like they are not committed to moving to 20.02, disruption and stability of the version were both mentioned by them.

We are looking at moving forward with a 19.05 fix. The customer wants a 'certified' version so HPE/Cray has to go through the testing regime in HPE/Cray. At this point, do you have any areas of concern we should be aware about?

Comment 40 Danny Auble 2020-03-31 09:03:49 MDT

Sorry Brian, while we think this will work fine with 19.05 the change is too great this late in the release and could potentially break the current hetjob functionality that does work. 

The current options are run with a local patch or wait until moving to 20.02 in the May/June time frame as they have indicated. At the moment no tagged version of Slurm has these changes in it.

Something to understand is there very well may never be another 19.05 release, so even if this patch was put there there may never be a blessed version of 19.05 containing it. At the moment only major and security related fixes are the only things being considered for 19.05.

Comment 41 Brian F Gilmer 2020-03-31 09:24:53 MDT

Danny,

For understanding: the reason for not including this in 19.05 is that the change has a high risk of breaking current functionality and the version is in an end-of-life maintenance phase. This patched version of 19.05 would potentially fall outside of Slurm support since it would not be an officially released version of Slurm.

If HPE/Cray were to move ahead with providing a Slurm variant to the customer that would not be covered by our support contract with SchedMD. This also deviates from the Cray support model which is only to provide support for released versions of Slurm. Cray wanted to avoid the maintenance tail associated with a Cray variant of Slurm. This will be part of the discussion within HPE/Cray.

Thanks

Comment 42 Danny Auble 2020-03-31 09:52:52 MDT

(In reply to Brian F Gilmer from comment #41)
> Danny,
> 
> For understanding: the reason for not including this in 19.05 is that the
> change has a high risk of breaking current functionality and the version is
> in an end-of-life maintenance phase. This patched version of 19.05 would
> potentially fall outside of Slurm support since it would not be an
> officially released version of Slurm.

Correct, this patch is seen as an enhancement more than a bug fix since 19.05 does work with hetjob, just not in the cases listed here.  As such this patch would not be considered for a release that is mature as 19.05 is.  This has always been our policy.

While we would do our best to support a patched version of the code the odds of a blessed 19.05 are rather low.

> 
> If HPE/Cray were to move ahead with providing a Slurm variant to the
> customer that would not be covered by our support contract with SchedMD.
> This also deviates from the Cray support model which is only to provide
> support for released versions of Slurm. Cray wanted to avoid the maintenance
> tail associated with a Cray variant of Slurm. This will be part of the
> discussion within HPE/Cray.

As mentioned above we would do our best to support them, but the request of a blessed version outside of 20.02+ is most likely not a reality.  While in most cases I support  the HPE/Cray stance on only supporting released versions, there may need to be wiggle room on your end to allow testing of patches and such.  As they plan to move to 20.02 in just a couple of months this doesn't seem like a horrible stop gap.  This problem has existed since the beginning and only now has gained any traction almost a year later.

> 
> Thanks

Comment 45 Danny Auble 2020-03-31 13:19:22 MDT

This patch will be in 20.02.2.  Commit 6ee504895e0809.

Thanks for all those who helped get this going.

Please reopen if needed.