Ticket 3983

Summary:	Question: likely source of error message: ""poe: error: task 4 launch failed: Error configuring interconnect"
Product:	Slurm	Reporter:	Alan Benner <bennera>
Component:	slurmstepd	Assignee:	Danny Auble <da>
Status:	RESOLVED TIMEDOUT	QA Contact:
Severity:	3 - Medium Impact
Priority:	---
Version:	15.08.13
Hardware:	IBM PERCS
OS:	Linux
Site:	IBM (US)	Slinky Site:	---
Alineos Sites:	---	Atos/Eviden Sites:	---
Confidential Site:	---	Coreweave sites:	---
Cray Sites:	---	DS9 clusters:	---
Google sites:	---	HPCnow Sites:	---
HPE Sites:	---	IBM Sites:	---
NOAA SIte:	---	NoveTech Sites:	---
Nvidia HWinf-CS Sites:	---	OCF Sites:	---
Recursion Pharma Sites:	---	SFW Sites:	---
SNIC sites:	---	Tzag Elita Sites:	---
Linux Distro:	---	Machine Name:
CLE Version:		Version Fixed:
Target Release:	---	DevPrio:	---
Emory-Cloud Sites:	---

Description Alan Benner 2017-07-11 15:22:09 MDT

May not be a SLURM bug, but it's a puzzling error message, for a pretty serious problem (some nodes not usable) and we're having trouble figuring out the source of the problem.    

This is on a Power775/PERCS system, with the "Torrent" interconnect.

We're trying to do the simplest test, running "$ srun hostname"  on various nodes of the system.  

On most of the 100s of nodes, it runs perfectly normally and well. However, on few nodes (5 of them, at the moment), seemingly randomly distributed, this job fails, with this error message:
        "poe: "task 0 launch failed: Error configuring interconnect"

If one of these odd nodes were task 4 in the job, for example, we'd see the message: "poe: "task 4 launch failed: Error configuring interconnect"

We get the error message whenever a job includes these particular 5 nodes -- even when the job is run serially, with just the single nodes in the job.

I can see that this error message is generated here:

api/step_launch.c-1133-                return;
api/step_launch.c-1134-        }
api/step_launch.c-1135-
api/step_launch.c-1136-        if (msg->return_code) {
api/step_launch.c-1137-                for (i = 0; i < msg->count_of_pids; i++) {
api/step_launch.c:1138:                        error("task %u launch failed: %s",
api/step_launch.c-1139-                              msg->task_ids[i],
api/step_launch.c-1140-                              slurm_strerror(msg->return_code));
api/step_launch.c-1141-                        bit_set(sls->tasks_started, msg->task_ids[i]);
api/step_launch.c-1142-                        bit_set(sls->tasks_exited, msg->task_ids[i]);


As far as we can tell, the system is healthy - no hardware is broken, no other jobs are running, no zombie tasks are active. The "scontrol" command doesn't show anything unusual for these 5-6 nodes, from what we've seen.    It's just that "srun hostname" fails with this error message for these few nodes of the many-node system.

A couple other things that may or may not be clues:
- We've restarted the slurm daemon on the failing nodes - no help.
- There's no slurmstepd process being generated.
- We noticed that, when this fails, no "pmdv12" process gets started.
- It's version 1508 of SLURM.
- We've rebooted one of the failing nodes, and we're still getting the error.
- We've traced a bit of the code - it seems to be related to a flag ESLURM_INTERCONNECT_FAILURE  - out of the switch_g_build_jobinfo() function  -- which is odd, I thought that on Power775 systems, we were calling the switch_p_build_jobinfo()   --- perhaps we have to call both of them.

The fact that this is happening even after a reboot of the compute nodes is most surprising to us.

Comment 1 Alan Benner 2017-07-12 07:37:03 MDT

Importance of this should be 2   -- we do have a support contract. 

Not sure why it's overriding to "6 - No support contract".

Comment 2 Danny Auble 2017-07-12 09:44:15 MDT

Alan, this is strange, but does seem to point to something outside of Slurm's control.

Do you happen to see any messages in the slurmd log pointing to anything?

If you run a single node job on these 5-6 nodes does it fail as well?

If you can't find anything in the slurmd log you might find something of interest in the pmd/pnsd log files.

Also exporting

export MP_INFOLEVEL=4 

before your srun might give you more information, but I'm guessing the slurmd log will be the more helpful.

Comment 3 Danny Auble 2017-07-13 10:52:18 MDT

Alan, any update on this?

Comment 4 Alan Benner 2017-07-13 11:40:39 MDT

Thanks for the response.
One note: I'm working with an administrator of the system - unfortunately, I'm not able to look at the logs myself (system is in another state, and information can't be exported outside the building), so I am, to some extent, passing along information 2nd-hand - when I say "we" I mean "the system administrator did this, while I was on the phone with him".
- We looked in the slurmd log, couldn't find anything informative on this problem.
- The jobs did fail on both single-node jobs and multi-node jobs that included the 5-6 troublesome nodes.
- We ran with MP_INFOLEVEL=5 - got a lot of output, but couldn't find anything informative on this problem.

It seems likely that the problem is somehow in the interaction between slurm and poe (including pmd/pnsd), rather than in slurm itself.

This is the best clue:
We *were* able to resolve the problem - without finding root cause - by doing a "clean" slurm restart (/etc/init.d/slurm startclean). (Previous calls to /etc/init.d/slurm restart didn't resolve the problem). After calling ../slurm startclean, we could start jobs without problem, and without the "...Error configuring interconnect" message.

This would seem to indicate a problem between consistency of state between slurm and poe regarding the 5-6 troublesome nodes. With the "..slurm startclean" command run, these 5-6 nodes are working normally, so we won't be able to gather any new data on the problem.

Since system is now running, the importance of this is reduced - I've changed to 3 - but any ideas for an explanation for how we could have gotten in this state would be appreciated, to possibly prevent recurrence.

We *do* still have the problem of reduced job run performance on simple "hostname" test - on some *other* (i.e., different) specific nodes - and that problem was not resolved by the "..slurm startclean" command. This is why we opened the 3987 bug.

Comment 5 Danny Auble 2017-07-13 11:52:47 MDT

Thanks Alan, I am wondering if the windows on the switch were in a strange state.  Having the slurmd logs (if they had sufficient debug level, info+).  Even if the logs didn't seem to produce something relevant I would still like to see them.  If this comes back (probably won't) turning on the DebugFlag=switch would probably give you quite a bit of information.

The only place in the code I can see potential problems is if the state was somehow stored incorrectly.  But that doesn't seem to matter on the slurmd level, only on the slurmctld.

Could you please send me the slurmctld log from when the slurmd's were restarted?

Comment 6 Danny Auble 2017-07-13 12:54:37 MDT

Alan, I am also assuming slurm startclean was only done on the compute nodes in question and not on the slurmctld node or any of the other compute nodes.

Comment 7 Danny Auble 2017-07-17 10:45:23 MDT

Any more information?

Comment 8 Danny Auble 2017-07-21 15:37:37 MDT

Ping?

Comment 9 Danny Auble 2017-07-24 11:11:38 MDT

Please reopen if any more is required on this.