Ticket 5474

Summary:	salloc segfaulted in _init_task_layout
Product:	Slurm	Reporter:	Steve Ford <fordste5>
Component:	User Commands	Assignee:	Felip Moll <felip.moll>
Status:	RESOLVED INFOGIVEN	QA Contact:
Severity:	4 - Minor Issue
Priority:	---
Version:	17.11.7
Hardware:	Linux
OS:	Linux
See Also:	https://bugs.schedmd.com/show_bug.cgi?id=5457
Site:	MSU	Slinky Site:	---
Alineos Sites:	---	Atos/Eviden Sites:	---
Confidential Site:	---	Coreweave sites:	---
Cray Sites:	---	DS9 clusters:	---
Google sites:	---	HPCnow Sites:	---
HPE Sites:	---	IBM Sites:	---
NOAA SIte:	---	NoveTech Sites:	---
Nvidia HWinf-CS Sites:	---	OCF Sites:	---
Recursion Pharma Sites:	---	SFW Sites:	---
SNIC sites:	---	Tzag Elita Sites:	---
Linux Distro:	---	Machine Name:
CLE Version:		Version Fixed:
Target Release:	---	DevPrio:	---
Emory-Cloud Sites:	---
Attachments:	Slurm config file Slurmctld log thread apply all bt full

Description Steve Ford 2018-07-24 11:23:01 MDT

Created attachment 7394 [details]
Slurm config file

We've seen the salloc command segfault a couple of times today. Salloc was invoked with 'salloc -N 1 --ntasks=8 -t 4:00:00 --mem-per-cpu=2048M -C intel14' when it crashed.

Here is the backtrace:

Thread 2 (Thread 0x7ff78081c700 (LWP 39340)):
#0  0x00007ff77fb69a3d in poll () from /lib64/libc.so.6
#1  0x00007ff7803b6563 in poll (__timeout=-1, __nfds=<optimized out>, __fds=0x7ff7780008d0)
    at /usr/include/bits/poll2.h:46
#2  _poll_internal (shutdown_time=<optimized out>, nfds=2, pfds=0x7ff7780008d0) at eio.c:364
#3  eio_handle_mainloop (eio=eio@entry=0xa599e0) at eio.c:328
#4  0x00007ff78029b51c in _msg_thr_internal (arg=0xa599e0) at allocate_msg.c:88
#5  0x00007ff77fe46e25 in start_thread () from /lib64/libpthread.so.0
#6  0x00007ff77fb7434d in clone () from /lib64/libc.so.6

Thread 1 (Thread 0x7ff78081d740 (LWP 39339)):
#0  _init_task_layout (arbitrary_nodes=0x0, step_layout=0xa526d0, step_layout_req=0x7fff59451510)
    at slurm_step_layout.c:420
#1  slurm_step_layout_create (step_layout_req=step_layout_req@entry=0x7fff594515d0) at slurm_step_layout.c:116
#2  0x00007ff7802d3eef in env_array_for_job (dest=dest@entry=0x7fff594516a0, alloc=alloc@entry=0xa603a0, 
    desc=desc@entry=0xa595b0, pack_offset=pack_offset@entry=-1) at env.c:1106
#3  0x0000000000405e6f in main (argc=<optimized out>, argv=<optimized out>) at salloc.c:525

Comment 1 Steve Ford 2018-07-24 11:24:31 MDT

Created attachment 7395 [details]
Slurmctld log

Comment 2 Felip Moll 2018-07-25 17:51:44 MDT

Hi Steve,

I am seeing a lot of errors like this in your logs:

[2018-07-24T12:54:35.045] error: _find_node_record(751): lookup failure for lac-354.i

I understand that you changed the node address, DNS, or removed it completely, have you *restarted* slurmds and slurmctld after that change? It must be done to regenerate the node bitmaps.

I suspect about this because it seems to fail when accessing:

        for (i=0; i<step_layout->node_cnt; i++) {
        ...
        cpus[i] = 
        ...

and some bitmap could be wrong.

Also, I would need a "thread apply all bt full" if possible.

Do you think you're able to reproduce it consistently?

Comment 3 Felip Moll 2018-07-26 10:00:17 MDT

Steve,

This may probably be related to corruption seen in bug 5457.

Have you tried to restart all the daemons manually, without puppet?

Comment 4 Steve Ford 2018-07-26 10:54:46 MDT

Created attachment 7424 [details]
thread apply all bt full

Comment 5 Steve Ford 2018-07-26 11:15:56 MDT

I'm not sure what the difference could with lac-354. This node has not been removed nor has its IP address/DNS entry been changed. I still see that message in the logs after restart slurmd daemons and the slurmctld daemon. Do I need to restart slurmdbd when slurm.conf changes?

I have not tried restarting the daemons outside of Puppet. I don't expect this to make a difference but I manually restarted them now just in case.

We have not found a combination of parameters for salloc that consistently causes this issue.

Comment 6 Felip Moll 2018-07-27 06:06:21 MDT

(In reply to Steve Ford from comment #5)
> I'm not sure what the difference could with lac-354. This node has not been
> removed nor has its IP address/DNS entry been changed. I still see that
> message in the logs after restart slurmd daemons and the slurmctld daemon.
> Do I need to restart slurmdbd when slurm.conf changes?

No, you don't need to restart slurmdbd but we certainly should fix this error.
Note the error references to "lac-354.i", not to lac-354.

Is there any chance that this name appeared in some script or crontab, of any other server that is sending commands (squeue, sacct, whatever...) to slurmctld?

Is it possible that there's still some slurm.conf (i.e. in the login nodes, management server) with an old configuration?

*** From the host that you are issuing salloc, is slurm.conf also up-to-date? ***


> I have not tried restarting the daemons outside of Puppet. I don't expect
> this to make a difference but I manually restarted them now just in case.

Let me know if this make any difference.



What I see is that salloc calls to _init_task_layout->_task_layout_hostfile->find_node_record.

It is possible that this call patch generates the lookup failure error in slurmctld log, and after that it crashes generating the segfault.

Running salloc with "-vvvv" would also be useful.

Comment 7 Felip Moll 2018-08-07 06:45:07 MDT

Hi Steve,

Are there any news about this issue?

Can you provide feedback to my last comment?

Thanks!,
Felip

Comment 8 Steve Ford 2018-08-07 07:55:06 MDT

(In reply to Felip Moll from comment #7)
> Hi Steve,
> 
> Are there any news about this issue?
> 
> Can you provide feedback to my last comment?
> 
> Thanks!,
> Felip

Felip,

We have not seen another salloc segfault since we applied the patch recommended in https://bugs.schedmd.com/show_bug.cgi?id=5457

I checked on lac-354. I expected the node to be on an older image in our Torque/Moab cluster. The node was on our newer image and trying to check into SLURM with the hostname 'lac-354.i' instead of 'lac-354'. I re-imaged that node to our old cluster and the error disappeared.

Thanks,
Steve

Comment 9 Felip Moll 2018-08-07 08:37:59 MDT

Thanks Steve,

I would bet that the issue won't appear anymore after these two changes.

Is it ok for you to close this issue now? You can reopen it and re-mark as unresolved if you see that happening again.

Comment 10 Steve Ford 2018-08-07 11:31:03 MDT

(In reply to Felip Moll from comment #9)
> Thanks Steve,
> 
> I would bet that the issue won't appear anymore after these two changes.
> 
> Is it ok for you to close this issue now? You can reopen it and re-mark as
> unresolved if you see that happening again.

Felip,

This can be closed. I'll re-open if I see this issue again.

Thanks,
Steve

Comment 11 Felip Moll 2018-08-10 09:28:16 MDT

(In reply to Steve Ford from comment #10)
> (In reply to Felip Moll from comment #9)
> > Thanks Steve,
> > 
> > I would bet that the issue won't appear anymore after these two changes.
> > 
> > Is it ok for you to close this issue now? You can reopen it and re-mark as
> > unresolved if you see that happening again.
> 
> Felip,
> 
> This can be closed. I'll re-open if I see this issue again.
> 
> Thanks,
> Steve

Okay, closing issue then!.