1342 – slurmctl segfaults in find_job_record

Ticket 1342 - slurmctl segfaults in find_job_record

Summary: slurmctl segfaults in find_job_record

Status:	RESOLVED DUPLICATE of ticket 1309

Alias:	None

Product:	Slurm
Classification:	Unclassified
Component:	slurmctld (show other tickets)
Version:	14.11.2
Hardware:	Linux Linux

Severity:	3 - Medium Impact
Assignee:	David Bigagli
QA Contact:

URL:

Depends on:
Blocks:

Reported:	2014-12-29 00:22 MST by John Hanks
Modified:	2014-12-29 01:06 MST (History)
CC List:	2 users (show)

See Also:
Site:	KAUST
Slinky Site:	---
Alineos Sites:	---
Atos/Eviden Sites:	---
Confidential Site:	---
Coreweave sites:	---
Cray Sites:	---
DS9 clusters:	---
Google sites:	---
HPCnow Sites:	---
HPE Sites:	---
IBM Sites:	---
NOAA SIte:	---
NoveTech Sites:	---
Nvidia HWinf-CS Sites:	---
OCF Sites:	---
Recursion Pharma Sites:	---
SFW Sites:	---
SNIC sites:	---
Tzag Elita Sites:	---
Linux Distro:	---
Machine Name:
CLE Version:
Version Fixed:
Target Release:	---
DevPrio:	---
Emory-Cloud Sites:	---

Attachments
Add an attachment (proposed patch, testcase, etc.)

Note You need to log in before you can comment on or make changes to this ticket.

Description John Hanks 2014-12-29 00:22:31 MST

Over the last couple of weeks we've seen slurmctld segfault about a half dozen times and have enabled core dumps, catching the most recent occurrence. The logs leading up to the crash are:

Dec 29 17:02:18 slurm slurmctld[4720]: _slurm_rpc_requeue: Processing RPC: REQUEST_JOB_REQUEUE from uid=507001
Dec 29 17:02:18 slurm slurmctld[4720]: sched: update_job: setting dependency to (null) for job_id 2968452
Dec 29 17:02:18 slurm slurmctld[4720]: _slurm_rpc_update_job complete JobId=4294967294 uid=507001 usec=185
Dec 29 17:02:18 slurm slurmctld[4720]: _slurm_rpc_requeue: Processing RPC: REQUEST_JOB_REQUEUE from uid=507001
Dec 29 17:02:18 slurm slurmctld[4720]: _slurm_rpc_requeue: 4294967294: Invalid job id specified
Dec 29 17:02:18 slurm slurmctld[4720]: _slurm_rpc_requeue: Processing RPC: REQUEST_JOB_REQUEUE from uid=507001
Dec 29 17:02:18 slurm slurmctld[4720]: _slurm_rpc_requeue: 4294967294: Invalid job id specified
Dec 29 17:02:18 slurm slurmctld[4720]: requeue batch job 2968199
Dec 29 17:02:18 slurm slurmctld[4720]: _slurm_rpc_requeue: Processing RPC: REQUEST_JOB_REQUEUE from uid=507001
Dec 29 17:02:18 slurm slurmctld[4720]: _slurm_rpc_requeue: 4294967294: Invalid job id specified
Dec 29 17:02:18 slurm slurmctld[4720]: sched: update_job: setting dependency to (null) for job_id 2967892
Dec 29 17:02:18 slurm slurmctld[4720]: _slurm_rpc_update_job complete JobId=4294967294 uid=507001 usec=292
Dec 29 17:02:18 slurm slurmctld[4720]: job_complete: JobID=2968614 State=0x8000 NodeCnt=1 WIFEXITED 1 WEXITSTATUS 0
Dec 29 17:02:18 slurm slurmctld[4720]: _slurm_rpc_requeue: Processing RPC: REQUEST_JOB_REQUEUE from uid=507001
Dec 29 17:02:18 slurm slurmctld[4720]: sched: update_job: setting dependency to (null) for job_id 2968528
Dec 29 17:02:18 slurm slurmctld[4720]: _slurm_rpc_update_job complete JobId=4294967294 uid=507001 usec=157
Dec 29 17:02:18 slurm slurmctld[4720]: _slurm_rpc_requeue: Processing RPC: REQUEST_JOB_REQUEUE from uid=507001
Dec 29 17:02:18 slurm slurmctld[4720]: job_complete: JobID=2967660 State=0x8000 NodeCnt=1 WIFEXITED 1 WEXITSTATUS 0
Dec 29 17:02:18 slurm slurmctld[4720]: _slurm_rpc_requeue: Processing RPC: REQUEST_JOB_REQUEUE from uid=507001
Dec 29 17:02:18 slurm kernel: slurmctld[4720]: segfault at 415 ip 00000000004390e0 sp 00007fff00d7a398 error 4 in slurmctld[400000+1c8000]


Loading core into gdb produces:

Program terminated with signal 11, Segmentation fault.
#0  0x00000000004390e0 in find_job_record (job_id=2966982) at job_mgr.c:2627
2627			if (job_ptr->job_id == job_id)

Backtrace from core dump is:

(gdb) bt
#0  0x00000000004390e0 in find_job_record (job_id=2966982) at job_mgr.c:2627
#1  0x0000000000458509 in test_job_dependency (job_ptr=0x2ba4fc0012d0) at job_scheduler.c:1907
#2  0x000000000043e381 in purge_old_job () at job_mgr.c:8130
#3  0x00000000004328ff in _slurmctld_background (no_data=<value optimized out>) at controller.c:1610
#4  0x000000000043507f in main (argc=<value optimized out>, argv=<value optimized out>) at controller.c:561

A lot of small jobs were churning when this segfault occurred and mysqld was using a lot of CPU, but I don't have details about the earlier segfaults. 

Please let me know what I can do or upload to troubleshoot this further.

Thanks,

jbh

Comment 1 Brian Christiansen 2014-12-29 00:51:08 MST

This is most likely the same bug found in bug 1309. The problem is fixed with this commit:
https://github.com/SchedMD/slurm/commit/f293ce7ccdef10bfd4a0d0b92d40f59a81b3b13b

Comment 2 John Hanks 2014-12-29 01:06:49 MST

We've applied this patch and will follow up on 1309 if we have any more related segfaults. 

Thanks,

jbh 

(In reply to Brian Christiansen from comment #1)
> This is most likely the same bug found in bug 1309. The problem is fixed
> with this commit:
> https://github.com/SchedMD/slurm/commit/
> f293ce7ccdef10bfd4a0d0b92d40f59a81b3b13b

*** This ticket has been marked as a duplicate of ticket 1309 ***