Ticket 1342

Summary: slurmctl segfaults in find_job_record
Product: Slurm Reporter: John Hanks <john.hanks>
Component: slurmctldAssignee: David Bigagli <david>
Status: RESOLVED DUPLICATE QA Contact:
Severity: 3 - Medium Impact    
Priority: --- CC: brian, da
Version: 14.11.2   
Hardware: Linux   
OS: Linux   
Site: KAUST Alineos Sites: ---
Atos/Eviden Sites: --- Confidential Site: ---
Coreweave sites: --- Cray Sites: ---
DS9 clusters: --- HPCnow Sites: ---
HPE Sites: --- IBM Sites: ---
NOAA SIte: --- OCF Sites: ---
Recursion Pharma Sites: --- SFW Sites: ---
SNIC sites: --- Linux Distro: ---
Machine Name: CLE Version:
Version Fixed: Target Release: ---
DevPrio: --- Emory-Cloud Sites: ---

Description John Hanks 2014-12-29 00:22:31 MST
Over the last couple of weeks we've seen slurmctld segfault about a half dozen times and have enabled core dumps, catching the most recent occurrence. The logs leading up to the crash are:

Dec 29 17:02:18 slurm slurmctld[4720]: _slurm_rpc_requeue: Processing RPC: REQUEST_JOB_REQUEUE from uid=507001
Dec 29 17:02:18 slurm slurmctld[4720]: sched: update_job: setting dependency to (null) for job_id 2968452
Dec 29 17:02:18 slurm slurmctld[4720]: _slurm_rpc_update_job complete JobId=4294967294 uid=507001 usec=185
Dec 29 17:02:18 slurm slurmctld[4720]: _slurm_rpc_requeue: Processing RPC: REQUEST_JOB_REQUEUE from uid=507001
Dec 29 17:02:18 slurm slurmctld[4720]: _slurm_rpc_requeue: 4294967294: Invalid job id specified
Dec 29 17:02:18 slurm slurmctld[4720]: _slurm_rpc_requeue: Processing RPC: REQUEST_JOB_REQUEUE from uid=507001
Dec 29 17:02:18 slurm slurmctld[4720]: _slurm_rpc_requeue: 4294967294: Invalid job id specified
Dec 29 17:02:18 slurm slurmctld[4720]: requeue batch job 2968199
Dec 29 17:02:18 slurm slurmctld[4720]: _slurm_rpc_requeue: Processing RPC: REQUEST_JOB_REQUEUE from uid=507001
Dec 29 17:02:18 slurm slurmctld[4720]: _slurm_rpc_requeue: 4294967294: Invalid job id specified
Dec 29 17:02:18 slurm slurmctld[4720]: sched: update_job: setting dependency to (null) for job_id 2967892
Dec 29 17:02:18 slurm slurmctld[4720]: _slurm_rpc_update_job complete JobId=4294967294 uid=507001 usec=292
Dec 29 17:02:18 slurm slurmctld[4720]: job_complete: JobID=2968614 State=0x8000 NodeCnt=1 WIFEXITED 1 WEXITSTATUS 0
Dec 29 17:02:18 slurm slurmctld[4720]: _slurm_rpc_requeue: Processing RPC: REQUEST_JOB_REQUEUE from uid=507001
Dec 29 17:02:18 slurm slurmctld[4720]: sched: update_job: setting dependency to (null) for job_id 2968528
Dec 29 17:02:18 slurm slurmctld[4720]: _slurm_rpc_update_job complete JobId=4294967294 uid=507001 usec=157
Dec 29 17:02:18 slurm slurmctld[4720]: _slurm_rpc_requeue: Processing RPC: REQUEST_JOB_REQUEUE from uid=507001
Dec 29 17:02:18 slurm slurmctld[4720]: job_complete: JobID=2967660 State=0x8000 NodeCnt=1 WIFEXITED 1 WEXITSTATUS 0
Dec 29 17:02:18 slurm slurmctld[4720]: _slurm_rpc_requeue: Processing RPC: REQUEST_JOB_REQUEUE from uid=507001
Dec 29 17:02:18 slurm kernel: slurmctld[4720]: segfault at 415 ip 00000000004390e0 sp 00007fff00d7a398 error 4 in slurmctld[400000+1c8000]


Loading core into gdb produces:

Program terminated with signal 11, Segmentation fault.
#0  0x00000000004390e0 in find_job_record (job_id=2966982) at job_mgr.c:2627
2627			if (job_ptr->job_id == job_id)

Backtrace from core dump is:

(gdb) bt
#0  0x00000000004390e0 in find_job_record (job_id=2966982) at job_mgr.c:2627
#1  0x0000000000458509 in test_job_dependency (job_ptr=0x2ba4fc0012d0) at job_scheduler.c:1907
#2  0x000000000043e381 in purge_old_job () at job_mgr.c:8130
#3  0x00000000004328ff in _slurmctld_background (no_data=<value optimized out>) at controller.c:1610
#4  0x000000000043507f in main (argc=<value optimized out>, argv=<value optimized out>) at controller.c:561

A lot of small jobs were churning when this segfault occurred and mysqld was using a lot of CPU, but I don't have details about the earlier segfaults. 

Please let me know what I can do or upload to troubleshoot this further.

Thanks,

jbh
Comment 1 Brian Christiansen 2014-12-29 00:51:08 MST
This is most likely the same bug found in bug 1309. The problem is fixed with this commit:
https://github.com/SchedMD/slurm/commit/f293ce7ccdef10bfd4a0d0b92d40f59a81b3b13b
Comment 2 John Hanks 2014-12-29 01:06:49 MST
We've applied this patch and will follow up on 1309 if we have any more related segfaults. 

Thanks,

jbh 

(In reply to Brian Christiansen from comment #1)
> This is most likely the same bug found in bug 1309. The problem is fixed
> with this commit:
> https://github.com/SchedMD/slurm/commit/
> f293ce7ccdef10bfd4a0d0b92d40f59a81b3b13b

*** This ticket has been marked as a duplicate of ticket 1309 ***