Ticket 1342 - slurmctl segfaults in find_job_record
Summary: slurmctl segfaults in find_job_record
Status: RESOLVED DUPLICATE of ticket 1309
Alias: None
Product: Slurm
Classification: Unclassified
Component: slurmctld (show other tickets)
Version: 14.11.2
Hardware: Linux Linux
: 3 - Medium Impact
Assignee: David Bigagli
QA Contact:
URL:
Depends on:
Blocks:
 
Reported: 2014-12-29 00:22 MST by John Hanks
Modified: 2014-12-29 01:06 MST (History)
2 users (show)

See Also:
Site: KAUST
Slinky Site: ---
Alineos Sites: ---
Atos/Eviden Sites: ---
Confidential Site: ---
Coreweave sites: ---
Cray Sites: ---
DS9 clusters: ---
Google sites: ---
HPCnow Sites: ---
HPE Sites: ---
IBM Sites: ---
NOAA SIte: ---
NoveTech Sites: ---
Nvidia HWinf-CS Sites: ---
OCF Sites: ---
Recursion Pharma Sites: ---
SFW Sites: ---
SNIC sites: ---
Tzag Elita Sites: ---
Linux Distro: ---
Machine Name:
CLE Version:
Version Fixed:
Target Release: ---
DevPrio: ---
Emory-Cloud Sites: ---


Attachments

Note You need to log in before you can comment on or make changes to this ticket.
Description John Hanks 2014-12-29 00:22:31 MST
Over the last couple of weeks we've seen slurmctld segfault about a half dozen times and have enabled core dumps, catching the most recent occurrence. The logs leading up to the crash are:

Dec 29 17:02:18 slurm slurmctld[4720]: _slurm_rpc_requeue: Processing RPC: REQUEST_JOB_REQUEUE from uid=507001
Dec 29 17:02:18 slurm slurmctld[4720]: sched: update_job: setting dependency to (null) for job_id 2968452
Dec 29 17:02:18 slurm slurmctld[4720]: _slurm_rpc_update_job complete JobId=4294967294 uid=507001 usec=185
Dec 29 17:02:18 slurm slurmctld[4720]: _slurm_rpc_requeue: Processing RPC: REQUEST_JOB_REQUEUE from uid=507001
Dec 29 17:02:18 slurm slurmctld[4720]: _slurm_rpc_requeue: 4294967294: Invalid job id specified
Dec 29 17:02:18 slurm slurmctld[4720]: _slurm_rpc_requeue: Processing RPC: REQUEST_JOB_REQUEUE from uid=507001
Dec 29 17:02:18 slurm slurmctld[4720]: _slurm_rpc_requeue: 4294967294: Invalid job id specified
Dec 29 17:02:18 slurm slurmctld[4720]: requeue batch job 2968199
Dec 29 17:02:18 slurm slurmctld[4720]: _slurm_rpc_requeue: Processing RPC: REQUEST_JOB_REQUEUE from uid=507001
Dec 29 17:02:18 slurm slurmctld[4720]: _slurm_rpc_requeue: 4294967294: Invalid job id specified
Dec 29 17:02:18 slurm slurmctld[4720]: sched: update_job: setting dependency to (null) for job_id 2967892
Dec 29 17:02:18 slurm slurmctld[4720]: _slurm_rpc_update_job complete JobId=4294967294 uid=507001 usec=292
Dec 29 17:02:18 slurm slurmctld[4720]: job_complete: JobID=2968614 State=0x8000 NodeCnt=1 WIFEXITED 1 WEXITSTATUS 0
Dec 29 17:02:18 slurm slurmctld[4720]: _slurm_rpc_requeue: Processing RPC: REQUEST_JOB_REQUEUE from uid=507001
Dec 29 17:02:18 slurm slurmctld[4720]: sched: update_job: setting dependency to (null) for job_id 2968528
Dec 29 17:02:18 slurm slurmctld[4720]: _slurm_rpc_update_job complete JobId=4294967294 uid=507001 usec=157
Dec 29 17:02:18 slurm slurmctld[4720]: _slurm_rpc_requeue: Processing RPC: REQUEST_JOB_REQUEUE from uid=507001
Dec 29 17:02:18 slurm slurmctld[4720]: job_complete: JobID=2967660 State=0x8000 NodeCnt=1 WIFEXITED 1 WEXITSTATUS 0
Dec 29 17:02:18 slurm slurmctld[4720]: _slurm_rpc_requeue: Processing RPC: REQUEST_JOB_REQUEUE from uid=507001
Dec 29 17:02:18 slurm kernel: slurmctld[4720]: segfault at 415 ip 00000000004390e0 sp 00007fff00d7a398 error 4 in slurmctld[400000+1c8000]


Loading core into gdb produces:

Program terminated with signal 11, Segmentation fault.
#0  0x00000000004390e0 in find_job_record (job_id=2966982) at job_mgr.c:2627
2627			if (job_ptr->job_id == job_id)

Backtrace from core dump is:

(gdb) bt
#0  0x00000000004390e0 in find_job_record (job_id=2966982) at job_mgr.c:2627
#1  0x0000000000458509 in test_job_dependency (job_ptr=0x2ba4fc0012d0) at job_scheduler.c:1907
#2  0x000000000043e381 in purge_old_job () at job_mgr.c:8130
#3  0x00000000004328ff in _slurmctld_background (no_data=<value optimized out>) at controller.c:1610
#4  0x000000000043507f in main (argc=<value optimized out>, argv=<value optimized out>) at controller.c:561

A lot of small jobs were churning when this segfault occurred and mysqld was using a lot of CPU, but I don't have details about the earlier segfaults. 

Please let me know what I can do or upload to troubleshoot this further.

Thanks,

jbh
Comment 1 Brian Christiansen 2014-12-29 00:51:08 MST
This is most likely the same bug found in bug 1309. The problem is fixed with this commit:
https://github.com/SchedMD/slurm/commit/f293ce7ccdef10bfd4a0d0b92d40f59a81b3b13b
Comment 2 John Hanks 2014-12-29 01:06:49 MST
We've applied this patch and will follow up on 1309 if we have any more related segfaults. 

Thanks,

jbh 

(In reply to Brian Christiansen from comment #1)
> This is most likely the same bug found in bug 1309. The problem is fixed
> with this commit:
> https://github.com/SchedMD/slurm/commit/
> f293ce7ccdef10bfd4a0d0b92d40f59a81b3b13b

*** This ticket has been marked as a duplicate of ticket 1309 ***