Over the last couple of weeks we've seen slurmctld segfault about a half dozen times and have enabled core dumps, catching the most recent occurrence. The logs leading up to the crash are: Dec 29 17:02:18 slurm slurmctld[4720]: _slurm_rpc_requeue: Processing RPC: REQUEST_JOB_REQUEUE from uid=507001 Dec 29 17:02:18 slurm slurmctld[4720]: sched: update_job: setting dependency to (null) for job_id 2968452 Dec 29 17:02:18 slurm slurmctld[4720]: _slurm_rpc_update_job complete JobId=4294967294 uid=507001 usec=185 Dec 29 17:02:18 slurm slurmctld[4720]: _slurm_rpc_requeue: Processing RPC: REQUEST_JOB_REQUEUE from uid=507001 Dec 29 17:02:18 slurm slurmctld[4720]: _slurm_rpc_requeue: 4294967294: Invalid job id specified Dec 29 17:02:18 slurm slurmctld[4720]: _slurm_rpc_requeue: Processing RPC: REQUEST_JOB_REQUEUE from uid=507001 Dec 29 17:02:18 slurm slurmctld[4720]: _slurm_rpc_requeue: 4294967294: Invalid job id specified Dec 29 17:02:18 slurm slurmctld[4720]: requeue batch job 2968199 Dec 29 17:02:18 slurm slurmctld[4720]: _slurm_rpc_requeue: Processing RPC: REQUEST_JOB_REQUEUE from uid=507001 Dec 29 17:02:18 slurm slurmctld[4720]: _slurm_rpc_requeue: 4294967294: Invalid job id specified Dec 29 17:02:18 slurm slurmctld[4720]: sched: update_job: setting dependency to (null) for job_id 2967892 Dec 29 17:02:18 slurm slurmctld[4720]: _slurm_rpc_update_job complete JobId=4294967294 uid=507001 usec=292 Dec 29 17:02:18 slurm slurmctld[4720]: job_complete: JobID=2968614 State=0x8000 NodeCnt=1 WIFEXITED 1 WEXITSTATUS 0 Dec 29 17:02:18 slurm slurmctld[4720]: _slurm_rpc_requeue: Processing RPC: REQUEST_JOB_REQUEUE from uid=507001 Dec 29 17:02:18 slurm slurmctld[4720]: sched: update_job: setting dependency to (null) for job_id 2968528 Dec 29 17:02:18 slurm slurmctld[4720]: _slurm_rpc_update_job complete JobId=4294967294 uid=507001 usec=157 Dec 29 17:02:18 slurm slurmctld[4720]: _slurm_rpc_requeue: Processing RPC: REQUEST_JOB_REQUEUE from uid=507001 Dec 29 17:02:18 slurm slurmctld[4720]: job_complete: JobID=2967660 State=0x8000 NodeCnt=1 WIFEXITED 1 WEXITSTATUS 0 Dec 29 17:02:18 slurm slurmctld[4720]: _slurm_rpc_requeue: Processing RPC: REQUEST_JOB_REQUEUE from uid=507001 Dec 29 17:02:18 slurm kernel: slurmctld[4720]: segfault at 415 ip 00000000004390e0 sp 00007fff00d7a398 error 4 in slurmctld[400000+1c8000] Loading core into gdb produces: Program terminated with signal 11, Segmentation fault. #0 0x00000000004390e0 in find_job_record (job_id=2966982) at job_mgr.c:2627 2627 if (job_ptr->job_id == job_id) Backtrace from core dump is: (gdb) bt #0 0x00000000004390e0 in find_job_record (job_id=2966982) at job_mgr.c:2627 #1 0x0000000000458509 in test_job_dependency (job_ptr=0x2ba4fc0012d0) at job_scheduler.c:1907 #2 0x000000000043e381 in purge_old_job () at job_mgr.c:8130 #3 0x00000000004328ff in _slurmctld_background (no_data=<value optimized out>) at controller.c:1610 #4 0x000000000043507f in main (argc=<value optimized out>, argv=<value optimized out>) at controller.c:561 A lot of small jobs were churning when this segfault occurred and mysqld was using a lot of CPU, but I don't have details about the earlier segfaults. Please let me know what I can do or upload to troubleshoot this further. Thanks, jbh
This is most likely the same bug found in bug 1309. The problem is fixed with this commit: https://github.com/SchedMD/slurm/commit/f293ce7ccdef10bfd4a0d0b92d40f59a81b3b13b
We've applied this patch and will follow up on 1309 if we have any more related segfaults. Thanks, jbh (In reply to Brian Christiansen from comment #1) > This is most likely the same bug found in bug 1309. The problem is fixed > with this commit: > https://github.com/SchedMD/slurm/commit/ > f293ce7ccdef10bfd4a0d0b92d40f59a81b3b13b *** This ticket has been marked as a duplicate of ticket 1309 ***