Created attachment 2598 [details]
slurmctld log
Created attachment 2599 [details]
slurmd log from dcalph055
Could you look at the parents of those "crtrs_mpi_dev_s" processes (follow parents to the top if possible)? I'm guessing they are children of the init (process 1) and not slurmstepd. If the slurmstepd process is gone, those processes are orphaned and totally removed from Slurm management. Other than the SIGBUS, do you have a handle on what happened on that node? Hi, Yes. Their parent bis init. -sh-4.1$ for P in {108101..108107}; do pstree -n -p $P; done crtrs_mpi_dev_s(108102) crtrs_mpi_dev_s(108103) crtrs_mpi_dev_s(108104) crtrs_mpi_dev_s(108105) crtrs_mpi_dev_s(108106) crtrs_mpi_dev_s(108107) -sh-4.1$ dcalph055 underwent series of memory pressure and one OOM but unrelated to any slurm* process or the job in question. I will shortly attach /var/log/messages here Created attachment 2600 [details]
/var/log/messages from dcalph055
The same also happened on another node: dcalph051 for job 1840 (though afterwards customer killed leftovers). I will attach here slurmd log from dcalph051 -sh-4.1$ sacct -j 1840 JobID JobName Partition Account AllocCPUS State Start NodeList ExitCode ------------ ---------- ---------- ---------- ---------- ---------- ------------------- --------------- -------- 1840 icp_test etch_norm+ e108292_e+ 16 NODE_FAIL Dec 8 19:58 dcalph051 1:0 1840.0 mira_mpi_+ e108292_e+ 16 CANCELLED Dec 8 19:58 dcalph051 -2:0 -sh-4.1$ Created attachment 2601 [details]
slurmd log from dcalph051
Do you seen any core files from the slurmstepd process on either node (likely in /tmp)? The Slurm code is built to gracefully deal with OOM, but I'm wondering if there is some condition not handled. That could cause the process to die and orphan remaining processes. If there's a core file, perhaps you could supply the backtrace so we can identify and fix the problem. Something that you might consider is using Linux cgroup to manage allocate memory to each job and enforce that allocation. Multiple jobs can run on each node with your current configuration and one job could adversely effect everything on the node. I don't see any core file from slurmstepd. I suppose their crashes have to go slurmd log anyway but I don't see any. So far I've seen only one process got killed by oom_killer and this is not any of slurmd/slurmstepd. [root@dcalph055 cm]# grep 'Out of memory: ' /var/log/messages Jan 11 14:20:17 dcalph055 kernel: [4577577.233000] Out of memory: Kill process 79630 (crtrs_mpi.exe) score 65 or sacrifice child [root@dcalph055 cm]# So until Slurm is properly checking errors from memory allocators/mmap and friends that should not cause any issue even under severe memory pressure. (In reply to Sergey Meirovich from comment #9) > So until Slurm is properly checking errors from memory allocators/mmap and > friends that should not cause any issue even under severe memory pressure. Slurm is checking for OOM on malloc and all of the other calls should handle OOM. The most likely scenario is that the slurmstepd received some error related to OOM, could not log it (due to the OOM), retries of the call kept failing, and it eventually quit. That's just a guess though. Hi, I have tried to reproduce that behavior deliberately sending system on the edge of OOM several times (mounting huge tmpfs FS and filling it with zeros containing files) In my situation I was only able to trigger series of oom_kills but not been able to replicate possible scenario: "slurmstepd received some error related to OOM, could not log it (due to the OOM), retries of the call kept failing, and it eventually quit." In my experience in general it is quite easier to trigger oom_kill due to overcommit but you are right in case if slurmstepd is frequently allocating in small chunk that is still possible to happen in theory. (In reply to Sergey Meirovich from comment #11) > In my experience in general it is quite easier to trigger oom_kill due to > overcommit but you are right in case if slurmstepd is frequently allocating > in small chunk that is still possible to happen in theory. In terms of what slurmstepd is doing after launching the application, typically it would only be collecting accounting information (every 30 seconds with your configuration) and processing I/O for the application. Unless the application is doing lots of unbuffered I/O (srun --unbuffered option), slurmstepd is going to be sleeping 99.999% of the time. I'll also add that I've never encountered this problem before. I've reviewed the OOM handling in slurmstepd and accounting plugin. There are definitely places where lack of memory would cause slurmstepd to log the event (if possible) and exit. The only improvement I can think of is changing the accounting logic to stop collecting data rather than exit on an OOM condition. I have created a new bug (feature request actually) to perform that development work, see: http://bugs.schedmd.com/show_bug.cgi?id=2341 I don't believe that we can do much more regarding this bug. Hi Moe, We are not using srun --unbuffered option. Looks like we having couple of internal codes that have a luck to dance very close to Linux's oom_kill so quite probable we were able to trigger the scenario you described. I understand that close to OOM that is more or less no-win game. But still the strategy you've described will relax that somewhat I believe. Thanks you! I've fixed the most likely cause, malloc failure when collecting accounting data (see http://bugs.schedmd.com/show_bug.cgi?id=2341). I don't think we can do more with this bug unless we get more logs or a core file, so I'll close this. Please re-open as needed. *** This ticket has been marked as a duplicate of ticket 2341 *** Created attachment 2846 [details]
slurm.conf
That is unrelated to http://bugs.schedmd.com/show_bug.cgi?id=2341 hence reopening. Slurm version has been updated. Created attachment 2847 [details]
log of stray processes job
Created attachment 2848 [details]
slurmd log (dcalph029)
Created attachment 2849 [details]
new slurmctld log
[root@dcalph029 log]# ps aux | grep vasp_parallel root 37827 0.0 0.0 103304 916 pts/36 S+ 05:11 0:00 grep vasp_parallel e153547 85244 61.7 1.3 2596404 1743444 ? R Feb29 8433:08 /hpc_lsf/application/vasp/vasp-5.4.1/bin/vasp_parallel e153547 85245 61.7 1.3 2596648 1735788 ? R Feb29 8431:25 /hpc_lsf/application/vasp/vasp-5.4.1/bin/vasp_parallel e153547 85246 61.3 1.2 2565652 1713716 ? R Feb29 8375:13 /hpc_lsf/application/vasp/vasp-5.4.1/bin/vasp_parallel e153547 85247 61.5 1.3 2596648 1740604 ? R Feb29 8405:18 /hpc_lsf/application/vasp/vasp-5.4.1/bin/vasp_parallel e153547 85248 61.4 1.2 2565604 1711860 ? R Feb29 8394:48 /hpc_lsf/application/vasp/vasp-5.4.1/bin/vasp_parallel e153547 85249 61.6 1.3 2598120 1752132 ? R Feb29 8429:13 /hpc_lsf/application/vasp/vasp-5.4.1/bin/vasp_parallel e153547 85250 61.6 1.3 2597384 1743984 ? R Feb29 8418:09 /hpc_lsf/application/vasp/vasp-5.4.1/bin/vasp_parallel e153547 85251 61.6 1.3 2565652 1719660 ? R Feb29 8428:14 /hpc_lsf/application/vasp/vasp-5.4.1/bin/vasp_parallel e153547 85252 61.3 1.3 2598172 1729628 ? R Feb29 8387:34 /hpc_lsf/application/vasp/vasp-5.4.1/bin/vasp_parallel e153547 85253 61.4 1.2 2565632 1715804 ? R Feb29 8399:05 /hpc_lsf/application/vasp/vasp-5.4.1/bin/vasp_parallel e153547 85254 62.0 1.3 2598100 1745596 ? R Feb29 8480:02 /hpc_lsf/application/vasp/vasp-5.4.1/bin/vasp_parallel e153547 85255 61.9 1.3 2565628 1717344 ? R Feb29 8466:30 /hpc_lsf/application/vasp/vasp-5.4.1/bin/vasp_parallel e153547 85256 61.0 1.3 2598152 1735468 ? R Feb29 8345:36 /hpc_lsf/application/vasp/vasp-5.4.1/bin/vasp_parallel e153547 85257 61.7 1.2 2565628 1711204 ? R Feb29 8438:27 /hpc_lsf/application/vasp/vasp-5.4.1/bin/vasp_parallel e153547 85258 61.7 1.2 2567156 1714116 ? R Feb29 8437:30 /hpc_lsf/application/vasp/vasp-5.4.1/bin/vasp_parallel e153547 85259 61.3 1.3 2599672 1751160 ? R Feb29 8381:47 /hpc_lsf/application/vasp/vasp-5.4.1/bin/vasp_parallel e153547 85260 60.8 1.3 2566560 1716928 ? R Feb29 8309:42 /hpc_lsf/application/vasp/vasp-5.4.1/bin/vasp_parallel e153547 85261 56.2 1.3 2599080 1730668 ? R Feb29 7689:18 /hpc_lsf/application/vasp/vasp-5.4.1/bin/vasp_parallel e153547 85262 55.8 1.2 2566560 1691976 ? R Feb29 7630:16 /hpc_lsf/application/vasp/vasp-5.4.1/bin/vasp_parallel e153547 85263 53.3 1.3 2598292 1722064 ? R Feb29 7285:43 /hpc_lsf/application/vasp/vasp-5.4.1/bin/vasp_parallel e153547 85264 56.3 1.3 2597664 1732276 ? R Feb29 7695:10 /hpc_lsf/application/vasp/vasp-5.4.1/bin/vasp_parallel e153547 85265 55.0 1.3 2598048 1724240 ? R Feb29 7522:33 /hpc_lsf/application/vasp/vasp-5.4.1/bin/vasp_parallel e153547 85266 54.6 1.3 2596332 1732020 ? R Feb29 7471:47 /hpc_lsf/application/vasp/vasp-5.4.1/bin/vasp_parallel e153547 85267 54.8 1.3 2596332 1722116 ? R Feb29 7493:08 /hpc_lsf/application/vasp/vasp-5.4.1/bin/vasp_parallel e153547 85268 55.9 1.2 2565336 1690628 ? R Feb29 7647:10 /hpc_lsf/application/vasp/vasp-5.4.1/bin/vasp_parallel e153547 85269 55.2 1.2 2565336 1684488 ? R Feb29 7541:32 /hpc_lsf/application/vasp/vasp-5.4.1/bin/vasp_parallel e153547 85270 53.6 1.2 2565336 1688656 ? R Feb29 7334:20 /hpc_lsf/application/vasp/vasp-5.4.1/bin/vasp_parallel e153547 85271 56.1 1.3 2597160 1719696 ? R Feb29 7674:30 /hpc_lsf/application/vasp/vasp-5.4.1/bin/vasp_parallel e153547 85272 56.2 1.2 2564092 1694036 ? R Feb29 7690:59 /hpc_lsf/application/vasp/vasp-5.4.1/bin/vasp_parallel e153547 85273 56.3 1.2 2564092 1697836 ? R Feb29 7692:08 /hpc_lsf/application/vasp/vasp-5.4.1/bin/vasp_parallel e153547 85274 55.5 1.2 2564692 1695264 ? R Feb29 7589:12 /hpc_lsf/application/vasp/vasp-5.4.1/bin/vasp_parallel e153547 85275 56.3 1.2 2564436 1698160 ? R Feb29 7697:21 /hpc_lsf/application/vasp/vasp-5.4.1/bin/vasp_parallel e153547 85276 55.0 1.2 2564880 1694860 ? R Feb29 7522:03 /hpc_lsf/application/vasp/vasp-5.4.1/bin/vasp_parallel e153547 85277 56.3 1.3 2597352 1728408 ? R Feb29 7704:13 /hpc_lsf/application/vasp/vasp-5.4.1/bin/vasp_parallel e153547 85278 56.3 1.2 2565620 1696548 ? R Feb29 7692:07 /hpc_lsf/application/vasp/vasp-5.4.1/bin/vasp_parallel [root@dcalph029 log]# However dcalph029 fully belongs to only: [root@dcalph029 log]# ps aux | grep step root 5263 0.0 0.0 282800 3592 ? Sl Mar09 0:09 slurmstepd: [14775] root 37954 0.0 0.0 103308 916 pts/36 S+ 05:12 0:00 grep step [root@dcalph029 log]# [root@DCALPH000 ~]# squeue -j 14775 JOBID USER ST PARTITION NAME COMMAND SUBMIT_TIME CPUS NODES NODELIST(REASON) 14775 e104725 R normal cu_001 /dat/usr/eMar 9 20:13 36 1 dcalph029 [root@DCALPH000 ~]# This looks like the same failure. I believe the slurmstepd died with an OOM and orphaned it's child processes. There seem to be two orphaned jobs on dcalph029. This slurmd log entry identifies them: [2016-03-02T03:17:10.979] _handle_stray_script: Purging vestigial job script /cm/local/apps/slurm/var/spool/job11805/slurm_script [2016-03-02T03:17:10.979] _handle_stray_script: Purging vestigial job script /cm/local/apps/slurm/var/spool/job11359/slurm_script You might look for messages of that type on other nodes to identify orphan processes. You might be able to remove some moe vestigial processes and free some resources. We are still investigating options. Does your Prolog or Epilog script (e.g Epilog=/cm/local/apps/cmd/scripts/epilog) make any effort to identify and kill orphan processes? That might help. There is no any evidence that slurmstepd died with an OOM at this time. There was no any OOM on that host in the last 90+ days. We will look into writing some daemon monitoring orphan processes presence. But still looks like something wrong is happening to me. (In reply to Sergey Meirovich from comment #24) > We will look into writing some daemon monitoring orphan processes presence. > But still looks like something wrong is happening to me. I concur, but have very little to go on. There are no other reports of this happening, so it has to be some very unusual situation. Are there any core files? Look in SlurmdSpoolDir=/cm/local/apps/slurm/var/spool No any core at /cm/local/apps/slurm/var/spool I think that was vasp process who got SIGBUS as I see only 35 stray vasp processes and there used to be 36 initially. I am puzzled by a lot of: srun: got SIGCONT srun: got SIGCONT srun: got SIGCONT srun: got SIGCONT srun: got SIGCONT srun: got SIGCONT srun: got SIGCONT srun: got SIGCONT srun: got SIGCONT srun: got SIGCONT present in job log file: https://bugs.schedmd.com/attachment.cgi?id=2847 (In reply to Sergey Meirovich from comment #28) > I am puzzled by a lot of: > > srun: got SIGCONT > srun: got SIGCONT > srun: got SIGCONT > srun: got SIGCONT > srun: got SIGCONT > srun: got SIGCONT > srun: got SIGCONT > srun: got SIGCONT > srun: got SIGCONT > srun: got SIGCONT > > present in job log file: https://bugs.schedmd.com/attachment.cgi?id=2847 That's triggered by the gang scheduling logic (from your slurm.conf: "PreemptMode=suspend,gang"). With only one job running at a time on that node, it's just making sure that what should be running is in fact running). Do you have any idea on how slurmstepd could be debugged or traced ad-hoc? I do not see any possibility oom_killer influence (there was no any OOM kill event) hence conclude that slurmstepd did self-exit. Is there a possibility to print backtrace on each exit? We found that the slurmstepd's won't generate cores on some systems without some extra configuration since they call setuid. Will you set these options on the compute nodes and see if you get any core files? Set /proc/sys/fs/suid_dumpable to 2. This could be set in permently in sysctl.conf with: fs.suid_dumpable = 2 or temporarily with: sysctl fs.suid_dumpable=2 On Centos 6, I also had to set "ProcessUnpackaged = yes" in /etc/abrt/abrt-action-save-package-data.conf. Once these were set and I aborted the stepd, I saw these messages in /var/log/messages: Oct 15 11:31:20 knc abrt[21489]: Saved core dump of pid 21477 (/localhome/brian/slurm/14.11/knc/sbin/slurmstepd) to /var/spool/abrt/ccpp-2015-10-15-11:31:20-21477 (6639616 bytes) Oct 15 11:31:20 knc abrtd: Directory 'ccpp-2015-10-15-11:31:20-21477' creation detected There is a coredump file inside the directory. On a 3.6 kernel (Ubuntu), fs.suid_dumpable requires a fully qualified path in the core_pattern. e.g. sysctl kernel.core_pattern=/tmp/core.%e.%p [root@DCALPH000 ~]# pdsh -g computenode sysctl fs.suid_dumpable=2 dcalph003: fs.suid_dumpable = 2 dcalph005: fs.suid_dumpable = 2 dcalph006: fs.suid_dumpable = 2 dcalph011: fs.suid_dumpable = 2 dcalph004: fs.suid_dumpable = 2 dcalph010: fs.suid_dumpable = 2 dcalph014: fs.suid_dumpable = 2 dcalph017: fs.suid_dumpable = 2 dcalph012: fs.suid_dumpable = 2 dcalph020: fs.suid_dumpable = 2 dcalph019: fs.suid_dumpable = 2 dcalph016: fs.suid_dumpable = 2 dcalph021: fs.suid_dumpable = 2 dcalph013: fs.suid_dumpable = 2 dcalph030: fs.suid_dumpable = 2 dcalph027: fs.suid_dumpable = 2 dcalph023: fs.suid_dumpable = 2 dcalph026: fs.suid_dumpable = 2 dcalph001: fs.suid_dumpable = 2 dcalph024: fs.suid_dumpable = 2 dcalph032: fs.suid_dumpable = 2 dcalph034: fs.suid_dumpable = 2 dcalph038: fs.suid_dumpable = 2 dcalph035: fs.suid_dumpable = 2 dcalph041: fs.suid_dumpable = 2 dcalph037: fs.suid_dumpable = 2 dcalph044: fs.suid_dumpable = 2 dcalph042: fs.suid_dumpable = 2 dcalph040: fs.suid_dumpable = 2 dcalph048: fs.suid_dumpable = 2 dcalph043: fs.suid_dumpable = 2 dcalph039: fs.suid_dumpable = 2 dcalph047: fs.suid_dumpable = 2 dcalph050: fs.suid_dumpable = 2 dcalph052: fs.suid_dumpable = 2 dcalph049: fs.suid_dumpable = 2 dcalph007: fs.suid_dumpable = 2 dcalph053: fs.suid_dumpable = 2 dcalph054: fs.suid_dumpable = 2 dcalph061: fs.suid_dumpable = 2 dcalph056: fs.suid_dumpable = 2 dcalph058: fs.suid_dumpable = 2 dcalph059: fs.suid_dumpable = 2 dcalph057: fs.suid_dumpable = 2 dcalph062: fs.suid_dumpable = 2 dcalph055: fs.suid_dumpable = 2 dcalph060: fs.suid_dumpable = 2 dcalph033: fs.suid_dumpable = 2 dcalph031: fs.suid_dumpable = 2 dcalph029: fs.suid_dumpable = 2 dcalph069: fs.suid_dumpable = 2 dcalph070: fs.suid_dumpable = 2 dcalph008: fs.suid_dumpable = 2 dcalph002: fs.suid_dumpable = 2 dcalph075: fs.suid_dumpable = 2 dcalph015: fs.suid_dumpable = 2 dcalph009: fs.suid_dumpable = 2 dcalph080: fs.suid_dumpable = 2 dcalph018: fs.suid_dumpable = 2 dcalph051: fs.suid_dumpable = 2 dcalph028: fs.suid_dumpable = 2 dcalph045: fs.suid_dumpable = 2 dcalph022: fs.suid_dumpable = 2 dcalph025: fs.suid_dumpable = 2 dcalph046: fs.suid_dumpable = 2 dcalph081: fs.suid_dumpable = 2 dcalph082: fs.suid_dumpable = 2 dcalph085: fs.suid_dumpable = 2 dcalph066: fs.suid_dumpable = 2 dcalph036: fs.suid_dumpable = 2 dcalph063: fs.suid_dumpable = 2 dcalph094: fs.suid_dumpable = 2 dcalph095: fs.suid_dumpable = 2 dcalph079: fs.suid_dumpable = 2 dcalph096: fs.suid_dumpable = 2 dcalph086: fs.suid_dumpable = 2 dcalph064: fs.suid_dumpable = 2 dcalph073: fs.suid_dumpable = 2 dcalph072: fs.suid_dumpable = 2 dcalph087: fs.suid_dumpable = 2 dcalph088: fs.suid_dumpable = 2 dcalph065: fs.suid_dumpable = 2 dcalph105: fs.suid_dumpable = 2 dcalph106: fs.suid_dumpable = 2 dcalph089: fs.suid_dumpable = 2 dcalph097: fs.suid_dumpable = 2 dcalph101: fs.suid_dumpable = 2 dcalph103: fs.suid_dumpable = 2 dcalph102: fs.suid_dumpable = 2 dcalph098: fs.suid_dumpable = 2 dcalph078: fs.suid_dumpable = 2 dcalph067: fs.suid_dumpable = 2 dcalph068: fs.suid_dumpable = 2 dcalph071: fs.suid_dumpable = 2 dcalph104: fs.suid_dumpable = 2 dcalph083: fs.suid_dumpable = 2 dcalph074: fs.suid_dumpable = 2 dcalph093: fs.suid_dumpable = 2 dcalph092: fs.suid_dumpable = 2 dcalph084: fs.suid_dumpable = 2 dcalph077: fs.suid_dumpable = 2 dcalph076: fs.suid_dumpable = 2 dcalph090: fs.suid_dumpable = 2 dcalph091: fs.suid_dumpable = 2 dcalph099: fs.suid_dumpable = 2 dcalph100: fs.suid_dumpable = 2 [root@DCALPH000 ~]# We are packaging Slurm as rpm. Anyway I have also set that ProcessUnpackaged = yes as it is useful to handle applications crashes. [root@DCALPH000 ~]# pdsh -g computenode grep ProcessUnpackaged /etc/abrt/abrt-action-save-package-data.conf dcalph003: ProcessUnpackaged = yes dcalph006: ProcessUnpackaged = yes dcalph005: ProcessUnpackaged = yes dcalph010: ProcessUnpackaged = yes dcalph011: ProcessUnpackaged = yes dcalph014: ProcessUnpackaged = yes dcalph017: ProcessUnpackaged = yes dcalph020: ProcessUnpackaged = yes dcalph019: ProcessUnpackaged = yes dcalph013: ProcessUnpackaged = yes dcalph004: ProcessUnpackaged = yes dcalph016: ProcessUnpackaged = yes dcalph021: ProcessUnpackaged = yes dcalph012: ProcessUnpackaged = yes dcalph030: ProcessUnpackaged = yes dcalph023: ProcessUnpackaged = yes dcalph027: ProcessUnpackaged = yes dcalph001: ProcessUnpackaged = yes dcalph026: ProcessUnpackaged = yes dcalph024: ProcessUnpackaged = yes dcalph032: ProcessUnpackaged = yes dcalph007: ProcessUnpackaged = yes dcalph038: ProcessUnpackaged = yes dcalph029: ProcessUnpackaged = yes dcalph041: ProcessUnpackaged = yes dcalph042: ProcessUnpackaged = yes dcalph037: ProcessUnpackaged = yes dcalph035: ProcessUnpackaged = yes dcalph034: ProcessUnpackaged = yes dcalph044: ProcessUnpackaged = yes dcalph040: ProcessUnpackaged = yes dcalph048: ProcessUnpackaged = yes dcalph039: ProcessUnpackaged = yes dcalph043: ProcessUnpackaged = yes dcalph047: ProcessUnpackaged = yes dcalph050: ProcessUnpackaged = yes dcalph049: ProcessUnpackaged = yes dcalph028: ProcessUnpackaged = yes dcalph046: ProcessUnpackaged = yes dcalph052: ProcessUnpackaged = yes dcalph054: ProcessUnpackaged = yes dcalph053: ProcessUnpackaged = yes dcalph061: ProcessUnpackaged = yes dcalph056: ProcessUnpackaged = yes dcalph062: ProcessUnpackaged = yes dcalph057: ProcessUnpackaged = yes dcalph058: ProcessUnpackaged = yes dcalph055: ProcessUnpackaged = yes dcalph059: ProcessUnpackaged = yes dcalph060: ProcessUnpackaged = yes dcalph008: ProcessUnpackaged = yes dcalph045: ProcessUnpackaged = yes dcalph069: ProcessUnpackaged = yes dcalph018: ProcessUnpackaged = yes dcalph033: ProcessUnpackaged = yes dcalph063: ProcessUnpackaged = yes dcalph022: ProcessUnpackaged = yes dcalph009: ProcessUnpackaged = yes dcalph002: ProcessUnpackaged = yes dcalph015: ProcessUnpackaged = yes dcalph075: ProcessUnpackaged = yes dcalph031: ProcessUnpackaged = yes dcalph051: ProcessUnpackaged = yes dcalph068: ProcessUnpackaged = yes dcalph083: ProcessUnpackaged = yes dcalph084: ProcessUnpackaged = yes dcalph066: ProcessUnpackaged = yes dcalph085: ProcessUnpackaged = yes dcalph025: ProcessUnpackaged = yes dcalph064: ProcessUnpackaged = yes dcalph092: ProcessUnpackaged = yes dcalph036: ProcessUnpackaged = yes dcalph093: ProcessUnpackaged = yes dcalph094: ProcessUnpackaged = yes dcalph095: ProcessUnpackaged = yes dcalph096: ProcessUnpackaged = yes dcalph071: ProcessUnpackaged = yes dcalph072: ProcessUnpackaged = yes dcalph073: ProcessUnpackaged = yes dcalph076: ProcessUnpackaged = yes dcalph105: ProcessUnpackaged = yes dcalph106: ProcessUnpackaged = yes dcalph079: ProcessUnpackaged = yes dcalph065: ProcessUnpackaged = yes dcalph078: ProcessUnpackaged = yes dcalph101: ProcessUnpackaged = yes dcalph102: ProcessUnpackaged = yes dcalph067: ProcessUnpackaged = yes dcalph077: ProcessUnpackaged = yes dcalph103: ProcessUnpackaged = yes dcalph104: ProcessUnpackaged = yes dcalph080: ProcessUnpackaged = yes dcalph070: ProcessUnpackaged = yes dcalph074: ProcessUnpackaged = yes dcalph091: ProcessUnpackaged = yes dcalph090: ProcessUnpackaged = yes dcalph081: ProcessUnpackaged = yes dcalph086: ProcessUnpackaged = yes dcalph082: ProcessUnpackaged = yes dcalph099: ProcessUnpackaged = yes dcalph100: ProcessUnpackaged = yes dcalph087: ProcessUnpackaged = yes dcalph089: ProcessUnpackaged = yes dcalph088: ProcessUnpackaged = yes dcalph097: ProcessUnpackaged = yes dcalph098: ProcessUnpackaged = yes [root@DCALPH000 ~]# Do you have any update on this? Any core files? If so, a full backtrace would be helpful. Do you have any update on this? Any core files? If so, a full backtrace would be helpful. Please reopen the ticket with updated information if problems persist. Hi Moe, Sorry for the delay. No any new core files. However I found the strange thing that might prevent core files generation for slurmd. [root@dcalph001 ~]# cat /proc/59877/limits | head -n 1 Limit Soft Limit Hard Limit Units [root@dcalph001 ~]# cat /proc/59877/limits | grep core Max core file size 0 unlimited bytes [root@dcalph001 ~]# Hence soft limit for slumrd set to zero. You also could see the same from the slurmd log I had attached. E.g. from "slurmd log from dcalph055 (543.63 KB, text/plain) 2016-01-12 12:47 PST, Sergey Meirovich" "[2016-01-12T10:40:52.647] Warning: Core limit is only 0 KB" [root@dcalph001 ~]# sysctl fs.suid_dumpable fs.suid_dumpable = 2 [root@dcalph001 ~]# Do you have any idea what causing core size soft limit to be zero? I don't see single reason for that. [root@dcalph001 ~]# ulimit -Hc unlimited [root@dcalph001 ~]# grep core /etc/security/limits.conf # - core - limits the core file size (KB) * soft core unlimited * hard core unlimited [root@dcalph001 ~]# (In reply to Sergey Meirovich from comment #39) > Do you have any idea what causing core size soft limit to be zero? I don't > see single reason for that. > > [root@dcalph001 ~]# ulimit -Hc > unlimited > [root@dcalph001 ~]# grep core /etc/security/limits.conf > # - core - limits the core file size (KB) > * soft core unlimited > * hard core unlimited Do you have a /etc/sysconfig/slurm file setting the limit? If that isn't clearing the limit, perhaps you could use that to set an unlimited limit. Hi Moe, That is something else. [root@dcalph001 ~]# cat /etc/sysconfig/slurm cat: /etc/sysconfig/slurm: No such file or directory [root@dcalph001 ~]# cat /etc/security/limits.d/* # Default limit for number of user's processes to prevent # accidental fork bombs. # See rhbz #432903 for reasoning. * soft nproc 1024 root soft nproc unlimited #BCM: added for OpenIB gen2 stack. * - memlock unlimited #BCM: added for making sure stack size is not limited. * - stack unlimited #BCM: added for making sure nofile number is set to 65536. * - nofile 65536 [root@dcalph001 ~]# setting /etc/sysconfig/slurm didn't help: [root@dcalph001 ~]# cat /etc/sysconfig/slurm ulimit -HSc unlimited [root@dcalph001 ~]# /etc/init.d/slurm restart stopping slurmd: [ OK ] slurmd is stopped starting slurmd: [ OK ] [root@dcalph001 ~]# ps aux | grep slurmd root 42440 0.0 0.0 135132 2212 ? S 06:56 0:00 /cm/shared/apps/slurm/current/sbin/slurmd root 42444 0.0 0.0 103304 912 pts/0 S+ 06:56 0:00 grep slurmd [root@dcalph001 ~]# cat /proc/42440/limits Limit Soft Limit Hard Limit Units Max cpu time unlimited unlimited seconds Max file size unlimited unlimited bytes Max data size unlimited unlimited bytes Max stack size unlimited unlimited bytes Max core file size 0 unlimited bytes Max resident set unlimited unlimited bytes Max processes 515715 515715 processes Max open files 65536 65536 files Max locked memory unlimited unlimited bytes Max address space unlimited unlimited bytes Max file locks unlimited unlimited locks Max pending signals 515715 515715 signals Max msgqueue size 819200 819200 bytes Max nice priority 0 0 Max realtime priority 0 0 Max realtime timeout unlimited unlimited us [root@dcalph001 ~]# (In reply to Sergey Meirovich from comment #42) > [root@dcalph001 ~]# /etc/init.d/slurm restart > stopping slurmd: [ OK ] > slurmd is stopped > starting slurmd: [ OK ] If you start slurmd using the /etc/init.d/slurm script (rather than the system starting the daemon directly), it's your limits that get propagated. Try: ulimit -c unlimited /etc/init.d/slurm restart It may be not clear. But I have already tried that. Anyway: [root@dcalph001 ~]# ulimit -c unlimited [root@dcalph001 ~]# /etc/init.d/slurm restart stopping slurmd: [ OK ] slurmd is stopped starting slurmd: [ OK ] [root@dcalph001 ~]# ps aux | grep /slurmd root 53822 0.0 0.0 135132 2152 ? S 09:44 0:00 /cm/shared/apps/slurm/current/sbin/slurmd root 53826 0.0 0.0 103304 912 pts/0 S+ 09:44 0:00 grep /slurmd [root@dcalph001 ~]# cat /proc/53822/limits Limit Soft Limit Hard Limit Units Max cpu time unlimited unlimited seconds Max file size unlimited unlimited bytes Max data size unlimited unlimited bytes Max stack size unlimited unlimited bytes Max core file size 0 unlimited bytes Max resident set unlimited unlimited bytes Max processes 515715 515715 processes Max open files 65536 65536 files Max locked memory unlimited unlimited bytes Max address space unlimited unlimited bytes Max file locks unlimited unlimited locks Max pending signals 515715 515715 signals Max msgqueue size 819200 819200 bytes Max nice priority 0 0 Max realtime priority 0 0 Max realtime timeout unlimited unlimited us [root@dcalph001 ~]# The /etc/init.d/slurm script isn't directly setting any limits, but there are a few other scripts that it includes in which might do so: These files should just include some functions and not set any limits: if [ -f /etc/rc.status ]; then . /etc/rc.status else . /etc/rc.d/init.d/functions fi This file might contain limits: if [ -f /etc/sysconfig/slurm ] ; then . /etc/sysconfig/slurm fi In a worst case scenario, you could just directly execute slurmd on the node (it will automatically cancel any previously running version, so the "stop" operation isn't required: /usr/sbin/slurmd & Hi Moe, Found what that was. On RHEL6 [root@dcalph103 etc]# grep -r ulimit | grep func rc.d/init.d/functions: corelimit="ulimit -S -c ${DAEMON_COREFILE_LIMIT:-0}" [root@dcalph103 etc]# Hence to workaround I set DAEMON_COREFILE_LIMIT=unlimited [root@dcalph001 ~]# cat /etc/sysconfig/slurm export DAEMON_COREFILE_LIMIT=unlimited [root@dcalph001 ~]# Now limits work as expected. [root@dcalph001 ~]# ps aux | grep slurmd root 3789 0.0 0.0 135132 2160 ? S 02:57 0:00 /cm/shared/apps/slurm/current/sbin/slurmd root 3952 0.0 0.0 103304 912 pts/0 S+ 02:59 0:00 grep slurmd [root@dcalph001 ~]# cat /proc/3789/limits Limit Soft Limit Hard Limit Units Max cpu time unlimited unlimited seconds Max file size unlimited unlimited bytes Max data size unlimited unlimited bytes Max stack size unlimited unlimited bytes Max core file size unlimited unlimited bytes Max resident set unlimited unlimited bytes Max processes 515715 515715 processes Max open files 65536 65536 files Max locked memory unlimited unlimited bytes Max address space unlimited unlimited bytes Max file locks unlimited unlimited locks Max pending signals 515715 515715 signals Max msgqueue size 819200 819200 bytes Max nice priority 0 0 Max realtime priority 0 0 Max realtime timeout unlimited unlimited us [root@dcalph001 ~]# I have updated our FAQ with your information. Thanks for sharing that. Did you get any core files yet? When you do, please send a backtrace. Did you get any core files yet? Hi Moe, No any core file so far. Last stray processes had happened on April 15 and I finally enabled core dumping on April 21. I guess sooner or later we will see cores... Hi Moe, I have tried to validate slurmstepd core dumping. I did next: -bash-4.1$ sbatch --wrap="sleep 10000" Submitted batch job 58445 -bash-4.1$ sacct -j 58445 JobID JobName Partition Account AllocCPUS State Start NodeList ExitCode ------------ ---------- ---------- ---------- ---------- ---------- ------------------- --------------- -------- 58445 wrap open e154466_o+ 1 RUNNING May 4 8:25 dcalph105 0:0 -bash-4.1$ [root@dcalph105 abrt]# ps aux | grep 58445 root 47795 0.0 0.0 148412 3240 ? Sl 08:25 0:00 slurmstepd: [58445] e154466 47799 0.0 0.0 106064 1296 ? S 08:25 0:00 /bin/sh /cm/local/apps/slurm/var/spool/job58445/slurm_script root 47802 0.0 0.0 103304 912 pts/5 S+ 08:25 0:00 grep 58445 [root@dcalph105 abrt]# kill -SIGBUS 47795 [root@dcalph105 abrt]# That didn't produce any core. I implemented all from http://slurm.schedmd.com/faq.html#core_dump but looks like something is still wrong with coredumping in case of slurmstepd. In order to rule out possible issue on machine side I tried next: [root@dcalph105 tmp]# vim suid.c [root@dcalph105 tmp]# cat suid.c #include <stdio.h> #include <unistd.h> int main () { int real = getuid(); int euid = geteuid(); printf("The REAL UID =: %d\n", real); printf("The EFFECTIVE UID =: %d\n", euid); sleep(1000); } [root@dcalph105 tmp]# cp suid /home/e154466 [root@dcalph105 tmp]# chmod +s /home/e154466/suid [root@dcalph105 tmp]# gcc -o suid suid.c [root@dcalph105 tmp]# gcc -o suid suid.c [root@dcalph105 tmp]# cp suid /home/e154466/suid [root@dcalph105 tmp]# chmod +s /home/e154466/suid [root@dcalph105 tmp]# ps aux | grep sudi root 50090 0.0 0.0 103304 912 pts/4 S+ 09:01 0:00 grep sudi [root@dcalph105 tmp]# ps aux | grep suid root 50058 0.0 0.0 3920 432 pts/5 S+ 09:01 0:00 /home/e154466/suid root 50092 0.0 0.0 103304 912 pts/4 S+ 09:01 0:00 grep suid [root@dcalph105 tmp]# kill -SIGBUS 50058 [root@dcalph105 tmp]# -bash-4.1$ /home/e154466/suid The REAL UID =: 19383 The EFFECTIVE UID =: 0 -bash-4.1$ /home/e154466/suid The REAL UID =: 19383 The EFFECTIVE UID =: 0 -bash-4.1$ /home/e154466/suid The REAL UID =: 19383 The EFFECTIVE UID =: 0 Bus error (core dumped) -bash-4.1$ ls -al core.* -rw------- 1 e154466 boks_users 131072 May 3 13:15 core.119143 -bash-4.1$ So coredumping works just fine for suid programs. But not for slurmstepd. The failure to dump is a problem that we see on a subset of Linux distros, including RHEL6. The fix is to issue the call to make slurmstepd a second time, after all of the set effective user ID calls complete as shown in the commit below: https://github.com/SchedMD/slurm/commit/e293734507e01509a659780f86c0beaa6c0174a5 After you apply that patch, we'll need to wait again for a core file. Hi Moe, I'll apply that once I be back to office on Sunday. A bug was just identified and fixed which can result in the slurmstepd failing with an invalid memory reference. This might be what you are experiencing. The patch with that fix is here: https://github.com/SchedMD/slurm/commit/db0fe22e8aae7869d106d099dd8fc7307757c5bf I have upgraded Slurm to v15.08.11 + a) https://github.com/SchedMD/slurm/commit/e293734507e01509a659780f86c0beaa6c0174a5 b) https://github.com/SchedMD/slurm/commit/db0fe22e8aae7869d106d099dd8fc7307757c5bf Let's see... Three different backtraces collected (from 4 cores) so far: Core was generated by `slurmstepd: [46541.0] '. Program terminated with signal 7, Bus error. #0 0x00000000004e7fd5 in slurm_init_msg_engine (addr=0x4aaca5) at slurm_protocol_socket_implementation.c:431 431 { Missing separate debuginfos, use: debuginfo-install audit-libs-2.3.7-5.el6.x86_64 glibc-2.12-1.166.el6_7.3.x86_64 hwloc-1.5-3.el6_5.x86_64 libxml2-2.7.6-20.el6.x86_64 munge-0.5.11-99_cm7.1.x86_64 nss-pam-ldapd-0.7.5-20.el6_6.3.x86_64 nss-softokn-freebl-3.14.3-23.el6_7.x86_64 numactl-2.0.9-2.el6.x86_64 pam-1.1.1-20.el6_7.1.x86_64 pciutils-libs-3.1.10-4.el6.x86_64 zlib-1.2.3-29.el6.x86_64 (gdb) bt #0 0x00000000004e7fd5 in slurm_init_msg_engine (addr=0x4aaca5) at slurm_protocol_socket_implementation.c:431 #1 0x00000000004ad868 in slurm_send_recv_node_msg (req=0x4ad868, resp=0x7fffffffe8e0, timeout=32767) at slurm_protocol_api.c:4129 #2 0x000000000042c27a in _send_srun_resp_msg (resp_msg=0x7fffffffe980, nnodes=2) at mgr.c:265 #3 0x000000000042d1c4 in _send_exit_msg (job=0x80df00, tid=0x7fffffffea10, n=1, status=9) at mgr.c:710 #4 0x000000000042f644 in _send_pending_exit_msgs (job=0x80df00) at mgr.c:1875 #5 0x000000000042fcfb in _wait_for_all_tasks (job=0x80df00) at mgr.c:2045 #6 0x000000000042e25d in job_manager (job=0x80df00) at mgr.c:1244 #7 0x0000000000429bfd in main (argc=1, argv=0x7fffffffecd8) at slurmstepd.c:146 (gdb) Core was generated by `slurmstepd: [64179.0] '. Program terminated with signal 7, Bus error. #0 0x000000000047cc38 in bit_copybits (dest=0x47d118, src=0x572fb848) at bitstring.c:692 692 _assert_bitstr_valid(src); Missing separate debuginfos, use: debuginfo-install audit-libs-2.3.7-5.el6.x86_64 glibc-2.12-1.166.el6_7.3.x86_64 hwloc-1.5-3.el6_5.x86_64 libxml2-2.7.6-20.el6.x86_64 munge-0.5.11-99_cm7.1.x86_64 nss-pam-ldapd-0.7.5-20.el6_6.3.x86_64 nss-softokn-freebl-3.14.3-23.el6_7.x86_64 numactl-2.0.9-2.el6.x86_64 pam-1.1.1-20.el6_7.1.x86_64 pciutils-libs-3.1.10-4.el6.x86_64 zlib-1.2.3-29.el6.x86_64 (gdb) bt #0 0x000000000047cc38 in bit_copybits (dest=0x47d118, src=0x572fb848) at bitstring.c:692 #1 0x000000000042d312 in _wait_for_children_slurmstepd (job=0x80df30) at mgr.c:748 #2 0x000000000042e4ac in job_manager (job=0x80df30) at mgr.c:1336 #3 0x0000000000429bfd in main (argc=1, argv=0x7fffffffecd8) at slurmstepd.c:146 (gdb) Core was generated by `slurmstepd: [46541] '. Program terminated with signal 7, Bus error. #0 0x000000000043cffe in _handle_signal_task_local (fd=10922, job=0x2aaab80ffdc0, uid=0) at req.c:725 725 safe_write(fd, &rc, sizeof(int)); Missing separate debuginfos, use: debuginfo-install audit-libs-2.3.7-5.el6.x86_64 glibc-2.12-1.166.el6_7.3.x86_64 hwloc-1.5-3.el6_5.x86_64 libxml2-2.7.6-20.el6.x86_64 munge-0.5.11-99_cm7.1.x86_64 nss-pam-ldapd-0.7.5-20.el6_6.3.x86_64 nss-softokn-freebl-3.14.3-23.el6_7.x86_64 numactl-2.0.9-2.el6.x86_64 pam-1.1.1-20.el6_7.1.x86_64 pciutils-libs-3.1.10-4.el6.x86_64 zlib-1.2.3-29.el6.x86_64 (gdb) bt #0 0x000000000043cffe in _handle_signal_task_local (fd=10922, job=0x2aaab80ffdc0, uid=0) at req.c:725 #1 0x000000000043be1c in _handle_request (fd=9, job=0x80ded0, uid=0, gid=0) at req.c:493 #2 0x000000000043baa7 in _handle_accept (arg=0x0) at req.c:446 #3 0x00002aaaab72fa51 in start_thread () from /lib64/libpthread.so.0 #4 0x00002aaaaba2d93d in clone () from /lib64/libc.so.6 (gdb) Do you need me to attach cores? The most interesting backtrace is this one: Core was generated by `slurmstepd: [46541] '. Program terminated with signal 7, Bus error. #0 0x000000000043cffe in _handle_signal_task_local (fd=10922, job=0x2aaab80ffdc0, uid=0) at req.c:725 725 safe_write(fd, &rc, sizeof(int)); Missing separate debuginfos, use: debuginfo-install audit-libs-2.3.7-5.el6.x86_64 glibc-2.12-1.166.el6_7.3.x86_64 hwloc-1.5-3.el6_5.x86_64 libxml2-2.7.6-20.el6.x86_64 munge-0.5.11-99_cm7.1.x86_64 nss-pam-ldapd-0.7.5-20.el6_6.3.x86_64 nss-softokn-freebl-3.14.3-23.el6_7.x86_64 numactl-2.0.9-2.el6.x86_64 pam-1.1.1-20.el6_7.1.x86_64 pciutils-libs-3.1.10-4.el6.x86_64 zlib-1.2.3-29.el6.x86_64 (gdb) bt #0 0x000000000043cffe in _handle_signal_task_local (fd=10922, job=0x2aaab80ffdc0, uid=0) at req.c:725 #1 0x000000000043be1c in _handle_request (fd=9, job=0x80ded0, uid=0, gid=0) at req.c:493 #2 0x000000000043baa7 in _handle_accept (arg=0x0) at req.c:446 #3 0x00002aaaab72fa51 in start_thread () from /lib64/libpthread.so.0 #4 0x00002aaaaba2d93d in clone () from /lib64/libc.so.6 The really interesting part is that _handle_request() is passed a file descriptor value of 9. It then uses that same variable as an argument to _handle_signal_task_local(), but the value there is 10922. Same variable name, it should have the same contents (no manipulation of the variable in any of these functions), but the contents have changed. My best guess for the cause is that some thread is going through it's entire 1 MB stack and corrupting data in the stack for another thread. Could you run "thread apply all bt" with gdb on that same core dump? We can easily increase the per-thread stack size, but I'd like to know what might be consuming that much of the stack and fix the underlying problem. If necessary, increasing the stack size can be done with the one line patch below: diff --git a/src/common/macros.h b/src/common/macros.h index 0b83726..4ddee58 100644 --- a/src/common/macros.h +++ b/src/common/macros.h @@ -227,7 +227,7 @@ /* we want 1:1 threads if there is a choice */ \ if (pthread_attr_setscope(attr, PTHREAD_SCOPE_SYSTEM)) \ error("pthread_attr_setscope: %m"); \ - if (pthread_attr_setstacksize(attr, 1024*1024)) \ + if (pthread_attr_setstacksize(attr, 10*1024*1024)) \ error("pthread_attr_setstacksize: %m"); \ } _STMT_END # else I'd rather have root cause. That works the nest in the long term. (gdb) thread apply all btthread apply all bt Thread 6 (Thread 0x2aaaae358700 (LWP 109224)): Undefined command: "btthread". Try "help". (gdb) thread apply all bt Thread 6 (Thread 0x2aaaae358700 (LWP 109224)): #0 0x00002aaaab9f1a7d in nanosleep () from /lib64/libc.so.6 #1 0x00002aaaaba26c24 in usleep () from /lib64/libc.so.6 #2 0x0000000000528c89 in _timer_thread (args=0x0) at slurm_acct_gather_profile.c:155 #3 0x00002aaaab72fa51 in start_thread () from /lib64/libpthread.so.0 #4 0x00002aaaaba2d93d in clone () from /lib64/libc.so.6 Thread 5 (Thread 0x2aaaaf085700 (LWP 109226)): #0 0x00002aaaaba24113 in poll () from /lib64/libc.so.6 #1 0x000000000044d97a in _poll_internal (pfds=0x2aaab4002190, nfds=38, shutdown_time=0) at eio.c:344 #2 0x000000000044d7f0 in eio_handle_mainloop (eio=0x2aaab40008d0) at eio.c:304 #3 0x00002aaaaec71a9e in _agent (unused=0x0) at agent.c:327 #4 0x00002aaaab72fa51 in start_thread () from /lib64/libpthread.so.0 #5 0x00002aaaaba2d93d in clone () from /lib64/libc.so.6 Thread 4 (Thread 0x2aaab8323700 (LWP 109227)): #0 0x00002aaaaba24113 in poll () from /lib64/libc.so.6 #1 0x000000000044d97a in _poll_internal (pfds=0x2aaabc0008d0, nfds=67, shutdown_time=0) at eio.c:344 #2 0x000000000044d7f0 in eio_handle_mainloop (eio=0x81a890) at eio.c:304 #3 0x0000000000437f2d in _io_thr (arg=0x80df00) at io.c:1504 #4 0x00002aaaab72fa51 in start_thread () from /lib64/libpthread.so.0 #5 0x00002aaaaba2d93d in clone () from /lib64/libc.so.6 Thread 3 (Thread 0x2aaaae459700 (LWP 109225)): #0 0x00002aaaaba24113 in poll () from /lib64/libc.so.6 #1 0x000000000044d97a in _poll_internal (pfds=0x2aaab00008d0, nfds=2, shutdown_time=0) at eio.c:344 #2 0x000000000044d7f0 in eio_handle_mainloop (eio=0x81f4d0) at eio.c:304 #3 0x000000000043aeb7 in _domain_socket_destroy (fd=0) at req.c:229 #4 0x00002aaaab72fa51 in start_thread () from /lib64/libpthread.so.0 #5 0x00002aaaaba2d93d in clone () from /lib64/libc.so.6 Thread 2 (Thread 0x2aaaae257700 (LWP 109223)): #0 0x00002aaaab73363c in pthread_cond_wait@@GLIBC_2.3.2 () from /lib64/libpthread.so.0 #1 0x00000000005248d4 in _poll_data (profile=false) at slurm_jobacct_gather.c:178 #2 0x00002aaaab72fa51 in start_thread () from /lib64/libpthread.so.0 #3 0x00002aaaaba2d93d in clone () from /lib64/libc.so.6 Thread 1 (Thread 0x2aaaac909220 (LWP 109222)): #0 0x00000000004e7fd5 in slurm_init_msg_engine (addr=0x4aaca5) at slurm_protocol_socket_implementation.c:431 #1 0x00000000004ad868 in slurm_send_recv_node_msg (req=0x4ad868, resp=0x7fffffffe8e0, timeout=32767) at slurm_protocol_api.c:4129 #2 0x000000000042c27a in _send_srun_resp_msg (resp_msg=0x7fffffffe980, nnodes=2) at mgr.c:265 #3 0x000000000042d1c4 in _send_exit_msg (job=0x80df00, tid=0x7fffffffea10, n=1, status=9) at mgr.c:710 ---Type <return> to continue, or q <return> to quit--- #4 0x000000000042f644 in _send_pending_exit_msgs (job=0x80df00) at mgr.c:1875 #5 0x000000000042fcfb in _wait_for_all_tasks (job=0x80df00) at mgr.c:2045 #6 0x000000000042e25d in job_manager (job=0x80df00) at mgr.c:1244 #7 0x0000000000429bfd in main (argc=1, argv=0x7fffffffecd8) at slurmstepd.c:146 (gdb) (In reply to Sergey Meirovich from comment #57) > I'd rather have root cause. That works the nest in the long term. I agree. I don't see anything very interesting in that backtrace. Could you send full backtraces from the other core files? [In order for me to use your core files, I would also need your plugins and executables, so sending the backtraces will probably be easier for both of us.] (gdb) thread apply all bt Thread 5 (Thread 0x2aaaae257700 (LWP 74356)): #0 0x00002aaaab9f1a7d in nanosleep () from /lib64/libc.so.6 #1 0x00002aaaab9f18f0 in sleep () from /lib64/libc.so.6 #2 0x000000000042fd31 in _wait_for_all_tasks (job=0x82bf60) at mgr.c:2054 #3 0x00002aaaab72fa51 in start_thread () from /lib64/libpthread.so.0 #4 0x00002aaaaba2d93d in clone () from /lib64/libc.so.6 Thread 4 (Thread 0x2aaaaf085700 (LWP 56287)): #0 0x00002aaaaba24113 in poll () from /lib64/libc.so.6 #1 0x000000000044d97a in _poll_internal (pfds=0x2aaab4000b30, nfds=2, shutdown_time=0) at eio.c:344 #2 0x000000000044d7f0 in eio_handle_mainloop (eio=0x2aaab40008d0) at eio.c:304 #3 0x00002aaaaec71a9e in _agent (unused=0x0) at agent.c:327 #4 0x00002aaaab72fa51 in start_thread () from /lib64/libpthread.so.0 #5 0x00002aaaaba2d93d in clone () from /lib64/libc.so.6 Thread 3 (Thread 0x2aaaae459700 (LWP 56286)): #0 0x00002aaaaba24113 in poll () from /lib64/libc.so.6 #1 0x000000000044d97a in _poll_internal (pfds=0x2aaab00008d0, nfds=2, shutdown_time=0) at eio.c:344 #2 0x000000000044d7f0 in eio_handle_mainloop (eio=0x81db50) at eio.c:304 #3 0x000000000043aeb7 in _domain_socket_destroy (fd=0) at req.c:229 #4 0x00002aaaab72fa51 in start_thread () from /lib64/libpthread.so.0 #5 0x00002aaaaba2d93d in clone () from /lib64/libc.so.6 Thread 2 (Thread 0x2aaaae358700 (LWP 56285)): #0 0x00002aaaab9f1a7d in nanosleep () from /lib64/libc.so.6 #1 0x00002aaaaba26c24 in usleep () from /lib64/libc.so.6 #2 0x0000000000528c89 in _timer_thread (args=0x0) at slurm_acct_gather_profile.c:155 #3 0x00002aaaab72fa51 in start_thread () from /lib64/libpthread.so.0 #4 0x00002aaaaba2d93d in clone () from /lib64/libc.so.6 Thread 1 (Thread 0x2aaaac909220 (LWP 56283)): #0 0x000000000047cc38 in bit_copybits (dest=0x47d118, src=0x572fb848) at bitstring.c:692 #1 0x000000000042d312 in _wait_for_children_slurmstepd (job=0x80df30) at mgr.c:748 #2 0x000000000042e4ac in job_manager (job=0x80df30) at mgr.c:1336 #3 0x0000000000429bfd in main (argc=1, argv=0x7fffffffecd8) at slurmstepd.c:146 (gdb) (gdb) bt #0 0x000000000043cffe in _handle_signal_task_local (fd=10922, job=0x2aaab80ffdc0, uid=0) at req.c:725 #1 0x000000000043be1c in _handle_request (fd=9, job=0x80ded0, uid=0, gid=0) at req.c:493 #2 0x000000000043baa7 in _handle_accept (arg=0x0) at req.c:446 #3 0x00002aaaab72fa51 in start_thread () from /lib64/libpthread.so.0 #4 0x00002aaaaba2d93d in clone () from /lib64/libc.so.6 (gdb) thread apply all bt Thread 5 (Thread 0x2aaaac909220 (LWP 7325)): #0 0x00002aaaab9f170a in wait4 () from /lib64/libc.so.6 #1 0x000000000042f7d0 in _wait_for_any_task (job=0x80ded0, waitflag=true) at mgr.c:1928 #2 0x000000000042fca6 in _wait_for_all_tasks (job=0x80ded0) at mgr.c:2036 #3 0x000000000042e25d in job_manager (job=0x80ded0) at mgr.c:1244 #4 0x0000000000429bfd in main (argc=1, argv=0x7fffffffecd8) at slurmstepd.c:146 Thread 4 (Thread 0x2aaaae459700 (LWP 7328)): #0 0x00002aaaaba24113 in poll () from /lib64/libc.so.6 #1 0x000000000044d97a in _poll_internal (pfds=0x2aaab00008d0, nfds=2, shutdown_time=0) at eio.c:344 #2 0x000000000044d7f0 in eio_handle_mainloop (eio=0x810900) at eio.c:304 #3 0x000000000043aeb7 in _domain_socket_destroy (fd=0) at req.c:229 #4 0x00002aaaab72fa51 in start_thread () from /lib64/libpthread.so.0 #5 0x00002aaaaba2d93d in clone () from /lib64/libc.so.6 Thread 3 (Thread 0x2aaaae358700 (LWP 7327)): #0 0x00002aaaab9f1a7d in nanosleep () from /lib64/libc.so.6 #1 0x00002aaaaba26c24 in usleep () from /lib64/libc.so.6 #2 0x0000000000528c89 in _timer_thread (args=0x0) at slurm_acct_gather_profile.c:155 #3 0x00002aaaab72fa51 in start_thread () from /lib64/libpthread.so.0 #4 0x00002aaaaba2d93d in clone () from /lib64/libc.so.6 Thread 2 (Thread 0x2aaaae257700 (LWP 7326)): #0 0x00002aaaab73363c in pthread_cond_wait@@GLIBC_2.3.2 () from /lib64/libpthread.so.0 #1 0x00000000005248d4 in _poll_data (profile=false) at slurm_jobacct_gather.c:178 #2 0x00002aaaab72fa51 in start_thread () from /lib64/libpthread.so.0 #3 0x00002aaaaba2d93d in clone () from /lib64/libc.so.6 Thread 1 (Thread 0x2aaab8100700 (LWP 99646)): #0 0x000000000043cffe in _handle_signal_task_local (fd=10922, job=0x2aaab80ffdc0, uid=0) at req.c:725 #1 0x000000000043be1c in _handle_request (fd=9, job=0x80ded0, uid=0, gid=0) at req.c:493 #2 0x000000000043baa7 in _handle_accept (arg=0x0) at req.c:446 #3 0x00002aaaab72fa51 in start_thread () from /lib64/libpthread.so.0 #4 0x00002aaaaba2d93d in clone () from /lib64/libc.so.6 (gdb) I do not see any explanation for the failures yet, but do have 3 findings. I did move one variable from the stack to the heap, which will help eliminate a possible problem, but that is very unlikely to be related (see this commit: https://github.com/SchedMD/slurm/commit/ba2fc67a6987387527da7bd9e242e50b51725e3d All other variables I see are either a modest fixed size on the stack or are created in the heap. I did notice that you have a job submit plugin named "amat". Could you confirm that is being rebuilt and installed whenever you install a new version of Slurm so that it tracks any data structure changes? Third, one backtrace looks like this: Thread 1 (Thread 0x2aaaac909220 (LWP 56283)): #0 0x000000000047cc38 in bit_copybits (dest=0x47d118, src=0x572fb848) at bitstring.c:692 #1 0x000000000042d312 in _wait_for_children_slurmstepd (job=0x80df30) at mgr.c:748 #2 0x000000000042e4ac in job_manager (job=0x80df30) at mgr.c:1336 #3 0x0000000000429bfd in main (argc=1, argv=0x7fffffffecd8) at slurmstepd.c:146 The unusual thing about that is that bit_copybits() is never called by _wait_for_children_slurmstepd(), so I am not sure how to interpret that backtrace. The other backtraces seem reasonable to me, but offer no explanation for corruption of the stack. Hi Moe, Yes. We always recompile amat job_submit plugin for a new installation. And we did that this time also. Here is one more slurmstepd crash. Core was generated by `slurmstepd: [64131.0] '. Program terminated with signal 7, Bus error. #0 0x00000000004e7fd5 in slurm_init_msg_engine (addr=0x1c00468bf7) at slurm_protocol_socket_implementation.c:431 431 { Missing separate debuginfos, use: debuginfo-install audit-libs-2.3.7-5.el6.x86_64 glibc-2.12-1.166.el6_7.3.x86_64 hwloc-1.5-3.el6_5.x86_64 libxml2-2.7.6-20.el6.x86_64 munge-0.5.11-99_cm7.1.x86_64 nss-pam-ldapd-0.7.5-20.el6_6.3.x86_64 nss-softokn-freebl-3.14.3-23.el6_7.x86_64 numactl-2.0.9-2.el6.x86_64 pam-1.1.1-20.el6_7.1.x86_64 pciutils-libs-3.1.10-4.el6.x86_64 zlib-1.2.3-29.el6.x86_64 (gdb) bt #0 0x00000000004e7fd5 in slurm_init_msg_engine (addr=0x1c00468bf7) at slurm_protocol_socket_implementation.c:431 #1 0x00002aaaaec73c7e in temp_kvs_send () at kvs.c:222 #2 0x00002aaaaec74cdc in _handle_barrier_in (fd=74, lrank=32, req=0x2aaab4007dd0) at pmi1.c:187 #3 0x00002aaaaec75afb in _handle_pmi1_cmd_buf (fd=74, lrank=32, buf_len=15, buf=0x2aaab40151d0 "cmd=barrier_in") at pmi1.c:538 #4 0x00002aaaaec75f13 in handle_pmi1_cmd (fd=74, lrank=32) at pmi1.c:630 #5 0x00002aaaaec712cf in _handle_task_request (fd=74, lrank=32) at agent.c:107 #6 0x00002aaaaec7177b in _task_read (obj=0x2aaab4001db0, objs=0x81a330) at agent.c:241 #7 0x000000000044de20 in _poll_handle_event (revents=1, obj=0x2aaab4001db0, objList=0x81a330) at eio.c:470 #8 0x000000000044dbe5 in _poll_dispatch (pfds=0x2aaa0000000e, nfds=0, map=0x1ffffffff, objList=0x2aaab4001db0) at eio.c:429 #9 0x00002aaab40022e0 in ?? () #10 0x0000002500000138 in ?? () #11 0x00002aaab4002190 in ?? () #12 0x000000000081c6e0 in ?? () #13 0x00000021ffffffff in ?? () #14 0x00002aaaaf084e70 in ?? () #15 0x000000000044d843 in eio_handle_mainloop (eio=0x4502ae) at eio.c:318 Backtrace stopped: previous frame inner to this frame (corrupt stack?) (gdb) thread apply all bt Thread 6 (Thread 0x2aaaac909220 (LWP 4534)): #0 0x00002aaaab9f170a in wait4 () from /lib64/libc.so.6 #1 0x000000000042f7d0 in _wait_for_any_task (job=0x80df30, waitflag=true) at mgr.c:1928 #2 0x000000000042fca6 in _wait_for_all_tasks (job=0x80df30) at mgr.c:2036 #3 0x000000000042e25d in job_manager (job=0x80df30) at mgr.c:1244 #4 0x0000000000429bfd in main (argc=1, argv=0x7fffffffecd8) at slurmstepd.c:146 Thread 5 (Thread 0x2aaab8323700 (LWP 4539)): #0 0x00002aaaaba24113 in poll () from /lib64/libc.so.6 #1 0x000000000044d97a in _poll_internal (pfds=0x2aaabc0008d0, nfds=74, shutdown_time=0) at eio.c:344 #2 0x000000000044d7f0 in eio_handle_mainloop (eio=0x81eee0) at eio.c:304 #3 0x0000000000437f2d in _io_thr (arg=0x80df30) at io.c:1504 #4 0x00002aaaab72fa51 in start_thread () from /lib64/libpthread.so.0 #5 0x00002aaaaba2d93d in clone () from /lib64/libc.so.6 Thread 4 (Thread 0x2aaaae257700 (LWP 4535)): #0 0x00002aaaab73363c in pthread_cond_wait@@GLIBC_2.3.2 () from /lib64/libpthread.so.0 #1 0x00000000005248d4 in _poll_data (profile=false) at slurm_jobacct_gather.c:178 #2 0x00002aaaab72fa51 in start_thread () from /lib64/libpthread.so.0 #3 0x00002aaaaba2d93d in clone () from /lib64/libc.so.6 Thread 3 (Thread 0x2aaaae358700 (LWP 4536)): #0 0x00002aaaab9f1a7d in nanosleep () from /lib64/libc.so.6 #1 0x00002aaaaba26c24 in usleep () from /lib64/libc.so.6 #2 0x0000000000528c89 in _timer_thread (args=0x0) at slurm_acct_gather_profile.c:155 #3 0x00002aaaab72fa51 in start_thread () from /lib64/libpthread.so.0 #4 0x00002aaaaba2d93d in clone () from /lib64/libc.so.6 Thread 2 (Thread 0x2aaaae459700 (LWP 4537)): #0 0x00002aaaaba24113 in poll () from /lib64/libc.so.6 #1 0x000000000044d97a in _poll_internal (pfds=0x2aaab00008d0, nfds=2, shutdown_time=0) at eio.c:344 #2 0x000000000044d7f0 in eio_handle_mainloop (eio=0x823b40) at eio.c:304 #3 0x000000000043aeb7 in _domain_socket_destroy (fd=0) at req.c:229 #4 0x00002aaaab72fa51 in start_thread () from /lib64/libpthread.so.0 #5 0x00002aaaaba2d93d in clone () from /lib64/libc.so.6 Thread 1 (Thread 0x2aaaaf085700 (LWP 4538)): #0 0x00000000004e7fd5 in slurm_init_msg_engine (addr=0x1c00468bf7) at slurm_protocol_socket_implementation.c:431 #1 0x00002aaaaec73c7e in temp_kvs_send () at kvs.c:222 #2 0x00002aaaaec74cdc in _handle_barrier_in (fd=74, lrank=32, req=0x2aaab4007dd0) at pmi1.c:187 #3 0x00002aaaaec75afb in _handle_pmi1_cmd_buf (fd=74, lrank=32, buf_len=15, buf=0x2aaab40151d0 "cmd=barrier_in") at pmi1.c:538 #4 0x00002aaaaec75f13 in handle_pmi1_cmd (fd=74, lrank=32) at pmi1.c:630 ---Type <return> to continue, or q <return> to quit--- #5 0x00002aaaaec712cf in _handle_task_request (fd=74, lrank=32) at agent.c:107 #6 0x00002aaaaec7177b in _task_read (obj=0x2aaab4001db0, objs=0x81a330) at agent.c:241 #7 0x000000000044de20 in _poll_handle_event (revents=1, obj=0x2aaab4001db0, objList=0x81a330) at eio.c:470 #8 0x000000000044dbe5 in _poll_dispatch (pfds=0x2aaa0000000e, nfds=0, map=0x1ffffffff, objList=0x2aaab4001db0) at eio.c:429 #9 0x00002aaab40022e0 in ?? () #10 0x0000002500000138 in ?? () #11 0x00002aaab4002190 in ?? () #12 0x000000000081c6e0 in ?? () #13 0x00000021ffffffff in ?? () #14 0x00002aaaaf084e70 in ?? () #15 0x000000000044d843 in eio_handle_mainloop (eio=0x4502ae) at eio.c:318 Backtrace stopped: previous frame inner to this frame (corrupt stack?) (gdb) Moe, Do you want me to apply https://github.com/SchedMD/slurm/commit/ba2fc67a6987387527da7bd9e242e50b51725e3d ? (In reply to Sergey Meirovich from comment #63) > Moe, > > Do you want me to apply > https://github.com/SchedMD/slurm/commit/ > ba2fc67a6987387527da7bd9e242e50b51725e3d > ? The chances of that patch making any difference are almost zero, but is was all that I could find. If somehow the number of tasks on the node was calculated to be something like -1 (no signs of that happening and the variable is 32-bit unsigned, so that would be 0xffffffff), Slurm would almost certainly die somewhere else where it allocates an array based upon that variable. Do you have any more core files? If so, could you send full backtraces? That would probably be the most helpful thing. Full backtrace is: thread apply all bt or thread apply all bt full ? (In reply to Sergey Meirovich from comment #65) > Full backtrace is: > thread apply all bt > or > thread apply all bt full > ? How about thread apply all bt full You might attach output files rather than uploading in line. Hi Moe, I have also a several slurmd cores. The most simple is: Thread 1 (Thread 44533): #0 0x0000000000482278 in plugin_context_create (plugin_type=0x2aaaab3ca7ad <Address 0x2aaaab3ca7ad out of bounds>, uler_type=0x7fffffffea90 " \354\377\377\377\177", ptrs=0x1aa2, names=0x0, names_size=876706) at plugin.c:454 plugin_dir = 0x429e8c "\360\211\020\271\375\063Y" c = 0x837120 n_names = 0 #1 0x000000000042dbd7 in _restore_cred_state (ctx=0x2aaaab3ca7ad) at slurmd.c:1662 file_name = 0x43fc26 "c\322H\301\342\003H\001\320H\213" data = 0x837120 "\320S\203" data_size = 0 cred_fd = 0 data_allocated = 0 data_read = 1462730932 buffer = 0x429e8c #2 0x0000000000429888 in main (argc=1, argv=0x7fffffffed08) at slurmd.c:378 i = 65536 pidfd = 6 blocked_signals = {13, 0} cc = 65536 oom_value = 0x0 slurmd_uid = 0 curr_uid = 0 time_stamp = "Sun, 24 Apr 2016 02:57:52 -0700\000\001\000\000\000\071\000\000\000\210\301̪\252*\000\000Ȣ\200\217\071", '\000' <repeats 11 times>"\270, \246\200\217\071\000\000\000\240\032\177", '\000' <repeats 13 times>"\340, \304̪\252*\000\000\360\353\377\377\377\177\000\000\000\000\000\000\000\000\000\000\b\354\377\377\377\177\000\000\020\016`\217\001\000\000\000\001\000\000\000\000\000\000\000MoA", '\000' <repeats 13 times>, "\017\000\000\000\000\000\000\000`\030\177\000\000\000\000\000\365\370\253\252\252*\000\000\001\000\000\000\000\000\000\000?\000\000\000\000\000\000\000\377\377\377\377\377\377\377\177\300\060Y", '\000' <repeats 13 times>"\263, \201B\000\000\000\000\000\002\000\000\000\000\000\000\000\005\061Y\000\000\000\000\000\240\333\063\253\252*\000\000\300\060Y\000\000\000\000" lopts = {stderr_level = LOG_LEVEL_INFO, syslog_level = LOG_LEVEL_INFO, logfile_level = LOG_LEVEL_INFO, prefix_level = 1, buffered = 0} (gdb) Looks like the stack is corrupted. Shall I open separate case for slurmd crashes? Your slurmd core dump shows the daemons failing while in the process of starting. Opening a new ticket probably would not be beneficial. I expect that the failures of slurmd and slurmstepd have the same root cause. Would you happen to be building Slurm using optimization flags? The reason that I ask is 1. The line numbers in comment #67 don't match the code, although that could be due to stack corruption. 2. We've seen compilers optimize Slurm to the point of generating bad logic (bug 1550 is one example) Another thing to check on one of these nodes with a slurmd or slurmstepd core dump is that all of the Slurm files were made at the same time including libslurm and lib/slurm/*.so (the plugins). My concern is there is a vestigial plugin that has a different concept of the data structure contents and/or format. Slurm logic should check that and report "Incompatible Slurm plugin version", but it would be good to check that anyway. How frequently are the slurmd daemons dying? Running under valgrind will slow them down a lot, but that might be a way to trap the failure when it happens. We are not using any optimization. I.e. we use O0. Below is snippet form compile time: "libtool: compile: gcc -DHAVE_CONFIG_H -I. -I../.. -I../../slurm -I../.. -DNUMA_VERSION1_COMPATIBILITY -O2 -g -pipe -Wall -Wp,-D_FORTIFY_SOURCE=2 -fexceptions -fstack-protector --param=ssp-buffer-size=4 -m64 -mtune=generic -pthread -fno-gcse -Wall -g -O0 -fno-strict-aliasing -MT assoc_mgr.lo -MD -MP -MF .deps/assoc_mgr.Tpo -c assoc_mgr.c -fPIC -DPIC -o .libs/assoc_mgr.o" There are two O* options that you could see. Latest one: O0 won (as said in gcc docs: " If you use multiple `-O' options, with or without level numbers, the last such option is the one that is effective.") Slurm is installed on NFS share. amat job_submit being build as per http://slurm.schedmd.com/job_submit_plugins.html from exactly the same code base. Final compile strings for amat plugin: /bin/sh ../../../../libtool --tag=CC --mode=compile gcc -DHAVE_CONFIG_H -I. -I../../../.. -I../../../../slurm -I../../../.. -I../../../../src/common -I /usr/include -DNUMA_VERSION1_COMPATIBILITY -g -O2 -pthread -fno-gcse -Wall -g -O0 -fno-strict-aliasing -MT job_submit_amat.lo -MD -MP -MF .deps/job_submit_amat.Tpo -c -o job_submit_amat.lo job_submit_amat.c /bin/sh ../../../../libtool --tag=CC --mode=link gcc -DNUMA_VERSION1_COMPATIBILITY -g -O2 -pthread -fno-gcse -Wall -g -O0 -fno-strict-aliasing -module -avoid-version --export-dynamic -o job_submit_amat.la -rpath /usr/local/lib/slurm job_submit_amat.lo All-in-all I don't see how miscompilation or different binary structures mix could happen. But maybe I am missing something. I haven't seen any new crash of slurmd since yesterday. And it looks like slumrmd crashes are occurring at the very head of slurmd start. Not sure how to hook valgrind here. I've been running the slurmd and slurmstepd under valgrind using configuration files as close to yours as possible and seeing no problems other than a one time memory leak in a library that Slurm uses, which is outside of our control and not a real corrupting memory. I've tried repeatedly both with and without active jobs on the compute node.
> I haven't seen any new crash of slurmd since yesterday. And it looks like
> slumrmd crashes are occurring at the very head of slurmd start. Not sure
> how to hook valgrind here.
You would need to start the daemon under valgrind's control. It will impact performance, so you might try this a few times on an idle node first with a command line of this sort. You will see a bunch of memory leaks with the default build as Slurm does not normally take the time to free memory when the daemon shuts down (we have a special option for that to easily see that everything does get freed).
valgrind --tool=memcheck --leak-check=yes --num-callers=40 --leak-resolution=high /usr/sbin/slurmd -D >>valg.slurmd.out 2>&1
I'm running out of ideas to pursue though...
Hi Moe, I had tried to crash slurmd (started via valgrind) about 100 times. Hadn't succeed in that. After that I checked all the cores from slurmd and found that all were created when I simultaneously restarted slurmds on all 106 nodes by pdsh. All the slurmd cores have the very same time of creation. On the other hands all slurmstepd cores were happened and differnmet times. I believe slurmd and slurmstepd issue are unrelated. Is there any way to auto-provision start of slurmstepd under valgrind? As was stated: "... Unless the application is doing lots of unbuffered I/O (srun --unbuffered option), slurmstepd is going to be sleeping 99.999% of the time." This is exactly our case (since we do not use `--unbuffered'). We could easily live with all slurmstepds under valgrind. Ok. I am starting to post new backtraces one-per-comment (to make referring more easy) Core was generated by `slurmstepd: [58442] '. Program terminated with signal 7, Bus error. #0 0x0000000000528c89 in _timer_thread (args=0x0) at slurm_acct_gather_profile.c:155 155 slurm_mutex_lock(&acct_gather_profile_timer[i]. Missing separate debuginfos, use: debuginfo-install audit-libs-2.3.7-5.el6.x86_64 glibc-2.12-1.166.el6_7.3.x86_64 hwloc-1.5-3.el6_5.x86_64 libxml2-2.7.6-20.el6.x86_64 munge-0.5.11-99_cm7.1.x86_64 nss-pam-ldapd-0.7.5-20.el6_6.3.x86_64 nss-softokn-freebl-3.14.3-23.el6_7.x86_64 numactl-2.0.9-2.el6.x86_64 pam-1.1.1-20.el6_7.1.x86_64 pciutils-libs-3.1.10-4.el6.x86_64 zlib-1.2.3-29.el6.x86_64 (gdb) thread apply all bt full Thread 4 (Thread 0x2aaaae257700 (LWP 105170)): #0 0x00002aaaab73363c in pthread_cond_wait@@GLIBC_2.3.2 () from /lib64/libpthread.so.0 No symbol table info available. #1 0x00000000005248d4 in _poll_data (profile=false) at slurm_jobacct_gather.c:178 err = 0 #2 0x00002aaaab72fa51 in start_thread () from /lib64/libpthread.so.0 No symbol table info available. #3 0x00002aaaaba2d93d in clone () from /lib64/libc.so.6 No symbol table info available. Thread 3 (Thread 0x2aaaac909220 (LWP 105169)): #0 0x00002aaaab9f170a in wait4 () from /lib64/libc.so.6 No symbol table info available. #1 0x000000000042f7d0 in _wait_for_any_task (job=0x80ded0, waitflag=true) at mgr.c:1928 t = 0x0 status = 0 pid = 5692464 completed = 0 jobacct = 0x0 rusage = {ru_utime = {tv_sec = 140737488350416, tv_usec = 4649478}, ru_stime = {tv_sec = 5884248, tv_usec = 8466080}, ru_maxrss = 8485440, ru_ixrss = 0, ru_idrss = 140737488350000, ru_isrss = 4392465, ru_minflt = 4294961856, ru_majflt = 8445648, ru_nswap = 140737488350416, ru_inblock = 0, ru_oublock = 0, ru_msgsnd = 4624298, ru_msgrcv = 6019680, ru_nsignals = 416611822512, ru_nvcsw = 8485440, ru_nivcsw = 8315856} tmp_env = 0x7fffffffeae0 #2 0x000000000042fca6 in _wait_for_all_tasks (job=0x80ded0) at mgr.c:2036 tasks_left = 1 i = 0 #3 0x000000000042e25d in job_manager (job=0x80ded0) at mgr.c:1244 rc = 0 io_initialized = true ckpt_type = 0x8193b0 "checkpoint/none" #4 0x0000000000429bfd in main (argc=1, argv=0x7fffffffecd8) at slurmstepd.c:146 cli = 0x803120 self = 0x0 msg = 0x80f8c0 job = 0x80ded0 ngids = 0 gids = 0x0 rc = 0 Thread 2 (Thread 0x2aaaae459700 (LWP 105172)): #0 0x00002aaaaba24113 in poll () from /lib64/libc.so.6 ---Type <return> to continue, or q <return> to quit--- No symbol table info available. #1 0x000000000044d97a in _poll_internal (pfds=0x2aaab00008d0, nfds=2, shutdown_time=0) at eio.c:344 n = 16777217 timeout = -1 #2 0x000000000044d7f0 in eio_handle_mainloop (eio=0x818b10) at eio.c:304 retval = 0 pollfds = 0x2aaab00008d0 map = 0x2aaab0000900 maxnfds = 1 nfds = 2 n = 1 __func__ = "eio_handle_mainloop" #3 0x000000000043aeb7 in _domain_socket_destroy (fd=0) at req.c:229 No locals. #4 0x00002aaaab72fa51 in start_thread () from /lib64/libpthread.so.0 No symbol table info available. #5 0x00002aaaaba2d93d in clone () from /lib64/libc.so.6 No symbol table info available. Thread 1 (Thread 0x2aaaae358700 (LWP 105171)): #0 0x0000000000528c89 in _timer_thread (args=0x0) at slurm_acct_gather_profile.c:155 err = 0 i = 4 now = 1463010791 diff = 1463010791 tv1 = {tv_sec = 1463010791, tv_usec = 141898} tv2 = {tv_sec = 1463010791, tv_usec = 141898} tv_str = "usec=0", '\000' <repeats 13 times> delta_t = 0 #1 0x00002aaaab72fa51 in start_thread () from /lib64/libpthread.so.0 No symbol table info available. #2 0x00002aaaaba2d93d in clone () from /lib64/libc.so.6 No symbol table info available. (gdb) Core was generated by `slurmstepd: [58442.0] '. Program terminated with signal 7, Bus error. #0 0x0000000000528c89 in _timer_thread (args=0x0) at slurm_acct_gather_profile.c:155 155 slurm_mutex_lock(&acct_gather_profile_timer[i]. Missing separate debuginfos, use: debuginfo-install audit-libs-2.3.7-5.el6.x86_64 glibc-2.12-1.166.el6_7.3.x86_64 hwloc-1.5-3.el6_5.x86_64 libxml2-2.7.6-20.el6.x86_64 munge-0.5.11-99_cm7.1.x86_64 nss-pam-ldapd-0.7.5-20.el6_6.3.x86_64 nss-softokn-freebl-3.14.3-23.el6_7.x86_64 numactl-2.0.9-2.el6.x86_64 pam-1.1.1-20.el6_7.1.x86_64 pciutils-libs-3.1.10-4.el6.x86_64 zlib-1.2.3-29.el6.x86_64 (gdb) thread apply all bt full Thread 6 (Thread 0x2aaaac909220 (LWP 105183)): #0 0x00002aaaab9f170a in wait4 () from /lib64/libc.so.6 No symbol table info available. #1 0x000000000042f7d0 in _wait_for_any_task (job=0x80df00, waitflag=true) at mgr.c:1928 t = 0x0 status = 0 pid = -131072 completed = 0 jobacct = 0x0 rusage = {ru_utime = {tv_sec = 5863754, tv_usec = 9891309677232}, ru_stime = {tv_sec = 5863755, tv_usec = 140737488349816}, ru_maxrss = 8669952, ru_ixrss = 8672064, ru_idrss = 140737488350000, ru_isrss = 4393060, ru_minflt = 4294961856, ru_majflt = 8445696, ru_nswap = 140733193388032, ru_inblock = 0, ru_oublock = 25769803776, ru_msgsnd = 0, ru_msgrcv = 0, ru_nsignals = 416611822512, ru_nvcsw = 8486560, ru_nivcsw = 25778119632} tmp_env = 0x0 #2 0x000000000042fca6 in _wait_for_all_tasks (job=0x80df00) at mgr.c:2036 tasks_left = 6 i = 0 #3 0x000000000042e25d in job_manager (job=0x80df00) at mgr.c:1244 rc = 0 io_initialized = true ckpt_type = 0x81b520 "checkpoint/none" #4 0x0000000000429bfd in main (argc=1, argv=0x7fffffffecd8) at slurmstepd.c:146 cli = 0x803120 self = 0x80cfa0 msg = 0x80fdd0 job = 0x80df00 ngids = 0 gids = 0x0 rc = 0 Thread 5 (Thread 0x2aaaae257700 (LWP 105184)): #0 0x00002aaaab73363c in pthread_cond_wait@@GLIBC_2.3.2 () from /lib64/libpthread.so.0 No symbol table info available. #1 0x00000000005248d4 in _poll_data (profile=false) at slurm_jobacct_gather.c:178 err = 0 #2 0x00002aaaab72fa51 in start_thread () from /lib64/libpthread.so.0 No symbol table info available. #3 0x00002aaaaba2d93d in clone () from /lib64/libc.so.6 No symbol table info available. Thread 4 (Thread 0x2aaab8323700 (LWP 105188)): #0 0x00002aaaaba24113 in poll () from /lib64/libc.so.6 ---Type <return> to continue, or q <return> to quit--- No symbol table info available. #1 0x000000000044d97a in _poll_internal (pfds=0x2aaabc0008d0, nfds=14, shutdown_time=0) at eio.c:344 n = 16777235 timeout = -1 #2 0x000000000044d7f0 in eio_handle_mainloop (eio=0x817d40) at eio.c:304 retval = 0 pollfds = 0x2aaabc0008d0 map = 0x2aaabc000990 maxnfds = 19 nfds = 14 n = 19 __func__ = "eio_handle_mainloop" #3 0x0000000000437f2d in _io_thr (arg=0x80df00) at io.c:1504 job = 0x80df00 set = {__val = {4097, 0 <repeats 15 times>}} rc = 0 #4 0x00002aaaab72fa51 in start_thread () from /lib64/libpthread.so.0 No symbol table info available. #5 0x00002aaaaba2d93d in clone () from /lib64/libc.so.6 No symbol table info available. Thread 3 (Thread 0x2aaaae459700 (LWP 105186)): #0 0x00002aaaaba24113 in poll () from /lib64/libc.so.6 No symbol table info available. #1 0x000000000044d97a in _poll_internal (pfds=0x2aaab00008d0, nfds=2, shutdown_time=0) at eio.c:344 n = 16777217 timeout = -1 #2 0x000000000044d7f0 in eio_handle_mainloop (eio=0x81b4e0) at eio.c:304 retval = 0 pollfds = 0x2aaab00008d0 map = 0x2aaab0000900 maxnfds = 1 nfds = 2 n = 1 __func__ = "eio_handle_mainloop" #3 0x000000000043aeb7 in _domain_socket_destroy (fd=0) at req.c:229 No locals. #4 0x00002aaaab72fa51 in start_thread () from /lib64/libpthread.so.0 No symbol table info available. #5 0x00002aaaaba2d93d in clone () from /lib64/libc.so.6 No symbol table info available. Thread 2 (Thread 0x2aaaaf085700 (LWP 105187)): ---Type <return> to continue, or q <return> to quit--- #0 0x00002aaaaba24113 in poll () from /lib64/libc.so.6 No symbol table info available. #1 0x000000000044d97a in _poll_internal (pfds=0x2aaab4000dd0, nfds=8, shutdown_time=0) at eio.c:344 n = 16777223 timeout = -1 #2 0x000000000044d7f0 in eio_handle_mainloop (eio=0x2aaab40008d0) at eio.c:304 retval = 0 pollfds = 0x2aaab4000dd0 map = 0x2aaab4000e30 maxnfds = 7 nfds = 8 n = 7 __func__ = "eio_handle_mainloop" #3 0x00002aaaaec71a9e in _agent (unused=0x0) at agent.c:327 tree_listen_obj = 0x2aaab4000910 task_obj = 0x2aaab4000cd0 i = 6 #4 0x00002aaaab72fa51 in start_thread () from /lib64/libpthread.so.0 No symbol table info available. #5 0x00002aaaaba2d93d in clone () from /lib64/libc.so.6 No symbol table info available. Thread 1 (Thread 0x2aaaae358700 (LWP 105185)): #0 0x0000000000528c89 in _timer_thread (args=0x0) at slurm_acct_gather_profile.c:155 err = 0 i = 4 now = 1463010791 diff = 1463010791 tv1 = {tv_sec = 1463010791, tv_usec = 536386} tv2 = {tv_sec = 1463010791, tv_usec = 536386} tv_str = "usec=0", '\000' <repeats 13 times> delta_t = 0 #1 0x00002aaaab72fa51 in start_thread () from /lib64/libpthread.so.0 No symbol table info available. #2 0x00002aaaaba2d93d in clone () from /lib64/libc.so.6 No symbol table info available. (gdb) Core was generated by `slurmstepd: [63735] '. Program terminated with signal 7, Bus error. #0 slurm_auth_unpack (buf=0x0) at auth_munge.c:415 415 { Missing separate debuginfos, use: debuginfo-install audit-libs-2.3.7-5.el6.x86_64 glibc-2.12-1.166.el6_7.3.x86_64 hwloc-1.5-3.el6_5.x86_64 libxml2-2.7.6-20.el6.x86_64 munge-0.5.11-99_cm7.1.x86_64 nss-pam-ldapd-0.7.5-20.el6_6.3.x86_64 nss-softokn-freebl-3.14.3-23.el6_7.x86_64 numactl-2.0.9-2.el6.x86_64 pam-1.1.1-20.el6_7.1.x86_64 pciutils-libs-3.1.10-4.el6.x86_64 zlib-1.2.3-29.el6.x86_64 (gdb) thread apply all bt full Thread 5 (Thread 0x2aaaae257700 (LWP 99786)): #0 0x00002aaaab73363c in pthread_cond_wait@@GLIBC_2.3.2 () from /lib64/libpthread.so.0 No symbol table info available. #1 0x00000000005248d4 in _poll_data (profile=false) at slurm_jobacct_gather.c:178 err = 0 #2 0x00002aaaab72fa51 in start_thread () from /lib64/libpthread.so.0 No symbol table info available. #3 0x00002aaaaba2d93d in clone () from /lib64/libc.so.6 No symbol table info available. Thread 4 (Thread 0x2aaaae459700 (LWP 99788)): #0 0x00002aaaaba24113 in poll () from /lib64/libc.so.6 No symbol table info available. #1 0x000000000044d97a in _poll_internal (pfds=0x2aaab00008d0, nfds=2, shutdown_time=0) at eio.c:344 n = 16777217 timeout = -1 #2 0x000000000044d7f0 in eio_handle_mainloop (eio=0x818ae0) at eio.c:304 retval = 0 pollfds = 0x2aaab00008d0 map = 0x2aaab0000900 maxnfds = 1 nfds = 2 n = 1 __func__ = "eio_handle_mainloop" #3 0x000000000043aeb7 in _domain_socket_destroy (fd=0) at req.c:229 No locals. #4 0x00002aaaab72fa51 in start_thread () from /lib64/libpthread.so.0 No symbol table info available. #5 0x00002aaaaba2d93d in clone () from /lib64/libc.so.6 No symbol table info available. Thread 3 (Thread 0x2aaaac909220 (LWP 99785)): #0 0x00002aaaab9f170a in wait4 () from /lib64/libc.so.6 No symbol table info available. #1 0x000000000042f7d0 in _wait_for_any_task (job=0x80ded0, waitflag=true) at mgr.c:1928 t = 0x0 status = 0 pid = 5692464 completed = 0 jobacct = 0x0 rusage = {ru_utime = {tv_sec = 140737488350416, tv_usec = 4649478}, ru_stime = {tv_sec = 5884248, tv_usec = 8466048}, ru_maxrss = 8485392, ru_ixrss = 0, ru_idrss = 140737488350000, ru_isrss = 4392465, ru_minflt = 4294961856, ru_majflt = 8445648, ---Type <return> to continue, or q <return> to quit--- ru_nswap = 140737488350416, ru_inblock = 0, ru_oublock = 0, ru_msgsnd = 4624298, ru_msgrcv = 6019680, ru_nsignals = 416611822512, ru_nvcsw = 8485392, ru_nivcsw = 8315856} tmp_env = 0x7fffffffeae0 #2 0x000000000042fca6 in _wait_for_all_tasks (job=0x80ded0) at mgr.c:2036 tasks_left = 1 i = 0 #3 0x000000000042e25d in job_manager (job=0x80ded0) at mgr.c:1244 rc = 0 io_initialized = true ckpt_type = 0x819380 "checkpoint/none" #4 0x0000000000429bfd in main (argc=1, argv=0x7fffffffecd8) at slurmstepd.c:146 cli = 0x803120 self = 0x0 msg = 0x80f8b0 job = 0x80ded0 ngids = 0 gids = 0x0 rc = 0 Thread 2 (Thread 0x2aaaae358700 (LWP 99787)): #0 0x00002aaaab9f1a7d in nanosleep () from /lib64/libc.so.6 No symbol table info available. #1 0x00002aaaaba26c24 in usleep () from /lib64/libc.so.6 No symbol table info available. #2 0x0000000000528c89 in _timer_thread (args=0x0) at slurm_acct_gather_profile.c:155 err = 0 i = 4 now = 1463102589 diff = 1463102589 tv1 = {tv_sec = 1463102589, tv_usec = 489958} tv2 = {tv_sec = 1463102589, tv_usec = 489960} tv_str = "usec=2", '\000' <repeats 13 times> delta_t = 2 #3 0x00002aaaab72fa51 in start_thread () from /lib64/libpthread.so.0 No symbol table info available. #4 0x00002aaaaba2d93d in clone () from /lib64/libc.so.6 No symbol table info available. Thread 1 (Thread 0x2aaab8100700 (LWP 90184)): #0 slurm_auth_unpack (buf=0x0) at auth_munge.c:415 cred = 0x2aaab80ffe00 type = 0xbc000900 <Address 0xbc000900 out of bounds> size = 0 ---Type <return> to continue, or q <return> to quit--- version = 0 #1 0x0000000000522696 in g_slurm_auth_verify (cred=0x522696, hosts=0x2aaab80ffe00, timeout=0, auth_info=0x2aaab80ffe00 "\260\376\017\270\252*") at slurm_auth.c:248 ret = 10922 #2 0x000000000043b8c2 in _handle_accept (arg=0x0) at req.c:419 remaining = 0 ptr = 0x2aaabc004900 "" rc = 16384 fd = 9 job = 0x80ded0 req = 0 len = 16384 buffer = 0x2aaabc0008d0 auth_cred = 0x2aaabc004a20 rc = -1412630440 uid = 10922 gid = 2882336848 #3 0x00002aaaab72fa51 in start_thread () from /lib64/libpthread.so.0 No symbol table info available. #4 0x00002aaaaba2d93d in clone () from /lib64/libc.so.6 No symbol table info available. (gdb) Core was generated by `slurmstepd: [63735.0] '. Program terminated with signal 7, Bus error. #0 slurm_auth_unpack (buf=0x0) at auth_munge.c:415 415 { Missing separate debuginfos, use: debuginfo-install audit-libs-2.3.7-5.el6.x86_64 glibc-2.12-1.166.el6_7.3.x86_64 hwloc-1.5-3.el6_5.x86_64 libxml2-2.7.6-20.el6.x86_64 munge-0.5.11-99_cm7.1.x86_64 nss-pam-ldapd-0.7.5-20.el6_6.3.x86_64 nss-softokn-freebl-3.14.3-23.el6_7.x86_64 numactl-2.0.9-2.el6.x86_64 pam-1.1.1-20.el6_7.1.x86_64 pciutils-libs-3.1.10-4.el6.x86_64 zlib-1.2.3-29.el6.x86_64 (gdb) thread apply all bt full Thread 7 (Thread 0x2aaab8323700 (LWP 99804)): #0 0x00002aaaaba24113 in poll () from /lib64/libc.so.6 No symbol table info available. #1 0x000000000044d97a in _poll_internal (pfds=0x2aaabc0008d0, nfds=18, shutdown_time=0) at eio.c:344 n = 16777241 timeout = -1 #2 0x000000000044d7f0 in eio_handle_mainloop (eio=0x817d80) at eio.c:304 retval = 0 pollfds = 0x2aaabc0008d0 map = 0x2aaabc0009c0 maxnfds = 25 nfds = 18 n = 25 __func__ = "eio_handle_mainloop" #3 0x0000000000437f2d in _io_thr (arg=0x80df00) at io.c:1504 job = 0x80df00 set = {__val = {4097, 0 <repeats 15 times>}} rc = 0 #4 0x00002aaaab72fa51 in start_thread () from /lib64/libpthread.so.0 No symbol table info available. #5 0x00002aaaaba2d93d in clone () from /lib64/libc.so.6 No symbol table info available. Thread 6 (Thread 0x2aaaaf085700 (LWP 99803)): #0 0x00002aaaaba24113 in poll () from /lib64/libc.so.6 No symbol table info available. #1 0x000000000044d97a in _poll_internal (pfds=0x2aaab4000f30, nfds=10, shutdown_time=0) at eio.c:344 n = 16777225 timeout = -1 #2 0x000000000044d7f0 in eio_handle_mainloop (eio=0x2aaab40008d0) at eio.c:304 retval = 0 pollfds = 0x2aaab4000f30 map = 0x2aaab4000fa0 maxnfds = 9 nfds = 10 n = 9 __func__ = "eio_handle_mainloop" #3 0x00002aaaaec71a9e in _agent (unused=0x0) at agent.c:327 tree_listen_obj = 0x2aaab4000910 task_obj = 0x2aaab4000e10 i = 8 #4 0x00002aaaab72fa51 in start_thread () from /lib64/libpthread.so.0 ---Type <return> to continue, or q <return> to quit--- No symbol table info available. #5 0x00002aaaaba2d93d in clone () from /lib64/libc.so.6 No symbol table info available. Thread 5 (Thread 0x2aaaae459700 (LWP 99802)): #0 0x00002aaaaba24113 in poll () from /lib64/libc.so.6 No symbol table info available. #1 0x000000000044d97a in _poll_internal (pfds=0x2aaab00008d0, nfds=2, shutdown_time=0) at eio.c:344 n = 16777217 timeout = -1 #2 0x000000000044d7f0 in eio_handle_mainloop (eio=0x81b660) at eio.c:304 retval = 0 pollfds = 0x2aaab00008d0 map = 0x2aaab0000900 maxnfds = 1 nfds = 2 n = 1 __func__ = "eio_handle_mainloop" #3 0x000000000043aeb7 in _domain_socket_destroy (fd=0) at req.c:229 No locals. #4 0x00002aaaab72fa51 in start_thread () from /lib64/libpthread.so.0 No symbol table info available. #5 0x00002aaaaba2d93d in clone () from /lib64/libc.so.6 No symbol table info available. Thread 4 (Thread 0x2aaaae358700 (LWP 99801)): #0 0x00002aaaab9f1a7d in nanosleep () from /lib64/libc.so.6 No symbol table info available. #1 0x00002aaaaba26c24 in usleep () from /lib64/libc.so.6 No symbol table info available. #2 0x0000000000528c89 in _timer_thread (args=0x0) at slurm_acct_gather_profile.c:155 err = 0 i = 4 now = 1463102589 diff = 1463102589 tv1 = {tv_sec = 1463102589, tv_usec = 495269} tv2 = {tv_sec = 1463102589, tv_usec = 495271} tv_str = "usec=2", '\000' <repeats 13 times> delta_t = 2 #3 0x00002aaaab72fa51 in start_thread () from /lib64/libpthread.so.0 No symbol table info available. #4 0x00002aaaaba2d93d in clone () from /lib64/libc.so.6 No symbol table info available. ---Type <return> to continue, or q <return> to quit--- Thread 3 (Thread 0x2aaaae257700 (LWP 99800)): #0 0x00002aaaab73363c in pthread_cond_wait@@GLIBC_2.3.2 () from /lib64/libpthread.so.0 No symbol table info available. #1 0x00000000005248d4 in _poll_data (profile=false) at slurm_jobacct_gather.c:178 err = 0 #2 0x00002aaaab72fa51 in start_thread () from /lib64/libpthread.so.0 No symbol table info available. #3 0x00002aaaaba2d93d in clone () from /lib64/libc.so.6 No symbol table info available. Thread 2 (Thread 0x2aaaac909220 (LWP 99799)): #0 0x00002aaaab9f170a in wait4 () from /lib64/libc.so.6 No symbol table info available. #1 0x000000000042f7d0 in _wait_for_any_task (job=0x80df00, waitflag=true) at mgr.c:1928 t = 0x0 status = 0 pid = -131072 completed = 0 jobacct = 0x0 rusage = {ru_utime = {tv_sec = 5863754, tv_usec = 9891309677232}, ru_stime = {tv_sec = 5863755, tv_usec = 140737488349816}, ru_maxrss = 8684576, ru_ixrss = 8685040, ru_idrss = 140737488350000, ru_isrss = 4393060, ru_minflt = 4294961856, ru_majflt = 8445696, ru_nswap = 140733193388032, ru_inblock = 0, ru_oublock = 34359738368, ru_msgsnd = 0, ru_msgrcv = 0, ru_nsignals = 416611822512, ru_nvcsw = 8486624, ru_nivcsw = 34368054224} tmp_env = 0x0 #2 0x000000000042fca6 in _wait_for_all_tasks (job=0x80df00) at mgr.c:2036 tasks_left = 8 i = 0 #3 0x000000000042e25d in job_manager (job=0x80df00) at mgr.c:1244 rc = 0 io_initialized = true ckpt_type = 0x819c70 "checkpoint/none" #4 0x0000000000429bfd in main (argc=1, argv=0x7fffffffecd8) at slurmstepd.c:146 cli = 0x803120 self = 0x80cfa0 msg = 0x80fdf0 job = 0x80df00 ngids = 0 gids = 0x0 rc = 0 Thread 1 (Thread 0x2aaab8625700 (LWP 90186)): #0 slurm_auth_unpack (buf=0x0) at auth_munge.c:415 ---Type <return> to continue, or q <return> to quit--- cred = 0x2aaab8624e00 type = 0xc4000900 <Address 0xc4000900 out of bounds> size = 0 version = 0 #1 0x0000000000522696 in g_slurm_auth_verify (cred=0x522696, hosts=0x2aaab8624e00, timeout=0, auth_info=0x2aaab8624e00 "\260Nb\270\252*") at slurm_auth.c:248 ret = 10922 #2 0x000000000043b8c2 in _handle_accept (arg=0x0) at req.c:419 remaining = 0 ptr = 0x2aaac4004900 "" rc = 16384 fd = 26 job = 0x80df00 req = 0 len = 16384 buffer = 0x2aaac40008d0 auth_cred = 0x2aaac4004a20 rc = -1412630440 uid = 10922 gid = 2882336848 #3 0x00002aaaab72fa51 in start_thread () from /lib64/libpthread.so.0 No symbol table info available. #4 0x00002aaaaba2d93d in clone () from /lib64/libc.so.6 No symbol table info available. (gdb) Core was generated by `slurmstepd: [50330] '. Program terminated with signal 7, Bus error. #0 fini () at mpi_pmi2.c:184 184 { Missing separate debuginfos, use: debuginfo-install audit-libs-2.3.7-5.el6.x86_64 glibc-2.12-1.166.el6_7.3.x86_64 hwloc-1.5-3.el6_5.x86_64 libxml2-2.7.6-20.el6.x86_64 munge-0.5.11-99_cm7.1.x86_64 nss-pam-ldapd-0.7.5-20.el6_6.3.x86_64 nss-softokn-freebl-3.14.3-23.el6_7.x86_64 numactl-2.0.9-2.el6.x86_64 pam-1.1.1-20.el6_7.1.x86_64 pciutils-libs-3.1.10-4.el6.x86_64 zlib-1.2.3-29.el6.x86_64 (gdb) thread apply all bt full Thread 2 (Thread 0x2aaaae357700 (LWP 61343)): #0 0x00002aaaab9f0a7d in nanosleep () from /lib64/libc.so.6 No symbol table info available. #1 0x00002aaaaba25c24 in usleep () from /lib64/libc.so.6 No symbol table info available. #2 0x0000000000528c89 in _timer_thread (args=0x0) at slurm_acct_gather_profile.c:155 err = 0 i = 4 now = 1463179557 diff = 1463179557 tv1 = {tv_sec = 1463179557, tv_usec = 775258} tv2 = {tv_sec = 1463179557, tv_usec = 775259} tv_str = "usec=1", '\000' <repeats 13 times> delta_t = 1 #3 0x00002aaaab72ea51 in start_thread () from /lib64/libpthread.so.0 No symbol table info available. #4 0x00002aaaaba2c93d in clone () from /lib64/libc.so.6 No symbol table info available. Thread 1 (Thread 0x2aaaac908220 (LWP 61341)): #0 fini () at mpi_pmi2.c:184 No locals. #1 0x000000000048678e in plugin_load_and_link (type_name=0x0, n_syms=0, names=0x0, ptrs=0x0) at plugin.c:282 got_colon = false plug = 0x48678e st = {st_dev = 0, st_ino = 0, st_nlink = 46912511462392, st_mode = 2874292624, st_uid = 10922, st_gid = 4294962000, __pad0 = 32767, st_rdev = 4294967295, st_size = 15, st_blksize = 5, st_blocks = 46912509219360, st_atim = {tv_sec = 0, tv_nsec = 8429312}, st_mtim = {tv_sec = 46912500720284, tv_nsec = 46912498348032}, st_ctim = {tv_sec = 0, tv_nsec = 4364640}, __unused = {46912500719759, 8508464, 5913691}} head = 0x7fffffffeb30 "`\353\377\377\377\177" dir_array = 0x7fffffffebb0 "" so_name = 0x7fff00000005 <Address 0x7fff00000005 out of bounds> file_name = 0x2aaaaec7022e "UH\211\345\350\021\364\377\377\350,\361\377\377\270" i = 8508464 err = 32767 #2 0x0000000000486c68 in plugin_context_create (plugin_type=0x2aaaaec7022e "UH\211\345\350\021\364\377\377\350,\361\377\377\270", uler_type=0x2aaaae458700 "", ptrs=0x81d430, names=0x7fffffffeb40, names_size=4745102) at plugin.c:456 plugin_dir = 0x486c68 "\330H\213U\330H\213E\340H\213@\bH\211\326H\211\307\350\376\005" c = 0x2aaaab72f180 n_names = 0 #3 0x000000000047e8fa in mpi_hook_slurmstepd_task (job=0x7fffffffecd0, env=0x429960) at mpi.c:176 No locals. ---Type <return> to continue, or q <return> to quit--- #4 0x0000000000429c50 in main (argc=1, argv=0x7fffffffecd8) at slurmstepd.c:163 cli = 0x8030d0 self = 0x0 msg = 0x810a40 job = 0x80deb0 ngids = 0 gids = 0x0 rc = 0 (gdb) Core was generated by `slurmstepd: [50330.0] '. Program terminated with signal 7, Bus error. #0 core_spec_p_clear (cont_id=4364640) at core_spec_none.c:142 142 { Missing separate debuginfos, use: debuginfo-install audit-libs-2.3.7-5.el6.x86_64 glibc-2.12-1.166.el6_7.3.x86_64 hwloc-1.5-3.el6_5.x86_64 libxml2-2.7.6-20.el6.x86_64 munge-0.5.11-99_cm7.1.x86_64 nss-pam-ldapd-0.7.5-20.el6_6.3.x86_64 nss-softokn-freebl-3.14.3-23.el6_7.x86_64 numactl-2.0.9-2.el6.x86_64 pam-1.1.1-20.el6_7.1.x86_64 pciutils-libs-3.1.10-4.el6.x86_64 zlib-1.2.3-29.el6.x86_64 (gdb) thread apply all bt full Thread 5 (Thread 0x2aaaaf084700 (LWP 61735)): #0 0x00002aaaaba23113 in poll () from /lib64/libc.so.6 No symbol table info available. #1 0x000000000044d97a in _poll_internal (pfds=0x2aaab4001470, nfds=18, shutdown_time=0) at eio.c:344 n = 16777233 timeout = -1 #2 0x000000000044d7f0 in eio_handle_mainloop (eio=0x2aaab40008d0) at eio.c:304 retval = 0 pollfds = 0x2aaab4001470 map = 0x2aaab4001520 maxnfds = 17 nfds = 18 n = 17 __func__ = "eio_handle_mainloop" #3 0x00002aaaaec70a9e in _agent (unused=0x0) at agent.c:327 tree_listen_obj = 0x2aaab4000910 task_obj = 0x2aaab4001310 i = 16 #4 0x00002aaaab72ea51 in start_thread () from /lib64/libpthread.so.0 No symbol table info available. #5 0x00002aaaaba2c93d in clone () from /lib64/libc.so.6 No symbol table info available. Thread 4 (Thread 0x2aaaae458700 (LWP 61734)): #0 0x00002aaaaba23113 in poll () from /lib64/libc.so.6 No symbol table info available. #1 0x000000000044d97a in _poll_internal (pfds=0x2aaab00008d0, nfds=2, shutdown_time=0) at eio.c:344 n = 16777217 timeout = -1 #2 0x000000000044d7f0 in eio_handle_mainloop (eio=0x8209d0) at eio.c:304 retval = 0 pollfds = 0x2aaab00008d0 map = 0x2aaab0000900 maxnfds = 1 nfds = 2 n = 1 __func__ = "eio_handle_mainloop" #3 0x000000000043aeb7 in _domain_socket_destroy (fd=0) at req.c:229 No locals. #4 0x00002aaaab72ea51 in start_thread () from /lib64/libpthread.so.0 No symbol table info available. #5 0x00002aaaaba2c93d in clone () from /lib64/libc.so.6 ---Type <return> to continue, or q <return> to quit--- No symbol table info available. Thread 3 (Thread 0x2aaaae256700 (LWP 23170)): #0 0x00002aaaab9f0a7d in nanosleep () from /lib64/libc.so.6 No symbol table info available. #1 0x00002aaaab9f08f0 in sleep () from /lib64/libc.so.6 No symbol table info available. #2 0x000000000042fd31 in _wait_for_all_tasks (job=0x84dec0) at mgr.c:2054 rc = 0 tasks_left = 300 i = 0 #3 0x00002aaaab72ea51 in start_thread () from /lib64/libpthread.so.0 No symbol table info available. #4 0x00002aaaaba2c93d in clone () from /lib64/libc.so.6 No symbol table info available. Thread 2 (Thread 0x2aaaae357700 (LWP 61733)): #0 0x00002aaaab9f0a7d in nanosleep () from /lib64/libc.so.6 No symbol table info available. #1 0x00002aaaaba25c24 in usleep () from /lib64/libc.so.6 No symbol table info available. #2 0x0000000000528c89 in _timer_thread (args=0x0) at slurm_acct_gather_profile.c:155 err = 0 i = 4 now = 1463179557 diff = 1463179557 tv1 = {tv_sec = 1463179557, tv_usec = 284747} tv2 = {tv_sec = 1463179557, tv_usec = 284748} tv_str = "usec=1", '\000' <repeats 13 times> delta_t = 1 #3 0x00002aaaab72ea51 in start_thread () from /lib64/libpthread.so.0 No symbol table info available. #4 0x00002aaaaba2c93d in clone () from /lib64/libc.so.6 No symbol table info available. Thread 1 (Thread 0x2aaaac908220 (LWP 61731)): #0 core_spec_p_clear (cont_id=4364640) at core_spec_none.c:142 No locals. #1 0x000000000044473f in core_spec_g_set (cont_id=61731, core_count=0) at core_spec_plugin.c:184 i = 0 rc = 0 #2 0x000000000042e4dd in job_manager (job=0x80dee0) at mgr.c:1340 rc = 0 ---Type <return> to continue, or q <return> to quit--- io_initialized = true ckpt_type = 0x820b40 "checkpoint/none" #3 0x0000000000429bfd in main (argc=1, argv=0x7fffffffecd8) at slurmstepd.c:146 cli = 0x8030d0 self = 0x80cf80 msg = 0x811e00 job = 0x80dee0 ngids = 0 gids = 0x0 rc = 0 (gdb) Core was generated by `slurmstepd: [50335] '. Program terminated with signal 7, Bus error. #0 task_p_post_term (job=0x0, task=0x0) at task_affinity.c:484 484 { Missing separate debuginfos, use: debuginfo-install audit-libs-2.3.7-5.el6.x86_64 glibc-2.12-1.166.el6_7.3.x86_64 hwloc-1.5-3.el6_5.x86_64 libxml2-2.7.6-20.el6.x86_64 munge-0.5.11-99_cm7.1.x86_64 nss-pam-ldapd-0.7.5-20.el6_6.3.x86_64 nss-softokn-freebl-3.14.3-23.el6_7.x86_64 numactl-2.0.9-2.el6.x86_64 pam-1.1.1-20.el6_7.1.x86_64 pciutils-libs-3.1.10-4.el6.x86_64 zlib-1.2.3-29.el6.x86_64 (gdb) thread apply all bt full Thread 4 (Thread 0x2aaaae357700 (LWP 64436)): #0 0x00002aaaab9f0a7d in nanosleep () from /lib64/libc.so.6 No symbol table info available. #1 0x00002aaaaba25c24 in usleep () from /lib64/libc.so.6 No symbol table info available. #2 0x0000000000528c89 in _timer_thread (args=0x0) at slurm_acct_gather_profile.c:155 err = 0 i = 4 now = 1463179589 diff = 1463179589 tv1 = {tv_sec = 1463179589, tv_usec = 824232} tv2 = {tv_sec = 1463179589, tv_usec = 824232} tv_str = "usec=0", '\000' <repeats 13 times> delta_t = 0 #3 0x00002aaaab72ea51 in start_thread () from /lib64/libpthread.so.0 No symbol table info available. #4 0x00002aaaaba2c93d in clone () from /lib64/libc.so.6 No symbol table info available. Thread 3 (Thread 0x2aaaae256700 (LWP 64435)): #0 0x00002aaaab73263c in pthread_cond_wait@@GLIBC_2.3.2 () from /lib64/libpthread.so.0 No symbol table info available. #1 0x00000000005248d4 in _poll_data (profile=false) at slurm_jobacct_gather.c:178 err = 0 #2 0x00002aaaab72ea51 in start_thread () from /lib64/libpthread.so.0 No symbol table info available. #3 0x00002aaaaba2c93d in clone () from /lib64/libc.so.6 No symbol table info available. Thread 2 (Thread 0x2aaaae458700 (LWP 64437)): #0 0x00002aaaaba23113 in poll () from /lib64/libc.so.6 No symbol table info available. #1 0x000000000044d97a in _poll_internal (pfds=0x2aaab00008d0, nfds=2, shutdown_time=0) at eio.c:344 n = 16777217 timeout = -1 #2 0x000000000044d7f0 in eio_handle_mainloop (eio=0x81ad60) at eio.c:304 retval = 0 pollfds = 0x2aaab00008d0 map = 0x2aaab0000900 maxnfds = 1 nfds = 2 n = 1 ---Type <return> to continue, or q <return> to quit--- __func__ = "eio_handle_mainloop" #3 0x000000000043aeb7 in _domain_socket_destroy (fd=0) at req.c:229 No locals. #4 0x00002aaaab72ea51 in start_thread () from /lib64/libpthread.so.0 No symbol table info available. #5 0x00002aaaaba2c93d in clone () from /lib64/libc.so.6 No symbol table info available. Thread 1 (Thread 0x2aaaac908220 (LWP 64434)): #0 task_p_post_term (job=0x0, task=0x0) at task_affinity.c:484 base = "\177\353\377\377\377\177\000\000\260Þ€\000\000\000\000\000\000\000\233\253\000\000\000\000\260Þ€\000\000\000\000\000 \232\201", '\000' <repeats 33 times>"\266, \373\000\000\001", '\000' <repeats 24 times>"\353, \377\377\377\177\000\000\272\264D\000\000\000\000\000\220\341\177\000\000\000\000\000\200$\202\000\000\000\000\000\000\323~\000\000\000\000\000`\323~\000\000\000\000\000include\000/etc/slurm/plugstack.conf.d/*\000\000\000t\245\201\000\000\000\000\000\260\332\377\377\377\177\000\000>\232F\000\000\000\000\000\320\354\377\377\377\177\000\000\340\332\377\377\377\177\000\000\000\000\000\000\000\000\000\000@\245\201\000\000\000\000\000\064\000\000\000\000\000\000\000\065\000\000\000\000\000\000\000\360\352\377\377\377\177\000\000\366\002C\000\000\000\000\000\300\341\377\377\377\177\000\000\260\341\377\377\377\177\000\000\027\000\000\000\000\000\000\000\060\343\377\377\377\177\000\000\215"... path = '\000' <repeats 192 times>, "p\321\377\377\377\177\000\000`\321\377\377\377\177\000\000\027\000\000\000\000\000\000\000\340\322\377\377\377\177\000\000\215\027Z\000\000\000\000\000g\201\230\253\252*\000\000\215\027Z", '\000' <repeats 13 times>, "\003", '\000' <repeats 47 times>, "P\320\377\377\377\177\000\000\000\000\000\000\000\000\000\000\360\321\377\377\377\177\000\000\340\321\377\377\377\177\000\000\027\000\000\000\000\000\000\000`\323\377\377\377\177\000\000\215\027Z\000\000\000\000\000g\201\230\253\252*\000\000(slurmstepd)\000\000\000\000\003", '\000' <repeats 47 times>"\320, \320\377\377\377\177\000\000@\314\377\377\377\177\000\000\205\027Z\000\000\000\000\000\260\320\377\377\377\177", '\000' <repeats 18 times>, "<\321\377\377\377\177\000\000X\321\377\377\377\177\000\000\000\000\000\000\000\000\000\000|\237\253\252\252*\000\000\000\000\000\000\000\000\000\000\374\234\225\253\252*\000\000\220\071R\253\252*\000\000\202\062\253\252\252*\000\000\000\000\000\000\000\000\000\000"... __func__ = "task_p_post_term" #1 0x0000000000448dfe in task_g_post_term (job=0x80deb0, task=0x81a820) at task_plugin.c:397 err = 0 i = 0 rc = 0 #2 0x000000000042fbe3 in _wait_for_any_task (job=0x80deb0, waitflag=true) at mgr.c:2014 t = 0x81a820 status = 15 pid = 64438 completed = 1 jobacct = 0x81e2d0 rusage = {ru_utime = {tv_sec = 3, tv_usec = 430478}, ru_stime = {tv_sec = 5, tv_usec = 160215}, ru_maxrss = 7920, ru_ixrss = 0, ru_idrss = 0, ru_isrss = 0, ru_minflt = 102295, ru_majflt = 35, ru_nswap = 0, ru_inblock = 6816, ru_oublock = 21472, ru_msgsnd = 0, ru_msgrcv = 0, ru_nsignals = 0, ru_nvcsw = 148522, ru_nivcsw = 231} tmp_env = 0x85aec0 #3 0x000000000042fca6 in _wait_for_all_tasks (job=0x80deb0) at mgr.c:2036 tasks_left = 1 i = 0 ---Type <return> to continue, or q <return> to quit--- #4 0x000000000042e25d in job_manager (job=0x80deb0) at mgr.c:1244 rc = 0 io_initialized = true ckpt_type = 0x81aaf0 "checkpoint/none" #5 0x0000000000429bfd in main (argc=1, argv=0x7fffffffecd8) at slurmstepd.c:146 cli = 0x8030d0 self = 0x0 msg = 0x810a40 job = 0x80deb0 ngids = 0 gids = 0x0 rc = 0 (gdb) Core was generated by `slurmstepd: [50335.0] '. Program terminated with signal 7, Bus error. #0 task_p_post_term (job=0x0, task=0x0) at task_affinity.c:484 484 { Missing separate debuginfos, use: debuginfo-install audit-libs-2.3.7-5.el6.x86_64 glibc-2.12-1.166.el6_7.3.x86_64 hwloc-1.5-3.el6_5.x86_64 libxml2-2.7.6-20.el6.x86_64 munge-0.5.11-99_cm7.1.x86_64 nss-pam-ldapd-0.7.5-20.el6_6.3.x86_64 nss-softokn-freebl-3.14.3-23.el6_7.x86_64 numactl-2.0.9-2.el6.x86_64 pam-1.1.1-20.el6_7.1.x86_64 pciutils-libs-3.1.10-4.el6.x86_64 zlib-1.2.3-29.el6.x86_64 (gdb) thread apply all bt full Thread 6 (Thread 0x2aaaaf084700 (LWP 64828)): #0 0x00002aaaaba23113 in poll () from /lib64/libc.so.6 No symbol table info available. #1 0x000000000044d97a in _poll_internal (pfds=0x2aaab4001470, nfds=18, shutdown_time=0) at eio.c:344 n = 16777233 timeout = -1 #2 0x000000000044d7f0 in eio_handle_mainloop (eio=0x2aaab40008d0) at eio.c:304 retval = 0 pollfds = 0x2aaab4001470 map = 0x2aaab4001520 maxnfds = 17 nfds = 18 n = 17 __func__ = "eio_handle_mainloop" #3 0x00002aaaaec70a9e in _agent (unused=0x0) at agent.c:327 tree_listen_obj = 0x2aaab4000910 task_obj = 0x2aaab4001310 i = 16 #4 0x00002aaaab72ea51 in start_thread () from /lib64/libpthread.so.0 No symbol table info available. #5 0x00002aaaaba2c93d in clone () from /lib64/libc.so.6 No symbol table info available. Thread 5 (Thread 0x2aaab8322700 (LWP 64829)): #0 0x00002aaaaba23113 in poll () from /lib64/libc.so.6 No symbol table info available. #1 0x000000000044d97a in _poll_internal (pfds=0x2aaabc0008d0, nfds=29, shutdown_time=0) at eio.c:344 n = 49 timeout = -1 #2 0x000000000044d7f0 in eio_handle_mainloop (eio=0x81cbc0) at eio.c:304 retval = 0 pollfds = 0x2aaabc0008d0 map = 0x2aaabc000a80 maxnfds = 49 nfds = 29 n = 49 __func__ = "eio_handle_mainloop" #3 0x0000000000437f2d in _io_thr (arg=0x80dee0) at io.c:1504 job = 0x80dee0 set = {__val = {4097, 0 <repeats 15 times>}} rc = 0 #4 0x00002aaaab72ea51 in start_thread () from /lib64/libpthread.so.0 ---Type <return> to continue, or q <return> to quit--- No symbol table info available. #5 0x00002aaaaba2c93d in clone () from /lib64/libc.so.6 No symbol table info available. Thread 4 (Thread 0x2aaaae256700 (LWP 64825)): #0 0x00002aaaab73263c in pthread_cond_wait@@GLIBC_2.3.2 () from /lib64/libpthread.so.0 No symbol table info available. #1 0x00000000005248d4 in _poll_data (profile=false) at slurm_jobacct_gather.c:178 err = 0 #2 0x00002aaaab72ea51 in start_thread () from /lib64/libpthread.so.0 No symbol table info available. #3 0x00002aaaaba2c93d in clone () from /lib64/libc.so.6 No symbol table info available. Thread 3 (Thread 0x2aaaae458700 (LWP 64827)): #0 0x00002aaaaba23113 in poll () from /lib64/libc.so.6 No symbol table info available. #1 0x000000000044d97a in _poll_internal (pfds=0x2aaab00008d0, nfds=2, shutdown_time=0) at eio.c:344 n = 16777217 timeout = -1 #2 0x000000000044d7f0 in eio_handle_mainloop (eio=0x8209d0) at eio.c:304 retval = 0 pollfds = 0x2aaab00008d0 map = 0x2aaab0000900 maxnfds = 1 nfds = 2 n = 1 __func__ = "eio_handle_mainloop" #3 0x000000000043aeb7 in _domain_socket_destroy (fd=0) at req.c:229 No locals. #4 0x00002aaaab72ea51 in start_thread () from /lib64/libpthread.so.0 No symbol table info available. #5 0x00002aaaaba2c93d in clone () from /lib64/libc.so.6 No symbol table info available. Thread 2 (Thread 0x2aaaae357700 (LWP 64826)): #0 0x00002aaaab9f0a7d in nanosleep () from /lib64/libc.so.6 No symbol table info available. #1 0x00002aaaaba25c24 in usleep () from /lib64/libc.so.6 No symbol table info available. #2 0x0000000000528c89 in _timer_thread (args=0x0) at slurm_acct_gather_profile.c:155 err = 0 i = 4 ---Type <return> to continue, or q <return> to quit--- now = 1463179590 diff = 1463179590 tv1 = {tv_sec = 1463179590, tv_usec = 275337} tv2 = {tv_sec = 1463179590, tv_usec = 275337} tv_str = "usec=0", '\000' <repeats 13 times> delta_t = 0 #3 0x00002aaaab72ea51 in start_thread () from /lib64/libpthread.so.0 No symbol table info available. #4 0x00002aaaaba2c93d in clone () from /lib64/libc.so.6 No symbol table info available. Thread 1 (Thread 0x2aaaac908220 (LWP 64824)): #0 task_p_post_term (job=0x0, task=0x0) at task_affinity.c:484 base = "\363\335\377\377\377\177\000\000\000\000\000\000\000\000\000\000\060\236\204\000\000\000\000\000\363\335\377\377\377\177\000\000\066\064\070\064\065\000\000\000\370\335\377\377\377\177\000\000d", '\000' <repeats 15 times>, "*F\365\255\252*\000\000\001\001", '\000' <repeats 14 times>, "\001\001\001\000\000\000\000\000\000", '\001' <repeats 26 times>, "\000\000\000\000\000\000", '\001' <repeats 26 times>, '\000' <repeats 117 times>"\300, \341\377\377\377\177\000\000\260\341\377\377\377\177\000\000\027\000\000\000\000\000\000\000\060\343\377\377\377\177\000\000\215\027Z\000\000\000\000\000g\201\230\253\252*\000\000\215\027Z", '\000' <repeats 13 times>, "\003\000\000\000\000\000\000\000\005\000\000\000\000\000\000\000\225\000\000\000\000\000\000\000\227\000\000\000\000\000\000\000\260\352\377\377\377\177\000\000\002\000\000\000\000\000\000\000\240\340\377\377\377\177\000\000\001\020\340\220\071\000\000\000\205\027Z\000\000\000\000\000\360"... path = '\000' <repeats 192 times>, "p\321\377\377\377\177\000\000`\321\377\377\377\177\000\000\027\000\000\000\000\000\000\000\340\322\377\377\377\177\000\000\215\027Z\000\000\000\000\000g\201\230\253\252*\000\000\215\027Z", '\000' <repeats 13 times>, "\003", '\000' <repeats 47 times>, "P\320\377\377\377\177\000\000\000\000\000\000\000\000\000\000\360\321\377\377\377\177\000\000\340\321\377\377\377\177\000\000\027\000\000\000\000\000\000\000`\323\377\377\377\177\000\000\215\027Z\000\000\000\000\000g\201\230\253\252*\000\000(slurmstepd)\000\000\000\000\003", '\000' <repeats 47 times>"\320, \320\377\377\377\177\000\000@\314\377\377\377\177\000\000\205\027Z\000\000\000\000\000\260\320\377\377\377\177", '\000' <repeats 18 times>, "<\321\377\377\377\177\000\000X\321\377\377\377\177\000\000\000\000\000\000\000\000\000\000|\237\253\252\252*\000\000\000\000\000\000\000\000\000\000\374\234\225\253\252*\000\000\220\071R\253\252*\000\000\202\062\253\252\252*\000\000\000\000\000\000\000\000\000\000"... __func__ = "task_p_post_term" #1 0x0000000000448dfe in task_g_post_term (job=0x80dee0, task=0x81fea0) at task_plugin.c:397 err = 0 i = 0 rc = 0 #2 0x000000000042fbe3 in _wait_for_any_task (job=0x80dee0, waitflag=true) at mgr.c:2014 t = 0x81fea0 status = 15104 pid = 64834 completed = 1 jobacct = 0x831070 rusage = {ru_utime = {tv_sec = 990585, tv_usec = 35184}, ru_stime = {tv_sec = 34813, tv_usec = 718510}, ru_maxrss = 3920228, ru_ixrss = 0, ru_idrss = 0, ru_isrss = 0, ru_minflt = 108514, ru_majflt = 39, ru_nswap = 0, ru_inblock = 2072, ru_oublock = 48, ru_msgsnd = 0, ru_msgrcv = 0, ru_nsignals = 0, ru_nvcsw = 23593, ru_nivcsw = 3256753} tmp_env = 0x80e3d0 ---Type <return> to continue, or q <return> to quit--- #3 0x000000000042fca6 in _wait_for_all_tasks (job=0x80dee0) at mgr.c:2036 tasks_left = 16 i = 0 #4 0x000000000042e25d in job_manager (job=0x80dee0) at mgr.c:1244 rc = 0 io_initialized = true ckpt_type = 0x820b40 "checkpoint/none" #5 0x0000000000429bfd in main (argc=1, argv=0x7fffffffecd8) at slurmstepd.c:146 cli = 0x8030d0 self = 0x80cf80 msg = 0x811e00 job = 0x80dee0 ngids = 0 gids = 0x0 rc = 0 (gdb) Core was generated by `slurmstepd: [50337] '. Program terminated with signal 7, Bus error. #0 0x00002aaaaba776bf in __strlen_sse42 () from /lib64/libc.so.6 Missing separate debuginfos, use: debuginfo-install audit-libs-2.3.7-5.el6.x86_64 glibc-2.12-1.166.el6_7.3.x86_64 hwloc-1.5-3.el6_5.x86_64 libxml2-2.7.6-20.el6.x86_64 munge-0.5.11-99_cm7.1.x86_64 nss-pam-ldapd-0.7.5-20.el6_6.3.x86_64 nss-softokn-freebl-3.14.3-23.el6_7.x86_64 numactl-2.0.9-2.el6.x86_64 pam-1.1.1-20.el6_7.1.x86_64 pciutils-libs-3.1.10-4.el6.x86_64 zlib-1.2.3-29.el6.x86_64 (gdb) thread apply all bt full Thread 5 (Thread 0x2aaaae458700 (LWP 24967)): #0 0x00002aaaaba23113 in poll () from /lib64/libc.so.6 No symbol table info available. #1 0x000000000044d97a in _poll_internal (pfds=0x2aaab00008d0, nfds=2, shutdown_time=0) at eio.c:344 n = 16777217 timeout = -1 #2 0x000000000044d7f0 in eio_handle_mainloop (eio=0x81ad60) at eio.c:304 retval = 0 pollfds = 0x2aaab00008d0 map = 0x2aaab0000900 maxnfds = 1 nfds = 2 n = 1 __func__ = "eio_handle_mainloop" #3 0x000000000043aeb7 in _domain_socket_destroy (fd=0) at req.c:229 No locals. #4 0x00002aaaab72ea51 in start_thread () from /lib64/libpthread.so.0 No symbol table info available. #5 0x00002aaaaba2c93d in clone () from /lib64/libc.so.6 No symbol table info available. Thread 4 (Thread 0x2aaaac908220 (LWP 24964)): #0 0x00002aaaab9f070a in wait4 () from /lib64/libc.so.6 No symbol table info available. #1 0x000000000042f7d0 in _wait_for_any_task (job=0x80deb0, waitflag=true) at mgr.c:1928 t = 0x0 status = 0 pid = 5692464 completed = 0 jobacct = 0x0 rusage = {ru_utime = {tv_sec = 140737488350416, tv_usec = 4649478}, ru_stime = {tv_sec = 5884248, tv_usec = 8472032}, ru_maxrss = 8494000, ru_ixrss = 0, ru_idrss = 140737488350000, ru_isrss = 4392465, ru_minflt = 4294961856, ru_majflt = 8445616, ru_nswap = 140737488350416, ru_inblock = 0, ru_oublock = 0, ru_msgsnd = 4624298, ru_msgrcv = 6019680, ru_nsignals = 416611822512, ru_nvcsw = 8494000, ru_nivcsw = 8315856} tmp_env = 0x7fffffffeae0 #2 0x000000000042fca6 in _wait_for_all_tasks (job=0x80deb0) at mgr.c:2036 tasks_left = 1 i = 0 #3 0x000000000042e25d in job_manager (job=0x80deb0) at mgr.c:1244 rc = 0 io_initialized = true ckpt_type = 0x81aaf0 "checkpoint/none" ---Type <return> to continue, or q <return> to quit--- #4 0x0000000000429bfd in main (argc=1, argv=0x7fffffffecd8) at slurmstepd.c:146 cli = 0x8030d0 self = 0x0 msg = 0x810a40 job = 0x80deb0 ngids = 0 gids = 0x0 rc = 0 Thread 3 (Thread 0x2aaaae357700 (LWP 24966)): #0 0x00002aaaab9f0a7d in nanosleep () from /lib64/libc.so.6 No symbol table info available. #1 0x00002aaaaba25c24 in usleep () from /lib64/libc.so.6 No symbol table info available. #2 0x0000000000528c89 in _timer_thread (args=0x0) at slurm_acct_gather_profile.c:155 err = 0 i = 4 now = 1463179617 diff = 1463179617 tv1 = {tv_sec = 1463179617, tv_usec = 41727} tv2 = {tv_sec = 1463179617, tv_usec = 41727} tv_str = "usec=0", '\000' <repeats 13 times> delta_t = 0 #3 0x00002aaaab72ea51 in start_thread () from /lib64/libpthread.so.0 No symbol table info available. #4 0x00002aaaaba2c93d in clone () from /lib64/libc.so.6 No symbol table info available. Thread 2 (Thread 0x2aaaae256700 (LWP 24965)): #0 0x00002aaaab73263c in pthread_cond_wait@@GLIBC_2.3.2 () from /lib64/libpthread.so.0 No symbol table info available. #1 0x00000000005248d4 in _poll_data (profile=false) at slurm_jobacct_gather.c:178 err = 0 #2 0x00002aaaab72ea51 in start_thread () from /lib64/libpthread.so.0 No symbol table info available. #3 0x00002aaaaba2c93d in clone () from /lib64/libc.so.6 No symbol table info available. Thread 1 (Thread 0x2aaab8100700 (LWP 51855)): #0 0x00002aaaaba776bf in __strlen_sse42 () from /lib64/libc.so.6 No symbol table info available. #1 0x000000000049b127 in setenvf (envp=0x12, name=0x5a23a1 "tall", fmt=0xf291aff6ba6ccf2 <Address 0xf291aff6ba6ccf2 out of bounds>) at env.c:300 ---Type <return> to continue, or q <return> to quit--- value = 0x3000000008 <Address 0x3000000008 out of bounds> ap = {{gp_offset = 3088055504, fp_offset = 10922, overflow_arg_area = 0x4773b4, reg_save_area = 0x2aaab80ffcd0}} size = 10922 rc = -1206911776 #2 0x000000000043d31c in _handle_signal_container (fd=9, job=0x80deb0, uid=0) at req.c:779 rc = 0 errnum = 0 sig = 18 msg_sent = 0 ptr = 0x0 target_node_id = 0 task = 0x5a23a0 i = 0 flag = 0 signal = 18 __func__ = "_handle_signal_container" #3 0x000000000043be1c in _handle_request (fd=9, job=0x80deb0, uid=0, gid=0) at req.c:493 rc = 0 req = 4 #4 0x000000000043baa7 in _handle_accept (arg=0x0) at req.c:446 remaining = 0 ptr = 0x2aaab80ffe94 "" rc = 4 fd = 9 job = 0x80deb0 req = 0 len = 16384 buffer = 0x2aaabc0008d0 auth_cred = 0x2aaabc004a20 rc = 7424 uid = 0 gid = 0 #5 0x00002aaaab72ea51 in start_thread () from /lib64/libpthread.so.0 No symbol table info available. #6 0x00002aaaaba2c93d in clone () from /lib64/libc.so.6 No symbol table info available. (gdb) Core was generated by `slurmstepd: [50337.0] '. Program terminated with signal 7, Bus error. #0 0x00002aaaaba776bf in __strlen_sse42 () from /lib64/libc.so.6 Missing separate debuginfos, use: debuginfo-install audit-libs-2.3.7-5.el6.x86_64 glibc-2.12-1.166.el6_7.3.x86_64 hwloc-1.5-3.el6_5.x86_64 libxml2-2.7.6-20.el6.x86_64 munge-0.5.11-99_cm7.1.x86_64 nss-pam-ldapd-0.7.5-20.el6_6.3.x86_64 nss-softokn-freebl-3.14.3-23.el6_7.x86_64 numactl-2.0.9-2.el6.x86_64 pam-1.1.1-20.el6_7.1.x86_64 pciutils-libs-3.1.10-4.el6.x86_64 zlib-1.2.3-29.el6.x86_64 (gdb) thread apply all bt full Thread 7 (Thread 0x2aaab8322700 (LWP 25359)): #0 0x00002aaaaba23113 in poll () from /lib64/libc.so.6 No symbol table info available. #1 0x000000000044d97a in _poll_internal (pfds=0x2aaabc0008d0, nfds=34, shutdown_time=0) at eio.c:344 n = 16777265 timeout = -1 #2 0x000000000044d7f0 in eio_handle_mainloop (eio=0x81cbc0) at eio.c:304 retval = 0 pollfds = 0x2aaabc0008d0 map = 0x2aaabc000a80 maxnfds = 49 nfds = 34 n = 49 __func__ = "eio_handle_mainloop" #3 0x0000000000437f2d in _io_thr (arg=0x80dee0) at io.c:1504 job = 0x80dee0 set = {__val = {4097, 0 <repeats 15 times>}} rc = 0 #4 0x00002aaaab72ea51 in start_thread () from /lib64/libpthread.so.0 No symbol table info available. #5 0x00002aaaaba2c93d in clone () from /lib64/libc.so.6 No symbol table info available. Thread 6 (Thread 0x2aaaac908220 (LWP 25354)): #0 0x00002aaaab9f070a in wait4 () from /lib64/libc.so.6 No symbol table info available. #1 0x000000000042f7d0 in _wait_for_any_task (job=0x80dee0, waitflag=true) at mgr.c:1928 t = 0x0 status = 0 pid = -131072 completed = 0 jobacct = 0x0 rusage = {ru_utime = {tv_sec = 5863754, tv_usec = 9891309677232}, ru_stime = {tv_sec = 5863755, tv_usec = 140737488349816}, ru_maxrss = 8612464, ru_ixrss = 8601776, ru_idrss = 140737488350000, ru_isrss = 4393060, ru_minflt = 4294961856, ru_majflt = 8445664, ru_nswap = 140733193388032, ru_inblock = 0, ru_oublock = 68719476736, ru_msgsnd = 0, ru_msgrcv = 0, ru_nsignals = 416611822512, ru_nvcsw = 8506656, ru_nivcsw = 68727792592} tmp_env = 0x0 #2 0x000000000042fca6 in _wait_for_all_tasks (job=0x80dee0) at mgr.c:2036 tasks_left = 16 i = 0 #3 0x000000000042e25d in job_manager (job=0x80dee0) at mgr.c:1244 rc = 0 ---Type <return> to continue, or q <return> to quit--- io_initialized = true ckpt_type = 0x820b40 "checkpoint/none" #4 0x0000000000429bfd in main (argc=1, argv=0x7fffffffecd8) at slurmstepd.c:146 cli = 0x8030d0 self = 0x80cf80 msg = 0x811e00 job = 0x80dee0 ngids = 0 gids = 0x0 rc = 0 Thread 5 (Thread 0x2aaaaf084700 (LWP 25358)): #0 0x00002aaaaba23113 in poll () from /lib64/libc.so.6 No symbol table info available. #1 0x000000000044d97a in _poll_internal (pfds=0x2aaab4001470, nfds=18, shutdown_time=0) at eio.c:344 n = 16777233 timeout = -1 #2 0x000000000044d7f0 in eio_handle_mainloop (eio=0x2aaab40008d0) at eio.c:304 retval = 0 pollfds = 0x2aaab4001470 map = 0x2aaab4001520 maxnfds = 17 nfds = 18 n = 17 __func__ = "eio_handle_mainloop" #3 0x00002aaaaec70a9e in _agent (unused=0x0) at agent.c:327 tree_listen_obj = 0x2aaab4000910 task_obj = 0x2aaab4001310 i = 16 #4 0x00002aaaab72ea51 in start_thread () from /lib64/libpthread.so.0 No symbol table info available. #5 0x00002aaaaba2c93d in clone () from /lib64/libc.so.6 No symbol table info available. Thread 4 (Thread 0x2aaaae458700 (LWP 25357)): #0 0x00002aaaaba23113 in poll () from /lib64/libc.so.6 No symbol table info available. #1 0x000000000044d97a in _poll_internal (pfds=0x2aaab00008d0, nfds=2, shutdown_time=0) at eio.c:344 n = 16777217 timeout = -1 #2 0x000000000044d7f0 in eio_handle_mainloop (eio=0x8209d0) at eio.c:304 retval = 0 pollfds = 0x2aaab00008d0 ---Type <return> to continue, or q <return> to quit--- map = 0x2aaab0000900 maxnfds = 1 nfds = 2 n = 1 __func__ = "eio_handle_mainloop" #3 0x000000000043aeb7 in _domain_socket_destroy (fd=0) at req.c:229 No locals. #4 0x00002aaaab72ea51 in start_thread () from /lib64/libpthread.so.0 No symbol table info available. #5 0x00002aaaaba2c93d in clone () from /lib64/libc.so.6 No symbol table info available. Thread 3 (Thread 0x2aaaae357700 (LWP 25356)): #0 0x00002aaaab9f0a7d in nanosleep () from /lib64/libc.so.6 No symbol table info available. #1 0x00002aaaaba25c24 in usleep () from /lib64/libc.so.6 No symbol table info available. #2 0x0000000000528c89 in _timer_thread (args=0x0) at slurm_acct_gather_profile.c:155 err = 0 i = 4 now = 1463179616 diff = 1463179616 tv1 = {tv_sec = 1463179616, tv_usec = 425283} tv2 = {tv_sec = 1463179616, tv_usec = 425284} tv_str = "usec=1", '\000' <repeats 13 times> delta_t = 1 #3 0x00002aaaab72ea51 in start_thread () from /lib64/libpthread.so.0 No symbol table info available. #4 0x00002aaaaba2c93d in clone () from /lib64/libc.so.6 No symbol table info available. Thread 2 (Thread 0x2aaaae256700 (LWP 25355)): #0 0x00002aaaab73263c in pthread_cond_wait@@GLIBC_2.3.2 () from /lib64/libpthread.so.0 No symbol table info available. #1 0x00000000005248d4 in _poll_data (profile=false) at slurm_jobacct_gather.c:178 err = 0 #2 0x00002aaaab72ea51 in start_thread () from /lib64/libpthread.so.0 No symbol table info available. #3 0x00002aaaaba2c93d in clone () from /lib64/libc.so.6 No symbol table info available. Thread 1 (Thread 0x2aaab8624700 (LWP 51854)): #0 0x00002aaaaba776bf in __strlen_sse42 () from /lib64/libc.so.6 ---Type <return> to continue, or q <return> to quit--- No symbol table info available. #1 0x000000000049b127 in setenvf (envp=0x12, name=0x5a23a1 "tall", fmt=0x392ac9c5c6b61727 <Address 0x392ac9c5c6b61727 out of bounds>) at env.c:300 value = 0x3000000008 <Address 0x3000000008 out of bounds> ap = {{gp_offset = 3093445840, fp_offset = 10922, overflow_arg_area = 0x4773b4, reg_save_area = 0x2aaab8623cd0}} size = 10922 rc = -1201521440 #2 0x000000000043d31c in _handle_signal_container (fd=44, job=0x80dee0, uid=0) at req.c:779 rc = 0 errnum = 0 sig = 18 msg_sent = 0 ptr = 0x0 target_node_id = 0 task = 0x5a23a0 i = 0 flag = 0 signal = 18 __func__ = "_handle_signal_container" #3 0x000000000043be1c in _handle_request (fd=44, job=0x80dee0, uid=0, gid=0) at req.c:493 rc = 0 req = 4 #4 0x000000000043baa7 in _handle_accept (arg=0x0) at req.c:446 remaining = 0 ptr = 0x2aaab8623e94 "" rc = 4 fd = 44 job = 0x80dee0 req = 0 len = 16384 buffer = 0x2aaac40008d0 auth_cred = 0x2aaac4004a20 rc = 7424 uid = 0 gid = 0 #5 0x00002aaaab72ea51 in start_thread () from /lib64/libpthread.so.0 No symbol table info available. #6 0x00002aaaaba2c93d in clone () from /lib64/libc.so.6 No symbol table info available. (gdb) Three of these core files (comments 74, 75, and 83) all have a similar pattern. Here is the relevant logic: typedef enum { PROFILE_ENERGY, PROFILE_TASK, PROFILE_FILESYSTEM, PROFILE_NETWORK, PROFILE_CNT } acct_gather_profile_type_t; // NOTE that enum starts at zero, so PROFILE_CNT=4 acct_gather_profile_timer_t acct_gather_profile_timer[PROFILE_CNT]; for (i=0; i<PROFILE_CNT; i++) { ... slurm_mutex_lock(&acct_gather_profile_timer[i].notify_mutex); ... } In each of those core files i=4 and the SIGBUS happens on the line with the lock. Since the array size is 4, with values from 0 to 3 inclusive (enum starts at zero), a variable outside of the array is referenced, causing the SIGBUS. Even if the abort does not happen at this exact location, it could corrupt memory causing a failure at some other point. My first thought was that your system has a #define PROFILE_CNT in some system header file, but that would have resulted in a larger array being defined and initialized, so it would not cause an invalid memory reference. The only other explanation I have for that behaviour is a compiler error. I'll send you a patch that checks for that and should prevent a failure at this specific location. I know that a compiler error is a long shot, but I've tracked down more unlikely failures in the past. Created attachment 3091 [details]
range check array references
If an attempt is made to reference an invalid array element slurmstepd will generate a log message like this:
error: [1235.0] _timer_thread @ 145 i=4 PROFILE_CNT=4
We would also recommend that you apply this patch, which might possibly be related: https://github.com/SchedMD/slurm/commit/b1fbeb85d949166b904875cf883d8d9e788e4c3f Hi Moe, I have applied both https://bugs.schedmd.com/attachment.cgi?id=3091 https://github.com/SchedMD/slurm/commit/b1fbeb85d949166b904875cf883d8d9e788e4c3f Let's see... Any update on this? We haven't got any slurmstepd crash from jobs started after upgrade. We have one more core from slurmstepd that doesn't contain https://bugs.schedmd.com/attachment.cgi?id=3091 https://github.com/SchedMD/slurm/commit/b1fbeb85d949166b904875cf883d8d9e788e4c3f Do you need it? Our cluster is not loaded much last 10 days. I expect load to increase in ~2 weeks. (In reply to Sergey Meirovich from comment #91) > We haven't got any slurmstepd crash from jobs started after upgrade. > > We have one more core from slurmstepd that doesn't contain > https://bugs.schedmd.com/attachment.cgi?id=3091 > https://github.com/SchedMD/slurm/commit/ > b1fbeb85d949166b904875cf883d8d9e788e4c3f > Do you need it? No thanks. > Our cluster is not loaded much last 10 days. I expect load to increase in ~2 > weeks. Thank you for the update. If my patch (below) fixed the problem, then it was a compiler error. > https://bugs.schedmd.com/attachment.cgi?id=3091 If the race condition patch (below) fixed the problem, your backtraces would be very difficult to explain, but that is more likely than a compiler error. > https://github.com/SchedMD/slurm/commit/ > b1fbeb85d949166b904875cf883d8d9e788e4c3f 2 weeks since last patching. No any crash for patched slurmstepd. Let keep it open for another 2 weeks. We got a crash/core of the patched slurmstepd. Core was generated by `slurmstepd: [80341] '. Program terminated with signal 7, Bus error. #0 0x000000000042f8d7 in _wait_for_any_task (job=0x80ded0, waitflag=true) at mgr.c:1938 1938 pid = wait3(&status, waitflag ? 0 : WNOHANG, &rusage); Missing separate debuginfos, use: debuginfo-install audit-libs-2.3.7-5.el6.x86_64 glibc-2.12-1.166.el6_7.3.x86_64 hwloc-1.5-3.el6_5.x86_64 libxml2-2.7.6-20.el6.x86_64 munge-0.5.11-99_cm7.1.x86_64 nss-pam-ldapd-0.7.5-20.el6_6.3.x86_64 nss-softokn-freebl-3.14.3-23.el6_7.x86_64 numactl-2.0.9-2.el6.x86_64 pam-1.1.1-20.el6_7.1.x86_64 pciutils-libs-3.1.10-4.el6.x86_64 zlib-1.2.3-29.el6.x86_64 (gdb) thread apply all bt Thread 4 (Thread 0x2aaaae45a700 (LWP 39737)): #0 0x00002aaaaba25113 in poll () from /lib64/libc.so.6 #1 0x000000000044da8e in _poll_internal (pfds=0x2aaab00008d0, nfds=2, shutdown_time=0) at eio.c:351 #2 0x000000000044d904 in eio_handle_mainloop (eio=0x818aa0) at eio.c:318 #3 0x000000000043afcb in _msg_thr_internal (job_arg=0x80ded0) at req.c:240 #4 0x00002aaaab730a51 in start_thread () from /lib64/libpthread.so.0 #5 0x00002aaaaba2e93d in clone () from /lib64/libc.so.6 Thread 3 (Thread 0x2aaaae359700 (LWP 39736)): #0 0x00002aaaab9f2a7d in nanosleep () from /lib64/libc.so.6 #1 0x00002aaaaba27c24 in usleep () from /lib64/libc.so.6 #2 0x0000000000528f7e in _timer_thread (args=0x0) at slurm_acct_gather_profile.c:172 #3 0x00002aaaab730a51 in start_thread () from /lib64/libpthread.so.0 #4 0x00002aaaaba2e93d in clone () from /lib64/libc.so.6 Thread 2 (Thread 0x2aaaae258700 (LWP 39735)): #0 0x00002aaaab73463c in pthread_cond_wait@@GLIBC_2.3.2 () from /lib64/libpthread.so.0 #1 0x0000000000524ae2 in _watch_tasks (arg=0x0) at slurm_jobacct_gather.c:210 #2 0x00002aaaab730a51 in start_thread () from /lib64/libpthread.so.0 #3 0x00002aaaaba2e93d in clone () from /lib64/libc.so.6 Thread 1 (Thread 0x2aaaac90a220 (LWP 39734)): #0 0x000000000042f8d7 in _wait_for_any_task (job=0x80ded0, waitflag=true) at mgr.c:1938 #1 0x000000000042fdad in _wait_for_all_tasks (job=0x80ded0) at mgr.c:2045 #2 0x000000000042e364 in job_manager (job=0x80ded0) at mgr.c:1256 #3 0x0000000000429cdd in main (argc=1, argv=0x7fffffffecd8) at slurmstepd.c:160 (gdb) Forgot full at the end of command. So one more time: Core was generated by `slurmstepd: [80341] '. Program terminated with signal 7, Bus error. #0 0x000000000042f8d7 in _wait_for_any_task (job=0x80ded0, waitflag=true) at mgr.c:1938 1938 pid = wait3(&status, waitflag ? 0 : WNOHANG, &rusage); Missing separate debuginfos, use: debuginfo-install audit-libs-2.3.7-5.el6.x86_64 glibc-2.12-1.166.el6_7.3.x86_64 hwloc-1.5-3.el6_5.x86_64 libxml2-2.7.6-20.el6.x86_64 munge-0.5.11-99_cm7.1.x86_64 nss-pam-ldapd-0.7.5-20.el6_6.3.x86_64 nss-softokn-freebl-3.14.3-23.el6_7.x86_64 numactl-2.0.9-2.el6.x86_64 pam-1.1.1-20.el6_7.1.x86_64 pciutils-libs-3.1.10-4.el6.x86_64 zlib-1.2.3-29.el6.x86_64 (gdb) thread apply all bt full Thread 4 (Thread 0x2aaaae45a700 (LWP 39737)): #0 0x00002aaaaba25113 in poll () from /lib64/libc.so.6 No symbol table info available. #1 0x000000000044da8e in _poll_internal (pfds=0x2aaab00008d0, nfds=2, shutdown_time=0) at eio.c:351 n = 16777217 timeout = -1 #2 0x000000000044d904 in eio_handle_mainloop (eio=0x818aa0) at eio.c:318 retval = 0 pollfds = 0x2aaab00008d0 map = 0x2aaab0000900 maxnfds = 1 nfds = 2 n = 1 __func__ = "eio_handle_mainloop" #3 0x000000000043afcb in _msg_thr_internal (job_arg=0x80ded0) at req.c:240 job = 0x80ded0 #4 0x00002aaaab730a51 in start_thread () from /lib64/libpthread.so.0 No symbol table info available. #5 0x00002aaaaba2e93d in clone () from /lib64/libc.so.6 No symbol table info available. Thread 3 (Thread 0x2aaaae359700 (LWP 39736)): #0 0x00002aaaab9f2a7d in nanosleep () from /lib64/libc.so.6 No symbol table info available. #1 0x00002aaaaba27c24 in usleep () from /lib64/libc.so.6 No symbol table info available. #2 0x0000000000528f7e in _timer_thread (args=0x0) at slurm_acct_gather_profile.c:172 i = 4 now = 1464882563 diff = 1464882563 tv1 = {tv_sec = 1464882563, tv_usec = 266411} tv2 = {tv_sec = 1464882563, tv_usec = 266411} tv_str = "usec=0", '\000' <repeats 13 times> delta_t = 0 __func__ = "_timer_thread" #3 0x00002aaaab730a51 in start_thread () from /lib64/libpthread.so.0 No symbol table info available. #4 0x00002aaaaba2e93d in clone () from /lib64/libc.so.6 No symbol table info available. Thread 2 (Thread 0x2aaaae258700 (LWP 39735)): #0 0x00002aaaab73463c in pthread_cond_wait@@GLIBC_2.3.2 () from /lib64/libpthread.so.0 ---Type <return> to continue, or q <return> to quit--- No symbol table info available. #1 0x0000000000524ae2 in _watch_tasks (arg=0x0) at slurm_jobacct_gather.c:210 type = 1 #2 0x00002aaaab730a51 in start_thread () from /lib64/libpthread.so.0 No symbol table info available. #3 0x00002aaaaba2e93d in clone () from /lib64/libc.so.6 No symbol table info available. Thread 1 (Thread 0x2aaaac90a220 (LWP 39734)): #0 0x000000000042f8d7 in _wait_for_any_task (job=0x80ded0, waitflag=true) at mgr.c:1938 t = 0x0 status = 34560 pid = 5693840 completed = 0 jobacct = 0x0 rusage = {ru_utime = {tv_sec = 0, tv_usec = 9998}, ru_stime = {tv_sec = 0, tv_usec = 6998}, ru_maxrss = 4804, ru_ixrss = 0, ru_idrss = 0, ru_isrss = 0, ru_minflt = 2204, ru_majflt = 6, ru_nswap = 0, ru_inblock = 3160, ru_oublock = 24, ru_msgsnd = 0, ru_msgrcv = 0, ru_nsignals = 0, ru_nvcsw = 175, ru_nivcsw = 431} tmp_env = 0x7fffffffeae0 #1 0x000000000042fdad in _wait_for_all_tasks (job=0x80ded0) at mgr.c:2045 rc = 0 tasks_left = 1 i = 0 #2 0x000000000042e364 in job_manager (job=0x80ded0) at mgr.c:1256 rc = 0 io_initialized = true ckpt_type = 0x819340 "checkpoint/none" #3 0x0000000000429cdd in main (argc=1, argv=0x7fffffffecd8) at slurmstepd.c:160 cli = 0x803120 self = 0x0 msg = 0x80f890 job = 0x80ded0 ngids = 0 gids = 0x0 rc = 0 (gdb) No bt for 80341.0 Core was generated by `slurmstepd: [80341.0] '. Program terminated with signal 7, Bus error. #0 acct_gather_profile_type_t_name (type=10922) at slurm_acct_gather_profile.c:348 348 { Missing separate debuginfos, use: debuginfo-install audit-libs-2.3.7-5.el6.x86_64 glibc-2.12-1.166.el6_7.3.x86_64 hwloc-1.5-3.el6_5.x86_64 libxml2-2.7.6-20.el6.x86_64 munge-0.5.11-99_cm7.1.x86_64 nss-pam-ldapd-0.7.5-20.el6_6.3.x86_64 nss-softokn-freebl-3.14.3-23.el6_7.x86_64 numactl-2.0.9-2.el6.x86_64 pam-1.1.1-20.el6_7.1.x86_64 pciutils-libs-3.1.10-4.el6.x86_64 zlib-1.2.3-29.el6.x86_64 (gdb) thread apply all bt full Thread 6 (Thread 0x2aaaac90a220 (LWP 39748)): #0 0x00002aaaab9f270a in wait4 () from /lib64/libc.so.6 No symbol table info available. #1 0x000000000042f8d7 in _wait_for_any_task (job=0x80df00, waitflag=true) at mgr.c:1938 t = 0x0 status = 0 pid = -131072 completed = 0 jobacct = 0x0 rusage = {ru_utime = {tv_sec = 5865130, tv_usec = 9917079481008}, ru_stime = {tv_sec = 5865131, tv_usec = 140737488349816}, ru_maxrss = 8669408, ru_ixrss = 8671520, ru_idrss = 140737488350000, ru_isrss = 4393323, ru_minflt = 4294961856, ru_majflt = 8445696, ru_nswap = 140733193388032, ru_inblock = 0, ru_oublock = 25769803776, ru_msgsnd = 0, ru_msgrcv = 0, ru_nsignals = 416611822512, ru_nvcsw = 8486464, ru_nivcsw = 25778119664} tmp_env = 0x0 #2 0x000000000042fdad in _wait_for_all_tasks (job=0x80df00) at mgr.c:2045 rc = 0 tasks_left = 6 i = 0 #3 0x000000000042e364 in job_manager (job=0x80df00) at mgr.c:1256 rc = 0 io_initialized = true ckpt_type = 0x81b4c0 "checkpoint/none" #4 0x0000000000429cdd in main (argc=1, argv=0x7fffffffecd8) at slurmstepd.c:160 cli = 0x803120 self = 0x80cfa0 msg = 0x80fda0 job = 0x80df00 ngids = 0 gids = 0x0 rc = 0 Thread 5 (Thread 0x2aaaae258700 (LWP 39749)): #0 0x00002aaaab73463c in pthread_cond_wait@@GLIBC_2.3.2 () from /lib64/libpthread.so.0 No symbol table info available. #1 0x0000000000524ae2 in _watch_tasks (arg=0x0) at slurm_jobacct_gather.c:210 type = 1 #2 0x00002aaaab730a51 in start_thread () from /lib64/libpthread.so.0 No symbol table info available. #3 0x00002aaaaba2e93d in clone () from /lib64/libc.so.6 No symbol table info available. Thread 4 (Thread 0x2aaab8324700 (LWP 39753)): ---Type <return> to continue, or q <return> to quit--- #0 0x00002aaaaba25113 in poll () from /lib64/libc.so.6 No symbol table info available. #1 0x000000000044da8e in _poll_internal (pfds=0x2aaabc0008d0, nfds=14, shutdown_time=0) at eio.c:351 n = 16777235 timeout = -1 #2 0x000000000044d904 in eio_handle_mainloop (eio=0x817ce0) at eio.c:318 retval = 0 pollfds = 0x2aaabc0008d0 map = 0x2aaabc000990 maxnfds = 19 nfds = 14 n = 19 __func__ = "eio_handle_mainloop" #3 0x0000000000438041 in _io_thr (arg=0x80df00) at io.c:1509 job = 0x80df00 set = {__val = {4097, 0 <repeats 15 times>}} rc = 0 #4 0x00002aaaab730a51 in start_thread () from /lib64/libpthread.so.0 No symbol table info available. #5 0x00002aaaaba2e93d in clone () from /lib64/libc.so.6 No symbol table info available. Thread 3 (Thread 0x2aaaaf086700 (LWP 39752)): #0 0x00002aaaaba25113 in poll () from /lib64/libc.so.6 No symbol table info available. #1 0x000000000044da8e in _poll_internal (pfds=0x2aaab4000dd0, nfds=8, shutdown_time=0) at eio.c:351 n = 16777223 timeout = -1 #2 0x000000000044d904 in eio_handle_mainloop (eio=0x2aaab40008d0) at eio.c:318 retval = 0 pollfds = 0x2aaab4000dd0 map = 0x2aaab4000e30 maxnfds = 7 nfds = 8 n = 7 __func__ = "eio_handle_mainloop" #3 0x00002aaaaec72a9e in _agent (unused=0x0) at agent.c:327 tree_listen_obj = 0x2aaab4000910 task_obj = 0x2aaab4000cd0 i = 6 #4 0x00002aaaab730a51 in start_thread () from /lib64/libpthread.so.0 No symbol table info available. #5 0x00002aaaaba2e93d in clone () from /lib64/libc.so.6 ---Type <return> to continue, or q <return> to quit--- No symbol table info available. Thread 2 (Thread 0x2aaaae45a700 (LWP 39751)): #0 0x00002aaaaba25113 in poll () from /lib64/libc.so.6 No symbol table info available. #1 0x000000000044da8e in _poll_internal (pfds=0x2aaab00008d0, nfds=2, shutdown_time=0) at eio.c:351 n = 16777217 timeout = -1 #2 0x000000000044d904 in eio_handle_mainloop (eio=0x81b480) at eio.c:318 retval = 0 pollfds = 0x2aaab00008d0 map = 0x2aaab0000900 maxnfds = 1 nfds = 2 n = 1 __func__ = "eio_handle_mainloop" #3 0x000000000043afcb in _msg_thr_internal (job_arg=0x80df00) at req.c:240 job = 0x80df00 #4 0x00002aaaab730a51 in start_thread () from /lib64/libpthread.so.0 No symbol table info available. #5 0x00002aaaaba2e93d in clone () from /lib64/libc.so.6 No symbol table info available. Thread 1 (Thread 0x2aaaae359700 (LWP 39750)): #0 acct_gather_profile_type_t_name (type=10922) at slurm_acct_gather_profile.c:348 No locals. #1 0x0000000000528dcf in _timer_thread (args=0x0) at slurm_acct_gather_profile.c:159 i = 1 now = 1464882563 diff = 30 tv1 = {tv_sec = 1464882563, tv_usec = 510484} tv2 = {tv_sec = 1464882562, tv_usec = 510423} tv_str = "usec=1", '\000' <repeats 13 times> delta_t = 1 __func__ = "_timer_thread" #2 0x00002aaaab730a51 in start_thread () from /lib64/libpthread.so.0 No symbol table info available. #3 0x00002aaaaba2e93d in clone () from /lib64/libc.so.6 No symbol table info available. (gdb) Yet another patched slurmstepd crash: Core was generated by `slurmstepd: [77945] '. Program terminated with signal 7, Bus error. #0 acct_gather_profile_type_t_name (type=10922) at slurm_acct_gather_profile.c:348 348 { Missing separate debuginfos, use: debuginfo-install audit-libs-2.3.7-5.el6.x86_64 glibc-2.12-1.166.el6_7.3.x86_64 hwloc-1.5-3.el6_5.x86_64 libxml2-2.7.6-20.el6.x86_64 munge-0.5.11-99_cm7.1.x86_64 nss-pam-ldapd-0.7.5-20.el6_6.3.x86_64 nss-softokn-freebl-3.14.3-23.el6_7.x86_64 numactl-2.0.9-2.el6.x86_64 pam-1.1.1-20.el6_7.1.x86_64 pciutils-libs-3.1.10-4.el6.x86_64 zlib-1.2.3-29.el6.x86_64 (gdb) thread apply all bt full Thread 4 (Thread 0x2aaaae258700 (LWP 129014)): #0 0x00002aaaab73463c in pthread_cond_wait@@GLIBC_2.3.2 () from /lib64/libpthread.so.0 No symbol table info available. #1 0x0000000000524ae2 in _watch_tasks (arg=0x0) at slurm_jobacct_gather.c:210 type = 1 #2 0x00002aaaab730a51 in start_thread () from /lib64/libpthread.so.0 No symbol table info available. #3 0x00002aaaaba2e93d in clone () from /lib64/libc.so.6 No symbol table info available. Thread 3 (Thread 0x2aaaac90a220 (LWP 129013)): #0 0x00002aaaab9f270a in wait4 () from /lib64/libc.so.6 No symbol table info available. #1 0x000000000042f8d7 in _wait_for_any_task (job=0x80ded0, waitflag=true) at mgr.c:1938 t = 0x0 status = 0 pid = 5693840 completed = 0 jobacct = 0x0 rusage = {ru_utime = {tv_sec = 140737488350416, tv_usec = 4649738}, ru_stime = {tv_sec = 5885624, tv_usec = 8464144}, ru_maxrss = 8481552, ru_ixrss = 0, ru_idrss = 140737488350000, ru_isrss = 4392728, ru_minflt = 4294961856, ru_majflt = 8445648, ru_nswap = 140737488350416, ru_inblock = 0, ru_oublock = 0, ru_msgsnd = 4624558, ru_msgrcv = 6020992, ru_nsignals = 416611822512, ru_nvcsw = 8481552, ru_nivcsw = 8315888} tmp_env = 0x7fffffffeae0 #2 0x000000000042fdad in _wait_for_all_tasks (job=0x80ded0) at mgr.c:2045 rc = 0 tasks_left = 1 i = 0 #3 0x000000000042e364 in job_manager (job=0x80ded0) at mgr.c:1256 rc = 0 io_initialized = true ckpt_type = 0x8180a0 "checkpoint/none" #4 0x0000000000429cdd in main (argc=1, argv=0x7fffffffecd8) at slurmstepd.c:160 cli = 0x803120 self = 0x0 msg = 0x80f420 job = 0x80ded0 ngids = 0 gids = 0x0 rc = 0 Thread 2 (Thread 0x2aaaae45a700 (LWP 129016)): ---Type <return> to continue, or q <return> to quit--- #0 0x00002aaaaba25113 in poll () from /lib64/libc.so.6 No symbol table info available. #1 0x000000000044da8e in _poll_internal (pfds=0x2aaab00008d0, nfds=2, shutdown_time=0) at eio.c:351 n = 16777217 timeout = -1 #2 0x000000000044d904 in eio_handle_mainloop (eio=0x817f30) at eio.c:318 retval = 0 pollfds = 0x2aaab00008d0 map = 0x2aaab0000900 maxnfds = 1 nfds = 2 n = 1 __func__ = "eio_handle_mainloop" #3 0x000000000043afcb in _msg_thr_internal (job_arg=0x80ded0) at req.c:240 job = 0x80ded0 #4 0x00002aaaab730a51 in start_thread () from /lib64/libpthread.so.0 No symbol table info available. #5 0x00002aaaaba2e93d in clone () from /lib64/libc.so.6 No symbol table info available. Thread 1 (Thread 0x2aaaae359700 (LWP 129015)): #0 acct_gather_profile_type_t_name (type=10922) at slurm_acct_gather_profile.c:348 No locals. #1 0x0000000000528dcf in _timer_thread (args=0x0) at slurm_acct_gather_profile.c:159 i = 1 now = 1464882548 diff = 30 tv1 = {tv_sec = 1464882548, tv_usec = 855368} tv2 = {tv_sec = 1464882547, tv_usec = 855318} tv_str = "usec=0", '\000' <repeats 13 times> delta_t = 0 __func__ = "_timer_thread" #2 0x00002aaaab730a51 in start_thread () from /lib64/libpthread.so.0 No symbol table info available. #3 0x00002aaaaba2e93d in clone () from /lib64/libc.so.6 No symbol table info available. (gdb) Moe, I have to repeat my question from comment 72: Is there any way to auto-provision start of slurmstepd under valgrind? As was stated: "... Unless the application is doing lots of unbuffered I/O (srun --unbuffered option), slurmstepd is going to be sleeping 99.999% of the time." This is exactly our case (since we do not use `--unbuffered'). We could easily live with all slurmstepds under valgrind. Created attachment 3176 [details]
run slurmstepd under valgrind
The attached patch will run the slurmstepd process under valgrind's control. The valgrind output will appear in the /tmp directory with names that include the SLurm job ID and step ID (or "4294967294" for a batch job". For example:
$ ls /tmp/slurm*
slurmstepd_valgrind_4895.0
slurmstepd_valgrind_4895.1
slurmstepd_valgrind_4895.4294967294
I hope this helps.
Hi, Upgraded SLurm to 15.08.12 + https://bugs.schedmd.com/attachment.cgi?id=3176 After test run noticed: [root@dcalph050 tmp]# cat slurmstepd_valgrind_81752.0 ==60617== Memcheck, a memory error detector ==60617== Copyright (C) 2002-2012, and GNU GPL'd, by Julian Seward et al. ==60617== Using Valgrind-3.8.1 and LibVEX; rerun with -h for copyright info ==60617== Command: /cm/shared/apps/slurm/15.08.12/sbin/slurmstepd ==60617== Parent PID: 1 ==60617== ==60617== Conditional jump or move depends on uninitialised value(s) ==60617== at 0x4C2B017: __GI___rawmemchr (mc_replace_strmem.c:1158) ==60617== by 0x5B0270F: _IO_str_init_static_internal (in /lib64/libc-2.12.so) ==60617== by 0x5AF1A94: __isoc99_vsscanf (in /lib64/libc-2.12.so) ==60617== by 0x5AF1A27: __isoc99_sscanf (in /lib64/libc-2.12.so) ==60617== by 0x8CA945E: _get_myname (kill_tree.c:141) ==60617== by 0x8CA94F8: _build_hashtbl (kill_tree.c:163) ==60617== by 0x8CA9D4C: proctrack_linuxproc_get_pids (kill_tree.c:366) ==60617== by 0x8CA918C: proctrack_p_get_pids (proctrack_linuxproc.c:157) ==60617== by 0x445AD4: proctrack_g_get_pids (proctrack.c:461) ==60617== by 0x8495114: _get_precs (common_jag.c:532) ==60617== by 0x8495DB8: jag_common_poll_data (common_jag.c:839) ==60617== by 0x8493E02: jobacct_gather_p_poll_data (jobacct_gather_linux.c:201) ==60617== by 0x524858: _poll_data (slurm_jobacct_gather.c:178) ==60617== by 0x5252C1: jobacct_gather_add_task (slurm_jobacct_gather.c:400) ==60617== by 0x42F1F3: _fork_all_tasks (mgr.c:1784) ==60617== by 0x42E131: job_manager (mgr.c:1216) ==60617== by 0x429D1C: main (slurmstepd.c:160) ==60617== Uninitialised value was created by a stack allocation ==60617== at 0x8CA9361: _get_myname (kill_tree.c:126) ==60617== ==60617== ==60617== HEAP SUMMARY: ==60617== in use at exit: 155,103 bytes in 1,542 blocks ==60617== total heap usage: 20,074 allocs, 18,532 frees, 3,300,510 bytes allocated ==60617== ==60617== LEAK SUMMARY: ==60617== definitely lost: 3,523 bytes in 22 blocks ==60617== indirectly lost: 6,644 bytes in 96 blocks ==60617== possibly lost: 86,541 bytes in 964 blocks ==60617== still reachable: 58,395 bytes in 460 blocks ==60617== suppressed: 0 bytes in 0 blocks ==60617== Rerun with --leak-check=full to see details of leaked memory ==60617== ==60617== For counts of detected and suppressed errors, rerun with: -v ==60617== ERROR SUMMARY: 8262 errors from 1 contexts (suppressed: 6 from 6) [root@dcalph050 tmp]# Shall I ignore that? Slurm's buffer to read the "stat" files ("/proc/#/stat") for processes associated with the job is 1024 bytes long. Especially if there is some really long command name, that 1024 byte buffer could be filled and the scan function used to read process IDs could read beyond the end of the buffer and that could generate an invalid memory reference error. Do some of your jobs have particularly long process names? That might be something unique about your workload and be triggering these Slurm errors. (In reply to Sergey Meirovich from comment #100) > Hi, > > Upgraded SLurm to 15.08.12 + https://bugs.schedmd.com/attachment.cgi?id=3176 > > > After test run noticed: > [root@dcalph050 tmp]# cat slurmstepd_valgrind_81752.0 > ==60617== Memcheck, a memory error detector > ==60617== Copyright (C) 2002-2012, and GNU GPL'd, by Julian Seward et al. > ==60617== Using Valgrind-3.8.1 and LibVEX; rerun with -h for copyright info > ==60617== Command: /cm/shared/apps/slurm/15.08.12/sbin/slurmstepd > ==60617== Parent PID: 1 > ==60617== > ==60617== Conditional jump or move depends on uninitialised value(s) > ==60617== at 0x4C2B017: __GI___rawmemchr (mc_replace_strmem.c:1158) > ==60617== by 0x5B0270F: _IO_str_init_static_internal (in > /lib64/libc-2.12.so) > ==60617== by 0x5AF1A94: __isoc99_vsscanf (in /lib64/libc-2.12.so) > ==60617== by 0x5AF1A27: __isoc99_sscanf (in /lib64/libc-2.12.so) > ==60617== by 0x8CA945E: _get_myname (kill_tree.c:141) > ==60617== by 0x8CA94F8: _build_hashtbl (kill_tree.c:163) > ==60617== by 0x8CA9D4C: proctrack_linuxproc_get_pids (kill_tree.c:366) > ==60617== by 0x8CA918C: proctrack_p_get_pids (proctrack_linuxproc.c:157) > ==60617== by 0x445AD4: proctrack_g_get_pids (proctrack.c:461) > ==60617== by 0x8495114: _get_precs (common_jag.c:532) > ==60617== by 0x8495DB8: jag_common_poll_data (common_jag.c:839) > ==60617== by 0x8493E02: jobacct_gather_p_poll_data > (jobacct_gather_linux.c:201) > ==60617== by 0x524858: _poll_data (slurm_jobacct_gather.c:178) > ==60617== by 0x5252C1: jobacct_gather_add_task > (slurm_jobacct_gather.c:400) > ==60617== by 0x42F1F3: _fork_all_tasks (mgr.c:1784) > ==60617== by 0x42E131: job_manager (mgr.c:1216) > ==60617== by 0x429D1C: main (slurmstepd.c:160) > ==60617== Uninitialised value was created by a stack allocation > ==60617== at 0x8CA9361: _get_myname (kill_tree.c:126) > ==60617== > ==60617== > ==60617== HEAP SUMMARY: > ==60617== in use at exit: 155,103 bytes in 1,542 blocks > ==60617== total heap usage: 20,074 allocs, 18,532 frees, 3,300,510 bytes > allocated > ==60617== > ==60617== LEAK SUMMARY: > ==60617== definitely lost: 3,523 bytes in 22 blocks > ==60617== indirectly lost: 6,644 bytes in 96 blocks > ==60617== possibly lost: 86,541 bytes in 964 blocks > ==60617== still reachable: 58,395 bytes in 460 blocks > ==60617== suppressed: 0 bytes in 0 blocks > ==60617== Rerun with --leak-check=full to see details of leaked memory > ==60617== > ==60617== For counts of detected and suppressed errors, rerun with: -v > ==60617== ERROR SUMMARY: 8262 errors from 1 contexts (suppressed: 6 from 6) > [root@dcalph050 tmp]# > > > Shall I ignore that? Created attachment 3188 [details]
Prevent buffer overflow reading stat file
This allocates 4096 bytes on the heap for reading the stat file (rather than 1048 bytes on the stack) and checks for an overflow before parsing. My best guess is this is behind the slurmstepd memory reference errors.
Hi Moe, What strange is that job 81752 command name was: hostname That was simply srun hostname (In reply to Sergey Meirovich from comment #103) > Hi Moe, > > What strange is that job 81752 command name was: > hostname > > That was simply > srun hostname That function is reading all records in /proc, so all it takes is one process with a large "stat" file. Since this isn't a real file, I can't get the size before the read operation. What I did change is to malloc the buffer and make the buffer 4k in size rather than 1k, plus skip any file that reaches the buffer size. Will this fix the problem that you see? I have never seen any slurmstepd aborts or memory corruption that you have reported. However, I am able to reproduce the valgrind error that you have reported and it goes away with this change. On the other hand, Slurm is not reading more data than space in the buffer and it is only scanning the first few fields, so it may or may not fix the bad memory references... Hi, We have got two new crashes (now under valgrind). Logs are identical. These are before applying your last patch. Nothing much from these logs. E.g.: [root@dcalph018 tmp]# cat slurmstepd_valgrind_81999.0 ==11012== Memcheck, a memory error detector ==11012== Copyright (C) 2002-2012, and GNU GPL'd, by Julian Seward et al. ==11012== Using Valgrind-3.8.1 and LibVEX; rerun with -h for copyright info ==11012== Command: /cm/shared/apps/slurm/15.08.12/sbin/slurmstepd ==11012== Parent PID: 1 ==11012== ==11012== Conditional jump or move depends on uninitialised value(s) ==11012== at 0x4C2B017: __GI___rawmemchr (mc_replace_strmem.c:1158) ==11012== by 0x5B0270F: _IO_str_init_static_internal (in /lib64/libc-2.12.so) ==11012== by 0x5AF1A94: __isoc99_vsscanf (in /lib64/libc-2.12.so) ==11012== by 0x5AF1A27: __isoc99_sscanf (in /lib64/libc-2.12.so) ==11012== by 0x8CA945E: _get_myname (kill_tree.c:141) ==11012== by 0x8CA94F8: _build_hashtbl (kill_tree.c:163) ==11012== by 0x8CA9D4C: proctrack_linuxproc_get_pids (kill_tree.c:366) ==11012== by 0x8CA918C: proctrack_p_get_pids (proctrack_linuxproc.c:157) ==11012== by 0x445AD4: proctrack_g_get_pids (proctrack.c:461) ==11012== by 0x8495114: _get_precs (common_jag.c:532) ==11012== by 0x8495DB8: jag_common_poll_data (common_jag.c:839) ==11012== by 0x8493E02: jobacct_gather_p_poll_data (jobacct_gather_linux.c:201) ==11012== by 0x524858: _poll_data (slurm_jobacct_gather.c:178) ==11012== by 0x5252C1: jobacct_gather_add_task (slurm_jobacct_gather.c:400) ==11012== by 0x42F1F3: _fork_all_tasks (mgr.c:1784) ==11012== by 0x42E131: job_manager (mgr.c:1216) ==11012== by 0x429D1C: main (slurmstepd.c:160) ==11012== Uninitialised value was created by a stack allocation ==11012== at 0x8CA9361: _get_myname (kill_tree.c:126) ==11012== --11012-- VALGRIND INTERNAL ERROR: Valgrind received a signal 7 (SIGBUS) - exiting --11012-- si_code=2; Faulting address: 0x4354E7; sp: 0x40c8373e0 valgrind: the 'impossible' happened: Killed by fatal signal ==11012== at 0x38171CEE: disInstr_AMD64_WRK (guest_amd64_toIR.c:30568) ==11012== by 0x3817336B: disInstr_AMD64 (guest_amd64_toIR.c:31154) ==11012== by 0x380E834A: bb_to_IR (guest_generic_bb_to_IR.c:354) ==11012== by 0x380D3203: LibVEX_Translate (main_main.c:599) ==11012== by 0x3804AACE: vgPlain_translate (m_translate.c:1559) ==11012== by 0x3807892A: handle_chain_me (scheduler.c:1019) ==11012== by 0x38079CC0: vgPlain_scheduler (scheduler.c:1323) ==11012== by 0x380A6409: run_a_thread_NORETURN (syswrap-linux.c:103) ==11012== by 0x380A669A: vgModuleLocal_start_thread_NORETURN (syswrap-linux.c:304) ==11012== by 0x380A94ED: ??? (in /usr/lib64/valgrind/memcheck-amd64-linux) ==11012== by 0xDEADBEEFDEADBEEE: ??? ==11012== by 0xDEADBEEFDEADBEEE: ??? ==11012== by 0xDEADBEEFDEADBEEE: ??? sched status: running_tid=6 Thread 1: status = VgTs_WaitSys ==11012== at 0x5B3970A: wait4 (in /lib64/libc-2.12.so) ==11012== by 0x42F738: _wait_for_any_task (mgr.c:1934) ==11012== by 0x42FC0E: _wait_for_all_tasks (mgr.c:2041) ==11012== by 0x42E1E6: job_manager (mgr.c:1250) ==11012== by 0x429D1C: main (slurmstepd.c:160) Thread 2: status = VgTs_WaitSys ==11012== at 0x587B63C: pthread_cond_wait@@GLIBC_2.3.2 (in /lib64/libpthread-2.12.so) ==11012== by 0x5249C9: _watch_tasks (slurm_jobacct_gather.c:210) ==11012== by 0x5877A50: start_thread (in /lib64/libpthread-2.12.so) ==11012== by 0x41426FF: ??? Thread 3: status = VgTs_WaitSys ==11012== at 0x5B39A7D: ??? (in /lib64/libc-2.12.so) ==11012== by 0x5B6EC23: usleep (in /lib64/libc-2.12.so) ==11012== by 0x528FE7: _timer_thread (slurm_acct_gather_profile.c:171) ==11012== by 0x5877A50: start_thread (in /lib64/libpthread-2.12.so) ==11012== by 0x87986FF: ??? Thread 4: status = VgTs_WaitSys ==11012== at 0x5B6C113: poll (in /lib64/libc-2.12.so) ==11012== by 0x44D98D: _poll_internal (eio.c:351) ==11012== by 0x44D803: eio_handle_mainloop (eio.c:318) ==11012== by 0x43AE3E: _msg_thr_internal (req.c:240) ==11012== by 0x5877A50: start_thread (in /lib64/libpthread-2.12.so) ==11012== by 0x88996FF: ??? Thread 5: status = VgTs_WaitSys ==11012== at 0x5B6C113: poll (in /lib64/libc-2.12.so) ==11012== by 0x44D98D: _poll_internal (eio.c:351) ==11012== by 0x44D803: eio_handle_mainloop (eio.c:318) ==11012== by 0x90B1A9D: _agent (agent.c:327) ==11012== by 0x5877A50: start_thread (in /lib64/libpthread-2.12.so) ==11012== by 0x94C56FF: ??? Thread 6: status = VgTs_Runnable ==11012== at 0x4354E7: _client_read (io.c:335) ==11012== by 0x44DE33: _poll_handle_event (eio.c:482) ==11012== by 0x44DBF8: _poll_dispatch (eio.c:434) ==11012== by 0x44D856: eio_handle_mainloop (eio.c:324) ==11012== by 0x437EB4: _io_thr (io.c:1509) ==11012== by 0x5877A50: start_thread (in /lib64/libpthread-2.12.so) ==11012== by 0x97D16FF: ??? Thread 7: status = VgTs_Init ==11012== at 0x5B75901: clone (in /lib64/libc-2.12.so) Note: see also the FAQ in the source distribution. It contains workarounds to several common problems. In particular, if Valgrind aborted or crashed after identifying problems in your program, there's a good chance that fixing those problems will prevent Valgrind aborting or crashing, especially if it happened in m_mallocfree.c. If that doesn't help, please report this bug to: www.valgrind.org In the bug report, send all the above text, the valgrind version, and what OS and version you are using. Thanks. [root@dcalph018 tmp]# Also need to add that last your patch doesn't fix the warning for me. Still simple "srun hostname" left in valgrind log: ==10367== Memcheck, a memory error detector ==10367== Copyright (C) 2002-2012, and GNU GPL'd, by Julian Seward et al. ==10367== Using Valgrind-3.8.1 and LibVEX; rerun with -h for copyright info ==10367== Command: /cm/shared/apps/slurm/15.08.12/sbin/slurmstepd ==10367== Parent PID: 1 ==10367== ==10367== Conditional jump or move depends on uninitialised value(s) ==10367== at 0x4C2B017: __GI___rawmemchr (mc_replace_strmem.c:1158) ==10367== by 0x5B0270F: _IO_str_init_static_internal (in /lib64/libc-2.12.so) ==10367== by 0x5AF1A94: __isoc99_vsscanf (in /lib64/libc-2.12.so) ==10367== by 0x5AF1A27: __isoc99_sscanf (in /lib64/libc-2.12.so) ==10367== by 0x8CA945E: _get_myname (kill_tree.c:141) ==10367== by 0x8CA94F8: _build_hashtbl (kill_tree.c:164) ==10367== by 0x8CA9DB0: proctrack_linuxproc_get_pids (kill_tree.c:370) ==10367== by 0x8CA918C: proctrack_p_get_pids (proctrack_linuxproc.c:157) ==10367== by 0x445AD4: proctrack_g_get_pids (proctrack.c:461) ==10367== by 0x8495114: _get_precs (common_jag.c:532) ==10367== by 0x8495DB8: jag_common_poll_data (common_jag.c:839) ==10367== by 0x8493E02: jobacct_gather_p_poll_data (jobacct_gather_linux.c:201) ==10367== by 0x524858: _poll_data (slurm_jobacct_gather.c:178) ==10367== by 0x5252C1: jobacct_gather_add_task (slurm_jobacct_gather.c:400) ==10367== by 0x42F1F3: _fork_all_tasks (mgr.c:1784) ==10367== by 0x42E131: job_manager (mgr.c:1216) ==10367== by 0x429D1C: main (slurmstepd.c:160) ==10367== Uninitialised value was created by a stack allocation ==10367== at 0x8CA9361: _get_myname (kill_tree.c:126) ==10367== ==10367== ==10367== HEAP SUMMARY: ==10367== in use at exit: 154,575 bytes in 1,542 blocks ==10367== total heap usage: 18,319 allocs, 16,777 frees, 3,256,092 bytes allocated ==10367== ==10367== LEAK SUMMARY: ==10367== definitely lost: 3,523 bytes in 22 blocks ==10367== indirectly lost: 6,500 bytes in 96 blocks ==10367== possibly lost: 86,157 bytes in 964 blocks ==10367== still reachable: 58,395 bytes in 460 blocks ==10367== suppressed: 0 bytes in 0 blocks ==10367== Rerun with --leak-check=full to see details of leaked memory ==10367== ==10367== For counts of detected and suppressed errors, rerun with: -v ==10367== ERROR SUMMARY: 10 errors from 1 contexts (suppressed: 6 from 6) Hi Moe, I am wandering is valgrind a proper tool to track possible stack corruptions? Looks like it doesn't detect array overruns: http://valgrind.org/docs/manual/faq.html#faq.overruns Also contrary to what said at that link exp-sgcheck is not detecting stack overruns either for me. -bash-4.1$ cat t.c int stat[5]; int main(void) { int stack[5]; stat[12] = 0; stack [12] = 0; return 0; } -bash-4.1$ gcc -g -o t t.c -bash-4.1$ /sw/valgrind/valgrind-3.11.0/bin/valgrind --tool=exp-sgcheck ./t ==3004== exp-sgcheck, a stack and global array overrun detector ==3004== NOTE: This is an Experimental-Class Valgrind Tool ==3004== Copyright (C) 2003-2015, and GNU GPL'd, by OpenWorks Ltd et al. ==3004== Using Valgrind-3.11.0 and LibVEX; rerun with -h for copyright info ==3004== Command: ./t ==3004== ==3004== ==3004== ERROR SUMMARY: 0 errors from 0 contexts (suppressed: 0 from 0) -bash-4.1$ Hi Moe, I left bug entry for valgrind: https://bugs.kde.org/show_bug.cgi?id=364058 Please also read my 2 previous comments. The situation is worrying me. It's 5+ months since we opened this bug. While indeed there is some progress on it and you've identified several bugs in various parts of Slurm. But still seems there is some bug in slurmstepd part that affects us. That might be also misapplication but chance of bug in Slurm is drasticaly higher to me. In our situation slurmstepd performance is not critical at all and we could run it with almost any heavy machinery to track down the issue. Do you have any idea about any other way to track the issue? The commit below will fix this particular valgrind error, but the chance of this resulting in any sort of memory corruption, a SEGV or SIGBUG is low: https://github.com/SchedMD/slurm/commit/2bbc0add2893101b3b8b3ced164091faf7ff1add (In reply to Sergey Meirovich from comment #106) > Also need to add that last your patch doesn't fix the warning for me. (In reply to Sergey Meirovich from comment #108) > Hi Moe, > > I left bug entry for valgrind: https://bugs.kde.org/show_bug.cgi?id=364058 Thanks. > Please also read my 2 previous comments. > > The situation is worrying me. It's 5+ months since we opened this bug. While > indeed there is some progress on it and you've identified several bugs in > various parts of Slurm. But still seems there is some bug in slurmstepd part > that affects us. That might be also misapplication but chance of bug in > Slurm is drasticaly higher to me. > > In our situation slurmstepd performance is not critical at all and we could > run it with almost any heavy machinery to track down the issue. > > Do you have any idea about any other way to track the issue? I am investigating another tool called Insure++ from Parasoft, but I'm not sure that it offers any real benefit over Valgrind. I haven't been able to reproduce the error here and nobody else has reported this problem on the thousands of clusters running Slurm. Without a reproducer or a a clear indication of where the problem is, I'm basically looking for a needle in a haystack. How frequently are you seeing the slurmstepd abort? How many jobs do you run each day? Would you be willing to disable accounting for a while to see if that makes a difference? (In reply to Moe Jette from comment #110) ... > > > > Do you have any idea about any other way to track the issue? > > I am investigating another tool called Insure++ from Parasoft, but I'm not > sure that it offers any real benefit over Valgrind. I haven't been able to > reproduce the error here and nobody else has reported this problem on the > thousands of clusters running Slurm. Without a reproducer or a a clear > indication of where the problem is, I'm basically looking for a needle in a haystack. That bug might some race - right? In that case there will be rather hard to reproduce on demand. And a lot of bugs are unique to this or that user. I think we are using preemption more aggressively that anyone else. And that might be related. > > How frequently are you seeing the slurmstepd abort? 2-3 per week on average. Sometimes 10 per day. Sometimes no any slurmstepd crash for weeks. > How many jobs do you run each day? ~500 per day. > Would you be willing to disable accounting for a while to see if that makes > a difference? Need to discuss that with my team. But that obviously could be a challenge for us since: "Sometimes no any slurmstepd crash for weeks" (In reply to Sergey Meirovich from comment #111) > (In reply to Moe Jette from comment #110) > ... > > > > > > Do you have any idea about any other way to track the issue? > > > > I am investigating another tool called Insure++ from Parasoft, but I'm not > > sure that it offers any real benefit over Valgrind. I haven't been able to > > reproduce the error here and nobody else has reported this problem on the > > thousands of clusters running Slurm. Without a reproducer or a a clear > > indication of where the problem is, I'm basically looking for a needle in a haystack. > > That bug might some race - right? That seems likely given the difficulties in reproducing the failure. > In that case there will be rather hard to > reproduce on demand. And a lot of bugs are unique to this or that user. If you discover any common factor in those user jobs, be sure to share that information. > I think we are using preemption more aggressively that anyone else. And that > might be related. You are using preemption more aggressively than the vast majority of sites, so that might be a factor. However some sites use gang scheduling, which suspends and resumes jobs every 30 seconds or so, far more frequently than you do. https://github.com/SchedMD/slurm/commit/2bbc0add2893101b3b8b3ced164091faf7ff1add has indeed fixed this particular valgrind error. Btw, Why memcheck and not helgrind or drd? If that may me a race they could work better? E.g. I have just tried next: char *const argv[10] = {"/sw/valgrind/valgrind-3.11.0/bin/valgrind", "--tool=drd", "--error-limit=no", "--max-stackframe=16777216", "--num-callers=20", "--child-silent-after-fork=yes", log_file, (char *)conf->stepd_loc, NULL}; Will attach log shortly. Created attachment 3197 [details]
valgrind: drd log
(In reply to Sergey Meirovich from comment #116) > Created attachment 3197 [details] > valgrind: drd log These are all false positives. Virtually every error was generated by code looking like this: thread 1 is doing work on a periodic basis: while (!shutdown) { do work } thread 2 executes a shutdown function: stop_thread(void) { shutdown = true; } The above will produce errors unless the variable "shutdown" is wrapped with a mutex. I have fixed most of them today, but it's adding overhead to the code for no benefit other than making helgrind and drd happy. In order to avoid possibly breaking something in the rather mature Slurm version 15.08, I have made the changes to version 16.05. Here is an example of one variable being wrapped with a mutex: https://github.com/SchedMD/slurm/commit/5a3a9a359f5a0254d7b5260d61eebb56fba79e71 I agree. In such case these are rather warrnings than errors. And using mutexes adds clearoverhead add-hoc. Btw, just curious, why do you not use something like: liburcu or libatomic for this? Though, not sure how drd/helgrind are reacting to liburcu or libatomic... What do you recomend for us now: a) Still run with memcheck b) Run wid drd exactly in my way - comment #115 c) any other test/ajustments to a/b ? Also. I think next time I should only provide to you log(s) corresponfing to any slurmstepd crash under valgrind? (In reply to Sergey Meirovich from comment #118) > I agree. In such case these are rather warrnings than errors. And using > mutexes adds clearoverhead add-hoc. Btw, just curious, why do you not use > something like: liburcu or libatomic for this? Though, not sure how > drd/helgrind are reacting to liburcu or libatomic... > > What do you recomend for us now: > a) Still run with memcheck > b) Run wid drd exactly in my way - comment #115 > c) any other test/ajustments to a/b > ? > > Also. I think next time I should only provide to you log(s) corresponfing to > any slurmstepd crash under valgrind? I'm doing some analysis with Coverity tools right now, but it's a needle in a haystack problem. Since there seem to be a relatively small number of users triggering this, have you been able to identify any common factor in their jobs? I'd recommend continuing to run with memcheck. My best guess is that either there is a rarely used code path or race condition triggering the failure. Moe, I understand that this is now a search for a needle in a haystack for you. That is why I started to look into valgrdind's drd and helgrind. I was under impression that memcheck is useless for race detections. I also thought about compiling slurmstepd using threadsanitizer - but this requires more recent gcc or clang. And I don't like to switch to another compiler as that changes important circumstance(e.g. that might introduce miscompilations or just hide existing problem due to changed timings (if that is a race)). Forgot to add. I have failed to find a common factor for failed jobs so far... Also, sorry for confusion. I didn't mean that this is specific only to small number of our users. Looks like here at AMAT it just correlates with number of jobs our users run - the more jobs user is running the more frequenly that happens for him/her. The same happenes per application here. By: "And a lot of bugs are unique to this or that user." I just meant that I am not such suprized that other your users (AMAT is single user for you) have never seen this. Obviosly we have something unique on AMAT side. But it is rather hard for us to realize what this is exactly. As we are not such aware in details about other your users. Do you have any valgrind logs for slurmstepd processes that aborted yet? Hi Moe, Suprisingly no any new crash since my last comment... (In reply to Sergey Meirovich from comment #124) > Hi Moe, > > Suprisingly no any new crash since my last comment... Any more crashes? Not so far. Not sure why. Hi Moe, I've finally noticed the similarity of vast majority of the issues's cases: Task 0 E.g. [2016-06-03T06:38:17.433] [81585.0] task 0 (93718) started 2016-06-03T06:38:17 Appeared to disappear (most probably self-exited rather than died as there is no any record in RHEL abrtd database). Also no any record about its (task 0) exit or termination in slurmd logs either. Any other task continued to run as leftovers. Does it maker any sense for you? (In reply to Sergey Meirovich from comment #127) > Hi Moe, > > I've finally noticed the similarity of vast majority of the issues's cases: > > Task 0 > > E.g. > [2016-06-03T06:38:17.433] [81585.0] task 0 (93718) started > 2016-06-03T06:38:17 > > Appeared to disappear (most probably self-exited rather than died as there > is no any record in RHEL abrtd database). Also no any record about its (task > 0) exit or termination in slurmd logs either. > > Any other task continued to run as leftovers. > > Does it maker any sense for you? The tasks/ranks are started in a simple loop, all in a single thread. The psuedo code looks like this: for (i=0; i<number_of_tasks; i++) { fork/exec task log("task %d started ...", i) } I don't see how the fork/exec of tasks could happen without each a log entry being written for each task, unless there was some shortage of resources (e.g. memory, disk space, etc.) caused the logging to fail. Slurm does have checks for memory allocation failures where it is allocating memory, but I'm not sure we would trap all of the possible places where memory allocations can fail (e.g. various library calls). You aren't seeing any signs in syslog of a problem at those times, are you? Hi Moe, All startups of any task are written fine. I mean no any exit in the logs. Nothing in the syslog. task 0 is always appear to be gone for any crashed slurmstepd. Any others still run as stray leftovers. (In reply to Sergey Meirovich from comment #129) > Hi Moe, > > All startups of any task are written fine. I mean no any exit in the logs. > Nothing in the syslog. I misunderstood your earlier comment. That just means the tasks/ranks started and the slurmstepd process running them somehow died before the user tasks exited. That is in agreement with my understanding of the situation. The real problem is to determine what happened to the slurmstepd process and we still don't have an explanation for that. We'll likely need a trace showing the failure to identify the cause. I suspect that whatever is behind task 0 of the application failure is also killing the slurmstepd. I think that it might be related to the fact that slurmstepd is the only daemon that should survive upgrade. If it had been swapped out before upgrade happened it could easily lead to SIGBUS at any time after upgrade. Please correct me if I am mistaking. I see also machinery for mlockall in slurmd but not any such in slurmstepd. And "... Memory locks are not inherited by a child created via fork(2) and are automatically removed (unlocked) during an execve(2) or when the process terminates." Created attachment 3370 [details]
Patch proposed
Here is a patch to address that possible issue related to slurmstepd might be partially swapped out before any upgrade. It is written for 17.02 yet it apples cleanly for 15.08 and with some offset to 16.05.
Created attachment 3371 [details]
Patch proposed
(In reply to Sergey Meirovich from comment #132) > I think that it might be related to the fact that slurmstepd is the only > daemon that should survive upgrade. If it had been swapped out before > upgrade happened it could easily lead to SIGBUS at any time after upgrade. > > Please correct me if I am mistaking. I believe the mlockall is there for performance reasons and doing the same for slurmstepd would be reasonable. In terms of swapping being responsible for SIGBUG, are you thinking in terms of shared libraries (the plugins) changing on slurmstepd causing it to SIGBUS? The slurmd man page describes why the slurmd memory locking was added: -M Lock slurmd pages into system memory using mlockall (2) to disable paging of the slurmd process. This may help in cases where nodes are marked DOWN during periods of heavy swap activity. If the mlockall (2) system call is not available, an error will be printed to the log and slurmd will continue as normal. (In reply to Moe Jette from comment #136) > In terms of swapping being responsible for SIGBUG, are you thinking in terms > of shared libraries (the plugins) changing on slurmstepd causing it to > SIGBUS? Yes. I am afraid about exactly this scenario. Having in mind somewhat random backtraces we have observed in this case. Plus the fact that we had been upgrading slurm rather frequently: mostly each 15.08.X release and sometimes even more frequent (to apply this or that patch)... All-in-all it seems quite possible to me that we were bitten by this scenario. I understand that intentions behind mlockall in slurmd were unrelated to the scenario I refereed to. That alone doesn't make this scenario invalid anyway. That seems like the best explanation see for the SIGBUS. I've applied your patch and it will appear in version 16.05.4 when released, probably in late August. Thank you for your contribution. Here is the commit: https://github.com/SchedMD/slurm/commit/03cf4a5d09f18ef3c3abf9767c6cb58b0f635e96 (In reply to Moe Jette from comment #139) > That seems like the best explanation see for the SIGBUS. I've applied your > patch and it will appear in version 16.05.4 when released, probably in late > August. Thank you for your contribution. > > Here is the commit: > https://github.com/SchedMD/slurm/commit/ > 03cf4a5d09f18ef3c3abf9767c6cb58b0f635e96 Are you still seeing failures? This memory locking code change has been causing problems for some systems which rely upon paging, so I have reverted this commit in our release. For now, I would suggest that you maintain this as a local patch if it helps. I plan to restore this functionality in the next major release of Slurm, version 17.02, along with configuration parameters to control what gets locked in memory (future memory or just current memory, by setting the slurm.conf parameter LaunchParameters=slurmctld_memlock or LaunchParameters=slurmctld_memlock_all). Here is the commit with the new optional memory locking logic: https://github.com/SchedMD/slurm/commit/bd3f87e947ccb7737f904a207ffcf7a3299f8a4e Hi, Haven't seen any new problem with our patch applied. Though to trigger the issue (if it is correctly identified) we need to check after next upgrade. Anyway, don't see any problem in either maintaining our patch locally or merging https://github.com/SchedMD/slurm/commit/bd3f87e947ccb7737f904a207ffcf7a3299f8a4e in our local tree I just checked in another change which might possibly be responsible for the failures. The slurmsteps process makes use of several Slurm plugins. Most of the plugins are loaded when the process starts, but a couple could possibly be loaded later. If the plugin changes after slurmstped starts and the new version is loaded, there could possibly be inconsistencies in the API or data structure. I don't believe this is responsible for these failures becasue 1. API or data structure changes in the plugins would be very rare, typically only when a major Slurm updated happens (i.e. version 15.08.x to 16.05.x) 2. Plugins not loaded immediately are typically be loaded soon thereafter, so the window for failure would be small In any case, this change does harden the code and will be in version 16.05.5 when released. The commit with the change is here: https://github.com/SchedMD/slurm/commit/962f0ccec19eb8a23a43e825d800243403f6de54 There haven't been any reported problems in quite a while, so I'll close this based upon the changes made so far. |
Created attachment 2597 [details] slurm.conf One of the processes of the job crashed with SIGBUS: /cm/local/apps/slurm/var/spool/job03996/slurm_script: line 30: 108086 Bus error srun ./$exe.exe ERROR! That job itself looks as terminated to slurm with state NODE_FAIL ================================================================================================================== -sh-4.1$ sacct -X -j 3996 JobID JobName Partition Account AllocCPUS State Start NodeList ExitCode ------------ ---------- ---------- ---------- ---------- ---------- ------------------- --------------- -------- 3996 enabler_t+ etch_norm+ e108292_e+ 8 NODE_FAIL Jan 11 18:48 dcalph055 1:0 -sh-4.1$ But all the other processes are still sitting on the node [root@dcalph055 ~]# grep 3996 /var/log/slurmd | grep started [2016-01-11T18:48:39.606] [3996] task 0 (108081) started 2016-01-11T18:48:39 [2016-01-11T18:48:40.000] [3996.0] task 0 (108101) started 2016-01-11T18:48:40 [2016-01-11T18:48:40.001] [3996.0] task 1 (108102) started 2016-01-11T18:48:40 [2016-01-11T18:48:40.001] [3996.0] task 2 (108103) started 2016-01-11T18:48:40 [2016-01-11T18:48:40.002] [3996.0] task 3 (108104) started 2016-01-11T18:48:40 [2016-01-11T18:48:40.002] [3996.0] task 4 (108105) started 2016-01-11T18:48:40 [2016-01-11T18:48:40.002] [3996.0] task 5 (108106) started 2016-01-11T18:48:40 [2016-01-11T18:48:40.003] [3996.0] task 6 (108107) started 2016-01-11T18:48:40 [2016-01-11T18:48:40.003] [3996.0] task 7 (108108) started 2016-01-11T18:48:40 [root@dcalph055 ~]# ps -eo pid,stat,comm | egrep '108081|108102|108103|108104|108105|108106|108107|108108' 108102 R crtrs_mpi_dev_s 108103 R crtrs_mpi_dev_s 108104 R crtrs_mpi_dev_s 108105 R crtrs_mpi_dev_s 108106 R crtrs_mpi_dev_s 108107 R crtrs_mpi_dev_s 108108 R crtrs_mpi_dev_s [root@dcalph055 ~]# ================================================================================================================== That is also visible via sinfo. ================================================================================================================== [root@DCALPH000 ~]# sinfo -n dcalph055 HOSTNAMES STATE AVAIL FREE_MEM MEMORY CPUS CPU_LOAD CPUS(A/I/O/T) dcalph055 idle up 591 128951 36 32.00 0/36/0/36 [root@DCALPH000 ~]# ================================================================================================================== So orphan processes are not just into busy loop.