| Summary: | slurmd crashes in close proximity to an (unrelated, surely?) lustre message | ||
|---|---|---|---|
| Product: | Slurm | Reporter: | Phil Schwan <phils> |
| Component: | slurmd | Assignee: | David Bigagli <david> |
| Status: | RESOLVED INFOGIVEN | QA Contact: | |
| Severity: | 3 - Medium Impact | ||
| Priority: | --- | CC: | da, stuartm |
| Version: | 14.03.2 | ||
| Hardware: | Linux | ||
| OS: | Linux | ||
| Site: | DownUnder GeoSolutions | Slinky Site: | --- |
| Alineos Sites: | --- | Atos/Eviden Sites: | --- |
| Confidential Site: | --- | Coreweave sites: | --- |
| Cray Sites: | --- | DS9 clusters: | --- |
| Google sites: | --- | HPCnow Sites: | --- |
| HPE Sites: | --- | IBM Sites: | --- |
| NOAA SIte: | --- | NoveTech Sites: | --- |
| Nvidia HWinf-CS Sites: | --- | OCF Sites: | --- |
| Recursion Pharma Sites: | --- | SFW Sites: | --- |
| SNIC sites: | --- | Tzag Elita Sites: | --- |
| Linux Distro: | --- | Machine Name: | |
| CLE Version: | Version Fixed: | ||
| Target Release: | --- | DevPrio: | --- |
| Emory-Cloud Sites: | --- | ||
|
Description
Phil Schwan
2014-05-27 23:21:38 MDT
Hi, if you have a slurmd core dump we would like to see the stack(s) to inspect them and compare them. Do you still have the core files? Batch missing from node can happen in the following scenario: 1) batch job starting on execution host, slurmctld knows there is 1 job running on the host 2) the slurmd dies and it is not able to launch the job 3) when slurmd comes up it checks the jobs running on the host, but it does not find any because there is no entry in the SlurmdSpoolDir so it reports no jobs on the node 4) slurmctld purges the zombie job the question is did you have 343 such failures in one day? This does sound quite strange... Basically in any situation in which there is a job count mismatch between what reported by slurmd and what is in the controller memory will lead to zombie jobs that will get purged. I agree with you that failure in Lustre should not affect slurmd if the spool and the binary are not in Lustre. I assume that /d/sw is a local file system. One thing that comes to mind is that some file access operation call internally getcwd() which in turn traverses the full path eventually doing stat() on the Lustre file system. David (In reply to David Bigagli from comment #1) > if you have a slurmd core dump we would like to see the stack(s) to > inspect them and compare them. Do you still have the core files? Hmm. I don't think so. Its CWD is /var/spool/slurmd, in which I don't see any cores. I suspect its ulimit for cores is probably 0. I'll try to get that changed, going forward. > the question is did you have 343 such failures in one day? I don't think we had 343 crashes; I only found evidence of a couple dozen crashes in those 3 days. The rest of the "missing from node" messages are I think for some other reason. > I agree with you that failure in Lustre should not affect slurmd if the spool > and the binary are not in Lustre. I assume that /d/sw is a local file system. Unfortunately it looks like I told a fib. Having re-examined this with fresh eyes, I realise that I was tricked by a confusing hostname. That close() error *is* on the filesystem where the slurm spool is stored. I apologise for the confusion. /d/sw/ is a separate NFS filesystem, on which I can see no sign of trouble. > One thing that comes to mind is that some file access operation call > internally > getcwd() which in turn traverses the full path eventually doing stat() on > the Lustre file system. I can't find the inodes referenced in that close() error -- specifically, it's not the inode of any file that remains in /var/spool/slurmd, or in the path leading to it -- and there are lots of different inodes when that message recurs. So I'm not quite sure what to make of it. But this close() is definitely happening on the filesystem with /var/spool/slurmd; sorry. Hi, did slurmd crashed again and were you able to get the stack? David I haven't wanted to go messing too much with the slurm environment while Stu was away, but he's back today, so we should be able to get that changed. We've got the environment changed now, so cores should begin to accumulate. In typical fashion, there haven't been any crashes since then. Close for now as it cannot be reproduced. We can reopen it if needed. David |