| Summary: | slurmctlds stuck after restart | ||
|---|---|---|---|
| Product: | Slurm | Reporter: | Cineca HPC Systems <hpc-sysmgt-info> |
| Component: | slurmctld | Assignee: | Nate Rini <nate> |
| Status: | RESOLVED INFOGIVEN | QA Contact: | |
| Severity: | 1 - System not usable | ||
| Priority: | --- | CC: | nate |
| Version: | 20.11.4 | ||
| Hardware: | Linux | ||
| OS: | Linux | ||
| Site: | Cineca | Slinky Site: | --- |
| Alineos Sites: | --- | Atos/Eviden Sites: | --- |
| Confidential Site: | --- | Coreweave sites: | --- |
| Cray Sites: | --- | DS9 clusters: | --- |
| Google sites: | --- | HPCnow Sites: | --- |
| HPE Sites: | --- | IBM Sites: | --- |
| NOAA SIte: | --- | NoveTech Sites: | --- |
| Nvidia HWinf-CS Sites: | --- | OCF Sites: | --- |
| Recursion Pharma Sites: | --- | SFW Sites: | --- |
| SNIC sites: | --- | Tzag Elita Sites: | --- |
| Linux Distro: | --- | Machine Name: | |
| CLE Version: | Version Fixed: | ||
| Target Release: | --- | DevPrio: | --- |
| Emory-Cloud Sites: | --- | ||
| Attachments: |
slurmctld logs for primary controller
slurmctld logs for secondary controller slurm configuration files |
||
Please attach the slurmctld log and slurm.conf.
Please call gcore against slurmctld and provide a backtrace:
> pgrep slurmctld | xargs gcore
> gdb -ex 't a a bt full' -c $PATH_TO_CORE $(which slurmctld)
Created attachment 19195 [details]
slurmctld logs for primary controller
Created attachment 19196 [details]
slurmctld logs for secondary controller
Any luck on the back trace too? Please also attach slurm.conf (& friends). Please open a new bug for this:
> slurmctld[46191]: error: High latency for 1000 calls to gettimeofday(): 707 microseconds
Please open a new bug for this too:
> slurmctld[47725]: error: Can't find parent id 21773 for assoc 21774, this should never happen.
(In reply to Nate Rini from comment #1) > Please attach the slurmctld log and slurm.conf. > > Please call gcore against slurmctld and provide a backtrace: > > pgrep slurmctld | xargs gcore > > gdb -ex 't a a bt full' -c $PATH_TO_CORE $(which slurmctld) Nate, thanks a lot for the quick response. Luckily we seem to have solved the problem on our own. Unfortunately we were not able to get the necessary debugging info with gcore because after yet another restart the slurmctld was able to complete startup. Our best guess is that something wrong was going on on the shared filesystem (gluterfs) which hosts the /slurmstate directory. Unfortunately also in this case logfiles are not helpful as no problem is reported (or was reported in the last months). About the extremely long latency for gettimeofday() calls it was caused by the clocksource set to "hpet". Setting it to "tsc" reduced the latency by a factor of 10 and also seems, for some reason or pure coincidence, to have 'unlocked' the filesystem problem. I have uploaded the logs for analysis, for sure you can lower the priority of the issue or even close it since the problem for us is solved. We are availble if you want to perform further analysis. Best Regards, Marcello (In reply to Cineca HPC Systems from comment #8) > (In reply to Nate Rini from comment #1) > thanks a lot for the quick response. > Luckily we seem to have solved the problem on our own. > Unfortunately we were not able to get the necessary debugging info with > gcore because after yet another restart the slurmctld was able to complete > startup. That is unfortunate. > Our best guess is that something wrong was going on on the shared filesystem > (gluterfs) which hosts the /slurmstate directory. Unfortunately also in this > case logfiles are not helpful as no problem is reported (or was reported in > the last months). In events like this, having slurmctld report to syslog may be beneficial instead of directly to file. > About the extremely long latency for gettimeofday() calls it was caused by > the clocksource set to "hpet". Setting it to "tsc" reduced the latency by a > factor of 10 and also seems, for some reason or pure coincidence, to have > 'unlocked' the filesystem problem. That sounds like a fun bug for the gluster developers. > I have uploaded the logs for analysis, for sure you can lower the priority > of the issue or even close it since the problem for us is solved. > We are availble if you want to perform further analysis. slurmctld requires the filesystem hosting statesavelocation to be healthy and will generally fatal out if it is not. If this happens again, please update the ticket and we can continue debugging but the logs were essentially clean except those other issues I pointed out but they are unlikely to cause a crash. I did notice the logs were at a high debug level and had multiple debugflags active. I would suggest returning to verbose and turning off those debugflags during normal production. High level of logging will needlessly slow down Slurm. Created attachment 19206 [details]
slurm configuration files
These are the slurm configuration files.
We are running a configless setup
|
Dear Support, we are experiencing a major problem after an ordinary configuration change in order to disable the slurmctld Prolog/Epilog and the consequent slurmctld restart. What happens is the following: - we start one of the slurmctld service - the slurmctld starts up apparently correctly - some of the s* commands respond properly: sdiag scontrol ping scontrol show conf -most of the s* commands hang and contribute to reach the maximum thread number: squeue sinfo (which however seems to work in the very early stages... let's say the very first 30 seconds) After a while thread counts reach 256 and the slurmctld cannot be queried anymore. We see some strange messages such as: slurmctld[43346]: error: High latency for 1000 calls to gettimeofday(): 737 microseconds And also sdiag output seems to suggest there are no more pending jobs even though there are supposed to be hundreds if not thousands and that data are from epoch: [root@r000u17l01 ~]# sdiag ******************************************************* sdiag output at Thu Apr 29 16:26:26 2021 (1619706386) Data since Thu Jan 01 01:00:00 1970 (0) ******************************************************* Server thread count: 50 Agent queue size: 0 Agent count: 0 Agent thread count: 0 DBD Agent queue size: 0 Jobs submitted: 0 Jobs started: 0 Jobs completed: 0 Jobs canceled: 0 Jobs failed: 0 Job states ts: Thu Jan 01 01:00:00 1970 (0) Jobs pending: 0 Jobs running: 0 Main schedule statistics (microseconds): Last cycle: 0 Max cycle: 0 Total cycles: 0 Cycles per minute: 0 Last queue length: 0 Backfilling stats Total backfilled jobs (since last slurm start): 0 Total backfilled jobs (since last stats cycle start): 0 Total backfilled heterogeneous job components: 0 Total cycles: 0 Last cycle when: N/A Last cycle: 0 Max cycle: 0 Last depth cycle: 0 Last depth cycle (try sched): 0 Last queue length: 0 Last table size: 0 Latency for 1000 calls to gettimeofday(): 737 microseconds Remote Procedure Call statistics by message type REQUEST_FED_INFO ( 2049) count:11 ave_time:352 total_time:3877 Remote Procedure Call statistics by user a07cmc00 ( 12522) count:7 ave_time:345 total_time:2419 jalguaci ( 29828) count:2 ave_time:355 total_time:711 akolsek0 ( 29428) count:1 ave_time:362 total_time:362 dbaratel ( 31670) count:1 ave_time:385 total_time:385 root ( 0) count:0 ave_time:0 total_time:0 Pending RPC statistics No pending RPCs Can you help us debug and solve the issue? Thansk in advance, Marcello