| Summary: | Fatal Error on BG/Q | ||
|---|---|---|---|
| Product: | Slurm | Reporter: | Don Lipari <lipari1> |
| Component: | Bluegene select plugin | Assignee: | Danny Auble <da> |
| Status: | RESOLVED FIXED | QA Contact: | |
| Severity: | 3 - Medium Impact | ||
| Priority: | --- | ||
| Version: | 2.5.x | ||
| Hardware: | IBM BlueGene | ||
| OS: | Linux | ||
| Site: | LLNL | Alineos Sites: | --- |
| Atos/Eviden Sites: | --- | Confidential Site: | --- |
| Coreweave sites: | --- | Cray Sites: | --- |
| DS9 clusters: | --- | HPCnow Sites: | --- |
| HPE Sites: | --- | IBM Sites: | --- |
| NOAA SIte: | --- | OCF Sites: | --- |
| Recursion Pharma Sites: | --- | SFW Sites: | --- |
| SNIC sites: | --- | Linux Distro: | --- |
| Machine Name: | CLE Version: | ||
| Version Fixed: | Target Release: | --- | |
| DevPrio: | --- | Emory-Cloud Sites: | --- |
|
Description
Don Lipari
2013-01-22 04:58:16 MST
The following was in the bluegene.conf file at the time of the problem: DenyPassthrough=A,X Having the state files would be very valuable. Perhaps the ".old" files still exist? Any idea how RMP20Ja225511530 was generated? From what you say here it sounds like it was generated outside of Slurm. Is that not the case? Doing a startclean should remove all old blocks. Is this not what is happening? I wouldn't expect it to try to re-create them unless you have a non-dynamic setup. Even having the logs from the bad state files would be helpful. (In reply to comment #2) > Having the state files would be very valuable. Perhaps the ".old" files > still exist? > > Any idea how RMP20Ja225511530 was generated? > > From what you say here it sounds like it was generated outside of Slurm. Is > that not the case? > > Doing a startclean should remove all old blocks. Is this not what is > happening? I wouldn't expect it to try to re-create them unless you have a > non-dynamic setup. I'm afraid I did not take care to preserve the old state files for diagnosis. The system was in a dynamic setup. The slurmctld.log files are still there as are some core files (in /tmp) from when the 2.4 state files were read in. To summarize, there were two bugs: one when reading the 2.4 state files, the other when starting with no state files. I'm not too concerned with the first bug as this is a one-time only event. This ticket is for the second bug. The block creation record is in the logs: [2013-01-20T22:55:11] Record: BlockID:RMP20Ja225511530 Nodes:seq[4120x4220] Conn:T,T,T,T You can log in and look at the slurmctld.logs or I can mail them to you if it is easier. If you could email them it would be great. Any backtrace from the cores would be of interest as well. They probably point to the problem. (In reply to comment #5) > If you could email them it would be great. Any backtrace from the cores > would be of interest as well. They probably point to the problem. slurmctld.log and backtraces mailed The second bug reported in comment #4, an abort in the reservation logic when upgrading from slurm v2.4 to v2.5.1, will be fixed in v2.5.2. The patch is here: https://github.com/SchedMD/slurm/commit/604b869ecdd512e0aa4cc244e4138842e7556b1f This is fixed in patch 6af6f6156a4f8bc71405b6ef22ad0d56f3e14d96. Will be in the next 2.5 release. |