I just updated Sequoia to v2.5.1. There was a problem converting the state files and I did not have time to diagnose. So, I deleted all the state files. When I restarted, I saw this: [2013-01-22T08:51:34-08:00] RMP20Ja225511530 not found in the state file, adding [2013-01-22T08:51:34-08:00] debug: We don't allow X passthoughs [2013-01-22T08:51:34-08:00] _fill_in_wires: we can't use this so return [2013-01-22T08:51:34-08:00] fatal: I was unable to make the requested block. Adam had seen this same problem over the weekend. Here is what he found: If hardware on the system is in error that would prevent the creation of a block, but the block was created before the hardware was in error, and you do a startclean on slurm... it does this. Slurm tries to re-create the block to sync up its internal state, and when it sees that it has internal rules preventing that block creation, it exits. The work-around is to delete any blocks from the blue gene control system that cause this error prior to doing the clean start of slurm. The block name is the one in the log immediately before the error. In this case RMP20Ja225511530.
The following was in the bluegene.conf file at the time of the problem: DenyPassthrough=A,X
Having the state files would be very valuable. Perhaps the ".old" files still exist? Any idea how RMP20Ja225511530 was generated? From what you say here it sounds like it was generated outside of Slurm. Is that not the case? Doing a startclean should remove all old blocks. Is this not what is happening? I wouldn't expect it to try to re-create them unless you have a non-dynamic setup.
Even having the logs from the bad state files would be helpful.
(In reply to comment #2) > Having the state files would be very valuable. Perhaps the ".old" files > still exist? > > Any idea how RMP20Ja225511530 was generated? > > From what you say here it sounds like it was generated outside of Slurm. Is > that not the case? > > Doing a startclean should remove all old blocks. Is this not what is > happening? I wouldn't expect it to try to re-create them unless you have a > non-dynamic setup. I'm afraid I did not take care to preserve the old state files for diagnosis. The system was in a dynamic setup. The slurmctld.log files are still there as are some core files (in /tmp) from when the 2.4 state files were read in. To summarize, there were two bugs: one when reading the 2.4 state files, the other when starting with no state files. I'm not too concerned with the first bug as this is a one-time only event. This ticket is for the second bug. The block creation record is in the logs: [2013-01-20T22:55:11] Record: BlockID:RMP20Ja225511530 Nodes:seq[4120x4220] Conn:T,T,T,T You can log in and look at the slurmctld.logs or I can mail them to you if it is easier.
If you could email them it would be great. Any backtrace from the cores would be of interest as well. They probably point to the problem.
(In reply to comment #5) > If you could email them it would be great. Any backtrace from the cores > would be of interest as well. They probably point to the problem. slurmctld.log and backtraces mailed
The second bug reported in comment #4, an abort in the reservation logic when upgrading from slurm v2.4 to v2.5.1, will be fixed in v2.5.2. The patch is here: https://github.com/SchedMD/slurm/commit/604b869ecdd512e0aa4cc244e4138842e7556b1f
This is fixed in patch 6af6f6156a4f8bc71405b6ef22ad0d56f3e14d96. Will be in the next 2.5 release.