Since the addition of the 1K, we have occasionally seen slurm create a block for the non-existent rzuseq0001 node (R00-M1). smap show this right now: no part RMP08Fe104921195 Error(F - T,T,T,T 512 rzuseq0001 This happened after this sequence of events. I removed all blocks related to R1. I ran a 1K job in pbatch which is setup for rzuseq[0010x0011] (i.e. R1-M[0-1]). That worked. Then, I submitted a 512 node job. Slurm created the bogus block above, then freed and deleted the 1K block (RMP08Fe104734636), and then attempted to boot the bad block (RMP08Fe104921195). That failed, which is why it is in error now. Now we have this: $ sinfo PARTITION AVAIL TIMELIMIT NODES STATE MIDPLANELIST pdebug* up 8:00:00 1 err rzuseq0000 pdebug* up 8:00:00 257 alloc rzuseq0000 pdebug* up 8:00:00 254 idle rzuseq0000 pbatch up 4:00:00 512 err rzuseq0010 pbatch up 4:00:00 512 idle rzuseq0011 $ sinfo -Rl Fri Feb 8 10:59:46 2013 REASON USER TIMESTAMP STATE MIDPLANELIST Block(s) in error st Unknown Unknown err rzuseq0000 Block(s) in error st Unknown Unknown alloc rzuseq0000 Block(s) in error st Unknown Unknown idle rzuseq0000 status_check: Boot f slurm(101) 2013-02-08T10:50:21 err rzuseq0010 So Slurm is marking rzuseq0010 in error in response to the bogus rzuseq0001 being in error. Also, I cannot run a 512 node job on the remaining midplane (rzuseq0011) even though it appears fine. My submitted job goes into pending on "Resources". ==Py
Does this system only have 1.5k nodes? Not sure what this system looks like. So what does your slurm.conf say for node declarations? My guess is it is something like rzuseq[0000x0011], at least it should be something like this. Even if the nodes don't exist in reality, they exist in the db2 so they must be represented in Slurm. More information on the system would be nice. I am unaware of what it looks like.
(In reply to comment #1) > Does this system only have 1.5k nodes? Not sure what this system looks like. > > So what does your slurm.conf say for node declarations? > > My guess is it is something like rzuseq[0000x0011], at least it should be > something like this. Even if the nodes don't exist in reality, they exist > in the db2 so they must be represented in Slurm. > > More information on the system would be nice. I am unaware of what it looks > like. # COMPUTE NODES FrontendName=rzuseqlac1 FrontendName=rzuseqlac2 NodeName=DEFAULT Procs=8192 RealMemory=2097152 State=UNKNOWN NodeName=rzuseq[0000,0010x0011] Include /etc/slurm/slurm.conf.updates PartitionName=pbatch Nodes=rzuseq[0010x0011] Default=No State=UP Shared=FORCE DefaultTime=60 MaxTime=4:00:00 AllowGroups=bgldev,gupta7,jdelsign cat /etc/slurm/slurm.conf.updates PartitionName=pdebug Nodes=rzuseq0000 Default=YES State=UP Shared=FORCE DefaultTime=60 MaxTime=8:00:00 MaxNodes=64
This is most likely a bad configuration. I would put NodeName=rzuseq[0000,0010x0011] NodeName=rzuseq0001 State=DOWN For reasons sighted in the last comment. You most likely don't need the second line and they can be combined. Slurm is making that block on the missing node on purpose in most cases to avoid running jobs there. See http://schedmd.com/slurmdocs/bluegene.html under "Naming Conventions" look at the second "IMPORTANT". Perhaps it isn't very clear in this situation since your system does start on 0000 but the idea still applies, if you have a better wording to help prevent this in the future please submit a patch. If this doesn't fix your problem please reopen the ticket.
(In reply to comment #3) > This is most likely a bad configuration. > > I would put > > NodeName=rzuseq[0000,0010x0011] > NodeName=rzuseq0001 State=DOWN > > For reasons sighted in the last comment. You most likely don't need the > second line and they can be combined. Slurm is making that block on the > missing node on purpose in most cases to avoid running jobs there. > > See http://schedmd.com/slurmdocs/bluegene.html under "Naming Conventions" > look at the second "IMPORTANT". Perhaps it isn't very clear in this > situation since your system does start on 0000 but the idea still applies, > if you have a better wording to help prevent this in the future please > submit a patch. > > If this doesn't fix your problem please reopen the ticket. We added Danny's recommended line to slurm.conf and have not been able to recreate the problem. Jobs are running.