| Summary: | SLURM erroneously marking block in error | ||
|---|---|---|---|
| Product: | Slurm | Reporter: | Don Lipari <lipari1> |
| Component: | Bluegene select plugin | Assignee: | Danny Auble <da> |
| Status: | RESOLVED INVALID | QA Contact: | |
| Severity: | 1 - System not usable | ||
| Priority: | --- | ||
| Version: | 2.5.x | ||
| Hardware: | IBM BlueGene | ||
| OS: | Linux | ||
| Site: | LLNL | Alineos Sites: | --- |
| Atos/Eviden Sites: | --- | Confidential Site: | --- |
| Coreweave sites: | --- | Cray Sites: | --- |
| DS9 clusters: | --- | HPCnow Sites: | --- |
| HPE Sites: | --- | IBM Sites: | --- |
| NOAA SIte: | --- | OCF Sites: | --- |
| Recursion Pharma Sites: | --- | SFW Sites: | --- |
| SNIC sites: | --- | Linux Distro: | --- |
| Machine Name: | CLE Version: | ||
| Version Fixed: | Target Release: | --- | |
| DevPrio: | --- | Emory-Cloud Sites: | --- |
|
Description
Don Lipari
2013-02-08 05:25:33 MST
Does this system only have 1.5k nodes? Not sure what this system looks like. So what does your slurm.conf say for node declarations? My guess is it is something like rzuseq[0000x0011], at least it should be something like this. Even if the nodes don't exist in reality, they exist in the db2 so they must be represented in Slurm. More information on the system would be nice. I am unaware of what it looks like. (In reply to comment #1) > Does this system only have 1.5k nodes? Not sure what this system looks like. > > So what does your slurm.conf say for node declarations? > > My guess is it is something like rzuseq[0000x0011], at least it should be > something like this. Even if the nodes don't exist in reality, they exist > in the db2 so they must be represented in Slurm. > > More information on the system would be nice. I am unaware of what it looks > like. # COMPUTE NODES FrontendName=rzuseqlac1 FrontendName=rzuseqlac2 NodeName=DEFAULT Procs=8192 RealMemory=2097152 State=UNKNOWN NodeName=rzuseq[0000,0010x0011] Include /etc/slurm/slurm.conf.updates PartitionName=pbatch Nodes=rzuseq[0010x0011] Default=No State=UP Shared=FORCE DefaultTime=60 MaxTime=4:00:00 AllowGroups=bgldev,gupta7,jdelsign cat /etc/slurm/slurm.conf.updates PartitionName=pdebug Nodes=rzuseq0000 Default=YES State=UP Shared=FORCE DefaultTime=60 MaxTime=8:00:00 MaxNodes=64 This is most likely a bad configuration. I would put NodeName=rzuseq[0000,0010x0011] NodeName=rzuseq0001 State=DOWN For reasons sighted in the last comment. You most likely don't need the second line and they can be combined. Slurm is making that block on the missing node on purpose in most cases to avoid running jobs there. See http://schedmd.com/slurmdocs/bluegene.html under "Naming Conventions" look at the second "IMPORTANT". Perhaps it isn't very clear in this situation since your system does start on 0000 but the idea still applies, if you have a better wording to help prevent this in the future please submit a patch. If this doesn't fix your problem please reopen the ticket. (In reply to comment #3) > This is most likely a bad configuration. > > I would put > > NodeName=rzuseq[0000,0010x0011] > NodeName=rzuseq0001 State=DOWN > > For reasons sighted in the last comment. You most likely don't need the > second line and they can be combined. Slurm is making that block on the > missing node on purpose in most cases to avoid running jobs there. > > See http://schedmd.com/slurmdocs/bluegene.html under "Naming Conventions" > look at the second "IMPORTANT". Perhaps it isn't very clear in this > situation since your system does start on 0000 but the idea still applies, > if you have a better wording to help prevent this in the future please > submit a patch. > > If this doesn't fix your problem please reopen the ticket. We added Danny's recommended line to slurm.conf and have not been able to recreate the problem. Jobs are running. |