| Summary: | All nodes which are allocated for this job are already filled | ||
|---|---|---|---|
| Product: | Slurm | Reporter: | Sanjaya Gajurel <sxg125> |
| Component: | Scheduling | Assignee: | Tim Wickberg <tim> |
| Status: | RESOLVED INFOGIVEN | QA Contact: | |
| Severity: | 3 - Medium Impact | ||
| Priority: | --- | ||
| Version: | - Unsupported Older Versions | ||
| Hardware: | Linux | ||
| OS: | Linux | ||
| Site: | Case | Alineos Sites: | --- |
| Atos/Eviden Sites: | --- | Confidential Site: | --- |
| Coreweave sites: | --- | Cray Sites: | --- |
| DS9 clusters: | --- | HPCnow Sites: | --- |
| HPE Sites: | --- | IBM Sites: | --- |
| NOAA SIte: | --- | OCF Sites: | --- |
| Recursion Pharma Sites: | --- | SFW Sites: | --- |
| SNIC sites: | --- | Linux Distro: | --- |
| Machine Name: | CLE Version: | ||
| Version Fixed: | Target Release: | --- | |
| DevPrio: | --- | Emory-Cloud Sites: | --- |
|
Description
Sanjaya Gajurel
2016-12-01 11:58:43 MST
Do you have some examples of the scripts, and the underlying calls its trying to make to Slurm? There's not much for me to work off based on that description. I have no idea what makeCluster() and stopCluster() would be doing, and it sounds like "All nodes which are allocated for this job are already filled." is an OpenMPI error message. All of that is outside the scope of our support model. Hi Tim,
Here is the R script for your reference.
------------------------------
library("parallel")
if (.Platform$OS.type == "unix") {
if (require("Rmpi")) {
print("Rmpi is loaded correctly \n")
} else {
stop("Rmpi must be installed first \n")
}
}
# =================#
# Setting working directory #
# =================#
setwd(dir=file.path(Sys.getenv("HOME"), "/CODES/R/ADMIN/Parallel/Slurm",
fsep=.Platform$file.sep))
# =================#
# Retrieving argument passed from the command line #
# =================#
args <- commandArgs(trailingOnly=TRUE)
# =================#
# Cluster configuration #
# =================#
if (.Platform$OS.type == "unix") {
conf <- list("cpus"=args[1],
"type"="MPI",
"homo"=TRUE,
"verbose"=TRUE,
"outfile"=paste(getwd(), "/output.txt", sep=""))
}
# =================#
# Data, Procedures #
# =================#
n <- 1e7
tasks <- list(1:n, 1:n, 1:n, 1:n, 1:n, 1:n, 1:n, 1:n)
mymean <- function(x) {
return(mean(cos(exp(sin(x)))))
}
foo <- function (conf, tasks, fun) {
# Setting the cluster up
if (conf$type == "SOCK") {
clus.rep <- parallel::makeCluster(spec=conf$names,
type=conf$type,
homogeneous=conf$homo,
outfile=conf$outfile,
verbose=conf$verbose)
} else {
clus.rep <- parallel::makeCluster(spec=conf$cpus,
type=conf$type,
homogeneous=conf$homo,
outfile=conf$outfile,
verbose=conf$verbose)
}
# Running the tasks in parallel
parallel::clusterApplyLB(cl=clus.rep, x=tasks, fun=fun)
# Stopping the cluster
parallel::stopCluster(cl=clus.rep)
}
# =================#
# Main #
# =================#
for (b in 1:3) {
cat("Replicate:", b, "\n")
foo(conf=conf, tasks=tasks, fun=mymean)
}
On Thu, Dec 1, 2016 at 2:11 PM, <bugs@schedmd.com> wrote:
> Tim Wickberg <tim@schedmd.com> changed bug 3307
> <https://bugs.schedmd.com/show_bug.cgi?id=3307>
> What Removed Added
> Assignee support@schedmd.com tim@schedmd.com
>
> *Comment # 1 <https://bugs.schedmd.com/show_bug.cgi?id=3307#c1> on bug
> 3307 <https://bugs.schedmd.com/show_bug.cgi?id=3307> from Tim Wickberg
> <tim@schedmd.com> *
>
> Do you have some examples of the scripts, and the underlying calls its trying
> to make to Slurm?
>
> There's not much for me to work off based on that description. I have no idea
> what makeCluster() and stopCluster() would be doing, and it sounds like "All
> nodes which are allocated for this job are already filled." is an OpenMPI error
> message. All of that is outside the scope of our support model.
>
> ------------------------------
> You are receiving this mail because:
>
> - You reported the bug.
>
>
Hi Tim, I replicated his job in both SLURM and PBS cluster. In PBS Cluster, it is running as expected. However, to make it run in the SLURM cluster, I had to switch from makeCluster() function to makeForkCluster(). We would appreciate your help if you could explain us about this discrepancies between PBS and SLURM. Thank you, -Sanaya On Fri, Dec 2, 2016 at 5:06 PM, Sanjaya Gajurel <sxg125@case.edu> wrote: > Hi Tim, > > Here is the R script for your reference. > > ------------------------------ > > library("parallel") > > if (.Platform$OS.type == "unix") { > if (require("Rmpi")) { > print("Rmpi is loaded correctly \n") > } else { > stop("Rmpi must be installed first \n") > } > } > > # =================# > # Setting working directory # > # =================# > setwd(dir=file.path(Sys.getenv("HOME"), "/CODES/R/ADMIN/Parallel/Slurm", > fsep=.Platform$file.sep)) > > # =================# > # Retrieving argument passed from the command line # > # =================# > args <- commandArgs(trailingOnly=TRUE) > > # =================# > # Cluster configuration # > # =================# > if (.Platform$OS.type == "unix") { > conf <- list("cpus"=args[1], > "type"="MPI", > "homo"=TRUE, > "verbose"=TRUE, > "outfile"=paste(getwd(), "/output.txt", sep="")) > } > > # =================# > # Data, Procedures # > # =================# > n <- 1e7 > tasks <- list(1:n, 1:n, 1:n, 1:n, 1:n, 1:n, 1:n, 1:n) > > mymean <- function(x) { > return(mean(cos(exp(sin(x))))) > } > > foo <- function (conf, tasks, fun) { > # Setting the cluster up > if (conf$type == "SOCK") { > clus.rep <- parallel::makeCluster(spec=conf$names, > type=conf$type, > homogeneous=conf$homo, > outfile=conf$outfile, > verbose=conf$verbose) > } else { > clus.rep <- parallel::makeCluster(spec=conf$cpus, > type=conf$type, > homogeneous=conf$homo, > outfile=conf$outfile, > verbose=conf$verbose) > } > # Running the tasks in parallel > parallel::clusterApplyLB(cl=clus.rep, x=tasks, fun=fun) > # Stopping the cluster > parallel::stopCluster(cl=clus.rep) > } > > # =================# > # Main # > # =================# > for (b in 1:3) { > cat("Replicate:", b, "\n") > foo(conf=conf, tasks=tasks, fun=mymean) > } > > > On Thu, Dec 1, 2016 at 2:11 PM, <bugs@schedmd.com> wrote: > >> Tim Wickberg <tim@schedmd.com> changed bug 3307 >> <https://bugs.schedmd.com/show_bug.cgi?id=3307> >> What Removed Added >> Assignee support@schedmd.com tim@schedmd.com >> >> *Comment # 1 <https://bugs.schedmd.com/show_bug.cgi?id=3307#c1> on bug >> 3307 <https://bugs.schedmd.com/show_bug.cgi?id=3307> from Tim Wickberg >> <tim@schedmd.com> * >> >> Do you have some examples of the scripts, and the underlying calls its trying >> to make to Slurm? >> >> There's not much for me to work off based on that description. I have no idea >> what makeCluster() and stopCluster() would be doing, and it sounds like "All >> nodes which are allocated for this job are already filled." is an OpenMPI error >> message. All of that is outside the scope of our support model. >> >> ------------------------------ >> You are receiving this mail because: >> >> - You reported the bug. >> >> > > > -- > ======================== > Sanjaya Gajurel, Ph.D. > Computational Scientist > sxg125@case.edu > Research Computing & Cyber Infrastructure (RCCI) > 216-368-5717 <(216)%20368-5717> (office) > 216-315-4136 <(216)%20315-4136> (cell) > Crawford 512 > Case Western Reserve University > 10900 Euclid Ave > Cleveland, OH 44106 > ========================= > (In reply to Sanjaya Gajurel from comment #3) > Hi Tim, > > I replicated his job in both SLURM and PBS cluster. In PBS Cluster, it is > running as expected. However, to make it run in the SLURM cluster, I had to > switch from makeCluster() function to makeForkCluster(). > > We would appreciate your help if you could explain us about this > discrepancies between PBS and SLURM. I've quickly looked at the script, and I have no idea how it's interacting with the Slurm resource request. Unfortunately, none of us have an experience with the R parallel package, and can't help on this. (This does not fall under our L3 support model for Slurm, there's no apparent problem with Slurm here.) If you're able to translate this into ways its interacting with Slurm itself, I can provide some assistance there. But it sounds like you may have found a solution already. Marking resolved/infogiven as I believe you'd found a workaround. Please reopen if you have further questions on this issue. - Tim Hi Tim, Yes, you can close the ticket. We are still investigating it. The good thing is that makeForkCluster is working for both Slurm and PBS cluster. I have not yet got response for *bug 3304 <https://bugs.schedmd.com/show_bug.cgi?id=3304> *after I sent you the slurm.conf file. I would appreciate your response. Thanks, -Sanjaya On Fri, Dec 9, 2016 at 2:08 PM, <bugs@schedmd.com> wrote: > Tim Wickberg <tim@schedmd.com> changed bug 3307 > <https://bugs.schedmd.com/show_bug.cgi?id=3307> > What Removed Added > Status UNCONFIRMED RESOLVED > Resolution --- INFOGIVEN > > *Comment # 5 <https://bugs.schedmd.com/show_bug.cgi?id=3307#c5> on bug > 3307 <https://bugs.schedmd.com/show_bug.cgi?id=3307> from Tim Wickberg > <tim@schedmd.com> * > > Marking resolved/infogiven as I believe you'd found a workaround. Please reopen > if you have further questions on this issue. > > - Tim > > ------------------------------ > You are receiving this mail because: > > - You reported the bug. > > |