Hi, This is the issues our faculty is reporting. We would appreciate your help. ------------- I'm bumping into a new problem under SLURM that did not exist under the previous PBS TORQUE. I know that because I've run the exact same code and requested the exact same configuration of nodes under SLURM as before, yet with a different outcome. I also can tell because my CRAN package MVR does not work anymore on the cluster (I know it works because it has a OK on all platforms and was accepted by the CRAN reviewers: see here). My script contains R functions that each is designed to configure a parallel backend, run the parallel commands, and then closes it when it is done (as it should do). FYI, I use the CRAN distribution package 'parallel' and its functions makeCluster() and stopCluster() to do that. What happens is that SLURM does not let you run that job anymore: once the first function is finished and it gets to the second function, I get the following message: -------------------------------------------------------------------------- All nodes which are allocated for this job are already filled. -------------------------------------------------------------------------- The job is not killed, though, but it is stalled (see for instance job #2459505, currently stalled) This is potentially a serious problem because as a result some CRAN packages, like mine (MVR, PRIMsrc) will also not work anymore. Is there a way to configure the cluster, or to add a SLURM script, to allow the re-configuration of a parallel backend cluster within the same job? ------------------------------------------------------- Thanks -Sanjaya
Do you have some examples of the scripts, and the underlying calls its trying to make to Slurm? There's not much for me to work off based on that description. I have no idea what makeCluster() and stopCluster() would be doing, and it sounds like "All nodes which are allocated for this job are already filled." is an OpenMPI error message. All of that is outside the scope of our support model.
Hi Tim, Here is the R script for your reference. ------------------------------ library("parallel") if (.Platform$OS.type == "unix") { if (require("Rmpi")) { print("Rmpi is loaded correctly \n") } else { stop("Rmpi must be installed first \n") } } # =================# # Setting working directory # # =================# setwd(dir=file.path(Sys.getenv("HOME"), "/CODES/R/ADMIN/Parallel/Slurm", fsep=.Platform$file.sep)) # =================# # Retrieving argument passed from the command line # # =================# args <- commandArgs(trailingOnly=TRUE) # =================# # Cluster configuration # # =================# if (.Platform$OS.type == "unix") { conf <- list("cpus"=args[1], "type"="MPI", "homo"=TRUE, "verbose"=TRUE, "outfile"=paste(getwd(), "/output.txt", sep="")) } # =================# # Data, Procedures # # =================# n <- 1e7 tasks <- list(1:n, 1:n, 1:n, 1:n, 1:n, 1:n, 1:n, 1:n) mymean <- function(x) { return(mean(cos(exp(sin(x))))) } foo <- function (conf, tasks, fun) { # Setting the cluster up if (conf$type == "SOCK") { clus.rep <- parallel::makeCluster(spec=conf$names, type=conf$type, homogeneous=conf$homo, outfile=conf$outfile, verbose=conf$verbose) } else { clus.rep <- parallel::makeCluster(spec=conf$cpus, type=conf$type, homogeneous=conf$homo, outfile=conf$outfile, verbose=conf$verbose) } # Running the tasks in parallel parallel::clusterApplyLB(cl=clus.rep, x=tasks, fun=fun) # Stopping the cluster parallel::stopCluster(cl=clus.rep) } # =================# # Main # # =================# for (b in 1:3) { cat("Replicate:", b, "\n") foo(conf=conf, tasks=tasks, fun=mymean) } On Thu, Dec 1, 2016 at 2:11 PM, <bugs@schedmd.com> wrote: > Tim Wickberg <tim@schedmd.com> changed bug 3307 > <https://bugs.schedmd.com/show_bug.cgi?id=3307> > What Removed Added > Assignee support@schedmd.com tim@schedmd.com > > *Comment # 1 <https://bugs.schedmd.com/show_bug.cgi?id=3307#c1> on bug > 3307 <https://bugs.schedmd.com/show_bug.cgi?id=3307> from Tim Wickberg > <tim@schedmd.com> * > > Do you have some examples of the scripts, and the underlying calls its trying > to make to Slurm? > > There's not much for me to work off based on that description. I have no idea > what makeCluster() and stopCluster() would be doing, and it sounds like "All > nodes which are allocated for this job are already filled." is an OpenMPI error > message. All of that is outside the scope of our support model. > > ------------------------------ > You are receiving this mail because: > > - You reported the bug. > >
Hi Tim, I replicated his job in both SLURM and PBS cluster. In PBS Cluster, it is running as expected. However, to make it run in the SLURM cluster, I had to switch from makeCluster() function to makeForkCluster(). We would appreciate your help if you could explain us about this discrepancies between PBS and SLURM. Thank you, -Sanaya On Fri, Dec 2, 2016 at 5:06 PM, Sanjaya Gajurel <sxg125@case.edu> wrote: > Hi Tim, > > Here is the R script for your reference. > > ------------------------------ > > library("parallel") > > if (.Platform$OS.type == "unix") { > if (require("Rmpi")) { > print("Rmpi is loaded correctly \n") > } else { > stop("Rmpi must be installed first \n") > } > } > > # =================# > # Setting working directory # > # =================# > setwd(dir=file.path(Sys.getenv("HOME"), "/CODES/R/ADMIN/Parallel/Slurm", > fsep=.Platform$file.sep)) > > # =================# > # Retrieving argument passed from the command line # > # =================# > args <- commandArgs(trailingOnly=TRUE) > > # =================# > # Cluster configuration # > # =================# > if (.Platform$OS.type == "unix") { > conf <- list("cpus"=args[1], > "type"="MPI", > "homo"=TRUE, > "verbose"=TRUE, > "outfile"=paste(getwd(), "/output.txt", sep="")) > } > > # =================# > # Data, Procedures # > # =================# > n <- 1e7 > tasks <- list(1:n, 1:n, 1:n, 1:n, 1:n, 1:n, 1:n, 1:n) > > mymean <- function(x) { > return(mean(cos(exp(sin(x))))) > } > > foo <- function (conf, tasks, fun) { > # Setting the cluster up > if (conf$type == "SOCK") { > clus.rep <- parallel::makeCluster(spec=conf$names, > type=conf$type, > homogeneous=conf$homo, > outfile=conf$outfile, > verbose=conf$verbose) > } else { > clus.rep <- parallel::makeCluster(spec=conf$cpus, > type=conf$type, > homogeneous=conf$homo, > outfile=conf$outfile, > verbose=conf$verbose) > } > # Running the tasks in parallel > parallel::clusterApplyLB(cl=clus.rep, x=tasks, fun=fun) > # Stopping the cluster > parallel::stopCluster(cl=clus.rep) > } > > # =================# > # Main # > # =================# > for (b in 1:3) { > cat("Replicate:", b, "\n") > foo(conf=conf, tasks=tasks, fun=mymean) > } > > > On Thu, Dec 1, 2016 at 2:11 PM, <bugs@schedmd.com> wrote: > >> Tim Wickberg <tim@schedmd.com> changed bug 3307 >> <https://bugs.schedmd.com/show_bug.cgi?id=3307> >> What Removed Added >> Assignee support@schedmd.com tim@schedmd.com >> >> *Comment # 1 <https://bugs.schedmd.com/show_bug.cgi?id=3307#c1> on bug >> 3307 <https://bugs.schedmd.com/show_bug.cgi?id=3307> from Tim Wickberg >> <tim@schedmd.com> * >> >> Do you have some examples of the scripts, and the underlying calls its trying >> to make to Slurm? >> >> There's not much for me to work off based on that description. I have no idea >> what makeCluster() and stopCluster() would be doing, and it sounds like "All >> nodes which are allocated for this job are already filled." is an OpenMPI error >> message. All of that is outside the scope of our support model. >> >> ------------------------------ >> You are receiving this mail because: >> >> - You reported the bug. >> >> > > > -- > ======================== > Sanjaya Gajurel, Ph.D. > Computational Scientist > sxg125@case.edu > Research Computing & Cyber Infrastructure (RCCI) > 216-368-5717 <(216)%20368-5717> (office) > 216-315-4136 <(216)%20315-4136> (cell) > Crawford 512 > Case Western Reserve University > 10900 Euclid Ave > Cleveland, OH 44106 > ========================= >
(In reply to Sanjaya Gajurel from comment #3) > Hi Tim, > > I replicated his job in both SLURM and PBS cluster. In PBS Cluster, it is > running as expected. However, to make it run in the SLURM cluster, I had to > switch from makeCluster() function to makeForkCluster(). > > We would appreciate your help if you could explain us about this > discrepancies between PBS and SLURM. I've quickly looked at the script, and I have no idea how it's interacting with the Slurm resource request. Unfortunately, none of us have an experience with the R parallel package, and can't help on this. (This does not fall under our L3 support model for Slurm, there's no apparent problem with Slurm here.) If you're able to translate this into ways its interacting with Slurm itself, I can provide some assistance there. But it sounds like you may have found a solution already.
Marking resolved/infogiven as I believe you'd found a workaround. Please reopen if you have further questions on this issue. - Tim
Hi Tim, Yes, you can close the ticket. We are still investigating it. The good thing is that makeForkCluster is working for both Slurm and PBS cluster. I have not yet got response for *bug 3304 <https://bugs.schedmd.com/show_bug.cgi?id=3304> *after I sent you the slurm.conf file. I would appreciate your response. Thanks, -Sanjaya On Fri, Dec 9, 2016 at 2:08 PM, <bugs@schedmd.com> wrote: > Tim Wickberg <tim@schedmd.com> changed bug 3307 > <https://bugs.schedmd.com/show_bug.cgi?id=3307> > What Removed Added > Status UNCONFIRMED RESOLVED > Resolution --- INFOGIVEN > > *Comment # 5 <https://bugs.schedmd.com/show_bug.cgi?id=3307#c5> on bug > 3307 <https://bugs.schedmd.com/show_bug.cgi?id=3307> from Tim Wickberg > <tim@schedmd.com> * > > Marking resolved/infogiven as I believe you'd found a workaround. Please reopen > if you have further questions on this issue. > > - Tim > > ------------------------------ > You are receiving this mail because: > > - You reported the bug. > >