Hello, We need to have the ability to allow more than one job to use a node (e.g. multiple serial jobs). I understand that the default behavior in Slurm is one job per node. I tried changing the slurm.conf file and set shared=yes and oversubscribed=yes for each partition, but after doing this and restarting slurmctld, all jobs sit in the queue. Is there another parameter to set to allow multiple jobs per node? Thanks, Rob Yelle Univ of Oregon
Shared/Oversubscribed don't directly affect this, and I'd suggest returning those to the default settings. What you're looking for Slurm refers to as "Consumable Resources", and is documented at https://slurm.schedmd.com/cons_res.html . Briefly, what you're looking to change is to start allocating the individual CPU (and possibly memory) within each node, rather than the whole node. This does have some other ramifications, and I'd encourage you to test this out before making this adjustment. If you can attach your current slurm.conf file it'd help me know what settings to recommend as well. - Tim
Created attachment 4285 [details] slurm.conf Hi Tim, Thank you for your response, and for the link to the cons_res page. I made the following changes to slurm.conf: SelectType=select/cons_res SelectTypeParameters=CR_Core_Memory Then restarted slurmctld (and slurmd on compute nodes). When a job is submitted, the job does not run, but I get this error message in /var/log/slurmctld: error: cons_res: cr_job_test core_bitmap index error on node n050 sched: _slurm_rpc_allocate_resources JobID-6893 NodeList=(null) usec=547 I am also getting “bad core count” messages from some of the nodes after this change, seems like additional configuration is required to properly set this if cons_res is used? Yes, I would be very interested in the settings that you recommend for this - see attached slurm.conf file. Thanks! Rob
Created attachment 4286 [details] ATT00001.htm
You'll need to add definitions to the NodeName line to inform it of the Memory and CPU layout of the nodes. As an example line from one of my configs: NodeName=scruffy NodeAddr=scruffy Port=30101 Sockets=2 CoresPerSocket=4 ThreadsPerCore=1 RealMemory=64000 The RealMemory line can be a bit less than the actual value (in MB) on the node - that'll effectively reserve a bit for the OS itself. It looks like you're using Bright Cluster Manager? I'm not sure how to get it to automatically fill that section in, you may need to refer to their documentation on how best to update this. - Tim
Hi Tim, Thanks! I figured it was something like that. I will make those changes and let you know what happens. Yes, we are using Bright CM 7.3. Per their recommendations I have frozen the slurm.conf file so that Bright won’t change it, so slurm.conf is totally in my control now. Cheers, Rob On Apr 3, 2017, at 3:53 PM, bugs@schedmd.com<mailto:bugs@schedmd.com> wrote: Comment # 3<https://bugs.schedmd.com/show_bug.cgi?id=3658#c3> on bug 3658<https://bugs.schedmd.com/show_bug.cgi?id=3658> from Robert Yelle<mailto:ryelle@uoregon.edu> Created attachment 4286 [details]<x-msg://56/attachment.cgi?id=4286> [details]<x-msg://56/attachment.cgi?id=4286&action=edit> ATT00001.htm Comment # 4<https://bugs.schedmd.com/show_bug.cgi?id=3658#c4> on bug 3658<https://bugs.schedmd.com/show_bug.cgi?id=3658> from Tim Wickberg<mailto:tim@schedmd.com> You'll need to add definitions to the NodeName line to inform it of the Memory and CPU layout of the nodes. As an example line from one of my configs: NodeName=scruffy NodeAddr=scruffy Port=30101 Sockets=2 CoresPerSocket=4 ThreadsPerCore=1 RealMemory=64000 The RealMemory line can be a bit less than the actual value (in MB) on the node - that'll effectively reserve a bit for the OS itself. It looks like you're using Bright Cluster Manager? I'm not sure how to get it to automatically fill that section in, you may need to refer to their documentation on how best to update this. - Tim ________________________________ You are receiving this mail because: * You reported the bug.
I'm marking this resolved/infogiven; please re-open if there was anything further I could answer here. - Tim
Hi Tim, Sorry, I meant to get back to you sooner on this. Thank you for your assistance on this matter, your proposed solution solved our problem. Go ahead and close the ticket. Cheers, Rob On Apr 18, 2017, at 7:32 PM, bugs@schedmd.com<mailto:bugs@schedmd.com> wrote: Tim Wickberg<mailto:tim@schedmd.com> changed bug 3658<https://bugs.schedmd.com/show_bug.cgi?id=3658> What Removed Added Status UNCONFIRMED RESOLVED Resolution --- INFOGIVEN Comment # 6<https://bugs.schedmd.com/show_bug.cgi?id=3658#c6> on bug 3658<https://bugs.schedmd.com/show_bug.cgi?id=3658> from Tim Wickberg<mailto:tim@schedmd.com> I'm marking this resolved/infogiven; please re-open if there was anything further I could answer here. - Tim ________________________________ You are receiving this mail because: * You reported the bug.