9957 – Some cores on one partition, some on another

Ticket 9957 - Some cores on one partition, some on another

Summary: Some cores on one partition, some on another

Status:	RESOLVED INFOGIVEN

Alias:	None

Product:	Slurm
Classification:	Unclassified
Component:	Configuration (show other tickets)
Version:	20.02.5
Hardware:	Linux Linux

Severity:	4 - Minor Issue
Assignee:	Ben Roberts
QA Contact:

URL:

Depends on:
Blocks:

Reported:	2020-10-08 01:43 MDT by Torkil Svensgaard
Modified:	2020-12-28 15:33 MST (History)
CC List:	3 users (show)

See Also:	10455
Site:	DRCMR
Slinky Site:	---
Alineos Sites:	---
Atos/Eviden Sites:	---
Confidential Site:	---
Coreweave sites:	---
Cray Sites:	---
DS9 clusters:	---
Google sites:	---
HPCnow Sites:	---
HPE Sites:	---
IBM Sites:	---
NOAA SIte:	---
NoveTech Sites:	---
Nvidia HWinf-CS Sites:	---
OCF Sites:	---
Recursion Pharma Sites:	---
SFW Sites:	---
SNIC sites:	---
Tzag Elita Sites:	---
Linux Distro:	CentOS
Machine Name:
CLE Version:
Version Fixed:
Target Release:	---
DevPrio:	---
Emory-Cloud Sites:	---

Attachments
Add an attachment (proposed patch, testcase, etc.)

Note You need to log in before you can comment on or make changes to this ticket.

Description Torkil Svensgaard 2020-10-08 01:43:06 MDT

Hi

We just had a session with ben@schedmd.com where we discussed how to configure compute hosts with GPUs. The best solution seemd to be to use 1 core for each GPU and then we could use the rest of the cores for our HPC partition.

It's not entirely clear to me how to do that though. How do we split the cores on a given host between two partitions, like 46 cores for HPC partition and 2 cores for GPU partition?

Thanks,

Torkil

Comment 2 Ben Roberts 2020-10-08 10:34:18 MDT

Hi Torkil,

The parameter you would want to use is MaxCPUsPerNode.  You would place the node(s) in both partitions and then in the partition for the GPU work you would limit the number of CPUs to be equal to the number of GPUs.  In the other partition you probably wouldn't define a similar limit because those wouldn't be the only nodes in the partition.  Here is the documentation on this parameter:
https://slurm.schedmd.com/slurm.conf.html#OPT_MaxCPUsPerNode

One more thing that we discussed on the phone is the ability to tie certain cores to a GPU.  You do that in the gres.conf file by specifying 'Cores' on the line with the GPU definition.  You can read more about this option here:
https://slurm.schedmd.com/gres.conf.html#OPT_Cores

Please let me know if you have any problems with this or additional questions about it.

Thanks,
Ben

Comment 3 Ben Roberts 2020-10-30 08:18:31 MDT

Hi Torkil,

I wanted to follow up and see if the MaxCPUsPerNode parameter I recommended did address your need.  Please let me know if you still need help with this or if this ticket is ok to close.

Thanks,
Ben

Comment 4 Torkil Svensgaard 2020-11-02 03:05:28 MST

(In reply to Ben Roberts from comment #3)
> Hi Torkil,

Hi Ben

> I wanted to follow up and see if the MaxCPUsPerNode parameter I recommended
> did address your need.  Please let me know if you still need help with this
> or if this ticket is ok to close.

Not quite at that point yet with the new install. I'm currently a bit stuck on something else, I'll just put it here since it's easier for me with just a single ticket and point of contact. Let me know if you need seperate tickets.

Getting this on my submit host:

"
bash-4.4$ srun hostname
srun: error: Unable to create step for job 30: Error generating job credential
"

Log on the controller for this:

"
[2020-11-02T10:53:48.953] error: slurm_auth_get_host: Lookup failed for 172.21.15.30: Unknown host
[2020-11-02T10:53:48.954] sched: _slurm_rpc_allocate_resources JobId=30 NodeList=bigger9 usec=2055
[2020-11-02T10:53:48.954] prolog_running_decr: Configuration for JobId=30 is complete
[2020-11-02T10:53:48.959] error: slurm_cred_create: getpwuid failed for uid=1018
[2020-11-02T10:53:48.959] error: slurm_cred_create error
[2020-11-02T10:53:48.960] _job_complete: JobId=30 WTERMSIG 1
[2020-11-02T10:53:48.961] _job_complete: JobId=30 done
"

Slurmuser = slurm has UID 20000 on all hosts
Mungeuser = munge has UID 20001 on all hosts

172.21.15.30 is my submit host, which isn't part of any partition.

Suggeestions?

Comment 5 Torkil Svensgaard 2020-11-02 05:31:34 MST

Sorted that one. 

Please leave the ticket open if that's ok, while we do the initial implementation. If you need a fresh ticket for each issue feel free to close it.

Mvh.

Torkil

Comment 6 Ben Roberts 2020-11-02 08:32:35 MST

I'm glad you were able to get to the bottom of your last issue.  I'm ok to leave this open for a while longer for some quick questions.  If something that comes up turns out to be a bigger issue than originally thought I might have you split it into a new ticket, but I'm happy to leave this open for now.

Thanks,
Ben

Comment 7 Torkil Svensgaard 2020-11-02 10:25:14 MST

Hi Ben

Cool =)

Currently a bit stuck on x11. I can do "srun --x11 someprogram" and it works fine, but the same someprogram doesn't seem to work when --x11 is put in a sbatch header. Suggestions?

Mvh.

Torkil

Comment 8 Ben Roberts 2020-11-02 10:59:42 MST

This is actually by design.  Support for x11 forwarding with sbatch was removed in 19.05 due to changes in the underlying mechanism used to accomplish this.  You would need to use either salloc or srun.  You can find some additional details in bug 3647 or in this commit:
https://github.com/SchedMD/slurm/commit/c97284691b6a0df57493a13132787a1a908a749f

Let me know if you have additional questions about this.  

Thanks,
Ben

Comment 9 Torkil Svensgaard 2020-11-02 14:45:15 MST

(In reply to Ben Roberts from comment #8)
> This is actually by design.  Support for x11 forwarding with sbatch was
> removed in 19.05 due to changes in the underlying mechanism used to
> accomplish this.  You would need to use either salloc or srun.  You can find
> some additional details in bug 3647 or in this commit:
> https://github.com/SchedMD/slurm/commit/
> c97284691b6a0df57493a13132787a1a908a749f
> 
> Let me know if you have additional questions about this.  


Ah, that explains it. That means users who run array jobs with X11 output will have to make some changes I guess. 

Thanks,

Torkil

Comment 10 Torkil Svensgaard 2020-11-03 02:57:46 MST

Hi

Currently struggling a bit with getting srun to work from a .desktop file. 

If I open a terminal locally and run "/usr/bin/srun --x11 /mnt/depot64/fsl/fsl.6.0.1/bin/fsl" it works just fine.

If I do it from a .desktop file with the following content it fails:

"
[Desktop Entry]
Name=FSL
Comment=Use the command line
TryExec=srun
Exec=/usr/bin/srun --x11 /mnt/depot64/fsl/fsl.6.0.1/bin/fsl
Icon=/usr/local/share/desktop-icons/fsl.jpg
Type=Application
Terminal=true 
Categories=X-DRCMR
DBusActivatable=true 
"

Error on slurmctld:

"
X11 connection rejected because of wrong authentication.
X11 connection rejected because of wrong authentication.
Nov 03 10:54:54 joe.drcmr at-spi-bus-launcher[2098]: dbus-daemon[2103]: Activating service name='org.a11y.atspi.Registry' requested by ':1.118' (uid=1018 pid=8918 comm="exo-open --launch TerminalEmulator /usr/bin/srun -" label="unconfined_u:unconfined_r:unconfined_t:s0-s0:c0.c1023")
Nov 03 10:54:55 joe.drcmr at-spi2-registr[8921]: Could not open X display
Nov 03 10:54:55 joe.drcmr at-spi-bus-launcher[2098]: dbus-daemon[2103]: Successfully activated service 'org.a11y.atspi.Registry'
Nov 03 10:54:55 joe.drcmr at-spi-bus-launcher[2098]: SpiRegistry daemon is running with well-known name - org.a11y.atspi.Registry
Nov 03 10:54:55 joe.drcmr at-spi2-registr[8921]: AT-SPI: Cannot open default display
X11 connection rejected because of wrong authentication.
X11 connection rejected because of wrong authentication.
Nov 03 10:54:57 joe.drcmr at-spi-bus-launcher[2098]: dbus-daemon[2103]: Activating service name='org.a11y.atspi.Registry' requested by ':1.61' (uid=1018 pid=6408 comm="xfce4-panel " label="unconfined_u:unconfined_r:unconfined_t:s0-s0:c0.c1023")
Nov 03 10:54:58 joe.drcmr at-spi2-registr[8932]: Could not open X display
Nov 03 10:54:58 joe.drcmr at-spi-bus-launcher[2098]: dbus-daemon[2103]: Successfully activated service 'org.a11y.atspi.Registry'
Nov 03 10:54:58 joe.drcmr at-spi-bus-launcher[2098]: SpiRegistry daemon is running with well-known name - org.a11y.atspi.Registry
Nov 03 10:54:58 joe.drcmr at-spi2-registr[8932]: AT-SPI: Cannot open default display
"

Any suggestions?

Also this error keeps cropping up:

"
error: slurm_auth_get_host: Lookup failed for 0.0.0.0: Unknown host
"

I had it yesterday too with a different IP. 

"
error: slurm_auth_get_host: Lookup failed for 172.21.15.30: Unknown host
"

I think it changed after I rebooted this morning.

Mvh.

Torkil

Comment 11 Ben Roberts 2020-11-03 12:45:12 MST

Hi Torkil,

This looks to me like an issue with the XAUTHORITY environment variable not being available to the srun command when run from within the desktop file.  You may be able to have it get access to the right variable(s) by having it exec 'sh' instead of 'srun' directly.  
Exec=sh -c "/usr/bin/srun --x11 /mnt/depot64/fsl/fsl.6.0.1/bin/fsl"

If that doesn't help you can try adding '-vvv' to the srun command to have it log some more information about what is going on on the srun side of things.  

Thanks,
Ben

Comment 12 Torkil Svensgaard 2020-11-03 13:08:57 MST

Hi Ben

I got that sorted and now apps launch. I'm still seeing this one though:

"
[2020-11-03T21:05:49.239] error: slurm_auth_get_host: Lookup failed for 0.0.0.0: Unknown host
"

How do I get rid of that, if it's only cosmetic? Is it some sort of reverse DNS lookup?

Mvh.

Torkil

Comment 13 Torkil Svensgaard 2020-11-03 13:49:39 MST

I can't seem to get srun to allocate less than an entire node.

In slurm.conf I have this:

"
SelectType=select/cons_tres
SelectTypeParameters=CR_Core

PartitionName=debug Nodes=ALL Default=YES MaxTime=INFINITE Shared=NO LLN=YES State=UP
"

Command line:

"
srun --cpus-per-task=1 --ntasks=1 --export=ALL --x11 fsl
"

What am I missing?

Mvh.

Torkil

Comment 14 Torkil Svensgaard 2020-11-03 14:10:12 MST

Hmm, I thought number of cores etc was detected automagically if not specified but that evidently isn't so out of the box:

"
NODELIST              CPUS        MEMORY      AVAIL_FEATURES             GRES       
bigger9               1           1           (null)                     (null) 
"

Is there a way to make it detect CPU and memory loadout?

Mvh.

Torkil

Comment 15 Ben Roberts 2020-11-03 14:48:28 MST

For your issue with slurm_auth_get_host, it looks like it's trying to get a hostname, but getting an ip address instead.  I'm not sure why that would be.  Is that only when you run it from a desktop file as well?  This call is probably coming from the Munge code, here:
https://github.com/SchedMD/slurm/blob/98e5e853a7ce52e284b3de14d5e98d70501a2d1f/src/plugins/auth/munge/auth_munge.c#L292-L336


Unfortunately there isn't a way to have Slurm populate the configuration of the nodes without specifying the information in the slurm.conf.  If your nodes are homogeneous (or mostly homogeneous) you can use 'NodeName=DEFAULT' to specify the configuration once and have the majority of the nodes pick it up from there.  If you have any that don't match the rest of the nodes you can define the correct settings for them on a separate line.  

Thanks,
Ben

Comment 16 Torkil Svensgaard 2020-11-03 14:56:53 MST

Ah, thanks. Not a big hazzle to make the configuration, it just seems convenient if it could be done automatically, and overridden manually if need be.

The error is also showing with srun directly in a terminal:

"
[2020-11-03T22:51:32.323] error: slurm_auth_get_host: Lookup failed for 0.0.0.0: Unknown host
[2020-11-03T22:51:32.323] sched: _slurm_rpc_allocate_resources JobId=243 NodeList=bigger9 usec=1228
[2020-11-03T22:51:32.323] prolog_running_decr: Configuration for JobId=243 is complete
"

Mvh.

Torkil

Comment 17 Ben Roberts 2020-11-03 15:33:45 MST

I looked at where the last section of the code I sent previously came from and it looks like the decision was made to use the IP of the host in cases where the hostname lookup failed.  That doesn't explain why the hostname lookup is failing in your case, but it does line up with the expected behavior as described in bug 7430.  You can also see the relevant commit here:
https://github.com/SchedMD/slurm/commit/a67548bc958f451c227437b1197f5c71f57ce771

Can you get the hostname manually?

Thanks,
Ben

Comment 18 Torkil Svensgaard 2020-11-05 12:37:37 MST

(In reply to Ben Roberts from comment #17)
>
> Can you get the hostname manually?

Get it how exactly? We haven't configured reverse DNS for 172.21.15.0/24 used by SLURM, that might be it? 

I tried disabling IPv6 as pr comments on the first link but that didn't make any difference.

Mvh.

Torkil

Comment 19 Ben Roberts 2020-11-05 15:27:41 MST

I hadn't paid close attention to the version listed on this ticket previously.  It shows 21.08, but that isn't available yet.  Since you're talking about disabling IPv6 I assume you're using 20.11, is that right?  There were changes in 20.11, moving from using gethostbyname() to getaddrinfo() to allow for IPv6 support.  It looks like there was an internal bug just opened where at some point a bug was introduced when there is only an IPv4 address.  It is being worked on currently and should be addressed before the official release of 20.11.  I'll let you know as there's progress.

Thanks,
Ben

Comment 20 Torkil Svensgaard 2020-11-05 23:11:31 MST

Hi Ben

Versions mismatch is probably my mistake. This is what we have, built with rpmbuild from source.

"
# slurm/root ~ 
# rpm -qa | grep slurm
slurm-perlapi-20.02.5-1.el8.x86_64
slurm-slurmrestd-20.02.5-1.el8.x86_64
slurm-20.02.5-1.el8.x86_64
slurm-slurmdbd-20.02.5-1.el8.x86_64
slurm-contribs-20.02.5-1.el8.x86_64
slurm-example-configs-20.02.5-1.el8.x86_64
slurm-slurmctld-20.02.5-1.el8.x86_64
slurm-devel-20.02.5-1.el8.x86_64
"

I changed the version in the ticket to 20.02.5.

We are not using IPv6 but it's usually enabled on the interfaces on new CentOS installs. I just noticed it being mentioned in the link below and disabled it across the board on the new setup.

https://github.com/SchedMD/slurm/blob/98e5e853a7ce52e284b3de14d5e98d70501a2d1f/src/plugins/auth/munge/auth_munge.c#L292-L336

Mvh.

Torkil

Comment 21 Torkil Svensgaard 2020-11-06 01:06:58 MST

(In reply to Ben Roberts from comment #2)

Back to the original question,

> The parameter you would want to use is MaxCPUsPerNode.  You would place the
> node(s) in both partitions and then in the partition for the GPU work you
> would limit the number of CPUs to be equal to the number of GPUs.  In the
> other partition you probably wouldn't define a similar limit because those
> wouldn't be the only nodes in the partition.  Here is the documentation on
> this parameter:
> https://slurm.schedmd.com/slurm.conf.html#OPT_MaxCPUsPerNode

I might be reading it wrong, or phrased my question wrong, but this seems fairly useless unless nodes have the exact same number of CPUs and GPUs so MaxCPUsPerNode can be set = (number of GPUs) for the GPU partition and = (number of cores - number of GPUs) for the HPC partition.

The important thing here is not so much to limit the amount of cores used by GPU jobs (that's accomplished by GRES) but to make sure one CPU core is reserved pr GPU, and that can't be done?
 
> One more thing that we discussed on the phone is the ability to tie certain
> cores to a GPU.  You do that in the gres.conf file by specifying 'Cores' on
> the line with the GPU definition.  You can read more about this option here:
> https://slurm.schedmd.com/gres.conf.html#OPT_Cores

I don't think I entirely understand this, but with this we could say core0 is the only core that can use GPU0, right? That will improve performance only in cases where it's important like which socket a thing is running on , but can't be used to ensure a core is only used with a GPU?

Mvh.

Torkil

Comment 22 Ben Roberts 2020-11-09 10:20:34 MST

Hi Torkil,

Your understanding is correct regarding the MaxCPUsPerNode parameter. The idea for this parameter is that your GPU nodes be configured to be in two partitions. With the MaxCPUsPerNode parameter set for each of them you can make sure there are CPUs available for GPU jobs in your 'gpu' partition and that the rest of the CPUs are available for other work in other partitions. This is designed to limit the number of CPUs available to GPU jobs more than making sure that there is one core reserved for each GPU that is inaccessible to jobs that don't use the GPU.

For billing purposes you can use TRESBillingWeights and the MAX_TRES PriorityFlag to have jobs that request configurations other than 1 Core to 1 GPU are billed for the maximum value for the resources they requested. For example, if the billing weights are defined such that CPUs and GPUs are equal, then jobs that request 2 GPUs and 2 CPUs, 1 GPU and 2 CPUs, or 2 GPUs and 1 CPU would all be billed for the same amount. This doesn't prevent users from requesting the CPUs in the GPU partition, but discourages them from doing it unless they are going to use the GPU too.

One other possibility that might get closer to what you want would be to create a reservation for the number of CPUs you want and then have GPU jobs request that reservation (or you could configure a submit filter that adds the reservation to the request of GPU jobs). This would be a way to avoid having to create a separate partition, but still doesn't prevent GPU jobs from requesting more CPUs than the number of GPUs they requested in that reservation.

I know these still aren't exactly what you're looking for, but hopefully there's something that will come close enough to work. Please let me know if you'd like me to clarify anything.

Thanks,
Ben

Comment 23 Torkil Svensgaard 2020-11-10 00:29:16 MST

(In reply to Ben Roberts from comment #22)
> Hi Torkil,

Hi Ben

> One other possibility that might get closer to what you want would be to
> create a reservation for the number of CPUs you want and then have GPU jobs
> request that reservation (or you could configure a submit filter that adds
> the reservation to the request of GPU jobs).  This would be a way to avoid
> having to create a separate partition, but still doesn't prevent GPU jobs
> from requesting more CPUs than the number of GPUs they requested in that
> reservation.

This sounds like it would do exactly what I want. Can that filter be used such that we only have 1 partition for batch jobs and if gres=gpu is requested the filter kicks in?

Mvh.

Torkil

Comment 24 Ben Roberts 2020-11-10 08:47:10 MST

Yes, you can do that.  Here's an example script I have from other testing where I was looking for jobs that request --gpus=x and setting a different partition on them when that was matched:

------------------------------
function slurm_job_submit(job_desc, part_list, submit_uid)

    slurm.log_user("TRES per job: %s, per task: %s", job_desc.tres_per_job, job_desc.tres_per_job)

    tres_string = string.find(job_desc.tres_per_job, "gpu")
    slurm.log_user("tres_string is %s", tres_string)
    if (string.find(job_desc.tres_per_job, "gpu")) then
        job_desc.partition = 'gpu'
        slurm.log_user("Set partition to: %s", job_desc.partition)
    end 

    return slurm.SUCCESS
end

function slurm_job_modify(job_desc, job_rec, part_list, modify_uid)
    slurm.log_user("In lua modify function")
    --slurm.log_user("partition available name:%s", part.name)
    return slurm.SUCCESS
end

return slurm.SUCCESS
------------------------------


Here's an example job that matches the filter and has the partition modified:

$ sbatch -N1 -pdebug --gpus=1 --wrap='srun sleep 30'
sbatch: TRES per job: gpu:1, per task: gpu:1
sbatch: tres_string is 1
sbatch: Set partition to: gpu
Submitted batch job 1423

$ scontrol show job 1423 | grep Partition
   Partition=gpu AllocNode:Sid=kitt:7786




This example shows the modification of the partition, but you can set a reservation on the job as well.  In the code you can see a list of all the attributes of a job you can modify.  For a reservation the attribute is just called 'reservation':
https://github.com/SchedMD/slurm/blob/a18562932cefbc668e5e2d7552251c5e60f314be/src/plugins/job_submit/lua/job_submit_lua.c#L711

Let me know if you have questions about this.

Thanks,
Ben

Comment 25 Torkil Svensgaard 2020-12-03 03:50:02 MST

Hi Ben

Looking at creating a reservation and it seems to require setting user and/or account so then looking at accounting, which we never really used before.

The docs I found mention manually creating accounts and users, and we are not particularly interested in more work. 

I think I've managed to create a default account, how do I go about getting run jobs registered to that account and UNIX user names automatically created in the accounting database, if necessary?

Mvh.

Torkil

Comment 26 Torkil Svensgaard 2020-12-03 04:23:28 MST

Hi again

"
# scontrol create reservationname=core_pr_gpu start=now duration=infinite partition=HPC Nodes=bigger9 CoreCnt=1 Account=drcmr
Reservation created: core_pr_gpu
# slurm/root run 
# scontrol show res
ReservationName=core_pr_gpu StartTime=2020-12-03T12:19:24 EndTime=2021-12-03T12:19:24 Duration=365-00:00:00
   Nodes=bigger9 NodeCnt=1 CoreCnt=1 Features=(null) PartitionName=HPC Flags=SPEC_NODES
     NodeName=bigger9 CoreIDs=0
   TRES=cpu=2
   Users=(null) Groups=(null) Accounts=drcmr Licenses=(null) State=ACTIVE BurstBuffer=(null) Watts=n/a
   MaxStartDelay=(null)
"

Why/what is TRES=cpu=2? Just curious since evething i set is 1.

Mvh.

Torkil

Comment 27 Torkil Svensgaard 2020-12-03 05:17:41 MST

(In reply to Torkil Svensgaard from comment #26)
> Hi again
> 
> "
> # scontrol create reservationname=core_pr_gpu start=now duration=infinite
> partition=HPC Nodes=bigger9 CoreCnt=1 Account=drcmr
> Reservation created: core_pr_gpu
> # slurm/root run 
> # scontrol show res
> ReservationName=core_pr_gpu StartTime=2020-12-03T12:19:24
> EndTime=2021-12-03T12:19:24 Duration=365-00:00:00
>    Nodes=bigger9 NodeCnt=1 CoreCnt=1 Features=(null) PartitionName=HPC
> Flags=SPEC_NODES
>      NodeName=bigger9 CoreIDs=0
>    TRES=cpu=2
>    Users=(null) Groups=(null) Accounts=drcmr Licenses=(null) State=ACTIVE
> BurstBuffer=(null) Watts=n/a
>    MaxStartDelay=(null)
> "
> 
> Why/what is TRES=cpu=2? Just curious since evething i set is 1.

I presume it has something to do with hyperthreading.

This host with an "AMD EPYC 7402P 24-Core Processor" is configured like this in slurm.conf:

"
NodeName=bigger9 CPUs=48 Boards=1 SocketsPerBoard=1 CoresPerSocket=24 ThreadsPerCore=2 RealMemory=257552 Gres=gpu:1
"

That configuration was generated by "slurmd -C".

But when I try to start a large array job I'm only able to run 24 jobs at the same time, not 48. How come? 

Mvh.

Torkil

Comment 28 Torkil Svensgaard 2020-12-03 05:20:04 MST

(In reply to Torkil Svensgaard from comment #27)

> But when I try to start a large array job I'm only able to run 24 jobs at
> the same time, not 48. How come? 

Eh, 23, due to the reservation.

Mvh.

Torkil

Comment 29 Torkil Svensgaard 2020-12-03 05:55:20 MST

(In reply to Ben Roberts from comment #24)
> Yes, you can do that.  Here's an example script I have from other testing
> where I was looking for jobs that request --gpus=x and setting a different
> partition on them when that was matched:

I've built and installed slurmctld --with lua and I've put mit script next to slurm.conf in /etc/slurm but it doesn't seem to be picked up.

Here's the script:

"
function slurm_job_submit(job_desc, part_list, submit_uid)

    slurm.log_user("TRES per job: %s, per task: %s", job_desc.tres_per_job, job_desc.tres_per_job)

    tres_string = string.find(job_desc.tres_per_job, "gpu")
    slurm.log_user("tres_string is %s", tres_string)
    if (string.find(job_desc.tres_per_job, "gpu")) then
        --job_desc.partition = 'application'
        job_desc.reservation = 'core_pr_cpu'
        --slurm.log_user("Set partition to: %s", job_desc.partition)
        slurm.log_user("Using reservation: %s", job_desc.reservation)
    end

    return slurm.SUCCESS
end

function slurm_job_modify(job_desc, job_rec, part_list, modify_uid)
    slurm.log_user("In lua modify function")
    --slurm.log_user("partition available name:%s", part.name)
    return slurm.SUCCESS
end

return slurm.SUCCESS
"

I've restarted and reconfigured, what did I miss?

Mvh.

Torkil

Comment 30 Ben Roberts 2020-12-03 10:52:00 MST

Hi Torkil,

Unfortunately the user and account system that is used for Slurm does not have a method of directly importing the information from LDAP or the OS. Because we allow you to define an account hierarchy and allow users to be in multiple accounts a tool like this would get very complex and difficult to have it put users in the right place(s). There are some users who have created scripts to help manage the Slurm users and accounts, and there was a good presentation on one of these scripts from the Slurm Users Group a couple years ago. I'll link their slides in case it's helpful.
https://slurm.schedmd.com/SLUG18/ewan_roche_slug18.pdf

Regarding your reservation question, you're right that the reason it shows 'TRES=cpu=2' when you just requested one core is due to the fact that it's requesting a core, which is hyper-threaded. If you're only able to run 23 jobs on a 48 CPU machine then it sounds like you are using CR_CORE (or CR_CORE_MEMORY) for the SelectTypeParameters. From the documentation for CR_CORE it says:
-----------------
On nodes with hyper-threads, each thread is counted as a CPU to satisfy a job's resource requirement, but multiple jobs are not allocated threads on the same core. The count of CPUs allocated to a job is rounded up to account for every CPU on an allocated core.
-----------------

Here is an example of this with a single task job on a node with hyper-threading.
$ sbatch -n1 -wnode01 --wrap='srun sleep 30'
Submitted batch job 1517

$ scontrol show node node01 | grep AllocTRES
AllocTRES=cpu=2,mem=15678M

If I request 2 tasks you can see that they both fit on a single core, so it shows the same number of CPUs being allocated.
$ sbatch -n2 -wnode01 --wrap='srun sleep 300'
Submitted batch job 1518

$ scontrol show node node01 | grep AllocTRES
AllocTRES=cpu=2,mem=15678M

If you're submitting an array of single task jobs to a hyper-threaded node and using CR_CORE that would explain why you can only start 23 jobs rather than 47. Let me know if this isn't the case for you.

For your last question, there is a parameter you have to set to enable the job submit plugin that you may not have seen.
JobSubmitPlugins=lua

Let me know if you've already got this set and it's still not picking up your script.

Thanks,
Ben

Comment 31 Torkil Svensgaard 2020-12-03 14:40:11 MST

(In reply to Ben Roberts from comment #30)
> Hi Torkil,

Hi Ben
 
> If you're submitting an array of single task jobs to a hyper-threaded node
> and using CR_CORE that would explain why you can only start 23 jobs rather
> than 47.  Let me know if this isn't the case for you.

Ah, of course. I changed it to CR_CPU and cut down the node definition to just "NodeName=bigger9 CPUs=48 RealMemory=257552 Gres=gpu:1" and now I get 47. 

Which configuration would be the better one though? It would depend on the workloads of course but perhaps you have some insights?
 
> For your last question, there is a parameter you have to set to enable the
> job submit plugin that you may not have seen.
> JobSubmitPlugins=lua

Thanks, missed that in the documentation. 

Mvh.

Torkil

Comment 32 Torkil Svensgaard 2020-12-03 15:01:00 MST

> Thanks, missed that in the documentation. 

That picks up the LUA script but SLURM doesn't like it:

"
[2020-12-03T22:52:02.656] error: job_submit/lua: /etc/slurm/job_submit.lua: [string "slurm.user_msg (string.format(table.unpack({...})))"]:1: bad argument #2 to 'format' (no value)
"

I've also tried with the code you linked, verbatim, same error. 

After the change to CR_CPU I'm also getting this when I restart slurmctld:

"
[2020-12-03T22:56:19.425] SchedulerParameters=default_queue_depth=100,max_rpc_cnt=0,max_sched_time=2,partition_job_depth=0,sched_max_job_start=0,sched_min_interval=2
[2020-12-03T22:56:19.434] error: _core_bitmap2str: bad core offset (24 >= 23)
[2020-12-03T22:56:19.434] error: _core_bitmap2str: bad core offset (25 >= 23)
[2020-12-03T22:56:19.434] error: _core_bitmap2str: bad core offset (26 >= 23)
[2020-12-03T22:56:19.434] error: _core_bitmap2str: bad core offset (27 >= 23)
[2020-12-03T22:56:19.434] error: _core_bitmap2str: bad core offset (28 >= 23)
[2020-12-03T22:56:19.434] error: _core_bitmap2str: bad core offset (29 >= 23)
[2020-12-03T22:56:19.434] error: _core_bitmap2str: bad core offset (30 >= 23)
[2020-12-03T22:56:19.434] error: _core_bitmap2str: bad core offset (31 >= 23)
[2020-12-03T22:56:19.434] error: _core_bitmap2str: bad core offset (32 >= 23)
[2020-12-03T22:56:19.434] error: _core_bitmap2str: bad core offset (33 >= 23)
[2020-12-03T22:56:19.434] error: _core_bitmap2str: bad core offset (34 >= 23)
[2020-12-03T22:56:19.434] error: _core_bitmap2str: bad core offset (35 >= 23)
[2020-12-03T22:56:19.434] error: _core_bitmap2str: bad core offset (36 >= 23)
[2020-12-03T22:56:19.434] error: _core_bitmap2str: bad core offset (37 >= 23)
[2020-12-03T22:56:19.434] error: _core_bitmap2str: bad core offset (38 >= 23)
[2020-12-03T22:56:19.434] error: _core_bitmap2str: bad core offset (39 >= 23)
[2020-12-03T22:56:19.434] error: _core_bitmap2str: bad core offset (40 >= 23)
[2020-12-03T22:56:19.434] error: _core_bitmap2str: bad core offset (41 >= 23)
[2020-12-03T22:56:19.434] error: _core_bitmap2str: bad core offset (42 >= 23)
[2020-12-03T22:56:19.434] error: _core_bitmap2str: bad core offset (43 >= 23)
[2020-12-03T22:56:19.434] error: _core_bitmap2str: bad core offset (44 >= 23)
[2020-12-03T22:56:19.434] error: _core_bitmap2str: bad core offset (45 >= 23)
[2020-12-03T22:56:19.434] error: _core_bitmap2str: bad core offset (46 >= 23)
[2020-12-03T22:56:19.434] error: _core_bitmap2str: bad core offset (47 >= 23)
"

It seems to work though. 

Lastly, I'm still seeing this:

"
[2020-12-03T22:56:34.683] error: slurm_auth_get_host: Lookup failed for 0.0.0.0
"

I guess the fix for that didn't make it in 20.11?

Mvh.

Torkil

Comment 33 Torkil Svensgaard 2020-12-04 04:06:10 MST

(In reply to Torkil Svensgaard from comment #32)

> That picks up the LUA script but SLURM doesn't like it:
> 
> "
> [2020-12-03T22:52:02.656] error: job_submit/lua: /etc/slurm/job_submit.lua:
> [string "slurm.user_msg (string.format(table.unpack({...})))"]:1: bad
> argument #2 to 'format' (no value)
> "
> 
> I've also tried with the code you linked, verbatim, same error. 

I've spent several hours this morning on this and I believe the job_desc object, or whatever it is, is empty or broken, depending on how exactly the metatable works.

I can set values on it, like job_desc.reservation, and read it back, but I am unable to read anything on the initial calling object. Setting the reservation also works just fine, it uses the reserved core.

Mvh.

Torkil

Comment 34 Ben Roberts 2020-12-07 12:08:02 MST

Hi Torkil,

My apologies for the delayed response, I was out of the office on Friday.

There are a few parts to your question, so I started by looking into the lua error you are getting. I didn't see a call to string.format in the example script you sent and mine didn't have it either, but it looks like this is called by slurm.log_user. I can cause this error to appear if I call slurm.log_user with a bad job descriptor attribute. I can't reproduce this error with the script I sent initially, but it could be that the tres_per_job attribute isn't set by default in your environment. It sounds like this might be what you found as well. Your last message says that you can set and read back attributes like reservations, but you can't read anything on the initial calling object. This sounds to me like you're trying to see the reservation value before you set it, is that right? If so, I can reproduce that behavior. It fails because there isn't a value for the reservation attribute of the job, so by trying to print the value it generates this error because there is no argument for it to format. If this isn't what is happening in your case let me know.

Regarding the "bad core offset" message, I haven't been able to reproduce that same error on my test system by switching to/from CR_Core and CR_CPU and changing the node definitions to include socket, core and thread information or just CPUs. It's possible that running jobs might have an effect on this. Were there active jobs when you made this change? I'm also curious if you have Cores specified in your gres.conf to associate GPUs with certain cores.

For the issue with the failed lookup, that fix did make it into 20.11. The fix there was to have it use the text version of the IP address if the reverse lookup fails. But the fact that it's showing an IP instead of the hostname indicates that something is causing the lookup to fail. We can manually simulate what's happening to see what is being decoded by munge. If you can ssh to one of the compute nodes you can use munge to encode an empty message and then send the output to the slurm controller, telling it to decode and get the information about that message. You would run this:
munge -n | ssh <hostname of controller> unmunge

Let me know what that shows.

Thanks,
Ben

Comment 35 Torkil Svensgaard 2020-12-07 12:16:32 MST

(In reply to Ben Roberts from comment #34)
> Hi Torkil,

Hi Ben

> My apologies for the delayed response, I was out of the office on Friday.  

Np.

> There are a few parts to your question, so I started by looking into the lua
> error you are getting.  I didn't see a call to string.format in the example
> script you sent and mine didn't have it either, but it looks like this is
> called by slurm.log_user.  I can cause this error to appear if I call
> slurm.log_user with a bad job descriptor attribute.  I can't reproduce this
> error with the script I sent initially, but it could be that the
> tres_per_job attribute isn't set by default in your environment.  It sounds
> like this might be what you found as well.  Your last message says that you
> can set and read back attributes like reservations, but you can't read
> anything on the initial calling object.  This sounds to me like you're
> trying to see the reservation value before you set it, is that right?  If
> so, I can reproduce that behavior.  It fails because there isn't a value for
> the reservation attribute of the job, so by trying to print the value it
> generates this error because there is no argument for it to format.  If this
> isn't what is happening in your case let me know.

I agree that is the case for the reservation, but I was unable to read any value from the calling object. The job was the following sbatch job, so I should think 
"tres_string" or "partition" should be available?

"
#!/bin/bash
#SBATCH --partition=HPC
#SBATCH --gres=gpu:1
#SBATCH --cpus-per-task=1 
#SBATCH --ntasks=1

nvidia-smi
"

> Regarding the "bad core offset" message, I haven't been able to reproduce
> that same error on my test system by switching to/from CR_Core and CR_CPU
> and changing the node definitions to include socket, core and thread
> information or just CPUs.  It's possible that running jobs might have an
> effect on this.  Were there active jobs when you made this change?  I'm also
> curious if you have Cores specified in your gres.conf to associate GPUs with
> certain cores.

It's possible, will test when I get in tomorrow. I haven't associated the GPU with a specific core yet.

> For the issue with the failed lookup, that fix did make it into 20.11.  The
> fix there was to have it use the text version of the IP address if the
> reverse lookup fails.  But the fact that it's showing an IP instead of the
> hostname indicates that something is causing the lookup to fail.  We can
> manually simulate what's happening to see what is being decoded by munge. 
> If you can ssh to one of the compute nodes you can use munge to encode an
> empty message and then send the output to the slurm controller, telling it
> to decode and get the information about that message.  You would run this:
> munge -n | ssh <hostname of controller> unmunge
> 
> Let me know what that shows.

Thanks, will test that tomorrow as well.

Mvh.

Torkil

Comment 36 Torkil Svensgaard 2020-12-07 12:43:22 MST

(In reply to Torkil Svensgaard from comment #35)

> I agree that is the case for the reservation, but I was unable to read any
> value from the calling object. The job was the following sbatch job, so I
> should think 
> "tres_string" or "partition" should be available?
> 
> "
> #!/bin/bash
> #SBATCH --partition=HPC
> #SBATCH --gres=gpu:1
> #SBATCH --cpus-per-task=1 
> #SBATCH --ntasks=1
> 
> nvidia-smi
> "

That was actually not true, I can read job_desc.partition. So how do I check if gres=gpu is set? In your script you did "if (string.find(job_desc.tres_per_job, "gpu")) then" but tres_pr_job doesn't seem to be set.

Also, is there a way to "pretty print" all variables in the job_desc object? 

Mvh.

Torkil

Comment 37 Torkil Svensgaard 2020-12-07 12:51:42 MST

(In reply to Torkil Svensgaard from comment #35)

> > munge -n | ssh <hostname of controller> unmunge
> > 
> > Let me know what that shows.

torkil@bigger9:~/slurm$ munge -n | ssh slurm unmunge
Enter passphrase for key '/mrhome/torkil/.ssh/id_ed25519': 
STATUS:          Success (0)
ENCODE_HOST:     ??? (172.21.15.102)
ENCODE_TIME:     2020-12-07 20:50:24 +0100 (1607370624)
DECODE_TIME:     2020-12-07 20:50:29 +0100 (1607370629)
TTL:             300
CIPHER:          aes128 (4)
MAC:             sha256 (5)
ZIP:             none (0)
UID:             torkil (1018)
GID:             torkil (1018)
LENGTH:          0

Comment 38 Ben Roberts 2020-12-07 15:28:57 MST

I'm afraid there isn't a way to have the submit filter print all the contents of a job_desc, but you can see the possible attributes of the job in _get_job_req_field or _set_job_req_field in src/plugins/job_submit/lua/job_submit_lua.c.  

I looked at the way these job attributes are set and I think I see the reason for the disconnect.  It has to do with where attributes are set with different ways of submitting with a GPU/gres.  If you submit with --gres=gpu:1 it sets tres_per_node.  If you submit with --gpus=1 it will set tres_per_job.  I realize this can be confusing and will probably require some extra logic to see how users submit (unless you can be sure they all submit the same way).  Let me know if making this change in your test does address the lua errors you're seeing.

For the munge error, the fact that you see "ENCODE_HOST:     ??? (172.21.15.102)" does confirm that there is an issue with the host lookup.  It should show the hostname rather than the series of question marks.  You mentioned earlier that you hadn't configured reverse DNS for 172.21.15.0/24.  If you get this configured and the munge command you ran shows the hostname of the machine I would expect these errors in Slurm to go away.  

Thanks,
Ben

Comment 39 Torkil Svensgaard 2020-12-07 23:25:51 MST

(In reply to Ben Roberts from comment #38)
 
> I looked at the way these job attributes are set and I think I see the
> reason for the disconnect.  It has to do with where attributes are set with
> different ways of submitting with a GPU/gres.  If you submit with
> --gres=gpu:1 it sets tres_per_node.  If you submit with --gpus=1 it will set
> tres_per_job.  I realize this can be confusing and will probably require
> some extra logic to see how users submit (unless you can be sure they all
> submit the same way).  Let me know if making this change in your test does
> address the lua errors you're seeing.

They're users, they'll do weird thing =) You were right, jobs_desc.tres_pr_node is populated. We can work with that.
 
> For the munge error, the fact that you see "ENCODE_HOST:     ???
> (172.21.15.102)" does confirm that there is an issue with the host lookup. 
> It should show the hostname rather than the series of question marks.  You
> mentioned earlier that you hadn't configured reverse DNS for 172.21.15.0/24.
> If you get this configured and the munge command you ran shows the hostname
> of the machine I would expect these errors in Slurm to go away.  

Indeed it did.

Also, "error: _core_bitmap2str: bad core offset (47 >= 23)" went away after older jobs were stopped, so all good.

Comment 40 Torkil Svensgaard 2020-12-07 23:30:26 MST

(In reply to Torkil Svensgaard from comment #31)
 
> Ah, of course. I changed it to CR_CPU and cut down the node definition to
> just "NodeName=bigger9 CPUs=48 RealMemory=257552 Gres=gpu:1" and now I get
> 47. 
> 
> Which configuration would be the better one though? It would depend on the
> workloads of course but perhaps you have some insights?

Any comments on the above?

Also, regarding state, this looks odd to me when it's only 1 core out of 48 that's reserved:

"
# sinfo
PARTITION   AVAIL  TIMELIMIT  NODES  STATE NODELIST
application    up   infinite      1    mix bigger10
HPC*           up   infinite      1   resv bigger9
HPC*           up   infinite      1   idle gojira
"

Not an error, just a little odd. "idle/resv" would be more accurate or even "mix/resv".

Mvh.

Torkil

Comment 41 Torkil Svensgaard 2020-12-08 07:15:27 MST

Another thing, the way we do is it users log in through a thin client and start terminals like this by clicking a terminal shortcut:

"
srun --partition=application --cpus-per-task=1 --ntasks=1 --export=ALL --x11 "xfce4-terminal"
"

With the reservation/GPU configuration I thought we could do another srun from that terminal to get another terminal with gres=gpu:1 for interactive sessions but that doesn't work as I can't seem escape from the initial srun allocation. It would also waste a core as the first srun terminal wouldn't be used for much.

Can you suggest a good solution? We could of course have the users decide on the need for a GPU before they start the first terminal but users get easily confused and might hog GPU's by accident if they get multiple terminals to choose from. We would much prefer them having to type a commmand to get a GPU, akin to having to explicitly type it in a sbatch header.

Mvh.

Torkil

Comment 42 Ben Roberts 2020-12-08 11:27:43 MST

> Which configuration would be the better one though? It would depend on the
> workloads of course but perhaps you have some insights?

Sorry I glossed over this question the first time.  You're right that it would depend on the types of jobs you typically have on your cluster.  If you have a lot of single processor jobs and you want to make sure the nodes run as many of these jobs as possible then it does make sense to use CR_CPU.  If you have larger jobs and have requirements that the core can't be shared then CR_Core would be the way to go.  I can't say that one is better than the other, they just have different use cases.  I will point out that you can set SelectTypeParameters on partitions as well, so if you want the cluster generally to work one way, but have just a specific set of nodes work another you can put those nodes in a separate partition and define a different parameter for them.

There isn't currently logic to have reserved nodes show multiple states if the reservation is only for a part of the node.  I'm not sure right now how involved it would be to add something like that, but it would most likely require a sponsor for the development work.  If you're interested in sponsoring something like that let me know and we can look into it further.

It is possible to make the workflow you're describing work, but it would leave a core idle in the terminal they're no longer using, as you pointed out.  To be able to run a new GPU from within an existing 'non-GPU' job you would remove the references to the current job id so it treats it as an unrelated allocation.  Here's an example of how to do this:

$ srun -n1 --pty /bin/bash

$ squeue
             JOBID PARTITION     NAME     USER ST       TIME  NODES NODELIST(REASON)
              1538     debug     bash      ben  R       0:02      1 node01

$ srun -n1 --gpus=1 --pty /bin/bash
srun: error: Unable to create step for job 1538: Invalid generic resource (gres) specification

$ unset SLURM_JOB_ID

$ unset SLURM_JOBID

$ srun -n1 --gpus=1 --pty /bin/bash

$ squeue
             JOBID PARTITION     NAME     USER ST       TIME  NODES NODELIST(REASON)
              1539     debug     bash      ben  R       0:02      1 node01
              1538     debug     bash      ben  R       0:30      1 node01

$ scontrol show job 1539 | grep TRES=
   TRES=cpu=1,node=1,billing=2,gres/gpu=1


This does mean the user needs to remember to exit out of two jobs as well or it could leave CPUs tied up for even longer.  So, it's technically possible, but whether that's something you want to implement in your environment is another question.  The alternative I see is that you make users exit the first job allocation they are in and re-submit a request with a GPU.  You could create a wrapper script that submits the srun command for them and write it to accept an argument like 'gpu' so they explicitly have to request it.

Hopefully this helps.

Thanks,
Ben

Comment 43 Torkil Svensgaard 2020-12-10 01:53:15 MST

(In reply to Ben Roberts from comment #42)
> > Which configuration would be the better one though? It would depend on the
> > workloads of course but perhaps you have some insights?
> 
> Sorry I glossed over this question the first time.  You're right that it
> would depend on the types of jobs you typically have on your cluster.  If
> you have a lot of single processor jobs and you want to make sure the nodes
> run as many of these jobs as possible then it does make sense to use CR_CPU.
> If you have larger jobs and have requirements that the core can't be shared
> then CR_Core would be the way to go.  I can't say that one is better than
> the other, they just have different use cases.  I will point out that you
> can set SelectTypeParameters on partitions as well, so if you want the
> cluster generally to work one way, but have just a specific set of nodes
> work another you can put those nodes in a separate partition and define a
> different parameter for them.

Looking at this and cgroups now. It looks like ConstrainCores works exactly as advertised and there is no ConstrainCPU? So with CR_CPU we will get some overcommit?

Mvh.

Torkil

Comment 44 Ben Roberts 2020-12-10 13:00:27 MST

That's correct, ConstrainCores limits the job to the Core it has been allocated, but there isn't an option to constrain a job to a CPU rather than a Core.  With ConstrainCores it's possible for jobs to spill over onto the other CPU on their allocated Core.  

For your reference, you can see the CPUs available to a job from within that job by looking at the CgroupMountpoint (which defaults to /sys/fs/cgroup) and then navigating to cpuset/slurm/uid_<Uid Of User>/job_<JobId>/cpuset.cpus.  An example would look like this:
/sys/fs/cgroup/cpuset/slurm_node01/uid_1000/job_10545/cpuset.cpus

Thanks,
Ben

Comment 45 Torkil Svensgaard 2020-12-13 23:28:14 MST

Hi Ben

Thanks for the explanation.

Still not quite done with the lookup thing it seems, this is from running scontrol reconfigure just now:

"
[2020-12-14T07:20:57.110] error: slurm_auth_get_host: Lookup failed for 0.0.0.0
[2020-12-14T07:20:57.163] sched: _slurm_rpc_allocate_resources JobId=1312 NodeList=bigger10 usec=53468
[2020-12-14T07:20:58.883] error: slurm_auth_get_host: Lookup failed for 0.0.0.0
[2020-12-14T07:20:58.883] sched: _slurm_rpc_allocate_resources JobId=1313 NodeList=bigger10 usec=517
[2020-12-14T07:22:17.091] Processing Reconfiguration Request
[2020-12-14T07:22:17.092] No memory enforcing mechanism configured.
[2020-12-14T07:22:17.097] restoring original state of nodes
[2020-12-14T07:22:17.097] select/cons_tres: part_data_create_array: select/cons_tres: preparing for 2 partitions
[2020-12-14T07:22:17.098] read_slurm_conf: backup_controller not specified
[2020-12-14T07:22:17.098] select/cons_tres: select_p_reconfigure: select/cons_tres: reconfigure
[2020-12-14T07:22:17.098] select/cons_tres: part_data_create_array: select/cons_tres: preparing for 2 partitions
[2020-12-14T07:22:17.098] No parameter for mcs plugin, default values set
[2020-12-14T07:22:17.098] mcs: MCSParameters = (null). ondemand set.
[2020-12-14T07:22:17.098] _slurm_rpc_reconfigure_controller: completed usec=7209
[2020-12-14T07:22:17.768] SchedulerParameters=default_queue_depth=100,max_rpc_cnt=0,max_sched_time=2,partition_job_depth=0,sched_max_job_start=0,sched_min_interval=2
[2020-12-14T07:23:42.361] error: slurm_auth_get_host: Lookup failed for 0.0.0.0
[2020-12-14T07:23:42.362] sched: _slurm_rpc_allocate_resources JobId=1314 NodeList=bigger10 usec=650
[2020-12-14T07:23:59.352] error: get_name_info: getnameinfo() failed: Name or service not known
[2020-12-14T07:23:59.352] error: slurm_auth_get_host: Lookup failed for 127.0.0.2
[2020-12-14T07:23:59.352] _slurm_rpc_submit_batch_job: JobId=1315 InitPrio=4294901586 usec=434
[2020-12-14T07:23:59.997] sched: Allocate JobId=1315 NodeList=bigger9 #CPUs=1 Partition=HPC
[2020-12-14T07:24:00.711] _job_complete: JobId=1315 WEXITSTATUS 0
[2020-12-14T07:24:00.711] _job_complete: JobId=1315 done
"

Fails for 0.0.0.0 and 127.0.0.2 which isn't in our DNS.

Mvh.

Torkil

Comment 46 Ben Roberts 2020-12-14 09:28:46 MST

Hi Torkil,

It looks like there's something going on that's causing some of your nodes to encode messages with Munge with an address other than the primary IP of the node.  Do you have any idea which nodes might be associated with these errors?  At this point it seems like it's going to take a bit more digging and would probably be better in a separate ticket.  Would you mind opening a new ticket with these details?  

Thanks,
Ben

Comment 47 Torkil Svensgaard 2020-12-17 04:27:43 MST

(In reply to Ben Roberts from comment #46)
> Hi Torkil,

Hi Ben

> It looks like there's something going on that's causing some of your nodes
> to encode messages with Munge with an address other than the primary IP of
> the node.  Do you have any idea which nodes might be associated with these
> errors?  At this point it seems like it's going to take a bit more digging
> and would probably be better in a separate ticket.  Would you mind opening a
> new ticket with these details?  

Ok, I created a new ticket for that.

Back to this one, the mechanism with GPU as gres and no GPU queue has some problems for us. I don't know if you are familiar with FSL but it comes with a  wrapper script (fsl_sub) which we've hacked to support SLURM. Using that fails miserably with the current setup as the gres GPU doesn't seem to be inherited by the child SLURM jobs spawned by fsl_sub. 

So we have:

Terminal running as an srun task, no GPU
Submit sbatch with gres=gpu
Sbatch job calls fsl_sub which in turn submit new sbatch 

Is there a way to have this/these last sbatch job(s) use the existing reservation made by the initial sbatch, or something like that?

Hope that makes sense :S

Mvh.

Torkil

Comment 48 Ben Roberts 2020-12-17 10:51:28 MST

Hi Torkil,

Thanks for moving the other issue to its own ticket.  For your workflow question, if the second and third steps you describe need a GPU then there isn't a way to have them run in the resources allocated to srun (in the first step you describe) since it wasn't allocated a GPU and it can't be added to a running job.  However, it sounds like changing fsl_sub to doing an srun instead of an sbatch should allow the third job to use the resources allocated to the sbatch job (with the GPU).  Using sbatch will always get you a unique job id and new allocation of resources, but if you're already in a job and you use srun then it will (by default) create a job step within the existing allocation.

Here's an example of how a job that submits another job will create a unique job id when using sbatch for the sub-job.
$ cat 9957.sh 
#!/bin/bash

#SBATCH -N1
#SBATCH --exclusive
#SBATCH -p debug

date
sbatch /home/ben/slurm/test.job
sleep 30
date


$ sbatch 9957.sh 
Submitted batch job 1584

$ squeue -s
         STEPID     NAME PARTITION     USER      TIME NODELIST
     1584.batch    batch     debug      ben      0:03 node01
    1584.extern   extern     debug      ben      0:03 node01
     1585.batch    batch       gpu      ben      0:02 node02
    1585.extern   extern       gpu      ben      0:02 node02



You can see that my submission of the 9957.sh job script creates job id 1584.  That job then calls sbatch again and creates job 1585.

If I change the submission to use srun instead of sbatch then I get the other script to run in the already allocated resources.
$ cat 9957.sh 
#!/bin/bash

#SBATCH -N1
#SBATCH --exclusive
#SBATCH -p debug

date
srun /home/ben/slurm/test.job
sleep 30
date


$ sbatch 9957.sh 
Submitted batch job 1586

$ squeue -s
         STEPID     NAME PARTITION     USER      TIME NODELIST
         1586.0 test.job     debug      ben      0:02 node01
     1586.batch    batch     debug      ben      0:02 node01
    1586.extern   extern     debug      ben      0:02 node01



It sounds like you've already modified fsl_sub some.  Can you modify it further to use srun for cases like this?  I know this doesn't have it use the resources from the first job, but does this get closer to what you want to accomplish?  

Thanks,
Ben

Comment 49 Torkil Svensgaard 2020-12-17 13:28:04 MST

(In reply to Ben Roberts from comment #48)
> Hi Torkil,

Hi Ben

> Thanks for moving the other issue to its own ticket.  For your workflow
> question, if the second and third steps you describe need a GPU then there
> isn't a way to have them run in the resources allocated to srun (in the
> first step you describe) since it wasn't allocated a GPU and it can't be
> added to a running job.  However, it sounds like changing fsl_sub to doing
> an srun instead of an sbatch should allow the third job to use the resources
> allocated to the sbatch job (with the GPU).  Using sbatch will always get
> you a unique job id and new allocation of resources, but if you're already
> in a job and you use srun then it will (by default) create a job step within
> the existing allocation.

Ah, thanks for the explanation. Inheriting the resources from the first sbatch is just what I need, I don't need to inherit from the first srun. However,

 > It sounds like you've already modified fsl_sub some.  Can you modify it
> further to use srun for cases like this? 

Fsl_sub creates array jobs for some tasks, and those have to be sbatch? Modifying it to use slurm instead of sge wasn't so hard, since all the logic was pretty much the same, but going sbatch->srun for array jobs would probably be a different beast.

Mvh.

Torkil

Comment 50 Torkil Svensgaard 2020-12-17 13:44:13 MST

(In reply to Torkil Svensgaard from comment #49)

> Modifying it to use slurm instead of sge wasn't so hard, since all the logic
> was pretty much the same, but going sbatch->srun for array jobs would
> probably be a different beast.

The array jobs are done by outputting a bunch of commands to a text file and then creating an array job via a for loop with i = number of commands in the text file. This might not be hard to change to sruns instead.

It looks like srun also has the notion of dependencies so perhaps not that difficult afterall. I'll look at it tomorrow. Thanks!

Comment 51 Ben Roberts 2020-12-17 14:01:34 MST

You're right, neither srun nor salloc have the option to submit a job array like sbatch does.  I think there are a couple of approaches you could take for job arrays.  You could have it still use sbatch in this case, or you could have it submit a series of job steps with srun in the for loop you mention.  One thing to keep in mind is that if the existing job allocation doesn't have enough resources for all these steps to run at once, the job will take longer to run than if you had submitted a job array with sbatch.

Thanks,
Ben

Comment 52 Torkil Svensgaard 2020-12-22 02:32:44 MST

Hi Ben 

I just upgraded the following packages on the nodes:

"
(1/4): slurm-perlapi-20.11.2-1.el8.x86_64.rpm                                   
(2/4): slurm-slurmd-20.11.2-1.el8.x86_64.rpm                                    (3/4): slurm-20.11.2-1.el8.x86_64.rpm                                           
(4/4): microcode_ctl-20200609-2.20201027.1.el8_3.x86_64.rpm                    
"

After reboot some nodes refuse to start/resume due to:

"
[2020-12-22T10:26:18.530] error: _slurm_rpc_node_registration node=smaug: Invalid argument
[2020-12-22T10:26:19.532] error: _slurm_rpc_node_registration node=smaug: Invalid argument
[2020-12-22T10:26:20.536] error: _slurm_rpc_node_registration node=smaug: Invalid argument
[2020-12-22T10:26:24.737] update_node: node smaug state set to IDLE
[2020-12-22T10:26:26.561] error: Setting node smaug state to DRAIN with reason:Low RealMemory
"

Smaug is configured like so, with the RealMemory taken by running "slurmd -C" on the host. 

"
NodeName=smaug CPUs=128 RealMemory=515572 MemSpecLimit=1024
"

After the upgrade "slurmd -C" on smaug reports this:

"
NodeName=smaug CPUs=128 Boards=1 SocketsPerBoard=2 CoresPerSocket=32 ThreadsPerCore=2 RealMemory=515561
UpTime=0-00:06:11
"

Is there a way to avoid this? 

Mvh.

Torkil

Comment 53 Torkil Svensgaard 2020-12-22 02:36:05 MST

Also, slurmd fails to start after reboot. I can't really see any hints in the logs.

"
# journalctl -xef -uslurmd
-- Logs begin at Tue 2020-12-22 10:23:49 CET. --
Dec 22 10:23:53 bigger11.drcmr systemd[1]: Started Slurm node daemon.
-- Subject: Unit slurmd.service has finished start-up
-- Defined-By: systemd
-- Support: https://access.redhat.com/support
-- 
-- Unit slurmd.service has finished starting up.
-- 
-- The start-up result is done.
Dec 22 10:23:53 bigger11.drcmr systemd[1]: slurmd.service: Main process exited, code=exited, status=1/FAILURE
Dec 22 10:23:53 bigger11.drcmr systemd[1]: slurmd.service: Failed with result 'exit-code'.
-- Subject: Unit failed
-- Defined-By: systemd
-- Support: https://access.redhat.com/support
-- 
-- The unit slurmd.service has entered the 'failed' state with result 'exit-code'.
^C
"

Suggestions on how to debug that?

Mvh.

Torkil

Comment 54 Felip Moll 2020-12-22 05:12:11 MST

(In reply to Torkil Svensgaard from comment #53)
> Also, slurmd fails to start after reboot. I can't really see any hints in
> the logs.
> 
> "
> # journalctl -xef -uslurmd
> -- Logs begin at Tue 2020-12-22 10:23:49 CET. --
> Dec 22 10:23:53 bigger11.drcmr systemd[1]: Started Slurm node daemon.
> -- Subject: Unit slurmd.service has finished start-up
> -- Defined-By: systemd
> -- Support: https://access.redhat.com/support
> -- 
> -- Unit slurmd.service has finished starting up.
> -- 
> -- The start-up result is done.
> Dec 22 10:23:53 bigger11.drcmr systemd[1]: slurmd.service: Main process
> exited, code=exited, status=1/FAILURE
> Dec 22 10:23:53 bigger11.drcmr systemd[1]: slurmd.service: Failed with
> result 'exit-code'.
> -- Subject: Unit failed
> -- Defined-By: systemd
> -- Support: https://access.redhat.com/support
> -- 
> -- The unit slurmd.service has entered the 'failed' state with result
> 'exit-code'.
> ^C
> "
> 
> Suggestions on how to debug that?
> 
> Mvh.
> 
> Torkil

Hi, sorry to jump in.

I've just realized that bug 10455 comes from here and I am interested in seeing this information too.

Torkil, as root run 'slurmd -Dvvv' and we'll see why it fails. Paste the entire output here please.

--------

As for your RealMemory question, the real memory on a Linux can be slightly different any time you boot or upgrade kernel. Check it with 'free -m' and you will see how it doesn't exactly correspond to your real physical memory.

RealMemory=515572
RealMemory=515561

These are 4 MiB difference, but enough to cause the error you see. The kernel must have reserved some more ram for its own purposes.
I recommend you to not set the RealMemory to the exact memory the node shows at a certain point, but just round numbers, eg. if your node has 515.572MB = 503GiB, just set 500GiB => RealMemory=512000.

Comment 55 Torkil Svensgaard 2020-12-22 05:20:37 MST

(In reply to Felip Moll from comment #54)
 
> Hi, sorry to jump in.
> 
> I've just realized that bug 10455 comes from here and I am interested in
> seeing this information too.

By all means. Should/could I have tagged in 10455 in some way to make it clear there was some history?
 
> Torkil, as root run 'slurmd -Dvvv' and we'll see why it fails. Paste the
> entire output here please.

"
Last login: Tue Dec 22 13:14:04 2020 from 172.21.140.12
# gojira/root ~ 
# systemctl status slurmd
● slurmd.service - Slurm node daemon
   Loaded: loaded (/usr/lib/systemd/system/slurmd.service; enabled; vendor preset: disabled)
   Active: failed (Result: exit-code) since Tue 2020-12-22 13:16:18 CET; 41s ago
  Process: 2333 ExecStart=/usr/sbin/slurmd -D $SLURMD_OPTIONS (code=exited, status=1/FAILURE)
 Main PID: 2333 (code=exited, status=1/FAILURE)

Dec 22 13:16:18 gojira.drcmr systemd[1]: Started Slurm node daemon.
Dec 22 13:16:18 gojira.drcmr systemd[1]: slurmd.service: Main process exited, code=exited, status=1/FAILURE
Dec 22 13:16:18 gojira.drcmr systemd[1]: slurmd.service: Failed with result 'exit-code'.
# gojira/root ~ 
# slurmd -Dvvv
slurmd: debug:  Log file re-opened
slurmd: debug2: hwloc_topology_init
slurmd: debug2: hwloc_topology_load
slurmd: debug2: hwloc_topology_export_xml
slurmd: debug:  CPUs:96 Boards:1 Sockets:2 CoresPerSocket:24 ThreadsPerCore:2
slurmd: error: Node configuration differs from hardware: CPUs=96:96(hw) Boards=1:1(hw) SocketsPerBoard=96:2(hw) CoresPerSocket=1:24(hw) ThreadsPerCore=1:2(hw)
slurmd: debug:  Reading cgroup.conf file /var/spool/slurm/d/conf-cache/cgroup.conf
slurmd: debug2: hwloc_topology_init
slurmd: debug2: xcpuinfo_hwloc_topo_load: xml file (/var/spool/slurm/d/hwloc_topo_whole.xml) found
slurmd: debug:  CPUs:96 Boards:1 Sockets:2 CoresPerSocket:24 ThreadsPerCore:2
slurmd: debug:  skipping GRES for NodeName=bigger9  AutoDetect=nvml

slurmd: debug:  gres/gpu: init: loaded
slurmd: debug:  gpu/generic: init: init: GPU Generic plugin loaded
slurmd: topology/none: init: topology NONE plugin loaded
slurmd: route/default: init: route default plugin loaded
slurmd: debug2: Gathering cpu frequency information for 96 cpus
slurmd: debug:  Resource spec: No specialized cores configured by default on this node
slurmd: debug:  Reading cgroup.conf file /var/spool/slurm/d/conf-cache/cgroup.conf
slurmd: debug:  system cgroup: memory: total:515581M allowed:100%(enforced), swap:0%(permissive), max:100%(515581M) max+swap:100%(1031162M) min:30M kmem:100%(515581M permissive) min:30M
slurmd: debug:  system cgroup: system memory cgroup initialized
slurmd: Resource spec: system cgroup memory limit set to 1024 MB
slurmd: debug:  task/cgroup: init: core enforcement enabled
slurmd: debug:  task/cgroup: task_cgroup_memory_init: task/cgroup/memory: total:515581M allowed:100%(enforced), swap:0%(permissive), max:100%(515581M) max+swap:100%(1031162M) min:30M kmem:100%(515581M permissive) min:30M swappiness:0(unset)
slurmd: debug:  task/cgroup: init: memory enforcement enabled
slurmd: debug:  task/cgroup: task_cgroup_devices_init: unable to open /var/spool/slurm/d/conf-cache/cgroup_allowed_devices_file.conf: No such file or directory
slurmd: debug:  task/cgroup: init: device enforcement enabled
slurmd: debug:  task/cgroup: init: task/cgroup: loaded
slurmd: debug:  auth/munge: init: Munge authentication plugin loaded
slurmd: debug:  spank: opening plugin stack /var/spool/slurm/d/conf-cache/plugstack.conf
slurmd: cred/munge: init: Munge credential signature plugin loaded
slurmd: slurmd version 20.11.2 started
slurmd: debug:  jobacct_gather/linux: init: Job accounting gather LINUX plugin loaded
slurmd: debug:  job_container/none: init: job_container none plugin loaded
slurmd: debug:  switch/none: init: switch NONE plugin loaded
slurmd: slurmd started on Tue, 22 Dec 2020 13:17:12 +0100
slurmd: CPUs=96 Boards=1 Sockets=96 Cores=1 Threads=1 Memory=515581 TmpDisk=226773 Uptime=62 CPUSpecList=(null) FeaturesAvail=(null) FeaturesActive=(null)
slurmd: debug:  acct_gather_energy/none: init: AcctGatherEnergy NONE plugin loaded
slurmd: debug:  acct_gather_Profile/none: init: AcctGatherProfile NONE plugin loaded
slurmd: debug:  acct_gather_interconnect/none: init: AcctGatherInterconnect NONE plugin loaded
slurmd: debug:  acct_gather_filesystem/none: init: AcctGatherFilesystem NONE plugin loaded
slurmd: debug2: No acct_gather.conf file (/var/spool/slurm/d/conf-cache/acct_gather.conf)
slurmd: debug:  _handle_node_reg_resp: slurmctld sent back 8 TRES.
"
 
> As for your RealMemory question, the real memory on a Linux can be slightly
> different any time you boot or upgrade kernel. Check it with 'free -m' and
> you will see how it doesn't exactly correspond to your real physical memory.
> 
> RealMemory=515572
> RealMemory=515561
> 
> These are 4 MiB difference, but enough to cause the error you see. The
> kernel must have reserved some more ram for its own purposes.
> I recommend you to not set the RealMemory to the exact memory the node shows
> at a certain point, but just round numbers, eg. if your node has 515.572MB =
> 503GiB, just set 500GiB => RealMemory=512000.

Thanks, just what I was looking for.

Mvh.

Torkil

Comment 56 Felip Moll 2020-12-22 05:43:17 MST

(In reply to Torkil Svensgaard from comment #55)
> (In reply to Felip Moll from comment #54)
>  
> > Hi, sorry to jump in.
> > 
> > I've just realized that bug 10455 comes from here and I am interested in
> > seeing this information too.
> 
> By all means. Should/could I have tagged in 10455 in some way to make it
> clear there was some history?

When writing the description of bug 10455, just saying "coming from bug 10455" is enough for us.
There's also the field See Also you can use if you want.

I could've realized that before but we receive a bunch of bugs and comments daily and I missed this one. Sorry for that.

> ...
> slurmd: debug:  _handle_node_reg_resp: slurmctld sent back 8 TRES.

So slurmd works and is started just well from command line.
The issue must be in how systemd starts it. What does the slurmd log show if you set slurm.conf 'SlurmdDebug=debug2' after starting it with systemd? What does 'journalctl -xn 300' show immediately after starting with 'systemctl start slurmd'?

Can we also see 'systemctl cat slurmd' ?

--

The only error I see is:

slurmd: error: Node configuration differs from hardware: CPUs=96:96(hw) Boards=1:1(hw) SocketsPerBoard=96:2(hw) CoresPerSocket=1:24(hw) ThreadsPerCore=1:2(hw)

This error won't impede slurm to run, but affinity may not be ideal. Is it possible you specify the architecture in slurm.conf to avoid this error?

NodeName=gojira CPUs=96 Sockets=2 CoresPerSocket=24 ThreadsPerCore=2 RealMemory=512000 MemSpecLimit=1024

Comment 57 Torkil Svensgaard 2020-12-22 23:47:35 MST

(In reply to Felip Moll from comment #56)

> So slurmd works and is started just well from command line.
> The issue must be in how systemd starts it. What does the slurmd log show if
> you set slurm.conf 'SlurmdDebug=debug2' after starting it with systemd? What
> does 'journalctl -xn 300' show immediately after starting with 'systemctl
> start slurmd'?

I think it's more a problem of when systemd starts it, as it starts just fine from systemd manually. This is from journald right after rebooting:

"
# journalctl | grep slurm
Dec 23 07:37:39 gojira.drcmr slurmd[2345]: error: resolve_ctls_from_dns_srv: res_nsearch error: Host name lookup failure
Dec 23 07:37:39 gojira.drcmr slurmd[2345]: error: fetch_config: DNS SRV lookup failed
Dec 23 07:37:39 gojira.drcmr systemd[1]: slurmd.service: Main process exited, code=exited, status=1/FAILURE
Dec 23 07:37:39 gojira.drcmr slurmd[2345]: error: _establish_configuration: failed to load configs
Dec 23 07:37:39 gojira.drcmr systemd[1]: slurmd.service: Failed with result 'exit-code'.
Dec 23 07:37:39 gojira.drcmr slurmd[2345]: error: slurmd initialization failed
"
 
> Can we also see 'systemctl cat slurmd' ?

"
# systemctl cat slurmd
# /usr/lib/systemd/system/slurmd.service
[Unit]
Description=Slurm node daemon
After=munge.service network.target remote-fs.target
#ConditionPathExists=/etc/slurm/slurm.conf

[Service]
Type=simple
EnvironmentFile=-/etc/sysconfig/slurmd
ExecStart=/usr/sbin/slurmd -D $SLURMD_OPTIONS
ExecReload=/bin/kill -HUP $MAINPID
KillMode=process
LimitNOFILE=131072
LimitMEMLOCK=infinity
LimitSTACK=infinity
Delegate=yes


[Install]
WantedBy=multi-user.target
"

It should wait for network so unsure why it fails the DNS lookup. Retrying might also fix it.

> --
> 
> The only error I see is:
> 
> slurmd: error: Node configuration differs from hardware: CPUs=96:96(hw)
> Boards=1:1(hw) SocketsPerBoard=96:2(hw) CoresPerSocket=1:24(hw)
> ThreadsPerCore=1:2(hw)
> 
> This error won't impede slurm to run, but affinity may not be ideal. Is it
> possible you specify the architecture in slurm.conf to avoid this error?
> 
> NodeName=gojira CPUs=96 Sockets=2 CoresPerSocket=24 ThreadsPerCore=2
> RealMemory=512000 MemSpecLimit=1024

Of course, my bad. I thought I had to either do CPUs OR Boards*Sockets*CoresPerSocket*ThreadsPerCore to get my total to be what I wanted.

Thanks,

Torkil

Comment 58 Ben Roberts 2020-12-28 11:03:21 MST

Hi Torkil,

I was out for the holidays last week, so my apologies that I wasn't responsive.  I'm glad that Felip was able to jump in and help.  It looks like you were able to get to the bottom of the host lookup issue by changing the start order with systemd in bug 10455.  It sounds like this issue was resolved as well when you updated the node definition to match what is reported by 'slurmd -C', is that right?  Let me know if you still need help with this issue.

Thanks,
Ben

Comment 59 Torkil Svensgaard 2020-12-28 14:57:23 MST

(In reply to Ben Roberts from comment #58)
> Hi Torkil,

Hi Ben

> I was out for the holidays last week, so my apologies that I wasn't
> responsive.  I'm glad that Felip was able to jump in and help.  It looks
> like you were able to get to the bottom of the host lookup issue by changing
> the start order with systemd in bug 10455.  It sounds like this issue was
> resolved as well when you updated the node definition to match what is
> reported by 'slurmd -C', is that right?  Let me know if you still need help
> with this issue.

No problem, I hope you had a Merry Christmas =)

Yes, I'm good for now. Just ordered some more GPUs and I'll probably have some questions regarding their use but no point in moving forward until they are added to the nodes.

Happy New Near,

Torkil

Comment 60 Ben Roberts 2020-12-28 15:33:23 MST

I'm glad to hear things are looking good.  Since that's the case I'll go ahead and close this ticket.  If anything comes up with your new GPUs don't hesitate to open a new ticket and we'll be glad to look into it with you.  I hope you have a Happy New Year too.

Thanks,
Ben