Hi We just had a session with ben@schedmd.com where we discussed how to configure compute hosts with GPUs. The best solution seemd to be to use 1 core for each GPU and then we could use the rest of the cores for our HPC partition. It's not entirely clear to me how to do that though. How do we split the cores on a given host between two partitions, like 46 cores for HPC partition and 2 cores for GPU partition? Thanks, Torkil
Hi Torkil, The parameter you would want to use is MaxCPUsPerNode. You would place the node(s) in both partitions and then in the partition for the GPU work you would limit the number of CPUs to be equal to the number of GPUs. In the other partition you probably wouldn't define a similar limit because those wouldn't be the only nodes in the partition. Here is the documentation on this parameter: https://slurm.schedmd.com/slurm.conf.html#OPT_MaxCPUsPerNode One more thing that we discussed on the phone is the ability to tie certain cores to a GPU. You do that in the gres.conf file by specifying 'Cores' on the line with the GPU definition. You can read more about this option here: https://slurm.schedmd.com/gres.conf.html#OPT_Cores Please let me know if you have any problems with this or additional questions about it. Thanks, Ben
Hi Torkil, I wanted to follow up and see if the MaxCPUsPerNode parameter I recommended did address your need. Please let me know if you still need help with this or if this ticket is ok to close. Thanks, Ben
(In reply to Ben Roberts from comment #3) > Hi Torkil, Hi Ben > I wanted to follow up and see if the MaxCPUsPerNode parameter I recommended > did address your need. Please let me know if you still need help with this > or if this ticket is ok to close. Not quite at that point yet with the new install. I'm currently a bit stuck on something else, I'll just put it here since it's easier for me with just a single ticket and point of contact. Let me know if you need seperate tickets. Getting this on my submit host: " bash-4.4$ srun hostname srun: error: Unable to create step for job 30: Error generating job credential " Log on the controller for this: " [2020-11-02T10:53:48.953] error: slurm_auth_get_host: Lookup failed for 172.21.15.30: Unknown host [2020-11-02T10:53:48.954] sched: _slurm_rpc_allocate_resources JobId=30 NodeList=bigger9 usec=2055 [2020-11-02T10:53:48.954] prolog_running_decr: Configuration for JobId=30 is complete [2020-11-02T10:53:48.959] error: slurm_cred_create: getpwuid failed for uid=1018 [2020-11-02T10:53:48.959] error: slurm_cred_create error [2020-11-02T10:53:48.960] _job_complete: JobId=30 WTERMSIG 1 [2020-11-02T10:53:48.961] _job_complete: JobId=30 done " Slurmuser = slurm has UID 20000 on all hosts Mungeuser = munge has UID 20001 on all hosts 172.21.15.30 is my submit host, which isn't part of any partition. Suggeestions?
Sorted that one. Please leave the ticket open if that's ok, while we do the initial implementation. If you need a fresh ticket for each issue feel free to close it. Mvh. Torkil
I'm glad you were able to get to the bottom of your last issue. I'm ok to leave this open for a while longer for some quick questions. If something that comes up turns out to be a bigger issue than originally thought I might have you split it into a new ticket, but I'm happy to leave this open for now. Thanks, Ben
Hi Ben Cool =) Currently a bit stuck on x11. I can do "srun --x11 someprogram" and it works fine, but the same someprogram doesn't seem to work when --x11 is put in a sbatch header. Suggestions? Mvh. Torkil
This is actually by design. Support for x11 forwarding with sbatch was removed in 19.05 due to changes in the underlying mechanism used to accomplish this. You would need to use either salloc or srun. You can find some additional details in bug 3647 or in this commit: https://github.com/SchedMD/slurm/commit/c97284691b6a0df57493a13132787a1a908a749f Let me know if you have additional questions about this. Thanks, Ben
(In reply to Ben Roberts from comment #8) > This is actually by design. Support for x11 forwarding with sbatch was > removed in 19.05 due to changes in the underlying mechanism used to > accomplish this. You would need to use either salloc or srun. You can find > some additional details in bug 3647 or in this commit: > https://github.com/SchedMD/slurm/commit/ > c97284691b6a0df57493a13132787a1a908a749f > > Let me know if you have additional questions about this. Ah, that explains it. That means users who run array jobs with X11 output will have to make some changes I guess. Thanks, Torkil
Hi Currently struggling a bit with getting srun to work from a .desktop file. If I open a terminal locally and run "/usr/bin/srun --x11 /mnt/depot64/fsl/fsl.6.0.1/bin/fsl" it works just fine. If I do it from a .desktop file with the following content it fails: " [Desktop Entry] Name=FSL Comment=Use the command line TryExec=srun Exec=/usr/bin/srun --x11 /mnt/depot64/fsl/fsl.6.0.1/bin/fsl Icon=/usr/local/share/desktop-icons/fsl.jpg Type=Application Terminal=true Categories=X-DRCMR DBusActivatable=true " Error on slurmctld: " X11 connection rejected because of wrong authentication. X11 connection rejected because of wrong authentication. Nov 03 10:54:54 joe.drcmr at-spi-bus-launcher[2098]: dbus-daemon[2103]: Activating service name='org.a11y.atspi.Registry' requested by ':1.118' (uid=1018 pid=8918 comm="exo-open --launch TerminalEmulator /usr/bin/srun -" label="unconfined_u:unconfined_r:unconfined_t:s0-s0:c0.c1023") Nov 03 10:54:55 joe.drcmr at-spi2-registr[8921]: Could not open X display Nov 03 10:54:55 joe.drcmr at-spi-bus-launcher[2098]: dbus-daemon[2103]: Successfully activated service 'org.a11y.atspi.Registry' Nov 03 10:54:55 joe.drcmr at-spi-bus-launcher[2098]: SpiRegistry daemon is running with well-known name - org.a11y.atspi.Registry Nov 03 10:54:55 joe.drcmr at-spi2-registr[8921]: AT-SPI: Cannot open default display X11 connection rejected because of wrong authentication. X11 connection rejected because of wrong authentication. Nov 03 10:54:57 joe.drcmr at-spi-bus-launcher[2098]: dbus-daemon[2103]: Activating service name='org.a11y.atspi.Registry' requested by ':1.61' (uid=1018 pid=6408 comm="xfce4-panel " label="unconfined_u:unconfined_r:unconfined_t:s0-s0:c0.c1023") Nov 03 10:54:58 joe.drcmr at-spi2-registr[8932]: Could not open X display Nov 03 10:54:58 joe.drcmr at-spi-bus-launcher[2098]: dbus-daemon[2103]: Successfully activated service 'org.a11y.atspi.Registry' Nov 03 10:54:58 joe.drcmr at-spi-bus-launcher[2098]: SpiRegistry daemon is running with well-known name - org.a11y.atspi.Registry Nov 03 10:54:58 joe.drcmr at-spi2-registr[8932]: AT-SPI: Cannot open default display " Any suggestions? Also this error keeps cropping up: " error: slurm_auth_get_host: Lookup failed for 0.0.0.0: Unknown host " I had it yesterday too with a different IP. " error: slurm_auth_get_host: Lookup failed for 172.21.15.30: Unknown host " I think it changed after I rebooted this morning. Mvh. Torkil
Hi Torkil, This looks to me like an issue with the XAUTHORITY environment variable not being available to the srun command when run from within the desktop file. You may be able to have it get access to the right variable(s) by having it exec 'sh' instead of 'srun' directly. Exec=sh -c "/usr/bin/srun --x11 /mnt/depot64/fsl/fsl.6.0.1/bin/fsl" If that doesn't help you can try adding '-vvv' to the srun command to have it log some more information about what is going on on the srun side of things. Thanks, Ben
Hi Ben I got that sorted and now apps launch. I'm still seeing this one though: " [2020-11-03T21:05:49.239] error: slurm_auth_get_host: Lookup failed for 0.0.0.0: Unknown host " How do I get rid of that, if it's only cosmetic? Is it some sort of reverse DNS lookup? Mvh. Torkil
I can't seem to get srun to allocate less than an entire node. In slurm.conf I have this: " SelectType=select/cons_tres SelectTypeParameters=CR_Core PartitionName=debug Nodes=ALL Default=YES MaxTime=INFINITE Shared=NO LLN=YES State=UP " Command line: " srun --cpus-per-task=1 --ntasks=1 --export=ALL --x11 fsl " What am I missing? Mvh. Torkil
Hmm, I thought number of cores etc was detected automagically if not specified but that evidently isn't so out of the box: " NODELIST CPUS MEMORY AVAIL_FEATURES GRES bigger9 1 1 (null) (null) " Is there a way to make it detect CPU and memory loadout? Mvh. Torkil
For your issue with slurm_auth_get_host, it looks like it's trying to get a hostname, but getting an ip address instead. I'm not sure why that would be. Is that only when you run it from a desktop file as well? This call is probably coming from the Munge code, here: https://github.com/SchedMD/slurm/blob/98e5e853a7ce52e284b3de14d5e98d70501a2d1f/src/plugins/auth/munge/auth_munge.c#L292-L336 Unfortunately there isn't a way to have Slurm populate the configuration of the nodes without specifying the information in the slurm.conf. If your nodes are homogeneous (or mostly homogeneous) you can use 'NodeName=DEFAULT' to specify the configuration once and have the majority of the nodes pick it up from there. If you have any that don't match the rest of the nodes you can define the correct settings for them on a separate line. Thanks, Ben
Ah, thanks. Not a big hazzle to make the configuration, it just seems convenient if it could be done automatically, and overridden manually if need be. The error is also showing with srun directly in a terminal: " [2020-11-03T22:51:32.323] error: slurm_auth_get_host: Lookup failed for 0.0.0.0: Unknown host [2020-11-03T22:51:32.323] sched: _slurm_rpc_allocate_resources JobId=243 NodeList=bigger9 usec=1228 [2020-11-03T22:51:32.323] prolog_running_decr: Configuration for JobId=243 is complete " Mvh. Torkil
I looked at where the last section of the code I sent previously came from and it looks like the decision was made to use the IP of the host in cases where the hostname lookup failed. That doesn't explain why the hostname lookup is failing in your case, but it does line up with the expected behavior as described in bug 7430. You can also see the relevant commit here: https://github.com/SchedMD/slurm/commit/a67548bc958f451c227437b1197f5c71f57ce771 Can you get the hostname manually? Thanks, Ben
(In reply to Ben Roberts from comment #17) > > Can you get the hostname manually? Get it how exactly? We haven't configured reverse DNS for 172.21.15.0/24 used by SLURM, that might be it? I tried disabling IPv6 as pr comments on the first link but that didn't make any difference. Mvh. Torkil
I hadn't paid close attention to the version listed on this ticket previously. It shows 21.08, but that isn't available yet. Since you're talking about disabling IPv6 I assume you're using 20.11, is that right? There were changes in 20.11, moving from using gethostbyname() to getaddrinfo() to allow for IPv6 support. It looks like there was an internal bug just opened where at some point a bug was introduced when there is only an IPv4 address. It is being worked on currently and should be addressed before the official release of 20.11. I'll let you know as there's progress. Thanks, Ben
Hi Ben Versions mismatch is probably my mistake. This is what we have, built with rpmbuild from source. " # slurm/root ~ # rpm -qa | grep slurm slurm-perlapi-20.02.5-1.el8.x86_64 slurm-slurmrestd-20.02.5-1.el8.x86_64 slurm-20.02.5-1.el8.x86_64 slurm-slurmdbd-20.02.5-1.el8.x86_64 slurm-contribs-20.02.5-1.el8.x86_64 slurm-example-configs-20.02.5-1.el8.x86_64 slurm-slurmctld-20.02.5-1.el8.x86_64 slurm-devel-20.02.5-1.el8.x86_64 " I changed the version in the ticket to 20.02.5. We are not using IPv6 but it's usually enabled on the interfaces on new CentOS installs. I just noticed it being mentioned in the link below and disabled it across the board on the new setup. https://github.com/SchedMD/slurm/blob/98e5e853a7ce52e284b3de14d5e98d70501a2d1f/src/plugins/auth/munge/auth_munge.c#L292-L336 Mvh. Torkil
(In reply to Ben Roberts from comment #2) Back to the original question, > The parameter you would want to use is MaxCPUsPerNode. You would place the > node(s) in both partitions and then in the partition for the GPU work you > would limit the number of CPUs to be equal to the number of GPUs. In the > other partition you probably wouldn't define a similar limit because those > wouldn't be the only nodes in the partition. Here is the documentation on > this parameter: > https://slurm.schedmd.com/slurm.conf.html#OPT_MaxCPUsPerNode I might be reading it wrong, or phrased my question wrong, but this seems fairly useless unless nodes have the exact same number of CPUs and GPUs so MaxCPUsPerNode can be set = (number of GPUs) for the GPU partition and = (number of cores - number of GPUs) for the HPC partition. The important thing here is not so much to limit the amount of cores used by GPU jobs (that's accomplished by GRES) but to make sure one CPU core is reserved pr GPU, and that can't be done? > One more thing that we discussed on the phone is the ability to tie certain > cores to a GPU. You do that in the gres.conf file by specifying 'Cores' on > the line with the GPU definition. You can read more about this option here: > https://slurm.schedmd.com/gres.conf.html#OPT_Cores I don't think I entirely understand this, but with this we could say core0 is the only core that can use GPU0, right? That will improve performance only in cases where it's important like which socket a thing is running on , but can't be used to ensure a core is only used with a GPU? Mvh. Torkil
Hi Torkil, Your understanding is correct regarding the MaxCPUsPerNode parameter. The idea for this parameter is that your GPU nodes be configured to be in two partitions. With the MaxCPUsPerNode parameter set for each of them you can make sure there are CPUs available for GPU jobs in your 'gpu' partition and that the rest of the CPUs are available for other work in other partitions. This is designed to limit the number of CPUs available to GPU jobs more than making sure that there is one core reserved for each GPU that is inaccessible to jobs that don't use the GPU. For billing purposes you can use TRESBillingWeights and the MAX_TRES PriorityFlag to have jobs that request configurations other than 1 Core to 1 GPU are billed for the maximum value for the resources they requested. For example, if the billing weights are defined such that CPUs and GPUs are equal, then jobs that request 2 GPUs and 2 CPUs, 1 GPU and 2 CPUs, or 2 GPUs and 1 CPU would all be billed for the same amount. This doesn't prevent users from requesting the CPUs in the GPU partition, but discourages them from doing it unless they are going to use the GPU too. One other possibility that might get closer to what you want would be to create a reservation for the number of CPUs you want and then have GPU jobs request that reservation (or you could configure a submit filter that adds the reservation to the request of GPU jobs). This would be a way to avoid having to create a separate partition, but still doesn't prevent GPU jobs from requesting more CPUs than the number of GPUs they requested in that reservation. I know these still aren't exactly what you're looking for, but hopefully there's something that will come close enough to work. Please let me know if you'd like me to clarify anything. Thanks, Ben
(In reply to Ben Roberts from comment #22) > Hi Torkil, Hi Ben > One other possibility that might get closer to what you want would be to > create a reservation for the number of CPUs you want and then have GPU jobs > request that reservation (or you could configure a submit filter that adds > the reservation to the request of GPU jobs). This would be a way to avoid > having to create a separate partition, but still doesn't prevent GPU jobs > from requesting more CPUs than the number of GPUs they requested in that > reservation. This sounds like it would do exactly what I want. Can that filter be used such that we only have 1 partition for batch jobs and if gres=gpu is requested the filter kicks in? Mvh. Torkil
Yes, you can do that. Here's an example script I have from other testing where I was looking for jobs that request --gpus=x and setting a different partition on them when that was matched: ------------------------------ function slurm_job_submit(job_desc, part_list, submit_uid) slurm.log_user("TRES per job: %s, per task: %s", job_desc.tres_per_job, job_desc.tres_per_job) tres_string = string.find(job_desc.tres_per_job, "gpu") slurm.log_user("tres_string is %s", tres_string) if (string.find(job_desc.tres_per_job, "gpu")) then job_desc.partition = 'gpu' slurm.log_user("Set partition to: %s", job_desc.partition) end return slurm.SUCCESS end function slurm_job_modify(job_desc, job_rec, part_list, modify_uid) slurm.log_user("In lua modify function") --slurm.log_user("partition available name:%s", part.name) return slurm.SUCCESS end return slurm.SUCCESS ------------------------------ Here's an example job that matches the filter and has the partition modified: $ sbatch -N1 -pdebug --gpus=1 --wrap='srun sleep 30' sbatch: TRES per job: gpu:1, per task: gpu:1 sbatch: tres_string is 1 sbatch: Set partition to: gpu Submitted batch job 1423 $ scontrol show job 1423 | grep Partition Partition=gpu AllocNode:Sid=kitt:7786 This example shows the modification of the partition, but you can set a reservation on the job as well. In the code you can see a list of all the attributes of a job you can modify. For a reservation the attribute is just called 'reservation': https://github.com/SchedMD/slurm/blob/a18562932cefbc668e5e2d7552251c5e60f314be/src/plugins/job_submit/lua/job_submit_lua.c#L711 Let me know if you have questions about this. Thanks, Ben
Hi Ben Looking at creating a reservation and it seems to require setting user and/or account so then looking at accounting, which we never really used before. The docs I found mention manually creating accounts and users, and we are not particularly interested in more work. I think I've managed to create a default account, how do I go about getting run jobs registered to that account and UNIX user names automatically created in the accounting database, if necessary? Mvh. Torkil
Hi again " # scontrol create reservationname=core_pr_gpu start=now duration=infinite partition=HPC Nodes=bigger9 CoreCnt=1 Account=drcmr Reservation created: core_pr_gpu # slurm/root run # scontrol show res ReservationName=core_pr_gpu StartTime=2020-12-03T12:19:24 EndTime=2021-12-03T12:19:24 Duration=365-00:00:00 Nodes=bigger9 NodeCnt=1 CoreCnt=1 Features=(null) PartitionName=HPC Flags=SPEC_NODES NodeName=bigger9 CoreIDs=0 TRES=cpu=2 Users=(null) Groups=(null) Accounts=drcmr Licenses=(null) State=ACTIVE BurstBuffer=(null) Watts=n/a MaxStartDelay=(null) " Why/what is TRES=cpu=2? Just curious since evething i set is 1. Mvh. Torkil
(In reply to Torkil Svensgaard from comment #26) > Hi again > > " > # scontrol create reservationname=core_pr_gpu start=now duration=infinite > partition=HPC Nodes=bigger9 CoreCnt=1 Account=drcmr > Reservation created: core_pr_gpu > # slurm/root run > # scontrol show res > ReservationName=core_pr_gpu StartTime=2020-12-03T12:19:24 > EndTime=2021-12-03T12:19:24 Duration=365-00:00:00 > Nodes=bigger9 NodeCnt=1 CoreCnt=1 Features=(null) PartitionName=HPC > Flags=SPEC_NODES > NodeName=bigger9 CoreIDs=0 > TRES=cpu=2 > Users=(null) Groups=(null) Accounts=drcmr Licenses=(null) State=ACTIVE > BurstBuffer=(null) Watts=n/a > MaxStartDelay=(null) > " > > Why/what is TRES=cpu=2? Just curious since evething i set is 1. I presume it has something to do with hyperthreading. This host with an "AMD EPYC 7402P 24-Core Processor" is configured like this in slurm.conf: " NodeName=bigger9 CPUs=48 Boards=1 SocketsPerBoard=1 CoresPerSocket=24 ThreadsPerCore=2 RealMemory=257552 Gres=gpu:1 " That configuration was generated by "slurmd -C". But when I try to start a large array job I'm only able to run 24 jobs at the same time, not 48. How come? Mvh. Torkil
(In reply to Torkil Svensgaard from comment #27) > But when I try to start a large array job I'm only able to run 24 jobs at > the same time, not 48. How come? Eh, 23, due to the reservation. Mvh. Torkil
(In reply to Ben Roberts from comment #24) > Yes, you can do that. Here's an example script I have from other testing > where I was looking for jobs that request --gpus=x and setting a different > partition on them when that was matched: I've built and installed slurmctld --with lua and I've put mit script next to slurm.conf in /etc/slurm but it doesn't seem to be picked up. Here's the script: " function slurm_job_submit(job_desc, part_list, submit_uid) slurm.log_user("TRES per job: %s, per task: %s", job_desc.tres_per_job, job_desc.tres_per_job) tres_string = string.find(job_desc.tres_per_job, "gpu") slurm.log_user("tres_string is %s", tres_string) if (string.find(job_desc.tres_per_job, "gpu")) then --job_desc.partition = 'application' job_desc.reservation = 'core_pr_cpu' --slurm.log_user("Set partition to: %s", job_desc.partition) slurm.log_user("Using reservation: %s", job_desc.reservation) end return slurm.SUCCESS end function slurm_job_modify(job_desc, job_rec, part_list, modify_uid) slurm.log_user("In lua modify function") --slurm.log_user("partition available name:%s", part.name) return slurm.SUCCESS end return slurm.SUCCESS " I've restarted and reconfigured, what did I miss? Mvh. Torkil
Hi Torkil, Unfortunately the user and account system that is used for Slurm does not have a method of directly importing the information from LDAP or the OS. Because we allow you to define an account hierarchy and allow users to be in multiple accounts a tool like this would get very complex and difficult to have it put users in the right place(s). There are some users who have created scripts to help manage the Slurm users and accounts, and there was a good presentation on one of these scripts from the Slurm Users Group a couple years ago. I'll link their slides in case it's helpful. https://slurm.schedmd.com/SLUG18/ewan_roche_slug18.pdf Regarding your reservation question, you're right that the reason it shows 'TRES=cpu=2' when you just requested one core is due to the fact that it's requesting a core, which is hyper-threaded. If you're only able to run 23 jobs on a 48 CPU machine then it sounds like you are using CR_CORE (or CR_CORE_MEMORY) for the SelectTypeParameters. From the documentation for CR_CORE it says: ----------------- On nodes with hyper-threads, each thread is counted as a CPU to satisfy a job's resource requirement, but multiple jobs are not allocated threads on the same core. The count of CPUs allocated to a job is rounded up to account for every CPU on an allocated core. ----------------- Here is an example of this with a single task job on a node with hyper-threading. $ sbatch -n1 -wnode01 --wrap='srun sleep 30' Submitted batch job 1517 $ scontrol show node node01 | grep AllocTRES AllocTRES=cpu=2,mem=15678M If I request 2 tasks you can see that they both fit on a single core, so it shows the same number of CPUs being allocated. $ sbatch -n2 -wnode01 --wrap='srun sleep 300' Submitted batch job 1518 $ scontrol show node node01 | grep AllocTRES AllocTRES=cpu=2,mem=15678M If you're submitting an array of single task jobs to a hyper-threaded node and using CR_CORE that would explain why you can only start 23 jobs rather than 47. Let me know if this isn't the case for you. For your last question, there is a parameter you have to set to enable the job submit plugin that you may not have seen. JobSubmitPlugins=lua Let me know if you've already got this set and it's still not picking up your script. Thanks, Ben
(In reply to Ben Roberts from comment #30) > Hi Torkil, Hi Ben > If you're submitting an array of single task jobs to a hyper-threaded node > and using CR_CORE that would explain why you can only start 23 jobs rather > than 47. Let me know if this isn't the case for you. Ah, of course. I changed it to CR_CPU and cut down the node definition to just "NodeName=bigger9 CPUs=48 RealMemory=257552 Gres=gpu:1" and now I get 47. Which configuration would be the better one though? It would depend on the workloads of course but perhaps you have some insights? > For your last question, there is a parameter you have to set to enable the > job submit plugin that you may not have seen. > JobSubmitPlugins=lua Thanks, missed that in the documentation. Mvh. Torkil
> Thanks, missed that in the documentation. That picks up the LUA script but SLURM doesn't like it: " [2020-12-03T22:52:02.656] error: job_submit/lua: /etc/slurm/job_submit.lua: [string "slurm.user_msg (string.format(table.unpack({...})))"]:1: bad argument #2 to 'format' (no value) " I've also tried with the code you linked, verbatim, same error. After the change to CR_CPU I'm also getting this when I restart slurmctld: " [2020-12-03T22:56:19.425] SchedulerParameters=default_queue_depth=100,max_rpc_cnt=0,max_sched_time=2,partition_job_depth=0,sched_max_job_start=0,sched_min_interval=2 [2020-12-03T22:56:19.434] error: _core_bitmap2str: bad core offset (24 >= 23) [2020-12-03T22:56:19.434] error: _core_bitmap2str: bad core offset (25 >= 23) [2020-12-03T22:56:19.434] error: _core_bitmap2str: bad core offset (26 >= 23) [2020-12-03T22:56:19.434] error: _core_bitmap2str: bad core offset (27 >= 23) [2020-12-03T22:56:19.434] error: _core_bitmap2str: bad core offset (28 >= 23) [2020-12-03T22:56:19.434] error: _core_bitmap2str: bad core offset (29 >= 23) [2020-12-03T22:56:19.434] error: _core_bitmap2str: bad core offset (30 >= 23) [2020-12-03T22:56:19.434] error: _core_bitmap2str: bad core offset (31 >= 23) [2020-12-03T22:56:19.434] error: _core_bitmap2str: bad core offset (32 >= 23) [2020-12-03T22:56:19.434] error: _core_bitmap2str: bad core offset (33 >= 23) [2020-12-03T22:56:19.434] error: _core_bitmap2str: bad core offset (34 >= 23) [2020-12-03T22:56:19.434] error: _core_bitmap2str: bad core offset (35 >= 23) [2020-12-03T22:56:19.434] error: _core_bitmap2str: bad core offset (36 >= 23) [2020-12-03T22:56:19.434] error: _core_bitmap2str: bad core offset (37 >= 23) [2020-12-03T22:56:19.434] error: _core_bitmap2str: bad core offset (38 >= 23) [2020-12-03T22:56:19.434] error: _core_bitmap2str: bad core offset (39 >= 23) [2020-12-03T22:56:19.434] error: _core_bitmap2str: bad core offset (40 >= 23) [2020-12-03T22:56:19.434] error: _core_bitmap2str: bad core offset (41 >= 23) [2020-12-03T22:56:19.434] error: _core_bitmap2str: bad core offset (42 >= 23) [2020-12-03T22:56:19.434] error: _core_bitmap2str: bad core offset (43 >= 23) [2020-12-03T22:56:19.434] error: _core_bitmap2str: bad core offset (44 >= 23) [2020-12-03T22:56:19.434] error: _core_bitmap2str: bad core offset (45 >= 23) [2020-12-03T22:56:19.434] error: _core_bitmap2str: bad core offset (46 >= 23) [2020-12-03T22:56:19.434] error: _core_bitmap2str: bad core offset (47 >= 23) " It seems to work though. Lastly, I'm still seeing this: " [2020-12-03T22:56:34.683] error: slurm_auth_get_host: Lookup failed for 0.0.0.0 " I guess the fix for that didn't make it in 20.11? Mvh. Torkil
(In reply to Torkil Svensgaard from comment #32) > That picks up the LUA script but SLURM doesn't like it: > > " > [2020-12-03T22:52:02.656] error: job_submit/lua: /etc/slurm/job_submit.lua: > [string "slurm.user_msg (string.format(table.unpack({...})))"]:1: bad > argument #2 to 'format' (no value) > " > > I've also tried with the code you linked, verbatim, same error. I've spent several hours this morning on this and I believe the job_desc object, or whatever it is, is empty or broken, depending on how exactly the metatable works. I can set values on it, like job_desc.reservation, and read it back, but I am unable to read anything on the initial calling object. Setting the reservation also works just fine, it uses the reserved core. Mvh. Torkil
Hi Torkil, My apologies for the delayed response, I was out of the office on Friday. There are a few parts to your question, so I started by looking into the lua error you are getting. I didn't see a call to string.format in the example script you sent and mine didn't have it either, but it looks like this is called by slurm.log_user. I can cause this error to appear if I call slurm.log_user with a bad job descriptor attribute. I can't reproduce this error with the script I sent initially, but it could be that the tres_per_job attribute isn't set by default in your environment. It sounds like this might be what you found as well. Your last message says that you can set and read back attributes like reservations, but you can't read anything on the initial calling object. This sounds to me like you're trying to see the reservation value before you set it, is that right? If so, I can reproduce that behavior. It fails because there isn't a value for the reservation attribute of the job, so by trying to print the value it generates this error because there is no argument for it to format. If this isn't what is happening in your case let me know. Regarding the "bad core offset" message, I haven't been able to reproduce that same error on my test system by switching to/from CR_Core and CR_CPU and changing the node definitions to include socket, core and thread information or just CPUs. It's possible that running jobs might have an effect on this. Were there active jobs when you made this change? I'm also curious if you have Cores specified in your gres.conf to associate GPUs with certain cores. For the issue with the failed lookup, that fix did make it into 20.11. The fix there was to have it use the text version of the IP address if the reverse lookup fails. But the fact that it's showing an IP instead of the hostname indicates that something is causing the lookup to fail. We can manually simulate what's happening to see what is being decoded by munge. If you can ssh to one of the compute nodes you can use munge to encode an empty message and then send the output to the slurm controller, telling it to decode and get the information about that message. You would run this: munge -n | ssh <hostname of controller> unmunge Let me know what that shows. Thanks, Ben
(In reply to Ben Roberts from comment #34) > Hi Torkil, Hi Ben > My apologies for the delayed response, I was out of the office on Friday. Np. > There are a few parts to your question, so I started by looking into the lua > error you are getting. I didn't see a call to string.format in the example > script you sent and mine didn't have it either, but it looks like this is > called by slurm.log_user. I can cause this error to appear if I call > slurm.log_user with a bad job descriptor attribute. I can't reproduce this > error with the script I sent initially, but it could be that the > tres_per_job attribute isn't set by default in your environment. It sounds > like this might be what you found as well. Your last message says that you > can set and read back attributes like reservations, but you can't read > anything on the initial calling object. This sounds to me like you're > trying to see the reservation value before you set it, is that right? If > so, I can reproduce that behavior. It fails because there isn't a value for > the reservation attribute of the job, so by trying to print the value it > generates this error because there is no argument for it to format. If this > isn't what is happening in your case let me know. I agree that is the case for the reservation, but I was unable to read any value from the calling object. The job was the following sbatch job, so I should think "tres_string" or "partition" should be available? " #!/bin/bash #SBATCH --partition=HPC #SBATCH --gres=gpu:1 #SBATCH --cpus-per-task=1 #SBATCH --ntasks=1 nvidia-smi " > Regarding the "bad core offset" message, I haven't been able to reproduce > that same error on my test system by switching to/from CR_Core and CR_CPU > and changing the node definitions to include socket, core and thread > information or just CPUs. It's possible that running jobs might have an > effect on this. Were there active jobs when you made this change? I'm also > curious if you have Cores specified in your gres.conf to associate GPUs with > certain cores. It's possible, will test when I get in tomorrow. I haven't associated the GPU with a specific core yet. > For the issue with the failed lookup, that fix did make it into 20.11. The > fix there was to have it use the text version of the IP address if the > reverse lookup fails. But the fact that it's showing an IP instead of the > hostname indicates that something is causing the lookup to fail. We can > manually simulate what's happening to see what is being decoded by munge. > If you can ssh to one of the compute nodes you can use munge to encode an > empty message and then send the output to the slurm controller, telling it > to decode and get the information about that message. You would run this: > munge -n | ssh <hostname of controller> unmunge > > Let me know what that shows. Thanks, will test that tomorrow as well. Mvh. Torkil
(In reply to Torkil Svensgaard from comment #35) > I agree that is the case for the reservation, but I was unable to read any > value from the calling object. The job was the following sbatch job, so I > should think > "tres_string" or "partition" should be available? > > " > #!/bin/bash > #SBATCH --partition=HPC > #SBATCH --gres=gpu:1 > #SBATCH --cpus-per-task=1 > #SBATCH --ntasks=1 > > nvidia-smi > " That was actually not true, I can read job_desc.partition. So how do I check if gres=gpu is set? In your script you did "if (string.find(job_desc.tres_per_job, "gpu")) then" but tres_pr_job doesn't seem to be set. Also, is there a way to "pretty print" all variables in the job_desc object? Mvh. Torkil
(In reply to Torkil Svensgaard from comment #35) > > munge -n | ssh <hostname of controller> unmunge > > > > Let me know what that shows. torkil@bigger9:~/slurm$ munge -n | ssh slurm unmunge Enter passphrase for key '/mrhome/torkil/.ssh/id_ed25519': STATUS: Success (0) ENCODE_HOST: ??? (172.21.15.102) ENCODE_TIME: 2020-12-07 20:50:24 +0100 (1607370624) DECODE_TIME: 2020-12-07 20:50:29 +0100 (1607370629) TTL: 300 CIPHER: aes128 (4) MAC: sha256 (5) ZIP: none (0) UID: torkil (1018) GID: torkil (1018) LENGTH: 0
I'm afraid there isn't a way to have the submit filter print all the contents of a job_desc, but you can see the possible attributes of the job in _get_job_req_field or _set_job_req_field in src/plugins/job_submit/lua/job_submit_lua.c. I looked at the way these job attributes are set and I think I see the reason for the disconnect. It has to do with where attributes are set with different ways of submitting with a GPU/gres. If you submit with --gres=gpu:1 it sets tres_per_node. If you submit with --gpus=1 it will set tres_per_job. I realize this can be confusing and will probably require some extra logic to see how users submit (unless you can be sure they all submit the same way). Let me know if making this change in your test does address the lua errors you're seeing. For the munge error, the fact that you see "ENCODE_HOST: ??? (172.21.15.102)" does confirm that there is an issue with the host lookup. It should show the hostname rather than the series of question marks. You mentioned earlier that you hadn't configured reverse DNS for 172.21.15.0/24. If you get this configured and the munge command you ran shows the hostname of the machine I would expect these errors in Slurm to go away. Thanks, Ben
(In reply to Ben Roberts from comment #38) > I looked at the way these job attributes are set and I think I see the > reason for the disconnect. It has to do with where attributes are set with > different ways of submitting with a GPU/gres. If you submit with > --gres=gpu:1 it sets tres_per_node. If you submit with --gpus=1 it will set > tres_per_job. I realize this can be confusing and will probably require > some extra logic to see how users submit (unless you can be sure they all > submit the same way). Let me know if making this change in your test does > address the lua errors you're seeing. They're users, they'll do weird thing =) You were right, jobs_desc.tres_pr_node is populated. We can work with that. > For the munge error, the fact that you see "ENCODE_HOST: ??? > (172.21.15.102)" does confirm that there is an issue with the host lookup. > It should show the hostname rather than the series of question marks. You > mentioned earlier that you hadn't configured reverse DNS for 172.21.15.0/24. > If you get this configured and the munge command you ran shows the hostname > of the machine I would expect these errors in Slurm to go away. Indeed it did. Also, "error: _core_bitmap2str: bad core offset (47 >= 23)" went away after older jobs were stopped, so all good.
(In reply to Torkil Svensgaard from comment #31) > Ah, of course. I changed it to CR_CPU and cut down the node definition to > just "NodeName=bigger9 CPUs=48 RealMemory=257552 Gres=gpu:1" and now I get > 47. > > Which configuration would be the better one though? It would depend on the > workloads of course but perhaps you have some insights? Any comments on the above? Also, regarding state, this looks odd to me when it's only 1 core out of 48 that's reserved: " # sinfo PARTITION AVAIL TIMELIMIT NODES STATE NODELIST application up infinite 1 mix bigger10 HPC* up infinite 1 resv bigger9 HPC* up infinite 1 idle gojira " Not an error, just a little odd. "idle/resv" would be more accurate or even "mix/resv". Mvh. Torkil
Another thing, the way we do is it users log in through a thin client and start terminals like this by clicking a terminal shortcut: " srun --partition=application --cpus-per-task=1 --ntasks=1 --export=ALL --x11 "xfce4-terminal" " With the reservation/GPU configuration I thought we could do another srun from that terminal to get another terminal with gres=gpu:1 for interactive sessions but that doesn't work as I can't seem escape from the initial srun allocation. It would also waste a core as the first srun terminal wouldn't be used for much. Can you suggest a good solution? We could of course have the users decide on the need for a GPU before they start the first terminal but users get easily confused and might hog GPU's by accident if they get multiple terminals to choose from. We would much prefer them having to type a commmand to get a GPU, akin to having to explicitly type it in a sbatch header. Mvh. Torkil
> Which configuration would be the better one though? It would depend on the > workloads of course but perhaps you have some insights? Sorry I glossed over this question the first time. You're right that it would depend on the types of jobs you typically have on your cluster. If you have a lot of single processor jobs and you want to make sure the nodes run as many of these jobs as possible then it does make sense to use CR_CPU. If you have larger jobs and have requirements that the core can't be shared then CR_Core would be the way to go. I can't say that one is better than the other, they just have different use cases. I will point out that you can set SelectTypeParameters on partitions as well, so if you want the cluster generally to work one way, but have just a specific set of nodes work another you can put those nodes in a separate partition and define a different parameter for them. There isn't currently logic to have reserved nodes show multiple states if the reservation is only for a part of the node. I'm not sure right now how involved it would be to add something like that, but it would most likely require a sponsor for the development work. If you're interested in sponsoring something like that let me know and we can look into it further. It is possible to make the workflow you're describing work, but it would leave a core idle in the terminal they're no longer using, as you pointed out. To be able to run a new GPU from within an existing 'non-GPU' job you would remove the references to the current job id so it treats it as an unrelated allocation. Here's an example of how to do this: $ srun -n1 --pty /bin/bash $ squeue JOBID PARTITION NAME USER ST TIME NODES NODELIST(REASON) 1538 debug bash ben R 0:02 1 node01 $ srun -n1 --gpus=1 --pty /bin/bash srun: error: Unable to create step for job 1538: Invalid generic resource (gres) specification $ unset SLURM_JOB_ID $ unset SLURM_JOBID $ srun -n1 --gpus=1 --pty /bin/bash $ squeue JOBID PARTITION NAME USER ST TIME NODES NODELIST(REASON) 1539 debug bash ben R 0:02 1 node01 1538 debug bash ben R 0:30 1 node01 $ scontrol show job 1539 | grep TRES= TRES=cpu=1,node=1,billing=2,gres/gpu=1 This does mean the user needs to remember to exit out of two jobs as well or it could leave CPUs tied up for even longer. So, it's technically possible, but whether that's something you want to implement in your environment is another question. The alternative I see is that you make users exit the first job allocation they are in and re-submit a request with a GPU. You could create a wrapper script that submits the srun command for them and write it to accept an argument like 'gpu' so they explicitly have to request it. Hopefully this helps. Thanks, Ben
(In reply to Ben Roberts from comment #42) > > Which configuration would be the better one though? It would depend on the > > workloads of course but perhaps you have some insights? > > Sorry I glossed over this question the first time. You're right that it > would depend on the types of jobs you typically have on your cluster. If > you have a lot of single processor jobs and you want to make sure the nodes > run as many of these jobs as possible then it does make sense to use CR_CPU. > If you have larger jobs and have requirements that the core can't be shared > then CR_Core would be the way to go. I can't say that one is better than > the other, they just have different use cases. I will point out that you > can set SelectTypeParameters on partitions as well, so if you want the > cluster generally to work one way, but have just a specific set of nodes > work another you can put those nodes in a separate partition and define a > different parameter for them. Looking at this and cgroups now. It looks like ConstrainCores works exactly as advertised and there is no ConstrainCPU? So with CR_CPU we will get some overcommit? Mvh. Torkil
That's correct, ConstrainCores limits the job to the Core it has been allocated, but there isn't an option to constrain a job to a CPU rather than a Core. With ConstrainCores it's possible for jobs to spill over onto the other CPU on their allocated Core. For your reference, you can see the CPUs available to a job from within that job by looking at the CgroupMountpoint (which defaults to /sys/fs/cgroup) and then navigating to cpuset/slurm/uid_<Uid Of User>/job_<JobId>/cpuset.cpus. An example would look like this: /sys/fs/cgroup/cpuset/slurm_node01/uid_1000/job_10545/cpuset.cpus Thanks, Ben
Hi Ben Thanks for the explanation. Still not quite done with the lookup thing it seems, this is from running scontrol reconfigure just now: " [2020-12-14T07:20:57.110] error: slurm_auth_get_host: Lookup failed for 0.0.0.0 [2020-12-14T07:20:57.163] sched: _slurm_rpc_allocate_resources JobId=1312 NodeList=bigger10 usec=53468 [2020-12-14T07:20:58.883] error: slurm_auth_get_host: Lookup failed for 0.0.0.0 [2020-12-14T07:20:58.883] sched: _slurm_rpc_allocate_resources JobId=1313 NodeList=bigger10 usec=517 [2020-12-14T07:22:17.091] Processing Reconfiguration Request [2020-12-14T07:22:17.092] No memory enforcing mechanism configured. [2020-12-14T07:22:17.097] restoring original state of nodes [2020-12-14T07:22:17.097] select/cons_tres: part_data_create_array: select/cons_tres: preparing for 2 partitions [2020-12-14T07:22:17.098] read_slurm_conf: backup_controller not specified [2020-12-14T07:22:17.098] select/cons_tres: select_p_reconfigure: select/cons_tres: reconfigure [2020-12-14T07:22:17.098] select/cons_tres: part_data_create_array: select/cons_tres: preparing for 2 partitions [2020-12-14T07:22:17.098] No parameter for mcs plugin, default values set [2020-12-14T07:22:17.098] mcs: MCSParameters = (null). ondemand set. [2020-12-14T07:22:17.098] _slurm_rpc_reconfigure_controller: completed usec=7209 [2020-12-14T07:22:17.768] SchedulerParameters=default_queue_depth=100,max_rpc_cnt=0,max_sched_time=2,partition_job_depth=0,sched_max_job_start=0,sched_min_interval=2 [2020-12-14T07:23:42.361] error: slurm_auth_get_host: Lookup failed for 0.0.0.0 [2020-12-14T07:23:42.362] sched: _slurm_rpc_allocate_resources JobId=1314 NodeList=bigger10 usec=650 [2020-12-14T07:23:59.352] error: get_name_info: getnameinfo() failed: Name or service not known [2020-12-14T07:23:59.352] error: slurm_auth_get_host: Lookup failed for 127.0.0.2 [2020-12-14T07:23:59.352] _slurm_rpc_submit_batch_job: JobId=1315 InitPrio=4294901586 usec=434 [2020-12-14T07:23:59.997] sched: Allocate JobId=1315 NodeList=bigger9 #CPUs=1 Partition=HPC [2020-12-14T07:24:00.711] _job_complete: JobId=1315 WEXITSTATUS 0 [2020-12-14T07:24:00.711] _job_complete: JobId=1315 done " Fails for 0.0.0.0 and 127.0.0.2 which isn't in our DNS. Mvh. Torkil
Hi Torkil, It looks like there's something going on that's causing some of your nodes to encode messages with Munge with an address other than the primary IP of the node. Do you have any idea which nodes might be associated with these errors? At this point it seems like it's going to take a bit more digging and would probably be better in a separate ticket. Would you mind opening a new ticket with these details? Thanks, Ben
(In reply to Ben Roberts from comment #46) > Hi Torkil, Hi Ben > It looks like there's something going on that's causing some of your nodes > to encode messages with Munge with an address other than the primary IP of > the node. Do you have any idea which nodes might be associated with these > errors? At this point it seems like it's going to take a bit more digging > and would probably be better in a separate ticket. Would you mind opening a > new ticket with these details? Ok, I created a new ticket for that. Back to this one, the mechanism with GPU as gres and no GPU queue has some problems for us. I don't know if you are familiar with FSL but it comes with a wrapper script (fsl_sub) which we've hacked to support SLURM. Using that fails miserably with the current setup as the gres GPU doesn't seem to be inherited by the child SLURM jobs spawned by fsl_sub. So we have: Terminal running as an srun task, no GPU Submit sbatch with gres=gpu Sbatch job calls fsl_sub which in turn submit new sbatch Is there a way to have this/these last sbatch job(s) use the existing reservation made by the initial sbatch, or something like that? Hope that makes sense :S Mvh. Torkil
Hi Torkil, Thanks for moving the other issue to its own ticket. For your workflow question, if the second and third steps you describe need a GPU then there isn't a way to have them run in the resources allocated to srun (in the first step you describe) since it wasn't allocated a GPU and it can't be added to a running job. However, it sounds like changing fsl_sub to doing an srun instead of an sbatch should allow the third job to use the resources allocated to the sbatch job (with the GPU). Using sbatch will always get you a unique job id and new allocation of resources, but if you're already in a job and you use srun then it will (by default) create a job step within the existing allocation. Here's an example of how a job that submits another job will create a unique job id when using sbatch for the sub-job. $ cat 9957.sh #!/bin/bash #SBATCH -N1 #SBATCH --exclusive #SBATCH -p debug date sbatch /home/ben/slurm/test.job sleep 30 date $ sbatch 9957.sh Submitted batch job 1584 $ squeue -s STEPID NAME PARTITION USER TIME NODELIST 1584.batch batch debug ben 0:03 node01 1584.extern extern debug ben 0:03 node01 1585.batch batch gpu ben 0:02 node02 1585.extern extern gpu ben 0:02 node02 You can see that my submission of the 9957.sh job script creates job id 1584. That job then calls sbatch again and creates job 1585. If I change the submission to use srun instead of sbatch then I get the other script to run in the already allocated resources. $ cat 9957.sh #!/bin/bash #SBATCH -N1 #SBATCH --exclusive #SBATCH -p debug date srun /home/ben/slurm/test.job sleep 30 date $ sbatch 9957.sh Submitted batch job 1586 $ squeue -s STEPID NAME PARTITION USER TIME NODELIST 1586.0 test.job debug ben 0:02 node01 1586.batch batch debug ben 0:02 node01 1586.extern extern debug ben 0:02 node01 It sounds like you've already modified fsl_sub some. Can you modify it further to use srun for cases like this? I know this doesn't have it use the resources from the first job, but does this get closer to what you want to accomplish? Thanks, Ben
(In reply to Ben Roberts from comment #48) > Hi Torkil, Hi Ben > Thanks for moving the other issue to its own ticket. For your workflow > question, if the second and third steps you describe need a GPU then there > isn't a way to have them run in the resources allocated to srun (in the > first step you describe) since it wasn't allocated a GPU and it can't be > added to a running job. However, it sounds like changing fsl_sub to doing > an srun instead of an sbatch should allow the third job to use the resources > allocated to the sbatch job (with the GPU). Using sbatch will always get > you a unique job id and new allocation of resources, but if you're already > in a job and you use srun then it will (by default) create a job step within > the existing allocation. Ah, thanks for the explanation. Inheriting the resources from the first sbatch is just what I need, I don't need to inherit from the first srun. However, > It sounds like you've already modified fsl_sub some. Can you modify it > further to use srun for cases like this? Fsl_sub creates array jobs for some tasks, and those have to be sbatch? Modifying it to use slurm instead of sge wasn't so hard, since all the logic was pretty much the same, but going sbatch->srun for array jobs would probably be a different beast. Mvh. Torkil
(In reply to Torkil Svensgaard from comment #49) > Modifying it to use slurm instead of sge wasn't so hard, since all the logic > was pretty much the same, but going sbatch->srun for array jobs would > probably be a different beast. The array jobs are done by outputting a bunch of commands to a text file and then creating an array job via a for loop with i = number of commands in the text file. This might not be hard to change to sruns instead. It looks like srun also has the notion of dependencies so perhaps not that difficult afterall. I'll look at it tomorrow. Thanks!
You're right, neither srun nor salloc have the option to submit a job array like sbatch does. I think there are a couple of approaches you could take for job arrays. You could have it still use sbatch in this case, or you could have it submit a series of job steps with srun in the for loop you mention. One thing to keep in mind is that if the existing job allocation doesn't have enough resources for all these steps to run at once, the job will take longer to run than if you had submitted a job array with sbatch. Thanks, Ben
Hi Ben I just upgraded the following packages on the nodes: " (1/4): slurm-perlapi-20.11.2-1.el8.x86_64.rpm (2/4): slurm-slurmd-20.11.2-1.el8.x86_64.rpm (3/4): slurm-20.11.2-1.el8.x86_64.rpm (4/4): microcode_ctl-20200609-2.20201027.1.el8_3.x86_64.rpm " After reboot some nodes refuse to start/resume due to: " [2020-12-22T10:26:18.530] error: _slurm_rpc_node_registration node=smaug: Invalid argument [2020-12-22T10:26:19.532] error: _slurm_rpc_node_registration node=smaug: Invalid argument [2020-12-22T10:26:20.536] error: _slurm_rpc_node_registration node=smaug: Invalid argument [2020-12-22T10:26:24.737] update_node: node smaug state set to IDLE [2020-12-22T10:26:26.561] error: Setting node smaug state to DRAIN with reason:Low RealMemory " Smaug is configured like so, with the RealMemory taken by running "slurmd -C" on the host. " NodeName=smaug CPUs=128 RealMemory=515572 MemSpecLimit=1024 " After the upgrade "slurmd -C" on smaug reports this: " NodeName=smaug CPUs=128 Boards=1 SocketsPerBoard=2 CoresPerSocket=32 ThreadsPerCore=2 RealMemory=515561 UpTime=0-00:06:11 " Is there a way to avoid this? Mvh. Torkil
Also, slurmd fails to start after reboot. I can't really see any hints in the logs. " # journalctl -xef -uslurmd -- Logs begin at Tue 2020-12-22 10:23:49 CET. -- Dec 22 10:23:53 bigger11.drcmr systemd[1]: Started Slurm node daemon. -- Subject: Unit slurmd.service has finished start-up -- Defined-By: systemd -- Support: https://access.redhat.com/support -- -- Unit slurmd.service has finished starting up. -- -- The start-up result is done. Dec 22 10:23:53 bigger11.drcmr systemd[1]: slurmd.service: Main process exited, code=exited, status=1/FAILURE Dec 22 10:23:53 bigger11.drcmr systemd[1]: slurmd.service: Failed with result 'exit-code'. -- Subject: Unit failed -- Defined-By: systemd -- Support: https://access.redhat.com/support -- -- The unit slurmd.service has entered the 'failed' state with result 'exit-code'. ^C " Suggestions on how to debug that? Mvh. Torkil
(In reply to Torkil Svensgaard from comment #53) > Also, slurmd fails to start after reboot. I can't really see any hints in > the logs. > > " > # journalctl -xef -uslurmd > -- Logs begin at Tue 2020-12-22 10:23:49 CET. -- > Dec 22 10:23:53 bigger11.drcmr systemd[1]: Started Slurm node daemon. > -- Subject: Unit slurmd.service has finished start-up > -- Defined-By: systemd > -- Support: https://access.redhat.com/support > -- > -- Unit slurmd.service has finished starting up. > -- > -- The start-up result is done. > Dec 22 10:23:53 bigger11.drcmr systemd[1]: slurmd.service: Main process > exited, code=exited, status=1/FAILURE > Dec 22 10:23:53 bigger11.drcmr systemd[1]: slurmd.service: Failed with > result 'exit-code'. > -- Subject: Unit failed > -- Defined-By: systemd > -- Support: https://access.redhat.com/support > -- > -- The unit slurmd.service has entered the 'failed' state with result > 'exit-code'. > ^C > " > > Suggestions on how to debug that? > > Mvh. > > Torkil Hi, sorry to jump in. I've just realized that bug 10455 comes from here and I am interested in seeing this information too. Torkil, as root run 'slurmd -Dvvv' and we'll see why it fails. Paste the entire output here please. -------- As for your RealMemory question, the real memory on a Linux can be slightly different any time you boot or upgrade kernel. Check it with 'free -m' and you will see how it doesn't exactly correspond to your real physical memory. RealMemory=515572 RealMemory=515561 These are 4 MiB difference, but enough to cause the error you see. The kernel must have reserved some more ram for its own purposes. I recommend you to not set the RealMemory to the exact memory the node shows at a certain point, but just round numbers, eg. if your node has 515.572MB = 503GiB, just set 500GiB => RealMemory=512000.
(In reply to Felip Moll from comment #54) > Hi, sorry to jump in. > > I've just realized that bug 10455 comes from here and I am interested in > seeing this information too. By all means. Should/could I have tagged in 10455 in some way to make it clear there was some history? > Torkil, as root run 'slurmd -Dvvv' and we'll see why it fails. Paste the > entire output here please. " Last login: Tue Dec 22 13:14:04 2020 from 172.21.140.12 # gojira/root ~ # systemctl status slurmd ● slurmd.service - Slurm node daemon Loaded: loaded (/usr/lib/systemd/system/slurmd.service; enabled; vendor preset: disabled) Active: failed (Result: exit-code) since Tue 2020-12-22 13:16:18 CET; 41s ago Process: 2333 ExecStart=/usr/sbin/slurmd -D $SLURMD_OPTIONS (code=exited, status=1/FAILURE) Main PID: 2333 (code=exited, status=1/FAILURE) Dec 22 13:16:18 gojira.drcmr systemd[1]: Started Slurm node daemon. Dec 22 13:16:18 gojira.drcmr systemd[1]: slurmd.service: Main process exited, code=exited, status=1/FAILURE Dec 22 13:16:18 gojira.drcmr systemd[1]: slurmd.service: Failed with result 'exit-code'. # gojira/root ~ # slurmd -Dvvv slurmd: debug: Log file re-opened slurmd: debug2: hwloc_topology_init slurmd: debug2: hwloc_topology_load slurmd: debug2: hwloc_topology_export_xml slurmd: debug: CPUs:96 Boards:1 Sockets:2 CoresPerSocket:24 ThreadsPerCore:2 slurmd: error: Node configuration differs from hardware: CPUs=96:96(hw) Boards=1:1(hw) SocketsPerBoard=96:2(hw) CoresPerSocket=1:24(hw) ThreadsPerCore=1:2(hw) slurmd: debug: Reading cgroup.conf file /var/spool/slurm/d/conf-cache/cgroup.conf slurmd: debug2: hwloc_topology_init slurmd: debug2: xcpuinfo_hwloc_topo_load: xml file (/var/spool/slurm/d/hwloc_topo_whole.xml) found slurmd: debug: CPUs:96 Boards:1 Sockets:2 CoresPerSocket:24 ThreadsPerCore:2 slurmd: debug: skipping GRES for NodeName=bigger9 AutoDetect=nvml slurmd: debug: gres/gpu: init: loaded slurmd: debug: gpu/generic: init: init: GPU Generic plugin loaded slurmd: topology/none: init: topology NONE plugin loaded slurmd: route/default: init: route default plugin loaded slurmd: debug2: Gathering cpu frequency information for 96 cpus slurmd: debug: Resource spec: No specialized cores configured by default on this node slurmd: debug: Reading cgroup.conf file /var/spool/slurm/d/conf-cache/cgroup.conf slurmd: debug: system cgroup: memory: total:515581M allowed:100%(enforced), swap:0%(permissive), max:100%(515581M) max+swap:100%(1031162M) min:30M kmem:100%(515581M permissive) min:30M slurmd: debug: system cgroup: system memory cgroup initialized slurmd: Resource spec: system cgroup memory limit set to 1024 MB slurmd: debug: task/cgroup: init: core enforcement enabled slurmd: debug: task/cgroup: task_cgroup_memory_init: task/cgroup/memory: total:515581M allowed:100%(enforced), swap:0%(permissive), max:100%(515581M) max+swap:100%(1031162M) min:30M kmem:100%(515581M permissive) min:30M swappiness:0(unset) slurmd: debug: task/cgroup: init: memory enforcement enabled slurmd: debug: task/cgroup: task_cgroup_devices_init: unable to open /var/spool/slurm/d/conf-cache/cgroup_allowed_devices_file.conf: No such file or directory slurmd: debug: task/cgroup: init: device enforcement enabled slurmd: debug: task/cgroup: init: task/cgroup: loaded slurmd: debug: auth/munge: init: Munge authentication plugin loaded slurmd: debug: spank: opening plugin stack /var/spool/slurm/d/conf-cache/plugstack.conf slurmd: cred/munge: init: Munge credential signature plugin loaded slurmd: slurmd version 20.11.2 started slurmd: debug: jobacct_gather/linux: init: Job accounting gather LINUX plugin loaded slurmd: debug: job_container/none: init: job_container none plugin loaded slurmd: debug: switch/none: init: switch NONE plugin loaded slurmd: slurmd started on Tue, 22 Dec 2020 13:17:12 +0100 slurmd: CPUs=96 Boards=1 Sockets=96 Cores=1 Threads=1 Memory=515581 TmpDisk=226773 Uptime=62 CPUSpecList=(null) FeaturesAvail=(null) FeaturesActive=(null) slurmd: debug: acct_gather_energy/none: init: AcctGatherEnergy NONE plugin loaded slurmd: debug: acct_gather_Profile/none: init: AcctGatherProfile NONE plugin loaded slurmd: debug: acct_gather_interconnect/none: init: AcctGatherInterconnect NONE plugin loaded slurmd: debug: acct_gather_filesystem/none: init: AcctGatherFilesystem NONE plugin loaded slurmd: debug2: No acct_gather.conf file (/var/spool/slurm/d/conf-cache/acct_gather.conf) slurmd: debug: _handle_node_reg_resp: slurmctld sent back 8 TRES. " > As for your RealMemory question, the real memory on a Linux can be slightly > different any time you boot or upgrade kernel. Check it with 'free -m' and > you will see how it doesn't exactly correspond to your real physical memory. > > RealMemory=515572 > RealMemory=515561 > > These are 4 MiB difference, but enough to cause the error you see. The > kernel must have reserved some more ram for its own purposes. > I recommend you to not set the RealMemory to the exact memory the node shows > at a certain point, but just round numbers, eg. if your node has 515.572MB = > 503GiB, just set 500GiB => RealMemory=512000. Thanks, just what I was looking for. Mvh. Torkil
(In reply to Torkil Svensgaard from comment #55) > (In reply to Felip Moll from comment #54) > > > Hi, sorry to jump in. > > > > I've just realized that bug 10455 comes from here and I am interested in > > seeing this information too. > > By all means. Should/could I have tagged in 10455 in some way to make it > clear there was some history? When writing the description of bug 10455, just saying "coming from bug 10455" is enough for us. There's also the field See Also you can use if you want. I could've realized that before but we receive a bunch of bugs and comments daily and I missed this one. Sorry for that. > ... > slurmd: debug: _handle_node_reg_resp: slurmctld sent back 8 TRES. So slurmd works and is started just well from command line. The issue must be in how systemd starts it. What does the slurmd log show if you set slurm.conf 'SlurmdDebug=debug2' after starting it with systemd? What does 'journalctl -xn 300' show immediately after starting with 'systemctl start slurmd'? Can we also see 'systemctl cat slurmd' ? -- The only error I see is: slurmd: error: Node configuration differs from hardware: CPUs=96:96(hw) Boards=1:1(hw) SocketsPerBoard=96:2(hw) CoresPerSocket=1:24(hw) ThreadsPerCore=1:2(hw) This error won't impede slurm to run, but affinity may not be ideal. Is it possible you specify the architecture in slurm.conf to avoid this error? NodeName=gojira CPUs=96 Sockets=2 CoresPerSocket=24 ThreadsPerCore=2 RealMemory=512000 MemSpecLimit=1024
(In reply to Felip Moll from comment #56) > So slurmd works and is started just well from command line. > The issue must be in how systemd starts it. What does the slurmd log show if > you set slurm.conf 'SlurmdDebug=debug2' after starting it with systemd? What > does 'journalctl -xn 300' show immediately after starting with 'systemctl > start slurmd'? I think it's more a problem of when systemd starts it, as it starts just fine from systemd manually. This is from journald right after rebooting: " # journalctl | grep slurm Dec 23 07:37:39 gojira.drcmr slurmd[2345]: error: resolve_ctls_from_dns_srv: res_nsearch error: Host name lookup failure Dec 23 07:37:39 gojira.drcmr slurmd[2345]: error: fetch_config: DNS SRV lookup failed Dec 23 07:37:39 gojira.drcmr systemd[1]: slurmd.service: Main process exited, code=exited, status=1/FAILURE Dec 23 07:37:39 gojira.drcmr slurmd[2345]: error: _establish_configuration: failed to load configs Dec 23 07:37:39 gojira.drcmr systemd[1]: slurmd.service: Failed with result 'exit-code'. Dec 23 07:37:39 gojira.drcmr slurmd[2345]: error: slurmd initialization failed " > Can we also see 'systemctl cat slurmd' ? " # systemctl cat slurmd # /usr/lib/systemd/system/slurmd.service [Unit] Description=Slurm node daemon After=munge.service network.target remote-fs.target #ConditionPathExists=/etc/slurm/slurm.conf [Service] Type=simple EnvironmentFile=-/etc/sysconfig/slurmd ExecStart=/usr/sbin/slurmd -D $SLURMD_OPTIONS ExecReload=/bin/kill -HUP $MAINPID KillMode=process LimitNOFILE=131072 LimitMEMLOCK=infinity LimitSTACK=infinity Delegate=yes [Install] WantedBy=multi-user.target " It should wait for network so unsure why it fails the DNS lookup. Retrying might also fix it. > -- > > The only error I see is: > > slurmd: error: Node configuration differs from hardware: CPUs=96:96(hw) > Boards=1:1(hw) SocketsPerBoard=96:2(hw) CoresPerSocket=1:24(hw) > ThreadsPerCore=1:2(hw) > > This error won't impede slurm to run, but affinity may not be ideal. Is it > possible you specify the architecture in slurm.conf to avoid this error? > > NodeName=gojira CPUs=96 Sockets=2 CoresPerSocket=24 ThreadsPerCore=2 > RealMemory=512000 MemSpecLimit=1024 Of course, my bad. I thought I had to either do CPUs OR Boards*Sockets*CoresPerSocket*ThreadsPerCore to get my total to be what I wanted. Thanks, Torkil
Hi Torkil, I was out for the holidays last week, so my apologies that I wasn't responsive. I'm glad that Felip was able to jump in and help. It looks like you were able to get to the bottom of the host lookup issue by changing the start order with systemd in bug 10455. It sounds like this issue was resolved as well when you updated the node definition to match what is reported by 'slurmd -C', is that right? Let me know if you still need help with this issue. Thanks, Ben
(In reply to Ben Roberts from comment #58) > Hi Torkil, Hi Ben > I was out for the holidays last week, so my apologies that I wasn't > responsive. I'm glad that Felip was able to jump in and help. It looks > like you were able to get to the bottom of the host lookup issue by changing > the start order with systemd in bug 10455. It sounds like this issue was > resolved as well when you updated the node definition to match what is > reported by 'slurmd -C', is that right? Let me know if you still need help > with this issue. No problem, I hope you had a Merry Christmas =) Yes, I'm good for now. Just ordered some more GPUs and I'll probably have some questions regarding their use but no point in moving forward until they are added to the nodes. Happy New Near, Torkil
I'm glad to hear things are looking good. Since that's the case I'll go ahead and close this ticket. If anything comes up with your new GPUs don't hesitate to open a new ticket and we'll be glad to look into it with you. I hope you have a Happy New Year too. Thanks, Ben