Ticket 12802 - Questing regarding use Weka storage system with Slurm
Summary: Questing regarding use Weka storage system with Slurm
Status: RESOLVED INFOGIVEN
Alias: None
Product: Slurm
Classification: Unclassified
Component: Other (show other tickets)
Version: 20.11.8
Hardware: Linux Linux
: 4 - Minor Issue
Assignee: Marshall Garey
QA Contact:
URL:
Depends on:
Blocks:
 
Reported: 2021-11-02 09:32 MDT by Hjalti Sveinsson
Modified: 2021-12-03 08:10 MST (History)
2 users (show)

See Also:
Site: deCODE
Slinky Site: ---
Alineos Sites: ---
Atos/Eviden Sites: ---
Confidential Site: ---
Coreweave sites: ---
Cray Sites: ---
DS9 clusters: ---
Google sites: ---
HPCnow Sites: ---
HPE Sites: ---
IBM Sites: ---
NOAA SIte: ---
NoveTech Sites: ---
Nvidia HWinf-CS Sites: ---
OCF Sites: ---
Recursion Pharma Sites: ---
SFW Sites: ---
SNIC sites: ---
Tzag Elita Sites: ---
Linux Distro: ---
Machine Name:
CLE Version:
Version Fixed:
Target Release: ---
DevPrio: ---
Emory-Cloud Sites: ---


Attachments

Note You need to log in before you can comment on or make changes to this ticket.
Description Hjalti Sveinsson 2021-11-02 09:32:52 MDT
Hello,

we are in the process of intergrating a new storage system called WekaIO
Comment 1 Hjalti Sveinsson 2021-11-02 09:36:23 MDT
I accedentally pressed enter... 

so here it goes:

Weka storage system uses a client running on the compute nodes in our cluster and we use udp mount and tell weka to use number of cores like this:

mount -t wekafs -o net=udp -o num_cores=2 weka-1/wekafs2 /mnt/wekafs

How can we be sure that slurm will not use the cores that have been assigned to wekafs client?

We found this:

https://www.linuxtopia.org/online_books/linux_kernel/kernel_configuration/re46.html

that says we can use "isolcpus" kernel parameter, will that work with slurm?
Comment 6 Marshall Garey 2021-11-02 11:20:00 MDT
CpuSpecList is what you want.

I talked with my colleague Michael about his comment in bug 12674 that CpuSpecList doesn’t work with task/affinity. We found that he was mistaken - task/affinity works just fine with CpuSpecList. The only requirement is that you have task/cgroup, but you can also have task/affinity, like this:

TaskPlugin=task/cgroup,task/affinity


From the slurm.conf man page[1]:

“A comma-delimited list of Slurm abstract CPU IDs reserved for system use. The list will be expanded to include all other CPUs, if any, on the same cores. These cores will not be available for allocation to user jobs. Depending upon the TaskPluginParam option of SlurmdOffSpec[2], Slurm daemons (i.e. slurmd and slurmstepd) may either be confined to these resources (the default) or prevented from using these resources.”

There are some important caveats, though.

(1) The CPUs specified by CpuSpecList (or CoreSpecCount) are still shown as available in “scontrol show node”:

# slurm.conf
NodeName=DEFAULT RealMemory=8000 Sockets=1 CoresPerSocket=8 ThreadsPerCore=2
NodeName=n1-[1-10] NodeAddr=localhost Port=36001-36010 CpuSpecList=0-1

$ scontrol show nodes n1-1 |egrep -i 'nodename|cpu'
NodeName=n1-1 Arch=x86_64 CoresPerSocket=8  
  CPUAlloc=0 CPUTot=16 CPULoad=1.28
  CoreSpecCount=1 CPUSpecList=0-1  
  CfgTRES=cpu=16,mem=8000M,billing=16,gres/gpu=4

This can be confusing to users who think they can be allocated all the CPUs on a node, but they actually can’t:

$ srun -N1 -n16 whereami
srun: error: Unable to allocate resources: Requested node configuration is not available
$ srun -N2 -n16 whereami  
0000 n1-1 - Cpus_allowed:       0002    Cpus_allowed_list:      1
0001 n1-1 - Cpus_allowed:       0200    Cpus_allowed_list:      9
0006 n1-1 - Cpus_allowed:       0010    Cpus_allowed_list:      4
0007 n1-1 - Cpus_allowed:       1000    Cpus_allowed_list:      12
0003 n1-1 - Cpus_allowed:       0400    Cpus_allowed_list:      10
0004 n1-1 - Cpus_allowed:       0008    Cpus_allowed_list:      3
0005 n1-1 - Cpus_allowed:       0800    Cpus_allowed_list:      11
0002 n1-1 - Cpus_allowed:       0004    Cpus_allowed_list:      2
0014 n1-2 - Cpus_allowed:       0010    Cpus_allowed_list:      4
0012 n1-2 - Cpus_allowed:       0008    Cpus_allowed_list:      3
0010 n1-2 - Cpus_allowed:       0004    Cpus_allowed_list:      2
0011 n1-2 - Cpus_allowed:       0400    Cpus_allowed_list:      10
0008 n1-2 - Cpus_allowed:       0002    Cpus_allowed_list:      1
0009 n1-2 - Cpus_allowed:       0200    Cpus_allowed_list:      9
0013 n1-2 - Cpus_allowed:       0800    Cpus_allowed_list:      11
0015 n1-2 - Cpus_allowed:       1000    Cpus_allowed_list:      12


We would like to improve this by not showing the CPUs in CpuSpecList/CoreSpecCount in the tres list, but this is more involved than simply changing the output of scontrol show node. Since CpuSpecList still works as designed, we consider this an enhancement. Unless this development is sponsored then we won’t be able to commit to it. Just make sure your users are aware of this.


(2) There are two bugs with CpuSpecList in 21.08:

(i) On some systems, jobs won’t run with CpuSpecList in 21.08. See bug 12393. I uploaded a fix in bug 12393 comment 54. However, since this patch changes plugin ABI, we can’t put this into 21.08, and we haven’t found a fix for 21.08. If you upgrade to 21.08, then you should first test this (before upgrading the production cluster) and if you see this bug then apply the patch in bug 12393 comment 54.

(ii) In 21.08, CpuSpecList does not confine slurmd to those CPUs. SlurmdOffSpec is also broken. We are tracking this in an internal bug (12477). However, slurmd is lightweight, so you may not really notice a performance difference. 


In 21.08, with CpuSpecList configured, Slurm will still not allocate those CPUs to jobs. So that basic behavior still works.

Does this answer your question? Do you have any follow-up questions?



[1] https://slurm.schedmd.com/slurm.conf.html#OPT_CpuSpecList
[2] https://slurm.schedmd.com/slurm.conf.html#OPT_SlurmdOffSpec
Comment 7 Marshall Garey 2021-11-02 11:40:59 MDT
I have a typo - bug 12674 was supposed to bug 10674. But, I also now see that bug is private so I guess it's irrelevant to you anyway.
Comment 8 Hjalti Sveinsson 2021-11-04 07:09:22 MDT
Hi, thank you for your answer. 

But do you know if "isolcpus" kernel parameter can be used as well? 

We need to have weka client use the CPU IDs that slurm has in CpuSpecList, I am unsure how to do this but that is of course unrelated to slurm :) I will try to get answers from Weka support about that.

I see that you mention 21.08 a couple of times in your answer, but we have not started using 21.08, we are using 20.11.8.

Do I only need to configure CpuSpecList and not SlurmdOffSpec ?

If we have two sockets how do we know what the CPU ID is if there are f.e. 2x24core CPUs in our system.
Comment 9 Marshall Garey 2021-11-04 10:05:45 MDT
(In reply to Hjalti Sveinsson from comment #8)
> Hi, thank you for your answer. 
> 
> But do you know if "isolcpus" kernel parameter can be used as well? 

I have never heard of isolcpus. For Slurm, I recommend using the built-in tools (CpuSpecList and CoreSpecCount).


> We need to have weka client use the CPU IDs that slurm has in CpuSpecList, I
> am unsure how to do this but that is of course unrelated to slurm :) I will
> try to get answers from Weka support about that.

I have no idea. I agree with getting support from Weka, or see if you can find an answer with Google search.

> I see that you mention 21.08 a couple of times in your answer, but we have
> not started using 21.08, we are using 20.11.8.

I know, I just wanted you to know some issues that we found in 21.08 if you decide to upgrade at some point.

> Do I only need to configure CpuSpecList and not SlurmdOffSpec ?

CpuSpecList is the important part. SlurmdOffSpec is optional, and it depends how you want it to work - do you want slurmd/slurmstepd running on the CPUs that are allocated to Slurm jobs or on the CPUs that are running Weka? Set SlurmdOffSpec accordingly.


> If we have two sockets how do we know what the CPU ID is if there are f.e.
> 2x24core CPUs in our system.

lstopo is a good tool.

But personally, I just use experimentation. I set CpuSpecList, start Slurm, then run a job using all the remaining CPUs on a node and see which ones are missing. For example, on my box with 8 cores and 2 threads per core:

Without CpuSpecList:

$ srun -n16 --cpu-bind=verbose sleep 1 |sort
cpu-bind=MASK - n1-1, task  0  0 [5954]: mask 0x1 set
cpu-bind=MASK - n1-1, task  1  1 [5955]: mask 0x100 set
cpu-bind=MASK - n1-1, task  2  2 [5956]: mask 0x2 set
cpu-bind=MASK - n1-1, task  3  3 [5957]: mask 0x200 set
cpu-bind=MASK - n1-1, task  4  4 [5958]: mask 0x4 set
cpu-bind=MASK - n1-1, task  5  5 [5959]: mask 0x400 set
cpu-bind=MASK - n1-1, task  6  6 [5960]: mask 0x8 set
cpu-bind=MASK - n1-1, task  7  7 [5961]: mask 0x800 set
cpu-bind=MASK - n1-1, task  8  8 [5962]: mask 0x10 set
cpu-bind=MASK - n1-1, task  9  9 [5963]: mask 0x1000 set
cpu-bind=MASK - n1-1, task 10 10 [5964]: mask 0x20 set
cpu-bind=MASK - n1-1, task 11 11 [5965]: mask 0x2000 set
cpu-bind=MASK - n1-1, task 12 12 [5966]: mask 0x40 set
cpu-bind=MASK - n1-1, task 13 13 [5967]: mask 0x4000 set
cpu-bind=MASK - n1-1, task 14 14 [5968]: mask 0x80 set
cpu-bind=MASK - n1-1, task 15 15 [5969]: mask 0x8000 set

With CpuSpecList:

# slurm.conf
NodeName=<name> ... CpuSpecList=0-1


$ srun -n14 --cpu-bind=verbose sleep 1
cpu-bind=MASK - n1-1, task  0  0 [5317]: mask 0x2 set
cpu-bind=MASK - n1-1, task  1  1 [5318]: mask 0x200 set
cpu-bind=MASK - n1-1, task  2  2 [5319]: mask 0x4 set
cpu-bind=MASK - n1-1, task  3  3 [5320]: mask 0x400 set
cpu-bind=MASK - n1-1, task  4  4 [5321]: mask 0x8 set
cpu-bind=MASK - n1-1, task  5  5 [5322]: mask 0x800 set
cpu-bind=MASK - n1-1, task  6  6 [5323]: mask 0x10 set
cpu-bind=MASK - n1-1, task  7  7 [5324]: mask 0x1000 set
cpu-bind=MASK - n1-1, task  8  8 [5325]: mask 0x20 set
cpu-bind=MASK - n1-1, task  9  9 [5326]: mask 0x2000 set
cpu-bind=MASK - n1-1, task 10 10 [5327]: mask 0x40 set
cpu-bind=MASK - n1-1, task 11 11 [5328]: mask 0x4000 set
cpu-bind=MASK - n1-1, task 12 12 [5329]: mask 0x80 set
cpu-bind=MASK - n1-1, task 13 13 [5330]: mask 0x8000 set
Comment 10 Marshall Garey 2021-11-08 14:26:52 MST
Is there anything else I can help with for this ticket?
Comment 11 Hjalti Sveinsson 2021-11-09 02:58:57 MST
Hi, you can close this issue, I think I have my answer, already tested this and it works like I expected. 

Thank you!
Comment 12 Marshall Garey 2021-11-09 08:00:28 MST
Sounds good. Closing as infogiven.