| Summary: | Hyperthreading not working | ||
|---|---|---|---|
| Product: | Slurm | Reporter: | Doug Meyer <dameyer> |
| Component: | Scheduling | Assignee: | Felip Moll <felip.moll> |
| Status: | RESOLVED DUPLICATE | QA Contact: | |
| Severity: | 4 - Minor Issue | ||
| Priority: | --- | CC: | felip.moll |
| Version: | 18.08.5 | ||
| Hardware: | Linux | ||
| OS: | Linux | ||
| Site: | Raytheon Missile, Space and Airborne | Slinky Site: | --- |
| Alineos Sites: | --- | Atos/Eviden Sites: | --- |
| Confidential Site: | --- | Coreweave sites: | --- |
| Cray Sites: | --- | DS9 clusters: | --- |
| Google sites: | --- | HPCnow Sites: | --- |
| HPE Sites: | --- | IBM Sites: | --- |
| NOAA SIte: | --- | NoveTech Sites: | --- |
| Nvidia HWinf-CS Sites: | --- | OCF Sites: | --- |
| Recursion Pharma Sites: | --- | SFW Sites: | --- |
| SNIC sites: | --- | Tzag Elita Sites: | --- |
| Linux Distro: | RHEL | Machine Name: | slurm02 |
| CLE Version: | Version Fixed: | 7.6 | |
| Target Release: | --- | DevPrio: | --- |
| Emory-Cloud Sites: | --- | ||
| Attachments: |
slurm.conf
sample script test_results |
||
|
Description
Doug Meyer
2019-05-16 07:02:14 MDT
Hi Doug,
If you can attach your latest slurm.conf I will check your configs. I guess you have CR_ONE_TASK_PER_CORE setting, and I am interested in how did you exactly define the nodes.
CR_ONE_TASK_PER_CORE
Allocate one task per core by default. Without this option, by default one task will be allocated per thread on nodes with more than one ThreadsPerCore configured.
NOTE: This option cannot be used with CR_CPU*.
Created attachment 10247 [details]
slurm.conf
hpc3 is the new config. It is not allocating logical threads for single thread jobs. Thank you for the fast response.
Hi Doug, Can you be more specific on how you are testing it? I get all 56 slots with a similar configuration: [slurm@moll0 18.08]$ srun bash -c 'slurmd -C' NodeName=moll1 CPUs=56 Boards=1 SocketsPerBoard=2 CoresPerSocket=14 ThreadsPerCore=2 RealMemory=984 UpTime=0-00:26:42 [slurm@moll0 18.08]$ srun --mem 10 -n 56 bash -c "taskset -cp \$\$"|cut -d":" -f 2|sort -n 0 1 2 3 4 5 ... 55 [slurm@moll0 18.08]$ scontrol show config|grep "TaskPlugin\|Select" SelectType = select/cons_res SelectTypeParameters = CR_CPU_MEMORY TaskPlugin = task/affinity TaskPluginParam = (null type) Can you try to do the same 'srun' test than me? Also I recommend to not set boards and set instead: NodeName=hpc[1089-1092] CPUs=56 Sockets=2 CoresPerSocket=14 ThreadsPerCore=2 RealMemory=256000 Thanks Created attachment 10276 [details]
sample script
Changed the node description. No change. Results of scontrol show config|grep "TaskPlugin\|Select" SelectType = select/cons_res SelectTypeParameters = CR_CPU_MEMORY TaskPlugin = task/affinity TaskPluginParam = (null type) Command shared failed for missing variable declaration. Believe you wanted to see all the HT threads. Ran srun "mpstat -P ALL 1" instead and show threads 0 55. sample submit script attached. When launched via sbatch against a 28-core/56-thread node, array tasks are only assigned to the physical cores. HT threads remain unused. (In reply to Doug Meyer from comment #5) > Changed the node description. No change. > > Results of > scontrol show config|grep "TaskPlugin\|Select" > > SelectType = select/cons_res > SelectTypeParameters = CR_CPU_MEMORY > TaskPlugin = task/affinity > TaskPluginParam = (null type) > > Command shared failed for missing variable declaration. Believe you wanted > to see all the HT threads. Ran srun "mpstat -P ALL 1" instead and show > threads 0 55. > > sample submit script attached. When launched via sbatch against a > 28-core/56-thread node, array tasks are only assigned to the physical cores. > HT threads remain unused. Is it possible to enable the cgroup plugin in your environment? There are software that can surpass the affinity setting if not enforced by cgroup. If you are able, please set: TaskPlugin=task/cgroup,task/affinity and create a cgroup.conf file in the same directory than slurm.conf with this content: ### # # Slurm cgroup support configuration file # # See man slurm.conf and man cgroup.conf for further # information on cgroup configuration parameters ####################### ### General options ### ####################### CgroupAutomount=yes ######################################## #### TaskPlugin=task/cgroup options #### ######################################## # Force cores limit, needs hwloc libraries ConstrainCores=yes # Bind each step task to a subset of allocated cores using # sched_setaffinity. Needs hwloc libraries. (disabled since task/affinity set) TaskAffinity=no You will need to restart daemons. Offtopic question: I see you are not enforcing memory so you can get OOMs. Is this intended? Hi, SelectType = select/cons_res SelectTypeParameters = CR_CPU_MEMORY TaskPlugin = task/cgroup,task/affinity TaskPluginParam = (null type) placed the cgroup.conf in use. No change though. still 28 tasks running at a time from the array. Many of our jobs have spikes in memory use that forced slurm to kill them. Turning off memory enforcement does expose us to OOM killer occasionally (very rare) (In reply to Doug Meyer from comment #7) > Hi, > SelectType = select/cons_res > SelectTypeParameters = CR_CPU_MEMORY > TaskPlugin = task/cgroup,task/affinity > TaskPluginParam = (null type) > > placed the cgroup.conf in use. > > No change though. still 28 tasks running at a time from the array. > I would need your slurmctld log at the time of submitting such a job, but first enabling debug: scontrol setdebug debug2 scontrol setdebugflags +CPU_Bind grab the logs and reset debug to your previous values. Then I'd need the outputs of the commands requested in your other bug 7029, i.e.: srun --mem 10 --ntasks-per-core=1 --ntasks=28 bash -c "taskset -cp \$\$"|cut -d":" -f 2|sort -n > Many of our jobs have spikes in memory use that forced slurm to kill them. > Turning off memory enforcement does expose us to OOM killer occasionally > (very rare) So you are assuming you may have OOM. If this is the case, it is ok but be aware that this can affect slurmd and other system components too. Thank you Created attachment 10292 [details]
test_results
Suspect I hosed the test. I undid croups last week as it was not a production change. This test is without cgroups enabled.
Could not run command from command line but was able to put into a script and srun that.
Doug, Is it ok to mark this bug as a duplicate of your other bug 7029? Thanks That will be fine. For some reason I thought I was asked to open a separate ticket... Marking as dup. *** This ticket has been marked as a duplicate of ticket 7029 *** |