Ticket 14590 - GpuFreqDef is not doing what it says on the tin
Summary: GpuFreqDef is not doing what it says on the tin
Status: RESOLVED FIXED
Alias: None
Product: Slurm
Classification: Unclassified
Component: Configuration (show other tickets)
Version: 21.08.8
Hardware: Linux Linux
: 4 - Minor Issue
Assignee: Oriol Vilarrubi
QA Contact:
URL:
Depends on:
Blocks:
 
Reported: 2022-07-22 01:27 MDT by Chris Samuel (NERSC)
Modified: 2023-02-27 05:11 MST (History)
3 users (show)

See Also:
Site: NERSC
Slinky Site: ---
Alineos Sites: ---
Atos/Eviden Sites: ---
Confidential Site: ---
Coreweave sites: ---
Cray Sites: ---
DS9 clusters: ---
Google sites: ---
HPCnow Sites: ---
HPE Sites: ---
IBM Sites: ---
NOAA SIte: ---
NoveTech Sites: ---
Nvidia HWinf-CS Sites: ---
OCF Sites: ---
Recursion Pharma Sites: ---
SFW Sites: ---
SNIC sites: ---
Tzag Elita Sites: ---
Linux Distro: ---
Machine Name:
CLE Version:
Version Fixed: 23.02.0
Target Release: ---
DevPrio: ---
Emory-Cloud Sites: ---


Attachments

Note You need to log in before you can comment on or make changes to this ticket.
Description Chris Samuel (NERSC) 2022-07-22 01:27:35 MDT
Hi there,

After investigating some performance oddities for applications we've found that it appears that GpuFreqDef is not being applied to jobs that don't explicitly ask for "--gpu-freq=[..]" - viz:

> scontrol show config | fgrep GpuFreqDef
GpuFreqDef              = high,memory=high

> srun --gpus-per-task=1 -C gpu -n1 -N 1 nvidia-smi --query-gpu=clocks.applications.gr,clocks.applications.mem --format=csv
clocks.applications.graphics [MHz], clocks.applications.memory [MHz]
1095 MHz, 1215 MHz

But when I ask for frequencies explicitly they do get applied:

> srun --gpu-freq=high,memory=high --gpus-per-task=1 -C gpu -n1 -N 1 nvidia-smi --query-gpu=clocks.applications.gr,clocks.applications.mem --format=csv
clocks.applications.graphics [MHz], clocks.applications.memory [MHz]
1410 MHz, 1215 MHz

> srun --gpu-freq=low --gpus-per-task=1 -C gpu -n1 -N 1 nvidia-smi --query-gpu=clocks.applications.gr,clocks.applications.mem --format=csv
clocks.applications.graphics [MHz], clocks.applications.memory [MHz]
210 MHz, 1215 MHz

I did try explicitly adding that default to the config file and reconfiguring, but it didn't change the behaviour.

Looking at the NVML plugin I think this is happening because the defaults are applied when _parse_gpu_freq() calls slurm_get_gpu_freq_def(), and _parse_gpu_freq() is in turn called from _set_freq() but __set_freq() is only ever called from gpu_p_step_hardware_init() if it gets past this check:

        if (!tres_freq)
                return;         /* No TRES frequency spec */

        if (!(tmp = strstr(tres_freq, "gpu:")))
                return;         /* No GPU frequency spec */


where tres_freq is passed in as an argument to the function and I suspect that it is in turn only set if the user specifies --gpu-freq

If that's right then GpuFreqDef is never used.

Does that diagnosis sound right?

All the best,
Chris
Comment 1 Oriol Vilarrubi 2022-07-22 11:21:25 MDT
Hi Chris,

You are absolutely right,

In slurmstepd mgr.c the gres_g_step_hardware_init call is done or not depending whether job->tres_freq is set or no, which in case that the job does not set the parameter will not be set, so the init will not be done, and in consequence the frequency will not be set.

So yes your diagnostic was 99% right, thanks for that it helped reducing the diagnostic time a lot. I'll work on to see how to handle this situation and I'll get back to you.

Regards.
Comment 2 Chris Samuel (NERSC) 2022-07-22 11:37:14 MDT
Hi Oriol,

No worries, thanks so much!

My local plan for an interim workaround is to patch out the cosmetic error from #14593 which also relates to `--gpu-freq` and then we can use our cli_filter infrastructure to force `--gpu-freq=high` for all users who don't otherwise specify it, does that sound reasonable?

All the best,
Chris
Comment 3 Chris Samuel (NERSC) 2022-07-22 22:42:33 MDT
After I'd built new Slurm RPMs with a local patch for bug #14593 to hide that cosmetic error I found adding this to our cli_filter lua config appears to work nicely (an extract from our slurm_cli_pre_submit() function):

    -- If we are a GPU job apply necessary settings
    if args["constraint"] == "gpu"
    then
        -- Set GPU frequency to high if not requested
        if args["gpu-freq"] == nil or args["gpu-freq"] == ''
        then
            slurm.log_verbose("GPU frequency not found, forcing to high")
            args["gpu-freq"] = "high"
        end
    end
Comment 4 Chris Samuel (NERSC) 2022-07-23 23:56:51 MDT
I took a quick peek at the RSMI GPU plugin for AMD GPUs and it looks like it will have the same issue too as its gpu_p_step_hardware_init() function has the same pattern at the top:

        if (!usable_gpus)
                return;         /* Job allocated no GPUs */
        if (!tres_freq)
                return;         /* No TRES frequency spec */
Comment 5 Oriol Vilarrubi 2022-07-25 03:10:33 MDT
Hi Chris,

(In reply to Chris Samuel (NERSC) from comment #3)

> After I'd built new Slurm RPMs with a local patch for bug #14593 to hide
> that cosmetic error I found adding this to our cli_filter lua config appears
> to work nicely (an extract from our slurm_cli_pre_submit() function):

Yes, that looks good, I would also do the same, in any case, I will evaluate if this is the expected behavior of gpufreqdef (which I think is not, but I want to talk about it with my colleagues) and we will either change the documentation to reflect that this will not be set unless the job specifies the flag option, or we will change the code to always apply the gpufreqdef.

(In reply to Chris Samuel (NERSC) from comment #4)
> I took a quick peek at the RSMI GPU plugin for AMD GPUs and it looks like it
> will have the same issue too as its gpu_p_step_hardware_init() function has
> the same pattern at the top:

Yes, all the gpu plugins will be affected, but the reason is this another one:

The src/plugin/gpu plugin function gpu_p_step_hardware_init() get called by the src/common/gpu function gpu_g_step_hardware_init, which gets called by the gres_gpu gres plugin function gres_p_step_hardware_init, which gets called by common/gres function gres_g_step_hardware_init, which gets called by slurmstepd/mgr.c in _fork_all_tasks function, inside this if:

if (!job->batch && (job->step_id.step_id != SLURM_INTERACTIVE_STEP) && job->tres_freq) {

And if the job does not specify any gpu-freq, the job->tres_freq will be null, thus not entering the if and never calling the init. So all the current gpu plugins and future ones will not get the init called unless this job->tres_freq gets populated.

Regards.
Comment 6 Chris Samuel (NERSC) 2022-07-27 15:23:08 MDT
Hi Oriol,

Oh I missed that subtlety, thank you!

All the best,
Chris
Comment 7 Chris Samuel (NERSC) 2022-07-27 23:55:27 MDT
(In reply to Oriol Vilarrubi from comment #5)

> Yes, that looks good, I would also do the same, in any case, I will evaluate
> if this is the expected behavior of gpufreqdef (which I think is not, but I
> want to talk about it with my colleagues) and we will either change the
> documentation to reflect that this will not be set unless the job specifies
> the flag option, or we will change the code to always apply the gpufreqdef.

Yeah I agree with you that this doesn't sound like the expected behaviour, the documentation says it's only applied when a job does not request a frequency:

> Default GPU frequency to use when running a job step if it has not been explicitly set using the --gpu-freq option

I feel that if a user requests a frequency then this should never get applied as they will only want what they asked for, not the default frequency.

The issue with not setting it is that users will not get the expected performance from their GPUs (it was a user who complained about that which led us to this discovery).

All the best,
Chris
Comment 8 Oriol Vilarrubi 2022-07-28 03:46:59 MDT
Hi Chris,

(In reply to Chris Samuel (NERSC) from comment #7)
> I feel that if a user requests a frequency then this should never get
> applied as they will only want what they asked for, not the default
> frequency.

Yes absolutely, I wrote "or we will change the code to always apply the gpufreqdef", but what I really wanted to mean is "or we will change the code to apply the gpufreqdef when the user does not specify a frequency"

Regards.
Comment 9 Chris Samuel (NERSC) 2022-07-28 22:25:19 MDT
(In reply to Oriol Vilarrubi from comment #8)

> Yes absolutely, I wrote "or we will change the code to always apply the
> gpufreqdef", but what I really wanted to mean is "or we will change the code
> to apply the gpufreqdef when the user does not specify a frequency"

Ah no worries, thanks for explaining!  Much appreciated.

All the best,
Chris
Comment 11 Oriol Vilarrubi 2022-08-19 03:44:05 MDT
Hi Chris,

We are working on the solution of this bug, taking into account that you already have a valid workaround in place, I'm lowering the severity of it.

Regards.
Comment 13 Oriol Vilarrubi 2023-02-22 11:50:44 MST
Hi Chris,

This is an update to inform you that we have a working solution in place for this issue and we are in the phase of having it pass QA.

Regards.
Comment 17 Chris Samuel (NERSC) 2023-02-23 15:30:21 MST
Hi Oriol,

Thanks for the updates!  Sorry I didn't see them earlier.

All the best,
Chris
Comment 20 Oriol Vilarrubi 2023-02-27 05:11:50 MST
Dear Chris,

In commits feb5734f5e..90bfdbf5c5, we fixed this issue in the following way: GpuFreqDef will have no default value anymore if it is not set in slurm.conf, and in consequence will not set any frequency in the job start if this and --gpu-freq are not set. But if GpuFreqDef is set to something, now (after this 2 commits, which will be contained in 23.02.0) it will set tje frequency to that.

I am closing this bug as resolved, feel free to reopen if somehting is not clear.

Regards.