| Summary: | Slurm seems to ignore DefaultMemPerCPU | ||
|---|---|---|---|
| Product: | Slurm | Reporter: | pbisbal |
| Component: | Configuration | Assignee: | Marcin Stolarek <cinek> |
| Status: | RESOLVED INFOGIVEN | QA Contact: | |
| Severity: | 4 - Minor Issue | ||
| Priority: | --- | ||
| Version: | 19.05.5 | ||
| Hardware: | Linux | ||
| OS: | Linux | ||
| Site: | Princeton Plasma Physics Laboratory (PPPL) | Alineos Sites: | --- |
| Atos/Eviden Sites: | --- | Confidential Site: | --- |
| Coreweave sites: | --- | Cray Sites: | --- |
| DS9 clusters: | --- | HPCnow Sites: | --- |
| HPE Sites: | --- | IBM Sites: | --- |
| NOAA SIte: | --- | OCF Sites: | --- |
| Recursion Pharma Sites: | --- | SFW Sites: | --- |
| SNIC sites: | --- | Linux Distro: | --- |
| Machine Name: | CLE Version: | ||
| Version Fixed: | Target Release: | --- | |
| DevPrio: | --- | Emory-Cloud Sites: | --- |
In fact, there was a fix for incorrect non-null value of min_memory_per_cpu passed to job_submit lua (Bug 7276) that was fixed in 19.05.3 release. Previously it was not possible to check if --mem-per-cpu was specified comparing its value to nil. The commit fixing it is 2d0178752[1]. You can check the details in the referenced bug, but making a long story short previously when neither --mem nor --mem-per-cpu was specified a value based on the special value of NO_VAL64 (all bits but the last set) was passed to lua script instead of nil. The rule of thumb is that slurm.conf/slurmdbd defaults are applied to the job after passing the job_submit plugin interface. If you want to rely on slurm setting the default memory requirement for job you can skip checking it in your job_submit plugin and just simplify it to only fail for jobs requesting --mem=0. The case of mixing --mem-per-cpu and --mem options was also improved in 19.05 and currently it's handled in similar way you were doing it: # sbatch --mem-per-cpu=10 --mem=5 --wrap="hostname" sbatch: fatal: --mem, --mem-per-cpu, and --mem-per-gpu are mutually exclusive. # sbatch -V slurm 19.05.6 Let me know whether this helps to understand the situation. cheers, Marcin [1] https://github.com/SchedMD/slurm/commit/2d01787524f8bd9587330406c154aa961b1b6e71 I agree that what this was the case for 18.08 (and 19.05 till 19.05.3). I believe that we're missing some other mechanis Thanks. That helps. On 4/22/20 4:23 AM, bugs@schedmd.com wrote: > > *Comment # 1 <https://bugs.schedmd.com/show_bug.cgi?id=8909#c1> on bug > 8909 <https://bugs.schedmd.com/show_bug.cgi?id=8909> from Marcin > Stolarek <mailto:cinek@schedmd.com> * > In fact, there was a fix for incorrect non-null value of min_memory_per_cpu > passed to job_submit lua (Bug 7276 <show_bug.cgi?id=7276>) that was fixed in 19.05.3 release. > Previously it was not possible to check if --mem-per-cpu was specified > comparing its value to nil. The commit fixing it is 2d0178752[1]. You can > check the details in the referenced bug, but making a long story short > previously when neither --mem nor --mem-per-cpu was specified a value based on > the special value of NO_VAL64 (all bits but the last set) was passed to lua > script instead of nil. > > The rule of thumb is that slurm.conf/slurmdbd defaults are applied to the job > after passing the job_submit plugin interface. If you want to rely on slurm > setting the default memory requirement for job you can skip checking it in your > job_submit plugin and just simplify it to only fail for jobs requesting > --mem=0. > > The case of mixing --mem-per-cpu and --mem options was also improved in 19.05 > and currently it's handled in similar way you were doing it: > # sbatch --mem-per-cpu=10 --mem=5 --wrap="hostname" > sbatch: fatal: --mem, --mem-per-cpu, and --mem-per-gpu are mutually exclusive. > # sbatch -V > slurm 19.05.6 > > Let me know whether this helps to understand the situation. > > cheers, > Marcin > [1] > https://github.com/SchedMD/slurm/commit/2d01787524f8bd9587330406c154aa961b1b6e71 > > I agree that what this was the case for 18.08 (and 19.05 till 19.05.3). > > > I believe that we're missing some other mechanis > ------------------------------------------------------------------------ > You are receiving this mail because: > > * You reported the bug. > |
Has there been a change between 18.08 and 19.05 with regards to how memory is calculated, or when it is checked? In my job_submit.lua script, I check to make sure that memory is defined as --mem-per-cpu or --mem: -- check that memory requirements have been specified and valid. -- If either value is set to 0, that means 'unlimited', and should be rejected -- Use min_mem_per_node instead of pn_min_memory. From trial and error, I have found that -- determining whether or not min_mem_per_node is defined is much more reliable than for -- pn_min_memory. if job_desc.min_mem_per_node == nil and job_desc.min_mem_per_cpu == nil then slurm.user_msg("Job rejected: No memory requirements specified") return 2044 -- signal ESLURM_INVALID_TASK_MEMORY elseif job_desc.min_mem_per_node == 0 or job_desc.min_mem_per_cpu == 0 then slurm.user_msg("Unlimited memory requests not allowed.") return 2044 -- signal ESLURM_INVALID_TASK_MEMORY elseif job_desc.min_mem_per_node ~= nil and job_desc.min_mem_per_cpu ~= nil then -- --mem and --mem-per-cpu are mutually exclusive, so we should never end up here. slurm.user_msg("Both --mem and --mem-per-cpu have been specified. This is not allowed, and should never ha ppen.") return 2044 -- signal ESLURM_INVALID_TASK_MEMORY end At the same time, in slurm.conf I define DefaultMemPerCPU = 2000, so if someone does not specify --mem-per-cpu, they would not get rejected by the code above. In case your wondering, I check for MemPerCPU as a failsafe- to make sure something isn't wrong with my configuration. Whether you agree with my logic or not, the above situation worked well for quite a while. If a user didn't specify --mem-per-cpu, that value defaulted to 2000, and they made it through my filter. However, since upgrading to 19.05, the behavior has changed. Users must now specify a --mem-per-cpu or else their job gets rejects with the following error: sbatch: error: Job rejected: No memory requirements specified sbatch: error: Batch job submission failed: Memory required by task is not available So has something changed between 18.08 and 19.05? Here's the relevant configuration line from my slurm.conf: $ scontrol show config | grep DefMem DefMemPerCPU = 2000