Summary: | Querying why slurmctld reports rate limiting REQUEST_JOB_STEP_CREATE when using enable_stepmgr | ||
---|---|---|---|
Product: | Slurm | Reporter: | Chris Samuel (NERSC) <csamuel> |
Component: | slurmctld | Assignee: | Oscar Hernández <oscar.hernandez> |
Status: | RESOLVED INFOGIVEN | QA Contact: | |
Severity: | 4 - Minor Issue | ||
Priority: | --- | CC: | oscar.hernandez |
Version: | 24.11.3 | ||
Hardware: | Linux | ||
OS: | Linux | ||
Site: | NERSC | Alineos Sites: | --- |
Atos/Eviden Sites: | --- | Confidential Site: | --- |
Coreweave sites: | --- | Cray Sites: | --- |
DS9 clusters: | --- | HPCnow Sites: | --- |
HPE Sites: | --- | IBM Sites: | --- |
NOAA SIte: | --- | NoveTech Sites: | --- |
Nvidia HWinf-CS Sites: | --- | OCF Sites: | --- |
Recursion Pharma Sites: | --- | SFW Sites: | --- |
SNIC sites: | --- | Tzag Elita Sites: | --- |
Linux Distro: | --- | Machine Name: | |
CLE Version: | Version Fixed: | ||
Target Release: | --- | DevPrio: | --- |
Emory-Cloud Sites: | --- |
Description
Chris Samuel (NERSC)
2025-03-26 20:50:55 MDT
Hiya, Poking through the source I see this is controlled via the SLURM_STEPMGR environment variable, and when I run a test myself that looks correct: salloc: Nodes nid200026 are ready for job encsamuel@nid200026:~> env | grep SLURM_STEPMGR SLURM_STEPMGR=nid200026 But when I look at the environment of one of their srun's I don't see it: perlmutter:nid004612:~ # strings -a /proc/373170/environ | fgrep SLURM_STEPMGR perlmutter:nid004612:~ # I'll check to confirm it's not something they're doing that's causing this. All the best, Chris Hiya, OK I now believe this is something they are somehow doing to themselves. I can see that before they source a particular config script SLURM_STEPMGR is set, and after it is not. So I will pursue them about this directly. All the best! Chris Reopening this as I realised I was running into hidden permission issues that prevented me from being able to replicate the users environment and the errors generated led me to miss the second instance of the variable in my failed test. I am curious whether the fact that this job was submitted before the upgrade to 24.11 and so was submitted with 23.11.10 and thus no "enable_stepmgr" would have meant it would not have picked up that support when it ran? Hi Chris, So, running a quick test. If on 24.11 I do the following: 1 - Disable stepmgr(change config and reconf) 2 - submit job 3 - Enable stepmgr(change config and reconf) 4 - Job starts without stepmgr support. Then, looking at the code. I can see this happens because Slurm marks a job to use stepmgr at job creation time[1] (not at job execution), setting a flag under the following conditions: > if ((stepmgr_enabled || (job_desc->bitflags & STEPMGR_ENABLED)) && > (job_desc->het_job_offset == NO_VAL) && > (job_ptr->start_protocol_ver >= SLURM_24_05_PROTOCOL_VERSION)) { > job_ptr->bit_flags |= STEPMGR_ENABLED; ->enable stepmgr > } else { > job_ptr->bit_flags &= ~STEPMGR_ENABLED; ->disable stepmgr > } So, a job will use stepmgr if, at submision time all this conditions match: - Job is not an hetjhob - Job requested --stepmgr or slurm.conf has enable_stepmgr - Submitting client was 24.05 or newer. Just sharing this for completion. But, as you were suspecting, in the case discussed, given that this logic was missing in 23.11, I understand the job was never marked with the bitflag STEPMGR_ENABLED at submit time. So, during allocation, it was just treated as a normal job. It is also worth considering that older clients won't use stepmgr. Hope that cleared your doubts. Kind regards, Oscar [1]https://github.com/SchedMD/slurm/blob/9d9fb40491ceb4da8777d3c74c2b3faa95a5f077/src/slurmctld/job_mgr.c#L7514 Hey Oscar, Perfect, thank you so much! That explains everything. So just have to get through the 20K+ jobs that were queued up before the maintenance to reach job step creation perfection. That shouldn't take long, right? :-D Very much obliged! I'll close this now. All the best, Chris |