Hi there, We've just started running Slurm 24.11.3 on Perlmutter and have enabled "enable_stepmgr" to delegate step creation from slurmctld to the job (which has worked fantastically in testing on the test systems, thank you!). However, now we're in production I'm noticing a stream of messages of the form: [2025-03-27T02:39:08.812] RPC rate limit exceeded by uid xxxx with REQUEST_JOB_STEP_CREATE from 10.100.56.100:47452, telling to back off [2025-03-27T02:39:08.812] RPC rate limit exceeded by uid xxxx with REQUEST_JOB_STEP_CREATE from 10.100.56.100:46832, telling to back off [2025-03-27T02:39:08.814] RPC rate limit exceeded by uid xxxx with REQUEST_JOB_STEP_CREATE from 10.100.56.100:46836, telling to back off [2025-03-27T02:39:09.390] RPC rate limit exceeded by uid xxxx with REQUEST_JOB_STEP_CREATE from 10.100.56.103:59928, telling to back off [2025-03-27T02:39:10.486] RPC rate limit exceeded by uid xxxx with REQUEST_JOB_STEP_CREATE from 10.100.56.103:44522, telling to back off Now "xxxx" in this case is the same user, and I'm wondering what they could be doing that (if I am understanding this correctly) would cause them to avoid this delegation? Looking on the compute node I see they are running a lot of: srun -N 1 -n 16 --cpus-per-task=1 python /path/to/script.py $ARGUMENTS Looking at the processes environment doesn't seem to show anything odd (I've anonymised identifying info). perlmutter:nid004612:~ # strings -a /proc/371278/environ | fgrep SLURM SLURM_NODEID=0 SLURM_TASK_PID=341557 SLURM_PRIO_PROCESS=0 SLURM_SUBMIT_DIR=/global/u1/a/xxxx SLURM_JOB_LICENSES=u1:1 SLURM_PROCID=0 SLURM_JOB_GID=xxxx SLURMD_NODENAME=nid004612 SLURM_JOB_END_TIME=1743044086 SLURM_TASKS_PER_NODE=256(x8) SLURM_NNODES=8 SLURM_JOB_START_TIME=1743042286 SLURM_JOB_NODELIST=nid[004612-004613,004617-004618,004697-004700] SLURM_CLUSTER_NAME=perlmutter SLURM_NODELIST=nid[004612-004613,004617-004618,004697-004700] SLURM_JOB_CPUS_PER_NODE=256(x8) SLURM_TOPOLOGY_ADDR=nid004612 SLURM_JOB_NAME=xxxx SLURM_JOBID=xxxx SLURM_JOB_QOS=debug SLURM_TOPOLOGY_ADDR_PATTERN=node SLURM_CPUS_ON_NODE=256 SLURM_JOB_NUM_NODES=8 SLURM_JOB_UID=xxxx SLURM_JOB_PARTITION=regular_milan_ss11 SLURM_SCRIPT_CONTEXT=prolog_task SLURM_JOB_USER=xxxx SLURM_SUBMIT_HOST=login24 SLURM_JOB_ACCOUNT=xxxx SLURM_GTIDS=0 SLURM_JOB_ID=xxxx SLURM_OOM_KILL_STEP=0 SLURM_LOCALID=0 Any ideas on what could lead to this please? All the best, Chris
Hiya, Poking through the source I see this is controlled via the SLURM_STEPMGR environment variable, and when I run a test myself that looks correct: salloc: Nodes nid200026 are ready for job encsamuel@nid200026:~> env | grep SLURM_STEPMGR SLURM_STEPMGR=nid200026 But when I look at the environment of one of their srun's I don't see it: perlmutter:nid004612:~ # strings -a /proc/373170/environ | fgrep SLURM_STEPMGR perlmutter:nid004612:~ # I'll check to confirm it's not something they're doing that's causing this. All the best, Chris
Hiya, OK I now believe this is something they are somehow doing to themselves. I can see that before they source a particular config script SLURM_STEPMGR is set, and after it is not. So I will pursue them about this directly. All the best! Chris
Reopening this as I realised I was running into hidden permission issues that prevented me from being able to replicate the users environment and the errors generated led me to miss the second instance of the variable in my failed test. I am curious whether the fact that this job was submitted before the upgrade to 24.11 and so was submitted with 23.11.10 and thus no "enable_stepmgr" would have meant it would not have picked up that support when it ran?
Hi Chris, So, running a quick test. If on 24.11 I do the following: 1 - Disable stepmgr(change config and reconf) 2 - submit job 3 - Enable stepmgr(change config and reconf) 4 - Job starts without stepmgr support. Then, looking at the code. I can see this happens because Slurm marks a job to use stepmgr at job creation time[1] (not at job execution), setting a flag under the following conditions: > if ((stepmgr_enabled || (job_desc->bitflags & STEPMGR_ENABLED)) && > (job_desc->het_job_offset == NO_VAL) && > (job_ptr->start_protocol_ver >= SLURM_24_05_PROTOCOL_VERSION)) { > job_ptr->bit_flags |= STEPMGR_ENABLED; ->enable stepmgr > } else { > job_ptr->bit_flags &= ~STEPMGR_ENABLED; ->disable stepmgr > } So, a job will use stepmgr if, at submision time all this conditions match: - Job is not an hetjhob - Job requested --stepmgr or slurm.conf has enable_stepmgr - Submitting client was 24.05 or newer. Just sharing this for completion. But, as you were suspecting, in the case discussed, given that this logic was missing in 23.11, I understand the job was never marked with the bitflag STEPMGR_ENABLED at submit time. So, during allocation, it was just treated as a normal job. It is also worth considering that older clients won't use stepmgr. Hope that cleared your doubts. Kind regards, Oscar [1]https://github.com/SchedMD/slurm/blob/9d9fb40491ceb4da8777d3c74c2b3faa95a5f077/src/slurmctld/job_mgr.c#L7514
Hey Oscar, Perfect, thank you so much! That explains everything. So just have to get through the 20K+ jobs that were queued up before the maintenance to reach job step creation perfection. That shouldn't take long, right? :-D Very much obliged! I'll close this now. All the best, Chris