Ticket 22445 - Querying why slurmctld reports rate limiting REQUEST_JOB_STEP_CREATE when using enable_stepmgr
Summary: Querying why slurmctld reports rate limiting REQUEST_JOB_STEP_CREATE when usi...
Status: RESOLVED INFOGIVEN
Alias: None
Product: Slurm
Classification: Unclassified
Component: slurmctld (show other tickets)
Version: 24.11.3
Hardware: Linux Linux
: 4 - Minor Issue
Assignee: Oscar Hernández
QA Contact:
URL:
Depends on:
Blocks:
 
Reported: 2025-03-26 20:50 MDT by Chris Samuel (NERSC)
Modified: 2025-03-27 09:02 MDT (History)
1 user (show)

See Also:
Site: NERSC
Alineos Sites: ---
Atos/Eviden Sites: ---
Confidential Site: ---
Coreweave sites: ---
Cray Sites: ---
DS9 clusters: ---
HPCnow Sites: ---
HPE Sites: ---
IBM Sites: ---
NOAA SIte: ---
NoveTech Sites: ---
Nvidia HWinf-CS Sites: ---
OCF Sites: ---
Recursion Pharma Sites: ---
SFW Sites: ---
SNIC sites: ---
Tzag Elita Sites: ---
Linux Distro: ---
Machine Name:
CLE Version:
Version Fixed:
Target Release: ---
DevPrio: ---
Emory-Cloud Sites: ---


Attachments

Note You need to log in before you can comment on or make changes to this ticket.
Description Chris Samuel (NERSC) 2025-03-26 20:50:55 MDT
Hi there,

We've just started running Slurm 24.11.3 on Perlmutter and have enabled "enable_stepmgr" to delegate step creation from slurmctld to the job (which has worked fantastically in testing on the test systems, thank you!).

However, now we're in production I'm noticing a stream of messages of the form:

[2025-03-27T02:39:08.812] RPC rate limit exceeded by uid xxxx with REQUEST_JOB_STEP_CREATE from 10.100.56.100:47452, telling to back off
[2025-03-27T02:39:08.812] RPC rate limit exceeded by uid xxxx with REQUEST_JOB_STEP_CREATE from 10.100.56.100:46832, telling to back off
[2025-03-27T02:39:08.814] RPC rate limit exceeded by uid xxxx with REQUEST_JOB_STEP_CREATE from 10.100.56.100:46836, telling to back off
[2025-03-27T02:39:09.390] RPC rate limit exceeded by uid xxxx with REQUEST_JOB_STEP_CREATE from 10.100.56.103:59928, telling to back off
[2025-03-27T02:39:10.486] RPC rate limit exceeded by uid xxxx with REQUEST_JOB_STEP_CREATE from 10.100.56.103:44522, telling to back off

Now "xxxx" in this case is the same user, and I'm wondering what they could be doing that (if I am understanding this correctly) would cause them to avoid this delegation?

Looking on the compute node I see they are running a lot of:

srun -N 1 -n 16 --cpus-per-task=1 python /path/to/script.py $ARGUMENTS

Looking at the processes environment doesn't seem to show anything odd (I've anonymised identifying info).

perlmutter:nid004612:~ # strings -a /proc/371278/environ  | fgrep SLURM
SLURM_NODEID=0
SLURM_TASK_PID=341557
SLURM_PRIO_PROCESS=0
SLURM_SUBMIT_DIR=/global/u1/a/xxxx
SLURM_JOB_LICENSES=u1:1
SLURM_PROCID=0
SLURM_JOB_GID=xxxx
SLURMD_NODENAME=nid004612
SLURM_JOB_END_TIME=1743044086
SLURM_TASKS_PER_NODE=256(x8)
SLURM_NNODES=8
SLURM_JOB_START_TIME=1743042286
SLURM_JOB_NODELIST=nid[004612-004613,004617-004618,004697-004700]
SLURM_CLUSTER_NAME=perlmutter
SLURM_NODELIST=nid[004612-004613,004617-004618,004697-004700]
SLURM_JOB_CPUS_PER_NODE=256(x8)
SLURM_TOPOLOGY_ADDR=nid004612
SLURM_JOB_NAME=xxxx
SLURM_JOBID=xxxx
SLURM_JOB_QOS=debug
SLURM_TOPOLOGY_ADDR_PATTERN=node
SLURM_CPUS_ON_NODE=256
SLURM_JOB_NUM_NODES=8
SLURM_JOB_UID=xxxx
SLURM_JOB_PARTITION=regular_milan_ss11
SLURM_SCRIPT_CONTEXT=prolog_task
SLURM_JOB_USER=xxxx
SLURM_SUBMIT_HOST=login24
SLURM_JOB_ACCOUNT=xxxx
SLURM_GTIDS=0
SLURM_JOB_ID=xxxx
SLURM_OOM_KILL_STEP=0
SLURM_LOCALID=0

Any ideas on what could lead to this please?

All the best,
Chris
Comment 1 Chris Samuel (NERSC) 2025-03-26 20:59:51 MDT
Hiya,

Poking through the source I see this is controlled via the SLURM_STEPMGR environment variable, and when I run a test myself that looks correct:

salloc: Nodes nid200026 are ready for job
encsamuel@nid200026:~> env | grep SLURM_STEPMGR
SLURM_STEPMGR=nid200026

But when I look at the environment of one of their srun's I don't see it:

perlmutter:nid004612:~ # strings -a /proc/373170/environ | fgrep SLURM_STEPMGR
perlmutter:nid004612:~ #

I'll check to confirm it's not something they're doing that's causing this.

All the best,
Chris
Comment 2 Chris Samuel (NERSC) 2025-03-26 21:04:34 MDT
Hiya,

OK I now believe this is something they are somehow doing to themselves.

I can see that before they source a particular config script SLURM_STEPMGR is set, and after it is not.

So I will pursue them about this directly.

All the best!
Chris
Comment 3 Chris Samuel (NERSC) 2025-03-26 23:17:43 MDT
Reopening this as I realised I was running into hidden permission issues that prevented me from being able to replicate the users environment and the errors generated led me to miss the second instance of the variable in my failed test.

I am curious whether the fact that this job was submitted before the upgrade to 24.11 and so was submitted with 23.11.10 and thus no "enable_stepmgr" would have meant it would not have picked up that support when it ran?
Comment 4 Oscar Hernández 2025-03-27 06:07:57 MDT
Hi Chris,

So, running a quick test. If on 24.11 I do the following:

1 - Disable stepmgr(change config and reconf)
2 - submit job
3 - Enable stepmgr(change config and reconf)
4 - Job starts without stepmgr support.

Then, looking at the code. I can see this happens because Slurm marks a job to use stepmgr at job creation time[1] (not at job execution), setting a flag under the following conditions:

>	if ((stepmgr_enabled || (job_desc->bitflags & STEPMGR_ENABLED)) &&
>	    (job_desc->het_job_offset == NO_VAL) &&
>	    (job_ptr->start_protocol_ver >= SLURM_24_05_PROTOCOL_VERSION)) {
>		job_ptr->bit_flags |= STEPMGR_ENABLED;  ->enable stepmgr
>	} else {
>		job_ptr->bit_flags &= ~STEPMGR_ENABLED; ->disable stepmgr
>	}
So, a job will use stepmgr if, at submision time all this conditions match:

- Job is not an hetjhob
- Job requested --stepmgr or slurm.conf has enable_stepmgr
- Submitting client was 24.05 or newer.

Just sharing this for completion. But, as you were suspecting, in the case discussed, given that this logic was missing in 23.11, I understand the job was never marked with the bitflag STEPMGR_ENABLED at submit time. So, during allocation, it was just treated as a normal job. It is also worth considering that older clients won't use stepmgr.

Hope that cleared your doubts.

Kind regards,
Oscar 

[1]https://github.com/SchedMD/slurm/blob/9d9fb40491ceb4da8777d3c74c2b3faa95a5f077/src/slurmctld/job_mgr.c#L7514
Comment 5 Chris Samuel (NERSC) 2025-03-27 09:02:39 MDT
Hey Oscar,

Perfect, thank you so much! That explains everything.

So just have to get through the 20K+ jobs that were queued up before the maintenance to reach job step creation perfection. That shouldn't take long, right? :-D

Very much obliged! I'll close this now.

All the best,
Chris