Ticket 22423 - slurmstepd: error: environment variable SLURM_CPU_BIND is too long
Summary: slurmstepd: error: environment variable SLURM_CPU_BIND is too long
Status: OPEN
Alias: None
Product: Slurm
Classification: Unclassified
Component: slurmstepd (show other tickets)
Version: 24.05.3
Hardware: Linux Linux
: 4 - Minor Issue
Assignee: Michael Steed
QA Contact:
URL:
Depends on:
Blocks:
 
Reported: 2025-03-24 09:42 MDT by David Gloe
Modified: 2025-03-26 16:47 MDT (History)
0 users

See Also:
Site: CRAY
Alineos Sites: ---
Atos/Eviden Sites: ---
Confidential Site: ---
Coreweave sites: ---
Cray Sites: Cray Internal
DS9 clusters: ---
HPCnow Sites: ---
HPE Sites: ---
IBM Sites: ---
NOAA SIte: ---
NoveTech Sites: ---
Nvidia HWinf-CS Sites: ---
OCF Sites: ---
Recursion Pharma Sites: ---
SFW Sites: ---
SNIC sites: ---
Tzag Elita Sites: ---
Linux Distro: ---
Machine Name:
CLE Version:
Version Fixed:
Target Release: ---
DevPrio: ---
Emory-Cloud Sites: ---


Attachments

Note You need to log in before you can comment on or make changes to this ticket.
Description David Gloe 2025-03-24 09:42:51 MDT
On AMD Turin nodes with 768 CPUs, we're seeing the following error:

srun --mpi=cray_shasta --nodefile=/tmp/mymachlist.24775.run --ntasks-per-node=768 /usr/diags/mpi/cray/amd/olconft.cray.sles
slurmstepd: error: environment variable SLURM_CPU_BIND is too long
slurmstepd: error: Unable to set SLURM_CPU_BIND
slurmstepd: error: environment variable SLURM_CPU_BIND_LIST is too long
slurmstepd: error: Unable to set SLURM_CPU_BIND_LIST
slurmstepd: error: environment variable SLURM_CPU_BIND is too long
slurmstepd: error: Unable to set SLURM_CPU_BIND
slurmstepd: error: environment variable SLURM_CPU_BIND is too long
slurmstepd: error: Unable to set SLURM_CPU_BIND
slurmstepd: error: environment variable SLURM_CPU_BIND_LIST is too long
...
Comment 4 Michael Steed 2025-03-26 16:47:21 MDT
Hi David,

The errors about `SLURM_CPU_BIND` and `SLURM_CPU_BIND_LIST` being too long will not prevent job execution, although those variables will remain unset in the job environment. Can you confirm that your jobs still run as expected?

The failure to set these environment variables is due to a kernel limitation (128KB per variable). We are looking at updating the logging around this since it won't cause jobs to fail.

Related: ticket 644

Michael