| Summary: | Job resize not setting job environment variables correctly | ||
|---|---|---|---|
| Product: | Slurm | Reporter: | Moe Jette <jette> |
| Component: | User Commands | Assignee: | Moe Jette <jette> |
| Status: | RESOLVED FIXED | QA Contact: | |
| Severity: | 3 - Medium Impact | ||
| Priority: | --- | ||
| Version: | 15.08.13 | ||
| Hardware: | Linux | ||
| OS: | Linux | ||
| Site: | SchedMD | Alineos Sites: | --- |
| Atos/Eviden Sites: | --- | Confidential Site: | --- |
| Coreweave sites: | --- | Cray Sites: | --- |
| DS9 clusters: | --- | HPCnow Sites: | --- |
| HPE Sites: | --- | IBM Sites: | --- |
| NOAA SIte: | --- | OCF Sites: | --- |
| Recursion Pharma Sites: | --- | SFW Sites: | --- |
| SNIC sites: | --- | Linux Distro: | --- |
| Machine Name: | CLE Version: | ||
| Version Fixed: | 15.08.14, 16.05.10 | Target Release: | --- |
| DevPrio: | --- | Emory-Cloud Sites: | --- |
The problem was introduced in logic added to handle updates on job arrays and resulted in getting the job state information _before_ changing size. I've relocated that logic in the following commit (added to v15.08): https://github.com/SchedMD/slurm/commit/f42f6943a6046469b6e8a28894bfad78ce30821e I've also somewhat hardened the related test to catch this type of problem in the following commit (added to v17.02): https://github.com/SchedMD/slurm/commit/eba3d27381ed48f5edd097463e94712d2cc2466c |
A request to decrease a job size does in fact properly decrease the job size, but the script generated to change the job's environment variables has bad data (environment prior to update rather than after the update). This bug exists in version 15.08.13 through the current master. $ salloc --ntasks-per-node=4 -N4 bash salloc: Granted job allocation 1448 salloc: Waiting for resource configuration salloc: Nodes smd[1-4] are ready for job $ srun hostname smd1 smd1 smd1 smd1 smd2 smd3 smd2 smd2 smd2 smd3 smd3 smd3 smd4 smd4 smd4 smd4 $ env | grep SLURM SLURM_NODELIST=smd[1-4] SLURM_JOB_NAME=bash SLURM_NTASKS_PER_NODE=4 SLURM_NODE_ALIASES=(null) SLURM_MEM_PER_CPU=50 SLURM_NNODES=4 SLURM_JOBID=1448 SLURM_NTASKS=16 SLURM_TASKS_PER_NODE=4(x4) SLURM_JOB_ID=1448 PWD=/home/jette/SLURM/install_smd/bin SLURM_SUBMIT_DIR=/home/jette/SLURM/install_smd/bin SLURM_NPROCS=16 SLURM_JOB_NODELIST=smd[1-4] SLURM_CLUSTER_NAME=smd SLURM_JOB_CPUS_PER_NODE=4(x4) SLURM_SUBMIT_HOST=smd1 SLURM_JOB_PARTITION=debug SLURM_JOB_NUM_NODES=4 $ scontrol update JobId=$SLURM_JOB_ID NumNodes=3 To reset SLURM environment variables, execute For bash or sh shells: . ./slurm_job_1448_resize.sh For csh shells: source ./slurm_job_1448_resize.csh $ cat ./slurm_job_1448_resize.sh export SLURM_NODELIST="smd[1-4]" export SLURM_JOB_NODELIST="smd[1-4]" export SLURM_NNODES=4 export SLURM_JOB_NUM_NODES=4 export SLURM_JOB_CPUS_PER_NODE="4(x4)" unset SLURM_TASKS_PER_NODE $ squeue JOBID PARTITION NAME USER ST TIME NODES NODELIST(REASON) 1448 debug bash jette R 0:40 3 smd[1-3] $ srun hostname srun: error: SLURM_NNODES environment variable conflicts with allocated node count (4!=3). srun: error: Unable to create job step: More processors requested than permitted