Ticket 3498

Summary: Job resize not setting job environment variables correctly
Product: Slurm Reporter: Moe Jette <jette>
Component: User CommandsAssignee: Moe Jette <jette>
Status: RESOLVED FIXED QA Contact:
Severity: 3 - Medium Impact    
Priority: ---    
Version: 15.08.13   
Hardware: Linux   
OS: Linux   
Site: SchedMD Alineos Sites: ---
Atos/Eviden Sites: --- Confidential Site: ---
Coreweave sites: --- Cray Sites: ---
DS9 clusters: --- HPCnow Sites: ---
HPE Sites: --- IBM Sites: ---
NOAA SIte: --- OCF Sites: ---
Recursion Pharma Sites: --- SFW Sites: ---
SNIC sites: --- Linux Distro: ---
Machine Name: CLE Version:
Version Fixed: 15.08.14, 16.05.10 Target Release: ---
DevPrio: --- Emory-Cloud Sites: ---

Description Moe Jette 2017-02-23 09:24:19 MST
A request to decrease a job size does in fact properly decrease the job size, but the script generated to change the job's environment variables has bad data (environment prior to update rather than after the update). This bug exists in version 15.08.13 through the current master.

$ salloc --ntasks-per-node=4 -N4 bash
salloc: Granted job allocation 1448
salloc: Waiting for resource configuration
salloc: Nodes smd[1-4] are ready for job

$ srun hostname
smd1
smd1
smd1
smd1
smd2
smd3
smd2
smd2
smd2
smd3
smd3
smd3
smd4
smd4
smd4
smd4

$ env | grep SLURM
SLURM_NODELIST=smd[1-4]
SLURM_JOB_NAME=bash
SLURM_NTASKS_PER_NODE=4
SLURM_NODE_ALIASES=(null)
SLURM_MEM_PER_CPU=50
SLURM_NNODES=4
SLURM_JOBID=1448
SLURM_NTASKS=16
SLURM_TASKS_PER_NODE=4(x4)
SLURM_JOB_ID=1448
PWD=/home/jette/SLURM/install_smd/bin
SLURM_SUBMIT_DIR=/home/jette/SLURM/install_smd/bin
SLURM_NPROCS=16
SLURM_JOB_NODELIST=smd[1-4]
SLURM_CLUSTER_NAME=smd
SLURM_JOB_CPUS_PER_NODE=4(x4)
SLURM_SUBMIT_HOST=smd1
SLURM_JOB_PARTITION=debug
SLURM_JOB_NUM_NODES=4


$ scontrol update JobId=$SLURM_JOB_ID NumNodes=3
To reset SLURM environment variables, execute
  For bash or sh shells:  . ./slurm_job_1448_resize.sh
  For csh shells:         source ./slurm_job_1448_resize.csh

$ cat ./slurm_job_1448_resize.sh
export SLURM_NODELIST="smd[1-4]"
export SLURM_JOB_NODELIST="smd[1-4]"
export SLURM_NNODES=4
export SLURM_JOB_NUM_NODES=4
export SLURM_JOB_CPUS_PER_NODE="4(x4)"
unset SLURM_TASKS_PER_NODE

$ squeue
             JOBID PARTITION     NAME     USER ST       TIME  NODES NODELIST(REASON)
              1448     debug     bash    jette  R       0:40      3 smd[1-3]

$ srun hostname
srun: error: SLURM_NNODES environment variable conflicts with allocated node count (4!=3).
srun: error: Unable to create job step: More processors requested than permitted
Comment 1 Moe Jette 2017-02-23 11:03:37 MST
The problem was introduced in logic added to handle updates on job arrays and resulted in getting the job state information _before_ changing size. I've relocated that logic in the following commit (added to v15.08):

https://github.com/SchedMD/slurm/commit/f42f6943a6046469b6e8a28894bfad78ce30821e

I've also somewhat hardened the related test to catch this type of problem in the following commit (added to v17.02):
https://github.com/SchedMD/slurm/commit/eba3d27381ed48f5edd097463e94712d2cc2466c