We created a reservation for a user. He submitted a couple of jobs under this reservation but the jobs stay pending with reason=Reservation even though the reservation is active and the nodes are all free. These are the outputs from scontrol show res and scontrol show jobid: [root@slurm-001-p ~]# squeue -R uoo00015 JOBID PARTITION NAME USER ST TIME NODES NODELIST(REASON) 30679726 medium fwhf2000 andrew.p PD 0:00 8 (Reservation) [root@slurm-001-p ~]# scontrol show res uoo00015 ReservationName=uoo00015 StartTime=2016-05-05T19:00:00 EndTime=2016-05-22T19:00:00 Duration=17-00:00:00 Nodes=compute-b1-[001-020],compute-c1-[001-020] NodeCnt=40 CoreCnt=640 Features=(null) PartitionName=(null) Flags=SPEC_NODES TRES=cpu=640 Users=(null) Accounts=uoo00015 Licenses=(null) State=ACTIVE BurstBuffer=(null) Watts=n/a [root@slurm-001-p ~]# scontrol show jobid 30679726 JobId=30679726 JobName=fwhf2000_cam5 UserId=andrew.pauling(5610) GroupId=nesi(5000) Priority=10000000 Nice=0 Account=uoo00015 QOS=normal JobState=PENDING Reason=Reservation Dependency=(null) Requeue=1 Restarts=0 BatchFlag=1 Reboot=0 ExitCode=0:0 RunTime=00:00:00 TimeLimit=23:59:00 TimeMin=N/A SubmitTime=2016-05-04T15:51:42 EligibleTime=2016-05-05T19:00:00 StartTime=Unknown EndTime=Unknown PreemptTime=None SuspendTime=None SecsPreSuspend=0 Partition=medium AllocNode:Sid=build-sb:16337 ReqNodeList=(null) ExcNodeList=(null) NodeList=(null) NumNodes=8 NumCPUs=315 CPUs/Task=1 ReqB:S:C:T=0:0:*:1 TRES=cpu=315,mem=1290240,node=1 Socks/Node=* NtasksPerN:B:S:C=0:0:*:* CoreSpec=* MinCPUsNode=315 MinMemoryCPU=4G MinTmpDiskNode=0 Features=sb Gres=(null) Reservation=uoo00015 Shared=0 Contiguous=0 Licenses=(null) Network=(null) Command=/gpfs1m/projects/uoo00015/andrew.pauling/fwhf2000_cam5/fwhf2000_cam5.run WorkDir=/gpfs1m/projects/uoo00015/andrew.pauling/fwhf2000_cam5 StdErr=/gpfs1m/projects/uoo00015/andrew.pauling/fwhf2000_cam5/stderr.txt StdIn=/dev/null StdOut=/gpfs1m/projects/uoo00015/andrew.pauling/fwhf2000_cam5/stdout.txt Power= SICP=0
Hi Gene, do you have in your history the command you used for creating the reservation? Thanks.
If you also have the command used to request the job it'd be nice. Attaching the slurmctld.log time frame around the reservation creation time as well as the job submit time might help as well. Enabling DebugFlags DB_RESERVATION,Reservation would be good as well. Thanks.
For reference we'd be interested in taking a look your site's slurm.conf.
Hi, Alejandro The command line we used was this: control create reservation=uoo00015 starttime=2016-05-05T19:00:00 Duration=17-00:00:00 accounts=uoo00015 nodes=compute-b1-[001-020],compute-c1-[001-020] The researvation was created at ~5pm (so it was not active) and a couple of jobs were submitted. At 7pm the reservation went active but the job changed its status to PD/Reservation - and stayed there for hours, until we canceled and re-submitted it. This way it went through. Cheers, Gene
Hi Gene. Despite the problem is solved by resubmitting the job, we'd appreciate if you could add the command used for the job request, the logs and your slurm.conf for future reference if you don't mind. I'm curious about what happened here. Did you restart/reconfigure any slurm daemon in the time frame around the reservation? Thank you.
Created attachment 3059 [details] Jobs description
Created attachment 3060 [details] Configuration files
Gene, do you have the slurmctld logs from the time just before creating the reservation until just after the job re-submit? Thanks.
Hi, Alejandro There is nothing specific to the reservation or the job in the logs - just normal submit entries. Gene
Alejandro, we have further problem with the same reservation. Whatever the user now submits sits waiting for the resources unless the job is very short, like, under 1 hour - even if we bump its priority above anything else. Gene
The original description includes some very strange values in the job record. Of particular note: MinCPUsNode=315 I only see this if I submit a similar job with the option --ntasks-per-node=315. In addition: NumNodes=8 This implies the job being submitted with the option -N8 What is the execute line used to submit this job?
The job script used was this: #! /bin/tcsh -f # submit with sbatch #SBATCH --job-name fwhf2000_cam5 # sfw_ext #SBATCH --constraint sb # sb=Sandybridge,wm=Westmere #SBATCH --time 23:59:00 # #SBATCH --account uoo00015 #SBATCH --ntasks 315 #SBATCH --cpus-per-task 1 # 1 #SBATCH --hint compute_bound #SBATCH --mem-per-cpu 4G # you can take 4GB no problem #SBATCH --workdir /gpfs1m/projects/uoo00015/andrew.pauling/fwhf2000_cam5 #SBATCH --output /gpfs1m/projects/uoo00015/andrew.pauling/fwhf2000_cam5/stdout.txt #SBATCH --error /gpfs1m/projects/uoo00015/andrew.pauling/fwhf2000_cam5/stderr.txt #SBATCH --exclusive #SBATCH --reservation=uoo00015 sleep 10 cd /gpfs1m/projects/uoo00015/andrew.pauling/fwhf2000_cam5 setenv OMP_NUM_THREADS 1 if(1>1) then setenv MP_TASK_AFFINITY core:1 endif # ---------------------------------------- # PE LAYOUT: # total number of tasks = 315 # maximum threads per task = 1 # cpl ntasks=300 nthreads=1 rootpe= 0 ninst=1 # cam ntasks=300 nthreads=1 rootpe= 0 ninst=1 # clm ntasks=105 nthreads=1 rootpe= 0 ninst=1 # cice ntasks=195 nthreads=1 rootpe=105 ninst=1 # pop2 ntasks=15 nthreads=1 rootpe=300 ninst=1 # sglc ntasks=300 nthreads=1 rootpe= 0 ninst=1 # swav ntasks=300 nthreads=1 rootpe= 0 ninst=1 # rtm ntasks=105 nthreads=1 rootpe= 0 ninst=1 # # total number of hw pes = 315 # cpl hw pe range ~ from 0 to 299 # cam hw pe range ~ from 0 to 299 # clm hw pe range ~ from 0 to 104 # cice hw pe range ~ from 105 to 299 # pop2 hw pe range ~ from 300 to 314 # sglc hw pe range ~ from 0 to 299 # swav hw pe range ~ from 0 to 299 # rtm hw pe range ~ from 0 to 104 # ---------------------------------------- cd /gpfs1m/projects/uoo00015/andrew.pauling/fwhf2000_cam5 ./Tools/ccsm_check_lockedfiles || exit -1 source ./Tools/ccsm_getenv || exit -2 if ($BUILD_COMPLETE != "TRUE") then echo "BUILD_COMPLETE is not TRUE" echo "Please rebuild the model interactively" exit -2 endif # BATCHQUERY is in env_run.xml setenv LBQUERY "TRUE" if !($?BATCHQUERY) then setenv LBQUERY "FALSE" setenv BATCHQUERY "undefined" else if ( "$BATCHQUERY" == 'UNSET' ) then setenv LBQUERY "FALSE" setenv BATCHQUERY "undefined" endif # BATCHSUBMIT is in env_run.xml setenv LBSUBMIT "TRUE" if !($?BATCHSUBMIT) then setenv LBSUBMIT "FALSE" setenv BATCHSUBMIT "undefined" else if ( "$BATCHSUBMIT" == 'UNSET' ) then setenv LBSUBMIT "FALSE" setenv BATCHSUBMIT "undefined" endif # --- Create and cleanup the timing directories--- if !(-d $RUNDIR) mkdir -p $RUNDIR || "cannot make $RUNDIR" && exit -1 if (-d $RUNDIR/timing) rm -r -f $RUNDIR/timing mkdir $RUNDIR/timing mkdir $RUNDIR/timing/checkpoints # --- Determine time-stamp/file-ID string --- setenv LID "`date +%y%m%d-%H%M%S`" set sdate = `date +"%Y-%m-%d %H:%M:%S"` echo "run started $sdate" >>& $CASEROOT/CaseStatus echo "-------------------------------------------------------------------------" echo " CESM BUILDNML SCRIPT STARTING" echo " - To prestage restarts, untar a restart.tar file into $RUNDIR" ./preview_namelists if ($status != 0) then echo "ERROR from preview namelist - EXITING" exit -1 endif echo " CESM BUILDNML SCRIPT HAS FINISHED SUCCESSFULLY" echo "-------------------------------------------------------------------------" echo "-------------------------------------------------------------------------" echo " CESM PRESTAGE SCRIPT STARTING" echo " - Case input data directory, DIN_LOC_ROOT, is $DIN_LOC_ROOT" echo " - Checking the existence of input datasets in DIN_LOC_ROOT" # This script prestages as follows # - DIN_LOC_ROOT is the local inputdata area, check it exists # - check whether all the data is in DIN_LOC_ROOT # - prestage the REFCASE data if needed cd $CASEROOT if !(-d $DIN_LOC_ROOT) then echo " " echo " ERROR DIN_LOC_ROOT $DIN_LOC_ROOT does not exist" echo " " exit -20 endif if (`./check_input_data -inputdata $DIN_LOC_ROOT -check | grep "unknown" | wc -l` > 0) then echo " " echo "The following files were not found, this is informational only" ./check_input_data -inputdata $DIN_LOC_ROOT -check echo " " endif if (`./check_input_data -inputdata $DIN_LOC_ROOT -check | grep "missing" | wc -l` > 0) then echo "Attempting to download missing data:" ./check_input_data -inputdata $DIN_LOC_ROOT -export endif if (`./check_input_data -inputdata $DIN_LOC_ROOT -check | grep "missing" | wc -l` > 0) then echo " " echo "The following files were not found, they are required" ./check_input_data -inputdata $DIN_LOC_ROOT -check echo "Invoke the following command to obtain them" echo " ./check_input_data -inputdata $DIN_LOC_ROOT -export" echo " " exit -30 endif if (($GET_REFCASE == 'TRUE') && ($RUN_TYPE != 'startup') && ($CONTINUE_RUN == 'FALSE')) then set refdir = "ccsm4_init/$RUN_REFCASE/$RUN_REFDATE" if !(-d $DIN_LOC_ROOT/$refdir) then echo "*****************************************************************" echo "ccsm_prestage ERROR: $DIN_LOC_ROOT/$refdir is not on local disk" echo "obtain this data from the svn input data repository:" echo " > mkdir -p $DIN_LOC_ROOT/$refdir" echo " > cd $DIN_LOC_ROOT/$refdir" echo " > cd .." echo " > svn export --force https://svn-ccsm-inputdata.cgd.ucar.edu/trunk/inputdata/$refdir" echo "or set GET_REFCASE to FALSE in env_run.xml, " echo " and prestage the restart data to $RUNDIR manually" echo "*****************************************************************" exit -1 endif echo " - Prestaging REFCASE ($refdir) to $RUNDIR" if !(-d $RUNDIR) mkdir -p $RUNDIR || "cannot make $RUNDIR" && exit -1 foreach file ($DIN_LOC_ROOT/$refdir/*${RUN_REFCASE}*) if !(-f $RUNDIR/$file:t) then ln -s $file $RUNDIR || "cannot prestage $DIN_LOC_ROOT/$refdir data to $RUNDIR" && exit -1 endif end cp $DIN_LOC_ROOT/$refdir/*rpointer* $RUNDIR || "cannot prestage $DIN_LOC_ROOT/$refdir rpointers to $RUNDIR" && exit -1 cd $RUNDIR set cam2_list = `sh -c 'ls *.cam2.* 2>/dev/null'` foreach cam2_file ($cam2_list) set cam_file = `echo $cam2_file | sed -e 's/cam2/cam/'` ln -fs $cam2_file $cam_file end chmod u+w $RUNDIR/* >& /dev/null endif echo " CESM PRESTAGE SCRIPT HAS FINISHED SUCCESSFULLY" echo "-------------------------------------------------------------------------" # ------------------------------------------------------------------------- # Run the model # ------------------------------------------------------------------------- cd $RUNDIR echo "`date` -- CSM EXECUTION BEGINS HERE" setenv MP_LABELIO yes sleep 25 setenv OMP_NUM_THREADS 1 setenv I_MPI_FABRICS shm:dapl setenv I_MPI_DAPL_PROVIDER ofa-v2-mlx4_0-1 setenv I_MPI_WAIT_MODE 1 srun --propagate=STACK $RUNDIR/../cesm.exe >&! cesm.log.$LID wait echo "`date` -- CSM EXECUTION HAS FINISHED" # ------------------------------------------------------------------------- # For Postprocessing # ------------------------------------------------------------------------- # ------------------------------------------------------------------------- # Check for successful run # ------------------------------------------------------------------------- set sdate = `date +"%Y-%m-%d %H:%M:%S"` cd $RUNDIR set CESMLogFile = `ls -1t cesm.log* | head -1` if ($CESMLogFile == "") then echo "Model did not complete - no cesm.log file present - exiting" exit -1 endif set CPLLogFile = `echo $CESMLogFile | sed -e 's/cesm/cpl/'` if ($CPLLogFile == "") then echo "Model did not complete - no cpl.log file corresponding to most recent CESM log ($RUNDIR/$CESMLogFile)" exit -1 endif grep 'SUCCESSFUL TERMINATION' $CPLLogFile || echo "Model did not complete - see $RUNDIR/$CESMLogFile" && echo "run FAILED $sdate" >>& $CASEROOT/CaseStatus && exit -1 echo "run SUCCESSFUL $sdate" >>& $CASEROOT/CaseStatus # ------------------------------------------------------------------------- # Update env variables in case user changed them during run # ------------------------------------------------------------------------- cd $CASEROOT source ./Tools/ccsm_getenv # ------------------------------------------------------------------------- # Save model output stdout and stderr # ------------------------------------------------------------------------- cd $RUNDIR gzip *.$LID if ($LOGDIR != "") then if (! -d $LOGDIR/bld) mkdir -p $LOGDIR/bld || echo " problem in creating $LOGDIR/bld" cp -p *build.$LID.* $LOGDIR/bld cp -p *log.$LID.* $LOGDIR endif # ------------------------------------------------------------------------- # Perform short term archiving of output # ------------------------------------------------------------------------- if ($DOUT_S == 'TRUE') then echo "Archiving ccsm output to $DOUT_S_ROOT" echo "Calling the short-term archiving script st_archive.sh" cd $RUNDIR; $CASETOOLS/st_archive.sh endif # ------------------------------------------------------------------------- # Submit longer term archiver if appropriate # ------------------------------------------------------------------------- cd $CASEROOT if ($DOUT_L_MS == 'TRUE' && $DOUT_S == 'TRUE') then echo "Long term archiving ccsm output using the script $CASE.l_archive" set num = 0 if ($LBQUERY == "TRUE") then set num = `$BATCHQUERY | grep $CASE.l_archive | wc -l` endif if ($LBSUBMIT == "TRUE" && $num < 1) then cat > templar <<EOF $BATCHSUBMIT ./$CASE.l_archive EOF source templar if ($status != 0) then echo "ccsm_postrun error: problem sourcing templar " endif rm templar endif endif # ------------------------------------------------------------------------- # Resubmit another run script # ------------------------------------------------------------------------- cd $CASEROOT if ($RESUBMIT > 0) then @ RESUBMIT = $RESUBMIT - 1 echo RESUBMIT is now $RESUBMIT #tcraig: reset CONTINUE_RUN on RESUBMIT if NOT doing timing runs #use COMP_RUN_BARRIERS as surrogate for timing run logical if ($?COMP_RUN_BARRIERS) then if (${COMP_RUN_BARRIERS} == "FALSE") then ./xmlchange -file env_run.xml -id CONTINUE_RUN -val TRUE endif else ./xmlchange -file env_run.xml -id CONTINUE_RUN -val TRUE endif ./xmlchange -file env_run.xml -id RESUBMIT -val $RESUBMIT if ($LBSUBMIT == "TRUE") then cat > tempres <<EOF $BATCHSUBMIT ./$CASE.run EOF source tempres if ($status != 0) then echo "ccsm_postrun error: problem sourcing tempres " endif rm tempres endif endif if ($CHECK_TIMING == 'TRUE') then cd $CASEROOT if !(-d timing) mkdir timing $CASETOOLS/getTiming.csh -lid $LID gzip timing/ccsm_timing_stats.$LID endif if ($SAVE_TIMING == 'TRUE') then cd $RUNDIR mv timing timing.$LID cd $CASEROOT endif As you can see, ntasks was the only thing used. Can exclusive and/or hint force those strange interpretations? Gene
(In reply to Gene Soudlenkov from comment #18) > As you can see, ntasks was the only thing used. Can exclusive and/or hint > force those strange interpretations? Can you confirm the job was submitted with NO command line options like this: sbatch my.script Several of us have been independently submitting the same job script with all of the same options and a configuration as similar to yours as possible and don't see the same behavior so something must be different. What does "JobSubmitPlugins=filter" do? Is that adding some job options? What about user SBATCH_ environment variables?
Yes, sbatch script.sl was used. filter is a plugin that does partition routing and user/account vetting. It does not, however, change cpus_per_task or any other cpu-related fields in job_desc The same job description was used to submit other jobs and it worked OK Gene
(In reply to Gene Soudlenkov from comment #20) > The same job description was used to submit other jobs and it worked OK You mean the same script, correct? (In reply to Gene Soudlenkov from comment #20) > Yes, sbatch script.sl was used. Are any SBATCH_ or SLURM_ environment variables set when the job gets submitted? That's equivalent to including an option on the command line > filter is a plugin that does partition routing and user/account vetting. It > does not, however, change cpus_per_task or any other cpu-related fields in > job_desc Does it change any fields? Does it get rebuilt when you install a new Slurm? > The same job description was used to submit other jobs and it worked OK Do you mean the identical script?
The only variable we have set is: SBATCH_EXPORT=NONE The plugin changes partition in the job_desc - this is the only field it changes. It gets rebuilt every time we upgrade. Yes, the same script is used - with (sometimes) variations in ntasks Gene
Just for reference: this is yet another job of the same user, using similar description but 320 cores. It runs OK in the reservation: JobId=30867950 JobName=fwhf2000_cam5 UserId=andrew.pauling(5610) GroupId=nesi(5000) Priority=50000 Nice=0 Account=uoo00015 QOS=normal JobState=RUNNING Reason=None Dependency=(null) Requeue=1 Restarts=0 BatchFlag=1 Reboot=0 ExitCode=0:0 DerivedExitCode=0:0 RunTime=02:27:18 TimeLimit=23:59:00 TimeMin=N/A SubmitTime=2016-05-10T12:38:04 EligibleTime=2016-05-10T12:38:04 StartTime=2016-05-10T13:49:12 EndTime=2016-05-11T13:48:12 PreemptTime=None SuspendTime=None SecsPreSuspend=0 Partition=merit AllocNode:Sid=compute-c1-002-p:723 ReqNodeList=(null) ExcNodeList=(null) NodeList=compute-c1-[002-020,044] BatchHost=compute-c1-002 NumNodes=20 NumCPUs=320 CPUs/Task=1 ReqB:S:C:T=0:0:*:1 TRES=cpu=320,mem=1310720,node=20 Socks/Node=* NtasksPerN:B:S:C=0:0:*:* CoreSpec=* Nodes=compute-c1-[002-020,044] CPU_IDs=0-15 Mem=65536 MinCPUsNode=1 MinMemoryCPU=4G MinTmpDiskNode=0 Features=sb Gres=(null) Reservation=uoo00015 Shared=0 Contiguous=0 Licenses=(null) Network=(null) Command=./fwhf2000_cam5.run WorkDir=/gpfs1m/projects/uoo00015/andrew.pauling/fwhf2000_cam5 StdErr=/gpfs1m/projects/uoo00015/andrew.pauling/fwhf2000_cam5/stderr.txt StdIn=/dev/null StdOut=/gpfs1m/projects/uoo00015/andrew.pauling/fwhf2000_cam5/stdout.txt BatchScript= #! /bin/tcsh -f # submit with sbatch #SBATCH --job-name fwhf2000_cam5 # sfw_ext #SBATCH --constraint sb # sb=Sandybridge,wm=Westmere #SBATCH --time 23:59:00 # #SBATCH --account uoo00015 #SBATCH --ntasks 315 #SBATCH --cpus-per-task 1 # 1 #SBATCH --hint compute_bound #SBATCH --mem-per-cpu 4G # you can take 4GB no problem #SBATCH --workdir /gpfs1m/projects/uoo00015/andrew.pauling/fwhf2000_cam5 #SBATCH --output /gpfs1m/projects/uoo00015/andrew.pauling/fwhf2000_cam5/stdout.txt #SBATCH --error /gpfs1m/projects/uoo00015/andrew.pauling/fwhf2000_cam5/stderr.txt #SBATCH --exclusive #SBATCH --reservation=uoo00015 sleep 10 cd /gpfs1m/projects/uoo00015/andrew.pauling/fwhf2000_cam5 setenv OMP_NUM_THREADS 1 if(1>1) then setenv MP_TASK_AFFINITY core:1 endif # ---------------------------------------- # PE LAYOUT: # total number of tasks = 315 # maximum threads per task = 1 # cpl ntasks=300 nthreads=1 rootpe= 0 ninst=1 # cam ntasks=300 nthreads=1 rootpe= 0 ninst=1 # clm ntasks=105 nthreads=1 rootpe= 0 ninst=1 # cice ntasks=195 nthreads=1 rootpe=105 ninst=1 # pop2 ntasks=15 nthreads=1 rootpe=300 ninst=1 # sglc ntasks=300 nthreads=1 rootpe= 0 ninst=1 # swav ntasks=300 nthreads=1 rootpe= 0 ninst=1 # rtm ntasks=105 nthreads=1 rootpe= 0 ninst=1 # # total number of hw pes = 315 # cpl hw pe range ~ from 0 to 299 # cam hw pe range ~ from 0 to 299 # clm hw pe range ~ from 0 to 104 # cice hw pe range ~ from 105 to 299 # pop2 hw pe range ~ from 300 to 314 # sglc hw pe range ~ from 0 to 299 # swav hw pe range ~ from 0 to 299 # rtm hw pe range ~ from 0 to 104 # ---------------------------------------- cd /gpfs1m/projects/uoo00015/andrew.pauling/fwhf2000_cam5 ./Tools/ccsm_check_lockedfiles || exit -1 source ./Tools/ccsm_getenv || exit -2 if ($BUILD_COMPLETE != "TRUE") then echo "BUILD_COMPLETE is not TRUE" echo "Please rebuild the model interactively" exit -2 endif # BATCHQUERY is in env_run.xml setenv LBQUERY "TRUE" if !($?BATCHQUERY) then setenv LBQUERY "FALSE" setenv BATCHQUERY "undefined" else if ( "$BATCHQUERY" == 'UNSET' ) then setenv LBQUERY "FALSE" setenv BATCHQUERY "undefined" endif # BATCHSUBMIT is in env_run.xml setenv LBSUBMIT "TRUE" if !($?BATCHSUBMIT) then setenv LBSUBMIT "FALSE" setenv BATCHSUBMIT "undefined" else if ( "$BATCHSUBMIT" == 'UNSET' ) then setenv LBSUBMIT "FALSE" setenv BATCHSUBMIT "undefined" endif # --- Create and cleanup the timing directories--- if !(-d $RUNDIR) mkdir -p $RUNDIR || "cannot make $RUNDIR" && exit -1 if (-d $RUNDIR/timing) rm -r -f $RUNDIR/timing mkdir $RUNDIR/timing mkdir $RUNDIR/timing/checkpoints # --- Determine time-stamp/file-ID string --- setenv LID "`date +%y%m%d-%H%M%S`" set sdate = `date +"%Y-%m-%d %H:%M:%S"` echo "run started $sdate" >>& $CASEROOT/CaseStatus echo "-------------------------------------------------------------------------" echo " CESM BUILDNML SCRIPT STARTING" echo " - To prestage restarts, untar a restart.tar file into $RUNDIR" ./preview_namelists if ($status != 0) then echo "ERROR from preview namelist - EXITING" exit -1 endif echo " CESM BUILDNML SCRIPT HAS FINISHED SUCCESSFULLY" echo "-------------------------------------------------------------------------" echo "-------------------------------------------------------------------------" echo " CESM PRESTAGE SCRIPT STARTING" echo " - Case input data directory, DIN_LOC_ROOT, is $DIN_LOC_ROOT" echo " - Checking the existence of input datasets in DIN_LOC_ROOT" # This script prestages as follows # - DIN_LOC_ROOT is the local inputdata area, check it exists # - check whether all the data is in DIN_LOC_ROOT # - prestage the REFCASE data if needed cd $CASEROOT if !(-d $DIN_LOC_ROOT) then echo " " echo " ERROR DIN_LOC_ROOT $DIN_LOC_ROOT does not exist" echo " " exit -20 endif if (`./check_input_data -inputdata $DIN_LOC_ROOT -check | grep "unknown" | wc -l` > 0) then echo " " echo "The following files were not found, this is informational only" ./check_input_data -inputdata $DIN_LOC_ROOT -check echo " " endif if (`./check_input_data -inputdata $DIN_LOC_ROOT -check | grep "missing" | wc -l` > 0) then echo "Attempting to download missing data:" ./check_input_data -inputdata $DIN_LOC_ROOT -export endif if (`./check_input_data -inputdata $DIN_LOC_ROOT -check | grep "missing" | wc -l` > 0) then echo " " echo "The following files were not found, they are required" ./check_input_data -inputdata $DIN_LOC_ROOT -check echo "Invoke the following command to obtain them" echo " ./check_input_data -inputdata $DIN_LOC_ROOT -export" echo " " exit -30 endif if (($GET_REFCASE == 'TRUE') && ($RUN_TYPE != 'startup') && ($CONTINUE_RUN == 'FALSE')) then set refdir = "ccsm4_init/$RUN_REFCASE/$RUN_REFDATE" if !(-d $DIN_LOC_ROOT/$refdir) then echo "*****************************************************************" echo "ccsm_prestage ERROR: $DIN_LOC_ROOT/$refdir is not on local disk" echo "obtain this data from the svn input data repository:" echo " > mkdir -p $DIN_LOC_ROOT/$refdir" echo " > cd $DIN_LOC_ROOT/$refdir" echo " > cd .." echo " > svn export --force https://svn-ccsm-inputdata.cgd.ucar.edu/trunk/inputdata/$refdir" echo "or set GET_REFCASE to FALSE in env_run.xml, " echo " and prestage the restart data to $RUNDIR manually" echo "*****************************************************************" exit -1 endif echo " - Prestaging REFCASE ($refdir) to $RUNDIR" if !(-d $RUNDIR) mkdir -p $RUNDIR || "cannot make $RUNDIR" && exit -1 foreach file ($DIN_LOC_ROOT/$refdir/*${RUN_REFCASE}*) if !(-f $RUNDIR/$file:t) then ln -s $file $RUNDIR || "cannot prestage $DIN_LOC_ROOT/$refdir data to $RUNDIR" && exit -1 endif end cp $DIN_LOC_ROOT/$refdir/*rpointer* $RUNDIR || "cannot prestage $DIN_LOC_ROOT/$refdir rpointers to $RUNDIR" && exit -1 cd $RUNDIR set cam2_list = `sh -c 'ls *.cam2.* 2>/dev/null'` foreach cam2_file ($cam2_list) set cam_file = `echo $cam2_file | sed -e 's/cam2/cam/'` ln -fs $cam2_file $cam_file end chmod u+w $RUNDIR/* >& /dev/null endif echo " CESM PRESTAGE SCRIPT HAS FINISHED SUCCESSFULLY" echo "-------------------------------------------------------------------------" # ------------------------------------------------------------------------- # Run the model # ------------------------------------------------------------------------- cd $RUNDIR echo "`date` -- CSM EXECUTION BEGINS HERE" setenv MP_LABELIO yes sleep 25 setenv OMP_NUM_THREADS 1 setenv I_MPI_FABRICS shm:dapl setenv I_MPI_DAPL_PROVIDER ofa-v2-mlx4_0-1 setenv I_MPI_WAIT_MODE 1 srun --propagate=STACK $RUNDIR/../cesm.exe >&! cesm.log.$LID wait echo "`date` -- CSM EXECUTION HAS FINISHED" # ------------------------------------------------------------------------- # For Postprocessing # ------------------------------------------------------------------------- # ------------------------------------------------------------------------- # Check for successful run # ------------------------------------------------------------------------- set sdate = `date +"%Y-%m-%d %H:%M:%S"` cd $RUNDIR set CESMLogFile = `ls -1t cesm.log* | head -1` if ($CESMLogFile == "") then echo "Model did not complete - no cesm.log file present - exiting" exit -1 endif set CPLLogFile = `echo $CESMLogFile | sed -e 's/cesm/cpl/'` if ($CPLLogFile == "") then echo "Model did not complete - no cpl.log file corresponding to most recent CESM log ($RUNDIR/$CESMLogFile)" exit -1 endif grep 'SUCCESSFUL TERMINATION' $CPLLogFile || echo "Model did not complete - see $RUNDIR/$CESMLogFile" && echo "run FAILED $sdate" >>& $CASEROOT/CaseStatus && exit -1 echo "run SUCCESSFUL $sdate" >>& $CASEROOT/CaseStatus # ------------------------------------------------------------------------- # Update env variables in case user changed them during run # ------------------------------------------------------------------------- cd $CASEROOT source ./Tools/ccsm_getenv # ------------------------------------------------------------------------- # Save model output stdout and stderr # ------------------------------------------------------------------------- cd $RUNDIR gzip *.$LID if ($LOGDIR != "") then if (! -d $LOGDIR/bld) mkdir -p $LOGDIR/bld || echo " problem in creating $LOGDIR/bld" cp -p *build.$LID.* $LOGDIR/bld cp -p *log.$LID.* $LOGDIR endif # ------------------------------------------------------------------------- # Perform short term archiving of output # ------------------------------------------------------------------------- if ($DOUT_S == 'TRUE') then echo "Archiving ccsm output to $DOUT_S_ROOT" echo "Calling the short-term archiving script st_archive.sh" cd $RUNDIR; $CASETOOLS/st_archive.sh endif # ------------------------------------------------------------------------- # Submit longer term archiver if appropriate # ------------------------------------------------------------------------- cd $CASEROOT if ($DOUT_L_MS == 'TRUE' && $DOUT_S == 'TRUE') then echo "Long term archiving ccsm output using the script $CASE.l_archive" set num = 0 if ($LBQUERY == "TRUE") then set num = `$BATCHQUERY | grep $CASE.l_archive | wc -l` endif if ($LBSUBMIT == "TRUE" && $num < 1) then cat > templar <<EOF $BATCHSUBMIT ./$CASE.l_archive EOF source templar if ($status != 0) then echo "ccsm_postrun error: problem sourcing templar " endif rm templar endif endif # ------------------------------------------------------------------------- # Resubmit another run script # ------------------------------------------------------------------------- cd $CASEROOT if ($RESUBMIT > 0) then @ RESUBMIT = $RESUBMIT - 1 echo RESUBMIT is now $RESUBMIT #tcraig: reset CONTINUE_RUN on RESUBMIT if NOT doing timing runs #use COMP_RUN_BARRIERS as surrogate for timing run logical if ($?COMP_RUN_BARRIERS) then if (${COMP_RUN_BARRIERS} == "FALSE") then ./xmlchange -file env_run.xml -id CONTINUE_RUN -val TRUE endif else ./xmlchange -file env_run.xml -id CONTINUE_RUN -val TRUE endif ./xmlchange -file env_run.xml -id RESUBMIT -val $RESUBMIT if ($LBSUBMIT == "TRUE") then cat > tempres <<EOF $BATCHSUBMIT ./$CASE.run EOF source tempres if ($status != 0) then echo "ccsm_postrun error: problem sourcing tempres " endif rm tempres endif endif if ($CHECK_TIMING == 'TRUE') then cd $CASEROOT if !(-d timing) mkdir timing $CASETOOLS/getTiming.csh -lid $LID gzip timing/ccsm_timing_stats.$LID endif if ($SAVE_TIMING == 'TRUE') then cd $RUNDIR mv timing timing.$LID cd $CASEROOT endif
The job which ran has this in the "scontrol show job" output: MinCPUsNode=1 While the failing job has this: MinCPUsNode=315 Do all of the failing job's have huge MinCPUsNode values?
We need to re-create the reservation and try again - will advise when done Gene
Gene, if in the job script I set --ntasks 315, the job runs when the reservation becomes ACTIVE and if I scontrol show jobid I see MinCPUsNode=1. However, if I comment --ntasks and instead I request --ntasks-per-node 315, when the reservation becomes ACTIVE the job remains PD (Reservation) and if I scontrol show jobid I see MinCPUsNode=315 which is the same problem you had when you opened the bug. So we're suspecting that: 1. Problem is that the PD reason is wrong, it should say (BadConstraints) instead of (Reservation). 2. We strongly advise to double check you are not changing the job request options and setting --ntasks-per-node through either jobsubmitplugin, env vars, inside the script itself or the command line. So it would be nice if you could double check that, then the problem should be isolated to just change the logic which sets the PD Reason. Thanks for your collaboration.
It will also be interesting if you could either grep in slurmctld.log a similar message to this one: [2016-05-10T19:11:16.799] _build_node_list: No nodes satisfy job 20026 requirements in partition part1 For instance executing: $ grep -i "_build_node_list: No nodes satisfy job" /path/to/your/slurmctld.log Or attach it here and we'll take a look at it. If job 30679726 appears in this message that would strengthen our hypothesis on comment #26 point 2. Thanks again.
Hi, Alejandro Yes, there are entries with "No nodes satisfy" message in the period where the problem occcurred. However, the user did not change his job description and used it for months. For now everything works except for one thing, again, related to reservations - the user can only place one job into the reservation, which is big enough for two jobs. When we tried submitted shorter jobs, we figured out that jobs under 1 hour of walltime went through the reservation OK, but the longer ones were stuck and did not want to use the reservation. Gene
This is just a guess, but we've seen this type of thing happen before: 1. The user executes "salloc bash" 2. A shell starts with a bunch of environment variables set 3. The original job eventual ends, say it times out, but the shell remains 4. The user then submits more jobs from the shell "sbatch ..." and the newly submitted jobs inherit environment variables from the original salloc command So from the user's perspective nothing changed, but from Slurm's perspective the job has some additional options set via environment variables
Hi, Alejandro I just checked what user did, restored previous versions of his script from the backup - everything is OK. No stray environment variables, no improper resource requests - he'd been using this script for months. Again, as I said, the problem was reported as reservation wait, not resource contention. Cheers, Gene
Let me clarify something. (In reply to Gene Soudlenkov from comment #28) > Hi, Alejandro > > Yes, there are entries with "No nodes satisfy" message in the period where > the problem occcurred. However, the user did not change his job description > and used it for months. For now everything works except for one thing, > again, related to reservations - the user can only place one job into the > reservation, which is big enough for two jobs. You mean that two submissions of the first job script (the attached one), the first one is properly running when the reservation becomes ACTIVE and the second one remains in PD (Reservation) ? Are the two jobs using the same job script? > When we tried submitted shorter jobs, we figured out that jobs under 1 hour of > walltime went through the reservation OK, but the longer ones were stuck and > did not want to use the reservation. > > Gene And here you tried submission of more than one job using the same batch script but this time just changing the --time value < 1h, and in this case all jobs started when reservation becomes ACTIVE, isn't it? Besides that, and despite it is very unlikely, but not impossible, could you please check whether all your slurm components use the same version or not? just to be sure: sbatch/salloc/srun version, slurmctld version, slurmd version and slurmdbd version. We do believe the MinCPUsNode value being wrong is an important key and we need to find who is changing its value from 1 to 315 and under which what conditions. If that is repeatable, it'd be nice to have that user dump and send his environment so we can check it. Also run "scontrol setdebug 7" and capture the incoming data in slurmctld in the period of time before reservation becomes active until jobs are submitted and after reservation becomes active, then reset with "scontrol setdebug 3". Thanks.
Hi, Alejandro Yes, we checked all the components after install and ensured versions match. We may be able to experiment further in a couple of days since the cluster is awfully busy at the moment. I have yet to see the reproduction of these events but if I find something related to the problem, I will report it straight away. Ah, yes - the script used was the same according to the user. However, as usual with users' dealings, I will be monitoring job submissions closely to make sure we have ways to identify the script used. Cheers, Gene
Ok, remain waiting for your sequence of concrete submission options/env/script/logs which lead to the jobs not starting in the reservation since we can't reproduce unless we specify --ntasks-per-node=315.
Hi Gene, since it's a sev-2 bug, we're required make daily progress on the bug. Did you have the chance to experiment further? If not maybe we could downgrade the bug to sev-3. Thank you.
Switching bug to sev-3. Please let me know as soon as you have an update. Thanks.
Thanks, Alejandro, will do Cheers, Gene
Hey Gene. Any update with this bug?
Hi, Alejandro We haven't seen this behaviour so far - I would suggest closing the ticket and re-opening if similar behaviour is observed (we expect some reservations later this week). Cheers, Gene
Marking as resolved/infogiven. Please, reopen if trouble is found with future reservations.