Ticket 2700

Summary:	Job with reservation pending
Product:	Slurm	Reporter:	Gene Soudlenkov <g.soudlenkov>
Component:	Scheduling	Assignee:	Alejandro Sanchez <alex>
Status:	RESOLVED INFOGIVEN	QA Contact:
Severity:	3 - Medium Impact
Priority:	---	CC:	alex, tim
Version:	15.08.11
Hardware:	Linux
OS:	Linux
Site:	University of Auckland	Slinky Site:	---
Alineos Sites:	---	Atos/Eviden Sites:	---
Confidential Site:	---	Coreweave sites:	---
Cray Sites:	---	DS9 clusters:	---
Google sites:	---	HPCnow Sites:	---
HPE Sites:	---	IBM Sites:	---
NOAA SIte:	---	NoveTech Sites:	---
Nvidia HWinf-CS Sites:	---	OCF Sites:	---
Recursion Pharma Sites:	---	SFW Sites:	---
SNIC sites:	---	Tzag Elita Sites:	---
Linux Distro:	---	Machine Name:
CLE Version:		Version Fixed:
Target Release:	---	DevPrio:	---
Emory-Cloud Sites:	---
Attachments:	Jobs description Configuration files

Description Gene Soudlenkov 2016-05-04 19:15:44 MDT

We created a reservation for a user. He submitted a couple of jobs under this reservation but the jobs stay pending with reason=Reservation even though the reservation is active and the nodes are all free. These are the outputs from scontrol show res and scontrol show jobid:

[root@slurm-001-p ~]# squeue -R uoo00015
             JOBID PARTITION     NAME     USER ST       TIME  NODES NODELIST(REASON)
          30679726    medium fwhf2000 andrew.p PD       0:00      8 (Reservation)

[root@slurm-001-p ~]# scontrol show res uoo00015
ReservationName=uoo00015 StartTime=2016-05-05T19:00:00 EndTime=2016-05-22T19:00:00 Duration=17-00:00:00
   Nodes=compute-b1-[001-020],compute-c1-[001-020] NodeCnt=40 CoreCnt=640 Features=(null) PartitionName=(null) Flags=SPEC_NODES
   TRES=cpu=640
   Users=(null) Accounts=uoo00015 Licenses=(null) State=ACTIVE BurstBuffer=(null) Watts=n/a



[root@slurm-001-p ~]# scontrol show jobid 30679726
JobId=30679726 JobName=fwhf2000_cam5
   UserId=andrew.pauling(5610) GroupId=nesi(5000)
   Priority=10000000 Nice=0 Account=uoo00015 QOS=normal
   JobState=PENDING Reason=Reservation Dependency=(null)
   Requeue=1 Restarts=0 BatchFlag=1 Reboot=0 ExitCode=0:0
   RunTime=00:00:00 TimeLimit=23:59:00 TimeMin=N/A
   SubmitTime=2016-05-04T15:51:42 EligibleTime=2016-05-05T19:00:00
   StartTime=Unknown EndTime=Unknown
   PreemptTime=None SuspendTime=None SecsPreSuspend=0
   Partition=medium AllocNode:Sid=build-sb:16337
   ReqNodeList=(null) ExcNodeList=(null)
   NodeList=(null)
   NumNodes=8 NumCPUs=315 CPUs/Task=1 ReqB:S:C:T=0:0:*:1
   TRES=cpu=315,mem=1290240,node=1
   Socks/Node=* NtasksPerN:B:S:C=0:0:*:* CoreSpec=*
   MinCPUsNode=315 MinMemoryCPU=4G MinTmpDiskNode=0
   Features=sb Gres=(null) Reservation=uoo00015
   Shared=0 Contiguous=0 Licenses=(null) Network=(null)
   Command=/gpfs1m/projects/uoo00015/andrew.pauling/fwhf2000_cam5/fwhf2000_cam5.run
   WorkDir=/gpfs1m/projects/uoo00015/andrew.pauling/fwhf2000_cam5
   StdErr=/gpfs1m/projects/uoo00015/andrew.pauling/fwhf2000_cam5/stderr.txt
   StdIn=/dev/null
   StdOut=/gpfs1m/projects/uoo00015/andrew.pauling/fwhf2000_cam5/stdout.txt
   Power= SICP=0

Comment 1 Alejandro Sanchez 2016-05-04 21:49:25 MDT

Hi Gene, do you have in your history the command you used for creating the reservation? Thanks.

Comment 2 Alejandro Sanchez 2016-05-04 22:04:12 MDT

If you also have the command used to request the job it'd be nice. Attaching the slurmctld.log time frame around the reservation creation time as well as the job submit time might help as well. Enabling DebugFlags DB_RESERVATION,Reservation would be good as well. Thanks.

Comment 5 Alejandro Sanchez 2016-05-05 02:05:26 MDT

For reference we'd be interested in taking a look your site's slurm.conf.

Comment 7 Gene Soudlenkov 2016-05-05 14:19:21 MDT

Hi, Alejandro

The command line we used was this:

control create reservation=uoo00015 starttime=2016-05-05T19:00:00 Duration=17-00:00:00 accounts=uoo00015 nodes=compute-b1-[001-020],compute-c1-[001-020]

The researvation was created at ~5pm (so it was not active) and a couple of jobs were submitted. At 7pm the reservation went active but the job changed its status to PD/Reservation - and stayed there for hours, until we canceled and re-submitted it. This way it went through.

Cheers,
Gene

Comment 8 Alejandro Sanchez 2016-05-05 19:49:18 MDT

Hi Gene. Despite the problem is solved by resubmitting the job, we'd appreciate if you could add the command used for the job request, the logs and your slurm.conf for future reference if you don't mind. I'm curious about what happened here. Did you restart/reconfigure any slurm daemon in the time frame around the reservation? Thank you.

Comment 9 Gene Soudlenkov 2016-05-05 20:41:24 MDT

Created attachment 3059 [details]
Jobs description

Comment 10 Gene Soudlenkov 2016-05-05 20:42:17 MDT

Created attachment 3060 [details]
Configuration files

Comment 11 Alejandro Sanchez 2016-05-09 01:50:51 MDT

Gene, do you have the slurmctld logs from the time just before creating the reservation until just after the job re-submit? Thanks.

Comment 15 Gene Soudlenkov 2016-05-09 11:43:07 MDT

Hi, Alejandro

There is nothing specific to the reservation or the job in the logs - just normal submit entries.

Gene

Comment 16 Gene Soudlenkov 2016-05-09 13:26:33 MDT

Alejandro, we have further problem with the same reservation. Whatever the user now submits sits waiting for the resources unless the job is very short, like, under 1 hour - even if we bump its priority above anything else.

Gene

Comment 17 Moe Jette 2016-05-09 14:24:02 MDT

The original description includes some very strange values in the job record. Of particular note:
   MinCPUsNode=315
I only see this if I submit a similar job with the option --ntasks-per-node=315.

In addition:
   NumNodes=8
This implies the job being submitted with the option -N8

What is the execute line used to submit this job?

Comment 18 Gene Soudlenkov 2016-05-09 14:30:16 MDT

The job script used was this:


#! /bin/tcsh -f

# submit with sbatch 
#SBATCH --job-name	fwhf2000_cam5    	# sfw_ext
#SBATCH --constraint	sb		# sb=Sandybridge,wm=Westmere
#SBATCH --time		23:59:00	# 
#SBATCH --account 	uoo00015
#SBATCH --ntasks  	315
#SBATCH --cpus-per-task	1		# 1
#SBATCH --hint		compute_bound
#SBATCH --mem-per-cpu	4G              # you can take 4GB no problem 
#SBATCH --workdir       /gpfs1m/projects/uoo00015/andrew.pauling/fwhf2000_cam5
#SBATCH --output        /gpfs1m/projects/uoo00015/andrew.pauling/fwhf2000_cam5/stdout.txt
#SBATCH --error      	/gpfs1m/projects/uoo00015/andrew.pauling/fwhf2000_cam5/stderr.txt
#SBATCH --exclusive
#SBATCH --reservation=uoo00015

sleep 10

cd /gpfs1m/projects/uoo00015/andrew.pauling/fwhf2000_cam5

setenv OMP_NUM_THREADS 1
if(1>1) then
  setenv MP_TASK_AFFINITY core:1
endif

# ---------------------------------------- 
# PE LAYOUT: 
#   total number of tasks  = 315 
#   maximum threads per task = 1 
#   cpl ntasks=300  nthreads=1 rootpe=  0 ninst=1 
#   cam ntasks=300  nthreads=1 rootpe=  0 ninst=1 
#   clm ntasks=105  nthreads=1 rootpe=  0 ninst=1 
#   cice ntasks=195  nthreads=1 rootpe=105 ninst=1 
#   pop2 ntasks=15  nthreads=1 rootpe=300 ninst=1 
#   sglc ntasks=300  nthreads=1 rootpe=  0 ninst=1 
#   swav ntasks=300  nthreads=1 rootpe=  0 ninst=1 
#   rtm ntasks=105  nthreads=1 rootpe=  0 ninst=1 
#   
#   total number of hw pes = 315 
#     cpl hw pe range ~ from 0 to 299 
#     cam hw pe range ~ from 0 to 299 
#     clm hw pe range ~ from 0 to 104 
#     cice hw pe range ~ from 105 to 299 
#     pop2 hw pe range ~ from 300 to 314 
#     sglc hw pe range ~ from 0 to 299 
#     swav hw pe range ~ from 0 to 299 
#     rtm hw pe range ~ from 0 to 104 
# ---------------------------------------- 
cd /gpfs1m/projects/uoo00015/andrew.pauling/fwhf2000_cam5

./Tools/ccsm_check_lockedfiles || exit -1
source ./Tools/ccsm_getenv     || exit -2

if ($BUILD_COMPLETE != "TRUE") then
  echo "BUILD_COMPLETE is not TRUE"
  echo "Please rebuild the model interactively"
  exit -2
endif

# BATCHQUERY is in env_run.xml
setenv LBQUERY "TRUE"
if !($?BATCHQUERY) then
  setenv LBQUERY "FALSE"
  setenv BATCHQUERY "undefined"
else if ( "$BATCHQUERY" == 'UNSET' ) then
  setenv LBQUERY "FALSE"
  setenv BATCHQUERY "undefined"
endif

# BATCHSUBMIT is in env_run.xml
setenv LBSUBMIT "TRUE"
if !($?BATCHSUBMIT) then
  setenv LBSUBMIT "FALSE"
  setenv BATCHSUBMIT "undefined"
else if ( "$BATCHSUBMIT" == 'UNSET' ) then
  setenv LBSUBMIT "FALSE"
  setenv BATCHSUBMIT "undefined"
endif

# --- Create and cleanup the timing directories---

if !(-d $RUNDIR) mkdir -p $RUNDIR || "cannot make $RUNDIR" && exit -1
if (-d $RUNDIR/timing) rm -r -f $RUNDIR/timing
mkdir $RUNDIR/timing
mkdir $RUNDIR/timing/checkpoints

# --- Determine time-stamp/file-ID string ---
setenv LID "`date +%y%m%d-%H%M%S`"

set sdate = `date +"%Y-%m-%d %H:%M:%S"`
echo "run started $sdate" >>& $CASEROOT/CaseStatus

echo "-------------------------------------------------------------------------"
echo " CESM BUILDNML SCRIPT STARTING"
echo " - To prestage restarts, untar a restart.tar file into $RUNDIR"

./preview_namelists 
if ($status != 0) then
   echo "ERROR from preview namelist - EXITING"
   exit -1
endif

echo " CESM BUILDNML SCRIPT HAS FINISHED SUCCESSFULLY"
echo "-------------------------------------------------------------------------"

echo "-------------------------------------------------------------------------"
echo " CESM PRESTAGE SCRIPT STARTING"
echo " - Case input data directory, DIN_LOC_ROOT, is $DIN_LOC_ROOT"
echo " - Checking the existence of input datasets in DIN_LOC_ROOT"

# This script prestages as follows
# - DIN_LOC_ROOT is the local inputdata area, check it exists
# - check whether all the data is in DIN_LOC_ROOT
# - prestage the REFCASE data if needed

cd $CASEROOT

if !(-d $DIN_LOC_ROOT) then
  echo " "
  echo "  ERROR DIN_LOC_ROOT $DIN_LOC_ROOT does not exist"
  echo " "
  exit -20
endif

if (`./check_input_data -inputdata $DIN_LOC_ROOT -check | grep "unknown" | wc -l` > 0) then
   echo " "
   echo "The following files were not found, this is informational only"
   ./check_input_data -inputdata $DIN_LOC_ROOT -check
   echo " "
endif

if (`./check_input_data -inputdata $DIN_LOC_ROOT -check | grep "missing" | wc -l` > 0) then
   echo "Attempting to download missing data:"
   ./check_input_data -inputdata $DIN_LOC_ROOT -export
endif 

if (`./check_input_data -inputdata $DIN_LOC_ROOT -check | grep "missing" | wc -l` > 0) then
   echo " "
   echo "The following files were not found, they are required"
   ./check_input_data -inputdata $DIN_LOC_ROOT -check
   echo "Invoke the following command to obtain them"
   echo "   ./check_input_data -inputdata $DIN_LOC_ROOT -export"
   echo " "
   exit -30
endif

if (($GET_REFCASE == 'TRUE') && ($RUN_TYPE != 'startup') && ($CONTINUE_RUN == 'FALSE')) then
  set refdir = "ccsm4_init/$RUN_REFCASE/$RUN_REFDATE"

  if !(-d $DIN_LOC_ROOT/$refdir) then
    echo "*****************************************************************"
    echo "ccsm_prestage ERROR: $DIN_LOC_ROOT/$refdir is not on local disk"
    echo "obtain this data from the svn input data repository:"
    echo "  > mkdir -p $DIN_LOC_ROOT/$refdir"
    echo "  > cd $DIN_LOC_ROOT/$refdir"
    echo "  > cd .."
    echo "  > svn export --force https://svn-ccsm-inputdata.cgd.ucar.edu/trunk/inputdata/$refdir"
    echo "or set GET_REFCASE to FALSE in env_run.xml, "
    echo "   and prestage the restart data to $RUNDIR manually"
    echo "*****************************************************************"
    exit -1
  endif 

  echo " - Prestaging REFCASE ($refdir) to $RUNDIR"
  if !(-d $RUNDIR) mkdir -p $RUNDIR || "cannot make $RUNDIR" && exit -1
  foreach file ($DIN_LOC_ROOT/$refdir/*${RUN_REFCASE}*) 
     if !(-f $RUNDIR/$file:t) then
        ln -s $file $RUNDIR || "cannot prestage $DIN_LOC_ROOT/$refdir data to $RUNDIR" && exit -1
     endif
  end
  cp $DIN_LOC_ROOT/$refdir/*rpointer* $RUNDIR || "cannot prestage $DIN_LOC_ROOT/$refdir rpointers to $RUNDIR" && exit -1

  cd $RUNDIR
  set cam2_list = `sh -c 'ls *.cam2.* 2>/dev/null'`
  foreach cam2_file ($cam2_list)
    set cam_file = `echo $cam2_file | sed -e 's/cam2/cam/'`
    ln -fs $cam2_file $cam_file
  end

  chmod u+w $RUNDIR/* >& /dev/null
endif

echo " CESM PRESTAGE SCRIPT HAS FINISHED SUCCESSFULLY"
echo "-------------------------------------------------------------------------"

# -------------------------------------------------------------------------
# Run the model
# -------------------------------------------------------------------------

cd $RUNDIR
echo "`date` -- CSM EXECUTION BEGINS HERE" 


setenv MP_LABELIO yes
sleep 25
setenv OMP_NUM_THREADS 1
setenv I_MPI_FABRICS shm:dapl
setenv I_MPI_DAPL_PROVIDER ofa-v2-mlx4_0-1
setenv I_MPI_WAIT_MODE 1
srun --propagate=STACK $RUNDIR/../cesm.exe >&! cesm.log.$LID

wait
echo "`date` -- CSM EXECUTION HAS FINISHED" 

# -------------------------------------------------------------------------
# For Postprocessing
# -------------------------------------------------------------------------
# -------------------------------------------------------------------------
# Check for successful run
# -------------------------------------------------------------------------

set sdate = `date +"%Y-%m-%d %H:%M:%S"`

cd $RUNDIR
set CESMLogFile = `ls -1t cesm.log* | head -1` 
if ($CESMLogFile == "") then
  echo "Model did not complete - no cesm.log file present - exiting"
  exit -1
endif
set CPLLogFile = `echo $CESMLogFile | sed -e 's/cesm/cpl/'`
if ($CPLLogFile == "") then
  echo "Model did not complete - no cpl.log file corresponding to most recent CESM log ($RUNDIR/$CESMLogFile)"
  exit -1
endif
grep 'SUCCESSFUL TERMINATION' $CPLLogFile  || echo "Model did not complete - see $RUNDIR/$CESMLogFile" && echo "run FAILED $sdate" >>& $CASEROOT/CaseStatus && exit -1

echo "run SUCCESSFUL $sdate" >>& $CASEROOT/CaseStatus

# -------------------------------------------------------------------------
# Update env variables in case user changed them during run
# -------------------------------------------------------------------------

cd $CASEROOT
source ./Tools/ccsm_getenv

# -------------------------------------------------------------------------
# Save model output stdout and stderr 
# -------------------------------------------------------------------------

cd $RUNDIR
gzip *.$LID
if ($LOGDIR != "") then
  if (! -d $LOGDIR/bld) mkdir -p $LOGDIR/bld || echo " problem in creating $LOGDIR/bld"
  cp -p *build.$LID.* $LOGDIR/bld  
  cp -p *log.$LID.*   $LOGDIR      
endif

# -------------------------------------------------------------------------
# Perform short term archiving of output
# -------------------------------------------------------------------------

if ($DOUT_S == 'TRUE') then
  echo "Archiving ccsm output to $DOUT_S_ROOT"
  echo "Calling the short-term archiving script st_archive.sh"
  cd $RUNDIR; $CASETOOLS/st_archive.sh
endif

# -------------------------------------------------------------------------
# Submit longer term archiver if appropriate
# -------------------------------------------------------------------------

cd $CASEROOT
if ($DOUT_L_MS == 'TRUE' && $DOUT_S == 'TRUE') then
  echo "Long term archiving ccsm output using the script $CASE.l_archive"
  set num = 0
  if ($LBQUERY == "TRUE") then
     set num = `$BATCHQUERY | grep $CASE.l_archive | wc -l`
  endif
  if ($LBSUBMIT == "TRUE" && $num < 1) then
cat > templar <<EOF
    $BATCHSUBMIT ./$CASE.l_archive
EOF
    source templar
    if ($status != 0) then
      echo "ccsm_postrun error: problem sourcing templar " 
    endif
    rm templar
  endif 
endif

# -------------------------------------------------------------------------
# Resubmit another run script
# -------------------------------------------------------------------------

cd $CASEROOT
if ($RESUBMIT > 0) then
    @ RESUBMIT = $RESUBMIT - 1
    echo RESUBMIT is now $RESUBMIT

    #tcraig: reset CONTINUE_RUN on RESUBMIT if NOT doing timing runs
    #use COMP_RUN_BARRIERS as surrogate for timing run logical
    if ($?COMP_RUN_BARRIERS) then
      if (${COMP_RUN_BARRIERS} == "FALSE") then
         ./xmlchange -file env_run.xml -id CONTINUE_RUN -val TRUE
      endif
    else
      ./xmlchange -file env_run.xml -id CONTINUE_RUN -val TRUE
    endif
    ./xmlchange -file env_run.xml -id RESUBMIT     -val $RESUBMIT

    if ($LBSUBMIT == "TRUE") then
cat > tempres <<EOF
   $BATCHSUBMIT ./$CASE.run
EOF
     source tempres
     if ($status != 0) then
       echo "ccsm_postrun error: problem sourcing tempres " 
     endif
     rm tempres
   endif 
endif

if ($CHECK_TIMING == 'TRUE') then
  cd $CASEROOT
  if !(-d timing) mkdir timing
  $CASETOOLS/getTiming.csh -lid $LID 
  gzip timing/ccsm_timing_stats.$LID
endif

if ($SAVE_TIMING == 'TRUE') then
  cd $RUNDIR
  mv timing timing.$LID
  cd $CASEROOT
endif



As you can see, ntasks was the only thing used. Can exclusive and/or hint force those strange interpretations?

Gene

Comment 19 Moe Jette 2016-05-09 14:39:57 MDT

(In reply to Gene Soudlenkov from comment #18)
> As you can see, ntasks was the only thing used. Can exclusive and/or hint
> force those strange interpretations?

Can you confirm the job was submitted with NO command line options like this:
  sbatch my.script

Several of us have been independently submitting the same job script with all of the same options and a configuration as similar to yours as possible and don't see the same behavior so something must be different.

What does "JobSubmitPlugins=filter" do?
Is that adding some job options?
What about user SBATCH_ environment variables?

Comment 20 Gene Soudlenkov 2016-05-09 14:43:27 MDT

Yes, sbatch script.sl was used.

filter is a plugin that does partition routing and user/account vetting. It does not, however, change cpus_per_task or any other cpu-related fields in job_desc

The same job description was used to submit other jobs and it worked OK

Gene

Comment 21 Moe Jette 2016-05-09 14:52:06 MDT

(In reply to Gene Soudlenkov from comment #20)
> The same job description was used to submit other jobs and it worked OK

You mean the same script, correct?
(In reply to Gene Soudlenkov from comment #20)
> Yes, sbatch script.sl was used.

Are any SBATCH_ or SLURM_ environment variables set when the job gets submitted?
That's equivalent to including an option on the command line

> filter is a plugin that does partition routing and user/account vetting. It
> does not, however, change cpus_per_task or any other cpu-related fields in
> job_desc

Does it change any fields?
Does it get rebuilt when you install a new Slurm?

> The same job description was used to submit other jobs and it worked OK

Do you mean the identical script?

Comment 22 Gene Soudlenkov 2016-05-09 14:56:39 MDT

The only variable we have set is:
SBATCH_EXPORT=NONE

The plugin changes partition in the job_desc - this is the only field it changes. It gets rebuilt every time we upgrade.

Yes, the same script is used - with (sometimes) variations in ntasks

Gene

Comment 23 Gene Soudlenkov 2016-05-09 15:18:33 MDT

Just for reference: this is yet another job of the same user, using similar description but 320 cores. It runs OK in the reservation:


JobId=30867950 JobName=fwhf2000_cam5
   UserId=andrew.pauling(5610) GroupId=nesi(5000)
   Priority=50000 Nice=0 Account=uoo00015 QOS=normal
   JobState=RUNNING Reason=None Dependency=(null)
   Requeue=1 Restarts=0 BatchFlag=1 Reboot=0 ExitCode=0:0
   DerivedExitCode=0:0
   RunTime=02:27:18 TimeLimit=23:59:00 TimeMin=N/A
   SubmitTime=2016-05-10T12:38:04 EligibleTime=2016-05-10T12:38:04
   StartTime=2016-05-10T13:49:12 EndTime=2016-05-11T13:48:12
   PreemptTime=None SuspendTime=None SecsPreSuspend=0
   Partition=merit AllocNode:Sid=compute-c1-002-p:723
   ReqNodeList=(null) ExcNodeList=(null)
   NodeList=compute-c1-[002-020,044]
   BatchHost=compute-c1-002
   NumNodes=20 NumCPUs=320 CPUs/Task=1 ReqB:S:C:T=0:0:*:1
   TRES=cpu=320,mem=1310720,node=20
   Socks/Node=* NtasksPerN:B:S:C=0:0:*:* CoreSpec=*
     Nodes=compute-c1-[002-020,044] CPU_IDs=0-15 Mem=65536
   MinCPUsNode=1 MinMemoryCPU=4G MinTmpDiskNode=0
   Features=sb Gres=(null) Reservation=uoo00015
   Shared=0 Contiguous=0 Licenses=(null) Network=(null)
   Command=./fwhf2000_cam5.run
   WorkDir=/gpfs1m/projects/uoo00015/andrew.pauling/fwhf2000_cam5
   StdErr=/gpfs1m/projects/uoo00015/andrew.pauling/fwhf2000_cam5/stderr.txt
   StdIn=/dev/null
   StdOut=/gpfs1m/projects/uoo00015/andrew.pauling/fwhf2000_cam5/stdout.txt
   BatchScript=
#! /bin/tcsh -f

# submit with sbatch 
#SBATCH --job-name	fwhf2000_cam5    	# sfw_ext
#SBATCH --constraint	sb		# sb=Sandybridge,wm=Westmere
#SBATCH --time		23:59:00	# 
#SBATCH --account 	uoo00015
#SBATCH --ntasks  	315
#SBATCH --cpus-per-task	1		# 1
#SBATCH --hint		compute_bound
#SBATCH --mem-per-cpu	4G              # you can take 4GB no problem 
#SBATCH --workdir       /gpfs1m/projects/uoo00015/andrew.pauling/fwhf2000_cam5
#SBATCH --output        /gpfs1m/projects/uoo00015/andrew.pauling/fwhf2000_cam5/stdout.txt
#SBATCH --error      	/gpfs1m/projects/uoo00015/andrew.pauling/fwhf2000_cam5/stderr.txt
#SBATCH --exclusive
#SBATCH --reservation=uoo00015

sleep 10

cd /gpfs1m/projects/uoo00015/andrew.pauling/fwhf2000_cam5

setenv OMP_NUM_THREADS 1
if(1>1) then
  setenv MP_TASK_AFFINITY core:1
endif

# ---------------------------------------- 
# PE LAYOUT: 
#   total number of tasks  = 315 
#   maximum threads per task = 1 
#   cpl ntasks=300  nthreads=1 rootpe=  0 ninst=1 
#   cam ntasks=300  nthreads=1 rootpe=  0 ninst=1 
#   clm ntasks=105  nthreads=1 rootpe=  0 ninst=1 
#   cice ntasks=195  nthreads=1 rootpe=105 ninst=1 
#   pop2 ntasks=15  nthreads=1 rootpe=300 ninst=1 
#   sglc ntasks=300  nthreads=1 rootpe=  0 ninst=1 
#   swav ntasks=300  nthreads=1 rootpe=  0 ninst=1 
#   rtm ntasks=105  nthreads=1 rootpe=  0 ninst=1 
#   
#   total number of hw pes = 315 
#     cpl hw pe range ~ from 0 to 299 
#     cam hw pe range ~ from 0 to 299 
#     clm hw pe range ~ from 0 to 104 
#     cice hw pe range ~ from 105 to 299 
#     pop2 hw pe range ~ from 300 to 314 
#     sglc hw pe range ~ from 0 to 299 
#     swav hw pe range ~ from 0 to 299 
#     rtm hw pe range ~ from 0 to 104 
# ---------------------------------------- 
cd /gpfs1m/projects/uoo00015/andrew.pauling/fwhf2000_cam5

./Tools/ccsm_check_lockedfiles || exit -1
source ./Tools/ccsm_getenv     || exit -2

if ($BUILD_COMPLETE != "TRUE") then
  echo "BUILD_COMPLETE is not TRUE"
  echo "Please rebuild the model interactively"
  exit -2
endif

# BATCHQUERY is in env_run.xml
setenv LBQUERY "TRUE"
if !($?BATCHQUERY) then
  setenv LBQUERY "FALSE"
  setenv BATCHQUERY "undefined"
else if ( "$BATCHQUERY" == 'UNSET' ) then
  setenv LBQUERY "FALSE"
  setenv BATCHQUERY "undefined"
endif

# BATCHSUBMIT is in env_run.xml
setenv LBSUBMIT "TRUE"
if !($?BATCHSUBMIT) then
  setenv LBSUBMIT "FALSE"
  setenv BATCHSUBMIT "undefined"
else if ( "$BATCHSUBMIT" == 'UNSET' ) then
  setenv LBSUBMIT "FALSE"
  setenv BATCHSUBMIT "undefined"
endif

# --- Create and cleanup the timing directories---

if !(-d $RUNDIR) mkdir -p $RUNDIR || "cannot make $RUNDIR" && exit -1
if (-d $RUNDIR/timing) rm -r -f $RUNDIR/timing
mkdir $RUNDIR/timing
mkdir $RUNDIR/timing/checkpoints

# --- Determine time-stamp/file-ID string ---
setenv LID "`date +%y%m%d-%H%M%S`"

set sdate = `date +"%Y-%m-%d %H:%M:%S"`
echo "run started $sdate" >>& $CASEROOT/CaseStatus

echo "-------------------------------------------------------------------------"
echo " CESM BUILDNML SCRIPT STARTING"
echo " - To prestage restarts, untar a restart.tar file into $RUNDIR"

./preview_namelists 
if ($status != 0) then
   echo "ERROR from preview namelist - EXITING"
   exit -1
endif

echo " CESM BUILDNML SCRIPT HAS FINISHED SUCCESSFULLY"
echo "-------------------------------------------------------------------------"

echo "-------------------------------------------------------------------------"
echo " CESM PRESTAGE SCRIPT STARTING"
echo " - Case input data directory, DIN_LOC_ROOT, is $DIN_LOC_ROOT"
echo " - Checking the existence of input datasets in DIN_LOC_ROOT"

# This script prestages as follows
# - DIN_LOC_ROOT is the local inputdata area, check it exists
# - check whether all the data is in DIN_LOC_ROOT
# - prestage the REFCASE data if needed

cd $CASEROOT

if !(-d $DIN_LOC_ROOT) then
  echo " "
  echo "  ERROR DIN_LOC_ROOT $DIN_LOC_ROOT does not exist"
  echo " "
  exit -20
endif

if (`./check_input_data -inputdata $DIN_LOC_ROOT -check | grep "unknown" | wc -l` > 0) then
   echo " "
   echo "The following files were not found, this is informational only"
   ./check_input_data -inputdata $DIN_LOC_ROOT -check
   echo " "
endif

if (`./check_input_data -inputdata $DIN_LOC_ROOT -check | grep "missing" | wc -l` > 0) then
   echo "Attempting to download missing data:"
   ./check_input_data -inputdata $DIN_LOC_ROOT -export
endif 

if (`./check_input_data -inputdata $DIN_LOC_ROOT -check | grep "missing" | wc -l` > 0) then
   echo " "
   echo "The following files were not found, they are required"
   ./check_input_data -inputdata $DIN_LOC_ROOT -check
   echo "Invoke the following command to obtain them"
   echo "   ./check_input_data -inputdata $DIN_LOC_ROOT -export"
   echo " "
   exit -30
endif

if (($GET_REFCASE == 'TRUE') && ($RUN_TYPE != 'startup') && ($CONTINUE_RUN == 'FALSE')) then
  set refdir = "ccsm4_init/$RUN_REFCASE/$RUN_REFDATE"

  if !(-d $DIN_LOC_ROOT/$refdir) then
    echo "*****************************************************************"
    echo "ccsm_prestage ERROR: $DIN_LOC_ROOT/$refdir is not on local disk"
    echo "obtain this data from the svn input data repository:"
    echo "  > mkdir -p $DIN_LOC_ROOT/$refdir"
    echo "  > cd $DIN_LOC_ROOT/$refdir"
    echo "  > cd .."
    echo "  > svn export --force https://svn-ccsm-inputdata.cgd.ucar.edu/trunk/inputdata/$refdir"
    echo "or set GET_REFCASE to FALSE in env_run.xml, "
    echo "   and prestage the restart data to $RUNDIR manually"
    echo "*****************************************************************"
    exit -1
  endif 

  echo " - Prestaging REFCASE ($refdir) to $RUNDIR"
  if !(-d $RUNDIR) mkdir -p $RUNDIR || "cannot make $RUNDIR" && exit -1
  foreach file ($DIN_LOC_ROOT/$refdir/*${RUN_REFCASE}*) 
     if !(-f $RUNDIR/$file:t) then
        ln -s $file $RUNDIR || "cannot prestage $DIN_LOC_ROOT/$refdir data to $RUNDIR" && exit -1
     endif
  end
  cp $DIN_LOC_ROOT/$refdir/*rpointer* $RUNDIR || "cannot prestage $DIN_LOC_ROOT/$refdir rpointers to $RUNDIR" && exit -1

  cd $RUNDIR
  set cam2_list = `sh -c 'ls *.cam2.* 2>/dev/null'`
  foreach cam2_file ($cam2_list)
    set cam_file = `echo $cam2_file | sed -e 's/cam2/cam/'`
    ln -fs $cam2_file $cam_file
  end

  chmod u+w $RUNDIR/* >& /dev/null
endif

echo " CESM PRESTAGE SCRIPT HAS FINISHED SUCCESSFULLY"
echo "-------------------------------------------------------------------------"

# -------------------------------------------------------------------------
# Run the model
# -------------------------------------------------------------------------

cd $RUNDIR
echo "`date` -- CSM EXECUTION BEGINS HERE" 


setenv MP_LABELIO yes
sleep 25
setenv OMP_NUM_THREADS 1
setenv I_MPI_FABRICS shm:dapl
setenv I_MPI_DAPL_PROVIDER ofa-v2-mlx4_0-1
setenv I_MPI_WAIT_MODE 1
srun --propagate=STACK $RUNDIR/../cesm.exe >&! cesm.log.$LID

wait
echo "`date` -- CSM EXECUTION HAS FINISHED" 

# -------------------------------------------------------------------------
# For Postprocessing
# -------------------------------------------------------------------------
# -------------------------------------------------------------------------
# Check for successful run
# -------------------------------------------------------------------------

set sdate = `date +"%Y-%m-%d %H:%M:%S"`

cd $RUNDIR
set CESMLogFile = `ls -1t cesm.log* | head -1` 
if ($CESMLogFile == "") then
  echo "Model did not complete - no cesm.log file present - exiting"
  exit -1
endif
set CPLLogFile = `echo $CESMLogFile | sed -e 's/cesm/cpl/'`
if ($CPLLogFile == "") then
  echo "Model did not complete - no cpl.log file corresponding to most recent CESM log ($RUNDIR/$CESMLogFile)"
  exit -1
endif
grep 'SUCCESSFUL TERMINATION' $CPLLogFile  || echo "Model did not complete - see $RUNDIR/$CESMLogFile" && echo "run FAILED $sdate" >>& $CASEROOT/CaseStatus && exit -1

echo "run SUCCESSFUL $sdate" >>& $CASEROOT/CaseStatus

# -------------------------------------------------------------------------
# Update env variables in case user changed them during run
# -------------------------------------------------------------------------

cd $CASEROOT
source ./Tools/ccsm_getenv

# -------------------------------------------------------------------------
# Save model output stdout and stderr 
# -------------------------------------------------------------------------

cd $RUNDIR
gzip *.$LID
if ($LOGDIR != "") then
  if (! -d $LOGDIR/bld) mkdir -p $LOGDIR/bld || echo " problem in creating $LOGDIR/bld"
  cp -p *build.$LID.* $LOGDIR/bld  
  cp -p *log.$LID.*   $LOGDIR      
endif

# -------------------------------------------------------------------------
# Perform short term archiving of output
# -------------------------------------------------------------------------

if ($DOUT_S == 'TRUE') then
  echo "Archiving ccsm output to $DOUT_S_ROOT"
  echo "Calling the short-term archiving script st_archive.sh"
  cd $RUNDIR; $CASETOOLS/st_archive.sh
endif

# -------------------------------------------------------------------------
# Submit longer term archiver if appropriate
# -------------------------------------------------------------------------

cd $CASEROOT
if ($DOUT_L_MS == 'TRUE' && $DOUT_S == 'TRUE') then
  echo "Long term archiving ccsm output using the script $CASE.l_archive"
  set num = 0
  if ($LBQUERY == "TRUE") then
     set num = `$BATCHQUERY | grep $CASE.l_archive | wc -l`
  endif
  if ($LBSUBMIT == "TRUE" && $num < 1) then
cat > templar <<EOF
    $BATCHSUBMIT ./$CASE.l_archive
EOF
    source templar
    if ($status != 0) then
      echo "ccsm_postrun error: problem sourcing templar " 
    endif
    rm templar
  endif 
endif

# -------------------------------------------------------------------------
# Resubmit another run script
# -------------------------------------------------------------------------

cd $CASEROOT
if ($RESUBMIT > 0) then
    @ RESUBMIT = $RESUBMIT - 1
    echo RESUBMIT is now $RESUBMIT

    #tcraig: reset CONTINUE_RUN on RESUBMIT if NOT doing timing runs
    #use COMP_RUN_BARRIERS as surrogate for timing run logical
    if ($?COMP_RUN_BARRIERS) then
      if (${COMP_RUN_BARRIERS} == "FALSE") then
         ./xmlchange -file env_run.xml -id CONTINUE_RUN -val TRUE
      endif
    else
      ./xmlchange -file env_run.xml -id CONTINUE_RUN -val TRUE
    endif
    ./xmlchange -file env_run.xml -id RESUBMIT     -val $RESUBMIT

    if ($LBSUBMIT == "TRUE") then
cat > tempres <<EOF
   $BATCHSUBMIT ./$CASE.run
EOF
     source tempres
     if ($status != 0) then
       echo "ccsm_postrun error: problem sourcing tempres " 
     endif
     rm tempres
   endif 
endif

if ($CHECK_TIMING == 'TRUE') then
  cd $CASEROOT
  if !(-d timing) mkdir timing
  $CASETOOLS/getTiming.csh -lid $LID 
  gzip timing/ccsm_timing_stats.$LID
endif

if ($SAVE_TIMING == 'TRUE') then
  cd $RUNDIR
  mv timing timing.$LID
  cd $CASEROOT
endif

Comment 24 Moe Jette 2016-05-09 15:24:01 MDT

The job which ran has this in the "scontrol show job" output:
   MinCPUsNode=1

While the failing job has this:
   MinCPUsNode=315

Do all of the failing job's have huge MinCPUsNode values?

Comment 25 Gene Soudlenkov 2016-05-09 15:25:30 MDT

We need to re-create the reservation and try again - will advise when done

Gene

Comment 26 Alejandro Sanchez 2016-05-10 03:32:31 MDT

Gene, if in the job script I set --ntasks 315, the job runs when the reservation becomes ACTIVE and if I scontrol show jobid I see MinCPUsNode=1. However, if I comment --ntasks and instead I request --ntasks-per-node 315, when the reservation becomes ACTIVE the job remains PD (Reservation) and if I scontrol show jobid I see MinCPUsNode=315 which is the same problem you had when you opened the bug. So we're suspecting that:

1. Problem is that the PD reason is wrong, it should say (BadConstraints) instead of (Reservation).
2. We strongly advise to double check you are not changing the job request options and setting --ntasks-per-node through either jobsubmitplugin, env vars, inside the script itself or the command line.

So it would be nice if you could double check that, then the problem should be isolated to just change the logic which sets the PD Reason. Thanks for your collaboration.

Comment 27 Alejandro Sanchez 2016-05-10 04:20:08 MDT

It will also be interesting if you could either grep in slurmctld.log a similar message to this one:

[2016-05-10T19:11:16.799] _build_node_list: No nodes satisfy job 20026 requirements in partition part1

For instance executing:

$ grep -i "_build_node_list: No nodes satisfy job" /path/to/your/slurmctld.log

Or attach it here and we'll take a look at it. If job 30679726 appears in this message that would strengthen our hypothesis on comment #26 point 2. Thanks again.

Comment 28 Gene Soudlenkov 2016-05-10 08:06:31 MDT

Hi, Alejandro

Yes, there are entries with "No nodes satisfy" message in the period where the problem occcurred. However, the user did not change his job description and used it for months. For now everything works except for one thing, again, related to reservations - the user can only place one job into the reservation, which is big enough for two jobs. When we tried submitted shorter jobs, we figured out that jobs under 1 hour of walltime went through the reservation OK, but the longer ones were stuck and did not want to use the reservation.

Gene

Comment 29 Moe Jette 2016-05-10 08:18:01 MDT

This is just a guess, but we've seen this type of thing happen before:

1. The user executes "salloc bash"
2. A shell starts with a bunch of environment variables set
3. The original job eventual ends, say it times out, but the shell remains
4. The user then submits more jobs from the shell "sbatch ..." and the newly submitted jobs inherit environment variables from the original salloc command

So from the user's perspective nothing changed, but from Slurm's perspective the job has some additional options set via environment variables

Comment 30 Gene Soudlenkov 2016-05-10 11:43:02 MDT

Hi, Alejandro

I just checked what user did, restored previous versions of his script from the backup - everything is OK. No stray environment variables, no improper resource requests - he'd been using this script for months. Again, as I said, the problem was reported as reservation wait, not resource contention.

Cheers,
Gene

Comment 31 Alejandro Sanchez 2016-05-11 02:15:36 MDT

Let me clarify something.

(In reply to Gene Soudlenkov from comment #28)
> Hi, Alejandro
> 
> Yes, there are entries with "No nodes satisfy" message in the period where
> the problem occcurred. However, the user did not change his job description
> and used it for months. For now everything works except for one thing,
> again, related to reservations - the user can only place one job into the
> reservation, which is big enough for two jobs.

You mean that two submissions of the first job script (the attached one), the first one is properly running when the reservation becomes ACTIVE and the second one remains in PD (Reservation) ? Are the two jobs using the same job script?

> When we tried submitted shorter jobs, we figured out that jobs under 1 hour of > walltime went through the reservation OK, but the longer ones were stuck and
> did not want to use the reservation.
> 
> Gene

And here you tried submission of more than one job using the same batch script but this time just changing the --time value < 1h, and in this case all jobs started when reservation becomes ACTIVE, isn't it?

Besides that, and despite it is very unlikely, but not impossible, could you please check whether all your slurm components use the same version or not? just to be sure: sbatch/salloc/srun version, slurmctld version, slurmd version and slurmdbd version.

We do believe the MinCPUsNode value being wrong is an important key and we need to find who is changing its value from 1 to 315 and under which what conditions. If that is repeatable, it'd be nice to have that user dump and send his environment so we can check it. Also run "scontrol setdebug 7" and capture the incoming data in slurmctld in the period of time before reservation becomes active until jobs are submitted and after reservation becomes active, then reset with "scontrol setdebug 3". Thanks.

Comment 32 Gene Soudlenkov 2016-05-11 12:50:06 MDT

Hi, Alejandro

Yes, we checked all the components after install and ensured versions match. We may be able to experiment further in a couple of days since the cluster is awfully busy at the moment. I have yet to see the reproduction of these events but if I find something related to the problem, I will report it straight away.

Ah, yes - the script used was the same according to the user. However, as usual with users' dealings, I will be monitoring job submissions closely to make sure we have ways to identify the script used.

Cheers,
Gene

Comment 33 Alejandro Sanchez 2016-05-11 20:06:11 MDT

Ok, remain waiting for your sequence of concrete submission options/env/script/logs which lead to the jobs not starting in the reservation since we can't reproduce unless we specify --ntasks-per-node=315.

Comment 34 Alejandro Sanchez 2016-05-13 01:53:42 MDT

Hi Gene, since it's a sev-2 bug, we're required make daily progress on the bug. Did you have the chance to experiment further? If not maybe we could downgrade the bug to sev-3. Thank you.

Comment 35 Alejandro Sanchez 2016-05-18 19:41:20 MDT

Switching bug to sev-3. Please let me know as soon as you have an update. Thanks.

Comment 36 Gene Soudlenkov 2016-05-18 20:54:42 MDT

Thanks, Alejandro, will do

Cheers,
Gene

Comment 37 Alejandro Sanchez 2016-06-13 23:18:22 MDT

Hey Gene. Any update with this bug?

Comment 38 Gene Soudlenkov 2016-06-14 08:55:49 MDT

Hi, Alejandro

We haven't seen this behaviour so far - I would suggest closing the ticket and re-opening if similar behaviour is observed (we expect some reservations later this week).

Cheers,
Gene

Comment 39 Alejandro Sanchez 2016-06-14 09:12:17 MDT

Marking as resolved/infogiven. Please, reopen if trouble is found with future reservations.