Ticket 2384

Summary:	large job (99% of system) failed to start unattended, needed to down other partitions
Product:	Slurm	Reporter:	Doug Jacobsen <dmjacobsen>
Component:	slurmctld	Assignee:	Tim Wickberg <tim>
Status:	RESOLVED FIXED	QA Contact:
Severity:	3 - Medium Impact
Priority:	---	CC:	brian.gilmer, dpaul, jfbotts, jsrinivasan
Version:	15.08.8
Hardware:	Cray XC
OS:	Linux
See Also:	https://bugs.schedmd.com/show_bug.cgi?id=2539
Site:	NERSC	Slinky Site:	---
Alineos Sites:	---	Atos/Eviden Sites:	---
Confidential Site:	---	Coreweave sites:	---
Cray Sites:	---	DS9 clusters:	---
Google sites:	---	HPCnow Sites:	---
HPE Sites:	---	IBM Sites:	---
NOAA SIte:	---	NoveTech Sites:	---
Nvidia HWinf-CS Sites:	---	OCF Sites:	---
Recursion Pharma Sites:	---	SFW Sites:	---
SNIC sites:	---	Tzag Elita Sites:	---
Linux Distro:	---	Machine Name:	cori
CLE Version:		Version Fixed:	15.08.9
Target Release:	---	DevPrio:	---
Emory-Cloud Sites:	---
Attachments:	slurmctld log cori slurmcltd log for 03/08 cori job_submit.lua print values for gres underflow bug

Description Doug Jacobsen 2016-01-26 14:59:27 MST

Hello,

Good news!  edison is running a patched 15.08.7 to include the NHC race condition patch.  It was able to successfully drain resources to start a 5462 node job (out of 5576 total nodes).

Unfortunately it failed to start the job on it's own.  I didn't see anything particularly damning in the logs -- the job just wouldn't start.  The backfill scheduler would just keep delaying it even though all the nodes were "idle".

The extra huge jobs run in the "regularx" partition which has access to all nodes in the system.  I needed to mark "regular" (a subset of nodes) and "debug" partitions as down.  Immediately the job was able to start once I marked those partitions down.

This was after it failed to start on it's own for over 30 minutes.  I suspect there is some issue with overlapping partitions and these jobs that want to consume nearly everything.

Note that in our case, both debug and regularx have access to ALL nodes.

I'll attach the logs shortly.
-Doug

Comment 1 Doug Jacobsen 2016-01-26 15:04:33 MST

Created attachment 2651 [details]
slurmctld log

Comment 2 Doug Jacobsen 2016-01-26 15:06:53 MST

please note the job in question is job 66226

it should have been able to start as early as 19:55 but waited until 20:42 when I downed the other partitions (regular and debug).  The job was allocated by the time I could type sinfo after doing that.  I don't see a log entry indicating the partitions were downed.

nid01605:~ # scontrol show config
Configuration data as of 2016-01-26T21:06:27
AccountingStorageBackupHost = edique02
AccountingStorageEnforce = associations,limits,qos,safe
AccountingStorageHost   = edique01
AccountingStorageLoc    = N/A
AccountingStoragePort   = 6819
AccountingStorageTRES   = node,cpu,mem,energy,node
AccountingStorageType   = accounting_storage/slurmdbd
AccountingStorageUser   = N/A
AccountingStoreJobComment = Yes
AcctGatherEnergyType    = acct_gather_energy/cray
AcctGatherFilesystemType = acct_gather_filesystem/none
AcctGatherInfinibandType = acct_gather_infiniband/none
AcctGatherNodeFreq      = 0 sec
AcctGatherProfileType   = acct_gather_profile/none
AllowSpecResourcesUsage = 1
AuthInfo                = (null)
AuthType                = auth/munge
BackupAddr              = 128.55.143.33
BackupController        = edique01
BatchStartTimeout       = 60 sec
BOOT_TIME               = 2016-01-24T12:09:50
BurstBufferType         = (null)
CacheGroups             = 0
CheckpointType          = checkpoint/none
ChosLoc                 = (null)
ClusterName             = edison
CompleteWait            = 0 sec
ControlAddr             = nid01605
ControlMachine          = nid01605
CoreSpecPlugin          = core_spec/cray
CpuFreqDef              = OnDemand
CpuFreqGovernors        = OnDemand
CryptoType              = crypto/munge
DebugFlags              = Backfill
DefMemPerNode           = UNLIMITED
DisableRootJobs         = Yes
EioTimeout              = 60
EnforcePartLimits       = Yes
Epilog                  = (null)
EpilogMsgTime           = 2000 usec
EpilogSlurmctld         = (null)
ExtSensorsType          = ext_sensors/none
ExtSensorsFreq          = 0 sec
FairShareDampeningFactor = 1
FastSchedule            = 1
FirstJobId              = 1
GetEnvTimeout           = 2 sec
GresTypes               = craynetwork
GroupUpdateForce        = 0
GroupUpdateTime         = 600 sec
HASH_VAL                = Match
HealthCheckInterval     = 0 sec
HealthCheckNodeState    = ANY
HealthCheckProgram      = (null)
InactiveLimit           = 600 sec
JobAcctGatherFrequency  = 0
JobAcctGatherType       = jobacct_gather/cgroup
JobAcctGatherParams     = (null)
JobCheckpointDir        = /var/slurm/checkpoint
JobCompHost             = localhost
JobCompLoc              = /var/log/slurm_jobcomp.log
JobCompPort             = 0
JobCompType             = jobcomp/none
JobCompUser             = root
JobContainerType        = job_container/cncu
JobCredentialPrivateKey = (null)
JobCredentialPublicCertificate = (null)
JobFileAppend           = 0
JobRequeue              = 0
JobSubmitPlugins        = cray,lua
KeepAliveTime           = SYSTEM_DEFAULT
KillOnBadExit           = 1
KillWait                = 30 sec
LaunchParameters        = (null)
LaunchType              = launch/slurm
Layouts                 =
Licenses                = scratch1:1000000,scratch2:1000000,scratch3:1000000,dna:1000000,project:1000000,projecta:1000000,projectb:1000000
LicensesUsed            = projectb:0/1000000,projecta:0/1000000,project:0/1000000,dna:0/1000000,scratch3:0/1000000,scratch2:0/1000000,scratch1:0/1000000
MailProg                = /bin/mail
MaxArraySize            = 65000
MaxJobCount             = 200000
MaxJobId                = 2147418112
MaxMemPerNode           = UNLIMITED
MaxStepCount            = 40000
MaxTasksPerNode         = 128
MemLimitEnforce         = Yes
MessageTimeout          = 60 sec
MinJobAge               = 300 sec
MpiDefault              = openmpi
MpiParams               = ports=63001-64000
MsgAggregationParams    = (null)
NEXT_JOB_ID             = 72667
OverTimeLimit           = 0 min
PluginDir               = /opt/slurm/default/lib/slurm
PlugStackConfig         = /opt/slurm/etc/plugstack.conf
PowerParameters         = (null)
PowerPlugin             =
PreemptMode             = REQUEUE
PreemptType             = preempt/qos
PriorityParameters      = (null)
PriorityDecayHalfLife   = 8-00:00:00
PriorityCalcPeriod      = 00:05:00
PriorityFavorSmall      = No
PriorityFlags           =
PriorityMaxAge          = 21-00:00:00
PriorityUsageResetPeriod = NONE
PriorityType            = priority/multifactor
PriorityWeightAge       = 30240
PriorityWeightFairShare = 1440
PriorityWeightJobSize   = 0
PriorityWeightPartition = 0
PriorityWeightQOS       = 38880
PriorityWeightTRES      = node=2880
PrivateData             = none
ProctrackType           = proctrack/cray
Prolog                  = /software/gpfs_dvsRemount_prolog.sh
PrologEpilogTimeout     = 65534
PrologSlurmctld         = (null)
PrologFlags             = Alloc
PropagatePrioProcess    = 0
PropagateResourceLimits = ALL
PropagateResourceLimitsExcept = (null)
RebootProgram           = (null)
ReconfigFlags           = (null)
RequeueExit             = (null)
RequeueExitHold         = (null)
ResumeProgram           = (null)
ResumeRate              = 300 nodes/min
ResumeTimeout           = 60 sec
ResvEpilog              = (null)
ResvOverRun             = 0 min
ResvProlog              = (null)
ReturnToService         = 1
RoutePlugin             = (null)
SallocDefaultCommand    = srun -n1 -N1 --mem-per-cpu=0 --pty --preserve-env --gres=craynetwork:0 --mpi=none $SHELL
SchedulerParameters     = no_backup_scheduling,bf_window=2880,bf_resolution=120,bf_max_job_array_resv=20,default_queue_depth=400,bf_max_job_test=6000,bf_continue,bf_interval=150,bf_max_job_part=1000,bf_max_job_user=12,nohold_on_prolog_fail,kill_invalid_depend
SchedulerPort           = 7321
SchedulerRootFilter     = 1
SchedulerTimeSlice      = 30 sec
SchedulerType           = sched/backfill
SelectType              = select/cray
SelectTypeParameters    = CR_SOCKET_MEMORY,OTHER_CONS_RES
SlurmUser               = root(0)
SlurmctldDebug          = debug
SlurmctldLogFile        = /var/tmp/slurm/slurmctld.log
SlurmctldPort           = 6817
SlurmctldTimeout        = 120 sec
SlurmdDebug             = info
SlurmdLogFile           = /var/spool/slurmd/%h.log
SlurmdPidFile           = /var/run/slurmd.pid
SlurmdPlugstack         = (null)
SlurmdPort              = 6818
SlurmdSpoolDir          = /var/spool/slurmd
SlurmdTimeout           = 300 sec
SlurmdUser              = root(0)
SlurmSchedLogFile       = (null)
SlurmSchedLogLevel      = 0
SlurmctldPidFile        = /var/run/slurmctld.pid
SlurmctldPlugstack      = (null)
SLURM_CONF              = /opt/slurm/etc/slurm.conf
SLURM_VERSION           = 15.08.7
SrunEpilog              = (null)
SrunPortRange           = 60001-63000
SrunProlog              = (null)
StateSaveLocation       = /global/syscom/sc/nsg/var/edison-slurm-state
SuspendExcNodes         = (null)
SuspendExcParts         = (null)
SuspendProgram          = (null)
SuspendRate             = 60 nodes/min
SuspendTime             = NONE
SuspendTimeout          = 30 sec
SwitchType              = switch/cray
TaskEpilog              = (null)
TaskPlugin              = task/cgroup,task/cray
TaskPluginParam         = (null type)
TaskProlog              = (null)
TmpFS                   = /tmp
TopologyParam           = (null)
TopologyPlugin          = topology/none
TrackWCKey              = No
TreeWidth               = 18
UsePam                  = 0
UnkillableStepProgram   = (null)
UnkillableStepTimeout   = 60 sec
VSizeFactor             = 0 percent
WaitTime                = 0 sec

Slurmctld(primary/backup) at nid01605/edique01 are UP/UP
nid01605:~ #

Comment 3 Doug Jacobsen 2016-01-26 17:14:24 MST

There are three other things worth mentioning here.  These might really be other issues though, especially (3).

(1) At the same time (and, in fact the whole time the system was reserving resources for these big jobs), almost all jobs in the system were marked as having Reason of "(Resources)", i.e., almost none of the 1600 jobs in the queue had reason "(Priority)".  This is similar to bug #2285.

(2) there is output like:
[2016-01-26T20:48:28.788] debug:  Job 70406 still has 1 active steps

in slurmctld.log.  This job has many steps but all were marked as completed in the slurm accounting database, and was done just prior to Noon.  However there is still an active entry in slurmctld:

nid01605:~ # scontrol show job 70406
JobId=70406 JobName=run_pwc_edison.sh
   UserId=kissel(53203) GroupId=kissel(53203)
   Priority=32890 Nice=0 Account=xpress QOS=normal_regular_0
   JobState=NODE_FAIL Reason=NodeDown Dependency=(null)
   Requeue=0 Restarts=0 BatchFlag=1 Reboot=0 ExitCode=0:1
   RunTime=00:17:58 TimeLimit=00:30:00 TimeMin=N/A
   SubmitTime=2016-01-26T09:58:09 EligibleTime=2016-01-26T09:58:09
   StartTime=2016-01-26T11:40:50 EndTime=2016-01-26T11:58:48
   PreemptTime=None SuspendTime=None SecsPreSuspend=0
   Partition=regular AllocNode:Sid=edison07:56529
   ReqNodeList=(null) ExcNodeList=(null)
   NodeList=nid0[0297-0323,0328-0383,0388-0451,0456-0511,0516-0522,1776-1919,1924-1987,1992-2047,2052-2201,2410-2431,2436-2687,2894-3051,3476-3523,3528-3775,4397-4607,4992-5262,5728-5759,5764-5827,5832-5949]
   BatchHost=nid00297
   NumNodes=2048 NumCPUs=98304 CPUs/Task=1 ReqB:S:C:T=0:0:*:*
   TRES=cpu=98304,mem=132143104,node=2048
   Socks/Node=* NtasksPerN:B:S:C=0:0:*:* CoreSpec=*
   MinCPUsNode=1 MinMemoryNode=64523M MinTmpDiskNode=0
   Features=(null) Gres=craynetwork:1 Reservation=(null)
   Shared=0 Contiguous=0 Licenses=(null) Network=(null)
   Command=/global/homes/k/kissel/scripts/edison/run_pwc_edison.sh
   WorkDir=/global/u2/k/kissel/RESULTS/pwc_edison
   StdErr=/global/u2/k/kissel/RESULTS/pwc_edison/run_pwc_edison.%J.out
   StdIn=/dev/null
   StdOut=/global/u2/k/kissel/RESULTS/pwc_edison/run_pwc_edison.%J.out
   Power= SICP=0

nid01605:~ #


sacct has a record of the job and it's many steps, but the batch step is missing.  Also step 51 apparently took a *very* long time to complete.  And it failed owing to a node failure (nid00297)

nid01605:~ # sacct -j 70406 --format=job,start,end,state
       JobID               Start                 End      State
------------ ------------------- ------------------- ----------
70406        2016-01-26T11:40:50 2016-01-26T11:58:48  NODE_FAIL
70406.0      2016-01-26T11:40:55 2016-01-26T11:40:59  COMPLETED
70406.1      2016-01-26T11:41:02 2016-01-26T11:41:07  COMPLETED
70406.2      2016-01-26T11:41:09 2016-01-26T11:41:14  COMPLETED
70406.3      2016-01-26T11:41:17 2016-01-26T11:41:22  COMPLETED
70406.4      2016-01-26T11:41:24 2016-01-26T11:41:29  COMPLETED
70406.5      2016-01-26T11:41:31 2016-01-26T11:41:36  COMPLETED
70406.6      2016-01-26T11:41:39 2016-01-26T11:41:44  COMPLETED
70406.7      2016-01-26T11:41:46 2016-01-26T11:41:54  COMPLETED
70406.8      2016-01-26T11:41:56 2016-01-26T11:42:01  COMPLETED
70406.9      2016-01-26T11:42:03 2016-01-26T11:42:11  COMPLETED
70406.10     2016-01-26T11:42:13 2016-01-26T11:42:19  COMPLETED
70406.11     2016-01-26T11:42:21 2016-01-26T11:42:26  COMPLETED
70406.12     2016-01-26T11:42:28 2016-01-26T11:42:34  COMPLETED
70406.13     2016-01-26T11:42:36 2016-01-26T11:42:42  COMPLETED
70406.14     2016-01-26T11:42:44 2016-01-26T11:42:49  COMPLETED
70406.15     2016-01-26T11:42:51 2016-01-26T11:42:59  COMPLETED
70406.16     2016-01-26T11:43:02 2016-01-26T11:43:07  COMPLETED
70406.17     2016-01-26T11:43:09 2016-01-26T11:43:17  COMPLETED
70406.18     2016-01-26T11:43:20 2016-01-26T11:43:25  COMPLETED
70406.19     2016-01-26T11:43:27 2016-01-26T11:43:35  COMPLETED
70406.20     2016-01-26T11:43:38 2016-01-26T11:43:43  COMPLETED
70406.21     2016-01-26T11:43:45 2016-01-26T11:43:54  COMPLETED
70406.22     2016-01-26T11:43:54 2016-01-26T11:44:04  COMPLETED
70406.23     2016-01-26T11:44:04 2016-01-26T11:44:14  COMPLETED
70406.24     2016-01-26T11:44:16 2016-01-26T11:44:24  COMPLETED
70406.25     2016-01-26T11:44:24 2016-01-26T11:44:33  COMPLETED
70406.26     2016-01-26T11:44:35 2016-01-26T11:44:43  COMPLETED
70406.27     2016-01-26T11:44:45 2016-01-26T11:44:54  COMPLETED
70406.28     2016-01-26T11:44:56 2016-01-26T11:45:05  COMPLETED
70406.29     2016-01-26T11:45:07 2016-01-26T11:45:15  COMPLETED
70406.30     2016-01-26T11:45:18 2016-01-26T11:45:28  COMPLETED
70406.31     2016-01-26T11:45:30 2016-01-26T11:45:40  COMPLETED
70406.32     2016-01-26T11:45:43 2016-01-26T11:45:53  COMPLETED
70406.33     2016-01-26T11:45:55 2016-01-26T11:46:05  COMPLETED
70406.34     2016-01-26T11:46:07 2016-01-26T11:46:17  COMPLETED
70406.35     2016-01-26T11:46:19 2016-01-26T11:46:31  COMPLETED
70406.36     2016-01-26T11:46:34 2016-01-26T11:46:46  COMPLETED
70406.37     2016-01-26T11:46:48 2016-01-26T11:47:01  COMPLETED
70406.38     2016-01-26T11:47:03 2016-01-26T11:47:15  COMPLETED
70406.39     2016-01-26T11:47:18 2016-01-26T11:47:30  COMPLETED
70406.40     2016-01-26T11:47:32 2016-01-26T11:47:47  COMPLETED
70406.41     2016-01-26T11:47:49 2016-01-26T11:48:05  COMPLETED
70406.42     2016-01-26T11:48:07 2016-01-26T11:48:22  COMPLETED
70406.43     2016-01-26T11:48:24 2016-01-26T11:48:39  COMPLETED
70406.44     2016-01-26T11:48:42 2016-01-26T11:48:57  COMPLETED
70406.45     2016-01-26T11:48:59 2016-01-26T11:49:24  COMPLETED
70406.46     2016-01-26T11:49:26 2016-01-26T11:49:51  COMPLETED
70406.47     2016-01-26T11:49:54 2016-01-26T11:50:20  COMPLETED
70406.48     2016-01-26T11:50:23 2016-01-26T11:50:47  COMPLETED
70406.49     2016-01-26T11:50:49 2016-01-26T11:51:14  COMPLETED
70406.50     2016-01-26T11:51:16 2016-01-26T11:51:54 CANCELLED+
70406.51     2016-01-26T11:51:58 2016-01-26T13:21:01 CANCELLED+
nid01605:~ #

nid01605:~ # sinfo -R
REASON               USER      TIMESTAMP           NODELIST
Not responding       root      2016-01-26T11:58:48 nid00297
...

So I'm guessing the batch step isn't showing up since the lowest number nid failed (i.e., where batch is running).  But could the node failure + errant entry of a still-running step (either .batch or .51 potentially) have resulted in the large job not starting?  possibly owing to some phantom idea that some of the nodes from this job were somehow still engaged despite being marked idle?




(3) Once the job started running, smaller task requests worked, but the largest jobs consistently fail.

e.g., user script:
#SBATCH -N 5462

srun -n 32     ../src/hpsr.${PETSC_ARCH}.ex ${ARGS} >& outc_000032_nx032_sr3
srun -n 256    ../src/hpsr.${PETSC_ARCH}.ex ${ARGS} >& outc_000256_nx032_sr3
srun -n 2048   ../src/hpsr.${PETSC_ARCH}.ex ${ARGS} >& outc_002048_nx032_sr3
srun -n 16384  ../src/hpsr.${PETSC_ARCH}.ex ${ARGS} >& outc_016384_nx032_sr3
srun -n 131072 ../src/hpsr.${PETSC_ARCH}.ex ${ARGS} >& outc_131072_nx032_sr3


The 32 - 16384 task jobs work fine.  The 131072 (fully packed 5462 node job) gives errors.  Other users with similar situations have similar issues.

The job output shows PMI errors, but I'm guessing these errors could be caused if alpscomm, for example, is not fully aware of the job that is starting, e.g., if the srun is proceeding ahead of whatever aeld or other interactions are needed.

The user is getting output like: 

...Tue Jan 26 20:55:02 2016: [PE_60094]:_pmi_init:_pmi_mmap_init returned -1 
Tue Jan 26 20:55:02 2016: [PE_89738]:_pmi_init:_pmi_mmap_init returned -1 
Tue Jan 26 20:55:02 2016: [PE_18714]:_pmi_init:_pmi_mmap_init returned -1 
Tue Jan 26 20:55:02 2016: [PE_44791]:_pmi_init:_pmi_mmap_init returned -1 
Tue Jan 26 20:55:02 2016: [PE_89750]:_pmi_init:_pmi_mmap_init returned -1 
Tue Jan 26 20:55:02 2016: [PE_89753]:_pmi_init:_pmi_mmap_init returned -1 
Tue Jan 26 20:55:02 2016: [PE_22613]:_pmi_init:_pmi_mmap_init returned -1 
Tue Jan 26 20:55:02 2016: [PE_75435]:_pmi_init:_pmi_mmap_init returned -1 
Tue Jan 26 20:55:02 2016: [PE_17906]:_pmi_init:_pmi_mmap_init returned -1 
... 
... 
... 
[Tue Jan 26 20:55:17 2016] [c4-0c0s10n3] Fatal error in PMPI_Init_thread: Other MPI error, error stack: 
MPIR_Init_thread(525): 
MPID_Init(225).......: channel initialization failed 
MPID_Init(598).......: PMI2 init failed: 1 
[Tue Jan 26 20:55:17 2016] [c4-0c2s12n3] Fatal error in PMPI_Init_thread: Other MPI error, error stack: 
MPIR_Init_thread(525): 
MPID_Init(225).......: channel initialization failed 
MPID_Init(598).......: PMI2 init failed: 1 
[Tue Jan 26 20:55:17 2016] [c3-3c0s12n0] Fatal error in PMPI_Init_thread: Other MPI error, error stack: 
MPIR_Init_thread(525): 
MPID_Init(225).......: channel initialization failed 
MPID_Init(598).......: PMI2 init failed: 1 
[Tue Jan 26 20:55:17 2016] [c4-2c1s4n1] Fatal error in PMPI_Init_thread: Other MPI error, error stack: 
MPIR_Init_thread(525): 
MPID_Init(225).......: channel initialization failed 
MPID_Init(598).......: PMI2 init failed: 1 
[Tue Jan 26 20:55:17 2016] [c4-0c1s4n0] Fatal error in PMPI_Init_thread: Other MPI error, error stack: 
MPIR_Init_thread(525): 
MPID_Init(225).......: channel initialization failed 
MPID_Init(598).......: PMI2 init failed: 1 
[Tue Jan 26 20:55:17 2016] [c5-1c2s1n3] Fatal error in PMPI_Init_thread: Other MPI error, error stack: 
MPIR_Init_thread(525): 
MPID_Init(225).......: channel initialization failed 
MPID_Init(598).......: PMI2 init failed: 1 
[Tue Jan 26 20:55:17 2016] [c6-0c2s14n3] Fatal error in PMPI_Init_thread: Other MPI error, error stack: 
MPIR_Init_thread(525): 
... 
... 
... 
... 
... 
srun: error: nid02748: tasks 62784-62801,62803-62807: Killed 
srun: error: nid01098: tasks 24504-24525,24527: Killed 
srun: error: nid01453: tasks 32736-32740,32742-32759: Killed 
srun: error: nid05853: tasks 125787-125788,125791,125793,125795-125796,125799-125801,125803,125805: Killed 
srun: error: nid00136: tasks 2688-2695,2697-2711: Killed 
srun: error: nid04172: tasks 95736,95738,95740,95742,95744,95746,95748,95751,95754,95756-95757: Killed 
srun: error: nid00509: tasks 10944-10956,10958-10967: Killed 
srun: error: nid00868: tasks 19179-19180,19183,19187,19189,19191,19195,19197,19199: Killed 
srun: error: nid04061: tasks 93072-93082,93084-93095: Killed 
srun: error: nid00394: tasks 8280-8298,8300-8303: Killed 
srun: error: nid05978: tasks 128688-128700,128702-128711: Killed 
srun: error: nid01108: tasks 24744-24759,24761-24767: Killed 
srun: error: nid04320: tasks 98640-98644,98646-98663: Killed 
srun: error: nid02643: tasks 60360-60382: Killed 
srun: error: nid05390: tasks 115104-115122,115124-115127: Killed 
srun: error: nid04182: tasks 95977-95999: Killed 
srun: error: nid00523: tasks 11185-11186,11189,11191,11193,11195,11197,11199,11201: Killed 
srun: error: nid03719: tasks 85080-85090,85092-85103: Killed 
srun: error: nid04071: tasks 93312-93318,93320-93335: Killed 
.... 


In the slurmctld logs (already attached), I see step 9 (one of the "big" steps), has this type of output upon allocation:

[2016-01-26T20:53:49.555] debug:  gres/craynetwork: _step_test 66226.9 gres_bit_alloc is NULL
[2016-01-26T20:53:49.555] debug:  gres/craynetwork: _step_test 66226.9 gres_bit_alloc is NULL
[2016-01-26T20:53:49.555] debug:  gres/craynetwork: _step_test 66226.9 gres_bit_alloc is NULL
[2016-01-26T20:53:49.555] debug:  gres/craynetwork: _step_test 66226.9 gres_bit_alloc is NULL
[2016-01-26T20:53:49.555] debug:  gres/craynetwork: _step_test 66226.9 gres_bit_alloc is NULL
[2016-01-26T20:53:49.555] debug:  gres/craynetwork: _step_test 66226.9 gres_bit_alloc is NULL
[2016-01-26T20:53:49.555] debug:  gres/craynetwork: _step_test 66226.9 gres_bit_alloc is NULL
...
[2016-01-26T20:53:49.711] debug:  gres/craynetwork: _step_alloc gres_bit_alloc for 66226.9 is NULL
[2016-01-26T20:53:49.711] debug:  gres/craynetwork: _step_alloc gres_bit_alloc for 66226.9 is NULL
[2016-01-26T20:53:49.711] debug:  gres/craynetwork: _step_alloc gres_bit_alloc for 66226.9 is NULL
[2016-01-26T20:54:00.227] debug:  aeld sync event
[2016-01-26T20:54:00.231] debug:  _start_session: Created aeld session fd 4
[2016-01-26T20:54:14.165] backfill: beginning
[2016-01-26T20:54:14.166] debug:  backfill: 2 jobs to backfill
...

Step 4 (the other "big" step), did not have the aeld sync immediately after job start, but did have something like:
[2016-01-26T20:46:50.055] debug:  gres/craynetwork: _step_alloc gres_bit_alloc for 66226.4 is NULL
[2016-01-26T20:46:50.055] debug:  gres/craynetwork: _step_alloc gres_bit_alloc for 66226.4 is NULL
[2016-01-26T20:47:28.474] debug:  Job 70406 still has 1 active steps
[2016-01-26T20:47:28.474] debug:  sched: Running job scheduler
[2016-01-26T20:47:28.569] debug:  sched: JobId=70601. State=PENDING. Reason=Priority, Priority=199000. Partition=regularx.
[2016-01-26T20:47:31.572] debug:  cray: select_p_state_save
[2016-01-26T20:47:31.683] debug:  backup controller responding

i.e., the select_p_state_save.

I'm just trying to understand if there is some issue that may be preventing large steps from starting, particular if there is some timing or communication issue with the Cray alpscomm stuff.  I've filed a similar issue with Cray regarding the PMI errors.

-Doug

Comment 4 Tim Wickberg 2016-01-27 02:20:18 MST

That failed job does seem like a possible culprit - the log indicates it tries to schedule 66226 but then continually pushed back the expected start time by three minutes. I'd love to know what it was blocked on, but we don't appear to log that currently.

I'm wondering if the craynetwork GRES may be causing problems - I'm looking into whether that downed node may have prevented the GRES from being free'd on the other nodes, or if there's some other interaction there that may have caused problems.

Can you provide further details on 66226? What the exact node / cpu / GRES request was initially, either from slurmctld.log or from sacct? Does it only ask for craynetwork:1 or does it want all two (or four?) slots on the node?

The "gres count underflow" warnings may also be related, I haven't seen a good reason for that count to be off.

Task launch is something I'll need to grab Danny to look at - how we interact with the Cray APIs is something I'm not familiar with, and there may be something obvious he can see here that I'm missing.

Comment 5 Doug Jacobsen 2016-01-27 02:30:39 MST

A future bug I will file has to do with all the craynetwork gres related
messages -- they kinda fill the logs up and seem to just be a matter of
course (both the underflow as well as the gres bit is NULL messages on all
job allocations).  However, unless the GRES are somehow tied to the
partitions, I don't understand why downing regular would allow a job in
regularx to start.  The failed job, however, was in regular.  I'll try to
get the exact gres later, but my memory is that user was just asking for

#SBATCH -n 131088

so would get GRES=cpu=131088,node=5462



----
Doug Jacobsen, Ph.D.
NERSC Computer Systems Engineer
National Energy Research Scientific Computing Center <http://www.nersc.gov>
dmjacobsen@lbl.gov

------------- __o
---------- _ '\<,_
----------(_)/  (_)__________________________


On Wed, Jan 27, 2016 at 8:20 AM, <bugs@schedmd.com> wrote:

> Tim Wickberg <tim@schedmd.com> changed bug 2384
> <http://bugs.schedmd.com/show_bug.cgi?id=2384>
> What Removed Added
> Assignee support@schedmd.com tim@schedmd.com
>
> *Comment # 4 <http://bugs.schedmd.com/show_bug.cgi?id=2384#c4> on bug 2384
> <http://bugs.schedmd.com/show_bug.cgi?id=2384> from Tim Wickberg
> <tim@schedmd.com> *
>
> That failed job does seem like a possible culprit - the log indicates it tries
> to schedule 66226 but then continually pushed back the expected start time by
> three minutes. I'd love to know what it was blocked on, but we don't appear to
> log that currently.
>
> I'm wondering if the craynetwork GRES may be causing problems - I'm looking
> into whether that downed node may have prevented the GRES from being free'd on
> the other nodes, or if there's some other interaction there that may have
> caused problems.
>
> Can you provide further details on 66226? What the exact node / cpu / GRES
> request was initially, either from slurmctld.log or from sacct? Does it only
> ask for craynetwork:1 or does it want all two (or four?) slots on the node?
>
> The "gres count underflow" warnings may also be related, I haven't seen a good
> reason for that count to be off.
>
> Task launch is something I'll need to grab Danny to look at - how we interact
> with the Cray APIs is something I'm not familiar with, and there may be
> something obvious he can see here that I'm missing.
>
> ------------------------------
> You are receiving this mail because:
>
>    - You reported the bug.
>
>

Comment 6 Danny Auble 2016-01-27 04:37:51 MST

(In reply to Doug Jacobsen from comment #3)

> 
> (3) Once the job started running, smaller task requests worked, but the
> largest jobs consistently fail.
> 
> e.g., user script:
> #SBATCH -N 5462
> 
> srun -n 32     ../src/hpsr.${PETSC_ARCH}.ex ${ARGS} >& outc_000032_nx032_sr3
> srun -n 256    ../src/hpsr.${PETSC_ARCH}.ex ${ARGS} >& outc_000256_nx032_sr3
> srun -n 2048   ../src/hpsr.${PETSC_ARCH}.ex ${ARGS} >& outc_002048_nx032_sr3
> srun -n 16384  ../src/hpsr.${PETSC_ARCH}.ex ${ARGS} >& outc_016384_nx032_sr3
> srun -n 131072 ../src/hpsr.${PETSC_ARCH}.ex ${ARGS} >& outc_131072_nx032_sr3
> 
> 
> The 32 - 16384 task jobs work fine.  The 131072 (fully packed 5462 node job)
> gives errors.  Other users with similar situations have similar issues.
> 
> The job output shows PMI errors, but I'm guessing these errors could be
> caused if alpscomm, for example, is not fully aware of the job that is
> starting, e.g., if the srun is proceeding ahead of whatever aeld or other
> interactions are needed.
> 
> The user is getting output like: 
> 
> ...Tue Jan 26 20:55:02 2016: [PE_60094]:_pmi_init:_pmi_mmap_init returned -1 
> Tue Jan 26 20:55:02 2016: [PE_89738]:_pmi_init:_pmi_mmap_init returned -1 
> Tue Jan 26 20:55:02 2016: [PE_18714]:_pmi_init:_pmi_mmap_init returned -1 
> Tue Jan 26 20:55:02 2016: [PE_44791]:_pmi_init:_pmi_mmap_init returned -1 
> Tue Jan 26 20:55:02 2016: [PE_89750]:_pmi_init:_pmi_mmap_init returned -1 
> Tue Jan 26 20:55:02 2016: [PE_89753]:_pmi_init:_pmi_mmap_init returned -1 
> Tue Jan 26 20:55:02 2016: [PE_22613]:_pmi_init:_pmi_mmap_init returned -1 
> Tue Jan 26 20:55:02 2016: [PE_75435]:_pmi_init:_pmi_mmap_init returned -1 
> Tue Jan 26 20:55:02 2016: [PE_17906]:_pmi_init:_pmi_mmap_init returned -1 
> ... 
> ... 
> ... 
> [Tue Jan 26 20:55:17 2016] [c4-0c0s10n3] Fatal error in PMPI_Init_thread:
> Other MPI error, error stack: 
> MPIR_Init_thread(525): 
> MPID_Init(225).......: channel initialization failed 
> MPID_Init(598).......: PMI2 init failed: 1 
> [Tue Jan 26 20:55:17 2016] [c4-0c2s12n3] Fatal error in PMPI_Init_thread:
> Other MPI error, error stack: 
> MPIR_Init_thread(525): 
> MPID_Init(225).......: channel initialization failed 
> MPID_Init(598).......: PMI2 init failed: 1 
> [Tue Jan 26 20:55:17 2016] [c3-3c0s12n0] Fatal error in PMPI_Init_thread:
> Other MPI error, error stack: 
> MPIR_Init_thread(525): 
> MPID_Init(225).......: channel initialization failed 
> MPID_Init(598).......: PMI2 init failed: 1 
> [Tue Jan 26 20:55:17 2016] [c4-2c1s4n1] Fatal error in PMPI_Init_thread:
> Other MPI error, error stack: 
> MPIR_Init_thread(525): 
> MPID_Init(225).......: channel initialization failed 
> MPID_Init(598).......: PMI2 init failed: 1 
> [Tue Jan 26 20:55:17 2016] [c4-0c1s4n0] Fatal error in PMPI_Init_thread:
> Other MPI error, error stack: 
> MPIR_Init_thread(525): 
> MPID_Init(225).......: channel initialization failed 
> MPID_Init(598).......: PMI2 init failed: 1 
> [Tue Jan 26 20:55:17 2016] [c5-1c2s1n3] Fatal error in PMPI_Init_thread:
> Other MPI error, error stack: 
> MPIR_Init_thread(525): 
> MPID_Init(225).......: channel initialization failed 
> MPID_Init(598).......: PMI2 init failed: 1 
> [Tue Jan 26 20:55:17 2016] [c6-0c2s14n3] Fatal error in PMPI_Init_thread:
> Other MPI error, error stack: 
> MPIR_Init_thread(525): 
> ... 

Hey Doug, I am not sure what is happening here.  I have never seen this before.  I added Brian from Cray on this to see if he has any insights on the matter.  If there is something Slurm needs to be doing I am not sure what it is before we launch any tasks we set things up as we do for any other job perhaps there is a delay at this scale setting things up from the Cray side?

Comment 7 Brian F Gilmer 2016-01-27 06:34:13 MST

For the problem in comment #6 try turning the debugging on to get the alps exchanges.

ALPSLLI_DEBUG=1 srun -N 675 --ntasks-per-node=32 --unbuffered a.out %s

A Cray bug was filed and it was forwarded to the MPI PMI engineering to look at.  This has come up before on large jobs on large system.  We also saw this on the XC40 when the compute nodes were not running the same version of Slurm.

===
This shows if there are different version running
> scontrol show hostname `sinfo -h -p workq --state=idle -o %N` | awk '{printf $1 ", "}' > nid.list
> srun -w nid.list  srun -V | sort -u
slurm 14.11.8
slurm 14.11.7
> srun -w nid.list bash -c 'which srun' | sort -u
/opt/slurm/14.11.8/bin/srun
/opt/slurm/14.11.7/bin/srun

This command shows the nodes that have not reported to slurmctld.
scontrol -o show nodes | awk '/BootTime=None/ {print $1 "   may be running the wrong version of Slurm"}'

Comment 8 Doug Jacobsen 2016-01-27 06:52:31 MST

Hi Brian,

Thanks for responding regarding the ALPS debugging - I'll discuss with the
group what a time we can drain the system to run some more big jobs and try
this.

Regarding different versions of SLURM, we've done a lot of work to engineer
that into an almost impossible problem.  This is mostly done by mounting
/dsl/opt/slurm with different caching levels on the compute nodes (15s
timeout for attribute cache), and moving the config into /opt/slurm/etc to
prevent dsl caching issues persisting from symlink attribute caching.
Also, the system was just rebooted owing to a site power issue (so even
outside of that issue it really isn't possible).  Anyway, I did run the
command and the only nodes which had a BootTime=None are the nodes that are
currently physically down.

-Doug

----
Doug Jacobsen, Ph.D.
NERSC Computer Systems Engineer
National Energy Research Scientific Computing Center <http://www.nersc.gov>
dmjacobsen@lbl.gov

------------- __o
---------- _ '\<,_
----------(_)/  (_)__________________________

On Wed, Jan 27, 2016 at 12:34 PM, <bugs@schedmd.com> wrote:

> Brian F Gilmer <bgilmer@cray.com> changed bug 2384
> <http://bugs.schedmd.com/show_bug.cgi?id=2384>
> What Removed Added
> Version Fixed   15.08.9
>
> *Comment # 7 <http://bugs.schedmd.com/show_bug.cgi?id=2384#c7> on bug 2384
> <http://bugs.schedmd.com/show_bug.cgi?id=2384> from Brian F Gilmer
> <bgilmer@cray.com> *
>
> For the problem in comment #6 <http://bugs.schedmd.com/show_bug.cgi?id=2384#c6> try turning the debugging on to get the alps
> exchanges.
>
> ALPSLLI_DEBUG=1 srun -N 675 --ntasks-per-node=32 --unbuffered a.out %s
>
> A Cray bug was filed and it was forwarded to the MPI PMI engineering to look
> at.  This has come up before on large jobs on large system.  We also saw this
> on the XC40 when the compute nodes were not running the same version of Slurm.
>
> ===
> This shows if there are different version running> scontrol show hostname `sinfo -h -p workq --state=idle -o %N` | awk '{printf $1 ", "}' > nid.list
> > srun -w nid.list  srun -V | sort -u
> slurm 14.11.8
> slurm 14.11.7> srun -w nid.list bash -c 'which srun' | sort -u
> /opt/slurm/14.11.8/bin/srun
> /opt/slurm/14.11.7/bin/srun
>
> This command shows the nodes that have not reported to slurmctld.
> scontrol -o show nodes | awk '/BootTime=None/ {print $1 "   may be running the
> wrong version of Slurm"}'
>
> ------------------------------
> You are receiving this mail because:
>
>    - You reported the bug.
>
>

Comment 9 Tim Wickberg 2016-01-27 09:20:37 MST

(In reply to Doug Jacobsen from comment #5)
> A future bug I will file has to do with all the craynetwork gres related
> messages -- they kinda fill the logs up and seem to just be a matter of
> course (both the underflow as well as the gres bit is NULL messages on all
> job allocations).  

Danny's commit a2559f0 moves the gres_bit_alloc message to debug3. (300k lines out of 500k), it'll be in the next 15.08 point release if you don't want to patch it now.

Commit 3c25beb also removes another minor nuisance message.

We're still looking into the underflow message. (48k lines)

> However, unless the GRES are somehow tied to the
> partitions, I don't understand why downing regular would allow a job in
> regularx to start.  The failed job, however, was in regular.

I'm still working on reproducing this. It does seem like the failed job was responsible for stalling this. Exactly why marking the partition down finally resolved it is also an open question; the job should have terminated and released all allocated resources before you changed that.

Comment 10 Tim Wickberg 2016-01-27 09:24:04 MST

Can you try running with this patch https://github.com/schedmd/slurm/commit/6f94bb7.patch for a while? That will at least will give us some better log messages to help narrow down the underflow problem.

Comment 11 Doug Jacobsen 2016-01-27 11:55:20 MST

Hi Tim,

I'll review the patch and will probably apply it this evening.  Thank you.

Regarding the srun crash, I got the following from Cray:

Bug 821264 has more details about PMI_MMAP_SYNC_WAIT_TIME.  By
default, PMI will wait up to 60 seconds for all the ranks on a given
node to check-in at launch.  If all ranks do not arrive within 60
seconds, PMI will abort, because it thinks something went wrong.  You
can increase the 60 second time limit by setting
PMI_MMAP_SYNC_WAIT_TIME to a higher value.  If SLURM is having a
problem, setting this PMI value higher might give SLURM enough time to
detect (and hopefully display) some error of its own.

They recommended setting PMI_MMAP_SYNC_WAIT_TIME to 300.  I've asked a
couple of targeted users to do so upon their next big job starts.
I've heard from a different source that this could also, in some way,
be related to the scalability of the PMI credential service and that
too many tasks are simultaneously making requests.

----
Doug Jacobsen, Ph.D.
NERSC Computer Systems Engineer
National Energy Research Scientific Computing Center <http://www.nersc.gov>
dmjacobsen@lbl.gov

------------- __o
---------- _ '\<,_
----------(_)/  (_)__________________________

On Wed, Jan 27, 2016 at 3:24 PM, <bugs@schedmd.com> wrote:

> *Comment # 10 <http://bugs.schedmd.com/show_bug.cgi?id=2384#c10> on bug
> 2384 <http://bugs.schedmd.com/show_bug.cgi?id=2384> from Tim Wickberg
> <tim@schedmd.com> *
>
> Can you try running with this patchhttps://github.com/schedmd/slurm/commit/6f94bb7.patch for a while? That will at
> least will give us some better log messages to help narrow down the underflow
> problem.
>
> ------------------------------
> You are receiving this mail because:
>
>    - You reported the bug.
>
>

Comment 12 Doug Jacobsen 2016-01-27 12:22:28 MST

hmm.. also I realized that the users most hit by this are using the
/global/homes filesystem primarily to host their executables.  This
filesystem may have less bandwidth than others to the system, so running
large executables without copying first could cause the issue (i.e., take
ranks more than 60s to all start).  I'm asking these users to try the
--bcast option you recently added.

-Doug

----
Doug Jacobsen, Ph.D.
NERSC Computer Systems Engineer
National Energy Research Scientific Computing Center <http://www.nersc.gov>
dmjacobsen@lbl.gov

------------- __o
---------- _ '\<,_
----------(_)/  (_)__________________________


On Wed, Jan 27, 2016 at 5:54 PM, Douglas Jacobsen <dmjacobsen@lbl.gov>
wrote:

> Hi Tim,
>
> I'll review the patch and will probably apply it this evening.  Thank you.
>
>
> Regarding the srun crash, I got the following from Cray:
>
> Bug 821264 has more details about PMI_MMAP_SYNC_WAIT_TIME.  By default, PMI will wait up to 60 seconds for all the ranks on a given node to check-in at launch.  If all ranks do not arrive within 60 seconds, PMI will abort, because it thinks something went wrong.  You can increase the 60 second time limit by setting PMI_MMAP_SYNC_WAIT_TIME to a higher value.  If SLURM is having a problem, setting this PMI value higher might give SLURM enough time to detect (and hopefully display) some error of its own.
>
>
> They recommended setting PMI_MMAP_SYNC_WAIT_TIME to 300.  I've asked a couple of targeted users to do so upon their next big job starts.  I've heard from a different source that this could also, in some way, be related to the scalability of the PMI credential service and that too many tasks are simultaneously making requests.
>
>
> ----
> Doug Jacobsen, Ph.D.
> NERSC Computer Systems Engineer
> National Energy Research Scientific Computing Center
> <http://www.nersc.gov>
> dmjacobsen@lbl.gov
>
> ------------- __o
> ---------- _ '\<,_
> ----------(_)/  (_)__________________________
>
>
> On Wed, Jan 27, 2016 at 3:24 PM, <bugs@schedmd.com> wrote:
>
>> *Comment # 10 <http://bugs.schedmd.com/show_bug.cgi?id=2384#c10> on bug
>> 2384 <http://bugs.schedmd.com/show_bug.cgi?id=2384> from Tim Wickberg
>> <tim@schedmd.com> *
>>
>> Can you try running with this patchhttps://github.com/schedmd/slurm/commit/6f94bb7.patch for a while? That will at
>> least will give us some better log messages to help narrow down the underflow
>> problem.
>>
>> ------------------------------
>> You are receiving this mail because:
>>
>>    - You reported the bug.
>>
>>
>

Comment 13 Moe Jette 2016-02-02 02:30:48 MST

(In reply to Doug Jacobsen from comment #12)
> hmm.. also I realized that the users most hit by this are using the
> /global/homes filesystem primarily to host their executables.  This
> filesystem may have less bandwidth than others to the system, so running
> large executables without copying first could cause the issue (i.e., take
> ranks more than 60s to all start).  I'm asking these users to try the
> --bcast option you recently added.

FYI: There are also sbcast and sgather commands available, which may serve the needs of some users better than srun's --bcast option (e.g. to copy an executable that is used for multiple srun invocations [job steps] within a single job allocation).

Comment 14 Doug Jacobsen 2016-02-02 02:41:10 MST

Hello,

This weekend a user was able to submit a fullscale job and it ran
successfully unattended (though  I still checked on it).

It may be that this original issue filed -- that a large job did not start
unattended until we downed the other partitions -- is related to a crashed
job earlier in the day.  I'd like to refocus on that issue again.   I can
understand that perhaps when the head node of the earlier job crashed some
resource was not released, though why downing the partition would resolve
that and allow the target job to run, I'm unsure of.

-Doug

On Tue, Feb 2, 2016 at 8:31 AM <bugs@schedmd.com> wrote:

> Moe Jette <jette@schedmd.com> changed bug 2384
> <http://bugs.schedmd.com/show_bug.cgi?id=2384>
> What Removed Added
> CC   jette@schedmd.com
>
> *Comment # 13 <http://bugs.schedmd.com/show_bug.cgi?id=2384#c13> on bug
> 2384 <http://bugs.schedmd.com/show_bug.cgi?id=2384> from Moe Jette
> <jette@schedmd.com> *
>
> (In reply to Doug Jacobsen from comment #12 <http://bugs.schedmd.com/show_bug.cgi?id=2384#c12>)
>
> > hmm.. also I realized that the users most hit by this are using the
> > /global/homes filesystem primarily to host their executables.  This
> > filesystem may have less bandwidth than others to the system, so running
> > large executables without copying first could cause the issue (i.e., take
> > ranks more than 60s to all start).  I'm asking these users to try the
> > --bcast option you recently added.
>
>
>
> FYI: There are also sbcast and sgather commands available, which may serve the
> needs of some users better than srun's --bcast option (e.g. to copy an
> executable that is used for multiple srun invocations [job steps] within a
> single job allocation).
>
> ------------------------------
> You are receiving this mail because:
>
>    - You reported the bug.
>
>

Comment 15 Doug Jacobsen 2016-02-02 02:41:30 MST

Hello,

This weekend a user was able to submit a fullscale job and it ran
successfully unattended (though  I still checked on it).

It may be that this original issue filed -- that a large job did not start
unattended until we downed the other partitions -- is related to a crashed
job earlier in the day.  I'd like to refocus on that issue again.   I can
understand that perhaps when the head node of the earlier job crashed some
resource was not released, though why downing the partition would resolve
that and allow the target job to run, I'm unsure of.

-Doug

On Tue, Feb 2, 2016 at 8:31 AM <bugs@schedmd.com> wrote:

> Moe Jette <jette@schedmd.com> changed bug 2384
> <http://bugs.schedmd.com/show_bug.cgi?id=2384>
> What Removed Added
> CC   jette@schedmd.com
>
> *Comment # 13 <http://bugs.schedmd.com/show_bug.cgi?id=2384#c13> on bug
> 2384 <http://bugs.schedmd.com/show_bug.cgi?id=2384> from Moe Jette
> <jette@schedmd.com> *
>
> (In reply to Doug Jacobsen from comment #12 <http://bugs.schedmd.com/show_bug.cgi?id=2384#c12>)
>
> > hmm.. also I realized that the users most hit by this are using the
> > /global/homes filesystem primarily to host their executables.  This
> > filesystem may have less bandwidth than others to the system, so running
> > large executables without copying first could cause the issue (i.e., take
> > ranks more than 60s to all start).  I'm asking these users to try the
> > --bcast option you recently added.
>
>
>
> FYI: There are also sbcast and sgather commands available, which may serve the
> needs of some users better than srun's --bcast option (e.g. to copy an
> executable that is used for multiple srun invocations [job steps] within a
> single job allocation).
>
> ------------------------------
> You are receiving this mail because:
>
>    - You reported the bug.
>
>

Comment 16 Tim Wickberg 2016-02-02 07:24:03 MST

Glad to hear things are behaving okay at least when the hardware is stable.

I'll continue investigating that, although I'm not seeing an obvious culprit here, and haven't been able to reproduce it so far.

I'm assuming that there's no relevant slurmd.log from around the time that node crashed? I'd be curious if that had any more information that may narrow it down.

Comment 17 Doug Jacobsen 2016-02-02 07:39:36 MST

sadly, I'm still logging to /var/tmp so the frequent power outages have
been causing destruction of my logs -- you have what I have from that time
period.

soon i'll have these logs persisted to disk via rsyslog...

On Tue, Feb 2, 2016 at 1:24 PM <bugs@schedmd.com> wrote:

> *Comment # 16 <http://bugs.schedmd.com/show_bug.cgi?id=2384#c16> on bug
> 2384 <http://bugs.schedmd.com/show_bug.cgi?id=2384> from Tim Wickberg
> <tim@schedmd.com> *
>
> Glad to hear things are behaving okay at least when the hardware is stable.
>
> I'll continue investigating that, although I'm not seeing an obvious culprit
> here, and haven't been able to reproduce it so far.
>
> I'm assuming that there's no relevant slurmd.log from around the time that node
> crashed? I'd be curious if that had any more information that may narrow it
> down.
>
> ------------------------------
> You are receiving this mail because:
>
>    - You reported the bug.
>
>

Comment 18 Tim Wickberg 2016-03-04 01:16:16 MST

I haven't had any luck on sorting this out, absent some logs to point towards the issue I'm not going to make any headway.

If this recurs please reopen - hopefully it won't, or at least the logs will provide some insight then.

- Tim

Comment 19 Doug Jacobsen 2016-03-08 22:38:58 MST

Hello,

I found a similar situation last night on cori.  It turned out that at least one node, nid00024 wasn't running jobs for the debug partition.  This is the lowest numbered node and thus seems to be a favorite for the backfill scheduler.

Anyway, for some reason jobs stopped starting on it (and possibly other nodes) until I downed the debug partition and then up'd it.

The effect of this is that "large" debug jobs (say  64 nodes or so) kept piling up, gaining priority over newly submitted ones, and we were really only getting a very small amount of throughput through debug.  I noticed because there was a long debug queue and plenty of idle nodes.

From the following accounting you can see that it ran job 1283858 ending on 00:42:58 but didn't start anything more until 21:57:14 (1289982).  I had downed and upped the partition right before the job started.

nid00837:/var/tmp/slurm # sacct --node=nid00024 --start=2016-03-08 --format=job,user,start,end,state -X
       JobID      User               Start                 End      State
------------ --------- ------------------- ------------------- ----------
1283858       lonardon 2016-03-08T00:32:37 2016-03-08T00:42:58  COMPLETED
1284114            psh 2016-03-08T00:01:47 2016-03-08T00:31:57    TIMEOUT
1289982           bwoo 2016-03-08T21:57:14 2016-03-08T22:27:30    TIMEOUT
1290514         totoni 2016-03-08T21:52:25 2016-03-08T21:57:00     FAILED
1290629        lzlwell 2016-03-08T22:28:20 2016-03-08T22:56:36     FAILED
1290631        lzlwell 2016-03-08T22:57:28 2016-03-08T23:24:30     FAILED
1290639        lzlwell 2016-03-08T23:25:28 2016-03-08T23:52:33     FAILED
1290651        lzlwell 2016-03-08T23:53:34 2016-03-09T00:21:47     FAILED
1290667        lzlwell 2016-03-09T00:21:57 2016-03-09T00:24:33 CANCELLED+
1291184       lonardon 2016-03-09T00:28:40 2016-03-09T00:44:10  COMPLETED
1291185       lonardon 2016-03-09T00:44:42 2016-03-09T01:00:40  COMPLETED
1291187       lonardon 2016-03-09T01:00:44 2016-03-09T01:16:26  COMPLETED
1291201        xiaobao 2016-03-09T01:16:46 2016-03-09T01:33:41  COMPLETED
1291214        xiaobao 2016-03-09T01:36:33 2016-03-09T01:52:56  COMPLETED
1291276       tbechtel 2016-03-09T01:53:47 2016-03-09T02:15:27  COMPLETED
1291277       tbechtel 2016-03-09T02:15:56 2016-03-09T02:29:57  COMPLETED
1291278       tbechtel 2016-03-09T02:31:14 2016-03-09T02:40:22  COMPLETED
1291449        lzlwell 2016-03-09T03:04:30 2016-03-09T03:32:11     FAILED
1291451        lzlwell 2016-03-09T03:33:11 2016-03-09T04:01:02     FAILED
1291453        lzlwell 2016-03-09T04:01:41 2016-03-09T04:25:58     FAILED
1291520       dpoletti 2016-03-09T02:41:59 2016-03-09T02:58:05     FAILED
1291621       dpoletti 2016-03-09T04:26:33 2016-03-09T04:27:30 CANCELLED+
1291623       dpoletti 2016-03-09T04:28:36 2016-03-09T04:54:49  COMPLETED
1291643       lgerhard 2016-03-09T05:00:40 2016-03-09T05:07:30  COMPLETED
1291644       lgerhard 2016-03-09T05:07:40 2016-03-09T05:13:10  COMPLETED
nid00837:/var/tmp/slurm #

So, while this isn't the same as the full scale job requiring a down/up cycle, it is showing that a node or group of nodes needed that in order to start scheduling again.

I'll attached the (900MB-when-expanded) slurmctld log for cori for yesterday.

-Doug

Comment 20 Doug Jacobsen 2016-03-08 22:42:38 MST

Created attachment 2840 [details]
cori slurmcltd log for 03/08

Comment 23 Tim Wickberg 2016-03-09 01:09:02 MST

Thanks for the logs, I'm poring through them now trying to understand what's going on.

One curiosity:

[2016-03-08T23:55:15.193] error: slurm_receive_msg: Zero Bytes were transmitted or received
[2016-03-08T23:55:15.203] error: slurm_receive_msg: Zero Bytes were transmitted or received

You have this pair of error messages every 10 minutes throughout the log which is odd. I'm guessing its some rogue cron job on a node with an out-of-sync clock, or invalid munge key? Its unrelated to this issue, but the less spurious errors I have to comb through the easier it is to pinpoint real issues.

I have Cori's slurm.conf, but apparently that doesn't match that current running config:

[2016-03-08T00:01:46.114] error: Partition system Nodes differs from slurm.conf
[2016-03-08T00:01:46.114] error: Partition debug State differs from slurm.conf
[2016-03-08T00:01:46.114] error: Partition regular State differs from slurm.conf
[2016-03-08T00:01:46.114] error: Partition regularx Nodes differs from slurm.conf
[2016-03-08T00:01:46.114] error: Partition regularx State differs from slurm.conf
[2016-03-08T00:01:46.114] error: Partition realtime Nodes differs from slurm.conf
[2016-03-08T00:01:46.114] error: Partition realtime State differs from slurm.conf
[2016-03-08T00:01:46.114] error: Partition shared State differs from slurm.conf

Can you send me 'scontrol show part' for reference? I'm guessing this is just to remove the broken nodes from bug 2516?

Comment 25 Doug Jacobsen 2016-03-09 01:42:24 MST

Hi Tim,

I had noticed the 10 minute periodicity of the communication error; I'm
unaware of the source, however all the compute nodes definitely are running
munge and almost certainly have the same clock (it's set by the Cray HSS
system), and definitely do not run cron.  Maybe it's something coming from
our eslogin network, I'll check into that.  I know the nodes are running
munge because we don't have any unreachable nodes in cori that aren't
shutdown owing to hardware issues, and slurmd/slurmctld communication would
break down quickly if munge faltered.

Regarding the partition state difference, that message comes up every
morning when I rotate the logs (send SIGHUP to slurmctld).  I believe it is
because the partitions are up and the configuration has them down.  We have
the partitions down by default so we can do some testing prior to user jobs
running on the system following maintenances or (rarely) slurmctld crashes.

Bug 2516 was referring to the edison system, I haven't tried to remove
nodes from the cori system.

-Doug

On Wed, Mar 9, 2016 at 8:10 AM <bugs@schedmd.com> wrote:

> *Comment # 23 <https://bugs.schedmd.com/show_bug.cgi?id=2384#c23> on bug
> 2384 <https://bugs.schedmd.com/show_bug.cgi?id=2384> from Tim Wickberg
> <tim@schedmd.com> *
>
> Thanks for the logs, I'm poring through them now trying to understand what's
> going on.
>
> One curiosity:
>
> [2016-03-08T23:55:15.193] error: slurm_receive_msg: Zero Bytes were transmitted
> or received
> [2016-03-08T23:55:15.203] error: slurm_receive_msg: Zero Bytes were transmitted
> or received
>
> You have this pair of error messages every 10 minutes throughout the log which
> is odd. I'm guessing its some rogue cron job on a node with an out-of-sync
> clock, or invalid munge key? Its unrelated to this issue, but the less spurious
> errors I have to comb through the easier it is to pinpoint real issues.
>
> I have Cori's slurm.conf, but apparently that doesn't match that current
> running config:
>
> [2016-03-08T00:01:46.114] error: Partition system Nodes differs from slurm.conf
> [2016-03-08T00:01:46.114] error: Partition debug State differs from slurm.conf
> [2016-03-08T00:01:46.114] error: Partition regular State differs from
> slurm.conf
> [2016-03-08T00:01:46.114] error: Partition regularx Nodes differs from
> slurm.conf
> [2016-03-08T00:01:46.114] error: Partition regularx State differs from
> slurm.conf
> [2016-03-08T00:01:46.114] error: Partition realtime Nodes differs from
> slurm.conf
> [2016-03-08T00:01:46.114] error: Partition realtime State differs from
> slurm.conf
> [2016-03-08T00:01:46.114] error: Partition shared State differs from slurm.conf
>
> Can you send me 'scontrol show part' for reference? I'm guessing this is just
> to remove the broken nodes from bug 2516 <https://bugs.schedmd.com/show_bug.cgi?id=2516>?
>
> ------------------------------
> You are receiving this mail because:
>
>    - You reported the bug.
>
>

Comment 26 Moe Jette 2016-03-09 02:16:26 MST

(In reply to Doug Jacobsen from comment #25)
> I had noticed the 10 minute periodicity of the communication error; I'm
> unaware of the source, however all the compute nodes definitely are running
> munge and almost certainly have the same clock (it's set by the Cray HSS

That message is indicative of receiving a message that we couldn't parse. It could originate from anything.

Comment 27 Doug Jacobsen 2016-03-09 02:21:19 MST

I think it's our nagios monitor looking to see if anybody is home
On Mar 9, 2016 9:16 AM, <bugs@schedmd.com> wrote:

> *Comment # 26 <https://bugs.schedmd.com/show_bug.cgi?id=2384#c26> on bug
> 2384 <https://bugs.schedmd.com/show_bug.cgi?id=2384> from Moe Jette
> <jette@schedmd.com> *
>
> (In reply to Doug Jacobsen from comment #25 <https://bugs.schedmd.com/show_bug.cgi?id=2384#c25>)> I had noticed the 10 minute periodicity of the communication error; I'm
> > unaware of the source, however all the compute nodes definitely are running
> > munge and almost certainly have the same clock (it's set by the Cray HSS
>
> That message is indicative of receiving a message that we couldn't parse. It
> could originate from anything.
>
> ------------------------------
> You are receiving this mail because:
>
>    - You reported the bug.
>
>

Comment 28 Tim Wickberg 2016-03-09 03:12:31 MST

(In reply to Doug Jacobsen from comment #27)
> I think it's our nagios monitor looking to see if anybody is home

Ah, that makes sense - if nagios is just opening the socket but not sending anything that'd explain it.

Moving on to real problems - do you mind sharing your lua job submit plugin? And I assume you're using the stock cray plugin?

Comment 29 Doug Jacobsen 2016-03-09 03:52:39 MST

Created attachment 2843 [details]
cori job_submit.lua

edison has a very similar job_submit.lua but has slightly different parameters, someday will merge them

Comment 30 Doug Jacobsen 2016-03-09 03:53:35 MST

Yes, we use the stock cray plugin.

Comment 31 Moe Jette 2016-03-09 05:48:35 MST

I have two theories about what could have happened. I'll share them in case someone can directly me to concentrate on one of them.

One possibility is there is a scheduling anomaly. Setting a partition DOWN takes jobs from that partition out of consideration. Given there are overlapping partitions and a variety of limits, that could allow jobs to start that otherwise could not.

My second theory is that Cray's node health check hung for one or more jobs. The NHC is initiated at the time of job completion and the resources associated with the job are not made available to other jobs until NHC completes OR Slurm's select table records are rebuilt. The rebuild happens when the slurmctld daemon is reconfigured (SIGHUP, "scontrol reconfig", and some other configuration changes like changing a partition's state). If the NHC terminated after the tables were rebuilt, that could result in some underflow messages (without the counters going negative).

Could you look for some hung NHC processes (they would be named "xtcleanup_after" and would have been spawned by slurmctld)?

Comment 37 Moe Jette 2016-03-09 07:34:06 MST

I believe the primary bug is fixed in this commit:
https://github.com/SchedMD/slurm/commit/fec5e03b900aebba21199a922cb75acf182a6abc

Specifically, each job has a "cleaning/cleaned" flag. Once a job
terminates, the cleaning flag is set, then after the job node health
check completes, the value gets set to cleaned. If the job is requeued,
on its second (or subsequent) termination, the select/cray plugin
is called to launch the NHC. The plugin sees the "cleaned" flag
already set, it then logs:
error: select_p_job_fini: Cleaned flag already set for job 1283858, this should never happen
and returns, never launching the NHC. Since the termination of the
job NHC triggers releasing job resources (CPUs, memory, and GRES),
those resources are never released for use by other jobs.

There were a total 9 jobs with this problem reported in this log:
[2016-03-08T00:29:49.620] error: select_p_job_fini: Cleaned flag already set for job 1283977, this should never happen
[2016-03-08T00:32:27.509] error: select_p_job_fini: Cleaned flag already set for job 1283792, this should never happen
[2016-03-08T00:42:58.894] error: select_p_job_fini: Cleaned flag already set for job 1283858, this should never happen
[2016-03-08T02:15:24.429] error: select_p_job_fini: Cleaned flag already set for job 1277531, this should never happen
[2016-03-08T02:20:44.047] error: select_p_job_fini: Cleaned flag already set for job 1277599, this should never happen
[2016-03-08T04:57:42.142] error: select_p_job_fini: Cleaned flag already set for job 1276011, this should never happen
[2016-03-08T06:14:08.701] error: select_p_job_fini: Cleaned flag already set for job 1277054, this should never happen
[2016-03-08T17:46:46.781] error: select_p_job_fini: Cleaned flag already set for job 1277801, this should never happen
[2016-03-08T23:26:42.345] error: select_p_job_fini: Cleaned flag already set for job 1273184, this should never happen

I know that Tim is pursuing some other issues discovered in the course of our investigation of this bug.

Comment 38 Tim Wickberg 2016-03-09 07:36:25 MST

Created attachment 2845 [details]
print values for gres underflow bug

Doug -

Do you mind applying the attached patch? I'm trying to root out the source of the GRES underflow messages and knowing what the affected values are would help narrow down the problem considerably.

It's not necessarily related to the scheduling bug at heart here, but I've seen this on several other bugs and want to get this resolved. The error seems sporadic, and I haven't been able to trigger it on anything I have, but you seem to get a handful of jobs in such a state every day.

- Tim

Comment 39 Doug Jacobsen 2016-03-09 09:08:29 MST

OK, I've reviewed and applied the patch and we're running it on both cori
and edison.  I'll check the past logs to see if I can estimate how
frequently we were seeing the log message to guess at how long we need to
observe again before declaring victory =)

I'll send the overnight logs in the morning so you can analyze those.  Do
you prefer edison (simpler queue structure, bigger system) or cori (complex
queue structure, small system)?

-Doug

On Wed, Mar 9, 2016 at 2:36 PM <bugs@schedmd.com> wrote:

> *Comment # 38 <https://bugs.schedmd.com/show_bug.cgi?id=2384#c38> on bug
> 2384 <https://bugs.schedmd.com/show_bug.cgi?id=2384> from Tim Wickberg
> <tim@schedmd.com> *
>
> Created attachment 2845 [details] <https://bugs.schedmd.com/attachment.cgi?id=2845&action=diff> [details] <https://bugs.schedmd.com/attachment.cgi?id=2845&action=edit>
> print values for gres underflow bug
>
> Doug -
>
> Do you mind applying the attached patch? I'm trying to root out the source of
> the GRES underflow messages and knowing what the affected values are would help
> narrow down the problem considerably.
>
> It's not necessarily related to the scheduling bug at heart here, but I've seen
> this on several other bugs and want to get this resolved. The error seems
> sporadic, and I haven't been able to trigger it on anything I have, but you
> seem to get a handful of jobs in such a state every day.
>
> - Tim
>
> ------------------------------
> You are receiving this mail because:
>
>    - You reported the bug.
>
>

Comment 40 Moe Jette 2016-03-09 09:12:46 MST

Did you get this in too? It should fix the main scheduling bug (draining nodes that should be OK to use).

https://github.com/SchedMD/slurm/commit/fec5e03b900aebba21199a922cb75acf182a6abc.patch

Comment 41 Doug Jacobsen 2016-03-09 09:17:59 MST

Oops -- actually, I only got the patch you requested Moe, I missed the one
from Tim since it was an attachment rather than a github patch.  I'll take
a look at Tim's as well.

-Doug

On Wed, Mar 9, 2016 at 4:12 PM <bugs@schedmd.com> wrote:

> *Comment # 40 <https://bugs.schedmd.com/show_bug.cgi?id=2384#c40> on bug
> 2384 <https://bugs.schedmd.com/show_bug.cgi?id=2384> from Moe Jette
> <jette@schedmd.com> *
>
> Did you get this in too? It should fix the main scheduling bug (draining nodes
> that should be OK to use).
> https://github.com/SchedMD/slurm/commit/fec5e03b900aebba21199a922cb75acf182a6abc.patch
>
> ------------------------------
> You are receiving this mail because:
>
>    - You reported the bug.
>
>

Comment 42 Brian F Gilmer 2016-03-09 22:51:27 MST

Moe,

Would the scenario in comment #37 also impact BB cleanup?  

Here is the speculation:
- Slurm never called 'dw_wlm_cli -f data_out' or 'dw_wlm_cli -f teardown'.  It stopped interacting with DataWarp for this batch job after 'dw_wlm_cli -f post_run' failed

This was a side investigation on a DW bug.  The situation was complex but it included nodes going down which would have resulted in jobs being terminated then re-queued.  If the patch covers the DW situation I will not file a separate bug.

Comment 43 Tim Wickberg 2016-03-10 00:13:54 MST

(In reply to Doug Jacobsen from comment #41)
> Oops -- actually, I only got the patch you requested Moe, I missed the one
> from Tim since it was an attachment rather than a github patch.  I'll take
> a look at Tim's as well.

Moe's is the important one by far, mine is just printing some additional info so I can continue to hunt down the GRES underflow issue.

Logs from either are fine assuming they trigger the underflow message again.

Comment 44 Doug Jacobsen 2016-03-10 00:49:50 MST

Moe - your patch is on both edison and cori
Tim - I only had time yesterday for cori:

nid00837:~ # grep underflow /var/tmp/slurm/slurmctld.log
[2016-03-10T00:02:27.471] error: gres/craynetwork: _step_dealloc step
1303535.16 dealloc count underflow 0 < 1
[2016-03-10T00:02:27.471] error: gres/craynetwork: _step_dealloc step
1303535.16 dealloc count underflow 0 < 1
[2016-03-10T00:02:38.370] error: gres/craynetwork: _step_dealloc step
1304123.1 dealloc count underflow 0 < 1
[2016-03-10T00:02:39.881] error: gres/craynetwork: _step_dealloc step
1304124.0 dealloc count underflow 0 < 1
[2016-03-10T00:02:53.627] error: gres/craynetwork: _step_dealloc step
1265260.460 dealloc count underflow 0 < 1
[2016-03-10T00:02:58.245] error: gres/craynetwork: _step_dealloc step
1283403.0 dealloc count underflow 0 < 1
[2016-03-10T00:02:58.245] error: gres/craynetwork: _step_dealloc step
1283403.0 dealloc count underflow 0 < 1
[2016-03-10T00:02:58.245] error: gres/craynetwork: _step_dealloc step
1283403.0 dealloc count underflow 0 < 1
[2016-03-10T00:02:58.245] error: gres/craynetwork: _step_dealloc step
1283403.0 dealloc count underflow 0 < 1
[2016-03-10T00:02:58.245] error: gres/craynetwork: _step_dealloc step
1283403.0 dealloc count underflow 0 < 1
[2016-03-10T00:02:58.245] error: gres/craynetwork: _step_dealloc step
1283403.0 dealloc count underflow 0 < 1
[2016-03-10T00:02:58.245] error: gres/craynetwork: _step_dealloc step
1283403.0 dealloc count underflow 0 < 1
[2016-03-10T00:03:01.136] error: gres/craynetwork: _step_dealloc step
1304107.0 dealloc count underflow 0 < 1
[2016-03-10T00:03:01.136] error: gres/craynetwork: _step_dealloc step
1304107.0 dealloc count underflow 0 < 1
[2016-03-10T00:03:13.727] error: gres/craynetwork: _step_dealloc step
1265229.465 dealloc count underflow 0 < 1
[2016-03-10T00:03:28.436] error: gres/craynetwork: _step_dealloc step
1265236.461 dealloc count underflow 0 < 1
[2016-03-10T00:03:38.838] error: gres/craynetwork: _step_dealloc step
1304126.0 dealloc count underflow 0 < 1
[2016-03-10T00:03:38.839] error: gres/craynetwork: _step_dealloc step
1304126.0 dealloc count underflow 0 < 1
....

There are many more messages much the same since a little after midnight:
nid00837:~ # grep underflow /var/tmp/slurm/slurmctld.log | wc -l
1047
nid00837:~ # grep underflow /var/tmp/slurm/slurmctld.log | grep "0 < 1" |
wc -l
1047
nid00837:~ #


On Thu, Mar 10, 2016 at 7:14 AM <bugs@schedmd.com> wrote:

> *Comment # 43 <https://bugs.schedmd.com/show_bug.cgi?id=2384#c43> on bug
> 2384 <https://bugs.schedmd.com/show_bug.cgi?id=2384> from Tim Wickberg
> <tim@schedmd.com> *
>
> (In reply to Doug Jacobsen from comment #41 <https://bugs.schedmd.com/show_bug.cgi?id=2384#c41>)
>
> > Oops -- actually, I only got the patch you requested Moe, I missed the one
> > from Tim since it was an attachment rather than a github patch.  I'll take
> > a look at Tim's as well.
>
>
>
> Moe's is the important one by far, mine is just printing some additional info
> so I can continue to hunt down the GRES underflow issue.
>
> Logs from either are fine assuming they trigger the underflow message again.
>
> ------------------------------
> You are receiving this mail because:
>
>    - You reported the bug.
>
>

Comment 45 Moe Jette 2016-03-10 01:03:58 MST

(In reply to Brian F Gilmer from comment #42)
> Moe,
> 
> Would the scenario in comment #37 also impact BB cleanup?  
> 
> Here is the speculation:
> - Slurm never called 'dw_wlm_cli -f data_out' or 'dw_wlm_cli -f teardown'. 
> It stopped interacting with DataWarp for this batch job after 'dw_wlm_cli -f
> post_run' failed
> 
> This was a side investigation on a DW bug.  The situation was complex but it
> included nodes going down which would have resulted in jobs being terminated
> then re-queued.  If the patch covers the DW situation I will not file a
> separate bug.


Thanks a lot for that information Brian. This patch will definitely not fix the DW bug, but the underlying problem could be similar. Here is a bit of the relevant logic:

if (bb_job->state < BB_STATE_POST_RUN) {
  bb_job->state = BB_STATE_POST_RUN;
  _queue_stage_out(bb_job);
}

If the job had a data_out operation and the state value was not reset, subsequent data_out operations would not take place. I do see the "state" value being reset for requeued, but I will look for ways this specific bit of code might not be executed:
if (bb_job->state == BB_STATE_COMPLETE)
  bb_job->state = BB_STATE_PENDING;     /* job requeued */

Comment 46 Moe Jette 2016-03-10 02:08:41 MST

(In reply to Moe Jette from comment #45)
> (In reply to Brian F Gilmer from comment #42)
> > Moe,
> > 
> > Would the scenario in comment #37 also impact BB cleanup?  
> > 
> > Here is the speculation:
> > - Slurm never called 'dw_wlm_cli -f data_out' or 'dw_wlm_cli -f teardown'. 
> > It stopped interacting with DataWarp for this batch job after 'dw_wlm_cli -f
> > post_run' failed
> > 
> > This was a side investigation on a DW bug.  The situation was complex but it
> > included nodes going down which would have resulted in jobs being terminated
> > then re-queued.  If the patch covers the DW situation I will not file a
> > separate bug.
> 
> 
> Thanks a lot for that information Brian. This patch will definitely not fix
> the DW bug, but the underlying problem could be similar. Here is a bit of
> the relevant logic:
> 
> if (bb_job->state < BB_STATE_POST_RUN) {
>   bb_job->state = BB_STATE_POST_RUN;
>   _queue_stage_out(bb_job);
> }
> 
> If the job had a data_out operation and the state value was not reset,
> subsequent data_out operations would not take place. I do see the "state"
> value being reset for requeued, but I will look for ways this specific bit
> of code might not be executed:
> if (bb_job->state == BB_STATE_COMPLETE)
>   bb_job->state = BB_STATE_PENDING;     /* job requeued */

The first job can definitely mess with the second job (i.e. the requeued job) burst buffer state information and cause a couple of failure modes.

A stage-in would not happen for a requeued job if the stage-out for the first execution was still in progress at the time of the  second resource allocation. The logic was checking the job's buffer state (in Slurm's tables) and finding the state was >= staged-in (e.g staging-out) and treating that like the stage-in was complete so the job could restart.

Then when the first job's stage-out finally completes, it changes the buffer state (in Slurm's tables) for that job to "teardown", which prevents the second stage-out from happening. 

The fix was relatively simple and prevents the job from starting a second time until the first stage-out has completed.

We'll get this fixed in version 15.08.9. The commit is here:
https://github.com/SchedMD/slurm/commit/3bb2e602bfa1161427384145486b47f506457414

Comment 47 Doug Jacobsen 2016-03-10 07:36:06 MST

cori has the patch (both), but I just now needed to down and up the partitions to get a job to run in a reservation

here are the errors in the log:

nid00837:~ # grep error: /var/tmp/slurm/slurmctld.log | grep -v CONF_HASH | grep -v underflow | grep -v Zero | grep -v "failed to assign" | grep -v realtime
[2016-03-10T00:02:20.295] error: Partition system Nodes differs from slurm.conf
[2016-03-10T00:02:20.295] error: Partition debug State differs from slurm.conf
[2016-03-10T00:02:20.295] error: Partition regular State differs from slurm.conf
[2016-03-10T00:02:20.295] error: Partition regularx Nodes differs from slurm.conf
[2016-03-10T00:02:20.295] error: Partition regularx State differs from slurm.conf
[2016-03-10T00:02:20.295] error: Partition shared State differs from slurm.conf
[2016-03-10T00:59:24.227] error: select_p_job_fini: Cleaned flag already set for job 1276822, this should never happen
[2016-03-10T01:46:52.180] error: slurm_receive_msgs: Transport endpoint is not connected
[2016-03-10T07:13:04.219] error: Nodes nid00296 not responding
[2016-03-10T07:14:44.421] error: Nodes nid00296 not responding
[2016-03-10T07:16:24.021] error: Nodes nid00296 not responding
[2016-03-10T07:18:04.487] error: Nodes nid00296 not responding, setting DOWN
[2016-03-10T09:58:54.627] error: select_p_job_fini: Cleaned flag already set for job 1265676, this should never happen
[2016-03-10T10:45:55.861] error: Node ping apparently hung, many nodes may be DOWN or configured SlurmdTimeout should be increased
[2016-03-10T11:31:10.685] error: _bb_get_sessions: show_sessions status:256 response:dwget - failed client status code %s 404
[2016-03-10T11:31:10.685] error: _bb_get_sessions: json parser failed on dwget - failed client status code %s 404
[2016-03-10T11:31:43.432] error: _load_state: Unexpected burst buffer found: andreyBB
[2016-03-10T11:31:43.469] error: _load_state: Unexpected burst buffer found: MelTest2
[2016-03-10T11:31:43.547] error: _load_state: Unexpected burst buffer found: DJBmlrefine19
[2016-03-10T11:31:43.623] error: _load_state: Unexpected burst buffer found: XuanCami
[2016-03-10T11:31:43.703] error: _load_state: Unexpected burst buffer found: myBBname
[2016-03-10T11:31:43.859] error: _load_state: Unexpected burst buffer found: gatkbrachy_2
[2016-03-10T11:31:44.011] error: _load_state: Unexpected burst buffer found: mlrefine19
[2016-03-10T11:31:44.087] error: _load_state: Unexpected burst buffer found: 1306493
[2016-03-10T11:31:44.087] error: _load_state: Unexpected burst buffer found: 1306678
[2016-03-10T11:31:44.087] error: _load_state: Unexpected burst buffer found: 1306686
[2016-03-10T13:54:30.287] error: find_node_record: lookup failure for 09
[2016-03-10T13:55:07.278] error: find_node_record: lookup failure for 08
nid00837:~ #



nid00296 is in the reservation.  Also I see the cleaned flag already set error is still present.

Is patched?

nid00837:~ # ls -l /proc/$(pgrep slurmctld)/exe
lrwxrwxrwx 1 root root 0 Mar  9 16:27 /proc/4038/exe -> /opt/slurm/15.08.8_nhc_requeue_gresunderflow_debug_15.08.8_nhc_requeue_gresunderflow_debug_20160309162247/sbin/slurmctld
nid00837:~ #

yes


Will send more logs after this reservation testing period is complete.

Comment 48 Moe Jette 2016-03-10 08:41:09 MST

There are a couple of error messages that particularly concern me:
> [2016-03-10T09:58:54.627] error: select_p_job_fini: Cleaned flag already set
> for job 1265676, this should never happen

Can you grep for "1265676" in your slurmctld log files. I really want to determine how this happened.


> [2016-03-10T11:31:10.685] error: _bb_get_sessions: show_sessions status:256
> response:dwget - failed client status code %s 404
> [2016-03-10T11:31:10.685] error: _bb_get_sessions: json parser failed on
> dwget - failed client status code %s 404

Could you file a bug report with cray that the execution of "dw_wlm_cli --function show_sessions" returned "dwget - failed client status code %s 404". At best it's a badly formatted message. I'll study Slurm's error handling for this.


Finally, when this happens again look for hung "xtcleanup_after" processes as children of slurmctld. The resources allocated to a job are not completely released until the NHC completes or slurmctld gets restarted/reconfigured, which could prevent jobs from getting started.

Comment 49 David Paul 2016-03-10 08:53:18 MST

The job referenced did not use Datawarp.

nid00837:/var/tmp/slurm # grep 1265676 slurmctld.log
[2016-03-10T09:58:53.556] debug:  Application 1265676 not found in app list
[2016-03-10T09:58:53.848] error: gres/craynetwork: _step_dealloc step 1265676.0 dealloc count underflow 0 < 1
[2016-03-10T09:58:53.848] debug:  freed ports 63749 for step 1265676.0
[2016-03-10T09:58:54.627] job_complete: JobID=1265676 State=0x1 NodeCnt=1 WEXITSTATUS 0
[2016-03-10T09:58:54.627] error: select_p_job_fini: Cleaned flag already set for job 1265676, this should never happen
[2016-03-10T09:58:54.627] job_complete: JobID=1265676 State=0x8003 NodeCnt=1 done

nid00837:/var/tmp/slurm # sacct -j 1265676 --format=JobID,User,Partition,Start,End,Elapsed,State,ExitCode,ReqMem,ReqTRES%40
       JobID      User  Partition               Start                 End    Elapsed      State ExitCode     ReqMem                                  ReqTRES 
------------ --------- ---------- ------------------- ------------------- ---------- ---------- -------- ---------- ---------------------------------------- 
1265676        jllyons    regular 2016-03-09T03:36:51 2016-03-10T09:58:54 1-06:22:03  COMPLETED      0:0      122Gn                             cpu=1,node=1 
1265676.bat+                      2016-03-09T03:36:51 2016-03-10T09:58:54 1-06:22:03  COMPLETED      0:0      122Gn                                          
1265676.0                         2016-03-09T03:36:55 2016-03-10T09:58:53 1-06:21:58  COMPLETED      0:0      122Gn                                          
nid00837:/var/tmp/slurm #

Comment 50 Doug Jacobsen 2016-03-10 08:57:56 MST

Please note that this is not a DW ticket, the issues found related to DW
should probably move to a different ticket to not distract from the
scheduling issue herein.

On Thu, Mar 10, 2016 at 3:53 PM <bugs@schedmd.com> wrote:

> *Comment # 49 <https://bugs.schedmd.com/show_bug.cgi?id=2384#c49> on bug
> 2384 <https://bugs.schedmd.com/show_bug.cgi?id=2384> from David Paul
> <dpaul@nersc.gov> *
>
> The job referenced did not use Datawarp.
>
> nid00837:/var/tmp/slurm # grep 1265676 slurmctld.log
> [2016-03-10T09:58:53.556] debug:  Application 1265676 not found in app list
> [2016-03-10T09:58:53.848] error: gres/craynetwork: _step_dealloc step 1265676.0
> dealloc count underflow 0 < 1
> [2016-03-10T09:58:53.848] debug:  freed ports 63749 for step 1265676.0
> [2016-03-10T09:58:54.627] job_complete: JobID=1265676 State=0x1 NodeCnt=1
> WEXITSTATUS 0
> [2016-03-10T09:58:54.627] error: select_p_job_fini: Cleaned flag already set
> for job 1265676, this should never happen
> [2016-03-10T09:58:54.627] job_complete: JobID=1265676 State=0x8003 NodeCnt=1
> done
>
> nid00837:/var/tmp/slurm # sacct -j 1265676
> --format=JobID,User,Partition,Start,End,Elapsed,State,ExitCode,ReqMem,ReqTRES%40
>        JobID      User  Partition               Start                 End
> Elapsed      State ExitCode     ReqMem                                  ReqTRES
> ------------ --------- ---------- ------------------- -------------------
> ---------- ---------- -------- ----------
> ----------------------------------------
> 1265676        jllyons    regular 2016-03-09T03:36:51 2016-03-10T09:58:54
> 1-06:22:03  COMPLETED      0:0      122Gn
> cpu=1,node=1
> 1265676.bat+                      2016-03-09T03:36:51 2016-03-10T09:58:54
> 1-06:22:03  COMPLETED      0:0      122Gn
> 1265676.0                         2016-03-09T03:36:55 2016-03-10T09:58:53
> 1-06:21:58  COMPLETED      0:0      122Gn
> nid00837:/var/tmp/slurm #
>
> ------------------------------
> You are receiving this mail because:
>
>    - You reported the bug.
>
>

Comment 51 Doug Jacobsen 2016-03-14 05:09:15 MDT

Hello,

I'm unsure if this is related to the recent patch for adjusting when-to-call the NHC, however since applying that patch on cori we've been getting about 20 instances per day of nodes going into suspect failing the "reservation" test and then recovering a few minutes later.  It seems there is a race condition in some of the cray cleanup software relative to when the NHC runs.

I sent the following information to Cray in case 133221, and I bring it up here as a possible race condition in the running the NHC.

Message sent to Cray:


Hello,

We're getting ~20 nodes per day going into suspect state for a few minutes and recovering.  This is disruptive to the SLURM scheduling since during those moments the node is in the drained state.  We are running SLURM 15.08.8 with patch https://github.com/SchedMD/slurm/commit/fec5e03b900aebba21199a922cb75acf182a6abc  (adjusts timing of running NHC in some scenarios).  This issue started after we applied the above patch, which as stated does change slightly the timing of running the NHC.

Can you please explain what this NHC check is doing?  What is responsible for cleaning up these files?  Why is the an apparent race condition in job cleanup and /proc/reservation cleanup?  Is the some alpscomm function that SLURM needs to be calling to ensure this is properly cleaned up?

excerpt of console log:
crayadm@corismw1:/var/opt/cray/log/p0-20160227t205932> grep c1-0c2s11n0 console-20160314
2016-03-14T09:58:47.930170-07:00 c1-0c2s11n0 <node_health:5.1> RESID:1318312 (xtnhc) FAILURES: The following tests from the 'reservation' set have failed in normal mode:
2016-03-14T09:58:47.930223-07:00 c1-0c2s11n0 <node_health:5.1> RESID:1318312 (Reservation_Test) WARNING: Directory /proc/reservations/1318312 still exists
2016-03-14T09:58:47.968703-07:00 c1-0c2s11n0 <node_health:5.1> RESID:1318312 (Reservation_Test) WARNING: Reservation 1318312 has status: rid 1318312 flags ENDED jobs 1563368097136
2016-03-14T09:58:47.968721-07:00 c1-0c2s11n0 <node_health:5.1> RESID:1318312 (Reservation_Test) WARNING: Reservation 1318312 still has the following object types: files
2016-03-14T09:58:48.004575-07:00 c1-0c2s11n0 <node_health:5.1> RESID:1318312 (xtnhc) FAILURES: (Admindown) Reservation_Test
2016-03-14T09:58:48.004596-07:00 c1-0c2s11n0 <node_health:5.1> RESID:1318312 (xtnhc) FAILURES: End of list of 1 failed test(s)
2016-03-14T09:58:49.214150-07:00 c4-0c1s1n1 <node_health:5.1> RESID:1318312 (xtcheckhealth) WARNING: Set node 364 (c1-0c2s11n0) to suspect because the node failed a health test.
2016-03-14T10:10:02.772452-07:00 c4-0c1s1n1 <node_health:5.1> RESID:1318312 (xtcheckhealth) WARNING: Set node 364 (c1-0c2s11n0) to up.
2016-03-14T10:27:35.471257-07:00 c1-0c2s11n0 <node_health:5.1> RESID:1336628 (Free_Memory_Check) WARNING: Memory free at boot: 129429352 kB; Memory currently free: 128512692 kB; Difference 916660 kB > Limit 614400 kB
2016-03-14T10:27:35.496514-07:00 c1-0c2s11n0 <node_health:5.1> RESID:1336628 (xtnhc) FAILURES: The following tests from the 'reservation' set have failed in normal mode:
2016-03-14T10:27:35.496728-07:00 c1-0c2s11n0 <node_health:5.1> RESID:1336628 (xtnhc) FAILURES: (Log Only ) Free_Memory_Check
2016-03-14T10:27:35.496828-07:00 c1-0c2s11n0 <node_health:5.1> RESID:1336628 (xtnhc) FAILURES: End of list of 1 failed test(s)
2016-03-14T10:31:36.483695-07:00 c1-0c2s11n0 Lustre: 683:0:(vvp_io.c:921:vvp_io_fault_start()) binary [0x2000608eb:0x1d245:0x0] changed while waiting for the page fault lock
crayadm@corismw1:/var/opt/cray/log/p0-20160227t205932>


Excerpt from messages log:
crayadm@corismw1:/var/opt/cray/log/p0-20160227t205932> grep c1-0c2s11n0 messages-20160314
<4>1 2016-03-14T09:58:49.174840-07:00 c4-0c1s1n1 kernel - p0-20160227t205932 - <node_health:5.1> RESID:1318312 (xtcheckhealth) WARNING: Set node 364 (c1-0c2s11n0) to suspect because the node failed a health test.
<166>1 2016-03-14T09:58:49.174848-07:00 c4-0c1s1n1 xtcheckhealth 1843 p0-20160227t205932 [nhc@34] <node_health:5.1> RESID:1318312 (xtcheckhealth) WARNING: Set node 364 (c1-0c2s11n0) to suspect because the node failed a health test.
<4>1 2016-03-14T10:10:02.747019-07:00 c4-0c1s1n1 kernel - p0-20160227t205932 - <node_health:5.1> RESID:1318312 (xtcheckhealth) WARNING: Set node 364 (c1-0c2s11n0) to up.
<166>1 2016-03-14T10:10:02.747038-07:00 c4-0c1s1n1 xtcheckhealth 7400 p0-20160227t205932 [nhc@34] <node_health:5.1> RESID:1318312 (xtcheckhealth) WARNING: Set node 364 (c1-0c2s11n0) to up.
crayadm@corismw1:/var/opt/cray/log/p0-20160227t205932>

Excerpt from slurmcltd log (a few different ways)

nid00837:~ # grep 1318312 /var/tmp/slurm/slurmctld.log
...
[2016-03-14T09:05:44.494] backfill: Started JobId=1318312 in regular on nid0[0284-0295,0336-0343,0360-0367,0564-0575,1944-1947,1952-1959,1968-1975,2024-2027]
[2016-03-14T09:55:46.949] Time limit exhausted for JobId=1318312
[2016-03-14T09:55:47.909] job_complete: JobID=1318312 State=0x8006 NodeCnt=1 WTERMSIG 15
nid00837:~ #

nid00837:~ # grep nid00364 /var/tmp/slurm/slurmctld.log | grep -v HASH
...
[2016-03-14T09:58:48.098] update_node: node nid00364 reason set to: suspect by NHC
[2016-03-14T09:58:48.099] update_node: node nid00364 state set to DRAINED
[2016-03-14T10:10:01.674] update_node: node nid00364 state set to IDLE
...
nid00837:~ #



Thank you,
Doug Jacobsen

Comment 52 Moe Jette 2016-03-15 03:27:26 MDT

(In reply to Doug Jacobsen from comment #51)
> I'm unsure if this is related to the recent patch for adjusting when-to-call
> the NHC, however since applying that patch on cori we've been getting about
> 20 instances per day of nodes going into suspect failing the "reservation"
> test and then recovering a few minutes later.  It seems there is a race
> condition in some of the cray cleanup software relative to when the NHC runs.

Let me expand a bit on the commit message (see:
https://github.com/SchedMD/slurm/commit/fec5e03b900aebba21199a922cb75acf182a6abc

The previous Slurm logic would only run the reservation/job NHC once for any job. The new Slurm logic will execute the reservation/job NHC at the end of each execution for a given job ID. In the case of a requeued job, that could easily mean the NHC runs more than once for the same job ID on the same node(s), but each one represents an independent execution of the job. I would guess that Cray's software thinks this might be a problem.

Comment 53 Moe Jette 2016-07-01 10:33:34 MDT

Have you upgraded to Slurm version 16.05?
Has this problem gone away?
If yes for both questions, we can probably close this ticket.

Comment 54 Tim Wickberg 2016-07-08 12:22:56 MDT

Doug - marking closed, as I assume this hasn't recurred since updating to 16.05.