| Summary: | large job (99% of system) failed to start unattended, needed to down other partitions | ||
|---|---|---|---|
| Product: | Slurm | Reporter: | Doug Jacobsen <dmjacobsen> |
| Component: | slurmctld | Assignee: | Tim Wickberg <tim> |
| Status: | RESOLVED FIXED | QA Contact: | |
| Severity: | 3 - Medium Impact | ||
| Priority: | --- | CC: | brian.gilmer, dpaul, jfbotts, jsrinivasan |
| Version: | 15.08.8 | ||
| Hardware: | Cray XC | ||
| OS: | Linux | ||
| See Also: | https://bugs.schedmd.com/show_bug.cgi?id=2539 | ||
| Site: | NERSC | Slinky Site: | --- |
| Alineos Sites: | --- | Atos/Eviden Sites: | --- |
| Confidential Site: | --- | Coreweave sites: | --- |
| Cray Sites: | --- | DS9 clusters: | --- |
| Google sites: | --- | HPCnow Sites: | --- |
| HPE Sites: | --- | IBM Sites: | --- |
| NOAA SIte: | --- | NoveTech Sites: | --- |
| Nvidia HWinf-CS Sites: | --- | OCF Sites: | --- |
| Recursion Pharma Sites: | --- | SFW Sites: | --- |
| SNIC sites: | --- | Tzag Elita Sites: | --- |
| Linux Distro: | --- | Machine Name: | cori |
| CLE Version: | Version Fixed: | 15.08.9 | |
| Target Release: | --- | DevPrio: | --- |
| Emory-Cloud Sites: | --- | ||
| Attachments: |
slurmctld log
cori slurmcltd log for 03/08 cori job_submit.lua print values for gres underflow bug |
||
|
Description
Doug Jacobsen
2016-01-26 14:59:27 MST
Created attachment 2651 [details]
slurmctld log
please note the job in question is job 66226 it should have been able to start as early as 19:55 but waited until 20:42 when I downed the other partitions (regular and debug). The job was allocated by the time I could type sinfo after doing that. I don't see a log entry indicating the partitions were downed. nid01605:~ # scontrol show config Configuration data as of 2016-01-26T21:06:27 AccountingStorageBackupHost = edique02 AccountingStorageEnforce = associations,limits,qos,safe AccountingStorageHost = edique01 AccountingStorageLoc = N/A AccountingStoragePort = 6819 AccountingStorageTRES = node,cpu,mem,energy,node AccountingStorageType = accounting_storage/slurmdbd AccountingStorageUser = N/A AccountingStoreJobComment = Yes AcctGatherEnergyType = acct_gather_energy/cray AcctGatherFilesystemType = acct_gather_filesystem/none AcctGatherInfinibandType = acct_gather_infiniband/none AcctGatherNodeFreq = 0 sec AcctGatherProfileType = acct_gather_profile/none AllowSpecResourcesUsage = 1 AuthInfo = (null) AuthType = auth/munge BackupAddr = 128.55.143.33 BackupController = edique01 BatchStartTimeout = 60 sec BOOT_TIME = 2016-01-24T12:09:50 BurstBufferType = (null) CacheGroups = 0 CheckpointType = checkpoint/none ChosLoc = (null) ClusterName = edison CompleteWait = 0 sec ControlAddr = nid01605 ControlMachine = nid01605 CoreSpecPlugin = core_spec/cray CpuFreqDef = OnDemand CpuFreqGovernors = OnDemand CryptoType = crypto/munge DebugFlags = Backfill DefMemPerNode = UNLIMITED DisableRootJobs = Yes EioTimeout = 60 EnforcePartLimits = Yes Epilog = (null) EpilogMsgTime = 2000 usec EpilogSlurmctld = (null) ExtSensorsType = ext_sensors/none ExtSensorsFreq = 0 sec FairShareDampeningFactor = 1 FastSchedule = 1 FirstJobId = 1 GetEnvTimeout = 2 sec GresTypes = craynetwork GroupUpdateForce = 0 GroupUpdateTime = 600 sec HASH_VAL = Match HealthCheckInterval = 0 sec HealthCheckNodeState = ANY HealthCheckProgram = (null) InactiveLimit = 600 sec JobAcctGatherFrequency = 0 JobAcctGatherType = jobacct_gather/cgroup JobAcctGatherParams = (null) JobCheckpointDir = /var/slurm/checkpoint JobCompHost = localhost JobCompLoc = /var/log/slurm_jobcomp.log JobCompPort = 0 JobCompType = jobcomp/none JobCompUser = root JobContainerType = job_container/cncu JobCredentialPrivateKey = (null) JobCredentialPublicCertificate = (null) JobFileAppend = 0 JobRequeue = 0 JobSubmitPlugins = cray,lua KeepAliveTime = SYSTEM_DEFAULT KillOnBadExit = 1 KillWait = 30 sec LaunchParameters = (null) LaunchType = launch/slurm Layouts = Licenses = scratch1:1000000,scratch2:1000000,scratch3:1000000,dna:1000000,project:1000000,projecta:1000000,projectb:1000000 LicensesUsed = projectb:0/1000000,projecta:0/1000000,project:0/1000000,dna:0/1000000,scratch3:0/1000000,scratch2:0/1000000,scratch1:0/1000000 MailProg = /bin/mail MaxArraySize = 65000 MaxJobCount = 200000 MaxJobId = 2147418112 MaxMemPerNode = UNLIMITED MaxStepCount = 40000 MaxTasksPerNode = 128 MemLimitEnforce = Yes MessageTimeout = 60 sec MinJobAge = 300 sec MpiDefault = openmpi MpiParams = ports=63001-64000 MsgAggregationParams = (null) NEXT_JOB_ID = 72667 OverTimeLimit = 0 min PluginDir = /opt/slurm/default/lib/slurm PlugStackConfig = /opt/slurm/etc/plugstack.conf PowerParameters = (null) PowerPlugin = PreemptMode = REQUEUE PreemptType = preempt/qos PriorityParameters = (null) PriorityDecayHalfLife = 8-00:00:00 PriorityCalcPeriod = 00:05:00 PriorityFavorSmall = No PriorityFlags = PriorityMaxAge = 21-00:00:00 PriorityUsageResetPeriod = NONE PriorityType = priority/multifactor PriorityWeightAge = 30240 PriorityWeightFairShare = 1440 PriorityWeightJobSize = 0 PriorityWeightPartition = 0 PriorityWeightQOS = 38880 PriorityWeightTRES = node=2880 PrivateData = none ProctrackType = proctrack/cray Prolog = /software/gpfs_dvsRemount_prolog.sh PrologEpilogTimeout = 65534 PrologSlurmctld = (null) PrologFlags = Alloc PropagatePrioProcess = 0 PropagateResourceLimits = ALL PropagateResourceLimitsExcept = (null) RebootProgram = (null) ReconfigFlags = (null) RequeueExit = (null) RequeueExitHold = (null) ResumeProgram = (null) ResumeRate = 300 nodes/min ResumeTimeout = 60 sec ResvEpilog = (null) ResvOverRun = 0 min ResvProlog = (null) ReturnToService = 1 RoutePlugin = (null) SallocDefaultCommand = srun -n1 -N1 --mem-per-cpu=0 --pty --preserve-env --gres=craynetwork:0 --mpi=none $SHELL SchedulerParameters = no_backup_scheduling,bf_window=2880,bf_resolution=120,bf_max_job_array_resv=20,default_queue_depth=400,bf_max_job_test=6000,bf_continue,bf_interval=150,bf_max_job_part=1000,bf_max_job_user=12,nohold_on_prolog_fail,kill_invalid_depend SchedulerPort = 7321 SchedulerRootFilter = 1 SchedulerTimeSlice = 30 sec SchedulerType = sched/backfill SelectType = select/cray SelectTypeParameters = CR_SOCKET_MEMORY,OTHER_CONS_RES SlurmUser = root(0) SlurmctldDebug = debug SlurmctldLogFile = /var/tmp/slurm/slurmctld.log SlurmctldPort = 6817 SlurmctldTimeout = 120 sec SlurmdDebug = info SlurmdLogFile = /var/spool/slurmd/%h.log SlurmdPidFile = /var/run/slurmd.pid SlurmdPlugstack = (null) SlurmdPort = 6818 SlurmdSpoolDir = /var/spool/slurmd SlurmdTimeout = 300 sec SlurmdUser = root(0) SlurmSchedLogFile = (null) SlurmSchedLogLevel = 0 SlurmctldPidFile = /var/run/slurmctld.pid SlurmctldPlugstack = (null) SLURM_CONF = /opt/slurm/etc/slurm.conf SLURM_VERSION = 15.08.7 SrunEpilog = (null) SrunPortRange = 60001-63000 SrunProlog = (null) StateSaveLocation = /global/syscom/sc/nsg/var/edison-slurm-state SuspendExcNodes = (null) SuspendExcParts = (null) SuspendProgram = (null) SuspendRate = 60 nodes/min SuspendTime = NONE SuspendTimeout = 30 sec SwitchType = switch/cray TaskEpilog = (null) TaskPlugin = task/cgroup,task/cray TaskPluginParam = (null type) TaskProlog = (null) TmpFS = /tmp TopologyParam = (null) TopologyPlugin = topology/none TrackWCKey = No TreeWidth = 18 UsePam = 0 UnkillableStepProgram = (null) UnkillableStepTimeout = 60 sec VSizeFactor = 0 percent WaitTime = 0 sec Slurmctld(primary/backup) at nid01605/edique01 are UP/UP nid01605:~ # There are three other things worth mentioning here. These might really be other issues though, especially (3). (1) At the same time (and, in fact the whole time the system was reserving resources for these big jobs), almost all jobs in the system were marked as having Reason of "(Resources)", i.e., almost none of the 1600 jobs in the queue had reason "(Priority)". This is similar to bug #2285. (2) there is output like: [2016-01-26T20:48:28.788] debug: Job 70406 still has 1 active steps in slurmctld.log. This job has many steps but all were marked as completed in the slurm accounting database, and was done just prior to Noon. However there is still an active entry in slurmctld: nid01605:~ # scontrol show job 70406 JobId=70406 JobName=run_pwc_edison.sh UserId=kissel(53203) GroupId=kissel(53203) Priority=32890 Nice=0 Account=xpress QOS=normal_regular_0 JobState=NODE_FAIL Reason=NodeDown Dependency=(null) Requeue=0 Restarts=0 BatchFlag=1 Reboot=0 ExitCode=0:1 RunTime=00:17:58 TimeLimit=00:30:00 TimeMin=N/A SubmitTime=2016-01-26T09:58:09 EligibleTime=2016-01-26T09:58:09 StartTime=2016-01-26T11:40:50 EndTime=2016-01-26T11:58:48 PreemptTime=None SuspendTime=None SecsPreSuspend=0 Partition=regular AllocNode:Sid=edison07:56529 ReqNodeList=(null) ExcNodeList=(null) NodeList=nid0[0297-0323,0328-0383,0388-0451,0456-0511,0516-0522,1776-1919,1924-1987,1992-2047,2052-2201,2410-2431,2436-2687,2894-3051,3476-3523,3528-3775,4397-4607,4992-5262,5728-5759,5764-5827,5832-5949] BatchHost=nid00297 NumNodes=2048 NumCPUs=98304 CPUs/Task=1 ReqB:S:C:T=0:0:*:* TRES=cpu=98304,mem=132143104,node=2048 Socks/Node=* NtasksPerN:B:S:C=0:0:*:* CoreSpec=* MinCPUsNode=1 MinMemoryNode=64523M MinTmpDiskNode=0 Features=(null) Gres=craynetwork:1 Reservation=(null) Shared=0 Contiguous=0 Licenses=(null) Network=(null) Command=/global/homes/k/kissel/scripts/edison/run_pwc_edison.sh WorkDir=/global/u2/k/kissel/RESULTS/pwc_edison StdErr=/global/u2/k/kissel/RESULTS/pwc_edison/run_pwc_edison.%J.out StdIn=/dev/null StdOut=/global/u2/k/kissel/RESULTS/pwc_edison/run_pwc_edison.%J.out Power= SICP=0 nid01605:~ # sacct has a record of the job and it's many steps, but the batch step is missing. Also step 51 apparently took a *very* long time to complete. And it failed owing to a node failure (nid00297) nid01605:~ # sacct -j 70406 --format=job,start,end,state JobID Start End State ------------ ------------------- ------------------- ---------- 70406 2016-01-26T11:40:50 2016-01-26T11:58:48 NODE_FAIL 70406.0 2016-01-26T11:40:55 2016-01-26T11:40:59 COMPLETED 70406.1 2016-01-26T11:41:02 2016-01-26T11:41:07 COMPLETED 70406.2 2016-01-26T11:41:09 2016-01-26T11:41:14 COMPLETED 70406.3 2016-01-26T11:41:17 2016-01-26T11:41:22 COMPLETED 70406.4 2016-01-26T11:41:24 2016-01-26T11:41:29 COMPLETED 70406.5 2016-01-26T11:41:31 2016-01-26T11:41:36 COMPLETED 70406.6 2016-01-26T11:41:39 2016-01-26T11:41:44 COMPLETED 70406.7 2016-01-26T11:41:46 2016-01-26T11:41:54 COMPLETED 70406.8 2016-01-26T11:41:56 2016-01-26T11:42:01 COMPLETED 70406.9 2016-01-26T11:42:03 2016-01-26T11:42:11 COMPLETED 70406.10 2016-01-26T11:42:13 2016-01-26T11:42:19 COMPLETED 70406.11 2016-01-26T11:42:21 2016-01-26T11:42:26 COMPLETED 70406.12 2016-01-26T11:42:28 2016-01-26T11:42:34 COMPLETED 70406.13 2016-01-26T11:42:36 2016-01-26T11:42:42 COMPLETED 70406.14 2016-01-26T11:42:44 2016-01-26T11:42:49 COMPLETED 70406.15 2016-01-26T11:42:51 2016-01-26T11:42:59 COMPLETED 70406.16 2016-01-26T11:43:02 2016-01-26T11:43:07 COMPLETED 70406.17 2016-01-26T11:43:09 2016-01-26T11:43:17 COMPLETED 70406.18 2016-01-26T11:43:20 2016-01-26T11:43:25 COMPLETED 70406.19 2016-01-26T11:43:27 2016-01-26T11:43:35 COMPLETED 70406.20 2016-01-26T11:43:38 2016-01-26T11:43:43 COMPLETED 70406.21 2016-01-26T11:43:45 2016-01-26T11:43:54 COMPLETED 70406.22 2016-01-26T11:43:54 2016-01-26T11:44:04 COMPLETED 70406.23 2016-01-26T11:44:04 2016-01-26T11:44:14 COMPLETED 70406.24 2016-01-26T11:44:16 2016-01-26T11:44:24 COMPLETED 70406.25 2016-01-26T11:44:24 2016-01-26T11:44:33 COMPLETED 70406.26 2016-01-26T11:44:35 2016-01-26T11:44:43 COMPLETED 70406.27 2016-01-26T11:44:45 2016-01-26T11:44:54 COMPLETED 70406.28 2016-01-26T11:44:56 2016-01-26T11:45:05 COMPLETED 70406.29 2016-01-26T11:45:07 2016-01-26T11:45:15 COMPLETED 70406.30 2016-01-26T11:45:18 2016-01-26T11:45:28 COMPLETED 70406.31 2016-01-26T11:45:30 2016-01-26T11:45:40 COMPLETED 70406.32 2016-01-26T11:45:43 2016-01-26T11:45:53 COMPLETED 70406.33 2016-01-26T11:45:55 2016-01-26T11:46:05 COMPLETED 70406.34 2016-01-26T11:46:07 2016-01-26T11:46:17 COMPLETED 70406.35 2016-01-26T11:46:19 2016-01-26T11:46:31 COMPLETED 70406.36 2016-01-26T11:46:34 2016-01-26T11:46:46 COMPLETED 70406.37 2016-01-26T11:46:48 2016-01-26T11:47:01 COMPLETED 70406.38 2016-01-26T11:47:03 2016-01-26T11:47:15 COMPLETED 70406.39 2016-01-26T11:47:18 2016-01-26T11:47:30 COMPLETED 70406.40 2016-01-26T11:47:32 2016-01-26T11:47:47 COMPLETED 70406.41 2016-01-26T11:47:49 2016-01-26T11:48:05 COMPLETED 70406.42 2016-01-26T11:48:07 2016-01-26T11:48:22 COMPLETED 70406.43 2016-01-26T11:48:24 2016-01-26T11:48:39 COMPLETED 70406.44 2016-01-26T11:48:42 2016-01-26T11:48:57 COMPLETED 70406.45 2016-01-26T11:48:59 2016-01-26T11:49:24 COMPLETED 70406.46 2016-01-26T11:49:26 2016-01-26T11:49:51 COMPLETED 70406.47 2016-01-26T11:49:54 2016-01-26T11:50:20 COMPLETED 70406.48 2016-01-26T11:50:23 2016-01-26T11:50:47 COMPLETED 70406.49 2016-01-26T11:50:49 2016-01-26T11:51:14 COMPLETED 70406.50 2016-01-26T11:51:16 2016-01-26T11:51:54 CANCELLED+ 70406.51 2016-01-26T11:51:58 2016-01-26T13:21:01 CANCELLED+ nid01605:~ # nid01605:~ # sinfo -R REASON USER TIMESTAMP NODELIST Not responding root 2016-01-26T11:58:48 nid00297 ... So I'm guessing the batch step isn't showing up since the lowest number nid failed (i.e., where batch is running). But could the node failure + errant entry of a still-running step (either .batch or .51 potentially) have resulted in the large job not starting? possibly owing to some phantom idea that some of the nodes from this job were somehow still engaged despite being marked idle? (3) Once the job started running, smaller task requests worked, but the largest jobs consistently fail. e.g., user script: #SBATCH -N 5462 srun -n 32 ../src/hpsr.${PETSC_ARCH}.ex ${ARGS} >& outc_000032_nx032_sr3 srun -n 256 ../src/hpsr.${PETSC_ARCH}.ex ${ARGS} >& outc_000256_nx032_sr3 srun -n 2048 ../src/hpsr.${PETSC_ARCH}.ex ${ARGS} >& outc_002048_nx032_sr3 srun -n 16384 ../src/hpsr.${PETSC_ARCH}.ex ${ARGS} >& outc_016384_nx032_sr3 srun -n 131072 ../src/hpsr.${PETSC_ARCH}.ex ${ARGS} >& outc_131072_nx032_sr3 The 32 - 16384 task jobs work fine. The 131072 (fully packed 5462 node job) gives errors. Other users with similar situations have similar issues. The job output shows PMI errors, but I'm guessing these errors could be caused if alpscomm, for example, is not fully aware of the job that is starting, e.g., if the srun is proceeding ahead of whatever aeld or other interactions are needed. The user is getting output like: ...Tue Jan 26 20:55:02 2016: [PE_60094]:_pmi_init:_pmi_mmap_init returned -1 Tue Jan 26 20:55:02 2016: [PE_89738]:_pmi_init:_pmi_mmap_init returned -1 Tue Jan 26 20:55:02 2016: [PE_18714]:_pmi_init:_pmi_mmap_init returned -1 Tue Jan 26 20:55:02 2016: [PE_44791]:_pmi_init:_pmi_mmap_init returned -1 Tue Jan 26 20:55:02 2016: [PE_89750]:_pmi_init:_pmi_mmap_init returned -1 Tue Jan 26 20:55:02 2016: [PE_89753]:_pmi_init:_pmi_mmap_init returned -1 Tue Jan 26 20:55:02 2016: [PE_22613]:_pmi_init:_pmi_mmap_init returned -1 Tue Jan 26 20:55:02 2016: [PE_75435]:_pmi_init:_pmi_mmap_init returned -1 Tue Jan 26 20:55:02 2016: [PE_17906]:_pmi_init:_pmi_mmap_init returned -1 ... ... ... [Tue Jan 26 20:55:17 2016] [c4-0c0s10n3] Fatal error in PMPI_Init_thread: Other MPI error, error stack: MPIR_Init_thread(525): MPID_Init(225).......: channel initialization failed MPID_Init(598).......: PMI2 init failed: 1 [Tue Jan 26 20:55:17 2016] [c4-0c2s12n3] Fatal error in PMPI_Init_thread: Other MPI error, error stack: MPIR_Init_thread(525): MPID_Init(225).......: channel initialization failed MPID_Init(598).......: PMI2 init failed: 1 [Tue Jan 26 20:55:17 2016] [c3-3c0s12n0] Fatal error in PMPI_Init_thread: Other MPI error, error stack: MPIR_Init_thread(525): MPID_Init(225).......: channel initialization failed MPID_Init(598).......: PMI2 init failed: 1 [Tue Jan 26 20:55:17 2016] [c4-2c1s4n1] Fatal error in PMPI_Init_thread: Other MPI error, error stack: MPIR_Init_thread(525): MPID_Init(225).......: channel initialization failed MPID_Init(598).......: PMI2 init failed: 1 [Tue Jan 26 20:55:17 2016] [c4-0c1s4n0] Fatal error in PMPI_Init_thread: Other MPI error, error stack: MPIR_Init_thread(525): MPID_Init(225).......: channel initialization failed MPID_Init(598).......: PMI2 init failed: 1 [Tue Jan 26 20:55:17 2016] [c5-1c2s1n3] Fatal error in PMPI_Init_thread: Other MPI error, error stack: MPIR_Init_thread(525): MPID_Init(225).......: channel initialization failed MPID_Init(598).......: PMI2 init failed: 1 [Tue Jan 26 20:55:17 2016] [c6-0c2s14n3] Fatal error in PMPI_Init_thread: Other MPI error, error stack: MPIR_Init_thread(525): ... ... ... ... ... srun: error: nid02748: tasks 62784-62801,62803-62807: Killed srun: error: nid01098: tasks 24504-24525,24527: Killed srun: error: nid01453: tasks 32736-32740,32742-32759: Killed srun: error: nid05853: tasks 125787-125788,125791,125793,125795-125796,125799-125801,125803,125805: Killed srun: error: nid00136: tasks 2688-2695,2697-2711: Killed srun: error: nid04172: tasks 95736,95738,95740,95742,95744,95746,95748,95751,95754,95756-95757: Killed srun: error: nid00509: tasks 10944-10956,10958-10967: Killed srun: error: nid00868: tasks 19179-19180,19183,19187,19189,19191,19195,19197,19199: Killed srun: error: nid04061: tasks 93072-93082,93084-93095: Killed srun: error: nid00394: tasks 8280-8298,8300-8303: Killed srun: error: nid05978: tasks 128688-128700,128702-128711: Killed srun: error: nid01108: tasks 24744-24759,24761-24767: Killed srun: error: nid04320: tasks 98640-98644,98646-98663: Killed srun: error: nid02643: tasks 60360-60382: Killed srun: error: nid05390: tasks 115104-115122,115124-115127: Killed srun: error: nid04182: tasks 95977-95999: Killed srun: error: nid00523: tasks 11185-11186,11189,11191,11193,11195,11197,11199,11201: Killed srun: error: nid03719: tasks 85080-85090,85092-85103: Killed srun: error: nid04071: tasks 93312-93318,93320-93335: Killed .... In the slurmctld logs (already attached), I see step 9 (one of the "big" steps), has this type of output upon allocation: [2016-01-26T20:53:49.555] debug: gres/craynetwork: _step_test 66226.9 gres_bit_alloc is NULL [2016-01-26T20:53:49.555] debug: gres/craynetwork: _step_test 66226.9 gres_bit_alloc is NULL [2016-01-26T20:53:49.555] debug: gres/craynetwork: _step_test 66226.9 gres_bit_alloc is NULL [2016-01-26T20:53:49.555] debug: gres/craynetwork: _step_test 66226.9 gres_bit_alloc is NULL [2016-01-26T20:53:49.555] debug: gres/craynetwork: _step_test 66226.9 gres_bit_alloc is NULL [2016-01-26T20:53:49.555] debug: gres/craynetwork: _step_test 66226.9 gres_bit_alloc is NULL [2016-01-26T20:53:49.555] debug: gres/craynetwork: _step_test 66226.9 gres_bit_alloc is NULL ... [2016-01-26T20:53:49.711] debug: gres/craynetwork: _step_alloc gres_bit_alloc for 66226.9 is NULL [2016-01-26T20:53:49.711] debug: gres/craynetwork: _step_alloc gres_bit_alloc for 66226.9 is NULL [2016-01-26T20:53:49.711] debug: gres/craynetwork: _step_alloc gres_bit_alloc for 66226.9 is NULL [2016-01-26T20:54:00.227] debug: aeld sync event [2016-01-26T20:54:00.231] debug: _start_session: Created aeld session fd 4 [2016-01-26T20:54:14.165] backfill: beginning [2016-01-26T20:54:14.166] debug: backfill: 2 jobs to backfill ... Step 4 (the other "big" step), did not have the aeld sync immediately after job start, but did have something like: [2016-01-26T20:46:50.055] debug: gres/craynetwork: _step_alloc gres_bit_alloc for 66226.4 is NULL [2016-01-26T20:46:50.055] debug: gres/craynetwork: _step_alloc gres_bit_alloc for 66226.4 is NULL [2016-01-26T20:47:28.474] debug: Job 70406 still has 1 active steps [2016-01-26T20:47:28.474] debug: sched: Running job scheduler [2016-01-26T20:47:28.569] debug: sched: JobId=70601. State=PENDING. Reason=Priority, Priority=199000. Partition=regularx. [2016-01-26T20:47:31.572] debug: cray: select_p_state_save [2016-01-26T20:47:31.683] debug: backup controller responding i.e., the select_p_state_save. I'm just trying to understand if there is some issue that may be preventing large steps from starting, particular if there is some timing or communication issue with the Cray alpscomm stuff. I've filed a similar issue with Cray regarding the PMI errors. -Doug That failed job does seem like a possible culprit - the log indicates it tries to schedule 66226 but then continually pushed back the expected start time by three minutes. I'd love to know what it was blocked on, but we don't appear to log that currently. I'm wondering if the craynetwork GRES may be causing problems - I'm looking into whether that downed node may have prevented the GRES from being free'd on the other nodes, or if there's some other interaction there that may have caused problems. Can you provide further details on 66226? What the exact node / cpu / GRES request was initially, either from slurmctld.log or from sacct? Does it only ask for craynetwork:1 or does it want all two (or four?) slots on the node? The "gres count underflow" warnings may also be related, I haven't seen a good reason for that count to be off. Task launch is something I'll need to grab Danny to look at - how we interact with the Cray APIs is something I'm not familiar with, and there may be something obvious he can see here that I'm missing. A future bug I will file has to do with all the craynetwork gres related messages -- they kinda fill the logs up and seem to just be a matter of course (both the underflow as well as the gres bit is NULL messages on all job allocations). However, unless the GRES are somehow tied to the partitions, I don't understand why downing regular would allow a job in regularx to start. The failed job, however, was in regular. I'll try to get the exact gres later, but my memory is that user was just asking for #SBATCH -n 131088 so would get GRES=cpu=131088,node=5462 ---- Doug Jacobsen, Ph.D. NERSC Computer Systems Engineer National Energy Research Scientific Computing Center <http://www.nersc.gov> dmjacobsen@lbl.gov ------------- __o ---------- _ '\<,_ ----------(_)/ (_)__________________________ On Wed, Jan 27, 2016 at 8:20 AM, <bugs@schedmd.com> wrote: > Tim Wickberg <tim@schedmd.com> changed bug 2384 > <http://bugs.schedmd.com/show_bug.cgi?id=2384> > What Removed Added > Assignee support@schedmd.com tim@schedmd.com > > *Comment # 4 <http://bugs.schedmd.com/show_bug.cgi?id=2384#c4> on bug 2384 > <http://bugs.schedmd.com/show_bug.cgi?id=2384> from Tim Wickberg > <tim@schedmd.com> * > > That failed job does seem like a possible culprit - the log indicates it tries > to schedule 66226 but then continually pushed back the expected start time by > three minutes. I'd love to know what it was blocked on, but we don't appear to > log that currently. > > I'm wondering if the craynetwork GRES may be causing problems - I'm looking > into whether that downed node may have prevented the GRES from being free'd on > the other nodes, or if there's some other interaction there that may have > caused problems. > > Can you provide further details on 66226? What the exact node / cpu / GRES > request was initially, either from slurmctld.log or from sacct? Does it only > ask for craynetwork:1 or does it want all two (or four?) slots on the node? > > The "gres count underflow" warnings may also be related, I haven't seen a good > reason for that count to be off. > > Task launch is something I'll need to grab Danny to look at - how we interact > with the Cray APIs is something I'm not familiar with, and there may be > something obvious he can see here that I'm missing. > > ------------------------------ > You are receiving this mail because: > > - You reported the bug. > > (In reply to Doug Jacobsen from comment #3) > > (3) Once the job started running, smaller task requests worked, but the > largest jobs consistently fail. > > e.g., user script: > #SBATCH -N 5462 > > srun -n 32 ../src/hpsr.${PETSC_ARCH}.ex ${ARGS} >& outc_000032_nx032_sr3 > srun -n 256 ../src/hpsr.${PETSC_ARCH}.ex ${ARGS} >& outc_000256_nx032_sr3 > srun -n 2048 ../src/hpsr.${PETSC_ARCH}.ex ${ARGS} >& outc_002048_nx032_sr3 > srun -n 16384 ../src/hpsr.${PETSC_ARCH}.ex ${ARGS} >& outc_016384_nx032_sr3 > srun -n 131072 ../src/hpsr.${PETSC_ARCH}.ex ${ARGS} >& outc_131072_nx032_sr3 > > > The 32 - 16384 task jobs work fine. The 131072 (fully packed 5462 node job) > gives errors. Other users with similar situations have similar issues. > > The job output shows PMI errors, but I'm guessing these errors could be > caused if alpscomm, for example, is not fully aware of the job that is > starting, e.g., if the srun is proceeding ahead of whatever aeld or other > interactions are needed. > > The user is getting output like: > > ...Tue Jan 26 20:55:02 2016: [PE_60094]:_pmi_init:_pmi_mmap_init returned -1 > Tue Jan 26 20:55:02 2016: [PE_89738]:_pmi_init:_pmi_mmap_init returned -1 > Tue Jan 26 20:55:02 2016: [PE_18714]:_pmi_init:_pmi_mmap_init returned -1 > Tue Jan 26 20:55:02 2016: [PE_44791]:_pmi_init:_pmi_mmap_init returned -1 > Tue Jan 26 20:55:02 2016: [PE_89750]:_pmi_init:_pmi_mmap_init returned -1 > Tue Jan 26 20:55:02 2016: [PE_89753]:_pmi_init:_pmi_mmap_init returned -1 > Tue Jan 26 20:55:02 2016: [PE_22613]:_pmi_init:_pmi_mmap_init returned -1 > Tue Jan 26 20:55:02 2016: [PE_75435]:_pmi_init:_pmi_mmap_init returned -1 > Tue Jan 26 20:55:02 2016: [PE_17906]:_pmi_init:_pmi_mmap_init returned -1 > ... > ... > ... > [Tue Jan 26 20:55:17 2016] [c4-0c0s10n3] Fatal error in PMPI_Init_thread: > Other MPI error, error stack: > MPIR_Init_thread(525): > MPID_Init(225).......: channel initialization failed > MPID_Init(598).......: PMI2 init failed: 1 > [Tue Jan 26 20:55:17 2016] [c4-0c2s12n3] Fatal error in PMPI_Init_thread: > Other MPI error, error stack: > MPIR_Init_thread(525): > MPID_Init(225).......: channel initialization failed > MPID_Init(598).......: PMI2 init failed: 1 > [Tue Jan 26 20:55:17 2016] [c3-3c0s12n0] Fatal error in PMPI_Init_thread: > Other MPI error, error stack: > MPIR_Init_thread(525): > MPID_Init(225).......: channel initialization failed > MPID_Init(598).......: PMI2 init failed: 1 > [Tue Jan 26 20:55:17 2016] [c4-2c1s4n1] Fatal error in PMPI_Init_thread: > Other MPI error, error stack: > MPIR_Init_thread(525): > MPID_Init(225).......: channel initialization failed > MPID_Init(598).......: PMI2 init failed: 1 > [Tue Jan 26 20:55:17 2016] [c4-0c1s4n0] Fatal error in PMPI_Init_thread: > Other MPI error, error stack: > MPIR_Init_thread(525): > MPID_Init(225).......: channel initialization failed > MPID_Init(598).......: PMI2 init failed: 1 > [Tue Jan 26 20:55:17 2016] [c5-1c2s1n3] Fatal error in PMPI_Init_thread: > Other MPI error, error stack: > MPIR_Init_thread(525): > MPID_Init(225).......: channel initialization failed > MPID_Init(598).......: PMI2 init failed: 1 > [Tue Jan 26 20:55:17 2016] [c6-0c2s14n3] Fatal error in PMPI_Init_thread: > Other MPI error, error stack: > MPIR_Init_thread(525): > ... Hey Doug, I am not sure what is happening here. I have never seen this before. I added Brian from Cray on this to see if he has any insights on the matter. If there is something Slurm needs to be doing I am not sure what it is before we launch any tasks we set things up as we do for any other job perhaps there is a delay at this scale setting things up from the Cray side? For the problem in comment #6 try turning the debugging on to get the alps exchanges. ALPSLLI_DEBUG=1 srun -N 675 --ntasks-per-node=32 --unbuffered a.out %s A Cray bug was filed and it was forwarded to the MPI PMI engineering to look at. This has come up before on large jobs on large system. We also saw this on the XC40 when the compute nodes were not running the same version of Slurm. === This shows if there are different version running > scontrol show hostname `sinfo -h -p workq --state=idle -o %N` | awk '{printf $1 ", "}' > nid.list > srun -w nid.list srun -V | sort -u slurm 14.11.8 slurm 14.11.7 > srun -w nid.list bash -c 'which srun' | sort -u /opt/slurm/14.11.8/bin/srun /opt/slurm/14.11.7/bin/srun This command shows the nodes that have not reported to slurmctld. scontrol -o show nodes | awk '/BootTime=None/ {print $1 " may be running the wrong version of Slurm"}' Hi Brian, Thanks for responding regarding the ALPS debugging - I'll discuss with the group what a time we can drain the system to run some more big jobs and try this. Regarding different versions of SLURM, we've done a lot of work to engineer that into an almost impossible problem. This is mostly done by mounting /dsl/opt/slurm with different caching levels on the compute nodes (15s timeout for attribute cache), and moving the config into /opt/slurm/etc to prevent dsl caching issues persisting from symlink attribute caching. Also, the system was just rebooted owing to a site power issue (so even outside of that issue it really isn't possible). Anyway, I did run the command and the only nodes which had a BootTime=None are the nodes that are currently physically down. -Doug ---- Doug Jacobsen, Ph.D. NERSC Computer Systems Engineer National Energy Research Scientific Computing Center <http://www.nersc.gov> dmjacobsen@lbl.gov ------------- __o ---------- _ '\<,_ ----------(_)/ (_)__________________________ On Wed, Jan 27, 2016 at 12:34 PM, <bugs@schedmd.com> wrote: > Brian F Gilmer <bgilmer@cray.com> changed bug 2384 > <http://bugs.schedmd.com/show_bug.cgi?id=2384> > What Removed Added > Version Fixed 15.08.9 > > *Comment # 7 <http://bugs.schedmd.com/show_bug.cgi?id=2384#c7> on bug 2384 > <http://bugs.schedmd.com/show_bug.cgi?id=2384> from Brian F Gilmer > <bgilmer@cray.com> * > > For the problem in comment #6 <http://bugs.schedmd.com/show_bug.cgi?id=2384#c6> try turning the debugging on to get the alps > exchanges. > > ALPSLLI_DEBUG=1 srun -N 675 --ntasks-per-node=32 --unbuffered a.out %s > > A Cray bug was filed and it was forwarded to the MPI PMI engineering to look > at. This has come up before on large jobs on large system. We also saw this > on the XC40 when the compute nodes were not running the same version of Slurm. > > === > This shows if there are different version running> scontrol show hostname `sinfo -h -p workq --state=idle -o %N` | awk '{printf $1 ", "}' > nid.list > > srun -w nid.list srun -V | sort -u > slurm 14.11.8 > slurm 14.11.7> srun -w nid.list bash -c 'which srun' | sort -u > /opt/slurm/14.11.8/bin/srun > /opt/slurm/14.11.7/bin/srun > > This command shows the nodes that have not reported to slurmctld. > scontrol -o show nodes | awk '/BootTime=None/ {print $1 " may be running the > wrong version of Slurm"}' > > ------------------------------ > You are receiving this mail because: > > - You reported the bug. > > (In reply to Doug Jacobsen from comment #5) > A future bug I will file has to do with all the craynetwork gres related > messages -- they kinda fill the logs up and seem to just be a matter of > course (both the underflow as well as the gres bit is NULL messages on all > job allocations). Danny's commit a2559f0 moves the gres_bit_alloc message to debug3. (300k lines out of 500k), it'll be in the next 15.08 point release if you don't want to patch it now. Commit 3c25beb also removes another minor nuisance message. We're still looking into the underflow message. (48k lines) > However, unless the GRES are somehow tied to the > partitions, I don't understand why downing regular would allow a job in > regularx to start. The failed job, however, was in regular. I'm still working on reproducing this. It does seem like the failed job was responsible for stalling this. Exactly why marking the partition down finally resolved it is also an open question; the job should have terminated and released all allocated resources before you changed that. Can you try running with this patch https://github.com/schedmd/slurm/commit/6f94bb7.patch for a while? That will at least will give us some better log messages to help narrow down the underflow problem. Hi Tim, I'll review the patch and will probably apply it this evening. Thank you. Regarding the srun crash, I got the following from Cray: Bug 821264 has more details about PMI_MMAP_SYNC_WAIT_TIME. By default, PMI will wait up to 60 seconds for all the ranks on a given node to check-in at launch. If all ranks do not arrive within 60 seconds, PMI will abort, because it thinks something went wrong. You can increase the 60 second time limit by setting PMI_MMAP_SYNC_WAIT_TIME to a higher value. If SLURM is having a problem, setting this PMI value higher might give SLURM enough time to detect (and hopefully display) some error of its own. They recommended setting PMI_MMAP_SYNC_WAIT_TIME to 300. I've asked a couple of targeted users to do so upon their next big job starts. I've heard from a different source that this could also, in some way, be related to the scalability of the PMI credential service and that too many tasks are simultaneously making requests. ---- Doug Jacobsen, Ph.D. NERSC Computer Systems Engineer National Energy Research Scientific Computing Center <http://www.nersc.gov> dmjacobsen@lbl.gov ------------- __o ---------- _ '\<,_ ----------(_)/ (_)__________________________ On Wed, Jan 27, 2016 at 3:24 PM, <bugs@schedmd.com> wrote: > *Comment # 10 <http://bugs.schedmd.com/show_bug.cgi?id=2384#c10> on bug > 2384 <http://bugs.schedmd.com/show_bug.cgi?id=2384> from Tim Wickberg > <tim@schedmd.com> * > > Can you try running with this patchhttps://github.com/schedmd/slurm/commit/6f94bb7.patch for a while? That will at > least will give us some better log messages to help narrow down the underflow > problem. > > ------------------------------ > You are receiving this mail because: > > - You reported the bug. > > hmm.. also I realized that the users most hit by this are using the /global/homes filesystem primarily to host their executables. This filesystem may have less bandwidth than others to the system, so running large executables without copying first could cause the issue (i.e., take ranks more than 60s to all start). I'm asking these users to try the --bcast option you recently added. -Doug ---- Doug Jacobsen, Ph.D. NERSC Computer Systems Engineer National Energy Research Scientific Computing Center <http://www.nersc.gov> dmjacobsen@lbl.gov ------------- __o ---------- _ '\<,_ ----------(_)/ (_)__________________________ On Wed, Jan 27, 2016 at 5:54 PM, Douglas Jacobsen <dmjacobsen@lbl.gov> wrote: > Hi Tim, > > I'll review the patch and will probably apply it this evening. Thank you. > > > Regarding the srun crash, I got the following from Cray: > > Bug 821264 has more details about PMI_MMAP_SYNC_WAIT_TIME. By default, PMI will wait up to 60 seconds for all the ranks on a given node to check-in at launch. If all ranks do not arrive within 60 seconds, PMI will abort, because it thinks something went wrong. You can increase the 60 second time limit by setting PMI_MMAP_SYNC_WAIT_TIME to a higher value. If SLURM is having a problem, setting this PMI value higher might give SLURM enough time to detect (and hopefully display) some error of its own. > > > They recommended setting PMI_MMAP_SYNC_WAIT_TIME to 300. I've asked a couple of targeted users to do so upon their next big job starts. I've heard from a different source that this could also, in some way, be related to the scalability of the PMI credential service and that too many tasks are simultaneously making requests. > > > ---- > Doug Jacobsen, Ph.D. > NERSC Computer Systems Engineer > National Energy Research Scientific Computing Center > <http://www.nersc.gov> > dmjacobsen@lbl.gov > > ------------- __o > ---------- _ '\<,_ > ----------(_)/ (_)__________________________ > > > On Wed, Jan 27, 2016 at 3:24 PM, <bugs@schedmd.com> wrote: > >> *Comment # 10 <http://bugs.schedmd.com/show_bug.cgi?id=2384#c10> on bug >> 2384 <http://bugs.schedmd.com/show_bug.cgi?id=2384> from Tim Wickberg >> <tim@schedmd.com> * >> >> Can you try running with this patchhttps://github.com/schedmd/slurm/commit/6f94bb7.patch for a while? That will at >> least will give us some better log messages to help narrow down the underflow >> problem. >> >> ------------------------------ >> You are receiving this mail because: >> >> - You reported the bug. >> >> > (In reply to Doug Jacobsen from comment #12) > hmm.. also I realized that the users most hit by this are using the > /global/homes filesystem primarily to host their executables. This > filesystem may have less bandwidth than others to the system, so running > large executables without copying first could cause the issue (i.e., take > ranks more than 60s to all start). I'm asking these users to try the > --bcast option you recently added. FYI: There are also sbcast and sgather commands available, which may serve the needs of some users better than srun's --bcast option (e.g. to copy an executable that is used for multiple srun invocations [job steps] within a single job allocation). Hello, This weekend a user was able to submit a fullscale job and it ran successfully unattended (though I still checked on it). It may be that this original issue filed -- that a large job did not start unattended until we downed the other partitions -- is related to a crashed job earlier in the day. I'd like to refocus on that issue again. I can understand that perhaps when the head node of the earlier job crashed some resource was not released, though why downing the partition would resolve that and allow the target job to run, I'm unsure of. -Doug On Tue, Feb 2, 2016 at 8:31 AM <bugs@schedmd.com> wrote: > Moe Jette <jette@schedmd.com> changed bug 2384 > <http://bugs.schedmd.com/show_bug.cgi?id=2384> > What Removed Added > CC jette@schedmd.com > > *Comment # 13 <http://bugs.schedmd.com/show_bug.cgi?id=2384#c13> on bug > 2384 <http://bugs.schedmd.com/show_bug.cgi?id=2384> from Moe Jette > <jette@schedmd.com> * > > (In reply to Doug Jacobsen from comment #12 <http://bugs.schedmd.com/show_bug.cgi?id=2384#c12>) > > > hmm.. also I realized that the users most hit by this are using the > > /global/homes filesystem primarily to host their executables. This > > filesystem may have less bandwidth than others to the system, so running > > large executables without copying first could cause the issue (i.e., take > > ranks more than 60s to all start). I'm asking these users to try the > > --bcast option you recently added. > > > > FYI: There are also sbcast and sgather commands available, which may serve the > needs of some users better than srun's --bcast option (e.g. to copy an > executable that is used for multiple srun invocations [job steps] within a > single job allocation). > > ------------------------------ > You are receiving this mail because: > > - You reported the bug. > > Hello, This weekend a user was able to submit a fullscale job and it ran successfully unattended (though I still checked on it). It may be that this original issue filed -- that a large job did not start unattended until we downed the other partitions -- is related to a crashed job earlier in the day. I'd like to refocus on that issue again. I can understand that perhaps when the head node of the earlier job crashed some resource was not released, though why downing the partition would resolve that and allow the target job to run, I'm unsure of. -Doug On Tue, Feb 2, 2016 at 8:31 AM <bugs@schedmd.com> wrote: > Moe Jette <jette@schedmd.com> changed bug 2384 > <http://bugs.schedmd.com/show_bug.cgi?id=2384> > What Removed Added > CC jette@schedmd.com > > *Comment # 13 <http://bugs.schedmd.com/show_bug.cgi?id=2384#c13> on bug > 2384 <http://bugs.schedmd.com/show_bug.cgi?id=2384> from Moe Jette > <jette@schedmd.com> * > > (In reply to Doug Jacobsen from comment #12 <http://bugs.schedmd.com/show_bug.cgi?id=2384#c12>) > > > hmm.. also I realized that the users most hit by this are using the > > /global/homes filesystem primarily to host their executables. This > > filesystem may have less bandwidth than others to the system, so running > > large executables without copying first could cause the issue (i.e., take > > ranks more than 60s to all start). I'm asking these users to try the > > --bcast option you recently added. > > > > FYI: There are also sbcast and sgather commands available, which may serve the > needs of some users better than srun's --bcast option (e.g. to copy an > executable that is used for multiple srun invocations [job steps] within a > single job allocation). > > ------------------------------ > You are receiving this mail because: > > - You reported the bug. > > Glad to hear things are behaving okay at least when the hardware is stable. I'll continue investigating that, although I'm not seeing an obvious culprit here, and haven't been able to reproduce it so far. I'm assuming that there's no relevant slurmd.log from around the time that node crashed? I'd be curious if that had any more information that may narrow it down. sadly, I'm still logging to /var/tmp so the frequent power outages have been causing destruction of my logs -- you have what I have from that time period. soon i'll have these logs persisted to disk via rsyslog... On Tue, Feb 2, 2016 at 1:24 PM <bugs@schedmd.com> wrote: > *Comment # 16 <http://bugs.schedmd.com/show_bug.cgi?id=2384#c16> on bug > 2384 <http://bugs.schedmd.com/show_bug.cgi?id=2384> from Tim Wickberg > <tim@schedmd.com> * > > Glad to hear things are behaving okay at least when the hardware is stable. > > I'll continue investigating that, although I'm not seeing an obvious culprit > here, and haven't been able to reproduce it so far. > > I'm assuming that there's no relevant slurmd.log from around the time that node > crashed? I'd be curious if that had any more information that may narrow it > down. > > ------------------------------ > You are receiving this mail because: > > - You reported the bug. > > I haven't had any luck on sorting this out, absent some logs to point towards the issue I'm not going to make any headway. If this recurs please reopen - hopefully it won't, or at least the logs will provide some insight then. - Tim Hello,
I found a similar situation last night on cori. It turned out that at least one node, nid00024 wasn't running jobs for the debug partition. This is the lowest numbered node and thus seems to be a favorite for the backfill scheduler.
Anyway, for some reason jobs stopped starting on it (and possibly other nodes) until I downed the debug partition and then up'd it.
The effect of this is that "large" debug jobs (say 64 nodes or so) kept piling up, gaining priority over newly submitted ones, and we were really only getting a very small amount of throughput through debug. I noticed because there was a long debug queue and plenty of idle nodes.
From the following accounting you can see that it ran job 1283858 ending on 00:42:58 but didn't start anything more until 21:57:14 (1289982). I had downed and upped the partition right before the job started.
nid00837:/var/tmp/slurm # sacct --node=nid00024 --start=2016-03-08 --format=job,user,start,end,state -X
JobID User Start End State
------------ --------- ------------------- ------------------- ----------
1283858 lonardon 2016-03-08T00:32:37 2016-03-08T00:42:58 COMPLETED
1284114 psh 2016-03-08T00:01:47 2016-03-08T00:31:57 TIMEOUT
1289982 bwoo 2016-03-08T21:57:14 2016-03-08T22:27:30 TIMEOUT
1290514 totoni 2016-03-08T21:52:25 2016-03-08T21:57:00 FAILED
1290629 lzlwell 2016-03-08T22:28:20 2016-03-08T22:56:36 FAILED
1290631 lzlwell 2016-03-08T22:57:28 2016-03-08T23:24:30 FAILED
1290639 lzlwell 2016-03-08T23:25:28 2016-03-08T23:52:33 FAILED
1290651 lzlwell 2016-03-08T23:53:34 2016-03-09T00:21:47 FAILED
1290667 lzlwell 2016-03-09T00:21:57 2016-03-09T00:24:33 CANCELLED+
1291184 lonardon 2016-03-09T00:28:40 2016-03-09T00:44:10 COMPLETED
1291185 lonardon 2016-03-09T00:44:42 2016-03-09T01:00:40 COMPLETED
1291187 lonardon 2016-03-09T01:00:44 2016-03-09T01:16:26 COMPLETED
1291201 xiaobao 2016-03-09T01:16:46 2016-03-09T01:33:41 COMPLETED
1291214 xiaobao 2016-03-09T01:36:33 2016-03-09T01:52:56 COMPLETED
1291276 tbechtel 2016-03-09T01:53:47 2016-03-09T02:15:27 COMPLETED
1291277 tbechtel 2016-03-09T02:15:56 2016-03-09T02:29:57 COMPLETED
1291278 tbechtel 2016-03-09T02:31:14 2016-03-09T02:40:22 COMPLETED
1291449 lzlwell 2016-03-09T03:04:30 2016-03-09T03:32:11 FAILED
1291451 lzlwell 2016-03-09T03:33:11 2016-03-09T04:01:02 FAILED
1291453 lzlwell 2016-03-09T04:01:41 2016-03-09T04:25:58 FAILED
1291520 dpoletti 2016-03-09T02:41:59 2016-03-09T02:58:05 FAILED
1291621 dpoletti 2016-03-09T04:26:33 2016-03-09T04:27:30 CANCELLED+
1291623 dpoletti 2016-03-09T04:28:36 2016-03-09T04:54:49 COMPLETED
1291643 lgerhard 2016-03-09T05:00:40 2016-03-09T05:07:30 COMPLETED
1291644 lgerhard 2016-03-09T05:07:40 2016-03-09T05:13:10 COMPLETED
nid00837:/var/tmp/slurm #
So, while this isn't the same as the full scale job requiring a down/up cycle, it is showing that a node or group of nodes needed that in order to start scheduling again.
I'll attached the (900MB-when-expanded) slurmctld log for cori for yesterday.
-Doug
Created attachment 2840 [details]
cori slurmcltd log for 03/08
Thanks for the logs, I'm poring through them now trying to understand what's going on. One curiosity: [2016-03-08T23:55:15.193] error: slurm_receive_msg: Zero Bytes were transmitted or received [2016-03-08T23:55:15.203] error: slurm_receive_msg: Zero Bytes were transmitted or received You have this pair of error messages every 10 minutes throughout the log which is odd. I'm guessing its some rogue cron job on a node with an out-of-sync clock, or invalid munge key? Its unrelated to this issue, but the less spurious errors I have to comb through the easier it is to pinpoint real issues. I have Cori's slurm.conf, but apparently that doesn't match that current running config: [2016-03-08T00:01:46.114] error: Partition system Nodes differs from slurm.conf [2016-03-08T00:01:46.114] error: Partition debug State differs from slurm.conf [2016-03-08T00:01:46.114] error: Partition regular State differs from slurm.conf [2016-03-08T00:01:46.114] error: Partition regularx Nodes differs from slurm.conf [2016-03-08T00:01:46.114] error: Partition regularx State differs from slurm.conf [2016-03-08T00:01:46.114] error: Partition realtime Nodes differs from slurm.conf [2016-03-08T00:01:46.114] error: Partition realtime State differs from slurm.conf [2016-03-08T00:01:46.114] error: Partition shared State differs from slurm.conf Can you send me 'scontrol show part' for reference? I'm guessing this is just to remove the broken nodes from bug 2516? Hi Tim, I had noticed the 10 minute periodicity of the communication error; I'm unaware of the source, however all the compute nodes definitely are running munge and almost certainly have the same clock (it's set by the Cray HSS system), and definitely do not run cron. Maybe it's something coming from our eslogin network, I'll check into that. I know the nodes are running munge because we don't have any unreachable nodes in cori that aren't shutdown owing to hardware issues, and slurmd/slurmctld communication would break down quickly if munge faltered. Regarding the partition state difference, that message comes up every morning when I rotate the logs (send SIGHUP to slurmctld). I believe it is because the partitions are up and the configuration has them down. We have the partitions down by default so we can do some testing prior to user jobs running on the system following maintenances or (rarely) slurmctld crashes. Bug 2516 was referring to the edison system, I haven't tried to remove nodes from the cori system. -Doug On Wed, Mar 9, 2016 at 8:10 AM <bugs@schedmd.com> wrote: > *Comment # 23 <https://bugs.schedmd.com/show_bug.cgi?id=2384#c23> on bug > 2384 <https://bugs.schedmd.com/show_bug.cgi?id=2384> from Tim Wickberg > <tim@schedmd.com> * > > Thanks for the logs, I'm poring through them now trying to understand what's > going on. > > One curiosity: > > [2016-03-08T23:55:15.193] error: slurm_receive_msg: Zero Bytes were transmitted > or received > [2016-03-08T23:55:15.203] error: slurm_receive_msg: Zero Bytes were transmitted > or received > > You have this pair of error messages every 10 minutes throughout the log which > is odd. I'm guessing its some rogue cron job on a node with an out-of-sync > clock, or invalid munge key? Its unrelated to this issue, but the less spurious > errors I have to comb through the easier it is to pinpoint real issues. > > I have Cori's slurm.conf, but apparently that doesn't match that current > running config: > > [2016-03-08T00:01:46.114] error: Partition system Nodes differs from slurm.conf > [2016-03-08T00:01:46.114] error: Partition debug State differs from slurm.conf > [2016-03-08T00:01:46.114] error: Partition regular State differs from > slurm.conf > [2016-03-08T00:01:46.114] error: Partition regularx Nodes differs from > slurm.conf > [2016-03-08T00:01:46.114] error: Partition regularx State differs from > slurm.conf > [2016-03-08T00:01:46.114] error: Partition realtime Nodes differs from > slurm.conf > [2016-03-08T00:01:46.114] error: Partition realtime State differs from > slurm.conf > [2016-03-08T00:01:46.114] error: Partition shared State differs from slurm.conf > > Can you send me 'scontrol show part' for reference? I'm guessing this is just > to remove the broken nodes from bug 2516 <https://bugs.schedmd.com/show_bug.cgi?id=2516>? > > ------------------------------ > You are receiving this mail because: > > - You reported the bug. > > (In reply to Doug Jacobsen from comment #25) > I had noticed the 10 minute periodicity of the communication error; I'm > unaware of the source, however all the compute nodes definitely are running > munge and almost certainly have the same clock (it's set by the Cray HSS That message is indicative of receiving a message that we couldn't parse. It could originate from anything. I think it's our nagios monitor looking to see if anybody is home On Mar 9, 2016 9:16 AM, <bugs@schedmd.com> wrote: > *Comment # 26 <https://bugs.schedmd.com/show_bug.cgi?id=2384#c26> on bug > 2384 <https://bugs.schedmd.com/show_bug.cgi?id=2384> from Moe Jette > <jette@schedmd.com> * > > (In reply to Doug Jacobsen from comment #25 <https://bugs.schedmd.com/show_bug.cgi?id=2384#c25>)> I had noticed the 10 minute periodicity of the communication error; I'm > > unaware of the source, however all the compute nodes definitely are running > > munge and almost certainly have the same clock (it's set by the Cray HSS > > That message is indicative of receiving a message that we couldn't parse. It > could originate from anything. > > ------------------------------ > You are receiving this mail because: > > - You reported the bug. > > (In reply to Doug Jacobsen from comment #27) > I think it's our nagios monitor looking to see if anybody is home Ah, that makes sense - if nagios is just opening the socket but not sending anything that'd explain it. Moving on to real problems - do you mind sharing your lua job submit plugin? And I assume you're using the stock cray plugin? Created attachment 2843 [details]
cori job_submit.lua
edison has a very similar job_submit.lua but has slightly different parameters, someday will merge them
Yes, we use the stock cray plugin. I have two theories about what could have happened. I'll share them in case someone can directly me to concentrate on one of them. One possibility is there is a scheduling anomaly. Setting a partition DOWN takes jobs from that partition out of consideration. Given there are overlapping partitions and a variety of limits, that could allow jobs to start that otherwise could not. My second theory is that Cray's node health check hung for one or more jobs. The NHC is initiated at the time of job completion and the resources associated with the job are not made available to other jobs until NHC completes OR Slurm's select table records are rebuilt. The rebuild happens when the slurmctld daemon is reconfigured (SIGHUP, "scontrol reconfig", and some other configuration changes like changing a partition's state). If the NHC terminated after the tables were rebuilt, that could result in some underflow messages (without the counters going negative). Could you look for some hung NHC processes (they would be named "xtcleanup_after" and would have been spawned by slurmctld)? I believe the primary bug is fixed in this commit: https://github.com/SchedMD/slurm/commit/fec5e03b900aebba21199a922cb75acf182a6abc Specifically, each job has a "cleaning/cleaned" flag. Once a job terminates, the cleaning flag is set, then after the job node health check completes, the value gets set to cleaned. If the job is requeued, on its second (or subsequent) termination, the select/cray plugin is called to launch the NHC. The plugin sees the "cleaned" flag already set, it then logs: error: select_p_job_fini: Cleaned flag already set for job 1283858, this should never happen and returns, never launching the NHC. Since the termination of the job NHC triggers releasing job resources (CPUs, memory, and GRES), those resources are never released for use by other jobs. There were a total 9 jobs with this problem reported in this log: [2016-03-08T00:29:49.620] error: select_p_job_fini: Cleaned flag already set for job 1283977, this should never happen [2016-03-08T00:32:27.509] error: select_p_job_fini: Cleaned flag already set for job 1283792, this should never happen [2016-03-08T00:42:58.894] error: select_p_job_fini: Cleaned flag already set for job 1283858, this should never happen [2016-03-08T02:15:24.429] error: select_p_job_fini: Cleaned flag already set for job 1277531, this should never happen [2016-03-08T02:20:44.047] error: select_p_job_fini: Cleaned flag already set for job 1277599, this should never happen [2016-03-08T04:57:42.142] error: select_p_job_fini: Cleaned flag already set for job 1276011, this should never happen [2016-03-08T06:14:08.701] error: select_p_job_fini: Cleaned flag already set for job 1277054, this should never happen [2016-03-08T17:46:46.781] error: select_p_job_fini: Cleaned flag already set for job 1277801, this should never happen [2016-03-08T23:26:42.345] error: select_p_job_fini: Cleaned flag already set for job 1273184, this should never happen I know that Tim is pursuing some other issues discovered in the course of our investigation of this bug. Created attachment 2845 [details]
print values for gres underflow bug
Doug -
Do you mind applying the attached patch? I'm trying to root out the source of the GRES underflow messages and knowing what the affected values are would help narrow down the problem considerably.
It's not necessarily related to the scheduling bug at heart here, but I've seen this on several other bugs and want to get this resolved. The error seems sporadic, and I haven't been able to trigger it on anything I have, but you seem to get a handful of jobs in such a state every day.
- Tim
OK, I've reviewed and applied the patch and we're running it on both cori and edison. I'll check the past logs to see if I can estimate how frequently we were seeing the log message to guess at how long we need to observe again before declaring victory =) I'll send the overnight logs in the morning so you can analyze those. Do you prefer edison (simpler queue structure, bigger system) or cori (complex queue structure, small system)? -Doug On Wed, Mar 9, 2016 at 2:36 PM <bugs@schedmd.com> wrote: > *Comment # 38 <https://bugs.schedmd.com/show_bug.cgi?id=2384#c38> on bug > 2384 <https://bugs.schedmd.com/show_bug.cgi?id=2384> from Tim Wickberg > <tim@schedmd.com> * > > Created attachment 2845 [details] <https://bugs.schedmd.com/attachment.cgi?id=2845&action=diff> [details] <https://bugs.schedmd.com/attachment.cgi?id=2845&action=edit> > print values for gres underflow bug > > Doug - > > Do you mind applying the attached patch? I'm trying to root out the source of > the GRES underflow messages and knowing what the affected values are would help > narrow down the problem considerably. > > It's not necessarily related to the scheduling bug at heart here, but I've seen > this on several other bugs and want to get this resolved. The error seems > sporadic, and I haven't been able to trigger it on anything I have, but you > seem to get a handful of jobs in such a state every day. > > - Tim > > ------------------------------ > You are receiving this mail because: > > - You reported the bug. > > Did you get this in too? It should fix the main scheduling bug (draining nodes that should be OK to use). https://github.com/SchedMD/slurm/commit/fec5e03b900aebba21199a922cb75acf182a6abc.patch Oops -- actually, I only got the patch you requested Moe, I missed the one from Tim since it was an attachment rather than a github patch. I'll take a look at Tim's as well. -Doug On Wed, Mar 9, 2016 at 4:12 PM <bugs@schedmd.com> wrote: > *Comment # 40 <https://bugs.schedmd.com/show_bug.cgi?id=2384#c40> on bug > 2384 <https://bugs.schedmd.com/show_bug.cgi?id=2384> from Moe Jette > <jette@schedmd.com> * > > Did you get this in too? It should fix the main scheduling bug (draining nodes > that should be OK to use). > https://github.com/SchedMD/slurm/commit/fec5e03b900aebba21199a922cb75acf182a6abc.patch > > ------------------------------ > You are receiving this mail because: > > - You reported the bug. > > Moe, Would the scenario in comment #37 also impact BB cleanup? Here is the speculation: - Slurm never called 'dw_wlm_cli -f data_out' or 'dw_wlm_cli -f teardown'. It stopped interacting with DataWarp for this batch job after 'dw_wlm_cli -f post_run' failed This was a side investigation on a DW bug. The situation was complex but it included nodes going down which would have resulted in jobs being terminated then re-queued. If the patch covers the DW situation I will not file a separate bug. (In reply to Doug Jacobsen from comment #41) > Oops -- actually, I only got the patch you requested Moe, I missed the one > from Tim since it was an attachment rather than a github patch. I'll take > a look at Tim's as well. Moe's is the important one by far, mine is just printing some additional info so I can continue to hunt down the GRES underflow issue. Logs from either are fine assuming they trigger the underflow message again. Moe - your patch is on both edison and cori Tim - I only had time yesterday for cori: nid00837:~ # grep underflow /var/tmp/slurm/slurmctld.log [2016-03-10T00:02:27.471] error: gres/craynetwork: _step_dealloc step 1303535.16 dealloc count underflow 0 < 1 [2016-03-10T00:02:27.471] error: gres/craynetwork: _step_dealloc step 1303535.16 dealloc count underflow 0 < 1 [2016-03-10T00:02:38.370] error: gres/craynetwork: _step_dealloc step 1304123.1 dealloc count underflow 0 < 1 [2016-03-10T00:02:39.881] error: gres/craynetwork: _step_dealloc step 1304124.0 dealloc count underflow 0 < 1 [2016-03-10T00:02:53.627] error: gres/craynetwork: _step_dealloc step 1265260.460 dealloc count underflow 0 < 1 [2016-03-10T00:02:58.245] error: gres/craynetwork: _step_dealloc step 1283403.0 dealloc count underflow 0 < 1 [2016-03-10T00:02:58.245] error: gres/craynetwork: _step_dealloc step 1283403.0 dealloc count underflow 0 < 1 [2016-03-10T00:02:58.245] error: gres/craynetwork: _step_dealloc step 1283403.0 dealloc count underflow 0 < 1 [2016-03-10T00:02:58.245] error: gres/craynetwork: _step_dealloc step 1283403.0 dealloc count underflow 0 < 1 [2016-03-10T00:02:58.245] error: gres/craynetwork: _step_dealloc step 1283403.0 dealloc count underflow 0 < 1 [2016-03-10T00:02:58.245] error: gres/craynetwork: _step_dealloc step 1283403.0 dealloc count underflow 0 < 1 [2016-03-10T00:02:58.245] error: gres/craynetwork: _step_dealloc step 1283403.0 dealloc count underflow 0 < 1 [2016-03-10T00:03:01.136] error: gres/craynetwork: _step_dealloc step 1304107.0 dealloc count underflow 0 < 1 [2016-03-10T00:03:01.136] error: gres/craynetwork: _step_dealloc step 1304107.0 dealloc count underflow 0 < 1 [2016-03-10T00:03:13.727] error: gres/craynetwork: _step_dealloc step 1265229.465 dealloc count underflow 0 < 1 [2016-03-10T00:03:28.436] error: gres/craynetwork: _step_dealloc step 1265236.461 dealloc count underflow 0 < 1 [2016-03-10T00:03:38.838] error: gres/craynetwork: _step_dealloc step 1304126.0 dealloc count underflow 0 < 1 [2016-03-10T00:03:38.839] error: gres/craynetwork: _step_dealloc step 1304126.0 dealloc count underflow 0 < 1 .... There are many more messages much the same since a little after midnight: nid00837:~ # grep underflow /var/tmp/slurm/slurmctld.log | wc -l 1047 nid00837:~ # grep underflow /var/tmp/slurm/slurmctld.log | grep "0 < 1" | wc -l 1047 nid00837:~ # On Thu, Mar 10, 2016 at 7:14 AM <bugs@schedmd.com> wrote: > *Comment # 43 <https://bugs.schedmd.com/show_bug.cgi?id=2384#c43> on bug > 2384 <https://bugs.schedmd.com/show_bug.cgi?id=2384> from Tim Wickberg > <tim@schedmd.com> * > > (In reply to Doug Jacobsen from comment #41 <https://bugs.schedmd.com/show_bug.cgi?id=2384#c41>) > > > Oops -- actually, I only got the patch you requested Moe, I missed the one > > from Tim since it was an attachment rather than a github patch. I'll take > > a look at Tim's as well. > > > > Moe's is the important one by far, mine is just printing some additional info > so I can continue to hunt down the GRES underflow issue. > > Logs from either are fine assuming they trigger the underflow message again. > > ------------------------------ > You are receiving this mail because: > > - You reported the bug. > > (In reply to Brian F Gilmer from comment #42) > Moe, > > Would the scenario in comment #37 also impact BB cleanup? > > Here is the speculation: > - Slurm never called 'dw_wlm_cli -f data_out' or 'dw_wlm_cli -f teardown'. > It stopped interacting with DataWarp for this batch job after 'dw_wlm_cli -f > post_run' failed > > This was a side investigation on a DW bug. The situation was complex but it > included nodes going down which would have resulted in jobs being terminated > then re-queued. If the patch covers the DW situation I will not file a > separate bug. Thanks a lot for that information Brian. This patch will definitely not fix the DW bug, but the underlying problem could be similar. Here is a bit of the relevant logic: if (bb_job->state < BB_STATE_POST_RUN) { bb_job->state = BB_STATE_POST_RUN; _queue_stage_out(bb_job); } If the job had a data_out operation and the state value was not reset, subsequent data_out operations would not take place. I do see the "state" value being reset for requeued, but I will look for ways this specific bit of code might not be executed: if (bb_job->state == BB_STATE_COMPLETE) bb_job->state = BB_STATE_PENDING; /* job requeued */ (In reply to Moe Jette from comment #45) > (In reply to Brian F Gilmer from comment #42) > > Moe, > > > > Would the scenario in comment #37 also impact BB cleanup? > > > > Here is the speculation: > > - Slurm never called 'dw_wlm_cli -f data_out' or 'dw_wlm_cli -f teardown'. > > It stopped interacting with DataWarp for this batch job after 'dw_wlm_cli -f > > post_run' failed > > > > This was a side investigation on a DW bug. The situation was complex but it > > included nodes going down which would have resulted in jobs being terminated > > then re-queued. If the patch covers the DW situation I will not file a > > separate bug. > > > Thanks a lot for that information Brian. This patch will definitely not fix > the DW bug, but the underlying problem could be similar. Here is a bit of > the relevant logic: > > if (bb_job->state < BB_STATE_POST_RUN) { > bb_job->state = BB_STATE_POST_RUN; > _queue_stage_out(bb_job); > } > > If the job had a data_out operation and the state value was not reset, > subsequent data_out operations would not take place. I do see the "state" > value being reset for requeued, but I will look for ways this specific bit > of code might not be executed: > if (bb_job->state == BB_STATE_COMPLETE) > bb_job->state = BB_STATE_PENDING; /* job requeued */ The first job can definitely mess with the second job (i.e. the requeued job) burst buffer state information and cause a couple of failure modes. A stage-in would not happen for a requeued job if the stage-out for the first execution was still in progress at the time of the second resource allocation. The logic was checking the job's buffer state (in Slurm's tables) and finding the state was >= staged-in (e.g staging-out) and treating that like the stage-in was complete so the job could restart. Then when the first job's stage-out finally completes, it changes the buffer state (in Slurm's tables) for that job to "teardown", which prevents the second stage-out from happening. The fix was relatively simple and prevents the job from starting a second time until the first stage-out has completed. We'll get this fixed in version 15.08.9. The commit is here: https://github.com/SchedMD/slurm/commit/3bb2e602bfa1161427384145486b47f506457414 cori has the patch (both), but I just now needed to down and up the partitions to get a job to run in a reservation here are the errors in the log: nid00837:~ # grep error: /var/tmp/slurm/slurmctld.log | grep -v CONF_HASH | grep -v underflow | grep -v Zero | grep -v "failed to assign" | grep -v realtime [2016-03-10T00:02:20.295] error: Partition system Nodes differs from slurm.conf [2016-03-10T00:02:20.295] error: Partition debug State differs from slurm.conf [2016-03-10T00:02:20.295] error: Partition regular State differs from slurm.conf [2016-03-10T00:02:20.295] error: Partition regularx Nodes differs from slurm.conf [2016-03-10T00:02:20.295] error: Partition regularx State differs from slurm.conf [2016-03-10T00:02:20.295] error: Partition shared State differs from slurm.conf [2016-03-10T00:59:24.227] error: select_p_job_fini: Cleaned flag already set for job 1276822, this should never happen [2016-03-10T01:46:52.180] error: slurm_receive_msgs: Transport endpoint is not connected [2016-03-10T07:13:04.219] error: Nodes nid00296 not responding [2016-03-10T07:14:44.421] error: Nodes nid00296 not responding [2016-03-10T07:16:24.021] error: Nodes nid00296 not responding [2016-03-10T07:18:04.487] error: Nodes nid00296 not responding, setting DOWN [2016-03-10T09:58:54.627] error: select_p_job_fini: Cleaned flag already set for job 1265676, this should never happen [2016-03-10T10:45:55.861] error: Node ping apparently hung, many nodes may be DOWN or configured SlurmdTimeout should be increased [2016-03-10T11:31:10.685] error: _bb_get_sessions: show_sessions status:256 response:dwget - failed client status code %s 404 [2016-03-10T11:31:10.685] error: _bb_get_sessions: json parser failed on dwget - failed client status code %s 404 [2016-03-10T11:31:43.432] error: _load_state: Unexpected burst buffer found: andreyBB [2016-03-10T11:31:43.469] error: _load_state: Unexpected burst buffer found: MelTest2 [2016-03-10T11:31:43.547] error: _load_state: Unexpected burst buffer found: DJBmlrefine19 [2016-03-10T11:31:43.623] error: _load_state: Unexpected burst buffer found: XuanCami [2016-03-10T11:31:43.703] error: _load_state: Unexpected burst buffer found: myBBname [2016-03-10T11:31:43.859] error: _load_state: Unexpected burst buffer found: gatkbrachy_2 [2016-03-10T11:31:44.011] error: _load_state: Unexpected burst buffer found: mlrefine19 [2016-03-10T11:31:44.087] error: _load_state: Unexpected burst buffer found: 1306493 [2016-03-10T11:31:44.087] error: _load_state: Unexpected burst buffer found: 1306678 [2016-03-10T11:31:44.087] error: _load_state: Unexpected burst buffer found: 1306686 [2016-03-10T13:54:30.287] error: find_node_record: lookup failure for 09 [2016-03-10T13:55:07.278] error: find_node_record: lookup failure for 08 nid00837:~ # nid00296 is in the reservation. Also I see the cleaned flag already set error is still present. Is patched? nid00837:~ # ls -l /proc/$(pgrep slurmctld)/exe lrwxrwxrwx 1 root root 0 Mar 9 16:27 /proc/4038/exe -> /opt/slurm/15.08.8_nhc_requeue_gresunderflow_debug_15.08.8_nhc_requeue_gresunderflow_debug_20160309162247/sbin/slurmctld nid00837:~ # yes Will send more logs after this reservation testing period is complete. There are a couple of error messages that particularly concern me: > [2016-03-10T09:58:54.627] error: select_p_job_fini: Cleaned flag already set > for job 1265676, this should never happen Can you grep for "1265676" in your slurmctld log files. I really want to determine how this happened. > [2016-03-10T11:31:10.685] error: _bb_get_sessions: show_sessions status:256 > response:dwget - failed client status code %s 404 > [2016-03-10T11:31:10.685] error: _bb_get_sessions: json parser failed on > dwget - failed client status code %s 404 Could you file a bug report with cray that the execution of "dw_wlm_cli --function show_sessions" returned "dwget - failed client status code %s 404". At best it's a badly formatted message. I'll study Slurm's error handling for this. Finally, when this happens again look for hung "xtcleanup_after" processes as children of slurmctld. The resources allocated to a job are not completely released until the NHC completes or slurmctld gets restarted/reconfigured, which could prevent jobs from getting started. The job referenced did not use Datawarp.
nid00837:/var/tmp/slurm # grep 1265676 slurmctld.log
[2016-03-10T09:58:53.556] debug: Application 1265676 not found in app list
[2016-03-10T09:58:53.848] error: gres/craynetwork: _step_dealloc step 1265676.0 dealloc count underflow 0 < 1
[2016-03-10T09:58:53.848] debug: freed ports 63749 for step 1265676.0
[2016-03-10T09:58:54.627] job_complete: JobID=1265676 State=0x1 NodeCnt=1 WEXITSTATUS 0
[2016-03-10T09:58:54.627] error: select_p_job_fini: Cleaned flag already set for job 1265676, this should never happen
[2016-03-10T09:58:54.627] job_complete: JobID=1265676 State=0x8003 NodeCnt=1 done
nid00837:/var/tmp/slurm # sacct -j 1265676 --format=JobID,User,Partition,Start,End,Elapsed,State,ExitCode,ReqMem,ReqTRES%40
JobID User Partition Start End Elapsed State ExitCode ReqMem ReqTRES
------------ --------- ---------- ------------------- ------------------- ---------- ---------- -------- ---------- ----------------------------------------
1265676 jllyons regular 2016-03-09T03:36:51 2016-03-10T09:58:54 1-06:22:03 COMPLETED 0:0 122Gn cpu=1,node=1
1265676.bat+ 2016-03-09T03:36:51 2016-03-10T09:58:54 1-06:22:03 COMPLETED 0:0 122Gn
1265676.0 2016-03-09T03:36:55 2016-03-10T09:58:53 1-06:21:58 COMPLETED 0:0 122Gn
nid00837:/var/tmp/slurm #
Please note that this is not a DW ticket, the issues found related to DW should probably move to a different ticket to not distract from the scheduling issue herein. On Thu, Mar 10, 2016 at 3:53 PM <bugs@schedmd.com> wrote: > *Comment # 49 <https://bugs.schedmd.com/show_bug.cgi?id=2384#c49> on bug > 2384 <https://bugs.schedmd.com/show_bug.cgi?id=2384> from David Paul > <dpaul@nersc.gov> * > > The job referenced did not use Datawarp. > > nid00837:/var/tmp/slurm # grep 1265676 slurmctld.log > [2016-03-10T09:58:53.556] debug: Application 1265676 not found in app list > [2016-03-10T09:58:53.848] error: gres/craynetwork: _step_dealloc step 1265676.0 > dealloc count underflow 0 < 1 > [2016-03-10T09:58:53.848] debug: freed ports 63749 for step 1265676.0 > [2016-03-10T09:58:54.627] job_complete: JobID=1265676 State=0x1 NodeCnt=1 > WEXITSTATUS 0 > [2016-03-10T09:58:54.627] error: select_p_job_fini: Cleaned flag already set > for job 1265676, this should never happen > [2016-03-10T09:58:54.627] job_complete: JobID=1265676 State=0x8003 NodeCnt=1 > done > > nid00837:/var/tmp/slurm # sacct -j 1265676 > --format=JobID,User,Partition,Start,End,Elapsed,State,ExitCode,ReqMem,ReqTRES%40 > JobID User Partition Start End > Elapsed State ExitCode ReqMem ReqTRES > ------------ --------- ---------- ------------------- ------------------- > ---------- ---------- -------- ---------- > ---------------------------------------- > 1265676 jllyons regular 2016-03-09T03:36:51 2016-03-10T09:58:54 > 1-06:22:03 COMPLETED 0:0 122Gn > cpu=1,node=1 > 1265676.bat+ 2016-03-09T03:36:51 2016-03-10T09:58:54 > 1-06:22:03 COMPLETED 0:0 122Gn > 1265676.0 2016-03-09T03:36:55 2016-03-10T09:58:53 > 1-06:21:58 COMPLETED 0:0 122Gn > nid00837:/var/tmp/slurm # > > ------------------------------ > You are receiving this mail because: > > - You reported the bug. > > Hello, I'm unsure if this is related to the recent patch for adjusting when-to-call the NHC, however since applying that patch on cori we've been getting about 20 instances per day of nodes going into suspect failing the "reservation" test and then recovering a few minutes later. It seems there is a race condition in some of the cray cleanup software relative to when the NHC runs. I sent the following information to Cray in case 133221, and I bring it up here as a possible race condition in the running the NHC. Message sent to Cray: Hello, We're getting ~20 nodes per day going into suspect state for a few minutes and recovering. This is disruptive to the SLURM scheduling since during those moments the node is in the drained state. We are running SLURM 15.08.8 with patch https://github.com/SchedMD/slurm/commit/fec5e03b900aebba21199a922cb75acf182a6abc (adjusts timing of running NHC in some scenarios). This issue started after we applied the above patch, which as stated does change slightly the timing of running the NHC. Can you please explain what this NHC check is doing? What is responsible for cleaning up these files? Why is the an apparent race condition in job cleanup and /proc/reservation cleanup? Is the some alpscomm function that SLURM needs to be calling to ensure this is properly cleaned up? excerpt of console log: crayadm@corismw1:/var/opt/cray/log/p0-20160227t205932> grep c1-0c2s11n0 console-20160314 2016-03-14T09:58:47.930170-07:00 c1-0c2s11n0 <node_health:5.1> RESID:1318312 (xtnhc) FAILURES: The following tests from the 'reservation' set have failed in normal mode: 2016-03-14T09:58:47.930223-07:00 c1-0c2s11n0 <node_health:5.1> RESID:1318312 (Reservation_Test) WARNING: Directory /proc/reservations/1318312 still exists 2016-03-14T09:58:47.968703-07:00 c1-0c2s11n0 <node_health:5.1> RESID:1318312 (Reservation_Test) WARNING: Reservation 1318312 has status: rid 1318312 flags ENDED jobs 1563368097136 2016-03-14T09:58:47.968721-07:00 c1-0c2s11n0 <node_health:5.1> RESID:1318312 (Reservation_Test) WARNING: Reservation 1318312 still has the following object types: files 2016-03-14T09:58:48.004575-07:00 c1-0c2s11n0 <node_health:5.1> RESID:1318312 (xtnhc) FAILURES: (Admindown) Reservation_Test 2016-03-14T09:58:48.004596-07:00 c1-0c2s11n0 <node_health:5.1> RESID:1318312 (xtnhc) FAILURES: End of list of 1 failed test(s) 2016-03-14T09:58:49.214150-07:00 c4-0c1s1n1 <node_health:5.1> RESID:1318312 (xtcheckhealth) WARNING: Set node 364 (c1-0c2s11n0) to suspect because the node failed a health test. 2016-03-14T10:10:02.772452-07:00 c4-0c1s1n1 <node_health:5.1> RESID:1318312 (xtcheckhealth) WARNING: Set node 364 (c1-0c2s11n0) to up. 2016-03-14T10:27:35.471257-07:00 c1-0c2s11n0 <node_health:5.1> RESID:1336628 (Free_Memory_Check) WARNING: Memory free at boot: 129429352 kB; Memory currently free: 128512692 kB; Difference 916660 kB > Limit 614400 kB 2016-03-14T10:27:35.496514-07:00 c1-0c2s11n0 <node_health:5.1> RESID:1336628 (xtnhc) FAILURES: The following tests from the 'reservation' set have failed in normal mode: 2016-03-14T10:27:35.496728-07:00 c1-0c2s11n0 <node_health:5.1> RESID:1336628 (xtnhc) FAILURES: (Log Only ) Free_Memory_Check 2016-03-14T10:27:35.496828-07:00 c1-0c2s11n0 <node_health:5.1> RESID:1336628 (xtnhc) FAILURES: End of list of 1 failed test(s) 2016-03-14T10:31:36.483695-07:00 c1-0c2s11n0 Lustre: 683:0:(vvp_io.c:921:vvp_io_fault_start()) binary [0x2000608eb:0x1d245:0x0] changed while waiting for the page fault lock crayadm@corismw1:/var/opt/cray/log/p0-20160227t205932> Excerpt from messages log: crayadm@corismw1:/var/opt/cray/log/p0-20160227t205932> grep c1-0c2s11n0 messages-20160314 <4>1 2016-03-14T09:58:49.174840-07:00 c4-0c1s1n1 kernel - p0-20160227t205932 - <node_health:5.1> RESID:1318312 (xtcheckhealth) WARNING: Set node 364 (c1-0c2s11n0) to suspect because the node failed a health test. <166>1 2016-03-14T09:58:49.174848-07:00 c4-0c1s1n1 xtcheckhealth 1843 p0-20160227t205932 [nhc@34] <node_health:5.1> RESID:1318312 (xtcheckhealth) WARNING: Set node 364 (c1-0c2s11n0) to suspect because the node failed a health test. <4>1 2016-03-14T10:10:02.747019-07:00 c4-0c1s1n1 kernel - p0-20160227t205932 - <node_health:5.1> RESID:1318312 (xtcheckhealth) WARNING: Set node 364 (c1-0c2s11n0) to up. <166>1 2016-03-14T10:10:02.747038-07:00 c4-0c1s1n1 xtcheckhealth 7400 p0-20160227t205932 [nhc@34] <node_health:5.1> RESID:1318312 (xtcheckhealth) WARNING: Set node 364 (c1-0c2s11n0) to up. crayadm@corismw1:/var/opt/cray/log/p0-20160227t205932> Excerpt from slurmcltd log (a few different ways) nid00837:~ # grep 1318312 /var/tmp/slurm/slurmctld.log ... [2016-03-14T09:05:44.494] backfill: Started JobId=1318312 in regular on nid0[0284-0295,0336-0343,0360-0367,0564-0575,1944-1947,1952-1959,1968-1975,2024-2027] [2016-03-14T09:55:46.949] Time limit exhausted for JobId=1318312 [2016-03-14T09:55:47.909] job_complete: JobID=1318312 State=0x8006 NodeCnt=1 WTERMSIG 15 nid00837:~ # nid00837:~ # grep nid00364 /var/tmp/slurm/slurmctld.log | grep -v HASH ... [2016-03-14T09:58:48.098] update_node: node nid00364 reason set to: suspect by NHC [2016-03-14T09:58:48.099] update_node: node nid00364 state set to DRAINED [2016-03-14T10:10:01.674] update_node: node nid00364 state set to IDLE ... nid00837:~ # Thank you, Doug Jacobsen (In reply to Doug Jacobsen from comment #51) > I'm unsure if this is related to the recent patch for adjusting when-to-call > the NHC, however since applying that patch on cori we've been getting about > 20 instances per day of nodes going into suspect failing the "reservation" > test and then recovering a few minutes later. It seems there is a race > condition in some of the cray cleanup software relative to when the NHC runs. Let me expand a bit on the commit message (see: https://github.com/SchedMD/slurm/commit/fec5e03b900aebba21199a922cb75acf182a6abc The previous Slurm logic would only run the reservation/job NHC once for any job. The new Slurm logic will execute the reservation/job NHC at the end of each execution for a given job ID. In the case of a requeued job, that could easily mean the NHC runs more than once for the same job ID on the same node(s), but each one represents an independent execution of the job. I would guess that Cray's software thinks this might be a problem. Have you upgraded to Slurm version 16.05? Has this problem gone away? If yes for both questions, we can probably close this ticket. Doug - marking closed, as I assume this hasn't recurred since updating to 16.05. |