Ticket 7872

Summary:	Job stuck in completing state
Product:	Slurm	Reporter:	Anthony DelSorbo <anthony.delsorbo>
Component:	slurmstepd	Assignee:	Nate Rini <nate>
Status:	RESOLVED INFOGIVEN	QA Contact:
Severity:	4 - Minor Issue
Priority:	---
Version:	19.05.1
Hardware:	Linux
OS:	Linux
See Also:	https://bugs.schedmd.com/show_bug.cgi?id=7888 https://bugs.schedmd.com/show_bug.cgi?id=7839 https://bugs.schedmd.com/show_bug.cgi?id=7942
Site:	NOAA	Slinky Site:	---
Alineos Sites:	---	Atos/Eviden Sites:	---
Confidential Site:	---	Coreweave sites:	---
Cray Sites:	---	DS9 clusters:	---
Google sites:	---	HPCnow Sites:	---
HPE Sites:	---	IBM Sites:	---
NOAA SIte:	NESCC	NoveTech Sites:	---
Nvidia HWinf-CS Sites:	---	OCF Sites:	---
Recursion Pharma Sites:	---	SFW Sites:	---
SNIC sites:	---	Tzag Elita Sites:	---
Linux Distro:	---	Machine Name:
CLE Version:		Version Fixed:
Target Release:	---	DevPrio:	---
Emory-Cloud Sites:	---

Description Anthony DelSorbo 2019-10-04 13:02:55 MDT

I have a job that's been stuck for over 24hrs holding up 10 nodes.  scontrol show job is shown below. It had a timelimit of 1 hour; The job ended within 8 minutes:

[204166.716380] prolog.root: Thu Oct  3 14:41:05 UTC 2019 Starting Job 174194
[204167.327937] drop_caches (202932): drop_caches: 3
[204643.597880] drop_caches (207563): drop_caches: 3
[204643.702413] epilog.root: Thu Oct  3 14:49:03 UTC 2019 Finished Job 174194

The affected nodes are showing a periodic attempt to kill the step as well as an attempt to contact the controller machine.  Here is the first attempt to kill 

Oct  4 18:00:04 h9c52 slurmstepd[25730]: error: Unable to establish controller machine
Oct  4 18:00:07 h9c52 slurmstepd[25730]: Sent SIGKILL signal to 174194.4294967295
...
Oct  4 18:01:04 h9c52 slurmstepd[25730]: error: Unable to establish controller machine
Oct  4 18:01:07 h9c52 slurmstepd[25730]: Sent SIGKILL signal to 174194.4294967295
Oct  4 18:01:17 h9c52 slurmstepd[25730]: Sent SIGKILL signal to 174194.4294967295

I'll attach an strace of one of the stuck processes and any other information I can gather.  Please let me know what other information I can provide you to diagnose this issue.

Thanks,

Tony.


======================================================
scontrol show job 174194
JobId=174194 JobName=RAPP_gsi_hyb_04
   UserId=Eric.James(5133) GroupId=wrfruc(10019) MCS_label=N/A
   Priority=205990816 Nice=0 Account=wrfruc QOS=batch
   JobState=COMPLETING Reason=None Dependency=(null)
   Requeue=1 Restarts=0 BatchFlag=1 Reboot=0 ExitCode=0:0
   RunTime=00:06:44 TimeLimit=01:00:00 TimeMin=N/A
   SubmitTime=2019-10-03T14:41:04 EligibleTime=2019-10-03T14:41:04
   AccrueTime=2019-10-03T14:41:04
   StartTime=2019-10-03T14:41:05 EndTime=2019-10-03T14:47:49 Deadline=N/A
   SuspendTime=None SecsPreSuspend=0 LastSchedEval=2019-10-03T14:41:05
   Partition=hera AllocNode:Sid=hfe04:170448
   ReqNodeList=(null) ExcNodeList=(null)
   NodeList=h9c[52-54],h15c[23-25,27-30]
   BatchHost=h9c51
   NumNodes=10 NumCPUs=480 NumTasks=480 CPUs/Task=N/A ReqB:S:C:T=0:0:*:*
   TRES=cpu=480,mem=1104000M,node=12,billing=480
   Socks/Node=* NtasksPerN:B:S:C=0:0:*:* CoreSpec=*
   MinCPUsNode=1 MinMemoryCPU=2300M MinTmpDiskNode=0
   Features=(null) DelayBoot=00:00:00
   OverSubscribe=NO Contiguous=0 Licenses=(null) Network=(null)
   Command=(null)
   WorkDir=/home/Eric.James
   Comment=3290806835d7aeb5350f45c7f6c17fb1 
   StdErr=/scratch2/BMC/wrfruc/ejames/rapretro/RAPv5_feb2019_retro1/log/gsi_hyb_pcyc_201902070400.log
   StdIn=/dev/null
   StdOut=/scratch2/BMC/wrfruc/ejames/rapretro/RAPv5_feb2019_retro1/log/gsi_hyb_pcyc_201902070400.log
   Power=


Oct  4 18:00:04 h9c52 slurmstepd[25730]: error: Unable to establish controller machine
Oct  4 18:00:07 h9c52 slurmstepd[25730]: Sent SIGKILL signal to 174194.4294967295
Oct  4 18:00:17 h9c52 slurmstepd[25730]: Sent SIGKILL signal to 174194.4294967295

Comment 1 Anthony DelSorbo 2019-10-04 13:32:30 MDT

Created attachment 11829 [details]
syslog from batch host

Comment 2 Anthony DelSorbo 2019-10-04 13:38:15 MDT

Interestingly, the syslog shows that the job epilog ran twice once at the original end of job:

Oct  3 14:48:59 h9c51 EPILOG-ROOT: Job 174194:

and again nearly 14 hours later:

Oct  4 02:41:20 h9c51 EPILOG-ROOT: Job 174194:

Comment 3 Anthony DelSorbo 2019-10-04 13:40:30 MDT

Created attachment 11830 [details]
strace from a stuck node

Not sure if it's helpful, but it's not clear to me why it isn't able to contact the controller - all networks are operational

Comment 4 Nate Rini 2019-10-04 14:18:14 MDT

Tony,

Please call the following on the stuck batch host:
> ps auxft
> systemctl status slurmd
> dmesg -T
> lsof -n -p 241860
> lsof -n -p 8777

Please call the following on the slurmctld host:
> scontrol ping
> scontrol show job 174194 #I want to see if anything changed from your last post
> scontrol show node h9c51

Thanks,
--Nate

Comment 5 Anthony DelSorbo 2019-10-07 07:38:50 MDT

(In reply to Nate Rini from comment #4)
> Tony,
> 
> Please call the following on the stuck batch host:
> > ps auxft
See attached file: ps_auxft.txt

> > systemctl status slurmd
See attached file: systemctl_status_slurmd.txt

> > dmesg -T
See attached file: dmesg_T

> > lsof -n -p 241860
Process no longer exists

> > lsof -n -p 8777
See attached file: lsof_np_8777.txt
 
> Please call the following on the slurmctld host:
> > scontrol ping
[root@bqs1 ~]# scontrol ping
Slurmctld(primary) at bqs1 is UP
Slurmctld(backup) at bqs2 is UP

> > scontrol show job 174194 #I want to see if anything changed from your last post

[root@bqs1 ~]# scontrol show job 174194
JobId=174194 JobName=RAPP_gsi_hyb_04
   UserId=Eric.James(5133) GroupId=wrfruc(10019) MCS_label=N/A
   Priority=205990816 Nice=0 Account=wrfruc QOS=batch
   JobState=COMPLETING Reason=None Dependency=(null)
   Requeue=1 Restarts=0 BatchFlag=1 Reboot=0 ExitCode=0:0
   RunTime=00:06:44 TimeLimit=01:00:00 TimeMin=N/A
   SubmitTime=2019-10-03T14:41:04 EligibleTime=2019-10-03T14:41:04
   AccrueTime=2019-10-03T14:41:04
   StartTime=2019-10-03T14:41:05 EndTime=2019-10-03T14:47:49 Deadline=N/A
   SuspendTime=None SecsPreSuspend=0 LastSchedEval=2019-10-03T14:41:05
   Partition=hera AllocNode:Sid=hfe04:170448
   ReqNodeList=(null) ExcNodeList=(null)
   NodeList=h9c[52-54],h15c[23-25,27-30]
   BatchHost=h9c51
   NumNodes=10 NumCPUs=480 NumTasks=480 CPUs/Task=N/A ReqB:S:C:T=0:0:*:*
   TRES=cpu=480,mem=1104000M,node=12,billing=480
   Socks/Node=* NtasksPerN:B:S:C=0:0:*:* CoreSpec=*
   MinCPUsNode=1 MinMemoryCPU=2300M MinTmpDiskNode=0
   Features=(null) DelayBoot=00:00:00
   OverSubscribe=NO Contiguous=0 Licenses=(null) Network=(null)
   Command=(null)
   WorkDir=/home/Eric.James
   Comment=3290806835d7aeb5350f45c7f6c17fb1 
   StdErr=/scratch2/BMC/wrfruc/ejames/rapretro/RAPv5_feb2019_retro1/log/gsi_hyb_pcyc_201902070400.log
   StdIn=/dev/null
   StdOut=/scratch2/BMC/wrfruc/ejames/rapretro/RAPv5_feb2019_retro1/log/gsi_hyb_pcyc_201902070400.log
   Power=


> > scontrol show node h9c51
> 
> Thanks,
> --Nate

Comment 6 Anthony DelSorbo 2019-10-07 07:39:46 MDT

Created attachment 11845 [details]
dmesg -T output

Comment 7 Anthony DelSorbo 2019-10-07 07:40:35 MDT

Created attachment 11846 [details]
lsof -n -p 8777

Comment 8 Anthony DelSorbo 2019-10-07 07:41:29 MDT

Created attachment 11847 [details]
ps auxft

Comment 9 Anthony DelSorbo 2019-10-07 07:42:52 MDT

Created attachment 11848 [details]
systemctl status slurmd

Comment 10 Anthony DelSorbo 2019-10-07 08:55:02 MDT

Nate,

I gave you info from the wrong node.  Please forgive and stdby as I collect new data for you....

Comment 11 Anthony DelSorbo 2019-10-07 09:15:51 MDT

(In reply to Nate Rini from comment #4)
> Tony,
> 
> Please call the following on the stuck batch host:
> > ps auxft
attached h9c52_ps_auxft.txt

> > systemctl status slurmd

[root@h9c52 ~]# systemctl status slurmd
\u25cf slurmd.service - Slurm node daemon
   Loaded: loaded (/apps/slurm/d/etc/slurmd.service; linked; vendor preset: disabled)
   Active: active (running) since Fri 2019-09-27 17:37:21 UTC; 1 weeks 2 days ago
  Process: 9407 ExecStart=/apps/slurm/d/sbin/slurmd $SLURMD_OPTIONS (code=exited, status=0/SUCCESS)
 Main PID: 9411 (slurmd)
    Tasks: 4
   Memory: 2.2M
   CGroup: /system.slice/slurmd.service
           \u251c\u2500 9411 /apps/slurm/d/sbin/slurmd
           \u2514\u250025730 slurmstepd: [174194.extern]

Oct 07 14:52:36 h9c52 slurmstepd[25730]: debug:  _handle_terminate for step=174194.4294967295 uid=0
Oct 07 14:52:36 h9c52 slurmstepd[25730]: debug2: _file_read_uint32s: unable to open '(null)/tasks' for reading : No such file or directory
Oct 07 14:52:36 h9c52 slurmstepd[25730]: debug2: xcgroup_get_pids: unable to get pids of '(null)'
Oct 07 14:52:36 h9c52 slurmstepd[25730]: debug2: _file_read_uint32s: unable to open '(null)/tasks' for reading : No such file or directory
Oct 07 14:52:36 h9c52 slurmstepd[25730]: debug2: xcgroup_get_pids: unable to get pids of '(null)'
Oct 07 14:52:36 h9c52 slurmstepd[25730]: debug3: unable to get pids list for cont_id=25730
Oct 07 14:52:36 h9c52 slurmstepd[25730]: Sent SIGKILL signal to 174194.4294967295
Oct 07 14:52:36 h9c52 slurmstepd[25730]: debug3: _handle_request: leaving with rc: 0
Oct 07 14:52:36 h9c52 slurmstepd[25730]: debug3: _handle_request: entering
Oct 07 14:52:36 h9c52 slurmstepd[25730]: debug3: Leaving _handle_accept

> > dmesg -T
attached: h9c52_dmesg_T.txt

> > lsof -n -p 241860
Process does not exist.  Perhaps you meant 25730?  If so, attached lsof_np_25730.txt.

> > lsof -n -p 8777
Process  8777 does not exist on stuck host, but does exist as slurmd on batch host (h9c51), which is not stuck and is actually running other jobs today.  However I did provide that for you earlier.
> 
> Please call the following on the slurmctld host:
> > scontrol ping
> > scontrol show job 174194 #I want to see if anything changed from your last post
> > scontrol show node h9c51
> 
> Thanks,
> --Nate

Comment 12 Anthony DelSorbo 2019-10-07 09:17:40 MDT

Created attachment 11851 [details]
dmesg -T output from h9c52 stuck host

Comment 13 Anthony DelSorbo 2019-10-07 09:22:57 MDT

Created attachment 11852 [details]
ps auxft for h9c52

Comment 14 Anthony DelSorbo 2019-10-07 09:24:12 MDT

Created attachment 11853 [details]
h9c52 slurmstepd process

Comment 15 Anthony DelSorbo 2019-10-07 09:26:21 MDT

Created attachment 11854 [details]
hpc52 stepd process

Comment 17 Nate Rini 2019-10-07 10:27:32 MDT

(In reply to Anthony DelSorbo from comment #13)
> Created attachment 11852 [details]
> ps auxft for h9c52

Is the job still in COMPLETING state?

Comment 18 Anthony DelSorbo 2019-10-07 10:47:27 MDT

(In reply to Nate Rini from comment #17)
> (In reply to Anthony DelSorbo from comment #13)
> > Created attachment 11852 [details]
> > ps auxft for h9c52
> 
> Is the job still in COMPLETING state?

Yes.

[root@bqs1 ~]# squeue
JOBID       PARTITION  QOS         USER                STATE         TIME_LIMIT  TIME        TIME_LEFT   NODES  REASON                        NAME                
174194      hera       batch       Eric.James          COMPLETING    1:00:00     6:44        53:16       10     None                          RAPP_gsi_hyb_04     

and those nodes are still "hung" - not released for other jobs.

Comment 19 Anthony DelSorbo 2019-10-07 11:06:44 MDT

Nate,

Do you have any objections to me releasing 9 of the 10 nodes back to production?  If not, do you have any preference as to which node to keep out for testing?

Or, do you need additional information from the other nodes?  

Since the job should have exceeded its wallclock time, shouldn't slurmctld have taken the nodes down by now?

[root@bqs1 ~]# sinfo --format="%45N %.3D %9P %11T %.4c %14C %.8z %.8m %.4d %.8w %10f %90E" -p hera --nodes h9c[52-54],h15c[23-25,27-30]
NODELIST                                      NOD PARTITION STATE       CPUS CPUS(A/I/O/T)     S:C:T   MEMORY TMP_   WEIGHT AVAIL_FEAT REASON                                                                                    
h9c[52-54],h15c[23-25,27-30]                   10 hera*     completing    40 0/400/0/400      2:20:2    95000    0        1 (null)     none                                                                                      


Thanks,

Tony.

Comment 20 Nate Rini 2019-10-07 11:14:24 MDT

(In reply to Anthony DelSorbo from comment #19)
> Do you have any objections to me releasing 9 of the 10 nodes back to
> production?  If not, do you have any preference as to which node to keep out
> for testing?


You should be able to down and then resume all of the nodes to clear the state.

Comment 21 Anthony DelSorbo 2019-10-07 11:36:58 MDT

(In reply to Nate Rini from comment #20)
> (In reply to Anthony DelSorbo from comment #19)
> > Do you have any objections to me releasing 9 of the 10 nodes back to
> > production?  If not, do you have any preference as to which node to keep out
> > for testing?
> 
> 
> You should be able to down and then resume all of the nodes to clear the
> state.

Tried that but now have nodes with a stuck slurmstepd process.  Here's an example on two nodes:

----------------
h9c53
----------------
\u25cf slurmd.service - Slurm node daemon
   Loaded: loaded (/apps/slurm/d/etc/slurmd.service; linked; vendor preset: disabled)
   Active: active (running) since Mon 2019-10-07 17:32:19 UTC; 1min 29s ago
  Process: 28886 ExecStart=/apps/slurm/d/sbin/slurmd $SLURMD_OPTIONS (code=exited, status=0/SUCCESS)
 Main PID: 28890 (slurmd)
    Tasks: 4
   Memory: 3.4M
   CGroup: /system.slice/slurmd.service
           \u251c\u250022157 slurmstepd: [174194.extern]          
           \u2514\u250028890 /apps/slurm/d/sbin/slurmd

Oct 07 17:33:44 h9c53 slurmstepd[22157]: debug:  _handle_terminate for step=174194.4294967295 uid=0
Oct 07 17:33:44 h9c53 slurmstepd[22157]: debug2: _file_read_uint32s: unable to open '(null)/tasks' for reading : No such file or directory
Oct 07 17:33:44 h9c53 slurmstepd[22157]: debug2: xcgroup_get_pids: unable to get pids of '(null)'
Oct 07 17:33:44 h9c53 slurmstepd[22157]: debug2: _file_read_uint32s: unable to open '(null)/tasks' for reading : No such file or directory
Oct 07 17:33:44 h9c53 slurmstepd[22157]: debug2: xcgroup_get_pids: unable to get pids of '(null)'
Oct 07 17:33:44 h9c53 slurmstepd[22157]: debug3: unable to get pids list for cont_id=22157
Oct 07 17:33:44 h9c53 slurmstepd[22157]: Sent SIGKILL signal to 174194.4294967295
Oct 07 17:33:44 h9c53 slurmstepd[22157]: debug3: _handle_request: leaving with rc: 0
Oct 07 17:33:44 h9c53 slurmstepd[22157]: debug3: _handle_request: entering
Oct 07 17:33:44 h9c53 slurmstepd[22157]: debug3: Leaving _handle_accept
----------------
h9c54
----------------
\u25cf slurmd.service - Slurm node daemon
   Loaded: loaded (/apps/slurm/d/etc/slurmd.service; linked; vendor preset: disabled)
   Active: active (running) since Mon 2019-10-07 17:32:19 UTC; 1min 29s ago
  Process: 48495 ExecStart=/apps/slurm/d/sbin/slurmd $SLURMD_OPTIONS (code=exited, status=0/SUCCESS)
 Main PID: 48498 (slurmd)
    Tasks: 4
   Memory: 3.4M
   CGroup: /system.slice/slurmd.service
           \u251c\u250043008 slurmstepd: [174194.extern]          
           \u2514\u250048498 /apps/slurm/d/sbin/slurmd

Oct 07 17:33:44 h9c54 slurmstepd[43008]: debug:  _handle_terminate for step=174194.4294967295 uid=0
Oct 07 17:33:44 h9c54 slurmstepd[43008]: debug2: _file_read_uint32s: unable to open '(null)/tasks' for reading : No such file or directory
Oct 07 17:33:44 h9c54 slurmstepd[43008]: debug2: xcgroup_get_pids: unable to get pids of '(null)'
Oct 07 17:33:44 h9c54 slurmstepd[43008]: debug2: _file_read_uint32s: unable to open '(null)/tasks' for reading : No such file or directory
Oct 07 17:33:44 h9c54 slurmstepd[43008]: debug2: xcgroup_get_pids: unable to get pids of '(null)'
Oct 07 17:33:44 h9c54 slurmstepd[43008]: debug3: unable to get pids list for cont_id=43008
Oct 07 17:33:44 h9c54 slurmstepd[43008]: Sent SIGKILL signal to 174194.4294967295
Oct 07 17:33:44 h9c54 slurmstepd[43008]: debug3: _handle_request: leaving with rc: 0
Oct 07 17:33:44 h9c54 slurmstepd[43008]: debug3: _handle_request: entering
Oct 07 17:33:44 h9c54 slurmstepd[43008]: debug3: Leaving _handle_accept

Comment 22 Nate Rini 2019-10-07 12:58:03 MDT

(In reply to Anthony DelSorbo from comment #21)
> Tried that but now have nodes with a stuck slurmstepd process.

Please call the following as root on your slurmctld host and attach the logs:
> scontrol setdebugflags +steps
> scontrol setdebugflags +tracejobs
> scontrol setdebugflags +Agent
> scontrol setdebug debug4
> scontrol show events
> scontrol show node h9c53
> scontrol show node h9c54
> scontrol show job 174194
> scontrol update nodename=h9c53,h9c54 state=down reason=bug7872
> sleep 10
> scontrol update nodename=h9c53,h9c54 state=resume reason=bug7872
> scontrol setdebug info
> scontrol setdebugflags -steps
> scontrol setdebugflags -tracejobs
> scontrol setdebugflags -Agent
> scontrol show node h9c53
> scontrol show node h9c54
> scontrol show job 174194
> scontrol show events

Comment 23 Nate Rini 2019-10-07 13:14:14 MDT

Please also attach the slurmctld log, slurmd logs on h9c53 and h9c54.

Comment 26 Anthony DelSorbo 2019-10-07 13:33:20 MDT

(In reply to Nate Rini from comment #23)
> Please also attach the slurmctld log, slurmd logs on h9c53 and h9c54.

Nate,

Looks like the job eventually disappeared.  I'll send up the logs momentarily....

[root@bqs2 ~]# scontrol setdebugflags +steps
[root@bqs2 ~]# scontrol setdebugflags +tracejobs
[root@bqs2 ~]# scontrol setdebugflags +Agent
[root@bqs2 ~]# scontrol setdebug debug4
[root@bqs2 ~]# scontrol show events
invalid entity:events for keyword:show 
[root@bqs2 ~]# scontrol show node h9c53
NodeName=h9c53 Arch=x86_64 CoresPerSocket=20 
   CPUAlloc=40 CPUTot=40 CPULoad=39.68
   AvailableFeatures=(null)
   ActiveFeatures=(null)
   Gres=(null)
   NodeAddr=h9c53 NodeHostName=h9c53 
   OS=Linux 3.10.0-957.21.3.el7.x86_64 #1 SMP Tue Jun 18 16:35:19 UTC 2019 
   RealMemory=95000 AllocMem=92000 FreeMem=76621 Sockets=2 Boards=1
   MemSpecLimit=2000
   State=ALLOCATED ThreadsPerCore=2 TmpDisk=0 Weight=1 Owner=N/A MCS_label=N/A
   Partitions=hera,admin 
   BootTime=2019-09-27T16:44:49 SlurmdStartTime=2019-10-07T17:32:24
   CfgTRES=cpu=40,mem=95000M,billing=40
   AllocTRES=cpu=40,mem=92000M
   CapWatts=n/a
   CurrentWatts=0 AveWatts=0
   ExtSensorsJoules=n/s ExtSensorsWatts=0 ExtSensorsTemp=n/s
   

[root@bqs2 ~]# scontrol show node h9c54
NodeName=h9c54 Arch=x86_64 CoresPerSocket=20 
   CPUAlloc=40 CPUTot=40 CPULoad=39.49
   AvailableFeatures=(null)
   ActiveFeatures=(null)
   Gres=(null)
   NodeAddr=h9c54 NodeHostName=h9c54 
   OS=Linux 3.10.0-957.21.3.el7.x86_64 #1 SMP Tue Jun 18 16:35:19 UTC 2019 
   RealMemory=95000 AllocMem=92000 FreeMem=76595 Sockets=2 Boards=1
   MemSpecLimit=2000
   State=ALLOCATED ThreadsPerCore=2 TmpDisk=0 Weight=1 Owner=N/A MCS_label=N/A
   Partitions=hera,admin 
   BootTime=2019-09-27T14:26:09 SlurmdStartTime=2019-10-07T17:32:23
   CfgTRES=cpu=40,mem=95000M,billing=40
   AllocTRES=cpu=40,mem=92000M
   CapWatts=n/a
   CurrentWatts=0 AveWatts=0
   ExtSensorsJoules=n/s ExtSensorsWatts=0 ExtSensorsTemp=n/s
   

[root@bqs2 ~]# scontrol show job 174194
slurm_load_jobs error: Invalid job id specified
[root@bqs2 ~]# scontrol update nodename=h9c53,h9c54 state=down reason=bug7872
[root@bqs2 ~]# sleep 10
[root@bqs2 ~]# scontrol update nodename=h9c53,h9c54 state=resume reason=bug7872
[root@bqs2 ~]# scontrol setdebug info
[root@bqs2 ~]# scontrol setdebugflags -steps
[root@bqs2 ~]# scontrol setdebugflags -tracejobs
[root@bqs2 ~]# scontrol setdebugflags -Agent
[root@bqs2 ~]# scontrol show node h9c53
NodeName=h9c53 Arch=x86_64 CoresPerSocket=20 
   CPUAlloc=0 CPUTot=40 CPULoad=39.68
   AvailableFeatures=(null)
   ActiveFeatures=(null)
   Gres=(null)
   NodeAddr=h9c53 NodeHostName=h9c53 
   OS=Linux 3.10.0-957.21.3.el7.x86_64 #1 SMP Tue Jun 18 16:35:19 UTC 2019 
   RealMemory=95000 AllocMem=0 FreeMem=76621 Sockets=2 Boards=1
   MemSpecLimit=2000
   State=IDLE* ThreadsPerCore=2 TmpDisk=0 Weight=1 Owner=N/A MCS_label=N/A
   Partitions=hera,admin 
   BootTime=2019-09-27T16:44:49 SlurmdStartTime=2019-10-07T17:32:24
   CfgTRES=cpu=40,mem=95000M,billing=40
   AllocTRES=
   CapWatts=n/a
   CurrentWatts=0 AveWatts=0
   ExtSensorsJoules=n/s ExtSensorsWatts=0 ExtSensorsTemp=n/s
   

[root@bqs2 ~]# scontrol show node h9c54
NodeName=h9c54 Arch=x86_64 CoresPerSocket=20 
   CPUAlloc=0 CPUTot=40 CPULoad=39.49
   AvailableFeatures=(null)
   ActiveFeatures=(null)
   Gres=(null)
   NodeAddr=h9c54 NodeHostName=h9c54 
   OS=Linux 3.10.0-957.21.3.el7.x86_64 #1 SMP Tue Jun 18 16:35:19 UTC 2019 
   RealMemory=95000 AllocMem=0 FreeMem=76595 Sockets=2 Boards=1
   MemSpecLimit=2000
   State=IDLE* ThreadsPerCore=2 TmpDisk=0 Weight=1 Owner=N/A MCS_label=N/A
   Partitions=hera,admin 
   BootTime=2019-09-27T14:26:09 SlurmdStartTime=2019-10-07T17:32:23
   CfgTRES=cpu=40,mem=95000M,billing=40
   AllocTRES=
   CapWatts=n/a
   CurrentWatts=0 AveWatts=0
   ExtSensorsJoules=n/s ExtSensorsWatts=0 ExtSensorsTemp=n/s
   

[root@bqs2 ~]# scontrol show job 174194
slurm_load_jobs error: Invalid job id specified
[root@bqs2 ~]# scontrol show events
invalid entity:events for keyword:show

Comment 27 Nate Rini 2019-10-07 13:34:21 MDT

(In reply to Anthony DelSorbo from comment #26)
> Looks like the job eventually disappeared.  I'll send up the logs
> momentarily....

Can we lower this to SEV4 since this is now a research ticket?

Comment 28 Nate Rini 2019-10-07 13:37:16 MDT

(In reply to Anthony DelSorbo from comment #26)
> (In reply to Nate Rini from comment #23)
> [root@bqs2 ~]# scontrol show events
> invalid entity:events for keyword:show
Please call this instead:
> sacctmgr show events Nodes=h9c54,h9c53

Comment 30 Anthony DelSorbo 2019-10-07 14:20:22 MDT

(In reply to Nate Rini from comment #28)
> (In reply to Anthony DelSorbo from comment #26)
> > (In reply to Nate Rini from comment #23)
> > [root@bqs2 ~]# scontrol show events
> > invalid entity:events for keyword:show
> Please call this instead:
> > sacctmgr show events Nodes=h9c54,h9c53

[root@bqs1 ~]# sacctmgr show events Nodes=h9c54,h9c53
   Cluster        NodeName           TimeStart             TimeEnd  State                         Reason       User 
---------- --------------- ------------------- ------------------- ------ ------------------------------ ---------- 
      hera h9c53           2019-10-07T17:20:55 2019-10-07T17:33:03   DOWN     Clearing job 174194 - Tony Anthony.D+ 
      hera h9c53           2019-10-07T19:26:54 2019-10-07T19:27:04   DOWN                        bug7872 Anthony.D+ 
      hera h9c54           2019-10-07T17:20:55 2019-10-07T17:33:03   DOWN     Clearing job 174194 - Tony Anthony.D+ 
      hera h9c54           2019-10-07T19:26:54 2019-10-07T19:27:04   DOWN                        bug7872 Anthony.D+

Comment 31 Anthony DelSorbo 2019-10-07 14:21:13 MDT

(In reply to Nate Rini from comment #27)
> (In reply to Anthony DelSorbo from comment #26)
> > Looks like the job eventually disappeared.  I'll send up the logs
> > momentarily....
> 
> Can we lower this to SEV4 since this is now a research ticket?

Yes - done.

Comment 32 Anthony DelSorbo 2019-10-07 14:24:28 MDT

Created attachment 11860 [details]
h9c53 slurmd messages

Comment 33 Anthony DelSorbo 2019-10-07 14:45:32 MDT

Getting a timeout in trying to send up other files. Are you folks out of space on your servers?

Comment 34 Anthony DelSorbo 2019-10-07 14:48:06 MDT

... more info:

Request Timeout

Server timeout waiting for the HTTP request from the client.
Apache/2.4.38 (Debian) Server at bugs.schedmd.com Port 443

Comment 35 Nate Rini 2019-10-07 14:49:13 MDT

(In reply to Anthony DelSorbo from comment #33)
> Getting a timeout in trying to send up other files. Are you folks out of
> space on your servers?

How large is the file?

Comment 36 Anthony DelSorbo 2019-10-07 15:19:32 MDT

... more info:

Request Timeout

Server timeout waiting for the HTTP request from the client.
Apache/2.4.38 (Debian) Server at bugs.schedmd.com Port 443(In reply to Nate Rini from comment #35)
> (In reply to Anthony DelSorbo from comment #33)
> > Getting a timeout in trying to send up other files. Are you folks out of
> > space on your servers?
> 
> How large is the file?

1.1 MB

Comment 37 Nate Rini 2019-10-07 15:32:22 MDT

(In reply to Anthony DelSorbo from comment #36)
> ... more info:
> 
> Request Timeout
> 
> Server timeout waiting for the HTTP request from the client.
> Apache/2.4.38 (Debian) Server at bugs.schedmd.com Port 443(In reply to Nate
> Rini from comment #35)
> > (In reply to Anthony DelSorbo from comment #33)
> > > Getting a timeout in trying to send up other files. Are you folks out of
> > > space on your servers?
> > 
> > How large is the file?
> 
> 1.1 MB

Can you please try a different browser? There appears to be an error in the POST:

[Mon Oct 07 21:22:07.854544 2019] [cgid:error] [pid XXXX:tid XXX] (70007)The timeout specified has expired: [client 140.XXXXXX:XXXXX] AH01270: Error reading request entity data, referer: https://bugs.schedmd.com/attachment.cgi?bugid=7872&action=enter
[Mon Oct  7 21:22:08 2019] attachment.cgi: CGI parsing error: 400 Bad request (malformed multipart POST) at Bugzilla/CGI.pm line 108.

Comment 38 Nate Rini 2019-10-07 15:33:43 MDT

Please also make sure the file is compressed.

Comment 39 Anthony DelSorbo 2019-10-08 06:16:35 MDT

Created attachment 11862 [details]
h9c54 slurmd messages

Comment 40 Anthony DelSorbo 2019-10-08 06:17:25 MDT

Created attachment 11863 [details]
bqs1 slurmctld (primary) logs

Comment 41 Anthony DelSorbo 2019-10-08 06:20:24 MDT

Created attachment 11864 [details]
bqs2 slurmctld (backup) logs

Sending you this log to augment the bqs1 logs since I wasn't aware the service crashed and failed over to the other server.

Comment 42 Anthony DelSorbo 2019-10-08 06:24:44 MDT

(In reply to Nate Rini from comment #37)
> (In reply to Anthony DelSorbo from comment #36)
> > ... more info:
> > 
> > Request Timeout
> > 
> > Server timeout waiting for the HTTP request from the client.
> > Apache/2.4.38 (Debian) Server at bugs.schedmd.com Port 443(In reply to Nate
> > Rini from comment #35)
> > > (In reply to Anthony DelSorbo from comment #33)
> > > > Getting a timeout in trying to send up other files. Are you folks out of
> > > > space on your servers?
> > > 
> > > How large is the file?
> > 
> > 1.1 MB
> 
> Can you please try a different browser? There appears to be an error in the
> POST:
Sorry about that Nate.  Turns out the issue was on my side.  The files had a different ownership and hence were not readable by me.

Is there another way to upload files to a case?

You should now have all the files you asked for.

Comment 43 Nate Rini 2019-10-08 09:35:14 MDT

(In reply to Anthony DelSorbo from comment #42)
> Is there another way to upload files to a case?

If needed, I can also take the files via google drive (and possibly other ways too). We usually use gdrive when files are too big for bugzilla to handle.

Comment 45 Nate Rini 2019-10-08 10:45:59 MDT

(In reply to Anthony DelSorbo from comment #40)
> Created attachment 11863 [details]
> bqs1 slurmctld (primary) logs

In an unrelated topic:
> [2019-10-07T15:25:30.851] error: select/cons_res: node h13c24 memory is under-allocated (0-92000) for JobId=119939

This issue has been fixed in bug#6769 comment#41. Please consider upgrading to 19.05.3 to receive the the fix.

Comment 47 Anthony DelSorbo 2019-10-08 11:07:45 MDT

(In reply to Nate Rini from comment #45)
> (In reply to Anthony DelSorbo from comment #40)
> > Created attachment 11863 [details]
> > bqs1 slurmctld (primary) logs
> 
> In an unrelated topic:
> > [2019-10-07T15:25:30.851] error: select/cons_res: node h13c24 memory is under-allocated (0-92000) for JobId=119939
> 
> This issue has been fixed in bug#6769 comment#41. Please consider upgrading
> to 19.05.3 to receive the the fix.

Thanks Nate.  It's our plan to download that version this week and test that on our test system.  We'll be planning to install it within the next several weeks on the production systems.  For me, the sooner the better - but we have to go through all the wickets.

Comment 51 Nate Rini 2019-10-16 12:23:55 MDT

Tony,

I'm going to close this ticket as info given. In 19.05, the best solution to down the nodes when this event happens. Bug#7942 has been opened to gracefully handle this situation in 20.02.

If you have any questions, please respond to either ticket.

Thanks,
--Nate

Comment 53 Anthony DelSorbo 2019-10-16 12:39:11 MDT

(In reply to Nate Rini from comment #51)
> Tony,
> 
> I'm going to close this ticket as info given. In 19.05, the best solution to
> down the nodes when this event happens. Bug#7942 has been opened to
> gracefully handle this situation in 20.02.
> 
> If you have any questions, please respond to either ticket.
> 
> Thanks,
> --Nate

Thanks Nate.  No issues with closing the ticket.  As an FYI, we just got started with 19.05.3 and will be testing it the rest of the week in preparation for going to production in November.

Tony.