Summary: | Job stuck in completing state | ||
---|---|---|---|
Product: | Slurm | Reporter: | Anthony DelSorbo <anthony.delsorbo> |
Component: | slurmstepd | Assignee: | Nate Rini <nate> |
Status: | RESOLVED INFOGIVEN | QA Contact: | |
Severity: | 4 - Minor Issue | ||
Priority: | --- | ||
Version: | 19.05.1 | ||
Hardware: | Linux | ||
OS: | Linux | ||
See Also: |
https://bugs.schedmd.com/show_bug.cgi?id=7888 https://bugs.schedmd.com/show_bug.cgi?id=7839 https://bugs.schedmd.com/show_bug.cgi?id=7942 |
||
Site: | NOAA | Alineos Sites: | --- |
Atos/Eviden Sites: | --- | Confidential Site: | --- |
Coreweave sites: | --- | Cray Sites: | --- |
DS9 clusters: | --- | HPCnow Sites: | --- |
HPE Sites: | --- | IBM Sites: | --- |
NOAA SIte: | NESCC | NoveTech Sites: | --- |
Nvidia HWinf-CS Sites: | --- | OCF Sites: | --- |
Recursion Pharma Sites: | --- | SFW Sites: | --- |
SNIC sites: | --- | Tzag Elita Sites: | --- |
Linux Distro: | --- | Machine Name: | |
CLE Version: | Version Fixed: | ||
Target Release: | --- | DevPrio: | --- |
Emory-Cloud Sites: | --- |
Description
Anthony DelSorbo
2019-10-04 13:02:55 MDT
Created attachment 11829 [details]
syslog from batch host
Interestingly, the syslog shows that the job epilog ran twice once at the original end of job: Oct 3 14:48:59 h9c51 EPILOG-ROOT: Job 174194: and again nearly 14 hours later: Oct 4 02:41:20 h9c51 EPILOG-ROOT: Job 174194: Created attachment 11830 [details]
strace from a stuck node
Not sure if it's helpful, but it's not clear to me why it isn't able to contact the controller - all networks are operational
Tony, Please call the following on the stuck batch host: > ps auxft > systemctl status slurmd > dmesg -T > lsof -n -p 241860 > lsof -n -p 8777 Please call the following on the slurmctld host: > scontrol ping > scontrol show job 174194 #I want to see if anything changed from your last post > scontrol show node h9c51 Thanks, --Nate (In reply to Nate Rini from comment #4) > Tony, > > Please call the following on the stuck batch host: > > ps auxft See attached file: ps_auxft.txt > > systemctl status slurmd See attached file: systemctl_status_slurmd.txt > > dmesg -T See attached file: dmesg_T > > lsof -n -p 241860 Process no longer exists > > lsof -n -p 8777 See attached file: lsof_np_8777.txt > Please call the following on the slurmctld host: > > scontrol ping [root@bqs1 ~]# scontrol ping Slurmctld(primary) at bqs1 is UP Slurmctld(backup) at bqs2 is UP > > scontrol show job 174194 #I want to see if anything changed from your last post [root@bqs1 ~]# scontrol show job 174194 JobId=174194 JobName=RAPP_gsi_hyb_04 UserId=Eric.James(5133) GroupId=wrfruc(10019) MCS_label=N/A Priority=205990816 Nice=0 Account=wrfruc QOS=batch JobState=COMPLETING Reason=None Dependency=(null) Requeue=1 Restarts=0 BatchFlag=1 Reboot=0 ExitCode=0:0 RunTime=00:06:44 TimeLimit=01:00:00 TimeMin=N/A SubmitTime=2019-10-03T14:41:04 EligibleTime=2019-10-03T14:41:04 AccrueTime=2019-10-03T14:41:04 StartTime=2019-10-03T14:41:05 EndTime=2019-10-03T14:47:49 Deadline=N/A SuspendTime=None SecsPreSuspend=0 LastSchedEval=2019-10-03T14:41:05 Partition=hera AllocNode:Sid=hfe04:170448 ReqNodeList=(null) ExcNodeList=(null) NodeList=h9c[52-54],h15c[23-25,27-30] BatchHost=h9c51 NumNodes=10 NumCPUs=480 NumTasks=480 CPUs/Task=N/A ReqB:S:C:T=0:0:*:* TRES=cpu=480,mem=1104000M,node=12,billing=480 Socks/Node=* NtasksPerN:B:S:C=0:0:*:* CoreSpec=* MinCPUsNode=1 MinMemoryCPU=2300M MinTmpDiskNode=0 Features=(null) DelayBoot=00:00:00 OverSubscribe=NO Contiguous=0 Licenses=(null) Network=(null) Command=(null) WorkDir=/home/Eric.James Comment=3290806835d7aeb5350f45c7f6c17fb1 StdErr=/scratch2/BMC/wrfruc/ejames/rapretro/RAPv5_feb2019_retro1/log/gsi_hyb_pcyc_201902070400.log StdIn=/dev/null StdOut=/scratch2/BMC/wrfruc/ejames/rapretro/RAPv5_feb2019_retro1/log/gsi_hyb_pcyc_201902070400.log Power= > > scontrol show node h9c51 > > Thanks, > --Nate Created attachment 11845 [details]
dmesg -T output
Created attachment 11846 [details]
lsof -n -p 8777
Created attachment 11847 [details]
ps auxft
Created attachment 11848 [details]
systemctl status slurmd
Nate, I gave you info from the wrong node. Please forgive and stdby as I collect new data for you.... (In reply to Nate Rini from comment #4) > Tony, > > Please call the following on the stuck batch host: > > ps auxft attached h9c52_ps_auxft.txt > > systemctl status slurmd [root@h9c52 ~]# systemctl status slurmd \u25cf slurmd.service - Slurm node daemon Loaded: loaded (/apps/slurm/d/etc/slurmd.service; linked; vendor preset: disabled) Active: active (running) since Fri 2019-09-27 17:37:21 UTC; 1 weeks 2 days ago Process: 9407 ExecStart=/apps/slurm/d/sbin/slurmd $SLURMD_OPTIONS (code=exited, status=0/SUCCESS) Main PID: 9411 (slurmd) Tasks: 4 Memory: 2.2M CGroup: /system.slice/slurmd.service \u251c\u2500 9411 /apps/slurm/d/sbin/slurmd \u2514\u250025730 slurmstepd: [174194.extern] Oct 07 14:52:36 h9c52 slurmstepd[25730]: debug: _handle_terminate for step=174194.4294967295 uid=0 Oct 07 14:52:36 h9c52 slurmstepd[25730]: debug2: _file_read_uint32s: unable to open '(null)/tasks' for reading : No such file or directory Oct 07 14:52:36 h9c52 slurmstepd[25730]: debug2: xcgroup_get_pids: unable to get pids of '(null)' Oct 07 14:52:36 h9c52 slurmstepd[25730]: debug2: _file_read_uint32s: unable to open '(null)/tasks' for reading : No such file or directory Oct 07 14:52:36 h9c52 slurmstepd[25730]: debug2: xcgroup_get_pids: unable to get pids of '(null)' Oct 07 14:52:36 h9c52 slurmstepd[25730]: debug3: unable to get pids list for cont_id=25730 Oct 07 14:52:36 h9c52 slurmstepd[25730]: Sent SIGKILL signal to 174194.4294967295 Oct 07 14:52:36 h9c52 slurmstepd[25730]: debug3: _handle_request: leaving with rc: 0 Oct 07 14:52:36 h9c52 slurmstepd[25730]: debug3: _handle_request: entering Oct 07 14:52:36 h9c52 slurmstepd[25730]: debug3: Leaving _handle_accept > > dmesg -T attached: h9c52_dmesg_T.txt > > lsof -n -p 241860 Process does not exist. Perhaps you meant 25730? If so, attached lsof_np_25730.txt. > > lsof -n -p 8777 Process 8777 does not exist on stuck host, but does exist as slurmd on batch host (h9c51), which is not stuck and is actually running other jobs today. However I did provide that for you earlier. > > Please call the following on the slurmctld host: > > scontrol ping > > scontrol show job 174194 #I want to see if anything changed from your last post > > scontrol show node h9c51 > > Thanks, > --Nate Created attachment 11851 [details]
dmesg -T output from h9c52 stuck host
Created attachment 11852 [details]
ps auxft for h9c52
Created attachment 11853 [details]
h9c52 slurmstepd process
Created attachment 11854 [details]
hpc52 stepd process
(In reply to Anthony DelSorbo from comment #13) > Created attachment 11852 [details] > ps auxft for h9c52 Is the job still in COMPLETING state? (In reply to Nate Rini from comment #17) > (In reply to Anthony DelSorbo from comment #13) > > Created attachment 11852 [details] > > ps auxft for h9c52 > > Is the job still in COMPLETING state? Yes. [root@bqs1 ~]# squeue JOBID PARTITION QOS USER STATE TIME_LIMIT TIME TIME_LEFT NODES REASON NAME 174194 hera batch Eric.James COMPLETING 1:00:00 6:44 53:16 10 None RAPP_gsi_hyb_04 and those nodes are still "hung" - not released for other jobs. Nate, Do you have any objections to me releasing 9 of the 10 nodes back to production? If not, do you have any preference as to which node to keep out for testing? Or, do you need additional information from the other nodes? Since the job should have exceeded its wallclock time, shouldn't slurmctld have taken the nodes down by now? [root@bqs1 ~]# sinfo --format="%45N %.3D %9P %11T %.4c %14C %.8z %.8m %.4d %.8w %10f %90E" -p hera --nodes h9c[52-54],h15c[23-25,27-30] NODELIST NOD PARTITION STATE CPUS CPUS(A/I/O/T) S:C:T MEMORY TMP_ WEIGHT AVAIL_FEAT REASON h9c[52-54],h15c[23-25,27-30] 10 hera* completing 40 0/400/0/400 2:20:2 95000 0 1 (null) none Thanks, Tony. (In reply to Anthony DelSorbo from comment #19) > Do you have any objections to me releasing 9 of the 10 nodes back to > production? If not, do you have any preference as to which node to keep out > for testing? You should be able to down and then resume all of the nodes to clear the state. (In reply to Nate Rini from comment #20) > (In reply to Anthony DelSorbo from comment #19) > > Do you have any objections to me releasing 9 of the 10 nodes back to > > production? If not, do you have any preference as to which node to keep out > > for testing? > > > You should be able to down and then resume all of the nodes to clear the > state. Tried that but now have nodes with a stuck slurmstepd process. Here's an example on two nodes: ---------------- h9c53 ---------------- \u25cf slurmd.service - Slurm node daemon Loaded: loaded (/apps/slurm/d/etc/slurmd.service; linked; vendor preset: disabled) Active: active (running) since Mon 2019-10-07 17:32:19 UTC; 1min 29s ago Process: 28886 ExecStart=/apps/slurm/d/sbin/slurmd $SLURMD_OPTIONS (code=exited, status=0/SUCCESS) Main PID: 28890 (slurmd) Tasks: 4 Memory: 3.4M CGroup: /system.slice/slurmd.service \u251c\u250022157 slurmstepd: [174194.extern] \u2514\u250028890 /apps/slurm/d/sbin/slurmd Oct 07 17:33:44 h9c53 slurmstepd[22157]: debug: _handle_terminate for step=174194.4294967295 uid=0 Oct 07 17:33:44 h9c53 slurmstepd[22157]: debug2: _file_read_uint32s: unable to open '(null)/tasks' for reading : No such file or directory Oct 07 17:33:44 h9c53 slurmstepd[22157]: debug2: xcgroup_get_pids: unable to get pids of '(null)' Oct 07 17:33:44 h9c53 slurmstepd[22157]: debug2: _file_read_uint32s: unable to open '(null)/tasks' for reading : No such file or directory Oct 07 17:33:44 h9c53 slurmstepd[22157]: debug2: xcgroup_get_pids: unable to get pids of '(null)' Oct 07 17:33:44 h9c53 slurmstepd[22157]: debug3: unable to get pids list for cont_id=22157 Oct 07 17:33:44 h9c53 slurmstepd[22157]: Sent SIGKILL signal to 174194.4294967295 Oct 07 17:33:44 h9c53 slurmstepd[22157]: debug3: _handle_request: leaving with rc: 0 Oct 07 17:33:44 h9c53 slurmstepd[22157]: debug3: _handle_request: entering Oct 07 17:33:44 h9c53 slurmstepd[22157]: debug3: Leaving _handle_accept ---------------- h9c54 ---------------- \u25cf slurmd.service - Slurm node daemon Loaded: loaded (/apps/slurm/d/etc/slurmd.service; linked; vendor preset: disabled) Active: active (running) since Mon 2019-10-07 17:32:19 UTC; 1min 29s ago Process: 48495 ExecStart=/apps/slurm/d/sbin/slurmd $SLURMD_OPTIONS (code=exited, status=0/SUCCESS) Main PID: 48498 (slurmd) Tasks: 4 Memory: 3.4M CGroup: /system.slice/slurmd.service \u251c\u250043008 slurmstepd: [174194.extern] \u2514\u250048498 /apps/slurm/d/sbin/slurmd Oct 07 17:33:44 h9c54 slurmstepd[43008]: debug: _handle_terminate for step=174194.4294967295 uid=0 Oct 07 17:33:44 h9c54 slurmstepd[43008]: debug2: _file_read_uint32s: unable to open '(null)/tasks' for reading : No such file or directory Oct 07 17:33:44 h9c54 slurmstepd[43008]: debug2: xcgroup_get_pids: unable to get pids of '(null)' Oct 07 17:33:44 h9c54 slurmstepd[43008]: debug2: _file_read_uint32s: unable to open '(null)/tasks' for reading : No such file or directory Oct 07 17:33:44 h9c54 slurmstepd[43008]: debug2: xcgroup_get_pids: unable to get pids of '(null)' Oct 07 17:33:44 h9c54 slurmstepd[43008]: debug3: unable to get pids list for cont_id=43008 Oct 07 17:33:44 h9c54 slurmstepd[43008]: Sent SIGKILL signal to 174194.4294967295 Oct 07 17:33:44 h9c54 slurmstepd[43008]: debug3: _handle_request: leaving with rc: 0 Oct 07 17:33:44 h9c54 slurmstepd[43008]: debug3: _handle_request: entering Oct 07 17:33:44 h9c54 slurmstepd[43008]: debug3: Leaving _handle_accept (In reply to Anthony DelSorbo from comment #21) > Tried that but now have nodes with a stuck slurmstepd process. Please call the following as root on your slurmctld host and attach the logs: > scontrol setdebugflags +steps > scontrol setdebugflags +tracejobs > scontrol setdebugflags +Agent > scontrol setdebug debug4 > scontrol show events > scontrol show node h9c53 > scontrol show node h9c54 > scontrol show job 174194 > scontrol update nodename=h9c53,h9c54 state=down reason=bug7872 > sleep 10 > scontrol update nodename=h9c53,h9c54 state=resume reason=bug7872 > scontrol setdebug info > scontrol setdebugflags -steps > scontrol setdebugflags -tracejobs > scontrol setdebugflags -Agent > scontrol show node h9c53 > scontrol show node h9c54 > scontrol show job 174194 > scontrol show events Please also attach the slurmctld log, slurmd logs on h9c53 and h9c54. (In reply to Nate Rini from comment #23) > Please also attach the slurmctld log, slurmd logs on h9c53 and h9c54. Nate, Looks like the job eventually disappeared. I'll send up the logs momentarily.... [root@bqs2 ~]# scontrol setdebugflags +steps [root@bqs2 ~]# scontrol setdebugflags +tracejobs [root@bqs2 ~]# scontrol setdebugflags +Agent [root@bqs2 ~]# scontrol setdebug debug4 [root@bqs2 ~]# scontrol show events invalid entity:events for keyword:show [root@bqs2 ~]# scontrol show node h9c53 NodeName=h9c53 Arch=x86_64 CoresPerSocket=20 CPUAlloc=40 CPUTot=40 CPULoad=39.68 AvailableFeatures=(null) ActiveFeatures=(null) Gres=(null) NodeAddr=h9c53 NodeHostName=h9c53 OS=Linux 3.10.0-957.21.3.el7.x86_64 #1 SMP Tue Jun 18 16:35:19 UTC 2019 RealMemory=95000 AllocMem=92000 FreeMem=76621 Sockets=2 Boards=1 MemSpecLimit=2000 State=ALLOCATED ThreadsPerCore=2 TmpDisk=0 Weight=1 Owner=N/A MCS_label=N/A Partitions=hera,admin BootTime=2019-09-27T16:44:49 SlurmdStartTime=2019-10-07T17:32:24 CfgTRES=cpu=40,mem=95000M,billing=40 AllocTRES=cpu=40,mem=92000M CapWatts=n/a CurrentWatts=0 AveWatts=0 ExtSensorsJoules=n/s ExtSensorsWatts=0 ExtSensorsTemp=n/s [root@bqs2 ~]# scontrol show node h9c54 NodeName=h9c54 Arch=x86_64 CoresPerSocket=20 CPUAlloc=40 CPUTot=40 CPULoad=39.49 AvailableFeatures=(null) ActiveFeatures=(null) Gres=(null) NodeAddr=h9c54 NodeHostName=h9c54 OS=Linux 3.10.0-957.21.3.el7.x86_64 #1 SMP Tue Jun 18 16:35:19 UTC 2019 RealMemory=95000 AllocMem=92000 FreeMem=76595 Sockets=2 Boards=1 MemSpecLimit=2000 State=ALLOCATED ThreadsPerCore=2 TmpDisk=0 Weight=1 Owner=N/A MCS_label=N/A Partitions=hera,admin BootTime=2019-09-27T14:26:09 SlurmdStartTime=2019-10-07T17:32:23 CfgTRES=cpu=40,mem=95000M,billing=40 AllocTRES=cpu=40,mem=92000M CapWatts=n/a CurrentWatts=0 AveWatts=0 ExtSensorsJoules=n/s ExtSensorsWatts=0 ExtSensorsTemp=n/s [root@bqs2 ~]# scontrol show job 174194 slurm_load_jobs error: Invalid job id specified [root@bqs2 ~]# scontrol update nodename=h9c53,h9c54 state=down reason=bug7872 [root@bqs2 ~]# sleep 10 [root@bqs2 ~]# scontrol update nodename=h9c53,h9c54 state=resume reason=bug7872 [root@bqs2 ~]# scontrol setdebug info [root@bqs2 ~]# scontrol setdebugflags -steps [root@bqs2 ~]# scontrol setdebugflags -tracejobs [root@bqs2 ~]# scontrol setdebugflags -Agent [root@bqs2 ~]# scontrol show node h9c53 NodeName=h9c53 Arch=x86_64 CoresPerSocket=20 CPUAlloc=0 CPUTot=40 CPULoad=39.68 AvailableFeatures=(null) ActiveFeatures=(null) Gres=(null) NodeAddr=h9c53 NodeHostName=h9c53 OS=Linux 3.10.0-957.21.3.el7.x86_64 #1 SMP Tue Jun 18 16:35:19 UTC 2019 RealMemory=95000 AllocMem=0 FreeMem=76621 Sockets=2 Boards=1 MemSpecLimit=2000 State=IDLE* ThreadsPerCore=2 TmpDisk=0 Weight=1 Owner=N/A MCS_label=N/A Partitions=hera,admin BootTime=2019-09-27T16:44:49 SlurmdStartTime=2019-10-07T17:32:24 CfgTRES=cpu=40,mem=95000M,billing=40 AllocTRES= CapWatts=n/a CurrentWatts=0 AveWatts=0 ExtSensorsJoules=n/s ExtSensorsWatts=0 ExtSensorsTemp=n/s [root@bqs2 ~]# scontrol show node h9c54 NodeName=h9c54 Arch=x86_64 CoresPerSocket=20 CPUAlloc=0 CPUTot=40 CPULoad=39.49 AvailableFeatures=(null) ActiveFeatures=(null) Gres=(null) NodeAddr=h9c54 NodeHostName=h9c54 OS=Linux 3.10.0-957.21.3.el7.x86_64 #1 SMP Tue Jun 18 16:35:19 UTC 2019 RealMemory=95000 AllocMem=0 FreeMem=76595 Sockets=2 Boards=1 MemSpecLimit=2000 State=IDLE* ThreadsPerCore=2 TmpDisk=0 Weight=1 Owner=N/A MCS_label=N/A Partitions=hera,admin BootTime=2019-09-27T14:26:09 SlurmdStartTime=2019-10-07T17:32:23 CfgTRES=cpu=40,mem=95000M,billing=40 AllocTRES= CapWatts=n/a CurrentWatts=0 AveWatts=0 ExtSensorsJoules=n/s ExtSensorsWatts=0 ExtSensorsTemp=n/s [root@bqs2 ~]# scontrol show job 174194 slurm_load_jobs error: Invalid job id specified [root@bqs2 ~]# scontrol show events invalid entity:events for keyword:show (In reply to Anthony DelSorbo from comment #26) > Looks like the job eventually disappeared. I'll send up the logs > momentarily.... Can we lower this to SEV4 since this is now a research ticket? (In reply to Anthony DelSorbo from comment #26) > (In reply to Nate Rini from comment #23) > [root@bqs2 ~]# scontrol show events > invalid entity:events for keyword:show Please call this instead: > sacctmgr show events Nodes=h9c54,h9c53 (In reply to Nate Rini from comment #28) > (In reply to Anthony DelSorbo from comment #26) > > (In reply to Nate Rini from comment #23) > > [root@bqs2 ~]# scontrol show events > > invalid entity:events for keyword:show > Please call this instead: > > sacctmgr show events Nodes=h9c54,h9c53 [root@bqs1 ~]# sacctmgr show events Nodes=h9c54,h9c53 Cluster NodeName TimeStart TimeEnd State Reason User ---------- --------------- ------------------- ------------------- ------ ------------------------------ ---------- hera h9c53 2019-10-07T17:20:55 2019-10-07T17:33:03 DOWN Clearing job 174194 - Tony Anthony.D+ hera h9c53 2019-10-07T19:26:54 2019-10-07T19:27:04 DOWN bug7872 Anthony.D+ hera h9c54 2019-10-07T17:20:55 2019-10-07T17:33:03 DOWN Clearing job 174194 - Tony Anthony.D+ hera h9c54 2019-10-07T19:26:54 2019-10-07T19:27:04 DOWN bug7872 Anthony.D+ (In reply to Nate Rini from comment #27) > (In reply to Anthony DelSorbo from comment #26) > > Looks like the job eventually disappeared. I'll send up the logs > > momentarily.... > > Can we lower this to SEV4 since this is now a research ticket? Yes - done. Created attachment 11860 [details]
h9c53 slurmd messages
Getting a timeout in trying to send up other files. Are you folks out of space on your servers? ... more info: Request Timeout Server timeout waiting for the HTTP request from the client. Apache/2.4.38 (Debian) Server at bugs.schedmd.com Port 443 (In reply to Anthony DelSorbo from comment #33) > Getting a timeout in trying to send up other files. Are you folks out of > space on your servers? How large is the file? ... more info: Request Timeout Server timeout waiting for the HTTP request from the client. Apache/2.4.38 (Debian) Server at bugs.schedmd.com Port 443(In reply to Nate Rini from comment #35) > (In reply to Anthony DelSorbo from comment #33) > > Getting a timeout in trying to send up other files. Are you folks out of > > space on your servers? > > How large is the file? 1.1 MB (In reply to Anthony DelSorbo from comment #36) > ... more info: > > Request Timeout > > Server timeout waiting for the HTTP request from the client. > Apache/2.4.38 (Debian) Server at bugs.schedmd.com Port 443(In reply to Nate > Rini from comment #35) > > (In reply to Anthony DelSorbo from comment #33) > > > Getting a timeout in trying to send up other files. Are you folks out of > > > space on your servers? > > > > How large is the file? > > 1.1 MB Can you please try a different browser? There appears to be an error in the POST: [Mon Oct 07 21:22:07.854544 2019] [cgid:error] [pid XXXX:tid XXX] (70007)The timeout specified has expired: [client 140.XXXXXX:XXXXX] AH01270: Error reading request entity data, referer: https://bugs.schedmd.com/attachment.cgi?bugid=7872&action=enter [Mon Oct 7 21:22:08 2019] attachment.cgi: CGI parsing error: 400 Bad request (malformed multipart POST) at Bugzilla/CGI.pm line 108. Please also make sure the file is compressed. Created attachment 11862 [details]
h9c54 slurmd messages
Created attachment 11863 [details]
bqs1 slurmctld (primary) logs
Created attachment 11864 [details]
bqs2 slurmctld (backup) logs
Sending you this log to augment the bqs1 logs since I wasn't aware the service crashed and failed over to the other server.
(In reply to Nate Rini from comment #37) > (In reply to Anthony DelSorbo from comment #36) > > ... more info: > > > > Request Timeout > > > > Server timeout waiting for the HTTP request from the client. > > Apache/2.4.38 (Debian) Server at bugs.schedmd.com Port 443(In reply to Nate > > Rini from comment #35) > > > (In reply to Anthony DelSorbo from comment #33) > > > > Getting a timeout in trying to send up other files. Are you folks out of > > > > space on your servers? > > > > > > How large is the file? > > > > 1.1 MB > > Can you please try a different browser? There appears to be an error in the > POST: Sorry about that Nate. Turns out the issue was on my side. The files had a different ownership and hence were not readable by me. Is there another way to upload files to a case? You should now have all the files you asked for. (In reply to Anthony DelSorbo from comment #42) > Is there another way to upload files to a case? If needed, I can also take the files via google drive (and possibly other ways too). We usually use gdrive when files are too big for bugzilla to handle. (In reply to Anthony DelSorbo from comment #40) > Created attachment 11863 [details] > bqs1 slurmctld (primary) logs In an unrelated topic: > [2019-10-07T15:25:30.851] error: select/cons_res: node h13c24 memory is under-allocated (0-92000) for JobId=119939 This issue has been fixed in bug#6769 comment#41. Please consider upgrading to 19.05.3 to receive the the fix. (In reply to Nate Rini from comment #45) > (In reply to Anthony DelSorbo from comment #40) > > Created attachment 11863 [details] > > bqs1 slurmctld (primary) logs > > In an unrelated topic: > > [2019-10-07T15:25:30.851] error: select/cons_res: node h13c24 memory is under-allocated (0-92000) for JobId=119939 > > This issue has been fixed in bug#6769 comment#41. Please consider upgrading > to 19.05.3 to receive the the fix. Thanks Nate. It's our plan to download that version this week and test that on our test system. We'll be planning to install it within the next several weeks on the production systems. For me, the sooner the better - but we have to go through all the wickets. Tony, I'm going to close this ticket as info given. In 19.05, the best solution to down the nodes when this event happens. Bug#7942 has been opened to gracefully handle this situation in 20.02. If you have any questions, please respond to either ticket. Thanks, --Nate (In reply to Nate Rini from comment #51) > Tony, > > I'm going to close this ticket as info given. In 19.05, the best solution to > down the nodes when this event happens. Bug#7942 has been opened to > gracefully handle this situation in 20.02. > > If you have any questions, please respond to either ticket. > > Thanks, > --Nate Thanks Nate. No issues with closing the ticket. As an FYI, we just got started with 19.05.3 and will be testing it the rest of the week in preparation for going to production in November. Tony. |