Created attachment 9102 [details]
netstat during hang
Created attachment 9103 [details]
lsof during hang
Created attachment 9104 [details]
strace during hang
Created attachment 9105 [details]
slurmctld log
Additionally, when the slurmctld is restarted from this situation to restore functionality it happens that some partitions get the wrong configuration: PartitionName=gll_usr_gpuprod AllowGroups=ALL AllowAccounts=ALL AllowQos=ALL AllocNodes=LINCO_INVENEW,LINCO_tph1,LINCO_galinew,fer_prod1,eli_prod1 Default=NO QoS=gll_usr_gpuprod DefaultTime=00:30:00 DisableRootJobs=YES ExclusiveUser=NO GraceTime=0 Hidden=NO MaxNodes=UNLIMITED MaxTime=1-00:00:00 MinNodes=0 LLN=NO MaxCPUsPerNode=UNLIMITED Nodes=node[153-156,499-507] PriorityJobFactor=1 PriorityTier=1 RootOnly=NO ReqResv=NO OverSubscribe=NO OverTimeLimit=NONE PreemptMode=OFF State=UP TotalCPUs=208 TotalNodes=13 SelectTypeParameters=NONE JobDefaults=(null) DefMemPerCPU=7000 MaxMemPerNode=UNLIMITED TRESBillingWeights=CPU=0.25,Mem=0.035G,GRES/gpu=1 PartitionName=gll_usr_prod AllowGroups=ALL AllowAccounts=ALL AllowQos=ALL AllocNodes=LINCO_galinew,fer_prod1,eli_prod1 Default=NO QoS=gll_usr_prod DefaultTime=00:30:00 DisableRootJobs=YES ExclusiveUser=NO GraceTime=0 Hidden=NO MaxNodes=UNLIMITED MaxTime=1-00:00:00 MinNodes=0 LLN=NO MaxCPUsPerNode=UNLIMITED Nodes=node[125-134,173-280,389-483] PriorityJobFactor=1 PriorityTier=1 RootOnly=NO ReqResv=NO OverSubscribe=NO OverTimeLimit=NONE PreemptMode=OFF State=UP TotalCPUs=7668 TotalNodes=213 SelectTypeParameters=NONE JobDefaults=(null) DefMemPerCPU=3000 MaxMemPerNode=UNLIMITED TRESBillingWeights=CPU=0.25,Mem=0.072G The AllocNodes list is not made of nodes, but accounts... Regards, Marcello Hi Marcello, When did you finish the upgrade to 18.08.3 and does the problem only occur some times or after a duration of time e.g. fifteen minuets? (In reply to Jason Booth from comment #6) > Hi Marcello, > > When did you finish the upgrade to 18.08.3 and does the problem only occur > some times or after a duration of time e.g. fifteen minuets? The upgrade was made in november 2018, after that the problem appeared more and more frequently. There doesn't seem to be a pattern however, this is when it happened: /var/log/messages-20181125.gz:Nov 22 17:23:26 io07 slurmctld[29712]: server_thread_count over limit (256), waiting /var/log/messages-20181125.gz:Nov 22 17:23:31 io07 slurmctld[29712]: server_thread_count over limit (256), waiting /var/log/messages-20181231.gz:Dec 26 11:41:47 io07 slurmctld[9354]: server_thread_count over limit (256), waiting /var/log/messages-20181231.gz:Dec 27 10:49:28 io07 slurmctld[16862]: server_thread_count over limit (256), waiting /var/log/messages-20181231.gz:Dec 29 15:36:18 io07 slurmctld[17366]: server_thread_count over limit (256), waiting /var/log/messages-20190113.gz:Jan 8 15:25:08 io07 slurmctld[19868]: server_thread_count over limit (256), waiting /var/log/messages-20190113.gz:Jan 9 10:51:59 io07 slurmctld[26980]: server_thread_count over limit (256), waiting /var/log/messages-20190203.gz:Jan 29 12:02:00 io07 slurmctld[2484]: server_thread_count over limit (256), waiting /var/log/messages-20190203.gz:Jan 29 17:29:27 io07 slurmctld[26128]: server_thread_count over limit (256), waiting /var/log/messages:Feb 6 15:08:29 io08 slurmctld[18699]: server_thread_count over limit (256), waiting /var/log/messages:Feb 7 16:47:39 io08 slurmctld[4123]: server_thread_count over limit (256), waiting Regards, Marcello Marcello, The service may be hitting a limit. Can you send the limit from the slurmctld. cat /proc/`pgrep slurmctld`/limits Please also send me the the output of the service file if you have one in use. e.g /usr/lib/systemd/system/slurmctld.service Also, please attach your slurm.conf. Can you also let us know if you have any other processes running on a consistency basis such as sacctmgr running updates often? Hi Marcello, besides Jason questions/requests, can you try temporarily disabling Priority DebugFlags? I'm wondering if the amount of logs enabled by this flag is affecting ctld performance. (In reply to Jason Booth from comment #12) > Marcello, > > The service may be hitting a limit. Can you send the limit from the > slurmctld. > > > cat /proc/`pgrep slurmctld`/limits > > Please also send me the the output of the service file if you have one in > use. > e.g > /usr/lib/systemd/system/slurmctld.service > > Also, please attach your slurm.conf. > > > Can you also let us know if you have any other processes running on a > consistency basis such as sacctmgr running updates often? Hi Jason, [root@io08 ~]# cat /proc/`pgrep slurmctld`/limits Limit Soft Limit Hard Limit Units Max cpu time unlimited unlimited seconds Max file size unlimited unlimited bytes Max data size unlimited unlimited bytes Max stack size unlimited unlimited bytes Max core file size unlimited unlimited bytes Max resident set unlimited unlimited bytes Max processes 514023 514023 processes Max open files 65536 65536 files Max locked memory 65536 65536 bytes Max address space unlimited unlimited bytes Max file locks unlimited unlimited locks Max pending signals 514023 514023 signals Max msgqueue size 819200 819200 bytes Max nice priority 0 0 Max realtime priority 0 0 Max realtime timeout unlimited unlimited us [root@io08 ~]# cat /usr/lib/systemd/system/slurmctld.service [Unit] Description=Slurm controller daemon After=network.target munge.service ConditionPathExists=/etc/slurm/slurm.conf [Service] Type=forking EnvironmentFile=-/etc/sysconfig/slurmctld ExecStart=/usr/sbin/slurmctld $SLURMCTLD_OPTIONS ExecReload=/bin/kill -HUP $MAINPID PIDFile=/var/run/slurmctld.pid [Install] WantedBy=multi-user.target We have 2 scripts running on a daily basis: - one script updating twice a day ( 8:30a.m. and 2:30p.m.) accounts' budget - one script updating twice a day ( 9:00 a.m. and 3:00 p.m.) accounts' fairshare. These same scripts runs also on the other 2 HPC clusters running slurm we have. Best Regards, Marcello Created attachment 9112 [details]
slurm conf file
(In reply to Alejandro Sanchez from comment #14) > Hi Marcello, > > besides Jason questions/requests, can you try temporarily disabling Priority > DebugFlags? I'm wondering if the amount of logs enabled by this flag is > affecting ctld performance. Priority DebugFlags removed. Regards, Marcello Hi Marcello. Did the change have any positive effect? If not, will you show the output of 'sdiag'? thanks. (In reply to Alejandro Sanchez from comment #18) > Hi Marcello. Did the change have any positive effect? If not, will you show > the output of 'sdiag'? thanks. There were no more disruption so far, but I'm not fully sure this was the problem since this is a relatively small cluster (~400 nodes) and we have the same debug flags active on a ~7000 node cluster with no issues. I guess we'll just have to wait and see. Regards, Marcello (In reply to Cineca HPC Systems from comment #19) > (In reply to Alejandro Sanchez from comment #18) > > Hi Marcello. Did the change have any positive effect? If not, will you show > > the output of 'sdiag'? thanks. > > There were no more disruption so far, but I'm not fully sure this was the > problem since this is a relatively small cluster (~400 nodes) and we have > the same debug flags active on a ~7000 node cluster with no issues. > I guess we'll just have to wait and see. > > Regards, > Marcello As expected the problem appeared again, this time we had a little time to examine it. At first slurmctld become unresponsive from other host's client (s* commands start to hang), BUT on the slurmctld the commands still responds. On the slurmd side some nodes log this: [2019-02-13T16:44:42.215] [780733.batch] debug: step_terminate_monitor_stop signaling condition [2019-02-13T16:44:42.216] [780733.batch] job 780733 completed with slurm_rc = 0, job_rc = 0 [2019-02-13T16:44:42.216] [780733.batch] sending REQUEST_COMPLETE_BATCH_SCRIPT, error:0 status 0 [2019-02-13T16:45:18.956] [780733.extern] debug: jag_common_poll_data: Task 0 pid 33260 ave_freq = 1200042 mem size/max 16384/16384 vmem size/max 110481408/110481408, disk read size/max (2012/2012), disk write size/max (0/0), time 0.00 0000(0+0) [2019-02-13T16:46:18.956] [780733.extern] debug: jag_common_poll_data: Task 0 pid 33260 ave_freq = 1199953 mem size/max 16384/16384 vmem size/max 110481408/110481408, disk read size/max (2012/2012), disk write size/max (0/0), time 0.00 0000(0+0) [2019-02-13T16:46:22.167] [780733.0] debug: slurm_recv_timeout at 0 of 4, timeout [2019-02-13T16:46:22.167] [780733.0] error: Rank 0 failed sending step completion message directly to slurmctld (0.0.0.0:0), retrying [2019-02-13T16:46:22.320] [780733.batch] debug: slurm_recv_timeout at 0 of 4, timeout [2019-02-13T16:46:22.320] [780733.batch] Retrying job complete RPC for 780733.4294967294 [2019-02-13T16:47:18.956] [780733.extern] debug: jag_common_poll_data: Task 0 pid 33260 ave_freq = 1199953 mem size/max 16384/16384 vmem size/max 110481408/110481408, disk read size/max (2012/2012), disk write size/max (0/0), time 0.00 0000(0+0) [2019-02-13T16:48:17.424] [780733.batch] debug: slurm_recv_timeout at 0 of 4, timeout being 16:45 the time when the problem started. As soon as the controller is restarted the log of the above mentioned slurmd node reports: 2019-02-13T17:07:25.715] [780733.batch] debug: Failed to contact primary controller: Connection refused [2019-02-13T17:07:25.716] [780733.batch] debug: Contacted backup controller 0 [2019-02-13T17:07:25.718] [780733.batch] debug: Primary not responding, backup not in control. sleep and retry [2019-02-13T17:08:08.396] [780733.0] debug: Handling REQUEST_STATE [2019-02-13T17:08:08.396] debug: _fill_registration_msg: found apparently running step 780733.0 [2019-02-13T17:08:08.397] [780733.extern] debug: Handling REQUEST_STATE [2019-02-13T17:08:08.397] debug: _fill_registration_msg: found apparently running step 780733.4294967295 [2019-02-13T17:08:08.398] [780733.batch] debug: Handling REQUEST_STATE [2019-02-13T17:08:08.398] debug: _fill_registration_msg: found apparently running job 780733 [2019-02-13T17:08:08.445] debug: _handle_node_reg_resp: slurmctld sent back 9 TRES. [2019-02-13T17:08:10.717] [780733.0] Rank 0 sent step completion message directly to slurmctld (0.0.0.0:0) [2019-02-13T17:08:10.717] [780733.0] debug: Message thread exited [2019-02-13T17:08:10.717] [780733.0] debug: mpi/pmi2: agent thread exit [2019-02-13T17:08:10.718] [780733.0] done with job [2019-02-13T17:08:18.956] [780733.extern] debug: jag_common_poll_data: Task 0 pid 33260 ave_freq = 1231578 mem size/max 16384/16384 vmem size/max 110481408/110481408, disk read size/max (2012/2012), disk write size/max (0/0), time 0.000000(0+0) [2019-02-13T17:08:25.734] [780733.batch] debug: Message thread exited [2019-02-13T17:08:25.734] debug: _rpc_terminate_job, uid = 6003 [2019-02-13T17:08:25.734] debug: task_p_slurmd_release_resources: affinity jobid 780733 [2019-02-13T17:08:25.734] [780733.batch] done with job [2019-02-13T17:08:25.734] debug: credential for job 780733 revoked Any clue? Controller nodes are reachable when the problem occur and we can successfully telnet to the 6817 port of the controller's host from any node, so it doesn't seem to be a network problem. Regards, Marcello Created attachment 9163 [details]
sdiag of the controller when not responding
The netstat attachment shows 255 tcp ctld connections in CLOSE_WAIT state and 1 in ESTABLISHED. The lsof attachment shows the CLOSE_WAIT states mostly happen in tcp connections with compute nodes. There are definitions of this state in either the rfc793 specification[1] and the Linux implementation[2]: Slurm defines MAX_SERVER_THREADS as 256. The ctld logs indicate this is galileo. Last cineca slurm.conf copies we have for either marconi, power8 and galileo all have a TCPTimeout of 60. I'm curious as to what's the differentiating factor so that galileo with ~400 nodes being hit by this problem while marconi with ~7000 isn't. I guess the timing config values haven't changed and all clusters continue with TCPTimeout=60 and MessageTimeout=100? I'd also be interested if you have differences in the kernel net tcp params like /proc/sys/net/ipv4/tcp_fin_timeout between clusters. Which Slurm version is marconi running? The gdb backtrace doesn't show information about all threads. I'd also be curious to see 'thread apply all bt full' output to discard any potential lock problems. An i/o fast path to StateSaveLocation is also critical for ctld performance. Are there differences in this aspect between marconi and galileo? sdiag shows a not negligible amount of RPC's by users root, edanovar, slurm000 and a big amount of REQUEST_PARTITION_INFO and REQUEST_NODE_INFO_SINGKLE. Is any user/cron/script executing sinfo's, scontrol show's in a repetitive/heavy fashion? For whatever reason these tcp connections aren't being closed and ctld doesn't accept more after hitting the limit. Then the compute nodes try to send more msgs like step completion and similar but fail to do so. In bug 6189 NERSC is experiencing something similar in 18.08.3. I attached a patch to fix lockings around server_thread_count but they didn't provide any feedback yet. We can consider applying it to galileo as well. [1] https://tools.ietf.org/html/rfc793#page-21 [2] https://github.com/torvalds/linux/blob/master/net/ipv4/tcp.c#L237 (In reply to Alejandro Sanchez from comment #22) > The netstat attachment shows 255 tcp ctld connections in CLOSE_WAIT state > and 1 in ESTABLISHED. The lsof attachment shows the CLOSE_WAIT states mostly > happen in tcp connections with compute nodes. There are definitions of this > state in either the rfc793 specification[1] and the Linux implementation[2]: > > Slurm defines MAX_SERVER_THREADS as 256. > > The ctld logs indicate this is galileo. Last cineca slurm.conf copies we > have for either marconi, power8 and galileo all have a TCPTimeout of 60. I'm > curious as to what's the differentiating factor so that galileo with ~400 > nodes being hit by this problem while marconi with ~7000 isn't. > > I guess the timing config values haven't changed and all clusters continue > with TCPTimeout=60 and MessageTimeout=100? > > I'd also be interested if you have differences in the kernel net tcp params > like /proc/sys/net/ipv4/tcp_fin_timeout between clusters. > > Which Slurm version is marconi running? > > The gdb backtrace doesn't show information about all threads. I'd also be > curious to see 'thread apply all bt full' output to discard any potential > lock problems. > > An i/o fast path to StateSaveLocation is also critical for ctld performance. > Are there differences in this aspect between marconi and galileo? > > sdiag shows a not negligible amount of RPC's by users root, edanovar, > slurm000 and a big amount of REQUEST_PARTITION_INFO and > REQUEST_NODE_INFO_SINGKLE. Is any user/cron/script executing sinfo's, > scontrol show's in a repetitive/heavy fashion? > > For whatever reason these tcp connections aren't being closed and ctld > doesn't accept more after hitting the limit. Then the compute nodes try to > send more msgs like step completion and similar but fail to do so. > > In bug 6189 NERSC is experiencing something similar in 18.08.3. I attached a > patch to fix lockings around server_thread_count but they didn't provide any > feedback yet. We can consider applying it to galileo as well. > > [1] https://tools.ietf.org/html/rfc793#page-21 > [2] https://github.com/torvalds/linux/blob/master/net/ipv4/tcp.c#L237 Alejandro, our slurm version are as follow: MARCONI: 18.08.4, GALILEO: 18.08.3 DAVIDE (Power8): 18.08.3 TCPtimeout and MessageTimeout are 60 and 100 on every cluster. /proc/sys/net/ipv4/tcp_fin_timeout is 60s on every cluster. OSes for the slurmctld hosts are: MARCONI and GALILEO: CentOS 7.2 DAVIDE: CentOS 7.5 I'll try the gdb of all threads next time the problem happens and have a look on the high number of client requests and let you know whether I find something or not. Regards, Marcello With regards to comment 5, that's a separate issue. Will you please file a new sev-4 bug copying the problem there and assign it to me? It looks like AllocNodes=LINCO_INVENEW,LINCO_tph1,LINCO_galinew,fer_prod1,eli_prod1 contain the values of the configured DenyAccounts=LINCO_INVENEW,LINCO_tph1,LINCO_galinew,fer_prod1,eli_prod1 and DenyAccounts isn't shown in the output of scontrol show part. Casually the two mentioned partitions are the only ones with DenyAccounts configured (at least from the last slurm.conf copy we have). Do you remember doing any 'scontrol update partition' to change fields dynamically? I'm looking into the logic in here to see how the output is built: https://github.com/SchedMD/slurm/blob/slurm-18-08-3-1/src/api/partition_info.c#L169 Created attachment 9573 [details]
gdb trace for all threads
We finally have been able to get a gdb backtrace of all thread when the problem happened. Best Regards, Marcello Hi Marcello, Can you attach new output of sdiag and slurmctld.log? 1. Is the StateSaveLocation=/slurmstate/spool backed up by a fast I/O filesystem? Overall ctld responsiveness under heavy load might be governed by latency reading/writing from StateSaveLocation as a side note. 2. Did you identify anything related to this question? sdiag shows a not negligible amount of RPC's by users root, edanovar, slurm000 and a big amount of REQUEST_PARTITION_INFO and REQUEST_NODE_INFO_SINGLE. Is any user/cron/script executing sinfo's, scontrol show's in a repetitive/heavy fashion? 3. Can you try adding 'defer' to SchedulerParameters and see if that improves responsiveness? Created attachment 9593 [details]
logs at 20190319
(In reply to Alejandro Sanchez from comment #27) > Hi Marcello, > > Can you attach new output of sdiag and slurmctld.log? I've attached logs and netstat output, unfortunately sdiag is not reporting anything: [root@master ~]# sdiag slurm_get_statistics: Socket timed out on send/recv operation > > 1. Is the StateSaveLocation=/slurmstate/spool backed up by a fast I/O > filesystem? Overall ctld responsiveness under heavy load might be governed > by latency reading/writing from StateSaveLocation as a side note. The filesystem is a replicated glusterfs which is showing no problems on MARCONI which has a much higher throughput of job and therefore load. > > 2. Did you identify anything related to this question? We gave a thought about it, the only difference between MARCONI and GALILEO apart from the slurm version (18.08.4 on MARCONI, 18.08.3 on GALILEO) and HW (GALILEO nodes are slighly older, but still powerful enough) is that slurmctld hosts on GALILEO are behind a NAT (with appropriate firewall rules) and not directly connected to the slurmdbd. We also have noticed that sometimes GALILEO closes the connection to the slurmdbd and is not able to recover it although the slurmctld keeps working: Mar 17 19:02:51 r000u17l01 slurmdbd[53521]: debug: cluster galileo has disconnected Moreover if you take a look at the last netstat output you can see that there are many packets waiting in the RECV-Q but this can be due to the fact that the slurmctld is hanging and so are its connections... > > sdiag shows a not negligible amount of RPC's by users root, edanovar, > slurm000 and a big amount of REQUEST_PARTITION_INFO and > REQUEST_NODE_INFO_SINGLE. Is any user/cron/script executing sinfo's, > scontrol show's in a repetitive/heavy fashion? We haven't found anything unusual, MARCONI is managing hundreds of time more requests to say the least, with no apparent problem. > > 3. Can you try adding 'defer' to SchedulerParameters and see if that > improves responsiveness? Added to the configuration. Regards, These two comments: (In reply to Cineca HPC Systems from comment #20) > As expected the problem appeared again, this time we had a little time to > examine it. > At first slurmctld become unresponsive from other host's client (s* commands > start to hang), BUT on the slurmctld the commands still responds. (In reply to Cineca HPC Systems from comment #29) > We gave a thought about it, the only difference between MARCONI and GALILEO > apart from the slurm version (18.08.4 on MARCONI, 18.08.3 on GALILEO) and HW > (GALILEO nodes are slighly older, but still powerful enough) is that > slurmctld hosts on GALILEO are behind a NAT (with appropriate firewall > rules) and not directly connected to the slurmdbd. make me think if that is making the difference and impacting performance of ctld. If you run 'sdiag' on the host where ctld runs do you get the output? Good morning Marcello, did you have the chance to run s* client commands from SlurmctlHost? Created attachment 9615 [details]
galileo sdiag output
Created attachment 9616 [details]
marconi sdiag output
(In reply to Alejandro Sanchez from comment #30) > These two comments: > > (In reply to Cineca HPC Systems from comment #20) > > As expected the problem appeared again, this time we had a little time to > > examine it. > > At first slurmctld become unresponsive from other host's client (s* commands > > start to hang), BUT on the slurmctld the commands still responds. > > (In reply to Cineca HPC Systems from comment #29) > > We gave a thought about it, the only difference between MARCONI and GALILEO > > apart from the slurm version (18.08.4 on MARCONI, 18.08.3 on GALILEO) and HW > > (GALILEO nodes are slighly older, but still powerful enough) is that > > slurmctld hosts on GALILEO are behind a NAT (with appropriate firewall > > rules) and not directly connected to the slurmdbd. > > make me think if that is making the difference and impacting performance of > ctld. If you run 'sdiag' on the host where ctld runs do you get the output? When it's stuck all the s* commands timeout, no matter the host. Otherwise it's working fine. Attached you can find the sdiag output. For comparison I've also attached MARCONI sdiag output which shows that activity is quite normal on GALILEO. Regards, Marcello I have a couple more questions: 1. Do any of your Prolog, PrologSlurmctld, TaskProlog execute scontrol, sinfo or other generating RPC's? 2. Are any of your users submitting jobs in a tight loop instead of using job arrays? (In reply to Alejandro Sanchez from comment #35) > I have a couple more questions: > > 1. Do any of your Prolog, PrologSlurmctld, TaskProlog execute scontrol, > sinfo or other generating RPC's? We have just one PrologSlurmctld script running at submit time which is performing an "squeue" command in order to obtain the partition on which the job was submitted and also 2 "scontrol" commands to hold the job if the budget is expired. We don't know if this is the right way to do it and if there is a more efficient way, but on MARCONI it worked just fine until now. > > 2. Are any of your users submitting jobs in a tight loop instead of using > job arrays? No, this doesn't seem to be the case However today the problem happened 3 times and luckily we were able to observe it while it was going on. What happens is that at first only some commands (squeue and sinfo for example) seem to stop working, while sdiag keep working. This gave us the opportunity to dump sdiag informations as the thread count was growing (while no pending RPCs were present!!!) and also trace all processes before hitting the 256 thread limit. What we've seen is that during this time the slurmctld wasn't scheduling any job and kept accepting all requests coming from the clients. The problem seems to originate from the inability of the slurmctld threads to close the connections which remains in a CLOSE_WAIT state undefinitely therefore preventing the threads to close and increasing the thread count until everything gets stuck. Attached you will find all the informations we've gathered. Regards, Marcello Created attachment 9655 [details]
sdiag output during hang
Created attachment 9656 [details]
gdb trace for all threads during hang
Ah, after inspecting your new backtraces I think I know what's going on.
Thread 256 (Thread 0x7f9e566cf700 (LWP 16165)):
#0 0x00007f9e9e5f1a39 in __strchr_sse42 () from /lib64/libc.so.6
No symbol table info available.
#1 0x00007f9e9ed1b60e in _get_next_job_gres (in_val=0x0, cnt=cnt@entry=0x7f9e566ce498, gres_list=0x161d450, save_ptr=save_ptr@entry=0x7f9e566ce4a0, rc=rc@entry=0x7f9e566ce494) at gres.c:3226
prev_save_ptr = 0x7f9d540022f0 "gpu:kepler:N"
I think a malformed --gres request is triggering the ctld hang... which casually was fixed in guess Slurm version... 18.08.4 (and thus Marconi isn't affected because it runs with the fix):
https://github.com/SchedMD/slurm/commit/80e2cc41f86fc.patch
Will you try upgrading Galileo to the latest 18.08 release Slurm 18.08.6-2 or at least apply this patch as standalone and see if that fixes the issues?
https://github.com/SchedMD/slurm/commit/80e2cc41f86fc.patch
Can you also ask user sbonechi if he is requesting --gres=gpu:kepler:N ?
Thanks.
Hi. Any feedback from this? could you apply the patch or upgrade? thanks. (In reply to Alejandro Sanchez from comment #40) > Hi. Any feedback from this? could you apply the patch or upgrade? thanks. We're discussing the opportunity to perform a rolling upgrade without interrupting the production. We'll let you know, Best Regards, Marcello (In reply to Cineca HPC Systems from comment #41) > (In reply to Alejandro Sanchez from comment #40) > > Hi. Any feedback from this? could you apply the patch or upgrade? thanks. > > We're discussing the opportunity to perform a rolling upgrade without > interrupting the production. > We'll let you know, > > Best Regards, > Marcello All right. I'm lowering the severity until you can confirm this is no longer an issue once you run Slurm with that patch in place. (In reply to Alejandro Sanchez from comment #42) > (In reply to Cineca HPC Systems from comment #41) > > (In reply to Alejandro Sanchez from comment #40) > > > Hi. Any feedback from this? could you apply the patch or upgrade? thanks. > > > > We're discussing the opportunity to perform a rolling upgrade without > > interrupting the production. > > We'll let you know, > > > > Best Regards, > > Marcello > > All right. I'm lowering the severity until you can confirm this is no longer > an issue once you run Slurm with that patch in place. We have upgraded to 18.08.6-2, now when performing the job submission with the wrong --gres specification the controller just rejects job and the thread number doesn't increase: [mmorgott@node165 ~]$ srun --nodes=1 --ntasks-per-node=4 --account=cin_staff --partition=gll_usr_gpuprod --gres=gpu:kepler:N --pty /bin/bash srun: error: Unable to allocate resources: Invalid generic resource (gres) specification we'll keep monitoring, but I think we can consider this problem solved. Regards, Marcello Hi Marcello. Can we close this? Thanks. I'm gonna close this as a duplicate of the referenced fix in bug 6077. Please, reopen if you need anything else from here. Thanks. *** This ticket has been marked as a duplicate of ticket 6077 *** |
Created attachment 9101 [details] gdb trace Hello, since the upgrade to 18.08.3 we are observing a strange behaviour of the slurmctld daemon. It happens infact that the slurmctl gets frozen and not only it doesn't schedule/dispatch any jobs, but also it doesn't accept any command resulting in all query command (squeue,sinfo etc) hanging. In the slurmctld logs we can see: [2019-02-07T16:47:21.897] _slurmctld_rpc_mgr: accept() connection from 10.23.16.26:45076 [2019-02-07T16:47:21.898] slurmctld_req: received opcode REQUEST_COMPLETE_BATCH_SCRIPT from 10.23.16.26:45076 [2019-02-07T16:47:22.447] _slurmctld_rpc_mgr: accept() connection from 10.23.16.32:51172 [2019-02-07T16:47:22.447] slurmctld_req: received opcode REQUEST_COMPLETE_BATCH_SCRIPT from 10.23.16.32:51172 [2019-02-07T16:47:36.579] _slurmctld_rpc_mgr: accept() connection from 10.23.16.5:44908 [2019-02-07T16:47:36.580] slurmctld_req: received opcode REQUEST_COMPLETE_BATCH_SCRIPT from 10.23.16.5:44908 [2019-02-07T16:47:39.027] _slurmctld_rpc_mgr: accept() connection from 10.23.16.21:39334 [2019-02-07T16:47:39.027] server_thread_count over limit (256), waiting [2019-02-07T16:47:39.027] slurmctld_req: received opcode REQUEST_STEP_COMPLETE from 10.23.16.21:39334 as the last messages logged. Strangely this is the only cluster (we have 3) where this happens. Best Regards, Marcello