Ticket 8191

Summary: Scheduling issue using multiple partitions and node weights
Product: Slurm Reporter: Lee Reynolds <Lee.Reynolds>
Component: SchedulingAssignee: Felip Moll <felip.moll>
Status: RESOLVED INFOGIVEN QA Contact:
Severity: 2 - High Impact    
Priority: --- CC: nate
Version: 18.08.8   
Hardware: Linux   
OS: Linux   
See Also: https://bugs.schedmd.com/show_bug.cgi?id=8549
Site: ASU Alineos Sites: ---
Atos/Eviden Sites: --- Confidential Site: ---
Coreweave sites: --- Cray Sites: ---
DS9 clusters: --- HPCnow Sites: ---
HPE Sites: --- IBM Sites: ---
NOAA SIte: --- OCF Sites: ---
Recursion Pharma Sites: --- SFW Sites: ---
SNIC sites: --- Linux Distro: ---
Machine Name: CLE Version:
Version Fixed: Target Release: ---
DevPrio: --- Emory-Cloud Sites: ---
Attachments: sdiag1.zip

Description Lee Reynolds 2019-12-05 18:08:32 MST
We're seeing an issue with jobs not being scheduled that does not make sense to us.

We have a serial partition that contains nodes that can only run serial jobs.

We also have a parallel partition that can run parallel jobs and is the overflow partition for serial jobs.

The weight on the serial only nodes is 1

The weight on the parallel nodes is 1024

We're stuck in a situation where there are many nodes that seem to be open but are not having jobs scheduled on them.  

We believe this may be due to large jobs being submitted to the cluster, but we're not entirely sure.  

In the past users would avoid sending large (many core) jobs to the cluster because these were unlikely to run to completion due to preemption.  We have now moved to using fairshare scheduling which means that large jobs will not be preempted.  

I don't have stats to look at yet, but I believe more users are submitting larger jobs to the cluster than in the past, which means that this behavior is now manifesting.

When we removed a set of parallel capable nodes from the serial partition, parallel jobs were immediately launched on them that were previously pending. 
 So it looks like pending serial jobs are somehow preventing parallel jobs from being run.

Will pending serial jobs with a higher priority than a parallel job prevent that parallel job from launching even though there are resources available?

Please let us know what more information you need from us as I'm sure this explanation isn't sufficient.

We're looking at reconfiguring the cluster by adding OPA cards to our serial nodes so that they can be added to the parallel partition.  If we do this, can we expect an improvement in the problem we are seeing?
Comment 1 Lee Reynolds 2019-12-05 18:17:02 MST
The reason code for almost all of the pending jobs is (Priority)
Comment 2 Lee Reynolds 2019-12-05 18:18:23 MST
I'd also like to add that the nodes which are stuck idle are all parallel nodes.  The serial-only nodes are not idle.
Comment 3 Nate Rini 2019-12-05 19:31:03 MST
Lee,

Please attach your slurm.conf, slurmctld log, and the output of sdiag.

Thanks,
--Nate
Comment 6 Nate Rini 2019-12-06 09:32:40 MST
Lee,

Please make sure to change your slurm database password as it was included in the slurmdbd.conf.

Thanks,
--Nate
Comment 7 Felip Moll 2019-12-06 10:37:31 MST
Hi Lee,

I see your backfilling is not getting to the end of the queue. Can you add bf_continue parameter to SchedulerParameters to see if it helps?

I want to study a bit your configuration, partitions and sdiag to give you some recommendations on the scheduler, but for being able to do so I'd also need:

squeue -o "%i %u %.11v %.10P %.10m %.10q %.10Q %.19V %.19S %.19e %.7T %.10E %.10n %r %N"
scontrol show nodes
scontrol show job

As you guessed is also possible that a larger job has reserved a big set of nodes and less priority jobs may not be able to be scheduled on those nodes. The nodes appear as IDLE even if they are reserved by the scheduler. Multi-partition jobs can also help to this situation, where a job is evaluated in the first partition of the list but not the second one.

Provide me with the remaining mentioned information and I will analyze your specific case.

Thanks
Comment 8 Lee Reynolds 2019-12-06 13:16:32 MST
Created attachment 12504 [details]
sdiag1.zip

Here is the info you requested.

I’ve also included the current version of the slurm.conf file and the logs for slurmctld and slurmdbd

I’ve added the bf_continue flag and also set the default log level to 4.




Lee Reynolds
Senior RC Architect
ASU Research Computing


T 480-965-9460 | E Lee.Reynolds@asu.edu<mailto:Lee.Reynolds@asu.edu>
researchcomputing.asu.edu<https://researchcomputing.asu.edu/> | research.asu.edu | rcstats.asu.edu<https://rcstats.asu.edu/>

How am I doing? Email my supervisor<mailto:Barnaby.Wasson@asu.edu> or send a Sun Award<https://cfo.asu.edu/hr-sunaward>.

From: bugs@schedmd.com <bugs@schedmd.com>
Sent: Friday, December 6, 2019 10:38 AM
To: Lee Reynolds <Lee.Reynolds@asu.edu>
Subject: [Bug 8191] Scheduling issue using multiple partitions and node weights

Comment # 7<https://urldefense.proofpoint.com/v2/url?u=https-3A__bugs.schedmd.com_show-5Fbug.cgi-3Fid-3D8191-23c7&d=DwMFaQ&c=l45AxH-kUV29SRQusp9vYR0n1GycN4_2jInuKy6zbqQ&r=SXlGbSjcJPj3jVCTiKs1WdTC1OwNEBBld2X02GdUX_w&m=b4g0bK37ktnUslwfl6h_Z103lAYjYESEnCJ1Csk3M58&s=GXfs2eMj91XOfdyP2OijePrmSZYitZysx-UraGP356Q&e=> on bug 8191<https://urldefense.proofpoint.com/v2/url?u=https-3A__bugs.schedmd.com_show-5Fbug.cgi-3Fid-3D8191&d=DwMFaQ&c=l45AxH-kUV29SRQusp9vYR0n1GycN4_2jInuKy6zbqQ&r=SXlGbSjcJPj3jVCTiKs1WdTC1OwNEBBld2X02GdUX_w&m=b4g0bK37ktnUslwfl6h_Z103lAYjYESEnCJ1Csk3M58&s=tyop8NT9fMK2jRk_99ZGbvlnVB50jk8MqGwMvVxfX8E&e=> from Felip Moll<mailto:felip.moll@schedmd.com>

Hi Lee,



I see your backfilling is not getting to the end of the queue. Can you add

bf_continue parameter to SchedulerParameters to see if it helps?



I want to study a bit your configuration, partitions and sdiag to give you some

recommendations on the scheduler, but for being able to do so I'd also need:



squeue -o "%i %u %.11v %.10P %.10m %.10q %.10Q %.19V %.19S %.19e %.7T %.10E

%.10n %r %N"

scontrol show nodes

scontrol show job



As you guessed is also possible that a larger job has reserved a big set of

nodes and less priority jobs may not be able to be scheduled on those nodes.

The nodes appear as IDLE even if they are reserved by the scheduler.

Multi-partition jobs can also help to this situation, where a job is evaluated

in the first partition of the list but not the second one.



Provide me with the remaining mentioned information and I will analyze your

specific case.



Thanks

________________________________
You are receiving this mail because:

  *   You reported the bug.
Comment 9 Lee Reynolds 2019-12-07 13:30:57 MST
We’re now seeing an issue where the slurmctld daemon is getting stuck with the following message in its log:

server_thread_count over limit (256), waiting

Restarting the service “fixes” this, but I fear it will continue happening.

Are there any settings we should scale back?




Lee Reynolds
Senior RC Architect
ASU Research Computing


T 480-965-9460 | E Lee.Reynolds@asu.edu<mailto:Lee.Reynolds@asu.edu>
researchcomputing.asu.edu<https://researchcomputing.asu.edu/> | research.asu.edu | rcstats.asu.edu<https://rcstats.asu.edu/>

How am I doing? Email my supervisor<mailto:Barnaby.Wasson@asu.edu> or send a Sun Award<https://cfo.asu.edu/hr-sunaward>.

From: Lee Reynolds <Lee.Reynolds@asu.edu>
Sent: Friday, December 6, 2019 1:16 PM
To: bugs@schedmd.com
Cc: DL.ORG.KE.RC.Devops <DL.ORG.OKED.RC.DevOps@exchange.asu.edu>
Subject: RE: [Bug 8191] Scheduling issue using multiple partitions and node weights

Here is the info you requested.

I’ve also included the current version of the slurm.conf file and the logs for slurmctld and slurmdbd

I’ve added the bf_continue flag and also set the default log level to 4.




Lee Reynolds
Senior RC Architect
ASU Research Computing


T 480-965-9460 | E Lee.Reynolds@asu.edu<mailto:Lee.Reynolds@asu.edu>
researchcomputing.asu.edu<https://researchcomputing.asu.edu/> | research.asu.edu | rcstats.asu.edu<https://rcstats.asu.edu/>

How am I doing? Email my supervisor<mailto:Barnaby.Wasson@asu.edu> or send a Sun Award<https://cfo.asu.edu/hr-sunaward>.

From: bugs@schedmd.com<mailto:bugs@schedmd.com> <bugs@schedmd.com<mailto:bugs@schedmd.com>>
Sent: Friday, December 6, 2019 10:38 AM
To: Lee Reynolds <Lee.Reynolds@asu.edu<mailto:Lee.Reynolds@asu.edu>>
Subject: [Bug 8191] Scheduling issue using multiple partitions and node weights

Comment # 7<https://urldefense.proofpoint.com/v2/url?u=https-3A__bugs.schedmd.com_show-5Fbug.cgi-3Fid-3D8191-23c7&d=DwMFaQ&c=l45AxH-kUV29SRQusp9vYR0n1GycN4_2jInuKy6zbqQ&r=SXlGbSjcJPj3jVCTiKs1WdTC1OwNEBBld2X02GdUX_w&m=b4g0bK37ktnUslwfl6h_Z103lAYjYESEnCJ1Csk3M58&s=GXfs2eMj91XOfdyP2OijePrmSZYitZysx-UraGP356Q&e=> on bug 8191<https://urldefense.proofpoint.com/v2/url?u=https-3A__bugs.schedmd.com_show-5Fbug.cgi-3Fid-3D8191&d=DwMFaQ&c=l45AxH-kUV29SRQusp9vYR0n1GycN4_2jInuKy6zbqQ&r=SXlGbSjcJPj3jVCTiKs1WdTC1OwNEBBld2X02GdUX_w&m=b4g0bK37ktnUslwfl6h_Z103lAYjYESEnCJ1Csk3M58&s=tyop8NT9fMK2jRk_99ZGbvlnVB50jk8MqGwMvVxfX8E&e=> from Felip Moll<mailto:felip.moll@schedmd.com>

Hi Lee,



I see your backfilling is not getting to the end of the queue. Can you add

bf_continue parameter to SchedulerParameters to see if it helps?



I want to study a bit your configuration, partitions and sdiag to give you some

recommendations on the scheduler, but for being able to do so I'd also need:



squeue -o "%i %u %.11v %.10P %.10m %.10q %.10Q %.19V %.19S %.19e %.7T %.10E

%.10n %r %N"

scontrol show nodes

scontrol show job



As you guessed is also possible that a larger job has reserved a big set of

nodes and less priority jobs may not be able to be scheduled on those nodes.

The nodes appear as IDLE even if they are reserved by the scheduler.

Multi-partition jobs can also help to this situation, where a job is evaluated

in the first partition of the list but not the second one.



Provide me with the remaining mentioned information and I will analyze your

specific case.



Thanks

________________________________
You are receiving this mail because:

  *   You reported the bug.
Comment 10 Lee Reynolds 2019-12-07 15:34:24 MST
Is there a phone number we can call on Monday?

I’m being asked to escalate this issue.




Lee Reynolds
Senior RC Architect
ASU Research Computing


T 480-965-9460 | E Lee.Reynolds@asu.edu<mailto:Lee.Reynolds@asu.edu>
researchcomputing.asu.edu<https://researchcomputing.asu.edu/> | research.asu.edu | rcstats.asu.edu<https://rcstats.asu.edu/>

How am I doing? Email my supervisor<mailto:Barnaby.Wasson@asu.edu> or send a Sun Award<https://cfo.asu.edu/hr-sunaward>.

From: Lee Reynolds <Lee.Reynolds@asu.edu>
Sent: Saturday, December 7, 2019 1:31 PM
To: bugs@schedmd.com
Cc: DL.ORG.KE.RC.Devops <DL.ORG.OKED.RC.DevOps@exchange.asu.edu>
Subject: RE: [Bug 8191] Scheduling issue using multiple partitions and node weights

We’re now seeing an issue where the slurmctld daemon is getting stuck with the following message in its log:

server_thread_count over limit (256), waiting

Restarting the service “fixes” this, but I fear it will continue happening.

Are there any settings we should scale back?




Lee Reynolds
Senior RC Architect
ASU Research Computing


T 480-965-9460 | E Lee.Reynolds@asu.edu<mailto:Lee.Reynolds@asu.edu>
researchcomputing.asu.edu<https://researchcomputing.asu.edu/> | research.asu.edu | rcstats.asu.edu<https://rcstats.asu.edu/>

How am I doing? Email my supervisor<mailto:Barnaby.Wasson@asu.edu> or send a Sun Award<https://cfo.asu.edu/hr-sunaward>.

From: Lee Reynolds <Lee.Reynolds@asu.edu<mailto:Lee.Reynolds@asu.edu>>
Sent: Friday, December 6, 2019 1:16 PM
To: bugs@schedmd.com<mailto:bugs@schedmd.com>
Cc: DL.ORG.KE.RC.Devops <DL.ORG.OKED.RC.DevOps@exchange.asu.edu<mailto:DL.ORG.OKED.RC.DevOps@exchange.asu.edu>>
Subject: RE: [Bug 8191] Scheduling issue using multiple partitions and node weights

Here is the info you requested.

I’ve also included the current version of the slurm.conf file and the logs for slurmctld and slurmdbd

I’ve added the bf_continue flag and also set the default log level to 4.




Lee Reynolds
Senior RC Architect
ASU Research Computing


T 480-965-9460 | E Lee.Reynolds@asu.edu<mailto:Lee.Reynolds@asu.edu>
researchcomputing.asu.edu<https://researchcomputing.asu.edu/> | research.asu.edu | rcstats.asu.edu<https://rcstats.asu.edu/>

How am I doing? Email my supervisor<mailto:Barnaby.Wasson@asu.edu> or send a Sun Award<https://cfo.asu.edu/hr-sunaward>.

From: bugs@schedmd.com<mailto:bugs@schedmd.com> <bugs@schedmd.com<mailto:bugs@schedmd.com>>
Sent: Friday, December 6, 2019 10:38 AM
To: Lee Reynolds <Lee.Reynolds@asu.edu<mailto:Lee.Reynolds@asu.edu>>
Subject: [Bug 8191] Scheduling issue using multiple partitions and node weights

Comment # 7<https://urldefense.proofpoint.com/v2/url?u=https-3A__bugs.schedmd.com_show-5Fbug.cgi-3Fid-3D8191-23c7&d=DwMFaQ&c=l45AxH-kUV29SRQusp9vYR0n1GycN4_2jInuKy6zbqQ&r=SXlGbSjcJPj3jVCTiKs1WdTC1OwNEBBld2X02GdUX_w&m=b4g0bK37ktnUslwfl6h_Z103lAYjYESEnCJ1Csk3M58&s=GXfs2eMj91XOfdyP2OijePrmSZYitZysx-UraGP356Q&e=> on bug 8191<https://urldefense.proofpoint.com/v2/url?u=https-3A__bugs.schedmd.com_show-5Fbug.cgi-3Fid-3D8191&d=DwMFaQ&c=l45AxH-kUV29SRQusp9vYR0n1GycN4_2jInuKy6zbqQ&r=SXlGbSjcJPj3jVCTiKs1WdTC1OwNEBBld2X02GdUX_w&m=b4g0bK37ktnUslwfl6h_Z103lAYjYESEnCJ1Csk3M58&s=tyop8NT9fMK2jRk_99ZGbvlnVB50jk8MqGwMvVxfX8E&e=> from Felip Moll<mailto:felip.moll@schedmd.com>

Hi Lee,



I see your backfilling is not getting to the end of the queue. Can you add

bf_continue parameter to SchedulerParameters to see if it helps?



I want to study a bit your configuration, partitions and sdiag to give you some

recommendations on the scheduler, but for being able to do so I'd also need:



squeue -o "%i %u %.11v %.10P %.10m %.10q %.10Q %.19V %.19S %.19e %.7T %.10E

%.10n %r %N"

scontrol show nodes

scontrol show job



As you guessed is also possible that a larger job has reserved a big set of

nodes and less priority jobs may not be able to be scheduled on those nodes.

The nodes appear as IDLE even if they are reserved by the scheduler.

Multi-partition jobs can also help to this situation, where a job is evaluated

in the first partition of the list but not the second one.



Provide me with the remaining mentioned information and I will analyze your

specific case.



Thanks

________________________________
You are receiving this mail because:

  *   You reported the bug.
Comment 11 Felip Moll 2019-12-09 04:39:45 MST
(In reply to Lee Reynolds from comment #10)
> Is there a phone number we can call on Monday?
> 
> I’m being asked to escalate this issue.
> 

Hi Lee,

Unfortunately our support model does not include phone/videoconf support and we prefer to keep track of everything in bugzilla. Also this helps us to understand better the evolution of the bug.

In any case I am dedicating the day to this bug, so my intention is to be quite responsive.

I am already working on the issue, but I would need a few things from you:

1. Do not use numeric debug levels anymore if possible, this helps to better understand the debug level and not get any confusion. The translation is that one:
       0: quiet
       1: fatal
       2: error
       3: info
       4: verbose
       5: debug
       6: debug2
       7: debug3
       8: debug4
       9: debug5

Please, set your slurmctld debug level to 'debug2'. I may ask you *later* to enable DebugFlags=agent,backfill, but let's start by the beginning.

2. I need you to check your system tunning parameters, can you check this guide and see if your server parameters are properly sized?

https://slurm.schedmd.com/high_throughput.html

3. Can you check the ctld server and network load? specifically look at the number of open files.

Send me back the slurmctld.log to try to diagnose the rpc queue issues and a capture of sdiag every 10 seconds for 3 minutes.

The issue may be because of a massive job submission + some server parameters too low.
Let me also know if the server is now stable or you're still having issues.
Comment 12 Felip Moll 2019-12-09 06:02:13 MST
That's off topic:

a) but these two errors would be nice if they get fixed:

[2019-12-06T13:04:48.302] error: Ignoring invalid Allow/DenyQOS value: leereyno

[2019-12-06T13:04:48.441] error: read_slurm_conf: default partition not set.

b) CacheGroups option is deprecated since 16.05, you can remove the line in slurm.conf



I see also this warning:

[2019-12-06T13:04:48.445] TOPOLOGY: warning -- no switch can reach all nodes through its descendants.Do not use route/topology

That means you have disjoint set of nodes in topology.conf, which implies that one job cannot use nodes from these two sets at the same time. Can you attach your topology.conf? Is it possible that the issue you were seeing was related to this, where nodes on one set were idle but the job requested nodes in the other set? 


Besides that, can you provide dmesg -T in ctld server? do you use NFS?

Just a note, in comment 11, it would be very good to catch the logs when the RPC issue is happening or after it happened.


I am studying now the sdiag + backfill parameters too for your initial problem.
Comment 13 Felip Moll 2019-12-09 10:14:55 MST
Lee,

I suggest replacing your slurm.conf SchedulerParameters line with this one:

SchedulerParameters=max_rpc_cnt=150,bf_yield_interval=1000000,bf_max_time=150,bf_continue,bf_resolution=300,sched_interval=45,default_queue_depth=300,preempt_reorder_count=3,max_switch_wait=86400,bf_window=20160

I worked on a set of settings for your site based on your situation, sdiag and output files. I think this can help to fix your issues. I explain the changed parameters below.
Can you apply them and tell me how it goes?

The adjustment of these parameters must be done empirically. So after applying we can analyze sdiag outputs and do more adjustments.
I will be waiting for your feedback about this comment and the previous ones.

Let me know if you have any question.

----

Make Backfill to release locks when this rpc count is reached, to avoid starving rpcs:

max_rpc_cnt=150

Make Backfill to release locks more frequently to let RPCs to be served more often:

bf_yield_interval=1000000
bf_max_time=150

Backfill will continue evaluating jobs in the list in the next iteration after breaking the current loop, opposite to start from the top again:

bf_continue

You have bf_resolution=30, which means 30 seconds. This can overload the scheduler, so I suggest to move to greater values. This can also cause backfill to be slow and, as seen in sdiag, evaluate very few jobs. Higher values gives better responsiveness, but scheduling can be a bit less precise. This could be read as: Looking bf_window minutes in the future with bf_resolution seconds resolution. Your bf_window is 20160 which would mean to look into the future for 14 days, and take in mind jobs finishing every 30 seconds to see if there's a gap available. I see it too much, so I'd suggest to change bf_resolution to at least 5 or 10 minutes:

bf_resolution=300

Your sched_interval=10 indicates that the main scheduler (not the backfill) will run every 10 seconds, which can cause also performance issues. I'd suggest to increase considerably
this value. The backfill will run every 30 seconds by default and should cover most of the jobs. Let's set it at least at 45 seconds:

sched_interval=45

For default_queue_depth, specifying a large value will result in a poor system responsiveness since the scheduling logic will not release locks for other events to occur.
You can keep this value, but you may be wanting to limit the time if issues with RPCs are still seen.

default_queue_depth=300


Other parameter to consider:

To limit the minimum time between the end of one scheduling cycle and the beginning of the next scheduling cycle use sched_min_interval. Even if you have sched_interval set, an event like a job submission may trigger the main scheduler and make it run too often thus causing locks starvation (and possibly RPC starvation).
The default is 2us but if you observer the scheduler kicking in too often, consider increasing it.

sched_min_interval=2000000 (default)
Comment 14 Lee Reynolds 2019-12-09 14:22:57 MST
I’m going through your emails from this morning and will have more info later today.

Just wanted to let you know that your suggestion about the backfill scheduling setting seems to have done the trick as the cluster is now actively scheduling jobs.

So the initial issue is solved, but I’d still like to ensure that the thread exhaustion problem is also dealt with.







Lee Reynolds
Senior RC Architect
ASU Research Computing


T 480-965-9460 | E Lee.Reynolds@asu.edu<mailto:Lee.Reynolds@asu.edu>
researchcomputing.asu.edu<https://researchcomputing.asu.edu/> | research.asu.edu | rcstats.asu.edu<https://rcstats.asu.edu/>

How am I doing? Email my supervisor<mailto:Barnaby.Wasson@asu.edu> or send a Sun Award<https://cfo.asu.edu/hr-sunaward>.

From: bugs@schedmd.com <bugs@schedmd.com>
Sent: Monday, December 9, 2019 10:15 AM
To: Lee Reynolds <Lee.Reynolds@asu.edu>
Subject: [Bug 8191] Scheduling issue using multiple partitions and node weights

Comment # 13<https://urldefense.proofpoint.com/v2/url?u=https-3A__bugs.schedmd.com_show-5Fbug.cgi-3Fid-3D8191-23c13&d=DwMFaQ&c=l45AxH-kUV29SRQusp9vYR0n1GycN4_2jInuKy6zbqQ&r=SXlGbSjcJPj3jVCTiKs1WdTC1OwNEBBld2X02GdUX_w&m=TJ4JJ16Eyv1wdoP5kt3BbJeWkJbrPcVlGuelPI4XET0&s=Pyc0tixw4UF2Gt02JgAquNp3u7Ynj7OyH89RsfPWXcg&e=> on bug 8191<https://urldefense.proofpoint.com/v2/url?u=https-3A__bugs.schedmd.com_show-5Fbug.cgi-3Fid-3D8191&d=DwMFaQ&c=l45AxH-kUV29SRQusp9vYR0n1GycN4_2jInuKy6zbqQ&r=SXlGbSjcJPj3jVCTiKs1WdTC1OwNEBBld2X02GdUX_w&m=TJ4JJ16Eyv1wdoP5kt3BbJeWkJbrPcVlGuelPI4XET0&s=6rPlY33HTG8X0f5j8XHRxcoJbMmy_1SWoA4Kt6GLDmc&e=> from Felip Moll<mailto:felip.moll@schedmd.com>

Lee,



I suggest replacing your slurm.conf SchedulerParameters line with this one:



SchedulerParameters=max_rpc_cnt=150,bf_yield_interval=1000000,bf_max_time=150,bf_continue,bf_resolution=300,sched_interval=45,default_queue_depth=300,preempt_reorder_count=3,max_switch_wait=86400,bf_window=20160



I worked on a set of settings for your site based on your situation, sdiag and

output files. I think this can help to fix your issues. I explain the changed

parameters below.

Can you apply them and tell me how it goes?



The adjustment of these parameters must be done empirically. So after applying

we can analyze sdiag outputs and do more adjustments.

I will be waiting for your feedback about this comment and the previous ones.



Let me know if you have any question.



----



Make Backfill to release locks when this rpc count is reached, to avoid

starving rpcs:



max_rpc_cnt=150



Make Backfill to release locks more frequently to let RPCs to be served more

often:



bf_yield_interval=1000000

bf_max_time=150



Backfill will continue evaluating jobs in the list in the next iteration after

breaking the current loop, opposite to start from the top again:



bf_continue



You have bf_resolution=30, which means 30 seconds. This can overload the

scheduler, so I suggest to move to greater values. This can also cause backfill

to be slow and, as seen in sdiag, evaluate very few jobs. Higher values gives

better responsiveness, but scheduling can be a bit less precise. This could be

read as: Looking bf_window minutes in the future with bf_resolution seconds

resolution. Your bf_window is 20160 which would mean to look into the future

for 14 days, and take in mind jobs finishing every 30 seconds to see if there's

a gap available. I see it too much, so I'd suggest to change bf_resolution to

at least 5 or 10 minutes:



bf_resolution=300



Your sched_interval=10 indicates that the main scheduler (not the backfill)

will run every 10 seconds, which can cause also performance issues. I'd suggest

to increase considerably

this value. The backfill will run every 30 seconds by default and should cover

most of the jobs. Let's set it at least at 45 seconds:



sched_interval=45



For default_queue_depth, specifying a large value will result in a poor system

responsiveness since the scheduling logic will not release locks for other

events to occur.

You can keep this value, but you may be wanting to limit the time if issues

with RPCs are still seen.



default_queue_depth=300





Other parameter to consider:



To limit the minimum time between the end of one scheduling cycle and the

beginning of the next scheduling cycle use sched_min_interval. Even if you have

sched_interval set, an event like a job submission may trigger the main

scheduler and make it run too often thus causing locks starvation (and possibly

RPC starvation).

The default is 2us but if you observer the scheduler kicking in too often,

consider increasing it.



sched_min_interval=2000000 (default)

________________________________
You are receiving this mail because:

  *   You reported the bug.
Comment 15 Felip Moll 2019-12-10 03:06:28 MST
(In reply to Lee Reynolds from comment #14)
> I’m going through your emails from this morning and will have more info
> later today.
> 
> Just wanted to let you know that your suggestion about the backfill
> scheduling setting seems to have done the trick as the cluster is now
> actively scheduling jobs.
> 
> So the initial issue is solved, but I’d still like to ensure that the thread
> exhaustion problem is also dealt with.

I'm glad it helped. 

I am pretty sure that my last comment suggestions with the backfill and slurm.conf will also help with RPCs issues.

Keep me posted.
Comment 16 Lee Reynolds 2019-12-11 12:24:59 MST
The scheduler is running well and our throughput has increased.

However, just to be sure that everything is good, I’m sending you everything you asked for in the three emails you sent out on Monday.

Here are the current settings:

/proc/sys/fs/file-max: 5000000

/proc/sys/net/ipv4/tcp_max_syn_backlog: 256

/proc/sys/net/ipv4/tcp_syncookies : 1

/proc/sys/net/ipv4/tcp_synack_retries : 5

/proc/sys/net/core/somaxconn : 128    **** RAISING TO 4096 ****

/proc/sys/net/ipv4/ip_local_port_range :   32768              60999

Munge has NOW been configured to use 10 threads

Here are the limits for slurmctld:

Limit                     Soft Limit           Hard Limit           Units
Max cpu time              unlimited            unlimited            seconds
Max file size             unlimited            unlimited            bytes
Max data size             unlimited            unlimited            bytes
Max stack size            unlimited            unlimited            bytes
Max core file size        unlimited            unlimited            bytes
Max resident set          unlimited            unlimited            bytes
Max processes             31155                31155                processes
Max open files            65536                65536                files
Max locked memory         65536                65536                bytes
Max address space         unlimited            unlimited            bytes
Max file locks            unlimited            unlimited            locks
Max pending signals       31155                31155                signals
Max msgqueue size         819200               819200               bytes
Max nice priority         0                    0
Max realtime priority     0                    0
Max realtime timeout      unlimited            unlimited            us


The slurmctld process currently has 74 open files.


Here’s a link to the documents you requested:

https://www.dropbox.com/sh/mn33yssnvuhtmpc/AAAzC0-OM0PDb1vCZhGLc9UTa?dl=0





Lee Reynolds
Senior RC Architect
ASU Research Computing


T 480-965-9460 | E Lee.Reynolds@asu.edu<mailto:Lee.Reynolds@asu.edu>
researchcomputing.asu.edu<https://researchcomputing.asu.edu/> | research.asu.edu | rcstats.asu.edu<https://rcstats.asu.edu/>

How am I doing? Email my supervisor<mailto:Barnaby.Wasson@asu.edu> or send a Sun Award<https://cfo.asu.edu/hr-sunaward>.

From: bugs@schedmd.com <bugs@schedmd.com>
Sent: Monday, December 9, 2019 4:40 AM
To: Lee Reynolds <Lee.Reynolds@asu.edu>
Subject: [Bug 8191] Scheduling issue using multiple partitions and node weights

Comment # 11<https://urldefense.proofpoint.com/v2/url?u=https-3A__bugs.schedmd.com_show-5Fbug.cgi-3Fid-3D8191-23c11&d=DwMFaQ&c=l45AxH-kUV29SRQusp9vYR0n1GycN4_2jInuKy6zbqQ&r=SXlGbSjcJPj3jVCTiKs1WdTC1OwNEBBld2X02GdUX_w&m=FhL2gQ6xclaCSd2chJG46DrMhSB7TPV47Do9lC_jtrU&s=qn9wBjXnFsFp-6U_4w4BwFZa85GvGlEaXP7EJME8COo&e=> on bug 8191<https://urldefense.proofpoint.com/v2/url?u=https-3A__bugs.schedmd.com_show-5Fbug.cgi-3Fid-3D8191&d=DwMFaQ&c=l45AxH-kUV29SRQusp9vYR0n1GycN4_2jInuKy6zbqQ&r=SXlGbSjcJPj3jVCTiKs1WdTC1OwNEBBld2X02GdUX_w&m=FhL2gQ6xclaCSd2chJG46DrMhSB7TPV47Do9lC_jtrU&s=UYG2fYEHyjR8POJJb5ChmcTAIFWltRu6bZxTtZ_8lJI&e=> from Felip Moll<mailto:felip.moll@schedmd.com>

(In reply to Lee Reynolds from comment #10<https://urldefense.proofpoint.com/v2/url?u=https-3A__bugs.schedmd.com_show-5Fbug.cgi-3Fid-3D8191-23c10&d=DwMFaQ&c=l45AxH-kUV29SRQusp9vYR0n1GycN4_2jInuKy6zbqQ&r=SXlGbSjcJPj3jVCTiKs1WdTC1OwNEBBld2X02GdUX_w&m=FhL2gQ6xclaCSd2chJG46DrMhSB7TPV47Do9lC_jtrU&s=oNbB0V-wUeUpoZjZyzIMI6C-CWeOZuKtcuve5yeTJek&e=>)

> Is there a phone number we can call on Monday?

>

> I’m being asked to escalate this issue.

>



Hi Lee,



Unfortunately our support model does not include phone/videoconf support and we

prefer to keep track of everything in bugzilla. Also this helps us to

understand better the evolution of the bug.



In any case I am dedicating the day to this bug, so my intention is to be quite

responsive.



I am already working on the issue, but I would need a few things from you:



1. Do not use numeric debug levels anymore if possible, this helps to better

understand the debug level and not get any confusion. The translation is that

one:

       0: quiet

       1: fatal

       2: error

       3: info

       4: verbose

       5: debug

       6: debug2

       7: debug3

       8: debug4

       9: debug5



Please, set your slurmctld debug level to 'debug2'. I may ask you *later* to

enable DebugFlags=agent,backfill, but let's start by the beginning.



2. I need you to check your system tunning parameters, can you check this guide

and see if your server parameters are properly sized?



https://slurm.schedmd.com/high_throughput.html<https://urldefense.proofpoint.com/v2/url?u=https-3A__slurm.schedmd.com_high-5Fthroughput.html&d=DwMFaQ&c=l45AxH-kUV29SRQusp9vYR0n1GycN4_2jInuKy6zbqQ&r=SXlGbSjcJPj3jVCTiKs1WdTC1OwNEBBld2X02GdUX_w&m=FhL2gQ6xclaCSd2chJG46DrMhSB7TPV47Do9lC_jtrU&s=dj3KWL6MILA9iC98xFlt6_flLbRloII7AXdu7oVWPNg&e=>



3. Can you check the ctld server and network load? specifically look at the

number of open files.



Send me back the slurmctld.log to try to diagnose the rpc queue issues and a

capture of sdiag every 10 seconds for 3 minutes.



The issue may be because of a massive job submission + some server parameters

too low.

Let me also know if the server is now stable or you're still having issues.

________________________________
You are receiving this mail because:

  *   You reported the bug.
Comment 17 Lee Reynolds 2019-12-11 12:36:40 MST
I’ve updated the slurm.conf file to remove the leereyno qos as it does not exist

We do not use a default partition on the cluster, but instead use the job submission plugin to determine which partition a job should be sent to when no partition is defined by the user.  We’ve been using this since 2017 and I do not believe it to be a factor in the issues we were having.

The topology setting you’re seeing is what was recommended to us by SchedMD as a solution to a problem we were having with jobs attempting to run on nodes connected to different omnipath switches.  See bug # 7950 for more info.  We’ve been running the scheduler with the switches “disconnected” for over a month.  I don’t think this was related to our issue either.  I’ve included the topology.conf file in the dropbox link

I’ve included “dmesg -T” as part of the dropbox share in my previous email

We do use NFS.  We have home directories on NFS and our main software package repository is stored on NFS.




Lee Reynolds
Senior RC Architect
ASU Research Computing


T 480-965-9460 | E Lee.Reynolds@asu.edu<mailto:Lee.Reynolds@asu.edu>
researchcomputing.asu.edu<https://researchcomputing.asu.edu/> | research.asu.edu | rcstats.asu.edu<https://rcstats.asu.edu/>

How am I doing? Email my supervisor<mailto:Barnaby.Wasson@asu.edu> or send a Sun Award<https://cfo.asu.edu/hr-sunaward>.

From: bugs@schedmd.com <bugs@schedmd.com>
Sent: Monday, December 9, 2019 6:02 AM
To: Lee Reynolds <Lee.Reynolds@asu.edu>
Subject: [Bug 8191] Scheduling issue using multiple partitions and node weights

Comment # 12<https://urldefense.proofpoint.com/v2/url?u=https-3A__bugs.schedmd.com_show-5Fbug.cgi-3Fid-3D8191-23c12&d=DwMFaQ&c=l45AxH-kUV29SRQusp9vYR0n1GycN4_2jInuKy6zbqQ&r=SXlGbSjcJPj3jVCTiKs1WdTC1OwNEBBld2X02GdUX_w&m=ott_1wTi2uD07zOL8lrwUFkySWFtiNeyCvbrbEQNRGE&s=c5h-ZNHgGmfHJW2qGXyu2bvrCCM5sYJLoqFEY3FJjow&e=> on bug 8191<https://urldefense.proofpoint.com/v2/url?u=https-3A__bugs.schedmd.com_show-5Fbug.cgi-3Fid-3D8191&d=DwMFaQ&c=l45AxH-kUV29SRQusp9vYR0n1GycN4_2jInuKy6zbqQ&r=SXlGbSjcJPj3jVCTiKs1WdTC1OwNEBBld2X02GdUX_w&m=ott_1wTi2uD07zOL8lrwUFkySWFtiNeyCvbrbEQNRGE&s=GsWftmOcRxRnvnCcHSDVpXhdiwzTy0pzcoemETf4lXE&e=> from Felip Moll<mailto:felip.moll@schedmd.com>

That's off topic:



a) but these two errors would be nice if they get fixed:



[2019-12-06T13:04:48.302] error: Ignoring invalid Allow/DenyQOS value: leereyno



[2019-12-06T13:04:48.441] error: read_slurm_conf: default partition not set.



b) CacheGroups option is deprecated since 16.05, you can remove the line in

slurm.conf







I see also this warning:



[2019-12-06T13:04:48.445] TOPOLOGY: warning -- no switch can reach all nodes

through its descendants.Do not use route/topology



That means you have disjoint set of nodes in topology.conf, which implies that

one job cannot use nodes from these two sets at the same time. Can you attach

your topology.conf? Is it possible that the issue you were seeing was related

to this, where nodes on one set were idle but the job requested nodes in the

other set?





Besides that, can you provide dmesg -T in ctld server? do you use NFS?



Just a note, in comment 11<https://urldefense.proofpoint.com/v2/url?u=https-3A__bugs.schedmd.com_show-5Fbug.cgi-3Fid-3D8191-23c11&d=DwMFaQ&c=l45AxH-kUV29SRQusp9vYR0n1GycN4_2jInuKy6zbqQ&r=SXlGbSjcJPj3jVCTiKs1WdTC1OwNEBBld2X02GdUX_w&m=ott_1wTi2uD07zOL8lrwUFkySWFtiNeyCvbrbEQNRGE&s=VZg5t43JFP0sS799DrSgy3vdyxJfWcb7SmF8XY7jxmE&e=>, it would be very good to catch the logs when the

RPC issue is happening or after it happened.





I am studying now the sdiag + backfill parameters too for your initial problem.

________________________________
You are receiving this mail because:

  *   You reported the bug.
Comment 18 Lee Reynolds 2019-12-11 12:44:00 MST
We implemented the changes to SchedulerParameters as recommended and the cluster has been working great since then.

Does SchedMD have any documentation that would provide guidance on how to configure these sorts of settings?

There is of course this page:  https://slurm.schedmd.com/sched_config.html  But it is pretty sparse on explanation.

SchedMD should really look into creating a training and certification process, preferably with multiple levels.  We’ve been studying the man pages and other documentation for years, but we would not have known to change the settings you recommended.




Lee Reynolds
Senior RC Architect
ASU Research Computing


T 480-965-9460 | E Lee.Reynolds@asu.edu<mailto:Lee.Reynolds@asu.edu>
researchcomputing.asu.edu<https://researchcomputing.asu.edu/> | research.asu.edu | rcstats.asu.edu<https://rcstats.asu.edu/>

How am I doing? Email my supervisor<mailto:Barnaby.Wasson@asu.edu> or send a Sun Award<https://cfo.asu.edu/hr-sunaward>.

From: bugs@schedmd.com <bugs@schedmd.com>
Sent: Monday, December 9, 2019 10:15 AM
To: Lee Reynolds <Lee.Reynolds@asu.edu>
Subject: [Bug 8191] Scheduling issue using multiple partitions and node weights

Comment # 13<https://urldefense.proofpoint.com/v2/url?u=https-3A__bugs.schedmd.com_show-5Fbug.cgi-3Fid-3D8191-23c13&d=DwMFaQ&c=l45AxH-kUV29SRQusp9vYR0n1GycN4_2jInuKy6zbqQ&r=SXlGbSjcJPj3jVCTiKs1WdTC1OwNEBBld2X02GdUX_w&m=TJ4JJ16Eyv1wdoP5kt3BbJeWkJbrPcVlGuelPI4XET0&s=Pyc0tixw4UF2Gt02JgAquNp3u7Ynj7OyH89RsfPWXcg&e=> on bug 8191<https://urldefense.proofpoint.com/v2/url?u=https-3A__bugs.schedmd.com_show-5Fbug.cgi-3Fid-3D8191&d=DwMFaQ&c=l45AxH-kUV29SRQusp9vYR0n1GycN4_2jInuKy6zbqQ&r=SXlGbSjcJPj3jVCTiKs1WdTC1OwNEBBld2X02GdUX_w&m=TJ4JJ16Eyv1wdoP5kt3BbJeWkJbrPcVlGuelPI4XET0&s=6rPlY33HTG8X0f5j8XHRxcoJbMmy_1SWoA4Kt6GLDmc&e=> from Felip Moll<mailto:felip.moll@schedmd.com>

Lee,



I suggest replacing your slurm.conf SchedulerParameters line with this one:



SchedulerParameters=max_rpc_cnt=150,bf_yield_interval=1000000,bf_max_time=150,bf_continue,bf_resolution=300,sched_interval=45,default_queue_depth=300,preempt_reorder_count=3,max_switch_wait=86400,bf_window=20160



I worked on a set of settings for your site based on your situation, sdiag and

output files. I think this can help to fix your issues. I explain the changed

parameters below.

Can you apply them and tell me how it goes?



The adjustment of these parameters must be done empirically. So after applying

we can analyze sdiag outputs and do more adjustments.

I will be waiting for your feedback about this comment and the previous ones.



Let me know if you have any question.



----



Make Backfill to release locks when this rpc count is reached, to avoid

starving rpcs:



max_rpc_cnt=150



Make Backfill to release locks more frequently to let RPCs to be served more

often:



bf_yield_interval=1000000

bf_max_time=150



Backfill will continue evaluating jobs in the list in the next iteration after

breaking the current loop, opposite to start from the top again:



bf_continue



You have bf_resolution=30, which means 30 seconds. This can overload the

scheduler, so I suggest to move to greater values. This can also cause backfill

to be slow and, as seen in sdiag, evaluate very few jobs. Higher values gives

better responsiveness, but scheduling can be a bit less precise. This could be

read as: Looking bf_window minutes in the future with bf_resolution seconds

resolution. Your bf_window is 20160 which would mean to look into the future

for 14 days, and take in mind jobs finishing every 30 seconds to see if there's

a gap available. I see it too much, so I'd suggest to change bf_resolution to

at least 5 or 10 minutes:



bf_resolution=300



Your sched_interval=10 indicates that the main scheduler (not the backfill)

will run every 10 seconds, which can cause also performance issues. I'd suggest

to increase considerably

this value. The backfill will run every 30 seconds by default and should cover

most of the jobs. Let's set it at least at 45 seconds:



sched_interval=45



For default_queue_depth, specifying a large value will result in a poor system

responsiveness since the scheduling logic will not release locks for other

events to occur.

You can keep this value, but you may be wanting to limit the time if issues

with RPCs are still seen.



default_queue_depth=300





Other parameter to consider:



To limit the minimum time between the end of one scheduling cycle and the

beginning of the next scheduling cycle use sched_min_interval. Even if you have

sched_interval set, an event like a job submission may trigger the main

scheduler and make it run too often thus causing locks starvation (and possibly

RPC starvation).

The default is 2us but if you observer the scheduler kicking in too often,

consider increasing it.



sched_min_interval=2000000 (default)

________________________________
You are receiving this mail because:

  *   You reported the bug.
Comment 19 Felip Moll 2019-12-12 04:52:49 MST
(In reply to Lee Reynolds from comment #16)
> The scheduler is running well and our throughput has increased.
> 
> However, just to be sure that everything is good, I’m sending you everything
> you asked for in the three emails you sent out on Monday.

I reviewed your parameters and I have two suggestions.


#1# You definetively want to increase somaxconn.

/proc/sys/net/core/somaxconn

Current value: 128
Recommended value: 4096
Explanation: This parameter limits the number of established connections and backlog to a socket 128 is way too low and may cause communication issues from clients or daemons, or even impede comunnication from ctld. As explained in the docs, a bursts of requests of 1024 rpcs needs 1024 connections to succeed.


#2# You may want to increase tcp_max_syn_backlog value too.

/proc/sys/net/ipv4/tcp_max_syn_backlog

Current value: 256
Recommended value: 1024
Explanation: This parameter limits the number of possible half-connections queued for a socket. If you have massive requests in the slurmctld socket, like for example job completion RPCs, it may limit the throughput.

From linux/net/core/request_sock.c 

/*
 * Maximum number of SYN_RECV sockets in queue per LISTEN socket.
 * One SYN_RECV socket costs about 80bytes on a 32bit machine
 * ....
 * The minimum value of it is 128. Experiments with real servers show that
 * it is absolutely not enough even at 100conn/sec. 256 cures most
 * of problems.
 * This value is adjusted to 128 for low memory machines,
 * and it will increase in proportion to the memory of machine.
 * Note : Dont forget somaxconn that may limit backlog too.
 */


Another suggestion:
Check your logging generation since slurmctld.log is huge for only a couple of days. I think you have modified job_submit_partition.c and job_submit_defaults.c and both are generating tons of logs. Consider moving the log level to a debug2 or 3 to enable them only when necessary. This can cause a real performance issue in slurmctld.

I guess the provided sdiag captures are from before applying all the backfill settings, since I see Queue lengths of about 2400 jobs but scheduler just trying to schedule ~250 jobs in each cycle. If it is from before the settings, you can now check how many jobs does the main and backfill scheduler consider. For example, compare new sdiag data from the previous provided sdiag:

Main schedule statistics (microseconds):
        ...
	Mean depth cycle:  521
        ...
	Last queue length: 2342

Backfilling stats (WARNING: data obtained in the middle of backfilling execution.)
        ...
	Last depth cycle (try sched): 248
	Depth Mean (try depth): 259
	Last queue length: 2342
	Queue length mean: 1388

The depth should match a bit more with the queue length.

As for the other files I see everything fine an nothing unusual.
- About topology: OK, is perfectly fine to have disconnected switches.
- About Default partition: It is OK, but if you want to get rid of the message just set one to DEFAULT, in the end you will be modifying that in the job submit plugin so it won't harm.
- About NFS: Just take that in mind, that a stuck NFS can struggle ctld and cause timeouts or any issues. I am talking in general.
- About documentation of backfill: The best way for you is, whenever you have an issue with the scheduler to analyze sdiag (man sdiag) captures during a period of time. You have to look at the performance of the main and backfill schedulers and see if the numbers make sense, the number of considered jobs but also the timings. You have to look also at your queue and see if there are bursts of jobs. Finally you need to see if there are massive RPCs sent, also seen in sdiag. All of this will help you to quantify your issue and then reading the description of all scheduler parameters in 'man slurm.conf' you should be able to tune them up.

This case was quite obvious, the sdiag showed how the schedulers didn't arrive at the end of the job queue so it didn't schedule them, even if the nodes were idle, so I suggested bf_continue which just makes that in the next iteration the considered job is the next one from the previous cycle, and not starting from the beginning. After that, you had RPC issues, and we just tuned up some parameters to support more connections and rpcs.

There's no perfect guide for this since every situation is very different. Components like NFS, network switches, technologies and architectures, job submission typology (burst HTC vs big jobs vs small jobs...), and a big etc. lead to very different situations.

Now, from my suggestions you need to see if everything works as expected, and if not a second round may be needed.

> SchedMD should really look into creating a training and certification process, preferably with multiple levels.  

Since a few months ago we have a new department in SchedMD dedicated exclusively to training. It provides some training modules for diferent Slurm areas. If you are interested you can contact with jess@schedmd.com .


Hope everything is clear, if not please just ask.
Also let me know if everything is fixed for you and if we can close this issue.

Regards,
Felip
Comment 20 Lee Reynolds 2019-12-12 04:53:06 MST
I am out of the office Thursday December 12th.

I will be back on Friday the 13th.
If you need help with Research Computing resources such as Agave, Saguaro or Ocotillo, please submit a service request through our support portal:
https://rcstatus.asu.edu/servicerequest/
Comment 21 Felip Moll 2019-12-19 09:34:59 MST
Hi Lee,

Can you please confirm your issue has been solved and that we can close this bug?

Thanks
Comment 22 Lee Reynolds 2019-12-19 11:05:51 MST
Yes, I believe this issue has been resolved.

We’ll open a new ticket if we have any more problems.



Lee Reynolds
Senior RC Architect
ASU Research Computing


T 480-965-9460 | E Lee.Reynolds@asu.edu<mailto:Lee.Reynolds@asu.edu>
researchcomputing.asu.edu<https://researchcomputing.asu.edu/> | research.asu.edu | rcstats.asu.edu<https://rcstats.asu.edu/>

How am I doing? Email my supervisor<mailto:Barnaby.Wasson@asu.edu> or send a Sun Award<https://cfo.asu.edu/hr-sunaward>.

From: bugs@schedmd.com <bugs@schedmd.com>
Sent: Thursday, December 19, 2019 9:35 AM
To: Lee Reynolds <Lee.Reynolds@asu.edu>
Subject: [Bug 8191] Scheduling issue using multiple partitions and node weights

Comment # 21<https://urldefense.proofpoint.com/v2/url?u=https-3A__bugs.schedmd.com_show-5Fbug.cgi-3Fid-3D8191-23c21&d=DwMFaQ&c=l45AxH-kUV29SRQusp9vYR0n1GycN4_2jInuKy6zbqQ&r=SXlGbSjcJPj3jVCTiKs1WdTC1OwNEBBld2X02GdUX_w&m=Kw1Bg8c1Stbbme1Oe34KZvUfhFDU1WO3EP9xIgCTorM&s=MifvkDfisIvDAaxpHCtvgtcE35ZdgIuhiq4tKLFYnEc&e=> on bug 8191<https://urldefense.proofpoint.com/v2/url?u=https-3A__bugs.schedmd.com_show-5Fbug.cgi-3Fid-3D8191&d=DwMFaQ&c=l45AxH-kUV29SRQusp9vYR0n1GycN4_2jInuKy6zbqQ&r=SXlGbSjcJPj3jVCTiKs1WdTC1OwNEBBld2X02GdUX_w&m=Kw1Bg8c1Stbbme1Oe34KZvUfhFDU1WO3EP9xIgCTorM&s=hIVCpJmI3M8OvcaoN2cxRjB1FmOW8WTdmD7L89OGdO0&e=> from Felip Moll<mailto:felip.moll@schedmd.com>

Hi Lee,



Can you please confirm your issue has been solved and that we can close this

bug?



Thanks

________________________________
You are receiving this mail because:

  *   You reported the bug.
Comment 23 Felip Moll 2019-12-19 14:00:41 MST
Thanks, closing it then!.