-
Notifications
You must be signed in to change notification settings - Fork 1.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
aws: Increase the default master instance size to reduce etcd timeouts #1069
aws: Increase the default master instance size to reduce etcd timeouts #1069
Conversation
/assign @wking |
Exception granted, bug fix, release blocker. I don't think it has to be in 0.10.0 unless everything magically is fixed, in which case we'll rebuild 0.10.0 |
c711595
to
e8ef5f9
Compare
Retrying with 120gb/400iop/io1 drives |
All nice and green :). Can you plot a pretty graph with the io1 results too, so we can ooh and ahh? |
I love data |
For the curious here's the one liner I used on the etcd logs:
|
e8ef5f9
to
8901a31
Compare
Trying with m4.xlarge |
8901a31
to
5ce166a
Compare
5ce166a
to
e97c0a2
Compare
e97c0a2
to
8b0b205
Compare
pkg/asset/machines/master.go
Outdated
@@ -81,7 +82,7 @@ func (m *Master) Generate(dependencies asset.Parents) error { | |||
if err != nil { | |||
return errors.Wrap(err, "failed to create master machine objects") | |||
} | |||
aws.ConfigMasters(machines, ic.ObjectMeta.Name) | |||
aws.ConfigMasters(machines, ic.ObjectMeta.Name, mpool.InstanceType) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why is this required?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@wking ^ this is what I get :)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why is this required?
Ah, sorry about that. Yeah, we can probably drop this, since aws.provider
should be applying this for us.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@smarterclayton, can you unwind is change with:
diff --git a/pkg/asset/machines/aws/machines.go b/pkg/asset/machines/aws/machines.go
index 865d40a..7e04a99 100644
--- a/pkg/asset/machines/aws/machines.go
+++ b/pkg/asset/machines/aws/machines.go
@@ -120,12 +120,10 @@ func tagsFromUserTags(clusterID, clusterName string, usertags map[string]string)
return tags, nil
}
-// ConfigMasters sets the PublicIP flag, bumps the instance type, and
-// assigns a set of load balancers to the given machines
-func ConfigMasters(machines []clusterapi.Machine, clusterName string, instanceType string) {
+// ConfigMasters sets the PublicIP flag and assigns a set of load balancers to the given machines
+func ConfigMasters(machines []clusterapi.Machine, clusterName string) {
for _, machine := range machines {
providerSpec := machine.Spec.ProviderSpec.Value.Object.(*awsprovider.AWSMachineProviderConfig)
- providerSpec.InstanceType = instanceType
providerSpec.PublicIP = pointer.BoolPtr(true)
providerSpec.LoadBalancers = []awsprovider.LoadBalancerReference{
{
diff --git a/pkg/asset/machines/master.go b/pkg/asset/machines/master.go
index f4c3806..2c782ac 100644
--- a/pkg/asset/machines/master.go
+++ b/pkg/asset/machines/master.go
@@ -82,7 +82,7 @@ func (m *Master) Generate(dependencies asset.Parents) error {
if err != nil {
return errors.Wrap(err, "failed to create master machine objects")
}
- aws.ConfigMasters(machines, ic.ObjectMeta.Name, mpool.InstanceType)
+ aws.ConfigMasters(machines, ic.ObjectMeta.Name)
list := listFromMachines(machines)
raw, err := yaml.Marshal(list)
I'm testing that now to make sure it still works...
return awstypes.MachinePool{ | ||
InstanceType: "m4.large", | ||
} | ||
return awstypes.MachinePool{} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why is this emptied ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@wking ^
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why is this emptied ?
Because we no longer have any default AWS configuration shared by both masters and workers. They each have their own defaults for instance types (and, with #1079) for volumes, that get applied directly after defaultAWSMachinePoolPlatform()
calls. In fact, my preference would be to drop defaultAWSMachinePoolPlatform
entirely in favor of an explicit awstypes.MachinePool{}
for initializing AWS machine pools.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah, maybe this was working around #1076.
5768d95
to
620dd6a
Compare
/retest |
I had a run with setting proper resource constraints on etcd and it didn't move the needle enough (4x improvement). We need this as well. Blue is baseline, red is moving etcd to have a 300m CPU request (openshift/machine-config-operator#316), and yellow is doubling cores. So it's a good tail and I updated with the comments from trevor, let me know what else needs to be done. |
/retest |
nit: can we squash down to one commit? |
pkg/asset/machines/aws/machines.go
Outdated
@@ -120,7 +120,8 @@ func tagsFromUserTags(clusterID, clusterName string, usertags map[string]string) | |||
return tags, nil | |||
} | |||
|
|||
// ConfigMasters sets the PublicIP flag and assigns a set of load balancers to the given machines | |||
// ConfigMasters sets the PublicIP flag, bumps the instance type, and | |||
// assigns a set of load balancers to the given machines |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No need to bump this comment, since we're no longer bumping ConfigMasters
.
We were seeing frequent long requests from etcd. After increasing CPU (2 -> 4 cores) those pauses dropped significantly. Increase the limit until the rebase lands and we can deliver the CPU perf improvements to the control plane. Make a set of changes to connect machine set size to the values passed as input. Update the docs.
620dd6a
to
82bd538
Compare
de-nitified |
/lgtm |
/lgtm @smarterclayton, I'm fine if you want to pull the hold yourself, or if you want to wait for @abhinavdahiya and/or @crawford. |
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: crawford, smarterclayton, wking The full list of commands accepted by this bot can be found here. The pull request process is described here
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
/hold cancel I see @crawford is already here ;). |
IMO this change is not required for this PR https://github.com/openshift/installer/pull/1069/files#diff-df3202277ffea159153362ec095a8762 |
I can nuke that |
I think we want that change, but I'm fine going either way to get this PR landed and hashing it our later. |
and more (generally involving /retest |
/retest |
related: #1087 |
Any idea if we need something similar for libvirt based installation? I'm keeping (since it doesn't seem to be accepted) this small diff:
But it seems the above kinda proves my point? (Note - did not test for etcd timeout, but I am frequently failing deployment on timeouts...) |
No need for patching, since #785 set these up to allow environment-variable overrides. I can't speak for the other maintainers, but personally I'd rather punt further tuning until after the origin rebase lands, since that's rumored to have some memory-usage improvements. |
After experimenting with disks we determined that we were CPU starved, not IO starved (later comments). This PR sets default master size to m4.xlarge which performs more consistently against 1.11 kube.
https://openshift-gce-devel.appspot.com/build/origin-ci-test/logs/release-openshift-origin-installer-e2e-aws-4.0/3155/ is representative, many timeouts.
The CPU use in 1.11 kube is excessive due to two bugs that will be fixed in the rebase, after which we can try stepping back down to m4.large.