aws: Increase the default master instance size to reduce etcd timeouts #1069

smarterclayton · 2019-01-15T18:25:46Z

After experimenting with disks we determined that we were CPU starved, not IO starved (later comments). This PR sets default master size to m4.xlarge which performs more consistently against 1.11 kube.

https://openshift-gce-devel.appspot.com/build/origin-ci-test/logs/release-openshift-origin-installer-e2e-aws-4.0/3155/ is representative, many timeouts.

The CPU use in 1.11 kube is excessive due to two bugs that will be fixed in the rebase, after which we can try stepping back down to m4.large.

smarterclayton · 2019-01-15T18:26:30Z

/assign @wking

wking · 2019-01-15T18:40:35Z

Previous bump was #737. I'd floated 500 GiB in #844.

/lgtm
/hold

Do we need an exception for this, or is it a bugfix? Also, do we need to stuff this into 0.10.0, or can it land in follow-up work?

smarterclayton · 2019-01-15T19:01:53Z

Exception granted, bug fix, release blocker. I don't think it has to be in 0.10.0 unless everything magically is fixed, in which case we'll rebuild 0.10.0

smarterclayton · 2019-01-15T20:33:37Z

Ok, here's the latency distribution of etcd "took too long" error messages in the logs:

Red is another run, blue is this PR, x-axis is MS.

So we're still seeing substantially the same curve, but the outliers are a bit better.

smarterclayton · 2019-01-15T20:50:32Z

Retrying with 120gb/400iop/io1 drives

wking · 2019-01-15T22:21:59Z

All nice and green :). Can you plot a pretty graph with the io1 results too, so we can ooh and ahh?

smarterclayton · 2019-01-15T22:32:01Z

Guess what?!

io1 did nothing!!!!

smarterclayton · 2019-01-15T22:32:17Z

I love data

smarterclayton · 2019-01-15T22:35:47Z

For the curious here's the one liner I used on the etcd logs:

cat ~/Downloads/kube-system_etcd-member-ip-10-0-* | sed -E -e 's/.*took too long \(([^\)]+)\).*/\1/;t;d' | sed -E -e 's/^([0-9]+)\.([0-9]{3})([0-9])*s/\1\2.\3ms/' | sed -e 's/ms//' | sort -n > /tmp/times3.txt

smarterclayton · 2019-01-15T22:43:43Z

Trying with m4.xlarge

.openshift_install.log

smarterclayton · 2019-01-16T02:45:05Z

m4.xlarge (4 cpu, up from 2) definitely improved it:

I think we should bump the instance size for now until the rebase lands, at which point we can look at reducing it. The two known CPU issues should be addressed in the rebase.

(this is log scale y-axis, so a 10x improvement in tail latency)

abhinavdahiya · 2019-01-16T17:25:00Z

pkg/asset/machines/master.go

@@ -81,7 +82,7 @@ func (m *Master) Generate(dependencies asset.Parents) error {
 		if err != nil {
 			return errors.Wrap(err, "failed to create master machine objects")
 		}
-		aws.ConfigMasters(machines, ic.ObjectMeta.Name)
+		aws.ConfigMasters(machines, ic.ObjectMeta.Name, mpool.InstanceType)


why is this required?

@wking ^ this is what I get :)

why is this required?

Ah, sorry about that. Yeah, we can probably drop this, since aws.provider should be applying this for us.

@smarterclayton, can you unwind is change with:

diff --git a/pkg/asset/machines/aws/machines.go b/pkg/asset/machines/aws/machines.go index 865d40a..7e04a99 100644 --- a/pkg/asset/machines/aws/machines.go +++ b/pkg/asset/machines/aws/machines.go @@ -120,12 +120,10 @@ func tagsFromUserTags(clusterID, clusterName string, usertags map[string]string) return tags, nil } -// ConfigMasters sets the PublicIP flag, bumps the instance type, and -// assigns a set of load balancers to the given machines -func ConfigMasters(machines []clusterapi.Machine, clusterName string, instanceType string) { +// ConfigMasters sets the PublicIP flag and assigns a set of load balancers to the given machines +func ConfigMasters(machines []clusterapi.Machine, clusterName string) { for _, machine := range machines { providerSpec := machine.Spec.ProviderSpec.Value.Object.(*awsprovider.AWSMachineProviderConfig) - providerSpec.InstanceType = instanceType providerSpec.PublicIP = pointer.BoolPtr(true) providerSpec.LoadBalancers = []awsprovider.LoadBalancerReference{ { diff --git a/pkg/asset/machines/master.go b/pkg/asset/machines/master.go index f4c3806..2c782ac 100644 --- a/pkg/asset/machines/master.go +++ b/pkg/asset/machines/master.go @@ -82,7 +82,7 @@ func (m *Master) Generate(dependencies asset.Parents) error { if err != nil { return errors.Wrap(err, "failed to create master machine objects") } - aws.ConfigMasters(machines, ic.ObjectMeta.Name, mpool.InstanceType) + aws.ConfigMasters(machines, ic.ObjectMeta.Name) list := listFromMachines(machines) raw, err := yaml.Marshal(list)

I'm testing that now to make sure it still works...

abhinavdahiya · 2019-01-16T17:25:25Z

pkg/asset/machines/worker.go

-	return awstypes.MachinePool{
-		InstanceType: "m4.large",
-	}
+	return awstypes.MachinePool{}


why is this emptied ?

why is this emptied ?

Because we no longer have any default AWS configuration shared by both masters and workers. They each have their own defaults for instance types (and, with #1079) for volumes, that get applied directly after defaultAWSMachinePoolPlatform() calls. In fact, my preference would be to drop defaultAWSMachinePoolPlatform entirely in favor of an explicit awstypes.MachinePool{} for initializing AWS machine pools.

Ah, maybe this was working around #1076.

smarterclayton · 2019-01-16T19:39:19Z

/retest

smarterclayton · 2019-01-16T21:18:14Z

I had a run with setting proper resource constraints on etcd and it didn't move the needle enough (4x improvement). We need this as well.

Blue is baseline, red is moving etcd to have a 300m CPU request (openshift/machine-config-operator#316), and yellow is doubling cores. So it's a good tail and

I updated with the comments from trevor, let me know what else needs to be done.

smarterclayton · 2019-01-16T21:20:14Z

/retest

wking · 2019-01-16T21:22:05Z

nit: can we squash down to one commit?

wking · 2019-01-16T21:23:04Z

pkg/asset/machines/aws/machines.go

@@ -120,7 +120,8 @@ func tagsFromUserTags(clusterID, clusterName string, usertags map[string]string)
 	return tags, nil
 }

-// ConfigMasters sets the PublicIP flag and assigns a set of load balancers to the given machines
+// ConfigMasters sets the PublicIP flag, bumps the instance type, and
+// assigns a set of load balancers to the given machines


No need to bump this comment, since we're no longer bumping ConfigMasters.

We were seeing frequent long requests from etcd. After increasing CPU (2 -> 4 cores) those pauses dropped significantly. Increase the limit until the rebase lands and we can deliver the CPU perf improvements to the control plane. Make a set of changes to connect machine set size to the values passed as input. Update the docs.

smarterclayton · 2019-01-16T21:27:02Z

de-nitified

crawford · 2019-01-16T21:32:23Z

/lgtm

wking · 2019-01-16T21:32:48Z

/lgtm
/hold

@smarterclayton, I'm fine if you want to pull the hold yourself, or if you want to wait for @abhinavdahiya and/or @crawford.

openshift-ci-robot · 2019-01-16T21:33:02Z

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: crawford, smarterclayton, wking

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

~~OWNERS~~ [crawford,smarterclayton,wking]

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

wking · 2019-01-16T21:33:19Z

/hold cancel

I see @crawford is already here ;).

abhinavdahiya · 2019-01-16T21:40:21Z

IMO this change is not required for this PR https://github.com/openshift/installer/pull/1069/files#diff-df3202277ffea159153362ec095a8762

smarterclayton · 2019-01-16T21:43:55Z

I can nuke that

wking · 2019-01-16T21:45:20Z

IMO this change is not required for this PR https://github.com/openshift/installer/pull/1069/files#diff-df3202277ffea159153362ec095a8762

I think we want that change, but I'm fine going either way to get this PR landed and hashing it our later.

wking · 2019-01-16T22:37:30Z

e2e-aws:

fail [k8s.io/kubernetes/test/e2e/framework/util.go:2355]: Expected error:
    <*errors.errorString | 0xc4215d4d30>: {
        s: "failed to get logs from pod-configmaps-679ec92b-19da-11e9-9cbd-0a58ac1041b1 for configmap-volume-test: an error on the server (\"unknown\") has prevented the request from succeeding (get pods pod-configmaps-679ec92b-19da-11e9-9cbd-0a58ac1041b1)",
    }
    failed to get logs from pod-configmaps-679ec92b-19da-11e9-9cbd-0a58ac1041b1 for configmap-volume-test: an error on the server ("unknown") has prevented the request from succeeding (get pods pod-configmaps-679ec92b-19da-11e9-9cbd-0a58ac1041b1)
not to have occurred

Jan 16 22:02:27.651 W persistentvolume=volume-idempotent-delete-6svns Error deleting EBS volume "vol-037c7a92290037b82" since volume is in "deleting" state count(1)

failed: (37.9s) 2019-01-16T22:02:43 "[sig-storage] ConfigMap should be consumable from pods in volume as non-root with defaultMode and fsGroup set [NodeFeature:FSGroup] [Suite:openshift/conformance/parallel] [Suite:k8s]"

and more (generally involving Error from server: error dialing backend: remote error: tls: internal error).

/retest

wking · 2019-01-17T01:14:08Z

images:

2019/01/17 01:07:20 Copied 0.20Mi of artifacts from release-latest to /logs/artifacts/release-latest
2019/01/17 01:07:26 Ran for 9m46s
error: could not run steps: test "release-latest" failed: pod release-latest was already deleted

/retest

jeremyeder · 2019-01-17T14:36:37Z

related: #1087

mykaul · 2019-01-19T19:31:57Z

Any idea if we need something similar for libvirt based installation? I'm keeping (since it doesn't seem to be accepted) this small diff:

[ykaul@ykaul installer]$ git diff
diff --git a/pkg/asset/machines/libvirt/machines.go b/pkg/asset/machines/libvirt/machines.go
index 909e08a51..defa4ab36 100644
--- a/pkg/asset/machines/libvirt/machines.go
+++ b/pkg/asset/machines/libvirt/machines.go
@@ -64,8 +64,8 @@ func provider(clusterName string, networkInterfaceAddress string, platform *libv
                        APIVersion: "libvirtproviderconfig.k8s.io/v1alpha1",
                        Kind:       "LibvirtMachineProviderConfig",
                },
-               DomainMemory: 2048,
-               DomainVcpu:   2,
+               DomainMemory: 4096,
+               DomainVcpu:   4,
                Ignition: &libvirtprovider.Ignition{
                        UserDataSecret: userDataSecret,
                },

But it seems the above kinda proves my point? (Note - did not test for etcd timeout, but I am frequently failing deployment on timeouts...)

wking · 2019-01-19T20:16:27Z

I'm keeping (since it doesn't seem to be accepted) this small diff...

No need for patching, since #785 set these up to allow environment-variable overrides. I can't speak for the other maintainers, but personally I'd rather punt further tuning until after the origin rebase lands, since that's rumored to have some memory-usage improvements.

openshift-ci-robot added size/XS Denotes a PR that changes 0-9 lines, ignoring generated files. approved Indicates a PR has been approved by an approver from all required OWNERS files. labels Jan 15, 2019

openshift-ci-robot requested review from hardys and russellb January 15, 2019 18:25

openshift-ci-robot assigned wking Jan 15, 2019

openshift-ci-robot added do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. lgtm Indicates that a PR is ready to be merged. labels Jan 15, 2019

smarterclayton force-pushed the default_disk branch from c711595 to e8ef5f9 Compare January 15, 2019 20:50

openshift-ci-robot removed the lgtm Indicates that a PR is ready to be merged. label Jan 15, 2019

smarterclayton changed the title ~~aws: Increase the default master disk size to 160gb for more io~~ aws: Increase the default master disk size for more io Jan 15, 2019

smarterclayton force-pushed the default_disk branch from e8ef5f9 to 8901a31 Compare January 15, 2019 22:39

smarterclayton force-pushed the default_disk branch from 8901a31 to 5ce166a Compare January 15, 2019 22:44

openshift-ci-robot added size/XL Denotes a PR that changes 500-999 lines, ignoring generated files. and removed size/XS Denotes a PR that changes 0-9 lines, ignoring generated files. labels Jan 15, 2019

wking reviewed Jan 15, 2019

View reviewed changes

.openshift_install.log Outdated Show resolved Hide resolved

smarterclayton force-pushed the default_disk branch from 5ce166a to e97c0a2 Compare January 16, 2019 01:29

openshift-ci-robot added size/XS Denotes a PR that changes 0-9 lines, ignoring generated files. and removed size/XL Denotes a PR that changes 500-999 lines, ignoring generated files. labels Jan 16, 2019

smarterclayton force-pushed the default_disk branch from e97c0a2 to 8b0b205 Compare January 16, 2019 02:46

openshift-ci-robot added size/S Denotes a PR that changes 10-29 lines, ignoring generated files. and removed size/XS Denotes a PR that changes 0-9 lines, ignoring generated files. labels Jan 16, 2019

abhinavdahiya reviewed Jan 16, 2019

View reviewed changes

smarterclayton force-pushed the default_disk branch from 5768d95 to 620dd6a Compare January 16, 2019 19:23

cgwalters mentioned this pull request Jan 16, 2019

set cpu request and remove limits on pods openshift/machine-config-operator#316

Merged

wking reviewed Jan 16, 2019

View reviewed changes

smarterclayton force-pushed the default_disk branch from 620dd6a to 82bd538 Compare January 16, 2019 21:26

openshift-ci-robot assigned crawford Jan 16, 2019

openshift-ci-robot added the lgtm Indicates that a PR is ready to be merged. label Jan 16, 2019

openshift-ci-robot removed the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Jan 16, 2019

openshift-merge-robot merged commit b63074b into openshift:master Jan 17, 2019

aws: Increase the default master instance size to reduce etcd timeouts #1069

aws: Increase the default master instance size to reduce etcd timeouts #1069

Conversation

smarterclayton commented Jan 15, 2019 • edited Loading

smarterclayton commented Jan 15, 2019

wking commented Jan 15, 2019

smarterclayton commented Jan 15, 2019

smarterclayton commented Jan 15, 2019 • edited Loading

smarterclayton commented Jan 15, 2019

wking commented Jan 15, 2019 • edited Loading

smarterclayton commented Jan 15, 2019

smarterclayton commented Jan 15, 2019

smarterclayton commented Jan 15, 2019

smarterclayton commented Jan 15, 2019

smarterclayton commented Jan 16, 2019 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

smarterclayton commented Jan 16, 2019

smarterclayton commented Jan 16, 2019

smarterclayton commented Jan 16, 2019

wking commented Jan 16, 2019

Choose a reason for hiding this comment

smarterclayton commented Jan 16, 2019

crawford commented Jan 16, 2019

wking commented Jan 16, 2019

openshift-ci-robot commented Jan 16, 2019

wking commented Jan 16, 2019

abhinavdahiya commented Jan 16, 2019

smarterclayton commented Jan 16, 2019

wking commented Jan 16, 2019

wking commented Jan 16, 2019 • edited Loading

wking commented Jan 17, 2019

jeremyeder commented Jan 17, 2019

mykaul commented Jan 19, 2019

wking commented Jan 19, 2019

smarterclayton commented Jan 15, 2019 •

edited

Loading

smarterclayton commented Jan 15, 2019 •

edited

Loading

wking commented Jan 15, 2019 •

edited

Loading

smarterclayton commented Jan 16, 2019 •

edited

Loading

wking commented Jan 16, 2019 •

edited

Loading