data/aws: Bump default instance type from t2.medium to t3.medium #724

wking · 2018-11-26T09:28:10Z

The T3 instances were launched in 2018-08-21. The documents stats are fairly similar to the T2 analogs. The only obvious difference there is that the T3 instances are 10% cheaper. However, from here:

If the instance runs at higher CPU utilization for a prolonged period, there will be an additional charge of $0.05 per vCPU-hour.

and from here:

T3 instances start in Unlimited mode by default... If the instance needs to run at higher CPU utilization for a prolonged period, it can do so for an additional charge of $0.05 per vCPU-hour. In Standard mode, a T3 instance can burst until it uses up all its earned credits.

The difference vs. T2 seems to be that with T2, Unlimited mode was opt-in while with T3 it is opt-out. I've set a credit specification in Terraform to opt us out, but there doesn't seem to be a setting in the cluster-API structures for this.

abhinavdahiya · 2018-11-26T17:26:44Z

I don't think we want to move towards more reliability on defaults in terraform. Also this causes the master machine definitions to drift from the machines that actually exist.

/hold

cuppett · 2018-11-26T19:12:33Z

I don't think we want to move towards more reliability on defaults in terraform. Also this causes the master machine definitions to drift from the machines that actually exist.

Should the master machine definitions be changed as well? We must assume our default node types will change over time (even after the initial 4.0 release).

data/data/aws/bootstrap/main.tf

wking · 2018-11-26T19:22:06Z

Should the master machine definitions be changed as well?

This same default feeds the master definitions.

smarterclayton · 2018-11-27T18:42:31Z

let's set some requirements:

the defaults for openshift should work out of the box for a reasonable period of time for most users (including smaller users than we traditionally encouraged)
in the future we expect to have master rotation, at which point vertical resize is free, so we shouldn't overdesign here
we want to test what we give to our users, so there is an argument for e2e being the high end of what we support for those users (we are confident that if e2e is not flaky our users will be able to use the ootb cluster)

3-15 nodes is a good small-medium cluster, 100 namespaces. e2e pushes harder than that on API traffic, but runs on 3 nodes densely.

wking · 2018-11-27T19:01:07Z

I don't think we want to move towards more reliability on defaults in terraform.

With 7d236d0 -> 3bdf83b, I've dropped some more Terraform defaults to make sure we aren't relying on them. I've also dropped the cpu_credits block. That should address your concerns on those points.

We'll see how 3bdf83b plays in CI, since we're not sure the initial CPU credits are sufficient to get us through the e2e tests without paying too much for the extended burst.

wking · 2018-11-27T20:54:41Z

e2e-aws:

level=error msg="Error: Required variable not set: tectonic_aws_master_ec2_type\n\n\n\nError: Required variable not set: tectonic_aws_worker_ec2_type"

I'm patching to get that passed through...

wking · 2018-11-27T20:56:08Z

Actually, we probably don't need tectonic_aws_worker_ec2_type anymore...

wking · 2018-11-27T21:05:43Z

Actually, we probably don't need tectonic_aws_worker_ec2_type anymore...

I've filed #742 for this.

The last consumers for these were removed by 124ac35 (*: Use machine-api-operator to deploy worker nodes, 2018-09-04, openshift#119).

And the related, libvirt-specific, tectonic_libvirt_worker_ips. This simplifies the Terraform logic for AWS and OpenStack, and consistently pushes most worker setup into the cluster-API providers who are creating the workers since 124ac35 (*: Use machine-api-operator to deploy worker nodes, 2018-09-04, openshift#119).

abhinavdahiya · 2018-11-28T21:48:23Z

/hold cancel

/approve

wking · 2018-11-29T07:26:34Z

I've pushed 3bdf83b -> 5156348, rebasing onto #742 and adding a fairly rough commit to fix the asset graph feeding TFVars(). The previous implementation pulled from the install-config, but that missed downstream changes like AWS instance type defaults being applied in pkg/asset/machines (like we're doing here with ddd12118c). With 51563482a9, we pull that information from the cluster-API types, since they're the last touch point for that data. The remaining information still comes from the InstallConfig, but I've split it into explicit arguments to avoid confusion about where data is coming from when InstallConfig's machine pools, etc. overlap with clusterapi.Machine fields.

abhinavdahiya · 2018-11-29T19:19:28Z

@wking can we take 2b76d62 out from this PR. lets just bump t2 -> t3 here (nothing worse than we have right now in terms of coordinating values b/w terraform and cluster-api).

2b76d62 in separate PR.

wking · 2018-11-30T00:49:02Z

I've pushed 85e86cf -> 8ed4d9f, dropping the TFVars refactor at @abhinavdahiya's request. I've left the commits from #742 underneath, but without the TFVars refactor to motivate them, I could also cherry-pick 30521b4 onto master by itself if folks wanted it that way. I'll push 85e86cf to a separate PR once CI greens up so we can fix the dependency disconnect.

The T3 instances were launched in 2018-08-21 [1]. The documents stats are fairly similar to the T2 analogs [2,3]: Name t2.medium t3.medium vCPUs 2 2 Memory (GiB) 4.0 4.0 Baseline Performance/vCPU 20% CPU Credits earned/hr 24 24 Network burst bandwidth (Gbps) 5 EBS burst bandwidth (Gbps) 1.50 On-Demand Price/hr* $0.0464 $0.0418 1-yr Reserved Instance Effective Hourly* $0.025 3-yr Reserved Instance Effective Hourly* $0.017 * Prices shown are for Linux/Unix in US East (Northern Virginia) AWS Region. Prices for 1-year and 3-year reserved instances are for "Partial Upfront" payment options or "No upfront" for instances without the Partial Upfront option. For full pricing details, see the Amazon EC2 pricing page [4]. The only obvious difference there is that the T3 instances are 10% cheaper. However, from [1]: If the instance runs at higher CPU utilization for a prolonged period, there will be an additional charge of $0.05 per vCPU-hour. and from [2]: T3 instances start in Unlimited mode by default... If the instance needs to run at higher CPU utilization for a prolonged period, it can do so for an additional charge of $0.05 per vCPU-hour. In Standard mode, a T3 instance can burst until it uses up all its earned credits. The difference vs. T2 seems to be that with T2, Unlimited mode was opt-in while with T3 it is opt-out. You can set a credit specification in Terraform [5] to opt out, but there doesn't seem to be a setting in the cluster-API structures for this. And in any case, we expect: * Most users to be able to run clusters of up to ~15 workers without overextending an Unlimited t3.medium. * To have cluster alerts for when nodes have exhausted their CPU quota, which will tell cluster admins that it's time to rotate in some larger master nodes. We may already have this alerting, but I'm not sure where it lives. * To have CI be able to get through its e2e tests before exhausting the initial CPU credits. But we can monitor this with the above-mentioned alerting as well. The FIXME is because pkg/tfvars is consuming the install-config, so it doesn't know about configuration that landed further downstream in the asset graph's pkg/asset/machines. I have a partial fix for that, but Abhinav wanted to punt that to future work [6]. [1]: https://aws.amazon.com/blogs/aws/new-t3-instances-burstable-cost-effective-performance/ [2]: https://aws.amazon.com/ec2/instance-types/t2/ [3]: https://aws.amazon.com/ec2/instance-types/t3/ [4]: https://aws.amazon.com/ec2/pricing/ [5]: https://www.terraform.io/docs/providers/aws/r/instance.html#credit-specification [6]: openshift#724 (comment)

openshift-ci-robot · 2018-11-30T01:38:29Z

@wking: The following test failed, say /retest to rerun them all:

Test name	Commit	Details	Rerun command
ci/prow/e2e-libvirt	`8ed4d9f`	link	`/test e2e-libvirt`

Full PR test history. Your PR dashboard. Please help us cut down on flakes by linking to an open issue when you hit one in your PR.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. I understand the commands that are listed here.

wking · 2018-11-30T01:44:37Z

e2e-aws:

level=error msg="Error: Error applying plan:\n\n1 error(s) occurred:\n\n* aws_route53_zone.tectonic_int: 1 error(s) occurred:\n\n* aws_route53_zone.tectonic_int: timeout while waiting for state to become 'INSYNC' (timeout: 15m0s)\n\nTerraform does not automatically rollback in the face of errors.\nInstead, your Terraform state file has been partially updated with\nany resources that successfully completed. Please address the error\nabove and apply again to incrementally change your infrastructure."

/test e2e-aws

smarterclayton · 2018-11-30T02:52:06Z

pkg/tfvars/tfvars.go

 					IAMRoleName: m.Platform.AWS.IAMRoleName,
-					WorkerRootVolume: aws.WorkerRootVolume{


Is this why workers are now 16gb volumes?

Is this why workers are now 16gb volumes?

This is just passing data to Terraform, and worker creation is via the cluster-API now. So I don't see a connection.

smarterclayton · 2018-11-30T03:42:53Z

/lgtm

To unblock all the queues

openshift-ci-robot · 2018-11-30T03:43:24Z

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: abhinavdahiya, smarterclayton, wking

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

~~OWNERS~~ [abhinavdahiya,smarterclayton,wking]

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

smarterclayton · 2018-11-30T03:43:42Z

45m of this run was tear down, and it looks like a big chunk of that was route53. We have a ton of garbage in the api, probably needs to be cleaned

wking · 2018-11-30T03:49:33Z

We have a ton of garbage in the api, probably needs to be cleaned

I cleaned it ;)

wking · 2018-11-30T05:41:18Z

Hrm, this is going to be a problem with t3.medium:

$ oc --namespace=openshift-cluster-api logs clusterapi-manager-controllers-559fc8d878-d5p65  --container=machine-controller
E1130 05:38:29.574505       1 actuator.go:96] error creating machine: error launching instance: error creating EC2 instance: InstanceLimitExceeded: You have requested more instances (21) than your current instance limit of 20 allows for the specified instance type. Please visit http://aws.amazon.com/contact-us/ec2-request to request an adjustment to this limit.
        status code: 400, request id: 01da30c5-c310-48c0-9550-79cc78cccd7e

(from this cluster). Can we get some serialization so we can limp through until we get that limit raised? Switch to some other instance type where we have more power than t2.medium but higher limits than t3.medium?

wking · 2018-11-30T05:58:19Z

#765 is in flight to get us on m4.large while we work on the t3.medium limits.

openshift-ci-robot requested review from aaronlevy and flaper87 November 26, 2018 09:28

openshift-ci-robot added size/S Denotes a PR that changes 10-29 lines, ignoring generated files. approved Indicates a PR has been approved by an approver from all required OWNERS files. labels Nov 26, 2018

openshift-ci-robot added the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Nov 26, 2018

cuppett suggested changes Nov 26, 2018

View reviewed changes

data/data/aws/bootstrap/main.tf Outdated Show resolved Hide resolved

wking force-pushed the t3.medium-default branch from 7d236d0 to 3bdf83b Compare November 27, 2018 18:56

wking added 2 commits November 27, 2018 13:15

data/data/aws/variables-aws: Drop tectonic_aws_worker_ec2_type, etc.

76b5757

The last consumers for these were removed by 124ac35 (*: Use machine-api-operator to deploy worker nodes, 2018-09-04, openshift#119).

openshift-ci-robot removed the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Nov 28, 2018

wking force-pushed the t3.medium-default branch from 3bdf83b to 5156348 Compare November 29, 2018 07:23

openshift-ci-robot added size/L Denotes a PR that changes 100-499 lines, ignoring generated files. and removed size/S Denotes a PR that changes 10-29 lines, ignoring generated files. labels Nov 29, 2018

wking changed the title ~~data/data/aws: Bump default instance type from t2.medium to t3.medium~~ WIP: data/data/aws: Bump default instance type from t2.medium to t3.medium Nov 29, 2018

openshift-ci-robot added the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Nov 29, 2018

wking force-pushed the t3.medium-default branch 2 times, most recently from 9c9a657 to 2b76d62 Compare November 29, 2018 18:56

wking mentioned this pull request Nov 29, 2018

installconfig: change default network blocks #712

Merged

wking force-pushed the t3.medium-default branch from 2b76d62 to 0dc7ae8 Compare November 29, 2018 20:47

wking force-pushed the t3.medium-default branch 2 times, most recently from 85e86cf to 12716e8 Compare November 30, 2018 00:43

wking changed the title ~~WIP: data/data/aws: Bump default instance type from t2.medium to t3.medium~~ data/aws: Bump default instance type from t2.medium to t3.medium Nov 30, 2018

wking force-pushed the t3.medium-default branch from 12716e8 to 30521b4 Compare November 30, 2018 00:46

openshift-ci-robot removed the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Nov 30, 2018

wking force-pushed the t3.medium-default branch from 30521b4 to 8ed4d9f Compare November 30, 2018 00:49

smarterclayton reviewed Nov 30, 2018

View reviewed changes

openshift-ci-robot assigned smarterclayton Nov 30, 2018

openshift-ci-robot added the lgtm Indicates that a PR is ready to be merged. label Nov 30, 2018

openshift-merge-robot merged commit b5202d8 into openshift:master Nov 30, 2018

wking mentioned this pull request Nov 30, 2018

pkg/lineprinter: Add a wrapper for Write -> Print #730

Merged

wking deleted the t3.medium-default branch November 30, 2018 05:24

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

data/aws: Bump default instance type from t2.medium to t3.medium #724

data/aws: Bump default instance type from t2.medium to t3.medium #724

wking commented Nov 26, 2018

abhinavdahiya commented Nov 26, 2018

cuppett commented Nov 26, 2018

wking commented Nov 26, 2018

smarterclayton commented Nov 27, 2018

wking commented Nov 27, 2018

wking commented Nov 27, 2018

wking commented Nov 27, 2018

wking commented Nov 27, 2018

abhinavdahiya commented Nov 28, 2018

wking commented Nov 29, 2018

abhinavdahiya commented Nov 29, 2018

wking commented Nov 30, 2018 •

edited

Loading

openshift-ci-robot commented Nov 30, 2018 •

edited

Loading

wking commented Nov 30, 2018

smarterclayton Nov 30, 2018

wking Nov 30, 2018

smarterclayton commented Nov 30, 2018

openshift-ci-robot commented Nov 30, 2018

smarterclayton commented Nov 30, 2018

wking commented Nov 30, 2018

wking commented Nov 30, 2018

wking commented Nov 30, 2018

		IAMRoleName: m.Platform.AWS.IAMRoleName,
		WorkerRootVolume: aws.WorkerRootVolume{

data/aws: Bump default instance type from t2.medium to t3.medium #724

data/aws: Bump default instance type from t2.medium to t3.medium #724

Conversation

wking commented Nov 26, 2018

abhinavdahiya commented Nov 26, 2018

cuppett commented Nov 26, 2018

wking commented Nov 26, 2018

smarterclayton commented Nov 27, 2018

wking commented Nov 27, 2018

wking commented Nov 27, 2018

wking commented Nov 27, 2018

wking commented Nov 27, 2018

abhinavdahiya commented Nov 28, 2018

wking commented Nov 29, 2018

abhinavdahiya commented Nov 29, 2018

wking commented Nov 30, 2018 • edited Loading

openshift-ci-robot commented Nov 30, 2018 • edited Loading

wking commented Nov 30, 2018

smarterclayton Nov 30, 2018

Choose a reason for hiding this comment

wking Nov 30, 2018

Choose a reason for hiding this comment

smarterclayton commented Nov 30, 2018

openshift-ci-robot commented Nov 30, 2018

smarterclayton commented Nov 30, 2018

wking commented Nov 30, 2018

wking commented Nov 30, 2018

wking commented Nov 30, 2018

wking commented Nov 30, 2018 •

edited

Loading

openshift-ci-robot commented Nov 30, 2018 •

edited

Loading