Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

data/aws: Bump default instance type from t2.medium to t3.medium #724

Merged
merged 3 commits into from
Nov 30, 2018

Conversation

wking
Copy link
Member

@wking wking commented Nov 26, 2018

The T3 instances were launched in 2018-08-21. The documents stats are fairly similar to the T2 analogs. The only obvious difference there is that the T3 instances are 10% cheaper. However, from here:

If the instance runs at higher CPU utilization for a prolonged period, there will be an additional charge of $0.05 per vCPU-hour.

and from here:

T3 instances start in Unlimited mode by default... If the instance needs to run at higher CPU utilization for a prolonged period, it can do so for an additional charge of $0.05 per vCPU-hour. In Standard mode, a T3 instance can burst until it uses up all its earned credits.

The difference vs. T2 seems to be that with T2, Unlimited mode was opt-in while with T3 it is opt-out. I've set a credit specification in Terraform to opt us out, but there doesn't seem to be a setting in the cluster-API structures for this.

@openshift-ci-robot openshift-ci-robot added size/S Denotes a PR that changes 10-29 lines, ignoring generated files. approved Indicates a PR has been approved by an approver from all required OWNERS files. labels Nov 26, 2018
@abhinavdahiya
Copy link
Contributor

I don't think we want to move towards more reliability on defaults in terraform. Also this causes the master machine definitions to drift from the machines that actually exist.

/hold

@openshift-ci-robot openshift-ci-robot added the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Nov 26, 2018
@cuppett
Copy link
Member

cuppett commented Nov 26, 2018

I don't think we want to move towards more reliability on defaults in terraform. Also this causes the master machine definitions to drift from the machines that actually exist.

Should the master machine definitions be changed as well? We must assume our default node types will change over time (even after the initial 4.0 release).

data/data/aws/bootstrap/main.tf Outdated Show resolved Hide resolved
@wking
Copy link
Member Author

wking commented Nov 26, 2018

Should the master machine definitions be changed as well?

This same default feeds the master definitions.

@smarterclayton
Copy link
Contributor

let's set some requirements:

  1. the defaults for openshift should work out of the box for a reasonable period of time for most users (including smaller users than we traditionally encouraged)
  2. in the future we expect to have master rotation, at which point vertical resize is free, so we shouldn't overdesign here
  3. we want to test what we give to our users, so there is an argument for e2e being the high end of what we support for those users (we are confident that if e2e is not flaky our users will be able to use the ootb cluster)

3-15 nodes is a good small-medium cluster, 100 namespaces. e2e pushes harder than that on API traffic, but runs on 3 nodes densely.

@wking wking force-pushed the t3.medium-default branch from 7d236d0 to 3bdf83b Compare November 27, 2018 18:56
@wking
Copy link
Member Author

wking commented Nov 27, 2018

I don't think we want to move towards more reliability on defaults in terraform.

With 7d236d0 -> 3bdf83b, I've dropped some more Terraform defaults to make sure we aren't relying on them. I've also dropped the cpu_credits block. That should address your concerns on those points.

We'll see how 3bdf83b plays in CI, since we're not sure the initial CPU credits are sufficient to get us through the e2e tests without paying too much for the extended burst.

@wking
Copy link
Member Author

wking commented Nov 27, 2018

e2e-aws:

level=error msg="Error: Required variable not set: tectonic_aws_master_ec2_type\n\n\n\nError: Required variable not set: tectonic_aws_worker_ec2_type"

I'm patching to get that passed through...

@wking
Copy link
Member Author

wking commented Nov 27, 2018

Actually, we probably don't need tectonic_aws_worker_ec2_type anymore...

@wking
Copy link
Member Author

wking commented Nov 27, 2018

Actually, we probably don't need tectonic_aws_worker_ec2_type anymore...

I've filed #742 for this.

The last consumers for these were removed by 124ac35 (*: Use
machine-api-operator to deploy worker nodes, 2018-09-04, openshift#119).
And the related, libvirt-specific, tectonic_libvirt_worker_ips.  This
simplifies the Terraform logic for AWS and OpenStack, and consistently
pushes most worker setup into the cluster-API providers who are
creating the workers since 124ac35 (*: Use machine-api-operator to
deploy worker nodes, 2018-09-04, openshift#119).
@abhinavdahiya
Copy link
Contributor

/hold cancel

/approve

@openshift-ci-robot openshift-ci-robot removed the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Nov 28, 2018
@wking wking force-pushed the t3.medium-default branch from 3bdf83b to 5156348 Compare November 29, 2018 07:23
@openshift-ci-robot openshift-ci-robot added size/L Denotes a PR that changes 100-499 lines, ignoring generated files. and removed size/S Denotes a PR that changes 10-29 lines, ignoring generated files. labels Nov 29, 2018
@wking wking changed the title data/data/aws: Bump default instance type from t2.medium to t3.medium WIP: data/data/aws: Bump default instance type from t2.medium to t3.medium Nov 29, 2018
@openshift-ci-robot openshift-ci-robot added the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Nov 29, 2018
@wking
Copy link
Member Author

wking commented Nov 29, 2018

I've pushed 3bdf83b -> 5156348, rebasing onto #742 and adding a fairly rough commit to fix the asset graph feeding TFVars(). The previous implementation pulled from the install-config, but that missed downstream changes like AWS instance type defaults being applied in pkg/asset/machines (like we're doing here with ddd12118c). With 51563482a9, we pull that information from the cluster-API types, since they're the last touch point for that data. The remaining information still comes from the InstallConfig, but I've split it into explicit arguments to avoid confusion about where data is coming from when InstallConfig's machine pools, etc. overlap with clusterapi.Machine fields.

@wking wking force-pushed the t3.medium-default branch 2 times, most recently from 9c9a657 to 2b76d62 Compare November 29, 2018 18:56
@abhinavdahiya
Copy link
Contributor

@wking can we take 2b76d62 out from this PR. lets just bump t2 -> t3 here (nothing worse than we have right now in terms of coordinating values b/w terraform and cluster-api).

2b76d62 in separate PR.

@wking wking force-pushed the t3.medium-default branch 2 times, most recently from 85e86cf to 12716e8 Compare November 30, 2018 00:43
@wking wking changed the title WIP: data/data/aws: Bump default instance type from t2.medium to t3.medium data/aws: Bump default instance type from t2.medium to t3.medium Nov 30, 2018
@wking wking force-pushed the t3.medium-default branch from 12716e8 to 30521b4 Compare November 30, 2018 00:46
@openshift-ci-robot openshift-ci-robot removed the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Nov 30, 2018
@wking
Copy link
Member Author

wking commented Nov 30, 2018

I've pushed 85e86cf -> 8ed4d9f, dropping the TFVars refactor at @abhinavdahiya's request. I've left the commits from #742 underneath, but without the TFVars refactor to motivate them, I could also cherry-pick 30521b4 onto master by itself if folks wanted it that way. I'll push 85e86cf to a separate PR once CI greens up so we can fix the dependency disconnect.

The T3 instances were launched in 2018-08-21 [1].  The documents stats
are fairly similar to the T2 analogs [2,3]:

  Name                                      t2.medium  t3.medium
  vCPUs                                     2          2
  Memory (GiB)                              4.0        4.0
  Baseline Performance/vCPU                            20%
  CPU Credits earned/hr                     24         24
  Network burst bandwidth (Gbps)                       5
  EBS burst bandwidth (Gbps)                           1.50
  On-Demand Price/hr*                       $0.0464    $0.0418
  1-yr Reserved Instance Effective Hourly*             $0.025
  3-yr Reserved Instance Effective Hourly*             $0.017

  * Prices shown are for Linux/Unix in US East (Northern Virginia) AWS
    Region. Prices for 1-year and 3-year reserved instances are for
    "Partial Upfront" payment options or "No upfront" for instances
    without the Partial Upfront option. For full pricing details, see
    the Amazon EC2 pricing page [4].

The only obvious difference there is that the T3 instances are 10%
cheaper.  However, from [1]:

  If the instance runs at higher CPU utilization for a prolonged
  period, there will be an additional charge of $0.05 per vCPU-hour.

and from [2]:

  T3 instances start in Unlimited mode by default... If the instance
  needs to run at higher CPU utilization for a prolonged period, it
  can do so for an additional charge of $0.05 per vCPU-hour.  In
  Standard mode, a T3 instance can burst until it uses up all its
  earned credits.

The difference vs. T2 seems to be that with T2, Unlimited mode was
opt-in while with T3 it is opt-out.  You can set a credit
specification in Terraform [5] to opt out, but there doesn't seem to
be a setting in the cluster-API structures for this.  And in any case,
we expect:

* Most users to be able to run clusters of up to ~15 workers without
  overextending an Unlimited t3.medium.

* To have cluster alerts for when nodes have exhausted their CPU
  quota, which will tell cluster admins that it's time to rotate in
  some larger master nodes.  We may already have this alerting, but
  I'm not sure where it lives.

* To have CI be able to get through its e2e tests before exhausting
  the initial CPU credits.  But we can monitor this with the
  above-mentioned alerting as well.

The FIXME is because pkg/tfvars is consuming the install-config, so it
doesn't know about configuration that landed further downstream in the
asset graph's pkg/asset/machines.  I have a partial fix for that, but
Abhinav wanted to punt that to future work [6].

[1]: https://aws.amazon.com/blogs/aws/new-t3-instances-burstable-cost-effective-performance/
[2]: https://aws.amazon.com/ec2/instance-types/t2/
[3]: https://aws.amazon.com/ec2/instance-types/t3/
[4]: https://aws.amazon.com/ec2/pricing/
[5]: https://www.terraform.io/docs/providers/aws/r/instance.html#credit-specification
[6]: openshift#724 (comment)
@wking wking force-pushed the t3.medium-default branch from 30521b4 to 8ed4d9f Compare November 30, 2018 00:49
@openshift-ci-robot
Copy link
Contributor

openshift-ci-robot commented Nov 30, 2018

@wking: The following test failed, say /retest to rerun them all:

Test name Commit Details Rerun command
ci/prow/e2e-libvirt 8ed4d9f link /test e2e-libvirt

Full PR test history. Your PR dashboard. Please help us cut down on flakes by linking to an open issue when you hit one in your PR.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. I understand the commands that are listed here.

@wking
Copy link
Member Author

wking commented Nov 30, 2018

e2e-aws:

level=error msg="Error: Error applying plan:\n\n1 error(s) occurred:\n\n* aws_route53_zone.tectonic_int: 1 error(s) occurred:\n\n* aws_route53_zone.tectonic_int: timeout while waiting for state to become 'INSYNC' (timeout: 15m0s)\n\nTerraform does not automatically rollback in the face of errors.\nInstead, your Terraform state file has been partially updated with\nany resources that successfully completed. Please address the error\nabove and apply again to incrementally change your infrastructure." 

/test e2e-aws

IAMRoleName: m.Platform.AWS.IAMRoleName,
WorkerRootVolume: aws.WorkerRootVolume{
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this why workers are now 16gb volumes?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this why workers are now 16gb volumes?

This is just passing data to Terraform, and worker creation is via the cluster-API now. So I don't see a connection.

@smarterclayton
Copy link
Contributor

/lgtm

To unblock all the queues

@openshift-ci-robot openshift-ci-robot added the lgtm Indicates that a PR is ready to be merged. label Nov 30, 2018
@openshift-ci-robot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: abhinavdahiya, smarterclayton, wking

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:
  • OWNERS [abhinavdahiya,smarterclayton,wking]

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@smarterclayton
Copy link
Contributor

45m of this run was tear down, and it looks like a big chunk of that was route53. We have a ton of garbage in the api, probably needs to be cleaned

@openshift-merge-robot openshift-merge-robot merged commit b5202d8 into openshift:master Nov 30, 2018
@wking
Copy link
Member Author

wking commented Nov 30, 2018

We have a ton of garbage in the api, probably needs to be cleaned

I cleaned it ;)

@wking wking deleted the t3.medium-default branch November 30, 2018 05:24
@wking
Copy link
Member Author

wking commented Nov 30, 2018

Hrm, this is going to be a problem with t3.medium:

$ oc --namespace=openshift-cluster-api logs clusterapi-manager-controllers-559fc8d878-d5p65  --container=machine-controller
E1130 05:38:29.574505       1 actuator.go:96] error creating machine: error launching instance: error creating EC2 instance: InstanceLimitExceeded: You have requested more instances (21) than your current instance limit of 20 allows for the specified instance type. Please visit http://aws.amazon.com/contact-us/ec2-request to request an adjustment to this limit.
        status code: 400, request id: 01da30c5-c310-48c0-9550-79cc78cccd7e

(from this cluster). Can we get some serialization so we can limp through until we get that limit raised? Switch to some other instance type where we have more power than t2.medium but higher limits than t3.medium?

@wking
Copy link
Member Author

wking commented Nov 30, 2018

#765 is in flight to get us on m4.large while we work on the t3.medium limits.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
approved Indicates a PR has been approved by an approver from all required OWNERS files. lgtm Indicates that a PR is ready to be merged. size/L Denotes a PR that changes 100-499 lines, ignoring generated files.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants