Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

AWS UPI draft #14241

Merged
merged 1 commit into from
May 7, 2019
Merged

AWS UPI draft #14241

merged 1 commit into from
May 7, 2019

Conversation

@kalexand-rh kalexand-rh added this to the Future Release milestone Mar 26, 2019
@kalexand-rh kalexand-rh self-assigned this Mar 26, 2019
@openshift-ci-robot openshift-ci-robot added do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. size/XXL Denotes a PR that changes 1000+ lines, ignoring generated files. labels Mar 26, 2019
@openshift-ci-robot openshift-ci-robot added size/XL Denotes a PR that changes 500-999 lines, ignoring generated files. size/XXL Denotes a PR that changes 1000+ lines, ignoring generated files. and removed size/XXL Denotes a PR that changes 1000+ lines, ignoring generated files. size/XL Denotes a PR that changes 500-999 lines, ignoring generated files. labels Mar 27, 2019
observedGeneration: 1
replicas: 1
----
<1> I think you change the RHCOS AMI here.
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@cuppett, do you know if there's a good RHCOS AMI that we should use?

Do you have more details about the values that a user would need to provide to update these MachineSets?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The IAM instance profile name comes as an output from the 03_cluster_security.yaml file. The subnet is created as part of 01_vpc.yaml; however, it doesn't have a name tag to filter on. They can change the specification to ID and use the exact ID or add a name. We could also add a generated name tag to help minimize the edit needed. The same is true for the security group.

For the AMI, we build a different one each day. I usually just query the AWS console to grab a recent one. @wking may have a more prescriptive method.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For the AMI...

See here and here for scripted approaches to figuring out your AMI. And see here for possible future work.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you! I'll watch the RFE because that will be useful.

@openshift-docs-preview-bot

The preview will be availble shortly at:

<2> Specify the name of the worker security group that you created.
<3> Specify the subnet for the machines.

.. Replace the `securityGroups` stanza with one that specifies the name of the
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@cuppett, do you have the format for the securityGroup and subnet stanzas that use names instead of tags?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here is one (lifted from a BZ):

      securityGroups:
      - filters:
        - name: tag:Name
          values:
          - worker-sg-jialiuuuu1
      subnet:
        filters:
        - name: tag:Name
          values:
          - jialiuuuu1-xbhm2-private-us-east-2b

There should also be a way to specify this using the always-there subnet or security group ID (instead of a tag of Name), but I'm not sure where these filters' formats are defined well. We should find them or a spot for them.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@wking, do you know who might know how these filters are defined?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The subnet entry is defined here. For UPI, it's probably easier to just set the subnet ID or ARN.

@kalexand-rh kalexand-rh changed the title [WIP] AWS UPI draft AWS UPI draft Apr 4, 2019
@openshift-ci-robot openshift-ci-robot removed the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Apr 4, 2019
@kalexand-rh
Copy link
Contributor Author

@wking, will you PTAL at this AWS UPI draft? (I know that we're scrubbing UPI from the docs, but I'm waiting for updates from the open JIRA to try to fix this terminology.)\

@jianlinliu, will you PTAL?

installing/installing_aws_upi/installing-aws-upi.adoc Outdated Show resolved Hide resolved
modules/installation-aws-upi-bootstrap.adoc Outdated Show resolved Hide resolved
modules/installation-aws-upi-bootstrap.adoc Outdated Show resolved Hide resolved
modules/installation-aws-upi-bootstrap.adoc Outdated Show resolved Hide resolved
modules/installation-aws-upi-bootstrap.adoc Outdated Show resolved Hide resolved
modules/installation-creating-aws-dns.adoc Outdated Show resolved Hide resolved
modules/installation-overview.adoc Outdated Show resolved Hide resolved
modules/installation-overview.adoc Outdated Show resolved Hide resolved
modules/installation-overview.adoc Outdated Show resolved Hide resolved
modules/installation-overview.adoc Outdated Show resolved Hide resolved
installing/installing_aws_upi/installing-aws-upi.adoc Outdated Show resolved Hide resolved
modules/installation-overview.adoc Outdated Show resolved Hide resolved
modules/installation-aws-upi-requirements.adoc Outdated Show resolved Hide resolved
modules/installation-creating-aws-control-plane.adoc Outdated Show resolved Hide resolved
modules/installation-creating-aws-worker.adoc Outdated Show resolved Hide resolved
modules/installation-aws-upi-bootstrap.adoc Show resolved Hide resolved
----
<1> `<name>` is the name of your bootstrap stack.

. View the list of machines in the `openshift-machine-api` namespace:

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think they're working on it, but I'll track the bug and follow up when it's resolved.


[WARNING]
====
It is better to use a MachineSet to allow the cluster to create and manage

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is better to have a link to that section how to use machineset to provision new worker node.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm going to create a new module based on https://docs.openshift.com/container-platform/4.0/machine_management/creating-machineset.html to include or link to.

installing/installing_aws_upi/installing-aws-upi.adoc Outdated Show resolved Hide resolved
modules/installation-creating-aws-dns.adoc Outdated Show resolved Hide resolved
modules/installation-creating-aws-dns.adoc Outdated Show resolved Hide resolved
@kalexand-rh
Copy link
Contributor Author

I've pushed updates for all of the resolved comments. Thank you all! I'm going to follow up on the remaining comments and will be happy to make more changes.

@cuppett
Copy link
Member

cuppett commented Apr 18, 2019 via email

@mazzystr
Copy link

An option would be to scp it manually to the instance. UserData can have bash ! while -f ca.pem; do sleep 5; done loop to pause until the file is present.

@cuppett
Copy link
Member

cuppett commented Apr 18, 2019 via email

@mazzystr
Copy link

Folks we are officially bugzilla'd - BZ 1701408

@jianlinliu
Copy link

@kalexand-rh Should we have something like delete cluster section for upi-on-aws?

@kalexand-rh
Copy link
Contributor Author

@jianlinliu, eventually. It's a different user story from installation, though. I have a JIRA to figure out all of the processes that will be different between UPI and IPI.

MasterInstanceType:
Default: m4.xlarge
Type: String
AllowedValues:

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Latest generation of EC2 instances aren't allowed (m5, c5, r5). It might be interesting to allow AMD based instances as well.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Correct. There are still outstanding issues with nitro-based instance classes.

https://bugzilla.redhat.com/show_bug.cgi?id=1658237

coreos/bugs#2371

cc @gnufied

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The installer currently falls...forward to m5 for regions that lack m4. So if we don't support m5, we'll need to adjust that.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually, The most serious of NVMe issues are fixed in - https://bugzilla.redhat.com/show_bug.cgi?id=1632440 . Our SRE team has confirmed that after deploying fixed kernel they did not receive any "stuck in attach/detach" issue.

What @cuppett linked above though is a separate customer running their own Openshift version on AWS. That bug however is more weird. The steps to reproduce - https://bugzilla.redhat.com/show_bug.cgi?id=1658237#c46 . So basically to reproduce this bug, you have to reboot the node while detach is in progress. Our Kernel team says - this is a problem with how EBS sends ACPI events to kernel. In practice I think this is less of a problem. We also have some fixes in Openshift to ensure if a node does get stuck with volumes in "attaching/detaching" state, it is removed from scheduling and is tainted. It must be then drained, rebooted and taint manually removed.

I think we should make sure that - we are running fixed Kernel with proper tuning in RHCOS and start supporting M5/C5 instance types.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That is excellent news! We could enable the latest generation instances in the templates.

@abhinavdahiya @crawford Do we think m5 should be our default? The latest generation would be my recommendation in general and continues the trend of improving cost/performance. A downside is they support a total of 28 EBS and ENA attachments where the previous generations could support up to 40 EBS attachments 1.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@gnufied It doesn't appear that https://bugzilla.redhat.com/show_bug.cgi?id=1632440 is publicly viewable; could you share some details or open up access to that bug?

modules/installation-creating-aws-control-plane.adoc Outdated Show resolved Hide resolved
RegisterNlbIpTargetsLambda:
Description: Lambda ARN useful to help register/deregister IP targets for these load balancers
Value: !GetAtt RegisterNlbIpTargets.Arn
ExternalApiTargetGroupArn:

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we remove Arn from the resource name? It will help with searching when trying to find values for parameters for the next stack.

ExternalApiTargetGroupArn:
Description: ARN of External API target group
Value: !Ref ExternalApiTargetGroup
InternalApiTargetGroupArn:

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we remove Arn from the resource name? It will help with searching when trying to find values for parameters for the next stack.

InternalApiTargetGroupArn:
Description: ARN of Internal API target group
Value: !Ref InternalApiTargetGroup
InternalServiceTargetGroupArn:

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we remove Arn from the resource name? It will help with searching when trying to find values for parameters for the next stack.

@openshift-ci-robot openshift-ci-robot added the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Apr 24, 2019
@openshift-ci-robot openshift-ci-robot removed the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Apr 24, 2019
modules/installation-creating-aws-control-plane.adoc Outdated Show resolved Hide resolved
modules/installation-creating-aws-worker.adoc Outdated Show resolved Hide resolved
modules/installation-creating-aws-worker.adoc Outdated Show resolved Hide resolved
modules/installation-creating-aws-control-plane.adoc Outdated Show resolved Hide resolved
@kalexand-rh kalexand-rh force-pushed the osdocs323 branch 2 times, most recently from a9cbab0 to f602bd5 Compare April 26, 2019 15:50
- filters:
- name: tag:Name
values:
- test-tkh7l-worker-sg <2>
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@wking, do lines 95 and 117 look right to you?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yup, {infraName}-{role}-{sg} matches this.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are your sure here? Per my test result, https://bugzilla.redhat.com/show_bug.cgi?id=1697968#c11, security group would be created with '{role}-{sg}-{infraName}' as 'GroupName' and empty 'Name'. While machine api is expecting '{infraName}-{role}-{sg}' as 'Name'. Do I miss something?

@@ -71,7 +71,7 @@ you can safely ignore this warning.
.. Remove the files that define the control plane machines:
+
----
$ rm -f openshift/99_openshift-cluster-api_master-machines-*.yaml
$ rm -f openshift/99_openshift-cluster-api_master-*.yaml

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Per my latest test result, https://bugzilla.redhat.com/show_bug.cgi?id=1698207, I think this change need to be reverted.

@kalexand-rh
Copy link
Contributor Author

Per @abhinavdahiya, I've removed the steps for creating a cluster that uses cluster-managed workers.

@kalexand-rh
Copy link
Contributor Author

@skordas, you said that you got this process to work yesterday, and I made a few updates based on your feedback. Are there any other changes that I need to add now, or does this look ok to you?

The bootstrap Ignition config file does contain secrets, like X.509 keys. If
uploading the Ignition config file to an S3 bucket does not provide adequate
security, you can enable an S3 bucket policy to allow only certain users,
such as the OpenShift IAM user, to access objects that the bucket contains.
Copy link
Member

@wking wking May 7, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"If uploading to an S3 bucket is not secure enough, tighten down your S3 bucket" doesn't feel quite right (there's an implicit "if the S3 bucket we're walking your through below is not secure enough"). How about:

The bootstrap Ignition config file does contain secrets, like X.509 keys. The steps below provide basic security for the S3 bucket. To provide additional security, you can enable an S3 bucket policy to allow only certain users, such as the OpenShift IAM user, to access objects that the bucket contains. And you can avoid S3 entirely and serve your bootstrap Ignition config from any address reachable from the bootstrap machine.

I'm also fine not enumerating possible S3 tightenings, and saying something like:

The bootstrap Ignition config file does contain secrets, like X.509 keys. The steps below provide basic security for the S3 bucket. If you have security concerns with these steps, you can configure a more-secure S3 bucket or avoid S3 entirely and serve your bootstrap Ignition config from any address reachable from the bootstrap machine.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@wking, please check again?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

f674f0e20fd6 looks good to me.

@kalexand-rh
Copy link
Contributor Author

QE just approved https://bugzilla.redhat.com/show_bug.cgi?id=1697968, which is the last bug that they considered a blocker for this PR. Per @vikram-redhat, that's enough to merge. Please file bugs for any outstanding issues.

Merging.

@kalexand-rh kalexand-rh merged commit 39e3e04 into openshift:enterprise-4.1 May 7, 2019
@skordas
Copy link

skordas commented May 8, 2019

I was able to provide UPI infrastructure guided by this document
/lgtm

@kalexand-rh kalexand-rh deleted the osdocs323 branch May 8, 2019 15:12
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
branch/enterprise-4.1 peer-review-done Signifies that the peer review team has reviewed this PR size/XXL Denotes a PR that changes 1000+ lines, ignoring generated files.
Projects
None yet
Development

Successfully merging this pull request may close these issues.