-
Notifications
You must be signed in to change notification settings - Fork 44
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Bug 1796595: [release-4.2] Mitigate aws eventual consistency issues #295
Bug 1796595: [release-4.2] Mitigate aws eventual consistency issues #295
Conversation
@enxebre: This pull request references Bugzilla bug 1796595, which is invalid:
Comment In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
/bugzilla refresh |
@enxebre: This pull request references Bugzilla bug 1796595, which is valid. The bug has been moved to the POST state. The bug has been updated to refer to the pull request using the external bug tracker. In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
@enxebre: This pull request references Bugzilla bug 1796595, which is valid. In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
/lgtm |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
/lgtm
cc @mfojtik |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
/approve
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: bison The full list of commands accepted by this bot can be found here. The pull request process is described here
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
I'm having a bit of a hard time following this one. The link BZ's upstream fix that was verified seems different (with a subset of commits). Can I get a quick rundown in this thread so I can better help approve? |
Hey @ashcrow thanks for having a look. #269 was included here https://bugzilla.redhat.com/show_bug.cgi?id=1772163 which was then mark as dup of https://bugzilla.redhat.com/show_bug.cgi?id=1761882 #267, #272, #268 and #270 are comments and updates that were introduced as part of the normal release feature development, so has no BZ but are relevant for the context and help fix the problem "Mitigate aws eventual consistency issues". #280 Fixed a regression introduced by these fixes so it belongs to this PR. d4912fb Is a transparent change necessary to reduce the conflicts hassle and keep code between releases aligned for easing farther back ports. |
https://bugzilla.redhat.com/show_bug.cgi?id=1796595 depends on a 4.4.0 BZ, it needs to depend on a 4.3.z BZ so we can confirm this was fixed in 4.3.z before merging the change into 4.2.z. (the bot should have told you this, i'm looking into why it did not) |
/bugzilla refresh |
@enxebre: This pull request references Bugzilla bug 1796595, which is valid. In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
This is to satisfy contract with the machine controller so it can react appropriately and signal a machine as failed with the given error message when an InvalidMachineConfiguration error is reported, main cases are: - The spec is invalid: Unable to be unmarshalled Missing expected labels by design Referencing non existing kubernetes resources - The input values provoke a client error 4xx response from the cloud call to create a machine.
This changes the behavior of the actuator's `getMachineInstances` method to first look for a machine's backing instance by the instance ID in the provider status. If the instance ID is not set, it falls back to the previous method of filtering by tags. This should help prevent issues where AWS creates the instance, but takes some time to propagate tag values, which can result in Exists() wrongly returning false.
This attemps to mitigate https://bugzilla.redhat.com/show_bug.cgi?id=1772163. Instead of telling the API request to filter by state, we fetch all and discriminate only terminated instances
…tances getInstances has taken an instanceStateFilter argument since it landed as GetInstances in 72b4823 (Migrate cluster operator implementation of the aws actuator, 2018-08-04, openshift#2).
…ancesOutput So we can create pending, terminated, etc. instances for testing. This should scale better than stub{State}InstanceDescribeInstancesOutput from openshift/cluster-api-provider-aws@c8e341f22f (Stop retieving instances by state on API call, 2019-11-14, openshift#269).
…ceByID openshift/cluster-api-provider-aws@c8e341f22f (Stop retieving instances by state on API call, 2019-11-14, openshift#269) did part of this. This commit: * Restores the instanceStateFilter argument to getInstanceByID, to avoid having two definitions (existingInstanceStates and a local != Terminated in getExistingInstanceByID) that define what it means to exist. This will also give us logged error messages like: instance i-123 state terminated is not in running, pending, stopped, stopping, shutting-down which will make it easier to find and fix omissions in existingInstanceStates. * Adds a -running suffix to has-status-search-by-id, so it's easier to distinguish from the terminated case. * Converts to the hyphenated has-status-search-by-id-terminated, instead of mixing hyphens and spaces in a case name. * Uses inline argument for the DescribeInstances mocks, so we don't have to use request and request2. These aren't so big that inlining them makes the mock setup unreadable. * Returns an empty set of instances in the instance-listing fallback call in has-status-search-by-id-terminated (which we now require to happen after the by-ID call). The list call is filtered by state, so it wouldn't return a terminated instance.
Like openshift/cluster-api-provider-aws@e4e94b1e08 (pkg/actuators/machine/utils: Client-side state filtering in getInstanceByID, 2019-11-13, openshift#268), but for getInstances. The cluster/name filtering is already enough to limit the list response to a reasonable size, so we can perform the state filtering locally in order to produce useful error messages. Also fix a bug in TestGetMachineInstances about the expected return type of getMachineInstances; it's a slice, not a single instance. Also, fix a few "Running" -> InstanceStateNameRunning bugs; the AWS string is "running", but we shouldn't care either way.
… comment The updateProviderID comment was probably a copy/paste error when setMachineCloudProviderSpecifics landed in 716083f (Set additional machine annotations/labels to get pretty machine output, 2019-07-23, openshift#242). I haven't read through the function carefully enough to be able to propose a comment that adds much beyond the function name, which is fairly clear on its own. But I know the updateProviderID comment is not a good fit ;).
Avoid [1]: $ curl -s https://storage.googleapis.com/origin-ci-test/logs/release-openshift-openshift-ansible-e2e-aws-scaleup-rhel7-4.3/255/artifacts/e2e-aws-scaleup-rhel7/must-gather/registry-svc-ci-openshift-org-ci-op-b24mfpd6-stable-sha256-64c63eedf863406fbc6c7515026f909a7221472cf70283708fb7010dd5e6139e/namespaces/openshift-machine-api/pods/machine-api-controllers-785b4687bf-wspnd/machine-controller/machine-controller/logs/current.log | grep -B5 'ci-op-b24mfpd6-6df53-m62pz-worker-us-east-1a-centos-sl696: found 0 existing instances for machine' 2019-11-18T01:08:14.686879882Z I1118 01:08:14.686811 1 actuator.go:552] ci-op-b24mfpd6-6df53-m62pz-worker-us-east-1a-centos-sl696: Found instance by id: i-04e60273f38857eed 2019-11-18T01:08:14.686879882Z I1118 01:08:14.686839 1 actuator.go:502] ci-op-b24mfpd6-6df53-m62pz-worker-us-east-1a-centos-sl696: Instance exists as "i-04e60273f38857eed" 2019-11-18T01:08:14.686879882Z I1118 01:08:14.686853 1 controller.go:284] Reconciling machine "ci-op-b24mfpd6-6df53-m62pz-worker-us-east-1a-centos-sl696" triggers idempotent update 2019-11-18T01:08:14.686879882Z I1118 01:08:14.686862 1 actuator.go:406] ci-op-b24mfpd6-6df53-m62pz-worker-us-east-1a-centos-sl696: updating machine 2019-11-18T01:08:14.68696548Z I1118 01:08:14.686938 1 actuator.go:414] ci-op-b24mfpd6-6df53-m62pz-worker-us-east-1a-centos-sl696: obtaining EC2 client for region 2019-11-18T01:08:14.741237517Z I1118 01:08:14.741189 1 actuator.go:430] ci-op-b24mfpd6-6df53-m62pz-worker-us-east-1a-centos-sl696: found 0 existing instances for machine by waiting a bit for AWS to resolve any eventual-consistency issues [2] before giving up on a machine and claiming it failed. With this commit, we'll try again until requeueAfterSeconds have passed since the last status update when we have a machine ID to look for (from creation). The efficiency of the retry request is weakened a bit by the AWS provider / machine controller pairing apparently ignoring requeue delays [3]. And if we respected retries like that, we probably wouldn't hit the guard this commit is adding. But a busy loop here is still better than abandoning the machine, and we don't have a better fix for the ignored requeue delay in the pipe. Personally, I'd rather tie this guard to the Provisioning and Provisioned machine phases, or the MachineCreationSucceeded condition, or some such to avoid excessive delays if folks were bumping the machine status and LastUpdated for other reasons. But Alberto points out that machine phases are supposed to be a higher-level abstraction that doesn't leak down into the provider [4] and he prefers switching on ProviderID [5], so that's what we have in this commit. [1]: https://bugzilla.redhat.com/show_bug.cgi?id=1772163#c11 [2]: https://docs.aws.amazon.com/AWSEC2/latest/APIReference/query-api-troubleshooting.html#eventual-consistency [3]: https://bugzilla.redhat.com/show_bug.cgi?id=1772192 [4]: openshift#271 (comment) [5]: openshift#271 (comment)
Same guard Update grew in openshift/cluster-api-provider-aws@09c1fe9cee9e68 (pkg/actuators/machine/actuator: Eventual-consistency wait for machines, 2019-11-18, openshift#271), but now for Exists and Describe. This should protect us from flows like [1]: $ curl -s https://storage.googleapis.com/origin-ci-test/pr-logs/pull/openshift_installer/2712/pull-ci-openshift-installer-master-e2e-aws-scaleup-rhel7/3059/artifacts/e2e-aws-scaleup-rhel7/must-gather/registry-svc-ci-openshift-org-ci-op-nr00dj4c-stable-sha256-b6dffb34bfe5c198d7e9ad032322277c2a129e45a4e7f2084a83ab560530f683/namespaces/openshift-machine-api/pods/machine-api-controllers-9bd7457fc-tfdmg/machine-controller/machine-controller/logs/current.log | grep ci-op-nr00dj4c-9de00-ls2x8-worker-us-east-1a-rhel-fdknz ... 2019-11-22T17:39:34.642709291Z I1122 17:39:34.642651 1 actuator.go:194] ci-op-nr00dj4c-9de00-ls2x8-worker-us-east-1a-rhel-fdknz: ProviderID updated at machine spec: aws:///us-east-1a/i-08b09536e36303450 ... 2019-11-22T17:39:34.660245157Z I1122 17:39:34.660202 1 actuator.go:668] ci-op-nr00dj4c-9de00-ls2x8-worker-us-east-1a-rhel-fdknz: Instance state still pending, returning an error to requeue 2019-11-22T17:39:34.660327796Z W1122 17:39:34.660285 1 controller.go:321] Failed to create machine "ci-op-nr00dj4c-9de00-ls2x8-worker-us-east-1a-rhel-fdknz": requeue in: 20s 2019-11-22T17:39:34.660426954Z I1122 17:39:34.660380 1 controller.go:164] Reconciling Machine "ci-op-nr00dj4c-9de00-ls2x8-worker-us-east-1a-rhel-fdknz" ... 2019-11-22T17:39:34.660426954Z I1122 17:39:34.660416 1 actuator.go:493] ci-op-nr00dj4c-9de00-ls2x8-worker-us-east-1a-rhel-fdknz: Checking if machine exists 2019-11-22T17:39:34.732221813Z I1122 17:39:34.732177 1 actuator.go:501] ci-op-nr00dj4c-9de00-ls2x8-worker-us-east-1a-rhel-fdknz: Instance does not exist 2019-11-22T17:39:34.732221813Z I1122 17:39:34.732202 1 controller.go:428] Machine "ci-op-nr00dj4c-9de00-ls2x8-worker-us-east-1a-rhel-fdknz" going into phase "Failed" ... Also, because the code was basically the same between Exists and Describe, I've refactored Exists to be a thin wrapper around Describe. It's unfortunate that Describe does not take a Context argument, but there's no regression because Exists was ignoring its Context argument before anyway ;). [1]: https://bugzilla.redhat.com/show_bug.cgi?id=1761882#c15
If the Spec.ProviderID field is present but an empty string, we may encounter issues. This is particularly the case in DR scenarios when a user is attempting to copy an existing machine in attempt to create one.
805ffdf
to
4f7fbd0
Compare
Rebased after #294 made it in. Should be good now if tests say so. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
/lgtm
@enxebre: All pull requests linked via external trackers have merged. Bugzilla bug 1796595 has been moved to the MODIFIED state. In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
This cherry-picks the commits from:
d4912fb
https://github.com/openshift/cluster-api-provider-aws/pull/266/commits
https://github.com/openshift/cluster-api-provider-aws/pull/269/commits
https://github.com/openshift/cluster-api-provider-aws/pull/267/commits
https://github.com/openshift/cluster-api-provider-aws/pull/268/commits
https://github.com/openshift/cluster-api-provider-aws/pull/271/commits
https://github.com/openshift/cluster-api-provider-aws/pull/272/commits
https://github.com/openshift/cluster-api-provider-aws/pull/273/commits
https://github.com/openshift/cluster-api-provider-aws/pull/270/commits
https://github.com/openshift/cluster-api-provider-aws/pull/280/commits
This is to mitigate eventual-consistency [aws eventual consistency issues](https://docs.aws.amazon.com/AWSEC2/latest/APIReference/query-api-troubleshooting.html#eventual-consistency issues) by: