Bug 1796595: [release-4.2] Mitigate aws eventual consistency issues #295

enxebre · 2020-02-07T11:19:46Z

This is to mitigate eventual-consistency [aws eventual consistency issues](https://docs.aws.amazon.com/AWSEC2/latest/APIReference/query-api-troubleshooting.html#eventual-consistency issues) by:

Fetching all instances and filter client side.
Waiting a bit for AWS to resolve any eventual-consistency issues before giving up.
First look for a machine's backing instance by the instance ID in the provider status. If the instance ID is not set, it falls back to the previous method of filtering by tags.

openshift-ci-robot · 2020-02-07T11:19:49Z

@enxebre: This pull request references Bugzilla bug 1796595, which is invalid:

expected the bug to target the "4.2.z" release, but it targets "---" instead
expected Bugzilla bug 1796595 to depend on a bug in one of the following states: VERIFIED, RELEASE_PENDING, CLOSED (ERRATA), but no dependents were found

Comment /bugzilla refresh to re-evaluate validity if changes to the Bugzilla bug are made, or edit the title of this pull request to link to a different bug.

In response to this:

Bug 1796595: Mitigate aws eventual consistency issues

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

enxebre · 2020-02-07T11:48:16Z

/bugzilla refresh

openshift-ci-robot · 2020-02-07T11:48:27Z

@enxebre: This pull request references Bugzilla bug 1796595, which is valid. The bug has been moved to the POST state. The bug has been updated to refer to the pull request using the external bug tracker.

In response to this:

/bugzilla refresh

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

openshift-ci-robot · 2020-02-07T12:10:46Z

@enxebre: This pull request references Bugzilla bug 1796595, which is valid.

In response to this:

Bug 1796595: [release-4.2] Mitigate aws eventual consistency issues

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

alexander-demicev · 2020-02-07T12:12:12Z

/lgtm

bison

/lgtm

enxebre · 2020-02-10T09:32:20Z

cc @mfojtik

bison

/approve

openshift-ci-robot · 2020-02-10T10:20:54Z

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: bison

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

~~OWNERS~~ [bison]

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

ashcrow · 2020-02-10T14:47:06Z

I'm having a bit of a hard time following this one. The link BZ's upstream fix that was verified seems different (with a subset of commits). Can I get a quick rundown in this thread so I can better help approve?

enxebre · 2020-02-11T10:19:42Z

Hey @ashcrow thanks for having a look.

#269 was included here https://bugzilla.redhat.com/show_bug.cgi?id=1772163 which was then mark as dup of https://bugzilla.redhat.com/show_bug.cgi?id=1761882

#267, #272, #268 and #270 are comments and updates that were introduced as part of the normal release feature development, so has no BZ but are relevant for the context and help fix the problem "Mitigate aws eventual consistency issues".

#280 Fixed a regression introduced by these fixes so it belongs to this PR.

d4912fb Is a transparent change necessary to reduce the conflicts hassle and keep code between releases aligned for easing farther back ports.

bparees · 2020-02-19T17:37:41Z

@enxebre

https://bugzilla.redhat.com/show_bug.cgi?id=1796595 depends on a 4.4.0 BZ, it needs to depend on a 4.3.z BZ so we can confirm this was fixed in 4.3.z before merging the change into 4.2.z.

(the bot should have told you this, i'm looking into why it did not)

enxebre · 2020-02-19T17:53:11Z

/bugzilla refresh
@bparees sorted the bz wrong referencing. PTAL

openshift-ci-robot · 2020-02-19T17:53:15Z

@enxebre: This pull request references Bugzilla bug 1796595, which is valid.

In response to this:

/bugzilla refresh
@bparees sorted the bz wrong referencing

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

This is to satisfy contract with the machine controller so it can react appropriately and signal a machine as failed with the given error message when an InvalidMachineConfiguration error is reported, main cases are: - The spec is invalid: Unable to be unmarshalled Missing expected labels by design Referencing non existing kubernetes resources - The input values provoke a client error 4xx response from the cloud call to create a machine.

This changes the behavior of the actuator's `getMachineInstances` method to first look for a machine's backing instance by the instance ID in the provider status. If the instance ID is not set, it falls back to the previous method of filtering by tags. This should help prevent issues where AWS creates the instance, but takes some time to propagate tag values, which can result in Exists() wrongly returning false.

This attemps to mitigate https://bugzilla.redhat.com/show_bug.cgi?id=1772163. Instead of telling the API request to filter by state, we fetch all and discriminate only terminated instances

…tances getInstances has taken an instanceStateFilter argument since it landed as GetInstances in 72b4823 (Migrate cluster operator implementation of the aws actuator, 2018-08-04, openshift#2).

…ancesOutput So we can create pending, terminated, etc. instances for testing. This should scale better than stub{State}InstanceDescribeInstancesOutput from openshift/cluster-api-provider-aws@c8e341f22f (Stop retieving instances by state on API call, 2019-11-14, openshift#269).

…ceByID openshift/cluster-api-provider-aws@c8e341f22f (Stop retieving instances by state on API call, 2019-11-14, openshift#269) did part of this. This commit: * Restores the instanceStateFilter argument to getInstanceByID, to avoid having two definitions (existingInstanceStates and a local != Terminated in getExistingInstanceByID) that define what it means to exist. This will also give us logged error messages like: instance i-123 state terminated is not in running, pending, stopped, stopping, shutting-down which will make it easier to find and fix omissions in existingInstanceStates. * Adds a -running suffix to has-status-search-by-id, so it's easier to distinguish from the terminated case. * Converts to the hyphenated has-status-search-by-id-terminated, instead of mixing hyphens and spaces in a case name. * Uses inline argument for the DescribeInstances mocks, so we don't have to use request and request2. These aren't so big that inlining them makes the mock setup unreadable. * Returns an empty set of instances in the instance-listing fallback call in has-status-search-by-id-terminated (which we now require to happen after the by-ID call). The list call is filtered by state, so it wouldn't return a terminated instance.

Like openshift/cluster-api-provider-aws@e4e94b1e08 (pkg/actuators/machine/utils: Client-side state filtering in getInstanceByID, 2019-11-13, openshift#268), but for getInstances. The cluster/name filtering is already enough to limit the list response to a reasonable size, so we can perform the state filtering locally in order to produce useful error messages. Also fix a bug in TestGetMachineInstances about the expected return type of getMachineInstances; it's a slice, not a single instance. Also, fix a few "Running" -> InstanceStateNameRunning bugs; the AWS string is "running", but we shouldn't care either way.

… comment The updateProviderID comment was probably a copy/paste error when setMachineCloudProviderSpecifics landed in 716083f (Set additional machine annotations/labels to get pretty machine output, 2019-07-23, openshift#242). I haven't read through the function carefully enough to be able to propose a comment that adds much beyond the function name, which is fairly clear on its own. But I know the updateProviderID comment is not a good fit ;).

Avoid [1]: $ curl -s https://storage.googleapis.com/origin-ci-test/logs/release-openshift-openshift-ansible-e2e-aws-scaleup-rhel7-4.3/255/artifacts/e2e-aws-scaleup-rhel7/must-gather/registry-svc-ci-openshift-org-ci-op-b24mfpd6-stable-sha256-64c63eedf863406fbc6c7515026f909a7221472cf70283708fb7010dd5e6139e/namespaces/openshift-machine-api/pods/machine-api-controllers-785b4687bf-wspnd/machine-controller/machine-controller/logs/current.log | grep -B5 'ci-op-b24mfpd6-6df53-m62pz-worker-us-east-1a-centos-sl696: found 0 existing instances for machine' 2019-11-18T01:08:14.686879882Z I1118 01:08:14.686811 1 actuator.go:552] ci-op-b24mfpd6-6df53-m62pz-worker-us-east-1a-centos-sl696: Found instance by id: i-04e60273f38857eed 2019-11-18T01:08:14.686879882Z I1118 01:08:14.686839 1 actuator.go:502] ci-op-b24mfpd6-6df53-m62pz-worker-us-east-1a-centos-sl696: Instance exists as "i-04e60273f38857eed" 2019-11-18T01:08:14.686879882Z I1118 01:08:14.686853 1 controller.go:284] Reconciling machine "ci-op-b24mfpd6-6df53-m62pz-worker-us-east-1a-centos-sl696" triggers idempotent update 2019-11-18T01:08:14.686879882Z I1118 01:08:14.686862 1 actuator.go:406] ci-op-b24mfpd6-6df53-m62pz-worker-us-east-1a-centos-sl696: updating machine 2019-11-18T01:08:14.68696548Z I1118 01:08:14.686938 1 actuator.go:414] ci-op-b24mfpd6-6df53-m62pz-worker-us-east-1a-centos-sl696: obtaining EC2 client for region 2019-11-18T01:08:14.741237517Z I1118 01:08:14.741189 1 actuator.go:430] ci-op-b24mfpd6-6df53-m62pz-worker-us-east-1a-centos-sl696: found 0 existing instances for machine by waiting a bit for AWS to resolve any eventual-consistency issues [2] before giving up on a machine and claiming it failed. With this commit, we'll try again until requeueAfterSeconds have passed since the last status update when we have a machine ID to look for (from creation). The efficiency of the retry request is weakened a bit by the AWS provider / machine controller pairing apparently ignoring requeue delays [3]. And if we respected retries like that, we probably wouldn't hit the guard this commit is adding. But a busy loop here is still better than abandoning the machine, and we don't have a better fix for the ignored requeue delay in the pipe. Personally, I'd rather tie this guard to the Provisioning and Provisioned machine phases, or the MachineCreationSucceeded condition, or some such to avoid excessive delays if folks were bumping the machine status and LastUpdated for other reasons. But Alberto points out that machine phases are supposed to be a higher-level abstraction that doesn't leak down into the provider [4] and he prefers switching on ProviderID [5], so that's what we have in this commit. [1]: https://bugzilla.redhat.com/show_bug.cgi?id=1772163#c11 [2]: https://docs.aws.amazon.com/AWSEC2/latest/APIReference/query-api-troubleshooting.html#eventual-consistency [3]: https://bugzilla.redhat.com/show_bug.cgi?id=1772192 [4]: openshift#271 (comment) [5]: openshift#271 (comment)

Same guard Update grew in openshift/cluster-api-provider-aws@09c1fe9cee9e68 (pkg/actuators/machine/actuator: Eventual-consistency wait for machines, 2019-11-18, openshift#271), but now for Exists and Describe. This should protect us from flows like [1]: $ curl -s https://storage.googleapis.com/origin-ci-test/pr-logs/pull/openshift_installer/2712/pull-ci-openshift-installer-master-e2e-aws-scaleup-rhel7/3059/artifacts/e2e-aws-scaleup-rhel7/must-gather/registry-svc-ci-openshift-org-ci-op-nr00dj4c-stable-sha256-b6dffb34bfe5c198d7e9ad032322277c2a129e45a4e7f2084a83ab560530f683/namespaces/openshift-machine-api/pods/machine-api-controllers-9bd7457fc-tfdmg/machine-controller/machine-controller/logs/current.log | grep ci-op-nr00dj4c-9de00-ls2x8-worker-us-east-1a-rhel-fdknz ... 2019-11-22T17:39:34.642709291Z I1122 17:39:34.642651 1 actuator.go:194] ci-op-nr00dj4c-9de00-ls2x8-worker-us-east-1a-rhel-fdknz: ProviderID updated at machine spec: aws:///us-east-1a/i-08b09536e36303450 ... 2019-11-22T17:39:34.660245157Z I1122 17:39:34.660202 1 actuator.go:668] ci-op-nr00dj4c-9de00-ls2x8-worker-us-east-1a-rhel-fdknz: Instance state still pending, returning an error to requeue 2019-11-22T17:39:34.660327796Z W1122 17:39:34.660285 1 controller.go:321] Failed to create machine "ci-op-nr00dj4c-9de00-ls2x8-worker-us-east-1a-rhel-fdknz": requeue in: 20s 2019-11-22T17:39:34.660426954Z I1122 17:39:34.660380 1 controller.go:164] Reconciling Machine "ci-op-nr00dj4c-9de00-ls2x8-worker-us-east-1a-rhel-fdknz" ... 2019-11-22T17:39:34.660426954Z I1122 17:39:34.660416 1 actuator.go:493] ci-op-nr00dj4c-9de00-ls2x8-worker-us-east-1a-rhel-fdknz: Checking if machine exists 2019-11-22T17:39:34.732221813Z I1122 17:39:34.732177 1 actuator.go:501] ci-op-nr00dj4c-9de00-ls2x8-worker-us-east-1a-rhel-fdknz: Instance does not exist 2019-11-22T17:39:34.732221813Z I1122 17:39:34.732202 1 controller.go:428] Machine "ci-op-nr00dj4c-9de00-ls2x8-worker-us-east-1a-rhel-fdknz" going into phase "Failed" ... Also, because the code was basically the same between Exists and Describe, I've refactored Exists to be a thin wrapper around Describe. It's unfortunate that Describe does not take a Context argument, but there's no regression because Exists was ignoring its Context argument before anyway ;). [1]: https://bugzilla.redhat.com/show_bug.cgi?id=1761882#c15

If the Spec.ProviderID field is present but an empty string, we may encounter issues. This is particularly the case in DR scenarios when a user is attempting to copy an existing machine in attempt to create one.

enxebre · 2020-02-20T09:04:32Z

Rebased after #294 made it in. Should be good now if tests say so.

alexander-demicev

/lgtm

openshift-ci-robot · 2020-02-20T10:27:53Z

@enxebre: All pull requests linked via external trackers have merged. Bugzilla bug 1796595 has been moved to the MODIFIED state.

In response to this:

Bug 1796595: [release-4.2] Mitigate aws eventual consistency issues

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

openshift-ci-robot added the bugzilla/invalid-bug Indicates that a referenced Bugzilla bug is invalid for the branch this PR is targeting. label Feb 7, 2020

openshift-ci-robot added the size/XL Denotes a PR that changes 500-999 lines, ignoring generated files. label Feb 7, 2020

enxebre changed the title ~~Bug 1796595: Mitigate aws eventual consistency issues~~ Bug 1796595: [release-4.2] Mitigate aws eventual consistency issues Feb 7, 2020

openshift-ci-robot requested review from bison and frobware February 7, 2020 11:20

openshift-ci-robot added bugzilla/valid-bug Indicates that a referenced Bugzilla bug is valid for the branch this PR is targeting. and removed bugzilla/invalid-bug Indicates that a referenced Bugzilla bug is invalid for the branch this PR is targeting. labels Feb 7, 2020

openshift-ci-robot added size/XXL Denotes a PR that changes 1000+ lines, ignoring generated files. and removed size/XL Denotes a PR that changes 500-999 lines, ignoring generated files. labels Feb 7, 2020

openshift-ci-robot assigned alexander-demicev Feb 7, 2020

openshift-ci-robot added the lgtm Indicates that a PR is ready to be merged. label Feb 7, 2020

bison approved these changes Feb 7, 2020

View reviewed changes

openshift-ci-robot assigned bison Feb 7, 2020

bison approved these changes Feb 10, 2020

View reviewed changes

openshift-ci-robot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Feb 10, 2020

bparees added bugzilla/invalid-bug Indicates that a referenced Bugzilla bug is invalid for the branch this PR is targeting. and removed bugzilla/valid-bug Indicates that a referenced Bugzilla bug is valid for the branch this PR is targeting. labels Feb 19, 2020

openshift-ci-robot added the bugzilla/valid-bug Indicates that a referenced Bugzilla bug is valid for the branch this PR is targeting. label Feb 19, 2020

openshift-ci-robot removed the bugzilla/invalid-bug Indicates that a referenced Bugzilla bug is invalid for the branch this PR is targeting. label Feb 19, 2020

bparees added the cherry-pick-approved Indicates a cherry-pick PR into a release branch has been approved by the release branch manager. label Feb 19, 2020

openshift-ci-robot added the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Feb 19, 2020

enxebre and others added 14 commits February 20, 2020 09:47

Fix typo in file name

d26fda8

Stop retieving instances by state on API call

744805b

This attemps to mitigate https://bugzilla.redhat.com/show_bug.cgi?id=1772163. Instead of telling the API request to filter by state, we fetch all and discriminate only terminated instances

pkg/actuators/machine/utils: Drop running/pending comment from getIns…

3038f30

…tances getInstances has taken an instanceStateFilter argument since it landed as GetInstances in 72b4823 (Migrate cluster operator implementation of the aws actuator, 2018-08-04, openshift#2).

Use getMachineInstances to find instances during update() call

1d28bb6

Ensure Spec.ProviderID is not empty string

1a3ca55

If the Spec.ProviderID field is present but an empty string, we may encounter issues. This is particularly the case in DR scenarios when a user is attempting to copy an existing machine in attempt to create one.

Revendor to bring k8s.io/utils/pointer

4f7fbd0

enxebre force-pushed the fix-1796595-4.2 branch from 805ffdf to 4f7fbd0 Compare February 20, 2020 09:00

openshift-ci-robot removed lgtm Indicates that a PR is ready to be merged. needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. labels Feb 20, 2020

alexander-demicev approved these changes Feb 20, 2020

View reviewed changes

openshift-ci-robot added the lgtm Indicates that a PR is ready to be merged. label Feb 20, 2020

openshift-merge-robot merged commit 75ed797 into openshift:release-4.2 Feb 20, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Bug 1796595: [release-4.2] Mitigate aws eventual consistency issues #295

Bug 1796595: [release-4.2] Mitigate aws eventual consistency issues #295

enxebre commented Feb 7, 2020 •

edited

Loading

openshift-ci-robot commented Feb 7, 2020

enxebre commented Feb 7, 2020

openshift-ci-robot commented Feb 7, 2020

openshift-ci-robot commented Feb 7, 2020

alexander-demicev commented Feb 7, 2020

bison left a comment

enxebre commented Feb 10, 2020

bison left a comment

openshift-ci-robot commented Feb 10, 2020

ashcrow commented Feb 10, 2020

enxebre commented Feb 11, 2020 •

edited

Loading

bparees commented Feb 19, 2020

enxebre commented Feb 19, 2020 •

edited

Loading

openshift-ci-robot commented Feb 19, 2020

enxebre commented Feb 20, 2020

alexander-demicev left a comment

openshift-ci-robot commented Feb 20, 2020

Bug 1796595: [release-4.2] Mitigate aws eventual consistency issues #295

Bug 1796595: [release-4.2] Mitigate aws eventual consistency issues #295

Conversation

enxebre commented Feb 7, 2020 • edited Loading

openshift-ci-robot commented Feb 7, 2020

enxebre commented Feb 7, 2020

openshift-ci-robot commented Feb 7, 2020

openshift-ci-robot commented Feb 7, 2020

alexander-demicev commented Feb 7, 2020

bison left a comment

Choose a reason for hiding this comment

enxebre commented Feb 10, 2020

bison left a comment

Choose a reason for hiding this comment

openshift-ci-robot commented Feb 10, 2020

ashcrow commented Feb 10, 2020

enxebre commented Feb 11, 2020 • edited Loading

bparees commented Feb 19, 2020

enxebre commented Feb 19, 2020 • edited Loading

openshift-ci-robot commented Feb 19, 2020

enxebre commented Feb 20, 2020

alexander-demicev left a comment

Choose a reason for hiding this comment

openshift-ci-robot commented Feb 20, 2020

enxebre commented Feb 7, 2020 •

edited

Loading

enxebre commented Feb 11, 2020 •

edited

Loading

enxebre commented Feb 19, 2020 •

edited

Loading