Cache the error from the last asynchronous reconciliation #391

ulucinar · 2024-04-24T20:30:58Z

Description of your changes

Related issue: crossplane-contrib/provider-upjet-aws#1164, crossplane-contrib/provider-upjet-azure#684

This PR proposes a set of changes to:

Cache the error from the last asynchronous reconciliation to return it in the next asynchronous reconciliation for the Terraform plugin SDK & framework based external clients.
Set the "Synced" status condition to "False" in the async CallbackFn to immediately update it when the async operation fails.
Set the "Synced" status condition to "True" when the async operation succeeds, or when the external client's Observe call reveals an up-to-date external resource which is not scheduled for deletion.

These changes result in the Synced status condition to being set to False when the asynchronously executing reconciliation operation encounters an error. Previously, we used to set a LastAsyncOperation to signal an error that the last asynchronous operation has encountered but we did not set the Synced status condition to False, which is a violation of the XRM contract.

The same issue exists also for the Terraform CLI-based asynchronous external client but we will address that client in a follow-up PR.

I have:

Read and followed Upjet's contribution process.
Run make reviewable to ensure this PR is ready for review.
Added backport release-x.y labels to auto-backport this PR if necessary.

How has this code been tested

Tested by updating a Policy.iam resource with an invalid policy document as described in crossplane-contrib/provider-upjet-aws#1164. Without the proposed changes here, here's a log of the registered Synced status condition values after each reconciliation when a bad policy document is specified. Initially, the policy document is a valid one:

True
True
True
True
True
True
True
True
True
True
True
True
True

Here, the first reconciliation is done on a valid document, and the remaining ones are done on an invalid document (the ones on invalid document are exponentially backed-off).

And with the proposed changes, here are the logs for the same scenario:

True
True
False
False
False
False
False
False
False
False
False
False
False
False

So, after the asynchronous reconciliation encounters an error, the Synced condition is set to False and it's stable, i.e., it does not oscillate between the True & False states as long as the desired policy document stays invalid (with a bad syntax). When it's set to a valid document, the Synced condition is stably set to True (stays at the state True).

And here's the full status.conditions in error state:

status:
...
  conditions:
  - lastTransitionTime: "2024-04-24T20:39:29Z"
    reason: Available
    status: "True"
    type: Ready
  - lastTransitionTime: "2024-04-24T21:19:43Z"
    message: "update failed: async update failed: failed to update the resource: [{0
      updating IAM Policy (arn:aws:iam::...:policy/test-policy): MalformedPolicyDocument:
      Syntax errors in policy.\n\tstatus code: 400, request id: 761851df-687a-4c66-984f-d0cd7b212b96
      \ []}]"
    reason: ReconcileError
    status: "False"
    type: Synced
  - lastTransitionTime: "2024-04-24T21:19:43Z"
    message: "async update failed: failed to update the resource: [{0 updating IAM
      Policy (arn:aws:iam::...:policy/test-policy): MalformedPolicyDocument:
      Syntax errors in policy.\n\tstatus code: 400, request id: 761851df-687a-4c66-984f-d0cd7b212b96
      \ []}]"
    reason: AsyncUpdateFailure
    status: "False"
    type: LastAsyncOperation

A test similar to the one described above was conducted for SecurityGroupIngressRule resource, using the example manifest.

…t in the next asynchronous reconciliation for the Terraform plugin SDK & framework based external clients. - Set the "Synced" status condition to "False" in the async CallbackFn to immediately update it when the async operation fails. - Set the "Synced" status condition to "True" when the async operation succeeds, or when the external client's Observe call reveals an up-to-date external resource which is not scheduled for deletion. Signed-off-by: Alper Rifat Ulucinar <[email protected]>

ulucinar · 2024-04-24T21:28:13Z

pkg/controller/api.go

+	errXPReconcileCreate = "create failed"
+	errXPReconcileUpdate = "update failed"
+	errXPReconcileDelete = "delete failed"


I've copied the error messages from crossplane-runtime to prevent flip-flops in the error messages (when the managed reconciler sets the error and when the upjet runtime sets it).

sergenyalcin

Thanks @ulucinar LGTM!

ulucinar requested review from sergenyalcin, erhancagirici and mergenci as code owners April 24, 2024 20:30

ulucinar marked this pull request as draft April 24, 2024 20:33

ulucinar force-pushed the fix-sync-state branch from 0ae1a67 to 77cc776 Compare April 24, 2024 21:10

ulucinar marked this pull request as ready for review April 24, 2024 21:26

ulucinar commented Apr 24, 2024

View reviewed changes

sergenyalcin approved these changes Apr 25, 2024

View reviewed changes

ulucinar merged commit 577bfa7 into crossplane:main Apr 25, 2024
6 checks passed

ulucinar deleted the fix-sync-state branch April 25, 2024 11:40

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Cache the error from the last asynchronous reconciliation #391

Cache the error from the last asynchronous reconciliation #391

ulucinar commented Apr 24, 2024 •

edited

Loading

ulucinar Apr 24, 2024

sergenyalcin left a comment

Cache the error from the last asynchronous reconciliation #391

Cache the error from the last asynchronous reconciliation #391

Conversation

ulucinar commented Apr 24, 2024 • edited Loading

Description of your changes

How has this code been tested

ulucinar Apr 24, 2024

Choose a reason for hiding this comment

sergenyalcin left a comment

Choose a reason for hiding this comment

ulucinar commented Apr 24, 2024 •

edited

Loading