Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Requeue the reconcile request on async update error #231

Merged
merged 7 commits into from
Jul 27, 2023

Conversation

ulucinar
Copy link
Collaborator

@ulucinar ulucinar commented Jul 14, 2023

Description of your changes

Fixes #176
Fixes #180

This PR employs the new DesiredStatusChanged predicate predicate from the crossplane-runtime as an event filter to prevent too-frequent reconciles on error conditions or on updates to the external create pending/failed annotations. It also allows the async callback functions to requeue a reconcile request with an exponential back-off, when the async operation fails.

I have:

  • Read and followed Crossplane's contribution process.
  • Run make reviewable to ensure this PR is ready for review.
  • Added backport release-x.y labels to auto-backport this PR if necessary.

How has this code been tested

Tested the PR with crossplane-contrib/provider-upjet-aws#789.
When, for example, the external connector's Connect call fails, the requeue request is exponentially backed-off with a cap of 1 min. Here are the sample timestamps from an experiment with a Connect call failing:

2023-07-26T22:40:35+03:00
2023-07-26T22:40:36+03:00
2023-07-26T22:40:37+03:00
2023-07-26T22:40:41+03:00
2023-07-26T22:40:49+03:00
2023-07-26T22:41:05+03:00
2023-07-26T22:41:37+03:00
2023-07-26T22:42:37+03:00
2023-07-26T22:43:37+03:00

To test whether we exponentially back-off when we error in asynchronous callbacks, I provisioned an S3 bucket, put a file into it, and attempted to delete the bucket. Here are the timestamps belonging to the (failing) reconciliation attempts while deleting the Bucket.s3 MR:

2023-07-26T23:03:35+03:00
2023-07-26T23:03:42+03:00
2023-07-26T23:03:48+03:00
2023-07-26T23:03:57+03:00
2023-07-26T23:04:06+03:00
2023-07-26T23:04:27+03:00
2023-07-26T23:05:04+03:00
2023-07-26T23:06:09+03:00
2023-07-26T23:07:14+03:00
2023-07-26T23:08:19+03:00
2023-07-26T23:09:24+03:00
2023-07-26T23:10:29+03:00
2023-07-26T23:11:34+03:00

, which again shows the reconcile requests are being exponentially backed-off.

We have also done some performance tests and those results can be found here.

@ulucinar ulucinar marked this pull request as draft July 14, 2023 12:42
@ulucinar ulucinar force-pushed the fix-180 branch 6 times, most recently from e54d702 to a5721c1 Compare July 14, 2023 15:51
@ulucinar ulucinar marked this pull request as ready for review July 20, 2023 21:10
Copy link
Member

@sergenyalcin sergenyalcin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, @ulucinar. These are my comments. According to the results in the description, deletion times seem to be getting longer. How would you like to handle this?

pkg/controller/api.go Show resolved Hide resolved
pkg/controller/api.go Outdated Show resolved Hide resolved
@ulucinar ulucinar force-pushed the fix-180 branch 2 times, most recently from dfe3e46 to 945fff4 Compare July 26, 2023 15:32
@ulucinar
Copy link
Collaborator Author

ulucinar commented Jul 26, 2023

Did a set of measurements with this PR. While this PR mainly decreases the resource utilization in case of especially when the external connector (or client) fails fast in the Connect call by properly doing exponential back-off, I'm posting the measurements with index.docker.io/ulucinar/provider-aws:v0.38.0-rc.0.18.gca9398d9 (from crossplane-contrib/provider-upjet-aws#789) below for reference.

------------------------------------------------------------
INFO[0040] Experiment Duration: 40.048828 seconds
INFO[0100] Average Time to Readiness of Repository: 12.800000 seconds
INFO[0100] Peak Time to Readiness of Repository: 16.000000 seconds
INFO[0100] Pod: provider-aws-8d35452277a1-5c48fd696d-wklhf
INFO[0100] Average Memory: 661039894.811881 Bytes
INFO[0100] Peak Memory: 1248210944.000000 Bytes
INFO[0100] Pod: provider-aws-8d35452277a1-5c48fd696d-wklhf
INFO[0100] Average CPU: 12.935712 Rate
INFO[0100] Peak CPU: 16.035185 Rate
------------------------------------------------------------
INFO[0039] Experiment Duration: 39.268071 seconds
INFO[0099] Average Time to Readiness of Repository: 13.400000 seconds
INFO[0099] Peak Time to Readiness of Repository: 16.000000 seconds
INFO[0099] Pod: provider-aws-8d35452277a1-5c48fd696d-jkc4g
INFO[0099] Average Memory: 402984069.565217 Bytes
INFO[0099] Peak Memory: 545832960.000000 Bytes
INFO[0099] Pod: provider-aws-8d35452277a1-5c48fd696d-jkc4g
INFO[0099] Average CPU: 16.826778 Rate
INFO[0099] Peak CPU: 19.651389 Rate
------------------------------------------------------------
INFO[0040] Experiment Duration: 40.004987 seconds
INFO[0100] Average Time to Readiness of Repository: 12.600000 seconds
INFO[0100] Peak Time to Readiness of Repository: 16.000000 seconds
INFO[0100] Pod: provider-aws-8d35452277a1-5c48fd696d-9wkpm
INFO[0100] Average Memory: 536236977.230769 Bytes
INFO[0100] Peak Memory: 793706496.000000 Bytes
INFO[0100] Pod: provider-aws-8d35452277a1-5c48fd696d-9wkpm
INFO[0100] Average CPU: 17.463247 Rate
INFO[0100] Peak CPU: 20.957870 Rate

@ulucinar
Copy link
Collaborator Author

Thanks, @ulucinar. These are my comments. According to the results in the description, deletion times seem to be getting longer. How would you like to handle this?

Did some refactoring and repeated the tests here. I think the experiments with 5 MRs have high variance but still the total experiment times are around the expected values.

Signed-off-by: Alper Rifat Ulucinar <[email protected]>
Copy link
Member

@sergenyalcin sergenyalcin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @ulucinar LGTM!

@ulucinar ulucinar merged commit e4ffeb8 into crossplane:main Jul 27, 2023
@ulucinar ulucinar deleted the fix-180 branch July 27, 2023 12:22
@sergenyalcin
Copy link
Member

Some performance test results. There are improvements with this PR.

v0.37.0 AWS Monolithic Test Results

(1) INFO[0683] Results - v0.37.0 - AWS - 100 ECR Repository
INFO[0683] Experiment Duration: 683.665586 seconds
INFO[0744] Average Time to Readiness of Repository: 158.760000 seconds
INFO[0744] Peak Time to Readiness of Repository: 449.000000 seconds
INFO[0744] Average Memory Working Set: 1793215686.011050 Bytes
INFO[0744] Peak Memory Working Set: 2998583296.000000 Bytes
INFO[0744] Average CPU Utilisation: 74.995274 Rate
INFO[0744] Peak CPU Utilisation: 96.572685 Rate

(2) INFO[0776] Results - v0.37.0 - AWS - 100 ECR Repository
INFO[0776] Experiment Duration: 776.977605 seconds
INFO[0837] Average Time to Readiness of Repository: 171.840000 seconds
INFO[0837] Peak Time to Readiness of Repository: 512.000000 seconds
INFO[0837] Average Memory Working Set: 1681988067.894988 Bytes
INFO[0837] Peak Memory Working Set: 3320328192.000000 Bytes
INFO[0837] Average CPU Utilisation: 77.289236 Rate
INFO[0837] Peak CPU Utilisation: 97.487037 Rate

(3) INFO[0783] Results - v0.37.0 - AWS - 100 ECR Repository
INFO[0783] Experiment Duration: 783.135806 seconds
INFO[0844] Average Time to Readiness of Repository: 170.790000 seconds
INFO[0844] Peak Time to Readiness of Repository: 514.000000 seconds
INFO[0844] Average Memory Working Set: 1671129672.086643 Bytes
INFO[0844] Peak Memory Working Set: 3548098560.000000 Bytes
INFO[0844] Average CPU Utilisation: 87.594181 Rate
INFO[0844] Peak CPU Utilisation: 97.258333 Rate

AWS Monolithic Test Results from this branch

(1) INFO[0351] Results - index.docker.io/ulucinar/provider-aws:v0.38.0-rc.0.43.g24291c953 - AWS - 100 ECR Repository
INFO[0351] Experiment Duration: 351.023854 seconds
INFO[0411] Average Time to Readiness of Repository: 118.950000 seconds
INFO[0411] Peak Time to Readiness of Repository: 158.000000 seconds
INFO[0411] Average Memory Working Set: 1801325607.766990 Bytes
INFO[0411] Peak Memory Working Set: 4295589888.000000 Bytes
INFO[0411] Average CPU Utilisation: 58.053456 Rate
INFO[0411] Peak CPU Utilisation: 95.928241 Rate

(2) INFO[0361] Results - index.docker.io/ulucinar/provider-aws:v0.38.0-rc.0.43.g24291c953 - AWS - 100 ECR Repository
INFO[0361] Experiment Duration: 361.475772 seconds
INFO[0422] Average Time to Readiness of Repository: 124.750000 seconds
INFO[0422] Peak Time to Readiness of Repository: 164.000000 seconds
INFO[0422] Average Memory Working Set: 1712672535.052133 Bytes
INFO[0422] Peak Memory Working Set: 3577266176.000000 Bytes
INFO[0422] Average CPU Utilisation: 58.469990 Rate
INFO[0422] Peak CPU Utilisation: 94.883796 Rate

(3) INFO[0370] Results - index.docker.io/ulucinar/provider-aws:v0.38.0-rc.0.43.g24291c953 - AWS - 100 ECR Repository
INFO[0370] Experiment Duration: 370.100280 seconds
INFO[0430] Average Time to Readiness of Repository: 125.030000 seconds
INFO[0430] Peak Time to Readiness of Repository: 171.000000 seconds
INFO[0430] Average Memory Working Set: 1889391200.222738 Bytes
INFO[0430] Peak Memory Working Set: 3654717440.000000 Bytes
INFO[0430] Average CPU Utilisation: 58.346910 Rate
INFO[0430] Peak CPU Utilisation: 91.258796 Rate

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
2 participants