Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add operation retry for exceeded quota group OperationReadGroup #4599

Merged
merged 2 commits into from
Mar 22, 2021

Conversation

c2thorn
Copy link
Member

@c2thorn c2thorn commented Mar 16, 2021

Fixes hashicorp/terraform-provider-google#8655

The error we need to retry on:

{
  "error": {
    "code": 403,
    "message": "Quota exceeded for quota group 'OperationReadGroup' and limit 'Operation read requests per 100 seconds' of service 'compute.googleapis.com' for consumer 'project_number:<>'.",
    "errors": [
      {
        "message": "Quota exceeded for quota group 'OperationReadGroup' and limit 'Operation read requests per 100 seconds' of service 'compute.googleapis.com' for consumer 'project_number:<>'.",
        "domain": "usageLimits",
        "reason": "rateLimitExceeded"
      }
    ],
    "status": "PERMISSION_DENIED"
  }

Currently only GCE has a quota group called OperationReadGroup, but the compute operations use CommonRefreshFunc. Rather than rewrite compute operations to use a non-common refresh function, I just added another retry predicate to the existing one.

Tested by spamming operation reads. Sorry GCE SRE's...

Release Note Template for Downstream PRs (will be copied)

compute: fixed an issue where exceeding the operation rate limit would fail without retrying

@google-cla google-cla bot added the cla: yes label Mar 16, 2021
@modular-magician
Copy link
Collaborator

Hi! I'm the modular magician. Your PR generated some diffs in downstreams - here they are.

Diff report:

Terraform GA: Diff ( 2 files changed, 12 insertions(+), 1 deletion(-))
Terraform Beta: Diff ( 2 files changed, 12 insertions(+), 1 deletion(-))
TF Conversion: Diff ( 2 files changed, 12 insertions(+), 1 deletion(-))

@modular-magician
Copy link
Collaborator

I have triggered VCR tests based on this PR's diffs. See the results here: "https://ci-oss.hashicorp.engineering/viewQueued.html?itemId=177549"

@c2thorn c2thorn requested review from a team and melinath and removed request for a team March 16, 2021 21:13
Copy link
Member

@melinath melinath left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM.

It might be nice to have unit tests for the error retry predicate.

One other thing: defaultErrorRetryPredicates includes is409OperationInProgressError, which is operation-specific. Would it make sense to include this in the defaults as well? Or would it be better to take the other one out of the defaults? Something else?

@modular-magician

This comment has been minimized.

@c2thorn
Copy link
Member Author

c2thorn commented Mar 16, 2021

@melinath I'll add the test!

One other thing: defaultErrorRetryPredicates includes is409OperationInProgressError, which is operation-specific. Would it make sense to include this in the defaults as well? Or would it be better to take the other one out of the defaults? Something else?

Good point. I think it's minor, but this predicate is pretty specific to querying operations. Adding it to the defaults would place the check in non-operation requests, which seems unnecessary.

Looking back to when the 409 predicate was added to defaults: d9ffbaf#diff-e59399c5cffdeb55901ca52dbe08a43f38439acb5ca7eb3a5532541cd7766e94R20
Looks like we've always had it as a default. It doesn't seem worth the effort to make sure we are moving it to the right place that covers all of the scenarios that it covers now.

@modular-magician
Copy link
Collaborator

Tests failed during RECORDING mode: TestAccCloudAssetProjectFeed_cloudAssetProjectFeedExample|TestAccComputeInstanceFromTemplate_012_removableFields|TestAccComputeForwardingRule_forwardingRuleHttpLbExample|TestAccComposerEnvironment_update|TestAccComposerEnvironment_withSoftwareConfig|TestAccComposerEnvironment_withEncryptionConfig Please fix these to complete your PR

@c2thorn c2thorn force-pushed the retry-operation-quota branch from efaa4c2 to cbecf28 Compare March 22, 2021 21:02
@modular-magician
Copy link
Collaborator

Hi! I'm the modular magician. Your PR generated some diffs in downstreams - here they are.

Diff report:

Terraform GA: Diff ( 3 files changed, 23 insertions(+), 1 deletion(-))
Terraform Beta: Diff ( 4 files changed, 24 insertions(+), 2 deletions(-))
TF Conversion: Diff ( 2 files changed, 12 insertions(+), 1 deletion(-))

@modular-magician
Copy link
Collaborator

I have triggered VCR tests based on this PR's diffs. See the results here: "https://ci-oss.hashicorp.engineering/viewQueued.html?itemId=178413"

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Terraform state leaks when the GCE Operation API(s) rate limit is exceeded
3 participants