Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Bug] Use bigquery default retryable exceptions #1431

Open
wants to merge 5 commits into
base: main
Choose a base branch
from

Conversation

mikealfare
Copy link
Contributor

resolves #263

Problem

We're not retrying in all scenarios that BigQuery recommends. This is causing unexpected behavior for users. This likely happened as a result of us maintaining our own list of retryable errors, which fell out of sync with BigQuery's development over time.

Solution

Instead of maintaining our own list, we should rely on BigQuery's list of retryable errors and reasons. However, we have one scenario that is not covered by BigQuery's defaults, BadRequest. We need to also check this scenario after checking BigQuery's defaults to maintain backwards compatibility.

Checklist

  • I have read the contributing guide and understand what's expected of me
  • I have run this code in development and it appears to resolve the stated issue
  • This PR includes tests, or tests are not required/relevant for this PR
  • This PR has no interface changes (e.g. macros, cli, logs, json artifacts, config files, adapter interface, etc) or this PR has already received feedback and approval from Product or DX

…lts; keep BadRequest as retryable as that is not in BigQuery's defaults and could cause a breaking change by removing it
@mikealfare mikealfare added the bug Something isn't working label Dec 11, 2024
@mikealfare mikealfare self-assigned this Dec 11, 2024
@mikealfare mikealfare requested a review from a team as a code owner December 11, 2024 20:36
@cla-bot cla-bot bot added the cla:yes label Dec 11, 2024
dbt/adapters/bigquery/retry.py Show resolved Hide resolved
):
return True
elif isinstance(error, Forbidden) and any(
e["reason"] == "rateLimitExceeded" for e in error.errors
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This condition is caught in the default retryable reasons (for more than just Forbidden).

def _is_retryable(error: Exception) -> bool:
"""Return true for errors that are unlikely to occur again if retried."""
if isinstance(
error, (BadGateway, BadRequest, ConnectionError, ConnectionResetError, ServerError)
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Only BadRequest is missing from BQ's defaults; the rest can be removed since they would be caught in _job_should_retry.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We determined we can remove BadRequest here. The rationale is that we would expect a bad request to always be a bad request since the issue originates from the user side and not the BQ side.

tests/unit/test_bigquery_connection_manager.py Outdated Show resolved Hide resolved
dbt/adapters/bigquery/retry.py Outdated Show resolved Hide resolved
from google.api_core.future.polling import DEFAULT_POLLING
from google.api_core.retry import Retry
from google.cloud.bigquery.retry import DEFAULT_RETRY
from google.cloud.exceptions import BadGateway, BadRequest, ServerError
from google.cloud.bigquery.retry import DEFAULT_RETRY, _job_should_retry
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

While this is a private function, there is no reasonable way to capture BQ's defaults here. We need to add an exception into list that goes into this function, which would then require vendoring the whole function. Even if we remove BadRequest, we would then need to pull a private attribute (_predicate) off of DEFAULT_JOB_RETRY so that we can implement the retry cap (job_retries).

@mikealfare mikealfare mentioned this pull request Dec 11, 2024
4 tasks
@@ -65,7 +65,9 @@ def test_is_retryable(self):
service_unavailable_error = exceptions.ServiceUnavailable("service is unavailable")

self.assertTrue(_is_retryable(internal_server_error))
self.assertTrue(_is_retryable(bad_request_error))
self.assertFalse(
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's worth keeping this, but flipping it, to be clear that this was once supported, and no longer is supported.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working cla:yes
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[CT-1015] job_retries not obeyed when job_execution_timeout_seconds is exceeded
1 participant