-
Notifications
You must be signed in to change notification settings - Fork 13.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
feat(bigquery): add support for query cost estimate #18694
Changes from 4 commits
6ae4638
6e676f6
4e63843
6c06b05
dba135c
1fd1488
f5908e5
9e36f0f
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -185,6 +185,60 @@ class BigQueryEngineSpec(BaseEngineSpec): | |
), | ||
} | ||
|
||
@classmethod | ||
def get_allow_cost_estimate(cls, extra: Dict[str, Any]) -> bool: | ||
return True | ||
|
||
@classmethod | ||
def estimate_statement_cost( | ||
cls, statement: str, cursor: Any, engine: Engine | ||
) -> Dict[str, Any]: | ||
try: | ||
# pylint: disable=import-outside-toplevel | ||
from google.cloud import bigquery | ||
from google.oauth2 import service_account | ||
except ImportError as ex: | ||
raise Exception( | ||
"Could not import libraries `google.cloud` or `google.oauth2`, " | ||
"which are required to be installed in your environment in order " | ||
"to estimate cost" | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Curious, wouldn't these be necessarily installed if the user has a BigQuery database connected? There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. That's right, we can simply use import here. I'll fix this! |
||
) from ex | ||
|
||
creds = engine.dialect.credentials_info | ||
credentials = service_account.Credentials.from_service_account_info(creds) | ||
client = bigquery.Client(credentials=credentials) | ||
dry_run_result = client.query( | ||
statement, bigquery.job.QueryJobConfig(dry_run=True) | ||
) | ||
|
||
return { | ||
"Total bytes processed": dry_run_result.total_bytes_processed, | ||
} | ||
|
||
@classmethod | ||
def query_cost_formatter( | ||
cls, raw_cost: List[Dict[str, Any]] | ||
) -> List[Dict[str, str]]: | ||
def format_bytes_str(raw_bytes: int) -> str: | ||
if not isinstance(raw_bytes, int): | ||
return str(raw_bytes) | ||
units = ["B", "KiB", "MiB", "GiB", "TiB", "PiB"] | ||
index = 0 | ||
bytes = float(raw_bytes) | ||
while bytes >= 1024 and index < len(units) - 1: | ||
bytes /= 1024 | ||
index += 1 | ||
|
||
return "{:.1f}".format(bytes) + f" {units[index]}" | ||
|
||
return [ | ||
{ | ||
k: format_bytes_str(v) if k == "Total bytes processed" else str(v) | ||
for k, v in row.items() | ||
} | ||
for row in raw_cost | ||
] | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. It seems this this logic overlaps with the There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Thanks for the review!
It is possible that it is better to go DRY, but there are some things to consider somewhat. The intent of this implementation was to be consistent with the official UI provided by BigQuery, both in KiB notation and to the first decimal place. In particular, the current presto and trino implementations divide by 1000 instead of 1024, which is a problem. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I think that there are several possible patterns: a. Based on the humanize implementation, prepare methods to pass prefixes and to_next_prefixes as parameters It allows for common implementation and same result, but is somewhat complex to implement. b. Provide two methods, humanize_number and humanize_bytes The behavior of the byte count display in trino and presto changes slightly. c. Keep a separate implementation ( or share only between trino and presto ) Which do you think is the best? There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Good point, being consistent with the BQ console definitely makes sense. I remember a discussion about this in the original PR where 1024 vs 1000 was debated: #8172 (comment) While being consistent with the BQ console, it would feel funny to have different units in for BQ vs Presto/Trino. @betodealmeida thoughts? There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Does anyone have any suggestions? There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I think we can change the (Note that it's also possible to overwrite the formatter function using |
||
|
||
@classmethod | ||
def convert_dttm( | ||
cls, target_type: str, dttm: datetime, db_extra: Optional[Dict[str, Any]] = None | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The only way to estimate the cost in advance in BigQuery is to run the query with dry_run, and since this is not possible with only cursor, I add engine as an argument.
Another way to handle bigquery.Client directly is to configure sqlalchemy to pass the dryrun parameter when creating the connection, but this seems to be more complicated...
https://github.com/googleapis/python-bigquery-sqlalchemy#connection-string-parameters