-
Notifications
You must be signed in to change notification settings - Fork 13.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
feat(bigquery): add support for query cost estimate #18694
Conversation
|
||
@classmethod | ||
def estimate_statement_cost( | ||
cls, statement: str, cursor: Any, engine: Engine |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The only way to estimate the cost in advance in BigQuery is to run the query with dry_run, and since this is not possible with only cursor, I add engine as an argument.
Another way to handle bigquery.Client directly is to configure sqlalchemy to pass the dryrun parameter when creating the connection, but this seems to be more complicated...
https://github.com/googleapis/python-bigquery-sqlalchemy#connection-string-parameters
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@kzosabe this looks great! One minor comment about making the code a bit more DRY, other than that looks good to go.
superset/db_engine_specs/bigquery.py
Outdated
@classmethod | ||
def query_cost_formatter( | ||
cls, raw_cost: List[Dict[str, Any]] | ||
) -> List[Dict[str, str]]: | ||
def format_bytes_str(raw_bytes: int) -> str: | ||
if not isinstance(raw_bytes, int): | ||
return str(raw_bytes) | ||
units = ["B", "KiB", "MiB", "GiB", "TiB", "PiB"] | ||
index = 0 | ||
bytes = float(raw_bytes) | ||
while bytes >= 1024 and index < len(units) - 1: | ||
bytes /= 1024 | ||
index += 1 | ||
|
||
return "{:.1f}".format(bytes) + f" {units[index]}" | ||
|
||
return [ | ||
{ | ||
k: format_bytes_str(v) if k == "Total bytes processed" else str(v) | ||
for k, v in row.items() | ||
} | ||
for row in raw_cost | ||
] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It seems this this logic overlaps with the humanize
functions in the query_cost_formatter
methods in TrinoEngineSpec
and PrestoEngineSpec
. I wonder if we should move humanize
to BaseEngineSpec` so we could remove the duplication?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the review!
so we could remove the duplication?
It is possible that it is better to go DRY, but there are some things to consider somewhat.
The intent of this implementation was to be consistent with the official UI provided by BigQuery, both in KiB notation and to the first decimal place.
In particular, the current presto and trino implementations divide by 1000 instead of 1024, which is a problem.
There will be a small difference between the number of predicted bytes in BigQuery and the number of predicted bytes in superset.
I would like to avoid using the current humanize implementation as is, because this would cause confusion for users.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think that there are several possible patterns:
a. Based on the humanize implementation, prepare methods to pass prefixes and to_next_prefixes as parameters
It allows for common implementation and same result, but is somewhat complex to implement.
b. Provide two methods, humanize_number and humanize_bytes
The behavior of the byte count display in trino and presto changes slightly.
c. Keep a separate implementation ( or share only between trino and presto )
Which do you think is the best?
For me, any of them is OK and I'll try it.
However, it might be better to work on a separate PR.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good point, being consistent with the BQ console definitely makes sense. I remember a discussion about this in the original PR where 1024 vs 1000 was debated: #8172 (comment) While being consistent with the BQ console, it would feel funny to have different units in for BQ vs Presto/Trino. @betodealmeida thoughts?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Does anyone have any suggestions?
I think the ability of estimating bytes is important, and the format is relatively unimportant, so I'm not strongly concerned about how to handle it.
I think it would be better to refactor trino and presto to KiB notation, but if there are some reasons not to do so, I'll make the BigQuery implementation as KB notation.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we can change the humanize
function in Presto (it's also duplicated in Trino) to return bytes in 1024 increments. The only reason I did 1000 is because it's also applied to row count and other parameters. This way it's consistent. Ideally we'd have a single function used by Presto, Trino, BigQuery and other engine specs.
(Note that it's also possible to overwrite the formatter function using QUERY_COST_FORMATTERS_BY_ENGINE
in the config. We used that at Lyft to show the query cost in dollars, estimated run time, and carbon footprint.)
Codecov Report
@@ Coverage Diff @@
## master #18694 +/- ##
==========================================
+ Coverage 66.28% 66.39% +0.11%
==========================================
Files 1605 1619 +14
Lines 62863 63012 +149
Branches 6341 6341
==========================================
+ Hits 41666 41835 +169
+ Misses 19545 19525 -20
Partials 1652 1652
Flags with carried forward coverage won't be shown. Click here to find out more.
Continue to review full report at Codecov.
|
superset/db_engine_specs/bigquery.py
Outdated
"Could not import libraries `google.cloud` or `google.oauth2`, " | ||
"which are required to be installed in your environment in order " | ||
"to estimate cost" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Curious, wouldn't these be necessarily installed if the user has a BigQuery database connected?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That's right, we can simply use import here. I'll fix this!
superset/db_engine_specs/bigquery.py
Outdated
@classmethod | ||
def query_cost_formatter( | ||
cls, raw_cost: List[Dict[str, Any]] | ||
) -> List[Dict[str, str]]: | ||
def format_bytes_str(raw_bytes: int) -> str: | ||
if not isinstance(raw_bytes, int): | ||
return str(raw_bytes) | ||
units = ["B", "KiB", "MiB", "GiB", "TiB", "PiB"] | ||
index = 0 | ||
bytes = float(raw_bytes) | ||
while bytes >= 1024 and index < len(units) - 1: | ||
bytes /= 1024 | ||
index += 1 | ||
|
||
return "{:.1f}".format(bytes) + f" {units[index]}" | ||
|
||
return [ | ||
{ | ||
k: format_bytes_str(v) if k == "Total bytes processed" else str(v) | ||
for k, v in row.items() | ||
} | ||
for row in raw_cost | ||
] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we can change the humanize
function in Presto (it's also duplicated in Trino) to return bytes in 1024 increments. The only reason I did 1000 is because it's also applied to row count and other parameters. This way it's consistent. Ideally we'd have a single function used by Presto, Trino, BigQuery and other engine specs.
(Note that it's also possible to overwrite the formatter function using QUERY_COST_FORMATTERS_BY_ENGINE
in the config. We used that at Lyft to show the query cost in dollars, estimated run time, and carbon footprint.)
@@ -1425,6 +1430,33 @@ def cancel_query( # pylint: disable=unused-argument | |||
def parse_sql(cls, sql: str) -> List[str]: | |||
return [str(s).strip(" ;") for s in sqlparse.parse(sql)] | |||
|
|||
@classmethod | |||
def _humanize(cls, value: Any, suffix: str, category: Optional[str] = None) -> str: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If there could be an input like (1000, "", "dollars") and an output like "$1,000", there would be more categories
@@ -524,7 +524,7 @@ def test_query_cost_formatter(self): | |||
expected = [ | |||
{ | |||
"Output count": "904 M rows", | |||
"Output size": "354 GB", | |||
"Output size": "329 GiB", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
These two values are identical except for the units. I took care not to change any of the other outputs.
@betodealmeida Hello, I've fixed the points you pointed out, could you please review? |
@betodealmeida @villebro I've been waiting for a review for about 3 weeks since last update, I'd love to get some feedback on whether something is wrong with my PR or reviewers are busy. |
I'm assuming that since @kzosabe hasn't responded, we missed the boat on that |
SUMMARY
The ability to know in advance how many bytes a query will process is important when using BigQuery.
This PR adds the ability to estimate query cost to BigQuery integration as well as other DB systems such as postgres and presto.
BEFORE/AFTER SCREENSHOTS OR ANIMATED GIF
before:
after:
TESTING INSTRUCTIONS
ESTIMATE_QUERY_COST
ADVANCED
>SQL Lab
>Enable query cost estimation
in the edit dialog in the BigQuery database connectionADDITIONAL INFORMATION