Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat(bigquery): add support for query cost estimate #18694

Closed
wants to merge 8 commits into from
Closed

feat(bigquery): add support for query cost estimate #18694

wants to merge 8 commits into from

Conversation

kzosabe
Copy link

@kzosabe kzosabe commented Feb 13, 2022

SUMMARY

The ability to know in advance how many bytes a query will process is important when using BigQuery.
This PR adds the ability to estimate query cost to BigQuery integration as well as other DB systems such as postgres and presto.

BEFORE/AFTER SCREENSHOTS OR ANIMATED GIF

before:
before

after:
after 1

after 2

TESTING INSTRUCTIONS

  • Enable feature flag ESTIMATE_QUERY_COST
  • Check ADVANCED > SQL Lab > Enable query cost estimation in the edit dialog in the BigQuery database connection

ADDITIONAL INFORMATION

  • Has associated issue:
  • Required feature flags:
  • Changes UI
  • Includes DB Migration (follow approval process in SIP-59)
    • Migration is atomic, supports rollback & is backwards-compatible
    • Confirm DB migration upgrade and downgrade tested
    • Runtime estimates and downtime expectations provided
  • Introduces new feature or API
  • Removes existing feature or API


@classmethod
def estimate_statement_cost(
cls, statement: str, cursor: Any, engine: Engine
Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The only way to estimate the cost in advance in BigQuery is to run the query with dry_run, and since this is not possible with only cursor, I add engine as an argument.

Another way to handle bigquery.Client directly is to configure sqlalchemy to pass the dryrun parameter when creating the connection, but this seems to be more complicated...

https://github.com/googleapis/python-bigquery-sqlalchemy#connection-string-parameters

Copy link
Member

@villebro villebro left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@kzosabe this looks great! One minor comment about making the code a bit more DRY, other than that looks good to go.

Comment on lines 218 to 240
@classmethod
def query_cost_formatter(
cls, raw_cost: List[Dict[str, Any]]
) -> List[Dict[str, str]]:
def format_bytes_str(raw_bytes: int) -> str:
if not isinstance(raw_bytes, int):
return str(raw_bytes)
units = ["B", "KiB", "MiB", "GiB", "TiB", "PiB"]
index = 0
bytes = float(raw_bytes)
while bytes >= 1024 and index < len(units) - 1:
bytes /= 1024
index += 1

return "{:.1f}".format(bytes) + f" {units[index]}"

return [
{
k: format_bytes_str(v) if k == "Total bytes processed" else str(v)
for k, v in row.items()
}
for row in raw_cost
]
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It seems this this logic overlaps with the humanize functions in the query_cost_formatter methods in TrinoEngineSpec and PrestoEngineSpec . I wonder if we should move humanize to BaseEngineSpec` so we could remove the duplication?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the review!

so we could remove the duplication?

It is possible that it is better to go DRY, but there are some things to consider somewhat.

The intent of this implementation was to be consistent with the official UI provided by BigQuery, both in KiB notation and to the first decimal place.
スクリーンショット 2022-02-14 20 32 57

In particular, the current presto and trino implementations divide by 1000 instead of 1024, which is a problem.
There will be a small difference between the number of predicted bytes in BigQuery and the number of predicted bytes in superset.
I would like to avoid using the current humanize implementation as is, because this would cause confusion for users.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think that there are several possible patterns:

a. Based on the humanize implementation, prepare methods to pass prefixes and to_next_prefixes as parameters

It allows for common implementation and same result, but is somewhat complex to implement.

b. Provide two methods, humanize_number and humanize_bytes

The behavior of the byte count display in trino and presto changes slightly.

c. Keep a separate implementation ( or share only between trino and presto )

Which do you think is the best?
For me, any of them is OK and I'll try it.
However, it might be better to work on a separate PR.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good point, being consistent with the BQ console definitely makes sense. I remember a discussion about this in the original PR where 1024 vs 1000 was debated: #8172 (comment) While being consistent with the BQ console, it would feel funny to have different units in for BQ vs Presto/Trino. @betodealmeida thoughts?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does anyone have any suggestions?
I think the ability of estimating bytes is important, and the format is relatively unimportant, so I'm not strongly concerned about how to handle it.
I think it would be better to refactor trino and presto to KiB notation, but if there are some reasons not to do so, I'll make the BigQuery implementation as KB notation.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we can change the humanize function in Presto (it's also duplicated in Trino) to return bytes in 1024 increments. The only reason I did 1000 is because it's also applied to row count and other parameters. This way it's consistent. Ideally we'd have a single function used by Presto, Trino, BigQuery and other engine specs.

(Note that it's also possible to overwrite the formatter function using QUERY_COST_FORMATTERS_BY_ENGINE in the config. We used that at Lyft to show the query cost in dollars, estimated run time, and carbon footprint.)

@codecov
Copy link

codecov bot commented Feb 14, 2022

Codecov Report

Merging #18694 (9e36f0f) into master (225015f) will increase coverage by 0.11%.
The diff coverage is 91.22%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master   #18694      +/-   ##
==========================================
+ Coverage   66.28%   66.39%   +0.11%     
==========================================
  Files        1605     1619      +14     
  Lines       62863    63012     +149     
  Branches     6341     6341              
==========================================
+ Hits        41666    41835     +169     
+ Misses      19545    19525      -20     
  Partials     1652     1652              
Flag Coverage Δ
hive 52.22% <24.56%> (+0.08%) ⬆️
mysql 81.48% <91.22%> (+0.15%) ⬆️
postgres 81.53% <91.22%> (+0.15%) ⬆️
presto 52.07% <24.56%> (+0.08%) ⬆️
python 81.96% <91.22%> (+0.15%) ⬆️
sqlite 81.22% <91.22%> (+0.15%) ⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

Impacted Files Coverage Δ
superset/db_engine_specs/trino.py 79.41% <50.00%> (+8.61%) ⬆️
superset/db_engine_specs/base.py 88.98% <91.30%> (+0.23%) ⬆️
superset/db_engine_specs/bigquery.py 87.02% <96.00%> (+2.17%) ⬆️
superset/db_engine_specs/postgres.py 97.29% <100.00%> (+0.02%) ⬆️
superset/db_engine_specs/presto.py 89.27% <100.00%> (+0.14%) ⬆️
superset/key_value/commands/create.py 80.95% <0.00%> (-3.05%) ⬇️
superset/key_value/utils.py 100.00% <0.00%> (ø)
superset/dashboards/filter_state/api.py 100.00% <0.00%> (ø)
superset/explore/form_data/commands/parameters.py 100.00% <0.00%> (ø)
superset/utils/pandas_postprocessing.py
... and 30 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 225015f...9e36f0f. Read the comment docs.

Comment on lines 202 to 204
"Could not import libraries `google.cloud` or `google.oauth2`, "
"which are required to be installed in your environment in order "
"to estimate cost"
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Curious, wouldn't these be necessarily installed if the user has a BigQuery database connected?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's right, we can simply use import here. I'll fix this!

Comment on lines 218 to 240
@classmethod
def query_cost_formatter(
cls, raw_cost: List[Dict[str, Any]]
) -> List[Dict[str, str]]:
def format_bytes_str(raw_bytes: int) -> str:
if not isinstance(raw_bytes, int):
return str(raw_bytes)
units = ["B", "KiB", "MiB", "GiB", "TiB", "PiB"]
index = 0
bytes = float(raw_bytes)
while bytes >= 1024 and index < len(units) - 1:
bytes /= 1024
index += 1

return "{:.1f}".format(bytes) + f" {units[index]}"

return [
{
k: format_bytes_str(v) if k == "Total bytes processed" else str(v)
for k, v in row.items()
}
for row in raw_cost
]
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we can change the humanize function in Presto (it's also duplicated in Trino) to return bytes in 1024 increments. The only reason I did 1000 is because it's also applied to row count and other parameters. This way it's consistent. Ideally we'd have a single function used by Presto, Trino, BigQuery and other engine specs.

(Note that it's also possible to overwrite the formatter function using QUERY_COST_FORMATTERS_BY_ENGINE in the config. We used that at Lyft to show the query cost in dollars, estimated run time, and carbon footprint.)

@@ -1425,6 +1430,33 @@ def cancel_query( # pylint: disable=unused-argument
def parse_sql(cls, sql: str) -> List[str]:
return [str(s).strip(" ;") for s in sqlparse.parse(sql)]

@classmethod
def _humanize(cls, value: Any, suffix: str, category: Optional[str] = None) -> str:
Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If there could be an input like (1000, "", "dollars") and an output like "$1,000", there would be more categories

@@ -524,7 +524,7 @@ def test_query_cost_formatter(self):
expected = [
{
"Output count": "904 M rows",
"Output size": "354 GB",
"Output size": "329 GiB",
Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These two values are identical except for the units. I took care not to change any of the other outputs.

@kzosabe
Copy link
Author

kzosabe commented Mar 3, 2022

@betodealmeida Hello, I've fixed the points you pointed out, could you please review?

@kzosabe
Copy link
Author

kzosabe commented Mar 17, 2022

@betodealmeida @villebro I've been waiting for a review for about 3 weeks since last update, I'd love to get some feedback on whether something is wrong with my PR or reviewers are busy.
If reviewers are busy, It's no problem and I'll wait as long as needed. This is my first OSS commit, so I would like your advice if there is anything that is not good.

@villebro
Copy link
Member

villebro commented Jan 9, 2023

@kzosabe I'm really sorry for having dropped the ball on this PR 🙁 As you may have seen a later PR #21325 added this feature. I'd still love to get the humanize logic from this PR in, so if you're still open to working on this I promise to do my best to see this through without delay.

@rusackas
Copy link
Member

@kzosabe any interest in rebasing and following through on making this more about the humanize logic as @villebro suggested? If not, we should close this out. Thanks in either case, we appreciate the contribution!

@rusackas
Copy link
Member

rusackas commented Feb 6, 2024

I'm assuming that since @kzosabe hasn't responded, we missed the boat on that humanize logic. Thanks for the PR nonetheless, and we hope we see more of you around here. I'll close this, but feel free to re-open if you want to pick this back up!

@rusackas rusackas closed this Feb 6, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants