Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat!: convert arrow data types to Python types in rows() #296

Open
tswast opened this issue Sep 3, 2021 · 2 comments
Open

feat!: convert arrow data types to Python types in rows() #296

tswast opened this issue Sep 3, 2021 · 2 comments
Labels
api: bigquerystorage Issues related to the googleapis/python-bigquery-storage API. next major: breaking change this is a change that we should wait to bundle into the next major version type: feature request ‘Nice-to-have’ improvement, new feature or different behavior or design.

Comments

@tswast
Copy link
Contributor

tswast commented Sep 3, 2021

I notice we just loop through all the rows in a record batch here:

yield dict(zip(self._column_names, row))

This might result in some odd types, such as Int64Scalar and TimestampScalar sneaking in.

I think we probably want to call to_pydict before looping over rows/columns in the BigQuery Storage API client.

I'm filing this issue in python-bigquery at first, because I think it's worth investigating if this is actually happening first.

@plamut
Copy link
Contributor

plamut commented Sep 7, 2021

I did a quick test and queried some public data, fetching the results as dataframe using BQ Storage API under the hood.

The execution flow reached _ArrowStreamParser.to_arrow(), but did not hit the to_rows() method linked in the issue description. Seems like it's not an issue when using the BigQuery client.

The script used in the test (and the debugger to step through):

from google.cloud import bigquery


PROJECT_ID = "bigquery-public-data"
DATASET_ID = "chicago_taxi_trips"
TABLE_ID = "taxi_trips"


table_name_full = f"{PROJECT_ID}.{DATASET_ID}.{TABLE_ID}"

client = bigquery.Client()

sql = f"""
SELECT taxi_id, trip_start_timestamp, trip_seconds, tips
FROM `{table_name_full}`
LIMIT 5;
"""

query_job = client.query(sql)
result = query_job.result()

df = result.to_dataframe()

print(df)

FWIW, calling the to_rows() method by hand indeed results in pyarrow values contained in the row dict, example:

{
    'taxi_id': <pyarrow.StringScalar: 'bc709a696db40a46144faa530198a65442402f42513ee44c63cc9bd1e83cadb1ef6d04b8d09bdbef85cf4b72236deff5644913be7e313a694821ce08545564bb'>,
    'trip_start_timestamp': <pyarrow.TimestampScalar: datetime.datetime(2013, 5, 24, 14, 15, tzinfo=<UTC>)>,
    'trip_seconds': <pyarrow.Int64Scalar: 480>,
    'tips': <pyarrow.DoubleScalar: 0.0>
}

(and yes, calling record_batch.to_pydict() in to_rows() would help, if necessary)


I also checked DB API and there we do hit the to_rows() method, but we convert pyarrow values to Python values in a helper.

It would be interesting to benchmark this and see if arrow to Python conversion is more efficient if done with a to_pydict() call instead of individual as_py() calls. Something to consider if/when we change the BQ Storage client.

from google.cloud.bigquery import dbapi

...
print("Fetching data using DBAPI")
cursor = dbapi.connect(client).cursor()
cursor.execute(sql)

rows = cursor.fetchall()

print(rows)

P.S.: Changing the BQ Storage client will likely break the BigQuery DB API, thus we need to coordinate releases and implement conditional code based on the detected BQ Storage version, i.e. whether to call or not call the as_py() method on values.

@tswast
Copy link
Contributor Author

tswast commented Sep 7, 2021

Sweet. I had forgotten

https://github.com/googleapis/python-bigquery/blob/c9068e4191dbe3632fe399a0b777e8bc54a183a6/google/cloud/bigquery/dbapi/_helpers.py#L468-L470

Thanks for investigating!

It does seem like we should change this in the BQ Storage client, but this will be good to keep in mind that we'd need to coordinate.

I'll move this to that repo as a feature request.

@tswast tswast transferred this issue from googleapis/python-bigquery Sep 7, 2021
@tswast tswast added api: bigquerystorage Issues related to the googleapis/python-bigquery-storage API. type: feature request ‘Nice-to-have’ improvement, new feature or different behavior or design. labels Sep 7, 2021
@tswast tswast changed the title possible issue with inconsistent data types when using DB-API with BQ Storage API feat!: convert arrow data types to Python types in rows() Sep 7, 2021
@tswast tswast added the next major: breaking change this is a change that we should wait to bundle into the next major version label Sep 7, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
api: bigquerystorage Issues related to the googleapis/python-bigquery-storage API. next major: breaking change this is a change that we should wait to bundle into the next major version type: feature request ‘Nice-to-have’ improvement, new feature or different behavior or design.
Projects
None yet
Development

No branches or pull requests

2 participants