feat: add `write_engine` parameter to `read_FORMATNAME` methods to control how data is written to BigQuery #371

tswast · 2024-02-06T19:49:24Z

Thank you for opening a Pull Request! Before submitting your PR, there are a few things you can do to make sure it goes smoothly:

Make sure to open an issue as a bug/issue before writing your code! That way we can discuss the change, evaluate designs, and agree on the general idea
Ensure the tests and linter pass
Code coverage does not decrease (if any source code was changed)
Appropriate docs were updated (if necessary)

Fixes internal issue 323176126
🦕

tswast · 2024-02-09T22:55:38Z

Blocked by googleapis/python-bigquery#1815

third_party/bigframes_vendored/google_cloud_bigquery/_pandas_helpers.py

tswast · 2024-03-01T16:19:00Z

Test failure is a real one:

E TypeError: Object of type bool_ is not JSON serializable

=================================== FAILURES ===================================
_____________ test_read_csv_gcs_default_engine[bigquery_streaming] _____________
[gw19] linux -- Python 3.11.6 /tmpfs/src/github/python-bigquery-dataframes/.nox/system-3-11/bin/python
session = <bigframes.session.Session object at 0x7f0d0f897cd0>

scalars_dfs = (          bool_col                                          bytes_col  

rowindex                                    ......  2038-01-19 03:14:17.999999+00:00

8            False  ...                              
[9 rows x 13 columns])

gcs_folder = 'gs://bigframes-dev-testing/bigframes_tests_system_20240229220731_1845bf/'

write_engine = 'bigquery_streaming'
@skip_legacy_pandas
@pytest.mark.parametrize(
    ("write_engine",),
    (
        ("default",),
        ("bigquery_inline",),
        ("bigquery_load",),
        ("bigquery_streaming",),
    ),
)
def test_read_csv_gcs_default_engine(session, scalars_dfs, gcs_folder, write_engine):
    scalars_df, _ = scalars_dfs
    if scalars_df.index.name is not None:
        path = gcs_folder + "test_read_csv_gcs_default_engine_w_index*.csv"
    else:
        path = gcs_folder + "test_read_csv_gcs_default_engine_wo_index*.csv"
    read_path = path.replace("*", FIRST_FILE)
    scalars_df.to_csv(path, index=False)
    dtype = scalars_df.dtypes.to_dict()
    dtype.pop("geography_col")


  df = session.read_csv(


        read_path,
        # Convert default pandas dtypes to match BigQuery DataFrames dtypes.
        dtype=dtype,
        write_engine=write_engine,
    )

tests/system/small/test_session.py:435:

bigframes/session/init.py:1162: in read_csv

return self._read_pandas(

bigframes/session/init.py:933: in _read_pandas

return self._read_pandas_bigquery_table(

bigframes/session/init.py:988: in _read_pandas_bigquery_table

table_expression = bigframes_io.pandas_to_bigquery_streaming(

bigframes/session/_io/bigquery.py:294: in pandas_to_bigquery_streaming

for errors in bqclient.insert_rows_from_dataframe(

.nox/system-3-11/lib/python3.11/site-packages/google/cloud/bigquery/client.py:3662: in insert_rows_from_dataframe

result = self.insert_rows(table, rows_chunk, selected_fields, **kwargs)

.nox/system-3-11/lib/python3.11/site-packages/google/cloud/bigquery/client.py:3605: in insert_rows

return self.insert_rows_json(table, json_rows, **kwargs)

.nox/system-3-11/lib/python3.11/site-packages/google/cloud/bigquery/client.py:3801: in insert_rows_json

response = self._call_api(

.nox/system-3-11/lib/python3.11/site-packages/google/cloud/bigquery/client.py:827: in _call_api

return call()

.nox/system-3-11/lib/python3.11/site-packages/google/api_core/retry/retry_unary.py:293: in retry_wrapped_func

return retry_target(

.nox/system-3-11/lib/python3.11/site-packages/google/api_core/retry/retry_unary.py:153: in retry_target

_retry_error_helper(

.nox/system-3-11/lib/python3.11/site-packages/google/api_core/retry/retry_base.py:212: in _retry_error_helper

raise final_exc from source_exc

.nox/system-3-11/lib/python3.11/site-packages/google/api_core/retry/retry_unary.py:144: in retry_target

result = target()

.nox/system-3-11/lib/python3.11/site-packages/google/cloud/_http/init.py:479: in api_request

data = json.dumps(data)

/usr/local/lib/python3.11/json/init.py:231: in dumps

return _default_encoder.encode(obj)

/usr/local/lib/python3.11/json/encoder.py:200: in encode

chunks = self.iterencode(o, _one_shot=True)

/usr/local/lib/python3.11/json/encoder.py:258: in iterencode

return _iterencode(o, 0)

self = <json.encoder.JSONEncoder object at 0x7f0d3d5e8b90>, o = True
def default(self, o):
    """Implement this method in a subclass such that it returns
    a serializable object for ``o``, or calls the base implementation
    (to raise a ``TypeError``).

    For example, to support arbitrary iterators, you could
    implement default like this::

        def default(self, o):
            try:
                iterable = iter(o)
            except TypeError:
                pass
            else:
                return list(iterable)
            # Let the base class default method raise the TypeError
            return JSONEncoder.default(self, o)

    """


  raise TypeError(f'Object of type {o.__class__.__name__} '


                    f'is not JSON serializable')

E       TypeError: Object of type bool_ is not JSON serializable
/usr/local/lib/python3.11/json/encoder.py:180: TypeError

----------------------------- Captured stdout call -----------------------------

Query job 2b56fdd1-5dcc-4afd-8521-69b517d5afae is DONE.1.1 kB processed.

https://console.cloud.google.com/bigquery?project=bigframes-dev&j=bq:US:2b56fdd1-5dcc-4afd-8521-69b517d5afae&page=queryresults

Query job 7d127271-a535-48d0-b092-df329771fb89 is RUNNING.

https://console.cloud.google.com/bigquery?project=bigframes-dev&j=bq:US:7d127271-a535-48d0-b092-df329771fb89&page=queryresults

Query job 7d127271-a535-48d0-b092-df329771fb89 is DONE.0 Bytes processed.

https://console.cloud.google.com/bigquery?project=bigframes-dev&j=bq:US:7d127271-a535-48d0-b092-df329771fb89&page=queryresults

=============================== warnings summary ===============================

This will likely require a fix upstream in google-cloud-bigquery, but in the meantime I can make sure to use a vendored version of insert_rows_from_dataframe that can serialize a numpy bool_ value, similar to how there's a special case for NaN already.

Edit: I already fixed this in googleapis/python-bigquery#1816, waiting on version 3.18.0.

bump minimum google-cloud-bigquery after merging chore(main): release 3.18.0 python-bigquery#1817

third_party/bigframes_vendored/google_cloud_bigquery/tests/unit/test_pyarrow_hlpers.py

third_party/bigframes_vendored/google_cloud_bigquery/_pyarrow_helpers.py

third_party/bigframes_vendored/google_cloud_bigquery/tests/unit/test_pyarrow_hlpers.py

bigframes/session/validation.py

tests/unit/session/test_session.py

tswast · 2024-03-06T18:35:47Z

Marking as do not merge for now. Thanks for your feedback so far. I will wait until we implement go/pandas-gbq-and-bigframes-redundancy before merging this (likely mid-April).

tswast · 2024-09-20T20:55:00Z

Marking as do not merge for now. Thanks for your feedback so far. I will wait until we implement go/pandas-gbq-and-bigframes-redundancy before merging this (likely mid-April).

I've mailed googleapis/python-bigquery-pandas#814 as a first step of this project. I plan to follow-up that PR with one that copies the bigquery_streaming from this PR to pandas-gbq so that the pandas -> BigQuery logic can be consolidated to the pandas-gbq package.

…ntrol how data is written to BigQuery

tswast · 2024-12-09T22:49:02Z

@shobsi This is ready for another look. Not quite merge-ready yet, as I need to update pandas-gbq's schema detection to account for ArrowDtype(pa.timestamp(...)) (googleapis/python-bigquery-pandas#824).

Edit: googleapis/python-bigquery-pandas#832 mailed to fix DATETIME system test.

bigframes/session/__init__.py

tswast · 2024-12-11T21:58:16Z

Looks like there are a few more tests for me to cleanup. These do look related to my change:

FAILED tests/system/small/test_session.py::test_read_pandas_w_unsupported_mixed_dtype
FAILED tests/system/small/test_series.py::test_series_explode_null[empty_array]
FAILED tests/system/small/test_series.py::test_series_explode_null[null_and_empty_array]

bigframes/session/__init__.py

tests/unit/session/test_session.py

third_party/bigframes_vendored/constants.py

…ed_mixed_dtype` test

tswast

Updated. Thanks for the review!

Looks like I still need a fix in pandas-gbq for empty arrays.

tests/unit/session/test_session.py

third_party/bigframes_vendored/constants.py

bigframes/session/__init__.py

tswast · 2024-12-12T23:30:58Z

googleapis/python-bigquery-pandas#838 to fix the remaining system test. Tested locally and it works.

shobsi

LGTM to merge after the pandas-gbq fix and update

bigframes/session/__init__.py

bigframes/session/validation.py

third_party/bigframes_vendored/google_cloud_bigquery/tests/unit/test_pyarrow_hlpers.py

tswast · 2024-12-26T19:06:26Z

bigframes/dtypes.py

+    # BigFrames doesn't distinguish between string and large_string because the
+    # largest string (2 GB) is already larger than the largest BigQuery row.
+    if pa.types.is_string(arrow_dtype) or pa.types.is_large_string(arrow_dtype):
+        return STRING_DTYPE


@GarrettWu Would this break the JSON support you're adding? It was needed for some failing CSV tests.

tswast marked this pull request as ready for review February 29, 2024 17:48

tswast requested review from a team as code owners February 29, 2024 17:48

tswast requested a review from shobsi February 29, 2024 17:48

tswast commented Feb 29, 2024

View reviewed changes

third_party/bigframes_vendored/google_cloud_bigquery/_pandas_helpers.py Outdated Show resolved Hide resolved

shobsi reviewed Mar 5, 2024

View reviewed changes

tswast added the do not merge Indicates a pull request not ready for merge, due to either quality or timing. label Mar 6, 2024

tswast mentioned this pull request Oct 15, 2024

chore(main): release 0.24.0 googleapis/python-bigquery-pandas#816

Merged

feat: add write_engine parameter to read_FORMATNAME methods to co…

3fca092

…ntrol how data is written to BigQuery

tswast force-pushed the b323176126-write_engine branch from eebdd0e to 3fca092 Compare December 9, 2024 22:25

product-auto-label bot added size: l Pull request size is large. and removed size: xl Pull request size is extra large. labels Dec 9, 2024

Merge remote-tracking branch 'origin/main' into b323176126-write_engine

0c30bb7

tswast removed the do not merge Indicates a pull request not ready for merge, due to either quality or timing. label Dec 9, 2024

tswast requested a review from shobsi December 9, 2024 22:47

tswast mentioned this pull request Dec 10, 2024

fix!: to_gbq uploads ArrowDtype(pa.timestamp(...) without timezone as DATETIME type googleapis/python-bigquery-pandas#832

Merged

4 tasks

tswast added 3 commits December 10, 2024 13:10

Merge remote-tracking branch 'origin/main' into b323176126-write_engine

2b48a88

set constraints to allow unit tests to pass

5234abf

allow binary columns to be inlined

279eeb5

tswast added 2 commits December 11, 2024 09:58

bump minimum pandas-gbq to 0.25.0

1204724

Merge remote-tracking branch 'origin/main' into b323176126-write_engine

aa4682a

tswast mentioned this pull request Dec 11, 2024

fix: json in struct destination type #1187

Merged

tswast added 2 commits December 11, 2024 13:49

use pandas-gbq schema detection in load jobs too

72c09a0

Merge remote-tracking branch 'origin/main' into b323176126-write_engine

bf854e0

tswast commented Dec 11, 2024

View reviewed changes

bigframes/session/__init__.py Outdated Show resolved Hide resolved

Update bigframes/session/__init__.py

5a03705

tswast added the owlbot:run Add this label to trigger the Owlbot post processor. label Dec 11, 2024

gcf-owl-bot bot removed the owlbot:run Add this label to trigger the Owlbot post processor. label Dec 11, 2024

fix mypy

43d6744

shobsi reviewed Dec 12, 2024

View reviewed changes

bigframes/session/__init__.py Show resolved Hide resolved

tests/unit/session/test_session.py Show resolved Hide resolved

third_party/bigframes_vendored/constants.py Outdated Show resolved Hide resolved

tswast added 4 commits December 12, 2024 09:52

Merge remote-tracking branch 'origin/main' into b323176126-write_engine

5c82c8e

fix `tests/system/small/test_session.py::test_read_pandas_w_unsupport…

6aacbfe

…ed_mixed_dtype` test

add advice for which write_engine to use

5f3153c

add unit tests

f40527e

tswast commented Dec 12, 2024

View reviewed changes

tests/unit/session/test_session.py Show resolved Hide resolved

third_party/bigframes_vendored/constants.py Outdated Show resolved Hide resolved

bigframes/session/__init__.py Show resolved Hide resolved

use prerelease version of pandas-gbq

4c8e43c

shobsi approved these changes Dec 13, 2024

View reviewed changes

bigframes/session/__init__.py Show resolved Hide resolved

bigframes/session/validation.py Outdated Show resolved Hide resolved

third_party/bigframes_vendored/google_cloud_bigquery/tests/unit/test_pyarrow_hlpers.py Outdated Show resolved Hide resolved

tswast added 2 commits December 26, 2024 11:51

Merge remote-tracking branch 'origin/main' into b323176126-write_engine

39dcaf9

update to pandas-gbq 0.26.0

ebcc49c

tswast commented Dec 26, 2024

View reviewed changes

tswast merged commit ed47ef1 into main Dec 26, 2024
22 checks passed

tswast deleted the b323176126-write_engine branch December 26, 2024 20:18

release-please bot mentioned this pull request Dec 26, 2024

chore(main): release 1.30.0 #1215

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: add `write_engine` parameter to `read_FORMATNAME` methods to control how data is written to BigQuery #371

feat: add `write_engine` parameter to `read_FORMATNAME` methods to control how data is written to BigQuery #371

tswast commented Feb 6, 2024

tswast commented Feb 9, 2024

tswast commented Mar 1, 2024 •

edited

Loading

tswast commented Mar 6, 2024

tswast commented Sep 20, 2024

tswast commented Dec 9, 2024 •

edited

Loading

tswast commented Dec 11, 2024

tswast left a comment

tswast commented Dec 12, 2024

shobsi left a comment

tswast Dec 26, 2024

feat: add write_engine parameter to read_FORMATNAME methods to control how data is written to BigQuery #371

feat: add write_engine parameter to read_FORMATNAME methods to control how data is written to BigQuery #371

Conversation

tswast commented Feb 6, 2024

tswast commented Feb 9, 2024

tswast commented Mar 1, 2024 • edited Loading

tswast commented Mar 6, 2024

tswast commented Sep 20, 2024

tswast commented Dec 9, 2024 • edited Loading

tswast commented Dec 11, 2024

tswast left a comment

Choose a reason for hiding this comment

tswast commented Dec 12, 2024

shobsi left a comment

Choose a reason for hiding this comment

tswast Dec 26, 2024

Choose a reason for hiding this comment

feat: add `write_engine` parameter to `read_FORMATNAME` methods to control how data is written to BigQuery #371

feat: add `write_engine` parameter to `read_FORMATNAME` methods to control how data is written to BigQuery #371

tswast commented Mar 1, 2024 •

edited

Loading

tswast commented Dec 9, 2024 •

edited

Loading