Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: add write_engine parameter to read_FORMATNAME methods to control how data is written to BigQuery #371

Merged
merged 18 commits into from
Dec 26, 2024

Conversation

tswast
Copy link
Collaborator

@tswast tswast commented Feb 6, 2024

Thank you for opening a Pull Request! Before submitting your PR, there are a few things you can do to make sure it goes smoothly:

  • Make sure to open an issue as a bug/issue before writing your code! That way we can discuss the change, evaluate designs, and agree on the general idea
  • Ensure the tests and linter pass
  • Code coverage does not decrease (if any source code was changed)
  • Appropriate docs were updated (if necessary)

Fixes internal issue 323176126
🦕

@product-auto-label product-auto-label bot added size: m Pull request size is medium. api: bigquery Issues related to the googleapis/python-bigquery-dataframes API. size: l Pull request size is large. size: xl Pull request size is extra large. and removed size: m Pull request size is medium. size: l Pull request size is large. labels Feb 6, 2024
@tswast
Copy link
Collaborator Author

tswast commented Feb 9, 2024

Blocked by googleapis/python-bigquery#1815

@tswast tswast marked this pull request as ready for review February 29, 2024 17:48
@tswast tswast requested review from a team as code owners February 29, 2024 17:48
@tswast tswast requested a review from shobsi February 29, 2024 17:48
@tswast
Copy link
Collaborator Author

tswast commented Mar 1, 2024

Test failure is a real one:

E TypeError: Object of type bool_ is not JSON serializable
=================================== FAILURES ===================================
_____________ test_read_csv_gcs_default_engine[bigquery_streaming] _____________
[gw19] linux -- Python 3.11.6 /tmpfs/src/github/python-bigquery-dataframes/.nox/system-3-11/bin/python

session = <bigframes.session.Session object at 0x7f0d0f897cd0>
scalars_dfs = ( bool_col bytes_col
rowindex ...... 2038-01-19 03:14:17.999999+00:00
8 False ...

[9 rows x 13 columns])
gcs_folder = 'gs://bigframes-dev-testing/bigframes_tests_system_20240229220731_1845bf/'
write_engine = 'bigquery_streaming'

@skip_legacy_pandas
@pytest.mark.parametrize(
    ("write_engine",),
    (
        ("default",),
        ("bigquery_inline",),
        ("bigquery_load",),
        ("bigquery_streaming",),
    ),
)
def test_read_csv_gcs_default_engine(session, scalars_dfs, gcs_folder, write_engine):
    scalars_df, _ = scalars_dfs
    if scalars_df.index.name is not None:
        path = gcs_folder + "test_read_csv_gcs_default_engine_w_index*.csv"
    else:
        path = gcs_folder + "test_read_csv_gcs_default_engine_wo_index*.csv"
    read_path = path.replace("*", FIRST_FILE)
    scalars_df.to_csv(path, index=False)
    dtype = scalars_df.dtypes.to_dict()
    dtype.pop("geography_col")
  df = session.read_csv(
        read_path,
        # Convert default pandas dtypes to match BigQuery DataFrames dtypes.
        dtype=dtype,
        write_engine=write_engine,
    )

tests/system/small/test_session.py:435:


bigframes/session/init.py:1162: in read_csv
return self._read_pandas(
bigframes/session/init.py:933: in _read_pandas
return self._read_pandas_bigquery_table(
bigframes/session/init.py:988: in _read_pandas_bigquery_table
table_expression = bigframes_io.pandas_to_bigquery_streaming(
bigframes/session/_io/bigquery.py:294: in pandas_to_bigquery_streaming
for errors in bqclient.insert_rows_from_dataframe(
.nox/system-3-11/lib/python3.11/site-packages/google/cloud/bigquery/client.py:3662: in insert_rows_from_dataframe
result = self.insert_rows(table, rows_chunk, selected_fields, **kwargs)
.nox/system-3-11/lib/python3.11/site-packages/google/cloud/bigquery/client.py:3605: in insert_rows
return self.insert_rows_json(table, json_rows, **kwargs)
.nox/system-3-11/lib/python3.11/site-packages/google/cloud/bigquery/client.py:3801: in insert_rows_json
response = self._call_api(
.nox/system-3-11/lib/python3.11/site-packages/google/cloud/bigquery/client.py:827: in _call_api
return call()
.nox/system-3-11/lib/python3.11/site-packages/google/api_core/retry/retry_unary.py:293: in retry_wrapped_func
return retry_target(
.nox/system-3-11/lib/python3.11/site-packages/google/api_core/retry/retry_unary.py:153: in retry_target
_retry_error_helper(
.nox/system-3-11/lib/python3.11/site-packages/google/api_core/retry/retry_base.py:212: in _retry_error_helper
raise final_exc from source_exc
.nox/system-3-11/lib/python3.11/site-packages/google/api_core/retry/retry_unary.py:144: in retry_target
result = target()
.nox/system-3-11/lib/python3.11/site-packages/google/cloud/_http/init.py:479: in api_request
data = json.dumps(data)
/usr/local/lib/python3.11/json/init.py:231: in dumps
return _default_encoder.encode(obj)
/usr/local/lib/python3.11/json/encoder.py:200: in encode
chunks = self.iterencode(o, _one_shot=True)
/usr/local/lib/python3.11/json/encoder.py:258: in iterencode
return _iterencode(o, 0)


self = <json.encoder.JSONEncoder object at 0x7f0d3d5e8b90>, o = True

def default(self, o):
    """Implement this method in a subclass such that it returns
    a serializable object for ``o``, or calls the base implementation
    (to raise a ``TypeError``).

    For example, to support arbitrary iterators, you could
    implement default like this::

        def default(self, o):
            try:
                iterable = iter(o)
            except TypeError:
                pass
            else:
                return list(iterable)
            # Let the base class default method raise the TypeError
            return JSONEncoder.default(self, o)

    """
  raise TypeError(f'Object of type {o.__class__.__name__} '
                    f'is not JSON serializable')

E TypeError: Object of type bool_ is not JSON serializable

/usr/local/lib/python3.11/json/encoder.py:180: TypeError
----------------------------- Captured stdout call -----------------------------
Query job 2b56fdd1-5dcc-4afd-8521-69b517d5afae is DONE.1.1 kB processed.
https://console.cloud.google.com/bigquery?project=bigframes-dev&j=bq:US:2b56fdd1-5dcc-4afd-8521-69b517d5afae&page=queryresults
Query job 7d127271-a535-48d0-b092-df329771fb89 is RUNNING.
https://console.cloud.google.com/bigquery?project=bigframes-dev&j=bq:US:7d127271-a535-48d0-b092-df329771fb89&page=queryresults
Query job 7d127271-a535-48d0-b092-df329771fb89 is DONE.0 Bytes processed.
https://console.cloud.google.com/bigquery?project=bigframes-dev&j=bq:US:7d127271-a535-48d0-b092-df329771fb89&page=queryresults
=============================== warnings summary ===============================

This will likely require a fix upstream in google-cloud-bigquery, but in the meantime I can make sure to use a vendored version of insert_rows_from_dataframe that can serialize a numpy bool_ value, similar to how there's a special case for NaN already.

Edit: I already fixed this in googleapis/python-bigquery#1816, waiting on version 3.18.0.

@tswast tswast added the do not merge Indicates a pull request not ready for merge, due to either quality or timing. label Mar 6, 2024
@tswast
Copy link
Collaborator Author

tswast commented Mar 6, 2024

Marking as do not merge for now. Thanks for your feedback so far. I will wait until we implement go/pandas-gbq-and-bigframes-redundancy before merging this (likely mid-April).

@tswast
Copy link
Collaborator Author

tswast commented Sep 20, 2024

Marking as do not merge for now. Thanks for your feedback so far. I will wait until we implement go/pandas-gbq-and-bigframes-redundancy before merging this (likely mid-April).

I've mailed googleapis/python-bigquery-pandas#814 as a first step of this project. I plan to follow-up that PR with one that copies the bigquery_streaming from this PR to pandas-gbq so that the pandas -> BigQuery logic can be consolidated to the pandas-gbq package.

@tswast tswast force-pushed the b323176126-write_engine branch from eebdd0e to 3fca092 Compare December 9, 2024 22:25
@product-auto-label product-auto-label bot added size: l Pull request size is large. and removed size: xl Pull request size is extra large. labels Dec 9, 2024
@tswast tswast removed the do not merge Indicates a pull request not ready for merge, due to either quality or timing. label Dec 9, 2024
@tswast tswast requested a review from shobsi December 9, 2024 22:47
@tswast
Copy link
Collaborator Author

tswast commented Dec 9, 2024

@shobsi This is ready for another look. Not quite merge-ready yet, as I need to update pandas-gbq's schema detection to account for ArrowDtype(pa.timestamp(...)) (googleapis/python-bigquery-pandas#824).

Edit: googleapis/python-bigquery-pandas#832 mailed to fix DATETIME system test.

bigframes/session/__init__.py Outdated Show resolved Hide resolved
@tswast tswast added the owlbot:run Add this label to trigger the Owlbot post processor. label Dec 11, 2024
@gcf-owl-bot gcf-owl-bot bot removed the owlbot:run Add this label to trigger the Owlbot post processor. label Dec 11, 2024
@tswast
Copy link
Collaborator Author

tswast commented Dec 11, 2024

Looks like there are a few more tests for me to cleanup. These do look related to my change:

FAILED tests/system/small/test_session.py::test_read_pandas_w_unsupported_mixed_dtype
FAILED tests/system/small/test_series.py::test_series_explode_null[empty_array]
FAILED tests/system/small/test_series.py::test_series_explode_null[null_and_empty_array]

bigframes/session/__init__.py Show resolved Hide resolved
tests/unit/session/test_session.py Show resolved Hide resolved
third_party/bigframes_vendored/constants.py Outdated Show resolved Hide resolved
Copy link
Collaborator Author

@tswast tswast left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Updated. Thanks for the review!

Looks like I still need a fix in pandas-gbq for empty arrays.

tests/unit/session/test_session.py Show resolved Hide resolved
third_party/bigframes_vendored/constants.py Outdated Show resolved Hide resolved
bigframes/session/__init__.py Show resolved Hide resolved
@tswast
Copy link
Collaborator Author

tswast commented Dec 12, 2024

googleapis/python-bigquery-pandas#838 to fix the remaining system test. Tested locally and it works.

Copy link
Contributor

@shobsi shobsi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM to merge after the pandas-gbq fix and update

bigframes/session/__init__.py Show resolved Hide resolved
bigframes/session/validation.py Outdated Show resolved Hide resolved
Comment on lines +410 to +413
# BigFrames doesn't distinguish between string and large_string because the
# largest string (2 GB) is already larger than the largest BigQuery row.
if pa.types.is_string(arrow_dtype) or pa.types.is_large_string(arrow_dtype):
return STRING_DTYPE
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@GarrettWu Would this break the JSON support you're adding? It was needed for some failing CSV tests.

@tswast tswast merged commit ed47ef1 into main Dec 26, 2024
22 checks passed
@tswast tswast deleted the b323176126-write_engine branch December 26, 2024 20:18
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
api: bigquery Issues related to the googleapis/python-bigquery-dataframes API. size: l Pull request size is large.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants