Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix: json in struct destination type #1187

Merged
merged 16 commits into from
Dec 12, 2024
Merged

fix: json in struct destination type #1187

merged 16 commits into from
Dec 12, 2024

Conversation

GarrettWu
Copy link
Contributor

@GarrettWu GarrettWu commented Dec 4, 2024

Patches the json output as BQ JSON type instead of STRING. Which avoids JSON destination table error in b/381148539 and unblocks Multimodal integrations.

Our current implementation convert JSON to STRING completely when reading into BigFrames. Here it uses pd.ArrowDtype(pa.large_string()) as BigFrames Dtype and pa.large_string() as PA dtype, and pass the information though in BigFrames to determine the output type and parse STR back to JSON at the end.

It is a workaround of the workaround of JSON as STR. We shall rework the JSON support after pyarrow brings in json type.

@GarrettWu GarrettWu self-assigned this Dec 4, 2024
@GarrettWu GarrettWu requested review from a team as code owners December 4, 2024 01:43
@product-auto-label product-auto-label bot added size: s Pull request size is small. api: bigquery Issues related to the googleapis/python-bigquery-dataframes API. labels Dec 4, 2024
@GarrettWu GarrettWu marked this pull request as draft December 4, 2024 20:47
@product-auto-label product-auto-label bot added size: m Pull request size is medium. and removed size: s Pull request size is small. labels Dec 5, 2024
@GarrettWu GarrettWu marked this pull request as ready for review December 6, 2024 18:46
@tswast
Copy link
Collaborator

tswast commented Dec 10, 2024

Question: Does this change anything with regards to I/O? For example, I notice with read_csv with the pandas-y code path, the DataFrame from pandas puts string columns as the large_string() in my testing. Hopefully such columns don't end up as JSON in BigQuery now.

@GarrettWu
Copy link
Contributor Author

GarrettWu commented Dec 10, 2024

Question: Does this change anything with regards to I/O? For example, I notice with read_csv with the pandas-y code path, the DataFrame from pandas puts string columns as the large_string() in my testing. Hopefully such columns don't end up as JSON in BigQuery now.

How did you test? I tried pandas df with pa.large_string dtype is interpreted as normal string. It shouldn't affect input as large_string, since we convert it to normal string in our I/O.

screen/763yZzyywmEVWbR

destination table

screen/4u6scyqtczXSAK8

@tswast
Copy link
Collaborator

tswast commented Dec 11, 2024

How did you test?

It was part of my debugger step-through of #371 but it's possible I accidentally changed something in the I/O path there.

@GarrettWu GarrettWu enabled auto-merge (squash) December 12, 2024 22:24
@GarrettWu GarrettWu merged commit 200c9bb into main Dec 12, 2024
22 checks passed
@GarrettWu GarrettWu deleted the garrettwu-json branch December 12, 2024 22:25
@@ -107,7 +107,7 @@ def from_table(
raise ValueError("must set at most one of 'offests', 'primary_key'")
if any(i.field_type == "JSON" for i in table.schema if i.name in schema.names):
warnings.warn(
"Interpreting JSON column(s) as StringDtype. This behavior may change in future versions.",
"Interpreting JSON column(s) as StringDtype and pyarrow.large_string. This behavior may change in future versions.",
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Shouldn't we just say "... as pyarrow.large_string"? StringDtype here is not relevant anymore, no?

@@ -2564,13 +2564,13 @@ def _get_rows_as_json_values(self) -> Block:
),
T1 AS (
SELECT *,
JSON_OBJECT(
TO_JSON_STRING(JSON_OBJECT(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we need an explicit TO_JSON_STRING? Doesn't the change in ibis_types.py::ibis_dtype_to_bigframes_dtype take care of this post read?

    if isinstance(ibis_dtype, ibis_dtypes.JSON):
        ...
        return bigframes.dtypes.JSON_DTYPE

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
api: bigquery Issues related to the googleapis/python-bigquery-dataframes API. size: m Pull request size is medium.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants