Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: Support bigframes.pandas.to_datetime for scalars, iterables and series. #372

Merged
merged 34 commits into from
Feb 12, 2024

Conversation

Genesis929
Copy link
Collaborator

@Genesis929 Genesis929 commented Feb 6, 2024

Thank you for opening a Pull Request! Before submitting your PR, there are a few things you can do to make sure it goes smoothly:

  • Make sure to open an issue as a bug/issue before writing your code! That way we can discuss the change, evaluate designs, and agree on the general idea
  • Ensure the tests and linter pass
  • Code coverage does not decrease (if any source code was changed)
  • Appropriate docs were updated (if necessary)

https://screenshot.googleplex.com/34gywaqpL9yYGSm
https://screenshot.googleplex.com/9QCcdhPgiuwbRk7
https://screenshot.googleplex.com/3sjTQ3xwj3MWp9j

For string/datetime dtypes, currently are limited to utc=True as we have limited timezone support.
Fixes #<issue_number_goes_here> 🦕

@product-auto-label product-auto-label bot added size: l Pull request size is large. api: bigquery Issues related to the googleapis/python-bigquery-dataframes API. labels Feb 6, 2024
@product-auto-label product-auto-label bot added size: m Pull request size is medium. and removed size: l Pull request size is large. labels Feb 6, 2024
@product-auto-label product-auto-label bot added size: l Pull request size is large. and removed size: m Pull request size is medium. labels Feb 6, 2024
@Genesis929 Genesis929 changed the title feat: Support pd.to_datetime for scalars, iterables and series. feat: Support bigframes.pandas.to_datetime for scalars, iterables and series. Feb 6, 2024
@Genesis929 Genesis929 added the owlbot:run Add this label to trigger the Owlbot post processor. label Feb 8, 2024
@gcf-owl-bot gcf-owl-bot bot removed the owlbot:run Add this label to trigger the Owlbot post processor. label Feb 8, 2024
@Genesis929 Genesis929 marked this pull request as ready for review February 8, 2024 01:28
@Genesis929 Genesis929 requested review from a team as code owners February 8, 2024 01:28
@Genesis929 Genesis929 requested a review from shobsi February 8, 2024 01:28
x.re_search(TIMEZONE_POS_REGEX),
(
(
x.substr(0, x.length() - 6).to_timestamp(op.format)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

6 is a bit of a "magic" number here. Please make some constants and comments explaining the intention.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

)
.cast(ibis_dtypes.Timestamp(timezone="UTC"))
.cast(ibis_dtypes.int64)
- x.substr(x.length() - 5, 2).cast(ibis_dtypes.int64)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Likewise, this is more "magic" with regards to which substrings we're look at. Perhaps some helper functions would be useful too.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Likewise for the rest of this function. Please refactor so that it's easier to validate the correctness of each smaller part.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

elif x.type() == ibis_dtypes.Timestamp(timezone="UTC"):
return x
elif x.type() != ibis_dtypes.timestamp:
unit = op.unit if op.unit is not None else "ns"
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's comment why we are making "ns" the default.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

# Note: Due to an issue where casting directly to a non-UTC
# timezone does not work, we first cast to UTC. This seems
# to bypass a potential bug in Ibis's cast function, allowing
# for subsequent casting to a non-UTC timezone. Further
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What would it mean to cast to non-UTC timezone type in BigQuery? It only supports UTC at the data-type level, even though other timezones are supported for parsing and formatting.

Please raise an error if someone tries to cast to a non-UTC timezone.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry about the confusion, will update the comment, this means without timezone. Basically this is for utc=True vs utc=False. Because of some unknown issue related to data type, potentially because of ibis, it's impossible to cast to the proper type when utc=False, unless cast it to utc timezone first.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Based on tests it would appear although the result of int64 to_timestamp is in utc timezone, the cast function think the datatype is actually without timezone, and skip the cast, this is to fix the issue.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Comment updated.

f"to datetime is not implemented. {constants.FEEDBACK_LINK}"
)

if ~isinstance(arg, bigframes.series.Series):
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Don't use bitwise negation to negate a boolean.

Suggested change
if ~isinstance(arg, bigframes.series.Series):
if not isinstance(arg, bigframes.series.Series):

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also, let's add a comment that this is intended to support pandas Series (and Index maybe?).

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

# TODO: Currently, data upload is performed using pandas DataFrames
# combined with the `read_pandas` method due to the BigFrames DataFrame
# constructor's limitations in handling various data types. Plan to update
# the upload process to utilize the BigPandas DataFrame constructor directly
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
# the upload process to utilize the BigPandas DataFrame constructor directly
# the upload process to utilize the BigQuery DataFrame constructor directly

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

bf_result = (
bpd.to_datetime(arg, utc=utc, unit=unit, format=format)
.to_pandas()
.astype("datetime64[ns, UTC]" if utc else "datetime64[ns]")
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fascinating. So utc=False will use DATETIME type in BigQuery?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, for utc=False, it will be later cast to DATETIME type.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is the example sql: SELECT
CAST(t0.0 AS DATETIME) AS Cast_0_ timestamp
FROM ...

@Genesis929 Genesis929 requested a review from tswast February 9, 2024 18:54
Copy link
Collaborator

@tswast tswast left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks!

@Genesis929 Genesis929 merged commit ffb0d15 into main Feb 12, 2024
15 checks passed
@Genesis929 Genesis929 deleted the huanc-to_datetime branch February 12, 2024 18:12
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
api: bigquery Issues related to the googleapis/python-bigquery-dataframes API. size: l Pull request size is large.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants