Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix: converting to dataframe with out of bounds timestamps #209

Merged
merged 2 commits into from
Aug 15, 2020

Conversation

plamut
Copy link
Contributor

@plamut plamut commented Aug 1, 2020

Fixes #168.

This PR fixes the problem when converting query results to Pandas with pyarrow when data contains timestamps that would fall out of pyarrow's nanoseconds precision.

The fix requires pyarrow>=1.0.0, thus it only works on Python 3.

PR checklist

  • Make sure to open an issue as a bug/issue before writing your code! That way we can discuss the change, evaluate designs, and agree on the general idea
  • Ensure the tests and linter pass
  • Code coverage does not decrease (if any source code was changed)
  • Appropriate docs were updated (if necessary)

@plamut plamut requested a review from tswast August 1, 2020 16:10
@google-cla google-cla bot added the cla: yes This human has signed the Contributor License Agreement. label Aug 1, 2020
@plamut
Copy link
Contributor Author

plamut commented Aug 1, 2020

@tswast There is inconsistency with the existing date_as_object option that is exposed to the users, while the timestamp_as_object option is hidden. Let me know if you want to unify these two approaches to a similar problem.

Copy link
Contributor

@tswast tswast left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks!

Regarding date_as_object, it's a little different in that case, because it doesn't throw an error for dates. They just come back as strings if it's not set.

If we do provide timestamp_as_object, I think it needs to be 3 states:

  • (default) the behavior in this fix
  • (explicitly false) let the error happen, since they want to use native pandas Timestamp (maybe for performance reasons)
  • (explicitly true) always convert to datetime objects

google/cloud/bigquery/table.py Outdated Show resolved Hide resolved
@plamut plamut marked this pull request as ready for review August 6, 2020 12:29
@plamut plamut requested review from tswast and shollyman August 6, 2020 12:29
@plamut
Copy link
Contributor Author

plamut commented Aug 6, 2020

Let me know if I should also add an explicit timestamp_as_object parameter as envisioned by @tswast, or should we leave it out from this fix and (maybe) add it in a separate feature PR.

Copy link
Contributor

@tswast tswast left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks!

I think we can wait for a separate PR for the timestamp_as_object parameter feature.

@plamut plamut added the automerge Merge the pull request once unit tests and other checks pass. label Aug 15, 2020
@gcf-merge-on-green gcf-merge-on-green bot merged commit 8209203 into googleapis:master Aug 15, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
automerge Merge the pull request once unit tests and other checks pass. cla: yes This human has signed the Contributor License Agreement.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

to_dataframe fails when fetching timestamp values outside nanosecond bounds
2 participants