Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: add efficient peek dataframe preview #318

Merged
merged 14 commits into from
Jan 25, 2024
Merged

feat: add efficient peek dataframe preview #318

merged 14 commits into from
Jan 25, 2024

Conversation

TrevorBergeron
Copy link
Contributor

Thank you for opening a Pull Request! Before submitting your PR, there are a few things you can do to make sure it goes smoothly:

  • Make sure to open an issue as a bug/issue before writing your code! That way we can discuss the change, evaluate designs, and agree on the general idea
  • Ensure the tests and linter pass
  • Code coverage does not decrease (if any source code was changed)
  • Appropriate docs were updated (if necessary)

Fixes #<issue_number_goes_here> 🦕

@product-auto-label product-auto-label bot added size: m Pull request size is medium. api: bigquery Issues related to the googleapis/python-bigquery-dataframes API. labels Jan 12, 2024
Copy link
Collaborator

@tswast tswast left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wonder if we can also implement this "peekable" logic (and _cached() if not) in the case of head()?

This would seem to align more with our customer's intuition, even if it causes some wasted work due to ordering.

Also, we should definitely document that they may want to use peek() instead of head() if they don't require a consistent ordering in as many places as we can.

maybe_result = self._block.try_peek(n)
if maybe_result is None:
raise NotImplementedError(
"Cannot peek efficiently when data has aggregates, joins or window functions applied."
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's not always going to be obvious when the data has aggregates, windows, or especially not joins (due to implicit joins on index).

Could we instead call _cached() in this case and then peek?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added a force=True param to peek. This will cause the dataframe to execute and cache the block if it is not peekable. Are you sure this is the best default behavior? One of my goals with peek was for users to avoid fully computing the dataframe.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In general, I'd say the pandas API skews towards the most usable default, not the most efficient one. That said, our main motivation here is to address the feedback of how expensive head() can be, so I see the argument for having force=False the default.

@TrevorBergeron TrevorBergeron requested a review from tswast January 17, 2024 20:42
@TrevorBergeron TrevorBergeron marked this pull request as ready for review January 17, 2024 20:42
@TrevorBergeron TrevorBergeron requested review from a team as code owners January 17, 2024 20:42
@@ -1066,6 +1066,36 @@ def head(self, n: int = 5) -> DataFrame:
def tail(self, n: int = 5) -> DataFrame:
return typing.cast(DataFrame, self.iloc[-n:])

def peek(self, n: int = 5, *, force: bool = True) -> pandas.DataFrame:
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
def peek(self, n: int = 5, *, force: bool = True) -> pandas.DataFrame:
def peek(self, n: int = 5, *, force: bool = False) -> pandas.DataFrame:

Let's do force=False as the default for now. If we get complaints, we can reconsider.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done, and amended docstring.

@TrevorBergeron TrevorBergeron requested a review from tswast January 25, 2024 01:17
@tswast tswast added the automerge Merge the pull request once unit tests and other checks pass. label Jan 25, 2024
def peek(self, n: int = 5, *, force: bool = False) -> pandas.DataFrame:
"""
Preview n arbitrary rows from the dataframe. No guarantees about row selection or ordering.
DataFrame.peek(force=False) is much faster than DataFrame.peek, but will only succeed in the
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please update docstring to match new default value.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

updated docstring

maybe_result = self._block.try_peek(n)
assert maybe_result is not None
else:
raise NotImplementedError(
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ValueError or a custom subclass of it may be more appropriate for this.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

changed to ValueError

@TrevorBergeron TrevorBergeron requested a review from tswast January 25, 2024 20:12
@tswast tswast merged commit 9c34d83 into main Jan 25, 2024
15 checks passed
@tswast tswast deleted the peek branch January 25, 2024 22:08
@gcf-merge-on-green gcf-merge-on-green bot removed the automerge Merge the pull request once unit tests and other checks pass. label Jan 25, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
api: bigquery Issues related to the googleapis/python-bigquery-dataframes API. size: m Pull request size is medium.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants