-
Notifications
You must be signed in to change notification settings - Fork 42
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
feat: add efficient peek dataframe preview #318
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I wonder if we can also implement this "peekable" logic (and _cached() if not) in the case of head()?
This would seem to align more with our customer's intuition, even if it causes some wasted work due to ordering.
Also, we should definitely document that they may want to use peek()
instead of head()
if they don't require a consistent ordering in as many places as we can.
bigframes/dataframe.py
Outdated
maybe_result = self._block.try_peek(n) | ||
if maybe_result is None: | ||
raise NotImplementedError( | ||
"Cannot peek efficiently when data has aggregates, joins or window functions applied." |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's not always going to be obvious when the data has aggregates, windows, or especially not joins (due to implicit joins on index).
Could we instead call _cached()
in this case and then peek?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Added a force=True param to peek. This will cause the dataframe to execute and cache the block if it is not peekable. Are you sure this is the best default behavior? One of my goals with peek was for users to avoid fully computing the dataframe.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In general, I'd say the pandas API skews towards the most usable default, not the most efficient one. That said, our main motivation here is to address the feedback of how expensive head()
can be, so I see the argument for having force=False
the default.
bigframes/dataframe.py
Outdated
@@ -1066,6 +1066,36 @@ def head(self, n: int = 5) -> DataFrame: | |||
def tail(self, n: int = 5) -> DataFrame: | |||
return typing.cast(DataFrame, self.iloc[-n:]) | |||
|
|||
def peek(self, n: int = 5, *, force: bool = True) -> pandas.DataFrame: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
def peek(self, n: int = 5, *, force: bool = True) -> pandas.DataFrame: | |
def peek(self, n: int = 5, *, force: bool = False) -> pandas.DataFrame: |
Let's do force=False
as the default for now. If we get complaints, we can reconsider.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done, and amended docstring.
bigframes/dataframe.py
Outdated
def peek(self, n: int = 5, *, force: bool = False) -> pandas.DataFrame: | ||
""" | ||
Preview n arbitrary rows from the dataframe. No guarantees about row selection or ordering. | ||
DataFrame.peek(force=False) is much faster than DataFrame.peek, but will only succeed in the |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please update docstring to match new default value.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
updated docstring
bigframes/dataframe.py
Outdated
maybe_result = self._block.try_peek(n) | ||
assert maybe_result is not None | ||
else: | ||
raise NotImplementedError( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ValueError or a custom subclass of it may be more appropriate for this.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
changed to ValueError
Thank you for opening a Pull Request! Before submitting your PR, there are a few things you can do to make sure it goes smoothly:
Fixes #<issue_number_goes_here> 🦕