Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

deps: migrate to ibis-framework >= "7.1.0" #53

Merged
merged 24 commits into from
Dec 15, 2023
Merged

Conversation

tswast
Copy link
Collaborator

@tswast tswast commented Sep 22, 2023

This should unlock some bug fixes as well as potential UNNEST support in a future change.

Thank you for opening a Pull Request! Before submitting your PR, there are a few things you can do to make sure it goes smoothly:

  • Make sure to open an issue as a bug/issue before writing your code! That way we can discuss the change, evaluate designs, and agree on the general idea
  • Ensure the tests and linter pass
  • Code coverage does not decrease (if any source code was changed)
  • Appropriate docs were updated (if necessary)

Towards internal issue 301459557 🦕

This should unlock some bug fixes as well as potential `UNNEST` support
in a future change.
@product-auto-label product-auto-label bot added size: s Pull request size is small. api: bigquery Issues related to the googleapis/python-bigquery-dataframes API. labels Sep 22, 2023
@product-auto-label product-auto-label bot added size: m Pull request size is medium. and removed size: s Pull request size is small. labels Sep 22, 2023
@tswast tswast marked this pull request as ready for review November 13, 2023 22:32
@tswast tswast requested review from a team as code owners November 13, 2023 22:32
@tswast
Copy link
Collaborator Author

tswast commented Nov 14, 2023

We're getting a few failures, such as

____________________________ test_df_nlargest[all] _____________________________
[gw0] linux -- Python 3.11.1 /tmpfs/src/github/python-bigquery-dataframes/.nox/system_prerelease/bin/python

scalars_df_index =           bool_col                                          bytes_col  \
rowindex                                     ....999999+00:00  
8                         T             <NA>                              <NA>  

[9 rows x 13 columns]
scalars_pandas_df_index =           bool_col  ...                     timestamp_col
rowindex            ...                                  
0 ... ...  2038-01-19 03:14:17.999999+00:00
8            False  ...                              <NA>

[9 rows x 13 columns]
keep = 'all'

    @pytest.mark.parametrize(
        ("keep",),
        [
            ("first",),
            ("last",),
            ("all",),
        ],
    )
    def test_df_nlargest(scalars_df_index, scalars_pandas_df_index, keep):
>       bf_result = scalars_df_index.nlargest(
            3, ["bool_col", "int64_too"], keep=keep
        ).to_pandas()

[tests/system/small/test_dataframe.py:156](https://cs.corp.google.com/piper///depot/google3/tests/system/small/test_dataframe.py?l=156): 
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
[bigframes/dataframe.py:929](https://cs.corp.google.com/piper///depot/google3/bigframes/dataframe.py?l=929): in nlargest
    return DataFrame(block_ops.nlargest(self._block, n, column_ids, keep=keep))
[bigframes/core/block_transforms.py:548](https://cs.corp.google.com/piper///depot/google3/bigframes/core/block_transforms.py?l=548): in nlargest
    block, counter = block.apply_window_op(
[bigframes/core/blocks.py:768](https://cs.corp.google.com/piper///depot/google3/bigframes/core/blocks.py?l=768): in apply_window_op
    block = Block(
[bigframes/core/blocks.py:112](https://cs.corp.google.com/piper///depot/google3/bigframes/core/blocks.py?l=112): in __init__
    self._expr = self._normalize_expression(expr, self._index_columns)
[bigframes/core/blocks.py:1093](https://cs.corp.google.com/piper///depot/google3/bigframes/core/blocks.py?l=1093): in _normalize_expression
    col_id for col_id in expr.column_ids if col_id not in index_columns
[bigframes/core/__init__.py:82](https://cs.corp.google.com/piper///depot/google3/bigframes/core/__init__.py?l=82): in column_ids
    return self._compile_ordered().column_ids
[bigframes/core/__init__.py:95](https://cs.corp.google.com/piper///depot/google3/bigframes/core/__init__.py?l=95): in _compile_ordered
    return compiler.compile_ordered(self.node)
[bigframes/core/compile/compiler.py:33](https://cs.corp.google.com/piper///depot/google3/bigframes/core/compile/compiler.py?l=33): in compile_ordered
    return typing.cast(compiled.OrderedIR, compile_node(node, True))
[bigframes/core/compile/compiler.py:45](https://cs.corp.google.com/piper///depot/google3/bigframes/core/compile/compiler.py?l=45): in compile_node
    return _compile_node(node, ordered)
/[usr/local/lib/python3.11/functools.py:909](https://cs.corp.google.com/piper///depot/google3/usr/local/lib/python3.11/functools.py?l=909): in wrapper
    return dispatch(args[0].__class__)(*args, **kw)
[bigframes/core/compile/compiler.py:192](https://cs.corp.google.com/piper///depot/google3/bigframes/core/compile/compiler.py?l=192): in compile_window
    result = compile_ordered(node.child).project_window_op(
[bigframes/core/compile/compiled.py:868](https://cs.corp.google.com/piper///depot/google3/bigframes/core/compile/compiled.py?l=868): in project_window_op
    result = self._set_or_replace_by_id(output_name or column_name, window_op)
[bigframes/core/compile/compiled.py:1185](https://cs.corp.google.com/piper///depot/google3/bigframes/core/compile/compiled.py?l=1185): in _set_or_replace_by_id
    return builder.build()
[bigframes/core/compile/compiled.py:1344](https://cs.corp.google.com/piper///depot/google3/bigframes/core/compile/compiled.py?l=1344): in build
    return OrderedIR(
[bigframes/core/compile/compiled.py:632](https://cs.corp.google.com/piper///depot/google3/bigframes/core/compile/compiled.py?l=632): in __init__
    super().__init__(table, columns, predicates)
[bigframes/core/compile/compiled.py:65](https://cs.corp.google.com/piper///depot/google3/bigframes/core/compile/compiled.py?l=65): in __init__
    self._column_names = {column.get_name(): column for column in self._columns}
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 

.0 = <tuple_iterator object at 0x7f117affa7d0>

>   self._column_names = {column.get_name(): column for column in self._columns}
E   TypeError: unhashable type: 'Deferred'

[bigframes/core/compile/compiled.py:65](https://cs.corp.google.com/piper///depot/google3/bigframes/core/compile/compiled.py?l=65): TypeError

I asked some friends over on the ibis project, and they shared that Deferred objects aren't bound to a table expression. Also, op() is hashable but the expression might not be. The "more future-proof way is to resolve the deferred through the table’s __getitem__" that is to say "table[deferred].op()"

@tswast tswast marked this pull request as draft November 14, 2023 22:42
@tswast
Copy link
Collaborator Author

tswast commented Nov 14, 2023

Marking as Draft again until we have a greater need to upgrade.

@tswast tswast marked this pull request as ready for review November 27, 2023 20:19
@tswast tswast changed the title deps: migrate to ibis-framework >= "7.0.0" deps: migrate to ibis-framework >= "7.1.0" Nov 27, 2023
@tswast
Copy link
Collaborator Author

tswast commented Nov 28, 2023

qcut support currently blocked by what I believe to be a regression in ibis related to those Deferred objects: ibis-project/ibis#7631

Will investigate if there's a way to work around it. Seems we are ending up in this code path somehow: https://github.com/ibis-project/ibis/blob/d6a2f0922ea92b0e03a0431bf9011e7103565061/ibis/expr/types/generic.py#L760

@@ -62,7 +63,16 @@ def __init__(
self._columns = tuple(columns)
# To allow for more efficient lookup by column name, create a
# dictionary mapping names to column values.
self._column_names = {column.get_name(): column for column in self._columns}
self._column_names = {
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Apparently you are adding support for Deferred objects. Currently we have a couple of objects interpretable as a value expression besides the expressions themselves, e.g. column names or selector in addition to deferred objects. The issue that ibis doesn't provide a clear, concise and backward compatible API function to bind those objects to a table. We have various protected functions and methods used in a hardly followable fashion.

Instead of providing a stable API for deferred, wouldn't it be better to have a stable function which convert any sort of ibis objects to an expression (given a table) if possible, something like ibis.bind(obj, table)?

Asking because we have something similar under preparation, but could be a good idea to have a public variant of that: https://github.com/ibis-project/ibis/pull/7580/files#diff-4400549b2eaa4ca3a3107f13fa78210e37277659acfe2a1eb4129cfbfaa66531R105-R134

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Instead of providing a stable API for deferred, wouldn't it be better to have a stable function which convert any sort of ibis objects to an expression (given a table) if possible, something like ibis.bind(obj, table)?

Yes, we'd definitely use something like that here.

Honestly, I'm not 100% how we are ending up with deferred objects in the first place. It seems to be with how we're building some of our windowed functions, but in all of the cases I know of there is an underlying table. I think we may just be hitting this bug: ibis-project/ibis#7631

@@ -516,17 +516,14 @@ def remote_function_node(
"""Creates an Ibis node representing a remote function call."""

fields = {
name: rlz.value(type_) if type_ else rlz.any
name: rlz.ValueOf(None if type_ == "ANY TYPE" else type_)
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually we are using almost the same code internally, possibly it could worth to expose it in a more developer friendly way after we understand the actual requirements. We can open an issue on our side to track this.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

True. I was hoping to upstream this sooner, but now that I see https://ibis-project.org/reference/scalar-udfs#ibis.expr.operations.udf.scalar.builtin maybe I can migrate to that? I'll try that in a separate PR.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Started on this here: #277

I found a few issues already with UDF in the BQ backend. I'll get those documented soon.

from ibis.expr.operations.analytic import Analytic
import ibis.expr.datatypes as dt
import ibis.expr.operations.analytic as ibis_ops_analytic
import ibis.expr.operations.core as ibis_ops_core
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I suggest to just simply use import ibis.expr.operations as ops since all op types are exposed there.

"""Retrieve the first element."""

arg = rlz.column(rlz.any)
output_dtype = rlz.dtype_like("arg")
arg: ibis_ops_core.Column[dt.Any]
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You can just simply use ops.Column without the parameter, it means the same thing as ops.Column[dt.Any]



class LastNonNullValue(Analytic):
class LastNonNullValue(ibis_ops_analytic.Analytic):
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since we have been breaking several internal parts of the codebase (sorry for that), maintaining custom operations this way is probably inconvenient. I wanted to have a documented and stable way of creating custom nodes for a long time now, just didn't have enough usage examples which we could draft a desired API from. This might be another topic we should discuss in our issue tracker.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good call. Filed ibis-project/ibis#7767

I don't even know that it needs to be a stable interface for custom ops. Maybe just an escape hatch of some sort for raw SQL that's just a Value, not a full table expression like the current Backend.sql() escape hatch. That said, such an escape hatch is going to look a lot like a custom op, since ideally we'd propagate type info somehow so that expressions can continue to be built on it.

@product-auto-label product-auto-label bot added size: l Pull request size is large. and removed size: m Pull request size is medium. labels Dec 15, 2023
@tswast tswast added automerge Merge the pull request once unit tests and other checks pass. owlbot:run Add this label to trigger the Owlbot post processor. labels Dec 15, 2023
@gcf-owl-bot gcf-owl-bot bot removed the owlbot:run Add this label to trigger the Owlbot post processor. label Dec 15, 2023
@tswast tswast merged commit 9798a2b into main Dec 15, 2023
15 checks passed
@tswast tswast deleted the b301459557-ibis-prerelease branch December 15, 2023 20:42
@gcf-merge-on-green gcf-merge-on-green bot removed the automerge Merge the pull request once unit tests and other checks pass. label Dec 15, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
api: bigquery Issues related to the googleapis/python-bigquery-dataframes API. size: l Pull request size is large.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants