fix(ir): support converting limit(1) inputs to scalar subqueries #8223

kszucs · 2024-02-05T00:11:46Z

No description provided.

…out a functional backend

…ests/sql` to `ibis/backends/tests/sql`

…ests/

…ompletely broken without the backend tests

Rationale and history --------------------- In the last couple of years we have been constantly refactoring the internals to make it easier to work with. Although we have made great progress, the current codebase is still hard to maintain and extend. One example of that complexity is the try to remove the `Projector` class in ibis-project#7430. I had to realize that we are unable to improve the internals in smaller incremental steps, we need to make a big leap forward to make the codebase maintainable in the long run. One of the hotspots of problems is the `analysis.py` module which tries to bridge the gap between the user-facing API and the internal representation. Part of its complexity is caused by loose integrity checks in the internal representation, allowing various ways to represent the same operation. This makes it hard to inspect, reason about and optimize the relational operations. In addition to that, it makes much harder to implement the backends since more branching is required to cover all the variations. We have always been aware of these problems, and actually we had several attempts to solve them the same way this PR does. However, we never managed to actually split the relational operations, we always hit roadblocks to maintain compatibility with the current test suite. Actually we were unable to even understand those issues because of the complexity of the codebase and number of indirections between the API, analysis functions and the internal representation. But(!) finally we managed to prototype a new IR in ibis-project#7580 along with implementations for the majority of the backends, including `various SQL backends` and `pandas`. After successfully validating the viability of the new IR, we split the PR into smaller pieces which can be individually reviewed. This PR is the first step of that process, it introduces the new IR and the new API. The next steps will be to implement the remaining backends on top of the new IR. Changes in this commit ---------------------- - Split the `ops.Selection` and `ops.Aggregration` nodes into proper relational algebra operations. - Almost entirely remove `analysis.py` with the technical debt accumulated over the years. - More flexible window frame binding: if an unbound analytical function is used with a window containing references to a relation then `.over()` is now able to bind the window frame to the relation. - Introduce a new API-level technique to dereference columns to the target relation(s). - Revamp the subquery handling to be more robust and to support more use cases with strict validation, now we have `ScalarSubquery`, `ExistsSubquery`, and `InSubquery` nodes which can only be used in the appropriate context. - Use way stricter integrity checks for all the relational operations, most of the time enforcing that all the value inputs of the node must originate from the parent relation the node depends on. - Introduce a new `JoinChain` operations to represent multiple joins in a single operation followed by a projection attached to the same relation. This enabled to solve several outstanding issues with the join handling (including the notorious chain join issue). - Use straightforward rewrite rules collected in `rewrites.py` to reinterpret user input so that the new operations can be constructed, even with the strict integrity checks. - Provide a set of simplification rules to reorder and squash the relational operations into a more compact form. - Use mappings to represent projections, eliminating the need of internally storing `ops.Alias` nodes. In addition to that table nodes in projections are not allowed anymore, the columns are expanded to the same mapping making the semantics clear. - Uniform handling of the various kinds of inputs for all the API methods using a generic `bind()` function. Advantages of the new IR ------------------------ - The operations are much simpler with clear semantics. - The operations are easier to reason about and to optimize. - The backends can easily lower the internal representation to a backend-specific form before compilation/execution, so the lowered form can be easily inspected, debugged, and optimized. - The API is much closer to the users' mental model, thanks to the dereferencing technique. - The backend implementation can be greatly simplified due to the simpler internal representation and strict integrity checks. As an example the pandas backend can be slimmed down by 4k lines of code while being more robust and easier to maintain. Disadvantages of the new IR --------------------------- - The backends must be rewritten to support the new internal representation.

…o the rest of the join tables

… of using the globally unique `SelfReference` This enables us to maintain join expression equality: `a.join(b).equals(a.join(b))` So far we have been using SelfReference to make join tables unique, but it was globally unique which broke the equality check above. Therefore we need to restrict the uniqueness to the scope of the join chain. The simplest solution for that is to simply enumerate the join tables in the join chain, hence now all join participants must be `ops.JoinTable(rel, index)` instances. `ops.SelfReference` is still required to distinguish between two identical tables at the API level, but it is now decoupled from the join internal representation.

… to `.view()`

it's alive! tests run (and fail) chore(duckdb): naive port of clickhouse compiler fix(duckdb): hacky fix for output shape feat(duckdb): bitwise ops (most of them) feat(duckdb): handle pandas dtype mapping in execute feat(duckdb): handle decimal types feat(duckdb): add euler's number test(duckdb): remove duckdb from alchemycon feat(duckdb): get _most_ of string ops working still some failures in re_exract feat(duckdb): add hash feat(duckdb): add CAST feat(duckdb): add cot and strright chore(duckdb): mark all the targets that still need attention (at least) feat(duckdb): combine binary bitwise ops chore(datestuff): some datetime ops feat(duckdb): add levenshtein, use op.dtype instead of output_dtype feat(duckdb): add blank list_schemas, use old current_database for now feat(duckdb): basic interval ops feat(duckdb): timestamp and temporal ops feat(duckdb): use pyarrow for fetching execute results feat(duckdb): handle interval casts, broken for columns feat(duckdb): shove literal handling up top feat(duckdb): more timestamp ops feat(duckdb): back to pandas output in execute feat(duckdb): timezone handling in cast feat(duckdb): ms and us epoch timestamp support chore(duckdb): misc cleanup feat(duckdb): initial create table feat(duckdb): add _from_url feat(duckdb): add read_parquet feat(duckdb): add persistent cache fix(duckdb): actually insert data if present in create_table feat(duckdb): use duckdb API read_parquet feat(duckdb): add read_csv This, frustratingly, cannot use the Python API for `read_csv` since that does not support list of files, for some reason. fix(duckdb): dont fully qualify the table names chore(duckdb): cleanup chore(duckdb): mark broken test broken fix(duckdb): fix read_parquet so it works feat(duckdb): add to_pyarrow, to_pyarrow_batches, sql() feat(duckdb): null checking feat(duckdb): translate uints fix(duckdb): fix file outputs and torch output fix(duckdb): add rest of integer types fix(duckdb): ops.InValues feat(duckdb): use sqlglot expressions (maybe a big mistake) fix(duckdb): don't stringify strings feat(duckdb): use sqlglot expr instead of strings for count fix(duckdb): fix isin fix(duckdb): fix some agg variance functions fix(duckdb): for logical equals, use sqlglot not operator fix(duckdb): struct not tuple for struct type

…ern implementation

…` method

…ly tested

Alternative implementation of `map` to reduce memory usage. While `map` keeps all the results in memory until the end of the traversal, the new `map_clear()` method removes intermediate results as soon as they are not needed anymore.

…d dependencies

Previously the `ASOF` join API was imprecise. The backends supporting `asof` joins require exactly one nearest match (inequality) predicate along with arbitrary number of ordinary join predicates, see [ClickHouse ASOF](https://clickhouse.com/docs/en/sql-reference/statements/select/join#asof-join-usage), [DuckDB ASOF](https://duckdb.org/docs/guides/sql_features/asof_join.html#asof-joins-with-the-using-keyword) and [Pandas ASOF](https://pandas.pydata.org/docs/reference/api/pandas.merge_asof.html). This change alters the API to `table.asof_join(left, right, on, predicates, ...)` where `on` is the nearest match predicate defaulting to `left[on] <= right[on]` if not an expression is given. I kept the `by` argument for compatibility reasons, but we should phase that out in favor of `predicates`. Also ensure that all the join methods or `ir.Join` have the exact same docstrings as `ir.Table`. BREAKING CHANGE: `on` paremater of `table.asof_join()` is now only accept a single predicate, use `predicates` to supply additional join predicates.

…arent relation

…ection`

…ALSE` literals

…own to be equal (ibis-project#8127) Co-authored-by: Phillip Cloud <[email protected]>

…licting names (ibis-project#8134) Fixes ibis-project#7345

cpcloud · 2024-03-20T15:19:38Z

Closing this out. Please submit a new PR if you're still interested in making these changes.

kszucs and others added 30 commits February 2, 2024 12:32

test(ir): reorganize ibis/tests/expr to enable running the tests with…

69cba23

…out a functional backend

test(sql): move sql tests requiring a functional backend from `ibis/t…

4986c91

…ests/sql` to `ibis/backends/tests/sql`

test(backends): move backends dependent benchmarks to ibis/backends/t…

4eba860

…ests/

test(ir): ensure that no backends are required to run the core tests

a38aae8

chore(ci): skip running backend tests on the-epic-split branch

2927d8f

chore(ci): change the core testing command since the core marker is c…

6b225ee

…ompletely broken without the backend tests

chore(ci): temporarily disable test_doctests job

b90f2c4

chore(ci): add todo note about restoring the previous ci-check command

26c1321

test(ir): ensure that no backends are required to run the core tests

631349a

refactor(ir): wrap JoinChain.first in ops.SelfReference similar t…

0ecc96f

…o the rest of the join tables

test(ir): cover constructing reductions in the core test suite

a162718

fix(decompile): ensure that SelfReference is decompiled with a call…

881a938

… to `.view()`

refactor(ir): support join of joins while avoiding nesting

5980895

feat(sql): lower expressions to SQL-like relational operations

405f1ba

refactor(duckdb/clickhouse): implement sqlglot backends and re-enable ci

da7745e

feat(datafusion): port to new sqlglot backend

31252a9

refactor(compilers): conslidate StringJoin impl

8bd248a

feat(common): add Dispatched base class for convenient visitor patt…

bce36bd

…ern implementation

refactor(duckdb): remove the need for a specialized `_to_geodataframe…

ac25d9e

…` method

fix(duckdb): ensure that create_schema and create_database are actual…

a3cca0b

…ly tested

refactor(ir): stricter scalar subquery integrity checks

dffbda1

fix(common): intermediate result removal fails if there are duplicate…

1726e6d

…d dependencies

fix(ir): self reference fields were incorrectly dereferenced to the p…

1d1c541

…arent relation

fix(rewrites): add missing filter arguments for node.replace() calls

78d38c7

refactor(snowflake): use sqlglot for the snowflake backend

2f7640f

kszucs and others added 14 commits February 2, 2024 12:33

test(snowflake): enable xpassing `test_dot_sql::test_order_by_no_proj…

c68cc96

…ection`

style: remove extra newline from docstrings

28ce539

test(pandas): skip tests for older pandas

1d9e263

test(dask): skip tests for older pandas

45a76f9

fix(mssql): don't use the removed sge.TRUE and sge.FALSE literals

8504d87

fix(pyspark): don't use the removed sge.NULL, sge.TRUE and `sge.F…

0aa9199

…ALSE` literals

fix(sqlite): don't use the removed sge.NULL literal

9982e8b

test(ir): remove outdated old-style pytest hookwrapper causing warnings

6b73d2e

chore(deps): remove sqlalchemy-views and sqlalchemy-risingwave

7915237

test(backends): restore deleted test_benchmarks.py

db39886

test(duckdb): account for other errors when running in the nix sandbox

e4df99b

feat(api): support the inner join convenience to not repeat fields kn…

d5e256f

…own to be equal (ibis-project#8127) Co-authored-by: Phillip Cloud <[email protected]>

fix(polars): columns are picked from the correct side in case of conf…

77052c7

…licting names (ibis-project#8134) Fixes ibis-project#7345

chore: quote everything (ibis-project#8172)

e68000c

kszucs force-pushed the to_array branch 3 times, most recently from 8e89a56 to 1e28de8 Compare February 5, 2024 01:58

fix(ir): support converting limit(1) inputs to scalar subqueries

5c421ca

kszucs force-pushed the to_array branch from 1e28de8 to 5c421ca Compare February 5, 2024 09:51

cpcloud force-pushed the the-epic-split branch 2 times, most recently from 0a87137 to 7fc1638 Compare February 5, 2024 17:08

kszucs force-pushed the the-epic-split branch 3 times, most recently from 497b3cb to abcc30a Compare February 6, 2024 14:13

cpcloud force-pushed the the-epic-split branch 4 times, most recently from c27f7e5 to e6e4a46 Compare February 12, 2024 19:32

cpcloud closed this Mar 20, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix(ir): support converting limit(1) inputs to scalar subqueries #8223

fix(ir): support converting limit(1) inputs to scalar subqueries #8223

kszucs commented Feb 5, 2024

cpcloud commented Mar 20, 2024

fix(ir): support converting limit(1) inputs to scalar subqueries #8223

fix(ir): support converting limit(1) inputs to scalar subqueries #8223

Conversation

kszucs commented Feb 5, 2024

cpcloud commented Mar 20, 2024