From 95dcd9884c8ac6f1a22bce8ad4dacad6f2de9da9 Mon Sep 17 00:00:00 2001
From: =?UTF-8?q?Kriszti=C3=A1n=20Sz=C5=B1cs?= <szucs.krisztian@gmail.com>
Date: Wed, 8 Nov 2023 11:01:33 +0100
Subject: [PATCH] refactor(ir): split the relational operations

Rationale and history
---------------------
In the last couple of years we have been constantly refactoring the
internals to make it easier to work with. Although we have made great
progress, the current codebase is still hard to maintain and extend.
One example of that complexity is the try to remove the `Projector`
class in #7430. I had to realize that we are unable to improve the
internals in smaller incremental steps, we need to make a big leap
forward to make the codebase maintainable in the long run.

One of the hotspots of problems is the `analysis.py` module which tries
to bridge the gap between the user-facing API and the internal
representation. Part of its complexity is caused by loose integrity
checks in the internal representation, allowing various ways to
represent the same operation. This makes it hard to inspect, reason
about and optimize the relational operations. In addition to that, it
makes much harder to implement the backends since more branching is
required to cover all the variations.

We have always been aware of these problems, and actually we had several
attempts to solve them the same way this PR does. However, we never
managed to actually split the relational operations, we always hit
roadblocks to maintain compatibility with the current test suite.
Actually we were unable to even understand those issues because of the
complexity of the codebase and number of indirections between the API,
analysis functions and the internal representation.

But(!) finally we managed to prototype a new IR in #7580 along with
implementations for the majority of the backends, including `various SQL
backends` and `pandas`. After successfully validating the viability of
the new IR, we split the PR into smaller pieces which can be
individually reviewed. This PR is the first step of that process, it
introduces the new IR and the new API. The next steps will be to
implement the remaining backends on top of the new IR.

Changes in this commit
----------------------
- Split the `ops.Selection` and `ops.Aggregration` nodes into proper
  relational algebra operations.
- Almost entirely remove `analysis.py` with the technical debt
  accumulated over the years.
- More flexible window frame binding: if an unbound analytical function
  is used with a window containing references to a relation then
  `.over()` is now able to bind the window frame to the relation.
- Introduce a new API-level technique to dereference columns to the
  target relation(s).
- Revamp the subquery handling to be more robust and to support more
  use cases with strict validation, now we have `ScalarSubquery`,
  `ExistsSubquery`, and `InSubquery` nodes which can only be used in
  the appropriate context.
- Use way stricter integrity checks for all the relational operations,
  most of the time enforcing that all the value inputs of the node must
  originate from the parent relation the node depends on.
- Introduce a new `JoinChain` operations to represent multiple joins in
  a single operation followed by a projection attached to the same
  relation. This enabled to solve several outstanding issues with the
  join handling (including the notorious chain join issue).
- Use straightforward rewrite rules collected in `rewrites.py` to
  reinterpret user input so that the new operations can be constructed,
  even with the strict integrity checks.
- Provide a set of simplification rules to reorder and squash the
  relational operations into a more compact form.
- Use mappings to represent projections, eliminating the need of
  internally storing `ops.Alias` nodes. In addition to that table nodes
  in projections are not allowed anymore, the columns are expanded to
  the same mapping making the semantics clear.
- Uniform handling of the various kinds of inputs for all the API
  methods using a generic `bind()` function.

Advantages of the new IR
------------------------
- The operations are much simpler with clear semantics.
- The operations are easier to reason about and to optimize.
- The backends can easily lower the internal representation to a
  backend-specific form before compilation/execution, so the lowered
  form can be easily inspected, debugged, and optimized.
- The API is much closer to the users' mental model, thanks to the
  dereferencing technique.
- The backend implementation can be greatly simplified due to the
  simpler internal representation and strict integrity checks. As an
  example the pandas backend can be slimmed down by 4k lines of code
  while being more robust and easier to maintain.

Disadvantages of the new IR
---------------------------
- The backends must be rewritten to support the new internal
  representation.
---
 ibis/expr/operations/relations.py |  1 +
 ibis/expr/rewrites.py             | 25 ++++++++++++++++++-------
 ibis/expr/types/generic.py        |  1 +
 3 files changed, 20 insertions(+), 7 deletions(-)

diff --git a/ibis/expr/operations/relations.py b/ibis/expr/operations/relations.py
index b7e62cf056b4..6681b1dc44bd 100644
--- a/ibis/expr/operations/relations.py
+++ b/ibis/expr/operations/relations.py
@@ -20,6 +20,7 @@
 from ibis.expr.operations.sortkeys import SortKey  # noqa: TCH001
 from ibis.expr.schema import Schema
 from ibis.formats import TableProxy  # noqa: TCH001
+from ibis.util import gen_name
 
 T = TypeVar("T")
 
diff --git a/ibis/expr/rewrites.py b/ibis/expr/rewrites.py
index f0352fefa8bd..6296f06b4682 100644
--- a/ibis/expr/rewrites.py
+++ b/ibis/expr/rewrites.py
@@ -1,8 +1,7 @@
 """Some common rewrite functions to be shared between backends."""
 from __future__ import annotations
 
-import functools
-from collections.abc import Mapping
+import toolz
 
 import toolz
 
@@ -44,11 +43,23 @@ def repl(_):
     return repl
 
 
-@replace(p.FillNa)
-def rewrite_fillna(_):
-    """Rewrite FillNa expressions to use more common operations."""
-    if isinstance(_.replacements, Mapping):
-        mapping = _.replacements
+y = var("y")
+name = var("name")
+
+
+@replace(ops.Analytic)
+def project_wrap_analytic(_, rel):
+    # Wrap analytic functions in a window function
+    return ops.WindowFunction(_, ops.RowsWindowFrame(rel))
+
+
+@replace(ops.Reduction)
+def project_wrap_reduction(_, rel):
+    # Query all the tables that the reduction depends on
+    if _.relations == {rel}:
+        # The reduction is fully originating from the `rel`, so turn
+        # it into a window function of `rel`
+        return ops.WindowFunction(_, ops.RowsWindowFrame(rel))
     else:
         mapping = {
             name: _.replacements
diff --git a/ibis/expr/types/generic.py b/ibis/expr/types/generic.py
index 24add1bff5e3..6d6064e91581 100644
--- a/ibis/expr/types/generic.py
+++ b/ibis/expr/types/generic.py
@@ -18,6 +18,7 @@
 from ibis.util import deprecated
 
 
+
 if TYPE_CHECKING:
     import pandas as pd
     import pyarrow as pa