-
Notifications
You must be signed in to change notification settings - Fork 603
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
refactor(ir): split the relational operations #7752
Conversation
b5355a4
to
bdd2239
Compare
bdd2239
to
5299ddc
Compare
5299ddc
to
d174f3e
Compare
260ecd2
to
2145718
Compare
ibis/expr/analysis.py
Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Lots of great stuff here!
I think the PR description needs some prose :)
Here's some possible questions to answer in the description:
- How are you solving the chained join problem?
- What are the intended semantics of table references: i.e., what constitutes an integrity error?
- How does the subquery flattening and column pruning work?
- What still needs work and whether it can be done in a follow up
- Are there any expected or known breakages to user code
Ideally most of the core maintainers can read the description and use it as a guide for the PR.
ibis/expr/tests/snapshots/test_sql/test_parse_sql_table_alias/decompiled.py
Show resolved
Hide resolved
table[["a", ["b"]]] | ||
with pytest.raises(com.IbisTypeError, match=errmsg): | ||
table["a", ["b"]] | ||
# FIXME(kszucs): currently bind() flattens the list of expressions, so arbitrary |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Does this need to be fixed?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The API is less restrictive now, if we are okay with that then no, otherwise yes.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Seems fine to me!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Created a follow-up issue for this #7819
2b9193a
to
402b90f
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
small things.
I think all of our future selves would benefit from writing up the join-chain-deferencing ideas and putting them somewhere in the developer docs.
Maybe under a "Design" or "Internals" section?
Non-blocking, and a lot of what would go in that doc is already present in docstrings here, but I think it would be good to collect it all and then add some more detail.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Lookin' good!
Can you make sure to capture the collision implementation in the PR description? It could even go in a follow up IMO.
d174f3e
to
f3ee1a6
Compare
Rationale and history --------------------- In the last couple of years we have been constantly refactoring the internals to make it easier to work with. Although we have made great progress, the current codebase is still hard to maintain and extend. One example of that complexity is the try to remove the `Projector` class in ibis-project#7430. I had to realize that we are unable to improve the internals in smaller incremental steps, we need to make a big leap forward to make the codebase maintainable in the long run. One of the hotspots of problems is the `analysis.py` module which tries to bridge the gap between the user-facing API and the internal representation. Part of its complexity is caused by loose integrity checks in the internal representation, allowing various ways to represent the same operation. This makes it hard to inspect, reason about and optimize the relational operations. In addition to that, it makes much harder to implement the backends since more branching is required to cover all the variations. We have always been aware of these problems, and actually we had several attempts to solve them the same way this PR does. However, we never managed to actually split the relational operations, we always hit roadblocks to maintain compatibility with the current test suite. Actually we were unable to even understand those issues because of the complexity of the codebase and number of indirections between the API, analysis functions and the internal representation. But(!) finally we managed to prototype a new IR in ibis-project#7580 along with implementations for the majority of the backends, including `various SQL backends` and `pandas`. After successfully validating the viability of the new IR, we split the PR into smaller pieces which can be individually reviewed. This PR is the first step of that process, it introduces the new IR and the new API. The next steps will be to implement the remaining backends on top of the new IR. Changes in this commit ---------------------- - Split the `ops.Selection` and `ops.Aggregration` nodes into proper relational algebra operations. - Almost entirely remove `analysis.py` with the technical debt accumulated over the years. - More flexible window frame binding: if an unbound analytical function is used with a window containing references to a relation then `.over()` is now able to bind the window frame to the relation. - Introduce a new API-level technique to dereference columns to the target relation(s). - Revamp the subquery handling to be more robust and to support more use cases with strict validation, now we have `ScalarSubquery`, `ExistsSubquery`, and `InSubquery` nodes which can only be used in the appropriate context. - Use way stricter integrity checks for all the relational operations, most of the time enforcing that all the value inputs of the node must originate from the parent relation the node depends on. - Introduce a new `JoinChain` operations to represent multiple joins in a single operation followed by a projection attached to the same relation. This enabled to solve several outstanding issues with the join handling (including the notorious chain join issue). - Use straightforward rewrite rules collected in `rewrites.py` to reinterpret user input so that the new operations can be constructed, even with the strict integrity checks. - Provide a set of simplification rules to reorder and squash the relational operations into a more compact form. - Use mappings to represent projections, eliminating the need of internally storing `ops.Alias` nodes. In addition to that table nodes in projections are not allowed anymore, the columns are expanded to the same mapping making the semantics clear. - Uniform handling of the various kinds of inputs for all the API methods using a generic `bind()` function. Advantages of the new IR ------------------------ - The operations are much simpler with clear semantics. - The operations are easier to reason about and to optimize. - The backends can easily lower the internal representation to a backend-specific form before compilation/execution, so the lowered form can be easily inspected, debugged, and optimized. - The API is much closer to the users' mental model, thanks to the dereferencing technique. - The backend implementation can be greatly simplified due to the simpler internal representation and strict integrity checks. As an example the pandas backend can be slimmed down by 4k lines of code while being more robust and easier to maintain. Disadvantages of the new IR --------------------------- - The backends must be rewritten to support the new internal representation.
🎉 🎉 🚀 🤖 |
…perations Old Implementation ------------------ Since we need to reimplement/port all of the backends for ibis-project#7752, I took an attempt at reimplementing the pandas backend using a new execution engine. Previously the pandas backend was implemented using a top-down execution model and each operation was executing using a multidispatched function. While it served us well for a long time, it had a few drawbacks: - it was often hard to understand what was going on due to the complex preparation steps and various execution hooks - the multidispatched functions were hard to debug, additionally they supported a wide variety of inputs making the implementation rather bulky - due to the previous reaon, several inputs combinations were not supported, e.g. value operations with multiple columnar inputs - the `Scope` object was used to pass around the execution context which was created for each operation separately and the results were not reusable even though the same operation was executed multiple times New Implementation ------------------ The new execution model has changed in several ways: - there is a rewrite layer before execution which lowers the input expression to a form closer to the pandas execution model, this makes it much easier to implement the operations and also makes the input "plan" inspectable - the execution is now topologically sorted and executed in a bottom-up manner; the intermediate results are reused, making the execution more efficient while also aggressively cleaned up as soon as they are not needed anymore to reduce the memory usage - the execute function is now single-dispatched making the implementation easier to locate and debug - the inputs now broadcasted to columnar shape so that the same implementation can be used for multiple input shape combinations, this removes several special cases from the implementation in exchange of a negligible performance overhead - there are helper utilities making it easier to implement compute kernels for the various value operations: `rowwise`, `columnwise`, `elementwise`, `serieswise`; if there are multiple implementations available for a given operation, the most efficient one is selected based on the input shapes The new backend implementation has a higher feature coverage while the implementation is one third of the size of the previous one. BREAKING CHANGE: the `timecontext` feature is not supported anymore
…perations Old Implementation ------------------ Since we need to reimplement/port all of the backends for ibis-project#7752, I took an attempt at reimplementing the pandas backend using a new execution engine. Previously the pandas backend was implemented using a top-down execution model and each operation was executing using a multidispatched function. While it served us well for a long time, it had a few drawbacks: - it was often hard to understand what was going on due to the complex preparation steps and various execution hooks - the multidispatched functions were hard to debug, additionally they supported a wide variety of inputs making the implementation rather bulky - due to the previous reaon, several inputs combinations were not supported, e.g. value operations with multiple columnar inputs - the `Scope` object was used to pass around the execution context which was created for each operation separately and the results were not reusable even though the same operation was executed multiple times New Implementation ------------------ The new execution model has changed in several ways: - there is a rewrite layer before execution which lowers the input expression to a form closer to the pandas execution model, this makes it much easier to implement the operations and also makes the input "plan" inspectable - the execution is now topologically sorted and executed in a bottom-up manner; the intermediate results are reused, making the execution more efficient while also aggressively cleaned up as soon as they are not needed anymore to reduce the memory usage - the execute function is now single-dispatched making the implementation easier to locate and debug - the inputs now broadcasted to columnar shape so that the same implementation can be used for multiple input shape combinations, this removes several special cases from the implementation in exchange of a negligible performance overhead - there are helper utilities making it easier to implement compute kernels for the various value operations: `rowwise`, `columnwise`, `elementwise`, `serieswise`; if there are multiple implementations available for a given operation, the most efficient one is selected based on the input shapes The new backend implementation has a higher feature coverage while the implementation is one third of the size of the previous one. BREAKING CHANGE: the `timecontext` feature is not supported anymore
…model (#7797) ## Old Implementation Since we need to reimplement/port all of the backends for #7752, I took an attempt at reimplementing the pandas backend using a new execution engine. Previously the pandas backend was implemented using a top-down execution model and each operation was executing using a multidispatched function. While it served us well for a long time, it had a few drawbacks: - it was often hard to understand what was going on due to the complex preparation steps and various execution hooks - the multidispatched functions were hard to debug, additionally they supported a wide variety of inputs making the implementation rather bulky - due to the previous reaon, several inputs combinations were not supported, e.g. value operations with multiple columnar inputs - the `Scope` object was used to pass around the execution context which was created for each operation separately and the results were not reusable even though the same operation was executed multiple times ## New Implementation The new execution model has changed in several ways: - there is a rewrite layer before execution which lowers the input expression to a form closer to the pandas execution model, this makes it much easier to implement the operations and also makes the input "plan" inspectable - the execution is now topologically sorted and executed in a bottom-up manner; the intermediate results are reused, making the execution more efficient while also aggressively cleaned up as soon as they are not needed anymore to reduce the memory usage - the execute function is now single-dispatched making the implementation easier to locate and debug - the inputs now broadcasted to columnar shape so that the same implementation can be used for multiple input shape combinations, this removes several special cases from the implementation in exchange of a negligible performance overhead - there are helper utilities making it easier to implement compute kernels for the various value operations: `rowwise`, `columnwise`, `elementwise`, `serieswise`; if there are multiple implementations available for a given operation, the most efficient one is selected based on the input shapes The new backend implementation has a higher feature coverage while the implementation is one third of the size of the previous one. BREAKING CHANGE: the `timecontext` feature is not supported anymore
…model (#7797) ## Old Implementation Since we need to reimplement/port all of the backends for #7752, I took an attempt at reimplementing the pandas backend using a new execution engine. Previously the pandas backend was implemented using a top-down execution model and each operation was executing using a multidispatched function. While it served us well for a long time, it had a few drawbacks: - it was often hard to understand what was going on due to the complex preparation steps and various execution hooks - the multidispatched functions were hard to debug, additionally they supported a wide variety of inputs making the implementation rather bulky - due to the previous reaon, several inputs combinations were not supported, e.g. value operations with multiple columnar inputs - the `Scope` object was used to pass around the execution context which was created for each operation separately and the results were not reusable even though the same operation was executed multiple times ## New Implementation The new execution model has changed in several ways: - there is a rewrite layer before execution which lowers the input expression to a form closer to the pandas execution model, this makes it much easier to implement the operations and also makes the input "plan" inspectable - the execution is now topologically sorted and executed in a bottom-up manner; the intermediate results are reused, making the execution more efficient while also aggressively cleaned up as soon as they are not needed anymore to reduce the memory usage - the execute function is now single-dispatched making the implementation easier to locate and debug - the inputs now broadcasted to columnar shape so that the same implementation can be used for multiple input shape combinations, this removes several special cases from the implementation in exchange of a negligible performance overhead - there are helper utilities making it easier to implement compute kernels for the various value operations: `rowwise`, `columnwise`, `elementwise`, `serieswise`; if there are multiple implementations available for a given operation, the most efficient one is selected based on the input shapes The new backend implementation has a higher feature coverage while the implementation is one third of the size of the previous one. BREAKING CHANGE: the `timecontext` feature is not supported anymore
…model (#7797) ## Old Implementation Since we need to reimplement/port all of the backends for #7752, I took an attempt at reimplementing the pandas backend using a new execution engine. Previously the pandas backend was implemented using a top-down execution model and each operation was executing using a multidispatched function. While it served us well for a long time, it had a few drawbacks: - it was often hard to understand what was going on due to the complex preparation steps and various execution hooks - the multidispatched functions were hard to debug, additionally they supported a wide variety of inputs making the implementation rather bulky - due to the previous reaon, several inputs combinations were not supported, e.g. value operations with multiple columnar inputs - the `Scope` object was used to pass around the execution context which was created for each operation separately and the results were not reusable even though the same operation was executed multiple times ## New Implementation The new execution model has changed in several ways: - there is a rewrite layer before execution which lowers the input expression to a form closer to the pandas execution model, this makes it much easier to implement the operations and also makes the input "plan" inspectable - the execution is now topologically sorted and executed in a bottom-up manner; the intermediate results are reused, making the execution more efficient while also aggressively cleaned up as soon as they are not needed anymore to reduce the memory usage - the execute function is now single-dispatched making the implementation easier to locate and debug - the inputs now broadcasted to columnar shape so that the same implementation can be used for multiple input shape combinations, this removes several special cases from the implementation in exchange of a negligible performance overhead - there are helper utilities making it easier to implement compute kernels for the various value operations: `rowwise`, `columnwise`, `elementwise`, `serieswise`; if there are multiple implementations available for a given operation, the most efficient one is selected based on the input shapes The new backend implementation has a higher feature coverage while the implementation is one third of the size of the previous one. BREAKING CHANGE: the `timecontext` feature is not supported anymore
…model (#7797) ## Old Implementation Since we need to reimplement/port all of the backends for #7752, I took an attempt at reimplementing the pandas backend using a new execution engine. Previously the pandas backend was implemented using a top-down execution model and each operation was executing using a multidispatched function. While it served us well for a long time, it had a few drawbacks: - it was often hard to understand what was going on due to the complex preparation steps and various execution hooks - the multidispatched functions were hard to debug, additionally they supported a wide variety of inputs making the implementation rather bulky - due to the previous reaon, several inputs combinations were not supported, e.g. value operations with multiple columnar inputs - the `Scope` object was used to pass around the execution context which was created for each operation separately and the results were not reusable even though the same operation was executed multiple times ## New Implementation The new execution model has changed in several ways: - there is a rewrite layer before execution which lowers the input expression to a form closer to the pandas execution model, this makes it much easier to implement the operations and also makes the input "plan" inspectable - the execution is now topologically sorted and executed in a bottom-up manner; the intermediate results are reused, making the execution more efficient while also aggressively cleaned up as soon as they are not needed anymore to reduce the memory usage - the execute function is now single-dispatched making the implementation easier to locate and debug - the inputs now broadcasted to columnar shape so that the same implementation can be used for multiple input shape combinations, this removes several special cases from the implementation in exchange of a negligible performance overhead - there are helper utilities making it easier to implement compute kernels for the various value operations: `rowwise`, `columnwise`, `elementwise`, `serieswise`; if there are multiple implementations available for a given operation, the most efficient one is selected based on the input shapes The new backend implementation has a higher feature coverage while the implementation is one third of the size of the previous one. BREAKING CHANGE: the `timecontext` feature is not supported anymore
…model (#7797) ## Old Implementation Since we need to reimplement/port all of the backends for #7752, I took an attempt at reimplementing the pandas backend using a new execution engine. Previously the pandas backend was implemented using a top-down execution model and each operation was executing using a multidispatched function. While it served us well for a long time, it had a few drawbacks: - it was often hard to understand what was going on due to the complex preparation steps and various execution hooks - the multidispatched functions were hard to debug, additionally they supported a wide variety of inputs making the implementation rather bulky - due to the previous reaon, several inputs combinations were not supported, e.g. value operations with multiple columnar inputs - the `Scope` object was used to pass around the execution context which was created for each operation separately and the results were not reusable even though the same operation was executed multiple times ## New Implementation The new execution model has changed in several ways: - there is a rewrite layer before execution which lowers the input expression to a form closer to the pandas execution model, this makes it much easier to implement the operations and also makes the input "plan" inspectable - the execution is now topologically sorted and executed in a bottom-up manner; the intermediate results are reused, making the execution more efficient while also aggressively cleaned up as soon as they are not needed anymore to reduce the memory usage - the execute function is now single-dispatched making the implementation easier to locate and debug - the inputs now broadcasted to columnar shape so that the same implementation can be used for multiple input shape combinations, this removes several special cases from the implementation in exchange of a negligible performance overhead - there are helper utilities making it easier to implement compute kernels for the various value operations: `rowwise`, `columnwise`, `elementwise`, `serieswise`; if there are multiple implementations available for a given operation, the most efficient one is selected based on the input shapes The new backend implementation has a higher feature coverage while the implementation is one third of the size of the previous one. BREAKING CHANGE: the `timecontext` feature is not supported anymore
…model (#7797) ## Old Implementation Since we need to reimplement/port all of the backends for #7752, I took an attempt at reimplementing the pandas backend using a new execution engine. Previously the pandas backend was implemented using a top-down execution model and each operation was executing using a multidispatched function. While it served us well for a long time, it had a few drawbacks: - it was often hard to understand what was going on due to the complex preparation steps and various execution hooks - the multidispatched functions were hard to debug, additionally they supported a wide variety of inputs making the implementation rather bulky - due to the previous reaon, several inputs combinations were not supported, e.g. value operations with multiple columnar inputs - the `Scope` object was used to pass around the execution context which was created for each operation separately and the results were not reusable even though the same operation was executed multiple times ## New Implementation The new execution model has changed in several ways: - there is a rewrite layer before execution which lowers the input expression to a form closer to the pandas execution model, this makes it much easier to implement the operations and also makes the input "plan" inspectable - the execution is now topologically sorted and executed in a bottom-up manner; the intermediate results are reused, making the execution more efficient while also aggressively cleaned up as soon as they are not needed anymore to reduce the memory usage - the execute function is now single-dispatched making the implementation easier to locate and debug - the inputs now broadcasted to columnar shape so that the same implementation can be used for multiple input shape combinations, this removes several special cases from the implementation in exchange of a negligible performance overhead - there are helper utilities making it easier to implement compute kernels for the various value operations: `rowwise`, `columnwise`, `elementwise`, `serieswise`; if there are multiple implementations available for a given operation, the most efficient one is selected based on the input shapes The new backend implementation has a higher feature coverage while the implementation is one third of the size of the previous one. BREAKING CHANGE: the `timecontext` feature is not supported anymore
…model (ibis-project#7797) Since we need to reimplement/port all of the backends for ibis-project#7752, I took an attempt at reimplementing the pandas backend using a new execution engine. Previously the pandas backend was implemented using a top-down execution model and each operation was executing using a multidispatched function. While it served us well for a long time, it had a few drawbacks: - it was often hard to understand what was going on due to the complex preparation steps and various execution hooks - the multidispatched functions were hard to debug, additionally they supported a wide variety of inputs making the implementation rather bulky - due to the previous reaon, several inputs combinations were not supported, e.g. value operations with multiple columnar inputs - the `Scope` object was used to pass around the execution context which was created for each operation separately and the results were not reusable even though the same operation was executed multiple times The new execution model has changed in several ways: - there is a rewrite layer before execution which lowers the input expression to a form closer to the pandas execution model, this makes it much easier to implement the operations and also makes the input "plan" inspectable - the execution is now topologically sorted and executed in a bottom-up manner; the intermediate results are reused, making the execution more efficient while also aggressively cleaned up as soon as they are not needed anymore to reduce the memory usage - the execute function is now single-dispatched making the implementation easier to locate and debug - the inputs now broadcasted to columnar shape so that the same implementation can be used for multiple input shape combinations, this removes several special cases from the implementation in exchange of a negligible performance overhead - there are helper utilities making it easier to implement compute kernels for the various value operations: `rowwise`, `columnwise`, `elementwise`, `serieswise`; if there are multiple implementations available for a given operation, the most efficient one is selected based on the input shapes The new backend implementation has a higher feature coverage while the implementation is one third of the size of the previous one. BREAKING CHANGE: the `timecontext` feature is not supported anymore
…model (ibis-project#7797) Since we need to reimplement/port all of the backends for ibis-project#7752, I took an attempt at reimplementing the pandas backend using a new execution engine. Previously the pandas backend was implemented using a top-down execution model and each operation was executing using a multidispatched function. While it served us well for a long time, it had a few drawbacks: - it was often hard to understand what was going on due to the complex preparation steps and various execution hooks - the multidispatched functions were hard to debug, additionally they supported a wide variety of inputs making the implementation rather bulky - due to the previous reaon, several inputs combinations were not supported, e.g. value operations with multiple columnar inputs - the `Scope` object was used to pass around the execution context which was created for each operation separately and the results were not reusable even though the same operation was executed multiple times The new execution model has changed in several ways: - there is a rewrite layer before execution which lowers the input expression to a form closer to the pandas execution model, this makes it much easier to implement the operations and also makes the input "plan" inspectable - the execution is now topologically sorted and executed in a bottom-up manner; the intermediate results are reused, making the execution more efficient while also aggressively cleaned up as soon as they are not needed anymore to reduce the memory usage - the execute function is now single-dispatched making the implementation easier to locate and debug - the inputs now broadcasted to columnar shape so that the same implementation can be used for multiple input shape combinations, this removes several special cases from the implementation in exchange of a negligible performance overhead - there are helper utilities making it easier to implement compute kernels for the various value operations: `rowwise`, `columnwise`, `elementwise`, `serieswise`; if there are multiple implementations available for a given operation, the most efficient one is selected based on the input shapes The new backend implementation has a higher feature coverage while the implementation is one third of the size of the previous one. BREAKING CHANGE: the `timecontext` feature is not supported anymore
…model (ibis-project#7797) Since we need to reimplement/port all of the backends for ibis-project#7752, I took an attempt at reimplementing the pandas backend using a new execution engine. Previously the pandas backend was implemented using a top-down execution model and each operation was executing using a multidispatched function. While it served us well for a long time, it had a few drawbacks: - it was often hard to understand what was going on due to the complex preparation steps and various execution hooks - the multidispatched functions were hard to debug, additionally they supported a wide variety of inputs making the implementation rather bulky - due to the previous reaon, several inputs combinations were not supported, e.g. value operations with multiple columnar inputs - the `Scope` object was used to pass around the execution context which was created for each operation separately and the results were not reusable even though the same operation was executed multiple times The new execution model has changed in several ways: - there is a rewrite layer before execution which lowers the input expression to a form closer to the pandas execution model, this makes it much easier to implement the operations and also makes the input "plan" inspectable - the execution is now topologically sorted and executed in a bottom-up manner; the intermediate results are reused, making the execution more efficient while also aggressively cleaned up as soon as they are not needed anymore to reduce the memory usage - the execute function is now single-dispatched making the implementation easier to locate and debug - the inputs now broadcasted to columnar shape so that the same implementation can be used for multiple input shape combinations, this removes several special cases from the implementation in exchange of a negligible performance overhead - there are helper utilities making it easier to implement compute kernels for the various value operations: `rowwise`, `columnwise`, `elementwise`, `serieswise`; if there are multiple implementations available for a given operation, the most efficient one is selected based on the input shapes The new backend implementation has a higher feature coverage while the implementation is one third of the size of the previous one. BREAKING CHANGE: the `timecontext` feature is not supported anymore
…model (ibis-project#7797) Since we need to reimplement/port all of the backends for ibis-project#7752, I took an attempt at reimplementing the pandas backend using a new execution engine. Previously the pandas backend was implemented using a top-down execution model and each operation was executing using a multidispatched function. While it served us well for a long time, it had a few drawbacks: - it was often hard to understand what was going on due to the complex preparation steps and various execution hooks - the multidispatched functions were hard to debug, additionally they supported a wide variety of inputs making the implementation rather bulky - due to the previous reaon, several inputs combinations were not supported, e.g. value operations with multiple columnar inputs - the `Scope` object was used to pass around the execution context which was created for each operation separately and the results were not reusable even though the same operation was executed multiple times The new execution model has changed in several ways: - there is a rewrite layer before execution which lowers the input expression to a form closer to the pandas execution model, this makes it much easier to implement the operations and also makes the input "plan" inspectable - the execution is now topologically sorted and executed in a bottom-up manner; the intermediate results are reused, making the execution more efficient while also aggressively cleaned up as soon as they are not needed anymore to reduce the memory usage - the execute function is now single-dispatched making the implementation easier to locate and debug - the inputs now broadcasted to columnar shape so that the same implementation can be used for multiple input shape combinations, this removes several special cases from the implementation in exchange of a negligible performance overhead - there are helper utilities making it easier to implement compute kernels for the various value operations: `rowwise`, `columnwise`, `elementwise`, `serieswise`; if there are multiple implementations available for a given operation, the most efficient one is selected based on the input shapes The new backend implementation has a higher feature coverage while the implementation is one third of the size of the previous one. BREAKING CHANGE: the `timecontext` feature is not supported anymore
…model (ibis-project#7797) Since we need to reimplement/port all of the backends for ibis-project#7752, I took an attempt at reimplementing the pandas backend using a new execution engine. Previously the pandas backend was implemented using a top-down execution model and each operation was executing using a multidispatched function. While it served us well for a long time, it had a few drawbacks: - it was often hard to understand what was going on due to the complex preparation steps and various execution hooks - the multidispatched functions were hard to debug, additionally they supported a wide variety of inputs making the implementation rather bulky - due to the previous reaon, several inputs combinations were not supported, e.g. value operations with multiple columnar inputs - the `Scope` object was used to pass around the execution context which was created for each operation separately and the results were not reusable even though the same operation was executed multiple times The new execution model has changed in several ways: - there is a rewrite layer before execution which lowers the input expression to a form closer to the pandas execution model, this makes it much easier to implement the operations and also makes the input "plan" inspectable - the execution is now topologically sorted and executed in a bottom-up manner; the intermediate results are reused, making the execution more efficient while also aggressively cleaned up as soon as they are not needed anymore to reduce the memory usage - the execute function is now single-dispatched making the implementation easier to locate and debug - the inputs now broadcasted to columnar shape so that the same implementation can be used for multiple input shape combinations, this removes several special cases from the implementation in exchange of a negligible performance overhead - there are helper utilities making it easier to implement compute kernels for the various value operations: `rowwise`, `columnwise`, `elementwise`, `serieswise`; if there are multiple implementations available for a given operation, the most efficient one is selected based on the input shapes The new backend implementation has a higher feature coverage while the implementation is one third of the size of the previous one. BREAKING CHANGE: the `timecontext` feature is not supported anymore
…model (ibis-project#7797) Since we need to reimplement/port all of the backends for ibis-project#7752, I took an attempt at reimplementing the pandas backend using a new execution engine. Previously the pandas backend was implemented using a top-down execution model and each operation was executing using a multidispatched function. While it served us well for a long time, it had a few drawbacks: - it was often hard to understand what was going on due to the complex preparation steps and various execution hooks - the multidispatched functions were hard to debug, additionally they supported a wide variety of inputs making the implementation rather bulky - due to the previous reaon, several inputs combinations were not supported, e.g. value operations with multiple columnar inputs - the `Scope` object was used to pass around the execution context which was created for each operation separately and the results were not reusable even though the same operation was executed multiple times The new execution model has changed in several ways: - there is a rewrite layer before execution which lowers the input expression to a form closer to the pandas execution model, this makes it much easier to implement the operations and also makes the input "plan" inspectable - the execution is now topologically sorted and executed in a bottom-up manner; the intermediate results are reused, making the execution more efficient while also aggressively cleaned up as soon as they are not needed anymore to reduce the memory usage - the execute function is now single-dispatched making the implementation easier to locate and debug - the inputs now broadcasted to columnar shape so that the same implementation can be used for multiple input shape combinations, this removes several special cases from the implementation in exchange of a negligible performance overhead - there are helper utilities making it easier to implement compute kernels for the various value operations: `rowwise`, `columnwise`, `elementwise`, `serieswise`; if there are multiple implementations available for a given operation, the most efficient one is selected based on the input shapes The new backend implementation has a higher feature coverage while the implementation is one third of the size of the previous one. BREAKING CHANGE: the `timecontext` feature is not supported anymore
…model (ibis-project#7797) Since we need to reimplement/port all of the backends for ibis-project#7752, I took an attempt at reimplementing the pandas backend using a new execution engine. Previously the pandas backend was implemented using a top-down execution model and each operation was executing using a multidispatched function. While it served us well for a long time, it had a few drawbacks: - it was often hard to understand what was going on due to the complex preparation steps and various execution hooks - the multidispatched functions were hard to debug, additionally they supported a wide variety of inputs making the implementation rather bulky - due to the previous reaon, several inputs combinations were not supported, e.g. value operations with multiple columnar inputs - the `Scope` object was used to pass around the execution context which was created for each operation separately and the results were not reusable even though the same operation was executed multiple times The new execution model has changed in several ways: - there is a rewrite layer before execution which lowers the input expression to a form closer to the pandas execution model, this makes it much easier to implement the operations and also makes the input "plan" inspectable - the execution is now topologically sorted and executed in a bottom-up manner; the intermediate results are reused, making the execution more efficient while also aggressively cleaned up as soon as they are not needed anymore to reduce the memory usage - the execute function is now single-dispatched making the implementation easier to locate and debug - the inputs now broadcasted to columnar shape so that the same implementation can be used for multiple input shape combinations, this removes several special cases from the implementation in exchange of a negligible performance overhead - there are helper utilities making it easier to implement compute kernels for the various value operations: `rowwise`, `columnwise`, `elementwise`, `serieswise`; if there are multiple implementations available for a given operation, the most efficient one is selected based on the input shapes The new backend implementation has a higher feature coverage while the implementation is one third of the size of the previous one. BREAKING CHANGE: the `timecontext` feature is not supported anymore
…model (ibis-project#7797) Since we need to reimplement/port all of the backends for ibis-project#7752, I took an attempt at reimplementing the pandas backend using a new execution engine. Previously the pandas backend was implemented using a top-down execution model and each operation was executing using a multidispatched function. While it served us well for a long time, it had a few drawbacks: - it was often hard to understand what was going on due to the complex preparation steps and various execution hooks - the multidispatched functions were hard to debug, additionally they supported a wide variety of inputs making the implementation rather bulky - due to the previous reaon, several inputs combinations were not supported, e.g. value operations with multiple columnar inputs - the `Scope` object was used to pass around the execution context which was created for each operation separately and the results were not reusable even though the same operation was executed multiple times The new execution model has changed in several ways: - there is a rewrite layer before execution which lowers the input expression to a form closer to the pandas execution model, this makes it much easier to implement the operations and also makes the input "plan" inspectable - the execution is now topologically sorted and executed in a bottom-up manner; the intermediate results are reused, making the execution more efficient while also aggressively cleaned up as soon as they are not needed anymore to reduce the memory usage - the execute function is now single-dispatched making the implementation easier to locate and debug - the inputs now broadcasted to columnar shape so that the same implementation can be used for multiple input shape combinations, this removes several special cases from the implementation in exchange of a negligible performance overhead - there are helper utilities making it easier to implement compute kernels for the various value operations: `rowwise`, `columnwise`, `elementwise`, `serieswise`; if there are multiple implementations available for a given operation, the most efficient one is selected based on the input shapes The new backend implementation has a higher feature coverage while the implementation is one third of the size of the previous one. BREAKING CHANGE: the `timecontext` feature is not supported anymore
…model (#7797) Since we need to reimplement/port all of the backends for #7752, I took an attempt at reimplementing the pandas backend using a new execution engine. Previously the pandas backend was implemented using a top-down execution model and each operation was executing using a multidispatched function. While it served us well for a long time, it had a few drawbacks: - it was often hard to understand what was going on due to the complex preparation steps and various execution hooks - the multidispatched functions were hard to debug, additionally they supported a wide variety of inputs making the implementation rather bulky - due to the previous reaon, several inputs combinations were not supported, e.g. value operations with multiple columnar inputs - the `Scope` object was used to pass around the execution context which was created for each operation separately and the results were not reusable even though the same operation was executed multiple times The new execution model has changed in several ways: - there is a rewrite layer before execution which lowers the input expression to a form closer to the pandas execution model, this makes it much easier to implement the operations and also makes the input "plan" inspectable - the execution is now topologically sorted and executed in a bottom-up manner; the intermediate results are reused, making the execution more efficient while also aggressively cleaned up as soon as they are not needed anymore to reduce the memory usage - the execute function is now single-dispatched making the implementation easier to locate and debug - the inputs now broadcasted to columnar shape so that the same implementation can be used for multiple input shape combinations, this removes several special cases from the implementation in exchange of a negligible performance overhead - there are helper utilities making it easier to implement compute kernels for the various value operations: `rowwise`, `columnwise`, `elementwise`, `serieswise`; if there are multiple implementations available for a given operation, the most efficient one is selected based on the input shapes The new backend implementation has a higher feature coverage while the implementation is one third of the size of the previous one. BREAKING CHANGE: the `timecontext` feature is not supported anymore
…model (#7797) Since we need to reimplement/port all of the backends for #7752, I took an attempt at reimplementing the pandas backend using a new execution engine. Previously the pandas backend was implemented using a top-down execution model and each operation was executing using a multidispatched function. While it served us well for a long time, it had a few drawbacks: - it was often hard to understand what was going on due to the complex preparation steps and various execution hooks - the multidispatched functions were hard to debug, additionally they supported a wide variety of inputs making the implementation rather bulky - due to the previous reaon, several inputs combinations were not supported, e.g. value operations with multiple columnar inputs - the `Scope` object was used to pass around the execution context which was created for each operation separately and the results were not reusable even though the same operation was executed multiple times The new execution model has changed in several ways: - there is a rewrite layer before execution which lowers the input expression to a form closer to the pandas execution model, this makes it much easier to implement the operations and also makes the input "plan" inspectable - the execution is now topologically sorted and executed in a bottom-up manner; the intermediate results are reused, making the execution more efficient while also aggressively cleaned up as soon as they are not needed anymore to reduce the memory usage - the execute function is now single-dispatched making the implementation easier to locate and debug - the inputs now broadcasted to columnar shape so that the same implementation can be used for multiple input shape combinations, this removes several special cases from the implementation in exchange of a negligible performance overhead - there are helper utilities making it easier to implement compute kernels for the various value operations: `rowwise`, `columnwise`, `elementwise`, `serieswise`; if there are multiple implementations available for a given operation, the most efficient one is selected based on the input shapes The new backend implementation has a higher feature coverage while the implementation is one third of the size of the previous one. BREAKING CHANGE: the `timecontext` feature is not supported anymore
…model (#7797) Since we need to reimplement/port all of the backends for #7752, I took an attempt at reimplementing the pandas backend using a new execution engine. Previously the pandas backend was implemented using a top-down execution model and each operation was executing using a multidispatched function. While it served us well for a long time, it had a few drawbacks: - it was often hard to understand what was going on due to the complex preparation steps and various execution hooks - the multidispatched functions were hard to debug, additionally they supported a wide variety of inputs making the implementation rather bulky - due to the previous reaon, several inputs combinations were not supported, e.g. value operations with multiple columnar inputs - the `Scope` object was used to pass around the execution context which was created for each operation separately and the results were not reusable even though the same operation was executed multiple times The new execution model has changed in several ways: - there is a rewrite layer before execution which lowers the input expression to a form closer to the pandas execution model, this makes it much easier to implement the operations and also makes the input "plan" inspectable - the execution is now topologically sorted and executed in a bottom-up manner; the intermediate results are reused, making the execution more efficient while also aggressively cleaned up as soon as they are not needed anymore to reduce the memory usage - the execute function is now single-dispatched making the implementation easier to locate and debug - the inputs now broadcasted to columnar shape so that the same implementation can be used for multiple input shape combinations, this removes several special cases from the implementation in exchange of a negligible performance overhead - there are helper utilities making it easier to implement compute kernels for the various value operations: `rowwise`, `columnwise`, `elementwise`, `serieswise`; if there are multiple implementations available for a given operation, the most efficient one is selected based on the input shapes The new backend implementation has a higher feature coverage while the implementation is one third of the size of the previous one. BREAKING CHANGE: the `timecontext` feature is not supported anymore
…model (ibis-project#7797) Since we need to reimplement/port all of the backends for ibis-project#7752, I took an attempt at reimplementing the pandas backend using a new execution engine. Previously the pandas backend was implemented using a top-down execution model and each operation was executing using a multidispatched function. While it served us well for a long time, it had a few drawbacks: - it was often hard to understand what was going on due to the complex preparation steps and various execution hooks - the multidispatched functions were hard to debug, additionally they supported a wide variety of inputs making the implementation rather bulky - due to the previous reaon, several inputs combinations were not supported, e.g. value operations with multiple columnar inputs - the `Scope` object was used to pass around the execution context which was created for each operation separately and the results were not reusable even though the same operation was executed multiple times The new execution model has changed in several ways: - there is a rewrite layer before execution which lowers the input expression to a form closer to the pandas execution model, this makes it much easier to implement the operations and also makes the input "plan" inspectable - the execution is now topologically sorted and executed in a bottom-up manner; the intermediate results are reused, making the execution more efficient while also aggressively cleaned up as soon as they are not needed anymore to reduce the memory usage - the execute function is now single-dispatched making the implementation easier to locate and debug - the inputs now broadcasted to columnar shape so that the same implementation can be used for multiple input shape combinations, this removes several special cases from the implementation in exchange of a negligible performance overhead - there are helper utilities making it easier to implement compute kernels for the various value operations: `rowwise`, `columnwise`, `elementwise`, `serieswise`; if there are multiple implementations available for a given operation, the most efficient one is selected based on the input shapes The new backend implementation has a higher feature coverage while the implementation is one third of the size of the previous one. BREAKING CHANGE: the `timecontext` feature is not supported anymore
…model (#7797) Since we need to reimplement/port all of the backends for #7752, I took an attempt at reimplementing the pandas backend using a new execution engine. Previously the pandas backend was implemented using a top-down execution model and each operation was executing using a multidispatched function. While it served us well for a long time, it had a few drawbacks: - it was often hard to understand what was going on due to the complex preparation steps and various execution hooks - the multidispatched functions were hard to debug, additionally they supported a wide variety of inputs making the implementation rather bulky - due to the previous reaon, several inputs combinations were not supported, e.g. value operations with multiple columnar inputs - the `Scope` object was used to pass around the execution context which was created for each operation separately and the results were not reusable even though the same operation was executed multiple times The new execution model has changed in several ways: - there is a rewrite layer before execution which lowers the input expression to a form closer to the pandas execution model, this makes it much easier to implement the operations and also makes the input "plan" inspectable - the execution is now topologically sorted and executed in a bottom-up manner; the intermediate results are reused, making the execution more efficient while also aggressively cleaned up as soon as they are not needed anymore to reduce the memory usage - the execute function is now single-dispatched making the implementation easier to locate and debug - the inputs now broadcasted to columnar shape so that the same implementation can be used for multiple input shape combinations, this removes several special cases from the implementation in exchange of a negligible performance overhead - there are helper utilities making it easier to implement compute kernels for the various value operations: `rowwise`, `columnwise`, `elementwise`, `serieswise`; if there are multiple implementations available for a given operation, the most efficient one is selected based on the input shapes The new backend implementation has a higher feature coverage while the implementation is one third of the size of the previous one. BREAKING CHANGE: the `timecontext` feature is not supported anymore
…model (ibis-project#7797) Since we need to reimplement/port all of the backends for ibis-project#7752, I took an attempt at reimplementing the pandas backend using a new execution engine. Previously the pandas backend was implemented using a top-down execution model and each operation was executing using a multidispatched function. While it served us well for a long time, it had a few drawbacks: - it was often hard to understand what was going on due to the complex preparation steps and various execution hooks - the multidispatched functions were hard to debug, additionally they supported a wide variety of inputs making the implementation rather bulky - due to the previous reaon, several inputs combinations were not supported, e.g. value operations with multiple columnar inputs - the `Scope` object was used to pass around the execution context which was created for each operation separately and the results were not reusable even though the same operation was executed multiple times The new execution model has changed in several ways: - there is a rewrite layer before execution which lowers the input expression to a form closer to the pandas execution model, this makes it much easier to implement the operations and also makes the input "plan" inspectable - the execution is now topologically sorted and executed in a bottom-up manner; the intermediate results are reused, making the execution more efficient while also aggressively cleaned up as soon as they are not needed anymore to reduce the memory usage - the execute function is now single-dispatched making the implementation easier to locate and debug - the inputs now broadcasted to columnar shape so that the same implementation can be used for multiple input shape combinations, this removes several special cases from the implementation in exchange of a negligible performance overhead - there are helper utilities making it easier to implement compute kernels for the various value operations: `rowwise`, `columnwise`, `elementwise`, `serieswise`; if there are multiple implementations available for a given operation, the most efficient one is selected based on the input shapes The new backend implementation has a higher feature coverage while the implementation is one third of the size of the previous one. BREAKING CHANGE: the `timecontext` feature is not supported anymore
…model (#7797) Since we need to reimplement/port all of the backends for #7752, I took an attempt at reimplementing the pandas backend using a new execution engine. Previously the pandas backend was implemented using a top-down execution model and each operation was executing using a multidispatched function. While it served us well for a long time, it had a few drawbacks: - it was often hard to understand what was going on due to the complex preparation steps and various execution hooks - the multidispatched functions were hard to debug, additionally they supported a wide variety of inputs making the implementation rather bulky - due to the previous reaon, several inputs combinations were not supported, e.g. value operations with multiple columnar inputs - the `Scope` object was used to pass around the execution context which was created for each operation separately and the results were not reusable even though the same operation was executed multiple times The new execution model has changed in several ways: - there is a rewrite layer before execution which lowers the input expression to a form closer to the pandas execution model, this makes it much easier to implement the operations and also makes the input "plan" inspectable - the execution is now topologically sorted and executed in a bottom-up manner; the intermediate results are reused, making the execution more efficient while also aggressively cleaned up as soon as they are not needed anymore to reduce the memory usage - the execute function is now single-dispatched making the implementation easier to locate and debug - the inputs now broadcasted to columnar shape so that the same implementation can be used for multiple input shape combinations, this removes several special cases from the implementation in exchange of a negligible performance overhead - there are helper utilities making it easier to implement compute kernels for the various value operations: `rowwise`, `columnwise`, `elementwise`, `serieswise`; if there are multiple implementations available for a given operation, the most efficient one is selected based on the input shapes The new backend implementation has a higher feature coverage while the implementation is one third of the size of the previous one. BREAKING CHANGE: the `timecontext` feature is not supported anymore
…model (#7797) Since we need to reimplement/port all of the backends for #7752, I took an attempt at reimplementing the pandas backend using a new execution engine. Previously the pandas backend was implemented using a top-down execution model and each operation was executing using a multidispatched function. While it served us well for a long time, it had a few drawbacks: - it was often hard to understand what was going on due to the complex preparation steps and various execution hooks - the multidispatched functions were hard to debug, additionally they supported a wide variety of inputs making the implementation rather bulky - due to the previous reaon, several inputs combinations were not supported, e.g. value operations with multiple columnar inputs - the `Scope` object was used to pass around the execution context which was created for each operation separately and the results were not reusable even though the same operation was executed multiple times The new execution model has changed in several ways: - there is a rewrite layer before execution which lowers the input expression to a form closer to the pandas execution model, this makes it much easier to implement the operations and also makes the input "plan" inspectable - the execution is now topologically sorted and executed in a bottom-up manner; the intermediate results are reused, making the execution more efficient while also aggressively cleaned up as soon as they are not needed anymore to reduce the memory usage - the execute function is now single-dispatched making the implementation easier to locate and debug - the inputs now broadcasted to columnar shape so that the same implementation can be used for multiple input shape combinations, this removes several special cases from the implementation in exchange of a negligible performance overhead - there are helper utilities making it easier to implement compute kernels for the various value operations: `rowwise`, `columnwise`, `elementwise`, `serieswise`; if there are multiple implementations available for a given operation, the most efficient one is selected based on the input shapes The new backend implementation has a higher feature coverage while the implementation is one third of the size of the previous one. BREAKING CHANGE: the `timecontext` feature is not supported anymore
…model (ibis-project#7797) Since we need to reimplement/port all of the backends for ibis-project#7752, I took an attempt at reimplementing the pandas backend using a new execution engine. Previously the pandas backend was implemented using a top-down execution model and each operation was executing using a multidispatched function. While it served us well for a long time, it had a few drawbacks: - it was often hard to understand what was going on due to the complex preparation steps and various execution hooks - the multidispatched functions were hard to debug, additionally they supported a wide variety of inputs making the implementation rather bulky - due to the previous reaon, several inputs combinations were not supported, e.g. value operations with multiple columnar inputs - the `Scope` object was used to pass around the execution context which was created for each operation separately and the results were not reusable even though the same operation was executed multiple times The new execution model has changed in several ways: - there is a rewrite layer before execution which lowers the input expression to a form closer to the pandas execution model, this makes it much easier to implement the operations and also makes the input "plan" inspectable - the execution is now topologically sorted and executed in a bottom-up manner; the intermediate results are reused, making the execution more efficient while also aggressively cleaned up as soon as they are not needed anymore to reduce the memory usage - the execute function is now single-dispatched making the implementation easier to locate and debug - the inputs now broadcasted to columnar shape so that the same implementation can be used for multiple input shape combinations, this removes several special cases from the implementation in exchange of a negligible performance overhead - there are helper utilities making it easier to implement compute kernels for the various value operations: `rowwise`, `columnwise`, `elementwise`, `serieswise`; if there are multiple implementations available for a given operation, the most efficient one is selected based on the input shapes The new backend implementation has a higher feature coverage while the implementation is one third of the size of the previous one. BREAKING CHANGE: the `timecontext` feature is not supported anymore
Rationale & History
In the last couple of years we have been constantly refactoring the internals to make it easier to work with.
Although we have made great progress, the current codebase is still hard to maintain and extend. One example of that complexity is the try to remove the
Projector
class in #7430. I had to realize that we are unable to improve the internals in smaller incremental steps, we need to make a big leap forward to make the codebase maintainable in the long run.One of the hotspots of problems is the
analysis.py
module which tries to bridge the gap between the user-facing API and the internal representation. Part of its complexity is caused by loose integrity checks in the internal representation, allowing various ways to represent the same operation. This makes it hard to inspect, reason about and optimize the relational operations. In addition to that, it makes much harder to implement the backends since more branching is required to cover all the variations.We have always been aware of these problems, and actually we had several attempts to solve them the same way this PR does. However, we never managed to actually split the relational operations, we always hit roadblocks to maintain compatibility with the current test suite. Actually we were unable to even understand those issues because of the complexity of the codebase and number of indirections between the API, analysis functions and the internal representation.
But(!) finally we managed to prototype a new IR in #7580 along with implementations for the majority of the backends, including
various SQL backends
andpandas
. After successfully validating the viability of the new IR, we split the PR into smaller pieces which can be individually reviewed. This PR is the first step of that process, it introduces the new IR and the new API. The next steps will be to implement the remaining backends on top of the new IR.What does the PR do:
ops.Selection
andops.Aggregration
nodes into proper relational algebra operations.analysis.py
with the technical debt accumulated over the years..over()
is now able to bind the window frame to the relation.ScalarSubquery
,ExistsSubquery
, andInSubquery
nodes which can only be used in the appropriate context.JoinChain
operations to represent multiple joins in a single operation followed by a projection attached to the same relation. This enabled to solve several outstanding issues with the join handling (including the notorious chain join issue).rewrites.py
to reinterpret user input so that the new operations can be constructed, even with the strict integrity checks.ops.Alias
nodes. In addition to that table nodes in projections are not allowed anymore, the columns are expanded to the same mapping making the semantics clear.bind()
function.Advantages:
Disadvantages:
Technicalities
Used the following command to cherry-pick changes related to the
IR
from #7580TODOs
These follow-ups must be addressed before doing a release: