-
Notifications
You must be signed in to change notification settings - Fork 603
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
docs: add composable data ecosystem concept #7898
Merged
cpcloud
merged 6 commits into
ibis-project:main
from
lostmygithubaccount:composable-ecosystem
Jan 5, 2024
Merged
Changes from all commits
Commits
Show all changes
6 commits
Select commit
Hold shift + click to select a range
c8fdb76
docs: add composable data ecosystem concept
lostmygithubaccount 88a8a2a
updates
lostmygithubaccount 79ea272
spelling
lostmygithubaccount eaba0fb
drop lightweight from ADBC :)
lostmygithubaccount 279b7ea
remove comma
lostmygithubaccount b7d659d
fix typo
lostmygithubaccount File filter
Filter by extension
Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,115 @@ | ||
# Composable data ecosystem | ||
|
||
Ibis exists in a broader composable data ecosystem. [The Composable Codex by | ||
Voltron Data](https://voltrondata.com/codex) is the result of years of | ||
experience and hours of writing by experts in the field, providing an in-depth | ||
introduction the composable data systems. We'll take a look at: | ||
|
||
- [Apache Arrow](https://github.com/apache/arrow) | ||
- [Apache Arrow Database Connectivity (ADBC)](https://arrow.apache.org/adbc/current/index.html) | ||
- [Substrait](https://substrait.io/) | ||
|
||
and how they fit in with Ibis. | ||
|
||
## Overview | ||
|
||
Ibis is the portable Python dataframe API, supporting many backends. This is | ||
achieved by decoupling the API from the execution engine. While Ibis already | ||
relies on standards like Apache Arrow today, we expect the composable data | ||
ecosystem to mature and broader adoption of adjacent projects going forward. | ||
This will allow Ibis to simplify its implementation and improve performance for | ||
backends that support these standards. | ||
|
||
[The first chapter of The Composable Codex provides a great overview of where | ||
these projects fit | ||
in](https://voltrondata.com/codex/standards-over-silos#1-2-3-a-composable-ecosystem): | ||
|
||
![Standards](images/standards.png) | ||
|
||
And a table explaining the standards: | ||
|
||
| Label | Types of standards | Standards | | ||
| --- | --- | --- | | ||
| A | Intermediate representation | [Substrait](https://substrait.io) allows any user-interface that produces Substrait to pass the compute operations to a Substrait-consuming execution engine. You could swap any Substrait compatible user interfaces or execution engine. | | ||
| B | Connectivity | [Arrow Database Connectivity (ADBC)](https://arrow.apache.org/adbc/current/index.html) ensures that no matter where the computation is performed the data will be returned in the Arrow format. You can swap your execution engine and know that your downstream code will still work. | | ||
| C | Data memory layout | The [Apache Arrow in-memory data format](https://arrow.apache.org/docs/format/Columnar.html) ensures that the data can pass from the storage to the engine (and even across the systems in a distributed environment) and back to the user without slowing down to serialize and deserialize. | | ||
|
||
## History | ||
|
||
The composable data ecosystem has been envisioned for some time. [Wes | ||
McKinney](https://wesmckinney.com) has been instrumental in the development of | ||
the composable data ecosystem, co-founding Voltron Data, Apache Arrow, and | ||
initially creating Ibis. [Wes looked back on 15 years on the road to composable | ||
data systems](https://wesmckinney.com/blog/looking-back-15-years/) and gave some | ||
motivation for Ibis in his infamous ["Apache Arrow and the '10 Things I Hate | ||
About pandas'"](https://wesmckinney.com/blog/apache-arrow-pandas-internals/). | ||
|
||
Ibis started as a pandas-like API for Apache Impala, but has since expanded to | ||
support many backends. It currently leverages open-source projects like | ||
[SQLAlchemy](https://github.com/sqlalchemy/sqlalchemy) and | ||
[SQLGlot](https://github.com/tobymao/sqlglot) to work with many backends. While | ||
these projects are great, they rely on backend-specific SQL that does not | ||
constitute a standard. Going forward, we expect ADBC and Substrait to be the | ||
standards for connectivity and intermediate representation, respectively. | ||
|
||
## Apache Arrow | ||
|
||
Ibis uses [Apache Arrow](https://arrow.apache.org/) to provide a common data | ||
format for data interchange between Ibis and backends. Many backends also use | ||
Apache Arrows as their in-memory data format. | ||
|
||
### Dataframe interchange protocol | ||
|
||
Ibis supports [the dataframe interchange | ||
protocol](https://data-apis.org/dataframe-protocol/latest/purpose_and_scope.html) | ||
for data interchange between with other Python dataframe libraries and | ||
visualization libraries. This is largely efficient because most libraries | ||
support Apache Arrow as their in-memory data format, making interchange between | ||
them cheap and efficient. | ||
|
||
## Apache Arrow Database Connectivity (ADBC) | ||
|
||
[Apache Arrow Database Connectivity | ||
(ADBC)](https://arrow.apache.org/docs/format/ADBC.html) is a relatively new | ||
standard for database connectivity. It is a low-level protocol for exchanging | ||
data between a client and a database. It is a successor to ODBC and JDBC, and is | ||
designed to be a more modern and performant alternative to these standards. | ||
|
||
While Ibis does not currently use ADBC for its backends, as the project matures | ||
we expect an increase in performance and a decrease in complexity for backends | ||
that support ADBC. | ||
|
||
## Substrait | ||
|
||
[Substrait](https://substrait.io/) is a relatively new standard for | ||
cross-language serialization of relational algebra. It is intended as an | ||
intermediary representation between the user interface and other points in the | ||
data system. Ibis can already compile expressions to Substrait which can then be | ||
executed by Substrait-consuming backends. Support today is limited but, like | ||
ADBC, we expect the project to mature and for Ibis to leverage it more in the | ||
future. | ||
|
||
### Why not SQL? | ||
|
||
Structured Query Language (SQL) is not a standard. There is a commonly | ||
referenced [ANSI Standard for | ||
SQL](https://blog.ansi.org/sql-standard-iso-iec-9075-2023-ansi-x3-135) that you | ||
can pay a lot of money to access and most execution engines claim to support. | ||
However, most execution engines extend or subtly deviate from the standard and | ||
in practice it is not possible to simply reuse SQL from one execution engine on | ||
another. This leads to notoriously difficult database migrations and vendor | ||
lock-in. | ||
|
||
### Why Substrait? | ||
|
||
Substrait, unlike SQL, is not intended as a user interface. Instead, a user | ||
interface like Ibis in Python or dplyr in R would compile to Substrait and pass | ||
it to a Substrait-consuming execution engine. This allows the user interface to | ||
be decoupled from the execution engine, allowing for more flexibility and | ||
portability. | ||
|
||
## Going forward | ||
|
||
Going forward, Ibis intends to leverage other standards in the broader | ||
composable data ecosystem to simplify its implementation and improve | ||
performance. |
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Oops, something went wrong.
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ah whoops, I didn't get to this in time
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'll make a PR to update
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
PR here: #7941