Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

docs: add composable data ecosystem concept #7898

Merged
merged 6 commits into from
Jan 5, 2024
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
115 changes: 115 additions & 0 deletions docs/concepts/composable-ecosystem.qmd
Original file line number Diff line number Diff line change
@@ -0,0 +1,115 @@
# Composable data ecosystem

Ibis exists in a broader composable data ecosystem. [The Composable Codex by
Voltron Data](https://voltrondata.com/codex) is the result of years of
experience and hours of writing by experts in the field, providing an in-depth
introduction the composable data systems. We'll take a look at:

- [Apache Arrow](https://github.com/apache/arrow)
- [Apache Arrow Database Connectivity (ADBC)](https://arrow.apache.org/adbc/current/index.html)
- [Substrait](https://substrait.io/)

and how they fit in with Ibis.

## Overview

Ibis is the portable Python dataframe API, supporting many backends. This is
achieved by decoupling the API from the execution engine. While Ibis already
relies on standards like Apache Arrow today, we expect the composable data
ecosystem to mature and broader adoption of adjacent projects going forward.
This will allow Ibis to simplify its implementation and improve performance for
backends that support these standards.

[The first chapter of The Composable Codex provides a great overview of where
these projects fit
in](https://voltrondata.com/codex/standards-over-silos#1-2-3-a-composable-ecosystem):

![Standards](images/standards.png)

And a table explaining the standards:

| Label | Types of standards | Standards |
| --- | --- | --- |
| A | Intermediate representation | [Substrait](https://substrait.io) allows any user-interface that produces Substrait to pass the compute operations to a Substrait-consuming execution engine. You could swap any Substrait compatible user interfaces or execution engine. |
| B | Connectivity | [Arrow Database Connectivity (ADBC)](https://arrow.apache.org/adbc/current/index.html) ensures that no matter where the computation is performed the data will be returned in the Arrow format. You can swap your execution engine and know that your downstream code will still work. |
| C | Data memory layout | The [Apache Arrow in-memory data format](https://arrow.apache.org/docs/format/Columnar.html) ensures that the data can pass from the storage to the engine (and even across the systems in a distributed environment) and back to the user without slowing down to serialize and deserialize. |

## History

The composable data ecosystem has been envisioned for some time. [Wes
McKinney](https://wesmckinney.com) has been instrumental in the development of
the composable data ecosystem, co-founding Voltron Data, Apache Arrow, and
initially creating Ibis. [Wes looked back on 15 years on the road to composable
data systems](https://wesmckinney.com/blog/looking-back-15-years/) and gave some
motivation for Ibis in his infamous ["Apache Arrow and the '10 Things I Hate
About pandas'"](https://wesmckinney.com/blog/apache-arrow-pandas-internals/).

Ibis started as a pandas-like API for Apache Impala, but has since expanded to
support many backends. It currently leverages open-source projects like
[SQLAlchemy](https://github.com/sqlalchemy/sqlalchemy) and
[SQLGlot](https://github.com/tobymao/sqlglot) to work with many backends. While
these projects are great, they rely on backend-specific SQL that does not
constitute a standard. Going forward, we expect ADBC and Substrait to be the
standards for connectivity and intermediate representation, respectively.

## Apache Arrow

Ibis uses [Apache Arrow](https://arrow.apache.org/) to provide a common data
format for data interchange between Ibis and backends. Many backends also use
Apache Arrows as their in-memory data format.

### Dataframe interchange protocol

Ibis supports [the dataframe interchange
protocol](https://data-apis.org/dataframe-protocol/latest/purpose_and_scope.html)
for data interchange between with other Python dataframe libraries and
visualization libraries. This is largely efficient because most libraries
support Apache Arrow as their in-memory data format, making interchange between
them cheap and efficient.

## Apache Arrow Database Connectivity (ADBC)

[Apache Arrow Database Connectivity
(ADBC)](https://arrow.apache.org/docs/format/ADBC.html) is a relatively new
standard for database connectivity. It is a low-level protocol for exchanging
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
standard for database connectivity. It is a low-level protocol for exchanging
standard for database connectivity. It is an API for exchanging

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ah whoops, I didn't get to this in time

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'll make a PR to update

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

PR here: #7941

data between a client and a database. It is a successor to ODBC and JDBC, and is
designed to be a more modern and performant alternative to these standards.

While Ibis does not currently use ADBC for its backends, as the project matures
we expect an increase in performance and a decrease in complexity for backends
that support ADBC.

## Substrait

[Substrait](https://substrait.io/) is a relatively new standard for
cross-language serialization of relational algebra. It is intended as an
intermediary representation between the user interface and other points in the
data system. Ibis can already compile expressions to Substrait which can then be
executed by Substrait-consuming backends. Support today is limited but, like
ADBC, we expect the project to mature and for Ibis to leverage it more in the
future.

### Why not SQL?

Structured Query Language (SQL) is not a standard. There is a commonly
referenced [ANSI Standard for
SQL](https://blog.ansi.org/sql-standard-iso-iec-9075-2023-ansi-x3-135) that you
can pay a lot of money to access and most execution engines claim to support.
However, most execution engines extend or subtly deviate from the standard and
in practice it is not possible to simply reuse SQL from one execution engine on
another. This leads to notoriously difficult database migrations and vendor
lock-in.

### Why Substrait?

Substrait, unlike SQL, is not intended as a user interface. Instead, a user
interface like Ibis in Python or dplyr in R would compile to Substrait and pass
it to a Substrait-consuming execution engine. This allows the user interface to
be decoupled from the execution engine, allowing for more flexibility and
portability.

## Going forward

Going forward, Ibis intends to leverage other standards in the broader
composable data ecosystem to simplify its implementation and improve
performance.
Binary file added docs/concepts/images/standards.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading