Skip to content

Commit

Permalink
docs: add composable data ecosystem concept (#7898)
Browse files Browse the repository at this point in the history
## Description of changes

add conceptual article for the composable data ecosystem, explaining the
relationship between Ibis and other key projects (Apache Arrow, ADBC,
Substrait)

very WIP. some TODOs:

- [x] finish prose
- [x] formatting
- [x] share w/ other projects for review/feedback/contribution

I plan to finish my first pass on this ~tomorrow and share w/ other
projects. some open questions:

- is it worth mentioning Flight and FlightSQL? I kinda feel like it's
confusing and focusing on ADBC as the connectivity layer is better
- are there other projects we should mention? perhaps SQLGlot fits in
here?
- Substrait: working through the finer points of why SQL isn't
sufficient and Substrait is needed

## Issues closed

closes #6618
  • Loading branch information
lostmygithubaccount authored Jan 5, 2024
1 parent e58be2e commit d78a887
Show file tree
Hide file tree
Showing 2 changed files with 115 additions and 0 deletions.
115 changes: 115 additions & 0 deletions docs/concepts/composable-ecosystem.qmd
Original file line number Diff line number Diff line change
@@ -0,0 +1,115 @@
# Composable data ecosystem

Ibis exists in a broader composable data ecosystem. [The Composable Codex by
Voltron Data](https://voltrondata.com/codex) is the result of years of
experience and hours of writing by experts in the field, providing an in-depth
introduction the composable data systems. We'll take a look at:

- [Apache Arrow](https://github.com/apache/arrow)
- [Apache Arrow Database Connectivity (ADBC)](https://arrow.apache.org/adbc/current/index.html)
- [Substrait](https://substrait.io/)

and how they fit in with Ibis.

## Overview

Ibis is the portable Python dataframe API, supporting many backends. This is
achieved by decoupling the API from the execution engine. While Ibis already
relies on standards like Apache Arrow today, we expect the composable data
ecosystem to mature and broader adoption of adjacent projects going forward.
This will allow Ibis to simplify its implementation and improve performance for
backends that support these standards.

[The first chapter of The Composable Codex provides a great overview of where
these projects fit
in](https://voltrondata.com/codex/standards-over-silos#1-2-3-a-composable-ecosystem):

![Standards](images/standards.png)

And a table explaining the standards:

| Label | Types of standards | Standards |
| --- | --- | --- |
| A | Intermediate representation | [Substrait](https://substrait.io) allows any user-interface that produces Substrait to pass the compute operations to a Substrait-consuming execution engine. You could swap any Substrait compatible user interfaces or execution engine. |
| B | Connectivity | [Arrow Database Connectivity (ADBC)](https://arrow.apache.org/adbc/current/index.html) ensures that no matter where the computation is performed the data will be returned in the Arrow format. You can swap your execution engine and know that your downstream code will still work. |
| C | Data memory layout | The [Apache Arrow in-memory data format](https://arrow.apache.org/docs/format/Columnar.html) ensures that the data can pass from the storage to the engine (and even across the systems in a distributed environment) and back to the user without slowing down to serialize and deserialize. |

## History

The composable data ecosystem has been envisioned for some time. [Wes
McKinney](https://wesmckinney.com) has been instrumental in the development of
the composable data ecosystem, co-founding Voltron Data, Apache Arrow, and
initially creating Ibis. [Wes looked back on 15 years on the road to composable
data systems](https://wesmckinney.com/blog/looking-back-15-years/) and gave some
motivation for Ibis in his infamous ["Apache Arrow and the '10 Things I Hate
About pandas'"](https://wesmckinney.com/blog/apache-arrow-pandas-internals/).

Ibis started as a pandas-like API for Apache Impala, but has since expanded to
support many backends. It currently leverages open-source projects like
[SQLAlchemy](https://github.com/sqlalchemy/sqlalchemy) and
[SQLGlot](https://github.com/tobymao/sqlglot) to work with many backends. While
these projects are great, they rely on backend-specific SQL that does not
constitute a standard. Going forward, we expect ADBC and Substrait to be the
standards for connectivity and intermediate representation, respectively.

## Apache Arrow

Ibis uses [Apache Arrow](https://arrow.apache.org/) to provide a common data
format for data interchange between Ibis and backends. Many backends also use
Apache Arrows as their in-memory data format.

### Dataframe interchange protocol

Ibis supports [the dataframe interchange
protocol](https://data-apis.org/dataframe-protocol/latest/purpose_and_scope.html)
for data interchange between with other Python dataframe libraries and
visualization libraries. This is largely efficient because most libraries
support Apache Arrow as their in-memory data format, making interchange between
them cheap and efficient.

## Apache Arrow Database Connectivity (ADBC)

[Apache Arrow Database Connectivity
(ADBC)](https://arrow.apache.org/docs/format/ADBC.html) is a relatively new
standard for database connectivity. It is a low-level protocol for exchanging
data between a client and a database. It is a successor to ODBC and JDBC, and is
designed to be a more modern and performant alternative to these standards.

While Ibis does not currently use ADBC for its backends, as the project matures
we expect an increase in performance and a decrease in complexity for backends
that support ADBC.

## Substrait

[Substrait](https://substrait.io/) is a relatively new standard for
cross-language serialization of relational algebra. It is intended as an
intermediary representation between the user interface and other points in the
data system. Ibis can already compile expressions to Substrait which can then be
executed by Substrait-consuming backends. Support today is limited but, like
ADBC, we expect the project to mature and for Ibis to leverage it more in the
future.

### Why not SQL?

Structured Query Language (SQL) is not a standard. There is a commonly
referenced [ANSI Standard for
SQL](https://blog.ansi.org/sql-standard-iso-iec-9075-2023-ansi-x3-135) that you
can pay a lot of money to access and most execution engines claim to support.
However, most execution engines extend or subtly deviate from the standard and
in practice it is not possible to simply reuse SQL from one execution engine on
another. This leads to notoriously difficult database migrations and vendor
lock-in.

### Why Substrait?

Substrait, unlike SQL, is not intended as a user interface. Instead, a user
interface like Ibis in Python or dplyr in R would compile to Substrait and pass
it to a Substrait-consuming execution engine. This allows the user interface to
be decoupled from the execution engine, allowing for more flexibility and
portability.

## Going forward

Going forward, Ibis intends to leverage other standards in the broader
composable data ecosystem to simplify its implementation and improve
performance.
Binary file added docs/concepts/images/standards.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.

0 comments on commit d78a887

Please sign in to comment.