diff --git a/.codespellrc b/.codespellrc index 2d2ec2582825..c8b6651c483a 100644 --- a/.codespellrc +++ b/.codespellrc @@ -3,3 +3,4 @@ skip = *.lock,.direnv,.git,./docs/_freeze,./docs/_output/**,./docs/_inv/**,docs/_freeze/**,*.svg,*.css,*.html,*.js ignore-regex = \b(i[if]f|I[IF]F|AFE)\b builtin = clear,rare,names +ignore-words-list = tim diff --git a/docs/posts/roadmap-2024-H1/commits.png b/docs/posts/roadmap-2024-H1/commits.png new file mode 100644 index 000000000000..f84cb6d3c50e Binary files /dev/null and b/docs/posts/roadmap-2024-H1/commits.png differ diff --git a/docs/posts/roadmap-2024-H1/index.qmd b/docs/posts/roadmap-2024-H1/index.qmd new file mode 100644 index 000000000000..576031eac221 --- /dev/null +++ b/docs/posts/roadmap-2024-H1/index.qmd @@ -0,0 +1,231 @@ +--- +title: "Ibis project 2024 roadmap" +author: "Cody Peterson" +date: "2024-02-15" +image: commits.png +draft: true +categories: + - blog + - roadmap + - community +--- + +## Overview + +Welcome to the first public roadmap for the Ibis project! If you aren't familiar +with the background of Ibis or who supports it nowadays, we recommend reading +[why Voltron Data supports Ibis](../why-voda-supports-ibis/index.qmd) before the +roadmap below. + +## 2024 roadmap + +We have a [public roadmap as a GitHub +project!](https://github.com/orgs/ibis-project/projects/5) + +![Ibis roadmap](roadmap.png) + +We are early in our use of this GitHub project, so please pardon any +disorganization as we get it up and running efficiently. In general, we have: + +- **Roadmap view**: consisting of meta-issues in their respective repositories + for high-level objectives of the Ibis project +- **Triage view**: consisting of new issues across Ibis project repositories + that need to be triaged +- **Backlog view**: consisting of issues that have been triaged (assigned a + priority) and are on the backlog +- **TODO view**: consisting of issues that are in progress or ready to be worked + on soon +- **Label-specific views**: consisting of issues for specific labels, like + documentation or a large refactor + +Right now, [the team at Voltron Data](../why-voda-supports-ibis/index.qmd) sets +the roadmap and priorities. Over time as more contributors and organizations +join the project, we expect this process to diversify and become more +community-driven. We'd love to have you involved in the process! [Join us on +Zulip](https://ibis-project.zulipchat.com) or interact with us on +[GitHub](https://github.com/ibis-project/ibis) to get involved and join the +decision making process for Ibis. + +### Overall themes + +Our top five themes for 2024 include: + +1. **Ibis backends**: Ibis is a Python frontend for many backends. To continue +scaling to more backends, we need to complete a major rework of library +internals and stabilize the API for backend authors. Related work in this area +will make it easier than ever to create new Ibis backends and maintain them. +This work will also include improving backend interfaces for operations like +table creation, insertion, and upsertion. This theme allows Ibis to deliver on +the promise of a single Python dataframe API that can be written once and run on +any execution engine. + +2. **Ibis for ML**: Increasingly, data projects are ML projects. Ibis can +uniquely help with feature engineering and other ML tasks connecting your data +where it lives to ML models. We will continue to improve Ibis for ML use cases. +This theme allows Ibis to cover more of the data and MLOps lifecycle, with +efficient feature engineering and handoff to ML training frameworks. + +3. **Ibis for streaming data**: Ibis has only been for batch data until very +recently. With the addition of the first streaming backends, we will continue to +improve Ibis for streaming data use cases and bridge the gap between batch and +streaming data. This theme allows Ibis to expand its promise of a single Python +dataframe to stream processing, too. + +4. **Ibis for geospatial**: Ibis has a rich set of geospatial expressions, but +most backends do not implement them. We will continue to improve Ibis for +geospatial use cases and bridge the gap between geospatial data and other data +types. This theme allows Ibis to cover more of the data lifecycle for geospatial +data. + +5. **Ibis community**: Ibis is an open source project and we want to make it as +easy as possible for new contributors to get involved. We will continue to +improve the Ibis community and make it easier than ever to contribute to Ibis. +This theme is critical for Ibis to continue to grow and thrive as an open source +project. We aim to delight our community and make it easy to get involved. + +We believe these themes will help Ibis as a standard Python interface for many +backends and real-world data use cases. + +### The big refactor + +The biggest item in Q1 2024 and primary focus of the core Ibis team right now is +the big refactor -- dubbed "the epic split" -- continuing the great work +completed by Krisztián in [his PR splitting the relational +operations](https://github.com/ibis-project/ibis/pull/7752). You can read more +details in that PR, but the gist is that a new intermediary representation for +Ibis expressions is being has been created that drastically simplifies the +codebase. + +With that refactor in place, each backend Ibis supports needs to be moved to the +new relational model. As a consequence, we are also swapping out SQLAlchemy for +[SQLGlot](https://github.com/tobymao/sqlglot). We are losing out on some of the +things SQLAlchemy did for us automatically, but overall this gives us a lot more +control over the SQL that is generated, reduces dependency overhead, and +simplifies the codebase further. + +::: {.callout-note} +We are targeting release in Ibis 9.0. Look at for a blog post dedicated to the +refactor soon! +::: + +### Ibis for ML preprocessing + +Data projects are increasingly ML projects. pandas and scikit-learn are the +default for Python users, but tend to lack scalability. Many projects look to +address this and Ibis does not intend on duplicating effort here. Instead, we +want to leverage what sets Ibis apart -- the ability to have a single Python API +that scales across many backends -- to feature engineering and other ML +preprocessing tasks ahead of model training. + +Jim took this on over the last few months, building up the +[IbisML](https://github.com/ibis-project/ibisml) package to a usable (but still +toy) state. We will further invest in IbisML this year to get it a +production-ready state, bringing the power of Ibis to ML feature engineering. + +We're [excited to welcome the (former) Claypot AI team to Voltron +Data](https://voltrondata.com/resources/voltron-data-acquires-claypot-ai) to +help drive this work forward! Expect a release announcement for IbisML soon +covering the majority of feature engineering operations and handoff to popular +ML training frameworks. + +::: {.callout-note collapse="true" title="LLMs: the Ibis Birdbrain project"} +I've been working on a new LLM integration for Ibis called `ibis-birdbrain`. +**It's highly experimental and still a work in progress**, but keep an eye out +for more details soon! +::: + +### Streaming data backends + +With the release of Ibis 8.0, we added support for Apache Flink in collaboration +with Claypot AI, the first dedicated streaming data backend for Ibis. + +::: {.callout-note} +Since writing this roadmap, [Voltron Data has acquired Claypot +AI!](https://voltrondata.com/resources/voltron-data-acquires-claypot-ai). We are +excited to welcome the Claypot team and continue to build the composable data +ecosystem with their streaming and ML expertise. +::: + +We've also collaborated with [RisingWave](https://risingwave.com/) on the second +streaming backend, which was merged recently. This backend is still early and +fairly experimental, but demonstrates the ability for Ibis to quickly add new +backends. We can now add batch and streaming backend with ease! + +### Geospatial improvements + +Ibis supports [50+ geospatial +expressions](https://ibis-project.org/reference/expression-geospatial) in the +API, but most backends do not implement them. + +::: {.callout-note} +This is a great opportunity for new contributors to get involved with Ibis! Let +us know if you're interested in adding geospatial support to your favorite +backend. +::: + +### Community engagement + +Hello! Expect to see an increased presence from the Ibis project in the form of +blogs, conference talks, video content, and more in 2024. [Join us on +Zulip](https://ibis-project.zulipchat.com) to discuss ideas and get involved! + +We would love to onboard new contributors to the project. + +### New backends + +Adding new backends is not a priority for the Ibis team at Voltron Data in Q1. +Instead, we are focusing on [the big refactor](#the-big-refactor) and other +internal library improvements to get Ibis to the point where adding new backends +is much easier and maintanable. That will take the form of stabilizing the new +intermediary representation, separating out **connection** from **compilation** +steps, and solidifying the API for backend authors. We will also introduce new +documentation and possibly testing frameworks to ease the burden of adding new +backends. + +We are still happy to support new backends! Some have already been mentioned, +but being added in Q1 include: + +- Apache Flink +- Exasol +- RisingWave + +Adding a new backend is a great way to get involved with Ibis! If you're +interested, [join us on Zulip](https://ibis-project.zulipchat.com) and let us +know or [open an issue on +GitHub](https://github.com/ibis-project/ibis/issues/new/choose). + +### Logo and website design + +We will likely engage an external design firm to help us redesign the logo +(initially created by Tim Swast, thanks Tim! It has served us well!) and website +theme. We aim to keep the website simple and focused on documentation that helps +users, but want to deviate from the default themes in Quarto to make Ibis stand +out. + +### Documentation + +> "When you're ~~selling~~ distributing free and open source software, the +> documentation is the product." - old tech adage, origin unknown + +A few months ago, we moved our documentation to [Quarto](https://quarto.org) and +revamped most of the website along the way. We will continue improving the +documentation with backend-specific getting started tutorials, how-to guides for +common tasks, improved API references, improving the website search +functionality, and more! + +Improving the documentation is a great way to get involved with Ibis! + +## Beyond Q1 2024 + +This writeup of our roadmap is heavily biased toward Q1 of 2024. Looking out, +our priorities remain much the same. After the big refactor is done, we will +continue improving our library internals, backend interface, and ensuring the +longevity of Ibis. We'll continue improving ML, streaming, and geospatial +support. + +Expect an updated roadmap blog in the second half of the year for more details! + +## Next steps + +It's never been a better time to get involved with Ibis. [Join us on Zulip and +introduce yourself!](https://ibis-project.zulipchat.com/) diff --git a/docs/posts/roadmap-2024-H1/roadmap.png b/docs/posts/roadmap-2024-H1/roadmap.png new file mode 100644 index 000000000000..51e787e0432f Binary files /dev/null and b/docs/posts/roadmap-2024-H1/roadmap.png differ diff --git a/docs/posts/why-voda-supports-ibis/commits.png b/docs/posts/why-voda-supports-ibis/commits.png new file mode 100644 index 000000000000..f84cb6d3c50e Binary files /dev/null and b/docs/posts/why-voda-supports-ibis/commits.png differ diff --git a/docs/posts/why-voda-supports-ibis/index.qmd b/docs/posts/why-voda-supports-ibis/index.qmd new file mode 100644 index 000000000000..0529e0397256 --- /dev/null +++ b/docs/posts/why-voda-supports-ibis/index.qmd @@ -0,0 +1,237 @@ +--- +title: "Why Voltron Data supports Ibis" +author: "Cody Peterson + Ian Cook" +date: "2024-02-10" +image: standards.png +categories: + - blog +--- + +## Overview + +The Ibis project is an [independently +governed](https://github.com/ibis-project/governance) open source community +project to build and maintain **the portable Python dataframe library**. Ibis +has [contributors](https://github.com/ibis-project/ibis/graphs/contributors) +across a range of data companies and institutions. Today the core Ibis +maintainers are employed by [Voltron Data](https://voltrondata.com). Voltron +Data’s support of Ibis is a part of its strategy to enable modular and +composable systems for data analytics. + +## Background + +The Ibis project was started in 2015 by [Wes McKinney](https://wesmckinney.com), +the creator of pandas and a cofounder of Voltron Data, as a pandas-like +interface to Apache Impala. It received improvements and support over the years, +but really took off under the stewardship of [Phillip +Cloud](https://github.com/cpcloud) and the [current Ibis team at Voltron +Data](#who-are-the-core-contributors). It now supports 20+ backends and is +improving rapidly. It's never been a better time to get involved with Ibis. + +You can see the inflection point in the number of commits to the repository in +early 2022: + +![Ibis commits over time](commits.png) + +### Who are we? + +#### Cody + +My name is Cody and I'm employed by Voltron Data to work on Ibis full-time as a +Technical Product Manager. I am an Ibis contributed and have created the Delta +Lake table input/output methods, helped move the documentation over to +[Quarto](https://quarto.org), and created the [Zulip +chat](https://ibis-project.zulipchat.com) for the community. + +My job is to help the Ibis community grow and thrive. I have a background in ML +(especially MLOps) and data products. Ibis solves many challenges I've seen in +the data space and I'm excited to help increase its adoption as a standard +Python frontend for dozens of data backends to reduce friction in the data +ecosystem. + +#### Ian + +I'm Ian, Director of Product Management at Voltron Data. I'm an Apache Arrow +contributor and I have a decade of experience working with SQL, dataframe APIs, +open standards, and distributed systems. + +My job is to align our open source engineering work at Voltron Data with the +needs and priorities of our stakeholders, including all the projects and +organizations that depend on Arrow and Ibis. I also launched [Voltron Data's +enterprise support product](https://voltrondata.com/enterprise-support) to help +make projects like Arrow and Ibis into safe, smart choices for companies to +build into business-critical applications. + +### Why does Voltron Data (VoDa) support Ibis? + +Why does Voltron Data employ a Technical Product Manager to work on Ibis +full-time? Why does Voltron Data employ five software engineers to work on Ibis +full-time? Great questions! + +::: {.callout-note title="The Composable Codex"} +To understand Voltron Data -- or if you're generally interested in learning +about the composable data ecosystem -- check out the [The Composable +Codex by Voltron Data](https://voltrondata.com/codex). +::: + +Voltron Data is a startup company founded in 2021 with the goal of making it +possible for organizations to build modular, composable, high-performance +systems for data analytics. Voltron Data advances open standards (like [Apache +Arrow](https://arrow.apache.org) and [Substrait](https://substrait.io)) and +builds software components that embrace these standards for maximum +interoperability and performance. This includes free and open source software +like Ibis. + +![Standards](standards.png) + +This also includes a commercially licensed product: +[Theseus](https://voltrondata.com/theseus), Voltron Data’s accelerator-native +data processing engine. Theseus is a separate project built by a different team +at Voltron Data. Most Ibis contributors do not need to know about Theseus. Most +Ibis users will probably never use Theseus. But the two projects are related +parts of Voltron Data’s strategy. + +The strategy goes like this: + +- Big changes are afoot in the world of computing hardware. [Moore’s + Law](https://en.wikipedia.org/wiki/Moore%27s_law) is coming to an end. + Accelerated hardware such as NVIDIA GPUs are becoming more important for + performance and efficiency—not just for AI and ML, but data analytics too. +- Big changes in hardware require big changes in software. But software has + lagged behind in some areas. Distributed big data analytics is one area where + the performance of existing software has lagged way behind hardware. +- Organizations choose data platforms based on many factors. No one chooses a + data platform _only_ because it can run very fast on accelerated hardware. +- Wouldn’t it be nice if there were a composable data processing engine that + could be embedded into _any_ data platform to enable that platform to run jobs + on accelerated hardware? Then the platform builders wouldn’t need to duplicate + efforts. And organizations wouldn’t need to migrate to different platforms to + accelerate their workloads. +- Voltron Data built Theseus to be that composable engine. +- Voltron Data is _not_ building a platform (PaaS or SaaS) around Theseus. + Instead, it is partnering with other companies and organizations to embed + Theseus in their platforms. +- Theseus is great for very large-scale ETL workloads with very high throughput + needs. For other types of workloads, other engines are better. Voltron Data + wants to make it easy to choose the best engine for your workload. Your choice + of engine or platform should not limit your choice of other tools. +- For example (this is where Ibis comes in!), your choice of engine or platform + should not limit your choice of Python dataframe API. So Ibis works with 20+ + engines, including Theseus. + +Ultimately, Voltron Data will be more successful if Ibis is successful. And the +same is true of any other company with an engine or platform that is the best +choice for some type of workload. Ibis makes it easy to switch which engine +you’re using. So the only reason to fear Ibis is if you have an uncompetitive +engine and your strategy for retaining customers is to lock them into using a +proprietary API that prevents them from switching. + +::: {.callout-note collapse="true" title="Why not the pandas API?"} +This is a great, and natural, question -- if Voltron Data wants a standard +Python dataframe API, why not just use pandas? The reason is relatively simple: +the pandas API inherently does not scale. This is largely due to the expectation +of ordered results and the index. pandas is implemented for single-threaded +execution and has a lot of baggage when it comes to distributed execution. While +projects like Modin and pandas on Spark (formerly Koalas) attempt to scale the +pandas API, any project that attempts the feat is doomed to a dubious support +matrix of operations. + +Instead, Wes McKinney envisioned Ibis as a portable Python dataframe where the +API is decoupled from the execution engine. Ibis code scales to the backend it +is connected to. Any other Python dataframe library locks you into its execution +engine. While they may claim to be easy to migrate to, this is rarely the case. +The founders of Voltron Data experienced these pains with the pandas API +themselves in previous efforts, including cuDF. For Theseus and as an +open source standard, we believe Ibis is the right approach. + +Instead of using Snowpark Python for Snowflake, you can use Ibis on Snowflake. +Instead of using PySpark or pandas on Spark, you can use Ibis on Spark. Instead +of using the pandas API on BigQuery (built on top of Ibis), you can use Ibis on +BigQuery. Instead of using PyStarburst on Starburst Galaxy, you can use Ibis on +Starburst Galaxy. Instead of using the Polars Python on the Polars execution +engine, you can use Ibis on Polars. Instead of using DataFusion Python on +DataFusion execution engine, you can use Ibis on DataFusion. Instead of +executing SQL strings on DuckDB through the Python client, you can use Ibis on +DuckDB. And so on... + +Ibis brings a Python dataframe interface to data platforms that only have SQL, +and brings a standard Python dataframe interface to data platforms that have +their own Python dataframe interface. It is the only portable Python dataframe +that can serve as a standard across the data ecosystem. +::: + +Voltron Data supports Ibis because it can serve as a universal Python dataframe +API for **any** backend engine. Ibis works great whether you need to query a +CSV file on your laptop with DuckDB, run a big ETL job in the cloud with +Snowflake or Starburst Galaxy, or process hundreds of terabytes in minutes on +a platform running Theseus on NVIDIA GPUs. With Ibis, you have the choice of +20+ backends, and the code you write is the same regardless of which backend +you choose to use. + +::: {.callout-important title="Ibis is independently governed"} +[Ibis is independently governed](https://github.com/ibis-project/governance) and +not owned by Voltron Data. Currently, four out of five members of the steering +committee are employed by Voltron Data (the fifth being at Alphabet working on +Google BigQuery). We are working toward a more diverse representation of +companies and organizations as the project continues to grow. + +Voltron Data also welcomes this dilution of power and influence! A healthy +open source project is one that is not controlled by a single entity. This is +true of [Apache Arrow](https://arrow.apache.org) and other open source projects +that Voltron Data employees have been instrumental in building. +::: + +### Who are the core contributors? + +The core contributors working full-time on Ibis are employed at Voltron Data, +with deep experience on successful open source projects including pandas, +Apache Arrow, Dask, and more. Everything in the Ibis project is made possible by +their hard work! They are: + +- [**Gil Forsyth**](https://github.com/gforsyth): long-time Ibis contributor and + primary maintainer of the `ibis-substrait` package +- [**Jim Crist-Harif**](https://github.com/jcrist): the engineering manager for + the Ibis team at Voltron Data +- [**Krisztián Szűcs**](https://github.com/kszucs): long-time Ibis contributor + and primary author of the precursor to [the big refactor](#the-big-refactor) +- [**Naty Clementi**](https://github.com/ncclementi): newest member of the Ibis + team at Voltron Data recently focusing on [geospatial support in + DuckDB](#geospatial-improvements) +- [**Phillip Cloud**](https://github.com/cpcloud): the tech lead for the Ibis + team at Voltron Data + +If you're interacting with us on GitHub or Zulip, you'll definitely run into at +least one of them! They make the Ibis project the delightful software it is +today and are always happy to help. + +### Who else supports Ibis? + +Anybody who contributes to Ibis is a supporter of Ibis! You can contribute by +[opening an issue](https://github.com/ibis-project/ibis/issues), [submitting a +pull request](https://github.com/ibis-project/ibis/pulls), [using Ibis in your +project](https://github.com/ibis-project/ibis/network/dependents), or [joining +the Zulip chat](https://ibis-project.zulipchat.com) to discuss problems or +ideas. + +Notable organizations that support Ibis include: + +- [**Claypot AI**](https://www.claypot.ai/): contributing the Apache Flink + backend +- [**Exasol**](https://www.exasol.com/): contributing the Exasol backend +- [**Google's BigQuery + DataFrames**](https://github.com/googleapis/python-bigquery-dataframes): a + pandas API for BigQuery built on top of Ibis +- [**RisingWave**](https://risingwave.com/): contributing the RisingWave backend +- [**SingleStore**](https://github.com/singlestore-labs/ibis-singlestoredb): + creating a SingleStore backend +- [**Starburst + Galaxy**](https://www.starburst.io/blog/introducing-python-dataframes/): + supporting Ibis alongside their native PyStarburst dataframes +- [**SuperDuperDB**](https://github.com/SuperDuperDB/superduperdb): bringing AI + to any database Ibis supports + +## Next steps + +If you're interested in partnering with the Ibis project and Voltron Data, get +in touch! It's never been a better time to get involved with Ibis. [Join us on +Zulip and introduce yourself!](https://ibis-project.zulipchat.com/) diff --git a/docs/posts/why-voda-supports-ibis/roadmap.png b/docs/posts/why-voda-supports-ibis/roadmap.png new file mode 100644 index 000000000000..51e787e0432f Binary files /dev/null and b/docs/posts/why-voda-supports-ibis/roadmap.png differ diff --git a/docs/posts/why-voda-supports-ibis/standards.png b/docs/posts/why-voda-supports-ibis/standards.png new file mode 100644 index 000000000000..93a8a08b7e89 Binary files /dev/null and b/docs/posts/why-voda-supports-ibis/standards.png differ