diff --git a/docs/presentations/datafusion-meetup-nyc-2024/images/competing_standards.png b/docs/presentations/datafusion-meetup-nyc-2024/images/competing_standards.png new file mode 100644 index 000000000000..5d38303773dd Binary files /dev/null and b/docs/presentations/datafusion-meetup-nyc-2024/images/competing_standards.png differ diff --git a/docs/presentations/datafusion-meetup-nyc-2024/images/datafusion-meetup-slides.png b/docs/presentations/datafusion-meetup-nyc-2024/images/datafusion-meetup-slides.png new file mode 100644 index 000000000000..89fde82b4df1 Binary files /dev/null and b/docs/presentations/datafusion-meetup-nyc-2024/images/datafusion-meetup-slides.png differ diff --git a/docs/presentations/datafusion-meetup-nyc-2024/talk.qmd b/docs/presentations/datafusion-meetup-nyc-2024/talk.qmd index a7cfd32eb6d7..95c2c21d2787 100644 --- a/docs/presentations/datafusion-meetup-nyc-2024/talk.qmd +++ b/docs/presentations/datafusion-meetup-nyc-2024/talk.qmd @@ -5,7 +5,7 @@ title-slide-attributes: data-background-size: 50% data-background-opacity: "0.25" author: Gil Forsyth -date: "2024-09-14" +date: "2024-09-17" execute: echo: true format: @@ -37,6 +37,10 @@ format: :::: +## Link to slides + +![](./images/datafusion-meetup-slides.png){fig-align="center"} + # Show of hands ## Who here is a... @@ -48,19 +52,25 @@ format: - ML something-something? ::: +::: {.notes} +ML something-something is used as a catchall because the job titles are varied +and tend to mean wildly different things, but it is not a disparagement of ML +jobs. +::: + ## Who here uses... ::: {.incremental} -- Rust? -- Python? -- SQL? -- R? -- KDB+ Q? +- 🦀Rust? +- 🐍Python? +- 🤖SQL? +- 🇷R? +- 🧨KDB+ Q? ::: # So you want to design a Python Dataframe API? -## Python/pandas terminology or SQL terminology? +## Python🐍/pandas🐼 terminology or SQL🤖 terminology? ::: {.incremental} - `order_by` or `orderby` or `sort` or `sort_by` or `sortby`? @@ -69,11 +79,17 @@ format: ::: {.fragment} ::: {.r-fit-text} -_please_ only choose one +🙏_please_ only choose one🙏 +::: +::: + +::: {.fragment} +::: {.r-fit-text} +when in doubt, copy `dplyr` ::: ::: -## Python/pandas semantics or SQL semantics? +## Python🐍/pandas🐼 semantics or SQL🤖 semantics? ::: {.incremental} @@ -98,7 +114,7 @@ _please_ only choose one ## SQL ain't standard -#### Which is (a small part of) why asking "How many Star Wars characters have 'Darth' in their name" looks like this: +#### Which is why, when you ask: `How many Star Wars characters have 'Darth' in their name?` ::: {.fragment} ::: {.r-fit-text} ```sql @@ -125,6 +141,13 @@ SELECT SUM(CAST(STRPOS(LOWER("t0"."name"), 'darth') > 0 AS INT)) FROM "starwars" ::: ::: + +::: {.notes} +Datafusion, BigQuery, MSSQL, Postgres + +Datatype names, function names, quoting behavior, whether bools exist +::: + ## SQL ain't standard
@@ -222,19 +245,52 @@ t.name.lower().contains("darth").sum() ``` ::: +## And yes... + +![](./images/competing_standards.png) + +::: {.notes} +First, I refuse to submit to nihilism that things can ever get better. + +Second I don't think there are actually very many proposed _standards_ for DataFrame APIs. + +There is the some work (https://data-apis.org/dataframe-api/draft/) but largely +each engine makes it's own API and says "USE THIS". +::: ## Ibis is _only_ an interface * Not an engine * We don't compute anything +* We work with a _lot_ of engines # Demo Time +## Why use DataFusion? + +* It's _fast_ +* It's _flexible_ +* Interface agnostic (SQL, Substrait, Dataframe API) + + +You should choose the _engine_ that suits your problem. + ## Why use Ibis? -Gives you flexibility +* It's flexible +* It's a pretty good API (no really!) +* Engine agnostic + +You should choose the _interface_ that suits your problem.^[If your problem involves a bunch of complex DDL, for instance, don't use Ibis] + +## The interface is not the engine is not the interface + + +::: {.incremental} +- Don't let the _engine_ dictate the _interface_ +- Don't let the _interface_ dictate the _engine_ +::: -It's a pretty good API (no really!) ## Try it out @@ -270,6 +326,45 @@ See: Apache Arrow and the “10 Things I Hate About pandas” ::: +## What other backends does Ibis support? + + +:::: {.columns} + +::: {.column width="33%"} + +- BigQuery +- ClickHouse +- DataFusion +- Druid +- DuckDB +- Exasol +::: + +::: {.column width="33%"} +- Flink +- Impala +- MSSQL +- MySQL +- Oracle +- Polars +::: + + +::: {.column width="33%"} +- Postgres +- Spark +- Risingwave +- Snowflake +- SQLite +- Trino +::: +:::: + +## Should I use Ibis _instead_ of `X`? + +Nope. You should use Ibis _with_ `X`. + ## Demo code (for reference) ::: {.panel-tabset} @@ -311,10 +406,9 @@ def main(): .reset_index() .sort_values(["month", "project_count"], ascending=False) ) - ``` -### Ibis+Datafusion PyPI +### Ibis+DataFusion PyPI ```python import glob @@ -354,7 +448,7 @@ expr = ( ) ``` -### Ibis+Datafusion PyPI (full) +### Ibis+DataFusion PyPI (full) ```python import ibis @@ -387,7 +481,6 @@ expr = ( .drop_null("ext") .order_by([_.month.desc(), _.project_count.desc()]) ) - ``` :::