docs: create examples tab and populate with ibis-examples repo content

ibis-project · Feb 5, 2024 · fe9de4d · fe9de4d
1 parent a34af25
commit fe9de4d
Show file tree

Hide file tree

Showing 5 changed files with 525 additions and 1 deletion.
diff --git a/.devcontainer/updateContent.sh b/.devcontainer/updateContent.sh
@@ -2,4 +2,4 @@
 
 # install ibis
 python3 -m pip install ipython
-python3 -m pip install -e '.[duckdb,examples]'
+python3 -m pip install -e '.[duckdb,clickhouse,examples]'
diff --git a/docs/_quarto.yml b/docs/_quarto.yml
@@ -125,6 +125,16 @@ website:
         - auto: "how-to/analytics"
         - auto: "how-to/visualization"
         - auto: "how-to/extending"
+
+    - id: examples
+      title: "Examples"
+      style: "docked"
+      collapse-level: 2
+      contents:
+        - auto: "examples/parquet-and-duckdb.qmd"
+        - auto: "examples/clickhouse-hackernews.qmd"
+        - auto: "examples/imdb.qmd"
+
     - id: contribute
       title: "Contribute"
       style: "docked"

diff --git a/docs/examples/clickhouse-hackernews.qmd b/docs/examples/clickhouse-hackernews.qmd
@@ -0,0 +1,142 @@
+---
+title: Using Ibis with ClickHouse
+---
+
+[Ibis](https://ibis-project.com) supports reading and querying data using
+[ClickHouse](https://clickhouse.com/) as a backend.
+
+In this example we'll demonstrate using Ibis to connect to a ClickHouse server,
+and executing a few queries.
+
+```{python}
+from ibis.interactive import *
+```
+
+## Creating a Connection
+
+First we need to connect Ibis to a running ClickHouse server.
+
+In this example we'll run queries against the publicly available [ClickHouse
+playground](https://clickhouse.com/docs/en/getting-started/playground) server.
+
+To run against your own ClickHouse server you'd only need to change the
+connection details.
+
+```{python}
+con = ibis.connect("clickhouse://[email protected]:443")
+```
+
+## Listing available tables
+
+The ClickHouse playground server has a number of interesting datasets
+available. To see them, we can examine the tables via the `.tables` attribute.
+
+This shows a list of all tables available:
+
+```{python}
+con.tables
+```
+
+## Inspecting a Table
+
+Lets take a look at the `hackernews` table. This table contains all posts and
+comments on [Hacker News](https://news.ycombinator.com/).
+
+We can access the table by attribute as `con.tables.hackernews`.
+
+```{python}
+t = con.tables.hackernews
+```
+
+We can then take a peak at the first few rows using the `.head()` method.
+
+```{python}
+t.head()
+```
+
+## Finding the highest scoring posts
+
+Here we find the top 5 posts by score.
+
+Posts have a title, so we:
+
+- `filter` out rows that lack a title
+- `select` only the columns we're interested in
+- `order` them by score, descending
+- `limit` to the top 5 rows
+
+```{python}
+top_posts_by_score = (
+    t.filter(_.title != "")
+    .select("title", "score")
+    .order_by(ibis.desc("score"))
+    .limit(5)
+)
+
+top_posts_by_score
+```
+
+## Finding the most prolific commenters
+
+Here we find the top 5 commenters by number of comments made.
+
+To do this we:
+
+- `filter` out rows with no author
+- `group_by` author
+- `count` all the rows in each group
+- `order_by` the counts, descending
+- `limit` to the top 5 rows
+
+```{python}
+top_commenters = (
+    t.filter(_.by != "")
+    .group_by("by")
+    .agg(count=_.count())
+    .order_by(ibis.desc("count"))
+    .limit(5)
+)
+
+top_commenters
+```
+
+This query could also be expressed using the `.topk` method, which is
+a shorthand for the above:
+
+```{python}
+# This is a shorthand for the above
+top_commenters = t.filter(_.by != "").by.topk(5)
+
+top_commenters
+```
+
+## Finding top commenters by score
+
+Here we find the top 5 commenters with the highest cumulative scores. In this
+case the `.topk` shorthand won't work and we'll need to write out the full
+`group_by` -> `agg` -> `order_by` -> `limit` pipeline.
+
+```{python}
+top_commenters_by_score = (
+    t.filter(_.by != "")
+    .group_by("by")
+    .agg(total_score=_.score.sum())
+    .order_by(ibis.desc("total_score"))
+    .limit(5)
+)
+
+top_commenters_by_score
+```
+
+## Next Steps
+
+There are lots of other interesting queries one might ask of this dataset.
+
+A few examples:
+
+- What posts had the most comments?
+- How do post scores fluctuate over time?
+- What day of the week has the highest average post score? What day has the lowest?
+
+To learn more about how to use Ibis with Clickhouse, see [the
+documentation](https://ibis-project.org/backends/ClickHouse/).
diff --git a/docs/examples/duckdb-parquet.qmd b/docs/examples/duckdb-parquet.qmd
@@ -0,0 +1,96 @@
+---
+title: Reading Parquet Files using DuckDB
+---
+
+In this example, we will use Ibis's DuckDB backend to analyze data from
+a remote parquet source using `ibis.read_parquet`. `ibis.read_parquet` can also
+read local parquet files, and there are other `ibis.read_*` functions that
+conveniently return a table expression from a file. One such function is
+`ibis.read_csv`, which reads from local and remote CSV.
+
+We will be reading from the [**Global Biodiversity Information Facility (GBIF)
+Species Occurrences**](https://registry.opendata.aws/gbif/) dataset. It is
+hosted on S3 at `s3://gbif-open-data-us-east-1/occurrence/`
+
+## Reading One Partition
+
+We can read a single partition by specifying its path.
+
+We do this by calling
+[`read_parquet`](https://ibis-project.org/api/expressions/top_level/#ibis.read_parquet)
+on the partition we care about.
+
+So to read the first partition in this dataset, we'll call `read_parquet` on
+`00000` in that path:
+
+```{python}
+import ibis
+
+t = ibis.read_parquet(
+    "s3://gbif-open-data-us-east-1/occurrence/2023-04-01/occurrence.parquet/000000"
+)
+t
+```
+
+Note that we're calling `read_parquet` and receiving a table expression without
+establishing a connection first. Ibis spins up a DuckDB connection (or
+whichever default backend you have) when you call `ibis.read_parquet` (or even
+`ibis.read_csv`).
+
+Since our result, `t`, is a table expression, we can now run queries against
+the file using Ibis expressions. For example, we can select columns, filter the
+file, and then view the first five rows of the result:
+
+```{python}
+cols = [
+    "gbifid",
+    "datasetkey",
+    "occurrenceid",
+    "kingdom",
+    "phylum",
+    "class",
+    "order",
+    "family",
+    "genus",
+    "species",
+    "day",
+    "month",
+    "year",
+]
+
+t.select(cols).filter(t["family"].isin(["Corvidae"])).limit(5).to_pandas()
+```
+
+We can count the rows in the table (partition):
+
+```{python}
+t.count().to_pandas()
+```
+
+## Reading all partitions: filter, aggregate, export
+
+We can use `read_parquet` to read an entire parquet file by globbing all
+partitions:
+
+```{python}
+t = ibis.read_parquet(
+    "s3://gbif-open-data-us-east-1/occurrence/2023-04-01/occurrence.parquet/*"
+)
+```
+
+Since the function returns a table expression, we can perform valid selections,
+filters, aggregations, and exports just as we could with any other table
+expression:
+
+```{python}
+df = (
+    t.select(["gbifid", "family", "species"])
+    .filter(t["family"].isin(["Corvidae"]))
+    # Here we limit by 10,000 to fetch a quick batch of results
+    .limit(10000)
+    .group_by("species")
+    .count()
+    .to_pandas()
+)
+df
+```