Skip to content

Commit

Permalink
docs: create examples tab and populate with ibis-examples repo content
Browse files Browse the repository at this point in the history
  • Loading branch information
cpcloud committed Feb 5, 2024
1 parent a34af25 commit fe9de4d
Show file tree
Hide file tree
Showing 5 changed files with 525 additions and 1 deletion.
2 changes: 1 addition & 1 deletion .devcontainer/updateContent.sh
Original file line number Diff line number Diff line change
Expand Up @@ -2,4 +2,4 @@

# install ibis
python3 -m pip install ipython
python3 -m pip install -e '.[duckdb,examples]'
python3 -m pip install -e '.[duckdb,clickhouse,examples]'
10 changes: 10 additions & 0 deletions docs/_quarto.yml
Original file line number Diff line number Diff line change
Expand Up @@ -125,6 +125,16 @@ website:
- auto: "how-to/analytics"
- auto: "how-to/visualization"
- auto: "how-to/extending"

- id: examples
title: "Examples"
style: "docked"
collapse-level: 2
contents:
- auto: "examples/parquet-and-duckdb.qmd"
- auto: "examples/clickhouse-hackernews.qmd"
- auto: "examples/imdb.qmd"

- id: contribute
title: "Contribute"
style: "docked"
Expand Down
142 changes: 142 additions & 0 deletions docs/examples/clickhouse-hackernews.qmd
Original file line number Diff line number Diff line change
@@ -0,0 +1,142 @@
---
title: Using Ibis with ClickHouse
---

[Ibis](https://ibis-project.com) supports reading and querying data using
[ClickHouse](https://clickhouse.com/) as a backend.

In this example we'll demonstrate using Ibis to connect to a ClickHouse server,
and executing a few queries.

```{python}
from ibis.interactive import *
```

## Creating a Connection

First we need to connect Ibis to a running ClickHouse server.

In this example we'll run queries against the publicly available [ClickHouse
playground](https://clickhouse.com/docs/en/getting-started/playground) server.

To run against your own ClickHouse server you'd only need to change the
connection details.

```{python}
con = ibis.connect("clickhouse://[email protected]:443")
```

## Listing available tables

The ClickHouse playground server has a number of interesting datasets
available. To see them, we can examine the tables via the `.tables` attribute.

This shows a list of all tables available:

```{python}
con.tables
```

## Inspecting a Table

Lets take a look at the `hackernews` table. This table contains all posts and
comments on [Hacker News](https://news.ycombinator.com/).

We can access the table by attribute as `con.tables.hackernews`.

```{python}
t = con.tables.hackernews
```

We can then take a peak at the first few rows using the `.head()` method.

```{python}
t.head()
```

## Finding the highest scoring posts

Here we find the top 5 posts by score.

Posts have a title, so we:

- `filter` out rows that lack a title
- `select` only the columns we're interested in
- `order` them by score, descending
- `limit` to the top 5 rows

```{python}
top_posts_by_score = (
t.filter(_.title != "")
.select("title", "score")
.order_by(ibis.desc("score"))
.limit(5)
)
top_posts_by_score
```

## Finding the most prolific commenters

Here we find the top 5 commenters by number of comments made.

To do this we:

- `filter` out rows with no author
- `group_by` author
- `count` all the rows in each group
- `order_by` the counts, descending
- `limit` to the top 5 rows

```{python}
top_commenters = (
t.filter(_.by != "")
.group_by("by")
.agg(count=_.count())
.order_by(ibis.desc("count"))
.limit(5)
)
top_commenters
```

This query could also be expressed using the `.topk` method, which is
a shorthand for the above:

```{python}
# This is a shorthand for the above
top_commenters = t.filter(_.by != "").by.topk(5)
top_commenters
```

## Finding top commenters by score

Here we find the top 5 commenters with the highest cumulative scores. In this
case the `.topk` shorthand won't work and we'll need to write out the full
`group_by` -> `agg` -> `order_by` -> `limit` pipeline.

```{python}
top_commenters_by_score = (
t.filter(_.by != "")
.group_by("by")
.agg(total_score=_.score.sum())
.order_by(ibis.desc("total_score"))
.limit(5)
)
top_commenters_by_score
```

## Next Steps

There are lots of other interesting queries one might ask of this dataset.

A few examples:

- What posts had the most comments?
- How do post scores fluctuate over time?
- What day of the week has the highest average post score? What day has the lowest?

To learn more about how to use Ibis with Clickhouse, see [the
documentation](https://ibis-project.org/backends/ClickHouse/).
96 changes: 96 additions & 0 deletions docs/examples/duckdb-parquet.qmd
Original file line number Diff line number Diff line change
@@ -0,0 +1,96 @@
---
title: Reading Parquet Files using DuckDB
---

In this example, we will use Ibis's DuckDB backend to analyze data from
a remote parquet source using `ibis.read_parquet`. `ibis.read_parquet` can also
read local parquet files, and there are other `ibis.read_*` functions that
conveniently return a table expression from a file. One such function is
`ibis.read_csv`, which reads from local and remote CSV.

We will be reading from the [**Global Biodiversity Information Facility (GBIF)
Species Occurrences**](https://registry.opendata.aws/gbif/) dataset. It is
hosted on S3 at `s3://gbif-open-data-us-east-1/occurrence/`

## Reading One Partition

We can read a single partition by specifying its path.

We do this by calling
[`read_parquet`](https://ibis-project.org/api/expressions/top_level/#ibis.read_parquet)
on the partition we care about.

So to read the first partition in this dataset, we'll call `read_parquet` on
`00000` in that path:

```{python}
import ibis
t = ibis.read_parquet(
"s3://gbif-open-data-us-east-1/occurrence/2023-04-01/occurrence.parquet/000000"
)
t
```

Note that we're calling `read_parquet` and receiving a table expression without
establishing a connection first. Ibis spins up a DuckDB connection (or
whichever default backend you have) when you call `ibis.read_parquet` (or even
`ibis.read_csv`).

Since our result, `t`, is a table expression, we can now run queries against
the file using Ibis expressions. For example, we can select columns, filter the
file, and then view the first five rows of the result:

```{python}
cols = [
"gbifid",
"datasetkey",
"occurrenceid",
"kingdom",
"phylum",
"class",
"order",
"family",
"genus",
"species",
"day",
"month",
"year",
]
t.select(cols).filter(t["family"].isin(["Corvidae"])).limit(5).to_pandas()
```

We can count the rows in the table (partition):

```{python}
t.count().to_pandas()
```

## Reading all partitions: filter, aggregate, export

We can use `read_parquet` to read an entire parquet file by globbing all
partitions:

```{python}
t = ibis.read_parquet(
"s3://gbif-open-data-us-east-1/occurrence/2023-04-01/occurrence.parquet/*"
)
```

Since the function returns a table expression, we can perform valid selections,
filters, aggregations, and exports just as we could with any other table
expression:

```{python}
df = (
t.select(["gbifid", "family", "species"])
.filter(t["family"].isin(["Corvidae"]))
# Here we limit by 10,000 to fetch a quick batch of results
.limit(10000)
.group_by("species")
.count()
.to_pandas()
)
df
```
Loading

0 comments on commit fe9de4d

Please sign in to comment.