-
Notifications
You must be signed in to change notification settings - Fork 603
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
docs: create examples tab and populate with
ibis-examples
repo content
- Loading branch information
Showing
5 changed files
with
525 additions
and
1 deletion.
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,142 @@ | ||
--- | ||
title: Using Ibis with ClickHouse | ||
--- | ||
|
||
[Ibis](https://ibis-project.com) supports reading and querying data using | ||
[ClickHouse](https://clickhouse.com/) as a backend. | ||
|
||
In this example we'll demonstrate using Ibis to connect to a ClickHouse server, | ||
and executing a few queries. | ||
|
||
```{python} | ||
from ibis.interactive import * | ||
``` | ||
|
||
## Creating a Connection | ||
|
||
First we need to connect Ibis to a running ClickHouse server. | ||
|
||
In this example we'll run queries against the publicly available [ClickHouse | ||
playground](https://clickhouse.com/docs/en/getting-started/playground) server. | ||
|
||
To run against your own ClickHouse server you'd only need to change the | ||
connection details. | ||
|
||
```{python} | ||
con = ibis.connect("clickhouse://[email protected]:443") | ||
``` | ||
|
||
## Listing available tables | ||
|
||
The ClickHouse playground server has a number of interesting datasets | ||
available. To see them, we can examine the tables via the `.tables` attribute. | ||
|
||
This shows a list of all tables available: | ||
|
||
```{python} | ||
con.tables | ||
``` | ||
|
||
## Inspecting a Table | ||
|
||
Lets take a look at the `hackernews` table. This table contains all posts and | ||
comments on [Hacker News](https://news.ycombinator.com/). | ||
|
||
We can access the table by attribute as `con.tables.hackernews`. | ||
|
||
```{python} | ||
t = con.tables.hackernews | ||
``` | ||
|
||
We can then take a peak at the first few rows using the `.head()` method. | ||
|
||
```{python} | ||
t.head() | ||
``` | ||
|
||
## Finding the highest scoring posts | ||
|
||
Here we find the top 5 posts by score. | ||
|
||
Posts have a title, so we: | ||
|
||
- `filter` out rows that lack a title | ||
- `select` only the columns we're interested in | ||
- `order` them by score, descending | ||
- `limit` to the top 5 rows | ||
|
||
```{python} | ||
top_posts_by_score = ( | ||
t.filter(_.title != "") | ||
.select("title", "score") | ||
.order_by(ibis.desc("score")) | ||
.limit(5) | ||
) | ||
top_posts_by_score | ||
``` | ||
|
||
## Finding the most prolific commenters | ||
|
||
Here we find the top 5 commenters by number of comments made. | ||
|
||
To do this we: | ||
|
||
- `filter` out rows with no author | ||
- `group_by` author | ||
- `count` all the rows in each group | ||
- `order_by` the counts, descending | ||
- `limit` to the top 5 rows | ||
|
||
```{python} | ||
top_commenters = ( | ||
t.filter(_.by != "") | ||
.group_by("by") | ||
.agg(count=_.count()) | ||
.order_by(ibis.desc("count")) | ||
.limit(5) | ||
) | ||
top_commenters | ||
``` | ||
|
||
This query could also be expressed using the `.topk` method, which is | ||
a shorthand for the above: | ||
|
||
```{python} | ||
# This is a shorthand for the above | ||
top_commenters = t.filter(_.by != "").by.topk(5) | ||
top_commenters | ||
``` | ||
|
||
## Finding top commenters by score | ||
|
||
Here we find the top 5 commenters with the highest cumulative scores. In this | ||
case the `.topk` shorthand won't work and we'll need to write out the full | ||
`group_by` -> `agg` -> `order_by` -> `limit` pipeline. | ||
|
||
```{python} | ||
top_commenters_by_score = ( | ||
t.filter(_.by != "") | ||
.group_by("by") | ||
.agg(total_score=_.score.sum()) | ||
.order_by(ibis.desc("total_score")) | ||
.limit(5) | ||
) | ||
top_commenters_by_score | ||
``` | ||
|
||
## Next Steps | ||
|
||
There are lots of other interesting queries one might ask of this dataset. | ||
|
||
A few examples: | ||
|
||
- What posts had the most comments? | ||
- How do post scores fluctuate over time? | ||
- What day of the week has the highest average post score? What day has the lowest? | ||
|
||
To learn more about how to use Ibis with Clickhouse, see [the | ||
documentation](https://ibis-project.org/backends/ClickHouse/). |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,96 @@ | ||
--- | ||
title: Reading Parquet Files using DuckDB | ||
--- | ||
|
||
In this example, we will use Ibis's DuckDB backend to analyze data from | ||
a remote parquet source using `ibis.read_parquet`. `ibis.read_parquet` can also | ||
read local parquet files, and there are other `ibis.read_*` functions that | ||
conveniently return a table expression from a file. One such function is | ||
`ibis.read_csv`, which reads from local and remote CSV. | ||
|
||
We will be reading from the [**Global Biodiversity Information Facility (GBIF) | ||
Species Occurrences**](https://registry.opendata.aws/gbif/) dataset. It is | ||
hosted on S3 at `s3://gbif-open-data-us-east-1/occurrence/` | ||
|
||
## Reading One Partition | ||
|
||
We can read a single partition by specifying its path. | ||
|
||
We do this by calling | ||
[`read_parquet`](https://ibis-project.org/api/expressions/top_level/#ibis.read_parquet) | ||
on the partition we care about. | ||
|
||
So to read the first partition in this dataset, we'll call `read_parquet` on | ||
`00000` in that path: | ||
|
||
```{python} | ||
import ibis | ||
t = ibis.read_parquet( | ||
"s3://gbif-open-data-us-east-1/occurrence/2023-04-01/occurrence.parquet/000000" | ||
) | ||
t | ||
``` | ||
|
||
Note that we're calling `read_parquet` and receiving a table expression without | ||
establishing a connection first. Ibis spins up a DuckDB connection (or | ||
whichever default backend you have) when you call `ibis.read_parquet` (or even | ||
`ibis.read_csv`). | ||
|
||
Since our result, `t`, is a table expression, we can now run queries against | ||
the file using Ibis expressions. For example, we can select columns, filter the | ||
file, and then view the first five rows of the result: | ||
|
||
```{python} | ||
cols = [ | ||
"gbifid", | ||
"datasetkey", | ||
"occurrenceid", | ||
"kingdom", | ||
"phylum", | ||
"class", | ||
"order", | ||
"family", | ||
"genus", | ||
"species", | ||
"day", | ||
"month", | ||
"year", | ||
] | ||
t.select(cols).filter(t["family"].isin(["Corvidae"])).limit(5).to_pandas() | ||
``` | ||
|
||
We can count the rows in the table (partition): | ||
|
||
```{python} | ||
t.count().to_pandas() | ||
``` | ||
|
||
## Reading all partitions: filter, aggregate, export | ||
|
||
We can use `read_parquet` to read an entire parquet file by globbing all | ||
partitions: | ||
|
||
```{python} | ||
t = ibis.read_parquet( | ||
"s3://gbif-open-data-us-east-1/occurrence/2023-04-01/occurrence.parquet/*" | ||
) | ||
``` | ||
|
||
Since the function returns a table expression, we can perform valid selections, | ||
filters, aggregations, and exports just as we could with any other table | ||
expression: | ||
|
||
```{python} | ||
df = ( | ||
t.select(["gbifid", "family", "species"]) | ||
.filter(t["family"].isin(["Corvidae"])) | ||
# Here we limit by 10,000 to fetch a quick batch of results | ||
.limit(10000) | ||
.group_by("species") | ||
.count() | ||
.to_pandas() | ||
) | ||
df | ||
``` |
Oops, something went wrong.