-
Notifications
You must be signed in to change notification settings - Fork 603
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
docs: create examples tab and populate with
ibis-examples
repo content
- Loading branch information
Showing
5 changed files
with
657 additions
and
1 deletion.
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,276 @@ | ||
--- | ||
title: Analyzing IMDB data with Ibis and DuckDB | ||
--- | ||
|
||
Let's use the Ibis examples module and the DuckDB backend to find some movies | ||
to watch. | ||
|
||
Adapted from [Phillip in the Cloud's livestream using the same | ||
data](https://www.youtube.com/watch?v=J7sEn9VklKY). | ||
|
||
## Imports | ||
|
||
For this example, we'll just use Ibis. | ||
|
||
```{python} | ||
from ibis.interactive import * # <1> | ||
``` | ||
|
||
1. This import imports `ibis.examples` as `ex`. | ||
|
||
## Fetch the example data | ||
|
||
We can use the `ibis.examples` module to fetch the IMDB data. Ibis | ||
automatically caches the data on disk so subsequent runs don't require fetching | ||
from cloud storage on each call to `fetch`. | ||
|
||
```{python} | ||
name_basics = ex.imdb_name_basics.fetch() | ||
name_basics | ||
``` | ||
|
||
To ensure column names are Pythonic, we can relabel as `snake_case`. | ||
|
||
```{python} | ||
name_basics.relabel("snake_case") | ||
``` | ||
|
||
Let's grab all of the relevant IMDB tables and relabel columns. | ||
|
||
```{python} | ||
name_basics = ex.imdb_name_basics.fetch().relabel("snake_case") | ||
title_akas = ex.imdb_title_akas.fetch().relabel("snake_case") | ||
title_basics = ex.imdb_title_basics.fetch().relabel("snake_case") | ||
title_crew = ex.imdb_title_crew.fetch().relabel("snake_case") | ||
title_episode = ex.imdb_title_episode.fetch().relabel("snake_case") | ||
title_principals = ex.imdb_title_principals.fetch().relabel("snake_case") | ||
title_ratings = ex.imdb_title_ratings.fetch().relabel("snake_case") | ||
``` | ||
|
||
## Preview the data | ||
|
||
We'll print out the first few rows of each table to get an idea of what is | ||
contained in each. | ||
|
||
```{python} | ||
name_basics.head() | ||
``` | ||
|
||
```{python} | ||
title_akas.head() | ||
``` | ||
|
||
```{python} | ||
title_basics.head() | ||
``` | ||
|
||
```{python} | ||
title_crew.head() | ||
``` | ||
|
||
```{python} | ||
title_episode.head() | ||
``` | ||
|
||
```{python} | ||
title_principals.head() | ||
``` | ||
|
||
```{python} | ||
title_ratings.head() | ||
``` | ||
|
||
## Basic data exploration | ||
|
||
Let's check how many records are in each table. It's just Python, so we can | ||
construct a dictionary and iterate through it in a for loop. | ||
|
||
```{python} | ||
tables = { | ||
"name_basics": name_basics, | ||
"title_akas": title_akas, | ||
"title_basics": title_basics, | ||
"title_crew": title_crew, | ||
"title_episode": title_episode, | ||
"title_principals": title_principals, | ||
"title_ratings": title_ratings, | ||
} | ||
max_name_len = max(map(len, tables.keys())) + 1 | ||
``` | ||
|
||
```{python} | ||
print("Length of tables:") | ||
for t in tables: | ||
print(f"\t{t.ljust(max_name_len)}: {tables[t].count().to_pandas():,}") | ||
``` | ||
|
||
## Clean data | ||
|
||
Looking at the data, the `nconst` and `tconst` columns seem to be unique | ||
identifiers. Let's confirm and adjust them accordingly. | ||
|
||
```{python} | ||
name_basics.head() | ||
``` | ||
|
||
Check the number of unique `nconst` values. | ||
|
||
```{python} | ||
name_basics.nconst.nunique() | ||
``` | ||
|
||
Confirm it's equal to the number of rows. | ||
|
||
```{python} | ||
name_basics.nconst.nunique() == name_basics.count() | ||
``` | ||
|
||
Mutate the table to convert `nconst` to an integer. | ||
|
||
```{python} | ||
t = name_basics.mutate(nconst=_.nconst.replace("nm", "").cast("int")) | ||
t.head() | ||
``` | ||
|
||
Let's also turn `primary_profession` into an array of strings instead of | ||
a single comma-separated string. | ||
|
||
```{python} | ||
t = t.mutate(primary_profession=_.primary_profession.split(",")) | ||
t | ||
``` | ||
|
||
And, combining the two concepts, convert `known_for_titles` into an array of | ||
integers corresponding to `tconst` identifiers. | ||
|
||
```{python} | ||
t = t.mutate( | ||
known_for_titles=_.known_for_titles.split(",").map( | ||
lambda tconst: tconst.replace("tt", "").cast("int") | ||
) | ||
) | ||
t | ||
``` | ||
|
||
## DRY-ing up the code | ||
|
||
We can define functions to convert `nconst` and `tconst` to integers. | ||
|
||
```{python} | ||
def nconst_to_int(nconst): | ||
return nconst.replace("nm", "").cast("int") | ||
def tconst_to_int(tconst): | ||
return tconst.replace("tt", "").cast("int") | ||
``` | ||
|
||
Then combine the previous data cleansing in a single mutate call. | ||
|
||
```{python} | ||
name_basics = name_basics.mutate( | ||
nconst=nconst_to_int(_.nconst), | ||
primary_profession=_.primary_profession.split(","), | ||
known_for_titles=_.known_for_titles.split(",").map(tconst_to_int), | ||
) | ||
name_basics | ||
``` | ||
|
||
We can use `ibis.to_sql` to see the SQL this generates. | ||
|
||
```{python} | ||
ibis.to_sql(name_basics) | ||
``` | ||
|
||
Clean the rest of the tables. We'll convert `nconst` and `tconst` columns | ||
consistently to allow for easy joining. | ||
|
||
```{python} | ||
title_akas = title_akas.mutate(title_id=tconst_to_int(_.title_id)).relabel( | ||
{"title_id": "tconst"} | ||
) | ||
title_basics = title_basics.mutate(tconst=tconst_to_int(_.tconst)) | ||
title_crew = title_crew.mutate( | ||
tconst=tconst_to_int(_.tconst), | ||
directors=_.directors.split(",").map(nconst_to_int), | ||
writers=_.writers.split(",").map(nconst_to_int), | ||
) | ||
title_episode = title_episode.mutate( | ||
tconst=tconst_to_int(_.tconst), parent_tconst=tconst_to_int(_.parent_tconst) | ||
) | ||
title_principals = title_principals.mutate( | ||
tconst=tconst_to_int(_.tconst), nconst=nconst_to_int(_.nconst) | ||
) | ||
title_ratings = title_ratings.mutate(tconst=tconst_to_int(_.tconst)) | ||
``` | ||
|
||
## Finding good (and bad) movies to watch | ||
|
||
Join the IMDB rankings with information about the movies. | ||
|
||
```{python} | ||
joined = title_basics.join(title_ratings, "tconst") | ||
joined | ||
``` | ||
|
||
```{python} | ||
joined.title_type.value_counts().order_by(_.title_type_count.desc()) | ||
``` | ||
|
||
Filter down to movies. | ||
|
||
```{python} | ||
joined = joined.filter(_.title_type == "movie") | ||
joined | ||
``` | ||
|
||
Reorder the columns and drop some. | ||
|
||
```{python} | ||
joined = joined.select( | ||
"tconst", | ||
"primary_title", | ||
"average_rating", | ||
"num_votes", | ||
"genres", | ||
"runtime_minutes", | ||
) | ||
joined | ||
``` | ||
|
||
Sort by the average rating. | ||
|
||
```{python} | ||
joined = joined.order_by([_.average_rating.desc(), _.num_votes.desc()]) | ||
joined | ||
``` | ||
|
||
A lot of 10/10 movies I haven't heard of … let's filter to movies with at least | ||
`N` votes. | ||
|
||
```{python} | ||
N = 50000 | ||
joined = joined.filter(_.num_votes > N) | ||
joined | ||
``` | ||
|
||
What if you're in the mood for a bad movie? | ||
|
||
```{python} | ||
joined = joined.order_by([_.average_rating.asc(), _.num_votes.desc()]) | ||
joined | ||
``` | ||
|
||
And specifically a bad comedy? | ||
|
||
```{python} | ||
joined = joined.filter(_.genres.contains("Comedy")) | ||
joined | ||
``` | ||
|
||
Perfect! | ||
|
||
## Next Steps | ||
|
||
We only used two of the IMDB tables. What else can we do with the rest of the | ||
data? Play around and let us know! |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,96 @@ | ||
--- | ||
title: Reading Parquet Files with Ibis + DuckDB | ||
--- | ||
|
||
In this example, we will use Ibis's DuckDB backend to analyze data from | ||
a remote parquet source using `ibis.read_parquet`. `ibis.read_parquet` can also | ||
read local parquet files, and there are other `ibis.read_*` functions that | ||
conveniently return a table expression from a file. One such function is | ||
`ibis.read_csv`, which reads from local and remote CSV. | ||
|
||
We will be reading from the [**Global Biodiversity Information Facility (GBIF) | ||
Species Occurrences**](https://registry.opendata.aws/gbif/) dataset. It is | ||
hosted on S3 at `s3://gbif-open-data-us-east-1/occurrence/` | ||
|
||
## Reading One Partition | ||
|
||
We can read a single partition by specifying its path. | ||
|
||
We do this by calling | ||
[`read_parquet`](https://ibis-project.org/api/expressions/top_level/#ibis.read_parquet) | ||
on the partition we care about. | ||
|
||
So to read the first partition in this dataset, we'll call `read_parquet` on | ||
`00000` in that path: | ||
|
||
```{python} | ||
import ibis | ||
t = ibis.read_parquet( | ||
"s3://gbif-open-data-us-east-1/occurrence/2023-04-01/occurrence.parquet/000000" | ||
) | ||
t | ||
``` | ||
|
||
Note that we're calling `read_parquet` and receiving a table expression without | ||
establishing a connection first. Ibis spins up a DuckDB connection (or | ||
whichever default backend you have) when you call `ibis.read_parquet` (or even | ||
`ibis.read_csv`). | ||
|
||
Since our result, `t`, is a table expression, we can now run queries against | ||
the file using Ibis expressions. For example, we can select columns, filter the | ||
file, and then view the first five rows of the result: | ||
|
||
```{python} | ||
cols = [ | ||
"gbifid", | ||
"datasetkey", | ||
"occurrenceid", | ||
"kingdom", | ||
"phylum", | ||
"class", | ||
"order", | ||
"family", | ||
"genus", | ||
"species", | ||
"day", | ||
"month", | ||
"year", | ||
] | ||
t.select(cols).filter(t["family"].isin(["Corvidae"])).limit(5).to_pandas() | ||
``` | ||
|
||
We can count the rows in the table (partition): | ||
|
||
```{python} | ||
t.count().to_pandas() | ||
``` | ||
|
||
## Reading all partitions: filter, aggregate, export | ||
|
||
We can use `read_parquet` to read an entire parquet file by globbing all | ||
partitions: | ||
|
||
```{python} | ||
t = ibis.read_parquet( | ||
"s3://gbif-open-data-us-east-1/occurrence/2023-04-01/occurrence.parquet/*" | ||
) | ||
``` | ||
|
||
Since the function returns a table expression, we can perform valid selections, | ||
filters, aggregations, and exports just as we could with any other table | ||
expression: | ||
|
||
```{python} | ||
df = ( | ||
t.select(["gbifid", "family", "species"]) | ||
.filter(t["family"].isin(["Corvidae"])) | ||
# Here we limit by 10,000 to fetch a quick batch of results | ||
.limit(10000) | ||
.group_by("species") | ||
.count() | ||
.to_pandas() | ||
) | ||
df | ||
``` |
Oops, something went wrong.