docs: create examples tab and populate with ibis-examples repo content

ibis-project · Feb 6, 2024 · 86482c9 · 86482c9
1 parent f2ef90a
commit 86482c9
Show file tree

Hide file tree

Showing 5 changed files with 657 additions and 1 deletion.
diff --git a/.devcontainer/updateContent.sh b/.devcontainer/updateContent.sh
@@ -2,4 +2,4 @@
 
 # install ibis
 python3 -m pip install ipython
-python3 -m pip install -e '.[duckdb,examples]'
+python3 -m pip install -e '.[duckdb,clickhouse,examples]'
diff --git a/docs/how-to/analytics/imdb.qmd b/docs/how-to/analytics/imdb.qmd
@@ -0,0 +1,276 @@
+---
+title: Analyzing IMDB data with Ibis and DuckDB
+---
+
+Let's use the Ibis examples module and the DuckDB backend to find some movies
+to watch.
+
+Adapted from [Phillip in the Cloud's livestream using the same
+data](https://www.youtube.com/watch?v=J7sEn9VklKY).
+
+## Imports
+
+For this example, we'll just use Ibis.
+
+```{python}
+from ibis.interactive import *  # <1>
+```
+
+1. This import imports `ibis.examples` as `ex`.
+
+## Fetch the example data
+
+We can use the `ibis.examples` module to fetch the IMDB data. Ibis
+automatically caches the data on disk so subsequent runs don't require fetching
+from cloud storage on each call to `fetch`.
+
+```{python}
+name_basics = ex.imdb_name_basics.fetch()
+name_basics
+```
+
+To ensure column names are Pythonic, we can relabel as `snake_case`.
+
+```{python}
+name_basics.relabel("snake_case")
+```
+
+Let's grab all of the relevant IMDB tables and relabel columns.
+
+```{python}
+name_basics = ex.imdb_name_basics.fetch().relabel("snake_case")
+title_akas = ex.imdb_title_akas.fetch().relabel("snake_case")
+title_basics = ex.imdb_title_basics.fetch().relabel("snake_case")
+title_crew = ex.imdb_title_crew.fetch().relabel("snake_case")
+title_episode = ex.imdb_title_episode.fetch().relabel("snake_case")
+title_principals = ex.imdb_title_principals.fetch().relabel("snake_case")
+title_ratings = ex.imdb_title_ratings.fetch().relabel("snake_case")
+```
+
+## Preview the data
+
+We'll print out the first few rows of each table to get an idea of what is
+contained in each.
+
+```{python}
+name_basics.head()
+```
+
+```{python}
+title_akas.head()
+```
+
+```{python}
+title_basics.head()
+```
+
+```{python}
+title_crew.head()
+```
+
+```{python}
+title_episode.head()
+```
+
+```{python}
+title_principals.head()
+```
+
+```{python}
+title_ratings.head()
+```
+
+## Basic data exploration
+
+Let's check how many records are in each table. It's just Python, so we can
+construct a dictionary and iterate through it in a for loop.
+
+```{python}
+tables = {
+    "name_basics": name_basics,
+    "title_akas": title_akas,
+    "title_basics": title_basics,
+    "title_crew": title_crew,
+    "title_episode": title_episode,
+    "title_principals": title_principals,
+    "title_ratings": title_ratings,
+}
+max_name_len = max(map(len, tables.keys())) + 1
+```
+
+```{python}
+print("Length of tables:")
+for t in tables:
+    print(f"\t{t.ljust(max_name_len)}: {tables[t].count().to_pandas():,}")
+```
+
+## Clean data
+
+Looking at the data, the `nconst` and `tconst` columns seem to be unique
+identifiers. Let's confirm and adjust them accordingly.
+
+```{python}
+name_basics.head()
+```
+
+Check the number of unique `nconst` values.
+
+```{python}
+name_basics.nconst.nunique()
+```
+
+Confirm it's equal to the number of rows.
+
+```{python}
+name_basics.nconst.nunique() == name_basics.count()
+```
+
+Mutate the table to convert `nconst` to an integer.
+
+```{python}
+t = name_basics.mutate(nconst=_.nconst.replace("nm", "").cast("int"))
+t.head()
+```
+
+Let's also turn `primary_profession` into an array of strings instead of
+a single comma-separated string.
+
+```{python}
+t = t.mutate(primary_profession=_.primary_profession.split(","))
+t
+```
+
+And, combining the two concepts, convert `known_for_titles` into an array of
+integers corresponding to `tconst` identifiers.
+
+```{python}
+t = t.mutate(
+    known_for_titles=_.known_for_titles.split(",").map(
+        lambda tconst: tconst.replace("tt", "").cast("int")
+    )
+)
+t
+```
+
+## DRY-ing up the code
+
+We can define functions to convert `nconst` and `tconst` to integers.
+
+```{python}
+def nconst_to_int(nconst):
+    return nconst.replace("nm", "").cast("int")
+
+
+def tconst_to_int(tconst):
+    return tconst.replace("tt", "").cast("int")
+```
+
+Then combine the previous data cleansing in a single mutate call.
+
+```{python}
+name_basics = name_basics.mutate(
+    nconst=nconst_to_int(_.nconst),
+    primary_profession=_.primary_profession.split(","),
+    known_for_titles=_.known_for_titles.split(",").map(tconst_to_int),
+)
+name_basics
+```
+
+We can use `ibis.to_sql` to see the SQL this generates.
+
+```{python}
+ibis.to_sql(name_basics)
+```
+
+Clean the rest of the tables. We'll convert `nconst` and `tconst` columns
+consistently to allow for easy joining.
+
+```{python}
+title_akas = title_akas.mutate(title_id=tconst_to_int(_.title_id)).relabel(
+    {"title_id": "tconst"}
+)
+title_basics = title_basics.mutate(tconst=tconst_to_int(_.tconst))
+title_crew = title_crew.mutate(
+    tconst=tconst_to_int(_.tconst),
+    directors=_.directors.split(",").map(nconst_to_int),
+    writers=_.writers.split(",").map(nconst_to_int),
+)
+title_episode = title_episode.mutate(
+    tconst=tconst_to_int(_.tconst), parent_tconst=tconst_to_int(_.parent_tconst)
+)
+title_principals = title_principals.mutate(
+    tconst=tconst_to_int(_.tconst), nconst=nconst_to_int(_.nconst)
+)
+title_ratings = title_ratings.mutate(tconst=tconst_to_int(_.tconst))
+```
+
+## Finding good (and bad) movies to watch
+
+Join the IMDB rankings with information about the movies.
+
+```{python}
+joined = title_basics.join(title_ratings, "tconst")
+joined
+```
+
+```{python}
+joined.title_type.value_counts().order_by(_.title_type_count.desc())
+```
+
+Filter down to movies.
+
+```{python}
+joined = joined.filter(_.title_type == "movie")
+joined
+```
+
+Reorder the columns and drop some.
+
+```{python}
+joined = joined.select(
+    "tconst",
+    "primary_title",
+    "average_rating",
+    "num_votes",
+    "genres",
+    "runtime_minutes",
+)
+joined
+```
+
+Sort by the average rating.
+
+```{python}
+joined = joined.order_by([_.average_rating.desc(), _.num_votes.desc()])
+joined
+```
+
+A lot of 10/10 movies I haven't heard of … let's filter to movies with at least
+`N` votes.
+
+```{python}
+N = 50000
+joined = joined.filter(_.num_votes > N)
+joined
+```
+
+What if you're in the mood for a bad movie?
+
+```{python}
+joined = joined.order_by([_.average_rating.asc(), _.num_votes.desc()])
+joined
+```
+
+And specifically a bad comedy?
+
+```{python}
+joined = joined.filter(_.genres.contains("Comedy"))
+joined
+```
+
+Perfect!
+
+## Next Steps
+
+We only used two of the IMDB tables. What else can we do with the rest of the
+data? Play around and let us know!
diff --git a/docs/how-to/input-output/duckdb-parquet.qmd b/docs/how-to/input-output/duckdb-parquet.qmd
@@ -0,0 +1,96 @@
+---
+title: Reading Parquet Files with Ibis + DuckDB
+---
+
+In this example, we will use Ibis's DuckDB backend to analyze data from
+a remote parquet source using `ibis.read_parquet`. `ibis.read_parquet` can also
+read local parquet files, and there are other `ibis.read_*` functions that
+conveniently return a table expression from a file. One such function is
+`ibis.read_csv`, which reads from local and remote CSV.
+
+We will be reading from the [**Global Biodiversity Information Facility (GBIF)
+Species Occurrences**](https://registry.opendata.aws/gbif/) dataset. It is
+hosted on S3 at `s3://gbif-open-data-us-east-1/occurrence/`
+
+## Reading One Partition
+
+We can read a single partition by specifying its path.
+
+We do this by calling
+[`read_parquet`](https://ibis-project.org/api/expressions/top_level/#ibis.read_parquet)
+on the partition we care about.
+
+So to read the first partition in this dataset, we'll call `read_parquet` on
+`00000` in that path:
+
+```{python}
+import ibis
+
+t = ibis.read_parquet(
+    "s3://gbif-open-data-us-east-1/occurrence/2023-04-01/occurrence.parquet/000000"
+)
+t
+```
+
+Note that we're calling `read_parquet` and receiving a table expression without
+establishing a connection first. Ibis spins up a DuckDB connection (or
+whichever default backend you have) when you call `ibis.read_parquet` (or even
+`ibis.read_csv`).
+
+Since our result, `t`, is a table expression, we can now run queries against
+the file using Ibis expressions. For example, we can select columns, filter the
+file, and then view the first five rows of the result:
+
+```{python}
+cols = [
+    "gbifid",
+    "datasetkey",
+    "occurrenceid",
+    "kingdom",
+    "phylum",
+    "class",
+    "order",
+    "family",
+    "genus",
+    "species",
+    "day",
+    "month",
+    "year",
+]
+
+t.select(cols).filter(t["family"].isin(["Corvidae"])).limit(5).to_pandas()
+```
+
+We can count the rows in the table (partition):
+
+```{python}
+t.count().to_pandas()
+```
+
+## Reading all partitions: filter, aggregate, export
+
+We can use `read_parquet` to read an entire parquet file by globbing all
+partitions:
+
+```{python}
+t = ibis.read_parquet(
+    "s3://gbif-open-data-us-east-1/occurrence/2023-04-01/occurrence.parquet/*"
+)
+```
+
+Since the function returns a table expression, we can perform valid selections,
+filters, aggregations, and exports just as we could with any other table
+expression:
+
+```{python}
+df = (
+    t.select(["gbifid", "family", "species"])
+    .filter(t["family"].isin(["Corvidae"]))
+    # Here we limit by 10,000 to fetch a quick batch of results
+    .limit(10000)
+    .group_by("species")
+    .count()
+    .to_pandas()
+)
+df
+```