From 8f2a40ff89ef113a327841d72159d497258e314f Mon Sep 17 00:00:00 2001 From: Phillip Cloud <417981+cpcloud@users.noreply.github.com> Date: Tue, 12 Sep 2023 07:41:41 -0400 Subject: [PATCH] docs(blog): add bigquery arrays 7.0.0 blog post --- .../index/execute-results/html.json | 15 ++ docs/posts/bigquery-arrays/index.qmd | 247 ++++++++++++++++++ 2 files changed, 262 insertions(+) create mode 100644 docs/_freeze/posts/bigquery-arrays/index/execute-results/html.json create mode 100644 docs/posts/bigquery-arrays/index.qmd diff --git a/docs/_freeze/posts/bigquery-arrays/index/execute-results/html.json b/docs/_freeze/posts/bigquery-arrays/index/execute-results/html.json new file mode 100644 index 000000000000..288341b461fa --- /dev/null +++ b/docs/_freeze/posts/bigquery-arrays/index/execute-results/html.json @@ -0,0 +1,15 @@ +{ + "hash": "47440a350432e93f85e6ed6553cc40f0", + "result": { + "markdown": "---\ntitle: Working with arrays in Google BigQuery\nauthor: \"Phillip Cloud\"\ndate: \"2023-09-12\"\ncategories:\n - blog\n - bigquery\n - arrays\n - cloud\n---\n\n## Introduction\n\nIbis and BigQuery have [worked well together for years](https://cloud.google.com/blog/products/data-analytics/ibis-and-bigquery-scalable-analytics-comfort-python).\n\nIn Ibis 7.0.0, they work even better together with the addition of array\nfunctionality for BigQuery.\n\nLet's look at some examples using BigQuery's [IMDB sample\ndata](https://developer.imdb.com/non-commercial-datasets/).\n\n## Basics\n\nFirst we'll connect to BigQuery and pluck out a table to work with.\n\nWe'll start with `from ibis.interactive import *` for maximum convenience.\n\n::: {#75a9d26f .cell execution_count=1}\n``` {.python .cell-code}\nfrom ibis.interactive import *\n\ncon = ibis.connect(\"bigquery://ibis-gbq\") # <1>\ncon.set_database(\"bigquery-public-data.imdb\") # <2>\n```\n:::\n\n\n1. Connect to the **billing** project. Compute (but not storage) is billed to\n this project.\n2. Set the database to the project and dataset that we will use for analysis.\n\nLet's look at the tables in this dataset:\n\n::: {#203b6b28 .cell execution_count=2}\n``` {.python .cell-code}\ncon.list_tables()\n```\n\n::: {.cell-output .cell-output-display execution_count=2}\n```\n['name_basics',\n 'reviews',\n 'title_akas',\n 'title_basics',\n 'title_crew',\n 'title_episode',\n 'title_principals',\n 'title_ratings']\n```\n:::\n:::\n\n\nLet's pull out the `name_basics` table, which contains names and metadata about\npeople listed on IMDB. We'll call this `ents` (short for `entities`), and remove some\ncolumns we won't need:\n\n::: {#6229c913 .cell execution_count=3}\n``` {.python .cell-code}\nents = con.tables.name_basics.drop(\"birth_year\", \"death_year\")\nents\n```\n\n::: {.cell-output .cell-output-display execution_count=3}\n```{=html}\n
┏━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━┓\n┃ nconst     primary_name          primary_profession  known_for_titles    ┃\n┡━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━┩\n│ stringstringstringstring              │\n├───────────┼──────────────────────┼────────────────────┼─────────────────────┤\n│ nm7200466Sam Townsend        NULLNULL                │\n│ nm7222639Marc Goula          NULLtt3185588           │\n│ nm7236451Charlie Furusho     NULLtt4548374           │\n│ nm7245943Cynthia Llanes      NULLNULL                │\n│ nm7252258Lance Hamner        NULLtt0247882           │\n│ nm7254706Paloma White        NULLNULL                │\n│ nm7256968Bart den Hartigh    NULLtt3947934           │\n│ nm7268314Don Cummings        NULLtt4613692,tt0042078 │\n│ nm7286675Svitlana BanschukovaNULLtt4636896           │\n│ nm7287050Glenn McCready      NULLtt4637318           │\n│                    │\n└───────────┴──────────────────────┴────────────────────┴─────────────────────┘\n
\n```\n:::\n:::\n\n\n### Splitting strings into arrays\n\nWe can see that `known_for_titles` looks sort of like an array, so let's call\nthe\n[`split`](../../reference/expression-strings.qmd#ibis.expr.types.strings.StringValue.split)\nmethod on that column and replace the existing column:\n\n::: {#1763a10e .cell execution_count=4}\n``` {.python .cell-code}\nents = ents.mutate(known_for_titles=_.known_for_titles.split(\",\"))\nents\n```\n\n::: {.cell-output .cell-output-display execution_count=4}\n```{=html}\n
┏━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━━━━━┓\n┃ nconst     primary_name          primary_profession  known_for_titles           ┃\n┡━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━━━━━┩\n│ stringstringstringarray<string>              │\n├───────────┼──────────────────────┼────────────────────┼────────────────────────────┤\n│ nm7200466Sam Townsend        NULL[]                         │\n│ nm7222639Marc Goula          NULL['tt3185588']              │\n│ nm7236451Charlie Furusho     NULL['tt4548374']              │\n│ nm7245943Cynthia Llanes      NULL[]                         │\n│ nm7252258Lance Hamner        NULL['tt0247882']              │\n│ nm7254706Paloma White        NULL[]                         │\n│ nm7256968Bart den Hartigh    NULL['tt3947934']              │\n│ nm7268314Don Cummings        NULL['tt4613692', 'tt0042078'] │\n│ nm7286675Svitlana BanschukovaNULL['tt4636896']              │\n│ nm7287050Glenn McCready      NULL['tt4637318']              │\n│                           │\n└───────────┴──────────────────────┴────────────────────┴────────────────────────────┘\n
\n```\n:::\n:::\n\n\nSimilarly for `primary_profession`, since people involved in show business often\nhave more than one responsibility on a project:\n\n::: {#398f73c9 .cell execution_count=5}\n``` {.python .cell-code}\nents = ents.mutate(primary_profession=_.primary_profession.split(\",\"))\n```\n:::\n\n\n### Array length\n\nLet's see how many titles each entity is known for, and then show the five\npeople with the largest number of titles they're known for:\n\nThis is computed using the\n[`length`](../../reference/expression-collections.qmd#ibis.expr.types.arrays.ArrayValue.length)\nAPI on array expressions:\n\n::: {#315e2b5c .cell execution_count=6}\n``` {.python .cell-code}\n(\n ents.select(\"primary_name\", num_titles=_.known_for_titles.length())\n .order_by(_.num_titles.desc())\n .limit(5)\n)\n```\n\n::: {.cell-output .cell-output-display execution_count=6}\n```{=html}\n
┏━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━┓\n┃ primary_name      num_titles ┃\n┡━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━┩\n│ stringint64      │\n├──────────────────┼────────────┤\n│ Marc Mayer      5 │\n│ Alex Koenigsmark5 │\n│ Sally Sun       5 │\n│ Carrie Schnelker5 │\n│ Henry Townsend  5 │\n└──────────────────┴────────────┘\n
\n```\n:::\n:::\n\n\nIt seems like the length of the `known_for_titles` might be capped at five!\n\n### Index\n\nWe can see the position of `\"actor\"` in `primary_profession`s:\n\n::: {#8f915d17 .cell execution_count=7}\n``` {.python .cell-code}\nents.primary_profession.index(\"actor\")\n```\n\n::: {.cell-output .cell-output-display execution_count=7}\n```{=html}\n
┏━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┓\n┃ ArrayPosition(primary_profession, 'actor') ┃\n┡━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┩\n│ int64                                      │\n├────────────────────────────────────────────┤\n│                                         -1 │\n│                                         -1 │\n│                                         -1 │\n│                                         -1 │\n│                                         -1 │\n│                                         -1 │\n│                                         -1 │\n│                                         -1 │\n│                                         -1 │\n│                                         -1 │\n│                                           │\n└────────────────────────────────────────────┘\n
\n```\n:::\n:::\n\n\nA return value of `-1` indicates that `\"actor\"` is not present in the value:\n\nLet's look for entities that are not primarily actors:\n\nWe can do this using the\n[`index`](../../reference/expression-collections.qmd#ibis.expr.types.arrays.ArrayValue.index)\nmethod by checking whether the position of the string `\"actor\"` is greater than\nzero:\n\n::: {#4335351c .cell execution_count=8}\n``` {.python .cell-code}\nactor_index = ents.primary_profession.index(\"actor\")\nnot_primarily_actors = actor_index > 0\nnot_primarily_actors.mean() # <1>\n```\n\n::: {.cell-output .cell-output-display}\n```{=html}\n
\n```\n:::\n\n::: {.cell-output .cell-output-display execution_count=8}\n\n::: {.ansi-escaped-output}\n```{=html}\n
0.01947497010287973
\n```\n:::\n\n:::\n:::\n\n\n1. The average of a `bool` column gives the percentage of `True` values\n\nWho are they?\n\n::: {#4bf604e5 .cell execution_count=9}\n``` {.python .cell-code}\nents[not_primarily_actors]\n```\n\n::: {.cell-output .cell-output-display execution_count=9}\n```{=html}\n
┏━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┓\n┃ nconst      primary_name             primary_profession   known_for_titles                    ┃\n┡━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┩\n│ stringstringarray<string>array<string>                       │\n├────────────┼─────────────────────────┼─────────────────────┼─────────────────────────────────────┤\n│ nm14670573Jamie Young            ['legal', 'actor']['tt27216887']                      │\n│ nm2320563 Miguel Eraso           ['editor', 'actor']['tt1820556', 'tt0823256']          │\n│ nm6050288 Kyle Springford        ['editor', 'actor']['tt3260540', 'tt4353988', ... +1]  │\n│ nm8606771 Edward Wu              ['editor', 'actor']['tt0259354', 'tt4219258']          │\n│ nm8159690 Arash Maleki           ['editor', 'actor']['tt14888266', 'tt5783616', ... +1] │\n│ nm3700713 Wendell Holland        ['editor', 'actor']['tt11546754', 'tt1554553', ... +1] │\n│ nm6531583 Tomás Díez-Kith Atienza['editor', 'actor']['tt3171042', 'tt3749248']          │\n│ nm2456342 Ed Cheesman            ['editor', 'actor']['tt13918214', 'tt9598592', ... +1] │\n│ nm0396397 Thomas Houg            ['editor', 'actor']['tt0093176', 'tt13339954', ... +1] │\n│ nm2171019 Larry Pena             ['editor', 'actor']['tt0831320', 'tt0800017', ... +2]  │\n│                                    │\n└────────────┴─────────────────────────┴─────────────────────┴─────────────────────────────────────┘\n
\n```\n:::\n:::\n\n\nIt's not 100% clear whether the order of elements in `primary_profession` matters here.\n\n### Containment\n\nWe can get people who are **not** actors using `contains`:\n\n::: {#510dc366 .cell execution_count=10}\n``` {.python .cell-code}\nnon_actors = ents[~ents.primary_profession.contains(\"actor\")]\nnon_actors\n```\n\n::: {.cell-output .cell-output-display execution_count=10}\n```{=html}\n
┏━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━┓\n┃ nconst      primary_name       primary_profession  known_for_titles ┃\n┡━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━┩\n│ stringstringarray<string>array<string>    │\n├────────────┼───────────────────┼────────────────────┼──────────────────┤\n│ nm13331027Robert Allen     ['legal'][]               │\n│ nm11516366Barney Given     ['legal'][]               │\n│ nm7841847 Natalia Utrera   ['legal'][]               │\n│ nm14658368Amber Payne      ['legal'][]               │\n│ nm15199944Melanie Tomanov  ['legal'][]               │\n│ nm11529563David Lazarus    ['legal'][]               │\n│ nm12224896Andrew Winston   ['legal'][]               │\n│ nm7591008 Miles Metcoff    ['legal'][]               │\n│ nm11355058Sameer Oberoi    ['legal'][]               │\n│ nm15069831Skyler R. Peacock['legal'][]               │\n│                 │\n└────────────┴───────────────────┴────────────────────┴──────────────────┘\n
\n```\n:::\n:::\n\n\n### Element removal\n\nWe can remove elements from arrays too.\n\n::: {.callout-note}\n## `remove()` does not mutate the underlying data\n:::\n\nLet's see who only has \"actor\" in the list of their primary professions:\n\n::: {#261a5744 .cell execution_count=11}\n``` {.python .cell-code}\nents.filter(\n [\n _.primary_profession.length() > 0,\n _.primary_profession.remove(\"actor\").length() == 0,\n ]\n)\n```\n\n::: {.cell-output .cell-output-display execution_count=11}\n```{=html}\n
┏━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┓\n┃ nconst     primary_name           primary_profession  known_for_titles                   ┃\n┡━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┩\n│ stringstringarray<string>array<string>                      │\n├───────────┼───────────────────────┼────────────────────┼────────────────────────────────────┤\n│ nm7198867Danny Brown          ['actor']['tt4532788']                      │\n│ nm7199329James Wyatt Fairbanks['actor']['tt4494580']                      │\n│ nm7201687Tony Jelen           ['actor']['tt2043887']                      │\n│ nm7203397Christian Petrucci   ['actor']['tt4537722']                      │\n│ nm7205107Pablo Schollaert     ['actor']['tt4539222']                      │\n│ nm7207724Shigeru Jerry Endo   ['actor']['tt0043590']                      │\n│ nm7209610Paul Tugwell         ['actor']['tt6185666', 'tt4544182']         │\n│ nm7213017Pancho               ['actor']['tt2333598']                      │\n│ nm7228531Phillip Shinn        ['actor']['tt1442462', 'tt2741602', ... +2] │\n│ nm7236342José María Martínez  ['actor']['tt2244891']                      │\n│                                   │\n└───────────┴───────────────────────┴────────────────────┴────────────────────────────────────┘\n
\n```\n:::\n:::\n\n\n### Slicing with square-bracket syntax\n\nLet's remove everyone's first profession from the list, but only if they have\nmore than one profession listed:\n\n::: {#c1137788 .cell execution_count=12}\n``` {.python .cell-code}\nents[_.primary_profession.length() > 1].mutate(\n primary_profession=_.primary_profession[1:],\n)\n```\n\n::: {.cell-output .cell-output-display execution_count=12}\n```{=html}\n
┏━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┓\n┃ nconst      primary_name     primary_profession  known_for_titles                     ┃\n┡━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┩\n│ stringstringarray<string>array<string>                        │\n├────────────┼─────────────────┼────────────────────┼──────────────────────────────────────┤\n│ nm0520425 Doug Lord      ['legal']['tt0086461']                        │\n│ nm7198767 Martin Harry   ['legal']['tt5554916']                        │\n│ nm2232471 Lee Thomas     ['legal']['tt0236124']                        │\n│ nm5500775 Stewart Hayes  ['legal']['tt2671192']                        │\n│ nm2653478 Aaron Rosenberg['actor']['tt4218260']                        │\n│ nm0701436 Dominic Pye    ['editor']['tt27329996', 'tt0195619']          │\n│ nm12705514Okpata Henry   ['editor']['tt28450328', 'tt15170142', ... +1] │\n│ nm8313644 Jeff Landers   ['editor']['tt0488302']                        │\n│ nm0438282 Joshua Kaplan  ['editor']['tt0110687', 'tt0329600', ... +2]   │\n│ nm2803821 Glen Ring      ['editor']['tt1579300', 'tt1126489', ... +2]   │\n│                                     │\n└────────────┴─────────────────┴────────────────────┴──────────────────────────────────────┘\n
\n```\n:::\n:::\n\n\n## Set operations and sorting\n\nTreating arrays as sets is possible with the\n[`union`](../../reference/expression-collections.qmd#ibis.expr.types.arrays.ArrayValue.union)\nand\n[`intersect`](../../reference/expression-collections.qmd#ibis.expr.types.arrays.ArrayValue.intersect)\nAPIs.\n\n### Union\n\n### Intersection\n\nLet's see if we can use array intersection to figure which actors share\nknown-for titles and sort the result:\n\n::: {#b4e2a96d .cell execution_count=13}\n``` {.python .cell-code}\nleft = ents.filter(_.known_for_titles.length() > 0).limit(10_000)\nright = left.view()\nshared_titles = (\n left\n .join(right, left.nconst != right.nconst)\n .select(\n s.startswith(\"known_for_titles\"),\n left_name=\"primary_name\",\n right_name=\"primary_name_right\",\n )\n .filter(_.known_for_titles.intersect(_.known_for_titles_right).length() > 0)\n .group_by(name=\"left_name\")\n .agg(together_with=_.right_name.collect())\n .mutate(together_with=_.together_with.unique().sort())\n)\nshared_titles\n```\n\n::: {.cell-output .cell-output-display execution_count=13}\n```{=html}\n
┏━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┓\n┃ name               together_with                                     ┃\n┡━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┩\n│ stringarray<string>                                     │\n├───────────────────┼───────────────────────────────────────────────────┤\n│ Sandra Murphy    ['Andrew Brisk', 'Angus McLaren', ... +41]        │\n│ Espera           ['Avena Campbell', 'Brenda James', ... +11]       │\n│ Mamta Jajoo      ['Barbara Buls', 'Bill Smoler', ... +72]          │\n│ Charles Ellis    ['Cherri Moore', 'Dennis Montano', ... +11]       │\n│ Chris Nicholus   ['Catherine Harrell', 'George Pounders', ... +11] │\n│ Paul Dembling    ['Barbara Buls', 'Bill Smoler', ... +72]          │\n│ Dnyaneshwar Mulay['Avena Campbell', 'Brenda James', ... +11]       │\n│ Daisy Boria      ['Barbara Buls', 'Bill Smoler', ... +72]          │\n│ Brandon Staley   ['Beacon Light', 'Ben Emanuel', ... +48]          │\n│ Dwayne Carter Jr.['Bill Collis', 'Charlie Jones', ... +21]         │\n│                                                  │\n└───────────────────┴───────────────────────────────────────────────────┘\n
\n```\n:::\n:::\n\n\n## Advanced operations\n\n### `unnest`\n\nAs of version 7.0.0 Ibis does not support its native `unnest` API for BigQuery,\nbut we plan to add it in the future.\n\nFor now, you can use `con.sql` to construct an Ibis expression from a BigQuery\nSQL string that contains `UNNEST` calls:\n\nDespite lack of native `UNNEST` support, many use cases for `UNNEST` are met by\nthe\n[`filter`](../../reference/expression-collections.qmd#ibis.expr.types.arrays.ArrayValue.filter)\nand\n[`map`](../../reference/expression-collections.qmd#ibis.expr.types.arrays.ArrayValue.map)\noperations on array expressions.\n\n### Filtering array elements\n\nShow all people who are neither editors nor actors:\n\n::: {#331a9699 .cell execution_count=14}\n``` {.python .cell-code}\nents.mutate(\n primary_profession=_.primary_profession.filter(\n lambda pp: pp.isin((\"actor\", \"editor\"))\n )\n).filter(_.primary_profession.length() > 0)\n```\n\n::: {.cell-output .cell-output-display execution_count=14}\n```{=html}\n
┏━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┓\n┃ nconst     primary_name           primary_profession  known_for_titles                   ┃\n┡━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┩\n│ stringstringarray<string>array<string>                      │\n├───────────┼───────────────────────┼────────────────────┼────────────────────────────────────┤\n│ nm7198867Danny Brown          ['actor']['tt4532788']                      │\n│ nm7199329James Wyatt Fairbanks['actor']['tt4494580']                      │\n│ nm7201687Tony Jelen           ['actor']['tt2043887']                      │\n│ nm7203397Christian Petrucci   ['actor']['tt4537722']                      │\n│ nm7205107Pablo Schollaert     ['actor']['tt4539222']                      │\n│ nm7207724Shigeru Jerry Endo   ['actor']['tt0043590']                      │\n│ nm7209610Paul Tugwell         ['actor']['tt6185666', 'tt4544182']         │\n│ nm7213017Pancho               ['actor']['tt2333598']                      │\n│ nm7228531Phillip Shinn        ['actor']['tt1442462', 'tt2741602', ... +2] │\n│ nm7236342José María Martínez  ['actor']['tt2244891']                      │\n│                                   │\n└───────────┴───────────────────────┴────────────────────┴────────────────────────────────────┘\n
\n```\n:::\n:::\n\n\n### Applying a function to array elements\n\nLet's normalize the case of primary_profession to upper case:\n\n::: {#1dd5c0b8 .cell execution_count=15}\n``` {.python .cell-code}\nents.mutate(\n primary_profession=_.primary_profession.map(lambda pp: pp.upper())\n).filter(_.primary_profession.length() > 0)\n```\n\n::: {.cell-output .cell-output-display execution_count=15}\n```{=html}\n
┏━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┓\n┃ nconst     primary_name           primary_profession  known_for_titles                   ┃\n┡━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┩\n│ stringstringarray<string>array<string>                      │\n├───────────┼───────────────────────┼────────────────────┼────────────────────────────────────┤\n│ nm7198867Danny Brown          ['ACTOR']['tt4532788']                      │\n│ nm7199329James Wyatt Fairbanks['ACTOR']['tt4494580']                      │\n│ nm7201687Tony Jelen           ['ACTOR']['tt2043887']                      │\n│ nm7203397Christian Petrucci   ['ACTOR']['tt4537722']                      │\n│ nm7205107Pablo Schollaert     ['ACTOR']['tt4539222']                      │\n│ nm7207724Shigeru Jerry Endo   ['ACTOR']['tt0043590']                      │\n│ nm7209610Paul Tugwell         ['ACTOR']['tt6185666', 'tt4544182']         │\n│ nm7213017Pancho               ['ACTOR']['tt2333598']                      │\n│ nm7228531Phillip Shinn        ['ACTOR']['tt1442462', 'tt2741602', ... +2] │\n│ nm7236342José María Martínez  ['ACTOR']['tt2244891']                      │\n│                                   │\n└───────────┴───────────────────────┴────────────────────┴────────────────────────────────────┘\n
\n```\n:::\n:::\n\n\n## Conclusion\n\nIbis has a sizable collection of array APIs that work with many different\nbackends and as of version 7.0.0, Ibis supports a much larger set of those APIs\nfor BigQuery!\n\nCheck out [the API\ndocumentation](../../reference/expression-collections.qmd#ibis.expr.types.arrays.ArrayValue)\nfor the full set of available methods.\n\nTry it out, and let us know what you think.\n\n", + "supporting": [ + "index_files" + ], + "filters": [], + "includes": { + "include-in-header": [ + "\n\n\n" + ] + } + } +} \ No newline at end of file diff --git a/docs/posts/bigquery-arrays/index.qmd b/docs/posts/bigquery-arrays/index.qmd new file mode 100644 index 000000000000..919989cc27aa --- /dev/null +++ b/docs/posts/bigquery-arrays/index.qmd @@ -0,0 +1,247 @@ +--- +title: Working with arrays in Google BigQuery +author: "Phillip Cloud" +date: "2023-09-12" +categories: + - blog + - bigquery + - arrays + - cloud +--- + +## Introduction + +Ibis and BigQuery have [worked well together for years](https://cloud.google.com/blog/products/data-analytics/ibis-and-bigquery-scalable-analytics-comfort-python). + +In Ibis 7.0.0, they work even better together with the addition of array +functionality for BigQuery. + +Let's look at some examples using BigQuery's [IMDB sample +data](https://developer.imdb.com/non-commercial-datasets/). + +## Basics + +First we'll connect to BigQuery and pluck out a table to work with. + +We'll start with `from ibis.interactive import *` for maximum convenience. + +```{python} +from ibis.interactive import * + +con = ibis.connect("bigquery://ibis-gbq") # <1> +con.set_database("bigquery-public-data.imdb") # <2> +``` + +1. Connect to the **billing** project. Compute (but not storage) is billed to + this project. +2. Set the database to the project and dataset that we will use for analysis. + +Let's look at the tables in this dataset: + +```{python} +con.list_tables() +``` + +Let's pull out the `name_basics` table, which contains names and metadata about +people listed on IMDB. We'll call this `ents` (short for `entities`), and remove some +columns we won't need: + +```{python} +ents = con.tables.name_basics.drop("birth_year", "death_year") +ents +``` + +### Splitting strings into arrays + +We can see that `known_for_titles` looks sort of like an array, so let's call +the +[`split`](../../reference/expression-strings.qmd#ibis.expr.types.strings.StringValue.split) +method on that column and replace the existing column: + +```{python} +ents = ents.mutate(known_for_titles=_.known_for_titles.split(",")) +ents +``` + +Similarly for `primary_profession`, since people involved in show business often +have more than one responsibility on a project: + +```{python} +ents = ents.mutate(primary_profession=_.primary_profession.split(",")) +``` + +### Array length + +Let's see how many titles each entity is known for, and then show the five +people with the largest number of titles they're known for: + +This is computed using the +[`length`](../../reference/expression-collections.qmd#ibis.expr.types.arrays.ArrayValue.length) +API on array expressions: + +```{python} +( + ents.select("primary_name", num_titles=_.known_for_titles.length()) + .order_by(_.num_titles.desc()) + .limit(5) +) +``` + +It seems like the length of the `known_for_titles` might be capped at five! + +### Index + +We can see the position of `"actor"` in `primary_profession`s: + +```{python} +ents.primary_profession.index("actor") +``` + +A return value of `-1` indicates that `"actor"` is not present in the value: + +Let's look for entities that are not primarily actors: + +We can do this using the +[`index`](../../reference/expression-collections.qmd#ibis.expr.types.arrays.ArrayValue.index) +method by checking whether the position of the string `"actor"` is greater than +zero: + +```{python} +actor_index = ents.primary_profession.index("actor") +not_primarily_actors = actor_index > 0 +not_primarily_actors.mean() # <1> +``` + +1. The average of a `bool` column gives the percentage of `True` values + +Who are they? + +```{python} +ents[not_primarily_actors] +``` + +It's not 100% clear whether the order of elements in `primary_profession` matters here. + +### Containment + +We can get people who are **not** actors using `contains`: + +```{python} +non_actors = ents[~ents.primary_profession.contains("actor")] +non_actors +``` + +### Element removal + +We can remove elements from arrays too. + +::: {.callout-note} +## `remove()` does not mutate the underlying data +::: + +Let's see who only has "actor" in the list of their primary professions: + +```{python} +ents.filter( + [ + _.primary_profession.length() > 0, + _.primary_profession.remove("actor").length() == 0, + ] +) +``` + +### Slicing with square-bracket syntax + +Let's remove everyone's first profession from the list, but only if they have +more than one profession listed: + +```{python} +ents[_.primary_profession.length() > 1].mutate( + primary_profession=_.primary_profession[1:], +) +``` + +## Set operations and sorting + +Treating arrays as sets is possible with the +[`union`](../../reference/expression-collections.qmd#ibis.expr.types.arrays.ArrayValue.union) +and +[`intersect`](../../reference/expression-collections.qmd#ibis.expr.types.arrays.ArrayValue.intersect) +APIs. + +### Union + +### Intersection + +Let's see if we can use array intersection to figure which actors share +known-for titles and sort the result: + +```{python} +left = ents.filter(_.known_for_titles.length() > 0).limit(10_000) +right = left.view() +shared_titles = ( + left + .join(right, left.nconst != right.nconst) + .select( + s.startswith("known_for_titles"), + left_name="primary_name", + right_name="primary_name_right", + ) + .filter(_.known_for_titles.intersect(_.known_for_titles_right).length() > 0) + .group_by(name="left_name") + .agg(together_with=_.right_name.collect()) + .mutate(together_with=_.together_with.unique().sort()) +) +shared_titles +``` + +## Advanced operations + +### `unnest` + +As of version 7.0.0 Ibis does not support its native `unnest` API for BigQuery, +but we plan to add it in the future. + +For now, you can use `con.sql` to construct an Ibis expression from a BigQuery +SQL string that contains `UNNEST` calls: + +Despite lack of native `UNNEST` support, many use cases for `UNNEST` are met by +the +[`filter`](../../reference/expression-collections.qmd#ibis.expr.types.arrays.ArrayValue.filter) +and +[`map`](../../reference/expression-collections.qmd#ibis.expr.types.arrays.ArrayValue.map) +operations on array expressions. + +### Filtering array elements + +Show all people who are neither editors nor actors: + +```{python} +ents.mutate( + primary_profession=_.primary_profession.filter( + lambda pp: pp.isin(("actor", "editor")) + ) +).filter(_.primary_profession.length() > 0) +``` + +### Applying a function to array elements + +Let's normalize the case of primary_profession to upper case: + +```{python} +ents.mutate( + primary_profession=_.primary_profession.map(lambda pp: pp.upper()) +).filter(_.primary_profession.length() > 0) +``` + +## Conclusion + +Ibis has a sizable collection of array APIs that work with many different +backends and as of version 7.0.0, Ibis supports a much larger set of those APIs +for BigQuery! + +Check out [the API +documentation](../../reference/expression-collections.qmd#ibis.expr.types.arrays.ArrayValue) +for the full set of available methods. + +Try it out, and let us know what you think.