From a8fe085ba1c53e5a44e6203b7d5403669fc9cfef Mon Sep 17 00:00:00 2001 From: Phillip Cloud <417981+cpcloud@users.noreply.github.com> Date: Tue, 12 Sep 2023 07:41:41 -0400 Subject: [PATCH] docs(blog): add bigquery arrays 7.0.0 blog post --- .../index/execute-results/html.json | 15 ++ docs/posts/bigquery-arrays/index.qmd | 240 ++++++++++++++++++ 2 files changed, 255 insertions(+) create mode 100644 docs/_freeze/posts/bigquery-arrays/index/execute-results/html.json create mode 100644 docs/posts/bigquery-arrays/index.qmd diff --git a/docs/_freeze/posts/bigquery-arrays/index/execute-results/html.json b/docs/_freeze/posts/bigquery-arrays/index/execute-results/html.json new file mode 100644 index 0000000000000..3c62da021b7e7 --- /dev/null +++ b/docs/_freeze/posts/bigquery-arrays/index/execute-results/html.json @@ -0,0 +1,15 @@ +{ + "hash": "248bf3712403b5ea5cd4ad18a0d787da", + "result": { + "markdown": "---\ntitle: Working with arrays in Google BigQuery\nauthor: \"Phillip Cloud\"\ndate: \"2023-09-12\"\ncategories:\n - release\n - blog\n - bigquery\n - arrays\n - cloud\n---\n\n## Introduction\n\nIbis and BigQuery have [worked well together for years](https://cloud.google.com/blog/products/data-analytics/ibis-and-bigquery-scalable-analytics-comfort-python).\n\nIn Ibis 7.0.0, they work even better together with the addition of array\nfunctionality for BigQuery.\n\nLet's look at some examples using BigQuery's [IMDB data](https://developer.imdb.com/non-commercial-datasets/).\n\n## Basics\n\nFirst we'll connect to BigQuery and pluck out a table to work with.\n\nWe'll start with `from ibis.interactive import *` for maximum convenience.\n\n::: {#e92e4d14 .cell execution_count=1}\n``` {.python .cell-code}\nfrom ibis.interactive import *\n\ncon = ibis.connect(\"bigquery://ibis-gbq\") # <1>\ncon.set_database(\"bigquery-public-data.imdb\") # <2>\n```\n:::\n\n\n1. Connect to the **billing** project. Compute (but not storage) is billed to\n this project.\n2. Set the database to the project and dataset that we will use for analysis.\n\nLet's look at the tables in this dataset:\n\n::: {#9c3be9b5 .cell execution_count=2}\n``` {.python .cell-code}\ncon.list_tables()\n```\n\n::: {.cell-output .cell-output-display execution_count=2}\n```\n['name_basics',\n 'reviews',\n 'title_akas',\n 'title_basics',\n 'title_crew',\n 'title_episode',\n 'title_principals',\n 'title_ratings']\n```\n:::\n:::\n\n\nLet's pull out the `name_basics` table, which contains names and metadata about\npeople listed on IMDB. We'll call this `ents` (short for `entities`), and remove some\ncolumns we won't need:\n\n::: {#7f119d44 .cell execution_count=3}\n``` {.python .cell-code}\nents = con.tables.name_basics.drop(\"birth_year\", \"death_year\")\nents\n```\n\n::: {.cell-output .cell-output-display execution_count=3}\n```{=html}\n
┏━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━┓\n┃ nconst     primary_name          primary_profession  known_for_titles    ┃\n┡━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━┩\n│ stringstringstringstring              │\n├───────────┼──────────────────────┼────────────────────┼─────────────────────┤\n│ nm7200466Sam Townsend        NULLNULL                │\n│ nm7222639Marc Goula          NULLtt3185588           │\n│ nm7236451Charlie Furusho     NULLtt4548374           │\n│ nm7245943Cynthia Llanes      NULLNULL                │\n│ nm7252258Lance Hamner        NULLtt0247882           │\n│ nm7254706Paloma White        NULLNULL                │\n│ nm7256968Bart den Hartigh    NULLtt3947934           │\n│ nm7268314Don Cummings        NULLtt4613692,tt0042078 │\n│ nm7286675Svitlana BanschukovaNULLtt4636896           │\n│ nm7287050Glenn McCready      NULLtt4637318           │\n│                    │\n└───────────┴──────────────────────┴────────────────────┴─────────────────────┘\n
\n```\n:::\n:::\n\n\n### Splitting strings into arrays\n\nWe can see that `known_for_titles` looks sort like an array, so let's call the [`split()`]()\nmethod on that column and replace the existing column:\n\n::: {#5cdf3cc9 .cell execution_count=4}\n``` {.python .cell-code}\nents = ents.mutate(known_for_titles=_.known_for_titles.split(\",\"))\nents\n```\n\n::: {.cell-output .cell-output-display execution_count=4}\n```{=html}\n
┏━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━━━━━┓\n┃ nconst     primary_name          primary_profession  known_for_titles           ┃\n┡━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━━━━━┩\n│ stringstringstringarray<string>              │\n├───────────┼──────────────────────┼────────────────────┼────────────────────────────┤\n│ nm7200466Sam Townsend        NULL[]                         │\n│ nm7222639Marc Goula          NULL['tt3185588']              │\n│ nm7236451Charlie Furusho     NULL['tt4548374']              │\n│ nm7245943Cynthia Llanes      NULL[]                         │\n│ nm7252258Lance Hamner        NULL['tt0247882']              │\n│ nm7254706Paloma White        NULL[]                         │\n│ nm7256968Bart den Hartigh    NULL['tt3947934']              │\n│ nm7268314Don Cummings        NULL['tt4613692', 'tt0042078'] │\n│ nm7286675Svitlana BanschukovaNULL['tt4636896']              │\n│ nm7287050Glenn McCready      NULL['tt4637318']              │\n│                           │\n└───────────┴──────────────────────┴────────────────────┴────────────────────────────┘\n
\n```\n:::\n:::\n\n\nSimilarly for `primary_profession`, since people involved in show business often\nhave more than one responsibility on a project:\n\n::: {#aa489438 .cell execution_count=5}\n``` {.python .cell-code}\nents = ents.mutate(primary_profession=_.primary_profession.split(\",\"))\n```\n:::\n\n\n### Array length\n\nLet's see how many titles each entity is known, and then show the five\npeople with the largest number of titles they're known for:\n\nThis is computed using the\n[`length()`](../../reference/expression-collections.qmd#ibis.expr.types.arrays.ArrayValue.length)\nAPI on array expressions:\n\n::: {#362ba8dd .cell execution_count=6}\n``` {.python .cell-code}\n(\n ents.select(\"primary_name\", num_titles=_.known_for_titles.length())\n .order_by(_.num_titles.desc())\n .limit(5)\n)\n```\n\n::: {.cell-output .cell-output-display execution_count=6}\n```{=html}\n
┏━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━┓\n┃ primary_name      num_titles ┃\n┡━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━┩\n│ stringint64      │\n├──────────────────┼────────────┤\n│ Marc Mayer      5 │\n│ Alex Koenigsmark5 │\n│ Sally Sun       5 │\n│ Carrie Schnelker5 │\n│ Henry Townsend  5 │\n└──────────────────┴────────────┘\n
\n```\n:::\n:::\n\n\nIt seems like the length of the `known_for_titles` might be capped at five!\n\n### Index\n\nWe can see the position of `\"actor\"` in `primary_profession`s:\n\n::: {#e1f29f1e .cell execution_count=7}\n``` {.python .cell-code}\nents.primary_profession.index(\"actor\")\n```\n\n::: {.cell-output .cell-output-display execution_count=7}\n```{=html}\n
┏━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┓\n┃ ArrayPosition(primary_profession, 'actor') ┃\n┡━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┩\n│ int64                                      │\n├────────────────────────────────────────────┤\n│                                         -1 │\n│                                         -1 │\n│                                         -1 │\n│                                         -1 │\n│                                         -1 │\n│                                         -1 │\n│                                         -1 │\n│                                         -1 │\n│                                         -1 │\n│                                         -1 │\n│                                           │\n└────────────────────────────────────────────┘\n
\n```\n:::\n:::\n\n\nA return value of `-1` indicates that `\"actor\"` is not present in the value:\n\nLet's check whether `\"actor\"` shows up in a different position in the `primary_profession` column:\n\n::: {#49c86e42 .cell execution_count=8}\n``` {.python .cell-code}\nactor_index = ents.primary_profession.index(\"actor\")\nnot_primarily_actors = actor_index > 0\nnot_primarily_actors.mean() # <1>\n```\n\n::: {.cell-output .cell-output-display}\n```{=html}\n
\n```\n:::\n\n::: {.cell-output .cell-output-display execution_count=8}\n\n::: {.ansi-escaped-output}\n```{=html}\n
0.019474970102879737
\n```\n:::\n\n:::\n:::\n\n\n1. The average of a `bool` column gives the percentage of `True` values\n\nWho are they?\n\n::: {#432dd817 .cell execution_count=9}\n``` {.python .cell-code}\nents[not_primarily_actors]\n```\n\n::: {.cell-output .cell-output-display execution_count=9}\n```{=html}\n
┏━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┓\n┃ nconst      primary_name          primary_profession   known_for_titles                    ┃\n┡━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┩\n│ stringstringarray<string>array<string>                       │\n├────────────┼──────────────────────┼─────────────────────┼─────────────────────────────────────┤\n│ nm2653478 Aaron Rosenberg     ['legal', 'actor']['tt4218260']                       │\n│ nm8437758 Diego Potás         ['editor', 'actor']['tt6062002']                       │\n│ nm6491774 Caleb Ralston       ['editor', 'actor']['tt10404122', 'tt3711124', ... +2] │\n│ nm8777228 Vaheed Sadeghi Sefat['editor', 'actor']['tt5463172', 'tt27817538', ... +1] │\n│ nm2714786 Massimo Croce       ['editor', 'actor']['tt1068997', 'tt5431646', ... +1]  │\n│ nm10157349Keith Edmund        ['editor', 'actor']['tt4142108', 'tt8888656']          │\n│ nm3277549 Brad Oberholtzer    ['editor', 'actor']['tt2417466', 'tt1358212', ... +1]  │\n│ nm3265541 Ramiro Suárez       ['editor', 'actor']['tt1358277', 'tt25964036', ... +2] │\n│ nm4357241 Julian Wierzbicki   ['editor', 'actor']['tt3010336', 'tt1854253', ... +2]  │\n│ nm7548563 Sujith Nayak        ['editor', 'actor']['tt21653928', 'tt7479692', ... +2] │\n│                                    │\n└────────────┴──────────────────────┴─────────────────────┴─────────────────────────────────────┘\n
\n```\n:::\n:::\n\n\nIt's not 100% clear whether the order of elements in `primary_profession` matters here.\n\n### Containment\n\nWe can get people who are **not** actors using `contains`:\n\n::: {#fcc45ade .cell execution_count=10}\n``` {.python .cell-code}\nnon_actors = ents[~_.primary_profession.contains(\"actor\")]\nnon_actors\n```\n\n::: {.cell-output .cell-output-display execution_count=10}\n```{=html}\n
┏━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━┓\n┃ nconst      primary_name          primary_profession  known_for_titles ┃\n┡━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━┩\n│ stringstringarray<string>array<string>    │\n├────────────┼──────────────────────┼────────────────────┼──────────────────┤\n│ nm4820183 Albert J. Soler     ['legal'][]               │\n│ nm11789175Rossa Gassó Raventos['legal'][]               │\n│ nm5951720 Christine Padlan    ['legal'][]               │\n│ nm10791435Dan Quintero        ['legal'][]               │\n│ nm13404890Benjamin M. Reznik  ['legal'][]               │\n│ nm4814590 Anne-Charlotte Gros ['legal'][]               │\n│ nm2683326 Cliff Lovette       ['legal'][]               │\n│ nm14365778Laura Lindenhovius  ['legal'][]               │\n│ nm14701100Karin Roach         ['legal'][]               │\n│ nm3955680 Brett J. Rodda      ['legal'][]               │\n│                 │\n└────────────┴──────────────────────┴────────────────────┴──────────────────┘\n
\n```\n:::\n:::\n\n\n### Element removal\n\nWe can remove elements from arrays too.\n\n::: {.callout-note}\n## `remove()` does not mutate the underlying data\n:::\n\nLet's see who only has \"actor\" in the list of their primary professions:\n\n::: {#7a585564 .cell execution_count=11}\n``` {.python .cell-code}\nents.filter(\n [\n _.primary_profession.length() > 0,\n _.primary_profession.remove(\"actor\").length() == 0,\n ]\n)\n```\n\n::: {.cell-output .cell-output-display execution_count=11}\n```{=html}\n
┏━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┓\n┃ nconst     primary_name           primary_profession  known_for_titles                   ┃\n┡━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┩\n│ stringstringarray<string>array<string>                      │\n├───────────┼───────────────────────┼────────────────────┼────────────────────────────────────┤\n│ nm7198867Danny Brown          ['actor']['tt4532788']                      │\n│ nm7199329James Wyatt Fairbanks['actor']['tt4494580']                      │\n│ nm7201687Tony Jelen           ['actor']['tt2043887']                      │\n│ nm7203397Christian Petrucci   ['actor']['tt4537722']                      │\n│ nm7205107Pablo Schollaert     ['actor']['tt4539222']                      │\n│ nm7207724Shigeru Jerry Endo   ['actor']['tt0043590']                      │\n│ nm7209610Paul Tugwell         ['actor']['tt6185666', 'tt4544182']         │\n│ nm7213017Pancho               ['actor']['tt2333598']                      │\n│ nm7228531Phillip Shinn        ['actor']['tt1442462', 'tt2741602', ... +2] │\n│ nm7236342José María Martínez  ['actor']['tt2244891']                      │\n│                                   │\n└───────────┴───────────────────────┴────────────────────┴────────────────────────────────────┘\n
\n```\n:::\n:::\n\n\n### Slicing with square-bracket syntax\n\nLet's remove everyone's first profession from the list, but only if they have\nmore than one profession listed:\n\n::: {#2f32ae40 .cell execution_count=12}\n``` {.python .cell-code}\nents[_.primary_profession.length() > 1].mutate(\n primary_profession=_.primary_profession[1:],\n)\n```\n\n::: {.cell-output .cell-output-display execution_count=12}\n```{=html}\n
┏━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┓\n┃ nconst      primary_name     primary_profession  known_for_titles                     ┃\n┡━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┩\n│ stringstringarray<string>array<string>                        │\n├────────────┼─────────────────┼────────────────────┼──────────────────────────────────────┤\n│ nm0520425 Doug Lord      ['legal']['tt0086461']                        │\n│ nm7198767 Martin Harry   ['legal']['tt5554916']                        │\n│ nm2232471 Lee Thomas     ['legal']['tt0236124']                        │\n│ nm5500775 Stewart Hayes  ['legal']['tt2671192']                        │\n│ nm2653478 Aaron Rosenberg['actor']['tt4218260']                        │\n│ nm0701436 Dominic Pye    ['editor']['tt27329996', 'tt0195619']          │\n│ nm12705514Okpata Henry   ['editor']['tt28450328', 'tt15170142', ... +1] │\n│ nm8313644 Jeff Landers   ['editor']['tt0488302']                        │\n│ nm0438282 Joshua Kaplan  ['editor']['tt0110687', 'tt0329600', ... +2]   │\n│ nm2803821 Glen Ring      ['editor']['tt1579300', 'tt1126489', ... +2]   │\n│                                     │\n└────────────┴─────────────────┴────────────────────┴──────────────────────────────────────┘\n
\n```\n:::\n:::\n\n\n## Set operations and sorting\n\nTreating arrays as sets is possible with the\n[`union`](../../reference/expression-collections.qmd#ibis.expr.types.arrays.ArrayValue.union)\nand\n[`intersect`](../../reference/expression-collections.qmd#ibis.expr.types.arrays.ArrayValue.intersect)\nAPIs.\n\n### Union\n\n### Intersection\n\nLet's see if we can use array intersection to figure which actors share\nknown-for titles and sort the result:\n\n::: {#8f743137 .cell execution_count=13}\n``` {.python .cell-code}\nleft = ents.filter(_.known_for_titles.length() > 0).limit(10_000)\nright = left.view()\nshared_titles = (\n left\n .join(right, left.nconst != right.nconst)\n .select(\n s.startswith(\"known_for_titles\"),\n left_name=\"primary_name\",\n right_name=\"primary_name_right\",\n )\n .filter(_.known_for_titles.intersect(_.known_for_titles_right).length() > 0)\n .group_by(name=\"left_name\")\n .agg(together_with=_.right_name.collect())\n .mutate(together_with=_.together_with.unique().sort())\n)\nshared_titles\n```\n\n::: {.cell-output .cell-output-display execution_count=13}\n```{=html}\n
┏━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┓\n┃ name                  together_with                                    ┃\n┡━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┩\n│ stringarray<string>                                    │\n├──────────────────────┼──────────────────────────────────────────────────┤\n│ Kotchakorn Voraakhom['Pat Gelsinger']                                │\n│ Cesar Diaz          ['Berlin Abreu', 'Melva Santos']                 │\n│ Maikel Cleto        ['Alejandro de Aza', 'Brad Brink', ... +14]      │\n│ Rai Benjamin        ['Adeline Grattard', 'Byron Gomez', ... +13]     │\n│ Andreas Norlén      ['Andreas Klinger', 'Birgit Dietze', ... +16]    │\n│ Valery Kostikov     ['Benjamin Andrews', 'Carol Chervenak', ... +17] │\n│ Rhea Sinha          ['Aaron Ellis', 'Andrea Schuelke', ... +23]      │\n│ Anne O'Rorke        ['Aida Muslic', 'Aslam Malik', ... +12]          │\n│ Guillaume Sourrieu  ['Adeline Grattard', 'Aline Perraudin', ... +8]  │\n│ Ed Goad             ['Betty Brady', 'Ed Parker', ... +16]            │\n│                                                 │\n└──────────────────────┴──────────────────────────────────────────────────┘\n
\n```\n:::\n:::\n\n\n## Advanced operations\n\n### `unnest`\n\nAs of version 7.0.0 Ibis does not support its native `unnest` API for BigQuery,\nbut we plan to add it in the future.\n\nFor now, you can use `con.sql` to construct an ibis expression from a BigQuery\nSQL string that contains `UNNEST` calls:\n\nDespite lack of native `UNNEST` support, many use cases for `UNNEST` are met by\nthe\n[`filter`](../../reference/expression-collections.qmd#ibis.expr.types.arrays.ArrayValue.filter)\nand\n[`map`](../../reference/expression-collections.qmd#ibis.expr.types.arrays.ArrayValue.map)\noperations on array expressions.\n\n### Filtering array elements\n\nShow all people who are neither editors nor actors:\n\n::: {#ba4a521c .cell execution_count=14}\n``` {.python .cell-code}\nents.mutate(\n primary_profession=_.primary_profession.filter(\n lambda pp: pp.isin((\"actor\", \"editor\"))\n )\n).filter(_.primary_profession.length() > 0)\n```\n\n::: {.cell-output .cell-output-display execution_count=14}\n```{=html}\n
┏━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┓\n┃ nconst     primary_name           primary_profession  known_for_titles                   ┃\n┡━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┩\n│ stringstringarray<string>array<string>                      │\n├───────────┼───────────────────────┼────────────────────┼────────────────────────────────────┤\n│ nm7198867Danny Brown          ['actor']['tt4532788']                      │\n│ nm7199329James Wyatt Fairbanks['actor']['tt4494580']                      │\n│ nm7201687Tony Jelen           ['actor']['tt2043887']                      │\n│ nm7203397Christian Petrucci   ['actor']['tt4537722']                      │\n│ nm7205107Pablo Schollaert     ['actor']['tt4539222']                      │\n│ nm7207724Shigeru Jerry Endo   ['actor']['tt0043590']                      │\n│ nm7209610Paul Tugwell         ['actor']['tt6185666', 'tt4544182']         │\n│ nm7213017Pancho               ['actor']['tt2333598']                      │\n│ nm7228531Phillip Shinn        ['actor']['tt1442462', 'tt2741602', ... +2] │\n│ nm7236342José María Martínez  ['actor']['tt2244891']                      │\n│                                   │\n└───────────┴───────────────────────┴────────────────────┴────────────────────────────────────┘\n
\n```\n:::\n:::\n\n\n### Applying a function to array elements\n\nLet's normalize the case of primary_profession to upper case:\n\n::: {#da71263b .cell execution_count=15}\n``` {.python .cell-code}\nents.mutate(\n primary_profession=_.primary_profession.map(lambda pp: pp.upper())\n).filter(_.primary_profession.length() > 0)\n```\n\n::: {.cell-output .cell-output-display execution_count=15}\n```{=html}\n
┏━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┓\n┃ nconst     primary_name           primary_profession  known_for_titles                   ┃\n┡━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┩\n│ stringstringarray<string>array<string>                      │\n├───────────┼───────────────────────┼────────────────────┼────────────────────────────────────┤\n│ nm7198867Danny Brown          ['ACTOR']['tt4532788']                      │\n│ nm7199329James Wyatt Fairbanks['ACTOR']['tt4494580']                      │\n│ nm7201687Tony Jelen           ['ACTOR']['tt2043887']                      │\n│ nm7203397Christian Petrucci   ['ACTOR']['tt4537722']                      │\n│ nm7205107Pablo Schollaert     ['ACTOR']['tt4539222']                      │\n│ nm7207724Shigeru Jerry Endo   ['ACTOR']['tt0043590']                      │\n│ nm7209610Paul Tugwell         ['ACTOR']['tt6185666', 'tt4544182']         │\n│ nm7213017Pancho               ['ACTOR']['tt2333598']                      │\n│ nm7228531Phillip Shinn        ['ACTOR']['tt1442462', 'tt2741602', ... +2] │\n│ nm7236342José María Martínez  ['ACTOR']['tt2244891']                      │\n│                                   │\n└───────────┴───────────────────────┴────────────────────┴────────────────────────────────────┘\n
\n```\n:::\n:::\n\n\n## Conclusion\n\nIbis has a sizable collection of array APIs that work with many different\nbackends and as of version 7.0.0, Ibis supports a much larger set of those APIs\nfor BigQuery!\n\nCheck out [the API\ndocumentation](http://localhost:8000/reference/expression-collections.html#ibis.expr.types.arrays.ArrayValue)\nfor the full set of available methods.\n\nTry it out, and let us know what you think.\n\n", + "supporting": [ + "index_files" + ], + "filters": [], + "includes": { + "include-in-header": [ + "\n\n\n" + ] + } + } +} \ No newline at end of file diff --git a/docs/posts/bigquery-arrays/index.qmd b/docs/posts/bigquery-arrays/index.qmd new file mode 100644 index 0000000000000..20bfdb31ae000 --- /dev/null +++ b/docs/posts/bigquery-arrays/index.qmd @@ -0,0 +1,240 @@ +--- +title: Working with arrays in Google BigQuery +author: "Phillip Cloud" +date: "2023-09-12" +categories: + - release + - blog + - bigquery + - arrays + - cloud +--- + +## Introduction + +Ibis and BigQuery have [worked well together for years](https://cloud.google.com/blog/products/data-analytics/ibis-and-bigquery-scalable-analytics-comfort-python). + +In Ibis 7.0.0, they work even better together with the addition of array +functionality for BigQuery. + +Let's look at some examples using BigQuery's [IMDB data](https://developer.imdb.com/non-commercial-datasets/). + +## Basics + +First we'll connect to BigQuery and pluck out a table to work with. + +We'll start with `from ibis.interactive import *` for maximum convenience. + +```{python} +from ibis.interactive import * + +con = ibis.connect("bigquery://ibis-gbq") # <1> +con.set_database("bigquery-public-data.imdb") # <2> +``` + +1. Connect to the **billing** project. Compute (but not storage) is billed to + this project. +2. Set the database to the project and dataset that we will use for analysis. + +Let's look at the tables in this dataset: + +```{python} +con.list_tables() +``` + +Let's pull out the `name_basics` table, which contains names and metadata about +people listed on IMDB. We'll call this `ents` (short for `entities`), and remove some +columns we won't need: + +```{python} +ents = con.tables.name_basics.drop("birth_year", "death_year") +ents +``` + +### Splitting strings into arrays + +We can see that `known_for_titles` looks sort like an array, so let's call the [`split()`]() +method on that column and replace the existing column: + +```{python} +ents = ents.mutate(known_for_titles=_.known_for_titles.split(",")) +ents +``` + +Similarly for `primary_profession`, since people involved in show business often +have more than one responsibility on a project: + +```{python} +ents = ents.mutate(primary_profession=_.primary_profession.split(",")) +``` + +### Array length + +Let's see how many titles each entity is known, and then show the five +people with the largest number of titles they're known for: + +This is computed using the +[`length()`](../../reference/expression-collections.qmd#ibis.expr.types.arrays.ArrayValue.length) +API on array expressions: + +```{python} +( + ents.select("primary_name", num_titles=_.known_for_titles.length()) + .order_by(_.num_titles.desc()) + .limit(5) +) +``` + +It seems like the length of the `known_for_titles` might be capped at five! + +### Index + +We can see the position of `"actor"` in `primary_profession`s: + +```{python} +ents.primary_profession.index("actor") +``` + +A return value of `-1` indicates that `"actor"` is not present in the value: + +Let's check whether `"actor"` shows up in a different position in the `primary_profession` column: + +```{python} +actor_index = ents.primary_profession.index("actor") +not_primarily_actors = actor_index > 0 +not_primarily_actors.mean() # <1> +``` + +1. The average of a `bool` column gives the percentage of `True` values + +Who are they? + +```{python} +ents[not_primarily_actors] +``` + +It's not 100% clear whether the order of elements in `primary_profession` matters here. + +### Containment + +We can get people who are **not** actors using `contains`: + +```{python} +non_actors = ents[~_.primary_profession.contains("actor")] +non_actors +``` + +### Element removal + +We can remove elements from arrays too. + +::: {.callout-note} +## `remove()` does not mutate the underlying data +::: + +Let's see who only has "actor" in the list of their primary professions: + +```{python} +ents.filter( + [ + _.primary_profession.length() > 0, + _.primary_profession.remove("actor").length() == 0, + ] +) +``` + +### Slicing with square-bracket syntax + +Let's remove everyone's first profession from the list, but only if they have +more than one profession listed: + +```{python} +ents[_.primary_profession.length() > 1].mutate( + primary_profession=_.primary_profession[1:], +) +``` + +## Set operations and sorting + +Treating arrays as sets is possible with the +[`union`](../../reference/expression-collections.qmd#ibis.expr.types.arrays.ArrayValue.union) +and +[`intersect`](../../reference/expression-collections.qmd#ibis.expr.types.arrays.ArrayValue.intersect) +APIs. + +### Union + +### Intersection + +Let's see if we can use array intersection to figure which actors share +known-for titles and sort the result: + +```{python} +left = ents.filter(_.known_for_titles.length() > 0).limit(10_000) +right = left.view() +shared_titles = ( + left + .join(right, left.nconst != right.nconst) + .select( + s.startswith("known_for_titles"), + left_name="primary_name", + right_name="primary_name_right", + ) + .filter(_.known_for_titles.intersect(_.known_for_titles_right).length() > 0) + .group_by(name="left_name") + .agg(together_with=_.right_name.collect()) + .mutate(together_with=_.together_with.unique().sort()) +) +shared_titles +``` + +## Advanced operations + +### `unnest` + +As of version 7.0.0 Ibis does not support its native `unnest` API for BigQuery, +but we plan to add it in the future. + +For now, you can use `con.sql` to construct an ibis expression from a BigQuery +SQL string that contains `UNNEST` calls: + +Despite lack of native `UNNEST` support, many use cases for `UNNEST` are met by +the +[`filter`](../../reference/expression-collections.qmd#ibis.expr.types.arrays.ArrayValue.filter) +and +[`map`](../../reference/expression-collections.qmd#ibis.expr.types.arrays.ArrayValue.map) +operations on array expressions. + +### Filtering array elements + +Show all people who are neither editors nor actors: + +```{python} +ents.mutate( + primary_profession=_.primary_profession.filter( + lambda pp: pp.isin(("actor", "editor")) + ) +).filter(_.primary_profession.length() > 0) +``` + +### Applying a function to array elements + +Let's normalize the case of primary_profession to upper case: + +```{python} +ents.mutate( + primary_profession=_.primary_profession.map(lambda pp: pp.upper()) +).filter(_.primary_profession.length() > 0) +``` + +## Conclusion + +Ibis has a sizable collection of array APIs that work with many different +backends and as of version 7.0.0, Ibis supports a much larger set of those APIs +for BigQuery! + +Check out [the API +documentation](http://localhost:8000/reference/expression-collections.html#ibis.expr.types.arrays.ArrayValue) +for the full set of available methods. + +Try it out, and let us know what you think.