From a8fe085ba1c53e5a44e6203b7d5403669fc9cfef Mon Sep 17 00:00:00 2001 From: Phillip Cloud <417981+cpcloud@users.noreply.github.com> Date: Tue, 12 Sep 2023 07:41:41 -0400 Subject: [PATCH] docs(blog): add bigquery arrays 7.0.0 blog post --- .../index/execute-results/html.json | 15 ++ docs/posts/bigquery-arrays/index.qmd | 240 ++++++++++++++++++ 2 files changed, 255 insertions(+) create mode 100644 docs/_freeze/posts/bigquery-arrays/index/execute-results/html.json create mode 100644 docs/posts/bigquery-arrays/index.qmd diff --git a/docs/_freeze/posts/bigquery-arrays/index/execute-results/html.json b/docs/_freeze/posts/bigquery-arrays/index/execute-results/html.json new file mode 100644 index 0000000000000..3c62da021b7e7 --- /dev/null +++ b/docs/_freeze/posts/bigquery-arrays/index/execute-results/html.json @@ -0,0 +1,15 @@ +{ + "hash": "248bf3712403b5ea5cd4ad18a0d787da", + "result": { + "markdown": "---\ntitle: Working with arrays in Google BigQuery\nauthor: \"Phillip Cloud\"\ndate: \"2023-09-12\"\ncategories:\n - release\n - blog\n - bigquery\n - arrays\n - cloud\n---\n\n## Introduction\n\nIbis and BigQuery have [worked well together for years](https://cloud.google.com/blog/products/data-analytics/ibis-and-bigquery-scalable-analytics-comfort-python).\n\nIn Ibis 7.0.0, they work even better together with the addition of array\nfunctionality for BigQuery.\n\nLet's look at some examples using BigQuery's [IMDB data](https://developer.imdb.com/non-commercial-datasets/).\n\n## Basics\n\nFirst we'll connect to BigQuery and pluck out a table to work with.\n\nWe'll start with `from ibis.interactive import *` for maximum convenience.\n\n::: {#e92e4d14 .cell execution_count=1}\n``` {.python .cell-code}\nfrom ibis.interactive import *\n\ncon = ibis.connect(\"bigquery://ibis-gbq\") # <1>\ncon.set_database(\"bigquery-public-data.imdb\") # <2>\n```\n:::\n\n\n1. Connect to the **billing** project. Compute (but not storage) is billed to\n this project.\n2. Set the database to the project and dataset that we will use for analysis.\n\nLet's look at the tables in this dataset:\n\n::: {#9c3be9b5 .cell execution_count=2}\n``` {.python .cell-code}\ncon.list_tables()\n```\n\n::: {.cell-output .cell-output-display execution_count=2}\n```\n['name_basics',\n 'reviews',\n 'title_akas',\n 'title_basics',\n 'title_crew',\n 'title_episode',\n 'title_principals',\n 'title_ratings']\n```\n:::\n:::\n\n\nLet's pull out the `name_basics` table, which contains names and metadata about\npeople listed on IMDB. We'll call this `ents` (short for `entities`), and remove some\ncolumns we won't need:\n\n::: {#7f119d44 .cell execution_count=3}\n``` {.python .cell-code}\nents = con.tables.name_basics.drop(\"birth_year\", \"death_year\")\nents\n```\n\n::: {.cell-output .cell-output-display execution_count=3}\n```{=html}\n
┏━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━┓\n┃ nconst ┃ primary_name ┃ primary_profession ┃ known_for_titles ┃\n┡━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━┩\n│ string │ string │ string │ string │\n├───────────┼──────────────────────┼────────────────────┼─────────────────────┤\n│ nm7200466 │ Sam Townsend │ NULL │ NULL │\n│ nm7222639 │ Marc Goula │ NULL │ tt3185588 │\n│ nm7236451 │ Charlie Furusho │ NULL │ tt4548374 │\n│ nm7245943 │ Cynthia Llanes │ NULL │ NULL │\n│ nm7252258 │ Lance Hamner │ NULL │ tt0247882 │\n│ nm7254706 │ Paloma White │ NULL │ NULL │\n│ nm7256968 │ Bart den Hartigh │ NULL │ tt3947934 │\n│ nm7268314 │ Don Cummings │ NULL │ tt4613692,tt0042078 │\n│ nm7286675 │ Svitlana Banschukova │ NULL │ tt4636896 │\n│ nm7287050 │ Glenn McCready │ NULL │ tt4637318 │\n│ … │ … │ … │ … │\n└───────────┴──────────────────────┴────────────────────┴─────────────────────┘\n\n```\n:::\n:::\n\n\n### Splitting strings into arrays\n\nWe can see that `known_for_titles` looks sort like an array, so let's call the [`split()`]()\nmethod on that column and replace the existing column:\n\n::: {#5cdf3cc9 .cell execution_count=4}\n``` {.python .cell-code}\nents = ents.mutate(known_for_titles=_.known_for_titles.split(\",\"))\nents\n```\n\n::: {.cell-output .cell-output-display execution_count=4}\n```{=html}\n
┏━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━━━━━┓\n┃ nconst ┃ primary_name ┃ primary_profession ┃ known_for_titles ┃\n┡━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━━━━━┩\n│ string │ string │ string │ array<string> │\n├───────────┼──────────────────────┼────────────────────┼────────────────────────────┤\n│ nm7200466 │ Sam Townsend │ NULL │ [] │\n│ nm7222639 │ Marc Goula │ NULL │ ['tt3185588'] │\n│ nm7236451 │ Charlie Furusho │ NULL │ ['tt4548374'] │\n│ nm7245943 │ Cynthia Llanes │ NULL │ [] │\n│ nm7252258 │ Lance Hamner │ NULL │ ['tt0247882'] │\n│ nm7254706 │ Paloma White │ NULL │ [] │\n│ nm7256968 │ Bart den Hartigh │ NULL │ ['tt3947934'] │\n│ nm7268314 │ Don Cummings │ NULL │ ['tt4613692', 'tt0042078'] │\n│ nm7286675 │ Svitlana Banschukova │ NULL │ ['tt4636896'] │\n│ nm7287050 │ Glenn McCready │ NULL │ ['tt4637318'] │\n│ … │ … │ … │ … │\n└───────────┴──────────────────────┴────────────────────┴────────────────────────────┘\n\n```\n:::\n:::\n\n\nSimilarly for `primary_profession`, since people involved in show business often\nhave more than one responsibility on a project:\n\n::: {#aa489438 .cell execution_count=5}\n``` {.python .cell-code}\nents = ents.mutate(primary_profession=_.primary_profession.split(\",\"))\n```\n:::\n\n\n### Array length\n\nLet's see how many titles each entity is known, and then show the five\npeople with the largest number of titles they're known for:\n\nThis is computed using the\n[`length()`](../../reference/expression-collections.qmd#ibis.expr.types.arrays.ArrayValue.length)\nAPI on array expressions:\n\n::: {#362ba8dd .cell execution_count=6}\n``` {.python .cell-code}\n(\n ents.select(\"primary_name\", num_titles=_.known_for_titles.length())\n .order_by(_.num_titles.desc())\n .limit(5)\n)\n```\n\n::: {.cell-output .cell-output-display execution_count=6}\n```{=html}\n
┏━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━┓\n┃ primary_name ┃ num_titles ┃\n┡━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━┩\n│ string │ int64 │\n├──────────────────┼────────────┤\n│ Marc Mayer │ 5 │\n│ Alex Koenigsmark │ 5 │\n│ Sally Sun │ 5 │\n│ Carrie Schnelker │ 5 │\n│ Henry Townsend │ 5 │\n└──────────────────┴────────────┘\n\n```\n:::\n:::\n\n\nIt seems like the length of the `known_for_titles` might be capped at five!\n\n### Index\n\nWe can see the position of `\"actor\"` in `primary_profession`s:\n\n::: {#e1f29f1e .cell execution_count=7}\n``` {.python .cell-code}\nents.primary_profession.index(\"actor\")\n```\n\n::: {.cell-output .cell-output-display execution_count=7}\n```{=html}\n
┏━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┓\n┃ ArrayPosition(primary_profession, 'actor') ┃\n┡━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┩\n│ int64 │\n├────────────────────────────────────────────┤\n│ -1 │\n│ -1 │\n│ -1 │\n│ -1 │\n│ -1 │\n│ -1 │\n│ -1 │\n│ -1 │\n│ -1 │\n│ -1 │\n│ … │\n└────────────────────────────────────────────┘\n\n```\n:::\n:::\n\n\nA return value of `-1` indicates that `\"actor\"` is not present in the value:\n\nLet's check whether `\"actor\"` shows up in a different position in the `primary_profession` column:\n\n::: {#49c86e42 .cell execution_count=8}\n``` {.python .cell-code}\nactor_index = ents.primary_profession.index(\"actor\")\nnot_primarily_actors = actor_index > 0\nnot_primarily_actors.mean() # <1>\n```\n\n::: {.cell-output .cell-output-display}\n```{=html}\n\n```\n:::\n\n::: {.cell-output .cell-output-display execution_count=8}\n\n::: {.ansi-escaped-output}\n```{=html}\n
0.019474970102879737
\n```\n:::\n\n:::\n:::\n\n\n1. The average of a `bool` column gives the percentage of `True` values\n\nWho are they?\n\n::: {#432dd817 .cell execution_count=9}\n``` {.python .cell-code}\nents[not_primarily_actors]\n```\n\n::: {.cell-output .cell-output-display execution_count=9}\n```{=html}\n┏━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┓\n┃ nconst ┃ primary_name ┃ primary_profession ┃ known_for_titles ┃\n┡━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┩\n│ string │ string │ array<string> │ array<string> │\n├────────────┼──────────────────────┼─────────────────────┼─────────────────────────────────────┤\n│ nm2653478 │ Aaron Rosenberg │ ['legal', 'actor'] │ ['tt4218260'] │\n│ nm8437758 │ Diego Potás │ ['editor', 'actor'] │ ['tt6062002'] │\n│ nm6491774 │ Caleb Ralston │ ['editor', 'actor'] │ ['tt10404122', 'tt3711124', ... +2] │\n│ nm8777228 │ Vaheed Sadeghi Sefat │ ['editor', 'actor'] │ ['tt5463172', 'tt27817538', ... +1] │\n│ nm2714786 │ Massimo Croce │ ['editor', 'actor'] │ ['tt1068997', 'tt5431646', ... +1] │\n│ nm10157349 │ Keith Edmund │ ['editor', 'actor'] │ ['tt4142108', 'tt8888656'] │\n│ nm3277549 │ Brad Oberholtzer │ ['editor', 'actor'] │ ['tt2417466', 'tt1358212', ... +1] │\n│ nm3265541 │ Ramiro Suárez │ ['editor', 'actor'] │ ['tt1358277', 'tt25964036', ... +2] │\n│ nm4357241 │ Julian Wierzbicki │ ['editor', 'actor'] │ ['tt3010336', 'tt1854253', ... +2] │\n│ nm7548563 │ Sujith Nayak │ ['editor', 'actor'] │ ['tt21653928', 'tt7479692', ... +2] │\n│ … │ … │ … │ … │\n└────────────┴──────────────────────┴─────────────────────┴─────────────────────────────────────┘\n\n```\n:::\n:::\n\n\nIt's not 100% clear whether the order of elements in `primary_profession` matters here.\n\n### Containment\n\nWe can get people who are **not** actors using `contains`:\n\n::: {#fcc45ade .cell execution_count=10}\n``` {.python .cell-code}\nnon_actors = ents[~_.primary_profession.contains(\"actor\")]\nnon_actors\n```\n\n::: {.cell-output .cell-output-display execution_count=10}\n```{=html}\n
┏━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━┓\n┃ nconst ┃ primary_name ┃ primary_profession ┃ known_for_titles ┃\n┡━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━┩\n│ string │ string │ array<string> │ array<string> │\n├────────────┼──────────────────────┼────────────────────┼──────────────────┤\n│ nm4820183 │ Albert J. Soler │ ['legal'] │ [] │\n│ nm11789175 │ Rossa Gassó Raventos │ ['legal'] │ [] │\n│ nm5951720 │ Christine Padlan │ ['legal'] │ [] │\n│ nm10791435 │ Dan Quintero │ ['legal'] │ [] │\n│ nm13404890 │ Benjamin M. Reznik │ ['legal'] │ [] │\n│ nm4814590 │ Anne-Charlotte Gros │ ['legal'] │ [] │\n│ nm2683326 │ Cliff Lovette │ ['legal'] │ [] │\n│ nm14365778 │ Laura Lindenhovius │ ['legal'] │ [] │\n│ nm14701100 │ Karin Roach │ ['legal'] │ [] │\n│ nm3955680 │ Brett J. Rodda │ ['legal'] │ [] │\n│ … │ … │ … │ … │\n└────────────┴──────────────────────┴────────────────────┴──────────────────┘\n\n```\n:::\n:::\n\n\n### Element removal\n\nWe can remove elements from arrays too.\n\n::: {.callout-note}\n## `remove()` does not mutate the underlying data\n:::\n\nLet's see who only has \"actor\" in the list of their primary professions:\n\n::: {#7a585564 .cell execution_count=11}\n``` {.python .cell-code}\nents.filter(\n [\n _.primary_profession.length() > 0,\n _.primary_profession.remove(\"actor\").length() == 0,\n ]\n)\n```\n\n::: {.cell-output .cell-output-display execution_count=11}\n```{=html}\n
┏━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┓\n┃ nconst ┃ primary_name ┃ primary_profession ┃ known_for_titles ┃\n┡━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┩\n│ string │ string │ array<string> │ array<string> │\n├───────────┼───────────────────────┼────────────────────┼────────────────────────────────────┤\n│ nm7198867 │ Danny Brown │ ['actor'] │ ['tt4532788'] │\n│ nm7199329 │ James Wyatt Fairbanks │ ['actor'] │ ['tt4494580'] │\n│ nm7201687 │ Tony Jelen │ ['actor'] │ ['tt2043887'] │\n│ nm7203397 │ Christian Petrucci │ ['actor'] │ ['tt4537722'] │\n│ nm7205107 │ Pablo Schollaert │ ['actor'] │ ['tt4539222'] │\n│ nm7207724 │ Shigeru Jerry Endo │ ['actor'] │ ['tt0043590'] │\n│ nm7209610 │ Paul Tugwell │ ['actor'] │ ['tt6185666', 'tt4544182'] │\n│ nm7213017 │ Pancho │ ['actor'] │ ['tt2333598'] │\n│ nm7228531 │ Phillip Shinn │ ['actor'] │ ['tt1442462', 'tt2741602', ... +2] │\n│ nm7236342 │ José María Martínez │ ['actor'] │ ['tt2244891'] │\n│ … │ … │ … │ … │\n└───────────┴───────────────────────┴────────────────────┴────────────────────────────────────┘\n\n```\n:::\n:::\n\n\n### Slicing with square-bracket syntax\n\nLet's remove everyone's first profession from the list, but only if they have\nmore than one profession listed:\n\n::: {#2f32ae40 .cell execution_count=12}\n``` {.python .cell-code}\nents[_.primary_profession.length() > 1].mutate(\n primary_profession=_.primary_profession[1:],\n)\n```\n\n::: {.cell-output .cell-output-display execution_count=12}\n```{=html}\n
┏━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┓\n┃ nconst ┃ primary_name ┃ primary_profession ┃ known_for_titles ┃\n┡━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┩\n│ string │ string │ array<string> │ array<string> │\n├────────────┼─────────────────┼────────────────────┼──────────────────────────────────────┤\n│ nm0520425 │ Doug Lord │ ['legal'] │ ['tt0086461'] │\n│ nm7198767 │ Martin Harry │ ['legal'] │ ['tt5554916'] │\n│ nm2232471 │ Lee Thomas │ ['legal'] │ ['tt0236124'] │\n│ nm5500775 │ Stewart Hayes │ ['legal'] │ ['tt2671192'] │\n│ nm2653478 │ Aaron Rosenberg │ ['actor'] │ ['tt4218260'] │\n│ nm0701436 │ Dominic Pye │ ['editor'] │ ['tt27329996', 'tt0195619'] │\n│ nm12705514 │ Okpata Henry │ ['editor'] │ ['tt28450328', 'tt15170142', ... +1] │\n│ nm8313644 │ Jeff Landers │ ['editor'] │ ['tt0488302'] │\n│ nm0438282 │ Joshua Kaplan │ ['editor'] │ ['tt0110687', 'tt0329600', ... +2] │\n│ nm2803821 │ Glen Ring │ ['editor'] │ ['tt1579300', 'tt1126489', ... +2] │\n│ … │ … │ … │ … │\n└────────────┴─────────────────┴────────────────────┴──────────────────────────────────────┘\n\n```\n:::\n:::\n\n\n## Set operations and sorting\n\nTreating arrays as sets is possible with the\n[`union`](../../reference/expression-collections.qmd#ibis.expr.types.arrays.ArrayValue.union)\nand\n[`intersect`](../../reference/expression-collections.qmd#ibis.expr.types.arrays.ArrayValue.intersect)\nAPIs.\n\n### Union\n\n### Intersection\n\nLet's see if we can use array intersection to figure which actors share\nknown-for titles and sort the result:\n\n::: {#8f743137 .cell execution_count=13}\n``` {.python .cell-code}\nleft = ents.filter(_.known_for_titles.length() > 0).limit(10_000)\nright = left.view()\nshared_titles = (\n left\n .join(right, left.nconst != right.nconst)\n .select(\n s.startswith(\"known_for_titles\"),\n left_name=\"primary_name\",\n right_name=\"primary_name_right\",\n )\n .filter(_.known_for_titles.intersect(_.known_for_titles_right).length() > 0)\n .group_by(name=\"left_name\")\n .agg(together_with=_.right_name.collect())\n .mutate(together_with=_.together_with.unique().sort())\n)\nshared_titles\n```\n\n::: {.cell-output .cell-output-display execution_count=13}\n```{=html}\n
┏━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┓\n┃ name ┃ together_with ┃\n┡━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┩\n│ string │ array<string> │\n├──────────────────────┼──────────────────────────────────────────────────┤\n│ Kotchakorn Voraakhom │ ['Pat Gelsinger'] │\n│ Cesar Diaz │ ['Berlin Abreu', 'Melva Santos'] │\n│ Maikel Cleto │ ['Alejandro de Aza', 'Brad Brink', ... +14] │\n│ Rai Benjamin │ ['Adeline Grattard', 'Byron Gomez', ... +13] │\n│ Andreas Norlén │ ['Andreas Klinger', 'Birgit Dietze', ... +16] │\n│ Valery Kostikov │ ['Benjamin Andrews', 'Carol Chervenak', ... +17] │\n│ Rhea Sinha │ ['Aaron Ellis', 'Andrea Schuelke', ... +23] │\n│ Anne O'Rorke │ ['Aida Muslic', 'Aslam Malik', ... +12] │\n│ Guillaume Sourrieu │ ['Adeline Grattard', 'Aline Perraudin', ... +8] │\n│ Ed Goad │ ['Betty Brady', 'Ed Parker', ... +16] │\n│ … │ … │\n└──────────────────────┴──────────────────────────────────────────────────┘\n\n```\n:::\n:::\n\n\n## Advanced operations\n\n### `unnest`\n\nAs of version 7.0.0 Ibis does not support its native `unnest` API for BigQuery,\nbut we plan to add it in the future.\n\nFor now, you can use `con.sql` to construct an ibis expression from a BigQuery\nSQL string that contains `UNNEST` calls:\n\nDespite lack of native `UNNEST` support, many use cases for `UNNEST` are met by\nthe\n[`filter`](../../reference/expression-collections.qmd#ibis.expr.types.arrays.ArrayValue.filter)\nand\n[`map`](../../reference/expression-collections.qmd#ibis.expr.types.arrays.ArrayValue.map)\noperations on array expressions.\n\n### Filtering array elements\n\nShow all people who are neither editors nor actors:\n\n::: {#ba4a521c .cell execution_count=14}\n``` {.python .cell-code}\nents.mutate(\n primary_profession=_.primary_profession.filter(\n lambda pp: pp.isin((\"actor\", \"editor\"))\n )\n).filter(_.primary_profession.length() > 0)\n```\n\n::: {.cell-output .cell-output-display execution_count=14}\n```{=html}\n
┏━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┓\n┃ nconst ┃ primary_name ┃ primary_profession ┃ known_for_titles ┃\n┡━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┩\n│ string │ string │ array<string> │ array<string> │\n├───────────┼───────────────────────┼────────────────────┼────────────────────────────────────┤\n│ nm7198867 │ Danny Brown │ ['actor'] │ ['tt4532788'] │\n│ nm7199329 │ James Wyatt Fairbanks │ ['actor'] │ ['tt4494580'] │\n│ nm7201687 │ Tony Jelen │ ['actor'] │ ['tt2043887'] │\n│ nm7203397 │ Christian Petrucci │ ['actor'] │ ['tt4537722'] │\n│ nm7205107 │ Pablo Schollaert │ ['actor'] │ ['tt4539222'] │\n│ nm7207724 │ Shigeru Jerry Endo │ ['actor'] │ ['tt0043590'] │\n│ nm7209610 │ Paul Tugwell │ ['actor'] │ ['tt6185666', 'tt4544182'] │\n│ nm7213017 │ Pancho │ ['actor'] │ ['tt2333598'] │\n│ nm7228531 │ Phillip Shinn │ ['actor'] │ ['tt1442462', 'tt2741602', ... +2] │\n│ nm7236342 │ José María Martínez │ ['actor'] │ ['tt2244891'] │\n│ … │ … │ … │ … │\n└───────────┴───────────────────────┴────────────────────┴────────────────────────────────────┘\n\n```\n:::\n:::\n\n\n### Applying a function to array elements\n\nLet's normalize the case of primary_profession to upper case:\n\n::: {#da71263b .cell execution_count=15}\n``` {.python .cell-code}\nents.mutate(\n primary_profession=_.primary_profession.map(lambda pp: pp.upper())\n).filter(_.primary_profession.length() > 0)\n```\n\n::: {.cell-output .cell-output-display execution_count=15}\n```{=html}\n
┏━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┓\n┃ nconst ┃ primary_name ┃ primary_profession ┃ known_for_titles ┃\n┡━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┩\n│ string │ string │ array<string> │ array<string> │\n├───────────┼───────────────────────┼────────────────────┼────────────────────────────────────┤\n│ nm7198867 │ Danny Brown │ ['ACTOR'] │ ['tt4532788'] │\n│ nm7199329 │ James Wyatt Fairbanks │ ['ACTOR'] │ ['tt4494580'] │\n│ nm7201687 │ Tony Jelen │ ['ACTOR'] │ ['tt2043887'] │\n│ nm7203397 │ Christian Petrucci │ ['ACTOR'] │ ['tt4537722'] │\n│ nm7205107 │ Pablo Schollaert │ ['ACTOR'] │ ['tt4539222'] │\n│ nm7207724 │ Shigeru Jerry Endo │ ['ACTOR'] │ ['tt0043590'] │\n│ nm7209610 │ Paul Tugwell │ ['ACTOR'] │ ['tt6185666', 'tt4544182'] │\n│ nm7213017 │ Pancho │ ['ACTOR'] │ ['tt2333598'] │\n│ nm7228531 │ Phillip Shinn │ ['ACTOR'] │ ['tt1442462', 'tt2741602', ... +2] │\n│ nm7236342 │ José María Martínez │ ['ACTOR'] │ ['tt2244891'] │\n│ … │ … │ … │ … │\n└───────────┴───────────────────────┴────────────────────┴────────────────────────────────────┘\n\n```\n:::\n:::\n\n\n## Conclusion\n\nIbis has a sizable collection of array APIs that work with many different\nbackends and as of version 7.0.0, Ibis supports a much larger set of those APIs\nfor BigQuery!\n\nCheck out [the API\ndocumentation](http://localhost:8000/reference/expression-collections.html#ibis.expr.types.arrays.ArrayValue)\nfor the full set of available methods.\n\nTry it out, and let us know what you think.\n\n", + "supporting": [ + "index_files" + ], + "filters": [], + "includes": { + "include-in-header": [ + "\n\n\n" + ] + } + } +} \ No newline at end of file diff --git a/docs/posts/bigquery-arrays/index.qmd b/docs/posts/bigquery-arrays/index.qmd new file mode 100644 index 0000000000000..20bfdb31ae000 --- /dev/null +++ b/docs/posts/bigquery-arrays/index.qmd @@ -0,0 +1,240 @@ +--- +title: Working with arrays in Google BigQuery +author: "Phillip Cloud" +date: "2023-09-12" +categories: + - release + - blog + - bigquery + - arrays + - cloud +--- + +## Introduction + +Ibis and BigQuery have [worked well together for years](https://cloud.google.com/blog/products/data-analytics/ibis-and-bigquery-scalable-analytics-comfort-python). + +In Ibis 7.0.0, they work even better together with the addition of array +functionality for BigQuery. + +Let's look at some examples using BigQuery's [IMDB data](https://developer.imdb.com/non-commercial-datasets/). + +## Basics + +First we'll connect to BigQuery and pluck out a table to work with. + +We'll start with `from ibis.interactive import *` for maximum convenience. + +```{python} +from ibis.interactive import * + +con = ibis.connect("bigquery://ibis-gbq") # <1> +con.set_database("bigquery-public-data.imdb") # <2> +``` + +1. Connect to the **billing** project. Compute (but not storage) is billed to + this project. +2. Set the database to the project and dataset that we will use for analysis. + +Let's look at the tables in this dataset: + +```{python} +con.list_tables() +``` + +Let's pull out the `name_basics` table, which contains names and metadata about +people listed on IMDB. We'll call this `ents` (short for `entities`), and remove some +columns we won't need: + +```{python} +ents = con.tables.name_basics.drop("birth_year", "death_year") +ents +``` + +### Splitting strings into arrays + +We can see that `known_for_titles` looks sort like an array, so let's call the [`split()`]() +method on that column and replace the existing column: + +```{python} +ents = ents.mutate(known_for_titles=_.known_for_titles.split(",")) +ents +``` + +Similarly for `primary_profession`, since people involved in show business often +have more than one responsibility on a project: + +```{python} +ents = ents.mutate(primary_profession=_.primary_profession.split(",")) +``` + +### Array length + +Let's see how many titles each entity is known, and then show the five +people with the largest number of titles they're known for: + +This is computed using the +[`length()`](../../reference/expression-collections.qmd#ibis.expr.types.arrays.ArrayValue.length) +API on array expressions: + +```{python} +( + ents.select("primary_name", num_titles=_.known_for_titles.length()) + .order_by(_.num_titles.desc()) + .limit(5) +) +``` + +It seems like the length of the `known_for_titles` might be capped at five! + +### Index + +We can see the position of `"actor"` in `primary_profession`s: + +```{python} +ents.primary_profession.index("actor") +``` + +A return value of `-1` indicates that `"actor"` is not present in the value: + +Let's check whether `"actor"` shows up in a different position in the `primary_profession` column: + +```{python} +actor_index = ents.primary_profession.index("actor") +not_primarily_actors = actor_index > 0 +not_primarily_actors.mean() # <1> +``` + +1. The average of a `bool` column gives the percentage of `True` values + +Who are they? + +```{python} +ents[not_primarily_actors] +``` + +It's not 100% clear whether the order of elements in `primary_profession` matters here. + +### Containment + +We can get people who are **not** actors using `contains`: + +```{python} +non_actors = ents[~_.primary_profession.contains("actor")] +non_actors +``` + +### Element removal + +We can remove elements from arrays too. + +::: {.callout-note} +## `remove()` does not mutate the underlying data +::: + +Let's see who only has "actor" in the list of their primary professions: + +```{python} +ents.filter( + [ + _.primary_profession.length() > 0, + _.primary_profession.remove("actor").length() == 0, + ] +) +``` + +### Slicing with square-bracket syntax + +Let's remove everyone's first profession from the list, but only if they have +more than one profession listed: + +```{python} +ents[_.primary_profession.length() > 1].mutate( + primary_profession=_.primary_profession[1:], +) +``` + +## Set operations and sorting + +Treating arrays as sets is possible with the +[`union`](../../reference/expression-collections.qmd#ibis.expr.types.arrays.ArrayValue.union) +and +[`intersect`](../../reference/expression-collections.qmd#ibis.expr.types.arrays.ArrayValue.intersect) +APIs. + +### Union + +### Intersection + +Let's see if we can use array intersection to figure which actors share +known-for titles and sort the result: + +```{python} +left = ents.filter(_.known_for_titles.length() > 0).limit(10_000) +right = left.view() +shared_titles = ( + left + .join(right, left.nconst != right.nconst) + .select( + s.startswith("known_for_titles"), + left_name="primary_name", + right_name="primary_name_right", + ) + .filter(_.known_for_titles.intersect(_.known_for_titles_right).length() > 0) + .group_by(name="left_name") + .agg(together_with=_.right_name.collect()) + .mutate(together_with=_.together_with.unique().sort()) +) +shared_titles +``` + +## Advanced operations + +### `unnest` + +As of version 7.0.0 Ibis does not support its native `unnest` API for BigQuery, +but we plan to add it in the future. + +For now, you can use `con.sql` to construct an ibis expression from a BigQuery +SQL string that contains `UNNEST` calls: + +Despite lack of native `UNNEST` support, many use cases for `UNNEST` are met by +the +[`filter`](../../reference/expression-collections.qmd#ibis.expr.types.arrays.ArrayValue.filter) +and +[`map`](../../reference/expression-collections.qmd#ibis.expr.types.arrays.ArrayValue.map) +operations on array expressions. + +### Filtering array elements + +Show all people who are neither editors nor actors: + +```{python} +ents.mutate( + primary_profession=_.primary_profession.filter( + lambda pp: pp.isin(("actor", "editor")) + ) +).filter(_.primary_profession.length() > 0) +``` + +### Applying a function to array elements + +Let's normalize the case of primary_profession to upper case: + +```{python} +ents.mutate( + primary_profession=_.primary_profession.map(lambda pp: pp.upper()) +).filter(_.primary_profession.length() > 0) +``` + +## Conclusion + +Ibis has a sizable collection of array APIs that work with many different +backends and as of version 7.0.0, Ibis supports a much larger set of those APIs +for BigQuery! + +Check out [the API +documentation](http://localhost:8000/reference/expression-collections.html#ibis.expr.types.arrays.ArrayValue) +for the full set of available methods. + +Try it out, and let us know what you think.