diff --git a/docs/_freeze/posts/bigquery-arrays/index/execute-results/html.json b/docs/_freeze/posts/bigquery-arrays/index/execute-results/html.json new file mode 100644 index 000000000000..288341b461fa --- /dev/null +++ b/docs/_freeze/posts/bigquery-arrays/index/execute-results/html.json @@ -0,0 +1,15 @@ +{ + "hash": "47440a350432e93f85e6ed6553cc40f0", + "result": { + "markdown": "---\ntitle: Working with arrays in Google BigQuery\nauthor: \"Phillip Cloud\"\ndate: \"2023-09-12\"\ncategories:\n - blog\n - bigquery\n - arrays\n - cloud\n---\n\n## Introduction\n\nIbis and BigQuery have [worked well together for years](https://cloud.google.com/blog/products/data-analytics/ibis-and-bigquery-scalable-analytics-comfort-python).\n\nIn Ibis 7.0.0, they work even better together with the addition of array\nfunctionality for BigQuery.\n\nLet's look at some examples using BigQuery's [IMDB sample\ndata](https://developer.imdb.com/non-commercial-datasets/).\n\n## Basics\n\nFirst we'll connect to BigQuery and pluck out a table to work with.\n\nWe'll start with `from ibis.interactive import *` for maximum convenience.\n\n::: {#75a9d26f .cell execution_count=1}\n``` {.python .cell-code}\nfrom ibis.interactive import *\n\ncon = ibis.connect(\"bigquery://ibis-gbq\") # <1>\ncon.set_database(\"bigquery-public-data.imdb\") # <2>\n```\n:::\n\n\n1. Connect to the **billing** project. Compute (but not storage) is billed to\n this project.\n2. Set the database to the project and dataset that we will use for analysis.\n\nLet's look at the tables in this dataset:\n\n::: {#203b6b28 .cell execution_count=2}\n``` {.python .cell-code}\ncon.list_tables()\n```\n\n::: {.cell-output .cell-output-display execution_count=2}\n```\n['name_basics',\n 'reviews',\n 'title_akas',\n 'title_basics',\n 'title_crew',\n 'title_episode',\n 'title_principals',\n 'title_ratings']\n```\n:::\n:::\n\n\nLet's pull out the `name_basics` table, which contains names and metadata about\npeople listed on IMDB. We'll call this `ents` (short for `entities`), and remove some\ncolumns we won't need:\n\n::: {#6229c913 .cell execution_count=3}\n``` {.python .cell-code}\nents = con.tables.name_basics.drop(\"birth_year\", \"death_year\")\nents\n```\n\n::: {.cell-output .cell-output-display execution_count=3}\n```{=html}\n
┏━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━┓\n┃ nconst ┃ primary_name ┃ primary_profession ┃ known_for_titles ┃\n┡━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━┩\n│ string │ string │ string │ string │\n├───────────┼──────────────────────┼────────────────────┼─────────────────────┤\n│ nm7200466 │ Sam Townsend │ NULL │ NULL │\n│ nm7222639 │ Marc Goula │ NULL │ tt3185588 │\n│ nm7236451 │ Charlie Furusho │ NULL │ tt4548374 │\n│ nm7245943 │ Cynthia Llanes │ NULL │ NULL │\n│ nm7252258 │ Lance Hamner │ NULL │ tt0247882 │\n│ nm7254706 │ Paloma White │ NULL │ NULL │\n│ nm7256968 │ Bart den Hartigh │ NULL │ tt3947934 │\n│ nm7268314 │ Don Cummings │ NULL │ tt4613692,tt0042078 │\n│ nm7286675 │ Svitlana Banschukova │ NULL │ tt4636896 │\n│ nm7287050 │ Glenn McCready │ NULL │ tt4637318 │\n│ … │ … │ … │ … │\n└───────────┴──────────────────────┴────────────────────┴─────────────────────┘\n\n```\n:::\n:::\n\n\n### Splitting strings into arrays\n\nWe can see that `known_for_titles` looks sort of like an array, so let's call\nthe\n[`split`](../../reference/expression-strings.qmd#ibis.expr.types.strings.StringValue.split)\nmethod on that column and replace the existing column:\n\n::: {#1763a10e .cell execution_count=4}\n``` {.python .cell-code}\nents = ents.mutate(known_for_titles=_.known_for_titles.split(\",\"))\nents\n```\n\n::: {.cell-output .cell-output-display execution_count=4}\n```{=html}\n
┏━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━━━━━┓\n┃ nconst ┃ primary_name ┃ primary_profession ┃ known_for_titles ┃\n┡━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━━━━━┩\n│ string │ string │ string │ array<string> │\n├───────────┼──────────────────────┼────────────────────┼────────────────────────────┤\n│ nm7200466 │ Sam Townsend │ NULL │ [] │\n│ nm7222639 │ Marc Goula │ NULL │ ['tt3185588'] │\n│ nm7236451 │ Charlie Furusho │ NULL │ ['tt4548374'] │\n│ nm7245943 │ Cynthia Llanes │ NULL │ [] │\n│ nm7252258 │ Lance Hamner │ NULL │ ['tt0247882'] │\n│ nm7254706 │ Paloma White │ NULL │ [] │\n│ nm7256968 │ Bart den Hartigh │ NULL │ ['tt3947934'] │\n│ nm7268314 │ Don Cummings │ NULL │ ['tt4613692', 'tt0042078'] │\n│ nm7286675 │ Svitlana Banschukova │ NULL │ ['tt4636896'] │\n│ nm7287050 │ Glenn McCready │ NULL │ ['tt4637318'] │\n│ … │ … │ … │ … │\n└───────────┴──────────────────────┴────────────────────┴────────────────────────────┘\n\n```\n:::\n:::\n\n\nSimilarly for `primary_profession`, since people involved in show business often\nhave more than one responsibility on a project:\n\n::: {#398f73c9 .cell execution_count=5}\n``` {.python .cell-code}\nents = ents.mutate(primary_profession=_.primary_profession.split(\",\"))\n```\n:::\n\n\n### Array length\n\nLet's see how many titles each entity is known for, and then show the five\npeople with the largest number of titles they're known for:\n\nThis is computed using the\n[`length`](../../reference/expression-collections.qmd#ibis.expr.types.arrays.ArrayValue.length)\nAPI on array expressions:\n\n::: {#315e2b5c .cell execution_count=6}\n``` {.python .cell-code}\n(\n ents.select(\"primary_name\", num_titles=_.known_for_titles.length())\n .order_by(_.num_titles.desc())\n .limit(5)\n)\n```\n\n::: {.cell-output .cell-output-display execution_count=6}\n```{=html}\n
┏━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━┓\n┃ primary_name ┃ num_titles ┃\n┡━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━┩\n│ string │ int64 │\n├──────────────────┼────────────┤\n│ Marc Mayer │ 5 │\n│ Alex Koenigsmark │ 5 │\n│ Sally Sun │ 5 │\n│ Carrie Schnelker │ 5 │\n│ Henry Townsend │ 5 │\n└──────────────────┴────────────┘\n\n```\n:::\n:::\n\n\nIt seems like the length of the `known_for_titles` might be capped at five!\n\n### Index\n\nWe can see the position of `\"actor\"` in `primary_profession`s:\n\n::: {#8f915d17 .cell execution_count=7}\n``` {.python .cell-code}\nents.primary_profession.index(\"actor\")\n```\n\n::: {.cell-output .cell-output-display execution_count=7}\n```{=html}\n
┏━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┓\n┃ ArrayPosition(primary_profession, 'actor') ┃\n┡━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┩\n│ int64 │\n├────────────────────────────────────────────┤\n│ -1 │\n│ -1 │\n│ -1 │\n│ -1 │\n│ -1 │\n│ -1 │\n│ -1 │\n│ -1 │\n│ -1 │\n│ -1 │\n│ … │\n└────────────────────────────────────────────┘\n\n```\n:::\n:::\n\n\nA return value of `-1` indicates that `\"actor\"` is not present in the value:\n\nLet's look for entities that are not primarily actors:\n\nWe can do this using the\n[`index`](../../reference/expression-collections.qmd#ibis.expr.types.arrays.ArrayValue.index)\nmethod by checking whether the position of the string `\"actor\"` is greater than\nzero:\n\n::: {#4335351c .cell execution_count=8}\n``` {.python .cell-code}\nactor_index = ents.primary_profession.index(\"actor\")\nnot_primarily_actors = actor_index > 0\nnot_primarily_actors.mean() # <1>\n```\n\n::: {.cell-output .cell-output-display}\n```{=html}\n\n```\n:::\n\n::: {.cell-output .cell-output-display execution_count=8}\n\n::: {.ansi-escaped-output}\n```{=html}\n
0.01947497010287973
\n```\n:::\n\n:::\n:::\n\n\n1. The average of a `bool` column gives the percentage of `True` values\n\nWho are they?\n\n::: {#4bf604e5 .cell execution_count=9}\n``` {.python .cell-code}\nents[not_primarily_actors]\n```\n\n::: {.cell-output .cell-output-display execution_count=9}\n```{=html}\n┏━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┓\n┃ nconst ┃ primary_name ┃ primary_profession ┃ known_for_titles ┃\n┡━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┩\n│ string │ string │ array<string> │ array<string> │\n├────────────┼─────────────────────────┼─────────────────────┼─────────────────────────────────────┤\n│ nm14670573 │ Jamie Young │ ['legal', 'actor'] │ ['tt27216887'] │\n│ nm2320563 │ Miguel Eraso │ ['editor', 'actor'] │ ['tt1820556', 'tt0823256'] │\n│ nm6050288 │ Kyle Springford │ ['editor', 'actor'] │ ['tt3260540', 'tt4353988', ... +1] │\n│ nm8606771 │ Edward Wu │ ['editor', 'actor'] │ ['tt0259354', 'tt4219258'] │\n│ nm8159690 │ Arash Maleki │ ['editor', 'actor'] │ ['tt14888266', 'tt5783616', ... +1] │\n│ nm3700713 │ Wendell Holland │ ['editor', 'actor'] │ ['tt11546754', 'tt1554553', ... +1] │\n│ nm6531583 │ Tomás Díez-Kith Atienza │ ['editor', 'actor'] │ ['tt3171042', 'tt3749248'] │\n│ nm2456342 │ Ed Cheesman │ ['editor', 'actor'] │ ['tt13918214', 'tt9598592', ... +1] │\n│ nm0396397 │ Thomas Houg │ ['editor', 'actor'] │ ['tt0093176', 'tt13339954', ... +1] │\n│ nm2171019 │ Larry Pena │ ['editor', 'actor'] │ ['tt0831320', 'tt0800017', ... +2] │\n│ … │ … │ … │ … │\n└────────────┴─────────────────────────┴─────────────────────┴─────────────────────────────────────┘\n\n```\n:::\n:::\n\n\nIt's not 100% clear whether the order of elements in `primary_profession` matters here.\n\n### Containment\n\nWe can get people who are **not** actors using `contains`:\n\n::: {#510dc366 .cell execution_count=10}\n``` {.python .cell-code}\nnon_actors = ents[~ents.primary_profession.contains(\"actor\")]\nnon_actors\n```\n\n::: {.cell-output .cell-output-display execution_count=10}\n```{=html}\n
┏━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━┓\n┃ nconst ┃ primary_name ┃ primary_profession ┃ known_for_titles ┃\n┡━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━┩\n│ string │ string │ array<string> │ array<string> │\n├────────────┼───────────────────┼────────────────────┼──────────────────┤\n│ nm13331027 │ Robert Allen │ ['legal'] │ [] │\n│ nm11516366 │ Barney Given │ ['legal'] │ [] │\n│ nm7841847 │ Natalia Utrera │ ['legal'] │ [] │\n│ nm14658368 │ Amber Payne │ ['legal'] │ [] │\n│ nm15199944 │ Melanie Tomanov │ ['legal'] │ [] │\n│ nm11529563 │ David Lazarus │ ['legal'] │ [] │\n│ nm12224896 │ Andrew Winston │ ['legal'] │ [] │\n│ nm7591008 │ Miles Metcoff │ ['legal'] │ [] │\n│ nm11355058 │ Sameer Oberoi │ ['legal'] │ [] │\n│ nm15069831 │ Skyler R. Peacock │ ['legal'] │ [] │\n│ … │ … │ … │ … │\n└────────────┴───────────────────┴────────────────────┴──────────────────┘\n\n```\n:::\n:::\n\n\n### Element removal\n\nWe can remove elements from arrays too.\n\n::: {.callout-note}\n## `remove()` does not mutate the underlying data\n:::\n\nLet's see who only has \"actor\" in the list of their primary professions:\n\n::: {#261a5744 .cell execution_count=11}\n``` {.python .cell-code}\nents.filter(\n [\n _.primary_profession.length() > 0,\n _.primary_profession.remove(\"actor\").length() == 0,\n ]\n)\n```\n\n::: {.cell-output .cell-output-display execution_count=11}\n```{=html}\n
┏━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┓\n┃ nconst ┃ primary_name ┃ primary_profession ┃ known_for_titles ┃\n┡━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┩\n│ string │ string │ array<string> │ array<string> │\n├───────────┼───────────────────────┼────────────────────┼────────────────────────────────────┤\n│ nm7198867 │ Danny Brown │ ['actor'] │ ['tt4532788'] │\n│ nm7199329 │ James Wyatt Fairbanks │ ['actor'] │ ['tt4494580'] │\n│ nm7201687 │ Tony Jelen │ ['actor'] │ ['tt2043887'] │\n│ nm7203397 │ Christian Petrucci │ ['actor'] │ ['tt4537722'] │\n│ nm7205107 │ Pablo Schollaert │ ['actor'] │ ['tt4539222'] │\n│ nm7207724 │ Shigeru Jerry Endo │ ['actor'] │ ['tt0043590'] │\n│ nm7209610 │ Paul Tugwell │ ['actor'] │ ['tt6185666', 'tt4544182'] │\n│ nm7213017 │ Pancho │ ['actor'] │ ['tt2333598'] │\n│ nm7228531 │ Phillip Shinn │ ['actor'] │ ['tt1442462', 'tt2741602', ... +2] │\n│ nm7236342 │ José María Martínez │ ['actor'] │ ['tt2244891'] │\n│ … │ … │ … │ … │\n└───────────┴───────────────────────┴────────────────────┴────────────────────────────────────┘\n\n```\n:::\n:::\n\n\n### Slicing with square-bracket syntax\n\nLet's remove everyone's first profession from the list, but only if they have\nmore than one profession listed:\n\n::: {#c1137788 .cell execution_count=12}\n``` {.python .cell-code}\nents[_.primary_profession.length() > 1].mutate(\n primary_profession=_.primary_profession[1:],\n)\n```\n\n::: {.cell-output .cell-output-display execution_count=12}\n```{=html}\n
┏━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┓\n┃ nconst ┃ primary_name ┃ primary_profession ┃ known_for_titles ┃\n┡━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┩\n│ string │ string │ array<string> │ array<string> │\n├────────────┼─────────────────┼────────────────────┼──────────────────────────────────────┤\n│ nm0520425 │ Doug Lord │ ['legal'] │ ['tt0086461'] │\n│ nm7198767 │ Martin Harry │ ['legal'] │ ['tt5554916'] │\n│ nm2232471 │ Lee Thomas │ ['legal'] │ ['tt0236124'] │\n│ nm5500775 │ Stewart Hayes │ ['legal'] │ ['tt2671192'] │\n│ nm2653478 │ Aaron Rosenberg │ ['actor'] │ ['tt4218260'] │\n│ nm0701436 │ Dominic Pye │ ['editor'] │ ['tt27329996', 'tt0195619'] │\n│ nm12705514 │ Okpata Henry │ ['editor'] │ ['tt28450328', 'tt15170142', ... +1] │\n│ nm8313644 │ Jeff Landers │ ['editor'] │ ['tt0488302'] │\n│ nm0438282 │ Joshua Kaplan │ ['editor'] │ ['tt0110687', 'tt0329600', ... +2] │\n│ nm2803821 │ Glen Ring │ ['editor'] │ ['tt1579300', 'tt1126489', ... +2] │\n│ … │ … │ … │ … │\n└────────────┴─────────────────┴────────────────────┴──────────────────────────────────────┘\n\n```\n:::\n:::\n\n\n## Set operations and sorting\n\nTreating arrays as sets is possible with the\n[`union`](../../reference/expression-collections.qmd#ibis.expr.types.arrays.ArrayValue.union)\nand\n[`intersect`](../../reference/expression-collections.qmd#ibis.expr.types.arrays.ArrayValue.intersect)\nAPIs.\n\n### Union\n\n### Intersection\n\nLet's see if we can use array intersection to figure which actors share\nknown-for titles and sort the result:\n\n::: {#b4e2a96d .cell execution_count=13}\n``` {.python .cell-code}\nleft = ents.filter(_.known_for_titles.length() > 0).limit(10_000)\nright = left.view()\nshared_titles = (\n left\n .join(right, left.nconst != right.nconst)\n .select(\n s.startswith(\"known_for_titles\"),\n left_name=\"primary_name\",\n right_name=\"primary_name_right\",\n )\n .filter(_.known_for_titles.intersect(_.known_for_titles_right).length() > 0)\n .group_by(name=\"left_name\")\n .agg(together_with=_.right_name.collect())\n .mutate(together_with=_.together_with.unique().sort())\n)\nshared_titles\n```\n\n::: {.cell-output .cell-output-display execution_count=13}\n```{=html}\n
┏━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┓\n┃ name ┃ together_with ┃\n┡━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┩\n│ string │ array<string> │\n├───────────────────┼───────────────────────────────────────────────────┤\n│ Sandra Murphy │ ['Andrew Brisk', 'Angus McLaren', ... +41] │\n│ Espera │ ['Avena Campbell', 'Brenda James', ... +11] │\n│ Mamta Jajoo │ ['Barbara Buls', 'Bill Smoler', ... +72] │\n│ Charles Ellis │ ['Cherri Moore', 'Dennis Montano', ... +11] │\n│ Chris Nicholus │ ['Catherine Harrell', 'George Pounders', ... +11] │\n│ Paul Dembling │ ['Barbara Buls', 'Bill Smoler', ... +72] │\n│ Dnyaneshwar Mulay │ ['Avena Campbell', 'Brenda James', ... +11] │\n│ Daisy Boria │ ['Barbara Buls', 'Bill Smoler', ... +72] │\n│ Brandon Staley │ ['Beacon Light', 'Ben Emanuel', ... +48] │\n│ Dwayne Carter Jr. │ ['Bill Collis', 'Charlie Jones', ... +21] │\n│ … │ … │\n└───────────────────┴───────────────────────────────────────────────────┘\n\n```\n:::\n:::\n\n\n## Advanced operations\n\n### `unnest`\n\nAs of version 7.0.0 Ibis does not support its native `unnest` API for BigQuery,\nbut we plan to add it in the future.\n\nFor now, you can use `con.sql` to construct an Ibis expression from a BigQuery\nSQL string that contains `UNNEST` calls:\n\nDespite lack of native `UNNEST` support, many use cases for `UNNEST` are met by\nthe\n[`filter`](../../reference/expression-collections.qmd#ibis.expr.types.arrays.ArrayValue.filter)\nand\n[`map`](../../reference/expression-collections.qmd#ibis.expr.types.arrays.ArrayValue.map)\noperations on array expressions.\n\n### Filtering array elements\n\nShow all people who are neither editors nor actors:\n\n::: {#331a9699 .cell execution_count=14}\n``` {.python .cell-code}\nents.mutate(\n primary_profession=_.primary_profession.filter(\n lambda pp: pp.isin((\"actor\", \"editor\"))\n )\n).filter(_.primary_profession.length() > 0)\n```\n\n::: {.cell-output .cell-output-display execution_count=14}\n```{=html}\n
┏━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┓\n┃ nconst ┃ primary_name ┃ primary_profession ┃ known_for_titles ┃\n┡━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┩\n│ string │ string │ array<string> │ array<string> │\n├───────────┼───────────────────────┼────────────────────┼────────────────────────────────────┤\n│ nm7198867 │ Danny Brown │ ['actor'] │ ['tt4532788'] │\n│ nm7199329 │ James Wyatt Fairbanks │ ['actor'] │ ['tt4494580'] │\n│ nm7201687 │ Tony Jelen │ ['actor'] │ ['tt2043887'] │\n│ nm7203397 │ Christian Petrucci │ ['actor'] │ ['tt4537722'] │\n│ nm7205107 │ Pablo Schollaert │ ['actor'] │ ['tt4539222'] │\n│ nm7207724 │ Shigeru Jerry Endo │ ['actor'] │ ['tt0043590'] │\n│ nm7209610 │ Paul Tugwell │ ['actor'] │ ['tt6185666', 'tt4544182'] │\n│ nm7213017 │ Pancho │ ['actor'] │ ['tt2333598'] │\n│ nm7228531 │ Phillip Shinn │ ['actor'] │ ['tt1442462', 'tt2741602', ... +2] │\n│ nm7236342 │ José María Martínez │ ['actor'] │ ['tt2244891'] │\n│ … │ … │ … │ … │\n└───────────┴───────────────────────┴────────────────────┴────────────────────────────────────┘\n\n```\n:::\n:::\n\n\n### Applying a function to array elements\n\nLet's normalize the case of primary_profession to upper case:\n\n::: {#1dd5c0b8 .cell execution_count=15}\n``` {.python .cell-code}\nents.mutate(\n primary_profession=_.primary_profession.map(lambda pp: pp.upper())\n).filter(_.primary_profession.length() > 0)\n```\n\n::: {.cell-output .cell-output-display execution_count=15}\n```{=html}\n
┏━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┓\n┃ nconst ┃ primary_name ┃ primary_profession ┃ known_for_titles ┃\n┡━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┩\n│ string │ string │ array<string> │ array<string> │\n├───────────┼───────────────────────┼────────────────────┼────────────────────────────────────┤\n│ nm7198867 │ Danny Brown │ ['ACTOR'] │ ['tt4532788'] │\n│ nm7199329 │ James Wyatt Fairbanks │ ['ACTOR'] │ ['tt4494580'] │\n│ nm7201687 │ Tony Jelen │ ['ACTOR'] │ ['tt2043887'] │\n│ nm7203397 │ Christian Petrucci │ ['ACTOR'] │ ['tt4537722'] │\n│ nm7205107 │ Pablo Schollaert │ ['ACTOR'] │ ['tt4539222'] │\n│ nm7207724 │ Shigeru Jerry Endo │ ['ACTOR'] │ ['tt0043590'] │\n│ nm7209610 │ Paul Tugwell │ ['ACTOR'] │ ['tt6185666', 'tt4544182'] │\n│ nm7213017 │ Pancho │ ['ACTOR'] │ ['tt2333598'] │\n│ nm7228531 │ Phillip Shinn │ ['ACTOR'] │ ['tt1442462', 'tt2741602', ... +2] │\n│ nm7236342 │ José María Martínez │ ['ACTOR'] │ ['tt2244891'] │\n│ … │ … │ … │ … │\n└───────────┴───────────────────────┴────────────────────┴────────────────────────────────────┘\n\n```\n:::\n:::\n\n\n## Conclusion\n\nIbis has a sizable collection of array APIs that work with many different\nbackends and as of version 7.0.0, Ibis supports a much larger set of those APIs\nfor BigQuery!\n\nCheck out [the API\ndocumentation](../../reference/expression-collections.qmd#ibis.expr.types.arrays.ArrayValue)\nfor the full set of available methods.\n\nTry it out, and let us know what you think.\n\n", + "supporting": [ + "index_files" + ], + "filters": [], + "includes": { + "include-in-header": [ + "\n\n\n" + ] + } + } +} \ No newline at end of file diff --git a/docs/posts/bigquery-arrays/index.qmd b/docs/posts/bigquery-arrays/index.qmd new file mode 100644 index 000000000000..919989cc27aa --- /dev/null +++ b/docs/posts/bigquery-arrays/index.qmd @@ -0,0 +1,247 @@ +--- +title: Working with arrays in Google BigQuery +author: "Phillip Cloud" +date: "2023-09-12" +categories: + - blog + - bigquery + - arrays + - cloud +--- + +## Introduction + +Ibis and BigQuery have [worked well together for years](https://cloud.google.com/blog/products/data-analytics/ibis-and-bigquery-scalable-analytics-comfort-python). + +In Ibis 7.0.0, they work even better together with the addition of array +functionality for BigQuery. + +Let's look at some examples using BigQuery's [IMDB sample +data](https://developer.imdb.com/non-commercial-datasets/). + +## Basics + +First we'll connect to BigQuery and pluck out a table to work with. + +We'll start with `from ibis.interactive import *` for maximum convenience. + +```{python} +from ibis.interactive import * + +con = ibis.connect("bigquery://ibis-gbq") # <1> +con.set_database("bigquery-public-data.imdb") # <2> +``` + +1. Connect to the **billing** project. Compute (but not storage) is billed to + this project. +2. Set the database to the project and dataset that we will use for analysis. + +Let's look at the tables in this dataset: + +```{python} +con.list_tables() +``` + +Let's pull out the `name_basics` table, which contains names and metadata about +people listed on IMDB. We'll call this `ents` (short for `entities`), and remove some +columns we won't need: + +```{python} +ents = con.tables.name_basics.drop("birth_year", "death_year") +ents +``` + +### Splitting strings into arrays + +We can see that `known_for_titles` looks sort of like an array, so let's call +the +[`split`](../../reference/expression-strings.qmd#ibis.expr.types.strings.StringValue.split) +method on that column and replace the existing column: + +```{python} +ents = ents.mutate(known_for_titles=_.known_for_titles.split(",")) +ents +``` + +Similarly for `primary_profession`, since people involved in show business often +have more than one responsibility on a project: + +```{python} +ents = ents.mutate(primary_profession=_.primary_profession.split(",")) +``` + +### Array length + +Let's see how many titles each entity is known for, and then show the five +people with the largest number of titles they're known for: + +This is computed using the +[`length`](../../reference/expression-collections.qmd#ibis.expr.types.arrays.ArrayValue.length) +API on array expressions: + +```{python} +( + ents.select("primary_name", num_titles=_.known_for_titles.length()) + .order_by(_.num_titles.desc()) + .limit(5) +) +``` + +It seems like the length of the `known_for_titles` might be capped at five! + +### Index + +We can see the position of `"actor"` in `primary_profession`s: + +```{python} +ents.primary_profession.index("actor") +``` + +A return value of `-1` indicates that `"actor"` is not present in the value: + +Let's look for entities that are not primarily actors: + +We can do this using the +[`index`](../../reference/expression-collections.qmd#ibis.expr.types.arrays.ArrayValue.index) +method by checking whether the position of the string `"actor"` is greater than +zero: + +```{python} +actor_index = ents.primary_profession.index("actor") +not_primarily_actors = actor_index > 0 +not_primarily_actors.mean() # <1> +``` + +1. The average of a `bool` column gives the percentage of `True` values + +Who are they? + +```{python} +ents[not_primarily_actors] +``` + +It's not 100% clear whether the order of elements in `primary_profession` matters here. + +### Containment + +We can get people who are **not** actors using `contains`: + +```{python} +non_actors = ents[~ents.primary_profession.contains("actor")] +non_actors +``` + +### Element removal + +We can remove elements from arrays too. + +::: {.callout-note} +## `remove()` does not mutate the underlying data +::: + +Let's see who only has "actor" in the list of their primary professions: + +```{python} +ents.filter( + [ + _.primary_profession.length() > 0, + _.primary_profession.remove("actor").length() == 0, + ] +) +``` + +### Slicing with square-bracket syntax + +Let's remove everyone's first profession from the list, but only if they have +more than one profession listed: + +```{python} +ents[_.primary_profession.length() > 1].mutate( + primary_profession=_.primary_profession[1:], +) +``` + +## Set operations and sorting + +Treating arrays as sets is possible with the +[`union`](../../reference/expression-collections.qmd#ibis.expr.types.arrays.ArrayValue.union) +and +[`intersect`](../../reference/expression-collections.qmd#ibis.expr.types.arrays.ArrayValue.intersect) +APIs. + +### Union + +### Intersection + +Let's see if we can use array intersection to figure which actors share +known-for titles and sort the result: + +```{python} +left = ents.filter(_.known_for_titles.length() > 0).limit(10_000) +right = left.view() +shared_titles = ( + left + .join(right, left.nconst != right.nconst) + .select( + s.startswith("known_for_titles"), + left_name="primary_name", + right_name="primary_name_right", + ) + .filter(_.known_for_titles.intersect(_.known_for_titles_right).length() > 0) + .group_by(name="left_name") + .agg(together_with=_.right_name.collect()) + .mutate(together_with=_.together_with.unique().sort()) +) +shared_titles +``` + +## Advanced operations + +### `unnest` + +As of version 7.0.0 Ibis does not support its native `unnest` API for BigQuery, +but we plan to add it in the future. + +For now, you can use `con.sql` to construct an Ibis expression from a BigQuery +SQL string that contains `UNNEST` calls: + +Despite lack of native `UNNEST` support, many use cases for `UNNEST` are met by +the +[`filter`](../../reference/expression-collections.qmd#ibis.expr.types.arrays.ArrayValue.filter) +and +[`map`](../../reference/expression-collections.qmd#ibis.expr.types.arrays.ArrayValue.map) +operations on array expressions. + +### Filtering array elements + +Show all people who are neither editors nor actors: + +```{python} +ents.mutate( + primary_profession=_.primary_profession.filter( + lambda pp: pp.isin(("actor", "editor")) + ) +).filter(_.primary_profession.length() > 0) +``` + +### Applying a function to array elements + +Let's normalize the case of primary_profession to upper case: + +```{python} +ents.mutate( + primary_profession=_.primary_profession.map(lambda pp: pp.upper()) +).filter(_.primary_profession.length() > 0) +``` + +## Conclusion + +Ibis has a sizable collection of array APIs that work with many different +backends and as of version 7.0.0, Ibis supports a much larger set of those APIs +for BigQuery! + +Check out [the API +documentation](../../reference/expression-collections.qmd#ibis.expr.types.arrays.ArrayValue) +for the full set of available methods. + +Try it out, and let us know what you think.