diff --git a/docs/_freeze/posts/backend-agnostic-arrays/index/execute-results/html.json b/docs/_freeze/posts/backend-agnostic-arrays/index/execute-results/html.json index 4323a6909884..5051270c7db9 100644 --- a/docs/_freeze/posts/backend-agnostic-arrays/index/execute-results/html.json +++ b/docs/_freeze/posts/backend-agnostic-arrays/index/execute-results/html.json @@ -1,7 +1,7 @@ { - "hash": "34305e46cb1163d2232533f8bd56e2b6", + "hash": "8e61910a8bd05e2e50367d1206f17b3d", "result": { - "markdown": "---\ntitle: Backend agnostic arrays\nauthor: \"Phillip Cloud\"\ndate: last-modified\ncategories:\n - arrays\n - bigquery\n - blog\n - cloud\n - duckdb\n - portability\n---\n\n## Introduction\n\nThis is a redux of a [previous post](../bigquery-arrays/index.qmd) showing\nIbis's portability in action.\n\nIbis is portable across complex operations and backends of very different\nscales and deployment models!\n\n::: {.callout-note}\n\n## Results differ slightly between BigQuery and DuckDB\n\nThe datasets used in each backend are slightly different.\n\nI opted to avoid ETL for the BigQuery backend by reusing the Google-provided\nIMDB dataset.\n\nThe tradeoff is the slight discrepancy in results.\n:::\n\n## Basics\n\nWe'll start with `from ibis.interactive import *` for maximum convenience.\n\n::: {#9a52f567 .cell execution_count=1}\n``` {.python .cell-code}\nfrom ibis.interactive import * # <1>\n```\n:::\n\n\n1. `from ibis.interactive import *` imports Ibis APIs into the global namespace\n and enables [interactive mode](../../how-to/configure/basics.qmd#interactive-mode).\n\n### Connect to your database\n\n::: {.panel-tabset}\n\n## DuckDB\n\n::: {#266ed79d .cell execution_count=2}\n``` {.python .cell-code}\nddb = ibis.connect(\"duckdb://\")\nddb.create_table( # <1>\n \"name_basics\", ex.imdb_name_basics.fetch(backend=ddb).rename(\"snake_case\")\n) # <1>\nddb.create_table( # <2>\n \"title_basics\", ex.imdb_title_basics.fetch(backend=ddb).rename(\"snake_case\")\n) # <2>\n```\n\n::: {.cell-output .cell-output-display execution_count=2}\n```{=html}\n
┏━━━━━━━━━━━┳━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━┳━━━━━━━━━━━━┳━━━━━━━━━━┳━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━━━┓\n┃ tconst ┃ title_type ┃ primary_title ┃ original_title ┃ is_adult ┃ start_year ┃ end_year ┃ runtime_minutes ┃ genres ┃\n┡━━━━━━━━━━━╇━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━╇━━━━━━━━━━━━╇━━━━━━━━━━╇━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━━━┩\n│ string │ string │ string │ string │ int64 │ int64 │ string │ int64 │ string │\n├───────────┼────────────┼─────────────────────────────────────────────┼─────────────────────────────────────────────┼──────────┼────────────┼──────────┼─────────────────┼──────────────────────────┤\n│ tt0000001 │ short │ Carmencita │ Carmencita │ 0 │ 1894 │ NULL │ 1 │ Documentary,Short │\n│ tt0000002 │ short │ Le clown et ses chiens │ Le clown et ses chiens │ 0 │ 1892 │ NULL │ 5 │ Animation,Short │\n│ tt0000003 │ short │ Pauvre Pierrot │ Pauvre Pierrot │ 0 │ 1892 │ NULL │ 4 │ Animation,Comedy,Romance │\n│ tt0000004 │ short │ Un bon bock │ Un bon bock │ 0 │ 1892 │ NULL │ 12 │ Animation,Short │\n│ tt0000005 │ short │ Blacksmith Scene │ Blacksmith Scene │ 0 │ 1893 │ NULL │ 1 │ Comedy,Short │\n│ tt0000006 │ short │ Chinese Opium Den │ Chinese Opium Den │ 0 │ 1894 │ NULL │ 1 │ Short │\n│ tt0000007 │ short │ Corbett and Courtney Before the Kinetograph │ Corbett and Courtney Before the Kinetograph │ 0 │ 1894 │ NULL │ 1 │ Short,Sport │\n│ tt0000008 │ short │ Edison Kinetoscopic Record of a Sneeze │ Edison Kinetoscopic Record of a Sneeze │ 0 │ 1894 │ NULL │ 1 │ Documentary,Short │\n│ tt0000009 │ movie │ Miss Jerry │ Miss Jerry │ 0 │ 1894 │ NULL │ 45 │ Romance │\n│ tt0000010 │ short │ Leaving the Factory │ La sortie de l'usine Lumière à Lyon │ 0 │ 1895 │ NULL │ 1 │ Documentary,Short │\n│ … │ … │ … │ … │ … │ … │ … │ … │ … │\n└───────────┴────────────┴─────────────────────────────────────────────┴─────────────────────────────────────────────┴──────────┴────────────┴──────────┴─────────────────┴──────────────────────────┘\n\n```\n:::\n:::\n\n\n1. Create a table called `name_basics` in our DuckDB database using `ibis.examples` data\n2. Create a table called `title_basics` in our DuckDB database using `ibis.examples` data\n\n## BigQuery\n\n::: {#50929109 .cell execution_count=3}\n``` {.python .cell-code}\nbq = ibis.connect(\"bigquery://ibis-gbq\")\nbq.set_database(\"bigquery-public-data.imdb\") # <1>\n```\n:::\n\n\n1. Google provides a public BigQuery dataset for IMDB data.\n\n:::\n\nLet's pull out the `name_basics` table, which contains names and metadata about\npeople listed on IMDB. We'll call this `ents` (short for `entities`), and remove some\ncolumns we won't need:\n\n::: {.panel-tabset}\n\n## DuckDB\n\n::: {#d8893bc4 .cell execution_count=4}\n``` {.python .cell-code}\nddb_ents = ddb.tables.name_basics.drop(\"birth_year\", \"death_year\")\nddb_ents.order_by(\"nconst\")\n```\n\n::: {.cell-output .cell-output-display execution_count=4}\n```{=html}\n
┏━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┓\n┃ nconst ┃ primary_name ┃ primary_profession ┃ known_for_titles ┃\n┡━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┩\n│ string │ string │ string │ string │\n├───────────┼─────────────────┼─────────────────────────────────────┼─────────────────────────────────────────┤\n│ nm0000001 │ Fred Astaire │ soundtrack,actor,miscellaneous │ tt0053137,tt0072308,tt0045537,tt0050419 │\n│ nm0000002 │ Lauren Bacall │ actress,soundtrack │ tt0037382,tt0117057,tt0075213,tt0038355 │\n│ nm0000003 │ Brigitte Bardot │ actress,soundtrack,music_department │ tt0057345,tt0054452,tt0049189,tt0056404 │\n│ nm0000004 │ John Belushi │ actor,soundtrack,writer │ tt0072562,tt0078723,tt0077975,tt0080455 │\n│ nm0000005 │ Ingmar Bergman │ writer,director,actor │ tt0083922,tt0069467,tt0050976,tt0050986 │\n│ nm0000006 │ Ingrid Bergman │ actress,soundtrack,producer │ tt0038109,tt0036855,tt0034583,tt0038787 │\n│ nm0000007 │ Humphrey Bogart │ actor,soundtrack,producer │ tt0037382,tt0034583,tt0042593,tt0043265 │\n│ nm0000008 │ Marlon Brando │ actor,soundtrack,director │ tt0068646,tt0070849,tt0078788,tt0047296 │\n│ nm0000009 │ Richard Burton │ actor,soundtrack,producer │ tt0057877,tt0059749,tt0061184,tt0087803 │\n│ nm0000010 │ James Cagney │ actor,soundtrack,director │ tt0042041,tt0035575,tt0029870,tt0031867 │\n│ … │ … │ … │ … │\n└───────────┴─────────────────┴─────────────────────────────────────┴─────────────────────────────────────────┘\n\n```\n:::\n:::\n\n\n## BigQuery\n\n::: {#5c8d59db .cell execution_count=5}\n``` {.python .cell-code}\nbq_ents = bq.tables.name_basics.drop(\"birth_year\", \"death_year\")\nbq_ents.order_by(\"nconst\")\n```\n\n::: {.cell-output .cell-output-display execution_count=5}\n```{=html}\n
┏━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┓\n┃ nconst ┃ primary_name ┃ primary_profession ┃ known_for_titles ┃\n┡━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┩\n│ string │ string │ string │ string │\n├───────────┼─────────────────┼─────────────────────────────────────┼─────────────────────────────────────────┤\n│ nm0000001 │ Fred Astaire │ soundtrack,actor,miscellaneous │ tt0031983,tt0072308,tt0053137,tt0050419 │\n│ nm0000002 │ Lauren Bacall │ actress,soundtrack │ tt0075213,tt0038355,tt0037382,tt0117057 │\n│ nm0000003 │ Brigitte Bardot │ actress,soundtrack,music_department │ tt0049189,tt0054452,tt0056404,tt0057345 │\n│ nm0000004 │ John Belushi │ actor,soundtrack,writer │ tt0072562,tt0077975,tt0078723,tt0080455 │\n│ nm0000005 │ Ingmar Bergman │ writer,director,actor │ tt0050986,tt0050976,tt0069467,tt0083922 │\n│ nm0000006 │ Ingrid Bergman │ actress,soundtrack,producer │ tt0038109,tt0034583,tt0036855,tt0038787 │\n│ nm0000007 │ Humphrey Bogart │ actor,soundtrack,producer │ tt0037382,tt0034583,tt0043265,tt0042593 │\n│ nm0000008 │ Marlon Brando │ actor,soundtrack,director │ tt0070849,tt0047296,tt0068646,tt0078788 │\n│ nm0000009 │ Richard Burton │ actor,soundtrack,producer │ tt0087803,tt0059749,tt0061184,tt0057877 │\n│ nm0000010 │ James Cagney │ actor,soundtrack,director │ tt0035575,tt0042041,tt0031867,tt0029870 │\n│ … │ … │ … │ … │\n└───────────┴─────────────────┴─────────────────────────────────────┴─────────────────────────────────────────┘\n\n```\n:::\n:::\n\n\n:::\n\n### Splitting strings into arrays\n\nWe can see that `known_for_titles` looks sort of like an array, so let's call\nthe\n[`split`](../../reference/expression-strings.qmd#ibis.expr.types.strings.StringValue.split)\nmethod on that column and replace the existing column:\n\n::: {.panel-tabset}\n\n## DuckDB\n\n::: {#806c1396 .cell execution_count=6}\n``` {.python .cell-code}\nddb_ents = ddb_ents.mutate(known_for_titles=_.known_for_titles.split(\",\"))\nddb_ents.order_by(\"nconst\")\n```\n\n::: {.cell-output .cell-output-display execution_count=6}\n```{=html}\n
┏━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┓\n┃ nconst ┃ primary_name ┃ primary_profession ┃ known_for_titles ┃\n┡━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┩\n│ string │ string │ string │ array<string> │\n├───────────┼─────────────────┼─────────────────────────────────────┼────────────────────────────────────┤\n│ nm0000001 │ Fred Astaire │ soundtrack,actor,miscellaneous │ ['tt0053137', 'tt0072308', ... +2] │\n│ nm0000002 │ Lauren Bacall │ actress,soundtrack │ ['tt0037382', 'tt0117057', ... +2] │\n│ nm0000003 │ Brigitte Bardot │ actress,soundtrack,music_department │ ['tt0057345', 'tt0054452', ... +2] │\n│ nm0000004 │ John Belushi │ actor,soundtrack,writer │ ['tt0072562', 'tt0078723', ... +2] │\n│ nm0000005 │ Ingmar Bergman │ writer,director,actor │ ['tt0083922', 'tt0069467', ... +2] │\n│ nm0000006 │ Ingrid Bergman │ actress,soundtrack,producer │ ['tt0038109', 'tt0036855', ... +2] │\n│ nm0000007 │ Humphrey Bogart │ actor,soundtrack,producer │ ['tt0037382', 'tt0034583', ... +2] │\n│ nm0000008 │ Marlon Brando │ actor,soundtrack,director │ ['tt0068646', 'tt0070849', ... +2] │\n│ nm0000009 │ Richard Burton │ actor,soundtrack,producer │ ['tt0057877', 'tt0059749', ... +2] │\n│ nm0000010 │ James Cagney │ actor,soundtrack,director │ ['tt0042041', 'tt0035575', ... +2] │\n│ … │ … │ … │ … │\n└───────────┴─────────────────┴─────────────────────────────────────┴────────────────────────────────────┘\n\n```\n:::\n:::\n\n\n## BigQuery\n\n::: {#70089eda .cell execution_count=7}\n``` {.python .cell-code}\nbq_ents = bq_ents.mutate(known_for_titles=_.known_for_titles.split(\",\"))\nbq_ents.order_by(\"nconst\")\n```\n\n::: {.cell-output .cell-output-display execution_count=7}\n```{=html}\n
┏━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┓\n┃ nconst ┃ primary_name ┃ primary_profession ┃ known_for_titles ┃\n┡━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┩\n│ string │ string │ string │ array<string> │\n├───────────┼─────────────────┼─────────────────────────────────────┼────────────────────────────────────┤\n│ nm0000001 │ Fred Astaire │ soundtrack,actor,miscellaneous │ ['tt0031983', 'tt0072308', ... +2] │\n│ nm0000002 │ Lauren Bacall │ actress,soundtrack │ ['tt0075213', 'tt0038355', ... +2] │\n│ nm0000003 │ Brigitte Bardot │ actress,soundtrack,music_department │ ['tt0049189', 'tt0054452', ... +2] │\n│ nm0000004 │ John Belushi │ actor,soundtrack,writer │ ['tt0072562', 'tt0077975', ... +2] │\n│ nm0000005 │ Ingmar Bergman │ writer,director,actor │ ['tt0050986', 'tt0050976', ... +2] │\n│ nm0000006 │ Ingrid Bergman │ actress,soundtrack,producer │ ['tt0038109', 'tt0034583', ... +2] │\n│ nm0000007 │ Humphrey Bogart │ actor,soundtrack,producer │ ['tt0037382', 'tt0034583', ... +2] │\n│ nm0000008 │ Marlon Brando │ actor,soundtrack,director │ ['tt0070849', 'tt0047296', ... +2] │\n│ nm0000009 │ Richard Burton │ actor,soundtrack,producer │ ['tt0087803', 'tt0059749', ... +2] │\n│ nm0000010 │ James Cagney │ actor,soundtrack,director │ ['tt0035575', 'tt0042041', ... +2] │\n│ … │ … │ … │ … │\n└───────────┴─────────────────┴─────────────────────────────────────┴────────────────────────────────────┘\n\n```\n:::\n:::\n\n\n:::\n\nSimilarly for `primary_profession`, since people involved in show business often\nhave more than one responsibility on a project:\n\n::: {.panel-tabset}\n\n## DuckDB\n\n::: {#646e2c20 .cell execution_count=8}\n``` {.python .cell-code}\nddb_ents = ddb_ents.mutate(primary_profession=_.primary_profession.split(\",\"))\n```\n:::\n\n\n## BigQuery\n\n::: {#0fac0f4a .cell execution_count=9}\n``` {.python .cell-code}\nbq_ents = bq_ents.mutate(primary_profession=_.primary_profession.split(\",\"))\n```\n:::\n\n\n:::\n\n### Array length\n\nLet's see how many titles each entity is known for, and then show the five\npeople with the largest number of titles they're known for.\n\nThis is computed using the\n[`length`](../../reference/expression-collections.qmd#ibis.expr.types.arrays.ArrayValue.length)\nAPI on array expressions:\n\n::: {.panel-tabset}\n\n## DuckDB\n\n::: {#dde377da .cell execution_count=10}\n``` {.python .cell-code}\n(\n ddb_ents.select(\"primary_name\", num_titles=_.known_for_titles.length())\n .order_by(_.num_titles.desc())\n .limit(5)\n)\n```\n\n::: {.cell-output .cell-output-display execution_count=10}\n```{=html}\n
┏━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━┓\n┃ primary_name ┃ num_titles ┃\n┡━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━┩\n│ string │ int64 │\n├──────────────────┼────────────┤\n│ Alex Koenigsmark │ 5 │\n│ Carrie Schnelker │ 5 │\n│ Sally Sun │ 5 │\n│ Henry Townsend │ 5 │\n│ Matthew Kavuma │ 5 │\n└──────────────────┴────────────┘\n\n```\n:::\n:::\n\n\n## BigQuery\n\n::: {#126820cf .cell execution_count=11}\n``` {.python .cell-code}\n(\n bq_ents.select(\"primary_name\", num_titles=_.known_for_titles.length())\n .order_by(_.num_titles.desc())\n .limit(5)\n)\n```\n\n::: {.cell-output .cell-output-display execution_count=11}\n```{=html}\n
┏━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━┓\n┃ primary_name ┃ num_titles ┃\n┡━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━┩\n│ string │ int64 │\n├───────────────────┼────────────┤\n│ Paul Winter │ 6 │\n│ Chris Estrada │ 6 │\n│ Nicolas Bernier │ 6 │\n│ Tsuyotake Matsuda │ 5 │\n│ Jonathon Saunders │ 5 │\n└───────────────────┴────────────┘\n\n```\n:::\n:::\n\n\n:::\n\nIt seems like the length of the `known_for_titles` might be capped at some small number!\n\n### Index\n\nWe can see the position of `\"actor\"` or `\"actress\"` in `primary_profession`s:\n\n::: {.panel-tabset}\n\n## DuckDB\n\n::: {#5a84302f .cell execution_count=12}\n``` {.python .cell-code}\nddb_ents.primary_profession.index(\"actor\")\n```\n\n::: {.cell-output .cell-output-display execution_count=12}\n```{=html}\n
┏━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┓\n┃ ArrayPosition(primary_profession, 'actor') ┃\n┡━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┩\n│ int64 │\n├────────────────────────────────────────────┤\n│ 1 │\n│ -1 │\n│ -1 │\n│ 0 │\n│ 2 │\n│ -1 │\n│ 0 │\n│ 0 │\n│ 0 │\n│ 0 │\n│ … │\n└────────────────────────────────────────────┘\n\n```\n:::\n:::\n\n\n::: {#24b4909f .cell execution_count=13}\n``` {.python .cell-code}\nddb_ents.primary_profession.index(\"actress\")\n```\n\n::: {.cell-output .cell-output-display execution_count=13}\n```{=html}\n
┏━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┓\n┃ ArrayPosition(primary_profession, 'actress') ┃\n┡━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┩\n│ int64 │\n├──────────────────────────────────────────────┤\n│ -1 │\n│ 0 │\n│ 0 │\n│ -1 │\n│ -1 │\n│ 0 │\n│ -1 │\n│ -1 │\n│ -1 │\n│ -1 │\n│ … │\n└──────────────────────────────────────────────┘\n\n```\n:::\n:::\n\n\n## BigQuery\n\n::: {#2955a839 .cell execution_count=14}\n``` {.python .cell-code}\nbq_ents.primary_profession.index(\"actor\")\n```\n\n::: {.cell-output .cell-output-display execution_count=14}\n```{=html}\n
┏━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┓\n┃ ArrayPosition(primary_profession, 'actor') ┃\n┡━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┩\n│ int64 │\n├────────────────────────────────────────────┤\n│ -1 │\n│ -1 │\n│ -1 │\n│ -1 │\n│ -1 │\n│ -1 │\n│ -1 │\n│ -1 │\n│ -1 │\n│ -1 │\n│ … │\n└────────────────────────────────────────────┘\n\n```\n:::\n:::\n\n\n::: {#5cbb952f .cell execution_count=15}\n``` {.python .cell-code}\nbq_ents.primary_profession.index(\"actress\")\n```\n\n::: {.cell-output .cell-output-display execution_count=15}\n```{=html}\n
┏━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┓\n┃ ArrayPosition(primary_profession, 'actress') ┃\n┡━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┩\n│ int64 │\n├──────────────────────────────────────────────┤\n│ -1 │\n│ -1 │\n│ -1 │\n│ -1 │\n│ -1 │\n│ -1 │\n│ -1 │\n│ -1 │\n│ -1 │\n│ -1 │\n│ … │\n└──────────────────────────────────────────────┘\n\n```\n:::\n:::\n\n\n:::\n\nA return value of `-1` indicates that `\"actor\"` is not present in the value.\n\nLet's look for entities that are not primarily actors.\n\nWe can do this using the\n[`index`](../../reference/expression-collections.qmd#ibis.expr.types.arrays.ArrayValue.index)\nmethod by checking whether the positions of the strings `\"actor\"` or\n`\"actress\"` are both greater than 0:\n\n::: {.panel-tabset}\n\n## DuckDB\n\n::: {#28eb1e61 .cell execution_count=16}\n``` {.python .cell-code}\nactor_index = ddb_ents.primary_profession.index(\"actor\")\nactress_index = ddb_ents.primary_profession.index(\"actress\")\n\nddb_not_primarily_acting = (actor_index > 0) & (actress_index > 0)\nddb_not_primarily_acting.mean()\n```\n\n::: {.cell-output .cell-output-display}\n```{=html}\n\n```\n:::\n\n::: {.cell-output .cell-output-display execution_count=16}\n\n::: {.ansi-escaped-output}\n```{=html}\n
0.0
\n```\n:::\n\n:::\n:::\n\n\n## BigQuery\n\n::: {#a7b0a283 .cell execution_count=17}\n``` {.python .cell-code}\nactor_index = bq_ents.primary_profession.index(\"actor\")\nactress_index = bq_ents.primary_profession.index(\"actress\")\n\nbq_not_primarily_acting = (actor_index > 0) & (actress_index > 0)\nbq_not_primarily_acting.mean()\n```\n\n::: {.cell-output .cell-output-display}\n```{=html}\n\n```\n:::\n\n::: {.cell-output .cell-output-display execution_count=17}\n\n::: {.ansi-escaped-output}\n```{=html}\n0.0
\n```\n:::\n\n:::\n:::\n\n\n:::\n\nWho are they?\n\n::: {.panel-tabset}\n\n## DuckDB\n\n::: {#d68e44dd .cell execution_count=18}\n``` {.python .cell-code}\nddb_ents[ddb_not_primarily_acting].order_by(\"nconst\")\n```\n\n::: {.cell-output .cell-output-display execution_count=18}\n```{=html}\n┏━━━━━━━━┳━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━┓\n┃ nconst ┃ primary_name ┃ primary_profession ┃ known_for_titles ┃\n┡━━━━━━━━╇━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━┩\n│ string │ string │ array<string> │ array<string> │\n└────────┴──────────────┴────────────────────┴──────────────────┘\n\n```\n:::\n:::\n\n\n## BigQuery\n\n::: {#db8558c1 .cell execution_count=19}\n``` {.python .cell-code}\nbq_ents[bq_not_primarily_acting].order_by(\"nconst\")\n```\n\n::: {.cell-output .cell-output-display execution_count=19}\n```{=html}\n
┏━━━━━━━━┳━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━┓\n┃ nconst ┃ primary_name ┃ primary_profession ┃ known_for_titles ┃\n┡━━━━━━━━╇━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━┩\n│ string │ string │ array<string> │ array<string> │\n└────────┴──────────────┴────────────────────┴──────────────────┘\n\n```\n:::\n:::\n\n\n:::\n\nIt's not 100% clear whether the order of elements in `primary_profession` matters here.\n\n### Containment\n\nWe can get people who are listed as actors or actresses using `contains`:\n\n::: {.panel-tabset}\n\n## DuckDB\n\n::: {#f9e1558c .cell execution_count=20}\n``` {.python .cell-code}\nddb_non_actors = bq_ents[\n ~_.primary_profession.contains(\"actor\") & ~_.primary_profession.contains(\"actress\")\n]\nddb_non_actors.order_by(\"nconst\")\n```\n\n::: {.cell-output .cell-output-display execution_count=20}\n```{=html}\n
┏━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┓\n┃ nconst ┃ primary_name ┃ primary_profession ┃ known_for_titles ┃\n┡━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┩\n│ string │ string │ array<string> │ array<string> │\n├───────────┼──────────────────┼────────────────────────────────────────────┼────────────────────────────────────┤\n│ nm0000016 │ Georges Delerue │ ['composer', 'soundtrack', ... +1] │ ['tt0091763', 'tt0096320', ... +2] │\n│ nm0000025 │ Jerry Goldsmith │ ['music_department', 'soundtrack', ... +1] │ ['tt0119488', 'tt0117731', ... +2] │\n│ nm0000033 │ Alfred Hitchcock │ ['director', 'producer', ... +1] │ ['tt0053125', 'tt0052357', ... +2] │\n│ nm0000035 │ James Horner │ ['music_department', 'soundtrack', ... +1] │ ['tt0120338', 'tt0499549', ... +2] │\n│ nm0000040 │ Stanley Kubrick │ ['director', 'writer', ... +1] │ ['tt0062622', 'tt0120663', ... +2] │\n│ nm0000041 │ Akira Kurosawa │ ['writer', 'director', ... +1] │ ['tt0051808', 'tt0089881', ... +2] │\n│ nm0000049 │ Henry Mancini │ ['music_department', 'soundtrack', ... +1] │ ['tt0057413', 'tt0054698', ... +2] │\n│ nm0000055 │ Alfred Newman │ ['music_department', 'composer', ... +1] │ ['tt0065377', 'tt0049408', ... +2] │\n│ nm0000065 │ Nino Rota │ ['composer', 'soundtrack', ... +1] │ ['tt0063518', 'tt0068646', ... +2] │\n│ nm0000067 │ Miklós Rózsa │ ['music_department', 'composer', ... +1] │ ['tt0038109', 'tt0054847', ... +2] │\n│ … │ … │ … │ … │\n└───────────┴──────────────────┴────────────────────────────────────────────┴────────────────────────────────────┘\n\n```\n:::\n:::\n\n\n## BigQuery\n\n::: {#b51e75a5 .cell execution_count=21}\n``` {.python .cell-code}\nbq_non_actors = bq_ents[\n ~_.primary_profession.contains(\"actor\") & ~_.primary_profession.contains(\"actress\")\n]\nbq_non_actors.order_by(\"nconst\")\n```\n\n::: {.cell-output .cell-output-display execution_count=21}\n```{=html}\n
┏━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┓\n┃ nconst ┃ primary_name ┃ primary_profession ┃ known_for_titles ┃\n┡━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┩\n│ string │ string │ array<string> │ array<string> │\n├───────────┼──────────────────┼────────────────────────────────────────────┼────────────────────────────────────┤\n│ nm0000016 │ Georges Delerue │ ['composer', 'soundtrack', ... +1] │ ['tt0091763', 'tt0096320', ... +2] │\n│ nm0000025 │ Jerry Goldsmith │ ['music_department', 'soundtrack', ... +1] │ ['tt0119488', 'tt0117731', ... +2] │\n│ nm0000033 │ Alfred Hitchcock │ ['director', 'producer', ... +1] │ ['tt0053125', 'tt0052357', ... +2] │\n│ nm0000035 │ James Horner │ ['music_department', 'soundtrack', ... +1] │ ['tt0120338', 'tt0499549', ... +2] │\n│ nm0000040 │ Stanley Kubrick │ ['director', 'writer', ... +1] │ ['tt0062622', 'tt0120663', ... +2] │\n│ nm0000041 │ Akira Kurosawa │ ['writer', 'director', ... +1] │ ['tt0051808', 'tt0089881', ... +2] │\n│ nm0000049 │ Henry Mancini │ ['music_department', 'soundtrack', ... +1] │ ['tt0057413', 'tt0054698', ... +2] │\n│ nm0000055 │ Alfred Newman │ ['music_department', 'composer', ... +1] │ ['tt0065377', 'tt0049408', ... +2] │\n│ nm0000065 │ Nino Rota │ ['composer', 'soundtrack', ... +1] │ ['tt0063518', 'tt0068646', ... +2] │\n│ nm0000067 │ Miklós Rózsa │ ['music_department', 'composer', ... +1] │ ['tt0038109', 'tt0054847', ... +2] │\n│ … │ … │ … │ … │\n└───────────┴──────────────────┴────────────────────────────────────────────┴────────────────────────────────────┘\n\n```\n:::\n:::\n\n\n:::\n\n### Element removal\n\nWe can remove elements from arrays too.\n\n::: {.callout-note}\n## [`remove()`](../../reference/expression-collections.qmd#ibis.expr.types.arrays.ArrayValue.remove) does not mutate the underlying data\n:::\n\nLet's see who only has \"actor\" in the list of their primary professions:\n\n::: {.panel-tabset}\n\n## DuckDB\n\n::: {#e930038b .cell execution_count=22}\n``` {.python .cell-code}\nddb_ents.filter(\n [\n _.primary_profession.length() > 0,\n _.primary_profession.remove(\"actor\").length() == 0,\n _.primary_profession.remove(\"actress\").length() == 0,\n ]\n).order_by(\"nconst\")\n```\n\n::: {.cell-output .cell-output-display execution_count=22}\n```{=html}\n
┏━━━━━━━━┳━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━┓\n┃ nconst ┃ primary_name ┃ primary_profession ┃ known_for_titles ┃\n┡━━━━━━━━╇━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━┩\n│ string │ string │ array<string> │ array<string> │\n└────────┴──────────────┴────────────────────┴──────────────────┘\n\n```\n:::\n:::\n\n\n## BigQuery\n\n::: {#4c7a3db0 .cell execution_count=23}\n``` {.python .cell-code}\nbq_ents.filter(\n [\n _.primary_profession.length() > 0,\n _.primary_profession.remove(\"actor\").length() == 0,\n _.primary_profession.remove(\"actress\").length() == 0,\n ]\n).order_by(\"nconst\")\n```\n\n::: {.cell-output .cell-output-display execution_count=23}\n```{=html}\n
┏━━━━━━━━┳━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━┓\n┃ nconst ┃ primary_name ┃ primary_profession ┃ known_for_titles ┃\n┡━━━━━━━━╇━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━┩\n│ string │ string │ array<string> │ array<string> │\n└────────┴──────────────┴────────────────────┴──────────────────┘\n\n```\n:::\n:::\n\n\n:::\n\n### Slicing with square-bracket syntax\n\nLet's remove everyone's first profession from the list, but only if they have\nmore than one profession listed:\n\n::: {.panel-tabset}\n\n## DuckDB\n\n::: {#727c9fc4 .cell execution_count=24}\n``` {.python .cell-code}\nddb_ents[_.primary_profession.length() > 1].mutate(\n primary_profession=_.primary_profession[1:],\n).order_by(\"nconst\")\n```\n\n::: {.cell-output .cell-output-display execution_count=24}\n```{=html}\n
┏━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┓\n┃ nconst ┃ primary_name ┃ primary_profession ┃ known_for_titles ┃\n┡━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┩\n│ string │ string │ array<string> │ array<string> │\n├───────────┼─────────────────┼────────────────────────────────────┼────────────────────────────────────┤\n│ nm0000001 │ Fred Astaire │ ['actor', 'miscellaneous'] │ ['tt0053137', 'tt0072308', ... +2] │\n│ nm0000002 │ Lauren Bacall │ ['soundtrack'] │ ['tt0037382', 'tt0117057', ... +2] │\n│ nm0000003 │ Brigitte Bardot │ ['soundtrack', 'music_department'] │ ['tt0057345', 'tt0054452', ... +2] │\n│ nm0000004 │ John Belushi │ ['soundtrack', 'writer'] │ ['tt0072562', 'tt0078723', ... +2] │\n│ nm0000005 │ Ingmar Bergman │ ['director', 'actor'] │ ['tt0083922', 'tt0069467', ... +2] │\n│ nm0000006 │ Ingrid Bergman │ ['soundtrack', 'producer'] │ ['tt0038109', 'tt0036855', ... +2] │\n│ nm0000007 │ Humphrey Bogart │ ['soundtrack', 'producer'] │ ['tt0037382', 'tt0034583', ... +2] │\n│ nm0000008 │ Marlon Brando │ ['soundtrack', 'director'] │ ['tt0068646', 'tt0070849', ... +2] │\n│ nm0000009 │ Richard Burton │ ['soundtrack', 'producer'] │ ['tt0057877', 'tt0059749', ... +2] │\n│ nm0000010 │ James Cagney │ ['soundtrack', 'director'] │ ['tt0042041', 'tt0035575', ... +2] │\n│ … │ … │ … │ … │\n└───────────┴─────────────────┴────────────────────────────────────┴────────────────────────────────────┘\n\n```\n:::\n:::\n\n\n## BigQuery\n\n::: {#86e58551 .cell execution_count=25}\n``` {.python .cell-code}\nbq_ents[_.primary_profession.length() > 1].mutate(\n primary_profession=_.primary_profession[1:],\n).order_by(\"nconst\")\n```\n\n::: {.cell-output .cell-output-display execution_count=25}\n```{=html}\n
┏━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┓\n┃ nconst ┃ primary_name ┃ primary_profession ┃ known_for_titles ┃\n┡━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┩\n│ string │ string │ array<string> │ array<string> │\n├───────────┼─────────────────┼────────────────────────────────────┼────────────────────────────────────┤\n│ nm0000001 │ Fred Astaire │ ['actor', 'miscellaneous'] │ ['tt0031983', 'tt0072308', ... +2] │\n│ nm0000002 │ Lauren Bacall │ ['soundtrack'] │ ['tt0075213', 'tt0038355', ... +2] │\n│ nm0000003 │ Brigitte Bardot │ ['soundtrack', 'music_department'] │ ['tt0049189', 'tt0054452', ... +2] │\n│ nm0000004 │ John Belushi │ ['soundtrack', 'writer'] │ ['tt0072562', 'tt0077975', ... +2] │\n│ nm0000005 │ Ingmar Bergman │ ['director', 'actor'] │ ['tt0050986', 'tt0050976', ... +2] │\n│ nm0000006 │ Ingrid Bergman │ ['soundtrack', 'producer'] │ ['tt0038109', 'tt0034583', ... +2] │\n│ nm0000007 │ Humphrey Bogart │ ['soundtrack', 'producer'] │ ['tt0037382', 'tt0034583', ... +2] │\n│ nm0000008 │ Marlon Brando │ ['soundtrack', 'director'] │ ['tt0070849', 'tt0047296', ... +2] │\n│ nm0000009 │ Richard Burton │ ['soundtrack', 'producer'] │ ['tt0087803', 'tt0059749', ... +2] │\n│ nm0000010 │ James Cagney │ ['soundtrack', 'director'] │ ['tt0035575', 'tt0042041', ... +2] │\n│ … │ … │ … │ … │\n└───────────┴─────────────────┴────────────────────────────────────┴────────────────────────────────────┘\n\n```\n:::\n:::\n\n\n:::\n\n## Set operations and sorting\n\nTreating arrays as sets is possible with the\n[`union`](../../reference/expression-collections.qmd#ibis.expr.types.arrays.ArrayValue.union)\nand\n[`intersect`](../../reference/expression-collections.qmd#ibis.expr.types.arrays.ArrayValue.intersect)\nAPIs.\n\nLet's take a look at `intersect`.\n\n### Intersection\n\nLet's see if we can use array intersection to figure which actors share\nknown-for titles and sort the result:\n\n::: {.panel-tabset}\n\n## DuckDB\n\n::: {#1f777844 .cell execution_count=26}\n``` {.python .cell-code}\nleft = ddb_ents.filter(_.known_for_titles.length() > 0).limit(10_000)\nright = left.view()\nshared_titles = (\n left\n .join(right, left.nconst != right.nconst)\n .select(\n s.startswith(\"known_for_titles\"),\n left_name=\"primary_name\",\n right_name=\"primary_name_right\",\n )\n .filter(_.known_for_titles.intersect(_.known_for_titles_right).length() > 0)\n .group_by(name=\"left_name\")\n .agg(together_with=_.right_name.collect())\n .mutate(together_with=_.together_with.unique().sort())\n)\nshared_titles\n```\n\n::: {.cell-output .cell-output-display}\n```{=html}\n\n```\n:::\n\n::: {.cell-output .cell-output-display execution_count=26}\n```{=html}\n
┏━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┓\n┃ name ┃ together_with ┃\n┡━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┩\n│ string │ array<string> │\n├─────────────────────┼─────────────────────────────────────────────────┤\n│ Richard Chamberlain │ ['Fred Astaire', 'Fred J. Koenekamp', ... +13] │\n│ John Wayne │ ['Cyril J. Mockridge', 'Glen Campbell', ... +8] │\n│ Fritz Lang │ ['Alfred Abel', 'Brigitte Bardot', ... +13] │\n│ John Candy │ ['Adam Bernardi', 'Amy Madigan', ... +21] │\n│ Peter Lorre │ ['Byron Haskin', 'Claude Rains', ... +16] │\n│ Miklós Rózsa │ ['Barbara Stanwyck', 'Charlton Heston', ... +8] │\n│ George Segal │ ['Alex North', 'Amanda Peet', ... +18] │\n│ Lon Chaney Jr. │ ['Bud Abbott', 'Charles Previn', ... +5] │\n│ Vivien Leigh │ ['Alex North', 'Clark Gable', ... +17] │\n│ Jim Backus │ ['Alan Hale Jr.', 'Bob Denver', ... +16] │\n│ … │ … │\n└─────────────────────┴─────────────────────────────────────────────────┘\n\n```\n:::\n:::\n\n\n## BigQuery\n\n::: {#91f39681 .cell execution_count=27}\n``` {.python .cell-code}\nleft = bq_ents.filter(_.known_for_titles.length() > 0).limit(10_000)\nright = left.view()\nshared_titles = (\n left\n .join(right, left.nconst != right.nconst)\n .select(\n s.startswith(\"known_for_titles\"),\n left_name=\"primary_name\",\n right_name=\"primary_name_right\",\n )\n .filter(_.known_for_titles.intersect(_.known_for_titles_right).length() > 0)\n .group_by(name=\"left_name\")\n .agg(together_with=_.right_name.collect())\n .mutate(together_with=_.together_with.unique().sort())\n)\nshared_titles\n```\n\n::: {.cell-output .cell-output-display execution_count=27}\n```{=html}\n
┏━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┓\n┃ name ┃ together_with ┃\n┡━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┩\n│ string │ array<string> │\n├──────────────────┼──────────────────────────────────────────────────────────┤\n│ Antonieta Careri │ ['Ayame Loren Tribune', 'Carey Giesner-Garcia', ... +13] │\n│ Mike Sutcliffe │ ['Christine Slattery', 'Linda Collister', ... +2] │\n│ Yolanda Paul │ ['Antonieta Careri', 'Ayame Loren Tribune', ... +13] │\n│ Anthony Micari │ ['Adam Cole', 'Andrew Del Vecchio', ... +6] │\n│ Enoch Showunmi │ ['Andy Leese', 'Ben Adelsbury', ... +18] │\n│ Ana Akauola │ ['A.B. Olevic', 'Candy Hurtado', ... +12] │\n│ Awad Al Yami │ ['Abdo Bardawill', 'Alex Saratsis', ... +66] │\n│ Charly Freitag │ ['Christoph Homberger', 'Jana Leu', ... +13] │\n│ Rachel Mader │ ['Charly Freitag', 'Christoph Homberger', ... +13] │\n│ Robert Bartley │ ['Adam Cole', 'Andrew Del Vecchio', ... +6] │\n│ … │ … │\n└──────────────────┴──────────────────────────────────────────────────────────┘\n\n```\n:::\n:::\n\n\n:::\n\n## Advanced operations\n\n### Flatten arrays into rows\n\nThanks to the [tireless\nefforts](https://github.com/tobymao/sqlglot/commit/06e0869e7aa5714d77e6ec763da38d6a422965fa)\nof the [folks](https://github.com/tobymao/sqlglot/graphs/contributors) working\non [`sqlglot`](https://github.com/tobymao/sqlglot), as of version 7.0.0 Ibis\nsupports\n[`unnest`](../../reference/expression-collections.qmd#ibis.expr.types.arrays.ArrayValue.unnest)\nfor BigQuery!\n\nYou can use it standalone on a column expression:\n\n::: {.panel-tabset}\n\n## DuckDB\n\n::: {#22f716af .cell execution_count=28}\n``` {.python .cell-code}\nddb_ents.primary_profession.unnest()\n```\n\n::: {.cell-output .cell-output-display execution_count=28}\n```{=html}\n
┏━━━━━━━━━━━━━━━━━━━━┓\n┃ primary_profession ┃\n┡━━━━━━━━━━━━━━━━━━━━┩\n│ string │\n├────────────────────┤\n│ soundtrack │\n│ actor │\n│ miscellaneous │\n│ actress │\n│ soundtrack │\n│ actress │\n│ soundtrack │\n│ music_department │\n│ actor │\n│ soundtrack │\n│ … │\n└────────────────────┘\n\n```\n:::\n:::\n\n\n## BigQuery\n\n::: {#46944f60 .cell execution_count=29}\n``` {.python .cell-code}\nbq_ents.primary_profession.unnest()\n```\n\n::: {.cell-output .cell-output-display execution_count=29}\n```{=html}\n
┏━━━━━━━━━━━━━━━━━━━━┓\n┃ primary_profession ┃\n┡━━━━━━━━━━━━━━━━━━━━┩\n│ string │\n├────────────────────┤\n│ actor │\n│ actor │\n│ actor │\n│ actor │\n│ actor │\n│ actor │\n│ actor │\n│ actor │\n│ actor │\n│ actor │\n│ … │\n└────────────────────┘\n\n```\n:::\n:::\n\n\n:::\n\nYou can also use it in `select`/`mutate` calls to expand the table accordingly:\n\n::: {.panel-tabset}\n\n## DuckDB\n\n::: {#cb45455b .cell execution_count=30}\n``` {.python .cell-code}\nddb_ents.mutate(primary_profession=_.primary_profession.unnest()).order_by(\"nconst\")\n```\n\n::: {.cell-output .cell-output-display execution_count=30}\n```{=html}\n
┏━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┓\n┃ nconst ┃ primary_name ┃ primary_profession ┃ known_for_titles ┃\n┡━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┩\n│ string │ string │ string │ array<string> │\n├───────────┼─────────────────┼────────────────────┼────────────────────────────────────┤\n│ nm0000001 │ Fred Astaire │ soundtrack │ ['tt0053137', 'tt0072308', ... +2] │\n│ nm0000001 │ Fred Astaire │ actor │ ['tt0053137', 'tt0072308', ... +2] │\n│ nm0000001 │ Fred Astaire │ miscellaneous │ ['tt0053137', 'tt0072308', ... +2] │\n│ nm0000002 │ Lauren Bacall │ actress │ ['tt0037382', 'tt0117057', ... +2] │\n│ nm0000002 │ Lauren Bacall │ soundtrack │ ['tt0037382', 'tt0117057', ... +2] │\n│ nm0000003 │ Brigitte Bardot │ actress │ ['tt0057345', 'tt0054452', ... +2] │\n│ nm0000003 │ Brigitte Bardot │ soundtrack │ ['tt0057345', 'tt0054452', ... +2] │\n│ nm0000003 │ Brigitte Bardot │ music_department │ ['tt0057345', 'tt0054452', ... +2] │\n│ nm0000004 │ John Belushi │ actor │ ['tt0072562', 'tt0078723', ... +2] │\n│ nm0000004 │ John Belushi │ soundtrack │ ['tt0072562', 'tt0078723', ... +2] │\n│ … │ … │ … │ … │\n└───────────┴─────────────────┴────────────────────┴────────────────────────────────────┘\n\n```\n:::\n:::\n\n\n## BigQuery\n\n::: {#74595b89 .cell execution_count=31}\n``` {.python .cell-code}\nbq_ents.mutate(primary_profession=_.primary_profession.unnest()).order_by(\"nconst\")\n```\n\n::: {.cell-output .cell-output-display execution_count=31}\n```{=html}\n
┏━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┓\n┃ nconst ┃ primary_name ┃ primary_profession ┃ known_for_titles ┃\n┡━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┩\n│ string │ string │ string │ array<string> │\n├───────────┼─────────────────┼────────────────────┼────────────────────────────────────┤\n│ nm0000001 │ Fred Astaire │ soundtrack │ ['tt0031983', 'tt0072308', ... +2] │\n│ nm0000001 │ Fred Astaire │ actor │ ['tt0031983', 'tt0072308', ... +2] │\n│ nm0000001 │ Fred Astaire │ miscellaneous │ ['tt0031983', 'tt0072308', ... +2] │\n│ nm0000002 │ Lauren Bacall │ actress │ ['tt0075213', 'tt0038355', ... +2] │\n│ nm0000002 │ Lauren Bacall │ soundtrack │ ['tt0075213', 'tt0038355', ... +2] │\n│ nm0000003 │ Brigitte Bardot │ soundtrack │ ['tt0049189', 'tt0054452', ... +2] │\n│ nm0000003 │ Brigitte Bardot │ actress │ ['tt0049189', 'tt0054452', ... +2] │\n│ nm0000003 │ Brigitte Bardot │ music_department │ ['tt0049189', 'tt0054452', ... +2] │\n│ nm0000004 │ John Belushi │ writer │ ['tt0072562', 'tt0077975', ... +2] │\n│ nm0000004 │ John Belushi │ actor │ ['tt0072562', 'tt0077975', ... +2] │\n│ … │ … │ … │ … │\n└───────────┴─────────────────┴────────────────────┴────────────────────────────────────┘\n\n```\n:::\n:::\n\n\n:::\n\nUnnesting can be useful when joining nested data.\n\nHere we use unnest to find people known for any of the godfather movies:\n\n::: {.panel-tabset}\n\n## DuckDB\n\n::: {#8c91eda2 .cell execution_count=32}\n``` {.python .cell-code}\nbasics = ddb.tables.title_basics.filter( # <1>\n [\n _.title_type == \"movie\",\n _.original_title.lower().startswith(\"the godfather\"),\n _.genres.lower().contains(\"crime\"),\n ]\n) # <1>\n\nddb_known_for_the_godfather = (\n ddb_ents.mutate(tconst=_.known_for_titles.unnest()) # <2>\n .join(basics, \"tconst\") # <3>\n .select(\"primary_title\", \"primary_name\") # <4>\n .distinct()\n .order_by([\"primary_title\", \"primary_name\"]) # <4>\n)\nddb_known_for_the_godfather\n```\n\n::: {.cell-output .cell-output-display}\n```{=html}\n\n```\n:::\n\n::: {.cell-output .cell-output-display execution_count=32}\n```{=html}\n
┏━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━┓\n┃ primary_title ┃ primary_name ┃\n┡━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━┩\n│ string │ string │\n├───────────────┼─────────────────────┤\n│ The Godfather │ A. Emmett Adams │\n│ The Godfather │ Abe Vigoda │\n│ The Godfather │ Al Lettieri │\n│ The Godfather │ Al Martino │\n│ The Godfather │ Al Pacino │\n│ The Godfather │ Albert S. Ruddy │\n│ The Godfather │ Alex Rocco │\n│ The Godfather │ Andrea Eastman │\n│ The Godfather │ Angelo Infanti │\n│ The Godfather │ Anna Hill Johnstone │\n│ … │ … │\n└───────────────┴─────────────────────┘\n\n```\n:::\n:::\n\n\n1. Filter the `title_basics` data set to only the Godfather movies\n2. Unnest the `known_for_titles` array column\n3. Join with `basics` to get movie titles\n4. Ensure that each entity is only listed once and sort the results\n\n## BigQuery\n\n::: {#6b92a489 .cell execution_count=33}\n``` {.python .cell-code}\nbasics = bq.tables.title_basics.filter( # <1>\n [\n _.title_type == \"movie\",\n _.original_title.lower().startswith(\"the godfather\"),\n _.genres.lower().contains(\"crime\"),\n ]\n) # <1>\n\nbq_known_for_the_godfather = (\n bq_ents.mutate(tconst=_.known_for_titles.unnest()) # <2>\n .join(basics, \"tconst\") # <3>\n .select(\"primary_title\", \"primary_name\") # <4>\n .distinct()\n .order_by([\"primary_title\", \"primary_name\"]) # <4>\n)\nbq_known_for_the_godfather\n```\n\n::: {.cell-output .cell-output-display execution_count=33}\n```{=html}\n
┏━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━┓\n┃ primary_title ┃ primary_name ┃\n┡━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━┩\n│ string │ string │\n├───────────────┼─────────────────────┤\n│ The Godfather │ A. Emmett Adams │\n│ The Godfather │ Abe Vigoda │\n│ The Godfather │ Al Lettieri │\n│ The Godfather │ Al Martino │\n│ The Godfather │ Al Pacino │\n│ The Godfather │ Albert S. Ruddy │\n│ The Godfather │ Alex Rocco │\n│ The Godfather │ Andrea Eastman │\n│ The Godfather │ Angelo Infanti │\n│ The Godfather │ Anna Hill Johnstone │\n│ … │ … │\n└───────────────┴─────────────────────┘\n\n```\n:::\n:::\n\n\n1. Filter the `title_basics` data set to only the Godfather movies\n2. Unnest the `known_for_titles` array column\n3. Join with `basics` to get movie titles\n4. Ensure that each entity is only listed once and sort the results\n\n:::\n\nLet's summarize by showing how many people are known for each Godfather movie:\n\n::: {.panel-tabset}\n\n## DuckDB\n\n::: {#c6caecb6 .cell execution_count=34}\n``` {.python .cell-code}\nddb_known_for_the_godfather.primary_title.value_counts()\n```\n\n::: {.cell-output .cell-output-display}\n```{=html}\n\n```\n:::\n\n::: {.cell-output .cell-output-display execution_count=34}\n```{=html}\n
┏━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━┓\n┃ primary_title ┃ primary_title_count ┃\n┡━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━┩\n│ string │ int64 │\n├────────────────────────┼─────────────────────┤\n│ The Godfather Part III │ 196 │\n│ The Godfather │ 93 │\n│ The Godfather Part II │ 117 │\n└────────────────────────┴─────────────────────┘\n\n```\n:::\n:::\n\n\n## BigQuery\n\n::: {#72d2859f .cell execution_count=35}\n``` {.python .cell-code}\nbq_known_for_the_godfather.primary_title.value_counts()\n```\n\n::: {.cell-output .cell-output-display execution_count=35}\n```{=html}\n
┏━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━┓\n┃ primary_title ┃ primary_title_count ┃\n┡━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━┩\n│ string │ int64 │\n├────────────────────────┼─────────────────────┤\n│ The Godfather Part III │ 194 │\n│ The Godfather Part II │ 114 │\n│ The Godfather │ 97 │\n└────────────────────────┴─────────────────────┘\n\n```\n:::\n:::\n\n\n:::\n\n### Filtering array elements\n\nFiltering array elements can be done with the\n[`filter`](../../reference/expression-collections.qmd#ibis.expr.types.arrays.ArrayValue.filter)\nmethod, which applies a predicate to each array element and returns an array of\nelements for which the predicate returns `True`.\n\nThis method is similar to Python's\n[`filter`](https://docs.python.org/3.7/library/functions.html#filter) function.\n\nLet's show all people who are neither editors nor actors:\n\n::: {.panel-tabset}\n\n## DuckDB\n\n::: {#b5e4701c .cell execution_count=36}\n``` {.python .cell-code}\nddb_ents.mutate(\n primary_profession=_.primary_profession.filter( # <1>\n lambda pp: ~pp.isin((\"actor\", \"actress\", \"editor\"))\n )\n).filter(_.primary_profession.length() > 0).order_by(\"nconst\") # <2>\n```\n\n::: {.cell-output .cell-output-display execution_count=36}\n```{=html}\n
┏━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┓\n┃ nconst ┃ primary_name ┃ primary_profession ┃ known_for_titles ┃\n┡━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┩\n│ string │ string │ array<string> │ array<string> │\n├───────────┼─────────────────┼────────────────────────────────────┼────────────────────────────────────┤\n│ nm0000001 │ Fred Astaire │ ['soundtrack', 'miscellaneous'] │ ['tt0053137', 'tt0072308', ... +2] │\n│ nm0000002 │ Lauren Bacall │ ['soundtrack'] │ ['tt0037382', 'tt0117057', ... +2] │\n│ nm0000003 │ Brigitte Bardot │ ['soundtrack', 'music_department'] │ ['tt0057345', 'tt0054452', ... +2] │\n│ nm0000004 │ John Belushi │ ['soundtrack', 'writer'] │ ['tt0072562', 'tt0078723', ... +2] │\n│ nm0000005 │ Ingmar Bergman │ ['writer', 'director'] │ ['tt0083922', 'tt0069467', ... +2] │\n│ nm0000006 │ Ingrid Bergman │ ['soundtrack', 'producer'] │ ['tt0038109', 'tt0036855', ... +2] │\n│ nm0000007 │ Humphrey Bogart │ ['soundtrack', 'producer'] │ ['tt0037382', 'tt0034583', ... +2] │\n│ nm0000008 │ Marlon Brando │ ['soundtrack', 'director'] │ ['tt0068646', 'tt0070849', ... +2] │\n│ nm0000009 │ Richard Burton │ ['soundtrack', 'producer'] │ ['tt0057877', 'tt0059749', ... +2] │\n│ nm0000010 │ James Cagney │ ['soundtrack', 'director'] │ ['tt0042041', 'tt0035575', ... +2] │\n│ … │ … │ … │ … │\n└───────────┴─────────────────┴────────────────────────────────────┴────────────────────────────────────┘\n\n```\n:::\n:::\n\n\n1. This `filter` call is applied to each array element\n2. This `filter` call is applied to the table\n\n## BigQuery\n\n::: {#089fba87 .cell execution_count=37}\n``` {.python .cell-code}\nbq_ents.mutate(\n primary_profession=_.primary_profession.filter( # <1>\n lambda pp: ~pp.isin((\"actor\", \"actress\", \"editor\"))\n )\n).filter(_.primary_profession.length() > 0).order_by(\"nconst\") # <2>\n```\n\n::: {.cell-output .cell-output-display execution_count=37}\n```{=html}\n
┏━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┓\n┃ nconst ┃ primary_name ┃ primary_profession ┃ known_for_titles ┃\n┡━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┩\n│ string │ string │ array<string> │ array<string> │\n├───────────┼─────────────────┼────────────────────────────────────┼────────────────────────────────────┤\n│ nm0000001 │ Fred Astaire │ ['soundtrack', 'miscellaneous'] │ ['tt0031983', 'tt0072308', ... +2] │\n│ nm0000002 │ Lauren Bacall │ ['soundtrack'] │ ['tt0075213', 'tt0038355', ... +2] │\n│ nm0000003 │ Brigitte Bardot │ ['soundtrack', 'music_department'] │ ['tt0049189', 'tt0054452', ... +2] │\n│ nm0000004 │ John Belushi │ ['soundtrack', 'writer'] │ ['tt0072562', 'tt0077975', ... +2] │\n│ nm0000005 │ Ingmar Bergman │ ['writer', 'director'] │ ['tt0050986', 'tt0050976', ... +2] │\n│ nm0000006 │ Ingrid Bergman │ ['soundtrack', 'producer'] │ ['tt0038109', 'tt0034583', ... +2] │\n│ nm0000007 │ Humphrey Bogart │ ['soundtrack', 'producer'] │ ['tt0037382', 'tt0034583', ... +2] │\n│ nm0000008 │ Marlon Brando │ ['soundtrack', 'director'] │ ['tt0070849', 'tt0047296', ... +2] │\n│ nm0000009 │ Richard Burton │ ['soundtrack', 'producer'] │ ['tt0087803', 'tt0059749', ... +2] │\n│ nm0000010 │ James Cagney │ ['soundtrack', 'director'] │ ['tt0035575', 'tt0042041', ... +2] │\n│ … │ … │ … │ … │\n└───────────┴─────────────────┴────────────────────────────────────┴────────────────────────────────────┘\n\n```\n:::\n:::\n\n\n1. This `filter` call is applied to each array element\n2. This `filter` call is applied to the table\n\n:::\n\n### Applying a function to array elements\n\nYou can apply a function to run an ibis expression on each element of an array\nusing the\n[`map`](../../reference/expression-collections.qmd#ibis.expr.types.arrays.ArrayValue.map)\nmethod.\n\nLet's normalize the case of primary_profession to upper case:\n\n::: {.panel-tabset}\n\n## DuckDB\n\n::: {#bd7ce7ac .cell execution_count=38}\n``` {.python .cell-code}\nddb_ents.mutate(\n primary_profession=_.primary_profession.map(lambda pp: pp.upper())\n).filter(_.primary_profession.length() > 0).order_by(\"nconst\")\n```\n\n::: {.cell-output .cell-output-display execution_count=38}\n```{=html}\n
┏━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┓\n┃ nconst ┃ primary_name ┃ primary_profession ┃ known_for_titles ┃\n┡━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┩\n│ string │ string │ array<string> │ array<string> │\n├───────────┼─────────────────┼───────────────────────────────────┼────────────────────────────────────┤\n│ nm0000001 │ Fred Astaire │ ['SOUNDTRACK', 'ACTOR', ... +1] │ ['tt0053137', 'tt0072308', ... +2] │\n│ nm0000002 │ Lauren Bacall │ ['ACTRESS', 'SOUNDTRACK'] │ ['tt0037382', 'tt0117057', ... +2] │\n│ nm0000003 │ Brigitte Bardot │ ['ACTRESS', 'SOUNDTRACK', ... +1] │ ['tt0057345', 'tt0054452', ... +2] │\n│ nm0000004 │ John Belushi │ ['ACTOR', 'SOUNDTRACK', ... +1] │ ['tt0072562', 'tt0078723', ... +2] │\n│ nm0000005 │ Ingmar Bergman │ ['WRITER', 'DIRECTOR', ... +1] │ ['tt0083922', 'tt0069467', ... +2] │\n│ nm0000006 │ Ingrid Bergman │ ['ACTRESS', 'SOUNDTRACK', ... +1] │ ['tt0038109', 'tt0036855', ... +2] │\n│ nm0000007 │ Humphrey Bogart │ ['ACTOR', 'SOUNDTRACK', ... +1] │ ['tt0037382', 'tt0034583', ... +2] │\n│ nm0000008 │ Marlon Brando │ ['ACTOR', 'SOUNDTRACK', ... +1] │ ['tt0068646', 'tt0070849', ... +2] │\n│ nm0000009 │ Richard Burton │ ['ACTOR', 'SOUNDTRACK', ... +1] │ ['tt0057877', 'tt0059749', ... +2] │\n│ nm0000010 │ James Cagney │ ['ACTOR', 'SOUNDTRACK', ... +1] │ ['tt0042041', 'tt0035575', ... +2] │\n│ … │ … │ … │ … │\n└───────────┴─────────────────┴───────────────────────────────────┴────────────────────────────────────┘\n\n```\n:::\n:::\n\n\n## BigQuery\n\n::: {#17587fba .cell execution_count=39}\n``` {.python .cell-code}\nbq_ents.mutate(\n primary_profession=_.primary_profession.map(lambda pp: pp.upper())\n).filter(_.primary_profession.length() > 0).order_by(\"nconst\")\n```\n\n::: {.cell-output .cell-output-display execution_count=39}\n```{=html}\n
┏━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┓\n┃ nconst ┃ primary_name ┃ primary_profession ┃ known_for_titles ┃\n┡━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┩\n│ string │ string │ array<string> │ array<string> │\n├───────────┼─────────────────┼───────────────────────────────────┼────────────────────────────────────┤\n│ nm0000001 │ Fred Astaire │ ['SOUNDTRACK', 'ACTOR', ... +1] │ ['tt0031983', 'tt0072308', ... +2] │\n│ nm0000002 │ Lauren Bacall │ ['ACTRESS', 'SOUNDTRACK'] │ ['tt0075213', 'tt0038355', ... +2] │\n│ nm0000003 │ Brigitte Bardot │ ['ACTRESS', 'SOUNDTRACK', ... +1] │ ['tt0049189', 'tt0054452', ... +2] │\n│ nm0000004 │ John Belushi │ ['ACTOR', 'SOUNDTRACK', ... +1] │ ['tt0072562', 'tt0077975', ... +2] │\n│ nm0000005 │ Ingmar Bergman │ ['WRITER', 'DIRECTOR', ... +1] │ ['tt0050986', 'tt0050976', ... +2] │\n│ nm0000006 │ Ingrid Bergman │ ['ACTRESS', 'SOUNDTRACK', ... +1] │ ['tt0038109', 'tt0034583', ... +2] │\n│ nm0000007 │ Humphrey Bogart │ ['ACTOR', 'SOUNDTRACK', ... +1] │ ['tt0037382', 'tt0034583', ... +2] │\n│ nm0000008 │ Marlon Brando │ ['ACTOR', 'SOUNDTRACK', ... +1] │ ['tt0070849', 'tt0047296', ... +2] │\n│ nm0000009 │ Richard Burton │ ['ACTOR', 'SOUNDTRACK', ... +1] │ ['tt0087803', 'tt0059749', ... +2] │\n│ nm0000010 │ James Cagney │ ['ACTOR', 'SOUNDTRACK', ... +1] │ ['tt0035575', 'tt0042041', ... +2] │\n│ … │ … │ … │ … │\n└───────────┴─────────────────┴───────────────────────────────────┴────────────────────────────────────┘\n\n```\n:::\n:::\n\n\n:::\n\n## Conclusion\n\nIbis has a sizable collection of array APIs that work with many different\nbackends and as of version 7.0.0, Ibis supports a much larger set of those APIs\nfor BigQuery!\n\nCheck out [the API\ndocumentation](../../reference/expression-collections.qmd#ibis.expr.types.arrays.ArrayValue)\nfor the full set of available methods.\n\nTry it out, and let us know what you think.\n\n", + "markdown": "---\ntitle: Backend agnostic arrays\nauthor: \"Phillip Cloud\"\ndate: 2024-01-19\ncategories:\n - arrays\n - bigquery\n - blog\n - cloud\n - duckdb\n - portability\n---\n\n## Introduction\n\nThis is a redux of a [previous post](../bigquery-arrays/index.qmd) showing\nIbis's portability in action.\n\nIbis is portable across complex operations and backends of very different\nscales and deployment models!\n\n::: {.callout-note}\n\n## Results differ slightly between BigQuery and DuckDB\n\nThe datasets used in each backend are slightly different.\n\nI opted to avoid ETL for the BigQuery backend by reusing the Google-provided\nIMDB dataset.\n\nThe tradeoff is the slight discrepancy in results.\n:::\n\n## Basics\n\nWe'll start with `from ibis.interactive import *` for maximum convenience.\n\n::: {#0b066bd8 .cell execution_count=1}\n``` {.python .cell-code}\nfrom ibis.interactive import * # <1>\n```\n:::\n\n\n1. `from ibis.interactive import *` imports Ibis APIs into the global namespace\n and enables [interactive mode](../../how-to/configure/basics.qmd#interactive-mode).\n\n### Connect to your database\n\n::: {.panel-tabset}\n\n## DuckDB\n\n::: {#7caf0298 .cell execution_count=2}\n``` {.python .cell-code}\nddb = ibis.connect(\"duckdb://\")\nddb.create_table( # <1>\n \"name_basics\", ex.imdb_name_basics.fetch(backend=ddb).rename(\"snake_case\")\n) # <1>\nddb.create_table( # <2>\n \"title_basics\", ex.imdb_title_basics.fetch(backend=ddb).rename(\"snake_case\")\n) # <2>\n```\n\n::: {.cell-output .cell-output-display execution_count=2}\n```{=html}\n
┏━━━━━━━━━━━┳━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━┳━━━━━━━━━━━━┳━━━━━━━━━━┳━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━━━┓\n┃ tconst ┃ title_type ┃ primary_title ┃ original_title ┃ is_adult ┃ start_year ┃ end_year ┃ runtime_minutes ┃ genres ┃\n┡━━━━━━━━━━━╇━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━╇━━━━━━━━━━━━╇━━━━━━━━━━╇━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━━━┩\n│ string │ string │ string │ string │ int64 │ int64 │ string │ int64 │ string │\n├───────────┼────────────┼─────────────────────────────────────────────┼─────────────────────────────────────────────┼──────────┼────────────┼──────────┼─────────────────┼──────────────────────────┤\n│ tt0000001 │ short │ Carmencita │ Carmencita │ 0 │ 1894 │ NULL │ 1 │ Documentary,Short │\n│ tt0000002 │ short │ Le clown et ses chiens │ Le clown et ses chiens │ 0 │ 1892 │ NULL │ 5 │ Animation,Short │\n│ tt0000003 │ short │ Pauvre Pierrot │ Pauvre Pierrot │ 0 │ 1892 │ NULL │ 4 │ Animation,Comedy,Romance │\n│ tt0000004 │ short │ Un bon bock │ Un bon bock │ 0 │ 1892 │ NULL │ 12 │ Animation,Short │\n│ tt0000005 │ short │ Blacksmith Scene │ Blacksmith Scene │ 0 │ 1893 │ NULL │ 1 │ Comedy,Short │\n│ tt0000006 │ short │ Chinese Opium Den │ Chinese Opium Den │ 0 │ 1894 │ NULL │ 1 │ Short │\n│ tt0000007 │ short │ Corbett and Courtney Before the Kinetograph │ Corbett and Courtney Before the Kinetograph │ 0 │ 1894 │ NULL │ 1 │ Short,Sport │\n│ tt0000008 │ short │ Edison Kinetoscopic Record of a Sneeze │ Edison Kinetoscopic Record of a Sneeze │ 0 │ 1894 │ NULL │ 1 │ Documentary,Short │\n│ tt0000009 │ movie │ Miss Jerry │ Miss Jerry │ 0 │ 1894 │ NULL │ 45 │ Romance │\n│ tt0000010 │ short │ Leaving the Factory │ La sortie de l'usine Lumière à Lyon │ 0 │ 1895 │ NULL │ 1 │ Documentary,Short │\n│ … │ … │ … │ … │ … │ … │ … │ … │ … │\n└───────────┴────────────┴─────────────────────────────────────────────┴─────────────────────────────────────────────┴──────────┴────────────┴──────────┴─────────────────┴──────────────────────────┘\n\n```\n:::\n:::\n\n\n1. Create a table called `name_basics` in our DuckDB database using `ibis.examples` data\n2. Create a table called `title_basics` in our DuckDB database using `ibis.examples` data\n\n## BigQuery\n\n::: {#399dd705 .cell execution_count=3}\n``` {.python .cell-code}\nbq = ibis.connect(\"bigquery://ibis-gbq\")\nbq.set_database(\"bigquery-public-data.imdb\") # <1>\n```\n:::\n\n\n1. Google provides a public BigQuery dataset for IMDB data.\n\n:::\n\nLet's pull out the `name_basics` table, which contains names and metadata about\npeople listed on IMDB. We'll call this `ents` (short for `entities`), and remove some\ncolumns we won't need:\n\n::: {.panel-tabset}\n\n## DuckDB\n\n::: {#108c077d .cell execution_count=4}\n``` {.python .cell-code}\nddb_ents = ddb.tables.name_basics.drop(\"birth_year\", \"death_year\")\nddb_ents.order_by(\"nconst\")\n```\n\n::: {.cell-output .cell-output-display execution_count=4}\n```{=html}\n
┏━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┓\n┃ nconst ┃ primary_name ┃ primary_profession ┃ known_for_titles ┃\n┡━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┩\n│ string │ string │ string │ string │\n├───────────┼─────────────────┼─────────────────────────────────────┼─────────────────────────────────────────┤\n│ nm0000001 │ Fred Astaire │ soundtrack,actor,miscellaneous │ tt0053137,tt0072308,tt0045537,tt0050419 │\n│ nm0000002 │ Lauren Bacall │ actress,soundtrack │ tt0037382,tt0117057,tt0075213,tt0038355 │\n│ nm0000003 │ Brigitte Bardot │ actress,soundtrack,music_department │ tt0057345,tt0054452,tt0049189,tt0056404 │\n│ nm0000004 │ John Belushi │ actor,soundtrack,writer │ tt0072562,tt0078723,tt0077975,tt0080455 │\n│ nm0000005 │ Ingmar Bergman │ writer,director,actor │ tt0083922,tt0069467,tt0050976,tt0050986 │\n│ nm0000006 │ Ingrid Bergman │ actress,soundtrack,producer │ tt0038109,tt0036855,tt0034583,tt0038787 │\n│ nm0000007 │ Humphrey Bogart │ actor,soundtrack,producer │ tt0037382,tt0034583,tt0042593,tt0043265 │\n│ nm0000008 │ Marlon Brando │ actor,soundtrack,director │ tt0068646,tt0070849,tt0078788,tt0047296 │\n│ nm0000009 │ Richard Burton │ actor,soundtrack,producer │ tt0057877,tt0059749,tt0061184,tt0087803 │\n│ nm0000010 │ James Cagney │ actor,soundtrack,director │ tt0042041,tt0035575,tt0029870,tt0031867 │\n│ … │ … │ … │ … │\n└───────────┴─────────────────┴─────────────────────────────────────┴─────────────────────────────────────────┘\n\n```\n:::\n:::\n\n\n## BigQuery\n\n::: {#9a444397 .cell execution_count=5}\n``` {.python .cell-code}\nbq_ents = bq.tables.name_basics.drop(\"birth_year\", \"death_year\")\nbq_ents.order_by(\"nconst\")\n```\n\n::: {.cell-output .cell-output-display execution_count=5}\n```{=html}\n
┏━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┓\n┃ nconst ┃ primary_name ┃ primary_profession ┃ known_for_titles ┃\n┡━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┩\n│ string │ string │ string │ string │\n├───────────┼─────────────────┼─────────────────────────────────────┼─────────────────────────────────────────┤\n│ nm0000001 │ Fred Astaire │ soundtrack,actor,miscellaneous │ tt0072308,tt0053137,tt0031983,tt0050419 │\n│ nm0000002 │ Lauren Bacall │ actress,soundtrack │ tt0038355,tt0075213,tt0117057,tt0037382 │\n│ nm0000003 │ Brigitte Bardot │ actress,soundtrack,music_department │ tt0049189,tt0054452,tt0056404,tt0057345 │\n│ nm0000004 │ John Belushi │ actor,soundtrack,writer │ tt0072562,tt0078723,tt0080455,tt0077975 │\n│ nm0000005 │ Ingmar Bergman │ writer,director,actor │ tt0050976,tt0083922,tt0069467,tt0050986 │\n│ nm0000006 │ Ingrid Bergman │ actress,soundtrack,producer │ tt0034583,tt0038787,tt0038109,tt0036855 │\n│ nm0000007 │ Humphrey Bogart │ actor,soundtrack,producer │ tt0037382,tt0043265,tt0034583,tt0042593 │\n│ nm0000008 │ Marlon Brando │ actor,soundtrack,director │ tt0068646,tt0070849,tt0047296,tt0078788 │\n│ nm0000009 │ Richard Burton │ actor,soundtrack,producer │ tt0061184,tt0087803,tt0057877,tt0059749 │\n│ nm0000010 │ James Cagney │ actor,soundtrack,director │ tt0042041,tt0029870,tt0031867,tt0035575 │\n│ … │ … │ … │ … │\n└───────────┴─────────────────┴─────────────────────────────────────┴─────────────────────────────────────────┘\n\n```\n:::\n:::\n\n\n:::\n\n### Splitting strings into arrays\n\nWe can see that `known_for_titles` looks sort of like an array, so let's call\nthe\n[`split`](../../reference/expression-strings.qmd#ibis.expr.types.strings.StringValue.split)\nmethod on that column and replace the existing column:\n\n::: {.panel-tabset}\n\n## DuckDB\n\n::: {#337dbe6a .cell execution_count=6}\n``` {.python .cell-code}\nddb_ents = ddb_ents.mutate(known_for_titles=_.known_for_titles.split(\",\"))\nddb_ents.order_by(\"nconst\")\n```\n\n::: {.cell-output .cell-output-display execution_count=6}\n```{=html}\n
┏━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┓\n┃ nconst ┃ primary_name ┃ primary_profession ┃ known_for_titles ┃\n┡━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┩\n│ string │ string │ string │ array<string> │\n├───────────┼─────────────────┼─────────────────────────────────────┼────────────────────────────────────┤\n│ nm0000001 │ Fred Astaire │ soundtrack,actor,miscellaneous │ ['tt0053137', 'tt0072308', ... +2] │\n│ nm0000002 │ Lauren Bacall │ actress,soundtrack │ ['tt0037382', 'tt0117057', ... +2] │\n│ nm0000003 │ Brigitte Bardot │ actress,soundtrack,music_department │ ['tt0057345', 'tt0054452', ... +2] │\n│ nm0000004 │ John Belushi │ actor,soundtrack,writer │ ['tt0072562', 'tt0078723', ... +2] │\n│ nm0000005 │ Ingmar Bergman │ writer,director,actor │ ['tt0083922', 'tt0069467', ... +2] │\n│ nm0000006 │ Ingrid Bergman │ actress,soundtrack,producer │ ['tt0038109', 'tt0036855', ... +2] │\n│ nm0000007 │ Humphrey Bogart │ actor,soundtrack,producer │ ['tt0037382', 'tt0034583', ... +2] │\n│ nm0000008 │ Marlon Brando │ actor,soundtrack,director │ ['tt0068646', 'tt0070849', ... +2] │\n│ nm0000009 │ Richard Burton │ actor,soundtrack,producer │ ['tt0057877', 'tt0059749', ... +2] │\n│ nm0000010 │ James Cagney │ actor,soundtrack,director │ ['tt0042041', 'tt0035575', ... +2] │\n│ … │ … │ … │ … │\n└───────────┴─────────────────┴─────────────────────────────────────┴────────────────────────────────────┘\n\n```\n:::\n:::\n\n\n## BigQuery\n\n::: {#bb1ea1fe .cell execution_count=7}\n``` {.python .cell-code}\nbq_ents = bq_ents.mutate(known_for_titles=_.known_for_titles.split(\",\"))\nbq_ents.order_by(\"nconst\")\n```\n\n::: {.cell-output .cell-output-display execution_count=7}\n```{=html}\n
┏━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┓\n┃ nconst ┃ primary_name ┃ primary_profession ┃ known_for_titles ┃\n┡━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┩\n│ string │ string │ string │ array<string> │\n├───────────┼─────────────────┼─────────────────────────────────────┼────────────────────────────────────┤\n│ nm0000001 │ Fred Astaire │ soundtrack,actor,miscellaneous │ ['tt0072308', 'tt0053137', ... +2] │\n│ nm0000002 │ Lauren Bacall │ actress,soundtrack │ ['tt0038355', 'tt0075213', ... +2] │\n│ nm0000003 │ Brigitte Bardot │ actress,soundtrack,music_department │ ['tt0049189', 'tt0054452', ... +2] │\n│ nm0000004 │ John Belushi │ actor,soundtrack,writer │ ['tt0072562', 'tt0078723', ... +2] │\n│ nm0000005 │ Ingmar Bergman │ writer,director,actor │ ['tt0050976', 'tt0083922', ... +2] │\n│ nm0000006 │ Ingrid Bergman │ actress,soundtrack,producer │ ['tt0034583', 'tt0038787', ... +2] │\n│ nm0000007 │ Humphrey Bogart │ actor,soundtrack,producer │ ['tt0037382', 'tt0043265', ... +2] │\n│ nm0000008 │ Marlon Brando │ actor,soundtrack,director │ ['tt0068646', 'tt0070849', ... +2] │\n│ nm0000009 │ Richard Burton │ actor,soundtrack,producer │ ['tt0061184', 'tt0087803', ... +2] │\n│ nm0000010 │ James Cagney │ actor,soundtrack,director │ ['tt0042041', 'tt0029870', ... +2] │\n│ … │ … │ … │ … │\n└───────────┴─────────────────┴─────────────────────────────────────┴────────────────────────────────────┘\n\n```\n:::\n:::\n\n\n:::\n\nSimilarly for `primary_profession`, since people involved in show business often\nhave more than one responsibility on a project:\n\n::: {.panel-tabset}\n\n## DuckDB\n\n::: {#bf33447a .cell execution_count=8}\n``` {.python .cell-code}\nddb_ents = ddb_ents.mutate(primary_profession=_.primary_profession.split(\",\"))\n```\n:::\n\n\n## BigQuery\n\n::: {#eea20c44 .cell execution_count=9}\n``` {.python .cell-code}\nbq_ents = bq_ents.mutate(primary_profession=_.primary_profession.split(\",\"))\n```\n:::\n\n\n:::\n\n### Array length\n\nLet's see how many titles each entity is known for, and then show the five\npeople with the largest number of titles they're known for.\n\nThis is computed using the\n[`length`](../../reference/expression-collections.qmd#ibis.expr.types.arrays.ArrayValue.length)\nAPI on array expressions:\n\n::: {.panel-tabset}\n\n## DuckDB\n\n::: {#5f0b413d .cell execution_count=10}\n``` {.python .cell-code}\n(\n ddb_ents.select(\"primary_name\", num_titles=_.known_for_titles.length())\n .order_by(_.num_titles.desc())\n .limit(5)\n)\n```\n\n::: {.cell-output .cell-output-display execution_count=10}\n```{=html}\n
┏━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━┓\n┃ primary_name ┃ num_titles ┃\n┡━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━┩\n│ string │ int64 │\n├──────────────────┼────────────┤\n│ Alex Koenigsmark │ 5 │\n│ Carrie Schnelker │ 5 │\n│ Henry Townsend │ 5 │\n│ Sally Sun │ 5 │\n│ Matthew Kavuma │ 5 │\n└──────────────────┴────────────┘\n\n```\n:::\n:::\n\n\n## BigQuery\n\n::: {#056cde90 .cell execution_count=11}\n``` {.python .cell-code}\n(\n bq_ents.select(\"primary_name\", num_titles=_.known_for_titles.length())\n .order_by(_.num_titles.desc())\n .limit(5)\n)\n```\n\n::: {.cell-output .cell-output-display execution_count=11}\n```{=html}\n
┏━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━┓\n┃ primary_name ┃ num_titles ┃\n┡━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━┩\n│ string │ int64 │\n├─────────────────────┼────────────┤\n│ José Jaime Espinosa │ 6 │\n│ Paul Winter │ 6 │\n│ Nicolas Bernier │ 6 │\n│ Chris Estrada │ 6 │\n│ Tsuyotake Matsuda │ 5 │\n└─────────────────────┴────────────┘\n\n```\n:::\n:::\n\n\n:::\n\nIt seems like the length of the `known_for_titles` might be capped at some small number!\n\n### Index\n\nWe can see the position of `\"actor\"` or `\"actress\"` in `primary_profession`s:\n\n::: {.panel-tabset}\n\n## DuckDB\n\n::: {#cbc557e3 .cell execution_count=12}\n``` {.python .cell-code}\nddb_ents.primary_profession.index(\"actor\")\n```\n\n::: {.cell-output .cell-output-display execution_count=12}\n```{=html}\n
┏━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┓\n┃ ArrayPosition(primary_profession, 'actor') ┃\n┡━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┩\n│ int64 │\n├────────────────────────────────────────────┤\n│ 1 │\n│ -1 │\n│ -1 │\n│ 0 │\n│ 2 │\n│ -1 │\n│ 0 │\n│ 0 │\n│ 0 │\n│ 0 │\n│ … │\n└────────────────────────────────────────────┘\n\n```\n:::\n:::\n\n\n::: {#3eb36d0b .cell execution_count=13}\n``` {.python .cell-code}\nddb_ents.primary_profession.index(\"actress\")\n```\n\n::: {.cell-output .cell-output-display execution_count=13}\n```{=html}\n
┏━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┓\n┃ ArrayPosition(primary_profession, 'actress') ┃\n┡━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┩\n│ int64 │\n├──────────────────────────────────────────────┤\n│ -1 │\n│ 0 │\n│ 0 │\n│ -1 │\n│ -1 │\n│ 0 │\n│ -1 │\n│ -1 │\n│ -1 │\n│ -1 │\n│ … │\n└──────────────────────────────────────────────┘\n\n```\n:::\n:::\n\n\n## BigQuery\n\n::: {#7efe2a42 .cell execution_count=14}\n``` {.python .cell-code}\nbq_ents.primary_profession.index(\"actor\")\n```\n\n::: {.cell-output .cell-output-display execution_count=14}\n```{=html}\n
┏━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┓\n┃ ArrayPosition(primary_profession, 'actor') ┃\n┡━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┩\n│ int64 │\n├────────────────────────────────────────────┤\n│ -1 │\n│ -1 │\n│ -1 │\n│ -1 │\n│ -1 │\n│ -1 │\n│ -1 │\n│ -1 │\n│ -1 │\n│ -1 │\n│ … │\n└────────────────────────────────────────────┘\n\n```\n:::\n:::\n\n\n::: {#909e212d .cell execution_count=15}\n``` {.python .cell-code}\nbq_ents.primary_profession.index(\"actress\")\n```\n\n::: {.cell-output .cell-output-display execution_count=15}\n```{=html}\n
┏━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┓\n┃ ArrayPosition(primary_profession, 'actress') ┃\n┡━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┩\n│ int64 │\n├──────────────────────────────────────────────┤\n│ -1 │\n│ -1 │\n│ -1 │\n│ -1 │\n│ -1 │\n│ -1 │\n│ -1 │\n│ -1 │\n│ -1 │\n│ -1 │\n│ … │\n└──────────────────────────────────────────────┘\n\n```\n:::\n:::\n\n\n:::\n\nA return value of `-1` indicates that `\"actor\"` is not present in the value.\n\nLet's look for entities that are not primarily actors.\n\nWe can do this using the\n[`index`](../../reference/expression-collections.qmd#ibis.expr.types.arrays.ArrayValue.index)\nmethod by checking whether the positions of the strings `\"actor\"` or\n`\"actress\"` are both greater than 0:\n\n::: {.panel-tabset}\n\n## DuckDB\n\n::: {#96ef0c2b .cell execution_count=16}\n``` {.python .cell-code}\nactor_index = ddb_ents.primary_profession.index(\"actor\")\nactress_index = ddb_ents.primary_profession.index(\"actress\")\n\nddb_not_primarily_acting = (actor_index > 0) & (actress_index > 0)\nddb_not_primarily_acting.mean()\n```\n\n::: {.cell-output .cell-output-display}\n```{=html}\n\n```\n:::\n\n::: {.cell-output .cell-output-display execution_count=16}\n\n::: {.ansi-escaped-output}\n```{=html}\n
0.0
\n```\n:::\n\n:::\n:::\n\n\n## BigQuery\n\n::: {#e10553cc .cell execution_count=17}\n``` {.python .cell-code}\nactor_index = bq_ents.primary_profession.index(\"actor\")\nactress_index = bq_ents.primary_profession.index(\"actress\")\n\nbq_not_primarily_acting = (actor_index > 0) & (actress_index > 0)\nbq_not_primarily_acting.mean()\n```\n\n::: {.cell-output .cell-output-display}\n```{=html}\n\n```\n:::\n\n::: {.cell-output .cell-output-display execution_count=17}\n\n::: {.ansi-escaped-output}\n```{=html}\n0.0
\n```\n:::\n\n:::\n:::\n\n\n:::\n\nWho are they?\n\n::: {.panel-tabset}\n\n## DuckDB\n\n::: {#a4243c7c .cell execution_count=18}\n``` {.python .cell-code}\nddb_ents[ddb_not_primarily_acting].order_by(\"nconst\")\n```\n\n::: {.cell-output .cell-output-display execution_count=18}\n```{=html}\n┏━━━━━━━━┳━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━┓\n┃ nconst ┃ primary_name ┃ primary_profession ┃ known_for_titles ┃\n┡━━━━━━━━╇━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━┩\n│ string │ string │ array<string> │ array<string> │\n└────────┴──────────────┴────────────────────┴──────────────────┘\n\n```\n:::\n:::\n\n\n## BigQuery\n\n::: {#40b5ad2f .cell execution_count=19}\n``` {.python .cell-code}\nbq_ents[bq_not_primarily_acting].order_by(\"nconst\")\n```\n\n::: {.cell-output .cell-output-display execution_count=19}\n```{=html}\n
┏━━━━━━━━┳━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━┓\n┃ nconst ┃ primary_name ┃ primary_profession ┃ known_for_titles ┃\n┡━━━━━━━━╇━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━┩\n│ string │ string │ array<string> │ array<string> │\n└────────┴──────────────┴────────────────────┴──────────────────┘\n\n```\n:::\n:::\n\n\n:::\n\nIt's not 100% clear whether the order of elements in `primary_profession` matters here.\n\n### Containment\n\nWe can get people who are listed as actors or actresses using `contains`:\n\n::: {.panel-tabset}\n\n## DuckDB\n\n::: {#db2f9479 .cell execution_count=20}\n``` {.python .cell-code}\nddb_non_actors = bq_ents[\n ~_.primary_profession.contains(\"actor\") & ~_.primary_profession.contains(\"actress\")\n]\nddb_non_actors.order_by(\"nconst\")\n```\n\n::: {.cell-output .cell-output-display execution_count=20}\n```{=html}\n
┏━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┓\n┃ nconst ┃ primary_name ┃ primary_profession ┃ known_for_titles ┃\n┡━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┩\n│ string │ string │ array<string> │ array<string> │\n├───────────┼──────────────────┼────────────────────────────────────────────┼────────────────────────────────────┤\n│ nm0000016 │ Georges Delerue │ ['composer', 'soundtrack', ... +1] │ ['tt8847712', 'tt0091763', ... +2] │\n│ nm0000025 │ Jerry Goldsmith │ ['music_department', 'soundtrack', ... +1] │ ['tt0077269', 'tt0117731', ... +2] │\n│ nm0000033 │ Alfred Hitchcock │ ['director', 'producer', ... +1] │ ['tt0054215', 'tt0052357', ... +2] │\n│ nm0000035 │ James Horner │ ['music_department', 'soundtrack', ... +1] │ ['tt0177971', 'tt0120338', ... +2] │\n│ nm0000040 │ Stanley Kubrick │ ['director', 'writer', ... +1] │ ['tt0120663', 'tt0066921', ... +2] │\n│ nm0000041 │ Akira Kurosawa │ ['writer', 'director', ... +1] │ ['tt0080979', 'tt0089881', ... +2] │\n│ nm0000049 │ Henry Mancini │ ['music_department', 'soundtrack', ... +1] │ ['tt0383216', 'tt0054698', ... +2] │\n│ nm0000055 │ Alfred Newman │ ['music_department', 'composer', ... +1] │ ['tt0049408', 'tt0434409', ... +2] │\n│ nm0000065 │ Nino Rota │ ['composer', 'soundtrack', ... +1] │ ['tt0071562', 'tt0056801', ... +2] │\n│ nm0000067 │ Miklós Rózsa │ ['music_department', 'composer', ... +1] │ ['tt0052618', 'tt0038109', ... +2] │\n│ … │ … │ … │ … │\n└───────────┴──────────────────┴────────────────────────────────────────────┴────────────────────────────────────┘\n\n```\n:::\n:::\n\n\n## BigQuery\n\n::: {#9fb71b70 .cell execution_count=21}\n``` {.python .cell-code}\nbq_non_actors = bq_ents[\n ~_.primary_profession.contains(\"actor\") & ~_.primary_profession.contains(\"actress\")\n]\nbq_non_actors.order_by(\"nconst\")\n```\n\n::: {.cell-output .cell-output-display execution_count=21}\n```{=html}\n
┏━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┓\n┃ nconst ┃ primary_name ┃ primary_profession ┃ known_for_titles ┃\n┡━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┩\n│ string │ string │ array<string> │ array<string> │\n├───────────┼──────────────────┼────────────────────────────────────────────┼────────────────────────────────────┤\n│ nm0000016 │ Georges Delerue │ ['composer', 'soundtrack', ... +1] │ ['tt8847712', 'tt0091763', ... +2] │\n│ nm0000025 │ Jerry Goldsmith │ ['music_department', 'soundtrack', ... +1] │ ['tt0077269', 'tt0117731', ... +2] │\n│ nm0000033 │ Alfred Hitchcock │ ['director', 'producer', ... +1] │ ['tt0054215', 'tt0052357', ... +2] │\n│ nm0000035 │ James Horner │ ['music_department', 'soundtrack', ... +1] │ ['tt0177971', 'tt0120338', ... +2] │\n│ nm0000040 │ Stanley Kubrick │ ['director', 'writer', ... +1] │ ['tt0120663', 'tt0066921', ... +2] │\n│ nm0000041 │ Akira Kurosawa │ ['writer', 'director', ... +1] │ ['tt0080979', 'tt0089881', ... +2] │\n│ nm0000049 │ Henry Mancini │ ['music_department', 'soundtrack', ... +1] │ ['tt0383216', 'tt0054698', ... +2] │\n│ nm0000055 │ Alfred Newman │ ['music_department', 'composer', ... +1] │ ['tt0049408', 'tt0434409', ... +2] │\n│ nm0000065 │ Nino Rota │ ['composer', 'soundtrack', ... +1] │ ['tt0071562', 'tt0056801', ... +2] │\n│ nm0000067 │ Miklós Rózsa │ ['music_department', 'composer', ... +1] │ ['tt0052618', 'tt0038109', ... +2] │\n│ … │ … │ … │ … │\n└───────────┴──────────────────┴────────────────────────────────────────────┴────────────────────────────────────┘\n\n```\n:::\n:::\n\n\n:::\n\n### Element removal\n\nWe can remove elements from arrays too.\n\n::: {.callout-note}\n## [`remove()`](../../reference/expression-collections.qmd#ibis.expr.types.arrays.ArrayValue.remove) does not mutate the underlying data\n:::\n\nLet's see who only has \"actor\" in the list of their primary professions:\n\n::: {.panel-tabset}\n\n## DuckDB\n\n::: {#551005a1 .cell execution_count=22}\n``` {.python .cell-code}\nddb_ents.filter(\n [\n _.primary_profession.length() > 0,\n _.primary_profession.remove(\"actor\").length() == 0,\n _.primary_profession.remove(\"actress\").length() == 0,\n ]\n).order_by(\"nconst\")\n```\n\n::: {.cell-output .cell-output-display execution_count=22}\n```{=html}\n
┏━━━━━━━━┳━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━┓\n┃ nconst ┃ primary_name ┃ primary_profession ┃ known_for_titles ┃\n┡━━━━━━━━╇━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━┩\n│ string │ string │ array<string> │ array<string> │\n└────────┴──────────────┴────────────────────┴──────────────────┘\n\n```\n:::\n:::\n\n\n## BigQuery\n\n::: {#6b4f941c .cell execution_count=23}\n``` {.python .cell-code}\nbq_ents.filter(\n [\n _.primary_profession.length() > 0,\n _.primary_profession.remove(\"actor\").length() == 0,\n _.primary_profession.remove(\"actress\").length() == 0,\n ]\n).order_by(\"nconst\")\n```\n\n::: {.cell-output .cell-output-display execution_count=23}\n```{=html}\n
┏━━━━━━━━┳━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━┓\n┃ nconst ┃ primary_name ┃ primary_profession ┃ known_for_titles ┃\n┡━━━━━━━━╇━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━┩\n│ string │ string │ array<string> │ array<string> │\n└────────┴──────────────┴────────────────────┴──────────────────┘\n\n```\n:::\n:::\n\n\n:::\n\n### Slicing with square-bracket syntax\n\nLet's remove everyone's first profession from the list, but only if they have\nmore than one profession listed:\n\n::: {.panel-tabset}\n\n## DuckDB\n\n::: {#7751efa4 .cell execution_count=24}\n``` {.python .cell-code}\nddb_ents[_.primary_profession.length() > 1].mutate(\n primary_profession=_.primary_profession[1:],\n).order_by(\"nconst\")\n```\n\n::: {.cell-output .cell-output-display execution_count=24}\n```{=html}\n
┏━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┓\n┃ nconst ┃ primary_name ┃ primary_profession ┃ known_for_titles ┃\n┡━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┩\n│ string │ string │ array<string> │ array<string> │\n├───────────┼─────────────────┼────────────────────────────────────┼────────────────────────────────────┤\n│ nm0000001 │ Fred Astaire │ ['actor', 'miscellaneous'] │ ['tt0053137', 'tt0072308', ... +2] │\n│ nm0000002 │ Lauren Bacall │ ['soundtrack'] │ ['tt0037382', 'tt0117057', ... +2] │\n│ nm0000003 │ Brigitte Bardot │ ['soundtrack', 'music_department'] │ ['tt0057345', 'tt0054452', ... +2] │\n│ nm0000004 │ John Belushi │ ['soundtrack', 'writer'] │ ['tt0072562', 'tt0078723', ... +2] │\n│ nm0000005 │ Ingmar Bergman │ ['director', 'actor'] │ ['tt0083922', 'tt0069467', ... +2] │\n│ nm0000006 │ Ingrid Bergman │ ['soundtrack', 'producer'] │ ['tt0038109', 'tt0036855', ... +2] │\n│ nm0000007 │ Humphrey Bogart │ ['soundtrack', 'producer'] │ ['tt0037382', 'tt0034583', ... +2] │\n│ nm0000008 │ Marlon Brando │ ['soundtrack', 'director'] │ ['tt0068646', 'tt0070849', ... +2] │\n│ nm0000009 │ Richard Burton │ ['soundtrack', 'producer'] │ ['tt0057877', 'tt0059749', ... +2] │\n│ nm0000010 │ James Cagney │ ['soundtrack', 'director'] │ ['tt0042041', 'tt0035575', ... +2] │\n│ … │ … │ … │ … │\n└───────────┴─────────────────┴────────────────────────────────────┴────────────────────────────────────┘\n\n```\n:::\n:::\n\n\n## BigQuery\n\n::: {#295ffc5a .cell execution_count=25}\n``` {.python .cell-code}\nbq_ents[_.primary_profession.length() > 1].mutate(\n primary_profession=_.primary_profession[1:],\n).order_by(\"nconst\")\n```\n\n::: {.cell-output .cell-output-display execution_count=25}\n```{=html}\n
┏━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┓\n┃ nconst ┃ primary_name ┃ primary_profession ┃ known_for_titles ┃\n┡━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┩\n│ string │ string │ array<string> │ array<string> │\n├───────────┼─────────────────┼────────────────────────────────────┼────────────────────────────────────┤\n│ nm0000001 │ Fred Astaire │ ['actor', 'miscellaneous'] │ ['tt0072308', 'tt0053137', ... +2] │\n│ nm0000002 │ Lauren Bacall │ ['soundtrack'] │ ['tt0038355', 'tt0075213', ... +2] │\n│ nm0000003 │ Brigitte Bardot │ ['soundtrack', 'music_department'] │ ['tt0049189', 'tt0054452', ... +2] │\n│ nm0000004 │ John Belushi │ ['soundtrack', 'writer'] │ ['tt0072562', 'tt0078723', ... +2] │\n│ nm0000005 │ Ingmar Bergman │ ['director', 'actor'] │ ['tt0050976', 'tt0083922', ... +2] │\n│ nm0000006 │ Ingrid Bergman │ ['soundtrack', 'producer'] │ ['tt0034583', 'tt0038787', ... +2] │\n│ nm0000007 │ Humphrey Bogart │ ['soundtrack', 'producer'] │ ['tt0037382', 'tt0043265', ... +2] │\n│ nm0000008 │ Marlon Brando │ ['soundtrack', 'director'] │ ['tt0068646', 'tt0070849', ... +2] │\n│ nm0000009 │ Richard Burton │ ['soundtrack', 'producer'] │ ['tt0061184', 'tt0087803', ... +2] │\n│ nm0000010 │ James Cagney │ ['soundtrack', 'director'] │ ['tt0042041', 'tt0029870', ... +2] │\n│ … │ … │ … │ … │\n└───────────┴─────────────────┴────────────────────────────────────┴────────────────────────────────────┘\n\n```\n:::\n:::\n\n\n:::\n\n## Set operations and sorting\n\nTreating arrays as sets is possible with the\n[`union`](../../reference/expression-collections.qmd#ibis.expr.types.arrays.ArrayValue.union)\nand\n[`intersect`](../../reference/expression-collections.qmd#ibis.expr.types.arrays.ArrayValue.intersect)\nAPIs.\n\nLet's take a look at `intersect`.\n\n### Intersection\n\nLet's see if we can use array intersection to figure which actors share\nknown-for titles and sort the result:\n\n::: {.panel-tabset}\n\n## DuckDB\n\n::: {#1d68e19a .cell execution_count=26}\n``` {.python .cell-code}\nleft = ddb_ents.filter(_.known_for_titles.length() > 0).limit(10_000)\nright = left.view()\nshared_titles = (\n left\n .join(right, left.nconst != right.nconst)\n .select(\n s.startswith(\"known_for_titles\"),\n left_name=\"primary_name\",\n right_name=\"primary_name_right\",\n )\n .filter(_.known_for_titles.intersect(_.known_for_titles_right).length() > 0)\n .group_by(name=\"left_name\")\n .agg(together_with=_.right_name.collect())\n .mutate(together_with=_.together_with.unique().sort())\n)\nshared_titles\n```\n\n::: {.cell-output .cell-output-display}\n```{=html}\n\n```\n:::\n\n::: {.cell-output .cell-output-display execution_count=26}\n```{=html}\n
┏━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┓\n┃ name ┃ together_with ┃\n┡━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┩\n│ string │ array<string> │\n├──────────────────────┼───────────────────────────────────────────────┤\n│ Ava Gardner │ ['Ernest Gold', 'Fred Astaire'] │\n│ Cyd Charisse │ ['Fred Astaire'] │\n│ John Landis │ ['Dan Aykroyd', 'Dick Ziker', ... +14] │\n│ Michael Curtiz │ ['Alan Hale', 'Ann Blyth', ... +19] │\n│ Francis Ford Coppola │ ['Abe Vigoda', 'Al Pacino', ... +19] │\n│ Bernardo Bertolucci │ ['Armand Abplanalp', 'James Acheson', ... +3] │\n│ Karl Malden │ ['Abraxas Aaran', 'Alex North', ... +14] │\n│ Richard Conte │ ['Abe Vigoda', 'Al Pacino', ... +9] │\n│ George Orwell │ ['John Hurt', 'Richard Burton'] │\n│ Joseph L. Mankiewicz │ ['Alfred Newman', 'Anne Baxter', ... +13] │\n│ … │ … │\n└──────────────────────┴───────────────────────────────────────────────┘\n\n```\n:::\n:::\n\n\n## BigQuery\n\n::: {#ecaee965 .cell execution_count=27}\n``` {.python .cell-code}\nleft = bq_ents.filter(_.known_for_titles.length() > 0).limit(10_000)\nright = left.view()\nshared_titles = (\n left\n .join(right, left.nconst != right.nconst)\n .select(\n s.startswith(\"known_for_titles\"),\n left_name=\"primary_name\",\n right_name=\"primary_name_right\",\n )\n .filter(_.known_for_titles.intersect(_.known_for_titles_right).length() > 0)\n .group_by(name=\"left_name\")\n .agg(together_with=_.right_name.collect())\n .mutate(together_with=_.together_with.unique().sort())\n)\nshared_titles\n```\n\n::: {.cell-output .cell-output-display execution_count=27}\n```{=html}\n
┏━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┓\n┃ name ┃ together_with ┃\n┡━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┩\n│ string │ array<string> │\n├─────────────────────────┼─────────────────────────────────────────────────────┤\n│ Pavel Vrba │ ['F.C. Lokomotiv Moscow', 'Jeffrey Bruma', ... +4] │\n│ Greg Carrolan │ ['Al Cambronne', 'Alana Tornello', ... +20] │\n│ Aleksander Parzychowski │ ['Adam Korszun', 'Grzegorz Wawrzenczyk', ... +5] │\n│ James Walt │ ['Anton Testino', 'Ben Walanka', ... +10] │\n│ Ellen Dallaglio │ ['Antonia Giovanazzi', 'Fra McCann', ... +10] │\n│ Catarina Martins │ ['Miguel Oliveira', 'Ricardo Gordon', ... +1] │\n│ Stanislav Sesták │ ['Martin Glenn', 'Miso Brecko', ... +6] │\n│ Allison Cabot │ ['Brenda Beard', 'Brian Fenmore', ... +16] │\n│ Vasilis Bouzianas │ ['Aggelos Kasolas', 'Christos Patriarheas', ... +3] │\n│ Marie Muldoon │ ['Alan Oxley', 'Andrew Raeber', ... +39] │\n│ … │ … │\n└─────────────────────────┴─────────────────────────────────────────────────────┘\n\n```\n:::\n:::\n\n\n:::\n\n## Advanced operations\n\n### Flatten arrays into rows\n\nThanks to the [tireless\nefforts](https://github.com/tobymao/sqlglot/commit/06e0869e7aa5714d77e6ec763da38d6a422965fa)\nof the [folks](https://github.com/tobymao/sqlglot/graphs/contributors) working\non [`sqlglot`](https://github.com/tobymao/sqlglot), as of version 7.0.0 Ibis\nsupports\n[`unnest`](../../reference/expression-collections.qmd#ibis.expr.types.arrays.ArrayValue.unnest)\nfor BigQuery!\n\nYou can use it standalone on a column expression:\n\n::: {.panel-tabset}\n\n## DuckDB\n\n::: {#e5f62712 .cell execution_count=28}\n``` {.python .cell-code}\nddb_ents.primary_profession.unnest()\n```\n\n::: {.cell-output .cell-output-display execution_count=28}\n```{=html}\n
┏━━━━━━━━━━━━━━━━━━━━┓\n┃ primary_profession ┃\n┡━━━━━━━━━━━━━━━━━━━━┩\n│ string │\n├────────────────────┤\n│ soundtrack │\n│ actor │\n│ miscellaneous │\n│ actress │\n│ soundtrack │\n│ actress │\n│ soundtrack │\n│ music_department │\n│ actor │\n│ soundtrack │\n│ … │\n└────────────────────┘\n\n```\n:::\n:::\n\n\n## BigQuery\n\n::: {#39a19645 .cell execution_count=29}\n``` {.python .cell-code}\nbq_ents.primary_profession.unnest()\n```\n\n::: {.cell-output .cell-output-display execution_count=29}\n```{=html}\n
┏━━━━━━━━━━━━━━━━━━━━┓\n┃ primary_profession ┃\n┡━━━━━━━━━━━━━━━━━━━━┩\n│ string │\n├────────────────────┤\n│ actor │\n│ actor │\n│ actor │\n│ actor │\n│ actor │\n│ actor │\n│ actor │\n│ actor │\n│ actor │\n│ actor │\n│ … │\n└────────────────────┘\n\n```\n:::\n:::\n\n\n:::\n\nYou can also use it in `select`/`mutate` calls to expand the table accordingly:\n\n::: {.panel-tabset}\n\n## DuckDB\n\n::: {#bd167fb0 .cell execution_count=30}\n``` {.python .cell-code}\nddb_ents.mutate(primary_profession=_.primary_profession.unnest()).order_by(\"nconst\")\n```\n\n::: {.cell-output .cell-output-display execution_count=30}\n```{=html}\n
┏━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┓\n┃ nconst ┃ primary_name ┃ primary_profession ┃ known_for_titles ┃\n┡━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┩\n│ string │ string │ string │ array<string> │\n├───────────┼─────────────────┼────────────────────┼────────────────────────────────────┤\n│ nm0000001 │ Fred Astaire │ soundtrack │ ['tt0053137', 'tt0072308', ... +2] │\n│ nm0000001 │ Fred Astaire │ miscellaneous │ ['tt0053137', 'tt0072308', ... +2] │\n│ nm0000001 │ Fred Astaire │ actor │ ['tt0053137', 'tt0072308', ... +2] │\n│ nm0000002 │ Lauren Bacall │ soundtrack │ ['tt0037382', 'tt0117057', ... +2] │\n│ nm0000002 │ Lauren Bacall │ actress │ ['tt0037382', 'tt0117057', ... +2] │\n│ nm0000003 │ Brigitte Bardot │ music_department │ ['tt0057345', 'tt0054452', ... +2] │\n│ nm0000003 │ Brigitte Bardot │ soundtrack │ ['tt0057345', 'tt0054452', ... +2] │\n│ nm0000003 │ Brigitte Bardot │ actress │ ['tt0057345', 'tt0054452', ... +2] │\n│ nm0000004 │ John Belushi │ soundtrack │ ['tt0072562', 'tt0078723', ... +2] │\n│ nm0000004 │ John Belushi │ writer │ ['tt0072562', 'tt0078723', ... +2] │\n│ … │ … │ … │ … │\n└───────────┴─────────────────┴────────────────────┴────────────────────────────────────┘\n\n```\n:::\n:::\n\n\n## BigQuery\n\n::: {#419d71fd .cell execution_count=31}\n``` {.python .cell-code}\nbq_ents.mutate(primary_profession=_.primary_profession.unnest()).order_by(\"nconst\")\n```\n\n::: {.cell-output .cell-output-display execution_count=31}\n```{=html}\n
┏━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┓\n┃ nconst ┃ primary_name ┃ primary_profession ┃ known_for_titles ┃\n┡━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┩\n│ string │ string │ string │ array<string> │\n├───────────┼─────────────────┼────────────────────┼────────────────────────────────────┤\n│ nm0000001 │ Fred Astaire │ miscellaneous │ ['tt0072308', 'tt0053137', ... +2] │\n│ nm0000001 │ Fred Astaire │ actor │ ['tt0072308', 'tt0053137', ... +2] │\n│ nm0000001 │ Fred Astaire │ soundtrack │ ['tt0072308', 'tt0053137', ... +2] │\n│ nm0000002 │ Lauren Bacall │ actress │ ['tt0038355', 'tt0075213', ... +2] │\n│ nm0000002 │ Lauren Bacall │ soundtrack │ ['tt0038355', 'tt0075213', ... +2] │\n│ nm0000003 │ Brigitte Bardot │ music_department │ ['tt0049189', 'tt0054452', ... +2] │\n│ nm0000003 │ Brigitte Bardot │ soundtrack │ ['tt0049189', 'tt0054452', ... +2] │\n│ nm0000003 │ Brigitte Bardot │ actress │ ['tt0049189', 'tt0054452', ... +2] │\n│ nm0000004 │ John Belushi │ soundtrack │ ['tt0072562', 'tt0078723', ... +2] │\n│ nm0000004 │ John Belushi │ actor │ ['tt0072562', 'tt0078723', ... +2] │\n│ … │ … │ … │ … │\n└───────────┴─────────────────┴────────────────────┴────────────────────────────────────┘\n\n```\n:::\n:::\n\n\n:::\n\nUnnesting can be useful when joining nested data.\n\nHere we use unnest to find people known for any of the godfather movies:\n\n::: {.panel-tabset}\n\n## DuckDB\n\n::: {#3bc2513c .cell execution_count=32}\n``` {.python .cell-code}\nbasics = ddb.tables.title_basics.filter( # <1>\n [\n _.title_type == \"movie\",\n _.original_title.lower().startswith(\"the godfather\"),\n _.genres.lower().contains(\"crime\"),\n ]\n) # <1>\n\nddb_known_for_the_godfather = (\n ddb_ents.mutate(tconst=_.known_for_titles.unnest()) # <2>\n .join(basics, \"tconst\") # <3>\n .select(\"primary_title\", \"primary_name\") # <4>\n .distinct()\n .order_by([\"primary_title\", \"primary_name\"]) # <4>\n)\nddb_known_for_the_godfather\n```\n\n::: {.cell-output .cell-output-display execution_count=32}\n```{=html}\n
┏━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━┓\n┃ primary_title ┃ primary_name ┃\n┡━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━┩\n│ string │ string │\n├───────────────┼─────────────────────┤\n│ The Godfather │ A. Emmett Adams │\n│ The Godfather │ Abe Vigoda │\n│ The Godfather │ Al Lettieri │\n│ The Godfather │ Al Martino │\n│ The Godfather │ Al Pacino │\n│ The Godfather │ Albert S. Ruddy │\n│ The Godfather │ Alex Rocco │\n│ The Godfather │ Andrea Eastman │\n│ The Godfather │ Angelo Infanti │\n│ The Godfather │ Anna Hill Johnstone │\n│ … │ … │\n└───────────────┴─────────────────────┘\n\n```\n:::\n:::\n\n\n1. Filter the `title_basics` data set to only the Godfather movies\n2. Unnest the `known_for_titles` array column\n3. Join with `basics` to get movie titles\n4. Ensure that each entity is only listed once and sort the results\n\n## BigQuery\n\n::: {#3f9231e0 .cell execution_count=33}\n``` {.python .cell-code}\nbasics = bq.tables.title_basics.filter( # <1>\n [\n _.title_type == \"movie\",\n _.original_title.lower().startswith(\"the godfather\"),\n _.genres.lower().contains(\"crime\"),\n ]\n) # <1>\n\nbq_known_for_the_godfather = (\n bq_ents.mutate(tconst=_.known_for_titles.unnest()) # <2>\n .join(basics, \"tconst\") # <3>\n .select(\"primary_title\", \"primary_name\") # <4>\n .distinct()\n .order_by([\"primary_title\", \"primary_name\"]) # <4>\n)\nbq_known_for_the_godfather\n```\n\n::: {.cell-output .cell-output-display execution_count=33}\n```{=html}\n
┏━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━┓\n┃ primary_title ┃ primary_name ┃\n┡━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━┩\n│ string │ string │\n├───────────────┼─────────────────────┤\n│ The Godfather │ A. Emmett Adams │\n│ The Godfather │ Abe Vigoda │\n│ The Godfather │ Al Lettieri │\n│ The Godfather │ Al Martino │\n│ The Godfather │ Al Pacino │\n│ The Godfather │ Albert S. Ruddy │\n│ The Godfather │ Alex Rocco │\n│ The Godfather │ Andrea Eastman │\n│ The Godfather │ Angelo Infanti │\n│ The Godfather │ Anna Hill Johnstone │\n│ … │ … │\n└───────────────┴─────────────────────┘\n\n```\n:::\n:::\n\n\n1. Filter the `title_basics` data set to only the Godfather movies\n2. Unnest the `known_for_titles` array column\n3. Join with `basics` to get movie titles\n4. Ensure that each entity is only listed once and sort the results\n\n:::\n\nLet's summarize by showing how many people are known for each Godfather movie:\n\n::: {.panel-tabset}\n\n## DuckDB\n\n::: {#eb030865 .cell execution_count=34}\n``` {.python .cell-code}\nddb_known_for_the_godfather.primary_title.value_counts()\n```\n\n::: {.cell-output .cell-output-display execution_count=34}\n```{=html}\n
┏━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━┓\n┃ primary_title ┃ primary_title_count ┃\n┡━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━┩\n│ string │ int64 │\n├────────────────────────┼─────────────────────┤\n│ The Godfather Part II │ 117 │\n│ The Godfather │ 93 │\n│ The Godfather Part III │ 196 │\n└────────────────────────┴─────────────────────┘\n\n```\n:::\n:::\n\n\n## BigQuery\n\n::: {#4786c6f4 .cell execution_count=35}\n``` {.python .cell-code}\nbq_known_for_the_godfather.primary_title.value_counts()\n```\n\n::: {.cell-output .cell-output-display execution_count=35}\n```{=html}\n
┏━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━┓\n┃ primary_title ┃ primary_title_count ┃\n┡━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━┩\n│ string │ int64 │\n├────────────────────────┼─────────────────────┤\n│ The Godfather Part II │ 114 │\n│ The Godfather Part III │ 202 │\n│ The Godfather │ 97 │\n└────────────────────────┴─────────────────────┘\n\n```\n:::\n:::\n\n\n:::\n\n### Filtering array elements\n\nFiltering array elements can be done with the\n[`filter`](../../reference/expression-collections.qmd#ibis.expr.types.arrays.ArrayValue.filter)\nmethod, which applies a predicate to each array element and returns an array of\nelements for which the predicate returns `True`.\n\nThis method is similar to Python's\n[`filter`](https://docs.python.org/3.7/library/functions.html#filter) function.\n\nLet's show all people who are neither editors nor actors:\n\n::: {.panel-tabset}\n\n## DuckDB\n\n::: {#d8bdd118 .cell execution_count=36}\n``` {.python .cell-code}\nddb_ents.mutate(\n primary_profession=_.primary_profession.filter( # <1>\n lambda pp: ~pp.isin((\"actor\", \"actress\", \"editor\"))\n )\n).filter(_.primary_profession.length() > 0).order_by(\"nconst\") # <2>\n```\n\n::: {.cell-output .cell-output-display execution_count=36}\n```{=html}\n
┏━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┓\n┃ nconst ┃ primary_name ┃ primary_profession ┃ known_for_titles ┃\n┡━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┩\n│ string │ string │ array<string> │ array<string> │\n├───────────┼─────────────────┼────────────────────────────────────┼────────────────────────────────────┤\n│ nm0000001 │ Fred Astaire │ ['soundtrack', 'miscellaneous'] │ ['tt0053137', 'tt0072308', ... +2] │\n│ nm0000002 │ Lauren Bacall │ ['soundtrack'] │ ['tt0037382', 'tt0117057', ... +2] │\n│ nm0000003 │ Brigitte Bardot │ ['soundtrack', 'music_department'] │ ['tt0057345', 'tt0054452', ... +2] │\n│ nm0000004 │ John Belushi │ ['soundtrack', 'writer'] │ ['tt0072562', 'tt0078723', ... +2] │\n│ nm0000005 │ Ingmar Bergman │ ['writer', 'director'] │ ['tt0083922', 'tt0069467', ... +2] │\n│ nm0000006 │ Ingrid Bergman │ ['soundtrack', 'producer'] │ ['tt0038109', 'tt0036855', ... +2] │\n│ nm0000007 │ Humphrey Bogart │ ['soundtrack', 'producer'] │ ['tt0037382', 'tt0034583', ... +2] │\n│ nm0000008 │ Marlon Brando │ ['soundtrack', 'director'] │ ['tt0068646', 'tt0070849', ... +2] │\n│ nm0000009 │ Richard Burton │ ['soundtrack', 'producer'] │ ['tt0057877', 'tt0059749', ... +2] │\n│ nm0000010 │ James Cagney │ ['soundtrack', 'director'] │ ['tt0042041', 'tt0035575', ... +2] │\n│ … │ … │ … │ … │\n└───────────┴─────────────────┴────────────────────────────────────┴────────────────────────────────────┘\n\n```\n:::\n:::\n\n\n1. This `filter` call is applied to each array element\n2. This `filter` call is applied to the table\n\n## BigQuery\n\n::: {#1ad9065c .cell execution_count=37}\n``` {.python .cell-code}\nbq_ents.mutate(\n primary_profession=_.primary_profession.filter( # <1>\n lambda pp: ~pp.isin((\"actor\", \"actress\", \"editor\"))\n )\n).filter(_.primary_profession.length() > 0).order_by(\"nconst\") # <2>\n```\n\n::: {.cell-output .cell-output-display execution_count=37}\n```{=html}\n
┏━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┓\n┃ nconst ┃ primary_name ┃ primary_profession ┃ known_for_titles ┃\n┡━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┩\n│ string │ string │ array<string> │ array<string> │\n├───────────┼─────────────────┼────────────────────────────────────┼────────────────────────────────────┤\n│ nm0000001 │ Fred Astaire │ ['soundtrack', 'miscellaneous'] │ ['tt0072308', 'tt0053137', ... +2] │\n│ nm0000002 │ Lauren Bacall │ ['soundtrack'] │ ['tt0038355', 'tt0075213', ... +2] │\n│ nm0000003 │ Brigitte Bardot │ ['soundtrack', 'music_department'] │ ['tt0049189', 'tt0054452', ... +2] │\n│ nm0000004 │ John Belushi │ ['soundtrack', 'writer'] │ ['tt0072562', 'tt0078723', ... +2] │\n│ nm0000005 │ Ingmar Bergman │ ['writer', 'director'] │ ['tt0050976', 'tt0083922', ... +2] │\n│ nm0000006 │ Ingrid Bergman │ ['soundtrack', 'producer'] │ ['tt0034583', 'tt0038787', ... +2] │\n│ nm0000007 │ Humphrey Bogart │ ['soundtrack', 'producer'] │ ['tt0037382', 'tt0043265', ... +2] │\n│ nm0000008 │ Marlon Brando │ ['soundtrack', 'director'] │ ['tt0068646', 'tt0070849', ... +2] │\n│ nm0000009 │ Richard Burton │ ['soundtrack', 'producer'] │ ['tt0061184', 'tt0087803', ... +2] │\n│ nm0000010 │ James Cagney │ ['soundtrack', 'director'] │ ['tt0042041', 'tt0029870', ... +2] │\n│ … │ … │ … │ … │\n└───────────┴─────────────────┴────────────────────────────────────┴────────────────────────────────────┘\n\n```\n:::\n:::\n\n\n1. This `filter` call is applied to each array element\n2. This `filter` call is applied to the table\n\n:::\n\n### Applying a function to array elements\n\nYou can apply a function to run an ibis expression on each element of an array\nusing the\n[`map`](../../reference/expression-collections.qmd#ibis.expr.types.arrays.ArrayValue.map)\nmethod.\n\nLet's normalize the case of primary_profession to upper case:\n\n::: {.panel-tabset}\n\n## DuckDB\n\n::: {#2935069b .cell execution_count=38}\n``` {.python .cell-code}\nddb_ents.mutate(\n primary_profession=_.primary_profession.map(lambda pp: pp.upper())\n).filter(_.primary_profession.length() > 0).order_by(\"nconst\")\n```\n\n::: {.cell-output .cell-output-display execution_count=38}\n```{=html}\n
┏━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┓\n┃ nconst ┃ primary_name ┃ primary_profession ┃ known_for_titles ┃\n┡━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┩\n│ string │ string │ array<string> │ array<string> │\n├───────────┼─────────────────┼───────────────────────────────────┼────────────────────────────────────┤\n│ nm0000001 │ Fred Astaire │ ['SOUNDTRACK', 'ACTOR', ... +1] │ ['tt0053137', 'tt0072308', ... +2] │\n│ nm0000002 │ Lauren Bacall │ ['ACTRESS', 'SOUNDTRACK'] │ ['tt0037382', 'tt0117057', ... +2] │\n│ nm0000003 │ Brigitte Bardot │ ['ACTRESS', 'SOUNDTRACK', ... +1] │ ['tt0057345', 'tt0054452', ... +2] │\n│ nm0000004 │ John Belushi │ ['ACTOR', 'SOUNDTRACK', ... +1] │ ['tt0072562', 'tt0078723', ... +2] │\n│ nm0000005 │ Ingmar Bergman │ ['WRITER', 'DIRECTOR', ... +1] │ ['tt0083922', 'tt0069467', ... +2] │\n│ nm0000006 │ Ingrid Bergman │ ['ACTRESS', 'SOUNDTRACK', ... +1] │ ['tt0038109', 'tt0036855', ... +2] │\n│ nm0000007 │ Humphrey Bogart │ ['ACTOR', 'SOUNDTRACK', ... +1] │ ['tt0037382', 'tt0034583', ... +2] │\n│ nm0000008 │ Marlon Brando │ ['ACTOR', 'SOUNDTRACK', ... +1] │ ['tt0068646', 'tt0070849', ... +2] │\n│ nm0000009 │ Richard Burton │ ['ACTOR', 'SOUNDTRACK', ... +1] │ ['tt0057877', 'tt0059749', ... +2] │\n│ nm0000010 │ James Cagney │ ['ACTOR', 'SOUNDTRACK', ... +1] │ ['tt0042041', 'tt0035575', ... +2] │\n│ … │ … │ … │ … │\n└───────────┴─────────────────┴───────────────────────────────────┴────────────────────────────────────┘\n\n```\n:::\n:::\n\n\n## BigQuery\n\n::: {#33931d68 .cell execution_count=39}\n``` {.python .cell-code}\nbq_ents.mutate(\n primary_profession=_.primary_profession.map(lambda pp: pp.upper())\n).filter(_.primary_profession.length() > 0).order_by(\"nconst\")\n```\n\n::: {.cell-output .cell-output-display execution_count=39}\n```{=html}\n
┏━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┓\n┃ nconst ┃ primary_name ┃ primary_profession ┃ known_for_titles ┃\n┡━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┩\n│ string │ string │ array<string> │ array<string> │\n├───────────┼─────────────────┼───────────────────────────────────┼────────────────────────────────────┤\n│ nm0000001 │ Fred Astaire │ ['SOUNDTRACK', 'ACTOR', ... +1] │ ['tt0072308', 'tt0053137', ... +2] │\n│ nm0000002 │ Lauren Bacall │ ['ACTRESS', 'SOUNDTRACK'] │ ['tt0038355', 'tt0075213', ... +2] │\n│ nm0000003 │ Brigitte Bardot │ ['ACTRESS', 'SOUNDTRACK', ... +1] │ ['tt0049189', 'tt0054452', ... +2] │\n│ nm0000004 │ John Belushi │ ['ACTOR', 'SOUNDTRACK', ... +1] │ ['tt0072562', 'tt0078723', ... +2] │\n│ nm0000005 │ Ingmar Bergman │ ['WRITER', 'DIRECTOR', ... +1] │ ['tt0050976', 'tt0083922', ... +2] │\n│ nm0000006 │ Ingrid Bergman │ ['ACTRESS', 'SOUNDTRACK', ... +1] │ ['tt0034583', 'tt0038787', ... +2] │\n│ nm0000007 │ Humphrey Bogart │ ['ACTOR', 'SOUNDTRACK', ... +1] │ ['tt0037382', 'tt0043265', ... +2] │\n│ nm0000008 │ Marlon Brando │ ['ACTOR', 'SOUNDTRACK', ... +1] │ ['tt0068646', 'tt0070849', ... +2] │\n│ nm0000009 │ Richard Burton │ ['ACTOR', 'SOUNDTRACK', ... +1] │ ['tt0061184', 'tt0087803', ... +2] │\n│ nm0000010 │ James Cagney │ ['ACTOR', 'SOUNDTRACK', ... +1] │ ['tt0042041', 'tt0029870', ... +2] │\n│ … │ … │ … │ … │\n└───────────┴─────────────────┴───────────────────────────────────┴────────────────────────────────────┘\n\n```\n:::\n:::\n\n\n:::\n\n## Conclusion\n\nIbis has a sizable collection of array APIs that work with many different\nbackends and as of version 7.0.0, Ibis supports a much larger set of those APIs\nfor BigQuery!\n\nCheck out [the API\ndocumentation](../../reference/expression-collections.qmd#ibis.expr.types.arrays.ArrayValue)\nfor the full set of available methods.\n\nTry it out, and let us know what you think.\n\n", "supporting": [ "index_files" ], @@ -11,7 +11,7 @@ "\n\n\n\n" ], "include-after-body": [ - "\n" + "\n" ] } } diff --git a/docs/posts/backend-agnostic-arrays/index.qmd b/docs/posts/backend-agnostic-arrays/index.qmd index 7a453701ff77..8aa8a404dcdd 100644 --- a/docs/posts/backend-agnostic-arrays/index.qmd +++ b/docs/posts/backend-agnostic-arrays/index.qmd @@ -1,7 +1,7 @@ --- title: Backend agnostic arrays author: "Phillip Cloud" -date: last-modified +date: 2024-01-19 categories: - arrays - bigquery