From cb2138a9a77f0e90f445806e22680c104f2d7ba0 Mon Sep 17 00:00:00 2001 From: Phillip Cloud <417981+cpcloud@users.noreply.github.com> Date: Mon, 22 Jan 2024 09:44:46 -0500 Subject: [PATCH] docs: fix rolling date on bigquery/duckdb array blog --- .../backend-agnostic-arrays/index/execute-results/html.json | 6 +++--- docs/posts/backend-agnostic-arrays/index.qmd | 2 +- 2 files changed, 4 insertions(+), 4 deletions(-) diff --git a/docs/_freeze/posts/backend-agnostic-arrays/index/execute-results/html.json b/docs/_freeze/posts/backend-agnostic-arrays/index/execute-results/html.json index 4323a6909884..5051270c7db9 100644 --- a/docs/_freeze/posts/backend-agnostic-arrays/index/execute-results/html.json +++ b/docs/_freeze/posts/backend-agnostic-arrays/index/execute-results/html.json @@ -1,7 +1,7 @@ { - "hash": "34305e46cb1163d2232533f8bd56e2b6", + "hash": "8e61910a8bd05e2e50367d1206f17b3d", "result": { - "markdown": "---\ntitle: Backend agnostic arrays\nauthor: \"Phillip Cloud\"\ndate: last-modified\ncategories:\n - arrays\n - bigquery\n - blog\n - cloud\n - duckdb\n - portability\n---\n\n## Introduction\n\nThis is a redux of a [previous post](../bigquery-arrays/index.qmd) showing\nIbis's portability in action.\n\nIbis is portable across complex operations and backends of very different\nscales and deployment models!\n\n::: {.callout-note}\n\n## Results differ slightly between BigQuery and DuckDB\n\nThe datasets used in each backend are slightly different.\n\nI opted to avoid ETL for the BigQuery backend by reusing the Google-provided\nIMDB dataset.\n\nThe tradeoff is the slight discrepancy in results.\n:::\n\n## Basics\n\nWe'll start with `from ibis.interactive import *` for maximum convenience.\n\n::: {#9a52f567 .cell execution_count=1}\n``` {.python .cell-code}\nfrom ibis.interactive import * # <1>\n```\n:::\n\n\n1. `from ibis.interactive import *` imports Ibis APIs into the global namespace\n and enables [interactive mode](../../how-to/configure/basics.qmd#interactive-mode).\n\n### Connect to your database\n\n::: {.panel-tabset}\n\n## DuckDB\n\n::: {#266ed79d .cell execution_count=2}\n``` {.python .cell-code}\nddb = ibis.connect(\"duckdb://\")\nddb.create_table( # <1>\n \"name_basics\", ex.imdb_name_basics.fetch(backend=ddb).rename(\"snake_case\")\n) # <1>\nddb.create_table( # <2>\n \"title_basics\", ex.imdb_title_basics.fetch(backend=ddb).rename(\"snake_case\")\n) # <2>\n```\n\n::: {.cell-output .cell-output-display execution_count=2}\n```{=html}\n
┏━━━━━━━━━━━┳━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━┳━━━━━━━━━━━━┳━━━━━━━━━━┳━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━━━┓\n┃ tconst     title_type  primary_title                                original_title                               is_adult  start_year  end_year  runtime_minutes  genres                   ┃\n┡━━━━━━━━━━━╇━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━╇━━━━━━━━━━━━╇━━━━━━━━━━╇━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━━━┩\n│ stringstringstringstringint64int64stringint64string                   │\n├───────────┼────────────┼─────────────────────────────────────────────┼─────────────────────────────────────────────┼──────────┼────────────┼──────────┼─────────────────┼──────────────────────────┤\n│ tt0000001short     Carmencita                                 Carmencita                                 01894NULL1Documentary,Short        │\n│ tt0000002short     Le clown et ses chiens                     Le clown et ses chiens                     01892NULL5Animation,Short          │\n│ tt0000003short     Pauvre Pierrot                             Pauvre Pierrot                             01892NULL4Animation,Comedy,Romance │\n│ tt0000004short     Un bon bock                                Un bon bock                                01892NULL12Animation,Short          │\n│ tt0000005short     Blacksmith Scene                           Blacksmith Scene                           01893NULL1Comedy,Short             │\n│ tt0000006short     Chinese Opium Den                          Chinese Opium Den                          01894NULL1Short                    │\n│ tt0000007short     Corbett and Courtney Before the KinetographCorbett and Courtney Before the Kinetograph01894NULL1Short,Sport              │\n│ tt0000008short     Edison Kinetoscopic Record of a Sneeze     Edison Kinetoscopic Record of a Sneeze     01894NULL1Documentary,Short        │\n│ tt0000009movie     Miss Jerry                                 Miss Jerry                                 01894NULL45Romance                  │\n│ tt0000010short     Leaving the Factory                        La sortie de l'usine Lumière à Lyon        01895NULL1Documentary,Short        │\n│                         │\n└───────────┴────────────┴─────────────────────────────────────────────┴─────────────────────────────────────────────┴──────────┴────────────┴──────────┴─────────────────┴──────────────────────────┘\n
\n```\n:::\n:::\n\n\n1. Create a table called `name_basics` in our DuckDB database using `ibis.examples` data\n2. Create a table called `title_basics` in our DuckDB database using `ibis.examples` data\n\n## BigQuery\n\n::: {#50929109 .cell execution_count=3}\n``` {.python .cell-code}\nbq = ibis.connect(\"bigquery://ibis-gbq\")\nbq.set_database(\"bigquery-public-data.imdb\") # <1>\n```\n:::\n\n\n1. Google provides a public BigQuery dataset for IMDB data.\n\n:::\n\nLet's pull out the `name_basics` table, which contains names and metadata about\npeople listed on IMDB. We'll call this `ents` (short for `entities`), and remove some\ncolumns we won't need:\n\n::: {.panel-tabset}\n\n## DuckDB\n\n::: {#d8893bc4 .cell execution_count=4}\n``` {.python .cell-code}\nddb_ents = ddb.tables.name_basics.drop(\"birth_year\", \"death_year\")\nddb_ents.order_by(\"nconst\")\n```\n\n::: {.cell-output .cell-output-display execution_count=4}\n```{=html}\n
┏━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┓\n┃ nconst     primary_name     primary_profession                   known_for_titles                        ┃\n┡━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┩\n│ stringstringstringstring                                  │\n├───────────┼─────────────────┼─────────────────────────────────────┼─────────────────────────────────────────┤\n│ nm0000001Fred Astaire   soundtrack,actor,miscellaneous     tt0053137,tt0072308,tt0045537,tt0050419 │\n│ nm0000002Lauren Bacall  actress,soundtrack                 tt0037382,tt0117057,tt0075213,tt0038355 │\n│ nm0000003Brigitte Bardotactress,soundtrack,music_departmenttt0057345,tt0054452,tt0049189,tt0056404 │\n│ nm0000004John Belushi   actor,soundtrack,writer            tt0072562,tt0078723,tt0077975,tt0080455 │\n│ nm0000005Ingmar Bergman writer,director,actor              tt0083922,tt0069467,tt0050976,tt0050986 │\n│ nm0000006Ingrid Bergman actress,soundtrack,producer        tt0038109,tt0036855,tt0034583,tt0038787 │\n│ nm0000007Humphrey Bogartactor,soundtrack,producer          tt0037382,tt0034583,tt0042593,tt0043265 │\n│ nm0000008Marlon Brando  actor,soundtrack,director          tt0068646,tt0070849,tt0078788,tt0047296 │\n│ nm0000009Richard Burton actor,soundtrack,producer          tt0057877,tt0059749,tt0061184,tt0087803 │\n│ nm0000010James Cagney   actor,soundtrack,director          tt0042041,tt0035575,tt0029870,tt0031867 │\n│                                        │\n└───────────┴─────────────────┴─────────────────────────────────────┴─────────────────────────────────────────┘\n
\n```\n:::\n:::\n\n\n## BigQuery\n\n::: {#5c8d59db .cell execution_count=5}\n``` {.python .cell-code}\nbq_ents = bq.tables.name_basics.drop(\"birth_year\", \"death_year\")\nbq_ents.order_by(\"nconst\")\n```\n\n::: {.cell-output .cell-output-display execution_count=5}\n```{=html}\n
┏━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┓\n┃ nconst     primary_name     primary_profession                   known_for_titles                        ┃\n┡━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┩\n│ stringstringstringstring                                  │\n├───────────┼─────────────────┼─────────────────────────────────────┼─────────────────────────────────────────┤\n│ nm0000001Fred Astaire   soundtrack,actor,miscellaneous     tt0031983,tt0072308,tt0053137,tt0050419 │\n│ nm0000002Lauren Bacall  actress,soundtrack                 tt0075213,tt0038355,tt0037382,tt0117057 │\n│ nm0000003Brigitte Bardotactress,soundtrack,music_departmenttt0049189,tt0054452,tt0056404,tt0057345 │\n│ nm0000004John Belushi   actor,soundtrack,writer            tt0072562,tt0077975,tt0078723,tt0080455 │\n│ nm0000005Ingmar Bergman writer,director,actor              tt0050986,tt0050976,tt0069467,tt0083922 │\n│ nm0000006Ingrid Bergman actress,soundtrack,producer        tt0038109,tt0034583,tt0036855,tt0038787 │\n│ nm0000007Humphrey Bogartactor,soundtrack,producer          tt0037382,tt0034583,tt0043265,tt0042593 │\n│ nm0000008Marlon Brando  actor,soundtrack,director          tt0070849,tt0047296,tt0068646,tt0078788 │\n│ nm0000009Richard Burton actor,soundtrack,producer          tt0087803,tt0059749,tt0061184,tt0057877 │\n│ nm0000010James Cagney   actor,soundtrack,director          tt0035575,tt0042041,tt0031867,tt0029870 │\n│                                        │\n└───────────┴─────────────────┴─────────────────────────────────────┴─────────────────────────────────────────┘\n
\n```\n:::\n:::\n\n\n:::\n\n### Splitting strings into arrays\n\nWe can see that `known_for_titles` looks sort of like an array, so let's call\nthe\n[`split`](../../reference/expression-strings.qmd#ibis.expr.types.strings.StringValue.split)\nmethod on that column and replace the existing column:\n\n::: {.panel-tabset}\n\n## DuckDB\n\n::: {#806c1396 .cell execution_count=6}\n``` {.python .cell-code}\nddb_ents = ddb_ents.mutate(known_for_titles=_.known_for_titles.split(\",\"))\nddb_ents.order_by(\"nconst\")\n```\n\n::: {.cell-output .cell-output-display execution_count=6}\n```{=html}\n
┏━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┓\n┃ nconst     primary_name     primary_profession                   known_for_titles                   ┃\n┡━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┩\n│ stringstringstringarray<string>                      │\n├───────────┼─────────────────┼─────────────────────────────────────┼────────────────────────────────────┤\n│ nm0000001Fred Astaire   soundtrack,actor,miscellaneous     ['tt0053137', 'tt0072308', ... +2] │\n│ nm0000002Lauren Bacall  actress,soundtrack                 ['tt0037382', 'tt0117057', ... +2] │\n│ nm0000003Brigitte Bardotactress,soundtrack,music_department['tt0057345', 'tt0054452', ... +2] │\n│ nm0000004John Belushi   actor,soundtrack,writer            ['tt0072562', 'tt0078723', ... +2] │\n│ nm0000005Ingmar Bergman writer,director,actor              ['tt0083922', 'tt0069467', ... +2] │\n│ nm0000006Ingrid Bergman actress,soundtrack,producer        ['tt0038109', 'tt0036855', ... +2] │\n│ nm0000007Humphrey Bogartactor,soundtrack,producer          ['tt0037382', 'tt0034583', ... +2] │\n│ nm0000008Marlon Brando  actor,soundtrack,director          ['tt0068646', 'tt0070849', ... +2] │\n│ nm0000009Richard Burton actor,soundtrack,producer          ['tt0057877', 'tt0059749', ... +2] │\n│ nm0000010James Cagney   actor,soundtrack,director          ['tt0042041', 'tt0035575', ... +2] │\n│                                   │\n└───────────┴─────────────────┴─────────────────────────────────────┴────────────────────────────────────┘\n
\n```\n:::\n:::\n\n\n## BigQuery\n\n::: {#70089eda .cell execution_count=7}\n``` {.python .cell-code}\nbq_ents = bq_ents.mutate(known_for_titles=_.known_for_titles.split(\",\"))\nbq_ents.order_by(\"nconst\")\n```\n\n::: {.cell-output .cell-output-display execution_count=7}\n```{=html}\n
┏━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┓\n┃ nconst     primary_name     primary_profession                   known_for_titles                   ┃\n┡━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┩\n│ stringstringstringarray<string>                      │\n├───────────┼─────────────────┼─────────────────────────────────────┼────────────────────────────────────┤\n│ nm0000001Fred Astaire   soundtrack,actor,miscellaneous     ['tt0031983', 'tt0072308', ... +2] │\n│ nm0000002Lauren Bacall  actress,soundtrack                 ['tt0075213', 'tt0038355', ... +2] │\n│ nm0000003Brigitte Bardotactress,soundtrack,music_department['tt0049189', 'tt0054452', ... +2] │\n│ nm0000004John Belushi   actor,soundtrack,writer            ['tt0072562', 'tt0077975', ... +2] │\n│ nm0000005Ingmar Bergman writer,director,actor              ['tt0050986', 'tt0050976', ... +2] │\n│ nm0000006Ingrid Bergman actress,soundtrack,producer        ['tt0038109', 'tt0034583', ... +2] │\n│ nm0000007Humphrey Bogartactor,soundtrack,producer          ['tt0037382', 'tt0034583', ... +2] │\n│ nm0000008Marlon Brando  actor,soundtrack,director          ['tt0070849', 'tt0047296', ... +2] │\n│ nm0000009Richard Burton actor,soundtrack,producer          ['tt0087803', 'tt0059749', ... +2] │\n│ nm0000010James Cagney   actor,soundtrack,director          ['tt0035575', 'tt0042041', ... +2] │\n│                                   │\n└───────────┴─────────────────┴─────────────────────────────────────┴────────────────────────────────────┘\n
\n```\n:::\n:::\n\n\n:::\n\nSimilarly for `primary_profession`, since people involved in show business often\nhave more than one responsibility on a project:\n\n::: {.panel-tabset}\n\n## DuckDB\n\n::: {#646e2c20 .cell execution_count=8}\n``` {.python .cell-code}\nddb_ents = ddb_ents.mutate(primary_profession=_.primary_profession.split(\",\"))\n```\n:::\n\n\n## BigQuery\n\n::: {#0fac0f4a .cell execution_count=9}\n``` {.python .cell-code}\nbq_ents = bq_ents.mutate(primary_profession=_.primary_profession.split(\",\"))\n```\n:::\n\n\n:::\n\n### Array length\n\nLet's see how many titles each entity is known for, and then show the five\npeople with the largest number of titles they're known for.\n\nThis is computed using the\n[`length`](../../reference/expression-collections.qmd#ibis.expr.types.arrays.ArrayValue.length)\nAPI on array expressions:\n\n::: {.panel-tabset}\n\n## DuckDB\n\n::: {#dde377da .cell execution_count=10}\n``` {.python .cell-code}\n(\n ddb_ents.select(\"primary_name\", num_titles=_.known_for_titles.length())\n .order_by(_.num_titles.desc())\n .limit(5)\n)\n```\n\n::: {.cell-output .cell-output-display execution_count=10}\n```{=html}\n
┏━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━┓\n┃ primary_name      num_titles ┃\n┡━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━┩\n│ stringint64      │\n├──────────────────┼────────────┤\n│ Alex Koenigsmark5 │\n│ Carrie Schnelker5 │\n│ Sally Sun       5 │\n│ Henry Townsend  5 │\n│ Matthew Kavuma  5 │\n└──────────────────┴────────────┘\n
\n```\n:::\n:::\n\n\n## BigQuery\n\n::: {#126820cf .cell execution_count=11}\n``` {.python .cell-code}\n(\n bq_ents.select(\"primary_name\", num_titles=_.known_for_titles.length())\n .order_by(_.num_titles.desc())\n .limit(5)\n)\n```\n\n::: {.cell-output .cell-output-display execution_count=11}\n```{=html}\n
┏━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━┓\n┃ primary_name       num_titles ┃\n┡━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━┩\n│ stringint64      │\n├───────────────────┼────────────┤\n│ Paul Winter      6 │\n│ Chris Estrada    6 │\n│ Nicolas Bernier  6 │\n│ Tsuyotake Matsuda5 │\n│ Jonathon Saunders5 │\n└───────────────────┴────────────┘\n
\n```\n:::\n:::\n\n\n:::\n\nIt seems like the length of the `known_for_titles` might be capped at some small number!\n\n### Index\n\nWe can see the position of `\"actor\"` or `\"actress\"` in `primary_profession`s:\n\n::: {.panel-tabset}\n\n## DuckDB\n\n::: {#5a84302f .cell execution_count=12}\n``` {.python .cell-code}\nddb_ents.primary_profession.index(\"actor\")\n```\n\n::: {.cell-output .cell-output-display execution_count=12}\n```{=html}\n
┏━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┓\n┃ ArrayPosition(primary_profession, 'actor') ┃\n┡━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┩\n│ int64                                      │\n├────────────────────────────────────────────┤\n│                                          1 │\n│                                         -1 │\n│                                         -1 │\n│                                          0 │\n│                                          2 │\n│                                         -1 │\n│                                          0 │\n│                                          0 │\n│                                          0 │\n│                                          0 │\n│                                           │\n└────────────────────────────────────────────┘\n
\n```\n:::\n:::\n\n\n::: {#24b4909f .cell execution_count=13}\n``` {.python .cell-code}\nddb_ents.primary_profession.index(\"actress\")\n```\n\n::: {.cell-output .cell-output-display execution_count=13}\n```{=html}\n
┏━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┓\n┃ ArrayPosition(primary_profession, 'actress') ┃\n┡━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┩\n│ int64                                        │\n├──────────────────────────────────────────────┤\n│                                           -1 │\n│                                            0 │\n│                                            0 │\n│                                           -1 │\n│                                           -1 │\n│                                            0 │\n│                                           -1 │\n│                                           -1 │\n│                                           -1 │\n│                                           -1 │\n│                                             │\n└──────────────────────────────────────────────┘\n
\n```\n:::\n:::\n\n\n## BigQuery\n\n::: {#2955a839 .cell execution_count=14}\n``` {.python .cell-code}\nbq_ents.primary_profession.index(\"actor\")\n```\n\n::: {.cell-output .cell-output-display execution_count=14}\n```{=html}\n
┏━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┓\n┃ ArrayPosition(primary_profession, 'actor') ┃\n┡━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┩\n│ int64                                      │\n├────────────────────────────────────────────┤\n│                                         -1 │\n│                                         -1 │\n│                                         -1 │\n│                                         -1 │\n│                                         -1 │\n│                                         -1 │\n│                                         -1 │\n│                                         -1 │\n│                                         -1 │\n│                                         -1 │\n│                                           │\n└────────────────────────────────────────────┘\n
\n```\n:::\n:::\n\n\n::: {#5cbb952f .cell execution_count=15}\n``` {.python .cell-code}\nbq_ents.primary_profession.index(\"actress\")\n```\n\n::: {.cell-output .cell-output-display execution_count=15}\n```{=html}\n
┏━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┓\n┃ ArrayPosition(primary_profession, 'actress') ┃\n┡━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┩\n│ int64                                        │\n├──────────────────────────────────────────────┤\n│                                           -1 │\n│                                           -1 │\n│                                           -1 │\n│                                           -1 │\n│                                           -1 │\n│                                           -1 │\n│                                           -1 │\n│                                           -1 │\n│                                           -1 │\n│                                           -1 │\n│                                             │\n└──────────────────────────────────────────────┘\n
\n```\n:::\n:::\n\n\n:::\n\nA return value of `-1` indicates that `\"actor\"` is not present in the value.\n\nLet's look for entities that are not primarily actors.\n\nWe can do this using the\n[`index`](../../reference/expression-collections.qmd#ibis.expr.types.arrays.ArrayValue.index)\nmethod by checking whether the positions of the strings `\"actor\"` or\n`\"actress\"` are both greater than 0:\n\n::: {.panel-tabset}\n\n## DuckDB\n\n::: {#28eb1e61 .cell execution_count=16}\n``` {.python .cell-code}\nactor_index = ddb_ents.primary_profession.index(\"actor\")\nactress_index = ddb_ents.primary_profession.index(\"actress\")\n\nddb_not_primarily_acting = (actor_index > 0) & (actress_index > 0)\nddb_not_primarily_acting.mean()\n```\n\n::: {.cell-output .cell-output-display}\n```{=html}\n
\n```\n:::\n\n::: {.cell-output .cell-output-display execution_count=16}\n\n::: {.ansi-escaped-output}\n```{=html}\n
0.0
\n```\n:::\n\n:::\n:::\n\n\n## BigQuery\n\n::: {#a7b0a283 .cell execution_count=17}\n``` {.python .cell-code}\nactor_index = bq_ents.primary_profession.index(\"actor\")\nactress_index = bq_ents.primary_profession.index(\"actress\")\n\nbq_not_primarily_acting = (actor_index > 0) & (actress_index > 0)\nbq_not_primarily_acting.mean()\n```\n\n::: {.cell-output .cell-output-display}\n```{=html}\n
\n```\n:::\n\n::: {.cell-output .cell-output-display execution_count=17}\n\n::: {.ansi-escaped-output}\n```{=html}\n
0.0
\n```\n:::\n\n:::\n:::\n\n\n:::\n\nWho are they?\n\n::: {.panel-tabset}\n\n## DuckDB\n\n::: {#d68e44dd .cell execution_count=18}\n``` {.python .cell-code}\nddb_ents[ddb_not_primarily_acting].order_by(\"nconst\")\n```\n\n::: {.cell-output .cell-output-display execution_count=18}\n```{=html}\n
┏━━━━━━━━┳━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━┓\n┃ nconst  primary_name  primary_profession  known_for_titles ┃\n┡━━━━━━━━╇━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━┩\n│ stringstringarray<string>array<string>    │\n└────────┴──────────────┴────────────────────┴──────────────────┘\n
\n```\n:::\n:::\n\n\n## BigQuery\n\n::: {#db8558c1 .cell execution_count=19}\n``` {.python .cell-code}\nbq_ents[bq_not_primarily_acting].order_by(\"nconst\")\n```\n\n::: {.cell-output .cell-output-display execution_count=19}\n```{=html}\n
┏━━━━━━━━┳━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━┓\n┃ nconst  primary_name  primary_profession  known_for_titles ┃\n┡━━━━━━━━╇━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━┩\n│ stringstringarray<string>array<string>    │\n└────────┴──────────────┴────────────────────┴──────────────────┘\n
\n```\n:::\n:::\n\n\n:::\n\nIt's not 100% clear whether the order of elements in `primary_profession` matters here.\n\n### Containment\n\nWe can get people who are listed as actors or actresses using `contains`:\n\n::: {.panel-tabset}\n\n## DuckDB\n\n::: {#f9e1558c .cell execution_count=20}\n``` {.python .cell-code}\nddb_non_actors = bq_ents[\n ~_.primary_profession.contains(\"actor\") & ~_.primary_profession.contains(\"actress\")\n]\nddb_non_actors.order_by(\"nconst\")\n```\n\n::: {.cell-output .cell-output-display execution_count=20}\n```{=html}\n
┏━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┓\n┃ nconst     primary_name      primary_profession                          known_for_titles                   ┃\n┡━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┩\n│ stringstringarray<string>array<string>                      │\n├───────────┼──────────────────┼────────────────────────────────────────────┼────────────────────────────────────┤\n│ nm0000016Georges Delerue ['composer', 'soundtrack', ... +1]['tt0091763', 'tt0096320', ... +2] │\n│ nm0000025Jerry Goldsmith ['music_department', 'soundtrack', ... +1]['tt0119488', 'tt0117731', ... +2] │\n│ nm0000033Alfred Hitchcock['director', 'producer', ... +1]['tt0053125', 'tt0052357', ... +2] │\n│ nm0000035James Horner    ['music_department', 'soundtrack', ... +1]['tt0120338', 'tt0499549', ... +2] │\n│ nm0000040Stanley Kubrick ['director', 'writer', ... +1]['tt0062622', 'tt0120663', ... +2] │\n│ nm0000041Akira Kurosawa  ['writer', 'director', ... +1]['tt0051808', 'tt0089881', ... +2] │\n│ nm0000049Henry Mancini   ['music_department', 'soundtrack', ... +1]['tt0057413', 'tt0054698', ... +2] │\n│ nm0000055Alfred Newman   ['music_department', 'composer', ... +1]['tt0065377', 'tt0049408', ... +2] │\n│ nm0000065Nino Rota       ['composer', 'soundtrack', ... +1]['tt0063518', 'tt0068646', ... +2] │\n│ nm0000067Miklós Rózsa    ['music_department', 'composer', ... +1]['tt0038109', 'tt0054847', ... +2] │\n│                                   │\n└───────────┴──────────────────┴────────────────────────────────────────────┴────────────────────────────────────┘\n
\n```\n:::\n:::\n\n\n## BigQuery\n\n::: {#b51e75a5 .cell execution_count=21}\n``` {.python .cell-code}\nbq_non_actors = bq_ents[\n ~_.primary_profession.contains(\"actor\") & ~_.primary_profession.contains(\"actress\")\n]\nbq_non_actors.order_by(\"nconst\")\n```\n\n::: {.cell-output .cell-output-display execution_count=21}\n```{=html}\n
┏━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┓\n┃ nconst     primary_name      primary_profession                          known_for_titles                   ┃\n┡━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┩\n│ stringstringarray<string>array<string>                      │\n├───────────┼──────────────────┼────────────────────────────────────────────┼────────────────────────────────────┤\n│ nm0000016Georges Delerue ['composer', 'soundtrack', ... +1]['tt0091763', 'tt0096320', ... +2] │\n│ nm0000025Jerry Goldsmith ['music_department', 'soundtrack', ... +1]['tt0119488', 'tt0117731', ... +2] │\n│ nm0000033Alfred Hitchcock['director', 'producer', ... +1]['tt0053125', 'tt0052357', ... +2] │\n│ nm0000035James Horner    ['music_department', 'soundtrack', ... +1]['tt0120338', 'tt0499549', ... +2] │\n│ nm0000040Stanley Kubrick ['director', 'writer', ... +1]['tt0062622', 'tt0120663', ... +2] │\n│ nm0000041Akira Kurosawa  ['writer', 'director', ... +1]['tt0051808', 'tt0089881', ... +2] │\n│ nm0000049Henry Mancini   ['music_department', 'soundtrack', ... +1]['tt0057413', 'tt0054698', ... +2] │\n│ nm0000055Alfred Newman   ['music_department', 'composer', ... +1]['tt0065377', 'tt0049408', ... +2] │\n│ nm0000065Nino Rota       ['composer', 'soundtrack', ... +1]['tt0063518', 'tt0068646', ... +2] │\n│ nm0000067Miklós Rózsa    ['music_department', 'composer', ... +1]['tt0038109', 'tt0054847', ... +2] │\n│                                   │\n└───────────┴──────────────────┴────────────────────────────────────────────┴────────────────────────────────────┘\n
\n```\n:::\n:::\n\n\n:::\n\n### Element removal\n\nWe can remove elements from arrays too.\n\n::: {.callout-note}\n## [`remove()`](../../reference/expression-collections.qmd#ibis.expr.types.arrays.ArrayValue.remove) does not mutate the underlying data\n:::\n\nLet's see who only has \"actor\" in the list of their primary professions:\n\n::: {.panel-tabset}\n\n## DuckDB\n\n::: {#e930038b .cell execution_count=22}\n``` {.python .cell-code}\nddb_ents.filter(\n [\n _.primary_profession.length() > 0,\n _.primary_profession.remove(\"actor\").length() == 0,\n _.primary_profession.remove(\"actress\").length() == 0,\n ]\n).order_by(\"nconst\")\n```\n\n::: {.cell-output .cell-output-display execution_count=22}\n```{=html}\n
┏━━━━━━━━┳━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━┓\n┃ nconst  primary_name  primary_profession  known_for_titles ┃\n┡━━━━━━━━╇━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━┩\n│ stringstringarray<string>array<string>    │\n└────────┴──────────────┴────────────────────┴──────────────────┘\n
\n```\n:::\n:::\n\n\n## BigQuery\n\n::: {#4c7a3db0 .cell execution_count=23}\n``` {.python .cell-code}\nbq_ents.filter(\n [\n _.primary_profession.length() > 0,\n _.primary_profession.remove(\"actor\").length() == 0,\n _.primary_profession.remove(\"actress\").length() == 0,\n ]\n).order_by(\"nconst\")\n```\n\n::: {.cell-output .cell-output-display execution_count=23}\n```{=html}\n
┏━━━━━━━━┳━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━┓\n┃ nconst  primary_name  primary_profession  known_for_titles ┃\n┡━━━━━━━━╇━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━┩\n│ stringstringarray<string>array<string>    │\n└────────┴──────────────┴────────────────────┴──────────────────┘\n
\n```\n:::\n:::\n\n\n:::\n\n### Slicing with square-bracket syntax\n\nLet's remove everyone's first profession from the list, but only if they have\nmore than one profession listed:\n\n::: {.panel-tabset}\n\n## DuckDB\n\n::: {#727c9fc4 .cell execution_count=24}\n``` {.python .cell-code}\nddb_ents[_.primary_profession.length() > 1].mutate(\n primary_profession=_.primary_profession[1:],\n).order_by(\"nconst\")\n```\n\n::: {.cell-output .cell-output-display execution_count=24}\n```{=html}\n
┏━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┓\n┃ nconst     primary_name     primary_profession                  known_for_titles                   ┃\n┡━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┩\n│ stringstringarray<string>array<string>                      │\n├───────────┼─────────────────┼────────────────────────────────────┼────────────────────────────────────┤\n│ nm0000001Fred Astaire   ['actor', 'miscellaneous']['tt0053137', 'tt0072308', ... +2] │\n│ nm0000002Lauren Bacall  ['soundtrack']['tt0037382', 'tt0117057', ... +2] │\n│ nm0000003Brigitte Bardot['soundtrack', 'music_department']['tt0057345', 'tt0054452', ... +2] │\n│ nm0000004John Belushi   ['soundtrack', 'writer']['tt0072562', 'tt0078723', ... +2] │\n│ nm0000005Ingmar Bergman ['director', 'actor']['tt0083922', 'tt0069467', ... +2] │\n│ nm0000006Ingrid Bergman ['soundtrack', 'producer']['tt0038109', 'tt0036855', ... +2] │\n│ nm0000007Humphrey Bogart['soundtrack', 'producer']['tt0037382', 'tt0034583', ... +2] │\n│ nm0000008Marlon Brando  ['soundtrack', 'director']['tt0068646', 'tt0070849', ... +2] │\n│ nm0000009Richard Burton ['soundtrack', 'producer']['tt0057877', 'tt0059749', ... +2] │\n│ nm0000010James Cagney   ['soundtrack', 'director']['tt0042041', 'tt0035575', ... +2] │\n│                                   │\n└───────────┴─────────────────┴────────────────────────────────────┴────────────────────────────────────┘\n
\n```\n:::\n:::\n\n\n## BigQuery\n\n::: {#86e58551 .cell execution_count=25}\n``` {.python .cell-code}\nbq_ents[_.primary_profession.length() > 1].mutate(\n primary_profession=_.primary_profession[1:],\n).order_by(\"nconst\")\n```\n\n::: {.cell-output .cell-output-display execution_count=25}\n```{=html}\n
┏━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┓\n┃ nconst     primary_name     primary_profession                  known_for_titles                   ┃\n┡━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┩\n│ stringstringarray<string>array<string>                      │\n├───────────┼─────────────────┼────────────────────────────────────┼────────────────────────────────────┤\n│ nm0000001Fred Astaire   ['actor', 'miscellaneous']['tt0031983', 'tt0072308', ... +2] │\n│ nm0000002Lauren Bacall  ['soundtrack']['tt0075213', 'tt0038355', ... +2] │\n│ nm0000003Brigitte Bardot['soundtrack', 'music_department']['tt0049189', 'tt0054452', ... +2] │\n│ nm0000004John Belushi   ['soundtrack', 'writer']['tt0072562', 'tt0077975', ... +2] │\n│ nm0000005Ingmar Bergman ['director', 'actor']['tt0050986', 'tt0050976', ... +2] │\n│ nm0000006Ingrid Bergman ['soundtrack', 'producer']['tt0038109', 'tt0034583', ... +2] │\n│ nm0000007Humphrey Bogart['soundtrack', 'producer']['tt0037382', 'tt0034583', ... +2] │\n│ nm0000008Marlon Brando  ['soundtrack', 'director']['tt0070849', 'tt0047296', ... +2] │\n│ nm0000009Richard Burton ['soundtrack', 'producer']['tt0087803', 'tt0059749', ... +2] │\n│ nm0000010James Cagney   ['soundtrack', 'director']['tt0035575', 'tt0042041', ... +2] │\n│                                   │\n└───────────┴─────────────────┴────────────────────────────────────┴────────────────────────────────────┘\n
\n```\n:::\n:::\n\n\n:::\n\n## Set operations and sorting\n\nTreating arrays as sets is possible with the\n[`union`](../../reference/expression-collections.qmd#ibis.expr.types.arrays.ArrayValue.union)\nand\n[`intersect`](../../reference/expression-collections.qmd#ibis.expr.types.arrays.ArrayValue.intersect)\nAPIs.\n\nLet's take a look at `intersect`.\n\n### Intersection\n\nLet's see if we can use array intersection to figure which actors share\nknown-for titles and sort the result:\n\n::: {.panel-tabset}\n\n## DuckDB\n\n::: {#1f777844 .cell execution_count=26}\n``` {.python .cell-code}\nleft = ddb_ents.filter(_.known_for_titles.length() > 0).limit(10_000)\nright = left.view()\nshared_titles = (\n left\n .join(right, left.nconst != right.nconst)\n .select(\n s.startswith(\"known_for_titles\"),\n left_name=\"primary_name\",\n right_name=\"primary_name_right\",\n )\n .filter(_.known_for_titles.intersect(_.known_for_titles_right).length() > 0)\n .group_by(name=\"left_name\")\n .agg(together_with=_.right_name.collect())\n .mutate(together_with=_.together_with.unique().sort())\n)\nshared_titles\n```\n\n::: {.cell-output .cell-output-display}\n```{=html}\n\n```\n:::\n\n::: {.cell-output .cell-output-display execution_count=26}\n```{=html}\n
┏━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┓\n┃ name                 together_with                                   ┃\n┡━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┩\n│ stringarray<string>                                   │\n├─────────────────────┼─────────────────────────────────────────────────┤\n│ Richard Chamberlain['Fred Astaire', 'Fred J. Koenekamp', ... +13]  │\n│ John Wayne         ['Cyril J. Mockridge', 'Glen Campbell', ... +8] │\n│ Fritz Lang         ['Alfred Abel', 'Brigitte Bardot', ... +13]     │\n│ John Candy         ['Adam Bernardi', 'Amy Madigan', ... +21]       │\n│ Peter Lorre        ['Byron Haskin', 'Claude Rains', ... +16]       │\n│ Miklós Rózsa       ['Barbara Stanwyck', 'Charlton Heston', ... +8] │\n│ George Segal       ['Alex North', 'Amanda Peet', ... +18]          │\n│ Lon Chaney Jr.     ['Bud Abbott', 'Charles Previn', ... +5]        │\n│ Vivien Leigh       ['Alex North', 'Clark Gable', ... +17]          │\n│ Jim Backus         ['Alan Hale Jr.', 'Bob Denver', ... +16]        │\n│                                                │\n└─────────────────────┴─────────────────────────────────────────────────┘\n
\n```\n:::\n:::\n\n\n## BigQuery\n\n::: {#91f39681 .cell execution_count=27}\n``` {.python .cell-code}\nleft = bq_ents.filter(_.known_for_titles.length() > 0).limit(10_000)\nright = left.view()\nshared_titles = (\n left\n .join(right, left.nconst != right.nconst)\n .select(\n s.startswith(\"known_for_titles\"),\n left_name=\"primary_name\",\n right_name=\"primary_name_right\",\n )\n .filter(_.known_for_titles.intersect(_.known_for_titles_right).length() > 0)\n .group_by(name=\"left_name\")\n .agg(together_with=_.right_name.collect())\n .mutate(together_with=_.together_with.unique().sort())\n)\nshared_titles\n```\n\n::: {.cell-output .cell-output-display execution_count=27}\n```{=html}\n
┏━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┓\n┃ name              together_with                                            ┃\n┡━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┩\n│ stringarray<string>                                            │\n├──────────────────┼──────────────────────────────────────────────────────────┤\n│ Antonieta Careri['Ayame Loren Tribune', 'Carey Giesner-Garcia', ... +13] │\n│ Mike Sutcliffe  ['Christine Slattery', 'Linda Collister', ... +2]        │\n│ Yolanda Paul    ['Antonieta Careri', 'Ayame Loren Tribune', ... +13]     │\n│ Anthony Micari  ['Adam Cole', 'Andrew Del Vecchio', ... +6]              │\n│ Enoch Showunmi  ['Andy Leese', 'Ben Adelsbury', ... +18]                 │\n│ Ana Akauola     ['A.B. Olevic', 'Candy Hurtado', ... +12]                │\n│ Awad Al Yami    ['Abdo Bardawill', 'Alex Saratsis', ... +66]             │\n│ Charly Freitag  ['Christoph Homberger', 'Jana Leu', ... +13]             │\n│ Rachel Mader    ['Charly Freitag', 'Christoph Homberger', ... +13]       │\n│ Robert Bartley  ['Adam Cole', 'Andrew Del Vecchio', ... +6]              │\n│                                                         │\n└──────────────────┴──────────────────────────────────────────────────────────┘\n
\n```\n:::\n:::\n\n\n:::\n\n## Advanced operations\n\n### Flatten arrays into rows\n\nThanks to the [tireless\nefforts](https://github.com/tobymao/sqlglot/commit/06e0869e7aa5714d77e6ec763da38d6a422965fa)\nof the [folks](https://github.com/tobymao/sqlglot/graphs/contributors) working\non [`sqlglot`](https://github.com/tobymao/sqlglot), as of version 7.0.0 Ibis\nsupports\n[`unnest`](../../reference/expression-collections.qmd#ibis.expr.types.arrays.ArrayValue.unnest)\nfor BigQuery!\n\nYou can use it standalone on a column expression:\n\n::: {.panel-tabset}\n\n## DuckDB\n\n::: {#22f716af .cell execution_count=28}\n``` {.python .cell-code}\nddb_ents.primary_profession.unnest()\n```\n\n::: {.cell-output .cell-output-display execution_count=28}\n```{=html}\n
┏━━━━━━━━━━━━━━━━━━━━┓\n┃ primary_profession ┃\n┡━━━━━━━━━━━━━━━━━━━━┩\n│ string             │\n├────────────────────┤\n│ soundtrack         │\n│ actor              │\n│ miscellaneous      │\n│ actress            │\n│ soundtrack         │\n│ actress            │\n│ soundtrack         │\n│ music_department   │\n│ actor              │\n│ soundtrack         │\n│                   │\n└────────────────────┘\n
\n```\n:::\n:::\n\n\n## BigQuery\n\n::: {#46944f60 .cell execution_count=29}\n``` {.python .cell-code}\nbq_ents.primary_profession.unnest()\n```\n\n::: {.cell-output .cell-output-display execution_count=29}\n```{=html}\n
┏━━━━━━━━━━━━━━━━━━━━┓\n┃ primary_profession ┃\n┡━━━━━━━━━━━━━━━━━━━━┩\n│ string             │\n├────────────────────┤\n│ actor              │\n│ actor              │\n│ actor              │\n│ actor              │\n│ actor              │\n│ actor              │\n│ actor              │\n│ actor              │\n│ actor              │\n│ actor              │\n│                   │\n└────────────────────┘\n
\n```\n:::\n:::\n\n\n:::\n\nYou can also use it in `select`/`mutate` calls to expand the table accordingly:\n\n::: {.panel-tabset}\n\n## DuckDB\n\n::: {#cb45455b .cell execution_count=30}\n``` {.python .cell-code}\nddb_ents.mutate(primary_profession=_.primary_profession.unnest()).order_by(\"nconst\")\n```\n\n::: {.cell-output .cell-output-display execution_count=30}\n```{=html}\n
┏━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┓\n┃ nconst     primary_name     primary_profession  known_for_titles                   ┃\n┡━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┩\n│ stringstringstringarray<string>                      │\n├───────────┼─────────────────┼────────────────────┼────────────────────────────────────┤\n│ nm0000001Fred Astaire   soundtrack        ['tt0053137', 'tt0072308', ... +2] │\n│ nm0000001Fred Astaire   actor             ['tt0053137', 'tt0072308', ... +2] │\n│ nm0000001Fred Astaire   miscellaneous     ['tt0053137', 'tt0072308', ... +2] │\n│ nm0000002Lauren Bacall  actress           ['tt0037382', 'tt0117057', ... +2] │\n│ nm0000002Lauren Bacall  soundtrack        ['tt0037382', 'tt0117057', ... +2] │\n│ nm0000003Brigitte Bardotactress           ['tt0057345', 'tt0054452', ... +2] │\n│ nm0000003Brigitte Bardotsoundtrack        ['tt0057345', 'tt0054452', ... +2] │\n│ nm0000003Brigitte Bardotmusic_department  ['tt0057345', 'tt0054452', ... +2] │\n│ nm0000004John Belushi   actor             ['tt0072562', 'tt0078723', ... +2] │\n│ nm0000004John Belushi   soundtrack        ['tt0072562', 'tt0078723', ... +2] │\n│                                   │\n└───────────┴─────────────────┴────────────────────┴────────────────────────────────────┘\n
\n```\n:::\n:::\n\n\n## BigQuery\n\n::: {#74595b89 .cell execution_count=31}\n``` {.python .cell-code}\nbq_ents.mutate(primary_profession=_.primary_profession.unnest()).order_by(\"nconst\")\n```\n\n::: {.cell-output .cell-output-display execution_count=31}\n```{=html}\n
┏━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┓\n┃ nconst     primary_name     primary_profession  known_for_titles                   ┃\n┡━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┩\n│ stringstringstringarray<string>                      │\n├───────────┼─────────────────┼────────────────────┼────────────────────────────────────┤\n│ nm0000001Fred Astaire   soundtrack        ['tt0031983', 'tt0072308', ... +2] │\n│ nm0000001Fred Astaire   actor             ['tt0031983', 'tt0072308', ... +2] │\n│ nm0000001Fred Astaire   miscellaneous     ['tt0031983', 'tt0072308', ... +2] │\n│ nm0000002Lauren Bacall  actress           ['tt0075213', 'tt0038355', ... +2] │\n│ nm0000002Lauren Bacall  soundtrack        ['tt0075213', 'tt0038355', ... +2] │\n│ nm0000003Brigitte Bardotsoundtrack        ['tt0049189', 'tt0054452', ... +2] │\n│ nm0000003Brigitte Bardotactress           ['tt0049189', 'tt0054452', ... +2] │\n│ nm0000003Brigitte Bardotmusic_department  ['tt0049189', 'tt0054452', ... +2] │\n│ nm0000004John Belushi   writer            ['tt0072562', 'tt0077975', ... +2] │\n│ nm0000004John Belushi   actor             ['tt0072562', 'tt0077975', ... +2] │\n│                                   │\n└───────────┴─────────────────┴────────────────────┴────────────────────────────────────┘\n
\n```\n:::\n:::\n\n\n:::\n\nUnnesting can be useful when joining nested data.\n\nHere we use unnest to find people known for any of the godfather movies:\n\n::: {.panel-tabset}\n\n## DuckDB\n\n::: {#8c91eda2 .cell execution_count=32}\n``` {.python .cell-code}\nbasics = ddb.tables.title_basics.filter( # <1>\n [\n _.title_type == \"movie\",\n _.original_title.lower().startswith(\"the godfather\"),\n _.genres.lower().contains(\"crime\"),\n ]\n) # <1>\n\nddb_known_for_the_godfather = (\n ddb_ents.mutate(tconst=_.known_for_titles.unnest()) # <2>\n .join(basics, \"tconst\") # <3>\n .select(\"primary_title\", \"primary_name\") # <4>\n .distinct()\n .order_by([\"primary_title\", \"primary_name\"]) # <4>\n)\nddb_known_for_the_godfather\n```\n\n::: {.cell-output .cell-output-display}\n```{=html}\n\n```\n:::\n\n::: {.cell-output .cell-output-display execution_count=32}\n```{=html}\n
┏━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━┓\n┃ primary_title  primary_name        ┃\n┡━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━┩\n│ stringstring              │\n├───────────────┼─────────────────────┤\n│ The GodfatherA. Emmett Adams     │\n│ The GodfatherAbe Vigoda          │\n│ The GodfatherAl Lettieri         │\n│ The GodfatherAl Martino          │\n│ The GodfatherAl Pacino           │\n│ The GodfatherAlbert S. Ruddy     │\n│ The GodfatherAlex Rocco          │\n│ The GodfatherAndrea Eastman      │\n│ The GodfatherAngelo Infanti      │\n│ The GodfatherAnna Hill Johnstone │\n│                    │\n└───────────────┴─────────────────────┘\n
\n```\n:::\n:::\n\n\n1. Filter the `title_basics` data set to only the Godfather movies\n2. Unnest the `known_for_titles` array column\n3. Join with `basics` to get movie titles\n4. Ensure that each entity is only listed once and sort the results\n\n## BigQuery\n\n::: {#6b92a489 .cell execution_count=33}\n``` {.python .cell-code}\nbasics = bq.tables.title_basics.filter( # <1>\n [\n _.title_type == \"movie\",\n _.original_title.lower().startswith(\"the godfather\"),\n _.genres.lower().contains(\"crime\"),\n ]\n) # <1>\n\nbq_known_for_the_godfather = (\n bq_ents.mutate(tconst=_.known_for_titles.unnest()) # <2>\n .join(basics, \"tconst\") # <3>\n .select(\"primary_title\", \"primary_name\") # <4>\n .distinct()\n .order_by([\"primary_title\", \"primary_name\"]) # <4>\n)\nbq_known_for_the_godfather\n```\n\n::: {.cell-output .cell-output-display execution_count=33}\n```{=html}\n
┏━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━┓\n┃ primary_title  primary_name        ┃\n┡━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━┩\n│ stringstring              │\n├───────────────┼─────────────────────┤\n│ The GodfatherA. Emmett Adams     │\n│ The GodfatherAbe Vigoda          │\n│ The GodfatherAl Lettieri         │\n│ The GodfatherAl Martino          │\n│ The GodfatherAl Pacino           │\n│ The GodfatherAlbert S. Ruddy     │\n│ The GodfatherAlex Rocco          │\n│ The GodfatherAndrea Eastman      │\n│ The GodfatherAngelo Infanti      │\n│ The GodfatherAnna Hill Johnstone │\n│                    │\n└───────────────┴─────────────────────┘\n
\n```\n:::\n:::\n\n\n1. Filter the `title_basics` data set to only the Godfather movies\n2. Unnest the `known_for_titles` array column\n3. Join with `basics` to get movie titles\n4. Ensure that each entity is only listed once and sort the results\n\n:::\n\nLet's summarize by showing how many people are known for each Godfather movie:\n\n::: {.panel-tabset}\n\n## DuckDB\n\n::: {#c6caecb6 .cell execution_count=34}\n``` {.python .cell-code}\nddb_known_for_the_godfather.primary_title.value_counts()\n```\n\n::: {.cell-output .cell-output-display}\n```{=html}\n\n```\n:::\n\n::: {.cell-output .cell-output-display execution_count=34}\n```{=html}\n
┏━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━┓\n┃ primary_title           primary_title_count ┃\n┡━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━┩\n│ stringint64               │\n├────────────────────────┼─────────────────────┤\n│ The Godfather Part III196 │\n│ The Godfather         93 │\n│ The Godfather Part II 117 │\n└────────────────────────┴─────────────────────┘\n
\n```\n:::\n:::\n\n\n## BigQuery\n\n::: {#72d2859f .cell execution_count=35}\n``` {.python .cell-code}\nbq_known_for_the_godfather.primary_title.value_counts()\n```\n\n::: {.cell-output .cell-output-display execution_count=35}\n```{=html}\n
┏━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━┓\n┃ primary_title           primary_title_count ┃\n┡━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━┩\n│ stringint64               │\n├────────────────────────┼─────────────────────┤\n│ The Godfather Part III194 │\n│ The Godfather Part II 114 │\n│ The Godfather         97 │\n└────────────────────────┴─────────────────────┘\n
\n```\n:::\n:::\n\n\n:::\n\n### Filtering array elements\n\nFiltering array elements can be done with the\n[`filter`](../../reference/expression-collections.qmd#ibis.expr.types.arrays.ArrayValue.filter)\nmethod, which applies a predicate to each array element and returns an array of\nelements for which the predicate returns `True`.\n\nThis method is similar to Python's\n[`filter`](https://docs.python.org/3.7/library/functions.html#filter) function.\n\nLet's show all people who are neither editors nor actors:\n\n::: {.panel-tabset}\n\n## DuckDB\n\n::: {#b5e4701c .cell execution_count=36}\n``` {.python .cell-code}\nddb_ents.mutate(\n primary_profession=_.primary_profession.filter( # <1>\n lambda pp: ~pp.isin((\"actor\", \"actress\", \"editor\"))\n )\n).filter(_.primary_profession.length() > 0).order_by(\"nconst\") # <2>\n```\n\n::: {.cell-output .cell-output-display execution_count=36}\n```{=html}\n
┏━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┓\n┃ nconst     primary_name     primary_profession                  known_for_titles                   ┃\n┡━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┩\n│ stringstringarray<string>array<string>                      │\n├───────────┼─────────────────┼────────────────────────────────────┼────────────────────────────────────┤\n│ nm0000001Fred Astaire   ['soundtrack', 'miscellaneous']['tt0053137', 'tt0072308', ... +2] │\n│ nm0000002Lauren Bacall  ['soundtrack']['tt0037382', 'tt0117057', ... +2] │\n│ nm0000003Brigitte Bardot['soundtrack', 'music_department']['tt0057345', 'tt0054452', ... +2] │\n│ nm0000004John Belushi   ['soundtrack', 'writer']['tt0072562', 'tt0078723', ... +2] │\n│ nm0000005Ingmar Bergman ['writer', 'director']['tt0083922', 'tt0069467', ... +2] │\n│ nm0000006Ingrid Bergman ['soundtrack', 'producer']['tt0038109', 'tt0036855', ... +2] │\n│ nm0000007Humphrey Bogart['soundtrack', 'producer']['tt0037382', 'tt0034583', ... +2] │\n│ nm0000008Marlon Brando  ['soundtrack', 'director']['tt0068646', 'tt0070849', ... +2] │\n│ nm0000009Richard Burton ['soundtrack', 'producer']['tt0057877', 'tt0059749', ... +2] │\n│ nm0000010James Cagney   ['soundtrack', 'director']['tt0042041', 'tt0035575', ... +2] │\n│                                   │\n└───────────┴─────────────────┴────────────────────────────────────┴────────────────────────────────────┘\n
\n```\n:::\n:::\n\n\n1. This `filter` call is applied to each array element\n2. This `filter` call is applied to the table\n\n## BigQuery\n\n::: {#089fba87 .cell execution_count=37}\n``` {.python .cell-code}\nbq_ents.mutate(\n primary_profession=_.primary_profession.filter( # <1>\n lambda pp: ~pp.isin((\"actor\", \"actress\", \"editor\"))\n )\n).filter(_.primary_profession.length() > 0).order_by(\"nconst\") # <2>\n```\n\n::: {.cell-output .cell-output-display execution_count=37}\n```{=html}\n
┏━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┓\n┃ nconst     primary_name     primary_profession                  known_for_titles                   ┃\n┡━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┩\n│ stringstringarray<string>array<string>                      │\n├───────────┼─────────────────┼────────────────────────────────────┼────────────────────────────────────┤\n│ nm0000001Fred Astaire   ['soundtrack', 'miscellaneous']['tt0031983', 'tt0072308', ... +2] │\n│ nm0000002Lauren Bacall  ['soundtrack']['tt0075213', 'tt0038355', ... +2] │\n│ nm0000003Brigitte Bardot['soundtrack', 'music_department']['tt0049189', 'tt0054452', ... +2] │\n│ nm0000004John Belushi   ['soundtrack', 'writer']['tt0072562', 'tt0077975', ... +2] │\n│ nm0000005Ingmar Bergman ['writer', 'director']['tt0050986', 'tt0050976', ... +2] │\n│ nm0000006Ingrid Bergman ['soundtrack', 'producer']['tt0038109', 'tt0034583', ... +2] │\n│ nm0000007Humphrey Bogart['soundtrack', 'producer']['tt0037382', 'tt0034583', ... +2] │\n│ nm0000008Marlon Brando  ['soundtrack', 'director']['tt0070849', 'tt0047296', ... +2] │\n│ nm0000009Richard Burton ['soundtrack', 'producer']['tt0087803', 'tt0059749', ... +2] │\n│ nm0000010James Cagney   ['soundtrack', 'director']['tt0035575', 'tt0042041', ... +2] │\n│                                   │\n└───────────┴─────────────────┴────────────────────────────────────┴────────────────────────────────────┘\n
\n```\n:::\n:::\n\n\n1. This `filter` call is applied to each array element\n2. This `filter` call is applied to the table\n\n:::\n\n### Applying a function to array elements\n\nYou can apply a function to run an ibis expression on each element of an array\nusing the\n[`map`](../../reference/expression-collections.qmd#ibis.expr.types.arrays.ArrayValue.map)\nmethod.\n\nLet's normalize the case of primary_profession to upper case:\n\n::: {.panel-tabset}\n\n## DuckDB\n\n::: {#bd7ce7ac .cell execution_count=38}\n``` {.python .cell-code}\nddb_ents.mutate(\n primary_profession=_.primary_profession.map(lambda pp: pp.upper())\n).filter(_.primary_profession.length() > 0).order_by(\"nconst\")\n```\n\n::: {.cell-output .cell-output-display execution_count=38}\n```{=html}\n
┏━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┓\n┃ nconst     primary_name     primary_profession                 known_for_titles                   ┃\n┡━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┩\n│ stringstringarray<string>array<string>                      │\n├───────────┼─────────────────┼───────────────────────────────────┼────────────────────────────────────┤\n│ nm0000001Fred Astaire   ['SOUNDTRACK', 'ACTOR', ... +1]['tt0053137', 'tt0072308', ... +2] │\n│ nm0000002Lauren Bacall  ['ACTRESS', 'SOUNDTRACK']['tt0037382', 'tt0117057', ... +2] │\n│ nm0000003Brigitte Bardot['ACTRESS', 'SOUNDTRACK', ... +1]['tt0057345', 'tt0054452', ... +2] │\n│ nm0000004John Belushi   ['ACTOR', 'SOUNDTRACK', ... +1]['tt0072562', 'tt0078723', ... +2] │\n│ nm0000005Ingmar Bergman ['WRITER', 'DIRECTOR', ... +1]['tt0083922', 'tt0069467', ... +2] │\n│ nm0000006Ingrid Bergman ['ACTRESS', 'SOUNDTRACK', ... +1]['tt0038109', 'tt0036855', ... +2] │\n│ nm0000007Humphrey Bogart['ACTOR', 'SOUNDTRACK', ... +1]['tt0037382', 'tt0034583', ... +2] │\n│ nm0000008Marlon Brando  ['ACTOR', 'SOUNDTRACK', ... +1]['tt0068646', 'tt0070849', ... +2] │\n│ nm0000009Richard Burton ['ACTOR', 'SOUNDTRACK', ... +1]['tt0057877', 'tt0059749', ... +2] │\n│ nm0000010James Cagney   ['ACTOR', 'SOUNDTRACK', ... +1]['tt0042041', 'tt0035575', ... +2] │\n│                                   │\n└───────────┴─────────────────┴───────────────────────────────────┴────────────────────────────────────┘\n
\n```\n:::\n:::\n\n\n## BigQuery\n\n::: {#17587fba .cell execution_count=39}\n``` {.python .cell-code}\nbq_ents.mutate(\n primary_profession=_.primary_profession.map(lambda pp: pp.upper())\n).filter(_.primary_profession.length() > 0).order_by(\"nconst\")\n```\n\n::: {.cell-output .cell-output-display execution_count=39}\n```{=html}\n
┏━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┓\n┃ nconst     primary_name     primary_profession                 known_for_titles                   ┃\n┡━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┩\n│ stringstringarray<string>array<string>                      │\n├───────────┼─────────────────┼───────────────────────────────────┼────────────────────────────────────┤\n│ nm0000001Fred Astaire   ['SOUNDTRACK', 'ACTOR', ... +1]['tt0031983', 'tt0072308', ... +2] │\n│ nm0000002Lauren Bacall  ['ACTRESS', 'SOUNDTRACK']['tt0075213', 'tt0038355', ... +2] │\n│ nm0000003Brigitte Bardot['ACTRESS', 'SOUNDTRACK', ... +1]['tt0049189', 'tt0054452', ... +2] │\n│ nm0000004John Belushi   ['ACTOR', 'SOUNDTRACK', ... +1]['tt0072562', 'tt0077975', ... +2] │\n│ nm0000005Ingmar Bergman ['WRITER', 'DIRECTOR', ... +1]['tt0050986', 'tt0050976', ... +2] │\n│ nm0000006Ingrid Bergman ['ACTRESS', 'SOUNDTRACK', ... +1]['tt0038109', 'tt0034583', ... +2] │\n│ nm0000007Humphrey Bogart['ACTOR', 'SOUNDTRACK', ... +1]['tt0037382', 'tt0034583', ... +2] │\n│ nm0000008Marlon Brando  ['ACTOR', 'SOUNDTRACK', ... +1]['tt0070849', 'tt0047296', ... +2] │\n│ nm0000009Richard Burton ['ACTOR', 'SOUNDTRACK', ... +1]['tt0087803', 'tt0059749', ... +2] │\n│ nm0000010James Cagney   ['ACTOR', 'SOUNDTRACK', ... +1]['tt0035575', 'tt0042041', ... +2] │\n│                                   │\n└───────────┴─────────────────┴───────────────────────────────────┴────────────────────────────────────┘\n
\n```\n:::\n:::\n\n\n:::\n\n## Conclusion\n\nIbis has a sizable collection of array APIs that work with many different\nbackends and as of version 7.0.0, Ibis supports a much larger set of those APIs\nfor BigQuery!\n\nCheck out [the API\ndocumentation](../../reference/expression-collections.qmd#ibis.expr.types.arrays.ArrayValue)\nfor the full set of available methods.\n\nTry it out, and let us know what you think.\n\n", + "markdown": "---\ntitle: Backend agnostic arrays\nauthor: \"Phillip Cloud\"\ndate: 2024-01-19\ncategories:\n - arrays\n - bigquery\n - blog\n - cloud\n - duckdb\n - portability\n---\n\n## Introduction\n\nThis is a redux of a [previous post](../bigquery-arrays/index.qmd) showing\nIbis's portability in action.\n\nIbis is portable across complex operations and backends of very different\nscales and deployment models!\n\n::: {.callout-note}\n\n## Results differ slightly between BigQuery and DuckDB\n\nThe datasets used in each backend are slightly different.\n\nI opted to avoid ETL for the BigQuery backend by reusing the Google-provided\nIMDB dataset.\n\nThe tradeoff is the slight discrepancy in results.\n:::\n\n## Basics\n\nWe'll start with `from ibis.interactive import *` for maximum convenience.\n\n::: {#0b066bd8 .cell execution_count=1}\n``` {.python .cell-code}\nfrom ibis.interactive import * # <1>\n```\n:::\n\n\n1. `from ibis.interactive import *` imports Ibis APIs into the global namespace\n and enables [interactive mode](../../how-to/configure/basics.qmd#interactive-mode).\n\n### Connect to your database\n\n::: {.panel-tabset}\n\n## DuckDB\n\n::: {#7caf0298 .cell execution_count=2}\n``` {.python .cell-code}\nddb = ibis.connect(\"duckdb://\")\nddb.create_table( # <1>\n \"name_basics\", ex.imdb_name_basics.fetch(backend=ddb).rename(\"snake_case\")\n) # <1>\nddb.create_table( # <2>\n \"title_basics\", ex.imdb_title_basics.fetch(backend=ddb).rename(\"snake_case\")\n) # <2>\n```\n\n::: {.cell-output .cell-output-display execution_count=2}\n```{=html}\n
┏━━━━━━━━━━━┳━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━┳━━━━━━━━━━━━┳━━━━━━━━━━┳━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━━━┓\n┃ tconst     title_type  primary_title                                original_title                               is_adult  start_year  end_year  runtime_minutes  genres                   ┃\n┡━━━━━━━━━━━╇━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━╇━━━━━━━━━━━━╇━━━━━━━━━━╇━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━━━┩\n│ stringstringstringstringint64int64stringint64string                   │\n├───────────┼────────────┼─────────────────────────────────────────────┼─────────────────────────────────────────────┼──────────┼────────────┼──────────┼─────────────────┼──────────────────────────┤\n│ tt0000001short     Carmencita                                 Carmencita                                 01894NULL1Documentary,Short        │\n│ tt0000002short     Le clown et ses chiens                     Le clown et ses chiens                     01892NULL5Animation,Short          │\n│ tt0000003short     Pauvre Pierrot                             Pauvre Pierrot                             01892NULL4Animation,Comedy,Romance │\n│ tt0000004short     Un bon bock                                Un bon bock                                01892NULL12Animation,Short          │\n│ tt0000005short     Blacksmith Scene                           Blacksmith Scene                           01893NULL1Comedy,Short             │\n│ tt0000006short     Chinese Opium Den                          Chinese Opium Den                          01894NULL1Short                    │\n│ tt0000007short     Corbett and Courtney Before the KinetographCorbett and Courtney Before the Kinetograph01894NULL1Short,Sport              │\n│ tt0000008short     Edison Kinetoscopic Record of a Sneeze     Edison Kinetoscopic Record of a Sneeze     01894NULL1Documentary,Short        │\n│ tt0000009movie     Miss Jerry                                 Miss Jerry                                 01894NULL45Romance                  │\n│ tt0000010short     Leaving the Factory                        La sortie de l'usine Lumière à Lyon        01895NULL1Documentary,Short        │\n│                         │\n└───────────┴────────────┴─────────────────────────────────────────────┴─────────────────────────────────────────────┴──────────┴────────────┴──────────┴─────────────────┴──────────────────────────┘\n
\n```\n:::\n:::\n\n\n1. Create a table called `name_basics` in our DuckDB database using `ibis.examples` data\n2. Create a table called `title_basics` in our DuckDB database using `ibis.examples` data\n\n## BigQuery\n\n::: {#399dd705 .cell execution_count=3}\n``` {.python .cell-code}\nbq = ibis.connect(\"bigquery://ibis-gbq\")\nbq.set_database(\"bigquery-public-data.imdb\") # <1>\n```\n:::\n\n\n1. Google provides a public BigQuery dataset for IMDB data.\n\n:::\n\nLet's pull out the `name_basics` table, which contains names and metadata about\npeople listed on IMDB. We'll call this `ents` (short for `entities`), and remove some\ncolumns we won't need:\n\n::: {.panel-tabset}\n\n## DuckDB\n\n::: {#108c077d .cell execution_count=4}\n``` {.python .cell-code}\nddb_ents = ddb.tables.name_basics.drop(\"birth_year\", \"death_year\")\nddb_ents.order_by(\"nconst\")\n```\n\n::: {.cell-output .cell-output-display execution_count=4}\n```{=html}\n
┏━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┓\n┃ nconst     primary_name     primary_profession                   known_for_titles                        ┃\n┡━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┩\n│ stringstringstringstring                                  │\n├───────────┼─────────────────┼─────────────────────────────────────┼─────────────────────────────────────────┤\n│ nm0000001Fred Astaire   soundtrack,actor,miscellaneous     tt0053137,tt0072308,tt0045537,tt0050419 │\n│ nm0000002Lauren Bacall  actress,soundtrack                 tt0037382,tt0117057,tt0075213,tt0038355 │\n│ nm0000003Brigitte Bardotactress,soundtrack,music_departmenttt0057345,tt0054452,tt0049189,tt0056404 │\n│ nm0000004John Belushi   actor,soundtrack,writer            tt0072562,tt0078723,tt0077975,tt0080455 │\n│ nm0000005Ingmar Bergman writer,director,actor              tt0083922,tt0069467,tt0050976,tt0050986 │\n│ nm0000006Ingrid Bergman actress,soundtrack,producer        tt0038109,tt0036855,tt0034583,tt0038787 │\n│ nm0000007Humphrey Bogartactor,soundtrack,producer          tt0037382,tt0034583,tt0042593,tt0043265 │\n│ nm0000008Marlon Brando  actor,soundtrack,director          tt0068646,tt0070849,tt0078788,tt0047296 │\n│ nm0000009Richard Burton actor,soundtrack,producer          tt0057877,tt0059749,tt0061184,tt0087803 │\n│ nm0000010James Cagney   actor,soundtrack,director          tt0042041,tt0035575,tt0029870,tt0031867 │\n│                                        │\n└───────────┴─────────────────┴─────────────────────────────────────┴─────────────────────────────────────────┘\n
\n```\n:::\n:::\n\n\n## BigQuery\n\n::: {#9a444397 .cell execution_count=5}\n``` {.python .cell-code}\nbq_ents = bq.tables.name_basics.drop(\"birth_year\", \"death_year\")\nbq_ents.order_by(\"nconst\")\n```\n\n::: {.cell-output .cell-output-display execution_count=5}\n```{=html}\n
┏━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┓\n┃ nconst     primary_name     primary_profession                   known_for_titles                        ┃\n┡━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┩\n│ stringstringstringstring                                  │\n├───────────┼─────────────────┼─────────────────────────────────────┼─────────────────────────────────────────┤\n│ nm0000001Fred Astaire   soundtrack,actor,miscellaneous     tt0072308,tt0053137,tt0031983,tt0050419 │\n│ nm0000002Lauren Bacall  actress,soundtrack                 tt0038355,tt0075213,tt0117057,tt0037382 │\n│ nm0000003Brigitte Bardotactress,soundtrack,music_departmenttt0049189,tt0054452,tt0056404,tt0057345 │\n│ nm0000004John Belushi   actor,soundtrack,writer            tt0072562,tt0078723,tt0080455,tt0077975 │\n│ nm0000005Ingmar Bergman writer,director,actor              tt0050976,tt0083922,tt0069467,tt0050986 │\n│ nm0000006Ingrid Bergman actress,soundtrack,producer        tt0034583,tt0038787,tt0038109,tt0036855 │\n│ nm0000007Humphrey Bogartactor,soundtrack,producer          tt0037382,tt0043265,tt0034583,tt0042593 │\n│ nm0000008Marlon Brando  actor,soundtrack,director          tt0068646,tt0070849,tt0047296,tt0078788 │\n│ nm0000009Richard Burton actor,soundtrack,producer          tt0061184,tt0087803,tt0057877,tt0059749 │\n│ nm0000010James Cagney   actor,soundtrack,director          tt0042041,tt0029870,tt0031867,tt0035575 │\n│                                        │\n└───────────┴─────────────────┴─────────────────────────────────────┴─────────────────────────────────────────┘\n
\n```\n:::\n:::\n\n\n:::\n\n### Splitting strings into arrays\n\nWe can see that `known_for_titles` looks sort of like an array, so let's call\nthe\n[`split`](../../reference/expression-strings.qmd#ibis.expr.types.strings.StringValue.split)\nmethod on that column and replace the existing column:\n\n::: {.panel-tabset}\n\n## DuckDB\n\n::: {#337dbe6a .cell execution_count=6}\n``` {.python .cell-code}\nddb_ents = ddb_ents.mutate(known_for_titles=_.known_for_titles.split(\",\"))\nddb_ents.order_by(\"nconst\")\n```\n\n::: {.cell-output .cell-output-display execution_count=6}\n```{=html}\n
┏━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┓\n┃ nconst     primary_name     primary_profession                   known_for_titles                   ┃\n┡━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┩\n│ stringstringstringarray<string>                      │\n├───────────┼─────────────────┼─────────────────────────────────────┼────────────────────────────────────┤\n│ nm0000001Fred Astaire   soundtrack,actor,miscellaneous     ['tt0053137', 'tt0072308', ... +2] │\n│ nm0000002Lauren Bacall  actress,soundtrack                 ['tt0037382', 'tt0117057', ... +2] │\n│ nm0000003Brigitte Bardotactress,soundtrack,music_department['tt0057345', 'tt0054452', ... +2] │\n│ nm0000004John Belushi   actor,soundtrack,writer            ['tt0072562', 'tt0078723', ... +2] │\n│ nm0000005Ingmar Bergman writer,director,actor              ['tt0083922', 'tt0069467', ... +2] │\n│ nm0000006Ingrid Bergman actress,soundtrack,producer        ['tt0038109', 'tt0036855', ... +2] │\n│ nm0000007Humphrey Bogartactor,soundtrack,producer          ['tt0037382', 'tt0034583', ... +2] │\n│ nm0000008Marlon Brando  actor,soundtrack,director          ['tt0068646', 'tt0070849', ... +2] │\n│ nm0000009Richard Burton actor,soundtrack,producer          ['tt0057877', 'tt0059749', ... +2] │\n│ nm0000010James Cagney   actor,soundtrack,director          ['tt0042041', 'tt0035575', ... +2] │\n│                                   │\n└───────────┴─────────────────┴─────────────────────────────────────┴────────────────────────────────────┘\n
\n```\n:::\n:::\n\n\n## BigQuery\n\n::: {#bb1ea1fe .cell execution_count=7}\n``` {.python .cell-code}\nbq_ents = bq_ents.mutate(known_for_titles=_.known_for_titles.split(\",\"))\nbq_ents.order_by(\"nconst\")\n```\n\n::: {.cell-output .cell-output-display execution_count=7}\n```{=html}\n
┏━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┓\n┃ nconst     primary_name     primary_profession                   known_for_titles                   ┃\n┡━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┩\n│ stringstringstringarray<string>                      │\n├───────────┼─────────────────┼─────────────────────────────────────┼────────────────────────────────────┤\n│ nm0000001Fred Astaire   soundtrack,actor,miscellaneous     ['tt0072308', 'tt0053137', ... +2] │\n│ nm0000002Lauren Bacall  actress,soundtrack                 ['tt0038355', 'tt0075213', ... +2] │\n│ nm0000003Brigitte Bardotactress,soundtrack,music_department['tt0049189', 'tt0054452', ... +2] │\n│ nm0000004John Belushi   actor,soundtrack,writer            ['tt0072562', 'tt0078723', ... +2] │\n│ nm0000005Ingmar Bergman writer,director,actor              ['tt0050976', 'tt0083922', ... +2] │\n│ nm0000006Ingrid Bergman actress,soundtrack,producer        ['tt0034583', 'tt0038787', ... +2] │\n│ nm0000007Humphrey Bogartactor,soundtrack,producer          ['tt0037382', 'tt0043265', ... +2] │\n│ nm0000008Marlon Brando  actor,soundtrack,director          ['tt0068646', 'tt0070849', ... +2] │\n│ nm0000009Richard Burton actor,soundtrack,producer          ['tt0061184', 'tt0087803', ... +2] │\n│ nm0000010James Cagney   actor,soundtrack,director          ['tt0042041', 'tt0029870', ... +2] │\n│                                   │\n└───────────┴─────────────────┴─────────────────────────────────────┴────────────────────────────────────┘\n
\n```\n:::\n:::\n\n\n:::\n\nSimilarly for `primary_profession`, since people involved in show business often\nhave more than one responsibility on a project:\n\n::: {.panel-tabset}\n\n## DuckDB\n\n::: {#bf33447a .cell execution_count=8}\n``` {.python .cell-code}\nddb_ents = ddb_ents.mutate(primary_profession=_.primary_profession.split(\",\"))\n```\n:::\n\n\n## BigQuery\n\n::: {#eea20c44 .cell execution_count=9}\n``` {.python .cell-code}\nbq_ents = bq_ents.mutate(primary_profession=_.primary_profession.split(\",\"))\n```\n:::\n\n\n:::\n\n### Array length\n\nLet's see how many titles each entity is known for, and then show the five\npeople with the largest number of titles they're known for.\n\nThis is computed using the\n[`length`](../../reference/expression-collections.qmd#ibis.expr.types.arrays.ArrayValue.length)\nAPI on array expressions:\n\n::: {.panel-tabset}\n\n## DuckDB\n\n::: {#5f0b413d .cell execution_count=10}\n``` {.python .cell-code}\n(\n ddb_ents.select(\"primary_name\", num_titles=_.known_for_titles.length())\n .order_by(_.num_titles.desc())\n .limit(5)\n)\n```\n\n::: {.cell-output .cell-output-display execution_count=10}\n```{=html}\n
┏━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━┓\n┃ primary_name      num_titles ┃\n┡━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━┩\n│ stringint64      │\n├──────────────────┼────────────┤\n│ Alex Koenigsmark5 │\n│ Carrie Schnelker5 │\n│ Henry Townsend  5 │\n│ Sally Sun       5 │\n│ Matthew Kavuma  5 │\n└──────────────────┴────────────┘\n
\n```\n:::\n:::\n\n\n## BigQuery\n\n::: {#056cde90 .cell execution_count=11}\n``` {.python .cell-code}\n(\n bq_ents.select(\"primary_name\", num_titles=_.known_for_titles.length())\n .order_by(_.num_titles.desc())\n .limit(5)\n)\n```\n\n::: {.cell-output .cell-output-display execution_count=11}\n```{=html}\n
┏━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━┓\n┃ primary_name         num_titles ┃\n┡━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━┩\n│ stringint64      │\n├─────────────────────┼────────────┤\n│ José Jaime Espinosa6 │\n│ Paul Winter        6 │\n│ Nicolas Bernier    6 │\n│ Chris Estrada      6 │\n│ Tsuyotake Matsuda  5 │\n└─────────────────────┴────────────┘\n
\n```\n:::\n:::\n\n\n:::\n\nIt seems like the length of the `known_for_titles` might be capped at some small number!\n\n### Index\n\nWe can see the position of `\"actor\"` or `\"actress\"` in `primary_profession`s:\n\n::: {.panel-tabset}\n\n## DuckDB\n\n::: {#cbc557e3 .cell execution_count=12}\n``` {.python .cell-code}\nddb_ents.primary_profession.index(\"actor\")\n```\n\n::: {.cell-output .cell-output-display execution_count=12}\n```{=html}\n
┏━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┓\n┃ ArrayPosition(primary_profession, 'actor') ┃\n┡━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┩\n│ int64                                      │\n├────────────────────────────────────────────┤\n│                                          1 │\n│                                         -1 │\n│                                         -1 │\n│                                          0 │\n│                                          2 │\n│                                         -1 │\n│                                          0 │\n│                                          0 │\n│                                          0 │\n│                                          0 │\n│                                           │\n└────────────────────────────────────────────┘\n
\n```\n:::\n:::\n\n\n::: {#3eb36d0b .cell execution_count=13}\n``` {.python .cell-code}\nddb_ents.primary_profession.index(\"actress\")\n```\n\n::: {.cell-output .cell-output-display execution_count=13}\n```{=html}\n
┏━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┓\n┃ ArrayPosition(primary_profession, 'actress') ┃\n┡━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┩\n│ int64                                        │\n├──────────────────────────────────────────────┤\n│                                           -1 │\n│                                            0 │\n│                                            0 │\n│                                           -1 │\n│                                           -1 │\n│                                            0 │\n│                                           -1 │\n│                                           -1 │\n│                                           -1 │\n│                                           -1 │\n│                                             │\n└──────────────────────────────────────────────┘\n
\n```\n:::\n:::\n\n\n## BigQuery\n\n::: {#7efe2a42 .cell execution_count=14}\n``` {.python .cell-code}\nbq_ents.primary_profession.index(\"actor\")\n```\n\n::: {.cell-output .cell-output-display execution_count=14}\n```{=html}\n
┏━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┓\n┃ ArrayPosition(primary_profession, 'actor') ┃\n┡━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┩\n│ int64                                      │\n├────────────────────────────────────────────┤\n│                                         -1 │\n│                                         -1 │\n│                                         -1 │\n│                                         -1 │\n│                                         -1 │\n│                                         -1 │\n│                                         -1 │\n│                                         -1 │\n│                                         -1 │\n│                                         -1 │\n│                                           │\n└────────────────────────────────────────────┘\n
\n```\n:::\n:::\n\n\n::: {#909e212d .cell execution_count=15}\n``` {.python .cell-code}\nbq_ents.primary_profession.index(\"actress\")\n```\n\n::: {.cell-output .cell-output-display execution_count=15}\n```{=html}\n
┏━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┓\n┃ ArrayPosition(primary_profession, 'actress') ┃\n┡━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┩\n│ int64                                        │\n├──────────────────────────────────────────────┤\n│                                           -1 │\n│                                           -1 │\n│                                           -1 │\n│                                           -1 │\n│                                           -1 │\n│                                           -1 │\n│                                           -1 │\n│                                           -1 │\n│                                           -1 │\n│                                           -1 │\n│                                             │\n└──────────────────────────────────────────────┘\n
\n```\n:::\n:::\n\n\n:::\n\nA return value of `-1` indicates that `\"actor\"` is not present in the value.\n\nLet's look for entities that are not primarily actors.\n\nWe can do this using the\n[`index`](../../reference/expression-collections.qmd#ibis.expr.types.arrays.ArrayValue.index)\nmethod by checking whether the positions of the strings `\"actor\"` or\n`\"actress\"` are both greater than 0:\n\n::: {.panel-tabset}\n\n## DuckDB\n\n::: {#96ef0c2b .cell execution_count=16}\n``` {.python .cell-code}\nactor_index = ddb_ents.primary_profession.index(\"actor\")\nactress_index = ddb_ents.primary_profession.index(\"actress\")\n\nddb_not_primarily_acting = (actor_index > 0) & (actress_index > 0)\nddb_not_primarily_acting.mean()\n```\n\n::: {.cell-output .cell-output-display}\n```{=html}\n
\n```\n:::\n\n::: {.cell-output .cell-output-display execution_count=16}\n\n::: {.ansi-escaped-output}\n```{=html}\n
0.0
\n```\n:::\n\n:::\n:::\n\n\n## BigQuery\n\n::: {#e10553cc .cell execution_count=17}\n``` {.python .cell-code}\nactor_index = bq_ents.primary_profession.index(\"actor\")\nactress_index = bq_ents.primary_profession.index(\"actress\")\n\nbq_not_primarily_acting = (actor_index > 0) & (actress_index > 0)\nbq_not_primarily_acting.mean()\n```\n\n::: {.cell-output .cell-output-display}\n```{=html}\n
\n```\n:::\n\n::: {.cell-output .cell-output-display execution_count=17}\n\n::: {.ansi-escaped-output}\n```{=html}\n
0.0
\n```\n:::\n\n:::\n:::\n\n\n:::\n\nWho are they?\n\n::: {.panel-tabset}\n\n## DuckDB\n\n::: {#a4243c7c .cell execution_count=18}\n``` {.python .cell-code}\nddb_ents[ddb_not_primarily_acting].order_by(\"nconst\")\n```\n\n::: {.cell-output .cell-output-display execution_count=18}\n```{=html}\n
┏━━━━━━━━┳━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━┓\n┃ nconst  primary_name  primary_profession  known_for_titles ┃\n┡━━━━━━━━╇━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━┩\n│ stringstringarray<string>array<string>    │\n└────────┴──────────────┴────────────────────┴──────────────────┘\n
\n```\n:::\n:::\n\n\n## BigQuery\n\n::: {#40b5ad2f .cell execution_count=19}\n``` {.python .cell-code}\nbq_ents[bq_not_primarily_acting].order_by(\"nconst\")\n```\n\n::: {.cell-output .cell-output-display execution_count=19}\n```{=html}\n
┏━━━━━━━━┳━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━┓\n┃ nconst  primary_name  primary_profession  known_for_titles ┃\n┡━━━━━━━━╇━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━┩\n│ stringstringarray<string>array<string>    │\n└────────┴──────────────┴────────────────────┴──────────────────┘\n
\n```\n:::\n:::\n\n\n:::\n\nIt's not 100% clear whether the order of elements in `primary_profession` matters here.\n\n### Containment\n\nWe can get people who are listed as actors or actresses using `contains`:\n\n::: {.panel-tabset}\n\n## DuckDB\n\n::: {#db2f9479 .cell execution_count=20}\n``` {.python .cell-code}\nddb_non_actors = bq_ents[\n ~_.primary_profession.contains(\"actor\") & ~_.primary_profession.contains(\"actress\")\n]\nddb_non_actors.order_by(\"nconst\")\n```\n\n::: {.cell-output .cell-output-display execution_count=20}\n```{=html}\n
┏━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┓\n┃ nconst     primary_name      primary_profession                          known_for_titles                   ┃\n┡━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┩\n│ stringstringarray<string>array<string>                      │\n├───────────┼──────────────────┼────────────────────────────────────────────┼────────────────────────────────────┤\n│ nm0000016Georges Delerue ['composer', 'soundtrack', ... +1]['tt8847712', 'tt0091763', ... +2] │\n│ nm0000025Jerry Goldsmith ['music_department', 'soundtrack', ... +1]['tt0077269', 'tt0117731', ... +2] │\n│ nm0000033Alfred Hitchcock['director', 'producer', ... +1]['tt0054215', 'tt0052357', ... +2] │\n│ nm0000035James Horner    ['music_department', 'soundtrack', ... +1]['tt0177971', 'tt0120338', ... +2] │\n│ nm0000040Stanley Kubrick ['director', 'writer', ... +1]['tt0120663', 'tt0066921', ... +2] │\n│ nm0000041Akira Kurosawa  ['writer', 'director', ... +1]['tt0080979', 'tt0089881', ... +2] │\n│ nm0000049Henry Mancini   ['music_department', 'soundtrack', ... +1]['tt0383216', 'tt0054698', ... +2] │\n│ nm0000055Alfred Newman   ['music_department', 'composer', ... +1]['tt0049408', 'tt0434409', ... +2] │\n│ nm0000065Nino Rota       ['composer', 'soundtrack', ... +1]['tt0071562', 'tt0056801', ... +2] │\n│ nm0000067Miklós Rózsa    ['music_department', 'composer', ... +1]['tt0052618', 'tt0038109', ... +2] │\n│                                   │\n└───────────┴──────────────────┴────────────────────────────────────────────┴────────────────────────────────────┘\n
\n```\n:::\n:::\n\n\n## BigQuery\n\n::: {#9fb71b70 .cell execution_count=21}\n``` {.python .cell-code}\nbq_non_actors = bq_ents[\n ~_.primary_profession.contains(\"actor\") & ~_.primary_profession.contains(\"actress\")\n]\nbq_non_actors.order_by(\"nconst\")\n```\n\n::: {.cell-output .cell-output-display execution_count=21}\n```{=html}\n
┏━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┓\n┃ nconst     primary_name      primary_profession                          known_for_titles                   ┃\n┡━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┩\n│ stringstringarray<string>array<string>                      │\n├───────────┼──────────────────┼────────────────────────────────────────────┼────────────────────────────────────┤\n│ nm0000016Georges Delerue ['composer', 'soundtrack', ... +1]['tt8847712', 'tt0091763', ... +2] │\n│ nm0000025Jerry Goldsmith ['music_department', 'soundtrack', ... +1]['tt0077269', 'tt0117731', ... +2] │\n│ nm0000033Alfred Hitchcock['director', 'producer', ... +1]['tt0054215', 'tt0052357', ... +2] │\n│ nm0000035James Horner    ['music_department', 'soundtrack', ... +1]['tt0177971', 'tt0120338', ... +2] │\n│ nm0000040Stanley Kubrick ['director', 'writer', ... +1]['tt0120663', 'tt0066921', ... +2] │\n│ nm0000041Akira Kurosawa  ['writer', 'director', ... +1]['tt0080979', 'tt0089881', ... +2] │\n│ nm0000049Henry Mancini   ['music_department', 'soundtrack', ... +1]['tt0383216', 'tt0054698', ... +2] │\n│ nm0000055Alfred Newman   ['music_department', 'composer', ... +1]['tt0049408', 'tt0434409', ... +2] │\n│ nm0000065Nino Rota       ['composer', 'soundtrack', ... +1]['tt0071562', 'tt0056801', ... +2] │\n│ nm0000067Miklós Rózsa    ['music_department', 'composer', ... +1]['tt0052618', 'tt0038109', ... +2] │\n│                                   │\n└───────────┴──────────────────┴────────────────────────────────────────────┴────────────────────────────────────┘\n
\n```\n:::\n:::\n\n\n:::\n\n### Element removal\n\nWe can remove elements from arrays too.\n\n::: {.callout-note}\n## [`remove()`](../../reference/expression-collections.qmd#ibis.expr.types.arrays.ArrayValue.remove) does not mutate the underlying data\n:::\n\nLet's see who only has \"actor\" in the list of their primary professions:\n\n::: {.panel-tabset}\n\n## DuckDB\n\n::: {#551005a1 .cell execution_count=22}\n``` {.python .cell-code}\nddb_ents.filter(\n [\n _.primary_profession.length() > 0,\n _.primary_profession.remove(\"actor\").length() == 0,\n _.primary_profession.remove(\"actress\").length() == 0,\n ]\n).order_by(\"nconst\")\n```\n\n::: {.cell-output .cell-output-display execution_count=22}\n```{=html}\n
┏━━━━━━━━┳━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━┓\n┃ nconst  primary_name  primary_profession  known_for_titles ┃\n┡━━━━━━━━╇━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━┩\n│ stringstringarray<string>array<string>    │\n└────────┴──────────────┴────────────────────┴──────────────────┘\n
\n```\n:::\n:::\n\n\n## BigQuery\n\n::: {#6b4f941c .cell execution_count=23}\n``` {.python .cell-code}\nbq_ents.filter(\n [\n _.primary_profession.length() > 0,\n _.primary_profession.remove(\"actor\").length() == 0,\n _.primary_profession.remove(\"actress\").length() == 0,\n ]\n).order_by(\"nconst\")\n```\n\n::: {.cell-output .cell-output-display execution_count=23}\n```{=html}\n
┏━━━━━━━━┳━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━┓\n┃ nconst  primary_name  primary_profession  known_for_titles ┃\n┡━━━━━━━━╇━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━┩\n│ stringstringarray<string>array<string>    │\n└────────┴──────────────┴────────────────────┴──────────────────┘\n
\n```\n:::\n:::\n\n\n:::\n\n### Slicing with square-bracket syntax\n\nLet's remove everyone's first profession from the list, but only if they have\nmore than one profession listed:\n\n::: {.panel-tabset}\n\n## DuckDB\n\n::: {#7751efa4 .cell execution_count=24}\n``` {.python .cell-code}\nddb_ents[_.primary_profession.length() > 1].mutate(\n primary_profession=_.primary_profession[1:],\n).order_by(\"nconst\")\n```\n\n::: {.cell-output .cell-output-display execution_count=24}\n```{=html}\n
┏━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┓\n┃ nconst     primary_name     primary_profession                  known_for_titles                   ┃\n┡━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┩\n│ stringstringarray<string>array<string>                      │\n├───────────┼─────────────────┼────────────────────────────────────┼────────────────────────────────────┤\n│ nm0000001Fred Astaire   ['actor', 'miscellaneous']['tt0053137', 'tt0072308', ... +2] │\n│ nm0000002Lauren Bacall  ['soundtrack']['tt0037382', 'tt0117057', ... +2] │\n│ nm0000003Brigitte Bardot['soundtrack', 'music_department']['tt0057345', 'tt0054452', ... +2] │\n│ nm0000004John Belushi   ['soundtrack', 'writer']['tt0072562', 'tt0078723', ... +2] │\n│ nm0000005Ingmar Bergman ['director', 'actor']['tt0083922', 'tt0069467', ... +2] │\n│ nm0000006Ingrid Bergman ['soundtrack', 'producer']['tt0038109', 'tt0036855', ... +2] │\n│ nm0000007Humphrey Bogart['soundtrack', 'producer']['tt0037382', 'tt0034583', ... +2] │\n│ nm0000008Marlon Brando  ['soundtrack', 'director']['tt0068646', 'tt0070849', ... +2] │\n│ nm0000009Richard Burton ['soundtrack', 'producer']['tt0057877', 'tt0059749', ... +2] │\n│ nm0000010James Cagney   ['soundtrack', 'director']['tt0042041', 'tt0035575', ... +2] │\n│                                   │\n└───────────┴─────────────────┴────────────────────────────────────┴────────────────────────────────────┘\n
\n```\n:::\n:::\n\n\n## BigQuery\n\n::: {#295ffc5a .cell execution_count=25}\n``` {.python .cell-code}\nbq_ents[_.primary_profession.length() > 1].mutate(\n primary_profession=_.primary_profession[1:],\n).order_by(\"nconst\")\n```\n\n::: {.cell-output .cell-output-display execution_count=25}\n```{=html}\n
┏━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┓\n┃ nconst     primary_name     primary_profession                  known_for_titles                   ┃\n┡━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┩\n│ stringstringarray<string>array<string>                      │\n├───────────┼─────────────────┼────────────────────────────────────┼────────────────────────────────────┤\n│ nm0000001Fred Astaire   ['actor', 'miscellaneous']['tt0072308', 'tt0053137', ... +2] │\n│ nm0000002Lauren Bacall  ['soundtrack']['tt0038355', 'tt0075213', ... +2] │\n│ nm0000003Brigitte Bardot['soundtrack', 'music_department']['tt0049189', 'tt0054452', ... +2] │\n│ nm0000004John Belushi   ['soundtrack', 'writer']['tt0072562', 'tt0078723', ... +2] │\n│ nm0000005Ingmar Bergman ['director', 'actor']['tt0050976', 'tt0083922', ... +2] │\n│ nm0000006Ingrid Bergman ['soundtrack', 'producer']['tt0034583', 'tt0038787', ... +2] │\n│ nm0000007Humphrey Bogart['soundtrack', 'producer']['tt0037382', 'tt0043265', ... +2] │\n│ nm0000008Marlon Brando  ['soundtrack', 'director']['tt0068646', 'tt0070849', ... +2] │\n│ nm0000009Richard Burton ['soundtrack', 'producer']['tt0061184', 'tt0087803', ... +2] │\n│ nm0000010James Cagney   ['soundtrack', 'director']['tt0042041', 'tt0029870', ... +2] │\n│                                   │\n└───────────┴─────────────────┴────────────────────────────────────┴────────────────────────────────────┘\n
\n```\n:::\n:::\n\n\n:::\n\n## Set operations and sorting\n\nTreating arrays as sets is possible with the\n[`union`](../../reference/expression-collections.qmd#ibis.expr.types.arrays.ArrayValue.union)\nand\n[`intersect`](../../reference/expression-collections.qmd#ibis.expr.types.arrays.ArrayValue.intersect)\nAPIs.\n\nLet's take a look at `intersect`.\n\n### Intersection\n\nLet's see if we can use array intersection to figure which actors share\nknown-for titles and sort the result:\n\n::: {.panel-tabset}\n\n## DuckDB\n\n::: {#1d68e19a .cell execution_count=26}\n``` {.python .cell-code}\nleft = ddb_ents.filter(_.known_for_titles.length() > 0).limit(10_000)\nright = left.view()\nshared_titles = (\n left\n .join(right, left.nconst != right.nconst)\n .select(\n s.startswith(\"known_for_titles\"),\n left_name=\"primary_name\",\n right_name=\"primary_name_right\",\n )\n .filter(_.known_for_titles.intersect(_.known_for_titles_right).length() > 0)\n .group_by(name=\"left_name\")\n .agg(together_with=_.right_name.collect())\n .mutate(together_with=_.together_with.unique().sort())\n)\nshared_titles\n```\n\n::: {.cell-output .cell-output-display}\n```{=html}\n\n```\n:::\n\n::: {.cell-output .cell-output-display execution_count=26}\n```{=html}\n
┏━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┓\n┃ name                  together_with                                 ┃\n┡━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┩\n│ stringarray<string>                                 │\n├──────────────────────┼───────────────────────────────────────────────┤\n│ Ava Gardner         ['Ernest Gold', 'Fred Astaire']               │\n│ Cyd Charisse        ['Fred Astaire']                              │\n│ John Landis         ['Dan Aykroyd', 'Dick Ziker', ... +14]        │\n│ Michael Curtiz      ['Alan Hale', 'Ann Blyth', ... +19]           │\n│ Francis Ford Coppola['Abe Vigoda', 'Al Pacino', ... +19]          │\n│ Bernardo Bertolucci ['Armand Abplanalp', 'James Acheson', ... +3] │\n│ Karl Malden         ['Abraxas Aaran', 'Alex North', ... +14]      │\n│ Richard Conte       ['Abe Vigoda', 'Al Pacino', ... +9]           │\n│ George Orwell       ['John Hurt', 'Richard Burton']               │\n│ Joseph L. Mankiewicz['Alfred Newman', 'Anne Baxter', ... +13]     │\n│                                              │\n└──────────────────────┴───────────────────────────────────────────────┘\n
\n```\n:::\n:::\n\n\n## BigQuery\n\n::: {#ecaee965 .cell execution_count=27}\n``` {.python .cell-code}\nleft = bq_ents.filter(_.known_for_titles.length() > 0).limit(10_000)\nright = left.view()\nshared_titles = (\n left\n .join(right, left.nconst != right.nconst)\n .select(\n s.startswith(\"known_for_titles\"),\n left_name=\"primary_name\",\n right_name=\"primary_name_right\",\n )\n .filter(_.known_for_titles.intersect(_.known_for_titles_right).length() > 0)\n .group_by(name=\"left_name\")\n .agg(together_with=_.right_name.collect())\n .mutate(together_with=_.together_with.unique().sort())\n)\nshared_titles\n```\n\n::: {.cell-output .cell-output-display execution_count=27}\n```{=html}\n
┏━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┓\n┃ name                     together_with                                       ┃\n┡━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┩\n│ stringarray<string>                                       │\n├─────────────────────────┼─────────────────────────────────────────────────────┤\n│ Pavel Vrba             ['F.C. Lokomotiv Moscow', 'Jeffrey Bruma', ... +4]  │\n│ Greg Carrolan          ['Al Cambronne', 'Alana Tornello', ... +20]         │\n│ Aleksander Parzychowski['Adam Korszun', 'Grzegorz Wawrzenczyk', ... +5]    │\n│ James Walt             ['Anton Testino', 'Ben Walanka', ... +10]           │\n│ Ellen Dallaglio        ['Antonia Giovanazzi', 'Fra McCann', ... +10]       │\n│ Catarina Martins       ['Miguel Oliveira', 'Ricardo Gordon', ... +1]       │\n│ Stanislav Sesták       ['Martin Glenn', 'Miso Brecko', ... +6]             │\n│ Allison Cabot          ['Brenda Beard', 'Brian Fenmore', ... +16]          │\n│ Vasilis Bouzianas      ['Aggelos Kasolas', 'Christos Patriarheas', ... +3] │\n│ Marie Muldoon          ['Alan Oxley', 'Andrew Raeber', ... +39]            │\n│                                                    │\n└─────────────────────────┴─────────────────────────────────────────────────────┘\n
\n```\n:::\n:::\n\n\n:::\n\n## Advanced operations\n\n### Flatten arrays into rows\n\nThanks to the [tireless\nefforts](https://github.com/tobymao/sqlglot/commit/06e0869e7aa5714d77e6ec763da38d6a422965fa)\nof the [folks](https://github.com/tobymao/sqlglot/graphs/contributors) working\non [`sqlglot`](https://github.com/tobymao/sqlglot), as of version 7.0.0 Ibis\nsupports\n[`unnest`](../../reference/expression-collections.qmd#ibis.expr.types.arrays.ArrayValue.unnest)\nfor BigQuery!\n\nYou can use it standalone on a column expression:\n\n::: {.panel-tabset}\n\n## DuckDB\n\n::: {#e5f62712 .cell execution_count=28}\n``` {.python .cell-code}\nddb_ents.primary_profession.unnest()\n```\n\n::: {.cell-output .cell-output-display execution_count=28}\n```{=html}\n
┏━━━━━━━━━━━━━━━━━━━━┓\n┃ primary_profession ┃\n┡━━━━━━━━━━━━━━━━━━━━┩\n│ string             │\n├────────────────────┤\n│ soundtrack         │\n│ actor              │\n│ miscellaneous      │\n│ actress            │\n│ soundtrack         │\n│ actress            │\n│ soundtrack         │\n│ music_department   │\n│ actor              │\n│ soundtrack         │\n│                   │\n└────────────────────┘\n
\n```\n:::\n:::\n\n\n## BigQuery\n\n::: {#39a19645 .cell execution_count=29}\n``` {.python .cell-code}\nbq_ents.primary_profession.unnest()\n```\n\n::: {.cell-output .cell-output-display execution_count=29}\n```{=html}\n
┏━━━━━━━━━━━━━━━━━━━━┓\n┃ primary_profession ┃\n┡━━━━━━━━━━━━━━━━━━━━┩\n│ string             │\n├────────────────────┤\n│ actor              │\n│ actor              │\n│ actor              │\n│ actor              │\n│ actor              │\n│ actor              │\n│ actor              │\n│ actor              │\n│ actor              │\n│ actor              │\n│                   │\n└────────────────────┘\n
\n```\n:::\n:::\n\n\n:::\n\nYou can also use it in `select`/`mutate` calls to expand the table accordingly:\n\n::: {.panel-tabset}\n\n## DuckDB\n\n::: {#bd167fb0 .cell execution_count=30}\n``` {.python .cell-code}\nddb_ents.mutate(primary_profession=_.primary_profession.unnest()).order_by(\"nconst\")\n```\n\n::: {.cell-output .cell-output-display execution_count=30}\n```{=html}\n
┏━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┓\n┃ nconst     primary_name     primary_profession  known_for_titles                   ┃\n┡━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┩\n│ stringstringstringarray<string>                      │\n├───────────┼─────────────────┼────────────────────┼────────────────────────────────────┤\n│ nm0000001Fred Astaire   soundtrack        ['tt0053137', 'tt0072308', ... +2] │\n│ nm0000001Fred Astaire   miscellaneous     ['tt0053137', 'tt0072308', ... +2] │\n│ nm0000001Fred Astaire   actor             ['tt0053137', 'tt0072308', ... +2] │\n│ nm0000002Lauren Bacall  soundtrack        ['tt0037382', 'tt0117057', ... +2] │\n│ nm0000002Lauren Bacall  actress           ['tt0037382', 'tt0117057', ... +2] │\n│ nm0000003Brigitte Bardotmusic_department  ['tt0057345', 'tt0054452', ... +2] │\n│ nm0000003Brigitte Bardotsoundtrack        ['tt0057345', 'tt0054452', ... +2] │\n│ nm0000003Brigitte Bardotactress           ['tt0057345', 'tt0054452', ... +2] │\n│ nm0000004John Belushi   soundtrack        ['tt0072562', 'tt0078723', ... +2] │\n│ nm0000004John Belushi   writer            ['tt0072562', 'tt0078723', ... +2] │\n│                                   │\n└───────────┴─────────────────┴────────────────────┴────────────────────────────────────┘\n
\n```\n:::\n:::\n\n\n## BigQuery\n\n::: {#419d71fd .cell execution_count=31}\n``` {.python .cell-code}\nbq_ents.mutate(primary_profession=_.primary_profession.unnest()).order_by(\"nconst\")\n```\n\n::: {.cell-output .cell-output-display execution_count=31}\n```{=html}\n
┏━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┓\n┃ nconst     primary_name     primary_profession  known_for_titles                   ┃\n┡━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┩\n│ stringstringstringarray<string>                      │\n├───────────┼─────────────────┼────────────────────┼────────────────────────────────────┤\n│ nm0000001Fred Astaire   miscellaneous     ['tt0072308', 'tt0053137', ... +2] │\n│ nm0000001Fred Astaire   actor             ['tt0072308', 'tt0053137', ... +2] │\n│ nm0000001Fred Astaire   soundtrack        ['tt0072308', 'tt0053137', ... +2] │\n│ nm0000002Lauren Bacall  actress           ['tt0038355', 'tt0075213', ... +2] │\n│ nm0000002Lauren Bacall  soundtrack        ['tt0038355', 'tt0075213', ... +2] │\n│ nm0000003Brigitte Bardotmusic_department  ['tt0049189', 'tt0054452', ... +2] │\n│ nm0000003Brigitte Bardotsoundtrack        ['tt0049189', 'tt0054452', ... +2] │\n│ nm0000003Brigitte Bardotactress           ['tt0049189', 'tt0054452', ... +2] │\n│ nm0000004John Belushi   soundtrack        ['tt0072562', 'tt0078723', ... +2] │\n│ nm0000004John Belushi   actor             ['tt0072562', 'tt0078723', ... +2] │\n│                                   │\n└───────────┴─────────────────┴────────────────────┴────────────────────────────────────┘\n
\n```\n:::\n:::\n\n\n:::\n\nUnnesting can be useful when joining nested data.\n\nHere we use unnest to find people known for any of the godfather movies:\n\n::: {.panel-tabset}\n\n## DuckDB\n\n::: {#3bc2513c .cell execution_count=32}\n``` {.python .cell-code}\nbasics = ddb.tables.title_basics.filter( # <1>\n [\n _.title_type == \"movie\",\n _.original_title.lower().startswith(\"the godfather\"),\n _.genres.lower().contains(\"crime\"),\n ]\n) # <1>\n\nddb_known_for_the_godfather = (\n ddb_ents.mutate(tconst=_.known_for_titles.unnest()) # <2>\n .join(basics, \"tconst\") # <3>\n .select(\"primary_title\", \"primary_name\") # <4>\n .distinct()\n .order_by([\"primary_title\", \"primary_name\"]) # <4>\n)\nddb_known_for_the_godfather\n```\n\n::: {.cell-output .cell-output-display execution_count=32}\n```{=html}\n
┏━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━┓\n┃ primary_title  primary_name        ┃\n┡━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━┩\n│ stringstring              │\n├───────────────┼─────────────────────┤\n│ The GodfatherA. Emmett Adams     │\n│ The GodfatherAbe Vigoda          │\n│ The GodfatherAl Lettieri         │\n│ The GodfatherAl Martino          │\n│ The GodfatherAl Pacino           │\n│ The GodfatherAlbert S. Ruddy     │\n│ The GodfatherAlex Rocco          │\n│ The GodfatherAndrea Eastman      │\n│ The GodfatherAngelo Infanti      │\n│ The GodfatherAnna Hill Johnstone │\n│                    │\n└───────────────┴─────────────────────┘\n
\n```\n:::\n:::\n\n\n1. Filter the `title_basics` data set to only the Godfather movies\n2. Unnest the `known_for_titles` array column\n3. Join with `basics` to get movie titles\n4. Ensure that each entity is only listed once and sort the results\n\n## BigQuery\n\n::: {#3f9231e0 .cell execution_count=33}\n``` {.python .cell-code}\nbasics = bq.tables.title_basics.filter( # <1>\n [\n _.title_type == \"movie\",\n _.original_title.lower().startswith(\"the godfather\"),\n _.genres.lower().contains(\"crime\"),\n ]\n) # <1>\n\nbq_known_for_the_godfather = (\n bq_ents.mutate(tconst=_.known_for_titles.unnest()) # <2>\n .join(basics, \"tconst\") # <3>\n .select(\"primary_title\", \"primary_name\") # <4>\n .distinct()\n .order_by([\"primary_title\", \"primary_name\"]) # <4>\n)\nbq_known_for_the_godfather\n```\n\n::: {.cell-output .cell-output-display execution_count=33}\n```{=html}\n
┏━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━┓\n┃ primary_title  primary_name        ┃\n┡━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━┩\n│ stringstring              │\n├───────────────┼─────────────────────┤\n│ The GodfatherA. Emmett Adams     │\n│ The GodfatherAbe Vigoda          │\n│ The GodfatherAl Lettieri         │\n│ The GodfatherAl Martino          │\n│ The GodfatherAl Pacino           │\n│ The GodfatherAlbert S. Ruddy     │\n│ The GodfatherAlex Rocco          │\n│ The GodfatherAndrea Eastman      │\n│ The GodfatherAngelo Infanti      │\n│ The GodfatherAnna Hill Johnstone │\n│                    │\n└───────────────┴─────────────────────┘\n
\n```\n:::\n:::\n\n\n1. Filter the `title_basics` data set to only the Godfather movies\n2. Unnest the `known_for_titles` array column\n3. Join with `basics` to get movie titles\n4. Ensure that each entity is only listed once and sort the results\n\n:::\n\nLet's summarize by showing how many people are known for each Godfather movie:\n\n::: {.panel-tabset}\n\n## DuckDB\n\n::: {#eb030865 .cell execution_count=34}\n``` {.python .cell-code}\nddb_known_for_the_godfather.primary_title.value_counts()\n```\n\n::: {.cell-output .cell-output-display execution_count=34}\n```{=html}\n
┏━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━┓\n┃ primary_title           primary_title_count ┃\n┡━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━┩\n│ stringint64               │\n├────────────────────────┼─────────────────────┤\n│ The Godfather Part II 117 │\n│ The Godfather         93 │\n│ The Godfather Part III196 │\n└────────────────────────┴─────────────────────┘\n
\n```\n:::\n:::\n\n\n## BigQuery\n\n::: {#4786c6f4 .cell execution_count=35}\n``` {.python .cell-code}\nbq_known_for_the_godfather.primary_title.value_counts()\n```\n\n::: {.cell-output .cell-output-display execution_count=35}\n```{=html}\n
┏━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━┓\n┃ primary_title           primary_title_count ┃\n┡━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━┩\n│ stringint64               │\n├────────────────────────┼─────────────────────┤\n│ The Godfather Part II 114 │\n│ The Godfather Part III202 │\n│ The Godfather         97 │\n└────────────────────────┴─────────────────────┘\n
\n```\n:::\n:::\n\n\n:::\n\n### Filtering array elements\n\nFiltering array elements can be done with the\n[`filter`](../../reference/expression-collections.qmd#ibis.expr.types.arrays.ArrayValue.filter)\nmethod, which applies a predicate to each array element and returns an array of\nelements for which the predicate returns `True`.\n\nThis method is similar to Python's\n[`filter`](https://docs.python.org/3.7/library/functions.html#filter) function.\n\nLet's show all people who are neither editors nor actors:\n\n::: {.panel-tabset}\n\n## DuckDB\n\n::: {#d8bdd118 .cell execution_count=36}\n``` {.python .cell-code}\nddb_ents.mutate(\n primary_profession=_.primary_profession.filter( # <1>\n lambda pp: ~pp.isin((\"actor\", \"actress\", \"editor\"))\n )\n).filter(_.primary_profession.length() > 0).order_by(\"nconst\") # <2>\n```\n\n::: {.cell-output .cell-output-display execution_count=36}\n```{=html}\n
┏━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┓\n┃ nconst     primary_name     primary_profession                  known_for_titles                   ┃\n┡━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┩\n│ stringstringarray<string>array<string>                      │\n├───────────┼─────────────────┼────────────────────────────────────┼────────────────────────────────────┤\n│ nm0000001Fred Astaire   ['soundtrack', 'miscellaneous']['tt0053137', 'tt0072308', ... +2] │\n│ nm0000002Lauren Bacall  ['soundtrack']['tt0037382', 'tt0117057', ... +2] │\n│ nm0000003Brigitte Bardot['soundtrack', 'music_department']['tt0057345', 'tt0054452', ... +2] │\n│ nm0000004John Belushi   ['soundtrack', 'writer']['tt0072562', 'tt0078723', ... +2] │\n│ nm0000005Ingmar Bergman ['writer', 'director']['tt0083922', 'tt0069467', ... +2] │\n│ nm0000006Ingrid Bergman ['soundtrack', 'producer']['tt0038109', 'tt0036855', ... +2] │\n│ nm0000007Humphrey Bogart['soundtrack', 'producer']['tt0037382', 'tt0034583', ... +2] │\n│ nm0000008Marlon Brando  ['soundtrack', 'director']['tt0068646', 'tt0070849', ... +2] │\n│ nm0000009Richard Burton ['soundtrack', 'producer']['tt0057877', 'tt0059749', ... +2] │\n│ nm0000010James Cagney   ['soundtrack', 'director']['tt0042041', 'tt0035575', ... +2] │\n│                                   │\n└───────────┴─────────────────┴────────────────────────────────────┴────────────────────────────────────┘\n
\n```\n:::\n:::\n\n\n1. This `filter` call is applied to each array element\n2. This `filter` call is applied to the table\n\n## BigQuery\n\n::: {#1ad9065c .cell execution_count=37}\n``` {.python .cell-code}\nbq_ents.mutate(\n primary_profession=_.primary_profession.filter( # <1>\n lambda pp: ~pp.isin((\"actor\", \"actress\", \"editor\"))\n )\n).filter(_.primary_profession.length() > 0).order_by(\"nconst\") # <2>\n```\n\n::: {.cell-output .cell-output-display execution_count=37}\n```{=html}\n
┏━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┓\n┃ nconst     primary_name     primary_profession                  known_for_titles                   ┃\n┡━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┩\n│ stringstringarray<string>array<string>                      │\n├───────────┼─────────────────┼────────────────────────────────────┼────────────────────────────────────┤\n│ nm0000001Fred Astaire   ['soundtrack', 'miscellaneous']['tt0072308', 'tt0053137', ... +2] │\n│ nm0000002Lauren Bacall  ['soundtrack']['tt0038355', 'tt0075213', ... +2] │\n│ nm0000003Brigitte Bardot['soundtrack', 'music_department']['tt0049189', 'tt0054452', ... +2] │\n│ nm0000004John Belushi   ['soundtrack', 'writer']['tt0072562', 'tt0078723', ... +2] │\n│ nm0000005Ingmar Bergman ['writer', 'director']['tt0050976', 'tt0083922', ... +2] │\n│ nm0000006Ingrid Bergman ['soundtrack', 'producer']['tt0034583', 'tt0038787', ... +2] │\n│ nm0000007Humphrey Bogart['soundtrack', 'producer']['tt0037382', 'tt0043265', ... +2] │\n│ nm0000008Marlon Brando  ['soundtrack', 'director']['tt0068646', 'tt0070849', ... +2] │\n│ nm0000009Richard Burton ['soundtrack', 'producer']['tt0061184', 'tt0087803', ... +2] │\n│ nm0000010James Cagney   ['soundtrack', 'director']['tt0042041', 'tt0029870', ... +2] │\n│                                   │\n└───────────┴─────────────────┴────────────────────────────────────┴────────────────────────────────────┘\n
\n```\n:::\n:::\n\n\n1. This `filter` call is applied to each array element\n2. This `filter` call is applied to the table\n\n:::\n\n### Applying a function to array elements\n\nYou can apply a function to run an ibis expression on each element of an array\nusing the\n[`map`](../../reference/expression-collections.qmd#ibis.expr.types.arrays.ArrayValue.map)\nmethod.\n\nLet's normalize the case of primary_profession to upper case:\n\n::: {.panel-tabset}\n\n## DuckDB\n\n::: {#2935069b .cell execution_count=38}\n``` {.python .cell-code}\nddb_ents.mutate(\n primary_profession=_.primary_profession.map(lambda pp: pp.upper())\n).filter(_.primary_profession.length() > 0).order_by(\"nconst\")\n```\n\n::: {.cell-output .cell-output-display execution_count=38}\n```{=html}\n
┏━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┓\n┃ nconst     primary_name     primary_profession                 known_for_titles                   ┃\n┡━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┩\n│ stringstringarray<string>array<string>                      │\n├───────────┼─────────────────┼───────────────────────────────────┼────────────────────────────────────┤\n│ nm0000001Fred Astaire   ['SOUNDTRACK', 'ACTOR', ... +1]['tt0053137', 'tt0072308', ... +2] │\n│ nm0000002Lauren Bacall  ['ACTRESS', 'SOUNDTRACK']['tt0037382', 'tt0117057', ... +2] │\n│ nm0000003Brigitte Bardot['ACTRESS', 'SOUNDTRACK', ... +1]['tt0057345', 'tt0054452', ... +2] │\n│ nm0000004John Belushi   ['ACTOR', 'SOUNDTRACK', ... +1]['tt0072562', 'tt0078723', ... +2] │\n│ nm0000005Ingmar Bergman ['WRITER', 'DIRECTOR', ... +1]['tt0083922', 'tt0069467', ... +2] │\n│ nm0000006Ingrid Bergman ['ACTRESS', 'SOUNDTRACK', ... +1]['tt0038109', 'tt0036855', ... +2] │\n│ nm0000007Humphrey Bogart['ACTOR', 'SOUNDTRACK', ... +1]['tt0037382', 'tt0034583', ... +2] │\n│ nm0000008Marlon Brando  ['ACTOR', 'SOUNDTRACK', ... +1]['tt0068646', 'tt0070849', ... +2] │\n│ nm0000009Richard Burton ['ACTOR', 'SOUNDTRACK', ... +1]['tt0057877', 'tt0059749', ... +2] │\n│ nm0000010James Cagney   ['ACTOR', 'SOUNDTRACK', ... +1]['tt0042041', 'tt0035575', ... +2] │\n│                                   │\n└───────────┴─────────────────┴───────────────────────────────────┴────────────────────────────────────┘\n
\n```\n:::\n:::\n\n\n## BigQuery\n\n::: {#33931d68 .cell execution_count=39}\n``` {.python .cell-code}\nbq_ents.mutate(\n primary_profession=_.primary_profession.map(lambda pp: pp.upper())\n).filter(_.primary_profession.length() > 0).order_by(\"nconst\")\n```\n\n::: {.cell-output .cell-output-display execution_count=39}\n```{=html}\n
┏━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┓\n┃ nconst     primary_name     primary_profession                 known_for_titles                   ┃\n┡━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┩\n│ stringstringarray<string>array<string>                      │\n├───────────┼─────────────────┼───────────────────────────────────┼────────────────────────────────────┤\n│ nm0000001Fred Astaire   ['SOUNDTRACK', 'ACTOR', ... +1]['tt0072308', 'tt0053137', ... +2] │\n│ nm0000002Lauren Bacall  ['ACTRESS', 'SOUNDTRACK']['tt0038355', 'tt0075213', ... +2] │\n│ nm0000003Brigitte Bardot['ACTRESS', 'SOUNDTRACK', ... +1]['tt0049189', 'tt0054452', ... +2] │\n│ nm0000004John Belushi   ['ACTOR', 'SOUNDTRACK', ... +1]['tt0072562', 'tt0078723', ... +2] │\n│ nm0000005Ingmar Bergman ['WRITER', 'DIRECTOR', ... +1]['tt0050976', 'tt0083922', ... +2] │\n│ nm0000006Ingrid Bergman ['ACTRESS', 'SOUNDTRACK', ... +1]['tt0034583', 'tt0038787', ... +2] │\n│ nm0000007Humphrey Bogart['ACTOR', 'SOUNDTRACK', ... +1]['tt0037382', 'tt0043265', ... +2] │\n│ nm0000008Marlon Brando  ['ACTOR', 'SOUNDTRACK', ... +1]['tt0068646', 'tt0070849', ... +2] │\n│ nm0000009Richard Burton ['ACTOR', 'SOUNDTRACK', ... +1]['tt0061184', 'tt0087803', ... +2] │\n│ nm0000010James Cagney   ['ACTOR', 'SOUNDTRACK', ... +1]['tt0042041', 'tt0029870', ... +2] │\n│                                   │\n└───────────┴─────────────────┴───────────────────────────────────┴────────────────────────────────────┘\n
\n```\n:::\n:::\n\n\n:::\n\n## Conclusion\n\nIbis has a sizable collection of array APIs that work with many different\nbackends and as of version 7.0.0, Ibis supports a much larger set of those APIs\nfor BigQuery!\n\nCheck out [the API\ndocumentation](../../reference/expression-collections.qmd#ibis.expr.types.arrays.ArrayValue)\nfor the full set of available methods.\n\nTry it out, and let us know what you think.\n\n", "supporting": [ "index_files" ], @@ -11,7 +11,7 @@ "\n\n\n\n" ], "include-after-body": [ - "\n" + "\n" ] } } diff --git a/docs/posts/backend-agnostic-arrays/index.qmd b/docs/posts/backend-agnostic-arrays/index.qmd index 7a453701ff77..8aa8a404dcdd 100644 --- a/docs/posts/backend-agnostic-arrays/index.qmd +++ b/docs/posts/backend-agnostic-arrays/index.qmd @@ -1,7 +1,7 @@ --- title: Backend agnostic arrays author: "Phillip Cloud" -date: last-modified +date: 2024-01-19 categories: - arrays - bigquery