diff --git a/docs/_freeze/posts/lms-for-data/index/execute-results/html.json b/docs/_freeze/posts/lms-for-data/index/execute-results/html.json index f2d807c659de..105f0d405ef2 100644 --- a/docs/_freeze/posts/lms-for-data/index/execute-results/html.json +++ b/docs/_freeze/posts/lms-for-data/index/execute-results/html.json @@ -1,8 +1,8 @@ { - "hash": "6590bc2b609c6b61f84fe2770b1d9476", + "hash": "689ee5e01c93ba2943e29d758a03c03e", "result": { "engine": "jupyter", - "markdown": "---\ntitle: \"Language models for data\"\nauthor: \"Cody Peterson\"\ndate: \"2024-02-15\"\ncategories:\n - blog\n - llms\n - duckdb\n---\n\n## Overview\n\nThis post will give an overview of how (large) language models (LMs) fit into\ndata engineering, analyst, and science workflows.\n\n## Use cases for LMs in data\n\nThere are three main use cases for language models for data practitioners:\n\n1. Synthetic data generation\n2. Natural language processing\n3. Writing code\n\nWe'll describe each in this section and see them in action in the following\nsections.\n\n### Synthetic data generation\n\nLanguage models can be used to generate synthetic data. This is useful for\ntesting, training, and other purposes. For example, you can use a language model\nto generate synthetic data for a machine learning model.\n\n:::{.callout-tip}\nThis post was re-inspired by the [1 billion row challenge we recently solved\nwith Ibis on three local backends](../1brc/index.qmd), in which synthetic data\ngenerated from a seed file was used to generate a billion rows.\n\nWith language models, we can reproduce this synthetic data and customize the\ndata produced with natural language! We'll demonstrate this in a section below.\n:::\n\n### Natural language processing\n\nThis includes tasks like:\n\n- sentiment analysis\n- named entity recognition\n- part of speech tagging\n- summarization\n- translation\n- question answering\n\nEach of these tasks can be, to some extent, solved by traditional natural\nlanguage processing (NLP) techniques. However, modern-day LMs can solve these\ntasks with a single model, and often with state-of-the-art performance. This\ndrastically simplifies what a single engineer, who doesn't need a deep\nunderstanding of NLP or ML in general, can accomplish.\n\n### Writing code\n\nFinally, language models can be used to write code. This is more useful with\nsystems around them to execute code, feed back error messages, make adjustments,\nand so on. There are numerous pitfalls with language models writing code, but\nthey're fairly good at SQL.\n\n## Demonstrating LMs with data\n\nWe'll use [Marvin](https://askmarvin.ai), the AI engineering toolkit, alongside\n[Ibis](https://ibis-project.org), the data engineering toolkit, to demonstrate\nthe capabilities of language models for data using the default DuckDB backend.\n\n:::{.callout-tip}\nWe'll use a cloud service provider (OpenAI) to demonstrate these capabilities.\nIn a follow-up post, we'll explore using local \"open source\" language models to\nachieve the same results.\n:::\n\nWith Marvin and Ibis, you can replicate the workflows below using other AI service providers, local language models, and over 20+ data backends!\n\nLet's start by importing and setting up our code:\n\n::: {#a30dc424 .cell execution_count=1}\n``` {.python .cell-code}\nimport ibis # <1>\nimport marvin # <2>\n\nfrom pydantic import BaseModel, Field # <3>\n\nibis.options.interactive = True # <4>\n```\n:::\n\n\n1. Import Ibis, the data engineering toolkit\n2. Import Marvin, the AI engineering toolkit\n3. Import Pydantic, used to define data models for Marvin\n4. Set Ibis to interactive mode to display the results of our queries\n\n## Synthetic data generation\n\nWe'll start by replicating the data in the one billion row challenge, then move\nover to our favorite penguins demo dataset to augment existing data with\nsynthetic data.\n\n### Weather stations\n\nWe can generate synthetic weather stations:\n\n::: {#c5cc0372 .cell execution_count=2}\n``` {.python .cell-code}\nclass WeatherStation(BaseModel):\n station: str = Field(\n ..., description=\"The weather station name\", example=\"Sandy Silicon\"\n )\n temperature: float = Field(\n ..., description=\"The average temperature in Fahrenheit\", example=72.5\n )\n\nstations = marvin.generate(\n target=WeatherStation,\n instructions=\"Generate fictitious but plausible-sounding weather stations with names that excite data nerds\",\n n=3,\n)\nstations\n```\n\n::: {.cell-output .cell-output-display execution_count=2}\n```\n[WeatherStation(station='Quantum Clouds Observatory', temperature=78.4),\n WeatherStation(station='Cyber Synoptic Lab', temperature=66.2),\n WeatherStation(station='Neural Nexus Watchpoint', temperature=59.6)]\n```\n:::\n:::\n\n\nAnd then load that data into an Ibis table:\n\n:::{.callout-tip}\nYou could also use a user-defined function (UDF) to directly generate this data\nin a table. We'll demonstrate UDFs throughout this post.\n:::\n\n::: {#c5563ac9 .cell execution_count=3}\n``` {.python .cell-code}\ns = ibis.memtable([station.model_dump() for station in stations])\ns\n```\n\n::: {.cell-output .cell-output-display execution_count=3}\n```{=html}\n
┏━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━┓\n┃ station                     temperature ┃\n┡━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━┩\n│ stringfloat64     │\n├────────────────────────────┼─────────────┤\n│ Quantum Clouds Observatory78.4 │\n│ Cyber Synoptic Lab        66.2 │\n│ Neural Nexus Watchpoint   59.6 │\n└────────────────────────────┴─────────────┘\n
\n```\n:::\n:::\n\n\n### Penguin poems\n\nWe can augment existing data with synthetic data. First, let's load the penguins dataset:\n\n::: {#d17a01cb .cell execution_count=4}\n``` {.python .cell-code}\npenguins = ibis.examples.penguins.fetch()\npenguins\n```\n\n::: {.cell-output .cell-output-display execution_count=4}\n```{=html}\n
┏━━━━━━━━━┳━━━━━━━━━━━┳━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━┳━━━━━━━━┳━━━━━━━┓\n┃ species  island     bill_length_mm  bill_depth_mm  flipper_length_mm  body_mass_g  sex     year  ┃\n┡━━━━━━━━━╇━━━━━━━━━━━╇━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━╇━━━━━━━━╇━━━━━━━┩\n│ stringstringfloat64float64int64int64stringint64 │\n├─────────┼───────────┼────────────────┼───────────────┼───────────────────┼─────────────┼────────┼───────┤\n│ Adelie Torgersen39.118.71813750male  2007 │\n│ Adelie Torgersen39.517.41863800female2007 │\n│ Adelie Torgersen40.318.01953250female2007 │\n│ Adelie TorgersenNULLNULLNULLNULLNULL2007 │\n│ Adelie Torgersen36.719.31933450female2007 │\n│ Adelie Torgersen39.320.61903650male  2007 │\n│ Adelie Torgersen38.917.81813625female2007 │\n│ Adelie Torgersen39.219.61954675male  2007 │\n│ Adelie Torgersen34.118.11933475NULL2007 │\n│ Adelie Torgersen42.020.21904250NULL2007 │\n│  │\n└─────────┴───────────┴────────────────┴───────────────┴───────────────────┴─────────────┴────────┴───────┘\n
\n```\n:::\n:::\n\n\nAnd take a sample of five rows to reduce our AI service costs:\n\n::: {#47c8aaa8 .cell execution_count=5}\n``` {.python .cell-code}\nt = penguins.sample(fraction=0.03).limit(5)\nt\n```\n\n::: {.cell-output .cell-output-display execution_count=5}\n```{=html}\n
┏━━━━━━━━━┳━━━━━━━━━━━┳━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━┳━━━━━━━━┳━━━━━━━┓\n┃ species  island     bill_length_mm  bill_depth_mm  flipper_length_mm  body_mass_g  sex     year  ┃\n┡━━━━━━━━━╇━━━━━━━━━━━╇━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━╇━━━━━━━━╇━━━━━━━┩\n│ stringstringfloat64float64int64int64stringint64 │\n├─────────┼───────────┼────────────────┼───────────────┼───────────────────┼─────────────┼────────┼───────┤\n│ Adelie Torgersen34.621.11984400male  2007 │\n│ Adelie Torgersen42.520.71974500male  2007 │\n│ Adelie Dream    36.417.01953325female2007 │\n│ Adelie Torgersen42.119.11954000male  2008 │\n│ Adelie Biscoe   41.020.02034725male  2009 │\n└─────────┴───────────┴────────────────┴───────────────┴───────────────────┴─────────────┴────────┴───────┘\n
\n```\n:::\n:::\n\n\nNow we define a UDF to generate a poem to describe each penguin:\n\n::: {#52b331b9 .cell execution_count=6}\n``` {.python .cell-code}\n@ibis.udf.scalar.python\ndef penguin_poem(\n species: str,\n island: str,\n bill_length_mm: float,\n bill_depth_mm: float,\n flipper_length_mm: float,\n body_mass_g: float,\n) -> str:\n instructions = f\"\"\"Provide a whimsical poem that rhymes for a penguin.\n\n You have the following information about the penguins:\n species {species}\n island of {island}\n bill length of {bill_length_mm} mm\n bill depth of {bill_depth_mm} mm\n flipper length of {flipper_length_mm} mm\n body mass of {body_mass_g} g.\n\n You must reference the penguin's size in addition to its species and island.\n \"\"\"\n\n poem = marvin.generate(\n n=1,\n instructions=instructions,\n )\n\n return poem[0]\n```\n:::\n\n\nAnd apply that UDF to our penguins table:\n\n::: {#83f23731 .cell execution_count=7}\n``` {.python .cell-code}\nt = (\n t.mutate(\n poem=penguin_poem(\n t.species,\n t.island,\n t.bill_length_mm,\n t.bill_depth_mm,\n t.flipper_length_mm,\n t.body_mass_g,\n )\n )\n .relocate(\"species\", \"island\", \"poem\")\n .cache()\n)\nt\n```\n\n::: {.cell-output .cell-output-display}\n```{=html}\n\n```\n:::\n\n::: {.cell-output .cell-output-display execution_count=7}\n```{=html}\n
┏━━━━━━━━━┳━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━┳━━━━━━━━┳━━━━━━━┓\n┃ species  island     poem                                                                                bill_length_mm  bill_depth_mm  flipper_length_mm  body_mass_g  sex     year  ┃\n┡━━━━━━━━━╇━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━╇━━━━━━━━╇━━━━━━━┩\n│ stringstringstringfloat64float64int64int64stringint64 │\n├─────────┼───────────┼────────────────────────────────────────────────────────────────────────────────────┼────────────────┼───────────────┼───────────────────┼─────────────┼────────┼───────┤\n│ Adelie Biscoe   In Biscoe's realm, where icebergs gleam,\\nA dapper Adelie is living the dream.\\nW…38.218.11853950male  2007 │\n│ Adelie TorgersenIn Torgersen's land of icy scenes,\\nLives a penguin, slick and keen.\\nAdelie by n…35.119.41934200male  2008 │\n│ Adelie Dream    In the land where dreams weave through the chill,\\nA penguin dwells with a froli… 36.918.61893500female2008 │\n│ Adelie Dream    In the heart of the island of Dream so chill,\\nAn Adelie penguin stands very sti… 32.115.51883050female2009 │\n│ Adelie Dream    In the land of snow and ice,\\nWithin Dream Island’s frosty thrice,\\nLives an Adel…37.316.81923000female2009 │\n└─────────┴───────────┴────────────────────────────────────────────────────────────────────────────────────┴────────────────┴───────────────┴───────────────────┴─────────────┴────────┴───────┘\n
\n```\n:::\n:::\n\n\nNice! While not particularly useful in this case, the same process can be used\nfor generating product descriptions or other practical applications.\n\n## Natural language processing\n\n### Sentiment analysis\n\nWe can use a language model to perform sentiment analysis on the penguin poems:\n\n::: {#2080c60a .cell execution_count=8}\n``` {.python .cell-code}\n@marvin.fn\ndef _sentiment_analysis(text: str) -> float:\n \"\"\"Returns a sentiment score for `text`\n between -1 (negative) and 1 (positive).\"\"\"\n\n\n@ibis.udf.scalar.python\ndef sentiment_analysis(text: str) -> float:\n return _sentiment_analysis(text)\n```\n:::\n\n\nAnd apply that UDF to our penguins table:\n\n::: {#7d998278 .cell execution_count=9}\n``` {.python .cell-code}\nt = (\n t.mutate(sentiment=sentiment_analysis(t.poem))\n .relocate(t.columns[:3], \"sentiment\")\n .cache()\n)\nt\n```\n\n::: {.cell-output .cell-output-display}\n```{=html}\n\n```\n:::\n\n::: {.cell-output .cell-output-display execution_count=9}\n```{=html}\n
┏━━━━━━━━━┳━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━┳━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━┳━━━━━━━━┳━━━━━━━┓\n┃ species  island     poem                                                                                sentiment  bill_length_mm  bill_depth_mm  flipper_length_mm  body_mass_g  sex     year  ┃\n┡━━━━━━━━━╇━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━╇━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━╇━━━━━━━━╇━━━━━━━┩\n│ stringstringstringfloat64float64float64int64int64stringint64 │\n├─────────┼───────────┼────────────────────────────────────────────────────────────────────────────────────┼───────────┼────────────────┼───────────────┼───────────────────┼─────────────┼────────┼───────┤\n│ Adelie Biscoe   In Biscoe's realm, where icebergs gleam,\\nA dapper Adelie is living the dream.\\nW…0.7538.218.11853950male  2007 │\n│ Adelie TorgersenIn Torgersen's land of icy scenes,\\nLives a penguin, slick and keen.\\nAdelie by n…0.9035.119.41934200male  2008 │\n│ Adelie Dream    In the land where dreams weave through the chill,\\nA penguin dwells with a froli… 0.8036.918.61893500female2008 │\n│ Adelie Dream    In the heart of the island of Dream so chill,\\nAn Adelie penguin stands very sti… 0.8032.115.51883050female2009 │\n│ Adelie Dream    In the land of snow and ice,\\nWithin Dream Island’s frosty thrice,\\nLives an Adel…0.8037.316.81923000female2009 │\n└─────────┴───────────┴────────────────────────────────────────────────────────────────────────────────────┴───────────┴────────────────┴───────────────┴───────────────────┴─────────────┴────────┴───────┘\n
\n```\n:::\n:::\n\n\n### Entity extraction\n\nWhile not exactly named entity recognition, we can extract arbitrary entities from text. In this case, we'll extract a list of words that rhyme from the poem:\n\n::: {#8e9338c0 .cell execution_count=10}\n``` {.python .cell-code}\n@ibis.udf.scalar.python\ndef extract_rhyming_words(text: str) -> list[str]:\n words = marvin.extract(\n text,\n instructions=\"Extract the primary rhyming words from the text\",\n )\n\n return words\n```\n:::\n\n\nAnd apply that UDF to our penguins table:\n\n::: {#53601c39 .cell execution_count=11}\n``` {.python .cell-code}\nt = (\n t.mutate(rhyming_words=extract_rhyming_words(t.poem))\n .relocate(t.columns[:4], \"rhyming_words\")\n .cache()\n)\nt\n```\n\n::: {.cell-output .cell-output-display}\n```{=html}\n\n```\n:::\n\n::: {.cell-output .cell-output-display execution_count=11}\n```{=html}\n
┏━━━━━━━━━┳━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━┳━━━━━━━━┳━━━━━━━┓\n┃ species  island     poem                                                                                sentiment  rhyming_words                bill_length_mm  bill_depth_mm  flipper_length_mm  body_mass_g  sex     year  ┃\n┡━━━━━━━━━╇━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━╇━━━━━━━━╇━━━━━━━┩\n│ stringstringstringfloat64array<string>float64float64int64int64stringint64 │\n├─────────┼───────────┼────────────────────────────────────────────────────────────────────────────────────┼───────────┼─────────────────────────────┼────────────────┼───────────────┼───────────────────┼─────────────┼────────┼───────┤\n│ Adelie Biscoe   In Biscoe's realm, where icebergs gleam,\\nA dapper Adelie is living the dream.\\nW…0.75['realm', 'gleam', ... +10]38.218.11853950male  2007 │\n│ Adelie TorgersenIn Torgersen's land of icy scenes,\\nLives a penguin, slick and keen.\\nAdelie by n…0.90['scenes', 'keen', ... +9]35.119.41934200male  2008 │\n│ Adelie Dream    In the land where dreams weave through the chill,\\nA penguin dwells with a froli… 0.80['chill', 'will', ... +10]36.918.61893500female2008 │\n│ Adelie Dream    In the heart of the island of Dream so chill,\\nAn Adelie penguin stands very sti… 0.80['chill', 'still', ... +10]32.115.51883050female2009 │\n│ Adelie Dream    In the land of snow and ice,\\nWithin Dream Island’s frosty thrice,\\nLives an Adel…0.80['ice', 'thrice', ... +10]37.316.81923000female2009 │\n└─────────┴───────────┴────────────────────────────────────────────────────────────────────────────────────┴───────────┴─────────────────────────────┴────────────────┴───────────────┴───────────────────┴─────────────┴────────┴───────┘\n
\n```\n:::\n:::\n\n\n### Translation\n\nWe can translate the penguin poems:\n\n::: {#4939ddce .cell execution_count=12}\n``` {.python .cell-code}\n@marvin.fn\ndef _translate_text(text: str, target_language: str = \"spanish\") -> str:\n \"\"\"Translate `text` to `target_language`.\"\"\"\n\n\n@ibis.udf.scalar.python\ndef translate_text(text: str, target_language: str = \"spanish\") -> str:\n return _translate_text(text, target_language)\n```\n:::\n\n\nAnd apply that UDF to our penguins table:\n\n::: {#2a1b6115 .cell execution_count=13}\n``` {.python .cell-code}\nt = (\n t.mutate(translated_poem=translate_text(t.poem))\n .relocate(t.columns[:5], \"translated_poem\")\n .cache()\n)\nt\n```\n\n::: {.cell-output .cell-output-display}\n```{=html}\n\n```\n:::\n\n::: {.cell-output .cell-output-display execution_count=13}\n```{=html}\n
┏━━━━━━━━━┳━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━┳━━━━━━━━┳━━━━━━━┓\n┃ species  island     poem                                                                                sentiment  rhyming_words                translated_poem                                                                    bill_length_mm  bill_depth_mm  flipper_length_mm  body_mass_g  sex     year  ┃\n┡━━━━━━━━━╇━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━╇━━━━━━━━╇━━━━━━━┩\n│ stringstringstringfloat64array<string>stringfloat64float64int64int64stringint64 │\n├─────────┼───────────┼────────────────────────────────────────────────────────────────────────────────────┼───────────┼─────────────────────────────┼───────────────────────────────────────────────────────────────────────────────────┼────────────────┼───────────────┼───────────────────┼─────────────┼────────┼───────┤\n│ Adelie Biscoe   In Biscoe's realm, where icebergs gleam,\\nA dapper Adelie is living the dream.\\nW…0.75['realm', 'gleam', ... +10]En el reino de Biscoe, donde brillan los icebergs,\\nun elegante Adelie está viv…38.218.11853950male  2007 │\n│ Adelie TorgersenIn Torgersen's land of icy scenes,\\nLives a penguin, slick and keen.\\nAdelie by n…0.90['scenes', 'keen', ... +9]En la tierra de escenarios helados de Torgersen,\\nvive un pingüino, elegante y a…35.119.41934200male  2008 │\n│ Adelie Dream    In the land where dreams weave through the chill,\\nA penguin dwells with a froli… 0.80['chill', 'will', ... +10]En la tierra donde los sue\\u00f1os se entretejen a trav\\u00e9s del fr\\u00edo,\\nU…36.918.61893500female2008 │\n│ Adelie Dream    In the heart of the island of Dream so chill,\\nAn Adelie penguin stands very sti… 0.80['chill', 'still', ... +10]En el corazón de la isla de Sueño tan fría,\\nUn pingüino Adelia permanece muy qu…32.115.51883050female2009 │\n│ Adelie Dream    In the land of snow and ice,\\nWithin Dream Island’s frosty thrice,\\nLives an Adel…0.80['ice', 'thrice', ... +10]En la tierra de nieve y hielo,\\nDentro del trino helado de la Isla de los Sue\\u0…37.316.81923000female2009 │\n└─────────┴───────────┴────────────────────────────────────────────────────────────────────────────────────┴───────────┴─────────────────────────────┴───────────────────────────────────────────────────────────────────────────────────┴────────────────┴───────────────┴───────────────────┴─────────────┴────────┴───────┘\n
\n```\n:::\n:::\n\n\n## Writing code\n\nFinally, we can use a language model to write code. Let's define a function that\noutputs SQL:\n\n::: {#3c552d8f .cell execution_count=14}\n``` {.python .cell-code}\n@marvin.fn\ndef _text_to_sql(\n text: str,\n table_names: list[str],\n table_schemas: list[str],\n table_previews: list[str],\n) -> str:\n \"\"\"Writes a SQL SELECT statement for the `text` given the provided `table_names`, `table_schemas`, and `table_previews`.\"\"\"\n\n\ndef text_to_sql(\n text: str,\n table_names: list[str],\n table_schemas: list[str],\n table_previews: list[str],\n) -> str:\n return _text_to_sql(text, table_names, table_schemas, table_previews).strip(\";\")\n```\n:::\n\n\nWe can try that out on our penguins table:\n\n::: {#5227d886 .cell execution_count=15}\n``` {.python .cell-code}\ntext = \"the count of penguins by species, from highest to lowest, per each island\"\n\ntable_names = [\"penguins\"]\ntable_schemas = [str(penguins.schema())]\ntable_previews = [str(penguins.limit(5))]\n\nsql = text_to_sql(text, table_names, table_schemas, table_previews)\nprint(sql)\n```\n\n::: {.cell-output .cell-output-display}\n```{=html}\n
\n```\n:::\n\n::: {.cell-output .cell-output-stdout}\n```\nSELECT species, island, COUNT(*) as count FROM penguins GROUP BY species, island ORDER BY count DESC, species, island\n```\n:::\n:::\n\n\nAnd execute the SQL:\n\n::: {#2a5a3b96 .cell execution_count=16}\n``` {.python .cell-code}\nr = penguins.sql(sql)\nr\n```\n\n::: {.cell-output .cell-output-display execution_count=16}\n```{=html}\n
┏━━━━━━━━━━━┳━━━━━━━━━━━┳━━━━━━━┓\n┃ species    island     count ┃\n┡━━━━━━━━━━━╇━━━━━━━━━━━╇━━━━━━━┩\n│ stringstringint64 │\n├───────────┼───────────┼───────┤\n│ Gentoo   Biscoe   124 │\n│ ChinstrapDream    68 │\n│ Adelie   Dream    56 │\n│ Adelie   Torgersen52 │\n│ Adelie   Biscoe   44 │\n└───────────┴───────────┴───────┘\n
\n```\n:::\n:::\n\n\n### A more complex example\n\nLet's see how this works on a query that requires joining two tables. We'll load in some IMDB data:\n\n::: {#d6d41ec1 .cell execution_count=17}\n``` {.python .cell-code}\nimdb_title_basics = ibis.examples.imdb_title_basics.fetch()\nimdb_title_basics\n```\n\n::: {.cell-output .cell-output-display execution_count=17}\n```{=html}\n
┏━━━━━━━━━━━┳━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━┳━━━━━━━━━━━┳━━━━━━━━━┳━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━━━┓\n┃ tconst     titleType  primaryTitle                                 originalTitle                                isAdult  startYear  endYear  runtimeMinutes  genres                   ┃\n┡━━━━━━━━━━━╇━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━╇━━━━━━━━━━━╇━━━━━━━━━╇━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━━━┩\n│ stringstringstringstringint64int64stringint64string                   │\n├───────────┼───────────┼─────────────────────────────────────────────┼─────────────────────────────────────────────┼─────────┼───────────┼─────────┼────────────────┼──────────────────────────┤\n│ tt0000001short    Carmencita                                 Carmencita                                 01894NULL1Documentary,Short        │\n│ tt0000002short    Le clown et ses chiens                     Le clown et ses chiens                     01892NULL5Animation,Short          │\n│ tt0000003short    Pauvre Pierrot                             Pauvre Pierrot                             01892NULL4Animation,Comedy,Romance │\n│ tt0000004short    Un bon bock                                Un bon bock                                01892NULL12Animation,Short          │\n│ tt0000005short    Blacksmith Scene                           Blacksmith Scene                           01893NULL1Comedy,Short             │\n│ tt0000006short    Chinese Opium Den                          Chinese Opium Den                          01894NULL1Short                    │\n│ tt0000007short    Corbett and Courtney Before the KinetographCorbett and Courtney Before the Kinetograph01894NULL1Short,Sport              │\n│ tt0000008short    Edison Kinetoscopic Record of a Sneeze     Edison Kinetoscopic Record of a Sneeze     01894NULL1Documentary,Short        │\n│ tt0000009movie    Miss Jerry                                 Miss Jerry                                 01894NULL45Romance                  │\n│ tt0000010short    Leaving the Factory                        La sortie de l'usine Lumière à Lyon        01895NULL1Documentary,Short        │\n│                         │\n└───────────┴───────────┴─────────────────────────────────────────────┴─────────────────────────────────────────────┴─────────┴───────────┴─────────┴────────────────┴──────────────────────────┘\n
\n```\n:::\n:::\n\n\n::: {#c0be0470 .cell execution_count=18}\n``` {.python .cell-code}\nimdb_title_ratings = ibis.examples.imdb_title_ratings.fetch()\nimdb_title_ratings\n```\n\n::: {.cell-output .cell-output-display execution_count=18}\n```{=html}\n
┏━━━━━━━━━━━┳━━━━━━━━━━━━━━━┳━━━━━━━━━━┓\n┃ tconst     averageRating  numVotes ┃\n┡━━━━━━━━━━━╇━━━━━━━━━━━━━━━╇━━━━━━━━━━┩\n│ stringfloat64int64    │\n├───────────┼───────────────┼──────────┤\n│ tt00000015.71990 │\n│ tt00000025.8265 │\n│ tt00000036.51869 │\n│ tt00000045.5177 │\n│ tt00000056.22655 │\n│ tt00000065.0182 │\n│ tt00000075.4831 │\n│ tt00000085.42132 │\n│ tt00000095.3206 │\n│ tt00000106.97268 │\n│  │\n└───────────┴───────────────┴──────────┘\n
\n```\n:::\n:::\n\n\n::: {#f7806dd0 .cell execution_count=19}\n``` {.python .cell-code}\ntext = \"the highest rated movies w/ over 100k ratings -- movies only\"\n\ntable_names = [\"imdb_title_basics\", \"imdb_title_ratings\"]\ntable_schemas = [str(imdb_title_basics.schema()), str(imdb_title_ratings.schema())]\ntable_previews = [str(imdb_title_basics.limit(5)), str(imdb_title_ratings.limit(5))]\n\nsql = text_to_sql(text, table_names, table_schemas, table_previews)\nprint(sql)\n```\n\n::: {.cell-output .cell-output-display}\n```{=html}\n
\n```\n:::\n\n::: {.cell-output .cell-output-display}\n```{=html}\n
\n```\n:::\n\n::: {.cell-output .cell-output-stdout}\n```\nSELECT\n    ibt.primaryTitle AS movie_title,\n    itr.averageRating AS rating,\n    itr.numVotes AS votes\nFROM\n    imdb_title_basics AS ibt\nJOIN\n    imdb_title_ratings AS itr\n    ON ibt.tconst = itr.tconst\nWHERE\n    ibt.titleType = 'movie'\n    AND itr.numVotes > 100000\nORDER BY\n    itr.averageRating DESC\n```\n:::\n:::\n\n\n::: {#a91fef5c .cell execution_count=20}\n``` {.python .cell-code}\nimdb_title_basics.sql(sql)\n```\n\n::: {.cell-output .cell-output-display execution_count=20}\n```{=html}\n
┏━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━┳━━━━━━━━━┓\n┃ movie_title                                    rating   votes   ┃\n┡━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━╇━━━━━━━━━┩\n│ stringfloat64int64   │\n├───────────────────────────────────────────────┼─────────┼─────────┤\n│ The Shawshank Redemption                     9.32793065 │\n│ The Godfather                                9.21945537 │\n│ The Lord of the Rings: The Return of the King9.01912955 │\n│ The Dark Knight                              9.02773504 │\n│ 12 Angry Men                                 9.0829776 │\n│ The Godfather Part II                        9.01321642 │\n│ Schindler's List                             9.01404536 │\n│ Pulp Fiction                                 8.92142607 │\n│ Fight Club                                   8.82227179 │\n│ The Lord of the Rings: The Two Towers        8.81726199 │\n│  │\n└───────────────────────────────────────────────┴─────────┴─────────┘\n
\n```\n:::\n:::\n\n\n## Issues with LMs today\n\n## Looking forward\n\n## Next steps\n\nIn a future post...\n\n", + "markdown": "---\ntitle: \"Language models for data\"\nauthor: \"Cody Peterson\"\ndate: \"2024-02-05\"\ncategories:\n - blog\n - llms\n - duckdb\n---\n\n## Overview\n\nThis post will give an overview of how (large) language models (LMs) fit into\ndata engineering, analyst, and science workflows.\n\n## Use cases\n\nThere are three main use cases for language models for data practitioners:\n\n1. Synthetic data generation\n2. Natural language processing\n3. Writing code\n\nWe'll describe each and then demonstrate them with code.\n\n## Setup\n\nWe'll use [Marvin](https://askmarvin.ai), the AI engineering toolkit, alongside\n[Ibis](https://ibis-project.org), the data engineering toolkit, to demonstrate\nthe capabilities of language models for data using the default DuckDB backend.\n\n:::{.callout-tip}\nWe'll use a cloud service provider (OpenAI) to demonstrate these capabilities.\nIn a follow-up post, we'll explore using local \"open source\" language models to\nachieve the same results.\n:::\n\nWith Marvin and Ibis, you can replicate the workflows below using other AI\nservice providers, local language models, and over 20+ data backends!\n\nYou'll need to install Marvin and Ibis to follow along:\n\n```{.bash}\npip install 'ibis-framework[duckdb,examples]' marvin\n```\n\nThen import them and turn on Ibis interactive mode:\n\n::: {#6776d992 .cell execution_count=1}\n``` {.python .cell-code}\nimport ibis # <1>\nimport marvin # <2>\n\nfrom pydantic import BaseModel, Field # <3>\n\nibis.options.interactive = True # <4>\n```\n:::\n\n\n1. Import Ibis, the data engineering toolkit\n2. Import Marvin, the AI engineering toolkit\n3. Import Pydantic, used to define data models for Marvin\n4. Set Ibis to interactive mode to display the results of our queries\n\n## Synthetic data generation\n\nLanguage models can be used to generate synthetic data. This is useful for\ntesting, training, and other purposes. For example, you can use a language model\nto generate synthetic data for training a machine learning model (including a\nlanguage model).\n\n:::{.callout-tip}\nThis post was re-inspired by the [1 billion row challenge we recently solved\nwith Ibis on three local backends](../1brc/index.qmd) in which synthetic data\ngenerated from a seed file was used to generate a billion rows.\n\nWith language models, we can reproduce this synthetic data and customize the\ndata produced with natural language!\n:::\n\nWe'll start by replicating the data in the one billion row challenge, then move\nover to our favorite penguins demo dataset to augment existing data with\nsynthetic data.\n\n### Weather stations\n\nWe can generate synthetic weather stations in a few lines of code:\n\n::: {#d7c5ecaa .cell execution_count=2}\n``` {.python .cell-code}\nclass WeatherStation(BaseModel): # <1>\n station: str = Field( # <1>\n ..., description=\"The weather station name\", example=\"Sandy Silicon\" # <1>\n ) # <1>\n temperature: float = Field( # <1>\n ..., description=\"The average temperature in Fahrenheit\", example=72.5 # <1>\n ) # <1>\n\nstations = marvin.generate( # <2>\n target=WeatherStation, # <2>\n instructions=\"Generate fictitious but plausible-sounding weather stations with names that excite data nerds\", # <2>\n n=3, # <2>\n) # <2>\nstations\n```\n\n::: {.cell-output .cell-output-display execution_count=2}\n```\n[WeatherStation(station='Quantum Precip Analytics', temperature=63.4),\n WeatherStation(station='Data Breeze Hub', temperature=78.1),\n WeatherStation(station='Neural Nimbus Center', temperature=54.3)]\n```\n:::\n:::\n\n\n1. Define a data model for the weather stations\n2. Use Marvin to generate three weather stations\n\nAnd then load that data into an Ibis table:\n\n:::{.callout-tip}\nYou could also use a user-defined function (UDF) to directly generate this data\nin a table. We'll demonstrate UDFs throughout this post.\n:::\n\n::: {#dc3a3ce1 .cell execution_count=3}\n``` {.python .cell-code}\ns = ibis.memtable([station.model_dump() for station in stations]) # <1>\ns\n```\n\n::: {.cell-output .cell-output-display execution_count=3}\n```{=html}\n
┏━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━┓\n┃ station                   temperature ┃\n┡━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━┩\n│ stringfloat64     │\n├──────────────────────────┼─────────────┤\n│ Quantum Precip Analytics63.4 │\n│ Data Breeze Hub         78.1 │\n│ Neural Nimbus Center    54.3 │\n└──────────────────────────┴─────────────┘\n
\n```\n:::\n:::\n\n\n1. Convert the generated data to an Ibis table\n\nWhile we've only generated three weather stations, you can repeat this process\nuntil you get as many as you'd like! You can then use Ibis to generate a billion\nrows of weather data for these stations as in the one billion row challenge.\n\n:::{.callout-warning}\nRunning this with GPT-4-turbo to generate 1000 weather stations costs about\n$4 USD. This is rather expensive for synthetic data! You can mitigate this by\nusing a cheaper model (e.g. GPT-3.5-turbo) but may get worse results.\n\nAlternatively, you can generate the data for free on your laptop with a small\nopen source language model! Look out for a future post exploring this option.\n:::\n\n### Penguin poems\n\nWe can augment existing data with synthetic data. First, let's load the penguins\ndataset:\n\n::: {#6cd0b217 .cell execution_count=4}\n``` {.python .cell-code}\npenguins = ibis.examples.penguins.fetch() # <1>\npenguins\n```\n\n::: {.cell-output .cell-output-display execution_count=4}\n```{=html}\n
┏━━━━━━━━━┳━━━━━━━━━━━┳━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━┳━━━━━━━━┳━━━━━━━┓\n┃ species  island     bill_length_mm  bill_depth_mm  flipper_length_mm  body_mass_g  sex     year  ┃\n┡━━━━━━━━━╇━━━━━━━━━━━╇━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━╇━━━━━━━━╇━━━━━━━┩\n│ stringstringfloat64float64int64int64stringint64 │\n├─────────┼───────────┼────────────────┼───────────────┼───────────────────┼─────────────┼────────┼───────┤\n│ Adelie Torgersen39.118.71813750male  2007 │\n│ Adelie Torgersen39.517.41863800female2007 │\n│ Adelie Torgersen40.318.01953250female2007 │\n│ Adelie TorgersenNULLNULLNULLNULLNULL2007 │\n│ Adelie Torgersen36.719.31933450female2007 │\n│ Adelie Torgersen39.320.61903650male  2007 │\n│ Adelie Torgersen38.917.81813625female2007 │\n│ Adelie Torgersen39.219.61954675male  2007 │\n│ Adelie Torgersen34.118.11933475NULL2007 │\n│ Adelie Torgersen42.020.21904250NULL2007 │\n│  │\n└─────────┴───────────┴────────────────┴───────────────┴───────────────────┴─────────────┴────────┴───────┘\n
\n```\n:::\n:::\n\n\n1. Load the penguins dataset from Ibis examples\n\nAnd take a sample of five rows to reduce our AI service costs:\n\n::: {#8502ce26 .cell execution_count=5}\n``` {.python .cell-code}\nt = penguins.sample(fraction=0.025).limit(5) # <1>\nt\n```\n\n::: {.cell-output .cell-output-display execution_count=5}\n```{=html}\n
┏━━━━━━━━━┳━━━━━━━━━━━┳━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━┳━━━━━━━━┳━━━━━━━┓\n┃ species  island     bill_length_mm  bill_depth_mm  flipper_length_mm  body_mass_g  sex     year  ┃\n┡━━━━━━━━━╇━━━━━━━━━━━╇━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━╇━━━━━━━━╇━━━━━━━┩\n│ stringstringfloat64float64int64int64stringint64 │\n├─────────┼───────────┼────────────────┼───────────────┼───────────────────┼─────────────┼────────┼───────┤\n│ Adelie Dream    39.618.11864450male  2008 │\n│ Adelie Dream    33.116.11782900female2008 │\n│ Adelie Torgersen35.717.01893350female2009 │\n│ Gentoo Biscoe   40.913.72144650female2007 │\n│ Gentoo Biscoe   44.514.32164100NULL2007 │\n└─────────┴───────────┴────────────────┴───────────────┴───────────────────┴─────────────┴────────┴───────┘\n
\n```\n:::\n:::\n\n\n1. Sample a fraction of the data, then take the first five rows\n\nNow we define a UDF to generate a poem to describe each penguin:\n\n::: {#4f1a91e1 .cell execution_count=6}\n``` {.python .cell-code}\n@ibis.udf.scalar.python # <1>\ndef penguin_poem( # <1>\n species: str, # <1>\n island: str, # <1>\n bill_length_mm: float, # <1>\n bill_depth_mm: float, # <1>\n flipper_length_mm: float, # <1>\n body_mass_g: float, # <1>\n) -> str: # <1>\n # <2>\n instructions = f\"\"\"Provide a whimsical poem that rhymes for a penguin.\n\n You have the following information about the penguins:\n species {species}\n island of {island}\n bill length of {bill_length_mm} mm\n bill depth of {bill_depth_mm} mm\n flipper length of {flipper_length_mm} mm\n body mass of {body_mass_g} g.\n\n You must reference the penguin's size in addition to its species and island.\n \"\"\" # <2>\n\n poem = marvin.generate( # <3>\n n=1, # <3>\n instructions=instructions, # <3>\n ) # <3>\n\n return poem[0] # <4>\n```\n:::\n\n\n1. Define a scalar Python UDF to generate a poem from penguin data\n2. Augment the LM's prompt with the penguin data\n3. Use Marvin to generate a poem for the penguin data\n4. Return the generated poem\n\nAnd apply that UDF to our penguins table:\n\n::: {#c8e00b5d .cell execution_count=7}\n``` {.python .cell-code}\nt = (\n t.mutate( # <1>\n poem=penguin_poem( # <1>\n t.species, # <1>\n t.island, # <1>\n t.bill_length_mm, # <1>\n t.bill_depth_mm, # <1>\n t.flipper_length_mm, # <1>\n t.body_mass_g, # <1>\n ) # <1>\n )\n .relocate(\"species\", \"island\", \"poem\") # <2>\n .cache() # <3>\n)\nt\n```\n\n::: {.cell-output .cell-output-display}\n```{=html}\n\n```\n:::\n\n::: {.cell-output .cell-output-display execution_count=7}\n```{=html}\n
┏━━━━━━━━━┳━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━┳━━━━━━━━┳━━━━━━━┓\n┃ species  island     poem                                                                                bill_length_mm  bill_depth_mm  flipper_length_mm  body_mass_g  sex     year  ┃\n┡━━━━━━━━━╇━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━╇━━━━━━━━╇━━━━━━━┩\n│ stringstringstringfloat64float64int64int64stringint64 │\n├─────────┼───────────┼────────────────────────────────────────────────────────────────────────────────────┼────────────────┼───────────────┼───────────────────┼─────────────┼────────┼───────┤\n│ Adelie TorgersenIn Torgersen's isle, so lively and free,\\nAn Adelie penguin, happy as can be.\\nWi…38.917.81813625female2007 │\n│ Adelie TorgersenIn Torgersen's land of icy blue,\\nAn Adelie penguin, quite the view.\\nWith a bill…39.718.41903900male  2008 │\n│ Gentoo Biscoe   In Biscoe's isle where icebergs gleam,                                            43.313.42094400female2007 │\n│ Gentoo Biscoe   In the winds where the cold airs flow,\\nOn Biscoe's isle, where soft mosses grow… 46.214.52094800female2007 │\n│ Gentoo Biscoe   A penguin of species Gentoo,\\nOn Biscoe Island, he dances in lieu,\\nWith a bill 4…44.514.32164100NULL2007 │\n└─────────┴───────────┴────────────────────────────────────────────────────────────────────────────────────┴────────────────┴───────────────┴───────────────────┴─────────────┴────────┴───────┘\n
\n```\n:::\n:::\n\n\n1. Apply the UDF by mutating the table with the function, using other columns as\n input\n2. Rearrange the columns we care about to the front\n3. Cache the table to avoid re-running the UDF\n\n:::{.callout-note title=\"Is this RAG?\" collapse=\"true\"}\nRetrieval augmented generation (RAG) is a technique for retrieving information,\naugmented the prompt input to a model with that information, and using it to\ngenerate a response that incorporates the retrieved information for improved\naccuracy.\n\nIn the example above, we've retrieved information from a table (that could be in\na remote database) and used it to augment the prompt input to a language model.\nThe model then generated a response that incorporates the retrieved information.\n\n> I would like to abolish the term RAG and instead just agree that we should\n> always try to provide models with the appropriate context to provide high\n> quality answers. - [Hamel\n> Husain](https://twitter.com/HamelHusain/status/1709740984643596768)\n\nI would consider this RAG, but I also consider RAG to be a silly term.\n:::\n\nNice! While not particularly useful in this case, the same process can be used\nfor generating product descriptions or other practical applications.\n\n## Natural language processing\n\nThis includes tasks like:\n\n- sentiment analysis\n- named entity recognition\n- part of speech tagging\n- summarization\n- translation\n- question answering\n\nEach of these tasks can be, to some extent, solved by traditional natural\nlanguage processing (NLP) techniques. However, modern-day LMs can solve these\ntasks with a single model, and often with state-of-the-art performance. This\ndrastically simplifies what a single engineer, who doesn't need a deep\nunderstanding of NLP or ML in general, can accomplish.\n\n### Sentiment analysis\n\nWe can use a language model to perform sentiment analysis on the penguin poems:\n\n::: {#7a6e87c7 .cell execution_count=8}\n``` {.python .cell-code}\n@marvin.fn # <1>\ndef _sentiment_analysis(text: str) -> float: # <1>\n \"\"\"Returns a sentiment score for `text`\n between -1 (negative) and 1 (positive).\"\"\"\n# <1>\n\n@ibis.udf.scalar.python # <2>\ndef sentiment_analysis(text: str) -> float: # <2>\n return _sentiment_analysis(text) # <3>\n```\n:::\n\n\n1. Define a Marvin function to perform sentiment analysis\n2. Define a scalar Python UDF to apply the Marvin function to a column\n3. Apply the Marvin function within the UDF\n\nAnd apply that UDF to our penguins table:\n\n::: {#d48ad4a4 .cell execution_count=9}\n``` {.python .cell-code}\nt = (\n t.mutate(sentiment=sentiment_analysis(t.poem)) # <1>\n .relocate(t.columns[:3], \"sentiment\") # <2>\n .cache() # <3>\n)\nt\n```\n\n::: {.cell-output .cell-output-display}\n```{=html}\n\n```\n:::\n\n::: {.cell-output .cell-output-display execution_count=9}\n```{=html}\n
┏━━━━━━━━━┳━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━┳━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━┳━━━━━━━━┳━━━━━━━┓\n┃ species  island     poem                                                                                sentiment  bill_length_mm  bill_depth_mm  flipper_length_mm  body_mass_g  sex     year  ┃\n┡━━━━━━━━━╇━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━╇━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━╇━━━━━━━━╇━━━━━━━┩\n│ stringstringstringfloat64float64float64int64int64stringint64 │\n├─────────┼───────────┼────────────────────────────────────────────────────────────────────────────────────┼───────────┼────────────────┼───────────────┼───────────────────┼─────────────┼────────┼───────┤\n│ Adelie TorgersenIn Torgersen's isle, so lively and free,\\nAn Adelie penguin, happy as can be.\\nWi…0.838.917.81813625female2007 │\n│ Adelie TorgersenIn Torgersen's land of icy blue,\\nAn Adelie penguin, quite the view.\\nWith a bill…0.239.718.41903900male  2008 │\n│ Gentoo Biscoe   In Biscoe's isle where icebergs gleam,                                            0.043.313.42094400female2007 │\n│ Gentoo Biscoe   In the winds where the cold airs flow,\\nOn Biscoe's isle, where soft mosses grow… 0.046.214.52094800female2007 │\n│ Gentoo Biscoe   A penguin of species Gentoo,\\nOn Biscoe Island, he dances in lieu,\\nWith a bill 4…0.044.514.32164100NULL2007 │\n└─────────┴───────────┴────────────────────────────────────────────────────────────────────────────────────┴───────────┴────────────────┴───────────────┴───────────────────┴─────────────┴────────┴───────┘\n
\n```\n:::\n:::\n\n\n1. Apply the UDF by mutating the table with the function\n2. Rearrange the columns we care about to the front\n3. Cache the table to avoid re-running the UDF\n\n### Entity extraction\n\nWhile not exactly named entity recognition, we can extract arbitrary entities\nfrom text. In this case, we'll extract a list of words that rhyme from the poem:\n\n::: {#cea88851 .cell execution_count=10}\n``` {.python .cell-code}\n@ibis.udf.scalar.python # <1>\ndef extract_rhyming_words(text: str) -> list[str]: # <1>\n words = marvin.extract( # <2>\n text, # <2>\n instructions=\"Extract the primary rhyming words from the text\", # <2>\n ) # <2>\n\n return words # <3>\n```\n:::\n\n\n1. Define a scalar Python UDF to extract rhyming words from a poem\n2. Use Marvin to extract the rhyming words from the poem\n3. Return the list of extracted words\n\nAnd apply that UDF to our penguins table:\n\n::: {#da78598c .cell execution_count=11}\n``` {.python .cell-code}\nt = (\n t.mutate(rhyming_words=extract_rhyming_words(t.poem)) # <1>\n .relocate(t.columns[:4], \"rhyming_words\") # <2>\n .cache() # <3>\n)\nt\n```\n\n::: {.cell-output .cell-output-display}\n```{=html}\n\n```\n:::\n\n::: {.cell-output .cell-output-display execution_count=11}\n```{=html}\n
┏━━━━━━━━━┳━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━┳━━━━━━━━┳━━━━━━━┓\n┃ species  island     poem                                                                                sentiment  rhyming_words             bill_length_mm  bill_depth_mm  flipper_length_mm  body_mass_g  sex     year  ┃\n┡━━━━━━━━━╇━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━╇━━━━━━━━╇━━━━━━━┩\n│ stringstringstringfloat64array<string>float64float64int64int64stringint64 │\n├─────────┼───────────┼────────────────────────────────────────────────────────────────────────────────────┼───────────┼──────────────────────────┼────────────────┼───────────────┼───────────────────┼─────────────┼────────┼───────┤\n│ Adelie TorgersenIn Torgersen's isle, so lively and free,\\nAn Adelie penguin, happy as can be.\\nWi…0.8['free', 'be', ... +8]38.917.81813625female2007 │\n│ Adelie TorgersenIn Torgersen's land of icy blue,\\nAn Adelie penguin, quite the view.\\nWith a bill…0.2['blue', 'view', ... +6]39.718.41903900male  2008 │\n│ Gentoo Biscoe   In Biscoe's isle where icebergs gleam,                                            0.0['isle', 'gleam']43.313.42094400female2007 │\n│ Gentoo Biscoe   In the winds where the cold airs flow,\\nOn Biscoe's isle, where soft mosses grow… 0.0['flow', 'grow', ... +8]46.214.52094800female2007 │\n│ Gentoo Biscoe   A penguin of species Gentoo,\\nOn Biscoe Island, he dances in lieu,\\nWith a bill 4…0.0['lieu', 'neat', ... +3]44.514.32164100NULL2007 │\n└─────────┴───────────┴────────────────────────────────────────────────────────────────────────────────────┴───────────┴──────────────────────────┴────────────────┴───────────────┴───────────────────┴─────────────┴────────┴───────┘\n
\n```\n:::\n:::\n\n\n1. Apply the UDF by mutating the table with the function\n2. Rearrange the columns we care about to the front\n3. Cache the table to avoid re-running the UDF\n\n### Translation\n\nWe can translate the penguin poems into Spanish or any language the language\nmodel sufficiently knows:\n\n::: {#7a79c00f .cell execution_count=12}\n``` {.python .cell-code}\n@marvin.fn # <1>\ndef _translate_text(text: str, target_language: str = \"spanish\") -> str: # <1>\n \"\"\"Translate `text` to `target_language`.\"\"\"\n# <1>\n\n@ibis.udf.scalar.python # <2>\ndef translate_text(text: str, target_language: str = \"spanish\") -> str: # <2>\n return _translate_text(text, target_language) # <3>\n```\n:::\n\n\n1. Define a Marvin function to translate text\n2. Define a scalar Python UDF to apply the Marvin function to a column\n3. Apply the Marvin function within the UDF\n\nAnd apply that UDF to our penguins table:\n\n::: {#98bd199d .cell execution_count=13}\n``` {.python .cell-code}\nt = (\n t.mutate(translated_poem=translate_text(t.poem)) # <1>\n .relocate(t.columns[:5], \"translated_poem\") # <2>\n .cache() # <3>\n)\nt\n```\n\n::: {.cell-output .cell-output-display}\n```{=html}\n\n```\n:::\n\n::: {.cell-output .cell-output-display execution_count=13}\n```{=html}\n
┏━━━━━━━━━┳━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━┳━━━━━━━━┳━━━━━━━┓\n┃ species  island     poem                                                                                sentiment  rhyming_words             translated_poem                                                                     bill_length_mm  bill_depth_mm  flipper_length_mm  body_mass_g  sex     year  ┃\n┡━━━━━━━━━╇━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━╇━━━━━━━━╇━━━━━━━┩\n│ stringstringstringfloat64array<string>stringfloat64float64int64int64stringint64 │\n├─────────┼───────────┼────────────────────────────────────────────────────────────────────────────────────┼───────────┼──────────────────────────┼────────────────────────────────────────────────────────────────────────────────────┼────────────────┼───────────────┼───────────────────┼─────────────┼────────┼───────┤\n│ Adelie TorgersenIn Torgersen's isle, so lively and free,\\nAn Adelie penguin, happy as can be.\\nWi…0.8['free', 'be', ... +8]En la isla Torgersen, tan animada y libre,\\nUn ping\"ino Adelia, feliz como puede… 38.917.81813625female2007 │\n│ Adelie TorgersenIn Torgersen's land of icy blue,\\nAn Adelie penguin, quite the view.\\nWith a bill…0.2['blue', 'view', ... +6]En la tierra helada azul de Torgersen,\\nun pingüino Adelia, todo un espectáculo.… 39.718.41903900male  2008 │\n│ Gentoo Biscoe   In Biscoe's isle where icebergs gleam,                                            0.0['isle', 'gleam']En la isla de Biscoe donde brillan los icebergs,                                  43.313.42094400female2007 │\n│ Gentoo Biscoe   In the winds where the cold airs flow,\\nOn Biscoe's isle, where soft mosses grow… 0.0['flow', 'grow', ... +8]En los vientos donde soplan los aires fríos,\\nEn la isla de Biscoe, donde crecen… 46.214.52094800female2007 │\n│ Gentoo Biscoe   A penguin of species Gentoo,\\nOn Biscoe Island, he dances in lieu,\\nWith a bill 4…0.0['lieu', 'neat', ... +3]Un pingüino de la especie Gentoo,\\nen la Isla Biscoe, baila en su lugar,\\ncon un …44.514.32164100NULL2007 │\n└─────────┴───────────┴────────────────────────────────────────────────────────────────────────────────────┴───────────┴──────────────────────────┴────────────────────────────────────────────────────────────────────────────────────┴────────────────┴───────────────┴───────────────────┴─────────────┴────────┴───────┘\n
\n```\n:::\n:::\n\n\n1. Apply the UDF by mutating the table with the function\n2. Rearrange the columns we care about to the front\n3. Cache the table to avoid re-running the UDF\n\n## Writing code\n\nFinally, language models can be used to write code. This is more useful with\nsystems around them to execute code, feed back error messages, make adjustments,\nand so on. There are numerous pitfalls with language models writing code, but\nthey're fairly good at SQL.\n\nLet's define a function that outputs SQL:\n\n::: {#6ad388a4 .cell execution_count=14}\n``` {.python .cell-code}\n@marvin.fn # <1>\ndef _text_to_sql(\n text: str,\n table_names: list[str],\n table_schemas: list[str],\n table_previews: list[str],\n) -> str:\n \"\"\"Writes a SQL SELECT statement for the `text` given the provided `table_names`, `table_schemas`, and `table_previews`.\"\"\"\n# <1>\n\ndef text_to_sql( # <2>\n text: str, # <2>\n table_names: list[str], # <2>\n table_schemas: list[str], # <2>\n table_previews: list[str], # <2>\n) -> str: # <2>\n sql = _text_to_sql(text, table_names, table_schemas, table_previews) # <3>\n return sql.strip(\";\") # <4>\n```\n:::\n\n\n1. Define a Marvin function to write SQL from text\n2. Define a Python function to apply the Marvin function to a string\n3. Generate the SQL string\n4. Strip the trailing semicolon that is sometimes generated, as it causes issues\n with Ibis\n\nWe can try that out on our penguins table:\n\n::: {#1409922f .cell execution_count=15}\n``` {.python .cell-code}\ntext = \"the count of penguins by species, from highest to lowest, per each island\" # <1>\n\ntable_names = [\"penguins\"] # <2>\ntable_schemas = [str(penguins.schema())] # <2>\ntable_previews = [str(penguins.limit(5))] # <2>\n\nsql = text_to_sql(text, table_names, table_schemas, table_previews) # <3>\nprint(sql)\n```\n\n::: {.cell-output .cell-output-display}\n```{=html}\n
\n```\n:::\n\n::: {.cell-output .cell-output-stdout}\n```\nSELECT species, island, COUNT(*) as count FROM penguins GROUP BY island, species ORDER BY count DESC, island ASC, species ASC\n```\n:::\n:::\n\n\n1. Create a natural language query\n2. Provide the table names, schemas, and previews\n3. Generate the SQL string\n\nAnd execute the SQL:\n\n::: {#fd5eaf5e .cell execution_count=16}\n``` {.python .cell-code}\nr = penguins.sql(sql) # <1>\nr\n```\n\n::: {.cell-output .cell-output-display execution_count=16}\n```{=html}\n
┏━━━━━━━━━━━┳━━━━━━━━━━━┳━━━━━━━┓\n┃ species    island     count ┃\n┡━━━━━━━━━━━╇━━━━━━━━━━━╇━━━━━━━┩\n│ stringstringint64 │\n├───────────┼───────────┼───────┤\n│ Gentoo   Biscoe   124 │\n│ ChinstrapDream    68 │\n│ Adelie   Dream    56 │\n│ Adelie   Torgersen52 │\n│ Adelie   Biscoe   44 │\n└───────────┴───────────┴───────┘\n
\n```\n:::\n:::\n\n\n1. Execute the SQL string on the table\n\n:::{.callout-note title=\"This is definitely RAG\"}\nIn this case, we've retrieved a table's name, schema, and a preview of its data\nto generate a high-quality SQL query. We're definitely doing RAG now.\n\nWhile most RAG implementations use a similarity metric between text converted to\nnumberes (tokens/word embeddings), I find simple text retrieval in a\nwell-organized knowledge base to be more effective. We'll follow up on this in\nfuture posts.\n:::\n\n### A more complex example\n\nLet's see how this works on a query that requires joining two tables. We'll load\nin some IMDB data:\n\n::: {#29a44ee6 .cell execution_count=17}\n``` {.python .cell-code}\nimdb_title_basics = ibis.examples.imdb_title_basics.fetch() # <1>\nimdb_title_basics\n```\n\n::: {.cell-output .cell-output-display execution_count=17}\n```{=html}\n
┏━━━━━━━━━━━┳━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━┳━━━━━━━━━━━┳━━━━━━━━━┳━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━━━┓\n┃ tconst     titleType  primaryTitle                                 originalTitle                                isAdult  startYear  endYear  runtimeMinutes  genres                   ┃\n┡━━━━━━━━━━━╇━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━╇━━━━━━━━━━━╇━━━━━━━━━╇━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━━━┩\n│ stringstringstringstringint64int64stringint64string                   │\n├───────────┼───────────┼─────────────────────────────────────────────┼─────────────────────────────────────────────┼─────────┼───────────┼─────────┼────────────────┼──────────────────────────┤\n│ tt0000001short    Carmencita                                 Carmencita                                 01894NULL1Documentary,Short        │\n│ tt0000002short    Le clown et ses chiens                     Le clown et ses chiens                     01892NULL5Animation,Short          │\n│ tt0000003short    Pauvre Pierrot                             Pauvre Pierrot                             01892NULL4Animation,Comedy,Romance │\n│ tt0000004short    Un bon bock                                Un bon bock                                01892NULL12Animation,Short          │\n│ tt0000005short    Blacksmith Scene                           Blacksmith Scene                           01893NULL1Comedy,Short             │\n│ tt0000006short    Chinese Opium Den                          Chinese Opium Den                          01894NULL1Short                    │\n│ tt0000007short    Corbett and Courtney Before the KinetographCorbett and Courtney Before the Kinetograph01894NULL1Short,Sport              │\n│ tt0000008short    Edison Kinetoscopic Record of a Sneeze     Edison Kinetoscopic Record of a Sneeze     01894NULL1Documentary,Short        │\n│ tt0000009movie    Miss Jerry                                 Miss Jerry                                 01894NULL45Romance                  │\n│ tt0000010short    Leaving the Factory                        La sortie de l'usine Lumière à Lyon        01895NULL1Documentary,Short        │\n│                         │\n└───────────┴───────────┴─────────────────────────────────────────────┴─────────────────────────────────────────────┴─────────┴───────────┴─────────┴────────────────┴──────────────────────────┘\n
\n```\n:::\n:::\n\n\n1. Load the IMDB title basics dataset from Ibis examples\n\n::: {#75c00a3f .cell execution_count=18}\n``` {.python .cell-code}\nimdb_title_ratings = ibis.examples.imdb_title_ratings.fetch() # <1>\nimdb_title_ratings\n```\n\n::: {.cell-output .cell-output-display execution_count=18}\n```{=html}\n
┏━━━━━━━━━━━┳━━━━━━━━━━━━━━━┳━━━━━━━━━━┓\n┃ tconst     averageRating  numVotes ┃\n┡━━━━━━━━━━━╇━━━━━━━━━━━━━━━╇━━━━━━━━━━┩\n│ stringfloat64int64    │\n├───────────┼───────────────┼──────────┤\n│ tt00000015.71990 │\n│ tt00000025.8265 │\n│ tt00000036.51869 │\n│ tt00000045.5177 │\n│ tt00000056.22655 │\n│ tt00000065.0182 │\n│ tt00000075.4831 │\n│ tt00000085.42132 │\n│ tt00000095.3206 │\n│ tt00000106.97268 │\n│  │\n└───────────┴───────────────┴──────────┘\n
\n```\n:::\n:::\n\n\n1. Load the IMDB title ratings dataset from Ibis examples\n\n::: {#f73ca290 .cell execution_count=19}\n``` {.python .cell-code}\ntext = \"the highest rated movies w/ over 100k ratings -- movies only\" # <1>\n\ntable_names = [\"imdb_title_basics\", \"imdb_title_ratings\"] # <2>\ntable_schemas = [str(imdb_title_basics.schema()), str(imdb_title_ratings.schema())] # <2>\ntable_previews = [str(imdb_title_basics.limit(5)), str(imdb_title_ratings.limit(5))] # <2>\n\nsql = text_to_sql(text, table_names, table_schemas, table_previews) # <3>\nprint(sql)\n```\n\n::: {.cell-output .cell-output-display}\n```{=html}\n
\n```\n:::\n\n::: {.cell-output .cell-output-display}\n```{=html}\n
\n```\n:::\n\n::: {.cell-output .cell-output-stdout}\n```\nSELECT \n  tb.titleType, \n  tb.primaryTitle, \n  tr.averageRating, \n  tr.numVotes\nFROM imdb_title_basics AS tb\nJOIN imdb_title_ratings AS tr ON tb.tconst = tr.tconst\nWHERE \n  tb.titleType = 'movie' AND\n  tr.numVotes > 100000\nORDER BY tr.averageRating DESC\n```\n:::\n:::\n\n\n1. Create a natural language query\n2. Provide the table names, schemas, and previews\n3. Generate the SQL string\n\n::: {#a7acbd26 .cell execution_count=20}\n``` {.python .cell-code}\nr = imdb_title_basics.sql(sql) # <1>\nr\n```\n\n::: {.cell-output .cell-output-display execution_count=20}\n```{=html}\n
┏━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━┳━━━━━━━━━━┓\n┃ titleType  primaryTitle                                   averageRating  numVotes ┃\n┡━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━╇━━━━━━━━━━┩\n│ stringstringfloat64int64    │\n├───────────┼───────────────────────────────────────────────┼───────────────┼──────────┤\n│ movie    The Shawshank Redemption                     9.32793065 │\n│ movie    The Godfather                                9.21945537 │\n│ movie    The Lord of the Rings: The Return of the King9.01912955 │\n│ movie    The Dark Knight                              9.02773504 │\n│ movie    12 Angry Men                                 9.0829776 │\n│ movie    The Godfather Part II                        9.01321642 │\n│ movie    Schindler's List                             9.01404536 │\n│ movie    Pulp Fiction                                 8.92142607 │\n│ movie    Fight Club                                   8.82227179 │\n│ movie    The Lord of the Rings: The Two Towers        8.81726199 │\n│  │\n└───────────┴───────────────────────────────────────────────┴───────────────┴──────────┘\n
\n```\n:::\n:::\n\n\n1. Execute the SQL string on the table\n\n## Issues with language models today\n\nThe biggest model with language models today is the cost. Similarly, the time\nfor inferencing these models is the second biggest concern. Generating millions\nof synthetic data points with state-of-the-art LLMs is prohibitively expensive\nand slow.\n\nSmaller open source language models can reduce this cost to practically zero,\nbut at the expensive of quality. And generating synthetic data is likely even\nslower. \n\n## Looking forward\n\nIn a future post, we'll explore using open source language models on a laptop to\nachieve similar results. In general, I expect language models with data to\nbecome increasingly commonplace. Some predictions:\n\n- small, specialized language models will become common\n- large language models that require expensive GPU clusters will still be the\n best for general-purpose tasks\n- data assistants will be gimmicky for a while, but will eventually become\n useful\n\nKeep an eye out for our own gimmicky data assistant soon...\n\n## Next steps\n\nTry this out yourself! All you need is an OpenAI account and the code above.\n\nIt's never been a better time to get involved with Ibis. [Join us on\nZulip and introduce yourself!](https://ibis-project.zulipchat.com/)\n\n", "supporting": [ "index_files/figure-html" ], @@ -12,7 +12,7 @@ "\n\n\n\n" ], "include-after-body": [ - "\n" + "\n" ] } } diff --git a/docs/posts/lms-for-data/index.qmd b/docs/posts/lms-for-data/index.qmd index eb8e6d734de2..0fa71fb763f6 100644 --- a/docs/posts/lms-for-data/index.qmd +++ b/docs/posts/lms-for-data/index.qmd @@ -1,7 +1,7 @@ --- title: "Language models for data" author: "Cody Peterson" -date: "2024-02-15" +date: "2024-02-05" categories: - blog - llms @@ -13,7 +13,7 @@ categories: This post will give an overview of how (large) language models (LMs) fit into data engineering, analyst, and science workflows. -## Use cases for LMs in data +## Use cases There are three main use cases for language models for data practitioners: @@ -21,49 +21,9 @@ There are three main use cases for language models for data practitioners: 2. Natural language processing 3. Writing code -We'll describe each in this section and see them in action in the following -sections. +We'll describe each and then demonstrate them with code. -### Synthetic data generation - -Language models can be used to generate synthetic data. This is useful for -testing, training, and other purposes. For example, you can use a language model -to generate synthetic data for a machine learning model. - -:::{.callout-tip} -This post was re-inspired by the [1 billion row challenge we recently solved -with Ibis on three local backends](../1brc/index.qmd), in which synthetic data -generated from a seed file was used to generate a billion rows. - -With language models, we can reproduce this synthetic data and customize the -data produced with natural language! We'll demonstrate this in a section below. -::: - -### Natural language processing - -This includes tasks like: - -- sentiment analysis -- named entity recognition -- part of speech tagging -- summarization -- translation -- question answering - -Each of these tasks can be, to some extent, solved by traditional natural -language processing (NLP) techniques. However, modern-day LMs can solve these -tasks with a single model, and often with state-of-the-art performance. This -drastically simplifies what a single engineer, who doesn't need a deep -understanding of NLP or ML in general, can accomplish. - -### Writing code - -Finally, language models can be used to write code. This is more useful with -systems around them to execute code, feed back error messages, make adjustments, -and so on. There are numerous pitfalls with language models writing code, but -they're fairly good at SQL. - -## Demonstrating LMs with data +## Setup We'll use [Marvin](https://askmarvin.ai), the AI engineering toolkit, alongside [Ibis](https://ibis-project.org), the data engineering toolkit, to demonstrate @@ -75,9 +35,16 @@ In a follow-up post, we'll explore using local "open source" language models to achieve the same results. ::: -With Marvin and Ibis, you can replicate the workflows below using other AI service providers, local language models, and over 20+ data backends! +With Marvin and Ibis, you can replicate the workflows below using other AI +service providers, local language models, and over 20+ data backends! + +You'll need to install Marvin and Ibis to follow along: + +```{.bash} +pip install 'ibis-framework[duckdb,examples]' marvin +``` -Let's start by importing and setting up our code: +Then import them and turn on Ibis interactive mode: ```{python} import ibis # <1> @@ -95,31 +62,48 @@ ibis.options.interactive = True # <4> ## Synthetic data generation +Language models can be used to generate synthetic data. This is useful for +testing, training, and other purposes. For example, you can use a language model +to generate synthetic data for training a machine learning model (including a +language model). + +:::{.callout-tip} +This post was re-inspired by the [1 billion row challenge we recently solved +with Ibis on three local backends](../1brc/index.qmd) in which synthetic data +generated from a seed file was used to generate a billion rows. + +With language models, we can reproduce this synthetic data and customize the +data produced with natural language! +::: + We'll start by replicating the data in the one billion row challenge, then move over to our favorite penguins demo dataset to augment existing data with synthetic data. ### Weather stations -We can generate synthetic weather stations: +We can generate synthetic weather stations in a few lines of code: ```{python} -class WeatherStation(BaseModel): - station: str = Field( - ..., description="The weather station name", example="Sandy Silicon" - ) - temperature: float = Field( - ..., description="The average temperature in Fahrenheit", example=72.5 - ) - -stations = marvin.generate( - target=WeatherStation, - instructions="Generate fictitious but plausible-sounding weather stations with names that excite data nerds", - n=3, -) +class WeatherStation(BaseModel): # <1> + station: str = Field( # <1> + ..., description="The weather station name", example="Sandy Silicon" # <1> + ) # <1> + temperature: float = Field( # <1> + ..., description="The average temperature in Fahrenheit", example=72.5 # <1> + ) # <1> + +stations = marvin.generate( # <2> + target=WeatherStation, # <2> + instructions="Generate fictitious but plausible-sounding weather stations with names that excite data nerds", # <2> + n=3, # <2> +) # <2> stations ``` +1. Define a data model for the weather stations +2. Use Marvin to generate three weather stations + And then load that data into an Ibis table: :::{.callout-tip} @@ -128,38 +112,59 @@ in a table. We'll demonstrate UDFs throughout this post. ::: ```{python} -s = ibis.memtable([station.model_dump() for station in stations]) +s = ibis.memtable([station.model_dump() for station in stations]) # <1> s ``` +1. Convert the generated data to an Ibis table + +While we've only generated three weather stations, you can repeat this process +until you get as many as you'd like! You can then use Ibis to generate a billion +rows of weather data for these stations as in the one billion row challenge. + +:::{.callout-warning} +Running this with GPT-4-turbo to generate 1000 weather stations costs about +$4 USD. This is rather expensive for synthetic data! You can mitigate this by +using a cheaper model (e.g. GPT-3.5-turbo) but may get worse results. + +Alternatively, you can generate the data for free on your laptop with a small +open source language model! Look out for a future post exploring this option. +::: + ### Penguin poems -We can augment existing data with synthetic data. First, let's load the penguins dataset: +We can augment existing data with synthetic data. First, let's load the penguins +dataset: ```{python} -penguins = ibis.examples.penguins.fetch() +penguins = ibis.examples.penguins.fetch() # <1> penguins ``` +1. Load the penguins dataset from Ibis examples + And take a sample of five rows to reduce our AI service costs: ```{python} -t = penguins.sample(fraction=0.03).limit(5) +t = penguins.sample(fraction=0.025).limit(5) # <1> t ``` +1. Sample a fraction of the data, then take the first five rows + Now we define a UDF to generate a poem to describe each penguin: ```{python} -@ibis.udf.scalar.python -def penguin_poem( - species: str, - island: str, - bill_length_mm: float, - bill_depth_mm: float, - flipper_length_mm: float, - body_mass_g: float, -) -> str: +@ibis.udf.scalar.python # <1> +def penguin_poem( # <1> + species: str, # <1> + island: str, # <1> + bill_length_mm: float, # <1> + bill_depth_mm: float, # <1> + flipper_length_mm: float, # <1> + body_mass_g: float, # <1> +) -> str: # <1> + # <2> instructions = f"""Provide a whimsical poem that rhymes for a penguin. You have the following information about the penguins: @@ -171,127 +176,200 @@ def penguin_poem( body mass of {body_mass_g} g. You must reference the penguin's size in addition to its species and island. - """ + """ # <2> - poem = marvin.generate( - n=1, - instructions=instructions, - ) + poem = marvin.generate( # <3> + n=1, # <3> + instructions=instructions, # <3> + ) # <3> - return poem[0] + return poem[0] # <4> ``` +1. Define a scalar Python UDF to generate a poem from penguin data +2. Augment the LM's prompt with the penguin data +3. Use Marvin to generate a poem for the penguin data +4. Return the generated poem + And apply that UDF to our penguins table: ```{python} t = ( - t.mutate( - poem=penguin_poem( - t.species, - t.island, - t.bill_length_mm, - t.bill_depth_mm, - t.flipper_length_mm, - t.body_mass_g, - ) + t.mutate( # <1> + poem=penguin_poem( # <1> + t.species, # <1> + t.island, # <1> + t.bill_length_mm, # <1> + t.bill_depth_mm, # <1> + t.flipper_length_mm, # <1> + t.body_mass_g, # <1> + ) # <1> ) - .relocate("species", "island", "poem") - .cache() + .relocate("species", "island", "poem") # <2> + .cache() # <3> ) t ``` +1. Apply the UDF by mutating the table with the function, using other columns as + input +2. Rearrange the columns we care about to the front +3. Cache the table to avoid re-running the UDF + +:::{.callout-note title="Is this RAG?" collapse="true"} +Retrieval augmented generation (RAG) is a technique for retrieving information, +augmented the prompt input to a model with that information, and using it to +generate a response that incorporates the retrieved information for improved +accuracy. + +In the example above, we've retrieved information from a table (that could be in +a remote database) and used it to augment the prompt input to a language model. +The model then generated a response that incorporates the retrieved information. + +> I would like to abolish the term RAG and instead just agree that we should +> always try to provide models with the appropriate context to provide high +> quality answers. - [Hamel +> Husain](https://twitter.com/HamelHusain/status/1709740984643596768) + +I would consider this RAG, but I also consider RAG to be a silly term. +::: + Nice! While not particularly useful in this case, the same process can be used for generating product descriptions or other practical applications. ## Natural language processing +This includes tasks like: + +- sentiment analysis +- named entity recognition +- part of speech tagging +- summarization +- translation +- question answering + +Each of these tasks can be, to some extent, solved by traditional natural +language processing (NLP) techniques. However, modern-day LMs can solve these +tasks with a single model, and often with state-of-the-art performance. This +drastically simplifies what a single engineer, who doesn't need a deep +understanding of NLP or ML in general, can accomplish. + ### Sentiment analysis We can use a language model to perform sentiment analysis on the penguin poems: ```{python} -@marvin.fn -def _sentiment_analysis(text: str) -> float: +@marvin.fn # <1> +def _sentiment_analysis(text: str) -> float: # <1> """Returns a sentiment score for `text` between -1 (negative) and 1 (positive).""" +# <1> - -@ibis.udf.scalar.python -def sentiment_analysis(text: str) -> float: - return _sentiment_analysis(text) +@ibis.udf.scalar.python # <2> +def sentiment_analysis(text: str) -> float: # <2> + return _sentiment_analysis(text) # <3> ``` +1. Define a Marvin function to perform sentiment analysis +2. Define a scalar Python UDF to apply the Marvin function to a column +3. Apply the Marvin function within the UDF + And apply that UDF to our penguins table: ```{python} t = ( - t.mutate(sentiment=sentiment_analysis(t.poem)) - .relocate(t.columns[:3], "sentiment") - .cache() + t.mutate(sentiment=sentiment_analysis(t.poem)) # <1> + .relocate(t.columns[:3], "sentiment") # <2> + .cache() # <3> ) t ``` +1. Apply the UDF by mutating the table with the function +2. Rearrange the columns we care about to the front +3. Cache the table to avoid re-running the UDF + ### Entity extraction -While not exactly named entity recognition, we can extract arbitrary entities from text. In this case, we'll extract a list of words that rhyme from the poem: +While not exactly named entity recognition, we can extract arbitrary entities +from text. In this case, we'll extract a list of words that rhyme from the poem: ```{python} -@ibis.udf.scalar.python -def extract_rhyming_words(text: str) -> list[str]: - words = marvin.extract( - text, - instructions="Extract the primary rhyming words from the text", - ) - - return words +@ibis.udf.scalar.python # <1> +def extract_rhyming_words(text: str) -> list[str]: # <1> + words = marvin.extract( # <2> + text, # <2> + instructions="Extract the primary rhyming words from the text", # <2> + ) # <2> + + return words # <3> ``` +1. Define a scalar Python UDF to extract rhyming words from a poem +2. Use Marvin to extract the rhyming words from the poem +3. Return the list of extracted words + And apply that UDF to our penguins table: ```{python} t = ( - t.mutate(rhyming_words=extract_rhyming_words(t.poem)) - .relocate(t.columns[:4], "rhyming_words") - .cache() + t.mutate(rhyming_words=extract_rhyming_words(t.poem)) # <1> + .relocate(t.columns[:4], "rhyming_words") # <2> + .cache() # <3> ) t ``` +1. Apply the UDF by mutating the table with the function +2. Rearrange the columns we care about to the front +3. Cache the table to avoid re-running the UDF + ### Translation -We can translate the penguin poems: +We can translate the penguin poems into Spanish or any language the language +model sufficiently knows: ```{python} -@marvin.fn -def _translate_text(text: str, target_language: str = "spanish") -> str: +@marvin.fn # <1> +def _translate_text(text: str, target_language: str = "spanish") -> str: # <1> """Translate `text` to `target_language`.""" +# <1> - -@ibis.udf.scalar.python -def translate_text(text: str, target_language: str = "spanish") -> str: - return _translate_text(text, target_language) +@ibis.udf.scalar.python # <2> +def translate_text(text: str, target_language: str = "spanish") -> str: # <2> + return _translate_text(text, target_language) # <3> ``` +1. Define a Marvin function to translate text +2. Define a scalar Python UDF to apply the Marvin function to a column +3. Apply the Marvin function within the UDF + And apply that UDF to our penguins table: ```{python} t = ( - t.mutate(translated_poem=translate_text(t.poem)) - .relocate(t.columns[:5], "translated_poem") - .cache() + t.mutate(translated_poem=translate_text(t.poem)) # <1> + .relocate(t.columns[:5], "translated_poem") # <2> + .cache() # <3> ) t ``` +1. Apply the UDF by mutating the table with the function +2. Rearrange the columns we care about to the front +3. Cache the table to avoid re-running the UDF + ## Writing code -Finally, we can use a language model to write code. Let's define a function that -outputs SQL: +Finally, language models can be used to write code. This is more useful with +systems around them to execute code, feed back error messages, make adjustments, +and so on. There are numerous pitfalls with language models writing code, but +they're fairly good at SQL. + +Let's define a function that outputs SQL: ```{python} -@marvin.fn +@marvin.fn # <1> def _text_to_sql( text: str, table_names: list[str], @@ -299,70 +377,129 @@ def _text_to_sql( table_previews: list[str], ) -> str: """Writes a SQL SELECT statement for the `text` given the provided `table_names`, `table_schemas`, and `table_previews`.""" - - -def text_to_sql( - text: str, - table_names: list[str], - table_schemas: list[str], - table_previews: list[str], -) -> str: - return _text_to_sql(text, table_names, table_schemas, table_previews).strip(";") +# <1> + +def text_to_sql( # <2> + text: str, # <2> + table_names: list[str], # <2> + table_schemas: list[str], # <2> + table_previews: list[str], # <2> +) -> str: # <2> + sql = _text_to_sql(text, table_names, table_schemas, table_previews) # <3> + return sql.strip(";") # <4> ``` +1. Define a Marvin function to write SQL from text +2. Define a Python function to apply the Marvin function to a string +3. Generate the SQL string +4. Strip the trailing semicolon that is sometimes generated, as it causes issues + with Ibis + We can try that out on our penguins table: ```{python} -text = "the count of penguins by species, from highest to lowest, per each island" +text = "the count of penguins by species, from highest to lowest, per each island" # <1> -table_names = ["penguins"] -table_schemas = [str(penguins.schema())] -table_previews = [str(penguins.limit(5))] +table_names = ["penguins"] # <2> +table_schemas = [str(penguins.schema())] # <2> +table_previews = [str(penguins.limit(5))] # <2> -sql = text_to_sql(text, table_names, table_schemas, table_previews) +sql = text_to_sql(text, table_names, table_schemas, table_previews) # <3> print(sql) ``` +1. Create a natural language query +2. Provide the table names, schemas, and previews +3. Generate the SQL string + And execute the SQL: ```{python} -r = penguins.sql(sql) +r = penguins.sql(sql) # <1> r ``` +1. Execute the SQL string on the table + +:::{.callout-note title="This is definitely RAG"} +In this case, we've retrieved a table's name, schema, and a preview of its data +to generate a high-quality SQL query. We're definitely doing RAG now. + +While most RAG implementations use a similarity metric between text converted to +numberes (tokens/word embeddings), I find simple text retrieval in a +well-organized knowledge base to be more effective. We'll follow up on this in +future posts. +::: + ### A more complex example -Let's see how this works on a query that requires joining two tables. We'll load in some IMDB data: +Let's see how this works on a query that requires joining two tables. We'll load +in some IMDB data: ```{python} -imdb_title_basics = ibis.examples.imdb_title_basics.fetch() +imdb_title_basics = ibis.examples.imdb_title_basics.fetch() # <1> imdb_title_basics ``` +1. Load the IMDB title basics dataset from Ibis examples + ```{python} -imdb_title_ratings = ibis.examples.imdb_title_ratings.fetch() +imdb_title_ratings = ibis.examples.imdb_title_ratings.fetch() # <1> imdb_title_ratings ``` +1. Load the IMDB title ratings dataset from Ibis examples + ```{python} -text = "the highest rated movies w/ over 100k ratings -- movies only" +text = "the highest rated movies w/ over 100k ratings -- movies only" # <1> -table_names = ["imdb_title_basics", "imdb_title_ratings"] -table_schemas = [str(imdb_title_basics.schema()), str(imdb_title_ratings.schema())] -table_previews = [str(imdb_title_basics.limit(5)), str(imdb_title_ratings.limit(5))] +table_names = ["imdb_title_basics", "imdb_title_ratings"] # <2> +table_schemas = [str(imdb_title_basics.schema()), str(imdb_title_ratings.schema())] # <2> +table_previews = [str(imdb_title_basics.limit(5)), str(imdb_title_ratings.limit(5))] # <2> -sql = text_to_sql(text, table_names, table_schemas, table_previews) +sql = text_to_sql(text, table_names, table_schemas, table_previews) # <3> print(sql) ``` +1. Create a natural language query +2. Provide the table names, schemas, and previews +3. Generate the SQL string + ```{python} -imdb_title_basics.sql(sql) +r = imdb_title_basics.sql(sql) # <1> +r ``` -## Issues with LMs today +1. Execute the SQL string on the table + +## Issues with language models today + +The biggest model with language models today is the cost. Similarly, the time +for inferencing these models is the second biggest concern. Generating millions +of synthetic data points with state-of-the-art LLMs is prohibitively expensive +and slow. + +Smaller open source language models can reduce this cost to practically zero, +but at the expensive of quality. And generating synthetic data is likely even +slower. ## Looking forward +In a future post, we'll explore using open source language models on a laptop to +achieve similar results. In general, I expect language models with data to +become increasingly commonplace. Some predictions: + +- small, specialized language models will become common +- large language models that require expensive GPU clusters will still be the + best for general-purpose tasks +- data assistants will be gimmicky for a while, but will eventually become + useful + +Keep an eye out for our own gimmicky data assistant soon... + ## Next steps -In a future post... +Try this out yourself! All you need is an OpenAI account and the code above. + +It's never been a better time to get involved with Ibis. [Join us on +Zulip and introduce yourself!](https://ibis-project.zulipchat.com/)