From 9cb4c8d17f9d7960c216b324a0153b9c785ea09f Mon Sep 17 00:00:00 2001 From: Cody Peterson <54814569+lostmygithubaccount@users.noreply.github.com> Date: Tue, 2 Apr 2024 09:38:06 -0400 Subject: [PATCH] docs(blog): update date on hamilton blog (#8851) --- .../posts/hamilton-ibis/index/execute-results/html.json | 6 +++--- docs/posts/hamilton-ibis/index.qmd | 2 +- 2 files changed, 4 insertions(+), 4 deletions(-) diff --git a/docs/_freeze/posts/hamilton-ibis/index/execute-results/html.json b/docs/_freeze/posts/hamilton-ibis/index/execute-results/html.json index f2b7e16dda90..902d10c1d285 100644 --- a/docs/_freeze/posts/hamilton-ibis/index/execute-results/html.json +++ b/docs/_freeze/posts/hamilton-ibis/index/execute-results/html.json @@ -1,10 +1,10 @@ { - "hash": "b80fae077405d68ff6efe0f3bf2f9408", + "hash": "a6bf9acae4a092818c0b455f97284793", "result": { "engine": "jupyter", - "markdown": "---\ntitle: Portable dataflows with Ibis and Hamilton\nauthor: \"Thierry Jean\"\ndate: \"2024-03-28\"\nimage: \"thumbnail.png\"\ncategories:\n - blog\n - hamilton\n - data engineering\n - feature engineering\n---\n\n## Introduction\nThis post showcases how Ibis and [Hamilton](https://hamilton.dagworks.io/en/latest/)\nenable dataflows that span execution over SQL and Python. Ibis is a portable dataframe\nlibrary to write procedural data transformations in Python and be able to execute them\ndirectly on various SQL backends (DuckDB, Snowflake, Postgres, Flink, see\n[full list](https://ibis-project.org/support_matrix)). Hamilton provides a declarative\nway to define testable, modular, self-documenting dataflows, that encode lineage and\nmetadata.\n\nLet’s introduce Ibis before exploring how it pairs with Hamilton.\n\n## Standalone Ibis\nHere’s an Ibis code snippet to load data from a parquet file, compute features,\nselect columns, and filter rows, illustrating typical feature engineering operations.\n\nReading the code, you’ll notice that:\n\n- We use \"expression chaining\", meaning there’s a series of `.method()` attached one\nafter another.\n- The variable `ibis._` is a special character referring to the current expression\ne.g., `ibis._.pet` accesses the column \"pet\" of the current table.\n- The table method `.mutate(col1=, col2=, ...)` assigns new columns or overwrites\nexisting ones.\n\n::: {#ba383db9 .cell execution_count=1}\n``` {.python .cell-code}\nimport ibis\n\nurl = \"https://storage.googleapis.com/ibis-blog-data-public/hamilton-ibis/absenteeism.parquet\"\nfeature_set = (\n ibis.read_parquet(sources=url, table_name=\"absenteeism\")\n .rename(\"snake_case\")\n .mutate( # allows us to define new columns\n has_children=ibis.ifelse(ibis._.son > 0, True, False),\n has_pet=ibis.ifelse(ibis._.pet > 0, True, False),\n is_summer_brazil=ibis._.month_of_absence.isin([1, 2, 12]),\n ).select(\n \"id\", \"has_children\", \"has_pet\", \"is_summer_brazil\",\n \"service_time\", \"seasons\", \"disciplinary_failure\",\n \"absenteeism_time_in_hours\"\n )\n)\n```\n:::\n\n\n### Challenge 1 – Maintain and test complex data transformations\n\nIbis has an SQL-like syntax and supports chaining operations, allowing for\npowerful queries in a few lines of code. Conversely, there’s a risk of\nsprawling complexity as expressions are appended, making them harder to test\nand debug. Preventing this issue requires a lot of upfront discipline and\nrefactoring.\n\n### Challenge 2 – Orchestrate Ibis code in production\n\nIbis alleviates a major pain point by enabling data transformations to work across\nbackends. However, moving from dev to prod still requires some code changes such as\nchanging backend connectors, swapping unsupported operators, adding some orchestration\nand logging execution, wanting to reuse prior code, etc. This is outside the scope of\nthe Ibis project and is expected to be enabled by other means, which usually means\nbespoke constructs that turn into technical debt.\n\n## What is Hamilton?\n\nHamilton is a general-purpose framework to write dataflows using regular Python\nfunctions. At the core, each function defines a transformation and its parameters\nindicates its dependencies. Hamilton automatically connects individual functions into\na [directed acyclic graph](https://en.wikipedia.org/wiki/Directed_acyclic_graph) (DAG)\nthat can be executed, visualized, optimized, and reported on.\n\n![The ABC of Hamilton](hamilton_abc.png)\n\n## How Hamilton complements Ibis\n\nHamilton was initially developed to [structure pandas code for a large catalog of\nfeatures](https://blog.dagworks.io/p/tidy-production-pandas-with-hamilton-3b759a2bf562),\nand has since been adopted by multiple organizations since, and expanded to cover any\npython object type (Polars, PySpark, ML Models, Numpy, you custom type, etc). Its syntax\nencourages users to chunk code into meaningful and reusable components, which facilitates\ndocumentation, unit testing, code reviews, and improves iteration speed, and the dev to\nproduction process. These benefits directly translate to organizing Ibis code.\n\n### Solution 1 – Structure your Ibis code with Hamilton\n\nNow, we’ll refactor the above Ibis code to use Hamilton. Users have the flexibility to\nchunk code (i.e., what the contents of a function is), at the table or the column-level\ndepending on the needed granularity. This modularity is particularly beneficial to Ibis\nbecause:\n\n- Well-scoped functions with type annotations and docstring are easier to understand for\nnew Ibis users and facilitate onboarding.\n\n- Unit testing and data validation becomes easier with smaller expressions. These checks\nbecome more important when working across backends since the\n[operation coverage varies](https://ibis-project.org/support_matrix) and bugs may arise.\n\n#### Table-level dataflow\n\nTable-level operations might feel most familiar to SQL and Spark users. Also, Ibis +\nHamilton is reminiscent of [dbt](https://www.getdbt.com/) for the Python ecosystem.\n\nWorking with tables is very efficient when your number of columns/features is limited,\nand you don’t need full column level lineage. As you want to reuse components, you can\nprogressively breakdown \"table-level code\" in to \"column-level code\".\n\nThe initial Ibis code is now 3 functions with type annotations and docstrings. We have\na clear sense of the expected external outputs and we could implement schema checks\nbetween functions.\n\n::: {#00192648 .cell execution_count=2}\n`````` {.python .cell-code}\nfrom typing import Optional\nimport ibis\nimport ibis.expr.types as ir\n\ndef raw_table(raw_data_path: str) -> ir.Table:\n \"\"\"Load parquet from `raw_data_path` into a Table expression\n and format column names to snakecase\n \"\"\"\n return (\n ibis.read_parquet(sources=raw_data_path, table_name=\"absenteism\")\n .rename(\"snake_case\")\n )\n\ndef feature_table(raw_table: ir.Table) -> ir.Table:\n \"\"\"Add to `raw_table` the feature columns `has_children`\n `has_pet`, and `is_summer_brazil`\n \"\"\"\n return raw_table.mutate(\n has_children=(ibis.ifelse(ibis._.son > 0, True, False)),\n has_pet=ibis.ifelse(ibis._.pet > 0, True, False),\n is_summer_brazil=ibis._.month_of_absence.isin([1, 2, 12]),\n )\n\ndef feature_set(\n feature_table: ir.Table,\n feature_selection: list[str],\n condition: Optional[ibis.common.deferred.Deferred] = None,\n) -> ir.Table:\n \"\"\"Select feature columns and filter rows\"\"\"\n return feature_table[feature_selection].filter(condition)\n``````\n:::\n\n\n![Table-level lineage with Hamilton](table_lineage.png)\n\n#### Column-level dataflow\nHamilton was initially built to expose and manage column-level operations, which is most\ncommon in dataframe libraries (pandas, Dask, polars).\n\nColumn-level code leads to fully-reusable feature definitions and a highly granular level\nof lineage. Notably, this allows one to [trace sensitive data and evaluate downstream\nimpacts of code changes](https://hamilton.dagworks.io/en/latest/how-tos/use-hamilton-for-lineage/).\nHowever, it is more verbose to get started with, but remember that code is read more\noften than written.\n\nNow, the raw_table is loaded and the columns `son`, `pet`, and `month_of_absence` are\nextracted to engineer new features. After transformations, features are joined with\n`raw_table` to create `feature_table`.\n\n::: {#13e8c515 .cell execution_count=3}\n``` {.python .cell-code}\nimport ibis\nimport ibis.expr.types as ir\nfrom hamilton.function_modifiers import extract_columns\nfrom hamilton.plugins import ibis_extensions\n\n# extract specific columns from the table\n@extract_columns(\"son\", \"pet\", \"month_of_absence\")\ndef raw_table(raw_data_path: str) -> ir.Table:\n \"\"\"Load the parquet found at `raw_data_path` into a Table expression\n and format columns to snakecase\n \"\"\"\n return (\n ibis.read_parquet(sources=raw_data_path, table_name=\"absenteism\")\n .rename(\"snake_case\")\n )\n\n# accesses a single column from `raw_table`\ndef has_children(son: ir.Column) -> ir.BooleanColumn:\n \"\"\"True if someone has any children\"\"\"\n return ibis.ifelse(son > 0, True, False)\n\n# narrows the return type from `ir.Column` to `ir.BooleanColumn`\ndef has_pet(pet: ir.Column) -> ir.BooleanColumn:\n \"\"\"True if someone has any pets\"\"\"\n return ibis.ifelse(pet > 0, True, False).cast(bool)\n\n# typing and docstring provides business context to features\ndef is_summer_brazil(month_of_absence: ir.Column) -> ir.BooleanColumn:\n \"\"\"True if it is summer in Brazil during this month\n\n People in the northern hemisphere are likely to take vacations\n to warm places when it's cold locally\n \"\"\"\n return month_of_absence.isin([1, 2, 12])\n\ndef feature_table(\n raw_table: ir.Table,\n has_children: ir.BooleanColumn,\n has_pet: ir.BooleanColumn,\n is_summer_brazil: ir.BooleanColumn,\n) -> ir.Table:\n \"\"\"Join computed features to the `raw_data` table\"\"\"\n return raw_table.mutate(\n has_children=has_children,\n has_pet=has_pet,\n is_summer_brazil=is_summer_brazil,\n )\n\ndef feature_set(\n feature_table: ir.Table,\n feature_selection: list[str],\n condition: Optional[ibis.common.deferred.Deferred] = None,\n) -> ir.Table:\n \"\"\"Select feature columns and filter rows\"\"\"\n return feature_table[feature_selection].filter(condition)\n```\n:::\n\n\n![Column-level lineage with Hamilton](column_lineage.png)\n\n### Solution 2 – Orchestrate Ibis anywhere\n\nHamilton is an ideal way to orchestrate Ibis code because it has a very small dependency\nfootprint and will run anywhere Python does (script, notebook,\n[FastAPI](https://hamilton.dagworks.io/en/latest/integrations/fastapi/),\n[Streamlit](https://hamilton.dagworks.io/en/latest/integrations/streamlit/), pyodide,\netc.) In fact, the Hamilton library only has four dependencies. You don’t need\n\"framework code\" to get started, just plain Python functions. When moving to production,\nHamilton has all the necessary features to complement Ibis such as swapping components,\nconfigurations, and lifecycle hooks for logging, alerting, and telemetry.\n\nA simple usage pattern of Hamilton + Ibis is to use the `@config.when`\n[function modifier](https://hamilton.dagworks.io/en/latest/concepts/function-modifiers/#select-functions-to-include).\nIn the following example, we have alternative implementations for the backend connection,\nwhich will be used for computing and storing results. When running your code, specify in\nyour config `backend=\"duckdb\"` or `backend=\"bigquery\"` to swap between the two.\n\n::: {#cb036ded .cell execution_count=4}\n``` {.python .cell-code}\n# ibis_dataflow.py\nimport ibis\nimport ibis.expr.types as ir\nfrom hamilton.function_modifiers import config\n\n# ... entire dataflow definition\n\n@config.when(backend=\"duckdb\")\ndef backend_connection__duckdb(\n connection_string: str\n) -> ibis.backends.BaseBackend:\n \"\"\"Connect to DuckDB backend\"\"\"\n return ibis.duckdb.connect(connection_string)\n\n@config.when(backend=\"bigquery\")\ndef backend_connection__bigquery(\n project_id: str,\n dataset_id: str,\n) -> ibis.backends.BaseBackend:\n \"\"\"Connect to BigQuery backend\n Install dependencies via `pip install ibis-framework[bigquery]`\n \"\"\"\n return ibis.bigquery.connect(\n project_id=project_id,\n dataset_id=dataset_id,\n )\n\ndef insert_results(\n backend_connection: ibis.backends.BaseBackend,\n result_table: ir.Table,\n table_name: str\n) -> None:\n \"\"\"Execute expression and insert results\"\"\"\n backend_connection.insert(\n table_name=table_name,\n obj=result_table\n )\n```\n:::\n\n\n# How Ibis complements Hamilton\n\n## Performance boost\n\nLeveraging DuckDB as the default backend, Hamilton users migrating to Ibis should\nimmediately find performance improvements both for local dev and production. In addition,\nthe portability of Ibis has the potential to greatly reduce development time.\n\n## Atomic data transformation documentation\n\nHamilton can directly produce a dataflow visualization from code, helping with project\ndocumentation. Ibis pushes this one step further by providing a detailed view of the\nquery plan and schemas. See this Ibis visualization for the column-level Hamilton\ndataflow defined above. It includes all renaming, type casting, and transformations\nsteps (Please open the image in a new tab and zoom in 🔎).\n\n![](ibis_lineage.png)\n\n## Working across rows with user-defined functions (UDFs)\n\nHamilton and most backends are designed to work primarily on tables and columns, but\nsometimes you’d like to operate over a row (think of `pd.DataFrame.apply()`). However,\npivoting tables is costly and manually iterating over rows to collect values and create\na new column is quickly inconvenient. By using scalar user-defined functions (UDFs), Ibis\nmakes it possible to execute arbitrary Python code on rows directly on the backend.\n\n:::{.callout-note}\nUsing `@ibis.udf.scalar.python` creates a non-vectorized function that iterates\nrow-by-row. See [the docs](https://ibis-project.org/reference/scalar-udfs) to use\nbackend-specific UDFs with `@ibis.udf.scalar.builtin` and create vectorized scalar UDFs.\n:::\n\nFor instance, you could [embed rows of a text column\nusing an LLM API](https://ibis-project.org/posts/duckdb-for-rag/) using your existing\ndata warehouse infrastructure.\n\n::: {#bae33003 .cell execution_count=5}\n``` {.python .cell-code}\nimport ibis\nimport ibis.expr.types as ir\n\ndef documents(path: str) -> ir.Table:\n \"\"\"load text documents from file\"\"\"\n return ibis.read_parquet(sources=path, table_name=\"documents\")\n\n# function name starts with `_` to prevent being added as a node\n@ibis.udf.scalar.python\ndef _generate_summary(author: str, text: str, prompt_template: str) -> str:\n \"\"\"UDF Function to call the OpenAI API line by line\"\"\"\n prompt = prompt_template.format(author=author, text=text)\n client = openai.OpenAI(...)\n try:\n response = client.chat.completions.create(...)\n return_value = response.choices[0].message.content\n except Exception:\n return_value = \"\"\n return return_value\n\n\ndef prompt_template() -> str:\n return \"\"\"summarize the following text from {author} and add\n contextual notes based on it biography and other written work\n\n TEXT\n {text}\n \"\"\"\n\ndef summaries(documents: ir.Table, prompt_template: str) -> ir.Table:\n \"\"\"Compute the UDF against the family\"\"\"\n return documents.mutate(\n summary=_generated_summary(\n _.author,\n _.text,\n prompt_template=prompt_template\n )\n )\n```\n:::\n\n\n![](udf.png)\n\n# Ibis + Hamilton – a natural pairing\n* **What works in dev works in prod**: Ibis and Hamilton allows you to write and\nstructure code data transformations that are portable across backends for small and big\ndata alike. The two being lightweight libraries, installing dependencies on remote\nworkers is fast and you’re unlikely to ever encounter dependency conflicts.\n\n* **Maintainable and testable code**: Modular functions facilitates writing high quality\ncode and promotes reusability, compounding your engineering efforts. It becomes easier\nfor new users to contribute to a dataflow and pull requests are merged faster.\n\n* **Greater visibility**: With Hamilton and Ibis, you have incredible visualizations\ndirectly derived from your code. This is a superpower for documentation, allowing users\nto make sense of a dataflow, and also reason about changes.\n\n", + "markdown": "---\ntitle: Portable dataflows with Ibis and Hamilton\nauthor: \"Thierry Jean\"\ndate: \"2024-04-02\"\nimage: \"thumbnail.png\"\ncategories:\n - blog\n - hamilton\n - data engineering\n - feature engineering\n---\n\n\n\n\n## Introduction\nThis post showcases how Ibis and [Hamilton](https://hamilton.dagworks.io/en/latest/)\nenable dataflows that span execution over SQL and Python. Ibis is a portable dataframe\nlibrary to write procedural data transformations in Python and be able to execute them\ndirectly on various SQL backends (DuckDB, Snowflake, Postgres, Flink, see\n[full list](https://ibis-project.org/support_matrix)). Hamilton provides a declarative\nway to define testable, modular, self-documenting dataflows, that encode lineage and\nmetadata.\n\nLet’s introduce Ibis before exploring how it pairs with Hamilton.\n\n## Standalone Ibis\nHere’s an Ibis code snippet to load data from a parquet file, compute features,\nselect columns, and filter rows, illustrating typical feature engineering operations.\n\nReading the code, you’ll notice that:\n\n- We use \"expression chaining\", meaning there’s a series of `.method()` attached one\nafter another.\n- The variable `ibis._` is a special character referring to the current expression\ne.g., `ibis._.pet` accesses the column \"pet\" of the current table.\n- The table method `.mutate(col1=, col2=, ...)` assigns new columns or overwrites\nexisting ones.\n\n::: {#525da212 .cell execution_count=1}\n``` {.python .cell-code}\nimport ibis\n\nurl = \"https://storage.googleapis.com/ibis-blog-data-public/hamilton-ibis/absenteeism.parquet\"\nfeature_set = (\n ibis.read_parquet(sources=url, table_name=\"absenteeism\")\n .rename(\"snake_case\")\n .mutate( # allows us to define new columns\n has_children=ibis.ifelse(ibis._.son > 0, True, False),\n has_pet=ibis.ifelse(ibis._.pet > 0, True, False),\n is_summer_brazil=ibis._.month_of_absence.isin([1, 2, 12]),\n ).select(\n \"id\", \"has_children\", \"has_pet\", \"is_summer_brazil\",\n \"service_time\", \"seasons\", \"disciplinary_failure\",\n \"absenteeism_time_in_hours\"\n )\n)\n```\n:::\n\n\n### Challenge 1 – Maintain and test complex data transformations\n\nIbis has an SQL-like syntax and supports chaining operations, allowing for\npowerful queries in a few lines of code. Conversely, there’s a risk of\nsprawling complexity as expressions are appended, making them harder to test\nand debug. Preventing this issue requires a lot of upfront discipline and\nrefactoring.\n\n### Challenge 2 – Orchestrate Ibis code in production\n\nIbis alleviates a major pain point by enabling data transformations to work across\nbackends. However, moving from dev to prod still requires some code changes such as\nchanging backend connectors, swapping unsupported operators, adding some orchestration\nand logging execution, wanting to reuse prior code, etc. This is outside the scope of\nthe Ibis project and is expected to be enabled by other means, which usually means\nbespoke constructs that turn into technical debt.\n\n## What is Hamilton?\n\nHamilton is a general-purpose framework to write dataflows using regular Python\nfunctions. At the core, each function defines a transformation and its parameters\nindicates its dependencies. Hamilton automatically connects individual functions into\na [directed acyclic graph](https://en.wikipedia.org/wiki/Directed_acyclic_graph) (DAG)\nthat can be executed, visualized, optimized, and reported on.\n\n![The ABC of Hamilton](hamilton_abc.png)\n\n## How Hamilton complements Ibis\n\nHamilton was initially developed to [structure pandas code for a large catalog of\nfeatures](https://blog.dagworks.io/p/tidy-production-pandas-with-hamilton-3b759a2bf562),\nand has since been adopted by multiple organizations since, and expanded to cover any\npython object type (Polars, PySpark, ML Models, Numpy, you custom type, etc). Its syntax\nencourages users to chunk code into meaningful and reusable components, which facilitates\ndocumentation, unit testing, code reviews, and improves iteration speed, and the dev to\nproduction process. These benefits directly translate to organizing Ibis code.\n\n### Solution 1 – Structure your Ibis code with Hamilton\n\nNow, we’ll refactor the above Ibis code to use Hamilton. Users have the flexibility to\nchunk code (i.e., what the contents of a function is), at the table or the column-level\ndepending on the needed granularity. This modularity is particularly beneficial to Ibis\nbecause:\n\n- Well-scoped functions with type annotations and docstring are easier to understand for\nnew Ibis users and facilitate onboarding.\n\n- Unit testing and data validation becomes easier with smaller expressions. These checks\nbecome more important when working across backends since the\n[operation coverage varies](https://ibis-project.org/support_matrix) and bugs may arise.\n\n#### Table-level dataflow\n\nTable-level operations might feel most familiar to SQL and Spark users. Also, Ibis +\nHamilton is reminiscent of [dbt](https://www.getdbt.com/) for the Python ecosystem.\n\nWorking with tables is very efficient when your number of columns/features is limited,\nand you don’t need full column level lineage. As you want to reuse components, you can\nprogressively breakdown \"table-level code\" in to \"column-level code\".\n\nThe initial Ibis code is now 3 functions with type annotations and docstrings. We have\na clear sense of the expected external outputs and we could implement schema checks\nbetween functions.\n\n::: {#9185b4b0 .cell execution_count=2}\n`````` {.python .cell-code}\nfrom typing import Optional\nimport ibis\nimport ibis.expr.types as ir\n\ndef raw_table(raw_data_path: str) -> ir.Table:\n \"\"\"Load parquet from `raw_data_path` into a Table expression\n and format column names to snakecase\n \"\"\"\n return (\n ibis.read_parquet(sources=raw_data_path, table_name=\"absenteism\")\n .rename(\"snake_case\")\n )\n\ndef feature_table(raw_table: ir.Table) -> ir.Table:\n \"\"\"Add to `raw_table` the feature columns `has_children`\n `has_pet`, and `is_summer_brazil`\n \"\"\"\n return raw_table.mutate(\n has_children=(ibis.ifelse(ibis._.son > 0, True, False)),\n has_pet=ibis.ifelse(ibis._.pet > 0, True, False),\n is_summer_brazil=ibis._.month_of_absence.isin([1, 2, 12]),\n )\n\ndef feature_set(\n feature_table: ir.Table,\n feature_selection: list[str],\n condition: Optional[ibis.common.deferred.Deferred] = None,\n) -> ir.Table:\n \"\"\"Select feature columns and filter rows\"\"\"\n return feature_table[feature_selection].filter(condition)\n``````\n:::\n\n\n![Table-level lineage with Hamilton](table_lineage.png)\n\n#### Column-level dataflow\nHamilton was initially built to expose and manage column-level operations, which is most\ncommon in dataframe libraries (pandas, Dask, polars).\n\nColumn-level code leads to fully-reusable feature definitions and a highly granular level\nof lineage. Notably, this allows one to [trace sensitive data and evaluate downstream\nimpacts of code changes](https://hamilton.dagworks.io/en/latest/how-tos/use-hamilton-for-lineage/).\nHowever, it is more verbose to get started with, but remember that code is read more\noften than written.\n\nNow, the raw_table is loaded and the columns `son`, `pet`, and `month_of_absence` are\nextracted to engineer new features. After transformations, features are joined with\n`raw_table` to create `feature_table`.\n\n::: {#1e8b429c .cell execution_count=3}\n``` {.python .cell-code}\nimport ibis\nimport ibis.expr.types as ir\nfrom hamilton.function_modifiers import extract_columns\nfrom hamilton.plugins import ibis_extensions\n\n# extract specific columns from the table\n@extract_columns(\"son\", \"pet\", \"month_of_absence\")\ndef raw_table(raw_data_path: str) -> ir.Table:\n \"\"\"Load the parquet found at `raw_data_path` into a Table expression\n and format columns to snakecase\n \"\"\"\n return (\n ibis.read_parquet(sources=raw_data_path, table_name=\"absenteism\")\n .rename(\"snake_case\")\n )\n\n# accesses a single column from `raw_table`\ndef has_children(son: ir.Column) -> ir.BooleanColumn:\n \"\"\"True if someone has any children\"\"\"\n return ibis.ifelse(son > 0, True, False)\n\n# narrows the return type from `ir.Column` to `ir.BooleanColumn`\ndef has_pet(pet: ir.Column) -> ir.BooleanColumn:\n \"\"\"True if someone has any pets\"\"\"\n return ibis.ifelse(pet > 0, True, False).cast(bool)\n\n# typing and docstring provides business context to features\ndef is_summer_brazil(month_of_absence: ir.Column) -> ir.BooleanColumn:\n \"\"\"True if it is summer in Brazil during this month\n\n People in the northern hemisphere are likely to take vacations\n to warm places when it's cold locally\n \"\"\"\n return month_of_absence.isin([1, 2, 12])\n\ndef feature_table(\n raw_table: ir.Table,\n has_children: ir.BooleanColumn,\n has_pet: ir.BooleanColumn,\n is_summer_brazil: ir.BooleanColumn,\n) -> ir.Table:\n \"\"\"Join computed features to the `raw_data` table\"\"\"\n return raw_table.mutate(\n has_children=has_children,\n has_pet=has_pet,\n is_summer_brazil=is_summer_brazil,\n )\n\ndef feature_set(\n feature_table: ir.Table,\n feature_selection: list[str],\n condition: Optional[ibis.common.deferred.Deferred] = None,\n) -> ir.Table:\n \"\"\"Select feature columns and filter rows\"\"\"\n return feature_table[feature_selection].filter(condition)\n```\n:::\n\n\n![Column-level lineage with Hamilton](column_lineage.png)\n\n### Solution 2 – Orchestrate Ibis anywhere\n\nHamilton is an ideal way to orchestrate Ibis code because it has a very small dependency\nfootprint and will run anywhere Python does (script, notebook,\n[FastAPI](https://hamilton.dagworks.io/en/latest/integrations/fastapi/),\n[Streamlit](https://hamilton.dagworks.io/en/latest/integrations/streamlit/), pyodide,\netc.) In fact, the Hamilton library only has four dependencies. You don’t need\n\"framework code\" to get started, just plain Python functions. When moving to production,\nHamilton has all the necessary features to complement Ibis such as swapping components,\nconfigurations, and lifecycle hooks for logging, alerting, and telemetry.\n\nA simple usage pattern of Hamilton + Ibis is to use the `@config.when`\n[function modifier](https://hamilton.dagworks.io/en/latest/concepts/function-modifiers/#select-functions-to-include).\nIn the following example, we have alternative implementations for the backend connection,\nwhich will be used for computing and storing results. When running your code, specify in\nyour config `backend=\"duckdb\"` or `backend=\"bigquery\"` to swap between the two.\n\n::: {#b6a17d31 .cell execution_count=4}\n``` {.python .cell-code}\n# ibis_dataflow.py\nimport ibis\nimport ibis.expr.types as ir\nfrom hamilton.function_modifiers import config\n\n# ... entire dataflow definition\n\n@config.when(backend=\"duckdb\")\ndef backend_connection__duckdb(\n connection_string: str\n) -> ibis.backends.BaseBackend:\n \"\"\"Connect to DuckDB backend\"\"\"\n return ibis.duckdb.connect(connection_string)\n\n@config.when(backend=\"bigquery\")\ndef backend_connection__bigquery(\n project_id: str,\n dataset_id: str,\n) -> ibis.backends.BaseBackend:\n \"\"\"Connect to BigQuery backend\n Install dependencies via `pip install ibis-framework[bigquery]`\n \"\"\"\n return ibis.bigquery.connect(\n project_id=project_id,\n dataset_id=dataset_id,\n )\n\ndef insert_results(\n backend_connection: ibis.backends.BaseBackend,\n result_table: ir.Table,\n table_name: str\n) -> None:\n \"\"\"Execute expression and insert results\"\"\"\n backend_connection.insert(\n table_name=table_name,\n obj=result_table\n )\n```\n:::\n\n\n# How Ibis complements Hamilton\n\n## Performance boost\n\nLeveraging DuckDB as the default backend, Hamilton users migrating to Ibis should\nimmediately find performance improvements both for local dev and production. In addition,\nthe portability of Ibis has the potential to greatly reduce development time.\n\n## Atomic data transformation documentation\n\nHamilton can directly produce a dataflow visualization from code, helping with project\ndocumentation. Ibis pushes this one step further by providing a detailed view of the\nquery plan and schemas. See this Ibis visualization for the column-level Hamilton\ndataflow defined above. It includes all renaming, type casting, and transformations\nsteps (Please open the image in a new tab and zoom in 🔎).\n\n![](ibis_lineage.png)\n\n## Working across rows with user-defined functions (UDFs)\n\nHamilton and most backends are designed to work primarily on tables and columns, but\nsometimes you’d like to operate over a row (think of `pd.DataFrame.apply()`). However,\npivoting tables is costly and manually iterating over rows to collect values and create\na new column is quickly inconvenient. By using scalar user-defined functions (UDFs), Ibis\nmakes it possible to execute arbitrary Python code on rows directly on the backend.\n\n:::{.callout-note}\nUsing `@ibis.udf.scalar.python` creates a non-vectorized function that iterates\nrow-by-row. See [the docs](https://ibis-project.org/reference/scalar-udfs) to use\nbackend-specific UDFs with `@ibis.udf.scalar.builtin` and create vectorized scalar UDFs.\n:::\n\nFor instance, you could [embed rows of a text column\nusing an LLM API](https://ibis-project.org/posts/duckdb-for-rag/) using your existing\ndata warehouse infrastructure.\n\n::: {#261a7d55 .cell execution_count=5}\n``` {.python .cell-code}\nimport ibis\nimport ibis.expr.types as ir\n\ndef documents(path: str) -> ir.Table:\n \"\"\"load text documents from file\"\"\"\n return ibis.read_parquet(sources=path, table_name=\"documents\")\n\n# function name starts with `_` to prevent being added as a node\n@ibis.udf.scalar.python\ndef _generate_summary(author: str, text: str, prompt_template: str) -> str:\n \"\"\"UDF Function to call the OpenAI API line by line\"\"\"\n prompt = prompt_template.format(author=author, text=text)\n client = openai.OpenAI(...)\n try:\n response = client.chat.completions.create(...)\n return_value = response.choices[0].message.content\n except Exception:\n return_value = \"\"\n return return_value\n\n\ndef prompt_template() -> str:\n return \"\"\"summarize the following text from {author} and add\n contextual notes based on it biography and other written work\n\n TEXT\n {text}\n \"\"\"\n\ndef summaries(documents: ir.Table, prompt_template: str) -> ir.Table:\n \"\"\"Compute the UDF against the family\"\"\"\n return documents.mutate(\n summary=_generated_summary(\n _.author,\n _.text,\n prompt_template=prompt_template\n )\n )\n```\n:::\n\n\n![](udf.png)\n\n# Ibis + Hamilton – a natural pairing\n* **What works in dev works in prod**: Ibis and Hamilton allows you to write and\nstructure code data transformations that are portable across backends for small and big\ndata alike. The two being lightweight libraries, installing dependencies on remote\nworkers is fast and you’re unlikely to ever encounter dependency conflicts.\n\n* **Maintainable and testable code**: Modular functions facilitates writing high quality\ncode and promotes reusability, compounding your engineering efforts. It becomes easier\nfor new users to contribute to a dataflow and pull requests are merged faster.\n\n* **Greater visibility**: With Hamilton and Ibis, you have incredible visualizations\ndirectly derived from your code. This is a superpower for documentation, allowing users\nto make sense of a dataflow, and also reason about changes.\n\n", "supporting": [ - "index_files" + "index_files/figure-html" ], "filters": [], "includes": {} diff --git a/docs/posts/hamilton-ibis/index.qmd b/docs/posts/hamilton-ibis/index.qmd index 12faa9fe6655..c6dd7f86ec1f 100644 --- a/docs/posts/hamilton-ibis/index.qmd +++ b/docs/posts/hamilton-ibis/index.qmd @@ -1,7 +1,7 @@ --- title: Portable dataflows with Ibis and Hamilton author: "Thierry Jean" -date: "2024-03-28" +date: "2024-04-02" image: "thumbnail.png" categories: - blog