From 141edea25f334f48dc3bdaf04a4460b41ce9e7aa Mon Sep 17 00:00:00 2001 From: Cody Peterson <54814569+lostmygithubaccount@users.noreply.github.com> Date: Mon, 22 Jan 2024 10:18:02 -0500 Subject: [PATCH] docs: blog for the 1 billion row challenge (#8004) --- .../1brc/index/execute-results/html.json | 16 + docs/posts/1brc/.gitignore | 1 + docs/posts/1brc/index.qmd | 344 ++++++++++++++++++ 3 files changed, 361 insertions(+) create mode 100644 docs/_freeze/posts/1brc/index/execute-results/html.json create mode 100644 docs/posts/1brc/.gitignore create mode 100644 docs/posts/1brc/index.qmd diff --git a/docs/_freeze/posts/1brc/index/execute-results/html.json b/docs/_freeze/posts/1brc/index/execute-results/html.json new file mode 100644 index 000000000000..8cb7d020f958 --- /dev/null +++ b/docs/_freeze/posts/1brc/index/execute-results/html.json @@ -0,0 +1,16 @@ +{ + "hash": "82db55d1ca02427a0ed68841420637fd", + "result": { + "engine": "jupyter", + "markdown": "---\ntitle: \"Using one Python dataframe API to take the billion row challenge with DuckDB, Polars, and DataFusion\"\nauthor: \"Cody\"\ndate: \"2024-01-22\"\ncategories:\n - blog\n - duckdb\n - polars\n - datafusion\n - portability\n---\n\n## Overview\n\nThis is an implementation of the [The One Billion Row\nChallenge](https://www.morling.dev/blog/one-billion-row-challenge/):\n\n> Let’s kick off 2024 true coder style—I’m excited to announce the One Billion\n> Row Challenge (1BRC), running from Jan 1 until Jan 31.\n\n> Your mission, should you decide to accept it, is deceptively simple: write a\n> Java program for retrieving temperature measurement values from a text file and\n> calculating the min, mean, and max temperature per weather station. There’s just\n> one caveat: the file has 1,000,000,000 rows!\n\nI haven't written Java since dropping a computer science course my second year\nof college that forced us to do functional programming exclusively in Java.\nHowever, I'll gladly take the challenge in Python using Ibis! In fact, I did\nsomething like this (generating a billion rows with 26 columns of random numbers\nand doing basic aggregations) to test out DuckDB and Polars.\n\nIn this blog, we'll demonstrate how Ibis provides a single Python dataframe API\nto take the billion row challenge with DuckDB, Polars, and DataFusion.\n\n## Setup\n\nWe need to generate the data from the challenge. First, clone the\n[repo](https://github.com/gunnarmorling/1brc):\n\n```{.bash}\ngh repo clone gunnarmorling/1brc\n```\n\nThen change into the Python directory and run the generation script with the\nnumber of rows you want to generate:\n\n```{.bash}\ncd 1brc/src/main/python\npython create_measurements.py 1_000_000_000\n```\n\nThis will generate a file called `measurements.txt` in the `data` directory at\nthe root of the repo. It is 15GB on disk:\n\n```{.bash}\n(venv) cody@voda 1brc % du 1brc/data/*\n 15G 1brc/data/measurements.txt\n808K 1brc/data/weather_stations.csv\n```\n\nAnd consists of one billion rows with two columns separated by a semicolon:\n\n```{.bash}\n(venv) cody@voda 1brc % head 1brc/data/measurements.txt\nKusugal;-67.2\nIpil;-88.6\nSohna;-31.2\nLubuagan;-2.3\nSzentes;29.2\nSylvan Lake;-70.7\nAmbato;-35.2\nBerkine;97.0\nWernau;73.4\nKennewick;-19.9\n```\n\nAlso, you'll need to install Ibis with the three backends we'll use:\n\n```{.bash}\npip install ibis-framework[duckdb,polars,datafusion]\n```\n\n## Understanding Ibis\n\nIbis provides a standard dataframe API decoupled from the execution engine. It\ncompiles Ibis expressions to a form of intermediary representation (often SQL)\nthat can be executed by different backends.\n\nThis allows us to write a single Ibis expression to complete the challenge with\nmany different execution engine backends.\n\n:::{.callout-warning}\nWhile Ibis does its best to abstract away the differences between backends, this\ncannot be done in some areas like data input and output. For example, the\n`read_csv` function across various backends (in their SQL and Python forms) have\ndifferent parameters. We'll handle that with different `kwargs` dictionaries for\nthese backends in this post.\n\nIn general, besides creating a connection and data input/output, the Ibis API is\nthe same across backends.\n:::\n\n## Completing the challenge thrice\n\nWe'll use three great options for local backends -- DuckDB, Polars, and\nDataFusion -- to complete the challenge.\n\n### Setup\n\nBefore we get started, we'll make some imports, turn on interactive mode, and\ndefine the `kwargs` dictionary for the backends corresponding to their\n`read_csv` function:\n\n::: {#b66caea0 .cell execution_count=1}\n``` {.python .cell-code}\nimport ibis\nimport polars as pl\nimport pyarrow as pa\n\nibis.options.interactive = True\n\nduckdb_kwargs = {\n \"delim\": \";\",\n \"header\": False,\n \"columns\": {\"station\": \"VARCHAR\", \"temperature\": \"DOUBLE\"},\n}\n\npolars_kwargs = {\n \"separator\": \";\",\n \"has_header\": False,\n \"new_columns\": [\"station\", \"temperature\"],\n \"schema\": {\"station\": pl.Utf8, \"temperature\": pl.Float64},\n}\n\ndatafusion_kwargs = {\n \"delimiter\": \";\",\n \"has_header\": False,\n \"schema\": pa.schema(\n [\n (\n \"station\",\n pa.string(),\n ),\n (\n \"temperature\",\n pa.float64(),\n ),\n ]\n ),\n \"file_extension\": \".txt\",\n}\n```\n:::\n\n\nLet's define a function to run the same code with each backend to complete the challenge:\n\n::: {#8d7a8f3b .cell execution_count=2}\n``` {.python .cell-code}\ndef run_challenge(t):\n res = (\n t.group_by(ibis._.station)\n .agg(\n min_temp=ibis._.temperature.min(),\n mean_temp=ibis._.temperature.mean(),\n max_temp=ibis._.temperature.max(),\n )\n .order_by(ibis._.station.desc())\n )\n return res\n```\n:::\n\n\n### Completing the challenge\n\nLet's complete the challenge with each backend.\n\n:::{.callout-note}\nThe results are the same across backends but look suspicious. It is noted in the\nrepository that the Python generation code is \"unofficial\", so may have some\nproblems. Given this is a contrived example of generated data, I'm not going to\nworry about it.\n\nThe point is that we can easily complete the challenge with the same code across\nmany backends, letting them worry about the details of execution. For this\nreason, I'm also not providing execution times. Try it out yourself!\n:::\n\n::: {.panel-tabset}\n\n## DuckDB\n\nFirst let's set the backend to DuckDB (redundantly since it's the default) and\nthe `kwargs` dictionary:\n\n::: {#a63ac8cc .cell execution_count=3}\n``` {.python .cell-code}\nibis.set_backend(\"duckdb\") # <1>\nkwargs = duckdb_kwargs\n```\n:::\n\n\n\n\n1. Redundant given DuckDB is the default\n\nNext, we'll read in the data and take a look at the table:\n\n::: {#73800ddb .cell execution_count=5}\n``` {.python .cell-code}\nt = ibis.read_csv(\"1brc/data/measurements.txt\", **kwargs)\nt.limit(3)\n```\n\n::: {.cell-output .cell-output-display execution_count=5}\n```{=html}\n
┏━━━━━━━━━━━━━┳━━━━━━━━━━━━━┓\n┃ station ┃ temperature ┃\n┡━━━━━━━━━━━━━╇━━━━━━━━━━━━━┩\n│ string │ float64 │\n├─────────────┼─────────────┤\n│ Lívingston │ -21.0 │\n│ Annūr │ -33.4 │\n│ Beni Douala │ 16.5 │\n└─────────────┴─────────────┘\n\n```\n:::\n:::\n\n\nThen let's confirm it's **a billion** rows:\n\n::: {#cfe6bf62 .cell execution_count=6}\n``` {.python .cell-code}\nf\"{t.count().to_pandas():,}\"\n```\n\n::: {.cell-output .cell-output-display execution_count=6}\n```\n'1,000,000,000'\n```\n:::\n:::\n\n\nFinally, we'll compute the min, mean, and max temperature per weather station:\n\n::: {#b01a5986 .cell execution_count=7}\n``` {.python .cell-code}\nres = run_challenge(t)\nres\n```\n\n::: {.cell-output .cell-output-display execution_count=7}\n```{=html}\n
┏━━━━━━━━━━━━━━━━┳━━━━━━━━━━┳━━━━━━━━━━━┳━━━━━━━━━━┓\n┃ station ┃ min_temp ┃ mean_temp ┃ max_temp ┃\n┡━━━━━━━━━━━━━━━━╇━━━━━━━━━━╇━━━━━━━━━━━╇━━━━━━━━━━┩\n│ string │ float64 │ float64 │ float64 │\n├────────────────┼──────────┼───────────┼──────────┤\n│ ’s-Gravendeel │ -99.9 │ 0.112188 │ 99.9 │\n│ ’Aïn el Hammam │ -99.9 │ -0.225289 │ 99.9 │\n│ ’Aïn Roua │ -99.9 │ -0.198241 │ 99.9 │\n│ ‘Ibrī │ -99.9 │ 0.009499 │ 99.9 │\n│ ‘Ayn al ‘Arab │ -99.9 │ 0.124730 │ 99.9 │\n│ ‘Akko │ -99.9 │ -0.087184 │ 99.9 │\n│ ‘Afrīn │ -99.9 │ -0.013322 │ 99.9 │\n│ Ấp Tân Ngãi │ -99.9 │ 0.344089 │ 99.9 │\n│ Ẕefat │ -99.9 │ 0.017767 │ 99.9 │\n│ Ḩīsh │ -99.9 │ 0.018804 │ 99.9 │\n│ … │ … │ … │ … │\n└────────────────┴──────────┴───────────┴──────────┘\n\n```\n:::\n:::\n\n\n## Polars\n\nFirst let's set the backend to Polars and the `kwargs` dictionary:\n\n::: {#bf41374d .cell execution_count=8}\n``` {.python .cell-code}\nibis.set_backend(\"polars\") # <1>\nkwargs = polars_kwargs\n```\n:::\n\n\n1. Set Polars as the default backend used\n\nNext, we'll read in the data and take a look at the table:\n\n::: {#cac9f1fe .cell execution_count=9}\n``` {.python .cell-code}\nt = ibis.read_csv(\"1brc/data/measurements.txt\", **kwargs)\nt.limit(3)\n```\n\n::: {.cell-output .cell-output-display execution_count=9}\n```{=html}\n
┏━━━━━━━━━━━━━┳━━━━━━━━━━━━━┓\n┃ station ┃ temperature ┃\n┡━━━━━━━━━━━━━╇━━━━━━━━━━━━━┩\n│ string │ float64 │\n├─────────────┼─────────────┤\n│ Lívingston │ -21.0 │\n│ Annūr │ -33.4 │\n│ Beni Douala │ 16.5 │\n└─────────────┴─────────────┘\n\n```\n:::\n:::\n\n\nThen let's confirm it's **a billion** rows:\n\n::: {#87cd1a0a .cell execution_count=10}\n``` {.python .cell-code}\nf\"{t.count().to_pandas():,}\"\n```\n\n::: {.cell-output .cell-output-display execution_count=10}\n```\n'1,000,000,000'\n```\n:::\n:::\n\n\nFinally, we'll compute the min, mean, and max temperature per weather station:\n\n::: {#778990e7 .cell execution_count=11}\n``` {.python .cell-code}\nres = run_challenge(t)\nres\n```\n\n::: {.cell-output .cell-output-display execution_count=11}\n```{=html}\n
┏━━━━━━━━━━━━━━━━┳━━━━━━━━━━┳━━━━━━━━━━━┳━━━━━━━━━━┓\n┃ station ┃ min_temp ┃ mean_temp ┃ max_temp ┃\n┡━━━━━━━━━━━━━━━━╇━━━━━━━━━━╇━━━━━━━━━━━╇━━━━━━━━━━┩\n│ string │ float64 │ float64 │ float64 │\n├────────────────┼──────────┼───────────┼──────────┤\n│ ’s-Gravendeel │ -99.9 │ 0.112188 │ 99.9 │\n│ ’Aïn el Hammam │ -99.9 │ -0.225289 │ 99.9 │\n│ ’Aïn Roua │ -99.9 │ -0.198241 │ 99.9 │\n│ ‘Ibrī │ -99.9 │ 0.009499 │ 99.9 │\n│ ‘Ayn al ‘Arab │ -99.9 │ 0.124730 │ 99.9 │\n│ ‘Akko │ -99.9 │ -0.087184 │ 99.9 │\n│ ‘Afrīn │ -99.9 │ -0.013322 │ 99.9 │\n│ Ấp Tân Ngãi │ -99.9 │ 0.344089 │ 99.9 │\n│ Ẕefat │ -99.9 │ 0.017767 │ 99.9 │\n│ Ḩīsh │ -99.9 │ 0.018804 │ 99.9 │\n│ … │ … │ … │ … │\n└────────────────┴──────────┴───────────┴──────────┘\n\n```\n:::\n:::\n\n\n## DataFusion\n\nFirst let's set the backend to DataFusion and the `kwargs` dictionary:\n\n::: {#1a714b65 .cell execution_count=12}\n``` {.python .cell-code}\nibis.set_backend(\"datafusion\") # <1>\nkwargs = datafusion_kwargs\n```\n:::\n\n\n1. Set DataFusion as the default backend used\n\nNext, we'll read in the data and take a look at the table:\n\n::: {#232867fd .cell execution_count=13}\n``` {.python .cell-code}\nt = ibis.read_csv(\"1brc/data/measurements.txt\", **kwargs)\nt.limit(3)\n```\n\n::: {.cell-output .cell-output-display execution_count=13}\n```{=html}\n
┏━━━━━━━━━━━━━┳━━━━━━━━━━━━━┓\n┃ station ┃ temperature ┃\n┡━━━━━━━━━━━━━╇━━━━━━━━━━━━━┩\n│ string │ float64 │\n├─────────────┼─────────────┤\n│ Lívingston │ -21.0 │\n│ Annūr │ -33.4 │\n│ Beni Douala │ 16.5 │\n└─────────────┴─────────────┘\n\n```\n:::\n:::\n\n\nThen let's confirm it's **a billion** rows:\n\n::: {#bb715c5f .cell execution_count=14}\n``` {.python .cell-code}\nf\"{t.count().to_pandas():,}\"\n```\n\n::: {.cell-output .cell-output-display execution_count=14}\n```\n'1,000,000,000'\n```\n:::\n:::\n\n\nFinally, we'll compute the min, mean, and max temperature per weather station:\n\n::: {#fedee251 .cell execution_count=15}\n``` {.python .cell-code}\nres = run_challenge(t)\nres\n```\n\n::: {.cell-output .cell-output-display execution_count=15}\n```{=html}\n
┏━━━━━━━━━━━━━━━━┳━━━━━━━━━━┳━━━━━━━━━━━┳━━━━━━━━━━┓\n┃ station ┃ min_temp ┃ mean_temp ┃ max_temp ┃\n┡━━━━━━━━━━━━━━━━╇━━━━━━━━━━╇━━━━━━━━━━━╇━━━━━━━━━━┩\n│ string │ float64 │ float64 │ float64 │\n├────────────────┼──────────┼───────────┼──────────┤\n│ ’s-Gravendeel │ -99.9 │ 0.112188 │ 99.9 │\n│ ’Aïn el Hammam │ -99.9 │ -0.225289 │ 99.9 │\n│ ’Aïn Roua │ -99.9 │ -0.198241 │ 99.9 │\n│ ‘Ibrī │ -99.9 │ 0.009499 │ 99.9 │\n│ ‘Ayn al ‘Arab │ -99.9 │ 0.124730 │ 99.9 │\n│ ‘Akko │ -99.9 │ -0.087184 │ 99.9 │\n│ ‘Afrīn │ -99.9 │ -0.013322 │ 99.9 │\n│ Ấp Tân Ngãi │ -99.9 │ 0.344089 │ 99.9 │\n│ Ẕefat │ -99.9 │ 0.017767 │ 99.9 │\n│ Ḩīsh │ -99.9 │ 0.018804 │ 99.9 │\n│ … │ … │ … │ … │\n└────────────────┴──────────┴───────────┴──────────┘\n\n```\n:::\n:::\n\n\n:::\n\n## Conclusion\n\nWhile the one billion row challenge isn't a great benchmark, it's a fun way to\ndemonstrate how Ibis provides a single Python dataframe API to take the billion\nrow challenge with DuckDB, Polars, and DataFusion. Feel free to try it out with\nother backends!\n\nHappy coding!\n\n## Bonus: more billion row data generation\n\nWhile we're here, I'll share the code I've used in the past to generate a\nbillion rows of random data:\n\n```{.python}\nimport ibis\n\ncon = ibis.connect(\"duckdb://data.ddb\")\n\nROWS = 1_000_000_000\n\nsql_str = \"\"\nsql_str += \"select\\n\"\nfor c in list(map(chr, range(ord(\"a\"), ord(\"z\") + 1))):\n sql_str += f\" random() as {c},\\n\"\nsql_str += f\"from generate_series(1, {ROWS})\"\n\nt = con.sql(sql_str)\ncon.create_table(\"billion\", t, overwrite=True)\n```\n\nNowadays I'd convert that to an Ibis expression:\n\n:::{.callout-note}\nThis is a slightly different result with a monotonic index column, but I prefer\nit anyway. You could drop that column or adjust the expression.\n:::\n\n```{.python}\nimport ibis\n\ncon = ibis.connect(\"duckdb://data.ddb\")\n\nROWS = 1_000_000_000\n\nt = (\n ibis.range(ROWS)\n .unnest()\n .name(\"index\")\n .as_table()\n .mutate(**{c: ibis.random() for c in list(map(chr, range(ord(\"a\"), ord(\"z\") + 1)))})\n)\ncon.create_table(\"billion\", t, overwrite=True)\n```\n\nBut if you do need to construct a programmatic SQL string, it's cool that you\ncan!\n\n", + "supporting": [ + "index_files/figure-html" + ], + "filters": [], + "includes": { + "include-in-header": [ + "\n\n\n" + ] + } + } +} \ No newline at end of file diff --git a/docs/posts/1brc/.gitignore b/docs/posts/1brc/.gitignore new file mode 100644 index 000000000000..581b9361d26a --- /dev/null +++ b/docs/posts/1brc/.gitignore @@ -0,0 +1 @@ +1brc diff --git a/docs/posts/1brc/index.qmd b/docs/posts/1brc/index.qmd new file mode 100644 index 000000000000..5144449e71b9 --- /dev/null +++ b/docs/posts/1brc/index.qmd @@ -0,0 +1,344 @@ +--- +title: "Using one Python dataframe API to take the billion row challenge with DuckDB, Polars, and DataFusion" +author: "Cody" +date: "2024-01-22" +categories: + - blog + - duckdb + - polars + - datafusion + - portability +--- + +## Overview + +This is an implementation of the [The One Billion Row +Challenge](https://www.morling.dev/blog/one-billion-row-challenge/): + +> Let’s kick off 2024 true coder style—I’m excited to announce the One Billion +> Row Challenge (1BRC), running from Jan 1 until Jan 31. + +> Your mission, should you decide to accept it, is deceptively simple: write a +> Java program for retrieving temperature measurement values from a text file and +> calculating the min, mean, and max temperature per weather station. There’s just +> one caveat: the file has 1,000,000,000 rows! + +I haven't written Java since dropping a computer science course my second year +of college that forced us to do functional programming exclusively in Java. +However, I'll gladly take the challenge in Python using Ibis! In fact, I did +something like this (generating a billion rows with 26 columns of random numbers +and doing basic aggregations) to test out DuckDB and Polars. + +In this blog, we'll demonstrate how Ibis provides a single Python dataframe API +to take the billion row challenge with DuckDB, Polars, and DataFusion. + +## Setup + +We need to generate the data from the challenge. First, clone the +[repo](https://github.com/gunnarmorling/1brc): + +```{.bash} +gh repo clone gunnarmorling/1brc +``` + +Then change into the Python directory and run the generation script with the +number of rows you want to generate: + +```{.bash} +cd 1brc/src/main/python +python create_measurements.py 1_000_000_000 +``` + +This will generate a file called `measurements.txt` in the `data` directory at +the root of the repo. It is 15GB on disk: + +```{.bash} +(venv) cody@voda 1brc % du 1brc/data/* + 15G 1brc/data/measurements.txt +808K 1brc/data/weather_stations.csv +``` + +And consists of one billion rows with two columns separated by a semicolon: + +```{.bash} +(venv) cody@voda 1brc % head 1brc/data/measurements.txt +Kusugal;-67.2 +Ipil;-88.6 +Sohna;-31.2 +Lubuagan;-2.3 +Szentes;29.2 +Sylvan Lake;-70.7 +Ambato;-35.2 +Berkine;97.0 +Wernau;73.4 +Kennewick;-19.9 +``` + +Also, you'll need to install Ibis with the three backends we'll use: + +```{.bash} +pip install ibis-framework[duckdb,polars,datafusion] +``` + +## Understanding Ibis + +Ibis provides a standard dataframe API decoupled from the execution engine. It +compiles Ibis expressions to a form of intermediary representation (often SQL) +that can be executed by different backends. + +This allows us to write a single Ibis expression to complete the challenge with +many different execution engine backends. + +:::{.callout-warning} +While Ibis does its best to abstract away the differences between backends, this +cannot be done in some areas like data input and output. For example, the +`read_csv` function across various backends (in their SQL and Python forms) have +different parameters. We'll handle that with different `kwargs` dictionaries for +these backends in this post. + +In general, besides creating a connection and data input/output, the Ibis API is +the same across backends. +::: + +## Completing the challenge thrice + +We'll use three great options for local backends -- DuckDB, Polars, and +DataFusion -- to complete the challenge. + +### Setup + +Before we get started, we'll make some imports, turn on interactive mode, and +define the `kwargs` dictionary for the backends corresponding to their +`read_csv` function: + +```{python} +import ibis +import polars as pl +import pyarrow as pa + +ibis.options.interactive = True + +duckdb_kwargs = { + "delim": ";", + "header": False, + "columns": {"station": "VARCHAR", "temperature": "DOUBLE"}, +} + +polars_kwargs = { + "separator": ";", + "has_header": False, + "new_columns": ["station", "temperature"], + "schema": {"station": pl.Utf8, "temperature": pl.Float64}, +} + +datafusion_kwargs = { + "delimiter": ";", + "has_header": False, + "schema": pa.schema( + [ + ( + "station", + pa.string(), + ), + ( + "temperature", + pa.float64(), + ), + ] + ), + "file_extension": ".txt", +} +``` + +Let's define a function to run the same code with each backend to complete the challenge: + +```{python} +def run_challenge(t): + res = ( + t.group_by(ibis._.station) + .agg( + min_temp=ibis._.temperature.min(), + mean_temp=ibis._.temperature.mean(), + max_temp=ibis._.temperature.max(), + ) + .order_by(ibis._.station.desc()) + ) + return res +``` + +### Completing the challenge + +Let's complete the challenge with each backend. + +:::{.callout-note} +The results are the same across backends but look suspicious. It is noted in the +repository that the Python generation code is "unofficial", so may have some +problems. Given this is a contrived example of generated data, I'm not going to +worry about it. + +The point is that we can easily complete the challenge with the same code across +many backends, letting them worry about the details of execution. For this +reason, I'm also not providing execution times. Try it out yourself! +::: + +::: {.panel-tabset} + +## DuckDB + +First let's set the backend to DuckDB (redundantly since it's the default) and +the `kwargs` dictionary: + +```{python} +ibis.set_backend("duckdb") # <1> +kwargs = duckdb_kwargs +``` + +```{python} +# | code-fold: true +# | echo: false +_ = ibis.get_backend().raw_sql("set enable_progress_bar = false") +``` + +1. Redundant given DuckDB is the default + +Next, we'll read in the data and take a look at the table: + +```{python} +t = ibis.read_csv("1brc/data/measurements.txt", **kwargs) +t.limit(3) +``` + +Then let's confirm it's **a billion** rows: + +```{python} +f"{t.count().to_pandas():,}" +``` + +Finally, we'll compute the min, mean, and max temperature per weather station: + +```{python} +res = run_challenge(t) +res +``` + +## Polars + +First let's set the backend to Polars and the `kwargs` dictionary: + +```{python} +ibis.set_backend("polars") # <1> +kwargs = polars_kwargs +``` + +1. Set Polars as the default backend used + +Next, we'll read in the data and take a look at the table: + +```{python} +t = ibis.read_csv("1brc/data/measurements.txt", **kwargs) +t.limit(3) +``` + +Then let's confirm it's **a billion** rows: + +```{python} +f"{t.count().to_pandas():,}" +``` + +Finally, we'll compute the min, mean, and max temperature per weather station: + +```{python} +res = run_challenge(t) +res +``` + +## DataFusion + +First let's set the backend to DataFusion and the `kwargs` dictionary: + +```{python} +ibis.set_backend("datafusion") # <1> +kwargs = datafusion_kwargs +``` + +1. Set DataFusion as the default backend used + +Next, we'll read in the data and take a look at the table: + +```{python} +t = ibis.read_csv("1brc/data/measurements.txt", **kwargs) +t.limit(3) +``` + +Then let's confirm it's **a billion** rows: + +```{python} +f"{t.count().to_pandas():,}" +``` + +Finally, we'll compute the min, mean, and max temperature per weather station: + +```{python} +res = run_challenge(t) +res +``` + +::: + +## Conclusion + +While the one billion row challenge isn't a great benchmark, it's a fun way to +demonstrate how Ibis provides a single Python dataframe API to take the billion +row challenge with DuckDB, Polars, and DataFusion. Feel free to try it out with +other backends! + +Happy coding! + +## Bonus: more billion row data generation + +While we're here, I'll share the code I've used in the past to generate a +billion rows of random data: + +```{.python} +import ibis + +con = ibis.connect("duckdb://data.ddb") + +ROWS = 1_000_000_000 + +sql_str = "" +sql_str += "select\n" +for c in list(map(chr, range(ord("a"), ord("z") + 1))): + sql_str += f" random() as {c},\n" +sql_str += f"from generate_series(1, {ROWS})" + +t = con.sql(sql_str) +con.create_table("billion", t, overwrite=True) +``` + +Nowadays I'd convert that to an Ibis expression: + +:::{.callout-note} +This is a slightly different result with a monotonic index column, but I prefer +it anyway. You could drop that column or adjust the expression. +::: + +```{.python} +import ibis + +con = ibis.connect("duckdb://data.ddb") + +ROWS = 1_000_000_000 + +t = ( + ibis.range(ROWS) + .unnest() + .name("index") + .as_table() + .mutate(**{c: ibis.random() for c in list(map(chr, range(ord("a"), ord("z") + 1)))}) +) +con.create_table("billion", t, overwrite=True) +``` + +But if you do need to construct a programmatic SQL string, it's cool that you +can!