From 9a98a5071bc1a7539fa328544537c726ddb3e4b7 Mon Sep 17 00:00:00 2001 From: Cody Date: Fri, 19 Jan 2024 12:31:06 -0500 Subject: [PATCH] forgot the freeze --- docs/_freeze/posts/1brc/index/execute-results/html.json | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) diff --git a/docs/_freeze/posts/1brc/index/execute-results/html.json b/docs/_freeze/posts/1brc/index/execute-results/html.json index fa605313d516..8cb7d020f958 100644 --- a/docs/_freeze/posts/1brc/index/execute-results/html.json +++ b/docs/_freeze/posts/1brc/index/execute-results/html.json @@ -1,8 +1,8 @@ { - "hash": "708d488ca91f22708b5c5e94fd03e5da", + "hash": "82db55d1ca02427a0ed68841420637fd", "result": { "engine": "jupyter", - "markdown": "---\ntitle: \"Using one Python dataframe API to take the billion row challenge with DuckDB, Polars, and DataFusion\"\nauthor: \"Cody\"\ndate: \"2024-01-22\"\ncategories:\n - blog\n - duckdb\n - polars\n - datafusion\n - portability\n---\n\n## Overview\n\nThis is an implementation of the [The One Billion Row\nChallenge](https://www.morling.dev/blog/one-billion-row-challenge/):\n\n> Let’s kick off 2024 true coder style—​I’m excited to announce the One Billion\n> Row Challenge (1BRC), running from Jan 1 until Jan 31.\n\n> Your mission, should you decide to accept it, is deceptively simple: write a\n> Java program for retrieving temperature measurement values from a text file and\n> calculating the min, mean, and max temperature per weather station. There’s just\n> one caveat: the file has 1,000,000,000 rows!\n\nI haven't written Java since dropping a computer science course my second year\nof college that forced us to do functional programming exclusively in Java.\nHowever, I'll gladly take the challenge in Python using Ibis! In fact, I did\nsomething like this (generating a billion rows with 26 columns of random numbers\nand doing basic aggregations) to test out DuckDB and Polars.\n\nIn this blog, we'll demonstrate how Ibis provides a single Python dataframe API\nto take the billion row challenge with DuckDB, Polars, and DataFusion.\n\n## Setup\n\nWe need to generate the data from the challenge. First, clone the\n[repo](https://github.com/gunnarmorling/1brc):\n\n```{.bash}\ngh repo clone gunnarmorling/1brc\n```\n\nThen change into the Python directory and run the generation script with the\nnumber of rows you want to generate:\n\n```{.bash}\ncd 1brc/src/main/python\npython create_measurements.py 1_000_000_000\n```\n\nThis will generate a file called `measurements.txt` in the `data` directory at\nthe root of the repo. It is 15GB on disk:\n\n```{.bash}\n(venv) cody@voda 1brc % du 1brc/data/*\n 15G 1brc/data/measurements.txt\n808K 1brc/data/weather_stations.csv\n```\n\nAnd consists of one billion rows with two columns separated by a semicolon:\n\n```{.bash}\n(venv) cody@voda 1brc % head 1brc/data/measurements.txt\nKusugal;-67.2\nIpil;-88.6\nSohna;-31.2\nLubuagan;-2.3\nSzentes;29.2\nSylvan Lake;-70.7\nAmbato;-35.2\nBerkine;97.0\nWernau;73.4\nKennewick;-19.9\n```\n\nAlso, you'll need to install Ibis with the three backends we'll use:\n\n```{.bash}\npip install ibis-framework[duckdb,polars,datafusion]\n```\n\n## Understanding Ibis\n\nIbis provides a standard dataframe API decoupled from the execution engine. It\ncompiles Ibis expressions to a form of intermediary representation (often SQL)\nthat can be executed by different backends.\n\nThis allows us to write a single Ibis expression to complete the challenge with\nmany different execution engine backends.\n\n:::{.callout-warning}\nWhile Ibis does its best to abstract away the differences between backends, this\ncannot be done in some areas like data input and output. For example, the\n`read_csv` function across various backends (in their SQL and Python forms) have\ndifferent parameters. We'll handle that with different `kwargs` dictionaries for\nthese backends in this post.\n\nIn general, besides creating a connection and data input/output, the Ibis API is\nthe same across backends.\n:::\n\n## Completing the challenge thrice\n\nWe'll use three great options for local backends -- DuckDB, Polars, and\nDataFusion -- to complete the challenge.\n\n### Setup\n\nBefore we get started, we'll make some imports, turn on interactive mode, and\ndefine the `kwargs` dictionary for the backends corresponding to their\n`read_csv` function:\n\n::: {#346b193e .cell execution_count=1}\n``` {.python .cell-code}\nimport ibis\nimport polars as pl\nimport pyarrow as pa\n\nibis.options.interactive = True\n\nduckdb_kwargs = {\n \"delim\": \";\",\n \"header\": False,\n \"columns\": {\"station\": \"VARCHAR\", \"temperature\": \"DOUBLE\"},\n}\n\npolars_kwargs = {\n \"separator\": \";\",\n \"has_header\": False,\n \"new_columns\": [\"station\", \"temperature\"],\n \"schema\": {\"station\": pl.Utf8, \"temperature\": pl.Float64},\n}\n\ndatafusion_kwargs = {\n \"delimiter\": \";\",\n \"has_header\": False,\n \"schema\": pa.schema(\n [\n (\n \"station\",\n pa.string(),\n ),\n (\n \"temperature\",\n pa.float64(),\n ),\n ]\n ),\n \"file_extension\": \".txt\",\n}\n```\n:::\n\n\nLet's define a function to run the same code with each backend to complete the challenge:\n\n::: {#f8ddfb05 .cell execution_count=2}\n``` {.python .cell-code}\ndef run_challenge(t):\n res = (\n t.group_by(ibis._.station)\n .agg(\n min_temp=ibis._.temperature.min(),\n mean_temp=ibis._.temperature.mean(),\n max_temp=ibis._.temperature.max(),\n )\n .order_by(ibis._.station.desc())\n )\n return res\n```\n:::\n\n\n### Completing the challenge\n\nLet's complete the challenge with each backend.\n\n:::{.callout-note}\nThe results are the same across backends but look suspicious. It is noted in the\nrepository that the Python generation code is \"unofficial\", so may have some\nproblems. Given this is a contrived example of generated data, I'm not going to\nworry about it.\n\nThe point is that we can easily complete the challenge with the same code across\nmany backends, letting them worry about the details of execution. For this\nreason, I'm also not providing execution times. Try it out yourself!\n:::\n\n::: {.panel-tabset}\n\n## DuckDB\n\nFirst let's set the backend to DuckDB (redundantly since it's the default) and\nthe `kwargs` dictionary:\n\n::: {#1a358abe .cell execution_count=3}\n``` {.python .cell-code}\nibis.set_backend(\"duckdb\") # <1>\nkwargs = duckdb_kwargs\n```\n:::\n\n\n\n\n1. Redundant given DuckDB is the default\n\nNext, we'll read in the data and take a look at the table:\n\n::: {#ead5cf27 .cell execution_count=5}\n``` {.python .cell-code}\nt = ibis.read_csv(\"1brc/data/measurements.txt\", **kwargs)\nt.limit(3)\n```\n\n::: {.cell-output .cell-output-display execution_count=5}\n```{=html}\n
┏━━━━━━━━━━━━━┳━━━━━━━━━━━━━┓\n┃ station      temperature ┃\n┡━━━━━━━━━━━━━╇━━━━━━━━━━━━━┩\n│ stringfloat64     │\n├─────────────┼─────────────┤\n│ Lívingston -21.0 │\n│ Annūr      -33.4 │\n│ Beni Douala16.5 │\n└─────────────┴─────────────┘\n
\n```\n:::\n:::\n\n\nThen let's confirm it's **a billion** rows:\n\n::: {#9b1bfb39 .cell execution_count=6}\n``` {.python .cell-code}\nf\"{t.count().to_pandas():,}\"\n```\n\n::: {.cell-output .cell-output-display execution_count=6}\n```\n'1,000,000,000'\n```\n:::\n:::\n\n\nFinally, we'll compute the min, mean, and max temperature per weather station:\n\n::: {#69c639dc .cell execution_count=7}\n``` {.python .cell-code}\nres = run_challenge(t)\nres\n```\n\n::: {.cell-output .cell-output-display execution_count=7}\n```{=html}\n
┏━━━━━━━━━━━━━━━━┳━━━━━━━━━━┳━━━━━━━━━━━┳━━━━━━━━━━┓\n┃ station         min_temp  mean_temp  max_temp ┃\n┡━━━━━━━━━━━━━━━━╇━━━━━━━━━━╇━━━━━━━━━━━╇━━━━━━━━━━┩\n│ stringfloat64float64float64  │\n├────────────────┼──────────┼───────────┼──────────┤\n│ ’s-Gravendeel -99.90.11218899.9 │\n│ ’Aïn el Hammam-99.9-0.22528999.9 │\n│ ’Aïn Roua     -99.9-0.19824199.9 │\n│ ‘Ibrī         -99.90.00949999.9 │\n│ ‘Ayn al ‘Arab -99.90.12473099.9 │\n│ ‘Akko         -99.9-0.08718499.9 │\n│ ‘Afrīn        -99.9-0.01332299.9 │\n│ Ấp Tân Ngãi   -99.90.34408999.9 │\n│ Ẕefat         -99.90.01776799.9 │\n│ Ḩīsh          -99.90.01880499.9 │\n│  │\n└────────────────┴──────────┴───────────┴──────────┘\n
\n```\n:::\n:::\n\n\n## Polars\n\nFirst let's set the backend to Polars and the `kwargs` dictionary:\n\n::: {#faa766a5 .cell execution_count=8}\n``` {.python .cell-code}\nibis.set_backend(\"polars\") # <1>\nkwargs = polars_kwargs\n```\n:::\n\n\n1. Set Polars as the default backend used\n\nNext, we'll read in the data and take a look at the table:\n\n::: {#8c6bfb71 .cell execution_count=9}\n``` {.python .cell-code}\nt = ibis.read_csv(\"1brc/data/measurements.txt\", **kwargs)\nt.limit(3)\n```\n\n::: {.cell-output .cell-output-display execution_count=9}\n```{=html}\n
┏━━━━━━━━━━━━━┳━━━━━━━━━━━━━┓\n┃ station      temperature ┃\n┡━━━━━━━━━━━━━╇━━━━━━━━━━━━━┩\n│ stringfloat64     │\n├─────────────┼─────────────┤\n│ Lívingston -21.0 │\n│ Annūr      -33.4 │\n│ Beni Douala16.5 │\n└─────────────┴─────────────┘\n
\n```\n:::\n:::\n\n\nThen let's confirm it's **a billion** rows:\n\n::: {#623dfb90 .cell execution_count=10}\n``` {.python .cell-code}\nf\"{t.count().to_pandas():,}\"\n```\n\n::: {.cell-output .cell-output-display execution_count=10}\n```\n'1,000,000,000'\n```\n:::\n:::\n\n\nFinally, we'll compute the min, mean, and max temperature per weather station:\n\n::: {#e9a74f6e .cell execution_count=11}\n``` {.python .cell-code}\nres = run_challenge(t)\nres\n```\n\n::: {.cell-output .cell-output-display execution_count=11}\n```{=html}\n
┏━━━━━━━━━━━━━━━━┳━━━━━━━━━━┳━━━━━━━━━━━┳━━━━━━━━━━┓\n┃ station         min_temp  mean_temp  max_temp ┃\n┡━━━━━━━━━━━━━━━━╇━━━━━━━━━━╇━━━━━━━━━━━╇━━━━━━━━━━┩\n│ stringfloat64float64float64  │\n├────────────────┼──────────┼───────────┼──────────┤\n│ ’s-Gravendeel -99.90.11218899.9 │\n│ ’Aïn el Hammam-99.9-0.22528999.9 │\n│ ’Aïn Roua     -99.9-0.19824199.9 │\n│ ‘Ibrī         -99.90.00949999.9 │\n│ ‘Ayn al ‘Arab -99.90.12473099.9 │\n│ ‘Akko         -99.9-0.08718499.9 │\n│ ‘Afrīn        -99.9-0.01332299.9 │\n│ Ấp Tân Ngãi   -99.90.34408999.9 │\n│ Ẕefat         -99.90.01776799.9 │\n│ Ḩīsh          -99.90.01880499.9 │\n│  │\n└────────────────┴──────────┴───────────┴──────────┘\n
\n```\n:::\n:::\n\n\n## DataFusion\n\nFirst let's set the backend to DataFusion and the `kwargs` dictionary:\n\n::: {#bd4f7f44 .cell execution_count=12}\n``` {.python .cell-code}\nibis.set_backend(\"datafusion\") # <1>\nkwargs = datafusion_kwargs\n```\n:::\n\n\n1. Set DataFusion as the default backend used\n\nNext, we'll read in the data and take a look at the table:\n\n::: {#64e7dc60 .cell execution_count=13}\n``` {.python .cell-code}\nt = ibis.read_csv(\"1brc/data/measurements.txt\", **kwargs)\nt.limit(3)\n```\n\n::: {.cell-output .cell-output-display execution_count=13}\n```{=html}\n
┏━━━━━━━━━━━━━┳━━━━━━━━━━━━━┓\n┃ station      temperature ┃\n┡━━━━━━━━━━━━━╇━━━━━━━━━━━━━┩\n│ stringfloat64     │\n├─────────────┼─────────────┤\n│ Lívingston -21.0 │\n│ Annūr      -33.4 │\n│ Beni Douala16.5 │\n└─────────────┴─────────────┘\n
\n```\n:::\n:::\n\n\nThen let's confirm it's **a billion** rows:\n\n::: {#5299d343 .cell execution_count=14}\n``` {.python .cell-code}\nf\"{t.count().to_pandas():,}\"\n```\n\n::: {.cell-output .cell-output-display execution_count=14}\n```\n'1,000,000,000'\n```\n:::\n:::\n\n\nFinally, we'll compute the min, mean, and max temperature per weather station:\n\n::: {#b7fd88b7 .cell execution_count=15}\n``` {.python .cell-code}\nres = run_challenge(t)\nres\n```\n\n::: {.cell-output .cell-output-display execution_count=15}\n```{=html}\n
┏━━━━━━━━━━━━━━━━┳━━━━━━━━━━┳━━━━━━━━━━━┳━━━━━━━━━━┓\n┃ station         min_temp  mean_temp  max_temp ┃\n┡━━━━━━━━━━━━━━━━╇━━━━━━━━━━╇━━━━━━━━━━━╇━━━━━━━━━━┩\n│ stringfloat64float64float64  │\n├────────────────┼──────────┼───────────┼──────────┤\n│ ’s-Gravendeel -99.90.11218899.9 │\n│ ’Aïn el Hammam-99.9-0.22528999.9 │\n│ ’Aïn Roua     -99.9-0.19824199.9 │\n│ ‘Ibrī         -99.90.00949999.9 │\n│ ‘Ayn al ‘Arab -99.90.12473099.9 │\n│ ‘Akko         -99.9-0.08718499.9 │\n│ ‘Afrīn        -99.9-0.01332299.9 │\n│ Ấp Tân Ngãi   -99.90.34408999.9 │\n│ Ẕefat         -99.90.01776799.9 │\n│ Ḩīsh          -99.90.01880499.9 │\n│  │\n└────────────────┴──────────┴───────────┴──────────┘\n
\n```\n:::\n:::\n\n\n:::\n\n## Bonus: more billion row data generation\n\nWhile we're here, I'll share the code I've used in the past to generate a\nbillion rows of random data:\n\n```{.python}\nimport ibis\n\ncon = ibis.connect(\"duckdb://data.ddb\")\n\nROWS = 1_000_000_000\n\nsql_str = \"\"\nsql_str += \"select\\n\"\nfor c in list(map(chr, range(ord(\"a\"), ord(\"z\") + 1))):\n sql_str += f\" random() as {c},\\n\"\nsql_str += f\"from generate_series(1, {ROWS})\"\n\nt = con.sql(sql_str)\ncon.create_table(\"billion\", t, overwrite=True)\n```\n\nNowadays I'd convert that to an Ibis expression:\n\n:::{.callout-note}\nThis is a slightly different result with a monotonic index column, but I prefer\nit anyway. You could drop that column or adjust the expression.\n:::\n\n```{.python}\nimport ibis\n\ncon = ibis.connect(\"duckdb://data.ddb\")\n\nROWS = 1_000_000_000\n\nt = (\n ibis.range(ROWS)\n .unnest()\n .name(\"index\")\n .as_table()\n .mutate(**{c: ibis.random() for c in list(map(chr, range(ord(\"a\"), ord(\"z\") + 1)))})\n)\ncon.create_table(\"billion\", t, overwrite=True)\n```\n\nBut if you do need to construct a programmatic SQL string, it's cool that you\ncan!\n\n## Conclusion\n\nWhile the one billion row challenge isn't a great benchmark, it's a fun way to\ndemonstrate how Ibis provides a single Python dataframe API to take the billion\nrow challenge with DuckDB, Polars, and DataFusion. Feel free to try it out with\nother backends!\n\nHappy coding!\n\n", + "markdown": "---\ntitle: \"Using one Python dataframe API to take the billion row challenge with DuckDB, Polars, and DataFusion\"\nauthor: \"Cody\"\ndate: \"2024-01-22\"\ncategories:\n - blog\n - duckdb\n - polars\n - datafusion\n - portability\n---\n\n## Overview\n\nThis is an implementation of the [The One Billion Row\nChallenge](https://www.morling.dev/blog/one-billion-row-challenge/):\n\n> Let’s kick off 2024 true coder style—​I’m excited to announce the One Billion\n> Row Challenge (1BRC), running from Jan 1 until Jan 31.\n\n> Your mission, should you decide to accept it, is deceptively simple: write a\n> Java program for retrieving temperature measurement values from a text file and\n> calculating the min, mean, and max temperature per weather station. There’s just\n> one caveat: the file has 1,000,000,000 rows!\n\nI haven't written Java since dropping a computer science course my second year\nof college that forced us to do functional programming exclusively in Java.\nHowever, I'll gladly take the challenge in Python using Ibis! In fact, I did\nsomething like this (generating a billion rows with 26 columns of random numbers\nand doing basic aggregations) to test out DuckDB and Polars.\n\nIn this blog, we'll demonstrate how Ibis provides a single Python dataframe API\nto take the billion row challenge with DuckDB, Polars, and DataFusion.\n\n## Setup\n\nWe need to generate the data from the challenge. First, clone the\n[repo](https://github.com/gunnarmorling/1brc):\n\n```{.bash}\ngh repo clone gunnarmorling/1brc\n```\n\nThen change into the Python directory and run the generation script with the\nnumber of rows you want to generate:\n\n```{.bash}\ncd 1brc/src/main/python\npython create_measurements.py 1_000_000_000\n```\n\nThis will generate a file called `measurements.txt` in the `data` directory at\nthe root of the repo. It is 15GB on disk:\n\n```{.bash}\n(venv) cody@voda 1brc % du 1brc/data/*\n 15G 1brc/data/measurements.txt\n808K 1brc/data/weather_stations.csv\n```\n\nAnd consists of one billion rows with two columns separated by a semicolon:\n\n```{.bash}\n(venv) cody@voda 1brc % head 1brc/data/measurements.txt\nKusugal;-67.2\nIpil;-88.6\nSohna;-31.2\nLubuagan;-2.3\nSzentes;29.2\nSylvan Lake;-70.7\nAmbato;-35.2\nBerkine;97.0\nWernau;73.4\nKennewick;-19.9\n```\n\nAlso, you'll need to install Ibis with the three backends we'll use:\n\n```{.bash}\npip install ibis-framework[duckdb,polars,datafusion]\n```\n\n## Understanding Ibis\n\nIbis provides a standard dataframe API decoupled from the execution engine. It\ncompiles Ibis expressions to a form of intermediary representation (often SQL)\nthat can be executed by different backends.\n\nThis allows us to write a single Ibis expression to complete the challenge with\nmany different execution engine backends.\n\n:::{.callout-warning}\nWhile Ibis does its best to abstract away the differences between backends, this\ncannot be done in some areas like data input and output. For example, the\n`read_csv` function across various backends (in their SQL and Python forms) have\ndifferent parameters. We'll handle that with different `kwargs` dictionaries for\nthese backends in this post.\n\nIn general, besides creating a connection and data input/output, the Ibis API is\nthe same across backends.\n:::\n\n## Completing the challenge thrice\n\nWe'll use three great options for local backends -- DuckDB, Polars, and\nDataFusion -- to complete the challenge.\n\n### Setup\n\nBefore we get started, we'll make some imports, turn on interactive mode, and\ndefine the `kwargs` dictionary for the backends corresponding to their\n`read_csv` function:\n\n::: {#b66caea0 .cell execution_count=1}\n``` {.python .cell-code}\nimport ibis\nimport polars as pl\nimport pyarrow as pa\n\nibis.options.interactive = True\n\nduckdb_kwargs = {\n \"delim\": \";\",\n \"header\": False,\n \"columns\": {\"station\": \"VARCHAR\", \"temperature\": \"DOUBLE\"},\n}\n\npolars_kwargs = {\n \"separator\": \";\",\n \"has_header\": False,\n \"new_columns\": [\"station\", \"temperature\"],\n \"schema\": {\"station\": pl.Utf8, \"temperature\": pl.Float64},\n}\n\ndatafusion_kwargs = {\n \"delimiter\": \";\",\n \"has_header\": False,\n \"schema\": pa.schema(\n [\n (\n \"station\",\n pa.string(),\n ),\n (\n \"temperature\",\n pa.float64(),\n ),\n ]\n ),\n \"file_extension\": \".txt\",\n}\n```\n:::\n\n\nLet's define a function to run the same code with each backend to complete the challenge:\n\n::: {#8d7a8f3b .cell execution_count=2}\n``` {.python .cell-code}\ndef run_challenge(t):\n res = (\n t.group_by(ibis._.station)\n .agg(\n min_temp=ibis._.temperature.min(),\n mean_temp=ibis._.temperature.mean(),\n max_temp=ibis._.temperature.max(),\n )\n .order_by(ibis._.station.desc())\n )\n return res\n```\n:::\n\n\n### Completing the challenge\n\nLet's complete the challenge with each backend.\n\n:::{.callout-note}\nThe results are the same across backends but look suspicious. It is noted in the\nrepository that the Python generation code is \"unofficial\", so may have some\nproblems. Given this is a contrived example of generated data, I'm not going to\nworry about it.\n\nThe point is that we can easily complete the challenge with the same code across\nmany backends, letting them worry about the details of execution. For this\nreason, I'm also not providing execution times. Try it out yourself!\n:::\n\n::: {.panel-tabset}\n\n## DuckDB\n\nFirst let's set the backend to DuckDB (redundantly since it's the default) and\nthe `kwargs` dictionary:\n\n::: {#a63ac8cc .cell execution_count=3}\n``` {.python .cell-code}\nibis.set_backend(\"duckdb\") # <1>\nkwargs = duckdb_kwargs\n```\n:::\n\n\n\n\n1. Redundant given DuckDB is the default\n\nNext, we'll read in the data and take a look at the table:\n\n::: {#73800ddb .cell execution_count=5}\n``` {.python .cell-code}\nt = ibis.read_csv(\"1brc/data/measurements.txt\", **kwargs)\nt.limit(3)\n```\n\n::: {.cell-output .cell-output-display execution_count=5}\n```{=html}\n
┏━━━━━━━━━━━━━┳━━━━━━━━━━━━━┓\n┃ station      temperature ┃\n┡━━━━━━━━━━━━━╇━━━━━━━━━━━━━┩\n│ stringfloat64     │\n├─────────────┼─────────────┤\n│ Lívingston -21.0 │\n│ Annūr      -33.4 │\n│ Beni Douala16.5 │\n└─────────────┴─────────────┘\n
\n```\n:::\n:::\n\n\nThen let's confirm it's **a billion** rows:\n\n::: {#cfe6bf62 .cell execution_count=6}\n``` {.python .cell-code}\nf\"{t.count().to_pandas():,}\"\n```\n\n::: {.cell-output .cell-output-display execution_count=6}\n```\n'1,000,000,000'\n```\n:::\n:::\n\n\nFinally, we'll compute the min, mean, and max temperature per weather station:\n\n::: {#b01a5986 .cell execution_count=7}\n``` {.python .cell-code}\nres = run_challenge(t)\nres\n```\n\n::: {.cell-output .cell-output-display execution_count=7}\n```{=html}\n
┏━━━━━━━━━━━━━━━━┳━━━━━━━━━━┳━━━━━━━━━━━┳━━━━━━━━━━┓\n┃ station         min_temp  mean_temp  max_temp ┃\n┡━━━━━━━━━━━━━━━━╇━━━━━━━━━━╇━━━━━━━━━━━╇━━━━━━━━━━┩\n│ stringfloat64float64float64  │\n├────────────────┼──────────┼───────────┼──────────┤\n│ ’s-Gravendeel -99.90.11218899.9 │\n│ ’Aïn el Hammam-99.9-0.22528999.9 │\n│ ’Aïn Roua     -99.9-0.19824199.9 │\n│ ‘Ibrī         -99.90.00949999.9 │\n│ ‘Ayn al ‘Arab -99.90.12473099.9 │\n│ ‘Akko         -99.9-0.08718499.9 │\n│ ‘Afrīn        -99.9-0.01332299.9 │\n│ Ấp Tân Ngãi   -99.90.34408999.9 │\n│ Ẕefat         -99.90.01776799.9 │\n│ Ḩīsh          -99.90.01880499.9 │\n│  │\n└────────────────┴──────────┴───────────┴──────────┘\n
\n```\n:::\n:::\n\n\n## Polars\n\nFirst let's set the backend to Polars and the `kwargs` dictionary:\n\n::: {#bf41374d .cell execution_count=8}\n``` {.python .cell-code}\nibis.set_backend(\"polars\") # <1>\nkwargs = polars_kwargs\n```\n:::\n\n\n1. Set Polars as the default backend used\n\nNext, we'll read in the data and take a look at the table:\n\n::: {#cac9f1fe .cell execution_count=9}\n``` {.python .cell-code}\nt = ibis.read_csv(\"1brc/data/measurements.txt\", **kwargs)\nt.limit(3)\n```\n\n::: {.cell-output .cell-output-display execution_count=9}\n```{=html}\n
┏━━━━━━━━━━━━━┳━━━━━━━━━━━━━┓\n┃ station      temperature ┃\n┡━━━━━━━━━━━━━╇━━━━━━━━━━━━━┩\n│ stringfloat64     │\n├─────────────┼─────────────┤\n│ Lívingston -21.0 │\n│ Annūr      -33.4 │\n│ Beni Douala16.5 │\n└─────────────┴─────────────┘\n
\n```\n:::\n:::\n\n\nThen let's confirm it's **a billion** rows:\n\n::: {#87cd1a0a .cell execution_count=10}\n``` {.python .cell-code}\nf\"{t.count().to_pandas():,}\"\n```\n\n::: {.cell-output .cell-output-display execution_count=10}\n```\n'1,000,000,000'\n```\n:::\n:::\n\n\nFinally, we'll compute the min, mean, and max temperature per weather station:\n\n::: {#778990e7 .cell execution_count=11}\n``` {.python .cell-code}\nres = run_challenge(t)\nres\n```\n\n::: {.cell-output .cell-output-display execution_count=11}\n```{=html}\n
┏━━━━━━━━━━━━━━━━┳━━━━━━━━━━┳━━━━━━━━━━━┳━━━━━━━━━━┓\n┃ station         min_temp  mean_temp  max_temp ┃\n┡━━━━━━━━━━━━━━━━╇━━━━━━━━━━╇━━━━━━━━━━━╇━━━━━━━━━━┩\n│ stringfloat64float64float64  │\n├────────────────┼──────────┼───────────┼──────────┤\n│ ’s-Gravendeel -99.90.11218899.9 │\n│ ’Aïn el Hammam-99.9-0.22528999.9 │\n│ ’Aïn Roua     -99.9-0.19824199.9 │\n│ ‘Ibrī         -99.90.00949999.9 │\n│ ‘Ayn al ‘Arab -99.90.12473099.9 │\n│ ‘Akko         -99.9-0.08718499.9 │\n│ ‘Afrīn        -99.9-0.01332299.9 │\n│ Ấp Tân Ngãi   -99.90.34408999.9 │\n│ Ẕefat         -99.90.01776799.9 │\n│ Ḩīsh          -99.90.01880499.9 │\n│  │\n└────────────────┴──────────┴───────────┴──────────┘\n
\n```\n:::\n:::\n\n\n## DataFusion\n\nFirst let's set the backend to DataFusion and the `kwargs` dictionary:\n\n::: {#1a714b65 .cell execution_count=12}\n``` {.python .cell-code}\nibis.set_backend(\"datafusion\") # <1>\nkwargs = datafusion_kwargs\n```\n:::\n\n\n1. Set DataFusion as the default backend used\n\nNext, we'll read in the data and take a look at the table:\n\n::: {#232867fd .cell execution_count=13}\n``` {.python .cell-code}\nt = ibis.read_csv(\"1brc/data/measurements.txt\", **kwargs)\nt.limit(3)\n```\n\n::: {.cell-output .cell-output-display execution_count=13}\n```{=html}\n
┏━━━━━━━━━━━━━┳━━━━━━━━━━━━━┓\n┃ station      temperature ┃\n┡━━━━━━━━━━━━━╇━━━━━━━━━━━━━┩\n│ stringfloat64     │\n├─────────────┼─────────────┤\n│ Lívingston -21.0 │\n│ Annūr      -33.4 │\n│ Beni Douala16.5 │\n└─────────────┴─────────────┘\n
\n```\n:::\n:::\n\n\nThen let's confirm it's **a billion** rows:\n\n::: {#bb715c5f .cell execution_count=14}\n``` {.python .cell-code}\nf\"{t.count().to_pandas():,}\"\n```\n\n::: {.cell-output .cell-output-display execution_count=14}\n```\n'1,000,000,000'\n```\n:::\n:::\n\n\nFinally, we'll compute the min, mean, and max temperature per weather station:\n\n::: {#fedee251 .cell execution_count=15}\n``` {.python .cell-code}\nres = run_challenge(t)\nres\n```\n\n::: {.cell-output .cell-output-display execution_count=15}\n```{=html}\n
┏━━━━━━━━━━━━━━━━┳━━━━━━━━━━┳━━━━━━━━━━━┳━━━━━━━━━━┓\n┃ station         min_temp  mean_temp  max_temp ┃\n┡━━━━━━━━━━━━━━━━╇━━━━━━━━━━╇━━━━━━━━━━━╇━━━━━━━━━━┩\n│ stringfloat64float64float64  │\n├────────────────┼──────────┼───────────┼──────────┤\n│ ’s-Gravendeel -99.90.11218899.9 │\n│ ’Aïn el Hammam-99.9-0.22528999.9 │\n│ ’Aïn Roua     -99.9-0.19824199.9 │\n│ ‘Ibrī         -99.90.00949999.9 │\n│ ‘Ayn al ‘Arab -99.90.12473099.9 │\n│ ‘Akko         -99.9-0.08718499.9 │\n│ ‘Afrīn        -99.9-0.01332299.9 │\n│ Ấp Tân Ngãi   -99.90.34408999.9 │\n│ Ẕefat         -99.90.01776799.9 │\n│ Ḩīsh          -99.90.01880499.9 │\n│  │\n└────────────────┴──────────┴───────────┴──────────┘\n
\n```\n:::\n:::\n\n\n:::\n\n## Conclusion\n\nWhile the one billion row challenge isn't a great benchmark, it's a fun way to\ndemonstrate how Ibis provides a single Python dataframe API to take the billion\nrow challenge with DuckDB, Polars, and DataFusion. Feel free to try it out with\nother backends!\n\nHappy coding!\n\n## Bonus: more billion row data generation\n\nWhile we're here, I'll share the code I've used in the past to generate a\nbillion rows of random data:\n\n```{.python}\nimport ibis\n\ncon = ibis.connect(\"duckdb://data.ddb\")\n\nROWS = 1_000_000_000\n\nsql_str = \"\"\nsql_str += \"select\\n\"\nfor c in list(map(chr, range(ord(\"a\"), ord(\"z\") + 1))):\n sql_str += f\" random() as {c},\\n\"\nsql_str += f\"from generate_series(1, {ROWS})\"\n\nt = con.sql(sql_str)\ncon.create_table(\"billion\", t, overwrite=True)\n```\n\nNowadays I'd convert that to an Ibis expression:\n\n:::{.callout-note}\nThis is a slightly different result with a monotonic index column, but I prefer\nit anyway. You could drop that column or adjust the expression.\n:::\n\n```{.python}\nimport ibis\n\ncon = ibis.connect(\"duckdb://data.ddb\")\n\nROWS = 1_000_000_000\n\nt = (\n ibis.range(ROWS)\n .unnest()\n .name(\"index\")\n .as_table()\n .mutate(**{c: ibis.random() for c in list(map(chr, range(ord(\"a\"), ord(\"z\") + 1)))})\n)\ncon.create_table(\"billion\", t, overwrite=True)\n```\n\nBut if you do need to construct a programmatic SQL string, it's cool that you\ncan!\n\n", "supporting": [ "index_files/figure-html" ],