From 297be44abd176d77aaa5d03a73d2b4d82a6e0abb Mon Sep 17 00:00:00 2001 From: Phillip Cloud <417981+cpcloud@users.noreply.github.com> Date: Sat, 16 Dec 2023 05:04:32 -0500 Subject: [PATCH] docs(perf): use an unordered list instead of an ordered one --- .../index/execute-results/html.json | 9 ++++++--- docs/posts/pydata-performance/index.qmd | 14 +++++++------- 2 files changed, 13 insertions(+), 10 deletions(-) diff --git a/docs/_freeze/posts/pydata-performance/index/execute-results/html.json b/docs/_freeze/posts/pydata-performance/index/execute-results/html.json index c52e29e8647f..15583ddffe70 100644 --- a/docs/_freeze/posts/pydata-performance/index/execute-results/html.json +++ b/docs/_freeze/posts/pydata-performance/index/execute-results/html.json @@ -1,14 +1,17 @@ { - "hash": "1802ba32fcaa631f14871584c07b482c", + "hash": "4587be9fc1ccd88f6227b0e8543b651a", "result": { - "markdown": "---\ntitle: \"Ibis versus X: Performance across the ecosystem part 1\"\nauthor: \"Phillip Cloud\"\ndate: 2023-12-06\ncategories:\n - blog\n - case study\n - ecosystem\n - performance\n---\n\n**TL; DR**: Ibis has a lot of great backends. They're all\ngood at different things. For working with local data, it's hard to beat DuckDB\non feature set and performance.\n\nBuckle up, it's going to be a long one.\n\n## Motivation\n\nIbis maintainer [Gil Forsyth](https://github.com/gforsyth) recently wrote\na [post on our\nblog](https://ibis-project.org/posts/querying-pypi-metadata-compiled-languages/)\nreplicating [**another** blog\npost](https://sethmlarson.dev/security-developer-in-residence-weekly-report-18)\nbut using Ibis instead of raw SQL.\n\nI thought it would be interesting to see how other tools compare to this setup,\nso I decided I'd try to do the same workflow on the same machine using\na few tools from across the ecosystem.\n\nI chose two incumbents--[pandas](https://pandas.pydata.org/) and\n[dask](https://www.dask.org/)--to see how they compare to Ibis + DuckDB on this\nworkload. In part 2 of this series I will compare two newer engines--Polars and\nDataFusion--to Ibis + DuckDB.\n\nI've worked on both pandas and Dask in the past but it's been such a long time\nsince I've used these tools for data analysis that I consider myself rather\nnaive about how to best use them today.\n\nInitially I was interested in API comparisons since usability is really where\nIbis shines, but as I started to explore things, I was unable to complete my\nanalysis in some cases due to running out of memory.\n\n::: {.callout-note}\n# This is not a forum to trash the work of others.\n\nI'm not interested in tearing down other tools.\n\nIbis has backends for each of these tools and it's in everyone's best interest\nthat all of the tools discussed here work to their full potential.\n:::\n\nI show each tool using its native API, in an attempt to compare ease-of-use\nout of the box and maximize each library's ability to complete the workload.\n\nLet's dig in.\n\n\n\n## Setup\n\nI ran all of the code in this blog post on a machine with these specs.\n\nAll OS caches were cleared before running this document with\n\n```bash\n$ sudo sysctl -w vm.drop_caches=3\n```\n\n::: {.callout-warning}\n# Clearing operating system caches **does not represent a realistic usage scenario**\n\nIt is a method for putting the tools here on more equal footing. When you're in\nthe thick of an analysis you're not going to artificially limit any OS\noptimizations.\n:::\n\n| Component | Specification |\n| --------- | ------------- |\n| CPU | AMD EPYC 7B12 (64 threads) |\n| RAM | 94 GiB |\n| Disk | 1.5 TiB SSD |\n| OS | NixOS (Linux 6.1.64) |\n\n\n### Soft constraints\n\nI'll introduce some soft UX constraints on the problem, that I think help\nconvey the perspective of someone who wants to get started quickly with\na data set:\n\n1. **I don't want to get another computer** to run this workload.\n2. **I want to use the data as is**, that is, without altering the files\n I already have.\n3. **I'd like to run this computation with the default configuration**.\n Ideally configuration isn't required to complete this workload out of the\n box.\n\n### Library versions\n\nHere are the versions I used to run this experiment at the time of writing.\n\n| Dependency | Version |\n|:-------------|:-------------------------------------------------------------------|\n| Python | 3.10.13 (main, Aug 24 2023, 12:59:26) [GCC 12.3.0] |\n| dask | 2023.12.0 |\n| distributed | 2023.12.0 |\n| duckdb | 0.9.2 |\n| ibis | [`ed47c7404`](https://github.com/ibis-project/ibis/tree/ed47c7404) |\n| pandas | 2.1.3 |\n| pyarrow | 14.0.1 |\n\n\n### Data\n\nI used the files [here](https://raw.githubusercontent.com/pypi-data/data/20135ed214be9d6bb9c316121e5ccdaf29c6b9b1/links/dataset.txt) in this link to run my experiment.\n\nHere's a summary of the data set's file sizes:\n\n```bash\n$ du -h /data/pypi-parquet/*.parquet\n```\n```\n1.8G\t/data/pypi-parquet/index-12.parquet\n1.7G\t/data/pypi-parquet/index-10.parquet\n1.9G\t/data/pypi-parquet/index-2.parquet\n1.9G\t/data/pypi-parquet/index-0.parquet\n1.8G\t/data/pypi-parquet/index-5.parquet\n1.7G\t/data/pypi-parquet/index-13.parquet\n1.7G\t/data/pypi-parquet/index-9.parquet\n1.8G\t/data/pypi-parquet/index-6.parquet\n1.7G\t/data/pypi-parquet/index-7.parquet\n1.7G\t/data/pypi-parquet/index-8.parquet\n800M\t/data/pypi-parquet/index-14.parquet\n1.8G\t/data/pypi-parquet/index-4.parquet\n1.8G\t/data/pypi-parquet/index-11.parquet\n1.9G\t/data/pypi-parquet/index-3.parquet\n1.9G\t/data/pypi-parquet/index-1.parquet\n\n```\n\n\n## Recapping the original Ibis post\n\nCheck out [the original blog\npost](https://ibis-project.org/posts/querying-pypi-metadata-compiled-languages/)\nif you haven't already!\n\nHere's the Ibis + DuckDB code, along with a timed execution of the query:\n\n```python\nfrom __future__ import annotations\n\nimport ibis\nfrom ibis import _, udf\n\n\n@udf.scalar.builtin\ndef flatten(x: list[list[str]]) -> list[str]: # <1>\n ...\n\n\nexpr = (\n ibis.read_parquet(\"/data/pypi-parquet/*.parquet\")\n .filter(\n [\n _.path.re_search(\n r\"\\.(asm|c|cc|cpp|cxx|h|hpp|rs|[Ff][0-9]{0,2}(?:or)?|go)$\"\n ),\n ~_.path.re_search(r\"(^|/)test(|s|ing)\"),\n ~_.path.contains(\"/site-packages/\"),\n ]\n )\n .group_by(\n month=_.uploaded_on.truncate(\"M\"),\n ext=_.path.re_extract(r\"\\.([a-z0-9]+)$\", 1),\n )\n .aggregate(projects=_.project_name.collect().unique())\n .order_by(_.month.desc())\n .mutate(\n ext=_.ext.re_replace(r\"cxx|cpp|cc|c|hpp|h\", \"C/C++\")\n .re_replace(\"^f.*$\", \"Fortran\")\n .replace(\"rs\", \"Rust\")\n .replace(\"go\", \"Go\")\n .replace(\"asm\", \"Assembly\")\n .nullif(\"\"),\n )\n .group_by([\"month\", \"ext\"])\n .aggregate(project_count=flatten(_.projects.collect()).unique().length())\n .dropna(\"ext\")\n .order_by([_.month.desc(), _.project_count.desc()]) # <2>\n)\n\n```\n\n\n1. We've since implemented [a `flatten` method](../../reference/expression-collections.qmd#ibis.expr.types.arrays.ArrayValue.flatten)\n on array expressions so it's no longer necessary to define a UDF here. I'll\n leave this code unchanged for this post. **This has no effect on the\n performance of the query**. In both cases the generated code contains\n a DuckDB-native call to [its `flatten`\n function](https://duckdb.org/docs/sql/functions/nested.html).\n2. This is a small change from the original query that adds a final sort key to\n make the results deterministic.\n\n::: {#0b9d0f6e .cell execution_count=6}\n``` {.python .cell-code}\n%time df = expr.to_pandas()\ndf\n```\n\n::: {.cell-output .cell-output-stdout}\n```\nCPU times: user 26min 24s, sys: 1min 6s, total: 27min 30s\nWall time: 33.5 s\n```\n:::\n\n::: {.cell-output .cell-output-display execution_count=6}\n```{=html}\n
\n\n\n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n
monthextproject_count
02023-11-01C/C++836
12023-11-01Rust190
22023-11-01Fortran48
32023-11-01Go33
42023-11-01Assembly10
............
7942005-08-01C/C++7
7952005-07-01C/C++4
7962005-05-01C/C++1
7972005-04-01C/C++1
7982005-03-01C/C++1
\n

799 rows × 3 columns

\n
\n```\n:::\n:::\n\n\nLet's show peak memory usage in GB as reported by the [](`resource`) module:\n\n::: {#91f153bc .cell execution_count=7}\n``` {.python .cell-code}\nimport resource\n\nrss_kb = resource.getrusage(resource.RUSAGE_SELF).ru_maxrss\nrss_mb = rss_kb / 1e3\nrss_gb = rss_mb / 1e3\n\nprint(round(rss_gb, 1), \"GB\")\n```\n\n::: {.cell-output .cell-output-stdout}\n```\n8.7 GB\n```\n:::\n:::\n\n\n## Pandas\n\nLet's try to replicate this workflow using pandas.\n\nI started with this code:\n\n::: {#2694821e .cell execution_count=8}\n``` {.python .cell-code}\nimport pandas as pd\n\ndf = pd.read_parquet(\"/data/pypi-parquet/*.parquet\")\n```\n\n::: {.cell-output .cell-output-error}\n```\nFileNotFoundError: [Errno 2] No such file or directory: '/data/pypi-parquet/*.parquet'\n```\n:::\n:::\n\n\nLooks like pandas doesn't support globs. That's fine, we can use the builtin\n`glob` module.\n\n```python\nimport glob\n\ndf = pd.read_parquet(glob.glob(\"/data/pypi-parquet/*.parquet\"))\n```\n\nThis eventually triggers the [Linux OOM\nkiller](https://lwn.net/Kernel/Index/#Memory_management-Out-of-memory_handling)\nafter some minutes, so I can't run the code.\n\nLet's try again with just a single file. I'll pick the smallest file, to avoid any\npotential issues with memory and give pandas the best possible shot.\n\n::: {#0e51484b .cell execution_count=9}\n``` {.python .cell-code}\nimport os\n\nsmallest_file = min(glob.glob(\"/data/pypi-parquet/*.parquet\"), key=os.path.getsize)\n```\n:::\n\n\nThe [smallest file](#data) is 799 MiB on disk.\n\n\n::: {#ac09ae1d .cell execution_count=11}\n``` {.python .cell-code}\n%time df = pd.read_parquet(smallest_file)\ndf\n```\n\n::: {.cell-output .cell-output-stdout}\n```\nCPU times: user 26 s, sys: 12.2 s, total: 38.2 s\nWall time: 26.6 s\n```\n:::\n\n::: {.cell-output .cell-output-display execution_count=11}\n```{=html}\n
\n\n\n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n
project_nameproject_versionproject_releaseuploaded_onpatharchive_pathsizehashskip_reasonlinesrepository
0zyte-spider-templates0.1.0zyte_spider_templates-0.1.0-py3-none-any.whl2023-10-26 07:29:49.894packages/zyte-spider-templates/zyte_spider_tem...zyte_spider_templates/spiders/ecommerce.py5748b'\\xe0\\xa6\\x9bd\\xc0+\\xe0\\xf8$J2\\xb3\\xf8\\x8c\\x9...160237
1zyte-spider-templates0.1.0zyte_spider_templates-0.1.0-py3-none-any.whl2023-10-26 07:29:49.894packages/zyte-spider-templates/zyte_spider_tem...zyte_spider_templates/spiders/base.py4160b'\\x1ck\\xd46={\\x7f`\\xbe\\xfaIg*&\\x977T\\xdb\\x8fJ'122237
2zyte-spider-templates0.1.0zyte_spider_templates-0.1.0-py3-none-any.whl2023-10-26 07:29:49.894packages/zyte-spider-templates/zyte_spider_tem...zyte_spider_templates/spiders/__init__.py0b'\\xe6\\x9d\\xe2\\x9b\\xb2\\xd1\\xd6CK\\x8b)\\xaewZ\\xd...empty0237
3zyte-spider-templates0.1.0zyte_spider_templates-0.1.0-py3-none-any.whl2023-10-26 07:29:49.894packages/zyte-spider-templates/zyte_spider_tem...zyte_spider_templates/page_objects/product_nav...3528b'\\xcd\\xc9\\xfc[\\xda\\xcf!\\x94\\x1b\\x92\\xffbJC\\xf...106237
4zyte-spider-templates0.1.0zyte_spider_templates-0.1.0-py3-none-any.whl2023-10-26 07:29:49.894packages/zyte-spider-templates/zyte_spider_tem...zyte_spider_templates/page_objects/__init__.py75b'r\\xb9\\xc1\\xcf2\\xa7\\xdc?\\xd1\\xa8\\xfcc+`\\xf3\\x...1237
....................................
354688281AH22CS1741.81.111AH22CS174-1.81.11.tar.gz2023-11-19 13:30:00.113packages/1AH22CS174/1AH22CS174-1.81.11.tar.gz/...1AH22CS174-1.81.11/1AH22CS174.egg-info/top_lev...1b\"\\x8b\\x13x\\x91y\\x1f\\xe9i'\\xadx\\xe6K\\n\\xad{\\xd...1242
354688291AH22CS1741.81.111AH22CS174-1.81.11.tar.gz2023-11-19 13:30:00.113packages/1AH22CS174/1AH22CS174-1.81.11.tar.gz/...1AH22CS174-1.81.11/1AH22CS174.egg-info/require...16b\"qG\\xad\\xc3:.'q\\xde\\xaa\\xac\\x91\\x89\\xf7S\\xcb\\...2242
354688301AH22CS1741.81.111AH22CS174-1.81.11.tar.gz2023-11-19 13:30:00.113packages/1AH22CS174/1AH22CS174-1.81.11.tar.gz/...1AH22CS174-1.81.11/1AH22CS174.egg-info/depende...1b\"\\x8b\\x13x\\x91y\\x1f\\xe9i'\\xadx\\xe6K\\n\\xad{\\xd...1242
354688311AH22CS1741.81.111AH22CS174-1.81.11.tar.gz2023-11-19 13:30:00.113packages/1AH22CS174/1AH22CS174-1.81.11.tar.gz/...1AH22CS174-1.81.11/1AH22CS174.egg-info/SOURCES...187b'\\xa2O$4|X\\x15,\\xb0\\x9a\\x07\\xe6\\x81[\\x15\\x1f|...7242
354688321AH22CS1741.81.111AH22CS174-1.81.11.tar.gz2023-11-19 13:30:00.113packages/1AH22CS174/1AH22CS174-1.81.11.tar.gz/...1AH22CS174-1.81.11/1AH22CS174.egg-info/PKG-INFO509b'\\xee\\xbe\\xbaoh*\\xacA\\xb0\\x8a}\\xb5\\x00\\xcbpz\\...16242
\n

35468833 rows × 11 columns

\n
\n```\n:::\n:::\n\n\n\n\nLoading the smallest file from the dataset is already pretty close\nto the time it took Ibis and DuckDB to execute the *entire query*.\n\nLet's give pandas a leg up and tell it what columns to use to avoid reading in\na bunch of data we're not going to use.\n\nWe can determine what these columns are by inspecting the Ibis code above.\n\n::: {#3a60d448 .cell execution_count=13}\n``` {.python .cell-code}\ncolumns = [\"path\", \"uploaded_on\", \"project_name\"]\n\n%time df = pd.read_parquet(smallest_file, columns=columns)\ndf\n```\n\n::: {.cell-output .cell-output-stdout}\n```\nCPU times: user 13.8 s, sys: 6.09 s, total: 19.9 s\nWall time: 16.4 s\n```\n:::\n\n::: {.cell-output .cell-output-display execution_count=13}\n```{=html}\n
\n\n\n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n
pathuploaded_onproject_name
0packages/zyte-spider-templates/zyte_spider_tem...2023-10-26 07:29:49.894zyte-spider-templates
1packages/zyte-spider-templates/zyte_spider_tem...2023-10-26 07:29:49.894zyte-spider-templates
2packages/zyte-spider-templates/zyte_spider_tem...2023-10-26 07:29:49.894zyte-spider-templates
3packages/zyte-spider-templates/zyte_spider_tem...2023-10-26 07:29:49.894zyte-spider-templates
4packages/zyte-spider-templates/zyte_spider_tem...2023-10-26 07:29:49.894zyte-spider-templates
............
35468828packages/1AH22CS174/1AH22CS174-1.81.11.tar.gz/...2023-11-19 13:30:00.1131AH22CS174
35468829packages/1AH22CS174/1AH22CS174-1.81.11.tar.gz/...2023-11-19 13:30:00.1131AH22CS174
35468830packages/1AH22CS174/1AH22CS174-1.81.11.tar.gz/...2023-11-19 13:30:00.1131AH22CS174
35468831packages/1AH22CS174/1AH22CS174-1.81.11.tar.gz/...2023-11-19 13:30:00.1131AH22CS174
35468832packages/1AH22CS174/1AH22CS174-1.81.11.tar.gz/...2023-11-19 13:30:00.1131AH22CS174
\n

35468833 rows × 3 columns

\n
\n```\n:::\n:::\n\n\nSweet, read times improved!\n\nLet's peek at the memory usage of the DataFrame.\n\n::: {#6ddb6243 .cell execution_count=14}\n``` {.python .cell-code}\nprint(round(df.memory_usage(deep=True).sum() / (1 << 30), 1), \"GiB\")\n```\n\n::: {.cell-output .cell-output-stdout}\n```\n8.7 GiB\n```\n:::\n:::\n\n\nI still have plenty of space to do my analysis, nice!\n\nFirst, filter the data:\n\n::: {#7beba8c3 .cell execution_count=15}\n``` {.python .cell-code}\n%%time\ndf = df[\n (\n df.path.str.contains(r\"\\.(?:asm|c|cc|cpp|cxx|h|hpp|rs|[Ff][0-9]{0,2}(?:or)?|go)$\")\n & ~df.path.str.contains(r\"(?:^|/)test(?:|s|ing)|/site-packages/\") # <1>\n )\n]\ndf\n```\n\n::: {.cell-output .cell-output-stdout}\n```\nCPU times: user 2min 25s, sys: 281 ms, total: 2min 25s\nWall time: 2min 25s\n```\n:::\n\n::: {.cell-output .cell-output-display execution_count=15}\n```{=html}\n
\n\n\n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n
pathuploaded_onproject_name
1462packages/zipline-tej/zipline_tej-0.0.50-cp38-c...2023-10-27 02:23:07.153zipline-tej
1470packages/zipline-tej/zipline_tej-0.0.50-cp38-c...2023-10-27 02:23:07.153zipline-tej
1477packages/zipline-tej/zipline_tej-0.0.50-cp38-c...2023-10-27 02:23:07.153zipline-tej
1481packages/zipline-tej/zipline_tej-0.0.50-cp38-c...2023-10-27 02:23:07.153zipline-tej
1485packages/zipline-tej/zipline_tej-0.0.50-cp38-c...2023-10-27 02:23:07.153zipline-tej
............
35460320packages/atomicshop/atomicshop-2.5.12-py3-none...2023-11-19 14:29:22.109atomicshop
35460515packages/atomicshop/atomicshop-2.5.11-py3-none...2023-11-19 11:58:09.589atomicshop
35460710packages/atomicshop/atomicshop-2.5.10-py3-none...2023-11-19 11:48:16.980atomicshop
35463761packages/ai-flow-nightly/ai_flow_nightly-2023....2023-11-19 16:06:36.819ai-flow-nightly
35464036packages/ai-flow-nightly/ai_flow_nightly-2023....2023-11-19 16:06:33.327ai-flow-nightly
\n

7166291 rows × 3 columns

\n
\n```\n:::\n:::\n\n\n1. I altered the original query here to avoid creating an unnecessary\n intermediate `Series` object.\n\nWe've blown **way** past our Ibis + DuckDB latency budget.\n\nLet's keep going!\n\nNext, group by and aggregate:\n\n::: {#82636a00 .cell execution_count=16}\n``` {.python .cell-code}\n%%time\ndf = (\n df.groupby(\n [\n df.uploaded_on.dt.floor(\"M\").rename(\"month\"),\n df.path.str.extract(r\"\\.([a-z0-9]+)$\", 0, expand=False).rename(\"ext\"),\n ]\n )\n .agg({\"project_name\": lambda s: list(set(s))})\n .sort_index(level=\"month\", ascending=False)\n)\ndf\n```\n\n::: {.cell-output .cell-output-error}\n```\nValueError: is a non-fixed frequency\n```\n:::\n:::\n\n\nHere we hit the first API issue going back to an [old pandas\nissue](https://github.com/pandas-dev/pandas/issues/15303): we can't truncate\na timestamp column to month frequency.\n\nLet's try the solution recommended in that issue.\n\n::: {#403e0cfe .cell execution_count=17}\n``` {.python .cell-code}\n%%time\ndf = (\n df.groupby(\n [\n df.uploaded_on.dt.to_period(\"M\").dt.to_timestamp().rename(\"month\"),\n df.path.str.extract(r\"\\.([a-z0-9]+)$\", 0, expand=False).rename(\"ext\"),\n ]\n )\n .agg({\"project_name\": lambda s: list(set(s))})\n .rename(columns={\"project_name\": \"projects\"})\n .sort_index(level=\"month\", ascending=False)\n)\ndf\n```\n\n::: {.cell-output .cell-output-stdout}\n```\nCPU times: user 8.69 s, sys: 204 ms, total: 8.89 s\nWall time: 8.89 s\n```\n:::\n\n::: {.cell-output .cell-output-display execution_count=17}\n```{=html}\n
\n\n\n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n
projects
monthext
2023-11-01rs[polars-u64-idx, py-rattler, maturin, kognic-i...
hpp[ddtrace, mqt.ddsim, zmesh, pyntcore, apache-t...
h[tinygwas, unidist, solarwinds-apm, PyICU, ima...
go[pygobuildinfo, apache-tvm, protobom-py, marie...
for[iricore]
f95[scikit-digital-health, PyGeopack, easychem, d...
f90[dropkick, pixell, mkl-include, mlatom, flexfl...
f03[mkl-include]
f[PyAstronomy, gnssrefl, mqt.ddsim, adani, mkl-...
cxx[mqt.ddsim, slipcover, solarwinds-apm, pygamer...
cpp[tinygwas, unidist, sagemath-tdlib, solarwinds...
cc[mqt.ddsim, kognic-io, qwen-cpp, adani, apache...
c[kognic-io, igem, imagequant, sagemath-objects...
asm[awscrt, mqt.ddsim, grpcio, cmeel-assimp, couc...
2023-10-01rs[polars-u64-idx, maturin, omikuji, kognic-io, ...
hpp[ddtrace, geodesk, apache-tvm, pyntcore, sagem...
h[solarwinds-apm, marie-ai, fdtdz, sagemath-obj...
go[algobattle-base, jsonatago, python-rtmidi, py...
f90[c4p, pypestutils, petsc, badlands, molalignli...
f[gnssrefl, simsopt, c4p, mazalib, avni, odoo13...
cxx[simsopt, mazalib, slipcover, petsc, solarwind...
cpp[qlat-cps, ddtrace, cloudi, algobattle-base, p...
cc[correctionlib, libregf-python, kognic-io, pyt...
c[algobattle-base, ddtrace, speculos, kognic-io...
asm[maud-metabolic-models, awscrt, fibers-ddtest,...
\n
\n```\n:::\n:::\n\n\nSort the values, add a new column and do the final aggregation:\n\n::: {#dbe62c7e .cell execution_count=18}\n``` {.python .cell-code}\n%%time\ndf = (\n df.reset_index()\n .assign(\n ext=lambda t: t.ext.str.replace(r\"cxx|cpp|cc|c|hpp|h\", \"C/C++\", regex=True)\n .str.replace(\"^f.*$\", \"Fortran\", regex=True)\n .str.replace(\"rs\", \"Rust\")\n .str.replace(\"go\", \"Go\")\n .str.replace(\"asm\", \"Assembly\")\n .replace(\"\", None)\n )\n .groupby([\"month\", \"ext\"])\n .agg({\"projects\": lambda s: len(set(sum(s, [])))})\n)\ndf\n```\n\n::: {.cell-output .cell-output-stdout}\n```\nCPU times: user 5.04 ms, sys: 1 ms, total: 6.04 ms\nWall time: 5.91 ms\n```\n:::\n\n::: {.cell-output .cell-output-display execution_count=18}\n```{=html}\n
\n\n\n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n
projects
monthext
2023-10-01Assembly14
C/C++484
Fortran23
Go25
Rust99
2023-11-01Assembly10
C/C++836
Fortran48
Go33
Rust190
\n
\n```\n:::\n:::\n\n\n\n\nRemember, all of the previous code is executing on **a single file** and still\ntakes minutes to run.\n\n#### Conclusion\n\nIf I only have pandas at my disposal, I'm unsure of how I can avoid getting\na bigger computer to run this query over the entire data set.\n\n### Rewriting the query to be fair\n\nAt this point I wondered whether this was a fair query to run with pandas.\n\nAfter all, the downsides of pandas' use of object arrays to hold nested data\nstructures like lists are well-known.\n\nThe original query uses a lot of nested array types, which are very performant\nin DuckDB, but in this case **we're throwing away all of our arrays** and we\ndon't need to use them.\n\nAdditionally, I'm using lambda functions instead of taking advantage of pandas'\nfast built-in methods like `count`, `nunique` and others.\n\nLet's see if we can alter the original query to give pandas a leg up.\n\n#### A story of two `GROUP BY`s\n\nHere's the first Ibis expression:\n\n```python\nfrom __future__ import annotations\n\nimport ibis\nfrom ibis import _, udf\n\n\n@udf.scalar.builtin\ndef flatten(x: list[list[str]]) -> list[str]: # <1>\n ...\n\n\nexpr = (\n ibis.read_parquet(\"/data/pypi-parquet/*.parquet\")\n .filter(\n [\n _.path.re_search(\n r\"\\.(asm|c|cc|cpp|cxx|h|hpp|rs|[Ff][0-9]{0,2}(?:or)?|go)$\"\n ),\n ~_.path.re_search(r\"(^|/)test(|s|ing)\"),\n ~_.path.contains(\"/site-packages/\"),\n ]\n )\n .group_by(\n month=_.uploaded_on.truncate(\"M\"),\n ext=_.path.re_extract(r\"\\.([a-z0-9]+)$\", 1),\n )\n .aggregate(projects=_.project_name.collect().unique())\n .order_by(_.month.desc())\n .mutate(\n ext=_.ext.re_replace(r\"cxx|cpp|cc|c|hpp|h\", \"C/C++\")\n .re_replace(\"^f.*$\", \"Fortran\")\n .replace(\"rs\", \"Rust\")\n .replace(\"go\", \"Go\")\n .replace(\"asm\", \"Assembly\")\n .nullif(\"\"),\n )\n .group_by([\"month\", \"ext\"])\n .aggregate(project_count=flatten(_.projects.collect()).unique().length())\n .dropna(\"ext\")\n .order_by([_.month.desc(), _.project_count.desc()]) # <2>\n)\n\n```\n\n\nIt looks like we can remove the double `group_by` by moving the second `mutate`\nexpression directly into the first `group_by` call.\n\nApplying these changes:\n\n```diff\n--- step0.py\t2023-12-06 09:25:51.097853839 -0500\n+++ step1.py\t2023-12-06 09:25:51.097853839 -0500\n@@ -5,7 +5,7 @@\n \n \n @udf.scalar.builtin\n-def flatten(x: list[list[str]]) -> list[str]: # <1>\n+def flatten(x: list[list[str]]) -> list[str]:\n ...\n \n \n@@ -22,20 +22,16 @@\n )\n .group_by(\n month=_.uploaded_on.truncate(\"M\"),\n- ext=_.path.re_extract(r\"\\.([a-z0-9]+)$\", 1),\n- )\n- .aggregate(projects=_.project_name.collect().unique())\n- .order_by(_.month.desc())\n- .mutate(\n- ext=_.ext.re_replace(r\"cxx|cpp|cc|c|hpp|h\", \"C/C++\")\n+ ext=_.path.re_extract(r\"\\.([a-z0-9]+)$\", 1)\n+ .re_replace(r\"cxx|cpp|cc|c|hpp|h\", \"C/C++\")\n .re_replace(\"^f.*$\", \"Fortran\")\n .replace(\"rs\", \"Rust\")\n .replace(\"go\", \"Go\")\n .replace(\"asm\", \"Assembly\")\n .nullif(\"\"),\n )\n+ .aggregate(projects=_.project_name.collect().unique())\n+ .order_by(_.month.desc())\n .group_by([\"month\", \"ext\"])\n .aggregate(project_count=flatten(_.projects.collect()).unique().length())\n- .dropna(\"ext\")\n- .order_by([_.month.desc(), _.project_count.desc()]) # <2>\n )\n```\n\n\nWe get:\n\n```python\nfrom __future__ import annotations\n\nimport ibis\nfrom ibis import _, udf\n\n\n@udf.scalar.builtin\ndef flatten(x: list[list[str]]) -> list[str]:\n ...\n\n\nexpr = (\n ibis.read_parquet(\"/data/pypi-parquet/*.parquet\")\n .filter(\n [\n _.path.re_search(\n r\"\\.(asm|c|cc|cpp|cxx|h|hpp|rs|[Ff][0-9]{0,2}(?:or)?|go)$\"\n ),\n ~_.path.re_search(r\"(^|/)test(|s|ing)\"),\n ~_.path.contains(\"/site-packages/\"),\n ]\n )\n .group_by(\n month=_.uploaded_on.truncate(\"M\"),\n ext=_.path.re_extract(r\"\\.([a-z0-9]+)$\", 1)\n .re_replace(r\"cxx|cpp|cc|c|hpp|h\", \"C/C++\")\n .re_replace(\"^f.*$\", \"Fortran\")\n .replace(\"rs\", \"Rust\")\n .replace(\"go\", \"Go\")\n .replace(\"asm\", \"Assembly\")\n .nullif(\"\"),\n )\n .aggregate(projects=_.project_name.collect().unique())\n .order_by(_.month.desc())\n .group_by([\"month\", \"ext\"])\n .aggregate(project_count=flatten(_.projects.collect()).unique().length())\n)\n\n```\n\n\n#### Don't sort unnecessarily\n\nNotice this `order_by` call just before a `group_by` call. Ordering before\ngrouping is somewhat useless here; we should probably sort after we've reduced\nour data. Let's stick the ordering at the end of the query.\n\nApplying these changes:\n\n```diff\n--- step1.py\t2023-12-06 09:25:51.097853839 -0500\n+++ step2.py\t2023-12-06 09:25:51.097853839 -0500\n@@ -31,7 +31,7 @@\n .nullif(\"\"),\n )\n .aggregate(projects=_.project_name.collect().unique())\n- .order_by(_.month.desc())\n .group_by([\"month\", \"ext\"])\n .aggregate(project_count=flatten(_.projects.collect()).unique().length())\n+ .order_by(_.month.desc())\n )\n```\n\n\nWe get:\n\n```python\nfrom __future__ import annotations\n\nimport ibis\nfrom ibis import _, udf\n\n\n@udf.scalar.builtin\ndef flatten(x: list[list[str]]) -> list[str]:\n ...\n\n\nexpr = (\n ibis.read_parquet(\"/data/pypi-parquet/*.parquet\")\n .filter(\n [\n _.path.re_search(\n r\"\\.(asm|c|cc|cpp|cxx|h|hpp|rs|[Ff][0-9]{0,2}(?:or)?|go)$\"\n ),\n ~_.path.re_search(r\"(^|/)test(|s|ing)\"),\n ~_.path.contains(\"/site-packages/\"),\n ]\n )\n .group_by(\n month=_.uploaded_on.truncate(\"M\"),\n ext=_.path.re_extract(r\"\\.([a-z0-9]+)$\", 1)\n .re_replace(r\"cxx|cpp|cc|c|hpp|h\", \"C/C++\")\n .re_replace(\"^f.*$\", \"Fortran\")\n .replace(\"rs\", \"Rust\")\n .replace(\"go\", \"Go\")\n .replace(\"asm\", \"Assembly\")\n .nullif(\"\"),\n )\n .aggregate(projects=_.project_name.collect().unique())\n .group_by([\"month\", \"ext\"])\n .aggregate(project_count=flatten(_.projects.collect()).unique().length())\n .order_by(_.month.desc())\n)\n\n```\n\n\n#### Don't repeat yourself\n\nNotice that we are now\n\n1. grouping\n2. aggregating\n3. grouping again by the **same keys**\n4. aggregating\n\nThis is less optimal than it could be. Notice that we are also flattening an\narray, computing its distinct values and then computing its length.\n\nWe are computing the grouped number of distinct values, and we likely don't\nneed to collect values into an array to do that.\n\nLet's try using a `COUNT(DISTINCT ...)` query instead, to avoid wasting cycles\ncollecting arrays.\n\nWe'll remove the second group by and then call `nunique()` to get the final\nquery.\n\nApplying these changes:\n\n```diff\n--- step2.py\t2023-12-06 09:25:51.097853839 -0500\n+++ step3.py\t2023-12-06 09:25:51.097853839 -0500\n@@ -1,13 +1,7 @@\n from __future__ import annotations\n \n import ibis\n-from ibis import _, udf\n-\n-\n-@udf.scalar.builtin\n-def flatten(x: list[list[str]]) -> list[str]:\n- ...\n-\n+from ibis import _\n \n expr = (\n ibis.read_parquet(\"/data/pypi-parquet/*.parquet\")\n@@ -30,8 +24,7 @@\n .replace(\"asm\", \"Assembly\")\n .nullif(\"\"),\n )\n- .aggregate(projects=_.project_name.collect().unique())\n- .group_by([\"month\", \"ext\"])\n- .aggregate(project_count=flatten(_.projects.collect()).unique().length())\n- .order_by(_.month.desc())\n+ .aggregate(project_count=_.project_name.nunique())\n+ .dropna(\"ext\")\n+ .order_by([_.month.desc(), _.project_count.desc()]) # <1>\n )\n```\n\n\nWe get:\n\n\n\n```python\nfrom __future__ import annotations\n\nimport ibis\nfrom ibis import _\n\nexpr = (\n ibis.read_parquet(\"/data/pypi-parquet/*.parquet\")\n .filter(\n [\n _.path.re_search(\n r\"\\.(asm|c|cc|cpp|cxx|h|hpp|rs|[Ff][0-9]{0,2}(?:or)?|go)$\"\n ),\n ~_.path.re_search(r\"(^|/)test(|s|ing)\"),\n ~_.path.contains(\"/site-packages/\"),\n ]\n )\n .group_by(\n month=_.uploaded_on.truncate(\"M\"),\n ext=_.path.re_extract(r\"\\.([a-z0-9]+)$\", 1)\n .re_replace(r\"cxx|cpp|cc|c|hpp|h\", \"C/C++\")\n .re_replace(\"^f.*$\", \"Fortran\")\n .replace(\"rs\", \"Rust\")\n .replace(\"go\", \"Go\")\n .replace(\"asm\", \"Assembly\")\n .nullif(\"\"),\n )\n .aggregate(project_count=_.project_name.nunique())\n .dropna(\"ext\")\n .order_by([_.month.desc(), _.project_count.desc()]) # <1>\n)\n\n```\n1. I added a second sort key (`project_count`) for deterministic output.\n\n\nLet's run it to make sure the results are as expected:\n\n::: {#b75a180a .cell execution_count=28}\n``` {.python .cell-code}\nduckdb_results = %timeit -n1 -r1 -o expr.to_pandas()\n```\n\n::: {.cell-output .cell-output-stdout}\n```\n43.8 s ± 0 ns per loop (mean ± std. dev. of 1 run, 1 loop each)\n```\n:::\n:::\n\n\nIt looks like the new query might be a bit slower even though we're ostensibly\ndoing less computation. Since we're still pretty close to the original\nduration, let's keep going.\n\n### Final pandas run with the new query\n\nRewriting the pandas code we get:\n\n```python\nfrom __future__ import annotations\n\nimport glob\nimport os\n\nimport pandas as pd\n\ndf = pd.read_parquet(\n min(glob.glob(\"/data/pypi-parquet/*.parquet\"), key=os.path.getsize),\n columns=[\"path\", \"uploaded_on\", \"project_name\"],\n)\ndf = df[\n df.path.str.contains(r\"\\.(?:asm|c|cc|cpp|cxx|h|hpp|rs|[Ff][0-9]{0,2}(?:or)?|go)$\")\n & ~df.path.str.contains(r\"(?:(?:^|/)test(?:|s|ing)|/site-packages/)\")\n]\nprint(\n df.assign(\n month=df.uploaded_on.dt.to_period(\"M\").dt.to_timestamp(),\n ext=df.path.str.extract(r\"\\.([a-z0-9]+)$\", 0)\n .iloc[:, 0]\n .str.replace(r\"cxx|cpp|cc|c|hpp|h\", \"C/C++\", regex=True)\n .str.replace(\"^f.*$\", \"Fortran\", regex=True)\n .str.replace(\"rs\", \"Rust\")\n .str.replace(\"go\", \"Go\")\n .str.replace(\"asm\", \"Assembly\"),\n )\n .groupby([\"month\", \"ext\"])\n .project_name.nunique()\n .rename(\"project_count\")\n .reset_index()\n .sort_values([\"month\", \"project_count\"], ascending=False)\n)\n\n```\n\n\nRunning it we get:\n\n::: {#1781a7a1 .cell execution_count=30}\n``` {.python .cell-code}\npandas_results = %timeit -n1 -r1 -o %run pandas_impl.py\n```\n\n::: {.cell-output .cell-output-stdout}\n```\n month ext project_count\n6 2023-11-01 C/C++ 836\n9 2023-11-01 Rust 190\n7 2023-11-01 Fortran 48\n8 2023-11-01 Go 33\n5 2023-11-01 Assembly 10\n1 2023-10-01 C/C++ 484\n4 2023-10-01 Rust 99\n3 2023-10-01 Go 25\n2 2023-10-01 Fortran 23\n0 2023-10-01 Assembly 14\n3min 16s ± 0 ns per loop (mean ± std. dev. of 1 run, 1 loop each)\n```\n:::\n:::\n\n\n\n\n::: {.callout-note}\n# Remember, this is the time it took pandas to run the query for a **single** file.\nDuckDB runs the query over the **entire** dataset about 4x faster than that!\n:::\n\nLet's try a tool that nominally scales to our problem: [Dask](https://dask.org).\n\n## Dask\n\nOne really nice component of Dask is\n[`dask.dataframe`](https://docs.dask.org/en/stable/dataframe.html).\n\nDask DataFrame implements a [good chunk of the pandas\nAPI](https://docs.dask.org/en/stable/dataframe.html#scope) and can be a drop-in\nreplacement for pandas.\n\nI am happy that this turned out to be the case here.\n\nMy first attempt was somewhat naive and was effectively a one line change\nfrom `import pandas as pd` to `import dask.dataframe as pd`.\n\nThis worked and the workload completed. However, after talking to Dask\nexpert and Ibis contributor [Naty Clementi](https://github.com/ncclementi) she\nsuggested I try a few things:\n\n* Use [the distributed scheduler](https://distributed.dask.org/en/stable/).\n* Ensure that [`pyarrow` string arrays are\n used](https://docs.dask.org/en/latest/configuration.html#dask) instead of\n NumPy object arrays. This required **no changes** to my Dask code because\n PyArrow strings have been the default since version 2023.7.1, hooray!\n* Explore some of the options to `read_parquet`. It turned that without setting\n `split_row_groups=True` I ran out of memory.\n\nLet's look at the Dask implementation:\n\n```python\nfrom __future__ import annotations\n\nimport logging\n\nimport dask.dataframe as dd\nfrom dask.distributed import Client\n\nif __name__ == \"__main__\":\n client = Client(silence_logs=logging.ERROR)\n df = dd.read_parquet(\n \"/data/pypi-parquet/*.parquet\",\n columns=[\"path\", \"uploaded_on\", \"project_name\"],\n split_row_groups=True,\n )\n df = df[\n df.path.str.contains(\n r\"\\.(?:asm|c|cc|cpp|cxx|h|hpp|rs|[Ff][0-9]{0,2}(?:or)?|go)$\"\n )\n & ~df.path.str.contains(r\"(?:^|/)test(?:|s|ing)\")\n & ~df.path.str.contains(\"/site-packages/\")\n ]\n print(\n df.assign(\n month=df.uploaded_on.dt.to_period(\"M\").dt.to_timestamp(),\n ext=df.path.str.extract(r\"\\.([a-z0-9]+)$\", 0, expand=False)\n .str.replace(r\"cxx|cpp|cc|c|hpp|h\", \"C/C++\", regex=True)\n .str.replace(\"^f.*$\", \"Fortran\", regex=True)\n .str.replace(\"rs\", \"Rust\")\n .str.replace(\"go\", \"Go\")\n .str.replace(\"asm\", \"Assembly\"),\n )\n .groupby([\"month\", \"ext\"])\n .project_name.nunique()\n .rename(\"project_count\")\n .compute()\n .reset_index()\n .sort_values([\"month\", \"project_count\"], ascending=False)\n )\n client.shutdown()\n\n```\n\n\nLet's run the code:\n\n::: {#024c9b04 .cell execution_count=33}\n``` {.python .cell-code}\ndask_results = %timeit -n1 -r1 -o %run dask_impl.py\n```\n\n::: {.cell-output .cell-output-stdout}\n```\n month ext project_count\n794 2023-11-01 C/C++ 836\n796 2023-11-01 Rust 190\n797 2023-11-01 Fortran 48\n795 2023-11-01 Go 33\n798 2023-11-01 Assembly 10\n.. ... ... ...\n2 2005-08-01 C/C++ 7\n1 2005-07-01 C/C++ 4\n83 2005-05-01 C/C++ 1\n82 2005-04-01 C/C++ 1\n0 2005-03-01 C/C++ 1\n\n[799 rows x 3 columns]\n53 s ± 0 ns per loop (mean ± std. dev. of 1 run, 1 loop each)\n```\n:::\n:::\n\n\n\n\nThat's a great improvement over pandas: we finished the workload and our\nrunning time is pretty close to DuckDB.\n\n## Takeaways\n\n**Ibis + DuckDB is the only system tested that handles this workload well out of the box**\n\n* Pandas couldn't handle the workload due to memory constraints.\n* Dask required its [recommended distributed\n scheduler](https://docs.dask.org/en/stable/deploying.html#deploy-dask-clusters)\n to achieve maximum performance and still used a lot of memory.\n\nLet's recap the results with some numbers:\n\n### Numbers\n\n| Toolset | Data size | Duration | Throughput |\n| ------------------ | --------: | -----------: | ---------: |\n| Ibis + DuckDB | 25,825 MiB | 44 seconds | 589 MiB/s |\n| Dask + Distributed | 25,825 MiB | 53 seconds | 488 MiB/s |\n\n\nWith Ibis + DuckDB, I was able to write the query the way I wanted to without\nrunning out of memory, using the default configuration provided by Ibis.\n\n\nI was able run this computation around **145x faster** than you can expect\nwith pandas using this hardware setup.\n\n\n\nIn contrast, pandas ran out of memory **on a single file** without some hand\nholding and while Dask didn't cause my program to run out of memory it still\nused quite a bit more than DuckDB.\n\n### Pandas is untenable for this workload\n\nPandas requires me to load everything into memory, and my machine doesn't have\nenough memory to do that.\n\nGiven that Ibis + DuckDB runs this workload on my machine it doesn't seem worth\nthe effort to write any additional code to make pandas scale to the whole\ndataset.\n\n### Dask finishes in a similar amount of time as Ibis + DuckDB (within 2x)\n\nOut of the box I had quite a bit of difficulty figuring out how to maximize\nperformance and not run out of memory.\n\nPlease get in touch if you think my Dask code can be improved!\n\nI know the Dask community is hard at work building\n[`dask-expr`](https://github.com/dask-contrib/dask-expr) which might improve\nthe performance of this workload when it lands.\n\n## Next steps\n\n### Please get in touch!\n\nIf you have ideas about how to speed up my use of the tools I've discussed here\nplease get in touch by opening a [GitHub\ndiscussion](https://github.com/ibis-project/ibis/discussions)!\n\nWe would love it if more backends handled this workload!\n\n### Look out for part 2\n\nIn part 2 of this series we'll explore how Polars and DataFusion perform on\nthis query. Stay tuned!\n\n", + "markdown": "---\ntitle: \"Ibis versus X: Performance across the ecosystem part 1\"\nauthor: \"Phillip Cloud\"\ndate: 2023-12-06\ncategories:\n - blog\n - case study\n - ecosystem\n - performance\n---\n\n**TL; DR**: Ibis has a lot of great backends. They're all\ngood at different things. For working with local data, it's hard to beat DuckDB\non feature set and performance.\n\nBuckle up, it's going to be a long one.\n\n## Motivation\n\nIbis maintainer [Gil Forsyth](https://github.com/gforsyth) recently wrote\na [post on our\nblog](https://ibis-project.org/posts/querying-pypi-metadata-compiled-languages/)\nreplicating [**another** blog\npost](https://sethmlarson.dev/security-developer-in-residence-weekly-report-18)\nbut using Ibis instead of raw SQL.\n\nI thought it would be interesting to see how other tools compare to this setup,\nso I decided I'd try to do the same workflow on the same machine using\na few tools from across the ecosystem.\n\nI chose two incumbents--[pandas](https://pandas.pydata.org/) and\n[dask](https://www.dask.org/)--to see how they compare to Ibis + DuckDB on this\nworkload. In part 2 of this series I will compare two newer engines--Polars and\nDataFusion--to Ibis + DuckDB.\n\nI've worked on both pandas and Dask in the past but it's been such a long time\nsince I've used these tools for data analysis that I consider myself rather\nnaive about how to best use them today.\n\nInitially I was interested in API comparisons since usability is really where\nIbis shines, but as I started to explore things, I was unable to complete my\nanalysis in some cases due to running out of memory.\n\n::: {.callout-note}\n# This is not a forum to trash the work of others.\n\nI'm not interested in tearing down other tools.\n\nIbis has backends for each of these tools and it's in everyone's best interest\nthat all of the tools discussed here work to their full potential.\n:::\n\nI show each tool using its native API, in an attempt to compare ease-of-use\nout of the box and maximize each library's ability to complete the workload.\n\nLet's dig in.\n\n\n\n## Setup\n\nI ran all of the code in this blog post on a machine with these specs.\n\nAll OS caches were cleared before running this document with\n\n```bash\n$ sudo sysctl -w vm.drop_caches=3\n```\n\n::: {.callout-warning}\n# Clearing operating system caches **does not represent a realistic usage scenario**\n\nIt is a method for putting the tools here on more equal footing. When you're in\nthe thick of an analysis you're not going to artificially limit any OS\noptimizations.\n:::\n\n| Component | Specification |\n| --------- | ------------- |\n| CPU | AMD EPYC 7B12 (64 threads) |\n| RAM | 94 GiB |\n| Disk | 1.5 TiB SSD |\n| OS | NixOS (Linux 6.1.68) |\n\n\n### Soft constraints\n\nI'll introduce some soft UX constraints on the problem, that I think help\nconvey the perspective of someone who wants to get started quickly with\na data set:\n\n1. **I don't want to get another computer** to run this workload.\n2. **I want to use the data as is**, that is, without altering the files\n I already have.\n3. **I'd like to run this computation with the default configuration**.\n Ideally configuration isn't required to complete this workload out of the\n box.\n\n### Library versions\n\nHere are the versions I used to run this experiment at the time of writing.\n\n| Dependency | Version |\n|:-------------|:-------------------------------------------------------------------|\n| Python | 3.10.13 (main, Aug 24 2023, 12:59:26) [GCC 12.3.0] |\n| dask | 2023.12.0 |\n| distributed | 2023.12.0 |\n| duckdb | 0.9.2 |\n| ibis | [`2d37ae816`](https://github.com/ibis-project/ibis/tree/2d37ae816) |\n| pandas | 2.1.4 |\n| pyarrow | 14.0.1 |\n\n\n### Data\n\nI used the files [here](https://raw.githubusercontent.com/pypi-data/data/20135ed214be9d6bb9c316121e5ccdaf29c6b9b1/links/dataset.txt) in this link to run my experiment.\n\nHere's a summary of the data set's file sizes:\n\n```bash\n$ du -h /data/pypi-parquet/*.parquet\n```\n```\n1.8G\t/data/pypi-parquet/index-12.parquet\n1.7G\t/data/pypi-parquet/index-10.parquet\n1.9G\t/data/pypi-parquet/index-2.parquet\n1.9G\t/data/pypi-parquet/index-0.parquet\n1.8G\t/data/pypi-parquet/index-5.parquet\n1.7G\t/data/pypi-parquet/index-13.parquet\n1.7G\t/data/pypi-parquet/index-9.parquet\n1.8G\t/data/pypi-parquet/index-6.parquet\n1.7G\t/data/pypi-parquet/index-7.parquet\n1.7G\t/data/pypi-parquet/index-8.parquet\n800M\t/data/pypi-parquet/index-14.parquet\n1.8G\t/data/pypi-parquet/index-4.parquet\n1.8G\t/data/pypi-parquet/index-11.parquet\n1.9G\t/data/pypi-parquet/index-3.parquet\n1.9G\t/data/pypi-parquet/index-1.parquet\n\n```\n\n\n## Recapping the original Ibis post\n\nCheck out [the original blog\npost](https://ibis-project.org/posts/querying-pypi-metadata-compiled-languages/)\nif you haven't already!\n\nHere's the Ibis + DuckDB code, along with a timed execution of the query:\n\n```python\nfrom __future__ import annotations\n\nimport ibis\nfrom ibis import _, udf\n\n\n@udf.scalar.builtin\ndef flatten(x: list[list[str]]) -> list[str]: # <1>\n ...\n\n\nexpr = (\n ibis.read_parquet(\"/data/pypi-parquet/*.parquet\")\n .filter(\n [\n _.path.re_search(\n r\"\\.(asm|c|cc|cpp|cxx|h|hpp|rs|[Ff][0-9]{0,2}(?:or)?|go)$\"\n ),\n ~_.path.re_search(r\"(^|/)test(|s|ing)\"),\n ~_.path.contains(\"/site-packages/\"),\n ]\n )\n .group_by(\n month=_.uploaded_on.truncate(\"M\"),\n ext=_.path.re_extract(r\"\\.([a-z0-9]+)$\", 1),\n )\n .aggregate(projects=_.project_name.collect().unique())\n .order_by(_.month.desc())\n .mutate(\n ext=_.ext.re_replace(r\"cxx|cpp|cc|c|hpp|h\", \"C/C++\")\n .re_replace(\"^f.*$\", \"Fortran\")\n .replace(\"rs\", \"Rust\")\n .replace(\"go\", \"Go\")\n .replace(\"asm\", \"Assembly\")\n .nullif(\"\"),\n )\n .group_by([\"month\", \"ext\"])\n .aggregate(project_count=flatten(_.projects.collect()).unique().length())\n .dropna(\"ext\")\n .order_by([_.month.desc(), _.project_count.desc()]) # <2>\n)\n\n```\n\n\n1. We've since implemented [a `flatten` method](../../reference/expression-collections.qmd#ibis.expr.types.arrays.ArrayValue.flatten)\n on array expressions so it's no longer necessary to define a UDF here. I'll\n leave this code unchanged for this post. **This has no effect on the\n performance of the query**. In both cases the generated code contains\n a DuckDB-native call to [its `flatten`\n function](https://duckdb.org/docs/sql/functions/nested.html).\n2. This is a small change from the original query that adds a final sort key to\n make the results deterministic.\n\n::: {#23f9d8fa .cell execution_count=6}\n``` {.python .cell-code}\n%time df = expr.to_pandas()\ndf\n```\n\n::: {.cell-output .cell-output-display}\n```{=html}\n\n```\n:::\n\n::: {.cell-output .cell-output-stdout}\n```\nCPU times: user 20min 46s, sys: 1min 10s, total: 21min 56s\nWall time: 28.5 s\n```\n:::\n\n::: {.cell-output .cell-output-display execution_count=6}\n```{=html}\n
\n\n\n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n
monthextproject_count
02023-11-01C/C++836
12023-11-01Rust190
22023-11-01Fortran48
32023-11-01Go33
42023-11-01Assembly10
............
7942005-08-01C/C++7
7952005-07-01C/C++4
7962005-05-01C/C++1
7972005-04-01C/C++1
7982005-03-01C/C++1
\n

799 rows × 3 columns

\n
\n```\n:::\n:::\n\n\nLet's show peak memory usage in GB as reported by the [](`resource`) module:\n\n::: {#4d8c28d7 .cell execution_count=7}\n``` {.python .cell-code}\nimport resource\n\nrss_kb = resource.getrusage(resource.RUSAGE_SELF).ru_maxrss\nrss_mb = rss_kb / 1e3\nrss_gb = rss_mb / 1e3\n\nprint(round(rss_gb, 1), \"GB\")\n```\n\n::: {.cell-output .cell-output-stdout}\n```\n8.6 GB\n```\n:::\n:::\n\n\n## Pandas\n\nLet's try to replicate this workflow using pandas.\n\nI started with this code:\n\n::: {#4f9244e8 .cell execution_count=8}\n``` {.python .cell-code}\nimport pandas as pd\n\ndf = pd.read_parquet(\"/data/pypi-parquet/*.parquet\")\n```\n\n::: {.cell-output .cell-output-error}\n```\nFileNotFoundError: [Errno 2] No such file or directory: '/data/pypi-parquet/*.parquet'\n```\n:::\n:::\n\n\nLooks like pandas doesn't support globs. That's fine, we can use the builtin\n`glob` module.\n\n```python\nimport glob\n\ndf = pd.read_parquet(glob.glob(\"/data/pypi-parquet/*.parquet\"))\n```\n\nThis eventually triggers the [Linux OOM\nkiller](https://lwn.net/Kernel/Index/#Memory_management-Out-of-memory_handling)\nafter some minutes, so I can't run the code.\n\nLet's try again with just a single file. I'll pick the smallest file, to avoid any\npotential issues with memory and give pandas the best possible shot.\n\n::: {#cfb2b3ac .cell execution_count=9}\n``` {.python .cell-code}\nimport os\n\nsmallest_file = min(glob.glob(\"/data/pypi-parquet/*.parquet\"), key=os.path.getsize)\n```\n:::\n\n\nThe [smallest file](#data) is 799 MiB on disk.\n\n\n::: {#84466e2e .cell execution_count=11}\n``` {.python .cell-code}\n%time df = pd.read_parquet(smallest_file)\ndf\n```\n\n::: {.cell-output .cell-output-stdout}\n```\nCPU times: user 26 s, sys: 13.3 s, total: 39.3 s\nWall time: 27 s\n```\n:::\n\n::: {.cell-output .cell-output-display execution_count=11}\n```{=html}\n
\n\n\n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n
project_nameproject_versionproject_releaseuploaded_onpatharchive_pathsizehashskip_reasonlinesrepository
0zyte-spider-templates0.1.0zyte_spider_templates-0.1.0-py3-none-any.whl2023-10-26 07:29:49.894packages/zyte-spider-templates/zyte_spider_tem...zyte_spider_templates/spiders/ecommerce.py5748b'\\xe0\\xa6\\x9bd\\xc0+\\xe0\\xf8$J2\\xb3\\xf8\\x8c\\x9...160237
1zyte-spider-templates0.1.0zyte_spider_templates-0.1.0-py3-none-any.whl2023-10-26 07:29:49.894packages/zyte-spider-templates/zyte_spider_tem...zyte_spider_templates/spiders/base.py4160b'\\x1ck\\xd46={\\x7f`\\xbe\\xfaIg*&\\x977T\\xdb\\x8fJ'122237
2zyte-spider-templates0.1.0zyte_spider_templates-0.1.0-py3-none-any.whl2023-10-26 07:29:49.894packages/zyte-spider-templates/zyte_spider_tem...zyte_spider_templates/spiders/__init__.py0b'\\xe6\\x9d\\xe2\\x9b\\xb2\\xd1\\xd6CK\\x8b)\\xaewZ\\xd...empty0237
3zyte-spider-templates0.1.0zyte_spider_templates-0.1.0-py3-none-any.whl2023-10-26 07:29:49.894packages/zyte-spider-templates/zyte_spider_tem...zyte_spider_templates/page_objects/product_nav...3528b'\\xcd\\xc9\\xfc[\\xda\\xcf!\\x94\\x1b\\x92\\xffbJC\\xf...106237
4zyte-spider-templates0.1.0zyte_spider_templates-0.1.0-py3-none-any.whl2023-10-26 07:29:49.894packages/zyte-spider-templates/zyte_spider_tem...zyte_spider_templates/page_objects/__init__.py75b'r\\xb9\\xc1\\xcf2\\xa7\\xdc?\\xd1\\xa8\\xfcc+`\\xf3\\x...1237
....................................
354688281AH22CS1741.81.111AH22CS174-1.81.11.tar.gz2023-11-19 13:30:00.113packages/1AH22CS174/1AH22CS174-1.81.11.tar.gz/...1AH22CS174-1.81.11/1AH22CS174.egg-info/top_lev...1b\"\\x8b\\x13x\\x91y\\x1f\\xe9i'\\xadx\\xe6K\\n\\xad{\\xd...1242
354688291AH22CS1741.81.111AH22CS174-1.81.11.tar.gz2023-11-19 13:30:00.113packages/1AH22CS174/1AH22CS174-1.81.11.tar.gz/...1AH22CS174-1.81.11/1AH22CS174.egg-info/require...16b\"qG\\xad\\xc3:.'q\\xde\\xaa\\xac\\x91\\x89\\xf7S\\xcb\\...2242
354688301AH22CS1741.81.111AH22CS174-1.81.11.tar.gz2023-11-19 13:30:00.113packages/1AH22CS174/1AH22CS174-1.81.11.tar.gz/...1AH22CS174-1.81.11/1AH22CS174.egg-info/depende...1b\"\\x8b\\x13x\\x91y\\x1f\\xe9i'\\xadx\\xe6K\\n\\xad{\\xd...1242
354688311AH22CS1741.81.111AH22CS174-1.81.11.tar.gz2023-11-19 13:30:00.113packages/1AH22CS174/1AH22CS174-1.81.11.tar.gz/...1AH22CS174-1.81.11/1AH22CS174.egg-info/SOURCES...187b'\\xa2O$4|X\\x15,\\xb0\\x9a\\x07\\xe6\\x81[\\x15\\x1f|...7242
354688321AH22CS1741.81.111AH22CS174-1.81.11.tar.gz2023-11-19 13:30:00.113packages/1AH22CS174/1AH22CS174-1.81.11.tar.gz/...1AH22CS174-1.81.11/1AH22CS174.egg-info/PKG-INFO509b'\\xee\\xbe\\xbaoh*\\xacA\\xb0\\x8a}\\xb5\\x00\\xcbpz\\...16242
\n

35468833 rows × 11 columns

\n
\n```\n:::\n:::\n\n\n\n\nLoading the smallest file from the dataset is already pretty close\nto the time it took Ibis and DuckDB to execute the *entire query*.\n\nLet's give pandas a leg up and tell it what columns to use to avoid reading in\na bunch of data we're not going to use.\n\nWe can determine what these columns are by inspecting the Ibis code above.\n\n::: {#55ad752c .cell execution_count=13}\n``` {.python .cell-code}\ncolumns = [\"path\", \"uploaded_on\", \"project_name\"]\n\n%time df = pd.read_parquet(smallest_file, columns=columns)\ndf\n```\n\n::: {.cell-output .cell-output-stdout}\n```\nCPU times: user 13.9 s, sys: 7.09 s, total: 21 s\nWall time: 16.9 s\n```\n:::\n\n::: {.cell-output .cell-output-display execution_count=13}\n```{=html}\n
\n\n\n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n
pathuploaded_onproject_name
0packages/zyte-spider-templates/zyte_spider_tem...2023-10-26 07:29:49.894zyte-spider-templates
1packages/zyte-spider-templates/zyte_spider_tem...2023-10-26 07:29:49.894zyte-spider-templates
2packages/zyte-spider-templates/zyte_spider_tem...2023-10-26 07:29:49.894zyte-spider-templates
3packages/zyte-spider-templates/zyte_spider_tem...2023-10-26 07:29:49.894zyte-spider-templates
4packages/zyte-spider-templates/zyte_spider_tem...2023-10-26 07:29:49.894zyte-spider-templates
............
35468828packages/1AH22CS174/1AH22CS174-1.81.11.tar.gz/...2023-11-19 13:30:00.1131AH22CS174
35468829packages/1AH22CS174/1AH22CS174-1.81.11.tar.gz/...2023-11-19 13:30:00.1131AH22CS174
35468830packages/1AH22CS174/1AH22CS174-1.81.11.tar.gz/...2023-11-19 13:30:00.1131AH22CS174
35468831packages/1AH22CS174/1AH22CS174-1.81.11.tar.gz/...2023-11-19 13:30:00.1131AH22CS174
35468832packages/1AH22CS174/1AH22CS174-1.81.11.tar.gz/...2023-11-19 13:30:00.1131AH22CS174
\n

35468833 rows × 3 columns

\n
\n```\n:::\n:::\n\n\nSweet, read times improved!\n\nLet's peek at the memory usage of the DataFrame.\n\n::: {#88a65893 .cell execution_count=14}\n``` {.python .cell-code}\nprint(round(df.memory_usage(deep=True).sum() / (1 << 30), 1), \"GiB\")\n```\n\n::: {.cell-output .cell-output-stdout}\n```\n8.7 GiB\n```\n:::\n:::\n\n\nI still have plenty of space to do my analysis, nice!\n\nFirst, filter the data:\n\n::: {#943f1a94 .cell execution_count=15}\n``` {.python .cell-code}\n%%time\ndf = df[\n (\n df.path.str.contains(r\"\\.(?:asm|c|cc|cpp|cxx|h|hpp|rs|[Ff][0-9]{0,2}(?:or)?|go)$\")\n & ~df.path.str.contains(r\"(?:^|/)test(?:|s|ing)|/site-packages/\") # <1>\n )\n]\ndf\n```\n\n::: {.cell-output .cell-output-stdout}\n```\nCPU times: user 2min 21s, sys: 297 ms, total: 2min 22s\nWall time: 2min 21s\n```\n:::\n\n::: {.cell-output .cell-output-display execution_count=15}\n```{=html}\n
\n\n\n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n
pathuploaded_onproject_name
1462packages/zipline-tej/zipline_tej-0.0.50-cp38-c...2023-10-27 02:23:07.153zipline-tej
1470packages/zipline-tej/zipline_tej-0.0.50-cp38-c...2023-10-27 02:23:07.153zipline-tej
1477packages/zipline-tej/zipline_tej-0.0.50-cp38-c...2023-10-27 02:23:07.153zipline-tej
1481packages/zipline-tej/zipline_tej-0.0.50-cp38-c...2023-10-27 02:23:07.153zipline-tej
1485packages/zipline-tej/zipline_tej-0.0.50-cp38-c...2023-10-27 02:23:07.153zipline-tej
............
35460320packages/atomicshop/atomicshop-2.5.12-py3-none...2023-11-19 14:29:22.109atomicshop
35460515packages/atomicshop/atomicshop-2.5.11-py3-none...2023-11-19 11:58:09.589atomicshop
35460710packages/atomicshop/atomicshop-2.5.10-py3-none...2023-11-19 11:48:16.980atomicshop
35463761packages/ai-flow-nightly/ai_flow_nightly-2023....2023-11-19 16:06:36.819ai-flow-nightly
35464036packages/ai-flow-nightly/ai_flow_nightly-2023....2023-11-19 16:06:33.327ai-flow-nightly
\n

7166291 rows × 3 columns

\n
\n```\n:::\n:::\n\n\n1. I altered the original query here to avoid creating an unnecessary\n intermediate `Series` object.\n\nWe've blown **way** past our Ibis + DuckDB latency budget.\n\nLet's keep going!\n\nNext, group by and aggregate:\n\n::: {#562d718d .cell execution_count=16}\n``` {.python .cell-code}\n%%time\ndf = (\n df.groupby(\n [\n df.uploaded_on.dt.floor(\"M\").rename(\"month\"),\n df.path.str.extract(r\"\\.([a-z0-9]+)$\", 0, expand=False).rename(\"ext\"),\n ]\n )\n .agg({\"project_name\": lambda s: list(set(s))})\n .sort_index(level=\"month\", ascending=False)\n)\ndf\n```\n\n::: {.cell-output .cell-output-error}\n```\nValueError: is a non-fixed frequency\n```\n:::\n:::\n\n\nHere we hit the first API issue going back to an [old pandas\nissue](https://github.com/pandas-dev/pandas/issues/15303): we can't truncate\na timestamp column to month frequency.\n\nLet's try the solution recommended in that issue.\n\n::: {#93d4bc0b .cell execution_count=17}\n``` {.python .cell-code}\n%%time\ndf = (\n df.groupby(\n [\n df.uploaded_on.dt.to_period(\"M\").dt.to_timestamp().rename(\"month\"),\n df.path.str.extract(r\"\\.([a-z0-9]+)$\", 0, expand=False).rename(\"ext\"),\n ]\n )\n .agg({\"project_name\": lambda s: list(set(s))})\n .rename(columns={\"project_name\": \"projects\"})\n .sort_index(level=\"month\", ascending=False)\n)\ndf\n```\n\n::: {.cell-output .cell-output-stdout}\n```\nCPU times: user 8.14 s, sys: 189 ms, total: 8.32 s\nWall time: 8.33 s\n```\n:::\n\n::: {.cell-output .cell-output-display execution_count=17}\n```{=html}\n
\n\n\n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n
projects
monthext
2023-11-01rs[qoqo-for-braket-devices, rouge-rs, h3ronpy, g...
hpp[isotree, boutpp-nightly, fdasrsf, PlotPy, mod...
h[numina, pyppmd, pantab, liftover, jupyter-cpp...
go[cppyy-cling, ascend-deployer, c2cciutils, aws...
for[iricore]
f95[PyGeopack, easychem, dioptas, scikit-digital-...
f90[iricore, pdfo, cosmosis, mkl-include, c4p, ml...
f03[mkl-include]
f[fastjet, pdfo, pyspharm-syl, PyAstronomy, PyG...
cxx[cppyy-cling, teca, boutpp-nightly, aplr, CPyC...
cpp[numina, liftover, jupyter-cpp-kernel, cuda-qu...
cc[numina, boutpp-nightly, tiledb, pyxai, arcae,...
c[cytimes, assemblyline, pyppmd, pantab, liftov...
asm[couchbase, hrm-interpreter, cmeel-assimp, aws...
2023-10-01rs[ruff, xvc, qarray-rust-core, del-msh, uniffi-...
hpp[pycaracal, pylibrb, cripser, icupy, cylp, mai...
h[pycaracal, icupy, memprocfs, mindspore-dev, w...
go[cryptography, c2cciutils, awscrt, odoo14-addo...
f90[pypestutils, petitRADTRANS, alpaqa, molalignl...
f[alpaqa, pestifer, gnssrefl, LightSim2Grid, od...
cxx[cars, wxPython-zombie, AnalysisG, petsc, fift...
cpp[pycaracal, roboflex.util.png, reynir, ParmEd,...
cc[tf-nightly-cpu-aws, cornflakes, trajgenpy, tf...
c[pycaracal, cytimes, fibers-ddtest, assemblyli...
asm[fibers-ddtest, maud-metabolic-models, chipsec...
\n
\n```\n:::\n:::\n\n\nSort the values, add a new column and do the final aggregation:\n\n::: {#94cd3564 .cell execution_count=18}\n``` {.python .cell-code}\n%%time\ndf = (\n df.reset_index()\n .assign(\n ext=lambda t: t.ext.str.replace(r\"cxx|cpp|cc|c|hpp|h\", \"C/C++\", regex=True)\n .str.replace(\"^f.*$\", \"Fortran\", regex=True)\n .str.replace(\"rs\", \"Rust\")\n .str.replace(\"go\", \"Go\")\n .str.replace(\"asm\", \"Assembly\")\n .replace(\"\", None)\n )\n .groupby([\"month\", \"ext\"])\n .agg({\"projects\": lambda s: len(set(sum(s, [])))})\n)\ndf\n```\n\n::: {.cell-output .cell-output-stdout}\n```\nCPU times: user 4.96 ms, sys: 0 ns, total: 4.96 ms\nWall time: 4.81 ms\n```\n:::\n\n::: {.cell-output .cell-output-display execution_count=18}\n```{=html}\n
\n\n\n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n
projects
monthext
2023-10-01Assembly14
C/C++484
Fortran23
Go25
Rust99
2023-11-01Assembly10
C/C++836
Fortran48
Go33
Rust190
\n
\n```\n:::\n:::\n\n\n\n\nRemember, all of the previous code is executing on **a single file** and still\ntakes minutes to run.\n\n#### Conclusion\n\nIf I only have pandas at my disposal, I'm unsure of how I can avoid getting\na bigger computer to run this query over the entire data set.\n\n### Rewriting the query to be fair\n\nAt this point I wondered whether this was a fair query to run with pandas.\n\nAfter all, the downsides of pandas' use of object arrays to hold nested data\nstructures like lists are well-known.\n\nThe original query uses a lot of nested array types, which are very performant\nin DuckDB, but in this case **we're throwing away all of our arrays** and we\ndon't need to use them.\n\nAdditionally, I'm using lambda functions instead of taking advantage of pandas'\nfast built-in methods like `count`, `nunique` and others.\n\nLet's see if we can alter the original query to give pandas a leg up.\n\n#### A story of two `GROUP BY`s\n\nHere's the first Ibis expression:\n\n```python\nfrom __future__ import annotations\n\nimport ibis\nfrom ibis import _, udf\n\n\n@udf.scalar.builtin\ndef flatten(x: list[list[str]]) -> list[str]: # <1>\n ...\n\n\nexpr = (\n ibis.read_parquet(\"/data/pypi-parquet/*.parquet\")\n .filter(\n [\n _.path.re_search(\n r\"\\.(asm|c|cc|cpp|cxx|h|hpp|rs|[Ff][0-9]{0,2}(?:or)?|go)$\"\n ),\n ~_.path.re_search(r\"(^|/)test(|s|ing)\"),\n ~_.path.contains(\"/site-packages/\"),\n ]\n )\n .group_by(\n month=_.uploaded_on.truncate(\"M\"),\n ext=_.path.re_extract(r\"\\.([a-z0-9]+)$\", 1),\n )\n .aggregate(projects=_.project_name.collect().unique())\n .order_by(_.month.desc())\n .mutate(\n ext=_.ext.re_replace(r\"cxx|cpp|cc|c|hpp|h\", \"C/C++\")\n .re_replace(\"^f.*$\", \"Fortran\")\n .replace(\"rs\", \"Rust\")\n .replace(\"go\", \"Go\")\n .replace(\"asm\", \"Assembly\")\n .nullif(\"\"),\n )\n .group_by([\"month\", \"ext\"])\n .aggregate(project_count=flatten(_.projects.collect()).unique().length())\n .dropna(\"ext\")\n .order_by([_.month.desc(), _.project_count.desc()]) # <2>\n)\n\n```\n\n\nIt looks like we can remove the double `group_by` by moving the second `mutate`\nexpression directly into the first `group_by` call.\n\nApplying these changes:\n\n```diff\n--- step0.py\t2023-12-12 05:20:01.712513949 -0500\n+++ step1.py\t2023-12-12 05:20:01.712513949 -0500\n@@ -5,7 +5,7 @@\n \n \n @udf.scalar.builtin\n-def flatten(x: list[list[str]]) -> list[str]: # <1>\n+def flatten(x: list[list[str]]) -> list[str]:\n ...\n \n \n@@ -22,20 +22,16 @@\n )\n .group_by(\n month=_.uploaded_on.truncate(\"M\"),\n- ext=_.path.re_extract(r\"\\.([a-z0-9]+)$\", 1),\n- )\n- .aggregate(projects=_.project_name.collect().unique())\n- .order_by(_.month.desc())\n- .mutate(\n- ext=_.ext.re_replace(r\"cxx|cpp|cc|c|hpp|h\", \"C/C++\")\n+ ext=_.path.re_extract(r\"\\.([a-z0-9]+)$\", 1)\n+ .re_replace(r\"cxx|cpp|cc|c|hpp|h\", \"C/C++\")\n .re_replace(\"^f.*$\", \"Fortran\")\n .replace(\"rs\", \"Rust\")\n .replace(\"go\", \"Go\")\n .replace(\"asm\", \"Assembly\")\n .nullif(\"\"),\n )\n+ .aggregate(projects=_.project_name.collect().unique())\n+ .order_by(_.month.desc())\n .group_by([\"month\", \"ext\"])\n .aggregate(project_count=flatten(_.projects.collect()).unique().length())\n- .dropna(\"ext\")\n- .order_by([_.month.desc(), _.project_count.desc()]) # <2>\n )\n```\n\n\nWe get:\n\n```python\nfrom __future__ import annotations\n\nimport ibis\nfrom ibis import _, udf\n\n\n@udf.scalar.builtin\ndef flatten(x: list[list[str]]) -> list[str]:\n ...\n\n\nexpr = (\n ibis.read_parquet(\"/data/pypi-parquet/*.parquet\")\n .filter(\n [\n _.path.re_search(\n r\"\\.(asm|c|cc|cpp|cxx|h|hpp|rs|[Ff][0-9]{0,2}(?:or)?|go)$\"\n ),\n ~_.path.re_search(r\"(^|/)test(|s|ing)\"),\n ~_.path.contains(\"/site-packages/\"),\n ]\n )\n .group_by(\n month=_.uploaded_on.truncate(\"M\"),\n ext=_.path.re_extract(r\"\\.([a-z0-9]+)$\", 1)\n .re_replace(r\"cxx|cpp|cc|c|hpp|h\", \"C/C++\")\n .re_replace(\"^f.*$\", \"Fortran\")\n .replace(\"rs\", \"Rust\")\n .replace(\"go\", \"Go\")\n .replace(\"asm\", \"Assembly\")\n .nullif(\"\"),\n )\n .aggregate(projects=_.project_name.collect().unique())\n .order_by(_.month.desc())\n .group_by([\"month\", \"ext\"])\n .aggregate(project_count=flatten(_.projects.collect()).unique().length())\n)\n\n```\n\n\n#### Don't sort unnecessarily\n\nNotice this `order_by` call just before a `group_by` call. Ordering before\ngrouping is somewhat useless here; we should probably sort after we've reduced\nour data. Let's stick the ordering at the end of the query.\n\nApplying these changes:\n\n```diff\n--- step1.py\t2023-12-12 05:20:01.712513949 -0500\n+++ step2.py\t2023-12-12 05:20:01.712513949 -0500\n@@ -31,7 +31,7 @@\n .nullif(\"\"),\n )\n .aggregate(projects=_.project_name.collect().unique())\n- .order_by(_.month.desc())\n .group_by([\"month\", \"ext\"])\n .aggregate(project_count=flatten(_.projects.collect()).unique().length())\n+ .order_by(_.month.desc())\n )\n```\n\n\nWe get:\n\n```python\nfrom __future__ import annotations\n\nimport ibis\nfrom ibis import _, udf\n\n\n@udf.scalar.builtin\ndef flatten(x: list[list[str]]) -> list[str]:\n ...\n\n\nexpr = (\n ibis.read_parquet(\"/data/pypi-parquet/*.parquet\")\n .filter(\n [\n _.path.re_search(\n r\"\\.(asm|c|cc|cpp|cxx|h|hpp|rs|[Ff][0-9]{0,2}(?:or)?|go)$\"\n ),\n ~_.path.re_search(r\"(^|/)test(|s|ing)\"),\n ~_.path.contains(\"/site-packages/\"),\n ]\n )\n .group_by(\n month=_.uploaded_on.truncate(\"M\"),\n ext=_.path.re_extract(r\"\\.([a-z0-9]+)$\", 1)\n .re_replace(r\"cxx|cpp|cc|c|hpp|h\", \"C/C++\")\n .re_replace(\"^f.*$\", \"Fortran\")\n .replace(\"rs\", \"Rust\")\n .replace(\"go\", \"Go\")\n .replace(\"asm\", \"Assembly\")\n .nullif(\"\"),\n )\n .aggregate(projects=_.project_name.collect().unique())\n .group_by([\"month\", \"ext\"])\n .aggregate(project_count=flatten(_.projects.collect()).unique().length())\n .order_by(_.month.desc())\n)\n\n```\n\n\n#### Don't repeat yourself\n\nNotice that we are now:\n\n* grouping\n* aggregating\n* grouping again by the **same keys**\n* aggregating\n\nThis is less optimal than it could be. We are also flattening an array,\ncomputing its distinct values and then computing its length.\n\nWe are computing the grouped number of distinct values, and we likely don't\nneed to collect values into an array to do that.\n\nLet's try using a `COUNT(DISTINCT ...)` query instead, to avoid wasting cycles\ncollecting arrays.\n\nWe'll remove the second group by and then call `nunique()` to get the final\nquery.\n\nApplying these changes:\n\n```diff\n--- step2.py\t2023-12-12 05:20:01.712513949 -0500\n+++ step3.py\t2023-12-12 05:20:01.712513949 -0500\n@@ -1,13 +1,7 @@\n from __future__ import annotations\n \n import ibis\n-from ibis import _, udf\n-\n-\n-@udf.scalar.builtin\n-def flatten(x: list[list[str]]) -> list[str]:\n- ...\n-\n+from ibis import _\n \n expr = (\n ibis.read_parquet(\"/data/pypi-parquet/*.parquet\")\n@@ -30,8 +24,7 @@\n .replace(\"asm\", \"Assembly\")\n .nullif(\"\"),\n )\n- .aggregate(projects=_.project_name.collect().unique())\n- .group_by([\"month\", \"ext\"])\n- .aggregate(project_count=flatten(_.projects.collect()).unique().length())\n- .order_by(_.month.desc())\n+ .aggregate(project_count=_.project_name.nunique())\n+ .dropna(\"ext\")\n+ .order_by([_.month.desc(), _.project_count.desc()]) # <1>\n )\n```\n\n\nWe get:\n\n\n\n```python\nfrom __future__ import annotations\n\nimport ibis\nfrom ibis import _\n\nexpr = (\n ibis.read_parquet(\"/data/pypi-parquet/*.parquet\")\n .filter(\n [\n _.path.re_search(\n r\"\\.(asm|c|cc|cpp|cxx|h|hpp|rs|[Ff][0-9]{0,2}(?:or)?|go)$\"\n ),\n ~_.path.re_search(r\"(^|/)test(|s|ing)\"),\n ~_.path.contains(\"/site-packages/\"),\n ]\n )\n .group_by(\n month=_.uploaded_on.truncate(\"M\"),\n ext=_.path.re_extract(r\"\\.([a-z0-9]+)$\", 1)\n .re_replace(r\"cxx|cpp|cc|c|hpp|h\", \"C/C++\")\n .re_replace(\"^f.*$\", \"Fortran\")\n .replace(\"rs\", \"Rust\")\n .replace(\"go\", \"Go\")\n .replace(\"asm\", \"Assembly\")\n .nullif(\"\"),\n )\n .aggregate(project_count=_.project_name.nunique())\n .dropna(\"ext\")\n .order_by([_.month.desc(), _.project_count.desc()]) # <1>\n)\n\n```\n1. I added a second sort key (`project_count`) for deterministic output.\n\n\nLet's run it to make sure the results are as expected:\n\n::: {#c5ebcade .cell execution_count=28}\n``` {.python .cell-code}\nduckdb_results = %timeit -n1 -r1 -o expr.to_pandas()\n```\n\n::: {.cell-output .cell-output-display}\n```{=html}\n\n```\n:::\n\n::: {.cell-output .cell-output-stdout}\n```\n31.4 s ± 0 ns per loop (mean ± std. dev. of 1 run, 1 loop each)\n```\n:::\n:::\n\n\nIt looks like the new query might be a bit slower even though we're ostensibly\ndoing less computation. Since we're still pretty close to the original\nduration, let's keep going.\n\n### Final pandas run with the new query\n\nRewriting the pandas code we get:\n\n```python\nfrom __future__ import annotations\n\nimport glob\nimport os\n\nimport pandas as pd\n\ndf = pd.read_parquet(\n min(glob.glob(\"/data/pypi-parquet/*.parquet\"), key=os.path.getsize),\n columns=[\"path\", \"uploaded_on\", \"project_name\"],\n)\ndf = df[\n df.path.str.contains(r\"\\.(?:asm|c|cc|cpp|cxx|h|hpp|rs|[Ff][0-9]{0,2}(?:or)?|go)$\")\n & ~df.path.str.contains(r\"(?:(?:^|/)test(?:|s|ing)|/site-packages/)\")\n]\nprint(\n df.assign(\n month=df.uploaded_on.dt.to_period(\"M\").dt.to_timestamp(),\n ext=df.path.str.extract(r\"\\.([a-z0-9]+)$\", 0)\n .iloc[:, 0]\n .str.replace(r\"cxx|cpp|cc|c|hpp|h\", \"C/C++\", regex=True)\n .str.replace(\"^f.*$\", \"Fortran\", regex=True)\n .str.replace(\"rs\", \"Rust\")\n .str.replace(\"go\", \"Go\")\n .str.replace(\"asm\", \"Assembly\"),\n )\n .groupby([\"month\", \"ext\"])\n .project_name.nunique()\n .rename(\"project_count\")\n .reset_index()\n .sort_values([\"month\", \"project_count\"], ascending=False)\n)\n\n```\n\n\nRunning it we get:\n\n::: {#6187b075 .cell execution_count=30}\n``` {.python .cell-code}\npandas_results = %timeit -n1 -r1 -o %run pandas_impl.py\n```\n\n::: {.cell-output .cell-output-stdout}\n```\n month ext project_count\n6 2023-11-01 C/C++ 836\n9 2023-11-01 Rust 190\n7 2023-11-01 Fortran 48\n8 2023-11-01 Go 33\n5 2023-11-01 Assembly 10\n1 2023-10-01 C/C++ 484\n4 2023-10-01 Rust 99\n3 2023-10-01 Go 25\n2 2023-10-01 Fortran 23\n0 2023-10-01 Assembly 14\n3min 2s ± 0 ns per loop (mean ± std. dev. of 1 run, 1 loop each)\n```\n:::\n:::\n\n\n\n\n::: {.callout-note}\n# Remember, this is the time it took pandas to run the query for a **single** file.\nDuckDB runs the query over the **entire** dataset about 4x faster than that!\n:::\n\nLet's try a tool that nominally scales to our problem: [Dask](https://dask.org).\n\n## Dask\n\nOne really nice component of Dask is\n[`dask.dataframe`](https://docs.dask.org/en/stable/dataframe.html).\n\nDask DataFrame implements a [good chunk of the pandas\nAPI](https://docs.dask.org/en/stable/dataframe.html#scope) and can be a drop-in\nreplacement for pandas.\n\nI am happy that this turned out to be the case here.\n\nMy first attempt was somewhat naive and was effectively a one line change\nfrom `import pandas as pd` to `import dask.dataframe as pd`.\n\nThis worked and the workload completed. However, after talking to Dask\nexpert and Ibis contributor [Naty Clementi](https://github.com/ncclementi) she\nsuggested I try a few things:\n\n* Use [the distributed scheduler](https://distributed.dask.org/en/stable/).\n* Ensure that [`pyarrow` string arrays are\n used](https://docs.dask.org/en/latest/configuration.html#dask) instead of\n NumPy object arrays. This required **no changes** to my Dask code because\n PyArrow strings have been the default since version 2023.7.1, hooray!\n* Explore some of the options to `read_parquet`. It turned that without setting\n `split_row_groups=True` I ran out of memory.\n\nLet's look at the Dask implementation:\n\n```python\nfrom __future__ import annotations\n\nimport logging\n\nimport dask.dataframe as dd\nfrom dask.distributed import Client\n\nif __name__ == \"__main__\":\n client = Client(silence_logs=logging.ERROR)\n df = dd.read_parquet(\n \"/data/pypi-parquet/*.parquet\",\n columns=[\"path\", \"uploaded_on\", \"project_name\"],\n split_row_groups=True,\n )\n df = df[\n df.path.str.contains(\n r\"\\.(?:asm|c|cc|cpp|cxx|h|hpp|rs|[Ff][0-9]{0,2}(?:or)?|go)$\"\n )\n & ~df.path.str.contains(r\"(?:^|/)test(?:|s|ing)\")\n & ~df.path.str.contains(\"/site-packages/\")\n ]\n print(\n df.assign(\n month=df.uploaded_on.dt.to_period(\"M\").dt.to_timestamp(),\n ext=df.path.str.extract(r\"\\.([a-z0-9]+)$\", 0, expand=False)\n .str.replace(r\"cxx|cpp|cc|c|hpp|h\", \"C/C++\", regex=True)\n .str.replace(\"^f.*$\", \"Fortran\", regex=True)\n .str.replace(\"rs\", \"Rust\")\n .str.replace(\"go\", \"Go\")\n .str.replace(\"asm\", \"Assembly\"),\n )\n .groupby([\"month\", \"ext\"])\n .project_name.nunique()\n .rename(\"project_count\")\n .compute()\n .reset_index()\n .sort_values([\"month\", \"project_count\"], ascending=False)\n )\n client.shutdown()\n\n```\n\n\nLet's run the code:\n\n::: {#a606e7b3 .cell execution_count=33}\n``` {.python .cell-code}\ndask_results = %timeit -n1 -r1 -o %run dask_impl.py\n```\n\n::: {.cell-output .cell-output-stdout}\n```\n month ext project_count\n794 2023-11-01 C/C++ 836\n796 2023-11-01 Rust 190\n797 2023-11-01 Fortran 48\n795 2023-11-01 Go 33\n798 2023-11-01 Assembly 10\n.. ... ... ...\n2 2005-08-01 C/C++ 7\n1 2005-07-01 C/C++ 4\n83 2005-05-01 C/C++ 1\n82 2005-04-01 C/C++ 1\n0 2005-03-01 C/C++ 1\n\n[799 rows x 3 columns]\n52.6 s ± 0 ns per loop (mean ± std. dev. of 1 run, 1 loop each)\n```\n:::\n:::\n\n\n\n\nThat's a great improvement over pandas: we finished the workload and our\nrunning time is pretty close to DuckDB.\n\n## Takeaways\n\n**Ibis + DuckDB is the only system tested that handles this workload well out of the box**\n\n* Pandas couldn't handle the workload due to memory constraints.\n* Dask required its [recommended distributed\n scheduler](https://docs.dask.org/en/stable/deploying.html#deploy-dask-clusters)\n to achieve maximum performance and still used a lot of memory.\n\nLet's recap the results with some numbers:\n\n### Numbers\n\n| Toolset | Data size | Duration | Throughput |\n| ------------------ | --------: | -----------: | ---------: |\n| Ibis + DuckDB | 25,825 MiB | 31 seconds | 823 MiB/s |\n| Dask + Distributed | 25,825 MiB | 53 seconds | 491 MiB/s |\n\n\nWith Ibis + DuckDB, I was able to write the query the way I wanted to without\nrunning out of memory, using the default configuration provided by Ibis.\n\n\nI was able run this computation around **188x faster** than you can expect\nwith pandas using this hardware setup.\n\n\n\nIn contrast, pandas ran out of memory **on a single file** without some hand\nholding and while Dask didn't cause my program to run out of memory it still\nused quite a bit more than DuckDB.\n\n### Pandas is untenable for this workload\n\nPandas requires me to load everything into memory, and my machine doesn't have\nenough memory to do that.\n\nGiven that Ibis + DuckDB runs this workload on my machine it doesn't seem worth\nthe effort to write any additional code to make pandas scale to the whole\ndataset.\n\n### Dask finishes in a similar amount of time as Ibis + DuckDB (within 2x)\n\nOut of the box I had quite a bit of difficulty figuring out how to maximize\nperformance and not run out of memory.\n\nPlease get in touch if you think my Dask code can be improved!\n\nI know the Dask community is hard at work building\n[`dask-expr`](https://github.com/dask-contrib/dask-expr) which might improve\nthe performance of this workload when it lands.\n\n## Next steps\n\n### Please get in touch!\n\nIf you have ideas about how to speed up my use of the tools I've discussed here\nplease get in touch by opening a [GitHub\ndiscussion](https://github.com/ibis-project/ibis/discussions)!\n\nWe would love it if more backends handled this workload!\n\n### Look out for part 2\n\nIn part 2 of this series we'll explore how Polars and DataFusion perform on\nthis query. Stay tuned!\n\n", "supporting": [ "index_files" ], "filters": [], "includes": { "include-in-header": [ - "\n\n\n" + "\n\n\n\n" + ], + "include-after-body": [ + "\n" ] } } diff --git a/docs/posts/pydata-performance/index.qmd b/docs/posts/pydata-performance/index.qmd index 5e1b492ad108..1739b1a87819 100644 --- a/docs/posts/pydata-performance/index.qmd +++ b/docs/posts/pydata-performance/index.qmd @@ -484,15 +484,15 @@ show_file("./step2.py") #### Don't repeat yourself -Notice that we are now +Notice that we are now: -1. grouping -2. aggregating -3. grouping again by the **same keys** -4. aggregating +* grouping +* aggregating +* grouping again by the **same keys** +* aggregating -This is less optimal than it could be. Notice that we are also flattening an -array, computing its distinct values and then computing its length. +This is less optimal than it could be. We are also flattening an array, +computing its distinct values and then computing its length. We are computing the grouped number of distinct values, and we likely don't need to collect values into an array to do that.