feat: track compressed size & compare to parquet(zstd)? & canonical (#…

…882) We now track these six values: 1. Compression time (s). 2. Compression throughput (bytes/s). 3. Compressed size (bytes). 4. Compressed size as fraction of a Vortex Canonical array. 5. Compressed Layout size as fraction of Parquet without block compression. 6. Compressed Layout size as fraction of Parquet with Zstd. It's a bit janky: I just unconditionally compute these values for several datasets. I couldn't figure out how to ask criterion which benchmark regex is currently in use so, for example, `cargo bench taxi` will still run all the size benchmarks for every other dataset. I also had to do some janky jq parsing to convert from Criterion's JSON output to the style expected by the benchmark-action GitHub action that we use. Nevertheless, now, for each commit to `develop`, we should get all six numbers for the Taxi, Airline Sentiment, Arade, Bimbo, CMSprovider, Euro2016, Food, HashTags, and TPC-H l_comment datasets. They'll be displayed under [Vortex Compression](https://spiraldb.github.io/vortex/dev/bench/#Vortex_Compression) at the benchmarks site. I might need to delete some old data form the gh-pages-bench branch since I changed some benchmark names, but after a few commits, those plots should become useful measures of our compression performance in space and time.
spiraldb · Sep 20, 2024 · a87c720 · a87c720
1 parent 3194009
commit a87c720
Show file tree

Hide file tree

Showing 11 changed files with 364 additions and 109 deletions.
diff --git a/.github/workflows/bench-pr.yml b/.github/workflows/bench-pr.yml
@@ -49,17 +49,36 @@ jobs:
 
       - name: Run benchmark
         shell: bash
-        run: cargo bench --bench ${{ matrix.benchmark.id }} -- --output-format bencher | tee ${{ matrix.benchmark.id }}.txt
+        run: |
+          cargo install cargo-criterion
 
+          cargo criterion --bench ${{ matrix.benchmark.id }} --message-format=json 2>&1 | tee out.json
+
+          cat out.json
+
+          sudo apt-get update && sudo apt-get install -y jq
+
+          jq --raw-input --compact-output '
+                 fromjson?
+                 | [ (if .mean != null then {name: .id, value: .mean.estimate, unit: .unit, range: ((.mean.upper_bound - .mean.lower_bound) / 2) } else {} end),
+                     (if .throughput != null then {name: (.id + " throughput"), value: .throughput[].per_iteration, unit: .throughput[].unit, range: 0} else {} end),
+                     {name, value, unit, range} ]
+                 | .[]
+                 | select(.value != null)
+              ' \
+              out.json \
+              | jq --slurp --compact-output '.' >${{ matrix.benchmark.id }}.json
+
+          cat ${{ matrix.benchmark.id }}.json
       - name: Store benchmark result
         if: '!cancelled()'
         uses: benchmark-action/github-action-benchmark@v1
         with:
           name: ${{ matrix.benchmark.name }}
-          tool: 'cargo'
+          tool: 'customSmallerIsBetter'
           gh-pages-branch: gh-pages-bench
           github-token: ${{ secrets.GITHUB_TOKEN }}
-          output-file-path: ${{ matrix.benchmark.id }}.txt
+          output-file-path: ${{ matrix.benchmark.id }}.json
           summary-always: true
           comment-always: true
           auto-push: false

diff --git a/.github/workflows/bench.yml b/.github/workflows/bench.yml
@@ -41,17 +41,36 @@ jobs:
 
       - name: Run benchmark
         shell: bash
-        run: cargo bench --bench ${{ matrix.version.id }} -- --output-format bencher | tee ${{ matrix.version.id }}.txt
+        run: |
+          cargo install cargo-criterion
 
+          cargo criterion --bench ${{ matrix.benchmark.id }} --message-format=json 2>&1 | tee out.json
+
+          cat out.json
+
+          sudo apt-get update && sudo apt-get install -y jq
+
+          jq --raw-input --compact-output '
+                 fromjson?
+                 | [ (if .mean != null then {name: .id, value: .mean.estimate, unit: .unit, range: ((.mean.upper_bound - .mean.lower_bound) / 2) } else {} end),
+                     (if .throughput != null then {name: (.id + " throughput"), value: .throughput[].per_iteration, unit: .throughput[].unit, range: 0} else {} end),
+                     {name, value, unit, range} ]
+                 | .[]
+                 | select(.value != null)
+              ' \
+              out.json \
+              | jq --slurp --compact-output '.' >${{ matrix.benchmark.id }}.json
+
+          cat ${{ matrix.benchmark.id }}.json
       - name: Store benchmark result
         if: '!cancelled()'
         uses: benchmark-action/github-action-benchmark@v1
         with:
-          name: ${{ matrix.version.name }}
-          tool: 'cargo'
+          name: ${{ matrix.benchmark.name }}
+          tool: 'customSmallerIsBetter'
           gh-pages-branch: gh-pages-bench
           github-token: ${{ secrets.GITHUB_TOKEN }}
-          output-file-path: ${{ matrix.version.id }}.txt
+          output-file-path: ${{ matrix.benchmark.id }}.json
           summary-always: true
           auto-push: true
           fail-on-alert: false

diff --git a/Cargo.lock b/Cargo.lock
diff --git a/bench-vortex/.gitignore b/bench-vortex/.gitignore
@@ -1 +1 @@
-data
+data
diff --git a/bench-vortex/Cargo.toml b/bench-vortex/Cargo.toml
@@ -47,6 +47,7 @@ rand = { workspace = true }
 rayon = { workspace = true }
 reqwest = { workspace = true }
 serde = { workspace = true }
+serde_json = { workspace = true }
 simplelog = { workspace = true }
 tar = { workspace = true }
 tokio = { workspace = true, features = ["full"] }