halo2_proofs: Avoid caching rotated polynomials in `poly::Evaluator` #644

str4d · 2022-08-25T00:56:37Z

Previously we used the existing Polynomial::rotate and EvaluationDomain::rotate_extended APIs to rotate the polynomials we query at non-zero rotations, and then stored references to the chunks of each (rotated and unrotated) polynomial as slices. When we reached an AstLeaf, we would then clone the corresponding polynomial chunk.

We now avoid all of this with combined "rotate-and-chunk" APIs, making use of the observation that a rotation is simply a renumbering of the indices of the polynomial coefficients. Given that when we reach an AstLeaf we already needed to allocate, these new APIs have the same number of allocations during AST evaluation, but enable us to completely avoid caching any information about the rotations or rotated polynomials ahead of time.

Previously we used the existing `Polynomial::rotate` and `EvaluationDomain::rotate_extended` APIs to rotate the polynomials we query at non-zero rotations, and then stored references to the chunks of each (rotated and unrotated) polynomial as slices. When we reached an `AstLeaf`, we would then clone the corresponding polynomial chunk. We now avoid all of this with combined "rotate-and-chunk" APIs, making use of the observation that a rotation is simply a renumbering of the indices of the polynomial coefficients. Given that when we reach an `AstLeaf` we already needed to allocate, these new APIs have the same number of allocations during AST evaluation, but enable us to completely avoid caching any information about the rotations or rotated polynomials ahead of time.

str4d · 2022-08-25T01:05:40Z

Running heaptrack on the Orchard test circuit::tests::round_trip (which runs MockProver and then creates and verifies a single proof) with a Ryzen 9 5950X, I get the following memory usage measurements before and after this change:

❯ heaptrack --analyze heaptrack.circuit.before.zst | tail -n 7
total runtime: 5.20s.
bytes allocated in total (ignoring deallocations): 2.95GB (567.03MB/s)
calls to allocation functions: 1125346 (216412/s)
temporary memory allocations: 266765 (51300/s)
peak heap memory consumption: 96.05MB
peak RSS (including heaptrack overhead): 680.80MB
total memory leaked: 684.67KB

❯ heaptrack --analyze heaptrack.circuit.after.zst | tail -n 7
total runtime: 5.17s.
bytes allocated in total (ignoring deallocations): 2.93GB (567.36MB/s)
calls to allocation functions: 1114236 (215561/s)
temporary memory allocations: 268533 (51950/s)
peak heap memory consumption: 81.70MB
peak RSS (including heaptrack overhead): 628.31MB
total memory leaked: 547.88KB

So for the Orchard circuit, this is a 14.9% reduction in peak heap memory consumption, and a 7.7% reduction in peak RSS. It also gives a minor performance boost to proving times (Ryzen 9 5950X):

proving/bundle/1        time:   [983.50 ms 990.66 ms 997.86 ms]                           
                        change: [-2.2962% -1.3784% -0.3654%] (p = 0.02 < 0.05)
                        Change within noise threshold.
proving/bundle/2        time:   [984.91 ms 991.11 ms 997.55 ms]                           
                        change: [-1.8530% -1.0189% -0.1863%] (p = 0.04 < 0.05)
                        Change within noise threshold.
proving/bundle/3        time:   [1.3261 s 1.3329 s 1.3399 s]                              
                        change: [-1.5906% -0.7636% +0.0227%] (p = 0.11 > 0.05)
                        No change in performance detected.
proving/bundle/4        time:   [1.6655 s 1.6743 s 1.6830 s]                              
                        change: [-2.5920% -1.8598% -1.0688%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 1 outliers among 10 measurements (10.00%)
  1 (10.00%) low mild

halo2_proofs/src/poly.rs

jasonmorton · 2022-08-25T01:52:23Z

On our compute-intensive example this reduces peak memory usage from 107GB to 42.8GB and is about 10% faster (i9-12900K). k=17, 147 columns, lookups with 2^14 rows. The reduction is across both this PR and #642 combined (#642 was about a 20% reduction).

daira

utACK

str4d added 2 commits August 25, 2022 01:07

Silence clippy lint

5b7b4dd

halo2_proofs: Add tests for "rotate-and-chunk" APIs

d24f0fd

daira reviewed Aug 25, 2022

View reviewed changes

halo2_proofs/src/poly.rs Show resolved Hide resolved

str4d mentioned this pull request Aug 25, 2022

Implement performance and memory optimizations for halo2_proofs #548

Open

24 tasks

daira approved these changes Sep 9, 2022

View reviewed changes

str4d merged commit 553584a into main Sep 10, 2022

str4d deleted the remove-rotated-poly-cache branch September 10, 2022 11:02

str4d added this to the 0.3.0 milestone Mar 2, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

halo2_proofs: Avoid caching rotated polynomials in `poly::Evaluator` #644

halo2_proofs: Avoid caching rotated polynomials in `poly::Evaluator` #644

str4d commented Aug 25, 2022

str4d commented Aug 25, 2022

jasonmorton commented Aug 25, 2022 •

edited

Loading

daira left a comment

halo2_proofs: Avoid caching rotated polynomials in poly::Evaluator #644

halo2_proofs: Avoid caching rotated polynomials in poly::Evaluator #644

Conversation

str4d commented Aug 25, 2022

str4d commented Aug 25, 2022

jasonmorton commented Aug 25, 2022 • edited Loading

daira left a comment

Choose a reason for hiding this comment

halo2_proofs: Avoid caching rotated polynomials in `poly::Evaluator` #644

halo2_proofs: Avoid caching rotated polynomials in `poly::Evaluator` #644

jasonmorton commented Aug 25, 2022 •

edited

Loading