WIP optimizations #1

Brechtpd · 2022-01-07T21:11:33Z

Contains:

Faster FFT
Slightly different multi exp with prefetching
Ability to export all expressions to rust code together and use that native code to do the h() evaluation much more efficiently.
Smarter memory use

Roughly 4x faster using 8x less memory for the current zkEVM circuit.

In general only doing high level optimizations here.

Next steps:

Big memory savings possible by being smart with product_coset/permuted_input_cosets/permuted_table_cosets. The calculations aren't ideal so not sure if it's possible to not have a table/lookup expr at all. EDIT: Done, but in an unsatisfying way. I believe some more savings are possible, but may be a bit messier.
For h() evaluation it's very important to be able to reuse intermediate results. Unfortunately the rust compiler doesn't seem to do this well for us (I assume because it's much harder because the calculations are field operations). Current algorithm that's used is pretty naive so I expect better results to be possible by having something smarter. Ideally could let something like LLVM do this optimization.

Also more things described at privacy-scaling-explorations#15 (comment)

Change multiopen commitment scheme to KZG

`lookup_any`

…-unused-ci rm unused CI checks

ashWhiteHat · 2022-01-13T09:12:37Z

Hi @Brechtpd
Thank you for your great work.
I am working on the prover optimization as well.
I would like to know the progress and if there are some things it's not finished, I would like to help.
How's the fft going?
I have read your article and researched the algorithm so I can implement if it's not finished.

Brechtpd · 2022-01-13T15:51:09Z

Hey @noctrlz, how's your assembly implementation in privacy-scaling-explorations/pairing#2 going? If that's ready to be used I would very much like it to be in this branch as soon as possible because I think that'll probably make a big difference! :) Because the current field operations are pretty slow any field operation saved makes a big difference, but that may not be the case anymore when they are well optimized so this may impact what is worth doing and what is not. For example with the FFT radix-4 is currently never faster while in my libsnark implementation it was a bit faster to use it when possible.

I think the FFT implementation could use some improvements, but I have some misc ideas for that already and wanted to wait for the assembly field optimizations to be ready before looking into it further.

One potentially interesting optimization one I have not looked at at all is removing zero knowledge from the prover. Because of my very limited knowledge about PLONK I don't know if this will save a lot of operations or not. I know there are "blinding factors" and some extra calculations in the lookup tables but no idea what kind of other things that could be removed. Do you have an idea or can you help finding this out?

Another important one is still the reusing the intermediate results one for the h() polynomial. You can see in src/plonk/prover/generated.rs what kind of code is currently generated by my naive algorithm, but I'm still not sure what the best way would be to get the most out of this. If you have any ideas let me know!

I may be forgetting some things right now so I'll update if I think of others.

ashWhiteHat · 2022-01-14T09:13:21Z

how's your assembly implementation in privacy-scaling-explorations/pairing#2 going?

All assembly arithmetic was completed and the original author of appliedzkp/pairing is thinking how we introduce assembly in here privacy-scaling-explorations/pairing#4.
I am going to proceed after his feedback.

One potentially interesting optimization one I have not looked at at all is removing zero knowledge from the prover. Because of my very limited knowledge about PLONK I don't know if this will save a lot of operations or not

This is interesting idea 👍
I think zero knowledge in halo2 prover gives us following benefits.
The halo2 is using plookup which allows us to use lookup table when creating proof instead of doing inefficient arithmetic operation.
And halo2 is also using recursive proof which allows us to reduce the proof size.
If we remove the zero knowledge from prover, we wouldn't use these benefit.

I'm still not sure what the best way would be to get the most out of this. If you have any ideas let me know!

Okay!
I am going to check.

Brechtpd · 2022-01-14T16:32:58Z

All assembly arithmetic was completed and the original author of appliedzkp/pairing is thinking how we introduce assembly in here appliedzkp/pairing#4. I am going to proceed after his feedback.

Ah nice! Seems like the review is pending for quite some time now, do you think it's worth it to just get it in as is on this branch for some testing?

This is interesting idea +1 I think zero knowledge in halo2 prover gives us following benefits. The halo2 is using plookup which allows us to use lookup table when creating proof instead of doing inefficient arithmetic operation. And halo2 is also using recursive proof which allows us to reduce the proof size. If we remove the zero knowledge from prover, we wouldn't use these benefit.

Hmmm not sure I understand. Looking at the halo2 docs for lookup it seems like having zero knowledge is only a small adjustment to the main lookup algorithm. Why would e.g. not doing this adjustment make it impossible to use lookups?

EDIT: I think if we remove the zk stuff from the lookup calculations we can lower the degree by one, which could be pretty important because this would allow us to get a circuit with an extended domain of only 2x the normal domain (we currently have an extended domain of x16). With the zk calculations the lowest we would be able to get is 4x.

ashWhiteHat · 2022-01-17T10:11:37Z

Ah nice! Seems like the review is pending for quite some time now, do you think it's worth it to just get it in as is on this branch for some testing?

Exactly.
Okay, I am going to work on that from tomorrow.

I think if we remove the zk stuff from the lookup calculations we can lower the degree by one, which could be pretty important because this would allow us to get a circuit with an extended domain of only 2x the normal domain

Yeah it seems.
Really creative idea 👐🏼

ashWhiteHat · 2022-01-21T09:03:21Z

Hi @Brechtpd
I introduced assembly to pairing!
privacy-scaling-explorations/pairing#5

I think the FFT implementation could use some improvements, but I have some misc ideas for that already and wanted to wait for the assembly field optimizations to be ready before looking into it further.

I am going to integrate it to FFT.
I would like to know what kind of idea you have.

Brechtpd · 2022-01-21T16:35:20Z

Hi @Brechtpd I introduced assembly to pairing! appliedzkp/pairing#5

Awesome! Eager to see how it behaves. :)

I think the FFT implementation could use some improvements, but I have some misc ideas for that already and wanted to wait for the assembly field optimizations to be ready before looking into it further.

I am going to integrate it to FFT. I would like to know what kind of idea you have.

Currently the multi-threading only works very well when running with power-of-2 cores. The lower parts of the FFT are done per core which means they should map as good as possible to the number of cores, while for the upper parts the work can be split per core in any way that is necessary. So I think there will need to be some different parallelization method so that in all cases the number of lower parts maps as good as possible to a number that.

Some possibilities because of the code here I think:

halo2/src/poly/domain.rs

Lines 240 to 254 in a970e57

    
           pub fn coeff_to_extended( 
        
               &self, 
        
               mut a: Polynomial<G, Coeff>, 
        
           ) -> Polynomial<G, ExtendedLagrangeCoeff> { 
        
               assert_eq!(a.values.len(), 1 << self.k); 
        
               self.distribute_powers_zeta(&mut a.values, true); 
        
               a.values.resize(self.extended_len(), G::group_zero()); 
        
               best_fft(&mut a.values, self.extended_omega, self.extended_k); 
        
               Polynomial { 
        
                   values: a.values, 
        
                   _marker: PhantomData, 
        
               } 
        
           }

distribute_powers_zeta could be done within the FFT I think by pre-calculating the mul done there with the twiddles somehow.
A large part of the input will be zero, although it's unlikely I think this can be exploited for a significant performance gain.

At times we know we have to do multiple FFTs, it may be worth it to do multiple FFTs in the same FFT call by allowing multiple inputs (e.g. could be a bit faster because shared overhead for function calls and loading the twiddles). Makes things a bit more complicated though so probably not worth it for the expected minor performance gains.
Reuse the scratch buffer between FFTs when possible (so the memory doesn't need to be allocated/deallocated all the time, which is pretty expensive). Ideally we end up with having a prover state object that could be reused even between prover invocations which also contains the pre-calculated twiddles etc...

I guess the only really important one is the parallelization one so the FFT code works well on all CPUs, the other ones are only minor possible optimizations.

Brechtpd · 2022-01-23T16:09:28Z

Hi @Brechtpd I introduced assembly to pairing! appliedzkp/pairing#5

Some basic testing shows the most heavy arithmetic steps around ~30% faster, overall prover time decreased 20-25% without doing any other changes that may make better use of the faster field ops now. :)

One thing to think about is that the assembly code uses ADCX and MULX (and perhaps others) that are not that old (https://en.wikipedia.org/wiki/Intel_ADX, especially on the AMD side) so that means the current code cannot run on those older CPUs (my main dev machine is old and I had to run the code on a different machine for testing). I guess that's fine because we're not really interested in supporting old CPUs for actually running the prover in useful scenario's, but
for this library perhaps still a good idea to leave the old slower code in so there just to have a path that can still run pretty well on all CPUs?

Brechtpd · 2022-01-23T16:14:00Z

It also looks like the FFT will be much less important after doing some circuit changes, I currently think the multi exps will be the most important part to optimize. So I would probably hold on off on doing the smaller FFT optimizations until it's clear they actually would make a decent difference.

ashWhiteHat · 2022-01-24T11:52:34Z

Some basic testing shows the most heavy arithmetic steps around ~30% faster, overall prover time decreased 20-25% without doing any other changes that may make better use of the faster field ops now. :)

I am benching as well on privacy-scaling-explorations/zkevm-circuits#302.
I would like to know how you benched it.

One thing to think about is that the assembly code uses ADCX and MULX (and perhaps others) that are not that old...

Thank you for the review!
I am going to modify accordingly.

It also looks like the FFT will be much less important after doing some circuit changes, I currently think the multi exps will be the most important part to optimize. So I would probably hold on off on doing the smaller FFT optimizations until it's clear they actually would make a decent difference.

Okay.
I am going to work on FFT instead.

ashWhiteHat · 2022-01-24T11:54:58Z

And it seems we should rebase upstream halo2 and it's breaking changes.
privacy-scaling-explorations#15 (comment)
I am going to work on that from tomorrow.

Brechtpd · 2022-01-24T16:15:45Z

I am benching as well on appliedzkp/zkevm-circuits#302. I would like to know how you benched it.

Not much different than the standard bench code, I just modified it a little so the test circuit actually does opcodes instead of everything empty, but I don't think that really changes things currently (I did it more as a precaution).

ashWhiteHat · 2022-01-28T01:01:18Z

Hi @Brechtpd
I pulled the latest zcash branch and create optimization branch.
privacy-scaling-explorations#23

I am going to bench prove function as well.
Sorry for inconvenient.

Reduce memory overhead of MSM

kilic and others added 25 commits November 29, 2021 10:32

Change multiopen commitment scheme to KZG

c20e99d

Remove Add impl for Polynomial

c7e142e

Generic degree assertation for KZG setup

0508ede

Lower the capacity of scalar vectors to match the exact

3636d8f

Read additional data exact

5e6060f

use Query get_eval

c297310

remove unused MSM argument from verify_proof function

ef56dc6

Remove useless warning

5b5d861

Document SRS content

76a51a8

Fix setup parameter size assertion

42b449d

fix remaining rebase issues

60b8cd3

Merge pull request #1 from kilic/kzg

fb30cb9

Change multiopen commitment scheme to KZG

add lookup2 method

5333fb3

change pairing reference

a432522

Rename lookup2 -> lookup_any and add documentation.

d04fde4

tests::lookup_any: Test lookup_any using Advice and Instance columns.

dab77c1

fix error and warn

049da72

change reference

e2d8da2

fix sample error enum

45008c3

fix clippy

09e89cf

Merge pull request privacy-scaling-explorations#8 from NoCtrlZ/lookup2

b78c39c

`lookup_any`

rm unused CI checks

9b6a790

Merge pull request privacy-scaling-explorations#13 from appliedzkp/rm…

a970e57

…-unused-ci rm unused CI checks

WIP optimizations

def0586

Lower memory + misc other improvements

1f02080

Small improvements to generated code

3461f07

Brechtpd mentioned this pull request Jan 26, 2022

WIP circuit layout optimization Brechtpd/zkevm-circuits#2

Closed

ashWhiteHat mentioned this pull request Jan 28, 2022

[WIP] Optimization From Zcash Latest Branch privacy-scaling-explorations/halo2#23

Closed

4 tasks

Small improvements/fixes

0495e86

ashWhiteHat mentioned this pull request Feb 22, 2022

fft optimization privacy-scaling-explorations/halo2#36

Closed

4 tasks

Brechtpd force-pushed the main branch from a970e57 to d0938ea Compare February 26, 2022 22:00

Brechtpd pushed a commit that referenced this pull request Sep 11, 2022

Merge pull request #1 from daira/msm-optimization-daira

c1159bd

Reduce memory overhead of MSM

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

WIP optimizations #1

WIP optimizations #1

Brechtpd commented Jan 7, 2022 •

edited

Loading

ashWhiteHat commented Jan 13, 2022

Brechtpd commented Jan 13, 2022

ashWhiteHat commented Jan 14, 2022

Brechtpd commented Jan 14, 2022 •

edited

Loading

ashWhiteHat commented Jan 17, 2022

ashWhiteHat commented Jan 21, 2022

Brechtpd commented Jan 21, 2022

Brechtpd commented Jan 23, 2022

Brechtpd commented Jan 23, 2022

ashWhiteHat commented Jan 24, 2022

ashWhiteHat commented Jan 24, 2022

Brechtpd commented Jan 24, 2022

ashWhiteHat commented Jan 28, 2022

WIP optimizations #1

Are you sure you want to change the base?

WIP optimizations #1

Conversation

Brechtpd commented Jan 7, 2022 • edited Loading

ashWhiteHat commented Jan 13, 2022

Brechtpd commented Jan 13, 2022

ashWhiteHat commented Jan 14, 2022

Brechtpd commented Jan 14, 2022 • edited Loading

ashWhiteHat commented Jan 17, 2022

ashWhiteHat commented Jan 21, 2022

Brechtpd commented Jan 21, 2022

Brechtpd commented Jan 23, 2022

Brechtpd commented Jan 23, 2022

ashWhiteHat commented Jan 24, 2022

ashWhiteHat commented Jan 24, 2022

Brechtpd commented Jan 24, 2022

ashWhiteHat commented Jan 28, 2022

Brechtpd commented Jan 7, 2022 •

edited

Loading

Brechtpd commented Jan 14, 2022 •

edited

Loading