Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

WIP optimizations #1

Draft
wants to merge 27 commits into
base: main
Choose a base branch
from
Draft

WIP optimizations #1

wants to merge 27 commits into from

Conversation

Brechtpd
Copy link
Owner

@Brechtpd Brechtpd commented Jan 7, 2022

Contains:

  • Faster FFT
  • Slightly different multi exp with prefetching
  • Ability to export all expressions to rust code together and use that native code to do the h() evaluation much more efficiently.
  • Smarter memory use

Roughly 4x faster using 8x less memory for the current zkEVM circuit.

In general only doing high level optimizations here.

Next steps:

  • Big memory savings possible by being smart with product_coset/permuted_input_cosets/permuted_table_cosets. The calculations aren't ideal so not sure if it's possible to not have a table/lookup expr at all. EDIT: Done, but in an unsatisfying way. I believe some more savings are possible, but may be a bit messier.
  • For h() evaluation it's very important to be able to reuse intermediate results. Unfortunately the rust compiler doesn't seem to do this well for us (I assume because it's much harder because the calculations are field operations). Current algorithm that's used is pretty naive so I expect better results to be possible by having something smarter. Ideally could let something like LLVM do this optimization.

Also more things described at privacy-scaling-explorations#15 (comment)

@ashWhiteHat
Copy link

Hi @Brechtpd
Thank you for your great work.
I am working on the prover optimization as well.
I would like to know the progress and if there are some things it's not finished, I would like to help.
How's the fft going?
I have read your article and researched the algorithm so I can implement if it's not finished.

@Brechtpd
Copy link
Owner Author

Hey @noctrlz, how's your assembly implementation in privacy-scaling-explorations/pairing#2 going? If that's ready to be used I would very much like it to be in this branch as soon as possible because I think that'll probably make a big difference! :) Because the current field operations are pretty slow any field operation saved makes a big difference, but that may not be the case anymore when they are well optimized so this may impact what is worth doing and what is not. For example with the FFT radix-4 is currently never faster while in my libsnark implementation it was a bit faster to use it when possible.

I think the FFT implementation could use some improvements, but I have some misc ideas for that already and wanted to wait for the assembly field optimizations to be ready before looking into it further.

One potentially interesting optimization one I have not looked at at all is removing zero knowledge from the prover. Because of my very limited knowledge about PLONK I don't know if this will save a lot of operations or not. I know there are "blinding factors" and some extra calculations in the lookup tables but no idea what kind of other things that could be removed. Do you have an idea or can you help finding this out?

Another important one is still the reusing the intermediate results one for the h() polynomial. You can see in src/plonk/prover/generated.rs what kind of code is currently generated by my naive algorithm, but I'm still not sure what the best way would be to get the most out of this. If you have any ideas let me know!

I may be forgetting some things right now so I'll update if I think of others.

@ashWhiteHat
Copy link

how's your assembly implementation in privacy-scaling-explorations/pairing#2 going?

All assembly arithmetic was completed and the original author of appliedzkp/pairing is thinking how we introduce assembly in here privacy-scaling-explorations/pairing#4.
I am going to proceed after his feedback.

One potentially interesting optimization one I have not looked at at all is removing zero knowledge from the prover. Because of my very limited knowledge about PLONK I don't know if this will save a lot of operations or not

This is interesting idea 👍
I think zero knowledge in halo2 prover gives us following benefits.
The halo2 is using plookup which allows us to use lookup table when creating proof instead of doing inefficient arithmetic operation.
And halo2 is also using recursive proof which allows us to reduce the proof size.
If we remove the zero knowledge from prover, we wouldn't use these benefit.

I'm still not sure what the best way would be to get the most out of this. If you have any ideas let me know!

Okay!
I am going to check.

@Brechtpd
Copy link
Owner Author

Brechtpd commented Jan 14, 2022

All assembly arithmetic was completed and the original author of appliedzkp/pairing is thinking how we introduce assembly in here appliedzkp/pairing#4. I am going to proceed after his feedback.

Ah nice! Seems like the review is pending for quite some time now, do you think it's worth it to just get it in as is on this branch for some testing?

This is interesting idea +1 I think zero knowledge in halo2 prover gives us following benefits. The halo2 is using plookup which allows us to use lookup table when creating proof instead of doing inefficient arithmetic operation. And halo2 is also using recursive proof which allows us to reduce the proof size. If we remove the zero knowledge from prover, we wouldn't use these benefit.

Hmmm not sure I understand. Looking at the halo2 docs for lookup it seems like having zero knowledge is only a small adjustment to the main lookup algorithm. Why would e.g. not doing this adjustment make it impossible to use lookups?

EDIT: I think if we remove the zk stuff from the lookup calculations we can lower the degree by one, which could be pretty important because this would allow us to get a circuit with an extended domain of only 2x the normal domain (we currently have an extended domain of x16). With the zk calculations the lowest we would be able to get is 4x.

@ashWhiteHat
Copy link

Ah nice! Seems like the review is pending for quite some time now, do you think it's worth it to just get it in as is on this branch for some testing?

Exactly.
Okay, I am going to work on that from tomorrow.

I think if we remove the zk stuff from the lookup calculations we can lower the degree by one, which could be pretty important because this would allow us to get a circuit with an extended domain of only 2x the normal domain

Yeah it seems.
Really creative idea 👐🏼

@ashWhiteHat
Copy link

Hi @Brechtpd
I introduced assembly to pairing!
privacy-scaling-explorations/pairing#5

I think the FFT implementation could use some improvements, but I have some misc ideas for that already and wanted to wait for the assembly field optimizations to be ready before looking into it further.

I am going to integrate it to FFT.
I would like to know what kind of idea you have.

@Brechtpd
Copy link
Owner Author

Hi @Brechtpd I introduced assembly to pairing! appliedzkp/pairing#5

Awesome! Eager to see how it behaves. :)

I think the FFT implementation could use some improvements, but I have some misc ideas for that already and wanted to wait for the assembly field optimizations to be ready before looking into it further.

I am going to integrate it to FFT. I would like to know what kind of idea you have.

  • Currently the multi-threading only works very well when running with power-of-2 cores. The lower parts of the FFT are done per core which means they should map as good as possible to the number of cores, while for the upper parts the work can be split per core in any way that is necessary. So I think there will need to be some different parallelization method so that in all cases the number of lower parts maps as good as possible to a number that.
  • Some possibilities because of the code here I think:

    halo2/src/poly/domain.rs

    Lines 240 to 254 in a970e57

    pub fn coeff_to_extended(
    &self,
    mut a: Polynomial<G, Coeff>,
    ) -> Polynomial<G, ExtendedLagrangeCoeff> {
    assert_eq!(a.values.len(), 1 << self.k);
    self.distribute_powers_zeta(&mut a.values, true);
    a.values.resize(self.extended_len(), G::group_zero());
    best_fft(&mut a.values, self.extended_omega, self.extended_k);
    Polynomial {
    values: a.values,
    _marker: PhantomData,
    }
    }
    • distribute_powers_zeta could be done within the FFT I think by pre-calculating the mul done there with the twiddles somehow.
    • A large part of the input will be zero, although it's unlikely I think this can be exploited for a significant performance gain.
  • At times we know we have to do multiple FFTs, it may be worth it to do multiple FFTs in the same FFT call by allowing multiple inputs (e.g. could be a bit faster because shared overhead for function calls and loading the twiddles). Makes things a bit more complicated though so probably not worth it for the expected minor performance gains.
  • Reuse the scratch buffer between FFTs when possible (so the memory doesn't need to be allocated/deallocated all the time, which is pretty expensive). Ideally we end up with having a prover state object that could be reused even between prover invocations which also contains the pre-calculated twiddles etc...

I guess the only really important one is the parallelization one so the FFT code works well on all CPUs, the other ones are only minor possible optimizations.

@Brechtpd
Copy link
Owner Author

Hi @Brechtpd I introduced assembly to pairing! appliedzkp/pairing#5

Some basic testing shows the most heavy arithmetic steps around ~30% faster, overall prover time decreased 20-25% without doing any other changes that may make better use of the faster field ops now. :)

One thing to think about is that the assembly code uses ADCX and MULX (and perhaps others) that are not that old (https://en.wikipedia.org/wiki/Intel_ADX, especially on the AMD side) so that means the current code cannot run on those older CPUs (my main dev machine is old and I had to run the code on a different machine for testing). I guess that's fine because we're not really interested in supporting old CPUs for actually running the prover in useful scenario's, but
for this library perhaps still a good idea to leave the old slower code in so there just to have a path that can still run pretty well on all CPUs?

@Brechtpd
Copy link
Owner Author

It also looks like the FFT will be much less important after doing some circuit changes, I currently think the multi exps will be the most important part to optimize. So I would probably hold on off on doing the smaller FFT optimizations until it's clear they actually would make a decent difference.

@ashWhiteHat
Copy link

Some basic testing shows the most heavy arithmetic steps around ~30% faster, overall prover time decreased 20-25% without doing any other changes that may make better use of the faster field ops now. :)

I am benching as well on privacy-scaling-explorations/zkevm-circuits#302.
I would like to know how you benched it.

One thing to think about is that the assembly code uses ADCX and MULX (and perhaps others) that are not that old...

Thank you for the review!
I am going to modify accordingly.

It also looks like the FFT will be much less important after doing some circuit changes, I currently think the multi exps will be the most important part to optimize. So I would probably hold on off on doing the smaller FFT optimizations until it's clear they actually would make a decent difference.

Okay.
I am going to work on FFT instead.

@ashWhiteHat
Copy link

And it seems we should rebase upstream halo2 and it's breaking changes.
privacy-scaling-explorations#15 (comment)
I am going to work on that from tomorrow.

@Brechtpd
Copy link
Owner Author

I am benching as well on appliedzkp/zkevm-circuits#302. I would like to know how you benched it.

Not much different than the standard bench code, I just modified it a little so the test circuit actually does opcodes instead of everything empty, but I don't think that really changes things currently (I did it more as a precaution).

@ashWhiteHat
Copy link

Hi @Brechtpd
I pulled the latest zcash branch and create optimization branch.
privacy-scaling-explorations#23

I am going to bench prove function as well.
Sorry for inconvenient.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants