Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Implement performance and memory optimizations for halo2_proofs #548

Open
9 of 24 tasks
r3ld3v opened this issue Apr 14, 2022 · 19 comments
Open
9 of 24 tasks

Implement performance and memory optimizations for halo2_proofs #548

r3ld3v opened this issue Apr 14, 2022 · 19 comments
Assignees

Comments

@r3ld3v
Copy link
Contributor

r3ld3v commented Apr 14, 2022

@r3ld3v r3ld3v added this to the Release 5.0.0 milestone Apr 14, 2022
@nuttycom nuttycom changed the title Find performance and memory optimizations for halo2_proofs 0.1.0 Implement performance and memory optimizations for halo2_proofs Apr 25, 2022
@nuttycom nuttycom removed this from the Release 5.0.0 milestone Apr 25, 2022
@ashWhiteHat
Copy link
Contributor

In terms of assembly optimizations for pasta field arithmetic, it's necessary to update rust version.
The asm feature is stable latter than 1.59.0.
If it's okay to update, I would implement asm as following and it's almost 20% speed up.
privacy-scaling-explorations/pairing#5

@ashWhiteHat
Copy link
Contributor

ashWhiteHat commented Apr 26, 2022

I analysed the functions efficiency of create_proof.
privacy-scaling-explorations#36 (comment)
I am going to work on h() evaluation.

@ashWhiteHat
Copy link
Contributor

ashWhiteHat commented Apr 26, 2022

Fft Optimization

implement 2 layer butterfly arithmetic

This allows us to bottom up n / 2 but it's hard to implement this with recursive.
The &[G] array divide into 4 parts and join can take double args.
I tried but failed.

change par_iter when the n is large enough

It's necessary to do experiment of the relation between overhead and degree.

parallelize last butterfly arithmetic

It's easy but becomes complicated.

ifft divisor

Perform the multiplication on last round of butterfly arithmetic to avoid extra iteration.

@ashWhiteHat
Copy link
Contributor

ashWhiteHat commented Apr 26, 2022

I think the following is interesting.
zkcrypto/bls12_381#84
We can reduce Montgomery reduction for each field array arithmetic.
This affects so much.

Do you have an idea how to replace with?

@str4d
Copy link
Contributor

str4d commented Apr 26, 2022

I think the following is interesting. zkcrypto/bls12_381#84 We can reduce Montgomery reduction for each field array arithmetic. This affects so much.

* [eval_poly](https://github.com/zcash/halo2/blob/main/halo2_proofs/src/arithmetic.rs#L269)

* [compute_inner_product](https://github.com/zcash/halo2/blob/main/halo2_proofs/src/arithmetic.rs#L279)

* [ifft_divise](https://github.com/zcash/halo2/blob/main/halo2_proofs/src/poly/domain.rs#L355)

Do you have an idea how to replace with?

Yep, I was thinking about this myself yesterday. The inner product function we use in Halo 2 is precisely the "sum of products" I implemented there, and so I'm currently working to expose this via the ff crate for generic usage (zkcrypto/ff#79). It can have a slow provided implementation in Field, that concrete fields override with efficient limb-aware implementations.

@str4d
Copy link
Contributor

str4d commented Apr 26, 2022

In terms of assembly optimizations for pasta field arithmetic, it's necessary to update rust version. The asm feature is stable latter than 1.59.0. If it's okay to update, I would implement asm as following and it's almost 20% speed up. appliedzkp/pairing#5

I do not want to use raw assembly in any of our field implementations if I can avoid it. Instead, we should look at that assembly and figure out why Rust / LLVM is not generating it, and rework the Rust code to get closer. I also want to implement some proper field arithmetic backends in the pasta_curves crate so we can use vector intrinsics (the x86 / x86_64 intrinsics have been stable for a while, and aarch64 intrinsics were stabilised in 1.59). Once we have backends available, then if it were impossible to get the same performance as with raw assembly, we could conceivably have a feature-flagged asm backend for those who really want it.

@ashWhiteHat
Copy link
Contributor

Hi @str4d
Thank you for the answer.

Yep, I was thinking about this myself yesterday. The inner product function we use in Halo 2 is precisely the "sum of products" I implemented there, and so I'm currently working to expose this via the ff crate for generic usage (zkcrypto/ff#79). It can have a slow provided implementation in Field, that concrete fields override with efficient limb-aware implementations

This is amazing.
I think this logic can be applied for [F] ⚪︎ [F] or F ⚪︎ [F] operation for any arithmetic ⚪︎ as long as they keep equivalence relation.
It reduces n - 1 times calculation of reduce method for additive and, reduce and montgomery reduction for multiplicative.
There are many functions which can be applied this arithmetic.
Do you have a plan to modularize in ff and apply to halo2?

I do not want to use raw assembly in any of our field implementations if I can avoid it

Copy that.

@str4d
Copy link
Contributor

str4d commented Apr 26, 2022

I think this logic can be applied for [F] ⚪︎ [F] or F ⚪︎ [F] operation for any arithmetic ⚪︎ as long as they keep equivalence relation. It reduces n - 1 times calculation of reduce method for additive and, reduce and montgomery reduction for multiplicative. There are many functions which can be applied this arithmetic. Do you have a plan to modularize in ff and apply to halo2?

The interleaved reductions likely won't have any benefit for additions, because the main speedup is due to reduced register pressure by not maintaining double-width field elements, which can be produced by multiplication but not addition. Also, additions don't use Montgomery reduction.

The sharing of reductions between accumulated ⚪︎ operations is interesting, but I'm not sure there's a good general API for that. In the additive case specifically, [F] + [F] is just sum([F].concat([F]); there is no Montgomery reduction, so the only thing we could batch is the conditional subtraction of the modulus, which would require storing an additional limb. For now, I'm inclined to just add Field::sum_of_products and then see what that gets us. If we find other operations that are common enough to justify, then we can add APIs for those (and then look at how we could perhaps generalise once we have the concrete APIs in place).

@ashWhiteHat
Copy link
Contributor

ashWhiteHat commented Apr 26, 2022

the main speedup is due to reduced register pressure by not maintaining double-width field elements

Nice.

If we find other operations that are common enough to justify, then we can add APIs for those (and then look at how we could perhaps generalise once we have the concrete APIs in place).

Thank you.
I understand.
I think it's reasonable to implement poly crate for example zkcrypto/poly.
[F] + [F] is nearer to poly arithmetic than field arithmetic, it should use parallelize and it's easier to optimize.
The poly arithmetic is performed frequently.
I am going to check how sharing of reduction improves performance.

@ashWhiteHat
Copy link
Contributor

memo: par_iter vs normal
rayon-rs/rayon#648 (comment)

@ashWhiteHat
Copy link
Contributor

ashWhiteHat commented May 7, 2022

Parallelize

Find proper task size and methods for core, thread number and memory.
Create condition branch for each environment using thread_num and task_size as following.

/// multiplication = 3 point, addition = 1 point
enum TaskSize {
    // less than 3 point
    Small,
    // more than 4 point and less than 8 point
    Middle,
    // more than 9 point
    Large,
}

fn task_chunk(task_size: TaskSize) -> u32 {
    let chunk = match task_size {
        TaskSize::Small => 8,
        TaskSize::Middle => 4,
        TaskSize::Large => 2,
    };
    chunk / log_threads()
}

Task

Find the most efficient task size, thread and rayon method in order to implement condition branch to split parallel or not.

  • par_iter > parallel

Bench

normal vs par_iter vs parallel

one time field arithmetic iteration
Add

k normal par_iter parallel
3 36.614 ns 36.616 ns 36.617 ns 8.5748 us 8.7767 us 8.9484 us 8.2590 us 8.3402 us 8.4133 us
4 69.030 ns 69.110 ns 69.231 ns 9.6449 us 9.7224 us 9.8196 us 7.5027 us 7.7373 us 7.9515 us
5 146.63 ns 146.79 ns 146.97 ns 10.100 us 10.189 us 10.305 us 7.2004 us 7.4858 us 7.8044 us
6 274.81 ns 274.83 ns 274.87 ns 11.191 us 11.256 us 11.351 us 7.9772 us 8.3716 us 8.7483 us
7 558.26 ns 558.31 ns 558.40 ns 13.600 us 13.662 us 13.730 us 9.2109 us 9.3365 us 9.4365 us
8 1.1079 us 1.1079 us 1.1080 us 15.481 us 15.524 us 15.569 us 10.029 us 10.088 us 10.139 us
9 2.2102 us 2.2104 us 2.2106 us 17.009 us 17.036 us 17.062 us 11.506 us 11.539 us 11.574 us
10 4.4262 us 4.4263 us 4.4266 us 18.809 us 18.837 us 18.861 us 12.331 us 12.356 us 12.382 us
11 8.8253 us 8.8256 us 8.8259 us 21.242 us 21.279 us 21.326 us 14.405 us 14.444 us 14.494 us
12 17.644 us 17.645 us 17.647 us 21.693 us 21.821 us 21.971 us 18.548 us 18.569 us 18.592 us
13 35.285 us 35.294 us 35.305 us 32.041 us 32.153 us 32.291 us 26.574 us 26.614 us 26.659 us
14 70.567 us 70.581 us 70.605 us 53.141 us 53.271 us 53.445 us 43.057 us 44.095 us 45.461 us
15 141.93 us 141.93 us 141.94 us 95.057 us 95.192 us 95.343 us 75.803 us 76.210 us 76.766 us
16 282.98 us 283.03 us 283.08 us 181.69 us 182.11 us 182.64 us 144.24 us 144.79 us 145.59 us
17 569.23 us 569.49 us 569.83 us 351.35 us 352.24 us 353.44 us 276.42 us 279.55 us 284.34 us
18 1.1501 ms 1.1513 ms 1.1529 ms 687.74 us 689.02 us 690.59 us 542.22 us 543.47 us 544.96 us

Mul

k normal par_iter parallel
3 227.36 ns 227.38 ns 227.40 ns 9.0132 us 9.1790 us 9.3149 us 8.6094 us 8.8608 us 9.0674 us
4 454.44 ns 454.63 ns 454.89 ns 10.743 us 10.816 us 10.893 us 9.3778 us 9.4564 us 9.5292 us
5 918.76 ns 918.85 ns 918.97 ns 12.749 us 12.851 us 13.017 us 10.056 us 10.117 us 10.184 us
6 1.8231 us 1.8233 us 1.8234 us 14.813 us 14.849 us 14.886 us 11.366 us 11.428 us 11.510 us
7 3.6549 us 3.6554 us 3.6558 us 16.249 us 16.289 us 16.326 us 12.558 us 12.579 us 12.599 us
8 7.3003 us 7.3008 us 7.3016 us 18.171 us 18.199 us 18.233 us 14.454 us 14.474 us 14.499 us
9 14.637 us 14.638 us 14.640 us 21.575 us 21.644 us 21.719 us 18.658 us 18.687 us 18.723 us
10 29.251 us 29.253 us 29.256 us 26.149 us 26.234 us 26.347 us 26.632 us 26.663 us 26.699 us
11 58.909 us 58.914 us 58.919 us 40.919 us 40.985 us 41.073 us 43.039 us 43.164 us 43.326 us
12 117.64 us 117.66 us 117.68 us 71.681 us 71.912 us 72.278 us 75.645 us 75.873 us 76.141 us
13 235.37 us 235.40 us 235.44 us 132.30 us 132.60 us 132.96 us 140.63 us 141.02 us 141.55 us
14 471.41 us 471.47 us 471.55 us 253.99 us 254.66 us 255.57 us 272.04 us 273.20 us 274.69 us
15 943.76 us 943.87 us 943.99 us 493.60 us 494.64 us 496.05 us 530.97 us 532.40 us 534.24 us
16 1.8840 ms 1.8842 ms 1.8845 ms 977.83 us 980.45 us 983.85 us 1.0663 ms 1.0683 ms 1.0706 ms
17 3.7703 ms 3.7711 ms 3.7718 ms 1.9412 ms 1.9449 ms 1.9495 ms 2.1186 ms 2.1261 ms 2.1351 ms
18 7.5475 ms 7.5495 ms 7.5517 ms 3.9131 ms 3.9256 ms 3.9398 ms 4.2483 ms 4.2646 ms 4.2856 ms

Three Mul

k normal par_iter parallel
3 740.03 ns 740.06 ns 740.12 ns 10.245 us 10.278 us 10.313 us 9.8069 us 9.8670 us 9.9196 us
4 1.4876 us 1.4877 us 1.4880 us 13.047 us 13.062 us 13.079 us 11.137 us 11.218 us 11.334 us
5 2.9867 us 2.9870 us 2.9874 us 14.695 us 14.729 us 14.768 us 12.249 us 12.342 us 12.473 us
6 6.0087 us 6.0097 us 6.0114 us 16.122 us 16.150 us 16.192 us 14.111 us 14.152 us 14.203 us
7 11.918 us 11.921 us 11.925 us 18.890 us 18.977 us 19.079 us 17.658 us 17.685 us 17.717 us
8 23.729 us 23.730 us 23.732 us 23.109 us 23.144 us 23.197 us 24.855 us 24.884 us 24.917 us
9 47.429 us 47.434 us 47.441 us 35.143 us 35.267 us 35.459 us 39.047 us 39.091 us 39.146 us
10 95.058 us 95.065 us 95.076 us 59.381 us 59.530 us 59.748 us 67.460 us 67.677 us 67.957 us
11 192.36 us 192.36 us 192.37 us 107.73 us 107.95 us 108.29 us 124.63 us 125.47 us 126.66 us
12 384.87 us 384.89 us 384.92 us 206.87 us 208.82 us 211.66 us 240.87 us 241.48 us 242.20 us
13 769.35 us 769.39 us 769.44 us 401.49 us 402.64 us 404.07 us 470.14 us 471.27 us 472.80 us
14 1.5384 ms 1.5386 ms 1.5388 ms 788.91 us 790.20 us 791.79 us 927.12 us 928.16 us 929.33 us
15 3.1031 ms 3.1034 ms 3.1039 ms 1.5759 ms 1.5776 ms 1.5796 ms 1.8491 ms 1.8517 ms 1.8549 ms
16 6.2126 ms 6.2131 ms 6.2136 ms 3.1819 ms 3.1865 ms 3.1911 ms 3.7340 ms 3.7428 ms 3.7531 ms
17 12.453 ms 12.456 ms 12.459 ms 6.3751 ms 6.3899 ms 6.4062 ms 7.4952 ms 7.5719 ms 7.6787 ms
18 24.969 ms 24.975 ms 24.981 ms 12.789 ms 12.827 ms 12.870 ms 14.962 ms 14.987 ms 15.014 ms

Three Add

k normal par_iter parallel
3 111.25 ns 111.26 ns 111.26 ns 8.9307 us 9.1100 us 9.2478 us 6.8630 us 6.9690 us 7.0764 us
4 221.65 ns 221.67 ns 221.70 ns 10.531 us 10.599 us 10.680 us 8.3465 us 8.6197 us 8.8761 us
5 443.69 ns 443.72 ns 443.75 ns 11.817 us 11.896 us 11.985 us 9.1735 us 9.2792 us 9.3672 us
6 885.67 ns 885.76 ns 885.90 ns 13.417 us 13.505 us 13.598 us 9.8654 us 9.9676 us 10.067 us
7 1.7744 us 1.7746 us 1.7748 us 15.675 us 15.703 us 15.728 us 11.240 us 11.273 us 11.312 us
8 3.5424 us 3.5434 us 3.5451 us 16.768 us 16.794 us 16.819 us 12.511 us 12.561 us 12.635 us
9 6.9358 us 6.9374 us 6.9399 us 18.731 us 18.754 us 18.776 us 14.208 us 14.283 us 14.409 us
10 14.198 us 14.200 us 14.201 us 20.764 us 21.045 us 21.290 us 17.782 us 17.821 us 17.865 us
11 28.315 us 28.319 us 28.324 us 28.049 us 28.104 us 28.176 us 25.271 us 25.326 us 25.409 us
12 56.602 us 56.605 us 56.609 us 45.301 us 45.395 us 45.524 us 40.283 us 40.412 us 40.582 us
13 113.18 us 113.19 us 113.21 us 79.458 us 79.737 us 80.174 us 69.654 us 69.769 us 69.878 us
14 226.36 us 226.38 us 226.40 us 147.58 us 147.84 us 148.15 us 129.21 us 129.45 us 129.74 us
15 452.97 us 453.01 us 453.07 us 284.25 us 285.09 us 286.38 us 248.64 us 249.59 us 250.81 us
16 906.45 us 906.52 us 906.60 us 559.63 us 561.53 us 564.16 us 485.47 us 487.52 us 490.65 us
17 1.8145 ms 1.8147 ms 1.8150 ms 1.1069 ms 1.1092 ms 1.1120 ms 957.72 us 959.49 us 961.72 us
18 3.6520 ms 3.6529 ms 3.6540 ms 2.2157 ms 2.2272 ms 2.2444 ms 1.9031 ms 1.9068 ms 1.9114 ms

Task Chunk

Mac

Add

k 2 4 8 16 32 64 128 256 512 1024 2048 4096 8192 16384
3 8.4865 us 8.7383 us 8.9984 us 7.6687 us 8.1307 us 8.6264 us 49.894 ns 50.056 ns 50.264 ns
4 10.804 us 10.841 us 10.878 us 8.5535 us 8.6013 us 8.6544 us 8.1329 us 8.5694 us 9.0234 us 79.834 ns 80.323 ns 80.943 ns
5 17.365 us 18.353 us 19.542 us 11.666 us 12.372 us 13.345 us 9.0639 us 9.3125 us 9.6756 us 7.7428 us 8.1249 us 8.5304 us 135.33 ns 135.54 ns 135.79 ns
6 24.028 us 24.127 us 24.234 us 17.746 us 18.126 us 18.658 us 12.443 us 12.695 us 13.119 us 9.6657 us 9.7400 us 9.8262 us 8.3629 us 8.6792 us 9.0522 us 273.53 ns 277.22 ns 281.66 ns
7 35.130 us 35.498 us 35.926 us 26.501 us 26.679 us 26.875 us 18.959 us 19.016 us 19.075 us 14.307 us 14.394 us 14.484 us 11.646 us 11.726 us 11.815 us 9.1532 us 9.4182 us 9.7456 us 481.41 ns 481.88 ns 482.42 ns
8 40.231 us 40.861 us 41.553 us 35.265 us 36.861 us 38.881 us 27.771 us 27.876 us 27.987 us 22.508 us 22.734 us 23.001 us 18.404 us 18.509 us 18.644 us 15.065 us 15.193 us 15.333 us 11.892 us 11.978 us 12.065 us 1.0040 us 1.0237 us 1.0510 us
9 41.950 us 42.430 us 42.989 us 37.774 us 38.013 us 38.288 us 34.655 us 34.976 us 35.353 us 28.869 us 28.979 us 29.100 us 26.027 us 26.172 us 26.334 us 22.043 us 22.285 us 22.533 us 19.511 us 19.763 us 20.026 us 14.208 us 14.354 us 14.491 us 1.8647 us 1.8722 us 1.8802 us
10 55.450 us 56.708 us 58.059 us 44.576 us 48.146 us 52.999 us 42.094 us 43.535 us 45.845 us 41.633 us 43.171 us 45.192 us 37.792 us 40.577 us 43.652 us 30.572 us 31.031 us 31.491 us 24.981 us 25.107 us 25.243 us 20.775 us 20.922 us 21.082 us 16.728 us 16.861 us 16.986 us 3.6868 us 3.6906 us 3.6948 us
11 57.630 us 58.260 us 58.976 us 53.438 us 54.222 us 54.981 us 51.131 us 52.458 us 54.095 us 43.672 us 44.314 us 45.022 us 38.095 us 38.238 us 38.398 us 34.721 us 34.935 us 35.188 us 29.845 us 30.088 us 30.390 us 26.737 us 26.882 us 27.063 us 24.126 us 24.421 us 24.772 us 19.853 us 20.027 us 20.225 us 7.3577 us 7.3707 us 7.3843 us
12 74.006 us 82.503 us 96.651 us 60.490 us 61.703 us 63.196 us 53.367 us 54.647 us 56.422 us 48.560 us 49.146 us 49.862 us 47.054 us 48.111 us 49.286 us 41.027 us 41.742 us 42.582 us 40.344 us 41.569 us 43.145 us 33.403 us 33.867 us 34.372 us 32.113 us 32.943 us 33.895 us 33.725 us 36.196 us 39.246 us 26.128 us 26.535 us 26.968 us 14.860 us 14.924 us 15.007 us
13 76.729 us 78.650 us 81.140 us 80.344 us 85.142 us 90.723 us 74.581 us 76.099 us 77.818 us 70.706 us 74.664 us 79.865 us 59.432 us 61.664 us 64.488 us 51.299 us 52.632 us 54.048 us 43.822 us 44.417 us 45.068 us 45.089 us 46.367 us 47.964 us 37.537 us 38.859 us 40.761 us 36.239 us 36.815 us 37.442 us 34.318 us 35.158 us 36.153 us 36.640 us 38.085 us 40.029 us 31.645 us 31.850 us 32.090 us
14 96.109 us 99.058 us 103.02 us 82.915 us 85.327 us 88.338 us 74.106 us 75.766 us 77.721 us 65.367 us 66.130 us 67.086 us 67.780 us 69.117 us 70.520 us 65.359 us 66.814 us 68.360 us 65.653 us 68.949 us 72.797 us 64.046 us 68.780 us 74.290 us 52.229 us 59.241 us 68.364 us 43.430 us 45.169 us 47.105 us 38.416 us 39.108 us 39.886 us 46.860 us 49.930 us 53.646 us 51.370 us 57.463 us 67.220 us 60.318 us 60.631 us 61.072 us
15 112.70 us 115.39 us 118.70 us 108.91 us 112.34 us 116.15 us 111.99 us 118.11 us 125.32 us 89.571 us 93.145 us 97.821 us 99.211 us 105.54 us 113.43 us 87.888 us 91.208 us 95.005 us 72.750 us 74.216 us 75.886 us 67.757 us 69.687 us 71.764 us 71.626 us 78.614 us 90.932 us 63.923 us 66.347 us 68.898 us 69.236 us 74.135 us 80.049 us 67.671 us 72.314 us 78.030 us 55.197 us 57.540 us 60.155 us 82.056 us 85.622 us 90.324 us

Github Actions
Add

k 2 4 8 16 32 64 128 256 512 1024 2048 4096 8192 16384
3 8.6546 us 8.8228 us 8.9699 us 8.3995 us 8.6312 us 8.8324 us 53.093 ns 53.124 ns 53.153 ns
4 8.8763 us 9.0172 us 9.0935 us 8.7933 us 8.9154 us 9.0024 us 8.6157 us 8.7592 us 8.8718 us 88.424 ns 88.434 ns 88.447 ns
5 9.7176 us 9.7890 us 9.8535 us 9.0014 us 9.0847 us 9.1433 us 9.0339 us 9.0610 us 9.0822 us 8.7167 us 8.8259 us 8.8992 us 157.66 ns 157.68 ns 157.70 ns
6 9.9716 us 10.003 us 10.034 us 9.8813 us 9.9691 us 10.070 us 9.1202 us 9.2548 us 9.3663 us 9.0368 us 9.0608 us 9.0907 us 8.6307 us 8.8124 us 8.9441 us 297.08 ns 297.10 ns 297.12 ns
7 11.683 us 11.753 us 11.825 us 10.635 us 10.665 us 10.697 us 10.445 us 10.482 us 10.519 us 9.6826 us 9.7893 us 9.9166 us 9.2953 us 9.4191 us 9.5374 us 8.6287 us 8.8341 us 9.0342 us 583.53 ns 583.60 ns 583.67 ns
8 15.075 us 15.093 us 15.110 us 13.777 us 13.812 us 13.845 us 12.781 us 12.816 us 12.851 us 12.175 us 12.210 us 12.262 us 10.641 us 10.668 us 10.701 us 10.057 us 10.084 us 10.113 us 9.8548 us 9.9329 us 10.009 us 1.1400 us 1.1402 us 1.1404 us
9 16.341 us 16.368 us 16.398 us 15.089 us 15.147 us 15.212 us 14.532 us 14.584 us 14.642 us 13.957 us 14.031 us 14.159 us 13.308 us 13.364 us 13.432 us 12.472 us 12.520 us 12.564 us 11.970 us 12.009 us 12.052 us 11.264 us 11.291 us 11.317 us 2.2537 us 2.2539 us 2.2543 us
10 18.331 us 18.403 us 18.504 us 17.327 us 17.363 us 17.398 us 16.829 us 16.862 us 16.903 us 16.357 us 16.404 us 16.455 us 15.387 us 15.421 us 15.452 us 15.018 us 15.048 us 15.082 us 14.510 us 14.552 us 14.599 us 13.661 us 13.689 us 13.715 us 13.258 us 13.306 us 13.354 us 4.4917 us 4.4922 us 4.4929 us
11 21.282 us 21.324 us 21.371 us 20.085 us 20.131 us 20.176 us 19.365 us 19.445 us 19.536 us 18.984 us 19.023 us 19.069 us 18.039 us 18.070 us 18.098 us 17.562 us 17.605 us 17.657 us 17.500 us 17.955 us 18.440 us 16.143 us 16.244 us 16.402 us 15.453 us 15.490 us 15.530 us 15.397 us 15.437 us 15.478 us 8.9466 us 8.9474 us 8.9483 us
12 24.029 us 24.103 us 24.180 us 24.317 us 24.483 us 24.686 us 23.475 us 23.547 us 23.622 us 23.031 us 23.082 us 23.127 us 22.150 us 22.211 us 22.275 us 22.000 us 22.047 us 22.095 us 21.733 us 21.795 us 21.877 us 21.030 us 21.096 us 21.179 us 20.199 us 20.332 us 20.578 us 20.032 us 20.094 us 20.171 us 19.908 us 19.949 us 20.002 us 17.864 us 17.866 us 17.867 us
13 35.482 us 35.561 us 35.669 us 32.658 us 32.715 us 32.792 us 31.259 us 31.329 us 31.417 us 30.324 us 30.387 us 30.470 us 30.093 us 30.197 us 30.368 us 29.818 us 29.857 us 29.896 us 30.086 us 30.138 us 30.202 us 29.723 us 29.827 us 29.983 us 29.879 us 29.943 us 30.018 us 29.552 us 29.583 us 29.616 us 29.390 us 29.543 us 29.760 us 29.072 us 29.109 us 29.147 us 35.701 us 35.703 us 35.705 us
14 59.379 us 59.476 us 59.589 us 53.713 us 53.901 us 54.181 us 50.974 us 51.068 us 51.173 us 49.523 us 49.660 us 49.843 us 48.536 us 48.676 us 48.859 us 48.324 us 48.509 us 48.743 us 48.409 us 48.493 us 48.588 us 48.309 us 48.394 us 48.493 us 48.247 us 48.439 us 48.739 us 48.175 us 48.270 us 48.396 us 47.640 us 47.850 us 48.151 us 47.383 us 47.524 us 47.700 us 47.289 us 47.374 us 47.474 us 71.345 us 71.348 us 71.352 us
15 107.67 us 107.84 us 108.03 us 95.682 us 96.364 us 97.592 us 90.079 us 90.224 us 90.388 us 87.090 us 87.324 us 87.616 us 85.794 us 86.112 us 86.499 us 85.090 us 85.281 us 85.510 us 85.549 us 85.682 us 85.829 us 84.796 us 85.649 us 87.173 us 84.602 us 84.786 us 85.009 us 84.375 us 84.505 us 84.649 us 84.195 us 84.374 us 84.600 us 84.204 us 84.365 us 84.550 us 83.856 us 84.032 us 84.259 us 83.751 us 83.971 us 84.246 us

Summary

Words

Word Explanation
Turning Degree The minimum degree k that the parallel starts to get benefit.
Task Size The task that per thread processes.
Task Point The point that expresses concrete task size as the addition is 1 and the multiplication is 5~6.
Base Turning Point The turning degree when we perform 1 task point and degree is 13.
Chunk Size The task point that per thread processes.

Fact

  • The thread number is almost nothing to do with turning degree.
    Benched with thread number 2 and 8.
  • The bigger task size, the lower turning degree.
  • The bigger task size, the more the parallel speeds up.
  • The bigger task size, degree and thread number, the smaller chunk size is efficient.
  • The task size is 5 times addition < 1 time multiplication < 6 times addition.
  • The influence rate is Task Point > Degree > Thread Number.

Variables

The turning degree can be computed approximately (Base Turning Degree) - (log(Task Point)).
The chunk size can be computed approximately n / 2^{[8 * (log(Task Point)) + 2 *(k - (Base Turning Degree)) + log(thread_number)] / 8}.

Uncertainty

I think the thread number is something to do with turning degree and would like to test with higher core machine.

Environment

Resource Github Actions Mac
Thread Number 2 8
CPU 2 core 4 core
Memory 7 GB 8 GB
Storage 14 GB 512GB

Github Actions
Mac

Reference

rayon

@ashWhiteHat
Copy link
Contributor

Hi there.
I did experiment about parallelize optimization and speeded up 20%.

Bench
ashWhiteHat#4 (comment)
Experiment Result
#548 (comment)

@ashWhiteHat
Copy link
Contributor

ashWhiteHat commented May 31, 2022

The improvement list and status.
The bench was done on Github Actions.

Target Branch Status Speed Up
Fft ashWhiteHat#5 Done fft and ifft 20%
pippenger ashWhiteHat#6 Draft multiexp 20%
Parallel ashWhiteHat#4 Done prover 20%
h() None None
wnaf None None

@daira
Copy link
Contributor

daira commented Nov 1, 2022

Consensys/gnark-crypto#249 is a new implementation of batch affine MSM that claims a 10-20% speedup for gnark-crypto's MSM (which was already a good implementation).

@ashWhiteHat
Copy link
Contributor

I think we can optimize curve scalar by naf with almost zero cost.

We convert field element to 256 length of bit array.
https://github.com/zcash/pasta_curves/blob/main/src/curves.rs#L483
If we convert it to bit array with sign, we can reduce $n/6$ times elliptic curve addition where n is field bits length.
And we also clean the zero bit and reduce the doubling.

This is the to_nafs implementation.
https://github.com/zero-network/zero/pull/60/files#diff-15198a125f029cb977758faf8db9469e6e710d1616b5ceb08eaced67993c715dR48

And scalar arithmetic implementation.
https://github.com/zero-network/zero/pull/60/files#diff-bb1d24a6ddcc81d90c6392212f713b15cac6a371be79c044302f0ea11695b02dR89

The extra cost is that to_naf operation and curve negation when bit has sign.
to_naf operation performs on $n$ length bit and not heavy and curve negation is also not heavy comparing with curve addition and doubling.

@ashWhiteHat
Copy link
Contributor

I think Karatsuba can be used for field arithmetic.
We are using convolution product for limbs arithmetic.
The following is example of 4 length limbs.

  • Naive(M: 16, A: 9)

$x^6: a_3b_3$
$x^5: a_2b_3 + a_3b_2$
$x^4: a_1b_3 + a_2b_2 + a_3b_1$
$x^3: a_0b_3 + a_1b_2 + a_2b_1 + a_3b_0$
$x^2: a_0b_2 + a_1b_1 + a_2b_0$
$x^1: a_0b_1 + a_1b_0$
$x^0: a_0b_0$

  • Karatsuba(M: 10, A: 6, S: 18)

$u_0v_1 + u_1v_0 = u_1v_1 + u_0v_0 - (u_1 - u_0)(v_1 - v_0)$

$x^6: a_3b_3$
$x^5: a_2b_2 + a_3b_3 - (a_2 - a_3)(b_2 - b_3)$
$x^4: a_1b_1 + a_2b_2 + a_3b_3 - (a_1 - a_3)(b_1 - b_3)$
$x^3: a_0b_0 + a_1b_1 + a_2b_2 + a_3b_3 - (a_0 - a_3)(b_0 - b_3) - (a_1 - a_2)(b_1 - b_2)$
$x^2: a_0b_0 + a_1b_1 + a_2b_2 - (a_0 - a_2)(b_0 - b_2)$
$x^1: a_0b_0 + a_1b_1 - (a_0 - a_1)(b_0 - b_1)$
$x^0: a_0b_0$

In case of 6M < 18S - 3A, we can speed up.
but $(a_2 - a_3)(b_2 - b_3)$ will be overflow.
$(a_2 - a_3)(b_2 - b_3)$ type should be i128 so when $(a_2 - a_3)$ and $(b_2 - b_3)$ are u64::MAX, i128 positive range is u127 so it's a problem.

@ashWhiteHat
Copy link
Contributor

We would improve wNAF, if we implement quadruple for curve and 4 multiplication for field.
In 4 multiplication, we can replace it with 2 right bit shift and skip one montgomery reduction.
There are ideas for performance.
Please let me know if these is interesting idea.
Thank you!

@ashWhiteHat
Copy link
Contributor

ashWhiteHat commented Jan 28, 2023

precomputing inv * mod may also reduce 4 times mul in montgomery reduction

@ashWhiteHat
Copy link
Contributor

one of curious is one time modular reduction.
bitcoin implementation doesn't perform modular reduction and manages number of arithmetic with magnitude.
https://github.com/RustCrypto/elliptic-curves/blob/master/k256/src/arithmetic/field/field_impl.rs#L20

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

7 participants