-
Notifications
You must be signed in to change notification settings - Fork 316
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Cache values for faster Pedersen #697
Comments
Is the idea to cache that amount per hash or for all hashes? |
for all hashes |
Based on discussion with @arielgabizon, this is cache-size information:
As you can see, the size of the cache quickly becomes unmanageable if we assume we will hold it all in memory at once ( In the table above, This continues to suggest that dedicated machines just for merkle-tree generation may be desirable -- so hardware can be configured to optimize for this usage pattern. Ideally, we would implement this in a way that makes size of the cache/window-bits entirely configurable. The first pass of the implementation probably should not include the multiple-cache solution, which adds complexity, but the possibility of needing that later should be considered when designing. We should explore that option through calculations, though. |
codename: "Spedersen" |
I looked into this more, and I think what @arielgabizon and I discussed above (which involves some differences from the original hope of this issue already) needs to be modified further. This is because the second loop in the pdersen hashes is not using a 3-bit (size 8) window. Rather, it's an 8-bit (size 256) window. That changes the table to look like this:
This brings the cache size up even further (with all the same logic applying), and therefore further reduces the expected overall benefit. I did validate that the best-case of making the changes is likely to do what we expect — by doubling the window size (from 8 to 16) and observing a 2x speedup in the pedersen hash benchmark in This also adds complexity, though because even processing two windows at a time will require more bits than are produced by the 'first loop'. That means we'll either need a lot of complexity to retain the same algorithm, or the actual result of the pedersen hashes will change because of a new strategy for how many bits are assigned to each generator. We would need to discuss whether such an approach could even be viable. Based on the above, I think we should consider this now to be 2-3x at best — and with a lot of accompanying effort to rework the implementation. |
I've had time to think about this a bit more and also discussed some planning issues with @sanchopansa. My tentative conclusions are:
If someone has more information or a different analysis, we can revisit. Otherwise, I'm prepared to consider the initial investigation complete and tentatively adopt the plan outlined above. Please note that the original issue is somewhat out of date because of what we have learned from closer reading of the code. |
@sanchopansa @jake @arielgabizon @nicola One more update: it seems that just adjusting So… I now think the answer is that we should add a mechanism to make this configurable and/or choose the default value we want. We could still consider the 'many passes with smaller in-memory cache' strategy, but the incremental gains might well not warrant it. 67Mb for 2x seems like it may be a good tradeoff in our setting. 11GiB for 3x not so much. 500MiB for (maybe? best case if overhead proves cheap) 3x plus a lot of complexity in implementation … maybe. It seems we can also set this to any bit size, so the optimal tradeoff may be at some speedup which is not an integer multiple. This might be worth benchmarking to see real performance numbers in both speed of creating large merkle trees and actual memory consumed. It seems plausible that the best answer will be empirical discovery of a sweet spot which gives us best bang for the buck. Whatever we learn there will help us model this for projections. We will need to be able to do so in order to choose values for the merkle-hybrid thresholds by layer. This might be something @jake can work on after all. @dignifiedquire would you help think about how best to conduct and report that investigation? |
Although it seems clear in retrospect (that we cannot), I thought maybe we could get some benefit from caching in the first loop (which calculates the scalar). Unfortunately, the bookkeeping required to manage indexing into the cache is effectively like just performing the calculations. Plus the edge cases involved in dealing with partial 'chunks' of windows add a lot of complexity. My conclusion is that we can't gain any benefit this way (unfortunately), and I don't think we should put any more energy into trying. I've pushed a branch implementing this, for the sake of completeness and in case someone wants to have a look, though. (https://github.com/filecoin-project/sapling-crypto/tree/feat/pedersen-cache) To be clear, we can still manually set the window size for the second loop and get a speedup at the cost of larger cache. @arielgabizon has also noted that we can reduce the required cache by calculating rather than caching the negation, since this is relatively cheap. That's a strategy we could explore if we find we need to push the limit and are confident (after calculating) that the space savings warrant the complexity of a new implementation. |
A small credit: not caching negation values was pointed to me by @ValarDragon |
One more thought: since we will most likely end up porting this pedersen hashing algorithm to ZEXE, it may make most sense to include the 'no-cache negation' optimization described above in the intial implementation. @sanchopansa, @stefandeml. |
@nicola @arielgabizon @DrPeterVanNostrand @dignifiedquire Here are benchmarks of pedersen hashing based on changing Results Summary
Bear in mind that the memory increase factor is for total memory consumption, which at smaller cache sizes is not dominated by the spedersen change. As you can see, we can get a 1.73x speedup for only 50KiB — an obvious benefit we should probably make the default. To move that to 1.88 requires 8GiB, and getting above 2x to 2.19 requires 64GiB! The proposed optimization will shift those numbers somewhat, and we can calculate it. However, the general pattern will remain similar. We should expect to see a modest increase toward (but probably less than) 2x at reasonable memory costs. If you compare with the projected table above, you can see that realized performance is a little worse than predicted, but the general pattern holds. I stick by what we discussed earlier (sync), which is that we should think of this as being about a 2x max speedup, and that the exact amount will depend on memory tradeoffs which have to be evaluated against other time-space tradeoffs also being made. Bench output
|
Circuit tests are timing out with window size 16: filecoin-project/sapling-crypto#5 I'm not yet sure if this is a real problem, but we should at least add some circuit benchmarks. We may need to independently set the value when running in and out of circuits. |
Updated benchmarks, probing window sizes of 17, 18, 19. 17 looks like another sweet spot — actually the best performance before it begins decreasing, but going up to 600+MiB overhead. We can probably settle on either 16 or 17 and call that good for now.
Raw results: for completeness note that the timing with window size 16 during this run was a little slower than previously.
|
We have merged configurable window sizes as well as set the new default to 16 |
For batches of Pedersen hashes dramatic savings can be made by storing intermediate values.
Details:
The Sapling Pedersen Hash by @ebfull and @daira splits the input into windows of 3 bits.
Associates the j'th window with the group element gj = [16j-1] g, where g is the generator of the segment (the input is split into segments of length 3*63 bits, each segment is divided into 63 windows or chunks see section 5.4.1.7 here https://github.com/zcash/zips/blob/master/protocol/protocol.pdf)
The 3 bit input of each window is interpreted as a number in cj in {-4,-3,-2,-1,1,2,3,4}
and [cj] gj is added to the output.
In current implementation
https://github.com/zcash-hackworks/sapling-crypto/blob/b70d6e66fc1de66f8609e22a2b13eb11bdeb69b3/src/pedersen_hash.rs#L37-L76
each window requires at worst case 8 group operations and at best case 4 group operations,
to compute [cj] gj, together with the next base value gj+1.
Precomputing all the values [c] gj for c in {-4,-3,-2,-1,1,2,3,4} and j in {1..63} compresses these 4-8 operations to one addition at the cost of storing 8 * 63 = 492 group elements each of 32 bytes.
Naturally you could go further and precompute all possible output contributions of each block of d windows for some parameter d<63.
This would compress 8d group operations into one, at the cost of precomputing and storing, denoting by T the ceiling of 63/d,
8d * T group elements or 8d * 32T bytes.
Recall that when hashing 64 bytes into 32 bytes we have 3 such segments so in fact need to cache
8d * 3T group elements or 8d * 96T bytes.
For example, taking d=4, we should be able to reduce ped hash computation time by a 32 factor,
by caching
4096 * 16 * 96 = 6,291,456 bytes.
The text was updated successfully, but these errors were encountered: