-
Notifications
You must be signed in to change notification settings - Fork 2
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
ProgPoW: A programmatic (dynamic) Proof-of-Work algorithm tuned for commodity hardware #15
Comments
This looks like a solid pre-proposal to me. I'm not familiar with the original Dagger Hashimoto and thus can't fully evaluate it yet; nevertheless, the proposed changes all make sense to me from the perspective of using GPUs fully and optimally. My (limited) understanding is that Dagger Hashimoto is an asymmetric PoW just like Equihash is - that is, validating a proof is lightweight and nearly instant, whereas computing a proof is heavy and slow (but not too slow as to make the PoW non-progress-free - the hashrates on the order of 10 to 20 per second given in the proposal are OK'ish in that respect, meaning there are lots of PoW computations per block interval). However, can't we achieve a similar effect with a simpler PoW? I think an instantiation of yescrypt/pwxform with low-level parameters to fit and maximally use a GPU can have similar properties, except only for not generating "a random sequence of math" and technically being symmetric. The latter can be offset by setting yescrypt's tunable costs to be low and using a large ROM. A major drawback is that it means only moderately fast validation (not nearly instant) - like taking a millisecond per CPU core - and that the machine doing the validation needs to have enough RAM to accommodate yescrypt's ROM. As to "a random sequence of math", how would it be generated so that it utilizes most of a GPU's computing resources yet lacks optimization potential (or has obvious optimization potential and no more, so that the GPU miners can utilize it easily)? I think it'd be tough to make use of more than a handful of different instructions in there while also meeting our other desirable properties. So target the different execution units (perhaps integer MADs and local/shared memory lookups/writes, like yescrypt's pwxform does? anything else we can easily use?) while keeping the instruction set use rather limited? Would this be very different from a non-random sequence then, in terms of relative resource utilization on GPUs vs. more specialized ASICs? That's unclear to me. A "random number generator" may be unnecessary as a separate component given that per this proposal we'd effectively add something similar to yescrypt pwxform's S-boxes. A very simple transformation like a pwxform round and subsequent S-box writes results in changing S-box contents that also pass randomness tests. The stated expected limitations of specialized ASICs' "up to 1.2x gains in efficiency, compared to the 2x gains possible in Dagger Hashimoto" look overly optimistic to me, although not necessarily impossible. When we're talking very small gains like 1.2x, it becomes especially important to consider what's included in GPU prices available to miners - and it's by far not just the production costs. The inefficiencies in gaming GPUs' (components) markets when those GPUs are purchased by/for miners are probably in excess of 20% compared to a market for a company's specialized mining ASICs (let alone that company's own use of the ASICs). It's also important to consider whether our PoW algorithm would be somewhat tolerant to hardware errors (which is another way to improve usable yield and lower production costs) or not at all. Equihash is somewhat tolerant. It sounds to me like this new PoW won't be as tolerant, which is a good thing. Please read the above as a friendly review and challenging of the proposal to help make it better. |
I am not familiar with FNV1a, but FWIW per the table at https://github.com/leo-yuriev/t1ha variations of t1ha (including those optimized for 32-bit) are the fastest non-cryptographic hashes that pass SMHasher tests, whereas FNV1a_YoshimitsuTRIAD is slightly faster yet but fails tests. |
I just found that the code and more documentation are already available at https://github.com/ifdefelse/ProgPOW This answers my question on what kind of random math is being used, but doesn't address the concerns associated with it: optimization potential that miners will compete in exploring, only few instructions being used, and this not necessarily being better than optimally-pre-weighted use of MADs - in fact, introducing relatively lightweight ops to the mix might leave a GPU's MAD units under-utilized and thus let an ASIC get away with implementing fewer MAD units. Quoting from the author's documentation linked above: void merge(uint32_t &a, uint32_t b, uint32_t r)
{
switch (r % 4)
{
case 0: a = (a * 33) + b; break;
case 1: a = (a ^ b) * 33; break;
case 2: a = ROTL32(a, ((r >> 16) % 32)) ^ b; break;
case 3: a = ROTR32(a, ((r >> 16) % 32)) ^ b; break;
}
} uint32_t math(uint32_t a, uint32_t b, uint32_t r)
{
switch (r % 11)
{
case 0: return a + b;
case 1: return a * b;
case 2: return mul_hi(a, b);
case 3: return min(a, b);
case 4: return ROTL32(a, b);
case 5: return ROTR32(a, b);
case 6: return a & b;
case 7: return a | b;
case 8: return a ^ b;
case 9: return clz(a) + clz(b);
case 10: return popcount(a) + popcount(b);
}
} (In the implementation, this choice is compile-time (kernel source code is produced to be compiled for the target GPU), so the maybe-inefficiency of modulo division is not an issue.) Out of these many operations, only the two MULs are heavy (and they miss on the opportunity to be MADs, to more fully utilize that hardware). BTW, the focus on staying 32-bit for GPU friendliness may be excessive, especially if we also have a goal to maintain some register pressure so that the large register files are used nearly-fully. Modern GPUs do have some "64-bit" instructions operating on 32-bit elements in pairs of registers for each operand, and in a sense it's wasteful not to use those. For example, at least at NVIDIA PTX asm level (and probably also at ISA level and in hardware, but I didn't confirm) there's 64-bit addition (that is, with carry between the two 32-bit halves) as part of the high 32-bit MAD instruction. From a Japanese yescrypt-0.5'ish coin miner: __device__ __forceinline__ uint2 mad64(const uint32_t a, const uint32_t b, uint2 c)
{
#if 0
return vectorize((uint64_t)a * (uint64_t)b) + c;
#else
asm("{\n\t"
"mad.lo.cc.u32 %0, %2, %3, %0; \n\t"
"madc.hi.u32 %1, %2, %3, %1; \n\t"
"}\n\t" : "+r"(c.x), "+r"(c.y) : "r"(a), "r"(b)
);
return c;
#endif
} It feels wasteful not to use powerful instructions like this, which probably utilize the MAD units' circuitry to the max achievable with integer math. Do we really achieve anything good by sometimes no-op'ing those heavy MADs in favor of something trivial like a bitwise-OR, which might also be redundant? I doubt it. |
Oh, also the funnel shifts probably consume non-trivial amount of die area. But then it's better to use them as such, not reduce them to the simpler bit rotates. So perhaps just use a fixed mix of 64-bit MADs, funnel shifts (of 64-bit values by variable counts - but then we could run into differences in how shift counts in excess of bit counts are processed by different GPUs/CPUs - some will require explicit masking to be compatible with others, which would be wasteful), and shared/local memory lookups and writes. The GPUs' instruction processing probably introduces little overhead (in die area and energy consumption) since it's amortized due to wide SIMD. What's avoided in a pre-defined circuit is the need to store intermediate results in the large global register file with indexed access, rather than in temporary registers placed right inbetween the units. So what we might want is non-pre-defined routing of data between a few relatively large units (like those I mentioned, in comparison to e.g. a mere bitwise-OR). It isn't important to mix different instructions in arbitrary order (in fact, we might end up under-utilizing our execution units if we do), but rather to re-route the outputs/inputs inbetween fixed groups of instructions (e.g., in one code revision a MAD could take one of its inputs from an immediately preceding S-box lookup, and in another it'd take that input from an S-box lookup done 5 instruction groups earlier, etc.) Unlike CPUs, modern GPUs also tend to allow indexed access to general-purpose registers (but for tiny indices, and we're limited to 255/N registers anyway). We could also use that, and then it's even runtime variability in the routing, which another ASIC would have to duplicate. |
Skimming over the code, I now see that FNV1a and KISS99 are only used in host code to generate a randomized kernel for GPU, and are not used on GPU. This wasn't clear to me from the proposal (perhaps in part because I am not familiar with the original Dagger Hashimoto). Then my earlier suggestion that we don't need an RNG because we already have enough randomness from S-box lookups doesn't apply here. OTOH, this also means that neither the hash nor the RNG need to be fast. |
@solardiz Thanks for the extensive feedback! This is wonderful. We are digesting. Please reach out to ifdefelse at protonmail if you would like to chat with the team directly. |
Thank you for the detailed review and comments. A bunch of answeres below.
Those should have been labeled as mega-hash/sec. Sorry for the confusion.
No. Any fixed hashing sequence can be implemented on an ASIC (or FPGA) with an order of magnitude better efficiency than a CPU-based implementation. The random sequence of math is what forces this algorithm to run on a GPU instead of on a specialized, fixed-sequence ASIC.
I wasn't familiar with t1ha, but it looks very interesting. KISS99/FNV1a are used both in generating the random math sequence for the inner loop and outside the loop. In the sequence generation fnv1a is used to seed KISS99, and then KISS99 is used for producing the instruction/register sequence. If you look at the non-loop kernel code here: FNV1a is again used to seed KISS99. KISS99 is used to initialize the values of mix[] (the per-lane registers). FNV1a is also used for reducing mix[] to a single per-lane value, and then reducing the pre-lane values to a single 128 bit result. I think FNV1a is fine for seeding KISS99. The reduction could replace FNV1a with T1HA0_32le, though it probably won't make much of a difference.
All GPU 64-bit ops are implemented using 32-bit ops. So doing a single 64-bit op or 2 32-bit ops makes effectively no difference. From someone implementing a CPU ALU having carry-out to support a 64 bit op via 2 32-bit ops is essentially free, this doesn't differentiate in any interesting way.
The ProgPoW_REGS define directly modifes the register pressure, and is a better way of controling it than occasional 64-bit ops.
I had originally supported explict 3-input MAD, and also some 3-input logical ops, but the added complexity didn't seem worth it. The random sequence will natrually produce some MADs and 3-input LOPs anyway, which the compiler can make use of.
The hardware implementation of either is effectively the same, again not a significant differentiator.
AMD supports indexed RF, though I think it's only accessible via assembly. Nvidia does not. I intentionally picked ops that are common between both arcitectures. Focusing on just CUDA I would have also made use of shfl, prmt, and lop3
If there was a fixed instruction order where only the inputs/outputs locations changed an ASIC could directly build that pipeline. Only by changing both the instruction order and the input/output locations does it guarantee a real instruction processing system, not a fixed (and much more efficient) pipeline. |
Thank you for your answers. Regarding the speeds, what are the validation speeds on CPU? And proof size? I agree with many of your points: there isn't a lot of difference between using "64-bit" ops vs. using 32-bit ones and doubling the register pressure via your algorithm's parameters (the maybe-improvement from making use of the carry-out/carry-in is very slight), yes the hardware implementation of 32-bit rotates is probably of almost same complexity as funnel shift's(*), and you seem to be correct that NVIDIA doesn't allow runtime indexing into the register file (unlike AMD GCN). (*) But I still think it'd take extra resources to route two separate 32-bit inputs to the circuit, instead of one such input. AMD has funnel shift, too: it's Regarding bitwise ops (even 3-input ones), I continue to think they're too simple to be worth using (except where we might have to e.g. for non-linearity with other ops). On every such op, we'd waste more time and energy on register file access than on the op itself. I also continue to see the unclear optimization potential of those instruction sequences as a concern, and to that I add that your Regarding pipelining in an ASIC, I continue to think that it may become impractical with varying locations of inputs to each algorithm step (would-be pipeline stage) and/or too much state to carry along the pipeline, even if each step's operations are fixed and known at ASIC design time. For example, consider implementing bcrypt in an ASIC: each Blowfish round has a fixed set of ADDs and XORs, but the input locations keep changing. Indeed, our and others' implementations of bcrypt on FPGA don't pipeline it - instead, they include multiple iterated bcrypt instances per core, which is similar to what's done in software when cracking bcrypt hashes on a CPU core. For another example, a person on our team is now working on sha512crypt on FPGA. SHA-512 seems like a perfect fit for pipelining, and almost all of the higher-level logic of sha512crypt is pre-determined (aside from variance by password and salt length, which affect how many times SHA-512's compression function is invoked, but we can group candidate passwords by same length and crack one salt at a time to overcome this runtime variance). So he started with exploring fully-pipelined designs for SHA-512 (thus, with ~80 or ~160 in-flight computations), but after quite some trial and error realized that the size of SHA-512 inputs to carry along the pipeline and the complexity of the high-level algorithm make this prohibitively wasteful. For the current design, he switched to modules of four iterated SHA-512 cores (each computing two SHA-512 hashes at a time, and with slight overlap between previous and new sets of two-hash computations, so at times with up to four in-flight computations) controlled with specialized soft CPU cores (with 16-way SMT, thus one soft CPU core controls the four SHA-512 cores times four in-flights from its different hardware threads). The CPU cores are small in comparison to their groups of four SHA-512 cores. This may sound crazy, but it proved more efficient than a fully-pipelined design so far. Indeed, this design still uses a little bit of pipelining, but to an extent similar to what we see in CPUs and GPUs. Sorry for this "off-topic" detail, but I think it illustrates that data dependencies can dictate optimal hardware design. I recognize that these examples I gave are for FPGAs rather than ASICs, but I expect similar considerations in ASIC design. I don't readily have an example to illustrate my other sub-point, namely that variability in instruction stream alone (without complex data dependencies on intermediate computation results from much earlier than the previous clock cycle) isn't as good at reducing potential ASIC advantage. Is use of both approaches (dependencies on older intermediate results and variable instruction stream) even better, and how much better? My current understanding is that it's potentially only slightly better, but it carries additional risks and pitfalls to avoid (redundancy, small intermediate results, other optimization potential, and lower utilization of the largest and naturally highest-latency execution units that we have). |
Thanks again for all the comments.
Proof size is just 64-bit nonce + 128-bit intermediate state (that can be used for DOS protection). This is a small change from ethash's 64-bit nonce + 256 bit intermediate. It's fairly easy to change.
The results of the math get passed as the B argument to merge(), which either adds or xors B into a modified version of A. Making a specialized 6-bit adder or XOR is probably more overhead than simply using a normal 32-bit ALU for everything. Merge is also setup so that even if the B input is 0 the result will be a version of A with high entropy.
That's a very true statement, that the actual executing of a 32-bit logical op is drastically lower power than reading/writing the register file. This is where specialized ASICs get their efficiency gains compared to a CPU/GPU. I like including as many ops as possible to force the execution hardware to be as general purpose as possible, but keeping or removing LOPs won't make too much of a difference either way.
Your bcrypt and sha512crypt examples are about an outer loop. Having a pipeline of 64 bcrypt cores in series or a single bcrypt core that's iterated on 64 times is going to have very similar efficiency. The gains mostly come from pipelining the inner loop. A small CPU sequencing the outer loop makes a lot of sense. I could build a ASIC pipeline to handle a fixed sequence of math where only the inputs changed. The fixed math sequence would have a bit of control logic on the side programmed with which registers to read/write at each step. The ASIC would still need a large register file to store all the state like a GPU, but it could be much higher performance. Say the pipeline is 32 logical stages you could have 32 difference hash calculations in flight at once. If the register file is 32-way banked they'd never conflict with each other. This pipeline would be 32x higher throughput than a GPU that has to process a single instruction at a time. |
You seem to be right about the (non-)tiny results - I missed the fact that you ret << math("data32", mix_src(), mix_src(), rnd());
ret << merge(mix_dst(), "data32", rnd()); However, the operation based on What you say about bcrypt doesn't make sense to me. (Maybe you're unfamiliar with it? That's OK; I'm also unfamiliar with a few things you referenced previously. But these non-overlaps make it harder for us to get on the same page.) This is about the inner loop. Assuming you'd try to build a pipeline of 64 Blowfish rounds (and then build 64 bcrypt cores by iterating around that pipeline), you'd have only the 64-bit Blowfish blocks propagating along the pipeline efficiently, yet you'd have to manage 64 x 4 KiB = 256 KiB worth of S-boxes near that pipeline, with routing from each 4 KiB set to each pipeline stage, which will waste die area (on the complex addressing and routing) and introduce latency at best on par with a CPU's L2 cache's - and I don't see a way to hide that latency since each round's lookup indices are only known after computation of the previous round (previous pipeline stage) and each round's results are similarly needed by the next round (next pipeline stage). Thus, you'd have a larger circuit that runs slower (per hash computed) than a fully-iterated circuit would run (with only the control logic shared and thus amortized across several such iterated bcrypt instances). [What does make sense, though, is interleaving two bcrypt instances' uses of the same S-box RAMs (of double capacity) to partially hide the access latencies - this is similar to what's done when cracking bcrypt on CPUs and it should also fit e.g. BlockRAM in Xilinx's 7-series FPGAs.] I think you actually try to achieve a defense similar to bcrypt's with your "cache accesses", which you currently make as numerous as your random math steps. For sha512crypt, it's about both inner and outer loops. Neither uses variable indexing like we have for bcrypt, and the inner loop (SHA-512 compression function) is a reasonably good fit for pipelining - but the amount of state+inputs to carry along a fully-pipelined implementation is prohibitively large. As I understand, Bitcoin mining FPGA designs and ASICs do use fully-pipelined double-SHA-256 and thus do carry that state+inputs, but there must be optimizations due to similarity of inputs (only the nonce changes) and for SHA-256 the inputs are twice smaller. So we may have a borderline case with SHA-512, and with the added complexity of needing to interface a non-trivial higher-level algorithm to SHA-512 (would need to manage 80+ in-flight sha512crypt's in the specialized CPU) we may have just passed the point where fully-pipelined SHA-512 made sense. As I understood your example, the 32-stage pipeline + 32-way register file would consume at least 32x the area of a fully-iterated design (excluding control logic, which could be amortized across several such fully-iterated instances - somewhat like we have with wide SIMD in GPUs), and will likely incur higher delays (thus, would have to run at a lower clock rate) to access registers in the larger register file. Yes, you will have 32x higher throughput per clock cycle but at >32x the area and at a lower clock rate, defeating the purpose of this design. Pipelining is great, but it has its limitations. For example, for descrypt a 400-stage pipeline (25 iterations x 16 DES rounds) can work great (and it's been done). DES and descrypt fit pipelining perfectly. Even then, going for a mixed e.g. 25 iterations of 16 stages design is only a slight slowdown (which has also been done and made sense as part of a more complex design). As soon as there's any extra complication for pipelining, fewer stages or even a non-pipelined design could work better. MUL instructions are good in that they're already optimally pipelined on a GPU (across 4+ clock cycles). Any instruction that runs on a CPU (yes, CPU) in one clock cycle (of latency) is not a good fit since this indicates that its latency on a GPU is mostly part of the GPU design rather than inherent to the computation. |
I vote For implementation, For changing algo to ProgPOW. |
I vote For implementation, For changing algo to ProgPOW. |
I vote for Implementation for Changing algo to ProgPOW |
3 similar comments
I vote for Implementation for Changing algo to ProgPOW |
I vote for Implementation for Changing algo to ProgPOW |
I vote for Implementation for Changing algo to ProgPOW |
Gentlemen, I appreciate the "Votes" of confidence in this proposal. Keep in mind this is just the first step in a process. The funding of this and other proposals will be reviewed by the Foundations Grant Review Committee as outlined here: https://github.com/ZcashFoundation/GrantProposals-2018Q2/blob/master/README.md And will be subject to ratification by the Zcash Foundation Board if approved for funding. After which the implimentation of a change of PoW could be considered as outlined by Josh Cincinnati @acityinohio in this post: |
Thanks for letting us know, but is it any problem that we continue voting here? |
@mineZcash Thanks for clarification shawn and dont forget, accoring to zooko's interview with coindesk... if the Foundation tries to go through with this there will more than likely be a chain split and he will continue with the ASIC chain... atleast until sapling... or until he gets hung for treason Shawn can clarify on that 👍 |
I don't see a problem with voting here, but I will leave that call to @amiller or @acityinohio . The only issue that could occur is if the conversation about specific details and solutions regarding the actual PoW is overwhelmed by users voting. Perhaps the Ballot that Andrew is working on would be able to provide more accurate accounting of Votes cast: ZcashFoundation/Elections#1 |
@mineZcash when will the ballot go live? will we all be able to vote? |
@mineZcash The purpose of Zcash was Asic resistance, what do you guys think we miners want? |
@cryptomined That's a good question for @amiller |
@mineZcash ok I will follow you to keep me informed about what is going on. |
@mineZcash I had to research what the trends of computing would be in general and I came to the conclusion that ASICs at this moment are the fastest but not for long, I think that within 3 to 4 years GPUs CPUs will compete with ASICs. |
I vote For implementation too and thanks for that |
We appreciate the votes of confidence, but more peer review is needed; if you, or anyone you know, has time to review and audit the code, please feel free to reach out to the team via email: [email protected], or through Twitter, with @OhGodAGirl, or via Gitter. |
@ifdefelse The proposal doesn't specify if you are seeking funding to implement code, or to cover the expense of expert analysis (although @solardiz seems to be doing that already!). Perhaps funding isn't required and you are submitting the proposal for the foundation to review and adopt as a priority (at which point the foundation would implement). I think clarifying this would help the review committee. |
@Miner356 @mineZcash I'm happy to see informal voting/polling/protesting occurring somewhere in this repo, even though I think there are betters places for it
@ifdefelse I was also gonna ask what bitcartel said... I'm excited this proposal has generated a technical discussion! What do you have in mind as the role of a grant? Maybe during the discussion phase we can get a list of suggested review or validation tasks and see if you or someone else would want to carry them out. Is there an open source GPU implementation of ProgPow yet? |
Is there a CPU-only (preferably, plain C) implementation of ProgPoW? I understand the OpenCL kernel can be run on a CPU if it doesn't use vendor-specific extensions, but I'd prefer to also have something in plain C or at worst C++, thus not needing an OpenCL runtime. This would be useful for review, experiments, as well as to serve as a reference implementation for correctness testing of the OpenCL and CUDA implementations against this reference. I understand that C(++) isn't a very good fit for dynamically adjusted code, but the same applies to CUDA, and this isn't literally a show-stopper - it only means that development tools need to be installed. If this exists, it'd help my potential contributions. If it doesn't exist yet, then maybe I should include creation of such an implementation in my proposal. (I continue to think that dynamically generated code might be overkill, and carefully tuned dynamically generated data dependencies could be sufficient to achieve similar "ASIC resistance". But I realize that current ProgPoW does use dynamically generated code.) |
Posting the full Grant Proposal: |
Thank you for posting the proposal, @ifdefelse. This is a very aggressive schedule you propose, which is great given Zcash's immediate need for a PoW scheme change, and I think you'll deliver, but it raises concerns of the extent of review and testing and tweaking that may be done on ProgPoW in this limited time. Part of the problem we have with Equihash now is that its choice of parameters was also a rushed decision with no time for adjustment even when much more efficient implementations appeared and the parameters were already seen as non-optimal shortly before launch. I'm afraid it'll be similar in that respect. Do you already have specific candidates in mind for the "2 more people"? Are they really available to start work on the project full-time(?) on such short notice? (I might also post a proposal today. I think the research, review, tweaking, testing needed here is quite separate from client and miner development, so I wouldn't expect the same people to do both kinds of work.) |
FWIW, I contacted @eXtremal-ik7 about his possible help with miner development, and he's potentially interested, but obviously needs specifics to start any work. eXtremal previously contributed to Zcash miners: mbevand/silentarmy@b879b79 https://bitcointalk.org/index.php?topic=1660023.0 |
Here are my observations after going through the code provided in the proposal:
|
@T3N2aWsK , @PrometheusXDE, @solardiz, @donamelo We thank you for your feedback. We encourage folks to continue to review this proposal from a technical perspective. We hope reviewers can start with a deeper technical understanding of what is being proposed, especially from the hardware side of things.
|
@PrometheusXDE @T3N2aWsK 's account is just 2 days old (edit: not 5) and clearly he/she misunderstood the algo (or maybe even wanted to spread FUD, as there are big economical consequences for some people (ppl that bought ASICs + manufacturers)). About the author (from my perspective):
So in my eyes there's no reason to distrust them. But ProgPOW is open source anyway so.. What I (and presumably every cryptocurrency dev) care about is decentralisation, and blockchain promises it. It is up to the software devs to keep that promise and ProgPOW is the best way to achieve that for now. It makes GPUs king in town and luckily AMD and NVIDIA have a public face to lose in case of shady business while Bitmain et al don't (or they already did? Re: Antbleed). There's never perfect decentralisation but for now it's better to go the way of most possible alignment with decentralisation (AMD/Nvidia/Intel/Samsung) than to go full retard and let ASIC manufacturers control 80% of every single algo. I promise you: they do pre-mine. They won't sell 50 day break-even miners to consumers, their investors wouldnt let them. The consumer will always be the ones chasing the carrot. So, I'm all for ProgPOW! Implement it in a testnet and see how it goes. Then mainnet. One critique: |
I dont agree, because the hacker who hackd my account is an dickhead |
@MoneroCrusher @PrometheusXDE Folks, please don't treat reviews pointing out (potential) issues as attacks on the algorithm or on its authors, and please don't treat the existence of issues (even if confirmed) as indicative of incompetence or whatever. It is perfectly normal for a reviewer (especially a volunteer) to be confused at first and find non-issues. If e.g. 1 in 10 issues is for real, that's worth it. And some of the rest might point to a need for better documentation. It is also perfectly normal for there to be issues in an algorithm as initially proposed. That's why independent review is needed. BTW, thank you for linking to that Reddit thread - I wasn't aware of it - it's very good news that someone intends to try implementing this on FPGA, and again it's no problem they misunderstood things at first. @ifdefelse No, the evaluation order of function arguments in C is in fact unspecified, and may/will vary between compilers, etc. I haven't checked whether/where in the code you have that issue, though - I'm just commenting in general. |
@ifdefelse Thanks for your feedback. The intention of my comments is neither to shoot down nor endorse this project, but to help with the analysis of its merits and making it better. The developers are obviously free to use my comments to improve their code and algorithm.
|
@T3N2aWsK on point 3, we stand corrected. Thanks for pointing this out. We will update the code. |
This was a useful read, as well as the comments: I am wondering if it is possible to do ProgPow style algo tailored to CPU. Seems a better thing than tuning for GPU. |
@alexcryptan |
@MoneroCrusher |
@ifdefelse |
@ifdefelse, we are finalizing decisions on the grants. Are there updates about the ProgPoW design, progress, plans or funding requirements that we should be aware of? |
Ethereum Core developers get answers to questions on ProgPow from IfDef |
Hey Eran! There are no main changes to the ProgPoW design, and progress is going well - we're working closely with the ETH team for integration. Nothing has changed on the funding, but I will be updating the EIP to simply add more information/transparency and more clarity to a lot of areas, and that will translate to the ZIP, naturally. So there's more on the education front coming. |
Hello team, We are unable to update our official ZGP proposal at this time, but I would like to make some brief statements in order to clarify some concerns of the public.
The official ZGP proposal will be updated to match these posts accordingly. Oh, and on our identities, and our backgrounds, we'll be tying our skillset more into the medium post that is an extension of what was already posted. But please keep in mind that an algorithm should be reviewed based on it's own merits- not on the identities of the people who created it. That is why Satoshi Nakamoto chose to remain anonymous. The technology needs to speak for itself. |
Thanks for the update @ifdefelse! It would also be helpful to have some pointers on a few things you referenced. I searched for Bitcoin Interest and found that it's a cryptocurrency that recently hard-forked to use ProgPoW (or something similar, as you say). But I can't similarly search for AMD's and NVIDIA's reviews of ProgPoW, for lack of sufficiently focused keywords to search for. Are their reviews public? "Equihash variant of ProgPoW" aka "ProgPoW for Equihash" sounds exciting at first, but starts to fall apart the more I think of it. I assume ProgPoW would be used for the memory filling phase (in place of the sequential writes with BLAKE2b, which Zcash's Equihash currently uses) and would be followed by the usual Equihash processing? If so, the Equihash phase would be as (moderately) ASIC-friendly as Zcash's Equihash currently is, and we'd need to ensure that either the memory filling phase contributes to a larger portion of the total processing cost or the frequent and large data transfer from a GPU-like to a more specialized Equihash ASIC would be prohibitively expensive. With this added requirement, having Equihash at the end of the processing feels superfluous - we're simply designing a new sequential memory-hard scheme (with the usual memory usage vs. verification time trade-off), and could use its computed hash value directly. Having Equihash's collision search step at the end makes sense if the collision-producing hashes can be recomputed quickly, thereby providing fast verification, but this weakens the memory filling (is inconsistent with it being sequential memory-hard). Is this correct, or do you have something different in mind? A potentially better "Equihash variant of ProgPoW" aka "ProgPoW for Equihash" would be to enhance Equihash's collision search phase, applying it to a ProgPoW'ish function of the memory contents rather than to the actual memory contents directly. This would need to be done in a way requiring that the computation be done dynamically on every ex-memory lookup rather than in advance (which would essentially replace the memory contents, to be followed by classic Equihash, thereby defeating our attempt at requiring GPU-like hardware), and that's tricky (perhaps achievable through inclusion of frequent memory writes, so that the collision search would be over changing data, or/and through doing the collision search over a much larger amount of expanded data, where each lookup of the virtual, say, 1 TB of data would be based on a function of multiple lookups from the real, say, 3 GB of data). Do you possibly have specific ideas on it? Overall, I think this is a research project requiring multiple iterations of (re-)design and attacks and tweaks. It's not just an engineering effort, and thus I see no way how it'd reasonably fit in the moderate budget you suggest, unless the research phase has somehow already been done? |
I'm thrilled to inform you that the Grant Review Committee and the Zcash Foundation Board of Directors have approved your proposal, pending a final compliance review. Congratulations, and thank you for the excellent submission! Next steps: Please email [email protected] from an email address that will be a suitable point of contact going forward. We plan to proceed with disbursements following a final confirmation that your grant is within the strictures of our 501(c)(3) status, and that our payment to you will comply with the relevant United States regulations. We also wish to remind you of the requirement for monthly progress updates to the Foundation’s general mailing list, as noted in the call for proposals. Before the end of this week, the Zcash Foundation plans to publish a blog post announcing grant winners to the public at large, including a lightly edited version of the Grant Review Committee’s comments on your project. The verbatim original text of the comments can be found below. Congratulations again! Grant Review Committee comments:
|
@ifdefelse Hi! I'd love to help provide any engineering support you might need from the engineers at Zcash Company. Would you be interested in getting a more regular line of communication set up between you and Zcash Company engineers so you can ask questions or get help? If so, would you be okay using a Rocketchat channel or do you have another preferred method of communication? |
@ifdefelse please check your email, we are awaiting a response from you |
@ifdefelse Due to several months of radio silence from the grant winners, we are closing this issue and not funding the grant. We have updated the 2018-Q2 blog entry to reflect this decision. We sincerely hope the ProgPoW team continues its research. |
Abstract
We propose an alternate proof-of-work algorithm - “ProgPoW” - tuned for commodity hardware in order to close the efficiency gap available to specialized ASICs.
The security of proof-of-work is built on a fair, randomized lottery where miners with similar resources have a similar chance of generating the next block.
For Zcash - a community based on widely distributed commodity hardware - specialized ASICs enable certain participants to gain a much greater chance of generating the next block and undermine the distributed security.
ASIC-resistance is a misunderstood problem. FPGAs, GPUs and CPUs can themselves be considered ASICs. Any algorithm that executes on a commodity ASIC can have a specialized ASIC made for it; most existing algorithms provide opportunities that reduce power usage and cost. Thus, the proper question to ask when solving ASIC-resistance is “how much more efficient will a specialized ASIC be, in comparison with commodity hardware?”
This proposal presents an algorithm that is tuned for commodity GPUs where there is minimal opportunity for ASIC specialization. This prevents specialized ASICs without resorting to a game of whack-a-mole where the network changes algorithms every few months.
Motivation
Since proof-of-work will continue to underpin the security of the Zcash network, it is important that this consensus isn’t dominated by a single party in control of a large portion of the compute power.
The existence of the Z9 ASIC miner proves that the current Equihash algorithm allows significant efficiency gains from specialized hardware. The ~150mb of state in Equihash is large but possible on an ASIC. Furthermore, the binning, sorting, and comparing of bit strings could be implemented on an ASIC at extremely high speed.
Thus, a new hash algorithm is needed to replace the current Equihash. This algorithm must be permanently resistant to domination by specialized hardware and avoid further PoW code forks.
Design Specification
ProgPoW preserves the 256-bit nonce size for Zcash and thus does not require any changes to light clients. However, it entirely replaces the rest of the Equihash PoW algorithm.
ProgPoW is based on Dagger Hashimoto and follows the same general structure. Dagger Hashimoto was selected as a base due to its simple algorithm and memory-hard nature. Dagger Hashimoto has also withstood a significant amount of real-world “battle testing.” Finally, starting from a simple algorithm makes the changes easy to understand and thus easy assess for strengths and weakness. Complexity tends to imply a larger attack surface.
The name of the algorithm comes from the fact that the inner loop between global memory accesses is a randomly generated program based on the block number. The random program is designed to both run efficiently on commodity GPUs and also cover most of the GPU's functionality. The random program sequence prevents the creation of a fixed pipeline implementation as seen in a specialized ASIC. The access size has also been tweaked to match contemporary GPUs.
ProgPoW utilizes almost all parts of a commodity GPU, excluding:
Making use of either of these would have significant portability issues between commodity hardware vendors, and across programming languages.
Since the GPU is almost fully utilized, there’s little opportunity for specialized ASICs to gain efficiency. Removing both the graphics pipeline and floating point math could provide up to 1.2x gains in efficiency, compared to the 2x gains possible in Dagger Hashimoto, 50x gains possible for CryptoNight, and ~100x gains possible in Equihash.
The algorithm has five main changes from Dagger Hashimoto, each tuned for commodity GPUs while minimizing the possible advantage of a specialized ASIC. In contrast to Dagger Hashimoto, the changes detailed below make Prog-PoW dependent on the core compute capabilities in addition to memory bandwidth and size.
Changes keccak to blake2s.
Zcash already uses BLAKE2 in various locations, so it makes sense to continue using BLAKE2. GPUs are natively 32-bit architectures so blake2s is used. Both blake2b and blake2s provide the same security but are tuned for 64-bit and 32-bit architectures respectively. Blake2s runs roughly twice as fast on a 32-bit architecture as blake2b.
Increases mix state.
A significant part of a GPU’s area, power, and complexity is the large register file. A large mix state ensures that a specialized ASIC would need to implement similar state storage, limiting any advantage.
Adds a random sequence of math in the main loop.
The random math changes every 50 blocks to amortize compilation overhead. Having a random sequence of math that reads and writes random locations within the state ensures that the ASIC executing the algorithm is fully programmable. There is no possibility to create an ASIC with a fixed pipeline that is much faster or lower power.
Adds reads from a small, low-latency cache that supports random addresses.
Another significant part of a GPU’s area, power, and complexity is the memory hierarchy. Adding cached reads makes use of this hierarchy and ensures that a specialized ASIC also implements a similar hierarchy, preventing power or area savings.
Increases the DRAM read from 128 bytes to 256 bytes.
The DRAM read from the DAG is the same as Dagger Hashimoto’s, but with the size increased to 256 bytes. This better matches the workloads seen on commodity GPUs, preventing a specialized ASIC from being able to gain performance by optimizing the memory controller for abnormally small accesses.
The DAG file is generated according to traditional Dagger Hashimoto specifications, with an additional ProgPoW_SIZE_CACHE bytes generated that will be cached in the L1.
ProgPoW can be tuned using the following parameters. The proposed settings have been tuned for a range of existing, commodity GPUs:
ProgPoW uses FNV1a for merging data. The existing Dagger Hashimoto uses FNV1 for merging, but FNV1a provides better distribution properties.
ProgPoW uses KISS99 for random number generation. This is the simplest (fewest instructions) random generator that passes the TestU01 statistical test suite. A more complex random number generator like Mersenne Twister can be efficiently implemented on a specialized ASIC, providing an opportunity for efficiency gains.
Backwards Compatibility
This algorithm is not backward compatible with the existing Equihash implementation and will require a fork for adoption. Furthermore, the network hash rate will change dramatically as the units of compute are re-normalized to this new algorithm.
Testing
This PoW algorithm was implemented and tested against six different models from two different manufacturers. Selected models span two different chips and memory types from each manufacturer (Polaris20-GDDR5 and Vega10-HBM2 for AMD; GP104-GDDR5 and GP102-GDDR5X for NVIDIA). The average hashrate results are listed below. Additional tests are ongoing.
As the algorithm nearly fully utilizes GPU functions in a natural way, the results reflect relative GPU performance that is similar to other gaming and graphics applications.
Sample Code
Sample code is available here.
Schedule and Funding Request
This proposal is submitted to the foundation to review and adopt as a priority. We ask that any funding goes toward any outside development, review, testing, and implementation efforts as the foundation deems appropriate. We humbly suggest that this proposal be integrated to the soonest planned fork (as feasible) to minimize disruption to the network.
The text was updated successfully, but these errors were encountered: