-
Notifications
You must be signed in to change notification settings - Fork 283
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add parallel reads to GetFullProof #239
Add parallel reads to GetFullProof #239
Conversation
e756f9a
to
6cd64c4
Compare
Wow cool -- this is great! |
This is potentially a nice upgrade but it needs to be kept in perspective. It would only assist when you have more than one likely winning proof at the exact same signage point. Even on large farms - that is rare. |
@marcoabreu Great work. I actually implemented the exact same thing and was looking to submit it. Looks like you beat me to it. However, in my patch, I did the work to switch the reads to pread(), which I think is better than opening multiple handles. I also performed the load of the entire read of ReadLinePoint in a single shot (you can determine the size of the buffer needed at the beginning of the call). @hoffmang9 --- I think this is actually quite critical on network media with high tail latencies or frequent multi-second latency spikes. In my interesting NAS setup where latencies average 30-50ms with 2s spikes, my changes seem to make the tests complete in 1/3 of the time - testing on a k18 plot:
versus
(I added the timing). For reference, it takes 0.84s on my macbook NVMe ssd, and adding async actually makes things slightly slower (1.1s). For me at least, it seemed to eliminate spikes resulting in harvester warning messages. |
Hi @no2chem , that sounds even better! I have to admit that I'm quite new to C++, hence I tried to just use some naiive methods :) In my tests, the time to execute a single full proof (qualities + 64 full proof values) went from 60 seconds on average down to 3 seconds - with my storage backend. I'll have a look to see if I can implement your suggestion, thanks a lot! @hoffmang9 this is not only useful in the scenario when you have multiple proofs (I agree, it's very unlikely). It rather is already useful for a single proof as you reduce the waiting time to effectively 7xReadLatency as opposed to 64xReadLatency. That's already quite useful for environments where the average read latency is not that good, but this excels when you have inconsistent read latencies like I have in my setup. |
@marcoabreu I've made a branch here https://github.com/no2chem/chiapos/tree/fastRead - feel free to take a look. |
Gene: Is it that simple ? In the case of high-latency/high-bandwidth storage solutions like NAS, archive-optimized (aka wide-stripe) RAID, ZFS, etc, or in my case, what turned out to be a flakey SAS expander that I was unaware of, there seems plenty of pitfalls for farms to fall in unbeknownst to them. With my intermittent hardware for example, every IO latency went up dramatically, but transporting plots via netcat/tar (high-queue depth) seemed entirely unaffected. Only until human eyes casually inspected the INFO level logs with a keen eye (prior to the >5s warning) was the fully win-blocking issue observed, triggering my investigation. There are countless everyday issues that can occur with storage that increase the latency, so any improvement that can eliminate extraneous software/hardware issues should probably try to be implemented. A farmer should win if they 1) have put the effort into creating the plots, and 2) has the best proof online at the time and 3) returns it to the timelord before the SP expiration. The "enterprise levelness" of their storage should not dictate their chances of winning. If there's any opportunity to level the playing field for all farmers, we should probably try to implement it. It just seems like the right thing to do 🤷♂️ Aside: my average eligible proofs is 3.4 and according to Bram in your recent AMA, that is a "small farmer" 😉 |
This is exactly what happened to me for over a week and a half with estimated time-to-win at 18hrs, while nothing stood out as a problem... then only after not seeing any wins for so long, found out a SAS expander was going flakey. I suspect this improvement would go very far toward helping mitigate these types of real-world latency-spiking issues. Thanks for your work on this for the benefit of everyone ! |
All - don't worry - we're very likely to take it once we have had a chance to consider any cross platform or other issues. I'm a touch shocked that such little reading over the network is this bad and wish to point out that this is partially why we point at and will be expanding support for - remote harvesters. Do seeks on the backplane and network on the network... |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I appreciate the simplicity of this patch, and agree that keeping it simple for now is a good approach.
I agree that future improvements would be using fewer file handles (technically, a single handle would be sufficient, but would require pread()
and overlapped I/O on windows).
my understanding is that GCC and clang implements std::async
as spawning a new thread for each invocation, whereas windows use its thread pool. I wouldn't expect spawning a thread to be expensive compared to the disk I/O (but could be on very fast disks).
Which operating system and compiler/standard library did you test this with?
Do you have before and after timings?
std::vector<Bits> left = GetInputs(disk_file, xy.second, depth - 1); // y | ||
std::vector<Bits> right = GetInputs(disk_file, xy.first, depth - 1); // x | ||
auto left_fut=std::async(std::launch::async, &DiskProver::GetInputs,this, (uint64_t)xy.second, (uint8_t)(depth - 1)); | ||
auto right_fut=std::async(std::launch::async, &DiskProver::GetInputs,this, (uint64_t)xy.first, (uint8_t)(depth - 1)); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
is there a good reason to cast the depth
argument to a uint8_t
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The compiler is very very picky about the data types during templating. If you don't cast it, it will throw an error that it can not find a matching template as it will consider the calculation an int - it does not do any implicit conversions and I didn't want to change the argument types.
Thanks everyone for your feedback! I have tested this under Ubuntu20 as well as Mac OS. I will provide the exact versions and timings when I'm back at my computer. |
I measured the impact through Before:
Ranged from 60 up to 120 seconds. Since these are 5 full proofs, it's 12 to 24 seconds per full proof. After:
Ranged from 12 up to 30 seconds. Since these are 5 full proofs, it's 2.4 to 6 seconds per full proof. I also added a benchmark to the live-system with the following results:
Additionally I tested on my mac with |
As |
At the risk of adding a distraction, the "right" way to do this is to use a cross platform async library like libuv, which has the proper cross platform and async abstractions |
We are biased against adding dependencies as we the inherit their security risks - but that doesn't completely rule them out. |
@hoffmang9 is there anything you'd like me to change or improve in the meantime or are we fine as long as the tests don't bring up anything? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this patch looks good to land.
It highlights three minor issues in the code today:
GetInputs()
is a member function, even though the only member it uses isk
. It might as well takek
instead ofthis
.GetInputs()
is notconst
qualified (suggesting it's not thread safe with regards tothis
), but it does not mutate the object state, so it should beconst
.SafeRead()
andSafeSeek()
are both static members, despite not needing access to any private members of the class. These should be made free functions.
I'll address these issues later (unless someone beats me to it)
Sorry for the stupid question. Can wee change the codeon the config or we should wait until an official update happens? Because I have the 30 seconds problem |
Is there an easy way to test this out with chia-blockchain? I'm able to compile it but am lacking the knowledge/documentation to figure out how to install it. |
@NPutting in your python venv, run:
|
I have wanted to add this for a long time, to help NAS users and remote network users. It's awesome that you got it working with such small changes! |
How can we simple mortals use it? |
If you install the latest 1.1.6 version of chia-network then you can also clone this branch and do |
Sorry to continue asking, if im using windows can i use it? |
@hoffmang9 how do we want to move forward here? |
This will likely get in to 1.1.7 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
aok
Does anyone know if it's safe to run chiapos==1.0.3 with chia-blockchain==1.1.7? |
devs are doing it on testnet according to their commits, but i personally couldn't get chiapos 1.0.3 to build on my system so good luck ! |
When harvester search for proof it open/close the plot file xx times on each eligibility case... for cloud service this is very bad, can we have option to set config to plot could and option to just once open file and close it at end of search? |
This PR adds parallel reads to retrieve the 64 values required in
GetFullProof
.Previously, 64 sequential reads were executed. In the best case (5400rpm), this would take 64 * 6ms = 400ms just for the seeking alone. But as we know, you can expect a waaay higher access latency over 64 queue_depth=1 calls. Especially in cases where a NAS or similar non-local storages were used and thus the access-latency got higher, you could see read times over a minute.
This will now ensure that in the rare case somebody would be eligible for a reward, that they are actually able to claim it instead of it silently timing out due to latency issues.
I'm aware that this approach might spawn quite a few threads and create a bunch of filehandles - but they are quite short lived. In future, this could be optimized with an explicit thread- and filehandle pool. For now, this should improve things a lot already.
Supported by: Chia-Network/chia-blockchain#5170