Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Mining] IncrementExtraNonce without cs_main #921

Merged
merged 1 commit into from
Apr 18, 2021

Conversation

Zannick
Copy link
Collaborator

@Zannick Zannick commented Apr 4, 2021

Problem

While mining SHA256D, the wallet periodically gets stuck, failing to stay caught up with the chain. (At the same time, the mining significantly slows down.)

Root Cause

Lock contention; SHA256D is very fast, completing its loop in subsecond time on a modern cpu, meaning every thread goes through lock-heavy sections of the block template creation code multiple times per second. In comparison, RandomX on the same machine takes minutes to complete its loop, usually being interrupted by the creation of a new block instead of reaching the maximal number of attempts.

Solution (partial)

Reduce lock contention on cs_main. The nonce setup prior to running the hashes includes grabbing cs_main to perform IncrementExtraNonce. It takes this lock to safely get the chain tip and then to safely compare the current thread's prevHeaderHash against a static value, resetting the nonce to 0 and updating the static value if it doesn't match.

However, it only uses the chain tip for its height, which we can get without a lock.

Second, the nonce being reset to 0 is thread-local; other threads are not affected by the first thread to reach this part in a new block. Also, the nonce base is constantly increasing in a separate section, so there does not really appear to be a requirement that nonces start from 0 on a new block; the lock on nNonce_base already assures us that we get a unique number for each thread. As such, I believe we can safely remove the nonce reset here. (I think we can also remove the extra increment, but I've left it in for now.)

Cleanup: Also removes one extraneous copy of the nonce into a thread local variable
(there were previously two, and I suspect the extra was already being optimized away).

Bounty PR

#862

Bounty Payment Address

sv1qqpswvjy7s9yrpcmrt3fu0kd8rutrdlq675ntyjxjzn09f965z9dutqpqgg85esvg8mhmyka5kq5vae0qnuw4428vs9d2gu4nz643jv5a72wkqqq73mnxr

Unit Testing Results

test_veil.exe appears to crash on my system, so manually ran all the tests via a script and compared to master. No differences.

Removes the `cs_main` lock usage from the nonce setup
in the miners.

The chain tip is only used for getting the Height of the tip,
so instead we can just retrieve the height directly. This requires
a change to the signature of `IncrementExtraNonce`.

The other item cs_main was protecting is the static `prevHeaderHash`;
this is only used to reset `nExtraNonce`. However, `nExtraNonce` is
thread-local, so the thread nonces will increment indefinitely, with the
occasional 0 thrown in. This is therefore unnecessary--and rarely may
result in a race:
- Thread 1: creates block template on block N-1
- Thread 2: accepts new block N
- Thread 3: creates block template on block N, resets local nonce
- Thread 1: resets local nonce
- Thread 4: creates block template on block N, resets local nonce
Thread 1 is doing useless work on the old block, this happens.
Threads 3 and 4 are doing the same work since they have the same nonce.

(If I understand right, and the nonces are used here mainly to give
the threads distinct works sets, I believe this means actually
incrementing `nExtraNonce` is unnecessary, since we take `cs_nonce` to
increment `nNonce_base` in order to give each thread a unique nonce.
We would need to rename `IncrementExtraNonce` to describe
what it's actually doing then.)

Removes one extraneous copy of the nonce into a thread local variable
(there were previously two, and I suspect the extra was already being
optimized away).
@CaveSpectre11 CaveSpectre11 added Component: Core App Related to the application itself. Component: Miner Both PoW and PoS block creation Tag: Waiting For Code Review Waiting for code review from a core developer labels Apr 5, 2021
Copy link
Collaborator

@CaveSpectre11 CaveSpectre11 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

With the explanation it does make sense, and I don't see anything structurally wrong with it. I'm going to utACK a3168e7

But I definitely want @codeofalltrades to give it a run for it's money, and it also needs some QA time as well.

@CaveSpectre11 CaveSpectre11 added the Tag: Waiting For QA A pull review is waiting for QA to test the pull request label Apr 5, 2021
@seanPhill
Copy link
Collaborator

I am running #921 on mainnet and just checked it (after a couple of days of running) and found it lagging by several minutes. All the connected peers were showing sync further behind. I had been running 5 out of 6 cores on SHA, while using the computer for Twitter and Discord chat. I stopped mining, and immediately it caught up several blocks and then the peers tab showed the peers were caught up. Evidently the peers' sync data shown was old.

@seanPhill
Copy link
Collaborator

seanPhill commented Apr 13, 2021

Checking the testnet wallet running 2 out of 4 cores mining and staking at the same time (on an i7 iMac from late-2013) I found that all four peers show one less block synced than headers, which are also one less than my latest block, and I have six out of the last staked blocks orphaned unconfirmed. 28 and 29 confirmations since my last mined [testnet] block. Currently synced. ... Since I looked a couple of minutes ago it looks like it has reorged and my orphans are now with confirmations.
It also had taken a minute before the peers tab had managed to show synced headers and blocks.

I might try turning off staking and continuing. (Maybe #915 would be working better? Or this one is just not enough on its own?)

@Zannick
Copy link
Collaborator Author

Zannick commented Apr 13, 2021

That (the wallet falling behind while mining SHA) sounds like a symptom of #862, which this doesn't fully address. Potentially "several minutes behind" is an improvement over my experience of being full-on stuck. Interesting to note that it might be the peers' sync data that is getting stuck on the other end, I may look into that separately.

So I think that's not caused by this PR, but if the same thing happens with RandomX mining, please do let me know.

@seanPhill
Copy link
Collaborator

seanPhill commented Apr 13, 2021

Okay. I'll switch to RandomX on mainnet and testnet. I've never done very well on these computers with RandomX, but with SHA difficulty so much higher than before (with so many new in-wallet miners) I didn't get any mainnet SHA in these two days anyway. (My iMac and Mac Pro are both from late-2013. i7 and XEON E5 respectively.)

Fan really blowing on the 4-core iMac with RandomX. Fortunately I've upgraded it to 16GB RAM and an SSD (as it was way slow before). Ambient temperature is pretty high here (in the tropics).

@seanPhill
Copy link
Collaborator

seanPhill commented Apr 13, 2021

I just accidentally quit the mainnet wallet while it was mining RandomX and (on two restarts) I got (this both times)

$ ./veil-qt
Assertion failed: (!setBlockIndexCandidates.empty()), function PruneBlockIndexCandidates, file validation.cpp, line 3467.
Abort trap: 6

It's been a while since I've had that. I thought something else had the foreground when I quit. (I'll check where that problem belongs. Thought it had been fixed.) (Note: snapshot, and reindex-chainstate did not fix this. Wallet.dat was corrupted.)
The repeated error after the chainstate reindex attempt was

$ ./veil-qt -reindex-chainstate
Assertion failed: (false), function forced_return, file /home/runner/work/veil/veil/source/depends/x86_64-apple-darwin14/share/../include/boost/variant/detail/forced_return.hpp, line 47.

The related past issues were #405 and #371. 405 had been closed, presumed fixed in release v1.0.3.0, and those instances had been fixed with -reindex-chainstate. The 371 version of this problem had happened on Linux, had failed to be fixed with a snapshot, and -reindex-chainstate had also failed. The only solution had been to abandon the wallet.dat.

@Zannick
Copy link
Collaborator Author

Zannick commented Apr 13, 2021

Interesting. I don't see a way this might be related to the changes here, but I'll take a look at the issue regardless.

Copy link
Collaborator

@codeofalltrades codeofalltrades left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ACK a3168e7

@codeofalltrades
Copy link
Collaborator

Running for 4 days
Machine 1: 1 core mining on a 2 core machine
Machine 2: 2 cores mining on a 6 core machine; Change today to 4 of 6 cores mining.

@codeofalltrades codeofalltrades added Code Review: Passed and removed Tag: Waiting For Code Review Waiting for code review from a core developer labels Apr 14, 2021
@codeofalltrades
Copy link
Collaborator

Running for 7 days
Machine 1: 1 core mining on a 2 core machine
Machine 2: 4 of 6 cores mining to much, got stuck. 3 of 6 was fine.
QA Complete for me.

@Zannick
Copy link
Collaborator Author

Zannick commented Apr 17, 2021

4 would probably work with RandomX (which is affected by this change even though the main intent behind it is an improvement to SHA).

@seanPhill
Copy link
Collaborator

Complete and fine for me. My testnet RandomX ran fine with 4 threads on a four core (old) i7.
SHA was fine on my usual 4 of 6 cores, but I didn't run it as long.
Had a power outage this morning. Busily mining RandomX on testnet was fine. Mainnet wallet was fine after two restarts, but had not been mining at the time of the power outage, as it was just resyncing from being off for a while.

@codeofalltrades codeofalltrades added QA: Passed This has passed QA testing and can be merged to master and removed Tag: Waiting For QA A pull review is waiting for QA to test the pull request labels Apr 18, 2021
@codeofalltrades codeofalltrades merged commit 1e2b060 into Veil-Project:master Apr 18, 2021
@Zannick Zannick deleted the nonce branch May 18, 2021 03:34
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Code Review: Passed Component: Core App Related to the application itself. Component: Miner Both PoW and PoS block creation QA: Passed This has passed QA testing and can be merged to master
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants