[Mining] IncrementExtraNonce without cs_main #921

Zannick · 2021-04-04T21:50:22Z

Problem

While mining SHA256D, the wallet periodically gets stuck, failing to stay caught up with the chain. (At the same time, the mining significantly slows down.)

Root Cause

Lock contention; SHA256D is very fast, completing its loop in subsecond time on a modern cpu, meaning every thread goes through lock-heavy sections of the block template creation code multiple times per second. In comparison, RandomX on the same machine takes minutes to complete its loop, usually being interrupted by the creation of a new block instead of reaching the maximal number of attempts.

Solution (partial)

Reduce lock contention on cs_main. The nonce setup prior to running the hashes includes grabbing cs_main to perform IncrementExtraNonce. It takes this lock to safely get the chain tip and then to safely compare the current thread's prevHeaderHash against a static value, resetting the nonce to 0 and updating the static value if it doesn't match.

However, it only uses the chain tip for its height, which we can get without a lock.

Second, the nonce being reset to 0 is thread-local; other threads are not affected by the first thread to reach this part in a new block. Also, the nonce base is constantly increasing in a separate section, so there does not really appear to be a requirement that nonces start from 0 on a new block; the lock on nNonce_base already assures us that we get a unique number for each thread. As such, I believe we can safely remove the nonce reset here. (I think we can also remove the extra increment, but I've left it in for now.)

Cleanup: Also removes one extraneous copy of the nonce into a thread local variable
(there were previously two, and I suspect the extra was already being optimized away).

Bounty PR

#862

Bounty Payment Address

sv1qqpswvjy7s9yrpcmrt3fu0kd8rutrdlq675ntyjxjzn09f965z9dutqpqgg85esvg8mhmyka5kq5vae0qnuw4428vs9d2gu4nz643jv5a72wkqqq73mnxr

Unit Testing Results

test_veil.exe appears to crash on my system, so manually ran all the tests via a script and compared to master. No differences.

Removes the `cs_main` lock usage from the nonce setup in the miners. The chain tip is only used for getting the Height of the tip, so instead we can just retrieve the height directly. This requires a change to the signature of `IncrementExtraNonce`. The other item cs_main was protecting is the static `prevHeaderHash`; this is only used to reset `nExtraNonce`. However, `nExtraNonce` is thread-local, so the thread nonces will increment indefinitely, with the occasional 0 thrown in. This is therefore unnecessary--and rarely may result in a race: - Thread 1: creates block template on block N-1 - Thread 2: accepts new block N - Thread 3: creates block template on block N, resets local nonce - Thread 1: resets local nonce - Thread 4: creates block template on block N, resets local nonce Thread 1 is doing useless work on the old block, this happens. Threads 3 and 4 are doing the same work since they have the same nonce. (If I understand right, and the nonces are used here mainly to give the threads distinct works sets, I believe this means actually incrementing `nExtraNonce` is unnecessary, since we take `cs_nonce` to increment `nNonce_base` in order to give each thread a unique nonce. We would need to rename `IncrementExtraNonce` to describe what it's actually doing then.) Removes one extraneous copy of the nonce into a thread local variable (there were previously two, and I suspect the extra was already being optimized away).

CaveSpectre11

With the explanation it does make sense, and I don't see anything structurally wrong with it. I'm going to utACK a3168e7

But I definitely want @codeofalltrades to give it a run for it's money, and it also needs some QA time as well.

seanPhill · 2021-04-13T14:22:50Z

I am running #921 on mainnet and just checked it (after a couple of days of running) and found it lagging by several minutes. All the connected peers were showing sync further behind. I had been running 5 out of 6 cores on SHA, while using the computer for Twitter and Discord chat. I stopped mining, and immediately it caught up several blocks and then the peers tab showed the peers were caught up. Evidently the peers' sync data shown was old.

seanPhill · 2021-04-13T14:30:33Z

Checking the testnet wallet running 2 out of 4 cores mining and staking at the same time (on an i7 iMac from late-2013) I found that all four peers show one less block synced than headers, which are also one less than my latest block, and I have six out of the last staked blocks ~~orphaned~~ unconfirmed. 28 and 29 confirmations since my last mined [testnet] block. Currently synced. ... Since I looked a couple of minutes ago it looks like it has reorged and my orphans are now with confirmations.
It also had taken a minute before the peers tab had managed to show synced headers and blocks.

I might try turning off staking and continuing. (Maybe #915 would be working better? Or this one is just not enough on its own?)

Zannick · 2021-04-13T14:34:00Z

That (the wallet falling behind while mining SHA) sounds like a symptom of #862, which this doesn't fully address. Potentially "several minutes behind" is an improvement over my experience of being full-on stuck. Interesting to note that it might be the peers' sync data that is getting stuck on the other end, I may look into that separately.

So I think that's not caused by this PR, but if the same thing happens with RandomX mining, please do let me know.

seanPhill · 2021-04-13T14:40:25Z

Okay. I'll switch to RandomX on mainnet and testnet. I've never done very well on these computers with RandomX, but with SHA difficulty so much higher than before (with so many new in-wallet miners) I didn't get any mainnet SHA in these two days anyway. (My iMac and Mac Pro are both from late-2013. i7 and XEON E5 respectively.)

Fan really blowing on the 4-core iMac with RandomX. Fortunately I've upgraded it to 16GB RAM and an SSD (as it was way slow before). Ambient temperature is pretty high here (in the tropics).

seanPhill · 2021-04-13T19:02:03Z

I just accidentally quit the mainnet wallet while it was mining RandomX and (on two restarts) I got (this both times)

$ ./veil-qt
Assertion failed: (!setBlockIndexCandidates.empty()), function PruneBlockIndexCandidates, file validation.cpp, line 3467.
Abort trap: 6

It's been a while since I've had that. I thought something else had the foreground when I quit. (I'll check where that problem belongs. Thought it had been fixed.) (Note: snapshot, and reindex-chainstate did not fix this. Wallet.dat was corrupted.)
The repeated error after the chainstate reindex attempt was

$ ./veil-qt -reindex-chainstate
Assertion failed: (false), function forced_return, file /home/runner/work/veil/veil/source/depends/x86_64-apple-darwin14/share/../include/boost/variant/detail/forced_return.hpp, line 47.

The related past issues were #405 and #371. 405 had been closed, presumed fixed in release v1.0.3.0, and those instances had been fixed with -reindex-chainstate. The 371 version of this problem had happened on Linux, had failed to be fixed with a snapshot, and -reindex-chainstate had also failed. The only solution had been to abandon the wallet.dat.

Zannick · 2021-04-13T23:23:02Z

Interesting. I don't see a way this might be related to the changes here, but I'll take a look at the issue regardless.

codeofalltrades

ACK a3168e7

codeofalltrades · 2021-04-14T01:49:30Z

Running for 4 days
Machine 1: 1 core mining on a 2 core machine
Machine 2: 2 cores mining on a 6 core machine; Change today to 4 of 6 cores mining.

codeofalltrades · 2021-04-17T04:04:22Z

Running for 7 days
Machine 1: 1 core mining on a 2 core machine
Machine 2: 4 of 6 cores mining to much, got stuck. 3 of 6 was fine.
QA Complete for me.

Zannick · 2021-04-17T06:22:07Z

4 would probably work with RandomX (which is affected by this change even though the main intent behind it is an improvement to SHA).

seanPhill · 2021-04-17T16:17:40Z

Complete and fine for me. My testnet RandomX ran fine with 4 threads on a four core (old) i7.
SHA was fine on my usual 4 of 6 cores, but I didn't run it as long.
Had a power outage this morning. Busily mining RandomX on testnet was fine. Mainnet wallet was fine after two restarts, but had not been mining at the time of the power outage, as it was just resyncing from being off for a while.

Zannick mentioned this pull request Apr 5, 2021

[Mining][SHA256D] Reduce lock contention with SHA256D #915

Merged

CaveSpectre11 requested review from CaveSpectre11, codeofalltrades and WetOne April 5, 2021 22:38

CaveSpectre11 assigned Zannick Apr 5, 2021

CaveSpectre11 added Component: Core App Related to the application itself. Component: Miner Both PoW and PoS block creation Tag: Waiting For Code Review Waiting for code review from a core developer labels Apr 5, 2021

CaveSpectre11 approved these changes Apr 5, 2021

View reviewed changes

CaveSpectre11 added the Tag: Waiting For QA A pull review is waiting for QA to test the pull request label Apr 5, 2021

CaveSpectre11 requested a review from seanPhill April 11, 2021 19:40

seanPhill mentioned this pull request Apr 13, 2021

Assertion failed. Wallet.dat corrupted. Quit mainnet macOS wallet while busy RandomX. #930

Closed

codeofalltrades approved these changes Apr 14, 2021

View reviewed changes

codeofalltrades added Code Review: Passed and removed Tag: Waiting For Code Review Waiting for code review from a core developer labels Apr 14, 2021

codeofalltrades added QA: Passed This has passed QA testing and can be merged to master and removed Tag: Waiting For QA A pull review is waiting for QA to test the pull request labels Apr 18, 2021

codeofalltrades merged commit 1e2b060 into Veil-Project:master Apr 18, 2021

Zannick deleted the nonce branch May 18, 2021 03:34

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Mining] IncrementExtraNonce without cs_main #921

[Mining] IncrementExtraNonce without cs_main #921

Zannick commented Apr 4, 2021

CaveSpectre11 left a comment

seanPhill commented Apr 13, 2021

seanPhill commented Apr 13, 2021 •

edited

Loading

Zannick commented Apr 13, 2021

seanPhill commented Apr 13, 2021 •

edited

Loading

seanPhill commented Apr 13, 2021 •

edited

Loading

Zannick commented Apr 13, 2021

codeofalltrades left a comment

codeofalltrades commented Apr 14, 2021

codeofalltrades commented Apr 17, 2021

Zannick commented Apr 17, 2021

seanPhill commented Apr 17, 2021

[Mining] IncrementExtraNonce without cs_main #921

[Mining] IncrementExtraNonce without cs_main #921

Conversation

Zannick commented Apr 4, 2021

Problem

Root Cause

Solution (partial)

Bounty PR

Bounty Payment Address

Unit Testing Results

CaveSpectre11 left a comment

Choose a reason for hiding this comment

seanPhill commented Apr 13, 2021

seanPhill commented Apr 13, 2021 • edited Loading

Zannick commented Apr 13, 2021

seanPhill commented Apr 13, 2021 • edited Loading

seanPhill commented Apr 13, 2021 • edited Loading

Zannick commented Apr 13, 2021

codeofalltrades left a comment

Choose a reason for hiding this comment

codeofalltrades commented Apr 14, 2021

codeofalltrades commented Apr 17, 2021

Zannick commented Apr 17, 2021

seanPhill commented Apr 17, 2021

seanPhill commented Apr 13, 2021 •

edited

Loading

seanPhill commented Apr 13, 2021 •

edited

Loading

seanPhill commented Apr 13, 2021 •

edited

Loading