Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Run Blockchain actor on its own thread using a PinnedDispatcher. #681

Closed
wants to merge 3 commits into from

Conversation

jsolman
Copy link
Contributor

@jsolman jsolman commented Apr 4, 2019

In testing the improved ImportBlocks plugin neo-project/neo-modules#66 I encountered similar issues as I had previously described here neo-project/neo-vm#53. When I had encountered these issues in the past, it was also when using code that imported blocks using multiple Import messages. I've finally root caused the issue to be related to the Blockchain actor running on different OS threads, and I will describe the issue in more detail below.

Specifically, consider this line that gets a snapshot of the database:

using (Snapshot snapshot = GetSnapshot())

This line works reliably to get the latest snapshot, when the same OS thread previously persisted the block through the following code:

snapshot.Commit();

However, if the actor switched threads between invocations, LevelDB does not ensure that the snapshot subsequently retrieved will be the latest snapshot (it doesn't guarantee atomic access to the latest snapshot last written across threads). By forcing the Blockchain to run on the same thread using a PinnedDispatcher the consistency is preserved and the updated ImportBlocks plugin from neo-project/neo-modules#66 works without any issues. Also, this as it is now in regular operation could have a bug if two blocks are persisted in fast succession.

@jsolman jsolman added the Bug Used to tag confirmed bugs label Apr 4, 2019
@jsolman jsolman requested a review from erikzhang April 4, 2019 07:07
@erikzhang
Copy link
Member

The Snapshot doesn't need to commit in the same thread.

@jsolman
Copy link
Contributor Author

jsolman commented Apr 4, 2019

The Snapshot doesn't need to commit in the same thread.

Maybe you don’t understand me. The snapshot is not guaranteed to get the most recent version of the database if it isn’t in the same thread.

@jsolman
Copy link
Contributor Author

jsolman commented Apr 4, 2019

To be clear:

Creating a snapshot in another thread is not guaranteed to get the latest version of the database if it has recently written to the database from another thread.

@erikzhang
Copy link
Member

this.snapshot = db.GetSnapshot();

The internal implementation of DbSnapshot will create a leveldb snapshot, which is thread safe and will get the same version of the database.

@jsolman
Copy link
Contributor Author

jsolman commented Apr 4, 2019

It is thread safe but it will not be guaranteed to get the same version of the database.

@erikzhang
Copy link
Member

It will be guaranteed to get the same version of the database. That's why they design snapshot for leveldb.

@jsolman
Copy link
Contributor Author

jsolman commented Apr 4, 2019

I have run multiple tests and proved without any doubt that it absolutely does not always get the latest version of the database if the database was very recently changed on another thread. It doesn’t matter that the creation of the snapshot is after the write it is a matter of memory visibility; LevelDB is thread safe in that it won’t cause corruption in this case but it is not thread safe to guarantee that it will get the latest version of the database in this case.

@jsolman
Copy link
Contributor Author

jsolman commented Apr 4, 2019

That's why they design snapshot for leveldb.

I’ve tried two different versions of LevelDB, and in practice, both of them do not work as you are saying.

@jsolman
Copy link
Contributor Author

jsolman commented Apr 4, 2019

Snapshot in LevelDB is meant to give you a read-only version of the database. It isn’t guaranteed to give you the latest version if asking on another thread. It will give you a consistent version from the perspective of batch writes, but not necessarily the latest.

@erikzhang
Copy link
Member

Can you put the test code on the conversation?

@jsolman
Copy link
Contributor Author

jsolman commented Apr 4, 2019

You can easily test it by running code that imports 1 block at a time from a chain.acc file. You will see that it will fail after a while due to the snapshot getting a previous view of the database.

@erikzhang
Copy link
Member

Do you mean the Blockchain actor doesn't process only one message once?

@jsolman
Copy link
Contributor Author

jsolman commented Apr 4, 2019

Do you mean the Blockchain actor doesn't process only one message once?

No I will give an example expressing the problem:

Persist block 5

  • Get snapshot of block 4 (normal)
  • apply changes to balances
  • commit

(Thread switch)

Persist block 6

  • erroneously get snapshot of block 4 (should get 5)
  • apply changes to balances and commit (now potentially broken)

@erikzhang
Copy link
Member

erroneously get snapshot of block 4 (should get 5)

Getting snapshot is always happened after last commit, why it is possible to get block 4 snapshot? Unless commit is not complete but start getting next snapshot.

@jsolman
Copy link
Contributor Author

jsolman commented Apr 4, 2019

Getting snapshot is always happened after last commit, why it is possible to get block 4 snapshot? Unless commit is not complete but start getting next snapshot.

My suspicion is related to memory visibility across threads and what is happening internally in LevelDB. It also may still be necessary for synchronous writes to ensure the snapshot is complete in addition to this change, to fully prevent the problem. That was the PR I previously closed #680 . However, to be safe, I think we may need both.

@jsolman
Copy link
Contributor Author

jsolman commented Apr 4, 2019

In conclusion it needs synchronous writes from #680 , but that alone isn't enough it also needs the Blockchain actor pinned to the same thread. With both now incorporated in this PR it should work. I'll leave the MainNet syncing again and report back later.

@jsolman
Copy link
Contributor Author

jsolman commented Apr 5, 2019

LevelDB should be using memory barriers and CAS primitives to be able to ensure the latest data will be read from other threads, so that these changes shouldn’t be necessary. We probably need to deep dive into LevelDB to see what is happening. Also maybe these issues are specific to the OS on which I’m testing.

@jsolman
Copy link
Contributor Author

jsolman commented Apr 5, 2019

Even with these changes I'm still having syncing problems. I'm going to close this until I fully root cause the issue and am able to sync the chain without issues.

@jsolman jsolman closed this Apr 5, 2019
@shargon shargon deleted the BlockchainUsingPinnedDispatcher branch April 5, 2019 07:46
@jsolman
Copy link
Contributor Author

jsolman commented Apr 7, 2019

So it seems my issues were actually caused by a bug in LevelDB. I no longer seem to encounter the issues when using LevelDB with the following changes applied:
google/leveldb#339

Thacryba pushed a commit to simplitech/neo that referenced this pull request Feb 17, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Bug Used to tag confirmed bugs
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants