Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

bug: concurrent commit and RPC Simulate call causes a data race in IAVL #1356

Closed
p0mvn opened this issue Apr 26, 2022 · 26 comments
Closed

bug: concurrent commit and RPC Simulate call causes a data race in IAVL #1356

p0mvn opened this issue Apr 26, 2022 · 26 comments
Assignees
Labels
T:bug 🐛 Something isn't working

Comments

@p0mvn
Copy link
Member

p0mvn commented Apr 26, 2022

Background

There was a fatal error observed on v7.1.0. On initial investigation, it is being caused by concurrently committing and calling the following RPC.

This happened at least 3 times over the past 3 days

Logs: concurrent_rw_panic.log

Acceptance Criteria

  • investigate and synchronize and fix the problem
  • create an e2e test to test the fix by running a chain and calling the Simulate RPC several times
@p0mvn p0mvn added the T:bug 🐛 Something isn't working label Apr 26, 2022
@p0mvn
Copy link
Member Author

p0mvn commented Apr 26, 2022

SDK tag used on v7.1.0: v0.45.1-0.20220327212402-9f71a993afdc
IAVL tag: v0.17.3-osmo-v7

@p0mvn
Copy link
Member Author

p0mvn commented Apr 26, 2022

There was #137 that might be related. IAVL must be concurrent with this change present

@p0mvn
Copy link
Member Author

p0mvn commented Apr 26, 2022

More logs from different occurrence that are stemming from the same source:

The original report is from April 26

@p0mvn
Copy link
Member Author

p0mvn commented Apr 26, 2022

This node is used for wallets

@vorotech
Copy link

hi @p0mvn
Do you think the following is related?

If I'm "lucky" enough and hit the endpoint https://lcd-osmosis.blockapsis.com/cosmos/tx/v1beta1/txs?events=tx.height%3D3815464&pagination.offset=0 several times, I'm getting:

{
  "code": 8,
  "message": "grpc: received message larger than max (4589341 vs. 4194304)",
  "details": [
  ]
}

@vorotech
Copy link

vorotech commented Apr 26, 2022

Tbh, I think it might be unrelated issue, the gRPC client should have the grpcClient.MaxCallSendMsgSize() method called here (also missed in the original implementation)

@p0mvn
Copy link
Member Author

p0mvn commented Apr 26, 2022

Tbh, I think it might be unrelated issue, the gRPC client should have the grpcClient.MaxCallSendMsgSize() method called here (also missed in the original implementation)

@vorotech You're right. We have already updated this on the latest branch of the SDK fork. However, this has not been backported to v7.x branch yet.

We are planning a minor (v7.2.1) release soon and should backport osmosis-labs/cosmos-sdk#170 to it

CC: @czarcas7ic

@catShaark
Copy link
Contributor

catShaark commented Apr 27, 2022

So I looked at the log, this is where the error showed up right ? Apr 26 18:43:17 ip-172-31-49-142 osmosisd_osmosis-1[333746]: fatal error: concurrent map iteration and map write

@p0mvn
Copy link
Member Author

p0mvn commented Apr 27, 2022

So I looked at the log, this is where the error showed up right ? Apr 26 18:43:17 ip-172-31-49-142 osmosisd_osmosis-1[333746]: fatal error: concurrent map iteration and map write

That's correct. It's happening because 2 goroutines are trying to iterate over this map and write to it concurrently.

One goroutine is accessing that IAVL map through Commit() in baseapp and another by calling this RPC

@p0mvn p0mvn self-assigned this Apr 27, 2022
@alexanderbez
Copy link
Contributor

alexanderbez commented Apr 27, 2022

Doesn't Tendermint place a lock on that ABCI flow though? Or is this lower level @p0mvn? What can be do in the SDK to synchronize these calls?

@p0mvn
Copy link
Member Author

p0mvn commented Apr 27, 2022

@alexanderbez essentially, this PR osmosis-labs/cosmos-sdk#137 removed the synchrony with the ABCI flow. The asynchrony is needed by integrators for handling the load. Otherwise, their queries are blocked at epoch / long commits

I know exactly where the problem is (IAVL - mutable_tree.go) and working on a fix as well as an e2e test to expose this in our suite. Should have an update in the next few days

@p0mvn
Copy link
Member Author

p0mvn commented Apr 27, 2022

SDK seems to be reading from a height in IAVL that hasn't been committed yet. We don't have any mutexes there because the assumption was that this should never happen. So the solution is to avoid reading from the uncommitted state. Instead, read from the latest commit height.

@ValarDragon
Copy link
Member

Is the situation that queries don't have a height field included, and then the query responds with the currently processing height, whereas it really should be responding using the last committed height?

@catShaark
Copy link
Contributor

catShaark commented Apr 28, 2022

Is the situation that queries don't have a height field included, and then the query responds with the currently processing height, whereas it really should be responding using the last committed height?

Yeah, The query uses the returned sdk.Context after calling getContextForTx( runTxModeSimulate, txBytes) which is the sdk.Context of current state, not the lastest height. Maybe we should change the function so that it uses createQueryContext to return sdk.Context of the latest height in case of runTxModeSimulate, like this ?

@alexanderbez
Copy link
Contributor

I guess simulate should use the latest committed state.

@p0mvn
Copy link
Member Author

p0mvn commented Apr 28, 2022

Is the situation that queries don't have a height field included, and then the query responds with the currently processing height, whereas it really should be responding using the last committed height?

Yeah, The query uses the returned sdk.Context after calling getContextForTx( runTxModeSimulate, txBytes) which is the sdk.Context of current state, not the lastest height. Maybe we should change the function so that it uses createQueryContext to return sdk.Context of the latest height in case of runTxModeSimulate, like this ?

Thanks for the suggestion @catShaark . While I think that this may help mitigate the problem, I don't think it would be removing the root cause.

To illustrate, I would like to point out a change that I did in osmosis-labs/cosmos-sdk#100 when upgrading to fast storage:
image

By being able to iterate over the mutable_tree, the SDK can access uncommitted state when a query/simulate tx happens to arrive before we commit the latest height. When working on the PR above I wasn't aware of all of the access patterns and concurrency flows so might have caused this bug.

I think the following should be done:

  1. Revert the Iterator and similar ReverseIterator change that allows iterating over the mutable tree with an uncommitted state.

  2. In addition, we might want to synchronize the following 2 resources with a read/write locks:
    https://github.com/osmosis-labs/iavl/blob/a324be8dc8bd24756adf2acb72e09e9d0c5ec5da/mutable_tree.go#L34-L35

    • Given that we do 1), this might not be as important because there is no other flow I know of that may access these resources from multiple gorotuines
    • If we end up doing that, the change should be heavily benchmarked.

So, to sum up, I think we can go ahead with the change suggested in 1). Wait with 2) until later. Please let me know if there is anything I'm missing and if anyone has more ideas

@p0mvn
Copy link
Member Author

p0mvn commented Apr 28, 2022

Is the situation that queries don't have a height field included, and then the query responds with the currently processing height, whereas it really should be responding using the last committed height?

Yes, that's the high-level description of this problem 👍

@p0mvn
Copy link
Member Author

p0mvn commented Apr 28, 2022

Also, I'm currently still working on e2e tests to expose this. If anyone is interested in trying out PRing a potential solution, I'm happy to discuss and guide

@alexanderbez
Copy link
Contributor

Yes, @p0mvn your summary sounds like the correct approach. I should've caught that in review :(

@ValarDragon
Copy link
Member

(1) makes sense as an immediate fix, though this does generally feel like the SDK is asking for something unsafe, and thats where the bug is. (What is the API usage its doing thats leading to an iterator over a mutable tree)

I think (2) being important would only come up once the SDK is writing things in parallel to that tree. But that is not the store architecture that it currently does, instead all writes to the IAVL store are done in sequence. (And all intermediate writes are buffered within the CacheKVStore)

@catShaark
Copy link
Contributor

catShaark commented Apr 29, 2022

Also, I'm currently still working on e2e tests to expose this. If anyone is interested in trying out PRing a potential solution, I'm happy to discuss and guide

I want to help but I will very much need your guidance

@p0mvn
Copy link
Member Author

p0mvn commented May 1, 2022

Hi everyone, thanks for all the help and input on this so far.

this does generally feel like the SDK is asking for something unsafe, and that's where the bug is

I kept thinking about what @ValarDragon said, and I think that was right. The SDK was incorrectly requesting data. It ended up being more complex than just doing the change I suggested earlier in 1).

I arrived at this conclusion by first struggling with reproducing this bug via e2e tests. Then, @alexanderbez pointed out offline that this issue would be easier to reproduce by approaching it from the SDK.

I proceeded by doing so and discovered that #137 introduced several issues. Due to how some integration tests were set up and the lack of others, these issues were never caught. Once I reconfigured the setup to function as expected, I was able to observe many data races and other problems related to requesting specific heights in queries.

I made this PR with the fix, more unit, and integration tests. You can find a more detailed summary in the description.

Please let me know if that change is acceptable, and I can get a tag out for testing

@ValarDragon
Copy link
Member

ValarDragon commented May 3, 2022

Ensuring I understand that PR correctly:

  • The bug before was the ABCI parallelization allowed someone to query a future height, which could cause a crash (on top of returning an incorrect result)
  • The behavior is changed so that a query to a future height will return an error. The intended behavior for anyone who runs into that, is to retry sending

If so, that seems good for us to roll into a new minor release to me. Thanks for working through that! I was suspicious of the GRPC desynchronization handling these things correctly when the PR was first made, wish I investigated it more at the time.

@p0mvn
Copy link
Member Author

p0mvn commented May 3, 2022

The behavior is changed so that a query to a future height will return an error. The intended behavior for anyone who runs into that, is to retry sending

I would also add that some queries like simulation and tendermint are changed to be always directed to the ABCI flow to avoid going through GRPC.

In the case of Tendermint queries, those are having issues with proto marshaling / unmarshalling when done via GRPC.

In the case of simulation, @alexanderbez and I discussed and thought that these are risky to do concurrently with the commit flow due to mimicking regular transaction logic. They have the potential for more unexpected concurrency problems. If throughput is reported to be problematic, we can experiment and change that in the future.

I have a PR with the fix to v7.x in #1390. Planning to do a v7.2.2 release tomorrow

@p0mvn p0mvn closed this as completed May 4, 2022
@p0mvn
Copy link
Member Author

p0mvn commented May 16, 2022

This issue was still reported on v7.3.0

@p0mvn p0mvn reopened this May 16, 2022
@p0mvn
Copy link
Member Author

p0mvn commented Jul 19, 2022

Haven't heard of this being a problem recently. Closing for now

@p0mvn p0mvn closed this as completed Jul 19, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
T:bug 🐛 Something isn't working
Projects
Archived in project
Development

No branches or pull requests

5 participants