-
Notifications
You must be signed in to change notification settings - Fork 3.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
fix(baseapp): fix race condition in state #11102
Conversation
as state is not threadsafe when rpc and grpc are both active race condition occurs Signed-off-by: Artur Troian <[email protected]>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this LTGM, did you try concurrent grpc queries along with the other PR?
Is this issue present on master? I think its present in the other pr because we added grpc routing to the client. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I can't replicate this data race on 0.44.5, 0.45 or master.
Here's the simple script I used (run a node in another terminal):
#!/usr/bin/env bash
set -e
# Set variables
CFG_DIR=~/.simapp
BUILD_CMD=./build/simd
ALICE=alice
CHAIN_ID=my-chain
ALICE_ADDRESS=$($BUILD_CMD keys show $ALICE -a --keyring-backend test)
for i in {1..10000}
do
echo $i
grpcurl -plaintext -d "{\"address\":\"$ALICE_ADDRESS\"}" localhost:9090 cosmos.bank.v1beta1.Query/AllBalances &
done
And with some additional logging in the app, I can see that BeginBlock and grpc queries are handled concurrently.
Let's investigate a bit more before making this R4R. @troian Are you seeing data races on v0.44.5, or only after your PR?
@AmauryM yes. we see it on 0.44.5. this PR with #10997 fixes the race issue for us. I'm not sure this runs in parallels, just sequentially 10k requests. isn't it? for i in {1..10000}
do
echo $i
grpcurl -plaintext -d "{\"address\":\"$ALICE_ADDRESS\"}" localhost:9090 cosmos.bank.v1beta1.Query/AllBalances &
done |
OK. Is there a way you can make a repro? e.g. a script or some code to make the data race happen
It's in parallel, because of the |
also, I'm not sure this test will reproduce the issue. take a look at my comment with stack trace in the #10997. the RPC and gRPC are trying to access non-thread-safe object |
OK. Could you explain/paste a script on how you reproduced the data race? What I'm trying to understand is how we didn't find this data race in previous 0.44.x testing phases. |
Note: It seems that there's maybe even another data race in v0.44.5: #11114 :( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
After #11117 I got convinced there are data race issues in v0.44+, so overall I am okay to merge this PR 👍, seems Query
is indeed potentially accessing checkState the same time when BeginBlock
is modifying it.
I never personally reproduced the data race though. @troian Do you think you can provide a small test like https://github.com/cosmos/cosmos-sdk/pull/11117/files#diff-9ab8b6b1ae348e450f51d4a110e504d9aee67d848a997128a39f914a0acfa7f7R807?
// WithContext update context of the state | ||
func (st *state) WithContext(ctx sdk.Context) { | ||
defer st.lock.Unlock() | ||
st.lock.Lock() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let's also add a changelog entry
defer st.lock.RUnlock() | ||
st.lock.RLock() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
defer st.lock.RUnlock() | |
st.lock.RLock() | |
st.lock.RLock() | |
defer st.lock.RUnlock() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
comment why?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You have to lock the mutex before you can unlock it :)
@@ -17,5 +20,15 @@ func (st *state) CacheMultiStore() sdk.CacheMultiStore { | |||
|
|||
// Context returns the Context of the state. | |||
func (st *state) Context() sdk.Context { | |||
defer st.lock.RUnlock() | |||
st.lock.RLock() | |||
|
|||
return st.ctx |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is not actually safe, as the Context type has fields which have reference semantics.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ouch
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@peterbourgon it seems the Context
type needs some sort of deep copy.
thoughts
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
tho more I dig into it - the more I am convinced this is a dirty hack and eventually it will blow
BeginBlock
for example
if there are two servers listening (socket
and grpc
) they may (and will) call BeginBlock
simultaneously.
correct if I'm getting it wrong
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yep! I think you're right.
As far as I can see, BaseApp's exported methods — including but not limited to BeginBlock — can absolutely be called by concurrent goroutines. This means that they must ensure that anything they read or write is synchronized. But that's not happening. BaseApp —and many, many, many other components in the SDK — permit unsynchronized reads and writes on their encapsulated values, and consequently violate Go's memory model. Many of these soundness errors remain undetected or overlooked because the current, specific execution paths happen to not trigger them most of the time.
There are a lot of pathological issues in the Cosmos SDK, as well as the sdk.Context type specifically, which make this kind of bug hard to fix.
In this case, at a high level: contexts are supposed to be request-scoped, but here — and in many other places, too — the context value is long-lived. I might be missing something, but that seems to be a clear design error. Neither a state nor a BaseApp nor anything else with a lifetime beyond an individual request should maintain a context value.
More concretely: as with most types in the SDK, methods on the sdk.Context are — probably incorrectly — defined on a value receiver. That means every method call creates and operates on a (shallow) copy of the original value. This thrashes the GC, but more importantly it makes synchronization of any field with reference semantics more or less impossible. Even if you have a mutex in the type — which the context doesn't — those mutexes would get copied, and so wouldn't actually provide mutual exclusion on the values they'd be meant to protect.
And then there's all of the basically unsolvable problems created by the SDK's endemic misuse of panic
as an error handling mechanism. But that's an entirely different discussion.
I don't see how to fix the data race without addressing these problems. Probably at least a few more, too.
--
I certainly haven't done a deep-dive on this code, and so I'm not speaking from an informed place. But based on what I do understand, it seems that the right approach to fixing this problem is to eliminate the state
type altogether, including the deliverState and checkState fields in the BaseApp, in order to eliminate the long-lived context value. Then, look at whatever was writing-to and reading-from those state values to figure out what stuff they actually needed from the contexts. Capture that information specifically, in a separate and synchronized type, in the BaseApp struct.
Or, I dunno. Maybe I got it all wrong.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
thanks @peterbourgon for looking over and confirming
I'll give it some thinking
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As the SDK exists today, it's not meant to or designed to be executed concurrently -- we assume Tendermint places the relevant locks in it's reactors prior to executing ABCI calls, which it does. The issue arrises here, at least as far as I can tell, from direct client gRPC queries being executed while the state machine is executing ABCI call(s) that can contain various writes, which is really outside the scope or domain of Tendermint.
So while I see the idea proposed here with Context()
, I don't think it's the correct approach, although I do appreciate the efforts @troian!
We need to take a step back and think of a different approach to allowing direct gRPC queries while the state machine is executing ABCI calls. For simplicity's sake, forget Tendermint even exists at this point. I think there are two ways we can protect reads and writes:
- Either by taking a revised approach to the use of the state context as you attempted (maybe it just needs a bit more thought), OR
- Using a RW mutex on BaseApp itself, where most ABCI calls use a write lock and we only obtain a read lock upon
Query
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
- Construct a standalone app instance for quering which only shares the low-level db handler?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That also might be an option, but im not sure it'll be an app. Rather We might have to refactor Baseapp#Query
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@alexanderbez is this issue still a thing on later cosmos sdk 0.47+? or someone had opportunity to address it?
sure, i'll try to make one, cannot promise it very soon tho, quite a busy schedule :( |
Is this PR still in draft? We want to make a release today (if possible) and we need this PR merged |
Co-authored-by: Peter Bourgon <[email protected]>
@robert-zaremba, as @peterbourgon mentioned the mutex in without digging in to much, deep copy seems to be an option, tho there might be logic relying on references |
This pull request has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions. |
return st.ctx | ||
} | ||
|
||
// WithContext update context of the state | ||
func (st *state) WithContext(ctx sdk.Context) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
WithX
methods typically leave the receiver unmodified and return a copy with the requested changes. Should this be e.g. SetContext
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
💯
@troian any hope you can continue working on this PR? |
@robert-zaremba i don't this PR will be merged as there are some serious things about thread safety revealed. |
@AmauryM could you turn this PR into an issue then we close it for the time being? |
This pull request has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions. |
as the state is not threadsafe when rpc and grpc are both active
race condition occurs
Signed-off-by: Artur Troian [email protected]