-
Notifications
You must be signed in to change notification settings - Fork 4.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Data race in callback manager / ClusterInfoImpl destruction #13209
Comments
@snowp for any more comments. |
I haven't looked at this code that closely, but could we post the removals to the main thread instead of requiring locking? I think that would align better with Envoy's threading model |
True from convention. However,
|
I'm not particularly concerned about lock contention here (should I be?), but rather about introducing thread awareness to more data structures. Maybe @mattklein123 has thoughts here? As for whether async removal would work that would likely require some code inspection to better understand. |
Sorry I'm not fully groking the problem. Can you add more detail to this issue on what the issue is including maybe a crashing stack trace? |
The initial reported crash thread and the investigation here: istio/istio#25909 (comment) |
I read that comment and it's still not clear to me. In the interest of time can we just reproduce here very clearly what the race is with call stack if possible? Thanks. |
Reproducing requires to have a setup, where an envoy instance(ingressgateway in original case) has large number of dynamic clusters, and those clusters are using TLS socket. Those TLS socket should have the same SdsConfig as key source. Then we constantly send the traffic to this envoy instance. Meanwhile, we constantly add & remove the dynamic clusters using some script automation. Keep running setup as above for couple of hours, we can reliably get segfault with stacktrace as below. [1] stacktrace
|
I'm sorry to keep saying this, but I don't have the bandwidth to re-debug the above issue, and then understand the follow up conversation above. Can someone briefly summarize the actual race condition (with links to code if possible) and then explain what structural improvement your are discussing? Thank you! |
My understanding is that the claim is that is that the TLS cluster (or actually the ClusterInfoImpl based on the stack posted?) creates/deletes a callback via the CallbackManager for SdsApi which is owned by the main thread, and since there is no locking this causes a data race leading to the crash. My push back was around adding locking to new data structures vs trying to make the operations align with Envoy's general threading architecture. Looking at the trace I'm a bit surprised that the ClusterInfoImpl would get destructed on a worker thread, and it sounds different to me than the explanation put forward in the OP:
Can we make sure that we understand whether this is triggered via TLS clusters being destructed or the the primary cluster? If the ClusterInfoImpl is destructed on a worker thread that sounds like a bigger issue |
Something like this should be possible to recreate with an integration test under tsan. |
OK thanks this makes more sense.
This can definitely happen. We lock/store the ClusterInfo as a shared_ptr in a large number of contexts, and it can definitely get deleted on a worker thread. Historically, everything inside ClusterInfo is stateless and should be safe to delete in this fashion. Given ^, if we are violating this somewhere within ClusterInfoImpl we should either fix it with locking or figure out whether the thing in question actually needs to be in the ClusterInfo interface at all. It seems strange to me that any type of callback would be stored via this interface so something seems broken at a more fundamental level. |
I have updated the description of the issue, this should now include all the details in one place.
That part I'm not sure, unfamliar with primary cluster & envoy internals, but the ASAN part does print different thread id, as you can see in the description. I assume maybe T0 thread is the main thread?
I think this is correct. |
OK thanks for the extra diligence here. I understand the issue now. So, yes, this is broken. In general I think we need to assume that ClusterInfo can get destroyed on any thread. This is by design and is required for access logging and a bunch of other things. It's snapped in a large number of worker contexts. The fact that we have a giant chain in which the destruction of ClusterInfo can result in a context destruction, and then effect the callback manager, etc. is an unfortunate side effect. We will need to figure out how to fix this. I'm not sure that adding locking is actually the right fix here vs. something more fundamental in terms of looking at ClusterInfoImpl and looking at what we are storing in there and whether it should be stored in some other way (other not at all, weak_ptr, etc.). I marked this issue bug/help wanted. Thank you! |
The clusterInfo is referenced by quit a lot of filter and unclear stuff. My concerning is that clusterInfo seems snapped by not only upstream |
The incomplete list in
And the crashes frames on my hands are all on top of |
But callback is not the only violation. See this one.
|
I'm wondering, could all of ^ actually be moved to the cluster and removed from ClusterInfo and we could access via some other API? I'm pretty sure that all of the things ^ are completely broken if they are accessed after a cluster is removed? Or could we start with just the thing that we know is causing issues? |
Are you saying the access to ClusterInfo in work thread should not be common read/write, but just unfortunate destroy? |
What I'm saying is that I think there are things in ClusterInfo that probably should not be in there, and I'm wondering if we can move them. I would start with just the thing that is causing the crash which I think is the socket_matcher_. I think that code is already thread safe, so yes, I'm wondering if it can be owned by the cluster and maybe accessed via ThreadLocalCluster vs. ClustetInfo? |
Make perfect sense to me. I was trapped in the zone that these things need to be destroyed later. |
Let me add more detailed description. Conclusion: there is no destroy order error. Both the stack are data race [1][2]. @mattklein123 There is no sync between ClusterImplBase and ThreadLocalCluster(ClusterEntry)/ClusterInfo. The ClusterImplBase is always destroyed on master thread first because clusterInfo hold the necessary shared ptr to ClusterImplBase. @mattklein123 The solutions
Anyway we need to enforce the new order
The owner of [1] callback handle removal is not guarded by lock since the assumption is context is destroyed on master thread. |
I had thought that SocketMatcher would only be used on connection creation, but even there I agree there is a race. I think technically it's possible for a cluster to be removed on the main thread while a connection is being created on a worker (before thread local cluster removal). Given ^ it might be pretty difficult to do this sanely without either destroying cluster info on the main thread as you have tried to do or changing the ownership model of the individual components that are actually causing the issue. So I guess the question is do you want to focus just on SocketMatcher? Or do you want to just make ClusterInfo destroyed on main thread? I guess I'm OK with the latter if you think that is the best approach. |
I am back on #14089 which addresses the latter. So the whole picture is
I think this is a user friendly threading model. WDYT? |
Yes I agree this is OK. Let's review in the other PR. Can you clean that up and we can go from there? |
Thanks! I updated that PR. |
#14954 should have fixed this. |
Title: Data race in callback managers' usage
Description:
When investigating an Envoy segfault issue, I think, the root cause is a data race for managing dynamic clusters from different threads without protection.
Basic Reproducing Steps
Reproducing requires to have a setup, where an envoy instance(ingressgateway in original case) has large number of dynamic clusters, and those clusters are using TLS socket. Those TLS socket should have the same SdsConfig as key source.
Then we constantly send the traffic to this envoy instance. Meanwhile, we constantly add & remove the dynamic clusters using some script automation.
Keep running setup as above for couple of hours, we can reliably get segfault with stacktrace as below.
Analysis
In one of the ASAN's run, when segfault, we see error report as below:
The thread ID is different. From the stacktrace [1], we see the segfault in
CallbackManager::remove
, link. And at the bottom of the trace, it's aClusterEntry
destructor, which would argubably owned by aThreadLocalClusterManagerImpl
.At this point, I think it's obivous we have a race across multiple thread local cluster manager
This also explains why when ASAN build is used, the triggering rate is much lower. taking 2 - 3X times to occur. This argubaly is because ASAN is slower therefore the the likelihood of triggering race is lower.
Code Path
I spent a bit time understanding the code path.
ClusterEntry owns a
TransportSocketMatchesImpl
, which owns someClientSslSocketFactory
, which owns owns some ClientContextConfigImpl, subclassed of ContextConfigImpl.The ContextConfigImpl when created, is adding some callbacks to the corresponding SdsApi object. And the Callback_manager is owned by
SdsApi
, which is key/indexed on some xDS config registries. Therefore, when different clusters are using the same Sds config snippet, they would add and remove callbacks against the same SdsApi's callback manager. That's why we have access across different threads, I think.Misc
Notably, some recent optimization solve/mitigates the issue, by removing the iterator precisely owned by callbackHolder, https://github.com/envoyproxy/envoy/pull/11751/files.
But I know seems this can still leave some potential error space if callback manager is not a thread safe lib, and used concurrently. @lambdai think even current changes is not bullet proof, because the
erase
from different threads is still not safe.Is such pattern expected or not in Envoy in general?
Just a discussion thread, feel free to close.
stacktrace[1]
The text was updated successfully, but these errors were encountered: