Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Data race in callback manager / ClusterInfoImpl destruction #13209

Closed
incfly opened this issue Sep 21, 2020 · 27 comments
Closed

Data race in callback manager / ClusterInfoImpl destruction #13209

incfly opened this issue Sep 21, 2020 · 27 comments

Comments

@incfly
Copy link
Contributor

incfly commented Sep 21, 2020

Title: Data race in callback managers' usage

Description:

When investigating an Envoy segfault issue, I think, the root cause is a data race for managing dynamic clusters from different threads without protection.

Basic Reproducing Steps

Reproducing requires to have a setup, where an envoy instance(ingressgateway in original case) has large number of dynamic clusters, and those clusters are using TLS socket. Those TLS socket should have the same SdsConfig as key source.

Then we constantly send the traffic to this envoy instance. Meanwhile, we constantly add & remove the dynamic clusters using some script automation.

Keep running setup as above for couple of hours, we can reliably get segfault with stacktrace as below.

Analysis

In one of the ASAN's run, when segfault, we see error report as below:

ERROR: AddressSanitizer: heap-use-after-free on address 0x6060033b07e0 at pc 0x5615ffe5b711 bp 0x7fbbb09c6480 sp 0x7fbbb09c6478
READ of size 8 at 0x6060033b07e0 thread T10
    #0 0x5615ffe5b710  (/usr/local/bin/envoy+0x54993710)
    #1 0x5615ffe5a8ae  (/usr/local/bin/envoy+0x549928ae)

0x6060033b07e0 is located 0 bytes inside of 64-byte region [0x6060033b07e0,0x6060033b0820)
freed by thread T9 here:
    #0 0x5615f4bbfccd  (/usr/local/bin/envoy+0x496f7ccd)
    #1 0x5615ffe55b44  (/usr/local/bin/envoy+0x5498db44)

previously allocated by thread T0 here:
    #0 0x5615f4bbff4d  (/usr/local/bin/envoy+0x496f7f4d)
    #1 0x7fbbba7a5297  (/usr/lib/x86_64-linux-gnu/libstdc++.so.6+0x93297)
    #2 0x5615ffe58c65  (/usr/local/bin/envoy+0x54990c65)

Thread T10 created by T0 here:
    #0 0x5615f4baa6da  (/usr/local/bin/envoy+0x496e26da)
    #1 0x5616031b6462  (/usr/local/bin/envoy+0x57cee462)

The thread ID is different. From the stacktrace [1], we see the segfault in CallbackManager::remove, link. And at the bottom of the trace, it's a ClusterEntry destructor, which would argubably owned by a ThreadLocalClusterManagerImpl.

At this point, I think it's obivous we have a race across multiple thread local cluster manager

  1. One cluster manager's call chains is trying to remove the call backs, doing iteration using std::list::remove_if, and erasing.
  2. Another cluster manager is also doing same call chains, but removing another element in the linked list.
  3. When one thread is removing element X, and another one is happen to read that one's iterator, and trying to do a dereference, in a remove_if's predicate. Segfault happens.

This also explains why when ASAN build is used, the triggering rate is much lower. taking 2 - 3X times to occur. This argubaly is because ASAN is slower therefore the the likelihood of triggering race is lower.

Code Path

I spent a bit time understanding the code path.

ClusterEntry owns a TransportSocketMatchesImpl, which owns some ClientSslSocketFactory, which owns owns some ClientContextConfigImpl, subclassed of ContextConfigImpl.

The ContextConfigImpl when created, is adding some callbacks to the corresponding SdsApi object. And the Callback_manager is owned by SdsApi, which is key/indexed on some xDS config registries. Therefore, when different clusters are using the same Sds config snippet, they would add and remove callbacks against the same SdsApi's callback manager. That's why we have access across different threads, I think.

Misc

Notably, some recent optimization solve/mitigates the issue, by removing the iterator precisely owned by callbackHolder, https://github.com/envoyproxy/envoy/pull/11751/files.

But I know seems this can still leave some potential error space if callback manager is not a thread safe lib, and used concurrently. @lambdai think even current changes is not bullet proof, because the erase from different threads is still not safe.

Is such pattern expected or not in Envoy in general?

Just a discussion thread, feel free to close.

stacktrace[1]

2020-09-03T21:51:33.093882Z	critical	envoy backtrace	[bazel-out/k8-fastbuild/bin/external/envoy/source/server/_virtual_includes/backtrace_lib/server/backtrace.h:104] Caught Segmentation fault, suspect faulting address 0x9b
2020-09-03T21:51:33.093924Z	critical	envoy backtrace	[bazel-out/k8-fastbuild/bin/external/envoy/source/server/_virtual_includes/backtrace_lib/server/backtrace.h:91] Backtrace (use tools/stack_decode.py to get line numbers):
2020-09-03T21:51:33.093936Z	critical	envoy backtrace	[bazel-out/k8-fastbuild/bin/external/envoy/source/server/_virtual_includes/backtrace_lib/server/backtrace.h:92] Envoy version: 80ad06b26b3f97606143871e16268eb036ca7dcd/1.14.3-dev/Clean/DEBUG/BoringSSL
2020-09-03T21:51:33.165802Z	critical	envoy backtrace	[bazel-out/k8-fastbuild/bin/external/envoy/source/server/_virtual_includes/backtrace_lib/server/backtrace.h:96] #0: Envoy::SignalAction::sigHandler() [0x56508bf1199c]
2020-09-03T21:51:33.166219Z	critical	envoy backtrace	[bazel-out/k8-fastbuild/bin/external/envoy/source/server/_virtual_includes/backtrace_lib/server/backtrace.h:96] #1: __restore_rt [0x7fac99fca890]
2020-09-03T21:51:33.240003Z	critical	envoy backtrace	[bazel-out/k8-fastbuild/bin/external/envoy/source/server/_virtual_includes/backtrace_lib/server/backtrace.h:96] #2: std::__1::list<>::remove_if<>() [0x56508b378c0f]
2020-09-03T21:51:33.320640Z	critical	envoy backtrace	[bazel-out/k8-fastbuild/bin/external/envoy/source/server/_virtual_includes/backtrace_lib/server/backtrace.h:96] #3: Envoy::Common::CallbackManager<>::remove() [0x56508b378893]
2020-09-03T21:51:33.390044Z	critical	envoy backtrace	[bazel-out/k8-fastbuild/bin/external/envoy/source/server/_virtual_includes/backtrace_lib/server/backtrace.h:96] #4: Envoy::Common::CallbackManager<>::CallbackHolder::remove() [0x56508b3787ac]
2020-09-03T21:51:33.467216Z	critical	envoy backtrace	[bazel-out/k8-fastbuild/bin/external/envoy/source/server/_virtual_includes/backtrace_lib/server/backtrace.h:96] #5: Envoy::Extensions::TransportSockets::Tls::ContextConfigImpl::~ContextConfigImpl() [0x56508892a2da]
2020-09-03T21:51:33.552001Z	critical	envoy backtrace	[bazel-out/k8-fastbuild/bin/external/envoy/source/server/_virtual_includes/backtrace_lib/server/backtrace.h:96] #6: Envoy::Extensions::TransportSockets::Tls::ClientContextConfigImpl::~ClientContextConfigImpl() [0x5650889405ce]
2020-09-03T21:51:33.625260Z	critical	envoy backtrace	[bazel-out/k8-fastbuild/bin/external/envoy/source/server/_virtual_includes/backtrace_lib/server/backtrace.h:96] #7: Envoy::Extensions::TransportSockets::Tls::ClientContextConfigImpl::~ClientContextConfigImpl() [0x565088937283]
2020-09-03T21:51:33.709967Z	critical	envoy backtrace	[bazel-out/k8-fastbuild/bin/external/envoy/source/server/_virtual_includes/backtrace_lib/server/backtrace.h:96] #8: Envoy::Extensions::TransportSockets::Tls::ClientContextConfigImpl::~ClientContextConfigImpl() [0x5650889372bc]
2020-09-03T21:51:33.773332Z	critical	envoy backtrace	[bazel-out/k8-fastbuild/bin/external/envoy/source/server/_virtual_includes/backtrace_lib/server/backtrace.h:96] #9: std::__1::default_delete<>::operator()() [0x5650889024ef]
2020-09-03T21:51:33.838373Z	critical	envoy backtrace	[bazel-out/k8-fastbuild/bin/external/envoy/source/server/_virtual_includes/backtrace_lib/server/backtrace.h:96] #10: std::__1::unique_ptr<>::reset() [0x56508890246f]
2020-09-03T21:51:33.902990Z	critical	envoy backtrace	[bazel-out/k8-fastbuild/bin/external/envoy/source/server/_virtual_includes/backtrace_lib/server/backtrace.h:96] #11: std::__1::unique_ptr<>::~unique_ptr() [0x5650889021d9]
2020-09-03T21:51:33.973127Z	critical	envoy backtrace	[bazel-out/k8-fastbuild/bin/external/envoy/source/server/_virtual_includes/backtrace_lib/server/backtrace.h:96] #12: Envoy::Extensions::TransportSockets::Tls::ClientSslSocketFactory::~ClientSslSocketFactory() [0x56508891d427]
2020-09-03T21:51:34.035886Z	critical	envoy backtrace	[bazel-out/k8-fastbuild/bin/external/envoy/source/server/_virtual_includes/backtrace_lib/server/backtrace.h:96] #13: Envoy::Extensions::TransportSockets::Tls::ClientSslSocketFactory::~ClientSslSocketFactory() [0x56508891d46c]
2020-09-03T21:51:34.098260Z	critical	envoy backtrace	[bazel-out/k8-fastbuild/bin/external/envoy/source/server/_virtual_includes/backtrace_lib/server/backtrace.h:96] #14: std::__1::default_delete<>::operator()() [0x56508afb5a6f]
2020-09-03T21:51:34.161593Z	critical	envoy backtrace	[bazel-out/k8-fastbuild/bin/external/envoy/source/server/_virtual_includes/backtrace_lib/server/backtrace.h:96] #15: std::__1::unique_ptr<>::reset() [0x56508afb59ef]
2020-09-03T21:51:34.233574Z	critical	envoy backtrace	[bazel-out/k8-fastbuild/bin/external/envoy/source/server/_virtual_includes/backtrace_lib/server/backtrace.h:96] #16: std::__1::unique_ptr<>::~unique_ptr() [0x56508afb2689]
2020-09-03T21:51:34.302754Z	critical	envoy backtrace	[bazel-out/k8-fastbuild/bin/external/envoy/source/server/_virtual_includes/backtrace_lib/server/backtrace.h:96] #17: Envoy::Upstream::TransportSocketMatcherImpl::FactoryMatch::~FactoryMatch() [0x56508b7c1788]
2020-09-03T21:51:34.367747Z	critical	envoy backtrace	[bazel-out/k8-fastbuild/bin/external/envoy/source/server/_virtual_includes/backtrace_lib/server/backtrace.h:96] #18: std::__1::allocator<>::destroy() [0x56508b7c2029]
2020-09-03T21:51:34.431605Z	critical	envoy backtrace	[bazel-out/k8-fastbuild/bin/external/envoy/source/server/_virtual_includes/backtrace_lib/server/backtrace.h:96] #19: std::__1::allocator_traits<>::__destroy<>() [0x56508b7c1ffd]
2020-09-03T21:51:34.495537Z	critical	envoy backtrace	[bazel-out/k8-fastbuild/bin/external/envoy/source/server/_virtual_includes/backtrace_lib/server/backtrace.h:96] #20: std::__1::allocator_traits<>::destroy<>() [0x56508b7c1fcd]
2020-09-03T21:51:34.570006Z	critical	envoy backtrace	[bazel-out/k8-fastbuild/bin/external/envoy/source/server/_virtual_includes/backtrace_lib/server/backtrace.h:96] #21: std::__1::__vector_base<>::__destruct_at_end() [0x56508b7c1f7b]
2020-09-03T21:51:34.633817Z	critical	envoy backtrace	[bazel-out/k8-fastbuild/bin/external/envoy/source/server/_virtual_includes/backtrace_lib/server/backtrace.h:96] #22: std::__1::__vector_base<>::clear() [0x56508b7c1eab]
2020-09-03T21:51:34.695359Z	critical	envoy backtrace	[bazel-out/k8-fastbuild/bin/external/envoy/source/server/_virtual_includes/backtrace_lib/server/backtrace.h:96] #23: std::__1::__vector_base<>::~__vector_base() [0x56508b7c1d27]
2020-09-03T21:51:34.758774Z	critical	envoy backtrace	[bazel-out/k8-fastbuild/bin/external/envoy/source/server/_virtual_includes/backtrace_lib/server/backtrace.h:96] #24: std::__1::vector<>::~vector() [0x56508b7c17ed]
2020-09-03T21:51:34.823329Z	critical	envoy backtrace	[bazel-out/k8-fastbuild/bin/external/envoy/source/server/_virtual_includes/backtrace_lib/server/backtrace.h:96] #25: Envoy::Upstream::TransportSocketMatcherImpl::~TransportSocketMatcherImpl() [0x56508b7c1965]
2020-09-03T21:51:34.885652Z	critical	envoy backtrace	[bazel-out/k8-fastbuild/bin/external/envoy/source/server/_virtual_includes/backtrace_lib/server/backtrace.h:96] #26: Envoy::Upstream::TransportSocketMatcherImpl::~TransportSocketMatcherImpl() [0x56508b7c19ac]
2020-09-03T21:51:34.947873Z	critical	envoy backtrace	[bazel-out/k8-fastbuild/bin/external/envoy/source/server/_virtual_includes/backtrace_lib/server/backtrace.h:96] #27: std::__1::default_delete<>::operator()() [0x56508b72d55f]
2020-09-03T21:51:35.013069Z	critical	envoy backtrace	[bazel-out/k8-fastbuild/bin/external/envoy/source/server/_virtual_includes/backtrace_lib/server/backtrace.h:96] #28: std::__1::unique_ptr<>::reset() [0x56508b72d51f]
2020-09-03T21:51:35.080359Z	critical	envoy backtrace	[bazel-out/k8-fastbuild/bin/external/envoy/source/server/_virtual_includes/backtrace_lib/server/backtrace.h:96] #29: std::__1::unique_ptr<>::~unique_ptr() [0x56508b70fcf9]
2020-09-03T21:51:35.142802Z	critical	envoy backtrace	[bazel-out/k8-fastbuild/bin/external/envoy/source/server/_virtual_includes/backtrace_lib/server/backtrace.h:96] #30: Envoy::Upstream::ClusterInfoImpl::~ClusterInfoImpl() [0x56508b71270f]
2020-09-03T21:51:35.206567Z	critical	envoy backtrace	[bazel-out/k8-fastbuild/bin/external/envoy/source/server/_virtual_includes/backtrace_lib/server/backtrace.h:96] #31: Envoy::Upstream::ClusterInfoImpl::~ClusterInfoImpl() [0x56508b71275c]
2020-09-03T21:51:35.272197Z	critical	envoy backtrace	[bazel-out/k8-fastbuild/bin/external/envoy/source/server/_virtual_includes/backtrace_lib/server/backtrace.h:96] #32: std::__1::default_delete<>::operator()() [0x56508b73055f]
2020-09-03T21:51:35.337501Z	critical	envoy backtrace	[bazel-out/k8-fastbuild/bin/external/envoy/source/server/_virtual_includes/backtrace_lib/server/backtrace.h:96] #33: std::__1::__shared_ptr_pointer<>::__on_zero_shared() [0x56508b730999]
2020-09-03T21:51:35.399892Z	critical	envoy backtrace	[bazel-out/k8-fastbuild/bin/external/envoy/source/server/_virtual_includes/backtrace_lib/server/backtrace.h:96] #34: std::__1::__shared_weak_count::__release_shared() [0x56508ce228de]
2020-09-03T21:51:35.465251Z	critical	envoy backtrace	[bazel-out/k8-fastbuild/bin/external/envoy/source/server/_virtual_includes/backtrace_lib/server/backtrace.h:96] #35: Envoy::Upstream::HostDescriptionImpl::~HostDescriptionImpl() [0x565088ab163a]
2020-09-03T21:51:35.531791Z	critical	envoy backtrace	[bazel-out/k8-fastbuild/bin/external/envoy/source/server/_virtual_includes/backtrace_lib/server/backtrace.h:96] #36: Envoy::Upstream::HostImpl::~HostImpl() [0x565088ab06ec]
2020-09-03T21:51:35.593671Z	critical	envoy backtrace	[bazel-out/k8-fastbuild/bin/external/envoy/source/server/_virtual_includes/backtrace_lib/server/backtrace.h:96] #37: Envoy::Upstream::HostImpl::~HostImpl() [0x565088acc6b3]
2020-09-03T21:51:35.655156Z	critical	envoy backtrace	[bazel-out/k8-fastbuild/bin/external/envoy/source/server/_virtual_includes/backtrace_lib/server/backtrace.h:96] #38: Envoy::Upstream::HostImpl::~HostImpl() [0x565088acc6ec]
2020-09-03T21:51:35.733583Z	critical	envoy backtrace	[bazel-out/k8-fastbuild/bin/external/envoy/source/server/_virtual_includes/backtrace_lib/server/backtrace.h:96] #39: std::__1::default_delete<>::operator()() [0x565089a0325f]
2020-09-03T21:51:35.822359Z	critical	envoy backtrace	[bazel-out/k8-fastbuild/bin/external/envoy/source/server/_virtual_includes/backtrace_lib/server/backtrace.h:96] #40: std::__1::__shared_ptr_pointer<>::__on_zero_shared() [0x565089a03029]
2020-09-03T21:51:35.822412Z	critical	envoy backtrace	[bazel-out/k8-fastbuild/bin/external/envoy/source/server/_virtual_includes/backtrace_lib/server/backtrace.h:96] #41: std::__1::__shared_weak_count::__release_shared() [0x56508ce228de]
2020-09-03T21:51:35.910752Z	critical	envoy backtrace	[bazel-out/k8-fastbuild/bin/external/envoy/source/server/_virtual_includes/backtrace_lib/server/backtrace.h:96] #42: std::__1::allocator<>::destroy() [0x565088a921c9]
2020-09-03T21:51:36.000379Z	critical	envoy backtrace	[bazel-out/k8-fastbuild/bin/external/envoy/source/server/_virtual_includes/backtrace_lib/server/backtrace.h:96] #43: std::__1::allocator_traits<>::__destroy<>() [0x565088a9219d]
2020-09-03T21:51:36.090423Z	critical	envoy backtrace	[bazel-out/k8-fastbuild/bin/external/envoy/source/server/_virtual_includes/backtrace_lib/server/backtrace.h:96] #44: std::__1::allocator_traits<>::destroy<>() [0x565088a9216d]
2020-09-03T21:51:36.170932Z	critical	envoy backtrace	[bazel-out/k8-fastbuild/bin/external/envoy/source/server/_virtual_includes/backtrace_lib/server/backtrace.h:96] #45: std::__1::__vector_base<>::__destruct_at_end() [0x565088a9211b]
2020-09-03T21:51:36.235720Z	critical	envoy backtrace	[bazel-out/k8-fastbuild/bin/external/envoy/source/server/_virtual_includes/backtrace_lib/server/backtrace.h:96] #46: std::__1::__vector_base<>::clear() [0x565088a9204b]
2020-09-03T21:51:36.302353Z	critical	envoy backtrace	[bazel-out/k8-fastbuild/bin/external/envoy/source/server/_virtual_includes/backtrace_lib/server/backtrace.h:96] #47: std::__1::__vector_base<>::~__vector_base() [0x565088a91ec7]
2020-09-03T21:51:36.368921Z	critical	envoy backtrace	[bazel-out/k8-fastbuild/bin/external/envoy/source/server/_virtual_includes/backtrace_lib/server/backtrace.h:96] #48: std::__1::vector<>::~vector() [0x565088a8afad]
2020-09-03T21:51:36.435696Z	critical	envoy backtrace	[bazel-out/k8-fastbuild/bin/external/envoy/source/server/_virtual_includes/backtrace_lib/server/backtrace.h:96] #49: std::__1::default_delete<>::operator()() [0x565088aaafab]
2020-09-03T21:51:36.500174Z	critical	envoy backtrace	[bazel-out/k8-fastbuild/bin/external/envoy/source/server/_virtual_includes/backtrace_lib/server/backtrace.h:96] #50: std::__1::__shared_ptr_pointer<>::__on_zero_shared() [0x565088ab40b9]
2020-09-03T21:51:36.500223Z	critical	envoy backtrace	[bazel-out/k8-fastbuild/bin/external/envoy/source/server/_virtual_includes/backtrace_lib/server/backtrace.h:96] #51: std::__1::__shared_weak_count::__release_shared() [0x56508ce228de]
2020-09-03T21:51:36.562835Z	critical	envoy backtrace	[bazel-out/k8-fastbuild/bin/external/envoy/source/server/_virtual_includes/backtrace_lib/server/backtrace.h:96] #52: Envoy::Upstream::HostSetImpl::~HostSetImpl() [0x565089a0f8df]
2020-09-03T21:51:36.625790Z	critical	envoy backtrace	[bazel-out/k8-fastbuild/bin/external/envoy/source/server/_virtual_includes/backtrace_lib/server/backtrace.h:96] #53: Envoy::Upstream::HostSetImpl::~HostSetImpl() [0x56508b7124ac]
2020-09-03T21:51:36.689788Z	critical	envoy backtrace	[bazel-out/k8-fastbuild/bin/external/envoy/source/server/_virtual_includes/backtrace_lib/server/backtrace.h:96] #54: std::__1::default_delete<>::operator()() [0x565088a8f07f]
2020-09-03T21:51:36.752537Z	critical	envoy backtrace	[bazel-out/k8-fastbuild/bin/external/envoy/source/server/_virtual_includes/backtrace_lib/server/backtrace.h:96] #55: std::__1::unique_ptr<>::reset() [0x565088a8efff]
2020-09-03T21:51:36.815776Z	critical	envoy backtrace	[bazel-out/k8-fastbuild/bin/external/envoy/source/server/_virtual_includes/backtrace_lib/server/backtrace.h:96] #56: std::__1::unique_ptr<>::~unique_ptr() [0x565088a8ef99]
2020-09-03T21:51:36.878107Z	critical	envoy backtrace	[bazel-out/k8-fastbuild/bin/external/envoy/source/server/_virtual_includes/backtrace_lib/server/backtrace.h:96] #57: std::__1::allocator<>::destroy() [0x565088a8ef79]
2020-09-03T21:51:36.940770Z	critical	envoy backtrace	[bazel-out/k8-fastbuild/bin/external/envoy/source/server/_virtual_includes/backtrace_lib/server/backtrace.h:96] #58: std::__1::allocator_traits<>::__destroy<>() [0x565088a8ef4d]
2020-09-03T21:51:37.004230Z	critical	envoy backtrace	[bazel-out/k8-fastbuild/bin/external/envoy/source/server/_virtual_includes/backtrace_lib/server/backtrace.h:96] #59: std::__1::allocator_traits<>::destroy<>() [0x565088a8ef1d]
2020-09-03T21:51:37.067947Z	critical	envoy backtrace	[bazel-out/k8-fastbuild/bin/external/envoy/source/server/_virtual_includes/backtrace_lib/server/backtrace.h:96] #60: std::__1::__vector_base<>::__destruct_at_end() [0x565088a8eecb]
2020-09-03T21:51:37.132594Z	critical	envoy backtrace	[bazel-out/k8-fastbuild/bin/external/envoy/source/server/_virtual_includes/backtrace_lib/server/backtrace.h:96] #61: std::__1::__vector_base<>::clear() [0x565088a8ee2b]
2020-09-03T21:51:37.196798Z	critical	envoy backtrace	[bazel-out/k8-fastbuild/bin/external/envoy/source/server/_virtual_includes/backtrace_lib/server/backtrace.h:96] #62: std::__1::__vector_base<>::~__vector_base() [0x565088a8eca7]
2020-09-03T21:51:37.259910Z	critical	envoy backtrace	[bazel-out/k8-fastbuild/bin/external/envoy/source/server/_virtual_includes/backtrace_lib/server/backtrace.h:96] #63: std::__1::vector<>::~vector() [0x565088a8eb5d]
2020-09-03T21:51:37.349748Z	info	sds	resource:ROOTCA connection is terminated: rpc error: code = Canceled desc = context canceled
2020-09-03T21:51:37.349792Z	error	sds	Remote side closed connection
@incfly incfly added the triage Issue requires triage label Sep 21, 2020
@yanavlasov
Copy link
Contributor

@snowp for any more comments.

@yanavlasov yanavlasov removed the triage Issue requires triage label Sep 22, 2020
@snowp
Copy link
Contributor

snowp commented Sep 22, 2020

I haven't looked at this code that closely, but could we post the removals to the main thread instead of requiring locking? I think that would align better with Envoy's threading model

@lambdai
Copy link
Contributor

lambdai commented Sep 22, 2020

I haven't looked at this code that closely, but could we post the removals to the main thread instead of requiring locking? I think that would align better with Envoy's threading model

True from convention. However, post implicitly locks. Unless there is a long racing period which may block the worker threads acquiring mutex of the list, the per-list-mutex should more efficient.
Another question of migrate to post is: is the list-removal code designed for async execution?


void DispatcherImpl::post(std::function<void()> callback) {
  bool do_post;
  {
    Thread::LockGuard lock(post_lock_);
    do_post = post_callbacks_.empty();
    post_callbacks_.push_back(callback);
  }

@snowp
Copy link
Contributor

snowp commented Sep 22, 2020

I'm not particularly concerned about lock contention here (should I be?), but rather about introducing thread awareness to more data structures. Maybe @mattklein123 has thoughts here?

As for whether async removal would work that would likely require some code inspection to better understand.

@mattklein123
Copy link
Member

Sorry I'm not fully groking the problem. Can you add more detail to this issue on what the issue is including maybe a crashing stack trace?

@lambdai
Copy link
Contributor

lambdai commented Sep 22, 2020

The initial reported crash thread and the investigation here: istio/istio#25909 (comment)

@mattklein123
Copy link
Member

I read that comment and it's still not clear to me. In the interest of time can we just reproduce here very clearly what the race is with call stack if possible? Thanks.

@incfly
Copy link
Contributor Author

incfly commented Sep 22, 2020

Reproducing requires to have a setup, where an envoy instance(ingressgateway in original case) has large number of dynamic clusters, and those clusters are using TLS socket. Those TLS socket should have the same SdsConfig as key source.

Then we constantly send the traffic to this envoy instance. Meanwhile, we constantly add & remove the dynamic clusters using some script automation.

Keep running setup as above for couple of hours, we can reliably get segfault with stacktrace as below.
Applying the fix I mentioned in the original post (callback manager optimization) will reliably remove the segfault.

[1] stacktrace

2020-09-03T21:51:33.093882Z	critical	envoy backtrace	[bazel-out/k8-fastbuild/bin/external/envoy/source/server/_virtual_includes/backtrace_lib/server/backtrace.h:104] Caught Segmentation fault, suspect faulting address 0x9b
2020-09-03T21:51:33.093924Z	critical	envoy backtrace	[bazel-out/k8-fastbuild/bin/external/envoy/source/server/_virtual_includes/backtrace_lib/server/backtrace.h:91] Backtrace (use tools/stack_decode.py to get line numbers):
2020-09-03T21:51:33.093936Z	critical	envoy backtrace	[bazel-out/k8-fastbuild/bin/external/envoy/source/server/_virtual_includes/backtrace_lib/server/backtrace.h:92] Envoy version: 80ad06b26b3f97606143871e16268eb036ca7dcd/1.14.3-dev/Clean/DEBUG/BoringSSL
2020-09-03T21:51:33.165802Z	critical	envoy backtrace	[bazel-out/k8-fastbuild/bin/external/envoy/source/server/_virtual_includes/backtrace_lib/server/backtrace.h:96] #0: Envoy::SignalAction::sigHandler() [0x56508bf1199c]
2020-09-03T21:51:33.166219Z	critical	envoy backtrace	[bazel-out/k8-fastbuild/bin/external/envoy/source/server/_virtual_includes/backtrace_lib/server/backtrace.h:96] #1: __restore_rt [0x7fac99fca890]
2020-09-03T21:51:33.240003Z	critical	envoy backtrace	[bazel-out/k8-fastbuild/bin/external/envoy/source/server/_virtual_includes/backtrace_lib/server/backtrace.h:96] #2: std::__1::list<>::remove_if<>() [0x56508b378c0f]
2020-09-03T21:51:33.320640Z	critical	envoy backtrace	[bazel-out/k8-fastbuild/bin/external/envoy/source/server/_virtual_includes/backtrace_lib/server/backtrace.h:96] #3: Envoy::Common::CallbackManager<>::remove() [0x56508b378893]
2020-09-03T21:51:33.390044Z	critical	envoy backtrace	[bazel-out/k8-fastbuild/bin/external/envoy/source/server/_virtual_includes/backtrace_lib/server/backtrace.h:96] #4: Envoy::Common::CallbackManager<>::CallbackHolder::remove() [0x56508b3787ac]
2020-09-03T21:51:33.467216Z	critical	envoy backtrace	[bazel-out/k8-fastbuild/bin/external/envoy/source/server/_virtual_includes/backtrace_lib/server/backtrace.h:96] #5: Envoy::Extensions::TransportSockets::Tls::ContextConfigImpl::~ContextConfigImpl() [0x56508892a2da]
2020-09-03T21:51:33.552001Z	critical	envoy backtrace	[bazel-out/k8-fastbuild/bin/external/envoy/source/server/_virtual_includes/backtrace_lib/server/backtrace.h:96] #6: Envoy::Extensions::TransportSockets::Tls::ClientContextConfigImpl::~ClientContextConfigImpl() [0x5650889405ce]
2020-09-03T21:51:33.625260Z	critical	envoy backtrace	[bazel-out/k8-fastbuild/bin/external/envoy/source/server/_virtual_includes/backtrace_lib/server/backtrace.h:96] #7: Envoy::Extensions::TransportSockets::Tls::ClientContextConfigImpl::~ClientContextConfigImpl() [0x565088937283]
2020-09-03T21:51:33.709967Z	critical	envoy backtrace	[bazel-out/k8-fastbuild/bin/external/envoy/source/server/_virtual_includes/backtrace_lib/server/backtrace.h:96] #8: Envoy::Extensions::TransportSockets::Tls::ClientContextConfigImpl::~ClientContextConfigImpl() [0x5650889372bc]
2020-09-03T21:51:33.773332Z	critical	envoy backtrace	[bazel-out/k8-fastbuild/bin/external/envoy/source/server/_virtual_includes/backtrace_lib/server/backtrace.h:96] #9: std::__1::default_delete<>::operator()() [0x5650889024ef]
2020-09-03T21:51:33.838373Z	critical	envoy backtrace	[bazel-out/k8-fastbuild/bin/external/envoy/source/server/_virtual_includes/backtrace_lib/server/backtrace.h:96] #10: std::__1::unique_ptr<>::reset() [0x56508890246f]
2020-09-03T21:51:33.902990Z	critical	envoy backtrace	[bazel-out/k8-fastbuild/bin/external/envoy/source/server/_virtual_includes/backtrace_lib/server/backtrace.h:96] #11: std::__1::unique_ptr<>::~unique_ptr() [0x5650889021d9]
2020-09-03T21:51:33.973127Z	critical	envoy backtrace	[bazel-out/k8-fastbuild/bin/external/envoy/source/server/_virtual_includes/backtrace_lib/server/backtrace.h:96] #12: Envoy::Extensions::TransportSockets::Tls::ClientSslSocketFactory::~ClientSslSocketFactory() [0x56508891d427]
2020-09-03T21:51:34.035886Z	critical	envoy backtrace	[bazel-out/k8-fastbuild/bin/external/envoy/source/server/_virtual_includes/backtrace_lib/server/backtrace.h:96] #13: Envoy::Extensions::TransportSockets::Tls::ClientSslSocketFactory::~ClientSslSocketFactory() [0x56508891d46c]
2020-09-03T21:51:34.098260Z	critical	envoy backtrace	[bazel-out/k8-fastbuild/bin/external/envoy/source/server/_virtual_includes/backtrace_lib/server/backtrace.h:96] #14: std::__1::default_delete<>::operator()() [0x56508afb5a6f]
2020-09-03T21:51:34.161593Z	critical	envoy backtrace	[bazel-out/k8-fastbuild/bin/external/envoy/source/server/_virtual_includes/backtrace_lib/server/backtrace.h:96] #15: std::__1::unique_ptr<>::reset() [0x56508afb59ef]
2020-09-03T21:51:34.233574Z	critical	envoy backtrace	[bazel-out/k8-fastbuild/bin/external/envoy/source/server/_virtual_includes/backtrace_lib/server/backtrace.h:96] #16: std::__1::unique_ptr<>::~unique_ptr() [0x56508afb2689]
2020-09-03T21:51:34.302754Z	critical	envoy backtrace	[bazel-out/k8-fastbuild/bin/external/envoy/source/server/_virtual_includes/backtrace_lib/server/backtrace.h:96] #17: Envoy::Upstream::TransportSocketMatcherImpl::FactoryMatch::~FactoryMatch() [0x56508b7c1788]
2020-09-03T21:51:34.367747Z	critical	envoy backtrace	[bazel-out/k8-fastbuild/bin/external/envoy/source/server/_virtual_includes/backtrace_lib/server/backtrace.h:96] #18: std::__1::allocator<>::destroy() [0x56508b7c2029]
2020-09-03T21:51:34.431605Z	critical	envoy backtrace	[bazel-out/k8-fastbuild/bin/external/envoy/source/server/_virtual_includes/backtrace_lib/server/backtrace.h:96] #19: std::__1::allocator_traits<>::__destroy<>() [0x56508b7c1ffd]
2020-09-03T21:51:34.495537Z	critical	envoy backtrace	[bazel-out/k8-fastbuild/bin/external/envoy/source/server/_virtual_includes/backtrace_lib/server/backtrace.h:96] #20: std::__1::allocator_traits<>::destroy<>() [0x56508b7c1fcd]
2020-09-03T21:51:34.570006Z	critical	envoy backtrace	[bazel-out/k8-fastbuild/bin/external/envoy/source/server/_virtual_includes/backtrace_lib/server/backtrace.h:96] #21: std::__1::__vector_base<>::__destruct_at_end() [0x56508b7c1f7b]
2020-09-03T21:51:34.633817Z	critical	envoy backtrace	[bazel-out/k8-fastbuild/bin/external/envoy/source/server/_virtual_includes/backtrace_lib/server/backtrace.h:96] #22: std::__1::__vector_base<>::clear() [0x56508b7c1eab]
2020-09-03T21:51:34.695359Z	critical	envoy backtrace	[bazel-out/k8-fastbuild/bin/external/envoy/source/server/_virtual_includes/backtrace_lib/server/backtrace.h:96] #23: std::__1::__vector_base<>::~__vector_base() [0x56508b7c1d27]
2020-09-03T21:51:34.758774Z	critical	envoy backtrace	[bazel-out/k8-fastbuild/bin/external/envoy/source/server/_virtual_includes/backtrace_lib/server/backtrace.h:96] #24: std::__1::vector<>::~vector() [0x56508b7c17ed]
2020-09-03T21:51:34.823329Z	critical	envoy backtrace	[bazel-out/k8-fastbuild/bin/external/envoy/source/server/_virtual_includes/backtrace_lib/server/backtrace.h:96] #25: Envoy::Upstream::TransportSocketMatcherImpl::~TransportSocketMatcherImpl() [0x56508b7c1965]
2020-09-03T21:51:34.885652Z	critical	envoy backtrace	[bazel-out/k8-fastbuild/bin/external/envoy/source/server/_virtual_includes/backtrace_lib/server/backtrace.h:96] #26: Envoy::Upstream::TransportSocketMatcherImpl::~TransportSocketMatcherImpl() [0x56508b7c19ac]
2020-09-03T21:51:34.947873Z	critical	envoy backtrace	[bazel-out/k8-fastbuild/bin/external/envoy/source/server/_virtual_includes/backtrace_lib/server/backtrace.h:96] #27: std::__1::default_delete<>::operator()() [0x56508b72d55f]
2020-09-03T21:51:35.013069Z	critical	envoy backtrace	[bazel-out/k8-fastbuild/bin/external/envoy/source/server/_virtual_includes/backtrace_lib/server/backtrace.h:96] #28: std::__1::unique_ptr<>::reset() [0x56508b72d51f]
2020-09-03T21:51:35.080359Z	critical	envoy backtrace	[bazel-out/k8-fastbuild/bin/external/envoy/source/server/_virtual_includes/backtrace_lib/server/backtrace.h:96] #29: std::__1::unique_ptr<>::~unique_ptr() [0x56508b70fcf9]
2020-09-03T21:51:35.142802Z	critical	envoy backtrace	[bazel-out/k8-fastbuild/bin/external/envoy/source/server/_virtual_includes/backtrace_lib/server/backtrace.h:96] #30: Envoy::Upstream::ClusterInfoImpl::~ClusterInfoImpl() [0x56508b71270f]
2020-09-03T21:51:35.206567Z	critical	envoy backtrace	[bazel-out/k8-fastbuild/bin/external/envoy/source/server/_virtual_includes/backtrace_lib/server/backtrace.h:96] #31: Envoy::Upstream::ClusterInfoImpl::~ClusterInfoImpl() [0x56508b71275c]
2020-09-03T21:51:35.272197Z	critical	envoy backtrace	[bazel-out/k8-fastbuild/bin/external/envoy/source/server/_virtual_includes/backtrace_lib/server/backtrace.h:96] #32: std::__1::default_delete<>::operator()() [0x56508b73055f]
2020-09-03T21:51:35.337501Z	critical	envoy backtrace	[bazel-out/k8-fastbuild/bin/external/envoy/source/server/_virtual_includes/backtrace_lib/server/backtrace.h:96] #33: std::__1::__shared_ptr_pointer<>::__on_zero_shared() [0x56508b730999]
2020-09-03T21:51:35.399892Z	critical	envoy backtrace	[bazel-out/k8-fastbuild/bin/external/envoy/source/server/_virtual_includes/backtrace_lib/server/backtrace.h:96] #34: std::__1::__shared_weak_count::__release_shared() [0x56508ce228de]
2020-09-03T21:51:35.465251Z	critical	envoy backtrace	[bazel-out/k8-fastbuild/bin/external/envoy/source/server/_virtual_includes/backtrace_lib/server/backtrace.h:96] #35: Envoy::Upstream::HostDescriptionImpl::~HostDescriptionImpl() [0x565088ab163a]
2020-09-03T21:51:35.531791Z	critical	envoy backtrace	[bazel-out/k8-fastbuild/bin/external/envoy/source/server/_virtual_includes/backtrace_lib/server/backtrace.h:96] #36: Envoy::Upstream::HostImpl::~HostImpl() [0x565088ab06ec]
2020-09-03T21:51:35.593671Z	critical	envoy backtrace	[bazel-out/k8-fastbuild/bin/external/envoy/source/server/_virtual_includes/backtrace_lib/server/backtrace.h:96] #37: Envoy::Upstream::HostImpl::~HostImpl() [0x565088acc6b3]
2020-09-03T21:51:35.655156Z	critical	envoy backtrace	[bazel-out/k8-fastbuild/bin/external/envoy/source/server/_virtual_includes/backtrace_lib/server/backtrace.h:96] #38: Envoy::Upstream::HostImpl::~HostImpl() [0x565088acc6ec]
2020-09-03T21:51:35.733583Z	critical	envoy backtrace	[bazel-out/k8-fastbuild/bin/external/envoy/source/server/_virtual_includes/backtrace_lib/server/backtrace.h:96] #39: std::__1::default_delete<>::operator()() [0x565089a0325f]
2020-09-03T21:51:35.822359Z	critical	envoy backtrace	[bazel-out/k8-fastbuild/bin/external/envoy/source/server/_virtual_includes/backtrace_lib/server/backtrace.h:96] #40: std::__1::__shared_ptr_pointer<>::__on_zero_shared() [0x565089a03029]
2020-09-03T21:51:35.822412Z	critical	envoy backtrace	[bazel-out/k8-fastbuild/bin/external/envoy/source/server/_virtual_includes/backtrace_lib/server/backtrace.h:96] #41: std::__1::__shared_weak_count::__release_shared() [0x56508ce228de]
2020-09-03T21:51:35.910752Z	critical	envoy backtrace	[bazel-out/k8-fastbuild/bin/external/envoy/source/server/_virtual_includes/backtrace_lib/server/backtrace.h:96] #42: std::__1::allocator<>::destroy() [0x565088a921c9]
2020-09-03T21:51:36.000379Z	critical	envoy backtrace	[bazel-out/k8-fastbuild/bin/external/envoy/source/server/_virtual_includes/backtrace_lib/server/backtrace.h:96] #43: std::__1::allocator_traits<>::__destroy<>() [0x565088a9219d]
2020-09-03T21:51:36.090423Z	critical	envoy backtrace	[bazel-out/k8-fastbuild/bin/external/envoy/source/server/_virtual_includes/backtrace_lib/server/backtrace.h:96] #44: std::__1::allocator_traits<>::destroy<>() [0x565088a9216d]
2020-09-03T21:51:36.170932Z	critical	envoy backtrace	[bazel-out/k8-fastbuild/bin/external/envoy/source/server/_virtual_includes/backtrace_lib/server/backtrace.h:96] #45: std::__1::__vector_base<>::__destruct_at_end() [0x565088a9211b]
2020-09-03T21:51:36.235720Z	critical	envoy backtrace	[bazel-out/k8-fastbuild/bin/external/envoy/source/server/_virtual_includes/backtrace_lib/server/backtrace.h:96] #46: std::__1::__vector_base<>::clear() [0x565088a9204b]
2020-09-03T21:51:36.302353Z	critical	envoy backtrace	[bazel-out/k8-fastbuild/bin/external/envoy/source/server/_virtual_includes/backtrace_lib/server/backtrace.h:96] #47: std::__1::__vector_base<>::~__vector_base() [0x565088a91ec7]
2020-09-03T21:51:36.368921Z	critical	envoy backtrace	[bazel-out/k8-fastbuild/bin/external/envoy/source/server/_virtual_includes/backtrace_lib/server/backtrace.h:96] #48: std::__1::vector<>::~vector() [0x565088a8afad]
2020-09-03T21:51:36.435696Z	critical	envoy backtrace	[bazel-out/k8-fastbuild/bin/external/envoy/source/server/_virtual_includes/backtrace_lib/server/backtrace.h:96] #49: std::__1::default_delete<>::operator()() [0x565088aaafab]
2020-09-03T21:51:36.500174Z	critical	envoy backtrace	[bazel-out/k8-fastbuild/bin/external/envoy/source/server/_virtual_includes/backtrace_lib/server/backtrace.h:96] #50: std::__1::__shared_ptr_pointer<>::__on_zero_shared() [0x565088ab40b9]
2020-09-03T21:51:36.500223Z	critical	envoy backtrace	[bazel-out/k8-fastbuild/bin/external/envoy/source/server/_virtual_includes/backtrace_lib/server/backtrace.h:96] #51: std::__1::__shared_weak_count::__release_shared() [0x56508ce228de]
2020-09-03T21:51:36.562835Z	critical	envoy backtrace	[bazel-out/k8-fastbuild/bin/external/envoy/source/server/_virtual_includes/backtrace_lib/server/backtrace.h:96] #52: Envoy::Upstream::HostSetImpl::~HostSetImpl() [0x565089a0f8df]
2020-09-03T21:51:36.625790Z	critical	envoy backtrace	[bazel-out/k8-fastbuild/bin/external/envoy/source/server/_virtual_includes/backtrace_lib/server/backtrace.h:96] #53: Envoy::Upstream::HostSetImpl::~HostSetImpl() [0x56508b7124ac]
2020-09-03T21:51:36.689788Z	critical	envoy backtrace	[bazel-out/k8-fastbuild/bin/external/envoy/source/server/_virtual_includes/backtrace_lib/server/backtrace.h:96] #54: std::__1::default_delete<>::operator()() [0x565088a8f07f]
2020-09-03T21:51:36.752537Z	critical	envoy backtrace	[bazel-out/k8-fastbuild/bin/external/envoy/source/server/_virtual_includes/backtrace_lib/server/backtrace.h:96] #55: std::__1::unique_ptr<>::reset() [0x565088a8efff]
2020-09-03T21:51:36.815776Z	critical	envoy backtrace	[bazel-out/k8-fastbuild/bin/external/envoy/source/server/_virtual_includes/backtrace_lib/server/backtrace.h:96] #56: std::__1::unique_ptr<>::~unique_ptr() [0x565088a8ef99]
2020-09-03T21:51:36.878107Z	critical	envoy backtrace	[bazel-out/k8-fastbuild/bin/external/envoy/source/server/_virtual_includes/backtrace_lib/server/backtrace.h:96] #57: std::__1::allocator<>::destroy() [0x565088a8ef79]
2020-09-03T21:51:36.940770Z	critical	envoy backtrace	[bazel-out/k8-fastbuild/bin/external/envoy/source/server/_virtual_includes/backtrace_lib/server/backtrace.h:96] #58: std::__1::allocator_traits<>::__destroy<>() [0x565088a8ef4d]
2020-09-03T21:51:37.004230Z	critical	envoy backtrace	[bazel-out/k8-fastbuild/bin/external/envoy/source/server/_virtual_includes/backtrace_lib/server/backtrace.h:96] #59: std::__1::allocator_traits<>::destroy<>() [0x565088a8ef1d]
2020-09-03T21:51:37.067947Z	critical	envoy backtrace	[bazel-out/k8-fastbuild/bin/external/envoy/source/server/_virtual_includes/backtrace_lib/server/backtrace.h:96] #60: std::__1::__vector_base<>::__destruct_at_end() [0x565088a8eecb]
2020-09-03T21:51:37.132594Z	critical	envoy backtrace	[bazel-out/k8-fastbuild/bin/external/envoy/source/server/_virtual_includes/backtrace_lib/server/backtrace.h:96] #61: std::__1::__vector_base<>::clear() [0x565088a8ee2b]
2020-09-03T21:51:37.196798Z	critical	envoy backtrace	[bazel-out/k8-fastbuild/bin/external/envoy/source/server/_virtual_includes/backtrace_lib/server/backtrace.h:96] #62: std::__1::__vector_base<>::~__vector_base() [0x565088a8eca7]
2020-09-03T21:51:37.259910Z	critical	envoy backtrace	[bazel-out/k8-fastbuild/bin/external/envoy/source/server/_virtual_includes/backtrace_lib/server/backtrace.h:96] #63: std::__1::vector<>::~vector() [0x565088a8eb5d]
2020-09-03T21:51:37.349748Z	info	sds	resource:ROOTCA connection is terminated: rpc error: code = Canceled desc = context canceled
2020-09-03T21:51:37.349792Z	error	sds	Remote side closed connection

@mattklein123
Copy link
Member

I'm sorry to keep saying this, but I don't have the bandwidth to re-debug the above issue, and then understand the follow up conversation above. Can someone briefly summarize the actual race condition (with links to code if possible) and then explain what structural improvement your are discussing? Thank you!

@snowp
Copy link
Contributor

snowp commented Sep 22, 2020

My understanding is that the claim is that is that the TLS cluster (or actually the ClusterInfoImpl based on the stack posted?) creates/deletes a callback via the CallbackManager for SdsApi which is owned by the main thread, and since there is no locking this causes a data race leading to the crash.

My push back was around adding locking to new data structures vs trying to make the operations align with Envoy's general threading architecture.

Looking at the trace I'm a bit surprised that the ClusterInfoImpl would get destructed on a worker thread, and it sounds different to me than the explanation put forward in the OP:

In the example case above, ThreadLocalClusterManagerImpl calling the destructor of the ClusterEntry -> transportsocketmatches-> context_config_impl's destructor, and at that time, the callbacks are removed.

Can we make sure that we understand whether this is triggered via TLS clusters being destructed or the the primary cluster? If the ClusterInfoImpl is destructed on a worker thread that sounds like a bigger issue

@yanavlasov
Copy link
Contributor

Something like this should be possible to recreate with an integration test under tsan.

@mattklein123
Copy link
Member

OK thanks this makes more sense.

Can we make sure that we understand whether this is triggered via TLS clusters being destructed or the the primary cluster? If the ClusterInfoImpl is destructed on a worker thread that sounds like a bigger issue

This can definitely happen. We lock/store the ClusterInfo as a shared_ptr in a large number of contexts, and it can definitely get deleted on a worker thread. Historically, everything inside ClusterInfo is stateless and should be safe to delete in this fashion.

Given ^, if we are violating this somewhere within ClusterInfoImpl we should either fix it with locking or figure out whether the thing in question actually needs to be in the ClusterInfo interface at all. It seems strange to me that any type of callback would be stored via this interface so something seems broken at a more fundamental level.

@incfly
Copy link
Contributor Author

incfly commented Sep 23, 2020

I have updated the description of the issue, this should now include all the details in one place.

Can we make sure that we understand whether this is triggered via TLS clusters being destructed or the the primary cluster? If the ClusterInfoImpl is destructed on a worker thread that sounds like a bigger issue

That part I'm not sure, unfamliar with primary cluster & envoy internals, but the ASAN part does print different thread id, as you can see in the description. I assume maybe T0 thread is the main thread?

My understanding is that the claim is that is that the TLS cluster (or actually the ClusterInfoImpl based on the stack posted?) creates/deletes a callback via the CallbackManager for SdsApi which is owned by the main thread, and since there is no locking this causes a data race leading to the crash.

I think this is correct.

@mattklein123 mattklein123 changed the title Discussion: Data race in callback managers' usage Data race in callback manager / ClusterInfoImpl destruction Sep 24, 2020
@mattklein123
Copy link
Member

OK thanks for the extra diligence here. I understand the issue now.

So, yes, this is broken. In general I think we need to assume that ClusterInfo can get destroyed on any thread. This is by design and is required for access logging and a bunch of other things. It's snapped in a large number of worker contexts.

The fact that we have a giant chain in which the destruction of ClusterInfo can result in a context destruction, and then effect the callback manager, etc. is an unfortunate side effect.

We will need to figure out how to fix this. I'm not sure that adding locking is actually the right fix here vs. something more fundamental in terms of looking at ClusterInfoImpl and looking at what we are storing in there and whether it should be stored in some other way (other not at all, weak_ptr, etc.).

I marked this issue bug/help wanted. Thank you!

@lambdai
Copy link
Contributor

lambdai commented Nov 19, 2020

The clusterInfo is referenced by quit a lot of filter and unclear stuff.

My concerning is that clusterInfo seems snapped by not only upstream Host but also downstream such as ConnectionManagerImpl::ActiveStream::cached_cluster_info_
This means close connections prior to destroy ClusterInfo may not work.

@lambdai
Copy link
Contributor

lambdai commented Nov 19, 2020

The incomplete list in ClusterInfoImpl which might want ClusterInfo destroyed in master threads are

  TransportSocketMatcherPtr socket_matcher_;
  const std::unique_ptr<Server::Configuration::CommonFactoryContext> factory_context_;
  std::vector<Network::FilterFactoryCb> filter_factories_;
  const std::map<std::string, ProtocolOptionsConfigConstSharedPtr> extension_protocol_options_;

And the crashes frames on my hands are all on top of TransportSocketMatcherImpl

@lambdai
Copy link
Contributor

lambdai commented Nov 19, 2020

But callback is not the only violation. See this one.

2020-11-12T01:14:05.365981Z     critical        envoy backtrace #0: __restore_rt [0x7f8afe823980]
2020-11-12T01:14:05.442870Z     critical        envoy backtrace #1: Envoy::Buffer::OwnedImpl::move() [0x5571c5a85acf]
2020-11-12T01:14:05.455167Z     critical        envoy backtrace #2: Envoy::Buffer::WatermarkBuffer::move() [0x5571c59eba0e]
2020-11-12T01:14:05.485904Z     critical        envoy backtrace #3: Envoy::Network::ConnectionImpl::write() [0x5571c5636b3d]
2020-11-12T01:14:05.499803Z     critical        envoy backtrace #4: Envoy::Http::Http2::ConnectionImpl::onSend() [0x5571c593a520]
2020-11-12T01:14:05.507516Z     critical        envoy backtrace #5: nghttp2_session_send [0x5571c5a96caa]
2020-11-12T01:14:05.515197Z     critical        envoy backtrace #6: Envoy::Http::Http2::ConnectionImpl::sendPendingFrames() [0x5571c59379cb]
2020-11-12T01:14:05.523609Z     critical        envoy backtrace #7: Envoy::Http::Http2::ConnectionImpl::StreamImpl::resetStream() [0x5571c5935b72]
2020-11-12T01:14:05.532874Z     critical        envoy backtrace #8: Envoy::Router::UpstreamRequest::resetStream() [0x5571c59b47a6]
2020-11-12T01:14:05.541955Z     critical        envoy backtrace #9: Envoy::Router::Filter::resetAll() [0x5571c59a6c47]
2020-11-12T01:14:05.548593Z     critical        envoy backtrace #10: Envoy::Router::Filter::onDestroy() [0x5571c59a6b5e]
2020-11-12T01:14:05.555132Z     critical        envoy backtrace #11: Envoy::Http::AsyncStreamImpl::reset() [0x5571c5980672]
2020-11-12T01:14:05.562154Z     critical        envoy backtrace #12: Envoy::Grpc::AsyncStreamImpl::cleanup() [0x5571c597a72a]
2020-11-12T01:14:05.568861Z     critical        envoy backtrace #13: Envoy::Grpc::AsyncClientImpl::~AsyncClientImpl() [0x5571c59797a5]
2020-11-12T01:14:05.576130Z     critical        envoy backtrace #14: Envoy::Grpc::AsyncClientImpl::~AsyncClientImpl() [0x5571c597995e]
2020-11-12T01:14:05.583409Z     critical        envoy backtrace #15: Envoy::Config::GrpcMuxImpl::~GrpcMuxImpl() [0x5571c567bd7a]
2020-11-12T01:14:05.591189Z     critical        envoy backtrace #16: Envoy::Config::GrpcSubscriptionImpl::~GrpcSubscriptionImpl() [0x5571c596462b]
2020-11-12T01:14:05.598235Z     critical        envoy backtrace #17: Envoy::Secret::SdsApi::~SdsApi() [0x5571c5648bf4]
2020-11-12T01:14:05.604575Z     critical        envoy backtrace #18: Envoy::Extensions::TransportSockets::Tls::ContextConfigImpl::~ContextConfigImpl() [0x5571c5554d63]
2020-11-12T01:14:05.611920Z     critical        envoy backtrace #19: Envoy::Extensions::TransportSockets::Tls::ClientContextConfigImpl::~ClientContextConfigImpl() [0x5571c55569f2]
2020-11-12T01:14:05.618784Z     critical        envoy backtrace #20: Envoy::Extensions::TransportSockets::Tls::ClientSslSocketFactory::~ClientSslSocketFactory() [0x5571c3e362f7]
2020-11-12T01:14:05.625697Z     critical        envoy backtrace #21: std::__1::allocator<>::destroy() [0x5571c57f1b7a]
2020-11-12T01:14:05.631537Z     critical        envoy backtrace #22: Envoy::Upstream::TransportSocketMatcherImpl::~TransportSocketMatcherImpl() [0x5571c57f1a5f]
2020-11-12T01:14:05.636746Z     critical        envoy backtrace #23: Envoy::Upstream::ClusterInfoImpl::~ClusterInfoImpl() [0x5571c57d102b]
2020-11-12T01:14:05.641519Z     critical        envoy backtrace #24: Envoy::Upstream::ClusterInfoImpl::~ClusterInfoImpl() [0x5571c57d107e]
2020-11-12T01:14:05.646321Z     critical        envoy backtrace #25: Envoy::Upstream::HostDescriptionImpl::~HostDescriptionImpl() [0x5571c3ea92a0]
2020-11-12T01:14:05.651609Z     critical        envoy backtrace #26: Envoy::Upstream::HostImpl::~HostImpl() [0x5571c3eb1fad]
2020-11-12T01:14:05.656670Z     critical        envoy backtrace #27: std::__1::__shared_ptr_pointer<>::__on_zero_shared() [0x5571c3ea9bee]
2020-11-12T01:14:05.661719Z     critical        envoy backtrace #28: Envoy::Upstream::HostSetImpl::~HostSetImpl() [0x5571c45dba4f]
2020-11-12T01:14:05.666881Z     critical        envoy backtrace #29: Envoy::Upstream::HostSetImpl::~HostSetImpl() [0x5571c57d0b1e]
2020-11-12T01:14:05.671938Z     critical        envoy backtrace #30: Envoy::Upstream::PrioritySetImpl::~PrioritySetImpl() [0x5571c3e9e330]
2020-11-12T01:14:05.677369Z     critical        envoy backtrace #31: Envoy::Upstream::ClusterManagerImpl::ThreadLocalClusterManagerImpl::ClusterEntry::~ClusterEntry() [0x5571c567905e]
2020-11-12T01:14:05.682725Z     critical        envoy backtrace #32: absl::container_internal::raw_hash_set<>::erase() [0x5571c5680aca]
2020-11-12T01:14:05.688187Z     critical        envoy backtrace #33: std::__1::__function::__func<>::operator()() [0x5571c56809b6]
2020-11-12T01:14:05.693737Z     critical        envoy backtrace #34: std::__1::__function::__func<>::operator()() [0x5571c567fbb3]
2020-11-12T01:14:05.699546Z     critical        envoy backtrace #35: std::__1::__function::__func<>::operator()() [0x5571c56005ac]
2020-11-12T01:14:05.704898Z     critical        envoy backtrace #36: Envoy::Event::DispatcherImpl::runPostCallbacks() [0x5571c562b806]
2020-11-12T01:14:05.710580Z     critical        envoy backtrace #37: event_process_active_single_queue [0x5571c5a8aff7]
2020-11-12T01:14:05.716157Z     critical        envoy backtrace #38: event_base_loop [0x5571c5a89c4e]
2020-11-12T01:14:05.721565Z     critical        envoy backtrace #39: Envoy::Server::WorkerImpl::threadRoutine() [0x5571c561d6e3]
2020-11-12T01:14:05.726968Z     critical        envoy backtrace #40: Envoy::Thread::ThreadImplPosix::ThreadImplPosix()::{lambda()#1}::__invoke() [0x5571c5c54793]
2020-11-12T01:14:05.727013Z     critical        envoy backtrace #41: start_thread [0x7f8afe8186db]

@mattklein123
Copy link
Member

I'm wondering, could all of ^ actually be moved to the cluster and removed from ClusterInfo and we could access via some other API? I'm pretty sure that all of the things ^ are completely broken if they are accessed after a cluster is removed?

Or could we start with just the thing that we know is causing issues?

@lambdai
Copy link
Contributor

lambdai commented Nov 19, 2020

I'm pretty sure that all of the things ^ are completely broken if they are accessed after a cluster is removed?

Are you saying the access to ClusterInfo in work thread should not be common read/write, but just unfortunate destroy?
That means we can let Cluster owns these, and ClusterInfo contains weak reference to them?

@mattklein123
Copy link
Member

What I'm saying is that I think there are things in ClusterInfo that probably should not be in there, and I'm wondering if we can move them. I would start with just the thing that is causing the crash which I think is the socket_matcher_. I think that code is already thread safe, so yes, I'm wondering if it can be owned by the cluster and maybe accessed via ThreadLocalCluster vs. ClustetInfo?

@lambdai
Copy link
Contributor

lambdai commented Nov 19, 2020

What I'm saying is that I think there are things in ClusterInfo that probably should not be in there, and I'm wondering if we can move them. I would start with just the thing that is causing the crash which I think is the socket_matcher_. I think that code is already thread safe, so yes, I'm wondering if it can be owned by the cluster and maybe accessed via ThreadLocalCluster vs. ClustetInfo?

Make perfect sense to me. I was trapped in the zone that these things need to be destroyed later.
I didn't realize they could be destroyed earlier with Cluster/threadLocalCluster(Sorry I don't have a clean concept of these clusters)

@lambdai
Copy link
Contributor

lambdai commented Nov 20, 2020

Let me add more detailed description. Conclusion: there is no destroy order error. Both the stack are data race [1][2].

@mattklein123 There is no sync between ClusterImplBase and ThreadLocalCluster(ClusterEntry)/ClusterInfo. The ClusterImplBase is always destroyed on master thread first because clusterInfo hold the necessary shared ptr to ClusterImplBase.
The typical destroy order is below. Not always but cannot be excluded.
~ClusterImplBase (master)
~Connections (worker)
~SocketMatcher (worker)
~ThreadLocalCluster (worker)

@mattklein123 The solutions

  1. Move the ownership of TransportSocketMatchesImpl to ClusterBase
    Connection destroyed after SocketMatcher. Crash.
  2. Move the ownership of TransportSocketMatchesImpl to ThreadLocalCluster.
    No crash. Same data race because SocketMatcher are destroyed in worker threads

Anyway we need to enforce the new order

  1. ~Connection on worker thread. This is trivial.
  2. ~SocketMatcher on master according to [1][2]

The owner of SocketMatcher is not that important to resolve this issue, but may impact the future design.

[1] callback handle removal is not guarded by lock since the assumption is context is destroyed on master thread.
[2] sds and grpcmux subscription. No lock there because "subscription" is xDS scope and was assumed to be mutated on master thread.

@mattklein123
Copy link
Member

I had thought that SocketMatcher would only be used on connection creation, but even there I agree there is a race. I think technically it's possible for a cluster to be removed on the main thread while a connection is being created on a worker (before thread local cluster removal).

Given ^ it might be pretty difficult to do this sanely without either destroying cluster info on the main thread as you have tried to do or changing the ownership model of the individual components that are actually causing the issue.

So I guess the question is do you want to focus just on SocketMatcher? Or do you want to just make ClusterInfo destroyed on main thread? I guess I'm OK with the latter if you think that is the best approach.

@lambdai
Copy link
Contributor

lambdai commented Nov 20, 2020

I am back on #14089 which addresses the latter.

So the whole picture is

  1. Cluster can be removed anytime. What a client connection need are supposed to put in ClusterInfo.
  2. ClusterInfo outlives the connections in this cluster. Actually it is achieved by host/connectionpool as the container of connections.
  3. ClusterInfo will be destroyed on master.

I think this is a user friendly threading model. WDYT?

@mattklein123
Copy link
Member

Yes I agree this is OK. Let's review in the other PR. Can you clean that up and we can go from there?

@lambdai
Copy link
Contributor

lambdai commented Nov 20, 2020

Thanks! I updated that PR.

@lambdai
Copy link
Contributor

lambdai commented Mar 25, 2021

#14954 should have fixed this.

@lambdai lambdai closed this as completed Mar 25, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging a pull request may close this issue.

5 participants