-
Notifications
You must be signed in to change notification settings - Fork 4.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Crash when updating UDP clusters through CDS #14866
Comments
Hmm, this stacks looks legit. It's probably data race. Also I realize the cluster destroy on worker thread is not fully resolved by recent changes so I am going to pick up my left over #14089 Once it is done let's see if that PR helps |
I have spent more time investigating this and it turns out that my initial assumptions were wrong. It is not the CDS size what triggers the crash, but the existence of UDP proxy/clusters. I got to the point where updating single udp listener/cluster is enough to crash envoy. I'm preparing test environment for you to investigate. |
I created a simple management server capable of crashing envoy on request. You will need to build it, run it and connect envoy instance to it. Having done that a few "Trigger" button clicks should crash envoy - at least it crashes my instance :) https://github.com/bartebor/crash-management-server |
@bartebor thanks, is there any chance you could wire this up using docker compose so we don't have to figure out how to build things, etc.? Then I think someone can look at this. |
Oh, I thought that it was easy enough to use - the build process is dockerized, there are no other requirements than docker itself to run this. I also thought that developers would use it against their own built envoy, so I did not see the reason to use docker-compose. The README has all commands to just copy&paste: # clone
git clone https://github.com/bartebor/crash-management-server.git
cd crash-management-server
# build
docker build -t c-m-s:latest .
# run
docker run --rm -p8080:8080 -p12345:12345 c-m-s:latest
# in other terminal, start your custom-built envoy using sample config file
envoy -c envoy/envoy-dynamic-v3.yaml --concurrency 1
# ... or use the official docker image
docker run --rm --net=host -v $(pwd)/envoy/envoy-dynamic-v3.yaml:/etc/envoy.yaml:ro envoyproxy/envoy-debug:v1.17.0 -c /etc/envoy.yaml
# trigger change - open in browser http://127.0.0.1:8080 or use curl:
curl -s -XPOST 127.0.0.1:8080/triggerChange |
Ah ok, perfect, thanks. I will take a look. |
I believe that I am having the same issue. I have a UDP proxy and updating the cluster definition most times causes a crash. I see this in the logs:
The output does vary, sometimes I get more backtrace. I see that this is already being looked at. If it's useful I can try and pull together the minimum required to trigger the issue. |
I just tried to repro this on current main branch and I can't repro. @bartebor can you check current main branch with your repro instructions to check me and see if I'm maybe doing something wrong? Thank you. |
I just tried the repro instructions on 1.17.0 and I also can't repro so I'm probably doing something wrong? |
Hmm, I just tried above instructions (with just copying into terminal) and envoy (docker image, envoy-debug:1.17.0) crashed on the first try:
@mattklein123 we need to make sure that:
and envoy reloads cluster
You sometimes need to call the endpoint few times in a row without restarting anything to trigger a crash. I'm using a i7-5600U laptop with security fixes for numerous cpu vulnerabilities. -- EDIT |
OK thanks I thought it was an instant crash with single repro. I can repro if I do the update over and over again. I will take a look. |
Is @mattklein123 I am not sure if #14954 fixes it... I remember the fix was initially aiming to address the callback data race in #13209 |
I don't know if it's a red herring or not. Is still repros on current main branch. I'm debugging now and will report back. |
The problem is
Is no longer valid due to the order of the update callbacks when a cluster is updated. It's possible that redis proxy also has this problem but I'm not sure yet. I will work on fixing this. I think the best thing to do would be to move the member_update_cb_handle_ to some type of RAII wrapper that uses weak_ptr internally (@htuch I think this came up in one of your recent PRs). Unless @htuch is working on this actively I will sort it out. |
Maybe the ~ClusterInfo wrapping this line should be executed on worker thread either |
I take it back. |
I think we have the same issue with TCP redis. The crash occurs when we try to change the cds.yaml file with new endpoints. node:
id: id_1
cluster: test
static_resources:
listeners:
- name: listener_0
address:
socket_address:
protocol: TCP
address: 0.0.0.0
port_value: 6379
filter_chains:
- filters:
- name: envoy.filters.network.redis_proxy
typed_config:
"@type": type.googleapis.com/envoy.extensions.filters.network.redis_proxy.v3.RedisProxy
stat_prefix: redis_proxy
settings:
op_timeout: 1s
enable_hashtagging: false
prefix_routes:
catch_all_route:
cluster: redis
dynamic_resources:
cds_config:
path: /var/lib/envoy/cds.yaml
admin:
access_log_path: /tmp/admin_access.log
address:
socket_address:
protocol: TCP
address: 0.0.0.0
port_value: 9901 resources:
- "@type": type.googleapis.com/envoy.config.cluster.v3.Cluster
name: redis
connect_timeout: 1s
type: STRICT_DNS
lb_policy: RING_HASH
dns_lookup_family: V4_ONLY
health_checks:
timeout: 1s
interval: 1s
unhealthy_threshold: 3
healthy_threshold: 1
custom_health_check:
name: envoy.health_checkers.redis
typed_config: {}
load_assignment:
cluster_name: redis
endpoints:
- lb_endpoints:
- endpoint:
address:
socket_address:
address: 10.42.0.37
port_value: 6379
- endpoint:
address:
socket_address:
address: 10.42.0.80
port_value: 6379 envoy-7d56b8b77f-8796f envoy [2021-02-25 18:21:00.194][1][info][main] [source/server/server.cc:731] starting main dispatch loop
envoy-7d56b8b77f-8796f envoy [2021-02-25 18:21:00.194][1][info][upstream] [source/common/upstream/cluster_manager_impl.cc:191] cm init: all clusters initialized
envoy-7d56b8b77f-8796f envoy [2021-02-25 18:21:00.194][1][info][main] [source/server/server.cc:712] all clusters initialized. initializing init manager
envoy-7d56b8b77f-8796f envoy [2021-02-25 18:21:00.194][1][info][config] [source/server/listener_manager_impl.cc:888] all dependencies initialized. starting workers
envoy-7d56b8b77f-8796f envoy [2021-02-25 18:22:55.851][1][info][upstream] [source/common/upstream/cds_api_impl.cc:71] cds: add 1 cluster(s), remove 0 cluster(s)
envoy-7d56b8b77f-8796f envoy [2021-02-25 18:22:55.854][1][info][upstream] [source/common/upstream/cds_api_impl.cc:86] cds: add/update cluster 'redis'
envoy-7d56b8b77f-8796f envoy [2021-02-25 18:22:55.859][1][critical][backtrace] [bazel-out/k8-opt/bin/source/server/_virtual_includes/backtrace_lib/server/backtrace.h:104] Caught Segmentation fault, suspect faulting address 0x18
envoy-7d56b8b77f-8796f envoy [2021-02-25 18:22:55.859][1][critical][backtrace] [bazel-out/k8-opt/bin/source/server/_virtual_includes/backtrace_lib/server/backtrace.h:91] Backtrace (use tools/stack_decode.py to get line numbers):
envoy-7d56b8b77f-8796f envoy [2021-02-25 18:22:55.859][1][critical][backtrace] [bazel-out/k8-opt/bin/source/server/_virtual_includes/backtrace_lib/server/backtrace.h:92] Envoy version: 5c801b25cae04f06bf48248c90e87d623d7a6283/1.17.0/Clean/RELEASE/BoringSSL
envoy-7d56b8b77f-8796f envoy [2021-02-25 18:22:55.859][1][critical][backtrace] [bazel-out/k8-opt/bin/source/server/_virtual_includes/backtrace_lib/server/backtrace.h:96] #0: __restore_rt [0x7fb7c5224980]
envoy-7d56b8b77f-8796f envoy [2021-02-25 18:22:55.859][1][critical][backtrace] [bazel-out/k8-opt/bin/source/server/_virtual_includes/backtrace_lib/server/backtrace.h:98] #1: [0x5568bac2b7f4]
envoy-7d56b8b77f-8796f envoy [2021-02-25 18:22:55.859][1][critical][backtrace] [bazel-out/k8-opt/bin/source/server/_virtual_includes/backtrace_lib/server/backtrace.h:98] #2: [0x5568bac35c43]
envoy-7d56b8b77f-8796f envoy [2021-02-25 18:22:55.859][1][critical][backtrace] [bazel-out/k8-opt/bin/source/server/_virtual_includes/backtrace_lib/server/backtrace.h:98] #3: [0x5568bac332a3]
envoy-7d56b8b77f-8796f envoy [2021-02-25 18:22:55.859][1][critical][backtrace] [bazel-out/k8-opt/bin/source/server/_virtual_includes/backtrace_lib/server/backtrace.h:98] #4: [0x5568bac52b59]
envoy-7d56b8b77f-8796f envoy [2021-02-25 18:22:55.860][1][critical][backtrace] [bazel-out/k8-opt/bin/source/server/_virtual_includes/backtrace_lib/server/backtrace.h:98] #5: [0x5568bac2b9dc]
envoy-7d56b8b77f-8796f envoy [2021-02-25 18:22:55.860][1][critical][backtrace] [bazel-out/k8-opt/bin/source/server/_virtual_includes/backtrace_lib/server/backtrace.h:98] #6: [0x5568bac35c43]
envoy-7d56b8b77f-8796f envoy [2021-02-25 18:22:55.860][1][critical][backtrace] [bazel-out/k8-opt/bin/source/server/_virtual_includes/backtrace_lib/server/backtrace.h:98] #7: [0x5568bac332a3]
envoy-7d56b8b77f-8796f envoy [2021-02-25 18:22:55.860][1][critical][backtrace] [bazel-out/k8-opt/bin/source/server/_virtual_includes/backtrace_lib/server/backtrace.h:98] #8: [0x5568babb4a4c]
envoy-7d56b8b77f-8796f envoy [2021-02-25 18:22:55.860][1][critical][backtrace] [bazel-out/k8-opt/bin/source/server/_virtual_includes/backtrace_lib/server/backtrace.h:98] #9: [0x5568babb3843]
envoy-7d56b8b77f-8796f envoy [2021-02-25 18:22:55.860][1][critical][backtrace] [bazel-out/k8-opt/bin/source/server/_virtual_includes/backtrace_lib/server/backtrace.h:98] #10: [0x5568babb370b]
envoy-7d56b8b77f-8796f envoy [2021-02-25 18:22:55.860][1][critical][backtrace] [bazel-out/k8-opt/bin/source/server/_virtual_includes/backtrace_lib/server/backtrace.h:98] #11: [0x5568bac26545]
envoy-7d56b8b77f-8796f envoy [2021-02-25 18:22:55.860][1][critical][backtrace] [bazel-out/k8-opt/bin/source/server/_virtual_includes/backtrace_lib/server/backtrace.h:98] #12: [0x5568bac2837e]
envoy-7d56b8b77f-8796f envoy [2021-02-25 18:22:55.860][1][critical][backtrace] [bazel-out/k8-opt/bin/source/server/_virtual_includes/backtrace_lib/server/backtrace.h:98] #13: [0x5568bac314bb]
envoy-7d56b8b77f-8796f envoy [2021-02-25 18:22:55.860][1][critical][backtrace] [bazel-out/k8-opt/bin/source/server/_virtual_includes/backtrace_lib/server/backtrace.h:98] #14: [0x5568bac52bc4]
envoy-7d56b8b77f-8796f envoy [2021-02-25 18:22:55.860][1][critical][backtrace] [bazel-out/k8-opt/bin/source/server/_virtual_includes/backtrace_lib/server/backtrace.h:98] #15: [0x5568bac53824]
envoy-7d56b8b77f-8796f envoy [2021-02-25 18:22:55.860][1][critical][backtrace] [bazel-out/k8-opt/bin/source/server/_virtual_includes/backtrace_lib/server/backtrace.h:98] #16: [0x5568badbe624]
envoy-7d56b8b77f-8796f envoy [2021-02-25 18:22:55.860][1][critical][backtrace] [bazel-out/k8-opt/bin/source/server/_virtual_includes/backtrace_lib/server/backtrace.h:98] #17: [0x5568badc58b4]
envoy-7d56b8b77f-8796f envoy [2021-02-25 18:22:55.860][1][critical][backtrace] [bazel-out/k8-opt/bin/source/server/_virtual_includes/backtrace_lib/server/backtrace.h:98] #18: [0x5568bae0354d]
envoy-7d56b8b77f-8796f envoy [2021-02-25 18:22:55.860][1][critical][backtrace] [bazel-out/k8-opt/bin/source/server/_virtual_includes/backtrace_lib/server/backtrace.h:98] #19: [0x5568b94cdb26]
envoy-7d56b8b77f-8796f envoy [2021-02-25 18:22:55.860][1][critical][backtrace] [bazel-out/k8-opt/bin/source/server/_virtual_includes/backtrace_lib/server/backtrace.h:98] #20: [0x5568b9bdb9f8]
envoy-7d56b8b77f-8796f envoy [2021-02-25 18:22:55.860][1][critical][backtrace] [bazel-out/k8-opt/bin/source/server/_virtual_includes/backtrace_lib/server/backtrace.h:98] #21: [0x5568b9bdfb4b]
envoy-7d56b8b77f-8796f envoy [2021-02-25 18:22:55.860][1][critical][backtrace] [bazel-out/k8-opt/bin/source/server/_virtual_includes/backtrace_lib/server/backtrace.h:98] #22: [0x5568b9bdea4b]
envoy-7d56b8b77f-8796f envoy [2021-02-25 18:22:55.860][1][critical][backtrace] [bazel-out/k8-opt/bin/source/server/_virtual_includes/backtrace_lib/server/backtrace.h:98] #23: [0x5568b9bdb4d7]
envoy-7d56b8b77f-8796f envoy [2021-02-25 18:22:55.860][1][critical][backtrace] [bazel-out/k8-opt/bin/source/server/_virtual_includes/backtrace_lib/server/backtrace.h:98] #24: [0x5568b9bdc6ed]
envoy-7d56b8b77f-8796f envoy [2021-02-25 18:22:55.860][1][critical][backtrace] [bazel-out/k8-opt/bin/source/server/_virtual_includes/backtrace_lib/server/backtrace.h:98] #25: [0x5568babf388f]
envoy-7d56b8b77f-8796f envoy [2021-02-25 18:22:55.860][1][critical][backtrace] [bazel-out/k8-opt/bin/source/server/_virtual_includes/backtrace_lib/server/backtrace.h:98] #26: [0x5568babee00d]
envoy-7d56b8b77f-8796f envoy [2021-02-25 18:22:55.860][1][critical][backtrace] [bazel-out/k8-opt/bin/source/server/_virtual_includes/backtrace_lib/server/backtrace.h:98] #27: [0x5568babebcb9]
envoy-7d56b8b77f-8796f envoy [2021-02-25 18:22:55.860][1][critical][backtrace] [bazel-out/k8-opt/bin/source/server/_virtual_includes/backtrace_lib/server/backtrace.h:98] #28: [0x5568babe0fd1]
envoy-7d56b8b77f-8796f envoy [2021-02-25 18:22:55.860][1][critical][backtrace] [bazel-out/k8-opt/bin/source/server/_virtual_includes/backtrace_lib/server/backtrace.h:98] #29: [0x5568babe1dbc]
envoy-7d56b8b77f-8796f envoy [2021-02-25 18:22:55.860][1][critical][backtrace] [bazel-out/k8-opt/bin/source/server/_virtual_includes/backtrace_lib/server/backtrace.h:98] #30: [0x5568bb050138]
envoy-7d56b8b77f-8796f envoy [2021-02-25 18:22:55.860][1][critical][backtrace] [bazel-out/k8-opt/bin/source/server/_virtual_includes/backtrace_lib/server/backtrace.h:98] #31: [0x5568bb04eb0e]
envoy-7d56b8b77f-8796f envoy [2021-02-25 18:22:55.861][1][critical][backtrace] [bazel-out/k8-opt/bin/source/server/_virtual_includes/backtrace_lib/server/backtrace.h:98] #32: [0x5568babc1bff]
envoy-7d56b8b77f-8796f envoy [2021-02-25 18:22:55.861][1][critical][backtrace] [bazel-out/k8-opt/bin/source/server/_virtual_includes/backtrace_lib/server/backtrace.h:98] #33: [0x5568b93d7e28]
envoy-7d56b8b77f-8796f envoy [2021-02-25 18:22:55.861][1][critical][backtrace] [bazel-out/k8-opt/bin/source/server/_virtual_includes/backtrace_lib/server/backtrace.h:98] #34: [0x5568b93d8627]
envoy-7d56b8b77f-8796f envoy [2021-02-25 18:22:55.861][1][critical][backtrace] [bazel-out/k8-opt/bin/source/server/_virtual_includes/backtrace_lib/server/backtrace.h:98] #35: [0x5568b93d69dc]
envoy-7d56b8b77f-8796f envoy [2021-02-25 18:22:55.861][1][critical][backtrace] [bazel-out/k8-opt/bin/source/server/_virtual_includes/backtrace_lib/server/backtrace.h:96] #36: __libc_start_main [0x7fb7c4e42bf7]
envoy-7d56b8b77f-8796f envoy ConnectionImpl 0x47643f340000, connecting_: 0, bind_error_: 0, state(): Open, read_buffer_limit_: 1048576
envoy-7d56b8b77f-8796f envoy socket_:
envoy-7d56b8b77f-8796f envoy ListenSocketImpl 0x47643f55d170, transport_protocol_: , server_name_:
envoy-7d56b8b77f-8796f envoy address_provider_:
envoy-7d56b8b77f-8796f envoy SocketAddressSetterImpl 0x47643f54a1f8, remote_address_: 10.42.0.37:6379, direct_remote_address_: 10.42.0.37:6379, local_address_: 10.42.0.82:59920 |
This changes the callback code to use RAII with weak pointers. This allows both the callee and the callback manager to be safely destructed in different orders which does happen during normal operation, for example with cluster and listener changes. Fixes #14866 Signed-off-by: Matt Klein <[email protected]>
This changes the callback code to use RAII with weak pointers. This allows both the callee and the callback manager to be safely destructed in different orders which does happen during normal operation, for example with cluster and listener changes. Fixes #14866 Signed-off-by: Matt Klein <[email protected]>
This changes the callback code to use RAII with weak pointers. This allows both the callee and the callback manager to be safely destructed in different orders which does happen during normal operation, for example with cluster and listener changes. Fixes envoyproxy/envoy#14866 Signed-off-by: Matt Klein <[email protected]>
I have a crash when using CDS with UDP, it is completely reproducible on Kubernetes CI Crashdump on CDS
Config applied
CDS
|
Title: Envoy crashes after CDS update of multiple clusters
Description:
I have an envoy server working with our custom control plane based on current (0.1.27) java-control-plane via ADS (LDS, RDS, CDS, EDS). When there is sudden change in all of our clusters (500+), envoy crashes. I have checked v1.15.3, v1.16.2 and v1.17.0 - all of them crash.
Repro steps:
I have no isolated steps to reproduce this, but it is easy to reproduce in our environment:
connect_timeout
and trigger CDS update for all clustersThere is no crash when small number of cluster is updated.
I don't know what information could be useful here, so if you find something missing please let me know.
Call Stack:
Data for envoy v1.17.0-debug follow:
GDB output:
The text was updated successfully, but these errors were encountered: