-
Notifications
You must be signed in to change notification settings - Fork 4.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
dns: fix exception unsafe behavior in c-ares callbacks. #4307
Conversation
Previously, we were taking an exception due to the late validation of update_merge_window duration in ClusterManagerImpl::scheduleUpdate, which happened under a c-ares strict DNS host resolution callback. There are several related issues here: 1. c-ares is exception unsafe, see c-ares/c-ares#219. 2. We should be validating Durations with PGV, see bufbuild/protoc-gen-validate#97. 3. We should defer the c-ares resolution callbacks to be outside the c-ares callback context for exception safety. This PR addresses (3) by moving callbacks, even when they are "immediate", to a dispatcher post, so that we never take an exception under a c-ares callback. A workaround for (2) is provided, in lieu of bufbuild/protoc-gen-validate#97, which is blocked on our ability to bump PGV version in Envoy, see lyft/protoc-gen-star#28. Fixes oss-fuzz issue https://bugs.chromium.org/p/oss-fuzz/issues/detail?id=9868. Risk level: Medium (DNS clusters will have some timing changes). Testing: Updated DNS implementation unit tests, server fuzz corpus entry added. Signed-off-by: Harvey Tuch <[email protected]>
Signed-off-by: Harvey Tuch <[email protected]>
With the original exception, who is supposed to catch it? What does the callstack roughly look like when this happens? |
@ggreenway the server initialization code was supposed to catch it; here's the trace:
|
So with this fix, where does the exception get caught? If we post() it, it won't get caught by server initialization. So where does it go? |
There are two things in this PR:
|
Signed-off-by: Harvey Tuch <[email protected]>
So let's say that you hadn't done (2) above so the exception could still be thrown at that time, but you moved it to be post()'d. Isn't that still an exception-unsafe context? I don't think there's any top-level try/catch. So it seems like whether you move it to be post()'d, or call it from the same place it is now, you'd need a try/catch around the call if you want to protect against future incorrect throw's. So what value is there in moving it to be post()'d? |
@ggreenway we have top-level try-catch for this, see https://github.com/envoyproxy/envoy/blob/master/source/exe/main_common.cc#L136. The idea in this PR is that it's better to exit via this path if we take an accidental exception vs. the existing situation, where we have a who-knows-what-c-ares is doing scenario. |
Oh right, I was thinking this was on a worker thread, but this always happens on the main thread. |
Previously, once the callback was posted to the dispatcher, the PendingResolution was destructed. This then broke the ability to cancel() after the post. This PR restores this capability and simplifies some of the object ownership aspects of PendingResolution post envoyproxy#4307. Fixes oss-fuzz issue https://bugs.chromium.org/p/oss-fuzz/issues/detail?id=10184. Risk level: Medium (this code has scary complicated lifetime and ownership guarantees). Testing: Additional unit test and corpus entry added. Signed-off-by: Harvey Tuch <[email protected]>
…yproxy#4307)" This reverts commit 1d34172. Signed-off-by: Harvey Tuch <[email protected]>
Signed-off-by: Harvey Tuch <[email protected]>
* Revert "dns: fix exception unsafe behavior in c-ares callbacks. (#4307)" This reverts commit 1d34172. * Preserve non-controversial aspects of #4307. * Exception wrapper for c-ares callback. Signed-off-by: Harvey Tuch <[email protected]>
Pulling the following changes from github.com/envoyproxy/envoy: f936fc6 ssl: serialize accesses to SSL socket factory contexts (envoyproxy#4345) e34dcd6 Fix crash in tcp_proxy (envoyproxy#4323) ae6a252 router: fix matching when all domains have wildcards (envoyproxy#4326) aa06142 test: Stop fake_upstream methods from accidentally succeeding (envoyproxy#4232) 5d73187 rbac: update the authenticated.user to a StringMatcher. (envoyproxy#4250) c6bfc7d time: Event::TimeSystem abstraction to make it feasible to inject time with simulated timers (envoyproxy#4257) 752483e Fixing the fix (envoyproxy#4333) 83487f6 tls: update BoringSSL to ab36a84b (3497). (envoyproxy#4338) 7bc210e test: fixing interactions between waitFor and ignore_spurious_events (envoyproxy#4309) 69474b3 admin: order stats in clusters json admin (envoyproxy#4306) 2d155f9 ppc64le build (envoyproxy#4183) 07efc6d fix static initialization fiasco problem (envoyproxy#4314) 0b7e3b5 test: Remove declared but undefined class methods (envoyproxy#4297) 1485a13 lua: make sure resetting dynamic metadata wrapper when request info is marked dead d243cd6 test: set to zero when start_time exceeds limit (envoyproxy#4328) 0a1e92a test: fix heap use-after-free in ~IntegrationTestServer. (envoyproxy#4319) cddc732 CONTRIBUTING: Document 'kick-ci' trick. (envoyproxy#4335) f13ef24 docs: remove reference to deprecated value field (envoyproxy#4322) e947a27 router: minor doc fixes in stream idle timeout (envoyproxy#4329) 0c2e998 tcp-proxy: fixing a TCP proxy bug where we attempted to readDisable a closed connection (envoyproxy#4296) 00ffe44 utility: fix strftime overflow handling. (envoyproxy#4321) af1183c Re-enable TcpProxySslIntegrationTest and make the tests pass again. (envoyproxy#4318) 3553461 fuzz: fix H2 codec fuzzer post envoyproxy#4262. (envoyproxy#4311) 42f6048 Proto string issue fix (envoyproxy#4320) 9c492a0 Support Envoy to fetch secrets using SDS service. (envoyproxy#4256) a857219 ratelimit: revert `revert rate limit failure mode config` and add tests (envoyproxy#4303) 1d34172 dns: fix exception unsafe behavior in c-ares callbacks. (envoyproxy#4307) 1212423 alts: add gRPC TSI socket (envoyproxy#4153) f0363ae fuzz: detect client-side resets in H2 codec fuzzer. (envoyproxy#4300) 01aa3f8 test: hopefully deflaking echo integration test (envoyproxy#4304) 1fc0f4b ratelimit: link legacy proto when message is being used (envoyproxy#4308) aa4481e fix rare List::remove(&target) segfault (envoyproxy#4244) 89e0f23 headers: fixing fast fail of size-validation (envoyproxy#4269) 97eba59 build: bump googletest version. (envoyproxy#4293) 0057e22 fuzz: avoid false positives in HCM fuzzer. (envoyproxy#4262) 9d094e5 Revert ac0bd74 (envoyproxy#4295) ddb28a4 Add validation context provider (envoyproxy#4264) 3b47cba added histogram latency information to Hystrix dashboard stream (envoyproxy#3986) cf87d50 docs: update SNI FAQ. (envoyproxy#4285) f952033 config: fix update empty stat for eds (envoyproxy#4276) 329e591 router: Add ability of custom headers to rely on per-request data (envoyproxy#4219) 68d20b4 thrift: refactor build files and imports (envoyproxy#4271) 5fa8192 access_log: log requested_server_name in tcp proxy (envoyproxy#4144) fa45bb4 fuzz: libc++ clocks don't like nanos. (envoyproxy#4282) 53f8944 stats: add symbol table for future stat name encoding (envoyproxy#3927) c987b42 test infra: Remove timeSource() from the ClusterManager api (envoyproxy#4247) cd171d9 websocket: tunneling websockets (and upgrades in general) over H2 (envoyproxy#4188) b9dc5d9 router: disallow :path/host rewriting in request_headers_to_add. (envoyproxy#4220) 0c91011 network: skip socket options and source address for UDS client connections (envoyproxy#4252) da1857d build: fixing a downstream compile error by noting explicit fallthrough (envoyproxy#4265) 9857cfe fuzz: cleanup per-test environment after each fuzz case. (envoyproxy#4253) 52beb06 test: Wrap proto string in std::string before comparison (envoyproxy#4238) f5e219e extensions/thrift_proxy: Add header matching to thrift router (envoyproxy#4239) c9ce5d2 fuzz: track read_disable_count bidirectionally in codec_impl_fuzz_test. (envoyproxy#4260) 35103b3 fuzz: use nanoseconds for SystemTime in RequestInfo. (envoyproxy#4255) ba6ba98 fuzz: make runtime root hermetic in server_fuzz_test. (envoyproxy#4258) b0a9014 time: Add 'format' test to ensure no one directly instantiates Prod*Time from source. (envoyproxy#4248) 8567460 access_log: support beginning of epoch in START_TIME. (envoyproxy#4254) 28d5f41 proto: unify envoy_proto_library/api_proto_library. (envoyproxy#4233) f7d3cb6 http: fix allocation bug introduced in envoyproxy#4211. (envoyproxy#4245) Fixes istio/istio#8310 (once pulled into istio/istio). Signed-off-by: Piotr Sikora <[email protected]>
Previously, we were taking an exception due to the late validation of update_merge_window duration
in ClusterManagerImpl::scheduleUpdate, which happened under a c-ares strict DNS host resolution
callback. There are several related issues here:
Validate Duration fields in PGV bufbuild/protoc-gen-validate#97.
exception safety.
This PR addresses (3) by moving callbacks, even when they are "immediate", to a dispatcher post, so
that we never take an exception under a c-ares callback.
A workaround for (2) is provided, in lieu of bufbuild/protoc-gen-validate#97,
which is blocked on our ability to bump PGV version in Envoy, see
lyft/protoc-gen-star#28.
Fixes oss-fuzz issue https://bugs.chromium.org/p/oss-fuzz/issues/detail?id=9868.
Risk level: Medium (DNS clusters will have some timing changes).
Testing: Updated DNS implementation unit tests, server fuzz corpus entry added.
Signed-off-by: Harvey Tuch [email protected]