cluster: destroy on main thread #14954

lambdai · 2021-02-05T13:04:42Z

Commit Message:
Always destroy cluster info object on the master thread.
Fixed SDS churn and a rare crash case.
Additional Description:
Risk Level:
Testing:
Docs Changes:
Release Notes:
Platform Specific Features:
[Optional Runtime guard:]
Fixes:
#13209
istio/istio#30199
istio/istio#28315
[Optional Deprecated:]
[Optional API Considerations:]

Signed-off-by: Yuchen Dai <[email protected]>

lambdai · 2021-02-05T13:05:38Z

ClusterInfo is not really destroyed because mainthread dispatcher stops dispatching posted tasks. SSLContextManager and symbol tables are not happy: these two managers expect all resources destroyed.

Before this commit the main thread dispatcher will drain the post queue but though, but the exit flag is fragile.

Any advice on how to gracefully shut down dispatcher? @antoniovicente

lambdai · 2021-02-05T18:06:05Z

The tests are failing because symbol tables is not empty when the IsolatedStoreImpl is destroyed.

I am investigating. But it's also good to hear if this is a big issue.

antoniovicente

ClusterInfo is not really destroyed because mainthread dispatcher stops dispatching posted tasks. SSLContextManager and symbol tables are not happy: these two managers expect all resources destroyed.

Before this commit the main thread dispatcher will drain the post queue but though, but the exit flag is fragile.

Any advice on how to gracefully shut down dispatcher? @antoniovicente

Clean shutdown of dispatchers is a fairly involved topic. The envoy dispatcher class hasn't really tried to shoot for clean shutdown. There are some things that could be done to move in that direction like:

introducing a bool tryPost(cb) that rejects additional callbacks after shutdown.
executing scheduled callbacks as part of the shutdown process
force close downstream connections
not sure what's relevant to timers

source/common/upstream/upstream_impl.cc

include/envoy/event/dispatcher.h

Signed-off-by: Yuchen Dai <[email protected]>

include/envoy/event/dispatcher.h

Signed-off-by: Yuchen Dai <[email protected]>

lambdai · 2021-02-08T21:30:51Z

Is tsan error a red herring?
subprocess.CalledProcessError: Command '['git', 'describe', '--all']' returned non-zero exit status 128.

https://dev.azure.com/cncf/envoy/_build/results?buildId=66038&view=logs&j=d1f76054-8f79-554b-6f4a-11d6a63b8e00&t=266e17e3-d213-54b5-deef-0dcee01da137&l=25588

mattklein123

Thanks!

lambdai · 2021-02-23T01:28:12Z

fixing format

Signed-off-by: Yuchen Dai <[email protected]>

lambdai · 2021-02-23T17:50:17Z

Thank you!

lambdai · 2021-02-23T17:59:10Z

@cpakulski The buggy behavior caused lots of churns in istio. See the PR description. Can we backport it?

cpakulski · 2021-02-23T18:06:24Z

@lambdai Sure. To which releases?

lambdai · 2021-02-23T19:16:29Z

@lambdai Sure. To which releases?

Thanks! The newly added SdsCdsIntegrationTest doesn't use any recent features. I am afraid all the stable releases are impacted.

cpakulski · 2021-02-23T19:18:16Z

OK - will port to all 4 latest releases.

lambdai · 2021-02-23T19:30:48Z

Awesome. Thanks!

Signed-off-by: Yuchen Dai <[email protected]> Signed-off-by: Christoph Pakulski <[email protected]>

* Dispatcher: keeps a stack of tracked objects. (#14573) Dispatcher will now keep a stack of tracked objects; on crash it'll "unwind" and have those objects dump their state. Moreover, it'll invoke fatal actions with the tracked objects. This allows us to dump more information during crash. See related PR: #14509 Will follow up with another PR dumping information at the codec/parser level. Signed-off-by: Kevin Baichoo <[email protected]> Signed-off-by: Christoph Pakulski <[email protected]> * cluster: destroy on main thread (#14954) Signed-off-by: Yuchen Dai <[email protected]> Signed-off-by: Christoph Pakulski <[email protected]> * Updated release notes. Signed-off-by: Christoph Pakulski <[email protected]> Co-authored-by: Kevin Baichoo <[email protected]> Co-authored-by: Yuchen Dai <[email protected]>

ewoksly · 2023-10-05T11:46:14Z

OK - will port to all 4 latest releases.

@cpakulski could you please confirm starting with version of envoy is this fixed?

cpakulski · 2023-10-12T13:46:43Z

I do not remember the starting version. It was long long time ago. Please check the code.

lambdai added 15 commits November 18, 2020 21:51

destroy hosts on master

b474d23

Signed-off-by: Yuchen Dai <[email protected]>

tobetrypost

7b933b1

Signed-off-by: Yuchen Dai <[email protected]>

to tryPost

65ed52e

Signed-off-by: Yuchen Dai <[email protected]>

revert hosts guard

df9beaa

Signed-off-by: Yuchen Dai <[email protected]>

Merge branch 'master' into completedestroyonmaster

3af5aac

Signed-off-by: Yuchen Dai <[email protected]>

fix cluster test

b647d0c

Signed-off-by: Yuchen Dai <[email protected]>

fix server fuzz test

8c6266d

Signed-off-by: Yuchen Dai <[email protected]>

Merge branch 'main' into clusterdestory

d8d639c

Signed-off-by: Yuchen Dai <[email protected]>

cleanup

564b17a

Signed-off-by: Yuchen Dai <[email protected]>

remove extra runPostCallbacks() in run

8e380ce

Signed-off-by: Yuchen Dai <[email protected]>

remove exit flag in dispatcher

bcad62e

Signed-off-by: Yuchen Dai <[email protected]>

rename to movePost returning void, tests failing

50ec141

Signed-off-by: Yuchen Dai <[email protected]>

relax ssl manager required empty context

4746d53

Signed-off-by: Yuchen Dai <[email protected]>

format

0ca27ec

Signed-off-by: Yuchen Dai <[email protected]>

revert accidentally touched files

ec6eb4a

Signed-off-by: Yuchen Dai <[email protected]>

mattklein123 self-assigned this Feb 5, 2021

tbarrella mentioned this pull request Feb 5, 2021

When cluster is removed and added back, Envoy fails to properly send a new SDS request, leading to all requests getting 503 (TestIngressRequestAuthentication failure) istio/istio#28315

Closed

antoniovicente reviewed Feb 5, 2021

View reviewed changes

source/common/upstream/upstream_impl.cc Outdated Show resolved Hide resolved

source/common/upstream/upstream_impl.cc Show resolved Hide resolved

antoniovicente reviewed Feb 5, 2021

View reviewed changes

include/envoy/event/dispatcher.h Outdated Show resolved Hide resolved

add preShutdown

2f26bf4

Signed-off-by: Yuchen Dai <[email protected]>

antoniovicente reviewed Feb 6, 2021

View reviewed changes

include/envoy/event/dispatcher.h Show resolved Hide resolved

antoniovicente self-assigned this Feb 6, 2021

lambdai added 5 commits February 5, 2021 20:39

fix typo

63b063a

Signed-off-by: Yuchen Dai <[email protected]>

fixing server tests

75c257a

Signed-off-by: Yuchen Dai <[email protected]>

clang-tidy

dd4b151

Signed-off-by: Yuchen Dai <[email protected]>

clang-tidy another try

968a611

Signed-off-by: Yuchen Dai <[email protected]>

fix format

d4dc8f8

Signed-off-by: Yuchen Dai <[email protected]>

repokitteh-read-only bot removed the waiting label Feb 23, 2021

mattklein123 previously approved these changes Feb 23, 2021

View reviewed changes

fix format

d1bb2d5

Signed-off-by: Yuchen Dai <[email protected]>

lambdai dismissed mattklein123’s stale review via d1bb2d5 February 23, 2021 01:30

lambdai mentioned this pull request Feb 23, 2021

[release-1.9] Make extension doc generates snake case. istio/proxy#3206

Merged

mattklein123 approved these changes Feb 23, 2021

View reviewed changes

mattklein123 changed the title ~~cluster: destroy on master thread~~ cluster: destroy on main thread Feb 23, 2021

mattklein123 merged commit 114d5ae into envoyproxy:main Feb 23, 2021

lambdai added the backport/review Request to backport to stable releases label Feb 23, 2021

lambdai mentioned this pull request Feb 23, 2021

Crash when updating UDP clusters through CDS #14866

Closed

Shikugawa added backport/approved Approved backports to stable releases and removed backport/review Request to backport to stable releases labels Feb 24, 2021

cpakulski pushed a commit to cpakulski/envoy that referenced this pull request Feb 25, 2021

cluster: destroy on main thread (envoyproxy#14954)

11ce2fe

Signed-off-by: Yuchen Dai <[email protected]> Signed-off-by: Christoph Pakulski <[email protected]>

lambdai mentioned this pull request Mar 25, 2021

Data race in callback manager / ClusterInfoImpl destruction #13209

Closed

This was referenced Apr 23, 2021

Admin server hangs and stops accepting all requests #16124

Closed

Ingress gateway occasionally hangs and stops accepting requests istio/istio#29334

Closed

danielfoehrKn mentioned this pull request May 25, 2021

Istio-ingressgateway getting stuck sometimes. gardener/gardener#4095

Closed

kyessenov mentioned this pull request Oct 12, 2022

manager_ could be a dangling reference when Envoy shuts down #21447

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

cluster: destroy on main thread #14954

cluster: destroy on main thread #14954

lambdai commented Feb 5, 2021 •

edited

Loading

lambdai commented Feb 5, 2021 •

edited

Loading

lambdai commented Feb 5, 2021

antoniovicente left a comment

lambdai commented Feb 8, 2021

mattklein123 left a comment

lambdai commented Feb 23, 2021

lambdai commented Feb 23, 2021

lambdai commented Feb 23, 2021

cpakulski commented Feb 23, 2021

lambdai commented Feb 23, 2021

cpakulski commented Feb 23, 2021

lambdai commented Feb 23, 2021

ewoksly commented Oct 5, 2023

cpakulski commented Oct 12, 2023

cluster: destroy on main thread #14954

cluster: destroy on main thread #14954

Conversation

lambdai commented Feb 5, 2021 • edited Loading

lambdai commented Feb 5, 2021 • edited Loading

lambdai commented Feb 5, 2021

antoniovicente left a comment

Choose a reason for hiding this comment

lambdai commented Feb 8, 2021

mattklein123 left a comment

Choose a reason for hiding this comment

lambdai commented Feb 23, 2021

lambdai commented Feb 23, 2021

lambdai commented Feb 23, 2021

cpakulski commented Feb 23, 2021

lambdai commented Feb 23, 2021

cpakulski commented Feb 23, 2021

lambdai commented Feb 23, 2021

ewoksly commented Oct 5, 2023

cpakulski commented Oct 12, 2023

lambdai commented Feb 5, 2021 •

edited

Loading

lambdai commented Feb 5, 2021 •

edited

Loading