Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Static/dynamic analysis of data plane exceptions on worker threads #14320

Open
htuch opened this issue Dec 8, 2020 · 39 comments
Open

Static/dynamic analysis of data plane exceptions on worker threads #14320

htuch opened this issue Dec 8, 2020 · 39 comments
Assignees
Labels
enhancement Feature requests. Not bugs or questions. no stalebot Disables stalebot from closing an issue

Comments

@htuch
Copy link
Member

htuch commented Dec 8, 2020

In general, there should be no exceptions on worker threads, and they should only happen on the main thread. To validate this at runtime (and during test runs which should catch most instances), I propose we replace all:

try {
  ...
} catch (..) {..}

in Envoy with

envoy_try {
  ...
} catch (..) {..}

where envoy_try is something like:

#define envoy_try \
  ASSERT(gettid() == main_thread_tid); \
  try

This bug tracks this proposal and implementation work. There's probably a number of data plane exceptions which still happen on worker threads, which need to be fixed before the ASSERT can be merged, but we can convert to the new macro to facilitate this. We would also augment check_format to catch any raw try statements.

@envoyproxy/maintainers WDYT?
CC @chaoqin-li1123 @asraa

@htuch htuch added enhancement Feature requests. Not bugs or questions. triage Issue requires triage and removed triage Issue requires triage labels Dec 8, 2020
@alyssawilk
Copy link
Contributor

I like it, but wouldn't we need an integration test to catch the failure? Maybe swap it for ENVOY_BUG so if there are any slip-ups they're caught (without crashing) in prod?

If someone is willing to doc up the current use cases, we could maybe run a fixit, or encourage new Envoy devs to pick them up - things like fixing Utility::getResponseStatus would be good intro projects to Envoy

Do you think it would also be worth splitting out data plane code from control plane code by location or naming convention or some such? That way we could also catch it with fix format scripts, and avoid my integration test concerns above.

@htuch
Copy link
Member Author

htuch commented Dec 8, 2020

Why wouldn't existing integration tests catch many of the failures? The only issue is that coverage is often in unit tests vs. integration tests, but it's probably not in scope to increase integration test coverage to 100%. Agreed on the ENVOY_BUG instead of ASSERT.

@alyssawilk
Copy link
Contributor

We only throw on unexpected/invalid behavior and those corner cases are frequently unit tested rather than integration tested. I'd be pleasantly surprised if even with ENVOY_BUG it caught even half the issues in CI, especially after a quick scan of where we throw exceptions.

@htuch
Copy link
Member Author

htuch commented Dec 8, 2020

The aim of wrapping the try, vs. the throw, is that we catch every site that might potentially throw, rather than wait until we actually see exceptions.

@alyssawilk
Copy link
Contributor

oh yeah, you're right, I totally misread this.

So to do this, we'd have to remove the catch we still have in the codec dispatch. I think that's pretty dangerous unless we have plenty more safeguards making sure utilities (like the example I posted) don't throw. Alternately we could leave in that catchall, but I think it mostly defeats the purpose if we still have a catchall catch since I think most throws depend on that?

@htuch
Copy link
Member Author

htuch commented Dec 8, 2020

After some offline discussion with @alyssawilk the goal here is not to remove top-level try, but rather to look for localized uses of try/throw patterns, e.g. in utilities that indicate a failure path by throw.

Another suggestion here is to write a CodeQL scanner that ensures that every throw has an enclosing try, and we can mark top-level try blocks (e.g. around dispatch) as indicative of a check failure.

@htuch
Copy link
Member Author

htuch commented Dec 8, 2020

An example of this class of data plane exception use is

. Ideally we can identify and fix these cases and stop them creeping back in.

We should also ENVOY_BUG on throws to catch places that are not correctly wrapped.

@antoniovicente
Copy link
Contributor

Seems like a good direction. Can we also have a clang-tidy check for use of try?

@alyssawilk
Copy link
Contributor

I'm not sure how we'd do that if we continue allow try to be used for control plane code (hence why I suggested splitting files out) Or maybe it would be possible to just say new data plane code should avoid try/catch as well, and blacklist new instances, but I think that'd be more controversial.

@snowp
Copy link
Contributor

snowp commented Dec 8, 2020

Yeah without some delineation between "this code is ok to run on the data plane" it's hard to prevent accidental usage of throwing utility functions. It's very easy to throw in something like a status code conversion function that appear innocent in code review but that was actually intended for use within a try-catch.

Splitting the code base between data/control plane seems a bit heavy handed (what about all the code that is safely shared?), but it is a fairly low tech solution. Conceptually I would imagine you could make use of static analysis and annotations would be good (e.g. void foo() DATAPLANE_SAFE; that makes it noexcept and ties into linters), which would mean having to explicitly tag all functions that can be used in the dataplane from shared code explicitly. Might be too complicated to get right though, I bet there are plenty of edge cases.

Having the first step being to try to find existing violations makes sense to me, then we can work on prevention in the future.

@htuch
Copy link
Member Author

htuch commented Dec 8, 2020

Control plane uses happen on the main thread (at least those that should be allowed to throw), data plane on worker threads. So, if we assert based on TID we can catch this. Admin endpoint is probably an exception that arguably shouldn't be on main thread (we don't want the admin endpoint to lock-up on large config updates for example).

I agree that new data plane code should avoid try/catch, the main issue is around utilities and nested stacks. Better structure as well would help here, but it will be a lot of work to get to the point that we could use things like build visibility rules to enforce (but arguably a good north star).

@alyssawilk
Copy link
Contributor

" So, if we assert based on TID we can catch this."

where "this" is new try blocks being added, not new exceptions being added, right?. I don't object, but I'm convinced it'll catch enough to be worth the churn.

If we agree that new data plane code should avoid try/catch, I think check_format or clang tidy checks which disallow try/catch by default but could be overridden for legit reasons would result in more consistent future-proofing. We can skip the check for refactor PRs, but by default just disallow new additions.

@htuch
Copy link
Member Author

htuch commented Dec 8, 2020

If we agree that new data plane code should avoid try/catch

How do we know if a utility is safe to use on data plane or control plane? I think this is the crux of the problem.

@alyssawilk
Copy link
Contributor

How many utilities have local try-catch blocks, and wouldn't be caught by the audit you'd need to do to land that macro in the first place? I'm guessing not many, even without restricting what we look at to not-obviously based utilities.. Based on my knowledge of what's integration tested I'm dubious there'd be value add beyond the audit and fix_format checks. I'm not going to block the PR if you find someone to do the work, I just don't think it's going to get us the most exception-saftey-bang for our proverbial buck.

@htuch htuch changed the title Assert on data plane exceptions on worker threads Static/dynamic analysis of data plane exceptions on worker threads Dec 8, 2020
@htuch
Copy link
Member Author

htuch commented Dec 8, 2020

and wouldn't be caught by the audit you'd need to do to land that macro in the first place?

I think 100% of them would be caught, that is the plan outlined today. The idea is to provide a regression framework for new ones.

I think it's becoming clear that a combination of things would make sense here. Worker thread ID-based ENVOY_BUG for both try/throw uses. CodeQL checks to verify that all throws are in fact caught (and not by the top-level dispatch). Annotations at a function level indicating exception safety or data/control plane compatibility.

I've updated the title to make clear the more general set of options. I think I'd defer to the person doing the actual work to prioritize amongst these.

@htuch htuch assigned htuch and chaoqin-li1123 and unassigned htuch Dec 8, 2020
@htuch
Copy link
Member Author

htuch commented Dec 8, 2020

@chaoqin-li1123 has kindly volunteered to do some initial work on this issue.

@chaoqin-li1123
Copy link
Member

Thanks! I will start with the thread id assertion macros and try to replace the raw try catch block one by one.

@github-actions
Copy link

github-actions bot commented Jan 7, 2021

This issue has been automatically marked as stale because it has not had activity in the last 30 days. It will be closed in the next 7 days unless it is tagged "help wanted" or "no stalebot" or other activity occurs. Thank you for your contributions.

@github-actions github-actions bot added the stale stalebot believes this issue/PR has not been touched recently label Jan 7, 2021
@htuch htuch removed the stale stalebot believes this issue/PR has not been touched recently label Jan 8, 2021
@chaoqin-li1123
Copy link
Member

The next step will be adding the try macros into the codebase and also some examples to demonstrate its use.
#define envoy_try \
ASSERT(Thread::MainThread::isMainThread()); \
try
But currently the implementation of main thread verification utility is tightly coupled with thread local instance. How do we make the main thread checking work with unit test? Providing a fake thread local instance can introduce much overhead in testing.

@chaoqin-li1123
Copy link
Member

We can also remove exception from parseInternetAddressAndPort(

Address::InstanceConstSharedPtr Utility::parseInternetAddressAndPort(const std::string& ip_address,
), parseInternetAddress(
Address::InstanceConstSharedPtr Utility::parseInternetAddress(const std::string& ip_address,
). Both interfaces take ip string as input and return shared_ptr to an instance. Currently, when the parsing of an IP address fails, throwWithMalformedIp is called to throw an EnvoyException. The exception can be removed by having the interface return nullptr upon a parsing failure and check the returned ptr inside the caller.

@jmarantz
Copy link
Contributor

jmarantz commented Jan 11, 2021 via email

@chaoqin-li1123
Copy link
Member

This may not be a perfect example because a worker thread is not going to parse a udpaurl. I guess we may want to check the thread id when threading is on and an exception will be thrown. We want to remove exception from the data plane, that's why we want to do the checking. This is an algorithmic function, but some algorithmic functions can be involved and throw exceptions in worker thread, which is what we want to avoid.

@chaoqin-li1123
Copy link
Member

To provide some context, we want to replace
try
with
ASSERT(isMainThread); try {
So that all exceptions are thrown in main thread.

@jmarantz
Copy link
Contributor

Actually do you need to assert that you are on the main thread at all the places exceptions are thrown? That might be hard because library functions can throw.

Instead maybe you could assert main-thread at all places excepts are caught. There are probably be fewer of them, and maybe they'll have enough context to access the dispatcher. WDYT?

@chaoqin-li1123
Copy link
Member

I see your point. Many library functions don't have enough threading context, but there may be enough context where the exceptions are caught. Actually I don't want to assert in all the throw, I plan to assert in all the try, I guess that is almost equivalent to "assert main-thread at all places excepts are caught". I will do some investigation to see whether we have the necessary context in all the places where exceptions are caught. Good night!

@chaoqin-li1123
Copy link
Member

chaoqin-li1123 commented Jan 17, 2021

I have a proposal somehow related to this issue. Currently, the constructor of EnvoyException

EnvoyException(const std::string& message) : std::runtime_error(message) {}

takes a const string reference as argument, which means that when we pass in a string literal, like EnvoyException("error message"), a tempory string object will be constructed, and this string will be copied by the constructor of std::runtime_error, the super class of EnvoyException. That means we are allocating the memory twice. Do you think it reasonable to avoid redundant string creation by adding another constructor
EnvoyException(const char * message) : std::runtime_error(message) {}

@htuch
Copy link
Member Author

htuch commented Jan 17, 2021

@chaoqin-li1123 probably not worth micro-optimizing, since this is the unhappy path. That said, I think we're trying to move to absl::string_view everywhere, and I'm guessing this probably has a cheap temporary construction, so I'd switch to that as the string reference type.

@chaoqin-li1123
Copy link
Member

Make sense. The cost of creating a tempory string is small when compared to all the stack unwinding in an exceptional scenario.

@github-actions
Copy link

This issue has been automatically marked as stale because it has not had activity in the last 30 days. It will be closed in the next 7 days unless it is tagged "help wanted" or "no stalebot" or other activity occurs. Thank you for your contributions.

@github-actions github-actions bot added the stale stalebot believes this issue/PR has not been touched recently label Feb 22, 2021
@github-actions
Copy link

github-actions bot commented Mar 2, 2021

This issue has been automatically closed because it has not had activity in the last 37 days. If this issue is still valid, please ping a maintainer and ask them to label it as "help wanted" or "no stalebot". Thank you for your contributions.

@github-actions github-actions bot closed this as completed Mar 2, 2021
@jmarantz jmarantz added no stalebot Disables stalebot from closing an issue and removed stale stalebot believes this issue/PR has not been touched recently labels Mar 2, 2021
@jmarantz jmarantz reopened this Mar 2, 2021
@chaoqin-li1123
Copy link
Member

The next PR will change the error propagation of existing envoy code. For codes executed on worker thread, replace try catch with other error propagation mechanism. When all the try catch in existing code has be removed or replaced with try with main thread assertion, we can add format checking to disallow raw try.

jmarantz pushed a commit that referenced this issue Mar 30, 2021
… caught on main thread (#15251)

Signed-off-by: chaoqin-li1123 <[email protected]>

Commit Message: Define a macros that wrap try with an assertion that exceptions are only caught in main thread. Currently, envoy use c++ exception for error propagation. This PR is one of the steps to address #14320. The long term goal is to disallow raw try in envoy core code and eliminate c++ exception from data plane, which can improve exception safety. The try in the PR happen on main thread and can be wrapped in main thread assertion without breaking any existing test. In the following PR, raw try that can not be replaced by TRY_ASSERT_MAIN_THREAD will be removed from core codebase with other error propagation.
Additional Description:
Risk Level:
Testing: none
Docs Changes: none
Release Notes: none
Platform Specific Features: none
jmarantz pushed a commit that referenced this issue Jun 23, 2021
…me try catch pattern (#16122)

Commit Message:This is part of the effort to remove C++ exception from data plane by adding assertion that the code is executed in main thread when an exception is caught.(#14320) By making the constructor of instances(PipeInstance, Ipv6Instance, Ipv4Instance) no throw, remove some try catch code from envoy.

Signed-off-by: chaoqin-li1123 <[email protected]>
chrisxrepo pushed a commit to chrisxrepo/envoy that referenced this issue Jul 8, 2021
…me try catch pattern (envoyproxy#16122)

Commit Message:This is part of the effort to remove C++ exception from data plane by adding assertion that the code is executed in main thread when an exception is caught.(envoyproxy#14320) By making the constructor of instances(PipeInstance, Ipv6Instance, Ipv4Instance) no throw, remove some try catch code from envoy.

Signed-off-by: chaoqin-li1123 <[email protected]>
Signed-off-by: chris.xin <[email protected]>
leyao-daily pushed a commit to leyao-daily/envoy that referenced this issue Sep 30, 2021
…me try catch pattern (envoyproxy#16122)

Commit Message:This is part of the effort to remove C++ exception from data plane by adding assertion that the code is executed in main thread when an exception is caught.(envoyproxy#14320) By making the constructor of instances(PipeInstance, Ipv6Instance, Ipv4Instance) no throw, remove some try catch code from envoy.

Signed-off-by: chaoqin-li1123 <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement Feature requests. Not bugs or questions. no stalebot Disables stalebot from closing an issue
Projects
None yet
Development

No branches or pull requests

6 participants