Skip to content

Commit

Permalink
Remove catch_unwind wrapper around the rust query planner (#6397)
Browse files Browse the repository at this point in the history
This PR removes catch_unwind around the invocation of the rust query planner. As part of this change, I have changed all applicable `panic!()`s in `apollo-federation`, as well as remove adding a backtrace capture to `SingleFederationErrors`.

### Details
#### `panic()!` at the `apollo-federation`

There were several panic!() marcos still in use in `apollo-federation`. Instead of panicking, these functions now return the expected FederationError.

If you're searching in the code, you'll still see a few panics. They are in tests, in macros or function annotated with `#[cfg(test)]`, or in [unused] composition part of the codebase. More specifically:
* in internally used [CLI](https://github.com/apollographql/router/blob/85f99f19e2a2907bd5597687773d68d63644a0c6/apollo-federation/cli/src/main.rs#L214)
* in tests:
  * [operation tests](https://github.com/apollographql/router/blob/85f99f19e2a2907bd5597687773d68d63644a0c6/apollo-federation/src/operation/tests/mod.rs)
  * [build query graph tests](https://github.com/apollographql/router/blob/85f99f19e2a2907bd5597687773d68d63644a0c6/apollo-federation/src/query_graph/build_query_graph.rs#L2448)
  * [api schema tests](https://github.com/apollographql/router/blob/85f99f19e2a2907bd5597687773d68d63644a0c6/apollo-federation/tests/api_schema.rs)
  * [build query plan support tests](https://github.com/apollographql/router/blob/85f99f19e2a2907bd5597687773d68d63644a0c6/apollo-federation/tests/query_plan/build_query_plan_support.rs)
  * [context build query plan test](https://github.com/apollographql/router/blob/85f99f19e2a2907bd5597687773d68d63644a0c6/apollo-federation/tests/query_plan/build_query_plan_tests/context.rs)
  * [fetch operation name tests](https://github.com/apollographql/router/blob/85f99f19e2a2907bd5597687773d68d63644a0c6/apollo-federation/tests/query_plan/build_query_plan_tests/fetch_operation_names.rs)
* test helpers annotated with `#[cfg(test)]`:
  * [connectors' json selection test helper](https://github.com/apollographql/router/blob/85f99f19e2a2907bd5597687773d68d63644a0c6/apollo-federation/src/sources/connect/json_selection/apply_to.rs#L116)
  * [connectors selection! macro](https://github.com/apollographql/router/blob/85f99f19e2a2907bd5597687773d68d63644a0c6/apollo-federation/src/sources/connect/json_selection/helpers.rs#L10)
* [argument merge strategies](https://github.com/apollographql/router/blob/85f99f19e2a2907bd5597687773d68d63644a0c6/apollo-federation/src/schema/argument_composition_strategies.rs) used solely in composition

#### Removing backtrace capture
We were adding backtrace captures to `SingleFederationErrors` in order to help with debugging unexpected errors while were still developing the rust query planner. Capturing backtraces is quite often [a slow and memory intensive process](https://doc.rust-lang.org/stable/std/backtrace/struct.Backtrace.html#method.capture), so we want to remove this additional functionality for rust query planner's GA. Since we now have a lot of confidence in this implementation and have not faced a panic across millions of operations that were tested and compared, this is a safe thing to remove. 

**Note:** The compliance CI is failing at the moment, and will be fixed after #6395 is merged 

<!-- ROUTER-909 -->
  • Loading branch information
lrlna authored Dec 6, 2024
1 parent 215ca5f commit d169b2f
Show file tree
Hide file tree
Showing 8 changed files with 54 additions and 89 deletions.
15 changes: 15 additions & 0 deletions .changesets/maint_lrlna_remove_catch_unwind.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,15 @@

### Remove catch_unwind wrapper around the native query planner ([PR #6397](https://github.com/apollographql/router/pull/6397))

As part of internal maintenance of the query planner, we are removing the
catch_unwind wrapper around the native query planner. This wrapper was used as
an extra safeguard for potential panics the native planner could produce. The
native query planner no longer has any code paths that could panic. We have also
not witnessed a panic in the last four months, having processed 560 million real
user operations through the native planner.

This maintenance work also removes backtrace capture for federation errors which
was used for debugging and is no longer necessary as we have the confidence in
the native planner's implementation.

By [@lrlna](https://github.com/lrlna) in https://github.com/apollographql/router/pull/6397
28 changes: 5 additions & 23 deletions apollo-federation/src/error/mod.rs
Original file line number Diff line number Diff line change
@@ -1,4 +1,3 @@
use std::backtrace::Backtrace;
use std::cmp::Ordering;
use std::fmt::Display;
use std::fmt::Formatter;
Expand Down Expand Up @@ -523,8 +522,8 @@ pub struct MultipleFederationErrors {
impl MultipleFederationErrors {
pub fn push(&mut self, error: FederationError) {
match error {
FederationError::SingleFederationError { inner, .. } => {
self.errors.push(inner);
FederationError::SingleFederationError(error) => {
self.errors.push(error);
}
FederationError::MultipleFederationErrors(errors) => {
self.errors.extend(errors.errors);
Expand Down Expand Up @@ -591,22 +590,14 @@ impl Display for AggregateFederationError {
}
}

/// Work around thiserror, which when an error field has a type named `Backtrace`
/// "helpfully" implements `Error::provides` even though that API is not stable yet:
/// <https://github.com/rust-lang/rust/issues/99301>
type ThiserrorTrustMeThisIsTotallyNotABacktrace = Backtrace;

// PORT_NOTE: Often times, JS functions would either throw/return a GraphQLError, return a vector
// of GraphQLErrors, or take a vector of GraphQLErrors and group them together under an
// AggregateGraphQLError which itself would have a specific error message and code, and throw that.
// We represent all these cases with an enum, and delegate to the members.
#[derive(thiserror::Error)]
pub enum FederationError {
#[error("{inner}")]
SingleFederationError {
inner: SingleFederationError,
trace: ThiserrorTrustMeThisIsTotallyNotABacktrace,
},
#[error(transparent)]
SingleFederationError(#[from] SingleFederationError),
#[error(transparent)]
MultipleFederationErrors(#[from] MultipleFederationErrors),
#[error(transparent)]
Expand All @@ -616,22 +607,13 @@ pub enum FederationError {
impl std::fmt::Debug for FederationError {
fn fmt(&self, f: &mut Formatter<'_>) -> std::fmt::Result {
match self {
Self::SingleFederationError { inner, trace } => write!(f, "{inner}\n{trace}"),
Self::SingleFederationError(inner) => std::fmt::Debug::fmt(inner, f),
Self::MultipleFederationErrors(inner) => std::fmt::Debug::fmt(inner, f),
Self::AggregateFederationError(inner) => std::fmt::Debug::fmt(inner, f),
}
}
}

impl From<SingleFederationError> for FederationError {
fn from(inner: SingleFederationError) -> Self {
Self::SingleFederationError {
inner,
trace: Backtrace::capture(),
}
}
}

impl From<DiagnosticList> for FederationError {
fn from(value: DiagnosticList) -> Self {
SingleFederationError::from(value).into()
Expand Down
14 changes: 7 additions & 7 deletions apollo-federation/src/query_plan/fetch_dependency_graph.rs
Original file line number Diff line number Diff line change
Expand Up @@ -32,6 +32,7 @@ use crate::bail;
use crate::display_helpers::DisplayOption;
use crate::error::FederationError;
use crate::error::SingleFederationError;
use crate::internal_error;
use crate::link::graphql_definition::DeferDirectiveArguments;
use crate::operation::ArgumentList;
use crate::operation::ContainmentOptions;
Expand Down Expand Up @@ -1709,7 +1710,7 @@ impl FetchDependencyGraph {
children.push(child_index);
} else {
let Some(child_defer_ref) = &child.defer_ref else {
panic!(
bail!(
"{} has defer_ref `{}`, so its child {} cannot have a top-level defer_ref.",
node.display(node_index),
DisplayOption(node.defer_ref.as_ref()),
Expand Down Expand Up @@ -2666,12 +2667,11 @@ impl FetchDependencyGraphNode {
};
let operation = operation_compression.compress(operation)?;
let operation_document = operation.try_into().map_err(|err| match err {
FederationError::SingleFederationError {
inner: SingleFederationError::InvalidGraphQL { diagnostics },
..
} => FederationError::internal(format!(
FederationError::SingleFederationError(SingleFederationError::InvalidGraphQL {
diagnostics,
}) => internal_error!(
"Query planning produced an invalid subgraph operation.\n{diagnostics}"
)),
),
_ => err,
})?;

Expand Down Expand Up @@ -3392,7 +3392,7 @@ impl DeferTracking {

if let Some(parent_ref) = &defer_context.current_defer_ref {
let Some(parent_info) = self.deferred.get_mut(parent_ref) else {
panic!("Cannot find info for parent {parent_ref} or {label}");
bail!("Cannot find info for parent {parent_ref} or {label}")
};

parent_info.deferred.insert(label.clone());
Expand Down
5 changes: 3 additions & 2 deletions apollo-federation/src/query_plan/query_planner.rs
Original file line number Diff line number Diff line change
Expand Up @@ -14,6 +14,7 @@ use tracing::trace;

use super::fetch_dependency_graph::FetchIdGenerator;
use super::ConditionNode;
use crate::bail;
use crate::error::FederationError;
use crate::error::SingleFederationError;
use crate::operation::normalize_operation;
Expand Down Expand Up @@ -432,10 +433,10 @@ impl QueryPlanner {
.root_kinds_to_nodes()?
.get(&normalized_operation.root_kind)
else {
panic!(
bail!(
"Shouldn't have a {0} operation if the subgraphs don't have a {0} root",
normalized_operation.root_kind
);
)
};

let operation_compression = if self.config.generate_query_fragments {
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -92,6 +92,8 @@ impl ApplyToError {
}))
}

// This macro is useful for tests, but it absolutely should never be used with
// dynamic input at runtime, since it panics for any input that's not JSON.
#[cfg(test)]
fn from_json(json: &JSON) -> Self {
if let JSON::Object(error) = json {
Expand Down
16 changes: 4 additions & 12 deletions apollo-router/src/executable.rs
Original file line number Diff line number Diff line change
@@ -1,6 +1,5 @@
//! Main entry point for CLI command to start server.
use std::cell::Cell;
use std::env;
use std::fmt::Debug;
use std::net::SocketAddr;
Expand Down Expand Up @@ -751,18 +750,11 @@ fn setup_panic_handler() {
} else {
tracing::error!("{}", e)
}
if !USING_CATCH_UNWIND.get() {
// Once we've panic'ed the behaviour of the router is non-deterministic
// We've logged out the panic details. Terminate with an error code
std::process::exit(1);
}
}));
}

// TODO: once the Rust query planner does not use `todo!()` anymore,
// remove this and the use of `catch_unwind` to call it.
thread_local! {
pub(crate) static USING_CATCH_UNWIND: Cell<bool> = const { Cell::new(false) };
// Once we've panic'ed the behaviour of the router is non-deterministic
// We've logged out the panic details. Terminate with an error code
std::process::exit(1);
}));
}

static COPIED: AtomicBool = AtomicBool::new(false);
Expand Down
5 changes: 1 addition & 4 deletions apollo-router/src/query_planner/bridge_query_planner.rs
Original file line number Diff line number Diff line change
Expand Up @@ -170,10 +170,7 @@ impl PlannerMode {
let result = QueryPlanner::new(schema.federation_supergraph(), config);

match &result {
Err(FederationError::SingleFederationError {
inner: error,
trace: _,
}) => match error {
Err(FederationError::SingleFederationError(error)) => match error {
SingleFederationError::UnsupportedFederationVersion { .. } => {
metric_rust_qp_init(Some(UNSUPPORTED_FED1));
}
Expand Down
58 changes: 17 additions & 41 deletions apollo-router/src/query_planner/dual_query_planner.rs
Original file line number Diff line number Diff line change
Expand Up @@ -15,6 +15,7 @@ use apollo_compiler::validation::Valid;
use apollo_compiler::ExecutableDocument;
use apollo_compiler::Name;
use apollo_compiler::Node;
use apollo_federation::error::FederationError;
use apollo_federation::query_plan::query_planner::QueryPlanOptions;
use apollo_federation::query_plan::query_planner::QueryPlanner;
use apollo_federation::query_plan::QueryPlan;
Expand All @@ -24,7 +25,6 @@ use super::fetch::SubgraphOperation;
use super::subscription::SubscriptionNode;
use super::FlattenNode;
use crate::error::format_bridge_errors;
use crate::executable::USING_CATCH_UNWIND;
use crate::query_planner::bridge_query_planner::metric_query_planning_plan_duration;
use crate::query_planner::bridge_query_planner::JS_QP_MODE;
use crate::query_planner::bridge_query_planner::RUST_QP_MODE;
Expand Down Expand Up @@ -80,47 +80,23 @@ impl BothModeComparisonJob {
}

fn execute(self) {
// TODO: once the Rust query planner does not use `todo!()` anymore,
// remove `USING_CATCH_UNWIND` and this use of `catch_unwind`.
let rust_result = std::panic::catch_unwind(|| {
let name = self
.operation_name
.clone()
.map(Name::try_from)
.transpose()?;
USING_CATCH_UNWIND.set(true);

let start = Instant::now();

// No question mark operator or macro from here …
let result =
let start = Instant::now();

let rust_result = self
.operation_name
.as_deref()
.map(|n| Name::new(n).map_err(FederationError::from))
.transpose()
.and_then(|operation| {
self.rust_planner
.build_query_plan(&self.document, name, self.plan_options);

let elapsed = start.elapsed().as_secs_f64();
metric_query_planning_plan_duration(RUST_QP_MODE, elapsed);

metric_query_planning_plan_both_comparison_duration(RUST_QP_MODE, elapsed);
metric_query_planning_plan_both_comparison_duration(JS_QP_MODE, self.js_duration);

// … to here, so the thread can only eiher reach here or panic.
// We unset USING_CATCH_UNWIND in both cases.
USING_CATCH_UNWIND.set(false);
result
})
.unwrap_or_else(|panic| {
USING_CATCH_UNWIND.set(false);
Err(apollo_federation::error::FederationError::internal(
format!(
"query planner panicked: {}",
panic
.downcast_ref::<String>()
.map(|s| s.as_str())
.or_else(|| panic.downcast_ref::<&str>().copied())
.unwrap_or_default()
),
))
});
.build_query_plan(&self.document, operation, self.plan_options)
});

let elapsed = start.elapsed().as_secs_f64();
metric_query_planning_plan_duration(RUST_QP_MODE, elapsed);

metric_query_planning_plan_both_comparison_duration(RUST_QP_MODE, elapsed);
metric_query_planning_plan_both_comparison_duration(JS_QP_MODE, self.js_duration);

let name = self.operation_name.as_deref();
let operation_desc = if let Ok(operation) = self.document.operations.get(name) {
Expand Down

0 comments on commit d169b2f

Please sign in to comment.