Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Improved Error Handling #625

Merged
merged 18 commits into from
Feb 21, 2022
Merged

Improved Error Handling #625

merged 18 commits into from
Feb 21, 2022

Conversation

rylev
Copy link
Contributor

@rylev rylev commented Jan 24, 2022

This is a proposal for a new form of error handling in the SDKs I have been working on with @yoshuawuyts. This is an attempt to reduce the large amount of errors into as compact a form as possible while still offering good ergonomics and quality error messages.

In the grand scheme of things, this proposal is not a radical departure from the current approach taken, but is an attempt to rethink error handling from first principles.

The current approach

The current approach to error handling is to create a large, all-encompassing error type modeled as a Rust enum. This looks like this:

/// A general Azure error type.
#[non_exhaustive]
#[derive(Debug, thiserror::Error)]
pub enum Error {
    #[error("parse error")]
    Parse(#[from] ParseError),
    #[error("error getting token")]
    GetToken(#[source] Box<dyn std::error::Error + Send + Sync>),
    #[error("http error")]
    Http(#[from] HttpError),
    /// .... 
}

This approach does have some nice properties:

  • The thiserror crate makes it pretty easy to declare new error variants together with how they are displayed to the user with minimal boiler plate and that conform to best practices for error types including provide an implementation for the Error::source method which shows what other error caused this error.
  • Error messages are relatively consistent (i.e., if two parsing errors can happen in two different SDKs, they'll largely have the same error message)
  • More error types can be added without breaking existing user code (through the use of non_exhaustive.
  • Conversions from 3rd party error types to this error type are generally fairly easy.

However, this approach does have downsides:

  • The specific variant of Error in question is tied directly to the underlying data or error cause. For example, in the code above, an Error::Parse must always have a corresponding ParseError. This encourages an ever expanding tree of errors.
  • Context dependent error messages are hard to add. Since the data associated with an error is directly tied to the error variant, it's hard to add additional context to errors that happen for unique reasons. For example,. say a parsing error when the key "foo" must have a value with the prefix "bar" - this happens only in one place in all the SDKs. If you want to display to the user this important context, you need to add a new variant to the ParseError type that explains this. This adds even more pressure for an ever expanding tree of error types.

The above list is not exhaustive but in general should give you the impression that the way we're modelling errors encourages error types with lots of variants and sub errors. This is exactly what we see in practice, where the error types in azure_core are bloated and often hard to keep track of.

What do we need?

Before getting into solutions, I'd like to briefly discuss the general needs we have from our errors. Errors are generally used for two things:

  • Control flow (i.e., the user inspects the errors and writes code specifically to handle a certain error case)
  • Error messages (i.e., the user just needs to display an error to the end user or developer consuming the library)

The more important of these two for our purposes is likely error messages. While no doubt, users of the library will want to write code that inspects some errors and handles them in special ways, the large majority of errors will be propagated up the stack either through use of ? or by panicking.

Therefore, we need errors that can handle control flow but more importantly, encourage good error messages.

The status quo provides ok error messages and excels at handling control flow, the exact opposite of our needs.

A proposed solution

This PR introduces a proposal for a new Error type that eventually will completely replace the current Error type (but that still works with it so that the transition does not need to happen all at once).

This new error is based closely on std::io::Error. Essentially the new error type wraps other errors and/or an error message along with an ErrorKind that roughly corresponds to the category of error being modeled.

First, let's see how this new error handles control flow (arguably the weaker of its capabilities):

Control Flow

 match error.kind() {
     ErrorKind::UnexpectedOperation { status } if status == 503 => {
           // TODO: handle 503 errors specially
     }
     ErrorKind::Io => {
          if let Some(e) = error.downcast_ref::<std::io::Error>() {
               if e.kind() == std::io::ErrorKind::ConnectionAborted {
                   // TODO: handle an aborted connection
               }
          }
     }
     _ => return Err(error),
}

As you can see from this example, control flow is still very much possible, but does now rely on runtime downcasting for control flow of errors that we generally expect users to not have to handle. For example, in this case, the user wants to handle ConnectionAborted but we expect this to not be a super common use case as the SDK already does automatic retries and a policy will likely be there to handle offline scenarios in the general case.

NOTE: the rest of this section is no longer true and has been removed from the design.

The biggest difference with the current design is that the new Error is actually generic over some type that represents operation specific errors. For example, if we're dealing with an error in the Cosmos create_database operation, the error type would be Error<CreateDatabaseError>. The CreateDatabaseError would include all the documented cases specific to CreateDatabase - for example, a bad request or a conflict. These errors are the ones we generally expect users to want to handle, and they are handled in a first class way:

 if let ErrorKind::Operation(CreateDatabaseError::Conflict) = error.kind() {
     /// TODO: handle conflict
}

Error messages

The new Error type comes with an improved ability to provide good error messages. The Error type conforms to best practices of the std::error::Error including providing a source for each error and displaying errors in such a way that plays nicely with ecosystem crates like eyre for displaying good error messages.

Included are conveniences like extensions to the Result type which allow for providing additional context to the error:

HeaderValue::from_str(&self.header).with_context(
    crate::error::ErrorKind::DataConversion,
    || {
        format!(
            "user agent '{}' cannot be serialized as an ascii-only HTTP header",
            self.header
        )
    },
)?;

Comparison

The new error type improves upon the old error type by not encouraging a growing tree of error types. It provides conveniences for providing good error messages while prioritizing a small error type that plays well with the ecosystem, provides good error messages, and provides control flow handling for cases where that handling is most likely needed.

@heaths
Copy link
Member

heaths commented Jan 24, 2022

From what crate is HttpError? Across our other language SDKs, we all have some sort of general HTTP error that includes the status code (integer) and status message i.e. description of the status code e.g. 400 is "Bad request" - typically from the standard Azure error response.

Ah, we define it and do have a generic StatusCode:

#[error("HTTP error status (status: {:?}, body: {:?})", status, body)]
StatusCode { status: StatusCode, body: String },

Though, shouldn't the subsequent Utf8 error be more of a parse error? Sure, the service messed up, but the HTTP response itself would still likely indicate some sort of status like 200.

@yoshuawuyts
Copy link
Contributor

From what crate is HttpError?

This is from the hyperium/http crate. This crate is used in the hyper and reqwest crates, and in order to wrap the errors in our Error type we need to write conversions to it.


Though, shouldn't the subsequent Utf8 error be more of a parse error? Sure, the service messed up, but the HTTP response itself would still likely indicate some sort of status like 200.

I'm not sure which Utf8 error this refers to, but in general the idea is that each resource defines its own set of error conditions it expects from the server, and for all other status-code related errors we assign them ErrorKind::UnexpectedOperation. Here are the error kinds we currently have:

/// The kind of error
///
/// The classification of error is intentionally fairly coarse.
#[derive(Copy, Clone, Debug, PartialEq)]
pub enum ErrorKind<O> {
    /// An error specific to a certain operation
    Operation(O),
    /// An HTTP status code that was not expected
    UnexpectedOperation { status: u16 },
    /// An error performing IO
    Io,
    /// An error converting data
    DataConversion,
    /// An error getting an API credential token
    Credential,
    /// A catch all for other kinds of errors
    Other,
}

Every operation is expected to return a Result<Data, azure_core::Error<OperationError>> where we encode the various known error cases into the OperationError type. This means that for example for cosmos::CreateDatabase which defines the following error codes, we encode these into an operation error like so:

#[derive(Debug, Clone, Copy, PartialEq, Eq)]
/// Expected errors which may be returned by the `CreateDatabase` operation.
pub enum CreateDatabaseError {
    /// Returned when the JSON body is invalid. Check for missing curly brackets or quotes.
    BadRequest,
    /// Returned when the ID provided for the new database has been taken by an existing database.
    Conflict,
}

If the server returns a statuscode which doesn't match any of the expected error kinds, we then assign it to an error with kind ErrorKind::UnexpectedOperation which contains the status code and message body.

Other error kinds such as ErrorKind::Io and ErrorKind::DataConversion are intended to be used to encode errors originating in our pipelines and backing transports.


Overall it makes sense that this is quite a bit to cover, and I'm hoping we can find some time in the sync meeting later today to walk through how the error handling scheme we propose in this PR works.

sdk/core/src/error/mod.rs Outdated Show resolved Hide resolved
sdk/core/src/error/mod.rs Outdated Show resolved Hide resolved
@cataggar cataggar added this to the azure_core 0.2.0 milestone Jan 26, 2022
@cataggar cataggar added the Azure.Core The azure_core crate label Jan 26, 2022
sdk/core/src/error/azure_core_errors.rs Show resolved Hide resolved
sdk/core/src/error/mod.rs Outdated Show resolved Hide resolved
sdk/core/src/error/mod.rs Show resolved Hide resolved
sdk/core/src/error/mod.rs Outdated Show resolved Hide resolved
sdk/core/src/error/mod.rs Show resolved Hide resolved

/// An extention to the `Result` type that easy allows creating `Error` values from exsiting errors
///
/// This trait should not be implemented on custom types and is meant for usage with `Result`
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could use the sealed trait pattern to enforce this

@rylev
Copy link
Contributor Author

rylev commented Feb 2, 2022

This PR is in a state where we could merge it. However, there are additional elements that I would like to follow up on - either in this PR or a subsequent one:

  • We currently heavily favor adding context to errors by adding string messages. It was pointed out by @nrc that this makes things like structured logging difficult. Instead it would be nice to explore how more structured data can be used to add context to errors.
  • Sealing the ResultExt trait so that others don't implement it
  • @yoshuawuyts and I have had subsequent chats about providing a "best effort" mechanism for structured UnsuccessfulResponse errors which gracefully fallback to what we're currently doing now (providing the raw HTTP status code and Azure error_code to the user). The team seemed happy with just providing the raw status code and error_code, but providing structured error conditions where possible does appear to be a better user experience. Since this is hard to change in the future, I'd like to make sure we've really examined all possibilities before settling on one.

@rylev
Copy link
Contributor Author

rylev commented Feb 21, 2022

Going to merge this. For the above three points, we decided:

  • more structured error reporting is a good thing, but we lack a clear vision for how exactly to do this. We'll continue exploring, but we don't want to delay merging this more.
  • I've made the ResultExt trait sealed
  • Talking this over with @JeffreyRichter, we confirmed that we do not want to make a best effort attempt at enumerating possible error conditions. The other language SDKs have tried this and it's resulted in worse user experiences.

@yoshuawuyts yoshuawuyts merged commit 58fcc4e into Azure:main Feb 21, 2022
@cataggar cataggar changed the title [Proposal] Error Handling Improved Error Handling Mar 30, 2022
@rylev rylev deleted the new-error branch August 23, 2022 18:10
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Azure.Core The azure_core crate
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants