Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[nexus] improve external messages and make more available to clients #4573

Conversation

sunshowers
Copy link
Contributor

@sunshowers sunshowers commented Nov 29, 2023

While developing #4520, I observed that we were producing a number of error messages that were:

  • 503 Service Unavailable,
  • With only an internal message attached
  • But where the message is both safe and useful to display to clients.

This is my attempt to make the situation slightly better. To achieve this, I made a few changes:

  1. Make all the client errors carry a new MessagePair struct, which consists of an external message. (Along the way, correct the definition of e.g. the Conflict variant: it actually is an external message, not an internal one.)
  2. Define a new InsufficientCapacity variant that consists of both an external and an internal message. This variant resolves to a 507 Insufficient Storage error, and has a more helpful message than just "Service Unavailable".
  3. Turn some current 503 errors into client errors so that the message is available externally.

Looking for feedback on this approach!

Created using spr 1.3.5
@sunshowers sunshowers mentioned this pull request Nov 29, 2023
11 tasks
Copy link
Collaborator

@davepacheco davepacheco left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for taking a look at this. Is the motivation here or aid debugging for ourselves or to help end users and operators better understand why a 503 failed? I ask because I can definitely see this being useful for developers debugging failures or for customers for some kinds of failures, but I worry this could easily wind up confusing customers more than it helps them when messages refer to implementation details instead of user-level abstractions that we've provided. Example: if the system fails to create a Crucible volume because all available disk space has been allocated, it makes sense to me that internally we'd report something like "we couldn't find a zpool [or dataset] with enough space to allocate a region". But those aren't terms that an end user or even operator would be expected to know about, right? Ultimately I think we want the system to communicate something like: "the sum total of disks, images, etc. that exist have consumed all available storage capacity of the rack. You must add capacity or delete some of these Disks." That said, the request handler is not necessarily in a position to make the leap from "I couldn't find storage" to "there is no storage and you need to add more". Ideally, I think we'd communicate something like "we couldn't find capacity" and we'd have other interfaces (alerts? a status page?) that would communicate the capacity situation and documentation that says "an error message about being unable to find capacity may mean that you're actually out -- check this status page". I appreciate this is a lot more complicated and even we decide to go this route we don't have all these things today.

I think most of the messages that are made external in this PR don't feel suitable for consumption by an operator or an end user.

What do you think of this: instead of allowing ServiceUnavailable to have external messages, create new Error variants for the ones that we want to have external messages, and map those to 503 (preserving the external message) on the way out? We could add an InsufficientCapacity variant with a way to express what resource we failed to allocate (like "storage space" or "IP addresses" or something). The reason I suggest this is to make it harder to accidentally leak implementation details in error messages. If a developer wants to expose something to the end user, they'd have to make a new variant (which I hope will cause people to consider more exactly what they're communicating to the API consumer).

I think "out of capacity" (including both physical resources and virtual ones like IP addresses) covers all the cases we'd want to expose external messages for? If we set those aside, I feel like a 503 is almost never going to be actionable for an end user and rarely so for an operator, too. The cases I usually think about for a 503 are (1) the system suffered some very transient issue (e.g., the CockroachDB node crashed and a query failed during the request) or relatedly (2) the system has suffered too many failures to do what was requested (e.g., you tried to do something with a Crucible volume, all of whose "downstairs" instances are on sleds that were removed or panicked or whatever). I don't think an end user can do anything about these. For an operator, I think we've got to communicate this some better way (e.g., alerts, a status page showing such problems, etc.). So for all these cases I feel like a more specific message would be more confusing than helpful. The capacity-related ones do feel different because they are actionable, even to end users. They might then go check how much capacity they're using, free up capacity, or just stop trying to do what they're doing.

@@ -48,17 +48,17 @@ pub fn from_diesel(e: DieselError) -> external::Error {
if let Some(sentinel) = matches_sentinel(&e, &sentinels) {
match sentinel {
NOT_ENOUGH_DATASETS_SENTINEL => {
return external::Error::unavail(
return external::Error::unavail_external(
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This one feels like the sort of thing I would not expect an operator to know about, let alone an end user.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've turned this and the other two below into a generic "Not enough disk space" message.

"Not enough datasets to allocate disks",
);
}
NOT_ENOUGH_ZPOOL_SPACE_SENTINEL => {
return external::Error::unavail(
return external::Error::unavail_external(
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This one feels like a mix. Some of this seems useful to an operator but much of it (terms like "zpool" and "region") describes implementation details that feel like a bad UX to expose to operators and end users.

"Not enough zpool space to allocate disks. There may not be enough disks with space for the requested region. You may also see this if your rack is in a degraded state, or you're running the default multi-rack topology configuration in a 1-sled development environment.",
);
}
NOT_ENOUGH_UNIQUE_ZPOOLS_SENTINEL => {
return external::Error::unavail(
return external::Error::unavail_external(
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Similarly this feels like a call generator more than it's helpful. (I'm imagining a customer wondering "What's a zpool? Do I need more physical disks? Is it somehow allocating something from an API Disk?)

@@ -737,7 +737,7 @@ fn endpoint_for_authority(
let endpoint_config = config_rx.borrow();
let endpoints = endpoint_config.as_ref().ok_or_else(|| {
error!(&log, "received request with no endpoints loaded");
Error::unavail("endpoints not loaded")
Error::unavail_external("endpoints not loaded")
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This feels confusing and unactionable for end users.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Made this internal-only.

@@ -362,7 +362,9 @@ impl super::Nexus {
}

if instance.runtime().migration_id.is_some() {
return Err(Error::unavail("instance is already migrating"));
return Err(Error::unavail_external(
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This feels like a 409 and not a 503.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agreed, turned this into a 409.

InstanceState::Destroyed => Err(Error::ServiceUnavailable {
internal_message: format!(
| InstanceState::Failed => {
Err(Error::unavail_external(format!(
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This feels like some kind of 400, not a 503.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Turned this into a 400 and cleaned up the message.

)))
}
InstanceState::Destroyed => {
Err(Error::unavail_external(format!(
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same -- 400 or 409, I'd think.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

400 as well.

Err(Error::ServiceUnavailable {
internal_message: format!(
"instance is in state {:?} and has no active serial console \
Err(Error::unavail_external(format!(
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

400 as well.

let info = oxs.first().ok_or_else(|| Error::ServiceUnavailable {
internal_message: String::from("no oximeter collectors available"),
let info = oxs.first().ok_or_else(|| {
Error::unavail_external("no oximeter collectors available")
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think this makes sense for an external user, though I suspect this is only produced in calls to the internal API.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reverted this change.

),
},
));
return Err(ActionError::action_failed(Error::unavail_external(
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think I understand what this means, but it doesn't sound like a 503.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've left this untouched.

Copy link
Contributor

@gjcolombo gjcolombo left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. I had thought about doing something right along these lines to fix #4136 (which this PR fixes) but hadn't gotten into the particulars yet. Thanks for putting this together! (The one caveat I'll offer is that my Omicron development history isn't the longest, and there may be prior design choices here that I'm not aware of, so you may want another reviewer or two to weigh in on the overall design before merging.)

match self.0 {
MessageVariant::Internal(msg) => write!(f, "(internal: {})", msg),
MessageVariant::External { message, internal_context } => {
write!(f, ": {message}")?;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: I would move the : back to the message in the ServiceUnavailable variant on line 57, in part to keep all those variants uniform, and in part so that if we ever, say, display_internal into a log line we don't get the leading colon. This is pretty nitpicky, though, and I don't feel especially strongly about it, especially if there's a reason to do the formatting here that I've totally overlooked.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Changed this significantly, and followed this approach in the new version.

@sunshowers
Copy link
Contributor Author

Thanks for the detailed review, @davepacheco! Yes, I'm thinking that operators are the most likely consumers of these messages, and that end users will ask operators for help if they run into an issue. My general view (mostly informed by devtools) is that status panes and such are all really useful, but providing synchronous, immediate "inner loop" feedback is crucial.

My initial work was definitely touching on "insufficient capacity" messages (just today there's been discussion in #oxide-dogfood about 503s that don't have a message attached.) I think it's fair to say that for external 503 errors, we should write informative messages.

I'll go over all the cases you pointed out and see what I think of them.

Created using spr 1.3.5
@sunshowers sunshowers changed the title [RFC] [nexus] make more 503 Service Unavailable messages available to clients [RFC] [nexus] improve external messages and make more available to clients Dec 5, 2023
@sunshowers
Copy link
Contributor Author

@davepacheco I've addressed your specific feedback, and also broadened the scope to fix some of the other issues I spotted along the way (see updated PR description, e.g. the fact that Conflict's message was mislabeled as internal when it should have been external). Would love feedback on the update.

@zephraph
Copy link
Contributor

zephraph commented Dec 6, 2023

Thanks for putting this together @sunshowers! As I was working on #4605 I came across the need for a new error code to indicate to users that they can't provision more resources because it would exceed their set capacity. I'd also created InsufficientCapacity as an apt error so I'm excited to see this. The only question I have is if a user who's trying to provision a resource passed their granted capacity should see a 503 or a 400 error. There might be a slight difference in semantics between an internal operation failing due to running out of capacity in an unexpected / uncontrollable way vs a user hitting against their quota.

Any thoughts? Ultimately I'm in favor of landing this with or without tweaking the response code as that's something we could always iterate on.

@david-crespo
Copy link
Contributor

@zephraph I don’t like 503 for what is ultimately IMO a non-exceptional occurrence. Everything is fine, they’re just out of space in their corner of the world.

@zephraph zephraph mentioned this pull request Dec 6, 2023
6 tasks
Copy link
Collaborator

@davepacheco davepacheco left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cool -- I like this direction a lot.

FYI, there are some text and semantic changes here that seem important to UX (and compatibility) but they're a little buried in more mechanical changes that made it a little hard to make sure I spotted everything.

Thanks again for taking this on -- and fixing several cases in the process!

@@ -151,8 +227,20 @@ impl Error {
///
/// This should be used for failures due possibly to invalid client input
/// or malformed requests.
pub fn invalid_request(message: &str) -> Error {
Error::InvalidRequest { message: message.to_owned() }
pub fn invalid_request(message: impl Into<String>) -> Error {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

general note: it feels like we've been backtracking on a lot of use of generics, especially in cross-crate interfaces, due to monomorphization causing long compile times. Does that apply here? Do we care?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There's going to be at most 3 copies of this (String, &String, &str), and there isn't a big tree of nested functions underneath (where the true nastiness with compile times lies I think) so I doubt this is going to be an issue.

}

/// Generates an [`Error::MethodNotAllowed`] error with an external message.
pub fn method_not_allowed(message: impl Into<String>) -> Error {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It feels weird to me to customize this message. It was also surprising we even have this -- I expected dropshot would produce all of these. I see where we use it when somebody attempts to delete a system resource that they're not able to. I don't think that's necessarily wrong either (though I think it could as well be a 403 or 409 or 400). But I'm a little worried people might see this and reach for it when it's not appropriate. I dunno.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah it's definitely a bit weird. I think 405 Method Not Allowed is typically used in spots where the method is unconditionally not permitted (e.g. POST on a GET-only endpoint). But in this case we're using it in a case where the method is conditionally not permitted, based on the input parameters. There's also the "Allow" header which RFC 9110 says MUST be returned, but we don't do that.

Based on all this, maybe this should just be 400 Invalid Request.

fn fmt(&self, f: &mut std::fmt::Formatter<'_>) -> std::fmt::Result {
write!(f, "{}", self.0.external_message)?;
if !self.0.internal_context.is_empty() {
write!(f, " (with internal context: {})", self.0.internal_context)?;
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Was this change in formatting intentional? It feels like kind of a big one but maybe I overestimate it. Do you happen to have any examples handy from testing?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this isn't so much a formatting change as much as it is an addition. Previously, we just wouldn't track any internal context for the error variants that had associated external messages, and now we do.

In any case, this only affects internal logging; user-visible messages don't change.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, I think I misunderstood what this was used for. I saw that this comes from with_internal_context, which elsewhere I think prepends a context message, and I thought this was replacing that. I see that's basically still there at L110. So I think I was just confused -- sorry.

Some(String::from("InvalidRequest")),
message,
),
Error::InvalidRequest { message } => {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(applies to this whole function)

aside from the internal message, are any of these changing at all?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No, no changes.

nexus/db-queries/src/db/datastore/external_ip.rs Outdated Show resolved Hide resolved
nexus/db-queries/src/db/queries/region_allocation.rs Outdated Show resolved Hide resolved
}),
| InstanceState::Failed => {
Err(Error::invalid_request(format!(
"cannot connect to serial console of {} instance",
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nitty, but I like the previous format a little better. "cannot connect to serial console of instance in state "Starting"" communicates clearly that "Starting" is a discrete state that the user might know about and could go look up what it means and when it would change and how to check for it, etc. "cannot connect to serial console of starting instance" makes me thing "what do you consider starting?"

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure, I wanted to align this with cannot connect to serial console of destroyed instance below, but I think your reasoning makes sense. As a compromise I ended up using the Display impl rather than the Debug one, so it says cannot connect to serial console of instance in state "starting". (Not wedded to any particular outcome here at all, to be clear.)

nexus/src/app/vpc_router.rs Outdated Show resolved Hide resolved
@sunshowers
Copy link
Contributor Author

FYI, there are some text and semantic changes here that seem important to UX (and compatibility) but they're a little buried in more mechanical changes that made it a little hard to make sure I spotted everything.

I do apologize for that, I thought the change would be small but it ballooned very quickly.

@sunshowers
Copy link
Contributor Author

The only question I have is if a user who's trying to provision a resource passed their granted capacity should see a 503 or a 400 error.

Thoughts on 507 Insufficient Storage?

@david-crespo
Copy link
Contributor

I don't love 507 for this, mostly because it's rare and weird and kinda WebDAV-specific. It does feel like a shame that there isn't something better. On reflection, despite what I said, 400 is a stretch since there isn't anything in the request that's wrong — you could make the exact same request 5 minutes later and it would be fine. I guess a 503 would be ok. My worry is that when I see a 503, I figure the whole system is borked rather than there being some limitation on the specific operation. But with the message clarifying the cause, 503 might be best.

@sunshowers
Copy link
Contributor Author

sunshowers commented Dec 7, 2023

Hmm, I think a thing to consider is that 507 is a very specific error that won't be used (now or in the future) for anything else. Agreed that it comes from WebDAV but I've found other WebDAV-originating codes like 422 Unprocessable Entity/Content to be very useful outside of WebDAV.

I think if 400 is ruled out, between 503 and 507 I would prefer 507 because lots of other things can generate 503.

@david-crespo
Copy link
Contributor

That's an interesting point, that the weirdness is a strength — no one will assume they know what it means. Persuasive!

Created using spr 1.3.5
Created using spr 1.3.5
@sunshowers sunshowers changed the title [RFC] [nexus] improve external messages and make more available to clients [nexus] improve external messages and make more available to clients Dec 8, 2023
Created using spr 1.3.5
@sunshowers sunshowers enabled auto-merge (squash) December 8, 2023 21:28
@sunshowers sunshowers merged commit 2a3db41 into main Dec 8, 2023
19 of 20 checks passed
@sunshowers sunshowers deleted the sunshowers/spr/rfc-nexus-make-more-503-service-unavailable-messages-available-to-clients branch December 8, 2023 22:39
@zephraph
Copy link
Contributor

@ahl suggested I dig into how AWS/GCP handles errors a bit to get a sense of how we might differentiate errors that are human controllable (a user is trying to allocate resources that would bust their quota) vs things outside of their control (not enough hardware to take the operation). AWS has some good documentation on their errors (bifurcated on client vs server) with rich descriptions of each: https://docs.aws.amazon.com/AWSEC2/latest/APIReference/errors-overview.html. As I'd somewhat suggested, AWS gives 400 for client errors which include things like trying to allocate a resource that would exceed your resource limit. They give 500 errors for things outside the user's control.

As @sunshowers and I were discussing this we both agree that this means we likely need to continue increasing the explicitness of our errors.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants