-
Notifications
You must be signed in to change notification settings - Fork 40
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[nexus] improve external messages and make more available to clients #4573
[nexus] improve external messages and make more available to clients #4573
Conversation
Created using spr 1.3.5
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for taking a look at this. Is the motivation here or aid debugging for ourselves or to help end users and operators better understand why a 503 failed? I ask because I can definitely see this being useful for developers debugging failures or for customers for some kinds of failures, but I worry this could easily wind up confusing customers more than it helps them when messages refer to implementation details instead of user-level abstractions that we've provided. Example: if the system fails to create a Crucible volume because all available disk space has been allocated, it makes sense to me that internally we'd report something like "we couldn't find a zpool [or dataset] with enough space to allocate a region". But those aren't terms that an end user or even operator would be expected to know about, right? Ultimately I think we want the system to communicate something like: "the sum total of disks, images, etc. that exist have consumed all available storage capacity of the rack. You must add capacity or delete some of these Disks." That said, the request handler is not necessarily in a position to make the leap from "I couldn't find storage" to "there is no storage and you need to add more". Ideally, I think we'd communicate something like "we couldn't find capacity" and we'd have other interfaces (alerts? a status page?) that would communicate the capacity situation and documentation that says "an error message about being unable to find capacity may mean that you're actually out -- check this status page". I appreciate this is a lot more complicated and even we decide to go this route we don't have all these things today.
I think most of the messages that are made external in this PR don't feel suitable for consumption by an operator or an end user.
What do you think of this: instead of allowing ServiceUnavailable
to have external messages, create new Error
variants for the ones that we want to have external messages, and map those to 503 (preserving the external message) on the way out? We could add an InsufficientCapacity
variant with a way to express what resource we failed to allocate (like "storage space" or "IP addresses" or something). The reason I suggest this is to make it harder to accidentally leak implementation details in error messages. If a developer wants to expose something to the end user, they'd have to make a new variant (which I hope will cause people to consider more exactly what they're communicating to the API consumer).
I think "out of capacity" (including both physical resources and virtual ones like IP addresses) covers all the cases we'd want to expose external messages for? If we set those aside, I feel like a 503 is almost never going to be actionable for an end user and rarely so for an operator, too. The cases I usually think about for a 503 are (1) the system suffered some very transient issue (e.g., the CockroachDB node crashed and a query failed during the request) or relatedly (2) the system has suffered too many failures to do what was requested (e.g., you tried to do something with a Crucible volume, all of whose "downstairs" instances are on sleds that were removed or panicked or whatever). I don't think an end user can do anything about these. For an operator, I think we've got to communicate this some better way (e.g., alerts, a status page showing such problems, etc.). So for all these cases I feel like a more specific message would be more confusing than helpful. The capacity-related ones do feel different because they are actionable, even to end users. They might then go check how much capacity they're using, free up capacity, or just stop trying to do what they're doing.
@@ -48,17 +48,17 @@ pub fn from_diesel(e: DieselError) -> external::Error { | |||
if let Some(sentinel) = matches_sentinel(&e, &sentinels) { | |||
match sentinel { | |||
NOT_ENOUGH_DATASETS_SENTINEL => { | |||
return external::Error::unavail( | |||
return external::Error::unavail_external( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This one feels like the sort of thing I would not expect an operator to know about, let alone an end user.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've turned this and the other two below into a generic "Not enough disk space" message.
"Not enough datasets to allocate disks", | ||
); | ||
} | ||
NOT_ENOUGH_ZPOOL_SPACE_SENTINEL => { | ||
return external::Error::unavail( | ||
return external::Error::unavail_external( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This one feels like a mix. Some of this seems useful to an operator but much of it (terms like "zpool" and "region") describes implementation details that feel like a bad UX to expose to operators and end users.
"Not enough zpool space to allocate disks. There may not be enough disks with space for the requested region. You may also see this if your rack is in a degraded state, or you're running the default multi-rack topology configuration in a 1-sled development environment.", | ||
); | ||
} | ||
NOT_ENOUGH_UNIQUE_ZPOOLS_SENTINEL => { | ||
return external::Error::unavail( | ||
return external::Error::unavail_external( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Similarly this feels like a call generator more than it's helpful. (I'm imagining a customer wondering "What's a zpool? Do I need more physical disks? Is it somehow allocating something from an API Disk
?)
nexus/src/app/external_endpoints.rs
Outdated
@@ -737,7 +737,7 @@ fn endpoint_for_authority( | |||
let endpoint_config = config_rx.borrow(); | |||
let endpoints = endpoint_config.as_ref().ok_or_else(|| { | |||
error!(&log, "received request with no endpoints loaded"); | |||
Error::unavail("endpoints not loaded") | |||
Error::unavail_external("endpoints not loaded") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This feels confusing and unactionable for end users.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Made this internal-only.
nexus/src/app/instance.rs
Outdated
@@ -362,7 +362,9 @@ impl super::Nexus { | |||
} | |||
|
|||
if instance.runtime().migration_id.is_some() { | |||
return Err(Error::unavail("instance is already migrating")); | |||
return Err(Error::unavail_external( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This feels like a 409 and not a 503.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Agreed, turned this into a 409.
nexus/src/app/instance.rs
Outdated
InstanceState::Destroyed => Err(Error::ServiceUnavailable { | ||
internal_message: format!( | ||
| InstanceState::Failed => { | ||
Err(Error::unavail_external(format!( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This feels like some kind of 400, not a 503.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Turned this into a 400 and cleaned up the message.
nexus/src/app/instance.rs
Outdated
))) | ||
} | ||
InstanceState::Destroyed => { | ||
Err(Error::unavail_external(format!( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Same -- 400 or 409, I'd think.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
400 as well.
nexus/src/app/instance.rs
Outdated
Err(Error::ServiceUnavailable { | ||
internal_message: format!( | ||
"instance is in state {:?} and has no active serial console \ | ||
Err(Error::unavail_external(format!( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Same
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
400 as well.
nexus/src/app/oximeter.rs
Outdated
let info = oxs.first().ok_or_else(|| Error::ServiceUnavailable { | ||
internal_message: String::from("no oximeter collectors available"), | ||
let info = oxs.first().ok_or_else(|| { | ||
Error::unavail_external("no oximeter collectors available") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think this makes sense for an external user, though I suspect this is only produced in calls to the internal API.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reverted this change.
), | ||
}, | ||
)); | ||
return Err(ActionError::action_failed(Error::unavail_external( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think I understand what this means, but it doesn't sound like a 503.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've left this untouched.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM. I had thought about doing something right along these lines to fix #4136 (which this PR fixes) but hadn't gotten into the particulars yet. Thanks for putting this together! (The one caveat I'll offer is that my Omicron development history isn't the longest, and there may be prior design choices here that I'm not aware of, so you may want another reviewer or two to weigh in on the overall design before merging.)
common/src/api/external/error.rs
Outdated
match self.0 { | ||
MessageVariant::Internal(msg) => write!(f, "(internal: {})", msg), | ||
MessageVariant::External { message, internal_context } => { | ||
write!(f, ": {message}")?; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: I would move the :
back to the message in the ServiceUnavailable
variant on line 57, in part to keep all those variants uniform, and in part so that if we ever, say, display_internal
into a log line we don't get the leading colon. This is pretty nitpicky, though, and I don't feel especially strongly about it, especially if there's a reason to do the formatting here that I've totally overlooked.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Changed this significantly, and followed this approach in the new version.
Thanks for the detailed review, @davepacheco! Yes, I'm thinking that operators are the most likely consumers of these messages, and that end users will ask operators for help if they run into an issue. My general view (mostly informed by devtools) is that status panes and such are all really useful, but providing synchronous, immediate "inner loop" feedback is crucial. My initial work was definitely touching on "insufficient capacity" messages (just today there's been discussion in I'll go over all the cases you pointed out and see what I think of them. |
Created using spr 1.3.5
@davepacheco I've addressed your specific feedback, and also broadened the scope to fix some of the other issues I spotted along the way (see updated PR description, e.g. the fact that |
Thanks for putting this together @sunshowers! As I was working on #4605 I came across the need for a new error code to indicate to users that they can't provision more resources because it would exceed their set capacity. I'd also created Any thoughts? Ultimately I'm in favor of landing this with or without tweaking the response code as that's something we could always iterate on. |
@zephraph I don’t like 503 for what is ultimately IMO a non-exceptional occurrence. Everything is fine, they’re just out of space in their corner of the world. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Cool -- I like this direction a lot.
FYI, there are some text and semantic changes here that seem important to UX (and compatibility) but they're a little buried in more mechanical changes that made it a little hard to make sure I spotted everything.
Thanks again for taking this on -- and fixing several cases in the process!
@@ -151,8 +227,20 @@ impl Error { | |||
/// | |||
/// This should be used for failures due possibly to invalid client input | |||
/// or malformed requests. | |||
pub fn invalid_request(message: &str) -> Error { | |||
Error::InvalidRequest { message: message.to_owned() } | |||
pub fn invalid_request(message: impl Into<String>) -> Error { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
general note: it feels like we've been backtracking on a lot of use of generics, especially in cross-crate interfaces, due to monomorphization causing long compile times. Does that apply here? Do we care?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There's going to be at most 3 copies of this (String
, &String
, &str
), and there isn't a big tree of nested functions underneath (where the true nastiness with compile times lies I think) so I doubt this is going to be an issue.
common/src/api/external/error.rs
Outdated
} | ||
|
||
/// Generates an [`Error::MethodNotAllowed`] error with an external message. | ||
pub fn method_not_allowed(message: impl Into<String>) -> Error { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It feels weird to me to customize this message. It was also surprising we even have this -- I expected dropshot would produce all of these. I see where we use it when somebody attempts to delete a system resource that they're not able to. I don't think that's necessarily wrong either (though I think it could as well be a 403 or 409 or 400). But I'm a little worried people might see this and reach for it when it's not appropriate. I dunno.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah it's definitely a bit weird. I think 405 Method Not Allowed is typically used in spots where the method is unconditionally not permitted (e.g. POST
on a GET
-only endpoint). But in this case we're using it in a case where the method is conditionally not permitted, based on the input parameters. There's also the "Allow" header which RFC 9110 says MUST be returned, but we don't do that.
Based on all this, maybe this should just be 400 Invalid Request.
fn fmt(&self, f: &mut std::fmt::Formatter<'_>) -> std::fmt::Result { | ||
write!(f, "{}", self.0.external_message)?; | ||
if !self.0.internal_context.is_empty() { | ||
write!(f, " (with internal context: {})", self.0.internal_context)?; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Was this change in formatting intentional? It feels like kind of a big one but maybe I overestimate it. Do you happen to have any examples handy from testing?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this isn't so much a formatting change as much as it is an addition. Previously, we just wouldn't track any internal context for the error variants that had associated external messages, and now we do.
In any case, this only affects internal logging; user-visible messages don't change.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah, I think I misunderstood what this was used for. I saw that this comes from with_internal_context
, which elsewhere I think prepends a context message, and I thought this was replacing that. I see that's basically still there at L110. So I think I was just confused -- sorry.
Some(String::from("InvalidRequest")), | ||
message, | ||
), | ||
Error::InvalidRequest { message } => { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
(applies to this whole function)
aside from the internal message, are any of these changing at all?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No, no changes.
nexus/src/app/instance.rs
Outdated
}), | ||
| InstanceState::Failed => { | ||
Err(Error::invalid_request(format!( | ||
"cannot connect to serial console of {} instance", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nitty, but I like the previous format a little better. "cannot connect to serial console of instance in state "Starting"" communicates clearly that "Starting" is a discrete state that the user might know about and could go look up what it means and when it would change and how to check for it, etc. "cannot connect to serial console of starting instance" makes me thing "what do you consider starting?"
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sure, I wanted to align this with cannot connect to serial console of destroyed instance
below, but I think your reasoning makes sense. As a compromise I ended up using the Display
impl rather than the Debug
one, so it says cannot connect to serial console of instance in state "starting"
. (Not wedded to any particular outcome here at all, to be clear.)
I do apologize for that, I thought the change would be small but it ballooned very quickly. |
Thoughts on 507 Insufficient Storage? |
I don't love 507 for this, mostly because it's rare and weird and kinda WebDAV-specific. It does feel like a shame that there isn't something better. On reflection, despite what I said, 400 is a stretch since there isn't anything in the request that's wrong — you could make the exact same request 5 minutes later and it would be fine. I guess a 503 would be ok. My worry is that when I see a 503, I figure the whole system is borked rather than there being some limitation on the specific operation. But with the message clarifying the cause, 503 might be best. |
Hmm, I think a thing to consider is that 507 is a very specific error that won't be used (now or in the future) for anything else. Agreed that it comes from WebDAV but I've found other WebDAV-originating codes like 422 Unprocessable Entity/Content to be very useful outside of WebDAV. I think if 400 is ruled out, between 503 and 507 I would prefer 507 because lots of other things can generate 503. |
That's an interesting point, that the weirdness is a strength — no one will assume they know what it means. Persuasive! |
Created using spr 1.3.5
Created using spr 1.3.5
Created using spr 1.3.5
@ahl suggested I dig into how AWS/GCP handles errors a bit to get a sense of how we might differentiate errors that are human controllable (a user is trying to allocate resources that would bust their quota) vs things outside of their control (not enough hardware to take the operation). AWS has some good documentation on their errors (bifurcated on client vs server) with rich descriptions of each: https://docs.aws.amazon.com/AWSEC2/latest/APIReference/errors-overview.html. As I'd somewhat suggested, AWS gives 400 for client errors which include things like trying to allocate a resource that would exceed your resource limit. They give 500 errors for things outside the user's control. As @sunshowers and I were discussing this we both agree that this means we likely need to continue increasing the explicitness of our errors. |
While developing #4520, I observed that we were producing a number of error messages that were:
This is my attempt to make the situation slightly better. To achieve this, I made a few changes:
MessagePair
struct, which consists of an external message. (Along the way, correct the definition of e.g. theConflict
variant: it actually is an external message, not an internal one.)InsufficientCapacity
variant that consists of both an external and an internal message. This variant resolves to a 507 Insufficient Storage error, and has a more helpful message than just "Service Unavailable".Looking for feedback on this approach!