[nexus] improve external messages and make more available to clients #4573

sunshowers · 2023-11-29T03:55:20Z

While developing #4520, I observed that we were producing a number of error messages that were:

503 Service Unavailable,
With only an internal message attached
But where the message is both safe and useful to display to clients.

This is my attempt to make the situation slightly better. To achieve this, I made a few changes:

Make all the client errors carry a new MessagePair struct, which consists of an external message. (Along the way, correct the definition of e.g. the Conflict variant: it actually is an external message, not an internal one.)
Define a new InsufficientCapacity variant that consists of both an external and an internal message. This variant resolves to a 507 Insufficient Storage error, and has a more helpful message than just "Service Unavailable".
Turn some current 503 errors into client errors so that the message is available externally.

Looking for feedback on this approach!

Created using spr 1.3.5

davepacheco

Thanks for taking a look at this. Is the motivation here or aid debugging for ourselves or to help end users and operators better understand why a 503 failed? I ask because I can definitely see this being useful for developers debugging failures or for customers for some kinds of failures, but I worry this could easily wind up confusing customers more than it helps them when messages refer to implementation details instead of user-level abstractions that we've provided. Example: if the system fails to create a Crucible volume because all available disk space has been allocated, it makes sense to me that internally we'd report something like "we couldn't find a zpool [or dataset] with enough space to allocate a region". But those aren't terms that an end user or even operator would be expected to know about, right? Ultimately I think we want the system to communicate something like: "the sum total of disks, images, etc. that exist have consumed all available storage capacity of the rack. You must add capacity or delete some of these Disks." That said, the request handler is not necessarily in a position to make the leap from "I couldn't find storage" to "there is no storage and you need to add more". Ideally, I think we'd communicate something like "we couldn't find capacity" and we'd have other interfaces (alerts? a status page?) that would communicate the capacity situation and documentation that says "an error message about being unable to find capacity may mean that you're actually out -- check this status page". I appreciate this is a lot more complicated and even we decide to go this route we don't have all these things today.

I think most of the messages that are made external in this PR don't feel suitable for consumption by an operator or an end user.

What do you think of this: instead of allowing ServiceUnavailable to have external messages, create new Error variants for the ones that we want to have external messages, and map those to 503 (preserving the external message) on the way out? We could add an InsufficientCapacity variant with a way to express what resource we failed to allocate (like "storage space" or "IP addresses" or something). The reason I suggest this is to make it harder to accidentally leak implementation details in error messages. If a developer wants to expose something to the end user, they'd have to make a new variant (which I hope will cause people to consider more exactly what they're communicating to the API consumer).

I think "out of capacity" (including both physical resources and virtual ones like IP addresses) covers all the cases we'd want to expose external messages for? If we set those aside, I feel like a 503 is almost never going to be actionable for an end user and rarely so for an operator, too. The cases I usually think about for a 503 are (1) the system suffered some very transient issue (e.g., the CockroachDB node crashed and a query failed during the request) or relatedly (2) the system has suffered too many failures to do what was requested (e.g., you tried to do something with a Crucible volume, all of whose "downstairs" instances are on sleds that were removed or panicked or whatever). I don't think an end user can do anything about these. For an operator, I think we've got to communicate this some better way (e.g., alerts, a status page showing such problems, etc.). So for all these cases I feel like a more specific message would be more confusing than helpful. The capacity-related ones do feel different because they are actionable, even to end users. They might then go check how much capacity they're using, free up capacity, or just stop trying to do what they're doing.

davepacheco · 2023-11-29T17:20:40Z

nexus/db-queries/src/db/queries/region_allocation.rs

@@ -48,17 +48,17 @@ pub fn from_diesel(e: DieselError) -> external::Error {
    if let Some(sentinel) = matches_sentinel(&e, &sentinels) {
        match sentinel {
            NOT_ENOUGH_DATASETS_SENTINEL => {
-                return external::Error::unavail(
+                return external::Error::unavail_external(


This one feels like the sort of thing I would not expect an operator to know about, let alone an end user.

I've turned this and the other two below into a generic "Not enough disk space" message.

davepacheco · 2023-11-29T17:23:49Z

nexus/db-queries/src/db/queries/region_allocation.rs

                    "Not enough datasets to allocate disks",
                );
            }
            NOT_ENOUGH_ZPOOL_SPACE_SENTINEL => {
-                return external::Error::unavail(
+                return external::Error::unavail_external(


This one feels like a mix. Some of this seems useful to an operator but much of it (terms like "zpool" and "region") describes implementation details that feel like a bad UX to expose to operators and end users.

davepacheco · 2023-11-29T17:25:27Z

nexus/db-queries/src/db/queries/region_allocation.rs

                    "Not enough zpool space to allocate disks. There may not be enough disks with space for the requested region. You may also see this if your rack is in a degraded state, or you're running the default multi-rack topology configuration in a 1-sled development environment.",
                );
            }
            NOT_ENOUGH_UNIQUE_ZPOOLS_SENTINEL => {
-                return external::Error::unavail(
+                return external::Error::unavail_external(


Similarly this feels like a call generator more than it's helpful. (I'm imagining a customer wondering "What's a zpool? Do I need more physical disks? Is it somehow allocating something from an API Disk?)

davepacheco · 2023-11-29T17:26:00Z

nexus/src/app/external_endpoints.rs

@@ -737,7 +737,7 @@ fn endpoint_for_authority(
    let endpoint_config = config_rx.borrow();
    let endpoints = endpoint_config.as_ref().ok_or_else(|| {
        error!(&log, "received request with no endpoints loaded");
-        Error::unavail("endpoints not loaded")
+        Error::unavail_external("endpoints not loaded")


This feels confusing and unactionable for end users.

Made this internal-only.

davepacheco · 2023-11-29T17:26:18Z

nexus/src/app/instance.rs

@@ -362,7 +362,9 @@ impl super::Nexus {
        }

        if instance.runtime().migration_id.is_some() {
-            return Err(Error::unavail("instance is already migrating"));
+            return Err(Error::unavail_external(


This feels like a 409 and not a 503.

Agreed, turned this into a 409.

davepacheco · 2023-11-29T17:26:55Z

nexus/src/app/instance.rs

-                InstanceState::Destroyed => Err(Error::ServiceUnavailable {
-                    internal_message: format!(
+                | InstanceState::Failed => {
+                    Err(Error::unavail_external(format!(


This feels like some kind of 400, not a 503.

Turned this into a 400 and cleaned up the message.

davepacheco · 2023-11-29T17:27:05Z

nexus/src/app/instance.rs

+                )))
+                }
+                InstanceState::Destroyed => {
+                    Err(Error::unavail_external(format!(


Same -- 400 or 409, I'd think.

400 as well.

davepacheco · 2023-11-29T17:27:11Z

nexus/src/app/instance.rs

-            Err(Error::ServiceUnavailable {
-                internal_message: format!(
-                    "instance is in state {:?} and has no active serial console \
+            Err(Error::unavail_external(format!(


400 as well.

davepacheco · 2023-11-29T17:27:53Z

nexus/src/app/oximeter.rs

-        let info = oxs.first().ok_or_else(|| Error::ServiceUnavailable {
-            internal_message: String::from("no oximeter collectors available"),
+        let info = oxs.first().ok_or_else(|| {
+            Error::unavail_external("no oximeter collectors available")


I don't think this makes sense for an external user, though I suspect this is only produced in calls to the internal API.

Reverted this change.

davepacheco · 2023-11-29T17:28:47Z

nexus/src/app/sagas/snapshot_create.rs

-                    ),
-                },
-            ));
+            return Err(ActionError::action_failed(Error::unavail_external(


I don't think I understand what this means, but it doesn't sound like a 503.

I've left this untouched.

gjcolombo

LGTM. I had thought about doing something right along these lines to fix #4136 (which this PR fixes) but hadn't gotten into the particulars yet. Thanks for putting this together! (The one caveat I'll offer is that my Omicron development history isn't the longest, and there may be prior design choices here that I'm not aware of, so you may want another reviewer or two to weigh in on the overall design before merging.)

gjcolombo · 2023-11-29T18:37:47Z

common/src/api/external/error.rs

+        match self.0 {
+            MessageVariant::Internal(msg) => write!(f, "(internal: {})", msg),
+            MessageVariant::External { message, internal_context } => {
+                write!(f, ": {message}")?;


nit: I would move the : back to the message in the ServiceUnavailable variant on line 57, in part to keep all those variants uniform, and in part so that if we ever, say, display_internal into a log line we don't get the leading colon. This is pretty nitpicky, though, and I don't feel especially strongly about it, especially if there's a reason to do the formatting here that I've totally overlooked.

Changed this significantly, and followed this approach in the new version.

sunshowers · 2023-11-29T23:03:25Z

Thanks for the detailed review, @davepacheco! Yes, I'm thinking that operators are the most likely consumers of these messages, and that end users will ask operators for help if they run into an issue. My general view (mostly informed by devtools) is that status panes and such are all really useful, but providing synchronous, immediate "inner loop" feedback is crucial.

My initial work was definitely touching on "insufficient capacity" messages (just today there's been discussion in #oxide-dogfood about 503s that don't have a message attached.) I think it's fair to say that for external 503 errors, we should write informative messages.

I'll go over all the cases you pointed out and see what I think of them.

Created using spr 1.3.5

sunshowers · 2023-12-05T04:07:10Z

@davepacheco I've addressed your specific feedback, and also broadened the scope to fix some of the other issues I spotted along the way (see updated PR description, e.g. the fact that Conflict's message was mislabeled as internal when it should have been external). Would love feedback on the update.

zephraph · 2023-12-06T01:25:12Z

Thanks for putting this together @sunshowers! As I was working on #4605 I came across the need for a new error code to indicate to users that they can't provision more resources because it would exceed their set capacity. I'd also created InsufficientCapacity as an apt error so I'm excited to see this. The only question I have is if a user who's trying to provision a resource passed their granted capacity should see a 503 or a 400 error. There might be a slight difference in semantics between an internal operation failing due to running out of capacity in an unexpected / uncontrollable way vs a user hitting against their quota.

Any thoughts? Ultimately I'm in favor of landing this with or without tweaking the response code as that's something we could always iterate on.

david-crespo · 2023-12-06T02:15:59Z

@zephraph I don’t like 503 for what is ultimately IMO a non-exceptional occurrence. Everything is fine, they’re just out of space in their corner of the world.

davepacheco

Cool -- I like this direction a lot.

FYI, there are some text and semantic changes here that seem important to UX (and compatibility) but they're a little buried in more mechanical changes that made it a little hard to make sure I spotted everything.

Thanks again for taking this on -- and fixing several cases in the process!

davepacheco · 2023-12-07T18:56:35Z

common/src/api/external/error.rs

@@ -151,8 +227,20 @@ impl Error {
    ///
    /// This should be used for failures due possibly to invalid client input
    /// or malformed requests.
-    pub fn invalid_request(message: &str) -> Error {
-        Error::InvalidRequest { message: message.to_owned() }
+    pub fn invalid_request(message: impl Into<String>) -> Error {


general note: it feels like we've been backtracking on a lot of use of generics, especially in cross-crate interfaces, due to monomorphization causing long compile times. Does that apply here? Do we care?

There's going to be at most 3 copies of this (String, &String, &str), and there isn't a big tree of nested functions underneath (where the true nastiness with compile times lies I think) so I doubt this is going to be an issue.

davepacheco · 2023-12-07T18:57:23Z

common/src/api/external/error.rs

+    }
+
+    /// Generates an [`Error::MethodNotAllowed`] error with an external message.
+    pub fn method_not_allowed(message: impl Into<String>) -> Error {


It feels weird to me to customize this message. It was also surprising we even have this -- I expected dropshot would produce all of these. I see where we use it when somebody attempts to delete a system resource that they're not able to. I don't think that's necessarily wrong either (though I think it could as well be a 403 or 409 or 400). But I'm a little worried people might see this and reach for it when it's not appropriate. I dunno.

Yeah it's definitely a bit weird. I think 405 Method Not Allowed is typically used in spots where the method is unconditionally not permitted (e.g. POST on a GET-only endpoint). But in this case we're using it in a case where the method is conditionally not permitted, based on the input parameters. There's also the "Allow" header which RFC 9110 says MUST be returned, but we don't do that.

Based on all this, maybe this should just be 400 Invalid Request.

davepacheco · 2023-12-07T19:02:06Z

common/src/api/external/error.rs

+    fn fmt(&self, f: &mut std::fmt::Formatter<'_>) -> std::fmt::Result {
+        write!(f, "{}", self.0.external_message)?;
+        if !self.0.internal_context.is_empty() {
+            write!(f, " (with internal context: {})", self.0.internal_context)?;


Was this change in formatting intentional? It feels like kind of a big one but maybe I overestimate it. Do you happen to have any examples handy from testing?

I think this isn't so much a formatting change as much as it is an addition. Previously, we just wouldn't track any internal context for the error variants that had associated external messages, and now we do.

In any case, this only affects internal logging; user-visible messages don't change.

Ah, I think I misunderstood what this was used for. I saw that this comes from with_internal_context, which elsewhere I think prepends a context message, and I thought this was replacing that. I see that's basically still there at L110. So I think I was just confused -- sorry.

davepacheco · 2023-12-07T19:07:50Z

common/src/api/external/error.rs

-                Some(String::from("InvalidRequest")),
-                message,
-            ),
+            Error::InvalidRequest { message } => {


(applies to this whole function)

aside from the internal message, are any of these changing at all?

No, no changes.

nexus/db-queries/src/db/datastore/external_ip.rs

nexus/db-queries/src/db/queries/region_allocation.rs

davepacheco · 2023-12-07T19:15:59Z

nexus/src/app/instance.rs

-                }),
+                | InstanceState::Failed => {
+                    Err(Error::invalid_request(format!(
+                        "cannot connect to serial console of {} instance",


nitty, but I like the previous format a little better. "cannot connect to serial console of instance in state "Starting"" communicates clearly that "Starting" is a discrete state that the user might know about and could go look up what it means and when it would change and how to check for it, etc. "cannot connect to serial console of starting instance" makes me thing "what do you consider starting?"

Sure, I wanted to align this with cannot connect to serial console of destroyed instance below, but I think your reasoning makes sense. As a compromise I ended up using the Display impl rather than the Debug one, so it says cannot connect to serial console of instance in state "starting". (Not wedded to any particular outcome here at all, to be clear.)

nexus/src/app/vpc_router.rs

sunshowers · 2023-12-07T20:32:11Z

FYI, there are some text and semantic changes here that seem important to UX (and compatibility) but they're a little buried in more mechanical changes that made it a little hard to make sure I spotted everything.

I do apologize for that, I thought the change would be small but it ballooned very quickly.

sunshowers · 2023-12-07T22:07:37Z

The only question I have is if a user who's trying to provision a resource passed their granted capacity should see a 503 or a 400 error.

Thoughts on 507 Insufficient Storage?

david-crespo · 2023-12-07T22:49:18Z

I don't love 507 for this, mostly because it's rare and weird and kinda WebDAV-specific. It does feel like a shame that there isn't something better. On reflection, despite what I said, 400 is a stretch since there isn't anything in the request that's wrong — you could make the exact same request 5 minutes later and it would be fine. I guess a 503 would be ok. My worry is that when I see a 503, I figure the whole system is borked rather than there being some limitation on the specific operation. But with the message clarifying the cause, 503 might be best.

sunshowers · 2023-12-07T22:58:54Z

Hmm, I think a thing to consider is that 507 is a very specific error that won't be used (now or in the future) for anything else. Agreed that it comes from WebDAV but I've found other WebDAV-originating codes like 422 Unprocessable Entity/Content to be very useful outside of WebDAV.

I think if 400 is ruled out, between 503 and 507 I would prefer 507 because lots of other things can generate 503.

david-crespo · 2023-12-07T23:00:09Z

That's an interesting point, that the weirdness is a strength — no one will assume they know what it means. Persuasive!

Created using spr 1.3.5

nexus/db-queries/src/db/datastore/external_ip.rs

Created using spr 1.3.5

zephraph · 2023-12-11T19:27:21Z

@ahl suggested I dig into how AWS/GCP handles errors a bit to get a sense of how we might differentiate errors that are human controllable (a user is trying to allocate resources that would bust their quota) vs things outside of their control (not enough hardware to take the operation). AWS has some good documentation on their errors (bifurcated on client vs server) with rich descriptions of each: https://docs.aws.amazon.com/AWSEC2/latest/APIReference/errors-overview.html. As I'd somewhat suggested, AWS gives 400 for client errors which include things like trying to allocate a resource that would exceed your resource limit. They give 500 errors for things outside the user's control.

As @sunshowers and I were discussing this we both agree that this means we likely need to continue increasing the explicitness of our errors.

[𝘀𝗽𝗿] initial version

6598065

Created using spr 1.3.5

sunshowers mentioned this pull request Nov 29, 2023

[nexus] add sled provision state #4520

Merged

11 tasks

davepacheco reviewed Nov 29, 2023

View reviewed changes

gjcolombo approved these changes Nov 29, 2023

View reviewed changes

Address comments, broaden scope

610b007

Created using spr 1.3.5

sunshowers changed the title ~~[RFC] [nexus] make more 503 Service Unavailable messages available to clients~~ [RFC] [nexus] improve external messages and make more available to clients Dec 5, 2023

sunshowers requested a review from davepacheco December 5, 2023 04:05

zephraph mentioned this pull request Dec 6, 2023

Add resource limits #4605

Merged

6 tasks

davepacheco reviewed Dec 7, 2023

View reviewed changes

sunshowers added 3 commits December 7, 2023 15:36

Address review comments, switch to 507 Insufficient Storage

fde90d6

Created using spr 1.3.5

Update docs

3ae14c7

Created using spr 1.3.5

Rebase

bad2f38

Created using spr 1.3.5

sunshowers changed the title ~~[RFC] [nexus] improve external messages and make more available to clients~~ [nexus] improve external messages and make more available to clients Dec 8, 2023

davepacheco approved these changes Dec 8, 2023

View reviewed changes

nexus/db-queries/src/db/datastore/external_ip.rs Outdated Show resolved Hide resolved

Rebase, address last comment

c9098ce

Created using spr 1.3.5

sunshowers enabled auto-merge (squash) December 8, 2023 21:28

sunshowers merged commit 2a3db41 into main Dec 8, 2023
19 of 20 checks passed

sunshowers deleted the sunshowers/spr/rfc-nexus-make-more-503-service-unavailable-messages-available-to-clients branch December 8, 2023 22:39

zephraph mentioned this pull request Dec 12, 2023

Add a better error for when a user tries to provision past their quota #4680

Open

[nexus] improve external messages and make more available to clients #4573

[nexus] improve external messages and make more available to clients #4573

Conversation

sunshowers commented Nov 29, 2023 • edited Loading

davepacheco left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

gjcolombo left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

sunshowers commented Nov 29, 2023

sunshowers commented Dec 5, 2023

zephraph commented Dec 6, 2023

david-crespo commented Dec 6, 2023

davepacheco left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

sunshowers commented Dec 7, 2023

sunshowers commented Dec 7, 2023

david-crespo commented Dec 7, 2023

sunshowers commented Dec 7, 2023 • edited Loading

david-crespo commented Dec 7, 2023

zephraph commented Dec 11, 2023

sunshowers commented Nov 29, 2023 •

edited

Loading

sunshowers commented Dec 7, 2023 •

edited

Loading