-
Notifications
You must be signed in to change notification settings - Fork 16
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
feat: Call out the "RESOURCE_EXHAUSTED" error metric #808
Conversation
Bigtable stream reads are not reliable, and may fail at any time. Move the chunk processing into the retry section to reduce the number of read errors that can happen. Closes: SYNC-4508
*NOTE* `cargo audit` may fail with an "Invalid version" if the `Cargo.lock` version is set to 4. Manually changing to 3 will resolve this.
Introduces notification.bridge.error.resource_exhausted Closes: SYNC-4551
} | ||
FcmError::Upstream { status, .. } if status == "RESOURCE_EXHAUSTED" => { | ||
Some("notification.bridge.error.fcm.resource_exhausted") | ||
FcmError::InvalidAppId(_) | FcmError::NoAppId | FcmError::Upstream { .. } => { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think we want Upstream
here with handle_error
already emitting a metric for it: as we'd get a metric emitted twice for it.
I think all FcmErrors
will be passed through handle_error
(or going to sentry, which InvalidAppId
/NoAppId
are set to do)? If so I'd argue we shouldn't return any metric labels here at all (for now)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah, you're right, I forgot that handle_error()
records the metric and then returns the original error for later processing.
That said, we do have some duplication already. I'll add a comment so I can remember that later.
I note that RouterError::TooMuchData(_)
produces notification.bridge.error {too_much_data}
🔗 and notification.bridge.error.too_much_data
🔗 Should I remove that as well so that we only have one? (I can add that as a different PR.)
I'll also note that we might want to have a ReportableError::metric_tags()
function similar to how we have extras()
to provide the additional information, since that seems to be a pattern we're hitting.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm fine with leaving it for now.
We already have ReportableError::tags
(it defaults to returning nothing, so you just have to implement it to override)
Introduces notification.bridge.error.resource_exhausted Closes: SYNC-4551
… feat/SYNC-4551_res-x
…topush-rs into feat/SYNC-4551_res-x
metrics, | ||
platform, | ||
app_id, | ||
&format!("upstream_{}", status), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry, one last thing: status
here is either FcmErrorResponse::status
or StatusCode::to_string()
. I know the latter will contain whitespace (e.g. "200 OK") and the former probably does too? Which makes this an invalid metric tag value
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sigh, you're right, with the various back & forth and merges status
is not the same. I'll modify.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Wait, no, That is right.
The status
here is from Upstream{status: String, ..}
and reflects the returned FcmErrorResponse.status
, which is a set of enums. There was as similar error, though, and I've corrected it. I've also added some comments because this was a source of confusion.
// (This may happen in the case where FCM terminates the connection abruptly | ||
// or a similar event.) Treat that as an INTERNAL error. | ||
(_, None) => FcmError::Upstream { | ||
error_code: "INTERNAL".to_string(), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: how about distinguishing between their INTERNAL
type w/ the response's status code (e.g. reason: UNKNOWN_502
)
error_code: "INTERNAL".to_string(), | |
error_code: format!("UNKNOWN_{}", status.as_str()), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm fine using "UNKNOWN" but I don't think adding the status code will get us much more.
I'm going to add a warn!()
logger message to report the error, status and the raw response, since that will probably be more useful when we see a bunch of these, since they should be (hopefully) very rare.
Introduces notification.bridge.error.resource_exhausted
Closes: SYNC-4551