Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: Call out the "RESOURCE_EXHAUSTED" error metric #808

Merged
merged 17 commits into from
Dec 17, 2024

Conversation

jrconlin
Copy link
Member

@jrconlin jrconlin commented Dec 9, 2024

Introduces notification.bridge.error.resource_exhausted

Closes: SYNC-4551

Bigtable stream reads are not reliable, and may fail at any time.
Move the chunk processing into the retry section to reduce the number
of read errors that can happen.

Closes: SYNC-4508
*NOTE*
`cargo audit` may fail with an "Invalid version" if the `Cargo.lock`
version is set to 4. Manually changing to 3 will resolve this.
Introduces notification.bridge.error.resource_exhausted

Closes: SYNC-4551
@jrconlin jrconlin requested review from pjenvey and taddes December 9, 2024 23:06
autoendpoint/src/routers/fcm/error.rs Outdated Show resolved Hide resolved
autoendpoint/src/routers/mod.rs Outdated Show resolved Hide resolved
* Remove old RouterError::GCMAuthentication - gcm is dead.
* Call out `RouterError::Fcm(FcmError::Upstream{..})` since
  `handle_error` munges it.
@jrconlin jrconlin requested a review from pjenvey December 13, 2024 00:12
}
FcmError::Upstream { status, .. } if status == "RESOURCE_EXHAUSTED" => {
Some("notification.bridge.error.fcm.resource_exhausted")
FcmError::InvalidAppId(_) | FcmError::NoAppId | FcmError::Upstream { .. } => {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think we want Upstream here with handle_error already emitting a metric for it: as we'd get a metric emitted twice for it.

I think all FcmErrors will be passed through handle_error (or going to sentry, which InvalidAppId/NoAppId are set to do)? If so I'd argue we shouldn't return any metric labels here at all (for now)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, you're right, I forgot that handle_error() records the metric and then returns the original error for later processing.
That said, we do have some duplication already. I'll add a comment so I can remember that later.
I note that RouterError::TooMuchData(_) produces notification.bridge.error {too_much_data} 🔗 and notification.bridge.error.too_much_data 🔗 Should I remove that as well so that we only have one? (I can add that as a different PR.)
I'll also note that we might want to have a ReportableError::metric_tags() function similar to how we have extras() to provide the additional information, since that seems to be a pattern we're hitting.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm fine with leaving it for now.

We already have ReportableError::tags (it defaults to returning nothing, so you just have to implement it to override)

@jrconlin jrconlin requested a review from pjenvey December 16, 2024 16:42
.cargo/audit.toml Outdated Show resolved Hide resolved
@jrconlin jrconlin requested a review from pjenvey December 17, 2024 00:00
metrics,
platform,
app_id,
&format!("upstream_{}", status),
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry, one last thing: status here is either FcmErrorResponse::status or StatusCode::to_string(). I know the latter will contain whitespace (e.g. "200 OK") and the former probably does too? Which makes this an invalid metric tag value

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sigh, you're right, with the various back & forth and merges status is not the same. I'll modify.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Wait, no, That is right.

The status here is from Upstream{status: String, ..} and reflects the returned FcmErrorResponse.status, which is a set of enums. There was as similar error, though, and I've corrected it. I've also added some comments because this was a source of confusion.

* better describe the response enum from FCM.
pjenvey
pjenvey previously approved these changes Dec 17, 2024
// (This may happen in the case where FCM terminates the connection abruptly
// or a similar event.) Treat that as an INTERNAL error.
(_, None) => FcmError::Upstream {
error_code: "INTERNAL".to_string(),
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: how about distinguishing between their INTERNAL type w/ the response's status code (e.g. reason: UNKNOWN_502)

Suggested change
error_code: "INTERNAL".to_string(),
error_code: format!("UNKNOWN_{}", status.as_str()),

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm fine using "UNKNOWN" but I don't think adding the status code will get us much more.
I'm going to add a warn!() logger message to report the error, status and the raw response, since that will probably be more useful when we see a bunch of these, since they should be (hopefully) very rare.

@jrconlin jrconlin requested a review from pjenvey December 17, 2024 17:43
@jrconlin jrconlin merged commit ebc5704 into master Dec 17, 2024
1 check passed
@jrconlin jrconlin deleted the feat/SYNC-4551_res-x branch December 17, 2024 18:08
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants