feat: Call out the "RESOURCE_EXHAUSTED" error metric #808

jrconlin · 2024-12-09T23:06:16Z

Introduces notification.bridge.error.resource_exhausted

Bigtable stream reads are not reliable, and may fail at any time. Move the chunk processing into the retry section to reduce the number of read errors that can happen. Closes: SYNC-4508

*NOTE* `cargo audit` may fail with an "Invalid version" if the `Cargo.lock` version is set to 4. Manually changing to 3 will resolve this.

Introduces notification.bridge.error.resource_exhausted Closes: SYNC-4551

autoendpoint/src/routers/fcm/error.rs

autoendpoint/src/routers/mod.rs

* Remove old RouterError::GCMAuthentication - gcm is dead. * Call out `RouterError::Fcm(FcmError::Upstream{..})` since `handle_error` munges it.

pjenvey · 2024-12-13T20:01:54Z

autoendpoint/src/routers/fcm/error.rs

-            }
-            FcmError::Upstream { status, .. } if status == "RESOURCE_EXHAUSTED" => {
-                Some("notification.bridge.error.fcm.resource_exhausted")
+            FcmError::InvalidAppId(_) | FcmError::NoAppId | FcmError::Upstream { .. } => {


I don't think we want Upstream here with handle_error already emitting a metric for it: as we'd get a metric emitted twice for it.

I think all FcmErrors will be passed through handle_error (or going to sentry, which InvalidAppId/NoAppId are set to do)? If so I'd argue we shouldn't return any metric labels here at all (for now)

Ah, you're right, I forgot that handle_error() records the metric and then returns the original error for later processing.
That said, we do have some duplication already. I'll add a comment so I can remember that later.
I note that RouterError::TooMuchData(_) produces notification.bridge.error {too_much_data} 🔗 and notification.bridge.error.too_much_data 🔗 Should I remove that as well so that we only have one? (I can add that as a different PR.)
I'll also note that we might want to have a ReportableError::metric_tags() function similar to how we have extras() to provide the additional information, since that seems to be a pattern we're hitting.

I'm fine with leaving it for now.

We already have ReportableError::tags (it defaults to returning nothing, so you just have to implement it to override)

Introduces notification.bridge.error.resource_exhausted Closes: SYNC-4551

.cargo/audit.toml

… feat/SYNC-4551_res-x

…topush-rs into feat/SYNC-4551_res-x

pjenvey · 2024-12-17T00:17:51Z

autoendpoint/src/routers/common.rs

+            metrics,
+            platform,
+            app_id,
+            &format!("upstream_{}", status),


Sorry, one last thing: status here is either FcmErrorResponse::status or StatusCode::to_string(). I know the latter will contain whitespace (e.g. "200 OK") and the former probably does too? Which makes this an invalid metric tag value

Sigh, you're right, with the various back & forth and merges status is not the same. I'll modify.

Wait, no, That is right.

The status here is from Upstream{status: String, ..} and reflects the returned FcmErrorResponse.status, which is a set of enums. There was as similar error, though, and I've corrected it. I've also added some comments because this was a source of confusion.

* better describe the response enum from FCM.

pjenvey · 2024-12-17T01:42:03Z

autoendpoint/src/routers/fcm/client.rs

+                // (This may happen in the case where FCM terminates the connection abruptly
+                // or a similar event.) Treat that as an INTERNAL error.
+                (_, None) => FcmError::Upstream {
+                    error_code: "INTERNAL".to_string(),


nit: how about distinguishing between their INTERNAL type w/ the response's status code (e.g. reason: UNKNOWN_502)

Suggested change

error_code: "INTERNAL".to_string(),

error_code: format!("UNKNOWN_{}", status.as_str()),

I'm fine using "UNKNOWN" but I don't think adding the status code will get us much more.
I'm going to add a warn!() logger message to report the error, status and the raw response, since that will probably be more useful when we see a bunch of these, since they should be (hopefully) very rare.

jrconlin added 4 commits December 2, 2024 16:21

feat: Move BT chuck processing into retry

fd59c67

Bigtable stream reads are not reliable, and may fail at any time. Move the chunk processing into the retry section to reduce the number of read errors that can happen. Closes: SYNC-4508

f fix audit / update RUST_VER

4b5b1a0

*NOTE* `cargo audit` may fail with an "Invalid version" if the `Cargo.lock` version is set to 4. Manually changing to 3 will resolve this.

feat: Call out the "RESOURCE_EXHAUSTED" error metric

7866ccf

Introduces notification.bridge.error.resource_exhausted Closes: SYNC-4551

f move conditional

b26c66f

jrconlin requested review from pjenvey and taddes December 9, 2024 23:06

jrconlin added 2 commits December 9, 2024 15:42

f audit, make FCM::Upstream so we are consistent

90d0293

f r's

735581e

pjenvey requested changes Dec 12, 2024

View reviewed changes

autoendpoint/src/routers/fcm/error.rs Outdated Show resolved Hide resolved

autoendpoint/src/routers/mod.rs Outdated Show resolved Hide resolved

f r's

559b653

* Remove old RouterError::GCMAuthentication - gcm is dead. * Call out `RouterError::Fcm(FcmError::Upstream{..})` since `handle_error` munges it.

jrconlin requested a review from pjenvey December 13, 2024 00:12

pjenvey reviewed Dec 13, 2024

View reviewed changes

f r's

dfbe578

jrconlin requested a review from pjenvey December 16, 2024 16:42

jrconlin added 3 commits December 16, 2024 08:50

feat: Call out the "RESOURCE_EXHAUSTED" error metric

352e723

Introduces notification.bridge.error.resource_exhausted Closes: SYNC-4551

f move conditional

b5e6273

f audit, make FCM::Upstream so we are consistent

1252d9a

pjenvey requested changes Dec 16, 2024

View reviewed changes

.cargo/audit.toml Outdated Show resolved Hide resolved

jrconlin added 3 commits December 16, 2024 15:25

Merge branch 'master' of github.com:mozilla-services/autopush-rs into…

cb9a9bb

… feat/SYNC-4551_res-x

f Update audit

887e90a

Merge branch 'feat/SYNC-4551_res-x' of github.com:mozilla-services/au…

186b641

…topush-rs into feat/SYNC-4551_res-x

jrconlin requested a review from pjenvey December 17, 2024 00:00

pjenvey requested changes Dec 17, 2024

View reviewed changes

f r's

4955c73

* better describe the response enum from FCM.

pjenvey previously approved these changes Dec 17, 2024

View reviewed changes

f r's

fbbc3df

jrconlin dismissed pjenvey’s stale review via fbbc3df December 17, 2024 17:10

Merge branch 'master' into feat/SYNC-4551_res-x

f0ac662

jrconlin requested a review from pjenvey December 17, 2024 17:43

pjenvey approved these changes Dec 17, 2024

View reviewed changes

jrconlin merged commit ebc5704 into master Dec 17, 2024
1 check passed

jrconlin deleted the feat/SYNC-4551_res-x branch December 17, 2024 18:08

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: Call out the "RESOURCE_EXHAUSTED" error metric #808

feat: Call out the "RESOURCE_EXHAUSTED" error metric #808

jrconlin commented Dec 9, 2024

pjenvey Dec 13, 2024

jrconlin Dec 13, 2024

pjenvey Dec 13, 2024

pjenvey Dec 17, 2024

jrconlin Dec 17, 2024

jrconlin Dec 17, 2024

pjenvey Dec 17, 2024

jrconlin Dec 17, 2024

	error_code: "INTERNAL".to_string(),
	error_code: format!("UNKNOWN_{}", status.as_str()),

feat: Call out the "RESOURCE_EXHAUSTED" error metric #808

feat: Call out the "RESOURCE_EXHAUSTED" error metric #808

Conversation

jrconlin commented Dec 9, 2024

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment