From d2412bbb2eac2d261ce4e671ffdf961715462579 Mon Sep 17 00:00:00 2001 From: Taddes Date: Wed, 20 Nov 2024 11:53:13 -0500 Subject: [PATCH] udpated ADR from team input, Glean team discussion --- ...ly-active-use-server-side-metrics-glean.md | 88 ++++++++++--------- 1 file changed, 48 insertions(+), 40 deletions(-) diff --git a/docs/adr/0001-daily-active-use-server-side-metrics-glean.md b/docs/adr/0001-daily-active-use-server-side-metrics-glean.md index fc5b22002d..b59949586d 100644 --- a/docs/adr/0001-daily-active-use-server-side-metrics-glean.md +++ b/docs/adr/0001-daily-active-use-server-side-metrics-glean.md @@ -1,6 +1,6 @@ # Measuring Server-Side Daily Active Use (DAU) With Glean -* Status: proposed +* Status: approved * Deciders: Taddes Korris, David Durst, JR Conlin, Phillip Jenvey * Date: 2024-10-09 @@ -10,16 +10,15 @@ Technical Story: ## Context and Problem Statement -There is a requirement to move away from the current measurement of Sync Daily Active Use (DAU), which measures usage via FxA/Mozilla Accounts. -The addition of Relay, device backups/migrations, through FxA means the metric will no longer an be accurate reflection of Sync usage. This is due to increased browser sign-ins with the potential of overcounting, or not counting those logging into the browser without Sync enabled. Furthermore, there is a broader organizational movement towards measuring DAU within services themselves. +There is an organizational requirement for each service to be able to measure its own DAU (Daily Active Users). Sync historically measured DAU via FxA through browser login. With the addition of Relay, device backups/migrations, etc. this will no longer an be accurate reflection of Sync usage, nor a direct measurement originating within Sync. This is due to increased browser sign-ins with the potential of overcounting, or not counting those logging into the browser without Sync enabled. See [Decision Brief - Server-Side Sync Usage Attribution from Mozilla Accounts](https://docs.google.com/document/d/1zD-ia3fP43o-dYpwavDgH5Hb6Xo_fgQzzoWqTiX_wR8/edit?tab=t.0#heading=h.mdoaoiyvqgfo). -The goal is to measure DAU (and subsequently WAU & MAU) by emitting metrics from syncserver-rs itself. This requires the following data: +The goal is to measure DAU (and subsequently WAU & MAU) by emitting metrics from syncstorage-rs itself. This requires the following data: * User identifier (hashed_fxa_uid) * Timestamp -* Platform (from UserAgent: Desktop, Fenix, iOS) +* Platform (Desktop, Fenix, iOS, Other) +* Device Family (Desktop, Mobioe, Tablet, Other) +* Device ID (hashed_device_id) for opt-out/deletion -In researching possible implementation methods, it became clear that many options did not offer us the ease and flexibility to reconcile the data after emission. This is why Glean is recommended as a clear frontrunner. This is not without some drawbacks, but they are minimal compared to other options that would, for example, require considerable data processing and querying difficulties. There is support for this implementation from the Glean team and active support in the process. - ## Decision Drivers 1. Simplicity of implementation, not only for internal metric emission, but for processing and querying. @@ -30,31 +29,38 @@ In researching possible implementation methods, it became clear that many option ## Considered Options -* A. Glean -* B. StatsD and Grafana -* C. Sql/Redash +* A. Glean Parser for Rust - Contribute to glean team repo by implementing Rust server +* B. Custom Glean Implementation - Our own custom implementation, internal only to Sync +* C. StatsD and Grafana ## Decision Outcome Chosen option: -* A. "Glean for server-side measurement of DAU" +* A. "Glean Parser for Rust: for server-side measurement of DAU" + +In researching possible implementation methods, it became clear that many options did not offer us the ease and flexibility to reconcile the data after emission. This is why Glean is recommended as a clear frontrunner, due to its rich tooling in aggregating, querying, and visualizing data. -The use of Glean appears the best choice for measuring internal DAU metrics. It meets our requirements and provides us with needed support on the data processing side. It also provides considerable support from the Glean team to implement this in a thoughtful manner. There are some challenges with this implementation (more below), namely in this being a greenfield attempt at Rust server-side metrics, however the pros outweigh the cons. Other metrics implementations like StatsD and Grafana cannot be easily used to measure and aggregate this data. Additionally, it adds considerable overhead in determining how to query the data and reconcile ping emissions of Sync events. Having dedicated organizational support means we establish best practices. +However, this left an addition decision to either implement our own custom Glean code to emit "Glean-compliant" output, or to contribute to the Glean team's `glean_parser` repository for server-side metrics. The `glean_parser` is used for all server implementations of Glean, since the SDK is only available for client-side applications. Currently, Rust is not supported. + +The Glean team does/did not have capacity to implemented the Rust `glean_parser` feature, so we had to decide what is the best solution not only for this use case, but to consider possible future use cases. In consultation with the Glean team, it became clear avoiding our custom implementation and opting for the general-purpose `glean_parser` for Rust was the ideal solution. ## Pros and Cons of the Options -### A. Glean +### A. Glean Parser for Rust -Glean is a widely used tool at Mozilla and provides us with a solution to the given issue and possible extensibility in the future. Not without some challenges in initial implementation, but the potential for positive impact is high. +Glean is a widely used tool at Mozilla and provides us with a solution to the given issue and possible extensibility in the future. Not without some challenges in initial implementation, related to coordination with Glean team and upfront development effort. However, the potential for positive impact within our team and the organization is significant: all Rust server applications will be able to use Glean with full server support, our possible intention to integrate Glean into Push is made easier, and this is done in partnership with the Glean team. #### Pros -* Satisfaction of requirement to measure internal DAU metrics. +* Makes Glean compatible for all Rust server applications going forward. +* Preferred option of the Glean team. +* Glean team believes, based on FxA metrics volume, that our volume will not be a problem (180-190K per minute). +* A collection of metrics, emitted as a single "Ping Event" make querying of related data simpler. * Core of Glean's purpose is to measure user interactions and the rich metadata that accompanies it. -* Capacity for future expansion of application metrics within Sync beyond DAU. +* Capacity for future expansion of application metrics within Sync, beyond DAU. * Prepares for implementation of same measurements in autopush, also using Glean. -* Easier to query. +* Easier to query, have data team support to set up queries. * Use of standardized Mozilla tooling. * Establishment of team knowledge of using Glean. * Establishment of server-side Rust best practices, leading to easier development for backend Rust applications. @@ -66,46 +72,48 @@ Glean is a widely used tool at Mozilla and provides us with a solution to the gi * Server side metrics have not yet been implemented for a Rust server application of this kind, so this is new territory. * There is added complexity of data review process and registration of the application to Glean's probe scraper. * Potential delays and challenges in new implementation. +* Some concerns exist around volume of data emitted from the service and if it is feasible, but we won't know until we try. -### StatsD and Grafana +### B. Custom Glean Implementation -StatsD and Grafana offer us core application metrics and service health. While we use this frequently, it doesn't neatly fit the measurement requirements we have for DAU and would likely be very difficult to process via queries. +This was originally appearing to be the desired approach to measuring DAU via Glean. This was predicated on the ease of prototyping a custom implementation that imitated the Glean team's logic to create "Glean-compliant" output that could be configured to be ingested. However, in consulting with the Glean team and evaluating the pros and cons of this approach, it became clear this approach had considerably more drawbacks than implementing the `glean_parser` for Rust. These drawbacks were a lack of testing, validation, less support from the Glean team, and the potential problems with maintenance and adding Glean metrics in the future. #### Pros - -* Well understood and used. -* Support for SRE for more complex queries. -* Already utilized for core metrics. +* Gives team control over implementation and allows us to customize the Glean logging as we see fit. +* Does not require contribution to Glean team's repos, which potentially limits the scope understanding a new codebase. +* Easy to prototype and make changes. +* Doesn't require understanding the templating logic and libraries in the `glean_parser` #### Cons +* There is no built-in testing suite or validation, so this would put a larger development burden on us and require the Glean team's review. +* Lack of testing and validation means higher likelihood of bugs. +* If we decide to add new Glean metrics in the future, this may break the custom implementation and impose a greater maintenance overhead. +* Time required to understand the Glean team's implementations anyways, in order to replicate behavior and data structures. +* Likely won't have same support from Glean team as it is not related to their implementation. -* StatsD is not a good format for measuring something like user interactions. -* Somewhat opaque and complicated query logic required. -* Significant difficulty in aggregation and reconciliation logic. -* May not scale well given number of events. -* Likely considerable overhead in understanding how to make sense of data. -* Heavier load for team to manage data processing. - -### SQL/Redash +### C. StatsD, Grafana, InfluxDB -The current DAU metric used from FxA uses SQL telemetry and provides the ability to query data. It is then displayed in a redash panel. While convenient, we do not have the infrastructure in place for this option and it might involve considerable effort to establish. +StatsD and Grafana offer us core application metrics and service health. While we use this frequently, it doesn't neatly fit the measurement requirements we have for DAU and would likely be very difficult to process via queries. This is because such application metrics are geared towards increment counters, response codes, and timers. Given DAU is a user-initiated interaction, and we need to query unique events based on the `hashed_fxa_id`, this is not suitable for InfluxDB/StatsD. Figuring out how to query this data poses challenges as it has not been implemented for such a use case. #### Pros -* Used and understood within services already. -* Simple interface. +* Already utilized for core service metrics (status codes, API endpoint counts, cluster health, etc). +* Well understood and used. +* Possible for SRE for more complex queries. #### Cons -* Implemented strictly for accounts at present. -* Lack of clarity on how to aggregate and process data after emitted. -* Infrastructure does not exist to emit the metrics to. -* Likely considerable overhead in understanding how to make sense of data. -* Heavier load for team to manage data processing. +* StatsD is not a good format for measuring something like user interactions. +* Somewhat opaque and complicated query logic required. +* Significant difficulty in aggregation and reconciliation logic. +* May not scale well given number of events. +* Likely a considerable overhead in understanding how to make sense of data. +* Heavier load for team to manage data processing, as this approach has not been tried. + ## Links -* [Decision Brief - Server-Side Sync Usage Attribution from Mozilla Accounts](https://docs.google.com/document/d/1zD-ia3fP43o-dYpwavDgH5Hb6Xo_fgQzzoWqTiX_wR8/edit) +* [Decision Brief - Server-Side Sync Usage Attribution from Mozilla Accounts](https://docs.google.com/document/d/1zD-ia3fP43o-dYpwavDgH5Hb6Xo_fgQzzoWqTiX_wR8/edit?tab=t.0#heading=h.mdoaoiyvqgfo) * [FxA DAU Metric in Redash](https://sql.telemetry.mozilla.org/queries/101007/source?p_end%20date=2024-06-26&p_start%20date=2024-05-01#248905) * [Working Document](https://docs.google.com/document/d/1Tk4VIuQZcn8IG-UI38kziZn5e-FMOI0Z-VrvaYTI1SM/edit#heading=h.b0mqx1fng4wa) * [Sync Ecosystem Infrastructure and Metrics: KPI Metrics](https://mozilla-hub.atlassian.net/wiki/spaces/CLOUDSERVICES/pages/969834589/Establish+KPI+metrics+DAU+Retention)