optimize metrics queries #2124

bcb37 · 2024-11-21T16:32:55Z

This optimization removes many joins from the metrics query generated by the analysis() method in the LogRepository; it also removes extra queries and consolidates them into one:

In general, there is now only one join of the very large 'log' table, where before there were at least 3.
The query is now based on the 'individual_enrollment' table, which allows for the elimination of joins on the 'metric' and 'query' tables (that data is already passed into the method so those literal values can be used in the query).
The extra query needed to get percentage values for categorical metrics is no longer required; percentages are calculated in the SQL.
Within Subject queries are separated from standard queries, which replaces many instances of branching logic with one single branch point.

As an example, the following queries were generated by the current code to get the percentage value for a categorical metric with a repeated measure of 'most recent', results from both queries have to be used to calculate the percentage:

SELECT "individualEnrollment"."conditionId", count(cast(extracted.value as text)) as result FROM "experiment" "experiment" INNER JOIN "query" "queries" ON "queries"."experimentId"="experiment"."id" INNER JOIN "metric" "metric" ON "metric"."key"="queries"."metricKey" INNER JOIN "metric_log" "metric_logs" ON "metric_logs"."metricKey"="metric"."key" INNER JOIN "log" "logs" ON "logs"."id"="metric_logs"."logId" INNER JOIN (SELECT DISTINCT "individualEnrollment"."userId" as "userId", "individualEnrollment"."experimentId" as "experimentId", "individualEnrollment"."conditionId" as "conditionId" FROM "individual_enrollment" "individualEnrollment") "individualEnrollment" ON "experiment"."id" = "individualEnrollment"."experimentId" AND logs."userId" = "individualEnrollment"."userId" INNER JOIN (SELECT "logs"."data" -> 'masteryWorkspace' -> 'worksheet_grapher_a1_patterns_2step_expr' ->> 'workspaceCompletionStatus' as value, "logs"."id" as id FROM "log" "logs") "extracted" ON extracted.id = "logs"."id" WHERE "metric"."key" = $1 AND "experiment"."id" = $2 AND "queries"."id" = $3 AND logs."updatedAt" = (SELECT max(sqlog."updatedAt") FROM "log" "sqlog" WHERE sqlog."userId" = logs."userId" AND "sqlog"."data" -> 'masteryWorkspace' -> 'worksheet_grapher_a1_patterns_2step_expr' ->> 'workspaceCompletionStatus' IS NOT NULL) AND (cast(extracted.value as text)) = $4 GROUP BY "individualEnrollment"."conditionId"

SELECT "individualEnrollment"."conditionId", count(cast(extracted.value as text)) as result, COUNT(DISTINCT "individualEnrollment"."userId") as "participantsLogged" FROM "experiment" "experiment" INNER JOIN "query" "queries" ON "queries"."experimentId"="experiment"."id" INNER JOIN "metric" "metric" ON "metric"."key"="queries"."metricKey" INNER JOIN "metric_log" "metric_logs" ON "metric_logs"."metricKey"="metric"."key" INNER JOIN "log" "logs" ON "logs"."id"="metric_logs"."logId" INNER JOIN (SELECT DISTINCT "individualEnrollment"."userId" as "userId", "individualEnrollment"."experimentId" as "experimentId", "individualEnrollment"."conditionId" as "conditionId" FROM "individual_enrollment" "individualEnrollment") "individualEnrollment" ON "experiment"."id" = "individualEnrollment"."experimentId" AND logs."userId" = "individualEnrollment"."userId" INNER JOIN (SELECT "logs"."data" -> 'masteryWorkspace' -> 'worksheet_grapher_a1_patterns_2step_expr' ->> 'workspaceCompletionStatus' as value, "logs"."id" as id FROM "log" "logs") "extracted" ON extracted.id = "logs"."id" WHERE "metric"."key" = $1 AND "experiment"."id" = $2 AND "queries"."id" = $3 AND (cast(extracted.value as text)) In ($4, $5) AND logs."updatedAt" = (SELECT max(sqlog."updatedAt") FROM "log" "sqlog" WHERE sqlog."userId" = logs."userId" AND "sqlog"."data" -> 'masteryWorkspace' -> 'worksheet_grapher_a1_patterns_2step_expr' ->> 'workspaceCompletionStatus' IS NOT NULL) GROUP BY "individualEnrollment"."conditionId"

By contrast, the code in this pr generates the following single query, which calculates the percentage:

SELECT "individualEnrollment"."conditionId", cast(count(cast(logs.datum as text)) filter (where logs.datum = 'GRADUATED') as decimal) / cast(count(cast(logs.datum as text)) as decimal) * 100 as result, COUNT(DISTINCT "individualEnrollment"."userId") as "participantsLogged" FROM "individual_enrollment" "individualEnrollment" INNER JOIN (SELECT "userId", "datum" FROM (SELECT "userId", "logs"."data" -> 'masteryWorkspace' -> 'worksheet_grapher_a1_patterns_2step_expr' ->> 'workspaceCompletionStatus' as "datum", row_number() over (partition by "userId" order by "updatedAt" DESC) AS rn FROM "log" "logs" WHERE "logs"."data" -> 'masteryWorkspace' -> 'worksheet_grapher_a1_patterns_2step_expr' ->> 'workspaceCompletionStatus' is not null) "t" WHERE rn = 1) "logs" ON logs."userId"="individualEnrollment"."userId" WHERE "experimentId" = $1 GROUP BY "individualEnrollment"."conditionId"

Note that the subquery that selects the data point of interest (in this case 'workspaceCompletionStatus' ) from the log table for a given user, also selects a 'row_number()', which allows us to select only the first row. That will be the most recent, since we are selecting over (partition by "userId" order by "updatedAt" DESC). In this way, we can get our data and satisfy the repeated measure with just one join on the 'log' table.

For Within Subject queries, the strategy of wrapping the normal query as a subquery and applying the repeated measure on the outer level is retained.

optimize metrics queries

5b18f35

bcb37 requested review from VivekFitkariwala and danoswaltCL November 21, 2024 16:33

ppratikcr7 assigned bcb37 Nov 22, 2024

bcb37 added 2 commits December 3, 2024 16:40

add comments and remove unused method

05704dc

Merge branch 'dev' into optimize-metric-query

078da29

VivekFitkariwala approved these changes Dec 5, 2024

View reviewed changes

Merge branch 'dev' into optimize-metric-query

891b2ba

bcb37 merged commit d2a37c5 into dev Dec 9, 2024
14 checks passed

bcb37 deleted the optimize-metric-query branch December 9, 2024 19:49

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

optimize metrics queries #2124

optimize metrics queries #2124

bcb37 commented Nov 21, 2024 •

edited

Loading

optimize metrics queries #2124

optimize metrics queries #2124

Conversation

bcb37 commented Nov 21, 2024 • edited Loading

bcb37 commented Nov 21, 2024 •

edited

Loading