Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

optimize metrics queries #2124

Merged
merged 4 commits into from
Dec 9, 2024
Merged

optimize metrics queries #2124

merged 4 commits into from
Dec 9, 2024

Conversation

bcb37
Copy link
Collaborator

@bcb37 bcb37 commented Nov 21, 2024

This optimization removes many joins from the metrics query generated by the analysis() method in the LogRepository; it also removes extra queries and consolidates them into one:

  • In general, there is now only one join of the very large 'log' table, where before there were at least 3.
  • The query is now based on the 'individual_enrollment' table, which allows for the elimination of joins on the 'metric' and 'query' tables (that data is already passed into the method so those literal values can be used in the query).
  • The extra query needed to get percentage values for categorical metrics is no longer required; percentages are calculated in the SQL.
  • Within Subject queries are separated from standard queries, which replaces many instances of branching logic with one single branch point.

As an example, the following queries were generated by the current code to get the percentage value for a categorical metric with a repeated measure of 'most recent', results from both queries have to be used to calculate the percentage:

SELECT "individualEnrollment"."conditionId", count(cast(extracted.value as text)) as result FROM "experiment" "experiment" INNER JOIN "query" "queries" ON "queries"."experimentId"="experiment"."id" INNER JOIN "metric" "metric" ON "metric"."key"="queries"."metricKey" INNER JOIN "metric_log" "metric_logs" ON "metric_logs"."metricKey"="metric"."key" INNER JOIN "log" "logs" ON "logs"."id"="metric_logs"."logId" INNER JOIN (SELECT DISTINCT "individualEnrollment"."userId" as "userId", "individualEnrollment"."experimentId" as "experimentId", "individualEnrollment"."conditionId" as "conditionId" FROM "individual_enrollment" "individualEnrollment") "individualEnrollment" ON "experiment"."id" = "individualEnrollment"."experimentId" AND logs."userId" = "individualEnrollment"."userId" INNER JOIN (SELECT "logs"."data" -> 'masteryWorkspace' -> 'worksheet_grapher_a1_patterns_2step_expr' ->> 'workspaceCompletionStatus' as value, "logs"."id" as id FROM "log" "logs") "extracted" ON extracted.id = "logs"."id" WHERE "metric"."key" = $1 AND "experiment"."id" = $2 AND "queries"."id" = $3 AND logs."updatedAt" = (SELECT max(sqlog."updatedAt") FROM "log" "sqlog" WHERE sqlog."userId" = logs."userId" AND "sqlog"."data" -> 'masteryWorkspace' -> 'worksheet_grapher_a1_patterns_2step_expr' ->> 'workspaceCompletionStatus' IS NOT NULL) AND (cast(extracted.value as text)) = $4 GROUP BY "individualEnrollment"."conditionId"

SELECT "individualEnrollment"."conditionId", count(cast(extracted.value as text)) as result, COUNT(DISTINCT "individualEnrollment"."userId") as "participantsLogged" FROM "experiment" "experiment" INNER JOIN "query" "queries" ON "queries"."experimentId"="experiment"."id" INNER JOIN "metric" "metric" ON "metric"."key"="queries"."metricKey" INNER JOIN "metric_log" "metric_logs" ON "metric_logs"."metricKey"="metric"."key" INNER JOIN "log" "logs" ON "logs"."id"="metric_logs"."logId" INNER JOIN (SELECT DISTINCT "individualEnrollment"."userId" as "userId", "individualEnrollment"."experimentId" as "experimentId", "individualEnrollment"."conditionId" as "conditionId" FROM "individual_enrollment" "individualEnrollment") "individualEnrollment" ON "experiment"."id" = "individualEnrollment"."experimentId" AND logs."userId" = "individualEnrollment"."userId" INNER JOIN (SELECT "logs"."data" -> 'masteryWorkspace' -> 'worksheet_grapher_a1_patterns_2step_expr' ->> 'workspaceCompletionStatus' as value, "logs"."id" as id FROM "log" "logs") "extracted" ON extracted.id = "logs"."id" WHERE "metric"."key" = $1 AND "experiment"."id" = $2 AND "queries"."id" = $3 AND (cast(extracted.value as text)) In ($4, $5) AND logs."updatedAt" = (SELECT max(sqlog."updatedAt") FROM "log" "sqlog" WHERE sqlog."userId" = logs."userId" AND "sqlog"."data" -> 'masteryWorkspace' -> 'worksheet_grapher_a1_patterns_2step_expr' ->> 'workspaceCompletionStatus' IS NOT NULL) GROUP BY "individualEnrollment"."conditionId"

By contrast, the code in this pr generates the following single query, which calculates the percentage:

SELECT "individualEnrollment"."conditionId", cast(count(cast(logs.datum as text)) filter (where logs.datum = 'GRADUATED') as decimal) / cast(count(cast(logs.datum as text)) as decimal) * 100 as result, COUNT(DISTINCT "individualEnrollment"."userId") as "participantsLogged" FROM "individual_enrollment" "individualEnrollment" INNER JOIN (SELECT "userId", "datum" FROM (SELECT "userId", "logs"."data" -> 'masteryWorkspace' -> 'worksheet_grapher_a1_patterns_2step_expr' ->> 'workspaceCompletionStatus' as "datum", row_number() over (partition by "userId" order by "updatedAt" DESC) AS rn FROM "log" "logs" WHERE "logs"."data" -> 'masteryWorkspace' -> 'worksheet_grapher_a1_patterns_2step_expr' ->> 'workspaceCompletionStatus' is not null) "t" WHERE rn = 1) "logs" ON logs."userId"="individualEnrollment"."userId" WHERE "experimentId" = $1 GROUP BY "individualEnrollment"."conditionId"

Note that the subquery that selects the data point of interest (in this case 'workspaceCompletionStatus' ) from the log table for a given user, also selects a 'row_number()', which allows us to select only the first row. That will be the most recent, since we are selecting over (partition by "userId" order by "updatedAt" DESC). In this way, we can get our data and satisfy the repeated measure with just one join on the 'log' table.

For Within Subject queries, the strategy of wrapping the normal query as a subquery and applying the repeated measure on the outer level is retained.

@bcb37 bcb37 merged commit d2a37c5 into dev Dec 9, 2024
14 checks passed
@bcb37 bcb37 deleted the optimize-metric-query branch December 9, 2024 19:49
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants