-
Notifications
You must be signed in to change notification settings - Fork 1.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
feat(funnels): speed up funnel queries by preventing properties being passed across shards #13935
Conversation
… passed across shards
@@ -30,8 +30,11 @@ | |||
min(latest_1) over (PARTITION by aggregation_target | |||
ORDER BY timestamp DESC ROWS BETWEEN UNBOUNDED PRECEDING AND 0 PRECEDING) latest_1 | |||
FROM | |||
(SELECT aggregation_target, | |||
timestamp, | |||
(SELECT e.event as event, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This file is the one showing the flattening of the query clearly.
Well.... Seems all's good for the "traditional funnel" and f'd for everything else :D |
Seems only unordered is the problem (besides this one tricky thing with exclusions that I know how to fix), looking into it |
Remove more unnecessary fields
+ "{step_filter}" | ||
) | ||
|
||
query = f""" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Q: Why split the overall thing up? IMO not adding too much clarity here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We spoke about this on a call - essentially there are ordering requirements with the functions building this query (as they have side effects) and this was the cleanest way to build things now without a large refactor
query = f""" | ||
SELECT {', '.join(_fields)} FROM events {self.EVENT_TABLE_ALIAS} | ||
select_fragment = ( | ||
f"SELECT {', '.join(_fields)}" + "{extra_select_fields}" + f" FROM events {self.EVENT_TABLE_ALIAS}" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You can do the following in f-strings: {{extra_select_fields}}
if really needed, though the logic behind it is sus.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
updated to this
Ok this is ready to go @macobo |
pdi.person_id as aggregation_target, | ||
pdi.person_id as person_id , |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
we certainly don't need this :D
Kinda don't wanna touch this PR anymore though so will fix in a follow up if OK
parsed_extra_fields = f", {', '.join(extra_fields)}" if extra_fields else "" | ||
|
||
event_query, params = FunnelEventQuery( | ||
inner_query, params = FunnelEventQuery( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Q: Should we name this events_query
?
extra_join = "" | ||
|
||
if self._filter.breakdown: | ||
if self._filter.breakdown_type == "cohort": | ||
extra_join = self._get_cohort_breakdown_join() | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nit: Let's remove this empty line
event_query=event_query, | ||
extra_select_fields = f", {', '.join(all_step_cols)}" if all_step_cols else "" | ||
|
||
inner_query = inner_query.format( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Q: Should we name this events_query
or funnel_events_query
? inner_query had me going in cycles since it was ambigious.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Remnant of the old implementation where the events_query
had an inner_query
. Will rename
Amazing work! |
I'll get this in because every change I've been making seems to be a bit of a struggle with checks failing due to API limits and what not. Want to get this out of sight out of mind :D Will put a PR right away addressing the review points though @macobo |
… passed across shards (#13935) * feat(funnels): speed up funnel queries by preventing properties being passed across shards * Update snapshots * Update snapshots * testing * Update snapshots * Update snapshots * new approach * Update snapshots * Update snapshots * Update snapshots * remove more unnecessary fields * update snapshots * Update snapshots * Update snapshots * Update snapshots * Update snapshots * Update snapshots * Update snapshots * Update snapshots * Update snapshots * Update snapshots * Update snapshots * Update snapshots * Update snapshots * remove more cols and see :) * remove event_columns_to_query * Update snapshots * Update snapshots * Update snapshots * Update snapshots * Update snapshots * Update snapshots * Update snapshots * update query building * Update snapshots * Update snapshots * Update snapshots * Update snapshots * Update snapshots * Update snapshots * Update snapshots --------- Co-authored-by: github-actions <41898282+github-actions[bot]@users.noreply.github.com>
Problem
#13922
Changes
Derive the funnel steps in the innermost subquery in order to prevent sending full properties across shards. This can speed up queries by up to 3-5x (estimate based on the fact that the slower the query today the greater the gains).
The current query will send data in the order of megabytes to the other shard, whereas the previous query would send gigabytes and gigabytes (see e.g. first pic).
Benchmarks:
See more
Note that I chose the path of least resistance to implement this. I think there's a refactor in order to simplify a lot of the logic (which is very messy) but I decided not to do that here. I kept things mostly as they are, with few changes.
A bunch of stale snapshots got deleted as a result of my testing.
How did you test this code?
Existing tests for funnels are all passing. Also tested manually.
Marked it as a feat because of the perf improvement but from the perspective of the query results it's just a refactor - no behavior should change.