-
Notifications
You must be signed in to change notification settings - Fork 94
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Dataflow plan for metrics in filters #1102
Conversation
MetricGroupBy
is requestedGroupByMetrics
7d7e374
to
fbdaac7
Compare
fb530ab
to
14e3b6d
Compare
dd54247
to
4022d8e
Compare
14e3b6d
to
8e4a32f
Compare
c3611f9
to
6d2d856
Compare
61a4cd0
to
371922f
Compare
f122d99
to
f8ee5df
Compare
33a0143
to
5389b6f
Compare
0d2c0a9
to
f466576
Compare
GroupByMetrics
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🎉
I was expecting this particular PR to be a lot more complicated than this, very nice!
I left two minor things inline for consideration - the sequences and the check query simplification - but otherwise this looks great! Thanks for splitting things up for me, made review a lot more straightforward.
@@ -346,6 +349,7 @@ def _build_conversion_metric_output_node( | |||
queried_linkable_specs: LinkableSpecSet, | |||
filter_spec_factory: WhereSpecFactory, | |||
time_range_constraint: Optional[TimeRangeConstraint] = None, | |||
for_group_by_source_node: bool = False, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oof, I guess we do have to thread this through everywhere, eh?
continue | ||
# If joining to ComputeMetricsNode, the right node is pre-aggregated. | ||
# Since we currently only allow one entity on GroupByMetric, this won't fan out. | ||
if not isinstance(right_node, ComputeMetricsNode): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok, this will do for now but we'll need a more robust way of handling this than the inlined isinstance check once we have other cases we need to think about. I'll keep this in mind as I read more of these updates, because we want a way to ask if the join is ok due to aggregation state, and we may need one that works across node types.
I don't think we have access to AggregationState from dataflow plan nodes, but something like that might be what we should be checking against - that and the grain, because we'll need to match the grain eventually.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Agreed! Definitely will need more robustness here in v2.
# If joining to ComputeMetricsNode, the right node is pre-aggregated. | ||
# Since we currently only allow one entity on GroupByMetric, this won't fan out. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we add a TODO here to make this more robust against different grains and different node types that might satisfy this condition?
user_id AS user | ||
, SUM(revenue) AS revenue_all_time |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh ok but this is not one with a window so it resolves to the same as the active_listings example.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes! I didn't add any with window
or grain_to_date
yet because those require querying with metric_time
, which is not yet supported for metric group bys. So you'll get an error if you try to use them anyway.
ON | ||
subq_30.listing = subq_42.listing | ||
) subq_44 | ||
WHERE listing__bookings > 2 AND listing__bookers > 1 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Right, we need the LEFT OUTER here because of listing__bookings IS NULL OR listing__bookings > 2
or similar expressions.
ON | ||
subq_13.listing = subq_26.listing | ||
) subq_28 | ||
WHERE listing__views_times_booking_value > 2 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
These are all pretty clearly generated via --compile or whatever so they're kind of hard to read (and also that means they're asserting that the compile output doesn't change semantically, but do we know the results are correct here?).
It might be worth a cleanup pass to simplify this SQL down to its most basic by hand. Then we can look at the optimizers later to see if we can cut back on the repetition here.
Also, I think this feature makes CTEs more important, so maybe we should surface that in the feature roadmap. We don't need them right away but they've got to get some higher precedence, because this output will be quite difficult to read.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For the really complex queries I do usually just let MF compile the SQL, then I read through it to verify and simplify it a little before adding it here. I see most of the value in these tests as making sure that the query runs end to end without error and that the SQL actually executes successfully (i.e., no columns are dropped unexpectedly, etc.)
So I'll do 2 things:
- go back through these to simplify them as much as I can
- add output tests
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Simplified the SQL more - will add output tests tomorrow.
b2b0efe
to
c3c16ab
Compare
Resolves #740
Description
Build source nodes for requested
GroupByMetrics
during dataflow plan building. This is the last step needed to enable these metrics/queries! 🎉Most of these joins require a join from a primary key to a foreign key, which is normally not allowed because a fan-out join might make metric calculation inaccurate. When joining to a pre-aggregated metric, though, we can allow that type of join since the aggregation prevents duplicate rows. We can ensure that's true by allowing no more than one entity in the join columns for
GroupByMetrics
. For V0 of this feature we'll implement that restriction. Later, we will want to enable multiple join columns (both entities and dimensions), so we will need to clarify what types of joins are allowed here.Other than that, this adds some tests and updates a couple of places with group by metric logic that was missed in previous PRs. I would recommend reviewing by commit.
Note the scope of this PR - it enables metric filters on metrics (in YAML) and in metric queries, joining by exactly one entity. It does not enable 1) metrics in the group by for queries, 2) joining to metrics via dimensions or multiple group bys, or 3) metrics in filters for distinct values queries (queries without metrics). Those will be handled in lower-priority follow ups.