Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Dataflow plan for metrics in filters #1102

Merged
merged 18 commits into from
Apr 2, 2024
Merged

Conversation

courtneyholcomb
Copy link
Contributor

@courtneyholcomb courtneyholcomb commented Mar 25, 2024

Resolves #740

Description

Build source nodes for requested GroupByMetrics during dataflow plan building. This is the last step needed to enable these metrics/queries! 🎉

Most of these joins require a join from a primary key to a foreign key, which is normally not allowed because a fan-out join might make metric calculation inaccurate. When joining to a pre-aggregated metric, though, we can allow that type of join since the aggregation prevents duplicate rows. We can ensure that's true by allowing no more than one entity in the join columns for GroupByMetrics. For V0 of this feature we'll implement that restriction. Later, we will want to enable multiple join columns (both entities and dimensions), so we will need to clarify what types of joins are allowed here.

Other than that, this adds some tests and updates a couple of places with group by metric logic that was missed in previous PRs. I would recommend reviewing by commit.

Note the scope of this PR - it enables metric filters on metrics (in YAML) and in metric queries, joining by exactly one entity. It does not enable 1) metrics in the group by for queries, 2) joining to metrics via dimensions or multiple group bys, or 3) metrics in filters for distinct values queries (queries without metrics). Those will be handled in lower-priority follow ups.

@cla-bot cla-bot bot added the cla:yes label Mar 25, 2024
@courtneyholcomb courtneyholcomb changed the title Build source node when MetricGroupBy is requested Dataflow plan for GroupByMetrics Mar 25, 2024
@courtneyholcomb courtneyholcomb force-pushed the court/group-by-metric-dfp branch 2 times, most recently from 7d7e374 to fbdaac7 Compare March 25, 2024 22:13
@courtneyholcomb courtneyholcomb changed the base branch from court/metric-group-by-instance-convert to court/metric-call-params March 25, 2024 22:13
@courtneyholcomb courtneyholcomb force-pushed the court/group-by-metric-dfp branch 2 times, most recently from fb530ab to 14e3b6d Compare March 27, 2024 17:50
@courtneyholcomb courtneyholcomb force-pushed the court/metric-call-params branch 3 times, most recently from dd54247 to 4022d8e Compare March 27, 2024 18:06
@courtneyholcomb courtneyholcomb force-pushed the court/metric-call-params branch 3 times, most recently from c3611f9 to 6d2d856 Compare March 27, 2024 20:19
@courtneyholcomb courtneyholcomb force-pushed the court/group-by-metric-dfp branch 3 times, most recently from 61a4cd0 to 371922f Compare March 27, 2024 22:04
@courtneyholcomb courtneyholcomb force-pushed the court/metric-call-params branch 3 times, most recently from f122d99 to f8ee5df Compare March 28, 2024 18:55
@courtneyholcomb courtneyholcomb force-pushed the court/group-by-metric-dfp branch 3 times, most recently from 33a0143 to 5389b6f Compare March 28, 2024 20:21
@courtneyholcomb courtneyholcomb changed the title Dataflow plan for GroupByMetrics Dataflow plan for metrics in filters Mar 29, 2024
@dbt-labs dbt-labs deleted a comment from github-actions bot Mar 29, 2024
@courtneyholcomb courtneyholcomb marked this pull request as ready for review March 29, 2024 23:20
@github-actions github-actions bot removed the Run Tests With Other SQL Engines Runs the test suite against the SQL engines in our target environment label Mar 30, 2024
Copy link
Contributor

@tlento tlento left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🎉

I was expecting this particular PR to be a lot more complicated than this, very nice!

I left two minor things inline for consideration - the sequences and the check query simplification - but otherwise this looks great! Thanks for splitting things up for me, made review a lot more straightforward.

@@ -346,6 +349,7 @@ def _build_conversion_metric_output_node(
queried_linkable_specs: LinkableSpecSet,
filter_spec_factory: WhereSpecFactory,
time_range_constraint: Optional[TimeRangeConstraint] = None,
for_group_by_source_node: bool = False,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oof, I guess we do have to thread this through everywhere, eh?

metricflow/dataflow/builder/dataflow_plan_builder.py Outdated Show resolved Hide resolved
metricflow/plan_conversion/node_processor.py Outdated Show resolved Hide resolved
continue
# If joining to ComputeMetricsNode, the right node is pre-aggregated.
# Since we currently only allow one entity on GroupByMetric, this won't fan out.
if not isinstance(right_node, ComputeMetricsNode):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok, this will do for now but we'll need a more robust way of handling this than the inlined isinstance check once we have other cases we need to think about. I'll keep this in mind as I read more of these updates, because we want a way to ask if the join is ok due to aggregation state, and we may need one that works across node types.

I don't think we have access to AggregationState from dataflow plan nodes, but something like that might be what we should be checking against - that and the grain, because we'll need to match the grain eventually.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agreed! Definitely will need more robustness here in v2.

Comment on lines +268 to +269
# If joining to ComputeMetricsNode, the right node is pre-aggregated.
# Since we currently only allow one entity on GroupByMetric, this won't fan out.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we add a TODO here to make this more robust against different grains and different node types that might satisfy this condition?

Comment on lines +30 to +31
user_id AS user
, SUM(revenue) AS revenue_all_time
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh ok but this is not one with a window so it resolves to the same as the active_listings example.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes! I didn't add any with window or grain_to_date yet because those require querying with metric_time, which is not yet supported for metric group bys. So you'll get an error if you try to use them anyway.

ON
subq_30.listing = subq_42.listing
) subq_44
WHERE listing__bookings > 2 AND listing__bookers > 1
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Right, we need the LEFT OUTER here because of listing__bookings IS NULL OR listing__bookings > 2 or similar expressions.

ON
subq_13.listing = subq_26.listing
) subq_28
WHERE listing__views_times_booking_value > 2
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These are all pretty clearly generated via --compile or whatever so they're kind of hard to read (and also that means they're asserting that the compile output doesn't change semantically, but do we know the results are correct here?).

It might be worth a cleanup pass to simplify this SQL down to its most basic by hand. Then we can look at the optimizers later to see if we can cut back on the repetition here.

Also, I think this feature makes CTEs more important, so maybe we should surface that in the feature roadmap. We don't need them right away but they've got to get some higher precedence, because this output will be quite difficult to read.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For the really complex queries I do usually just let MF compile the SQL, then I read through it to verify and simplify it a little before adding it here. I see most of the value in these tests as making sure that the query runs end to end without error and that the SQL actually executes successfully (i.e., no columns are dropped unexpectedly, etc.)
So I'll do 2 things:

  • go back through these to simplify them as much as I can
  • add output tests

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Simplified the SQL more - will add output tests tomorrow.

Base automatically changed from court/metric-call-params to main April 2, 2024 01:35
@courtneyholcomb courtneyholcomb added the Run Tests With Other SQL Engines Runs the test suite against the SQL engines in our target environment label Apr 2, 2024
@github-actions github-actions bot removed the Run Tests With Other SQL Engines Runs the test suite against the SQL engines in our target environment label Apr 2, 2024
@courtneyholcomb courtneyholcomb merged commit 831727c into main Apr 2, 2024
27 checks passed
@courtneyholcomb courtneyholcomb deleted the court/group-by-metric-dfp branch April 2, 2024 19:48
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[Feature] Allow metrics/measures to be referenced in filters and where clause
2 participants