Dataflow plan for metrics in filters #1102

courtneyholcomb · 2024-03-25T17:53:11Z

Resolves #740

Description

Build source nodes for requested GroupByMetrics during dataflow plan building. This is the last step needed to enable these metrics/queries! 🎉

Most of these joins require a join from a primary key to a foreign key, which is normally not allowed because a fan-out join might make metric calculation inaccurate. When joining to a pre-aggregated metric, though, we can allow that type of join since the aggregation prevents duplicate rows. We can ensure that's true by allowing no more than one entity in the join columns for GroupByMetrics. For V0 of this feature we'll implement that restriction. Later, we will want to enable multiple join columns (both entities and dimensions), so we will need to clarify what types of joins are allowed here.

Other than that, this adds some tests and updates a couple of places with group by metric logic that was missed in previous PRs. I would recommend reviewing by commit.

Note the scope of this PR - it enables metric filters on metrics (in YAML) and in metric queries, joining by exactly one entity. It does not enable 1) metrics in the group by for queries, 2) joining to metrics via dimensions or multiple group bys, or 3) metrics in filters for distinct values queries (queries without metrics). Those will be handled in lower-priority follow ups.

tlento

🎉

I was expecting this particular PR to be a lot more complicated than this, very nice!

I left two minor things inline for consideration - the sequences and the check query simplification - but otherwise this looks great! Thanks for splitting things up for me, made review a lot more straightforward.

tlento · 2024-04-02T00:25:45Z

metricflow/dataflow/builder/dataflow_plan_builder.py

@@ -346,6 +349,7 @@ def _build_conversion_metric_output_node(
        queried_linkable_specs: LinkableSpecSet,
        filter_spec_factory: WhereSpecFactory,
        time_range_constraint: Optional[TimeRangeConstraint] = None,
+        for_group_by_source_node: bool = False,


Oof, I guess we do have to thread this through everywhere, eh?

metricflow/dataflow/builder/dataflow_plan_builder.py

metricflow/plan_conversion/node_processor.py

tlento · 2024-04-02T00:40:42Z

metricflow/dataflow/builder/node_evaluator.py

-                    continue
+                    # If joining to ComputeMetricsNode, the right node is pre-aggregated.
+                    # Since we currently only allow one entity on GroupByMetric, this won't fan out.
+                    if not isinstance(right_node, ComputeMetricsNode):


Ok, this will do for now but we'll need a more robust way of handling this than the inlined isinstance check once we have other cases we need to think about. I'll keep this in mind as I read more of these updates, because we want a way to ask if the join is ok due to aggregation state, and we may need one that works across node types.

I don't think we have access to AggregationState from dataflow plan nodes, but something like that might be what we should be checking against - that and the grain, because we'll need to match the grain eventually.

Agreed! Definitely will need more robustness here in v2.

tlento · 2024-04-02T00:41:41Z

metricflow/dataflow/builder/node_evaluator.py

+                    # If joining to ComputeMetricsNode, the right node is pre-aggregated.
+                    # Since we currently only allow one entity on GroupByMetric, this won't fan out.


Can we add a TODO here to make this more robust against different grains and different node types that might satisfy this condition?

tlento · 2024-04-02T00:50:22Z

...y/SqlQueryPlan/DuckDB/test_query_with_cumulative_metric_in_where_filter__plan0_optimized.sql

+      user_id AS user
+      , SUM(revenue) AS revenue_all_time


Oh ok but this is not one with a window so it resolves to the same as the active_listings example.

Yes! I didn't add any with window or grain_to_date yet because those require querying with metric_time, which is not yet supported for metric group bys. So you'll get an error if you try to use them anyway.

tlento · 2024-04-02T00:53:11Z

...ering.py/SqlQueryPlan/DuckDB/test_query_with_multiple_metrics_in_filter__plan0_optimized.sql

+  ON
+    subq_30.listing = subq_42.listing
+) subq_44
+WHERE listing__bookings > 2 AND listing__bookers > 1


Right, we need the LEFT OUTER here because of listing__bookings IS NULL OR listing__bookings > 2 or similar expressions.

tlento · 2024-04-02T01:02:40Z

tests/integration/test_cases/itest_metrics.yaml

+      ON
+        subq_13.listing = subq_26.listing
+    ) subq_28
+    WHERE listing__views_times_booking_value > 2


These are all pretty clearly generated via --compile or whatever so they're kind of hard to read (and also that means they're asserting that the compile output doesn't change semantically, but do we know the results are correct here?).

It might be worth a cleanup pass to simplify this SQL down to its most basic by hand. Then we can look at the optimizers later to see if we can cut back on the repetition here.

Also, I think this feature makes CTEs more important, so maybe we should surface that in the feature roadmap. We don't need them right away but they've got to get some higher precedence, because this output will be quite difficult to read.

For the really complex queries I do usually just let MF compile the SQL, then I read through it to verify and simplify it a little before adding it here. I see most of the value in these tests as making sure that the query runs end to end without error and that the SQL actually executes successfully (i.e., no columns are dropped unexpectedly, etc.)
So I'll do 2 things:

go back through these to simplify them as much as I can

add output tests

Simplified the SQL more - will add output tests tomorrow.

courtneyholcomb added the Skip Changelog label Mar 25, 2024

cla-bot bot added the cla:yes label Mar 25, 2024

courtneyholcomb changed the title ~~Build source node when MetricGroupBy is requested~~ Dataflow plan for GroupByMetrics Mar 25, 2024

courtneyholcomb force-pushed the court/group-by-metric-dfp branch 2 times, most recently from 7d7e374 to fbdaac7 Compare March 25, 2024 22:13

courtneyholcomb changed the base branch from court/metric-group-by-instance-convert to court/metric-call-params March 25, 2024 22:13

courtneyholcomb force-pushed the court/group-by-metric-dfp branch 2 times, most recently from fb530ab to 14e3b6d Compare March 27, 2024 17:50

courtneyholcomb force-pushed the court/metric-call-params branch 3 times, most recently from dd54247 to 4022d8e Compare March 27, 2024 18:06

courtneyholcomb force-pushed the court/group-by-metric-dfp branch from 14e3b6d to 8e4a32f Compare March 27, 2024 18:07

courtneyholcomb force-pushed the court/metric-call-params branch 3 times, most recently from c3611f9 to 6d2d856 Compare March 27, 2024 20:19

courtneyholcomb force-pushed the court/group-by-metric-dfp branch 3 times, most recently from 61a4cd0 to 371922f Compare March 27, 2024 22:04

courtneyholcomb force-pushed the court/metric-call-params branch 3 times, most recently from f122d99 to f8ee5df Compare March 28, 2024 18:55

courtneyholcomb force-pushed the court/group-by-metric-dfp branch 3 times, most recently from 33a0143 to 5389b6f Compare March 28, 2024 20:21

courtneyholcomb removed the Skip Changelog label Mar 29, 2024

courtneyholcomb force-pushed the court/group-by-metric-dfp branch from 0d2c0a9 to f466576 Compare March 29, 2024 22:58

courtneyholcomb changed the title ~~Dataflow plan for GroupByMetrics~~ Dataflow plan for metrics in filters Mar 29, 2024

dbt-labs deleted a comment from github-actions bot Mar 29, 2024

courtneyholcomb marked this pull request as ready for review March 29, 2024 23:20

courtneyholcomb requested a review from tlento March 29, 2024 23:20

github-actions bot removed the Run Tests With Other SQL Engines Runs the test suite against the SQL engines in our target environment label Mar 30, 2024

tlento approved these changes Apr 2, 2024

View reviewed changes

Base automatically changed from court/metric-call-params to main April 2, 2024 01:35

courtneyholcomb added 14 commits April 1, 2024 19:05

Build source node when MetricGroupBy is requested

043560a

Allow PK to FK join for GroupByMetric queries

f2526c6

Add test metric with metric in filter + update snapshots

139b308

Write dataflow plan tests

0b6ca02

Add render_metric_template function for configured test cases

47f923f

Update optimizer

77ab6b2

Add query rendering tests

7f6fe1e

Add integration tests

52204f5

Update missed spec pattern logic

d629c2e

Add query rendering tests

dce3af4

Add integration tests

1d9c725

Changelog

f924c46

Update SQL engine snapshots

42e6d60

Fix integration test SQL syntax for Trino compatibility

c3c16ab

courtneyholcomb force-pushed the court/group-by-metric-dfp branch from b2b0efe to c3c16ab Compare April 2, 2024 02:06

courtneyholcomb added 4 commits April 1, 2024 20:08

Fix typing on methods

f48dd19

Simplify check queries in integration tests

fe8ae77

Add query output tests

e4b2df1

Update snapshots

357a463

courtneyholcomb added the Run Tests With Other SQL Engines Runs the test suite against the SQL engines in our target environment label Apr 2, 2024

courtneyholcomb temporarily deployed to DW_INTEGRATION_TESTS April 2, 2024 19:31 — with GitHub Actions Inactive

github-actions bot removed the Run Tests With Other SQL Engines Runs the test suite against the SQL engines in our target environment label Apr 2, 2024

courtneyholcomb merged commit 831727c into main Apr 2, 2024
27 checks passed

courtneyholcomb deleted the court/group-by-metric-dfp branch April 2, 2024 19:48

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Dataflow plan for metrics in filters #1102

Dataflow plan for metrics in filters #1102

courtneyholcomb commented Mar 25, 2024 •

edited

Loading

tlento left a comment

tlento Apr 2, 2024

tlento Apr 2, 2024

courtneyholcomb Apr 2, 2024

tlento Apr 2, 2024

tlento Apr 2, 2024

courtneyholcomb Apr 2, 2024

tlento Apr 2, 2024

tlento Apr 2, 2024

courtneyholcomb Apr 2, 2024

courtneyholcomb Apr 2, 2024

		# If joining to ComputeMetricsNode, the right node is pre-aggregated.
		# Since we currently only allow one entity on GroupByMetric, this won't fan out.

Dataflow plan for metrics in filters #1102

Dataflow plan for metrics in filters #1102

Conversation

courtneyholcomb commented Mar 25, 2024 • edited Loading

Description

tlento left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

courtneyholcomb commented Mar 25, 2024 •

edited

Loading