Add Walker to Simplify Dataflow DAG Traversal #1206

plypaul · 2024-05-11T01:16:12Z

Description

As per title.

tlento · 2024-05-12T21:20:44Z

metricflow/dataflow/dfs_walker.py

+logger = logging.getLogger(__name__)
+
+
+class DataflowDagWalker(DataflowPlanNodeVisitor, Generic[VisitorOutputT], ABC):


Oof. In my opinion, the need to write another layer of abstraction to make it easier to write logic against a design pattern abstraction is a fireworks display level signal that maybe the pattern - or at least the existing implementation of it - was not the right choice.

As with the generic visitor itself, this makes it a little easier to implement arbitrary things, and ESPECIALLY arbitrary things that turn out to be glorified property accessors. In exchange, it makes it harder to both read the code and understand what's happening in the runtime, because you have to use your human brain to remember that anything hitting a DagWalker means huge chunks of the stack trace can be ignored, except of course when they can't, and good luck distinguishing those scenarios.

An alternative approach to doubling and tripling down on these generic interfaces to make it marginally easier to do things like count the nodes in the graph by type is to parcel out the operations into concrete needs and then develop strongly typed interfaces around them. As an example of what we currently need:

Node level -

Optimization classes. Then nodes can be typed according to whether or not they can reasonably be optimized. At the moment the DataflowPlan only has 3 optimizable nodes, so this can refine things in a type-safe way that doesn't require us to add these nested layers of boilerplate.

Conversion classes (well, class, really, because as a practical matter we only actually care about dataflow -> sql). Every node needs an implementation, so every node gets one.

Recursive node properties - these can just be property methods, no visitors required, and we should move to that model regardless of what happens with the DataflowDagWalker

Plan level -

Node classification by type, which can be done via a single walk in the initializer for the DataflowPlan

Plan level properties, which can be boostrapped by initializer-level node classification and recursive node properties

What are the future operations we're expecting to implement here? How many of these are not better resolved via updates to our type specifications? How many of them truly require heterogenous operations coded to node subtypes vs implementing a "default handler" that applies to 90% of the object types in the hierarchy?

Your points are well noted and they are similar to the ones raised in the visitor discussion. This is intended more as an incremental improvement in the current structure. One potential change that I'm down to do in the short term is replace the visitor with the walker if you have the concern about an additional class.

What are the future operations we're expecting to implement here? ...

I was going for the incremental improvement, so these will require an opportunity to think / brainstorm more about upcoming work.

I was going for the incremental improvement, so these will require an opportunity to think / brainstorm more about upcoming work.

Fair enough! I have a specific set of ideas and I'll be setting to work on making some changes once I'm out of the weeds. I also have a concept for some type structure improvements, would be great to chat about those as well.

This won't matter to me much one way or the other, and if it eases the boilerplate of writing traversals that seems fine.

I will say, I think a DfsWalker that does not do a walk is a baffling construct. I just looked at the link you sent me on top of this to the execution plan update and went through this set of stages of confusion:

Step 1 - it doesn't walk the DAG
Step 2 - wait, maybe it does, let me look at the original cllass
Step 3 - oh wow, it DOES walk the DAG, that's a bug, better tell Paul
Step 4 - no, wait, what's this default recursion thing?
Step 5 - oh, let me check the implementing class
Step 6 - oh ok I had it right the first time

If you like this simplification we can proceed and I'll review this tomorrow. Please do consider taking out the "disable recursion" flag.

If you like this simplification we can proceed and I'll review this tomorrow. Please do consider taking out the "disable recursion" flag.

For that class, I was using walk in a very general sense. Maybe traverse is a better term. For traversal, you could specify traversal behavior. Let me see what that looks like.

Maybe a maximum traversal depth parameter? None means full, etc.? Seems more complicated though.

…nodes.

…de`.

plypaul added the Skip Changelog label May 11, 2024

cla-bot bot added the cla:yes label May 11, 2024

plypaul marked this pull request as ready for review May 11, 2024 01:20

plypaul force-pushed the p--smr--05 branch from 0b523ac to cdfdce9 Compare May 12, 2024 21:17

tlento reviewed May 12, 2024

View reviewed changes

plypaul force-pushed the p--smr--04 branch from 28a631c to 7f99425 Compare May 15, 2024 18:42

plypaul force-pushed the p--smr--05 branch from cdfdce9 to 4664e70 Compare May 15, 2024 18:42

plypaul force-pushed the p--smr--04 branch from 7f99425 to 07322cb Compare May 15, 2024 20:09

plypaul force-pushed the p--smr--05 branch from 4664e70 to 4feb4d6 Compare May 15, 2024 20:09

plypaul force-pushed the p--smr--04 branch from 07322cb to c53b193 Compare May 15, 2024 20:23

plypaul force-pushed the p--smr--05 branch from 4feb4d6 to 04a0a16 Compare May 15, 2024 20:23

plypaul force-pushed the p--smr--04 branch from c53b193 to 3830a25 Compare May 15, 2024 20:29

plypaul force-pushed the p--smr--05 branch from 04a0a16 to 00accaf Compare May 15, 2024 20:29

plypaul force-pushed the p--smr--04 branch from 3830a25 to 7b1693b Compare May 15, 2024 21:20

plypaul force-pushed the p--smr--05 branch from 00accaf to ac4c0d3 Compare May 15, 2024 21:20

plypaul force-pushed the p--smr--04 branch from 7b1693b to abd1872 Compare May 16, 2024 00:22

plypaul force-pushed the p--smr--05 branch from ac4c0d3 to 369743f Compare May 16, 2024 00:22

plypaul added 10 commits May 15, 2024 19:15

/* PR_START p--smr 04 */ Remove DataflowPlanNode output types / sink …

04f2e90

…nodes.

Rename JoinToBaseOutputNode.

f946280

Rename visit_join_to_base_output_node -> `visit_join_on_entities_no…

e881c0a

…de`.

Remove references to sink_output_node from DataflowPlan.

b13603e

Rename base_output_node -> optimized_branch.

80f5af9

Simplify OptimizeBranchResult.

3c942fc

Update snapshots.

6f5dcce

Rename checked_sink_node.

73d7566

/* PR_START p--smr 05 */ Add DataflowPlan walker.

4e0425a

Simplify existing cases with DataflowPlan walker.

71f634a

plypaul force-pushed the p--smr--04 branch from abd1872 to 73d7566 Compare May 16, 2024 02:19

plypaul force-pushed the p--smr--05 branch from 369743f to 71f634a Compare May 16, 2024 02:19

Base automatically changed from p--smr--04 to main May 16, 2024 02:24

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add Walker to Simplify Dataflow DAG Traversal #1206

Add Walker to Simplify Dataflow DAG Traversal #1206

plypaul commented May 11, 2024

tlento May 12, 2024

plypaul May 12, 2024 •

edited

Loading

tlento May 14, 2024 •

edited

Loading

plypaul May 14, 2024

tlento May 15, 2024

		logger = logging.getLogger(__name__)


		class DataflowDagWalker(DataflowPlanNodeVisitor, Generic[VisitorOutputT], ABC):

Add Walker to Simplify Dataflow DAG Traversal #1206

Are you sure you want to change the base?

Add Walker to Simplify Dataflow DAG Traversal #1206

Conversation

plypaul commented May 11, 2024

Description

tlento May 12, 2024

Choose a reason for hiding this comment

plypaul May 12, 2024 • edited Loading

Choose a reason for hiding this comment

tlento May 14, 2024 • edited Loading

Choose a reason for hiding this comment

plypaul May 14, 2024

Choose a reason for hiding this comment

tlento May 15, 2024

Choose a reason for hiding this comment

plypaul May 12, 2024 •

edited

Loading

tlento May 14, 2024 •

edited

Loading