add subgraph logic to post and pre order traversal #345

jperez999 · 2023-06-14T01:16:12Z

This PR adds logic to post and pre order traversal methods to handle subgraphs. It flattens the subgraph while still including the actual subgraph operator in the nodes. This is important because it ensures that when we are constructing the schemas for all the operator the subgraph operator does not get skipped. Otherwise it will cause issues with downstream operators that will essentially get schemas from operators previous to the subgraph or if there are none it will get schema from root (i.e. dataset).

github-actions · 2023-06-14T01:21:24Z

Documentation preview

https://nvidia-merlin.github.io/core/review/pr-345

oliverholworthy · 2023-06-14T12:10:01Z

Getting an error about two subgraphs with the same name in the tests now that postorder_iter_nodes returns nodes within subgraphs.

One option seems to be to remove this line in _find_subgraphs which is no longer required now that the postorder_iter_nodes returns all the nodes

core/merlin/dag/graph.py

Line 237 in c5facda

_find_subgraphs(op.graph.output_node, subgraphs)

oliverholworthy · 2023-06-14T12:19:27Z

One other thing I noticed while looking at this is related to the way iter_nodes is currently working. It behaves slightly differently to the post and pre order versions. Using the test test_subgraph_with_summed_subgraphs as an example.

iter_nodes doesn't return the Subgraph nodes (while the post and pre order versions do)

set(postorder_iter_nodes([graph.output_node])).difference(set(iter_nodes([graph.output_node])))
# => {<Node Subgraph>, <Node Subgraph>, <Node Subgraph>, <Node Subgraph>, <Node Subgraph>, <Node Subgraph output>}

iter_nodes returns duplicates nodes (while the post and pre order versions do not)

len(list(iter_nodes([graph.output_node])))  
# => 101

len(list(postorder_iter_nodes([graph.output_node])))
# => 17

len(list(preorder_iter_nodes([graph.output_node])))
# => 17

Is that a bug in iter_nodes, or is there a reason it needs to return a different set and length of nodes compared to the others?

jperez999 · 2023-06-15T01:41:05Z

Ok, so thanks to this failing test, I went down quite a deep rabbit hole to find that some assumptions we were making were not correct. The new commits to this PR work to remedy those issues that were found:

iter_nodes ended with a list that had WAY TOO MANY nodes in it. The reason for this was because we were doing LIFO instead of FIFO. In LIFO the first thing to get popped off the queue was the first thing in the graph. This led to a scenario where when you popped the next Node in the graph it would add its parent and dependencies (the first thing in the graph) to the queue again.
Flattening out the nodes in a subgraph is not ALWAYS the correct behavior. For instance when you are fitting a graph you do not want to create a list that has the subgraph node and the nodes contained in it. That will create a scenario where you will fit the subgraph and then try to fit the individual stat operators inside the subgraph. This is not the desired behavior. In the fit case, we only want to fit on the subgraph and allow it to handle fitting the nodes within it (the subgraph). To handle this we added an optional keyword argument, flatten_subgraphs to the iteration function (iter_nodes, postorder_iter_nodes, preorder_iter_nodes) that allows you to designate a flattening when you need it. The kwarg is set to False by default to allow for backwards compatibility. Currently, this should only be necessary in merlin-systems, when dealing with node specific actions like loading, exporting and saving artifacts. It is used in the testing suite of nvtabular for a function that looks to retrieve categorify nodes to check the data within them. These changes will be reflected in merlin-systems and nvtabular, where we activate flatten_subgraphs.
_combine_schemas in node was designed with the assumption that all leaf nodes in a graph would be selection operators. Selection operators have the unique characteristic of having the same input and output schemas. However, now that we have subgraphs, it is now possible to have a subgraph as a leaf node. Subgraphs do not share this characteristic so we must account for that. To do so, an optional kwarg was added to combine schemas, input_schemas, to allow the user to specify if they want to retrieve an input schema or an output schema.

karlhigley · 2023-06-16T00:15:04Z

merlin/dag/graph.py

@@ -123,7 +123,8 @@ def _validate_node_schemas(self, root_schema, nodes, strict_dtypes=False):
    @property
    def input_schema(self):
        # leaf_node input and output schemas are the same (aka selection)
-        return _combine_schemas(self.leaf_nodes)
+        # subgraphs can also be leaf nodes now, so input and output are different
+        return _combine_schemas(self.leaf_nodes, input_schemas=True)


This seems like a good place to use two separate methods (_combine_input_schemas and combine_output_schemas) instead of a boolean flag

karlhigley · 2023-06-16T00:33:34Z

merlin/dag/node.py

@@ -613,7 +616,7 @@ def iter_nodes(nodes):


 # output node (bottom) -> selection leaf nodes (top)
-def preorder_iter_nodes(nodes):
+def preorder_iter_nodes(nodes, flatten_subgraphs=False):


Noticing that this boolean param is added to multiple functions makes me wonder if there's a missing concept here. Back in the day (before Graph, Node, etc existed and everything was just a ColumnGroup), it probably made more sense to have a plain function instead of a method, but now this is starting to look like either:

we're iterating through two different kinds of things that require different iteration behavior and could use polymorphism to achieve that

we're doing two different kinds of iteration over the same kind of thing and we could use two methods to capture those behaviors

karlhigley · 2023-06-16T00:36:47Z

merlin/dag/node.py

@@ -663,15 +671,18 @@ def _filter_by_type(elements, type_):
    return results


-def _combine_schemas(elements):
+def _combine_schemas(elements, input_schemas=False):


Seems like one approach could be to implement methods like this in order to avoid the boolean flag:

`_combine_input_schemas(elements)

`_combine_output_schemas(elements)

`_combine_schemas(schemas: List[Schema])

karlhigley

Left some style suggestions, but approving as is. You can take or leave the suggestions and merge when you're ready.

oliverholworthy · 2023-06-20T13:51:42Z

2. when you are fitting a graph you do not want to create a list that has the subgraph node and the nodes contained in it. That will create a scenario where you will fit the subgraph and then try to fit the individual stat operators inside the subgraph. This is not the desired behavior. In the fit case, we only want to fit on the subgraph and allow it to handle fitting the nodes within it (the subgraph).

@jperez999 Can you explain more about this point. What part requires that we call the fit method of the Subgraph operator instead of the nodes contained within it? In other words would it work if we called fit on the subnodes but not the Subgraph (if Subgraph was not a StatOperator).

jperez999 · 2023-06-20T14:48:38Z

Left some style suggestions, but approving as is. You can take or leave the suggestions and merge when you're ready.

I will make the updates in a subsequent PR.

when you are fitting a graph you do not want to create a list that has the subgraph node and the nodes contained in it. That will create a scenario where you will fit the subgraph and then try to fit the individual stat operators inside the subgraph. This is not the desired behavior. In the fit case, we only want to fit on the subgraph and allow it to handle fitting the nodes within it (the subgraph).

@jperez999 Can you explain more about this point. What part requires that we call the fit method of the Subgraph operator instead of the nodes contained within it? In other words would it work if we called fit on the subnodes but not the Subgraph (if Subgraph was not a StatOperator).

The Subgraph is a special type of operator, it is considered a graph. This means it should allow for all the capabilities/responsibilities of a graph.
We are working toward a position where you could have certain subgraphs in different states, where subgraphs that have been fit already might get connected to other subgraphs that have not been fit. In this case, if you do not allow for subgraphs to fit then each node would have to carry that information. By allowing the subgraph to fit and transform instead of just being a placeholder/divider of nodes, it provides a way to keep that information about fitting and possibly other metadata in the future, at the graph (subgraph) level. We have not yet completely built that feature but we are taking incremental steps in this direction. If we just flattened everything out, we would lose the information about the subgraph as a whole.

add subgraph logic to post and pre order traversal

38fdb24

jperez999 added enhancement New feature or request clean up chore Maintenance for the repository labels Jun 14, 2023

jperez999 added this to the Merlin 23.06 milestone Jun 14, 2023

jperez999 requested a review from oliverholworthy June 14, 2023 01:16

jperez999 self-assigned this Jun 14, 2023

jperez999 mentioned this pull request Jun 14, 2023

add subgraph support for systems with tests NVIDIA-Merlin/systems#370

Merged

fix node iteration functions

628ef53

jperez999 mentioned this pull request Jun 15, 2023

fix get-cats function to add kwarg usage NVIDIA-Merlin/NVTabular#1844

Open

Merge branch 'main' into fix-traversal

e97a3dc

karlhigley reviewed Jun 16, 2023

View reviewed changes

karlhigley approved these changes Jun 16, 2023

View reviewed changes

jperez999 merged commit af07206 into NVIDIA-Merlin:main Jun 20, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

add subgraph logic to post and pre order traversal #345

add subgraph logic to post and pre order traversal #345

jperez999 commented Jun 14, 2023

github-actions bot commented Jun 14, 2023

oliverholworthy commented Jun 14, 2023

oliverholworthy commented Jun 14, 2023

jperez999 commented Jun 15, 2023

karlhigley Jun 16, 2023

karlhigley Jun 16, 2023

karlhigley Jun 16, 2023

karlhigley left a comment

oliverholworthy commented Jun 20, 2023

jperez999 commented Jun 20, 2023

add subgraph logic to post and pre order traversal #345

add subgraph logic to post and pre order traversal #345

Conversation

jperez999 commented Jun 14, 2023

github-actions bot commented Jun 14, 2023

Documentation preview

oliverholworthy commented Jun 14, 2023

oliverholworthy commented Jun 14, 2023

jperez999 commented Jun 15, 2023

karlhigley Jun 16, 2023

Choose a reason for hiding this comment

karlhigley Jun 16, 2023

Choose a reason for hiding this comment

karlhigley Jun 16, 2023

Choose a reason for hiding this comment

karlhigley left a comment

Choose a reason for hiding this comment

oliverholworthy commented Jun 20, 2023

jperez999 commented Jun 20, 2023