Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

add subgraph logic to post and pre order traversal #345

Merged
merged 3 commits into from
Jun 20, 2023

Conversation

jperez999
Copy link
Collaborator

This PR adds logic to post and pre order traversal methods to handle subgraphs. It flattens the subgraph while still including the actual subgraph operator in the nodes. This is important because it ensures that when we are constructing the schemas for all the operator the subgraph operator does not get skipped. Otherwise it will cause issues with downstream operators that will essentially get schemas from operators previous to the subgraph or if there are none it will get schema from root (i.e. dataset).

@jperez999 jperez999 added enhancement New feature or request clean up chore Maintenance for the repository labels Jun 14, 2023
@jperez999 jperez999 added this to the Merlin 23.06 milestone Jun 14, 2023
@jperez999 jperez999 self-assigned this Jun 14, 2023
@github-actions
Copy link

Documentation preview

https://nvidia-merlin.github.io/core/review/pr-345

@oliverholworthy
Copy link
Member

Getting an error about two subgraphs with the same name in the tests now that postorder_iter_nodes returns nodes within subgraphs.

One option seems to be to remove this line in _find_subgraphs which is no longer required now that the postorder_iter_nodes returns all the nodes

_find_subgraphs(op.graph.output_node, subgraphs)

@oliverholworthy
Copy link
Member

One other thing I noticed while looking at this is related to the way iter_nodes is currently working. It behaves slightly differently to the post and pre order versions. Using the test test_subgraph_with_summed_subgraphs as an example.

  • iter_nodes doesn't return the Subgraph nodes (while the post and pre order versions do)
set(postorder_iter_nodes([graph.output_node])).difference(set(iter_nodes([graph.output_node])))
# => {<Node Subgraph>, <Node Subgraph>, <Node Subgraph>, <Node Subgraph>, <Node Subgraph>, <Node Subgraph output>}
  • iter_nodes returns duplicates nodes (while the post and pre order versions do not)
len(list(iter_nodes([graph.output_node])))  
# => 101

len(list(postorder_iter_nodes([graph.output_node])))
# => 17

len(list(preorder_iter_nodes([graph.output_node])))
# => 17

Is that a bug in iter_nodes, or is there a reason it needs to return a different set and length of nodes compared to the others?

@jperez999
Copy link
Collaborator Author

Ok, so thanks to this failing test, I went down quite a deep rabbit hole to find that some assumptions we were making were not correct. The new commits to this PR work to remedy those issues that were found:

  1. iter_nodes ended with a list that had WAY TOO MANY nodes in it. The reason for this was because we were doing LIFO instead of FIFO. In LIFO the first thing to get popped off the queue was the first thing in the graph. This led to a scenario where when you popped the next Node in the graph it would add its parent and dependencies (the first thing in the graph) to the queue again.
  2. Flattening out the nodes in a subgraph is not ALWAYS the correct behavior. For instance when you are fitting a graph you do not want to create a list that has the subgraph node and the nodes contained in it. That will create a scenario where you will fit the subgraph and then try to fit the individual stat operators inside the subgraph. This is not the desired behavior. In the fit case, we only want to fit on the subgraph and allow it to handle fitting the nodes within it (the subgraph). To handle this we added an optional keyword argument, flatten_subgraphs to the iteration function (iter_nodes, postorder_iter_nodes, preorder_iter_nodes) that allows you to designate a flattening when you need it. The kwarg is set to False by default to allow for backwards compatibility. Currently, this should only be necessary in merlin-systems, when dealing with node specific actions like loading, exporting and saving artifacts. It is used in the testing suite of nvtabular for a function that looks to retrieve categorify nodes to check the data within them. These changes will be reflected in merlin-systems and nvtabular, where we activate flatten_subgraphs.
  3. _combine_schemas in node was designed with the assumption that all leaf nodes in a graph would be selection operators. Selection operators have the unique characteristic of having the same input and output schemas. However, now that we have subgraphs, it is now possible to have a subgraph as a leaf node. Subgraphs do not share this characteristic so we must account for that. To do so, an optional kwarg was added to combine schemas, input_schemas, to allow the user to specify if they want to retrieve an input schema or an output schema.

@@ -123,7 +123,8 @@ def _validate_node_schemas(self, root_schema, nodes, strict_dtypes=False):
@property
def input_schema(self):
# leaf_node input and output schemas are the same (aka selection)
return _combine_schemas(self.leaf_nodes)
# subgraphs can also be leaf nodes now, so input and output are different
return _combine_schemas(self.leaf_nodes, input_schemas=True)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This seems like a good place to use two separate methods (_combine_input_schemas and combine_output_schemas) instead of a boolean flag

@@ -613,7 +616,7 @@ def iter_nodes(nodes):


# output node (bottom) -> selection leaf nodes (top)
def preorder_iter_nodes(nodes):
def preorder_iter_nodes(nodes, flatten_subgraphs=False):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Noticing that this boolean param is added to multiple functions makes me wonder if there's a missing concept here. Back in the day (before Graph, Node, etc existed and everything was just a ColumnGroup), it probably made more sense to have a plain function instead of a method, but now this is starting to look like either:

  • we're iterating through two different kinds of things that require different iteration behavior and could use polymorphism to achieve that
  • we're doing two different kinds of iteration over the same kind of thing and we could use two methods to capture those behaviors

@@ -663,15 +671,18 @@ def _filter_by_type(elements, type_):
return results


def _combine_schemas(elements):
def _combine_schemas(elements, input_schemas=False):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Seems like one approach could be to implement methods like this in order to avoid the boolean flag:

  • `_combine_input_schemas(elements)
  • `_combine_output_schemas(elements)
  • `_combine_schemas(schemas: List[Schema])

Copy link
Contributor

@karlhigley karlhigley left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Left some style suggestions, but approving as is. You can take or leave the suggestions and merge when you're ready.

@oliverholworthy
Copy link
Member

2. when you are fitting a graph you do not want to create a list that has the subgraph node and the nodes contained in it. That will create a scenario where you will fit the subgraph and then try to fit the individual stat operators inside the subgraph. This is not the desired behavior. In the fit case, we only want to fit on the subgraph and allow it to handle fitting the nodes within it (the subgraph).

@jperez999 Can you explain more about this point. What part requires that we call the fit method of the Subgraph operator instead of the nodes contained within it? In other words would it work if we called fit on the subnodes but not the Subgraph (if Subgraph was not a StatOperator).

@jperez999 jperez999 merged commit af07206 into NVIDIA-Merlin:main Jun 20, 2023
@jperez999
Copy link
Collaborator Author

Left some style suggestions, but approving as is. You can take or leave the suggestions and merge when you're ready.

I will make the updates in a subsequent PR.

  1. when you are fitting a graph you do not want to create a list that has the subgraph node and the nodes contained in it. That will create a scenario where you will fit the subgraph and then try to fit the individual stat operators inside the subgraph. This is not the desired behavior. In the fit case, we only want to fit on the subgraph and allow it to handle fitting the nodes within it (the subgraph).

@jperez999 Can you explain more about this point. What part requires that we call the fit method of the Subgraph operator instead of the nodes contained within it? In other words would it work if we called fit on the subnodes but not the Subgraph (if Subgraph was not a StatOperator).

The Subgraph is a special type of operator, it is considered a graph. This means it should allow for all the capabilities/responsibilities of a graph.
We are working toward a position where you could have certain subgraphs in different states, where subgraphs that have been fit already might get connected to other subgraphs that have not been fit. In this case, if you do not allow for subgraphs to fit then each node would have to carry that information. By allowing the subgraph to fit and transform instead of just being a placeholder/divider of nodes, it provides a way to keep that information about fitting and possibly other metadata in the future, at the graph (subgraph) level. We have not yet completely built that feature but we are taking incremental steps in this direction. If we just flattened everything out, we would lose the information about the subgraph as a whole.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
chore Maintenance for the repository clean up enhancement New feature or request
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants