-
Notifications
You must be signed in to change notification settings - Fork 14
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
add subgraph logic to post and pre order traversal #345
Conversation
Documentation preview |
Getting an error about two subgraphs with the same name in the tests now that One option seems to be to remove this line in Line 237 in c5facda
|
One other thing I noticed while looking at this is related to the way
Is that a bug in |
Ok, so thanks to this failing test, I went down quite a deep rabbit hole to find that some assumptions we were making were not correct. The new commits to this PR work to remedy those issues that were found:
|
@@ -123,7 +123,8 @@ def _validate_node_schemas(self, root_schema, nodes, strict_dtypes=False): | |||
@property | |||
def input_schema(self): | |||
# leaf_node input and output schemas are the same (aka selection) | |||
return _combine_schemas(self.leaf_nodes) | |||
# subgraphs can also be leaf nodes now, so input and output are different | |||
return _combine_schemas(self.leaf_nodes, input_schemas=True) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This seems like a good place to use two separate methods (_combine_input_schemas
and combine_output_schemas
) instead of a boolean flag
@@ -613,7 +616,7 @@ def iter_nodes(nodes): | |||
|
|||
|
|||
# output node (bottom) -> selection leaf nodes (top) | |||
def preorder_iter_nodes(nodes): | |||
def preorder_iter_nodes(nodes, flatten_subgraphs=False): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Noticing that this boolean param is added to multiple functions makes me wonder if there's a missing concept here. Back in the day (before Graph
, Node
, etc existed and everything was just a ColumnGroup
), it probably made more sense to have a plain function instead of a method, but now this is starting to look like either:
- we're iterating through two different kinds of things that require different iteration behavior and could use polymorphism to achieve that
- we're doing two different kinds of iteration over the same kind of thing and we could use two methods to capture those behaviors
@@ -663,15 +671,18 @@ def _filter_by_type(elements, type_): | |||
return results | |||
|
|||
|
|||
def _combine_schemas(elements): | |||
def _combine_schemas(elements, input_schemas=False): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Seems like one approach could be to implement methods like this in order to avoid the boolean flag:
- `_combine_input_schemas(elements)
- `_combine_output_schemas(elements)
- `_combine_schemas(schemas: List[Schema])
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Left some style suggestions, but approving as is. You can take or leave the suggestions and merge when you're ready.
@jperez999 Can you explain more about this point. What part requires that we call the fit method of the Subgraph operator instead of the nodes contained within it? In other words would it work if we called fit on the subnodes but not the Subgraph (if Subgraph was not a StatOperator). |
I will make the updates in a subsequent PR.
The Subgraph is a special type of operator, it is considered a graph. This means it should allow for all the capabilities/responsibilities of a graph. |
This PR adds logic to post and pre order traversal methods to handle subgraphs. It flattens the subgraph while still including the actual subgraph operator in the nodes. This is important because it ensures that when we are constructing the schemas for all the operator the subgraph operator does not get skipped. Otherwise it will cause issues with downstream operators that will essentially get schemas from operators previous to the subgraph or if there are none it will get schema from root (i.e. dataset).