-
Notifications
You must be signed in to change notification settings - Fork 42
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
feat: Allow join-free alignment of analytic expressions #1168
Conversation
bigframes/core/nodes.py
Outdated
def replace_child( | ||
self, new_child: BigFrameNode, validate: bool = False | ||
) -> UnaryNode: | ||
new_self = replace(self, child=new_child) # type: ignore |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I had to open up the file to figure out this is from dataclasses. Per https://google.github.io/styleguide/pyguide.html#s2.2-imports please import dataclasses
, not individual functions / classes from it. dataclasses
is not one of the allowed exceptions.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
fixed this import
bigframes/core/nodes.py
Outdated
self, new_child: BigFrameNode, validate: bool = False | ||
) -> UnaryNode: | ||
new_self = replace(self, child=new_child) # type: ignore | ||
if validate: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm curious, in what cases is it necessary to have a validate
argument instead of a public validate()
method? If it's for method chaining, we could have validate()
return self
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
eh, I think I"ll remove that, its not being used, and yeah, could run validations it as a separate invocation
bigframes/core/__init__.py
Outdated
@@ -449,14 +449,40 @@ def relational_join( | |||
) | |||
return ArrayValue(join_node), (l_mapping, r_mapping) | |||
|
|||
def try_align_as_projection( | |||
def try_new_row_join( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe just try_row_join
since we plan on this being the only one early next year?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done
genned_ids.append(attempted_id) | ||
i = i + 1 | ||
return genned_ids | ||
return [ids.ColumnId.unique().name for _ in range(n)] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Does this make the column IDs less deterministic than the previous logic?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not any less deterministic than we are now. If we want an isomorphism between query structure and output syntax, there is a bit more work that needs to be done to the system, which I think basically amounts to late binding identifiers serially through the tree.
@@ -2693,7 +2697,35 @@ def is_uniquely_named(self: BlockIndexProperties): | |||
return len(set(self.names)) == len(self.names) | |||
|
|||
|
|||
def try_row_join( | |||
def try_new_row_join( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ditto, re try_row_join
here, but perhaps in a separate PR since I see that it would make it harder to identify the one's that should be replaced with try_legacy_row_join
if we did that.
bigframes/core/node_align.py
Outdated
) -> Optional[bigframes.core.nodes.BigFrameNode]: | ||
"""Joins the two nodes""" | ||
if l_node.projection_base != r_node.projection_base: | ||
return None |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Since there are return None
s should the be try_join_as_projection
?
bigframes/core/node_align.py
Outdated
return None | ||
# check join key | ||
for l_key, r_key in join_keys: | ||
# Caller is block, so they still work with raw strings rather than ids |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should we fix that? e.g. make block use ColumnId?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, we should do that refactor, but its a bit of an undertaking. For now, block uses string ids, and ArrayValue is responsible for wrapping them up as ColumnIds.
bigframes/core/node_align.py
Outdated
import bigframes.core.window_spec | ||
import bigframes.operations.aggregations | ||
|
||
ADDITIVE_NODES = ( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Would be helpful to have some comments about what these node types have in common. How would I decide if a node type should be added to this list?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
added comment explaining
bigframes/core/node_align.py
Outdated
|
||
|
||
def pull_up_selection( | ||
node: bigframes.core.nodes.BigFrameNode, rename_vars: bool = False |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What's the purpose of the rename_vars
parameter? A docstring would be helpful.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
added docstring. its intended to make sure that when we combine columns from two sides, we don't get conflicts.
bigframes/core/node_align.py
Outdated
(bigframes.core.expression.DerefOp(field.id), field.id) | ||
for field in node.fields | ||
) | ||
assert isinstance(node, (bigframes.core.nodes.SelectionNode, *ADDITIVE_NODES)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Am I remembering correctly that SelectionNode
represents a WHERE clause, based on Selection from relational algebra terminology?
Given the high-level that implicit joiner no longer handles these, I'm a little confused why we need to rewrite these? Is it so that they are all consolidated so that the nodes we can combine appear together in the tree? I would appreciate a bit more info.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Its actually poorly named. SelectionNode just renames/reorders existing columns, without any scalar transforms
Thank you for opening a Pull Request! Before submitting your PR, there are a few things you can do to make sure it goes smoothly:
Fixes #<issue_number_goes_here> 🦕