Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: Allow join-free alignment of analytic expressions #1168

Merged
merged 6 commits into from
Nov 26, 2024
Merged

Conversation

TrevorBergeron
Copy link
Contributor

Thank you for opening a Pull Request! Before submitting your PR, there are a few things you can do to make sure it goes smoothly:

  • Make sure to open an issue as a bug/issue before writing your code! That way we can discuss the change, evaluate designs, and agree on the general idea
  • Ensure the tests and linter pass
  • Code coverage does not decrease (if any source code was changed)
  • Appropriate docs were updated (if necessary)

Fixes #<issue_number_goes_here> 🦕

@TrevorBergeron TrevorBergeron requested review from a team as code owners November 22, 2024 00:58
@product-auto-label product-auto-label bot added size: l Pull request size is large. api: bigquery Issues related to the googleapis/python-bigquery-dataframes API. labels Nov 22, 2024
def replace_child(
self, new_child: BigFrameNode, validate: bool = False
) -> UnaryNode:
new_self = replace(self, child=new_child) # type: ignore
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I had to open up the file to figure out this is from dataclasses. Per https://google.github.io/styleguide/pyguide.html#s2.2-imports please import dataclasses, not individual functions / classes from it. dataclasses is not one of the allowed exceptions.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fixed this import

self, new_child: BigFrameNode, validate: bool = False
) -> UnaryNode:
new_self = replace(self, child=new_child) # type: ignore
if validate:
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm curious, in what cases is it necessary to have a validate argument instead of a public validate() method? If it's for method chaining, we could have validate() return self.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

eh, I think I"ll remove that, its not being used, and yeah, could run validations it as a separate invocation

@@ -449,14 +449,40 @@ def relational_join(
)
return ArrayValue(join_node), (l_mapping, r_mapping)

def try_align_as_projection(
def try_new_row_join(
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe just try_row_join since we plan on this being the only one early next year?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

genned_ids.append(attempted_id)
i = i + 1
return genned_ids
return [ids.ColumnId.unique().name for _ in range(n)]
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does this make the column IDs less deterministic than the previous logic?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not any less deterministic than we are now. If we want an isomorphism between query structure and output syntax, there is a bit more work that needs to be done to the system, which I think basically amounts to late binding identifiers serially through the tree.

@@ -2693,7 +2697,35 @@ def is_uniquely_named(self: BlockIndexProperties):
return len(set(self.names)) == len(self.names)


def try_row_join(
def try_new_row_join(
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ditto, re try_row_join here, but perhaps in a separate PR since I see that it would make it harder to identify the one's that should be replaced with try_legacy_row_join if we did that.

) -> Optional[bigframes.core.nodes.BigFrameNode]:
"""Joins the two nodes"""
if l_node.projection_base != r_node.projection_base:
return None
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since there are return Nones should the be try_join_as_projection?

return None
# check join key
for l_key, r_key in join_keys:
# Caller is block, so they still work with raw strings rather than ids
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we fix that? e.g. make block use ColumnId?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, we should do that refactor, but its a bit of an undertaking. For now, block uses string ids, and ArrayValue is responsible for wrapping them up as ColumnIds.

import bigframes.core.window_spec
import bigframes.operations.aggregations

ADDITIVE_NODES = (
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would be helpful to have some comments about what these node types have in common. How would I decide if a node type should be added to this list?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

added comment explaining



def pull_up_selection(
node: bigframes.core.nodes.BigFrameNode, rename_vars: bool = False
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What's the purpose of the rename_vars parameter? A docstring would be helpful.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

added docstring. its intended to make sure that when we combine columns from two sides, we don't get conflicts.

(bigframes.core.expression.DerefOp(field.id), field.id)
for field in node.fields
)
assert isinstance(node, (bigframes.core.nodes.SelectionNode, *ADDITIVE_NODES))
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Am I remembering correctly that SelectionNode represents a WHERE clause, based on Selection from relational algebra terminology?

Given the high-level that implicit joiner no longer handles these, I'm a little confused why we need to rewrite these? Is it so that they are all consolidated so that the nodes we can combine appear together in the tree? I would appreciate a bit more info.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Its actually poorly named. SelectionNode just renames/reorders existing columns, without any scalar transforms

@product-auto-label product-auto-label bot added size: xl Pull request size is extra large. and removed size: l Pull request size is large. labels Nov 23, 2024
@tswast tswast merged commit daef4f0 into main Nov 26, 2024
23 checks passed
@tswast tswast deleted the align_analytic branch November 26, 2024 23:13
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
api: bigquery Issues related to the googleapis/python-bigquery-dataframes API. size: xl Pull request size is extra large.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants