Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Bug-fix: MemoryExec sort expressions do NOT refer to the projected schema #12876

Merged
merged 10 commits into from
Oct 12, 2024

Conversation

berkaysynnada
Copy link
Contributor

Which issue does this PR close?

Closes #.

Rationale for this change

The MemTable contains a field sort_order: Arc<Mutex<Vec<Vec<SortExpr>>>>, and its columns refer to the original schema of the source. In the scan() method, we introduce an optional projection that is injected into MemoryExec. However, currently, the sort_order is not projected, which results in it referring to an invalid schema during planning.

I am surprised that existing tests do not catch this problem. I identified this bug in our downstream tests, where newly added aggregate fuzzer tests exposed the problem. For instance, the test test_basic_prim_aggr_group_by_mixed_string_int64() have queries that do not use column "d," but sort_keys_set contains an ordering involving "d". If we had supported ordered MemoryExec via sql (e.g., CREATE TABLE ... WITH ORDER ...), this problem would likely have surfaced earlier. Currently, we can only set the order of a MemTable using with_sort_order(), and that's what these tests do. They assign an order to a MemTable, but this order is not used either in the query result or in an intermediate step.

What changes are included in this PR?

Before constructing a MemoryExec, its sort expressions are projected using the equivalence projection API.

Are these changes tested?

I have added an assertion at the start of the order assignment in MemoryExec. If any column in the sort expressions does not appear in the schema, it throws an error.

Are there any user-facing changes?

@github-actions github-actions bot added physical-expr Physical Expressions core Core DataFusion crate labels Oct 11, 2024
@ozankabak ozankabak requested a review from alamb October 11, 2024 14:25
Copy link
Contributor

@ozankabak ozankabak left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks obvious to me but an extra pair of eyes would be great

@alamb
Copy link
Contributor

alamb commented Oct 11, 2024

where newly added aggregate fuzzer tests exposed the problem

@Rachelint for the win!

Copy link
Contributor

@alamb alamb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Makes sense to me -- thank you for the find and fix @berkaysynnada and for the review @ozankabak

I have some suggestions on maybe how to improve this but I don't think they are necessary.

@@ -241,6 +245,31 @@ impl TableProvider for MemTable {
)
})
.collect::<Result<Vec<_>>>()?;

// If there is a projection on the source, we also need to project orderings
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Instead of adjusting the ordering here, I wonder if it would be less code / "just work" if you applied the projection to self.sort_order first? the normal output properties calculation should work I think 🤔

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think I can simply apply a straightforward projection to the self.sort_order, as they may still require special handling. For example, if we have an ordering [a, b, c] and the projection excludes column a, the remaining ordering [b,c] would not be valid. To avoid missing edge cases like this, I prefer consulting the equivalence API to ensure correctness.

@@ -207,6 +208,20 @@ impl MemoryExec {
/// [`EquivalenceProperties`], we can keep track of these equivalences
/// and treat `a ASC` and `b DESC` as the same ordering requirement.
pub fn with_sort_information(mut self, sort_information: Vec<LexOrdering>) -> Self {
// All sort expressions must refer to the projected schema
debug_assert!({
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Rather than a debug assert, I think it is worth considering changing with_sort_information to try_with_sort_information and returning a Result<Self> with a real error such as which columns wasn't present, )

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done. I have also insert the logic into try_with_sort_information() since I've observed that ExecutionPlan API's taking PhysicalExpr's as parameter refer to the input schema when that operator has a built-in projection like this MemoryExec. I believe we should follow this convention.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

makes sense -- thank you

@berkaysynnada
Copy link
Contributor Author

To trigger CI

@ozankabak ozankabak merged commit e7ac843 into apache:main Oct 12, 2024
24 checks passed
hailelagi pushed a commit to hailelagi/datafusion that referenced this pull request Oct 14, 2024
…hema (apache#12876)

* Update memory.rs

* add assert

* Update memory.rs

* Update memory.rs

* Update memory.rs

* address review

* Update memory.rs

* Update memory.rs

* final fix

* Fix comments in test_utils.rs

---------

Co-authored-by: Mehmet Ozan Kabak <[email protected]>
@alamb alamb added the api change Changes the API exposed to users of the crate label Oct 15, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
api change Changes the API exposed to users of the crate core Core DataFusion crate physical-expr Physical Expressions
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants