-
Notifications
You must be signed in to change notification settings - Fork 1.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Bug-fix: MemoryExec sort expressions do NOT refer to the projected schema #12876
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This looks obvious to me but an extra pair of eyes would be great
@Rachelint for the win! |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Makes sense to me -- thank you for the find and fix @berkaysynnada and for the review @ozankabak
I have some suggestions on maybe how to improve this but I don't think they are necessary.
@@ -241,6 +245,31 @@ impl TableProvider for MemTable { | |||
) | |||
}) | |||
.collect::<Result<Vec<_>>>()?; | |||
|
|||
// If there is a projection on the source, we also need to project orderings |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Instead of adjusting the ordering here, I wonder if it would be less code / "just work" if you applied the projection to self.sort_order
first? the normal output properties calculation should work I think 🤔
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think I can simply apply a straightforward projection to the self.sort_order
, as they may still require special handling. For example, if we have an ordering [a, b, c]
and the projection excludes column a
, the remaining ordering [b,c]
would not be valid. To avoid missing edge cases like this, I prefer consulting the equivalence API to ensure correctness.
@@ -207,6 +208,20 @@ impl MemoryExec { | |||
/// [`EquivalenceProperties`], we can keep track of these equivalences | |||
/// and treat `a ASC` and `b DESC` as the same ordering requirement. | |||
pub fn with_sort_information(mut self, sort_information: Vec<LexOrdering>) -> Self { | |||
// All sort expressions must refer to the projected schema | |||
debug_assert!({ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Rather than a debug assert, I think it is worth considering changing with_sort_information
to try_with_sort_information
and returning a Result<Self>
with a real error such as which columns wasn't present, )
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done. I have also insert the logic into try_with_sort_information()
since I've observed that ExecutionPlan API's taking PhysicalExpr
's as parameter refer to the input schema when that operator has a built-in projection like this MemoryExec
. I believe we should follow this convention.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
makes sense -- thank you
To trigger CI |
…hema (apache#12876) * Update memory.rs * add assert * Update memory.rs * Update memory.rs * Update memory.rs * address review * Update memory.rs * Update memory.rs * final fix * Fix comments in test_utils.rs --------- Co-authored-by: Mehmet Ozan Kabak <[email protected]>
Which issue does this PR close?
Closes #.
Rationale for this change
The
MemTable
contains a fieldsort_order: Arc<Mutex<Vec<Vec<SortExpr>>>>
, and its columns refer to the original schema of the source. In thescan()
method, we introduce an optional projection that is injected intoMemoryExec
. However, currently, thesort_order
is not projected, which results in it referring to an invalid schema during planning.I am surprised that existing tests do not catch this problem. I identified this bug in our downstream tests, where newly added aggregate fuzzer tests exposed the problem. For instance, the test
test_basic_prim_aggr_group_by_mixed_string_int64()
have queries that do not use column "d," butsort_keys_set
contains an ordering involving "d". If we had supported orderedMemoryExec
via sql (e.g., CREATE TABLE ... WITH ORDER ...), this problem would likely have surfaced earlier. Currently, we can only set the order of aMemTable
usingwith_sort_order()
, and that's what these tests do. They assign an order to aMemTable
, but this order is not used either in the query result or in an intermediate step.What changes are included in this PR?
Before constructing a
MemoryExec
, its sort expressions are projected using the equivalence projection API.Are these changes tested?
I have added an assertion at the start of the order assignment in
MemoryExec
. If any column in the sort expressions does not appear in the schema, it throws an error.Are there any user-facing changes?