Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix missing columns in orderby #4969

Closed

Conversation

xtcyclist
Copy link
Contributor

@xtcyclist xtcyclist commented Dec 1, 2022

What type of PR is this?

  • bug
  • feature
  • enhancement

What problem(s) does this PR solve?

Issue(s) number:

Close #4930

Description:

To comply with the openCyper, the scope of columns that an order-by can see depends on the whether the projection before it is changing the cardinality of the data set. In cases when the projection is neither aggregating or going with distinct, the ordery-by shall see columns before the projection. Or, in other words, users can sort the returned columns using a column that is not explicited used in the return clause.

This beheviour has not been implemented before this PR. Currently, order-by can only see columns that are either explicitly declared in the return clause or return *.

How do you solve it?

  • Adapted order-by's validator in the MatchValidator, expanding its scope when needed and feasible.
  • When expanded, needed columns are added to the projection.
  • These columns shall be invisible to the user. Thus, they are removed from the result dataset after executing Sort or TopN.

Special notes for your reviewer, ex. impact of this fix, design document, etc:

Checklist:

Tests:

  • Unit test(positive and negative cases)
  • Function test
  • Performance test
  • TCK

Affects:

  • Documentation affected (Please add the label if documentation needs to be modified.)
  • Incompatibility (If it breaks the compatibility, please describe it and add the label.)
  • If it's needed to cherry-pick (If cherry-pick to some branches is required, please label the destination version(s).)
  • Performance impacted: Consumes more CPU/Memory

Release notes:

Please confirm whether to be reflected in release notes and how to describe:

ex. Fixed the bug .....

@xtcyclist xtcyclist added wip Solution: work in progress ready-for-testing PR: ready for the CI test labels Dec 1, 2022
@xtcyclist xtcyclist force-pushed the fix_orderBy_column_missing branch from 9aff14b to f24450b Compare December 1, 2022 10:50
@xtcyclist xtcyclist removed the wip Solution: work in progress label Dec 1, 2022
@xtcyclist xtcyclist changed the title Fix order by column missing Fix missing columns in orderby Dec 1, 2022
@xtcyclist xtcyclist force-pushed the fix_orderBy_column_missing branch from d3caad5 to af4e7fd Compare December 2, 2022 02:19
with v, v1, v.player.name as name, count(v) as a0
where a0 > 0
return name
order by a0 desc, name desc
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it seems the a0 variable is also invisible, why not to report the semantic error?

Copy link
Contributor Author

@xtcyclist xtcyclist Dec 2, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's not an error. According to cypher's standard, this case is valid. Please refer to the issue this PR links to, I pasted the standard there.

src/common/datatypes/Value.cpp Outdated Show resolved Hide resolved
@@ -44,6 +44,9 @@ folly::Future<Status> SortExecutor::execute() {

auto seqIter = static_cast<SequentialIter *>(iter);
std::sort(seqIter->begin(), seqIter->end(), comparator);
for (auto &idx : sort->inputVars()[0]->invisibleColIndicies) {
result.valuePtr()->getMutableDataSet().removeColumn(idx);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why do this in sort executor?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Currently, I put this in both sort and topn. These two operators requrie this hidden columns, after which they are not needed and shall be removed from the result dataset. I want to have a lightweight solution to only remove them logically but haven't find one. Please advise me if you have better ideas.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

According to relational algebra, the sort operator does not need to be responsible for these things. Column projection processing needs to be handled by the project operator.
The sort operator will be unfriendly to the optimizer if it handles these extra.

private:
Expression *expr_{nullptr};
std::string alias_;
// If some columns are required by order-by and are not explicitly
// declared in the return caluse, it is an invisible column and not to be
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

IMHO, this is not a good fix. Better to do that in optimizer,such as column pruning rule.

Copy link
Contributor Author

@xtcyclist xtcyclist Dec 2, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think so. It is not an optimization issue to me. It's a functional issue. Because this type of query shall be executable without the optimizer optimizing the plan.

Previously the validator reports column-not-found. Now it expands the scope for order-by when needed and possible by selecting required columns and adding them in the query plan. These columns are not to be expected in the result set. So I mark them to be invisible in their respective YieldColumns.

It's not an validation issue anyway. It's a plan-gen issue. But it seems to me I have to do it in the validator in the current framework.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You just need to make the filter execute ahead of time so that it can see all columns and the unexpected columns will not be projected in result set, which is also consistent with cypher's semantic definition of with-where clause:

image

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Discussed offline. To summarize, we'd better prepare all data for potential aggregation and orderby before the final projection, instead of passing hidden column beneath the projection.

@xtcyclist xtcyclist force-pushed the fix_orderBy_column_missing branch 5 times, most recently from ab75dfa to 6eb62fb Compare December 6, 2022 02:38
@xtcyclist xtcyclist marked this pull request as draft December 6, 2022 03:34
@xtcyclist xtcyclist force-pushed the fix_orderBy_column_missing branch from 6eb62fb to 7dde0a1 Compare December 6, 2022 03:55
@xtcyclist
Copy link
Contributor Author

We have decided not to address this compatibility (with openCypher) issue. Order-by won't support sorting by columns not listed in the return clause. If a query wants to sort its results, it has to do so on a column listed in the return clause. BTW, there is no correctness issues on order-by.

@xtcyclist xtcyclist closed this Dec 6, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
ready for review ready-for-testing PR: ready for the CI test
Projects
None yet
Development

Successfully merging this pull request may close these issues.

order by report column not found while it does exists
3 participants