Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

SQL version of unnest native druid function #13576

Merged
merged 30 commits into from
Jan 23, 2023

Conversation

somu-imply
Copy link
Contributor

@somu-imply somu-imply commented Dec 15, 2022

The goal here is to develop rules to add LogicalCorrelate and Uncollect to support the query form we have decided for SQL Unnest.

The SQL query that takes the form SELECT * FROM UNNEST(ARRAY['1','2','3']) or SELECT * FROM druid.numfoo, UNNEST(MV_TO_ARRAY(dim3)) generates two components in the Calcite's logical plan (LogicalCorrelate and Uncollect) which do not have DruidConverters.

The logical plans that calcite generates for these type of queries are

 SELECT * FROM UNNEST(ARRAY['1','2','3'])
 
 Generates the plan

25:Uncollect
  23:LogicalProject(subset=[rel#24:Subset#1.NONE.[]], EXPR$0=[ARRAY('1', '2', '3')])
    4:LogicalValues(subset=[rel#22:Subset#0.NONE.[0]], tuples=[[{ 0 }]])

and

SELECT * FROM druid.numfoo, UNNEST(MV_TO_ARRAY(dim3))

Generates

80:LogicalCorrelate(correlation=[$cor0], joinType=[inner], requiredColumns=[{3}])
  6:LogicalTableScan(subset=[rel#74:Subset#0.NONE.[]], table=[[druid, numfoo]])
  78:Uncollect(subset=[rel#79:Subset#3.NONE.[]])
    76:LogicalProject(subset=[rel#77:Subset#2.NONE.[]], EXPR$0=[MV_TO_ARRAY($cor0.dim3)])
      7:LogicalValues(subset=[rel#75:Subset#1.NONE.[0]], tuples=[[{ 0 }]])

So we add a set of rules which helps convert these to Druid queries. In a nutshell the chain of rules look like the following:

Screen Shot 2022-12-15 at 4 25 36 PM

In this PR we define the rules and develop appropriate DruidRels with partial queries that converts these plans to druid queries. Testcases are added to establish the different use cases we deal with

This PR has:

  • been self-reviewed.
  • added documentation for new or modified features or behaviors.
  • a release note entry in the PR description.
  • added Javadocs for most classes and all non-trivial methods. Linked related entities via Javadoc links.
  • added or updated version, license, or notice information in licenses.yaml
  • added comments explaining the "why" and the intent of the code wherever would not be obvious for an unfamiliar reader.
  • added unit tests or modified existing tests to cover new code paths, ensuring the threshold for code coverage is met.
  • added integration tests.
  • been tested in a test Druid cluster.

Copy link
Member

@clintropolis clintropolis left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nice 🚀

I suggest adding some more tests with some other query types such as topN or group by

.idea/misc.xml Outdated
@@ -46,7 +46,7 @@
<option name="myDefaultNotNull" value="javax.annotation.Nonnull" />
<option name="myNullables">
<value>
<list size="12">
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I assume changes to this file were unintended?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reverted

// This exception is caught while returning false from isValidDruidQuery() method
if (ref.getField().getIndex() > rowSignature.size()) {
throw new CannotBuildQueryException(
"Cannot build query as index is higher than row size"
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

suggest showing column/field name that is missing from row signature instead of saying anything about index since is confusing

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fixed

final String columnName = rowSignature.getColumnName(ref.getField().getIndex());
final Optional<ColumnType> columnType = rowSignature.getColumnType(ref.getField().getIndex());
if (columnName == null) {
throw new ISE("Expression referred to nonexistent index[%d]", ref.getField().getIndex());
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

same comment about error message

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed the error messages

public class DruidCorrelateUnnestRel extends DruidRel<DruidCorrelateUnnestRel>
{
// This may be needed for the explain plan later
// private static final TableDataSource DUMMY_DATA_SOURCE = new TableDataSource("__unnest__");
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: is this needed? else remove

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The code has been modified

unnestDatasourceRel.getUnnestProject().getRowType()
);

String dimensionToUnnest;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: final? also suggest using inputToUnnest or something since the input doesn't have to be a dimension

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

final Project project = Preconditions.checkNotNull(partialQuery.getUnnestProject(), "unnestProject");

if (partialQuery.getAggregate() != null) {
throw new ISE("Cannot have both 'unnestProject' and 'aggregate', how can this be?");
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why not?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Addressed

@@ -54,6 +54,8 @@
private final RelNode scan;
private final Filter whereFilter;
private final Project selectProject;
// add an unnestProject
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: remove comment

}


// todo: make new projects with druidQueryRel projects + unnestRel projects shifted
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: this todo seems done?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

@somu-imply somu-imply marked this pull request as ready for review January 12, 2023 20:30
.idea/misc.xml Outdated
Comment on lines 93 to 95
<component name="SuppressKotlinCodeStyleNotification">
<option name="disableForAll" value="true" />
</component>
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The changes in this file appear to continue to exist in this PR, calling it out as it should likely be reverted before a merge.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good call, I'll merge the master into this branch and update this

@@ -125,7 +125,7 @@ private static boolean rowsEqual(final Iterable<Object[]> rowsA, final Iterable<
final Object[] rowA = listA.get(i);
final Object[] rowB = listB.get(i);

if (!Arrays.equals(rowA, rowB)) {
if (!Arrays.deepEquals(rowA, rowB)) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There's a hashCode method that claims that it is compatible with this method. I'm not sure that is actually true anymore because I don't believe it walks into Arrays-of-Arrays the same way that this one does.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This change got percolated here, I'll move to the equals here

@@ -181,7 +181,10 @@ public void inspectRuntimeShape(RuntimeShapeInspector inspector)
@Override
public Object getObject()
{
if (indexedIntsForCurrentRow == null) {
if (dimSelector.getObject() == null) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why did we switch away from using the stored reference? getObject() can do work, work we've already done, doing the same work multiple times is bad.

It looks to me like you just needed to switch it to be

if (indexedIntsForCurrentRow == null || indexedIntsForCurrentRow.size() == 0)

And then, you could perhaps make it even simpler if you check the size once when setting indexedIntsForCurrentRow and set to null when size is 0.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Moving to stored reference and setting the indexedIntsForCurrentRow to null at the time of assignment if there are no elements

Comment on lines 415 to 419
// to support rows which have only null values
// need to check if the value is not null and the size is greater than 0
if (indexedIntsForCurrentRow != null && indexedIntsForCurrentRow.size() > 0) {
return indexedIntsForCurrentRow.get(index);
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why would we be getting any rows at all when unnesting a data point that was null to begin with? That is, if we had the array-of-null, we should be getting an IndexedInts back with 1 entry, which is null. If we didn't have anything to begin with, then we should be getting back null and probably don't have any work to do on the row anyway?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The comment is misleading, I'll update the comment we process the row only if it is not null and has atleast 1 value

int k = 0;
while (!unnestCursor.isDone()) {
if (k < 8) {
Assert.assertEquals(unnestDimSelector.getValue(), expectedResults.get(k).toString());
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think you have expected and actual inverted. The junit assert has expected come first.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, actually in this file all were inverted. Good catch ! I'll update all of them

Comment on lines 235 to 238
// This case arises in the case of a correlation where the rexNode points to a table from the left subtree
// while the underlying datasource is the scan stub created from LogicalValuesRule
// In such a case we throw a CannotBuildQueryException so that Calcite does not go ahead with this path
// This exception is caught while returning false from isValidDruidQuery() method
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I feel like this comment has been separated from its home. Is it inside of the if down below?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Moving this to inside of the if

);
}

final String columnName = rowSignature.getColumnName(index);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Didn't we just search the rowSignature for the name? Why will we get a different name back than the one that we searched for?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We used the rexNodes name to find the index before. Now we are just using the column name of the row signature at the particular index

Comment on lines 250 to 252
if (columnName == null) {
throw new ISE("Expression referred to nonexistent index [%d] in row [%s]", index, rowSignature);
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't believe this if block can ever be run.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

True we are ensuring the index is always within the rowSignature length, will remove this

throw new ISE("Expression referred to nonexistent index [%d] in row [%s]", index, rowSignature);
}

return DruidExpression.ofColumn(columnType.orElse(null), columnName);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why would we ever not have the columnType? Why do we need to orElse it?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

With the index in bounds this should be just columnType.get()

Copy link
Contributor

@cheddar cheddar left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I went to merge this, then realized that there isn't a great commit message to combine all of the commits under. I tend to try to write the very first commit message to be something that can be used for the whole PR, but here each of the commit messages tends to be a terse statement of fact. So, instead of merge, I'm going to just approve.

If that lets you merge, then please merge and come up with a good commit message for the squashed files. If you still cannot merge, please write a good commit message as a comment in the PR here and I will merge with that.

@somu-imply
Copy link
Contributor Author

This PR implements the SQL component of the native unnest functionality in Druid. Unnest in SQL through Calcite has been implemented as a combination of Correlate (the comma join part) and Uncollect (the unnest part). Here we have introduced rules to handle unnest SQL queries on a table dimension, virtual column or a constant array and appropriate rels to convert them into Druid convention to translate correctly into native Druid queries.

@clintropolis clintropolis merged commit 90d4455 into apache:master Jan 23, 2023
abhagraw pushed a commit to abhagraw/druid that referenced this pull request Feb 8, 2023
* adds the SQL component of the native unnest functionality in Druid to unnest SQL queries on a table dimension, virtual column or a constant array and convert them into native Druid queries
* unnest in SQL is implemented as a combination of Correlate (the comma join part) and Uncollect (the unnest part)
@clintropolis clintropolis added this to the 26.0 milestone Apr 10, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants