-
Notifications
You must be signed in to change notification settings - Fork 3.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Projections prototype #17214
Projections prototype #17214
Conversation
This PR begins to introduce the concept of projections to Druid datasources, which are similar to materialized views but are built into a segment, and which can automatically be used during query execution if the projection fits the query. This PR only contains the logic to build and query them for realtime queries, and does not contain the ability to serialize and actually store them in persisted segments, so it is effectively a toy right now. changes: * Adds ProjectionSpec interface, AggregateProjectionSpec implementation for defining rollup projections on druid datasources * Adds projections to DataSchema * Adds projection building and querying support to OnHeapIncrementalIndex
processing/src/main/java/org/apache/druid/data/input/impl/AggregateProjectionSpec.java
Fixed
Show fixed
Hide fixed
processing/src/main/java/org/apache/druid/data/input/impl/AggregateProjectionSpec.java
Fixed
Show fixed
Hide fixed
public final CloserRule closer = new CloserRule(false); | ||
|
||
public CursorFactoryProjectionTest( | ||
String name, |
Check notice
Code scanning / CodeQL
Useless parameter Note test
processing/src/main/java/org/apache/druid/data/input/impl/AggregateProjectionSpec.java
Outdated
Show resolved
Hide resolved
processing/src/main/java/org/apache/druid/data/input/impl/AggregateProjectionSpec.java
Outdated
Show resolved
Hide resolved
// wtb some sort of virtual column comparison function that can check if projection granularity time column | ||
// satisifies query granularity virtual column | ||
// can rebind? q.canRebind("__time", p) | ||
// special handle time granularity |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
need a revision.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yea sorry, still a bunch of todos and my rambling comments all over the place, this one is about wanting to dump using Granularity
at all in favor of giving some way that a virtual column can decide if it can replace __time to check for things like finer granularity. i'm not going to do that in this PR, its just notes for myself, i'm still working on cleaning this up.
processing/src/main/java/org/apache/druid/data/input/impl/AggregateProjectionSpec.java
Outdated
Show resolved
Hide resolved
processing/src/main/java/org/apache/druid/data/input/impl/AggregateProjectionSpec.java
Outdated
Show resolved
Hide resolved
|
||
ColumnFormat getColumnFormat(String columnName); | ||
|
||
int size(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
what size is it exactly?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
number of rows in the facts table, like after rollup if it is a rollup facts table, will add javadoc and maybe rename, i just picked this up since was previous name on IncrementalIndex
processing/src/main/java/org/apache/druid/segment/incremental/OnheapIncrementalIndex.java
Show resolved
Hide resolved
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
should the outputname be logged in msg "Completed dim[%s] inverted with cardinality[%,d] in %,d millis." instead of dimension name?
rowNumConversions.add(IntBuffer.wrap(arr)); | ||
} | ||
|
||
final String section = "walk through and merge rows"; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
final String section = "walk through and merge rows"; | |
final String section = "walk through and merge rows for projections"; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yea, still need to adjust a lot of these things, its sort of adapted from the regular flow since its pretty similar in a lot of ways
processing/src/main/java/org/apache/druid/segment/serde/ColumnPartSerde.java
Outdated
Show resolved
Hide resolved
…ot chill, thanks tests
processing/src/test/java/org/apache/druid/frame/processor/FrameProcessorExecutorTest.java
Dismissed
Show dismissed
Hide dismissed
processing/src/test/java/org/apache/druid/frame/processor/FrameProcessorExecutorTest.java
Dismissed
Show dismissed
Hide dismissed
* {@link AggregateProjectionMetadata.Schema#getTimeColumnName()}). Callers must verify this externally before | ||
* calling this method by examining {@link VirtualColumn#requiredColumns()}. | ||
* <p> | ||
* This method also does not handle other time expressions, or if the virtual column is just an identifier for a |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
missing text
@@ -351,6 +355,13 @@ public TransformSpec getTransformSpec() | |||
return transformSpec; | |||
} | |||
|
|||
@JsonProperty | |||
@JsonInclude(JsonInclude.Include.NON_NULL) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
would prefer NON_EMPTY
here, so it only shows up if there are really projections. Unless we think we will ever have a semantic difference between projections: null
and projections: []
.
@@ -87,6 +88,7 @@ public class AutoTypeColumnMerger implements DimensionMergerV9 | |||
|
|||
public AutoTypeColumnMerger( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please take a look at adding this. I think what's going on is that for regular columns, name
and outputName
are the same; and for projection columns, name
is the parent name and outputName
is the projection column name.
It might be clearer to do String name
and @Nullable String parentName
, i.e., make name
the output name.
@@ -332,6 +349,193 @@ private void makeMetadataBinary( | |||
} | |||
} | |||
|
|||
private Metadata makeProjections( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This functions appears to have a bunch of stuff that is adapted and remixed from other functions in this class. It would be good to share common code, if possible.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yes, i absolutely would like to do this, i feel like the base table is kind of just like another projection. This is true for building the incremental index as well, however I'd like to save both of these refactors for future work in order to minimize risk for now
@@ -124,6 +129,27 @@ public List<OrderBy> getOrdering() | |||
return ordering; | |||
} | |||
|
|||
@Nullable | |||
@JsonProperty | |||
@JsonInclude(JsonInclude.Include.NON_NULL) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
or NON_EMPTY
, assuming there isn't a meaningful difference between null
and []
.
@@ -228,7 +236,7 @@ public void reset() | |||
numAdvanced++; | |||
} | |||
|
|||
done = !foundMatched && (emptyRange || !baseIter.hasNext()); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
was the clause removed here always unnecessary?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yea, intellij suggested it could be removed because emptyRange = !cursorIterable.iterator().hasNext();
was defined in the constructor, and baseIter = cursorIterable.iterator();
at the start of this method, and finally foundMatched
will advance all the way through the iterator if it cannot find a match, so !foundMatched
implies that hasNext is false, and emptyRange
/!baseIter.hasNext()
were effectively equivalent
@JsonCreator | ||
public Schema( | ||
@JsonProperty("name") String name, | ||
@JsonProperty("timeColumnName") @Nullable String timeColumnName, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think in the ideal design there is no such thing as timeColumnName
. Through some introspection abilities, we should be able to select the right projections, even with time flooring, using just virtualColumns
and groupingColumns
. It's ok for now but something to think about for the future.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yea, i totally agree, i just did this for now to save some work of finding the time column until larger refactors can happen and should be harmless to remove later once that happens
matchBuilder.addReferenceedVirtualColumn(buildSpecVirtualColumn); | ||
final List<String> requiredInputs = buildSpecVirtualColumn.requiredColumns(); | ||
if (requiredInputs.size() == 1 && ColumnHolder.TIME_COLUMN_NAME.equals(requiredInputs.get(0))) { | ||
// wtb some sort of virtual column comparison function that can check if projection granularity time column |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
please clean up this comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
oops, forgot about this one since it wasn't marked with // todo (clint): ...
like most of my ramblings
@@ -94,7 +94,7 @@ public class TimeAndDimsPointer implements Comparable<TimeAndDimsPointer> | |||
this.timestampSelector = timestampSelector; | |||
this.timePosition = timePosition; | |||
Preconditions.checkArgument( | |||
timePosition >= 0 && timePosition <= dimensionSelectors.length, | |||
timePosition <= dimensionSelectors.length, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I suppose timePosition
can be -1
for projections, so part of this check had to go. Please update the message too.
@@ -1038,4 +1176,316 @@ public void clear() | |||
facts.clear(); | |||
} | |||
} | |||
|
|||
public static class OnHeapAggregateProjection implements IncrementalIndexRowSelector |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Given this is a static class, and the file is already quite large, please consider making this into its own file.
* abstract `IncrementalIndex` cursor stuff to prepare for using different "views" of the data based on the cursor build spec (#17064) * abstract `IncrementalIndex` cursor stuff to prepare to allow for possibility of using different "views" of the data based on the cursor build spec changes: * introduce `IncrementalIndexRowSelector` interface to capture how `IncrementalIndexCursor` and `IncrementalIndexColumnSelectorFactory` read data * `IncrementalIndex` implements `IncrementalIndexRowSelector` * move `FactsHolder` interface to separate file * other minor refactorings * add DataSchema.Builder to tidy stuff up a bit (#17065) * add DataSchema.Builder to tidy stuff up a bit * fixes * fixes * more style fixes * review stuff * Projections prototype (#17214)
…pache#17314) Follow up to apache#17214, adds implementations for substituteCombiningFactory so that more datasketches aggs can match projections, along with some projections tests for datasketches.
…17314) (#17323) Follow up to #17214, adds implementations for substituteCombiningFactory so that more datasketches aggs can match projections, along with some projections tests for datasketches. Co-authored-by: Clint Wylie <[email protected]>
Description
#17117 + some refactors + projections persisted segments = possibly usable prototype
todo
realtime segments:
historical segments:
projection metadata in benchmark segment:
smoosh layout:
Release note
todo