Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Projections prototype #17214

Merged
merged 53 commits into from
Oct 5, 2024
Merged

Conversation

clintropolis
Copy link
Member

@clintropolis clintropolis commented Oct 1, 2024

Description

#17117 + some refactors + projections persisted segments = possibly usable prototype

todo

SELECT string2, APPROX_COUNT_DISTINCT_DS_HLL("string5") FROM "druid"."projections" GROUP BY 1 ORDER BY 2
SELECT string2, SUM(long4) FROM "druid"."projections" GROUP BY 1 ORDER BY 2

realtime segments:

Benchmark                         (complexCompression)  (query)  (rowsPerSegment)  (schema)  (storageType)  (stringEncoding)  (useProjections)  (vectorize)  Mode  Cnt   Score   Error  Units
SqlProjectionsBenchmark.querySql                  none        0            150000  explicit    INCREMENTAL              UTF8              true        false  avgt    5   6.805 ± 0.928  ms/op
SqlProjectionsBenchmark.querySql                  none        0            150000  explicit    INCREMENTAL              UTF8             false        false  avgt    5  25.313 ± 2.134  ms/op
SqlProjectionsBenchmark.querySql                  none        1            150000  explicit    INCREMENTAL              UTF8              true        false  avgt    5   2.058 ± 0.429  ms/op
SqlProjectionsBenchmark.querySql                  none        1            150000  explicit    INCREMENTAL              UTF8             false        false  avgt    5  22.030 ± 2.828  ms/op

historical segments:

Benchmark                         (complexCompression)  (query)  (rowsPerSegment)  (schema)  (storageType)  (stringEncoding)  (useProjections)  (vectorize)  Mode  Cnt    Score    Error  Units
SqlProjectionsBenchmark.querySql                   lz4        0           1500000  explicit           MMAP              UTF8              true        false  avgt    5   18.742 ±  0.959  ms/op
SqlProjectionsBenchmark.querySql                   lz4        0           1500000  explicit           MMAP              UTF8              true        force  avgt    5   18.638 ±  0.775  ms/op
SqlProjectionsBenchmark.querySql                   lz4        0           1500000  explicit           MMAP              UTF8             false        false  avgt    5  180.287 ± 11.416  ms/op
SqlProjectionsBenchmark.querySql                   lz4        0           1500000  explicit           MMAP              UTF8             false        force  avgt    5  138.844 ± 10.884  ms/op
SqlProjectionsBenchmark.querySql                   lz4        1           1500000  explicit           MMAP              UTF8              true        false  avgt    5    1.764 ±  0.271  ms/op
SqlProjectionsBenchmark.querySql                   lz4        1           1500000  explicit           MMAP              UTF8              true        force  avgt    5    1.751 ±  0.408  ms/op
SqlProjectionsBenchmark.querySql                   lz4        1           1500000  explicit           MMAP              UTF8             false        false  avgt    5   48.939 ±  2.440  ms/op
SqlProjectionsBenchmark.querySql                   lz4        1           1500000  explicit           MMAP              UTF8             false        force  avgt    5   13.600 ±  0.522  ms/op

projection metadata in benchmark segment:

  "projections" : [ {
    "type" : "aggregate",
    "schema" : {
      "name" : "string2_hourly_sums_hll",
      "timeColumnName" : "__gran",
      "virtualColumns" : [ {
        "type" : "expression",
        "name" : "__gran",
        "expression" : "timestamp_floor(__time,'PT1H')",
        "outputType" : "LONG"
      } ],
      "groupingColumns" : [ "string2", "__gran" ],
      "aggregators" : [ {
        "type" : "longSum",
        "name" : "long4_sum",
        "fieldName" : "long4"
      }, {
        "type" : "doubleSum",
        "name" : "double2_sum",
        "fieldName" : "double2"
      }, {
        "type" : "HLLSketchBuild",
        "name" : "hll_string5",
        "fieldName" : "string5",
        "lgK" : 12,
        "tgtHllType" : "HLL_4",
        "shouldFinalize" : false,
        "round" : true
      } ],
      "ordering" : [ {
        "columnName" : "string2",
        "order" : "ascending"
      }, {
        "columnName" : "__gran",
        "order" : "ascending"
      } ]
    },
    "numRows" : 2376
  } ]

smoosh layout:

v1,2147483647,1
__time,0,0,6017118
double1,0,323400142,325132505
double2,0,325132505,326676524
double3,0,326676524,338722195
double4,0,338722195,350768214
double5,0,350768214,352309359
float1,0,352309359,353802435
float2,0,353802435,355166332
float3,0,355166332,361190284
float4,0,361190284,367214296
float5,0,367214296,368575446
index.drd,0,369608752,369610074
long1,0,303717758,309709418
long2,0,309709418,311371091
long3,0,311371091,312846945
long4,0,312846945,318187662
long5,0,318187662,323400142
metadata.drd,0,369610074,369612129
multi-string1,0,58809016,98386815
multi-string2,0,98386815,116609876
multi-string3,0,116609876,136310542
multi-string4,0,136310542,168498666
multi-string5,0,168498666,303717758
nested,0,368575446,368575703
nested.__encodedColumn,0,368575721,368582039
nested.__stringDictionary,0,368575703,368575721
nested.__valueIndexes,0,368582039,368582386
rows,0,6017118,6068394
string1,0,6068394,12120081
string2,0,12120081,14819759
string2_hourly_sums_hll/__gran,0,369608356,369608752
string2_hourly_sums_hll/double2_sum,0,368593213,368600687
string2_hourly_sums_hll/hll_string5,0,368600687,368600942
string2_hourly_sums_hll/hll_string5.__complexColumn,0,369605372,369605377
string2_hourly_sums_hll/hll_string5.__complexColumn_compressed,0,368611234,369605372
string2_hourly_sums_hll/hll_string5.__complexColumn_offsets,0,368600942,368611234
string2_hourly_sums_hll/long4_sum,0,368582782,368593213
string2_hourly_sums_hll/string2,0,369605377,369608356
string3,0,14819759,17091334
string4,0,17091334,25219656
string5,0,25219656,58809016

Release note

todo

This PR begins to introduce the concept of projections to Druid datasources, which are similar to materialized views but are built into a segment, and which can automatically be used during query execution if the projection fits the query. This PR only contains the logic to build and query them for realtime queries, and does not contain the ability to serialize and actually store them in persisted segments, so it is effectively a toy right now.

changes:
* Adds ProjectionSpec interface, AggregateProjectionSpec implementation for defining rollup projections on druid datasources
* Adds projections to DataSchema
* Adds projection building and querying support to OnHeapIncrementalIndex
public final CloserRule closer = new CloserRule(false);

public CursorFactoryProjectionTest(
String name,

Check notice

Code scanning / CodeQL

Useless parameter Note test

The parameter 'name' is never used.
Comment on lines 347 to 350
// wtb some sort of virtual column comparison function that can check if projection granularity time column
// satisifies query granularity virtual column
// can rebind? q.canRebind("__time", p)
// special handle time granularity
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

need a revision.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yea sorry, still a bunch of todos and my rambling comments all over the place, this one is about wanting to dump using Granularity at all in favor of giving some way that a virtual column can decide if it can replace __time to check for things like finer granularity. i'm not going to do that in this PR, its just notes for myself, i'm still working on cleaning this up.


ColumnFormat getColumnFormat(String columnName);

int size();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what size is it exactly?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

number of rows in the facts table, like after rollup if it is a rollup facts table, will add javadoc and maybe rename, i just picked this up since was previous name on IncrementalIndex

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should the outputname be logged in msg "Completed dim[%s] inverted with cardinality[%,d] in %,d millis." instead of dimension name?

rowNumConversions.add(IntBuffer.wrap(arr));
}

final String section = "walk through and merge rows";
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
final String section = "walk through and merge rows";
final String section = "walk through and merge rows for projections";

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yea, still need to adjust a lot of these things, its sort of adapted from the regular flow since its pretty similar in a lot of ways

@github-actions github-actions bot added Area - Batch Ingestion Area - MSQ For multi stage queries - https://github.com/apache/druid/issues/12262 labels Oct 5, 2024
@clintropolis clintropolis removed the WIP label Oct 5, 2024
* {@link AggregateProjectionMetadata.Schema#getTimeColumnName()}). Callers must verify this externally before
* calling this method by examining {@link VirtualColumn#requiredColumns()}.
* <p>
* This method also does not handle other time expressions, or if the virtual column is just an identifier for a
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

missing text

@@ -351,6 +355,13 @@ public TransformSpec getTransformSpec()
return transformSpec;
}

@JsonProperty
@JsonInclude(JsonInclude.Include.NON_NULL)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

would prefer NON_EMPTY here, so it only shows up if there are really projections. Unless we think we will ever have a semantic difference between projections: null and projections: [].

@@ -87,6 +88,7 @@ public class AutoTypeColumnMerger implements DimensionMergerV9

public AutoTypeColumnMerger(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please take a look at adding this. I think what's going on is that for regular columns, name and outputName are the same; and for projection columns, name is the parent name and outputName is the projection column name.

It might be clearer to do String name and @Nullable String parentName, i.e., make name the output name.

@@ -332,6 +349,193 @@ private void makeMetadataBinary(
}
}

private Metadata makeProjections(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This functions appears to have a bunch of stuff that is adapted and remixed from other functions in this class. It would be good to share common code, if possible.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes, i absolutely would like to do this, i feel like the base table is kind of just like another projection. This is true for building the incremental index as well, however I'd like to save both of these refactors for future work in order to minimize risk for now

@@ -124,6 +129,27 @@ public List<OrderBy> getOrdering()
return ordering;
}

@Nullable
@JsonProperty
@JsonInclude(JsonInclude.Include.NON_NULL)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

or NON_EMPTY, assuming there isn't a meaningful difference between null and [].

@@ -228,7 +236,7 @@ public void reset()
numAdvanced++;
}

done = !foundMatched && (emptyRange || !baseIter.hasNext());
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

was the clause removed here always unnecessary?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yea, intellij suggested it could be removed because emptyRange = !cursorIterable.iterator().hasNext(); was defined in the constructor, and baseIter = cursorIterable.iterator(); at the start of this method, and finally foundMatched will advance all the way through the iterator if it cannot find a match, so !foundMatched implies that hasNext is false, and emptyRange/!baseIter.hasNext() were effectively equivalent

@JsonCreator
public Schema(
@JsonProperty("name") String name,
@JsonProperty("timeColumnName") @Nullable String timeColumnName,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think in the ideal design there is no such thing as timeColumnName. Through some introspection abilities, we should be able to select the right projections, even with time flooring, using just virtualColumns and groupingColumns. It's ok for now but something to think about for the future.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yea, i totally agree, i just did this for now to save some work of finding the time column until larger refactors can happen and should be harmless to remove later once that happens

matchBuilder.addReferenceedVirtualColumn(buildSpecVirtualColumn);
final List<String> requiredInputs = buildSpecVirtualColumn.requiredColumns();
if (requiredInputs.size() == 1 && ColumnHolder.TIME_COLUMN_NAME.equals(requiredInputs.get(0))) {
// wtb some sort of virtual column comparison function that can check if projection granularity time column
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

please clean up this comment

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

oops, forgot about this one since it wasn't marked with // todo (clint): ... like most of my ramblings

@@ -94,7 +94,7 @@ public class TimeAndDimsPointer implements Comparable<TimeAndDimsPointer>
this.timestampSelector = timestampSelector;
this.timePosition = timePosition;
Preconditions.checkArgument(
timePosition >= 0 && timePosition <= dimensionSelectors.length,
timePosition <= dimensionSelectors.length,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I suppose timePosition can be -1 for projections, so part of this check had to go. Please update the message too.

@@ -1038,4 +1176,316 @@ public void clear()
facts.clear();
}
}

public static class OnHeapAggregateProjection implements IncrementalIndexRowSelector
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Given this is a static class, and the file is already quite large, please consider making this into its own file.

@clintropolis clintropolis added this to the 31.0.0 milestone Oct 5, 2024
@clintropolis clintropolis merged commit 0bd13bc into apache:master Oct 5, 2024
90 checks passed
@clintropolis clintropolis deleted the projections-prototype branch October 5, 2024 11:39
clintropolis added a commit to clintropolis/druid that referenced this pull request Oct 5, 2024
abhishekagarwal87 pushed a commit that referenced this pull request Oct 5, 2024
* abstract `IncrementalIndex` cursor stuff to prepare for using different "views" of the data based on the cursor build spec (#17064)

* abstract `IncrementalIndex` cursor stuff to prepare to allow for possibility of using different "views" of the data based on the cursor build spec
changes:
* introduce `IncrementalIndexRowSelector` interface to capture how `IncrementalIndexCursor` and `IncrementalIndexColumnSelectorFactory` read data
* `IncrementalIndex` implements `IncrementalIndexRowSelector`
* move `FactsHolder` interface to separate file
* other minor refactorings

* add DataSchema.Builder to tidy stuff up a bit (#17065)

* add DataSchema.Builder to tidy stuff up a bit

* fixes

* fixes

* more style fixes

* review stuff

* Projections prototype (#17214)
kfaraz pushed a commit that referenced this pull request Oct 10, 2024
…17314)

Follow up to #17214, adds implementations for substituteCombiningFactory so that more
datasketches aggs can match projections, along with some projections tests for datasketches.
kfaraz pushed a commit to kfaraz/druid that referenced this pull request Oct 10, 2024
…pache#17314)

Follow up to apache#17214, adds implementations for substituteCombiningFactory so that more
datasketches aggs can match projections, along with some projections tests for datasketches.
kfaraz added a commit that referenced this pull request Oct 10, 2024
…17314) (#17323)

Follow up to #17214, adds implementations for substituteCombiningFactory so that more
datasketches aggs can match projections, along with some projections tests for datasketches.

Co-authored-by: Clint Wylie <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Area - Batch Ingestion Area - Ingestion Area - MSQ For multi stage queries - https://github.com/apache/druid/issues/12262 Area - Segment Format and Ser/De Performance
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants