-
Notifications
You must be signed in to change notification settings - Fork 3.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
remix nested columns #14014
remix nested columns #14014
Conversation
changes: * introduce ColumnFormat to separate physical storage format from logical type. ColumnFormat is now used instead of ColumnCapabilities to get column handlers for segment creation * introduce new 'standard type' indexer, merger, and family of serializers, which is the next logical iteration of the nested column stuff. Essentially this is an automatic type column indexer that produces the most appropriate column for the given inputs, making either STRING, ARRAY<STRING>, LONG, ARRAY<LONG>, DOUBLE, ARRAY<DOUBLE>, or COMPLEX<json>. * revert NestedDataColumnIndexer, NestedDataColumnMerger, NestedDataColumnSerializer to their version pre apache#13803 behavior (v4) for backwards compatibility * fix a bug in RoaringBitmapSerdeFactory if anything actually ever wrote out an empty bitmap using toBytes and then later tried to read it (the nerve!)
6561769
to
2f93beb
Compare
processing/src/main/java/org/apache/druid/segment/serde/StandardDoubleColumnSerializer.java
Fixed
Show fixed
Hide fixed
processing/src/main/java/org/apache/druid/segment/serde/StandardLongColumnSerializer.java
Fixed
Show fixed
Hide fixed
processing/src/main/java/org/apache/druid/segment/serde/StandardNestedColumnSerializer.java
Fixed
Show fixed
Hide fixed
processing/src/main/java/org/apache/druid/segment/serde/StandardStringColumnSerializer.java
Fixed
Show fixed
Hide fixed
if (o instanceof Object[]) { | ||
Object[] array = (Object[]) o; | ||
if (elementNumber < array.length) { | ||
return array[elementNumber]; |
Check failure
Code scanning / CodeQL
Improper validation of user-provided array index
processing/src/main/java/org/apache/druid/segment/serde/ComplexMetricSerde.java
Fixed
Show fixed
Hide fixed
processing/src/main/java/org/apache/druid/segment/serde/StandardArrayColumnSerializer.java
Fixed
Show fixed
Hide fixed
processing/src/main/java/org/apache/druid/segment/serde/StandardArrayColumnSupplier.java
Fixed
Show fixed
Hide fixed
@JsonSubTypes.Type(name = NestedDataComplexTypeSerde.TYPE_NAME, value = NestedDataDimensionSchema.class) | ||
@JsonSubTypes.Type(name = NestedDataComplexTypeSerde.TYPE_NAME, value = NestedDataDimensionSchema.class), | ||
@JsonSubTypes.Type(name = StandardTypeColumnSchema.TYPE, value = StandardTypeColumnSchema.class), | ||
@JsonSubTypes.Type(name = "auto", value = StandardTypeColumnSchema.class) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why do we need this added as well? Shouldn't anything that is using this be able to use StandardTypeColumnSchema.TYPE
when it persists such that we don't need to add 2 names for the same thing?
default void mergeNestedFields(SortedMap<String, FieldTypeInfo.MutableTypeSet> mergedFields) | ||
{ | ||
mergedFields.put( | ||
NestedPathFinder.JSON_PATH_ROOT, | ||
new FieldTypeInfo.MutableTypeSet().add(getColumnCapabilities().toColumnType()) | ||
); | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This feel very nested-column-specific, but it's a change on the DimensionIndexer
interface. Does it really need to be here? Can't DimensionIndexer
instances make sure that they are all the same type and then use concrete methods instead of leaking something like this on the interface?
return new CapabilitiesBasedFormat( | ||
ColumnCapabilitiesImpl.snapshot( | ||
getColumnCapabilities(), | ||
CapabilitiesBasedFormat.DIMENSION_CAPABILITY_MERGE_LOGIC |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A bit of a nit, but it seems really weird to have the CapabilitiesBasedFormat
need to be given logic from itself? Maybe create a static CapabilitiesBasedFormat.snapshot(ColumnCapabilities)
that does this on its own and keeps the implementation details local and private?
handler.makeMerger( | ||
indexSpec, | ||
segmentWriteOutMedium, | ||
dimCapabilities.get(i), | ||
dimFormats.get(i).toColumnCapabilities(), | ||
progress, | ||
closer | ||
) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The handler
came from the dimFormat
, right? Why would it need the capabilities from dimFormat
? Can we not eliminate the argument entirely?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
the snag here is that right now the string dimension merger needs the 'has multiple values' flag set on the capabilities to determine whether it makes a serializer for a single value or mutli value string.
I didn't want to rework the classic columns and indexers to use the ColumnFormat
stuff at this time to minimize the number of changes, but the answer is yes, once we make a dedicate format for the String
dimension schema i would think it would capture whether the string was multi-valued or not and we can drop this argument.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I was thinking that because we are using CapabilitiesBasedFormat
for all of the old ones that it would provide a nice clean plain to just start ignoring this parameter. Though, punting on it for later is probably also fine as even if we started ignoring it, making the actual interface change is less easy... Okay.
ColumnDescriptor columnDesc = ColumnDescriptor | ||
.builder() | ||
.setValueType(dimCapabilities.get(i).getType()) | ||
.setValueType(dimFormats.get(i).getLogicalType().getType()) | ||
.addSerde(new NullColumnPartSerde(indexMergeResult.rowCount, indexSpec.getBitmapSerdeFactory())) | ||
.build(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why do we need to store the type of the null
column? Maybe we need a null
type that indicates that it's always null and we can store it like that?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this is from that explicit null value columns change we did a while back that saves any dimension schema in the segment that didn't have any values
I agree this can probably be better/smarter
public void deserializeColumn( | ||
@SuppressWarnings("unused") String columnName, | ||
ByteBuffer buffer, | ||
ColumnBuilder builder, | ||
ColumnConfig columnConfig | ||
) | ||
{ | ||
deserializeColumn(buffer, builder, columnConfig); | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't believe that we need to add this columnName parameter. So far, the serde code has generally kept the name disassociated from the actual storage of the column as the name of the column is actually unimportant for the ability to serialize and deserialize the bytes.
It looks like this is being done because the column name is being used as a prefix on the other names in the file smoosher, that makes sense, but at this point it's not a "column name" as much as a "unique prefix". Given that it is a prefix that we expect this column to use on all of its things, I think it makes sense to serialize out the unique prefix as part of the bytes of the column itself and then read it back in from there. Let it be coincidence that the unique prefix just so happens to be the same thing as the column name.
This is pretty similar to how it was already working in the "older" versions where it got the information by deserializing a metadata object...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
removed from this interface since this wasn't needed for the v4 complex column and has been reverted since it can get the filename from its embedded metadata.
I have left the parameter on the ColumnPartSerde.Deserializer
interface for now because the new nested column part serde does need it since it doesn't store a separate metadata file embedded in it and IndexIO.V9IndexLoader.deserializeColumn
already has the interned column/filename passed in so i took advantage of this and pushed it down (instead of writing the column name again inside of the nested part serde or its data file). This feels like an internal interface so it doesn't seem that disruptive to change, but can do this other ways if really against it
@@ -294,7 +294,7 @@ public Deserializer getDeserializer() | |||
return new Deserializer() | |||
{ | |||
@Override | |||
public void read(ByteBuffer buffer, ColumnBuilder builder, ColumnConfig columnConfig) | |||
public void read(String columnName, ByteBuffer buffer, ColumnBuilder builder, ColumnConfig columnConfig) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Once we remove the argument from ComplexMetricSerde
we don't need it here anymore either.
@@ -39,7 +39,8 @@ | |||
@JsonSubTypes.Type(name = "floatV2", value = FloatNumericColumnPartSerdeV2.class), | |||
@JsonSubTypes.Type(name = "longV2", value = LongNumericColumnPartSerdeV2.class), | |||
@JsonSubTypes.Type(name = "doubleV2", value = DoubleNumericColumnPartSerdeV2.class), | |||
@JsonSubTypes.Type(name = "null", value = NullColumnPartSerde.class) | |||
@JsonSubTypes.Type(name = "null", value = NullColumnPartSerde.class), | |||
@JsonSubTypes.Type(name = "standard", value = StandardTypeColumnPartSerde.class) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What're the semantics of the standard column part serde trying to do? If it's standard, does it subsume all of the others?
void read(ByteBuffer buffer, ColumnBuilder builder, ColumnConfig columnConfig); | ||
void read(String columnName, ByteBuffer buffer, ColumnBuilder builder, ColumnConfig columnConfig); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let's not add the extra parameter here.
Assert.assertEquals(168, key.getEffectiveSizeBytes()); | ||
Assert.assertEquals(6, indexer.getCardinality()); | ||
Assert.assertEquals(276, key.getEffectiveSizeBytes()); | ||
Assert.assertEquals(5, indexer.getCardinality()); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is the different in numbers because the tests was validating v5, now it's validating v4?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yeah, just rolling back to the old results https://github.com/apache/druid/pull/13803/files#diff-26f97e49d0f81a1797174646d89b826c3d175ddd7d4438e450fe03329cd030ffL97
String name, | ||
IndexSpec indexSpec, | ||
SegmentWriteOutMedium segmentWriteOutMedium, | ||
@SuppressWarnings("unused") ProgressIndicator progressIndicator, |
Check notice
Code scanning / CodeQL
Useless parameter
String name, | ||
IndexSpec indexSpec, | ||
SegmentWriteOutMedium segmentWriteOutMedium, | ||
@SuppressWarnings("unused") ProgressIndicator progressIndicator, |
Check notice
Code scanning / CodeQL
Useless parameter
String name, | ||
IndexSpec indexSpec, | ||
SegmentWriteOutMedium segmentWriteOutMedium, | ||
@SuppressWarnings("unused") ProgressIndicator progressIndicator, |
Check notice
Code scanning / CodeQL
Useless parameter
String name, | ||
IndexSpec indexSpec, | ||
SegmentWriteOutMedium segmentWriteOutMedium, | ||
@SuppressWarnings("unused") ProgressIndicator progressIndicator, |
Check notice
Code scanning / CodeQL
Useless parameter
String name, | ||
IndexSpec indexSpec, | ||
SegmentWriteOutMedium segmentWriteOutMedium, | ||
@SuppressWarnings("unused") ProgressIndicator progressIndicator, |
Check notice
Code scanning / CodeQL
Useless parameter
// this should use: | ||
// columnHolder.getColumnFormat().getColumnSchema(dimension) | ||
// someday... |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
But it cannot because? (Please add to the comment)
final SortedValueDictionary dimValues = mergable.getValueDictionary(); | ||
mergable.mergeFieldsInto(mergedFields); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's unclear if the dimValues
dictionary would be mutated or not and if that's expected or not. I think that it's expected that it is not mutated. In which case, it might be nice to push mergable.mergeFieldsInto()
down below the if (!allNulls){ }
block just to make it clear that those values are being evaluated and grabbed before any merge occurs.
@@ -17,27 +17,31 @@ | |||
* under the License. | |||
*/ | |||
|
|||
package org.apache.druid.segment.column; | |||
package org.apache.druid.segment.nested; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Package nit: I wonder if it shouldn't be segment.column.nested
? Probably doesn't really make that much of a difference, but seems more technically accurate.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
we don't have a segment.column.nested
right now, i almost left it in segment.column
since that also seems appropriate
* Nested data column with optimized support for simple arrays. Not actually v5 in the segment since columns are now | ||
* serialized using {@link org.apache.druid.segment.serde.NestedCommonFormatColumnPartSerde} instead of the generic | ||
* complex type system. | ||
* | ||
* Not really stored in a segment as V5 since instead of V5 we migrated to {@link NestedCommonFormatColumn} which | ||
* specializes physical format based on the types of data encountered during processing, and so versions are now | ||
* {@link NestedCommonFormatColumnSerializer#V0} for all associated specializations. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I like the NestedCommon
name that you picked for other stuff. It wouldn't bother me if you named this NestedCommon
as well, fwiw. I'm also fine with the current naming, so more stream-of-consciousness thought than a request for change.
/** | ||
* Deserializes a ByteBuffer and adds it to the ColumnBuilder. This method allows for the ComplexMetricSerde | ||
* to implement it's own versioning scheme to allow for changes of binary format in a forward-compatible manner. | ||
* | ||
* @param buffer the buffer to deserialize | ||
* @param builder ColumnBuilder to add the column to | ||
* @param columnConfig ColumnConfiguration used during deserialization | ||
*/ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks like the javadoc on this method got clobbered, probably by accident?
// this shouldn't happen, but if it does, try to close to prevent a leak | ||
try { | ||
col.close(); | ||
} | ||
catch (IOException e) { | ||
throw new RuntimeException(e); | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A bit of a nit, but if you add this text to the exception message, you get the benefit of the comment explaining that you don't expect it to occur and if the exception ever does get thrown, the message is there too for whoever sees it.
processing/src/main/java/org/apache/druid/segment/column/ColumnDescriptor.java
Fixed
Show fixed
Hide fixed
|
||
boolean allNulls = dimValues == null || dimValues.allNull(); | ||
sortedLookup = dimValues; | ||
if (!allNulls) { | ||
mergable.mergeFieldsInto(mergedFields); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just double checking, I had been thinking this was correct outside of the if
statement because, even if it's all nulls, the others might have never seen a null and so you want to merge it so that null is properly registered. Is that not a concern?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
i realized that if it is all null this isn't doing anything useful anyway so i put it inside the if
Description
This PR reverts
NestedDataColumnIndexer
,NestedDataColumnMerger
,NestedDataColumnSerializer
to their version pre #13803 behavior (v4) for backwards compatibility in favor of splitting the new functionality into a new"auto"
dimension schema type to more accurately reflect the new behavior that this PR adds.This new 'auto type' indexer, merger, and an associated family of serializers is the next logical iteration of the nested column stuff. Essentially this is an automatic type column indexer that produces the most appropriate column for the given inputs, making either
STRING
,ARRAY<STRING>
,LONG
,ARRAY<LONG>
,DOUBLE
,ARRAY<DOUBLE>
, orCOMPLEX<json>
columns, all sharing a common 'nested' format that allows segment merging to occur smoothly regardless of which physical format the column was written with.To accompany this, a new
ColumnFormat
interface has been defined to separate physical storage format from logical type.ColumnFormat
is now used instead ofColumnCapabilities
to get column handlers for segment creation. This allows both the v4 legacy columns and the new common format nested columns to both be able to map to theCOMPLEX<json>
This PR also fixes a bug in
RoaringBitmapSerdeFactory
, where if anything actually ever wrote out an empty bitmap usingtoBytes
and then later tried to read it it would fail (the nerve!)Release note
A new "auto" type column schema and indexer has been added to native ingestion as the next logical iteration of the nested column functionality. This automatic type column indexer that produces the most appropriate column for the given inputs, producing either
STRING
,ARRAY<STRING>
,LONG
,ARRAY<LONG>
,DOUBLE
,ARRAY<DOUBLE>
, orCOMPLEX<json>
columns, all sharing a common 'nested' format.All columns produced by 'auto' have indexes to aid in fast filtering (unlike classic
LONG
andDOUBLE
columns) and use cardinality based thresholds to attempt to only utilize these indexes when it is likely to actually speed up the query (unlike classicSTRING
columns).COMPLEX<json>
columns produced by this 'auto' indexer store arrays of simple scalar types differently than their 'json' (v4) counterparts, storing them asARRAY
typed columns. This means that theJSON_VALUE
function can now extract entire arrays, e.g.JSON_VALUE(nested, '$.array' RETURNING BIGINT ARRAY)
. There is no change with how arrays of complex objects are stored at this time.This improvement also adds a completely new functionality to Druid,
ARRAY
typed columns, which unlike classic multi-valueSTRING
columns behave withARRAY
semantics. These columns can currently only be created via the 'auto' type indexer when all values are an arrays with the same type of elements.Key changed/added classes in this PR
ColumnFormat
This PR has: