complex typed expressions #11853

clintropolis · 2021-10-28T20:21:22Z

Description

This PR adds support for Druid "complex" types to the native expression processing system, made possible after the type system enhancements done in #11713. The implications of this are that now it will be possible for all Druid data to be usable within expressions, should expressions be added to handle these types.

ObjectBinding, the non-vectorized expression input data provider, now implements ColumnInspector so that it can retain type information when available, and a new constant, ComplexExpr has been added which accepts the ExpressionType alongside the value to represent these values provided by the binding.

Several generic nullable value binary serde methods for types have been moved out of ExprEval and into Types, to hopefully be more generally available for writing nullable values that follow the | null (byte) | value (byte[]) | pattern, which is now all of the ExprEval types. I've adusted the binary formats slightly to be more consistent, so there are some minor changes to the expression buffer aggregator, but this should have no compatibility issues because this format is not written to segments anywhere, and contained within processing of a single node.

A base interface has been extracted from ObjectStrategy in druid-processing, which is called ObjectByteStrategy because naming is hard and lives in druid-core, to provide conversion between object and binary format for complex types. A registry of these ObjectByteStrategy to type name has been added to hold these, and registering a ComplexMetricsSerde in ComplexMetrics will automatically register its ObjectStrategy in the lower level ObjectByteStrategy registry. This would be less messy if druid-core and druid-processing were just merged since the ComplexMetrics registry could just be used directly for binary serialization of expressions, but.. they are not yet.

To showcase the new complex expressions, I have added 4 new 'hyperUnique' functions, and 3 new bloom filter expressions to the druid-bloom-filter extension:

hyper_unique() - creates a druid built-in HyperLogLogCollector
hyper_unique_add(expr1, expr2) - adds expr1 to hyper-log-log collector expr2
hyper_unique_estimate(expr) - get double estimate for hyper-log-log collector expr
hyper_unique_round_estimate(expr) - get estimate rounded to a long value for hyper-log-log collector expr
bloom_filter(expr) - creates a bloom filter with expected capacity expr
bloom_filter_test(expr1, expr2) - checks if expr1 is contained in the bloom filter expr2
bloom_filter_add(expr1, expr2) - adds expr1 to bloom filter expr2.

To allow complex expressions to be defined as literals, I've also added complex_decode_base64(expr1,expr2), where expr1 must be a string literal with a valid complex type name, and expr2 a base64 encoded string that is a serialized value of that type (or null if the row is null).

I have not documented any these yet, because I'm still considering how to position them, and there are several parts of the expression system which are still missing documentation for the same reason like the native expression aggregator. I have also not wired these up to SQL functions yet for similar reasons.

With these expressions, it is possible for example to even re-create the native bloom filter aggregator - instead using the expression aggregator:

    {
      "type": "expression",
      "name": "bloom_expression",
      "fields": ["user"],
      "initialValue": "bloom_filter(10000)",
      "fold": "bloom_filter_add(user, __acc)",
      "maxSizeBytes": 8096
    }

but I think this is just scratching the surface of what this change will make possible.

Arrays

Because of multi-value string transformation magic, that automatically translates expressions for selectors into map and for the expression aggregator into fold, it was necessary to support arrays of complex types. I have reworked the code of arrays to collapse all of LongArrayExpr, DoubleArrayExpr, and StringArrayExpr into a single consolidated ArrayExpr, and likewise collapsed the array ExprEval implementations into a single ArrayExprEval. This significantly simplifies a lot of the array handling code, opened the door to allow arrays of complex types, and interestingly, also nested array types!

I have gated nested arrays behind a new feature flag, set in runtime properties with druid.expressions.allowNestedArrays, which defaults to disabled. I think it would be better to hold off on opening this up until we support proper grouping on arrays and drop a lot of the automatic STRING coercion that is currently happening.

I have added some tests with the functionality enabled though, since it is mostly at the selector and above layers that we don't fully handle array types.

The parser doesn't directly understand a nested array literal, so the array function must be used to construct nested arrays, e.g. [['a', 'b', 'c'],['d', 'e']] will not parse correctly, but array(['a', 'b', 'c'],['d', 'e']) will.

Empty arrays can be defined directly as literals, as I have added parser support for the full ExpressionType string representation, e.g. ARRAY<ARRAY<LONG>>[] is an empty nested array of long literal. This syntax also works for non-nested arrays and complex arrays, ARRAY<COMPLEX<hyperUnique>>[], etc.

Type stuffs

Along the way i've further improved the quality of RowSignature available when processing queries, here mostly for IncrementalIndex column selector factories, which were previously making a RowBasedColumnSelectorFactory with an empty RowSignature, and now will be created with the latest RowSignature available in the form of a supplier. RowBasedColumnSelectorFactory accepts this supplier instead of a direct RowSignature, since the schema might change during the lifetime of an incremental index.

I have also enriched the non-vectorized expression type information available, by changing Expr.ObjectBinding to also implement InputBindingInspector, which requires it define a getType method. I have transitioned most uses of ExprEval.bestEffortOf to now use ExprEval.ofType instead, which will fall back to best effort if the type is null.

Future work

Implementing additional expressions for other complex type extensions, such as data sketches, etc.

Key changed/added classes in this PR

Expr
ConstantExpr
ArrayExpr
ExprEval
Types
ObjectStrategy
IncrementalIndex
RowBasedColumnSelectorFactory

This PR has:

been self-reviewed.
added documentation for new or modified features or behaviors.
added Javadocs for most classes and all non-trivial methods. Linked related entities via Javadoc links.
added comments explaining the "why" and the intent of the code wherever would not be obvious for an unfamiliar reader.
added unit tests or modified existing tests to cover new code paths, ensuring the threshold for code coverage is met.
added integration tests.
been tested in a test Druid cluster.

paul-rogers

Partial review, ransom comments thus far.

paul-rogers · 2021-10-29T02:33:46Z

core/src/main/java/org/apache/druid/math/expr/ExprEval.java

-              }
-            }
-            return ofStringArray(stringArray);
+            return ofStringArray(Types.readNullableStringArray(buffer, offset));


Nice simplification. (The code below this line that was removed, leaving only this line.)

paul-rogers · 2021-10-29T02:39:22Z