Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

tighten up array handling, fix bug with array_slice output type inference #12914

Merged
merged 10 commits into from
Aug 25, 2022

Conversation

clintropolis
Copy link
Member

@clintropolis clintropolis commented Aug 17, 2022

Description

This PR adjusts array handling in the direction of standardizing the representations on Object[] as well as long[], double[], and float[]. This is follow-up to #12498 (comment) and a follow-up discussion I had with @gianm.

The main changes are removing the remaining leaks of typed object arrays such as Long[], Double[], and Float[] from the expression system, as well as adding support for handling arrays of java primitives such as long[], float[], and double[]. Expressions do not produce these primitive arrays, but aggregators and post-aggs can.

I've added javadocs to try to help provide additional details on native type matters.

I've adjusted the behavior of the expression column value selector when the virtual column output type is STRING to behave a bit more like a normal multi-value string column, and return the value directly when there is only a single value contained in the list. This does slightly change scan query results, which for multi-value string expressions would always produce array typed results, but now will produce a mix of array and scalar values. I think this behavior is more consistent with how the multi-value string columns themselves behave, but I can put it behind a flag if anyone is worried about this change.

I encountered a bug with array_slice output type inference, which was not indicating its output as array output, instead reporting the output type of its argument. This issue became clear due to the stronger separation between multi-value string handling and array handling that now occurs in the expression selectors.

Several redundant uses of NullHandling.emptyToNullIfNeeded have been removed from expression selectors, since it is baked into StringExprEval itself:

    private StringExprEval(@Nullable String value)
    {
      super(NullHandling.emptyToNullIfNeeded(value));
    }

In a follow-up PR, I'm going to try to see if I can remove the rest of the representation of ARRAY types as List, which is primarily the grouping engine. I decided to split it out from this PR to keep change set small.


This PR has:

  • been self-reviewed.
  • added Javadocs for most classes and all non-trivial methods. Linked related entities via Javadoc links.
  • added comments explaining the "why" and the intent of the code wherever would not be obvious for an unfamiliar reader.
  • added unit tests or modified existing tests to cover new code paths, ensuring the threshold for code coverage is met.
  • been tested in a test Druid cluster.

@lgtm-com
Copy link

lgtm-com bot commented Aug 17, 2022

This pull request introduces 4 alerts when merging 8763c66 into f70f7b4 - view on LGTM.com

new alerts:

  • 4 for Dereferenced variable may be null

@gianm
Copy link
Contributor

gianm commented Aug 23, 2022

4 for Dereferenced variable may be null

Do these matter?

public static boolean isAllConstants(Expr... exprs)
{
return isAllConstants(Arrays.asList(exprs));
for (Expr expr : exprs) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why duplicate the code? To avoid wrapping the array?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yeah, this change wasn't really related to anything, but it seemed a bit pointless to wrap it... though i guess probably not that expensive in the scheme of things since this is only used when flattening expressions in the Parser, so on second thought can just change it back

@@ -429,18 +443,15 @@ public static ExprEval ofType(@Nullable ExpressionType type, @Nullable Object va
case STRING:
// not all who claim to be "STRING" are always a String, prepare ourselves...
if (value instanceof String[]) {
return new ArrayExprEval(ExpressionType.STRING_ARRAY, (String[]) value);
return new ArrayExprEval(ExpressionType.STRING_ARRAY, Arrays.stream((String[]) value).toArray());
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Stream construction in hot paths can be expensive. May be better to avoid it.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

good call, looking around there are quite a few usages of steam in array stuffs, should I try to fix them all up in this PR or do as a follow-up?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Up to you, I'm OK either way.

if (val instanceof String[]) {
return new ArrayExprEval(ExpressionType.STRING_ARRAY, (String[]) val);
return new ArrayExprEval(ExpressionType.STRING_ARRAY, Arrays.stream((String[]) val).toArray());
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Stream construction in hot paths can be expensive. May be better to avoid it.

/**
* Druid string type. Values will be represented as {@link String} or {@link java.util.List<String>} in the case
* of multi-value string columns. {@link ColumnType} has insufficient information to distinguish between single
* and multi-value strings, this requires a type 'ColumnCapabilities' which is available at a higher layer.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Link to the specific hasMultipleValues function? Or mention, if it's not available from this package.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

not in this package unfortunately but will do. i do still someday wish to merge druid-core into druid-processing since i find the split a bit frustrating, but that is so disruptive i should probably give up hope that it will ever happen :trollface:

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

so disruptive i should probably give up hope that it will ever happen

There's always the old "do it during a quiet week" strategy 🙂

public static final ColumnType FLOAT = new ColumnType(ValueType.FLOAT, null, null);
// currently, arrays only come from expressions or aggregators
// and there are no native float expressions (or aggs which produce float arrays)
/**
* An array of Strings. Values will reprsented as Object[]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

represented (spelling)

* this complex type.
* in the form of a type name which must be registered in the complex type registry. This type is not currently
* supported as a grouping key for aggregations, and might only be supported by the specific aggregators crafted to
* handle this complex type. Filtering on this type with standard filters will most likely have limited support, and
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"most likely"? Are there cases where they are treated as anything other than null?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

so, there is isFilterable on column capabilities which is used exclusively to allow complex types to provide indexes to use for filtering, https://github.com/apache/druid/blob/master/processing/src/main/java/org/apache/druid/segment/ColumnSelectorColumnIndexSelector.java#L86. This could allow complex columns to directly provide indexes that filters understand, which I suppose could be used for filtering.

However, there are no examples of this in current complex types, and importantly, I don't consider this support to be fully built-out yet though because without value matcher integration its not really complete and likely not correct except in certain situations. We did start to orient some things to make this someday be possible, like adding the object predicate https://github.com/apache/druid/blob/master/processing/src/main/java/org/apache/druid/query/filter/DruidPredicateFactory.java#L46, but some more plumbing needs to be done to make use of it I think.

The big win here on finishing this stuff out someday would be to allow matching constant forms of complex values with selector filters, and making is null/is not null actually work correctly (like is null matches everything since the column is treated as a null) which is like a special case of the same thing but especially jarring.

Beyond equality checks I'm not sure most of the standard filters make sense with complex types in general, but tangentially, it would probably be nice in the future to have a mode where filtering on them with filters that don't make sense is an error condition instead of silently treating it as a column of null values.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this seemed too complicated to explain so now just mention that filters will treat as null

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe we can distill this to: "filters generally treat complex values as null, although this is neither guaranteed nor required."

* input to expression virtual columns, and might only be supported by the specific aggregators crafted to handle
* this complex type.
* in the form of a type name which must be registered in the complex type registry. This type is not currently
* supported as a grouping key for aggregations, and might only be supported by the specific aggregators crafted to
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"might"? How about:

Complex types are not currently supported as a grouping key for aggregations. Complex types can be used as inputs to aggregators, in cases where the specific aggregator supports the specific complex type.

*/
COMPLEX,

/**
* Placeholder for arbitrary arrays of other {@link ValueType}. This type is not currently supported as a grouping
* Placeholder for arbitrary arrays of other {@link ValueType}. This type has limited support as a grouping
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What does "limited support" mean?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

you can group on specific array types, such as ARRAY<STRING>, ARRAY<LONG>, ARRAY<DOUBLE>, ARRAY<FLOAT>, but not fully generically, like on ARRAY<ARRAY<LONG>>, or ARRAY<COMPLEX<json>> (assuming grouping on COMPLEX types was also supported, which it isn't yet... but soon.). I'll clarify that some specific array grouping is supported, but array grouping in general is not.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah that makes sense. Would be good for that to come across in the comment.

* There are currently no native ARRAY typed columns, but they may be produced by expression virtual columns,
* aggregators, and post-aggregators.
*
* Arrays are typically represented as Object[], but may also be Java primitive arrays such as long[], double[], or
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Love the idea of this paragraph. How do you feel about this adjustment? The idea is to firm up the language and give people clearer expectations.

Arrays are represented as Object[], long[], double[], or float[]. The preferred type is Object[], since the expression system is the main consumer of arrays, and the expression system uses Object[] internally. Some code represents arrays in other ways; in particular the groupBy engine and SQL result layer. Over time we expect these usages to migrate to Object[], long[], double[], and float[].

if (lResult == null) {
return false;
}

return Arrays.stream(lResult).filter(Objects::nonNull).anyMatch(Evals::asBoolean);
return Arrays.stream(lResult).filter(Objects::nonNull).anyMatch(o -> Evals.asBoolean((long) o));
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I know this code was pre-existing, but, may be a good idea to move it away from streams since it's a hot path.

Copy link
Contributor

@gianm gianm left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Latest patch LGTM.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants