explicit outputType for ExpressionPostAggregator, better documentation for the differences between arrays and mvds #15245

clintropolis · 2023-10-25T03:54:16Z

Description

This PR attempts to:

more clearly lay out the differences between ARRAY types and multi-value dimensions in the docs
adds a new arrays page with examples and concepts
updates the multi-value dimensions page to call out that they are not arrays and include SQL examples
document the new arrayIngestMode parameter added in MSQ arrayIngestMode to control if arrays are ingested as ARRAY, MVD, or an exception #15093
and update MSQ examples to show how to ingest ARRAY types and MVDs going forward to prepare for the eventual default of "arrayIngestMode":"array"

It also adds a new outputType to ExpressionPostAggregator which is used by the SQL planner to decorate the postagg with the expected output type, particularly useful when using functions like ARRAY_TO_MV, which has a native expression type of ARRAY<STRING> but is expected to be treated as STRING outside of expressions (multi-value STRING do not exist inside of expression engine, so native expression type-inference alone isn't quite cool enough to handle this properly). This was necessary to make the new docs actually true.

gianm · 2023-10-25T20:39:32Z

docs/multi-stage-query/concepts.md

@@ -88,6 +88,9 @@ When deciding whether to use `REPLACE` or `INSERT`, keep in mind that segments g
 with dimension-based pruning but those generated with `INSERT` cannot. For more information about the requirements
 for dimension-based pruning, see [Clustering](#clustering).

+To insert [ARRAY types](../querying/arrays.md), be sure to set context flag `"arrayIngestMode":"array"` which allows


Hmm this seems like the wrong place to put this. It's generic docs about INSERT, we don't want to gunk it up with stuff about specific data that might be inserted. (Otherwise this would be, like, 10 times longer.)

I suggest cutting it, and relying on the examples and the array docs to guide people.

gianm · 2023-10-25T20:59:55Z

docs/multi-stage-query/concepts.md

-   false`](reference.md#context-parameters) in your context. This ensures that multi-value strings are left alone and
-   remain lists, instead of being [automatically unnested](../querying/sql-data-types.md#multi-value-strings) by the
-   `GROUP BY` operator.
+3. To ingest [Druid multi-value dimensions](../querying/multi-value-dimensions.md), wrap all multi-value strings 


This direction has become too complicated for people to understand, so I think we'll need an example. Or to link to one.

gianm · 2023-10-25T21:18:02Z

docs/querying/arrays.md

@@ -0,0 +1,228 @@
+---
+id: arrays
+title: "Array columns"


"Arrays" is a better title and scope — as people can use arrays even if they don't have array columns. For example they can use MV_TO_ARRAY, ARRAY_AGG, etc.

gianm · 2023-10-26T06:56:20Z

docs/querying/arrays.md

+  -->
+
+
+Apache Druid supports SQL standard `ARRAY` typed columns for `STRING`, `LONG`, and `DOUBLE` types. Other more complicated ARRAY types must be stored in [nested columns](nested-columns.md). Druid ARRAY types are distinct from [multi-value dimension](multi-value-dimensions.md), which have significantly different behavior than standard arrays.


Odd to be mixing SQL terms with native type names here. Possibly switch to VARCHAR, BIGINT, and DOUBLE for making more sense with SQL. Or mention both the SQL name and the native name?

gianm · 2023-10-26T06:58:43Z

docs/querying/arrays.md

+],
+```
+
+Arrays can also be inserted with [multi-stage ingestion](../multi-stage-query/index.md), but must include a query context parameter `"arrayIngestMode":"array"`.


Sort of unclear what the verb "include" refers to. The sentence construction makes it sound like the arrays themselves must include the context parameter. But that isn't right. Also, "multi-stage ingestion" isn't a thing 🙂— it's "SQL-based ingestion" or "multi-stage query".

So, suggestion:

Arrays can also be inserted with SQL-based ingestion when you use the context parameter "arrayIngestMode": "array".

Also link the text context parameter to docs/multi-stage-query/reference.md#context.

Also include some text about what will happen if you don't do arrayIngestMode: array. Something like: string arrays will be converted to multi-value dimensions, and numeric arrays will cause the query to fail with an error (what error?)

gianm · 2023-10-26T07:44:49Z

docs/querying/arrays.md

+#### Example: SQL grouping query with a filter
+```sql
+SELECT label, arrayString
+FROM "array_example" CROSS JOIN UNNEST(arrayString) as u(strings)


is this CROSS JOIN meant to be here? it doesn't seem to be doing much if anything

gianm · 2023-10-26T07:45:20Z

docs/querying/arrays.md

+
+- Value filters, like "equality", "range" match on entire array values
+- The "null" filter will match rows where the entire array value is null
+- Array specific functions like ARRAY_CONTAINS and ARRAY_OVERLAP follow the behavior specified by those functions


backticks around function names, & link them to the SQL function docs

gianm · 2023-10-26T08:02:16Z

docs/querying/arrays.md

+{"timestamp": "2023-01-01T00:00:00", "label": "row3", "arrayString": [],          "arrayLong":[1, 2, 3],   "arrayDouble":[null, 2.2, 1.1]} 
+{"timestamp": "2023-01-01T00:00:00", "label": "row4", "arrayString": ["a", "b"],  "arrayLong":[1, 2, 3],   "arrayDouble":[]}
+{"timestamp": "2023-01-01T00:00:00", "label": "row5", "arrayString": null,        "arrayLong":[],          "arrayDouble":null}
+```


Somewhere around here we should have a section "String arrays vs. multi-value dimensions" that sets people straight about the differences. Suggested text:

Avoid confusing string arrays with multi-value dimensions (link to MVD docs). Arrays and multi-value dimensions are stored in different column types, and query behavior is different. You can use the functions MV_TO_ARRAY and ARRAY_TO_MV to convert between the two if needed. In general, we recommend using arrays whenever possible, since they are a newer and more powerful feature.

Use care during ingestion to ensure you get the type you want.

To get arrays when performing an ingestion using JSON ingestion specs, such as native batch (link) or streaming ingestion (link), use dimension type auto or enable useSchemaDiscovery. When performing a SQL-based ingestion, write a query that generates arrays and set the context parameter arrayIngestMode: array. Arrays may contain strings or numbers.

To get multi-value dimensions when performing an ingestion using JSON ingestion specs, use dimension type string and do not enable useSchemaDiscovery. When performing a SQL-based ingestion, wrap arrays in ARRAY_TO_MV (link to examples), which ensures you get multi-value dimensions in any arrayIngestMode. Multi-value dimensions can only contain strings.

You can tell which type you have by checking the INFORMATION_SCHEMA.COLUMNS table, using a query like SELECT COLUMN_NAME, DATA_TYPE FROM INFORMATION_SCHEMA.COLUMNS WHERE TABLE_NAME = 'mytable'. Arrays are type ARRAY, multi-value strings are type VARCHAR.

I suggest including the same exact text in multi-value-dimensions.md, or at least linking to this section prominently.

gianm · 2023-10-26T08:06:04Z

docs/querying/sql-data-types.md

+You can convert multi-value dimensions to standard SQL arrays explicitly with `MV_TO_ARRAY` or implicitly using [array functions](./sql-array-functions.md). You can also use the array functions to construct arrays from multiple columns.
+
+Druid serializes `ARRAY` results as a JSON string of the array by default, which can be controlled by the context parameter
+`sqlStringifyArrays`. When set to `false`, the arrays will instead be returned as regular JSON arrays instead of in stringified form.


Surely this is only true for certain result formats? I mean, in csv, everything must be stringified somehow.

gianm · 2023-10-26T08:06:33Z

docs/querying/sql-data-types.md

+
+Druid supports [ARRAY types](arrays.md), which behave as standard SQL arrays, where results are grouped by matching entire arrays. The [`UNNEST` operator](./sql-array-functions.md#unn) can be used to perform operations on individual array elements, translating each element into a separate row. 
+
+ARRAY typed columns can be stored in segments with class JSON based ingestion using the 'auto' typed dimension schema shared with [schema auto-discovery](../ingestion/schema-design.md#schema-auto-discovery-for-dimensions) to detect and ingest arrays as ARRAY typed columns. For [SQL based ingestion](../multi-stage-query/index.md), the query context parameter `arrayIngestMode` must be specified as `"array"` to ingest ARRAY types. In Druid 28, the default mode for this parameter is `'mvd'` for backwards compatibility, which instead can only handle `ARRAY<STRING>` which it stores in [multi-value string columns](#multi-value-strings). 


No real reason to have the extra single quotes in 'mvd'. Doing mvd is preferred.

317brian

Just some minor copyediting nits/suggestions. Thanks for putting this together! Will re-review once Gian's suggestions make it in.

317brian · 2023-10-26T21:31:09Z

docs/querying/arrays.md

+Refer to the [Druid SQL data type documentation](sql-data-types.md#arrays) and [SQL array function reference](sql-array-functions.md) for additional details
+about the functions available to use with ARRAY columns and types in SQL.
+
+The following sections describe inserting, filtering, and grouping behavior based on the following example data, which includes 3 array typed columns.


Suggested change

The following sections describe inserting, filtering, and grouping behavior based on the following example data, which includes 3 array typed columns.

The following sections describe inserting, filtering, and grouping behavior based on the following example data, which includes 3 array typed columns:

317brian · 2023-10-26T21:34:28Z

docs/querying/multi-value-dimensions.md

@@ -30,21 +30,36 @@ array of values instead of a single value, such as the `tags` values in the foll
 {"timestamp": "2011-01-12T00:00:00.000Z", "tags": ["t1","t2","t3"]} 
 ```

-This document describes filtering and grouping behavior for multi-value dimensions. For information about the internal representation of multi-value dimensions, see
+It is important to be aware that multi-value dimensions are distinct from [array types](arrays.md), which behave like standard SQL arrays. This document describes the behavior of multi-value dimensions, and some additional details can be found in the [SQL data type documentation](sql-data-types.md#multi-value-strings-behavior).


Slight change to emphasize that they're different:

Suggested change

It is important to be aware that multi-value dimensions are distinct from [array types](arrays.md), which behave like standard SQL arrays. This document describes the behavior of multi-value dimensions, and some additional details can be found in the [SQL data type documentation](sql-data-types.md#multi-value-strings-behavior).

It is important to be aware that multi-value dimensions are distinct from [array types](arrays.md). While array types behave like standard SQL arrays, multi-value dimensions do not. This document describes the behavior of multi-value dimensions, and some additional details can be found in the [SQL data type documentation](sql-data-types.md#multi-value-strings-behavior).

317brian · 2023-10-26T21:35:38Z

docs/querying/multi-value-dimensions.md

@@ -61,20 +76,79 @@ By default, Druid sorts values in multi-value dimensions. This behavior is contr

 See [Dimension Objects](../ingestion/ingestion-spec.md#dimension-objects) for information on configuring multi-value handling.

+Multi-value dimensions can also be inserted with [multi-stage ingestion](../multi-stage-query/index.md). The multi-stage query engine does not have direct handling of class Druid multi-value dimensions. A special pair of functions, `MV_TO_ARRAY` which converts multi-value dimensions into `VARCHAR ARRAY` and `ARRAY_TO_MV` to coerce them back into `VARCHAR` exist to enable handling these types. Multi-value handling is not available when using the multi-stage query engine to insert data.


Suggested change

Multi-value dimensions can also be inserted with [multi-stage ingestion](../multi-stage-query/index.md). The multi-stage query engine does not have direct handling of class Druid multi-value dimensions. A special pair of functions, `MV_TO_ARRAY` which converts multi-value dimensions into `VARCHAR ARRAY` and `ARRAY_TO_MV` to coerce them back into `VARCHAR` exist to enable handling these types. Multi-value handling is not available when using the multi-stage query engine to insert data.

Multi-value dimensions can also be inserted with [SQL-based ingestion using the multi-stage query (MSQ) task engine](../multi-stage-query/index.md). The MSQ task engine does not have direct handling of class Druid multi-value dimensions. A special pair of functions, `MV_TO_ARRAY` which converts multi-value dimensions into `VARCHAR ARRAY` and `ARRAY_TO_MV` to coerce them back into `VARCHAR` exist to enable handling these types. Multi-value handling is not available when using the multi-stage query task engine to insert data.

317brian · 2023-10-26T21:35:59Z

docs/querying/multi-value-dimensions.md

-{"timestamp": "2011-01-14T00:00:00.000Z", "tags": ["t5","t6","t7"]}  #row3
-{"timestamp": "2011-01-14T00:00:00.000Z", "tags": []}                #row4
+
+Notice that `ARRAY_TO_MV` is not present in the `GROUP BY` clause, since we only wish to coerce the type _after_ grouping.


Suggested change

Notice that `ARRAY_TO_MV` is not present in the `GROUP BY` clause, since we only wish to coerce the type _after_ grouping.

Notice that `ARRAY_TO_MV` is not present in the `GROUP BY` clause since we only wish to coerce the type _after_ grouping.

317brian · 2023-10-26T22:31:28Z

docs/querying/multi-value-dimensions.md

+Notice that `ARRAY_TO_MV` is not present in the `GROUP BY` clause, since we only wish to coerce the type _after_ grouping.
+
+
+The `EXTERN` is also able to refer to the `tags` input type as `VARCHAR`, which is also how a query on a Druid table containing a multi-value dimension would specify the type of the `tags` column. If this is the case, `MV_TO_ARRAY` must be used since the multi-stage engine only supports grouping on multi-value dimensions as arrays, and so they must be coerced first. These arrays then must be coerced back into `VARCHAR` in the `SELECT` part of the statement with `ARRAY_TO_MV`.


Suggested change

The `EXTERN` is also able to refer to the `tags` input type as `VARCHAR`, which is also how a query on a Druid table containing a multi-value dimension would specify the type of the `tags` column. If this is the case, `MV_TO_ARRAY` must be used since the multi-stage engine only supports grouping on multi-value dimensions as arrays, and so they must be coerced first. These arrays then must be coerced back into `VARCHAR` in the `SELECT` part of the statement with `ARRAY_TO_MV`.

The `EXTERN` is also able to refer to the `tags` input type as `VARCHAR`, which is also how a query on a Druid table containing a multi-value dimension would specify the type of the `tags` column. If this is the case, you must use `MV_TO_ARRAY` since the MSQ task engine only supports grouping on multi-value dimensions as arrays. So, they must be coerced first. These arrays must then be coerced back into `VARCHAR` in the `SELECT` part of the statement with `ARRAY_TO_MV`.

gianm

TY for the updates; a smaller round of suggestions follows.

gianm · 2023-11-01T06:40:50Z

docs/multi-stage-query/reference.md

+| `maxNumTasks` | SELECT, INSERT, REPLACE<br /><br />The maximum total number of tasks to launch, including the controller task. The lowest possible value for this setting is 2: one controller and one worker. All tasks must be able to launch simultaneously. If they cannot, the query returns a `TaskStartTimeout` error code after approximately 10 minutes.<br /><br />May also be provided as `numTasks`. If both are present, `maxNumTasks` takes priority. | 2 |
+| `taskAssignment` | SELECT, INSERT, REPLACE<br /><br />Determines how many tasks to use. Possible values include: <ul><li>`max`: Uses as many tasks as possible, up to `maxNumTasks`.</li><li>`auto`: When file sizes can be determined through directory listing (for example: local files, S3, GCS, HDFS) uses as few tasks as possible without exceeding 512 MiB or 10,000 files per task, unless exceeding these limits is necessary to stay within `maxNumTasks`. When calculating the size of files, the weighted size is used, which considers the file format and compression format used if any. When file sizes cannot be determined through directory listing (for example: http), behaves the same as `max`.</li></ul> | `max` |
+| `finalizeAggregations` | SELECT, INSERT, REPLACE<br /><br />Determines the type of aggregation to return. If true, Druid finalizes the results of complex aggregations that directly appear in query results. If false, Druid returns the aggregation's intermediate type rather than finalized type. This parameter is useful during ingestion, where it enables storing sketches directly in Druid tables. For more information about aggregations, see [SQL aggregation functions](../querying/sql-aggregations.md). | true |
+| `arrayIngestMode` | INSERT, REPLACE<br /><br /> Controls how ARRAY type values are stored in Druid segments. When set to `'array'` (recommended for SQL compliance), Druid will store all ARRAY typed values in [ARRAY typed columns](../querying/arrays.md), and supports storing both VARCHAR and numeric typed arrays. When set to `'mvd'` (the default, for backwards compatibility), Druid only supports VARCHAR typed arrays, and will store them as [multi-value string columns](../querying/multi-value-dimensions.md). When set to `none`, Druid will throw an exception when trying to store any type of arrays, used to help migrate operators from `'mvd'` mode to `'array'` mode and force query writers to make an explicit choice between ARRAY and multi-value VARCHAR typed columns. | `'mvd'` (for backwards compatibility, recommended to use `array` for SQL compliance)|


array is preferred over 'array'. In the JSON it's "array" anyway. (But forget that, use array.)

For none, is there a way for operators to set a default value? Otherwise it doesn't seem like it'd be useful for operators. (The useful flow would be for operators to set a default of none, and users to override it to either mvd or array as their preference dictates.)

I think operators would need to set the default query context with druid.query.default.context.arrayIngestMode which could then be overridden on a per query basis

gianm · 2023-11-01T06:44:57Z

docs/querying/multi-value-dimensions.md

@@ -61,20 +77,81 @@ By default, Druid sorts values in multi-value dimensions. This behavior is contr

 See [Dimension Objects](../ingestion/ingestion-spec.md#dimension-objects) for information on configuring multi-value handling.

+### SQL-based ingestion
+Multi-value dimensions can also be inserted with [SQL-based ingestion](../multi-stage-query/index.md). The multi-stage query engine does not have direct handling of class Druid multi-value dimensions. A special pair of functions, `MV_TO_ARRAY` which converts multi-value dimensions into `VARCHAR ARRAY` and `ARRAY_TO_MV` to coerce them back into `VARCHAR` exist to enable handling these types. Multi-value handling is not available when using the multi-stage query engine to insert data.


"classic" (spelling)

Although… what does it mean to say that MSQ doesn't have "direct handling of classic Druid multi-value dimensions"? I would think it does directly handle them, if you use ARRAY_TO_MV? I guess I'm not sure what you're trying to say here.

Grammar for the sentence starting with "A special pair of functions" is kind of wonky. Please rewrite it to be clearer.

I guess I was thinking of groupByEnableMultiValueUnnesting which looking at the code is actually allowed by default, and it is the web-console which sets it to false for MSQ queries by default. I'll try to clarify stuff

gianm · 2023-11-01T06:49:12Z

docs/querying/sql-data-types.md

+
+Druid supports [`ARRAY` types](arrays.md), which behave as standard SQL arrays, where results are grouped by matching entire arrays. The [`UNNEST` operator](./sql-array-functions.md#unn) can be used to perform operations on individual array elements, translating each element into a separate row. 
+
+`ARRAY` typed columns can be stored in segments with class JSON based ingestion using the 'auto' typed dimension schema shared with [schema auto-discovery](../ingestion/schema-design.md#schema-auto-discovery-for-dimensions) to detect and ingest arrays as ARRAY typed columns. For [SQL based ingestion](../multi-stage-query/index.md), the query context parameter `arrayIngestMode` must be specified as `"array"` to ingest ARRAY types. In Druid 28, the default mode for this parameter is `"mvd"` for backwards compatibility, which instead can only handle `ARRAY<STRING>` which it stores in [multi-value string columns](#multi-value-strings). 


I don't think the word "class" is useful here. I suppose you meant "classic", but JSON based ingestion isn't entirely "classic" / "legacy"; for example it's the only way to do realtime still.

gianm

Some minor comments. Could you also add a case to MSQInsertTest that tests that ARRAY_TO_MV makes an MVD even if arrayIngestMode: array? Some other tests in there would also benefit from having a version for array and a version for mvd.

gianm · 2023-11-01T20:11:56Z

docs/querying/multi-value-dimensions.md

@@ -78,7 +78,7 @@ By default, Druid sorts values in multi-value dimensions. This behavior is contr
 See [Dimension Objects](../ingestion/ingestion-spec.md#dimension-objects) for information on configuring multi-value handling.

 ### SQL-based ingestion
-Multi-value dimensions can also be inserted with [SQL-based ingestion](../multi-stage-query/index.md). The multi-stage query engine does not have direct handling of class Druid multi-value dimensions. A special pair of functions, `MV_TO_ARRAY` which converts multi-value dimensions into `VARCHAR ARRAY` and `ARRAY_TO_MV` to coerce them back into `VARCHAR` exist to enable handling these types. Multi-value handling is not available when using the multi-stage query engine to insert data.
+Multi-value dimensions can also be inserted with [SQL-based ingestion](../multi-stage-query/index.md). The functions `MV_TO_ARRAY` and `ARRAY_TO_MV` can assist in converting `VARCHAR` to `VARCHAR ARRAY` and `VARCHAR ARRAY` into `VARCHAR` respectively. Multi-value handling is not available when using the multi-stage query engine to insert data.


"Multi-value handling" in English like that I think will be confusing. It sounds like we're saying that multi-value dimensions cannot be handled by MSQ. Probably clearer to use multiValueHandling to make it clear we're talking about a parameter.

gianm · 2023-11-01T20:13:09Z

docs/querying/post-aggregations.md

 }
 ```

+Output type is optional, and can be any native Druid type: `LONG`, `FLOAT`, `DOUBLE`, `STRING`, `ARRAY` types (e.g. `ARRAY<LONG>`), or `COMPLEX` types (e.g. `COMPLEX<json>`).


This raises questions that the docs should answer:

What benefit is there to providing outputType?

What happens if outputType different from the type of expression? Error, cast, something else?

…dated post-aggregations.md to be consistent with aggregations.md and filters.md and use tables

gianm · 2023-11-02T06:45:13Z

Thanks, the latest changes look good to me!

…n for the differences between arrays and mvds (apache#15245) * better documentation for the differences between arrays and mvds * add outputType to ExpressionPostAggregator to make docs true * add output coercion if outputType is defined on ExpressionPostAgg * updated post-aggregations.md to be consistent with aggregations.md and filters.md and use tables

…n for the differences between arrays and mvds (#15245) (#15307) * better documentation for the differences between arrays and mvds * add outputType to ExpressionPostAggregator to make docs true * add output coercion if outputType is defined on ExpressionPostAgg * updated post-aggregations.md to be consistent with aggregations.md and filters.md and use tables

…n for the differences between arrays and mvds (apache#15245) * better documentation for the differences between arrays and mvds * add outputType to ExpressionPostAggregator to make docs true * add output coercion if outputType is defined on ExpressionPostAgg * updated post-aggregations.md to be consistent with aggregations.md and filters.md and use tables

better documentation for the differences between arrays and mvds

4c5a5db

clintropolis added Area - Documentation Area - Querying Area - SQL Area - MSQ For multi stage queries - https://github.com/apache/druid/issues/12262 labels Oct 25, 2023

clintropolis added this to the 28.0 milestone Oct 25, 2023

github-actions bot removed Area - Querying Area - MSQ For multi stage queries - https://github.com/apache/druid/issues/12262 labels Oct 25, 2023

Merge remote-tracking branch 'upstream/master' into arrays-are-not-mvds

b729e86

317brian self-requested a review October 25, 2023 22:33

gianm reviewed Oct 26, 2023

View reviewed changes

317brian reviewed Oct 26, 2023

View reviewed changes

clintropolis added 4 commits October 30, 2023 16:13

Merge remote-tracking branch 'upstream/master' into arrays-are-not-mvds

1b0ccbb

adjustments

5490ff0

fix link

8fe01fe

missed a spot

aaa1486

gianm reviewed Nov 1, 2023

View reviewed changes

doc fixes, add outputType to ExpressionPostAggregator to make docs true

bcefe45

github-actions bot added Area - Batch Ingestion Area - Querying Area - MSQ For multi stage queries - https://github.com/apache/druid/issues/12262 labels Nov 1, 2023

clintropolis added 2 commits November 1, 2023 05:41

more test

7a25e0a

Merge remote-tracking branch 'upstream/master' into arrays-are-not-mvds

fa2a356

clintropolis added the Bug label Nov 1, 2023

gianm reviewed Nov 1, 2023

View reviewed changes

add output coercion if outputType is defined on ExpressionPostAgg, up…

7168f50

…dated post-aggregations.md to be consistent with aggregations.md and filters.md and use tables

clintropolis changed the title ~~better documentation for the differences between arrays and mvds~~ explicit outputType for ExpressionPostAggregator, better documentation for the differences between arrays and mvds Nov 1, 2023

clintropolis added 3 commits November 1, 2023 21:11

fix

0e6a862

more test

429f8ab

adjust

60dbc6e

gianm approved these changes Nov 2, 2023

View reviewed changes

clintropolis merged commit d261587 into apache:master Nov 2, 2023
82 checks passed

clintropolis deleted the arrays-are-not-mvds branch November 2, 2023 07:31

clintropolis mentioned this pull request Nov 2, 2023

[Backport] explicit outputType for ExpressionPostAggregator, better documentation for the differences between arrays and mvds (#15245) #15307

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

explicit outputType for ExpressionPostAggregator, better documentation for the differences between arrays and mvds #15245

explicit outputType for ExpressionPostAggregator, better documentation for the differences between arrays and mvds #15245

clintropolis commented Oct 25, 2023 •

edited

Loading

gianm Oct 25, 2023

gianm Oct 25, 2023 •

edited

Loading

gianm Oct 25, 2023

gianm Oct 26, 2023

gianm Oct 26, 2023

gianm Oct 26, 2023

gianm Oct 26, 2023

gianm Oct 26, 2023 •

edited

Loading

gianm Oct 26, 2023

gianm Oct 26, 2023

317brian left a comment

317brian Oct 26, 2023

317brian Oct 26, 2023

317brian Oct 26, 2023

317brian Oct 26, 2023

317brian Oct 26, 2023

gianm left a comment

gianm Nov 1, 2023

clintropolis Nov 1, 2023

gianm Nov 1, 2023

clintropolis Nov 1, 2023

gianm Nov 1, 2023

gianm left a comment

gianm Nov 1, 2023

gianm Nov 1, 2023

gianm commented Nov 2, 2023

		-->


		Apache Druid supports SQL standard `ARRAY` typed columns for `STRING`, `LONG`, and `DOUBLE` types. Other more complicated ARRAY types must be stored in [nested columns](nested-columns.md). Druid ARRAY types are distinct from [multi-value dimension](multi-value-dimensions.md), which have significantly different behavior than standard arrays.


		Druid supports [ARRAY types](arrays.md), which behave as standard SQL arrays, where results are grouped by matching entire arrays. The [`UNNEST` operator](./sql-array-functions.md#unn) can be used to perform operations on individual array elements, translating each element into a separate row.

		ARRAY typed columns can be stored in segments with class JSON based ingestion using the 'auto' typed dimension schema shared with [schema auto-discovery](../ingestion/schema-design.md#schema-auto-discovery-for-dimensions) to detect and ingest arrays as ARRAY typed columns. For [SQL based ingestion](../multi-stage-query/index.md), the query context parameter `arrayIngestMode` must be specified as `"array"` to ingest ARRAY types. In Druid 28, the default mode for this parameter is `'mvd'` for backwards compatibility, which instead can only handle `ARRAY<STRING>` which it stores in [multi-value string columns](#multi-value-strings).

	The following sections describe inserting, filtering, and grouping behavior based on the following example data, which includes 3 array typed columns.
	The following sections describe inserting, filtering, and grouping behavior based on the following example data, which includes 3 array typed columns:

	It is important to be aware that multi-value dimensions are distinct from [array types](arrays.md), which behave like standard SQL arrays. This document describes the behavior of multi-value dimensions, and some additional details can be found in the [SQL data type documentation](sql-data-types.md#multi-value-strings-behavior).
	It is important to be aware that multi-value dimensions are distinct from [array types](arrays.md). While array types behave like standard SQL arrays, multi-value dimensions do not. This document describes the behavior of multi-value dimensions, and some additional details can be found in the [SQL data type documentation](sql-data-types.md#multi-value-strings-behavior).

		@@ -61,20 +76,79 @@ By default, Druid sorts values in multi-value dimensions. This behavior is contr

		See [Dimension Objects](../ingestion/ingestion-spec.md#dimension-objects) for information on configuring multi-value handling.

		Multi-value dimensions can also be inserted with [multi-stage ingestion](../multi-stage-query/index.md). The multi-stage query engine does not have direct handling of class Druid multi-value dimensions. A special pair of functions, `MV_TO_ARRAY` which converts multi-value dimensions into `VARCHAR ARRAY` and `ARRAY_TO_MV` to coerce them back into `VARCHAR` exist to enable handling these types. Multi-value handling is not available when using the multi-stage query engine to insert data.

	Notice that `ARRAY_TO_MV` is not present in the `GROUP BY` clause, since we only wish to coerce the type _after_ grouping.
	Notice that `ARRAY_TO_MV` is not present in the `GROUP BY` clause since we only wish to coerce the type _after_ grouping.

		Notice that `ARRAY_TO_MV` is not present in the `GROUP BY` clause, since we only wish to coerce the type _after_ grouping.


		The `EXTERN` is also able to refer to the `tags` input type as `VARCHAR`, which is also how a query on a Druid table containing a multi-value dimension would specify the type of the `tags` column. If this is the case, `MV_TO_ARRAY` must be used since the multi-stage engine only supports grouping on multi-value dimensions as arrays, and so they must be coerced first. These arrays then must be coerced back into `VARCHAR` in the `SELECT` part of the statement with `ARRAY_TO_MV`.


		Druid supports [`ARRAY` types](arrays.md), which behave as standard SQL arrays, where results are grouped by matching entire arrays. The [`UNNEST` operator](./sql-array-functions.md#unn) can be used to perform operations on individual array elements, translating each element into a separate row.

		`ARRAY` typed columns can be stored in segments with class JSON based ingestion using the 'auto' typed dimension schema shared with [schema auto-discovery](../ingestion/schema-design.md#schema-auto-discovery-for-dimensions) to detect and ingest arrays as ARRAY typed columns. For [SQL based ingestion](../multi-stage-query/index.md), the query context parameter `arrayIngestMode` must be specified as `"array"` to ingest ARRAY types. In Druid 28, the default mode for this parameter is `"mvd"` for backwards compatibility, which instead can only handle `ARRAY<STRING>` which it stores in [multi-value string columns](#multi-value-strings).

explicit outputType for ExpressionPostAggregator, better documentation for the differences between arrays and mvds #15245

explicit outputType for ExpressionPostAggregator, better documentation for the differences between arrays and mvds #15245

Conversation

clintropolis commented Oct 25, 2023 • edited Loading

Description

Choose a reason for hiding this comment

gianm Oct 25, 2023 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

gianm Oct 26, 2023 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

317brian left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

gianm left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

gianm left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

gianm commented Nov 2, 2023

clintropolis commented Oct 25, 2023 •

edited

Loading

gianm Oct 25, 2023 •

edited

Loading

gianm Oct 26, 2023 •

edited

Loading