consolidate json and auto indexers, remove v4 nested column serializer #14456

clintropolis · 2023-06-21T06:37:41Z

Description

This PR consolidates 'json' and 'auto' column indexers, bring all the improvements of 'auto' to 'json'. 'auto' is the next generation (v5) of 'json' (v4) nested column format. However, 'json' was kept separate and on the previous version of the format for backwards compatibility during rolling upgrades from 25 to 26. Basically this allows us to dump a bunch of code, leaving only the v4 reader so that we can continue to read the older columns.

I will open a separate PR to handle future backwards compatibility concerns in order to degrade gracefully when new column format features are added to make what we had to do here less important.

Release note

'json' is now equivalent to using 'auto' in native ingestion dimension specs, upgrading 'json' to get all of the features and functionality of 'auto', such as type specializations including ARRAY typed columns, better support for nested arrays and smarter index utilization. However, this means that 'json' columns created with Druid 28 are not backwards compatible with Druid versions older than 26, so it is recommended to upgrade to Druid 26 prior to upgrading to Druid 28 if upgrading from an older version and using 'json' columns. Additionally, to downgrade to a version older than Druid 26, any new segments created in Druid 28 should be re-ingested using Druid 26 or 27 prior to further downgrading.

This PR has:

been self-reviewed.
added documentation for new or modified features or behaviors.
a release note entry in the PR description.
been tested in a test Druid cluster.

…and-auto

ektravel · 2023-07-12T19:59:31Z

docs/querying/nested-columns.md

@@ -23,12 +23,14 @@ sidebar_label: Nested columns
  ~ under the License.
  -->

-Apache Druid supports directly storing nested data structures in `COMPLEX<json>` columns. `COMPLEX<json>` columns store a copy of the structured data in JSON format and specialized internal columns and indexes for nested literal values&mdash;STRING, LONG, and DOUBLE types. An optimized [virtual column](./virtual-columns.md#nested-field-virtual-column) allows Druid to read and filter these values at speeds consistent with standard Druid LONG, DOUBLE, and STRING columns.
+Apache Druid supports directly storing nested data structures in `COMPLEX<json>` columns. `COMPLEX<json>` columns store a copy of the structured data in JSON format and specialized internal columns and indexes for nested literal values&mdash;STRING, LONG, and DOUBLE types, as well as ARRAY of STRING, LONG, and DOUBLE values. An optimized [virtual column](./virtual-columns.md#nested-field-virtual-column) allows Druid to read and filter these values at speeds consistent with standard Druid LONG, DOUBLE, and STRING columns.


Suggested change

Apache Druid supports directly storing nested data structures in `COMPLEX<json>` columns. `COMPLEX<json>` columns store a copy of the structured data in JSON format and specialized internal columns and indexes for nested literal values—STRING, LONG, and DOUBLE types, as well as ARRAY of STRING, LONG, and DOUBLE values. An optimized [virtual column](./virtual-columns.md#nested-field-virtual-column) allows Druid to read and filter these values at speeds consistent with standard Druid LONG, DOUBLE, and STRING columns.

Apache Druid supports directly storing nested data structures in `COMPLEX<json>` columns. `COMPLEX<json>` columns store a copy of the structured data in JSON format and specialized internal columns and indexes for nested literal values—STRING, LONG, and DOUBLE types, as well as ARRAY of STRING, LONG, and DOUBLE values. An optimized [virtual column](./virtual-columns.md#nested-field-virtual-column) allows Druid to read and filter these values at speeds consistent with standard Druid LONG, DOUBLE, and STRING columns.

Escaping the <> is required for Docusaurus 2. Otherwise, it treats it as a JavaScript/HTML tag.

even inside of backticks? that seems frustrating

hmm, looking at #14411 and https://github.com/apache/druid/pull/14417/files#diff-1509d7fece1f79b5fc731c62d0fd4f114d48e1c606b3d23a7779e408df4590de i don't see changes to this page, and everything looks ok on the staged site https://druid.staged.apache.org/docs/latest/querying/nested-columns

@317brian I thought escaping <> was requited for Docusaurus 2?

@clintropolis I was wrong and we do not need to escape <> inside backticks.

ektravel · 2023-07-12T20:00:24Z

docs/querying/nested-columns.md


 Druid [SQL JSON functions](./sql-json-functions.md) allow you to extract, transform, and create `COMPLEX<json>` values in SQL queries, using the specialized virtual columns where appropriate. You can use the [JSON nested columns functions](math-expr.md#json-functions) in [native queries](./querying.md) using [expression virtual columns](./virtual-columns.md#expression-virtual-column), and in native ingestion with a [`transformSpec`](../ingestion/ingestion-spec.md#transformspec).

 You can use the JSON functions in INSERT and REPLACE statements in SQL-based ingestion, or in a `transformSpec` in native ingestion as an alternative to using a [`flattenSpec`](../ingestion/data-formats.md#flattenspec) object to "flatten" nested data for ingestion.

+Columns ingested as `COMPLEX<json>` are automatically optimized to store the most appropriate physical column based on the data processed. For example, if only LONG values are processed, Druid will store a LONG column, ARRAY columns if the data consists of arrays, or `COMPLEX<json>` in the general case if the data is actually nested. This is the same functionality that powers ['type aware' schema discovery](../ingestion/schema-design.md#type-aware-schema-discovery).


Suggested change

Columns ingested as `COMPLEX<json>` are automatically optimized to store the most appropriate physical column based on the data processed. For example, if only LONG values are processed, Druid will store a LONG column, ARRAY columns if the data consists of arrays, or `COMPLEX<json>` in the general case if the data is actually nested. This is the same functionality that powers ['type aware' schema discovery](../ingestion/schema-design.md#type-aware-schema-discovery).

Columns ingested as `COMPLEX<json>` are automatically optimized to store the most appropriate physical column based on the data processed. For example, if only LONG values are processed, Druid stores a LONG column, ARRAY columns if the data consists of arrays, or `COMPLEX<json>` in the general case if the data is actually nested. This is the same functionality that powers ['type aware' schema discovery](../ingestion/schema-design.md#type-aware-schema-discovery).

ektravel

Reviewed nested-columns.md

…and-auto

ektravel

Docs portion is approved

consolidate json and auto indexers, remove v4 nested column serializer

29febfd

clintropolis added Release Notes Area - Segment Format and Ser/De labels Jun 21, 2023

clintropolis added this to the 27.0 milestone Jun 21, 2023

github-actions bot added the Area - Documentation label Jun 21, 2023

imply-cheddar approved these changes Jun 22, 2023

View reviewed changes

clintropolis added 3 commits June 26, 2023 22:00

Merge remote-tracking branch 'upstream/master' into consolidate-json-…

c1e55a5

…and-auto

bad merge

972c0a4

Merge remote-tracking branch 'upstream/master' into consolidate-json-…

ec4c3f7

…and-auto

ektravel reviewed Jul 12, 2023

View reviewed changes

clintropolis removed this from the 27.0 milestone Jul 18, 2023

clintropolis added 2 commits July 26, 2023 20:01

Merge remote-tracking branch 'upstream/master' into consolidate-json-…

9fe2fbf

…and-auto

adjust docs

4a6753f

ektravel approved these changes Jul 27, 2023

View reviewed changes

unused imports

ba5e742

clintropolis merged commit fb053c3 into apache:master Aug 23, 2023

clintropolis deleted the consolidate-json-and-auto branch August 23, 2023 01:50

clintropolis mentioned this pull request Sep 8, 2023

longer compatibility window for nested column format v4 #14955

Merged

10 tasks

LakshSingla added this to the 28.0 milestone Oct 12, 2023

LakshSingla mentioned this pull request Nov 4, 2023

[DRAFT] 28.0.0 release notes #15326

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

consolidate json and auto indexers, remove v4 nested column serializer #14456

consolidate json and auto indexers, remove v4 nested column serializer #14456

clintropolis commented Jun 21, 2023 •

edited

Loading

ektravel Jul 12, 2023 •

edited

Loading

clintropolis Jul 27, 2023

clintropolis Jul 27, 2023

ektravel Jul 27, 2023

ektravel Jul 27, 2023

ektravel Jul 12, 2023 •

edited

Loading

ektravel left a comment

ektravel left a comment

consolidate json and auto indexers, remove v4 nested column serializer #14456

consolidate json and auto indexers, remove v4 nested column serializer #14456

Conversation

clintropolis commented Jun 21, 2023 • edited Loading

Description

Release note

ektravel Jul 12, 2023 • edited Loading

Choose a reason for hiding this comment

clintropolis Jul 27, 2023

Choose a reason for hiding this comment

clintropolis Jul 27, 2023

Choose a reason for hiding this comment

ektravel Jul 27, 2023

Choose a reason for hiding this comment

ektravel Jul 27, 2023

Choose a reason for hiding this comment

ektravel Jul 12, 2023 • edited Loading

Choose a reason for hiding this comment

ektravel left a comment

Choose a reason for hiding this comment

ektravel left a comment

Choose a reason for hiding this comment

clintropolis commented Jun 21, 2023 •

edited

Loading

ektravel Jul 12, 2023 •

edited

Loading

ektravel Jul 12, 2023 •

edited

Loading