Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

MSQ arrayIngestMode to control if arrays are ingested as ARRAY, MVD, or an exception #15093

Merged
merged 6 commits into from
Oct 9, 2023

Conversation

LakshSingla
Copy link
Contributor

@LakshSingla LakshSingla commented Oct 5, 2023

Description

MSQ uses the string dimension schema for ARRAY<STRING> typed columns, which creates MVDs instead of string arrays as is correct. Therefore someone trying to ingest columns of type ARRAY<STRING> from an external data source or another data source would get STRING columns in the newly generated segments.

This patch adds a arrayIngestMode query context parameter with the following behavior:

  1. If mode is array, uses auto dimension schema to correctly ingest the ARRAY<STRING> columns as ARRAY<STRING> columns, and also allows numeric ARRAY types to be ingested
  2. If mode is mvd, currently default to ease the transition, ARRAY<STRING> will continue to ingest as STRING typed MVDs, but logs will warn operators that this is a deprecated mode. Numeric arrays cannot be ingested in this mode.
  3. If mode is none, neither string or numeric ARRAY types can be ingested, this mode will be used as a forcing mechanism to force people to choose if they explicitly want MVDs or ARRAY typed columns, and in either case, suggest setting the mode to array and explicitly use ARRAY_TO_MV if users still want MVDs.

Release note

MSQ supports a new array mode arrayIngestMode which specifies the behavior of MSQ while ingesting arrays. Please refer to the docs in the release for a detailed behavior of what each option specifies. The default 'mvd' is the existing behaviour of MSQ, therefore doesn't require immediate intervention, however it is subject to removal in future releases.


Key changed/added classes in this PR
  • MyFoo
  • OurBar
  • TheirBaz

This PR has:

  • been self-reviewed.
  • added documentation for new or modified features or behaviors.
  • a release note entry in the PR description.
  • added Javadocs for most classes and all non-trivial methods. Linked related entities via Javadoc links.
  • added or updated version, license, or notice information in licenses.yaml
  • added comments explaining the "why" and the intent of the code wherever would not be obvious for an unfamiliar reader.
  • added unit tests or modified existing tests to cover new code paths, ensuring the threshold for code coverage is met.
  • added integration tests.
  • been tested in a test Druid cluster.

@github-actions github-actions bot added Area - Batch Ingestion Area - MSQ For multi stage queries - https://github.com/apache/druid/issues/12262 labels Oct 5, 2023
@gianm
Copy link
Contributor

gianm commented Oct 5, 2023

This change is important but we need to do it less disruptively. People are using MV_TO_ARRAY because they need it for grouping on MVs during ingestion, and our docs recommend using it there. It's a relatively common pattern, and all these people would start getting string arrays instead of MVDs, which will break their queries.

For this reason the change needs to be opt-in. I suggest documenting the context parameter and swapping the default so the old behavior is retained. In addition I'd suggest writing up a doc page about how people can migrate from MVDs to string arrays (with examples of how to rewrite queries) and pointing people at that in the docs for this parameter. And showing people how to use ARRAY_TO_MV to keep things as MVDs. That sets us up for a future where we can deprecate the parameter and swap the default.

@gianm
Copy link
Contributor

gianm commented Oct 5, 2023

Another thought: is it possible to add explicit dimension schemas for the various types that can be generated by "auto"? In MSQ we know the exact type we want so it seems odd & circuitous to use "auto".

@clintropolis
Copy link
Member

clintropolis commented Oct 5, 2023

Another thought: is it possible to add explicit dimension schemas for the various types that can be generated by "auto"? In MSQ we know the exact type we want so it seems odd & circuitous to use "auto".

I've started doing some work on this, but its sort of non-trivial and shouldn't be part of this PR, so using 'auto' is currently the only way to ingest array columns

Copy link
Member

@clintropolis clintropolis left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm 👍

Are there any tests for trying to insert numeric arrays in mvd mode or any type of arrays in none mode?

public static final boolean DEFAULT_USE_AUTO_SCHEMAS = false;

public static final String CTX_ARRAY_INGEST_MODE = "arrayIngestMode";
public static final ArrayIngestMode DEFAULT_ARRAY_INGEST_MODE = ArrayIngestMode.NONE;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Default should be MVD.

Copy link
Contributor

@cryptoe cryptoe left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please add the document changes for this as well.

public static final boolean DEFAULT_USE_AUTO_SCHEMAS = false;

public static final String CTX_ARRAY_INGEST_MODE = "arrayIngestMode";
public static final ArrayIngestMode DEFAULT_ARRAY_INGEST_MODE = ArrayIngestMode.NONE;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The default mode should be MVD IMHO since it will not break stuff.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I personally disagree, I think we should default to none, then people explicitly choose to use MVDs or arrays

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

mostly because the behavior of MVD is totally incorrect

Copy link
Member

@clintropolis clintropolis Oct 9, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am working on updating the docs/examples refer to the parameter and to separate them into ones that explicitly store MVDs using ARRAY_TO_MV function, and ones which only use ARRAY_ functions into examples of storing array typed columns instead.

MVD mode is not going to have examples, because it should be the first to go and no one should rely on this behavior, because again the behavior is incorrect.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

IMHO we cannot break user's ingestion sql query so I propose the mode should be MVD.
The patch should also throw a warning from the controller and throw it on the console if a row signature with Array is detected. The warning clearly lays out the path of migration.

There are lot of layers build out in organizations. A change even if its adding a context flag to sometimes takes weeks to reach to production. With this warning we effectively nudge the user that we are going to break compatibility soon.

Comment on lines 97 to 102
"String arrays can not be ingested when '%s' is set to '%s'. Either set '%s' in query context "
+ "to 'array' to ingest the string array as an array, or set it to 'mvd' to ingest the string array "
+ "as MVD (which is legacy behaviour and not recommmended)",
MultiStageQueryContext.CTX_ARRAY_INGEST_MODE,
StringUtils.toLowerCase(arrayIngestMode.name()),
MultiStageQueryContext.CTX_ARRAY_INGEST_MODE
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think we should recommend MVD mode at all here, instead we should always recommend array mode and suggest using ARRAY_TO_MV if people want to store things as a MVD.

@LakshSingla
Copy link
Contributor Author

LakshSingla commented Oct 9, 2023

Are there any tests for trying to insert numeric arrays in mvd mode or any type of arrays in none mode?

Yes, there are tests verifying that arrays cannot be inserted in none mode, and numeric arrays cannot be inserted in MVD mode.

@abhishekagarwal87 abhishekagarwal87 merged commit b0edbc3 into apache:master Oct 9, 2023
53 checks passed
@clintropolis clintropolis changed the title MSQ writes out string arrays instead of MVDs by default MSQ arrayIngestMode to control if arrays are ingested as ARRAY, MVD, or an exception Oct 9, 2023
@LakshSingla
Copy link
Contributor Author

LakshSingla commented Oct 10, 2023

Thanks, @clintropolis for updating & aligning the description with what we actually merged, and not the stale changes.

@LakshSingla LakshSingla added this to the 28.0 milestone Oct 12, 2023
@LakshSingla LakshSingla deleted the msq-write-string-array branch October 13, 2023 19:39
ektravel pushed a commit to ektravel/druid that referenced this pull request Oct 16, 2023
MSQ uses the string dimension schema for ARRAY<STRING> typed columns, which creates MVDs instead of string arrays as required. Therefore someone trying to ingest columns of type ARRAY<STRING> from an external data source or another data source would get STRING columns in the newly generated segments.

This patch changes the following:

- Use auto dimension schema to ingest the ARRAY<STRING> columns, which will create columns with the desired type.
- Add an undocumented flag ingestStringArraysAsMVDs to preserve the legacy behavior. Legacy behaviour is turned on by default. 
- Create MSQArraysInsertTest and refactor some of the tests in MSQInsertTest.
CaseyPan pushed a commit to CaseyPan/druid that referenced this pull request Nov 17, 2023
MSQ uses the string dimension schema for ARRAY<STRING> typed columns, which creates MVDs instead of string arrays as required. Therefore someone trying to ingest columns of type ARRAY<STRING> from an external data source or another data source would get STRING columns in the newly generated segments.

This patch changes the following:

- Use auto dimension schema to ingest the ARRAY<STRING> columns, which will create columns with the desired type.
- Add an undocumented flag ingestStringArraysAsMVDs to preserve the legacy behavior. Legacy behaviour is turned on by default. 
- Create MSQArraysInsertTest and refactor some of the tests in MSQInsertTest.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Area - Batch Ingestion Area - MSQ For multi stage queries - https://github.com/apache/druid/issues/12262 Release Notes
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants