Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Display the output column name in InvalidNullByteException #14780

Merged
merged 6 commits into from
Aug 24, 2023

Conversation

LakshSingla
Copy link
Contributor

@LakshSingla LakshSingla commented Aug 8, 2023

Description

InvalidNullByteFault for MSQ displayed the query column name (the one that is used internally while planning and executing the query) instead of the output column name (the one that is given by the user). Sometimes the two are similar, however, for columns involving expressions, it is not the case.
It can be misleading for someone who is not familiar with this quirk, and even then it would require reading the native plan to figure out the culprit column.

This PR maps the query column to the output column name while surfacing the fault since that is readily visible to the user while executing the query.


For a query like:

WITH "ext" AS (SELECT *
FROM TABLE(
  EXTERN(
    '{"type":"inline","data":"{\"desc\":\"Normal row\",\"text\":\"Hi I am a normal row\"}\n{\"desc\":\"Row with NULL\",\"text\":\"There is a null in\\u0000 here somewhere\"}\n{\"desc\":\"Anther normal row\",\"text\":\"Nice\"}\n"}',
    '{"type":"json"}'
  )
) EXTEND ("desc" VARCHAR, "text" VARCHAR))
SELECT
  "desc",
  REPLACE("text", 'a', 'A') AS "text"
FROM "ext"

Original Behaviour:

InvalidNullByte: Invalid null byte at source [external input source: InlineInputSource{data='{"desc":"Normal row","text":"Hi I am a normal row"} {"desc":"Row with NULL","text":"There is a null in\u0000 here somewhere"} {"desc":"Anther normal row","text":"Nice"} '}], rowNumber [2], column[v0], value[There is A null in here somewhere], position[18]. Consider sanitizing the string using REPLACE("v0", U&'\0000', '') AS v0.

Notice it refers to the column as 'v0'

Modified Behaviour
InvalidNullByte: Invalid null byte at source [external input source: InlineInputSource{data='{"desc":"Normal row","text":"Hi I am a normal row"} {"desc":"Row with NULL","text":"There is a null in\u0000 here somewhere"} {"desc":"Anther normal row","text":"Nice"} '}], rowNumber [2], column[text], value[There is A null in here somewhere], position[18]. Consider sanitizing the string using REPLACE("text", U&'\0000', '') AS text.


Note, for columns without alias, it will display it as EXPR$<number> which is still more informative than something like v0.

WITH "ext" AS (SELECT *
FROM TABLE(
  EXTERN(
    '{"type":"inline","data":"{\"desc\":\"Normal row\",\"text\":\"Hi I am a normal row\"}\n{\"desc\":\"Row with NULL\",\"text\":\"There is a null in\\u0000 here somewhere\"}\n{\"desc\":\"Anther normal row\",\"text\":\"Nice\"}\n"}',
    '{"type":"json"}'
  )
) EXTEND ("desc" VARCHAR, "text" VARCHAR))
SELECT
  "desc",
  REPLACE("text", 'a', 'A')
FROM "ext"

Modified Behaviour
InvalidNullByte: Invalid null byte at source [external input source: InlineInputSource{data='{"desc":"Normal row","text":"Hi I am a normal row"} {"desc":"Row with NULL","text":"There is a null in\u0000 here somewhere"} {"desc":"Anther normal row","text":"Nice"} '}], rowNumber [2], column[EXPR$1], value[There is A null in here somewhere], position[18]. Consider sanitizing the string using REPLACE("EXPR$1", U&'\0000', '') AS EXPR$1.


This PR has:

  • been self-reviewed.
  • added documentation for new or modified features or behaviors.
  • a release note entry in the PR description.
  • added Javadocs for most classes and all non-trivial methods. Linked related entities via Javadoc links.
  • added or updated version, license, or notice information in licenses.yaml
  • added comments explaining the "why" and the intent of the code wherever would not be obvious for an unfamiliar reader.
  • added unit tests or modified existing tests to cover new code paths, ensuring the threshold for code coverage is met.
  • added integration tests.
  • been tested in a test Druid cluster.

@abhishekagarwal87
Copy link
Contributor

can "Consider sanitizing the string using" be changed to "Consider sanitizing the input string using"

IntList outputColumnsForQueryColumn = columnMappings.getOutputColumnsForQueryColumn(columnName);

// outputColumnsForQueryColumn.size should always be 1 due to hasUniqueOutputColumnNames check that is done
if (outputColumnsForQueryColumn.size() >= 1) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: Should we handle the case where it is more than 1 separately for neatness? That said, since this is code to handle a different type of exception, I am okay with the current code.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it should be fine since even if there are multiple columns for a single query column, to surface the exception, we can only display one name, and displaying the first one seems okay.

If the code which throws InvalidNullByteException instead showed all the columns with values \0 in them (which I think won't be possible), we can modify it, however as the code currently is, we throw on encountering the first value with \0.

Copy link
Contributor

@cryptoe cryptoe left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Left one comment.
Rest all LGTM.

: null;
final MSQErrorReport workerError = workerErrorRef.get();
MSQErrorReport workerError = mapQueryColumnNameToOutputColumnName(workerErrorRef.get());
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

IMHO: this change should go into :

public void workerError(MSQErrorReport errorReport)
{
if (workerTaskLauncher.isTaskCanceledByController(errorReport.getTaskId()) ||
!workerTaskLauncher.isTaskLatest(errorReport.getTaskId())) {
log.info("Ignoring task %s", errorReport.getTaskId());
} else {
workerErrorRef.compareAndSet(null, errorReport);
}
}

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Makes much more sense, incorporated the feedback in my latest commit.

@LakshSingla
Copy link
Contributor Author

LakshSingla commented Aug 23, 2023

Missed out on the latest comments. Thanks for the reviews @abhishekagarwal87 @adarshsanjeev @cryptoe!!

@LakshSingla LakshSingla added the Area - MSQ For multi stage queries - https://github.com/apache/druid/issues/12262 label Aug 23, 2023
@LakshSingla LakshSingla merged commit f9f734c into apache:master Aug 24, 2023
46 checks passed
@LakshSingla LakshSingla deleted the inbe-better-column-name branch August 24, 2023 04:24
@LakshSingla LakshSingla added this to the 28.0 milestone Oct 12, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Area - MSQ For multi stage queries - https://github.com/apache/druid/issues/12262
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants