Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Changing the array behavior to start the index from 1. #1682

Conversation

hjafarpour
Copy link
Contributor

Description

Currently KSQL array index starts from 0. This PR changes this behavior so the array index starts from 1 so we are compatible with SQL standard.
This is a breaking change and we need to also include this in our release notes explicitly.
The documentation also has been updated.

This relates to #1659

Testing done

Describe the testing stategy. Unit and integration tests are expected for any behavior changes.

Reviewer checklist

  • Ensure docs are updated if necessary. (eg. if a user visible feature is being added or changed).
  • Ensure relevant issues are linked (description should include text like "Fixes #")

@hjafarpour hjafarpour self-assigned this Aug 1, 2018
@hjafarpour hjafarpour requested a review from a team August 1, 2018 23:07
Copy link
Contributor

@big-andy-coates big-andy-coates left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hey @hjafarpour

This is a breaking change without providing the user to switch back to the old way of working. Don't we need to provide a configuration setting they can use to switch the array base index back to zero, if they need to be able to upgrade but maintain old apps?

This was discussed in slack and in the #1659 itself.

I've suggested a ksql.functions.arrayget.legacy.base configuration. As such would match the ksql.functions. prefix used for controlling other functions. (I know array access is not currently a UDF, but it probably should be).

See #1635 for an example: SUBSTRINGs use of ksql.functions.substring.legacy.args.

@hjafarpour
Copy link
Contributor Author

It's ok to have breaking changes as long as we communicate them clearly. Users just need to update the queries that use arrays accordingly when they upgrade. Streams and many other projects have had breaking changes and by communicating them they don't need to keep supporting the old/deprecated behavior. If we have them as early as possible it's better.
Supporting old and new behavior in the same version would result in confusion in different ways. For instance you could force users to write their new queries based on the old/deprecated behavior even when they have the new version available.
@big-andy-coates let me know what you think.

@dguy
Copy link
Contributor

dguy commented Aug 3, 2018

My vote is that we make it a non-breaking change. Make a note that 0 based indexing is effectively deprecated and that people should move to 1 based indexing as in a future release 0 based indexing will be removed.

@big-andy-coates big-andy-coates requested a review from a team August 3, 2018 12:58
@big-andy-coates
Copy link
Contributor

My vote is that we make it a non-breaking change. Make a note that 0 based indexing is effectively deprecated and that people should move to 1 based indexing as in a future release 0 based indexing will be removed.

You mean make the default still base-0, but have an option to switch to base-1, and encourage people to do so?

Yeah, works for me - can do the same for SUBSTRING.

You mean something like:

v5.1 - support new style, but default to old.
v5.2 - switch default to new. (could also be a major release...)
v6.0 - remove switch

@big-andy-coates
Copy link
Contributor

@hjafarpour - might also be worth documenting this in the release docs, similar to:
https://github.com/confluentinc/docs/pull/1095

@hjafarpour
Copy link
Contributor Author

My main objection was having to support old behavior forever. If we only have the switch for only one more release and then remove it, it sounds good. I would suggest a process the same as @big-andy-coates described except would make the default the new behavior:

v5.1 - support new style, but default to the new (base-1).
v5.2 or 6.0 - remove the switch

What do you think @big-andy-coates and @dguy?

@rodesai
Copy link
Contributor

rodesai commented Aug 3, 2018

My vote is that we make it a non-breaking change. Make a note that 0 based indexing is effectively deprecated and that people should move to 1 based indexing as in a future release 0 based indexing will be removed.

I don't necessarily agree with deprecating until we have an actual way of migrating queries. Currently the only option for users is to stop their queries and then start new ones. The new queries won't resume with the same state as the previously running queries. So if its a requirement for you to keep your app running then you cannot switch. So either we keep the settings/behavior around, or we deprecate and then our users can't upgrade and we have to support lots of different KSQL versions. Not sure which is worse.

Providing a migration path for this kind of change shouldn't be too hard. We just need some way to stop the query, update some settings and the query itself, and then resume it with the same consumer group. But IMO that should be a prereq for deprecating settings.

@dguy
Copy link
Contributor

dguy commented Aug 3, 2018

Yeah, i mean leave the default as it is for now. So that queries don't break. We can then figure out if we remove it in the future.

@hjafarpour hjafarpour force-pushed the KSQL-1659-Change-Array-Index-to-Start-From-One branch from 03f8c86 to 1ea3765 Compare August 14, 2018 01:18
@big-andy-coates
Copy link
Contributor

big-andy-coates commented Aug 14, 2018

Yeah, i mean leave the default as it is for now. So that queries don't break. We can then figure out if we remove it in the future.

Then we should also swap #1635 to currently default to the old school base-0 and args and document in the same way.

Of course, the down side of this approach is the no-one will read the docs and switch their implementation / set the property. So they'll still break when we change the default. But at least we'll have documented the fact the change is coming!

@miguno
Copy link
Contributor

miguno commented Aug 14, 2018

My opinion is to:

  • Keep the current behavior as the default (starts at 0).
  • Allow to enable the new behavior (starts at 1) with a configuration option.
  • Mark the current behavior as Deprecated.

(similar to what others have said above)

@big-andy-coates
Copy link
Contributor

Yep, agreed - currently fixing up the SUBSTRING change to follow the same pattern, and I've raised the following to ensure we don't forget to switch the default at in some future release: #1734

@big-andy-coates
Copy link
Contributor

I don't necessarily agree with deprecating until we have an actual way of migrating queries. Currently the only option for users is to stop their queries and then start new ones. The new queries won't resume with the same state as the previously running queries. So if its a requirement for you to keep your app running then you cannot switch. So either we keep the settings/behavior around, or we deprecate and then our users can't upgrade and we have to support lots of different KSQL versions. Not sure which is worse.

@rodesai I don't see how deprecating the old way of working is going to cause issues with migrating queries - can you explain? My understanding was that for command-topic based clusters they will already have the 'legacy.mode = true' property saved in the command topic, so flipping the default to use the new more SQL-like syntax won't change any running queries. For non-interactive mode, if they would need to add a 'set property' statement at the top of their sql file, or switch the use of SUBSTRING in the SQL file to match the new syntax, both of which would work without breaking the running app.

However, I agree that later removing the setting all together would break old queries/apps using the old syntax. (Was this what you meant?).

Providing a migration path for this kind of change shouldn't be too hard. We just need some way to stop the query, update some settings and the query itself, and then resume it with the same consumer group. But IMO that should be a prereq for deprecating settings.

By 'deprecating settings' to you mean 'remove settings'?

@hjafarpour hjafarpour force-pushed the KSQL-1659-Change-Array-Index-to-Start-From-One branch from 1ea3765 to 1d07f19 Compare August 14, 2018 16:47
@rodesai
Copy link
Contributor

rodesai commented Aug 14, 2018

@big-andy-coates yeah my bad - s/deprecate/remove/

@big-andy-coates
Copy link
Contributor

@rodesai

@big-andy-coates yeah my bad - s/deprecate/remove/

In which case, I'm with you! :D

Copy link
Contributor

@big-andy-coates big-andy-coates left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @hjafarpour

To summarise the above - as it stands:

  1. this is a breaking change
  2. users have no way of opting out or in

As discussed above, we should:

  1. switch between base-zero and base-one indexing using a configuration setting, e.g. ksql.functions.arrayget.legacy.base. I think this is probably best achieved by rewriting the query to use a UDF for array access.
  2. Have the default for the new setting as 'legacy mode' initially. (Meaning this is not a breaking change).
  3. Document the setting and the mark the base-zero as deprecated. (See substring for example, so we're consistent: Switch the default mode of SUBSTRING to be the 'legacy' mode #1735).
  4. Add section to release notes highlighting the change. (See https://github.com/confluentinc/docs/pull/1095 for an example.)
  5. Add json based test cases covering both base-zero and base-one, and proving base-zero is current default. (See https://github.com/confluentinc/ksql/pull/1735/files#diff-d627a383a783655bcd5afc320447a410).

@@ -532,7 +532,7 @@ private String visitBooleanComparisonExpression(ComparisonExpression.Type type)
switch (internalSchema.type()) {
case ARRAY:
return new Pair<>(
String.format("((%s) ((%s)%s).get((int)(%s)))",
String.format("((%s) ((%s)%s).get(((int)(%s)) - 1))",
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The -1 needs to be controlled by a config setting.

Personally, I think this might be best achieved by replacing array access with a UDF.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

See below comment.

@@ -243,13 +245,15 @@ The supported column data types are:
- ``BIGINT``
- ``DOUBLE``
- ``VARCHAR`` (or ``STRING``)
- ``ARRAY<ArrayType>`` (JSON and AVRO only. Index starts from 0)
- ``ARRAY<ArrayType>`` (JSON and AVRO only. Index starts from 1)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why do we have this listed twice in the docs? Fancy pulling this out into its own section?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe in another PR since it's not directly related to this one.

- ``MAP<VARCHAR, ValueType>`` (JSON and AVRO only)
- ``STRUCT<FieldName FieldType, ...>`` (JSON and AVRO only) The STRUCT type requires you to specify a list of fields.
For each field you must specify the field name (FieldName) and field type (FieldType). The field type can be any of
the supported KSQL types, including the complex types ``MAP``, ``ARRAY``, and ``STRUCT``. ``STRUCT`` fields can be
accessed in expressions using the struct dereference (``->``) operator. See :ref:`operators` for more details.

Note that starting KSQL 5.1 array index starts from 1. For the versions of KSQL before 5.1, array index starts from 0.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would put this 'note' inline with the ARRAY type in the list, otherwise users may miss it. (Even if that means changing the layout from a list to a table or something).

Also, need to change the default back to 0 and document the configuration setting that allows users to switch to the new more SQwhelly way of working. (as discussed above)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added an inline note in addition to this paragraph. The default is now 0.

@@ -22,7 +22,7 @@
"name": "extract JSON array field",
"statements": [
"CREATE STREAM TEST (array_field ARRAY<VARCHAR>) WITH (kafka_topic='test_topic', value_format='JSON');",
"CREATE STREAM OUTPUT AS SELECT EXTRACTJSONFIELD(array_field[0], '$.nested') AS Col1, EXTRACTJSONFIELD(array_field[1], '$.nested') AS Col2 FROM TEST;"
"CREATE STREAM OUTPUT AS SELECT EXTRACTJSONFIELD(array_field[1], '$.nested') AS Col1, EXTRACTJSONFIELD(array_field[2], '$.nested') AS Col2 FROM TEST;"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Add tests that test both old style and new style array access (with appropriate properties set), and one proving the default is the old base-zero. (See substring.json for examples).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added the tests.

@big-andy-coates big-andy-coates requested a review from a team August 14, 2018 17:50
@hjafarpour
Copy link
Contributor Author

@big-andy-coates thanks for the latest review. Here is the current state:

  1. I added ksql.functions.array.legacy.base config variable to configure the array base. If it's set to false the array index starts from 1, otherwise it starts from 0. The default is true meaning the array index starts from 0 as you suggested.
  2. Updated the docs accordingly.
  3. Added tests for QueryTranslationTests to test both cases based on the ksql.functions.array.legacy.base value.
  4. Will update the release notes after the PR is merged.

KSQL_FUNCTIONS_ARRAY_LEGACY_BASE_CONFIG,
ConfigDef.Type.BOOLEAN,
true,
true,
Copy link
Contributor

@rodesai rodesai Aug 15, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If you set this to false we'd get the new behavior for any new queries, without breaking anything that's already running. At least for interactive mode. For headless mode, we can document that the setting should be set to true as part of the upgrade steps. Or change the executor to always use the old defaults.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@dguy flagging this to you - what do you reckon? I've switched SUBSTRING to default to old way, but maybe switching back to default to new way, (while maintaining backwards compatibility with old interactive queries), is the way to go?

Thoughts?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I like @rodesai 's suggestion too. Seems that @big-andy-coates also is leaning this way. Will wait for @dguy 's opinion before making any changes here.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sounds reasonable to me.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK, I'll switch SUBSTRING back to this model as well.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

SUBSTRING updated.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So then should this be updated to false here?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So then should this be updated to false here?

Yes, I believe so. To be clear: old should be true and new should be false.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Old is true and new is false now.

@big-andy-coates
Copy link
Contributor

big-andy-coates commented Aug 15, 2018

@hjafarpour what do you think to the idea of moving array access into a UDF? This would make SqlToJavaVisitor, which is already an overly complex class, simpler.

I think this change adds enough additional complexity to array access to warrant moving it into its own class.

Even if we don't, but think we might in the future, I would suggest renaming ksql.functions.array.legacy.base to ksql.functions.arrayget.legacy.base, as a future UDF would not be called Array, but something like ArrayGet.

Also, with reference to #1682 (comment) and #1682 (comment) ... are you proposing that you'll restructor the docs in another PR and then merge into this one so that you can provide cleaner docs here? :D

@hjafarpour
Copy link
Contributor Author

@big-andy-coates what do you mean by making array access as a UDF? Do you mean internally or externally? Can you give an example?

@big-andy-coates
Copy link
Contributor

big-andy-coates commented Aug 16, 2018

@hjafarpour by make it a UDF, I mean move the implementation of how array access is handled into a type, e.g. ArrayGet, which, given an array and an index will return the item at that index. This ArrayGet can then be injected in SqlToJavaVisitor or we could make it a new style UDF (i.e. like Substring), and use the re-write logic to introduce the ArrayGet UDF as we do for struct access. (Basically, ArrayGet should be an internal implementation detail in the same way as FetchFieldFromStruct is).

The benefits of this are:

  • it simplifies SqlToJavaVisitor
  • it moves array access into a more isolated bit of code, so, for example, your new property for setting which index base to use is only in that one class.
  • it allows better testing of array access
  • it would mean we can more easily improve edge case handling e.g. at the moment I think if you access an array element out of bounds it will throw an exception, where as I think most other implementations just return null.

@apurvam
Copy link
Contributor

apurvam commented Aug 18, 2018

Thanks for the update @hjafarpour .. I definitely think that your latest patch is an improvement.

I am fine with merging as is, but before we do, it would be good to understand why going the re-write route is more appropriate for the struct dereference, but not appropriate for the array dereference. Why can't both dereference code paths be unified?

It is worth understanding the differences between the two which merit the different treatment, IMO.

Copy link
Contributor

@big-andy-coates big-andy-coates left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@hjafarpour I'm with @apurvam on this one. Yes, this is definitely an improvement. But it also introduces a new way of encapsulating such functionality and wiring it in.

It seems to me that structs could be handled the same way, so why did we go with UDFs for structs??

IMHO, we should chose one way, (probably the most flexible), and standardise on it.

KSQL_FUNCTIONS_ARRAY_LEGACY_BASE_CONFIG,
ConfigDef.Type.BOOLEAN,
true,
true,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So then should this be updated to false here?

Yes, I believe so. To be clear: old should be true and new should be false.

@@ -539,11 +545,12 @@ private String visitBooleanComparisonExpression(final ComparisonExpression.Type
switch (internalSchema.type()) {
case ARRAY:
return new Pair<>(
String.format("((%s) ((%s)%s).get((int)(%s)))",
String.format("((%s) (ArrayGet.getItem(((%s) %s), (int)(%s), %s)))",
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If index is null, then this will throw NPE. The index may be derived from the source query e.g.

SELECT array_field[index_field] FROM ...

So we may want to consider handling a null index by just returning null. (I've not checked what other vendors do, but most seem to err on the side of not-logging-loads-of-errors and just returning null)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One problem with returning null is that we won't know if the value of the array item is null or there has been an error. We do return null for the column in case of exception anyway but I think having the error cause in the log is informative, although I agree that we may end up having lot's of errors in the log.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One problem with returning null is that we won't know if the value of the array item is null or there has been an error. We do return null for the column in case of exception anyway but I think having the error cause in the log is informative, although I agree that we may end up having lot's of errors in the log.

I'd normally agree with you. I'm a fan or fail hard and fail fast. However, the pattern for most SQL implementations is to be pretty lax about error reporting. For example, substring just returns null if the start index is out of bounds.

At the very least, we should be logging a more informative error than an NPE.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I commented in another thread, but I think it would be better form to log the exception explicitly at a debug level so that we can control what goes into our logs.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we throw an exception it will be logged in the processing log...

private ArrayGet() {}

public static Object getItem(final List<?> array, final int index, final boolean isLegacy) {
final int currectIndex = isLegacy ? index : index - 1;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: spelling currect -> correct?

public void shouldHandleMultiDimensionalArrayWithBaseStartingFrom1() throws Exception {
final KsqlConfig localKsqlConfig =
new KsqlConfig(Collections.singletonMap(KsqlConfig.KSQL_FUNCTIONS_ARRAY_LEGACY_BASE_CONFIG, false));
final CodeGenRunner localCodeGenRunner = new CodeGenRunner(schema, localKsqlConfig, functionRegistry);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

consider making ksqlConfig field a mock, which means you no longer need to create a new running, just set up the mock.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Changed it into a mock.

final List<String> innerArray1 = Arrays.asList("item_11", "item_12");
final List<String> innerArray2 = Arrays.asList("item_21", "item_22");
final Object[] args = new Object[]{Arrays.asList(innerArray1, innerArray2)};
final Object result = expressionEvaluatorMetadata.getExpressionEvaluator().evaluate(args);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

consider refactoring this to be more like:

@Test
    public void shouldHandleMultiDimensionalArrayWithBaseStartingFrom1() throws Exception {
        // Given:
        final String simpleQuery = "SELECT col14[1][1] FROM CODEGEN_TEST;";

        final Object[] arguments = {Arrays.asList(
            Arrays.asList("item_11", "item_12"),
            Arrays.asList("item_21", "item_22"))};

        EasyMock.expect(ksqlConfig.getBoolean(KsqlConfig.KSQL_FUNCTIONS_ARRAY_LEGACY_BASE_CONFIG)).andReturn(false);
        EasyMock.replay(ksqlConfig);

        // When:
        
        final Object result = executeExpression(simpleQuery, arguments);

        // Then:
        assertThat(result, instanceOf(String.class));
        assertThat(result, equalTo("item_11"));
    }

    private Object executeExpression(final String simpleQuery, final Object[] arguments) throws Exception {
        final Analysis analysis = analyzeQuery(simpleQuery);
        final ExpressionMetadata expressionEvaluatorMetadata = codeGenRunner.buildCodeGenFromParseTree
            (analysis.getSelectExpressions().get(0));
        return expressionEvaluatorMetadata.getExpressionEvaluator().evaluate(arguments);
    }

To make the test more readable. The same could be applied to other tests, like the one above this one, for example.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Updated those two tests as you suggested.

final String javaExpression = new SqlToJavaVisitor(schema, functionRegistry)

@Test
public void shouldProcessArrayExpressionCorrectlyForNonLegacyIndexBase() {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Consider refactoring to make this more readable, e.g. similar to the example I give above in the code runner class.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Moved the common code to a method.

}

@Test
public void shouldReturnNullForOutOfBound() {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Add another test for bounds < 0.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Updated the tests so we test the exception. As I mentioned above I changed the behavior to throw the exception since we want to distinguish between error and null values.

@hjafarpour
Copy link
Contributor Author

@big-andy-coates @apurvam things are more complex for STRUCT. Unlike array, for struct we need to make changes in analysis and planning phases too so we can resolve nested references correctly. Rewriting the query with function calls for STRUC eliminates the need for such changes too.

@hjafarpour hjafarpour force-pushed the KSQL-1659-Change-Array-Index-to-Start-From-One branch from 842247e to 9d8b976 Compare August 21, 2018 01:20
Copy link
Contributor

@big-andy-coates big-andy-coates left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @hjafarpour, some comments added below.

@big-andy-coates @apurvam things are more complex for STRUCT. Unlike array, for struct we need to make changes in analysis and planning phases too so we can resolve nested references correctly. Rewriting the query with function calls for STRUC eliminates the need for such changes too.

I'm not familiar enough with the struct rewrite stuff to comment, though I did have a look at the code around KsqlParser and the rewrite stuff. As part of trying to understand it I noticed a lot of duplication and so I refactored it in #1775. The struct rewrite only happens in one place now. Not sure how this relates to your comment above about STRUCT needing to make changes in analysis and planning phases.

Personally, I can't conceptually see any difference between a STRUCT:

SELECT someStruct->f1.f2 FROM ...

Compared to ARRAY:

SELECT someArray[12] FROM ...

Both are taking some path, e.g. 'f1.f2' or '12' to resolve on some object, e.g. a struct or an array.

Maybe you could explain the differences by way of an example? I realise my knowledge in this area is not as in depth as yours, so this might help me learn and understand!

@@ -539,11 +545,12 @@ private String visitBooleanComparisonExpression(final ComparisonExpression.Type
switch (internalSchema.type()) {
case ARRAY:
return new Pair<>(
String.format("((%s) ((%s)%s).get((int)(%s)))",
String.format("((%s) (ArrayGet.getItem(((%s) %s), (int)(%s), %s)))",
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One problem with returning null is that we won't know if the value of the array item is null or there has been an error. We do return null for the column in case of exception anyway but I think having the error cause in the log is informative, although I agree that we may end up having lot's of errors in the log.

I'd normally agree with you. I'm a fan or fail hard and fail fast. However, the pattern for most SQL implementations is to be pretty lax about error reporting. For example, substring just returns null if the start index is out of bounds.

At the very least, we should be logging a more informative error than an NPE.


public static Object getItem(final List<?> array, final int index, final boolean isLegacy) {
final int correctIndex = isLegacy ? index : index - 1;
return array.get(correctIndex);
Copy link
Contributor

@big-andy-coates big-andy-coates Aug 23, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You've removed the out of bounds handing that was here before. Just wondering why?

I'm guessing this is inline with your comment above, and you want the exception to be logged? If so, then I think this should be a general question to the team to come to a decision on how we handle such issues: do we silently handle them and return null, or do we throw an exception, log it and then return null?

Personally, I'm with you and prefer the logging approach. But if this is not what expected by our users then they might not thank us for spamming the logs. Likewise, in cloud we're going to end up with a lot of noise in the logs, (though we can turn it off I guess).

Which ever way we decide to go we should be consistent. So, for example, if we go with throwing exceptions, then we should change SubString to follow the same pattern.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think for now we can throw an exception and return null. This way we make sure we don't lose information that user may need. This is the same behavior for other types of expressions. Let's make Substring behave the same.
I agree with the possibility of trashing the log and I think @blueedgenick 's suggestion(#1795 ) in having a data error log file would alleviate this problem.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am not a fan of just dumping exception traces to the log by default. Even if we have #1795, it would be much better to explicitly log this at some sort of verbose logging level (like debug) so that we can control what is logged by setting the class level log level in the config.

If these types of things were logged at debug level, then in most cases we won't see them until we want to debug something, at which point we can change the log level for the class (potentially at run time through JMX) and then collect the necessary information.

If you see our soak cluster, we already have many NPE stack traces in the logs. These are really hard to work with because:

  1. It is hard to grep these out given that they have many lines.
  2. The sheer number crowds out everything else.

I vote for not adding to the problem in this patch.

public void shouldHandleMultiDimensionalArray() throws Exception {
public void shouldHandleMultiDimensionalArrayWithLegacyBase() throws Exception {
// Given:
final KsqlConfig localKsqlConfig = EasyMock.mock(KsqlConfig.class);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Rather than creating a localKsqlConfig and localCodeGenRunner, why not just switch the ksqlConfig to a @Mock(NICE)? Same in the test below.


assertThat(javaExpression, equalTo("(TEST1_COL0 + TEST1_COL3)"));
}

@Test
public void shouldProcessArrayExpressionCorrectly() {
final String simpleQuery = "SELECT col4[0] FROM test1 WHERE col0 > 100;";
final Analysis analysis = analyzeQuery(simpleQuery);
final KsqlConfig localKsqlConfig = EasyMock.mock(KsqlConfig.class);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As above, just field version of KsqlConfig and @Mock(NICE)...

@hjafarpour
Copy link
Contributor Author

@big-andy-coates we need to resolve columns and their types in analysis and planning stages and if we have a nested field we will need to implement this functionality. For fetching array items we don't need to worry about this since the array is a column and the type of the array is known.
Consider the following example:

SELECT arrayCol[1], structCol ->field1 -> field2 FROM foo;

When we analyze the arrayCol, we can directly get the type of the column but for the nested field, structCol ->field1 -> field2, we have to have a way to resolve the nested field and its data type. We avoid this by rewriting the nested field access as function call. This way, our existing function call analysis will take care of resolving the datatype.

@hjafarpour hjafarpour force-pushed the KSQL-1659-Change-Array-Index-to-Start-From-One branch from 9d8b976 to 79cc7f2 Compare August 29, 2018 00:06
@big-andy-coates
Copy link
Contributor

I'm still voting for moving array access out of SqlVistor and leveraging our UDF framework to handle it.

@apurvam apurvam requested review from dguy and rodesai and removed request for dguy and rodesai September 7, 2018 06:17
@big-andy-coates big-andy-coates requested a review from a team September 7, 2018 09:14
Copy link
Contributor

@apurvam apurvam left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @hjafarpour, I left a few comments, mainly about our logging strategy for exceptions.

I totally buy that Struct dereferences are fundamentally harder to do type inference for, and hence are a natural fit for the UDF Framework. And as such, using the same mechanism for array and map dereferences is a bit of overkill.

Having said that, there is still merit in unifying all object dereference mechanisms, and thereby removing chunks of code out of the unwieldy SqlToJavaVisitor code.

But I am not really attached to using the UDF framework for maps and arrays. The latter are objectively simpler than structs, so it is not unreasonable to have them dereferenced through their own mechanism.

- ``MAP<VARCHAR, ValueType>`` (JSON and AVRO only)
- ``STRUCT<FieldName FieldType, ...>`` (JSON and AVRO only) The STRUCT type requires you to specify a list of fields.
For each field you must specify the field name (FieldName) and field type (FieldType). The field type can be any of
the supported KSQL types, including the complex types ``MAP``, ``ARRAY``, and ``STRUCT``. ``STRUCT`` fields can be
accessed in expressions using the struct dereference (``->``) operator. See :ref:`operators` for more details.

You can configure the base index for arrays to start from 0 or 1 using ``ksql.functions.array.legacy.base`` configuration value.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why can't we just name this ksql.functions.array.zero.base. It seems like a much more self explanatory name.

- ``MAP<VARCHAR, ValueType>`` (JSON and AVRO only)
- ``STRUCT<FieldName FieldType, ...>`` (JSON and AVRO only) The STRUCT type requires you to specify a list of fields.
For each field you must specify the field name (FieldName) and field type (FieldType). The field type can be any of
the supported KSQL types, including the complex types ``MAP``, ``ARRAY``, and ``STRUCT``. ``STRUCT`` fields can be
accessed in expressions using the struct dereference (``->``) operator. See :ref:`operators` for more details.

You can configure the base index for arrays to start from 0 or 1 using ``ksql.functions.array.legacy.base`` configuration value.
The current default value of ``ksql.functions.array.legacy.base`` is true indicating that the array indexes start from 0.
You can set ``ksql.functions.array.legacy.base`` to false in order to have array indexes startinhg from 1.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

typo: starting


public static Object getItem(final List<?> array, final int index, final boolean isLegacy) {
final int correctIndex = isLegacy ? index : index - 1;
return array.get(correctIndex);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am not a fan of just dumping exception traces to the log by default. Even if we have #1795, it would be much better to explicitly log this at some sort of verbose logging level (like debug) so that we can control what is logged by setting the class level log level in the config.

If these types of things were logged at debug level, then in most cases we won't see them until we want to debug something, at which point we can change the log level for the class (potentially at run time through JMX) and then collect the necessary information.

If you see our soak cluster, we already have many NPE stack traces in the logs. These are really hard to work with because:

  1. It is hard to grep these out given that they have many lines.
  2. The sheer number crowds out everything else.

I vote for not adding to the problem in this patch.

@@ -539,11 +545,12 @@ private String visitBooleanComparisonExpression(final ComparisonExpression.Type
switch (internalSchema.type()) {
case ARRAY:
return new Pair<>(
String.format("((%s) ((%s)%s).get((int)(%s)))",
String.format("((%s) (ArrayGet.getItem(((%s) %s), (int)(%s), %s)))",
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I commented in another thread, but I think it would be better form to log the exception explicitly at a debug level so that we can control what goes into our logs.

@@ -191,6 +202,36 @@ public void shouldHandleMultiDimensionalArray() throws Exception {
assertThat(result, equalTo("item_11"));
}

@Test
public void shouldHandleMultiDimensionalArrayWithBaseStartingFrom1() throws Exception {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit, spell out the 1, ie One. I don't know if it is in our code style, but having digits in our function names seems a bit weird.

"name": "Json 2 Dimensional Array Base Starting from 1",
"format": ["JSON"],
"properties": {
"ksql.functions.array.legacy.base": true
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

doesn't the value true imply that the base is 0? The documentation, the description of this test, and the value of this config are not consistent. Yet another reason to just spell out the name in the config ksql.functions.array.zero.base.

@miguno
Copy link
Contributor

miguno commented Nov 8, 2018

What's the status of this PR?

5.1 will include the change to SUBSTRING to index with start position 1 (before: 0). So it's rather unfortunate that the same change for Arrays will not happen in 5.1, too. Version 5.1 will then be rather confusing, because some functions start at 0, others at 1.

@agavra
Copy link
Contributor

agavra commented Dec 5, 2019

Closing out this PR in favor of #4057

@agavra agavra closed this Dec 5, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

7 participants