Changing the array behavior to start the index from 1. #1682

hjafarpour · 2018-08-01T23:07:50Z

Description

Currently KSQL array index starts from 0. This PR changes this behavior so the array index starts from 1 so we are compatible with SQL standard.
This is a breaking change and we need to also include this in our release notes explicitly.
The documentation also has been updated.

This relates to #1659

Testing done

Describe the testing stategy. Unit and integration tests are expected for any behavior changes.

Reviewer checklist

Ensure docs are updated if necessary. (eg. if a user visible feature is being added or changed).
Ensure relevant issues are linked (description should include text like "Fixes #")

big-andy-coates

Hey @hjafarpour

This is a breaking change without providing the user to switch back to the old way of working. Don't we need to provide a configuration setting they can use to switch the array base index back to zero, if they need to be able to upgrade but maintain old apps?

This was discussed in slack and in the #1659 itself.

I've suggested a ksql.functions.arrayget.legacy.base configuration. As such would match the ksql.functions. prefix used for controlling other functions. (I know array access is not currently a UDF, but it probably should be).

See #1635 for an example: SUBSTRINGs use of ksql.functions.substring.legacy.args.

hjafarpour · 2018-08-02T16:50:37Z

It's ok to have breaking changes as long as we communicate them clearly. Users just need to update the queries that use arrays accordingly when they upgrade. Streams and many other projects have had breaking changes and by communicating them they don't need to keep supporting the old/deprecated behavior. If we have them as early as possible it's better.
Supporting old and new behavior in the same version would result in confusion in different ways. For instance you could force users to write their new queries based on the old/deprecated behavior even when they have the new version available.
@big-andy-coates let me know what you think.

dguy · 2018-08-03T08:12:14Z

My vote is that we make it a non-breaking change. Make a note that 0 based indexing is effectively deprecated and that people should move to 1 based indexing as in a future release 0 based indexing will be removed.

big-andy-coates · 2018-08-03T13:01:41Z

My vote is that we make it a non-breaking change. Make a note that 0 based indexing is effectively deprecated and that people should move to 1 based indexing as in a future release 0 based indexing will be removed.

You mean make the default still base-0, but have an option to switch to base-1, and encourage people to do so?

Yeah, works for me - can do the same for SUBSTRING.

You mean something like:

v5.1 - support new style, but default to old.
v5.2 - switch default to new. (could also be a major release...)
v6.0 - remove switch

big-andy-coates · 2018-08-03T13:13:54Z

@hjafarpour - might also be worth documenting this in the release docs, similar to:
https://github.com/confluentinc/docs/pull/1095

hjafarpour · 2018-08-03T16:36:14Z

My main objection was having to support old behavior forever. If we only have the switch for only one more release and then remove it, it sounds good. I would suggest a process the same as @big-andy-coates described except would make the default the new behavior:

v5.1 - support new style, but default to the new (base-1).
v5.2 or 6.0 - remove the switch

What do you think @big-andy-coates and @dguy?

rodesai · 2018-08-03T17:16:55Z

My vote is that we make it a non-breaking change. Make a note that 0 based indexing is effectively deprecated and that people should move to 1 based indexing as in a future release 0 based indexing will be removed.

I don't necessarily agree with deprecating until we have an actual way of migrating queries. Currently the only option for users is to stop their queries and then start new ones. The new queries won't resume with the same state as the previously running queries. So if its a requirement for you to keep your app running then you cannot switch. So either we keep the settings/behavior around, or we deprecate and then our users can't upgrade and we have to support lots of different KSQL versions. Not sure which is worse.

Providing a migration path for this kind of change shouldn't be too hard. We just need some way to stop the query, update some settings and the query itself, and then resume it with the same consumer group. But IMO that should be a prereq for deprecating settings.

dguy · 2018-08-03T17:25:32Z

Yeah, i mean leave the default as it is for now. So that queries don't break. We can then figure out if we remove it in the future.

big-andy-coates · 2018-08-14T13:07:00Z

Yeah, i mean leave the default as it is for now. So that queries don't break. We can then figure out if we remove it in the future.

Then we should also swap #1635 to currently default to the old school base-0 and args and document in the same way.

Of course, the down side of this approach is the no-one will read the docs and switch their implementation / set the property. So they'll still break when we change the default. But at least we'll have documented the fact the change is coming!

miguno · 2018-08-14T13:19:56Z

My opinion is to:

Keep the current behavior as the default (starts at 0).
Allow to enable the new behavior (starts at 1) with a configuration option.
Mark the current behavior as Deprecated.

(similar to what others have said above)

big-andy-coates · 2018-08-14T13:51:01Z

Yep, agreed - currently fixing up the SUBSTRING change to follow the same pattern, and I've raised the following to ensure we don't forget to switch the default at in some future release: #1734

big-andy-coates · 2018-08-14T14:51:07Z

I don't necessarily agree with deprecating until we have an actual way of migrating queries. Currently the only option for users is to stop their queries and then start new ones. The new queries won't resume with the same state as the previously running queries. So if its a requirement for you to keep your app running then you cannot switch. So either we keep the settings/behavior around, or we deprecate and then our users can't upgrade and we have to support lots of different KSQL versions. Not sure which is worse.

@rodesai I don't see how deprecating the old way of working is going to cause issues with migrating queries - can you explain? My understanding was that for command-topic based clusters they will already have the 'legacy.mode = true' property saved in the command topic, so flipping the default to use the new more SQL-like syntax won't change any running queries. For non-interactive mode, if they would need to add a 'set property' statement at the top of their sql file, or switch the use of SUBSTRING in the SQL file to match the new syntax, both of which would work without breaking the running app.

However, I agree that later removing the setting all together would break old queries/apps using the old syntax. (Was this what you meant?).

Providing a migration path for this kind of change shouldn't be too hard. We just need some way to stop the query, update some settings and the query itself, and then resume it with the same consumer group. But IMO that should be a prereq for deprecating settings.

By 'deprecating settings' to you mean 'remove settings'?

rodesai · 2018-08-14T17:19:59Z

@big-andy-coates yeah my bad - s/deprecate/remove/

big-andy-coates · 2018-08-14T17:22:06Z

@rodesai

@big-andy-coates yeah my bad - s/deprecate/remove/

In which case, I'm with you! :D

big-andy-coates

Thanks @hjafarpour

To summarise the above - as it stands:

this is a breaking change
users have no way of opting out or in

As discussed above, we should:

switch between base-zero and base-one indexing using a configuration setting, e.g. ksql.functions.arrayget.legacy.base. I think this is probably best achieved by rewriting the query to use a UDF for array access.
Have the default for the new setting as 'legacy mode' initially. (Meaning this is not a breaking change).
Document the setting and the mark the base-zero as deprecated. (See substring for example, so we're consistent: Switch the default mode of SUBSTRING to be the 'legacy' mode #1735).
Add section to release notes highlighting the change. (See https://github.com/confluentinc/docs/pull/1095 for an example.)
Add json based test cases covering both base-zero and base-one, and proving base-zero is current default. (See https://github.com/confluentinc/ksql/pull/1735/files#diff-d627a383a783655bcd5afc320447a410).

big-andy-coates · 2018-08-14T17:26:37Z

ksql-engine/src/main/java/io/confluent/ksql/codegen/SqlToJavaVisitor.java

@@ -532,7 +532,7 @@ private String visitBooleanComparisonExpression(ComparisonExpression.Type type)
      switch (internalSchema.type()) {
        case ARRAY:
          return new Pair<>(
-              String.format("((%s) ((%s)%s).get((int)(%s)))",
+              String.format("((%s) ((%s)%s).get(((int)(%s)) - 1))",


The -1 needs to be controlled by a config setting.

Personally, I think this might be best achieved by replacing array access with a UDF.

See below comment.

big-andy-coates · 2018-08-14T17:26:50Z

docs/syntax-reference.rst

@@ -243,13 +245,15 @@ The supported column data types are:
 -  ``BIGINT``
 -  ``DOUBLE``
 -  ``VARCHAR`` (or ``STRING``)
-  ``ARRAY<ArrayType>`` (JSON and AVRO only. Index starts from 0)
+-  ``ARRAY<ArrayType>`` (JSON and AVRO only. Index starts from 1)


Why do we have this listed twice in the docs? Fancy pulling this out into its own section?

Maybe in another PR since it's not directly related to this one.

big-andy-coates · 2018-08-14T17:27:03Z

docs/syntax-reference.rst

 -  ``MAP<VARCHAR, ValueType>`` (JSON and AVRO only)
 -  ``STRUCT<FieldName FieldType, ...>`` (JSON and AVRO only) The STRUCT type requires you to specify a list of fields.
   For each field you must specify the field name (FieldName) and field type (FieldType). The field type can be any of
   the supported KSQL types, including the complex types ``MAP``, ``ARRAY``, and ``STRUCT``. ``STRUCT`` fields can be
   accessed in expressions using the struct dereference (``->``) operator. See :ref:`operators` for more details.

+Note that starting KSQL 5.1 array index starts from 1. For the versions of KSQL before 5.1, array index starts from 0.


I would put this 'note' inline with the ARRAY type in the list, otherwise users may miss it. (Even if that means changing the layout from a list to a table or something).

Also, need to change the default back to 0 and document the configuration setting that allows users to switch to the new more SQwhelly way of working. (as discussed above)

Added an inline note in addition to this paragraph. The default is now 0.

big-andy-coates · 2018-08-14T17:34:02Z

ksql-engine/src/test/resources/query-validation-tests/extract-json-field.json

@@ -22,7 +22,7 @@
      "name": "extract JSON array field",
      "statements": [
        "CREATE STREAM TEST (array_field ARRAY<VARCHAR>) WITH (kafka_topic='test_topic', value_format='JSON');",
-        "CREATE STREAM OUTPUT AS SELECT EXTRACTJSONFIELD(array_field[0], '$.nested') AS Col1, EXTRACTJSONFIELD(array_field[1], '$.nested') AS Col2 FROM TEST;"
+        "CREATE STREAM OUTPUT AS SELECT EXTRACTJSONFIELD(array_field[1], '$.nested') AS Col1, EXTRACTJSONFIELD(array_field[2], '$.nested') AS Col2 FROM TEST;"


Add tests that test both old style and new style array access (with appropriate properties set), and one proving the default is the old base-zero. (See substring.json for examples).

Added the tests.

hjafarpour · 2018-08-15T00:50:29Z

@big-andy-coates thanks for the latest review. Here is the current state:

I added ksql.functions.array.legacy.base config variable to configure the array base. If it's set to false the array index starts from 1, otherwise it starts from 0. The default is true meaning the array index starts from 0 as you suggested.
Updated the docs accordingly.
Added tests for QueryTranslationTests to test both cases based on the ksql.functions.array.legacy.base value.
Will update the release notes after the PR is merged.

rodesai · 2018-08-15T01:47:58Z

ksql-common/src/main/java/io/confluent/ksql/util/KsqlConfig.java

+              KSQL_FUNCTIONS_ARRAY_LEGACY_BASE_CONFIG,
+              ConfigDef.Type.BOOLEAN,
+          true,
+          true,


If you set this to false we'd get the new behavior for any new queries, without breaking anything that's already running. At least for interactive mode. For headless mode, we can document that the setting should be set to true as part of the upgrade steps. Or change the executor to always use the old defaults.

@dguy flagging this to you - what do you reckon? I've switched SUBSTRING to default to old way, but maybe switching back to default to new way, (while maintaining backwards compatibility with old interactive queries), is the way to go?

Thoughts?

I like @rodesai 's suggestion too. Seems that @big-andy-coates also is leaning this way. Will wait for @dguy 's opinion before making any changes here.

Sounds reasonable to me.

OK, I'll switch SUBSTRING back to this model as well.

SUBSTRING updated.

So then should this be updated to false here?

So then should this be updated to false here?

Yes, I believe so. To be clear: old should be true and new should be false.

Old is true and new is false now.

big-andy-coates · 2018-08-15T16:58:19Z

@hjafarpour what do you think to the idea of moving array access into a UDF? This would make SqlToJavaVisitor, which is already an overly complex class, simpler.

I think this change adds enough additional complexity to array access to warrant moving it into its own class.

Even if we don't, but think we might in the future, I would suggest renaming ksql.functions.array.legacy.base to ksql.functions.arrayget.legacy.base, as a future UDF would not be called Array, but something like ArrayGet.

Also, with reference to #1682 (comment) and #1682 (comment) ... are you proposing that you'll restructor the docs in another PR and then merge into this one so that you can provide cleaner docs here? :D

hjafarpour · 2018-08-15T17:04:31Z

@big-andy-coates what do you mean by making array access as a UDF? Do you mean internally or externally? Can you give an example?

big-andy-coates · 2018-08-16T10:00:45Z

@hjafarpour by make it a UDF, I mean move the implementation of how array access is handled into a type, e.g. ArrayGet, which, given an array and an index will return the item at that index. This ArrayGet can then be injected in SqlToJavaVisitor or we could make it a new style UDF (i.e. like Substring), and use the re-write logic to introduce the ArrayGet UDF as we do for struct access. (Basically, ArrayGet should be an internal implementation detail in the same way as FetchFieldFromStruct is).

The benefits of this are:

it simplifies SqlToJavaVisitor
it moves array access into a more isolated bit of code, so, for example, your new property for setting which index base to use is only in that one class.
it allows better testing of array access
it would mean we can more easily improve edge case handling e.g. at the moment I think if you access an array element out of bounds it will throw an exception, where as I think most other implementations just return null.

apurvam · 2018-08-18T21:24:17Z

Thanks for the update @hjafarpour .. I definitely think that your latest patch is an improvement.

I am fine with merging as is, but before we do, it would be good to understand why going the re-write route is more appropriate for the struct dereference, but not appropriate for the array dereference. Why can't both dereference code paths be unified?

It is worth understanding the differences between the two which merit the different treatment, IMO.

big-andy-coates

@hjafarpour I'm with @apurvam on this one. Yes, this is definitely an improvement. But it also introduces a new way of encapsulating such functionality and wiring it in.

It seems to me that structs could be handled the same way, so why did we go with UDFs for structs??

IMHO, we should chose one way, (probably the most flexible), and standardise on it.

big-andy-coates · 2018-08-20T10:11:12Z

ksql-common/src/main/java/io/confluent/ksql/util/KsqlConfig.java

+              KSQL_FUNCTIONS_ARRAY_LEGACY_BASE_CONFIG,
+              ConfigDef.Type.BOOLEAN,
+          true,
+          true,


So then should this be updated to false here?

Yes, I believe so. To be clear: old should be true and new should be false.

big-andy-coates · 2018-08-20T10:16:32Z

ksql-engine/src/main/java/io/confluent/ksql/codegen/SqlToJavaVisitor.java

@@ -539,11 +545,12 @@ private String visitBooleanComparisonExpression(final ComparisonExpression.Type
      switch (internalSchema.type()) {
        case ARRAY:
          return new Pair<>(
-              String.format("((%s) ((%s)%s).get((int)(%s)))",
+              String.format("((%s) (ArrayGet.getItem(((%s) %s), (int)(%s), %s)))",


If index is null, then this will throw NPE. The index may be derived from the source query e.g.

SELECT array_field[index_field] FROM ...

So we may want to consider handling a null index by just returning null. (I've not checked what other vendors do, but most seem to err on the side of not-logging-loads-of-errors and just returning null)

One problem with returning null is that we won't know if the value of the array item is null or there has been an error. We do return null for the column in case of exception anyway but I think having the error cause in the log is informative, although I agree that we may end up having lot's of errors in the log.

One problem with returning null is that we won't know if the value of the array item is null or there has been an error. We do return null for the column in case of exception anyway but I think having the error cause in the log is informative, although I agree that we may end up having lot's of errors in the log.

I'd normally agree with you. I'm a fan or fail hard and fail fast. However, the pattern for most SQL implementations is to be pretty lax about error reporting. For example, substring just returns null if the start index is out of bounds.

At the very least, we should be logging a more informative error than an NPE.

I commented in another thread, but I think it would be better form to log the exception explicitly at a debug level so that we can control what goes into our logs.

If we throw an exception it will be logged in the processing log...

big-andy-coates · 2018-08-20T10:17:19Z

ksql-engine/src/main/java/io/confluent/ksql/codegen/helpers/ArrayGet.java

+  private ArrayGet() {}
+
+  public static Object getItem(final List<?> array, final int index, final boolean isLegacy) {
+    final int currectIndex = isLegacy ? index : index - 1;


nit: spelling currect -> correct?

big-andy-coates · 2018-08-20T10:26:59Z

ksql-engine/src/test/java/io/confluent/ksql/codegen/CodeGenRunnerTest.java

+    public void shouldHandleMultiDimensionalArrayWithBaseStartingFrom1() throws Exception {
+        final KsqlConfig localKsqlConfig =
+            new KsqlConfig(Collections.singletonMap(KsqlConfig.KSQL_FUNCTIONS_ARRAY_LEGACY_BASE_CONFIG, false));
+        final CodeGenRunner localCodeGenRunner = new CodeGenRunner(schema, localKsqlConfig, functionRegistry);


consider making ksqlConfig field a mock, which means you no longer need to create a new running, just set up the mock.

Changed it into a mock.

big-andy-coates · 2018-08-20T10:37:51Z

ksql-engine/src/test/java/io/confluent/ksql/codegen/CodeGenRunnerTest.java

+        final List<String> innerArray1 = Arrays.asList("item_11", "item_12");
+        final List<String> innerArray2 = Arrays.asList("item_21", "item_22");
+        final Object[] args = new Object[]{Arrays.asList(innerArray1, innerArray2)};
+        final Object result = expressionEvaluatorMetadata.getExpressionEvaluator().evaluate(args);


consider refactoring this to be more like:

@Test public void shouldHandleMultiDimensionalArrayWithBaseStartingFrom1() throws Exception { // Given: final String simpleQuery = "SELECT col14[1][1] FROM CODEGEN_TEST;"; final Object[] arguments = {Arrays.asList( Arrays.asList("item_11", "item_12"), Arrays.asList("item_21", "item_22"))}; EasyMock.expect(ksqlConfig.getBoolean(KsqlConfig.KSQL_FUNCTIONS_ARRAY_LEGACY_BASE_CONFIG)).andReturn(false); EasyMock.replay(ksqlConfig); // When: final Object result = executeExpression(simpleQuery, arguments); // Then: assertThat(result, instanceOf(String.class)); assertThat(result, equalTo("item_11")); } private Object executeExpression(final String simpleQuery, final Object[] arguments) throws Exception { final Analysis analysis = analyzeQuery(simpleQuery); final ExpressionMetadata expressionEvaluatorMetadata = codeGenRunner.buildCodeGenFromParseTree (analysis.getSelectExpressions().get(0)); return expressionEvaluatorMetadata.getExpressionEvaluator().evaluate(arguments); }

To make the test more readable. The same could be applied to other tests, like the one above this one, for example.

Updated those two tests as you suggested.

big-andy-coates · 2018-08-20T10:39:46Z

ksql-engine/src/test/java/io/confluent/ksql/codegen/SqlToJavaVisitorTest.java

-    final String javaExpression = new SqlToJavaVisitor(schema, functionRegistry)
+
+  @Test
+  public void shouldProcessArrayExpressionCorrectlyForNonLegacyIndexBase() {


Consider refactoring to make this more readable, e.g. similar to the example I give above in the code runner class.

Moved the common code to a method.

big-andy-coates · 2018-08-20T10:40:23Z

ksql-engine/src/test/java/io/confluent/ksql/codegen/helpers/ArrayGetTest.java

+  }
+
+  @Test
+  public void shouldReturnNullForOutOfBound() {


Add another test for bounds < 0.

Updated the tests so we test the exception. As I mentioned above I changed the behavior to throw the exception since we want to distinguish between error and null values.

hjafarpour · 2018-08-21T01:20:20Z

@big-andy-coates @apurvam things are more complex for STRUCT. Unlike array, for struct we need to make changes in analysis and planning phases too so we can resolve nested references correctly. Rewriting the query with function calls for STRUC eliminates the need for such changes too.

big-andy-coates

Thanks @hjafarpour, some comments added below.

@big-andy-coates @apurvam things are more complex for STRUCT. Unlike array, for struct we need to make changes in analysis and planning phases too so we can resolve nested references correctly. Rewriting the query with function calls for STRUC eliminates the need for such changes too.

I'm not familiar enough with the struct rewrite stuff to comment, though I did have a look at the code around KsqlParser and the rewrite stuff. As part of trying to understand it I noticed a lot of duplication and so I refactored it in #1775. The struct rewrite only happens in one place now. Not sure how this relates to your comment above about STRUCT needing to make changes in analysis and planning phases.

Personally, I can't conceptually see any difference between a STRUCT:

SELECT someStruct->f1.f2 FROM ...

Compared to ARRAY:

SELECT someArray[12] FROM ...

Both are taking some path, e.g. 'f1.f2' or '12' to resolve on some object, e.g. a struct or an array.

Maybe you could explain the differences by way of an example? I realise my knowledge in this area is not as in depth as yours, so this might help me learn and understand!

big-andy-coates · 2018-08-23T08:24:59Z

ksql-engine/src/main/java/io/confluent/ksql/codegen/SqlToJavaVisitor.java

@@ -539,11 +545,12 @@ private String visitBooleanComparisonExpression(final ComparisonExpression.Type
      switch (internalSchema.type()) {
        case ARRAY:
          return new Pair<>(
-              String.format("((%s) ((%s)%s).get((int)(%s)))",
+              String.format("((%s) (ArrayGet.getItem(((%s) %s), (int)(%s), %s)))",


One problem with returning null is that we won't know if the value of the array item is null or there has been an error. We do return null for the column in case of exception anyway but I think having the error cause in the log is informative, although I agree that we may end up having lot's of errors in the log.

I'd normally agree with you. I'm a fan or fail hard and fail fast. However, the pattern for most SQL implementations is to be pretty lax about error reporting. For example, substring just returns null if the start index is out of bounds.

At the very least, we should be logging a more informative error than an NPE.

big-andy-coates · 2018-08-23T08:28:51Z

ksql-engine/src/main/java/io/confluent/ksql/codegen/helpers/ArrayGet.java

+
+  public static Object getItem(final List<?> array, final int index, final boolean isLegacy) {
+    final int correctIndex = isLegacy ? index : index - 1;
+    return array.get(correctIndex);


You've removed the out of bounds handing that was here before. Just wondering why?

I'm guessing this is inline with your comment above, and you want the exception to be logged? If so, then I think this should be a general question to the team to come to a decision on how we handle such issues: do we silently handle them and return null, or do we throw an exception, log it and then return null?

Personally, I'm with you and prefer the logging approach. But if this is not what expected by our users then they might not thank us for spamming the logs. Likewise, in cloud we're going to end up with a lot of noise in the logs, (though we can turn it off I guess).

Which ever way we decide to go we should be consistent. So, for example, if we go with throwing exceptions, then we should change SubString to follow the same pattern.

I think for now we can throw an exception and return null. This way we make sure we don't lose information that user may need. This is the same behavior for other types of expressions. Let's make Substring behave the same.
I agree with the possibility of trashing the log and I think @blueedgenick 's suggestion(#1795 ) in having a data error log file would alleviate this problem.

I am not a fan of just dumping exception traces to the log by default. Even if we have #1795, it would be much better to explicitly log this at some sort of verbose logging level (like debug) so that we can control what is logged by setting the class level log level in the config.

If these types of things were logged at debug level, then in most cases we won't see them until we want to debug something, at which point we can change the log level for the class (potentially at run time through JMX) and then collect the necessary information.

If you see our soak cluster, we already have many NPE stack traces in the logs. These are really hard to work with because:

It is hard to grep these out given that they have many lines.

The sheer number crowds out everything else.

I vote for not adding to the problem in this patch.

big-andy-coates · 2018-08-23T08:33:14Z

ksql-engine/src/test/java/io/confluent/ksql/codegen/CodeGenRunnerTest.java

-    public void shouldHandleMultiDimensionalArray() throws Exception {
+    public void shouldHandleMultiDimensionalArrayWithLegacyBase() throws Exception {
+        // Given:
+        final KsqlConfig localKsqlConfig = EasyMock.mock(KsqlConfig.class);


Rather than creating a localKsqlConfig and localCodeGenRunner, why not just switch the ksqlConfig to a @Mock(NICE)? Same in the test below.

big-andy-coates · 2018-08-23T08:34:06Z

ksql-engine/src/test/java/io/confluent/ksql/codegen/SqlToJavaVisitorTest.java


    assertThat(javaExpression, equalTo("(TEST1_COL0 + TEST1_COL3)"));
  }

  @Test
  public void shouldProcessArrayExpressionCorrectly() {
    final String simpleQuery = "SELECT col4[0] FROM test1 WHERE col0 > 100;";
-    final Analysis analysis = analyzeQuery(simpleQuery);
+    final KsqlConfig localKsqlConfig = EasyMock.mock(KsqlConfig.class);


As above, just field version of KsqlConfig and @Mock(NICE)...

…acy.base config value.

hjafarpour · 2018-08-28T23:33:39Z

@big-andy-coates we need to resolve columns and their types in analysis and planning stages and if we have a nested field we will need to implement this functionality. For fetching array items we don't need to worry about this since the array is a column and the type of the array is known.
Consider the following example:

SELECT arrayCol[1], structCol ->field1 -> field2 FROM foo;

When we analyze the arrayCol, we can directly get the type of the column but for the nested field, structCol ->field1 -> field2, we have to have a way to resolve the nested field and its data type. We avoid this by rewriting the nested field access as function call. This way, our existing function call analysis will take care of resolving the datatype.

big-andy-coates · 2018-09-05T16:24:48Z

I'm still voting for moving array access out of SqlVistor and leveraging our UDF framework to handle it.

apurvam

Thanks @hjafarpour, I left a few comments, mainly about our logging strategy for exceptions.

I totally buy that Struct dereferences are fundamentally harder to do type inference for, and hence are a natural fit for the UDF Framework. And as such, using the same mechanism for array and map dereferences is a bit of overkill.

Having said that, there is still merit in unifying all object dereference mechanisms, and thereby removing chunks of code out of the unwieldy SqlToJavaVisitor code.

But I am not really attached to using the UDF framework for maps and arrays. The latter are objectively simpler than structs, so it is not unreasonable to have them dereferenced through their own mechanism.

apurvam · 2018-09-12T00:44:11Z

docs/syntax-reference.rst

 -  ``MAP<VARCHAR, ValueType>`` (JSON and AVRO only)
 -  ``STRUCT<FieldName FieldType, ...>`` (JSON and AVRO only) The STRUCT type requires you to specify a list of fields.
   For each field you must specify the field name (FieldName) and field type (FieldType). The field type can be any of
   the supported KSQL types, including the complex types ``MAP``, ``ARRAY``, and ``STRUCT``. ``STRUCT`` fields can be
   accessed in expressions using the struct dereference (``->``) operator. See :ref:`operators` for more details.

+You can configure the base index for arrays to start from 0 or 1 using ``ksql.functions.array.legacy.base`` configuration value.


Why can't we just name this ksql.functions.array.zero.base. It seems like a much more self explanatory name.

apurvam · 2018-09-12T00:44:18Z

docs/syntax-reference.rst

 -  ``MAP<VARCHAR, ValueType>`` (JSON and AVRO only)
 -  ``STRUCT<FieldName FieldType, ...>`` (JSON and AVRO only) The STRUCT type requires you to specify a list of fields.
   For each field you must specify the field name (FieldName) and field type (FieldType). The field type can be any of
   the supported KSQL types, including the complex types ``MAP``, ``ARRAY``, and ``STRUCT``. ``STRUCT`` fields can be
   accessed in expressions using the struct dereference (``->``) operator. See :ref:`operators` for more details.

+You can configure the base index for arrays to start from 0 or 1 using ``ksql.functions.array.legacy.base`` configuration value.
+The current default value of ``ksql.functions.array.legacy.base`` is true indicating that the array indexes start from 0.
+You can set ``ksql.functions.array.legacy.base`` to false in order to have array indexes startinhg from 1.


typo: starting

apurvam · 2018-09-12T00:51:24Z

ksql-engine/src/main/java/io/confluent/ksql/codegen/helpers/ArrayGet.java

+
+  public static Object getItem(final List<?> array, final int index, final boolean isLegacy) {
+    final int correctIndex = isLegacy ? index : index - 1;
+    return array.get(correctIndex);


I am not a fan of just dumping exception traces to the log by default. Even if we have #1795, it would be much better to explicitly log this at some sort of verbose logging level (like debug) so that we can control what is logged by setting the class level log level in the config.

If these types of things were logged at debug level, then in most cases we won't see them until we want to debug something, at which point we can change the log level for the class (potentially at run time through JMX) and then collect the necessary information.

If you see our soak cluster, we already have many NPE stack traces in the logs. These are really hard to work with because:

It is hard to grep these out given that they have many lines.

The sheer number crowds out everything else.

I vote for not adding to the problem in this patch.

apurvam · 2018-09-12T00:52:45Z

ksql-engine/src/main/java/io/confluent/ksql/codegen/SqlToJavaVisitor.java

@@ -539,11 +545,12 @@ private String visitBooleanComparisonExpression(final ComparisonExpression.Type
      switch (internalSchema.type()) {
        case ARRAY:
          return new Pair<>(
-              String.format("((%s) ((%s)%s).get((int)(%s)))",
+              String.format("((%s) (ArrayGet.getItem(((%s) %s), (int)(%s), %s)))",


I commented in another thread, but I think it would be better form to log the exception explicitly at a debug level so that we can control what goes into our logs.

apurvam · 2018-09-12T00:54:16Z

ksql-engine/src/test/java/io/confluent/ksql/codegen/CodeGenRunnerTest.java

@@ -191,6 +202,36 @@ public void shouldHandleMultiDimensionalArray() throws Exception {
        assertThat(result, equalTo("item_11"));
    }

+    @Test
+    public void shouldHandleMultiDimensionalArrayWithBaseStartingFrom1() throws Exception {


Nit, spell out the 1, ie One. I don't know if it is in our code style, but having digits in our function names seems a bit weird.

apurvam · 2018-09-12T00:58:55Z

ksql-engine/src/test/resources/query-validation-tests/project-filter.json

+      "name": "Json 2 Dimensional Array Base Starting from 1",
+      "format": ["JSON"],
+      "properties": {
+        "ksql.functions.array.legacy.base": true


doesn't the value true imply that the base is 0? The documentation, the description of this test, and the value of this config are not consistent. Yet another reason to just spell out the name in the config ksql.functions.array.zero.base.

miguno · 2018-11-08T09:41:55Z

What's the status of this PR?

5.1 will include the change to SUBSTRING to index with start position 1 (before: 0). So it's rather unfortunate that the same change for Arrays will not happen in 5.1, too. Version 5.1 will then be rather confusing, because some functions start at 0, others at 1.

agavra · 2019-12-05T19:41:31Z

Closing out this PR in favor of #4057

hjafarpour added enhancement user-experience labels Aug 1, 2018

hjafarpour self-assigned this Aug 1, 2018

hjafarpour requested a review from a team August 1, 2018 23:07

big-andy-coates reviewed Aug 2, 2018

View reviewed changes

big-andy-coates requested a review from a team August 3, 2018 12:58

hjafarpour force-pushed the KSQL-1659-Change-Array-Index-to-Start-From-One branch from 03f8c86 to 1ea3765 Compare August 14, 2018 01:18

big-andy-coates mentioned this pull request Aug 14, 2018

Switch defaults for SUBSTRING and ARRAY access #1734

Closed

big-andy-coates mentioned this pull request Aug 14, 2018

Switch the default mode of SUBSTRING to be the 'legacy' mode #1735

Merged

2 tasks

hjafarpour force-pushed the KSQL-1659-Change-Array-Index-to-Start-From-One branch from 1ea3765 to 1d07f19 Compare August 14, 2018 16:47

big-andy-coates reviewed Aug 14, 2018

View reviewed changes

big-andy-coates requested a review from a team August 14, 2018 17:50

rodesai reviewed Aug 15, 2018

View reviewed changes

big-andy-coates reviewed Aug 20, 2018

View reviewed changes

hjafarpour force-pushed the KSQL-1659-Change-Array-Index-to-Start-From-One branch from 842247e to 9d8b976 Compare August 21, 2018 01:20

big-andy-coates reviewed Aug 24, 2018

View reviewed changes

hjafarpour added 6 commits August 28, 2018 10:50

Changing the array behavior to start the index from 1.

472287f

Made the array index base configurable using ksql.functions.array.leg…

c953aaa

…acy.base config value.

More feedback from Andy.

b63959f

ENcapsulate array access in a separate helper class.

eb7034f

Removed custom our of bound behavior.

7f31600

Some refactorings.

501f0d1

hjafarpour force-pushed the KSQL-1659-Change-Array-Index-to-Start-From-One branch from 9d8b976 to 79cc7f2 Compare August 29, 2018 00:06

Rebase with master.

79cc7f2

big-andy-coates requested review from dguy, apurvam and rodesai September 5, 2018 16:23

apurvam requested review from dguy and rodesai and removed request for dguy and rodesai September 7, 2018 06:17

big-andy-coates requested a review from a team September 7, 2018 09:14

apurvam reviewed Sep 12, 2018

View reviewed changes

rodesai mentioned this pull request Nov 9, 2018

Add reference about compatibility breaking configs in upgrade docs #2136

Merged

2 tasks

big-andy-coates mentioned this pull request Nov 22, 2019

Change base index of array access to one #3949

Closed

agavra mentioned this pull request Dec 5, 2019

fix: array access is now 1-indexed instead of 0-indexed #4057

Merged

2 tasks

agavra closed this Dec 5, 2019

Changing the array behavior to start the index from 1. #1682

Changing the array behavior to start the index from 1. #1682

Conversation

hjafarpour commented Aug 1, 2018

Description

Testing done

Reviewer checklist

big-andy-coates left a comment • edited Loading

Choose a reason for hiding this comment

hjafarpour commented Aug 2, 2018

dguy commented Aug 3, 2018

big-andy-coates commented Aug 3, 2018

big-andy-coates commented Aug 3, 2018

hjafarpour commented Aug 3, 2018

rodesai commented Aug 3, 2018

dguy commented Aug 3, 2018

big-andy-coates commented Aug 14, 2018 • edited Loading

miguno commented Aug 14, 2018

big-andy-coates commented Aug 14, 2018

big-andy-coates commented Aug 14, 2018

rodesai commented Aug 14, 2018

big-andy-coates commented Aug 14, 2018

big-andy-coates left a comment • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

hjafarpour commented Aug 15, 2018

rodesai Aug 15, 2018 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

big-andy-coates commented Aug 15, 2018 • edited Loading

hjafarpour commented Aug 15, 2018

big-andy-coates commented Aug 16, 2018 • edited Loading

apurvam commented Aug 18, 2018 • edited Loading

big-andy-coates left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

hjafarpour commented Aug 21, 2018

big-andy-coates left a comment • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

big-andy-coates Aug 23, 2018 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

hjafarpour commented Aug 28, 2018

big-andy-coates commented Sep 5, 2018

apurvam left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

big-andy-coates left a comment •

edited

Loading

big-andy-coates commented Aug 14, 2018 •

edited

Loading

big-andy-coates left a comment •

edited

Loading

rodesai Aug 15, 2018 •

edited

Loading

big-andy-coates commented Aug 15, 2018 •

edited

Loading

big-andy-coates commented Aug 16, 2018 •

edited

Loading

apurvam commented Aug 18, 2018 •

edited

Loading

big-andy-coates left a comment •

edited

Loading

big-andy-coates Aug 23, 2018 •

edited

Loading