-
Notifications
You must be signed in to change notification settings - Fork 1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Changing the array behavior to start the index from 1. #1682
Changing the array behavior to start the index from 1. #1682
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hey @hjafarpour
This is a breaking change without providing the user to switch back to the old way of working. Don't we need to provide a configuration setting they can use to switch the array base index back to zero, if they need to be able to upgrade but maintain old apps?
This was discussed in slack and in the #1659 itself.
I've suggested a ksql.functions.arrayget.legacy.base
configuration. As such would match the ksql.functions.
prefix used for controlling other functions. (I know array access is not currently a UDF, but it probably should be).
See #1635 for an example: SUBSTRINGs use of ksql.functions.substring.legacy.args
.
It's ok to have breaking changes as long as we communicate them clearly. Users just need to update the queries that use arrays accordingly when they upgrade. Streams and many other projects have had breaking changes and by communicating them they don't need to keep supporting the old/deprecated behavior. If we have them as early as possible it's better. |
My vote is that we make it a non-breaking change. Make a note that 0 based indexing is effectively deprecated and that people should move to 1 based indexing as in a future release 0 based indexing will be removed. |
You mean make the default still base-0, but have an option to switch to base-1, and encourage people to do so? Yeah, works for me - can do the same for SUBSTRING. You mean something like: v5.1 - support new style, but default to old. |
@hjafarpour - might also be worth documenting this in the release docs, similar to: |
My main objection was having to support old behavior forever. If we only have the switch for only one more release and then remove it, it sounds good. I would suggest a process the same as @big-andy-coates described except would make the default the new behavior: v5.1 - support new style, but default to the new (base-1). What do you think @big-andy-coates and @dguy? |
I don't necessarily agree with deprecating until we have an actual way of migrating queries. Currently the only option for users is to stop their queries and then start new ones. The new queries won't resume with the same state as the previously running queries. So if its a requirement for you to keep your app running then you cannot switch. So either we keep the settings/behavior around, or we deprecate and then our users can't upgrade and we have to support lots of different KSQL versions. Not sure which is worse. Providing a migration path for this kind of change shouldn't be too hard. We just need some way to stop the query, update some settings and the query itself, and then resume it with the same consumer group. But IMO that should be a prereq for deprecating settings. |
Yeah, i mean leave the default as it is for now. So that queries don't break. We can then figure out if we remove it in the future. |
03f8c86
to
1ea3765
Compare
Then we should also swap #1635 to currently default to the old school base-0 and args and document in the same way. Of course, the down side of this approach is the no-one will read the docs and switch their implementation / set the property. So they'll still break when we change the default. But at least we'll have documented the fact the change is coming! |
My opinion is to:
(similar to what others have said above) |
Yep, agreed - currently fixing up the SUBSTRING change to follow the same pattern, and I've raised the following to ensure we don't forget to switch the default at in some future release: #1734 |
@rodesai I don't see how deprecating the old way of working is going to cause issues with migrating queries - can you explain? My understanding was that for command-topic based clusters they will already have the 'legacy.mode = true' property saved in the command topic, so flipping the default to use the new more SQL-like syntax won't change any running queries. For non-interactive mode, if they would need to add a 'set property' statement at the top of their sql file, or switch the use of SUBSTRING in the SQL file to match the new syntax, both of which would work without breaking the running app. However, I agree that later removing the setting all together would break old queries/apps using the old syntax. (Was this what you meant?).
By 'deprecating settings' to you mean 'remove settings'? |
1ea3765
to
1d07f19
Compare
@big-andy-coates yeah my bad - s/deprecate/remove/ |
In which case, I'm with you! :D |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @hjafarpour
To summarise the above - as it stands:
- this is a breaking change
- users have no way of opting out or in
As discussed above, we should:
- switch between base-zero and base-one indexing using a configuration setting, e.g.
ksql.functions.arrayget.legacy.base
. I think this is probably best achieved by rewriting the query to use a UDF for array access. - Have the default for the new setting as 'legacy mode' initially. (Meaning this is not a breaking change).
- Document the setting and the mark the base-zero as deprecated. (See substring for example, so we're consistent: Switch the default mode of SUBSTRING to be the 'legacy' mode #1735).
- Add section to release notes highlighting the change. (See https://github.com/confluentinc/docs/pull/1095 for an example.)
- Add json based test cases covering both base-zero and base-one, and proving base-zero is current default. (See https://github.com/confluentinc/ksql/pull/1735/files#diff-d627a383a783655bcd5afc320447a410).
@@ -532,7 +532,7 @@ private String visitBooleanComparisonExpression(ComparisonExpression.Type type) | |||
switch (internalSchema.type()) { | |||
case ARRAY: | |||
return new Pair<>( | |||
String.format("((%s) ((%s)%s).get((int)(%s)))", | |||
String.format("((%s) ((%s)%s).get(((int)(%s)) - 1))", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The -1
needs to be controlled by a config setting.
Personally, I think this might be best achieved by replacing array access with a UDF.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
See below comment.
docs/syntax-reference.rst
Outdated
@@ -243,13 +245,15 @@ The supported column data types are: | |||
- ``BIGINT`` | |||
- ``DOUBLE`` | |||
- ``VARCHAR`` (or ``STRING``) | |||
- ``ARRAY<ArrayType>`` (JSON and AVRO only. Index starts from 0) | |||
- ``ARRAY<ArrayType>`` (JSON and AVRO only. Index starts from 1) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why do we have this listed twice in the docs? Fancy pulling this out into its own section?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe in another PR since it's not directly related to this one.
docs/syntax-reference.rst
Outdated
- ``MAP<VARCHAR, ValueType>`` (JSON and AVRO only) | ||
- ``STRUCT<FieldName FieldType, ...>`` (JSON and AVRO only) The STRUCT type requires you to specify a list of fields. | ||
For each field you must specify the field name (FieldName) and field type (FieldType). The field type can be any of | ||
the supported KSQL types, including the complex types ``MAP``, ``ARRAY``, and ``STRUCT``. ``STRUCT`` fields can be | ||
accessed in expressions using the struct dereference (``->``) operator. See :ref:`operators` for more details. | ||
|
||
Note that starting KSQL 5.1 array index starts from 1. For the versions of KSQL before 5.1, array index starts from 0. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would put this 'note' inline with the ARRAY type in the list, otherwise users may miss it. (Even if that means changing the layout from a list to a table or something).
Also, need to change the default back to 0 and document the configuration setting that allows users to switch to the new more SQwhelly way of working. (as discussed above)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Added an inline note in addition to this paragraph. The default is now 0.
@@ -22,7 +22,7 @@ | |||
"name": "extract JSON array field", | |||
"statements": [ | |||
"CREATE STREAM TEST (array_field ARRAY<VARCHAR>) WITH (kafka_topic='test_topic', value_format='JSON');", | |||
"CREATE STREAM OUTPUT AS SELECT EXTRACTJSONFIELD(array_field[0], '$.nested') AS Col1, EXTRACTJSONFIELD(array_field[1], '$.nested') AS Col2 FROM TEST;" | |||
"CREATE STREAM OUTPUT AS SELECT EXTRACTJSONFIELD(array_field[1], '$.nested') AS Col1, EXTRACTJSONFIELD(array_field[2], '$.nested') AS Col2 FROM TEST;" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Add tests that test both old style and new style array access (with appropriate properties set), and one proving the default is the old base-zero. (See substring.json for examples).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Added the tests.
@big-andy-coates thanks for the latest review. Here is the current state:
|
KSQL_FUNCTIONS_ARRAY_LEGACY_BASE_CONFIG, | ||
ConfigDef.Type.BOOLEAN, | ||
true, | ||
true, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If you set this to false we'd get the new behavior for any new queries, without breaking anything that's already running. At least for interactive mode. For headless mode, we can document that the setting should be set to true as part of the upgrade steps. Or change the executor to always use the old defaults.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@dguy flagging this to you - what do you reckon? I've switched SUBSTRING to default to old way, but maybe switching back to default to new way, (while maintaining backwards compatibility with old interactive queries), is the way to go?
Thoughts?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I like @rodesai 's suggestion too. Seems that @big-andy-coates also is leaning this way. Will wait for @dguy 's opinion before making any changes here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sounds reasonable to me.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OK, I'll switch SUBSTRING back to this model as well.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
SUBSTRING updated.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So then should this be updated to false
here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So then should this be updated to false here?
Yes, I believe so. To be clear: old should be true
and new should be false
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Old is true and new is false now.
@hjafarpour what do you think to the idea of moving array access into a UDF? This would make I think this change adds enough additional complexity to array access to warrant moving it into its own class. Even if we don't, but think we might in the future, I would suggest renaming Also, with reference to #1682 (comment) and #1682 (comment) ... are you proposing that you'll restructor the docs in another PR and then merge into this one so that you can provide cleaner docs here? :D |
@big-andy-coates what do you mean by making array access as a UDF? Do you mean internally or externally? Can you give an example? |
@hjafarpour by make it a UDF, I mean move the implementation of how array access is handled into a type, e.g. The benefits of this are:
|
Thanks for the update @hjafarpour .. I definitely think that your latest patch is an improvement. I am fine with merging as is, but before we do, it would be good to understand why going the re-write route is more appropriate for the struct dereference, but not appropriate for the array dereference. Why can't both dereference code paths be unified? It is worth understanding the differences between the two which merit the different treatment, IMO. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@hjafarpour I'm with @apurvam on this one. Yes, this is definitely an improvement. But it also introduces a new way of encapsulating such functionality and wiring it in.
It seems to me that structs could be handled the same way, so why did we go with UDFs for structs??
IMHO, we should chose one way, (probably the most flexible), and standardise on it.
KSQL_FUNCTIONS_ARRAY_LEGACY_BASE_CONFIG, | ||
ConfigDef.Type.BOOLEAN, | ||
true, | ||
true, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So then should this be updated to false here?
Yes, I believe so. To be clear: old should be true
and new should be false
.
@@ -539,11 +545,12 @@ private String visitBooleanComparisonExpression(final ComparisonExpression.Type | |||
switch (internalSchema.type()) { | |||
case ARRAY: | |||
return new Pair<>( | |||
String.format("((%s) ((%s)%s).get((int)(%s)))", | |||
String.format("((%s) (ArrayGet.getItem(((%s) %s), (int)(%s), %s)))", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If index is null, then this will throw NPE. The index may be derived from the source query e.g.
SELECT array_field[index_field] FROM ...
So we may want to consider handling a null index by just returning null. (I've not checked what other vendors do, but most seem to err on the side of not-logging-loads-of-errors and just returning null)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
One problem with returning null is that we won't know if the value of the array item is null or there has been an error. We do return null for the column in case of exception anyway but I think having the error cause in the log is informative, although I agree that we may end up having lot's of errors in the log.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
One problem with returning null is that we won't know if the value of the array item is null or there has been an error. We do return null for the column in case of exception anyway but I think having the error cause in the log is informative, although I agree that we may end up having lot's of errors in the log.
I'd normally agree with you. I'm a fan or fail hard and fail fast. However, the pattern for most SQL implementations is to be pretty lax about error reporting. For example, substring just returns null if the start index is out of bounds.
At the very least, we should be logging a more informative error than an NPE.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I commented in another thread, but I think it would be better form to log the exception explicitly at a debug level so that we can control what goes into our logs.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If we throw an exception it will be logged in the processing log...
private ArrayGet() {} | ||
|
||
public static Object getItem(final List<?> array, final int index, final boolean isLegacy) { | ||
final int currectIndex = isLegacy ? index : index - 1; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: spelling currect
-> correct
?
public void shouldHandleMultiDimensionalArrayWithBaseStartingFrom1() throws Exception { | ||
final KsqlConfig localKsqlConfig = | ||
new KsqlConfig(Collections.singletonMap(KsqlConfig.KSQL_FUNCTIONS_ARRAY_LEGACY_BASE_CONFIG, false)); | ||
final CodeGenRunner localCodeGenRunner = new CodeGenRunner(schema, localKsqlConfig, functionRegistry); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
consider making ksqlConfig
field a mock, which means you no longer need to create a new running, just set up the mock.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Changed it into a mock.
final List<String> innerArray1 = Arrays.asList("item_11", "item_12"); | ||
final List<String> innerArray2 = Arrays.asList("item_21", "item_22"); | ||
final Object[] args = new Object[]{Arrays.asList(innerArray1, innerArray2)}; | ||
final Object result = expressionEvaluatorMetadata.getExpressionEvaluator().evaluate(args); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
consider refactoring this to be more like:
@Test
public void shouldHandleMultiDimensionalArrayWithBaseStartingFrom1() throws Exception {
// Given:
final String simpleQuery = "SELECT col14[1][1] FROM CODEGEN_TEST;";
final Object[] arguments = {Arrays.asList(
Arrays.asList("item_11", "item_12"),
Arrays.asList("item_21", "item_22"))};
EasyMock.expect(ksqlConfig.getBoolean(KsqlConfig.KSQL_FUNCTIONS_ARRAY_LEGACY_BASE_CONFIG)).andReturn(false);
EasyMock.replay(ksqlConfig);
// When:
final Object result = executeExpression(simpleQuery, arguments);
// Then:
assertThat(result, instanceOf(String.class));
assertThat(result, equalTo("item_11"));
}
private Object executeExpression(final String simpleQuery, final Object[] arguments) throws Exception {
final Analysis analysis = analyzeQuery(simpleQuery);
final ExpressionMetadata expressionEvaluatorMetadata = codeGenRunner.buildCodeGenFromParseTree
(analysis.getSelectExpressions().get(0));
return expressionEvaluatorMetadata.getExpressionEvaluator().evaluate(arguments);
}
To make the test more readable. The same could be applied to other tests, like the one above this one, for example.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Updated those two tests as you suggested.
final String javaExpression = new SqlToJavaVisitor(schema, functionRegistry) | ||
|
||
@Test | ||
public void shouldProcessArrayExpressionCorrectlyForNonLegacyIndexBase() { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Consider refactoring to make this more readable, e.g. similar to the example I give above in the code runner class.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Moved the common code to a method.
} | ||
|
||
@Test | ||
public void shouldReturnNullForOutOfBound() { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Add another test for bounds < 0.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Updated the tests so we test the exception. As I mentioned above I changed the behavior to throw the exception since we want to distinguish between error and null values.
@big-andy-coates @apurvam things are more complex for STRUCT. Unlike array, for struct we need to make changes in analysis and planning phases too so we can resolve nested references correctly. Rewriting the query with function calls for STRUC eliminates the need for such changes too. |
842247e
to
9d8b976
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @hjafarpour, some comments added below.
@big-andy-coates @apurvam things are more complex for STRUCT. Unlike array, for struct we need to make changes in analysis and planning phases too so we can resolve nested references correctly. Rewriting the query with function calls for STRUC eliminates the need for such changes too.
I'm not familiar enough with the struct rewrite stuff to comment, though I did have a look at the code around KsqlParser
and the rewrite stuff. As part of trying to understand it I noticed a lot of duplication and so I refactored it in #1775. The struct rewrite only happens in one place now. Not sure how this relates to your comment above about STRUCT needing to make changes in analysis and planning phases.
Personally, I can't conceptually see any difference between a STRUCT:
SELECT someStruct->f1.f2 FROM ...
Compared to ARRAY:
SELECT someArray[12] FROM ...
Both are taking some path, e.g. 'f1.f2' or '12' to resolve on some object, e.g. a struct or an array.
Maybe you could explain the differences by way of an example? I realise my knowledge in this area is not as in depth as yours, so this might help me learn and understand!
@@ -539,11 +545,12 @@ private String visitBooleanComparisonExpression(final ComparisonExpression.Type | |||
switch (internalSchema.type()) { | |||
case ARRAY: | |||
return new Pair<>( | |||
String.format("((%s) ((%s)%s).get((int)(%s)))", | |||
String.format("((%s) (ArrayGet.getItem(((%s) %s), (int)(%s), %s)))", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
One problem with returning null is that we won't know if the value of the array item is null or there has been an error. We do return null for the column in case of exception anyway but I think having the error cause in the log is informative, although I agree that we may end up having lot's of errors in the log.
I'd normally agree with you. I'm a fan or fail hard and fail fast. However, the pattern for most SQL implementations is to be pretty lax about error reporting. For example, substring just returns null if the start index is out of bounds.
At the very least, we should be logging a more informative error than an NPE.
|
||
public static Object getItem(final List<?> array, final int index, final boolean isLegacy) { | ||
final int correctIndex = isLegacy ? index : index - 1; | ||
return array.get(correctIndex); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You've removed the out of bounds handing that was here before. Just wondering why?
I'm guessing this is inline with your comment above, and you want the exception to be logged? If so, then I think this should be a general question to the team to come to a decision on how we handle such issues: do we silently handle them and return null, or do we throw an exception, log it and then return null?
Personally, I'm with you and prefer the logging approach. But if this is not what expected by our users then they might not thank us for spamming the logs. Likewise, in cloud we're going to end up with a lot of noise in the logs, (though we can turn it off I guess).
Which ever way we decide to go we should be consistent. So, for example, if we go with throwing exceptions, then we should change SubString
to follow the same pattern.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think for now we can throw an exception and return null. This way we make sure we don't lose information that user may need. This is the same behavior for other types of expressions. Let's make Substring
behave the same.
I agree with the possibility of trashing the log and I think @blueedgenick 's suggestion(#1795 ) in having a data error log file would alleviate this problem.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am not a fan of just dumping exception traces to the log by default. Even if we have #1795, it would be much better to explicitly log this at some sort of verbose logging level (like debug) so that we can control what is logged by setting the class level log level in the config.
If these types of things were logged at debug level, then in most cases we won't see them until we want to debug something, at which point we can change the log level for the class (potentially at run time through JMX) and then collect the necessary information.
If you see our soak cluster, we already have many NPE stack traces in the logs. These are really hard to work with because:
- It is hard to grep these out given that they have many lines.
- The sheer number crowds out everything else.
I vote for not adding to the problem in this patch.
public void shouldHandleMultiDimensionalArray() throws Exception { | ||
public void shouldHandleMultiDimensionalArrayWithLegacyBase() throws Exception { | ||
// Given: | ||
final KsqlConfig localKsqlConfig = EasyMock.mock(KsqlConfig.class); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Rather than creating a localKsqlConfig
and localCodeGenRunner
, why not just switch the ksqlConfig
to a @Mock(NICE)
? Same in the test below.
|
||
assertThat(javaExpression, equalTo("(TEST1_COL0 + TEST1_COL3)")); | ||
} | ||
|
||
@Test | ||
public void shouldProcessArrayExpressionCorrectly() { | ||
final String simpleQuery = "SELECT col4[0] FROM test1 WHERE col0 > 100;"; | ||
final Analysis analysis = analyzeQuery(simpleQuery); | ||
final KsqlConfig localKsqlConfig = EasyMock.mock(KsqlConfig.class); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As above, just field version of KsqlConfig
and @Mock(NICE)
...
…acy.base config value.
@big-andy-coates we need to resolve columns and their types in analysis and planning stages and if we have a nested field we will need to implement this functionality. For fetching array items we don't need to worry about this since the array is a column and the type of the array is known.
When we analyze the |
9d8b976
to
79cc7f2
Compare
I'm still voting for moving array access out of SqlVistor and leveraging our UDF framework to handle it. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @hjafarpour, I left a few comments, mainly about our logging strategy for exceptions.
I totally buy that Struct dereferences are fundamentally harder to do type inference for, and hence are a natural fit for the UDF Framework. And as such, using the same mechanism for array and map dereferences is a bit of overkill.
Having said that, there is still merit in unifying all object dereference mechanisms, and thereby removing chunks of code out of the unwieldy SqlToJavaVisitor
code.
But I am not really attached to using the UDF framework for maps and arrays. The latter are objectively simpler than structs, so it is not unreasonable to have them dereferenced through their own mechanism.
- ``MAP<VARCHAR, ValueType>`` (JSON and AVRO only) | ||
- ``STRUCT<FieldName FieldType, ...>`` (JSON and AVRO only) The STRUCT type requires you to specify a list of fields. | ||
For each field you must specify the field name (FieldName) and field type (FieldType). The field type can be any of | ||
the supported KSQL types, including the complex types ``MAP``, ``ARRAY``, and ``STRUCT``. ``STRUCT`` fields can be | ||
accessed in expressions using the struct dereference (``->``) operator. See :ref:`operators` for more details. | ||
|
||
You can configure the base index for arrays to start from 0 or 1 using ``ksql.functions.array.legacy.base`` configuration value. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why can't we just name this ksql.functions.array.zero.base
. It seems like a much more self explanatory name.
- ``MAP<VARCHAR, ValueType>`` (JSON and AVRO only) | ||
- ``STRUCT<FieldName FieldType, ...>`` (JSON and AVRO only) The STRUCT type requires you to specify a list of fields. | ||
For each field you must specify the field name (FieldName) and field type (FieldType). The field type can be any of | ||
the supported KSQL types, including the complex types ``MAP``, ``ARRAY``, and ``STRUCT``. ``STRUCT`` fields can be | ||
accessed in expressions using the struct dereference (``->``) operator. See :ref:`operators` for more details. | ||
|
||
You can configure the base index for arrays to start from 0 or 1 using ``ksql.functions.array.legacy.base`` configuration value. | ||
The current default value of ``ksql.functions.array.legacy.base`` is true indicating that the array indexes start from 0. | ||
You can set ``ksql.functions.array.legacy.base`` to false in order to have array indexes startinhg from 1. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
typo: starting
|
||
public static Object getItem(final List<?> array, final int index, final boolean isLegacy) { | ||
final int correctIndex = isLegacy ? index : index - 1; | ||
return array.get(correctIndex); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am not a fan of just dumping exception traces to the log by default. Even if we have #1795, it would be much better to explicitly log this at some sort of verbose logging level (like debug) so that we can control what is logged by setting the class level log level in the config.
If these types of things were logged at debug level, then in most cases we won't see them until we want to debug something, at which point we can change the log level for the class (potentially at run time through JMX) and then collect the necessary information.
If you see our soak cluster, we already have many NPE stack traces in the logs. These are really hard to work with because:
- It is hard to grep these out given that they have many lines.
- The sheer number crowds out everything else.
I vote for not adding to the problem in this patch.
@@ -539,11 +545,12 @@ private String visitBooleanComparisonExpression(final ComparisonExpression.Type | |||
switch (internalSchema.type()) { | |||
case ARRAY: | |||
return new Pair<>( | |||
String.format("((%s) ((%s)%s).get((int)(%s)))", | |||
String.format("((%s) (ArrayGet.getItem(((%s) %s), (int)(%s), %s)))", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I commented in another thread, but I think it would be better form to log the exception explicitly at a debug level so that we can control what goes into our logs.
@@ -191,6 +202,36 @@ public void shouldHandleMultiDimensionalArray() throws Exception { | |||
assertThat(result, equalTo("item_11")); | |||
} | |||
|
|||
@Test | |||
public void shouldHandleMultiDimensionalArrayWithBaseStartingFrom1() throws Exception { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nit, spell out the 1
, ie One
. I don't know if it is in our code style, but having digits in our function names seems a bit weird.
"name": "Json 2 Dimensional Array Base Starting from 1", | ||
"format": ["JSON"], | ||
"properties": { | ||
"ksql.functions.array.legacy.base": true |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
doesn't the value true
imply that the base is 0
? The documentation, the description of this test, and the value of this config are not consistent. Yet another reason to just spell out the name in the config ksql.functions.array.zero.base
.
What's the status of this PR? 5.1 will include the change to SUBSTRING to index with start position 1 (before: 0). So it's rather unfortunate that the same change for Arrays will not happen in 5.1, too. Version 5.1 will then be rather confusing, because some functions start at 0, others at 1. |
Closing out this PR in favor of #4057 |
Description
Currently KSQL array index starts from 0. This PR changes this behavior so the array index starts from 1 so we are compatible with SQL standard.
This is a breaking change and we need to also include this in our release notes explicitly.
The documentation also has been updated.
This relates to #1659
Testing done
Describe the testing stategy. Unit and integration tests are expected for any behavior changes.
Reviewer checklist