-
Notifications
You must be signed in to change notification settings - Fork 411
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Support Json unquote function #8407
Conversation
/run-all-tests |
4 similar comments
/run-all-tests |
/run-all-tests |
/run-all-tests |
/run-all-tests |
/rebuild |
/run-integration-test |
/run-all-tests |
/hold |
run-integration-test |
auto & factory = FunctionFactory::instance(); | ||
ColumnsWithTypeAndName columns({input_column}); | ||
ColumnNumbers argument_column_numbers; | ||
for (size_t i = 0; i < columns.size(); ++i) | ||
argument_column_numbers.push_back(i); | ||
|
||
ColumnsWithTypeAndName arguments; | ||
for (const auto argument_column_number : argument_column_numbers) | ||
arguments.push_back(columns.at(argument_column_number)); | ||
|
||
const String func_name = "cast_json_as_string"; | ||
auto builder = factory.tryGet(func_name, context); | ||
if (!builder) | ||
throw TiFlashTestException(fmt::format("Function {} not found!", func_name)); | ||
auto func = builder->build(arguments, nullptr); | ||
auto * function_build_ptr = builder.get(); | ||
if (auto * default_function_builder = dynamic_cast<DefaultFunctionBuilder *>(function_build_ptr); | ||
default_function_builder) | ||
{ | ||
auto * function_impl = default_function_builder->getFunctionImpl().get(); | ||
if (auto * function_cast_json_as_string = dynamic_cast<FunctionsCastJsonAsString *>(function_impl); | ||
function_cast_json_as_string) | ||
{ | ||
function_cast_json_as_string->setOutputTiDBFieldType(field_type); | ||
} | ||
else | ||
{ | ||
throw TiFlashTestException(fmt::format("Function {} not found!", func_name)); | ||
} | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Seems useless because DAGExpressionAnalyerHelper
will be called when raw_function_test
is false
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can't get your point here, just introduce this method for test to set tidb field type here
/run-all-tests |
/rebuild |
/run-all-tests |
3 similar comments
/run-all-tests |
/run-all-tests |
/run-all-tests |
dbms/src/Functions/FunctionsJson.h
Outdated
byte_length = std::min(byte_length, orig_length); | ||
if (byte_length < element_write_buffer.count()) | ||
context.getDAGContext()->handleTruncateError("Data Too Long"); | ||
write_buffer.write(reinterpret_cast<char *>(container_per_element.data()), byte_length); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks like if byte_length > element_write_buffer.count()
, it will append random bytes, is it the expected behavior?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
And Looks like if there is a method to get current pos in write_buffer
, we don't need to write tmp result into element_write_buffer
and copy it to write_buffer
after the byte length check?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
byte_length is expected to be equal or fewer than orig_length, thus shouldn't be byte_length > element_write_buffer.count() case.
And it is not common to set char length here, thus use tmp result to make code more readable.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
But theoretical speaking, we still need to handle the case of byte_length > element_write_buffer.count()
? Maybe we should throw Exception in charLengthToByteLengthFromUTF8
if ret > length
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
byte_length = std::min(byte_length, orig_length); is executed after charLengthToByteLengthFromUTF8, thus byte_length <= orig_length. Not sure if this answer your question.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
But charLengthToByteLengthFromUTF8
can not guarantee this if it is not a valid utf8 string, so I suggest to throw an exception in charLengthToByteLengthFromUTF8
if ret > length
@@ -189,7 +189,7 @@ struct TiDBConvertToString | |||
WriteBufferFromVector<ColumnString::Chars_t> element_write_buffer(container_per_element); | |||
FormatImpl<FromDataType>::execute(vec_from[i], element_write_buffer, &type, nullptr); | |||
size_t byte_length = element_write_buffer.count(); | |||
if (tp.flen() > 0) | |||
if (tp.flen() >= 0) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is it a bug fix here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, it is a existing bug.
/hold |
/run-all-tests |
Signed-off-by: yibin <[email protected]>
/run-all-tests |
dbms/src/Functions/FunctionsJson.h
Outdated
json_binary.toStringInBuffer(element_write_buffer); | ||
} | ||
|
||
size_t orig_length = element_write_buffer.count(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
L475-L483 should be inside the above else branch?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, it can reduce useless code for null case. I'll move it.
dbms/src/Functions/FunctionsJson.h
Outdated
byte_length = std::min(byte_length, orig_length); | ||
if (byte_length < element_write_buffer.count()) | ||
context.getDAGContext()->handleTruncateError("Data Too Long"); | ||
write_buffer.write(reinterpret_cast<char *>(container_per_element.data()), byte_length); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
But theoretical speaking, we still need to handle the case of byte_length > element_write_buffer.count()
? Maybe we should throw Exception in charLengthToByteLengthFromUTF8
if ret > length
?
Signed-off-by: yibin <[email protected]>
dbms/src/Functions/FunctionsJson.h
Outdated
reinterpret_cast<char *>(container_per_element.data()), | ||
orig_length, | ||
tidb_tp->flen()); | ||
byte_length = std::min(byte_length, orig_length); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks like this is not necessary since charLengthToByteLengthFromUTF8
should ensure that the return value is less than orig_length
?
dbms/src/Functions/FunctionsJson.h
Outdated
JsonBinary::JsonBinaryWriteBuffer element_write_buffer(container_per_element); | ||
JsonBinary json_binary( | ||
data_from[current_offset], | ||
StringRef(&data_from[current_offset + 1], json_length - 1)); | ||
json_binary.toStringInBuffer(element_write_buffer); | ||
size_t orig_length = element_write_buffer.count(); | ||
auto byte_length = charLengthToByteLengthFromUTF8( | ||
reinterpret_cast<char *>(container_per_element.data()), | ||
orig_length, | ||
tidb_tp->flen()); | ||
byte_length = std::min(byte_length, orig_length); | ||
if (byte_length < element_write_buffer.count()) | ||
context.getDAGContext()->handleTruncateError("Data Too Long"); | ||
write_buffer.write(reinterpret_cast<char *>(container_per_element.data()), byte_length); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
how about
JsonBinary::JsonBinaryWriteBuffer element_write_buffer(container_per_element); | |
JsonBinary json_binary( | |
data_from[current_offset], | |
StringRef(&data_from[current_offset + 1], json_length - 1)); | |
json_binary.toStringInBuffer(element_write_buffer); | |
size_t orig_length = element_write_buffer.count(); | |
auto byte_length = charLengthToByteLengthFromUTF8( | |
reinterpret_cast<char *>(container_per_element.data()), | |
orig_length, | |
tidb_tp->flen()); | |
byte_length = std::min(byte_length, orig_length); | |
if (byte_length < element_write_buffer.count()) | |
context.getDAGContext()->handleTruncateError("Data Too Long"); | |
write_buffer.write(reinterpret_cast<char *>(container_per_element.data()), byte_length); | |
auto start_pos = write_buffer.offset(); | |
JsonBinary json_binary( | |
data_from[current_offset], | |
StringRef(&data_from[current_offset + 1], json_length - 1)); | |
json_binary.toStringInBuffer(write_buffer); | |
auto end_pos = write_buffer.offset(); | |
auto orig_length = end_pos - start_pos; | |
auto byte_length = charLengthToByteLengthFromUTF8( | |
reinterpret_cast<char *>(write_buffer.data() + start_offset), | |
orig_length, | |
tidb_tp->flen()); | |
byte_length = std::min(byte_length, orig_length); | |
if (byte_length < orig_length) | |
{ | |
context.getDAGContext()->handleTruncateError("Data Too Long"); | |
write_buffer.setOffset(start_pos + byte_length); | |
} |
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
To avoid one more memcpy.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah,you're right. I just think this code path is not common used(because cast json as fixed length char is valid but strange), and even if it is used the performance won't drop significantly, thus choose to use the temporary buffer here to make code more easier.
Signed-off-by: yibin <[email protected]>
/unhold |
/run-all-tests |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: SeaRise, windtalker The full list of commands accepted by this bot can be found here. The pull request process is described here
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
[LGTM Timeline notifier]Timeline:
|
/run-all-tests |
@yibin87: Your PR was out of date, I have automatically updated it for you. At the same time I will also trigger all tests for you: /run-all-tests
If the CI test fails, you just re-trigger the test that failed and the bot will merge the PR for you after the CI passes. Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the ti-community-infra/tichi repository. |
What problem does this PR solve?
Issue Number: close #8334
Problem Summary:
What is changed and how it works?
Check List
Tests
Side effects
Documentation
Release note