Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support Json unquote function #8407

Merged
merged 29 commits into from
Nov 28, 2023
Merged

Support Json unquote function #8407

merged 29 commits into from
Nov 28, 2023

Conversation

yibin87
Copy link
Contributor

@yibin87 yibin87 commented Nov 22, 2023

What problem does this PR solve?

Issue Number: close #8334

Problem Summary:

What is changed and how it works?

Check List

Tests

  • Unit test
  • Integration test
  • Manual test (add detailed scripts or steps below)
  • No code

Side effects

  • Performance regression: Consumes more CPU
  • Performance regression: Consumes more Memory
  • Breaking backward compatibility

Documentation

  • Affects user behaviors
  • Contains syntax changes
  • Contains variable changes
  • Contains experimental features
  • Changes MySQL compatibility

Release note

None

@ti-chi-bot ti-chi-bot bot added release-note-none Denotes a PR that doesn't merit a release note. do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. size/XXL Denotes a PR that changes 1000+ lines, ignoring generated files. labels Nov 22, 2023
@yibin87
Copy link
Contributor Author

yibin87 commented Nov 22, 2023

/run-all-tests

4 similar comments
@yibin87
Copy link
Contributor Author

yibin87 commented Nov 23, 2023

/run-all-tests

@yibin87
Copy link
Contributor Author

yibin87 commented Nov 23, 2023

/run-all-tests

@yibin87
Copy link
Contributor Author

yibin87 commented Nov 23, 2023

/run-all-tests

@purelind
Copy link
Collaborator

/run-all-tests

@purelind
Copy link
Collaborator

/rebuild

@yibin87
Copy link
Contributor Author

yibin87 commented Nov 23, 2023

/run-integration-test

@yibin87 yibin87 changed the title [WIP] Support Json unquote function Support Json unquote function Nov 23, 2023
@ti-chi-bot ti-chi-bot bot removed the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Nov 23, 2023
@yibin87
Copy link
Contributor Author

yibin87 commented Nov 23, 2023

/run-all-tests

@yibin87
Copy link
Contributor Author

yibin87 commented Nov 23, 2023

/hold

@ti-chi-bot ti-chi-bot bot added the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Nov 23, 2023
@yibin87
Copy link
Contributor Author

yibin87 commented Nov 23, 2023

run-integration-test

dbms/src/TiDB/Decode/JsonScanner.cpp Show resolved Hide resolved
Comment on lines 38 to 67
auto & factory = FunctionFactory::instance();
ColumnsWithTypeAndName columns({input_column});
ColumnNumbers argument_column_numbers;
for (size_t i = 0; i < columns.size(); ++i)
argument_column_numbers.push_back(i);

ColumnsWithTypeAndName arguments;
for (const auto argument_column_number : argument_column_numbers)
arguments.push_back(columns.at(argument_column_number));

const String func_name = "cast_json_as_string";
auto builder = factory.tryGet(func_name, context);
if (!builder)
throw TiFlashTestException(fmt::format("Function {} not found!", func_name));
auto func = builder->build(arguments, nullptr);
auto * function_build_ptr = builder.get();
if (auto * default_function_builder = dynamic_cast<DefaultFunctionBuilder *>(function_build_ptr);
default_function_builder)
{
auto * function_impl = default_function_builder->getFunctionImpl().get();
if (auto * function_cast_json_as_string = dynamic_cast<FunctionsCastJsonAsString *>(function_impl);
function_cast_json_as_string)
{
function_cast_json_as_string->setOutputTiDBFieldType(field_type);
}
else
{
throw TiFlashTestException(fmt::format("Function {} not found!", func_name));
}
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Seems useless because DAGExpressionAnalyerHelper will be called when raw_function_test is false

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can't get your point here, just introduce this method for test to set tidb field type here

dbms/src/Flash/Coprocessor/DAGExpressionAnalyzerHelper.cpp Outdated Show resolved Hide resolved
@yibin87
Copy link
Contributor Author

yibin87 commented Nov 24, 2023

/run-all-tests

@yibin87
Copy link
Contributor Author

yibin87 commented Nov 24, 2023

/rebuild

@yibin87 yibin87 requested a review from SeaRise November 24, 2023 02:35
@yibin87
Copy link
Contributor Author

yibin87 commented Nov 24, 2023

/run-all-tests

3 similar comments
@yibin87
Copy link
Contributor Author

yibin87 commented Nov 24, 2023

/run-all-tests

@yibin87
Copy link
Contributor Author

yibin87 commented Nov 24, 2023

/run-all-tests

@yibin87
Copy link
Contributor Author

yibin87 commented Nov 24, 2023

/run-all-tests

dbms/src/Functions/FunctionsJson.h Outdated Show resolved Hide resolved
byte_length = std::min(byte_length, orig_length);
if (byte_length < element_write_buffer.count())
context.getDAGContext()->handleTruncateError("Data Too Long");
write_buffer.write(reinterpret_cast<char *>(container_per_element.data()), byte_length);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks like if byte_length > element_write_buffer.count(), it will append random bytes, is it the expected behavior?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

And Looks like if there is a method to get current pos in write_buffer, we don't need to write tmp result into element_write_buffer and copy it to write_buffer after the byte length check?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

byte_length is expected to be equal or fewer than orig_length, thus shouldn't be byte_length > element_write_buffer.count() case.
And it is not common to set char length here, thus use tmp result to make code more readable.

Copy link
Contributor

@windtalker windtalker Nov 28, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

But theoretical speaking, we still need to handle the case of byte_length > element_write_buffer.count()? Maybe we should throw Exception in charLengthToByteLengthFromUTF8 if ret > length?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

byte_length = std::min(byte_length, orig_length); is executed after charLengthToByteLengthFromUTF8, thus byte_length <= orig_length. Not sure if this answer your question.

Copy link
Contributor

@windtalker windtalker Nov 28, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

But charLengthToByteLengthFromUTF8 can not guarantee this if it is not a valid utf8 string, so I suggest to throw an exception in charLengthToByteLengthFromUTF8 if ret > length

@@ -189,7 +189,7 @@ struct TiDBConvertToString
WriteBufferFromVector<ColumnString::Chars_t> element_write_buffer(container_per_element);
FormatImpl<FromDataType>::execute(vec_from[i], element_write_buffer, &type, nullptr);
size_t byte_length = element_write_buffer.count();
if (tp.flen() > 0)
if (tp.flen() >= 0)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is it a bug fix here?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, it is a existing bug.

@yibin87 yibin87 requested a review from windtalker November 27, 2023 03:34
@yibin87
Copy link
Contributor Author

yibin87 commented Nov 27, 2023

/hold

@yibin87
Copy link
Contributor Author

yibin87 commented Nov 28, 2023

/run-all-tests

Signed-off-by: yibin <[email protected]>
@yibin87
Copy link
Contributor Author

yibin87 commented Nov 28, 2023

/run-all-tests

@yibin87 yibin87 requested a review from SeaRise November 28, 2023 06:07
dbms/src/Functions/tests/gtest_json_array.cpp Show resolved Hide resolved
json_binary.toStringInBuffer(element_write_buffer);
}

size_t orig_length = element_write_buffer.count();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

L475-L483 should be inside the above else branch?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, it can reduce useless code for null case. I'll move it.

byte_length = std::min(byte_length, orig_length);
if (byte_length < element_write_buffer.count())
context.getDAGContext()->handleTruncateError("Data Too Long");
write_buffer.write(reinterpret_cast<char *>(container_per_element.data()), byte_length);
Copy link
Contributor

@windtalker windtalker Nov 28, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

But theoretical speaking, we still need to handle the case of byte_length > element_write_buffer.count()? Maybe we should throw Exception in charLengthToByteLengthFromUTF8 if ret > length?

@SeaRise SeaRise self-requested a review November 28, 2023 07:19
Signed-off-by: yibin <[email protected]>
@yibin87 yibin87 requested a review from windtalker November 28, 2023 07:32
reinterpret_cast<char *>(container_per_element.data()),
orig_length,
tidb_tp->flen());
byte_length = std::min(byte_length, orig_length);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks like this is not necessary since charLengthToByteLengthFromUTF8 should ensure that the return value is less than orig_length?

Comment on lines 468 to 481
JsonBinary::JsonBinaryWriteBuffer element_write_buffer(container_per_element);
JsonBinary json_binary(
data_from[current_offset],
StringRef(&data_from[current_offset + 1], json_length - 1));
json_binary.toStringInBuffer(element_write_buffer);
size_t orig_length = element_write_buffer.count();
auto byte_length = charLengthToByteLengthFromUTF8(
reinterpret_cast<char *>(container_per_element.data()),
orig_length,
tidb_tp->flen());
byte_length = std::min(byte_length, orig_length);
if (byte_length < element_write_buffer.count())
context.getDAGContext()->handleTruncateError("Data Too Long");
write_buffer.write(reinterpret_cast<char *>(container_per_element.data()), byte_length);
Copy link
Contributor

@SeaRise SeaRise Nov 28, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

how about

Suggested change
JsonBinary::JsonBinaryWriteBuffer element_write_buffer(container_per_element);
JsonBinary json_binary(
data_from[current_offset],
StringRef(&data_from[current_offset + 1], json_length - 1));
json_binary.toStringInBuffer(element_write_buffer);
size_t orig_length = element_write_buffer.count();
auto byte_length = charLengthToByteLengthFromUTF8(
reinterpret_cast<char *>(container_per_element.data()),
orig_length,
tidb_tp->flen());
byte_length = std::min(byte_length, orig_length);
if (byte_length < element_write_buffer.count())
context.getDAGContext()->handleTruncateError("Data Too Long");
write_buffer.write(reinterpret_cast<char *>(container_per_element.data()), byte_length);
auto start_pos = write_buffer.offset();
JsonBinary json_binary(
data_from[current_offset],
StringRef(&data_from[current_offset + 1], json_length - 1));
json_binary.toStringInBuffer(write_buffer);
auto end_pos = write_buffer.offset();
auto orig_length = end_pos - start_pos;
auto byte_length = charLengthToByteLengthFromUTF8(
reinterpret_cast<char *>(write_buffer.data() + start_offset),
orig_length,
tidb_tp->flen());
byte_length = std::min(byte_length, orig_length);
if (byte_length < orig_length)
{
context.getDAGContext()->handleTruncateError("Data Too Long");
write_buffer.setOffset(start_pos + byte_length);
}

?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To avoid one more memcpy.

Copy link
Contributor Author

@yibin87 yibin87 Nov 28, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah,you're right. I just think this code path is not common used(because cast json as fixed length char is valid but strange), and even if it is used the performance won't drop significantly, thus choose to use the temporary buffer here to make code more easier.

@ti-chi-bot ti-chi-bot bot added needs-1-more-lgtm Indicates a PR needs 1 more LGTM. approved labels Nov 28, 2023
@yibin87
Copy link
Contributor Author

yibin87 commented Nov 28, 2023

/unhold

@ti-chi-bot ti-chi-bot bot removed the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Nov 28, 2023
@yibin87
Copy link
Contributor Author

yibin87 commented Nov 28, 2023

/run-all-tests

Copy link
Contributor

@windtalker windtalker left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@ti-chi-bot ti-chi-bot bot added lgtm and removed needs-1-more-lgtm Indicates a PR needs 1 more LGTM. labels Nov 28, 2023
Copy link
Contributor

ti-chi-bot bot commented Nov 28, 2023

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: SeaRise, windtalker

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

Copy link
Contributor

ti-chi-bot bot commented Nov 28, 2023

[LGTM Timeline notifier]

Timeline:

  • 2023-11-28 08:22:31.526480833 +0000 UTC m=+910980.191707013: ☑️ agreed by SeaRise.
  • 2023-11-28 08:39:03.4456276 +0000 UTC m=+911972.110853795: ☑️ agreed by windtalker.

@yibin87
Copy link
Contributor Author

yibin87 commented Nov 28, 2023

/run-all-tests

Copy link
Contributor

ti-chi-bot bot commented Nov 28, 2023

@yibin87: Your PR was out of date, I have automatically updated it for you.

At the same time I will also trigger all tests for you:

/run-all-tests

trigger some heavy tests which will not run always when PR updated.

If the CI test fails, you just re-trigger the test that failed and the bot will merge the PR for you after the CI passes.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the ti-community-infra/tichi repository.

@ti-chi-bot ti-chi-bot bot merged commit 4479df8 into pingcap:master Nov 28, 2023
6 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
approved lgtm release-note-none Denotes a PR that doesn't merit a release note. size/XXL Denotes a PR that changes 1000+ lines, ignoring generated files.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Support json_unquote function
4 participants