Support Json unquote function #8407

yibin87 · 2023-11-22T06:22:07Z

What problem does this PR solve?

Issue Number: close #8334

Problem Summary:

What is changed and how it works?

Check List

Tests

Unit test
Integration test
Manual test (add detailed scripts or steps below)
No code

Side effects

Performance regression: Consumes more CPU
Performance regression: Consumes more Memory
Breaking backward compatibility

Documentation

Release note

None

yibin87 · 2023-11-22T06:23:14Z

/run-all-tests

yibin87 · 2023-11-23T03:58:33Z

/run-all-tests

yibin87 · 2023-11-23T07:46:18Z

/run-all-tests

yibin87 · 2023-11-23T08:18:10Z

/run-all-tests

purelind · 2023-11-23T08:19:42Z

/run-all-tests

purelind · 2023-11-23T08:22:18Z

/rebuild

yibin87 · 2023-11-23T08:22:49Z

/run-integration-test

yibin87 · 2023-11-23T08:49:38Z

/run-all-tests

yibin87 · 2023-11-23T08:50:11Z

/hold

yibin87 · 2023-11-23T09:11:17Z

run-integration-test

dbms/src/TiDB/Decode/JsonScanner.cpp

SeaRise · 2023-11-23T09:22:55Z

dbms/src/Functions/tests/gtest_cast_json_as_string.cpp

+        auto & factory = FunctionFactory::instance();
+        ColumnsWithTypeAndName columns({input_column});
+        ColumnNumbers argument_column_numbers;
+        for (size_t i = 0; i < columns.size(); ++i)
+            argument_column_numbers.push_back(i);
+
+        ColumnsWithTypeAndName arguments;
+        for (const auto argument_column_number : argument_column_numbers)
+            arguments.push_back(columns.at(argument_column_number));
+
+        const String func_name = "cast_json_as_string";
+        auto builder = factory.tryGet(func_name, context);
+        if (!builder)
+            throw TiFlashTestException(fmt::format("Function {} not found!", func_name));
+        auto func = builder->build(arguments, nullptr);
+        auto * function_build_ptr = builder.get();
+        if (auto * default_function_builder = dynamic_cast<DefaultFunctionBuilder *>(function_build_ptr);
+            default_function_builder)
+        {
+            auto * function_impl = default_function_builder->getFunctionImpl().get();
+            if (auto * function_cast_json_as_string = dynamic_cast<FunctionsCastJsonAsString *>(function_impl);
+                function_cast_json_as_string)
+            {
+                function_cast_json_as_string->setOutputTiDBFieldType(field_type);
+            }
+            else
+            {
+                throw TiFlashTestException(fmt::format("Function {} not found!", func_name));
+            }
+        }


Seems useless because DAGExpressionAnalyerHelper will be called when raw_function_test is false

Can't get your point here, just introduce this method for test to set tidb field type here

dbms/src/Flash/Coprocessor/DAGExpressionAnalyzerHelper.cpp

yibin87 · 2023-11-24T01:24:10Z

/run-all-tests

yibin87 · 2023-11-24T01:35:26Z

/rebuild

yibin87 · 2023-11-24T03:14:48Z

/run-all-tests

yibin87 · 2023-11-24T05:16:04Z

/run-all-tests

yibin87 · 2023-11-24T05:52:12Z

/run-all-tests

yibin87 · 2023-11-24T08:09:12Z

/run-all-tests

dbms/src/Functions/FunctionsJson.h

windtalker · 2023-11-27T03:06:00Z

dbms/src/Functions/FunctionsJson.h

+                    byte_length = std::min(byte_length, orig_length);
+                    if (byte_length < element_write_buffer.count())
+                        context.getDAGContext()->handleTruncateError("Data Too Long");
+                    write_buffer.write(reinterpret_cast<char *>(container_per_element.data()), byte_length);


Looks like if byte_length > element_write_buffer.count(), it will append random bytes, is it the expected behavior?

And Looks like if there is a method to get current pos in write_buffer, we don't need to write tmp result into element_write_buffer and copy it to write_buffer after the byte length check?

byte_length is expected to be equal or fewer than orig_length, thus shouldn't be byte_length > element_write_buffer.count() case.
And it is not common to set char length here, thus use tmp result to make code more readable.

But theoretical speaking, we still need to handle the case of byte_length > element_write_buffer.count()? Maybe we should throw Exception in charLengthToByteLengthFromUTF8 if ret > length?

byte_length = std::min(byte_length, orig_length); is executed after charLengthToByteLengthFromUTF8, thus byte_length <= orig_length. Not sure if this answer your question.

But charLengthToByteLengthFromUTF8 can not guarantee this if it is not a valid utf8 string, so I suggest to throw an exception in charLengthToByteLengthFromUTF8 if ret > length

windtalker · 2023-11-27T03:09:45Z

dbms/src/Functions/FunctionsTiDBConversion.h

@@ -189,7 +189,7 @@ struct TiDBConvertToString
                WriteBufferFromVector<ColumnString::Chars_t> element_write_buffer(container_per_element);
                FormatImpl<FromDataType>::execute(vec_from[i], element_write_buffer, &type, nullptr);
                size_t byte_length = element_write_buffer.count();
-                if (tp.flen() > 0)
+                if (tp.flen() >= 0)


Is it a bug fix here?

Yes, it is a existing bug.

dbms/src/Functions/FunctionsJson.h

yibin87 · 2023-11-27T06:29:22Z

/hold

yibin87 · 2023-11-28T05:11:01Z

/run-all-tests

Signed-off-by: yibin <[email protected]>

yibin87 · 2023-11-28T05:16:54Z

/run-all-tests

dbms/src/Functions/tests/gtest_json_array.cpp

windtalker · 2023-11-28T07:09:03Z

dbms/src/Functions/FunctionsJson.h

+                        json_binary.toStringInBuffer(element_write_buffer);
+                    }
+
+                    size_t orig_length = element_write_buffer.count();


L475-L483 should be inside the above else branch?

Yeah, it can reduce useless code for null case. I'll move it.

windtalker · 2023-11-28T07:13:25Z

dbms/src/Functions/FunctionsJson.h

+                    byte_length = std::min(byte_length, orig_length);
+                    if (byte_length < element_write_buffer.count())
+                        context.getDAGContext()->handleTruncateError("Data Too Long");
+                    write_buffer.write(reinterpret_cast<char *>(container_per_element.data()), byte_length);


But theoretical speaking, we still need to handle the case of byte_length > element_write_buffer.count()? Maybe we should throw Exception in charLengthToByteLengthFromUTF8 if ret > length?

dbms/src/Flash/Coprocessor/DAGExpressionAnalyzerHelper.cpp

Signed-off-by: yibin <[email protected]>

windtalker · 2023-11-28T07:44:22Z

dbms/src/Functions/FunctionsJson.h

+                            reinterpret_cast<char *>(container_per_element.data()),
+                            orig_length,
+                            tidb_tp->flen());
+                        byte_length = std::min(byte_length, orig_length);


Looks like this is not necessary since charLengthToByteLengthFromUTF8 should ensure that the return value is less than orig_length?

SeaRise · 2023-11-28T07:50:17Z

dbms/src/Functions/FunctionsJson.h

+                        JsonBinary::JsonBinaryWriteBuffer element_write_buffer(container_per_element);
+                        JsonBinary json_binary(
+                            data_from[current_offset],
+                            StringRef(&data_from[current_offset + 1], json_length - 1));
+                        json_binary.toStringInBuffer(element_write_buffer);
+                        size_t orig_length = element_write_buffer.count();
+                        auto byte_length = charLengthToByteLengthFromUTF8(
+                            reinterpret_cast<char *>(container_per_element.data()),
+                            orig_length,
+                            tidb_tp->flen());
+                        byte_length = std::min(byte_length, orig_length);
+                        if (byte_length < element_write_buffer.count())
+                            context.getDAGContext()->handleTruncateError("Data Too Long");
+                        write_buffer.write(reinterpret_cast<char *>(container_per_element.data()), byte_length);


how about

Suggested change

JsonBinary::JsonBinaryWriteBuffer element_write_buffer(container_per_element);

JsonBinary json_binary(

data_from[current_offset],

StringRef(&data_from[current_offset + 1], json_length - 1));

json_binary.toStringInBuffer(element_write_buffer);

size_t orig_length = element_write_buffer.count();

auto byte_length = charLengthToByteLengthFromUTF8(

reinterpret_cast<char *>(container_per_element.data()),

orig_length,

tidb_tp->flen());

byte_length = std::min(byte_length, orig_length);

if (byte_length < element_write_buffer.count())

context.getDAGContext()->handleTruncateError("Data Too Long");

write_buffer.write(reinterpret_cast<char *>(container_per_element.data()), byte_length);

auto start_pos = write_buffer.offset();

JsonBinary json_binary(

data_from[current_offset],

StringRef(&data_from[current_offset + 1], json_length - 1));

json_binary.toStringInBuffer(write_buffer);

auto end_pos = write_buffer.offset();

auto orig_length = end_pos - start_pos;

auto byte_length = charLengthToByteLengthFromUTF8(

reinterpret_cast<char *>(write_buffer.data() + start_offset),

orig_length,

tidb_tp->flen());

byte_length = std::min(byte_length, orig_length);

if (byte_length < orig_length)

{

context.getDAGContext()->handleTruncateError("Data Too Long");

write_buffer.setOffset(start_pos + byte_length);

}

?

To avoid one more memcpy.

Yeah，you're right. I just think this code path is not common used(because cast json as fixed length char is valid but strange), and even if it is used the performance won't drop significantly, thus choose to use the temporary buffer here to make code more easier.

Signed-off-by: yibin <[email protected]>

yibin87 · 2023-11-28T08:30:17Z

/unhold

yibin87 · 2023-11-28T08:30:25Z

/run-all-tests

windtalker

LGTM

ti-chi-bot · 2023-11-28T08:39:02Z

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: SeaRise, windtalker

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

~~OWNERS~~ [SeaRise,windtalker]

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

ti-chi-bot · 2023-11-28T08:39:04Z

[LGTM Timeline notifier]

Timeline:

2023-11-28 08:22:31.526480833 +0000 UTC m=+910980.191707013: ☑️ agreed by SeaRise.
2023-11-28 08:39:03.4456276 +0000 UTC m=+911972.110853795: ☑️ agreed by windtalker.

yibin87 · 2023-11-28T08:44:56Z

/run-all-tests

ti-chi-bot · 2023-11-28T08:49:30Z

@yibin87: Your PR was out of date, I have automatically updated it for you.

At the same time I will also trigger all tests for you:

/run-all-tests

trigger some heavy tests which will not run always when PR updated.

If the CI test fails, you just re-trigger the test that failed and the bot will merge the PR for you after the CI passes.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the ti-community-infra/tichi repository.

ti-chi-bot bot added release-note-none Denotes a PR that doesn't merit a release note. do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. size/XXL Denotes a PR that changes 1000+ lines, ignoring generated files. labels Nov 22, 2023

yibin87 changed the title ~~[WIP] Support Json unquote function~~ Support Json unquote function Nov 23, 2023

ti-chi-bot bot removed the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Nov 23, 2023

yibin87 requested review from SeaRise and windtalker November 23, 2023 08:47

ti-chi-bot bot added the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Nov 23, 2023

SeaRise reviewed Nov 23, 2023

View reviewed changes

yibin87 requested a review from SeaRise November 24, 2023 02:35

windtalker reviewed Nov 27, 2023

View reviewed changes

yibin87 requested a review from windtalker November 27, 2023 03:34

SeaRise reviewed Nov 27, 2023

View reviewed changes

dbms/src/Functions/FunctionsJson.h Outdated Show resolved Hide resolved

Fix format issue

9b3e744

Signed-off-by: yibin <[email protected]>

yibin87 requested a review from SeaRise November 28, 2023 06:07

windtalker reviewed Nov 28, 2023

View reviewed changes

SeaRise reviewed Nov 28, 2023

View reviewed changes

dbms/src/Flash/Coprocessor/DAGExpressionAnalyzerHelper.cpp Outdated Show resolved Hide resolved

SeaRise self-requested a review November 28, 2023 07:19

Address comments

28a6233

Signed-off-by: yibin <[email protected]>

yibin87 requested a review from windtalker November 28, 2023 07:32

windtalker reviewed Nov 28, 2023

View reviewed changes

SeaRise reviewed Nov 28, 2023

View reviewed changes

Address comments to throw exception when invalid utf8 code encountered

b7752df

Signed-off-by: yibin <[email protected]>

yibin87 requested review from windtalker and SeaRise November 28, 2023 08:21

SeaRise approved these changes Nov 28, 2023

View reviewed changes

ti-chi-bot bot added needs-1-more-lgtm Indicates a PR needs 1 more LGTM. approved labels Nov 28, 2023

ti-chi-bot bot removed the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Nov 28, 2023

windtalker approved these changes Nov 28, 2023

View reviewed changes

ti-chi-bot bot added lgtm and removed needs-1-more-lgtm Indicates a PR needs 1 more LGTM. labels Nov 28, 2023

Merge branch 'master' into json_unquote

d8cbc35

Merge branch 'master' into json_unquote

572fea9

ti-chi-bot bot merged commit 4479df8 into pingcap:master Nov 28, 2023
6 checks passed

Support Json unquote function #8407

Support Json unquote function #8407

Conversation

yibin87 commented Nov 22, 2023

What problem does this PR solve?

What is changed and how it works?

Check List

Release note

yibin87 commented Nov 22, 2023

yibin87 commented Nov 23, 2023

yibin87 commented Nov 23, 2023

yibin87 commented Nov 23, 2023

purelind commented Nov 23, 2023

purelind commented Nov 23, 2023

yibin87 commented Nov 23, 2023

yibin87 commented Nov 23, 2023

yibin87 commented Nov 23, 2023

yibin87 commented Nov 23, 2023

Choose a reason for hiding this comment

Choose a reason for hiding this comment

yibin87 commented Nov 24, 2023

yibin87 commented Nov 24, 2023

yibin87 commented Nov 24, 2023

yibin87 commented Nov 24, 2023

yibin87 commented Nov 24, 2023

yibin87 commented Nov 24, 2023

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

windtalker Nov 28, 2023 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

windtalker Nov 28, 2023 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

yibin87 commented Nov 27, 2023

yibin87 commented Nov 28, 2023

yibin87 commented Nov 28, 2023

Choose a reason for hiding this comment

Choose a reason for hiding this comment

windtalker Nov 28, 2023 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

SeaRise Nov 28, 2023 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

yibin87 Nov 28, 2023 • edited Loading

Choose a reason for hiding this comment

yibin87 commented Nov 28, 2023

yibin87 commented Nov 28, 2023

windtalker left a comment

Choose a reason for hiding this comment

ti-chi-bot bot commented Nov 28, 2023

ti-chi-bot bot commented Nov 28, 2023

[LGTM Timeline notifier]

yibin87 commented Nov 28, 2023

ti-chi-bot bot commented Nov 28, 2023

windtalker Nov 28, 2023 •

edited

Loading

windtalker Nov 28, 2023 •

edited

Loading

windtalker Nov 28, 2023 •

edited

Loading

SeaRise Nov 28, 2023 •

edited

Loading

yibin87 Nov 28, 2023 •

edited

Loading