[Enhancement] Use dynamic batch size for simdjson to parse multiple json document #53056

srlch · 2024-11-20T09:19:51Z

Why I'm doing:

In current implementation, JsonDocumentStreamParser use simdjson::ondemand::parser::iterate_many to
parse multiple JSON document. This API need caller pass the max size of JSON document called, says
max_json_lenght_in_file in a given file to allocate the a memory chunk to finish the parsing process.
But the problem is that, the caller pass the file size instead of max_json_lenght_in_file and allocate
huge memory chunk (which may not be used) almost 5~6 time of the file size. This is a huge memory amplification

What I'm doing:

Introduce json_parse_many_batch_size to control the batch_size passed into simdjson::ondemand::parser::iterate_many.
If json_parse_many_batch_size > 0, use json_parse_many_batch_size as batch size, otherwise use
simdjson::dom::DEFAULT_BATCH_SIZE.
For JsonDocumentStreamParser::get_current, parse the doc using a relative small buffer. If an exception is thrown
because the buffer is too small, increase the buffer size and retry.

Fixes #issue
https://github.com/StarRocks/StarRocksTest/issues/8636

What type of PR is this:

Does this PR entail a change in behavior?

Yes, this PR will result in a change in behavior.
No, this PR will not result in a change in behavior.

If yes, please specify the type of change:

Interface/UI changes: syntax, type conversion, expression evaluation, display information
Parameter changes: default values, similar parameters but with different default values
Policy changes: use new policy to replace old one, functionality automatically enabled
Feature removed
Miscellaneous: upgrade & downgrade compatibility, etc.

Checklist:

I have added test cases for my bug fix or my new feature
This pr needs user documentation (for new or modified features or behaviors)
- I have added documentation for my new feature or new function
This is a backport pr

Bugfix cherry-pick branch check:

starrocks-cr · 2024-11-20T09:31:07Z

be/src/exec/json_parser.cpp

-            return Status::OK();
-        }
-        return Status::EndOfFile("all documents of the stream are iterated");
+        return _get_current_impl(row);
    } catch (simdjson::simdjson_error& e) {
        std::string err_msg;
        if (e.error() == simdjson::CAPACITY) {


The most risky bug in this code is:
If _batch_size is initially larger than len, the assignment to _doc_stream can result in iterate_many() being called with an invalid length, causing undefined behavior or a crash.

You can modify the code like this:

- _doc_stream = _parser->iterate_many(data, len, len); + _doc_stream = _parser->iterate_many(data, len, std::min(_batch_size, len));

…son document Signed-off-by: srlch <[email protected]>

Signed-off-by: srlch <[email protected]>

github-actions · 2024-11-21T02:59:01Z

[Java-Extensions Incremental Coverage Report]

✅ pass : 0 / 0 (0%)

github-actions · 2024-11-21T02:59:03Z

[FE Incremental Coverage Report]

✅ pass : 0 / 0 (0%)

github-actions · 2024-11-21T02:59:50Z

[BE Incremental Coverage Report]

✅ pass : 28 / 33 (84.85%)

file detail

	path	covered_line	new_line	coverage	not_covered_line_detail
🔵	src/exec/json_parser.h	0	1	00.00%	[77]
🔵	src/exec/json_parser.cpp	28	32	87.50%	[59, 67, 86, 91]

srlch requested a review from a team as a code owner November 20, 2024 09:19

github-actions bot added the title needs [type] label Nov 20, 2024

mergify bot assigned srlch Nov 20, 2024

starrocks-cr bot reviewed Nov 20, 2024

View reviewed changes

[Enhancement] Use dynamic batch size for simdjson to parse multiple j…

b403633

…son document Signed-off-by: srlch <[email protected]>

srlch force-pushed the enhance_json_iterate_many branch from bf7cb20 to b403633 Compare November 20, 2024 09:50

github-actions bot added the behavior_changed label Nov 21, 2024

srlch changed the title ~~[WIP] [Enhancement] Use dynamic batch size for simdjson to parse multiple json document~~ [Enhancement] Use dynamic batch size for simdjson to parse multiple json document Nov 21, 2024

fix

be49e7d

Signed-off-by: srlch <[email protected]>

github-actions bot added 3.4 3.3 and removed title needs [type] labels Nov 21, 2024

srlch requested a review from a team as a code owner November 21, 2024 02:54

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Enhancement] Use dynamic batch size for simdjson to parse multiple json document #53056

[Enhancement] Use dynamic batch size for simdjson to parse multiple json document #53056

srlch commented Nov 20, 2024 •

edited

Loading

starrocks-cr bot Nov 20, 2024

github-actions bot commented Nov 21, 2024

github-actions bot commented Nov 21, 2024

github-actions bot commented Nov 21, 2024

[Enhancement] Use dynamic batch size for simdjson to parse multiple json document #53056

Are you sure you want to change the base?

[Enhancement] Use dynamic batch size for simdjson to parse multiple json document #53056

Conversation

srlch commented Nov 20, 2024 • edited Loading

Why I'm doing:

What I'm doing:

What type of PR is this:

Checklist:

Bugfix cherry-pick branch check:

starrocks-cr bot Nov 20, 2024

Choose a reason for hiding this comment

github-actions bot commented Nov 21, 2024

[Java-Extensions Incremental Coverage Report]

github-actions bot commented Nov 21, 2024

[FE Incremental Coverage Report]

github-actions bot commented Nov 21, 2024

[BE Incremental Coverage Report]

file detail

srlch commented Nov 20, 2024 •

edited

Loading