Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Enhancement] Use dynamic batch size for simdjson to parse multiple json document #53056

Open
wants to merge 2 commits into
base: main
Choose a base branch
from

Conversation

srlch
Copy link
Contributor

@srlch srlch commented Nov 20, 2024

Why I'm doing:

In current implementation, JsonDocumentStreamParser use simdjson::ondemand::parser::iterate_many to
parse multiple JSON document. This API need caller pass the max size of JSON document called, says
max_json_lenght_in_file in a given file to allocate the a memory chunk to finish the parsing process.
But the problem is that, the caller pass the file size instead of max_json_lenght_in_file and allocate
huge memory chunk (which may not be used) almost 5~6 time of the file size. This is a huge memory amplification

What I'm doing:

Introduce json_parse_many_batch_size to control the batch_size passed into simdjson::ondemand::parser::iterate_many.
If json_parse_many_batch_size > 0, use json_parse_many_batch_size as batch size, otherwise use
simdjson::dom::DEFAULT_BATCH_SIZE.
For JsonDocumentStreamParser::get_current, parse the doc using a relative small buffer. If an exception is thrown
because the buffer is too small, increase the buffer size and retry.

Fixes #issue
https://github.com/StarRocks/StarRocksTest/issues/8636

What type of PR is this:

  • BugFix
  • Feature
  • Enhancement
  • Refactor
  • UT
  • Doc
  • Tool

Does this PR entail a change in behavior?

  • Yes, this PR will result in a change in behavior.
  • No, this PR will not result in a change in behavior.

If yes, please specify the type of change:

  • Interface/UI changes: syntax, type conversion, expression evaluation, display information
  • Parameter changes: default values, similar parameters but with different default values
  • Policy changes: use new policy to replace old one, functionality automatically enabled
  • Feature removed
  • Miscellaneous: upgrade & downgrade compatibility, etc.

Checklist:

  • I have added test cases for my bug fix or my new feature
  • This pr needs user documentation (for new or modified features or behaviors)
    • I have added documentation for my new feature or new function
  • This is a backport pr

Bugfix cherry-pick branch check:

  • I have checked the version labels which the pr will be auto-backported to the target branch
    • 3.4
    • 3.3
    • 3.2
    • 3.1
    • 3.0

@srlch srlch requested a review from a team as a code owner November 20, 2024 09:19
@mergify mergify bot assigned srlch Nov 20, 2024
return Status::OK();
}
return Status::EndOfFile("all documents of the stream are iterated");
return _get_current_impl(row);
} catch (simdjson::simdjson_error& e) {
std::string err_msg;
if (e.error() == simdjson::CAPACITY) {
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The most risky bug in this code is:
If _batch_size is initially larger than len, the assignment to _doc_stream can result in iterate_many() being called with an invalid length, causing undefined behavior or a crash.

You can modify the code like this:

-        _doc_stream = _parser->iterate_many(data, len, len);
+        _doc_stream = _parser->iterate_many(data, len, std::min(_batch_size, len));

@srlch srlch changed the title [WIP] [Enhancement] Use dynamic batch size for simdjson to parse multiple json document [Enhancement] Use dynamic batch size for simdjson to parse multiple json document Nov 21, 2024
Signed-off-by: srlch <[email protected]>
@srlch srlch requested a review from a team as a code owner November 21, 2024 02:54
Copy link

[Java-Extensions Incremental Coverage Report]

pass : 0 / 0 (0%)

Copy link

[FE Incremental Coverage Report]

pass : 0 / 0 (0%)

Copy link

[BE Incremental Coverage Report]

pass : 28 / 33 (84.85%)

file detail

path covered_line new_line coverage not_covered_line_detail
🔵 src/exec/json_parser.h 0 1 00.00% [77]
🔵 src/exec/json_parser.cpp 28 32 87.50% [59, 67, 86, 91]

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant