-
Notifications
You must be signed in to change notification settings - Fork 1.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Enhancement] Use dynamic batch size for simdjson to parse multiple json document #53056
base: main
Are you sure you want to change the base?
Conversation
return Status::OK(); | ||
} | ||
return Status::EndOfFile("all documents of the stream are iterated"); | ||
return _get_current_impl(row); | ||
} catch (simdjson::simdjson_error& e) { | ||
std::string err_msg; | ||
if (e.error() == simdjson::CAPACITY) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The most risky bug in this code is:
If _batch_size
is initially larger than len
, the assignment to _doc_stream
can result in iterate_many()
being called with an invalid length, causing undefined behavior or a crash.
You can modify the code like this:
- _doc_stream = _parser->iterate_many(data, len, len);
+ _doc_stream = _parser->iterate_many(data, len, std::min(_batch_size, len));
…son document Signed-off-by: srlch <[email protected]>
bf7cb20
to
b403633
Compare
Signed-off-by: srlch <[email protected]>
[Java-Extensions Incremental Coverage Report]✅ pass : 0 / 0 (0%) |
[FE Incremental Coverage Report]✅ pass : 0 / 0 (0%) |
[BE Incremental Coverage Report]✅ pass : 28 / 33 (84.85%) file detail
|
Why I'm doing:
In current implementation, JsonDocumentStreamParser use
simdjson::ondemand::parser::iterate_many
toparse multiple JSON document. This API need caller pass the max size of JSON document called, says
max_json_lenght_in_file
in a given file to allocate the a memory chunk to finish the parsing process.But the problem is that, the caller pass the file size instead of
max_json_lenght_in_file
and allocatehuge memory chunk (which may not be used) almost 5~6 time of the file size. This is a huge memory amplification
What I'm doing:
Introduce
json_parse_many_batch_size
to control the batch_size passed intosimdjson::ondemand::parser::iterate_many
.If
json_parse_many_batch_size > 0
, usejson_parse_many_batch_size
as batch size, otherwise usesimdjson::dom::DEFAULT_BATCH_SIZE
.For
JsonDocumentStreamParser::get_current
, parse the doc using a relative small buffer. If an exception is thrownbecause the buffer is too small, increase the buffer size and retry.
Fixes #issue
https://github.com/StarRocks/StarRocksTest/issues/8636
What type of PR is this:
Does this PR entail a change in behavior?
If yes, please specify the type of change:
Checklist:
Bugfix cherry-pick branch check: