-
-
Notifications
You must be signed in to change notification settings - Fork 2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
perf: Speedup ndjson reader ~40%
#18197
Conversation
This simple change speeds up NDJSON reading by 30%.
Previously we use it to delimit the values. While convenience, it was not efficient (see the comment in the code). This gives a 20% speedup. This *could* break people's code since we will not split correctly (and thus error) if one object spans two lines or two objects are in the same line. However, such code was already broken, since NDJSON is not allowed to contain any line breaks. If this is a concern, it is possible (at some perf degradation) to check for `}\n` instead of `\n` alone, and that will make this basically equivalent to the splitting logic we have for threads. As a nice bonus, this allows us to avoid a dependency on `serde_json` for JSON parsing (although we still use it for other things). The original PR that introduced this usage of `serde_json` was pola-rs#5427. It was done because newline handling wasn't correct. However, as I said above, it is very simple: newlines are not allowed everywhere except between values. And even if we decide we want to handle non-spec-compliant NDJSON, we still don't handle it properly as we can break thread chunks in the middle of a string. The abovementioned PR also said this had massive perf gains. However, I cannot reproduce that. I've checked out the repo at this time, and this PR was a definite regression. It is also expected, given that `serde_json::StreamDeserializer` does a lot of additional work, and it also shows up in profiles. It was probably benchmarked incorrectly (maybe with a debug build?).
This code errors for invalid JSON. But simd_json will already error (and we'll propagate that) for invalid JSON, so I see no reason for that. In addition, a side-effect of that code is that it will also reject some valid JSON: the empty object (`{}`). An empty dataframe seems non-useful, but I see no reason to *forbid* it. Also, the empty object may appear in a non-empty dataframe, to signal an all-null row. As a nice side benefit, this also improves perf by 3.5%, but that could be just noise.
Codecov ReportAttention: Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## main #18197 +/- ##
==========================================
- Coverage 80.35% 80.31% -0.05%
==========================================
Files 1492 1498 +6
Lines 196332 198748 +2416
Branches 2813 2833 +20
==========================================
+ Hits 157759 159618 +1859
- Misses 38052 38603 +551
- Partials 521 527 +6 ☔ View full report in Codecov by Sentry. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Really nice PR and great review with those commit messages. I have one question.
~40%
This speeds it up by 40% in the following benchmark:
I had a plan to improve it more, but I can't find time for that and this will involve bigger changes, even a rewrite of the mechanism, while the changes in this PR are simple and effective, so I thought I'll just send them.
Best reviewed commit-by-commit.
A warning from the second commit, repeated here for noticeability: