Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

perf: Speedup ndjson reader ~40% #18197

Merged
merged 3 commits into from
Aug 15, 2024

Commits on Jul 30, 2024

  1. Reuse the simd_json::Buffers for NDJSON parsing

    This simple change speeds up NDJSON reading by 30%.
    ChayimFriedman2 committed Jul 30, 2024
    Configuration menu
    Copy the full SHA
    500ccd7 View commit details
    Browse the repository at this point in the history

Commits on Aug 14, 2024

  1. Avoid going through serde_json while deserializing

    Previously we use it to delimit the values. While convenience, it was not efficient (see the comment in the code).
    
    This gives a 20% speedup.
    
    This *could* break people's code since we will not split correctly (and thus error) if one object spans two lines or two objects are in the same line. However, such code was already broken, since NDJSON is not allowed to contain any line breaks. If this is a concern, it is possible (at some perf degradation) to check for `}\n` instead of `\n` alone, and that will make this basically equivalent to the splitting logic we have for threads.
    
    As a nice bonus, this allows us to avoid a dependency on `serde_json` for JSON parsing (although we still use it for other things).
    
    The original PR that introduced this usage of `serde_json` was pola-rs#5427. It was done because newline handling wasn't correct. However, as I said above, it is very simple: newlines are not allowed everywhere except between values. And even if we decide we want to handle non-spec-compliant NDJSON, we still don't handle it properly as we can break thread chunks in the middle of a string.
    
    The abovementioned PR also said this had massive perf gains. However, I cannot reproduce that. I've checked out the repo at this time, and this PR was a definite regression. It is also expected, given that `serde_json::StreamDeserializer` does a lot of additional work, and it also shows up in profiles. It was probably benchmarked incorrectly (maybe with a debug build?).
    ChayimFriedman2 committed Aug 14, 2024
    Configuration menu
    Copy the full SHA
    79fe29a View commit details
    Browse the repository at this point in the history
  2. Remove seemingly unnecessary code

    This code errors for invalid JSON. But simd_json will already error (and we'll propagate that) for invalid JSON, so I see no reason for that.
    In addition, a side-effect of that code is that it will also reject some valid JSON: the empty object (`{}`). An empty dataframe seems non-useful, but I see no reason to *forbid* it. Also, the empty object may appear in a non-empty dataframe, to signal an all-null row.
    
    As a nice side benefit, this also improves perf by 3.5%, but that could be just noise.
    ChayimFriedman2 committed Aug 14, 2024
    Configuration menu
    Copy the full SHA
    ecf762d View commit details
    Browse the repository at this point in the history