perf: Speedup ndjson reader `~40%` #18197

This simple change speeds up NDJSON reading by 30%.

Previously we use it to delimit the values. While convenience, it was not efficient (see the comment in the code). This gives a 20% speedup. This *could* break people's code since we will not split correctly (and thus error) if one object spans two lines or two objects are in the same line. However, such code was already broken, since NDJSON is not allowed to contain any line breaks. If this is a concern, it is possible (at some perf degradation) to check for `}\n` instead of `\n` alone, and that will make this basically equivalent to the splitting logic we have for threads. As a nice bonus, this allows us to avoid a dependency on `serde_json` for JSON parsing (although we still use it for other things). The original PR that introduced this usage of `serde_json` was pola-rs#5427. It was done because newline handling wasn't correct. However, as I said above, it is very simple: newlines are not allowed everywhere except between values. And even if we decide we want to handle non-spec-compliant NDJSON, we still don't handle it properly as we can break thread chunks in the middle of a string. The abovementioned PR also said this had massive perf gains. However, I cannot reproduce that. I've checked out the repo at this time, and this PR was a definite regression. It is also expected, given that `serde_json::StreamDeserializer` does a lot of additional work, and it also shows up in profiles. It was probably benchmarked incorrectly (maybe with a debug build?).

This code errors for invalid JSON. But simd_json will already error (and we'll propagate that) for invalid JSON, so I see no reason for that. In addition, a side-effect of that code is that it will also reject some valid JSON: the empty object (`{}`). An empty dataframe seems non-useful, but I see no reason to *forbid* it. Also, the empty object may appear in a non-empty dataframe, to signal an all-null row. As a nice side benefit, this also improves perf by 3.5%, but that could be just noise.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

perf: Speedup ndjson reader `~40%` #18197

perf: Speedup ndjson reader `~40%` #18197

Commits on Jul 30, 2024

Commits on Aug 14, 2024

perf: Speedup ndjson reader ~40% #18197

perf: Speedup ndjson reader ~40% #18197

Commits on Jul 30, 2024

Commits on Aug 14, 2024

perf: Speedup ndjson reader `~40%` #18197

perf: Speedup ndjson reader `~40%` #18197