perf: Speedup ndjson reader `~40%` #18197

ChayimFriedman2 · 2024-08-14T17:44:41Z

This speeds it up by 40% in the following benchmark:

use criterion::{criterion_group, criterion_main, Criterion};
use mimalloc::MiMalloc;
use polars::prelude::*;
use polars_core::POOL;
use rand::{thread_rng, Rng};

#[global_allocator]
static MIMALLOC: MiMalloc = MiMalloc;

fn from_json(json: &[u8]) -> DataFrame {
    JsonReader::new(std::io::Cursor::new(json))
        .with_json_format(JsonFormat::JsonLines)
        .set_rechunk(false)
        .finish()
        .unwrap()
}

fn my_benchmark(_c: &mut Criterion) {
    POOL.install(|| {
        let mut c = Criterion::default().configure_from_args();

        const SIZE: i32 = 5_000_000;
        let mut rng = thread_rng();
        let mut df = df![
            "a" => (0..SIZE).map(|_| rng.gen::<i32>()).collect::<Vec<_>>(),
            "b" => (0..SIZE).map(|v| v.to_string()).collect::<Vec<_>>(),
        ]
        .unwrap();

        let mut json = Vec::new();
        JsonWriter::new(&mut json)
            .with_json_format(JsonFormat::JsonLines)
            .finish(&mut df)
            .unwrap();

        c.bench_function("JSON Lines Deserialization", |b| {
            b.iter(|| from_json(&json))
        });
    });
}

criterion_group!(benches, my_benchmark);
criterion_main!(benches);

I had a plan to improve it more, but I can't find time for that and this will involve bigger changes, even a rewrite of the mechanism, while the changes in this PR are simple and effective, so I thought I'll just send them.

Best reviewed commit-by-commit.

A warning from the second commit, repeated here for noticeability:

This could break people's code since we will not split correctly (and thus error) if one object spans two lines or two objects are in the same line. However, such code was already broken, since NDJSON is not allowed to contain any line breaks. If this is a concern, it is possible (at some perf degradation) to check for }\n instead of \n alone, and that will make this basically equivalent to the splitting logic we have for threads.

This simple change speeds up NDJSON reading by 30%.

Previously we use it to delimit the values. While convenience, it was not efficient (see the comment in the code). This gives a 20% speedup. This *could* break people's code since we will not split correctly (and thus error) if one object spans two lines or two objects are in the same line. However, such code was already broken, since NDJSON is not allowed to contain any line breaks. If this is a concern, it is possible (at some perf degradation) to check for `}\n` instead of `\n` alone, and that will make this basically equivalent to the splitting logic we have for threads. As a nice bonus, this allows us to avoid a dependency on `serde_json` for JSON parsing (although we still use it for other things). The original PR that introduced this usage of `serde_json` was pola-rs#5427. It was done because newline handling wasn't correct. However, as I said above, it is very simple: newlines are not allowed everywhere except between values. And even if we decide we want to handle non-spec-compliant NDJSON, we still don't handle it properly as we can break thread chunks in the middle of a string. The abovementioned PR also said this had massive perf gains. However, I cannot reproduce that. I've checked out the repo at this time, and this PR was a definite regression. It is also expected, given that `serde_json::StreamDeserializer` does a lot of additional work, and it also shows up in profiles. It was probably benchmarked incorrectly (maybe with a debug build?).

This code errors for invalid JSON. But simd_json will already error (and we'll propagate that) for invalid JSON, so I see no reason for that. In addition, a side-effect of that code is that it will also reject some valid JSON: the empty object (`{}`). An empty dataframe seems non-useful, but I see no reason to *forbid* it. Also, the empty object may appear in a non-empty dataframe, to signal an all-null row. As a nice side benefit, this also improves perf by 3.5%, but that could be just noise.

codecov · 2024-08-14T18:14:29Z

Codecov Report

Attention: Patch coverage is 96.96970% with 1 line in your changes missing coverage. Please review.

Project coverage is 80.31%. Comparing base (aa1950c) to head (ecf762d).
Report is 106 commits behind head on main.

Files	Patch %	Lines
crates/polars-io/src/ndjson/core.rs	96.96%	1 Missing ⚠️

Additional details and impacted files

@@            Coverage Diff             @@
##             main   #18197      +/-   ##
==========================================
- Coverage   80.35%   80.31%   -0.05%     
==========================================
  Files        1492     1498       +6     
  Lines      196332   198748    +2416     
  Branches     2813     2833      +20     
==========================================
+ Hits       157759   159618    +1859     
- Misses      38052    38603     +551     
- Partials      521      527       +6

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

ritchie46

Really nice PR and great review with those commit messages. I have one question.

crates/polars-io/src/ndjson/core.rs

ChayimFriedman2 added 3 commits July 30, 2024 12:48

Reuse the simd_json::Buffers for NDJSON parsing

500ccd7

This simple change speeds up NDJSON reading by 30%.

ChayimFriedman2 requested review from ritchie46, orlp and c-peters as code owners August 14, 2024 17:44

ritchie46 changed the title ~~Speedup ndjson reader~~ perf: Speedup ndjson reader Aug 15, 2024

github-actions bot added performance Performance issues or improvements python Related to Python Polars rust Related to Rust Polars labels Aug 15, 2024

ritchie46 reviewed Aug 15, 2024

View reviewed changes

crates/polars-io/src/ndjson/core.rs Show resolved Hide resolved

ritchie46 changed the title ~~perf: Speedup ndjson reader~~ perf: Speedup ndjson reader ~40% Aug 15, 2024

ritchie46 merged commit 8476f8c into pola-rs:main Aug 15, 2024
23 of 24 checks passed

ChayimFriedman2 deleted the speedup-json-reader branch August 15, 2024 10:34

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

perf: Speedup ndjson reader `~40%` #18197

perf: Speedup ndjson reader `~40%` #18197

ChayimFriedman2 commented Aug 14, 2024

codecov bot commented Aug 14, 2024 •

edited

Loading

ritchie46 left a comment

perf: Speedup ndjson reader ~40% #18197

perf: Speedup ndjson reader ~40% #18197

Conversation

ChayimFriedman2 commented Aug 14, 2024

codecov bot commented Aug 14, 2024 • edited Loading

Codecov Report

ritchie46 left a comment

Choose a reason for hiding this comment

perf: Speedup ndjson reader `~40%` #18197

perf: Speedup ndjson reader `~40%` #18197

codecov bot commented Aug 14, 2024 •

edited

Loading