Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix(rust, python): properly handle json with unclosed strings #5427

Merged

Conversation

universalmind303
Copy link
Collaborator

fixes #5371

A little note about the implementation. The SplitLines iterator is not really designed to handle all the nuances of json parsing. So I decided to go back to leveraging the serde_json:: StreamDeserializer, but now we are using it exclusively as a mechanism to iterate over the rows. We can do this via the RawValue type in serde_json. Which just returns the raw str of each row without doing any parsing of the values, acting as a drop in replacement to SplitLines.

The actual parsing of the rows is still done by simd_json. So the performance impact should be minimal. I'll post some benchmarks once my release build is finished.

@github-actions github-actions bot added the fix Bug fix label Nov 4, 2022
@universalmind303
Copy link
Collaborator Author

universalmind303 commented Nov 4, 2022

Suprisingly, this actually gave us some pretty massive performance gains.

%%timeit -n 10
pl.scan_ndjson("big_file.json").collect()

on a 600M file shape: (500001, 18).

release 0.14.25

1.82 s ± 111 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)

this branch

236 ms ± 4.99 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)

on the "yelp_academic_dataset_review" 5.0G dataset shape: (6990280, 9)

release 0.14.25

error/invalid

this branch

3.72 s ± 881 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

for reference. The scan_csv operation on the same dataset.

2.21 s ± 21.9 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

@universalmind303 universalmind303 added python Related to Python Polars rust Related to Rust Polars labels Nov 4, 2022
@ritchie46 ritchie46 changed the title fix: properly handle json with unclosed strings fix(rust, python): properly handle json with unclosed strings Nov 5, 2022
@ritchie46
Copy link
Member

Thanks!

@ritchie46 ritchie46 merged commit 3db8bf0 into pola-rs:master Nov 5, 2022
ChayimFriedman2 added a commit to ChayimFriedman2/polars that referenced this pull request Aug 14, 2024
Previously we use it to delimit the values. While convenience, it was not efficient (see the comment in the code).

This gives a 20% speedup.

This *could* break people's code since we will not split correctly (and thus error) if one object spans two lines or two objects are in the same line. However, such code was already broken, since NDJSON is not allowed to contain any line breaks. If this is a concern, it is possible (at some perf degradation) to check for `}\n` instead of `\n` alone, and that will make this basically equivalent to the splitting logic we have for threads.

As a nice bonus, this allows us to avoid a dependency on `serde_json` for JSON parsing (although we still use it for other things).

The original PR that introduced this usage of `serde_json` was pola-rs#5427. It was done because newline handling wasn't correct. However, as I said above, it is very simple: newlines are not allowed everywhere except between values. And even if we decide we want to handle non-spec-compliant NDJSON, we still don't handle it properly as we can break thread chunks in the middle of a string.

The abovementioned PR also said this had massive perf gains. However, I cannot reproduce that. I've checked out the repo at this time, and this PR was a definite regression. It is also expected, given that `serde_json::StreamDeserializer` does a lot of additional work, and it also shows up in profiles. It was probably benchmarked incorrectly (maybe with a debug build?).
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
fix Bug fix python Related to Python Polars rust Related to Rust Polars
Projects
None yet
Development

Successfully merging this pull request may close these issues.

read_ndjson and scan_ndjson stop randomly after part of file
2 participants