-
Notifications
You must be signed in to change notification settings - Fork 17
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Support compression codec choice in TPC-H data generation and write with pyarrow #1209
Support compression codec choice in TPC-H data generation and write with pyarrow #1209
Conversation
Hrm, it's a shame about DuckDB and LZ4. I was hoping that everyone would be able to read and write this data. No comment from me on the code. It seems fine (although it's unfortunate to have to add arrow into the mix). I'll be curious how profiles would change for Dask when performing queries on LZ4 compressed data, both in terms of raw performance, and also line profiles. If you end up running these experiments I recommend taking a look at the |
A top level comparison using scale 100, LZ4 is much faster, roughly half the time using lz4 vs snappy. Which is highly suspect, probably relating back to the original apache/arrow#38389, as while LZ4 is more performant, it's not 2x faster than snappy. Also my experience in cramjam, Lz4 block format is indeed faster, but not substantially faster when compared to snappy raw format: https://github.com/milesgranger/pyrus-cramjam/tree/master/cramjam-python/benchmarks#snappy If you wanted the detailed view of the performance reports. TL/DR seems like the profiles are pretty much the same, just much longer to decode snappy stemming from pyarrow AFAICT. |
Just to double check these results, can you also generate a parquet dataset with snappy compression that has been written by pyarrow? |
Look at that.. the files we're producing with DuckDB snappy is not so great afterall. Metadata of DuckDB produced snappy file:
and PyArrow produced:
For similar I would think/hope this affects all engines about the same performance wise? And starts to feel this is different than the original issue about deserialization performance being different on Linux and we just haven't written out the best structured snappy parquet files here. It'd be pretty easy to modify this PR to route all compression codecs thru pyarrow though. |
Awesome. While this may not make us faster relative to other projects, it does help us know where to focus. I suspect that there's still plenty to do with parquet and that it's still one of the highest priorities, but the relative importance maybe just dropped a little. |
Thank you for running these experiments @milesgranger |
Although, looking at worker profiles, it's clear that reading parquet is still our primary bottleneck. My hope is that by combining this with a switch to a new parquet reading system (similar to the POC I had written up) that we can improve things considerably. |
@fjetter (no rush I don't think...) |
I think it's also worth ensuring that the page index + statistics are written out to the parquet files. This ensures that engines that support maximal pushdown by utilizing those features can demonstrate them in the benchmarks. |
Thanks for the reminder @kszlim I've explicitly set that now in a95ab8b But it does appear it was being written anyhow (first is pyarrow output, second is from original DuckDB output on different but similar sized files) In [2]: f = pq.ParquetFile('../lineitem.snappy.parquet')
In [3]: f.metadata.row_group(0).column(0).statistics
Out[3]:
<pyarrow._parquet.Statistics object at 0x7f0512cc6a90>
has_min_max: True
min: 206729988
max: 207777761
null_count: 0
distinct_count: 0
num_values: 1048576
physical_type: INT32
logical_type: None
converted_type (legacy): NONE
In [4]: f = pq.ParquetFile('../lineitem.parquet')
In [5]: f.metadata.row_group(0).column(0).statistics
Out[5]:
<pyarrow._parquet.Statistics object at 0x7f0512da4040>
has_min_max: True
min: 1326000001
max: 1326124934
null_count: 0
distinct_count: 0
num_values: 124928
physical_type: INT32
logical_type: Int(bitWidth=32, isSigned=true)
converted_type (legacy): INT_32 Edit: and page index (207cb32 which wasn't being written before. 👍 ) |
Nice, thanks! |
tmp, | ||
compression=compression.value.lower(), | ||
write_statistics=True, | ||
write_page_index=True, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
pyarrow
doesn't store page indices by default and I'm not even sure if it is implemented to use them during reading.
Whether this is a good idea to have depends on many different things, size of file, size of row groups, number of columns, etc. and for some constellations this can cause overhead.
Before we enable this blindly, I would like to make sure this does not negatively impact anything.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I see a small difference in file size (see metadata below) when adding page index, and no real performance difference:
❯ ls -lhs lineitem* (tmp)
126M -rw-r--r--. 1 milesg milesg 126M Nov 28 08:47 lineitem.duckdb.snappy.parquet
102M -rw-r--r--. 1 milesg milesg 102M Nov 28 12:09 lineitem-stats-no-index.snappy.parquet
102M -rw-r--r--. 1 milesg milesg 102M Nov 30 11:53 lineitem-stats-and-index.snappy.parquet
❯ hyperfine 'python read-no-page-index.py' 'python read-page-index.py' (tmp)
Benchmark 1: python read-no-page-index.py
Time (mean ± σ): 1.299 s ± 0.160 s [User: 2.656 s, System: 2.409 s]
Range (min … max): 0.994 s … 1.572 s 10 runs
Benchmark 2: python read-page-index.py
Time (mean ± σ): 1.272 s ± 0.159 s [User: 2.514 s, System: 2.381 s]
Range (min … max): 1.084 s … 1.538 s 10 runs
Summary
python read-page-index.py ran
1.02 ± 0.18 times faster than python read-no-page-index.py
Then metadata:
In [5]: pq.ParquetFile('lineitem-stats-no-index.snappy.parquet').metadata
Out[5]:
<pyarrow._parquet.FileMetaData object at 0x7fc36c2da520>
created_by: parquet-cpp-arrow version 13.0.0
num_columns: 16
num_rows: 2735597
num_row_groups: 3
format_version: 2.6
serialized_size: 6900
In [6]: pq.ParquetFile('lineitem-stats-and-index.snappy.parquet').metadata
Out[6]:
<pyarrow._parquet.FileMetaData object at 0x7fc36c4f4fe0>
created_by: parquet-cpp-arrow version 13.0.0
num_columns: 16
num_rows: 2735597
num_row_groups: 3
format_version: 2.6
serialized_size: 7584
I'd like to point out that the pyarrow version above is using way fewer row groups which causes the files to be much more compact (compression works better and overhead per row group is smaller) |
After comparing, current data using many row groups generated by DuckDB, then new data generated by pyarrow using fewer row groups w/ and w/o statistics I discovered:
Here's a comparison of the old/current (red), new with stats (blue), and new without stats (orange): Will update later w/ full comparison using the new data with other engines as well. |
Is it hard to rerun on scale 1000? we saw very different things when we Switched to scale 1000 initially (only row group size, I don’t care About statistics |
You bet, will be including that in the follow-up summary. 👍 |
Polars definitely utilizes statistics, datafusion as well. Imo it's important to keep them to demonstrate query engine performance fairly. Yes, they bloat files a little but I think most people keep statistics on by default anyways. |
My guess is that in this particular set of queries the statistics aren't helpful for those systems that use them, mostly because the dataset is random. I think it totally makes sense to continue keeping statistics, but my guess is that it won't actually affect performance in a positive way for any of the projects. |
Ran a full engine comparison on current and the proposed new files (no statistics). I have no strong opinion on add statistics, I was only nudged in the direction of going with pyarrow default (no statistics). I'm gleaning @mrocklin will have them, then I'll regenerate tomorrow and we'll be done with it. Otherwise, I think this PR is ready if others are happy.. or even just mildly satisfied. :) Scale 100Current files (and also a regression in dask-expr): New files (w/ fixed dask-expr also): Scale 1000 (just dask for relative comparison) |
I doubt that it'll matter. Probably we want to have them for some final production (to avoid the appearance of tuning to make other projects worse) I don't think that this is urgent at all. |
Also, to briefly explain poor Dask performance. I suspect that this is mostly due to bad parquet performance still. I think that something like my WIP will be necessary before we are ready to push on the "Dask is faster than Spark" messaging. Pinging @fjetter so that he's aware of my current thinking here. |
I also merged a PR that I shouldn't have merged before fixing something else. This caused a slowdown as well. |
I'd bet good money that we're still spending 75+% of our time in
read_parquet
…On Wed, Dec 13, 2023 at 12:16 PM Patrick Hoefler ***@***.***> wrote:
I also merged a PR that I shouldn't have merged before fixing something
else. This caused a slowdown as well.
—
Reply to this email directly, view it on GitHub
<#1209 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/AACKZTCYBTNDMZJOWHLPO4DYJIEKLAVCNFSM6AAAAAA73VL6WGVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMYTQNJUGY2DAOBYGQ>
.
You are receiving this because you were mentioned.Message ID:
***@***.***>
--
<https://coiled.io>
Matthew Rocklin CEO, Dask Maintainer
|
Yes that's correct, but we are doing a lot of things twice for query 1 and 7, which is not ideal |
Yup. No disagreement there :) |
DuckDB doesn't cover LZ4
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
lgmt, merge whenever you are ready
FWIW: Could you update the top post of the pr briefly? It still says that we want to use lz4, which might be confusing when looking back without reading all the discussion
From what I've seen, I think about ~2x better performance w/ the new files, and the remainder was due to the top charts being ran also with a temporary regression in dask-expr. (sorry 'bout that). Leading comment updated that we're not defaulting to lz4. Thanks everyone! :) |
FYI, it looks like LZ4-support has been released in duckdb=0.10.2 (duckdb/duckdb#11220). |
Support setting the compression codec in resulting parquet files.
--compression lz4
*.snappy.parquet