Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support compression codec choice in TPC-H data generation and write with pyarrow #1209

Merged
merged 8 commits into from
Dec 14, 2023

Conversation

milesgranger
Copy link
Contributor

@milesgranger milesgranger commented Nov 27, 2023

Support setting the compression codec in resulting parquet files. --compression lz4

  • Choices: [lz4, snappy, zstd, gzip, none, brotli]
  • Adds the compression codec name to the filename suffix ex. *.snappy.parquet
  • Uses pyarrow.parquet.write_table for all data output.

image

@mrocklin
Copy link
Member

Hrm, it's a shame about DuckDB and LZ4. I was hoping that everyone would be able to read and write this data.

No comment from me on the code. It seems fine (although it's unfortunate to have to add arrow into the mix).

I'll be curious how profiles would change for Dask when performing queries on LZ4 compressed data, both in terms of raw performance, and also line profiles. If you end up running these experiments I recommend taking a look at the --performance-report flag for runs up to scale 100 (scale 1000 and above are a bit too large to fit comfortably in a performance report).

@milesgranger
Copy link
Contributor Author

A top level comparison using scale 100, LZ4 is much faster, roughly half the time using lz4 vs snappy. Which is highly suspect, probably relating back to the original apache/arrow#38389, as while LZ4 is more performant, it's not 2x faster than snappy.

Also my experience in cramjam, Lz4 block format is indeed faster, but not substantially faster when compared to snappy raw format: https://github.com/milesgranger/pyrus-cramjam/tree/master/cramjam-python/benchmarks#snappy

visualization

If you wanted the detailed view of the performance reports. TL/DR seems like the profiles are pretty much the same, just much longer to decode snappy stemming from pyarrow AFAICT.

performance-reports-snappy.zip
performance-reports-lz4.zip

@fjetter
Copy link
Member

fjetter commented Nov 28, 2023

Just to double check these results, can you also generate a parquet dataset with snappy compression that has been written by pyarrow?

@milesgranger
Copy link
Contributor Author

Look at that.. the files we're producing with DuckDB snappy is not so great afterall.

visualization (1)

Metadata of DuckDB produced snappy file:

  created_by: DuckDB
  num_columns: 16
  num_rows: 2568534
  num_row_groups: 21
  format_version: 1.0
  serialized_size: 29792

and PyArrow produced:

  created_by: parquet-cpp-arrow version 13.0.0
  num_columns: 16
  num_rows: 2735597
  num_row_groups: 3
  format_version: 2.6
  serialized_size: 6900

For similar num_rows, using DuckDB we've generated a heck of a lot more row groups. Along w/ using an older format version...

I would think/hope this affects all engines about the same performance wise? And starts to feel this is different than the original issue about deserialization performance being different on Linux and we just haven't written out the best structured snappy parquet files here.

It'd be pretty easy to modify this PR to route all compression codecs thru pyarrow though.

@mrocklin
Copy link
Member

Awesome. While this may not make us faster relative to other projects, it does help us know where to focus. I suspect that there's still plenty to do with parquet and that it's still one of the highest priorities, but the relative importance maybe just dropped a little.

@mrocklin
Copy link
Member

Thank you for running these experiments @milesgranger

@mrocklin
Copy link
Member

Although, looking at worker profiles, it's clear that reading parquet is still our primary bottleneck. My hope is that by combining this with a switch to a new parquet reading system (similar to the POC I had written up) that we can improve things considerably.

@milesgranger milesgranger changed the title Support compression codec choice in TPC-H data generation Support compression codec choice in TPC-H data generation and write with pyarrow Nov 28, 2023
@milesgranger
Copy link
Contributor Author

@fjetter (no rush I don't think...)
If you wanted to have a look at this. I think it might be in our interest after to regenerate the TPC-H datasets with 'better' parquet files produced by pyarrow. I can of course can handle that.

@kszlim
Copy link

kszlim commented Nov 29, 2023

I think it's also worth ensuring that the page index + statistics are written out to the parquet files. This ensures that engines that support maximal pushdown by utilizing those features can demonstrate them in the benchmarks.

@milesgranger
Copy link
Contributor Author

milesgranger commented Nov 30, 2023

Thanks for the reminder @kszlim

I've explicitly set that now in a95ab8b

But it does appear it was being written anyhow (first is pyarrow output, second is from original DuckDB output on different but similar sized files)

In [2]: f = pq.ParquetFile('../lineitem.snappy.parquet')

In [3]: f.metadata.row_group(0).column(0).statistics
Out[3]:
<pyarrow._parquet.Statistics object at 0x7f0512cc6a90>
  has_min_max: True
  min: 206729988
  max: 207777761
  null_count: 0
  distinct_count: 0
  num_values: 1048576
  physical_type: INT32
  logical_type: None
  converted_type (legacy): NONE

In [4]: f = pq.ParquetFile('../lineitem.parquet')

In [5]: f.metadata.row_group(0).column(0).statistics
Out[5]:
<pyarrow._parquet.Statistics object at 0x7f0512da4040>
  has_min_max: True
  min: 1326000001
  max: 1326124934
  null_count: 0
  distinct_count: 0
  num_values: 124928
  physical_type: INT32
  logical_type: Int(bitWidth=32, isSigned=true)
  converted_type (legacy): INT_32

Edit: and page index (207cb32 which wasn't being written before. 👍 )

@kszlim
Copy link

kszlim commented Nov 30, 2023

Nice, thanks!

tmp,
compression=compression.value.lower(),
write_statistics=True,
write_page_index=True,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

pyarrow doesn't store page indices by default and I'm not even sure if it is implemented to use them during reading.
Whether this is a good idea to have depends on many different things, size of file, size of row groups, number of columns, etc. and for some constellations this can cause overhead.
Before we enable this blindly, I would like to make sure this does not negatively impact anything.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see a small difference in file size (see metadata below) when adding page index, and no real performance difference:

❯ ls -lhs lineitem*                                                                                                         (tmp)
126M -rw-r--r--. 1 milesg milesg 126M Nov 28 08:47 lineitem.duckdb.snappy.parquet
102M -rw-r--r--. 1 milesg milesg 102M Nov 28 12:09 lineitem-stats-no-index.snappy.parquet
102M -rw-r--r--. 1 milesg milesg 102M Nov 30 11:53 lineitem-stats-and-index.snappy.parquet
❯ hyperfine 'python read-no-page-index.py' 'python read-page-index.py'                                                      (tmp)
Benchmark 1: python read-no-page-index.py
  Time (mean ± σ):      1.299 s ±  0.160 s    [User: 2.656 s, System: 2.409 s]
  Range (min … max):    0.994 s …  1.572 s    10 runs

Benchmark 2: python read-page-index.py
  Time (mean ± σ):      1.272 s ±  0.159 s    [User: 2.514 s, System: 2.381 s]
  Range (min … max):    1.084 s …  1.538 s    10 runs

Summary
  python read-page-index.py ran
    1.02 ± 0.18 times faster than python read-no-page-index.py

Then metadata:

In [5]: pq.ParquetFile('lineitem-stats-no-index.snappy.parquet').metadata
Out[5]:
<pyarrow._parquet.FileMetaData object at 0x7fc36c2da520>
  created_by: parquet-cpp-arrow version 13.0.0
  num_columns: 16
  num_rows: 2735597
  num_row_groups: 3
  format_version: 2.6
  serialized_size: 6900

In [6]: pq.ParquetFile('lineitem-stats-and-index.snappy.parquet').metadata
Out[6]:
<pyarrow._parquet.FileMetaData object at 0x7fc36c4f4fe0>
  created_by: parquet-cpp-arrow version 13.0.0
  num_columns: 16
  num_rows: 2735597
  num_row_groups: 3
  format_version: 2.6
  serialized_size: 7584

@fjetter
Copy link
Member

fjetter commented Nov 30, 2023

I'd like to point out that the pyarrow version above is using way fewer row groups which causes the files to be much more compact (compression works better and overhead per row group is smaller)
The pyarrow file only has three row groups causing the compressed payload data (serialized_bytes) for the duckdb file to be four times as large and this likely does not even include the additional metadata for the rowgroups. I suspect the final size on disk is significantly different for the two approaches.

@milesgranger milesgranger requested a review from fjetter December 6, 2023 15:02
@milesgranger
Copy link
Contributor Author

milesgranger commented Dec 13, 2023

After comparing, current data using many row groups generated by DuckDB, then new data generated by pyarrow using fewer row groups w/ and w/o statistics I discovered:

  • It's obviously better using fewer row groups / pyarrow generated data.
  • LZ4 and Snappy are basically on par in performance as we saw here
  • Using write_statistics=True increased file size and generally made performance a bit worse. Also I cannot find any mention of existing engines that explicitly say they make use of these? So (for now) I'm going to generate new data using pyarrow's default write_statistics=False

Here's a comparison of the old/current (red), new with stats (blue), and new without stats (orange):

visualization (2)

Will update later w/ full comparison using the new data with other engines as well.

@phofl
Copy link
Contributor

phofl commented Dec 13, 2023

Is it hard to rerun on scale 1000?

we saw very different things when we Switched to scale 1000 initially (only row group size, I don’t care About statistics

@milesgranger
Copy link
Contributor Author

Is it hard to rerun on scale 1000?

You bet, will be including that in the follow-up summary. 👍

@kszlim
Copy link

kszlim commented Dec 13, 2023

After comparing, current data using many row groups generated by DuckDB, then new data generated by pyarrow using fewer row groups w/ and w/o statistics I discovered:

  • It's obviously better using fewer row groups / pyarrow generated data.
  • LZ4 and Snappy are basically on par in performance as we saw here
  • Using write_statistics=True increased file size and generally made performance a bit worse. Also I cannot find any mention of existing engines that explicitly say they make use of these? So (for now) I'm going to generate new data using pyarrow's default write_statistics=False

Here's a comparison of the old/current (red), new with stats (blue), and new without stats (orange):

visualization (2)

Will update later w/ full comparison using the new data with other engines as well.

Polars definitely utilizes statistics, datafusion as well. Imo it's important to keep them to demonstrate query engine performance fairly. Yes, they bloat files a little but I think most people keep statistics on by default anyways.

@mrocklin
Copy link
Member

Polars definitely utilizes statistics, datafusion as well. Imo it's important to keep them to demonstrate query engine performance fairly. Yes, they bloat files a little but I think most people keep statistics on by default anyways

My guess is that in this particular set of queries the statistics aren't helpful for those systems that use them, mostly because the dataset is random. I think it totally makes sense to continue keeping statistics, but my guess is that it won't actually affect performance in a positive way for any of the projects.

@milesgranger
Copy link
Contributor Author

milesgranger commented Dec 13, 2023

Ran a full engine comparison on current and the proposed new files (no statistics). I have no strong opinion on add statistics, I was only nudged in the direction of going with pyarrow default (no statistics). I'm gleaning @mrocklin will have them, then I'll regenerate tomorrow and we'll be done with it. Otherwise, I think this PR is ready if others are happy.. or even just mildly satisfied. :)

Scale 100

Current files (and also a regression in dask-expr):
scale-100-old-files

New files (w/ fixed dask-expr also):
scale-100-new-files

Scale 1000 (just dask for relative comparison)

Current files (and also a regression in dask-expr):
scale-1000-old-files

New files (w/ fixed dask-expr also):
scale-1000-new-files

@mrocklin
Copy link
Member

I'm gleaning @mrocklin will have them

I doubt that it'll matter. Probably we want to have them for some final production (to avoid the appearance of tuning to make other projects worse) I don't think that this is urgent at all.

@mrocklin
Copy link
Member

Also, to briefly explain poor Dask performance. I suspect that this is mostly due to bad parquet performance still. I think that something like my WIP will be necessary before we are ready to push on the "Dask is faster than Spark" messaging. Pinging @fjetter so that he's aware of my current thinking here.

@phofl
Copy link
Contributor

phofl commented Dec 13, 2023

I also merged a PR that I shouldn't have merged before fixing something else. This caused a slowdown as well.

@mrocklin
Copy link
Member

mrocklin commented Dec 13, 2023 via email

@phofl
Copy link
Contributor

phofl commented Dec 13, 2023

Yes that's correct, but we are doing a lot of things twice for query 1 and 7, which is not ideal

@mrocklin
Copy link
Member

Yup. No disagreement there :)

Copy link
Contributor

@phofl phofl left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgmt, merge whenever you are ready

FWIW: Could you update the top post of the pr briefly? It still says that we want to use lz4, which might be confusing when looking back without reading all the discussion

@milesgranger
Copy link
Contributor Author

From what I've seen, I think about ~2x better performance w/ the new files, and the remainder was due to the top charts being ran also with a temporary regression in dask-expr. (sorry 'bout that).

Leading comment updated that we're not defaulting to lz4.

Thanks everyone! :)

@milesgranger milesgranger merged commit 8df025f into main Dec 14, 2023
@milesgranger milesgranger deleted the milesgranger/tpch-compression-codec-to-lz4 branch December 14, 2023 18:03
@hendrikmakait
Copy link
Member

FYI, it looks like LZ4-support has been released in duckdb=0.10.2 (duckdb/duckdb#11220).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants