Support compression codec choice in TPC-H data generation and write with pyarrow #1209

milesgranger · 2023-11-27T09:01:25Z

Support setting the compression codec in resulting parquet files. --compression lz4

Choices: [lz4, snappy, zstd, gzip, none, brotli]
Adds the compression codec name to the filename suffix ex. *.snappy.parquet
Uses pyarrow.parquet.write_table for all data output.

mrocklin · 2023-11-27T13:04:46Z

Hrm, it's a shame about DuckDB and LZ4. I was hoping that everyone would be able to read and write this data.

No comment from me on the code. It seems fine (although it's unfortunate to have to add arrow into the mix).

I'll be curious how profiles would change for Dask when performing queries on LZ4 compressed data, both in terms of raw performance, and also line profiles. If you end up running these experiments I recommend taking a look at the --performance-report flag for runs up to scale 100 (scale 1000 and above are a bit too large to fit comfortably in a performance report).

milesgranger · 2023-11-28T10:40:10Z

A top level comparison using scale 100, LZ4 is much faster, roughly half the time using lz4 vs snappy. Which is highly suspect, probably relating back to the original apache/arrow#38389, as while LZ4 is more performant, it's not 2x faster than snappy.

Also my experience in cramjam, Lz4 block format is indeed faster, but not substantially faster when compared to snappy raw format: https://github.com/milesgranger/pyrus-cramjam/tree/master/cramjam-python/benchmarks#snappy

If you wanted the detailed view of the performance reports. TL/DR seems like the profiles are pretty much the same, just much longer to decode snappy stemming from pyarrow AFAICT.

performance-reports-snappy.zip
performance-reports-lz4.zip

fjetter · 2023-11-28T10:44:34Z

Just to double check these results, can you also generate a parquet dataset with snappy compression that has been written by pyarrow?

milesgranger · 2023-11-28T11:19:20Z

Look at that.. the files we're producing with DuckDB snappy is not so great afterall.

Metadata of DuckDB produced snappy file:

  created_by: DuckDB
  num_columns: 16
  num_rows: 2568534
  num_row_groups: 21
  format_version: 1.0
  serialized_size: 29792

and PyArrow produced:

  created_by: parquet-cpp-arrow version 13.0.0
  num_columns: 16
  num_rows: 2735597
  num_row_groups: 3
  format_version: 2.6
  serialized_size: 6900

For similar num_rows, using DuckDB we've generated a heck of a lot more row groups. Along w/ using an older format version...

I would think/hope this affects all engines about the same performance wise? And starts to feel this is different than the original issue about deserialization performance being different on Linux and we just haven't written out the best structured snappy parquet files here.

It'd be pretty easy to modify this PR to route all compression codecs thru pyarrow though.

mrocklin · 2023-11-28T13:09:46Z

Awesome. While this may not make us faster relative to other projects, it does help us know where to focus. I suspect that there's still plenty to do with parquet and that it's still one of the highest priorities, but the relative importance maybe just dropped a little.

mrocklin · 2023-11-28T13:10:08Z

Thank you for running these experiments @milesgranger

mrocklin · 2023-11-28T13:30:12Z

Although, looking at worker profiles, it's clear that reading parquet is still our primary bottleneck. My hope is that by combining this with a switch to a new parquet reading system (similar to the POC I had written up) that we can improve things considerably.

milesgranger · 2023-11-29T11:26:11Z

@fjetter (no rush I don't think...)
If you wanted to have a look at this. I think it might be in our interest after to regenerate the TPC-H datasets with 'better' parquet files produced by pyarrow. I can of course can handle that.

kszlim · 2023-11-29T23:02:50Z

I think it's also worth ensuring that the page index + statistics are written out to the parquet files. This ensures that engines that support maximal pushdown by utilizing those features can demonstrate them in the benchmarks.

milesgranger · 2023-11-30T07:08:43Z

Thanks for the reminder @kszlim

I've explicitly set that now in a95ab8b

But it does appear it was being written anyhow (first is pyarrow output, second is from original DuckDB output on different but similar sized files)

In [2]: f = pq.ParquetFile('../lineitem.snappy.parquet')

In [3]: f.metadata.row_group(0).column(0).statistics
Out[3]:
<pyarrow._parquet.Statistics object at 0x7f0512cc6a90>
  has_min_max: True
  min: 206729988
  max: 207777761
  null_count: 0
  distinct_count: 0
  num_values: 1048576
  physical_type: INT32
  logical_type: None
  converted_type (legacy): NONE

In [4]: f = pq.ParquetFile('../lineitem.parquet')

In [5]: f.metadata.row_group(0).column(0).statistics
Out[5]:
<pyarrow._parquet.Statistics object at 0x7f0512da4040>
  has_min_max: True
  min: 1326000001
  max: 1326124934
  null_count: 0
  distinct_count: 0
  num_values: 124928
  physical_type: INT32
  logical_type: Int(bitWidth=32, isSigned=true)
  converted_type (legacy): INT_32

Edit: and page index (207cb32 which wasn't being written before. 👍 )

kszlim · 2023-11-30T07:19:32Z

Nice, thanks!

fjetter · 2023-11-30T10:27:57Z

tests/tpch/generate-data.py

+            tmp,
+            compression=compression.value.lower(),
+            write_statistics=True,
+            write_page_index=True,


pyarrow doesn't store page indices by default and I'm not even sure if it is implemented to use them during reading.
Whether this is a good idea to have depends on many different things, size of file, size of row groups, number of columns, etc. and for some constellations this can cause overhead.
Before we enable this blindly, I would like to make sure this does not negatively impact anything.

I see a small difference in file size (see metadata below) when adding page index, and no real performance difference:

❯ ls -lhs lineitem* (tmp) 126M -rw-r--r--. 1 milesg milesg 126M Nov 28 08:47 lineitem.duckdb.snappy.parquet 102M -rw-r--r--. 1 milesg milesg 102M Nov 28 12:09 lineitem-stats-no-index.snappy.parquet 102M -rw-r--r--. 1 milesg milesg 102M Nov 30 11:53 lineitem-stats-and-index.snappy.parquet

❯ hyperfine 'python read-no-page-index.py' 'python read-page-index.py' (tmp) Benchmark 1: python read-no-page-index.py Time (mean ± σ): 1.299 s ± 0.160 s [User: 2.656 s, System: 2.409 s] Range (min … max): 0.994 s … 1.572 s 10 runs Benchmark 2: python read-page-index.py Time (mean ± σ): 1.272 s ± 0.159 s [User: 2.514 s, System: 2.381 s] Range (min … max): 1.084 s … 1.538 s 10 runs Summary python read-page-index.py ran 1.02 ± 0.18 times faster than python read-no-page-index.py

Then metadata:

In [5]: pq.ParquetFile('lineitem-stats-no-index.snappy.parquet').metadata Out[5]: <pyarrow._parquet.FileMetaData object at 0x7fc36c2da520> created_by: parquet-cpp-arrow version 13.0.0 num_columns: 16 num_rows: 2735597 num_row_groups: 3 format_version: 2.6 serialized_size: 6900 In [6]: pq.ParquetFile('lineitem-stats-and-index.snappy.parquet').metadata Out[6]: <pyarrow._parquet.FileMetaData object at 0x7fc36c4f4fe0> created_by: parquet-cpp-arrow version 13.0.0 num_columns: 16 num_rows: 2735597 num_row_groups: 3 format_version: 2.6 serialized_size: 7584

fjetter · 2023-11-30T11:02:33Z

I'd like to point out that the pyarrow version above is using way fewer row groups which causes the files to be much more compact (compression works better and overhead per row group is smaller)
The pyarrow file only has three row groups causing the compressed payload data (serialized_bytes) for the duckdb file to be four times as large and this likely does not even include the additional metadata for the rowgroups. I suspect the final size on disk is significantly different for the two approaches.

milesgranger · 2023-12-13T11:19:35Z

After comparing, current data using many row groups generated by DuckDB, then new data generated by pyarrow using fewer row groups w/ and w/o statistics I discovered:

It's obviously better using fewer row groups / pyarrow generated data.
LZ4 and Snappy are basically on par in performance as we saw here
Using write_statistics=True increased file size and generally made performance a bit worse. Also I cannot find any mention of existing engines that explicitly say they make use of these? So (for now) I'm going to generate new data using pyarrow's default write_statistics=False

Here's a comparison of the old/current (red), new with stats (blue), and new without stats (orange):

Will update later w/ full comparison using the new data with other engines as well.

phofl · 2023-12-13T11:23:29Z

Is it hard to rerun on scale 1000?

we saw very different things when we Switched to scale 1000 initially (only row group size, I don’t care About statistics

milesgranger · 2023-12-13T11:25:28Z

Is it hard to rerun on scale 1000?

You bet, will be including that in the follow-up summary. 👍

kszlim · 2023-12-13T19:26:48Z

After comparing, current data using many row groups generated by DuckDB, then new data generated by pyarrow using fewer row groups w/ and w/o statistics I discovered:

It's obviously better using fewer row groups / pyarrow generated data.

LZ4 and Snappy are basically on par in performance as we saw here

Using write_statistics=True increased file size and generally made performance a bit worse. Also I cannot find any mention of existing engines that explicitly say they make use of these? So (for now) I'm going to generate new data using pyarrow's default write_statistics=False

Here's a comparison of the old/current (red), new with stats (blue), and new without stats (orange):

Will update later w/ full comparison using the new data with other engines as well.

Polars definitely utilizes statistics, datafusion as well. Imo it's important to keep them to demonstrate query engine performance fairly. Yes, they bloat files a little but I think most people keep statistics on by default anyways.

mrocklin · 2023-12-13T19:31:56Z

Polars definitely utilizes statistics, datafusion as well. Imo it's important to keep them to demonstrate query engine performance fairly. Yes, they bloat files a little but I think most people keep statistics on by default anyways

My guess is that in this particular set of queries the statistics aren't helpful for those systems that use them, mostly because the dataset is random. I think it totally makes sense to continue keeping statistics, but my guess is that it won't actually affect performance in a positive way for any of the projects.

milesgranger · 2023-12-13T19:42:35Z

Ran a full engine comparison on current and the proposed new files (no statistics). I have no strong opinion on add statistics, I was only nudged in the direction of going with pyarrow default (no statistics). I'm gleaning @mrocklin will have them, then I'll regenerate tomorrow and we'll be done with it. Otherwise, I think this PR is ready if others are happy.. or even just mildly satisfied. :)

Scale 100

Current files (and also a regression in dask-expr):

New files (w/ fixed dask-expr also):

Scale 1000 (just dask for relative comparison)

Current files (and also a regression in dask-expr):

New files (w/ fixed dask-expr also):

mrocklin · 2023-12-13T20:08:24Z

I'm gleaning @mrocklin will have them

I doubt that it'll matter. Probably we want to have them for some final production (to avoid the appearance of tuning to make other projects worse) I don't think that this is urgent at all.

mrocklin · 2023-12-13T20:10:10Z

Also, to briefly explain poor Dask performance. I suspect that this is mostly due to bad parquet performance still. I think that something like my WIP will be necessary before we are ready to push on the "Dask is faster than Spark" messaging. Pinging @fjetter so that he's aware of my current thinking here.

phofl · 2023-12-13T20:16:26Z

I also merged a PR that I shouldn't have merged before fixing something else. This caused a slowdown as well.

mrocklin · 2023-12-13T20:19:51Z

I'd bet good money that we're still spending 75+% of our time in read_parquet

…

On Wed, Dec 13, 2023 at 12:16 PM Patrick Hoefler ***@***.***> wrote: I also merged a PR that I shouldn't have merged before fixing something else. This caused a slowdown as well. — Reply to this email directly, view it on GitHub <#1209 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AACKZTCYBTNDMZJOWHLPO4DYJIEKLAVCNFSM6AAAAAA73VL6WGVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMYTQNJUGY2DAOBYGQ> . You are receiving this because you were mentioned.Message ID: ***@***.***>

-- <https://coiled.io> Matthew Rocklin CEO, Dask Maintainer

phofl · 2023-12-13T20:22:54Z

Yes that's correct, but we are doing a lot of things twice for query 1 and 7, which is not ideal

mrocklin · 2023-12-14T00:31:31Z

Yup. No disagreement there :)

DuckDB doesn't cover LZ4

phofl

lgmt, merge whenever you are ready

FWIW: Could you update the top post of the pr briefly? It still says that we want to use lz4, which might be confusing when looking back without reading all the discussion

milesgranger · 2023-12-14T18:03:29Z

From what I've seen, I think about ~2x better performance w/ the new files, and the remainder was due to the top charts being ran also with a temporary regression in dask-expr. (sorry 'bout that).

Leading comment updated that we're not defaulting to lz4.

Thanks everyone! :)

hendrikmakait · 2024-04-26T08:31:47Z

FYI, it looks like LZ4-support has been released in duckdb=0.10.2 (duckdb/duckdb#11220).

Support compression codec choice in TPC-H data generation

b180cf9

milesgranger requested a review from mrocklin November 27, 2023 09:01

Use pyarrow to write out parquet files

eafb5fc

milesgranger changed the title ~~Support compression codec choice in TPC-H data generation~~ Support compression codec choice in TPC-H data generation and write with pyarrow Nov 28, 2023

Explicitly set writing of statistics

a95ab8b

Explicitly set writing of page index

207cb32

fjetter reviewed Nov 30, 2023

View reviewed changes

milesgranger requested a review from fjetter December 6, 2023 15:02

Set write_statistics=False

898275f

Update scale paths to new data

a31c442

Set write_statistics=True

e6f297e

Swtich default back to snappy [skip ci]

af32157

DuckDB doesn't cover LZ4

phofl approved these changes Dec 14, 2023

View reviewed changes

milesgranger merged commit 8df025f into main Dec 14, 2023

milesgranger deleted the milesgranger/tpch-compression-codec-to-lz4 branch December 14, 2023 18:03

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Support compression codec choice in TPC-H data generation and write with pyarrow #1209

Support compression codec choice in TPC-H data generation and write with pyarrow #1209

milesgranger commented Nov 27, 2023 •

edited

Loading

mrocklin commented Nov 27, 2023

milesgranger commented Nov 28, 2023

fjetter commented Nov 28, 2023

milesgranger commented Nov 28, 2023

mrocklin commented Nov 28, 2023

mrocklin commented Nov 28, 2023

mrocklin commented Nov 28, 2023

milesgranger commented Nov 29, 2023

kszlim commented Nov 29, 2023

milesgranger commented Nov 30, 2023 •

edited

Loading

kszlim commented Nov 30, 2023

fjetter Nov 30, 2023

milesgranger Nov 30, 2023

fjetter commented Nov 30, 2023

milesgranger commented Dec 13, 2023 •

edited

Loading

phofl commented Dec 13, 2023

milesgranger commented Dec 13, 2023

kszlim commented Dec 13, 2023

mrocklin commented Dec 13, 2023

milesgranger commented Dec 13, 2023 •

edited

Loading

mrocklin commented Dec 13, 2023

mrocklin commented Dec 13, 2023

phofl commented Dec 13, 2023

mrocklin commented Dec 13, 2023 via email

phofl commented Dec 13, 2023

mrocklin commented Dec 14, 2023

phofl left a comment

milesgranger commented Dec 14, 2023

hendrikmakait commented Apr 26, 2024

Support compression codec choice in TPC-H data generation and write with pyarrow #1209

Support compression codec choice in TPC-H data generation and write with pyarrow #1209

Conversation

milesgranger commented Nov 27, 2023 • edited Loading

mrocklin commented Nov 27, 2023

milesgranger commented Nov 28, 2023

fjetter commented Nov 28, 2023

milesgranger commented Nov 28, 2023

mrocklin commented Nov 28, 2023

mrocklin commented Nov 28, 2023

mrocklin commented Nov 28, 2023

milesgranger commented Nov 29, 2023

kszlim commented Nov 29, 2023

milesgranger commented Nov 30, 2023 • edited Loading

kszlim commented Nov 30, 2023

fjetter Nov 30, 2023

Choose a reason for hiding this comment

milesgranger Nov 30, 2023

Choose a reason for hiding this comment

fjetter commented Nov 30, 2023

milesgranger commented Dec 13, 2023 • edited Loading

phofl commented Dec 13, 2023

milesgranger commented Dec 13, 2023

kszlim commented Dec 13, 2023

mrocklin commented Dec 13, 2023

milesgranger commented Dec 13, 2023 • edited Loading

Scale 100

Scale 1000 (just dask for relative comparison)

mrocklin commented Dec 13, 2023

mrocklin commented Dec 13, 2023

phofl commented Dec 13, 2023

mrocklin commented Dec 13, 2023 via email

phofl commented Dec 13, 2023

mrocklin commented Dec 14, 2023

phofl left a comment

Choose a reason for hiding this comment

milesgranger commented Dec 14, 2023

hendrikmakait commented Apr 26, 2024

milesgranger commented Nov 27, 2023 •

edited

Loading

milesgranger commented Nov 30, 2023 •

edited

Loading

milesgranger commented Dec 13, 2023 •

edited

Loading

milesgranger commented Dec 13, 2023 •

edited

Loading