Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ARROW2: Performance benchmark #1652

Closed
houqp opened this issue Jan 24, 2022 · 24 comments
Closed

ARROW2: Performance benchmark #1652

houqp opened this issue Jan 24, 2022 · 24 comments
Labels
question Further information is requested
Milestone

Comments

@houqp
Copy link
Member

houqp commented Jan 24, 2022

Perform thorough testing to make sure the arrow2 branch will indeed result in better performance across the board.

Please give https://github.com/apache/arrow-datafusion/tree/arrow2 a try for your workloads and report any regression you find here.

@houqp houqp added the question Further information is requested label Jan 24, 2022
@houqp houqp added this to the arrow2 milestone Jan 24, 2022
@houqp
Copy link
Member Author

houqp commented Jan 24, 2022

Here are some of the TPCH results I got from running our tpch benchmark suite on an 8 cores x86-64 Linux machine.

The base commit from I used as baseline for arrow-rs is 2008b1d. The commit for arrow2 is c0c9c72.

Default single partition CSV files generated from our tpch gen script (--batch-size 4096):

image

CSV tables partitioned into 16 files and processed with 8 datafusion partitions (--batch-size 4096 --partitions 8):

image

Data generated with:

PARTITIONS=16
for i in `seq 1 ${PARTITIONS}`; do
  echo "generating partition $i"
  ./dbgen -vf -s 16 -C ${PARTITIONS} -S $i
  ls -lh
done

Parquet tables partitioned in 8 files and processed with 8 datafusion partitions (--batch-size 4096 --partitions 8):

image

Note query 7 could not complete with arrow2 due to OOM. Arrow2 parquet reader currently consuming almost double the memory compared to arrow-rs is a known issue. Related upstream issue: jorgecarleitao/arrow2#768.

Q1 is significantly slower in arrow2 compared to the other queies (perhaps related to predicate pushdown?).

I think both of these two regressions require deepdive before we merge arrow2 into master.

Overall, arrow2 is around 5% faster for CSV tables and 10-20% faster for parquet tables across the board.

@tustvold
Copy link
Contributor

tustvold commented Jan 24, 2022

Big 👍 to this, getting some concrete numbers would be really nice.

FWIW some ideas for whoever picks this up that I at least would be very interested in:

@alamb
Copy link
Contributor

alamb commented Jan 24, 2022

A number of the performance improvements will be in arrow-rs 8.0.0, though some such as apache/arrow-rs#1180 will not be released until arrow-rs 9.0.0

@houqp
Copy link
Member Author

houqp commented Jan 24, 2022

bummer that the dictionary encoding optimization will need to wait for the 9.0 release, but we can test it with a 9.0 datafusion branch if needed.

@sunchao
Copy link
Member

sunchao commented Jan 27, 2022

I think a big part of perf improvement comes from Parquet predicate pushdown (e.g., stats, dictionary, bloom filter and column index). Does either arrow-rs or arrow2 implement (some of) these currently?

@alamb
Copy link
Contributor

alamb commented Jan 27, 2022

DataFusion does implement row group pruning based on statistics (that arrow-rs creates)

arrow-rs creates statistics. It will (in version after 8.0.0) preserve dictionary encoding (meaning the output of a dictionary encoded column will also be dictionary encoded) which should be a large win for low cardinality columns

It does not currently create or use column indexes or bloom filters

I believe that @tustvold has a plan for implementing more sophisticated predicate pushdown (aka that a filter on one column could be used to avoid decoding swaths of others) but I am not sure what the timeline on that is -- apache/arrow-rs#1191

@sunchao
Copy link
Member

sunchao commented Jan 27, 2022

It will (in version after 8.0.0) preserve dictionary encoding (meaning the output of a dictionary encoded column will also be dictionary encoded) which should be a large win for low cardinality columns

This is nice 👍 . By dictionary-based row group filtering I mean we can first decode only the dictionary pages for the column chunks where pushed down predicates (e.g., in predicate) are associated to, and skip the whole row group if no keys in the dictionary match.

@alamb
Copy link
Contributor

alamb commented Jan 27, 2022

This is nice 👍 . By dictionary-based row group filtering I mean we can first decode only the dictionary pages for the column chunks where pushed down predicates (e.g., in predicate) are associated to, and skip the whole row group if no keys in the dictionary match.

this would definitely be a neat optimization

@tustvold
Copy link
Contributor

skip the whole row group if no keys in the dictionary match.

You'd need to check that column chunk hadn't spilled, but yes that is broadly speaking the access pattern I wish to support with the work in apache/arrow-rs#1191. Specially exploit apache/arrow-rs#1180 to cheaply get dictionary data, evaluate predicates on it, and use this to construct a refined row selection to fetch.

As @alamb alludes to the timelines for this effort are not clear, and there may be other more pressing things for me to spend time on, but it is definitely something that I hope to deliver in the next month or so.

@jorgecarleitao
Copy link
Member

  • arrow-rs: group filter push down
  • arrow2: group filter push down, page filter push down

afaik both support reading and writing dictionary encoded arrays to dictionary-encoded column chunks (but neither supports pushdown based on dict values atm).

TBH, imo the biggest limiting factor in implementing anything in parquet is its lack of documentation - it is just a lot of work to decipher what is being requested, and the situation is not improving. For example, I spent a lot of time in understanding the encodings, have a PR to try to help future folks implementing it, and it has been lingering for ~9 months now. I wish parquet was a bit more inspired by e.g. Arrow or ORC in this respect.

Note that datafusion's benchmarks only use "required" / non-nullable data, so most optimizations on the null values are not seen from datafusion's perspective. Last time I benched arrow2/parquet2 was much faster in nullable data; I am a bit surprised to see so many differences in non-nullable data.

@alamb
Copy link
Contributor

alamb commented Jan 27, 2022

arrow-rs 8 has some improvements for nullable columns as well now courtesy of @tustvold apache/arrow-rs#1054

@matthewmturner
Copy link
Contributor

checking in on this - has anyone rerun tpch on master and arrow2 branches since the arrow-rs 9 release?

@tustvold
Copy link
Contributor

tustvold commented Mar 5, 2022

I added some further benchmarks in #1738 which I would also be interested in the numbers for

@xudong963
Copy link
Member

More than half a year passed, and I believe arrow-rs has made great progress. Does anyone have the latest benchmark data?

@alamb
Copy link
Contributor

alamb commented Nov 22, 2022

I have not really been following arrow2 very closely -- I am not sure when the last time anyone got datafusion to work enough with arrow2 to run the TPCH benchmarks 🤷

@xudong963
Copy link
Member

I will bump arrow branch and make comparision.

@tustvold
Copy link
Contributor

My 2 cents is that as mentioned by @sunchao above, parquet predicate pushdown makes a drastic difference to performance. Whilst there are still some kinks to iron out in the DataFusion integration, arrow-rs now supports the full suite, including late materialisation.

Given they were previously roughly in the same ballpark, and significant optimisations have landed in arrow-rs since then, with predicate pushdown I think the performance story for arrow-rs is pretty compelling, at least w.r.t parquet. Arrow2 only supports row group level pruning, as page pruning requires skippable decode and more sophisticated handling of repetition levels, and so is a fair ways behind on this.

This has been born out by some benchmarks I did, but the TPCH benchmarks would be very interesting to see also.

I mention this mainly because previous attempts to update the arrow2 branch have gotten stuck in merge conflict hell, and perhaps I can save you some time, and sanity 😅

@Dandandan
Copy link
Contributor

For TPC-H It might be the pruning wouldn't help a lot, as most of the data is AFAIK very evenly distributed, so quite hard to prune values based on min/max values without sorting data first - at least on row group level this was the case. But maybe page pruning works in some cases, that would be great.

@tustvold
Copy link
Contributor

But maybe page pruning works in some cases, that would be great.

FWIW we support late materialization now, in addition to row group and page pruning, and which doesn't rely on aggregate statistics. It instead evaluates predicates on the subset of the columns needed by the predicate, and then uses the result to conditionally materialize the other projected columns. This isn't always advantageous, e.g. for predicates that don't eliminate large numbers of rows, we need to develop better heuristics / bailing logic, but where it works the benefits can be huge.

@Dandandan
Copy link
Contributor

But maybe page pruning works in some cases, that would be great.

FWIW we support late materialization now, in addition to row group and page pruning, and which doesn't rely on aggregate statistics. It instead evaluates predicates on the subset of the columns needed by the predicate, and then uses the result to conditionally materialize the other projected columns. This isn't always advantageous, e.g. for predicates that don't eliminate large numbers of rows, we need to develop better heuristics / bailing logic, but where it works the benefits can be huge.

Yes, I realized that, but also seeing some tpc-h generated data and filters, I would be (pleasantly) surprised if it can prune out many pages.

@alamb
Copy link
Contributor

alamb commented Nov 22, 2022

Yes, I realized that, but also seeing some tpc-h generated data and filters, I would be (pleasantly) surprised if it can prune out many pages.

I have it on my list to propose enabling the pushdown filtering by default (see #3828) -- I plan to run the TPCH benchmarks as part of that exercise to see how much / if at all they help

@xudong963
Copy link
Member

I plan to run the TPCH benchmarks as part of that exercise to see how much / if at all they help

From benchmarks, I think pushdown filtering is very help

@Ted-Jiang
Copy link
Member

Yes, I realized that, but also seeing some tpc-h generated data and filters, I would be (pleasantly) surprised if it can prune out many pages.

I try to modify the filter condition in tpch q1, with page index it really helped 😂 . It can boost a lot sometimes in real world.

@alamb
Copy link
Contributor

alamb commented Mar 10, 2023

Given jorgecarleitao/arrow2#1429 I don't think this issue is relevant any more -- please let me know if you feel otherwise.

@alamb alamb closed this as completed Mar 10, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
question Further information is requested
Projects
None yet
Development

No branches or pull requests

9 participants