Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Easy DataFusion / DataFusion Benchmarking #6127

Closed
Tracked by #5505
alamb opened this issue Apr 26, 2023 · 3 comments · Fixed by #6131
Closed
Tracked by #5505

Easy DataFusion / DataFusion Benchmarking #6127

alamb opened this issue Apr 26, 2023 · 3 comments · Fixed by #6131
Assignees

Comments

@alamb
Copy link
Contributor

alamb commented Apr 26, 2023

Is your feature request related to a problem or challenge? Please describe what you are trying to do.
Performance is a key differentiator for DataFusion. We often see third party benchmarks comparing performance to other systems e.g. #5942

However, in addition to comparing to different systems, we also need to compare the performance of DataFusion over time. I want an easier way to compare DataFusion performance with a proposed change -- ideally a single command to run and get a report that tells me "does this PR make DataFusion faster or slower". This most recently came up as part of #6034

DataFusion has several benchmark runners but they have grown "organically" and are hard to use require manually downloading of datasets, and are not very easy to run or reproduce (see discussions on #6034 (comment))

Right now, it is cumbersome to do so -- I need to know how to create the appropriate datasets, build the runners, convert the dataset to parquet (potentially), run the benchmarks, and then build a report.

This is made more challenging by the fact that the runners need to be built in release mode which is slow (takes several minutes per cycle)

Describe the solution you'd like
I want a documented methodology (ideally in a script) that will do:

  1. Setup (creates / downloads / whatever) the data files needed
  2. Run that writes timing information into log files
  3. Compare writes out a report comparing the runs

We currently have the tpch benchmark (links) and I have a jenky script that can compare performance with the main branch: https://github.com/alamb/datafusion-benchmarking/blob/1f0beb5d32c39b6cc576e9846cddc40e692d181f/bench.sh

Describe alternatives you've considered

Additional context
This will likely result in cleaning up the runners in #5502

@andygrove
Copy link
Member

Many of the current benchmarks I see online are querying a single CSV file, so we may want to benchmark that to measure the "first impression" of performance, but a more realistic use case IMO is querying partitioned Parquet files, so would be ideal to be benchmarking both.

Personally, I think there is less value in benchmarking partitioned CSV or single Parquet.

@alamb
Copy link
Contributor Author

alamb commented Apr 26, 2023

I really like the "first impression" name -- I will ensure that is covered.

@alamb
Copy link
Contributor Author

alamb commented Apr 28, 2023

Filed #6156 to track first impression benchmark

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants