You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Is your feature request related to a problem or challenge? Please describe what you are trying to do.
Performance is a key differentiator for DataFusion. We often see third party benchmarks comparing performance to other systems e.g. #5942
However, in addition to comparing to different systems, we also need to compare the performance of DataFusion over time. I want an easier way to compare DataFusion performance with a proposed change -- ideally a single command to run and get a report that tells me "does this PR make DataFusion faster or slower". This most recently came up as part of #6034
DataFusion has several benchmark runners but they have grown "organically" and are hard to use require manually downloading of datasets, and are not very easy to run or reproduce (see discussions on #6034 (comment))
Right now, it is cumbersome to do so -- I need to know how to create the appropriate datasets, build the runners, convert the dataset to parquet (potentially), run the benchmarks, and then build a report.
This is made more challenging by the fact that the runners need to be built in release mode which is slow (takes several minutes per cycle)
Describe the solution you'd like
I want a documented methodology (ideally in a script) that will do:
Setup (creates / downloads / whatever) the data files needed
Many of the current benchmarks I see online are querying a single CSV file, so we may want to benchmark that to measure the "first impression" of performance, but a more realistic use case IMO is querying partitioned Parquet files, so would be ideal to be benchmarking both.
Personally, I think there is less value in benchmarking partitioned CSV or single Parquet.
Is your feature request related to a problem or challenge? Please describe what you are trying to do.
Performance is a key differentiator for DataFusion. We often see third party benchmarks comparing performance to other systems e.g. #5942
However, in addition to comparing to different systems, we also need to compare the performance of DataFusion over time. I want an easier way to compare DataFusion performance with a proposed change -- ideally a single command to run and get a report that tells me "does this PR make DataFusion faster or slower". This most recently came up as part of #6034
DataFusion has several benchmark runners but they have grown "organically" and are hard to use require manually downloading of datasets, and are not very easy to run or reproduce (see discussions on #6034 (comment))
Right now, it is cumbersome to do so -- I need to know how to create the appropriate datasets, build the runners, convert the dataset to parquet (potentially), run the benchmarks, and then build a report.
This is made more challenging by the fact that the runners need to be built in release mode which is slow (takes several minutes per cycle)
Describe the solution you'd like
I want a documented methodology (ideally in a script) that will do:
We currently have the tpch benchmark (links) and I have a jenky script that can compare performance with the main branch: https://github.com/alamb/datafusion-benchmarking/blob/1f0beb5d32c39b6cc576e9846cddc40e692d181f/bench.sh
Describe alternatives you've considered
Additional context
This will likely result in cleaning up the runners in #5502
The text was updated successfully, but these errors were encountered: