Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Table Scan Performance Tests #497

Open
wants to merge 4 commits into
base: main
Choose a base branch
from
Open

Table Scan Performance Tests #497

wants to merge 4 commits into from

Conversation

sdd
Copy link
Contributor

@sdd sdd commented Jul 28, 2024

This PR adds some performance testing capabilities. It includes the following features:

  • docker-compose environment that includes containers for Minio, Spark, HAProxy and the Iceberg REST Catalog
  • Uses HAProxy to simulate real-world latency and bandwidth constraints of connections to services like S3
  • Includes scripting to create an Iceberg table in the performance testing environment and populate it with data from the widely-used NYC Taxi dataset
  • Adds a justfile for ease of creating, initialising, starting, stopping and tearing down the performance testing environment
  • Adds some Criterion benchmarks that use the performance testing environment to test the performance of TableScan.plan_files in four different representative scenarios
  • Adds some Criterion benchmarks that use the performance testing environment to test the performance of TableScan.to_arrow in four different representative scenarios

The performance tests can be set up and ran by running just perf-run. This will trigger the following actions before actually running the tests. It checks each item to see if it actually needs to be run, skipping if already done on a previous run:

  • Download NYC taxi data parquets
  • Spin up docker containers
  • Create a table
  • Insert test data from the parquets

@sdd sdd mentioned this pull request Aug 2, 2024
@sdd sdd force-pushed the perf-suite branch 5 times, most recently from 6d0a7ee to 56f068e Compare August 9, 2024 23:25
@sdd sdd changed the title feat: performance testing harness and perf tests for scan file plan feat: performance testing harness and perf tests for scan file plan and execution Aug 9, 2024
@sdd sdd changed the title feat: performance testing harness and perf tests for scan file plan and execution Table Scan Performance tests Aug 10, 2024
@sdd sdd changed the title Table Scan Performance tests Table Scan Performance Tests Aug 10, 2024
@sdd sdd marked this pull request as ready for review August 13, 2024 19:18
@sdd
Copy link
Contributor Author

sdd commented Aug 13, 2024

@Xuanwo and @liurenjie1024: This is now passing and ready for review.

@sdd sdd force-pushed the perf-suite branch 2 times, most recently from f90d2d4 to a00b32a Compare August 15, 2024 20:35
Copy link
Member

@Xuanwo Xuanwo left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks a lot for driving this work!

justfile Show resolved Hide resolved
Copy link
Contributor

@liurenjie1024 liurenjie1024 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @sdd for this pr. I just skimmed through it and got your points here. I have some concerns with this approach, for example, I feel this approach is difficult to maintain and extend to other cases. I'm more interested in integrated with datafusion to do such thing, like integration tests and benchmark. What do you think?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants