Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add fuzzing / random SQL testing #913

Closed
alamb opened this issue Aug 21, 2021 · 3 comments
Closed

Add fuzzing / random SQL testing #913

alamb opened this issue Aug 21, 2021 · 3 comments
Labels
enhancement New feature or request

Comments

@alamb
Copy link
Contributor

alamb commented Aug 21, 2021

Is your feature request related to a problem or challenge? Please describe what you are trying to do.
In a past project, we had a harness that could generate random SQL queries and it found many bugs -- such tests are a wonderful way to help database software mature. Applying such technology to DataFusion would be very cool.

On apache/datafusion-sqlparser-rs#312 (comment), @PsiACE points at a good blog post from cockroachdb: https://www.cockroachlabs.com/blog/sqlsmith-randomized-sql-testing/ that describes such testing

One of the tools mentioned is https://github.com/anse1/sqlsmith

Describe the solution you'd like
Add a script / way to run SQLSmith against DataFusion. As described in the blog this might require modifying SQLSmith to restrict itself to the subset of postgres datafusion supports

I would suggest we don't put this in CI initially until someone has the bandwidth to review the results, but getting the scripts that could be run setup would be a great first step

Describe alternatives you've considered
Haven't done research into alternatives to SQlsmitg

Additional context
apache/datafusion-sqlparser-rs#312

@andygrove
Copy link
Member

andygrove commented Aug 26, 2021

This paper would be worth a read too for anyone interested to learn how Databricks uses query fuzzing with Spark.

I have been doing some query fuzzing myself in my day job, to compare Spark with Spark on GPU (using the RAPIDS Accelerator for Apache Spark). My approach there was to generate logical query plans directly (via Spark's DataFrame API).

I had been contemplating doing something similar with DataFusion/Ballista by generating random plans in Rust and encoding them to protobuf using the Ballista serde module and then writing Scala code to read these protobuf files and translate them to Spark plans. I have an old proof-of-concept of some of this already in my How Query Engines Work repo.

With the new Arrow Compute IR proposal, an approach along these lines would be useful for having fuzzing tools that work across Arrow implementations as well.

@andygrove
Copy link
Member

I have started work on fuzzing SQL and data using https://github.com/andygrove/sqlfuzz and plan on eventually adding tests to this project but for now, I am doing this separately. It has already been effective in finding bugs.

@alamb
Copy link
Contributor Author

alamb commented Aug 22, 2024

Given #11030 / https://github.com/datafusion-contrib/datafusion-sqlancer from @2010YOUY01 I think I am going to claim this issue is closed.

Thanks again @2010YOUY01

@alamb alamb closed this as completed Aug 22, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

2 participants