-
Notifications
You must be signed in to change notification settings - Fork 1.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add fuzzing / random SQL testing #913
Comments
This paper would be worth a read too for anyone interested to learn how Databricks uses query fuzzing with Spark. I have been doing some query fuzzing myself in my day job, to compare Spark with Spark on GPU (using the RAPIDS Accelerator for Apache Spark). My approach there was to generate logical query plans directly (via Spark's DataFrame API). I had been contemplating doing something similar with DataFusion/Ballista by generating random plans in Rust and encoding them to protobuf using the Ballista serde module and then writing Scala code to read these protobuf files and translate them to Spark plans. I have an old proof-of-concept of some of this already in my How Query Engines Work repo. With the new Arrow Compute IR proposal, an approach along these lines would be useful for having fuzzing tools that work across Arrow implementations as well. |
I have started work on fuzzing SQL and data using https://github.com/andygrove/sqlfuzz and plan on eventually adding tests to this project but for now, I am doing this separately. It has already been effective in finding bugs. |
Given #11030 / https://github.com/datafusion-contrib/datafusion-sqlancer from @2010YOUY01 I think I am going to claim this issue is closed. Thanks again @2010YOUY01 |
Is your feature request related to a problem or challenge? Please describe what you are trying to do.
In a past project, we had a harness that could generate random SQL queries and it found many bugs -- such tests are a wonderful way to help database software mature. Applying such technology to DataFusion would be very cool.
On apache/datafusion-sqlparser-rs#312 (comment), @PsiACE points at a good blog post from cockroachdb: https://www.cockroachlabs.com/blog/sqlsmith-randomized-sql-testing/ that describes such testing
One of the tools mentioned is https://github.com/anse1/sqlsmith
Describe the solution you'd like
Add a script / way to run SQLSmith against DataFusion. As described in the blog this might require modifying SQLSmith to restrict itself to the subset of postgres datafusion supports
I would suggest we don't put this in CI initially until someone has the bandwidth to review the results, but getting the scripts that could be run setup would be a great first step
Describe alternatives you've considered
Haven't done research into alternatives to SQlsmitg
Additional context
apache/datafusion-sqlparser-rs#312
The text was updated successfully, but these errors were encountered: