Add fuzzing / random SQL testing #913

alamb · 2021-08-21T10:28:57Z

Is your feature request related to a problem or challenge? Please describe what you are trying to do.
In a past project, we had a harness that could generate random SQL queries and it found many bugs -- such tests are a wonderful way to help database software mature. Applying such technology to DataFusion would be very cool.

On apache/datafusion-sqlparser-rs#312 (comment), @PsiACE points at a good blog post from cockroachdb: https://www.cockroachlabs.com/blog/sqlsmith-randomized-sql-testing/ that describes such testing

One of the tools mentioned is https://github.com/anse1/sqlsmith

Describe the solution you'd like
Add a script / way to run SQLSmith against DataFusion. As described in the blog this might require modifying SQLSmith to restrict itself to the subset of postgres datafusion supports

I would suggest we don't put this in CI initially until someone has the bandwidth to review the results, but getting the scripts that could be run setup would be a great first step

Describe alternatives you've considered
Haven't done research into alternatives to SQlsmitg

Additional context
apache/datafusion-sqlparser-rs#312

andygrove · 2021-08-26T14:47:36Z

This paper would be worth a read too for anyone interested to learn how Databricks uses query fuzzing with Spark.

SparkFuzz: Searching Correctness Regressions in Modern Query Engines

I have been doing some query fuzzing myself in my day job, to compare Spark with Spark on GPU (using the RAPIDS Accelerator for Apache Spark). My approach there was to generate logical query plans directly (via Spark's DataFrame API).

I had been contemplating doing something similar with DataFusion/Ballista by generating random plans in Rust and encoding them to protobuf using the Ballista serde module and then writing Scala code to read these protobuf files and translate them to Spark plans. I have an old proof-of-concept of some of this already in my How Query Engines Work repo.

With the new Arrow Compute IR proposal, an approach along these lines would be useful for having fuzzing tools that work across Arrow implementations as well.

andygrove · 2022-08-05T13:55:56Z

I have started work on fuzzing SQL and data using https://github.com/andygrove/sqlfuzz and plan on eventually adding tests to this project but for now, I am doing this separately. It has already been effective in finding bugs.

alamb · 2024-08-22T15:35:40Z

Given #11030 / https://github.com/datafusion-contrib/datafusion-sqlancer from @2010YOUY01 I think I am going to claim this issue is closed.

Thanks again @2010YOUY01

alamb added the enhancement New feature or request label Aug 21, 2021

This was referenced Aug 21, 2021

Add fuzzer based on honggfuzz apache/datafusion-sqlparser-rs#312

Merged

Add fuzzer based on cargo-fuzz apache/datafusion-sqlparser-rs#211

Closed

PsiACE mentioned this issue Aug 25, 2021

[sqlparser] Domain-aware fuzzing databendlabs/databend#1549

Closed

Omega359 mentioned this issue Mar 23, 2024

Run sqllogictests multiple times with random fuzzed configurations #9746

Open

alamb closed this as completed Aug 22, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add fuzzing / random SQL testing #913

Add fuzzing / random SQL testing #913

alamb commented Aug 21, 2021

andygrove commented Aug 26, 2021 •

edited

Loading

andygrove commented Aug 5, 2022

alamb commented Aug 22, 2024

Add fuzzing / random SQL testing #913

Add fuzzing / random SQL testing #913

Comments

alamb commented Aug 21, 2021

andygrove commented Aug 26, 2021 • edited Loading

andygrove commented Aug 5, 2022

alamb commented Aug 22, 2024

andygrove commented Aug 26, 2021 •

edited

Loading