Make filter selectivity for statistics configurable #8243

edmondop · 2023-11-17T03:34:15Z

Expose an API to set a default value for filter selectivity when the exact value cannot be computed. This value is also exposed as a part of the configuration as datafusion.optimizer.default_filter_selectivity

datafusion/sql/src/statement.rs

datafusion/substrait/src/logical_plan/consumer.rs

datafusion/sql/src/select.rs

datafusion/optimizer/src/push_down_filter.rs

datafusion/optimizer/src/filter_null_join_keys.rs

datafusion/optimizer/src/decorrelate.rs

datafusion/core/src/datasource/view.rs

edmondop · 2023-11-17T03:42:11Z

@andygrove I do have a super early draft broken implementation, but it's already enough to ask your view on what's the right thing to do with SQL and when the filter is introduced as a part of an optimization but not as an optimization of an existing filter. Could you please review my comments?

datafusion/expr/src/logical_plan/plan.rs

Dandandan · 2023-11-25T10:09:25Z

datafusion/common/src/config.rs

+        /// The default filter selectivity used by Filter Statistics
+        /// when an exact selectivity cannot be determined. Valid values are
+        /// between 0 (no selectivity) and 100 (all rows are selected).
+        pub default_filter_selectivity: u8, default = 20


I think it makes sense to make this a float (0.2).

The two main reasons for choosing a uint are the lack of Eq trait implementation for f32, as well as the problem that could arise when serializing numbers that cannot be perfectly represented as f32. If you had already made this consideration and you think f32 is still a better option, let me know and I will proceed

Dandandan · 2023-11-25T10:11:26Z

Overall looks very good! I think it might make more sense to use a f32 for this :)

Dandandan · 2023-12-01T18:41:46Z

FYI @alamb @andygrove

alamb

Thanks @edmondop and @Dandandan -- this looks pretty neat.

alamb · 2023-12-01T19:42:42Z

datafusion/physical-plan/src/filter.rs

@@ -994,4 +1014,22 @@ mod tests {

        Ok(())
    }
+
+    #[tokio::test]
+    async fn test_validation_filter_selectivity() -> Result<()> {


Shall we also add a test showing that changing the default selectivity actually affects the output statistics?

I think if the selectivity got hard coded to 0.2 again, no tests would fail 🤔 Maybe we could add another unit tests here setting selectivity to 0.5 or something and demonstrating the statistics are different

I agree we should, however I didn't know how to observe the effect of such a parameter. Is there some form of observable state or result exposed that I can use to perform an assertion about what selectivity has been used ?

I think you should be able to look at the output of FilterExec::statistics() and the row number estimates will change with different value of selectivity

Will try to finish it between today and tomorrow, thanks for the suggestion

@alamb fixed with 81034c2

Not sure however how this should be handled

Not sure however how this should be handled

That code will be invoked for 'complicated' predicates -- maybe we could fake it with something like sin(x) = 4.0.

alamb

Thanks @edmondop -- this looks good to me.

alamb · 2023-12-04T21:39:59Z

datafusion/physical-plan/src/filter.rs

@@ -994,4 +1014,22 @@ mod tests {

        Ok(())
    }
+
+    #[tokio::test]
+    async fn test_validation_filter_selectivity() -> Result<()> {


Not sure however how this should be handled

That code will be invoked for 'complicated' predicates -- maybe we could fake it with something like sin(x) = 4.0.

alamb · 2023-12-04T21:40:14Z

datafusion/physical-plan/src/filter.rs

+        ));
+        let filter = FilterExec::try_new(predicate, input)?;
+        let statistics = filter.statistics()?;
+        assert_eq!(statistics.num_rows, Precision::Inexact(200));


👌 very nice

* Turning filter selectivity as a configurable parameter * Renaming API to be more consistent with struct value * Adding a filter with custom selectivity

github-actions bot added sql SQL Planner logical-expr Logical plan and expressions optimizer Optimizer rules core Core DataFusion crate substrait labels Nov 17, 2023