Disable optimization if we aren't sure its faster (backport of #74260) #74563

nik9000 · 2021-06-24T14:00:15Z

This disables the filter-by-filter aggregation optimization used by
terms, range, date_histogram, and date_range aggregations unless
we're sure that its faster than the "native" implementation. Mostly this
is when the top level query is empty or we can merge it into the filter
generated by the agg rewrite process.

Now that we have hard and fast rules we can drop the cost estimation
framework without too much fear. So we remove it in this change. It
stomps a bunch of complexity. Sadly, without the cost estimation stuff
we have to add a separate mechanism for blocking the optimization
against runtime fields for which it'd be kind of garbage. For that I
added another rule preventing the filter-by-filter aggregation from
running against the queries made by runtime fields. Its not fool-proof,
but we have control over what queries we pass as a filter so its not
wide open.

I spent a lot of time working on an alternative to this that preserved
that fancy filter-by-filter collection mechanism and was much more kind
to the query cache. It detected cases where going full filter-by-filter
was bad and grouped those filters together to collect in one pass with a
funny ORing collector. It worked. And, if we were super concerned with
the performance of the filters aggregation it'd be the way to go. But
it was very complex and it was actually slower than using the native
aggregation for things like terms and date_histogram. It was
glorious. But it was wrong for us. Too complex and optimized the wrong
things.

So here we are. Hopefully this is a fairly simple solution to a sneaky
problem.

…c#74260) This disables the filter-by-filter aggregation optimization used by `terms`, `range`, `date_histogram`, and `date_range` aggregations unless we're *sure* that its faster than the "native" implementation. Mostly this is when the top level query is empty or we can merge it into the filter generated by the agg rewrite process. Now that we have hard and fast rules we can drop the cost estimation framework without too much fear. So we remove it in this change. It stomps a bunch of complexity. Sadly, without the cost estimation stuff we have to add a separate mechanism for blocking the optimization against runtime fields for which it'd be kind of garbage. For that I added another rule preventing the filter-by-filter aggregation from running against the queries made by runtime fields. Its not fool-proof, but we have control over what queries we pass as a filter so its not wide open. I spent a lot of time working on an alternative to this that preserved that fancy filter-by-filter collection mechanism and was much more kind to the query cache. It detected cases where going full filter-by-filter was bad and grouped those filters together to collect in one pass with a funny ORing collector. It *worked*. And, if we were super concerned with the performance of the `filters` aggregation it'd be the way to go. But it was very complex and it was actually slower than using the native aggregation for things like `terms` and `date_histogram`. It was glorious. But it was wrong for us. Too complex and optimized the wrong things. So here we are. Hopefully this is a fairly simple solution to a sneaky problem.

nik9000 added backport v7.14.0 labels Jun 24, 2021

Update skip

9440f02

nik9000 merged commit 2d5982b into elastic:7.x Jun 24, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Disable optimization if we aren't sure its faster (backport of #74260) #74563

Disable optimization if we aren't sure its faster (backport of #74260) #74563

nik9000 commented Jun 24, 2021

Disable optimization if we aren't sure its faster (backport of #74260) #74563

Disable optimization if we aren't sure its faster (backport of #74260) #74563

Conversation

nik9000 commented Jun 24, 2021