Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Feature Request] Provide capability for not adding top docs collector in the query search path #13170

Closed
martin-gaievski opened this issue Apr 11, 2024 · 13 comments
Labels
enhancement Enhancement or improvement to existing feature or request Search:Performance Search Search query, autocomplete ...etc v2.15.0 Issues and PRs related to version 2.15.0 v3.0.0 Issues and PRs related to version 3.0.0

Comments

@martin-gaievski
Copy link
Member

Is your feature request related to a problem? Please describe

Right now as part of search code path both Default and Concurrent QueryPhaseSearchers are adding TopDocsCollector as first collector in the chain (code ref1 DefaultQueryPhaseSearcher, code ref2 ConcurrentQueryPhaseSearcher).

In case of certain types of extensions that may lead to a performance degradation of search. In particular that's true if custom collector and collector manager are added and those collectors are used in a way that scores are collected with a custom logic. In such case some compute (like filtering) will be done twice and depending on the extension logic that can be a thrown away work.

Example of such extension is hybrid query. Please check collector manager and collector added in neural-search plugin.

Describe the solution you'd like

I suggest core add an interface or method in existing QueryPhaseSearcher interface (or public core class QueryPhaseSearcherWrapper) that allows extension or plugin to set a flag and depending on that flag the top docs collector will be added or skipped. Default implementation should allow adding of topdocscollector so system is backward compatible.

Related component

Search

Describe alternatives you've considered

Right now there isn't much of alternatives, custom collector is executed after the top docs collector. This leads to performance degradation of search queries, that has been submitted to us by multiple customers.

Additional context

Custom query collector and collector manager were added as part of the neural-search plugin in 2.13 release (original PR)

@martin-gaievski martin-gaievski added enhancement Enhancement or improvement to existing feature or request untriaged labels Apr 11, 2024
@github-actions github-actions bot added the Search Search query, autocomplete ...etc label Apr 11, 2024
@reta
Copy link
Collaborator

reta commented Apr 17, 2024

@martin-gaievski I don't think we need to modify QueryPhaseSearcher but if needed you could (and probably should) provide your own implementation of QueryPhaseSearcher::searchWithCollector

@navneet1v
Copy link
Contributor

@reta if we modify the QueryPhaseSearcher::searchWithCollector then we will miss out on the optimizations/other functionalities happening in the SearchWithCollector functions or any new functionality that is getting added there.

I suggest core add an interface or method in existing QueryPhaseSearcher interface (or public core class QueryPhaseSearcherWrapper) that allows extension or plugin to set a flag and depending on that flag the top docs collector will be added or skipped. Default implementation should allow adding of topdocscollector so system is backward compatible.

On this I would suggest we can have a function returns the topDocsCollector instance, rather than a flag which says topdocscollector should be added or not.

@reta
Copy link
Collaborator

reta commented Apr 17, 2024

On this I would suggest we can have a function returns the topDocsCollector instance, rather than a flag which says topdocscollector should be added or not.

@navneet1v we may have an overloaded version:

        protected boolean searchWithCollector(
            QueryCollectorContext  queryCollectorContext,
            SearchContext searchContext,
            ContextIndexSearcher searcher,
            Query query,
            LinkedList<QueryCollectorContext> collectors,
            boolean hasFilterCollector,
            boolean hasTimeout
        ) throws IOException {
            ...
        }

The current searchWithCollector(...) would delegate to new searchWithCollector(...) with TopDocsCollector instance.

@navneet1v
Copy link
Contributor

navneet1v commented Apr 17, 2024

@navneet1v we may have an overloaded version:

@reta when you say we may have an overload version, are you saying we already have it? or we can build such a version? Little confused here. I think its the later one right?

@reta
Copy link
Collaborator

reta commented Apr 17, 2024

@navneet1v sorry, I should have been more clear: we could add the overloaded version (no flags or functions).

@navneet1v
Copy link
Contributor

@navneet1v sorry, I should have been more clear: we could add the overloaded version (no flags or functions).

Thanks. I myself don't agree with flags. But we can do override functions. :D Thanks @reta for helping out here.

@martin-gaievski can we start making the change.

@martin-gaievski
Copy link
Member Author

It would be great if you folks can confirm my understanding of the approach with new overloaded function with additional param QueryCollectorContext:

  • the way we can skip top docs collector in custom implementation is if return null for TopDocsCollectorContext, in such case it will not be added to the list of collector context objects LinkedList<QueryCollectorContext> collectors
  • function searchWithCollector(...) is part of both ConcurrentQPS (QPS stands for QueryPhaseSearcher) and DefaultQPS, not the QueryPhaseSearcher interface. To be able to take advantage of that overloaded function we need to extend both implementation of core QPS
  • 2 means that we need to use our implementations instead of those used in core today. For that we need to override searchWith from QueryPhaseSearcherWrapper in the plugin.

@reta
Copy link
Collaborator

reta commented Apr 17, 2024

the way we can skip top docs collector in custom implementation is if return null for TopDocsCollectorContext, in such case it will not be added to the list of collector context objects LinkedList collectors

:+1

function searchWithCollector(...) is part of both ConcurrentQPS (QPS stands for QueryPhaseSearcher) and DefaultQPS, not the QueryPhaseSearcher interface. To be able to take advantage of that overloaded function we need to extend both implementation of core QPS

That is correct, QueryPhaseSearcher is unchanged, the idea is to allow reusing the implementation only

2 means that we need to use our implementations instead of those used in core today. For that we need to override searchWith from QueryPhaseSearcherWrapper in the plugin.

I think no, the implementation would follow this chain: searchWithCollector -> searchWithCollector -> new overloaded searchWithCollector, so you would only need to override searchWithCollector to pass null as queryCollectorContext

@navneet1v
Copy link
Contributor

@martin-gaievski i know you did some deep-dive on the impact of having topdocs collector and hybrid search collector both running. Can you paste the flame graph and benchmark results here for visibility.

@martin-gaievski
Copy link
Member Author

Sure Navneet, after we've implemented one optimization on plugin side I got following numbers :

One sub-query that selects 11M documents

Bool: 84.7201
Hybrid: 256.799

One sub-query that selects 1.6K documents

Bool: 89.6258
Hybrid: 85.4563

Three sub-query that select 15M documents

Bool: 90.1481
Hybrid: 326.331

Based on profiling info (flamegraphs are attached) 45 to 80% percent of CPU time taken by the TopDocsCollector related methods, mainly by collect method.

profile_os_hybrq_perf_step1_4_23_15_50

profile_os_hybrq_perf_step1_4_23_16_20

For reference: benchmark is done for 2.13 using noaa OSB workload. Following search queries used for benchmark

Bool queries:

Query 1
        "size": 100,
        "query": {
          "bool": {
              "should": [
                  {
                      "term": {
                          "station.country_code": "JA"
                      }
                  },
                  {
                      "range": {
                          "TRANGE": {
                              "gte": 0,
                              "lte": 30
                          }
                      }
                  },
                  {
                    "range": {
                        "date": {
                            "gte": "2016-06-04",
                            "format":"yyyy-MM-dd"
                        }
                    }
                  }
              ]
          }
        }

Query 2
        "size": 100,
        "query": {
          "bool": {
              "should": [
                  {
                      "range": {
                          "TRANGE": {
                              "gte": -100,
                              "lte": -50
                          }
                      }
                  }
              ]
          }
        }

Query 3
        "size": 100,
        "query": {
          "bool": {
              "should": [
                  {
                      "range": {
                        "TRANGE": {
                          "gte": 1,
                          "lte": 35
                        }
                      }
                  }
              ]
          }

equivalent hybrid queres are:

Query 1
        "size": 100,
        "query": {
          "hybrid": {
            "queries": [
                {
                    "term": {
                        "station.country_code": "JA"
                    }
                },
                {
                    "range": {
                        "TRANGE": {
                            "gte": 0,
                            "lte": 30
                        }
                    }
                },
                {
                  "range": {
                      "date": {
                          "gte": "2016-06-04",
                          "format":"yyyy-MM-dd"
                      }
                  }
                }
            ]
          }
        }

Query 2
        "size": 100,
        "query": {
          "hybrid": {
            "queries": [
                {
                    "range": {
                        "TRANGE": {
                          "gte": -100,
                          "lte": -50
                        }
                    }
                }
            ]
          }

Query 3
        "size": 100,
        "query": {
          "hybrid": {
            "queries": [
                {
                    "range": {
                      "TRANGE": {
                        "gte": 1,
                        "lte": 35
                      }
                    }
                }
            ]
          }

@navneet1v
Copy link
Contributor

Thanks martin for providing the details. Let start working on the POC to remove the topdocsCollector and see how much improvement we can get here.

@vibrantvarun
Copy link
Member

Is there a scope of making this work little more generic?

I mean like to disable or enable any collector?

@reta
Copy link
Collaborator

reta commented Apr 28, 2024

I mean like to disable or enable any collector?

@vibrantvarun I don't think the idea is to enable or disable any collector, instead the QueryCollectorContext could be provided w/o query collectors. In that sense, the solution is generic.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement Enhancement or improvement to existing feature or request Search:Performance Search Search query, autocomplete ...etc v2.15.0 Issues and PRs related to version 2.15.0 v3.0.0 Issues and PRs related to version 3.0.0
Projects
Status: Done
Development

No branches or pull requests

5 participants