-
Notifications
You must be signed in to change notification settings - Fork 141
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Relevance Based Search With SQL/PPL Query Engine #182
Comments
Project BreakdownWBS overview
Implementation work planning
Note that combined_fields #189 and common #190 are excluded. The combined_fields is missing in OpenSearch and common function is deprecated already. |
What do you think about the following syntax for queries that take a field list like multi_match and simple_query_string? Treat each value in the fields list as either an identifier, or a boosted identifier or a boosted string. For example, instead of option (A):
it'd be better if we supported option (B):
This will be more natural for end users and keep input parsing in one place. |
#636 |
tagged all in 2.2 release |
I think Highlight function in PPL is still missing. Do we have plan to add it too? |
Yes. |
Okay. Please check if we want to reopen this or create a new issue to track. Thanks! |
Thanks, but no change is required. EDIT: I see the above ticket was updated to add highlight to the scope. We can re-open this ticket if that helps tracking. But 636 is still open to track progress. |
1 Overview
In a search engine, the relevance is the measure of the relationship accuracy between the search query and the search result. The higher the relevance is, the higher quality of the search result, then the users are able to get more relevant content from the search result. For the searches in OpenSearch engine, the returned results (documents) are ordered by the relevance by default, the top documents are of the highest relevance. In OpenSearch, the relevance is indicated by a field
_score
. This float type field gives a score to the current document to measure how relevant it is related to the search query, with a higher score the document is indicated to be more relevant to the search.The OpenSearch query engine is the engine to do the query planning for user input queries. Currently the query engine is interfaced with SQL and PPL (Piped Processing Language), thus the users are able to write SQL and PPL queries to explore their data in OpenSearch. Most of the queries supported in the query engine are following the SQL use cases, which are mapped to the structured queries in OpenSearch.
This design is to support the relevance based search queries with query engine, in another word to enable the OpenSearch users to write SQL/PPL languages to do the search by relevance in the search engine.
1.1 Problem Statement
1. DSL is not commonly used
OpenSearch query language (DSL) is not commonly used in regular databases, especially for the users in the realm of analytics rather than development. This is also the reason we created the SQL plugin, where the query engine lies in. Like many other SQL servers where the full text search features are enabled to support the relevance based search with SQL language, the SQL plugin in OpenSearch is also a perfect target for users to search by relevance on the indexed database if we support the search features in the query engine.
2. OpenSearch search features are not present in the new query engine
The current query engine is working more like a traditional SQL server to retrieve exact data from the OpenSearch indices, so from the query engine standpoint, the OpenSearch is treated as an ordinary database to store data rather than a search engine. One of the gap in between is the search features in the search engine are not supported in the query engine. By bringing search by relevance, one of the most relevant features of the search engine, into the query engine, our users would be able to explore and do the search using SQL or PPL language directly.
3. Full text functions in old engine
Last year (2020) we migrated the legacy SQL plugin to a new constructed architecture with a new query engine to do the query planning for SQL and PPL queries. Some of the search functions were already enabled in the old engine. However, the full text functions in the old engine are not migrated to the new engine, so when users try to do the search query with SQL, it falls back to the old engine rather be planned in the new query engine. This is not causing any issue for short term since the use of old and new engines are out of the user awareness. But for long-term prospective, we need to support these functions in new engine as well in rid of in consistency, and also good for future plan of the old engine deprecation.
1.2 Use Cases
Use case 1: Full text search
The full text search functions are very common in most of the SQL servers now due to the high demanding of the search features, and also because its high efficiency compared to the wildcard match (like) operator. With the support of search features in the query engine, users are able to execute the full text functions with SQL language on the top of OpenSearch search engine, while not limited to the simple search but also the complicated search like the search with prefix, search multiple fields and so forth.
Use case 2: Field independent search with SQL/PPL
In many cases the users want to search the entire document for a term rather than in a specific field. For example a user might want to search an index for a keyword “Seattle”, this might come from fields like “DestinationCity”, “OriginCity”, “AirportLocation” etc., and all these results matter for the user. The search features proposed in this design also include this case to enable users to do multi field search or even field independent search with SQL and PPL language.
Use case 3: Keyword highlighting
The highlighters supported in the search engine is another feature that is on the top of relevance based search. By enabling the highlighters, users are able to get the highlighted snippets from one or more fields in the search results, so users can clearly see where the query matches are.
Use case 4: Observability project essential
The observability is a project that aims to enable users to explore all type of data like logs, metrics etc. in one place by bringing the logs, metrics and traces together. The relevance based search feature would be a key feature for user to search and filter the data in observability.
1.3 Requests from the community
2 Requirements
2.1 Functional Requirements
_score
field in either ascending or descending order.2.2 Non-functional Requirements
A. Reliability
B. Extensibility
2.3 Tenets
2.4 Out of Scope
3 High Level Design
3.1 Search functions to support
Since the relevance based search is highly depending on the OpenSearch core engine, we are therefore defining the search functions following the existing search queries (see appendix A1) but in the language style of SQL full text search functions. Here comes the list of functions with basic functionalities (i.e. parameter options) to support as the relevance based search functions in the query engine:
match_phrase
query analyzes the text and creates aphrase
query out of the analyzed text.bool
query from the terms. Each term except the last is used in aterm
query. The last term is used in aprefix
query._score
from the best field._score
from each field.analyzer
as though they were one big field. Looks for each word in any field.match_phrase
query on each field and uses the_score
from the best field.match_phrase_prefix
query on each field and uses the_score
from the best field.match_bool_prefix
query on each field and combines the_score
from each field.common
terms query is a modern alternative to stopwords which improves the precision and recall of search results (by taking stopwords into account), without sacrificing performance.The function names follow the query type names directly to avoid confusion when using the exactly the same functionalities with different languages. Besides, all the available options for the queries are passed as parameters in the functions. See 4.2 Search function details for more details.
3.2 Architecture diagram
Option A: Remain the original query plans, register all the search functions as SQL/PPL scalar functions.
The current engine architecture is well constructed with plan component that have their own unique job. For example, all the conditions in a query would be converted to the filter plans, and an aggregation plan is converted also the aggregation functions.
Regarding the functionalities of search features, they are essentially performing the filter roles to find the documents that contain specific terms. Therefore, the search functions could be perfectly fitted into the filter conditions just like the other SQL functions in the condition expression.
Besides, one of the query engine architecture tenets is to keep the job of every plan node simple and unique, and construct the entire query planning using existing plan node rather than creating new type of plan. This can keep each of operators in the physical plans simple and well defined. The following figure shows a simplified logical query plan with only project, filter and relation plans for a match query like
SELECT * FROM my_index WHERE match(message, "this is a test")
.Pros:
Cons:
match
to support the search features.Option B: Create a new query plan node dedicated for the search features.
The diagram below is simplified with only the logical plan and physical plan sections and leaves out others. Please check out OpenSearch SQL Engine Architecture for the complete architecture of the query engine.
The match component in the logical plans stands for the logical match in logical planning stage. This could be an extension of the filter plan, which is specially to handle the search functions.
Ideally a relevance based search query is optimized to a plan where the score is able to merge with the logical index scan, this also means the score operation is eligible to turn to the OpenSearch query and pushed down to the search engine. However the query engine should also have the capability to deal with the scoring operations locally when the query planning runs into a complicated case and fails to push score down to the search engine.
Since the in memory execution of the search queries are out of scope for current phase, we are not setting the corresponding match operator in the query engine, and throwing out errors instead if a query tries to reach the match operator.
Pros:
Cons:
4 Detailed Design
4.1 SQL and PPL language interface
In the legacy engine, the solution is to allow user to write a DSL segment as the parameter and pass it directly to the filter. This syntax could be much flexible for users but it skips the syntax analysis and type check in between. So we are proposing a more SQL-like syntax for the search functions by passing the options as named arguments instead. However this might bring some challenge in future maintenance if more contexts for the supported queries in core engine come up in future release.
For the sake of user experience, we decided to put on the SQL like language interface. So from the SQL language standpoint, the search functions could be performed similar to the SQL functions existing in SELECT, WHERE, GROUP BY clauses etc., and the returned data type is set to boolean. For example:
Similarly from PPL standpoint, we could define it either as functions or as a command named
match
:Option 1: served as eval functions similar to SQL language
Option 2: define a new command
4.2 Search mode
The search type for the new engine is defaulted to
query_then_fetch
mode as the default setting of OpenSearch engine, and usually it works fine unless the document number is too small to smooth out the term/document frequency statistics. The alternative option for the query type isdfs_query_then_fetch
, which adds an additional step of prequerying each shard asking about term and document frequencies beforequery_then_fetch
procedures.Similarly to the DSL queries, the search mode options for SQL plugin could be supported through the request endpoint, for example to set the
dfs_query_then_fetch
as the search mode in specific query:4.3 Search function details
In this section we focus on defining the syntax and functionalities for each of the functions to support, and how these functions are mapped with the search engine queries. Since most of the functions are reusing the common arguments, here lists the arguments that could be used by the search functions
Required parameters:
<field>
. For the functions that are not requiring a specific field, thefields
parameter might be an option to specify the fields to search for terms.Optional parameters:
query
into tokens. Available values: (default) standard analyzer | simple analyzer | whitespace analyzer | stop analyzer | keyword analyzer | pattern analyzer | language analyzer | fingerprint analyzer | custom analyzerquery
value will expand. Defaults to 50.cutoff_frequency
value can either be relative to the total number of documents if in the range [0..1) or absolute if greater or equal to 1.0.The following links are redirecting to the issues pages of the search functions details including syntax, functionality and available paramters:
1. match function
#184
2. match_phrase function
#185
3. match_phrase_prefix function
#186
4. match_bool_prefix function
#187
5. multi_match function
#188
6. combined_fields function
#189
7. common function:
#190
8. query_string function
#191
9. simple_query_string function
#192
5 Implementation
The implementation covers the planner and optimizer in the query engine but skip the changes in in-memory executor as discussed in the design for current phase. One of the tricky things in the implementations is how to recognize the custom values as query parameters or options.
The current solution is to pass the type check and environment resolve for the parameter values since they are not participating the in memory computing, but take the values all as string expressions and resolve them in the layer of optimizer when translated to query DSL. Take
analyzer
as an example:6 Testing
All the code changes should be test driven. The pull requests should include unit test cases, integration test cases and comparison test cases in applicable.
Appendix
A1. Relevance based search queries in OpenSearch
A2. Search flow in search engine
The text was updated successfully, but these errors were encountered: