diff --git a/README.md b/README.md index 0c220838b5..d3afa82ad4 100644 --- a/README.md +++ b/README.md @@ -122,11 +122,11 @@ The following projects have been merged into this repository as separate folders Besides basic filtering and aggregation, OpenSearch SQL also supports complex queries, such as querying semi-structured data, JOINs, set operations, sub-queries etc. Beyond the standard functions, OpenSearch functions are provided for better analytics and visualization. Please check our [documentation](#documentation) for more details. -Recently we have been actively improving our query engine primarily for better correctness and extensibility. Behind the scene, the new enhanced engine has already supported both SQL and Piped Processing Language. Please find more details in [SQL Engine V2 - Release Notes](./docs/dev/NewSQLEngine.md). +Recently we have been actively improving our query engine primarily for better correctness and extensibility. Behind the scene, the new enhanced engine has already supported both SQL and Piped Processing Language. Please find more details in [SQL Engine V2 - Release Notes](./docs/dev/intro-v2-engine.md). ## Documentation -Please refer to the [SQL Language Reference Manual](./docs/user/index.rst), [Piped Processing Language (PPL) Reference Manual](./docs/user/ppl/index.rst) and [Technical Documentation](https://opensearch.org/docs/latest/search-plugins/sql/index/) for detailed information on installing and configuring plugin. +Please refer to the [SQL Language Reference Manual](./docs/user/index.rst), [Piped Processing Language (PPL) Reference Manual](./docs/user/ppl/index.rst), [OpenSearch SQL/PPL Engine Development Manual](./docs/dev/index.md) and [Technical Documentation](https://opensearch.org/docs/latest/search-plugins/sql/index/) for detailed information on installing and configuring plugin. ## Forum diff --git a/docs/dev/datasource-prometheus.md b/docs/dev/datasource-prometheus.md new file mode 100644 index 0000000000..8b07e4dc58 --- /dev/null +++ b/docs/dev/datasource-prometheus.md @@ -0,0 +1,331 @@ +# Overview +Build federated PPL query engine to fetch data from multiple data sources. + +## 1. Motivation +PPL(Piped Processing Language) serves as the de-facto query language for all the observability solutions(Event Explorer, Notebooks, Trace Analytics) built on OpenSearch Dashboards. In the current shape, PPL engine can only query from OpenSearch and this limits observability solutions to leverage data from other data sources. As part of this project, we want to build framework to support multiple data sources and initially implement support for metric data stores like Prometheus and AWS Cloudwatch. This federation will also help in injecting data from different sources to ML commands, correlate metrics and logs from different datasources. + +## 2. Glossary + +* *Observability :* The ability to understand whats happening inside your business/application using logs, traces, metrics and other data emitted from the application. + + +## 3.Tenets +* Query interface should be agnostic of underlying data store. +* Changes to PPL Query Language grammar should be simple and easy to onboard. +* Component design should be extensible for supporting new data stores. + +## 4.Out of Scope for the Design. + +* Join Queries across datasources is out of scope. +* Distributed Query Execution is out of scope and the current execution will happen on single coordination node. + +## 5. Requirements + +### 5.1 *Functional* + +* As an OpenSearch user, I should be able to configure and add a new data source with available connectors. +* As an OpenSearch user, I should be able to query data from all the configured data sources using PPL. +* As an Opensearch contributor, I should be able to add connectors for different data sources. + +### 5.2 *Non Functional* + +* Query Execution Plan should be optimized and pushed down as much as possible to underlying data stores. +* Latency addition from query engine should be small when compared to that of underlying data stores. + +## 6. High Level Design + +![prometheus drawio(2)](https://user-images.githubusercontent.com/99925918/166970105-09855a1a-6db9-475f-9638-eb240022e399.svg) + +At high level, our design is broken down into below major sections. + +* Data Source representation. + * This section speaks on the new constructs introduced in the query engine to support additional data sources. +* Query Engine Architecture Changes. + * This section gives an overview of the architecture and changes required for supporting multiple data sources. +* PPL grammar changes for metric data and details for implementing Prometheus Connector. + + + + +### 6.1 Data source representation. + +Below are the new constructs introduced to support additional datasources. These constructs helps in identfying and referring correct datasource for a particular query. + +* *Connector :* Connector is a component that adapts the query engine to a datasource. For eg: Prometheus connector would adapt and help execute the queries to run on Prometheus data source. + +* *Catalog :* we can describe a catalog as the union of a connector and the underlying instance of the datasource being referred. This gives us flexbility to refer different instances of a similar datastore with the same connector i.e. we can refer data from multiple prometheus instances using prometheus connector. The name for each catalog should be different. + +Example Prometheus Catalog Definition +``` +[{ + "name" : "prometheus_1", + "connector": "prometheus", + "uri" : "http://localhost:9090", + "authentication" : { + "type" : "basicauth", + "username" : "admin", + "password" : "admin" + } +}] +``` + +* *Table :* A table is a set of unordered rows which are organized into named columns with types. The mapping from source data to tables is defined by the connector. + + +* In order to query data from above Prometheus Catalog, Query would look like something below. + * `source = prometheus_1.total_cpu_usage | stats avg(@value) by region` +* What will be the interface to add new catalogs? + * In the initial phase, we are trying to adopt a simple approach of storing the above configuration to opensearch-keystore. Since this catalog configuration contains credential data we are taking this input from the user via keystore(secure-setting) +``` +bin/opensearch-keystore add-file plugins.query.federation.catalog.config catalog.json +``` + +* Catalog is optional and it will be defaulted to opensearch instance in which the PPL query engine plugin is running. + * `source = accounts`, this is a valid query and the engine defaults to opensearch instance to get the data from. + + +### 6.2 Query Engine Architecture. + +There are no major changes in the query engine flow as the current state is extensible to support new datastores. + +Changes required +* Current StorageEngine and Table injection beans default to Opensearch Connector implementation. This should be dynamic and based on the catalog mentioned in the source command. + + + +![PPL Execution flow](https://user-images.githubusercontent.com/99925918/163804980-8e1f5c07-4776-4497-97d5-7b2eec1ffb8e.svg) + +Interfaces to be implemented for supporting a new data source connector. + +```java +public interface StorageEngine { + + /** + * Get {@link Table} from storage engine. + */ + Table getTable(String name); +} +``` + + +```java +public interface Table { + + /** + * Get the {@link ExprType} for each field in the table. + */ + Map getFieldTypes(); + + /** + * Implement a {@link LogicalPlan} by {@link PhysicalPlan} in storage engine. + * + * @param plan logical plan + * @return physical plan + */ + PhysicalPlan implement(LogicalPlan plan); + + /** + * Optimize the {@link LogicalPlan} by storage engine rule. + * The default optimize solution is no optimization. + * + * @param plan logical plan. + * @return logical plan. + */ + default LogicalPlan optimize(LogicalPlan plan) { + return plan; + } + +} +``` + +```java +public interface ExecutionEngine { + + /** + * Execute physical plan and call back response listener. + * + * @param plan executable physical plan + * @param listener response listener + */ + void execute(PhysicalPlan plan, ResponseListener listener); + +} +``` + +### 6.3 PPL Metric Grammar and Prometheus Connector. + +This section talks about metric grammar for PPL. Till now, PPL catered only for event data and language grammar was built around that. With the introduction of support for new metric data stores, we want to analyse and add changes changes to support metric data. As called out in tenets, we want to build simple intuitive grammar through which developers can easily onboard. + +*Key Characteristics of Metric Data* + +* Metric data is a timeseries vector. Below are the four major attributes of any metric data. + * Metric Name + * Measured value + * Dimensions + * Timestamp +* Every time-series is uniquely identified by metric and a combination of value of dimensions. + + +Since we don't have column names similar to relational databases, `@timestamp` and `@value` are new internal keywords used to represent the time and measurement of a metric. + + +*Types of metric queries* +***These queries are inspired from widely used Prometheus data store.*** + + +### Passthrough PromQL Support. +* PromQL is easy to use powerful language over metrics and we want to support promQL queries from catalogs with prometheus connectors. +* Function signature: source = ``promcatalog.nativeQuery(`promQLCommand`, startTime = "{{startTime}}", endTime="{{endTime}", step="{{resolution}}")`` + + + +### PPL queries on prometheus. +### 1. Selecting Time Series + + +|PromQL |PPL |Remarks | +|--- |--- |--- | +|`node_cpu_seconds_total` |source = promcatalog.`node_cpu_seconds_total` | | +|`node_cpu_seconds_total[5m]` |source = promcatalog.`node_cpu_seconds_total` \| where endtime = now() and starttime = now()-5m | we can either use `startime and endtime` ? or `@timestamp < now() and @timstamp >now() - 5m` | +|`node_cpu_seconds_total{cpu="0",mode="idle"}` |source = promcatalog.`node_cpu_seconds_total` \| where cpu = '0' and mode = 'idle' |This again got the same limitations as first query. Where should we stop the result set. | +|`process_resident_memory_bytes offset 1d` |source = promcatalog.`process_resident_memory_bytes` \| where starttime = now()-1d and endtime = 1d | | + + + +###### Limitations and Solutions + +* How do we limit the output size of timeseries select query? We can't output the entire history. + * **[Solution 1]** Make `starttime` and `endtime` compulsory ? + * This will create validations based on the catalog used and the grammar be agnostic of the underlying catalog. + * **[Solution 2]** Have a hard limit on the result set length. Let user don’t specify time range and dig deep into the timeseries until it hits the hard limit. For eg: In case of prometheus, How can we execute this under hood, there is no limit operator in promQl. The only limit we could set is the time range limit. + * **[Proposed Solution 3]** Have a default time range specified in the catalog configuration. If user specifies no filter on timerange in where clause, this default timerange will be applied similar to limiting eventData resultSet in the currrent day for OpenSearch. + + +### 2. Rates of increase for counters + +|PromQL |PPL |Remarks | +|--- |--- |--- | +|`rate(demo_api_request_duration_seconds_count[5m])` |source = promcatalog.`demo_api_request_duration_seconds_count` \| eval x = rate(@value, 5m) | | +|`irate(demo_api_request_duration_seconds_count[1m])` |source = promcatalog.`demo_api_request_duration_seconds_count` \| eval x = irate(@value, 5m) | | +|`increase(demo_api_request_duration_seconds_count[1h])` |source = promcatalog.`demo_api_request_duration_seconds_count` \| eval x = increase(@value, 5m) | | + +* Should we output x along with @value in two columns or restrict to rate function output, + * [timestamp, value, rate] or [timestamp, x] +* If we push down the query to Prometheus we only get rate as output. Should we not push down and calculate on the coordination node? + + +### 3. Aggregating over multiple series + +|PromQL |PPL |Remarks | +|--- |--- |--- | +|`sum(node_filesystem_size_bytes)` |source = promcatalog.`node_filesystem_size_bytes `\| stats sum(@value) by span(@timestamp, 5m) | | +|`sum by(job, instance) (node_filesystem_size_bytes)` |source = promcatalog.`node_filesystem_size_bytes `\| stats sum(@value) by instance, job span(@timestamp, 5m) | | +|`sum without(instance, job) (node_filesystem_size_bytes)` |source = promcatalog.`node_filesystem_size_bytes `\| stats sum(@value) without instance, job |We dont have without support. We need to build that, we can group by fields other than instance, job. | + +`sum()`, `min()`, `max()`, `avg()`, `stddev()`, `stdvar()`, `count()`, `count_values()`, `group()`, `bottomk()`, `topk()`, `quantile() Additional Aggregation Functions. +` + +* On time series, we calculate aggregations with in small buckets over a large period of time. For eg: calculate the average over 5min for the last one hour. PromQL gets the time parameters from API parameters, how can we get these for PPL. Can we make these compulsory but that would make the language specific to metric datastore which doesn’t apply for event data store. +* Can we have separate command mstats for metric datastore with compulsory span expression. + +### 4. Math Between Time Series [Vector Operation Arithmetic] + +Arithmetic between series. Allowed operators : , `-`, `*`, `/`, `%`, `^` + +|PromQL |PPL |Remarks | +|--- |--- |--- | +|`node_memory_MemFree_bytes + node_memory_Cached_bytes` |source = promcatalog.`node_memory_MemFree_bytes \|` `vector_op`(+) promcatalog.`node_memory_Cached_bytes` |Add all equally-labelled series from both sides: | +|`node_memory_MemFree_bytes + on(instance, job) node_memory_Cached_bytes` |source = promcatalog.`node_memory_MemFree_bytes \|` `vector_op`(+) on(instance, job) promcatalog.`node_memory_Cached_bytes` |Add series, matching only on the `instance` and `job` labels: | +|`node_memory_MemFree_bytes + ignoring(instance, job) node_memory_Cached_bytes` |source = promcatalog.`node_memory_MemFree_bytes \|` `vector_op`(+) ignoring(instance, job) promcatalog.`node_memory_Cached_bytes` |Add series, ignoring the `instance` and `job` labels for matching: | +|`rate(demo_cpu_usage_seconds_total[1m]) / on(instance, job) group_left demo_num_cpus` |source = `rate(promcatalog.demo_cpu_usage_seconds_total[1m]) ` \| `vector_op`(/) on(instance, job) group_left promcatalog.`node_memory_Cached_bytes` |Explicitly allow many-to-one matching: | +|`node_filesystem_avail_bytes * on(instance, job) group_left(version) node_exporter_build_info` |source = promcatalog.`node_filesystem_avail_bytes` \| `vector_op`(*) on(instance, job) group_left(version) promcatalog.`node_exporter_build_info` | Include the `version` label from "one" (right) side in the result: | + + + +* Event data joins can have below grammar. + * `source = lefttable | join righttable on columnName` + * `source = lefttable | join righttable on $left.leftColumn = $right.rightColumn` +* Metric data grammar. Since joins are mostly for vector arithmetic. + * `source=leftTable | vector_op(operator) on|ignoring group_left|group_right rightTable ` +* What would be a better keyword `vector_op` or `mjoin` or `{new}` +* Does vector_op has any meaning for event data. + * Can we restrict vector_op for metric data only? + * If yes, How should we identify if something is of metric data type. + * This can break the tenet of designing langauge grammar irrespective of the underlying datastore. + + + +### 5. Filtering series by value [Vector Operation Comparitive] + + + +|PromQL |PPL |Remarks | +|--- |--- |--- | +|`node_filesystem_avail_bytes > 10*1024*1024` |source = promcatalog.`node_filesystem_avail_bytes` \| `vector_op(>) 10*1024*1024` |Only keep series with a sample value greater than a given number: | +|`go_goroutines > go_threads` |source = promcatalog.`node_memory_MemFree_bytes` \| `vector_op`(+) promcatalog.`node_memory_Cached_bytes` |Only keep series from the left-hand side whose sample values are larger than their right-hand-side matches: | +|`go_goroutines > bool go_threads` |source = promcatalog.`go_goroutines` \| `vector_op`(> bool) promcatalog.`go_threads` |Instead of filtering, return `0` or `1` for each compared series: | +|`go_goroutines > bool on(job, instance) go_threads` |source = promcatalog.`go_goroutines` \| `vector_op`(> bool) on(job,instance) promcatalog.`go_threads` |Match only on specific labels: | + + +* The above operations are similar to 4th section, except the operators are comparision operators instead of arithmetic, +* The operations are always between `a vector and scalar` or `a vector and a vector` + + +### 6. Set operations [Vector Operation Logical] + +|PromQL |PPL |Remarks | +|--- |--- |--- | +|`up{job="prometheus"} or up{job="node"}` |source = promcatalog.`up` \| where job="prometheus" \| `vector_op(or) inner(`promcatalog.up \| where job="node") |Only keep series with a sample value greater than a given number: | +|`node_network_mtu_bytes and (node_network_address_assign_type == 0)` |source = promcatalog.`node_network_mtu_bytes` \| `vector_op(and) inner(`promcatalog.`node_network_address_assign_type` \| `vector_op(==) 0)` |Include any label sets that are present both on the left and right side: | +|`node_network_mtu_bytes unless (node_network_address_assign_type == 1)` |source = promcatalog.`node_network_mtu_bytes` \| `vector_op(unless) inner(`promcatalog`node_network_address_assign_type` \| `vector_op(==) 1) ` |Include any label sets from the left side that are not present in the right side: | +|`node_network_mtu_bytes and on(device) (node_network_address_assign_type == 0)` |source = `promcatalog.`node_network_mtu_bytes` \| `` vector_op(and) on(device) inner(`promcatalog.`node_network_address_assign_type \| vector_op(==) 1) ` |Match only on specific labels: | + + +### 7.Quantiles from histograms + + + +|PromQL |PPL |Remarks | +|--- |--- |--- | +|`histogram_quantile(0.9, rate(demo_api_request_duration_seconds_bucket[5m]))` |source = promcatalog.`demo_api_request_duration_seconds_bucket` \| eval x = `rate`(@value) \| eval k = `histogram_quantile`(le=0.9,x) |90th percentile request latency over last 5 minutes, for every label dimension: +|`histogram_quantile(0.9,sum by(le, path, method) (rate(demo_api_request_duration_seconds_bucket[5m])))` | source = promcatalog.`demo_api_request_duration_seconds_bucket` \| eval x= `rate`(@value, 5m) \| stats `sum(x)` by (le,path,method) | 90th percentile request latency over last 5 minutes, for only the `path` and `method` dimensions. | + + +* Can we apply nested functions? + +### 8. Changes in gauges + +|PromQL |PPL |Remarks | +|--- |--- |--- | +|`deriv(demo_disk_usage_bytes[1h])` |source = promcatalog.`demo_disk_usage_bytes` \| eval x = deriv(@value, 1h) |Per-second derivative using linear regression: | +|`predict_linear(demo_disk_usage_bytes[4h], 3600)` |source = promcatalog.`demo_disk_usage_bytes` \| eval x = `predict_linear`(@value, 4h, 3600) |Predict value in 1 hour, based on last 4 hours: | + +Has the same drawbacks as earlier eval function commands. + + +### 9.Aggregating over time + +|PromQL |PPL |Remarks | +|--- |--- |--- | +|`avg_over_time(go_goroutines[5m])` |source = promcatalog.`go_goroutines` \| eval k = `avg_over_time`(@value, 5m) |Average within each series over a 5-minute period: | +|`max_over_time(process_resident_memory_bytes[1d])` |source = promcatalog.`process_resident_memory_bytes` \| eval k = `max_over_time`(@value, 1d) |Get the maximum for each series over a one-day period: | +|`count_over_time(process_resident_memory_bytes[5m])` |source = promcatalog.`process_resident_memory_bytes` \| eval k = `count_over_time`(@value, 5m) |Count the number of samples for each series over a 5-minute period: | + + +## 7. Implementation + +Tasks and Phase wise Division + +|Goal |Description |Phase |Area | +|--- |--- |--- |--- | +|Catalog Configuration for external Metric Datastore |Provides rudimentary way of capturing catalog connection information using opensearch-keystore. |P0 |Backend (OS SQL Plugin) | +|PPL Grammar to support basic prometheus metric operations. |This involes support of PPL commands search, stats, where commands on Prometheus Data |P0 |Backend (OS SQL Plugin) | +|PromQL support |PromQL support for querying prometheus |P0 |Backend (OS SQL Plugin) | +|Event Analytic Page enhancements |This includes UI support in the existing event analytics page of for visualizations of metric data from Prometheus |P0 |UI (OSD Observability Plugin) | +|PPL Grammar to support advanced prometheus metric operations. |This includes design and implementation of PPL grammar for advanced query operations like cross-metric commands, rate commands, histogram and summary commands. |P1 |Backend (OS SQL Plugin) | +|Metric analytics page |Build a new page explicitly for viewing metrics from all sources |P1 |UI (OSD Observability Plugin) | + + +Quick Demo: + +https://user-images.githubusercontent.com/99925918/164787195-86a17d34-cf75-4a40-943f-24d1cf7b9a51.mov \ No newline at end of file diff --git a/docs/dev/datasource-query-s3.md b/docs/dev/datasource-query-s3.md new file mode 100644 index 0000000000..6bf2036801 --- /dev/null +++ b/docs/dev/datasource-query-s3.md @@ -0,0 +1,252 @@ +## 1.Overview + +In this document, we will propose a solution in OpenSearch Observability to query log data stored in S3. + +### 1.1.Problem Statements + +Currently, OpenSearch Observability is collection of plugins and applications that let you visualize data-driven events by using Piped Processing Language to explore, discover, and query data stored in OpenSearch. The major requirements we heard from customer are + +* **cost**, regarding to hardware cost of setup an OpenSearch cluster. +* **ingestion performance,** it is not easy to supporting high throughput raw log ingestion. +* **flexibility,** OpenSearch index required user know their query pattern before ingest data. Which is not flexible. + +Can build a new solution for OpenSearch observably uses and leverage S3 as storage. The benefits are + +* **cost efficiency**, comparing with OpenSearch, S3 is cheap. +* **high ingestion throughput**, S3 is by design to support high throughput write. +* **flexibility**, user do not need to worry about index mapping define and reindex. everything could be define at query tier. +* **scalability**, user do not need to optimize their OpenSearch cluster for write. S3 is auto scale. +* **data durability**, S3 provide 11s 9 of data durability. + +With all these benefits, are there any concerns? The **ability to query S3 in OpenSearch and query performance** are the major concerns. In this doc, we will provide the solution to solve these two major concerns. + +## 2.Terminology + +* **Catalog**. OpenSearch access external data source through catalog. For example, S3 catalog. Each catalog has attributes, most important attributes is data source access credentials. +* **Table**: To access external data source. User should create external table to describe the schema and location. Table is the virtual concept which does not mapping to OpenSearch index. +* **Materialized View**: User could create view from existing tables. Each view is 1:1 mapping to OpenSearch index. There are two types of views + * (1) Permanent view (default) which is fully managed by user. + * (2) Transient View which is maintained by Maximus automatically. user could decide when to drop the transient view. + +## 3.Requirements + +### 3.1.Use Cases + +#### Use Case - 1: pre-build and query metrics from log on s3 + +**_Create_** + +* User could ingest the log or events directly to s3 with existing ingestion tools (e.g. [fluentbit](https://docs.fluentbit.io/manual/pipeline/outputs/s3)). The ingested log should be partitioned. e.g. _s3://my-raw-httplog/region/yyyy/mm/dd_. +* User configure s3 Catalog in OpenSearch. By default, s3 connector use [ProfileCredentialsProvider](https://docs.aws.amazon.com/AWSJavaSDK/latest/javadoc/com/amazonaws/auth/profile/ProfileCredentialsProvider.html). + +``` +settings.s3.access_key: xxxxx +settings.s3.secret_key: xxxxx +``` + +* User create table in OpenSearch of their log data on s3. Maximus will create table httplog in the s3 catalog + +``` +CREATE EXTERNAL TABLE `s3`.`httplog`( + @timestamp timestamp, + clientip string, + request string, + state integer, + size long + ) +ROW FORMAT SERDE + 'json' + 'grok', +PARTITION BY + ${region}/${year}/${month}/${day}/ +LOCATION + 's3://my-raw-httplog/'; +``` + +* User create the view *failEventsMetrics*. Note: User could only create view from schema-defined table in Maximus + +``` +CREATE MATERIALIZED VIEW failEventsMetrics ( + cnt long, + time timestamp, + status string +) +AS source=`s3`.`httpLog` status != 200 | status count() as cnt by span(5mins), status +WITH ( + REFRESH=AUTO +) +``` + +* Maximus will (1) create view *failEventsMetrics* in the default catalog. (2) create index *failEventsMetrics* in the OpenSearch cluster. (3) refresh *failEventsMetrics* index with logs on s3*.* _Notes: Create view return success when operation (1) and (2) success. Refresh view is an async task will be executed in background asynchronously._ +* User could describe the view to monitor the status. + +``` +DESCRIBE/SHOW MATERIALIZED VIEW failEventsMetrics + +# Return +status: INIT | IN_PROGRESS | READY +``` + +_**Query**_ + +* User could query the view. + +``` +source=failEventsMetrics time in last 7 days +``` + +* User could query the table, Thunder will rewrite table query as view query when optimizing the query. + +``` +source=`s3`.`httpLog` status != 200 | status count() as cnt by span(1hour), status +``` + +_**Drop**_ + + +* User could drop the table. Maximus will delete httpLog metadata from catalog and associated view drop. + +``` +DROP TABLE `s3`.`httpLog` +``` + +* User could drop the view. Maximus will delete failEventsMetrics metadata from catalog and delete failEventsMetrics indices. + +``` +DROP MATERIALIZED VIEW failEventsMetrics +``` + +**Access Control** + +* User could not configure any permission on table. +* User could not directly configure any permission on view. But user could configure permission on index associated with the view. + ![image1](https://user-images.githubusercontent.com/2969395/182239505-a8541ec1-f46f-4b91-882a-8a4be36f5aea.png) + +#### Use Case - 2: Ad-hoc s3 query in OpenSearch + +**_Create_** + +* Similar as previous example, User could ingest the log or events directly to s3 with existing ingestion tools. create catalog and table from data on s3. + +_**Query**_ + +* User could query table without create view. At runtime, Maximus will create the transient view and populate the view with required data. + +``` +source=`s3`.`httpLog` status != 200 | status count() as cnt by span(5mins), status +``` + +_**List**_ + +* User could list all the view of a table no matter how the view is created. User could query/drop/describe the temp view create by Maximus, as same as user created view. + +``` +LIST VIEW on `s3`.`httpLog` + +# return +httplog-tempView-xxxx +``` + +**Access Control** + +* User could not configure any permission on table. During query time, Maximus will use the credentials to access s3 data. It is user’s ownership to configure the permission on s3 data. +* User could not directly configure any permission on view. But user could configure permission on index associated with the view. + ![image2](https://user-images.githubusercontent.com/2969395/182239672-72b2cfc6-c22e-4279-b33e-67a85ee6a778.png) + +### 3.2.Function Requirements + +#### Query S3 + +* User could query time series data on S3 by using PPL in OpenSearch. +* For querying time series data in S3, user must create a **Catalog** for S3 with required setting. + * access key and secret key + * endpoint +* For querying time series data in S3, user must create a **Table** of time series data on S3. + +#### View + +* Support create materialized view from time series data on S3. +* Support fully materialized view refresh +* Support manually materialized view incrementally refresh +* Support automatically materialized view incrementally refresh +* Support hybrid scan +* Support drop materialized view +* Support show materialized view + +#### Query Optimization + +* Inject optimizer rule to rewrite the query with MV to avoid S3 scan + +#### Automatic query acceleration + +* Automatically select view candidate based on OpenSearch - S3 query execution metrics +* Store workload and selected view info for visualization and user intervention +* Automatically create/refresh/drop view. + +#### S3 data format + +* The time series data could be compressed with gzip. +* The time series data file format could be. + + * JSON + * TSV + +* If the time series data should be partitioned and have snapshot id. Query engine could support automatically incremental refresh and hybrid scan. + +#### Resource Management + +* Support circuit breaker based resource control when executing a query. +* Support task based resource manager + +#### Fault Tolerant + +* For fast query process, we scarify Fault Tolerant. Support query fast failure in case of hardware failure. + +#### Setting + +* Support configure of x% of disk automatically create view should used. + +### 3.3.Non Function Requirements + +#### Security: + +There are three types of privileges that are related to materialized views + +* Privileges directly on the materialized view itself +* Privileges on the objects (e.g. table) that the materialized view accesses. + +*_materialized view itself_* + +* User could not directly configure any access control on view. But user could configure any access control on index associated with the view. In the other words, materialized inherits all the access control on the index associated with view. +* When **automatically** **refresh view**, Maximus will use the backend role. It required the backend role has permission to index data. + +*_objects (e.g. table) that the materialized view accesses_* + +* As with non-materialized views, a user who wishes to access a materialized view needs privileges only on the view, not on the underlying object(s) that the view references. + +*_table_* + +* User could not configure any privileges on table. *_Note: because the table query could be rewrite as view query If the user do not have required permission to access the view. User could still get no permission exception._* + +*_objects (e.g. table) that the table refer_* + +* The s3 access control is only evaluated during s3 access. if the table access is rewrite as view access. The s3 access control will not take effect. + +*_Encryption_* + +* Materialized view data will be encrypted at rest. + +#### Others + +* Easy to use: the solution should be designed easy to use. It should just work out of the box and provide good performance with minimal setup. +* **Performance**: Use **http_log** dataset to benchmark with OpenSearch cluster. +* Scalability: The solution should be scale horizontally. +* **Serverless**: The solution should be designed easily deployed as Serverless infra. +* **Multi-tenant**: The solution should support multi tenant use case. +* **Metrics**: Todo +* **Log:** Todo + +## 4.What is, and what is not + +* We design and optimize for observability use cases only. Not OLTP and OLAP. +* We only support time series log data on S3. We do not support query arbitrary data on S3. \ No newline at end of file diff --git a/docs/dev/index.md b/docs/dev/index.md new file mode 100644 index 0000000000..96248ecf48 --- /dev/null +++ b/docs/dev/index.md @@ -0,0 +1,68 @@ + +# OpenSearch SQL/PPL Engine Development Manual + +## Introduction + ++ [Architecture](intro-architecture.md): a quick overview of architecture ++ [V2 Engine](intro-v2-engine.md): introduces why we developed new V2 engine ++ Concepts ++ Quickstart + +--- +## Clients + ++ **CLI** ++ **JDBC Driver** ++ **ODBC Driver** ++ **Query Workbench** + +--- +## Deployment + ++ **Standalone Mode** ++ **OpenSearch Cluster** + +--- +## Programming Guides + ++ **API** ++ **JavaDoc** + +--- +## Development Guides + +### Language Processing + ++ **SQL** + + [Aggregate Window Function](sql-aggregate-window-function.md): aggregate window function support ++ **Piped Processing Language** + +### Query Processing + ++ **Query Analyzing** + + [Semantic Analysis](query-semantic-analysis.md): performs semantic analysis to ensure semantic correctness + + [Type Conversion](query-type-conversion.md): implement implicit data type conversion ++ **Query Planning** + + [Logical Optimization](query-optimizier-improvement.md): improvement on logical optimizer and physical implementer ++ **Query Execution** + + [Query Manager](query-manager.md): query management ++ **Query Acceleration** + + [Automatic Acceleration](query-automatic-acceleration.md): workload based automatic query acceleration proposal + +### Data Sources + ++ **OpenSearch** + + [Relevancy Search](opensearch-relevancy-search.md): OpenSearch relevancy search functions + + [Sub Queries](opensearch-nested-field-subquery.md): support sub queries on OpenSearch nested field + + [Pagination](opensearch-pagination.md): pagination implementation by OpenSearch scroll API ++ [Prometheus](datasource-prometheus.md): Prometheus query federation ++ **File System** + + [Querying S3](datasource-query-s3.md): S3 query federation proposal + +--- +## Other Documents + ++ **Test Framework** + + [Doc Test](testing-doctest.md): makes our doc live and runnable to ensure documentation correctness + + [Comparison Test](testing-comparison-test.md): compares with other databases to ensure functional correctness ++ **Operation Tools** \ No newline at end of file diff --git a/docs/dev/Architecture.md b/docs/dev/intro-architecture.md similarity index 100% rename from docs/dev/Architecture.md rename to docs/dev/intro-architecture.md diff --git a/docs/dev/NewSQLEngine.md b/docs/dev/intro-v2-engine.md similarity index 97% rename from docs/dev/NewSQLEngine.md rename to docs/dev/intro-v2-engine.md index cd82e61571..a4e9d0c678 100644 --- a/docs/dev/NewSQLEngine.md +++ b/docs/dev/intro-v2-engine.md @@ -31,7 +31,7 @@ With the architecture and extensibility improved significantly, the following SQ * [Semi-structured data query](../../docs/user/beyond/partiql.rst#example-2-selecting-deeper-levels): support querying OpenSearch object fields on arbitrary level * OpenSearch multi-field: handled automatically and users won't have the access, ex. `text` is converted to `text.keyword` if it’s a multi-field -As for correctness, besides full coverage of unit and integration test, we developed a new comparison test framework to ensure correctness by comparing with other databases. Please find more details in [Testing](./Testing.md). +As for correctness, besides full coverage of unit and integration test, we developed a new comparison test framework to ensure correctness by comparing with other databases. Please find more details in [Testing](./testing-comparison-test.md). --- @@ -63,7 +63,7 @@ You can find all the limitations in [Limitations](../../docs/user/limitations/li --- ## 4.How it's Implemented -If you're interested in the new query engine, please find more details in [Developer Guide](../../DEVELOPER_GUIDE.rst), [Architecture](./Architecture.md) and other docs in the dev folder. +If you're interested in the new query engine, please find more details in [Developer Guide](../../DEVELOPER_GUIDE.rst), [Architecture](./intro-architecture.md) and other docs in the dev folder. --- diff --git a/docs/dev/SubQuery.md b/docs/dev/opensearch-nested-field-subquery.md similarity index 100% rename from docs/dev/SubQuery.md rename to docs/dev/opensearch-nested-field-subquery.md diff --git a/docs/dev/Pagination.md b/docs/dev/opensearch-pagination.md similarity index 100% rename from docs/dev/Pagination.md rename to docs/dev/opensearch-pagination.md diff --git a/docs/dev/opensearch-relevancy-search.md b/docs/dev/opensearch-relevancy-search.md new file mode 100644 index 0000000000..9b25cc757d --- /dev/null +++ b/docs/dev/opensearch-relevancy-search.md @@ -0,0 +1,294 @@ +## 1 Overview + +In a search engine, the relevance is the measure of the relationship accuracy between the search query and the search result. The higher the relevance is, the higher quality of the search result, then the users are able to get more relevant content from the search result. For the searches in OpenSearch engine, the returned results (documents) are ordered by the relevance by default, the top documents are of the highest relevance. In OpenSearch, the relevance is indicated by a field `_score`. This float type field gives a score to the current document to measure how relevant it is related to the search query, with a higher score the document is indicated to be more relevant to the search. + +The OpenSearch query engine is the engine to do the query planning for user input queries. Currently the query engine is interfaced with SQL and PPL (Piped Processing Language), thus the users are able to write SQL and PPL queries to explore their data in OpenSearch. Most of the queries supported in the query engine are following the SQL use cases, which are mapped to the structured queries in OpenSearch. + +This design is to support the relevance based search queries with query engine, in another word to enable the OpenSearch users to write SQL/PPL languages to do the search by relevance in the search engine. + + +### 1.1 Problem Statement + +**1. DSL is not commonly used** +OpenSearch query language (DSL) is not commonly used in regular databases, especially for the users in the realm of analytics rather than development. This is also the reason we created the SQL plugin, where the query engine lies in. Like many other SQL servers where the full text search features are enabled to support the relevance based search with SQL language, the SQL plugin in OpenSearch is also a perfect target for users to search by relevance on the indexed database if we support the search features in the query engine. + +**2. OpenSearch search features are not present in the new query engine** +The current query engine is working more like a traditional SQL server to retrieve exact data from the OpenSearch indices, so from the query engine standpoint, the OpenSearch is treated as an ordinary database to store data rather than a search engine. One of the gap in between is the search features in the search engine are not supported in the query engine. By bringing search by relevance, one of the most relevant features of the search engine, into the query engine, our users would be able to explore and do the search using SQL or PPL language directly. + +**3. Full text functions in old engine** +Last year (2020) we migrated the legacy SQL plugin to a new constructed architecture with a new query engine to do the query planning for SQL and PPL queries. Some of the search functions were already enabled in the old engine. However, the full text functions in the old engine are not migrated to the new engine, so when users try to do the search query with SQL, it falls back to the old engine rather be planned in the new query engine. This is not causing any issue for short term since the use of old and new engines are out of the user awareness. But for long-term prospective, we need to support these functions in new engine as well in rid of in consistency, and also good for future plan of the old engine deprecation. + + +### 1.2 Use Cases + +**Use case 1:** **Full text search** +The full text search functions are very common in most of the SQL servers now due to the high demanding of the search features, and also because its high efficiency compared to the wildcard match (like) operator. With the support of search features in the query engine, users are able to execute the full text functions with SQL language on the top of OpenSearch search engine, while not limited to the simple search but also the complicated search like the search with prefix, search multiple fields and so forth. + +**Use case 2: Field independent search with SQL/PPL** +In many cases the users want to search the entire document for a term rather than in a specific field. For example a user might want to search an index for a keyword “Seattle”, this might come from fields like “DestinationCity”, “OriginCity”, “AirportLocation” etc., and all these results matter for the user. The search features proposed in this design also include this case to enable users to do multi field search or even field independent search with SQL and PPL language. + +**Use case 3: Keyword highlighting** +The highlighters supported in the search engine is another feature that is on the top of relevance based search. By enabling the highlighters, users are able to get the highlighted snippets from one or more fields in the search results, so users can clearly see where the query matches are. + +**Use case 4: Observability project essential** +The observability is a project that aims to enable users to explore all type of data like logs, metrics etc. in one place by bringing the logs, metrics and traces together. The relevance based search feature would be a key feature for user to search and filter the data in observability. + + +### 1.3 Requests from the community +- https://github.com/opendistro-for-elasticsearch/sql/issues/1137 +- https://github.com/opensearch-project/sql/issues/160 +- https://github.com/opensearch-project/sql/issues/183 + + +## 2 Requirements + +### 2.1 Functional Requirements + +* Support relevance based search functions with SQL/PPL language. + * Define user friendly SQL/PPL syntax for the search use. + * Should push the search query down to OpenSearch engine if possible. For queries that are unable to push down, the engine should return error with explicit message to raise users’ awareness. +* Support custom input values for function parameters to adjust the scoring strategy. +* (Optional) Enable user actions on the scoring field, for example sort by `_score` field in either ascending or descending order. +* (Optional) Enable in memory execution within the query engine if the plan fails to optimize to search engine. + +### 2.2 Non-functional Requirements + +#### A. Reliability +* The search is entirely executed in the OpenSearch engine, so the search result is consistent with the result in OpenSearch search engine with corresponding DSL. +* The old engine already has supported some of the full text functions, so the new supported features should not break the existing full text functions in syntax prospective for users. + +#### B. Extensibility +* Extensible to supplement more relevance based search functions or other relevant features. +* The architecture and implementation should be extensible to enable in memory execution for long term improvement of the query engine. + +### 2.3 Tenets + +1. The search with query engine is on the top of the search query in OpenSearch search engine, the relevance based search should be entirely executed in the search engine. Different from the ordinary queries in the query engine, the search query is not going to be executed in memory in the query engine even when it fails to push down into search engine. +2. The search with SQL/PPL interface respects the search query with OpenSearch DSL in syntax and results. This means the users are able to search with DSL + +### 2.4 Out of Scope + +1. All the search feature is built on the top of the search engine. Basically a complex query that fails to push down to the search engine is executed in the query engine in node level. But for the search features, the in memory search is out of scope for this design. +2. This design focuses on enabling users to do the search with SQL/PPL interface, but the analyzers and search algorithms are actually of the search engine job, so the features or enhancement of the search queries within the search engine are out of scope for this design. + + +## 3 High Level Design + +### 3.1 Search functions to support + +Since the relevance based search is highly depending on the OpenSearch core engine, we are therefore defining the search functions following the existing search queries (see **appendix A1**) but in the language style of SQL full text search functions. Here comes the list of functions with basic functionalities (i.e. parameter options) to support as the relevance based search functions in the query engine: + +|Query Type |Function Name In Query Engine |Description | +|--- |--- |--- | +|Match query |match |Returns documents that match a provided text, number, date or boolean value. The provided text is analyzed before matching. | +|Match phrase query |match_phrase |The `match_phrase` query analyzes the text and creates a `phrase` query out of the analyzed text. | +|Match phrase prefix query |match_phrase_prefix |Returns documents that contain the words of a provided text, in the **same order** as provided. The last term of the provided text is treated as a prefix, matching any words that begin with that term. | +|Match boolean prefix query |match_bool_prefix |analyzes its input and constructs a `bool` query from the terms. Each term except the last is used in a `term` query. The last term is used in a `prefix` query. | +|Multi match query - best fields |multi_match |Finds documents which match any field, but uses the `_score` from the best field. | +|Multi match query - most fields |multi_match |Finds documents which match any field and combines the `_score` from each field. | +|Multi match query - cross fields |multi_match |Treats fields with the same `analyzer` as though they were one big field. Looks for each word in **any** field. | +|Multi match query - phrase |multi_match |Runs a `match_phrase` query on each field and uses the `_score` from the best field. | +|Multi match query - phrase prefix |multi_match |Runs a `match_phrase_prefix` query on each field and uses the `_score` from the best field. | +|Multi match query - bool prefix |multi_match |Creates a `match_bool_prefix` query on each field and combines the `_score` from each field. | +|Combined fields |combined_fields |analyzes the query string into individual terms, then looks for each term in any of the fields | +|Common terms query |common |The `common` terms query is a modern alternative to stopwords which improves the precision and recall of search results (by taking stopwords into account), without sacrificing performance. | +|Query string |query_string |Returns documents based on a provided query string, using a parser with a strict syntax. | +|Simple query string |simple_query_string |Returns documents based on a provided query string, using a parser with a limited but fault-tolerant syntax. | + +The function names follow the query type names directly to avoid confusion when using the exactly the same functionalities with different languages. Besides, all the available options for the queries are passed as parameters in the functions. See **4.2 Search function details** for more details. + + +### 3.2 Architecture diagram + +#### Option A: Remain the original query plans, register all the search functions as SQL/PPL scalar functions. +The current engine architecture is well constructed with plan component that have their own unique job. For example, all the conditions in a query would be converted to the filter plans, and an aggregation plan is converted also the aggregation functions. + +Regarding the functionalities of search features, they are essentially performing the filter roles to find the documents that contain specific terms. Therefore, the search functions could be perfectly fitted into the filter conditions just like the other SQL functions in the condition expression. + +Besides, one of the query engine architecture tenets is to keep the job of every plan node simple and unique, and construct the entire query planning using existing plan node rather than creating new type of plan. This can keep each of operators in the physical plans simple and well defined. The following figure shows a simplified logical query plan with only project, filter and relation plans for a match query like `SELECT * FROM my_index WHERE match(message, "this is a test")`. + +![relevance-Page-3](https://user-images.githubusercontent.com/33583073/130142789-bd78f8e6-c7b9-43e2-abb5-46c845af5a8e.png) + +**Pros:** +- Obey the tenet to avoid logical and physical plans with overlapped work +- Could follow the similar pattern of scalar functions to do the implementation work + +**Cons:** +- The search queries are served as scalar functions, this could be a limitation in extensibility for PPL language, for example this option is over complicated if we create commands like `match` to support the search features. + + +#### Option B: Create a new query plan node dedicated for the search features. + +The diagram below is simplified with only the logical plan and physical plan sections and leaves out others. Please check out [OpenSearch SQL Engine Architecture](https://github.com/opensearch-project/sql/blob/main/docs/dev/Architecture.md) for the complete architecture of the query engine. + +![relevance-Page-2 (4)](https://user-images.githubusercontent.com/33583073/129938534-28fa4845-4246-4707-9519-e68c9e86d174.png) + +The match component in the logical plans stands for the logical match in logical planning stage. This could be an extension of the filter plan, which is specially to handle the search functions. + +Ideally a relevance based search query is optimized to a plan where the score is able to merge with the logical index scan, this also means the score operation is eligible to turn to the OpenSearch query and pushed down to the search engine. However the query engine should also have the capability to deal with the scoring operations locally when the query planning runs into a complicated case and fails to push score down to the search engine. + +Since the in memory execution of the search queries are out of scope for current phase, we are not setting the corresponding match operator in the query engine, and throwing out errors instead if a query tries to reach the match operator. + +Pros: +- All the search plans are aggregated in one plan node and distinguished from other scalar function expression. +- Extensible for the implementation of in-memory execution since they share common algorithm base (BM25). + +Cons: +- Redundant plan node and disobey the query plan tenet. + + +## 4 Detailed Design + +### 4.1 SQL and PPL language interface + +In the legacy engine, the solution is to allow user to write a DSL segment as the parameter and pass it directly to the filter. This syntax could be much flexible for users but it skips the syntax analysis and type check in between. So we are proposing a more SQL-like syntax for the search functions by passing the options as named arguments instead. However this might bring some challenge in future maintenance if more contexts for the supported queries in core engine come up in future release. + +For the sake of user experience, we decided to put on the SQL like language interface. So from the SQL language standpoint, the search functions could be performed similar to the SQL functions existing in SELECT, WHERE, GROUP BY clauses etc., and the returned data type is set to boolean. For example: + +``` +SELECT match(message, "this is a test") FROM my_index +SELECT * FROM my_index where match(message, "this is a test") +``` + + +Similarly from PPL standpoint, we could define it either as functions or as a command named `match`: +**Option 1: served as eval functions similar to SQL language** + +``` +search source=my_index | eval f=match(message, "this is a test") +search source=my_index | where match(message, "this is a test") | fields message +``` + +**Option 2: define a new command** + +``` +search source=my_index | match field=message query="this is a test" +search source=my_index | match type=match_phrase field=message query="this is a test" | fields event, message, timestamp +``` + +### 4.2 Search mode +The search type for the new engine is defaulted to `query_then_fetch` mode as the default setting of OpenSearch engine, and usually it works fine unless the document number is too small to smooth out the term/document frequency statistics. The alternative option for the query type is `dfs_query_then_fetch`, which adds an additional step of prequerying each shard asking about term and document frequencies before `query_then_fetch` procedures. + +Similarly to the DSL queries, the search mode options for SQL plugin could be supported through the request endpoint, for example to set the `dfs_query_then_fetch` as the search mode in specific query: +``` +POST _plugins/_sql?type=dfs_query_then_fetch +{ + "query": "SELECT message FROM my_index WHERE match(message, 'this is a test') +} +``` + + +### 4.3 Search function details + +In this section we focus on defining the syntax and functionalities for each of the functions to support, and how these functions are mapped with the search engine queries. Since most of the functions are reusing the common arguments, here lists the arguments that could be used by the search functions + +Required parameters: + +* **field(s):** Field(s) you wish to search. Defaults to the index.query.default_field index setting, which has a default value of *. +* **query:** Text, number, boolean value or date you wish to find in the provided ``. For the functions that are not requiring a specific field, the `fields` parameter might be an option to specify the fields to search for terms. + +Optional parameters: + +* **fields**: Field list to search for the query string. Set to "*" for all fields. +* **analyzer:** (string) Used to convert the text in `query` into tokens. Available values: (default) standard analyzer | simple analyzer | whitespace analyzer | stop analyzer | keyword analyzer | pattern analyzer | language analyzer | fingerprint analyzer | custom analyzer +* **auto_generate_synonyms_phrase_query:** (boolean) If true, match phrase queries are automatically created for multi-term synonyms. Defaults to true. +* **fuzziness**: (string) Maximum edit distance allowed for matching. +* **max_expansions:** (integer) Maximum number of terms to which the last provided term of the `query` value will expand. Defaults to 50. +* **prefix_length**: (integer) Number of beginning characters left unchanged for fuzzy matching. Defaults to 0. +* **fuzzy_transpositions**: (boolean) If true, edits for fuzzy matching include transpositions of two adjacent characters (ab → ba). Defaults to true. +* **fuzzy_rewrite**: (string) Method used to rewrite the query. +* **lenient**: (boolean) If true, format-based errors, such as providing a text query value for a numeric field, are ignored. Defaults to false. +* **operator**: (string) Boolean logic used to interpret text in the query value. Available operators: (default) OR | AND. +* **minimum_should_match**: (string) Minimum number of clauses that must match for a document to be returned. +* **zero_terms_query**: (string) Indicates whether no documents are returned if the analyzer removes all tokens, such as when using a stop filter. Available values: (default) none | all. +* **boost**: (float) Boost value toward the relevance score. Defaults to 1.0. +* **type_breaker**: (float) Floating point number between 0 and 1.0 used to increase the relevance scores of documents matching multiple query clauses. Defaults to 0.0. +* **cutoff_frequency**: (float) Specifies an absolute or relative document frequency where high frequency terms are moved into an optional subquery and are only scored if one of the low frequency (below the cutoff) terms in the case of an or operator or all of the low frequency terms in the case of an and operator match. The `cutoff_frequency` value can either be relative to the total number of documents if in the range [0..1) or absolute if greater or equal to 1.0. +* **slop:** (integer) **** maximum number of positions allowed between matching tokens. Defaults to 0. + + +The following links are redirecting to the issues pages of the search functions details including syntax, functionality and available paramters: + +**1. match function** +https://github.com/opensearch-project/sql/issues/184 + +**2. match_phrase function** +https://github.com/opensearch-project/sql/issues/185 + +**3. match_phrase_prefix function** +https://github.com/opensearch-project/sql/issues/186 + +**4. match_bool_prefix function** +https://github.com/opensearch-project/sql/issues/187 + +**5. multi_match function** +https://github.com/opensearch-project/sql/issues/188 + +**6. combined_fields function** +https://github.com/opensearch-project/sql/issues/189 + +**7. common function:** +https://github.com/opensearch-project/sql/issues/190 + +**8. query_string function** +https://github.com/opensearch-project/sql/issues/191 + +**9. simple_query_string function** +https://github.com/opensearch-project/sql/issues/192 + + +## 5 Implementation +The implementation covers the planner and optimizer in the query engine but skip the changes in in-memory executor as discussed in the design for current phase. One of the tricky things in the implementations is how to recognize the custom values as query parameters or options. + +The current solution is to pass the type check and environment resolve for the parameter values since they are not participating the in memory computing, but take the values all as string expressions and resolve them in the layer of optimizer when translated to query DSL. Take `analyzer` as an example: +``` +// define the analyzer action +private final BiFunction analyzer = + (b, v) -> b.analyzer(v.stringValue()); + +... + +ImmutableMap argAction = ImmutableMap.builder() + .put("analyzer", analyzer) + .put(...) + .build(); + +... + +// build match query +while (iterator.hasNext()) { + NamedArgumentExpression arg = (NamedArgumentExpression) iterator.next(); + if (!argAction.containsKey(arg.getArgName())) { + throw new SemanticCheckException(String + .format("Parameter %s is invalid for match function.", arg.getArgName())); + } + ((BiFunction) argAction + .get(arg.getArgName())) + .apply(queryBuilder, arg.getValue().valueOf(null)); +} +``` + +## 6 Testing +All the code changes should be test driven. The pull requests should include unit test cases, integration test cases and comparison test cases in applicable. + + + +## Appendix + +### A1. Relevance based search queries in OpenSearch + +|Query Type |Advanced options for common parameter |Example of basic query | +|--- |--- |--- | +|Match query |enable synonyms, fuzziness options, max expansions, prefix length, lenient, operator, minimum should match, zero terms query |{"query": {"match": {"message": {"query": "this is a test"}}}} | +|Match phrase query |zero terms query |{"query": {"match_phrase": {"message": "this is a test"}}} | +|Match phrase prefix query |max expansions, slop, zero terms query |{"query": {"match_phrase_prefix": {"message": {"query": "quick brown f"}}}} | +|Match boolean prefix query |fuzziness options, max expansions, prefix length, operator, minimum should match |{"query": {"match_bool_prefix" : {"message" : "quick brown f"}}} | +|Multi match query | type (best fields, most fields, cross fields, phrase, phrase prefix, bool prefix) |{"query": {"multi_match" : {"query": "brown fox","type": "best_fields","fields": [ "subject", "message" ],"tie_breaker": 0.3}}} | +|Combined fields |enable synonyms, operator, minimum should match, zero terms query |{"query": {"combined_fields" : {"query": "database systems","fields": [ "title", "abstract", "body"],"operator": "and"}}} | +|Common terms query |minimum should match, low/high frequency operator, cutoff frequency, boost |{"query": {"common": {"body": {"query": "this is bonsai cool","cutoff_frequency": 0.001}}}} | +|Query string |wildcard options, enable synonyms, boost, operator, enable position increments, fields (multi fields search), fuzziness options, lenient, max determined states, minimum should match, quote analyzer, phrase slop, quote field suffix, rewrite, time zone |{"query": {"query_string": {"query": "(new york city) OR (big apple)","default_field": "content"}}} | +|Simple query string |operator, enable all fields search, analyze wildcard, enable synonyms, flags, fuzziness options, lenient, minimum should match, quote field suffix |{"query": {"simple_query_string" : {"query": "\"fried eggs\" +(eggplant | potato) -frittata","fields": ["title^5", "body"],"default_operator": "and"}}} | + +### A2. Search flow in search engine + +![relevance](https://user-images.githubusercontent.com/33583073/129938659-5b49f43d-a83f-47d5-be5b-937b1c96e5bc.png) \ No newline at end of file diff --git a/docs/dev/query-automatic-acceleration.md b/docs/dev/query-automatic-acceleration.md new file mode 100644 index 0000000000..e509fa194e --- /dev/null +++ b/docs/dev/query-automatic-acceleration.md @@ -0,0 +1,75 @@ +In a database engine, there are different ways to optimize query performance. For instance, rule-based/cost-based optimizer and distributed execution layer tries to find best execution plan by cost estimate and equivalent transformation of query plan. Here we're proposing an alternative approach which is to accelerate query execution by materialized view for time-space tradeoff. + +### Architecture + +Here is a reference architecture that illustrates components and the entire workflow which essentially is a *workload-driven feedback loop*: + +1. *Input*: Query plan telemetry collected +2. *Generating feedback*: Feeding it into a workload-driven feedback generator +3. *Output*: Feedback for optimizer to rewrite query plan in future + +Basically, *feedback* is referring to various materialized view prebuilt (either online or offline) which hints acceleration opportunity to query optimizer. + +![AutoMV (1) (1)](https://user-images.githubusercontent.com/46505291/168863085-d5d39ab4-40bf-41c8-922c-4a45572e40dd.png) + +There are 2 areas and paths moving forward for both of which lack open source solutions: + +* OpenSearch engine acceleration: accelerate DSL or SQL/PPL engine execution +* MPP/Data Lake engine acceleration: accelerate Spark, Presto, Trino + +### General Acceleration Workflow + +#### 1.Workload Telemetry Collecting + +Collect query plan telemetry generated in query execution and emit it as feedback generation input. + +* *Query Plan Telemetry:* Specifically, query plan telemetry means statistics collected on each physical node (sub-query or sub-expression) when execution. Generally, the statistics include input/output size, column cardinality, running time etc. Eventually logical plan is rewritten to reuse materialized view, so the statistics in execution may need to be linked to logical plan before emitting telemetry data. +* *Challenge:* Efforts required in this stage depends on to what extent the query engine is observable and how easy telemetry can be collected. + +#### 2.Workload Telemetry Preprocessing + +Preprocess query plan telemetry into uniform workload representation. + +* *Workload Representation*: uniform workload representation decouples subsequent stages from specific telemetry data format and store. + +#### 3.View Selection + +Analyze workload data and select sub-query as materialization candidate according to view selection algorithm. + +* *Algorithm* + * View selection algorithm can be heuristic rule, such as estimate high performance boost and low materialization cost, or by more complex learning algorithm. + * Alternatively the selection can be manually done by customers with access to all workload statistics. + * In between is giving acceleration suggestion by advisor and allow customer intervene to change the default acceleration strategy. +* *Select Timing* + * *Online:* analyze and select view candidate at query processing time which benefits interactive/ad-hoc queries + * *Offline:* shown as in figure above +* *Challenge*: Automatic workload analysis and view selection is challenging and may require machine learning capability. Simple heuristic rules mentioned above may be acceptable. Alternative options include view recommendation advisor or manual selection by customers. + +#### 4.View Materialization and Maintenance + +Materialize selected view and maintain the consistency between source data and materialized view data, by incrementally refreshing for example. + +* *Materialized View:* is a database object that contains the results of a query. The result may be subset of a single table or multi-table join, or may be a summary using an aggregate function + * *Query Result Cache* + * *Full Result*: MV that stores entire result set and can only be reused by same deterministic query + * *Intermediate Result*: MV that stores result for a subquery (similar as Covering Index if filtering operation only) + * *Secondary Index* + * *Data Skipping Index*: MV that stores column statistics in coarse-grained way, Small Materialized Aggregate, and thus skip those impossible to produce a match. Common SMA includes Min-Max, Bloom Filter, Value List. + * *Covering Index*: MV that stores indexed column value(s) and included column value(s) so that index itself can answer the query and no need to access original data. Common index implementation includes B-tree family, Hash, Z-Ordering index. + * *Approximate Data Structure* +* *Materialization Timing* + * *Ingestion Time:* for a view defined and materialized at ingestion time, it can be “registered” to Selected View table in figure above (ex. by DDL CREATE MV). In this way the query acceleration framework can take care of query plan optimization + * Parquet: min-max SMA + * S3 query federation: user defined transformation as final resulting MV. More details in https://github.com/opensearch-project/sql/issues/595 + * Loki: labels as skipping hash index + * DataSketch: approximate data structure + * *Online (Query Processing Time):* add materialization overhead to first query in future + * *Offline:* shown as in figure above +* *Challenge:* To ensure consistency, the materialized view needs to be in sync with source data. Without real-time notification to refresh or invalidate, hybrid scan or similar mechanism is required to reuse partial stale materialized view. + +#### 5.Query Plan Optimization + +At last, query optimizer checks the existing materialized view and replace original operator with scan on materialized view. + +* *View Matching:* match sub-query with materialized view +* *Query Rewrite:* replace query plan node with materialized view scan operator \ No newline at end of file diff --git a/docs/dev/query-manager.md b/docs/dev/query-manager.md new file mode 100644 index 0000000000..68c82d0ea7 --- /dev/null +++ b/docs/dev/query-manager.md @@ -0,0 +1,34 @@ +## Query Manager + +### New abstraction +**QueryManager is the high level interface of the core engine**, Parser parse raw query as the Plan and sumitted to QueryManager. + +1. **AstBuilder** analyze raw query string and create **Statement**. +2. **QueryPlanFactory** create **Plan** for different **Statement**. +3. **QueryManager** execute **Plan**. + +Screen Shot 2022-10-25 at 5 05 46 PM + +Core engine define the interface of **QueryManager. Each execution engine should provide the** implementation of QueryManager which bind to execution enviroment. +**QueryManager** manage all the submitted plans and define the following interface + +* submit: submit queryexecution. +* cancel: cancel query execution. +* get: get query execution info of specific query. + +![image](https://user-images.githubusercontent.com/2969395/197908982-74a30a56-9b0c-4a1a-82bf-176d25e86bc1.png) + +Parser parse raw query as Statement and create AbstractPlan. Each AbstractPlan decide how to execute the query in QueryManager. + +![image (1)](https://user-images.githubusercontent.com/2969395/197909011-c53bf5e0-9418-4d9d-8f4d-353147158526.png) + + +**QueryService is the low level interface of core engine**, each **Plan** decide how to execute the query and use QueryService to analyze, plan, optimize and execute the query. + +Screen Shot 2022-10-25 at 5 07 29 PM + + +### Change of existing logic +1. Remove the schedule logic in NIO thread. After the change, + a. Parser will be executed in NIO thread. + b. QueryManager decide query execution strategy. e.g. OpenSearchQueryManager schedule the QueryExecution running in **sql-worker** thread pool. \ No newline at end of file diff --git a/docs/dev/query-null-missing-value.md b/docs/dev/query-null-missing-value.md new file mode 100644 index 0000000000..a8a3b32e36 --- /dev/null +++ b/docs/dev/query-null-missing-value.md @@ -0,0 +1,193 @@ +## Description ++ Auto add FIELDS * for PPL command. ++ In Analyzer, expend SELECT * to SELECT all fields. ++ Extract type info from QueryPlan and add to QueryResponse. ++ Support NULL and MISSING value in response. https://github.com/penghuo/sql/blob/pr-select-all/docs/user/general/values.rst + +## Problem Statements +### Background +Before explain the current issue, firstly, let’s setup the context. + +#### Sample Data +Let’s explain the problem with an example, the bank index which has 2 fields. + +``` + "mappings" : { + "properties" : { + "account_number" : { + "type" : "long" + }, + "age" : { + "type" : "integer" + } + } + } +``` +Then we add some data to the index. + +``` +POST bank/_doc/1 +{"account_number":1,"age":31} + +// age is null +POST bank/_doc/2 +{"account_number":2,"age":null} + +// age is missing +POST bank/_doc/3 +{"account_number":3} +``` + +#### JDBC and JSON data format +Then, we define the response data format for query “SELECT account_number FROM bank” as follows. + +JDBC Format. There are mandatory fields, “field name”, “field type” and “data”. e.g. +``` +{ + "schema": [{ + "name": "account_number", + "type": "long" + }], + "total": 3, + "datarows": [ + [1], + [2], + [3] + ], + "size": 3 +} +``` + +JSON Format, comparing with JDBC format, it doesn’t have schema field + +``` +{ + "datarows": [ + {"account_number": 1}, + {"account_number": 2}, + {"account_number": 3} + ] +} +``` + +### Issue 1. Represent NULL and MISSING in Response +With these sample data and response data format in mind, let go through more query and examine their results. +Considering the query:** SELECT age, account_number FROM bank. ** +The JDBC format doesn’t have MISSING value. If the field exist in the schema but missed in the document, it should be considered as NULL value. +The JSON format could represent the MISSING value properly. + +* JDBC Format +``` + { + "schema": [ + { + "name": "age", + "type": "integer" + }, + { + "name": "account_number", + "type": "long" + } + ], + "total": 3, + "datarows": [ + [ + 31, + 1 + ], + [ + null, + 2 + ], + [ + null, + 3 + ], + ], + "size": 3 + } + +* JSON Format + { + "datarows": [ + {"age": 1, "account_number": 1}, + {"age": null, "account_number": 2}, + {"account_number": 3} + ] + } +``` + +### Issue 2. ExprValue to JDBC format + +Based on our current implementation, all the SQL operator is translated to chain of PhysicalOperator. Each PhysicalOperator provide the ExprValue as the return value. The protocol pull the result from PhysicalOperator and translate to the expected format. e.g. Taking the above query as example, the result of the PhysicalOpeartor is a list of ExprValues. + +[ +{"age": ExprIntegerValue(1), "account_number": ExprIntegerValue(1)}, +{"age": ExprNullValue(), "account_number": ExprIntegerValue(2)}, +{"account_number": ExprIntegerValue(3)} +] +The current solution is extract field name and field type from the data itself. This solution has two problems + +It is impossible to derive the type from NULL value. +If the field is missing in the ExprValue, there is no way to derive it. + +### Issue 3. The Type info is missing +In current design, the Protocol is a separate module which work independently with QueryEngine. The Protocol module receive the list of ExprTupleValue from QueryEngine, then the Protocol module format the result based on the type of ExprValue. the problem is ExprNullValue and ExprMissingValue doesn’t have type assosicate with it. thus the Protocol module can’t derive the type info from input ExprTupleValue directly. + +### Issue 4. What is *(all field) means in SELECT +In current design, the SELECT * clause ignored in the AST builder logic, because it means select all the data from input operator. The issue is similar as Issue 3 that if the input operator produce NULL or MISSING value, then the Protocol have no idea to derive type info from it. + +#### Requirements +The JDBC format should be supported. The MISSING and NULL value should be represented as NULL. +The JSON format should be supported. +The Protocol module should receive the QueryResponse which include schema and data. + +#### Solution +##### Include NULL and MISSING value in the QueryResult (Issue 1, 2) +The SELECT operator will be translated to PhysicalOpeartor with a list of expression to resolve ExprValue from input data. With the above example, when handling NULL and MISSING value, the expected output data should be as follows. + +``` +[ + {"age": ExprIntegerValue(1), "account_number": ExprIntegerValue(1)}, + {"age": ExprNullValue(), "account_number": ExprIntegerValue(2)}, + {"age": ExprMissingValue(), "account_number": ExprIntegerValue(3)} +] +``` + +An additional list of Schema is also required to when protocol is JDBC. + +``` +{ + "schema": [ + { + "name": "age", + "type": "integer" + }, + { + "name": "account_number", + "type": "long" + } + ] +} +``` +Then the protocol module could easily translate the JDBC format or JSON format. + +##### Expend SELECT * to SELECT ...fields (Issue 4) +In our current implementation, in SQL, the SELECT * is ignored and in PPL there even no fields * command. This solution works fine for JSON format which doesn’t require schema, but it doesn’t work for JDBC format. +The proposal in here is + ++ Automatically add the fields command to PPL query ++ Expand SELECT * to SELECT ...fields. + +##### Automatically add fields * to PPL query +Comparing with SQL, the PPL grammer doesn’t require the Fields command is the last command. Thus, the fields * command should be automatically added. +The automatically added logic is if the last operator is not Fields command, the Fields * command will be added. + +##### Expand SELECT * to SELECT ...fields +In Analyzer, we should expend the * to all fields in the current scope. There are two issues we need to address, + +No all the fields in the current scope should been used to expand *. The original scope is from Elasticsearch mapping which include nested mapping. In current design, only the top fields will be retrived from the current scope, all the nested fields will been ignored. +The scope should been dynamtic maintain in the Analyzer. For example, the stats command will define the new scope. + +#### Retrieve Type Info from ProjectOperator and Expose to Protocol (Issue 3) +After expending the * and automatically add fields, the type info could been retrieved from ProjectOperator. Then the Protocol could get schema and data from QueryEngine. \ No newline at end of file diff --git a/docs/dev/query-optimizier-improvement.md b/docs/dev/query-optimizier-improvement.md new file mode 100644 index 0000000000..753abcc844 --- /dev/null +++ b/docs/dev/query-optimizier-improvement.md @@ -0,0 +1,106 @@ +### Background + +This section introduces the current architecture of logical optimizer and physical transformation. + +#### Logical-to-Logical Optimization + +Currently each storage engine adds its own logical operator as concrete implementation for `TableScanOperator` abstraction. Typically each data source needs to add 2 logical operators for table scan with and without aggregation. Take OpenSearch for example, there are `OpenSearchLogicalIndexScan` and `OpenSearchLogicalIndexAgg` and a bunch of pushdown optimization rules for each accordingly. + +``` +class LogicalPlanOptimizer: + /* + * OpenSearch rules include: + * MergeFilterAndRelation + * MergeAggAndIndexScan + * MergeAggAndRelation + * MergeSortAndRelation + * MergeSortAndIndexScan + * MergeSortAndIndexAgg + * MergeSortAndIndexScan + * MergeLimitAndRelation + * MergeLimitAndIndexScan + * PushProjectAndRelation + * PushProjectAndIndexScan + * + * that return *OpenSearchLogicalIndexAgg* + * or *OpenSearchLogicalIndexScan* finally + */ + val rules: List + + def optimize(plan: LogicalPlan): + for rule in rules: + if rule.match(plan): + plan = rules.apply(plan) + return plan.children().forEach(this::optimize) +``` + +#### Logical-to-Physical Transformation + +After logical transformation, planner will let the `Table` in `LogicalRelation` (identified before logical transformation above) transform the logical plan to physical plan. + +``` +class OpenSearchIndex: + + def implement(plan: LogicalPlan): + return plan.accept( + DefaultImplementor(): + def visitNode(node): + if node is OpenSearchLogicalIndexScan: + return OpenSearchIndexScan(...) + else if node is OpenSearchLogicalIndexAgg: + return OpenSearchIndexScan(...) +``` + +### Problem Statement + +The current planning architecture causes 2 serious problems: + +1. Each data source adds special logical operator and explode the optimizer rule space. For example, Prometheus also has `PrometheusLogicalMetricAgg` and `PrometheusLogicalMetricScan` accordingly. They have the exactly same pattern to match query plan tree as OpenSearch. +2. A bigger problem is the difficulty of transforming from logical to physical when there are 2 `Table`s in query plan. Because only 1 of them has the chance to do the `implement()`. This is a blocker for supporting `INSERT ... SELECT ...` statement or JOIN query. See code below. + +``` + public PhysicalPlan plan(LogicalPlan plan) { + Table table = findTable(plan); + if (table == null) { + return plan.accept(new DefaultImplementor<>(), null); + } + return table.implement( + table.optimize(optimize(plan))); + } +``` + +### Solution + +#### TableScanBuilder + +A new abstraction `TableScanBuilder` is added as a transition operator during logical planning and optimization. Each data source provides its implementation class by `Table` interface. The push down difference in non-aggregate and aggregate query is hidden inside specific scan builder, for example `OpenSearchIndexScanBuilder` rather than exposed to core module. + +![TableScanBuilder](https://user-images.githubusercontent.com/46505291/204355538-e54f7679-3585-423e-97d5-5832b2038cc1.png) + +#### TablePushDownRules + +In this way, `LogicalOptimizier` in core module always have the same set of rule for all push down optimization. + +![LogicalPlanOptimizer](https://user-images.githubusercontent.com/46505291/203142195-9b38f1e9-1116-469d-9709-3cbf893ec522.png) + + +### Examples + +The following diagram illustrates how `TableScanBuilder` along with `TablePushDownRule` solve the problem aforementioned. + +![optimizer-Page-1](https://user-images.githubusercontent.com/46505291/203645359-3f2fff73-a210-4bc0-a582-951a27de684d.jpg) + + +Similarly, `TableWriteBuilder` will be added and work in the same way in separate PR: https://github.com/opensearch-project/sql/pull/1094 + +![optimizer-Page-2](https://user-images.githubusercontent.com/46505291/203645380-5155fd22-71b4-49ca-8ed7-9652b005f761.jpg) + +### TODO + +1. Refactor Prometheus optimize rule and enforce table scan builder +2. Figure out how to implement AD commands +4. Deprecate `optimize()` and `implement()` if item 1 and 2 complete +5. Introduce fixed point or maximum iteration limit for iterative optimization +6. Investigate if CBO should be part of current optimizer or distributed planner in future +7. Remove `pushdownHighlight` once it's moved to OpenSearch storage +8. Move `TableScanOperator` to the new `read` package (leave it in this PR to avoid even more file changed) \ No newline at end of file diff --git a/docs/dev/SemanticAnalysis.md b/docs/dev/query-semantic-analysis.md similarity index 100% rename from docs/dev/SemanticAnalysis.md rename to docs/dev/query-semantic-analysis.md diff --git a/docs/dev/TypeConversion.md b/docs/dev/query-type-conversion.md similarity index 100% rename from docs/dev/TypeConversion.md rename to docs/dev/query-type-conversion.md diff --git a/docs/dev/AggregateWindowFunction.md b/docs/dev/sql-aggregate-window-function.md similarity index 100% rename from docs/dev/AggregateWindowFunction.md rename to docs/dev/sql-aggregate-window-function.md diff --git a/docs/dev/Testing.md b/docs/dev/testing-comparison-test.md similarity index 100% rename from docs/dev/Testing.md rename to docs/dev/testing-comparison-test.md diff --git a/docs/dev/Doctest.md b/docs/dev/testing-doctest.md similarity index 100% rename from docs/dev/Doctest.md rename to docs/dev/testing-doctest.md