[RFC] Accelerate Spark SQL queries by covering index #298

dai-chen · 2024-04-01T17:37:48Z

Is your feature request related to a problem?

Currently, covering indexes are ingested and exposed directly in OpenSearch. While users can utilize this data for visualization, alerting, reporting, and manage its lifecycle independently, Spark SQL queries on the source table are not accelerated by the covering index data.

What solution would you like?

We propose improving the performance of Spark SQL queries by leveraging covering indexes. Specifically, this involves rewriting query with the covering index data from the same source table.

Use Cases

After visualizing the data to gain initial insights, users typically seek more in-depth analysis to uncover underlying trends, patterns, and potential root causes driving the observed phenomena. This phase often involves examining various data points, and conducting comprehensive root cause analysis. The proposed feature aims to expedite this in-depth analysis process by utilizing the pre-existing covering index data within OpenSearch, thereby eliminating the necessity for redundant scans of the source data.

Open questions regarding this proposal:

Use Case: Is it acceptable for user to query the OpenSearch index directly as proposed in Alternative section below? Is it more natural because visualization is built on the index and DSL query too. Or Spark SQL query experience is required?
Consistency:
a. Currently, we only guarantee eventual consistency, and the freshness of OpenSearch index data is determined by the latency of index refresh.
b. Users may face difficulty viewing the latest data if query rewrite is enabled always
Data Management: Differently from the current mental model, the index will be exposed to both external customers and internal use. For example, if users delete the OpenSearch index data fully or partially, it may impact the correctness of the rewritten query.
Security: If query rewrite happens in Spark and any access control enabled, could this query rewrite pose a security risk?
Performance: It is uncertain whether there will be an improvement in latency if query rewriting occurs in Spark. Comparing scan S3 directly vs. scan OpenSearch index in Spark and transmit data to OpenSearch back, it depends on the covering index size, query result size and predicate pushdown to OpenSearch.
Dev Efforts: One of the challenge is how to rewrite query using partial covering index. Proof of concept is necessary to evaluate the feasibility.

What alternatives have you considered?

Alternatively, users can:

Query the index data directly via OpenSearch DSL or SQL; or
Query the OpenSearch table directly via SparkSQL (PoC item 1)

Do you have any additional context?

Most of the discussion here applies to query rewrite for materialized views. For clarity, this will be discussed separately.

anirudha · 2024-04-01T18:20:50Z

OS index will be used for re-write/ we may not depend freshness of s3 data. OS-flint doesnt do edit/modify - its an append only uasecase.

dai-chen · 2024-04-01T21:05:58Z

Design Option: Query Rewriting in Spark using Index Data Only

One design option involves leveraging the Flint optimizer within Spark to perform query rewriting, similar to skipping index query rewriting today. In this approach, eligible queries will experience rewriting to utilize covering index data exclusively from OpenSearch.

Workflow

Users send queries to the asynchronous query endpoint in OpenSearch SQL.
OpenSearch SQL submits the queries to Spark.
The Flint optimizer in Spark rewrites the query plan using OpenSearch index data.
Spark executes the queries and scans the covering index data in OpenSearch.
The Flint application stores the query results in the query result index within OpenSearch.
OpenSearch SQL reads the query results and returns them to the users.

Example

# Create covering index
CREATE INDEX all_cols_idx ON http_logs;

# Spark rewrite the query and execute to pull data from OS index to result index
SELECT * FROM http_logs;

# Logical plan before and after rewriting:
#   LogicalRelation("spark_catalog.default.http_logs")
#     => LogicalRelation("opensearch.default.flint_opensearch_default_http_logs_all_all_cols_idx")

# Execution:
# Query OS: /flint_opensearch_default_http_logs_all_all_cols_idx/_search
# Load into Spark InternalRow
# Write OS: /flint_opensearch_default_http_logs_all_all_cols_idx/_bulk

# UI pulls result from query result index

Limitations

User Expertise Required: Without pre-analyze API provided, users must have the knowledge to determine if their query can be accelerated beforehand. Otherwise, investing in building a covering index can incur significant costs.
Covering Index are not Updatable: If users create covering index on subset of the columns, there is no way to add more columns in future.
Stale Query Results:
a. The freshness of the covering index directly impacts query accuracy, potentially leading to stale or incorrect outcomes.
b. User manipulation of the covering index data can cause inconsistencies and partial results.
Untracked Permission Changes: Changes in permissions on the source table are not reflected in the covering index data, affecting query accuracy.
Restricted Predicate Pushdown: Currently, the Flint data source allows for only a limited set of predicates to be pushed down to the OpenSearch DSL query.
3.1 Only supports basic operator
3.2 Others, such as function, aggregation etc, are not supported
Performance Degradation: Large covering indexes or queries lacking effective filtering conditions may result in degraded performance. [To be verified in PoC]

Proof of Concept

High Priorities (~3 weeks):

Catalog Integration for OpenSearch Index: Integrating the OpenSearch index into the catalog system to facilitate its utilization in the query plan post-rewrite. [Efforts will focus on query rewrite requirements and avoid full integration, a task designated for [EPIC] Zero-ETL - OpenSearch Table #185]
- Implementing a catalog and table interface for the OpenSearch index (3 days).
- Mapping basic OpenSearch field types to SparkSQL types (2 days).
Full Covering Index Rewrite: Developing query rewriting mechanisms to utilize a full covering index for optimizing query execution.
- Adding a rewrite rule to identify queries suitable for full covering index utilization with all columns (2 days).
- Enhancing the rule to accommodate queries matching covering index with some columns (3 days).
Performance Testing: Evaluating the effectiveness of query rewriting in reducing latency or costs under specific conditions. (~1 week)
- Test scenarios include: queries without filtering conditions, queries with filtering conditions or aggregations that cannot be pushed down, and queries with highly selective filtering conditions.

Secondary Priorities (~2 weeks):

Partial Covering Index Rewrite: Facilitate query rewriting with a partial covering index, incorporating a WHERE clause for more precise data retrieval.
a. Convert WHERE clause string in Flint metadata to expression (1 day)
b. Perform subsumption test between expressions in query and those in partial covering index definition (1 week)
User Hints for Freshness: Introduce hints that users can employ to retrieve the most recent results without the need for query rewriting.

penghuo · 2024-04-03T17:25:15Z

One question, what block user directly query on covering index? In my understanding we allow user to query on original table. for doing this, background refresh job need to keep data freshness.

vamsimanohar · 2024-04-03T18:57:10Z

How does the rewritten query would look like?
Is this for increasing performance on a query on the table or exposing any opensearch capabilities via spark sql.
As you rightly pointed out, we have data and there is another copy of subset of data stored in different format(Lucene format). Are we looking for use cases exploiting these different storages? is there any precedence for this type of usecase in other products?

dai-chen · 2024-04-04T16:37:02Z

@penghuo @vamsi-amazon Thanks for the comments!

As discussed, the proposed feature is for unified SparkSQL user experience.
Updated the Example section above with query plan before and after rewriting
This is mostly for performance improvement. Exposing full OS via SparkSQL is tracked in [EPIC] Zero-ETL - OpenSearch Table #185. (will update PoC item to clarify)
For use case, I think our main motivation is aforementioned in item 1

dai-chen added the enhancement New feature or request label Apr 1, 2024

github-actions bot added the untriaged label Apr 1, 2024

dai-chen changed the title ~~[RFC] Accelerate Spark SQL Queries by Covering Index~~ [RFC] Accelerate Spark SQL queries by covering index Apr 1, 2024

dai-chen removed the untriaged label Apr 1, 2024

dai-chen added feature New feature 0.4 and removed enhancement New feature or request labels Apr 1, 2024

This was referenced Apr 4, 2024

[Feature] OpenSearch and Apache Spark Integration #3

Closed

Add skipping index recommendations for specific columns #300

Closed

This was referenced Apr 23, 2024

Add covering index based query rewriter rule #318

Merged

Enhance Iceberg query performance with covering index support #325

Closed

dai-chen added this to OpenSearch Spark Project Planning May 7, 2024

dai-chen moved this to Under Review in OpenSearch Spark Project Planning May 7, 2024

This was referenced Jun 3, 2024

[FEATURE] Performance and Scalability Enhancements for Flint Index #365

Open

[EPIC] Zero-ETL - Apache Iceberg Table Support #372

Open

This was referenced Jun 18, 2024

[FEATURE] Support hybrid scan for Flint covering index #386

Open

Abstracting source relations for enhanced covering index rewriting #391

Merged

dai-chen added the 0.5 label Jun 20, 2024

dai-chen self-assigned this Jun 20, 2024

dai-chen mentioned this issue Jul 3, 2024

Enhance query rewriter rule to support partial covering index #409

Merged

6 tasks

dai-chen closed this as completed Jul 15, 2024

github-project-automation bot moved this from Under Review to Done in OpenSearch Spark Project Planning Jul 15, 2024

dai-chen mentioned this issue Aug 26, 2024

[BUG] The same log explorer used for Security Lake data source can’t query MV/CI indices #510

Open

dai-chen mentioned this issue Sep 30, 2024

[FEATURE] Enable covering index acceleration for Iceberg tables #719

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[RFC] Accelerate Spark SQL queries by covering index #298

[RFC] Accelerate Spark SQL queries by covering index #298

dai-chen commented Apr 1, 2024 •

edited

Loading

anirudha commented Apr 1, 2024

dai-chen commented Apr 1, 2024 •

edited

Loading

penghuo commented Apr 3, 2024

vamsimanohar commented Apr 3, 2024

dai-chen commented Apr 4, 2024 •

edited

Loading

[RFC] Accelerate Spark SQL queries by covering index #298

[RFC] Accelerate Spark SQL queries by covering index #298

Comments

dai-chen commented Apr 1, 2024 • edited Loading

Use Cases

Open questions regarding this proposal:

anirudha commented Apr 1, 2024

dai-chen commented Apr 1, 2024 • edited Loading

Design Option: Query Rewriting in Spark using Index Data Only

Workflow

Example

Limitations

Proof of Concept

penghuo commented Apr 3, 2024

vamsimanohar commented Apr 3, 2024

dai-chen commented Apr 4, 2024 • edited Loading

dai-chen commented Apr 1, 2024 •

edited

Loading

dai-chen commented Apr 1, 2024 •

edited

Loading

dai-chen commented Apr 4, 2024 •

edited

Loading