-
Notifications
You must be signed in to change notification settings - Fork 33
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[RFC] Accelerate Spark SQL queries by covering index #298
Comments
|
Design Option: Query Rewriting in Spark using Index Data OnlyOne design option involves leveraging the Flint optimizer within Spark to perform query rewriting, similar to skipping index query rewriting today. In this approach, eligible queries will experience rewriting to utilize covering index data exclusively from OpenSearch. Workflow
Example
Limitations
Proof of ConceptHigh Priorities (~3 weeks):
Secondary Priorities (~2 weeks):
|
One question, what block user directly query on covering index? In my understanding we allow user to query on original table. for doing this, background refresh job need to keep data freshness. |
|
@penghuo @vamsi-amazon Thanks for the comments!
|
Is your feature request related to a problem?
Currently, covering indexes are ingested and exposed directly in OpenSearch. While users can utilize this data for visualization, alerting, reporting, and manage its lifecycle independently, Spark SQL queries on the source table are not accelerated by the covering index data.
What solution would you like?
We propose improving the performance of Spark SQL queries by leveraging covering indexes. Specifically, this involves rewriting query with the covering index data from the same source table.
Use Cases
After visualizing the data to gain initial insights, users typically seek more in-depth analysis to uncover underlying trends, patterns, and potential root causes driving the observed phenomena. This phase often involves examining various data points, and conducting comprehensive root cause analysis. The proposed feature aims to expedite this in-depth analysis process by utilizing the pre-existing covering index data within OpenSearch, thereby eliminating the necessity for redundant scans of the source data.
Open questions regarding this proposal:
a. Currently, we only guarantee eventual consistency, and the freshness of OpenSearch index data is determined by the latency of index refresh.
b. Users may face difficulty viewing the latest data if query rewrite is enabled always
What alternatives have you considered?
Alternatively, users can:
Do you have any additional context?
Most of the discussion here applies to query rewrite for materialized views. For clarity, this will be discussed separately.
The text was updated successfully, but these errors were encountered: