From 214b7dafaa9e66a693e12d231820bce9f185cde7 Mon Sep 17 00:00:00 2001 From: Fanit Kolchina Date: Thu, 19 Dec 2024 10:13:25 -0500 Subject: [PATCH] Tech and editorial review Signed-off-by: Fanit Kolchina --- .../scheduled-query-acceleration.md | 106 +++++------------- 1 file changed, 31 insertions(+), 75 deletions(-) diff --git a/_dashboards/management/scheduled-query-acceleration.md b/_dashboards/management/scheduled-query-acceleration.md index faa9eff0b6..916f1ea186 100644 --- a/_dashboards/management/scheduled-query-acceleration.md +++ b/_dashboards/management/scheduled-query-acceleration.md @@ -10,21 +10,21 @@ has_children: false Introduced 2.17 {: .label .label-purple } -Scheduled Query Acceleration (SQA) is designed to optimize direct queries from OpenSearch to external data sources, such as Amazon Simple Storage Service (Amazon S3). It addresses issues often faced when managing and refreshing indexes, views, and data in an automated way. +Scheduled Query Acceleration (SQA) is designed to optimize queries sent directly from OpenSearch to external data sources, such as Amazon Simple Storage Service (Amazon S3). It uses automation to address issues commonly encountered when managing and refreshing indexes, views, and data. -Query acceleration is facilitated by secondary indexes like [skipping indexes]({{site.url}}{{site.baseurl}}/dashboards/management/accelerate-external-data/#skipping-indexes), [covering indexes]({{site.url}}{{site.baseurl}}/dashboards/management/accelerate-external-data/#covering-indexes), or [materialized views]({{site.url}}{{site.baseurl}}/dashboards/management/accelerate-external-data/#materialized-views). When queries run, they use these indexes instead of directly querying S3. +Query acceleration is facilitated by secondary indexes like [skipping indexes]({{site.url}}{{site.baseurl}}/dashboards/management/accelerate-external-data/#skipping-indexes), [covering indexes]({{site.url}}{{site.baseurl}}/dashboards/management/accelerate-external-data/#covering-indexes), or [materialized views]({{site.url}}{{site.baseurl}}/dashboards/management/accelerate-external-data/#materialized-views). When queries run, they use these indexes instead of directly querying Amazon S3. -The secondary indexes need to be refreshed periodically to stay current with the Amazon S3 data. This refresh can be scheduled using an internal scheduler (within Spark) or an external scheduler. +The secondary indexes need to be refreshed periodically in order to remain current with the Amazon S3 data. This refresh operation can be scheduled using either an internal scheduler (within Spark) or an external scheduler. -Using SQA provides the following benefits: +SQA provides the following benefits: - **Cost reduction through optimized resource usage**: SQA reduces the operational load on driver nodes, lowering the costs associated with maintaining auto-refresh for indexes and views. -- **Improved observability of refresh operations**: SQA provides visibility into index states and refresh timings, offering insights into data processing and the current system state. +- **Improved observability of refresh operations**: SQA provides visibility into index states and refresh timing, offering insights into data processing and the current system state. -- **Better control over refresh scheduling**: SQA allows flexible scheduling of refresh intervals, helping manage resource usage and refresh frequency according to specific requirements. +- **Better control over refresh scheduling**: SQA allows flexible scheduling of refresh intervals, helping you to manage resource usage and refresh frequency according to specific requirements. -- **Simplified index management**: SQA enables updates to index settings, such as refresh intervals, in a single query, simplifying workflows. +- **Simplified index management**: SQA enables updates to index settings, such as refresh intervals, in a single query, which simplifies workflows. ## Concepts @@ -39,16 +39,13 @@ Before configuring SQA, familiarize yourself with the following topics: Before configuring SQA, verify that the following requirements are met: - Ensure you're running OpenSearch version 2.17 or later. -- Ensure you have the SQL plugin installed. The SQL plugin is part of most OpenSearch distributions. For more information, see [Installing plugins]({{site.url}}{{site.baseurl}}/install-and-configure/plugins/). -- Ensure you have configured an Amazon S3 and Amazon EMR Serverless (needed for access to Apache Spark). +- Ensure you have the SQL plugin installed. The SQL plugin is included in most OpenSearch distributions. For more information, see [Installing plugins]({{site.url}}{{site.baseurl}}/install-and-configure/plugins/). +- Ensure you have configured a data source (in this example, Amazon S3): Configure a skipping index, covering index, or materialized view. These secondary data sources are additional data structures that improve query performance by optimizing queries sent to external data sources, such as Amazon S3. For more information, see [Optimizing query performance using OpenSearch indexing]({{site.url}}{{site.baseurl}}/dashboards/management/accelerate-external-data/). +- Configure Amazon EMR Serverless (needed for access to Apache Spark). -## Configuring SQA +## Configuring SQA settings -To configure SQA, perform the following steps. - -### Step 1: Configure the OpenSearch cluster settings - -Configure the following cluster settings: +If you want to override default configuration values, change the following cluster settings: - **Enable asynchronous query execution**: Set `plugins.query.executionengine.async_query.enabled` to `true` (default value): ```json @@ -63,7 +60,7 @@ Configure the following cluster settings: For more information, see [Settings](https://github.com/opensearch-project/sql/blob/main/docs/user/admin/settings.rst#pluginsqueryexecutionengineasync_queryenabled). -- **Configure the external scheduler interval for asynchronous queries**: This setting defines how often the external scheduler checks for tasks, allowing customization of refresh frequency. There is no default value for this setting so you must explicitly configure it. Adjusting the interval based on workload can optimize resources and manage costs: +- **Configure the external scheduler interval for asynchronous queries**: This setting defines how often the external scheduler checks for tasks, allowing customization of refresh frequency. There is no default value for this setting: if this value is empty, the default comes from `opensearch-spark` and is `5 minutes`. Adjusting the interval based on workload volume can help you to optimize resources and manage costs: ```json PUT /_cluster/settings { @@ -76,18 +73,6 @@ Configure the following cluster settings: For more information, see [Settings](https://github.com/opensearch-project/sql/blob/main/docs/user/admin/settings.rst#pluginsqueryexecutionengineasync_queryexternal_schedulerinterval). -### Step 2: Configure a data source - -Connect OpenSearch to your Amazon S3 data source using the OpenSearch Dashboards interface. For more information, see [Connecting Amazon S3 to OpenSearch]({{site.url}}{{site.baseurl}}/dashboards/management/S3-data-source/). - -After this step, you can directly query your S3 data (the primary data source) using [Query Workbench]({{site.url}}{{site.baseurl}}/dashboards/query-workbench/). - -### Step 3: Configure query acceleration - -Configure a skipping index, covering index, or materialized view. These secondary data sources are additional data structures that improve query performance by optimizing queries on external data sources, such as Amazon S3. For more information, see [Optimize query performance using OpenSearch indexing]({{site.url}}{{site.baseurl}}/dashboards/management/accelerate-external-data/). - -After this step, you can [run accelerated queries](#running-an-accelerated-query) using one of the secondary data sources. - ## Running an accelerated query You can run accelerated queries in [Query Workbench]({{site.url}}{{site.baseurl}}/dashboards/query-workbench/). To run an accelerated query, use the following syntax: @@ -101,7 +86,7 @@ WITH ( ``` {% include copy.html %} -By default, the query uses an external scheduler. To specify an internal scheduler, set `scheduler_mode` to `internal`: +By default, the query uses an external scheduler. To use an internal scheduler, set `scheduler_mode` to `internal`: ```sql CREATE SKIPPING INDEX example_index @@ -115,12 +100,12 @@ WITH ( ## Parameters -When creating indexes using an accelerated query, you can specify the following parameters in the `WITH` clause to control the refresh behavior, scheduling, and timing. +When creating indexes using an accelerated query, you can specify the following parameters in the `WITH` clause to control refresh behavior, scheduling, and timing. | Parameter | Description | |:--- | :--- | -| `auto_refresh` | Enables automatic refresh for the index. If `true`, the index refreshes automatically at the specified interval. If `false`, refresh must be triggered manually using the `REFRESH` statement. Default is `false`. | -| `refresh_interval` | Defines the time interval between refresh operations for the index, which determines how frequently new data is integrated into the index. This is applicable only when `auto_refresh` is enabled. The interval determines how frequently new data is integrated and can be specified in formats like `1 minute` or `10 seconds`. For valid time units, see [Time units](#time-units).| +| `auto_refresh` | Enables automatic refresh for the index. If `true`, the index refreshes automatically at the specified interval. If `false`, the refresh operation must be triggered manually using the `REFRESH` statement. Default is `false`. | +| `refresh_interval` | Defines the amount of time between index refresh operations for the index, which determines how frequently new data is ingested into the index. This is applicable only when `auto_refresh` is enabled. The interval determines how frequently new data is integrated and can be specified in formats like `1 minute` or `10 seconds`. For valid time units, see [Time units](#time-units).| | `scheduler_mode` | Specifies the scheduling mode for auto-refresh (internal or external scheduling). The external scheduler requires a `checkpoint_location` (a path for refresh job checkpoints) for state management. For more information, see [Starting streaming queries](https://spark.apache.org/docs/3.5.1/structured-streaming-programming-guide.html#starting-streaming-queries). Valid values are `internal` and `external`.| For more information and additional available parameters, see [Flint index refresh](https://github.com/opensearch-project/opensearch-spark/blob/main/docs/index.md#flint-index-refresh). @@ -135,23 +120,9 @@ You can specify the following time units when defining time intervals: - Hours: `h`, `hour`, or `hours` - Days: `d`, `day`, or `days` -## Creating a scheduled refresh job - -To create an index with a scheduled refresh job, use the following statement: - -```sql -CREATE SKIPPING INDEX example_index -WITH ( - auto_refresh = true, - refresh_interval = '15 minutes', - scheduler_mode = 'external' -); -``` -{% include copy.html %} - ## Monitoring index status -To monitor index status, use the following statement: +To monitor the status of an index, use the following statement: ```sql SHOW FLINT INDEXES IN spark_catalog.default; @@ -164,23 +135,16 @@ Use the following commands to manage scheduled jobs. ### Enabling jobs -To disable the external scheduler, use the ALTER command with a manual refresh: +To disable auto refresh using an internal or external scheduler, set `auto_refresh` to `false`: ```sql ALTER MATERIALIZED VIEW myglue_test.default.count_by_status_v9 WITH (auto_refresh = false); ``` {% include copy.html %} -To enable the external scheduler, use the ALTER command with an auto-refresh: - -```sql -ALTER MATERIALIZED VIEW myglue_test.default.count_by_status_v9 WITH (auto_refresh = true); -``` -{% include copy.html %} - ### Updating schedules -To update the schedule and modify refresh settings, specify the `refresh_interval` in the `WITH` clause: +To update the schedule and modify the refresh settings, specify the `refresh_interval` in the `WITH` clause: ```sql ALTER INDEX example_index @@ -188,18 +152,18 @@ WITH (refresh_interval = '30 minutes'); ``` {% include copy.html %} -### Updating the scheduler mode +### Switching the scheduler mode -To update the scheduler mode, specify the `scheduler_mode` in the `WITH` clause: +To switch the scheduler mode, specify the `scheduler_mode` in the `WITH` clause: ```sql ALTER MATERIALIZED VIEW myglue_test.default.count_by_status_v9 WITH (scheduler_mode = 'internal'); ``` {% include copy.html %} -### Verifying scheduler job status +### Inspecting scheduler metadata -To verify scheduler job status, use the following request: +To inspect scheduler metadata, use the following request: ```json GET /.async-query-scheduler/_search @@ -212,41 +176,33 @@ We recommend the following best practices when using SQA. ### Performance optimization -We recommend the following practices for best performance: - -- **Recommended refresh intervals**: Choosing the right refresh interval is crucial for balancing resource usage and system performance. Consider your workload requirements and the freshness of data you need when setting intervals. +- **Recommended refresh intervals**: Choosing the right refresh interval is crucial for balancing resource usage and system performance. Consider your workload requirements and the freshness of the data you need when setting intervals. -- **Concurrent job limits**: Limit the number of concurrent jobs running to avoid overloading system resources. Monitor system capacity and adjust job limits accordingly to ensure optimal performance. +- **Concurrent job limits**: Limit the number of concurrent running jobs running to avoid overloading system resources. Monitor system capacity and adjust job limits accordingly to ensure optimal performance. - **Resource usage**: Efficient resource allocation is key to maximizing performance. Properly allocate memory, CPU, and I/O based on the workload and the type of queries you're running. ### Cost management -We recommend the following practices to reduce costs: - - **Use an external scheduler**: An external scheduler offloads refresh operations, reducing the demand on core driver nodes. - **Configure a refresh interval for your use case**: Longer refresh intervals lead to reduced costs but may impact data freshness. - **Optimize the refresh schedule**: Adjust refresh intervals based on workload patterns to reduce unnecessary refresh operations. -- **Monitor costs**: Regularly monitor the costs related to scheduled queries and refresh operations. Using observability tools can help you gain insights into resource usage and costs over time. +- **Monitor costs**: Regularly monitor costs related to scheduled queries and refresh operations. Using observability tools can help you gain insights into resource usage and costs over time. -## Validations +## Validating settings -You can validate your settings by running a test query and verifying the scheduler configurations: +You can validate your settings by running a test query and verifying the scheduler configuration: ```sql SHOW FLINT INDEXES EXTENDED ``` {% include copy.html %} -For more information, see [OpenSearch Spark documentation](https://github.com/opensearch-project/opensearch-spark/blob/main/docs/index.md#all-indexes). +For more information, see the [OpenSearch Spark documentation](https://github.com/opensearch-project/opensearch-spark/blob/main/docs/index.md#all-indexes). ## Troubleshooting -If the refresh operation is not triggering as expected, ensure the `auto_refresh` setting is enabled and the refresh interval is properly configured. - -## Next steps - -For answers to more technical questions, see the [OpenSearch Spark RFC](https://github.com/opensearch-project/opensearch-spark/issues/416). \ No newline at end of file +If the refresh operation is not triggering as expected, ensure that the `auto_refresh` setting is enabled and the refresh interval is properly configured.