docs: msq autocompaction (#16681)

Co-authored-by: Kashif Faraz <[email protected]> Co-authored-by: Vishesh Garg <[email protected]> Co-authored-by: Victoria Lim <[email protected]> (cherry picked from commit d1b81f3)
apache · Oct 17, 2024 · 37df4d0 · 37df4d0
1 parent 9dac0f7
commit 37df4d0
Show file tree

Hide file tree

Showing 7 changed files with 237 additions and 78 deletions.
diff --git a/docs/api-reference/automatic-compaction-api.md b/docs/api-reference/automatic-compaction-api.md
@@ -27,7 +27,13 @@ import TabItem from '@theme/TabItem';
   ~ under the License.
   -->
 
-This topic describes the status and configuration API endpoints for [automatic compaction](../data-management/automatic-compaction.md) in Apache Druid. You can configure automatic compaction in the Druid web console or API.
+This topic describes the status and configuration API endpoints for [automatic compaction using Coordinator duties](../data-management/automatic-compaction.md#auto-compaction-using-coordinator-duties) in Apache Druid. You can configure automatic compaction in the Druid web console or API.
+
+:::info Experimental
+
+Instead of the automatic compaction API, you can use the supervisor API to submit auto-compaction jobs using compaction supervisors. For more information, see [Auto-compaction using compaction supervisors](../data-management/automatic-compaction.md#auto-compaction-using-compaction-supervisors).
+
+:::
 
 In this topic, `http://ROUTER_IP:ROUTER_PORT` is a placeholder for your Router service address and port. Replace it with the information for your deployment. For example, use `http://localhost:8888` for quickstart deployments.
 

diff --git a/docs/configuration/index.md b/docs/configuration/index.md
@@ -1046,7 +1046,7 @@ The following table shows the supported configurations for auto-compaction.
 
 |Property|Description|Required|
 |--------|-----------|--------|
-|type|The task type, this should always be `index_parallel`.|yes|
+|type|The task type. If you're using Coordinator duties for auto-compaction, set it to `index_parallel`. If you're using compaction supervisors, set it to `autocompact`. |yes|
 |`maxRowsInMemory`|Used in determining when intermediate persists to disk should occur. Normally user does not need to set this, but depending on the nature of data, if rows are short in terms of bytes, user may not want to store a million rows in memory and this value should be set.|no (default = 1000000)|
 |`maxBytesInMemory`|Used in determining when intermediate persists to disk should occur. Normally this is computed internally and user does not need to set it. This value represents number of bytes to aggregate in heap memory before persisting. This is based on a rough estimate of memory usage and not actual usage. The maximum heap memory usage for indexing is `maxBytesInMemory` * (2 + `maxPendingPersists`)|no (default = 1/6 of max JVM memory)|
 |`splitHintSpec`|Used to give a hint to control the amount of data that each first phase task reads. This hint could be ignored depending on the implementation of the input source. See [Split hint spec](../ingestion/native-batch.md#split-hint-spec) for more details.|no (default = size-based split hint spec)|
@@ -1063,6 +1063,7 @@ The following table shows the supported configurations for auto-compaction.
 |`taskStatusCheckPeriodMs`|Polling period in milliseconds to check running task statuses.|no (default = 1000)|
 |`chatHandlerTimeout`|Timeout for reporting the pushed segments in worker tasks.|no (default = PT10S)|
 |`chatHandlerNumRetries`|Retries for reporting the pushed segments in worker tasks.|no (default = 5)|
+|`engine` | Engine for compaction. Can be either `native` or `msq`. `msq`  uses the MSQ task engine and is only supported with [compaction supervisors](../data-management/automatic-compaction.md#auto-compaction-using-compaction-supervisors). | no (default = native)|
 
 ###### Automatic compaction granularitySpec
 

diff --git a/docs/data-management/automatic-compaction.md b/docs/data-management/automatic-compaction.md
diff --git a/docs/ingestion/concurrent-append-replace.md b/docs/ingestion/concurrent-append-replace.md
@@ -34,7 +34,7 @@ If you want to append data to a datasource while compaction is running, you need
 
 In the **Compaction config** for a datasource, enable  **Use concurrent locks (experimental)**.
 
-For details on accessing the compaction config in the UI, see [Enable automatic compaction with the web console](../data-management/automatic-compaction.md#web-console).
+For details on accessing the compaction config in the UI, see [Enable automatic compaction with the web console](../data-management/automatic-compaction.md#manage-auto-compaction-using-the-web-console).
 
 ### Update the compaction settings with the API
 

diff --git a/docs/ingestion/supervisor.md b/docs/ingestion/supervisor.md
@@ -23,22 +23,22 @@ sidebar_label: Supervisor
   ~ under the License.
   -->
 
-A supervisor manages streaming ingestion from external streaming sources into Apache Druid.
-Supervisors oversee the state of indexing tasks to coordinate handoffs, manage failures, and ensure that the scalability and replication requirements are maintained.
+Apache Druid uses supervisors to manage streaming ingestion from external streaming sources into Druid.
+Supervisors oversee the state of indexing tasks to coordinate handoffs, manage failures, and ensure that the scalability and replication requirements are maintained. They can also be used to perform [automatic compaction](../data-management/automatic-compaction.md) after data has been ingested.
 
 This topic uses the Apache Kafka term offset to refer to the identifier for records in a partition. If you are using Amazon Kinesis, the equivalent is sequence number.
 
 ## Supervisor spec
 
-Druid uses a JSON specification, often referred to as the supervisor spec, to define streaming ingestion tasks.
-The supervisor spec specifies how Druid should consume, process, and index streaming data.
+Druid uses a JSON specification, often referred to as the supervisor spec, to define tasks used for streaming ingestion or auto-compaction.
+The supervisor spec specifies how Druid should consume, process, and index data from an external stream or Druid itself.
 
 The following table outlines the high-level configuration options for a supervisor spec:
 
 |Property|Type|Description|Required|
 |--------|----|-----------|--------|
-|`type`|String|The supervisor type. One of `kafka`or `kinesis`.|Yes|
-|`spec`|Object|The container object for the supervisor configuration.|Yes|
+|`type`|String|The supervisor type. For streaming ingestion, this can be either `kafka`, `kinesis`, or `rabbit`. For automatic compaction, set the type to `autocompact`. |Yes|
+|`spec`|Object|The container object for the supervisor configuration. For automatic compaction, this is the same as the compaction configuration. |Yes|
 |`spec.dataSchema`|Object|The schema for the indexing task to use during ingestion. See [`dataSchema`](../ingestion/ingestion-spec.md#dataschema) for more information.|Yes|
 |`spec.ioConfig`|Object|The I/O configuration object to define the connection and I/O-related settings for the supervisor and indexing tasks.|Yes|
 |`spec.tuningConfig`|Object|The tuning configuration object to define performance-related settings for the supervisor and indexing tasks.|No|

diff --git a/docs/multi-stage-query/known-issues.md b/docs/multi-stage-query/known-issues.md
@@ -68,3 +68,16 @@ properties, and the `indexSpec` [`tuningConfig`](../ingestion/ingestion-spec.md#
 - The maximum number of elements in a window cannot exceed a value of 100,000. 
 - To avoid `leafOperators` in MSQ engine, window functions have an extra scan stage after the window stage for cases 
 where native engine has a non-empty `leafOperator`.
+
+## Automatic compaction
+
+<!-- If you update this list, also update data-management/automatic-compaction.md -->
+
+The following known issues and limitations affect automatic compaction with the MSQ task engine:
+
+- The `metricSpec` field is only supported for certain aggregators. For more information, see [Supported aggregators](../data-management/automatic-compaction.md#supported-aggregators).
+- Only dynamic and range-based partitioning are supported.
+- Set `rollup`  to `true` if and only if `metricSpec` is not empty or null.
+- You can only partition on string dimensions. However, multi-valued string dimensions are not supported.
+- The `maxTotalRows` config is not supported in `DynamicPartitionsSpec`. Use `maxRowsPerSegment` instead.
+- Segments can only be sorted on `__time` as the first column.
diff --git a/website/.spelling b/website/.spelling
@@ -410,6 +410,7 @@ maxNumSegments
 max_map_count
 memcached
 mergeable
+mergeability
 metadata
 metastores
 millis