Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

docs: msq autocompaction #16681

Merged
merged 28 commits into from
Oct 17, 2024
Merged
Show file tree
Hide file tree
Changes from 9 commits
Commits
Show all changes
28 commits
Select commit Hold shift + click to select a range
7d5c8d3
docs: msq autocompaction docs
317brian Jul 2, 2024
63bc906
cleanup
317brian Jul 2, 2024
4c204e7
Merge branch 'master' into msq-autocompact-docs
317brian Jul 2, 2024
0f96df3
Merge branch 'master' into msq-autocompact-docs
317brian Sep 18, 2024
d118847
update for overlord-based autocompact
317brian Sep 19, 2024
b698acf
parallelism
317brian Sep 19, 2024
0699168
update list in ki
317brian Sep 19, 2024
c58b26b
update supervisor docs
317brian Sep 20, 2024
5e08d07
fix typos
317brian Sep 20, 2024
802af43
Apply suggestions from code review
317brian Sep 27, 2024
30db2ba
address comments
317brian Sep 27, 2024
56ede75
fix typo
317brian Sep 30, 2024
ea6c767
address comments
317brian Oct 1, 2024
9defd6e
Apply suggestions from code review
317brian Oct 3, 2024
4fe38a7
address comments
317brian Oct 3, 2024
790dcf0
fix link etc
317brian Oct 4, 2024
ae5829b
update aggregator section
317brian Oct 7, 2024
ba1434a
fix link
317brian Oct 8, 2024
1dfbb6b
Merge branch 'master' into msq-autocompact-docs
317brian Oct 9, 2024
52dfd51
Apply suggestions from code review
317brian Oct 10, 2024
a0e2b61
Update docs/data-management/automatic-compaction.md
317brian Oct 15, 2024
3e1bbf3
Apply suggestions from code review
317brian Oct 15, 2024
82baf46
Apply suggestions from code review
317brian Oct 15, 2024
632da59
address review comments
317brian Oct 15, 2024
390dd6a
Apply suggestions from code review
317brian Oct 16, 2024
1ea4fbd
update config page
317brian Oct 16, 2024
9667d59
fix link
317brian Oct 16, 2024
7fda723
update spelling file
317brian Oct 16, 2024
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
232 changes: 179 additions & 53 deletions docs/data-management/automatic-compaction.md
Original file line number Diff line number Diff line change
Expand Up @@ -22,76 +22,45 @@ title: "Automatic compaction"
~ under the License.
-->

In Apache Druid, compaction is a special type of ingestion task that reads data from a Druid datasource and writes it back into the same datasource. A common use case for this is to [optimally size segments](../operations/segment-optimization.md) after ingestion to improve query performance. Automatic compaction, or auto-compaction, refers to the system for automatic execution of compaction tasks managed by the [Druid Coordinator](../design/coordinator.md).
This topic guides you through setting up automatic compaction for your Druid cluster. See the [examples](#examples) for common use cases for automatic compaction.

## How Druid manages automatic compaction

The Coordinator [indexing period](../configuration/index.md#coordinator-operation), `druid.coordinator.period.indexingPeriod`, controls the frequency of compaction tasks.
The default indexing period is 30 minutes, meaning that the Coordinator first checks for segments to compact at most 30 minutes from when auto-compaction is enabled.
This time period affects other Coordinator duties including merge and conversion tasks.
To configure the auto-compaction time period without interfering with `indexingPeriod`, see [Set frequency of compaction runs](#set-frequency-of-compaction-runs).
In Apache Druid, compaction is a special type of ingestion task that reads data from a Druid datasource and writes it back into the same datasource. A common use case for this is to [optimally size segments](../operations/segment-optimization.md) after ingestion to improve query performance. Automatic compaction, or auto-compaction, refers to the system for automatic execution of compaction tasks managed by the [Druid Coordinator](../design/coordinator.md) or the [Overlord](../design/overlord.md).
317brian marked this conversation as resolved.
Show resolved Hide resolved

At every invocation of auto-compaction, the Coordinator initiates a [segment search](../design/coordinator.md#segment-search-policy-in-automatic-compaction) to determine eligible segments to compact.
When there are eligible segments to compact, the Coordinator issues compaction tasks based on available worker capacity.
If a compaction task takes longer than the indexing period, the Coordinator waits for it to finish before resuming the period for segment search.
You can specify whether Druid uses the native engine on the Coordinator or the multi-stage query (MSQ) task engine or native engine on the Overlord. Using the Overlord and MSQ task engine for compaction provides faster compaction times as well as better memory tuning and usage. Both methods use the same syntax, but you use different methods to submit the automatic compaction.
317brian marked this conversation as resolved.
Show resolved Hide resolved

:::info
Auto-compaction skips datasources that have a segment granularity of `ALL`.
:::

As a best practice, you should set up auto-compaction for all Druid datasources. You can run compaction tasks manually for cases where you want to allocate more system resources. For example, you may choose to run multiple compaction tasks in parallel to compact an existing datasource for the first time. See [Compaction](compaction.md) for additional details and use cases.

This topic guides you through setting up automatic compaction for your Druid cluster. See the [examples](#examples) for common use cases for automatic compaction.

## Enable automatic compaction
## Coordinator-based
317brian marked this conversation as resolved.
Show resolved Hide resolved

You can enable automatic compaction for a datasource using the web console or programmatically via an API.
This process differs for manual compaction tasks, which can be submitted from the [Tasks view of the web console](../operations/web-console.md) or the [Tasks API](../api-reference/tasks-api.md).
The Coordinator [indexing period](../configuration/index.md#coordinator-operation), `druid.coordinator.period.indexingPeriod`, controls the frequency of compaction tasks.
The default indexing period is 30 minutes, meaning that the Coordinator first checks for segments to compact at most 30 minutes from when auto-compaction is enabled.
This time period affects other Coordinator duties including merge and conversion tasks.
317brian marked this conversation as resolved.
Show resolved Hide resolved
To configure the auto-compaction time period without interfering with `indexingPeriod`, see [Set frequency of compaction runs](#compaction-frequency).

### Web console
At every invocation of auto-compaction, the Coordinator initiates a [segment search](../design/coordinator.md#segment-search-policy-in-automatic-compaction) to determine eligible segments to compact.
When there are eligible segments to compact, the Coordinator issues compaction tasks based on available worker capacity.
If a compaction task takes longer than the indexing period, the Coordinator waits for it to finish before resuming the period for segment search.

Use the web console to enable automatic compaction for a datasource as follows.
No additional configuration is needed to run automatic compaction tasks using the Coordinator and native engine. This is the default behavior for Druid.

1. Click **Datasources** in the top-level navigation.
2. In the **Compaction** column, click the edit icon for the datasource to compact.
3. In the **Compaction config** dialog, configure the auto-compaction settings. The dialog offers a form view as well as a JSON view. Editing the form updates the JSON specification, and editing the JSON updates the form field, if present. Form fields not present in the JSON indicate default values. You may add additional properties to the JSON for auto-compaction settings not displayed in the form. See [Configure automatic compaction](#configure-automatic-compaction) for supported settings for auto-compaction.
4. Click **Submit**.
5. Refresh the **Datasources** view. The **Compaction** column for the datasource changes from “Not enabled” to “Awaiting first run.”
## Overlord-based
317brian marked this conversation as resolved.
Show resolved Hide resolved

The following screenshot shows the compaction config dialog for a datasource with auto-compaction enabled.
![Compaction config in web console](../assets/compaction-config.png)
You can run automatic compaction using the Overlord rather than the Coordinator. Running compaction tasks on the Overlord means that polling the task status and running compaction at a higher frequency is more efficient than a comparable compaction task that runs on the Coordinator. When running compaction tasks using the Overlord, Druid checks to see if there is data to compact in a datasource every 5 seconds.

To disable auto-compaction for a datasource, click **Delete** from the **Compaction config** dialog. Druid does not retain your auto-compaction configuration.
* In your Overlord runtime properties, set the following properties:
* `druid.supervisor.compaction.enabled` to `true` so that compaction tasks can be run as a supervisor task
* `druid.supervisor.compaction.defaultEngine` to `msq` to specify the MSQ task engine as the compaction engine or to `native`.
317brian marked this conversation as resolved.
Show resolved Hide resolved

### Compaction configuration API
After making these changes, you can submit automatic compaction tasks as supervisors. For more general information about supervisors, see [Supervisors](../ingestion/supervisor.md).

Use the [Automatic compaction API](../api-reference/automatic-compaction-api.md#manage-automatic-compaction) to configure automatic compaction.
To enable auto-compaction for a datasource, create a JSON object with the desired auto-compaction settings.
See [Configure automatic compaction](#configure-automatic-compaction) for the syntax of an auto-compaction spec.
Send the JSON object as a payload in a [`POST` request](../api-reference/automatic-compaction-api.md#create-or-update-automatic-compaction-configuration) to `/druid/coordinator/v1/config/compaction`.
The following example configures auto-compaction for the `wikipedia` datasource:

```sh
curl --location --request POST 'http://localhost:8081/druid/coordinator/v1/config/compaction' \
--header 'Content-Type: application/json' \
--data-raw '{
"dataSource": "wikipedia",
"granularitySpec": {
"segmentGranularity": "DAY"
}
}'
```

To disable auto-compaction for a datasource, send a [`DELETE` request](../api-reference/automatic-compaction-api.md#remove-automatic-compaction-configuration) to `/druid/coordinator/v1/config/compaction/{dataSource}`. Replace `{dataSource}` with the name of the datasource for which to disable auto-compaction. For example:

```sh
curl --location --request DELETE 'http://localhost:8081/druid/coordinator/v1/config/compaction/wikipedia'
```

## Configure automatic compaction
## Automatic compaction syntax

You can configure automatic compaction dynamically without restarting Druid.
The automatic compaction system uses the following syntax:
Both the native and MSQ task engine automatic compaction engines use the following syntax:
317brian marked this conversation as resolved.
Show resolved Hide resolved

```json
{
Expand All @@ -104,10 +73,15 @@ The automatic compaction system uses the following syntax:
"granularitySpec": <compaction task granularitySpec>,
"skipOffsetFromLatest": <time period to avoid compaction>,
"taskPriority": <compaction task priority>,
"taskContext": <task context>
"taskContext": <task context>,
"engine": <native|msq>
317brian marked this conversation as resolved.
Show resolved Hide resolved
}
```

For Coordinator-based automatic compaction, you submit the spec to the [Compaction config UI](#ui-for-coordinator-based-compaction) or the [Compaction configuration API](#api-for-coordinator-based-compaction).

For Overlord-based automatic compaction, you submit a supervisor spec with the `type` set to `autocompact`.

Most fields in the auto-compaction configuration correlate to a typical [Druid ingestion spec](../ingestion/ingestion-spec.md).
The following properties only apply to auto-compaction:
* `skipOffsetFromLatest`
Expand All @@ -131,7 +105,54 @@ maximize performance and minimize disk usage of the `compact` tasks launched by

For more details on each of the specs in an auto-compaction configuration, see [Automatic compaction dynamic configuration](../configuration/index.md#automatic-compaction-dynamic-configuration).

### Set frequency of compaction runs
## Use Coordinator-based compaction

The default engine for compaction is the native engine running on the Coordinator. Prior to the availability of the Overlord for automatic compaction, the native engine was the only compaction engine available.

You can use Coordinator-based automatic compaction for a datasource through the web console or programmatically via an API.
This process differs for manual compaction tasks, which can be submitted from the [Tasks view of the web console](../operations/web-console.md) or the [Tasks API](../api-reference/tasks-api.md).

### UI for Coordinator-based compaction

Use the web console to enable automatic compaction for a datasource as follows:

1. Click **Datasources** in the top-level navigation.
2. In the **Compaction** column, click the edit icon for the datasource to compact.
3. In the **Compaction config** dialog, configure the auto-compaction settings. The dialog offers a form view as well as a JSON view. Editing the form updates the JSON specification, and editing the JSON updates the form field, if present. Form fields not present in the JSON indicate default values. You may add additional properties to the JSON for auto-compaction settings not displayed in the form. See [Configure automatic compaction](#automatic-compaction-syntax) for supported settings for auto-compaction.
4. Click **Submit**.
5. Refresh the **Datasources** view. The **Compaction** column for the datasource changes from “Not enabled” to “Awaiting first run.”

The following screenshot shows the compaction config dialog for a datasource with auto-compaction enabled.
![Compaction config in web console](../assets/compaction-config.png)

To disable auto-compaction for a datasource, click **Delete** from the **Compaction config** dialog. Druid does not retain your auto-compaction configuration.

### API for Coordinator-based compaction
317brian marked this conversation as resolved.
Show resolved Hide resolved

Use the [Automatic compaction API](../api-reference/automatic-compaction-api.md#manage-automatic-compaction) to configure automatic compaction.
To enable auto-compaction for a datasource, create a JSON object with the desired auto-compaction settings.
See [Configure automatic compaction](#automatic-compaction-syntax) for the syntax of an auto-compaction spec.
Send the JSON object as a payload in a [`POST` request](../api-reference/automatic-compaction-api.md#create-or-update-automatic-compaction-configuration) to `/druid/coordinator/v1/config/compaction`.
The following example configures auto-compaction for the `wikipedia` datasource:

```sh
curl --location --request POST 'http://localhost:8081/druid/coordinator/v1/config/compaction' \
--header 'Content-Type: application/json' \
--data-raw '{
"dataSource": "wikipedia",
"granularitySpec": {
"segmentGranularity": "DAY"
}
}'
```

To disable auto-compaction for a datasource, send a [`DELETE` request](../api-reference/automatic-compaction-api.md#remove-automatic-compaction-configuration) to `/druid/coordinator/v1/config/compaction/{dataSource}`. Replace `{dataSource}` with the name of the datasource for which to disable auto-compaction. For example:

```sh
curl --location --request DELETE 'http://localhost:8081/druid/coordinator/v1/config/compaction/wikipedia'
```

### Compaction frequency
317brian marked this conversation as resolved.
Show resolved Hide resolved

If you want the Coordinator to check for compaction more frequently than its indexing period, create a separate group to handle compaction duties.
Set the time period of the duty group in the `coordinator/runtime.properties` file.
Expand All @@ -142,6 +163,108 @@ druid.coordinator.compaction.duties=["compactSegments"]
druid.coordinator.compaction.period=PT60S
```

## Use Overlord-based automatic compaction

When you use the Overlord for automatic compaction, Druid uses a supervisor task on the Overlord to perform the compaction. Since it's a supervisor task, automatic compaction using the Overlord can run frequently while providing faster compaction times as well as better memory tuning and usage.

When you use Overlord-based automatic compaction, you can use either the native engine like Coordinator-based automatic compaction or the [MSQ task engine](#use-msq-for-automatic-compaction).

By default, Druid checks every 5 seconds to see whether or not compaction is required.

### Use MSQ for automatic compaction

The MSQ task engine is available as a compaction engine if you configure compaction tasks to run on the Overlord as a supervisor. To use the MSQ task engine for automatic compaction, make sure the following requirements are met:

* Have the [MSQ task engine extension loaded](../multi-stage-query/index.md#load-the-extension).
* In your Overlord runtime properties, set the following properties:
* `druid.supervisor.compaction.enabled` to `true` so that compaction tasks can be run as a supervisor task
* `druid.supervisor.compaction.defaultEngine` to `msq` to specify the MSQ task engine as the compaction engine
317brian marked this conversation as resolved.
Show resolved Hide resolved
* Have at least two compaction task slots available or set `compactionConfig.taskContext.maxNumTasks` to two or more. The MSQ task engine requires at least two tasks to run, one controller task and one worker task.

You can use [MSQ task engine context parameters](../multi-stage-query/) in `compactionConfig.taskContext` when configuring your datasource for automatic compaction, such as setting the maximum number of tasks using the `compactionConfig.taskContext.maxNumTasks` parameter. Some of the MSQ task engine context parameters overlap with automatic compaction parameters. When these settings overlap, set one or the other.

To submit an automatic compaction task, you submit a supervisor spec through the UI or API with the type `autocompact` and the `spec` where you define the compaction behavior using the [automatic compaction syntax](#automatic-compaction-syntax). You can use the [web console](#ui-for-overlord-based-compaction)

### UI for Overlord-based compaction

To submit a supervisor spec for MSQ task engine automatic compaction, perform the following steps:

1. In the web console, go to the **Supervisors** tab.
1. Click **...** > **Submit JSON supervisor**.
1. In the dialog, include the following:
- The type of supervisor spec by setting `"type": "autocompact"`
- The compaction configuration by adding it to the `spec` field
```json
{
"type": "autocompact",
"spec": {
"dataSource": YOUR_DATASOURCE,
...
...
}
```
1. Submit the supervisor.

To stop the automatic compaction task, suspend or terminate the supervisor through the UI or API.

### API for Overlord-based compaction

Submitting an automatic compaction as a supervisor task uses the same endpoint as supervisor tasks for streaming ingestion.

The following example configures auto-compaction for the `wikipedia` datasource:

```sh
curl --location --request POST 'http://localhost:8081/druid/indexer/v1/supervisor' \
--header 'Content-Type: application/json' \
--data-raw '{
"type": "autocompact", // required
"suspended": false, // optional
"spec": { // required
"dataSource": "wikipedia", // required
"tuningConfig": {...}, // optional
"granularitySpec": {...}, // optional
...
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should also add engine parameter here

"granularitySpec": {...},
"engine": <native|msq>,            // optional
...

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I understand that skipping it here and specifying it later in supervisor-based spec may be confusing. If we keep it here, just want to make sure that users realize that it's only supported with supervisors.

Also, we need to add this field to Automatic compaction dynamic configuration page. Maybe this info can reside there simiar to below:

engine | Engine for compaction. Can be either native or msq. MSQ is only supported with compaction supervisors | no (default = native)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Shall we also include the above on Automatic compaction dynamic configuration page?

}
}'
```

To stop the automatic compaction task, suspend or terminate the supervisor through the UI or API.

### MSQ task engine limitations

When using the MSQ task engine for auto-compaction, keep the following limitations in mind:

- The `metricSpec` field is only supported for idempotent aggregators. For more information, see [Idempotent aggregators](#idempotent-aggregators).
- Only dynamic and range-based partitioning are supported
- Set `rollup` to `true` if `metricSpec` is not empty or null. If `metricSpec` is empty or null, set `rollup` to `false`.
317brian marked this conversation as resolved.
Show resolved Hide resolved
- You cannot group on multi-value dimensions
317brian marked this conversation as resolved.
Show resolved Hide resolved
- The `maxTotalRows` config is not supported in `DynamicPartitionsSpec`. Use `maxRowsPerSegment` instead.
317brian marked this conversation as resolved.
Show resolved Hide resolved

#### Idempotent aggregators

Idempotent aggregators are aggregators that can be applied repeatedly on a column and each run produces the same results, such as the following `longSum` aggregator:

```
{"name": "added", "type": "longSum", "fieldName": "added"}
```

where the input and output column are both `added`.
317brian marked this conversation as resolved.
Show resolved Hide resolved

The following are some examples of non-idempotent aggregators where each run of the aggregator produces different results:

* `longSum` aggregator where the `added` column rolls up into the `sum_added` column:
317brian marked this conversation as resolved.
Show resolved Hide resolved
```
{"name": "sum_added", "type": "longSum", "fieldName": "added" }
```
* Partial sketches:
317brian marked this conversation as resolved.
Show resolved Hide resolved
```
{"name": added, "type":"", fieldName: added}
317brian marked this conversation as resolved.
Show resolved Hide resolved
```
* Count aggregators since it rolls up into a different count column
317brian marked this conversation as resolved.
Show resolved Hide resolved
```
{ "type" : "count", "name" : "count" }
```

## Avoid conflicts with ingestion

Compaction tasks may be interrupted when they interfere with ingestion. For example, this occurs when an ingestion task needs to write data to a segment for a time interval locked for compaction. If there are continuous failures that prevent compaction from making progress, consider one of the following strategies:
Expand Down Expand Up @@ -221,6 +344,9 @@ The following auto-compaction configuration compacts updates the `wikipedia` seg
}
```




## Learn more

See the following topics for more information:
Expand Down
2 changes: 1 addition & 1 deletion docs/ingestion/concurrent-append-replace.md
Original file line number Diff line number Diff line change
Expand Up @@ -38,7 +38,7 @@ If you want to append data to a datasource while compaction is running, you need

In the **Compaction config** for a datasource, enable **Use concurrent locks (experimental)**.

For details on accessing the compaction config in the UI, see [Enable automatic compaction with the web console](../data-management/automatic-compaction.md#web-console).
For details on accessing the compaction config in the UI, see [Enable automatic compaction with the web console](../data-management/automatic-compaction.md#ui-for-coordinator-based-compaction).

### Update the compaction settings with the API

Expand Down
Loading
Loading