diff --git a/administration/scheduling-and-retries.md b/administration/scheduling-and-retries.md index eb865096d..d5d7496b1 100644 --- a/administration/scheduling-and-retries.md +++ b/administration/scheduling-and-retries.md @@ -2,30 +2,39 @@ -[Fluent Bit](https://fluentbit.io) has an Engine that helps to coordinate the data ingestion from input plugins and calls the _Scheduler_ to decide when it is time to flush the data through one or multiple output plugins. The Scheduler flushes new data at a fixed time of seconds and the _Scheduler_ retries when asked. +[Fluent Bit](https://fluentbit.io) has an engine that helps to coordinate the data +ingestion from input plugins. The engine calls the _scheduler_ to decide when it's time to +flush the data through one or multiple output plugins. The scheduler flushes new data +at a fixed number of seconds, and retries when asked. -Once an output plugin gets called to flush some data, after processing that data it can notify the Engine three possible return statuses: +When an output plugin gets called to flush some data, after processing that data it +can notify the engine using these possible return statuses: -* OK -* Retry -* Error +- `OK`: Data successfully processed and flushed. +- `Retry`: If a retry is requested, the engine asks the scheduler to retry flushing + that data. The scheduler decides how many seconds to wait before retry. +- `Error`: An unrecoverable error occurred and the engine shouldn't try to flush that data again. -If the return status was **OK**, it means it was successfully able to process and flush the data. If it returned an **Error** status, it means that an unrecoverable error happened and the engine should not try to flush that data again. If a **Retry** was requested, the _Engine_ will ask the _Scheduler_ to retry to flush that data, the Scheduler will decide how many seconds to wait before that happens. +## Configure wait time for retry -## Configuring Wait Time for Retry +The scheduler provides two configuration options, called `scheduler.cap` and +`scheduler.base`, which can be set in the Service section. These determine the waiting +time before a retry happens. -The Scheduler provides two configuration options called **scheduler.cap** and **scheduler.base** which can be set in the Service section. +| Key | Description | Default | +| --- | ------------| --------------| +| `scheduler.cap` | Set a maximum retry time in seconds. Supported in v1.8.7 or later. | `2000` | +| `scheduler.base` | Set a base of exponential backoff. Supported in v1.8.7 or later. | `5` | -| Key | Description | Default Value | -| -- | ------------| --------------| -| scheduler.cap | Set a maximum retry time in seconds. The property is supported from v1.8.7. | 2000 | -| scheduler.base | Set a base of exponential backoff. The property is supported from v1.8.7. | 5 | +The `scheduler.base` determines the lower bound of time and the `scheduler.cap` +determines the upper bound for each retry. -These two configuration options determine the waiting time before a retry will happen. +Fluent Bit uses an exponential backoff and jitter algorithm to determine the waiting +time before a retry. The waiting time is a random number between a configurable upper +and lower bound. For a detailed explanation of the exponential backoff and jitter algorithm, see +[Exponential Backoff And Jitter](https://aws.amazon.com/blogs/architecture/exponential-backoff-and-jitter/). -Fluent Bit uses an exponential backoff and jitter algorithm to determine the waiting time before a retry. - -The waiting time is a random number between a configurable upper and lower bound. +For example: For the Nth retry, the lower bound of the random number will be: @@ -35,23 +44,26 @@ The upper bound will be: `min(base * (Nth power of 2), cap)` -Given an example where `base` is set to 3 and `cap` is set to 30. - -1st retry: The lower bound will be 3, the upper bound will be 3 * 2 = 6. So the waiting time will be a random number between (3, 6). +For example: -2nd retry: the lower bound will be 3, the upper bound will be 3 * (2 * 2) = 12. So the waiting time will be a random number between (3, 12). +When `base` is set to 3 and `cap` is set to 30: -3rd retry: the lower bound will be 3, the upper bound will be 3 * (2 * 2 * 2) = 24. So the waiting time will be a random number between (3, 24). +First retry: The lower bound will be 3. The upper bound will be `3 * 2 = 6`. +The waiting time will be a random number between (3, 6). -4th retry: the lower bound will be 3, since 3 * (2 * 2 * 2 * 2) = 48 > 30, the upper bound will be 30. So the waiting time will be a random number between (3, 30). +Second retry: The lower bound will be 3. The upper bound will be `3 * (2 * 2) = 12`. +The waiting time will be a random number between (3, 12). -Basically, the **scheduler.base** determines the lower bound of time between each retry and the **scheduler.cap** determines the upper bound. +Third retry: The lower bound will be 3. The upper bound will be `3 * (2 * 2 * 2) =24`. +The waiting time will be a random number between (3, 24). -For a detailed explanation of the exponential backoff and jitter algorithm, please check this [blog](https://aws.amazon.com/blogs/architecture/exponential-backoff-and-jitter/). +Fourth retry: The lower bound will be 3, because `3 * (2 * 2 * 2 * 2) = 48` > `30`. +The upper bound will be 30. The waiting time will be a random number between (3, 30). -### Example +### Wait time example -The following example configures the **scheduler.base** as 3 seconds and **scheduler.cap** as 30 seconds. +The following example configures the `scheduler.base` as `3` seconds and +`scheduler.cap` as `30` seconds. ```text [SERVICE] @@ -64,26 +76,29 @@ The following example configures the **scheduler.base** as 3 seconds and **sched The waiting time will be: -| Nth retry | waiting time range (seconds) | -| --- | --- | +| Nth retry | Waiting time range (seconds) | +| --- | --- | | 1 | (3, 6) | | 2 | (3, 12) | | 3 | (3, 24) | | 4 | (3, 30) | -## Configuring Retries +## Configure retries -The Scheduler provides a simple configuration option called **Retry\_Limit**, which can be set independently on each output section. This option allows us to disable retries or impose a limit to try N times and then discard the data after reaching that limit: +The scheduler provides a configuration option called `Retry_Limit`, which can be set +independently for each output section. This option lets you disable retries or +impose a limit to try N times and then discard the data after reaching that limit: | | Value | Description | | :--- | :--- | :--- | -| Retry\_Limit | N | Integer value to set the maximum number of retries allowed. N must be >= 1 \(default: 1\) | -| Retry\_Limit | `no_limits` or `False` | When Retry\_Limit is set to `no_limits` or`False`, means that there is not limit for the number of retries that the Scheduler can do. | -| Retry\_Limit | no\_retries | When Retry\_Limit is set to no\_retries, means that retries are disabled and Scheduler would not try to send data to the destination if it failed the first time. | +| `Retry_Limit` | N | Integer value to set the maximum number of retries allowed. N must be >= 1 (default: `1`) | +| `Retry_Limit` | `no_limits` or `False` | When set there no limit for the number of retries that the scheduler can do. | +| `Retry_Limit` | `no_retries` | When set, retries are disabled and scheduler doesn't try to send data to the destination if it failed the first time. | -### Example +### Retry example -The following example configures two outputs where the HTTP plugin has an unlimited number of while the Elasticsearch plugin have a limit of 5 retries: +The following example configures two outputs, where the HTTP plugin has an unlimited +number of retries, and the Elasticsearch plugin have a limit of `5` retries: ```text [OUTPUT] @@ -99,4 +114,3 @@ The following example configures two outputs where the HTTP plugin has an unlimi Logstash_Format On Retry_Limit 5 ``` -