Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

docs(specs): Add specification for output buffer persistence strategy #14928

Merged
merged 3 commits into from
Mar 15, 2024
Merged
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
78 changes: 78 additions & 0 deletions docs/specs/tsd-005-output-buffer-strategy.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,78 @@
# Telegraf Output Buffer Strategy

## Objective

Introduce a new agent-level config option to choose a disk buffer strategy for
output plugin metric queues.

## Overview

Currently, when a Telegraf output metric queue fills, either due to incoming
metrics being too fast or various issues with writing to the output, new
metrics are dropped and never written to the output. This specification
defines a set of options to make this output queue more durable by persisting
pending metrics to disk rather than only an in-memory limited size queue.

## Keywords

output plugins, agent configuration, persist to disk

## Agent Configuration

The configuration is at the agent-level, with options for:

- **Memory**, the current implementation, with no persistence to disk
- **Write-through**, all metrics are also written to disk using a
Write Ahead Log (WAL) file
- **Disk-overflow**, when the memory buffer fills, metrics are flushed to a
WAL file to avoid dropping overflow metrics

As well as an option to specify a directory to store the WAL files on disk,
with a default value. These configurations are global, and no change means
memory only mode, retaining current behavior.

## Metric Ordering and Tracking

Tracking metrics will be accepted either on a successful write to the output
source like currently, or on write to the WAL file. Metrics will be written
to their appropriate output in the order they are received in the buffer
regardless of which buffer strategy is chosen.

## Disk Utilization and File Handling

Each output plugin has its own in-memory output buffer, and therefore will
each have their own WAL file for buffer persistence. This file may not exist
if Telegraf is successfully able to write all of its metrics without filling
the in-memory buffer in disk-overflow mode, or not at all in memory mode.
Telegraf should use one file per output plugin, and remove entries from the
WAL file as they are written to the output.

Telegraf will not make any attempt to limit the size on disk taken by these
files beyond cleaning up WAL files for metrics that have successfully been
flushed to their output source. It is the user's responsibility to ensure
these files do not entirely fill the disk, both during Telegraf uptime and
with lingering files from previous instances of the program.

If WAL files exist for an output plugin from previous instances of Telegraf,
they will be picked up and flushed before any new metrics that are written
to the output. This is to ensure that these metrics are not lost, and to
ensure that output write order remains consistent.

Telegraf must additionally provide a way to manually flush WAL files via
some separate plugin or similar. This could be used as a way to ensure that
WAL files are properly written in the event that the output plugin changes
and the WAL file is unable to be detected by a new instance of Telegraf.
This plugin should not be required for use to allow the buffer strategy to
work.

## Is/Is-not

- Is a way to prevent metrics from being dropped due to a full memory buffer
- Is not a way to guarantee data safety in the event of a crash or system failure
- Is not a way to manage file system allocation size, file space will be used
until the disk is full

## Prior art

[Initial issue](https://github.com/influxdata/telegraf/issues/802)
[Loose specification issue](https://github.com/influxdata/telegraf/issues/14805)
Loading