internal_write.buffer_size metric not reset on timed writes #5298

pberlowski · 2019-01-16T15:52:39Z

Relevant telegraf.conf:

[global_tags]
 test = "test"

# Configuration for telegraf agent
[agent]
  interval = "30s"
  round_interval = true
  metric_batch_size = 500
  metric_buffer_limit = 500000
  collection_jitter = "0s"
  flush_interval = "30s"
  flush_jitter = "5s"

  ## By default, precision will be set to the same timestamp order as the
  ## collection interval, with the maximum being 1s.
  ## Precision will NOT be used for service inputs, such as logparser and statsd.
  ## Valid values are "ns", "us" (or "µs"), "ms", "s".
  precision = "1ns"

  ## Logging configuration:
  ## Run telegraf with debug log messages.
  debug = true
  ## Rg = true
  ## Run telegraf in quiet mode (error log messages only).
  quiet = false
  ## Specify the log file name. The empty string means to log to stderr.
  logfile = "/var/log/telegraf/telegraf.log"

  ## If set to true, do no set the "host" tag in the telegraf agent.
  omit_hostname = false


[[inputs.http_listener]]
    # Gateway listens globally
        service_address = "0.0.0.0:8186"
        read_timeout = "10s"
        write_timeout = "10s"

###############################################################################
#                            INPUT PLUGINS                                    #
###############################################################################

# Read metrics about cpu usage
[[inputs.cpu]]
  ## Whether to report per-cpu stats or not
  percpu = true
  ## Whether to report total system cpu stats or not
  totalcpu = true
  ## If true, collect raw CPU time metrics.
  collect_cpu_time = false

# Read metrics about disk usage by mount point
[[inputs.disk]]
  ## By default, telegraf gather stats for all mountpoints.
  ## Setting mountpoints will restrict the stats to the specified mountpoints.
  # mount_points = ["/"]
  ## Ignore some mountpoints by filesystem type. For example (dev)tmpfs (usually
  ## present on /run, /var/run, /dev/shm or /dev).
  ignore_fs = ["tmpfs", "devtmpfs"]

[[inputs.diskio]]

# Collect statistics about itself
[[inputs.internal]]
  ## If true, collect telegraf memory stats.
  collect_memstats = true

# Get kernel statistics from /proc/stat
[[inputs.kernel]]
  # no configuration

# Read metrics about memory usage
[[inputs.mem]]
  # no configuration

# Read metrics about network interface usage
[[inputs.net]]
  ## By default, telegraf gathers stats from any up interface (excluding loopback)
  ## Setting interfaces will tell it to gather these explicit interfaces,
  ## regardless of status.
  ##
  # interfaces = ["eth0"]

[[inputs.nstat]]
  fieldpass = ["Tcp*Opens","TcpCurrEstab"]

# Read metrics about swap memory usage
[[inputs.swap]]
  # no configuration

[[inputs.system]]
  fielddrop = [ "uptime_format" ]

[[inputs.netstat]]

[[inputs.processes]]

[[inputs.ntpq]]
  ## If false, set the -n ntpq flag. Can reduce metric gather times.
  dns_lookup = false

[[inputs.procstat]]
  systemd_unit = "telegraf"
  pid_tag = true
  fieldpass = ["*rss", "*rss_hard"]

System info:

Telegraf version: 1.9.2
OS: Centos 7

Steps to reproduce:

Create a chart of internal_write.buffer_size metric
Leave batch size sufficiently high to never flush due to batch size
Overflow batch size once (e.g. send 1000 metrics while batch_size is 500)
Do not overlow batch size again (agent will flush on a set flush period)
Observe reported buffer_size

Expected behavior:

internal_agent.buffer_size drops to 0 as there's no metrics in the buffer

Actual behavior:

internal_agent.buffer_size metric reported as batch_size forever.

Additional info:

Buffer_size is set and emitted only in the AddMetric method of the running_output and only if the batch was written to buffer before the flush time.

Buffer_size is not set in the Write method so when the buffer is flushed, the metric is not reset.

The above means that buffer_size will be set only when we overflow the batch and thus will never reset to 0.

The text was updated successfully, but these errors were encountered:

pberlowski · 2019-01-16T17:30:41Z

We'll be testing this patch in our environment:
https://gist.github.com/pberlowski/6855a647f74b4d3c647e2d1ab344e525

pberlowski · 2019-01-17T15:34:42Z

The patch is successfully zeroing the buffer count when relevant. One additional behavior that was noticed here is that the buffer_size is always reported as a factor of batch size, due to the metric being reported after adding a full batch to buffer. This is interesting but not necessarily a problem.

danielnelson added this to the 1.9.3 milestone Jan 16, 2019

danielnelson self-assigned this Jan 16, 2019

danielnelson mentioned this issue Jan 18, 2019

Update the buffer_size internal metric after writes #5314

Merged

3 tasks

danielnelson added bug unexpected problem or unintended behavior area/agent labels Jan 18, 2019

danielnelson closed this as completed in #5314 Jan 22, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

internal_write.buffer_size metric not reset on timed writes #5298

internal_write.buffer_size metric not reset on timed writes #5298

pberlowski commented Jan 16, 2019

pberlowski commented Jan 16, 2019

pberlowski commented Jan 17, 2019

internal_write.buffer_size metric not reset on timed writes #5298

internal_write.buffer_size metric not reset on timed writes #5298

Comments

pberlowski commented Jan 16, 2019

Relevant telegraf.conf:

System info:

Steps to reproduce:

Expected behavior:

Actual behavior:

Additional info:

pberlowski commented Jan 16, 2019

pberlowski commented Jan 17, 2019