All metrics values delayed when inserting in the past beyond the default retention policy #3144

lgosselin · 2017-08-21T16:01:41Z

I have a process that is posting measurements with a timestamp linked to the data processed to telegraf. Usually, as it is working on (almost) realtime data, the timestamp are more or less current. However, occasionally, it can be asked to reprocess old data and the measurements will be send again but at the original timestamps.

If there is a default retention policy on the database, when reprocessing older data, all the metrics in Chronograf dashboard are delayed by a few minutes. (How much seems to vary between environment). When that process stops emitting past events, the dashboard still lag a bit before returning to normal.

Setting the retention policy is critical to reproduce. It causes partial writes at Influx level and telegraf seems a bit confused and appears to hold hostage other measurements even if issued by another input plugin. However I have not seen any missing measurement value when it gets back to normal.

Environment setup using docker on linux:

Use the docker-compose.yml in attach (shameful rip off https://github.com/influxdata/TICK-docker/tree/master/1.2 with updated versions, little adjustments, and http_listener enabled)
Use the telegraf configuration provided in attachment
Start the environment using : docker-compose up
Create a default retention policy on telegraf database: docker-compose run influxdb-cli -execute 'CREATE RETENTION POLICY realtime ON telegraf DURATION 4w REPLICATION 1 DEFAULT;'
Open a browser on chronograf (localhost:8888), go to host list, use the "system" dashboard for your host.
Setup refresh to "Every 10s" and timerange to "Past 15 minutes".
Wait a few minutes to have data points collected
Validate that the collected data is up-to-date (for example, use the tooltip on the CPU usage measurements to validate the time)

Then begin to reproduce:

Post a few events in the past beyond the retention policy: curl -i -XPOST "http://localhost:8186/write?db=telegraf&precision=ns" --data-binary "@test.txt"
Wait 1 or 2 minutes and confirm that most of the measurements don't reach the dashboard anymore. You should have a gap on almost all charts (at least those who refresh their X axis).

If it does not work for you, try posting several times (5 times, 1 or 2 seconds apart seems to be enough for me).

The telegraf logs should reveal something along the line of:

E! InfluxDB Output Error: Response Error: Status Code [400], expected [204], [partial write: points beyond retention policy dropped=xx]
E! Error writing to output [influxdb]: Could not write to any InfluxDB server in cluster

The expected behaviour would be to have no delay at all (or close to none) in unrelated metrics, especially if coming from other plugins.

The actual behaviour: No new metrics value available during a (variable) time period (at least 4-5 min, sometimes way more).

issue_telegraf.zip

The text was updated successfully, but these errors were encountered:

danielnelson added the bug unexpected problem or unintended behavior label Aug 21, 2017

danielnelson added this to the 1.4.0 milestone Aug 21, 2017

danielnelson added the area/influxdb label Aug 21, 2017

danielnelson mentioned this issue Aug 22, 2017

Don't retry points beyond retention policy #3155

Merged

1 task

danielnelson closed this as completed in #3155 Aug 22, 2017

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

All metrics values delayed when inserting in the past beyond the default retention policy #3144

All metrics values delayed when inserting in the past beyond the default retention policy #3144

lgosselin commented Aug 21, 2017 •

edited

Loading

All metrics values delayed when inserting in the past beyond the default retention policy #3144

All metrics values delayed when inserting in the past beyond the default retention policy #3144

Comments

lgosselin commented Aug 21, 2017 • edited Loading

lgosselin commented Aug 21, 2017 •

edited

Loading