Cloudwatch output is blocked by metric values close to 0 #2523

kamalvpm777 · 2017-03-09T23:43:17Z

##Our telegraf is integrated with cloudwatch output, in the last 4 to 5 few weeks it was working fine . Suddenly in the last 2 days we do see an inconsistency where metrics are not being pushed to cloudwatch output though telegraf setup is still same. Based on the analysis looking for your opinions and troubleshooting for the same. Let me know if something is not clear, I can add more details

Scenario 1 -

As per below telegraf conf we have metrics limit set to 10000 and flush interval of every 20 sec. Though after reaching the limit of 10000 buffer its never getting flushed and gathered metrics keeps accumulating which increases the count of dropped metrics subsequently.
i) What dropped metrics means here, does it mean these many(195956) metrics are not sent to cloudwatch output.
ii) we do generate 222 metrics /min , so per hour its total of 13320 metrics / hour. In the past also we have captured similar volume of metrics with same buffer limit of 10000 and we never had issue in the last 2 months posting the stats to cloudwatch.
iii) Does going above the buffer fullness causes metrics not sent to cloudwatch?

2017/03/09 16:39:00 Output [cloudwatch] buffer fullness: 10074 / 10000 metrics. Total gathered metrics: 206030. Total dropped metrics: 195956.
2017/03/09 16:39:20 Output [cloudwatch] buffer fullness: 10074 / 10000 metrics. Total gathered metrics: 206104. Total dropped metrics: 196030.

Scenario 2

i) To troubleshoot if reaching the buffer limit is an issue , disabled few metrics to reduce the volume. With this change we generate only 36 metrics/min, and our total gathered metrics is still under the the limit of 10000.
ii) With this change telegraf just connected successfully to cloudwatch at 12:09 PM and reported the metrics once at 12:10 PM and again stopped reporting afterwards though we under the limit of 10000

2017/03/09 17:34:20 Output [cloudwatch] buffer fullness: 1116 / 10000 metrics. Total gathered metrics: 1116. Total dropped metrics: 0.
2017/03/09 17:34:40 Output [cloudwatch] buffer fullness: 1128 / 10000 metrics. Total gathered metrics: 1128. Total dropped metrics: 0.

Telegraf conf

Configuration for telegraf agent

[agent]

Default data collection interval for all inputs

interval = "20s"
#####Rounds collection interval to 'interval'

ie, if interval="10s" then always collect on :00, :10, :20, etc.

round_interval = true

Telegraf will cache metric_buffer_limit metrics for each output, and will

flush this buffer on a successful write.

metric_buffer_limit = 10000
#####Flush the buffer whenever full, regardless of flush_interval.
flush_buffer_when_full = true
#####Collection jitter is used to jitter the collection by a random amount.
collection_jitter = "0s"

Default flushing interval for all outputs. You shouldn't set this below

interval. Maximum flush_interval will be flush_interval + flush_jitter

flush_interval = "20s"
#####ie, a jitter of 5s and interval 10s means flushes will happen every 10-15s
flush_jitter = "0s"

Run telegraf in debug mode

debug = true

Run telegraf in quiet mode

quiet = false

Override default hostname, if empty use os.Hostname()

hostname = ""

Configuration for cloudwatch api to send metrics to

[[outputs.cloudwatch]]

Amazon Region (required)

region = 'us-east-1'

Current behavior:

Telegraf logs looks fine , but its not reporting /updating the metrics to cloudwatch

Use case: Currently our production monitoring with telegraf is completely broken, need to get this fixed as soon as possible

Telegraf logs

2017/03/09 21:45:40 Input [httpjson] gathered metrics, (20s interval) in 7.261833ms
2017/03/09 21:45:40 Input [mesos] gathered metrics, (20s interval) in 7.050785ms
2017/03/09 21:45:40 Input [httpjson] gathered metrics, (20s interval) in 2.375689ms
2017/03/09 21:45:40 Input [httpjson] gathered metrics, (20s interval) in 3.618668ms
2017/03/09 21:45:40 Input [httpjson] gathered metrics, (20s interval) in 6.121121ms
2017/03/09 21:45:40 Input [httpjson] gathered metrics, (20s interval) in 5.218891ms
2017/03/09 21:45:40 Input [httpjson] gathered metrics, (20s interval) in 11.845889ms
2017/03/09 21:45:40 Input [httpjson] gathered metrics, (20s interval) in 14.763179ms
2017/03/09 21:45:40 Input [httpjson] gathered metrics, (20s interval) in 10.747364ms
2017/03/09 21:45:40 Input [httpjson] gathered metrics, (20s interval) in 15.15531ms
2017/03/09 21:45:40 Input [httpjson] gathered metrics, (20s interval) in 23.128821ms
2017/03/09 21:45:40 Input [httpjson] gathered metrics, (20s interval) in 24.935601ms
2017/03/09 21:45:40 Output [cloudwatch] buffer fullness: 10012 / 10000 metrics. Total gathered metrics: 10164. Total dropped metrics: 152.

danielnelson · 2017-03-10T00:10:34Z

iii) Does going above the buffer fullness causes metrics not sent to cloudwatch?

Yes, the buffer is emptied on every successful push, and should remain empty. Need to determine why nothing is being successfully output to cloudwatch.

danielnelson · 2017-03-10T00:22:57Z

What version of telegraf are you using? What type of AWS credentials are you using?

kamalvpm777 · 2017-03-10T04:56:10Z

Thanks for your reply, provided the answer below hope it helps
What version of telegraf are you using?
Telegraf - version 0.13.2

What type of AWS credentials are you using?
Access Key and Secret Access Key

#####Scenario 1:
if i understand correctly once it reach the buffer limit of 10000, after the flush triggers this buffer value is supposed to reset to 0. But looking at telegraf logs it doesnt reports anything about flush and dropped metrics count only keeps accumulating . Also below is the AWS cloud watch console where stats updated inconsistenly till March 09, 11 AM.

2017/03/10 04:43:40 Output [cloudwatch] buffer fullness: 10012 / 10000 metrics. Total gathered metrics: 25212. Total dropped metrics: 15200.

#####Scenario 2:
Though even after restarting the telegraf when total gathered metrics is just 100 out of buffer limit of 10000 still it wrote metrics only once at 12:09 PM then it doesn't update any metrics to cloudwatch at all as shown in below cloudwatch console. Though looking at telegraf log from 12 - 1 PM accumulated metrics was still under buffer limit and telegraf successfully connected to AWS cloudwatch output as per the telegraf logs but still no metrics sent to AWS

It will be great if you can provide some approach for troubleshooting to identify the root cause and fix this stuff ASAP to get our production monitoring live back

danielnelson · 2017-03-10T18:09:36Z

There have been a lot of changes since this version, can you test with the latest 1.2.1 release?

kamalvpm777 · 2017-03-13T21:21:49Z

Sure , I just upgraded to version 1.2.1. With new version we still see the issue but we are able to narrow down the particular rootcause which is causing metrics not to be wriiten in AWS cloudwatch. Could we get some suggestion on why we are getting this error and how to fix this isssue ?

2017-03-11T07:56:42Z E! CloudWatch: Unable to write to CloudWatch : InvalidParameterValue: The value 0 for parameter MetricData.member.10.Value is invalid.
The value 0 for parameter MetricData.member.12.Value is invalid.

For every spring boot application we are retrieving drop wizard metrics through HTTP JSON INPUT in telegraf like below. In the past for all the apps we were able to ship the metrics successfully to cloudwatch , but now the same telegraf conf works for 1 app but fails for other app metrics

WORKS (Writes metrics to cloudwatch)

[[inputs.httpjson]]
name = "registration"
servers = [
"http://localhost:5481/metrics"
]

HTTP method to use (case-sensitive)

method = "GET"

TELEGRAF Log : Output Buffer (Reset on successful write)

2017-03-11T08:02:00Z D! Output [cloudwatch] buffer fullness: 68 / 10000 metrics.
2017-03-11T08:02:06Z D! Output [cloudwatch] wrote batch of 68 metrics in 6.065242049s
2017-03-11T08:02:20Z D! Output [cloudwatch] buffer fullness: 68 / 10000 metrics.
2017-03-11T08:02:26Z D! Output [cloudwatch] wrote batch of 68 metrics in 6.19724333s

NOT WORKS (Failed to write metrics to cloudwatch)

[[inputs.httpjson]]
name = "Login"
servers = [
"http://localhost:5482/metrics"
]

HTTP method to use (case-sensitive)

method = "GET"

TELEGRAF LOG : Output Buffer (Keeps accumulating without reset)

2017-03+A314:A324-11T07:56:40Z D! Output [cloudwatch] buffer fullness: 212 / 10000 metrics.
2017-03-11T07:56:42Z E! CloudWatch: Unable to write to CloudWatch : InvalidParameterValue: The value 0 for parameter MetricData.member.10.Value is invalid.
The value 0 for parameter MetricData.member.12.Value is invalid.
2017-03-11T07:56:42Z E! Error writing to output [cloudwatch]: InvalidParameterValue: The value 0 for parameter MetricData.member.10.Value is invalid.
The value 0 for parameter MetricData.member.12.Value is invalid.
2017-03-11T07:57:00Z D! Output [cloudwatch] buffer fullness: 280 / 10000 metrics.
2017-03-11T07:57:02Z E! CloudWatch: Unable to write to CloudWatch : InvalidParameterValue: The value 0 for parameter MetricData.member.12.Value is invalid.
2017-03-11T07:57:02Z E! Error writing to output [cloudwatch]: InvalidParameterValue: The value 0 for parameter MetricData.member.12.Value is invalid.

#Attached the sample HTTP JSON METRICS which is captured by Telegraf
HTTPJSON_Metrics.DOCX

danielnelson · 2017-03-13T22:09:09Z

I found this blacklocus/metrics-cloudwatch#13 which seems to indicate that if you have a very small number you can receive this error from Cloudwatch.

In your sample JSON, there is one value that is 4.784335455263847e-50, I wonder if this is what is causing the problem, are you able to change the http server to return 0 for very small numbers?

kamalvpm777 · 2017-03-15T02:14:24Z

Thanks. It is even okay for cloudwatch to reject these small values , but am still curious why telegraf not ignoring these cloudwatch errors for very small number(e-50) and just write other metrics which are closer and greater than 0's to cloudwatch?

I believe that will probably fix all this issue, right now when this errors comes as per logs "output [cloudwatch] buffer fullness keeps accumulating without reset" and in the logs it says "Unable to Write to cloudwatch" . At the same time it doesn't tell anything about whether the metrics >0 written to cloudwatch or not.

Telegraf log (with errors - Unable to write to cloudwatch / output buffer accumulating 73 -> 146)

2017-03-15T00:27:20Z D! Output [cloudwatch] buffer fullness: 73 / 10000 metrics.
2017-03-15T00:27:26Z E! CloudWatch: Unable to write to CloudWatch : InvalidParameterValue: The value 0 for parameter MetricData.member.19.Value is invalid.
status code: 400, request id: 6c39d867-0916-11e7-ae62-41eb59694862
2017-03-15T00:27:26Z E! Error writing to output [cloudwatch]: InvalidParameterValue: The value 0 for parameter MetricData.member.19.Value is invalid.
status code: 400, request id: 6c39d867-0916-11e7-ae62-41eb59694862
2017-03-15T00:27:40Z D! Output [cloudwatch] buffer fullness: 146 / 10000 metrics.

Telegraf log (without errors - Wrote batch of ## metrics in seconds / output buffer reset 73 -> 73)

2017-03-14T22:16:00Z D! Output [cloudwatch] buffer fullness: 74 / 10000 metrics.
2017-03-14T22:16:10Z D! Output [cloudwatch] wrote batch of 74 metrics in 10.476617767s

Please advise on how to get telegraf simply ignore this very small numbers and go-ahead with other metrics those values closer and > than 0's. Thanks for your help on this

It would be great if you can provide some references/approach for achieving HTTP server to return 0 for small numbers. Also just found in the below link where it says CloudWatch rejects values that are either too small or too large. Values must be in the range of 8.515920e-109 to 1.174271e+108 for cloudwatch putmetrics.
http://docs.aws.amazon.com/AmazonCloudWatch/latest/APIReference/API_PutMetricData.html

danielnelson · 2017-03-15T17:40:29Z

It is even okay for cloudwatch to reject these small values , but am still curious why telegraf not ignoring these cloudwatch errors for very small number(e-50) and just write other metrics which are closer and greater than 0's to cloudwatch?

I think what telegraf should do is round these close to zero values down to exactly zero. In order to do this we need to determine the correct cutoff point for the conversion.

It appears that the inability to push the bad cloudwatch metric is blocking telegraf from processes the good metrics, at some point the bad values will drop off the buffer if they only occur on occasion. If the error occurs infrequently you could lower the metric_buffer_limit in your telegraf.conf, this will cause telegraf to become stuck for shorter periods of time.

Please advise on how to get telegraf simply ignore this very small numbers and go-ahead with other metrics those values closer and > than 0's.

The quickest solution for you may be to update the server that your httpjson input uses as its source to adjust the values before they enter telegraf. We will try to get a fix done for this before the next release, if someone wants to work on this it would be great.

kamalvpm777 · 2017-03-15T18:58:12Z

Thanks for the response and appreciate you taking this care as part of 1.3.0 milestone release.

Regarding the quickest solution could you please elaborate little more. Here is our env currently looks like, any suggestions /recommendations what needs to done on the source server (mesos master) to adjust the values to zero

Springboot apps running on Mesos slaves which expose metics on its each port
Apps metrics on mesos slave are directly available in mesos master by calling REST endpoint ("http://localhost:2960/metrics")
Telegraf conf setup and running on Mesos master which calls each app metrics through the port.
Local host here refers to host that is the mesos master which telegraf listens as source.
[[inputs.httpjson]]
servers = [
"http://localhost:2960/metrics"
]

Meanwhile this small numbers( E-179) comes very often and consistently in our case. Currently we have a buffer metrics limit of 10000 even on lowering to 1000 not sure if bad values will drop off completely from the buffer or not. But I will definitely give it a try and let you know with buffer limit of default 1000 and let you know how it goes.

danielnelson · 2017-03-15T19:27:43Z

Based on the metrics above, the problematic one is:

"com.wipr.lac.shop.rest.controller.ControllerV1.getproductsproductData.oneMinuteRate": 4.784335455263847e-50,

Perhaps you can round this where this metric is produced, presumably in your code somewhere.

Another suggestion is that you filter the problematic metric, should be something like:

[[inputs.httpjson]]
  name = "registration"
  servers = [
    "http://localhost:5481/metrics"
  ]
  method = "GET"
  fielddrop = ["com.wipr.lac.shop.rest.controller.ControllerV1.getproductsproductData.oneMinuteRate"]

You may have to filter additional fields, here are additional docs on filtering: https://github.com/influxdata/telegraf/blob/master/docs/CONFIGURATION.md#measurement-filtering

allingeek · 2017-08-14T23:10:29Z

I'm encountering the same issue with inputs.http_listener and outputs.cloudwatch:

8/14/2017 4:06:30 PM2017-08-14T23:06:30Z E! Error writing to output [cloudwatch]: InvalidParameterValue: The value 0 for parameter MetricData.member.4.Value is invalid.
8/14/2017 4:06:30 PMThe value 0 for parameter MetricData.member.13.Value is invalid.
8/14/2017 4:06:30 PMThe value 0 for parameter MetricData.member.14.Value is invalid.

allingeek · 2017-08-28T18:11:58Z

The linked PR adds code to actually enforce the constraints on CloudWatch metrics. Data points that fail those constraints are omitted in the same way the plugin omits metrics with unsupported data types.

I've tested the change in an environment that always encounters this problem and it seems to have been fixed. Boundary checking is important.

allingeek · 2017-08-28T20:05:37Z

I don't suppose we could pull this change into a 1.3.X release and 1.4.X-rcs?

danielnelson · 2017-08-29T00:08:38Z

@allingeek I was somewhat hoping to avoid another 1.3.X release, but only because it takes me several hours to prepare a release. I'm planning to release 1.4 by end of week, assuming this, would you still want it as a new 1.3 release?

allingeek · 2017-08-29T00:25:56Z

1.4 would be fine. Thanks for the quick merge. :)

danielnelson added the bug unexpected problem or unintended behavior label Mar 15, 2017

danielnelson added this to the 1.3.0 milestone Mar 15, 2017

danielnelson changed the title ~~[Issue]Telegraf stopped sending metrics to cloudwatch output, total dropped metrics keeps accumulating though flush interval is set ON~~ Cloudwatch output is blocked by metric values close to 0 Mar 15, 2017

danielnelson modified the milestones: 1.3.0, 1.4.0 Apr 27, 2017

danielnelson modified the milestones: 1.4.0, 1.5.0 Aug 14, 2017

allingeek mentioned this issue Aug 28, 2017

Added CloudWatch metric constraint validation #3183

Merged

3 tasks

danielnelson closed this as completed in #3183 Aug 28, 2017

danielnelson modified the milestones: 1.5.0, 1.4.0 Aug 29, 2017

danielnelson modified the milestones: 1.4.0, 1.5.0 Aug 29, 2017

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Cloudwatch output is blocked by metric values close to 0 #2523

Cloudwatch output is blocked by metric values close to 0 #2523

kamalvpm777 commented Mar 9, 2017 •

edited

Loading

danielnelson commented Mar 10, 2017

danielnelson commented Mar 10, 2017

kamalvpm777 commented Mar 10, 2017 •

edited

Loading

danielnelson commented Mar 10, 2017

kamalvpm777 commented Mar 13, 2017 •

edited

Loading

danielnelson commented Mar 13, 2017

kamalvpm777 commented Mar 15, 2017 •

edited

Loading

danielnelson commented Mar 15, 2017

kamalvpm777 commented Mar 15, 2017

danielnelson commented Mar 15, 2017

allingeek commented Aug 14, 2017

allingeek commented Aug 28, 2017 •

edited

Loading

allingeek commented Aug 28, 2017

danielnelson commented Aug 29, 2017

allingeek commented Aug 29, 2017

Cloudwatch output is blocked by metric values close to 0 #2523

Cloudwatch output is blocked by metric values close to 0 #2523

Comments

kamalvpm777 commented Mar 9, 2017 • edited Loading

Scenario 1 -

Scenario 2

Configuration for telegraf agent

Default data collection interval for all inputs

ie, if interval="10s" then always collect on :00, :10, :20, etc.

Telegraf will cache metric_buffer_limit metrics for each output, and will

flush this buffer on a successful write.

Default flushing interval for all outputs. You shouldn't set this below

interval. Maximum flush_interval will be flush_interval + flush_jitter

Run telegraf in debug mode

Run telegraf in quiet mode

Override default hostname, if empty use os.Hostname()

Configuration for cloudwatch api to send metrics to

Amazon Region (required)

Current behavior:

Use case: Currently our production monitoring with telegraf is completely broken, need to get this fixed as soon as possible

Telegraf logs

danielnelson commented Mar 10, 2017

danielnelson commented Mar 10, 2017

kamalvpm777 commented Mar 10, 2017 • edited Loading

danielnelson commented Mar 10, 2017

kamalvpm777 commented Mar 13, 2017 • edited Loading

WORKS (Writes metrics to cloudwatch)

HTTP method to use (case-sensitive)

TELEGRAF Log : Output Buffer (Reset on successful write)

NOT WORKS (Failed to write metrics to cloudwatch)

HTTP method to use (case-sensitive)

TELEGRAF LOG : Output Buffer (Keeps accumulating without reset)

danielnelson commented Mar 13, 2017

kamalvpm777 commented Mar 15, 2017 • edited Loading

Telegraf log (with errors - Unable to write to cloudwatch / output buffer accumulating 73 -> 146)

Telegraf log (without errors - Wrote batch of ## metrics in seconds / output buffer reset 73 -> 73)

danielnelson commented Mar 15, 2017

kamalvpm777 commented Mar 15, 2017

danielnelson commented Mar 15, 2017

allingeek commented Aug 14, 2017

allingeek commented Aug 28, 2017 • edited Loading

allingeek commented Aug 28, 2017

danielnelson commented Aug 29, 2017

allingeek commented Aug 29, 2017

kamalvpm777 commented Mar 9, 2017 •

edited

Loading

kamalvpm777 commented Mar 10, 2017 •

edited

Loading

kamalvpm777 commented Mar 13, 2017 •

edited

Loading

kamalvpm777 commented Mar 15, 2017 •

edited

Loading

allingeek commented Aug 28, 2017 •

edited

Loading