-
Notifications
You must be signed in to change notification settings - Fork 5.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Cloudwatch output is blocked by metric values close to 0 #2523
Comments
Yes, the buffer is emptied on every successful push, and should remain empty. Need to determine why nothing is being successfully output to cloudwatch. |
What version of telegraf are you using? What type of AWS credentials are you using? |
Thanks for your reply, provided the answer below hope it helps What type of AWS credentials are you using? #####Scenario 1: 2017/03/10 04:43:40 Output [cloudwatch] buffer fullness: 10012 / 10000 metrics. Total gathered metrics: 25212. Total dropped metrics: 15200. #####Scenario 2: It will be great if you can provide some approach for troubleshooting to identify the root cause and fix this stuff ASAP to get our production monitoring live back |
There have been a lot of changes since this version, can you test with the latest 1.2.1 release? |
Sure , I just upgraded to version 1.2.1. With new version we still see the issue but we are able to narrow down the particular rootcause which is causing metrics not to be wriiten in AWS cloudwatch. Could we get some suggestion on why we are getting this error and how to fix this isssue ? 2017-03-11T07:56:42Z E! CloudWatch: Unable to write to CloudWatch : InvalidParameterValue: The value 0 for parameter MetricData.member.10.Value is invalid. For every spring boot application we are retrieving drop wizard metrics through HTTP JSON INPUT in telegraf like below. In the past for all the apps we were able to ship the metrics successfully to cloudwatch , but now the same telegraf conf works for 1 app but fails for other app metrics WORKS (Writes metrics to cloudwatch)[[inputs.httpjson]] HTTP method to use (case-sensitive)method = "GET" TELEGRAF Log : Output Buffer (Reset on successful write)2017-03-11T08:02:00Z D! Output [cloudwatch] buffer fullness: 68 / 10000 metrics. NOT WORKS (Failed to write metrics to cloudwatch)[[inputs.httpjson]] HTTP method to use (case-sensitive)method = "GET" TELEGRAF LOG : Output Buffer (Keeps accumulating without reset)2017-03+A314:A324-11T07:56:40Z D! Output [cloudwatch] buffer fullness: 212 / 10000 metrics. #Attached the sample HTTP JSON METRICS which is captured by Telegraf |
I found this blacklocus/metrics-cloudwatch#13 which seems to indicate that if you have a very small number you can receive this error from Cloudwatch. In your sample JSON, there is one value that is 4.784335455263847e-50, I wonder if this is what is causing the problem, are you able to change the http server to return 0 for very small numbers? |
Thanks. It is even okay for cloudwatch to reject these small values , but am still curious why telegraf not ignoring these cloudwatch errors for very small number(e-50) and just write other metrics which are closer and greater than 0's to cloudwatch? I believe that will probably fix all this issue, right now when this errors comes as per logs "output [cloudwatch] buffer fullness keeps accumulating without reset" and in the logs it says "Unable to Write to cloudwatch" . At the same time it doesn't tell anything about whether the metrics >0 written to cloudwatch or not. Telegraf log (with errors - Unable to write to cloudwatch / output buffer accumulating 73 -> 146)2017-03-15T00:27:20Z D! Output [cloudwatch] buffer fullness: 73 / 10000 metrics. Telegraf log (without errors - Wrote batch of ## metrics in seconds / output buffer reset 73 -> 73)2017-03-14T22:16:00Z D! Output [cloudwatch] buffer fullness: 74 / 10000 metrics. Please advise on how to get telegraf simply ignore this very small numbers and go-ahead with other metrics those values closer and > than 0's. Thanks for your help on this It would be great if you can provide some references/approach for achieving HTTP server to return 0 for small numbers. Also just found in the below link where it says CloudWatch rejects values that are either too small or too large. Values must be in the range of 8.515920e-109 to 1.174271e+108 for cloudwatch putmetrics. |
I think what telegraf should do is round these close to zero values down to exactly zero. In order to do this we need to determine the correct cutoff point for the conversion. It appears that the inability to push the bad cloudwatch metric is blocking telegraf from processes the good metrics, at some point the bad values will drop off the buffer if they only occur on occasion. If the error occurs infrequently you could lower the
The quickest solution for you may be to update the server that your httpjson input uses as its source to adjust the values before they enter telegraf. We will try to get a fix done for this before the next release, if someone wants to work on this it would be great. |
Thanks for the response and appreciate you taking this care as part of 1.3.0 milestone release. Regarding the quickest solution could you please elaborate little more. Here is our env currently looks like, any suggestions /recommendations what needs to done on the source server (mesos master) to adjust the values to zero
Meanwhile this small numbers( E-179) comes very often and consistently in our case. Currently we have a buffer metrics limit of 10000 even on lowering to 1000 not sure if bad values will drop off completely from the buffer or not. But I will definitely give it a try and let you know with buffer limit of default 1000 and let you know how it goes. |
Based on the metrics above, the problematic one is:
Perhaps you can round this where this metric is produced, presumably in your code somewhere. Another suggestion is that you filter the problematic metric, should be something like: [[inputs.httpjson]]
name = "registration"
servers = [
"http://localhost:5481/metrics"
]
method = "GET"
fielddrop = ["com.wipr.lac.shop.rest.controller.ControllerV1.getproductsproductData.oneMinuteRate"] You may have to filter additional fields, here are additional docs on filtering: https://github.com/influxdata/telegraf/blob/master/docs/CONFIGURATION.md#measurement-filtering |
I'm encountering the same issue with inputs.http_listener and outputs.cloudwatch:
|
The linked PR adds code to actually enforce the constraints on CloudWatch metrics. Data points that fail those constraints are omitted in the same way the plugin omits metrics with unsupported data types. I've tested the change in an environment that always encounters this problem and it seems to have been fixed. Boundary checking is important. |
I don't suppose we could pull this change into a 1.3.X release and 1.4.X-rcs? |
@allingeek I was somewhat hoping to avoid another 1.3.X release, but only because it takes me several hours to prepare a release. I'm planning to release 1.4 by end of week, assuming this, would you still want it as a new 1.3 release? |
1.4 would be fine. Thanks for the quick merge. :) |
##Our telegraf is integrated with cloudwatch output, in the last 4 to 5 few weeks it was working fine . Suddenly in the last 2 days we do see an inconsistency where metrics are not being pushed to cloudwatch output though telegraf setup is still same. Based on the analysis looking for your opinions and troubleshooting for the same. Let me know if something is not clear, I can add more details
Scenario 1 -
As per below telegraf conf we have metrics limit set to 10000 and flush interval of every 20 sec. Though after reaching the limit of 10000 buffer its never getting flushed and gathered metrics keeps accumulating which increases the count of dropped metrics subsequently.
i) What dropped metrics means here, does it mean these many(195956) metrics are not sent to cloudwatch output.
ii) we do generate 222 metrics /min , so per hour its total of 13320 metrics / hour. In the past also we have captured similar volume of metrics with same buffer limit of 10000 and we never had issue in the last 2 months posting the stats to cloudwatch.
iii) Does going above the buffer fullness causes metrics not sent to cloudwatch?
2017/03/09 16:39:00 Output [cloudwatch] buffer fullness: 10074 / 10000 metrics. Total gathered metrics: 206030. Total dropped metrics: 195956.
2017/03/09 16:39:20 Output [cloudwatch] buffer fullness: 10074 / 10000 metrics. Total gathered metrics: 206104. Total dropped metrics: 196030.
Scenario 2
i) To troubleshoot if reaching the buffer limit is an issue , disabled few metrics to reduce the volume. With this change we generate only 36 metrics/min, and our total gathered metrics is still under the the limit of 10000.
ii) With this change telegraf just connected successfully to cloudwatch at 12:09 PM and reported the metrics once at 12:10 PM and again stopped reporting afterwards though we under the limit of 10000
2017/03/09 17:34:20 Output [cloudwatch] buffer fullness: 1116 / 10000 metrics. Total gathered metrics: 1116. Total dropped metrics: 0.
2017/03/09 17:34:40 Output [cloudwatch] buffer fullness: 1128 / 10000 metrics. Total gathered metrics: 1128. Total dropped metrics: 0.
Telegraf conf
Configuration for telegraf agent
[agent]
Default data collection interval for all inputs
interval = "20s"
#####Rounds collection interval to 'interval'
ie, if interval="10s" then always collect on :00, :10, :20, etc.
round_interval = true
Telegraf will cache metric_buffer_limit metrics for each output, and will
flush this buffer on a successful write.
metric_buffer_limit = 10000
#####Flush the buffer whenever full, regardless of flush_interval.
flush_buffer_when_full = true
#####Collection jitter is used to jitter the collection by a random amount.
collection_jitter = "0s"
Default flushing interval for all outputs. You shouldn't set this below
interval. Maximum flush_interval will be flush_interval + flush_jitter
flush_interval = "20s"
#####ie, a jitter of 5s and interval 10s means flushes will happen every 10-15s
flush_jitter = "0s"
Run telegraf in debug mode
debug = true
Run telegraf in quiet mode
quiet = false
Override default hostname, if empty use os.Hostname()
hostname = ""
Configuration for cloudwatch api to send metrics to
[[outputs.cloudwatch]]
Amazon Region (required)
region = 'us-east-1'
Current behavior:
Telegraf logs looks fine , but its not reporting /updating the metrics to cloudwatch
Use case: Currently our production monitoring with telegraf is completely broken, need to get this fixed as soon as possible
Telegraf logs
2017/03/09 21:45:40 Input [httpjson] gathered metrics, (20s interval) in 7.261833ms
2017/03/09 21:45:40 Input [mesos] gathered metrics, (20s interval) in 7.050785ms
2017/03/09 21:45:40 Input [httpjson] gathered metrics, (20s interval) in 2.375689ms
2017/03/09 21:45:40 Input [httpjson] gathered metrics, (20s interval) in 3.618668ms
2017/03/09 21:45:40 Input [httpjson] gathered metrics, (20s interval) in 6.121121ms
2017/03/09 21:45:40 Input [httpjson] gathered metrics, (20s interval) in 5.218891ms
2017/03/09 21:45:40 Input [httpjson] gathered metrics, (20s interval) in 11.845889ms
2017/03/09 21:45:40 Input [httpjson] gathered metrics, (20s interval) in 14.763179ms
2017/03/09 21:45:40 Input [httpjson] gathered metrics, (20s interval) in 10.747364ms
2017/03/09 21:45:40 Input [httpjson] gathered metrics, (20s interval) in 15.15531ms
2017/03/09 21:45:40 Input [httpjson] gathered metrics, (20s interval) in 23.128821ms
2017/03/09 21:45:40 Input [httpjson] gathered metrics, (20s interval) in 24.935601ms
2017/03/09 21:45:40 Output [cloudwatch] buffer fullness: 10012 / 10000 metrics. Total gathered metrics: 10164. Total dropped metrics: 152.
The text was updated successfully, but these errors were encountered: