-
Notifications
You must be signed in to change notification settings - Fork 5.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
AMQP Consumer Stops function with bad messages #5285
Comments
This should be fixed in 1.9.2: #5170 |
@danielnelson Is this different? That appears to if you send a single message that is empty? Just want to confirm since the #5170 doesn't have a ton of info. |
Oh, I think I misunderstood the issue, it is stopped because it is trying to send to InfluxDB but unable to make progress? |
Telegraf will throw an error in the log, then it won't ack the message or drop it but just hang on to it. Telegraf as a service will continue to run but once it has done that enough times, it will stop grabbing messages. If you look in RabbitMQ, the Unacked stat will either = the max_undelivery_messages or the prefetch depending on what is lower. So if your prefetch is 50, and you have 100 messages in the queue, the first 50 are messages that will cause an error while writing, Telegraf will never pick up the next 50 messages because it will hang onto those first 50 that are bad. Not sure if I am explaining it clearly |
I can also retest this against 1.9.2 to see if it is still an issue. I know this impacts 1.9.0 and 1.9.1 but doesn't impact <1.9 because we ack messages immediately |
Can you show the log output? |
It is worth noting these messages can be very large as currently all telegraf agents collecting metrics batch them into a single amqp message before sending them |
Looks like we are hitting the prefetch limit because the message is neither acked or rejected when a parse error occurs. |
@danielnelson do we have an expected release date for 1.9.3? |
Should be on the 22nd, I can get you a pre-release sooner though if it would be helpful. |
System info:
RHEL 7
Telegraf 1.9.X
Steps to reproduce:
Set a reasonable max_undelivered_messages inside the amqp_input
Set Output to InfluxDB
Send bad metrics over RMQ.
Expected behavior:
Telegraf should either allow for a dead letter exchange from RabbitMQ or it should "consume" the message but drop the metric
Actual behavior:
After the AMQP has grabbed X messages(either prefetch or max_undelivered_messages, it will stop functioning. Basically Telegraf will say "I have 50 messages and my max is 50" but the messages are bad so it won't write them to Influx causing Telegraf to just stop doing anything
Additional Context
This occured inside our environment once we upgraded our main Telegraf writers(amqp_consumer and influx output) to 1.9.X
We realized that with different groups using [[inputs.mysql]], some had metric_version not set, some had it set to 1 and some had it set to 2. This is what was causing the error buildup in the writers
The text was updated successfully, but these errors were encountered: