Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Output to Kafka does not recover after a Kafka node goes down #1113

Closed
elvarb opened this issue Apr 28, 2016 · 4 comments · Fixed by #1131
Closed

Output to Kafka does not recover after a Kafka node goes down #1113

elvarb opened this issue Apr 28, 2016 · 4 comments · Fixed by #1131
Labels
bug unexpected problem or unintended behavior

Comments

@elvarb
Copy link

elvarb commented Apr 28, 2016

3x Centos kafka/zookeeper cluster
1x Windows machine to run Telegraf
1x Centos Influxdb

On the Windows machine I have two instances of Telegraf running, one to gather metrics and write to Kafka and another to read from Kafka and write to Influxdb.

When I'm testing shutting down the kafka nodes to see how Telegraf handles it. Shutting down the Zookeper service that the Kafka consumer is using works fine, it automatically tries the next Zookeper node and continues.

On the other hand the Kafka output that is connected to the Kafka brokers fails completely when I shut down the Kafka service on the first node and continues to fail even after the service is up and running again.

2016/04/28 10:04:20 Wrote 66 metrics to output kafka in 4.5028ms
2016/04/28 10:04:30 Gathered metrics, (10s interval), from 1 inputs in 21.4846ms
2016/04/28 10:04:30 Wrote 66 metrics to output kafka in 13.4982ms
2016/04/28 10:04:40 Gathered metrics, (10s interval), from 1 inputs in 48.9969ms
2016/04/28 10:04:40 Error writing to output [kafka]: FAILED to send kafka message: write tcp 192.168.32.1:18041->192.168.32.11:9092: wsasend: An established connection was aborted by the software in your host machine.

2016/04/28 10:04:50 Gathered metrics, (10s interval), from 1 inputs in 37.8644ms
2016/04/28 10:04:50 Error writing to output [kafka]: FAILED to send kafka message: write tcp 192.168.32.1:18041->192.168.32.11:9092: wsasend: An established connection was aborted by the software in your host machine.

The only way to get it working again is to restart the Telegraf process.

This is my output config

[outputs.kafka]
    # URLs of kafka brokers
    brokers = ["confluent-1:9092","confluent-2:9092","confluent-3:9092"] # EDIT THIS LINE
    # Kafka topic for producer messages
    topic = "telegraf3"
    data_format = "influx"
@elvarb
Copy link
Author

elvarb commented Apr 28, 2016

Obviously has something to do with the required_acks setting, if that config is missing, what is the default value used?

This is how my config looks now and everything works

[outputs.kafka]
    # URLs of kafka brokers
    brokers = ["confluent-1:9092","confluent-2:9092","confluent-3:9092"] # EDIT THIS LINE
    # Kafka topic for producer messages
    topic = "telegraf3"
    data_format = "influx"
    required_acks = 1
    max_retry = 3

@sparrc
Copy link
Contributor

sparrc commented Apr 28, 2016

it would use 0

@elvarb
Copy link
Author

elvarb commented Apr 29, 2016

That makes sense that it would fail when the broker it is connected to fails, but why does it not recover when the broker comes back up?

@sparrc sparrc added the bug unexpected problem or unintended behavior label Apr 29, 2016
@sparrc
Copy link
Contributor

sparrc commented Apr 30, 2016

oh, I think I see the issue, when you don't specify max_retry it sets that value to 0 in Go, so then no retries happen and you get the error messages.

I can change that by having the kafka producer have a more reasonable default, such as 3.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug unexpected problem or unintended behavior
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants