Filebeat TCP input panics after 2^31 events received #7202

adriansr · 2018-05-30T08:10:36Z

User reports a panic ("sync: negative WaitGroup counter") after a week of running filebeat. filebeat stays running but does not accept the connections.

Stacktrace:

github.com/elastic/beats/libbeat/logp.Recover
	/home/jason/go/src/github.com/elastic/beats/libbeat/logp/global.go:88
runtime.call32
	/usr/lib/go-1.10/src/runtime/asm_amd64.s:573
runtime.gopanic
	/usr/lib/go-1.10/src/runtime/panic.go:502
sync.(*WaitGroup).Add
	/usr/lib/go-1.10/src/sync/waitgroup.go:73
github.com/elastic/beats/filebeat/beater.(*eventCounter).Add
	/home/jason/go/src/github.com/elastic/beats/filebeat/beater/channels.go:61
github.com/elastic/beats/filebeat/channel.(*outlet).OnEvent
	/home/jason/go/src/github.com/elastic/beats/filebeat/channel/outlet.go:43
github.com/elastic/beats/filebeat/harvester.(*Forwarder).Send
	/home/jason/go/src/github.com/elastic/beats/filebeat/harvester/forwarder.go:33
github.com/elastic/beats/filebeat/input/tcp.NewInput.func1
	/home/jason/go/src/github.com/elastic/beats/filebeat/input/tcp/input.go:59
github.com/elastic/beats/filebeat/inputsource/tcp.(*client).handle
	/home/jason/go/src/github.com/elastic/beats/filebeat/inputsource/tcp/client.go:71
github.com/elastic/beats/filebeat/inputsource/tcp.(*Server).run.func1
	/home/jason/go/src/github.com/elastic/beats/filebeat/inputsource/tcp/server.go:99

Log (two consecutive panics, same stracktrace):

2018-05-29T10:55:39.577-0600 ERROR sync/waitgroup.go:73 recovering from a tcp client crash. Recovering, but please report this. {"panic": "sync: negative WaitGroup counter", "stack": "github.com/elastic/beats/libbeat/logp.Recover\n\t/home/jason/go/src/github.com/elastic/beats/libbeat/logp/global.go:88\nruntime.call32\n\t/usr/lib/go-1.10/src/runtime/asm_amd64.s:573\nruntime.gopanic\n\t/usr/lib/go-1.10/src/runtime/panic.go:502\nsync.(*WaitGroup).Add\n\t/usr/lib/go-1.10/src/sync/waitgroup.go:73\ngithub.com/elastic/beats/filebeat/beater.(*eventCounter).Add\n\t/home/jason/go/src/github.com/elastic/beats/filebeat/beater/channels.go:61\ngithub.com/elastic/beats/filebeat/channel.(*outlet).OnEvent\n\t/home/jason/go/src/github.com/elastic/beats/filebeat/channel/outlet.go:43\ngithub.com/elastic/beats/filebeat/harvester.(*Forwarder).Send\n\t/home/jason/go/src/github.com/elastic/beats/filebeat/harvester/forwarder.go:33\ngithub.com/elastic/beats/filebeat/input/tcp.NewInput.func1\n\t/home/jason/go/src/github.com/elastic/beats/filebeat/input/tcp/input.go:59\ngithub.com/elastic/beats/filebeat/inputsource/tcp.(*client).handle\n\t/home/jason/go/src/github.com/elastic/beats/filebeat/inputsource/tcp/client.go:71\ngithub.com/elastic/beats/filebeat/inputsource/tcp.(*Server).run.func1\n\t/home/jason/go/src/github.com/elastic/beats/filebeat/inputsource/tcp/server.go:99"}
2018-05-29T10:55:39.637-0600 ERROR sync/waitgroup.go:73 recovering from a tcp client crash. Recovering, but please report this. {"panic": "sync: negative WaitGroup counter", "stack": "github.com/elastic/beats/libbeat/logp.Recover\n\t/home/jason/go/src/github.com/elastic/beats/libbeat/logp/global.go:88\nruntime.call32\n\t/usr/lib/go-1.10/src/runtime/asm_amd64.s:573\nruntime.gopanic\n\t/usr/lib/go-1.10/src/runtime/panic.go:502\nsync.(*WaitGroup).Add\n\t/usr/lib/go-1.10/src/sync/waitgroup.go:73\ngithub.com/elastic/beats/filebeat/beater.(*eventCounter).Add\n\t/home/jason/go/src/github.com/elastic/beats/filebeat/beater/channels.go:61\ngithub.com/elastic/beats/filebeat/channel.(*outlet).OnEvent\n\t/home/jason/go/src/github.com/elastic/beats/filebeat/channel/outlet.go:43\ngithub.com/elastic/beats/filebeat/harvester.(*Forwarder).Send\n\t/home/jason/go/src/github.com/elastic/beats/filebeat/harvester/forwarder.go:33\ngithub.com/elastic/beats/filebeat/input/tcp.NewInput.func1\n\t/home/jason/go/src/github.com/elastic/beats/filebeat/input/tcp/input.go:59\ngithub.com/elastic/beats/filebeat/inputsource/tcp.(*client).handle\n\t/home/jason/go/src/github.com/elastic/beats/filebeat/inputsource/tcp/client.go:71\ngithub.com/elastic/beats/filebeat/inputsource/tcp.(*Server).run.func1\n\t/home/jason/go/src/github.com/elastic/beats/filebeat/inputsource/tcp/server.go:99"}

For confirmed bugs, please report:

Version: 6.3.0 (amd64), libbeat 6.3.0 [489ad50 built 2018-05-11 19:02:09 +0000 UTC
Operating System: N/A
Discuss Forum URL: https://discuss.elastic.co/t/filebeats-6-3-tcp-client-crash/133759
Steps to Reproduce: N/A

The text was updated successfully, but these errors were encountered:

adriansr · 2018-05-30T09:02:48Z

After having a look at the code, the only possibility that seems plausible is that the internal counter from the WaitGroup is overflown (it uses an int32).

Can be reproduced with this example:

wg := sync.WaitGroup{}
wg.Add(math.MaxInt32)
wg.Add(1) // panic: sync: negative WaitGroup counter

That means that wg.Done() is not being called ~~for the Redis output~~
Edit: Not being called at all for any kind of output when TCP input is used, as currently the logic to call finishedLogger.Done is embedded into the Registrar.

ph · 2018-05-30T10:13:50Z

Thanks for the report will verify that

ph · 2018-05-30T14:02:12Z

I believe TCP/UDP and redis are affected by this issues well.

ph · 2018-05-30T14:59:51Z

Just to leave some feedback on this issue.

This is the current flow of events for the Redis, TCP and up.

send events to the pipeline
ES receives the event
Generate ACK
Global ACK handler receives the ACK
The Private Field on the event is empty; The Field handles the registry update.
Done() is not called on the waiting group.

The problem is the current implementation of the registry is global, even if we don't need it.
I have started to take a look at the refactoring, but it will take more time; I think we should have a short-term fix.

I will check if I can use the same strategy as the stdin.

This commit introduces a change in how filebat handle ACK by default, before the ACK was using the private field of the event to retrieve a state. The updated state was sent to the registrar, and the registrar was finalizing the ACK. But with the introduction of the TCP/UDP and the Redis input, the events don't have any state attached. So in that scenario, Filebeat was not correctly acking these events to some wait group. The ACKer was modified to handle both stateless (default) and stateful events, when stateful is required, the states are sent to the registry otherwise, the waiting groups are directly updated. Fixes: elastic#7202

#7214) * Filebeat: Allow stateless and stateful ACKer on the global ack handler This commit introduces a change in how filebat handle ACK by default, before the ACK was using the private field of the event to retrieve a state. The updated state was sent to the registrar, and the registrar was finalizing the ACK. But with the introduction of the TCP/UDP and the Redis input, the events don't have any state attached. So in that scenario, Filebeat was not correctly acking these events to some wait group. The ACKer was modified to handle both stateless (default) and stateful events, when stateful is required, the states are sent to the registry otherwise, the waiting groups are directly updated. Fixes: #7202 * changelog

elastic#7214) * Filebeat: Allow stateless and stateful ACKer on the global ack handler This commit introduces a change in how filebat handle ACK by default, before the ACK was using the private field of the event to retrieve a state. The updated state was sent to the registrar, and the registrar was finalizing the ACK. But with the introduction of the TCP/UDP and the Redis input, the events don't have any state attached. So in that scenario, Filebeat was not correctly acking these events to some wait group. The ACKer was modified to handle both stateless (default) and stateful events, when stateful is required, the states are sent to the registry otherwise, the waiting groups are directly updated. Fixes: elastic#7202 * changelog (cherry picked from commit b9d2150)

#7214) (#7258) * Filebeat: Allow stateless and stateful ACKer on the global ack handler This commit introduces a change in how filebat handle ACK by default, before the ACK was using the private field of the event to retrieve a state. The updated state was sent to the registrar, and the registrar was finalizing the ACK. But with the introduction of the TCP/UDP and the Redis input, the events don't have any state attached. So in that scenario, Filebeat was not correctly acking these events to some wait group. The ACKer was modified to handle both stateless (default) and stateful events, when stateful is required, the states are sent to the registry otherwise, the waiting groups are directly updated. Fixes: #7202 * changelog (cherry picked from commit b9d2150)

elastic#7214) (elastic#7258) * Filebeat: Allow stateless and stateful ACKer on the global ack handler This commit introduces a change in how filebat handle ACK by default, before the ACK was using the private field of the event to retrieve a state. The updated state was sent to the registrar, and the registrar was finalizing the ACK. But with the introduction of the TCP/UDP and the Redis input, the events don't have any state attached. So in that scenario, Filebeat was not correctly acking these events to some wait group. The ACKer was modified to handle both stateless (default) and stateful events, when stateful is required, the states are sent to the registry otherwise, the waiting groups are directly updated. Fixes: elastic#7202 * changelog (cherry picked from commit 31c46c9)

adriansr added bug Filebeat Filebeat labels May 30, 2018

adriansr changed the title ~~Filebeat panics in TCP client (negative WaitGroup counter)~~ Filebeat panics in TCP input (negative WaitGroup counter) after 2^31 events received May 30, 2018

adriansr changed the title ~~Filebeat panics in TCP input (negative WaitGroup counter) after 2^31 events received~~ Filebeat TCP input panics after 2^31 events received May 30, 2018

ph self-assigned this May 30, 2018

ph added blocker and removed blocker labels May 30, 2018

ph mentioned this issue May 30, 2018

Filebeat: Allow stateless and stateful ACKer on the global ack handler #7214

Merged

tsg closed this as completed in #7214 Jun 4, 2018

tsg mentioned this issue Jun 4, 2018

Cherry-pick #7214 to 6.3: Filebeat: Allow stateless and stateful ACKer on the global ack handler #7258

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Filebeat TCP input panics after 2^31 events received #7202

Filebeat TCP input panics after 2^31 events received #7202

adriansr commented May 30, 2018

adriansr commented May 30, 2018 •

edited

Loading

ph commented May 30, 2018

ph commented May 30, 2018

ph commented May 30, 2018

Filebeat TCP input panics after 2^31 events received #7202

Filebeat TCP input panics after 2^31 events received #7202

Comments

adriansr commented May 30, 2018

adriansr commented May 30, 2018 • edited Loading

ph commented May 30, 2018

ph commented May 30, 2018

ph commented May 30, 2018

adriansr commented May 30, 2018 •

edited

Loading