Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Filebeat TCP input panics after 2^31 events received #7202

Closed
adriansr opened this issue May 30, 2018 · 4 comments
Closed

Filebeat TCP input panics after 2^31 events received #7202

adriansr opened this issue May 30, 2018 · 4 comments
Assignees
Labels

Comments

@adriansr
Copy link
Contributor

User reports a panic ("sync: negative WaitGroup counter") after a week of running filebeat. filebeat stays running but does not accept the connections.

Stacktrace:

github.com/elastic/beats/libbeat/logp.Recover
	/home/jason/go/src/github.com/elastic/beats/libbeat/logp/global.go:88
runtime.call32
	/usr/lib/go-1.10/src/runtime/asm_amd64.s:573
runtime.gopanic
	/usr/lib/go-1.10/src/runtime/panic.go:502
sync.(*WaitGroup).Add
	/usr/lib/go-1.10/src/sync/waitgroup.go:73
github.com/elastic/beats/filebeat/beater.(*eventCounter).Add
	/home/jason/go/src/github.com/elastic/beats/filebeat/beater/channels.go:61
github.com/elastic/beats/filebeat/channel.(*outlet).OnEvent
	/home/jason/go/src/github.com/elastic/beats/filebeat/channel/outlet.go:43
github.com/elastic/beats/filebeat/harvester.(*Forwarder).Send
	/home/jason/go/src/github.com/elastic/beats/filebeat/harvester/forwarder.go:33
github.com/elastic/beats/filebeat/input/tcp.NewInput.func1
	/home/jason/go/src/github.com/elastic/beats/filebeat/input/tcp/input.go:59
github.com/elastic/beats/filebeat/inputsource/tcp.(*client).handle
	/home/jason/go/src/github.com/elastic/beats/filebeat/inputsource/tcp/client.go:71
github.com/elastic/beats/filebeat/inputsource/tcp.(*Server).run.func1
	/home/jason/go/src/github.com/elastic/beats/filebeat/inputsource/tcp/server.go:99

Log (two consecutive panics, same stracktrace):

2018-05-29T10:55:39.577-0600 ERROR sync/waitgroup.go:73 recovering from a tcp client crash. Recovering, but please report this. {"panic": "sync: negative WaitGroup counter", "stack": "github.com/elastic/beats/libbeat/logp.Recover\n\t/home/jason/go/src/github.com/elastic/beats/libbeat/logp/global.go:88\nruntime.call32\n\t/usr/lib/go-1.10/src/runtime/asm_amd64.s:573\nruntime.gopanic\n\t/usr/lib/go-1.10/src/runtime/panic.go:502\nsync.(*WaitGroup).Add\n\t/usr/lib/go-1.10/src/sync/waitgroup.go:73\ngithub.com/elastic/beats/filebeat/beater.(*eventCounter).Add\n\t/home/jason/go/src/github.com/elastic/beats/filebeat/beater/channels.go:61\ngithub.com/elastic/beats/filebeat/channel.(*outlet).OnEvent\n\t/home/jason/go/src/github.com/elastic/beats/filebeat/channel/outlet.go:43\ngithub.com/elastic/beats/filebeat/harvester.(*Forwarder).Send\n\t/home/jason/go/src/github.com/elastic/beats/filebeat/harvester/forwarder.go:33\ngithub.com/elastic/beats/filebeat/input/tcp.NewInput.func1\n\t/home/jason/go/src/github.com/elastic/beats/filebeat/input/tcp/input.go:59\ngithub.com/elastic/beats/filebeat/inputsource/tcp.(*client).handle\n\t/home/jason/go/src/github.com/elastic/beats/filebeat/inputsource/tcp/client.go:71\ngithub.com/elastic/beats/filebeat/inputsource/tcp.(*Server).run.func1\n\t/home/jason/go/src/github.com/elastic/beats/filebeat/inputsource/tcp/server.go:99"}
2018-05-29T10:55:39.637-0600 ERROR sync/waitgroup.go:73 recovering from a tcp client crash. Recovering, but please report this. {"panic": "sync: negative WaitGroup counter", "stack": "github.com/elastic/beats/libbeat/logp.Recover\n\t/home/jason/go/src/github.com/elastic/beats/libbeat/logp/global.go:88\nruntime.call32\n\t/usr/lib/go-1.10/src/runtime/asm_amd64.s:573\nruntime.gopanic\n\t/usr/lib/go-1.10/src/runtime/panic.go:502\nsync.(*WaitGroup).Add\n\t/usr/lib/go-1.10/src/sync/waitgroup.go:73\ngithub.com/elastic/beats/filebeat/beater.(*eventCounter).Add\n\t/home/jason/go/src/github.com/elastic/beats/filebeat/beater/channels.go:61\ngithub.com/elastic/beats/filebeat/channel.(*outlet).OnEvent\n\t/home/jason/go/src/github.com/elastic/beats/filebeat/channel/outlet.go:43\ngithub.com/elastic/beats/filebeat/harvester.(*Forwarder).Send\n\t/home/jason/go/src/github.com/elastic/beats/filebeat/harvester/forwarder.go:33\ngithub.com/elastic/beats/filebeat/input/tcp.NewInput.func1\n\t/home/jason/go/src/github.com/elastic/beats/filebeat/input/tcp/input.go:59\ngithub.com/elastic/beats/filebeat/inputsource/tcp.(*client).handle\n\t/home/jason/go/src/github.com/elastic/beats/filebeat/inputsource/tcp/client.go:71\ngithub.com/elastic/beats/filebeat/inputsource/tcp.(*Server).run.func1\n\t/home/jason/go/src/github.com/elastic/beats/filebeat/inputsource/tcp/server.go:99"}

For confirmed bugs, please report:

@adriansr
Copy link
Contributor Author

adriansr commented May 30, 2018

After having a look at the code, the only possibility that seems plausible is that the internal counter from the WaitGroup is overflown (it uses an int32).

Can be reproduced with this example:

wg := sync.WaitGroup{}
wg.Add(math.MaxInt32)
wg.Add(1) // panic: sync: negative WaitGroup counter

That means that wg.Done() is not being called for the Redis output
Edit: Not being called at all for any kind of output when TCP input is used, as currently the logic to call finishedLogger.Done is embedded into the Registrar.

@adriansr adriansr changed the title Filebeat panics in TCP client (negative WaitGroup counter) Filebeat panics in TCP input (negative WaitGroup counter) after 2^31 events received May 30, 2018
@adriansr adriansr changed the title Filebeat panics in TCP input (negative WaitGroup counter) after 2^31 events received Filebeat TCP input panics after 2^31 events received May 30, 2018
@ph ph self-assigned this May 30, 2018
@ph
Copy link
Contributor

ph commented May 30, 2018

Thanks for the report will verify that

@ph ph added blocker and removed blocker labels May 30, 2018
@ph
Copy link
Contributor

ph commented May 30, 2018

I believe TCP/UDP and redis are affected by this issues well.

@ph
Copy link
Contributor

ph commented May 30, 2018

Just to leave some feedback on this issue.

This is the current flow of events for the Redis, TCP and up.

  1. send events to the pipeline
  2. ES receives the event
  3. Generate ACK
  4. Global ACK handler receives the ACK
  5. The Private Field on the event is empty; The Field handles the registry update.
  6. Done() is not called on the waiting group.

The problem is the current implementation of the registry is global, even if we don't need it.
I have started to take a look at the refactoring, but it will take more time; I think we should have a short-term fix.

I will check if I can use the same strategy as the stdin.

ph added a commit to ph/beats that referenced this issue May 30, 2018
This commit introduces a change in how filebat handle ACK by default,
before the ACK was using the private field of the event to retrieve a
state. The updated state was sent to the registrar, and the registrar
was finalizing the ACK.

But with the introduction of the TCP/UDP and the Redis input, the events
don't have any state attached. So in that scenario, Filebeat was not
correctly acking these events to some wait group.

The ACKer was modified to handle both stateless (default) and stateful
events, when stateful is required, the states are sent to the registry
otherwise, the waiting groups are directly updated.

Fixes: elastic#7202
@tsg tsg closed this as completed in #7214 Jun 4, 2018
tsg pushed a commit that referenced this issue Jun 4, 2018
#7214)

* Filebeat: Allow stateless and stateful ACKer on the global ack handler

This commit introduces a change in how filebat handle ACK by default,
before the ACK was using the private field of the event to retrieve a
state. The updated state was sent to the registrar, and the registrar
was finalizing the ACK.

But with the introduction of the TCP/UDP and the Redis input, the events
don't have any state attached. So in that scenario, Filebeat was not
correctly acking these events to some wait group.

The ACKer was modified to handle both stateless (default) and stateful
events, when stateful is required, the states are sent to the registry
otherwise, the waiting groups are directly updated.

Fixes: #7202

* changelog
tsg pushed a commit to tsg/beats that referenced this issue Jun 5, 2018
elastic#7214)

* Filebeat: Allow stateless and stateful ACKer on the global ack handler

This commit introduces a change in how filebat handle ACK by default,
before the ACK was using the private field of the event to retrieve a
state. The updated state was sent to the registrar, and the registrar
was finalizing the ACK.

But with the introduction of the TCP/UDP and the Redis input, the events
don't have any state attached. So in that scenario, Filebeat was not
correctly acking these events to some wait group.

The ACKer was modified to handle both stateless (default) and stateful
events, when stateful is required, the states are sent to the registry
otherwise, the waiting groups are directly updated.

Fixes: elastic#7202

* changelog

(cherry picked from commit b9d2150)
urso pushed a commit that referenced this issue Jun 5, 2018
#7214) (#7258)

* Filebeat: Allow stateless and stateful ACKer on the global ack handler

This commit introduces a change in how filebat handle ACK by default,
before the ACK was using the private field of the event to retrieve a
state. The updated state was sent to the registrar, and the registrar
was finalizing the ACK.

But with the introduction of the TCP/UDP and the Redis input, the events
don't have any state attached. So in that scenario, Filebeat was not
correctly acking these events to some wait group.

The ACKer was modified to handle both stateless (default) and stateful
events, when stateful is required, the states are sent to the registry
otherwise, the waiting groups are directly updated.

Fixes: #7202

* changelog

(cherry picked from commit b9d2150)
leweafan pushed a commit to leweafan/beats that referenced this issue Apr 28, 2023
elastic#7214) (elastic#7258)

* Filebeat: Allow stateless and stateful ACKer on the global ack handler

This commit introduces a change in how filebat handle ACK by default,
before the ACK was using the private field of the event to retrieve a
state. The updated state was sent to the registrar, and the registrar
was finalizing the ACK.

But with the introduction of the TCP/UDP and the Redis input, the events
don't have any state attached. So in that scenario, Filebeat was not
correctly acking these events to some wait group.

The ACKer was modified to handle both stateless (default) and stateful
events, when stateful is required, the states are sent to the registry
otherwise, the waiting groups are directly updated.

Fixes: elastic#7202

* changelog

(cherry picked from commit 31c46c9)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

2 participants