-
Notifications
You must be signed in to change notification settings - Fork 4.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[libbeat][outputs/logstash] - The logstash output waits for acknowledgements forever #41534
Comments
I managed to reproduce this locally and I can confirm it always returns |
Pinging @elastic/elastic-agent-data-plane (Team:Elastic-Agent-Data-Plane) |
I think fundamentally we need a timeout for how long we are willing to wait for acknowledgements to come back. It can be quite generous given this is a rare condition that requires Logstash to be alive and responsive to network requests, but otherwise unable to make progress. Something like 5 minutes. It needs to be configurable and there needs to be an obvious error level log message indicating what the problem is when this happens. If a Logstash instance is stuck in this situation, this approach will keep blocking individual batches on it for the length of the timeout until the problem is solved, but there will be no batches that are never sent. |
There are requests for an unconditional connection TTL, but I don't think this actually helps here. |
I agree. What would be the behaviour for future batches? If we have timed out on a given host, waiting for acks, should we mark it as "unhealthy" and avoid sending new events to this host for some time (perhaps another configuration) and try to re-establish connection again later? |
@faec any update on this? |
After discussion with @amitkanfer and @cmacknz and the team, the current plan is a compromise: rather than add a new retry mode to the full pipeline, the Logstash output itself will track when its batches may have stopped making progress and log an error informing the user that the upstream hosts are likely crashed. Disadvantages:
Advantages:
|
Note: This is different from elastic/go-lumber#35, but can cause same effect (queue stalling)
Synchronous data sending to logstash host occurs in following way:
The
AwaitACK
is designed as follows:For an example, let's say we send 100 events in a request to logstash:
AwaitACK
gets called it waits till all the 100 events are acknowledged.conn.Read(..)
to read acknowledged events from logstash. You can find the this here.There's a problem with this approach.
This approach works completely fine with a healthy logstash. It would even work well for
slow
logstash (which would return acks at a slower rate)But, if the internals of the logstash has faced a permanent failure (for eg. one of the pipeline crashed, but the connection is still active), we get stuck in
AwaitAck
loop forever, because logstash will return 0 when we read for events that are acknowledged, indicating no acknowledgement.Like this,
I had a brief discussion with @jsvd, and he confirmed that Logstash can return a
0
when reading events that have been acknowledged.This means we will be always be stuck in this loop.
The text was updated successfully, but these errors were encountered: