[libbeat][outputs/logstash] - The logstash output waits for acknowledgements forever #41534

VihasMakwana · 2024-11-06T12:36:48Z

Note: This is different from elastic/go-lumber#35, but can cause same effect (queue stalling)

Synchronous data sending to logstash host occurs in following way:

func (c *SyncClient) Send(data []interface{}) (int, error) {
	if err := c.cl.Send(data); err != nil {
		return 0, err
	}

	seq, err := c.cl.AwaitACK(uint32(len(data)))
	return int(seq), err
}

It first sends the data
Then is waits for ACK, synchronously.

The AwaitACK is designed as follows:

func (c *Client) AwaitACK(count uint32) (uint32, error) {
	var ackSeq uint32
	var err error

	// read until all ACKs
	for ackSeq < count {
		ackSeq, err = c.ReceiveACK()
		if err != nil {
			return ackSeq, err
		}
	}
	...
	return ackSeq, nil
}

For an example, let's say we send 100 events in a request to logstash:

The client sends 100 events to logstash
The AwaitACK gets called it waits till all the 100 events are acknowledged.
- Internally, it calls conn.Read(..) to read acknowledged events from logstash. You can find the this here.

There's a problem with this approach.
This approach works completely fine with a healthy logstash. It would even work well for slow logstash (which would return acks at a slower rate)
But, if the internals of the logstash has faced a permanent failure (for eg. one of the pipeline crashed, but the connection is still active), we get stuck in AwaitAck loop forever, because logstash will return 0 when we read for events that are acknowledged, indicating no acknowledgement.

Like this,

func (c *Client) AwaitACK(count uint32) (uint32, error) {
	var ackSeq uint32
	var err error

	// read until all ACKs
	for ackSeq < count {			// ackSeq will always be 0 if logstash is facing some issues,
		ackSeq, err = c.ReceiveACK()	// indicating no acknowledgements.
		if err != nil {
			return ackSeq, err
		}
	}

         ...
	return ackSeq, nil
}

I had a brief discussion with @jsvd, and he confirmed that Logstash can return a 0 when reading events that have been acknowledged.
This means we will be always be stuck in this loop.

The text was updated successfully, but these errors were encountered:

VihasMakwana · 2024-11-06T12:38:41Z

I managed to reproduce this locally and I can confirm it always returns 0, until you fix the logstash issue and restart the failed logstash.

elasticmachine · 2024-11-06T12:39:16Z

Pinging @elastic/elastic-agent-data-plane (Team:Elastic-Agent-Data-Plane)

cmacknz · 2024-11-06T21:41:04Z

I think fundamentally we need a timeout for how long we are willing to wait for acknowledgements to come back. It can be quite generous given this is a rare condition that requires Logstash to be alive and responsive to network requests, but otherwise unable to make progress. Something like 5 minutes. It needs to be configurable and there needs to be an obvious error level log message indicating what the problem is when this happens.

If a Logstash instance is stuck in this situation, this approach will keep blocking individual batches on it for the length of the timeout until the problem is solved, but there will be no batches that are never sent.

cmacknz · 2024-11-06T21:42:00Z

There are requests for an unconditional connection TTL, but I don't think this actually helps here.

Specify a TTL for connection from Filebeat to Logstash #661

VihasMakwana · 2024-11-07T09:20:36Z

I think fundamentally we need a timeout for how long we are willing to wait for acknowledgements to come back. It can be quite generous given this is a rare condition that requires Logstash to be alive and responsive to network requests, but otherwise unable to make progress. Something like 5 minutes. It needs to be configurable and there needs to be an obvious error level log message indicating what the problem is when this happens.

I agree.

What would be the behaviour for future batches? If we have timed out on a given host, waiting for acks, should we mark it as "unhealthy" and avoid sending new events to this host for some time (perhaps another configuration) and try to re-establish connection again later?

jlind23 · 2024-12-09T13:38:33Z

@faec any update on this?

faec · 2024-12-13T20:30:46Z

After discussion with @amitkanfer and @cmacknz and the team, the current plan is a compromise: rather than add a new retry mode to the full pipeline, the Logstash output itself will track when its batches may have stopped making progress and log an error informing the user that the upstream hosts are likely crashed.

Disadvantages:

This will not actually route around a presumed-deadlocked host, only inform the user of it.

Advantages:

Effective on all Logstash configurations instead of requiring manual configuration of a new field (the configurable field only worked for users who already knew their Logstash hosts were silently deadlocked. The error logs instead drive users towards the preferred solution, which is to upgrade to a version of Logstash that doesn't leave a deadlocked connection open when a host crashes.)
Reports errors even if load balancing and/or multiple hosts are not enabled.
Requires no changes to the pipeline itself (parallel retry required special concurrency cases that were only used by the Logstash output).

botelastic bot added the needs_team Indicates that the issue/PR needs a Team:* label label Nov 6, 2024

VihasMakwana added the Team:Elastic-Agent-Data-Plane Label for the Agent Data Plane team label Nov 6, 2024

botelastic bot removed the needs_team Indicates that the issue/PR needs a Team:* label label Nov 6, 2024

pierrehilbert assigned faec Nov 7, 2024

jsvd mentioned this issue Nov 7, 2024

Optionally terminate connections after an idle period logstash-plugins/logstash-input-beats#505

Open

faec linked a pull request Dec 10, 2024 that will close this issue

Add upstream deadlock warning to the logstash output #41960

Open

6 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[libbeat][outputs/logstash] - The logstash output waits for acknowledgements forever #41534

[libbeat][outputs/logstash] - The logstash output waits for acknowledgements forever #41534

VihasMakwana commented Nov 6, 2024 •

edited

Loading

VihasMakwana commented Nov 6, 2024

elasticmachine commented Nov 6, 2024

cmacknz commented Nov 6, 2024

cmacknz commented Nov 6, 2024

VihasMakwana commented Nov 7, 2024 •

edited

Loading

jlind23 commented Dec 9, 2024

faec commented Dec 13, 2024

[libbeat][outputs/logstash] - The logstash output waits for acknowledgements forever #41534

[libbeat][outputs/logstash] - The logstash output waits for acknowledgements forever #41534

Comments

VihasMakwana commented Nov 6, 2024 • edited Loading

VihasMakwana commented Nov 6, 2024

elasticmachine commented Nov 6, 2024

cmacknz commented Nov 6, 2024

cmacknz commented Nov 6, 2024

VihasMakwana commented Nov 7, 2024 • edited Loading

jlind23 commented Dec 9, 2024

faec commented Dec 13, 2024

VihasMakwana commented Nov 6, 2024 •

edited

Loading

VihasMakwana commented Nov 7, 2024 •

edited

Loading