Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[libbeat][outputs/logstash] - The logstash output waits for acknowledgements forever #41534

Open
VihasMakwana opened this issue Nov 6, 2024 · 7 comments · May be fixed by #41960
Open

[libbeat][outputs/logstash] - The logstash output waits for acknowledgements forever #41534

VihasMakwana opened this issue Nov 6, 2024 · 7 comments · May be fixed by #41960
Assignees
Labels
Team:Elastic-Agent-Data-Plane Label for the Agent Data Plane team

Comments

@VihasMakwana
Copy link
Contributor

VihasMakwana commented Nov 6, 2024

Note: This is different from elastic/go-lumber#35, but can cause same effect (queue stalling)

Synchronous data sending to logstash host occurs in following way:

func (c *SyncClient) Send(data []interface{}) (int, error) {
	if err := c.cl.Send(data); err != nil {
		return 0, err
	}

	seq, err := c.cl.AwaitACK(uint32(len(data)))
	return int(seq), err
}
  • It first sends the data
  • Then is waits for ACK, synchronously.

The AwaitACK is designed as follows:

func (c *Client) AwaitACK(count uint32) (uint32, error) {
	var ackSeq uint32
	var err error

	// read until all ACKs
	for ackSeq < count {
		ackSeq, err = c.ReceiveACK()
		if err != nil {
			return ackSeq, err
		}
	}
	...
	return ackSeq, nil
}

For an example, let's say we send 100 events in a request to logstash:

  • The client sends 100 events to logstash
  • The AwaitACK gets called it waits till all the 100 events are acknowledged.
    • Internally, it calls conn.Read(..) to read acknowledged events from logstash. You can find the this here.

There's a problem with this approach.
This approach works completely fine with a healthy logstash. It would even work well for slow logstash (which would return acks at a slower rate)
But, if the internals of the logstash has faced a permanent failure (for eg. one of the pipeline crashed, but the connection is still active), we get stuck in AwaitAck loop forever, because logstash will return 0 when we read for events that are acknowledged, indicating no acknowledgement.

Like this,

func (c *Client) AwaitACK(count uint32) (uint32, error) {
	var ackSeq uint32
	var err error

	// read until all ACKs
	for ackSeq < count {			// ackSeq will always be 0 if logstash is facing some issues,
		ackSeq, err = c.ReceiveACK()	// indicating no acknowledgements.
		if err != nil {
			return ackSeq, err
		}
	}

         ...
	return ackSeq, nil
}

I had a brief discussion with @jsvd, and he confirmed that Logstash can return a 0 when reading events that have been acknowledged.
This means we will be always be stuck in this loop.

@botelastic botelastic bot added the needs_team Indicates that the issue/PR needs a Team:* label label Nov 6, 2024
@VihasMakwana
Copy link
Contributor Author

I managed to reproduce this locally and I can confirm it always returns 0, until you fix the logstash issue and restart the failed logstash.

@VihasMakwana VihasMakwana added the Team:Elastic-Agent-Data-Plane Label for the Agent Data Plane team label Nov 6, 2024
@elasticmachine
Copy link
Collaborator

Pinging @elastic/elastic-agent-data-plane (Team:Elastic-Agent-Data-Plane)

@botelastic botelastic bot removed the needs_team Indicates that the issue/PR needs a Team:* label label Nov 6, 2024
@cmacknz
Copy link
Member

cmacknz commented Nov 6, 2024

I think fundamentally we need a timeout for how long we are willing to wait for acknowledgements to come back. It can be quite generous given this is a rare condition that requires Logstash to be alive and responsive to network requests, but otherwise unable to make progress. Something like 5 minutes. It needs to be configurable and there needs to be an obvious error level log message indicating what the problem is when this happens.

If a Logstash instance is stuck in this situation, this approach will keep blocking individual batches on it for the length of the timeout until the problem is solved, but there will be no batches that are never sent.

@cmacknz
Copy link
Member

cmacknz commented Nov 6, 2024

There are requests for an unconditional connection TTL, but I don't think this actually helps here.

@VihasMakwana
Copy link
Contributor Author

VihasMakwana commented Nov 7, 2024

I think fundamentally we need a timeout for how long we are willing to wait for acknowledgements to come back. It can be quite generous given this is a rare condition that requires Logstash to be alive and responsive to network requests, but otherwise unable to make progress. Something like 5 minutes. It needs to be configurable and there needs to be an obvious error level log message indicating what the problem is when this happens.

I agree.

What would be the behaviour for future batches? If we have timed out on a given host, waiting for acks, should we mark it as "unhealthy" and avoid sending new events to this host for some time (perhaps another configuration) and try to re-establish connection again later?

@jlind23
Copy link
Collaborator

jlind23 commented Dec 9, 2024

@faec any update on this?

@faec faec linked a pull request Dec 10, 2024 that will close this issue
6 tasks
@faec
Copy link
Contributor

faec commented Dec 13, 2024

After discussion with @amitkanfer and @cmacknz and the team, the current plan is a compromise: rather than add a new retry mode to the full pipeline, the Logstash output itself will track when its batches may have stopped making progress and log an error informing the user that the upstream hosts are likely crashed.

Disadvantages:

  • This will not actually route around a presumed-deadlocked host, only inform the user of it.

Advantages:

  • Effective on all Logstash configurations instead of requiring manual configuration of a new field (the configurable field only worked for users who already knew their Logstash hosts were silently deadlocked. The error logs instead drive users towards the preferred solution, which is to upgrade to a version of Logstash that doesn't leave a deadlocked connection open when a host crashes.)
  • Reports errors even if load balancing and/or multiple hosts are not enabled.
  • Requires no changes to the pipeline itself (parallel retry required special concurrency cases that were only used by the Logstash output).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Team:Elastic-Agent-Data-Plane Label for the Agent Data Plane team
Projects
None yet
Development

Successfully merging a pull request may close this issue.

5 participants