-
Notifications
You must be signed in to change notification settings - Fork 4.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Allow queue configuration to be specified under the output type #35615
Comments
Pinging @elastic/elastic-agent (Team:Elastic-Agent) |
Relates: elastic/elastic-agent#284 |
Inability to configure the underlying Beat queue parameters is currently one of the largest blockers for scaling the agent. High volume use cases need a larger queue to achieve the required throughput and we prevent this right now. Implementing this will unblock those use cases, although it will result is some potentially unintuitive behavior until we have the shipper as the total queue size configured will not be what is in the output configuration but will instead be proportional to the number of running Beat inputs. |
Some relevant comments on elastic/kibana#158699 (comment) When we do this, it will allow agent to configure both the memory and disk queue. I think we want to explicitly prevent configuring the agent with a disk queue until we support it properly, for example by correctly sharing the the queue contents across upgrades by default. |
The easiest way to do this is to just not allow the disk queue to be configured in output settings yet, otherwise we'll need the outputs to special-case their config parsing based on whether they're running under Agent, which ideally they shouldn't know or care about. This sounds like a reasonable limitation to me -- there's no reason for us to advertise this new way of configuring the queue (to Beats users, at least), it just lets Beats work with Agent's unit-based configs, so a lack of disk queue support is unlikely to bite anyone. Any concerns about this approach? |
No, the simple approach is best here. |
So the Fleet YAML syntax would be something like?
or? |
Likely yes, the syntax will be confirmed once we have finished the implementation. |
Hmm, something seems to be off
|
got it working with the correct yaml syntax. |
@leehinman as discussed, I assigned you this issue for the next sprint. |
Can we ensure that |
This would allow configuring every parameter of the Beats memory queue through an agent policy: https://www.elastic.co/guide/en/beats/filebeat/current/configuring-internal-queue.html#configuration-internal-queue-memory |
That includes both This would technically also allow configuring the Beats disk queue, but we will likely disallow it initially since this would create one disk queue per unique input type with the current architecture in the agent policy which is probably not what most people would expect. |
- add support for `idle_connection_timeout` for ES output - add support for queue settings under output Closes elastic#35615
- add support for `idle_connection_timeout` for ES output - add support for queue settings under output Closes elastic#35615
- add support for `idle_connection_timeout` for ES output - add support for queue settings under output Closes elastic#35615
When we merge this let's add an agent changelog entry as well explaining where this can be set. |
This change should probably have a docs issue associated with it to make sure that this functionality is explained properly in the agent documentation. I don't know that we need to document this for standalone Beats because it doesn't enable anything that wasn't already possible. |
@kilfoyle fyi. the configuration items are described in this section already: https://www.elastic.co/guide/en/beats/filebeat/current/configuring-internal-queue.html we are just exposing them in Agent output (via the adavanced yaml box. the possible settings are now: output.elasticsearch: (any output) |
@cmacknz if we do this, we can claim disk queue support also. Fair enough that this queue is in every beat and not what user expects but then again so is the internal queue (until this change). The main usage for the spooling is resiliency when we are disconnected, is there a reason why we can't do spooling on the input rather than once on output? My vote is to enable this while we are at it. |
What do you mean by:
? The Agent is quite offen upgraded most of times together with the whole stack. Can you please better describe the issue with the queue contents after upgrade? |
- add support for `idle_connection_timeout` for ES output - add support for queue settings under output Closes elastic#35615
Thanks Craig and Nima for the heads up! I'll look after this docs issue in the upcoming sprint. |
See elastic/elastic-agent#3490 which explains what needs to happen for the Elastic Agent to support the Beats disk queue. This requires some understanding of the internal architecture of the agent. See https://github.com/elastic/elastic-agent/blob/main/docs/architecture.md. The disk queue will be preserved between upgrades, but we need to do it correctly and without having to copy it since it can be quite large. We need to special case this in the agent to make sure it happens properly. |
- add support for `idle_connection_timeout` for ES output - add support for queue settings under output Closes elastic#35615
Do we really not want to copy it? What happens when its corrupted on upgrade and now rollback fails because we didn't copy it and the new version corrupted? |
The primary reason is that it can be GBs in size, the default size is 10 GB. I don't want us to block the completion of upgrades on copying a possibly 10 GB file, copying a file that large is also likely to run into disk space constraints since the system needs to be able to store it twice. We need to handle the corrupted disk queue case, but I don't think we can solve it by copying on upgrade even though it is conceptually the simplest option I don't think it is practical. |
FYI: When I was using graylog we had 300GB journal files fur queue buffering. This was more of a safety net for those moments when we had really high (>100k EPS) load of when we had to perform administration tasks that brought our ElasticSearch cluster offline. When the 300GB was reaching its full(>90%) capacity we where declaring to our LB that node dead. My goal is to get the queue status from the Agent status(when the shipper will be finalized) or directly from the underlying filebeat via http endpoint |
- add support for `idle_connection_timeout` for ES output - add support for queue settings under output Closes elastic#35615
- add support for `idle_connection_timeout` for ES output - add support for queue settings under output Closes elastic#35615
- add support for `idle_connection_timeout` for ES output - add support for queue settings under output Closes elastic#35615
Beats queue configurations are specified at the top level of the config, grouped by queue type, e.g.
queue.mem
orqueue.disk
. Recent work to support the shipper involved exposing queue configuration hooks to the output itself during initialization. We should follow up on this by moving the queue configuration entirely into the output block, for example:or
This settings block, when present, should use the same structure and behavior as the existing Beats queue settings, and should replace the root-level configuration (though we will still support root-level as a fallback).
Outputs that support this will automatically gain the ability to specify queue configurations through Agent, both with and without the shipper enabled. (However without the shipper it will apply these settings separately in each input process, duplicating the queue for each input type, so we should be careful how we communicate this option.)
In addition, please add the
IdleConnectionTimeout
setting for the Elasticsearch Idle Timeout to the ES Output settings.beats/libbeat/esleg/eslegclient/connection.go
Line 139 in ad64f28
The text was updated successfully, but these errors were encountered: