-
Notifications
You must be signed in to change notification settings - Fork 4.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fix input ID uniqueness #31512
Comments
Pinging @elastic/elastic-agent-data-plane (Team:Elastic-Agent-Data-Plane) |
I haven't started coding it yet, but I have a feeling it might be good to also keep an eye for this kind of effect/behaviour: #31767 |
I've been looking into how to solve that from the input side of things, here are my findings. Elastic-AgentThe kubernetes autodiscover runs on the Elastic-Agent and generates filestream input configurations that are sent to Filebeat. There is a good description of how it works on a high level here. Unfortunately, there isn't a direct connection between the input configuration and the kubernetes provider. The data generated by the autodiscovery event at some point gets converted by transpiler.RenderInputs that does not know about any input specifics. Trying to add any code there to handle a kubernetes exception would be a very ugly hack. I believe the best fix for the Elastic-Agent case is to update the integration to include FilebeatFilebeat kubernetes autodiscovery at the moment does not suffer from the filestream input ID issue because under the hood it uses the container input that, under the hood, is the log input. For each kubernetes container a new container input is created. If we ever migrate the container input to filestream or use filestream directly we would face the input ID uniqueness problem, however on Filebeat it is manageable. The kubernetes autodiscover provider contains a config that contains the template used to render the container input. So if needed, we could parse this part of the YAML and endure there is an ID and it's unique. One way for that is to append |
Elastic-Agent
Which integration? Agent only supports autodiscover in standalone mode. Looking through the agent autodiscover provider I can see that it is dynamically generating processors with the container metadata: https://github.com/elastic/elastic-agent/blob/a18cfadcd38fef27be0307ae5307b562ceed44fc/internal/pkg/composable/providers/kubernetes/pod.go#L314 Those processors are ultimately going to end up in the input configuration block, so there is already a path to modify the filebeat configuration based on container metadata. Can we just extend the code path used for this to also return an optional input ID in addition to the processors? Filebeat
I would create a separate issue for migrating the container input to use |
Here's the original issue and PR for the agent dynamic provider variable substitution that we probably need to hook into: |
Looking through the variable substitution examples we would probably need the code to add an ID template variable automatically like: inputs:
- type: filestream
id: filestream-${kubernetes.container.id} # Implicitly added if not specified by the user
enabled: true
streams:
- paths:
- /var/log/{{local_dynamic.key}} Maybe when the agent loads the agent policy for the first time in standalone mode we can check if k8s autodiscovery is in use, and then just inject the ID template variable into the policy if there isn't an ID defined already? Hopefully the autodiscovery variable substitution would just work without modification if we did that. |
The Kubernetes Container Logs one. It exists on both modes. As far as I know, and have seen, the main difference between managed by fleet and standalone modes are the options exposed to the user. They use the same codebase with different options configured. Ex. If managed by fleet, Elastic-Agent always assumes it's running inside the kubernetes cluster, while standalone exposes the option to configure the cluster access.
I understood that as being the processors that run on the event, not a policy processor. It adds this processor configuration, that leads to those fields being added to the event: I looked at the links you posted on: #31512 (comment), but they only reinforced my understanding regarding the processors. We could have a f-2-f chat to clarify that ;) |
In that case of modifying the policy when it's being loaded, I believe updating the integration would be a more robust solution. On my tests the Kubernets Container Logs integration generates this entry in the policy: - id: filestream-container-logs-46cdfbd0-12ba-49b1-8006-aa501f145459
name: kubernetes-1
revision: 1
type: filestream
use_output: default
meta:
package:
name: kubernetes
version: 1.21.1
data_stream:
namespace: default
streams:
- id: >-
filestream-kubernetes.container_logs-46cdfbd0-12ba-49b1-8006-aa501f145459
data_stream:
dataset: kubernetes.container_logs
type: logs
paths:
- '/var/log/containers/*${kubernetes.container.id}.log'
prospector.scanner.symlinks: true
parsers:
- container:
stream: all
format: auto It's easy to know it's filestream because of |
Tiago and I had a quick sync on this. We decided the best path is to fix the ID in the Kubernetes Container Logs integration. I'm going to write up a separate issue documenting a way to try and prevent data duplication when filestream inputs are created without IDs in general if this continues to be a larger issue. |
PR to update the integration created: elastic/integrations#3672 I'm still seeing log messages like
I believe it is due #31767. I'll keep the integrations PR as a draft until I have an answer. |
This is part of the fix for data duplication that happens when there are multiple filestream inputs configured with the same ID. The container ID and pod name are added to the filestream input ID. The container ID ensure uniquiness and the pod name helps with debugging. Issue: elastic/beats#31512
One possible way to deal with this on the input side is #32335, but that it be complex and is not a 100% guaranteed fix. |
This is part of the fix for data duplication that happens when there are multiple filestream inputs configured with the same ID. The container ID and pod name are added to the filestream input ID. The container ID ensure uniquiness and the pod name helps with debugging. Issue: elastic/beats#31512
elastic/integrations#3672 and #32309 have been merged, so this issue is resolved for now. |
Hello all - I have a customer that is still seeing duplicate filebeat data from versionm 8.3.3 of the Elastic Agent. Here is his filebeat.yaml:filebeat.inputs: type: filestream (his indents are proper :) What could be the problem here? TiA for any input |
@rstatsinger looks like the same report as #31239 (comment), can we get a separate issue or an SDH opened for tracking this so it isn't spread across multiple other issues? |
@cmacknz The error is: {"log.level":"error","@timestamp":"2023-03-22T17:11:22.014Z","log.logger":"input","log.origin":{"file.name":"input-logfile/manager.go","file.line":182},"message":"filestream input with ID 'spark-executors-file-stream-e40cd871-dabd-46b5-a245-decac624262a' already exists, this will lead to data duplication, please use a different ID","service.name":"filebeat","ecs.version":"1.6.0"} My filebeat configuration is: filebeat.autodiscover:
providers:
- type: kubernetes
templates:
- condition:
equals:
kubernetes.namespace: "spark"
config:
- type: container
paths:
- /var/log/containers/*-${data.kubernetes.container.id}.log
multiline.type: pattern
multiline.pattern: '^[[:space:]]+(at|\.{3})[[:space:]]+\b|^Caused by:|^java'
multiline.negate: false
multiline.match: after
- type: log
id: spark-executors-file-stream-${data.kubernetes.node.uid}
paths:
- /var/log/spark/work/*/*/stdout
- /var/log/spark/work/*/*/stderr
processors:
- add_cloud_metadata:
- add_host_metadata:
output.logstash:
hosts: ["logstash:5044"] I am using the latest version i.e 8.6.2 for filebeat. I don't understand why this issue is closed if it still has not been resolved in the latest release. Their is another similar issue #34613 |
@keenborder786 your problem is that your configuration is creating multiple inputs with the same ID, specifically the ID It looks like This doesn't immediately look like a bug in in Beats, or a related to the original issue that was fixed here. If you keep having problems start a thread on the discussion forums https://discuss.elastic.co/c/elastic-stack/beats. |
Description
Kubernetes provider is dynamically creating new stream for each container for example for that config, but the resulting stream will have the same
id
It is possible in a non managed policy to bypass that problem by providing an id with a dynamic variable
like this
How should we solve that for managed solutions?
Potential solution
A fix should be done on the input side of things, we should first discuss what mechanism will be used for this.
Follow up of: elastic/kibana#129851
The text was updated successfully, but these errors were encountered: