Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix input ID uniqueness #31512

Closed
jlind23 opened this issue May 4, 2022 · 16 comments
Closed

Fix input ID uniqueness #31512

jlind23 opened this issue May 4, 2022 · 16 comments
Assignees
Labels
discuss Issue needs further discussion. estimation:Week Task that represents a week of work. QA:Needs Validation Needs validation by the QA Team Team:Elastic-Agent-Data-Plane Label for the Agent Data Plane team v8.4.0

Comments

@jlind23
Copy link
Collaborator

jlind23 commented May 4, 2022

Description

Kubernetes provider is dynamically creating new stream for each container for example for that config, but the resulting stream will have the same id

streams:
      - id: filestream-kubernetes.container_logs
        data_stream:
          dataset: kubernetes.container_logs
          type: logs
        prospector.scanner.symlinks: true
        paths:
          - '/var/log/containers/*${kubernetes.container.id}.log'

It is possible in a non managed policy to bypass that problem by providing an id with a dynamic variable
like this

streams:
     - id: >-
         filestream-kubernetes.container_logs-${kubernetes.container.id}
       data_stream:
         dataset: kubernetes.container_logs
         type: logs
       prospector.scanner.symlinks: true
       paths:
         - '/var/log/containers/*${kubernetes.container.id}.log'

How should we solve that for managed solutions?

Potential solution

A fix should be done on the input side of things, we should first discuss what mechanism will be used for this.

Follow up of: elastic/kibana#129851

@botelastic botelastic bot added the needs_team Indicates that the issue/PR needs a Team:* label label May 4, 2022
@jlind23 jlind23 added discuss Issue needs further discussion. Team:Elastic-Agent-Data-Plane Label for the Agent Data Plane team labels May 4, 2022
@elasticmachine
Copy link
Collaborator

Pinging @elastic/elastic-agent-data-plane (Team:Elastic-Agent-Data-Plane)

@belimawr
Copy link
Contributor

I haven't started coding it yet, but I have a feeling it might be good to also keep an eye for this kind of effect/behaviour: #31767

@jlind23 jlind23 added the QA:Needs Validation Needs validation by the QA Team label Jun 29, 2022
@belimawr
Copy link
Contributor

belimawr commented Jul 7, 2022

I've been looking into how to solve that from the input side of things, here are my findings.

Elastic-Agent

The kubernetes autodiscover runs on the Elastic-Agent and generates filestream input configurations that are sent to Filebeat. There is a good description of how it works on a high level here.

Unfortunately, there isn't a direct connection between the input configuration and the kubernetes provider. The data generated by the autodiscovery event at some point gets converted by transpiler.RenderInputs that does not know about any input specifics.

Trying to add any code there to handle a kubernetes exception would be a very ugly hack.

I believe the best fix for the Elastic-Agent case is to update the integration to include ${kubernetes.container.id} as part of the filestream input ID.

Filebeat

Filebeat kubernetes autodiscovery at the moment does not suffer from the filestream input ID issue because under the hood it uses the container input that, under the hood, is the log input. For each kubernetes container a new container input is created.

If we ever migrate the container input to filestream or use filestream directly we would face the input ID uniqueness problem, however on Filebeat it is manageable.

The kubernetes autodiscover provider contains a config that contains the template used to render the container input. So if needed, we could parse this part of the YAML and endure there is an ID and it's unique. One way for that is to append ${data.kubernetes.container.id} to the input ID.

@cmacknz
Copy link
Member

cmacknz commented Jul 7, 2022

Elastic-Agent

I believe the best fix for the Elastic-Agent case is to update the integration to include ${kubernetes.container.id} as part of the filestream input ID.

Which integration? Agent only supports autodiscover in standalone mode.

Looking through the agent autodiscover provider I can see that it is dynamically generating processors with the container metadata: https://github.com/elastic/elastic-agent/blob/a18cfadcd38fef27be0307ae5307b562ceed44fc/internal/pkg/composable/providers/kubernetes/pod.go#L314

Those processors are ultimately going to end up in the input configuration block, so there is already a path to modify the filebeat configuration based on container metadata. Can we just extend the code path used for this to also return an optional input ID in addition to the processors?

Filebeat

If we ever migrate the container input to filestream or use filestream directly we would face the input ID uniqueness problem, however on Filebeat it is manageable.

I would create a separate issue for migrating the container input to use filestream instead of log and link it from our overall issue to migrate everything to filestream. We can deal with this problem later.

@cmacknz
Copy link
Member

cmacknz commented Jul 7, 2022

Here's the original issue and PR for the agent dynamic provider variable substitution that we probably need to hook into:

@cmacknz
Copy link
Member

cmacknz commented Jul 7, 2022

Looking through the variable substitution examples we would probably need the code to add an ID template variable automatically like:

inputs:
  - type: filestream
    id: filestream-${kubernetes.container.id} # Implicitly added if not specified by the user
    enabled: true
    streams:
      - paths:
          - /var/log/{{local_dynamic.key}}

Maybe when the agent loads the agent policy for the first time in standalone mode we can check if k8s autodiscovery is in use, and then just inject the ID template variable into the policy if there isn't an ID defined already?

Hopefully the autodiscovery variable substitution would just work without modification if we did that.

@belimawr
Copy link
Contributor

belimawr commented Jul 8, 2022

Which integration? Agent only supports autodiscover in standalone mode.

The Kubernetes Container Logs one. It exists on both modes. As far as I know, and have seen, the main difference between managed by fleet and standalone modes are the options exposed to the user. They use the same codebase with different options configured. Ex. If managed by fleet, Elastic-Agent always assumes it's running inside the kubernetes cluster, while standalone exposes the option to configure the cluster access.

Looking through the agent autodiscover provider I can see that it is dynamically generating processors with the container metadata: https://github.com/elastic/elastic-agent/blob/a18cfadcd38fef27be0307ae5307b562ceed44fc/internal/pkg/composable/providers/kubernetes/pod.go#L314

I understood that as being the processors that run on the event, not a policy processor. It adds this processor configuration, that leads to those fields being added to the event:
2022-07-08_10-40

I looked at the links you posted on: #31512 (comment), but they only reinforced my understanding regarding the processors. We could have a f-2-f chat to clarify that ;)

@belimawr
Copy link
Contributor

belimawr commented Jul 8, 2022

Looking through the variable substitution examples we would probably need the code to add an ID template variable automatically like:

inputs:
  - type: filestream
    id: filestream-${kubernetes.container.id} # Implicitly added if not specified by the user
    enabled: true
    streams:
      - paths:
          - /var/log/{{local_dynamic.key}}

Maybe when the agent loads the agent policy for the first time in standalone mode we can check if k8s autodiscovery is in use, and then just inject the ID template variable into the policy if there isn't an ID defined already?

Hopefully the autodiscovery variable substitution would just work without modification if we did that.

In that case of modifying the policy when it's being loaded, I believe updating the integration would be a more robust solution.

On my tests the Kubernets Container Logs integration generates this entry in the policy:

  - id: filestream-container-logs-46cdfbd0-12ba-49b1-8006-aa501f145459
    name: kubernetes-1
    revision: 1
    type: filestream
    use_output: default
    meta:
      package:
        name: kubernetes
        version: 1.21.1
    data_stream:
      namespace: default
    streams:
      - id: >-
          filestream-kubernetes.container_logs-46cdfbd0-12ba-49b1-8006-aa501f145459
        data_stream:
          dataset: kubernetes.container_logs
          type: logs
        paths:
          - '/var/log/containers/*${kubernetes.container.id}.log'
        prospector.scanner.symlinks: true
        parsers:
          - container:
              stream: all
              format: auto

It's easy to know it's filestream because of type: filestream, but it's hard to know it's a kubernetes autodiscover. We do have meta.package.name: kubernetes, but that's not reliable and is specific for this integration.

@cmacknz
Copy link
Member

cmacknz commented Jul 8, 2022

Tiago and I had a quick sync on this. We decided the best path is to fix the ID in the Kubernetes Container Logs integration.

I'm going to write up a separate issue documenting a way to try and prevent data duplication when filestream inputs are created without IDs in general if this continues to be a larger issue.

@belimawr
Copy link
Contributor

belimawr commented Jul 8, 2022

PR to update the integration created: elastic/integrations#3672

I'm still seeing log messages like

filestream input with ID 'kubernetes-container-logs-d9ba19b3cc48c2231bd648f3f573e984b6c332093f250a6facdb92090bfaa4b9-' already exists, this will lead to data duplication, please use a different ID

I believe it is due #31767. I'll keep the integrations PR as a draft until I have an answer.

belimawr added a commit to belimawr/integrations that referenced this issue Jul 11, 2022
This is part of the fix for data duplication that happens when there
are multiple filestream inputs configured with the same ID.

The container ID and pod name are added to the filestream input
ID. The container ID ensure uniquiness and the pod name helps with
debugging.

Issue: elastic/beats#31512
@cmacknz
Copy link
Member

cmacknz commented Jul 13, 2022

One possible way to deal with this on the input side is #32335, but that it be complex and is not a 100% guaranteed fix.

gizas pushed a commit to elastic/integrations that referenced this issue Jul 14, 2022
This is part of the fix for data duplication that happens when there
are multiple filestream inputs configured with the same ID.

The container ID and pod name are added to the filestream input
ID. The container ID ensure uniquiness and the pod name helps with
debugging.

Issue: elastic/beats#31512
@belimawr
Copy link
Contributor

elastic/integrations#3672 and #32309 have been merged, so this issue is resolved for now.

@rstatsinger
Copy link

rstatsinger commented Sep 19, 2022

Hello all - I have a customer that is still seeing duplicate filebeat data from versionm 8.3.3 of the Elastic Agent. Here is his filebeat.yaml:filebeat.inputs:

type: filestream
enabled: true
id: "ecs-app-logging"
paths:
C:\MVD2\Logs\ECS*.log
json.keys_under_root: true
json.overwrite_keys: true
json.add_error_key: true
json.expand_keys: true

(his indents are proper :)

What could be the problem here? TiA for any input

@cmacknz
Copy link
Member

cmacknz commented Sep 20, 2022

@rstatsinger looks like the same report as #31239 (comment), can we get a separate issue or an SDH opened for tracking this so it isn't spread across multiple other issues?

@keenborder786
Copy link

keenborder786 commented Mar 22, 2023

@cmacknz
I also facing a similar error.

The error is:

{"log.level":"error","@timestamp":"2023-03-22T17:11:22.014Z","log.logger":"input","log.origin":{"file.name":"input-logfile/manager.go","file.line":182},"message":"filestream input with ID 'spark-executors-file-stream-e40cd871-dabd-46b5-a245-decac624262a' already exists, this will lead to data duplication, please use a different ID","service.name":"filebeat","ecs.version":"1.6.0"}

My filebeat configuration is:

  filebeat.autodiscover:
    providers:
    - type: kubernetes
      templates:
        - condition:
            equals:
              kubernetes.namespace: "spark"
          config:
            - type: container
              paths:
              - /var/log/containers/*-${data.kubernetes.container.id}.log
              multiline.type: pattern
              multiline.pattern: '^[[:space:]]+(at|\.{3})[[:space:]]+\b|^Caused by:|^java'
              multiline.negate: false
              multiline.match: after
            - type: log
              id: spark-executors-file-stream-${data.kubernetes.node.uid}
              paths:
              - /var/log/spark/work/*/*/stdout
              - /var/log/spark/work/*/*/stderr
  processors:
    - add_cloud_metadata:
    - add_host_metadata:
  output.logstash:
    hosts: ["logstash:5044"]

I am using the latest version i.e 8.6.2 for filebeat. I don't understand why this issue is closed if it still has not been resolved in the latest release.

Their is another similar issue #34613

@cmacknz
Copy link
Member

cmacknz commented Mar 23, 2023

@keenborder786 your problem is that your configuration is creating multiple inputs with the same ID, specifically the ID spark-executors-file-stream-e40cd871-dabd-46b5-a245-decac624262a

It looks like e40cd871-dabd-46b5-a245-decac624262a is the node ID, are there multiple instances of this container running on a single node? If so you need to use the pod or container ID instead of the node ID when creating the input ID.

This doesn't immediately look like a bug in in Beats, or a related to the original issue that was fixed here. If you keep having problems start a thread on the discussion forums https://discuss.elastic.co/c/elastic-stack/beats.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
discuss Issue needs further discussion. estimation:Week Task that represents a week of work. QA:Needs Validation Needs validation by the QA Team Team:Elastic-Agent-Data-Plane Label for the Agent Data Plane team v8.4.0
Projects
None yet
Development

No branches or pull requests

6 participants