Fix input ID uniqueness #31512

jlind23 · 2022-05-04T07:53:48Z

Description

Kubernetes provider is dynamically creating new stream for each container for example for that config, but the resulting stream will have the same id

streams:
      - id: filestream-kubernetes.container_logs
        data_stream:
          dataset: kubernetes.container_logs
          type: logs
        prospector.scanner.symlinks: true
        paths:
          - '/var/log/containers/*${kubernetes.container.id}.log'

It is possible in a non managed policy to bypass that problem by providing an id with a dynamic variable
like this

streams:
     - id: >-
         filestream-kubernetes.container_logs-${kubernetes.container.id}
       data_stream:
         dataset: kubernetes.container_logs
         type: logs
       prospector.scanner.symlinks: true
       paths:
         - '/var/log/containers/*${kubernetes.container.id}.log'

How should we solve that for managed solutions?

Potential solution

A fix should be done on the input side of things, we should first discuss what mechanism will be used for this.

Follow up of: elastic/kibana#129851

The text was updated successfully, but these errors were encountered:

elasticmachine · 2022-05-04T07:56:22Z

Pinging @elastic/elastic-agent-data-plane (Team:Elastic-Agent-Data-Plane)

belimawr · 2022-05-30T13:15:43Z

I haven't started coding it yet, but I have a feeling it might be good to also keep an eye for this kind of effect/behaviour: #31767

belimawr · 2022-07-07T17:22:31Z

I've been looking into how to solve that from the input side of things, here are my findings.

Elastic-Agent

The kubernetes autodiscover runs on the Elastic-Agent and generates filestream input configurations that are sent to Filebeat. There is a good description of how it works on a high level here.

Unfortunately, there isn't a direct connection between the input configuration and the kubernetes provider. The data generated by the autodiscovery event at some point gets converted by transpiler.RenderInputs that does not know about any input specifics.

Trying to add any code there to handle a kubernetes exception would be a very ugly hack.

I believe the best fix for the Elastic-Agent case is to update the integration to include ${kubernetes.container.id} as part of the filestream input ID.

Filebeat

Filebeat kubernetes autodiscovery at the moment does not suffer from the filestream input ID issue because under the hood it uses the container input that, under the hood, is the log input. For each kubernetes container a new container input is created.

If we ever migrate the container input to filestream or use filestream directly we would face the input ID uniqueness problem, however on Filebeat it is manageable.

The kubernetes autodiscover provider contains a config that contains the template used to render the container input. So if needed, we could parse this part of the YAML and endure there is an ID and it's unique. One way for that is to append ${data.kubernetes.container.id} to the input ID.

cmacknz · 2022-07-07T18:51:04Z

Elastic-Agent

I believe the best fix for the Elastic-Agent case is to update the integration to include ${kubernetes.container.id} as part of the filestream input ID.

Which integration? Agent only supports autodiscover in standalone mode.

Looking through the agent autodiscover provider I can see that it is dynamically generating processors with the container metadata: https://github.com/elastic/elastic-agent/blob/a18cfadcd38fef27be0307ae5307b562ceed44fc/internal/pkg/composable/providers/kubernetes/pod.go#L314

Those processors are ultimately going to end up in the input configuration block, so there is already a path to modify the filebeat configuration based on container metadata. Can we just extend the code path used for this to also return an optional input ID in addition to the processors?

Filebeat

If we ever migrate the container input to filestream or use filestream directly we would face the input ID uniqueness problem, however on Filebeat it is manageable.

I would create a separate issue for migrating the container input to use filestream instead of log and link it from our overall issue to migrate everything to filestream. We can deal with this problem later.

cmacknz · 2022-07-07T18:57:09Z

Here's the original issue and PR for the agent dynamic provider variable substitution that we probably need to hook into:

cmacknz · 2022-07-07T19:09:47Z

Looking through the variable substitution examples we would probably need the code to add an ID template variable automatically like:

inputs:
  - type: filestream
    id: filestream-${kubernetes.container.id} # Implicitly added if not specified by the user
    enabled: true
    streams:
      - paths:
          - /var/log/{{local_dynamic.key}}

Maybe when the agent loads the agent policy for the first time in standalone mode we can check if k8s autodiscovery is in use, and then just inject the ID template variable into the policy if there isn't an ID defined already?

Hopefully the autodiscovery variable substitution would just work without modification if we did that.

belimawr · 2022-07-08T08:54:41Z

Which integration? Agent only supports autodiscover in standalone mode.

The Kubernetes Container Logs one. It exists on both modes. As far as I know, and have seen, the main difference between managed by fleet and standalone modes are the options exposed to the user. They use the same codebase with different options configured. Ex. If managed by fleet, Elastic-Agent always assumes it's running inside the kubernetes cluster, while standalone exposes the option to configure the cluster access.

Looking through the agent autodiscover provider I can see that it is dynamically generating processors with the container metadata: https://github.com/elastic/elastic-agent/blob/a18cfadcd38fef27be0307ae5307b562ceed44fc/internal/pkg/composable/providers/kubernetes/pod.go#L314

I understood that as being the processors that run on the event, not a policy processor. It adds this processor configuration, that leads to those fields being added to the event:

I looked at the links you posted on: #31512 (comment), but they only reinforced my understanding regarding the processors. We could have a f-2-f chat to clarify that ;)

belimawr · 2022-07-08T12:57:33Z

Looking through the variable substitution examples we would probably need the code to add an ID template variable automatically like:
inputs:
  - type: filestream
    id: filestream-${kubernetes.container.id} # Implicitly added if not specified by the user
    enabled: true
    streams:
      - paths:
          - /var/log/{{local_dynamic.key}}
Maybe when the agent loads the agent policy for the first time in standalone mode we can check if k8s autodiscovery is in use, and then just inject the ID template variable into the policy if there isn't an ID defined already?

Hopefully the autodiscovery variable substitution would just work without modification if we did that.

In that case of modifying the policy when it's being loaded, I believe updating the integration would be a more robust solution.

On my tests the Kubernets Container Logs integration generates this entry in the policy:

  - id: filestream-container-logs-46cdfbd0-12ba-49b1-8006-aa501f145459
    name: kubernetes-1
    revision: 1
    type: filestream
    use_output: default
    meta:
      package:
        name: kubernetes
        version: 1.21.1
    data_stream:
      namespace: default
    streams:
      - id: >-
          filestream-kubernetes.container_logs-46cdfbd0-12ba-49b1-8006-aa501f145459
        data_stream:
          dataset: kubernetes.container_logs
          type: logs
        paths:
          - '/var/log/containers/*${kubernetes.container.id}.log'
        prospector.scanner.symlinks: true
        parsers:
          - container:
              stream: all
              format: auto

It's easy to know it's filestream because of type: filestream, but it's hard to know it's a kubernetes autodiscover. We do have meta.package.name: kubernetes, but that's not reliable and is specific for this integration.

cmacknz · 2022-07-08T13:41:26Z

Tiago and I had a quick sync on this. We decided the best path is to fix the ID in the Kubernetes Container Logs integration.

I'm going to write up a separate issue documenting a way to try and prevent data duplication when filestream inputs are created without IDs in general if this continues to be a larger issue.

belimawr · 2022-07-08T16:08:00Z

PR to update the integration created: elastic/integrations#3672

I'm still seeing log messages like

filestream input with ID 'kubernetes-container-logs-d9ba19b3cc48c2231bd648f3f573e984b6c332093f250a6facdb92090bfaa4b9-' already exists, this will lead to data duplication, please use a different ID

I believe it is due #31767. I'll keep the integrations PR as a draft until I have an answer.

This is part of the fix for data duplication that happens when there are multiple filestream inputs configured with the same ID. The container ID and pod name are added to the filestream input ID. The container ID ensure uniquiness and the pod name helps with debugging. Issue: elastic/beats#31512

cmacknz · 2022-07-13T12:50:41Z

One possible way to deal with this on the input side is #32335, but that it be complex and is not a 100% guaranteed fix.

This is part of the fix for data duplication that happens when there are multiple filestream inputs configured with the same ID. The container ID and pod name are added to the filestream input ID. The container ID ensure uniquiness and the pod name helps with debugging. Issue: elastic/beats#31512

belimawr · 2022-07-14T17:36:27Z

elastic/integrations#3672 and #32309 have been merged, so this issue is resolved for now.

rstatsinger · 2022-09-19T21:23:12Z

Hello all - I have a customer that is still seeing duplicate filebeat data from versionm 8.3.3 of the Elastic Agent. Here is his filebeat.yaml:filebeat.inputs:

type: filestream
enabled: true
id: "ecs-app-logging"
paths:
C:\MVD2\Logs\ECS*.log
json.keys_under_root: true
json.overwrite_keys: true
json.add_error_key: true
json.expand_keys: true

(his indents are proper :)

What could be the problem here? TiA for any input

cmacknz · 2022-09-20T17:23:48Z

@rstatsinger looks like the same report as #31239 (comment), can we get a separate issue or an SDH opened for tracking this so it isn't spread across multiple other issues?

keenborder786 · 2023-03-22T17:13:47Z

@cmacknz
I also facing a similar error.

The error is:

{"log.level":"error","@timestamp":"2023-03-22T17:11:22.014Z","log.logger":"input","log.origin":{"file.name":"input-logfile/manager.go","file.line":182},"message":"filestream input with ID 'spark-executors-file-stream-e40cd871-dabd-46b5-a245-decac624262a' already exists, this will lead to data duplication, please use a different ID","service.name":"filebeat","ecs.version":"1.6.0"}

My filebeat configuration is:

  filebeat.autodiscover:
    providers:
    - type: kubernetes
      templates:
        - condition:
            equals:
              kubernetes.namespace: "spark"
          config:
            - type: container
              paths:
              - /var/log/containers/*-${data.kubernetes.container.id}.log
              multiline.type: pattern
              multiline.pattern: '^[[:space:]]+(at|\.{3})[[:space:]]+\b|^Caused by:|^java'
              multiline.negate: false
              multiline.match: after
            - type: log
              id: spark-executors-file-stream-${data.kubernetes.node.uid}
              paths:
              - /var/log/spark/work/*/*/stdout
              - /var/log/spark/work/*/*/stderr
  processors:
    - add_cloud_metadata:
    - add_host_metadata:
  output.logstash:
    hosts: ["logstash:5044"]

I am using the latest version i.e 8.6.2 for filebeat. I don't understand why this issue is closed if it still has not been resolved in the latest release.

Their is another similar issue #34613

cmacknz · 2023-03-23T01:27:45Z

@keenborder786 your problem is that your configuration is creating multiple inputs with the same ID, specifically the ID spark-executors-file-stream-e40cd871-dabd-46b5-a245-decac624262a

It looks like e40cd871-dabd-46b5-a245-decac624262a is the node ID, are there multiple instances of this container running on a single node? If so you need to use the pod or container ID instead of the node ID when creating the input ID.

This doesn't immediately look like a bug in in Beats, or a related to the original issue that was fixed here. If you keep having problems start a thread on the discussion forums https://discuss.elastic.co/c/elastic-stack/beats.

botelastic bot added the needs_team Indicates that the issue/PR needs a Team:* label label May 4, 2022

jlind23 mentioned this issue May 4, 2022

[Fleet] Duplicate stream id with kubernetes provider elastic/kibana#129851

Closed

jlind23 added discuss Issue needs further discussion. Team:Elastic-Agent-Data-Plane Label for the Agent Data Plane team labels May 4, 2022

botelastic bot removed the needs_team Indicates that the issue/PR needs a Team:* label label May 4, 2022

jlind23 added the 8.4-candidate label May 4, 2022

belimawr mentioned this issue May 5, 2022

[Filebeat] Duplicated data when using filestream input #31239

Closed

jlind23 mentioned this issue May 11, 2022

IDs are required for filestream inputs on Filebeat monitoring elastic/elastic-agent#323

Closed

jlind23 added v8.4.0 estimation:Week Task that represents a week of work. and removed 8.4-candidate labels May 24, 2022

jlind23 assigned belimawr May 24, 2022

jlind23 added the QA:Needs Validation Needs validation by the QA Team label Jun 29, 2022

belimawr mentioned this issue Jul 8, 2022

add container ID as part of Kubernetes Cotainer Logs filestream input elastic/integrations#3672

Merged

4 tasks

belimawr mentioned this issue Jul 11, 2022

fix race condition when stopping inputs filestream ID bookkeeper #32309

Merged

4 tasks

belimawr closed this as completed Jul 14, 2022

belimawr mentioned this issue Jul 19, 2022

Add filestream ID on standalone kubernetes manifest elastic/elastic-agent#742

Merged

6 tasks

keenborder786 mentioned this issue Mar 22, 2023

filebeat.config.inputs to load external configuration produces error #34613

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix input ID uniqueness #31512

Fix input ID uniqueness #31512

jlind23 commented May 4, 2022 •

edited by TomonoriSoejima

Loading

elasticmachine commented May 4, 2022

belimawr commented May 30, 2022

belimawr commented Jul 7, 2022

cmacknz commented Jul 7, 2022

cmacknz commented Jul 7, 2022

cmacknz commented Jul 7, 2022 •

edited

Loading

belimawr commented Jul 8, 2022

belimawr commented Jul 8, 2022

cmacknz commented Jul 8, 2022

belimawr commented Jul 8, 2022

cmacknz commented Jul 13, 2022

belimawr commented Jul 14, 2022

rstatsinger commented Sep 19, 2022 •

edited

Loading

cmacknz commented Sep 20, 2022

keenborder786 commented Mar 22, 2023 •

edited

Loading

cmacknz commented Mar 23, 2023

Fix input ID uniqueness #31512

Fix input ID uniqueness #31512

Comments

jlind23 commented May 4, 2022 • edited by TomonoriSoejima Loading

Description

Potential solution

elasticmachine commented May 4, 2022

belimawr commented May 30, 2022

belimawr commented Jul 7, 2022

Elastic-Agent

Filebeat

cmacknz commented Jul 7, 2022

Elastic-Agent

Filebeat

cmacknz commented Jul 7, 2022

cmacknz commented Jul 7, 2022 • edited Loading

belimawr commented Jul 8, 2022

belimawr commented Jul 8, 2022

cmacknz commented Jul 8, 2022

belimawr commented Jul 8, 2022

cmacknz commented Jul 13, 2022

belimawr commented Jul 14, 2022

rstatsinger commented Sep 19, 2022 • edited Loading

cmacknz commented Sep 20, 2022

keenborder786 commented Mar 22, 2023 • edited Loading

cmacknz commented Mar 23, 2023

jlind23 commented May 4, 2022 •

edited by TomonoriSoejima

Loading

cmacknz commented Jul 7, 2022 •

edited

Loading

rstatsinger commented Sep 19, 2022 •

edited

Loading

keenborder786 commented Mar 22, 2023 •

edited

Loading