Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Metricbeat][Kubernetes] Share watchers between metricsets #37332

Merged
merged 76 commits into from
Apr 3, 2024

Conversation

constanca-m
Copy link
Contributor

@constanca-m constanca-m commented Dec 7, 2023

Proposed commit message

Details

The needed watchers for each resource are defined in getExtraWatchers function. You can check which are required in the table of section Expected watchers in this issue.

Note: only state_resourcequota does not have the expected watchers with this change. This is because we need to change the implementation of that metricset first.

We have a global map that saves all the watchers:

type watchers struct {
	watchersMap map[string]*watcherData
	lock        sync.RWMutex
}

The key to this map is the resource name, and the values are defined as:

type watcherData struct {
	metricsetsUsing []string // list of metricsets using this watcher

	watcher kubernetes.Watcher
	started bool // true if watcher has started, false otherwise

	enrichers map[string]*enricher // map of enrichers using this watcher. The key is the metricset name

	metadataObjects map[string]bool // map of ids of each object received by the handler functions
}
  • metricsetsUsing contains the list of metricsets that are using this watcher. We need this because when the enricher calls Start() or Stop(), the watchers start/stop. We cannot start a watcher more than once. We only stop a watcher if the list of metricsets using it is empty. We use metricset to avoid conflicts between metricsets that use the same resource, like state_pod and pod.
  • watcher is the kubernetes watcher for the resource.
  • started just tells us if the watcher is started. This is mainly needed for the enricher.Start() and for testing purposes.
  • enrichers is the list of enrichers for this watcher per each metricset
  • metadataEvents is the resulted metadata events from the resource event handler. Please see the next list, point 6.2, why this is necessary

The algorithm goes like this when NewResourceMetadataEnricher is called:

  1. The configuration is validated. It will return a nil enricher if it fails.
  2. We create the configuration needed for the metadata generator. It will return a nil enricher if it fails.
  3. We create the K8s client. It will return a nil enricher if it fails.
  4. We start all the watchers:
    1. We first check if the resource exists. If it fails, we stop.
    2. We build the kubernetes.WatchOptions{} needed for the watcher. If it fails, we stop.
    3. We start the watcher for this specific resource:
      1. We first check if the watcher is already created.
      2. If it is, then we don't do anything.
      3. Otherwise, we create a new watcher and put it in the map with key = resource name.
      4. We add this metricset to the list of metricsets we have that are using this watcher.
    4. We get all needed extra resources for this resource, and repeat step 3.
  5. We create the metadata generators.
  6. Lastly, create the enricher.
    Considerations:
    1. Because each watcher only has one function for UpdateFunc / addFunc and DeleteFunc, we need to save which metricsets and respective enrichers need that handler function. For this, we keep track of the enrichers using a map, and iterate over that map when one of the functions is triggered.
    2. It is possible that AddFunc is called for one metricset first, and when the other metricset starts, the AddFunc is no longer triggered. To avoid the loss of metadata, we have the map metadataObjects, that saves the id of the object that triggered the handler function. This way, for each enricher upon creation, we iterate over all this map and using the id saved there, we get the object from the watcher store. Using this object, we call the update function and ensure all enrichers have up to date metadata.

Checklist

  • My code follows the style guidelines of this project
  • I have commented my code, particularly in hard-to-understand areas
  • I have made corresponding changes to the documentation
  • I have made corresponding change to the default configuration files
  • I have added tests that prove my fix is effective or that my feature works
  • I have added an entry in CHANGELOG.next.asciidoc or CHANGELOG-developer.next.asciidoc.

How to test this PR locally

  1. Clone this branch.
  2. Follow the steps of this README file to launch metricbeat with the changes.
  3. Check it is working as expected.

Related issues

Results

Metricbeat

These results come from running metricbeat with this configuration for kubernetes module (all metricsets that launch watchers are enabled, the others are not).
metricbeat.autodiscover:
  providers:
    - type: kubernetes
      scope: cluster
      node: ${NODE_NAME}
      unique: true
      templates:
        - config:
            - module: kubernetes
              hosts: ["kube-state-metrics:8080"]
              period: 10s
            #  #add_metadata: true
              metricsets:
                - state_node
                - state_deployment
                - state_daemonset
                - state_replicaset
                - state_pod
                - state_container
                - state_cronjob
                - state_job
                #- state_resourcequota
                - state_statefulset
                - state_service
                - state_persistentvolume
                - state_persistentvolumeclaim
                - state_storageclass
                - state_namespace
            - module: kubernetes
              metricsets:
                - node
                - pod
                - container
              period: 10s
              host: ${NODE_NAME}
              hosts: ["https://${NODE_NAME}:10250"]
              bearer_token_file: /var/run/secrets/kubernetes.io/serviceaccount/token
              ssl.verification_mode: "none"

The logs for the watchers initialization will look like this (only message field displayed for simplicity):

{"log.level":"debug","@timestamp":"2024-02-08T08:03:02.094Z","log.logger":"kubernetes","log.origin":{"function":"github.com/elastic/beats/v7/metricbeat/module/kubernetes/util.createAllWatchers","file.name":"util/kubernetes.go","file.line":354},"message":"Created watcher node successfully, created by node.","service.name":"metricbeat","ecs.version":"1.6.0"}
{"log.level":"debug","@timestamp":"2024-02-08T08:03:02.116Z","log.logger":"kubernetes","log.origin":{"function":"github.com/elastic/beats/v7/metricbeat/module/kubernetes/util.createAllWatchers","file.name":"util/kubernetes.go","file.line":354},"message":"Created watcher pod successfully, created by pod.","service.name":"metricbeat","ecs.version":"1.6.0"}
{"log.level":"debug","@timestamp":"2024-02-08T08:03:02.116Z","log.logger":"kubernetes","log.origin":{"function":"github.com/elastic/beats/v7/metricbeat/module/kubernetes/util.createAllWatchers","file.name":"util/kubernetes.go","file.line":368},"message":"Created watcher state_namespace successfully, created by pod.","service.name":"metricbeat","ecs.version":"1.6.0"}                <-------------------------------------
{"log.level":"debug","@timestamp":"2024-02-08T08:03:02.191Z","log.logger":"kubernetes","log.origin":{"function":"github.com/elastic/beats/v7/metricbeat/module/kubernetes/util.createAllWatchers","file.name":"util/kubernetes.go","file.line":354},"message":"Created watcher deployment successfully, created by state_deployment.","service.name":"metricbeat","ecs.version":"1.6.0"}
{"log.level":"debug","@timestamp":"2024-02-08T08:03:02.206Z","log.logger":"kubernetes","log.origin":{"function":"github.com/elastic/beats/v7/metricbeat/module/kubernetes/util.createAllWatchers","file.name":"util/kubernetes.go","file.line":354},"message":"Created watcher daemonset successfully, created by state_daemonset.","service.name":"metricbeat","ecs.version":"1.6.0"}
{"log.level":"debug","@timestamp":"2024-02-08T08:03:02.516Z","log.logger":"kubernetes","log.origin":{"function":"github.com/elastic/beats/v7/metricbeat/module/kubernetes/util.createAllWatchers","file.name":"util/kubernetes.go","file.line":354},"message":"Created watcher replicaset successfully, created by state_replicaset.","service.name":"metricbeat","ecs.version":"1.6.0"}
{"log.level":"debug","@timestamp":"2024-02-08T08:03:03.118Z","log.logger":"kubernetes","log.origin":{"function":"github.com/elastic/beats/v7/metricbeat/module/kubernetes/util.createAllWatchers","file.name":"util/kubernetes.go","file.line":354},"message":"Created watcher cronjob successfully, created by state_cronjob.","service.name":"metricbeat","ecs.version":"1.6.0"}
{"log.level":"debug","@timestamp":"2024-02-08T08:03:03.126Z","log.logger":"kubernetes","log.origin":{"function":"github.com/elastic/beats/v7/metricbeat/module/kubernetes/util.createAllWatchers","file.name":"util/kubernetes.go","file.line":354},"message":"Created watcher statefulset successfully, created by state_statefulset.","service.name":"metricbeat","ecs.version":"1.6.0"}
{"log.level":"debug","@timestamp":"2024-02-08T08:03:03.132Z","log.logger":"kubernetes","log.origin":{"function":"github.com/elastic/beats/v7/metricbeat/module/kubernetes/util.createAllWatchers","file.name":"util/kubernetes.go","file.line":354},"message":"Created watcher service successfully, created by state_service.","service.name":"metricbeat","ecs.version":"1.6.0"}
{"log.level":"debug","@timestamp":"2024-02-08T08:03:03.139Z","log.logger":"kubernetes","log.origin":{"function":"github.com/elastic/beats/v7/metricbeat/module/kubernetes/util.createAllWatchers","file.name":"util/kubernetes.go","file.line":354},"message":"Created watcher persistentvolume successfully, created by state_persistentvolume.","service.name":"metricbeat","ecs.version":"1.6.0"}
{"log.level":"debug","@timestamp":"2024-02-08T08:03:03.146Z","log.logger":"kubernetes","log.origin":{"function":"github.com/elastic/beats/v7/metricbeat/module/kubernetes/util.createAllWatchers","file.name":"util/kubernetes.go","file.line":354},"message":"Created watcher persistentvolumeclaim successfully, created by state_persistentvolumeclaim.","service.name":"metricbeat","ecs.version":"1.6.0"}
{"log.level":"debug","@timestamp":"2024-02-08T08:03:03.152Z","log.logger":"kubernetes","log.origin":{"function":"github.com/elastic/beats/v7/metricbeat/module/kubernetes/util.createAllWatchers","file.name":"util/kubernetes.go","file.line":354},"message":"Created watcher storageclass successfully, created by state_storageclass.","service.name":"metricbeat","ecs.version":"1.6.0"}

Notice the line with <----------: the pod was the resource that created the watcher for namespace, since it is one of the required resources, and it did not exist yet. This is also the reason why we don't see the line "message":"Created watcher state_namespace successfully, created by state_namespace.", because by the time state_namespace is iterating over the needed watchers, they are already created.

In Discover:
image

Elastic Agent

These results come from running EA with this standalone manifest, but with the custom image.

Logs:

These are the logs for starting the watchers (working as expected).
{"log.level":"debug","@timestamp":"2023-12-08T10:17:48.051Z","message":"Started watcher statefulset successfully, created by statefulset.","component":{"binary":"metricbeat","dataset":"elastic_agent.metricbeat","id":"kubernetes/metrics-default","type":"kubernetes/metrics"},"log":{"source":"kubernetes/metrics-default"},"ecs.version":"1.6.0","log.logger":"kubernetes","log.origin":{"file.line":307,"file.name":"util/kubernetes.go","function":"github.com/elastic/beats/v7/metricbeat/module/kubernetes/util.startAllWatchers"},"service.name":"metricbeat","ecs.version":"1.6.0"}
{"log.level":"debug","@timestamp":"2023-12-08T10:17:48.051Z","message":"Started watcher state_namespace successfully, created by statefulset.","component":{"binary":"metricbeat","dataset":"elastic_agent.metricbeat","id":"kubernetes/metrics-default","type":"kubernetes/metrics"},"log":{"source":"kubernetes/metrics-default"},"log.logger":"kubernetes","log.origin":{"file.line":321,"file.name":"util/kubernetes.go","function":"github.com/elastic/beats/v7/metricbeat/module/kubernetes/util.startAllWatchers"},"service.name":"metricbeat","ecs.version":"1.6.0","ecs.version":"1.6.0"}
...
{"log.level":"debug","@timestamp":"2023-12-08T10:17:48.257Z","message":"Started watcher node successfully, created by node.","component":{"binary":"metricbeat","dataset":"elastic_agent.metricbeat","id":"kubernetes/metrics-default","type":"kubernetes/metrics"},"log":{"source":"kubernetes/metrics-default"},"log.logger":"kubernetes","log.origin":{"file.line":307,"file.name":"util/kubernetes.go","function":"github.com/elastic/beats/v7/metricbeat/module/kubernetes/util.startAllWatchers"},"service.name":"metricbeat","ecs.version":"1.6.0","ecs.version":"1.6.0"}
...
{"log.level":"debug","@timestamp":"2023-12-08T10:17:48.361Z","message":"Started watcher pod successfully, created by pod.","component":{"binary":"metricbeat","dataset":"elastic_agent.metricbeat","id":"kubernetes/metrics-default","type":"kubernetes/metrics"},"log":{"source":"kubernetes/metrics-default"},"ecs.version":"1.6.0","log.logger":"kubernetes","log.origin":{"file.line":307,"file.name":"util/kubernetes.go","function":"github.com/elastic/beats/v7/metricbeat/module/kubernetes/util.startAllWatchers"},"service.name":"metricbeat","ecs.version":"1.6.0"}
...
{"log.level":"debug","@timestamp":"2023-12-08T10:17:48.470Z","message":"Started watcher deployment successfully, created by deployment.","component":{"binary":"metricbeat","dataset":"elastic_agent.metricbeat","id":"kubernetes/metrics-default","type":"kubernetes/metrics"},"log":{"source":"kubernetes/metrics-default"},"log.logger":"kubernetes","log.origin":{"file.line":307,"file.name":"util/kubernetes.go","function":"github.com/elastic/beats/v7/metricbeat/module/kubernetes/util.startAllWatchers"},"service.name":"metricbeat","ecs.version":"1.6.0","ecs.version":"1.6.0"}
...
{"log.level":"debug","@timestamp":"2023-12-08T10:17:48.577Z","message":"Started watcher persistentvolume successfully, created by persistentvolume.","component":{"binary":"metricbeat","dataset":"elastic_agent.metricbeat","id":"kubernetes/metrics-default","type":"kubernetes/metrics"},"log":{"source":"kubernetes/metrics-default"},"ecs.version":"1.6.0","log.logger":"kubernetes","log.origin":{"file.line":307,"file.name":"util/kubernetes.go","function":"github.com/elastic/beats/v7/metricbeat/module/kubernetes/util.startAllWatchers"},"service.name":"metricbeat","ecs.version":"1.6.0"}
...
{"log.level":"debug","@timestamp":"2023-12-08T10:17:48.679Z","message":"Started watcher persistentvolumeclaim successfully, created by persistentvolumeclaim.","component":{"binary":"metricbeat","dataset":"elastic_agent.metricbeat","id":"kubernetes/metrics-default","type":"kubernetes/metrics"},"log":{"source":"kubernetes/metrics-default"},"log.logger":"kubernetes","log.origin":{"file.line":307,"file.name":"util/kubernetes.go","function":"github.com/elastic/beats/v7/metricbeat/module/kubernetes/util.startAllWatchers"},"service.name":"metricbeat","ecs.version":"1.6.0","ecs.version":"1.6.0"}
...
{"log.level":"debug","@timestamp":"2023-12-08T10:17:48.782Z","message":"Started watcher replicaset successfully, created by replicaset.","component":{"binary":"metricbeat","dataset":"elastic_agent.metricbeat","id":"kubernetes/metrics-default","type":"kubernetes/metrics"},"log":{"source":"kubernetes/metrics-default"},"log.logger":"kubernetes","log.origin":{"file.line":307,"file.name":"util/kubernetes.go","function":"github.com/elastic/beats/v7/metricbeat/module/kubernetes/util.startAllWatchers"},"service.name":"metricbeat","ecs.version":"1.6.0","ecs.version":"1.6.0"}
...
{"log.level":"debug","@timestamp":"2023-12-08T10:17:48.887Z","message":"Started watcher service successfully, created by service.","component":{"binary":"metricbeat","dataset":"elastic_agent.metricbeat","id":"kubernetes/metrics-default","type":"kubernetes/metrics"},"log":{"source":"kubernetes/metrics-default"},"service.name":"metricbeat","ecs.version":"1.6.0","log.logger":"kubernetes","log.origin":{"file.line":307,"file.name":"util/kubernetes.go","function":"github.com/elastic/beats/v7/metricbeat/module/kubernetes/util.startAllWatchers"},"ecs.version":"1.6.0"}
...
{"log.level":"debug","@timestamp":"2023-12-08T10:17:49.017Z","message":"Started watcher storageclass successfully, created by storageclass.","component":{"binary":"metricbeat","dataset":"elastic_agent.metricbeat","id":"kubernetes/metrics-default","type":"kubernetes/metrics"},"log":{"source":"kubernetes/metrics-default"},"log.logger":"kubernetes","log.origin":{"file.line":307,"file.name":"util/kubernetes.go","function":"github.com/elastic/beats/v7/metricbeat/module/kubernetes/util.startAllWatchers"},"service.name":"metricbeat","ecs.version":"1.6.0","ecs.version":"1.6.0"}
...
{"log.level":"debug","@timestamp":"2023-12-08T10:17:49.120Z","message":"Started watcher cronjob successfully, created by cronjob.","component":{"binary":"metricbeat","dataset":"elastic_agent.metricbeat","id":"kubernetes/metrics-default","type":"kubernetes/metrics"},"log":{"source":"kubernetes/metrics-default"},"log.logger":"kubernetes","log.origin":{"file.line":307,"file.name":"util/kubernetes.go","function":"github.com/elastic/beats/v7/metricbeat/module/kubernetes/util.startAllWatchers"},"service.name":"metricbeat","ecs.version":"1.6.0","ecs.version":"1.6.0"}
...
{"log.level":"debug","@timestamp":"2023-12-08T10:17:49.225Z","message":"Started watcher daemonset successfully, created by daemonset.","component":{"binary":"metricbeat","dataset":"elastic_agent.metricbeat","id":"kubernetes/metrics-default","type":"kubernetes/metrics"},"log":{"source":"kubernetes/metrics-default"},"service.name":"metricbeat","ecs.version":"1.6.0","log.logger":"kubernetes","log.origin":{"file.line":307,"file.name":"util/kubernetes.go","function":"github.com/elastic/beats/v7/metricbeat/module/kubernetes/util.startAllWatchers"},"ecs.version":"1.6.0"}
...
{"log.level":"debug","@timestamp":"2023-12-08T10:17:49.329Z","message":"Started watcher job successfully, created by job.","component":{"binary":"metricbeat","dataset":"elastic_agent.metricbeat","id":"kubernetes/metrics-default","type":"kubernetes/metrics"},"log":{"source":"kubernetes/metrics-default"},"ecs.version":"1.6.0","log.logger":"kubernetes","log.origin":{"file.line":307,"file.name":"util/kubernetes.go","function":"github.com/elastic/beats/v7/metricbeat/module/kubernetes/util.startAllWatchers"},"service.name":"metricbeat","ecs.version":"1.6.0"}

The results for the dashboards are (check if it still works):

  • [Metrics Kubernetes] Cronjobs
  • [Metrics Kubernetes] StatefulSets
  • [Metrics Kubernetes] Pods
  • [Metrics Kubernetes] Deployments
  • [Metrics Kubernetes] DaemonSets
  • [Metrics Kubernetes] Jobs
  • [Metrics Kubernetes] Nodes
  • [Metrics Kubernetes] PV/PVC
  • [Metrics Kubernetes] Cluster Overview
  • [Metrics Kubernetes] Services - It is broken, but it is not related with the changes on this PR.

Note: only dashboards for resources that launch watchers are considered. There were no changes in the others.

Notes for testing

Important things to consider when testing this PR code changes:

  • This PR changes only affect the kubernetes module metricsets that use metadata enrichment. These are state_namespace state_node state_deployment state_daemonset state_replicaset state_pod state_container state_job state_cronjob state_statefulset state_service state_persistentvolume state_persistentvolumeclaim state_storageclass pod container node
  • Everything that was working before this PR changes should still be working. The changes only reduce the number of watchers created from the different metricsets, thus reducing the k8s API calls.
  • Thorough regression testing is needed. In more details:
    a. All events coming from the affected metricsets in Kibana should be enriched with own resource metadata (labels, annotations) and kubernetes node metadata and kubernetes namespace metadata when applicable.
    b. When a new metadata(like a new label) is added on a resource(i.e. pod) then the new events from the related metricset(pod, container, state_pod, state_container) should contain the new metadata
    c. When a new node or namespace label and annotation is added to a node/namespace, then the events from relevant metricsets(state_node, node or state_namespace) should include the new metadata.
    d. The events of the rest of the metricsets(i.e. state_pod or state_deployment) coming from resources in the updated node/namespace won't get the updated node or namespace metadata out of the box.
    e. In order for those events to be updated, there should be first an update in the metadata of these resources. For example if a node is labeled then the pods of that node won't get the new node label immediately. In order to get it, we should also add a label to these pod to trigger a watcher event. Then the new events will include the new pod label and node label
    f. Test with addition/removal of metadata on pods that run on the leader node and also on the non-leader nodes.

@constanca-m constanca-m added the Team:Cloudnative-Monitoring Label for the Cloud Native Monitoring team label Dec 7, 2023
@constanca-m constanca-m self-assigned this Dec 7, 2023
@constanca-m constanca-m requested a review from a team as a code owner December 7, 2023 09:16
@botelastic botelastic bot added needs_team Indicates that the issue/PR needs a Team:* label and removed needs_team Indicates that the issue/PR needs a Team:* label labels Dec 7, 2023
Copy link
Contributor

mergify bot commented Dec 7, 2023

This pull request does not have a backport label.
If this is a bug or security fix, could you label this PR @constanca-m? 🙏.
For such, you'll need to label your PR with:

  • The upcoming major version of the Elastic Stack
  • The upcoming minor version of the Elastic Stack (if you're not pushing a breaking change)

To fixup this pull request, you need to add the backport labels for the needed
branches, such as:

  • backport-v8./d.0 is the label to automatically backport to the 8./d branch. /d is the digit

@elasticmachine
Copy link
Collaborator

❕ Build Aborted

There is a new build on-going so the previous on-going builds have been aborted.

the below badges are clickable and redirect to their specific view in the CI or DOCS
Pipeline View Test View Changes Artifacts preview

Expand to view the summary

Build stats

  • Start Time: 2023-12-07T09:16:30.968+0000

  • Duration: 8 min 49 sec

🤖 GitHub comments

Expand to view the GitHub comments

To re-run your PR in the CI, just comment with:

  • /test : Re-trigger the build.

  • /package : Generate the packages and run the E2E tests.

  • /beats-tester : Run the installation tests with beats-tester.

  • run elasticsearch-ci/docs : Re-trigger the docs validation. (use unformatted text in the comment!)

@elasticmachine
Copy link
Collaborator

❕ Build Aborted

Either there was a build timeout or someone aborted the build.

the below badges are clickable and redirect to their specific view in the CI or DOCS
Pipeline View Test View Changes Artifacts preview

Expand to view the summary

Build stats

  • Duration: 22 min 52 sec

🤖 GitHub comments

Expand to view the GitHub comments

To re-run your PR in the CI, just comment with:

  • /test : Re-trigger the build.

  • /package : Generate the packages and run the E2E tests.

  • /beats-tester : Run the installation tests with beats-tester.

  • run elasticsearch-ci/docs : Re-trigger the docs validation. (use unformatted text in the comment!)

@elasticmachine
Copy link
Collaborator

❕ Build Aborted

Either there was a build timeout or someone aborted the build.

the below badges are clickable and redirect to their specific view in the CI or DOCS
Pipeline View Test View Changes Artifacts preview

Expand to view the summary

Build stats

  • Duration: 18 min 18 sec

🤖 GitHub comments

Expand to view the GitHub comments

To re-run your PR in the CI, just comment with:

  • /test : Re-trigger the build.

  • /package : Generate the packages and run the E2E tests.

  • /beats-tester : Run the installation tests with beats-tester.

  • run elasticsearch-ci/docs : Re-trigger the docs validation. (use unformatted text in the comment!)

@elasticmachine
Copy link
Collaborator

💚 Build Succeeded

the below badges are clickable and redirect to their specific view in the CI or DOCS
Pipeline View Test View Changes Artifacts preview preview

Expand to view the summary

Build stats

  • Duration: 50 min 17 sec

❕ Flaky test report

No test was executed to be analysed.

🤖 GitHub comments

Expand to view the GitHub comments

To re-run your PR in the CI, just comment with:

  • /test : Re-trigger the build.

  • /package : Generate the packages and run the E2E tests.

  • /beats-tester : Run the installation tests with beats-tester.

  • run elasticsearch-ci/docs : Re-trigger the docs validation. (use unformatted text in the comment!)

// NewResourceMetadataEnricher returns an Enricher configured for kubernetes resource events
// getExtraWatchers returns a list of the extra resources to watch based on some resource.
// The full list can be seen in https://github.com/elastic/beats/issues/37243, at Expected Watchers section.
func getExtraWatchers(resourceName string, config *kubernetesConfig) []string {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also as a general comment all this functions I think can move to util folder and keep same structure (as we have util.NewResourceMetadataEnricher), because now we go back and forth

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You mean in a different file inside util? They are in the util folder right now.

@gizas
Copy link
Contributor

gizas commented Dec 7, 2023

  • This PR will also need testing with Agent for sure. We need to build the agent and repeat the same tests and see if we dont break anything.

  • Also I would need to run some E2E tests to see that metadata enrichemnt are ok.

  • We will need to decide what a configuration like eg add_resource_metadata.namespace.enabled: false will do in our case

Copy link
Contributor

@MichaelKatsoulis MichaelKatsoulis left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Without having checked the code line by line, I believe that this approach is not aligned with kubeStateMetricsCache and kubeletStatsCache approach where we try to solve a similar issue.

GetStateMetricsFamilies(prometheus p.Prometheus) ([]*p.MetricFamily, error)

The watches need to be set in a module level. We need to take into account elastic-agent where each metricset starts as a new kubernetes module. So when we need to share stuff between metricsets of kubernetes module, we need to define a global cache on Module's level shared across all Module's instances.
See @ChrsMark implementation in #25640 (comment)

@constanca-m
Copy link
Contributor Author

We will need to decide what a configuration like eg add_resource_metadata.namespace.enabled: false will do in our case

I think maybe we should move those new decisions to a new PR @gizas

@constanca-m
Copy link
Contributor Author

constanca-m commented Dec 7, 2023

Without having checked the code line by line, I believe that this approach is not aligned with kubeStateMetricsCache and kubeletStatsCache approach where we try to solve a similar issue.

I think this approach still works this way and does basically the same thing.

It is harder to use util/kubernetes with the kubernetes in the parent folder, because we would have the import cycle error in go at all times. The only workaround i could find for that, would be to pass the funcitons as parameters, but it is very hard to read the code that way.

I added unit tests for every function, and they work just fine. @MichaelKatsoulis

Edit: It is the same approach we are already using for state_metricset shared map.

@MichaelKatsoulis
Copy link
Contributor

It is harder to use util/kubernetes with the kubernetes in the parent folder, because we would have the import cycle error in go at all times

You need to define a watchersCache in utils like we do with MetricsRepo

metricsRepo *util.MetricsRepo
and

Then its pointer can be passed to NewResourceMetadataEnricher like metricsRepo.

Did you test how many watchers are created under the hood with e2e tests? Did you test elastic-agent and metricbeat?

@constanca-m
Copy link
Contributor Author

constanca-m commented Dec 7, 2023

Did you test how many watchers are created under the hood with e2e tests? Did you test elastic-agent and metricbeat?

Yes, the number of watchers are correct. I posted the results in Results in the description from running metricbeat. I also added the unit tests. EA was also tested, results are now in the description. @MichaelKatsoulis

Any test in specific I should do? Any specific situation?

@elasticmachine
Copy link
Collaborator

elasticmachine commented Mar 13, 2024

💔 Build Failed

Failed CI Steps

History

cc @constanca-m

@elasticmachine
Copy link
Collaborator

💚 Build Succeeded

History

cc @constanca-m

@elasticmachine
Copy link
Collaborator

💚 Build Succeeded

History

cc @constanca-m

@elasticmachine
Copy link
Collaborator

💚 Build Succeeded

History

cc @constanca-m

@elasticmachine
Copy link
Collaborator

💚 Build Succeeded

History

cc @constanca-m

@MichaelKatsoulis MichaelKatsoulis requested a review from gizas March 14, 2024 10:07
Copy link
Contributor

@tetianakravchenko tetianakravchenko left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

regarding the documentation:

  • I think it makes more sense to keep structs and function parameters details as comments in the code, otherwise there is high chances this doc will be not kept in sync with code changes. I believe it would be helpful to keep implementation details in the code, and enrichers.md for more high level overview
  • could the documentation be restructured to have sub-topics? it is hard to read 100+ lines of text as one piece
  • could maybe be added a high level architecture diagram? It would simplify the documentation a lot


- The following description is irrelevant to the metadata enrichment that happens due to the `add_kubernetes_metadata` processor and Kubernetes provider.
- The `add_kubernetes_metadata` processor is skipped in cases of Kubernetes module for metrics collection. It is only used in the case of logs collection when the Kubernetes autodiscover provider is not used.
-The Kubernetes autodiscover provider enriches with metadata mainly when it comes to log collection when it is configured. It is by default in the `container_logs` integration in the Elastic Agent.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why not only?

Suggested change
-The Kubernetes autodiscover provider enriches with metadata mainly when it comes to log collection when it is configured. It is by default in the `container_logs` integration in the Elastic Agent.
-The Kubernetes autodiscover provider enriches with metadata only when it comes to log collection when it is configured. It is by default in the `container_logs` integration in the Elastic Agent.

maybe instead of referring to different project (Elastic Agent) - add link to the filebeat https://github.com/elastic/beats/blob/main/deploy/kubernetes/filebeat-kubernetes.yaml#L133-L150?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The kubernetes provider is also used with templates to start a metricbeat module (https://www.elastic.co/guide/en/beats/metricbeat/current/configuration-autodiscover.html#configuration-autodiscover) and enrich the metrics with kubernetes metadata

like state_pod, state_container, node, state_job etc.
- The shared watchers are stored in the module-level struct [resourceWatchers](https://github.com/constanca-m/beats/blob/8fcdfc205ebe0221ed7a1046c673c9babfb5280d/metricbeat/module/kubernetes/kubernetes.go#L102). The struct is initialized when the Kubernetes module [starts](https://github.com/constanca-m/beats/blob/8fcdfc205ebe0221ed7a1046c673c9babfb5280d/metricbeat/module/kubernetes/util/kubernetes.go#L124).
- Each time a watcher gets created it is added to the struct along with some watcher important data. The [struct](https://github.com/constanca-m/beats/blob/8fcdfc205ebe0221ed7a1046c673c9babfb5280d/metricbeat/module/kubernetes/util/kubernetes.go#L99) consists of a metaWatchersMap which has as key the name of the resource this watcher is watching(i.e. pod or node) and as value a [metaWatcher](https://github.com/constanca-m/beats/blob/8fcdfc205ebe0221ed7a1046c673c9babfb5280d/metricbeat/module/kubernetes/util/kubernetes.go#L86C6-L86C17) struct.
- The metaWatcher struct holds all the relevant info for the specific watcher. These info include
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would move such detailed explanation on the metaWatcher struct to the code, otherwise there is high chances this doc will be not kept in sync with code changes

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These details of the variables are already in the code like this:

	watcher kubernetes.Watcher
	started bool // true if watcher has started, false otherwise

	metricsetsUsing []string // list of metricsets using this watcher

	enrichers       map[string]*enricher // map of enrichers using this watcher. The key is the metricset name
	metadataObjects map[string]bool      // map of ids of each object received by the handler functions

	nodeScope      bool               // whether this watcher is only for current node
	restartWatcher kubernetes.Watcher // whether this watcher needs a restart

metricbeat/module/kubernetes/util/enrichers.md Outdated Show resolved Hide resolved
- If it does not exist it creates it and [stores](https://github.com/constanca-m/beats/blob/8fcdfc205ebe0221ed7a1046c673c9babfb5280d/metricbeat/module/kubernetes/util/kubernetes.go#L288C3-L288C18) it in the `resourceWatchers` struct. When creating a new metaWatcher struct, thet `started` options set to false as the watcher has not been started yet(just created). The `metadataObjects`, `enrichers` and `metricsetsUsing` initiliazied to empty, the `restartWatcher` set to nil and the `nodeScope` according to the function input(except from cases when it is an extra watcher where it is hardcoded to false).
- If the watcher for that resource already exists, we check if it needs to be restarted. This situation can arise in specific cases. For instance, if a pod watcher has been created from a metricset like pod or container with nodeScope set to true (watching only from one node), and then another metricset like state_pod or state_container tries to create a pod watcher (which can happen only in the leader node), the watcher already exists. However, if we don't take any action in this scenario, the state_pod watcher would use a watcher that watches only from one node, leading to missing metadata, as described earlier. To address this, we need to update the watcher, mainly changing its watch options (removing options.Node). Unfortunately, a running watcher cannot be updated directly. Instead, we must stop it and create a new one with the correct watch options. The new restartWatcher must be identical to the old watcher, including the same handler function (more on that later), with the only difference being in the watch options. Consequently, the metaWatcher struct of the watcher gets updated with the restartWatcher. The process of stopping the old watcher and starting the new one is handled later on.
- After createAllWatchers function creates all the watchers that are needed and update the resourceWatchers struct, the code flow returns to NewResourceMetadataEnricher
- Now, let's delve into the creation of metadata generators and handler functions. Watchers, on their own, are responsible for subscribing to the Kubernetes API and monitoring changes in the resources they are configured to watch. Their primary function is to update their internal cache or store. However, to determine what actions to take when a change occurs, we rely on the so-called event handlers.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can maybe this list be structured somehow with sub-titles? for example metadata generation can be split to another list?

// Enrich the given list of events
Enrich([]mapstr.M)
}

type enricher struct {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can we add comments regarding this struct and the struct fields here, instead of in documentation - to keep the implementation details in code and the high level overview in the https://github.com/elastic/beats/blob/a7f709a153a66bdb5369128342bd21c36da6439e/metricbeat/module/kubernetes/util/enrichers.md instead ?

} else if resourceMetaWatcher.nodeScope != nodeScope && resourceMetaWatcher.nodeScope {
// It might happen that the resourceMetaWatcher already exists, but is only being used to monitor the resources
// of a single node. In that case, we need to check if we are trying to create a new resourceMetaWatcher that will track
// the resources of multiple nodes. If it is the case, then we need to update the resourceMetaWatcher.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In that case, we need to check if we are trying to create a new resourceMetaWatcher that will track the resources of multiple nodes.

it is not the multiple nodes, but rather the whole cluster/clusterScope, right?
Such behavior is expected only on the leader node, right?
what will happen in case of the re-election, is it needed in that case to switch from clusterScope to nodeScope (now this condition would only work for the nodeScope: true)?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it is not the multiple nodes, but rather the whole cluster/clusterScope, right?

You are right

what will happen in case of the re-election, is it needed in that case to switch from clusterScope to nodeScope

Under what conditions the re-election happens ? I think it only happens if an agent is down/restarted, so in that case the whole metricsets initialisation will take place again and the ones with clusterscope will not start at this node.

-The Kubernetes autodiscover provider enriches with metadata mainly when it comes to log collection when it is configured. It is by default in the `container_logs` integration in the Elastic Agent.
- Metadata enrichment from the enrichers happens for all the following Kubernetes metricsets: `state_namespace`, `state_node`, `state_deployment`, `state_daemonset`, `state_replicaset`, `state_pod`, `state_container`, `state_job`, `state_cronjob`, `state_statefulset`, `state_service`, `state_persistentvolume`, `state_persistentvolumeclaim`, `state_storageclass`, `pod`, `container`, `node`.
- The reason that these metricsets trigger the metadata enrichment is because of the way they start.
- All `state_metricsets` (except `state_container`) trigger the shared [kubernetes.Init](https://github.com/constanca-m/beats/blob/8fcdfc205ebe0221ed7a1046c673c9babfb5280d/metricbeat/module/kubernetes/state_daemonset/state_daemonset.go#L45) function when they get initialized.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

almost all the links are from the private repo - https://github.com/constanca-m/beats, that can be outdated quite soon or removed when the feature branch will be merged in

- The pod watcher event handler will only call the updateFunc of the one enricher it has, when triggered
- A new pod appears, the watcher handler function gets triggered and it executes the updatefunc of only the pod metricset's enricher
- So the pod metricset's enricher's metadata map will have the metadata for all pod resources
- state_pod metricsets gets initiliazied. This happens in leader node. First pod metricsets starts and then state_pod
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

First pod metricsets starts and then state_pod
does it always happen like this? or it is an example here?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It always happens. The reason is that for the state_* metricsets to start , the leader election needs to take place first which takes sometime.

- A new enricher for state_pod gets created, it uses the same existing pod watcher. Inside buildMetadataEnricher the watcher.enrichers map gets updated to include the state_pod enricher as well
- So whenever a new pod appears or gets updated, both enricher's updateFunc will be triggered, so both enrichers' metadata map will be up to date
- But what happens with pods that triggered a watcher event, in between the pod and state_pod initilization. So after the pod metricsets got initiliazed and before the state_pod got initiliazed
- This is very common. The moment a watcher starts watching(from pod metricset) it immidiately gets notified for all pods in the cluster and executes whatever its updateFunc says. In that case the pod enricher's updateFunc
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In that case the pod enricher's updateFunc

this sentence is not finished, no?

- This is very common. The moment a watcher starts watching(from pod metricset) it immidiately gets notified for all pods in the cluster and executes whatever its updateFunc says. In that case the pod enricher's updateFunc
- So when the watcher got updated with state_pod enricher, the events for the existing pods have already arrived and at that point the watcher did not call the state_pod's enricher updateFunc
- Outcome is that the state_pod enricher's metadata map won't have the existing pods metadata, because the metadata generate method of that enricher was never called
- So we need a way to handle those cases
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is it an explanation of the restartWatcher ? this seems to be not finished

Copy link
Contributor

mergify bot commented Mar 25, 2024

This pull request is now in conflicts. Could you fix it? 🙏
To fixup this pull request, you can check out it locally. See documentation: https://help.github.com/articles/checking-out-pull-requests-locally/

git fetch upstream
git checkout -b fix-multiple-watchers upstream/fix-multiple-watchers
git merge upstream/main
git push upstream fix-multiple-watchers

- The reason that these metricsets trigger the metadata enrichment is because of the way they start.
- All `state_metricsets` (except `state_container`) trigger the shared [kubernetes.Init](https://github.com/constanca-m/beats/blob/8fcdfc205ebe0221ed7a1046c673c9babfb5280d/metricbeat/module/kubernetes/state_daemonset/state_daemonset.go#L45) function when they get initialized.
- [kubernetes.Init](https://github.com/constanca-m/beats/blob/8fcdfc205ebe0221ed7a1046c673c9babfb5280d/metricbeat/helper/kubernetes/state_metricset.go#L44) calls the [New metricsets method](https://github.com/constanca-m/beats/blob/8fcdfc205ebe0221ed7a1046c673c9babfb5280d/metricbeat/helper/kubernetes/state_metricset.go#L80), which returns a new metricset with resource metadata enricher as part of it.
- Node, pod, container, and `state_container` metricsets do not trigger the `kubernetes.Init` rather they implement their own [New method](https://github.com/constanca-m/beats/blob/8fcdfc205ebe0221ed7a1046c673c9babfb5280d/metricbeat/module/kubernetes/node/node.go#L66) with the enricher as part of the metricsets returned.
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think all points up until this one are not necessary in this file, and they might add confusion. We just need to say the enrichers are created in the metricset struct. We talk a lot about the kubernetes.Init, and I think that code might change in the future, and then this file won't be updated. Can we remove all these points and just add a sentence explaining where we start enrichers? @MichaelKatsoulis

go.mod Outdated Show resolved Hide resolved
// removeFromMetricsetsUsing removes the metricset from the list of resources using the shared watcher.
// It returns true if element was removed and new size of array.
// The cache should be locked when called.
func removeFromMetricsetsUsing(resourceName string, notUsingName string, resourceWatchers *Watchers) (bool, int) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The inverse of this function addToMetricsetsUsing grabs the lock and this function doesn't. That inconsistency is not clear. I would prefer to see that consistent between the two functions.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

At the time we call this function, we are already holding the lock. That is why it is not consistent. I will add a comment to the addToMetricsetsUsing mentioning the lock

Signed-off-by: constanca <[email protected]>
Copy link
Contributor

@blakerouse blakerouse left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the additional comment and cleanup of the go.mod file.

@constanca-m constanca-m requested a review from a team as a code owner April 3, 2024 07:23
@constanca-m constanca-m merged commit 460b5c4 into elastic:main Apr 3, 2024
156 of 160 checks passed
@constanca-m constanca-m deleted the fix-multiple-watchers branch April 3, 2024 13:21
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Team:Cloudnative-Monitoring Label for the Cloud Native Monitoring team Team:Elastic-Agent Label for the Agent team
Projects
None yet
Development

Successfully merging this pull request may close these issues.

8 participants