Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

AWS CloudWatch logs for Container Insights contain no CPU usage metrics when setting collection_interval to more than 300s #36109

Open
oleksandr-san opened this issue Oct 31, 2024 · 2 comments
Labels
bug Something isn't working needs triage New item requiring triage receiver/awscontainerinsight Stale

Comments

@oleksandr-san
Copy link

oleksandr-san commented Oct 31, 2024

Component(s)

receiver/awscontainerinsight

What happened?

Description

We've tried to increase the collection_interval parameter for the receivers.awscontainerinsight component to optimize AWS CloudWatch costs.

I've figured, that it is related to the TTL in the map used to store metric deltas: when the collection interval is more than 5 minutes, collecting deltas breaks because older deltas get removed before new deltas are applied.

Increasing the cleanInterval to 15 minutes helps.

Steps to Reproduce

  1. Create any EKS cluster
  2. Install OTEL to collect AWS Container Insights
  3. Set receivers.awscontainerinsightreceiver.collection_interval to 600s
  4. Restart the daemonset
  5. Wait for 15-20 minutes

Expected Result

Log events in CloudWatch contain CPU usage metrics

Actual Result

Log events in CloudWatch do not contain CPU usage metrics

Collector version

0.41.1

Environment information

Environment

OS: (e.g., "Ubuntu 20.04")
Compiler(if manually compiled): (e.g., "go 14.2")

OpenTelemetry Collector configuration

extensions:
    health_check:

 receivers:
   awscontainerinsightreceiver:
     collection_interval: 600s

 processors:
   batch/metrics:
     timeout: 60s
 
   exporters:
      awsemf:
        namespace: ContainerInsights
        log_group_name: '/aws/containerinsights/{ClusterName}/performance'
        log_stream_name: '{NodeName}'
        resource_to_telemetry_conversion:
          enabled: true
        dimension_rollup_option: NoDimensionRollup
        parse_json_encoded_attr_values: [Sources, kubernetes]
        metric_declarations:
          # cluster metrics
          - dimensions: [[ClusterName]]
            metric_name_selectors:
              - cluster_node_count
              - cluster_failed_node_count

    service:
      pipelines:
        metrics:
          receivers: [awscontainerinsightreceiver]
          processors: [batch/metrics]
          exporters: [awsemf]

      extensions: [health_check]

Log output

No response

Additional context

Log event with collection_interval == 600s:

{
    "AutoScalingGroupName": "eks-agent-ng-arm64-4ac815a7-3a71-20b4-a604-aa35acfabcd4",
    "ClusterName": "cluster-with-agent",
    "InstanceId": "i-019f99ea685e48c83",
    "InstanceType": "t4g.medium",
    "Namespace": "kube-system",
    "NodeName": "ip-172-31-28-91.eu-north-1.compute.internal",
    "PodName": "aws-node",
    "Sources": [
        "cadvisor",
        "pod",
        "calculated"
    ],
    "Timestamp": "1730302312567",
    "Type": "Container",
    "Version": "0",
    "container_memory_cache": 106377216,
    "container_memory_failcnt": 0,
    "container_memory_mapped_file": 811008,
    "container_memory_max_usage": 160075776,
    "container_memory_rss": 28655616,
    "container_memory_swap": 0,
    "container_memory_usage": 136433664,
    "container_memory_utilization": 1.1755803143695827,
    "container_memory_working_set": 47341568,
    "container_status": "Running",
    "kubernetes": {
        "container_name": "aws-node",
        "containerd": {
            "container_id": "aabb7c4bea02cfe72371bb5a36bbcd23eff478078c6e920b77e1e9e0ade591b9"
        },
        "host": "ip-172-31-28-91.eu-north-1.compute.internal",
        "labels": {
            "app.kubernetes.io/instance": "aws-vpc-cni",
            "app.kubernetes.io/name": "aws-node",
            "controller-revision-hash": "588469c5c6",
            "k8s-app": "aws-node",
            "pod-template-generation": "2"
        },
        "namespace_name": "kube-system",
        "pod_id": "c3476737-e9d4-44cb-a20f-dcb812ac9091",
        "pod_name": "aws-node-wghkn",
        "pod_owners": [
            {
                "owner_kind": "DaemonSet",
                "owner_name": "aws-node"
            }
        ]
    },
    "number_of_container_restarts": 0
}

Log event with the default configuration:

{
    "AutoScalingGroupName": "eks-agent-ng-1ac79c42-2aa5-ff45-0c1e-b03d703c0d47",
    "ClusterName": "cluster-with-agent",
    "InstanceId": "i-0becbf3535f001cb4",
    "InstanceType": "t3.medium",
    "Namespace": "kube-system",
    "NodeName": "ip-172-31-25-41.eu-north-1.compute.internal",
    "PodName": "aws-node",
    "Sources": [
        "cadvisor",
        "pod",
        "calculated"
    ],
    "Timestamp": "1730371819323",
    "Type": "Container",
    "Version": "0",
    "container_cpu_request": 25,
    "container_cpu_usage_system": 1.3264307613654849,
    "container_cpu_usage_total": 2.9252373450029627,
    "container_cpu_usage_user": 1.393591812573864,
    "container_cpu_utilization": 0.14626186725014814,
    "container_memory_cache": 24600576,
    "container_memory_failcnt": 0,
    "container_memory_hierarchical_pgfault": 267.61999880258816,
    "container_memory_hierarchical_pgmajfault": 0,
    "container_memory_mapped_file": 270336,
    "container_memory_max_usage": 56954880,
    "container_memory_pgfault": 267.61999880258816,
    "container_memory_pgmajfault": 0,
    "container_memory_rss": 26337280,
    "container_memory_swap": 0,
    "container_memory_usage": 52269056,
    "container_memory_utilization": 1.1655047122298874,
    "container_memory_working_set": 47063040,
    "container_status": "Running",
    "kubernetes": {
        "container_name": "aws-node",
        "containerd": {
            "container_id": "b038c0f909602224fa9e1b1351379ff2dc48d0de3e96f720ed80316ada28aca2"
        },
        "host": "ip-172-31-25-41.eu-north-1.compute.internal",
        "labels": {
            "app.kubernetes.io/instance": "aws-vpc-cni",
            "app.kubernetes.io/name": "aws-node",
            "controller-revision-hash": "588469c5c6",
            "k8s-app": "aws-node",
            "pod-template-generation": "2"
        },
        "namespace_name": "kube-system",
        "pod_id": "5e453328-d24c-45d8-9451-7274248cd447",
        "pod_name": "aws-node-wt85g",
        "pod_owners": [
            {
                "owner_kind": "DaemonSet",
                "owner_name": "aws-node"
            }
        ]
    },
    "number_of_container_restarts": 0
}
@oleksandr-san oleksandr-san added bug Something isn't working needs triage New item requiring triage labels Oct 31, 2024
Copy link
Contributor

Pinging code owners:

See Adding Labels via Comments if you do not have permissions to add labels yourself.

Copy link
Contributor

This issue has been inactive for 60 days. It will be closed in 60 days if there is no activity. To ping code owners by adding a component label, see Adding Labels via Comments, or if you are unsure of which component this issue relates to, please ping @open-telemetry/collector-contrib-triagers. If this issue is still relevant, please ping the code owners or leave a comment explaining why it is still relevant. Otherwise, please close it.

Pinging code owners:

See Adding Labels via Comments if you do not have permissions to add labels yourself.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working needs triage New item requiring triage receiver/awscontainerinsight Stale
Projects
None yet
Development

No branches or pull requests

1 participant