Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Use kubernetes app label in logGroupTemplate #927

Open
Mattie112 opened this issue Mar 21, 2023 · 9 comments · May be fixed by #1168
Open

Use kubernetes app label in logGroupTemplate #927

Mattie112 opened this issue Mar 21, 2023 · 9 comments · May be fixed by #1168
Labels
bug Something isn't working

Comments

@Mattie112
Copy link

Describe the bug
We always used the following terraform module: https://github.com/DNXLabs/terraform-aws-eks-cloudwatch-logs but have not switched to directly using this chart because that module is no longer maintained. That TF module would in the background use the aws-for-fluent-bit chart.

The logGroup used was like this:

  set {
    name  = "cloudWatch.logGroupName"
    value = "/aws/eks/${var.cluster_name}/$(kubernetes['labels']['app'])"
  }

So this resulted in /aws/eks/staging/some_app_label. I'd like to have the same structure with this chart directly. (And now also the latest version).

I used the following in my terraform:

  set {
    name  = "cloudWatchLogs.logGroupTemplate"
    value = "/aws/eks/${local.cluster_name}/$kubernetes['labels']['app']"
  }

However this would let fluentbit really spam cloudwatch with these log messages:

[2023/03/21 13:43:57] [ warn] [record accessor] translation failed, root key=kubernetes
[2023/03/21 13:43:57] [ warn] [record accessor] translation failed, root key=kubernetes

If I use the format /aws/eks/${local.cluster_name}/$kubernetes['namespace_name'] it works fine.

How can I get the ['labels']['app'] format to work? Is logGroupTemplate indeed correct?

Steps to reproduce
See above

Expected outcome
See above

Environment

  • Chart name: aws-for-fluent-bit
  • Chart version: 0.1.24
  • Kubernetes version: 1.22
  • Using EKS (yes/no), if so version? 1.22.17-eks-48e63af

Additional Context:
I did found this MR: #903 but not 100% sure if it is related to my issue or not.

@PettitWesley
Copy link
Contributor

Background

There are two cloudwatch plugins and two different templating features.

See here: https://github.com/aws/aws-for-fluent-bit/blob/mainline/troubleshooting/debugging.md#migrating-to-or-from-cloudwatch_logs-c-plugin-to-or-from-cloudwatch-go-plugin

This chart calls the go plugin cloudWatch and the C plugin cloudWatchLogs.

Advice

If I use the format /aws/eks/${local.cluster_name}/$kubernetes['namespace_name'] it works fine.

How can I get the ['labels']['app'] format to work? Is logGroupTemplate indeed correct?

You're using the correct templating key and your format looks good.

Can you attach example logs? Is there a label called app? It should work if there is...

@Mattie112
Copy link
Author

Thank you for your reply. I have spend a lot more time on tis (much more then I should haha). And I understand it a bit more now. I did not understand the whole "plugin" part but now I get that this is the fluentbit plugin for aws. It was a bit hard for me to track what went wrong on on what project I needed to report that.

The previous setup used an old version, that is still running in our production cluster and seems to use cloudwatch as the plugin. The new version (staging) uses cloudwatch_logs. The template DOES seem to work but not all workloads have the "apps" label. So the warning is actually correct. However this will cause a hell of a lot of logs to be generated by fluentbit as each line that could not be parsed (for example k8s cluster logs etc) logs the warning. See this issue: aws/aws-for-fluent-bit#599 This just costs to much $$$

So I think I now have 2 options: either stick with a different naming scheme (based on namespace_name because everything has that) OR revert back to the old plugin.

So: for my sanity I now choose to go back to the old plugin (as we where using that without issues) and then switch over when the new plugin as the suggested fix.

So: I update my terraform config to be as follows:

resource "helm_release" "aws-for-fluent-bit" {
  name             = "aws-for-fluent-bit"
  chart            = "aws-for-fluent-bit"
  repository       = "https://aws.github.io/eks-charts"
  version          = "0.1.24"
[...]
  // Disable the new logging plugin
  set {
    name  = "cloudWatchLogs.enabled"
    value = false
  }

  // Enable the old logging plugin
  set {
    name  = "cloudWatch.enabled"
    value = true
  }

  set {
    name  = "cloudWatch.logGroupName"
    value = "/aws/eks/${local.cluster_name}/$(kubernetes['labels']['app'])"
  }

That should be it right? Deactivating the new plugin and activating the old one? However I am not sure this works:

time="2023-03-22T12:41:32Z" level=error msg="[cloudwatch 0] parsing log_group_name template '/aws/eks/staging/$(kubernetes['labels']['app'])' (using value of default_log_group_name instead): app: sub-tag name not found"
time="2023-03-22T12:41:32Z" level=error msg="[cloudwatch 0] parsing log_group_name template '/aws/eks/staging/$(kubernetes['labels']['app'])' (using value of default_log_group_name instead): app: sub-tag name not found"
time="2023-03-22T12:41:32Z" level=error msg="[cloudwatch 0] parsing log_group_name template '/aws/eks/staging/$(kubernetes['labels']['app'])' (using value of default_log_group_name instead): app: sub-tag name not found""

The generated ConfigMap for fluent-bit.conf is:

[SERVICE]
    Parsers_File /fluent-bit/parsers/parsers.conf
[INPUT]
    Name              tail
    Tag               kube.*
    Path              /var/log/containers/*.log
    DB                /var/log/flb_kube.db
    Parser            docker
    Docker_Mode       On
    Mem_Buf_Limit     5MB
    Skip_Long_Lines   On
    Refresh_Interval  10
[FILTER]
    Name                kubernetes
    Match               kube.*
    Kube_URL            https://kubernetes.default.svc.cluster.local:443
    Merge_Log           On
    Merge_Log_Key       data
    Keep_Log            On
    K8S-Logging.Parser  On
    K8S-Logging.Exclude On
    Buffer_Size         32k
[OUTPUT]
    Name                  cloudwatch_logs
    Match                 *
    region                eu-west-1
    log_group_name        /aws/eks/fluentbit-cloudwatch/logs
    log_group_template    /aws/eks/staging/$kubernetes['namespace_name']
    log_stream_prefix     fluentbit-
    log_stream_template   $kubernetes['pod_name'].$kubernetes['container_name']
    auto_create_group     true
    log_retention_days    30

Is this correct? I was expecting Name cloudwatch and not cloudwatch_logs?

@Mattie112
Copy link
Author

And just for fun:

On production we have chart version 0.1.7 (yes really old, that is why we want to update).

The generated fluentbit config there is:

[SERVICE]
    Parsers_File /fluent-bit/parsers/parsers.conf

[INPUT]
    Name              tail
    Tag               kube.*
    Path              /var/log/containers/*.log
    DB                /var/log/flb_kube.db
    Parser            docker
    Docker_Mode       On
    Mem_Buf_Limit     5MB
    Skip_Long_Lines   On
    Refresh_Interval  10

[FILTER]
    Name                kubernetes
    Match               kube.*
    Kube_URL            https://kubernetes.default.svc.cluster.local:443
    Merge_Log           On
    Merge_Log_Key       data
    Keep_Log            On
    K8S-Logging.Parser  On
    K8S-Logging.Exclude On
[OUTPUT]
    Name                  cloudwatch
    Match                 *
    region                eu-west-1
    log_group_name        /aws/eks/prod/$(kubernetes['labels']['app'])
    log_stream_prefix     fluentbit-
    auto_create_group     true

However here we run it "through" https://registry.terraform.io/modules/DNXLabs/eks-cloudwatch-logs/aws/latest (that is no longer maintained). So I am not setting the variables in the exact same way. Not sure if this is helpful.

@PettitWesley
Copy link
Contributor

@Mattie112 yes, the cloudWatch chart config maps to the cloudwatch plugin: https://github.com/aws/eks-charts/blob/master/stable/aws-for-fluent-bit/templates/configmap.yaml#L64

set {
name = "cloudWatch.logGroupName"
value = "/aws/eks/${local.cluster_name}/$(kubernetes['labels']['app'])"
}

log_group_template    /aws/eks/staging/$kubernetes['namespace_name']
log_stream_prefix     fluentbit-
log_stream_template   $kubernetes['pod_name'].$kubernetes['container_name']

These do not match up, so I think these are from two different test runs.

@Mattie112
Copy link
Author

Well no, they are from the same run. It seems like whatever I set is ignored. I can double check it again tomorrow but I wanted to use the "old" plugin (to keep the naming equal) and that was the result. But I will try it again and let you know.

@Mattie112
Copy link
Author

Allright I tested it again. The error is still the same:

time="2023-03-23T10:07:01Z" level=error msg="[cloudwatch 0] parsing log_group_name template '/aws/eks/staging/$(kubernetes['labels']['app'])' (using value of default_log_group_name instead): app: sub-tag name not found"

For reference, the full terraform config (minus some comments):

resource "helm_release" "aws-for-fluent-bit" {
  name             = "aws-for-fluent-bit"
  chart            = "aws-for-fluent-bit"
  repository       = "https://aws.github.io/eks-charts"
  version          = "0.1.24"
  namespace        = "logging"
  create_namespace = true
  max_history      = 20
  timeout          = 300

  set {
    name  = "clusterName"
    value = module.eks.cluster_id
  }
  set {
    name  = "clusterName"
    value = module.eks.cluster_id
  }

  // Disable the new logging plugin
  set {
    name  = "cloudWatchLogs.enabled"
    value = false
  }

  // Enable the old logging plugin
  set {
    name  = "cloudWatch.enabled"
    value = true
  }

  set {
    name  = "cloudWatch.logGroupName"
    value = "/aws/eks/${local.cluster_name}/$(kubernetes['labels']['app'])"
  }

  set {
    name  = "cloudWatch.region"
    value = data.aws_region.current.name
  }

  set {
    name  = "cloudWatch.logRetentionDays"
    value = "30"
  }

  set {
    name  = "priorityClassName"
    value = kubernetes_priority_class.high_priority.metadata[0].name
  }
}

The generated fluent-bit.conf (checking one aws-for-fluent-bit pod)

[OUTPUT]
    Name                  cloudwatch
    Match                 *
    region                eu-west-1
    log_group_name        /aws/eks/fluentbit-cloudwatch/logs
    log_stream_prefix     fluentbit-
    log_retention_days    30
    auto_create_group     true

I am not really sure why, the logging would indicate it is trying to use the correct one but the configmap does not reflect that.

image
(Screenshot of OpenLens)

I will try a couple of other things but as not all apps have the "app" label if it "spams" the log like this it is not really a solution for us. Possible the fluentbit 2.0 update so we can change the log level. I double checked production (with the old version) and there are no errors logged about this and there are also workloads there with no 'app' label. My guess is that this was changed somewhere over the last years.

@Mattie112
Copy link
Author

I also deleted the entire helm chart and applied it again (using terraform). Same result. The configmap still has log_group_name /aws/eks/fluentbit-cloudwatch/logs in it.

When checking the helm chart it does look to be correct:

image

So yeah I am not sure what to do. If it still logs all the messages that could not be parsed then this is not even an option for us due to excessive logging of those messages. If we can hide / filter those errors that would be better. For now I will set it back to namespace_name again but if you have any suggestions I'd love to hear them!

@PettitWesley
Copy link
Contributor

@Mattie112

time="2023-03-23T10:07:01Z" level=error msg="[cloudwatch 0] parsing log_group_name template '/aws/eks/staging/$(kubernetes['labels']['app'])' (using value of default_log_group_name instead): app: sub-tag name not found"

I think this error just means the app label is not found. See: https://github.com/aws/amazon-cloudwatch-logs-for-fluent-bit/blob/mainline/cloudwatch/cloudwatch.go#L537

@Mattie112
Copy link
Author

Yeah so it does seem to work, strange that I don't see it in the fluent-bit config. But hey that does not really matter as this spams the log if it generates an error for each log that can not be mapped.

Think I will keep it on namespace_name for now and then wait for aws/aws-for-fluent-bit#599 before I switch back to the format I'd like.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
2 participants