Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

out_cloudwatch: Segmentation fault #6616

Closed
supergillis opened this issue Dec 23, 2022 · 24 comments
Closed

out_cloudwatch: Segmentation fault #6616

supergillis opened this issue Dec 23, 2022 · 24 comments

Comments

@supergillis
Copy link

Bug Report

Describe the bug

fluent-bit immediately crashes with a segmentation fault when cloudwatch_logs output is used on ARM64.

To Reproduce

Configure fluent-bit to output to cloudwatch_logs.

[INPUT]
    Name              tail
    Tag               kube.*
    Path              /var/log/containers/*.log
    DB                /var/log/flb_kube.db
    Parser            docker
    Docker_Mode       On
    Mem_Buf_Limit     5MB
    Skip_Long_Lines   On
    Refresh_Interval  10

[FILTER]
    Name                kubernetes
    Match               kube.*
    Kube_URL            https://kubernetes.default.svc.cluster.local:443
    Merge_Log           On
    Merge_Log_Key       data
    Keep_Log            On
    K8S-Logging.Parser  On
    K8S-Logging.Exclude On
    Buffer_Size         32k

[OUTPUT]
    Name loki
    Match *
    host loki-gateway.monitoring.svc.cluster.local
    port 80
    auto_kubernetes_labels false
    labels job=fluent-bit
    label_map_path /etc/labelmap.json
    remove_keys kubernetes

[OUTPUT]
    Name                  cloudwatch_logs
    Match                 *
    region                eu-west-1
    log_group_name        /aws/eks/fluentbit-cloudwatch/logs
    log_stream_prefix     fluentbit-
    log_retention_days    30
    auto_create_group     true

The following error is logged on startup and fluent-bit crashes:

[2022/12/23 14:06:57] [ info] [fluent bit] version=2.0.8, commit=9444fdc5ee, pid=1                                                                                                            
[2022/12/23 14:06:57] [ info] [storage] ver=1.4.0, type=memory, sync=normal, checksum=off, max_chunks_up=128                                                                                  
[2022/12/23 14:06:57] [ info] [cmetrics] version=0.5.8                                                                                                                                        
[2022/12/23 14:06:57] [ info] [ctraces ] version=0.2.7                                                                                                                                        
[2022/12/23 14:06:57] [ info] [input:tail:tail.0] initializing                                                                                                                                
[2022/12/23 14:06:57] [ info] [input:tail:tail.0] storage_strategy='memory' (memory only)                                                                                                     
[2022/12/23 14:06:57] [ info] [filter:kubernetes:kubernetes.0] https=1 host=kubernetes.default.svc.cluster.local port=443                                                                     
[2022/12/23 14:06:57] [ info] [filter:kubernetes:kubernetes.0]  token updated                                                                                                                 
[2022/12/23 14:06:57] [ info] [filter:kubernetes:kubernetes.0] local POD info OK                                                                                                              
[2022/12/23 14:06:57] [ info] [filter:kubernetes:kubernetes.0] testing connectivity with API server...                                                                                        
[2022/12/23 14:06:57] [ info] [filter:kubernetes:kubernetes.0] connectivity OK                                                                                                                
[2022/12/23 14:06:57] [ info] [output:loki:loki.0] configured, hostname=loki-gateway.monitoring.svc.cluster.local:80                                                                          
[2022/12/23 14:06:57] [ info] [output:cloudwatch_logs:cloudwatch_logs.1] worker #0 started                                                                                                    
[2022/12/23 14:06:57] [ info] [http_server] listen iface=0.0.0.0 tcp_port=2020                                                                                                                
[2022/12/23 14:06:57] [ info] [sp] stream processor started
[2022/12/23 14:06:57] [ info] [input:tail:tail.0] inotify_fs_add(): inode=528997 watch_fd=2 name=/var/log/containers/aws-load-balancer-controller-5bfbd747bd-59v4h_kube-system_aws-load-balanc
[2022/12/23 14:06:57] [ info] [input:tail:tail.0] inotify_fs_add(): inode=134898 watch_fd=3 name=/var/log/containers/aws-load-balancer-controller-5bfbd747bd-vwxlb_kube-system_aws-load-balanc
[2022/12/23 14:06:57] [ info] [input:tail:tail.0] inotify_fs_add(): inode=134905 watch_fd=4 name=/var/log/containers/aws-load-balancer-controller-5bfbd747bd-vwxlb_kube-system_aws-load-balanc
[2022/12/23 14:06:57] [ info] [input:tail:tail.0] inotify_fs_add(): inode=266650 watch_fd=5 name=/var/log/containers/aws-node-bndp7_kube-system_aws-node-25e1ce4793370ef1b2386200c27bc6c0f2007
[2022/12/23 14:06:57] [ info] [input:tail:tail.0] inotify_fs_add(): inode=266644 watch_fd=6 name=/var/log/containers/aws-node-bndp7_kube-system_aws-vpc-cni-init-1798466d7955c1f09fa5be21ef65c
[2022/12/23 14:06:57] [ info] [input:tail:tail.0] inotify_fs_add(): inode=134906 watch_fd=7 name=/var/log/containers/cert-manager-cainjector-86f7f4749-kfndw_cert-manager_cert-manager-cainjec
[2022/12/23 14:06:57] [ info] [input:tail:tail.0] inotify_fs_add(): inode=134907 watch_fd=8 name=/var/log/containers/cert-manager-cainjector-86f7f4749-kfndw_cert-manager_cert-manager-cainjec
[2022/12/23 14:06:57] [ info] [input:tail:tail.0] inotify_fs_add(): inode=269677 watch_fd=9 name=/var/log/containers/cert-manager-webhook-66c85f8577-lhbfz_cert-manager_cert-manager-webhook-9
[2022/12/23 14:06:57] [ info] [input:tail:tail.0] inotify_fs_add(): inode=266446 watch_fd=10 name=/var/log/containers/cluster-proportional-autoscaler-coredns-6ccfb4d9b5-8v892_kube-system_clu
[2022/12/23 14:06:57] [ info] [input:tail:tail.0] inotify_fs_add(): inode=266427 watch_fd=11 name=/var/log/containers/coredns-68b4d7cb47-sk5lc_kube-system_coredns-65f7eea6cd0dd159aa4bcba818a
[2022/12/23 14:06:57] [ info] [input:tail:tail.0] inotify_fs_add(): inode=529055 watch_fd=12 name=/var/log/containers/ebs-csi-controller-ddd4f6984-225ng_kube-system_csi-attacher-5c87844421e8
[2022/12/23 14:06:57] [ info] [input:tail:tail.0] inotify_fs_add(): inode=529031 watch_fd=13 name=/var/log/containers/ebs-csi-controller-ddd4f6984-225ng_kube-system_csi-provisioner-ee9da4c02
[2022/12/23 14:06:57] [ info] [input:tail:tail.0] inotify_fs_add(): inode=529104 watch_fd=14 name=/var/log/containers/ebs-csi-controller-ddd4f6984-225ng_kube-system_csi-resizer-b8d5c1b1e0569
[2022/12/23 14:06:57] [ info] [input:tail:tail.0] inotify_fs_add(): inode=529078 watch_fd=15 name=/var/log/containers/ebs-csi-controller-ddd4f6984-225ng_kube-system_csi-snapshotter-280f7d129
[2022/12/23 14:06:57] [ info] [input:tail:tail.0] inotify_fs_add(): inode=528990 watch_fd=16 name=/var/log/containers/ebs-csi-controller-ddd4f6984-225ng_kube-system_ebs-plugin-09adedebe40a79
[2022/12/23 14:06:57] [ info] [input:tail:tail.0] inotify_fs_add(): inode=530317 watch_fd=17 name=/var/log/containers/ebs-csi-controller-ddd4f6984-225ng_kube-system_liveness-probe-5631541006
[2022/12/23 14:06:57] [ info] [input:tail:tail.0] inotify_fs_add(): inode=528598 watch_fd=18 name=/var/log/containers/ebs-csi-node-bzvwg_kube-system_ebs-plugin-5be936e176dc41bee8d7cfb3d9a358
[2022/12/23 14:06:57] [ info] [input:tail:tail.0] inotify_fs_add(): inode=528843 watch_fd=19 name=/var/log/containers/ebs-csi-node-bzvwg_kube-system_liveness-probe-ba8c6f18f5865fcaf56b061e93
[2022/12/23 14:06:57] [ info] [input:tail:tail.0] inotify_fs_add(): inode=528725 watch_fd=20 name=/var/log/containers/ebs-csi-node-bzvwg_kube-system_node-driver-registrar-7ebb2d392e1c655e89e
[2022/12/23 14:06:57] [ info] [input:tail:tail.0] inotify_fs_add(): inode=528542 watch_fd=22 name=/var/log/containers/fluent-bit-gkclz_monitoring_fluent-bit-b40b1d6a925d1bbfa8aae336fa00b37c8
[2022/12/23 14:06:57] [ info] [input:tail:tail.0] inotify_fs_add(): inode=7765 watch_fd=32 name=/var/log/containers/kube-proxy-qwmsq_kube-system_kube-proxy-8e9f8234db195425e6db081745b8fe6d57
[2022/12/23 14:06:57] [ info] [input:tail:tail.0] inotify_fs_add(): inode=528530 watch_fd=33 name=/var/log/containers/fluent-bit-gkclz_monitoring_fluent-bit-5df98c2f5e66a525ba1cd91b5ddcb6e2c
[2022/12/23 14:06:57] [ info] [output:cloudwatch_logs:cloudwatch_logs.1] Creating log stream fluentbit-kube.var.log.containers.fluent-bit-gkclz_monitoring_fluent-bit-5df98c2f5e66a525ba1cd91b
[2022/12/23 14:06:57] [ info] [output:cloudwatch_logs:cloudwatch_logs.1] Created log stream fluentbit-kube.var.log.containers.fluent-bit-gkclz_monitoring_fluent-bit-5df98c2f5e66a525ba1cd91b5
[2022/12/23 14:06:57] [engine] caught signal (SIGSEGV)                                                                                                                                        
#0  0xaaaab52df2b0      in  template_execute() at lib/msgpack-c/include/msgpack/unpack_template.h:172                                                                                         
#1  0xaaaab52e0f6b      in  msgpack_unpack_next() at lib/msgpack-c/src/unpack.c:677                                                                                                           
#2  0xaaaab4c7ab0b      in  process_and_send() at plugins/out_cloudwatch_logs/cloudwatch_api.c:859                                                                                            
#3  0xaaaab4c741bf      in  cb_cloudwatch_flush() at plugins/out_cloudwatch_logs/cloudwatch_logs.c:417                                                                                        
#4  0xaaaab48a860f      in  output_pre_cb_flush() at include/fluent-bit/flb_output.h:527                                                                                                      
#5  0xaaaab52fc0ef      in  co_switch() at lib/monkey/deps/flb_libco/aarch64.c:133                                                                                                            
#6  0xffffffffffffffff  in  ???() at ???:0                                                                                                                                                    
Stream closed EOF for monitoring/fluent-bit-gkclz (fluent-bit)

Expected behavior

The application should successfully start and start logging to CloudWatch.

Your Environment

  • Version used: 2.0.8
  • Configuration: See above.
  • Environment name and version (e.g. Kubernetes? What version?): Kubernetes v1.24.7-eks-fb459a0
  • Server type and version: arm64
  • Operating System and version: Bottlerocket OS 1.11.1 (aws-k8s-1.24)
  • Filters and plugins: kubernetes, cloudwatch_logs

Additional context

There is another issue logged with a similar error. My expectation was that this would be resolved by version 2.0.8 that includes the fix, but it seems that is not the case. See #6451

@cdancy
Copy link

cdancy commented Jan 27, 2023

@supergillis did you ever find a workaround for this? We are hitting the exact same thing but only when we enable the S3 plugin as well. We're only enabling cloudwatch_logs and s3 at the moment. If we enable only one or the other than all works as expected. It almost feels like a resource issue maybe??? I don't know...

[2023/01/27 19:06:47] [ info] [output:cloudwatch_logs:cloudwatch_logs.1] Creating log stream logging-fluentbit-kube.var.log.containers.logging-p578l_logging_fluent-bit-123456789.log in log group /aws/eks/helloworld/cluster/namespace/logging
[2023/01/27 19:06:47] [ info] [output:cloudwatch_logs:cloudwatch_logs.1] Created log stream logging-fluentbit-kube.var.log.containers.logging-p578l_logging_fluent-bit-123456789a.log
[2023/01/27 19:06:47] [engine] caught signal (SIGSEGV)
#0  0x561a391bbaf5      in  template_execute() at lib/msgpack-c/include/msgpack/unpack_template.h:172
#1  0x561a391bd59b      in  msgpack_unpack_next() at lib/msgpack-c/src/unpack.c:677
#2  0x561a38b60f9e      in  process_and_send() at plugins/out_cloudwatch_logs/cloudwatch_api.c:859
#3  0x561a38b5a717      in  cb_cloudwatch_flush() at plugins/out_cloudwatch_logs/cloudwatch_logs.c:417
#4  0x561a3879c267      in  output_pre_cb_flush() at include/fluent-bit/flb_output.h:527
#5  0x561a391d8266      in  co_init() at lib/monkey/deps/flb_libco/amd64.c:117 

Tagging some of the devs as this is stopping us from moving to production with this solution.

@patrick-stephens @nikitka @edsiper @PettitWesley

@PettitWesley
Copy link
Contributor

Does this only happen on 2.0?

If anyone is able to obtain a core dump that'd be super helpful: https://github.com/aws/aws-for-fluent-bit/blob/mainline/troubleshooting/tutorials/remote-core-dump/README.md

@PettitWesley
Copy link
Contributor

The AWS Disto does not support 2.x yet. We will soon. Currently we recommend our 2.29.0+ versions for cloudwatch customers.

We have an upcoming concurrency issue fix for S3 in 2.31.0: aws/aws-for-fluent-bit@21c287e

@PettitWesley
Copy link
Contributor

Enabling debug logging may give more info. Does anyone have stack trace for s3?

@cdancy can you share your config.

@cdancy
Copy link

cdancy commented Jan 27, 2023

@PettitWesley we're using OOTB fluent-bit here and not the aws distro. I'm assuming that's OK?

We've only been playing with 2.x but I can downgrade to 1.x and see what happens.

I can certainly enable debug logs to see if it dumps anything else as well as get a core dump.

config is as follows (being done through helm values.yaml):

    inputs: |
      [INPUT]
          Name tail
          Path /var/log/containers/*.log
          multiline.parser docker, cri
          Tag kube.*
          Mem_Buf_Limit 5MB
          Refresh_Interval 1
          Buffer_Chunk_Size 100k
          Buffer_Max_Size 1M
          Threaded on
          Skip_Long_Lines On
          Skip_Empty_Lines On

    outputs: |
      [OUTPUT]
          Name s3
          Match *
          region us-east-1
          bucket devtest
          total_file_size 50M
          upload_timeout 1m
          s3_key_format /logs/devtest/\$UUID.gz
          s3_key_format_tag_delimiters .-
          workers 1
          compression gzip
      
      [OUTPUT]
          Name cloudwatch_logs
          Match *
          region us-east-1
          log_group_name /aws/eks/devtest/cluster/namespace/logging
          log_stream_prefix logging-fluentbit-
          auto_create_group On

EDIT: it should be noted ... running with either only the S3 or only the cloudwatch_logs plugin works perfect. It's only when we add the other does things go sideways and when it does it's only for the cloudwatch_logs plugin

@cdancy
Copy link

cdancy commented Jan 27, 2023

Another data point... when things do fail the fluent-bit pods crash a few times and eventually work themselves out ... but there is always a single one that doesn't. Every time it plays out in exactly the same way.

NAME            READY   STATUS             RESTARTS        AGE
helloworld-2wqkp   1/1     Running            0               7m57s
helloworld-clczn   1/1     Running            3 (7m33s ago)   7m57s
helloworld-f2kbz   1/1     Running            0               7m57s
helloworld-jq9vn   1/1     Running            2 (7m51s ago)   7m56s
helloworld-lvc7x   0/1     CrashLoopBackOff   6 (105s ago)    7m57s
helloworld-pnbfr   1/1     Running            0               7m57s

@cdancy
Copy link

cdancy commented Jan 27, 2023

@cdancy
Copy link

cdancy commented Jan 30, 2023

@PettitWesley another data point: downgrading to latest 1.9 produces the same issue.

@PettitWesley
Copy link
Contributor

@cdancy thanks for letting me know.

I see this:

[2023/01/27 20:20:31] [debug] [input chunk] tail.0 is paused, cannot append records
[2023/01/27 20:20:31] [debug] [input chunk] tail.0 is paused, cannot append records
#0  0x55e4e178aaf5      in  template_execute() at lib/msgpack-c/include/msgpack/unpack_template.h:172
#1  0x55e4e178c59b      in  msgpack_unpack_next() at lib/msgpack-c/src/unpack.c:677
#2  0x55e4e112ff9e      in  process_and_send() at plugins/out_cloudwatch_logs/cloudwatch_api.c:859
#3  0x55e4e1129717      in  cb_cloudwatch_flush() at plugins/out_cloudwatch_logs/cloudwatch_logs.c:417
#4  0x55e4e0d6b267      in  output_pre_cb_flush() at include/fluent-bit/flb_output.h:527
#5  0x55e4e17a7266      in  co_init() at lib/monkey/deps/flb_libco/amd64.c:117

Similar to the trace already in this issue. I am wondering if its related to the S3 concurrency issues we recently discovered: #6573

This will be released in AWS4FB 2.31.0 very soon, not sure when it will get in this distro: https://github.com/aws/aws-for-fluent-bit/releases

@cdancy
Copy link

cdancy commented Jan 30, 2023

@PettitWesley is there anything we can do config-wise to get around this for our purposes? Any way we can get impress upon whomever to get that PR merged with a new release out?

@PettitWesley
Copy link
Contributor

@cdancy I have a pre-release build of S3 concurrency fix here: aws/aws-for-fluent-bit#495 (comment)

Keep checking AWS distro, we will release it very very soon.

If you can repro it with an AWS distro version and can open an issue at our repo, that will also help with tracking

@cdancy
Copy link

cdancy commented Jan 30, 2023

@PettitWesley thanks! Is it "better" to use the aws-for-fluent-bit project over this one? We don't seem to need all the bells and whistles the aws-for-fluent-bit project uses but if that has more fixes over this one that could be a reason to use the other.

@PettitWesley
Copy link
Contributor

@cdancy AWS distro is focused on AWS customers and thus yes, is always the most up to date distro in terms of fixes for AWS customers.

We don't seem to need all the bells and whistles the aws-for-fluent-bit project uses

We're distributing the same code, the only thing we add is the old AWS go plugins. What do you mean? https://github.com/aws/aws-for-fluent-bit/blob/mainline/troubleshooting/debugging.md#aws-go-plugins-vs-aws-core-c-plugins

@cdancy
Copy link

cdancy commented Jan 30, 2023

We're distributing the same code, the only thing we add is the old AWS go plugins. What do you mean?

It's for that reason as it seemed, at least when we were looking into things, that for our use-cases it was unnecessary to go that route as there was nothing that we needed within it nor did we need to use the older plugins. With that said ... I will keep an eye on the aws-for-fluent-bit project to see if something is coming out sooner rather than later.

@cdancy
Copy link

cdancy commented Jan 31, 2023

@PettitWesley how soon is "very very soon"? Any time table one way or another? Just wondering if we should be trying to spin our wheels testing out hacks/etc or if the release is imminent and we should just hold off.

Keep checking AWS distro, we will release it very very soon.

@cdancy
Copy link

cdancy commented Jan 31, 2023

@PettitWesley so ... another data point ... we switched to using aws-for-fluentbit and still hit the same problem. However, if we changed from using the cloudwatch_logs plugin to the cloudwatch plugin we were able to get things to work. Something fishy is going on but we're going to go with the ... "normal performance" plugin for now over the fluentbit "high performance" plugin in order to keep things moving forward. Please let us know when a new release is out and we'll grab that.

@PettitWesley
Copy link
Contributor

@cdancy I wrote a guide on reproducing crash reports, its focused on ECS FireLens, but the core dump S3 uploader technique can be used anywhere: https://github.com/aws/aws-for-fluent-bit/pull/533/files

You'd need to customize it to build 2.0.

My team is not setup to support 2.0 yet, the AWS distro is still on 1.9, but if you can use this to get a core dump from 2.0 and send it to us, we could help fix it.

@PettitWesley
Copy link
Contributor

I want to note here that we think we've found a potential issue when you have multiple cloudwatch_logs outputs matching the same tag. Not sure which versions this impacts yet.

@cdancy
Copy link

cdancy commented Feb 7, 2023

@PettitWesley we're only using a single output for cloudwatch_logs here for what that's worth. I see there is a new version you all released which we have yet to try.

@cdancy
Copy link

cdancy commented Feb 8, 2023

@PettitWesley we tried using the cloudwatch_logs plugin again with this new version that you all just released and things are still failing albeit slightly different now. It's weird that things work and work and work and then all of a sudden fall over with a SIGSEGV.

[2023/02/08 18:16:51] [debug] [output:cloudwatch_logs:cloudwatch_logs.3] Using stream=logging-s54vn.aws-for-fluent-bit, group=application-logs-ip-10-0-54-53.ec2.internal.logging
[2023/02/08 18:16:51] [debug] [input chunk] update output instances with new chunk size diff=16064
[2023/02/08 18:16:51] [debug] [input:tail:tail.0] inode=25165971 events: IN_MODIFY
[2023/02/08 18:16:51] [debug] [input chunk] update output instances with new chunk size diff=11416
[2023/02/08 18:16:51] [debug] [input chunk] update output instances with new chunk size diff=2145
[2023/02/08 18:16:51] [debug] [input:tail:tail.0] inode=25165971 events: IN_MODIFY
[2023/02/08 18:16:51] [debug] [output:cloudwatch_logs:cloudwatch_logs.3] Using stream=logging-s54vn.aws-for-fluent-bit, group=application-logs-ip-10-0-54-53.ec2.internal.logging
[2023/02/08 18:16:51] [debug] [output:cloudwatch_logs:cloudwatch_logs.3] Using stream=logging-s54vn.aws-for-fluent-bit, group=application-logs-ip-10-0-54-53.ec2.internal.logging
[2023/02/08 18:16:51] [debug] [out flush] cb_destroy coro_id=5
[2023/02/08 18:16:51] [debug] [output:cloudwatch_logs:cloudwatch_logs.3] Using stream=logging-s54vn.aws-for-fluent-bit, group=application-logs-ip-10-0-54-53.ec2.internal.logging
[2023/02/08 18:16:51] [debug] [input:tail:tail.0] inode=25165971 events: IN_MODIFY
[2023/02/08 18:16:51] [debug] [input chunk] update output instances with new chunk size diff=5506
[2023/02/08 18:16:51] [engine] caught signal (SIGSEGV)

@PettitWesley
Copy link
Contributor

PettitWesley commented Feb 8, 2023

@cdancy actually, the issue seems to be cause simply by having any two outputs match the same tag. We've also discovered another issue on 1.9 (not sure if it applies to 2.0) which is much more likely to occur if you have net.keepalive On, the default.

The stack trace in this issue matches the "two outputs match same tag" issue we are still investigating.

Sorry that's all I can say right now. Will post more when I have more clear info.

@PettitWesley
Copy link
Contributor

Please see the notice for known high impact issues as of the beginning of this year: aws/aws-for-fluent-bit#542

@github-actions
Copy link
Contributor

This issue is stale because it has been open 90 days with no activity. Remove stale label or comment or this will be closed in 5 days. Maintainers can add the exempt-stale label.

@github-actions github-actions bot added the Stale label May 10, 2023
@PettitWesley
Copy link
Contributor

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants