Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[all versions] exec input SIGSEGV/crash due unitialized memory [fix in 2.31.12] #661

Open
SoamA opened this issue May 23, 2023 · 80 comments
Labels
bug Something isn't working

Comments

@SoamA
Copy link

SoamA commented May 23, 2023

Describe the question/issue

Hi folks,

We've been basing our EKS logging infrastructure on AWS-for-fluent-bit. However, recently, we've been noticing some Fluent Bit pods crash with a SIGSEGV on startup and go into a CrashLoopBackoff loop on deployment. Redoing the deployment leads to the same problem on the very same physical hosts while other pods elsewhere on other hosts in the same cluster run fine. The EKS workers are configured identically, so it's a bit of a head scratcher as to why this is happening on a handful of random nodes in a cluster and that too persistently while the majority of the pods are running fine.

If we let the pod retries run long enough on a host, eventually, it will succeed but that can take anywhere from an hour to a day, unacceptable for a production environment.

Configuration

Please find the config map for aws-for-fluent-bit attached. This contains the fluent bit config file as well. Note that the daemonset is running in its own namespace ("logging"). The namespace is attached to a dedicated service account, "fb-service-account"

aws-for-fluent-bit-conf.txt

Fluent Bit Log Output

DebugLog.txt

Fluent Bit Version Info

AWS For Fluent Bit Image: 2.31.11 though we've also seen the same behavior for 2.31.10 and 2.31.6
For debugging , we used debug-2.31.11.

Cluster Details

EKS cluster: K8s version 1.25 (v1.25.9-eks-0a21954) though we've also noticed this for earlier versions
Instance types: mostly r6gd.16xl and the occasional r5d.16xlarge
We base our internal AMIs on the following images:

  • arm64 - ami-0aa7aa4c87fe47ff6
  • amd64 - ami-071432800334eb200

Application Details

This occurred on an essentially idle cluster with no applications running. Fluent Bit crashes immediately on startup, so load wasn't a factor.

Steps to reproduce issue

We reviewed the suggestions provided in:

We generated the debug log as follows:

  • on a machine where FB was repeatedly crashing , we copied over the fluent-bit config files to /home/myname/etc
  • docker run -it --entrypoint=/bin/bash --ulimit core=-1 -v /var/log:/var/log -v /home/sacharya:/cores -v $(pwd):/fluent-bit/etc public.ecr.aws/aws-observability/aws-for-fluent-bit:debug-2.31.11
  • Note that while the image name is the same,
  • once in pod shell, we had to install some libraries to get rid of gdb warnings:
yum -y install yum-utils
debuginfo-install bzip2-libs-1.0.6-13.amzn2.0.3.aarch64 cyrus-sasl-lib-2.1.26-24.amzn2.aarch64 elfutils-libelf-0.176-2.amzn2.aarch64 elfutils-libs-0.176-2.amzn2.aarch64 keyutils-libs-1.5.8-3.amzn2.0.2.aarch64 krb5-libs-1.15.1-55.amzn2.2.5.aarch64 libcap-2.54-1.amzn2.0.1.aarch64 libcom_err-1.42.9-19.amzn2.0.1.aarch64 libcrypt-2.26-63.amzn2.aarch64 libgcc-7.3.1-15.amzn2.aarch64 libgcrypt-1.5.3-14.amzn2.0.3.aarch64 libgpg-error-1.12-3.amzn2.0.3.aarch64 libselinux-2.5-12.amzn2.0.2.aarch64 libyaml-0.1.4-11.amzn2.0.2.aarch64 lz4-1.7.5-2.amzn2.0.1.aarch64 pcre-8.32-17.amzn2.0.2.aarch64 systemd-libs-219-78.amzn2.0.22.aarch64 xz-libs-5.2.2-1.amzn2.0.3.aarch64 zlib-1.2.7-19.amzn2.0.2.aarch64

yum remove openssl-debuginfo-1:1.0.2k-24.amzn2.0.6.aarch64
debuginfo-install openssl11-libs-1.1.1g-12.amzn2.0.13.aarch64
  • and then, finally:
export FLB_LOG_LEVEL=debug
gdb /fluent-bit/bin/fluent-bit
r -e /fluent-bit/firehose.so -e /fluent-bit/cloudwatch.so -e /fluent-bit/kinesis.so -c /fluent-bit/etc/fluent-bit.conf

Note that we were able to manually run FB successfully via docker on the other hosts in the cluster where FB pods had been successfully launched by our usual (terraform/helm based) deployment process.

Related Issues

@PettitWesley
Copy link
Contributor

@SoamA Thank you for your report, it would really help us if you can deploy one of our pre-built debug images without updating the entrypoint, this will output a full stacktrace. Why did you customize the image entrypoint?

Here are some references:

these show our entrypoint which both prints the core stacktrace and uploads it to S3, if you want to customize, please follow from these:

@SoamA
Copy link
Author

SoamA commented May 23, 2023

Hi @PettitWesley - appreciate the response! We were trying to generate core dumps using the prebuilt images but couldn't find any core dumps actually being created, hence our going down the gdb/entrypoint path. I'll retry and update.

@PettitWesley
Copy link
Contributor

PettitWesley commented May 23, 2023

@SoamA if a core was created, our pre-built debug images without changes will BOTH output the stacktrace to stdout and upload the full core to S3. If there was no core generated, it should output a message like "No core to upload".

@SoamA
Copy link
Author

SoamA commented May 23, 2023

@PettitWesley - yes, I get the No core to upload message. Here's the full log run:

# docker run --ulimit core=-1 -v /var/log:/var/log -v /home/sacharya:/cores -v /home/sacharya/etc:/fluent-bit/etc --env FLB_LOG_LEVEL=debug public.ecr.aws/aws-observability/aws-for-fluent-bit:debug-2.31.11
Unable to find image 'public.ecr.aws/aws-observability/aws-for-fluent-bit:debug-2.31.11' locally
debug-2.31.11: Pulling from aws-observability/aws-for-fluent-bit
7ed20885ae48: Pulling fs layer 
e1e9cac6d36f: Pull complete 
b8b9260725cf: Pull complete 
e55ff44b904d: Pull complete 
4f39e777b183: Pull complete 
d3dd74b6712d: Pull complete 
49b63aad221e: Pull complete 
e11e7fe40475: Pull complete 
e1e3c7548ee4: Pull complete 
f62f2598a3f0: Pull complete 
378e6e4f1434: Pull complete 
a1954a714f61: Pull complete 
b90cc0664476: Pull complete 
f1aabba20e84: Pull complete 
85e713a0eeed: Pull complete 
4fe2be406f9e: Pull complete 
b69d560aaca4: Pull complete 
31245aca59a9: Pull complete 
d03b55a8e192: Pull complete 
ca1904f257a4: Pull complete 
76eec49be659: Pull complete 
5b2b2d7663f2: Pull complete 
21133f660f80: Pull complete 
51e9829c0be9: Pull complete 
694e2a0ae2a2: Pull complete 
ada7b99d6756: Pull complete 
58a4ffdf8625: Pull complete 
2f8627c95613: Pull complete 
Digest: sha256:6f254a095c478d6da6fe271b141c0026eec0c664fb44247bec0ea6ecda85e615
Status: Downloaded newer image for public.ecr.aws/aws-observability/aws-for-fluent-bit:debug-2.31.11
AWS for Fluent Bit Container Image Version 2.31.11 - Debug Image with S3 Core Uploader
Note: Please set S3_BUCKET environment variable to your crash symbol upload destination S3 bucket
Note: Please set S3_KEY_PREFIX environment variable to a useful identifier - e.g. company name, team name, customer name
RUN_ID is set to 17894158985717
Fluent Bit v1.9.10
* Copyright (C) 2015-2022 The Fluent Bit Authors
* Fluent Bit is a CNCF sub-project under the umbrella of Fluentd
* https://fluentbit.io

[2023/05/23 19:19:21] [ info] Configuration:
[2023/05/23 19:19:21] [ info]  flush time     | 1.000000 seconds
[2023/05/23 19:19:21] [ info]  grace          | 5 seconds
[2023/05/23 19:19:21] [ info]  daemon         | 0
[2023/05/23 19:19:21] [ info] ___________
[2023/05/23 19:19:21] [ info]  inputs:
[2023/05/23 19:19:21] [ info]      tail
[2023/05/23 19:19:21] [ info]      tail
[2023/05/23 19:19:21] [ info]      tail
[2023/05/23 19:19:21] [ info]      exec
[2023/05/23 19:19:21] [ info] ___________
[2023/05/23 19:19:21] [ info]  filters:
[2023/05/23 19:19:21] [ info]      kubernetes.0
[2023/05/23 19:19:21] [ info]      rewrite_tag.1
[2023/05/23 19:19:21] [ info]      kubernetes.2
[2023/05/23 19:19:21] [ info]      rewrite_tag.3
[2023/05/23 19:19:21] [ info] ___________
[2023/05/23 19:19:21] [ info]  outputs:
[2023/05/23 19:19:21] [ info]      s3.0
[2023/05/23 19:19:21] [ info]      s3.1
[2023/05/23 19:19:21] [ info]      s3.2
[2023/05/23 19:19:21] [ info]      s3.3
[2023/05/23 19:19:21] [ info]      s3.4
[2023/05/23 19:19:21] [ info]      null.5
[2023/05/23 19:19:21] [ info] ___________
[2023/05/23 19:19:21] [ info]  collectors:
[2023/05/23 19:19:21] [ info] [fluent bit] version=1.9.10, commit=101f9fab76, pid=10
[2023/05/23 19:19:21] [debug] [engine] coroutine stack size: 196608 bytes (192.0K)
[2023/05/23 19:19:21] [ info] [storage] version=1.3.0, type=memory-only, sync=normal, checksum=disabled, max_chunks_up=128
[2023/05/23 19:19:21] [ info] [cmetrics] version=0.3.7
[2023/05/23 19:19:21] [debug] [tail:tail.0] created event channels: read=27 write=28
[2023/05/23 19:19:21] [ info] [input:tail:tail.0] multiline core started
[2023/05/23 19:19:21] [debug] [input:tail:tail.0] flb_tail_fs_inotify_init() initializing inotify tail input
[2023/05/23 19:19:21] [debug] [input:tail:tail.0] inotify watch fd=37
[2023/05/23 19:19:21] [debug] [input:tail:tail.0] scanning path /var/log/containers/aws-node*
[2023/05/23 19:19:21] [debug] [input:tail:tail.0] inode=6442451735 with offset=637 appended as /var/log/containers/aws-node-jrgj8_kube-system_aws-node-45f5e3c893d2470390fadb475c9517fbe156e39604c45a7c2c30f2bdaaed8858.log
[2023/05/23 19:19:21] [debug] [input:tail:tail.0] scan_glob add(): /var/log/containers/aws-node-jrgj8_kube-system_aws-node-45f5e3c893d2470390fadb475c9517fbe156e39604c45a7c2c30f2bdaaed8858.log, inode 6442451735
[2023/05/23 19:19:21] [debug] [input:tail:tail.0] inode=5905580814 with offset=1239 appended as /var/log/containers/aws-node-jrgj8_kube-system_aws-vpc-cni-init-9e5e21e9c7d59c5b2bf4cda947c509231519257d342cbea8c28ff02f127846e1.log
[2023/05/23 19:19:21] [debug] [input:tail:tail.0] scan_glob add(): /var/log/containers/aws-node-jrgj8_kube-system_aws-vpc-cni-init-9e5e21e9c7d59c5b2bf4cda947c509231519257d342cbea8c28ff02f127846e1.log, inode 5905580814
[2023/05/23 19:19:21] [debug] [input:tail:tail.0] 2 new files found on path '/var/log/containers/aws-node*'
[2023/05/23 19:19:21] [debug] [input:tail:tail.0] scanning path /var/log/containers/kube-proxy*
[2023/05/23 19:19:21] [debug] [input:tail:tail.0] inode=5905604236 with offset=154004 appended as /var/log/containers/kube-proxy-qdddr_kube-system_kube-proxy-6819205653fee0885ef8c389c7df7d44ce5082a53f8dc6fa402bc3e324c9031d.log
[2023/05/23 19:19:21] [debug] [input:tail:tail.0] scan_glob add(): /var/log/containers/kube-proxy-qdddr_kube-system_kube-proxy-6819205653fee0885ef8c389c7df7d44ce5082a53f8dc6fa402bc3e324c9031d.log, inode 5905604236
[2023/05/23 19:19:21] [debug] [input:tail:tail.0] 1 new files found on path '/var/log/containers/kube-proxy*'
[2023/05/23 19:19:21] [debug] [input:tail:tail.0] scanning path /var/log/containers/aws-for-fluent-bit*
[2023/05/23 19:19:21] [debug] [input:tail:tail.0] inode=7516371205 with offset=851 appended as /var/log/containers/aws-for-fluent-bit-f7qmf_logging_aws-for-fluent-bit-40179b36ad2b62ff4a4ee6838f7d3f4b735438367bbc845170149c338969c224.log
[2023/05/23 19:19:21] [debug] [input:tail:tail.0] scan_glob add(): /var/log/containers/aws-for-fluent-bit-f7qmf_logging_aws-for-fluent-bit-40179b36ad2b62ff4a4ee6838f7d3f4b735438367bbc845170149c338969c224.log, inode 7516371205
[2023/05/23 19:19:21] [debug] [input:tail:tail.0] 1 new files found on path '/var/log/containers/aws-for-fluent-bit*'
[2023/05/23 19:19:21] [debug] [input:tail:tail.0] scanning path /var/log/containers/spark-history-server*
[2023/05/23 19:19:21] [debug] [input:tail:tail.0] cannot read info from: /var/log/containers/spark-history-server*
[2023/05/23 19:19:21] [debug] [input:tail:tail.0] 0 new files found on path '/var/log/containers/spark-history-server*'
[2023/05/23 19:19:21] [debug] [input:tail:tail.0] scanning path /var/log/containers/spark-operator*
[2023/05/23 19:19:21] [debug] [input:tail:tail.0] cannot read info from: /var/log/containers/spark-operator*
[2023/05/23 19:19:21] [debug] [input:tail:tail.0] 0 new files found on path '/var/log/containers/spark-operator*'
[2023/05/23 19:19:21] [debug] [input:tail:tail.0] scanning path /var/log/containers/*_yunikorn_*
[2023/05/23 19:19:21] [debug] [input:tail:tail.0] cannot read info from: /var/log/containers/*_yunikorn_*
[2023/05/23 19:19:21] [debug] [input:tail:tail.0] 0 new files found on path '/var/log/containers/*_yunikorn_*'
[2023/05/23 19:19:21] [debug] [tail:tail.1] created event channels: read=42 write=43
[2023/05/23 19:19:21] [ info] [input:tail:tail.1] multiline core started
[2023/05/23 19:19:21] [debug] [input:tail:tail.1] flb_tail_fs_inotify_init() initializing inotify tail input
[2023/05/23 19:19:21] [debug] [input:tail:tail.1] inotify watch fd=52
[2023/05/23 19:19:21] [debug] [input:tail:tail.1] scanning path /var/log/containers/*_default_*
[2023/05/23 19:19:21] [debug] [input:tail:tail.1] inode=1342221990 with offset=848612 appended as /var/log/containers/generate-related-pins-calendar-day-denom-metrics-country-8188828812617950-exec-15_default_spark-kubernetes-executor-9db6f71994793b467c600e63ff21055f1f8d814a752fcfb8ec8ec23f241a5993.log
[2023/05/23 19:19:21] [debug] [input:tail:tail.1] scan_glob add(): /var/log/containers/generate-related-pins-calendar-day-denom-metrics-country-8188828812617950-exec-15_default_spark-kubernetes-executor-9db6f71994793b467c600e63ff21055f1f8d814a752fcfb8ec8ec23f241a5993.log, inode 1342221990
[2023/05/23 19:19:21] [debug] [input:tail:tail.1] inode=1879221443 with offset=133657 appended as /var/log/containers/pinlogsdatasetinsightsjob-local-adhoc-001-f226a9880873a7be-exec-26_default_spark-kubernetes-executor-04552d85b28fe68fd29dbf2b6ad0f9fa3a974e1596ed9198ca50ddd1a4cb493f.log
[2023/05/23 19:19:21] [debug] [input:tail:tail.1] scan_glob add(): /var/log/containers/pinlogsdatasetinsightsjob-local-adhoc-001-f226a9880873a7be-exec-26_default_spark-kubernetes-executor-04552d85b28fe68fd29dbf2b6ad0f9fa3a974e1596ed9198ca50ddd1a4cb493f.log, inode 1879221443
[2023/05/23 19:19:21] [debug] [input:tail:tail.1] inode=8321502110 with offset=128992 appended as /var/log/containers/pinlogsdatasetinsightsjob-s3-adhoc-001-57ae29880885048c-exec-11_default_spark-kubernetes-executor-883d140f31ac6763cf2906902f74fa99742297696977bd2d31d13245eb07ba05.log
[2023/05/23 19:19:21] [debug] [input:tail:tail.1] scan_glob add(): /var/log/containers/pinlogsdatasetinsightsjob-s3-adhoc-001-57ae29880885048c-exec-11_default_spark-kubernetes-executor-883d140f31ac6763cf2906902f74fa99742297696977bd2d31d13245eb07ba05.log, inode 8321502110
[2023/05/23 19:19:21] [debug] [input:tail:tail.1] inode=2684357349 with offset=1702227 appended as /var/log/containers/pinlogss3accesssummaryjobgraviton-local-bhavin-adhoc001-fff63e87d95352ec-exec-1_default_spark-kubernetes-executor-fbaf7c26086ac6adc6c12fbcc899de3173f097df5e429e4159b2333d13e6e18a.log
[2023/05/23 19:19:21] [debug] [input:tail:tail.1] scan_glob add(): /var/log/containers/pinlogss3accesssummaryjobgraviton-local-bhavin-adhoc001-fff63e87d95352ec-exec-1_default_spark-kubernetes-executor-fbaf7c26086ac6adc6c12fbcc899de3173f097df5e429e4159b2333d13e6e18a.log, inode 2684357349
[2023/05/23 19:19:21] [debug] [input:tail:tail.1] 4 new files found on path '/var/log/containers/*_default_*'
[2023/05/23 19:19:21] [debug] [tail:tail.2] created event channels: read=57 write=58
[2023/05/23 19:19:21] [ info] [input:tail:tail.2] multiline core started
[2023/05/23 19:19:21] [debug] [input:tail:tail.2] flb_tail_fs_inotify_init() initializing inotify tail input
[2023/05/23 19:19:21] [debug] [input:tail:tail.2] inotify watch fd=67
[2023/05/23 19:19:21] [debug] [input:tail:tail.2] scanning path /var/log/containers/eventlogs/*\.inprogress
[2023/05/23 19:19:21] [debug] [input:tail:tail.2] cannot read info from: /var/log/containers/eventlogs/*\.inprogress
[2023/05/23 19:19:21] [debug] [input:tail:tail.2] 0 new files found on path '/var/log/containers/eventlogs/*\.inprogress'
[2023/05/23 19:19:21] [debug] [exec:exec.3] created event channels: read=68 write=69
[2023/05/23 19:19:21] [engine] caught signal (SIGSEGV)
/bin/sh: line 1:    10 Aborted                 (core dumped) /fluent-bit/bin/fluent-bit -e /fluent-bit/firehose.so -e /fluent-bit/cloudwatch.so -e /fluent-bit/kinesis.so -c /fluent-bit/etc/fluent-bit.conf
No core file to upload

@PettitWesley
Copy link
Contributor

@SoamA

-v /home/sacharya:/cores

Did you get anything here?

I think this might be an issue where we need to update our guides/docs.

The cores go to /cores-out first: https://github.com/aws/aws-for-fluent-bit/blob/mainline/scripts/dockerfiles/Dockerfile.main-debug-s3#L27

See also: https://github.com/aws/aws-for-fluent-bit/blob/mainline/scripts/core_uploader.sh#L24

Then if the script detects a core there it moves it to /cores: https://github.com/aws/aws-for-fluent-bit/blob/mainline/scripts/core_uploader.sh#L82

But if it detects no core it does nothing: https://github.com/aws/aws-for-fluent-bit/blob/mainline/scripts/core_uploader.sh#L42

Can you try mounting both /cores-out and /cores on your host?

@PettitWesley
Copy link
Contributor

I will test this myself and manually trigger a core, to check where it goes...

@SoamA
Copy link
Author

SoamA commented May 23, 2023

Ah, noted. Following your guidance, I tried

# docker run --ulimit core=-1 -v /var/log:/var/log -v /home/sacharya/cores:/cores -v /home/sacharya/cores-out:/cores-out -v /home/sacharya/etc:/fluent-bit/etc --env FLB_LOG_LEVEL=debug public.ecr.aws/aws-observability/aws-for-fluent-bit:debug-2.31.11

but sadly no dice. Same complaint about no core files and both directories were empty.

@PettitWesley
Copy link
Contributor

@SoamA I'm sorry I don't know why this isn't working for you. I double checked that image and I get generate a core dump easily:

$ docker run -it -v /home/ec2-user/cores:/cores -v /home/ec2-user/cores-out:/cores-out --ulimit core=-1 public.ecr.aws/aws-observability/aws-for-fluent-bit:debug-2.31.11

Then in another window, trigger core:

$ docker ps
CONTAINER ID   IMAGE                                                               COMMAND                   CREATED         STATUS                 PORTS      NAMES
b53d4a2ae069   public.ecr.aws/aws-observability/aws-for-fluent-bit:debug-2.31.11   "/bin/sh -c 'echo \"A…"   5 seconds ago   Up 1 second            2020/tcp   reverent_golick
1239ffb33a36   amazon/amazon-ecs-agent:latest                                      "/agent"                  3 weeks ago     Up 3 weeks (healthy)              ecs-agent
$ docker exec -it b53d4a2ae069 /bin/bash
bash-4.2# ps -e | grep ^C
bash-4.2# pgrep flu
9
bash-4.2# kill -SIGSEGV 9

and it sends a core to /cores-out and then zips it up and moves it to /cores and then prints me the stacktrace... everything works as expected.

This makes me think that somehow a core dump is not actually being generated in your case 🧐

Another option would be to run under Valgrind, we have a make debug-valgrind target. I can also build you an image via that target if you don't want to build it yourself. Valgrind will often also give a stacktrace when an issue occurs.

@SoamA
Copy link
Author

SoamA commented May 23, 2023

Yes, if you could build a valgrind target for Graviton (we run r6gd.16xl), that'd be great. I was going to get around to it but it'd take a bit of time to provision a dev Graviton node. We do run Intel nodes (r5g.16xl) as well but much fewer, essentially one per node at our current configuration.

In the meantime, I'll investigate whether we have any system settings turned on that'd prevent core dumps.

BTW, I created an AWS support ticket - https://support.console.aws.amazon.com/support/home?region=us-east-1#/case/?displayId=12844872531&language=en - just to make sure the EKS team is tracking this as well.

@SoamA
Copy link
Author

SoamA commented May 23, 2023

Looks like the core files were actually going to /var/log/core/ on the machine. Working on getting you some of the core dumps. Stay tuned.

@PettitWesley
Copy link
Contributor

@SoamA cool. I did build you an arm image here:

144718711470.dkr.ecr.us-west-2.amazonaws.com/aws-for-fluent-bit:2.31.11-debug-valgrind

@SoamA
Copy link
Author

SoamA commented May 23, 2023

I uploaded two core images in the support ticket - https://support.console.aws.amazon.com/support/home?region=us-east-1#/case/?displayId=12844872531&language=en. I used S3 uploader from AWS support. As it doesn't confirm where the final files ended up, there might be multiple copies.

@SoamA
Copy link
Author

SoamA commented May 23, 2023

And yes, thanks for the valgrind image. We'll need it as we make our way through the debugging process.

@PettitWesley
Copy link
Contributor

@SoamA thanks. I'll check for the core.

@SoamA
Copy link
Author

SoamA commented May 24, 2023

Thanks @PettitWesley. Were you able to access the files and did they yield any actionable insights? Let me know what else I can run on my side or any additional data we can capture.

@PettitWesley
Copy link
Contributor

@SoamA the files that i got from support seemed to just have your logs files in /var/log... there was a lot of stuff in there... is there a specific path I should look at? We didn't see any cores. Sorry.

@SoamA
Copy link
Author

SoamA commented May 24, 2023

Hey @PettitWesley - I uploaded two files (twice) via S3 upload:

  • core.flb-pipeline.10.1d99da825c76.1684878593
  • core.flb-pipeline.1.moka-adhoc-001-dpp-moka-worker-adhoc-0ab50023.1684878937
    did you see those files uploaded at all in that location?

@SoamA
Copy link
Author

SoamA commented May 25, 2023

I checked the support ticket and AWS confirmed the uploads hadn't gone through the first time. My bad. I've retried, this time with the correct commands, and it looks like it succeeded this time. Please check again.

@PettitWesley
Copy link
Contributor

@SoamA I think I got it now. Just to be clear, exactly which Arch (arm or intel) and which image were these cores from? I need the right executable in order to read the core.

@SoamA
Copy link
Author

SoamA commented May 25, 2023

@PettitWesley - these are ARM images, produced on a r6gd.16xl box

core.flb-pipeline.10.1d99da825c76.1684878593 : core file produced manually using the debug-2.31.11 ARM image
core.flb-pipeline.1.moka-adhoc-001-dpp-moka-worker-adhoc-0ab50023.1684878937 : core file produced by Fluent Bit pod crashing as part of K8s deployment. There are a lot of these files but I only uploaded one example. Using standard 2.31.11 ARM image.

@PettitWesley
Copy link
Contributor

PettitWesley commented May 25, 2023

@SoamA When you say "core file produced manually", what exactly do you mean, be specific please

@SoamA
Copy link
Author

SoamA commented May 25, 2023

@PettitWesley - to clarify, core.flb-pipeline.10.1d99da825c76.1684878593 was produced by running the following from the offending machine's command line:

docker run --ulimit core=-1 -v /var/log:/var/log -v /home/sacharya/cores:/cores -v /home/sacharya/cores-out:/cores-out -v /home/sacharya/etc:/fluent-bit/etc --env FLB_LOG_LEVEL=debug public.ecr.aws/aws-observability/aws-for-fluent-bit:debug-2.31.11

@PettitWesley
Copy link
Contributor

@SoamA I'm not able to read the cores with either of the 2.31.11 arm images (debug and non-debug).

GDB output

Below is a sample of what the output looks like, I'm not able to get a stacktrace out of the core.

warning: .dynamic section for "/lib64/libkrb5.so.3" is not at the expected address (wrong library or version mismatch?)

warning: .dynamic section for "/lib64/libk5crypto.so.3" is not at the expected address (wrong library or version mismatch?)

warning: .dynamic section for "/lib64/libcom_err.so.2" is not at the expected address (wrong library or version mismatch?)

warning: .dynamic section for "/lib64/libkrb5support.so.0" is not at the expected address (wrong library or version mismatch?)

warning: .dynamic section for "/lib64/libelf.so.1" is not at the expected address (wrong library or version mismatch?)

warning: Can't open file /usr/lib64/libkeyutils.so.1.5 during file-backed mapping note processing

warning: Can't open file /usr/lib64/libbz2.so.1.0.6 during file-backed mapping note processing

warning: Can't open file /usr/lib64/libelf-0.176.so during file-backed mapping note processing

warning: Can't open file /usr/lib64/libpcre.so.1.2.0 during file-backed mapping note processing

warning: Can't open file /usr/lib64/libcrypt-2.26.so during file-backed mapping note processing

warning: Can't open file /usr/lib64/libdw-0.176.so during file-backed mapping note processing

warning: Could not load shared library symbols for 8 libraries, e.g. /lib64/libssl.so.1.1.
Use the "info sharedlibrary" command to see the complete listing.
Do you need "set solib-search-path" or "set sysroot"?
Warning: couldn't activate thread debugging using libthread_db: Cannot find new threads: generic error

Core was generated by `/fluent-bit/bin/fluent-bit -e /fluent-bit/firehose.so -e /fluent-bit/cloudwatch'.
Program terminated with signal SIGTRAP, Trace/breakpoint trap.
#0  0x0000ffffb110dd28 in printf_positional () from /lib64/libc.so.6
[Current thread is 1 (LWP 26)]

SIGTRAP suggests that you had set breakpoints? We do not set breakpoints in our debug image, so I am guessing you were using the debugger possibly with a custom version?

/fluent-bit/bin/fluent-bit -e /fluent-bit/firehose.so -e /fluent-bit/cloudwatch

This also isn't the command used in our entrypoint: https://github.com/aws/aws-for-fluent-bit/blob/mainline/entrypoint.sh

When I install debuginfos, I get:

warning: Unable to find libthread_db matching inferior's thread library, thread debugging will not be available.
Core was generated by `/fluent-bit/bin/fluent-bit -e /fluent-bit/firehose.so -e /fluent-bit/cloudwatch'.
Program terminated with signal SIGTRAP, Trace/breakpoint trap.
#0  0x0000ffffb110dd28 in printf_positional (s=0xffffc07674ef, format=<optimized out>, readonly_format=<optimized out>,
    ap=<error reading variable: Cannot access memory at address 0x8>, ap_savep=<optimized out>, done=-1322930176, nspecs_done=-1322927416, lead_str_end=<optimized out>,
    work_buffer=<optimized out>, save_errno=<optimized out>, grouping=<optimized out>, thousands_sep=<optimized out>, mode_flags=<optimized out>)
    at /usr/src/debug/glibc-2.34-52.amzn2023.0.2.aarch64/stdio-common/vfprintf-internal.c:1996
1996		  process_arg ((&specs[nspecs_done]));
[Current thread is 1 (LWP 26)]

The above was from the core with adhoc in the name (not posting the full name since some of the name possibly looks like it might be related to your service and I don't want to put that publicly without your permission).

The core core.flb-pipeline.10.1d99da825c76.1684878593 has:

warning: Unable to find libthread_db matching inferior's thread library, thread debugging will not be available.
Core was generated by `/fluent-bit/bin/fluent-bit -e /fluent-bit/firehose.so -e /fluent-bit/cloudwatch'.
Program terminated with signal SIGABRT, Aborted.
#0  0x0000ffff83f63530 in group_number (front_ptr=0xfffeffdf26b0 L"", w=0x6 <error: Cannot access memory at address 0x6>,
    rear_ptr=0x6 <error: Cannot access memory at address 0x6>, grouping=0xffffeedd56df "", thousands_sep=2221330432 L'\x8466d000')
    at /usr/src/debug/glibc-2.34-52.amzn2023.0.2.aarch64/stdio-common/vfprintf-internal.c:2121

Steps followed

   30  sudo docker pull public.ecr.aws/aws-observability/aws-for-fluent-bit:2.31.11
   33  sudo docker create -ti --name ssmdemo public.ecr.aws/aws-observability/aws-for-fluent-bit:2.31.11
   34  docker cp ssmdemo:/fluent-bit/bin/fluent-bit .
   35  sudo docker cp ssmdemo:/fluent-bit/bin/fluent-bit .
   41  gdb fluent-bit-2.31.11 core.flb-pipeline.10.1d99da825c76.1684878593

I performed the same steps for the debug-2.31.11 image as well.

@SoamA
Copy link
Author

SoamA commented May 25, 2023

Hi @PettitWesley - hmm, we didn't use gdb to create those particular core files, so it's puzzling to see why this is happening. Thanks for sharing the steps you used to view the core files. I'll see if there's an easy way of producing some cores that are more consumable.

@PettitWesley
Copy link
Contributor

@SoamA it should be possible for me to read the cores as long as I can get the exact binary used... can you confirm you use the public ECR image that I show above? Because the binary in all of those images should be exactly the same.

@SoamA
Copy link
Author

SoamA commented May 25, 2023

Hi @PettitWesley - to produce core.flb-pipeline.10.1d99da825c76.1684878593, I used public.ecr.aws/aws-observability/aws-for-fluent-bit:debug-2.31.11 and the other image, core.flb-pipeline.1.moka-adhoc-001-dpp-moka-worker-adhoc-0ab50023.1684878937, should have been produced by public.ecr.aws/aws-observability/aws-for-fluent-bit:2.31.11

I am going to produce another core dump via public.ecr.aws/aws-observability/aws-for-fluent-bit:debug-2.31.11 and upload. Stay tuned!

@SoamA
Copy link
Author

SoamA commented May 25, 2023

Just uploaded another file core.flb-pipeline.10.91cade59032c.1685052744 produced by running the following:

docker run --ulimit core=-1 -v /var/log:/var/log -v /home/sacharya/cores:/cores -v /home/sacharya/cores-out:/cores-out -v /home/sacharya/etc:/fluent-bit/etc --env FLB_LOG_LEVEL=debug public.ecr.aws/aws-observability/aws-for-fluent-bit:debug-2.31.11

This is using the image public.ecr.aws/aws-observability/aws-for-fluent-bit:debug-2.31.11

@SoamA
Copy link
Author

SoamA commented May 25, 2023

Also, I uploaded the console output of the docker run command above as a separate file console.log.core.flb-pipeline.10.91cade59032c.1685052744

@SoamA
Copy link
Author

SoamA commented May 26, 2023

BTW, not sure if this will be useful for you or not given how this messes with entry points but i opened up a shell into a debug-2.31.11 image, set up and ran fluent bit va gdb exactly as described in my original post. This time, however, I used backtrace in gdb after it crashed to see if it'd generate a stack trace:

<omitting lots of lines>
[2023/05/26 00:19:41] [debug] [input:tail:tail.2] cannot read info from: /var/log/containers/eventlogs/*\.inprogress
[2023/05/26 00:19:41] [debug] [input:tail:tail.2] 0 new files found on path '/var/log/containers/eventlogs/*\.inprogress'
[2023/05/26 00:19:41] [debug] [exec:exec.3] created event channels: read=69 write=70

Thread 20 "flb-pipeline" received signal SIGSEGV, Segmentation fault.
[Switching to Thread 0xffff1c7f2da0 (LWP 85)]
strcmp () at ../sysdeps/aarch64/strcmp.S:147
147		ldr	data3, [src2], 8
(gdb) backtrace
#0  strcmp () at ../sysdeps/aarch64/strcmp.S:147
#1  0x00000000004f9f10 in flb_parser_get (name=0x14 <error: Cannot access memory at address 0x14>, config=0xffffa0019cc0) at /tmp/fluent-bit-1.9.10/src/flb_parser.c:904
#2  0x0000000000567444 in in_exec_config_read (ctx=0xffff1ba675f0, in=0xffffa000b180, config=0xffffa0019cc0) at /tmp/fluent-bit-1.9.10/plugins/in_exec/in_exec.c:164
#3  0x0000000000567718 in in_exec_init (in=0xffffa000b180, config=0xffffa0019cc0, data=0x0) at /tmp/fluent-bit-1.9.10/plugins/in_exec/in_exec.c:229
#4  0x00000000004caf64 in flb_input_instance_init (ins=0xffffa000b180, config=0xffffa0019cc0) at /tmp/fluent-bit-1.9.10/src/flb_input.c:700
#5  0x00000000004cb040 in flb_input_init_all (config=0xffffa0019cc0) at /tmp/fluent-bit-1.9.10/src/flb_input.c:736
#6  0x00000000004e1e58 in flb_engine_start (config=0xffffa0019cc0) at /tmp/fluent-bit-1.9.10/src/flb_engine.c:668
#7  0x00000000004bebe4 in flb_lib_worker (data=0xffffa0018000) at /tmp/fluent-bit-1.9.10/src/flb_lib.c:626
#8  0x0000ffffa10ad22c in start_thread (arg=0xffffa10d6000) at pthread_create.c:465
#9  0x0000ffffa0a6ea1c in thread_start () at ../sysdeps/unix/sysv/linux/aarch64/clone.S:80

Hope this helps!

@SoamA
Copy link
Author

SoamA commented May 26, 2023

Also, where is the Fluent Bit source used in the AWS For Fluent Bit image? I didn't see it in https://github.com/aws/aws-for-fluent-bit/ so was wondering.

@SoamA
Copy link
Author

SoamA commented May 31, 2023

Thanks @PettitWesley. Checking ..

@SoamA
Copy link
Author

SoamA commented May 31, 2023

Yes, it does appear to work when I remove the exec input. Let me use this as a workaround while you dig into the deeper cause. Thanks for the find!

@PettitWesley
Copy link
Contributor

I think this is the upstream fix we are missing: fluent/fluent-bit@62431ad

Originally reported in 1.9.4 but looks like the fix was never backported: fluent/fluent-bit#5715

This commit is probably worth including as well: fluent/fluent-bit@6ed4aaa

@PettitWesley
Copy link
Contributor

@SoamA when I add those above commits, I can no longer reproduce it. I'm going to tack this onto the 2.31.12 release which hasn't been completed yet: https://github.com/PettitWesley/aws-for-fluent-bit/pull/new/2_31_12-more

Let me know if you'd like a pre-release build. You can also create one yourself by running make release or make debug after cloning my PR branch.

@SoamA
Copy link
Author

SoamA commented May 31, 2023

@PettitWesley - sounds great! What's the ETA for the 2.31.12 release? That'd impact whether we can hold off a bit longer or if we'd need a custom build to tide us over until then.

@PettitWesley PettitWesley changed the title aws-for-fluent-bit 2.31.11 randomly failing with SIGSEGV on both Graviton / Intel nodes [all versions] exec input SIGSEGV/crash due unitialized memory [pending fix in 2.31.12] May 31, 2023
@PettitWesley PettitWesley added the bug Something isn't working label May 31, 2023
@PettitWesley
Copy link
Contributor

@SoamA probably either friday or more likely next monday. I can't promise any date for certain.

@SoamA
Copy link
Author

SoamA commented May 31, 2023

@PettitWesley - with the aforementioned mitigation, we should be able to wait until Friday/Monday. If you expect it to take longer, let us know. Thanks!

@SoamA
Copy link
Author

SoamA commented Jun 1, 2023

@PettitWesley - a quick question for you. Any reason we wouldn't be able to create a custom image for FB containing a more recent Fluent Bit release (eg. 2.1.4) by bumping up the FB version in https://github.com/aws/aws-for-fluent-bit/blob/mainline/scripts/dockerfiles/Dockerfile.build#LL4C16-L4C16? We don't use any of the other AWS specific plugins that are built as part of the image but would love to run with the latest FB. We haven't tried it yet but was wondering if anything would break.

@PettitWesley
Copy link
Contributor

@SoamA yea you can do that, you just need to also clear out our custom patches file since those will not cleanly rebase onto 2.x (and 2.x includes alternate versions of most of the same commits): https://github.com/aws/aws-for-fluent-bit/blob/mainline/AWS_FLB_CHERRY_PICKS

@PettitWesley
Copy link
Contributor

Also @SoamA I think the release will now not happen till Monday, since there are build issues with some of hte other fixes I added. I also am planning to try to push out some other fixes from 2.x into this new version, since as you saw, this issue was already fixed in 2.x but the patch wasn't in our distro.

@SoamA
Copy link
Author

SoamA commented Jun 1, 2023

@PettitWesley - got it, thanks!

@SoamA
Copy link
Author

SoamA commented Jun 7, 2023

Hi @PettitWesley - any updates on the 2.31.12 release?

@PettitWesley
Copy link
Contributor

@SoamA Sorry we've been having issues with our release automation... I'm working on getting it out ASAP.

@PettitWesley
Copy link
Contributor

PettitWesley commented Jun 8, 2023

@SoamA We want to apologize again for how long it is taking to get this release out. For you and anyone else who is waiting, here are pre-release images that you can use. They're built using the same code as our pending 2.31.12 prod release, just on a fresh EC2 intance instead of our pipeline:

144718711470.dkr.ecr.us-west-2.amazonaws.com/aws-for-fluent-bit:2.31.12-pre
144718711470.dkr.ecr.us-west-2.amazonaws.com/aws-for-fluent-bit:init-2.31.12-pre

The tag with "init" in the name is our ECS init release: https://github.com/aws/aws-for-fluent-bit/blob/mainline/use_cases/init-process-for-fluent-bit/README.md

These images can be pulled from any AWS account.

For example, in your Dockerfile:

FROM 144718711470.dkr.ecr.us-west-2.amazonaws.com/aws-for-fluent-bit:2.31.12-pre

Or task def:

"image": "144718711470.dkr.ecr.us-west-2.amazonaws.com/aws-for-fluent-bit:2.31.12-pre"

Or with ECS CLI:

ecs-cli pull --region us-west-2 --registry-id 144718711470 144718711470.dkr.ecr.us-west-2.amazonaws.com/aws-for-fluent-bit:2.31.12-pre

See this comment for ARM: #661 (comment)

How much can we trust your pre-release?

Strictly speaking, we must add a disclaimer for all pre-released images that they must be used at your own risk. However, we can quantify this with key details on the state of testing that these image have undergone.

These images were built on an EC2 instance, by running make release on our source code, the same as our prod pipeline does for a "real" release: https://github.com/aws/aws-for-fluent-bit

Recall our release testing for normal releases here: https://github.com/aws/aws-for-fluent-bit#aws-distro-for-fluent-bit-release-testing

As of June 12th, 2023, the 2.31.12 release has passed the following testing:

  1. Stability tests: we have passed our 1 day latest release bar with zero crashes/failures. We started those tests on June 4th, and there are still no failures, so it has met 8 days of stability thus far.
  2. Integration tests: while our pipeline is currently blocked, the release has passed our integration testing stage without failure.
  3. Load tests: Our pipeline is currently blocked on these. The failures do not appear to be due to the code contents of the image at this time. Of the 24 load tests that you see in our release notes, so far 20/24 have passed in different test runs. The failures in the remaining 4 do not appear to indicate an issue in this 2.31.12 release, but we're still investigating that. We will post an update when we know more. The PR should fix the tests (issue is in the test code, not the image): Fix load test failures #682

@SoamA
Copy link
Author

SoamA commented Jun 8, 2023

That'll be useful. Thanks @PettitWesley for making the image available!

BTW, I had another FB question/issue, this time around behavior of one of the features of the S3 output plugin. I'll start another thread for it though.

@PettitWesley
Copy link
Contributor

Should unblock the pipeline so we can move forward with the release for this fix: #676

@SoamA
Copy link
Author

SoamA commented Jun 9, 2023

Hey @PettitWesley - I tried using the image you built. Unfortunately, it's a amd64 image whereas we're mostly running Graviton instances. Do you think you could provide a multi-arch image (i.e. same format as images in https://gallery.ecr.aws/aws-observability/aws-for-fluent-bit) for the pre-build assuming the proper 2.13.12 image is still a couple of days out?

@PettitWesley
Copy link
Contributor

@SoamA I pulled this image out of our release pipeline infrastructure... the pipeline is blocked in our load testing framework but it has built and integration tested the images, this should be the image for arm64:

144718711470.dkr.ecr.us-west-2.amazonaws.com/aws-for-fluent-bit:2.31.12-pre-arm64

It should run on arm and should print this on startup:

AWS for Fluent Bit Container Image Version 2.31.12Fluent Bit v1.9.10

@lubingfeng
Copy link
Contributor

@SoamA as mentioned by Wesley in the above #661 (comment), this is pre-release version and has not gone through the complete release testing process. Please use in your non-production environment for testing purpose only.

AWS will announce the release once the testing is fully complete.

@SoamA
Copy link
Author

SoamA commented Jun 12, 2023

@PettitWesley @lubingfeng - thanks for the heads up! this morning, i deployed FB 2.31.11 with the workaround (took out the exec plugin) but will look into using these temporary images.

@PettitWesley
Copy link
Contributor

With this latest fix the pipeline seems to be happier and I think we will have this release out by tomorrow morning at the latest...

#686

@PettitWesley PettitWesley changed the title [all versions] exec input SIGSEGV/crash due unitialized memory [pending fix in 2.31.12] [all versions] exec input SIGSEGV/crash due unitialized memory [fix in 2.31.12] Jun 15, 2023
@PettitWesley
Copy link
Contributor

@SoamA The release is finally now is in progress! Windows images are out and linux should be within 2 hours. I want to apologize again for how long this has taken.

@PettitWesley
Copy link
Contributor

@SoamA
Copy link
Author

SoamA commented Jun 20, 2023

Great! Looking forward to deploying and trying it out.

@saurabh1git
Copy link

Hi Team,
We are encountering similar issue on 2.31.12 as well? Any suggestions.
Below are the logs,
Fluent Bit v1.9.10

  • Copyright (C) 2015-2022 The Fluent Bit Authors
  • Fluent Bit is a CNCF sub-project under the umbrella of Fluentd
  • https://fluentbit.io

[2023/12/12 04:05:23] [ info] [fluent bit] version=1.9.10, commit=2a8644893d, pid=1
[2023/12/12 04:05:23] [ info] [storage] version=1.4.0, type=memory+filesystem, sync=normal, checksum=disabled, max_chunks_up=128
[2023/12/12 04:05:23] [ info] [storage] backlog input plugin: storage_backlog.8
[2023/12/12 04:05:23] [ info] [cmetrics] version=0.3.7
[2023/12/12 04:05:23] [ info] [input:tail:tail.0] multiline core started
[2023/12/12 04:05:23] [ info] [input:tail:tail.1] multiline core started
[2023/12/12 04:05:23] [ info] [input:tail:tail.2] multiline core started
[2023/12/12 04:05:23] [ info] [input:systemd:systemd.3] seek_cursor=s=75864b5ffe234403b3b41539656f7bf8;i=760... OK
[2023/12/12 04:05:23] [ info] [input:tail:tail.4] multiline core started
[2023/12/12 04:05:23] [ info] [input:storage_backlog:storage_backlog.8] queue memory limit: 4.8M
[2023/12/12 04:05:23] [ info] [filter:kubernetes:kubernetes.0] https=1 host=127.0.0.1 port=10250
[2023/12/12 04:05:23] [ info] [filter:kubernetes:kubernetes.0] token updated
[2023/12/12 04:05:23] [ info] [filter:kubernetes:kubernetes.0] local POD info OK
[2023/12/12 04:05:23] [ info] [filter:kubernetes:kubernetes.0] testing connectivity with Kubelet...
[2023/12/12 04:05:23] [ info] [filter:kubernetes:kubernetes.0] connectivity OK
[2023/12/12 04:05:23] [ info] [output:cloudwatch_logs:cloudwatch_logs.0] worker #0 started
[2023/12/12 04:05:23] [ info] [output:cloudwatch_logs:cloudwatch_logs.1] worker #0 started
[2023/12/12 04:05:23] [ info] [output:cloudwatch_logs:cloudwatch_logs.2] worker #0 started
[2023/12/12 04:05:23] [ info] [http_server] listen iface=0.0.0.0 tcp_port=2020
[2023/12/12 04:05:23] [ info] [sp] stream processor started
[2023/12/12 04:05:23] [ info] [input:storage_backlog:storage_backlog.8] register tail.0/1-1702348687.339960689.flb
[2023/12/12 04:05:24] [ info] [input:tail:tail.0] inotify_fs_add(): inode=144756739 watch_fd=19 name=/var/log/containers/XXXXXXXXXXXXXXXXXXXX-775bdb8644-fjvqq_daas_istio-proxy-2789cc3fdd129df4401cc967c6c159c4c4224b9db3f7f2d9a730f78ede1e50bc.log
[2023/12/12 04:05:24] [engine] caught signal (SIGSEGV)
AWS for Fluent Bit Container Image Version 2.31.12.20231011

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

4 participants