-
Notifications
You must be signed in to change notification settings - Fork 133
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[all versions] exec input SIGSEGV/crash due unitialized memory [fix in 2.31.12] #661
Comments
@SoamA Thank you for your report, it would really help us if you can deploy one of our pre-built debug images without updating the entrypoint, this will output a full stacktrace. Why did you customize the image entrypoint? Here are some references:
these show our entrypoint which both prints the core stacktrace and uploads it to S3, if you want to customize, please follow from these: |
Hi @PettitWesley - appreciate the response! We were trying to generate core dumps using the prebuilt images but couldn't find any core dumps actually being created, hence our going down the gdb/entrypoint path. I'll retry and update. |
@SoamA if a core was created, our pre-built debug images without changes will BOTH output the stacktrace to stdout and upload the full core to S3. If there was no core generated, it should output a message like "No core to upload". |
@PettitWesley - yes, I get the
|
Did you get anything here? I think this might be an issue where we need to update our guides/docs. The cores go to See also: https://github.com/aws/aws-for-fluent-bit/blob/mainline/scripts/core_uploader.sh#L24 Then if the script detects a core there it moves it to But if it detects no core it does nothing: https://github.com/aws/aws-for-fluent-bit/blob/mainline/scripts/core_uploader.sh#L42 Can you try mounting both |
I will test this myself and manually trigger a core, to check where it goes... |
Ah, noted. Following your guidance, I tried
but sadly no dice. Same complaint about no core files and both directories were empty. |
@SoamA I'm sorry I don't know why this isn't working for you. I double checked that image and I get generate a core dump easily:
Then in another window, trigger core:
and it sends a core to This makes me think that somehow a core dump is not actually being generated in your case 🧐 Another option would be to run under Valgrind, we have a |
Yes, if you could build a valgrind target for Graviton (we run r6gd.16xl), that'd be great. I was going to get around to it but it'd take a bit of time to provision a dev Graviton node. We do run Intel nodes (r5g.16xl) as well but much fewer, essentially one per node at our current configuration. In the meantime, I'll investigate whether we have any system settings turned on that'd prevent core dumps. BTW, I created an AWS support ticket - https://support.console.aws.amazon.com/support/home?region=us-east-1#/case/?displayId=12844872531&language=en - just to make sure the EKS team is tracking this as well. |
Looks like the core files were actually going to |
@SoamA cool. I did build you an arm image here:
|
I uploaded two core images in the support ticket - https://support.console.aws.amazon.com/support/home?region=us-east-1#/case/?displayId=12844872531&language=en. I used S3 uploader from AWS support. As it doesn't confirm where the final files ended up, there might be multiple copies. |
And yes, thanks for the valgrind image. We'll need it as we make our way through the debugging process. |
@SoamA thanks. I'll check for the core. |
Thanks @PettitWesley. Were you able to access the files and did they yield any actionable insights? Let me know what else I can run on my side or any additional data we can capture. |
@SoamA the files that i got from support seemed to just have your logs files in /var/log... there was a lot of stuff in there... is there a specific path I should look at? We didn't see any cores. Sorry. |
Hey @PettitWesley - I uploaded two files (twice) via S3 upload:
|
I checked the support ticket and AWS confirmed the uploads hadn't gone through the first time. My bad. I've retried, this time with the correct commands, and it looks like it succeeded this time. Please check again. |
@SoamA I think I got it now. Just to be clear, exactly which Arch (arm or intel) and which image were these cores from? I need the right executable in order to read the core. |
@PettitWesley - these are ARM images, produced on a r6gd.16xl box core.flb-pipeline.10.1d99da825c76.1684878593 : core file produced manually using the debug-2.31.11 ARM image |
@SoamA When you say "core file produced manually", what exactly do you mean, be specific please |
@PettitWesley - to clarify,
|
@SoamA I'm not able to read the cores with either of the 2.31.11 arm images (debug and non-debug). GDB outputBelow is a sample of what the output looks like, I'm not able to get a stacktrace out of the core.
SIGTRAP suggests that you had set breakpoints? We do not set breakpoints in our debug image, so I am guessing you were using the debugger possibly with a custom version?
This also isn't the command used in our entrypoint: https://github.com/aws/aws-for-fluent-bit/blob/mainline/entrypoint.sh When I install debuginfos, I get:
The above was from the core with The core
Steps followed
I performed the same steps for the debug-2.31.11 image as well. |
Hi @PettitWesley - hmm, we didn't use gdb to create those particular core files, so it's puzzling to see why this is happening. Thanks for sharing the steps you used to view the core files. I'll see if there's an easy way of producing some cores that are more consumable. |
@SoamA it should be possible for me to read the cores as long as I can get the exact binary used... can you confirm you use the public ECR image that I show above? Because the binary in all of those images should be exactly the same. |
Hi @PettitWesley - to produce I am going to produce another core dump via |
Just uploaded another file
This is using the image |
Also, I uploaded the console output of the |
BTW, not sure if this will be useful for you or not given how this messes with entry points but i opened up a shell into a
Hope this helps! |
Also, where is the Fluent Bit source used in the AWS For Fluent Bit image? I didn't see it in https://github.com/aws/aws-for-fluent-bit/ so was wondering. |
Thanks @PettitWesley. Checking .. |
Yes, it does appear to work when I remove the |
I think this is the upstream fix we are missing: fluent/fluent-bit@62431ad Originally reported in 1.9.4 but looks like the fix was never backported: fluent/fluent-bit#5715 This commit is probably worth including as well: fluent/fluent-bit@6ed4aaa |
@SoamA when I add those above commits, I can no longer reproduce it. I'm going to tack this onto the 2.31.12 release which hasn't been completed yet: https://github.com/PettitWesley/aws-for-fluent-bit/pull/new/2_31_12-more Let me know if you'd like a pre-release build. You can also create one yourself by running |
@PettitWesley - sounds great! What's the ETA for the 2.31.12 release? That'd impact whether we can hold off a bit longer or if we'd need a custom build to tide us over until then. |
@SoamA probably either friday or more likely next monday. I can't promise any date for certain. |
@PettitWesley - with the aforementioned mitigation, we should be able to wait until Friday/Monday. If you expect it to take longer, let us know. Thanks! |
@PettitWesley - a quick question for you. Any reason we wouldn't be able to create a custom image for FB containing a more recent Fluent Bit release (eg. 2.1.4) by bumping up the FB version in https://github.com/aws/aws-for-fluent-bit/blob/mainline/scripts/dockerfiles/Dockerfile.build#LL4C16-L4C16? We don't use any of the other AWS specific plugins that are built as part of the image but would love to run with the latest FB. We haven't tried it yet but was wondering if anything would break. |
@SoamA yea you can do that, you just need to also clear out our custom patches file since those will not cleanly rebase onto 2.x (and 2.x includes alternate versions of most of the same commits): https://github.com/aws/aws-for-fluent-bit/blob/mainline/AWS_FLB_CHERRY_PICKS |
Also @SoamA I think the release will now not happen till Monday, since there are build issues with some of hte other fixes I added. I also am planning to try to push out some other fixes from 2.x into this new version, since as you saw, this issue was already fixed in 2.x but the patch wasn't in our distro. |
@PettitWesley - got it, thanks! |
Hi @PettitWesley - any updates on the 2.31.12 release? |
@SoamA Sorry we've been having issues with our release automation... I'm working on getting it out ASAP. |
@SoamA We want to apologize again for how long it is taking to get this release out. For you and anyone else who is waiting, here are pre-release images that you can use. They're built using the same code as our pending 2.31.12 prod release, just on a fresh EC2 intance instead of our pipeline:
The tag with "init" in the name is our ECS init release: https://github.com/aws/aws-for-fluent-bit/blob/mainline/use_cases/init-process-for-fluent-bit/README.md These images can be pulled from any AWS account. For example, in your Dockerfile:
Or task def:
Or with ECS CLI:
See this comment for ARM: #661 (comment) How much can we trust your pre-release?Strictly speaking, we must add a disclaimer for all pre-released images that they must be used at your own risk. However, we can quantify this with key details on the state of testing that these image have undergone. These images were built on an EC2 instance, by running Recall our release testing for normal releases here: https://github.com/aws/aws-for-fluent-bit#aws-distro-for-fluent-bit-release-testing As of June 12th, 2023, the 2.31.12 release has passed the following testing:
|
That'll be useful. Thanks @PettitWesley for making the image available! BTW, I had another FB question/issue, this time around behavior of one of the features of the S3 output plugin. I'll start another thread for it though. |
Should unblock the pipeline so we can move forward with the release for this fix: #676 |
Hey @PettitWesley - I tried using the image you built. Unfortunately, it's a |
@SoamA I pulled this image out of our release pipeline infrastructure... the pipeline is blocked in our load testing framework but it has built and integration tested the images, this should be the image for arm64:
It should run on arm and should print this on startup:
|
@SoamA as mentioned by Wesley in the above #661 (comment), this is pre-release version and has not gone through the complete release testing process. Please use in your non-production environment for testing purpose only. AWS will announce the release once the testing is fully complete. |
@PettitWesley @lubingfeng - thanks for the heads up! this morning, i deployed FB 2.31.11 with the workaround (took out the exec plugin) but will look into using these temporary images. |
With this latest fix the pipeline seems to be happier and I think we will have this release out by tomorrow morning at the latest... |
@SoamA The release is finally now is in progress! Windows images are out and linux should be within 2 hours. I want to apologize again for how long this has taken. |
Great! Looking forward to deploying and trying it out. |
Hi Team,
[2023/12/12 04:05:23] [ info] [fluent bit] version=1.9.10, commit=2a8644893d, pid=1 |
Describe the question/issue
Hi folks,
We've been basing our EKS logging infrastructure on AWS-for-fluent-bit. However, recently, we've been noticing some Fluent Bit pods crash with a SIGSEGV on startup and go into a CrashLoopBackoff loop on deployment. Redoing the deployment leads to the same problem on the very same physical hosts while other pods elsewhere on other hosts in the same cluster run fine. The EKS workers are configured identically, so it's a bit of a head scratcher as to why this is happening on a handful of random nodes in a cluster and that too persistently while the majority of the pods are running fine.
If we let the pod retries run long enough on a host, eventually, it will succeed but that can take anywhere from an hour to a day, unacceptable for a production environment.
Configuration
Please find the config map for aws-for-fluent-bit attached. This contains the fluent bit config file as well. Note that the daemonset is running in its own namespace ("logging"). The namespace is attached to a dedicated service account, "fb-service-account"
aws-for-fluent-bit-conf.txt
Fluent Bit Log Output
DebugLog.txt
Fluent Bit Version Info
AWS For Fluent Bit Image: 2.31.11 though we've also seen the same behavior for 2.31.10 and 2.31.6
For debugging , we used debug-2.31.11.
Cluster Details
EKS cluster: K8s version 1.25 (v1.25.9-eks-0a21954) though we've also noticed this for earlier versions
Instance types: mostly r6gd.16xl and the occasional r5d.16xlarge
We base our internal AMIs on the following images:
Application Details
This occurred on an essentially idle cluster with no applications running. Fluent Bit crashes immediately on startup, so load wasn't a factor.
Steps to reproduce issue
We reviewed the suggestions provided in:
We generated the debug log as follows:
docker run -it --entrypoint=/bin/bash --ulimit core=-1 -v /var/log:/var/log -v /home/sacharya:/cores -v $(pwd):/fluent-bit/etc public.ecr.aws/aws-observability/aws-for-fluent-bit:debug-2.31.11
Note that we were able to manually run FB successfully via docker on the other hosts in the cluster where FB pods had been successfully launched by our usual (terraform/helm based) deployment process.
Related Issues
The text was updated successfully, but these errors were encountered: