-
Notifications
You must be signed in to change notification settings - Fork 1.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[engine] caught signal (SIGSEGV) on 2 pods in the daemonset #4195
Comments
This issue is stale because it has been open 30 days with no activity. Remove stale label or comment or this will be closed in 5 days. |
This issue was closed because it has been stalled for 5 days with no activity. |
I got a similar issue:
or
or
Is it possible a parsing issue is creating a segfault ? |
"Is it possible a parsing issue is creating a segfault ?" I think so. I am currently experimenting with adding "more and more" functionality back into my parser. Here's the parser line that I was using when I see the segfaults:
|
I did a test by replacing elastic-search output by simple stdout and I don't have segfault anymore.
stdout used:
@bstolz can you try on your side with stdout to confirm it could be an issue with elastic-search exporter ? |
Ahh you're using the elasticsearch exporter - I am exporting to stackdriver and getting the same. I can try standard out on a non-production environment. |
I upgraded to
I have
|
First time I saw this line UPDATED: Issue fixed for
|
As mentioned previously, I am logging to stackdriver and upon changing it to stdout I still see the same issue. I'm slowly removing some of the "extra" lines I have in the input block - multiline, skip long lines, and reduction of the buffer_max_size to see how it goes. Then I will set all of the buffers to default. After that, I will start investigating the regex. |
thanks for the info, would you please provide a simple repro case so we can troubleshoot ? |
I am observing this running on all of our elasticsearch 7 clusters (ECK) in GKE - it happens a few times in the period of 24hrs. I am using the same configmap for other clusters not running elasticsearch and I do not see the issue. I am not sure how to provide a "simple" repro case given an elasticsearch cluster is involved. I've included the relevant bits of the configmap. Let me know how else I can provide what you need? I can also extend an offer to help as I do have a demo environment. `
` |
I did a new install of the latest chart for elasticsearch (docker image: I clean my configuration to keep less parser/filter and it still crashs with this configuration:
log:
Another log after the pod crashed:
Full log with trace enabled: I saw the CPU I added
because I read an issue where it recommanded to enable it when there is too much warning log with UPDATE: remove |
I run valgrind on a debug build of master branch (9ccc6f8) valgrind-fluent-bit-w67h2-2.log They crash with the same stack:
It seems |
@bstolz I notice you use
and I am using:
I think there is an issue in the flushing part for multiline parser.
|
I added one debug log in es plugin: GuillaumeSmaha@5b908dd which should not occur because from what I am understanding it means even with buffer resizing, it is still inferior to the requirement size used in the fluent-bit/plugins/out_es/es_bulk.c Line 105 in 3197e97
And I got a segv just after:
|
@GuillaumeSmaha great digging, thank you. |
@bstolz I will create another issue because the issue found is inside plugin "elasticsearch output". Reverting #3788 seems to fix my issue (fluentbit doesn't crash anymore) I just created #4412 |
@bstolz My fix on es output was merged on |
@GuillaumeSmaha wonderful! Thank you! That was quick ;-) I appreciate you diving in and posting that example. |
I have tried fluent-bit versions 1.7.x and 1.8.x and they all crash using tail input and elastic output with heavy JSON formatted logs. I tried the change in commit 0a061fb. Fluent-bit was a bit more stable but still crashed. I have gone back to version 1.6.10 which "just runs". I have noticed another issue with the 1.8.x series, it is continually logging errors on writing to elastic, but the "status": element in the elastic response only ever shows status 201 which should be "document created" and not an error. When I see errors reported with 1.6 I always see a real error status in the response, like a 429 for example. I think that the elastic response handling has also been broken. |
So I took the nokute78:fix_4311_es_fix and it ran for a few minutes before SEGV-ing. I'm back to 1.6.20. |
This issue is stale because it has been open 90 days with no activity. Remove stale label or comment or this will be closed in 5 days. Maintainers can add the |
Hi, I just built and tested the 1.8.13 fluent-bit release and it crashed after processing for about 5 minutes. Again 1.6.20 works flawlessly so something has been badly broken in the elastic plugin memory management since 1.7. It does not appear to be related to the input data, I have tried reprocessing the same file and it does not crash in the same place. If I replace the output with STDOUT it does not crash. Fluent Bit v1.8.13
* Copyright (C) 2015-2021 The Fluent Bit Authors
* Fluent Bit is a CNCF sub-project under the umbrella of Fluentd
* https://fluentbit.io
[2022/03/21 14:59:12] [engine] caught signal (SIGSEGV)
#0 0x7ffff6647fbe in ???() at ???:0
#1 0x44dde2 in flb_free() at include/fluent-bit/flb_mem.h:122
#2 0x44ef11 in flb_sds_destroy() at src/flb_sds.c:408
#3 0x4e1dfb in elasticsearch_format() at plugins/out_es/es.c:540
#4 0x4e2b6b in cb_es_flush() at plugins/out_es/es.c:772
#5 0x456a08 in output_pre_cb_flush() at include/fluent-bit/flb_output.h:517
#6 0x787905 in co_init() at lib/monkey/deps/flb_libco/amd64.c:117
Segmentation fault |
This issue is stale because it has been open 90 days with no activity. Remove stale label or comment or this will be closed in 5 days. Maintainers can add the |
This issue was closed because it has been stalled for 5 days with no activity. |
Similar issue or related issue #3585: Info 1:
Info 2:
Info 3:
Info 4:# fluent-bit.conf
fluent_bit_conf = """
[SERVICE]
Flush 5
Grace 30
Log_Level info
Daemon off
Parsers_File parsers.conf
HTTP_Server ${HTTP_SERVER}
HTTP_Listen 0.0.0.0
HTTP_Port ${HTTP_PORT}
storage.path /var/fluent-bit/state/flb-storage/
storage.sync normal
storage.checksum off
storage.backlog.mem_limit 5M
@INCLUDE application-log.conf
@INCLUDE dataplane-log.conf
@INCLUDE host-log.conf
"""
# application-log.conf
application_log_conf = """
[INPUT]
Name tail
Tag application.<namespace_name>.<deployment_name>.log
Tag_Regex ^\/var\/log\/containers\/(?<pod_name>[^_]+)_(?<namespace_name>[^_]+)_(?<deployment_name>[^_]+)-(?<docker_id>[a-z0-9]{64})\.log$
Exclude_Path /var/log/containers/cloudwatch-agent*, /var/log/containers/fluent-bit*, /var/log/containers/aws-node*, /var/log/containers/kube-proxy*
Path /var/log/containers/*.log
multiline.parser docker, cri
DB /var/fluent-bit/state/flb_container.db
Mem_Buf_Limit 50MB
Skip_Long_Lines On
Refresh_Interval 10
Rotate_Wait 30
storage.type filesystem
Read_from_Head ${READ_FROM_HEAD}
[INPUT]
Name tail
Tag application.*
Path /var/log/containers/cloudwatch-agent*
multiline.parser docker, cri
DB /var/fluent-bit/state/flb_cwagent.db
Mem_Buf_Limit 5MB
Skip_Long_Lines On
Refresh_Interval 10
Read_from_Head ${READ_FROM_HEAD}
[FILTER]
Name kubernetes
Match application.*
Kube_URL https://kubernetes.default.svc:443
Kube_Tag_Prefix application.
Regex_Parser k8s-custom-tag
Merge_Log On
Merge_Log_Key log_processed
K8S-Logging.Parser On
K8S-Logging.Exclude Off
Labels Off
Annotations Off
Use_Kubelet On
Kubelet_Port 10250
Buffer_Size 0
[OUTPUT]
Name cloudwatch_logs
Match application.*
region ${AWS_REGION}
log_group_name /${CLUSTER_NAME}/application
log_stream_prefix log-
auto_create_group true
extra_user_agent container-insights
"""
# dataplane-log.conf
dataplane_log_conf = """
[INPUT]
Name systemd
Tag dataplane.systemd.*
Systemd_Filter _SYSTEMD_UNIT=docker.service
Systemd_Filter _SYSTEMD_UNIT=containerd.service
Systemd_Filter _SYSTEMD_UNIT=kubelet.service
DB /var/fluent-bit/state/systemd.db
Path /var/log/journal
Read_From_Tail ${READ_FROM_TAIL}
[INPUT]
Name tail
Tag dataplane.tail.*
Path /var/log/containers/aws-node*, /var/log/containers/kube-proxy*
multiline.parser docker, cri
DB /var/fluent-bit/state/flb_dataplane_tail.db
Mem_Buf_Limit 50MB
Skip_Long_Lines On
Refresh_Interval 10
Rotate_Wait 30
storage.type filesystem
Read_from_Head ${READ_FROM_HEAD}
[FILTER]
Name modify
Match dataplane.systemd.*
Rename _HOSTNAME hostname
Rename _SYSTEMD_UNIT systemd_unit
Rename MESSAGE message
Remove_regex ^((?!hostname|systemd_unit|message).)*$
[FILTER]
Name aws
Match dataplane.*
imds_version v1
[OUTPUT]
Name cloudwatch_logs
Match dataplane.*
region ${AWS_REGION}
log_group_name /${CLUSTER_NAME}/dataplane
log_stream_prefix log-
auto_create_group true
extra_user_agent container-insights
"""
# host-log.conf
host_log_conf = """
[INPUT]
Name tail
Tag host.dmesg
Path /var/log/dmesg
Key message
DB /var/fluent-bit/state/flb_dmesg.db
Mem_Buf_Limit 5MB
Skip_Long_Lines On
Refresh_Interval 10
Read_from_Head ${READ_FROM_HEAD}
[INPUT]
Name tail
Tag host.messages
Path /var/log/messages
Parser syslog
DB /var/fluent-bit/state/flb_messages.db
Mem_Buf_Limit 5MB
Skip_Long_Lines On
Refresh_Interval 10
Read_from_Head ${READ_FROM_HEAD}
[INPUT]
Name tail
Tag host.secure
Path /var/log/secure
Parser syslog
DB /var/fluent-bit/state/flb_secure.db
Mem_Buf_Limit 5MB
Skip_Long_Lines On
Refresh_Interval 10
Read_from_Head ${READ_FROM_HEAD}
[FILTER]
Name aws
Match host.*
imds_version v1
[OUTPUT]
Name cloudwatch_logs
Match host.*
region ${AWS_REGION}
log_group_name /${CLUSTER_NAME}/host
log_stream_prefix log-
auto_create_group true
extra_user_agent container-insights
"""
# parsers.conf
parsers_conf = r"""
[PARSER]
Name syslog
Format regex
Regex ^(?<time>[^ ]* {1,2}[^ ]* [^ ]*) (?<host>[^ ]*) (?<ident>[a-zA-Z0-9_\/\.\-]*)(?:\[(?<pid>[0-9]+)\])?(?:[^\:]*\:)? *(?<message>.*)$
Time_Key time
Time_Format %b %d %H:%M:%S
[PARSER]
Name container_firstline
Format regex
Regex (?<log>(?<="log":")\S(?!\.).*?)(?<!\\)".*(?<stream>(?<="stream":").*?)".*(?<time>\d{4}-\d{1,2}-\d{1,2}T\d{2}:\d{2}:\d{2}\.\w*).*(?=})
Time_Key time
Time_Format %Y-%m-%dT%H:%M:%S.%LZ
[PARSER]
Name cwagent_firstline
Format regex
Regex (?<log>(?<="log":")\d{4}[\/-]\d{1,2}[\/-]\d{1,2}[ T]\d{2}:\d{2}:\d{2}(?!\.).*?)(?<!\\)".*(?<stream>(?<="stream":").*?)".*(?<time>\d{4}-\d{1,2}-\d{1,2}T\d{2}:\d{2}:\d{2}\.\w*).*(?=})
Time_Key time
Time_Format %Y-%m-%dT%H:%M:%S.%LZ
[PARSER]
Name k8s-custom-tag
Format regex
Regex ^(?<namespace_name>[^.]+)\.(?<deployment_name>[^.]+)\.log$
""" Codes:https://github.com/omidraha/pulumi_example/blob/main/cw/cw.py |
Bug Report
Describe the bug
Over time I have observed that our fluent bit pods running 1.8.4 occasionally show pod restarts due to termination (kubernetes exit code 139). Interestingly, the restarts seem to only happen on a few pods in the daemonset:
To Reproduce
There doesn't seem to be anything specific that we do other than just running for a period of time.
Expected behavior
I suppose in an ideal world this error would not happen.
Your Environment
Additional context
It doesn't seem to affect us other than tripping our restart monitors. We're not sure why it's restarting (eg: is it something we're doing?)
The text was updated successfully, but these errors were encountered: