Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fluent-bit 1.8.3 occasional crashes (caught signal SIGSEGV) #3955

Closed
psyhomb opened this issue Aug 15, 2021 · 10 comments
Closed

fluent-bit 1.8.3 occasional crashes (caught signal SIGSEGV) #3955

psyhomb opened this issue Aug 15, 2021 · 10 comments
Labels

Comments

@psyhomb
Copy link

psyhomb commented Aug 15, 2021

Bug Report

Describe the bug

Fluent Bit running as systemd unit occasional crashes without being able to automatically recover, the only way to recover it is by removing DB and WAL files and then manually restarting the service unit.

Aug 15 07:44:53 test-instance-1 td-agent-bit[16104]: [2021/08/15 07:44:53] [error] [multiline] invalid stream_id 1095162821569803272, could not append content to multiline context
Aug 15 07:44:54 test-instance-1 td-agent-bit[16104]: [2021/08/15 07:44:54] [error] [multiline] invalid stream_id 1095162821569803272, could not append content to multiline context
Aug 15 07:44:54 test-instance-1 td-agent-bit[16104]: [2021/08/15 07:44:54] [error] [multiline] invalid stream_id 14222019020935586222, could not append content to multiline context
Aug 15 07:44:55 test-instance-1 td-agent-bit[16104]: [2021/08/15 07:44:55] [error] [multiline] invalid stream_id 1095162821569803272, could not append content to multiline context
Aug 15 07:44:58 test-instance-1 td-agent-bit[16104]: [2021/08/15 07:44:58] [error] [multiline] invalid stream_id 14222019020935586222, could not append content to multiline context
Aug 15 07:44:59 test-instance-1 td-agent-bit[16104]: [2021/08/15 07:44:59] [engine] caught signal (SIGSEGV)
Aug 15 07:44:59 test-instance-1 td-agent-bit[16104]: ERROR: no debug info in ELF executable (-1)ERROR: no debug info in ELF executable (-1)ERROR: no debug info in ELF executable (-1)ERROR: no debug info in ELF executable (-1)ERROR: no debug info in ELF executable (-1)ERROR: no debug info in ELF executable (-1)ERROR: no debug info in ELF executable (-1)ERROR: no debug info in ELF executable (-1)ERROR: no debug info in ELF executable (-1)ERROR: no debug info in ELF executable (-1)ERROR: no debug info in ELF executable (-1)ERROR: no debug info in ELF executable (-1)ERROR: no debug info in ELF executable (-1)ERROR: no debug info in ELF executable (-1)
Aug 15 07:44:59 test-instance-1 systemd[1]: td-agent-bit.service: main process exited, code=killed, status=6/ABRT
Aug 15 07:44:59 test-instance-1 systemd[1]: Unit td-agent-bit.service entered failed state.
Aug 15 07:44:59 test-instance-1 systemd[1]: td-agent-bit.service failed.
Aug 15 07:44:59 test-instance-1 systemd[1]: td-agent-bit.service holdoff time over, scheduling restart.
Aug 15 07:44:59 test-instance-1 systemd[1]: Stopped TD Agent Bit.
Aug 15 07:44:59 test-instance-1 systemd[1]: Starting TD Agent Bit...
Aug 15 07:44:59 test-instance-1 systemd[1]: Started TD Agent Bit.
Aug 15 07:44:59 test-instance-1 td-agent-bit[20430]: Fluent Bit v1.8.3
Aug 15 07:44:59 test-instance-1 td-agent-bit[20430]: * Copyright (C) 2019-2021 The Fluent Bit Authors
Aug 15 07:44:59 test-instance-1 td-agent-bit[20430]: * Copyright (C) 2015-2018 Treasure Data
--
Aug 15 07:44:59 test-instance-1 td-agent-bit[20430]: [2021/08/15 07:44:59] [ info] [input:tail:tail.27] multiline core started
Aug 15 07:44:59 test-instance-1 td-agent-bit[20430]: [2021/08/15 07:44:59] [ info] [input:tail:tail.28] multiline core started
Aug 15 07:44:59 test-instance-1 td-agent-bit[20430]: [2021/08/15 07:44:59] [ info] [input:tail:tail.29] multiline core started
Aug 15 07:44:59 test-instance-1 td-agent-bit[20430]: [2021/08/15 07:44:59] [ info] [input:tail:tail.30] multiline core started
Aug 15 07:44:59 test-instance-1 td-agent-bit[20430]: [2021/08/15 07:44:59] [ info] [input:tail:tail.31] multiline core started
Aug 15 07:44:59 test-instance-1 td-agent-bit[20430]: [2021/08/15 07:44:59] [ info] [input:tail:tail.32] multiline core started
Aug 15 07:44:59 test-instance-1 td-agent-bit[20430]: [2021/08/15 07:44:59] [ info] [sp] stream processor started
Aug 15 07:44:59 test-instance-1 td-agent-bit[20430]: [2021/08/15 07:44:59] [ info] [input:tail:tail.3] inotify_fs_add(): inode=1185289 watch_fd=1 name=/var/log/test/service-1/request.log
Aug 15 07:44:59 test-instance-1 td-agent-bit[20430]: [2021/08/15 07:44:59] [ info] [input:tail:tail.3] inotify_fs_add(): inode=1185429 watch_fd=2 name=/var/log/test/service-1/service.log
Aug 15 07:44:59 test-instance-1 td-agent-bit[20430]: [2021/08/15 07:44:59] [engine] caught signal (SIGSEGV)
Aug 15 07:44:59 test-instance-1 td-agent-bit[20430]: ERROR: no debug info in ELF executable (-1)ERROR: no debug info in ELF executable (-1)ERROR: no debug info in ELF executable (-1)ERROR: no debug info in ELF executable (-1)ERROR: no debug info in ELF executable (-1)ERROR: no debug info in ELF executable (-1)ERROR: no debug info in ELF executable (-1)ERROR: no debug info in ELF executable (-1)ERROR: no debug info in ELF executable (-1)ERROR: no debug info in ELF executable (-1)ERROR: no debug info in ELF executable (-1)ERROR: no debug info in ELF executable (-1)ERROR: no debug info in ELF executable (-1)
Aug 15 07:44:59 test-instance-1 systemd[1]: td-agent-bit.service: main process exited, code=killed, status=6/ABRT
Aug 15 07:44:59 test-instance-1 systemd[1]: Unit td-agent-bit.service entered failed state.
Aug 15 07:44:59 test-instance-1 systemd[1]: td-agent-bit.service failed.
Aug 15 07:44:59 test-instance-1 systemd[1]: td-agent-bit.service holdoff time over, scheduling restart.
Aug 15 07:44:59 test-instance-1 systemd[1]: Stopped TD Agent Bit.
Aug 15 07:44:59 test-instance-1 systemd[1]: Starting TD Agent Bit...
Aug 15 07:44:59 test-instance-1 systemd[1]: Started TD Agent Bit.
Aug 15 07:44:59 test-instance-1 td-agent-bit[20489]: Fluent Bit v1.8.3
Aug 15 07:44:59 test-instance-1 td-agent-bit[20489]: * Copyright (C) 2019-2021 The Fluent Bit Authors
Aug 15 07:44:59 test-instance-1 td-agent-bit[20489]: * Copyright (C) 2015-2018 Treasure Data
--
Aug 15 07:45:00 test-instance-1 td-agent-bit[20489]: [2021/08/15 07:44:59] [ info] [input:tail:tail.28] multiline core started
Aug 15 07:45:00 test-instance-1 td-agent-bit[20489]: [2021/08/15 07:44:59] [ info] [input:tail:tail.29] multiline core started
Aug 15 07:45:00 test-instance-1 td-agent-bit[20489]: [2021/08/15 07:44:59] [ info] [input:tail:tail.30] multiline core started
Aug 15 07:45:00 test-instance-1 td-agent-bit[20489]: [2021/08/15 07:44:59] [ info] [input:tail:tail.31] multiline core started
Aug 15 07:45:00 test-instance-1 td-agent-bit[20489]: [2021/08/15 07:44:59] [ info] [input:tail:tail.32] multiline core started
Aug 15 07:45:00 test-instance-1 td-agent-bit[20489]: [2021/08/15 07:45:00] [ info] [sp] stream processor started
Aug 15 07:45:00 test-instance-1 td-agent-bit[20489]: [2021/08/15 07:45:00] [ info] [input:tail:tail.3] inotify_fs_add(): inode=1185289 watch_fd=1 name=/var/log/test/service-1/request.log
Aug 15 07:45:00 test-instance-1 td-agent-bit[20489]: [2021/08/15 07:45:00] [ info] [input:tail:tail.3] inotify_fs_add(): inode=1185429 watch_fd=2 name=/var/log/test/service-1/service.log
Aug 15 07:45:00 test-instance-1 td-agent-bit[20489]: [2021/08/15 07:45:00] [ info] [input:tail:tail.4] inotify_fs_add(): inode=1185422 watch_fd=1 name=/var/log/test-2/service-1/request.log
Aug 15 07:45:00 test-instance-1 td-agent-bit[20489]: [2021/08/15 07:45:00] [engine] caught signal (SIGSEGV)
Aug 15 07:45:00 test-instance-1 td-agent-bit[20489]: ERROR: no debug info in ELF executable (-1)ERROR: no debug info in ELF executable (-1)ERROR: no debug info in ELF executable (-1)ERROR: no debug info in ELF executable (-1)ERROR: no debug info in ELF executable (-1)ERROR: no debug info in ELF executable (-1)ERROR: no debug info in ELF executable (-1)ERROR: no debug info in ELF executable (-1)ERROR: no debug info in ELF executable (-1)ERROR: no debug info in ELF executable (-1)ERROR: no debug info in ELF executable (-1)ERROR: no debug info in ELF executable (-1)ERROR: no debug info in ELF executable (-1)

Steps to reproduce the problem

There is no easy way to reproduce it, because it is not quite clear to me what is causing the issue.

Expected behavior

Be able to automatically recover by systemd (Restart=always) without removing DB and WAL files.

Your Environment

  • Version used: 1.8.3
  • Environment name and version (e.g. Kubernetes? What version?): systemd 219
  • Server type and version: AWS EC2
  • Operating System and version: Amazon Linux 2
  • Filters and plugins: record_modifier and none
  • Configuration:
[INPUT]
    name                tail
    path                /var/log/test/service-1/*.log
    tag                 test-service-1
    multiline.parser    multiline_java

    Skip_Long_Lines     on
    Refresh_Interval    5
    Rotate_Wait         10

    Buffer_Chunk_Size   128KB
    Buffer_Max_Size     5MB
    Mem_Buf_Limit       100MB

    DB                  /tmp/test-td-agent-bit.db
    DB.sync             normal
    DB.locking          true
    DB.journal_mode     WAL


########################################################################################################


[FILTER]
    Name                record_modifier
    Match               test-service-1
    Record              source file
    Record              hostname ${HOSTNAME}
    Record              ec2_instance_id ${EC2_INSTANCE_ID}
    Record              service_name test-service-1


########################################################################################################


[OUTPUT]
    name                    http
    match                   *-service-1
    host                    c.example.com
    port                    443
    tls                     on

    http_user               ${CUSTOMER_NAME}
    http_passwd             ${CUSTOMER_TOKEN}
    uri                     /v1/http/fluentbit

    format                  msgpack
    header                  X-Example-Decoder fluent
    compress                gzip
    log_response_payload    true
    Retry_Limit             2

Additional context

We have to automate restarts (remove DB and WAL files and then restart the service) to prevent further dropping of logs data.

@autero1
Copy link

autero1 commented Aug 27, 2021

The SIGSEGV seems to be a common topic in a lot of the issues. We're having the exact same behaviour in EKS, running as DaemonSet. The Pods just crash with exit code 139, a lot of the times right after start. Absolutely nothing in the logs:

[2021/08/26 19:16:42] [ info] [input:tail:tail.0] inotify_fs_add(): inode=147854471 watch_fd=5 name=/var/log/containers/xxxx.log
[2021/08/26 19:16:42] [engine] caught signal (SIGSEGV)

image

@autero1
Copy link

autero1 commented Aug 27, 2021

Tried to downgrade all the way to version 1.6.0 and the issue went away. So the problem seems to be introduced somewhere between these versions.

@jpfreeley
Copy link

jpfreeley commented Aug 27, 2021

+1 .. we are struggling with the same ..
No meaningful log output.
We are running v1.8.4 on aws ec2:
Linux ******* 4.14.203-156.332.amzn2.aarch64 #1 SMP Fri Oct 30 19:19:46 UTC 2020 aarch64 aarch64 aarch64 GNU/Linux

Our CPUs spike to 100%. Stopping and starting the service via systemctl restart seems to fix the problem. Our service is set to "restart always".

Completely unclear what the cause is, happening at random times of day/load. No further information found in "DEBUG level" logs.

We have not yet tried previous versions.

We have only recently begun experiencing this issue.

@github-actions
Copy link
Contributor

This issue is stale because it has been open 30 days with no activity. Remove stale label or comment or this will be closed in 5 days.

@github-actions github-actions bot added the Stale label Sep 27, 2021
@psyhomb
Copy link
Author

psyhomb commented Sep 27, 2021

Had all kind of different issues since day one with fluent-bit and I have it enough, I've simply switched to Filebeat OSS and everything is running perfectly smooth, not a single issue so far.

@github-actions github-actions bot removed the Stale label Sep 28, 2021
@github-actions
Copy link
Contributor

This issue is stale because it has been open 30 days with no activity. Remove stale label or comment or this will be closed in 5 days.

@github-actions github-actions bot added the Stale label Oct 30, 2021
@psyhomb
Copy link
Author

psyhomb commented Oct 30, 2021

Anything new on this, is this issue resolved?

@github-actions github-actions bot removed the Stale label Oct 31, 2021
@nokute78
Copy link
Collaborator

nokute78 commented Nov 8, 2021

I sent a patch #4197 to fix invalid stream_id error of in_tail.
The patch is merged from v1.8.9.
https://fluentbit.io/announcements/v1.8.9/

Could you check it ?

@github-actions
Copy link
Contributor

This issue is stale because it has been open 30 days with no activity. Remove stale label or comment or this will be closed in 5 days.

@github-actions github-actions bot added the Stale label Dec 10, 2021
@github-actions
Copy link
Contributor

This issue was closed because it has been stalled for 5 days with no activity.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

4 participants