fluent-bit 1.8.3 occasional crashes (caught signal SIGSEGV) #3955

psyhomb · 2021-08-15T17:10:44Z

Bug Report

Describe the bug

Fluent Bit running as systemd unit occasional crashes without being able to automatically recover, the only way to recover it is by removing DB and WAL files and then manually restarting the service unit.

Aug 15 07:44:53 test-instance-1 td-agent-bit[16104]: [2021/08/15 07:44:53] [error] [multiline] invalid stream_id 1095162821569803272, could not append content to multiline context
Aug 15 07:44:54 test-instance-1 td-agent-bit[16104]: [2021/08/15 07:44:54] [error] [multiline] invalid stream_id 1095162821569803272, could not append content to multiline context
Aug 15 07:44:54 test-instance-1 td-agent-bit[16104]: [2021/08/15 07:44:54] [error] [multiline] invalid stream_id 14222019020935586222, could not append content to multiline context
Aug 15 07:44:55 test-instance-1 td-agent-bit[16104]: [2021/08/15 07:44:55] [error] [multiline] invalid stream_id 1095162821569803272, could not append content to multiline context
Aug 15 07:44:58 test-instance-1 td-agent-bit[16104]: [2021/08/15 07:44:58] [error] [multiline] invalid stream_id 14222019020935586222, could not append content to multiline context
Aug 15 07:44:59 test-instance-1 td-agent-bit[16104]: [2021/08/15 07:44:59] [engine] caught signal (SIGSEGV)
Aug 15 07:44:59 test-instance-1 td-agent-bit[16104]: ERROR: no debug info in ELF executable (-1)ERROR: no debug info in ELF executable (-1)ERROR: no debug info in ELF executable (-1)ERROR: no debug info in ELF executable (-1)ERROR: no debug info in ELF executable (-1)ERROR: no debug info in ELF executable (-1)ERROR: no debug info in ELF executable (-1)ERROR: no debug info in ELF executable (-1)ERROR: no debug info in ELF executable (-1)ERROR: no debug info in ELF executable (-1)ERROR: no debug info in ELF executable (-1)ERROR: no debug info in ELF executable (-1)ERROR: no debug info in ELF executable (-1)ERROR: no debug info in ELF executable (-1)
Aug 15 07:44:59 test-instance-1 systemd[1]: td-agent-bit.service: main process exited, code=killed, status=6/ABRT
Aug 15 07:44:59 test-instance-1 systemd[1]: Unit td-agent-bit.service entered failed state.
Aug 15 07:44:59 test-instance-1 systemd[1]: td-agent-bit.service failed.
Aug 15 07:44:59 test-instance-1 systemd[1]: td-agent-bit.service holdoff time over, scheduling restart.
Aug 15 07:44:59 test-instance-1 systemd[1]: Stopped TD Agent Bit.
Aug 15 07:44:59 test-instance-1 systemd[1]: Starting TD Agent Bit...
Aug 15 07:44:59 test-instance-1 systemd[1]: Started TD Agent Bit.
Aug 15 07:44:59 test-instance-1 td-agent-bit[20430]: Fluent Bit v1.8.3
Aug 15 07:44:59 test-instance-1 td-agent-bit[20430]: * Copyright (C) 2019-2021 The Fluent Bit Authors
Aug 15 07:44:59 test-instance-1 td-agent-bit[20430]: * Copyright (C) 2015-2018 Treasure Data
--
Aug 15 07:44:59 test-instance-1 td-agent-bit[20430]: [2021/08/15 07:44:59] [ info] [input:tail:tail.27] multiline core started
Aug 15 07:44:59 test-instance-1 td-agent-bit[20430]: [2021/08/15 07:44:59] [ info] [input:tail:tail.28] multiline core started
Aug 15 07:44:59 test-instance-1 td-agent-bit[20430]: [2021/08/15 07:44:59] [ info] [input:tail:tail.29] multiline core started
Aug 15 07:44:59 test-instance-1 td-agent-bit[20430]: [2021/08/15 07:44:59] [ info] [input:tail:tail.30] multiline core started
Aug 15 07:44:59 test-instance-1 td-agent-bit[20430]: [2021/08/15 07:44:59] [ info] [input:tail:tail.31] multiline core started
Aug 15 07:44:59 test-instance-1 td-agent-bit[20430]: [2021/08/15 07:44:59] [ info] [input:tail:tail.32] multiline core started
Aug 15 07:44:59 test-instance-1 td-agent-bit[20430]: [2021/08/15 07:44:59] [ info] [sp] stream processor started
Aug 15 07:44:59 test-instance-1 td-agent-bit[20430]: [2021/08/15 07:44:59] [ info] [input:tail:tail.3] inotify_fs_add(): inode=1185289 watch_fd=1 name=/var/log/test/service-1/request.log
Aug 15 07:44:59 test-instance-1 td-agent-bit[20430]: [2021/08/15 07:44:59] [ info] [input:tail:tail.3] inotify_fs_add(): inode=1185429 watch_fd=2 name=/var/log/test/service-1/service.log
Aug 15 07:44:59 test-instance-1 td-agent-bit[20430]: [2021/08/15 07:44:59] [engine] caught signal (SIGSEGV)
Aug 15 07:44:59 test-instance-1 td-agent-bit[20430]: ERROR: no debug info in ELF executable (-1)ERROR: no debug info in ELF executable (-1)ERROR: no debug info in ELF executable (-1)ERROR: no debug info in ELF executable (-1)ERROR: no debug info in ELF executable (-1)ERROR: no debug info in ELF executable (-1)ERROR: no debug info in ELF executable (-1)ERROR: no debug info in ELF executable (-1)ERROR: no debug info in ELF executable (-1)ERROR: no debug info in ELF executable (-1)ERROR: no debug info in ELF executable (-1)ERROR: no debug info in ELF executable (-1)ERROR: no debug info in ELF executable (-1)
Aug 15 07:44:59 test-instance-1 systemd[1]: td-agent-bit.service: main process exited, code=killed, status=6/ABRT
Aug 15 07:44:59 test-instance-1 systemd[1]: Unit td-agent-bit.service entered failed state.
Aug 15 07:44:59 test-instance-1 systemd[1]: td-agent-bit.service failed.
Aug 15 07:44:59 test-instance-1 systemd[1]: td-agent-bit.service holdoff time over, scheduling restart.
Aug 15 07:44:59 test-instance-1 systemd[1]: Stopped TD Agent Bit.
Aug 15 07:44:59 test-instance-1 systemd[1]: Starting TD Agent Bit...
Aug 15 07:44:59 test-instance-1 systemd[1]: Started TD Agent Bit.
Aug 15 07:44:59 test-instance-1 td-agent-bit[20489]: Fluent Bit v1.8.3
Aug 15 07:44:59 test-instance-1 td-agent-bit[20489]: * Copyright (C) 2019-2021 The Fluent Bit Authors
Aug 15 07:44:59 test-instance-1 td-agent-bit[20489]: * Copyright (C) 2015-2018 Treasure Data
--
Aug 15 07:45:00 test-instance-1 td-agent-bit[20489]: [2021/08/15 07:44:59] [ info] [input:tail:tail.28] multiline core started
Aug 15 07:45:00 test-instance-1 td-agent-bit[20489]: [2021/08/15 07:44:59] [ info] [input:tail:tail.29] multiline core started
Aug 15 07:45:00 test-instance-1 td-agent-bit[20489]: [2021/08/15 07:44:59] [ info] [input:tail:tail.30] multiline core started
Aug 15 07:45:00 test-instance-1 td-agent-bit[20489]: [2021/08/15 07:44:59] [ info] [input:tail:tail.31] multiline core started
Aug 15 07:45:00 test-instance-1 td-agent-bit[20489]: [2021/08/15 07:44:59] [ info] [input:tail:tail.32] multiline core started
Aug 15 07:45:00 test-instance-1 td-agent-bit[20489]: [2021/08/15 07:45:00] [ info] [sp] stream processor started
Aug 15 07:45:00 test-instance-1 td-agent-bit[20489]: [2021/08/15 07:45:00] [ info] [input:tail:tail.3] inotify_fs_add(): inode=1185289 watch_fd=1 name=/var/log/test/service-1/request.log
Aug 15 07:45:00 test-instance-1 td-agent-bit[20489]: [2021/08/15 07:45:00] [ info] [input:tail:tail.3] inotify_fs_add(): inode=1185429 watch_fd=2 name=/var/log/test/service-1/service.log
Aug 15 07:45:00 test-instance-1 td-agent-bit[20489]: [2021/08/15 07:45:00] [ info] [input:tail:tail.4] inotify_fs_add(): inode=1185422 watch_fd=1 name=/var/log/test-2/service-1/request.log
Aug 15 07:45:00 test-instance-1 td-agent-bit[20489]: [2021/08/15 07:45:00] [engine] caught signal (SIGSEGV)
Aug 15 07:45:00 test-instance-1 td-agent-bit[20489]: ERROR: no debug info in ELF executable (-1)ERROR: no debug info in ELF executable (-1)ERROR: no debug info in ELF executable (-1)ERROR: no debug info in ELF executable (-1)ERROR: no debug info in ELF executable (-1)ERROR: no debug info in ELF executable (-1)ERROR: no debug info in ELF executable (-1)ERROR: no debug info in ELF executable (-1)ERROR: no debug info in ELF executable (-1)ERROR: no debug info in ELF executable (-1)ERROR: no debug info in ELF executable (-1)ERROR: no debug info in ELF executable (-1)ERROR: no debug info in ELF executable (-1)

Steps to reproduce the problem

There is no easy way to reproduce it, because it is not quite clear to me what is causing the issue.

Expected behavior

Be able to automatically recover by systemd (Restart=always) without removing DB and WAL files.

Your Environment

Version used: 1.8.3
Environment name and version (e.g. Kubernetes? What version?): systemd 219
Server type and version: AWS EC2
Operating System and version: Amazon Linux 2
Filters and plugins: record_modifier and none
Configuration:

[INPUT]
    name                tail
    path                /var/log/test/service-1/*.log
    tag                 test-service-1
    multiline.parser    multiline_java

    Skip_Long_Lines     on
    Refresh_Interval    5
    Rotate_Wait         10

    Buffer_Chunk_Size   128KB
    Buffer_Max_Size     5MB
    Mem_Buf_Limit       100MB

    DB                  /tmp/test-td-agent-bit.db
    DB.sync             normal
    DB.locking          true
    DB.journal_mode     WAL


########################################################################################################


[FILTER]
    Name                record_modifier
    Match               test-service-1
    Record              source file
    Record              hostname ${HOSTNAME}
    Record              ec2_instance_id ${EC2_INSTANCE_ID}
    Record              service_name test-service-1


########################################################################################################


[OUTPUT]
    name                    http
    match                   *-service-1
    host                    c.example.com
    port                    443
    tls                     on

    http_user               ${CUSTOMER_NAME}
    http_passwd             ${CUSTOMER_TOKEN}
    uri                     /v1/http/fluentbit

    format                  msgpack
    header                  X-Example-Decoder fluent
    compress                gzip
    log_response_payload    true
    Retry_Limit             2

Additional context

We have to automate restarts (remove DB and WAL files and then restart the service) to prevent further dropping of logs data.

The text was updated successfully, but these errors were encountered:

autero1 · 2021-08-27T03:36:11Z

The SIGSEGV seems to be a common topic in a lot of the issues. We're having the exact same behaviour in EKS, running as DaemonSet. The Pods just crash with exit code 139, a lot of the times right after start. Absolutely nothing in the logs:

[2021/08/26 19:16:42] [ info] [input:tail:tail.0] inotify_fs_add(): inode=147854471 watch_fd=5 name=/var/log/containers/xxxx.log
[2021/08/26 19:16:42] [engine] caught signal (SIGSEGV)

autero1 · 2021-08-27T06:58:31Z

Tried to downgrade all the way to version 1.6.0 and the issue went away. So the problem seems to be introduced somewhere between these versions.

jpfreeley · 2021-08-27T14:51:00Z

+1 .. we are struggling with the same ..
No meaningful log output.
We are running v1.8.4 on aws ec2:
Linux ******* 4.14.203-156.332.amzn2.aarch64 #1 SMP Fri Oct 30 19:19:46 UTC 2020 aarch64 aarch64 aarch64 GNU/Linux

Our CPUs spike to 100%. Stopping and starting the service via systemctl restart seems to fix the problem. Our service is set to "restart always".

Completely unclear what the cause is, happening at random times of day/load. No further information found in "DEBUG level" logs.

We have not yet tried previous versions.

We have only recently begun experiencing this issue.

github-actions · 2021-09-27T01:48:15Z

This issue is stale because it has been open 30 days with no activity. Remove stale label or comment or this will be closed in 5 days.

psyhomb · 2021-09-27T08:36:40Z

Had all kind of different issues since day one with fluent-bit and I have it enough, I've simply switched to Filebeat OSS and everything is running perfectly smooth, not a single issue so far.

github-actions · 2021-10-30T01:47:27Z

This issue is stale because it has been open 30 days with no activity. Remove stale label or comment or this will be closed in 5 days.

psyhomb · 2021-10-30T08:36:08Z

Anything new on this, is this issue resolved?

nokute78 · 2021-11-08T04:43:39Z

I sent a patch #4197 to fix invalid stream_id error of in_tail.
The patch is merged from v1.8.9.
https://fluentbit.io/announcements/v1.8.9/

Could you check it ?

github-actions · 2021-12-10T02:03:09Z

This issue is stale because it has been open 30 days with no activity. Remove stale label or comment or this will be closed in 5 days.

github-actions · 2021-12-16T01:49:11Z

This issue was closed because it has been stalled for 5 days with no activity.

github-actions bot added the Stale label Sep 27, 2021

github-actions bot removed the Stale label Sep 28, 2021

github-actions bot added the Stale label Oct 30, 2021

github-actions bot removed the Stale label Oct 31, 2021

github-actions bot added the Stale label Dec 10, 2021

github-actions bot closed this as completed Dec 16, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fluent-bit 1.8.3 occasional crashes (caught signal SIGSEGV) #3955

fluent-bit 1.8.3 occasional crashes (caught signal SIGSEGV) #3955

psyhomb commented Aug 15, 2021 •

edited

Loading

autero1 commented Aug 27, 2021

autero1 commented Aug 27, 2021

jpfreeley commented Aug 27, 2021 •

edited

Loading

github-actions bot commented Sep 27, 2021

psyhomb commented Sep 27, 2021 •

edited

Loading

github-actions bot commented Oct 30, 2021

psyhomb commented Oct 30, 2021

nokute78 commented Nov 8, 2021 •

edited

Loading

github-actions bot commented Dec 10, 2021

github-actions bot commented Dec 16, 2021

fluent-bit 1.8.3 occasional crashes (caught signal SIGSEGV) #3955

fluent-bit 1.8.3 occasional crashes (caught signal SIGSEGV) #3955

Comments

psyhomb commented Aug 15, 2021 • edited Loading

Bug Report

autero1 commented Aug 27, 2021

autero1 commented Aug 27, 2021

jpfreeley commented Aug 27, 2021 • edited Loading

github-actions bot commented Sep 27, 2021

psyhomb commented Sep 27, 2021 • edited Loading

github-actions bot commented Oct 30, 2021

psyhomb commented Oct 30, 2021

nokute78 commented Nov 8, 2021 • edited Loading

github-actions bot commented Dec 10, 2021

github-actions bot commented Dec 16, 2021

psyhomb commented Aug 15, 2021 •

edited

Loading

jpfreeley commented Aug 27, 2021 •

edited

Loading

psyhomb commented Sep 27, 2021 •

edited

Loading

nokute78 commented Nov 8, 2021 •

edited

Loading