Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

invalid stream_id x, could not append content to multiline context #4190

Closed
Tinche opened this issue Oct 14, 2021 · 16 comments · Fixed by #4197
Closed

invalid stream_id x, could not append content to multiline context #4190

Tinche opened this issue Oct 14, 2021 · 16 comments · Fixed by #4197

Comments

@Tinche
Copy link

Tinche commented Oct 14, 2021

Bug Report

Describe the bug
Since upgrading to 1.8.8 using the fluent/fluent-bit Helm chart, I see these errors in the logs of essentially every daemonset pod.

To Reproduce
I'm not actually sure what's causing this. Is there a way to make FluentBit show me some information about what's causing this exactly? I would also be ok with a way to disable multiline completely while still parsing structured logs from our services. (We log single JSON lines for the most part, and multiline isn't necessary.)

Expected behavior
FluentBit doesn't log tons of errors.

Screenshots
Screenshot 2021-10-14 at 18 32 30

Your Environment

  • Version used: 1.8.8
  • Configuration:
serviceMonitor:
  enabled: true

dashboards:
  enabled: true
  labelKey: grafana_dashboard
  annotations: {}

config:
  inputs: |
    [INPUT]
        Name             tail
        Path             /var/log/containers/*.log
        multiline.parser docker, cri
        Tag              kube.*
        Mem_Buf_Limit    50MB
        Skip_Long_Lines  Off
        Skip_Empty_Lines On
    [INPUT]
        Name systemd
        Tag host.*
        Systemd_Filter _SYSTEMD_UNIT=kubelet.service
        Read_From_Tail On
  outputs: |
    [OUTPUT]
        Name                es
        Match               kube.*
        Host                x
        Port                443
        Logstash_Format     On
        Retry_Limit         False
        Time_Key            @timestamp
        Replace_Dots        On
        Logstash_Prefix     kubernetes_cluster
        Logstash_DateFormat %y.%m.%d.%H
        tls                 on
        tls.verify          on
        tls.debug           1
        Trace_Error         On
        Suppress_Type_Name  On
  • Environment name and version (e.g. Kubernetes? What version?): Kubernetes v1.21.4

Additional context
I'm trying to discover if there's something wrong on our end emitting logs or a problem with FluentBit. Thanks in advance!

@RalfWenzel
Copy link

Hi, @Tinche i was just considering filing the same issue when i saw you dropped yours.

I can confirm issue and add some more context:

  • Our Kubernetes v1.20.7
  • We detected problem in 1.8.4 and hoped it to be fixed with 1.8.8 -> still a problem in 1.8.8
  • tested with various output/filter combinations -> same result -> i think you can rule out having a problem on your emitting side -> seems to be a problem at the very beginning of the pipeline (tail, multiline.parser)
  • i configured tail to filter on output of a container i know very well so maybe this will help to narrow the problem (ongoing work, will add comment later)
    Relevant part of my config is:
    [INPUT]
        Name tail
        Path /var/log/containers/my-app-0_*.log
        multiline.parser docker, cri
        Tag kube.*
        Mem_Buf_Limit 5MB
        Skip_Long_Lines On

My preliminary conclusion: Seems to be Fluent-Bit problem in multiline core, not on your end (2b approved)

@RalfWenzel
Copy link

RalfWenzel commented Oct 14, 2021

More context:

Tracked error message "invalid stream_id x, could not append content to multiline context" down to /fluent-bit/blob/master/src/multiline/flb_ml_stream.c

struct flb_ml_stream *flb_ml_stream_get(struct flb_ml_parser_ins *parser,
                                        uint64_t stream_id)
{
    struct mk_list *head;
    struct flb_ml_stream *mst = NULL;

    mk_list_foreach(head, &parser->streams) {
        mst = mk_list_entry(head, struct flb_ml_stream, _head);
        if (mst->id == stream_id) {
            return mst;
        }
    }

    return NULL;
}

given this i guess

  • that parser->streams does not contain stream_id passed as parameter
  • because the list is updated at some given point in time (after logrotation?)
  • but ml_append_try_parser from /master/src/multiline/flb_ml.c does still expects stream to be open.

@ggermis
Copy link

ggermis commented Oct 14, 2021

We too are seeing this issue on our production server after starting to use multiline.parser cri. Using fluent-bit version 1.8.8 on kubernetes 1.21.4. Running fluent-bit as a daemonset. Installed through the official helm chart (0.19.1)

The errors always seem to start appearing on log rotation. This is how the typical log looks:

fluent-bit-8jbzn fluent-bit [2021/10/14 18:36:27] [ info] [input:tail:tail.0] inode=2304023 handle rotation(): /var/log/containers/xyz-v2-67866f8657-7vhnw_prd_xyz-v2-e07b9f25a6fe6cfc23664406a102cf5f83994f1345a8a2d48c2e8976643a0cbf.log => /var/log/pods/prd_xyz-v2-67866f8657-7vhnw_df972d75-2a31-4979-9203-5ccbf3b1a3c1/xyz-v2/0.log.20211014-183627
fluent-bit-8jbzn fluent-bit [2021/10/14 18:36:27] [ info] [input:tail:tail.0] inotify_fs_remove(): inode=2304023 watch_fd=38
fluent-bit-8jbzn fluent-bit [2021/10/14 18:36:27] [ info] [input:tail:tail.0] inotify_fs_add(): inode=2304023 watch_fd=42 name=/var/log/pods/prd_xyz-v2-67866f8657-7vhnw_df972d75-2a31-4979-9203-5ccbf3b1a3c1/xyz-v2/0.log.20211014-183627
fluent-bit-8jbzn fluent-bit [2021/10/14 18:36:27] [ info] [input:tail:tail.0] inotify_fs_add(): inode=2304022 watch_fd=43 name=/var/log/containers/xyz-v2-67866f8657-7vhnw_prd_xyz-v2-e07b9f25a6fe6cfc23664406a102cf5f83994f1345a8a2d48c2e8976643a0cbf.log
fluent-bit-8jbzn fluent-bit [2021/10/14 18:36:42] [ info] [input:tail:tail.0] inotify_fs_remove(): inode=2304023 watch_fd=42
fluent-bit-8jbzn fluent-bit [2021/10/14 18:36:42] [error] [multiline] invalid stream_id 12936912804296098966, could not append content to multiline context
fluent-bit-8jbzn fluent-bit [2021/10/14 18:36:43] [error] [multiline] invalid stream_id 12936912804296098966, could not append content to multiline context
fluent-bit-8jbzn fluent-bit [2021/10/14 18:36:43] [error] [multiline] invalid stream_id 12936912804296098966, could not append content to multiline context
fluent-bit-8jbzn fluent-bit [2021/10/14 18:36:43] [error] [multiline] invalid stream_id 12936912804296098966, could not append content to multiline context
fluent-bit-8jbzn fluent-bit [2021/10/14 18:36:43] [error] [multiline] invalid stream_id 12936912804296098966, could not append content to multiline context
fluent-bit-8jbzn fluent-bit [2021/10/14 18:36:43] [error] [multiline] invalid stream_id 12936912804296098966, could not append content to multiline context
...

I set Rotate_Wait to 15 seconds. That is why the inotify_fs_remove() entry lags by 15 seconds from the log rotation entries. Immediately after the inotify_fs_remove() the errors start appearing

The same for another one:

fluent-bit-449l8 fluent-bit [2021/10/14 18:37:44] [ info] [input:tail:tail.0] inode=2825607 handle rotation(): /var/log/containers/xyz-mail-7b84b65fc8-9fjc9_prd_xyz-mail-3b451c8d00cea53af69a49adef3eae1c20021672d386f845383514ffcf5e4063.log => /var/log/pods/prd_xyz-mail-7b84b65fc8-9fjc9_70036b92-7094-46e4-89d1-10a45406d146/xyz-mail/0.log.20211014-183744
fluent-bit-449l8 fluent-bit [2021/10/14 18:37:44] [ info] [input:tail:tail.0] inotify_fs_remove(): inode=2825607 watch_fd=16
fluent-bit-449l8 fluent-bit [2021/10/14 18:37:44] [ info] [input:tail:tail.0] inotify_fs_add(): inode=2825607 watch_fd=22 name=/var/log/pods/prd_xyz-mail-7b84b65fc8-9fjc9_70036b92-7094-46e4-89d1-10a45406d146/xyz-mail/0.log.20211014-183744
fluent-bit-449l8 fluent-bit [2021/10/14 18:37:44] [ info] [input:tail:tail.0] inotify_fs_add(): inode=2825637 watch_fd=23 name=/var/log/containers/xyz-mail-7b84b65fc8-9fjc9_prd_xyz-mail-3b451c8d00cea53af69a49adef3eae1c20021672d386f845383514ffcf5e4063.log
fluent-bit-449l8 fluent-bit [2021/10/14 18:38:03] [ info] [input:tail:tail.0] inotify_fs_remove(): inode=2825607 watch_fd=22
fluent-bit-449l8 fluent-bit [2021/10/14 18:38:04] [error] [multiline] invalid stream_id 1184272597832188412, could not append content to multiline context
fluent-bit-449l8 fluent-bit [2021/10/14 18:38:04] [error] [multiline] invalid stream_id 1184272597832188412, could not append content to multiline context
fluent-bit-449l8 fluent-bit [2021/10/14 18:38:04] [error] [multiline] invalid stream_id 1184272597832188412, could not append content to multiline context
fluent-bit-449l8 fluent-bit [2021/10/14 18:38:04] [error] [multiline] invalid stream_id 1184272597832188412, could not append content to multiline context
fluent-bit-449l8 fluent-bit [2021/10/14 18:38:04] [error] [multiline] invalid stream_id 1184272597832188412, could not append content to multiline context
...

After the errors start happening, no more logs are being processed for the pod the log files were rotated for until the fluent-bit daemon is restarted

@ggermis
Copy link

ggermis commented Oct 15, 2021

I am able to reproduce the issue (both with the 1.8.8 build and on master) using the following config

Configuration

~/fluent-bit.conf:

[SERVICE]
    Flush        1
    Daemon       Off
    Log_Level    info
    HTTP_Monitor Off
    HTTP_Port    2020
    storage.path              /tmp/fluent-bit-data/
    storage.sync              normal
    storage.checksum          off
    storage.max_chunks_up     128
    storage.backlog.mem_limit 512M
    storage.metrics           on

[INPUT]
    Name              tail
    Path              /tmp/logs/*.log
    multiline.parser  cri
    Tag               kube.*
    Refresh_Interval  1
    Mem_Buf_Limit     50MB
    Buffer_Chunk_Size 1MB
    Buffer_Max_Size   2MB
    Skip_Empty_Lines  On
    Skip_Long_Lines   On
    Rotate_Wait       5
    DB                /tmp/tail-containers-state.db
    DB.Sync           Normal
    storage.type      filesystem

[OUTPUT]
    Name stdout
    Match *

~/fluent-bit.logrotate:

/tmp/logs/*.log {
	rotate 5
	monthly
	rotate 12
	compress
	delaycompress
	missingok
	notifempty
	create 644 root root
}

The logrotate needs to create a new file (inode) on rotation to match the kubelet log rotation, so no copytruncate

Steps to reproduce the error:

  1. Create the initial log file
$ echo "$(date -u +'%Y-%m-%dT%H:%M:%S.%NZ') stdout F testing..." >> /tmp/logs/x.log
  1. Startup fluent-bit
$ bin/fluent-bit -c ~/fluent-bit.conf
Fluent Bit v1.9.0
* Copyright (C) 2019-2021 The Fluent Bit Authors
* Copyright (C) 2015-2018 Treasure Data
* Fluent Bit is a CNCF sub-project under the umbrella of Fluentd
* https://fluentbit.io

[2021/10/15 13:29:55] [ info] [engine] started (pid=127012)
[2021/10/15 13:29:55] [ info] [storage] version=1.1.4, initializing...
[2021/10/15 13:29:55] [ info] [storage] root path '/tmp/fluent-bit-data/'
[2021/10/15 13:29:55] [ info] [storage] normal synchronization mode, checksum disabled, max_chunks_up=128
[2021/10/15 13:29:55] [ info] [storage] backlog input plugin: storage_backlog.1
[2021/10/15 13:29:55] [ info] [cmetrics] version=0.2.2
[2021/10/15 13:29:55] [ info] [input:tail:tail.0] multiline core started
[2021/10/15 13:29:55] [ info] [input:storage_backlog:storage_backlog.1] queue memory limit: 488.3M
[2021/10/15 13:29:55] [ info] [sp] stream processor started
[2021/10/15 13:29:55] [ info] [input:tail:tail.0] inotify_fs_add(): inode=521755 watch_fd=1 name=/tmp/logs/x.log
  1. Force a log rotation
$ logrotate --force ~/fluent-bit.logrotate
  1. fluent-bit will output
[2021/10/15 13:31:31] [ info] [input:tail:tail.0] inode=521755 handle rotation(): /tmp/logs/x.log => /tmp/logs/x.log.1
[2021/10/15 13:31:31] [ info] [input:tail:tail.0] inotify_fs_remove(): inode=521755 watch_fd=1
[2021/10/15 13:31:31] [ info] [input:tail:tail.0] inotify_fs_add(): inode=521755 watch_fd=2 name=/tmp/logs/x.log.1
[2021/10/15 13:31:31] [ info] [input:tail:tail.0] inotify_fs_add(): inode=521756 watch_fd=3 name=/tmp/logs/x.log
  1. After Rotate_Wait seconds it shows
[2021/10/15 13:31:40] [ info] [input:tail:tail.0] inotify_fs_remove(): inode=521755 watch_fd=2
  1. Try to add another entry to the log
echo "$(date -u +'%Y-%m-%dT%H:%M:%S.%NZ') stdout F testing..." >> /tmp/logs/x.log
  1. fluent-bit logs will now show the error
[2021/10/15 13:32:53] [error] [multiline] invalid stream_id 11907403896869813768, could not append content to multiline context

Notes

Added some log statements to the flb_ml_stream_get method mentioned by @RalfWenzel and to the flb_ml_stream_id_destroy_all method. This is what I found

  • Adding an entry to the log file shows this output in fluent-bit:
streams empty? false
11907403896869813768: (null)
[0] kube.tmp.logs.x.log: [1634306811.731850320, {"time"=>"2021-10-15T14:06:51.731850320Z", "stream"=>"stdout", "_p"=>"F", "log"=>"testing..."}]
  • Forcing a log rotation shows
[2021/10/15 14:07:03] [ info] [input:tail:tail.0] inode=521755 handle rotation(): /tmp/logs/x.log => /tmp/logs/x.log.1
streams empty? false
11907403896869813768: (null)
[2021/10/15 14:07:03] [ info] [input:tail:tail.0] inotify_fs_remove(): inode=521755 watch_fd=1
[2021/10/15 14:07:03] [ info] [input:tail:tail.0] inotify_fs_add(): inode=521755 watch_fd=2 name=/tmp/logs/x.log.1
[2021/10/15 14:07:03] [ info] [input:tail:tail.0] inotify_fs_add(): inode=521756 watch_fd=3 name=/tmp/logs/x.log
  • After the Rotate_wait seconds it shows
Destroying stream 11907403896869813768: (null)
[2021/10/15 14:07:08] [ info] [input:tail:tail.0] inotify_fs_remove(): inode=521755 watch_fd=2
  • Trying to add another entry to the log file after the rotation shows
streams empty? true
[2021/10/15 14:07:33] [error] [multiline] invalid stream_id 11907403896869813768, could not append content to multiline context

The changes to the 2 methods mentioned are (just added the fprintf statements):

struct flb_ml_stream *flb_ml_stream_get(struct flb_ml_parser_ins *parser,
                                        uint64_t stream_id)
{
    struct mk_list *head;
    struct flb_ml_stream *mst = NULL;

    fprintf(stdout, "streams empty? %s\n", ((mk_list_is_empty(&parser->streams) == 0) ? "true" : "false"));
    mk_list_foreach(head, &parser->streams) {
        mst = mk_list_entry(head, struct flb_ml_stream, _head);
        fprintf(stdout, "%lu: %s\n", mst->id, mst->name);
        if (mst->id == stream_id) {
            return mst;
        }
    }

    return NULL;
}

void flb_ml_stream_id_destroy_all(struct flb_ml *ml, uint64_t stream_id)
{
    struct mk_list *tmp;
    struct mk_list *head;
    struct mk_list *head_group;
    struct mk_list *head_stream;
    struct flb_ml_group *group;
    struct flb_ml_stream *mst;
    struct flb_ml_parser_ins *parser_i;

    /* groups */
    mk_list_foreach(head, &ml->groups) {
        group = mk_list_entry(head, struct flb_ml_group, _head);

        /* parser instances */
        mk_list_foreach(head_group, &group->parsers) {
            parser_i = mk_list_entry(head_group, struct flb_ml_parser_ins, _head);

            /* streams */
            mk_list_foreach_safe(head_stream, tmp, &parser_i->streams) {
                mst = mk_list_entry(head_stream, struct flb_ml_stream, _head);
                if (mst->id != stream_id) {
                    continue;
                }
                fprintf(stdout, "Destroying stream %lu: %s\n", mst->id, mst->name);

                /* flush any pending data */
                flb_ml_flush_parser_instance(ml, parser_i, stream_id);

                /* destroy internal groups of the stream */
                flb_ml_stream_destroy(mst);
            }
        }
    }
}

@ggermis
Copy link

ggermis commented Oct 15, 2021

not completely related, but I would like to add that the test (tests/runtime_shell/in_tail) also fails. This is the test for just tailing (normal, not multiline) and rotating files. Does this mean there's a more global issue with fluent-bit?

However, I ran the script first on Ubuntu focal which doesn't have python but has python3.. the script will in this case say it can't find python but will still exit successfully! Perhaps this is the reason this wasn't flagged as an issue? (don't know if those tests run automatically on build or not)

$ ./test_rotation.sh -- test_normal_rotation
==========> UNIT TEST: test_normal_rotation
Fluent Bit started, pid=179799
./test_rotation.sh: 66: eval: python: not found
Logger finished...wait 10 seconds
Fluent Bit v1.9.0
* Copyright (C) 2019-2021 The Fluent Bit Authors
* Copyright (C) 2015-2018 Treasure Data
* Fluent Bit is a CNCF sub-project under the umbrella of Fluentd
* https://fluentbit.io

[2021/10/15 16:38:37] [ info] Configuration:
[2021/10/15 16:38:37] [ info]  flush time     | 1.000000 seconds
[2021/10/15 16:38:37] [ info]  grace          | 5 seconds
[2021/10/15 16:38:37] [ info]  daemon         | 0
[2021/10/15 16:38:37] [ info] ___________
[2021/10/15 16:38:37] [ info]  inputs:
[2021/10/15 16:38:37] [ info]      tail
[2021/10/15 16:38:37] [ info]      tail
[2021/10/15 16:38:37] [ info]      tail
[2021/10/15 16:38:37] [ info]      tail
[2021/10/15 16:38:37] [ info]      tail
[2021/10/15 16:38:37] [ info] ___________
[2021/10/15 16:38:37] [ info]  filters:
[2021/10/15 16:38:37] [ info] ___________
[2021/10/15 16:38:37] [ info]  outputs:
[2021/10/15 16:38:37] [ info]      file.0
[2021/10/15 16:38:37] [ info] ___________
[2021/10/15 16:38:37] [ info]  collectors:
cat: 'tmp_test/[abcdefghij]': No such file or directory
> write lines: 0
> read lines : 0
> database file tmp_test/a.db is OK
> database file tmp_test/b.db is OK
> database file tmp_test/c.db is OK
> database file tmp_test/d.db is OK
> database file tmp_test/e.db is OK
[2021/10/15 16:38:48] [engine] caught signal (SIGTERM)
[2021/10/15 16:38:48] [ info] [input] pausing tail.0
[2021/10/15 16:38:48] [ info] [input] pausing tail.1
[2021/10/15 16:38:48] [ info] [input] pausing tail.2
[2021/10/15 16:38:48] [ info] [input] pausing tail.3
[2021/10/15 16:38:48] [ info] [input] pausing tail.4

Ran 1 test.

OK

If I changed the test_rotation.sh to use python3 instead of python I get this failure:

$ ./test_rotation.sh -- test_normal_rotation
==========> UNIT TEST: test_normal_rotation
Fluent Bit started, pid=188724
Fluent Bit v1.9.0
* Copyright (C) 2019-2021 The Fluent Bit Authors
* Copyright (C) 2015-2018 Treasure Data
* Fluent Bit is a CNCF sub-project under the umbrella of Fluentd
* https://fluentbit.io

[2021/10/15 17:08:48] [ info] Configuration:
[2021/10/15 17:08:48] [ info]  flush time     | 1.000000 seconds
[2021/10/15 17:08:48] [ info]  grace          | 5 seconds
[2021/10/15 17:08:48] [ info]  daemon         | 0
[2021/10/15 17:08:48] [ info] ___________
[2021/10/15 17:08:48] [ info]  inputs:
[2021/10/15 17:08:48] [ info]      tail
[2021/10/15 17:08:48] [ info]      tail
[2021/10/15 17:08:48] [ info]      tail
[2021/10/15 17:08:48] [ info]      tail
[2021/10/15 17:08:48] [ info]      tail
[2021/10/15 17:08:48] [ info] ___________
[2021/10/15 17:08:48] [ info]  filters:
[2021/10/15 17:08:48] [ info] ___________
[2021/10/15 17:08:48] [ info]  outputs:
[2021/10/15 17:08:48] [ info]      file.0
[2021/10/15 17:08:48] [ info] ___________
[2021/10/15 17:08:48] [ info]  collectors:
Logger thread for 'tmp_test/a.log' has started
Logger thread for 'tmp_test/b.log' has started
Logger thread for 'tmp_test/c.log' has started
Logger thread for 'tmp_test/d.log' has started
Logger thread for 'tmp_test/e.log' has started
Logger thread finished
Logger thread finished
Logger thread finished
Logger thread finished
Logger thread finished
Logger finished...wait 10 seconds
> write lines: 500000
> read lines : 1877579
ASSERT:expected:<500000> but was:<1877579>
> database file tmp_test/a.db is OK
> database file tmp_test/b.db is OK
> database file tmp_test/c.db is OK
> database file tmp_test/d.db is OK
> database file tmp_test/e.db is OK
[2021/10/15 17:10:48] [engine] caught signal (SIGTERM)
[2021/10/15 17:10:48] [ info] [input] pausing tail.0
[2021/10/15 17:10:48] [ info] [input] pausing tail.1
[2021/10/15 17:10:48] [ info] [input] pausing tail.2
[2021/10/15 17:10:48] [ info] [input] pausing tail.3
[2021/10/15 17:10:48] [ info] [input] pausing tail.4

Ran 1 test.

FAILED (failures=1)

The output from the tests files:

$ wc -l tmp_test/?.log*
    3148 tmp_test/a.log
    6918 tmp_test/a.log.1
    6918 tmp_test/a.log.10
    6918 tmp_test/a.log.11
    6918 tmp_test/a.log.12
    6918 tmp_test/a.log.13
    6918 tmp_test/a.log.14
    6918 tmp_test/a.log.2
    6918 tmp_test/a.log.3
    6918 tmp_test/a.log.4
    6918 tmp_test/a.log.5
    6918 tmp_test/a.log.6
    6918 tmp_test/a.log.7
    6918 tmp_test/a.log.8
    6918 tmp_test/a.log.9
    3148 tmp_test/b.log
    6918 tmp_test/b.log.1
    6918 tmp_test/b.log.10
    6918 tmp_test/b.log.11
    6918 tmp_test/b.log.12
    6918 tmp_test/b.log.13
    6918 tmp_test/b.log.14
    6918 tmp_test/b.log.2
    6918 tmp_test/b.log.3
    6918 tmp_test/b.log.4
    6918 tmp_test/b.log.5
    6918 tmp_test/b.log.6
    6918 tmp_test/b.log.7
    6918 tmp_test/b.log.8
    6918 tmp_test/b.log.9
    3148 tmp_test/c.log
    6918 tmp_test/c.log.1
    6918 tmp_test/c.log.10
    6918 tmp_test/c.log.11
    6918 tmp_test/c.log.12
    6918 tmp_test/c.log.13
    6918 tmp_test/c.log.14
    6918 tmp_test/c.log.2
    6918 tmp_test/c.log.3
    6918 tmp_test/c.log.4
    6918 tmp_test/c.log.5
    6918 tmp_test/c.log.6
    6918 tmp_test/c.log.7
    6918 tmp_test/c.log.8
    6918 tmp_test/c.log.9
    3148 tmp_test/d.log
    6918 tmp_test/d.log.1
    6918 tmp_test/d.log.10
    6918 tmp_test/d.log.11
    6918 tmp_test/d.log.12
    6918 tmp_test/d.log.13
    6918 tmp_test/d.log.14
    6918 tmp_test/d.log.2
    6918 tmp_test/d.log.3
    6918 tmp_test/d.log.4
    6918 tmp_test/d.log.5
    6918 tmp_test/d.log.6
    6918 tmp_test/d.log.7
    6918 tmp_test/d.log.8
    6918 tmp_test/d.log.9
    3148 tmp_test/e.log
    6918 tmp_test/e.log.1
    6918 tmp_test/e.log.10
    6918 tmp_test/e.log.11
    6918 tmp_test/e.log.12
    6918 tmp_test/e.log.13
    6918 tmp_test/e.log.14
    6918 tmp_test/e.log.2
    6918 tmp_test/e.log.3
    6918 tmp_test/e.log.4
    6918 tmp_test/e.log.5
    6918 tmp_test/e.log.6
    6918 tmp_test/e.log.7
    6918 tmp_test/e.log.8
    6918 tmp_test/e.log.9
  500000 total

and the others

$ wc -l tmp_test/?
   386182 tmp_test/a
   333947 tmp_test/b
   421381 tmp_test/c
   379345 tmp_test/d
   356724 tmp_test/e
  1877579 total

so it would seem that the config for the output is not working as expected. The docs say that when you don't specify a File parameter on the file output, it will use the tag, but that doesn't seem to be the case since there are more entries in those output files combined than there are written to the log files themselves. duplicates? The config used in the test is

[SERVICE]
    flush     1
    daemon    off
    log_level debug
    log_file  ${TEST_DIR}/out.log

[INPUT]
    name             tail
    tag              a
    path             ${TEST_DIR}/a.log
    db               ${TEST_DIR}/a.db
    db.sync          full
    rotate_wait      5
    refresh_interval 2

[INPUT]
    name             tail
    tag              b
    path             ${TEST_DIR}/b.log
    db               ${TEST_DIR}/b.db
    db.sync          full
    rotate_wait      5
    refresh_interval 2

[INPUT]
    name             tail
    tag              c
    path             ${TEST_DIR}/c.log
    db               ${TEST_DIR}/c.db
    db.sync          full
    rotate_wait      5
    refresh_interval 2

[INPUT]
    name             tail
    tag              d
    path             ${TEST_DIR}/d.log
    db               ${TEST_DIR}/d.db
    db.sync          full
    rotate_wait      5
    refresh_interval 2

[INPUT]
    name             tail
    tag              e
    path             ${TEST_DIR}/e.log
    db               ${TEST_DIR}/e.db
    db.sync          full
    rotate_wait      5
    refresh_interval 2

[OUTPUT]
    name   file
    match  *
    path   ${TEST_DIR}

It would probably also be good to add a test for the multiline parser once the normal one is working again :)

@RalfWenzel
Copy link

RalfWenzel commented Oct 15, 2021

From all said above by @ggermis i think that it can be assumed that there is a problem somewhere in the flb_ml_stream ecosystem.
Whatever solution may be found, this will be fixed in an upcoming release. In the meantime, a workaround could be helpful.
One solution with respect to the question asked by @Tinche in the very first post of this thread is to disable multiline support if not required by the application.

For the configuration shown by @Tinche, simply change the:

config:
  inputs: |
    [INPUT]
        Name             tail
        Path             /var/log/containers/*.log
        multiline.parser docker, cri
        Tag              kube.*
        Mem_Buf_Limit    50MB
        Skip_Long_Lines  Off
        Skip_Empty_Lines On

section to

config:
  inputs: |
    [INPUT]
        Name             tail
        Path             /var/log/containers/*.log
        Parser           cri
        Tag              kube.*
        Mem_Buf_Limit    50MB
        Skip_Long_Lines  Off
        Skip_Empty_Lines On

With this you will loose multiline context but fluent-bit won't hang.

@Tinche
Copy link
Author

Tinche commented Oct 15, 2021

@RalfWenzel Thanks a lot, that would be very useful but it doesn't work - the json payload doesn't get parsed at all.

@RalfWenzel
Copy link

@Tinche This looks to me as if we are discussing problems on different levels.

  1. My expectation was, that the log-record written by your container engine is not JSON but the log written by the app is a single line of JSON. Example: App writes to stdout:

{ "myprop" : "myval" }

The resulting log-record in "/var/log/containers/my_pod_....log" then looks like:

2021-10-15T09:53:08.896924582Z stdout F { "myprop" : "myval" }

This should work (at least does in my installation)

  1. If your container engine writes JSON you need another parser (not CRI)

  2. Your container engine writes either 1) or 2) but your App writes JSON as multiline.

{
  "myprop" : "myval"
}

then multiline parsing (or changing the app) is required.

Can you clarify (or provide some records from the original logs on node filesystem in /var/logs/containers) ?

@Tinche
Copy link
Author

Tinche commented Oct 15, 2021

@RalfWenzel No, it's a single line just as you correctly assumed, it's just not getting parsed.

2021-10-15T21:16:12.600253126Z stderr F {"host":"crew-service:80","user-agent":"Go-http-client/1.1","transfer-encoding":"chunked","adult":"1","api_version":"40","appid":"com.pz.life","buildnumber":"1840","installsource":"release","is_masquerade":"0","locale":"en","pin_status":"0","privilege":"normal","version":"3.21.2","accept-encoding":"gzip","path":"/session-metadata","event":"Processing","logger":"hr.crew.handlers_direct","level":"info"}

So at the final destination (ElasticSearch) it all ends up in a single string under the log key.

@RalfWenzel
Copy link

RalfWenzel commented Oct 15, 2021

@Tinche i think i can help with this but it's an off topic in this thread because this should focus on the problem with the multiline parsing. I assume that the "invalid stream_id..." message was gone after you deployed using the approach proposed by me?
If yes, can you pls. open another issue with something like "JSON parsing not working as expected"?

@Tinche
Copy link
Author

Tinche commented Oct 15, 2021

@RalfWenzel Well, I rolled back the configuration since unstructured logs aren't useful to us. That said, I can apply your change on Monday and let it run for a few hours and see if any errors pop up.

@RalfWenzel
Copy link

@Tinche Then a quick answer: Your JSON is not parsed because you did not tell fluent-bit to do so. Need a filter between input and output, Try:

 filters: |
    [FILTER]
        Name kubernetes
        Match kube.*
        Merge_Log On
        Merge_Log_Key       log_processed
        K8S-Logging.Parser On
        K8S-Logging.Exclude Off

@Tinche
Copy link
Author

Tinche commented Oct 16, 2021

@RalfWenzel Since I'm using the Helm chart by default the filter should be active (https://github.com/fluent/helm-charts/blob/55dd89e76a914800eec6a1bee57641b24f46744f/charts/fluent-bit/values.yaml#L208), but I guess it doesn't get used?

In addition, thanks a lot for your help with this issue and coming up with a workaround. :)

@nokute78
Copy link
Collaborator

I found a root cause.
stream_id is calculated from filename, so in_tail creates same stream_id after rotation.
Then in_tail tries to delete old stream_id(= new id), it causes deleting new stream_id instance.

I added below diff to print old/new stream_id.

diff --git a/plugins/in_tail/tail_file.c b/plugins/in_tail/tail_file.c
index 958bd58d..59433dea 100644
--- a/plugins/in_tail/tail_file.c
+++ b/plugins/in_tail/tail_file.c
@@ -894,6 +894,7 @@ int flb_tail_file_append(char *path, struct stat *st, int mode,
             goto error;
         }
         file->ml_stream_id = stream_id;
+        flb_error("new stream_id= %lu", stream_id);
     }
 
     /* Local buffer */
diff --git a/src/multiline/flb_ml_stream.c b/src/multiline/flb_ml_stream.c
index 9220e7b9..1e6b442a 100644
--- a/src/multiline/flb_ml_stream.c
+++ b/src/multiline/flb_ml_stream.c
@@ -289,7 +289,7 @@ void flb_ml_stream_id_destroy_all(struct flb_ml *ml, uint64_t stream_id)
     struct flb_ml_group *group;
     struct flb_ml_stream *mst;
     struct flb_ml_parser_ins *parser_i;
-
+    flb_error("%s stream_id=%lu", __FUNCTION__, stream_id);
     /* groups */
     mk_list_foreach(head, &ml->groups) {
         group = mk_list_entry(head, struct flb_ml_group, _head);

Then the log indicates these stream_ids are same.

Fluent Bit v1.9.0
* Copyright (C) 2019-2021 The Fluent Bit Authors
* Copyright (C) 2015-2018 Treasure Data
* Fluent Bit is a CNCF sub-project under the umbrella of Fluentd
* https://fluentbit.io

[2021/10/17 10:48:07] [ info] [engine] started (pid=19267)
[2021/10/17 10:48:07] [ info] [storage] version=1.1.4, initializing...
[2021/10/17 10:48:07] [ info] [storage] root path '/tmp/fluent-bit-data/'
[2021/10/17 10:48:07] [ info] [storage] normal synchronization mode, checksum disabled, max_chunks_up=128
[2021/10/17 10:48:07] [ info] [storage] backlog input plugin: storage_backlog.1
[2021/10/17 10:48:07] [ info] [cmetrics] version=0.2.2
[2021/10/17 10:48:07] [ info] [input:tail:tail.0] multiline core started
[2021/10/17 10:48:07] [error] new stream_id= 11907403896869813768
[2021/10/17 10:48:07] [ info] [input:storage_backlog:storage_backlog.1] queue memory limit: 488.3M
[2021/10/17 10:48:07] [ info] [sp] stream processor started
[2021/10/17 10:48:07] [ info] [input:tail:tail.0] inotify_fs_add(): inode=1703983 watch_fd=1 name=/tmp/logs/x.log
[0] kube.tmp.logs.x.log: [1634435253.383217136, {"time"=>"2021-10-17T01:47:33.383217136Z", "stream"=>"stdout", "_p"=>"F", "log"=>"testing..."}]
[2021/10/17 10:48:12] [ info] [input:tail:tail.0] inode=1703983 handle rotation(): /tmp/logs/x.log => /tmp/logs/x.log.1
[2021/10/17 10:48:12] [error] new stream_id= 11907403896869813768
[2021/10/17 10:48:12] [ info] [input:tail:tail.0] inotify_fs_remove(): inode=1703983 watch_fd=1
[2021/10/17 10:48:12] [ info] [input:tail:tail.0] inotify_fs_add(): inode=1703983 watch_fd=2 name=/tmp/logs/x.log.1
[2021/10/17 10:48:12] [ info] [input:tail:tail.0] inotify_fs_add(): inode=1703973 watch_fd=3 name=/tmp/logs/x.log
[2021/10/17 10:48:21] [error] flb_ml_stream_id_destroy_all stream_id=11907403896869813768
[2021/10/17 10:48:21] [ info] [input:tail:tail.0] inotify_fs_remove(): inode=1703983 watch_fd=2
^C[2021/10/17 10:48:28] [engine] caught signal (SIGINT)
[2021/10/17 10:48:28] [ info] [input] pausing tail.0
[2021/10/17 10:48:28] [ info] [input] pausing storage_backlog.1
[2021/10/17 10:48:28] [ warn] [engine] service will stop in 5 seconds
[2021/10/17 10:48:32] [ info] [engine] service stopped
[2021/10/17 10:48:32] [error] flb_ml_stream_id_destroy_all stream_id=11907403896869813768
[2021/10/17 10:48:32] [ info] [input:tail:tail.0] inotify_fs_remove(): inode=1703973 watch_fd=3

nokute78 added a commit to nokute78/fluent-bit that referenced this issue Oct 17, 2021
If stream_id is created by filename, rotated file id will be same.
It causes releasing new multiline instance after file rotation.

Signed-off-by: Takahiro Yamashita <[email protected]>
nokute78 added a commit to nokute78/fluent-bit that referenced this issue Oct 17, 2021
If stream_id is created by filename, rotated file id will be same.
It causes releasing new multiline instance after file rotation.

Signed-off-by: Takahiro Yamashita <[email protected]>
@nokute78
Copy link
Collaborator

I sent a patch #4197.

nokute78 added a commit to nokute78/fluent-bit that referenced this issue Oct 17, 2021
If stream_id is created by filename, rotated file id will be same.
It causes releasing new multiline instance after file rotation.

Signed-off-by: Takahiro Yamashita <[email protected]>
edsiper pushed a commit that referenced this issue Oct 26, 2021
If stream_id is created by filename, rotated file id will be same.
It causes releasing new multiline instance after file rotation.

Signed-off-by: Takahiro Yamashita <[email protected]>
edsiper pushed a commit that referenced this issue Oct 26, 2021
If stream_id is created by filename, rotated file id will be same.
It causes releasing new multiline instance after file rotation.

Signed-off-by: Takahiro Yamashita <[email protected]>
@Tinche
Copy link
Author

Tinche commented Oct 29, 2021

Confirming this looks to be fixed, thanks!

0Delta pushed a commit to 0Delta/fluent-bit that referenced this issue Jan 20, 2022
If stream_id is created by filename, rotated file id will be same.
It causes releasing new multiline instance after file rotation.

Signed-off-by: Takahiro Yamashita <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants