Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

could not enqueue records into the ring buffer #7071

Closed
qnapnickchang opened this issue Mar 26, 2023 · 17 comments · Fixed by #7812
Closed

could not enqueue records into the ring buffer #7071

qnapnickchang opened this issue Mar 26, 2023 · 17 comments · Fixed by #7812

Comments

@qnapnickchang
Copy link

Bug Report

Describe the bug
Error Log:

[2023/03/26 07:32:13] [error] [input:tail:tail.0] could not enqueue records into the ring buffer
[tail.0] failed buffer write, retries=0
[tail.0] failed buffer write, retries=1
[tail.0] failed buffer write, retries=2
[tail.0] failed buffer write, retries=3
[tail.0] failed buffer write, retries=4
[tail.0] failed buffer write, retries=5
[tail.0] failed buffer write, retries=0
[tail.0] failed buffer write, retries=1
[tail.0] failed buffer write, retries=2
[tail.0] failed buffer write, retries=3
[tail.0] failed buffer write, retries=4
[tail.0] failed buffer write, retries=5
[tail.0] failed buffer write, retries=6
[tail.0] failed buffer write, retries=7
[tail.0] failed buffer write, retries=8
[tail.0] failed buffer write, retries=0
[tail.0] failed buffer write, retries=0
[tail.0] failed buffer write, retries=0
[tail.0] failed buffer write, retries=0
[tail.0] failed buffer write, retries=1
[tail.0] failed buffer write, retries=2
[tail.0] failed buffer write, retries=0

fluent-bit config

apiVersion: v1
data:
  custom_parsers.conf: |
    [PARSER]
        Name docker_no_time
        Format json
        Time_Keep Off
        Time_Key time
        Time_Format %Y-%m-%dT%H:%M:%S.%L
    [PARSER]
        Name custom-nginx
        Format regex
        Regex ^(?<domain>[^ ]*) (?<remote_ip>[^\, ]*)(?:,? ?[^\[]*) \[(?<time>[^\]]*)\]\[(?<msec>[^ ]*)\] "(?<method>[^ ]*) (?<url>[^\?]*)(?:\??)(?<url_params>.*) (HTTP\/)?(?<http_version>[^ ]*)" (?<status>[^ ]*) (?<body_bytes>[^ ]*) "(?<referer>[^\"]*)" "(?<http_user_agent>[^\"]*)" "(?<body>[^ ]*)" (?<request_time>[^ ]*) (?<response_time>[^ ]*) (?<app_id>[^ ]*) (?<digest>[^ ]*) "(?<firmware_version>[^ ]*)" "(?<device_model>[^ ]*)" "(?<app_version>[^ ]*)" "(?<token>[^\?]*)" "(?<resp_body>[^"]*)" "(?<request_id>[^ "]*)"
        Time_Format %d/%b/%Y:%H:%M:%S %z
        Time_Key time
    [PARSER]
        Name   logfmt
        Format logfmt
  fluent-bit.conf: |
    [SERVICE]
        Daemon Off

        Flush 5
        Log_Level warn
        Parsers_File parsers.conf
        Parsers_File custom_parsers.conf
        HTTP_Server On
        HTTP_Listen 0.0.0.0
        HTTP_Port 2020
        storage.path              /var/log/flb-storage/
        storage.sync              normal
        storage.checksum          off
        storage.backlog.mem_limit 300M
        storage.max_chunks_up     256
        storage.metrics           on

    [INPUT]
        Name tail
        Path /var/log/containers/*.log
        multiline.parser docker, cri
        Tag kube.*
        DB /var/log/flb_kube.db
        Mem_Buf_Limit  256MB
        Rotate_Wait 30
        Buffer_Chunk_Size 5MB
        Buffer_Max_Size 512MB
        threaded on
    [INPUT]
        Name              tail
        Tag               kube_audit.*
        Path              /var/log/kube-apiserver-audit.log
        Parser            docker_no_time
        DB                /var/log/flb_kube_audit.db
        Mem_Buf_Limit  32MB
        Rotate_Wait    30
        Buffer_Chunk_Size 5MB
        Buffer_Max_Size 32MB
        threaded on

    [FILTER]
        Name kubernetes
        Match kube.*
        Kube_URL       https://kubernetes.default.svc:443
        Merge_Log On
        Keep_Log Off
        K8S-Logging.Parser On
        K8S-Logging.Exclude Off
    [FILTER]
        Name           grep
        Match          *
        Exclude        http_user_agent ELB-HealthChecker
    [FILTER]
        Name             geoip2
        Match            *
        Database         /fluent-bit/geo/GeoLite2-City.mmdb
        Lookup_key       remote_ip
        Record  country_name  remote_ip %{country.names.en}
        Record  country_code2 remote_ip %{country.iso_code}
    [FILTER]
        Name           nest
        Match          *
        Operation      nest
        Wildcard       country*
        Nest_under     geoip
    [FILTER]
        Name           lua
        Match          *
        script         /fluent-bit/scripts/qnap_filter.lua
        call           ip_filter

    [OUTPUT]
        Name loki
        Match *
        Host loki-loki-distributed-distributor.monitoring.svc
        Port 3100
        labels job=fluent-bit
        label_map_path  /fluent-bit/etc/labelmap.json
        remove_keys kubernetes.container_hash, kubernetes.docker_id, kubernetes.annotations, stream, _p
        auto_kubernetes_labels off
        line_format json
        Retry_Limit False
  labelmap.json: |
    {
      "kubernetes": {
        "container_name": "container",
        "host": "node",
        "labels": {
          "app": "app",
          "k8s_app": "app",
          "release": "release"
        },
        "namespace_name": "namespace",
        "pod_name": "instance"
      },
      "stream": "stream"
    }

edit deployment:

**To Reproduce**
When I add "fluent-bit/parser" in deployment

template:
metadata:
annotations:
fluentbit.io/parser: custom-nginx


Fluentibit can't send log to loki. Log miss. 

**Your Environment**
<!--- Include as many relevant details about the environment you experienced the bug in -->
* Version used: kubernetes 1.25.6. fluent-bit 2.0.8

Could everyone have any suggestion?
@leonardo-albertovich
Copy link
Collaborator

It seems like your filter stack is causing a bottleneck which means both inputs combined are ingesting data at a faster rate than it goes through the system. I think we need to address this but I'm curious about the kubernetes filter, I'd like to know if that's the culprit (because it makes http requests in synchronous mode) but I'm not entirely sure about how to get more information about it in this case.

@dakkkob
Copy link

dakkkob commented Apr 12, 2023

Hello,

I'm having the very same issue with much simpler configuration - 4 tails and 1 output to Splunk. The logs are not particularly talkative (usually a batch of few lines appears every 10 seconds) and still, once fluent-bit reaches the "could not enqueue records into the ring buffer" error state, it's not able to recover from it - no new records appear in Splunk until restart of Fluent-bit. See my config below. I used to have all logs in single tail (using a star *) but I tried to split it as below. Issue is still present.

Edit: Fluent-bit 2.0.9 on Windows.

[SERVICE]
    flush        5
    daemon       Off
    log_level    debug
    log_file     C:\ProgramData\fluent-bit.log
    parsers_file parsers.conf
    plugins_file plugins.conf
    http_server  Off
    http_listen  0.0.0.0
    http_port    2020
    storage.metrics on
[INPUT]
    name tail
    parser json
    Path C:\Products\A.log
    threaded on
[INPUT]
    name tail
    parser json
    Path C:\Products\B.log
    threaded on
[INPUT]
    name tail
    parser json
    Path C:\Products\C.log
    threaded on
[INPUT]
    name tail
    parser json
    Path C:\Products\D.log
    threaded on
[OUTPUT]
    name  splunk
    match *
    Host http-inputs.splunkcloud.com
    Port 443
    Tls On
    Splunk_Token ******
    event_index ******
    event_host ******
    # LEGACY DNS resolver due to memory leak since v2.0.0
    # https://github.com/fluent/fluent-bit/issues/6525
    net.dns.resolver LEGACY

And the log. Maybe it's the output which gets stuck?
image

@amolbms
Copy link

amolbms commented May 5, 2023

@leonardo-albertovich I am having same issue after using threading feature for tail input. Where is ring buffer used? I have set the storage to filesystem for the tail input. Why does it not flush to filesystem? Where is the bottleneck?

[2023/05/05 16:16:31] [error] [input:tail:tail.2] could not enqueue records into the ring buffer
[tail.2] failed buffer write, retries=0
[tail.2] failed buffer write, retries=1
[tail.2] failed buffer write, retries=2
[tail.2] failed buffer write, retries=3
[tail.2] failed buffer write, retries=4
[tail.2] failed buffer write, retries=5
[tail.2] failed buffer write, retries=6
[tail.2] failed buffer write, retries=7
[tail.2] failed buffer write, retries=8
[tail.2] failed buffer write, retries=9
[tail.2] failed buffer write, retries=0

@leonardo-albertovich
Copy link
Collaborator

Hi @amolbms, the ring buffer is used to move ingested records from the input plugin threads to the main pipeline thread where they are filtered, persisted and routed.

I'd need to know a bit more about your setup to make a proper assessment but if you are running fluent-bit 2.1 (or are able to upgrade) you'll find that moving your filters to the processor stack of the input plugin (which requires you to use yaml as your configuration file format) will probably eliminate this issue.

Something else you can do is ensure your output plugins are running in threaded mode (with workers set to a value of 1 or more) which should be the default value for most plugins in 2.1.

If you share a bit more information about your setup I might be able to give you better feedback.

@amolbms
Copy link

amolbms commented May 5, 2023

@leonardo-albertovich Thanks for the quick feedback. We are using lua script in the filter. Here is my config for the input which is having issues. Sure I will try to see how I can move filter to the processor.

[INPUT]
    Name              tail
    Path              /varlog/<service>/access.log
    DB                /varlog/<service>/access.db.pos
    DB.Sync           Normal
    DB.locking        true
    Buffer_Chunk_Size 15M
    Buffer_Max_Size   500MB
    Mem_Buf_Limit     500MB
    read_from_head    true
    Refresh_Interval  5
    Rotate_Wait       20
    threaded          on
    Tag               mdsd.xxx

[FILTER]
    Name      lua
    Match     mdsd.xxx
    script    fluentbit_filter.lua
    call      modify_record_for_xxx

@amolbms
Copy link

amolbms commented May 5, 2023

Does this look correct config?

  pipeline:
    inputs:
      - name: tail
         Path /varlog//access.log
         DB /varlog//access.db.pos
         DB.Sync Normal
         DB.locking true
         Buffer_Chunk_Size 15M
         Buffer_Max_Size 500MB
         Mem_Buf_Limit 500MB
         read_from_head true
        Refresh_Interval 5
        Rotate_Wait 20
        threaded on
        Tag mdsd.xxx

        processors:
          logs:
            - name: lua
              call: modify_record_for_xxx
              script: fluentbit_filter.lua

@leonardo-albertovich
Copy link
Collaborator

The processor block looks correct, the rest does as well save for the indentation issue and I assume the path /varlog/ is correct and not a typo.

The one thing there that's counter productive in my opinion are the values for Buffer_Chunk_Size and Buffer_Max_Size because those are meant to regulate the ingestion buffer and ideally they should be much lower. In fact, the documentation displays the default values as 32 kilobytes.

Here's an example, you'll have to replace the output component but other than that it should be compliant with your configuration :

pipeline:
  inputs:
  - name: tail
    path: /varlog/access.log
    db: /varlog/access.db.pos
    db.sync: normal
    db.locking: true
    buffer_chunk_size: 2MB
    buffer_max_size: 2MB
    mem_buf_limit: 500MB
    read_from_head: true
    refresh_interval: 5
    rotate_wait: 20
    threaded: on
    tag: mdsd.xxx

    processors:
      logs:
        - name: lua
          call: modify_record_for_xxx
          script: fluentbit_filter.lua

  outputs:
  - name: stdout
    match: "*"

@leonardo-albertovich
Copy link
Collaborator

In that example I shared the filter step is performed in the input thread before inserting the records in the ring buffer which should eliminate the bottleneck.

@amolbms
Copy link

amolbms commented May 5, 2023

Thanks @leonardo-albertovich. I tried processor and the error went away. After adding rewrite_tag inside the processor I am getting below exception, Is it not supported?

My configuration is,

          processors:
              logs:
                  - name: lua
                    match: mdsd.fdaceesslogs
                    script: fluentbit_filter.lua
                    call: modify_record_for_fdaccesslogs

                  - name: rewrite_tag
                    match: mdsd.fdaceesslogs
                    rule: ${MDSD_REWRITE_TAG_FILTER_KEY} ^.*$ mdsd.azuremonitorlogs true

                  - name: lua
                    match: mdsd.azuremonitorlogs
                    script: fluentbit_filter.lua
                    call: modify_record_for_azuremonitorlogs
[2023/05/05 21:20:14] [engine] caught signal (SIGSEGV)
#0  0x55e070a463d2      in  mk_event_channel_create() at lib/monkey/mk_core/mk_event.c:175
#1  0x55e06ff24055      in  input_instance_channel_events_init() at src/flb_input.c:838
#2  0x55e06ff251be      in  flb_input_instance_init() at src/flb_input.c:1195
#3  0x55e07044b68d      in  emitter_create() at plugins/filter_rewrite_tag/rewrite_tag.c:75
#4  0x55e07044c189      in  cb_rewrite_tag_init() at plugins/filter_rewrite_tag/rewrite_tag.c:290
#5  0x55e06ff3866f      in  flb_filter_init() at src/flb_filter.c:594
#6  0x55e06ff9e0ee      in  flb_processor_unit_init() at src/flb_processor.c:214
#7  0x55e06ff9e1f2      in  flb_processor_init() at src/flb_processor.c:249
#8  0x55e06ff9ecc9      in  flb_processors_load_from_config_format_group() at src/flb_processor.c:619
#9  0x55e06ff4ea2e      in  configure_plugins_type() at src/flb_config.c:829
#10 0x55e06ff4edd7      in  flb_config_load_config_format() at src/flb_config.c:909
#11 0x55e06fe71e8e      in  service_configure() at src/fluent-bit.c:712
#12 0x55e06fe72852      in  flb_main() at src/fluent-bit.c:1032
#13 0x55e06fe72a8b      in  main() at src/fluent-bit.c:1131
#14 0x7fe5fc058d09      in  ???() at ???:0
#15 0x55e06fe6cb19      in  ???() at ???:0
#16 0xffffffffffffffff  in  ???() at ???:0

@leonardo-albertovich
Copy link
Collaborator

That's not expected at all, could you please share more context with me? A complete and properly escaped config file would probably be enough for me to find the root of the issue and fix it.

If you can't or don't want to share the configuration file publicly feel free to send me a private message in slack.

Thank you!

@amolbms
Copy link

amolbms commented May 6, 2023

Sure, let me share config on slack.

Thanks you!

@leonardo-albertovich
Copy link
Collaborator

Hi @amolbms, it seems like you couldn't find me in slack, I'd really appreciate it if you shared that configuration so we can fix this issue.

Thank you!

@msolters
Copy link

We have almost exactly the same problem. This config doesn't crash, but reports the ring buffer error.

    pipeline:
      inputs:
        - name: tail
          alias: foo
          path: $the_paths
          db: /var/log/fluent-bit-s3.db
          db.sync: normal
          db.locking: true
          db.journal_mode: WAL
          key: message
          tag: foo.*
          buffer_chunk_size: 2MB
          buffer_max_size: 5MB
          mem_buf_limit: 50MB
          path_key: log_file_path
          skip_long_lines: On
          offset_key: offset
          storage.type: filesystem
          threaded: on
          processors:
            logs:
              - alias: lua-s3
                name: lua
                match: foo.*
                script: custom_filters.lua
                call: foo_rewrite_for_s3

There are multiple filters downstream. The first is a rewrite_tag. If I move that filter into the processor stack, like so:

    pipeline:
      inputs:
        - name: tail
          alias: foo
          path: $the_paths
          db: /var/log/fluent-bit-s3.db
          db.sync: normal
          db.locking: true
          db.journal_mode: WAL
          key: message
          tag: foo.*
          buffer_chunk_size: 2MB
          buffer_max_size: 5MB
          mem_buf_limit: 50MB
          path_key: log_file_path
          skip_long_lines: On
          offset_key: offset
          storage.type: filesystem
          threaded: on
          processors:
            logs:
              - alias: lua-s3
                name: lua
                match: foo.*
                script: custom_filters.lua
                call: foo_rewrite_for_s3
              - alias: rewrite-foo-pre-k8s
                name: rewrite_tag
                match: foo.*
                rule: $k8s_pod_name ^.*$ kube-foo.all.$k8s_namespace.$k8s_pod_name.$log_file false
                emitter_name: k8s_meta_into_tag_partial_all
                emitter_storage.type: filesystem
                emitter_mem_buf_limit: 300M

Then Fluent Bit immediately segfaults on boot:

[2023/05/10 19:04:33] [engine] caught signal (SIGSEGV)
#0  0x55d97e7c784a      in  mk_event_channel_create() at lib/monkey/mk_core/mk_event.c:175
#1  0x55d97dca5215      in  input_instance_channel_events_init() at src/flb_input.c:838
#2  0x55d97dca637e      in  flb_input_instance_init() at src/flb_input.c:1195
#3  0x55d97e1cca1e      in  emitter_create() at plugins/filter_rewrite_tag/rewrite_tag.c:75
#4  0x55d97e1cd51a      in  cb_rewrite_tag_init() at plugins/filter_rewrite_tag/rewrite_tag.c:290
#5  0x55d97dcb982f      in  flb_filter_init() at src/flb_filter.c:594
#6  0x55d97dd1f2ae      in  flb_processor_unit_init() at src/flb_processor.c:214
#7  0x55d97dd1f3b2      in  flb_processor_init() at src/flb_processor.c:249
#8  0x55d97dd1fe89      in  flb_processors_load_from_config_format_group() at src/flb_processor.c:619
#9  0x55d97dccfbee      in  configure_plugins_type() at src/flb_config.c:829
#10 0x55d97dccff97      in  flb_config_load_config_format() at src/flb_config.c:909
#11 0x55d97dbf2e8e      in  service_configure() at src/fluent-bit.c:712
#12 0x55d97dbf3852      in  flb_main() at src/fluent-bit.c:1032
#13 0x55d97dbf3a8b      in  main() at src/fluent-bit.c:1131
#14 0x7ff68bfd2d09      in  ???() at ???:0
#15 0x55d97dbedb19      in  ???() at ???:0
#16 0xffffffffffffffff  in  ???() at ???:0

@leonardo-albertovich
Copy link
Collaborator

Yes, that's an issue we're actively working on and expect to fix within the week.

@leonardo-albertovich
Copy link
Collaborator

Quick update: We have already solved the initialization issue and are in the process of improving how the rewrite_tag filter operates in the context of processor stacks.

@ganga1980
Copy link

Hi, @leonardo-albertovich - With 2.0.9 version, I am also seeing ton of errors "could not enqueue records into the ring buffer" and fluent-bit doesn't get recover from these errors. Is there any workaround? is the bottleneck here record_modifier filter plugin? Can you please advise

Here is the config

https://github.com/microsoft/Docker-Provider/blob/ci_prod/build/linux/installer/conf/fluent-bit-geneva.conf
https://github.com/microsoft/Docker-Provider/blob/ci_prod/build/linux/installer/conf/fluent-bit-geneva-logs_tenant.conf

Here is the full logs -

[2023/07/14 14:18:48] [ info] [fluent bit] version=2.0.9, commit=, pid=174
[2023/07/14 14:18:48] [ info] [storage] ver=1.4.0, type=memory+filesystem, sync=normal, checksum=off, max_chunks_up=128
[2023/07/14 14:18:48] [ info] [storage] backlog input plugin: storage_backlog.8
[2023/07/14 14:18:48] [ info] [cmetrics] version=0.5.8
[2023/07/14 14:18:48] [ info] [ctraces ] version=0.2.7
[2023/07/14 14:18:48] [ info] [input:tail:ama-logs_stdout_tail] initializing
[2023/07/14 14:18:48] [ info] [input:tail:ama-logs_stdout_tail] storage_strategy='memory' (memory only)
[2023/07/14 14:18:48] [ info] [input:tail:ama-logs_mdsd_err_tail] initializing
[2023/07/14 14:18:48] [ info] [input:tail:ama-logs_mdsd_err_tail] storage_strategy='memory' (memory only)
[2023/07/14 14:18:48] [ info] [input:tail:ama-logs_termination_log_tail] initializing
[2023/07/14 14:18:48] [ info] [input:tail:ama-logs_termination_log_tail] storage_strategy='memory' (memory only)
[2023/07/14 14:18:48] [ info] [input:tcp:telegraf_tcp] initializing
[2023/07/14 14:18:48] [ info] [input:tcp:telegraf_tcp] storage_strategy='memory' (memory only)
[2023/07/14 14:18:48] [ info] [input:tail:ama-logs_mdsd_qos_tail] initializing
[2023/07/14 14:18:48] [ info] [input:tail:ama-logs_mdsd_qos_tail] storage_strategy='memory' (memory only)
[2023/07/14 14:18:48] [ info] [input:tail:infra_fake-ns_tail] initializing
[2023/07/14 14:18:48] [ info] [input:tail:infra_fake-ns_tail] storage_strategy='memory' (memory only)
[2023/07/14 14:18:48] [ info] [input:tail:infra_fake-ns_tail] thread instance initialized
[2023/07/14 14:18:48] [ info] [input:tail:tenant_metricsrp_tail] initializing
[2023/07/14 14:18:48] [ info] [input:tail:tenant_metricsrp_tail] storage_strategy='memory' (memory only)
[2023/07/14 14:18:48] [ info] [input:tail:tenant_metricsrp_tail] thread instance initialized
[2023/07/14 14:18:48] [ info] [input:fluentbit_metrics:fluentbit_metrics_input] initializing
[2023/07/14 14:18:48] [ info] [input:fluentbit_metrics:fluentbit_metrics_input] storage_strategy='memory' (memory only)
[2023/07/14 14:18:48] [ info] [input:storage_backlog:storage_backlog.8] initializing
[2023/07/14 14:18:48] [ info] [input:storage_backlog:storage_backlog.8] storage_strategy='memory' (memory only)
[2023/07/14 14:18:48] [ info] [input:storage_backlog:storage_backlog.8] queue memory limit: 9.5M
[2023/07/14 14:18:48] [ info] [output:forward:tenant_metricsrp_forward] worker #0 started
[2023/07/14 14:18:48] [ info] [output:forward:tenant_metricsrp_forward] worker #1 started
[2023/07/14 14:18:48] [ info] [output:forward:tenant_metricsrp_forward] worker #2 started
[2023/07/14 14:18:48] [ info] [output:forward:tenant_metricsrp_forward] worker #3 started
[2023/07/14 14:18:48] [ info] [output:forward:tenant_metricsrp_forward] worker #7 started
[2023/07/14 14:18:48] [ info] [output:forward:tenant_metricsrp_forward] worker #5 started
[2023/07/14 14:18:48] [ info] [output:forward:tenant_metricsrp_forward] worker #6 started
[2023/07/14 14:18:48] [ info] [output:forward:tenant_metricsrp_forward] worker #4 started
[2023/07/14 14:18:48] [ info] [output:forward:tenant_metricsrp_forward] worker #8 started
[2023/07/14 14:18:48] [ info] [output:prometheus_exporter:fluentbit_metrics_output] listening iface=0.0.0.0 tcp_port=9102
[2023/07/14 14:18:48] [ info] [sp] stream processor started
[2023/07/14 14:18:48] [ info] [input:tail:tenant_metricsrp_tail] inotify_fs_add(): inode=4966719 watch_fd=1 name=/var/log/containers/geneva-services-vhw24_metricsrp_mdm-1dc92a1b8d4471da4a8355b14001fb369c0866a1139d1ae14c690382730a4b74.log
[2023/07/14 14:18:48] [ info] [input:tail:ama-logs_mdsd_err_tail] inotify_fs_add(): inode=8526629 watch_fd=1 name=/var/opt/microsoft/linuxmonagent/log/mdsd.err
[2023/07/14 14:18:48] [ info] [input:tail:ama-logs_mdsd_qos_tail] inotify_fs_add(): inode=8526630 watch_fd=1 name=/var/opt/microsoft/linuxmonagent/log/mdsd.qos
[2023/07/14 14:18:48] [ info] [output:forward:tenant_metricsrp_forward] worker #9 started
[2023/07/14 14:18:48] [ info] [input:tail:ama-logs_stdout_tail] inotify_fs_add(): inode=4902961 watch_fd=1 name=/var/log/containers/ama-logs-f24bp_kube-system_ama-logs-6e8f65fe11a0a83ba6d740b74e94c402f5fd24b4a8a1999e0c65faa87551d9bf.log
[2023/07/14 14:18:48] [ info] [input:tail:ama-logs_stdout_tail] inotify_fs_add(): inode=4903554 watch_fd=2 name=/var/log/containers/ama-logs-f24bp_kube-system_ama-logs-prometheus-e4a8475a8215873c1434ae7e1ca58d935847faf0ee9468f6671a06655a18005a.log
[2023/07/14 14:19:48] [ info] [input:tail:ama-logs_termination_log_tail] inotify_fs_add(): inode=211 watch_fd=1 name=/dev/write-to-traces
[2023/07/14 19:03:09] [error] [input:tail:tenant_metricsrp_tail] could not enqueue records into the ring buffer
[2023/07/14 19:03:10] [error] [input:tail:tenant_metricsrp_tail] could not enqueue records into the ring buffer
[2023/07/14 19:03:39] [error] [input:tail:tenant_metricsrp_tail] could not enqueue records into the ring buffer
[2023/07/14 19:03:50] [error] [input:tail:tenant_metricsrp_tail] could not enqueue records into the ring buffer
[2023/07/14 19:04:09] [error] [input:tail:tenant_metricsrp_tail] could not enqueue records into the ring buffer
[2023/07/14 19:04:39] [error] [input:tail:tenant_metricsrp_tail] could not enqueue records into the ring buffer
[2023/07/14 19:04:50] [error] [input:tail:tenant_metricsrp_tail] could not enqueue records into the ring buffer
[2023/07/14 19:05:09] [error] [input:tail:tenant_metricsrp_tail] could not enqueue records into the ring buffer
[2023/07/14 19:05:39] [error] [input:tail:tenant_metricsrp_tail] could not enqueue records into the ring buffer
[2023/07/14 19:05:50] [error] [input:tail:tenant_metricsrp_tail] could not enqueue records into the ring buffer
[2023/07/14 19:06:09] [error] [input:tail:tenant_metricsrp_tail] could not enqueue records into the ring buffer
[2023/07/14 19:06:39] [error] [input:tail:tenant_metricsrp_tail] could not enqueue records into the ring buffer
[2023/07/14 19:06:50] [error] [input:tail:tenant_metricsrp_tail] could not enqueue records into the ring buffer
[2023/07/14 19:07:02] [error] [input:tail:tenant_metricsrp_tail] could not enqueue records into the ring buffer
[2023/07/14 19:07:09] [error] [input:tail:tenant_metricsrp_tail] could not enqueue records into the ring buffer
[2023/07/14 19:07:39] [error] [input:tail:tenant_metricsrp_tail] could not enqueue records into the ring buffer
[2023/07/14 19:07:50] [error] [input:tail:tenant_metricsrp_tail] could not enqueue records into the ring buffer
[2023/07/14 19:08:00] [error] [input:tail:tenant_metricsrp_tail] could not enqueue records into the ring buffer
[2023/07/14 19:08:01] [error] [input:tail:tenant_metricsrp_tail] could not enqueue records into the ring buffer
[2023/07/14 19:08:09] [error] [input:tail:tenant_metricsrp_tail] could not enqueue records into the ring buffer
[2023/07/14 19:08:10] [error] [input:tail:tenant_metricsrp_tail] could not enqueue records into the ring buffer
[2023/07/14 19:08:39] [error] [input:tail:tenant_metricsrp_tail] could not enqueue records into the ring buffer
[2023/07/14 19:08:50] [error] [input:tail:tenant_metricsrp_tail] could not enqueue records into the ring buffer
[2023/07/14 19:09:09] [error] [input:tail:tenant_metricsrp_tail] could not enqueue records into the ring buffer
[2023/07/14 19:09:39] [error] [input:tail:tenant_metricsrp_tail] could not enqueue records into the ring buffer
[2023/07/14 19:09:50] [error] [input:tail:tenant_metricsrp_tail] could not enqueue records into the ring buffer
[2023/07/14 19:10:09] [error] [input:tail:tenant_metricsrp_tail] could not enqueue records into the ring buffer

@danlenar
Copy link
Contributor

danlenar commented Aug 9, 2023

The issue is how resume and pause are handled by input threads.

flb_input_pause sends a signal to input thread event loop
https://github.com/fluent/fluent-bit/blob/v2.1.8/src/flb_input.c#L1673

flb_input_resume is done by the main thread, which causes race condition.
https://github.com/fluent/fluent-bit/blob/v2.1.8/src/flb_input.c#L1695

The proper fix is to have flb_input_resume also send signal to the input thread, so pause and resume don't happen out of order or stomp on each other.

I will have a PR ready soon. I just want the fix to run in my env for a couple of hours to see if race condition is truly fixed.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

8 participants