Malformed HTTP response from splunk and cannot increase buffer on fluentBit v1.8.9 #4723

hegderohit89 · 2022-01-31T19:14:15Z

Bug Report

Describe the bug
Observing this issue consistently when trying to forward logs to Splunk.
Running fluentBit v1.8.9 (observed the same on 1.8.4 as well)

Common pattern we see is malformed HTTP response from Splunk and cannot increase buffer: before it gets the SEGV signal

[2021/11/17 23:19:28] [ warn] [engine] failed to flush chunk '9-1637191162.481275663.flb', retry in 10 seconds: task_id=0, input=tail.0 > output=splunk.0 (out_id=0)
[2021/11/17 23:19:32] [ warn] [http_client] malformed HTTP response from splunk-sh01:8088 on connection #168
[2021/11/17 23:19:32] [ warn] [output:splunk:splunk.0] http_do=-1
[2021/11/17 23:19:32] [ warn] [http_client] cannot increase buffer: current=2000000 requested=2032768 max=2000000
[2021/11/17 23:19:32] [ warn] [http_client] cannot increase buffer: current=2000000 requested=2032768 max=2000000
[2021/11/17 23:19:32] [ warn] [http_client] cannot increase buffer: current=2000000 requested=2032768 max=2000000
[2021/11/17 23:19:32] [ warn] [engine] failed to flush chunk '9-1637191167.486176165.flb', retry in 8 seconds: task_id=1, input=tail.0 > output=splunk.0 (out_id=0)
[2021/11/17 23:19:32] [engine] caught signal (SIGSEGV)
#0  0x7f7bc0ac3aec      in  ???() at ???:0
#1  0x7f7bc0acdbed      in  ???() at ???:0
#2  0x7f7bc0e84a18      in  ???() at ???:0
#3  0x7f7bc0e85c56      in  ???() at ???:0
#4  0x7f7bc0e82fa0      in  ???() at ???:0
#5  0x55a5d06becdb      in  tls_session_destroy() at src/tls/openssl.c:338
#6  0x55a5d06bf84c      in  flb_tls_session_destroy() at src/tls/flb_tls.c:394
#7  0x55a5d06ae6dc      in  destroy_conn() at src/flb_upstream.c:425
#8  0x55a5d06af3c6      in  flb_upstream_conn_pending_destroy() at src/flb_upstream.c:815
#9  0x55a5d06af51b      in  flb_upstream_conn_pending_destroy_list() at src/flb_upstream.c:865
#10 0x55a5d06a8711      in  flb_engine_start() at src/flb_engine.c:717
#11 0x55a5d068cf1d      in  flb_lib_worker() at src/flb_lib.c:628
#12 0x7f7bc1149608      in  ???() at ???:0
#13 0x7f7bc0364292      in  ???() at ???:0
#14 0xffffffffffffffff  in  ???() at ???:0
/docker-entrypoint.sh: line 19:     9 Aborted                 (core dumped) /fluen

To Reproduce
Setup fluentBit with Splunk output plugin
Hit the applications with some traffic triggering the logFwd from fluentBit.
Within the span of few hours this issue within few hours the fluentBit pods will start restarting with the above issue.

Your Environment

Version used: 1.8.9
Configuration:

[SERVICE]
    Flush         5
    Config_Watch  On
    Log_Level     info
    Daemon        Off
    HTTP_Server   On
    HTTP_Listen   0.0.0.0
    HTTP_PORT     2020
    Parsers_File  parsers.conf

@INCLUDE input-kubernetes.conf
@INCLUDE filter-kubernetes.conf
@INCLUDE config/*

[INPUT]
    Name               tail
    Tag                kube.*
    Path               /var/log/containers/*.log
    Parser             docker
    DB                 ${DB_FILE_PATH}
    Mem_Buf_Limit      5MB
    Skip_Long_Lines    On
    Refresh_Interval   10
    Rotate_Wait        10 
    Docker_Mode        On
    Docker_Mode_Flush  5
    Docker_Mode_Parser test

[FILTER]
		Name                kubernetes
		Match               kube.*
		Kube_URL            ${KUBERNETES_SVC_URL}
		Kube_CA_File        /var/run/secrets/kubernetes.io/serviceaccount/ca.crt
		Kube_Token_File     /var/run/secrets/kubernetes.io/serviceaccount/token
		Kube_Tag_Prefix     kube.var.log.containers.
		K8S-Logging.Parser  On
		K8S-Logging.Exclude On
		Annotations         Off
		Merge_Log           On
		Labels              On

Environment name and version : Kubernetes v13
Server type and version: Splunk
Operating System and version: RHEL

Additional context

This is causing lot of issues to our customers who use our log forwarding configuration to forward the production software logs to Splunk (which uses flunetBit internally).
We are looking for some guidance on how can we go about debugging this issue.

Few ideas we are trying:

Enabling the docker container core dump (since the error trace mentions /docker-entrypoint.sh: line 19: 9 Aborted (core dumped) /fluen
Asking the customers to run on plain HTTP connection and capture the TCP dump for further analysis of request response headers.

@edsiper It would be really helpful if you can throw some light on this or give some pointers how can we continue and find the root cause of this.
Is there a way to enable debug at network layer ? so we can see what is causing the [2021/11/17 23:19:32] [ warn] [http_client] malformed HTTP response from splunk-sh01:8088 on connection #168 warning

The text was updated successfully, but these errors were encountered:

hegderohit89 · 2022-01-31T19:17:42Z

We saw this issue (or similar) even when we were running fBit v1.8.3
#4166 (comment)
We upgraded to v1.8.9 but still no luck

hegderohit89 · 2022-01-31T19:26:31Z

I got a doc link from @agup006 on this https://docs.fluentbit.io/manual/pipeline/outputs/splunk
I will be heading in this direction to debug this issue. Thanks

pmula-onbe · 2022-01-31T20:21:52Z

We have v1.8.11, shipping logs to Loki and having same issue too. Not always but I have been seeing that error lately in one of my fluent-bit pods

nokute78 · 2022-02-12T05:38:45Z

Could you share Log_Level debug logs ? It may be related #4098 .
Without the patch(#4584), tls_net_read of openssl can return positive value even if it errors.
It means flb_io_net_read can return positive value even if openssl error occurs.
flb_io_net_read -> flb_tls_net_read/_async -> tls_net_read of openssl.

https://github.com/fluent/fluent-bit/blob/v1.8.9/src/flb_http_client.c#L1206-L1223

The patch is merged from v1.8.12
https://fluentbit.io/announcements/v1.8.12/

tls: openssl: fix error handling for OpenSSL apis (#4584)

hegderohit89 · 2022-02-21T17:21:31Z

Thanks @nokute78 , yes once i have that info i will share it here, Also i am bumping the fBit version to 1.8.12 with the hopes that this issue is fixed.

thunder-spb · 2022-03-22T15:00:56Z

@nokute78 Hey, did new version fix this issue?

hegderohit89 · 2022-03-22T15:02:24Z

@thunder-spb Yes the new version (1.8.12) bump fixed this issue for us.

thunder-spb · 2022-03-22T15:03:27Z

cool, thank you @hegderohit89 , I'm having the same issue with malformed this and buffer increase. will give it a shot.

hegderohit89 added the status: waiting-for-triage label Jan 31, 2022

nokute78 mentioned this issue Feb 12, 2022

malformed HTTP response from logging.googleapis.com:443 on connection #186 #4528

Closed

hegderohit89 closed this as completed Mar 22, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Malformed HTTP response from splunk and cannot increase buffer on fluentBit v1.8.9 #4723

Malformed HTTP response from splunk and cannot increase buffer on fluentBit v1.8.9 #4723

hegderohit89 commented Jan 31, 2022

hegderohit89 commented Jan 31, 2022

hegderohit89 commented Jan 31, 2022

pmula-onbe commented Jan 31, 2022

nokute78 commented Feb 12, 2022

hegderohit89 commented Feb 21, 2022

thunder-spb commented Mar 22, 2022

hegderohit89 commented Mar 22, 2022

thunder-spb commented Mar 22, 2022

Malformed HTTP response from splunk and cannot increase buffer on fluentBit v1.8.9 #4723

Malformed HTTP response from splunk and cannot increase buffer on fluentBit v1.8.9 #4723

Comments

hegderohit89 commented Jan 31, 2022

Bug Report

hegderohit89 commented Jan 31, 2022

hegderohit89 commented Jan 31, 2022

pmula-onbe commented Jan 31, 2022

nokute78 commented Feb 12, 2022

hegderohit89 commented Feb 21, 2022

thunder-spb commented Mar 22, 2022

hegderohit89 commented Mar 22, 2022

thunder-spb commented Mar 22, 2022