Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Malformed HTTP response from splunk and cannot increase buffer on fluentBit v1.8.9 #4723

Closed
hegderohit89 opened this issue Jan 31, 2022 · 8 comments

Comments

@hegderohit89
Copy link

Bug Report

Describe the bug
Observing this issue consistently when trying to forward logs to Splunk.
Running fluentBit v1.8.9 (observed the same on 1.8.4 as well)

Common pattern we see is malformed HTTP response from Splunk and cannot increase buffer: before it gets the SEGV signal

[2021/11/17 23:19:28] [ warn] [engine] failed to flush chunk '9-1637191162.481275663.flb', retry in 10 seconds: task_id=0, input=tail.0 > output=splunk.0 (out_id=0)
[2021/11/17 23:19:32] [ warn] [http_client] malformed HTTP response from splunk-sh01:8088 on connection #168
[2021/11/17 23:19:32] [ warn] [output:splunk:splunk.0] http_do=-1
[2021/11/17 23:19:32] [ warn] [http_client] cannot increase buffer: current=2000000 requested=2032768 max=2000000
[2021/11/17 23:19:32] [ warn] [http_client] cannot increase buffer: current=2000000 requested=2032768 max=2000000
[2021/11/17 23:19:32] [ warn] [http_client] cannot increase buffer: current=2000000 requested=2032768 max=2000000
[2021/11/17 23:19:32] [ warn] [engine] failed to flush chunk '9-1637191167.486176165.flb', retry in 8 seconds: task_id=1, input=tail.0 > output=splunk.0 (out_id=0)
[2021/11/17 23:19:32] [engine] caught signal (SIGSEGV)
#0  0x7f7bc0ac3aec      in  ???() at ???:0
#1  0x7f7bc0acdbed      in  ???() at ???:0
#2  0x7f7bc0e84a18      in  ???() at ???:0
#3  0x7f7bc0e85c56      in  ???() at ???:0
#4  0x7f7bc0e82fa0      in  ???() at ???:0
#5  0x55a5d06becdb      in  tls_session_destroy() at src/tls/openssl.c:338
#6  0x55a5d06bf84c      in  flb_tls_session_destroy() at src/tls/flb_tls.c:394
#7  0x55a5d06ae6dc      in  destroy_conn() at src/flb_upstream.c:425
#8  0x55a5d06af3c6      in  flb_upstream_conn_pending_destroy() at src/flb_upstream.c:815
#9  0x55a5d06af51b      in  flb_upstream_conn_pending_destroy_list() at src/flb_upstream.c:865
#10 0x55a5d06a8711      in  flb_engine_start() at src/flb_engine.c:717
#11 0x55a5d068cf1d      in  flb_lib_worker() at src/flb_lib.c:628
#12 0x7f7bc1149608      in  ???() at ???:0
#13 0x7f7bc0364292      in  ???() at ???:0
#14 0xffffffffffffffff  in  ???() at ???:0
/docker-entrypoint.sh: line 19:     9 Aborted                 (core dumped) /fluen

To Reproduce
Setup fluentBit with Splunk output plugin
Hit the applications with some traffic triggering the logFwd from fluentBit.
Within the span of few hours this issue within few hours the fluentBit pods will start restarting with the above issue.

Your Environment

  • Version used: 1.8.9
  • Configuration:
[SERVICE]
    Flush         5
    Config_Watch  On
    Log_Level     info
    Daemon        Off
    HTTP_Server   On
    HTTP_Listen   0.0.0.0
    HTTP_PORT     2020
    Parsers_File  parsers.conf

@INCLUDE input-kubernetes.conf
@INCLUDE filter-kubernetes.conf
@INCLUDE config/*

[INPUT]
    Name               tail
    Tag                kube.*
    Path               /var/log/containers/*.log
    Parser             docker
    DB                 ${DB_FILE_PATH}
    Mem_Buf_Limit      5MB
    Skip_Long_Lines    On
    Refresh_Interval   10
    Rotate_Wait        10 
    Docker_Mode        On
    Docker_Mode_Flush  5
    Docker_Mode_Parser test

[FILTER]
		Name                kubernetes
		Match               kube.*
		Kube_URL            ${KUBERNETES_SVC_URL}
		Kube_CA_File        /var/run/secrets/kubernetes.io/serviceaccount/ca.crt
		Kube_Token_File     /var/run/secrets/kubernetes.io/serviceaccount/token
		Kube_Tag_Prefix     kube.var.log.containers.
		K8S-Logging.Parser  On
		K8S-Logging.Exclude On
		Annotations         Off
		Merge_Log           On
		Labels              On

Environment name and version : Kubernetes v13
Server type and version: Splunk
Operating System and version: RHEL

Additional context

  1. This is causing lot of issues to our customers who use our log forwarding configuration to forward the production software logs to Splunk (which uses flunetBit internally).
  2. We are looking for some guidance on how can we go about debugging this issue.

Few ideas we are trying:

  1. Enabling the docker container core dump (since the error trace mentions /docker-entrypoint.sh: line 19: 9 Aborted (core dumped) /fluen
  2. Asking the customers to run on plain HTTP connection and capture the TCP dump for further analysis of request response headers.

@edsiper It would be really helpful if you can throw some light on this or give some pointers how can we continue and find the root cause of this.
Is there a way to enable debug at network layer ? so we can see what is causing the [2021/11/17 23:19:32] [ warn] [http_client] malformed HTTP response from splunk-sh01:8088 on connection #168 warning

@hegderohit89
Copy link
Author

We saw this issue (or similar) even when we were running fBit v1.8.3
#4166 (comment)
We upgraded to v1.8.9 but still no luck

@hegderohit89
Copy link
Author

I got a doc link from @agup006 on this https://docs.fluentbit.io/manual/pipeline/outputs/splunk
I will be heading in this direction to debug this issue. Thanks

@pmula-onbe
Copy link

We have v1.8.11, shipping logs to Loki and having same issue too. Not always but I have been seeing that error lately in one of my fluent-bit pods

@nokute78
Copy link
Collaborator

Could you share Log_Level debug logs ? It may be related #4098 .
Without the patch(#4584), tls_net_read of openssl can return positive value even if it errors.
It means flb_io_net_read can return positive value even if openssl error occurs.
flb_io_net_read -> flb_tls_net_read/_async -> tls_net_read of openssl.

https://github.com/fluent/fluent-bit/blob/v1.8.9/src/flb_http_client.c#L1206-L1223

The patch is merged from v1.8.12
https://fluentbit.io/announcements/v1.8.12/

tls: openssl: fix error handling for OpenSSL apis (#4584)

@hegderohit89
Copy link
Author

Thanks @nokute78 , yes once i have that info i will share it here, Also i am bumping the fBit version to 1.8.12 with the hopes that this issue is fixed.

@thunder-spb
Copy link

@nokute78 Hey, did new version fix this issue?

@hegderohit89
Copy link
Author

@thunder-spb Yes the new version (1.8.12) bump fixed this issue for us.

@thunder-spb
Copy link

cool, thank you @hegderohit89 , I'm having the same issue with malformed this and buffer increase. will give it a shot.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

4 participants