Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fluentbit S3 output causes prometheus metrics endpoint to be briefly unavailable. #4165

Closed
bharatnc opened this issue Oct 5, 2021 · 15 comments
Labels

Comments

@bharatnc
Copy link

bharatnc commented Oct 5, 2021

Bug Report

Describe the bug

Prometheus metrics endpoint /api/v1/metrics/prometheus is not available immediately with S3 Output and takes a long time for the interface to be available.

To Reproduce

Add an output to S3:

Note: For credentials, AWS_ACCESS_KEY_ID & AWS_SECRET_ACCESS_KEY variables are used. These are exported and present in systemd environment when running fluentbit.

[SERVICE]
    flush        10
    daemon       Off
    log_level    info
    parsers_file /etc/flb/parsers.conf
    plugins_file /etc/flb/plugins.conf
    http_server  On
    http_listen  0.0.0.0
    http_port    3281
    storage.metrics On
    storage.path /fluentbit-buffer
    storage.sync normal
    storage.checksum off
    storage.backlog.mem_limit  8M
    storage.max_chunks_up  128

[INPUT]
    Name  tail
    Alias  test_tail
    Path  /var/log/test.log
    Read_from_Head  Off
    Path_Key  filename
    Tag  syslog
    Key  event
    exit_on_eof  false
    Rotate_Wait  5
    Refresh_Interval  60
    Skip_Long_Lines  Off
    DB.sync  normal
    DB.locking  false
    Buffer_Chunk_Size  32k
    Buffer_Max_Size  8M
    Multiline  Off
    Multiline_Flush  4
    Parser_Firstline  8192
    Docker_Mode  Off
    Docker_Mode_Flush  4

[FILTER]
    Name record_modifier
    Match *
    Record hostname ${HOSTNAME}

[OUTPUT]
    Name  s3
    Match  syslog
    endpoint  https://storage.googleapis.com
    bucket. fluentbit-test
    use_put_object true
    content_type  application/gzip
    compression gzip
    store_dir  /fluentbit/s3
    upload_timeout 1m
    region  us-west2
    total_file_size  1M
    s3_key_format  /99com25-test-k8s-s3/$UUID.gz
    s3_key_format_tag_delimiters .-

curl the metrics endpoint:

curl  http://localhost:3281/api/v1/metrics/prometheus
curl: (7) Failed to connect to localhost port 3281: Connection refused

Expected behavior

I should be able to curl the endpoint without seeing connection refused error instantaneously. Curl starts working only after a long time. This varies b/w my tests from attempt to attempt - generally 5-10 minutes and also happens to coincide with the first successful upload after buffering 1M of data (according to the settings I use above).

Screenshots
NA

Your Environment

  • Version used: 1.8.3
  • Configuration: On bare metal using systemd unit
  • Environment name and version: Debian stretch
  • Operating System and version: Debain stretch
  • Filters and plugins: record_modifier, s3 output, tail input

Additional context

This will cause delay in metrics reporting as the metrics interface will not be available for Prometheus scrapes. I am observing varying amount of time before this interface is accessible for curl - not sure what's going on / if I am missing some settings.

@bharatnc
Copy link
Author

bharatnc commented Oct 6, 2021

cc: @PettitWesley

@PettitWesley
Copy link
Contributor

@bharatnc if possible can you do two things to help me:

  1. Open an issue linking this one at the AWS repo for tracking purposes: https://github.com/aws/aws-for-fluent-bit
  2. Can you repro it with a simple test case like the following:
[SERVICE]
    flush        10
    daemon       Off
    log_level    info
    http_server  On
    http_listen  0.0.0.0
    http_port    3281

[INPUT]
   Name dummy
   Tag dummy

[OUTPUT]
     Name s3
     Match *
     bucket your-bucket
     region us-east-1
     total_file_size 1M
     upload_timeout 5m
     use_put_object On

@PettitWesley
Copy link
Contributor

endpoint  https://storage.googleapis.com

Also wait, you can use my S3 output to send to Google??

@JeffLuoo @qingling128 Curious if this is something you folks recommend? 😬

@qingling128
Copy link
Collaborator

That's news to me. Does it actually work? If so, I guess that means the s3 output plugin is implemented in a generic way? I do see a related feature request here: #1032

@bharatnc
Copy link
Author

bharatnc commented Oct 6, 2021

That's news to me. Does it actually work? If so, I guess that means the s3 output plugin is implemented in a generic way? I do see a related feature request here: #1032

Yes it works quite well (though I haven't tested it throughly) and I guess that the s3 output plugin is generic enough that all I had to do was to generate HMAC keys on GCS and use it for access-key and access-secret.

@bharatnc
Copy link
Author

bharatnc commented Oct 6, 2021

[INPUT]
Name dummy
Tag dummy

Thank you @PettitWesley.

Re 1: Filed a new issue under https://github.com/aws/aws-for-fluent-bit
Re 2: Yes, I am able to reproduce the issue using the simple dummy input.

@PettitWesley
Copy link
Contributor

s3 output plugin is generic enough that all I had to do was to generate HMAC keys on GCS and use it for access-key and access-secret.

interesting... this doesn't really make any sense to me... does GCS have an option to not use Auth?

Because AWS has its own auth algorithm called Sigv4 (for which Eduardo and I had to write a custom module for in Fluent Bit). No one else uses that; I think Google uses oauthv2. If you take someone else's secret and put it into the AWS sigv4 algorithm... the output shouldn't be something GCS would accept if its checking auth headers...

@bharatnc
Copy link
Author

bharatnc commented Oct 7, 2021

s3 output plugin is generic enough that all I had to do was to generate HMAC keys on GCS and use it for access-key and access-secret.

interesting... this doesn't really make any sense to me... does GCS have an option to not use Auth?

Because AWS has its own auth algorithm called Sigv4 (for which Eduardo and I had to write a custom module for in Fluent Bit). No one else uses that; I think Google uses oauthv2. If you take someone else's secret and put it into the AWS sigv4 algorithm... the output shouldn't be something GCS would accept if its checking auth headers...

I don't know much details about the auth algorithm. But this doc: https://cloud.google.com/storage/docs/migrating#migration-simple provides a generic example on how to use the AWS go-sdk with HMAC keys (for interoperability). I happened to try this with the fluentbit and the AWS S3 Output plugin assuming that it was written in a S3 generic way. Guess what ? It looks like it was written in such a way and started working with just the above settings with AWS_ACCESS_KEY_ID and AWS_SECRET_ACCESS_KEY set to the google hmac key/val pairs respectively in the environment.

@PettitWesley
Copy link
Contributor

@bharatnc interesting! I didn't know that GCP had built compatibility with S3 and with AWS auth into GCS.

@github-actions
Copy link
Contributor

This issue is stale because it has been open 30 days with no activity. Remove stale label or comment or this will be closed in 5 days.

@github-actions github-actions bot added the Stale label Nov 10, 2021
@leonardo-albertovich
Copy link
Collaborator

This functionality is provided by the embedded http server which is implemented in src/http_server/flb_hs.c and that component is configured and initialized by flb_engine after the in/out plugins are initialized but before the event loop starts iterating events.

This http server is implemented through monkey which in this case means it runs in its own thread so when the call to flb_hs_start is made in flb_engine it should almost immediately start listening and accepting connections regardless of what the calling thread is up to.

What I'm wondering is if it would be possible for the S3 to be causing this in cb_s3_init (which is executed through flb_output_init_all before the http server is initialized) when it synchronously issues the credentials request in line 840 but I'm not familiar enough with this part of the code so I could be wrong. Would you be able to shed some light on it @PettitWesley?

@PettitWesley
Copy link
Contributor

@leonardo-albertovich Yea we do try to fetch credentials in cb_s3_init which can lead to synchronous http requests being made. The credential code is here: https://github.com/fluent/fluent-bit/tree/master/src/aws

But all our http requests are made using this: https://github.com/fluent/fluent-bit/blob/master/src/aws/flb_aws_util.c#L151

What are you looking for? How could this code cause the issue?

@leonardo-albertovich
Copy link
Collaborator

I was trying to make sense of the symptom which is "the http server takes longer than expected to start when the s3 output plugin is enabled".

That's why I mentioned the initialization order. I wouldn't expect the credentials request mechanism to be super long, it wouldn't really make sense for the process to be too convoluted or for those services to actually take long to answer but that is one of the possible reasons considering that blocking connections are used there.

I think the safest way to determine it is printing the timestamp in key lines in flb_engine and then digging deeper.

@github-actions
Copy link
Contributor

This issue is stale because it has been open 90 days with no activity. Remove stale label or comment or this will be closed in 5 days. Maintainers can add the exempt-stale label.

@github-actions github-actions bot added the Stale label Feb 25, 2022
@github-actions
Copy link
Contributor

github-actions bot commented Mar 2, 2022

This issue was closed because it has been stalled for 5 days with no activity.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

4 participants