Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[windows] upstream connection failed to logs.eu-west-2.amazonaws.com #4682

Closed
tmetodie opened this issue Jan 25, 2022 · 9 comments
Closed

[windows] upstream connection failed to logs.eu-west-2.amazonaws.com #4682

tmetodie opened this issue Jan 25, 2022 · 9 comments
Labels
Stale status: waiting-for-triage Windows Bugs and requests about Windows platforms

Comments

@tmetodie
Copy link

Bug Report

Describe the bug
Brand new setup.
OS Name: Microsoft Windows Server 2019 Datacenter
OS Version: 10.0.17763 N/A Build 17763
Fluent Bit Version: 1.8.11-win64

After fluent-bit is started with the following command - .\bin\fluent-bit.exe -c .\conf\fluent-bit.conf - the following errors are seen:

[2022/01/24 15:34:22] [ info] [output:cloudwatch_logs:cloudwatch_logs.0] Creating log stream application.C.var.log.containers.deployment-56bcb45977-4x72l_default_container-fcaeb2263bf6a440a2224bc7139cc1c6db99990d1dfe21bf579d9f201b00e2ca.log in loggroup /application
[2022/01/24 15:34:22] [debug] [upstream] connection #996 failed to logs.eu-west-2.amazonaws.com:443
[2022/01/24 15:34:22] [error] [aws_client] connection initialization error
[2022/01/24 15:34:22] [error] [output:cloudwatch_logs:cloudwatch_logs.0] Failed to create log stream

fluent-bit.conf

[SERVICE]
Flush 5
Log_Level trace
Daemon off
Parsers_File parsers.conf
HTTP_Server On
HTTP_Listen 0.0.0.0
HTTP_Port 2020
storage.path /var/fluent-bit/state/flb-storage/
storage.sync normal
storage.checksum off
storage.backlog.mem_limit 5M

@include application-log.conf

application-log.conf

[INPUT]
Name tail
Tag application.*
Path C:\var\log\containers*.log
multiline.parser docker

[OUTPUT]
Name cloudwatch_logs
Match application.*
region eu-west-2
log_group_name /application
log_stream_prefix testing-
auto_create_group false
extra_user_agent container-insights

IAM Role permissions:

{
"Version": "2012-10-17",
"Statement": [
{
"Sid": "",
"Effect": "Allow",
"Action": [
"logs:PutLogEvents",
"logs:DescribeLogStreams",
"logs:CreateLogStream",
"logs:CreateLogGroup"
],
"Resource": "arn:aws:logs:eu-west-2::*"
}
]
}

IAM Role logs:

[2022/01/25 07:45:28] [debug] [aws_credentials] Initialized Env Provider in standard chain
[2022/01/25 07:45:28] [ warn] [aws_credentials] Failed to initialize profile provider: HOME, AWS_CONFIG_FILE, and AWS_SHARED_CREDENTIALS_FILE not set.
[2022/01/25 07:45:28] [debug] [aws_credentials] Not initializing EKS provider because AWS_ROLE_ARN was not set
[2022/01/25 07:45:28] [debug] [aws_credentials] Not initializing ECS Provider because AWS_CONTAINER_CREDENTIALS_RELATIVE_URI is not set
[2022/01/25 07:45:28] [debug] [aws_credentials] Initialized EC2 Provider in standard chain
[2022/01/25 07:45:28] [debug] [aws_credentials] Sync called on the EC2 provider
[2022/01/25 07:45:28] [debug] [aws_credentials] Init called on the env provider
[2022/01/25 07:45:28] [debug] [aws_credentials] Init called on the EC2 IMDS provider
[2022/01/25 07:45:28] [debug] [aws_credentials] requesting credentials from EC2 IMDS
[2022/01/25 07:45:28] [debug] [http_client] not using http_proxy for header
[2022/01/25 07:45:28] [debug] [http_client] server 169.254.169.254:80 will close connection #736
[2022/01/25 07:45:28] [debug] [aws_client] (null): http_do=0, HTTP Status: 401
[2022/01/25 07:45:28] [debug] [http_client] not using http_proxy for header
[2022/01/25 07:45:28] [debug] [http_client] server 169.254.169.254:80 will close connection #736
[2022/01/25 07:45:28] [debug] [imds] using IMDSv2
[2022/01/25 07:45:28] [debug] [http_client] not using http_proxy for header
[2022/01/25 07:45:28] [debug] [http_client] server 169.254.169.254:80 will close connection #736
[2022/01/25 07:45:28] [debug] [aws_credentials] Requesting credentials for instance role eks-NodeInstanceRole-XXXXXXXXX
[2022/01/25 07:45:28] [debug] [imds] using IMDSv2
[2022/01/25 07:45:28] [debug] [http_client] not using http_proxy for header
[2022/01/25 07:45:28] [debug] [http_client] server 169.254.169.254:80 will close connection #736
[2022/01/25 07:45:28] [debug] [aws_credentials] upstream_set called on the EC2 provider
[2022/01/25 07:45:28] [debug] [router] match rule tail.0:cloudwatch_logs.0
[2022/01/25 07:45:28] [ info] [http_server] listen iface=0.0.0.0 tcp_port=2020
[2022/01/25 07:45:28] [ info] [sp] stream processor started

  • Connectivity to logs.eu-west-2.amazonaws.com:443 is successful. Checked with WSL (netcat) and Powershell.

To Reproduce

  • Use above configurations for fluent-bit and run the following command ".\bin\fluent-bit.exe -c .\conf\fluent-bit.conf"
  • AWS EC2 machine with instance profile with above permissions

Expected behavior

  • Fluent-bit to be able to start, create CloudWatch log groups and streams and send logs.

Actual behavior

  • Fluent-bit starts but is unable to create log streams and therefore keeps retrying.
@tmetodie
Copy link
Author

tmetodie commented Feb 1, 2022

Just as an update when using the WSL on the Windows server to push a single log, it works fine. However as the logs on the server are not having the same directory structure from WSL point of view and the Linux version of fluent-bit does not provide a /some/directory/**/*.log I cannot use WSL as a workaround. The test shows that the machine has access to CW logs from networking and IAM perspective.

@awsitcloudpro
Copy link

awsitcloudpro commented Apr 22, 2022

A little bit more info that might help in troubleshooting this issue. As per the logs I posted in #4727 the error is coming from this function called here. As -1 is used as the return code for each error, it's not clear exactly which line of code causes the issue. Perhaps adding more debug statements or using different error codes will help?
Please note that this error is happening only in the fluent-bit pod's C code. I installed AWS CLI v2 on the pod after it started running, and executed a command aws logs create-log-group. That worked without any error, so there is no issue in the pod's network connectivity to AWS STS / CW.

@leonardo-albertovich
Copy link
Collaborator

I noticed that you are running in TRACE level but I don't see much there, would you be able to share the complete log file? Also, maybe someone else remembers this better, but, since there is a socket number in the log line and no indications of a timeout, I'm wondering if this could be related to SSL. I remember there was a PR that improved certificate loading in windows but I can't remember in which version it was included, would you be able to test the latest version of the 1.8 branch (master would be great too).

@bryangardner
Copy link

bryangardner commented May 3, 2022

I am seeing this issue as well running FluentBit v1.9.1 in a Windows container based on mcr.microsoft.com/windows/servercore:ltsc2019 on Kubernetes 1.21.
Underlying Windows node is running Windows Server 2019.

My logs include additional message [tls] error: unexpected EOF output by this line (introduced in v1.9.0):

flb_error("[tls] error: unexpected EOF");

This issue appears to be similar to #5381.

@github-actions
Copy link
Contributor

github-actions bot commented Aug 2, 2022

This issue is stale because it has been open 90 days with no activity. Remove stale label or comment or this will be closed in 5 days. Maintainers can add the exempt-stale label.

@github-actions github-actions bot added the Stale label Aug 2, 2022
@bryangardner
Copy link

Please remove stale

@github-actions github-actions bot removed the Stale label Aug 5, 2022
@dpryden
Copy link

dpryden commented Sep 16, 2022

I am seeing what appears to be the same issue (I came here from #4727 which has a clearer description of how to repro the problem).

The problem looks similar to #4735 which appears to have been fixed in 1.9.0. I see a big change has modified src/tls/openssl.c since that time. Is it possible that something regressed this functionality?

@github-actions
Copy link
Contributor

This issue is stale because it has been open 90 days with no activity. Remove stale label or comment or this will be closed in 5 days. Maintainers can add the exempt-stale label.

@github-actions github-actions bot added the Stale label Dec 16, 2022
@github-actions
Copy link
Contributor

This issue was closed because it has been stalled for 5 days with no activity.

@github-actions github-actions bot closed this as not planned Won't fix, can't repro, duplicate, stale Dec 21, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Stale status: waiting-for-triage Windows Bugs and requests about Windows platforms
Projects
None yet
Development

No branches or pull requests

6 participants