Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

out_http: Add AWS SigV4 Authentication Option to out_http #5165

Merged
merged 2 commits into from
Apr 5, 2022

Conversation

matthewfala
Copy link
Contributor

@matthewfala matthewfala commented Mar 23, 2022

Documentation

Documentation PR in review.
See: fluent/fluent-bit-docs#760

SigV4 Results with Valgrind

Config File

[SERVICE]
     Grace 30
     Log_Level info

[INPUT]
     Name        tcp
     Listen      127.0.0.1
     Port        4560
     Chunk_Size  80
     Buffer_Size 100
     Tag         me-tcp

[OUTPUT]
     Name http
     Match *
     Host 127.0.0.1
     Port 6435
     URI  /
     Format json
     aws_auth true
     aws_region us-west-2
     aws_service myservice

Valgrind output

[[...] bin]$ valgrind --leak-check=full ./fluent-bit -c "[...]/fluent-bit/.vscode/fluent-bit-config/fluent-bit.conf"
==6323== Memcheck, a memory error detector
==6323== Copyright (C) 2002-2017, and GNU GPL'd, by Julian Seward et al.
==6323== Using Valgrind-3.13.0 and LibVEX; rerun with -h for copyright info
==6323== Command: ./fluent-bit -c /home/ec2-user/fala-forks/fluent-bit/.vscode/fluent-bit-config/fluent-bit.conf
==6323== 
Fluent Bit v1.9.1
* Git commit: 47671b467b02ac0e3d22835f9edeb3dfd2ecb44c
* Copyright (C) 2015-2021 The Fluent Bit Authors
* Fluent Bit is a CNCF sub-project under the umbrella of Fluentd
* https://fluentbit.io

[2022/03/23 22:43:19] [ info] [engine] started (pid=6323)
[2022/03/23 22:43:19] [ info] [storage] version=1.1.6, initializing...
[2022/03/23 22:43:19] [ info] [storage] in-memory
[2022/03/23 22:43:19] [ info] [storage] normal synchronization mode, checksum disabled, max_chunks_up=128
[2022/03/23 22:43:19] [ info] [cmetrics] version=0.3.0
[2022/03/23 22:43:19] [ info] [input:tcp:tcp.0] listening on 127.0.0.1:4560
[2022/03/23 22:43:20] [ info] [sp] stream processor started
[2022/03/23 22:43:20] [ info] [output:http:http.0] worker #0 started
[2022/03/23 22:43:20] [ info] [output:http:http.0] worker #1 started
==6323== Warning: client switching stacks?  SP change: 0xa0578c8 --> 0x933cd90
==6323==          to suppress, use: --max-stackframe=13740856 or greater
==6323== Warning: client switching stacks?  SP change: 0x933cd08 --> 0xa0578c8
==6323==          to suppress, use: --max-stackframe=13740992 or greater
==6323== Warning: client switching stacks?  SP change: 0xa0578c8 --> 0x933cd08
==6323==          to suppress, use: --max-stackframe=13740992 or greater
==6323==          further instances of this message will not be shown.
[2022/03/23 22:43:39] [ info] [output:http:http.0] 127.0.0.1:6435, HTTP status=200
OK
[2022/03/23 22:43:40] [ info] [output:http:http.0] 127.0.0.1:6435, HTTP status=200
OK
[2022/03/23 22:43:41] [ info] [output:http:http.0] 127.0.0.1:6435, HTTP status=200
OK
[2022/03/23 22:43:42] [ info] [output:http:http.0] 127.0.0.1:6435, HTTP status=200
OK
[2022/03/23 22:43:43] [ info] [output:http:http.0] 127.0.0.1:6435, HTTP status=200
OK
[2022/03/23 22:43:44] [ info] [output:http:http.0] 127.0.0.1:6435, HTTP status=200
OK
[2022/03/23 22:43:45] [ info] [output:http:http.0] 127.0.0.1:6435, HTTP status=200
OK
[2022/03/23 22:43:46] [ info] [output:http:http.0] 127.0.0.1:6435, HTTP status=200
OK
[2022/03/23 22:43:47] [ info] [output:http:http.0] 127.0.0.1:6435, HTTP status=200
OK
[2022/03/23 22:43:48] [ info] [output:http:http.0] 127.0.0.1:6435, HTTP status=200
OK
^C[2022/03/23 22:44:14] [engine] caught signal (SIGINT)
[2022/03/23 22:44:14] [ warn] [engine] service will shutdown in max 30 seconds
[2022/03/23 22:44:15] [ info] [engine] service has stopped (0 pending tasks)
[2022/03/23 22:44:15] [ info] [output:http:http.0] thread worker #0 stopping...
[2022/03/23 22:44:15] [ info] [output:http:http.0] thread worker #0 stopped
[2022/03/23 22:44:15] [ info] [output:http:http.0] thread worker #1 stopping...
[2022/03/23 22:44:15] [ info] [output:http:http.0] thread worker #1 stopped
==6323== 
==6323== HEAP SUMMARY:
==6323==     in use at exit: 101,728 bytes in 3,410 blocks
==6323==   total heap usage: 31,771 allocs, 28,361 frees, 8,704,474 bytes allocated
==6323== 
==6323== LEAK SUMMARY:
==6323==    definitely lost: 0 bytes in 0 blocks
==6323==    indirectly lost: 0 bytes in 0 blocks
==6323==      possibly lost: 0 bytes in 0 blocks
==6323==    still reachable: 101,728 bytes in 3,410 blocks
==6323==         suppressed: 0 bytes in 0 blocks
==6323== Reachable blocks (those to which a pointer was found) are not shown.
==6323== To see them, rerun with: --leak-check=full --show-leak-kinds=all

Result Headers (from http server)

[
  "Host",
  "127.0.0.1:6435",
  "Content-Length",
  "491",
  "Content-Type",
  "application/json",
  "User-Agent",
  "Fluent-Bit",
  "x-amz-date",
  "20220323T224920Z",
  "Authorization",
  "AWS4-HMAC-SHA256 Credential=XXXXXXXX/20220323/us-west-2/myservice/aws4_request, SignedHeaders=content-length;content-type;host;user-agent;x-amz-date, Signature=8de8f8ec640f571d6313b67578d64505f9d464652b8f3f71c5479eb7a854fe62",
]

Valgrind SigV4 Error Test

Config File

[SERVICE]
     Grace 30
     Log_Level info

[INPUT]
     Name        tcp
     Listen      127.0.0.1
     Port        4560
     Chunk_Size  80
     Buffer_Size 100
     Tag         me-tcp

[OUTPUT]
     Name http
     Match *
     Host 127.0.0.1
     Port 6435
     URI  /
     Format json
     aws_auth true
     aws_region us-west-2
     aws_service myservice
     aws_role_arn bob
     aws_external_id ext_bob
     aws_sts_endpoint sts.fpaoeijfapeowijfpoaewijfapeowijfaepwoijfa

Valgrind Output

[... bin]$ ./fluent-bit -c "[...]/fluent-bit/.vscode/fluent-bit-config/fluent-bit.conf"
Fluent Bit v1.9.1
* Git commit: 47671b467b02ac0e3d22835f9edeb3dfd2ecb44c
* Copyright (C) 2015-2021 The Fluent Bit Authors
* Fluent Bit is a CNCF sub-project under the umbrella of Fluentd
* https://fluentbit.io

[2022/03/23 22:15:48] [ info] [engine] started (pid=17119)
[2022/03/23 22:15:48] [ info] [storage] version=1.1.6, initializing...
[2022/03/23 22:15:48] [ info] [storage] in-memory
[2022/03/23 22:15:48] [ info] [storage] normal synchronization mode, checksum disabled, max_chunks_up=128
[2022/03/23 22:15:48] [ info] [cmetrics] version=0.3.0
[2022/03/23 22:15:48] [ info] [input:tcp:tcp.0] listening on 127.0.0.1:4560
[2022/03/23 22:15:48] [ warn] [net] getaddrinfo(host='sts.fpaoeijfapeowijfpoaewijfapeowijfaepwoijfa', err=-2): Name or service not known
[2022/03/23 22:15:48] [ info] [sp] stream processor started
[2022/03/23 22:15:48] [ info] [output:http:http.0] worker #0 started
[2022/03/23 22:15:48] [ info] [output:http:http.0] worker #1 started
[2022/03/23 22:19:11] [ warn] [net] getaddrinfo(host='sts.fpaoeijfapeowijfpoaewijfapeowijfaepwoijfa', err=4): Domain name not found
[2022/03/23 22:19:11] [error] [aws_client] connection initialization error
[2022/03/23 22:19:11] [error] [aws_credentials] STS assume role request failed
[2022/03/23 22:19:11] [ warn] [aws_credentials] No cached credentials are available and a credential refresh is already in progress. The current co-routine will retry.

[...]


==14574== 
==14574== 
==14574== HEAP SUMMARY:
==14574==     in use at exit: 101,728 bytes in 3,410 blocks
==14574==   total heap usage: 58,098 allocs, 54,688 frees, 13,083,154 bytes allocated
==14574== 
==14574== LEAK SUMMARY:
==14574==    definitely lost: 0 bytes in 0 blocks
==14574==    indirectly lost: 0 bytes in 0 blocks
==14574==      possibly lost: 0 bytes in 0 blocks
==14574==    still reachable: 101,728 bytes in 3,410 blocks
==14574==         suppressed: 0 bytes in 0 blocks
==14574== Rerun with --leak-check=full to see details of leaked memory
==14574== 
==14574== For counts of detected and suppressed errors, rerun with: -v
==14574== Use --track-origins=yes to see where uninitialised values come from
==14574== ERROR SUMMARY: 1509411 errors from 997 contexts (suppressed: 169 from 56)

One Click Replication Configuration

via Firelens Datajet

  • Sets up a server
  • Connects to fluent bit via tcp
  • Sends 1 70,000 byte log per second 10 times
    HTTP loopback validator featured only on the http-validator branch

Here's a link to the firelens-datajet project which aims to make replicating fluent bit end to end testing portable.
https://github.com/aws/firelens-datajet/tree/http-validator

{
	"component": "wrap",
	"comment": "validator currently under development. this test will yield incorrect results",
	"comment2": "validator currently used as HTTP loopback address for fluent bit logs",
	"library": [
		{
			"referenceId": "stageGenerator",
			"component": "generator",
			"generator": {
				"name": "increment",
				"config": {
					"payloadSize": 70000,
					"disableSignal": true
				}
			}
		},
		{
			"referenceId": "validationGenerator",
			"component": "generator",
			"generator": {
				"name": "http",
				"config": {
                    "port": 6435
				}
			}
		}
	],
	"config": {
		"wrapper": {
			"name": "comparator-validator",
			"config": {
				"validationGraceTimeout": 3,
				"validationIdleTimeout": 0.2,
				"stageGeneratorRef": "stageGenerator",
				"validationGeneratorRef": "validationGenerator"
			}
		}
	},
	"child": {
		"component": "synchronizer",
		"config": {
			"repeat": 1,
			"waitBefore": 0.5,
			"waitAfter": 0.5,
			"waitBetween": 0.01,
			"isAsync": false
		},
		"children": [
			{
				"generator": {
					"name": "reference",
					"config": {
						"ref": "stageGenerator"
					}
				},
				"datajet": {
					"name": "tcp",
					"config": {
						"port": 4560
					}
				},
				"stage": {
					"batchRate": 1,
					"maxBatches": 10
				}
			}
		]
	}
}

Enter [N/A] in the box, if an item is not applicable to your change.

Testing
Before we can approve your change; please submit the following in a comment:

  • Example configuration file for the change
  • Debug log output from testing the change
  • Attached Valgrind output that shows no leaks or memory corruption was found

If this is a change to packaging of containers or native binaries then please confirm it works for all targets.

Documentation

  • Documentation required for this feature

Backporting

  • Backport to latest stable release.

Fluent Bit is licensed under Apache 2.0, by submitting this pull request I understand that this code will be released under the terms of that license.

@matthewfala matthewfala changed the title Add AWS Sigv4 to out_http WIP: Add AWS Sigv4 to out_http Mar 23, 2022
@matthewfala matthewfala changed the title WIP: Add AWS Sigv4 to out_http Add AWS Sigv4 to out_http Mar 23, 2022
@matthewfala matthewfala changed the title Add AWS Sigv4 to out_http Add AWS SigV4 Authentication Option to out_http Mar 23, 2022
@matthewfala matthewfala changed the title Add AWS SigV4 Authentication Option to out_http out_http: Add AWS SigV4 Authentication Option to out_http Mar 23, 2022
#ifdef FLB_HAVE_SIGNV4
#ifdef FLB_HAVE_AWS
if (ctx->has_aws_auth) {
ctx->aws_service = flb_output_get_property("aws_service", ctx->ins);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why we cannot set aws_service in FLB_AWS_CREDENTIAL_BASE_CONFIG_MAP as well and put it in lb_managed_chain_provider_create

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the managed_chain_provider_create is meant to be a general credential provider base init for all plugins using credential services. We can easily migrate cloudwatch, firehose, kinesis, s3, and es, to use this base as well since they all share the same 4 output properties.

aws_service is only used for sigv4, and doesn't generalize well to be part of the credential provider.

}

/* Use null sts_endpoint if none provided */
sts_endpoint = flb_output_get_property(config_key_sts_endpoint, ins);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we could add a debug log here say that we use null sts_endpoint as none provided

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Null sts endpoint means that we are not using sts. I don't think a debug statement is needed.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think this needs a debug statement either. Also, if the custom sts endpoint is NULL we will later construct the regional STS endpoint in the provider itself.

strcpy(config_key_external_id + key_prefix_len, "external_id");

/* AWS provider needs a separate TLS instance */
cred_tls = flb_tls_create(FLB_TRUE,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: do we want to check the format a little bit? more spaces in rest lines

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yep, sounds good.


/* STS provider needs yet another separate TLS instance */
sts_tls = flb_tls_create(FLB_TRUE,
ins->tls_debug,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

also format

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1

Copy link
Contributor

@DrewZhang13 DrewZhang13 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

overall LGTM, Code is pretty clean and well structured. I like the idea to make the credential provider generic so would be easy to reuse by other plugin in the future.

{ \
FLB_CONFIG_MAP_STR, prefix "role_arn", NULL, \
0, FLB_FALSE, 0, \
"ARN of an IAM role to assume (ex. for cross account access)" \
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Change to AWS IAM Role so align with line 186 to reduce confusion.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Aligning with value in Cloudwatch, Firehose, Kinesis, and S3. Future plans to potentially unify the code for credential provider setup, need to keep this

ARN of an IAM role to assume 

if (ctx->has_aws_auth == FLB_TRUE) {
flb_plg_debug(ctx->ins, "signing request with AWS Sigv4");
signature = flb_signv4_do(c,
FLB_TRUE, /* normalize URI ? */
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what are the question mark for?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's a boolean

flb_plg_debug(ctx->ins, "signing request with AWS Sigv4");
signature = flb_signv4_do(c,
FLB_TRUE, /* normalize URI ? */
FLB_TRUE, /* add x-amz-date header ? */
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

same comment as above

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Booleans. Actually copied this comment from other plugins

}

/* initialize credentials in sync mode */
aws_provider->provider_vtable->sync(aws_provider);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

could you explain why it need to be set as sync at first and set back to async quickly?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The way i understand it is that during plugin init, there is no such thing as async networking (this might have changed recently). That is because setup occurs before the coroutines are allowed to swap out. Really the reason why I did this was because everywhere else we setup the cred provider, we do this.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yea I added this a while ago to fix a bug... AFAIK the init callbacks still must run in sync mode.

ins->tls_key_file,
ins->tls_key_passwd);
if (!sts_tls) {
flb_errno();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could we add an error log for this error too?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yep

ins->tls_key_file,
ins->tls_key_passwd);
if (!cred_tls) {
flb_errno();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could we add an error log for this error too?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sure

0, FLB_TRUE, offsetof(struct flb_out_http, aws_service),
"AWS destination service code, used by SigV4 authentication"
},
FLB_AWS_CREDENTIAL_BASE_CONFIG_MAP("aws_"),
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I like the idea to make the config general to help on reusing it.
The aws_ value looks to me like a static value and has been used several different place below.
Could we create a static field somewhere and refer to the same place so in case in the future this name is not suitable we don't need to search everywhere to change it.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Love it.

Copy link
Contributor Author

@matthewfala matthewfala Mar 25, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually this is the only place it is used. Regardless, I'll make it a #define

It's used multiple times actually.

ctx->aws_provider = flb_managed_chain_provider_create(
ins,
config,
"aws_",
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Make the aws_ refer to a static field

@matthewfala
Copy link
Contributor Author

Resolved all comments. Would you take another look @DrewZhang13 and @zhonghui12?

DrewZhang13
DrewZhang13 previously approved these changes Mar 25, 2022
Copy link
Contributor

@DrewZhang13 DrewZhang13 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, approved


session_name = flb_sts_session_name();
if (!session_name) {
flb_plg_error(ins, "Failed to create aws iam role "
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: use the word generate not create to help make it clear this was our code failing vs some create call to STS

Comment on lines +135 to +140
/* Provider managed dependencies; to delete on destroy */
struct flb_aws_provider *base_aws_provider;
struct flb_tls *cred_tls; /* tls instances can't be re-used; aws provider requires
a separate one */
struct flb_tls *sts_tls; /* one for the standard chain provider, one for sts
assume role */
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmmm this is only used by the new flb_managed_chain_provider_create function to use? the other providers don't use these new fields I guess?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's right. It's just if we want aws_provider_destroy to handle clean up operations. If null (due to calloc) clean up will not occur

PettitWesley
PettitWesley previously approved these changes Mar 28, 2022
@edsiper edsiper merged commit bc905f6 into fluent:master Apr 5, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants