Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

GH-43535: [C++] Support the AWS S3 SSE-C encryption #43601

Merged
merged 18 commits into from
Nov 3, 2024

Conversation

ripplehang
Copy link
Contributor

@ripplehang ripplehang commented Aug 7, 2024

Rationale for this change

server-side encryption with customer-provided keys is an important security feature for aws s3, it's useful when user want to manager the encryption key themselves, say, they don't want the data to be exposed to the aws system admin, and ensure the object is safe even the ACCESS_KEY and SECRET_KEY is somehow leaked.
Some comparison of S3 encryption options :
https://www.linkedin.com/pulse/delusion-s3-encryption-benefits-ravi-ivaturi/

What changes are included in this PR?

  1. Add the sse_customer_key member for S3Options to support server-side encryption with customer-provided keys (SSE-C keys).

    • The sse_customer_key was expected to be 256 bits (32 bytes) according to aws doc
    • The sse_customer_key was expected to be the raw key rather than base64 encoded value, arrow would calculate the base64 and MD5 on the fly.
    • By default the sse_customer_key is empty, and when the sse_customer_key is empty, there is no impact on the existing workflow. When the sse_customer_key is configured, it would require the aws sdk version to newer than 1.9.201.
  2. Add the tls_ca_file_path, tls_ca_dir_path and tls_verify_certificates members for S3Options.

    • the tls_ca_file_path, tls_ca_dir_path member for S3Options would override the value configured by arrow::fs::FileSystemGlobalOptions.
    • for s3, according to aws sdk doc, the tls_ca_file_path and tls_ca_dir_path only take effect in Linux, in order to support connect to the the storage server like minio with self-signed certificates on non-linux platform, we expose the tls_verify_certificates.
  3. Refine the unit test to start the minio server with self-signed certificate on linux platform, so the unit test could cover the https case on linux, and http case on non-linux platform.

Are these changes tested?

Yes

Are there any user-facing changes?

Only additional members to S3Options.

Copy link

github-actions bot commented Aug 7, 2024

⚠️ GitHub issue #43535 has been automatically assigned in GitHub to PR creator.

@ripplehang
Copy link
Contributor Author

@pitrou @mapleFU @felipecrv can you help to review the PR, Thanks

@pitrou
Copy link
Member

pitrou commented Aug 22, 2024

I took a quick look and here are some assorted thoughts:

  • there are no unit tests
  • only the encryption key needs to be on S3Options, not the MD5 and algorithm string (which can be computed on the fly)

@pitrou
Copy link
Member

pitrou commented Aug 22, 2024

Also, for the other readers wondering, I've found a useful (though possibly outdated?) comparison of S3 encryption options here:
https://www.linkedin.com/pulse/delusion-s3-encryption-benefits-ravi-ivaturi/

@ripplehang
Copy link
Contributor Author

I took a quick look and here are some assorted thoughts:

@pitrou Thanks for your review, for #1, i would try to add ut for the change, for #2, yes, actually ONLY the Key is needed, that's why i only expose one SetSSECKey function, and three Get function, and private the new added datamember.
do you suggest to only add one datamember for S3Options? but if there is only one memeber, we may needs to calculate everywhere?

@ripplehang
Copy link
Contributor Author

@pitrou I've submitted another commit to only expose the ssec-key in S3Options, and then calculate the MD5 every time on the fly, can you help to review whether there are other comments? I would try to add the ut if there is no interface change required, Thanks

@ripplehang ripplehang force-pushed the support_ssec_for_aws_s3 branch 3 times, most recently from bb2e024 to 539fa7c Compare August 27, 2024 06:00
@ripplehang
Copy link
Contributor Author

@pitrou Can you help to review the PR again?Thanks

@ripplehang
Copy link
Contributor Author

@kou @mapleFU can you help to review the change, Thanks

Copy link
Member

@mapleFU mapleFU left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'll take a careful around this weekend

/// \param expect_input_key_size, default 32
/// \return true if the decode and calculate MD5 success, otherwise return false
ARROW_EXPORT
bool CalculateSSECKeyMD5(const std::string& base64_encoded_key, std::string& md5_result,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would you mind change to Result<std::string>?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ok, I would try to refine the return value for the function.

Comment on lines 278 to 280
// Decode the Base64-encoded key to get the raw binary key
Aws::Utils::ByteBuffer rawKey =
Aws::Utils::HashingUtils::Base64Decode(base64_encoded_key);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

how do this handles error if base64_encoded_key is not valid?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

From https://github.com/aws/aws-sdk-cpp/blob/ac2da09e6930e3988d1289717e2df5d4b7408f17/src/aws-cpp-sdk-core/source/utils/base64/Base64.cpp#L91-L121, seems this aws util API didn't validate the input properly,but directly calculate the output size and allocate the buffer, so i add the check to ensure every character is the valid character here.
Meanwhile, I also add the related Unit test for the CalculateSSECKeyMD5 to test the different input value.


// the key needs to be // 256 bits(32 bytes)according to
// https://docs.aws.amazon.com/AmazonS3/latest/userguide/ServerSideEncryptionCustomerKeys.html#specifying-s3-c-encryption
if (rawKey.GetLength() != expect_input_key_size) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When would this not equal to 256?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the expect_input_key_size is exposed via S3Options, so it's possible for this value to be set as any length.

@github-actions github-actions bot added awaiting committer review Awaiting committer review awaiting review Awaiting review and removed awaiting review Awaiting review awaiting committer review Awaiting committer review labels Aug 30, 2024
@ripplehang ripplehang force-pushed the support_ssec_for_aws_s3 branch 3 times, most recently from de05328 to 8649388 Compare August 30, 2024 12:24
@kou kou changed the title GH-43535:[C++] support the AWS S3 SSE-C encryption GH-43535: [C++] support the AWS S3 SSE-C encryption Aug 31, 2024
Copy link
Member

@kou kou left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you fix style by pre-commit run --show-diff-on-failure --color=always --all?

Comment on lines 58 to 61
ASSERT_FALSE(CalculateSSECKeyMD5("").ok()); // invalid base64
ASSERT_FALSE(CalculateSSECKeyMD5("%^H").ok()); // invalid base64
ASSERT_FALSE(CalculateSSECKeyMD5("INVALID").ok()); // invalid base64
ASSERT_FALSE(CalculateSSECKeyMD5("MTIzNDU2Nzg5").ok()); // invalid, the input key size not match
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you use ASSERT_RAISES_WITH_MESSAGES() instead?

cpp/src/arrow/filesystem/filesystem_test.cc Outdated Show resolved Hide resolved
cpp/src/arrow/filesystem/filesystem_test.cc Outdated Show resolved Hide resolved
cpp/src/arrow/filesystem/s3fs.cc Outdated Show resolved Hide resolved
cpp/src/arrow/filesystem/s3fs.cc Outdated Show resolved Hide resolved
@@ -443,6 +443,7 @@ class TestS3FS : public S3TestMixin {
// Most tests will create buckets
options_.allow_bucket_creation = true;
options_.allow_bucket_deletion = true;
options_.sse_customer_key = "1WH9aTJ0+Tn0NLbTMHZn9aCW3Li3ViAdBsoIldPCREw=";
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why is this needed?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I tend to cover the workflow with sse_customer_key nonempty.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we always set sse_custom_key, we can't test no sse_custom_key case.
Can we add a test that uses sse_custom_key instead?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Indeed, and ideally the SSE-C encryption test would write a file with the encryption key, and check that trying to read it with another key (or with no key at all) fails.

cpp/src/arrow/filesystem/util_internal.cc Outdated Show resolved Hide resolved
cpp/src/arrow/filesystem/util_internal.cc Outdated Show resolved Hide resolved
cpp/src/arrow/filesystem/util_internal.cc Outdated Show resolved Hide resolved
cpp/src/arrow/filesystem/util_internal.cc Outdated Show resolved Hide resolved
@github-actions github-actions bot added awaiting changes Awaiting changes awaiting review Awaiting review and removed awaiting review Awaiting review labels Aug 31, 2024
@ripplehang
Copy link
Contributor Author

@kou Do you think we could merge the pr since there is no response from @pitrou for two weeks.

@kou
Copy link
Member

kou commented Nov 3, 2024

I'll merge this in the next week if nobody objects it.

Copy link
Member

@pitrou pitrou left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I will merge if CI is green, thanks for the updates @ripplehang !

@pitrou
Copy link
Member

pitrou commented Nov 3, 2024

@github-actions crossbow submit -g cpp

This comment was marked as outdated.

@pitrou
Copy link
Member

pitrou commented Nov 3, 2024

@github-actions crossbow submit -g cpp

Copy link

github-actions bot commented Nov 3, 2024

Revision: ce67a27

Submitted crossbow builds: ursacomputing/crossbow @ actions-7d5b001eaa

Task Status
example-cpp-minimal-build-static GitHub Actions
example-cpp-minimal-build-static-system-dependency GitHub Actions
example-cpp-tutorial GitHub Actions
test-alpine-linux-cpp GitHub Actions
test-build-cpp-fuzz GitHub Actions
test-conda-cpp GitHub Actions
test-conda-cpp-valgrind GitHub Actions
test-cuda-cpp-ubuntu-20.04-cuda-11.2.2 GitHub Actions
test-cuda-cpp-ubuntu-22.04-cuda-11.7.1 GitHub Actions
test-debian-12-cpp-amd64 GitHub Actions
test-debian-12-cpp-i386 GitHub Actions
test-fedora-39-cpp GitHub Actions
test-ubuntu-20.04-cpp GitHub Actions
test-ubuntu-20.04-cpp-bundled GitHub Actions
test-ubuntu-22.04-cpp GitHub Actions
test-ubuntu-22.04-cpp-20 GitHub Actions
test-ubuntu-22.04-cpp-emscripten GitHub Actions
test-ubuntu-22.04-cpp-no-threading GitHub Actions
test-ubuntu-24.04-cpp GitHub Actions
test-ubuntu-24.04-cpp-gcc-13-bundled GitHub Actions
test-ubuntu-24.04-cpp-gcc-14 GitHub Actions
test-ubuntu-24.04-cpp-minimal-with-formats GitHub Actions
test-ubuntu-24.04-cpp-thread-sanitizer GitHub Actions

@pitrou pitrou merged commit 00e7c65 into apache:main Nov 3, 2024
38 of 39 checks passed
@pitrou pitrou removed the awaiting merge Awaiting merge label Nov 3, 2024
Copy link

After merging your PR, Conbench analyzed the 3 benchmarking runs that have been run so far on merge-commit 00e7c65.

There were no benchmark performance regressions. 🎉

The full Conbench report has more details. It also includes information about 7 possible false positives for unstable benchmarks that are known to sometimes produce them.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants