-
Notifications
You must be signed in to change notification settings - Fork 3.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
GH-35903: [C++] Skeleton for Azure Blob Storage filesystem implementation #35701
Conversation
Thanks for opening a pull request! If this is not a minor PR. Could you open an issue for this pull request on GitHub? https://github.com/apache/arrow/issues/new/choose Opening GitHub issues ahead of time contributes to the Openness of the Apache Arrow project. Then could you also rename the pull request title in the following format?
or
In the case of PARQUET issues on JIRA the title also supports:
See also: |
|
|
|
|
|
@kou Thank you for the suggestions! This is my first time opening a PR for Arrow, so I'm a bit unfamiliar with the codebase infrastructure.
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could you enable GitHub Actions in your fork?
.github/workflows/cpp.yml
Outdated
@@ -226,6 +227,7 @@ jobs: | |||
- os: windows-2019 | |||
name: Windows 2019 | |||
env: | |||
ARROW_AZURE: OFF |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could you revert this because ARROW_AZURE
is OFF
by default.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, I reverted it.
.github/workflows/cpp.yml
Outdated
@@ -329,6 +331,7 @@ jobs: | |||
- msystem_lower: clang64 | |||
msystem_upper: CLANG64 | |||
env: | |||
ARROW_AZURE: OFF |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ditto.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also reverted this.
cpp/src/arrow/filesystem/azurefs.cc
Outdated
// ----------------------------------------------------------------------- | ||
// AzureFilesystem Implementation | ||
|
||
class AzureFileSystem::Impl : public std::enable_shared_from_this<AzureFileSystem::Impl> { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we remove std::enable_shared_from_this
?
class AzureFileSystem::Impl : public std::enable_shared_from_this<AzureFileSystem::Impl> { | |
class AzureFileSystem::Impl { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Indeed std::enable_shared_from_this
is probably not useful if the Impl is stored by its parent using a std::unique_ptr
.
That said, once we start implementing async APIs, a std::shared_ptr
will probably be required.
cpp/src/arrow/filesystem/azurefs.h
Outdated
struct ARROW_EXPORT AzureOptions { | ||
std::string scheme; | ||
std::string account_dfs_url; | ||
std::string account_blob_url; | ||
bool is_azurite = false; | ||
AzureCredentialsKind credentials_kind = AzureCredentialsKind::Anonymous; | ||
|
||
std::string sas_token; | ||
std::string connection_string; | ||
std::shared_ptr<Azure::Storage::StorageSharedKeyCredential> | ||
storage_credentials_provider; | ||
std::shared_ptr<Azure::Core::Credentials::TokenCredential> | ||
service_principle_credentials_provider; | ||
|
||
AzureOptions(); | ||
|
||
bool Equals(const AzureOptions& other) const; | ||
}; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How about removing AzureOptions
and related codes for now and adding it as the next step?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Since they're already there and not terribly long, I think we can keep them.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If we want to keep AzureOptions
in this PR, we should pick up tests for AzureOptions
from the original PR too.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good point, will add some of the basic tests
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok I looked at the tests for AzureOptions and all of them require a lot of other functions to be implemented first. I think adding those functions to this PR would make it too complicated. Personally, I don't feel like it is horrible to keep AzureOptions in this PR without adding tests, but it is up to you. Alternatively, I can remove some of the additional fields that we don't immediately need, like sas_token
, connection_string
, storage_credentials_provider
, and service_principle_credentials_provider
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How about removing AzureOptions
related codes from this PR to merge this PR quickly?
diff --git a/cpp/src/arrow/filesystem/azurefs.cc b/cpp/src/arrow/filesystem/azurefs.cc
index 0158c0cec..9fdf903b0 100644
--- a/cpp/src/arrow/filesystem/azurefs.cc
+++ b/cpp/src/arrow/filesystem/azurefs.cc
@@ -23,43 +23,21 @@
namespace arrow {
namespace fs {
-// -----------------------------------------------------------------------
-// AzureOptions Implementation
-
-AzureOptions::AzureOptions() {}
-
-bool AzureOptions::Equals(const AzureOptions& other) const {
- return (account_dfs_url == other.account_dfs_url &&
- account_blob_url == other.account_blob_url &&
- credentials_kind == other.credentials_kind);
-}
-
// -----------------------------------------------------------------------
// AzureFilesystem Implementation
class AzureFileSystem::Impl {
public:
io::IOContext io_context_;
- bool is_hierarchical_namespace_enabled_;
- AzureOptions options_;
- explicit Impl(AzureOptions options, io::IOContext io_context)
- : io_context_(io_context), options_(std::move(options)) {}
+ explicit Impl(io::IOContext io_context)
+ : io_context_(io_context) {}
Status Init() {
- if (options_.backend == AzureBackend::Azurite) {
- // gen1Client_->GetAccountInfo().Value.IsHierarchicalNamespaceEnabled
- // throws error in azurite
- is_hierarchical_namespace_enabled_ = false;
- }
return Status::OK();
}
-
- const AzureOptions& options() const { return options_; }
};
-const AzureOptions& AzureFileSystem::options() const { return impl_->options(); }
-
bool AzureFileSystem::Equals(const FileSystem& other) const {
if (this == &other) {
return true;
@@ -67,8 +45,7 @@ bool AzureFileSystem::Equals(const FileSystem& other) const {
if (other.type_name() != type_name()) {
return false;
}
- const auto& azure_fs = ::arrow::internal::checked_cast<const AzureFileSystem&>(other);
- return options().Equals(azure_fs.options());
+ return true;
}
Result<FileInfo> AzureFileSystem::GetFileInfo(const std::string& path) {
@@ -138,15 +115,14 @@ Result<std::shared_ptr<io::OutputStream>> AzureFileSystem::OpenAppendStream(
}
Result<std::shared_ptr<AzureFileSystem>> AzureFileSystem::Make(
- const AzureOptions& options, const io::IOContext& io_context) {
- std::shared_ptr<AzureFileSystem> ptr(new AzureFileSystem(options, io_context));
+ const io::IOContext& io_context) {
+ std::shared_ptr<AzureFileSystem> ptr(new AzureFileSystem(io_context));
RETURN_NOT_OK(ptr->impl_->Init());
return ptr;
}
-AzureFileSystem::AzureFileSystem(const AzureOptions& options,
- const io::IOContext& io_context)
- : FileSystem(io_context), impl_(std::make_unique<Impl>(options, io_context)) {
+AzureFileSystem::AzureFileSystem(const io::IOContext& io_context)
+ : FileSystem(io_context), impl_(std::make_unique<Impl>(io_context)) {
default_async_is_sync_ = false;
}
diff --git a/cpp/src/arrow/filesystem/azurefs.h b/cpp/src/arrow/filesystem/azurefs.h
index 45087668d..31c014dcb 100644
--- a/cpp/src/arrow/filesystem/azurefs.h
+++ b/cpp/src/arrow/filesystem/azurefs.h
@@ -25,63 +25,9 @@
#include "arrow/util/macros.h"
#include "arrow/util/uri.h"
-namespace Azure {
-namespace Core {
-namespace Credentials {
-
-class TokenCredential;
-
-} // namespace Credentials
-} // namespace Core
-namespace Storage {
-
-class StorageSharedKeyCredential;
-
-} // namespace Storage
-} // namespace Azure
-
namespace arrow {
namespace fs {
-enum class AzureCredentialsKind : int8_t {
- /// Anonymous access (no credentials used), public
- Anonymous,
- /// Use explicitly-provided access key pair
- StorageCredentials,
- /// Use ServicePrincipleCredentials
- ServicePrincipleCredentials,
- /// Use Sas Token to authenticate
- Sas,
- /// Use Connection String
- ConnectionString
-};
-
-enum class AzureBackend : bool {
- /// Official Azure Remote Backend
- Azure,
- /// Local Simulated Storage
- Azurite
-};
-
-/// Options for the AzureFileSystem implementation.
-struct ARROW_EXPORT AzureOptions {
- std::string account_dfs_url;
- std::string account_blob_url;
- AzureBackend backend = AzureBackend::Azure;
- AzureCredentialsKind credentials_kind = AzureCredentialsKind::Anonymous;
-
- std::string sas_token;
- std::string connection_string;
- std::shared_ptr<Azure::Storage::StorageSharedKeyCredential>
- storage_credentials_provider;
- std::shared_ptr<Azure::Core::Credentials::TokenCredential>
- service_principle_credentials_provider;
-
- AzureOptions();
-
- bool Equals(const AzureOptions& other) const;
-};
-
/// \brief Azure-backed FileSystem implementation for ABFS and ADLS.
///
/// ABFS (Azure Blob Storage - https://azure.microsoft.com/en-us/products/storage/blobs/)
@@ -94,7 +40,7 @@ struct ARROW_EXPORT AzureOptions {
/// compatibility. Gen1 exists as a separate object that will retired
/// on Feb 29, 2024. New ADLS accounts will use Gen2 instead, which is
/// implemented on top of ABFS.
-///
+///
/// TODO: GH-18014 Complete the internal implementation
/// and review the documentation
class ARROW_EXPORT AzureFileSystem : public FileSystem {
@@ -103,9 +49,6 @@ class ARROW_EXPORT AzureFileSystem : public FileSystem {
std::string type_name() const override { return "abfs"; }
- /// Return the original Azure options when constructing the filesystem
- const AzureOptions& options() const;
-
bool Equals(const FileSystem& other) const override;
Result<FileInfo> GetFileInfo(const std::string& path) override;
@@ -146,10 +89,10 @@ class ARROW_EXPORT AzureFileSystem : public FileSystem {
const std::shared_ptr<const KeyValueMetadata>& metadata = {}) override;
static Result<std::shared_ptr<AzureFileSystem>> Make(
- const AzureOptions& options, const io::IOContext& = io::default_io_context());
+ const io::IOContext& = io::default_io_context());
private:
- explicit AzureFileSystem(const AzureOptions& options, const io::IOContext& io_context);
+ explicit AzureFileSystem(const io::IOContext& io_context);
class Impl;
std::unique_ptr<Impl> impl_;
diff --git a/cpp/src/arrow/filesystem/azurefs_test.cc b/cpp/src/arrow/filesystem/azurefs_test.cc
index 0f03e8839..d7f97d1e4 100644
--- a/cpp/src/arrow/filesystem/azurefs_test.cc
+++ b/cpp/src/arrow/filesystem/azurefs_test.cc
@@ -34,11 +34,9 @@ using ::testing::IsEmpty;
using ::testing::Not;
using ::testing::NotNull;
-// Placeholder test for file structure
-// TODO: GH-18014 Remove once a proper test is added
-TEST(AzureFileSystem, OptionsCompare) {
- AzureOptions options;
- EXPECT_TRUE(options.Equals(options));
+TEST(AzureFileSystem, type_name) {
+ ASSERT_OK_AND_ASSIGN(auto fs, AzureFileSystem::Make());
+ ASSERT_EQ(fs->type_name(), "abfs");
}
} // namespace
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Well, if this PR is ready, it probably doesn't make sense to remove AzureOptions
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@srilman It seems that you can choose whichever you want. If you remove AzureOptions
, I'll merge this PR, pitrou will merge this PR otherwise.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@srilman Are you still working on this? Would you be opposed to some assistance with getting this up and running in the coming weeks? |
cpp/src/arrow/filesystem/azurefs.cc
Outdated
|
||
const AzureOptions& options() const { return options_; } | ||
|
||
protected: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nit, but since it's a private implementation class you don't really need to differentiate between public and protected members.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good point, removed the protected section and moved AzureOptions options_
above
cpp/src/arrow/filesystem/azurefs.h
Outdated
std::string scheme; | ||
std::string account_dfs_url; | ||
std::string account_blob_url; | ||
bool is_azurite = false; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Several comments / questions:
- Will this command special behaviour? Are there other possible backends?
- Does this need to be parametered by the user or can it be auto-detected, e.g. from response headers?
- A boolean is not future-proof if we need to support more backends in the future. Our S3 internals use an enum:
enum class S3Backend { Amazon, Minio, Other };
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
- There are a couple of areas with some special behavior. In particular,
- Azurite does not support hierarchical namespaces, then
AzureFileSystem::Impl::is_hierarchical_namespace_enabled_
needs to always be set to false - In the original PR, when
is_azurite
is set, then it will configure the following with account key credentials:
account_blob_url = "http://127.0.0.1:10000/" + account_name + "/";
account_dfs_url = "http://127.0.0.1:10000/" + account_name + "/";
This is instead of the official URLs. I'm don't like the built-in localhost URLs, so I'll probably modify it a bit. I'm not aware of any other backends, unless ADLS or ADLS Gen1 happens to differ significantly from ABFS.
-
I'm not aware if it can be detected from response headers or not. With the hierarchical namespaces issue, it seems like this might be necessary even for setting up the connection.
-
Good point, will use a enum instead.
/// ADLS provides filesystem semantics, file-level security, and Hadoop | ||
/// compatibility. Gen1 exists as a separate object that will retired | ||
/// on Feb 29, 2024. New ADLS accounts will use Gen2 instead, which is | ||
/// implemented on top of ABFS. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So the filesystem implementation can provide access to both storage types? If so, then nice!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ideally, it should. I did an initial test a while ago on the original PR and was able to.
cpp/src/arrow/filesystem/azurefs.h
Outdated
std::string type_name() const override { return "abfs"; } | ||
|
||
/// Return the original Azure options when constructing the filesystem | ||
AzureOptions options() const; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nit: can probably return a const-ref:
AzureOptions options() const; | |
const AzureOptions& options() const; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good point. AzureFileSystem::Impl::options
already returns a const-ref anyways.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Changed
@srilman Sorry if this was discussed already before, but what will be the testing strategy? Will we run Azurite from the C++ unit tests (as we already do e.g. for S3 using Minio)? |
I think so. |
@pitrou It looks like CI is failing due to flaky tests. Is there anything I should do, or do you need to manually merge it? |
@srilman Some of the flakiness should be fixed by rebasing/merging latest git main. Also, we are nearing a release, so we are prioritizing fixing current CI issues. I'll take a look again at this PR a bit later. Thanks a lot for your patience! |
ee29a3b
to
45d9c8f
Compare
I rebased on latest git main. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok, I'll merge this in the interesting of moving things forward.
The next step should be to start adding some code that makes use of the Azure SDK, even something minimal such as actually instantiating the filesystem internals.
CI failures are unrelated. |
…ementation (apache#35701) ### What changes are included in this PR? This PR splits out the overall skeleton of apache#12914 in order to make merging of the overall Azure Filesystem easier to do. ### Are these changes tested? Yes. ### Are there any user-facing changes? Yes. * Closes: apache#35903 Authored-by: Srinivas Lade <[email protected]> Signed-off-by: Antoine Pitrou <[email protected]>
After merging your PR, Conbench analyzed the 6 benchmarking runs that have been run so far on merge-commit 085a0ba. There were no benchmark performance regressions. 🎉 The full Conbench report has more details. It also includes information about possible false positives for unstable benchmarks that are known to sometimes produce them. |
azurite
in preparation for testing Azure C++ filesystem
Tom-Newton/arrow#3
…C++ filesystem (#36988) ### Rationale for this change We need to write tests for #18014. azurite is like a fake Azure blob storage so it can be used to write integration tests ### What changes are included in this PR? Extract the `azurite` related changes from #12914 to create a smaller PR that's easier to review. I have made very minimal changes compared to that PR. Currently `azurite` is configured for all the environments where `ARROW_AZURE` was enabled by #35701. I assume its deliberate that its not enabled yet for windows, alpine, conda, debian or fedora builds. ### Are these changes tested? Its tested by there aren't really any good tests in this PR. I used this `azurite` config in #36835 to make an integration test that uses the Azure C++ SDK. On its own we can't really write tests for this `azurite` setup PR. ### Are there any user-facing changes? No * Closes: #36886 Lead-authored-by: Thomas Newton <[email protected]> Co-authored-by: Sutou Kouhei <[email protected]> Signed-off-by: Sutou Kouhei <[email protected]>
…ementation (apache#35701) ### What changes are included in this PR? This PR splits out the overall skeleton of apache#12914 in order to make merging of the overall Azure Filesystem easier to do. ### Are these changes tested? Yes. ### Are there any user-facing changes? Yes. * Closes: apache#35903 Authored-by: Srinivas Lade <[email protected]> Signed-off-by: Antoine Pitrou <[email protected]>
…Azure C++ filesystem (apache#36988) ### Rationale for this change We need to write tests for apache#18014. azurite is like a fake Azure blob storage so it can be used to write integration tests ### What changes are included in this PR? Extract the `azurite` related changes from apache#12914 to create a smaller PR that's easier to review. I have made very minimal changes compared to that PR. Currently `azurite` is configured for all the environments where `ARROW_AZURE` was enabled by apache#35701. I assume its deliberate that its not enabled yet for windows, alpine, conda, debian or fedora builds. ### Are these changes tested? Its tested by there aren't really any good tests in this PR. I used this `azurite` config in apache#36835 to make an integration test that uses the Azure C++ SDK. On its own we can't really write tests for this `azurite` setup PR. ### Are there any user-facing changes? No * Closes: apache#36886 Lead-authored-by: Thomas Newton <[email protected]> Co-authored-by: Sutou Kouhei <[email protected]> Signed-off-by: Sutou Kouhei <[email protected]>
What changes are included in this PR?
This PR splits out the overall skeleton of #12914 in order to make merging of the overall Azure Filesystem easier to do.
Are these changes tested?
Yes.
Are there any user-facing changes?
Yes.