-
Notifications
You must be signed in to change notification settings - Fork 3.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
ARROW-2034: [C++] Filesystem implementation for Azure Blob Storage #12914
Conversation
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could you add -DARROW_AZURE
to ci/scripts/cpp_build.sh
like ARROW_GCS
and ARROW_S3
?
cpp/CMakeLists.txt
Outdated
set(ARROW_STATIC_LINK_LIBS) | ||
set(ARROW_AZURE_STATIC_LINK_LIBS) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could you use ARROW_LINK_LIBS
and ARROW_STATIC_LINK_LIBS
instead of add new ARROW_AZURE_*
variables like S3 and GCS?
cpp/cmake_modules/BuildUtils.cmake
Outdated
@@ -46,6 +46,260 @@ if(WIN32 AND CMAKE_CXX_COMPILER_ID STREQUAL "GNU") | |||
list(APPEND ARROW_BOOST_PROCESS_COMPILE_DEFINITIONS "BOOST_USE_WINDOWS_H=1") | |||
endif() | |||
|
|||
function(ADD_ARROW_LIB_AZURE LIB_NAME) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we use add_arrow_lib
instead of defining this?
cpp/cmake_modules/BuildUtils.cmake
Outdated
endif() | ||
endfunction() | ||
|
||
function(ADD_TEST_CASE_AZURE REL_TEST_NAME) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we use add_test_case
instead of defining this?
@@ -4513,6 +4513,53 @@ if(ARROW_S3) | |||
endif() | |||
endif() | |||
|
|||
macro(build_azuresdk) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could you build Azure C++ SDK by externalproject_add
in this?
message(STATUS "Found AZURE SDK headers: ${AZURESDK_INCLUDE_DIR}") | ||
message(STATUS "Found AZURE SDK libraries: ${AZURESDK_LINK_LIBRARIES}") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
message(STATUS "Found AZURE SDK headers: ${AZURESDK_INCLUDE_DIR}") | |
message(STATUS "Found AZURE SDK libraries: ${AZURESDK_LINK_LIBRARIES}") | |
message(STATUS "Found Azure SDK headers: ${AZURESDK_INCLUDE_DIR}") | |
message(STATUS "Found Azure SDK libraries: ${AZURESDK_LINK_LIBRARIES}") |
set(CMAKE_CXX_STANDARD 14) | ||
set(CMAKE_CXX_STANDARD_REQUIRED ON) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could you use CXX_STANDARD
and CXX_STANDARD_REQUIRED
target properties instead?
#ifdef _WIN32 | ||
// Undefine preprocessor macros that interfere with AWS function / method names | ||
#ifdef GetMessage | ||
#undef GetMessage | ||
#endif | ||
#ifdef GetObject | ||
#undef GetObject | ||
#endif | ||
#endif |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we need this for Azure SDK for C++?
void AzureOptions::ConfigureConnectionStringCredentials( | ||
const std::string& connection_string_uri) { | ||
auto account_name = | ||
Azure::Storage::_internal::ParseConnectionString(connection_string_uri).AccountName; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we use public API instead of internal API here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Will check for a public API and update
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No public API is available for this, Added implementation of ParseConnectionString instead.
|
||
Result<AzureOptions> AzureOptions::FromUri(const Uri& uri, std::string* out_path) { | ||
AzureOptions options; | ||
AZURE_ASSERT(uri.has_host()); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It seems that this may call std::abort()
.
Generally, we should not use std::abort()
in a library.
If we need to use std::abort()
, please use ARROW_DCHECK
family.
|
||
// private: | ||
// const std::string& AdlsGen2AccountName = std::getenv("ADLS_GEN2_ACCOUNT_NAME"); | ||
// const std::string& AdlsGen2AccountKey = std::getenv("ADLS_GEN2_ACCOUNT_KEY"); | ||
// const std::string& AdlsGen2ConnectionStringValue = std::getenv( | ||
// "ADLS_GEN2_CONNECTION_STRING"); | ||
// const std::string& AdlsGen2SasUrl = std::getenv("ADLS_GEN2_SASURL"); | ||
// const std::string& AadTenantIdValue = std::getenv("AAD_TENANT_ID"); | ||
// const std::string& AadClientIdValue = std::getenv("AAD_CLIENT_ID"); | ||
// const std::string& AadClientSecretValue = std::getenv("AAD_CLIENT_SECRET"); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should we remove them?
ci/appveyor-cpp-build.bat
Outdated
@@ -108,6 +108,7 @@ cmake -G "%GENERATOR%" %CMAKE_ARGS% ^ | |||
-DARROW_PARQUET=ON ^ | |||
-DARROW_PYTHON=ON ^ | |||
-DARROW_S3=%ARROW_S3% ^ | |||
-DARROW_AZURE=OFF ^ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could you keep this list in alphabetical order?
foreach(AZURESDK_LIBRARY_CPP ${AZURESDK_LIBRARIES_CPP}) | ||
find_package(${AZURESDK_LIBRARY_CPP} CONFIG REQUIRED) | ||
endforeach() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is needless.
foreach(AZURESDK_LIBRARY_CPP ${AZURESDK_LIBRARIES_CPP}) | ||
find_package(${AZURESDK_LIBRARY_CPP} CONFIG REQUIRED) | ||
endforeach() | ||
include_directories(SYSTEM ${AZURESDK_INCLUDE_DIR}) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is needless.
|
||
set_property(TARGET Azure::azure-core | ||
APPEND | ||
PROPERTY INTERFACE_LINK_LIBRARIES CURL::libcurl LibXml2::LibXml2) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We need to call find_curl()
for CURL::libcurl
and find_package(LibXml2 REQUIRED)
for LibXml2::LibXml2
.
cpp/src/arrow/filesystem/azurefs.h
Outdated
Status CreateDir(const std::string& path, bool recursive = true) override; | ||
|
||
Status DeleteDir(const std::string& path) override; | ||
Status DeleteDirContents(const std::string& path) override; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Status DeleteDirContents(const std::string& path) override; | |
Status DeleteDirContents(const std::string& path, | |
bool missing_dir_ok = false) override; |
STATIC_LINK_LIBS | ||
${ARROW_STATIC_LINK_LIBS}) | ||
|
||
set_target_properties(azurefs_objlib PROPERTIES CXX_STANDARD 14 CXX_STANDARD_REQUIRED ON) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah, sorry. I forgot that we embed filesystem modules to libarrow.so
instead of creating separated libarrow_XXX.so
. (We don't have separated CMake target for libarrow.so
and filesystem modules.)
How about changing the default C++ version?
diff --git a/cpp/cmake_modules/SetupCxxFlags.cmake b/cpp/cmake_modules/SetupCxxFlags.cmake
index d3a2a1a2d2..cdaafe379b 100644
--- a/cpp/cmake_modules/SetupCxxFlags.cmake
+++ b/cpp/cmake_modules/SetupCxxFlags.cmake
@@ -118,12 +118,16 @@ if(NOT DEFINED CMAKE_C_STANDARD)
set(CMAKE_C_STANDARD 11)
endif()
-# This ensures that things like c++11 get passed correctly
+# This ensures that things like c++11/c++14 get passed correctly
if(NOT DEFINED CMAKE_CXX_STANDARD)
- set(CMAKE_CXX_STANDARD 11)
+ if(ARROW_AZURE)
+ set(CMAKE_CXX_STANDARD 14)
+ else()
+ set(CMAKE_CXX_STANDARD 11)
+ endif()
endif()
-# We require a C++11 compliant compiler
+# We require a C++11/14 compliant compiler
set(CMAKE_CXX_STANDARD_REQUIRED ON)
# ARROW-6848: Do not use GNU (or other CXX) extensions
diff --git a/cpp/src/arrow/CMakeLists.txt b/cpp/src/arrow/CMakeLists.txt
index ec6cada1cd..1ded8e59d4 100644
--- a/cpp/src/arrow/CMakeLists.txt
+++ b/cpp/src/arrow/CMakeLists.txt
@@ -469,6 +469,12 @@ if(ARROW_FILESYSTEM)
filesystem/path_util.cc
filesystem/util_internal.cc)
+ if(ARROW_AZURE)
+ list(APPEND ARROW_SRCS filesystem/azurefs.cc filesystem/azurefs_mock.cc)
+ set_source_files_properties(filesystem/azurefs.cc filesystem/azurefs_mock.cc
+ PROPERTIES SKIP_PRECOMPILE_HEADERS ON
+ SKIP_UNITY_BUILD_INCLUSION ON)
+ endif()
if(ARROW_GCS)
list(APPEND ARROW_SRCS filesystem/gcsfs.cc filesystem/gcsfs_internal.cc)
set_source_files_properties(filesystem/gcsfs.cc filesystem/gcsfs_internal.cc
diff --git a/cpp/src/arrow/filesystem/CMakeLists.txt b/cpp/src/arrow/filesystem/CMakeLists.txt
index 819eca08cf..bbca231baf 100644
--- a/cpp/src/arrow/filesystem/CMakeLists.txt
+++ b/cpp/src/arrow/filesystem/CMakeLists.txt
@@ -28,8 +28,8 @@ add_arrow_test(filesystem-test
EXTRA_LABELS
filesystem)
-if(ARROW_GCS)
- add_arrow_test(gcsfs_test
+if(ARROW_AZURE)
+ add_arrow_test(azurefs_test
EXTRA_LABELS
filesystem
EXTRA_LINK_LIBS
@@ -37,32 +37,13 @@ if(ARROW_GCS)
Boost::system)
endif()
-if(ARROW_AZURE)
- set(AZURE_SRCS)
- list(APPEND
- AZURE_SRCS
- azurefs_mock.cc
- azurefs.cc)
-
- add_arrow_lib(azurefs
- SOURCES
- ${AZURE_SRCS}
- SHARED_LINK_LIBS
- ${ARROW_LINK_LIBS}
- SHARED_PRIVATE_LINK_LIBS
- ${ARROW_SHARED_PRIVATE_LINK_LIBS}
- STATIC_LINK_LIBS
- ${ARROW_STATIC_LINK_LIBS})
-
- set_target_properties(azurefs_objlib PROPERTIES CXX_STANDARD 14 CXX_STANDARD_REQUIRED ON)
-
- set(TEST_LIBS_AZURE ${ARROW_TEST_LINK_LIBS})
- list(APPEND TEST_LIBS_AZURE azurefs_shared)
- add_arrow_test(azurefs_test EXTRA_LABELS filesystem
- STATIC_LINK_LIBS
- ${TEST_LIBS_AZURE}
- )
- set_target_properties(arrow-azurefs-test PROPERTIES CXX_STANDARD 14 CXX_STANDARD_REQUIRED ON)
+if(ARROW_GCS)
+ add_arrow_test(gcsfs_test
+ EXTRA_LABELS
+ filesystem
+ EXTRA_LINK_LIBS
+ Boost::filesystem
+ Boost::system)
endif()
if(ARROW_S3)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If we change the default C++ version, then it will pick C++14 for the entire compilation, and looks like from this conversation - https://issues.apache.org/jira/browse/ARROW-2034?focusedCommentId=17463318&page=com.atlassian.jira.plugin.system.issuetabpanels%3Acomment-tabpanel#comment-17463318, it is not desired, right?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think that @pitrou said that we can use C++14 features in cpp/src/arrow/filesystem/azurefs*.cc
but we must not use C++14 features in other *.cc
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Right. Also, .h
files should not use any C++14 features either.
cpp/src/arrow/CMakeLists.txt
Outdated
@@ -573,6 +579,7 @@ add_arrow_lib(arrow | |||
SHARED_INSTALL_INTERFACE_LIBS | |||
${ARROW_SHARED_INSTALL_INTERFACE_LIBS}) | |||
|
|||
target_link_libraries(arrow_shared PUBLIC LibXml2::LibXml2) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could you do this for Azure::azure-storage-common
for not arrow_shared
?
@github-actions autotune |
@github-actions rebase |
Hi @kou, Currently |
Could you enable it in |
Rebase should fix the R build failures, though there are a couple of merge conflicts to resolve |
Is there a plan for when ADLS/Azure Blob file system will be available in an Arrow release? In the meantime, what is the best way to use Apache Arrow with ADLS? Has anyone figured out how to authenticate to Azure using the existing HDFS API (which seems to only support Kerberos)? |
15.0.0 or 16.0.0? If you join developing this, you may be able to control it. |
After a considerable amount of working through the legal aspect of contributing back to open source that is in place at the company where I am presently employed, I am cleared to work on this ADLS stuff. So where are things? The last I recall someone was planning on contributing the skeleton iteration. Did that come to pass? Thanks |
Skeleton: #35701 I can probably write some more GitHub issues for next few parts, if that would be helpful. |
### Rationale for this change We want a C++ implementation of an Azure filesystem. Reading files is the first step. ### What changes are included in this PR? Adds an implementation of `io::RandomAccessFile` for Azure blob storage (with or without hierarchical namespace (HNS) a.k.a datalake gen 2). This is largely copied from #12914. Using this `io::RandomAccessFile` implementation we implement the input file and stream methods of the `AzureFileSystem`. I've made a few changes to the implementation from #12914. The biggest one is removing use of the Azure SDK datalake APIs. These APIs cannot be tested with `azurite`, they are only beneficial for listing operations on HNS enabled accounts and detecting a HNS enabled account is quite difficult (unless you use significantly elevated Azure permissions). Adding 2 different code paths for normal blob storage and datalake gen 2 seems like a bad idea to me except in cases where there is a performance advantage. I also made a few other tweaks to some of the error handling and to make things more consistent with the S3 or GCS filesystems. ### Are these changes tested? Yes. The tests are all based on the tests from the GCS filesystem with minimal chantges. I remember reading a review comment on #12914 which recommended this approach. There are a few places where the GCS tests relied on file writes or file info methods so I've replaced those with direct calls to the Azure blob client and left TODO comments saying to switch them to use the AzureFilesystem when the relevant methods are implemented. ### Are there any user-facing changes? Yes. File reads using the Azure filesystem are now supported. * Closes: #37511 Lead-authored-by: Thomas Newton <[email protected]> Co-authored-by: Benjamin Kietzman <[email protected]> Signed-off-by: Benjamin Kietzman <[email protected]>
…he#38269) ### Rationale for this change We want a C++ implementation of an Azure filesystem. Reading files is the first step. ### What changes are included in this PR? Adds an implementation of `io::RandomAccessFile` for Azure blob storage (with or without hierarchical namespace (HNS) a.k.a datalake gen 2). This is largely copied from apache#12914. Using this `io::RandomAccessFile` implementation we implement the input file and stream methods of the `AzureFileSystem`. I've made a few changes to the implementation from apache#12914. The biggest one is removing use of the Azure SDK datalake APIs. These APIs cannot be tested with `azurite`, they are only beneficial for listing operations on HNS enabled accounts and detecting a HNS enabled account is quite difficult (unless you use significantly elevated Azure permissions). Adding 2 different code paths for normal blob storage and datalake gen 2 seems like a bad idea to me except in cases where there is a performance advantage. I also made a few other tweaks to some of the error handling and to make things more consistent with the S3 or GCS filesystems. ### Are these changes tested? Yes. The tests are all based on the tests from the GCS filesystem with minimal chantges. I remember reading a review comment on apache#12914 which recommended this approach. There are a few places where the GCS tests relied on file writes or file info methods so I've replaced those with direct calls to the Azure blob client and left TODO comments saying to switch them to use the AzureFilesystem when the relevant methods are implemented. ### Are there any user-facing changes? Yes. File reads using the Azure filesystem are now supported. * Closes: apache#37511 Lead-authored-by: Thomas Newton <[email protected]> Co-authored-by: Benjamin Kietzman <[email protected]> Signed-off-by: Benjamin Kietzman <[email protected]>
Thanks Tom. Nearest I can tell, the skeleton and file reads have been implemented. Is that correct? Also, we are in need of the ability to read/write from/to ADLS. In a previous comment, it was mentioned that this was included in the code that had been written for the blob stuff. Is that correct? Finally, it seems that the write is being developed. Where can I help? Thanks, Jerry |
Currently skeleton, file reads and Regarding ADLS vs blob the plan is to support blob and ADLS gen2. I think ADLS gen1 will not be supported since its deprecated by microsoft and its very different to the other 2. |
…Azure C++ filesystem (apache#36988) ### Rationale for this change We need to write tests for apache#18014. azurite is like a fake Azure blob storage so it can be used to write integration tests ### What changes are included in this PR? Extract the `azurite` related changes from apache#12914 to create a smaller PR that's easier to review. I have made very minimal changes compared to that PR. Currently `azurite` is configured for all the environments where `ARROW_AZURE` was enabled by apache#35701. I assume its deliberate that its not enabled yet for windows, alpine, conda, debian or fedora builds. ### Are these changes tested? Its tested by there aren't really any good tests in this PR. I used this `azurite` config in apache#36835 to make an integration test that uses the Azure C++ SDK. On its own we can't really write tests for this `azurite` setup PR. ### Are there any user-facing changes? No * Closes: apache#36886 Lead-authored-by: Thomas Newton <[email protected]> Co-authored-by: Sutou Kouhei <[email protected]> Signed-off-by: Sutou Kouhei <[email protected]>
### Rationale for this change We want to use the Azure SDK for C++ to read/write to Azure blob storage. Obviously this is pretty important for building an `AzureFileSystem`. ### What changes are included in this PR? Builds the the relevant parts of the azure SDK as a cmake external project. Adds a couple of simple tests that just assert that the Azure SDK is working and a couple of lines in `AzureFileSystem` to initialise the blob storage client to ensure the build is working correctly in all environments. I started with the build setup from apache#12914 but I did make few changes. 1. Although its atypical for this project we chose to switch from cmake's `ExternalProject` to `FetchContent`. `FetchContent` is recomended by the Azure docs https://github.com/Azure/azure-sdk-for-cpp#cmake-project--fetch-content. It also solves a few problems including: automatically linking system curl and ssl instead of bootstrapping vcpkg and installing curl and ssl from there. 2. Only build one version of the Azure SDK for C++ because it contains all the components. Previously we were unnecessarily building 5 different versions of the whole thing on top of each other. This created race conditions for which version each component came from. 3. We are using `azure-core_1.10.2` which is a very recent version. There are a couple of important reasons for this 1. [an important managed identity fix](Azure/azure-sdk-for-cpp#4723), 2. [fixed support for curl versions < 7.71.0](Azure/azure-sdk-for-cpp#4792). There will be follow up PRs to enable Azure in the manylinux builds. We need to update `vcpkg` first so we can get a version of the Azure SDK which contains [an important managed identity fix](Azure/azure-sdk-for-cpp#4723). ### Are these changes tested? Yes. There is a simple test that just runs the Azure client against azurite. Additionally just initialising the client in `AzureFileSystem` goes a long way towards ensuring the build is working. ### Are there any user-facing changes? No * Closes: apache#29847 Lead-authored-by: Thomas Newton <[email protected]> Co-authored-by: Sutou Kouhei <[email protected]> Co-authored-by: Sutou Kouhei <[email protected]> Co-authored-by: Antoine Pitrou <[email protected]> Co-authored-by: shefali singh <[email protected]> Signed-off-by: Sutou Kouhei <[email protected]>
…he#38269) ### Rationale for this change We want a C++ implementation of an Azure filesystem. Reading files is the first step. ### What changes are included in this PR? Adds an implementation of `io::RandomAccessFile` for Azure blob storage (with or without hierarchical namespace (HNS) a.k.a datalake gen 2). This is largely copied from apache#12914. Using this `io::RandomAccessFile` implementation we implement the input file and stream methods of the `AzureFileSystem`. I've made a few changes to the implementation from apache#12914. The biggest one is removing use of the Azure SDK datalake APIs. These APIs cannot be tested with `azurite`, they are only beneficial for listing operations on HNS enabled accounts and detecting a HNS enabled account is quite difficult (unless you use significantly elevated Azure permissions). Adding 2 different code paths for normal blob storage and datalake gen 2 seems like a bad idea to me except in cases where there is a performance advantage. I also made a few other tweaks to some of the error handling and to make things more consistent with the S3 or GCS filesystems. ### Are these changes tested? Yes. The tests are all based on the tests from the GCS filesystem with minimal chantges. I remember reading a review comment on apache#12914 which recommended this approach. There are a few places where the GCS tests relied on file writes or file info methods so I've replaced those with direct calls to the Azure blob client and left TODO comments saying to switch them to use the AzureFilesystem when the relevant methods are implemented. ### Are there any user-facing changes? Yes. File reads using the Azure filesystem are now supported. * Closes: apache#37511 Lead-authored-by: Thomas Newton <[email protected]> Co-authored-by: Benjamin Kietzman <[email protected]> Signed-off-by: Benjamin Kietzman <[email protected]>
### Rationale for this change Writing files is an important part of the filesystem ### What changes are included in this PR? Implements `OpenOutputStream` and `OpenAppendStream` for Azure. - Initially I started with the implementation from #12914 but I made quite a few changes: - Removed the different code path for hierarchical namespace accounts. There should not be any performance advantage to using special APIs only available on hierachical namespace accounts. - Only implement `ObjectAppendStream`, not `ObjectOutputStream`. `OpenOutputStream` is implemented by truncating the existing file then returning a `ObjectAppendStream`. - More precise use of `try` `catch`. Every call to Azure is wrapped in a `try` `catch` and should return a descriptive error status. - Avoid unnecessary calls to Azure. For example we now maintain the block list in memory and commit it only once on flush. #12914 committed the block list after each block that was staged and on flush queried Azure to get the list of uncommitted blocks. The new approach is consistent with the Azure fsspec implementation https://github.com/fsspec/adlfs/blob/092685f102c5cd215550d10e8347e5bce0e2b93d/adlfs/spec.py#L2009 - Adjust the block_ids slightly to minimise the risk of them conflicting with blocks written by other blob storage clients. - Implement metadata writes. Includes adding default metadata to `AzureOptions`. - Tests are based on the `gscfs_test.cc` but I added a couple of extra. - Handle the TODO(GH-38780) comments for using the Azure fs to write data in tests ### Are these changes tested? Yes. Everything should be covered by azurite tests ### Are there any user-facing changes? Yes. The Azure filesystem now supports file writes. * Closes: #38333 Lead-authored-by: Thomas Newton <[email protected]> Co-authored-by: Sutou Kouhei <[email protected]> Signed-off-by: Sutou Kouhei <[email protected]>
### Rationale for this change We want to use the Azure SDK for C++ to read/write to Azure blob storage. Obviously this is pretty important for building an `AzureFileSystem`. ### What changes are included in this PR? Builds the the relevant parts of the azure SDK as a cmake external project. Adds a couple of simple tests that just assert that the Azure SDK is working and a couple of lines in `AzureFileSystem` to initialise the blob storage client to ensure the build is working correctly in all environments. I started with the build setup from apache#12914 but I did make few changes. 1. Although its atypical for this project we chose to switch from cmake's `ExternalProject` to `FetchContent`. `FetchContent` is recomended by the Azure docs https://github.com/Azure/azure-sdk-for-cpp#cmake-project--fetch-content. It also solves a few problems including: automatically linking system curl and ssl instead of bootstrapping vcpkg and installing curl and ssl from there. 2. Only build one version of the Azure SDK for C++ because it contains all the components. Previously we were unnecessarily building 5 different versions of the whole thing on top of each other. This created race conditions for which version each component came from. 3. We are using `azure-core_1.10.2` which is a very recent version. There are a couple of important reasons for this 1. [an important managed identity fix](Azure/azure-sdk-for-cpp#4723), 2. [fixed support for curl versions < 7.71.0](Azure/azure-sdk-for-cpp#4792). There will be follow up PRs to enable Azure in the manylinux builds. We need to update `vcpkg` first so we can get a version of the Azure SDK which contains [an important managed identity fix](Azure/azure-sdk-for-cpp#4723). ### Are these changes tested? Yes. There is a simple test that just runs the Azure client against azurite. Additionally just initialising the client in `AzureFileSystem` goes a long way towards ensuring the build is working. ### Are there any user-facing changes? No * Closes: apache#29847 Lead-authored-by: Thomas Newton <[email protected]> Co-authored-by: Sutou Kouhei <[email protected]> Co-authored-by: Sutou Kouhei <[email protected]> Co-authored-by: Antoine Pitrou <[email protected]> Co-authored-by: shefali singh <[email protected]> Signed-off-by: Sutou Kouhei <[email protected]>
…he#38269) ### Rationale for this change We want a C++ implementation of an Azure filesystem. Reading files is the first step. ### What changes are included in this PR? Adds an implementation of `io::RandomAccessFile` for Azure blob storage (with or without hierarchical namespace (HNS) a.k.a datalake gen 2). This is largely copied from apache#12914. Using this `io::RandomAccessFile` implementation we implement the input file and stream methods of the `AzureFileSystem`. I've made a few changes to the implementation from apache#12914. The biggest one is removing use of the Azure SDK datalake APIs. These APIs cannot be tested with `azurite`, they are only beneficial for listing operations on HNS enabled accounts and detecting a HNS enabled account is quite difficult (unless you use significantly elevated Azure permissions). Adding 2 different code paths for normal blob storage and datalake gen 2 seems like a bad idea to me except in cases where there is a performance advantage. I also made a few other tweaks to some of the error handling and to make things more consistent with the S3 or GCS filesystems. ### Are these changes tested? Yes. The tests are all based on the tests from the GCS filesystem with minimal chantges. I remember reading a review comment on apache#12914 which recommended this approach. There are a few places where the GCS tests relied on file writes or file info methods so I've replaced those with direct calls to the Azure blob client and left TODO comments saying to switch them to use the AzureFilesystem when the relevant methods are implemented. ### Are there any user-facing changes? Yes. File reads using the Azure filesystem are now supported. * Closes: apache#37511 Lead-authored-by: Thomas Newton <[email protected]> Co-authored-by: Benjamin Kietzman <[email protected]> Signed-off-by: Benjamin Kietzman <[email protected]>
### Rationale for this change Writing files is an important part of the filesystem ### What changes are included in this PR? Implements `OpenOutputStream` and `OpenAppendStream` for Azure. - Initially I started with the implementation from apache#12914 but I made quite a few changes: - Removed the different code path for hierarchical namespace accounts. There should not be any performance advantage to using special APIs only available on hierachical namespace accounts. - Only implement `ObjectAppendStream`, not `ObjectOutputStream`. `OpenOutputStream` is implemented by truncating the existing file then returning a `ObjectAppendStream`. - More precise use of `try` `catch`. Every call to Azure is wrapped in a `try` `catch` and should return a descriptive error status. - Avoid unnecessary calls to Azure. For example we now maintain the block list in memory and commit it only once on flush. apache#12914 committed the block list after each block that was staged and on flush queried Azure to get the list of uncommitted blocks. The new approach is consistent with the Azure fsspec implementation https://github.com/fsspec/adlfs/blob/092685f102c5cd215550d10e8347e5bce0e2b93d/adlfs/spec.py#L2009 - Adjust the block_ids slightly to minimise the risk of them conflicting with blocks written by other blob storage clients. - Implement metadata writes. Includes adding default metadata to `AzureOptions`. - Tests are based on the `gscfs_test.cc` but I added a couple of extra. - Handle the TODO(apacheGH-38780) comments for using the Azure fs to write data in tests ### Are these changes tested? Yes. Everything should be covered by azurite tests ### Are there any user-facing changes? Yes. The Azure filesystem now supports file writes. * Closes: apache#38333 Lead-authored-by: Thomas Newton <[email protected]> Co-authored-by: Sutou Kouhei <[email protected]> Signed-off-by: Sutou Kouhei <[email protected]>
We've implemented this by separated PRs. |
No description provided.