Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ARROW-2034: [C++] Filesystem implementation for Azure Blob Storage #12914

Closed
wants to merge 38 commits into from

Conversation

shefali163
Copy link

No description provided.

@github-actions
Copy link

@github-actions
Copy link

⚠️ Ticket has not been started in JIRA, please click 'Start Progress'.

@nealrichardson nealrichardson requested a review from kou April 21, 2022 11:57
Copy link
Member

@kou kou left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you add -DARROW_AZURE to ci/scripts/cpp_build.sh like ARROW_GCS and ARROW_S3?

set(ARROW_STATIC_LINK_LIBS)
set(ARROW_AZURE_STATIC_LINK_LIBS)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you use ARROW_LINK_LIBS and ARROW_STATIC_LINK_LIBS instead of add new ARROW_AZURE_* variables like S3 and GCS?

@@ -46,6 +46,260 @@ if(WIN32 AND CMAKE_CXX_COMPILER_ID STREQUAL "GNU")
list(APPEND ARROW_BOOST_PROCESS_COMPILE_DEFINITIONS "BOOST_USE_WINDOWS_H=1")
endif()

function(ADD_ARROW_LIB_AZURE LIB_NAME)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we use add_arrow_lib instead of defining this?

endif()
endfunction()

function(ADD_TEST_CASE_AZURE REL_TEST_NAME)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we use add_test_case instead of defining this?

@@ -4513,6 +4513,53 @@ if(ARROW_S3)
endif()
endif()

macro(build_azuresdk)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you build Azure C++ SDK by externalproject_add in this?

Comment on lines 4559 to 4560
message(STATUS "Found AZURE SDK headers: ${AZURESDK_INCLUDE_DIR}")
message(STATUS "Found AZURE SDK libraries: ${AZURESDK_LINK_LIBRARIES}")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
message(STATUS "Found AZURE SDK headers: ${AZURESDK_INCLUDE_DIR}")
message(STATUS "Found AZURE SDK libraries: ${AZURESDK_LINK_LIBRARIES}")
message(STATUS "Found Azure SDK headers: ${AZURESDK_INCLUDE_DIR}")
message(STATUS "Found Azure SDK libraries: ${AZURESDK_LINK_LIBRARIES}")

Comment on lines 18 to 19
set(CMAKE_CXX_STANDARD 14)
set(CMAKE_CXX_STANDARD_REQUIRED ON)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Comment on lines 37 to 45
#ifdef _WIN32
// Undefine preprocessor macros that interfere with AWS function / method names
#ifdef GetMessage
#undef GetMessage
#endif
#ifdef GetObject
#undef GetObject
#endif
#endif
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we need this for Azure SDK for C++?

void AzureOptions::ConfigureConnectionStringCredentials(
const std::string& connection_string_uri) {
auto account_name =
Azure::Storage::_internal::ParseConnectionString(connection_string_uri).AccountName;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we use public API instead of internal API here?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Will check for a public API and update

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No public API is available for this, Added implementation of ParseConnectionString instead.


Result<AzureOptions> AzureOptions::FromUri(const Uri& uri, std::string* out_path) {
AzureOptions options;
AZURE_ASSERT(uri.has_host());
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It seems that this may call std::abort().
Generally, we should not use std::abort() in a library.

If we need to use std::abort(), please use ARROW_DCHECK family.

Comment on lines 99 to 108

// private:
// const std::string& AdlsGen2AccountName = std::getenv("ADLS_GEN2_ACCOUNT_NAME");
// const std::string& AdlsGen2AccountKey = std::getenv("ADLS_GEN2_ACCOUNT_KEY");
// const std::string& AdlsGen2ConnectionStringValue = std::getenv(
// "ADLS_GEN2_CONNECTION_STRING");
// const std::string& AdlsGen2SasUrl = std::getenv("ADLS_GEN2_SASURL");
// const std::string& AadTenantIdValue = std::getenv("AAD_TENANT_ID");
// const std::string& AadClientIdValue = std::getenv("AAD_CLIENT_ID");
// const std::string& AadClientSecretValue = std::getenv("AAD_CLIENT_SECRET");
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we remove them?

@shefali163 shefali163 requested a review from kou May 26, 2022 14:52
@@ -108,6 +108,7 @@ cmake -G "%GENERATOR%" %CMAKE_ARGS% ^
-DARROW_PARQUET=ON ^
-DARROW_PYTHON=ON ^
-DARROW_S3=%ARROW_S3% ^
-DARROW_AZURE=OFF ^
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you keep this list in alphabetical order?

Comment on lines 4687 to 4689
foreach(AZURESDK_LIBRARY_CPP ${AZURESDK_LIBRARIES_CPP})
find_package(${AZURESDK_LIBRARY_CPP} CONFIG REQUIRED)
endforeach()
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is needless.

foreach(AZURESDK_LIBRARY_CPP ${AZURESDK_LIBRARIES_CPP})
find_package(${AZURESDK_LIBRARY_CPP} CONFIG REQUIRED)
endforeach()
include_directories(SYSTEM ${AZURESDK_INCLUDE_DIR})
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is needless.


set_property(TARGET Azure::azure-core
APPEND
PROPERTY INTERFACE_LINK_LIBRARIES CURL::libcurl LibXml2::LibXml2)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We need to call find_curl() for CURL::libcurl and find_package(LibXml2 REQUIRED) for LibXml2::LibXml2.

Status CreateDir(const std::string& path, bool recursive = true) override;

Status DeleteDir(const std::string& path) override;
Status DeleteDirContents(const std::string& path) override;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
Status DeleteDirContents(const std::string& path) override;
Status DeleteDirContents(const std::string& path,
bool missing_dir_ok = false) override;

STATIC_LINK_LIBS
${ARROW_STATIC_LINK_LIBS})

set_target_properties(azurefs_objlib PROPERTIES CXX_STANDARD 14 CXX_STANDARD_REQUIRED ON)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, sorry. I forgot that we embed filesystem modules to libarrow.so instead of creating separated libarrow_XXX.so. (We don't have separated CMake target for libarrow.so and filesystem modules.)

How about changing the default C++ version?

diff --git a/cpp/cmake_modules/SetupCxxFlags.cmake b/cpp/cmake_modules/SetupCxxFlags.cmake
index d3a2a1a2d2..cdaafe379b 100644
--- a/cpp/cmake_modules/SetupCxxFlags.cmake
+++ b/cpp/cmake_modules/SetupCxxFlags.cmake
@@ -118,12 +118,16 @@ if(NOT DEFINED CMAKE_C_STANDARD)
   set(CMAKE_C_STANDARD 11)
 endif()
 
-# This ensures that things like c++11 get passed correctly
+# This ensures that things like c++11/c++14 get passed correctly
 if(NOT DEFINED CMAKE_CXX_STANDARD)
-  set(CMAKE_CXX_STANDARD 11)
+  if(ARROW_AZURE)
+    set(CMAKE_CXX_STANDARD 14)
+  else()
+    set(CMAKE_CXX_STANDARD 11)
+  endif()
 endif()
 
-# We require a C++11 compliant compiler
+# We require a C++11/14 compliant compiler
 set(CMAKE_CXX_STANDARD_REQUIRED ON)
 
 # ARROW-6848: Do not use GNU (or other CXX) extensions
diff --git a/cpp/src/arrow/CMakeLists.txt b/cpp/src/arrow/CMakeLists.txt
index ec6cada1cd..1ded8e59d4 100644
--- a/cpp/src/arrow/CMakeLists.txt
+++ b/cpp/src/arrow/CMakeLists.txt
@@ -469,6 +469,12 @@ if(ARROW_FILESYSTEM)
        filesystem/path_util.cc
        filesystem/util_internal.cc)
 
+  if(ARROW_AZURE)
+    list(APPEND ARROW_SRCS filesystem/azurefs.cc filesystem/azurefs_mock.cc)
+    set_source_files_properties(filesystem/azurefs.cc filesystem/azurefs_mock.cc
+                                PROPERTIES SKIP_PRECOMPILE_HEADERS ON
+                                           SKIP_UNITY_BUILD_INCLUSION ON)
+  endif()
   if(ARROW_GCS)
     list(APPEND ARROW_SRCS filesystem/gcsfs.cc filesystem/gcsfs_internal.cc)
     set_source_files_properties(filesystem/gcsfs.cc filesystem/gcsfs_internal.cc
diff --git a/cpp/src/arrow/filesystem/CMakeLists.txt b/cpp/src/arrow/filesystem/CMakeLists.txt
index 819eca08cf..bbca231baf 100644
--- a/cpp/src/arrow/filesystem/CMakeLists.txt
+++ b/cpp/src/arrow/filesystem/CMakeLists.txt
@@ -28,8 +28,8 @@ add_arrow_test(filesystem-test
                EXTRA_LABELS
                filesystem)
 
-if(ARROW_GCS)
-  add_arrow_test(gcsfs_test
+if(ARROW_AZURE)
+  add_arrow_test(azurefs_test
                  EXTRA_LABELS
                  filesystem
                  EXTRA_LINK_LIBS
@@ -37,32 +37,13 @@ if(ARROW_GCS)
                  Boost::system)
 endif()
 
-if(ARROW_AZURE)
-  set(AZURE_SRCS)
-  list(APPEND
-        AZURE_SRCS
-        azurefs_mock.cc
-        azurefs.cc)
-
-  add_arrow_lib(azurefs
-                SOURCES
-                ${AZURE_SRCS}
-                SHARED_LINK_LIBS
-                ${ARROW_LINK_LIBS}
-                SHARED_PRIVATE_LINK_LIBS
-                ${ARROW_SHARED_PRIVATE_LINK_LIBS}
-                STATIC_LINK_LIBS
-                ${ARROW_STATIC_LINK_LIBS})
-
-  set_target_properties(azurefs_objlib PROPERTIES CXX_STANDARD 14 CXX_STANDARD_REQUIRED ON)
-
-  set(TEST_LIBS_AZURE ${ARROW_TEST_LINK_LIBS})
-  list(APPEND TEST_LIBS_AZURE azurefs_shared)
-  add_arrow_test(azurefs_test EXTRA_LABELS filesystem
-                  STATIC_LINK_LIBS
-                  ${TEST_LIBS_AZURE}
-  )
-  set_target_properties(arrow-azurefs-test PROPERTIES CXX_STANDARD 14 CXX_STANDARD_REQUIRED ON)
+if(ARROW_GCS)
+  add_arrow_test(gcsfs_test
+                 EXTRA_LABELS
+                 filesystem
+                 EXTRA_LINK_LIBS
+                 Boost::filesystem
+                 Boost::system)
 endif()
 
 if(ARROW_S3)

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we change the default C++ version, then it will pick C++14 for the entire compilation, and looks like from this conversation - https://issues.apache.org/jira/browse/ARROW-2034?focusedCommentId=17463318&page=com.atlassian.jira.plugin.system.issuetabpanels%3Acomment-tabpanel#comment-17463318, it is not desired, right?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think that @pitrou said that we can use C++14 features in cpp/src/arrow/filesystem/azurefs*.cc but we must not use C++14 features in other *.cc.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Right. Also, .h files should not use any C++14 features either.

@shefali163 shefali163 requested a review from kou June 1, 2022 00:10
@@ -573,6 +579,7 @@ add_arrow_lib(arrow
SHARED_INSTALL_INTERFACE_LIBS
${ARROW_SHARED_INSTALL_INTERFACE_LIBS})

target_link_libraries(arrow_shared PUBLIC LibXml2::LibXml2)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you do this for Azure::azure-storage-common for not arrow_shared?

@kou
Copy link
Member

kou commented Jun 1, 2022

@github-actions autotune

@kou
Copy link
Member

kou commented Jun 1, 2022

@github-actions rebase

@shefali163
Copy link
Author

Hi @kou, Currently ARROW_AZURE is turned OFF for all the PR builds, and azurefs.cc/azurefs_mock.cc is not being included in the target, For which particular builds we should turn this ON?

@kou
Copy link
Member

kou commented Jun 3, 2022

Could you enable it in .github/workflows/cpp.yml (Linux, macOS and Windows) and ci/docker/ubuntu-*-cpp.dockerfile?

@nealrichardson
Copy link
Member

Rebase should fix the R build failures, though there are a couple of merge conflicts to resolve

Tom-Newton added a commit to Tom-Newton/arrow that referenced this pull request Sep 29, 2023
@samkumar
Copy link

samkumar commented Oct 13, 2023

Is there a plan for when ADLS/Azure Blob file system will be available in an Arrow release?

In the meantime, what is the best way to use Apache Arrow with ADLS? Has anyone figured out how to authenticate to Azure using the existing HDFS API (which seems to only support Kerberos)?

@kou
Copy link
Member

kou commented Oct 15, 2023

15.0.0 or 16.0.0?

If you join developing this, you may be able to control it.

@av8or1
Copy link
Contributor

av8or1 commented Oct 15, 2023

After a considerable amount of working through the legal aspect of contributing back to open source that is in place at the company where I am presently employed, I am cleared to work on this ADLS stuff. So where are things? The last I recall someone was planning on contributing the skeleton iteration. Did that come to pass? Thanks

@Tom-Newton
Copy link
Contributor

Tom-Newton commented Oct 15, 2023

So where are things? The last I recall someone was planning on contributing the skeleton iteration. Did that come to pass? Thanks

Skeleton: #35701
Build with Azure SDK: #36835 (there are a few related ones to fix some things)
In progress implementing file reads: #38269

I can probably write some more GitHub issues for next few parts, if that would be helpful.

bkietz added a commit that referenced this pull request Oct 19, 2023
### Rationale for this change

We want a C++ implementation of an Azure filesystem. Reading files is the first step. 

### What changes are included in this PR?

Adds an implementation of `io::RandomAccessFile` for Azure blob storage (with or without hierarchical namespace (HNS) a.k.a datalake gen 2). This is largely copied from #12914. Using this `io::RandomAccessFile` implementation we implement the input file and stream methods of the `AzureFileSystem`. 

I've made a few changes to the implementation from #12914. The biggest one is removing use of the Azure SDK datalake APIs. These APIs cannot be tested with `azurite`, they are only beneficial for listing operations on HNS enabled accounts and detecting a HNS enabled account is quite difficult (unless you use significantly elevated Azure permissions). Adding 2 different code paths for normal blob storage and datalake gen 2 seems like a bad idea to me except in cases where there is a performance advantage. I also made a few other tweaks to some of the error handling and to make things more consistent with the S3 or GCS filesystems. 

### Are these changes tested?

Yes. The tests are all based on the tests from the GCS filesystem with minimal chantges. I remember reading a review comment on #12914 which recommended this approach. 
There are a few places where the GCS tests relied on file writes or file info methods so I've replaced those with direct calls to the Azure blob client and left TODO comments saying to switch them to use the AzureFilesystem when the relevant methods are implemented. 

### Are there any user-facing changes?

Yes. File reads using the Azure filesystem are now supported. 

* Closes: #37511

Lead-authored-by: Thomas Newton <[email protected]>
Co-authored-by: Benjamin Kietzman <[email protected]>
Signed-off-by: Benjamin Kietzman <[email protected]>
JerAguilon pushed a commit to JerAguilon/arrow that referenced this pull request Oct 23, 2023
…he#38269)

### Rationale for this change

We want a C++ implementation of an Azure filesystem. Reading files is the first step. 

### What changes are included in this PR?

Adds an implementation of `io::RandomAccessFile` for Azure blob storage (with or without hierarchical namespace (HNS) a.k.a datalake gen 2). This is largely copied from apache#12914. Using this `io::RandomAccessFile` implementation we implement the input file and stream methods of the `AzureFileSystem`. 

I've made a few changes to the implementation from apache#12914. The biggest one is removing use of the Azure SDK datalake APIs. These APIs cannot be tested with `azurite`, they are only beneficial for listing operations on HNS enabled accounts and detecting a HNS enabled account is quite difficult (unless you use significantly elevated Azure permissions). Adding 2 different code paths for normal blob storage and datalake gen 2 seems like a bad idea to me except in cases where there is a performance advantage. I also made a few other tweaks to some of the error handling and to make things more consistent with the S3 or GCS filesystems. 

### Are these changes tested?

Yes. The tests are all based on the tests from the GCS filesystem with minimal chantges. I remember reading a review comment on apache#12914 which recommended this approach. 
There are a few places where the GCS tests relied on file writes or file info methods so I've replaced those with direct calls to the Azure blob client and left TODO comments saying to switch them to use the AzureFilesystem when the relevant methods are implemented. 

### Are there any user-facing changes?

Yes. File reads using the Azure filesystem are now supported. 

* Closes: apache#37511

Lead-authored-by: Thomas Newton <[email protected]>
Co-authored-by: Benjamin Kietzman <[email protected]>
Signed-off-by: Benjamin Kietzman <[email protected]>
Tom-Newton added a commit to Tom-Newton/arrow that referenced this pull request Nov 9, 2023
@av8or1
Copy link
Contributor

av8or1 commented Nov 13, 2023

Thanks Tom. Nearest I can tell, the skeleton and file reads have been implemented. Is that correct? Also, we are in need of the ability to read/write from/to ADLS. In a previous comment, it was mentioned that this was included in the code that had been written for the blob stuff. Is that correct? Finally, it seems that the write is being developed. Where can I help? Thanks, Jerry

@Tom-Newton
Copy link
Contributor

Tom-Newton commented Nov 13, 2023

Thanks Tom. Nearest I can tell, the skeleton and file reads have been implemented. Is that correct? Also, we are in need of the ability to read/write from/to ADLS. In a previous comment, it was mentioned that this was included in the code that had been written for the blob stuff. Is that correct? Finally, it seems that the write is being developed. Where can I help? Thanks, Jerry

Currently skeleton, file reads and GetFileInfo for a single file have been implemented. I started working on writes but I have not made much progress yet. Probably the best place to track the status is on #18014. I have been creating child github issues from there. Currently #38598 and #38597 are both un-claimed.

Regarding ADLS vs blob the plan is to support blob and ADLS gen2. I think ADLS gen1 will not be supported since its deprecated by microsoft and its very different to the other 2.

loicalleyne pushed a commit to loicalleyne/arrow that referenced this pull request Nov 13, 2023
…Azure C++ filesystem (apache#36988)

### Rationale for this change
We need to write tests for apache#18014. azurite is like a fake Azure blob storage so it can be used to write integration tests

### What changes are included in this PR?
Extract the `azurite` related changes from apache#12914 to create a smaller PR that's easier to review. I have made very minimal changes compared to that PR. 

Currently `azurite` is configured for all the environments where `ARROW_AZURE` was enabled by apache#35701. I assume its deliberate that its not enabled yet for windows, alpine, conda, debian or fedora builds. 

### Are these changes tested?
Its tested by there aren't really any good tests in this PR. I used this `azurite` config in apache#36835 to make an integration test that uses the Azure C++ SDK. On its own we can't really write tests for this `azurite` setup PR. 

### Are there any user-facing changes?
No

* Closes: apache#36886

Lead-authored-by: Thomas Newton <[email protected]>
Co-authored-by: Sutou Kouhei <[email protected]>
Signed-off-by: Sutou Kouhei <[email protected]>
loicalleyne pushed a commit to loicalleyne/arrow that referenced this pull request Nov 13, 2023
### Rationale for this change
We want to use the Azure SDK for C++ to read/write to Azure blob storage. Obviously this is pretty important for building an `AzureFileSystem`. 

### What changes are included in this PR?
Builds the the relevant parts of the azure SDK as a cmake external project. Adds a couple of simple tests that just assert that the Azure SDK is working and a couple of lines in `AzureFileSystem` to initialise the blob storage client to ensure the build is working correctly in all environments.

I started with the build setup from apache#12914 but I did make few changes.

1. Although its atypical for this project we chose to switch from cmake's `ExternalProject` to `FetchContent`. `FetchContent` is recomended by the Azure docs https://github.com/Azure/azure-sdk-for-cpp#cmake-project--fetch-content. It also solves a few problems including: automatically linking system curl and ssl instead of bootstrapping vcpkg and installing curl and ssl from there. 
2. Only build one version of the Azure SDK for C++ because it contains all the components. Previously we were unnecessarily building 5 different versions of the whole thing on top of each other. This created race conditions for which version each component came from.
3. We are using `azure-core_1.10.2` which is a very recent version. There are a couple of important reasons for this 1.  [an important managed identity fix](Azure/azure-sdk-for-cpp#4723), 2. [fixed support for curl versions < 7.71.0](Azure/azure-sdk-for-cpp#4792).

There will be follow up PRs to enable Azure in the manylinux builds. We need to update `vcpkg` first so we can get a version of the Azure SDK which contains [an important managed identity fix](Azure/azure-sdk-for-cpp#4723).

### Are these changes tested?
Yes. There is a simple test that just runs the Azure client against azurite. Additionally just initialising the client in `AzureFileSystem` goes a long way towards ensuring the build is working. 

### Are there any user-facing changes?
No
* Closes: apache#29847

Lead-authored-by: Thomas Newton <[email protected]>
Co-authored-by: Sutou Kouhei <[email protected]>
Co-authored-by: Sutou Kouhei <[email protected]>
Co-authored-by: Antoine Pitrou <[email protected]>
Co-authored-by: shefali singh <[email protected]>
Signed-off-by: Sutou Kouhei <[email protected]>
loicalleyne pushed a commit to loicalleyne/arrow that referenced this pull request Nov 13, 2023
…he#38269)

### Rationale for this change

We want a C++ implementation of an Azure filesystem. Reading files is the first step. 

### What changes are included in this PR?

Adds an implementation of `io::RandomAccessFile` for Azure blob storage (with or without hierarchical namespace (HNS) a.k.a datalake gen 2). This is largely copied from apache#12914. Using this `io::RandomAccessFile` implementation we implement the input file and stream methods of the `AzureFileSystem`. 

I've made a few changes to the implementation from apache#12914. The biggest one is removing use of the Azure SDK datalake APIs. These APIs cannot be tested with `azurite`, they are only beneficial for listing operations on HNS enabled accounts and detecting a HNS enabled account is quite difficult (unless you use significantly elevated Azure permissions). Adding 2 different code paths for normal blob storage and datalake gen 2 seems like a bad idea to me except in cases where there is a performance advantage. I also made a few other tweaks to some of the error handling and to make things more consistent with the S3 or GCS filesystems. 

### Are these changes tested?

Yes. The tests are all based on the tests from the GCS filesystem with minimal chantges. I remember reading a review comment on apache#12914 which recommended this approach. 
There are a few places where the GCS tests relied on file writes or file info methods so I've replaced those with direct calls to the Azure blob client and left TODO comments saying to switch them to use the AzureFilesystem when the relevant methods are implemented. 

### Are there any user-facing changes?

Yes. File reads using the Azure filesystem are now supported. 

* Closes: apache#37511

Lead-authored-by: Thomas Newton <[email protected]>
Co-authored-by: Benjamin Kietzman <[email protected]>
Signed-off-by: Benjamin Kietzman <[email protected]>
Tom-Newton added a commit to Tom-Newton/arrow that referenced this pull request Nov 16, 2023
Tom-Newton added a commit to Tom-Newton/arrow that referenced this pull request Nov 18, 2023
Tom-Newton added a commit to Tom-Newton/arrow that referenced this pull request Nov 18, 2023
Tom-Newton added a commit to Tom-Newton/arrow that referenced this pull request Nov 19, 2023
Tom-Newton added a commit to Tom-Newton/arrow that referenced this pull request Nov 19, 2023
Tom-Newton added a commit to Tom-Newton/arrow that referenced this pull request Nov 19, 2023
Tom-Newton added a commit to Tom-Newton/arrow that referenced this pull request Nov 19, 2023
kou added a commit that referenced this pull request Nov 21, 2023
### Rationale for this change
Writing files is an important part of the filesystem

### What changes are included in this PR?
Implements `OpenOutputStream` and `OpenAppendStream` for Azure.

- Initially I started with the implementation from #12914 but I made quite a few changes:
  - Removed the different code path for hierarchical namespace accounts. There should not be any performance advantage to using special APIs only available on hierachical namespace accounts. 
  - Only implement `ObjectAppendStream`, not `ObjectOutputStream`. `OpenOutputStream` is implemented by truncating the existing file then returning a `ObjectAppendStream`.
  - More precise use of `try` `catch`. Every call to Azure is wrapped in a `try` `catch` and should return a descriptive error status. 
  - Avoid unnecessary calls to Azure. For example we now maintain the block list in memory and commit it only once on flush. #12914 committed the block list after each block that was staged and on flush queried Azure to get the list of uncommitted blocks. The new approach is consistent with the Azure fsspec implementation https://github.com/fsspec/adlfs/blob/092685f102c5cd215550d10e8347e5bce0e2b93d/adlfs/spec.py#L2009
  - Adjust the block_ids slightly to minimise the risk of them conflicting with blocks written by other blob storage clients. 
  - Implement metadata writes. Includes adding default metadata to `AzureOptions`.
- Tests are based on the `gscfs_test.cc` but I added a couple of extra. 
- Handle the TODO(GH-38780) comments for using the Azure fs to write data in tests

### Are these changes tested?
Yes. Everything should be covered by azurite tests

### Are there any user-facing changes?
Yes. The Azure filesystem now supports file writes. 

* Closes: #38333

Lead-authored-by: Thomas Newton <[email protected]>
Co-authored-by: Sutou Kouhei <[email protected]>
Signed-off-by: Sutou Kouhei <[email protected]>
dgreiss pushed a commit to dgreiss/arrow that referenced this pull request Feb 19, 2024
### Rationale for this change
We want to use the Azure SDK for C++ to read/write to Azure blob storage. Obviously this is pretty important for building an `AzureFileSystem`. 

### What changes are included in this PR?
Builds the the relevant parts of the azure SDK as a cmake external project. Adds a couple of simple tests that just assert that the Azure SDK is working and a couple of lines in `AzureFileSystem` to initialise the blob storage client to ensure the build is working correctly in all environments.

I started with the build setup from apache#12914 but I did make few changes.

1. Although its atypical for this project we chose to switch from cmake's `ExternalProject` to `FetchContent`. `FetchContent` is recomended by the Azure docs https://github.com/Azure/azure-sdk-for-cpp#cmake-project--fetch-content. It also solves a few problems including: automatically linking system curl and ssl instead of bootstrapping vcpkg and installing curl and ssl from there. 
2. Only build one version of the Azure SDK for C++ because it contains all the components. Previously we were unnecessarily building 5 different versions of the whole thing on top of each other. This created race conditions for which version each component came from.
3. We are using `azure-core_1.10.2` which is a very recent version. There are a couple of important reasons for this 1.  [an important managed identity fix](Azure/azure-sdk-for-cpp#4723), 2. [fixed support for curl versions < 7.71.0](Azure/azure-sdk-for-cpp#4792).

There will be follow up PRs to enable Azure in the manylinux builds. We need to update `vcpkg` first so we can get a version of the Azure SDK which contains [an important managed identity fix](Azure/azure-sdk-for-cpp#4723).

### Are these changes tested?
Yes. There is a simple test that just runs the Azure client against azurite. Additionally just initialising the client in `AzureFileSystem` goes a long way towards ensuring the build is working. 

### Are there any user-facing changes?
No
* Closes: apache#29847

Lead-authored-by: Thomas Newton <[email protected]>
Co-authored-by: Sutou Kouhei <[email protected]>
Co-authored-by: Sutou Kouhei <[email protected]>
Co-authored-by: Antoine Pitrou <[email protected]>
Co-authored-by: shefali singh <[email protected]>
Signed-off-by: Sutou Kouhei <[email protected]>
dgreiss pushed a commit to dgreiss/arrow that referenced this pull request Feb 19, 2024
…he#38269)

### Rationale for this change

We want a C++ implementation of an Azure filesystem. Reading files is the first step. 

### What changes are included in this PR?

Adds an implementation of `io::RandomAccessFile` for Azure blob storage (with or without hierarchical namespace (HNS) a.k.a datalake gen 2). This is largely copied from apache#12914. Using this `io::RandomAccessFile` implementation we implement the input file and stream methods of the `AzureFileSystem`. 

I've made a few changes to the implementation from apache#12914. The biggest one is removing use of the Azure SDK datalake APIs. These APIs cannot be tested with `azurite`, they are only beneficial for listing operations on HNS enabled accounts and detecting a HNS enabled account is quite difficult (unless you use significantly elevated Azure permissions). Adding 2 different code paths for normal blob storage and datalake gen 2 seems like a bad idea to me except in cases where there is a performance advantage. I also made a few other tweaks to some of the error handling and to make things more consistent with the S3 or GCS filesystems. 

### Are these changes tested?

Yes. The tests are all based on the tests from the GCS filesystem with minimal chantges. I remember reading a review comment on apache#12914 which recommended this approach. 
There are a few places where the GCS tests relied on file writes or file info methods so I've replaced those with direct calls to the Azure blob client and left TODO comments saying to switch them to use the AzureFilesystem when the relevant methods are implemented. 

### Are there any user-facing changes?

Yes. File reads using the Azure filesystem are now supported. 

* Closes: apache#37511

Lead-authored-by: Thomas Newton <[email protected]>
Co-authored-by: Benjamin Kietzman <[email protected]>
Signed-off-by: Benjamin Kietzman <[email protected]>
dgreiss pushed a commit to dgreiss/arrow that referenced this pull request Feb 19, 2024
### Rationale for this change
Writing files is an important part of the filesystem

### What changes are included in this PR?
Implements `OpenOutputStream` and `OpenAppendStream` for Azure.

- Initially I started with the implementation from apache#12914 but I made quite a few changes:
  - Removed the different code path for hierarchical namespace accounts. There should not be any performance advantage to using special APIs only available on hierachical namespace accounts. 
  - Only implement `ObjectAppendStream`, not `ObjectOutputStream`. `OpenOutputStream` is implemented by truncating the existing file then returning a `ObjectAppendStream`.
  - More precise use of `try` `catch`. Every call to Azure is wrapped in a `try` `catch` and should return a descriptive error status. 
  - Avoid unnecessary calls to Azure. For example we now maintain the block list in memory and commit it only once on flush. apache#12914 committed the block list after each block that was staged and on flush queried Azure to get the list of uncommitted blocks. The new approach is consistent with the Azure fsspec implementation https://github.com/fsspec/adlfs/blob/092685f102c5cd215550d10e8347e5bce0e2b93d/adlfs/spec.py#L2009
  - Adjust the block_ids slightly to minimise the risk of them conflicting with blocks written by other blob storage clients. 
  - Implement metadata writes. Includes adding default metadata to `AzureOptions`.
- Tests are based on the `gscfs_test.cc` but I added a couple of extra. 
- Handle the TODO(apacheGH-38780) comments for using the Azure fs to write data in tests

### Are these changes tested?
Yes. Everything should be covered by azurite tests

### Are there any user-facing changes?
Yes. The Azure filesystem now supports file writes. 

* Closes: apache#38333

Lead-authored-by: Thomas Newton <[email protected]>
Co-authored-by: Sutou Kouhei <[email protected]>
Signed-off-by: Sutou Kouhei <[email protected]>
@kou
Copy link
Member

kou commented Apr 18, 2024

We've implemented this by separated PRs.

@kou kou closed this Apr 18, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.