New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

Sign up for GitHub

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Jump to bottom

GH-37511: [C++] Implement file reads for Azure filesystem #38269

Merged

bkietz merged 40 commits into apache:main from Tom-Newton:tomnewton/azure_filesystem_reads/GH-37511

Oct 19, 2023

Contributor

Tom-Newton commented Oct 14, 2023 •

edited

Loading

Rationale for this change

We want a C++ implementation of an Azure filesystem. Reading files is the first step.

What changes are included in this PR?

Adds an implementation of io::RandomAccessFile for Azure blob storage (with or without hierarchical namespace (HNS) a.k.a datalake gen 2). This is largely copied from #12914. Using this io::RandomAccessFile implementation we implement the input file and stream methods of the AzureFileSystem.

I've made a few changes to the implementation from #12914. The biggest one is removing use of the Azure SDK datalake APIs. These APIs cannot be tested with azurite, they are only beneficial for listing operations on HNS enabled accounts and detecting a HNS enabled account is quite difficult (unless you use significantly elevated Azure permissions). Adding 2 different code paths for normal blob storage and datalake gen 2 seems like a bad idea to me except in cases where there is a performance advantage. I also made a few other tweaks to some of the error handling and to make things more consistent with the S3 or GCS filesystems.

Are these changes tested?

Yes. The tests are all based on the tests from the GCS filesystem with minimal chantges. I remember reading a review comment on #12914 which recommended this approach.
There are a few places where the GCS tests relied on file writes or file info methods so I've replaced those with direct calls to the Azure blob client and left TODO comments saying to switch them to use the AzureFilesystem when the relevant methods are implemented.

Are there any user-facing changes?

Yes. File reads using the Azure filesystem are now supported.

Closes: [C++] Implement file reads for Azure filesystem #37511

Tom-Newton added 11 commits

September 29, 2023 12:45


          Paste in AzurePath and ObjectInputFile from apache#12914

dbbaf92


          Paste in TestAzureFileSystem and an example test (TestAzureFileSystem…

1db80b3

…, FromAccountKey) from apache#12914


          Paste in input file test cases from gscfs_test.cc

81d1523


          Minimal changes for successful build

9e31d1a


          Paste in ConfigureAccountKeyCredentials from apache#12914

5e4803b


          TestAzureFileSystem builds successfully

cc552bf


          Paste in OpenInputFile from apache#12914

8874fbe


          First test builds successfully


          First ReadAt test passes

8d574dc


          Majority of tests working

07947f3


          Implement open file from info and enable the relevant tests

bb8421f

github-actions bot commented Oct 14, 2023

⚠️ GitHub issue #37511 has been automatically assigned in GitHub to PR creator.

github-actions bot added Component: C++ awaiting review labels

Tom-Newton added 12 commits

October 14, 2023 18:10


          Add OpenInputStream implementation

e8cde8f


          Paste in input stream tests from gcsfd_test.cc

0a693c8


          Fix input stream tests

8ff5684


          Fix metadata test

bd85d5a


          Use basic blob client for tests

6cb904f


          Rename file_client -> blob_client

3ff7051


          Fix implementation to pass OpenInputStreamUri test

e70d01d


          Adjust some error handling

00b8139


          Tidy tests

b14b83e


          Make AzurePath consistent with changes from apache#11997

918c68d


          Tidy path validation

f696aee


          Remote un-needed includes for datalake client

98e019c

Tom-Newton force-pushed the tomnewton/azure_filesystem_reads/GH-37511 branch from a872890 to 98e019c Compare

October 15, 2023 14:20

Tom-Newton added 3 commits

October 15, 2023 15:22


          Remove one of the placeholder tests

0c58509


          Tidy

8505e93


          Better error messges

2b048ad

felipecrv requested changes

View reviewed changes

cpp/src/arrow/filesystem/azurefs_test.cc Outdated Show resolved Hide resolved

cpp/src/arrow/filesystem/azurefs_test.cc Outdated Show resolved Hide resolved

cpp/src/arrow/filesystem/azurefs_test.cc Outdated Show resolved Hide resolved

cpp/src/arrow/filesystem/azurefs_test.cc Outdated Show resolved Hide resolved

cpp/src/arrow/filesystem/azurefs.cc Show resolved Hide resolved

cpp/src/arrow/filesystem/azurefs.cc Outdated Show resolved Hide resolved

cpp/src/arrow/filesystem/azurefs.cc Outdated Show resolved Hide resolved

cpp/src/arrow/filesystem/azurefs.cc Outdated Show resolved Hide resolved

cpp/src/arrow/filesystem/azurefs.cc Show resolved Hide resolved

cpp/src/arrow/filesystem/azurefs.cc Outdated Show resolved Hide resolved


          PR comments

cad62df

Tom-Newton requested a review from felipecrv

October 18, 2023 09:31

felipecrv requested changes

View reviewed changes

cpp/src/arrow/filesystem/azurefs.cc Outdated Show resolved Hide resolved

cpp/src/arrow/filesystem/azurefs.cc Outdated Show resolved Hide resolved

cpp/src/arrow/filesystem/azurefs.cc Show resolved Hide resolved

cpp/src/arrow/filesystem/azurefs_test.cc Show resolved Hide resolved

bkietz requested changes

View reviewed changes

Member

bkietz left a comment

Thanks for working on this!

cpp/src/arrow/filesystem/azurefs.cc Outdated Show resolved Hide resolved

cpp/src/arrow/filesystem/azurefs.cc Outdated Show resolved Hide resolved

cpp/src/arrow/filesystem/azurefs.cc Outdated Show resolved Hide resolved

cpp/src/arrow/filesystem/azurefs.cc Outdated Show resolved Hide resolved

cpp/src/arrow/filesystem/azurefs_test.cc Show resolved Hide resolved

cpp/src/arrow/filesystem/azurefs_test.cc

+              TEST_F(TestAzureFileSystem, OpenInputFileInfoInvalid) {
+                // TODO: When implemented use ASSERT_OK_AND_ASSIGN(info,
+                // fs->GetFileInfo(PreexistingContainerPath()));

Member

bkietz Oct 18, 2023

Not necessary in the scope of this PR but FWIW this should be as simple as a call to BlobClient::GetProperties right?

Contributor Author

Tom-Newton Oct 18, 2023

Basically yes, but I was planning to do it in a separate PR to keep this as minimal as possible to file reads.

Contributor Author

Tom-Newton Oct 18, 2023 •

edited

Loading

Actually it might require a check to determine if the storage account has hierarchical namespace enabled, at which point it could get quite complicated... but that is for another PR.

Contributor Author

Tom-Newton Oct 18, 2023

I had a look in a bit more detail. There will be a bit of logic required around the root of the container and the storage account but no need for anything too complicated for hierarchical namespace (ADLS gen2) storage accounts. I've create a github issue for it #38335

cpp/src/arrow/filesystem/azurefs_test.cc Outdated Show resolved Hide resolved

github-actions bot added awaiting changes and removed awaiting committer review labels

Tom-Newton mentioned this pull request

[C++] Return filesystem properties not user defined metadata in Azure file reads #38330

Closed


          Reference issue for better metadata response

10c25e2

github-actions bot added awaiting change review and removed awaiting changes labels

Tom-Newton and others added 7 commits

October 18, 2023 18:49


          Avoid unnecessary copying of AzurePath

b9f1eaf


          Better status message for invalid path

835e6ab


          Remove unnecessary and dangerous path string parsing

8e9a985


          Update cpp/src/arrow/filesystem/azurefs_test.cc

7f329cc

Co-authored-by: Benjamin Kietzman <bengilgit@gmail.com>


          Avoid designated initializers

13929d8


          Make another test more concise

f652a4f


          Fix a clang build warning

26d425e

Tom-Newton requested a review from bkietz

October 18, 2023 22:28

Contributor

felipecrv commented Oct 18, 2023

There is a formatting issue:

--- /arrow/cpp/src/arrow/filesystem/azurefs_test.cc
+++ /arrow/cpp/src/arrow/filesystem/azurefs_test.cc (after clang format)
@@ -308,8 +308,8 @@
 
   std::shared_ptr<const KeyValueMetadata> actual;
   ASSERT_OK_AND_ASSIGN(actual, stream->ReadMetadata());
-  // TODO(GH-38330): This is asserting that the user defined metadata is returned but this is 
-  // probably not the correct behaviour.
+  // TODO(GH-38330): This is asserting that the user defined metadata is returned but this
+  // is probably not the correct behaviour.
   ASSERT_OK_AND_EQ("value0", actual->Get("key0"));
 }
 
/arrow/cpp/src/arrow/filesystem/azurefs_test.cc had clang-format style issues


          Reference follow up github issues in TODO comments

13c0d1c

felipecrv approved these changes

View reviewed changes

bkietz approved these changes

View reviewed changes

bkietz merged commit 23dfd0e into apache:main

35 checks passed

bkietz removed the awaiting change review label

github-actions bot added the awaiting merge label

conbench-apache-arrow bot commented Oct 22, 2023

After merging your PR, Conbench analyzed the 6 benchmarking runs that have been run so far on merge-commit 23dfd0e.

There were no benchmark performance regressions. 🎉

The full Conbench report has more details. It also includes information about 3 possible false positives for unstable benchmarks that are known to sometimes produce them.

JerAguilon pushed a commit to JerAguilon/arrow that referenced this pull request


          apacheGH-37511: [C++] Implement file reads for Azure filesystem (apac…

f87077d

…he#38269)

### Rationale for this change

We want a C++ implementation of an Azure filesystem. Reading files is the first step. 

### What changes are included in this PR?

Adds an implementation of `io::RandomAccessFile` for Azure blob storage (with or without hierarchical namespace (HNS) a.k.a datalake gen 2). This is largely copied from apache#12914. Using this `io::RandomAccessFile` implementation we implement the input file and stream methods of the `AzureFileSystem`. 

I've made a few changes to the implementation from apache#12914. The biggest one is removing use of the Azure SDK datalake APIs. These APIs cannot be tested with `azurite`, they are only beneficial for listing operations on HNS enabled accounts and detecting a HNS enabled account is quite difficult (unless you use significantly elevated Azure permissions). Adding 2 different code paths for normal blob storage and datalake gen 2 seems like a bad idea to me except in cases where there is a performance advantage. I also made a few other tweaks to some of the error handling and to make things more consistent with the S3 or GCS filesystems. 

### Are these changes tested?

Yes. The tests are all based on the tests from the GCS filesystem with minimal chantges. I remember reading a review comment on apache#12914 which recommended this approach. 
There are a few places where the GCS tests relied on file writes or file info methods so I've replaced those with direct calls to the Azure blob client and left TODO comments saying to switch them to use the AzureFilesystem when the relevant methods are implemented. 

### Are there any user-facing changes?

Yes. File reads using the Azure filesystem are now supported. 

* Closes: apache#37511

Lead-authored-by: Thomas Newton <thomas.w.newton@gmail.com>
Co-authored-by: Benjamin Kietzman <bengilgit@gmail.com>
Signed-off-by: Benjamin Kietzman <bengilgit@gmail.com>

loicalleyne pushed a commit to loicalleyne/arrow that referenced this pull request


          apacheGH-37511: [C++] Implement file reads for Azure filesystem (apac…

2783b95

…he#38269)

### Rationale for this change

We want a C++ implementation of an Azure filesystem. Reading files is the first step. 

### What changes are included in this PR?

Adds an implementation of `io::RandomAccessFile` for Azure blob storage (with or without hierarchical namespace (HNS) a.k.a datalake gen 2). This is largely copied from apache#12914. Using this `io::RandomAccessFile` implementation we implement the input file and stream methods of the `AzureFileSystem`. 

I've made a few changes to the implementation from apache#12914. The biggest one is removing use of the Azure SDK datalake APIs. These APIs cannot be tested with `azurite`, they are only beneficial for listing operations on HNS enabled accounts and detecting a HNS enabled account is quite difficult (unless you use significantly elevated Azure permissions). Adding 2 different code paths for normal blob storage and datalake gen 2 seems like a bad idea to me except in cases where there is a performance advantage. I also made a few other tweaks to some of the error handling and to make things more consistent with the S3 or GCS filesystems. 

### Are these changes tested?

Yes. The tests are all based on the tests from the GCS filesystem with minimal chantges. I remember reading a review comment on apache#12914 which recommended this approach. 
There are a few places where the GCS tests relied on file writes or file info methods so I've replaced those with direct calls to the Azure blob client and left TODO comments saying to switch them to use the AzureFilesystem when the relevant methods are implemented. 

### Are there any user-facing changes?

Yes. File reads using the Azure filesystem are now supported. 

* Closes: apache#37511

Lead-authored-by: Thomas Newton <thomas.w.newton@gmail.com>
Co-authored-by: Benjamin Kietzman <bengilgit@gmail.com>
Signed-off-by: Benjamin Kietzman <bengilgit@gmail.com>

dgreiss pushed a commit to dgreiss/arrow that referenced this pull request


          apacheGH-37511: [C++] Implement file reads for Azure filesystem (apac…

e67fefa

…he#38269)

### Rationale for this change

We want a C++ implementation of an Azure filesystem. Reading files is the first step. 

### What changes are included in this PR?

Adds an implementation of `io::RandomAccessFile` for Azure blob storage (with or without hierarchical namespace (HNS) a.k.a datalake gen 2). This is largely copied from apache#12914. Using this `io::RandomAccessFile` implementation we implement the input file and stream methods of the `AzureFileSystem`. 

I've made a few changes to the implementation from apache#12914. The biggest one is removing use of the Azure SDK datalake APIs. These APIs cannot be tested with `azurite`, they are only beneficial for listing operations on HNS enabled accounts and detecting a HNS enabled account is quite difficult (unless you use significantly elevated Azure permissions). Adding 2 different code paths for normal blob storage and datalake gen 2 seems like a bad idea to me except in cases where there is a performance advantage. I also made a few other tweaks to some of the error handling and to make things more consistent with the S3 or GCS filesystems. 

### Are these changes tested?

Yes. The tests are all based on the tests from the GCS filesystem with minimal chantges. I remember reading a review comment on apache#12914 which recommended this approach. 
There are a few places where the GCS tests relied on file writes or file info methods so I've replaced those with direct calls to the Azure blob client and left TODO comments saying to switch them to use the AzureFilesystem when the relevant methods are implemented. 

### Are there any user-facing changes?

Yes. File reads using the Azure filesystem are now supported. 

* Closes: apache#37511

Lead-authored-by: Thomas Newton <thomas.w.newton@gmail.com>
Co-authored-by: Benjamin Kietzman <bengilgit@gmail.com>
Signed-off-by: Benjamin Kietzman <bengilgit@gmail.com>

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

awaiting merge Component: C++