Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

GH-37511: [C++] Implement file reads for Azure filesystem #38269

Merged

Conversation

Tom-Newton
Copy link
Contributor

@Tom-Newton Tom-Newton commented Oct 14, 2023

Rationale for this change

We want a C++ implementation of an Azure filesystem. Reading files is the first step.

What changes are included in this PR?

Adds an implementation of io::RandomAccessFile for Azure blob storage (with or without hierarchical namespace (HNS) a.k.a datalake gen 2). This is largely copied from #12914. Using this io::RandomAccessFile implementation we implement the input file and stream methods of the AzureFileSystem.

I've made a few changes to the implementation from #12914. The biggest one is removing use of the Azure SDK datalake APIs. These APIs cannot be tested with azurite, they are only beneficial for listing operations on HNS enabled accounts and detecting a HNS enabled account is quite difficult (unless you use significantly elevated Azure permissions). Adding 2 different code paths for normal blob storage and datalake gen 2 seems like a bad idea to me except in cases where there is a performance advantage. I also made a few other tweaks to some of the error handling and to make things more consistent with the S3 or GCS filesystems.

Are these changes tested?

Yes. The tests are all based on the tests from the GCS filesystem with minimal chantges. I remember reading a review comment on #12914 which recommended this approach.
There are a few places where the GCS tests relied on file writes or file info methods so I've replaced those with direct calls to the Azure blob client and left TODO comments saying to switch them to use the AzureFilesystem when the relevant methods are implemented.

Are there any user-facing changes?

Yes. File reads using the Azure filesystem are now supported.

@github-actions
Copy link

⚠️ GitHub issue #37511 has been automatically assigned in GitHub to PR creator.

@Tom-Newton Tom-Newton force-pushed the tomnewton/azure_filesystem_reads/GH-37511 branch from a872890 to 98e019c Compare October 15, 2023 14:20
cpp/src/arrow/filesystem/azurefs_test.cc Outdated Show resolved Hide resolved
cpp/src/arrow/filesystem/azurefs_test.cc Outdated Show resolved Hide resolved
cpp/src/arrow/filesystem/azurefs_test.cc Outdated Show resolved Hide resolved
cpp/src/arrow/filesystem/azurefs_test.cc Outdated Show resolved Hide resolved
cpp/src/arrow/filesystem/azurefs.cc Show resolved Hide resolved
cpp/src/arrow/filesystem/azurefs.cc Outdated Show resolved Hide resolved
cpp/src/arrow/filesystem/azurefs.cc Outdated Show resolved Hide resolved
cpp/src/arrow/filesystem/azurefs.cc Outdated Show resolved Hide resolved
cpp/src/arrow/filesystem/azurefs.cc Show resolved Hide resolved
cpp/src/arrow/filesystem/azurefs.cc Outdated Show resolved Hide resolved
cpp/src/arrow/filesystem/azurefs.cc Outdated Show resolved Hide resolved
cpp/src/arrow/filesystem/azurefs.cc Outdated Show resolved Hide resolved
cpp/src/arrow/filesystem/azurefs.cc Show resolved Hide resolved
cpp/src/arrow/filesystem/azurefs_test.cc Show resolved Hide resolved
Copy link
Member

@bkietz bkietz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for working on this!

cpp/src/arrow/filesystem/azurefs.cc Outdated Show resolved Hide resolved
cpp/src/arrow/filesystem/azurefs.cc Outdated Show resolved Hide resolved
cpp/src/arrow/filesystem/azurefs.cc Outdated Show resolved Hide resolved
cpp/src/arrow/filesystem/azurefs.cc Outdated Show resolved Hide resolved
cpp/src/arrow/filesystem/azurefs_test.cc Show resolved Hide resolved

TEST_F(TestAzureFileSystem, OpenInputFileInfoInvalid) {
// TODO: When implemented use ASSERT_OK_AND_ASSIGN(info,
// fs->GetFileInfo(PreexistingContainerPath()));
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not necessary in the scope of this PR but FWIW this should be as simple as a call to BlobClient::GetProperties right?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Basically yes, but I was planning to do it in a separate PR to keep this as minimal as possible to file reads.

Copy link
Contributor Author

@Tom-Newton Tom-Newton Oct 18, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually it might require a check to determine if the storage account has hierarchical namespace enabled, at which point it could get quite complicated... but that is for another PR.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I had a look in a bit more detail. There will be a bit of logic required around the root of the container and the storage account but no need for anything too complicated for hierarchical namespace (ADLS gen2) storage accounts. I've create a github issue for it #38335

cpp/src/arrow/filesystem/azurefs_test.cc Outdated Show resolved Hide resolved
@github-actions github-actions bot added awaiting change review Awaiting change review and removed awaiting changes Awaiting changes labels Oct 18, 2023
@felipecrv
Copy link
Contributor

There is a formatting issue:

--- /arrow/cpp/src/arrow/filesystem/azurefs_test.cc
+++ /arrow/cpp/src/arrow/filesystem/azurefs_test.cc (after clang format)
@@ -308,8 +308,8 @@
 
   std::shared_ptr<const KeyValueMetadata> actual;
   ASSERT_OK_AND_ASSIGN(actual, stream->ReadMetadata());
-  // TODO(GH-38330): This is asserting that the user defined metadata is returned but this is 
-  // probably not the correct behaviour.
+  // TODO(GH-38330): This is asserting that the user defined metadata is returned but this
+  // is probably not the correct behaviour.
   ASSERT_OK_AND_EQ("value0", actual->Get("key0"));
 }
 
/arrow/cpp/src/arrow/filesystem/azurefs_test.cc had clang-format style issues

@bkietz bkietz merged commit 23dfd0e into apache:main Oct 19, 2023
36 checks passed
@bkietz bkietz removed the awaiting change review Awaiting change review label Oct 19, 2023
@github-actions github-actions bot added the awaiting merge Awaiting merge label Oct 19, 2023
@conbench-apache-arrow
Copy link

After merging your PR, Conbench analyzed the 6 benchmarking runs that have been run so far on merge-commit 23dfd0e.

There were no benchmark performance regressions. 🎉

The full Conbench report has more details. It also includes information about 3 possible false positives for unstable benchmarks that are known to sometimes produce them.

JerAguilon pushed a commit to JerAguilon/arrow that referenced this pull request Oct 23, 2023
…he#38269)

### Rationale for this change

We want a C++ implementation of an Azure filesystem. Reading files is the first step. 

### What changes are included in this PR?

Adds an implementation of `io::RandomAccessFile` for Azure blob storage (with or without hierarchical namespace (HNS) a.k.a datalake gen 2). This is largely copied from apache#12914. Using this `io::RandomAccessFile` implementation we implement the input file and stream methods of the `AzureFileSystem`. 

I've made a few changes to the implementation from apache#12914. The biggest one is removing use of the Azure SDK datalake APIs. These APIs cannot be tested with `azurite`, they are only beneficial for listing operations on HNS enabled accounts and detecting a HNS enabled account is quite difficult (unless you use significantly elevated Azure permissions). Adding 2 different code paths for normal blob storage and datalake gen 2 seems like a bad idea to me except in cases where there is a performance advantage. I also made a few other tweaks to some of the error handling and to make things more consistent with the S3 or GCS filesystems. 

### Are these changes tested?

Yes. The tests are all based on the tests from the GCS filesystem with minimal chantges. I remember reading a review comment on apache#12914 which recommended this approach. 
There are a few places where the GCS tests relied on file writes or file info methods so I've replaced those with direct calls to the Azure blob client and left TODO comments saying to switch them to use the AzureFilesystem when the relevant methods are implemented. 

### Are there any user-facing changes?

Yes. File reads using the Azure filesystem are now supported. 

* Closes: apache#37511

Lead-authored-by: Thomas Newton <[email protected]>
Co-authored-by: Benjamin Kietzman <[email protected]>
Signed-off-by: Benjamin Kietzman <[email protected]>
loicalleyne pushed a commit to loicalleyne/arrow that referenced this pull request Nov 13, 2023
…he#38269)

### Rationale for this change

We want a C++ implementation of an Azure filesystem. Reading files is the first step. 

### What changes are included in this PR?

Adds an implementation of `io::RandomAccessFile` for Azure blob storage (with or without hierarchical namespace (HNS) a.k.a datalake gen 2). This is largely copied from apache#12914. Using this `io::RandomAccessFile` implementation we implement the input file and stream methods of the `AzureFileSystem`. 

I've made a few changes to the implementation from apache#12914. The biggest one is removing use of the Azure SDK datalake APIs. These APIs cannot be tested with `azurite`, they are only beneficial for listing operations on HNS enabled accounts and detecting a HNS enabled account is quite difficult (unless you use significantly elevated Azure permissions). Adding 2 different code paths for normal blob storage and datalake gen 2 seems like a bad idea to me except in cases where there is a performance advantage. I also made a few other tweaks to some of the error handling and to make things more consistent with the S3 or GCS filesystems. 

### Are these changes tested?

Yes. The tests are all based on the tests from the GCS filesystem with minimal chantges. I remember reading a review comment on apache#12914 which recommended this approach. 
There are a few places where the GCS tests relied on file writes or file info methods so I've replaced those with direct calls to the Azure blob client and left TODO comments saying to switch them to use the AzureFilesystem when the relevant methods are implemented. 

### Are there any user-facing changes?

Yes. File reads using the Azure filesystem are now supported. 

* Closes: apache#37511

Lead-authored-by: Thomas Newton <[email protected]>
Co-authored-by: Benjamin Kietzman <[email protected]>
Signed-off-by: Benjamin Kietzman <[email protected]>
dgreiss pushed a commit to dgreiss/arrow that referenced this pull request Feb 19, 2024
…he#38269)

### Rationale for this change

We want a C++ implementation of an Azure filesystem. Reading files is the first step. 

### What changes are included in this PR?

Adds an implementation of `io::RandomAccessFile` for Azure blob storage (with or without hierarchical namespace (HNS) a.k.a datalake gen 2). This is largely copied from apache#12914. Using this `io::RandomAccessFile` implementation we implement the input file and stream methods of the `AzureFileSystem`. 

I've made a few changes to the implementation from apache#12914. The biggest one is removing use of the Azure SDK datalake APIs. These APIs cannot be tested with `azurite`, they are only beneficial for listing operations on HNS enabled accounts and detecting a HNS enabled account is quite difficult (unless you use significantly elevated Azure permissions). Adding 2 different code paths for normal blob storage and datalake gen 2 seems like a bad idea to me except in cases where there is a performance advantage. I also made a few other tweaks to some of the error handling and to make things more consistent with the S3 or GCS filesystems. 

### Are these changes tested?

Yes. The tests are all based on the tests from the GCS filesystem with minimal chantges. I remember reading a review comment on apache#12914 which recommended this approach. 
There are a few places where the GCS tests relied on file writes or file info methods so I've replaced those with direct calls to the Azure blob client and left TODO comments saying to switch them to use the AzureFilesystem when the relevant methods are implemented. 

### Are there any user-facing changes?

Yes. File reads using the Azure filesystem are now supported. 

* Closes: apache#37511

Lead-authored-by: Thomas Newton <[email protected]>
Co-authored-by: Benjamin Kietzman <[email protected]>
Signed-off-by: Benjamin Kietzman <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[C++] Implement file reads for Azure filesystem
4 participants