Skip to content

Commit

Permalink
GH-38335: [C++] Implement GetFileInfo for a single file in Azure fi…
Browse files Browse the repository at this point in the history
…lesystem (#38505)

### Rationale for this change

`GetFileInfo` is an important part of an Arrow filesystem implementation. 

### What changes are included in this PR?
- Start `azurefs_internal` similar to GCS and S3 filesystems. 
- Implement `HierarchicalNamespaceDetector`. 
  - This does not use the obvious and simple implementation. It uses a more complicated option inspired by `hadoop-azure` that avoids requiring the significantly elevated permissions needed for `blob_service_client->GetAccountInfo()`.
  - This can't be detected an initialisation time of the filesystem because it requires a `container_name`.  Its packed into its only class so that the result can be cached. 
- Implement `GetFileInfo` for single paths. 
  - Supports hierarchical or flat namespace accounts and takes advantage of hierarchical namespace where possible to avoid unnecessary extra calls to blob storage. The performance difference is actually noticeable just from running the `GetFileInfoObjectWithNestedStructure` test against real flat and hierarchical accounts.  Its about 3 seconds with hierarchical namespace or 5 seconds with a flat namespace.
- Update tests with TODO(GH-38335) to now use this implementation of `GetFileInfo` to replace the temporary direct Azure SDK usage.
- Rename the main test fixture and introduce new ones for connecting to real blob storage. If details of real blob storage is not provided then the real blob storage tests will be skipped. 

### Are these changes tested?

Yes. There are new Azurite based tests for everything that can be tested with Azurite. 

There are also some tests that are designed to test against a real blob storage account. This is because [Azurite cannot emulate a hierarchical namespace account](Azure/Azurite#553). Additionally some of the behaviour used to detect a hierarchical namespace account is different on Azurite compared to a real flat namespace account. These tests will be automatically skipped unless environment variables are provided with details for connecting to the relevant real storage accounts. 

Initially I based the tests on the GCS filesystem but I added a few extras where I thought it was appropriate. 

### Are there any user-facing changes?
Yes. `GetFileInfo` is now supported on the Azure filesystem. 

* Closes: #38335

Lead-authored-by: Thomas Newton <[email protected]>
Co-authored-by: Sutou Kouhei <[email protected]>
Signed-off-by: Sutou Kouhei <[email protected]>
  • Loading branch information
Tom-Newton and kou authored Nov 9, 2023
1 parent db19a35 commit 75a0403
Show file tree
Hide file tree
Showing 7 changed files with 497 additions and 97 deletions.
4 changes: 2 additions & 2 deletions cpp/src/arrow/CMakeLists.txt
Original file line number Diff line number Diff line change
Expand Up @@ -502,8 +502,8 @@ if(ARROW_FILESYSTEM)
filesystem/util_internal.cc)

if(ARROW_AZURE)
list(APPEND ARROW_SRCS filesystem/azurefs.cc)
set_source_files_properties(filesystem/azurefs.cc
list(APPEND ARROW_SRCS filesystem/azurefs.cc filesystem/azurefs_internal.cc)
set_source_files_properties(filesystem/azurefs.cc filesystem/azurefs_internal.cc
PROPERTIES SKIP_PRECOMPILE_HEADERS ON
SKIP_UNITY_BUILD_INCLUSION ON)
endif()
Expand Down
158 changes: 134 additions & 24 deletions cpp/src/arrow/filesystem/azurefs.cc
Original file line number Diff line number Diff line change
Expand Up @@ -16,8 +16,10 @@
// under the License.

#include "arrow/filesystem/azurefs.h"
#include "arrow/filesystem/azurefs_internal.h"

#include <azure/storage/blobs.hpp>
#include <azure/storage/files/datalake.hpp>

#include "arrow/buffer.h"
#include "arrow/filesystem/path_util.h"
Expand Down Expand Up @@ -59,6 +61,7 @@ Status AzureOptions::ConfigureAccountKeyCredentials(const std::string& account_n
credentials_kind = AzureCredentialsKind::StorageCredentials;
return Status::OK();
}

namespace {

// An AzureFileSystem represents a single Azure storage account. AzurePath describes a
Expand All @@ -79,18 +82,17 @@ struct AzurePath {
"Expected an Azure object path of the form 'container/path...', got a URI: '",
s, "'");
}
const auto src = internal::RemoveTrailingSlash(s);
auto first_sep = src.find_first_of(internal::kSep);
auto first_sep = s.find_first_of(internal::kSep);
if (first_sep == 0) {
return Status::Invalid("Path cannot start with a separator ('", s, "')");
}
if (first_sep == std::string::npos) {
return AzurePath{std::string(src), std::string(src), "", {}};
return AzurePath{std::string(s), std::string(s), "", {}};
}
AzurePath path;
path.full_path = std::string(src);
path.container = std::string(src.substr(0, first_sep));
path.path_to_file = std::string(src.substr(first_sep + 1));
path.full_path = std::string(s);
path.container = std::string(s.substr(0, first_sep));
path.path_to_file = std::string(s.substr(first_sep + 1));
path.path_to_file_parts = internal::SplitAbstractPath(path.path_to_file);
RETURN_NOT_OK(Validate(path));
return path;
Expand Down Expand Up @@ -146,11 +148,6 @@ Status ValidateFilePath(const AzurePath& path) {
return Status::OK();
}

Status ErrorToStatus(const std::string& prefix,
const Azure::Storage::StorageException& exception) {
return Status::IOError(prefix, " Azure Error: ", exception.what());
}

template <typename ArrowType>
std::string FormatValue(typename TypeTraits<ArrowType>::CType value) {
struct StringAppender {
Expand Down Expand Up @@ -316,11 +313,13 @@ class ObjectInputFile final : public io::RandomAccessFile {
return Status::OK();
} catch (const Azure::Storage::StorageException& exception) {
if (exception.StatusCode == Azure::Core::Http::HttpStatusCode::NotFound) {
// Could be either container or blob not found.
return PathNotFound(path_);
}
return ErrorToStatus(
"When fetching properties for '" + blob_client_->GetUrl() + "': ", exception);
return internal::ExceptionToStatus(
"GetProperties failed for '" + blob_client_->GetUrl() +
"' with an unexpected Azure error. Can not initialise an ObjectInputFile "
"without knowing the file size.",
exception);
}
}

Expand Down Expand Up @@ -397,10 +396,12 @@ class ObjectInputFile final : public io::RandomAccessFile {
->DownloadTo(reinterpret_cast<uint8_t*>(out), nbytes, download_options)
.Value.ContentRange.Length.Value();
} catch (const Azure::Storage::StorageException& exception) {
return ErrorToStatus("When reading from '" + blob_client_->GetUrl() +
"' at position " + std::to_string(position) + " for " +
std::to_string(nbytes) + " bytes: ",
exception);
return internal::ExceptionToStatus("DownloadTo from '" + blob_client_->GetUrl() +
"' at position " + std::to_string(position) +
" for " + std::to_string(nbytes) +
" bytes failed with an Azure error. ReadAt "
"failed to read the required byte range.",
exception);
}
}

Expand Down Expand Up @@ -444,7 +445,6 @@ class ObjectInputFile final : public io::RandomAccessFile {
int64_t content_length_ = kNoSize;
std::shared_ptr<const KeyValueMetadata> metadata_;
};

} // namespace

// -----------------------------------------------------------------------
Expand All @@ -453,27 +453,136 @@ class ObjectInputFile final : public io::RandomAccessFile {
class AzureFileSystem::Impl {
public:
io::IOContext io_context_;
std::shared_ptr<Azure::Storage::Blobs::BlobServiceClient> service_client_;
std::unique_ptr<Azure::Storage::Files::DataLake::DataLakeServiceClient>
datalake_service_client_;
std::unique_ptr<Azure::Storage::Blobs::BlobServiceClient> blob_service_client_;
AzureOptions options_;
internal::HierarchicalNamespaceDetector hierarchical_namespace_;

explicit Impl(AzureOptions options, io::IOContext io_context)
: io_context_(io_context), options_(std::move(options)) {}

Status Init() {
service_client_ = std::make_shared<Azure::Storage::Blobs::BlobServiceClient>(
blob_service_client_ = std::make_unique<Azure::Storage::Blobs::BlobServiceClient>(
options_.account_blob_url, options_.storage_credentials_provider);
datalake_service_client_ =
std::make_unique<Azure::Storage::Files::DataLake::DataLakeServiceClient>(
options_.account_dfs_url, options_.storage_credentials_provider);
RETURN_NOT_OK(hierarchical_namespace_.Init(datalake_service_client_.get()));
return Status::OK();
}

const AzureOptions& options() const { return options_; }

public:
Result<FileInfo> GetFileInfo(const AzurePath& path) {
FileInfo info;
info.set_path(path.full_path);

if (path.container.empty()) {
DCHECK(path.path_to_file.empty()); // The path is invalid if the container is empty
// but not path_to_file.
// path must refer to the root of the Azure storage account. This is a directory,
// and there isn't any extra metadata to fetch.
info.set_type(FileType::Directory);
return info;
}
if (path.path_to_file.empty()) {
// path refers to a container. This is a directory if it exists.
auto container_client =
blob_service_client_->GetBlobContainerClient(path.container);
try {
auto properties = container_client.GetProperties();
info.set_type(FileType::Directory);
info.set_mtime(
std::chrono::system_clock::time_point(properties.Value.LastModified));
return info;
} catch (const Azure::Storage::StorageException& exception) {
if (exception.StatusCode == Azure::Core::Http::HttpStatusCode::NotFound) {
info.set_type(FileType::NotFound);
return info;
}
return internal::ExceptionToStatus(
"GetProperties for '" + container_client.GetUrl() +
"' failed with an unexpected Azure error. GetFileInfo is unable to "
"determine whether the container exists.",
exception);
}
}
auto file_client = datalake_service_client_->GetFileSystemClient(path.container)
.GetFileClient(path.path_to_file);
try {
auto properties = file_client.GetProperties();
if (properties.Value.IsDirectory) {
info.set_type(FileType::Directory);
} else if (internal::HasTrailingSlash(path.path_to_file)) {
// For a path with a trailing slash a hierarchical namespace may return a blob
// with that trailing slash removed. For consistency with flat namespace and
// other filesystems we chose to return NotFound.
info.set_type(FileType::NotFound);
return info;
} else {
info.set_type(FileType::File);
info.set_size(properties.Value.FileSize);
}
info.set_mtime(
std::chrono::system_clock::time_point(properties.Value.LastModified));
return info;
} catch (const Azure::Storage::StorageException& exception) {
if (exception.StatusCode == Azure::Core::Http::HttpStatusCode::NotFound) {
ARROW_ASSIGN_OR_RAISE(auto hierarchical_namespace_enabled,
hierarchical_namespace_.Enabled(path.container));
if (hierarchical_namespace_enabled) {
// If the hierarchical namespace is enabled, then the storage account will have
// explicit directories. Neither a file nor a directory was found.
info.set_type(FileType::NotFound);
return info;
}
// On flat namespace accounts there are no real directories. Directories are only
// implied by using `/` in the blob name.
Azure::Storage::Blobs::ListBlobsOptions list_blob_options;

// If listing the prefix `path.path_to_file` with trailing slash returns at least
// one result then `path` refers to an implied directory.
auto prefix = internal::EnsureTrailingSlash(path.path_to_file);
list_blob_options.Prefix = prefix;
// We only need to know if there is at least one result, so minimise page size
// for efficiency.
list_blob_options.PageSizeHint = 1;

try {
auto paged_list_result =
blob_service_client_->GetBlobContainerClient(path.container)
.ListBlobs(list_blob_options);
if (paged_list_result.Blobs.size() > 0) {
info.set_type(FileType::Directory);
} else {
info.set_type(FileType::NotFound);
}
return info;
} catch (const Azure::Storage::StorageException& exception) {
return internal::ExceptionToStatus(
"ListBlobs for '" + prefix +
"' failed with an unexpected Azure error. GetFileInfo is unable to "
"determine whether the path should be considered an implied directory.",
exception);
}
}
return internal::ExceptionToStatus(
"GetProperties for '" + file_client.GetUrl() +
"' failed with an unexpected "
"Azure error. GetFileInfo is unable to determine whether the path exists.",
exception);
}
}

Result<std::shared_ptr<ObjectInputFile>> OpenInputFile(const std::string& s,
AzureFileSystem* fs) {
ARROW_RETURN_NOT_OK(internal::AssertNoTrailingSlash(s));
ARROW_ASSIGN_OR_RAISE(auto path, AzurePath::FromString(s));
RETURN_NOT_OK(ValidateFilePath(path));
auto blob_client = std::make_shared<Azure::Storage::Blobs::BlobClient>(
service_client_->GetBlobContainerClient(path.container)
blob_service_client_->GetBlobContainerClient(path.container)
.GetBlobClient(path.path_to_file));

auto ptr =
Expand All @@ -494,7 +603,7 @@ class AzureFileSystem::Impl {
ARROW_ASSIGN_OR_RAISE(auto path, AzurePath::FromString(info.path()));
RETURN_NOT_OK(ValidateFilePath(path));
auto blob_client = std::make_shared<Azure::Storage::Blobs::BlobClient>(
service_client_->GetBlobContainerClient(path.container)
blob_service_client_->GetBlobContainerClient(path.container)
.GetBlobClient(path.path_to_file));

auto ptr = std::make_shared<ObjectInputFile>(blob_client, fs->io_context(),
Expand All @@ -518,7 +627,8 @@ bool AzureFileSystem::Equals(const FileSystem& other) const {
}

Result<FileInfo> AzureFileSystem::GetFileInfo(const std::string& path) {
return Status::NotImplemented("The Azure FileSystem is not fully implemented");
ARROW_ASSIGN_OR_RAISE(auto p, AzurePath::FromString(path));
return impl_->GetFileInfo(p);
}

Result<FileInfoVector> AzureFileSystem::GetFileInfo(const FileSelector& select) {
Expand Down
88 changes: 88 additions & 0 deletions cpp/src/arrow/filesystem/azurefs_internal.cc
Original file line number Diff line number Diff line change
@@ -0,0 +1,88 @@
// Licensed to the Apache Software Foundation (ASF) under one
// or more contributor license agreements. See the NOTICE file
// distributed with this work for additional information
// regarding copyright ownership. The ASF licenses this file
// to you under the Apache License, Version 2.0 (the
// "License"); you may not use this file except in compliance
// with the License. You may obtain a copy of the License at
//
// http://www.apache.org/licenses/LICENSE-2.0
//
// Unless required by applicable law or agreed to in writing,
// software distributed under the License is distributed on an
// "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
// KIND, either express or implied. See the License for the
// specific language governing permissions and limitations
// under the License.

#include "arrow/filesystem/azurefs_internal.h"

#include <azure/storage/files/datalake.hpp>

#include "arrow/result.h"

namespace arrow::fs::internal {

Status ExceptionToStatus(const std::string& prefix,
const Azure::Storage::StorageException& exception) {
return Status::IOError(prefix, " Azure Error: ", exception.what());
}

Status HierarchicalNamespaceDetector::Init(
Azure::Storage::Files::DataLake::DataLakeServiceClient* datalake_service_client) {
datalake_service_client_ = datalake_service_client;
return Status::OK();
}

Result<bool> HierarchicalNamespaceDetector::Enabled(const std::string& container_name) {
// Hierarchical namespace can't easily be changed after the storage account is created
// and its common across all containers in the storage account. Do nothing until we've
// checked for a cached result.
if (enabled_.has_value()) {
return enabled_.value();
}

// This approach is inspired by hadoop-azure
// https://github.com/apache/hadoop/blob/7c6af6a5f626d18d68b656d085cc23e4c1f7a1ef/hadoop-tools/hadoop-azure/src/main/java/org/apache/hadoop/fs/azurebfs/AzureBlobFileSystemStore.java#L356.
// Unfortunately `blob_service_client->GetAccountInfo()` requires significantly
// elevated permissions.
// https://learn.microsoft.com/en-us/rest/api/storageservices/get-blob-service-properties?tabs=azure-ad#authorization
auto filesystem_client = datalake_service_client_->GetFileSystemClient(container_name);
auto directory_client = filesystem_client.GetDirectoryClient("/");
try {
directory_client.GetAccessControlList();
enabled_ = true;
} catch (const Azure::Storage::StorageException& exception) {
// GetAccessControlList will fail on storage accounts without hierarchical
// namespace enabled.

if (exception.StatusCode == Azure::Core::Http::HttpStatusCode::BadRequest ||
exception.StatusCode == Azure::Core::Http::HttpStatusCode::Conflict) {
// Flat namespace storage accounts with soft delete enabled return
// Conflict - This endpoint does not support BlobStorageEvents or SoftDelete
// otherwise it returns: BadRequest - This operation is only supported on a
// hierarchical namespace account.
enabled_ = false;
} else if (exception.StatusCode == Azure::Core::Http::HttpStatusCode::NotFound) {
// Azurite returns NotFound.
try {
filesystem_client.GetProperties();
enabled_ = false;
} catch (const Azure::Storage::StorageException& exception) {
return ExceptionToStatus("Failed to confirm '" + filesystem_client.GetUrl() +
"' is an accessible container. Therefore the "
"hierarchical namespace check was invalid.",
exception);
}
} else {
return ExceptionToStatus(
"GetAccessControlList for '" + directory_client.GetUrl() +
"' failed with an unexpected Azure error, while checking "
"whether the storage account has hierarchical namespace enabled.",
exception);
}
}
return enabled_.value();
}

} // namespace arrow::fs::internal
42 changes: 42 additions & 0 deletions cpp/src/arrow/filesystem/azurefs_internal.h
Original file line number Diff line number Diff line change
@@ -0,0 +1,42 @@
// Licensed to the Apache Software Foundation (ASF) under one
// or more contributor license agreements. See the NOTICE file
// distributed with this work for additional information
// regarding copyright ownership. The ASF licenses this file
// to you under the Apache License, Version 2.0 (the
// "License"); you may not use this file except in compliance
// with the License. You may obtain a copy of the License at
//
// http://www.apache.org/licenses/LICENSE-2.0
//
// Unless required by applicable law or agreed to in writing,
// software distributed under the License is distributed on an
// "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
// KIND, either express or implied. See the License for the
// specific language governing permissions and limitations
// under the License.

#pragma once

#include <optional>

#include <azure/storage/files/datalake.hpp>

#include "arrow/result.h"

namespace arrow::fs::internal {

Status ExceptionToStatus(const std::string& prefix,
const Azure::Storage::StorageException& exception);

class HierarchicalNamespaceDetector {
public:
Status Init(
Azure::Storage::Files::DataLake::DataLakeServiceClient* datalake_service_client);
Result<bool> Enabled(const std::string& container_name);

private:
Azure::Storage::Files::DataLake::DataLakeServiceClient* datalake_service_client_;
std::optional<bool> enabled_;
};

} // namespace arrow::fs::internal
Loading

0 comments on commit 75a0403

Please sign in to comment.