forked from apache/arrow
-
Notifications
You must be signed in to change notification settings - Fork 0
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
apacheGH-38335: [C++] Implement
GetFileInfo
for a single file in Az…
…ure filesystem (apache#38505) ### Rationale for this change `GetFileInfo` is an important part of an Arrow filesystem implementation. ### What changes are included in this PR? - Start `azurefs_internal` similar to GCS and S3 filesystems. - Implement `HierarchicalNamespaceDetector`. - This does not use the obvious and simple implementation. It uses a more complicated option inspired by `hadoop-azure` that avoids requiring the significantly elevated permissions needed for `blob_service_client->GetAccountInfo()`. - This can't be detected an initialisation time of the filesystem because it requires a `container_name`. Its packed into its only class so that the result can be cached. - Implement `GetFileInfo` for single paths. - Supports hierarchical or flat namespace accounts and takes advantage of hierarchical namespace where possible to avoid unnecessary extra calls to blob storage. The performance difference is actually noticeable just from running the `GetFileInfoObjectWithNestedStructure` test against real flat and hierarchical accounts. Its about 3 seconds with hierarchical namespace or 5 seconds with a flat namespace. - Update tests with TODO(apacheGH-38335) to now use this implementation of `GetFileInfo` to replace the temporary direct Azure SDK usage. - Rename the main test fixture and introduce new ones for connecting to real blob storage. If details of real blob storage is not provided then the real blob storage tests will be skipped. ### Are these changes tested? Yes. There are new Azurite based tests for everything that can be tested with Azurite. There are also some tests that are designed to test against a real blob storage account. This is because [Azurite cannot emulate a hierarchical namespace account](Azure/Azurite#553). Additionally some of the behaviour used to detect a hierarchical namespace account is different on Azurite compared to a real flat namespace account. These tests will be automatically skipped unless environment variables are provided with details for connecting to the relevant real storage accounts. Initially I based the tests on the GCS filesystem but I added a few extras where I thought it was appropriate. ### Are there any user-facing changes? Yes. `GetFileInfo` is now supported on the Azure filesystem. * Closes: apache#38335 Lead-authored-by: Thomas Newton <[email protected]> Co-authored-by: Sutou Kouhei <[email protected]> Signed-off-by: Sutou Kouhei <[email protected]>
- Loading branch information
1 parent
0a714b9
commit c39c31b
Showing
7 changed files
with
497 additions
and
97 deletions.
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,88 @@ | ||
// Licensed to the Apache Software Foundation (ASF) under one | ||
// or more contributor license agreements. See the NOTICE file | ||
// distributed with this work for additional information | ||
// regarding copyright ownership. The ASF licenses this file | ||
// to you under the Apache License, Version 2.0 (the | ||
// "License"); you may not use this file except in compliance | ||
// with the License. You may obtain a copy of the License at | ||
// | ||
// http://www.apache.org/licenses/LICENSE-2.0 | ||
// | ||
// Unless required by applicable law or agreed to in writing, | ||
// software distributed under the License is distributed on an | ||
// "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY | ||
// KIND, either express or implied. See the License for the | ||
// specific language governing permissions and limitations | ||
// under the License. | ||
|
||
#include "arrow/filesystem/azurefs_internal.h" | ||
|
||
#include <azure/storage/files/datalake.hpp> | ||
|
||
#include "arrow/result.h" | ||
|
||
namespace arrow::fs::internal { | ||
|
||
Status ExceptionToStatus(const std::string& prefix, | ||
const Azure::Storage::StorageException& exception) { | ||
return Status::IOError(prefix, " Azure Error: ", exception.what()); | ||
} | ||
|
||
Status HierarchicalNamespaceDetector::Init( | ||
Azure::Storage::Files::DataLake::DataLakeServiceClient* datalake_service_client) { | ||
datalake_service_client_ = datalake_service_client; | ||
return Status::OK(); | ||
} | ||
|
||
Result<bool> HierarchicalNamespaceDetector::Enabled(const std::string& container_name) { | ||
// Hierarchical namespace can't easily be changed after the storage account is created | ||
// and its common across all containers in the storage account. Do nothing until we've | ||
// checked for a cached result. | ||
if (enabled_.has_value()) { | ||
return enabled_.value(); | ||
} | ||
|
||
// This approach is inspired by hadoop-azure | ||
// https://github.com/apache/hadoop/blob/7c6af6a5f626d18d68b656d085cc23e4c1f7a1ef/hadoop-tools/hadoop-azure/src/main/java/org/apache/hadoop/fs/azurebfs/AzureBlobFileSystemStore.java#L356. | ||
// Unfortunately `blob_service_client->GetAccountInfo()` requires significantly | ||
// elevated permissions. | ||
// https://learn.microsoft.com/en-us/rest/api/storageservices/get-blob-service-properties?tabs=azure-ad#authorization | ||
auto filesystem_client = datalake_service_client_->GetFileSystemClient(container_name); | ||
auto directory_client = filesystem_client.GetDirectoryClient("/"); | ||
try { | ||
directory_client.GetAccessControlList(); | ||
enabled_ = true; | ||
} catch (const Azure::Storage::StorageException& exception) { | ||
// GetAccessControlList will fail on storage accounts without hierarchical | ||
// namespace enabled. | ||
|
||
if (exception.StatusCode == Azure::Core::Http::HttpStatusCode::BadRequest || | ||
exception.StatusCode == Azure::Core::Http::HttpStatusCode::Conflict) { | ||
// Flat namespace storage accounts with soft delete enabled return | ||
// Conflict - This endpoint does not support BlobStorageEvents or SoftDelete | ||
// otherwise it returns: BadRequest - This operation is only supported on a | ||
// hierarchical namespace account. | ||
enabled_ = false; | ||
} else if (exception.StatusCode == Azure::Core::Http::HttpStatusCode::NotFound) { | ||
// Azurite returns NotFound. | ||
try { | ||
filesystem_client.GetProperties(); | ||
enabled_ = false; | ||
} catch (const Azure::Storage::StorageException& exception) { | ||
return ExceptionToStatus("Failed to confirm '" + filesystem_client.GetUrl() + | ||
"' is an accessible container. Therefore the " | ||
"hierarchical namespace check was invalid.", | ||
exception); | ||
} | ||
} else { | ||
return ExceptionToStatus( | ||
"GetAccessControlList for '" + directory_client.GetUrl() + | ||
"' failed with an unexpected Azure error, while checking " | ||
"whether the storage account has hierarchical namespace enabled.", | ||
exception); | ||
} | ||
} | ||
return enabled_.value(); | ||
} | ||
|
||
} // namespace arrow::fs::internal |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,42 @@ | ||
// Licensed to the Apache Software Foundation (ASF) under one | ||
// or more contributor license agreements. See the NOTICE file | ||
// distributed with this work for additional information | ||
// regarding copyright ownership. The ASF licenses this file | ||
// to you under the Apache License, Version 2.0 (the | ||
// "License"); you may not use this file except in compliance | ||
// with the License. You may obtain a copy of the License at | ||
// | ||
// http://www.apache.org/licenses/LICENSE-2.0 | ||
// | ||
// Unless required by applicable law or agreed to in writing, | ||
// software distributed under the License is distributed on an | ||
// "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY | ||
// KIND, either express or implied. See the License for the | ||
// specific language governing permissions and limitations | ||
// under the License. | ||
|
||
#pragma once | ||
|
||
#include <optional> | ||
|
||
#include <azure/storage/files/datalake.hpp> | ||
|
||
#include "arrow/result.h" | ||
|
||
namespace arrow::fs::internal { | ||
|
||
Status ExceptionToStatus(const std::string& prefix, | ||
const Azure::Storage::StorageException& exception); | ||
|
||
class HierarchicalNamespaceDetector { | ||
public: | ||
Status Init( | ||
Azure::Storage::Files::DataLake::DataLakeServiceClient* datalake_service_client); | ||
Result<bool> Enabled(const std::string& container_name); | ||
|
||
private: | ||
Azure::Storage::Files::DataLake::DataLakeServiceClient* datalake_service_client_; | ||
std::optional<bool> enabled_; | ||
}; | ||
|
||
} // namespace arrow::fs::internal |
Oops, something went wrong.