Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

GH-41034: [C++][FS][Azure] Adjust DeleteDir/DeleteDirContents/GetFileInfoSelector behaviors against Azure for generic filesystem tests #41068

Merged
merged 5 commits into from
Apr 10, 2024
Merged
Show file tree
Hide file tree
Changes from 2 commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
49 changes: 40 additions & 9 deletions cpp/src/arrow/filesystem/azurefs.cc
Original file line number Diff line number Diff line change
Expand Up @@ -1642,11 +1642,27 @@ class AzureFileSystem::Impl {
options.Prefix = {};
found = true; // Unless the container itself is not found later!
} else {
options.Prefix = internal::EnsureTrailingSlash(base_location.path);
ARROW_ASSIGN_OR_RAISE(
auto prefix, AzureLocation::FromString(
std::string(internal::EnsureTrailingSlash(select.base_dir))));
ARROW_ASSIGN_OR_RAISE(auto info, GetFileInfo(container_client, prefix));
if (info.type() == FileType::NotFound) {
if (select.allow_not_found) {
return Status::OK();
} else {
return PathNotFound(base_location);
}
} else if (info.type() != FileType::Directory) {
return NotADir(base_location);
}
options.Prefix = prefix.path;
}
options.PageSizeHint = page_size_hint;
options.Include = Blobs::Models::ListBlobsIncludeFlags::Metadata;

auto adlfs_client = GetFileSystemClient(base_location.container);
ARROW_ASSIGN_OR_RAISE(auto hns_support, HierarchicalNamespaceSupport(adlfs_client));
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This fix is very confusing to me. This function is GetFileInfoWithSelectorFromContainer, it's meant to be called when HNS support has been detected as disabled.

The right fix (IMO) is to create GetFileInfoWithSelectorFromFileSystem(adlfs_client, ...) and then in GetFileInfoWithSelector dispatch to the different implementations according to the HNS support detection.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I can understand what you say but the current implementation doesn't:

GetFileInfoWithSelector() isn't dispatched based on HNS support:

Result<FileInfoVector> AzureFileSystem::GetFileInfo(const FileSelector& select) {
Core::Context context;
Azure::Nullable<int32_t> page_size_hint; // unspecified
FileInfoVector results;
RETURN_NOT_OK(
impl_->GetFileInfoWithSelector(context, page_size_hint, select, &results));
return {std::move(results)};
}

Status GetFileInfoWithSelector(const Core::Context& context,
Azure::Nullable<int32_t> page_size_hint,
const FileSelector& select,
FileInfoVector* acc_results) {
ARROW_ASSIGN_OR_RAISE(auto base_location, AzureLocation::FromString(select.base_dir));
if (base_location.container.empty()) {
// Without a container, the base_location is equivalent to the filesystem
// root -- `/`. FileSelector::allow_not_found doesn't matter in this case
// because the root always exists.
auto on_container = [&](const Blobs::Models::BlobContainerItem& container) {
// Deleted containers are not listed by ListContainers.
DCHECK(!container.IsDeleted);
// Every container is considered a directory.
FileInfo info{container.Name, FileType::Directory};
info.set_mtime(
std::chrono::system_clock::time_point{container.Details.LastModified});
acc_results->push_back(std::move(info));
// Recurse into containers (subdirectories) if requested.
if (select.recursive && select.max_recursion > 0) {
FileSelector sub_select;
sub_select.base_dir = container.Name;
sub_select.allow_not_found = true;
sub_select.recursive = true;
sub_select.max_recursion = select.max_recursion - 1;
ARROW_RETURN_NOT_OK(
GetFileInfoWithSelector(context, page_size_hint, sub_select, acc_results));
}
return Status::OK();
};
return VisitContainers(context, std::move(on_container));
}
auto container_client =
blob_service_client_->GetBlobContainerClient(base_location.container);
return GetFileInfoWithSelectorFromContainer(container_client, context, page_size_hint,
select, acc_results);
}

The FromContainer suffix doesn't mean only for no HNS support. We use OnContainer/OnFileSystem suffix for it. FromContainer just means that it's for one container and it doesn't work with multiple containers.


Should we create GetFileInfoWithSelectorFromContainerOn{Container,FileSystem}()? But their names are too long...

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I can understand what you say but the current implementation doesn't:

GetFileInfoWithSelector() isn't dispatched based on HNS support:

But what I meant was changing/fixing GetFileInfoWithSelector to do that.

Should we create GetFileInfoWithSelectorFromContainerOn{Container,FileSystem}()? But their names are too long...

ADLFS calls "filesystem" what "blobs" calls "container", so you replace the last word of the name:

  Status GetFileInfoWithSelectorFromContainer(const Blobs::BlobContainerClient& container_client, ...);
  Status GetFileInfoWithSelecitonFromFilesystem(const DataLake::DataLakeFileSystemClient& adlfs_client, ...);

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ADLFS calls "filesystem" what "blobs" calls "container"

Wow! I didn't notice it!

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Implemented.


auto recurse = [&](const std::string& blob_prefix) noexcept -> Status {
if (select.recursive && select.max_recursion > 0) {
FileSelector sub_select;
Expand All @@ -1671,7 +1687,15 @@ class AzureFileSystem::Impl {
};
auto process_prefix = [&](const std::string& prefix) noexcept -> Status {
const auto path = internal::ConcatAbstractPath(base_location.container, prefix);
acc_results->push_back(DirectoryFileInfoFromPath(path));
if (hns_support == HNSSupport::kEnabled) {
ARROW_ASSIGN_OR_RAISE(
auto location,
AzureLocation::FromString(std::string(internal::RemoveTrailingSlash(path))));
ARROW_ASSIGN_OR_RAISE(auto info, GetFileInfo(adlfs_client, location));
acc_results->push_back(std::move(info));
} else {
acc_results->push_back(DirectoryFileInfoFromPath(path));
}
return recurse(prefix);
};

Expand Down Expand Up @@ -2157,6 +2181,17 @@ class AzureFileSystem::Impl {
Azure::Nullable<std::string> lease_id = {}) {
DCHECK(!location.container.empty());
DCHECK(!location.path.empty());
ARROW_ASSIGN_OR_RAISE(auto file_info, GetFileInfo(adlfs_client, location));
if (file_info.type() == FileType::NotFound) {
if (require_dir_to_exist) {
return PathNotFound(location);
} else {
return Status::OK();
}
}
if (file_info.type() != FileType::Directory) {
return NotADir(location);
}
auto directory_client = adlfs_client.GetDirectoryClient(
std::string(internal::RemoveTrailingSlash(location.path)));
DataLake::DeleteDirectoryOptions options;
Expand All @@ -2168,13 +2203,6 @@ class AzureFileSystem::Impl {
// All the others either succeed or throw an exception.
DCHECK(response.Value.Deleted);
} catch (const Storage::StorageException& exception) {
if (exception.ErrorCode == "FilesystemNotFound" ||
exception.ErrorCode == "PathNotFound") {
if (require_dir_to_exist) {
return PathNotFound(location);
}
return Status::OK();
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The fact that you check existence above doesn't mean these errors can't happen down here.

To use terms that @pitrou once shared with me [1], you're moving from EAFP [2] "easier to ask for permission" to LBYL [3] "look before you leap".

From [3]:

LBYL

Look before you leap. This coding style explicitly tests for pre-conditions before making calls or lookups. This style contrasts with the EAFP approach and is characterized by the presence of many if statements.

In a multi-threaded environment, the LBYL approach can risk introducing a race condition between “the looking” and “the leaping”. For example, the code, if key in mapping: return mapping[key] can fail if another thread removes key from mapping after the test, but before the lookup. This issue can be solved with locks or by using the EAFP approach.

[1] https://devblogs.microsoft.com/python/idiomatic-python-eafp-versus-lbyl/
[2] https://docs.python.org/3.5/glossary.html#term-eafp
[3] https://docs.python.org/3.5/glossary.html#term-lbyl

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Given the TZ differences, I will give myself a chance of fixing these bugs to see if I can come up with something that doesn't introduce pessimistic round trips to the backend.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK. I'll revert this for now.
You can introduce better approach later if you can find it.

FYI: We want to report a NotADir() for file location. But DeleteRecursive()/DeleteEmpty() can delete both of file and directory. See also: https://learn.microsoft.com/en-us/rest/api/storageservices/datalakestoragegen2/path/delete?view=rest-storageservices-datalakestoragegen2-2019-12-12

Delete the file or directory.

// Not a directory
CreateFile(fs, "AB/def", "");
ASSERT_RAISES(IOError, fs->DeleteDir("AB/def"));
is a test for this case.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we block changing the target location from other connections by using lease_id for GetFileInfo() too?

diff --git a/cpp/src/arrow/filesystem/azurefs.cc b/cpp/src/arrow/filesystem/azurefs.cc
index df6db3d959..67232b3156 100644
--- a/cpp/src/arrow/filesystem/azurefs.cc
+++ b/cpp/src/arrow/filesystem/azurefs.cc
@@ -2181,7 +2181,7 @@ class AzureFileSystem::Impl {
                                Azure::Nullable<std::string> lease_id = {}) {
     DCHECK(!location.container.empty());
     DCHECK(!location.path.empty());
-    ARROW_ASSIGN_OR_RAISE(auto file_info, GetFileInfo(adlfs_client, location));
+    ARROW_ASSIGN_OR_RAISE(auto file_info, GetFileInfo(adlfs_client, location, lease_id));
     if (file_info.type() == FileType::NotFound) {
       if (require_dir_to_exist) {
         return PathNotFound(location);

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

But DeleteRecursive()/DeleteEmpty() can delete both of file and directory. See also: https://learn.microsoft.com/en-us/rest/api/storageservices/datalakestoragegen2/path/delete?view=rest-storageservices-datalakestoragegen2-2019-12-12

This scenario sounds familiar. 🤔 Does that happen even if you append a trailing slash to the path?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we add a trailing slash, API returns an error...:

'fs->DeleteDir("AB/CD")' failed with IOError: Failed to delete a directory: AB/CD: https://clearcodearrow.blob.core.windows.net/aeouspyxzsz16iliz71ya6yt5qrur6b9/AB/CD/ Azure Error: [InvalidUri] 400 The request URI is invalid.

return ExceptionToStatus(exception, "Failed to delete a directory: ", location.path,
": ", directory_client.GetUrl());
}
Expand All @@ -2200,6 +2228,9 @@ class AzureFileSystem::Impl {
kDelimiter, path.Name, ": ", sub_directory_client.GetUrl());
}
} else {
if (path.Name == location.path) {
return NotADir(location);
}
auto sub_file_client = adlfs_client.GetFileClient(path.Name);
try {
sub_file_client.Delete();
Expand Down
6 changes: 5 additions & 1 deletion cpp/src/arrow/filesystem/azurefs_test.cc
Original file line number Diff line number Diff line change
Expand Up @@ -392,7 +392,7 @@ class TestGeneric : public ::testing::Test, public GenericFileSystemTest {
bool allow_move_dir() const override { return false; }
bool allow_move_file() const override { return true; }
bool allow_append_to_file() const override { return true; }
bool have_directory_mtimes() const override { return false; }
bool have_directory_mtimes() const override { return true; }
bool have_flaky_directory_tree_deletion() const override { return false; }
bool have_file_metadata() const override { return true; }
// calloc() used in libxml2's xmlNewGlobalState() is detected as a
Expand Down Expand Up @@ -429,6 +429,8 @@ class TestAzuriteGeneric : public TestGeneric {
protected:
// Azurite doesn't support moving files over containers.
bool allow_move_file() const override { return false; }
// Azurite doesn't support directory mtime.
bool have_directory_mtimes() const override { return false; }
// DeleteDir() doesn't work with Azurite on macOS
bool have_flaky_directory_tree_deletion() const override {
return env_->HasSubmitBatchBug();
Expand All @@ -449,6 +451,8 @@ class TestAzureFlatNSGeneric : public TestGeneric {
protected:
// Flat namespace account doesn't support moving files over containers.
bool allow_move_file() const override { return false; }
// Flat namespace account doesn't support directory mtime.
bool have_directory_mtimes() const override { return false; }
};

class TestAzureHierarchicalNSGeneric : public TestGeneric {
Expand Down
Loading