Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[C++] Filesystem implementation for Azure Blob Storage #18014

Closed
asfimport opened this issue Jan 25, 2018 · 48 comments
Closed

[C++] Filesystem implementation for Azure Blob Storage #18014

asfimport opened this issue Jan 25, 2018 · 48 comments

Comments

@asfimport
Copy link
Collaborator

asfimport commented Jan 25, 2018

Subissues:


Reporter: Wes McKinney / @wesm
Assignee: Shefali Singh

Related issues:

Original Issue Attachments:

PRs and other links:

Note: This issue was originally created as ARROW-2034. Please see the migration documentation for further details.

@asfimport
Copy link
Collaborator Author

Wes McKinney / @wesm:
I see that TileDB (MIT license) has built a C++ wrapper for Azure

https://github.com/TileDB-Inc/TileDB/blob/dev/tiledb/sm/filesystem/azure.cc

No this is not moved to fsspec, this is a C++ ticket

@asfimport
Copy link
Collaborator Author

Antoine Pitrou / @pitrou:
I'm not sure I understand the relationship between Blob Store and Data Lake. Is Data Lake a higher-level layer above Blob Store? Or are they two different services that would need separate filesystem implementations?

@asfimport
Copy link
Collaborator Author

Antoine Pitrou / @pitrou:
According to https://docs.microsoft.com/en-us/azure/storage/blobs/data-lake-storage-introduction ,

Data Lake Storage Gen2 converges the capabilities of Azure Data Lake Storage Gen1 with Azure Blob storage. For example, Data Lake Storage Gen2 provides file system semantics, file-level security, and scale. Because these capabilities are built on Blob storage, you'll also get low-cost, tiered storage, with high availability/disaster recovery capabilities.
I'm not sure this means the same C++ API can be used to access both, though.

@asfimport
Copy link
Collaborator Author

Uwe Korn / @xhochy:
It is as confusing in reality, here is what they all are (I'm though already 1 year outdated on this):

  • Blob Store: Like S3, simple but limited API
  • Data Lake Gen 1: HDFS-like deployment with different but more user-friendly API / attributes
  • Data Lake Gen 2: Some improvements were made to the Blob Store so that there is no need for a special (more expensive) Data Lake service anymore, everything is now on the Blob Store. A new set of APIs was though released that exposes some nice features that the initial Blob Store API didn't have, probably for marketing purposes this was named Data Lake Gen 2 although technically Blob Store Gen 2 would have been more appropriate.

@asfimport
Copy link
Collaborator Author

Antoine Pitrou / @pitrou:
Ha, and here are some interesting resources:

@asfimport
Copy link
Collaborator Author

Yesh:
There is also https://github.com/Azure/azure-sdk-for-cpp which I’ve tested against adl gen2 .

@asfimport
Copy link
Collaborator Author

@asfimport
Copy link
Collaborator Author

Tom Augspurger / @TomAugspurger:
Does Arrow support C+\14 features now (or more specifically, is the SDK being C\14 a problem?) From https://issues.apache.org/jira/browse/ARROW-13744 it seems like C\14 is at least tested, but https://github.com/apache/arrow/blame/master/docs/source/developers/cpp/building.rst#L40 says a "A C\+11-enabled compiler. " is required.

@asfimport
Copy link
Collaborator Author

Neal Richardson / @nealrichardson:
I'm not an expert here, but I think we could use C+14 if required and if the compiler supports it. If the compiler doesn't support C 14, we wouldn't be able to build the azure sdk. So the line would be that Arrow requires C 11 at a minimum, and some features are only available with C+14.

@asfimport
Copy link
Collaborator Author

Antoine Pitrou / @pitrou:
We require only C+11 in the codebase. We might add C+14-requiring optional components if desired, but that will add complication to the build setup.

@asfimport
Copy link
Collaborator Author

Shashanka Balakuntala Srinivasa:
hi @pitrou , we were looking into implementing this feature from our side. I did try to compile the whole arrow code base with c+14 and ran unit tests as well. Everything is passing in local and as mentioned before : [ARROW-13744] [CI] c+14 and 17 nightly job fails - ASF JIRA (apache.org) ticket mentions we have daily build run for validation which are passing. 

Since the azure sdks depend on c+14 features, and since we have the code compiled in c 14, can we look into upgrading the c+ version to 14? 
Let me know if there are any issues. I will be happy to take those and work on them.

@asfimport
Copy link
Collaborator Author

Antoine Pitrou / @pitrou:
[~balakuntala] We can probably require C+\14 for the Azure filesystem implementation only, but the rest of Arrow should remain C+11-compatible.

@asfimport
Copy link
Collaborator Author

Dean MacGregor:
If someone wants to work on this but doesn't have an Azure account let me know.  I can make a storage account for this development/testing

@asfimport
Copy link
Collaborator Author

Bipin Mathew:
I have a rudimentary implementation of this that supports import and export to Azure. I tried to align as closely as possible to the s3 implementation, however as I needed it only for a specific use case and have not had a chance to implement all the same methods. Depending on bandwidth, I could probably build out the implementation further. I have never contributed to this project. Do all endpoints need to be implemented before it can be merged into the code base? Can it be built out over time? I attach what I have so far here.

azfs.hazfs.cc

@asfimport
Copy link
Collaborator Author

Antoine Pitrou / @pitrou:
[~mathewb001] Well, there's a Github PR open already, did you take a look?

@asfimport
Copy link
Collaborator Author

Bipin Mathew:
Oh this is much further along than what I have offered. Looking forward to
its release so I can cut over to it.

@av8or1
Copy link
Contributor

av8or1 commented Jan 11, 2023

We too are in need of this feature. Any word as to whether it will be included in the version 11 release? Seems like the deadline is kinda tight at this point.

@pitrou
Copy link
Member

pitrou commented May 23, 2023

@h-vetinari Do you know if there's a conda-forge package for the Azure Blob Storage C++ library? I couldn't find one.

@h-vetinari
Copy link
Contributor

@h-vetinari Do you know if there's a conda-forge package for the Azure Blob Storage C++ library? I couldn't find one.

I'm not aware of any either, but I don't know. What would be the sources of that C++ lib? If it's open source we could bring it to conda-forge eventually of course.

@pitrou
Copy link
Member

pitrou commented May 24, 2023

@raulcd raulcd modified the milestones: 15.0.0, 16.0.0 Jan 8, 2024
@raulcd
Copy link
Member

raulcd commented Jan 8, 2024

I am moving this umbrella issue to 16.0.0

@felipecrv
Copy link
Contributor

I am moving this umbrella issue to 16.0.0

That's fine. Thanks.

@av8or1
Copy link
Contributor

av8or1 commented Mar 5, 2024

kou/felipecrv - How close is this to being done? I see a few green-colored items in the list above, but they seem to be completed already (at least the ones I looked at did). Is there anything else we can work on? Will this make it into version 16.0.0? Thanks

@felipecrv
Copy link
Contributor

@av8or1 most of it works now. s3fs doesn't even support Move. AzureFileSystem supports Move on accounts with HNS enabled. You don't have to wait for this issue to be closed to start using what's already merged and ready to be part of the 16.0.0 release.

@av8or1
Copy link
Contributor

av8or1 commented Mar 5, 2024

Hi felipe- Thanks. Well, the company can't utilize any library that isn't an official release. I suppose that I could begin writing the code that will utilize the ADLS stuff now, then when 16.0.0 is released (April?), I would be able to produce our product shortly thereafter. What remains to be completed, by the way? Anything I could help with? Thanks

@felipecrv
Copy link
Contributor

@av8or1 as I said above: Move with Blobs API (not a critical feature at all, s3fs doesn't even support Move). Python bindings (PR is open), and URI parsing (PR is open). Are any of these a dealbreaker for you? Everything else will be available in 16.0.0.

@av8or1
Copy link
Contributor

av8or1 commented Mar 20, 2024

@felipecrv OK thank you. Work has been busy. Just now looking at this again. It appears that @kou has completed the URI parsing business (#40028). Thus I will prepare on my end to use the library when it is released. Hopefully in April. Thanks

@wirable23
Copy link

Everything else will be available in 16.0.0.

@felipecrv do you know when 16.0.0 would be available?

@felipecrv
Copy link
Contributor

Everything else will be available in 16.0.0.

@felipecrv do you know when 16.0.0 would be available?

In April. The time it takes for the release also depends on how smooth the packaging and publishing process goes.

@Tom-Newton
Copy link
Contributor

I think ideally #40036 would be taken care of before the 16.0.0 release. I don't have any real world performance numbers but I suspect write performance is currently a bit disappointing.

@raulcd
Copy link
Member

raulcd commented Mar 26, 2024

Hi @Tom-Newton there's no planned release before 16.0.0
The feature freeze for 16.0.0 is planned for the 8th of April.

@felipecrv
Copy link
Contributor

I opened a MINOR PR expanding some of the docstrings: #40838

@raulcd
Copy link
Member

raulcd commented Apr 8, 2024

There are some subtasks still opened. @felipecrv should I tag this umbrella issue as 17.0.0?

@felipecrv
Copy link
Contributor

felipecrv commented Apr 8, 2024

There are some subtasks still opened. @felipecrv should I tag this umbrella issue as 17.0.0?

AzureFileSystem is already usable and feature-complete even without these open issues being fixed. I'm in favor of closing (marking it as complete) making it part of the 16.0 release in the logs.

@kou
Copy link
Member

kou commented Apr 8, 2024

Let's close this as complete.
We don't need an umbrella issue for AzureFileSystem. We can just use separated issues like other components.

@kou kou closed this as completed Apr 8, 2024
@kou
Copy link
Member

kou commented Apr 9, 2024

Note that the current AzureFileSystem's CopyFile() doesn't work with Azure hierarchical namespace support. See also: #41095

Some other AzureFileSystem implementations for Azure hierarchical namespace support have some problems: #41034
I want to add the fix of this (#41068) to 16.0.0.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests