-
Notifications
You must be signed in to change notification settings - Fork 324
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Is there a plan to support AdlsGen2 (Datalake Storage) on top of blobstore emulator ? #553
Comments
Hi @Arnaud-Nauwynck, can you further description your usage scenarios? For example, are you using a datalake gen2 account with or without hierarchical namespace enabled? Do you use any datalake storage sdks? which datalake gen2 features you are most interested? Interop between blob or datalake gen2? |
As for me, I'd like to Spark and hadoop-azure, as described here: https://stackoverflow.com/questions/65050695/how-can-i-read-write-data-from-azurite-using-spark |
Hi @XiaoningLiu However, we need to add test coverage to our code. And not just basic unit tests, but something that can verify that integration is Ok end to end, but without being tightly coupled to network for real calls to Azure. We implemented what we needed with testcontainers, but hadoop inside docker is a bit tricky to use and requires quite unconventional setup(with host resolutions, some env vars, permissions, etc.) while all we need is the filesystem for tests. |
When do you release ADLS feature? |
Any updates on this? |
Hi guys, we get your feedbacks and the asks to support datalake gen2 on Azurite. We do think it's a valid ask and keep it open for collecting requirements and feedbacks. Unluckily, it's not our current priorities yet. In the same time, if possible, please try to leverage your contacts with Azure, and reach to Azure AdlsGen2 team for your asks directly. We have once talked this ask with AdlsGen2 team, it's better they can get more direct feedbacks to better understand the scenario and importance. |
Another request for datalake gen2 #909 |
+1 This should be really helpful for testing features using ADLS Gen2. |
Any updates on this |
@XiaoningLiu Is there any update on when this feature will arive? |
Any update on this? |
@blueww @XiaoningLiu any progress? That would be good to have this feature during integration tests |
Yes please. Especially since it just straight up fails when testing .net AzureDataLakeFileClient connections and writes. Without a particularly useful error message. |
Is there any chance that we will receive any information on this? It for some scenarios lack of ADLS Gen2 support is a killer. |
Attempting to build a dockerized, fully self-contained environment for doing some end to end testing using Cypress. Would love if ADLS Gen2 support were available for this. Has any further discussion about this taken place @XiaoningLiu? |
Yes, it's on our radar and being regularly reviewed. Azurite pending features are scheduled per customer asks, importance, workload and team resource. Currently we are working on prioritized work items like Blob Batch, User Delegation SAS etc. |
Support for ADLS Gen2 in this project is pretty critical for any team building support for ABFS. Without this simulator, testing integrations in projects like Trino and Iceberg will require coordination of volunteers that are trusted enough to have real Azure credentials, which slows development. I have used this project for building and testing integrations with blob APIs, and it makes this work enjoyable (I can assure you integrating with most cloud systems is just painful). |
Would like to join the request - especially due to the fact that more and more users transition to ADLS Gen2 |
This is on our radar and being regularly reviewed. We will Azure AdlsGen2 team support to implement this feature in Azuite. If possible, please try to leverage your contacts with Azure, and reach to Azure AdlsGen2 team for your asks directly. It's better they can get more direct feedbacks to better understand the scenario and importance. |
@MahmoudGSaleh , @N-o-Z, @Arnaud-Nauwynck, @liabozarth, @dain , @arony , @kacperniwczykr1, @felixnext , @barry-jones , @ac710k, @Amro77, @jonycodes Would you please share how you would like to use AdlsGen2 with Azurite? ADLS Gen2, though it is exposed as a REST API, it was designed to be used by drivers (ABFS mostly). This information will help us to better priority the feature for Azurite. |
Directory creation and manipulation, for one. (The client may end up using the ABFS, but any C# server code that sets up a filesystem for e.g. integration testing will want to use the SDK.) |
@blueww We need Azurite to simulate ADLS Gen2 behavior. Specifically in the way it deals with directories and objects listing (HNS). |
@blueww |
I work on Trino and we in the process of replacing Hadoop dependencies with custom code, because the Hadoop code is leaky, rarely updated, and kind not well maintained. As part of this we are building new file system interfaces that use the cloud storage APIs directly (instead of through HDFS). To write this code we need to be able to test, and Azureite is a great way for the, volunteer, open source developers to test changes without needing access to an Azure account. The key to making this work is the |
We make a heavy use of ACL. Azure doesn't provide very useful tools to manage ACL, so we made our own commands. However, we'd rather test against a local test server than a real storage account. |
We have added a wiki for our requirements and general expectations of PRs that add new ADLS Gen2 to Azurite. Azurite welcome contribution! |
The DFS endpoint is not available in Azure if hierarchical namespace is not enabled. The blob endpoint works on HNS accounts with some limitation (indexed tags don't work with HNS, for example). |
@mlongtin0 |
My bad, I could swear I tried it and it failed. Seems to work fine. |
+1 for HNS support for build pipeline unit tests |
…lesystem (#38505) ### Rationale for this change `GetFileInfo` is an important part of an Arrow filesystem implementation. ### What changes are included in this PR? - Start `azurefs_internal` similar to GCS and S3 filesystems. - Implement `HierarchicalNamespaceDetector`. - This does not use the obvious and simple implementation. It uses a more complicated option inspired by `hadoop-azure` that avoids requiring the significantly elevated permissions needed for `blob_service_client->GetAccountInfo()`. - This can't be detected an initialisation time of the filesystem because it requires a `container_name`. Its packed into its only class so that the result can be cached. - Implement `GetFileInfo` for single paths. - Supports hierarchical or flat namespace accounts and takes advantage of hierarchical namespace where possible to avoid unnecessary extra calls to blob storage. The performance difference is actually noticeable just from running the `GetFileInfoObjectWithNestedStructure` test against real flat and hierarchical accounts. Its about 3 seconds with hierarchical namespace or 5 seconds with a flat namespace. - Update tests with TODO(GH-38335) to now use this implementation of `GetFileInfo` to replace the temporary direct Azure SDK usage. - Rename the main test fixture and introduce new ones for connecting to real blob storage. If details of real blob storage is not provided then the real blob storage tests will be skipped. ### Are these changes tested? Yes. There are new Azurite based tests for everything that can be tested with Azurite. There are also some tests that are designed to test against a real blob storage account. This is because [Azurite cannot emulate a hierarchical namespace account](Azure/Azurite#553). Additionally some of the behaviour used to detect a hierarchical namespace account is different on Azurite compared to a real flat namespace account. These tests will be automatically skipped unless environment variables are provided with details for connecting to the relevant real storage accounts. Initially I based the tests on the GCS filesystem but I added a few extras where I thought it was appropriate. ### Are there any user-facing changes? Yes. `GetFileInfo` is now supported on the Azure filesystem. * Closes: #38335 Lead-authored-by: Thomas Newton <[email protected]> Co-authored-by: Sutou Kouhei <[email protected]> Signed-off-by: Sutou Kouhei <[email protected]>
…ure filesystem (apache#38505) ### Rationale for this change `GetFileInfo` is an important part of an Arrow filesystem implementation. ### What changes are included in this PR? - Start `azurefs_internal` similar to GCS and S3 filesystems. - Implement `HierarchicalNamespaceDetector`. - This does not use the obvious and simple implementation. It uses a more complicated option inspired by `hadoop-azure` that avoids requiring the significantly elevated permissions needed for `blob_service_client->GetAccountInfo()`. - This can't be detected an initialisation time of the filesystem because it requires a `container_name`. Its packed into its only class so that the result can be cached. - Implement `GetFileInfo` for single paths. - Supports hierarchical or flat namespace accounts and takes advantage of hierarchical namespace where possible to avoid unnecessary extra calls to blob storage. The performance difference is actually noticeable just from running the `GetFileInfoObjectWithNestedStructure` test against real flat and hierarchical accounts. Its about 3 seconds with hierarchical namespace or 5 seconds with a flat namespace. - Update tests with TODO(apacheGH-38335) to now use this implementation of `GetFileInfo` to replace the temporary direct Azure SDK usage. - Rename the main test fixture and introduce new ones for connecting to real blob storage. If details of real blob storage is not provided then the real blob storage tests will be skipped. ### Are these changes tested? Yes. There are new Azurite based tests for everything that can be tested with Azurite. There are also some tests that are designed to test against a real blob storage account. This is because [Azurite cannot emulate a hierarchical namespace account](Azure/Azurite#553). Additionally some of the behaviour used to detect a hierarchical namespace account is different on Azurite compared to a real flat namespace account. These tests will be automatically skipped unless environment variables are provided with details for connecting to the relevant real storage accounts. Initially I based the tests on the GCS filesystem but I added a few extras where I thought it was appropriate. ### Are there any user-facing changes? Yes. `GetFileInfo` is now supported on the Azure filesystem. * Closes: apache#38335 Lead-authored-by: Thomas Newton <[email protected]> Co-authored-by: Sutou Kouhei <[email protected]> Signed-off-by: Sutou Kouhei <[email protected]>
…ure filesystem (apache#38505) ### Rationale for this change `GetFileInfo` is an important part of an Arrow filesystem implementation. ### What changes are included in this PR? - Start `azurefs_internal` similar to GCS and S3 filesystems. - Implement `HierarchicalNamespaceDetector`. - This does not use the obvious and simple implementation. It uses a more complicated option inspired by `hadoop-azure` that avoids requiring the significantly elevated permissions needed for `blob_service_client->GetAccountInfo()`. - This can't be detected an initialisation time of the filesystem because it requires a `container_name`. Its packed into its only class so that the result can be cached. - Implement `GetFileInfo` for single paths. - Supports hierarchical or flat namespace accounts and takes advantage of hierarchical namespace where possible to avoid unnecessary extra calls to blob storage. The performance difference is actually noticeable just from running the `GetFileInfoObjectWithNestedStructure` test against real flat and hierarchical accounts. Its about 3 seconds with hierarchical namespace or 5 seconds with a flat namespace. - Update tests with TODO(apacheGH-38335) to now use this implementation of `GetFileInfo` to replace the temporary direct Azure SDK usage. - Rename the main test fixture and introduce new ones for connecting to real blob storage. If details of real blob storage is not provided then the real blob storage tests will be skipped. ### Are these changes tested? Yes. There are new Azurite based tests for everything that can be tested with Azurite. There are also some tests that are designed to test against a real blob storage account. This is because [Azurite cannot emulate a hierarchical namespace account](Azure/Azurite#553). Additionally some of the behaviour used to detect a hierarchical namespace account is different on Azurite compared to a real flat namespace account. These tests will be automatically skipped unless environment variables are provided with details for connecting to the relevant real storage accounts. Initially I based the tests on the GCS filesystem but I added a few extras where I thought it was appropriate. ### Are there any user-facing changes? Yes. `GetFileInfo` is now supported on the Azure filesystem. * Closes: apache#38335 Lead-authored-by: Thomas Newton <[email protected]> Co-authored-by: Sutou Kouhei <[email protected]> Signed-off-by: Sutou Kouhei <[email protected]>
Bummer, just wasted 4h to get my tests to use azurite and now I see that adlsgen2 isn't supported (my fault!). Any guidance on how to test adlsgen2 calls locally otherwise (using python)? |
We are using It Azurite supports |
@XiaoningLiu writes
I will apologize in advance for being abrasive here, but this just feels like Microsoft is trolling the developers at this point. The only alternative that exists (test this against an actual storage account) incurs significant costs. Looking at the code base, this is should just be a week of work for the azure storage team to implement an in-memory version of a file system adhering to the ADLS Gen2 features :(. It has already been close to 4 years since the issue has been opened and the azure teams responsible for this effort have still not "prioritized" this work that improves the dev experience writing code against one of the core features of Azure storage!! |
From the implementation design discussion here, I see the following
Would it be better to use some form of Trie as the underlying datastructure where each node corresponds to a path segment?
Each node will contain the following metadata
File Nodes can be restricted to not contain children. File nodes will additionally have a file_data pointer that points to a byte arrays. To start with, you could restrict the byte array length (say 4MB, and support max 4MB files). Allocate the byte arrays as an array of arrays and reuse the byte array themselves for memory reasons. |
@cool-mist For your suggestion to use Trie. It might be applicable, however, the current comes from :
We can re-visit it and discuss more when finish the phase I and start the Phase II implemetation. Azurite welcome contribution! |
Which service(blob, file, queue, table) does this issue concern?
not existing currently in V3 : AdlsGen2 (Datalake)
This is a question: Is there a plan to support AdlsGen2 (Datalake Storage) on top of blobstore emulator ?
If yes, when would it be available?
Which version of the Azurite was used?
V3, unusable yet, laking ADlsGen2 support
Where do you get Azurite? (npm, DockerHub, NuGet, Visual Studio Code Extension)
npm
What's the Node.js version?
What problem was encountered?
Steps to reproduce the issue?
Have you found a mitigation/solution?
no
not able to develop on local with emulator, forced to connect to azure
The text was updated successfully, but these errors were encountered: