Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Is there a plan to support AdlsGen2 (Datalake Storage) on top of blobstore emulator ? #553

Open
Arnaud-Nauwynck opened this issue Sep 10, 2020 · 38 comments
Assignees
Labels
datalake featureparity Tracking issues for catching up feature parity

Comments

@Arnaud-Nauwynck
Copy link

Arnaud-Nauwynck commented Sep 10, 2020

Which service(blob, file, queue, table) does this issue concern?

not existing currently in V3 : AdlsGen2 (Datalake)

This is a question: Is there a plan to support AdlsGen2 (Datalake Storage) on top of blobstore emulator ?
If yes, when would it be available?

Which version of the Azurite was used?

V3, unusable yet, laking ADlsGen2 support

Where do you get Azurite? (npm, DockerHub, NuGet, Visual Studio Code Extension)

npm

What's the Node.js version?

What problem was encountered?

Steps to reproduce the issue?

Have you found a mitigation/solution?

no
not able to develop on local with emulator, forced to connect to azure

@XiaoningLiu XiaoningLiu self-assigned this Sep 14, 2020
@XiaoningLiu XiaoningLiu added datalake featureparity Tracking issues for catching up feature parity labels Sep 14, 2020
@XiaoningLiu
Copy link
Member

Hi @Arnaud-Nauwynck, can you further description your usage scenarios? For example, are you using a datalake gen2 account with or without hierarchical namespace enabled? Do you use any datalake storage sdks? which datalake gen2 features you are most interested? Interop between blob or datalake gen2?

@rh99
Copy link

rh99 commented Dec 7, 2020

As for me, I'd like to Spark and hadoop-azure, as described here: https://stackoverflow.com/questions/65050695/how-can-i-read-write-data-from-azurite-using-spark

@yuranos
Copy link

yuranos commented Mar 16, 2021

Hi @XiaoningLiu
We are using Gen 2 with Hierarchical namespace enabled.
We write our apps with Akka Streams and need to write to and read from to Datalake Gen 2.
Akka Streams has HDFS connector and since Datalake Gen 2 is HDFS-compatible, so far so good.

However, we need to add test coverage to our code. And not just basic unit tests, but something that can verify that integration is Ok end to end, but without being tightly coupled to network for real calls to Azure.

We implemented what we needed with testcontainers, but hadoop inside docker is a bit tricky to use and requires quite unconventional setup(with host resolutions, some env vars, permissions, etc.) while all we need is the filesystem for tests.

@KlaudiuszBryjaRelativity

When do you release ADLS feature?
BTW do you have plan to add support for file share?

@karimdabbagh
Copy link

Any updates on this?

@XiaoningLiu
Copy link
Member

XiaoningLiu commented Jul 12, 2021

Hi guys, we get your feedbacks and the asks to support datalake gen2 on Azurite. We do think it's a valid ask and keep it open for collecting requirements and feedbacks. Unluckily, it's not our current priorities yet.

In the same time, if possible, please try to leverage your contacts with Azure, and reach to Azure AdlsGen2 team for your asks directly. We have once talked this ask with AdlsGen2 team, it's better they can get more direct feedbacks to better understand the scenario and importance.

@blueww
Copy link
Member

blueww commented Jul 14, 2021

Another request for datalake gen2 #909

@jonycodes
Copy link

+1 This should be really helpful for testing features using ADLS Gen2.

@Amro77
Copy link

Amro77 commented Oct 25, 2021

Any updates on this

@felixnext
Copy link

@XiaoningLiu Is there any update on when this feature will arive?

@ac710k
Copy link

ac710k commented Jul 19, 2022

Any update on this?

@arony
Copy link

arony commented Jul 25, 2022

@blueww @XiaoningLiu any progress? That would be good to have this feature during integration tests

@barry-jones
Copy link

Yes please. Especially since it just straight up fails when testing .net AzureDataLakeFileClient connections and writes. Without a particularly useful error message.

@kacperniwczykr1
Copy link

Is there any chance that we will receive any information on this? It for some scenarios lack of ADLS Gen2 support is a killer.

@darena-patrick
Copy link

Attempting to build a dockerized, fully self-contained environment for doing some end to end testing using Cypress. Would love if ADLS Gen2 support were available for this. Has any further discussion about this taken place @XiaoningLiu?

@XiaoningLiu
Copy link
Member

Attempting to build a dockerized, fully self-contained environment for doing some end to end testing using Cypress. Would love if ADLS Gen2 support were available for this. Has any further discussion about this taken place @XiaoningLiu?

Yes, it's on our radar and being regularly reviewed. Azurite pending features are scheduled per customer asks, importance, workload and team resource. Currently we are working on prioritized work items like Blob Batch, User Delegation SAS etc.

@dain
Copy link

dain commented Mar 16, 2023

Support for ADLS Gen2 in this project is pretty critical for any team building support for ABFS. Without this simulator, testing integrations in projects like Trino and Iceberg will require coordination of volunteers that are trusted enough to have real Azure credentials, which slows development. I have used this project for building and testing integrations with blob APIs, and it makes this work enjoyable (I can assure you integrating with most cloud systems is just painful).

@blueww
Copy link
Member

blueww commented Apr 25, 2023

@MahmoudGSaleh , @N-o-Z, @Arnaud-Nauwynck, @liabozarth, @dain , @arony , @kacperniwczykr1, @felixnext , @barry-jones , @ac710k, @Amro77, @jonycodes

Would you please share how you would like to use AdlsGen2 with Azurite?

ADLS Gen2, though it is exposed as a REST API, it was designed to be used by drivers (ABFS mostly).
Could you please share what features in ADLS Gen2 DFS endpoint are you interested in using via REST that is not exposed via Blob.

This information will help us to better priority the feature for Azurite.

@Arithmomaniac
Copy link

Directory creation and manipulation, for one. (The client may end up using the ABFS, but any C# server code that sets up a filesystem for e.g. integration testing will want to use the SDK.)
And once you need to use ADLS REST for anything, there's a decent chance your application won't use the Blob API at all, even for things that it could.

@N-o-Z
Copy link

N-o-Z commented Apr 25, 2023

@blueww We need Azurite to simulate ADLS Gen2 behavior. Specifically in the way it deals with directories and objects listing (HNS).
We provide our clients with services over their Azure storage accounts and we want to be able to test that our logic works both for blob storage and ADLS Gen2. We are using the Azurite simulator for our unit tests and as of now can only verify correctness against the blob storage behavior

@kacperniwczykr1
Copy link

@blueww We have very similar scenario to this described by @N-o-Z. We want to make sure that data that we are working on is properly structured within our tests. HNS is key feature that we are missing.

@karimdabbagh
Copy link

@blueww
Our service (c# code) primarily works with directories (and files). So, in order to have our unit/integration tests use the same client in our code we would need to either emulate HNS ourselves or use Azurite with HNS enabled.

@dain
Copy link

dain commented Apr 25, 2023

Would you please share how you would like to use AdlsGen2 with Azurite?

I work on Trino and we in the process of replacing Hadoop dependencies with custom code, because the Hadoop code is leaky, rarely updated, and kind not well maintained. As part of this we are building new file system interfaces that use the cloud storage APIs directly (instead of through HDFS). To write this code we need to be able to test, and Azureite is a great way for the, volunteer, open source developers to test changes without needing access to an Azure account. The key to making this work is the blob and dfs apis need to perform the exact same as Azure, which I have found to not be the case even for the blob apis (e.g., paths are not normalized in Azurie like they are in Azure). In general, without something like Azurite maintaining Azure integration will be harder (and generally that means less maintenance).

@mlongtin0
Copy link

We make a heavy use of ACL. Azure doesn't provide very useful tools to manage ACL, so we made our own commands. However, we'd rather test against a local test server than a real storage account.

@blueww
Copy link
Member

blueww commented Jul 20, 2023

We have added a wiki for our requirements and general expectations of PRs that add new ADLS Gen2 to Azurite.
https://github.com/Azure/Azurite/wiki/ADLS-Gen2-Implementation-Guidance

Azurite welcome contribution!
If you would like help to implement ADLS Gen2 in Azurite, please read the wiki and follow it to design/implement ADLS Gen2 in Azurite (better review the detail design with us first), to get a smooth PR review / merge.

@mlongtin0
Copy link

mlongtin0 commented Jul 20, 2023

The DFS endpoint is not available in Azure if hierarchical namespace is not enabled. The blob endpoint works on HNS accounts with some limitation (indexed tags don't work with HNS, for example).

@blueww
Copy link
Member

blueww commented Jul 21, 2023

@mlongtin0
Per our test, currently DFS endpoint is available on storage account which not enabled hierarchical namespace, although there are still some DFS API/parameters not supported on this kind account.
And from DFS rest API doc, you can see many parameters "only valid if Hierarchical Namespace is enabled for the account", which means the API is available in none HNS storage account, but these parameters not available on none HNS storage account.

@mlongtin0
Copy link

My bad, I could swear I tried it and it failed. Seems to work fine.

@jasonmohyla
Copy link

+1 for HNS support for build pipeline unit tests

kou added a commit to apache/arrow that referenced this issue Nov 9, 2023
…lesystem (#38505)

### Rationale for this change

`GetFileInfo` is an important part of an Arrow filesystem implementation. 

### What changes are included in this PR?
- Start `azurefs_internal` similar to GCS and S3 filesystems. 
- Implement `HierarchicalNamespaceDetector`. 
  - This does not use the obvious and simple implementation. It uses a more complicated option inspired by `hadoop-azure` that avoids requiring the significantly elevated permissions needed for `blob_service_client->GetAccountInfo()`.
  - This can't be detected an initialisation time of the filesystem because it requires a `container_name`.  Its packed into its only class so that the result can be cached. 
- Implement `GetFileInfo` for single paths. 
  - Supports hierarchical or flat namespace accounts and takes advantage of hierarchical namespace where possible to avoid unnecessary extra calls to blob storage. The performance difference is actually noticeable just from running the `GetFileInfoObjectWithNestedStructure` test against real flat and hierarchical accounts.  Its about 3 seconds with hierarchical namespace or 5 seconds with a flat namespace.
- Update tests with TODO(GH-38335) to now use this implementation of `GetFileInfo` to replace the temporary direct Azure SDK usage.
- Rename the main test fixture and introduce new ones for connecting to real blob storage. If details of real blob storage is not provided then the real blob storage tests will be skipped. 

### Are these changes tested?

Yes. There are new Azurite based tests for everything that can be tested with Azurite. 

There are also some tests that are designed to test against a real blob storage account. This is because [Azurite cannot emulate a hierarchical namespace account](Azure/Azurite#553). Additionally some of the behaviour used to detect a hierarchical namespace account is different on Azurite compared to a real flat namespace account. These tests will be automatically skipped unless environment variables are provided with details for connecting to the relevant real storage accounts. 

Initially I based the tests on the GCS filesystem but I added a few extras where I thought it was appropriate. 

### Are there any user-facing changes?
Yes. `GetFileInfo` is now supported on the Azure filesystem. 

* Closes: #38335

Lead-authored-by: Thomas Newton <[email protected]>
Co-authored-by: Sutou Kouhei <[email protected]>
Signed-off-by: Sutou Kouhei <[email protected]>
loicalleyne pushed a commit to loicalleyne/arrow that referenced this issue Nov 13, 2023
…ure filesystem (apache#38505)

### Rationale for this change

`GetFileInfo` is an important part of an Arrow filesystem implementation. 

### What changes are included in this PR?
- Start `azurefs_internal` similar to GCS and S3 filesystems. 
- Implement `HierarchicalNamespaceDetector`. 
  - This does not use the obvious and simple implementation. It uses a more complicated option inspired by `hadoop-azure` that avoids requiring the significantly elevated permissions needed for `blob_service_client->GetAccountInfo()`.
  - This can't be detected an initialisation time of the filesystem because it requires a `container_name`.  Its packed into its only class so that the result can be cached. 
- Implement `GetFileInfo` for single paths. 
  - Supports hierarchical or flat namespace accounts and takes advantage of hierarchical namespace where possible to avoid unnecessary extra calls to blob storage. The performance difference is actually noticeable just from running the `GetFileInfoObjectWithNestedStructure` test against real flat and hierarchical accounts.  Its about 3 seconds with hierarchical namespace or 5 seconds with a flat namespace.
- Update tests with TODO(apacheGH-38335) to now use this implementation of `GetFileInfo` to replace the temporary direct Azure SDK usage.
- Rename the main test fixture and introduce new ones for connecting to real blob storage. If details of real blob storage is not provided then the real blob storage tests will be skipped. 

### Are these changes tested?

Yes. There are new Azurite based tests for everything that can be tested with Azurite. 

There are also some tests that are designed to test against a real blob storage account. This is because [Azurite cannot emulate a hierarchical namespace account](Azure/Azurite#553). Additionally some of the behaviour used to detect a hierarchical namespace account is different on Azurite compared to a real flat namespace account. These tests will be automatically skipped unless environment variables are provided with details for connecting to the relevant real storage accounts. 

Initially I based the tests on the GCS filesystem but I added a few extras where I thought it was appropriate. 

### Are there any user-facing changes?
Yes. `GetFileInfo` is now supported on the Azure filesystem. 

* Closes: apache#38335

Lead-authored-by: Thomas Newton <[email protected]>
Co-authored-by: Sutou Kouhei <[email protected]>
Signed-off-by: Sutou Kouhei <[email protected]>
dgreiss pushed a commit to dgreiss/arrow that referenced this issue Feb 19, 2024
…ure filesystem (apache#38505)

### Rationale for this change

`GetFileInfo` is an important part of an Arrow filesystem implementation. 

### What changes are included in this PR?
- Start `azurefs_internal` similar to GCS and S3 filesystems. 
- Implement `HierarchicalNamespaceDetector`. 
  - This does not use the obvious and simple implementation. It uses a more complicated option inspired by `hadoop-azure` that avoids requiring the significantly elevated permissions needed for `blob_service_client->GetAccountInfo()`.
  - This can't be detected an initialisation time of the filesystem because it requires a `container_name`.  Its packed into its only class so that the result can be cached. 
- Implement `GetFileInfo` for single paths. 
  - Supports hierarchical or flat namespace accounts and takes advantage of hierarchical namespace where possible to avoid unnecessary extra calls to blob storage. The performance difference is actually noticeable just from running the `GetFileInfoObjectWithNestedStructure` test against real flat and hierarchical accounts.  Its about 3 seconds with hierarchical namespace or 5 seconds with a flat namespace.
- Update tests with TODO(apacheGH-38335) to now use this implementation of `GetFileInfo` to replace the temporary direct Azure SDK usage.
- Rename the main test fixture and introduce new ones for connecting to real blob storage. If details of real blob storage is not provided then the real blob storage tests will be skipped. 

### Are these changes tested?

Yes. There are new Azurite based tests for everything that can be tested with Azurite. 

There are also some tests that are designed to test against a real blob storage account. This is because [Azurite cannot emulate a hierarchical namespace account](Azure/Azurite#553). Additionally some of the behaviour used to detect a hierarchical namespace account is different on Azurite compared to a real flat namespace account. These tests will be automatically skipped unless environment variables are provided with details for connecting to the relevant real storage accounts. 

Initially I based the tests on the GCS filesystem but I added a few extras where I thought it was appropriate. 

### Are there any user-facing changes?
Yes. `GetFileInfo` is now supported on the Azure filesystem. 

* Closes: apache#38335

Lead-authored-by: Thomas Newton <[email protected]>
Co-authored-by: Sutou Kouhei <[email protected]>
Signed-off-by: Sutou Kouhei <[email protected]>
@dekiesel
Copy link

dekiesel commented May 7, 2024

Bummer, just wasted 4h to get my tests to use azurite and now I see that adlsgen2 isn't supported (my fault!). Any guidance on how to test adlsgen2 calls locally otherwise (using python)?

@stenneepro
Copy link

We are using Azure Data Lake Storage Gen2 just to control access in folder level.
For example, there are three roles - admin, supervisor, manager.
Admin can access all folders, files in the container, supervisor can access only supervisor folder, manager can access only manager folder.
We are generating SAS token when user login and they use the SAS token to access files.

It Azurite supports ADLS that would be great for local development.
Or is there any other solution to implement folder level access with Azure Blob Storage?

@cool-mist
Copy link

@XiaoningLiu writes

Yes, it's on our radar and being regularly reviewed. Azurite pending features are scheduled per customer asks, importance, workload and team resource. Currently we are working on prioritized work items like Blob Batch, User Delegation SAS etc.

I will apologize in advance for being abrasive here, but this just feels like Microsoft is trolling the developers at this point. The only alternative that exists (test this against an actual storage account) incurs significant costs. Looking at the code base, this is should just be a week of work for the azure storage team to implement an in-memory version of a file system adhering to the ADLS Gen2 features :(. It has already been close to 4 years since the issue has been opened and the azure teams responsible for this effort have still not "prioritized" this work that improves the dev experience writing code against one of the core features of Azure storage!!

@cool-mist
Copy link

From the implementation design discussion here, I see the following

  1. Implement HNS metadata Store in Azurite
    i. Any schema change or new table design should be reviewed and signed off.
    ii. We need to maintain hierarchical relationships between parent-child dir/file. For example, we can add a table to match each item (blob/dir) with its parent, and integrate existing blob tables and the new table added above (Detail design need discussion).
    iii. Blob/file binary payload persistency based on local files shouldn’t be changed.

Would it be better to use some form of Trie as the underlying datastructure where each node corresponds to a path segment?

  1. Create path/directory = add a node.
  2. Delete path/directory = delete a node. If recursive flag is set, drop the sub-tree, else error
  3. Rename path/directory = detach the sub-trie and parent it under the new path.
  4. Update paths

Each node will contain the following metadata

  1. x-ms-properties dictionary
  2. acl rules for the path
  3. isDir flag
  4. file_data pointer to a utf-8 byte array

File Nodes can be restricted to not contain children. File nodes will additionally have a file_data pointer that points to a byte arrays. To start with, you could restrict the byte array length (say 4MB, and support max 4MB files). Allocate the byte arrays as an array of arrays and reuse the byte array themselves for memory reasons.

@blueww
Copy link
Member

blueww commented Aug 23, 2024

From the implementation design discussion here, I see the following

  1. Implement HNS metadata Store in Azurite
    i. Any schema change or new table design should be reviewed and signed off.
    ii. We need to maintain hierarchical relationships between parent-child dir/file. For example, we can add a table to match each item (blob/dir) with its parent, and integrate existing blob tables and the new table added above (Detail design need discussion).
    iii. Blob/file binary payload persistency based on local files shouldn’t be changed.

Would it be better to use some form of Trie as the underlying datastructure where each node corresponds to a path segment?

  1. Create path/directory = add a node.
  2. Delete path/directory = delete a node. If recursive flag is set, drop the sub-tree, else error
  3. Rename path/directory = detach the sub-trie and parent it under the new path.
  4. Update paths

Each node will contain the following metadata

  1. x-ms-properties dictionary
  2. acl rules for the path
  3. isDir flag
  4. file_data pointer to a utf-8 byte array

File Nodes can be restricted to not contain children. File nodes will additionally have a file_data pointer that points to a byte arrays. To start with, you could restrict the byte array length (say 4MB, and support max 4MB files). Allocate the byte arrays as an array of arrays and reuse the byte array themselves for memory reasons.

@cool-mist
Thanks for the suggestion!

For your suggestion to use Trie. It might be applicable, however, the current comes from :

  1. we should use similar structure as the Azure server implementation to get similar behavior/performance as Azure server.
  2. Unitize the current Azurite implementation, change as less as possible to lower cost and lower regression risk.

We can re-visit it and discuss more when finish the phase I and start the Phase II implemetation.

Azurite welcome contribution!
If you are interested in implementation Datalake Gen2 in Azurite, would you please raise your detail design, and when we get agreement on that, you could raise implementation PRs (should be split into several small PRs to help review).
We could start from Phase I (DFS endpoint on FNS account), then Phase (DFS/BLob endpoint on HNS account).

@fatih-celonis
Copy link

Hey, this feature was requested 4 years ago initially. Is there any plan to implement it?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
datalake featureparity Tracking issues for catching up feature parity
Projects
None yet
Development

No branches or pull requests