-
Notifications
You must be signed in to change notification settings - Fork 413
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add Support for Microsoft OneLake #1418
Comments
This is of great interest to my team as well -- the promise of OneLake as a managed access layer for a data lake (but being open to any compute engine to write) would be absolutely incredible! |
From the error and the fact that a temp file is remaining, it does look like the rename operation is failing. Can you share the URL scheme that the Delta table is being loaded with/ I'm wondering if Azure is exposing "raw" ADLSv2 or masking it somehow. This may require us to have a specific |
It seems like accessing the underlying one lake storage should be compatible with adls gen2 apis, but require a quite different kind of url. https://learn.microsoft.com/en-us/fabric/onelake/onelake-access-api It seems they are using the gen2 apis.. unfortunately right now object store always calls the blob apis which work with both gen1 and gen2 storages, but may not work with one lake, we would have to find out. While the docs are very limited right now I have not found any blocking disparity between gen2 and one lake storage from the api side. The authorization docs are very vague, but it sounds like things may just work using the same tokens as storage accounts do. So if we are lucky we just need to handle urls correctly, maybe we also need to reimplement some things to use gen2 style requests only, as one lake apis may no longer support blob api requests. |
Hi all, I'm on the OneLake team. I was trying to repro this on my side and was running into a "URL did not match any known pattern for scheme" error - does DeltaLake perform URL validation checks and is blocking the 'onelake.dfs.fabric.microsoft.com' format? Also, @roeap - right now you can use the format "onelake.blob.fabric.microsoft.com" for OneLake to support Blob API calls. However this isn't supported for calls to the ABFS driver, so I'm not sure if that will help you here. |
It looks like there is some validation of endpoints for Azure. delta-rs/rust/src/storage/config.rs Line 124 in b17f286
If you are supporting only blob, you can replace dfs in the URL with blob which you might already be doing for ADLS gen2. |
@mabasile-MSFT, @jocaplan - thanks for the info, this is great news! The code in delta-rs is actually just to choose the type of storage backend, we will have to do some (small) udates in the object_store crate where the actual client implementation lives.
object_store treats all variants or urls (including For the convenience of the user, we also like to allow for shorthand urls. like Do you think it makes sense to maybe create a convention for OneLake as well? Maybe something like
Also, is there some conventions / guarantees around the top level folders - i.e. would Last but not least, we could allow for an optional
May is ask one thing that you may be able to answer or maybe forward to the storage PG? Since listing the storage is a key concern in delta, there are some optimisations possible if the storage supports sorted results and offsets as S3 and GCS do? IIRC, Azure does not support that thus far. Spark and soon delta-rs leverages these optimizations whenever possible. Since it seems MSFT went "all in" on Delta, is there a chance for this to become a feature? :) |
Happy to pull in the storage team and discuss in more detail. It might be easier to have a quick call if you are open to it. I can send an invite. There are certainly a lot of optimizations that you could do for OneLake since we have a well-known structure at the top and that our tables are in delta lake. However, I am curious why the normal ADLS path doesn't work. Other than the domain in the URL, we did a lot of work to match the ADLS DFS/Blob endpoints so that existing ADLS apps just work even if they call the REST APIs directly. We mapped our top-level concepts to ADLS concepts like storage accounts and containers. So far, any apps that didn't work were because of 1 of 2 reasons. Either we had a bug on our side, or the app was validating the domain in the URL. |
any update ? |
We should have a PR out this week with our fix! |
@jocaplan @mabasile-MSFT - sorry for being MIA for a while.
Does this mean you are adding onelake support to object store, or is there another way in the works? Regarding the talk with the storage team .- if that is still on the table... I am right now working on the ruist delta kernel implentation, trying to apply what we learned during delta-rs. As it turs nout there seems to be an assumtion in spark, the one gets lexicographically sorted results from list, and it seems to work out. But Icould not find any reference in the docs. A second question would be if there is any change we can one day pass in a blobs key rather then an opaque marker to have list with offset. Any info on that would ne great :) |
Yes, we have two draft PRs - one adding OneLake support to object store, and one adding support in Delta RS (by updating the URL validation). We're just performing the last checks on our side and then we'll have the PRs officially submitted! |
Great! I'll see that we integrate with the url parsing from object-store on this end. |
@roeap - The latest version of object_store v0.7.0 crate has been published which includes the changes required to support onelake. The latest version needs to be updated in delta-rs . Is there any specific procedure followed for upgrading package dependencies, raised a query #1597 with more details. |
@roeap - just wondering if you had an update here. Is there a way to get delta-rs to take up this latest object_store version so we can also publish the changes for delta-rs to support OneLake? |
I believe support for this has landed via #1642 and will be in the 0.16 version of the crate |
We tested #1642 already in our rust based solution that pushes data directly to ms-fabric this works wonderfully. |
I hope we are are not going to wait months for the python package ? |
@rtyler please reopen the bug report, I just test it with Python 0.11 and got the same error @mabasile-MSFT is it fixed or not ?
|
@rtyler please reopen the bug report till the issue is resolved, this simple example does not work
|
@djouallah - It seems you are referencing a local path, in which case delta-rs will use the local filesystem implementation. To work with onelake, you have to write that as an azure path. Additionally, the option |
@roeap thanks, the local filesystem is just a fuse to an Azure storage, I guess that is a Fabric problem, not Delta rust is there any example how to write to an Azure Path with credential and all ? |
There are some integration tests, that currently are not executed in CI since they require a live credential ... delta-rs/rust/tests/integration_object_store.rs Lines 25 to 49 in 18eec38
In essence though its just like reading from any other azure storage. The container name corresponds to the workspace and would be a giud - i think. The |
could you share any error messages? |
|
since you did not configure a credential, it tries to query the metadata endpoint, which seems to not be available. Never worked with fabric, so not sure what the environments look like, an if that is expected. We have seen working instances with object store and managed identities though ... So you need to figure out what credential is available and configure it based on the options for azure. |
also the specified path is still worn, either configure the url in short form |
Thanks @roeap I will just pass, I am not an expert in Azure authentication, I don't know how to get a token, account key is not exposed in Azure fabric, Microsoft should add more documentation how this thing is supposed to work, thanks for your help |
@djouallah - You can generate a token from powershell using
If you do not have Azure PowerShell Az module installed, please follow steps mentioned in - https://learn.microsoft.com/en-us/powershell/azure/install-azure-powershell once this is done , please set the token in storage_options
|
@vmuddassir-msft thanks, my scenario is to ingest data into Fabric onelake using a cloud function, I can't use Powershell in that case |
I meant you can use powershell to generate a token which can be used with deltalake
|
@vmuddassir-msft for how long the token will be valid ? |
Hi @rtyler, @vmuddassir-msft, please reopen the bug report till the issue is resolved, this simple example does not work. df = pd.DataFrame({"id": [1, 2], "value": ["foo", "boo"]})
write_deltalake("abfss://[email protected]/test.Lakehouse/Tables/sample_table2", df,
storage_options={"bearer_token": aadToken, "use_fabric_endpoint": "true"}) error:
|
As for fuse's issue, I'd like to answer it. Firstly, it will write a tmp#1 file in local, Then, it will rename tmp#1 to tmp file, so, fuse will call rename system call, So, I hope delta-rs can fix it. Thanks! |
Hi @RobinLin666, based on the error message we see, it seems there is an artifact from a failed commit in your table that contains a character deemed illegal in the object -store crate (specifically '#'). while it may be legal for azure blob, this is not the case for all object stores out there, and object_store tries to be fully consistent across all its implementations. |
Thanks for the context. Just making sure I understand. so in fabric environments users will work against a mounted (via blobfuse) filesystem. This filesystem then makes the calls you describe. I.e. buffer files locally before sending them along to the remote store? the file suffixed with I have to think on that a little more :). |
I opened a bug report with glaredb which uses Delta_rs and it seems it is a blobfuse bug |
Hi @roeap, thanks for your quick reply, I am using the last version (0.12.0) of the deltalake, I think |
Thank you @djouallah, Just as @scsmithr said, it needs close (drop) the file before calls to rename. |
thanks @RobinLin666 for the code how to get a token, it does works for me using this code thank you everyone for your help, and sorry for my tone :)
|
@roeap @jocaplan @mabasile-MSFT was there any discussion off this thread about opening a request to the azure storage team to support pushing down an offset filter to the list API? |
@alexwilcoxson-rel - unfortunately, due to me not following up - we never reached out to the storage team. So right now, the list_from method does some internal stuff to emulatte this behaviour, but actually does normal list, since we do not have a marker. Hovever, given MSFTs commitment to delta, maybe the storage team ist open to the optimizatiojn to allow for pushing down actual blob key names ... |
@roeap after speaking with our Microsoft contact I opened this: apache/arrow-rs#5653 |
Description
Microsoft recently released "in great fanfare" OneLake which is a managed lakehouse offering based on ADLS Gen2.
Reading works fine but writing generate an error, the Parquet is written but in the log, we get a tmp json file
using Polars writer which is based on Deltalake writer I believe get this error
ldf.write_delta("/lakehouse/default/Files/regionn")
The text was updated successfully, but these errors were encountered: