Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Simple delta write in Fabric notebook failing with SSL error #1803

Closed
bcdobbs opened this issue Nov 4, 2023 · 20 comments
Closed

Simple delta write in Fabric notebook failing with SSL error #1803

bcdobbs opened this issue Nov 4, 2023 · 20 comments
Labels
bug Something isn't working

Comments

@bcdobbs
Copy link

bcdobbs commented Nov 4, 2023

Delta-rs version: Python 0.12.0
Cloud provider: Microsoft (UK South)
Environment: Fabric Notebook


Bug

What happened:
When trying to write a pandas dataframe to a delta table in Microsoft Fabric it fails with an SSL error:

OSError: Generic MicrosoftAzure error: response error "request error", after 10 retries: error sending request for url (https://onelake.blob.fabric.microsoft.com/xxx/yyy.Lakehouse/Tables/Test/_delta_log/_last_checkpoint): error trying to connect: error:0A000086:SSL routines:tls_post_process_server_certificate:certificate verify failed:ssl/statem/statem_clnt.c:1889: (self-signed certificate)

How to reproduce it:
Installed deltalake in fabric library management and then ran the following in a notebook:

import pandas as pd
from deltalake.writer import write_deltalake
from trident_token_library_wrapper import PyTridentTokenLibrary

token = PyTridentTokenLibrary.get_access_token("storage")

TablePath = "abfss://[email protected]/yyy.Lakehouse/Tables/Test"
aadToken = PyTridentTokenLibrary.get_access_token("storage")

df = pd.DataFrame({"id": [1, 2], "value": ["foo", "boo"]})

write_deltalake(TablePath, df, storage_options={"bearer_token": aadToken, "use_fabric_endpoint": "true"})
@bcdobbs bcdobbs added the bug Something isn't working label Nov 4, 2023
@r3stl355
Copy link
Contributor

r3stl355 commented Nov 5, 2023

@bcdobbs is there any chance that in the environment where the notebook is running, traffic goes through some appliance with SSL inspection (I don't know much about Fabric notebooks)? https://onelake.blob.fabric.microsoft.com/ has a valid SSL certificate but the error says it has a self-signed one which may happen if SSL inspection is used.

@r3stl355
Copy link
Contributor

r3stl355 commented Nov 5, 2023

btw, an easy way to answer my question would be running something like this in the notebook.

import requests
requests.get('https://onelake.blob.fabric.microsoft.com/').content

If you get Healthy then my earlier theory is wrong, but if you get an error - then it holds.

I was trying to replicate the issue on my side but I am getting a different error when I try to from deltalake.writer import write_deltalake

Error: /home/trusted-service-user/cluster-env/trident_env/lib/python3.10/site-packages/pyarrow/libarrow_acero.so.1200: undefined symbol: _ZN5arrow7compute4callESsSt6vectorINS0_10ExpressionESaIS2_EESt10shared_ptrINS0_15FunctionOptionsEE

@bcdobbs
Copy link
Author

bcdobbs commented Nov 5, 2023

Thanks @r3stl355, I'd assumed that there was some redirect going on but your test returned Healthy.
With regard to your error how did you make deltalake library available? I'd used the workspace library management GUI from workspace settings (https://learn.microsoft.com/en-us/fabric/data-science/python-guide/python-library-management), not sure if you can install them at a notebook level; still learning myself!

@bcdobbs
Copy link
Author

bcdobbs commented Nov 5, 2023

@r3stl355 based on your suggestion I tried running:

import requests
aadToken = PyTridentTokenLibrary.get_access_token("storage")

headersAuth = {
    "Authorization": f"Bearer {aadToken}"
}
output = requests.get("https://onelake.blob.fabric.microsoft.com/xxx/yyy.Lakehouse/Tables", headers=headersAuth)

I get a 200 status code which suggests it's authenticating. (If I remove the auth header it tells me there is an authentication issue.)

@r3stl355
Copy link
Contributor

r3stl355 commented Nov 5, 2023

OK @bcdobbs, ignore everything I wrote before 😁 , this looks like a problem with the writer because Spark writer works and so does the direct API call (i.e. I can create a file under Files with a PUT. I'll carry on digging

@r3stl355
Copy link
Contributor

r3stl355 commented Nov 5, 2023

as for the other error I had - I installed deltalake with pip but then figured it doesn't work with the pyarrow in the cluster. Can you please check which version of pyarrow you are running, i.e. pip list

@r3stl355
Copy link
Contributor

r3stl355 commented Nov 5, 2023

lastly - this is not just a writer but also a reader problem, I get the same error if I do DeltaTable("abfss://<ws-id>@onelake.dfs.fabric.microsoft.com/<lh-id>/Tables/test", storage_options={"bearer_token": aadToken, "use_fabric_endpoint": "true"})

@bcdobbs
Copy link
Author

bcdobbs commented Nov 5, 2023

pyarrow is 12.0.0.

@r3stl355
Copy link
Contributor

r3stl355 commented Nov 5, 2023

hmm, strange, I had to lower the pyarrow version to avoid that other error I was getting. Actually, just re-installing v12.0.0 also works - maybe it comes with some incomplete install. Anyways, that folder _delta_log/_last_checkpoint in the error does not actually exist, I wonder if that could be a cause of the problem (resulting in incorrect message perhaps 🤷 )

@djouallah
Copy link

there is an issue with pyarrow
#1743

@djouallah
Copy link

hmm, ok, tried with deltalake 0.13, and same erros, I think the regression was introduced in Fabric 1.2 runtime, for now better use runtime 1.1 where it works fine.

@bcdobbs
Copy link
Author

bcdobbs commented Nov 6, 2023

Thanks @r3stl355 and @djouallah, really appreciate your time. Indeed reverting the Fabric runtime let's it work fine! Really excited as work for a group of schools so data volumes aren't huge and always looking for ways to keep compute costs low.

Much appreciated

Ben

@r3stl355
Copy link
Contributor

r3stl355 commented Nov 6, 2023

I think I got to the bottom of this. Issue is likely related to the way ADLS access is configured in Azure Fabric - though onelake.blob.fabric.microsoft.com resolves to a public IP in the notebook, there is an entry in /ets/hosts pointing to a loopback IP 127.0.0.2 which uses a self-signed certificate. The same code that fails in Fabrick notebook works in other places (I tried on a local Mac and Azure Web Terminal using the token issued in Fabrick notebook) so this is unlikely a Delta RS problem, more like for Microsoft to solve.

Some extra supporting/interesting data:

  • Azure Web Terminal actually uses the same OS as the Fabric notebook: NAME="Common Base Linux Mariner, VERSION="2.0.20231004"

  • curl in the Fabrick server works but shows that the connection is to a loopback IP 127.0.0.2 which means it may be using a self-signed certificate. (I have a good table there named bad)

> !curl -H "Authorization: Bearer $TOKEN" https://onelake.blob.fabric.microsoft.com/.../.../Tables/bad -verbose

Connected to onelake.blob.fabric.microsoft.com (127.0.0.2) port 443
* ALPN: curl offers h2,http/1.1
* TLSv1.3 (OUT), TLS handshake, Client hello (1):
*  CAfile: /etc/pki/tls/certs/ca-bundle.trust.crt
*  CApath: /etc/pki/ca-trust/extracted/openssl
...
* SSL connection using TLSv1.3 / TLS_AES_256_GCM_SHA384
* ALPN: server accepted h2
* Server certificate:
*  subject: C=US; ST=Washington; L=Redmond; O=MicrosoftData; OU=SparkDepartment; [email protected]; CN=microsoft.com
*  start date: Nov  6 09:14:13 2023 GMT
*  expire date: Nov  5 09:14:13 2024 GMT
....
*  SSL certificate verify ok.
  • The same command in Web Terminal shows that it's using a public IP this time and different certificate (e.g. different validity dates so)
* Connected to onelake.blob.fabric.microsoft.com (20.50.0.27) port 443
* ALPN: curl offers h2,http/1.1
* TLSv1.3 (OUT), TLS handshake, Client hello (1):
*  CAfile: /etc/pki/tls/certs/ca-bundle.trust.crt
*  CApath: /etc/ssl/certs
...
* SSL connection using TLSv1.3 / TLS_AES_256_GCM_SHA384
* ALPN: server accepted h2
* Server certificate:
*  subject: C=US; ST=WA; L=Redmond; O=Microsoft Corporation; CN=westeurope.onelake.fabric.microsoft.com
*  start date: Oct  7 14:27:33 2023 GMT
*  expire date: Apr  4 14:27:33 2024 GMT
...
*  SSL certificate verify ok.
  • Open SSL cert verification confirmes the earlier theory about self-signed cert:
  1. In the Fabrick notebook
> !openssl s_client -connect onelake.blob.fabric.microsoft.com:443 -showcerts

CONNECTED(00000003)
depth=0 C = US, ST = Washington, L = Redmond, O = MicrosoftData, OU = SparkDepartment, emailAddress = [email protected], CN = microsoft.com
verify error:num=18:self signed certificate
...
---
SSL handshake has read 1867 bytes and written 407 bytes
Verification error: self signed certificate
---
  1. In Web Terminal
depth=1 C = US, O = Microsoft Corporation, CN = Microsoft Azure TLS Issuing CA 06
verify return:1
depth=0 C = US, ST = WA, L = Redmond, O = Microsoft Corporation, CN = westeurope.onelake.fabric.microsoft.com
verify return:1
...
---
SSL handshake has read 4564 bytes and written 799 bytes
Verification: OK
---
  • Using curl with IPs - works in the Fabrick notebook if I use a loopback IP (e.g. curl https://127.0.0.2/...) but fails with certificate error if I use a public IP returned by nslookup (e.g. curl https://40.82.254.113/...). Using IP in Azure Web Terminal does not work as expected

@r3stl355
Copy link
Contributor

r3stl355 commented Nov 6, 2023

A shorter version of the answer - curl in Fabric runtime 1.1 seems to be using a different CA file (/etc/ssl/certs/ca-certificates.crt) than on 1.2 CA file (/etc/pki/tls/certs/ca-bundle.trust.crt), which also has an extra suffix attached to certificate value. /etc/ssl/certs/ca-certificates.crt file is still there on Runtime 1.2 but it does not contain the certificate used by the endpoint.

Maybe openssl is trying to use the /etc/ssl/certs/ca-certificates.crt, or maybe it is unable to properly find the cert in /etc/pki/tls/certs/ca-bundle.trust.crt because of that extra suffix

@r3stl355
Copy link
Contributor

r3stl355 commented Nov 6, 2023

and lastly, here is a really ugly solution if you are still keen on trying runtime 1.2.

  1. Run !openssl s_client -connect onelake.blob.fabric.microsoft.com:443 to get the certificate.
  2. Copy the certificate value between -----BEGIN CERTIFICATE----- and -----END CERTIFICATE----- and write it out to a local file, e.g.
cert = """-----BEGIN CERTIFICATE-----
MIIFGzCCBAOgAwIBAgIUFO5FzvkmKVoyIlO8gQM8vkcNJ0kwDQYJKoZIhvcNAQEL
BQAwgZ8xCzAJBgNVBAYTAlVTMRMwEQYDVQQIDApXYXNoaW5ndG9uMRAwDgYDVQQH
<rest of the cert value here, shortened for brevity
-----END CERTIFICATE-----
"""
with open("ca.cert", "w") as out:
    out.write(cert)
  1. Export the created file name into ENV var and things should work, e.g.
os.environ["SSL_CERT_FILE"] = "./ca.cert"

workspace_id = <your workspace id here>
lakehouse_id = <your lakehouse id here>
dt = DeltaTable(f"abfss://{workspace_id}@onelake.dfs.fabric.microsoft.com/{lakehouse_id}/Tables/bad", storage_options={"bearer_token": aadToken, "use_fabric_endpoint": "true"})
print(dt.version())

With this you may actually consider closing this ticket, not the place to be resolved imo

@ion-elgreco
Copy link
Collaborator

If you have the option, try to reach out to Microsoft fabric product team directly to flag the regression

@bcdobbs
Copy link
Author

bcdobbs commented Nov 6, 2023

Thanks all, will reach out to Microsoft.

@RobinLin666
Copy link
Contributor

please try:

os.environ["SSL_CERT_DIR"] = "/etc/pki/ca-trust/extracted/openssl:/opt/olcclient"

Microsoft fabric onelake team is fixing it.

@ion-elgreco
Copy link
Collaborator

Maybe it's good to close this, since the issue is caused by Fabric.

@rtyler rtyler closed this as completed Feb 6, 2024
@martroben
Copy link

@RobinLin666
Could you give a link to the bug report with the MS Fabric OneLake team that I could follow, regarding the self-signed certificate problem, please? Or is it all just back channels?

re:

os.environ["SSL_CERT_DIR"] = "/etc/pki/ca-trust/extracted/openssl:/opt/olcclient"

This does not seem to work. I'm currently using a variation of the ugly solution suggested by @r3stl355

if not os.path.exists("onelake_cert.crt"):
    os.system("openssl s_client -showcerts -connect onelake.blob.fabric.microsoft.com:443 | awk '/BEGIN CERTIFICATE/,/END CERTIFICATE/' >> onelake_cert.crt")
    os.environ["SSL_CERT_FILE"] = "./onelake_cert.crt"

Hopefully MS will come through with a solution soon. Along with the other delta table write issue, deltalake
and polars currently have a severely limited usability in the Fabric environment, which is a pity since I love both.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

7 participants