This project aims to perform data transformation using Databricks Pyspark
and SparkSQL
. The data was mounted from an Azure Data Lake Storage Gen2
and transformed within Databricks. The transformed data was then loaded back to the Datalake. This notebooks were then combined using Azure Data Fractory
- Databricks Pyspark
- Python
- SparkSQL
- Azure Data Lake Storage Gen2
- Azure Storage Account
- Azure resource group
- Azure Key Vault
- Azure Data Factory
- PowerBI
- Azure Storage Explorer
- Mount the data from the Azure Data Lake Storage Gen2 to Databricks.
- Use Pyspark within Databricks to perform data transformations using
DELTA TABLES
. - Load the transformed data back to the Azure Data Lake Storage Gen2.
The data can be found in the data folder. There is either the raw
data or the raw_incremental_load
data.
This is basically the same data, but the in raw_incremental_load
the data is ordered in a way to mimic data, which would normally generated over time and hence use incremental load.
- An active Azure subscription with access to Azure Data Lake Storage Gen2
- Databricks account set up
- Python
- Pyspark
- SQL
- Azure Storage Explorer installed
This project demonstrates how to perform data transformation using Databricks Pyspark and Azure Data Lake Storage Gen2. This setup can be used for larger scale data processing and storage needs.
- Storing data in the FileStore of Databricks, loading into Workspace notebook and perfroming data science.
- Storing Data in Azure Blob and mounting to Databricks. This includes the following steps:
- Create Resource Group in Azure.
- Create Storage account and assign to Resouce group.
- App registration (create a managed itenditiy), which we will use to connect Databricks to storage account. 3.1 Create a client secret and copy.
- Create Key vault (assign to same resource group) 4.1. Add the cleint secret here.
- Create secret scope within Databricks. 5.1 Use the keyvault DNS (url) and the ResourceID to allow Databricks to access the key valuts secrets within a specific scope.
- Use this scope to retreive secrets and connect to storage acount container, where data is stored in Azure:
configs = {"fs.azure.account.auth.type": "OAuth",
"fs.azure.account.oauth.provider.type": "org.apache.hadoop.fs.azurebfs.oauth2.ClientCredsTokenProvider",
"fs.azure.account.oauth2.client.id": "<appId>",
"fs.azure.account.oauth2.client.secret": "<clientSecret>",
"fs.azure.account.oauth2.client.endpoint": "https://login.microsoftonline.com/<tenant>/oauth2/token",
"fs.azure.createRemoteFileSystemDuringInitialization": "true"}
- Finally we can mount the data:
dbutils.fs.mount(
source = "abfss://<container-name>@<storage-account-name>.dfs.core.windows.net/folder1",
mount_point = "/mnt/flightdata",
extra_configs = configs)
- Now we can load the data from the MountPoint into a Dataframe and perform actions.
We can either connect the databricks instance to the DataFactory instance through AD or though a personalised acces token, which we can generate in Databricks and pass as an authentication method.