-
Notifications
You must be signed in to change notification settings - Fork 91
Data Accelerator with Databricks
Data Accelerator environment can now be set up to run jobs on either Databricks or HDInsight. During the time of setting up the Data Accelerator environment, you can choose the platform on which you would want to run the spark jobs – Databricks or HDInsight.
In this tutorial we will go over:
- Install Azure CLI from here
- Install Databricks CLI from here
- Download the scripts and templates locally via this link: template
- Open common.parameters.txt under DeploymentCloud/Deployment.DataX, provide TenantId and SubscriptionId. Also set useDatabricks = y
- For Windows OS, open a command prompt as an admin under the downloaded folder DeploymentCloud/Deployment.DataX and run :
deploy.bat
- If you are not the admin of the tenant (typically when using AAD account), then please copy over the DeploymentCloud folder to your admin's machine and ask your admin to run the following command:
runAdminSteps.bat
The above steps will setup the azure resources required by Data Accelerator. A Databricks resource will also be created. To finish setting up databricks resource you will further need to generate databricks token, create a secret scope, upload jars to DBFS which are required to run spark jobs and finally create a databricks cluster for live query.
The following steps will instruct you through the steps required to create a Databricks token. This databricks token will be required to run Databricks CLI commands which we will go over later in the setup process and for running flows on databricks.
- On https://portal.azure.com, go to the ‘Azure Databricks Service’ resource created by the ARM deployment step and click on ‘Launch Workspace’.
- On Databricks portal, click on Account and select User Settings. Then click on the ‘Generate New Token’ button.
- You can set the lifetime of token here or choose the default lifetime and click ‘Generate’. Please copy the token as this token will be required later and you will not be able to see the token again.
Here we will be creating an Azure Key Vault-backed secret scope which will be required to read secrets from Azure Key Vault.
- On https://portal.azure.com, go to the ‘Key vault’ resource named ‘kvSpark****’. Note: Do not go to the key vault that has RDP in the name.
- Click on Properties blade and copy following.
- Name
- DNS Name
- Resource ID
- Click on Properties blade and copy following.
- Go to https://<your_azure_databricks_url>#secrets/createScope (for example, https://eastus.azuredatabricks.net#secrets/createScope) and paste the info copied from above step:
- Key Vault Name as ‘Scope Name’
- DNS Name
- Resource ID
We will be running DBFS CLI command to upload the jar files to Databricks File System. These jars are required by Data Accelerator spark jobs. To run the following steps, first Install Databricks CLI if you have not done so (Note: It is recommended to use the latest python version when installing databricks CLI) and then set up authentication by running following command in command prompt 'databricks configure --token'. (when prompted, enter 'https://<your_azure_databricks_url>' as the host eg: https://eastus.azuredatabricks.net and use the databricks token that we generated in the previous step when prompted for token).
- Unpack Microsoft.DataX.Spark Nuget package
- Open powershell. Enter the folder path of extracted nuget package in the command below and run it.
dbfs cp -r <path of extracted Microsoft.DataX.Spark>\lib dbfs:/datax
- To verify that all the jars got uploaded, you can run following and it will list out the files
dbfs ls dbfs:/datax --absolute
We will now create a dedicated cluster to run live queries. In the following script, set values of $clusterName and $defaultVaultName in the first two lines and run the script.
$clusterName = '<Enter your databricks workspace name here eg:dx123>'
$defaultVault = '<Enter SparkKeyVault name that was used to create secret scope eg:kvSpark123>'
$defaultStorageAccount = '<Enter the default storage account name that has created by ARM template eg:saspark1234xd123>'
$jsonCommand = '{
\"cluster_name\": \"' + $clusterName + '\",
\"spark_version\": \"5.3.x-scala2.11\",
\"node_type_id\": \"Standard_DS3_v2\",
\"autoscale\": {
\"min_workers\": \"2\",
\"max_workers\": \"8\"
},
\"autotermination_minutes\": \"0\",
\"spark_conf\": {
\"spark.databricks.delta.preview.enabled\": true,
\"spark.sql.hive.metastore.version\": \"1.2.1\",
\"spark.driver.userClassPathFirst\": true,
\"spark.executor.userClassPathFirst\": true,
\"spark.sql.hive.metastore.jars\": \"builtin\"
},
\"spark_env_vars\": {
\"DATAX_AZURESTORAGEJARPATH\": \"/datax/bin/azure-storage-3.1.0.jar\",
\"DATAX_DEFAULTCONTAINER\": \"defaultdx\",
\"DATAX_DEFAULTSTORAGEACCOUNT\": \"' + $defaultStorageAccount + '\",
\"DATAX_DEFAULTVAULTNAME\": \"' + $defaultVault + '\"
}
}'
$clusterId = (databricks clusters create --json $jsonCommand | ConvertFrom-Json).cluster_id
$dbfsFiles = (dbfs ls dbfs:/datax --absolute)
foreach($dbfsFile in $dbfsFiles) {
databricks libraries install --cluster-id $clusterId --jar $dbfsFile
}
It can take about 10 minutes for the cluster to start
Please see Run Data Accelerator Flows on Databricks for instructions around how to run Data Accelerator flows on Databricks