From 94341ce7f14980d38007a61bd78f5040f95576ff Mon Sep 17 00:00:00 2001 From: James Horrocks <15187219+james-horrocks@users.noreply.github.com> Date: Tue, 29 Oct 2024 13:10:32 +0000 Subject: [PATCH] Fix: Data stacks documentation - 971 (#546) * 917 - updates to data stacks documentation and fixed installation instructions for stacks-cli * 1041 - updated requirements page for data stacks * 917 - updates to data stacks documentation and fixed installation instructions for stacks-cli * 1048 - changes to '1. Generate a data project' page * Revert "917 - updates to data stacks documentation and fixed installation instructions for stacks-cli" This reverts commit 4aaac07678a06d102fb68770372e42f005ece219. --------- Co-authored-by: Mehdi Kimakhe <58773700+mehdi-kimakhe-amido@users.noreply.github.com> Co-authored-by: Jack Blower --- docs/stackscli/about.md | 4 +- .../data/getting_started/generate_project.md | 34 ++++----- .../data/getting_started/getting_started.md | 17 ++--- .../requirements_data_azure.md | 71 ++++++++++--------- sidebars.js | 2 +- 5 files changed, 67 insertions(+), 61 deletions(-) diff --git a/docs/stackscli/about.md b/docs/stackscli/about.md index f7f549817..3bff95a2b 100644 --- a/docs/stackscli/about.md +++ b/docs/stackscli/about.md @@ -19,10 +19,10 @@ As the CLI is a single binary, the quickest way to install it is to download it ```bash # Download the binary to a location in the PATH ## Mac OS -curl https://github.com/Ensono/stacks-cli/releases/download/v{stackscli_version}/stacks-cli-darwin-amd64-{stackscli_version} -o /usr/local/bin/stacks-cli +curl -L https://github.com/Ensono/stacks-cli/releases/download/v{stackscli_version}/stacks-cli-darwin-amd64-{stackscli_version} -o /usr/local/bin/stacks-cli ## Linux -curl https://github.com/Ensono/stacks-cli/releases/download/v{stackscli_version}/stacks-cli-linux-amd64-{stackscli_version} -o /usr/local/bin/stacks-cli +curl -L https://github.com/Ensono/stacks-cli/releases/download/v{stackscli_version}/stacks-cli-linux-amd64-{stackscli_version} -o /usr/local/bin/stacks-cli ## Ensure that the command is executable chmod +x /usr/local/bin/stacks-cli diff --git a/docs/workloads/azure/data/getting_started/generate_project.md b/docs/workloads/azure/data/getting_started/generate_project.md index 285597877..12aae9cc4 100644 --- a/docs/workloads/azure/data/getting_started/generate_project.md +++ b/docs/workloads/azure/data/getting_started/generate_project.md @@ -14,10 +14,12 @@ keywords: This section provides an overview of scaffolding and generating a new data platform project using the [Ensono Stacks CLI](/docs/stackscli/about). -It assumes the following [requirements](./requirements_data_azure.md) are in place: +It assumes the following [pre-requisites](./requirements_data_azure.md) are in place: * A [remote git repository](./requirements_data_azure.md#git-repository) for hosting the generated project -* [Terraform state storage](./requirements_data_azure.md#terraform-state-storage) +* A [storage account](./requirements_data_azure.md#terraform-state-storage) for the Terraform state + +For more information on the pre-requisites, see [here](./requirements_data_azure.md). ## Step 1: Install the Ensono Stacks CLI @@ -30,20 +32,20 @@ We will be using the `stacks-cli scaffold` command to generate a new data projec A [sample data project config file](https://github.com/Ensono/stacks-azure-data/blob/main/stacks-cli/data-scaffold-example.yml) is provided. Prepare a copy of this file, and update the following entries as required for your new project: -| Config field | Example value | Description | -| ----- | ----- | ----- | -| directory.working | `stacks` | Target directory for the scaffolded project. | -| directory.export | `~` | Path to your Ensono Stacks CLI installation. | -| business.company | `mycompany` | Used for resource naming. | -| business.domain | `mydomain` | Used for environment & Terraform state key naming. | -| business.component | `data` | Used for resource naming. | -| project.name | `stacks-data-platform` | Name of project created & used for resource naming. | -| project.sourcecontrol.type | `github` | Remote repository type. | -| project.sourcecontrol.url | `https://github.com/mycompany/stacks-data-platform` | Used for setting up the remote repository - see [Git repository](./requirements_data_azure.md#git-repository). | -| project.cloud.region | `ukwest` | The Azure region you'll be deploying into. Using the Azure CLI, you can use `az account list-locations -o Table` to see available region names. | -| terraform.backend.storage | `tfstorage` | Storage account name for Terraform state - see [Terraform state storage](./requirements_data_azure.md#terraform-state-storage). | -| terraform.backend.group | `tfgroup` | Resource group account name for Terraform state. | -| terraform.backend.container | `tfcontainer` | Container name account name for Terraform state. | +| Config field | Example value | Description | +|-----------------------------|-----------------------------------------------------|-------------------------------------------------------------------------------------------------------------------------------------------------| +| directory.working | `stacks` | Target directory for the scaffolded project. | +| directory.export | `~` | Path to your Ensono Stacks CLI installation. | +| business.company | `mycompany` | Used for resource naming. | +| business.domain | `mydomain` | Used for environment & Terraform state key naming. | +| business.component | `data` | Used for resource naming. | +| project.name | `stacks-data-platform` | Name of project created & used for resource naming. | +| project.sourcecontrol.type | `github` | Remote repository provider, e.g. GitHub or Azure DevOps. | +| project.sourcecontrol.url | `https://github.com/mycompany/stacks-data-platform` | Used for setting up the remote repository - see [Git repository](./requirements_data_azure.md#git-repository). | +| project.cloud.region | `ukwest` | The Azure region you'll be deploying into. Using the Azure CLI, you can use `az account list-locations -o Table` to see available region names. | +| terraform.backend.storage | `tfstorage` | Storage account name for Terraform state - see [Terraform state storage](./requirements_data_azure.md#terraform-state-storage). | +| terraform.backend.group | `tfgroup` | Resource group name for Terraform state. | +| terraform.backend.container | `tfcontainer` | Storage container name for Terraform state. | All other values can be left as they are. For full documentation of all fields in the config file, refer to the Stacks CLI Manual. diff --git a/docs/workloads/azure/data/getting_started/getting_started.md b/docs/workloads/azure/data/getting_started/getting_started.md index d353d8b26..dd8290db0 100644 --- a/docs/workloads/azure/data/getting_started/getting_started.md +++ b/docs/workloads/azure/data/getting_started/getting_started.md @@ -19,11 +19,12 @@ A more [detailed workflow diagram](../architecture/architecture_data_azure.md#de ## Steps -1. [Generate a Data Project](./generate_project.md) - Generate a new data project. -2. [Infrastructure Deployment](./core_data_platform_deployment_azure.md) - Deploy the data platform infrastructure into your cloud environment. -3. [Local Development Quickstart](./dev_quickstart_data_azure.md) - Once your project has been generated, setup your local environment to start developing. -4. [Shared Resources Deployment](./shared_resources_deployment_azure.md) - Deploy common resources to be shared across data pipelines. -5. (Optional) [Example Data Source](./example_data_source.md) - To assist with the 'Getting Started' steps, you may wish to setup the Example Data Source. -6. [Data Ingest Pipeline Deployment](./ingest_pipeline_deployment_azure.md) - Generate and deploy a data ingest pipeline using the Datastacks CLI. -7. [Data Processing Pipeline Deployment](./processing_pipeline_deployment_azure.md) - Generate and deploy a data processing pipeline using the Datastacks CLI. -8. [Fabric Lakehouse Deployment](./fabric_deployment_guide.md) - Steps to implement a Microsoft Fabric Lakehouse over the data platform. +1. [Prerequisites](./requirements_data_azure.md) - Ensure you have the necessary tools and resources to get started. +2. [Generate a Data Project](./generate_project.md) - Generate a new data project. +3. [Infrastructure Deployment](./core_data_platform_deployment_azure.md) - Deploy the data platform infrastructure into your cloud environment. +4. [Local Development Quickstart](./dev_quickstart_data_azure.md) - Once your project has been generated, setup your local environment to start developing. +5. [Shared Resources Deployment](./shared_resources_deployment_azure.md) - Deploy common resources to be shared across data pipelines. +6. (Optional) [Example Data Source](./example_data_source.md) - To assist with the 'Getting Started' steps, you may wish to setup the Example Data Source. +7. [Data Ingest Pipeline Deployment](./ingest_pipeline_deployment_azure.md) - Generate and deploy a data ingest pipeline using the Datastacks CLI. +8. [Data Processing Pipeline Deployment](./processing_pipeline_deployment_azure.md) - Generate and deploy a data processing pipeline using the Datastacks CLI. +9. [Fabric Lakehouse Deployment](./fabric_deployment_guide.md) - Steps to implement a Microsoft Fabric Lakehouse over the data platform. diff --git a/docs/workloads/azure/data/getting_started/requirements_data_azure.md b/docs/workloads/azure/data/getting_started/requirements_data_azure.md index 15e13c77d..334acaf59 100644 --- a/docs/workloads/azure/data/getting_started/requirements_data_azure.md +++ b/docs/workloads/azure/data/getting_started/requirements_data_azure.md @@ -1,31 +1,34 @@ --- id: requirements_data_azure -title: Requirements -sidebar_label: Requirements +title: Prerequisites +sidebar_label: Prerequisites hide_title: false hide_table_of_contents: false -description: Requirements +description: Prerequisites for developing with Ensono Stacks Data Platform keywords: - requirements + - prerequisites --- ## Local development The following tools are recommended for developing while using the Ensono Stacks data solution: -| Tool | Notes | -| ----- | ----- | -| [Python 3.9+](https://www.python.org/downloads/) | Use of Python 3.12+ is not currently supported. You may wish to use a utility such as [pyenv](https://pypi.org/project/pyenv/) to manage your local versions of Python. | -| [Poetry](https://python-poetry.org/docs/) | Used for Python dependency management in Stacks. | -| (Windows users) a Linux distribution, e.g. [WSL](https://docs.microsoft.com/en-us/windows/wsl/install) | A Unix-based environment is recommended for developing the solution (e.g. macOS, or WSL for Windows users). | -| Java 8/11/17 | Optional: Java is required to develop and run tests using PySpark locally - see [Spark documentation](https://spark.apache.org/docs/latest/). | -| [Azure CLI](https://learn.microsoft.com/en-us/cli/azure/install-azure-cli) | Optional: Azure CLI allows you to interact with Azure resources locally, including running end-to-end tests. | +| Tool | Notes | +|--------------------------------------------------------------------------------------------------------|-------------------------------------------------------------------------------------------------------------------------------------------------------------------------| +| [Python 3.9 - 3.11](https://www.python.org/downloads/) | Use of Python 3.12+ is not currently supported. You may wish to use a utility such as [pyenv](https://pypi.org/project/pyenv/) to manage your local versions of Python. | +| [Poetry](https://python-poetry.org/docs/) | Used for Python dependency management in Stacks. | +| (Windows users) a Linux distribution, e.g. [WSL](https://docs.microsoft.com/en-us/windows/wsl/install) | A Unix-based environment is recommended for developing the solution (e.g. macOS, Linux, or WSL for Windows users). | +| Java 8/11/17 runtime | Optional: Java is required to develop and run tests using PySpark locally - see [Spark documentation](https://spark.apache.org/docs/latest/). | +| [Azure CLI](https://learn.microsoft.com/en-us/cli/azure/install-azure-cli) | Optional: Azure CLI allows you to interact with Azure resources locally, including running end-to-end tests. | See [development quickstart](./dev_quickstart_data_azure.md) for further details on getting start with developing the solution. ## Git repository -A remote Git repository is required for storing and managing a data project's code. When scaffolding a new data project, you will need the HTTPS URL of the repo. +A remote Git repository is required for storing and managing a data project's code. This can be in either **GitHub** or **Azure DevOps**. When scaffolding a new data project, you will need the HTTPS URL of the repo. + +While Ensono Stacks supports storing code in both GitHub and Azure DevOps, it does not currently support CI/CD pipelines using GitHub Actions. Requirements for Azure DevOps are detailed in the [CI/DC - Azure DevOps](#cicd---azure-devops) section below. The examples and quickstart documentation assume that `main` is the primary branch in the repo. @@ -34,7 +37,7 @@ The examples and quickstart documentation assume that `main` is the primary bran In order to deploy an Ensono Stacks Data Platform into Azure, you will need: * One or more Azure subscriptions – for deploying the solution into -* Azure service principal (Application) – with permissions to deploy and configure all required +* Azure service principal (Application) – must have `Contributor` access to deploy and configure all required resources into the target subscription(s) ### Terraform state storage @@ -45,9 +48,9 @@ Deployment of Azure resources in Ensono Stacks is done through Terraform. Within * Resource group name * Container name -## Azure DevOps +## CI/CD - Azure DevOps -CI/CD processes within the Ensono Stacks Data Platform are designed to be run in Azure DevOps Pipelines[^1]. Therefore, it is a requirement to [create a project in Azure DevOps](https://learn.microsoft.com/en-us/azure/devops/organizations/projects/create-project?view=azure-devops&tabs=browser). +CI/CD processes within the Ensono Stacks Data Platform are currently designed to be run in Azure DevOps Pipelines[^1]. Therefore, it is a requirement to [create a project in Azure DevOps](https://learn.microsoft.com/en-us/azure/devops/organizations/projects/create-project?view=azure-devops&tabs=browser). [^1]: More general information on [using Azure Pipelines in Stacks](/docs/infrastructure/azure/pipelines/azure_devops) is also available. @@ -77,22 +80,22 @@ The variables under `amido-stacks-euw-de-env-network` are only required if you w
amido-stacks-de-pipeline-env -| Variable Name | When Needed | Description | -|----------------------------------|------------------|---------------------------------------------| -| ADLS_DataLake_URL | After core infra | Azure Data Lake Storage Gen2 URL | -| blob_adls_storage | After core infra | Azure Data Lake Storage Gen2 name | -| blob_configStorage | After core infra | Blob storage name | -| Blob_ConfigStore_serviceEndpoint | After core infra | Blob service URL | -| databricksHost | After core infra | Databricks URL | -| databricksWorkspaceResourceId | After core infra | Databricks workspace resource id | -| datafactoryname | After core infra | Azure Data Factory name | -| github_token | After core infra | GitHub PAT token, see below for more details| -| integration_runtime_name | After core infra | Azure Data Factory integration runtime name | -| KeyVault_baseURL | After core infra | Vault URI | -| keyvault_name | After core infra | Key Vault name | -| location | Project start | Azure region | -| resource_group | Project start | Name of the resource group | -| sql_connection | After core infra | Connection string to Azure SQL database | +| Variable Name | When Needed | Description | +|----------------------------------|------------------|----------------------------------------------| +| ADLS_DataLake_URL | After core infra | Azure Data Lake Storage Gen2 URL | +| blob_adls_storage | After core infra | Azure Data Lake Storage Gen2 name | +| blob_configStorage | After core infra | Blob storage name | +| Blob_ConfigStore_serviceEndpoint | After core infra | Blob service URL | +| databricksHost | After core infra | Databricks URL | +| databricksWorkspaceResourceId | After core infra | Databricks workspace resource id | +| datafactoryname | After core infra | Azure Data Factory name | +| github_token | After core infra | GitHub PAT token, see below for more details | +| integration_runtime_name | After core infra | Azure Data Factory integration runtime name | +| KeyVault_baseURL | After core infra | Vault URI | +| keyvault_name | After core infra | Key Vault name | +| location | Project start | Azure region | +| resource_group | Project start | Name of the resource group | +| sql_connection | After core infra | Connection string to Azure SQL database |
@@ -135,7 +138,7 @@ You do not require any permissions on this token because GitHub only needs to re Service Connections are used in Azure DevOps Pipelines to connect to external services, like Azure and GitHub. You must create the following Service Connections: -| Name | When Needed | Description | -|-----------------------|---------------|-------------------------------------------------------| -| Stacks.Pipeline.Builds | Project start | The Service Connection to Azure. The service principal or managed identity that is used to create the connection must have contributor access to the Azure Subscription. | -| GitHubReleases | Project start | The Service Connection to Github for releases. The access token that is used to create the connection must have read/write access to the GitHub repository. | +| Name | When Needed | Description | +|------------------------|---------------|------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------| +| Stacks.Pipeline.Builds | Project start | The Service Connection to Azure. The service principal or managed identity that is used to create the connection must have contributor access to the Azure Subscription. See [here](https://learn.microsoft.com/en-us/azure/devops/pipelines/library/connect-to-azure?view=azure-devops) for more information. | +| GitHubReleases | Project start | The Service Connection to Github for releases. The access token that is used to create the connection must have read/write access to the GitHub repository. See [here](https://learn.microsoft.com/en-us/azure/devops/pipelines/library/service-endpoints?view=azure-devops#github-service-connection) for more information. | diff --git a/sidebars.js b/sidebars.js index a9f96295c..6f0b7179c 100644 --- a/sidebars.js +++ b/sidebars.js @@ -293,8 +293,8 @@ module.exports = { type: "category", label: "Getting Started", items: [ - "workloads/azure/data/getting_started/requirements_data_azure", "workloads/azure/data/getting_started/getting_started", + "workloads/azure/data/getting_started/requirements_data_azure", "workloads/azure/data/getting_started/generate_project", "workloads/azure/data/getting_started/core_data_platform_deployment_azure", "workloads/azure/data/getting_started/dev_quickstart_data_azure",