Creating a Lake House for Open Cap Stack Platform #90

urbantech · 2024-10-15T04:53:32Z

urbantech
Oct 15, 2024
Maintainer

Documentation: Open Cap Stack "Lake House" Setup

Overview

This document provides a detailed breakdown of the system architecture for the Open Cap Stack "Lake House" project. It outlines the applications and services used, along with their respective versions, configurations, and the overall data flow. The goal of this lake house is to efficiently manage metadata and data ingestion processes, ensuring scalability and flexibility for future data-related operations.

1. Applications and Components

a. PostgreSQL Database

Version: PostgreSQL 14.12 (installed via Homebrew)
Purpose: Serves as the metadata store for the lake house, storing information about datasets, schema definitions, and ingestion logs.
Configuration:
- Database Name: lakehouse_metadata
- User: lakehouse_user
- Tables:
  - datasets: Stores metadata about the datasets ingested into the lakehouse.
  - dataset_schema: Stores schema information for each dataset.
  - ingestion_logs: Logs details of data ingestion processes, including statuses and error messages.

b. Apache Spark

Version: Apache Spark 3.3.0
Delta Lake Version: Delta Core 1.2.1 (Stable for current setup)
Purpose: Provides distributed data processing and enables Delta Lake operations for reliable storage and transaction support in the lake house.
Configuration:
- Delta Table Location: /tmp/delta-table
- Delta Table Schema: Supports versioned updates with the ability to merge new data.
- Delta Lake Features:
  - Schema merging for flexible data structure evolution.
  - Support for both historical queries (using versionAsOf) and time travel to previous versions of data.

c. Delta Lake

Version: Delta Core 1.2.1
Purpose: Delta Lake sits on top of the Spark engine to provide ACID transactions and time travel for data stored in the lake.
Usage:
- Writing and reading Delta tables using Spark.
- Schema merging and overwriting through mergeSchema and overwriteSchema options.
- Maintaining history of datasets with version control for recovery and auditing.

d. Apache Airflow

Version: Apache Airflow 2.7.2 (Installed in a Python virtual environment)
Purpose: Orchestrates data workflows, automating the ingestion, transformation, and management of data pipelines.
Configuration:
- Executor: SequentialExecutor (suitable for the current single-node deployment).
- Scheduler and Webserver: Airflow's components are used to run workflows and provide a user interface for DAG (Directed Acyclic Graph) management.
- Database: SQLite (for quick testing; PostgreSQL or MySQL is recommended for production).

e. Python Virtual Environment

Version: Python 3.8 (used in the virtual environment airflow-venv)
Installed Packages:
- Apache Airflow 2.7.2: Installed with constraints to ensure compatibility across the ecosystem.
- Dependencies such as Flask, Gunicorn, Pydantic, and SQLAlchemy for airflow and DAG execution.

2. System Architecture

The system architecture consists of multiple integrated components working together to manage, process, and orchestrate data:

PostgreSQL: Stores metadata about datasets, schema definitions, and logs ingestion processes.
- datasets: Keeps track of datasets in the Lake House.
- dataset_schema: Stores the schema definitions of each dataset.
- ingestion_logs: Records ingestion activity and logs errors or issues.
Spark with Delta Lake: Facilitates large-scale data processing and reliable storage.
- Delta Lake provides the ACID transaction layer, ensuring that data is stored in a reliable and versioned format.
- Spark reads and writes to the lake house using Delta tables stored in /tmp/delta-table.
Apache Airflow: Orchestrates workflows for data ingestion, processing, and maintenance.
- Airflow's DAGs define how data moves from ingestion to storage and how it’s transformed within the lake.

3. Installation & Configuration Notes

PostgreSQL Setup:

A PostgreSQL instance has been configured locally with a user lakehouse_user and a database lakehouse_metadata to track dataset metadata.
PostgreSQL has been set up with table structures designed for logging dataset ingestion, schema, and dataset information.

Spark and Delta Lake:

Spark is configured with Delta Lake for transaction support, schema enforcement, and version control.
The Delta table at /tmp/delta-table provides a sample of how data is ingested and versioned.
Schema updates and merges have been demonstrated using Delta Lake's schema evolution features.

Apache Airflow:

Airflow was installed in a Python virtual environment with Airflow 2.7.2, allowing orchestration of data pipelines.
Airflow uses an SQLite backend for now, though it is recommended to switch to PostgreSQL or MySQL for production.

Step-by-Step Instructional Guide: Creating a Lake House for Open Cap Stack Platform

1. Install Apache Spark 3.3.0

Apache Spark is the foundation for distributed data processing in the lake house.

Download and install Apache Spark:

cd /usr/local
sudo curl -O https://archive.apache.org/dist/spark/spark-3.3.0/spark-3.3.0-bin-hadoop3.2.tgz
sudo tar -xzf spark-3.3.0-bin-hadoop3.2.tgz
sudo mv spark-3.3.0-bin-hadoop3.2 /usr/local/spark-3.3.0

Set environment variables in .zshrc or .bash_profile:

export SPARK_HOME=/usr/local/spark-3.3.0
export PATH=$SPARK_HOME/bin:$PATH

Reload the environment:

source ~/.zshrc  # Or source ~/.bash_profile depending on your shell

Verify Spark installation:
```
spark-shell --version
```

2. Set Up Delta Lake

Delta Lake is used for managing ACID transactions and time travel in the lake house.

Launch Spark shell with Delta Lake support:

/usr/local/spark-3.3.0/bin/spark-shell --packages io.delta:delta-core_2.12:1.2.1 \
--conf "spark.sql.extensions=io.delta.sql.DeltaSparkSessionExtension" \
--conf "spark.sql.catalog.spark_catalog=org.apache.spark.sql.delta.catalog.DeltaCatalog"

Create and manage a Delta table:

val data = spark.range(0, 5)  // Sample data
data.write.format("delta").save("/tmp/delta-table")  // Save as Delta table

Verify Delta table creation:
```
ls /tmp/delta-table
```

Interact with Delta table:

val deltaTable = DeltaTable.forPath("/tmp/delta-table")
deltaTable.toDF.show()  // Show contents

3. Install PostgreSQL for Metadata Database

PostgreSQL will be used to store the metadata for the datasets and ingestion logs.

Install PostgreSQL via Homebrew:

brew install postgresql
brew services start postgresql

Access PostgreSQL shell:
```
psql -U postgres
```

Create a user and database for metadata:

CREATE USER lakehouse_user WITH PASSWORD 'password';
ALTER USER lakehouse_user CREATEDB;
CREATE DATABASE lakehouse_metadata WITH OWNER lakehouse_user;

Grant necessary privileges:

GRANT ALL PRIVILEGES ON DATABASE lakehouse_metadata TO lakehouse_user;

Create tables for metadata:

CREATE TABLE datasets (
    dataset_id SERIAL PRIMARY KEY,
    dataset_name VARCHAR(255) NOT NULL,
    description TEXT,
    storage_location VARCHAR(255),
    creation_time TIMESTAMPTZ DEFAULT NOW(),
    last_modified_time TIMESTAMPTZ DEFAULT NOW()
);

CREATE TABLE dataset_schema (
    schema_id SERIAL PRIMARY KEY,
    dataset_id INT REFERENCES datasets(dataset_id),
    column_name VARCHAR(255) NOT NULL,
    data_type VARCHAR(50) NOT NULL,
    is_nullable BOOLEAN,
    creation_time TIMESTAMPTZ DEFAULT NOW()
);

CREATE TABLE ingestion_logs (
    log_id SERIAL PRIMARY KEY,
    dataset_id INT REFERENCES datasets(dataset_id),
    ingestion_status VARCHAR(50),
    records_ingested INT,
    ingestion_time TIMESTAMPTZ DEFAULT NOW(),
    error_message TEXT
);

Verify connection and test data:

psql -U lakehouse_user -d lakehouse_metadata

Insert sample data:

INSERT INTO datasets (dataset_name, description, storage_location) 
VALUES ('Sample Dataset', 'This is a sample dataset for testing', '/data/sample-dataset');

4. Install and Configure MinIO Object Storage

MinIO provides object storage functionality for raw and structured data in the lake.

Download and install MinIO:

wget https://dl.min.io/server/minio/release/darwin-amd64/minio
chmod +x minio
sudo mv minio /usr/local/bin/

Start MinIO:

export MINIO_ROOT_USER=minioadmin
export MINIO_ROOT_PASSWORD=minioadmin
minio server /data/minio

Access the MinIO UI:
- Open a browser and navigate to http://localhost:9000
- Log in using the credentials: minioadmin / minioadmin
Create a bucket for data storage:
- Inside the MinIO UI, create a bucket called datalake.

Test MinIO bucket access:

mc alias set local http://localhost:9000 minioadmin minioadmin
mc mb local/datalake

5. Install Apache Airflow for Orchestration

Apache Airflow orchestrates workflows and DAGs (Directed Acyclic Graphs) for your lake house operations.

Set up Python virtual environment:

python3 -m venv airflow-venv
source airflow-venv/bin/activate

Install Airflow:

AIRFLOW_VERSION=2.7.2
PYTHON_VERSION="$(python --version | cut -d " " -f 2 | cut -d "." -f 1-2)"
CONSTRAINT_URL="https://raw.githubusercontent.com/apache/airflow/constraints-${AIRFLOW_VERSION}/constraints-${PYTHON_VERSION}.txt"
pip install "apache-airflow==${AIRFLOW_VERSION}" --constraint "${CONSTRAINT_URL}"

Initialize Airflow DB:
```
airflow db init
```

Create an Airflow admin user:

airflow users create --username admin --firstname FirstName --lastname LastName --role Admin --email [email protected] --password admin_password

Start Airflow scheduler and webserver:

airflow scheduler &
airflow webserver --port 8080

Access Airflow UI:
- Open a browser and navigate to http://localhost:8080
- Log in with the admin user credentials.

6. Integrating Components into the Lake House

Using Spark with MinIO: To read/write data to MinIO with Spark, use the following configuration:

spark.conf.set("spark.hadoop.fs.s3a.endpoint", "http://localhost:9000")
spark.conf.set("spark.hadoop.fs.s3a.access.key", "minioadmin")
spark.conf.set("spark.hadoop.fs.s3a.secret.key", "minioadmin")
spark.conf.set("spark.hadoop.fs.s3a.path.style.access", "true")
spark.conf.set("spark.hadoop.fs.s3a.impl", "org.apache.hadoop.fs.s3a.S3AFileSystem")

Orchestrate with Airflow: Use Airflow to manage Spark jobs, MinIO interactions, and metadata updates. Create DAGs in Airflow to automate data ingestions and metadata registration.

Developer Instructions: Setting Up the Open Cap Stack Lake House Locally for Development

These instructions will guide developers through the process of setting up the Open Cap Stack "Lake House" environment on a local machine for development purposes. This setup includes Apache Spark with Delta Lake, PostgreSQL for metadata, MinIO for object storage, and Apache Airflow for orchestration.

1. Install Prerequisites

Ensure the following software is installed on your local machine:

Homebrew (macOS): Used to install many of the necessary tools. Install Homebrew if you haven't already by running the following command:
```
/bin/bash -c "$(curl -fsSL https://raw.githubusercontent.com/Homebrew/install/HEAD/install.sh)"
```
Java: Apache Spark requires Java. Install it with Homebrew:
```
brew install openjdk@11
```
Python 3.8: Ensure Python 3.8 or above is installed. You can install Python via Homebrew:
```
brew install python
```
PostgreSQL: Install PostgreSQL locally using Homebrew:
```
brew install postgresql
```
MinIO: MinIO object storage will serve as the local S3-compatible storage solution:
```
brew install minio/stable/minio
```

Apache Airflow: Install Airflow via pip within a virtual environment:

python3 -m venv airflow-venv
source airflow-venv/bin/activate
pip install apache-airflow==2.7.2

2. Apache Spark and Delta Lake Setup

Download Apache Spark: Download and extract Spark (version 3.3.0) to your local machine from the official Spark website:

wget https://dlcdn.apache.org/spark/spark-3.3.0/spark-3.3.0-bin-hadoop3.tgz
tar -xvf spark-3.3.0-bin-hadoop3.tgz
mv spark-3.3.0-bin-hadoop3 /usr/local/spark-3.3.0

Set Up Environment Variables: Add Spark-related variables to your .bash_profile or .zshrc:
```
export SPARK_HOME=/usr/local/spark-3.3.0
export PATH=$SPARK_HOME/bin:$PATH
```

Install Delta Lake Dependencies: Launch Spark with the Delta Lake package:

/usr/local/spark-3.3.0/bin/spark-shell --packages io.delta:delta-core_2.12:2.4.0

Verify Delta Table Creation: Once Spark is running, create and write a Delta table to a local file system:
```
val data = spark.range(0, 5)
data.write.format("delta").save("/tmp/delta-table")
```

3. PostgreSQL Metadata Database

Start PostgreSQL: Ensure PostgreSQL is running:
```
brew services start postgresql
```

Create Database and User:

Login to PostgreSQL as the postgres user:
```
psql -U postgres
```

Create a new user and database for metadata storage:

CREATE USER lakehouse_user WITH PASSWORD 'password';
ALTER USER lakehouse_user CREATEDB;
CREATE DATABASE lakehouse_metadata WITH OWNER lakehouse_user;
\q

Create Metadata Tables: Connect to the database and create tables for metadata storage:

psql -U lakehouse_user -d lakehouse_metadata

Run the following SQL commands to create the tables:

CREATE TABLE datasets (
    dataset_id SERIAL PRIMARY KEY,
    dataset_name VARCHAR(255) NOT NULL,
    description TEXT,
    storage_location VARCHAR(255),
    creation_time TIMESTAMPTZ DEFAULT NOW(),
    last_modified_time TIMESTAMPTZ DEFAULT NOW()
);

CREATE TABLE dataset_schema (
    schema_id SERIAL PRIMARY KEY,
    dataset_id INT REFERENCES datasets(dataset_id),
    column_name VARCHAR(255) NOT NULL,
    data_type VARCHAR(50) NOT NULL,
    is_nullable BOOLEAN,
    creation_time TIMESTAMPTZ DEFAULT NOW()
);

CREATE TABLE ingestion_logs (
    log_id SERIAL PRIMARY KEY,
    dataset_id INT REFERENCES datasets(dataset_id),
    ingestion_status VARCHAR(50),
    records_ingested INT,
    ingestion_time TIMESTAMPTZ DEFAULT NOW(),
    error_message TEXT
);

Grant Necessary Privileges:

GRANT ALL PRIVILEGES ON DATABASE lakehouse_metadata TO lakehouse_user;
GRANT ALL PRIVILEGES ON ALL TABLES IN SCHEMA public TO lakehouse_user;

4. MinIO Object Storage

Start MinIO: Run MinIO in standalone mode:
```
minio server /path/to/minio/storage
```
Access MinIO UI: Navigate to http://127.0.0.1:9000 in your browser to access the MinIO management UI. You can create a new bucket (e.g., lakehouse-bucket) for storing your datasets.

5. Apache Airflow Setup

Activate Airflow Environment:
```
source airflow-venv/bin/activate
```
Initialize the Airflow Database:
```
airflow db init
```

Create an Admin User for Airflow:

airflow users create \
--username admin \
--firstname FirstName \
--lastname LastName \
--role Admin \
--email [email protected] \
--password admin_password

Start Airflow Web Server and Scheduler:
- Start the web server to access the Airflow UI:
```
airflow webserver --port 8080
```
- Start the Airflow scheduler in a separate terminal:
```
airflow scheduler
```
The Airflow web UI can be accessed at http://localhost:8080.

6. Verify Environment

Once all components are running:

Confirm Delta tables can be written and read using Spark.
Check that metadata can be stored in PostgreSQL by inserting sample datasets.
Verify MinIO object storage by uploading and retrieving files via the web interface.
Test Airflow workflow automation by creating simple DAGs and verifying execution in the web UI.

This development environment mimics the architecture of the Open Cap Stack lake house, allowing you to develop and test workflows, storage operations, and metadata management processes locally.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Open Cap Stack

Creating a Lake House for Open Cap Stack Platform #90

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Replies: 0 comments

Select a reply

Open Cap Stack

Creating a Lake House for Open Cap Stack Platform #90

urbantech Oct 15, 2024 Maintainer

Documentation: Open Cap Stack "Lake House" Setup

Overview

1. Applications and Components

a. PostgreSQL Database

b. Apache Spark

c. Delta Lake

d. Apache Airflow

e. Python Virtual Environment

2. System Architecture

3. Installation & Configuration Notes

PostgreSQL Setup:

Spark and Delta Lake:

Apache Airflow:

Step-by-Step Instructional Guide: Creating a Lake House for Open Cap Stack Platform

1. Install Apache Spark 3.3.0

2. Set Up Delta Lake

3. Install PostgreSQL for Metadata Database

4. Install and Configure MinIO Object Storage

5. Install Apache Airflow for Orchestration

6. Integrating Components into the Lake House

Developer Instructions: Setting Up the Open Cap Stack Lake House Locally for Development

1. Install Prerequisites

2. Apache Spark and Delta Lake Setup

3. PostgreSQL Metadata Database

4. MinIO Object Storage

5. Apache Airflow Setup

6. Verify Environment

Replies: 0 comments

urbantech
Oct 15, 2024
Maintainer