Skip to content

Commit

Permalink
Add admin-guide for hubble (#669)
Browse files Browse the repository at this point in the history
Add admin-guide for hubble
  • Loading branch information
chowbao authored Jun 17, 2024
1 parent 96cf8ab commit 88eeb4a
Show file tree
Hide file tree
Showing 29 changed files with 570 additions and 2 deletions.
2 changes: 1 addition & 1 deletion network/hubble/README.mdx
Original file line number Diff line number Diff line change
Expand Up @@ -7,7 +7,7 @@ sidebar_position: 0

Hubble is an open-source, publicly available dataset that provides a complete historical record of the Stellar network. Similar to Horizon, it ingests and presents the data produced by the Stellar network in a format that is easier to consume than the performance-oriented data representations used by Stellar Core. The dataset is hosted on BigQuery–meaning it is suitable for large, analytic workloads, historical data retrieval and complex data aggregation. **Hubble should not be used for real-time data retrieval and cannot submit transactions to the network.** For real time use cases, we recommend [running an API server](../horizon/admin-guide/README.mdx).

This guide describes when to use Hubble and how to connect. To view the underlying data structures, queries and examples, use the [Viewing Metadata](./viewing-metadata.mdx) and [Optimizing Queries](./optimizing-queries.mdx) tutorials.
This guide describes when to use Hubble and how to connect. To view the underlying data structures, queries and examples, use the [Viewing Metadata](./analyst-guide/viewing-metadata.mdx) and [Optimizing Queries](./analyst-guide/optimizing-queries.mdx) tutorials.

## Why Use Hubble?

Expand Down
10 changes: 10 additions & 0 deletions network/hubble/admin-guide/README.mdx
Original file line number Diff line number Diff line change
@@ -0,0 +1,10 @@
---
title: Admin Guide
sidebar_position: 15
---

import DocCardList from "@theme/DocCardList";

All you need to know about running a Hubble analytics platform.

<DocCardList />
10 changes: 10 additions & 0 deletions network/hubble/admin-guide/data-curation/README.mdx
Original file line number Diff line number Diff line change
@@ -0,0 +1,10 @@
---
title: Data Curation
sidebar_position: 20
---

import DocCardList from "@theme/DocCardList";

Running stellar-dbt-public to transform raw Stellar network data into something better.

<DocCardList />
22 changes: 22 additions & 0 deletions network/hubble/admin-guide/data-curation/architecture.mdx
Original file line number Diff line number Diff line change
@@ -0,0 +1,22 @@
---
title: Architecture
sidebar_position: 10
---

import stellar_dbt_arch from '/img/hubble/stellar_dbt_architecture.png';

## Architecture Overview

<img src={stellar_dbt_arch} width="300"/>

In general stellar-dbt-public runs by:

* Selecting a dbt model to run
* Within the model run:
* Sources are referenced and used to create staging tables
* Staging tables then undergo various transformations and are stored in intermediate tables
* Finishing touches and joins are done on the intermediate tables which produce the final analytics friendly mart tables

We try to adhere to the best practices set by the [dbt docs](https://docs.getdbt.com/docs/build/projects)

More detailed information about stellar-dbt-public and examples can be found in the [stellar-dbt-public](https://github.com/stellar/stellar-dbt-public/tree/master) repo.
140 changes: 140 additions & 0 deletions network/hubble/admin-guide/data-curation/getting-started.mdx
Original file line number Diff line number Diff line change
@@ -0,0 +1,140 @@
---
title: Getting Started
sidebar_position: 20
---

[stellar-dbt-public GitHub repository](https://github.com/stellar/stellar-dbt-public/tree/master)

[stellar/stellar-dbt-public docker images](https://hub.docker.com/r/stellar/stellar-dbt-public)

## Recommended Usage

### Docker Image

Generally if you do not need to modify any of the stellar-dbt-public code, it is recommended that you use the [stellar/stellar-dbt-public docker images](https://hub.docker.com/r/stellar/stellar-dbt-public)

Example to run locally with docker:

```
docker run --platform linux/amd64 -ti stellar/stellar-dbt-public:latest <parameters>
```

### Import stellar-dbt-public as a dbt Package

Alternatively, if you need to build your own models on top of stellar-dbt-public, you can import stellar-dbt-public as a dbt package into a separate dbt project.

Example instructions:

* Create a new file `packages.yml` in your dbt project (not the stellar-dbt-public project) with the yml below

```
packages:
- git: "https://github.com/stellar/stellar-dbt-public.git"
revision: v0.0.28
```

* (Optional) Update your profiles.yml to include profile configurations for stellar-dbt-public

```
new_project:
target: test
outputs:
test:
project: <project>
dataset: <dataset>
<other configurations>
stellar_dbt_public:
target: test
outputs:
test:
project: <project>
dataset: <dataset>
<other configurations>
```

* (Optional) Update your dbt_project.yml to include project configurations for stellar-dbt-public

```
name: 'stellar_dbt'
version: '1.0.0'
config-version: 2
profile: 'new_project'
model-paths: ["models"]
analysis-paths: ["analyses"]
test-paths: ["tests"]
seed-paths: ["seeds"]
macro-paths: ["macros"]
snapshot-paths: ["snapshots"]
target-path: "target"
clean-targets:
- "target"
- "dbt_packages"
models:
new_project:
staging:
+materialized: view
intermediate:
+materialized: ephemeral
marts:
+materialized: table
stellar_dbt_public:
staging:
+materialized: ephemeral
intermediate:
+materialized: ephemeral
marts:
+materialized: table
```

* Models from the stellar-dbt-public package/repo will now be available in your new dbt project

## Building and Running Locally

### Clone the repo

```
git clone https://github.com/stellar/stellar-dbt-public
```

### Install required python packages

```
pip install --upgrade pip && pip install -r requirements.txt
```

### Install required dbt packages

```
dbt deps
```

### Running dbt

* There are many useful commands that come with dbt which can be found in the [dbt documentation](https://docs.getdbt.com/reference/dbt-commands#available-commands)
* stellar-dbt-public is designed to use the `dbt build` command which will `run` the model and `test` the model table output
* (Optional) run with the `--full-refresh` option

```
dbt build --full-refresh
```

* Subsequent runs can be run with incremental mode (only inserts the newest of data instead of rebuilding all of history every time)

```
dbt build
```

* You can also specify just a single model if you don't want to run all stellar-dbt-public models

```
dbt build --select <model name or tag>
```

Please see the [stellar-dbt-public/modles/marts](https://github.com/stellar/stellar-dbt-public/tree/master/models/marts) directory to see a full list of the available models that dbt can run
15 changes: 15 additions & 0 deletions network/hubble/admin-guide/data-curation/overview.mdx
Original file line number Diff line number Diff line change
@@ -0,0 +1,15 @@
---
title: "Overview"
sidebar_position: 0
---

Data curation in Hubble is done through [stellar-dbt-public](https://github.com/stellar/stellar-dbt-public). stellar-dbt-public transforms raw Stellar network data from BigQuery datasets and tables into aggregates for more user friendly analytics.

It is worth noting that most users will not need to standup and run their own stellar-dbt-public instance. The Stellar Development Foundation provides public access to fully transformed Stellar network data through the public datasets and tables in GCP BigQuery. Instructions on how to access this data can be found in the [Connecting](https://developers.stellar.org/network/hubble/analyst-guide/connecting) section.

## Why Run stellar-dbt-public?

Running stellar-dbt-public within your own infrastructure provides a number of benefits. You can:

- Have full operational control without dependency on the Stellar Development Foundation for network data
- Run modified ETL/ELT pipelines that fit your individual business needs
10 changes: 10 additions & 0 deletions network/hubble/admin-guide/scheduling-and-orchestration/README.mdx
Original file line number Diff line number Diff line change
@@ -0,0 +1,10 @@
---
title: Scheduling and Orchestration
sidebar_position: 100
---

import DocCardList from "@theme/DocCardList";

Stitching all the components together.

<DocCardList />
Original file line number Diff line number Diff line change
@@ -0,0 +1,18 @@
---
title: Architecture
sidebar_position: 10
---

import stellar_etl_airflow_arch from '/img/hubble/stellar_etl_airflow_architecture.png';

## Architecture Overview

<img src={stellar_etl_airflow_arch} width="300"/>

In general stellar-etl-airflow runs by:

* Scheduling DAGs to run `stellar-etl` and upload the data outputted to BigQuery
* Scheduling DAGs to run `stellar-dbt-public` using the data in BigQuery
* We try to adhere to the best practices set by the [dbt docs](https://docs.getdbt.com/docs/build/projects)

More detailed information about stellar-etl-airflow can be found in the [stellar-etl-airflow](https://github.com/stellar/stellar-etl-airflow/tree/master) repo.
Original file line number Diff line number Diff line change
@@ -0,0 +1,87 @@
---
title: Getting Started
sidebar_position: 20
---

import history_table_export from '/img/hubble/history_table_export.png';
import state_table_export from '/img/hubble/state_table_export.png';
import dbt_enriched_base_tables from '/img/hubble/dbt_enriched_base_tables.png';

[stellar-etl-airflow GitHub repository](https://github.com/stellar/stellar-etl-airflow/tree/master)

## GCP Account Setup

The Stellar Development Foundation runs Hubble in GCP using Composer and BigQuery. To follow the same deployment you will need to have access to GCP project. Instructions can be found in the [Get Started](https://cloud.google.com/docs/get-started) documentation from Google.

Note: BigQuery and Composer should be available by default. If they are not you can find instructions for enabling them in the [BigQuery](https://cloud.google.com/bigquery?hl=en) or [Composer](https://cloud.google.com/composer?hl=en) Google documentation.

## Create GCP Composer Instance to Run Airflow

Instructions on bringing up a GCP Composer instance to run Hubble can be found in the [Installation and Setup](https://github.com/stellar/stellar-etl-airflow?tab=readme-ov-file#installation-and-setup) section in the [stellar-etl-airflow](https://github.com/stellar/stellar-etl-airflow) repository.

:::note

Hardware requirements can be very different depending on the Stellar network data you require. The default GCP settings may be higher/lower than actually required.

:::

## Configuring GCP Composer Airflow

There are two things required for the configuration and setup of GCP Composer Airflow:

* Upload DAGs to the Composer Airflow Bucket
* Configure the Airflow variables for your GCP setup

For more detailed instructions please see the [stellar-etl-airflow Installation and Setup](https://github.com/stellar/stellar-etl-airflow?tab=readme-ov-file#installation-and-setup) documentation.

### Uploading DAGs

Within the [stellar-etl-airflow](https://github.com/stellar/stellar-etl-airflow) repo there is an [upload_static_to_gcs.sh](https://github.com/stellar/stellar-etl-airflow/blob/master/upload_static_to_gcs.sh) shell script that will upload all the DAGs and schemas into your Composer Airflow bucket.

This can also be done using the [gcloud CLI or console](https://cloud.google.com/storage/docs/uploading-objects) and manually selecting the dags and schemas you wish to upload.

### Configuring Airflow Variables

Please see the [Airflow Variables Explanation](https://github.com/stellar/stellar-etl-airflow?tab=readme-ov-file#airflow-variables-explanation) documentation for more information about what should and needs to be configured.

## Running the DAGs

To run a DAG all you have to do is toggle the DAG on/off as seen below

![Toggle DAGs](/img/hubble/airflow_dag_toggle.png)

More information about each DAG can be found in the [DAG Diagrams](https://github.com/stellar/stellar-etl-airflow?tab=readme-ov-file#dag-diagrams) documentation.

## Available DAGs

More information can be found [here](https://github.com/stellar/stellar-etl-airflow/blob/master/README.md#public-dags)

### History Table Export DAG

[This DAG](https://github.com/stellar/stellar-etl-airflow/blob/master/dags/history_tables_dag.py):

- Exports part of sources: ledgers, operations, transactions, trades, effects and assets from Stellar using the data lake of LedgerCloseMeta files
- Optionally this can ingest data using captive-core but that is not ideal nor recommended for usage with Airflow
- Inserts into BigQuery

<img src={history_table_export} width="300"/>

### State Table Export DAG

[This DAG](https://github.com/stellar/stellar-etl-airflow/blob/master/dags/state_table_dag.py)

- Exports accounts, account_signers, offers, claimable_balances, liquidity pools, trustlines, contract_data, contract_code, config_settings and ttl.
- Inserts into BigQuery

<img src={state_table_export} width="300"/>

### DBT Enriched Base Tables DAG

[This DAG](https://github.com/stellar/stellar-etl-airflow/blob/master/dags/dbt_enriched_base_tables_dag.py)

- Creates the DBT staging views for models
- Updates the enriched_history_operations table
- Updates the current state tables
- (Optional) warnings and errors are sent to slack.

<img src={dbt_enriched_base_tables} width="300"/>
Original file line number Diff line number Diff line change
@@ -0,0 +1,15 @@
---
title: "Overview"
sidebar_position: 0
---

Hubble uses [stellar-etl-airflow](https://github.com/stellar/stellar-etl-airflow) to schedule and orchestrate all its workflows. This includes the scheduling and running of stellar-etl and stellar-dbt.

It is worth noting that most users will not need to standup and run their own Hubble. The Stellar Development Foundation provides public access to the data through the public datasets and tables in GCP BigQuery. Instructions on how to access this data can be found in the [Connecting](https://developers.stellar.org/network/hubble/connecting) section.

## Why Run stellar-etl-ariflow?

Running stellar-etl-airflow within your own infrastructure provides a number of benefits. You can:

- Have full operational control without dependency on the Stellar Development Foundation for network data
- Run modified ETL/ELT pipelines that fit your individual business needs
10 changes: 10 additions & 0 deletions network/hubble/admin-guide/source-system-ingestion/README.mdx
Original file line number Diff line number Diff line change
@@ -0,0 +1,10 @@
---
title: Source System Ingestion
sidebar_position: 10
---

import DocCardList from "@theme/DocCardList";

Running stellar-etl for Stellar network data ingestion.

<DocCardList />
Original file line number Diff line number Diff line change
@@ -0,0 +1,25 @@
---
title: Architecture
sidebar_position: 10
---

import stellar_arch from '/img/hubble/stellar_overall_architecture.png';
import stellar_etl_arch from '/img/hubble/stellar_etl_architecture.png';

## Architecture Overview

<img src={stellar_arch} width="600"/>

<img src={stellar_etl_arch} width="300"/>

In general stellar-etl runs by:

* Read raw data from the Stellar network
* This can be done by running a stellar-etl export command to export data between a start and end ledger
* stellar-etl has the ability to read from two different sources:
* Captive-core directly to get LedgerCloseMeta
* A data lake of compressed LedgerCloseMeta files from Ledger Exporter
* Tranforms the LedgerCloseMeta XDR into an easy to parse JSON format
* Optionally uploads the JSON files to GCS or any other cloud storage service

More detailed information about stellar-etl and examples can be found in the [stellar-etl](https://github.com/stellar/stellar-etl/tree/master) repo.
Loading

0 comments on commit 88eeb4a

Please sign in to comment.