Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add admin-guide for hubble #669

Merged
merged 9 commits into from
Jun 17, 2024
Merged
Show file tree
Hide file tree
Changes from 7 commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
2 changes: 1 addition & 1 deletion network/hubble/README.mdx
Original file line number Diff line number Diff line change
Expand Up @@ -7,7 +7,7 @@ sidebar_position: 0

Hubble is an open-source, publicly available dataset that provides a complete historical record of the Stellar network. Similar to Horizon, it ingests and presents the data produced by the Stellar network in a format that is easier to consume than the performance-oriented data representations used by Stellar Core. The dataset is hosted on BigQuery–meaning it is suitable for large, analytic workloads, historical data retrieval and complex data aggregation. **Hubble should not be used for real-time data retrieval and cannot submit transactions to the network.** For real time use cases, we recommend [running an API server](../horizon/admin-guide/README.mdx).

This guide describes when to use Hubble and how to connect. To view the underlying data structures, queries and examples, use the [Viewing Metadata](./viewing-metadata.mdx) and [Optimizing Queries](./optimizing-queries.mdx) tutorials.
This guide describes when to use Hubble and how to connect. To view the underlying data structures, queries and examples, use the [Viewing Metadata](./analyst-guide/viewing-metadata.mdx) and [Optimizing Queries](./analyst-guide/optimizing-queries.mdx) tutorials.

## Why Use Hubble?

Expand Down
10 changes: 10 additions & 0 deletions network/hubble/admin-guide/README.mdx
Original file line number Diff line number Diff line change
@@ -0,0 +1,10 @@
---
title: Admin Guide
sidebar_position: 15
---

import DocCardList from "@theme/DocCardList";

All you need to know about running a Hubble analytics platform.

<DocCardList />
10 changes: 10 additions & 0 deletions network/hubble/admin-guide/data-curation/README.mdx
Original file line number Diff line number Diff line change
@@ -0,0 +1,10 @@
---
title: Data Curation
sidebar_position: 20
---

import DocCardList from "@theme/DocCardList";

Running stellar-dbt-public to transform raw Stellar network data into something better.

<DocCardList />
22 changes: 22 additions & 0 deletions network/hubble/admin-guide/data-curation/architecture.mdx
Original file line number Diff line number Diff line change
@@ -0,0 +1,22 @@
---
title: Architecture
sidebar_position: 10
---

import stellar_dbt_arch from '/img/hubble/stellar_dbt_architecture.png';

## Architecture Overview

<img src={stellar_dbt_arch} width="300"/>

In general stellar-dbt-public runs by:

* Selecting a dbt model to run
* Within the model run:
* Sources are referenced and used to create staging tables
* Staging tables then undergo various transformations and are stored in intermediate tables
* Finishing touches and joins are done on the intermediate tables which produce the final analytics friendly mart tables

We try to adhere to the best practices set by the [dbt docs](https://docs.getdbt.com/docs/build/projects)

More detailed information about stellar-dbt-public and examples can be found in the [stellar-dbt-public](https://github.com/stellar/stellar-dbt-public/tree/master) repo.
140 changes: 140 additions & 0 deletions network/hubble/admin-guide/data-curation/getting-started.mdx
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There are two options that we can recommend for running stellar-dbt-public:

  1. Recommend that the operator stand up their own dbt project, and import it as a library. This recommendation means we expect operators to build their own models on top of our dims and denormalized tables. This will be a lot of overhead for operators that don't intend to develop custom analytics in dbt.
  2. Recommend that the operator runs this dbt package in an isolated docker container. This means they don't necessarily need to build additional models but would rather maintain parity with our warehouse. Running in this manner will make dependency management and data orchestration difficult if they build their own dbt project.

@harsha-stellar-data do you have an opinion on the above? Should we document both, or just one?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm I think it does make sense to document both. Below already does option 2.
I'll add another section like Advanced Usage - Importing as a dbt package in this getting-started doc

Original file line number Diff line number Diff line change
@@ -0,0 +1,140 @@
---
title: Getting Started
sidebar_position: 20
---

[stellar-dbt-public GitHub repository](https://github.com/stellar/stellar-dbt-public/tree/master)

[stellar/stellar-dbt-public docker images](https://hub.docker.com/r/stellar/stellar-dbt-public)

## Recommended Usage

### Docker Image

Generally if you do not need to modify any of the stellar-dbt-public code, it is recommended that you use the [stellar/stellar-dbt-public docker images](https://hub.docker.com/r/stellar/stellar-dbt-public)

Example to run locally with docker:

```
docker run --platform linux/amd64 -ti stellar/stellar-dbt-public:latest <parameters>
```

### Import stellar-dbt-public as a dbt Package

Alternatively, if you need to build your own models on top of stellar-dbt-public, you can import stellar-dbt-public as a dbt package into a separate dbt project.

Example instructions:

* Create a new file `packages.yml` in your dbt project (not the stellar-dbt-public project) with the yml below

```
packages:
- git: "https://github.com/stellar/stellar-dbt-public.git"
revision: v0.0.28
```

* (Optional) Update your profiles.yml to include profile configurations for stellar-dbt-public

```
new_project:
target: test
outputs:
test:
project: <project>
dataset: <dataset>
<other configurations>

stellar_dbt_public:
target: test
outputs:
test:
project: <project>
dataset: <dataset>
<other configurations>
```

* (Optional) Update your dbt_project.yml to include project configurations for stellar-dbt-public

```
name: 'stellar_dbt'
version: '1.0.0'
config-version: 2

profile: 'new_project'

model-paths: ["models"]
analysis-paths: ["analyses"]
test-paths: ["tests"]
seed-paths: ["seeds"]
macro-paths: ["macros"]
snapshot-paths: ["snapshots"]

target-path: "target"
clean-targets:
- "target"
- "dbt_packages"

models:
new_project:
staging:
+materialized: view
intermediate:
+materialized: ephemeral
marts:
+materialized: table

stellar_dbt_public:
staging:
+materialized: ephemeral
intermediate:
+materialized: ephemeral
marts:
+materialized: table
```

* Models from the stellar-dbt-public package/repo will now be available in your new dbt project

## Building and Running Locally

### Clone the repo

```
git clone https://github.com/stellar/stellar-dbt-public
```

### Install required python packages

```
pip install --upgrade pip && pip install -r requirements.txt

```

### Install required dbt packages

```
dbt deps
```

### Running dbt

* There are many useful commands that come with dbt which can be found in the [dbt documentation](https://docs.getdbt.com/reference/dbt-commands#available-commands)
* Most of stellar-dbt-public will want to use the `dbt build` command which will `run` the model and `test` the model table output
chowbao marked this conversation as resolved.
Show resolved Hide resolved
* The first time running stellar-dbt-public you will want to run the following to create the tables
chowbao marked this conversation as resolved.
Show resolved Hide resolved

```
dbt build --full-refresh
```

* Subsequent runs can be run with incremental mode (only inserts the newest of data instead of rebuilding all of history every time)

```
dbt build
```

* You can also specify just a single model if you don't want to run all stellar-dbt-public models

```
dbt build --select <model name or tag>
```

Please see the [stellar-dbt-public/modles/marts](https://github.com/stellar/stellar-dbt-public/tree/master/models/marts) directory to see a full list of the available models that dbt can run
15 changes: 15 additions & 0 deletions network/hubble/admin-guide/data-curation/overview.mdx
Original file line number Diff line number Diff line change
@@ -0,0 +1,15 @@
---
title: "Overview"
sidebar_position: 0
---

Data curation in Hubble is done through [stellar-dbt-public](https://github.com/stellar/stellar-dbt-public). stellar-dbt-public transforms raw Stellar network data from BigQuery datasets and tables into aggregates for more user friendly analytics.

It is worth noting that most users will not need to standup and run their own stellar-dbt-public instance. The Stellar Development Foundation provides public access to fully transformed Stellar network data through the public datasets and tables in GCP BigQuery. Instructions on how to access this data can be found in the [Connecting](https://developers.stellar.org/network/hubble/analyst-guide/connecting) section.

## Why Run stellar-dbt-public?

Running stellar-dbt-public within your own infrastructure provides a number of benefits. You can:

- Have full operational control without dependency on the Stellar Development Foundation for network data
- Run modified ETL/ELT pipelines that fit your individual business needs
10 changes: 10 additions & 0 deletions network/hubble/admin-guide/scheduling-and-orchestration/README.mdx
Original file line number Diff line number Diff line change
@@ -0,0 +1,10 @@
---
title: Scheduling and Orchestration
sidebar_position: 100
---

import DocCardList from "@theme/DocCardList";

Stitching all the components together.

<DocCardList />
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It may be worthwhile to reference dbt docs and that our project adheres to many best practices laid out by dbt.

Copy link
Contributor Author

@chowbao chowbao Jun 13, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we want to mention that here or in our data-curation/architecture section? Oh NVM I think it should be both.

Here we follow it for orchestration and scheduling purposes
In data-curation we follow the directory structure and model format that's recommended

Original file line number Diff line number Diff line change
@@ -0,0 +1,18 @@
---
title: Architecture
sidebar_position: 10
---

import stellar_etl_airflow_arch from '/img/hubble/stellar_etl_airflow_architecture.png';

## Architecture Overview

<img src={stellar_etl_airflow_arch} width="300"/>

In general stellar-etl-airflow runs by:

* Scheduling DAGs to run `stellar-etl` and upload the data outputted to BigQuery
* Scheduling DAGs to run `stellar-dbt-public` using the data in BigQuery
* We try to adhere to the best practices set by the [dbt docs](https://docs.getdbt.com/docs/build/projects)

More detailed information about stellar-etl-airflow can be found in the [stellar-etl-airflow](https://github.com/stellar/stellar-etl-airflow/tree/master) repo.
Original file line number Diff line number Diff line change
@@ -0,0 +1,87 @@
---
title: Getting Started
sidebar_position: 20
---

import history_table_export from '/img/hubble/history_table_export.png';
import state_table_export from '/img/hubble/state_table_export.png';
import dbt_enriched_base_tables from '/img/hubble/dbt_enriched_base_tables.png';

[stellar-etl-airflow GitHub repository](https://github.com/stellar/stellar-etl-airflow/tree/master)

## GCP Account Setup

The Stellar Development Foundation runs Hubble in GCP using Composer and BigQuery. To follow the same deployment you will need to have access to GCP project. Instructions can be found in the [Get Started](https://cloud.google.com/docs/get-started) documentation from Google.

Note: BigQuery and Composer should be available by default. If they are not you can find instructions for enabling them in the [BigQuery](https://cloud.google.com/bigquery?hl=en) or [Composer](https://cloud.google.com/composer?hl=en) Google documentation.

## Create GCP Composer Instance to Run Airflow

Instructions on bringing up a GCP Composer instance to run Hubble can be found in the [Installation and Setup](https://github.com/stellar/stellar-etl-airflow?tab=readme-ov-file#installation-and-setup) section in the [stellar-etl-airflow](https://github.com/stellar/stellar-etl-airflow) repository.

:::note

Hardware requirements can be very different depending on the Stellar network data you require. The default GCP settings may be higher/lower than actually required.

:::

## Configuring GCP Composer Airflow

There are two things required for the configuration and setup of GCP Composer Airflow:

* Upload DAGs to the Composer Airflow Bucket
* Configure the Airflow variables for your GCP setup

For more detailed instructions please see the [stellar-etl-airflow Installation and Setup](https://github.com/stellar/stellar-etl-airflow?tab=readme-ov-file#installation-and-setup) documentation.

### Uploading DAGs

Within the [stellar-etl-airflow](https://github.com/stellar/stellar-etl-airflow) repo there is an [upload_static_to_gcs.sh](https://github.com/stellar/stellar-etl-airflow/blob/master/upload_static_to_gcs.sh) shell script that will upload all the DAGs and schemas into your Composer Airflow bucket.

This can also be done using the [gcloud CLI or console](https://cloud.google.com/storage/docs/uploading-objects) and manually selecting the dags and schemas you wish to upload.

### Configuring Airflow Variables

Please see the [Airflow Variables Explanation](https://github.com/stellar/stellar-etl-airflow?tab=readme-ov-file#airflow-variables-explanation) documentation for more information about what should and needs to be configured.

## Running the DAGs

To run a DAG all you have to do is toggle the DAG on/off as seen below

![Toggle DAGs](/img/hubble/airflow_dag_toggle.png)

More information about each DAG can be found in the [DAG Diagrams](https://github.com/stellar/stellar-etl-airflow?tab=readme-ov-file#dag-diagrams) documentation.

## Available DAGs

More information can be found [here](https://github.com/stellar/stellar-etl-airflow/blob/master/README.md#public-dags)

### History Table Export DAG

[This DAG](https://github.com/stellar/stellar-etl-airflow/blob/master/dags/history_tables_dag.py):

- Exports part of sources: ledgers, operations, transactions, trades, effects and assets from Stellar using the data lake of LedgerCloseMeta files
- Optionally this can ingest data using captive-core but that is not ideal nor recommended for usage with Airflow
- Inserts into BigQuery

<img src={history_table_export} width="300"/>

### State Table Export DAG

[This DAG](https://github.com/stellar/stellar-etl-airflow/blob/master/dags/state_table_dag.py)

- Exports accounts, account_signers, offers, claimable_balances, liquidity pools, trustlines, contract_data, contract_code, config_settings and ttl.
- Inserts into BigQuery

<img src={state_table_export} width="300"/>

### DBT Enriched Base Tables DAG

[This DAG](https://github.com/stellar/stellar-etl-airflow/blob/master/dags/dbt_enriched_base_tables_dag.py)

- Creates the DBT staging views for models
- Updates the enriched_history_operations table
- Updates the current state tables
- (Optional) warnings and errors are sent to slack.

<img src={dbt_enriched_base_tables} width="300"/>
Original file line number Diff line number Diff line change
@@ -0,0 +1,15 @@
---
title: "Overview"
sidebar_position: 0
---

Hubble uses [stellar-etl-airflow](https://github.com/stellar/stellar-etl-airflow) to schedule and orchestrate all its workflows. This includes the scheduling and running of stellar-etl and stellar-dbt.

It is worth noting that most users will not need to standup and run their own Hubble. The Stellar Development Foundation provides public access to the data through the public datasets and tables in GCP BigQuery. Instructions on how to access this data can be found in the [Connecting](https://developers.stellar.org/network/hubble/connecting) section.

## Why Run stellar-etl-ariflow?

Running stellar-etl-airflow within your own infrastructure provides a number of benefits. You can:

- Have full operational control without dependency on the Stellar Development Foundation for network data
- Run modified ETL/ELT pipelines that fit your individual business needs
10 changes: 10 additions & 0 deletions network/hubble/admin-guide/source-system-ingestion/README.mdx
Original file line number Diff line number Diff line change
@@ -0,0 +1,10 @@
---
title: Source System Ingestion
sidebar_position: 10
---

import DocCardList from "@theme/DocCardList";

Running stellar-etl for Stellar network data ingestion.

<DocCardList />
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

wdyt about adding the High Level systems context diagram found here as well?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh yeah I like that. We should update that diagram though

Original file line number Diff line number Diff line change
@@ -0,0 +1,22 @@
---
title: Architecture
sidebar_position: 10
---

import stellar_arch from '/img/hubble/stellar_overall_architecture.png';
import stellar_etl_arch from '/img/hubble/stellar_etl_architecture.png';

## Architecture Overview

<img src={stellar_arch} width="300"/>

<img src={stellar_etl_arch} width="300"/>

In general stellar-etl runs by:

* Accepting an export command to export data between a start and end ledger
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If you're new to Hubble architecture this part is confusing. Suggestion to add more details about using captive core to read and write ledgerclose meta to a file store

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I updated this but not sure if it's what you wanted

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we just need more context and details in this section in general. This should be done in a follow up PR and does not need to be included in this PR

* Reads the LedgerCloseMeta files from the data lake created from Leger Exporter
* Tranforms the LedgerCloseMeta XDR into an easy to parse JSON format
* Optionally uploads the JSON files to GCS or any other cloud storage service

More detailed information about stellar-etl and examples can be found in the [stellar-etl](https://github.com/stellar/stellar-etl/tree/master) repo.
Loading
Loading