From 81ff8c6acab250897e4e06cfd0113ca1ad340809 Mon Sep 17 00:00:00 2001 From: shawn Date: Wed, 19 Jul 2023 17:46:12 -0700 Subject: [PATCH] Partner Experience: revamp ingestion docs and guidelines (#168) --- .../ingestion-filtering.mdx | 123 +++----- docs/run-platform-server/ingestion.mdx | 298 +++--------------- docs/run-platform-server/installing.mdx | 6 +- docs/run-platform-server/monitoring.mdx | 96 +----- 4 files changed, 107 insertions(+), 416 deletions(-) diff --git a/docs/run-platform-server/ingestion-filtering.mdx b/docs/run-platform-server/ingestion-filtering.mdx index 6e330246f..536633cb2 100644 --- a/docs/run-platform-server/ingestion-filtering.mdx +++ b/docs/run-platform-server/ingestion-filtering.mdx @@ -3,89 +3,61 @@ title: Ingestion Filtering order: 46 --- -The Ingestion Filtering feature is now released for public beta testing available from Horizon [version 2.18.0](https://github.com/stellar/go/releases/tag/horizon-v2.18.0) and up. - ## Overview -Ingestion Filtering enables Horizon operators to drastically reduce storage footprint of their Horizon DB by whitelisting Assets and/or Accounts that are relevant to their operations. This feature is ideally suited for private Horizon operators who do not need full history for all assets and accounts on the Stellar network. +Ingestion Filtering enables Horizon operators to drastically reduce the storage footprint of the historical data in the Horizon database by white-listing Assets and/or Accounts that are relevant to their operations. ### Why is it useful: -Previously, the only way to limit data storage is by limiting the amount of history Horizon ingests, either by configuring the starting ledger to be later than genesis block or via rolling retention (ie: last 30 days). This feature allows users to store the full history of assets and accounts (and related entities) that they care about. +Previously, the only way to limit data storage was by limiting the temporal range of history via rolling retention (e.g. the last 30 days). The filtering feature allows users to store a longer historical timeframe in the Horizon database for only whitelisted assets, accounts, and their related historical entities (transactions, operations, trades, etc.). -For further context, running a full history Horizon instance currently takes ~ 15TB of disk space (as of June 2022) with storage growing at a rate of ~ 1TB / month. As a benchmark, filtering by even 100 of the most active accounts and assets reduces storage by over 90%. For the majority of users who care about an even more limited set of assets and accounts, storage savings should be well over 99%. Other benefits are reducing operating costs for maintaining storage, improved DB health metrics and query performance. +For further context, running an unfiltered `full` history Horizon instance currently requires over 30TB of disk space (as of June 2023) with storage growing at a rate of about 1TB/month. As a benchmark, filtering by even 100 of the most active accounts and assets reduces storage by over 90%. For the majority of applications which are interested in an even more limited set of assets and accounts, storage savings should be well over 99%. Other benefits include reducing operating costs for maintaining storage, improved DB health metrics and query performance. ### How does it work: -This feature provides an ability to select which ledger transactions are accepted at ingestion time to be stored in Horizon’s historical database. Filter whitelists are maintained via an admin REST API (and persisted in the DB). The ingestion process checks the list and persists transactions related to Accounts and Assets that are whitelisted. Note that the feature does not filter the current state of the ledger and related DB tables, only history tables. +Filtering feature operates during ingestion in **live** and **historical range** processes. It tells ingestion process to only accept incoming ledger transactions which match on a filter rule, any transactions which don't match on filter rules are skipped by ingestion and therefore not stored on database. + +Some key aspects to note about filtering behavior: + +- Filtering applies only to ingestion of historical data in the database, it does not affect how ingestion process maintains current state data stored in database, which is the last known ledger entry for each unique entity within accounts, trustlines, liquidity pools, offers. However, current state data consumes a relatively small amount of the overall storage capacity. +- When filter rules are changed, they only apply to existing, running ingestion processes(**live** and **historical range**). They don't trigger any retro-active filtering or back-filling of existing historical data on the database. + - When the filter rules are updated to include additional accounts or assets in the white-list, the related transactions from **live** ingestion will only appear in the historical database data once the filter rules have been updated using the Admin API. The same applies to **historical range** ingestion, where the new filter rules will only affect the data from the current ledger within its configured range at the time of the update. + - Updating the filter rules to include additional accounts or assets does not trigger automatic back-filling related to new entites in the historical database. To include prior history of newly white-listed entites in the database you can manually run a new [Historical Ingestion Range](ingestion.mdx#ingesting-historical-data) after updating the filter rules. + - When the filter rules are updated to remove accounts or assets previously defined on white-list, the historical data in the database will not be retroactively purged or filtered based on the updated rules. The data is stored in the history tables for the lifetime of the database or until the `HISTORY_RETENTION_COUNT` is exceeded. Once the retention limit is reached, Horizon will purge all historical data related to older ledgers, regardless of any filtering rules. +- Filtering will not affect the performance or throughput rate of an ingestion process, it will remain consistent whether filter rules are present or not. -Whitelisting can include the following supported entities: +Filter rules define white-lists of the following supported entities: - Account id - Asset id (canonical) -Given that all transactions related to the white listed entities are included, all historical time series data related to those transactions are saved in horizon's history db as well. For example, whitelisting an Asset will also persist all Accounts that interact with that Asset and vice versa, if an Account is whitelisted, all assets that are held by that Account will also be included. +Given that all transactions related to the white listed entities are included, all historical time series data related to those transactions are saved in horizon's history db, including transaction itself, all operations in the transaction, and references to any ancillary entities from operations. ## Configuration: -The filters and their configuration are optional features and must be enabled with horizon command line or environmental parameters: +Filtering is enabled by default with no filter rules defined. When no filter rules are defined, it effectively means no filtering of ingested data occurs. To start filtering ingestion, need to define at least one filter rule: -``` -admin-port=[your_choice] -``` +- enable Horizon admin port with environmental configuration parameter `ADMIN_PORT=XXXXX`, this will allow you to access the port. +- define filter whitelists. submit Admin HTTP API requests to view and update the filter rules: -and + Refer to the [Horizon Admin API Docs](https://github.com/stellar/go/blob/master/services/horizon/internal/httpx/static/admin_oapi.yml) which are also published on Horizon running instances as Open API 3.0 doc on the Admin Port when enabled at `http://localhost:/`. You can paste the contents from that url into any OAPI tool such as [Swagger](https://editor.swagger.io/) which will render a visual explorer of the API endpoints. On the swagger editor you can also load the published Horizon admin.oapi.yml directly as a url, choose `File->Import URL`: -``` -exp-enable-ingestion-filtering=true -``` + ``` + https://raw.githubusercontent.com/stellar/go/master/services/horizon/internal/httpx/static/admin_oapi.yml + ``` -As Environment properties: + Follow details and examples of request/response payloads to read and update the filter rules for these endpoints: -``` -ADMIN_PORT= -``` - -and - -``` -EXP-ENABLE-INGESTION-FILTERING=True -``` - -These should be included in addition to the standard ingestion parameters that must be set also to enable the ingestion engine to be running, such as `ingest=true`, etc. Once these flags are included at horizon runtime, filter configurations and their rules are initially empty and the filters are disabled by default. To enable filters, update the configuration settings, refer to the Admin API Docs which are published as Open API 3.0 doc on the Admin Port at `http://localhost:/`. You can paste the contents from that url into any OAPI tool such as [Swagger](https://editor.swagger.io/) which will render a visual explorer of the API endpoints. Follow details and examples for endpoints: - -``` -/ingestion/filters/account -/ingestion/filters/asset -``` + ``` + /ingestion/filters/account + /ingestion/filters/asset + ``` -## Operation: - -Adding and Removing Entities can be done by submitting PUT requests to the `http://localhost:/` endpoint. - -To add new filtered entities, submit an `HTTP PUT` request to the admin API endpoints for either Asset or Account filters. The PUT request body will be JSON that expresses the filter rules, currently the rules model is a whitelist format and expressed as JSON string array. To remove entities, submit an `HTTP PUT` request to update the list accordingly. To retrieve what is currently configured, submit an `HTTP GET` request. - -The OAPI doc published by the Admin Server can be pulled directly from the Github repo [here](https://github.com/stellar/go/blob/horizon-v2.18.0/services/horizon/internal/httpx/static/admin_oapi.yml). - -### Reverting Options: - -1. Disable both Asset and Account Filter config rules via the [Admin API](https://github.com/stellar/go/blob/master/services/horizon/internal/httpx/static/admin_oapi.yml) by setting `enabled=false` in each filter rule, or set `--exp-enable_ingestion_filtering=false`, this will open up forward ingestion to include all data again. It is then your choice whether to run a Re-ingestion to capture older data from past that would have been dropped by filters but could now be re-imported with filters off, e.g. `horizon db reingest ` - -2. If you have a DB backup: - -- restore the DB -- run a Reingestion Gap Fill command to fill in the gaps to current tip of the chain -- resume Ingestion Sync - -3. Start over with a fresh DB (or see Patching Historical Data below) - -### Patching Historical Data: - -If new Assets or Accounts are added to the whitelist and you would like to patch in its missing historical data, Reingestion can be run. The Reingestion process is idempotent and will re-ingest the data from the designated ledger range and overwrite or insert new data if not already on current DB. + Choosing `Try it out` button from either endpoint will display `curl` examples of entire HTTP request. ## Sample Use Case: -As an Asset Issuer, I have issued 4 assets and am interested in all transaction data related to those assets including customer Accounts that interact with those assets and the following: +As an Asset Issuer, I have issued 4 assets and am interested in all transaction data related to those assets including customer Accounts that interact with those assets through the following operations: - Operations - Effects @@ -97,29 +69,28 @@ I would like to store the full history of all transactions related from the gene ### Pre-requisites: -You have an existing Horizon installed, configured and has forward ingestion enabled at a minimum to be able to successfully sync to the current state of the Stellar network. Bonus if you are familiar with running re-ingestion. +You have installed Horizon with empty database and it has **live** ingestion enabled. -Steps: +### Steps: -1. Configure 4 whitelisted Assets via the Admin API. Also check the `HISTORY_RETENTION_COUNT` and set it to `0` if you don’t want any history purged anymore now that you are filtering, otherwise it will continue to reap all data older than the retention. - -2. Decide if you want to wipe existing history data on the DB first before the filtering starts running, you can effectively clear the history by running +1. Configure a filter rule with 4 white-listed Assets by POST'ing the request to Horizon ADMIN API `:/ingestion/filters/asset`. ``` -HISTORY_RETENTION_COUNT=1 stellar-horizon db reap +curl -X 'PUT' \ + 'http://localhost:4200/ingestion/filters/asset' \ + -H 'accept: application/json' \ + -H 'Content-Type: application/json' \ + -d '{ + "whitelist": [ + "USDC:GAFRNZHK4DGH6CSF4HB5EBKK6KARUOVWEI2Y2OIC5NSQ4UBSN4DR456U", + "DOTT:GAFRNZHK4DGH6CSF4HB5EBKK6KARUOVWEI2Y2OIC5NSQ4UBSN4DR456U", + "ABCD:GAFRNZHK4DGH6CSF4HB5EBKK6KARUOVWEI2Y2OIC5NSQ4UBSN4DR456U", + "EFGH:GAFRNZHK4DGH6CSF4HB5EBKK6KARUOVWEI2Y2OIC5NSQ4UBSN4DR456U" + ], + "enabled": true +}' ``` -or drop/create the db and run `stellar-horizon db init`. - -Alternatively, if you do not need to free up old history tables, you can effectively stop here, anytime changes or enablement of filter rules are done, the history tables will immediately reflect filtered data per those latest rules from the time the filter config is updated and forward. - -3. If starting with a fresh DB, decide if you want to re-run ingestion from the earliest ledger # related to the whitelisted entities to populate history for just the allowed data from filters. - -- Tip: To find this ledger number, you can check for the earliest transaction of the Account issuing that asset. -- Also consider running parallel workers to speed up the process. - -4. Optional: When re-ingestion is finished, run an ingestion gap fill `stellar-horizon db fill-gaps` to fill any gaps that may have been missed. - -5. Verify that your data is there +2. Since this is new horizon database, and first filter rules, there is nothing more to do, and effectively stop here. -- Do a spot check of Accounts that should be automatically be ingested against a full history Horizon instance such as SDF Horizon +3. However, for sake of exercise, suppose you already had Horizon running for a while and the database populated based on some filter rules, and these new rules were additional white-listings you just added. In this case, you choose whether you want to retro-actively back fill historical data on horizon database for these new white-listed entites from a prior time up to the present time, because they were originally dropped at prior ingestion time and not included on the database. If you decide you want to back fill, then you run a separate Horizon **historical range** ingestion process, refer to [Historical Ingestion Range](ingestion.mdx#ingesting-historical-data) for steps: diff --git a/docs/run-platform-server/ingestion.mdx b/docs/run-platform-server/ingestion.mdx index 7b75156fa..f380d00bd 100644 --- a/docs/run-platform-server/ingestion.mdx +++ b/docs/run-platform-server/ingestion.mdx @@ -5,26 +5,35 @@ sidebar_position: 45 import { CodeExample } from "@site/src/components/CodeExample"; -Horizon provides access to both current and historical state on the Stellar network through a process called **ingestion**. - -Horizon provides most of its utility through ingested data, and your Horizon server can be configured to listen for and ingest transaction results from the Stellar network. Ingestion enables API access to both current (e.g. someone's balance) and historical state (e.g. someone's transaction history). +Horizon API provides most of its utility through ingested data, and your Horizon server can be configured to listen for and ingest transaction results from the Stellar network. Ingestion enables API access to both current state (e.g. someone's balance) and historical state (e.g. someone's transaction history). ## Ingestion Types There are two primary ingestion use-cases for Horizon operations: -- ingesting **live** data to stay up to date with the latest, real-time changes to the Stellar network, and -- ingesting **historical** data to peek how the Stellar ledger has changed over time +- ingesting **live** data to stay up to date with the latest ledgers from the network, accumulating a sliding window of aged ledgers +- ingesting **historical** data to retroactively add network data from a time range in the past to the database + +## Determine storage space + +You should think carefully about the historical timeframe of ingested data you'd like to retain in Horizon's database. The storage requirements for transactions on the Stellar network are substantial and are growing unbounded over time. This is something that you may need to continually monitor and reevaluate as the network continues to grow. We have found that most organizations need only a small fraction of recent historical data to satisfy their use cases. Through analyzing traffic patterns on SDF's Horizon instance, we see that most requests are for very recent data. + +To keep your storage footprint small, we recommend the following: + +- use **live** ingestion, use **historical** ingestion only in limited exceptional cases +- if your application requires access to all network data, no filtering can be done, we recommend limiting historical retention of ingested data to a sliding window of 1 month (HISTORY_RETENTION_COUNT=518400) which is default set by Horizon +- if your application can work on a [filtered network dataset](./ingestion-filtering.mdx) based on specific accounts and assets, then we recommend applying ingestion filter rules. When using filter rules, it provides benefit of choice in longer historical retention timeframe since the filtering is reducing the overall database size to such a degree, historical retention(`HISTORY_RETENTION_COUNT`) can be set in terms of years rather than months or even disabled(`HISTORY_RETENTION_COUNT=0`) +- if you cannot limit your history retention window to 30 days and cannot use filter rules, we recommend considering [Stellar Hubble Data Warehouse](https://developers.stellar.org/docs/accessing-data/overview) for any historical data ### Ingesting Live Data -Though this option is disabled by default, in this guide we've [assumed](./configuring.mdx) you turned it on. If you haven't, pass the `--ingest` flag or set `INGEST=true` in your environment. +This option is enabled by default and is the recommended mode of ingestion to run. It is controlled with environment configuration flag `INGEST`. Refer to [Configuration](./configuring.mdx) for how an instance of Horizon performs the ingestion role. -For a serious setup, **we highly recommend having more than one live ingesting instance**, as this makes it easier to avoid downtime during upgrades and adds resilience to your infrastructure, ensuring you always have the latest network data. +For a high availability requirements, **we recommend deploying more than one live ingesting instance**, as this makes it easier to avoid downtime during upgrades and adds resilience, ensuring you always have the latest network data, refer to [Ingestion Role Instance](./configuring.mdx#multiple-instance-deployment) ### Ingesting Historical Data -Providing API access to historical data is facilitated by a Horizon subcommand: +Import network data from a past date range in to the database: @@ -34,29 +43,15 @@ stellar-horizon db reingest range -_(The command name is a bit of a misnomer: you can use `reingest` both to ingest new ledger data and reingest old data.)_ - -You can run this process in the background while your Horizon server is up. It will continuously decrement the `history.elder_ledger` in your `/metrics` endpoint until the `` ledger is reached and the backfill is complete. If Horizon receives a request for a ledger it hasn't ingested, it returns a 503 error and clarify that it's `Still Ingesting` (see [below](#some-endpoints-are-not-available-during-state-ingestion)). - -#### Deciding on how much history to ingest +Running any historical range of ingestion requires coordination with the data retention configuration chosen. When setting a temporal limit on history with `HISTORY_RETENTION_COUNT=`, the temporal limit takes precedence, and any data ingested beyond that limit will be automatically purged. -You should think carefully about the amount of ingested data you'd like to keep around. Though the storage requirements for the entire Stellar network are substantial, **most organizations and operators only need a small fraction of the history** to fit their use case. For example, +Typically the only time you need to run historical ingestion is once when boot-strapping a system after first deployment, from that point forward **live** ingestion will keep the database populated with the expected sliding window of trailing historical data. Maybe one exception is if you think you have a gap in the database caused by the **live** ingestion being down, in which case you can run historical ingestion range to essentially gap fill. -- If you just started developing a new application or service, you can probably get away with just doing live ingestion, since nothing you do requires historical data. +You can run historical ingestion in parallel in background while your main Horizon server separately performs **live** ingestion. If the range specified overlaps with data already in the database, it is ok and will simply be overwritten, effectively idempotent. -- If you're moving an existing service away from reliance on SDF's Horizon, you likely only need history from the point at which you started using the Stellar network. +#### Parallel ingestion workers -- If you provide temporal guarantees to your users--a 6-month guarantee of transaction history like some online banks do, or history only for the last thousand ledgers (see [below](#managing-storage)), for example--then you similarly don't have heavy ingestion requirements. - -Even a massively-popular, well-established custodial service probably doesn't need full history to service its users. It will, however, need full history to be a [Full Validator](../run-core-node/index.mdx#full-validator) with published history archives. - -#### Reingestion - -Regardless of whether you are running live ingestion or building up historical data, you may occasionally need to \_re_ingest ledgers anew (for example on certain upgrades of Horizon). For this, you use the same command as above. - -#### Parallel ingestion - -Note that historical (re)ingestion happens independently for any given ledger range, so you can reingest in parallel across multiple Horizon processes: +You can break up historical date range into slices and run each in parallel as a separate process: @@ -69,240 +64,29 @@ horizon3> stellar-horizon db reingest range 20001 30000 -#### Managing storage - -Over time, the recorded network history will grow unbounded, increasing storage used by the database. Horizon needs sufficient disk space to expand the data ingested from Stellar Core. Unless you need to maintain a [history archive](../run-core-node/publishing-history-archives.mdx), you should configure Horizon to only retain a certain number of ledgers in the database. - -This is done using the `--history-retention-count` flag or the `HISTORY_RETENTION_COUNT` environment variable. Set the value to the number of recent ledgers you wish to keep around, and every hour the Horizon subsystem will reap expired data. Alternatively, Horizon provides a command to force a collection: - - - -```bash -stellar-horizon db reap -``` - - - -### Common Issues - -Ingestion is a complicated process, so there are a number of things to look out for. - -#### Some endpoints are not available during state ingestion - -Endpoints that display state information are not available during initial state ingestion and will return a `503 Service Unavailable`/`Still Ingesting` error. An example is the `/paths` endpoint (built using offers). Such endpoints will become available after state ingestion is done (usually within a couple of minutes). - -#### State ingestion is taking a lot of time - -State ingestion shouldn't take more than a couple of minutes on an AWS `c5.xlarge` instance or equivalent. - -It's possible that the progress logs (see [below](#reading-the-logs)) will not show anything new for a longer period of time or print a lot of progress entries every few seconds. This happens because of the way history archives are designed. - -The ingestion is still working but it's processing entries of type `DEADENTRY`. If there is a lot of them in the bucket, there are no _active_ entries to process. We plan to improve the progress logs to display actual percentage progress so it's easier to estimate an ETA. - -If you see that ingestion is not proceeding for a very long period of time: - -1. Check the RAM usage on the machine. It's possible that system ran out of RAM and is using swap memory that is extremely slow. -1. If above is not the case, file a [new issue](https://github.com/stellar/go/issues/new/choose) in the [Horizon repository](https://github.com/stellar/go/tree/master/services/horizon). - -#### CPU usage goes high every few minutes - -**This is by design**. Horizon runs a state verifier routine that compares state in local storage to history archives every 64 ledgers to ensure data changes are applied correctly. If data corruption is detected, Horizon will block access to endpoints serving invalid data. - -We recommend keeping this security feature turned on; however, if it's causing problems (due to CPU usage) this can be disabled via the `--ingest-disable-state-verification`/`INGEST_DISABLE_STATE_VERIFICATION` parameter. - -## Ingesting Full Public Network History - -In some (albeit rare) cases, it can be convenient to (re)ingest the full Stellar Public Network history into Horizon (e.g. when running Horizon for the first time). Using multiple Captive Core workers on a high performance environment (powerful machines on which to run Horizon + a powerful database) makes this possible in ~1.5 days. - -The following instructions assume the reingestion is done on AWS. However, they should be applicable to any other environment with equivalent capacity. In the same way, the instructions can be adapted to reingest only specific parts of the history. - -### Prerequisites - -Before we begin, we make some assumptions around the environment required. Please refer to the [Prerequisites](./prerequisites.mdx) section for the current HW requirements to run Horizon reingestion for either historical catch up or real-time ingestion (for staying in sync with the ledger). A few things to keep in mind: - -1. For reingestion, the more parallel workers are provisioned to speed up the process, the larger the machine size is required in terms of RAM, CPU, IOPS and disk size. The size of the RAM per worker also increases over time (14GB RAM / worker as of mid 2022) due to the growth of the ledger. HW specs can be downsized once reingestion is completed. - -1. [Horizon](./installing.mdx) latest version installed on the machine from (1). - -1. [Core](https://github.com/stellar/stellar-core) latest version installed on the machine from (1). - -1. A Horizon database where to reingest the history. Preferably, the database should be empty to minimize storage (Postgres accumulates data during usage, which is only deleted when `VACUUM`ed) and have the minimum spec's for reingestion as outlined in [Prerequisites](./prerequisites.mdx). - -As the DB storage grows, the IO capacity will grow along with it. The number of workers (and the size of the instance created in (1), should be increased accordingly if we want to take advantage of it. To make sure we are minimizing reingestion time, we should watch write IOPS. It should ideally always be close to the theoretical limit of the DB. - -### Parallel Reingestion - -Once the prerequisites are satisfied, we can spawn two Horizon reingestion processes in parallel: - -1. One for the first 17 million ledgers (which are almost empty). -1. Another one for the rest of the history. - -This is due to first 17 million ledgers being almost empty whilst the rest are much more packed. Having a single Horizon instance with enough workers to saturate the IO capacity of the machine for the first 17 million would kill the machine when reingesting the rest (during which there is a higher CPU and memory consumption per worker). - -64 workers for (1) and 20 workers for (2) saturates instance with RAM and 15K IOPS. Again, as the DB storage grows, a larger number of workers and faster storage should be considered. - -In order to run the reingestion, first set the following environment variables in the [configuration](./configuring.mdx) (updating values to match your database environment, of course): - - - -```bash -export DATABASE_URL=postgres://postgres:secret@db.local:5432/horizon -export APPLY_MIGRATIONS=true -export HISTORY_ARCHIVE_URLS=https://s3-eu-west-1.amazonaws.com/history.stellar.org/prd/core-live/core_live_001 -export NETWORK_PASSPHRASE="Public Global Stellar Network ; September 2015" -export STELLAR_CORE_BINARY_PATH=$(which stellar-core) -export ENABLE_CAPTIVE_CORE_INGESTION=true -# Number of ledgers per job sent to the workers. -# The larger the job, the better performance from Captive Core's perspective, -# but, you want to choose a job size which maximizes the time all workers are -# busy. -export PARALLEL_JOB_SIZE=100000 -# Retries per job -export RETRIES=10 -export RETRY_BACKOFF_SECONDS=20 - -# Enable optional config when running captive core ingestion - -# For stellar-horizon to download buckets locally at specific location. -# If not enabled, stellar-horizon would download data in the current working directory. -# export CAPTIVE_CORE_STORAGE_PATH="/var/lib/stellar" - -# For stellar-horizon to use local disk file for ledger states rather than in memory(RAM), approximately -# 8GB of space and increasing as size of ledger entries grows over time. -# export CAPTIVE_CORE_USE_DB=true -``` - - - -(Naturally, you can also edit the configuration file at `/etc/default/stellar-horizon` directly if you installed [from a package manager](./installing.mdx#package-manager).) - -If Horizon was previously running, first ensure it is stopped. Then, run the following commands in parallel: - -1. `stellar-horizon db reingest range --parallel-workers=64 1 16999999` -1. `stellar-horizon db reingest range --parallel-workers=20 17000000 ` - -(Where you can find `` under [SDF Horizon's](https://horizon.stellar.org/) `core_latest_ledger` field.) - -When saturating a database instance with 15K IOPS capacity: - -(1) should take a few hours to complete. - -(2) should take about 3 days to complete. - -Although there is a retry mechanism, reingestion may fail half-way. Horizon will print the recommended range to use in order to restart it. - -When reingestion is complete it's worth running `ANALYZE VERBOSE [table]` on all tables to recalculate the stats. This should improve the query speed. - -### Monitoring reingestion process - -This script should help monitor the reingestion process by printing the ledger subranges being reingested: - - - -```bash -#!/bin/bash -echo "Current ledger ranges being reingested:" -echo -I=1 -for S in $(ps aux | grep stellar-core | grep catchup | awk '{print $15}' | sort -n); do - printf '%15s' $S - if [ $(( I % 5 )) = 0 ]; then - echo - fi - I=$(( I + 1)) -done -``` - - - -Ideally we would be using Prometheus metrics for this, but they haven't been implemented yet. - -Here is an example run: - - - -``` -Current ledger ranges being reingested: - 99968/99968 199936/99968 299904/99968 399872/99968 499840/99968 - 599808/99968 699776/99968 799744/99968 899712/99968 999680/99968 - 1099648/99968 1199616/99968 1299584/99968 1399552/99968 1499520/99968 - 1599488/99968 1699456/99968 1799424/99968 1899392/99968 1999360/99968 - 2099328/99968 2199296/99968 2299264/99968 2399232/99968 2499200/99968 - 2599168/99968 2699136/99968 2799104/99968 2899072/99968 2999040/99968 - 3099008/99968 3198976/99968 3298944/99968 3398912/99968 3498880/99968 - 3598848/99968 3698816/99968 3798784/99968 3898752/99968 3998720/99968 - 4098688/99968 4198656/99968 4298624/99968 4398592/99968 4498560/99968 - 4598528/99968 4698496/99968 4798464/99968 4898432/99968 4998400/99968 - 5098368/99968 5198336/99968 5298304/99968 5398272/99968 5498240/99968 - 5598208/99968 5698176/99968 5798144/99968 5898112/99968 5998080/99968 - 6098048/99968 6198016/99968 6297984/99968 6397952/99968 17099967/99968 - 17199935/99968 17299903/99968 17399871/99968 17499839/99968 17599807/99968 - 17699775/99968 17799743/99968 17899711/99968 17999679/99968 18099647/99968 - 18199615/99968 18299583/99968 18399551/99968 18499519/99968 18599487/99968 - 18699455/99968 18799423/99968 18899391/99968 18999359/99968 19099327/99968 - 19199295/99968 19299263/99968 19399231/99968 -``` - - - -## Reading Logs - -In order to check the progress and status of ingestion you should check your logs regularly; all logs related to ingestion are tagged with `service=ingest`. - -It starts with informing you about state ingestion: - - - -``` -INFO[...] Starting ingestion system from empty state... pid=5965 service=ingest temp_set="*io.MemoryTempSet" -INFO[...] Reading from History Archive Snapshot pid=5965 service=ingest ledger=25565887 -``` - - - -During state ingestion, Horizon will log the number of processed entries every 100,000 entries (there are currently around 10M entries in the public network): - - +### Notes -``` -INFO[...] Processing entries from History Archive Snapshot ledger=25565887 numEntries=100000 pid=5965 service=ingest -INFO[...] Processing entries from History Archive Snapshot ledger=25565887 numEntries=200000 pid=5965 service=ingest -INFO[...] Processing entries from History Archive Snapshot ledger=25565887 numEntries=300000 pid=5965 service=ingest -INFO[...] Processing entries from History Archive Snapshot ledger=25565887 numEntries=400000 pid=5965 service=ingest -INFO[...] Processing entries from History Archive Snapshot ledger=25565887 numEntries=500000 pid=5965 service=ingest -``` - - +#### Some endpoints may report not available during **live** ingestion -When state ingestion is finished, it will proceed to ledger ingestion starting from the next ledger after the checkpoint ledger (25565887+1 in this example) to update the state using transaction metadata: +Endpoints that display current state information from **live** ingestion may return `503 Service Unavailable`/`Still Ingesting` error. An example is the `/paths` endpoint (built using offers). Such endpoints will become available after **live** ingestion has finished network synchronization and catch up(usually within a couple of minutes). - - -``` -INFO[...] Processing entries from History Archive Snapshot ledger=25565887 numEntries=5400000 pid=5965 service=ingest -INFO[...] Processing entries from History Archive Snapshot ledger=25565887 numEntries=5500000 pid=5965 service=ingest -INFO[...] Processed ledger ledger=25565887 pid=5965 service=ingest type=state_pipeline -INFO[...] Finished processing History Archive Snapshot duration=2145.337575904 ledger=25565887 numEntries=5529931 pid=5965 service=ingest shutdown=false -INFO[...] Reading new ledger ledger=25565888 pid=5965 service=ingest -INFO[...] Processing ledger ledger=25565888 pid=5965 service=ingest type=ledger_pipeline updating_database=true -INFO[...] Processed ledger ledger=25565888 pid=5965 service=ingest type=ledger_pipeline -INFO[...] Finished processing ledger duration=0.086024492 ledger=25565888 pid=5965 service=ingest shutdown=false transactions=14 -INFO[...] Reading new ledger ledger=25565889 pid=5965 service=ingest -INFO[...] Processing ledger ledger=25565889 pid=5965 service=ingest type=ledger_pipeline updating_database=true -INFO[...] Processed ledger ledger=25565889 pid=5965 service=ingest type=ledger_pipeline -INFO[...] Finished processing ledger duration=0.06619956 ledger=25565889 pid=5965 service=ingest shutdown=false transactions=29 -INFO[...] Reading new ledger ledger=25565890 pid=5965 service=ingest -INFO[...] Processing ledger ledger=25565890 pid=5965 service=ingest type=ledger_pipeline updating_database=true -INFO[...] Processed ledger ledger=25565890 pid=5965 service=ingest type=ledger_pipeline -INFO[...] Finished processing ledger duration=0.071039012 ledger=25565890 pid=5965 service=ingest shutdown=false transactions=20 -``` - - +#### If more than five minutes has elapsed with no new ingested data: -## Managing Stale Historical Data +- Verify host machine meets recommended [Prerequisites](./prerequisites.mdx). -Horizon ingests ledger data from a managed, pared-down Captive Stellar Core instance. In the event that Captive Core crashes, lags, or if Horizon stops ingesting data for any other reason, the view provided by Horizon will start to lag behind reality. For simpler applications, this may be fine, but in many cases this lag is unacceptable and the application should not continue operating until the lag is resolved. +- Check horizon log output. + - If there are many `level=error` messages, it may point to an environmental issue, inability to access the database. + - **live** ingestion will emit two key log lines about once every 5 seconds based on latest ledger emitted from network. Tail the horizon log output and grep for presence of these lines with a filter: + ``` + tail -f horizon.log | | grep -E 'Processed ledger|Closed ledger' + ``` + If you don't see output from this pipeline every couple of seconds for a new ledger then ingestion is not proceeding, look at full logs and see if any alternative messages are printing reasons to the contrary. May see lines mentioning 'catching up' When connecting to pubnet, as it can take up to 5 minutes for the captive core process started by Horizon to catch up to pubnet network. + - Check RAM usage on the machine, it's possible that system ran low on RAM and is using swap memory which will result in slow performance. Verify host machine meets minimum RAM [prerequisites](./prerequisites.mdx). + - Verify the read/write throughput speeds on the volume that current working directory for horizon process is using. Based on [prerequisites](./prerequisites.mdx), volume should have at least 10mb/s, one way to roughly verify this on host machine(linux/mac) command line: + ``` + sudo dd if=/dev/zero of=/tmp/test_speed.img bs=1G count=1 + ``` -To help applications that cannot tolerate lag, Horizon provides a configurable "staleness" threshold. If enough lag accumulates to surpass this threshold (expressed in number of ledgers), Horizon will only respond with an error: [`stale_history`](https://github.com/stellar/go/blob/master/services/horizon/internal/docs/reference/errors/stale-history.md). To configure this option, use the `--history-stale-threshold`/`HISTORY_STALE_THRESHOLD` parameter. +#### Monitoring ingestion process -**Note:** Non-historical requests (such as submitting transactions or checking account balances) will not error out if the staleness threshold is surpassed. +For high availability deployments, it is recommended to implement monitoring of ingestion process for visibility on performance/health. Refer to [Monitoring](./monitoring.mdx) for accessing logs and metrics from horizon. Stellar publishes the example [Horizon Grafana Dashboard](https://grafana.com/grafana/dashboards/13793-stellar-horizon/) which demonstrates queries against key horizon ingestion metrics, specifically look at the `Local Ingestion Delay [Ledgers]` and `Last ledger age` in the `Health Summary` panel. diff --git a/docs/run-platform-server/installing.mdx b/docs/run-platform-server/installing.mdx index c5c96519c..5cf40f059 100644 --- a/docs/run-platform-server/installing.mdx +++ b/docs/run-platform-server/installing.mdx @@ -5,14 +5,16 @@ sidebar_position: 20 import { CodeExample } from "@site/src/components/CodeExample"; -To install Horizon, you have choices, we recommend the following for target infrastructure: +To install Horizon in production or non-development environments, we recommend the following based on target infrastructure: - bare-metal: - if host is debian linux, install prebuilt binaries [from repositories](#package-manager) using package manager. - for any other hosts, download [prebuilt release binaries](#prebuilt-releases) of Stellar Horizon and Core for host target architecture and operation system. - containerized: - non-Helm Chart, if the target envrironment for container to run does not support Helm chart usage, run the prebuilt docker image of Horizon published on [dockerhub.com/stellar/horizon](https://hub.docker.com/r/stellar/stellar-horizon). - - Helm charts, when the target envrionment uses container orchestration such as Kubernetes and has enabled Helm Charts on cluster. The Horizon Helm chart manages installation life cycle. Use the [Helm Install command](https://helm.sh/docs/helm/helm_install/), it will accept Horizon's configuration parameters. Please review [Configuration](./configuring.mdx) first, to identify any specific configuration params needed. + - Helm charts, when the target envrionment uses container orchestration such as Kubernetes and has enabled Helm Charts on cluster. The [Horizon Helm chart](https://github.com/stellar/helm-charts/tree/main/charts/horizon) manages installation life cycle. Use the [Helm Install command](https://helm.sh/docs/helm/helm_install/), it will accept Horizon's configuration parameters. Please review [Configuration](./configuring.mdx) first, to identify any specific configuration params needed. + +For installation in development environments, please refer to the [Horizon README](https://github.com/stellar/go/blob/master/services/horizon/README.md#try-it-out) from the source code repo for options to use in development context. ### Notes diff --git a/docs/run-platform-server/monitoring.mdx b/docs/run-platform-server/monitoring.mdx index 582961d32..35d25a77b 100644 --- a/docs/run-platform-server/monitoring.mdx +++ b/docs/run-platform-server/monitoring.mdx @@ -5,85 +5,13 @@ sidebar_position: 60 import { CodeExample } from "@site/src/components/CodeExample"; -To ensure that your instance of Horizon is performing correctly, we encourage you to monitor it and provide both logs and metrics to do so. - ## Metrics -Metrics are collected while a Horizon process is running and they are exposed _privately_ via the `/metrics` path, accessible only through the Horizon admin port. You need to configure this via `--admin-port` or `ADMIN_PORT`, since it's disabled by default. If you're running such an instance locally, you can access this endpoint: - - - -``` -$ stellar-horizon --admin-port=4200 & -$ curl localhost:4200/metrics -# HELP go_gc_duration_seconds A summary of the GC invocation durations. -# TYPE go_gc_duration_seconds summary -go_gc_duration_seconds{quantile="0"} 1.665e-05 -go_gc_duration_seconds{quantile="0.25"} 2.1889e-05 -go_gc_duration_seconds{quantile="0.5"} 2.4062e-05 -go_gc_duration_seconds{quantile="0.75"} 3.4226e-05 -go_gc_duration_seconds{quantile="1"} 0.001294239 -go_gc_duration_seconds_sum 0.002469679 -go_gc_duration_seconds_count 25 -# HELP go_goroutines Number of goroutines that currently exist. -# TYPE go_goroutines gauge -go_goroutines 23 -and so on... -``` - - - -## Logs - -Horizon will output logs to standard out. Information about what requests are coming in will be reported, but more importantly, warnings or errors will also be emitted by default. A correctly running Horizon instance will not output any warning or error log entries. - -Below we present a few standard log entries with associated fields. You can use them to build metrics and alerts. Please note that these represent Horizon app metrics only. You should also monitor your hardware metrics like CPU or RAM Utilization. - -### Starting HTTP request +Metrics are emitted from a running Horizon process and exposed _privately_ via the `/metrics` path, accessible only through the Horizon admin port bound to host machine. You need to add environment configuration parameter `ADMIN_PORT=XXXXX` , since it's disabled by default, then metrics are published on the host machine as `localhost:\metrics`. -| Key | Value | -| --- | --- | -| **`msg`** | **`Starting request`** | -| `client_name` | Value of `X-Client-Name` HTTP header representing client name | -| `client_version` | Value of `X-Client-Version` HTTP header representing client version | -| `app_name` | Value of `X-App-Name` HTTP header representing app name | -| `app_version` | Value of `X-App-Version` HTTP header representing app version | -| `forwarded_ip` | First value of `X-Forwarded-For` header | -| `host` | Value of `Host` header | -| `ip` | IP of a client sending HTTP request | -| `ip_port` | IP and port of a client sending HTTP request | -| `method` | HTTP method (`GET`, `POST`, ...) | -| `path` | Full request path, including query string (ex. `/transactions?order=desc`) | -| `streaming` | Boolean, `true` if request is a streaming request | -| `referer` | Value of `Referer` header | -| `req` | Random value that uniquely identifies a request, attached to all logs within this HTTP request | +### Queries -### Finished HTTP request - -| Key | Value | -| --- | --- | -| **`msg`** | **`Finished request`** | -| `bytes` | Number of response bytes sent | -| `client_name` | Value of `X-Client-Name` HTTP header representing client name | -| `client_version` | Value of `X-Client-Version` HTTP header representing client version | -| `app_name` | Value of `X-App-Name` HTTP header representing app name | -| `app_version` | Value of `X-App-Version` HTTP header representing app version | -| `duration` | Duration of request in seconds | -| `forwarded_ip` | First value of `X-Forwarded-For` header | -| `host` | Value of `Host` header | -| `ip` | IP of a client sending HTTP request | -| `ip_port` | IP and port of a client sending HTTP request | -| `method` | HTTP method (`GET`, `POST`, ...) | -| `path` | Full request path, including query string (ex. `/transactions?order=desc`) | -| `route` | Route pattern without query string (ex. `/accounts/{id}`) | -| `status` | HTTP status code (ex. `200`) | -| `streaming` | Boolean, `true` if request is a streaming request | -| `referer` | Value of `Referer` header | -| `req` | Random value that uniquely identifies a request, attached to all logs within this HTTP request | - -### Metrics - -Using the entries above you can build metrics that will help understand performance of a given Horizon node. For example: +Build queries that highlight performance of a given Horizon deployment. Refer to Stellar's [Grafana Horizon Dashboard](https://grafana.com/grafana/dashboards/13793-stellar-horizon/) for examples of metrics queries to derive application performance: - Number of requests per minute. - Number of requests per route (the most popular routes). @@ -99,15 +27,21 @@ Using the entries above you can build metrics that will help understand performa ### Alerts -Below are example alerts with potential causes and solutions. Feel free to add more alerts using your metrics: +Once queries are developed on a Grafana dashboard, it enables convenient follow-on step to add [alert rules](https://grafana.com/docs/grafana/latest/alerting/alerting-rules/create-grafana-managed-rule/) based on specific queries to trigger notifications when thresholds are exceeded. + +Here are some example alerts to consider with potential causes and solutions. | Alert | Cause | Solution | | --- | --- | --- | -| Spike in number of requests | Potential DoS attack | Lower rate-limiting threshold | -| Large number of rate-limited requests | Rate-limiting threshold too low | Increase rate-limiting threshold | -| Ingestion is slow | Horizon server spec too low | Increase hardware spec | -| Spike in average response time of a single route | Possible bug in a code responsible for rendering a route | Report an issue in Horizon repository. | +| Spike in number of requests | Potential DoS attack | network load balance or content switch configurations | +| Ingestion is slow | host server compute resources are low | increase compute specs | + +## Logs + +Horizon will output logs to standard out. It will log on all aspects of runtime including http requests and ingestion. Typically, there are very few `warn` or `error` severity level messages emitted. The default severity level logged in Horizon is configured to `LOG_LEVEL=info`, this environment configuration parameter can be set to one of `trace, debug, info, warn, error`. The verbosity of log output is inverse of the severity level chosen. I.e. for most verbose logs use 'trace', for least verbose logs use 'error'. + +For production deployments, we recommend using the default severity at `info` level and redirecting the standard out to a file and apply a log rotation tool on the file such as [logrotate](https://man7.org/linux/man-pages/man8/logrotate.8.html) to manage disk space usage. ## I'm Stuck! Help! -If any of the above steps don't work or you are otherwise prevented from correctly setting up Horizon, please join our community and let us know. Either post a question at [our Stack Exchange](https://stellar.stackexchange.com/) or chat with us on [Keybase in #dev_discussion](https://keybase.io/team/stellar.public) to ask for help. +If any of the above steps don't work or you are otherwise prevented from correctly setting up Horizon, please join our community and let us know. Either post a question at [our Stack Exchange](https://stellar.stackexchange.com/) or chat with us on [Horizon Discord](https://discord.com/channels/897514728459468821/912466080960766012) to ask for help.