diff --git a/developer-docs-site/docs/nodes/node-health-checker-faq.md b/developer-docs-site/docs/nodes/node-health-checker-faq.md index 02efe538f2763..03576b8ad8516 100755 --- a/developer-docs-site/docs/nodes/node-health-checker-faq.md +++ b/developer-docs-site/docs/nodes/node-health-checker-faq.md @@ -1,28 +1,29 @@ --- title: "Node Health Checker FAQ" slug: "node-health-checker-faq" -sidebar_position: 10 --- -import BlockQuote from "@site/src/components/BlockQuote"; # Node Health Checker FAQ -The Aptos Node Health Checker (NHC) is a tool Aptos offers to the community for a few different key use cases. For now you can see more about NHC, why we have it, how to run it, and more at the [Node Health Checker README](https://github.com/aptos-labs/aptos-core/tree/main/ecosystem/node-checker) in our repo. -The purpose of this FAQ is to help users understand why their node did not pass a particular evaluation from NHC. If you couldn't find the information you wanted in this FAQ, please [open an issue](https://github.com/aptos-labs/aptos-core/issues/new/choose) and we can add it. Even better, feel free to [open a PR](https://github.com/aptos-labs/aptos-core/pulls) and add the information yourself! +The Aptos Node Health Checker (NHC) service can be used to check the health of your node(s). See [Node Health Checker](/nodes/node-health-checker) for full documentation on the NHC. + +The purpose of this FAQ is to help you understand why your node did not pass a particular health check when you ran NHC for it. If you didn't find the information you wanted in this FAQ, [open an issue](https://github.com/aptos-labs/aptos-core/issues/new/choose), or [open a PR](https://github.com/aptos-labs/aptos-core/pulls) and add the information yourself. ## How does the latency evaluator work? -You are likely here because you were given an evaluation result like this: + +You are likely here because you were given an NHC evaluation result like this: + ``` Average latency too high: The average latency was 1216ms, which is higher than the maximum allowed latency of 1000ms. ``` -When faced with this error, you might see that the validation reports something like 1200ms above, but then when you `ping`, you see something more like 600ms. This difference comes from a misunderstanding in how our latency test works. When you `ping` an IP, the result you see is a single round trip (where the latency is RTT, round trip time). Our latency test is not doing an ICMP ping though, but timing a request to the API running on your node. In effect, this means we're timing 2 round trips, because it does the following: +While the NHC reports 1216ms above, when you `ping` you might see a latency like 600ms. This difference is because when you `ping` an IP, the result you see is a single round trip (where the latency is the round trip time (RTT)). On the other hand, the NHC latency test will a request to the API running on your node. In effect, this means that the NHC will time 2 round trips, because it does the following: 1. SYN 2. SYNACK 3. ACK + Send HTTP request 4. Receive HTTP response -Because we must do a TCP handshake (one round trip) and then make an HTTP request (another round trip). +i.e., the NHC must do a TCP handshake (one round trip) and then make an HTTP request (second round trip). -The reason we have the latency evaluator is to ensure we can maintain good network performance. In particular, if the latency to your node is too high, it will result in low TPS and high time to finality, both of which are very important to running a highly performant L1 blockchain. **If you receive this error, you will need to try and improve the latency to your node, we have set high thresholds on this value with the understanding that nodes will be running all over the world**. +The reason the NHC uses the latency evaluator is to ensure that we can maintain good network performance. In particular, if the latency to your node is too high, it will result in a low TPS and high time to finality, both of which are very important to running a highly performant L1 blockchain. **If you receive this error, you will need to try and improve the latency to your node. We have set high thresholds on this value with the understanding that nodes will be running all over the world**. diff --git a/developer-docs-site/docs/nodes/node-health-checker.md b/developer-docs-site/docs/nodes/node-health-checker.md new file mode 100755 index 0000000000000..d5ec5d7fd41f1 --- /dev/null +++ b/developer-docs-site/docs/nodes/node-health-checker.md @@ -0,0 +1,211 @@ +--- +title: "Node Health Checker" +slug: "node-health-checker" +--- + +# Node Health Checker + +The Aptos Node Health Checker (NHC) service can be used to check the health of the following Aptos node types: + +- Validator nodes. +- Validator fullnodes, and +- Public fullnodes. + +If you are a node operator, use this NHC service to check if your node is running correctly. The NHC service evaluates your node's health by comparing against a baseline node configuration, and outputs the evaluation results. + +:::tip Node health check for AIT +If you are participating in the [Aptos Incentivized Testnet](/nodes/ait/ait-2), then use the NHC service to demonstrate that you can run your validator node successfully. The Aptos team uses this service continuously to check your node's health. +::: + +This document describes how to run NHC when you are operating a node. + +## Quickstart + +Before you get into the details of how NHC works, you can run the below steps to start the NHC service and send it a request. This quickstart uses a baseline configuration for a devnet fullnode, i.e., it will evaluate your node against a devnet fullnode that is configured with the baseline configuration YAML. + +### Step 1: Download the baseline configuration YAML + +Download a baseline configuration YAML file for a devnet fullnode. The below command will download the `devnet_fullnode.yaml` configuration file: + +``` +mkdir /tmp/nhc && cd /tmp/nhc && wget https://raw.githubusercontent.com/aptos-labs/aptos-core/main/ecosystem/node-checker/configurations/devnet_fullnode.yaml +``` + +### Step 2: Start the NHC service + +Start the NHC service by providing the above-downloaded `devnet_fullnode.yaml` baseline configuration YAML file: + +``` +docker run -v /tmp/nhc:/nhc -t aptoslabs/node-checker:nightly /usr/local/bin/aptos-node-checker server run --baseline-node-config-paths /nhc/devnet_fullnode.yaml +``` + +### Step 3: Send a request to NHC service + +Finally, send a request to the NHC service you started above. The following command runs health checks of your node that is at `node_url=http://mynode.mysite.com` and compares these results with the downloaded baseline configuration `devnet_fullnode`: + +``` +curl 'http://localhost:20121/check_node?node_url=http://mynode.mysite.com&baseline_configuration_name=devnet_fullnode' +``` + +You will see output similar to this: + +``` +{ + "evaluation_results": [ + { + "headline": "Chain ID reported by baseline and target match", + "score": 100, + "explanation": "The node under investigation reported the same Chain ID 18 as is reported by the baseline node", + "evaluator_name": "node_identity", + "category": "api", + "links": [] + }, + { + "headline": "Role Type reported by baseline and target match", + "score": 100, + "explanation": "The node under investigation reported the same Role Type full_node as is reported by the baseline node", + "evaluator_name": "node_identity", + "category": "api", + "links": [] + }, + { + "headline": "Target node produced valid recent transaction", + "score": 100, + "explanation": "We were able to pull the same transaction (version: 3238616) from both your node and the baseline node. Great! This implies that your node is keeping up with other nodes in the network.", + "evaluator_name": "transaction_availability", + "category": "api", + "links": [] + } + ], + "summary_score": 100, + "summary_explanation": "100: Awesome!" +} +``` + +## How NHC works + +The NHC runs as a service. When you want to run a health check of your node, you send the HTTP requests to this service. + +A single NHC instance can be configured to check the health of multiple node configurations, each of different type, for example: + +- A validator node running in a single node testnet. +- A public fullnode connected to the Aptos devnet. +- A validator node connected to a testnet, for example, as part of an Aptos Incentivized Testnet. + +The NHC service can reasonably be run both as an external tool as well as a sidecar process for the operator use case. Both are described in this documentation. + +### Baseline configuration + +In all the above cases, a baseline node is used to compare your node's health. For example, for a public fullnode connected to the Aptos devnet, the baseline node might be a node run by the Aptos team and this node demonstrates optimal performance and participation characteristics. + +You will download the baseline configuration YAML before running the NHC service for your node. The baseline node's configuration YAML describes where to find this baseline node (URL + port), what evaluators (e.g. metrics checks, TPS tests, API validations, etc.) the NHC service should run, what parameters the NHC should use for those evaluators, what name the configuration has, and so on. See some [example baseline configuration YAML files here](https://github.com/aptos-labs/aptos-core/tree/b183a232784e4c77991b23e8728d3c7669a95d47/ecosystem/node-checker/configuration_examples). + +When you send requests to the NHC service, you must include a baseline configuration. For example, a request to NHC to use `devnet_fullnode` as the baseline configuration will look like this: + +``` +curl 'http://nhc.aptoslabs.com/check_node?node_url=http://myfullnode.mysite.com&baseline_configuration_name=devnet_fullnode' +``` + +### Getting baseline configurations ready + +In order to run the NHC service, you must have a baseline configuration that the service can use. You have two options here: + +#### Configure a pre-existing YAML + +You can find a few [example baseline configuration YAML files here](https://github.com/aptos-labs/aptos-core/tree/b183a232784e4c77991b23e8728d3c7669a95d47/ecosystem/node-checker/configuration_examples) that work for each of the above use cases and more. + +Next, download these configuration YAML files into the `/etc/nhc` folder in your host system. For example: + +``` +mkdir /etc/nhc +cd /etc/nhc +configs=(single_node_validator devnet_fullnode ait2_validator); for c in ${configs[@]}; do wget https://raw.githubusercontent.com/aptos-labs/aptos-core/main/ecosystem/node-checker/configurations/$c.yaml; done +``` + +These configurations are not quite ready to be used as they are. You will need to modify certain fields, such as the baseline node address or evaluator set (`evaluators` and `evaluator_args` in the YAML) used. The best way to iterate on this is to run the NHC with a downloaded baseline configuration and see what it says on startup. + +#### Generate your own baseline configuration YAML + +To generate your own baseline configuration, you must first run the NHC service with `create` option. The below command shows how to create a baseline configuration YAML by running the NHC service using Docker: + +``` +docker run -it aptoslabs/node-checker:nightly /usr/local/bin/aptos-node-checker configuration create --url 'http://baseline-fullnode.aptoslabs.com' --configuration-name devnet_fullnode --configuration-name-pretty "Devnet FullNode" --evaluators network_minimum_peers api_latency --api-port 80 > /etc/nhc/devnet_fullnode.yaml +``` + +The above command specifies the bare minimum for a baseline configuration. You can tune each evaluator as you see fit. See the fields `evaluators` and `evaluator_args` in the YAML. For more guidance on this, pass the `-h` flag to the above command to see all the flags you can work with. + +### Required files + +For some NHC configurations, you will need accompanying files, e.g. `mint.key` to use for running a TPS test against a validator. You should make sure these files are also available to NHC, either on disk or mounted into your container. NHC expects them on startup at a path specified in the baseline configuration YAML. + +## Running NHC: Docker + +:::tip +While the Aptos team hosts our own instances of this service, we encourage the node operators to run their own instances. You may choose to either run a publicly available NHC or run it as a sidecar, where it only works against your own node. +::: + +When you are ready with baseline configuration YAML and the required files, you can run the NHC server with a command like this, for example, with Docker: + +``` +docker run -v /etc/nhc:/etc/nhc -p 20121:20121 -t aptoslabs/node-checker:nightly /usr/local/bin/aptos-node-checker server run --baseline-node-config-paths /etc/nhc/ait2_validator.yaml /etc/nhc/devnet_fullnode.yaml +``` + +:::tip + +You may want to include other environment variables such as `RUST_LOG=info`. As you can see, by default NHC runs on port 20121. Make sure to publish it from the container, as shown in the above command, and ensure the port is open on your host. You may change the port NHC runs on with `--listen-port`. +::: + +## Running NHC: Source + +First, check out the source: + +``` +git clone git@github.com:aptos-labs/aptos-core.git +cd aptos-core +``` + +Depending on your setup, you may want to check out a particular branch, to ensure NHC is compatible with your node, e.g. `git checkout --track devnet`. + +Run NHC: + +``` +cargo run --release -- server run --baseline-node-config-paths /etc/nhc/ait2_validator.yaml /etc/nhc/devnet_fullnode.yaml +``` + + +## Running NHC as a sidecar + +When you run NHC as a sidecar, you preconfigure a node that NHC should use as the node under investigation by default: + +``` +--target-node-url http://localhost +``` + +Running NHC as a sidecar can be handy when you want to close the API / metrics ports on your machine to the public internet, but would still like to run NHC to validate the setup of your node. + +If you want, you can even restrict NHC to test only that node: + +``` +--allow-preconfigured-test-node-only +``` + +With this flag, the `/check_node` endpoint will always return 400s, you must instead use `/check_preconfigured_node`. + +Once you have configured your NHC instance in sidecar mode, you can send requests that omit the target node address. + +``` +curl 'http://nhc.aptoslabs.com/check_preconfigured_node?baseline_configuration_name=devnet_fullnode' +``` + +There are more options available for which ports to use. Pass `-h` to see more options. + +## Generating the OpenAPI specs + +To generate the OpenAPI specs, run the following commands: + +``` +cargo run -- server generate-openapi -f yaml > openapi.yaml +cargo run -- server generate-openapi -f json > openapi.json +``` + +You can also hit the `/spec_yaml` and `/spec_json` endpoints of the running service. \ No newline at end of file diff --git a/developer-docs-site/sidebars.js b/developer-docs-site/sidebars.js index 7f5429bcf4017..7e3b90ca8de42 100644 --- a/developer-docs-site/sidebars.js +++ b/developer-docs-site/sidebars.js @@ -113,6 +113,7 @@ const sidebars = { ], }, "nodes/run-a-local-testnet", + "nodes/node-health-checker", "nodes/node-health-checker-faq", ], }, diff --git a/ecosystem/node-checker/DEVELOPING.md b/ecosystem/node-checker/DEVELOPING.md deleted file mode 100644 index 248f5b057dc60..0000000000000 --- a/ecosystem/node-checker/DEVELOPING.md +++ /dev/null @@ -1,24 +0,0 @@ -# Developing NHC -To develop NHC, you should first run two nodes of the same type. See [this wiki](https://aptos.dev/nodes/full-node/fullnode-for-devnet) for guidance on how to do this. You may also target a known existing FullNode with its metrics port open. - -The below command assumes we have a fullnode running locally, the target node (the node under investigation), and another running on a machine in our network, the baseline node (the node we compare the target to): -``` -cargo run -- --baseline-node-url 'http://192.168.86.2' --target-node-url http://localhost --evaluators state_sync_version --allow-preconfigured-test-node-only -``` -This runs NHC in sidecar mode, where only the `/check_preconfigured_node` endpoint can be called, which will target the node running on localhost. - -Once the service is running, you can query it like this: -``` -$ curl -s localhost:20121/check_preconfigured_node | jq . -{ - "evaluations": [ - { - "headline": "State sync version is within tolerance", - "score": 100, - "explanation": "Successfully pulled metrics from target node twice, saw the version was progressing, and saw that it is within tolerance of the baseline node. Target version: 1882004. Baseline version: 549003. Tolerance: 1000" - } - ], - "summary_score": 100, - "summary_explanation": "100: Awesome!" -} -``` diff --git a/ecosystem/node-checker/README.md b/ecosystem/node-checker/README.md index b5609b72bcfcd..bba489a6f58d2 100644 --- a/ecosystem/node-checker/README.md +++ b/ecosystem/node-checker/README.md @@ -1,156 +1,4 @@ # Aptos Node Health Checker -The Aptos Node Health Checker (NHC) is the reference implementation of a node health checker for Validator Nodes (Validators), Validator FullNodes (VFNs), and Public FullNodes (PFNs). The node health checker aims to serve 3 major user types: -- **AIT Registration**: As part of sign up for the Aptos Incentivized Testnets (AIT), we request that users demonstrate that they can run a ValidatorNode successfully. We use this tool to encode precisely what that means. -- **Operator Support**: As node operators, you will want to know whether your node is running correctly. This service can help you figure that out. While we host our own instances of this service, we encourage node operators to run their own instances. You may choose to either run a publicly available NHC or run it as a sidecar, where it only works against your own node. -- **Continuous Evaluation**: As part of the AITs, Aptos Labs needs a tool to help confirm that participants are running their nodes in a way that meets our criteria. We run this tool continuously throughout each AIT to help us evaluate this. -In this README we describe how to run NHC for the **Operator Support** use case. NHC can reasonably be run both as an external tool as well as a sidecar process for this use case. Both are described below. For more information on how NHC works, see [How NHC works](#how-nhc-works) below. +The Aptos Node Health Checker (NHC) service can be used to check the health of the various Aptos node types. See [Node Health Checker](http://aptos.dev/nodes/node-health-checker) for documentation. -## tl;dr -While we highly recommend you read this whole README, you can get NHC working in a basic form by doing the following. This baseline configuration is for a devnet FullNode. - -First, get a baseline configuration YAML file. The command below will download the `devnet_fullnode.yaml` configuration file: -``` -cd /tmp/nhc && wget https://raw.githubusercontent.com/aptos-labs/aptos-core/main/ecosystem/node-checker/configurations/devnet_fullnode.yaml -``` - -Then, start the NHC service by providing the above-downloaded `devnet_fullnode.yaml` configuration file: -``` -docker run -v /tmp/nhc:/nhc -t aptoslabs/node-checker:nightly /usr/local/bin/aptos-node-checker server run --baseline-node-config-paths /nhc/devnet_fullnode.yaml -``` - -Now that you have started up the NHC service, send it a request. The below command runs validations of your node by using the downloaded baseline configuration `devnet_fullnode` for comparison: -``` -curl 'http://localhost:20121/check_node?node_url=http://mynode.mysite.com&baseline_configuration_name=devnet_fullnode' -``` - -You should expect to see output similar to this: -``` -{ - "evaluation_results": [ - { - "headline": "Chain ID reported by baseline and target match", - "score": 100, - "explanation": "The node under investigation reported the same Chain ID 18 as is reported by the baseline node", - "evaluator_name": "node_identity", - "category": "api", - "links": [] - }, - { - "headline": "Role Type reported by baseline and target match", - "score": 100, - "explanation": "The node under investigation reported the same Role Type full_node as is reported by the baseline node", - "evaluator_name": "node_identity", - "category": "api", - "links": [] - }, - { - "headline": "Target node produced valid recent transaction", - "score": 100, - "explanation": "We were able to pull the same transaction (version: 3238616) from both your node and the baseline node. Great! This implies that your node is keeping up with other nodes in the network.", - "evaluator_name": "transaction_availability", - "category": "api", - "links": [] - } - ], - "summary_score": 100, - "summary_explanation": "100: Awesome!" -} -``` - -## How NHC works -Before running NHC, it is important to know at a high level how NHC works. In short, NHC runs as a service. When you want to run a set of validations against your node, you send HTTP requests to this service. - -A single NHC instance can be configured to test multiple different node configurations, for example: - -- Validator Node running in single node testnet. -- Public FullNode connected to devnet. -- Validator Node connected to testnet, e.g. as part of an Aptos Incentivized Testnet. - -In all cases, validations are performed compared to a baseline node. For example, for the second configuration above (Public FullNode connected to devnet), the baseline node might be a node run by the Aptos team that demonstrates optimal performance and participation characteristics. The baseline node's configuration YAML describes where to find this node (URL + port), what evaluators (e.g. metrics checks, TPS tests, API validations, etc.) NHC should run, what parameters to use for those evaluators, what name the configuration has, and so on. Your node will be compared to this baseline node. - -When you send requests to NHC, you must include a baseline configuration. For example, a request to NHC to use `devnet_fullnode` as the baseline configuration will look like this: -``` -curl 'http://nhc.aptoslabs.com/check_node?node_url=http://myfullnode.mysite.com&baseline_configuration_name=devnet_fullnode' -``` - -## Getting configurations ready -In order to run NHC, you must have baseline configurations that it can use. You have two options here: - -### Start from a pre-existing configuration -In [./configuration_examples](./configuration_examples) you can find configurations that work for each of the use cases above and more. - -You might want to setup configurations in your host system like this: -``` -mkdir /etc/nhc -cd /etc/nhc -configs=(single_node_validator devnet_fullnode ait2_validator); for c in ${configs[@]}; do wget https://raw.githubusercontent.com/aptos-labs/aptos-core/main/ecosystem/node-checker/configurations/$c.yaml; done -``` - -These configurations are not quite ready to be used as they are, you will need to modify certain fields, such as the node address or evaluator set used. The best way to iterate on this is to just try to run NHC with the configuration and see what it says on startup. - -### Generate your own configurations -To generate your own configurations, you must first get your hands on NHC. Follow one of the guides below for that. Assuming you're using NHC from an image, you could generate a configuration with a command like this: -``` -docker run -it aptoslabs/node-checker:nightly /usr/local/bin/aptos-node-checker configuration create --url 'http://baseline-fullnode.aptoslabs.com' --configuration-name devnet_fullnode --configuration-name-pretty "Devnet FullNode" --evaluators network_minimum_peers api_latency --api-port 80 > /etc/nhc/devnet_fullnode.yaml -``` - -This command just specifies the bare minimum for a baseline configuration, you can tune each evaluator as you see fit. For more guidance on this, try passing `-h` to the above command and seeing all the flags you can work with. - -### Getting necessary files -For some NHC configurations, you will need accompanying files, e.g. `mint.key` to use for running a TPS test against a validator. You should make sure those are also avilable to NHC, either on disk or mounted into your container. NHC will expect them on startup at a path determined by the baseline configuration. - -## Running NHC: Docker -Assuming you've followed the configuration guide above, you can mount and use the configurations and then run the server with a command like this: -``` -docker run -v /etc/nhc:/etc/nhc -p 20121:20121 -t aptoslabs/node-checker:nightly /usr/local/bin/aptos-node-checker server run --baseline-node-config-paths /etc/nhc/ait2_validator.yaml /etc/nhc/devnet_fullnode.yaml -``` - -You may want to include other env vars such as `RUST_LOG=info`. As you can see, by default NHC runs on port 20121. Make sure to publish it from the container like in the above command and ensure the port is open on your host. You may change the port NHC runs on with `--listen-port`. - -## Running NHC: Source -First, check out the source: -``` -git clone git@github.com:aptos-labs/aptos-core.git -cd aptos-core -``` - -Depending on your setup, you may want to check out a particular branch, to ensure NHC is compatible with your node, e.g. `git checkout --track devnet`. - -From here, assuming you have followed the above configuration guide, you can run NHC: -``` -cargo run --release -- server run --baseline-node-config-paths /etc/nhc/ait2_validator.yaml /etc/nhc/devnet_fullnode.yaml -``` - -## Running NHC: Terraform / Helm -Down the line we will have easier pre-packaged configs in which you only need to specify key pieces of the configuration. Coming soon! - -## Running NHC as a sidecar -When you run NHC as a sidecar, you preconfigure a node that NHC should use as the node under investigation by default: -``` ---target-node-url http://localhost -``` - -Running NHC as a sidecar can be handy when you want to close the API / metrics ports on your machine to the public internet, but would still like to run NHC to validate the setup of your node. - -If you want, you can even restrict NHC to test only that node: -``` ---allow-preconfigured-test-node-only -``` -With this flag, the `/check_node` endpoint will always return 400s, you must instead use `/check_preconfigured_node`. - -Once you have configured your NHC instance in sidecar mode, you can send requests that omit the target node address. -``` -curl 'http://nhc.aptoslabs.com/check_preconfigured_node?baseline_configuration_name=devnet_fullnode' -``` - -There are more options than these, e.g. around which ports to use. Pass `-h` to see more options. - -## Generating the OpenAPI specs -To generate the OpenAPI specs, run the following commands: -``` -cargo run -- server generate-openapi -f yaml > openapi.yaml -cargo run -- server generate-openapi -f json > openapi.json -``` - -You can also hit the `/spec_yaml` and `/spec_json` endpoints of the running service.