-
Notifications
You must be signed in to change notification settings - Fork 0
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
Merge branch 'main' into deploy/bitforex-metagraph-intnet
# Conflicts: # event.json # index.js # src/currency-l1/index.js # src/data-l1/index.js # src/external/opsgenie/index.js # src/metagraph-l0/index.js # src/shared/restart_operations.js # src/utils/types.js
- Loading branch information
Showing
13 changed files
with
3,249 additions
and
120 deletions.
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,19 @@ | ||
name: Deploy DOR Metagraph Monitor - Mainnet | ||
|
||
on: | ||
push: | ||
branches: | ||
- "deploy/dor-metagraph-mainnet" | ||
jobs: | ||
deploy-mainnet-dor-metagraph: | ||
runs-on: ubuntu-latest | ||
steps: | ||
- uses: actions/checkout@v3 | ||
|
||
- name: Deploy DOR Metagraph Monitor - Mainnet | ||
uses: "./.github/templates/deploy" | ||
with: | ||
AWS_ACCESS_KEY_ID: ${{ secrets.AWS_ACCESS_KEY_ID_DOR }} | ||
AWS_SECRET_ACCESS_KEY: ${{ secrets.AWS_SECRET_ACCESS_KEY_DOR }} | ||
AWS_REGION: ${{ secrets.AWS_REGION_DOR}} | ||
AWS_LAMBDA_FUNCTION_NAME: MainnetMetagraphMonitor |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -1,73 +1,147 @@ | ||
|
||
Network Monitoring Tools | ||
Metagraph Monitoring Tools | ||
======================== | ||
|
||
This project has been developed for monitoring metagraphs and initiating a restart if necessary. | ||
This project has been developed to monitor metagraphs and initiate a restart if necessary. We have a Lambda function dedicated to monitoring each metagraph node individually as well as the snapshot generation process. The lambda can trigger two types of restarts: **FULL_CLUSTER** or **INDIVIDUAL_NODES**. | ||
|
||
We have a lambda function dedicated to monitoring a metagraph and restarting it if the snapshot production stops. It can be enhanced to handle other scenarios requiring a restart. | ||
### Full Cluster Restart | ||
|
||
To run the lambda function, you should provide the parameters described in `event.json`. Some parameters are mandatory and need to be populated as SSM Parameters in the Parameter Store. | ||
This restart will involve restarting all the nodes and layers. In other words, it will completely restart the metagraph (3ml0, 3cl1, and 3dl1). We always aim to avoid a complete restart of the metagraph, but there are certain conditions that can trigger this type of restart, such as: | ||
|
||
- Snapshots no longer being produced. | ||
- Snapshots not reaching the global networks (MainNet, IntegrationNet). | ||
- All nodes from layer 0 down (which would cause the snapshots to stop being created). | ||
- All nodes from all layers down. | ||
|
||
### Individual Nodes Restart | ||
|
||
This restart will involve restarting nodes individually. This means that we won't need to restart the full cluster but only specific nodes in certain layers. For example, if node 2 in layer cl1 is down, we can restart only the process of node 2 in layer cl1 instead of the entire cluster or node. The conditions that can trigger this type of restart are: | ||
|
||
- Unhealthy node in layer ml0 | ||
- Unhealthy node in layer cl1 | ||
- Unhealthy node in layer dl1 | ||
|
||
## Guide | ||
|
||
We are using **Node.js 16** to package this Lambda function. You can manage Node.js versions using [nvm](https://github.com/nvm-sh/nvm). | ||
|
||
Please remember to create a new role for the lambda function, which should include the following policies: AmazonEventBridgeFullAccess, AWSLambda_FullAccess, AWSLambdaBasicExecutionRole. Additionally, we need to create a custom policy with the following permissions: | ||
For local execution, we are utilizing the [node-lambda](https://www.npmjs.com/package/node-lambda) library. To run it locally, install node-lambda by executing the following command: | ||
`npm install -g node-lambda` | ||
|
||
**This repository was made to run using AWS Cloud.** | ||
|
||
On AWS we will need to use some services: | ||
|
||
- IAM (Policies) | ||
- Lambda | ||
- Event Bridge | ||
- Systems Manager | ||
- Dynamo | ||
- Cloud Watch | ||
|
||
### IAM | ||
When creating a Lambda function, a corresponding role must be provided. This Lambda function will require access to other services such as DynamoDB and Systems Manager (SSM). To establish this role, access to AWS Identity and Access Management (IAM) is required. | ||
|
||
The initial step involves creating a new policy on AWS, named `MetagraphMonitor`. This policy should include the following JSON configuration: | ||
|
||
``` | ||
{ | ||
"Version": "2012-10-17", | ||
"Statement": [ | ||
{ | ||
"Sid": "VisualEditor0", | ||
"Effect": "Allow", | ||
"Action": [ | ||
"ssm:SendCommand", | ||
"ssm:CreateAssociation", | ||
"ssm:GetParameter" | ||
], | ||
"Resource": "*" | ||
} | ||
] | ||
"Version": "2012-10-17", | ||
"Statement": [ | ||
{ | ||
"Sid": "VisualEditor0", | ||
"Effect": "Allow", | ||
"Action": [ | ||
"ssm:SendCommand", | ||
"ssm:CreateAssociation", | ||
"ssm:GetParameter" | ||
], | ||
"Resource": "*" | ||
} | ||
] | ||
} | ||
``` | ||
|
||
We also access instances using SSM, so ensure that your instances have the SSM client set up. | ||
Upon policy creation, the subsequent step is to establish the role. This role should also be named `MetagraphMonitor`. Attach the following policies to this role: | ||
|
||
To deploy the function, simply package the changes using `zip -r my_deployment_package.zip .` and then deploy the ZIP file to your function. | ||
1. MetagraphMonitor (previously created) | ||
2. AmazonDynamoDBFullAccess (AWS default policy) | ||
3. AmazonEventBridgeFullAccess (AWS default policy) | ||
4. AWSLambdaBasicExecutionRole (AWS default policy) | ||
|
||
Dependencies | ||
------------ | ||
These policies ensure access to the required services. | ||
|
||
This project is designed to run as a lambda function and monitor EC2 instances. Therefore, we require the following: | ||
### Lambda | ||
This codebase represents a Lambda function designed for deployment on AWS. To deploy the Lambda function, execute the following script: | ||
`npm run package` | ||
|
||
- 3 EC2 instances with an authorized SSM agent and zip installed (use `sudo apt install zip` to install zip for compressing logs). | ||
Running this command will generate a file named `my_deployment_package.zip`. Upload this file to your Lambda function on AWS. | ||
|
||
Inside each instance, we should follow a specific directory structure. To run this lambda function, your instance should contain the following files: | ||
To run the lambda function, you should provide the parameters described in `event.json`. Some parameters are mandatory and need to be populated as SSM Parameters in the Parameter Store. | ||
|
||
- `your_metagraph_l0_directory` | ||
Ensure that you set the Lambda concurrency to only `1` and the timeout to `15 minutes`. | ||
|
||
- `genesis.csv` | ||
- `metagraph-l0.jar` | ||
- `your_currency_l1_directory` | ||
### Event Bridge | ||
This service is responsible for scheduling the Lambda function to be triggered. Currently, we recommend creating a schedule to run every 5 minutes to check the health of the metagraph. The service should provide the Lambda payload, which includes information about **metagraph**, **network**, **aws**, **force_metagraph_restart**, and **enable_opsgenie_alerts**. | ||
|
||
- `currency-l1.jar` | ||
- `your_data_l1_directory` | ||
- **metagraph**: This section contains information about the metagraph, such as metagraphID, metagraphName, layers to be monitored besides ml0, ports, file_system, additional environment variables, required environment variables, and seed lists (if necessary). | ||
- **network**: This section contains information about the network on which the metagraph will run. It could be Integrationnet or Mainnet. | ||
- **aws**: This section includes information about the AWS region where we are running our instances and details about the instances, such as **ids** and **ips**. | ||
- **force_metagraph_restart**: This will initiate a restart, even if one is already in progress. Further details about restarts in progress will be discussed in the **Dynamo** section. | ||
- **enable_opsgenie_alerts**: Additionally, we offer support for creating alerts on Opsgenie. Set this to `true` if you want to enable this integration. | ||
|
||
The template for this payload can be found in the file `event.json` in the root directory of this repository. | ||
**Ensure that you fill in this payload correctly, as it is crucial for the proper execution of the Lambda function.** | ||
|
||
- `data-l1.jar` | ||
### Systems Manager | ||
|
||
We need to create variables in the SSM Parameter Store, following this pattern: | ||
This service plays a crucial role in two key aspects of our monitoring system: enabling commands to be sent to instances (without the need for `ssh`) and securely storing the sensitive parameters required for the proper execution of the Lambda function. | ||
|
||
``` | ||
/metagraph-nodes/:ec2_instance_id/l0/keystore | ||
/metagraph-nodes/:ec2_instance_id/l0/keyalias | ||
/metagraph-nodes/:ec2_instance_id/l0/password | ||
It is imperative to ensure SSM access is enabled for your EC2 instances running the metagraph. Refer to the official documentation on how to enable SSM on EC2 instances [here](https://docs.aws.amazon.com/systems-manager/latest/userguide/sysman-install-ssm-agent.html). | ||
**This is critically important; without SSM on EC2 instances, the script will not function, as we rely on it to send instructions to the instances.** | ||
|
||
/metagraph-nodes/:ec2_instance_id/cl1/keystore | ||
/metagraph-nodes/:ec2_instance_id/cl1/keyalias | ||
/metagraph-nodes/:ec2_instance_id/cl1/password | ||
Additionally, we rely on Systems Manager to store sensitive parameters, such as `p12` key information for the metagraph. These parameters are stored in the `Systems Manager -> Parameters Store` with the following pattern: | ||
|
||
/metagraph-nodes/:ec2_instance_id/dl1/keystore | ||
/metagraph-nodes/:ec2_instance_id/dl1/keyalias | ||
/metagraph-nodes/:ec2_instance_id/dl1/password | ||
``` | ||
- `/metagraph-nodes/{instance_id}/{layer}/keyalias` | ||
- `/metagraph-nodes/{instance_id}/{layer}/keystore` | ||
- `/metagraph-nodes/{instance_id}/{layer}/password` | ||
|
||
The layers include: `l0`, `cl1`, and `dl1`. | ||
The instance IDs correspond to the IDs from **EC2**. | ||
Considering we have **3 instances** and **3 layers**, we should have a total of **9 parameters** for each instance (3 for ml0, 3 for cl1, and 3 for dl1). Therefore, **27 parameters** in total should be stored in the end. | ||
|
||
**Note: The example above assumes all 3 layers; this value may vary based on the number of instances.** | ||
|
||
As mentioned earlier, we support **Opsgenie** integration. However, to enable this integration, you need the **Opsgenie API-KEY**. Therefore, an additional parameter is required for this integration: | ||
|
||
- `/metagraph-nodes/opsgenie-api-key` | ||
|
||
**Note: The integration will not work if you enable Opsgenie integration in the payload but forget to provide the API-KEY.** | ||
|
||
### Dynamo | ||
|
||
To prevent multiple restarts in parallel, we utilize **Dynamo** to store the restart state. Currently, the possible states for restarts are: `NEW`, `ROLLBACK_IN_PROGRESS`, `READY_TO_JOIN`, `JOINING`, and `READY`. | ||
|
||
- **NEW**: This is the initial state, and all restarts begin with this state. Even if the metagraph is healthy, we initiate the script with this status. | ||
- **ROLLBACK_IN_PROGRESS**: This state indicates that we've started a restart, but the node is still in the process of starting and is not ready to join the metagraph yet. | ||
- **READY_TO_JOIN**: As the name suggests, it signifies that the node is ready to join the metagraph. | ||
- **JOINING**: This state indicates that the node is currently joining the metagraph. | ||
- **READY**: This is the expected final state, signaling that the restart has successfully concluded. | ||
|
||
After a successful execution, the row will be removed from **Dynamo**. You can check the current state by accessing the table directly on the AWS console. | ||
|
||
To create the table on **Dynamo**, you can run the script `src/utils/scripts/create_dynamo_table.sh`. | ||
|
||
**Note: Ensure that you set all environment variables pointing to the correct AWS account (AWS_ACCESS_KEY_ID, AWS_SECRET_ACCESS_KEY, and AWS_REGION) before running the script.** | ||
|
||
### Cloud Watch | ||
|
||
This service is responsible for storing the logs of our Lambda execution. You can search by Lambda name and review the logs of both the current and past executions. | ||
|
||
## Additional Informations | ||
This repository is designed to aid in monitoring the health of the metagraph and initiate a restart if needed. In the file `src/utils/types.js`, you will find a variable named `ROLLBACK_IN_PROGRESS_TIMEOUT_IN_MINUTES`. This variable determines the timeout for the current execution. If this timeout is exceeded, a new restart will be triggered, and the process will be retried. | ||
|
||
After deploying the Lambda, monitor the initial executions to ensure everything is functioning correctly and that no parameters are missing or provided incorrectly. | ||
|
||
You should repeat the above parameters for your 3 instances. Additionally, there is a parameter in the SSM Parameter Store for Opsgenie integration, so you need to create the following parameter: | ||
In the `.github` directory, you will find examples of actions to automate the deployment of the Lambda function to AWS. You can use these examples as a reference for automating your deploys. | ||
|
||
`/metagraph-nodes/opsgenie-api-key` | ||
Seedlists are not required. You can leave the `seedlists` field as an empty object in the `event.json` file. |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Binary file not shown.
Oops, something went wrong.