Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: critical alerts by modules - 2 #264

Open
wants to merge 3 commits into
base: develop
Choose a base branch
from
Open
Show file tree
Hide file tree
Changes from 1 commit
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
164 changes: 148 additions & 16 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -262,7 +262,8 @@ Holesky) this value should be omitted.
* **Default:** ./docker/validators/lido_mainnet.db
* **Note:** it makes sense to change default value if `VALIDATOR_REGISTRY_SOURCE` is set to "lido"
---
`VALIDATOR_REGISTRY_KEYSAPI_SOURCE_URLS` - Comma-separated list of URLs to [Lido Keys API service](https://github.com/lidofinance/lido-keys-api).
`VALIDATOR_REGISTRY_KEYSAPI_SOURCE_URLS` - Comma-separated list of URLs to
[Lido Keys API service](https://github.com/lidofinance/lido-keys-api).
* **Required:** false
* **Note:** will be used only if `VALIDATOR_REGISTRY_SOURCE` is set to "keysapi"
---
Expand All @@ -278,55 +279,186 @@ Holesky) this value should be omitted.
* **Required:** false
* **Default:** 2
---
`VALIDATOR_USE_STUCK_KEYS_FILE` - Use a file with list of validators that are stuck and should be excluded from the monitoring metrics.
`VALIDATOR_USE_STUCK_KEYS_FILE` - Use a file with list of validators that are stuck and should be excluded from the
monitoring metrics.
* **Required:** false
* **Values:** true / false
* **Default:** false
---
`VALIDATOR_STUCK_KEYS_FILE_PATH` - Path to file with list of validators that are stuck and should be excluded from the monitoring metrics.
`VALIDATOR_STUCK_KEYS_FILE_PATH` - Path to file with list of validators that are stuck and should be excluded from the
monitoring metrics.
* **Required:** false
* **Default:** ./docker/validators/stuck_keys.yaml
* **Note:** will be used only if `VALIDATOR_USE_STUCK_KEYS_FILE` is true
---
`SYNC_PARTICIPATION_DISTANCE_DOWN_FROM_CHAIN_AVG` - Distance (down) from Blockchain Sync Participation average after which we think that our sync participation is bad.
`SYNC_PARTICIPATION_DISTANCE_DOWN_FROM_CHAIN_AVG` - Distance (down) from Blockchain Sync Participation average after
which we think that our sync participation is bad.
* **Required:** false
* **Default:** 0
---
`SYNC_PARTICIPATION_EPOCHS_LESS_THAN_CHAIN_AVG` - Number epochs after which we think that our sync participation is bad and alert about that.
`SYNC_PARTICIPATION_EPOCHS_LESS_THAN_CHAIN_AVG` - Number epochs after which we think that our sync participation is bad
and alert about that.
* **Required:** false
* **Default:** 3
---
`BAD_ATTESTATION_EPOCHS` - Number epochs after which we think that our attestation is bad and alert about that.
* **Required:** false
* **Default:** 3
---
`CRITICAL_ALERTS_ALERTMANAGER_URL` - If passed, application sends additional critical alerts about validators performance to Alertmanager.
`CRITICAL_ALERTS_ALERTMANAGER_URL` - If passed, application sends additional critical alerts about validators
performance to Alertmanager.
* **Required:** false
---
`CRITICAL_ALERTS_MIN_VAL_COUNT` - Critical alerts will be sent for Node Operators with validators count greater this value.
`CRITICAL_ALERTS_MIN_VAL_COUNT` - Critical alerts will be sent for Node Operators with validators count greater or equal
to this value.
* **Required:** false
* **Default:** 100
---
`CRITICAL_ALERTS_MIN_ACTIVE_VAL_COUNT` - Specifies the minimal threshold of active validators for node operators in the
specific module for critical alerts. If the number of validators for a node operator in the specified module is greater
or equal to the `minActiveCount` value of the variable and the number of node operator's validators affected by the
critical alert is greater or equal to the total number of node operator's validators multiplied by the `affectedShare`
value of the variable or greater or equal to the `minAffectedCount` value of the variable, and variable's values for the
particular module are not overridden by the `CRITICAL_ALERTS_MIN_AFFECTED_VAL_COUNT` value, the critical alert will be
sent.

It must be in JSON string format. Example:
`{ "0": { "minActiveCount": 100, "affectedShare": 0.33, "minAffectedCount": 1000 }}`.

The numeric key in this structure defines module ID. Values specified for zero key are applied to all modules. Values
specified for non-zero keys of this structure are applied only to the specified module and have priority over values,
specified for the zero key.

If this variable doesn't have values for the particular module and no values for the zero key are set, then the rule is
applied like if the following values are set:
`{ "minActiveCount": CRITICAL_ALERTS_MIN_VAL_COUNT, "affectedShare": 0.33, "minAffectedCount": 1000 }`.
* **Required:** false
* **Default:** {}
---
`CRITICAL_ALERTS_MIN_AFFECTED_VAL_COUNT` - If the number of validators for a node operator in the specified module
affected by the critical alert is greater or equal to this value, the critical alert will be sent.

It must be in JSON string format. Example: `{ "0": 100, "3": 50 }`.

The numeric key in this structure defines module ID. Values specified for zero key are applied to all modules. Values
specified for non-zero keys of this structure are applied only to the specified module and have priority over values,
specified for the zero key.

This variable has priority over the `CRITICAL_ALERTS_MIN_ACTIVE_VAL_COUNT` and `CRITICAL_ALERTS_MIN_VAL_COUNT` values.
If this variable doesn't have values for the particular module and no values for the zero key are set, rules defined in
the `CRITICAL_ALERTS_MIN_ACTIVE_VAL_COUNT` and `CRITICAL_ALERTS_MIN_VAL_COUNT` variables are applied.
* **Required:** false
* **Default:** {}
---
`CRITICAL_ALERTS_ALERTMANAGER_LABELS` - Additional labels for critical alerts.
Must be in JSON string format. Example - '{"a":"valueA","b":"valueB"}'.
Must be in JSON string format. Example: `{ "a": "valueA", "b": "valueB" }`.
* **Required:** false
* **Default:** {}
---

## Application critical alerts (via Alertmanager)

In addition to alerts based on Prometheus metrics you can receive special critical alerts based on beaconchain aggregates from app.
In addition to alerts based on Prometheus metrics you can receive special critical alerts based on Beacon Chain
aggregates from app.

You should pass env var `CRITICAL_ALERTS_ALERTMANAGER_URL=http://<alertmanager_host>:<alertmanager_port>`.

And if `ethereum_validators_monitoring_data_actuality < 1h` it allows you to receive alerts from table bellow
There are 3 environmental variables that control how critical alerts are sent for certain modules:
```
CRITICAL_ALERTS_MIN_VAL_COUNT: number;
CRITICAL_ALERTS_MIN_ACTIVE_VAL_COUNT: {
<moduleIndex>: {
minActiveCount: number,
affectedShare: number,
minAffectedCount: number,
}
};
CRITICAL_ALERTS_MIN_AFFECTED_VAL_COUNT: {
<moduleIndex>: number
};
```

The following rules are applied (listed in order of increasing priority, the next rule overrides the previous one).

1. (lowest priority) `CRITICAL_ALERTS_MIN_VAL_COUNT`. If only this variable is set, the app behaves as if the
`CRITICAL_ALERTS_MIN_ACTIVE_VAL_COUNT` has the following value:
```
{
"0": {
"minActiveCount": CRITICAL_ALERTS_MIN_VAL_COUNT,
"affectedShare": 0.33,
"minAffectedCount": 1000
}
}
```

2. Default rules for the `CRITICAL_ALERTS_MIN_ACTIVE_VAL_COUNT` variable are set.
```
CRITICAL_ALERTS_MIN_ACTIVE_VAL_COUNT = {
"0": {
"minActiveCount": <integer>,
"affectedShare": <0.xx>,
"minAffectedCount": <integer>,
}
}
```
Values specified for the zero key are applied to all modules. A Critical alert is triggered for the particular module if
both conditions are met:

a. the number of active validators for the given node operator is greater than `minActiveCount`;

b. the number of validators affected by the critical alert is greater than the `minAffectedCount` or the share of node
operator's validators affected by the critical alert is greater than `affectedShare`.

3. Default rules for the `CRITICAL_ALERTS_MIN_AFFECTED_VAL_COUNT` variable are set.
```
CRITICAL_ALERTS_MIN_AFFECTED_VAL_COUNT = {
"0": <integer>
}
```
The value specified for the zero key is applied to all modules. A Critical alert is triggered for the particular module
if the number of node operator's validators affected by the critical alert is greater than the specified value.

4. Value(s) for specific module(s) in the `CRITICAL_ALERTS_MIN_ACTIVE_VAL_COUNT` variable is set.
```
CRITICAL_ALERTS_MIN_ACTIVE_VAL_COUNT = {
"n": {
"minActiveCount": <integer>,
"affectedShare": <0.xx>,
"minAffectedCount": <integer>,
}
}
```
A Critical alert is triggered for the specified module(s) if both conditions are met:

a. the number of active validators for the given node operator is greater than `minActiveCount`;

b. the number of validators affected by the critical alert is greater than the `minAffectedCount` or the share of node
operator's validators affected by the critical alert is greater than `affectedShare`.

For those modules that don't have keys in the `CRITICAL_ALERTS_MIN_ACTIVE_VAL_COUNT` value the rules defined in the
previous steps are applied.

5. (highest priority) Value(s) for specific module(s) in the `CRITICAL_ALERTS_MIN_AFFECTED_VAL_COUNT ` variable is set.
```
CRITICAL_ALERTS_MIN_AFFECTED_VAL_COUNT = {
"n": <integer>
}
```
A Critical alert is triggered for the specified module(s) if the number of node operator's validators affected by the
critical alert is greater than the specified value for the module.

For those modules that don't have keys in the `CRITICAL_ALERTS_MIN_AFFECTED_VAL_COUNT` value the rules defined in the
previous steps are applied.

If `ethereum_validators_monitoring_data_actuality < 1h` alerts from table bellow are sent.

| Alert name | Description | If fired repeat | If value increased repeat |
|----------------------------|-----------------------------------------------------------------------------------------------------------------|-----------------|---------------------------|
| CriticalSlashing | At least one validator was slashed | instant | - |
| CriticalMissedProposes | More than 1/3 blocks from Node Operator duties was missed in the last 12 hours | every 6h | - |
| CriticalNegativeDelta | More than 1/3 or more than 1000 Node Operator validators with negative balance delta (between current and 6 epochs ago) | every 6h | every 1h |
| CriticalMissedAttestations | More than 1/3 or more than 1000 Node Operator validators with missed attestations in the last {{ BAD_ATTESTATION_EPOCHS }} epochs | every 6h | every 1h |
| Alert name | Description | If fired repeat | If value increased repeat |
|----------------------------|---------------------------------------------------------------------------------------------------------|-----------------|---------------------------|
| CriticalSlashing | At least one validator was slashed | instant | - |
| CriticalMissedProposes | More than 1/3 blocks from Node Operator duties was missed in the last 12 hours | every 6h | - |
| CriticalNegativeDelta | A certain number of validators with negative balance delta (between current and 6 epochs ago) | every 6h | every 1h |
| CriticalMissedAttestations | A certain number of validators with missed attestations in the last `{{BAD_ATTESTATION_EPOCHS}}` epochs | every 6h | every 1h |


## Application metrics
Expand Down
27 changes: 20 additions & 7 deletions docker/validators/custom_mainnet.yaml
Original file line number Diff line number Diff line change
@@ -1,9 +1,22 @@
operators:
- name: Operator1
module_1:
- name: Operator_1_0
keys:
- "0xa8088b23b6e9eaecb04c7dfd194d9e47df966605a1cf03004b7d671708421da4cb2836447f73a5f25c2cfb567b181f80"
- "0x84f6ffe8d2285b76d5076165cec8b298c8ed3dc123379de8d49ecf2e27137ebe479fec0e667322a450283c990bfe9995"
- name: Operator2
- "0x800429af2ff9e4581b3a800cec1604de49538a50659c0cbb2b79493b5d888b2b2075f9e7163bc11024088b17c2b78107"
- "0x8004a4ddb445add99be6e41fce54ae0ceba0d802817585c900e3b43d2a35ab09a8b451d02592fa105249af07122887b8"
- name: Operator_1_1
keys:
- "0xa015a5fcd78cb52e2b1f9c1a833868f9da8dfee31c919e8e1c19aa64defdd140390a16d133b500d5a90bc99bca409908"
- "0xb9b74aaec50f74e484862b5b6bf0174ffa7344f2de2b1b89aeb233722d4bc9812ee346d99a6b0740e2c14c1580257247"
- "0x8004d6da4e9228cb0efbf383ce259338d5626029e3f80913ad1c89098d3289977ba10d873cf88c61e1b2572e26fbd318"
- "0x800532e962039d57e63d1da433e26f6bbff8b15f07b90deb5be8038b7f24ddb2d71d2b26a1693a7fb9a7657f3b8b5fef"

# Optional
module_2:
- name: Operator_2_0
keys:
- "0x80081580eefc89c95874ca868cb439a0c51b4b6f97483632ea597e4801c47f03a8f45360a44411c2320296c737c89bc6"
- "0x8008b169609ee48ef4bd36c37bb2d0c5f9fe0335f28396d5aa8620409912e16c06b4ae2048542492007a2005928b074c"
- "0x800e4b8fa424ff35feef522592f3e711a46b426320a7dc40044fb02537e0faf25566e47c72172a3020d0c6bc1648ecc8"
- name: Operator_2_1
keys:
- "0x80096ff18d55b9b08c1778568867210d9110f5a2200962a962846d09a75bfa29177c42b83903ed0cb0b69f8a061e3e11"
- "0x800c8cb0fcd6104cbdf76120352c1651e858eef2fad8142ebca37d26f76a16c5f692f9b987bb22dd6eb5dd0dc9e021a4"
- "0x800cd7cf64998da8d95ac0e864012922904b78cccc28f2fa88f3bf019ecc8779833d1c7e09d62700b14d2b015f002a52"
27 changes: 20 additions & 7 deletions docker/validators/custom_testnet.yaml
Original file line number Diff line number Diff line change
@@ -1,9 +1,22 @@
operators:
- name: Operator1
module_1:
- name: Operator_1_0
keys:
- "0xafbf5b06e7953b095a9946cc7ee8f2ecf1312878bd196af4d06661bc7718f1ae2d5c9f8b635f5924bf5d8266234607f8"
- "0x925c1f368524be3fa83c52f40151724b38fb4ebfe64f64f70942aa9a307b81843d9514c1d8f3c8689236f0f1ccd6c6d4"
- name: Operator2
- "0x8000011bc03bbf99ac5964d14d3bb52de983c848cc3734d736235a19715e8cbbd5e963163eb4bd2d8cd473d103b95c12"
- "0x80000b1388d41e2cb346e6a85d94fccc6510a11d5bd91699e156907b53e1f5c265effa87f492b7cba7fe218f232c6c39"
- name: Operator_1_1
keys:
- "0xb5b9b79942fcce7ddd2c3b00dae34e571fb77f0630d4fdeccba3721b6549013b55cbfe643d96cbe920864795c5f01db6"
- "0xb3ddd2b56dbf80ba035d948709099f8ad7241929a051140ce2698fae216293d98c792314c414afb0ed3b849323b523c6"
- "0x800010c6cde9a31d218347c9d042ceff227a1dbec3970336bd8cd6d767fd0f2e587332ef6a3010b1b0f5d04288483d44"
- "0x80001887f6c44f54e043866a6536b940f1c2bdf0a99203f217940fab8684e77fa1c9cc64537464d7d2b681115eec446a"

# Optional
module_2:
- name: Operator_2_0
keys:
- "0x80002248327da011001f38ab78e277ed5ddc1448078a1ba3f1cb47fd20f65f6de07808d7c3c96a2a795011b25100cc1d"
- "0x800037d7c5468fb960d7e5cb40c2d9c39d6713676d9bc971e92692759ac7ba5b0f12d034282e0cfd4cf2c1212d38dd2a"
- "0x80003ad67e896cb261a17398e77e474a7ffc7898a40cf004a74ea8d20b2b562ac7906a3a62656bfbc1d3033748cdd972"
- name: Operator_2_1
keys:
- "0x80004546cdf353788bd0fb2048c80ecaae4dbd72ed1b9e51d90c0457d57f5e3577778a9710f267aa1e50ce0d5df6fa28"
- "0x80008083f7eb1366eaef3992c48e0ced5dadef0e4405c7b9a0a662322847f98022d970e6a13cf12da9d199b7518562f7"
- "0x80009e291a1e81be05ffce78180bb0a240242466af9613ef8dd34a8f1289f9b9dfc2c98c5d40be4d61f1eb4dec559217"
23 changes: 17 additions & 6 deletions src/common/alertmanager/alerts/BasicAlert.ts
Original file line number Diff line number Diff line change
@@ -1,6 +1,6 @@
import { ConfigService } from 'common/config';
import { Epoch } from 'common/consensus-provider/types';
import { ClickhouseService } from 'storage';
import { NOsValidatorsStatusStats } from 'storage/clickhouse/clickhouse.types';
import { RegistrySourceOperator } from 'validators-registry';

export interface AlertRequestBody {
Expand All @@ -26,22 +26,33 @@ export abstract class Alert {
protected readonly config: ConfigService;
protected readonly storage: ClickhouseService;
protected readonly operators: RegistrySourceOperator[];

protected constructor(name: string, config: ConfigService, storage: ClickhouseService, operators: RegistrySourceOperator[]) {
protected readonly moduleIndex: number;
protected readonly nosStats: NOsValidatorsStatusStats[];

protected constructor(
name: string,
config: ConfigService,
storage: ClickhouseService,
operators: RegistrySourceOperator[],
moduleIndex: number,
nosStats: NOsValidatorsStatusStats[],
) {
this.alertname = name;
this.config = config;
this.storage = storage;
this.operators = operators;
this.moduleIndex = moduleIndex;
this.nosStats = nosStats;
}

abstract alertRule(bySlot: number): Promise<AlertRuleResult>;
abstract alertRule(): AlertRuleResult;

abstract sendRule(ruleResult?: AlertRuleResult): boolean;

abstract alertBody(ruleResult: AlertRuleResult): AlertRequestBody;

async toSend(epoch: Epoch): Promise<PreparedToSendAlert | undefined> {
const ruleResult = await this.alertRule(epoch);
async toSend(): Promise<PreparedToSendAlert | undefined> {
const ruleResult = await this.alertRule();
if (this.sendRule(ruleResult)) return { timestamp: this.sendTimestamp, body: this.alertBody(ruleResult), ruleResult };
}
}
Loading
Loading