Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: critical alerts by modules - 2 #264

Open
wants to merge 3 commits into
base: develop
Choose a base branch
from
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
184 changes: 168 additions & 16 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -262,7 +262,8 @@ Holesky) this value should be omitted.
* **Default:** ./docker/validators/lido_mainnet.db
* **Note:** it makes sense to change default value if `VALIDATOR_REGISTRY_SOURCE` is set to "lido"
---
`VALIDATOR_REGISTRY_KEYSAPI_SOURCE_URLS` - Comma-separated list of URLs to [Lido Keys API service](https://github.com/lidofinance/lido-keys-api).
`VALIDATOR_REGISTRY_KEYSAPI_SOURCE_URLS` - Comma-separated list of URLs to
[Lido Keys API service](https://github.com/lidofinance/lido-keys-api).
* **Required:** false
* **Note:** will be used only if `VALIDATOR_REGISTRY_SOURCE` is set to "keysapi"
---
Expand All @@ -278,55 +279,206 @@ Holesky) this value should be omitted.
* **Required:** false
* **Default:** 2
---
`VALIDATOR_USE_STUCK_KEYS_FILE` - Use a file with list of validators that are stuck and should be excluded from the monitoring metrics.
`VALIDATOR_USE_STUCK_KEYS_FILE` - Use a file with list of validators that are stuck and should be excluded from the
monitoring metrics.
* **Required:** false
* **Values:** true / false
* **Default:** false
---
`VALIDATOR_STUCK_KEYS_FILE_PATH` - Path to file with list of validators that are stuck and should be excluded from the monitoring metrics.
`VALIDATOR_STUCK_KEYS_FILE_PATH` - Path to file with list of validators that are stuck and should be excluded from the
monitoring metrics.
* **Required:** false
* **Default:** ./docker/validators/stuck_keys.yaml
* **Note:** will be used only if `VALIDATOR_USE_STUCK_KEYS_FILE` is true
---
`SYNC_PARTICIPATION_DISTANCE_DOWN_FROM_CHAIN_AVG` - Distance (down) from Blockchain Sync Participation average after which we think that our sync participation is bad.
`SYNC_PARTICIPATION_DISTANCE_DOWN_FROM_CHAIN_AVG` - Distance (down) from Blockchain Sync Participation average after
which we think that our sync participation is bad.
* **Required:** false
* **Default:** 0
---
`SYNC_PARTICIPATION_EPOCHS_LESS_THAN_CHAIN_AVG` - Number epochs after which we think that our sync participation is bad and alert about that.
`SYNC_PARTICIPATION_EPOCHS_LESS_THAN_CHAIN_AVG` - Number epochs after which we think that our sync participation is bad
and alert about that.
* **Required:** false
* **Default:** 3
---
`BAD_ATTESTATION_EPOCHS` - Number epochs after which we think that our attestation is bad and alert about that.
* **Required:** false
* **Default:** 3
---
`CRITICAL_ALERTS_ALERTMANAGER_URL` - If passed, application sends additional critical alerts about validators performance to Alertmanager.
`CRITICAL_ALERTS_ALERTMANAGER_URL` - If passed, application sends additional critical alerts about validators
performance to Alertmanager.
* **Required:** false
---
`CRITICAL_ALERTS_MIN_VAL_COUNT` - Critical alerts will be sent for Node Operators with validators count greater this value.
`CRITICAL_ALERTS_MIN_VAL_COUNT` - Critical alerts will be sent for Node Operators with validators count greater or equal
to this value.
* **Required:** false
* **Default:** 100
---
`CRITICAL_ALERTS_MIN_ACTIVE_VAL_COUNT` - Sets the minimum conditions for triggering critical alerts based on the number
of active validators for node operators in a specific module.

The value must be in JSON format. Example:
`{ "0": { "minActiveCount": 100, "affectedShare": 0.33, "minAffectedCount": 1000 } }`.

The numeric key represents the module ID. Settings under the `0` key apply to all modules unless overridden by settings
for specific module IDs. Settings for specific module IDs take precedence over the `0` key.

A critical alert is sent if:

* The number of active validators for a node operator meets or exceeds `minActiveCount`.
* The number of affected validators:
* Is at least `affectedShare` of the total validators for the node operator, OR
* Exceeds or equal to `minAffectedCount`.
* Value in the `CRITICAL_ALERTS_MIN_ACTIVE_VAL_COUNT` for specific module is not overridden by
`CRITICAL_ALERTS_MIN_AFFECTED_VAL_COUNT`.

If no settings are provided for a specific module or the 0 key, default values are used:
`{ "minActiveCount": CRITICAL_ALERTS_MIN_VAL_COUNT, "affectedShare": 0.33, "minAffectedCount": 1000 }`.
* **Required:** false
* **Default:** {}
---
`CRITICAL_ALERTS_MIN_AFFECTED_VAL_COUNT` - Defines the minimum number of affected validators for a node operator in a
specific module for which a critical alert should be sent.

The value must be in JSON format, for example: `{ "0": 100, "3": 50 }`. The numeric key represents the module ID. The
value for the key `0` applies to all modules. Values for non-zero keys apply only to the specified module and take
precedence over the `0` key.

This variable takes priority over `CRITICAL_ALERTS_MIN_ACTIVE_VAL_COUNT` and `CRITICAL_ALERTS_MIN_VAL_COUNT`. If no
value is set for a specific module or the `0` key, the rules from the other two variables will apply instead.
* **Required:** false
* **Default:** {}
---
`CRITICAL_ALERTS_ALERTMANAGER_LABELS` - Additional labels for critical alerts.
Must be in JSON string format. Example - '{"a":"valueA","b":"valueB"}'.
Must be in JSON string format. Example: `{ "a": "valueA", "b": "valueB" }`.
* **Required:** false
* **Default:** {}
---

## Application critical alerts (via Alertmanager)

In addition to alerts based on Prometheus metrics you can receive special critical alerts based on beaconchain aggregates from app.
In addition to alerts based on Prometheus metrics you can receive special critical alerts based on Beacon Chain
aggregates from app.

You should pass env var `CRITICAL_ALERTS_ALERTMANAGER_URL=http://<alertmanager_host>:<alertmanager_port>`.

And if `ethereum_validators_monitoring_data_actuality < 1h` it allows you to receive alerts from table bellow
Critical alerts for modules are controlled by three environment variables, listed here with their priority (from lowest
to highest):
```
CRITICAL_ALERTS_MIN_VAL_COUNT: number;
CRITICAL_ALERTS_MIN_ACTIVE_VAL_COUNT: {
<moduleIndex>: {
minActiveCount: number,
affectedShare: number,
minAffectedCount: number,
}
};
CRITICAL_ALERTS_MIN_AFFECTED_VAL_COUNT: {
<moduleIndex>: number
};
```

The following rules are applied (listed in order of increasing priority, the next rule overrides the previous one).

| Alert name | Description | If fired repeat | If value increased repeat |
|----------------------------|-----------------------------------------------------------------------------------------------------------------|-----------------|---------------------------|
| CriticalSlashing | At least one validator was slashed | instant | - |
| CriticalMissedProposes | More than 1/3 blocks from Node Operator duties was missed in the last 12 hours | every 6h | - |
| CriticalNegativeDelta | More than 1/3 or more than 1000 Node Operator validators with negative balance delta (between current and 6 epochs ago) | every 6h | every 1h |
| CriticalMissedAttestations | More than 1/3 or more than 1000 Node Operator validators with missed attestations in the last {{ BAD_ATTESTATION_EPOCHS }} epochs | every 6h | every 1h |
1. **Global Fallback** (`CRITICAL_ALERTS_MIN_VAL_COUNT`). If this variable is set, it acts as a default for modules by
creating an implicit rule:
```
{
"0": {
"minActiveCount": CRITICAL_ALERTS_MIN_VAL_COUNT,
"affectedShare": 0.33,
"minAffectedCount": 1000
}
}
```

2. **Global Rules for Active Validators** (`CRITICAL_ALERTS_MIN_ACTIVE_VAL_COUNT`). Default rules apply to all modules
(key `0`) unless overridden.
```
CRITICAL_ALERTS_MIN_ACTIVE_VAL_COUNT = {
"0": {
"minActiveCount": <integer>,
"affectedShare": <0.xx>,
"minAffectedCount": <integer>,
}
}
```
A critical alert is triggered for a module if **both** conditions are met:
* Active validators exceed or equal to `minActiveCount`.
* Affected validators exceed or equal to either `minAffectedCount` or `affectedShare` of the total active validators.

3. **Global Rules for Affected Validators** (`CRITICAL_ALERTS_MIN_AFFECTED_VAL_COUNT`). Default rules apply to all
modules (key `0`) unless overridden.
```
CRITICAL_ALERTS_MIN_AFFECTED_VAL_COUNT = {
"0": <integer>
}
```
A critical alert is triggered if the number of affected validators exceeds or equal to this value.

4. **Per-Module Rules for Active Validators** (`CRITICAL_ALERTS_MIN_ACTIVE_VAL_COUNT`). If specific module keys are
defined, those values override the global rules for `CRITICAL_ALERTS_MIN_ACTIVE_VAL_COUNT` and
`CRITICAL_ALERTS_MIN_AFFECTED_VAL_COUNT`.
```
CRITICAL_ALERTS_MIN_ACTIVE_VAL_COUNT = {
"n": {
"minActiveCount": <integer>,
"affectedShare": <0.xx>,
"minAffectedCount": <integer>,
}
}
```
A critical alert is triggered for those modules if **both** conditions are met:

* Active validators exceed or equal to `minActiveCount`.
* Affected validators exceed or equal either `minAffectedCount` or `affectedShare` of the total validators.

For modules that don't have keys in the `CRITICAL_ALERTS_MIN_ACTIVE_VAL_COUNT` the rules defined in the previous steps
are applied.

5. **Per-Module Rules for Affected Validators** (`CRITICAL_ALERTS_MIN_AFFECTED_VAL_COUNT`). If specific module keys are
defined, those values override all other rules for the module.
```
CRITICAL_ALERTS_MIN_AFFECTED_VAL_COUNT = {
"n": <integer>
}
```
A critical alert is triggered if the number of affected validators exceeds or equal to the specified value.

To illustrate these rules let's consider the following sample config:
```
CRITICAL_ALERTS_MIN_ACTIVE_VAL_COUNT = {
"0": {
"minActiveCount": 100,
"affectedShare": 0.3,
"minAffectedCount": 1000,
},
"3": {
"minActiveCount": 10,
"affectedShare": 0.5,
"minAffectedCount": 200,
},
};
CRITICAL_ALERTS_MIN_AFFECTED_VAL_COUNT = {
"2": 30
};
```
In this case, critical alerts for any modules except 2 and 3 will be triggered for operators with at least 100 active
validators and only if either at least 1000 or 30% of active validators are affected by a critical alert (depending on
what number is less). However, for operators from the 3-rd module, these rules are weakened: a critical alert will be
triggered for operators with at least 10 active validators and only if either 200 or 50% of validators are affected.

These rules are not applied to the 2-nd module. For this module, critical alerts will be triggered for all operators
with at least 30 affected validators (no matter how many active validators they have).

If `ethereum_validators_monitoring_data_actuality < 1h` alerts from table bellow are sent.

| Alert name | Description | If fired repeat | If value increased repeat |
|----------------------------|---------------------------------------------------------------------------------------------------------|-----------------|---------------------------|
| CriticalSlashing | At least one validator was slashed | instant | - |
| CriticalMissedProposes | More than 1/3 blocks from Node Operator duties was missed in the last 12 hours | every 6h | - |
| CriticalNegativeDelta | A certain number of validators with negative balance delta (between current and 6 epochs ago) | every 6h | every 1h |
| CriticalMissedAttestations | A certain number of validators with missed attestations in the last `{{BAD_ATTESTATION_EPOCHS}}` epochs | every 6h | every 1h |


## Application metrics
Expand Down
27 changes: 20 additions & 7 deletions docker/validators/custom_mainnet.yaml
Original file line number Diff line number Diff line change
@@ -1,9 +1,22 @@
operators:
- name: Operator1
module_1:
- name: Operator_1_0
keys:
- "0xa8088b23b6e9eaecb04c7dfd194d9e47df966605a1cf03004b7d671708421da4cb2836447f73a5f25c2cfb567b181f80"
- "0x84f6ffe8d2285b76d5076165cec8b298c8ed3dc123379de8d49ecf2e27137ebe479fec0e667322a450283c990bfe9995"
- name: Operator2
- "0x800429af2ff9e4581b3a800cec1604de49538a50659c0cbb2b79493b5d888b2b2075f9e7163bc11024088b17c2b78107"
- "0x8004a4ddb445add99be6e41fce54ae0ceba0d802817585c900e3b43d2a35ab09a8b451d02592fa105249af07122887b8"
- name: Operator_1_1
keys:
- "0xa015a5fcd78cb52e2b1f9c1a833868f9da8dfee31c919e8e1c19aa64defdd140390a16d133b500d5a90bc99bca409908"
- "0xb9b74aaec50f74e484862b5b6bf0174ffa7344f2de2b1b89aeb233722d4bc9812ee346d99a6b0740e2c14c1580257247"
- "0x8004d6da4e9228cb0efbf383ce259338d5626029e3f80913ad1c89098d3289977ba10d873cf88c61e1b2572e26fbd318"
- "0x800532e962039d57e63d1da433e26f6bbff8b15f07b90deb5be8038b7f24ddb2d71d2b26a1693a7fb9a7657f3b8b5fef"

# Optional
module_2:
- name: Operator_2_0
keys:
- "0x80081580eefc89c95874ca868cb439a0c51b4b6f97483632ea597e4801c47f03a8f45360a44411c2320296c737c89bc6"
- "0x8008b169609ee48ef4bd36c37bb2d0c5f9fe0335f28396d5aa8620409912e16c06b4ae2048542492007a2005928b074c"
- "0x800e4b8fa424ff35feef522592f3e711a46b426320a7dc40044fb02537e0faf25566e47c72172a3020d0c6bc1648ecc8"
- name: Operator_2_1
keys:
- "0x80096ff18d55b9b08c1778568867210d9110f5a2200962a962846d09a75bfa29177c42b83903ed0cb0b69f8a061e3e11"
- "0x800c8cb0fcd6104cbdf76120352c1651e858eef2fad8142ebca37d26f76a16c5f692f9b987bb22dd6eb5dd0dc9e021a4"
- "0x800cd7cf64998da8d95ac0e864012922904b78cccc28f2fa88f3bf019ecc8779833d1c7e09d62700b14d2b015f002a52"
27 changes: 20 additions & 7 deletions docker/validators/custom_testnet.yaml
Original file line number Diff line number Diff line change
@@ -1,9 +1,22 @@
operators:
- name: Operator1
module_1:
- name: Operator_1_0
keys:
- "0xafbf5b06e7953b095a9946cc7ee8f2ecf1312878bd196af4d06661bc7718f1ae2d5c9f8b635f5924bf5d8266234607f8"
- "0x925c1f368524be3fa83c52f40151724b38fb4ebfe64f64f70942aa9a307b81843d9514c1d8f3c8689236f0f1ccd6c6d4"
- name: Operator2
- "0x8000011bc03bbf99ac5964d14d3bb52de983c848cc3734d736235a19715e8cbbd5e963163eb4bd2d8cd473d103b95c12"
- "0x80000b1388d41e2cb346e6a85d94fccc6510a11d5bd91699e156907b53e1f5c265effa87f492b7cba7fe218f232c6c39"
- name: Operator_1_1
keys:
- "0xb5b9b79942fcce7ddd2c3b00dae34e571fb77f0630d4fdeccba3721b6549013b55cbfe643d96cbe920864795c5f01db6"
- "0xb3ddd2b56dbf80ba035d948709099f8ad7241929a051140ce2698fae216293d98c792314c414afb0ed3b849323b523c6"
- "0x800010c6cde9a31d218347c9d042ceff227a1dbec3970336bd8cd6d767fd0f2e587332ef6a3010b1b0f5d04288483d44"
- "0x80001887f6c44f54e043866a6536b940f1c2bdf0a99203f217940fab8684e77fa1c9cc64537464d7d2b681115eec446a"

# Optional
module_2:
- name: Operator_2_0
keys:
- "0x80002248327da011001f38ab78e277ed5ddc1448078a1ba3f1cb47fd20f65f6de07808d7c3c96a2a795011b25100cc1d"
- "0x800037d7c5468fb960d7e5cb40c2d9c39d6713676d9bc971e92692759ac7ba5b0f12d034282e0cfd4cf2c1212d38dd2a"
- "0x80003ad67e896cb261a17398e77e474a7ffc7898a40cf004a74ea8d20b2b562ac7906a3a62656bfbc1d3033748cdd972"
- name: Operator_2_1
keys:
- "0x80004546cdf353788bd0fb2048c80ecaae4dbd72ed1b9e51d90c0457d57f5e3577778a9710f267aa1e50ce0d5df6fa28"
- "0x80008083f7eb1366eaef3992c48e0ced5dadef0e4405c7b9a0a662322847f98022d970e6a13cf12da9d199b7518562f7"
- "0x80009e291a1e81be05ffce78180bb0a240242466af9613ef8dd34a8f1289f9b9dfc2c98c5d40be4d61f1eb4dec559217"
23 changes: 17 additions & 6 deletions src/common/alertmanager/alerts/BasicAlert.ts
Original file line number Diff line number Diff line change
@@ -1,6 +1,6 @@
import { ConfigService } from 'common/config';
import { Epoch } from 'common/consensus-provider/types';
import { ClickhouseService } from 'storage';
import { NOsValidatorsStatusStats } from 'storage/clickhouse/clickhouse.types';
import { RegistrySourceOperator } from 'validators-registry';

export interface AlertRequestBody {
Expand All @@ -26,22 +26,33 @@ export abstract class Alert {
protected readonly config: ConfigService;
protected readonly storage: ClickhouseService;
protected readonly operators: RegistrySourceOperator[];

protected constructor(name: string, config: ConfigService, storage: ClickhouseService, operators: RegistrySourceOperator[]) {
protected readonly moduleIndex: number;
protected readonly nosStats: NOsValidatorsStatusStats[];

protected constructor(
name: string,
config: ConfigService,
storage: ClickhouseService,
operators: RegistrySourceOperator[],
moduleIndex: number,
nosStats: NOsValidatorsStatusStats[],
) {
this.alertname = name;
this.config = config;
this.storage = storage;
this.operators = operators;
this.moduleIndex = moduleIndex;
this.nosStats = nosStats;
}

abstract alertRule(bySlot: number): Promise<AlertRuleResult>;
abstract alertRule(): AlertRuleResult;

abstract sendRule(ruleResult?: AlertRuleResult): boolean;

abstract alertBody(ruleResult: AlertRuleResult): AlertRequestBody;

async toSend(epoch: Epoch): Promise<PreparedToSendAlert | undefined> {
const ruleResult = await this.alertRule(epoch);
async toSend(): Promise<PreparedToSendAlert | undefined> {
const ruleResult = await this.alertRule();
if (this.sendRule(ruleResult)) return { timestamp: this.sendTimestamp, body: this.alertBody(ruleResult), ruleResult };
}
}
Loading
Loading