Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

use sum stat for lambda detectors #369

Merged
merged 7 commits into from
Feb 18, 2022
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
21 changes: 21 additions & 0 deletions modules/integration_aws-lambda/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -8,6 +8,10 @@
- [What are the available detectors in this module?](#what-are-the-available-detectors-in-this-module)
- [How to collect required metrics?](#how-to-collect-required-metrics)
- [Metrics](#metrics)
- [Notes](#notes)
- [Lambda Metrics Lag](#lambda-metrics-lag)
- [About `pct_errors` detector](#about-pct_errors-detector)
- [About `invocations` detector](#about-invocations-detector)
- [Related documentation](#related-documentation)

<!-- END doctoc generated TOC please keep comment here to allow auto update -->
Expand Down Expand Up @@ -100,6 +104,23 @@ Here is the list of required metrics for detectors in this module.
* `Throttles`


## Notes

### Lambda Metrics Lag

* All detectors have a `max_delay` of 600s to accomodate the lag that can occur when ingesting AWS Cloudwatch Metrics for the Lambda namespace.

### About `pct_errors` detector

* This detector uses a `latest` extrapolation to force the alert to remain until a new execution of the same lambda function
happens in success. Depending on the frequency at which the function executes, the alert may take time to self resolve.

### About `invocations` detector

* The goal of this detector is to trigger an alert if a function did not execute at least the number of `invocations_threshold_major`
(once by default) on the timeframe defined by `invocations_transformation_function`. It could be useful to ensure that a regular
function has been run as expected like a cron based function. You should disable it for erratic or event based function (except
if you expect to see enough events on your timeframe)


## Related documentation
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -3,19 +3,20 @@ name: errors percentage
id: pct_errors

transformation: true
aggregation: true
filtering: "filter('namespace', 'AWS/Lambda') and filter('stat', 'mean') and filter('Resource', '*')"
aggregation: ".mean(by=['FunctionName'])"
filtering: "filter('namespace', 'AWS/Lambda') and filter('stat', 'sum') and filter('Resource', '*')"
value_unit: "%"
pdecat marked this conversation as resolved.
Show resolved Hide resolved
max_delay: 600

signals:
errors:
metric: Errors
extrapolation: last_value
rollup: average
rollup: sum
invocations:
metric: Invocations
extrapolation: last_value
rollup: average
rollup: sum
signal:
formula: (errors/invocations).scale(100).fill(value=0)
rules:
Expand Down
7 changes: 4 additions & 3 deletions modules/integration_aws-lambda/conf/02-throttles.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -3,14 +3,15 @@ name: invocations throttled
id: throttles

transformation: ".sum(over='1h')"
aggregation: true
filtering: "filter('namespace', 'AWS/Lambda') and filter('stat', 'mean') and filter('Resource', '*')"
aggregation: ".mean(by=['FunctionName'])"
filtering: "filter('namespace', 'AWS/Lambda') and filter('stat', 'sum') and filter('Resource', '*')"
max_delay: 600

signals:
signal:
metric: Throttles
extrapolation: last_value
rollup: average
rollup: sum
rules:
critical:
threshold: 1
Expand Down
9 changes: 5 additions & 4 deletions modules/integration_aws-lambda/conf/03-invocations.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -2,15 +2,16 @@ module: AWS Lambda
name: invocations

transformation: ".sum(over='1h')"
aggregation: true
filtering: "filter('namespace', 'AWS/Lambda') and filter('stat', 'mean') and filter('Resource', '*')"
aggregation: ".mean(by=['FunctionName'])"
filtering: "filter('namespace', 'AWS/Lambda') and filter('stat', 'sum') and filter('Resource', '*')"
disabled: true
max_delay: 600

signals:
signal:
metric: Invocations
extrapolation: last_value
rollup: average
extrapolation: zero
rollup: sum
rules:
major:
threshold: 1
Expand Down
21 changes: 19 additions & 2 deletions modules/integration_aws-lambda/conf/readme.yaml
Original file line number Diff line number Diff line change
@@ -1,5 +1,22 @@
documentations:
- name: CloudWatch metrics
url: 'https://docs.aws.amazon.com/lambda/latest/dg/monitoring-metrics.html'
url: "https://docs.aws.amazon.com/lambda/latest/dg/monitoring-metrics.html"
- name: Splunk Observability metrics
url: 'https://docs.splunk.com/Observability/gdi/get-data-in/connect/aws/aws-metrics.html#aws-lambda'
url: "https://docs.splunk.com/Observability/gdi/get-data-in/connect/aws/aws-metrics.html#aws-lambda"

notes: |
### Lambda Metrics Lag

* All detectors have a `max_delay` of 600s to accomodate the lag that can occur when ingesting AWS Cloudwatch Metrics for the Lambda namespace.

### About `pct_errors` detector

* This detector uses a `latest` extrapolation to force the alert to remain until a new execution of the same lambda function
happens in success. Depending on the frequency at which the function executes, the alert may take time to self resolve.

### About `invocations` detector

* The goal of this detector is to trigger an alert if a function did not execute at least the number of `invocations_threshold_major`
(once by default) on the timeframe defined by `invocations_transformation_function`. It could be useful to ensure that a regular
function has been run as expected like a cron based function. You should disable it for erratic or event based function (except
if you expect to see enough events on your timeframe)
14 changes: 7 additions & 7 deletions modules/integration_aws-lambda/detectors-gen.tf
Original file line number Diff line number Diff line change
Expand Up @@ -11,9 +11,9 @@ resource "signalfx_detector" "pct_errors" {
}

program_text = <<-EOF
base_filtering = filter('namespace', 'AWS/Lambda') and filter('stat', 'mean') and filter('Resource', '*')
errors = data('Errors', filter=base_filtering and ${module.filtering.signalflow}, rollup='average', extrapolation='last_value')${var.pct_errors_aggregation_function}${var.pct_errors_transformation_function}
invocations = data('Invocations', filter=base_filtering and ${module.filtering.signalflow}, rollup='average', extrapolation='last_value')${var.pct_errors_aggregation_function}${var.pct_errors_transformation_function}
base_filtering = filter('namespace', 'AWS/Lambda') and filter('stat', 'sum') and filter('Resource', '*')
errors = data('Errors', filter=base_filtering and ${module.filtering.signalflow}, rollup='sum', extrapolation='last_value')${var.pct_errors_aggregation_function}${var.pct_errors_transformation_function}
invocations = data('Invocations', filter=base_filtering and ${module.filtering.signalflow}, rollup='sum', extrapolation='last_value')${var.pct_errors_aggregation_function}${var.pct_errors_transformation_function}
signal = (errors/invocations).scale(100).fill(value=0).publish('signal')
detect(when(signal > ${var.pct_errors_threshold_critical}, lasting=%{if var.pct_errors_lasting_duration_critical == null}None%{else}'${var.pct_errors_lasting_duration_critical}'%{endif}, at_least=${var.pct_errors_at_least_percentage_critical})).publish('CRIT')
detect(when(signal > ${var.pct_errors_threshold_major}, lasting=%{if var.pct_errors_lasting_duration_major == null}None%{else}'${var.pct_errors_lasting_duration_major}'%{endif}, at_least=${var.pct_errors_at_least_percentage_major}) and (not when(signal > ${var.pct_errors_threshold_critical}, lasting=%{if var.pct_errors_lasting_duration_critical == null}None%{else}'${var.pct_errors_lasting_duration_critical}'%{endif}, at_least=${var.pct_errors_at_least_percentage_critical}))).publish('MAJOR')
Expand Down Expand Up @@ -54,8 +54,8 @@ resource "signalfx_detector" "throttles" {
tags = compact(concat(local.common_tags, local.tags, var.extra_tags))

program_text = <<-EOF
base_filtering = filter('namespace', 'AWS/Lambda') and filter('stat', 'mean') and filter('Resource', '*')
signal = data('Throttles', filter=base_filtering and ${module.filtering.signalflow}, rollup='average', extrapolation='last_value')${var.throttles_aggregation_function}${var.throttles_transformation_function}.publish('signal')
base_filtering = filter('namespace', 'AWS/Lambda') and filter('stat', 'sum') and filter('Resource', '*')
signal = data('Throttles', filter=base_filtering and ${module.filtering.signalflow}, rollup='sum', extrapolation='last_value')${var.throttles_aggregation_function}${var.throttles_transformation_function}.publish('signal')
detect(when(signal > ${var.throttles_threshold_critical}, lasting=%{if var.throttles_lasting_duration_critical == null}None%{else}'${var.throttles_lasting_duration_critical}'%{endif}, at_least=${var.throttles_at_least_percentage_critical})).publish('CRIT')
detect(when(signal > ${var.throttles_threshold_major}, lasting=%{if var.throttles_lasting_duration_major == null}None%{else}'${var.throttles_lasting_duration_major}'%{endif}, at_least=${var.throttles_at_least_percentage_major}) and (not when(signal > ${var.throttles_threshold_critical}, lasting=%{if var.throttles_lasting_duration_critical == null}None%{else}'${var.throttles_lasting_duration_critical}'%{endif}, at_least=${var.throttles_at_least_percentage_critical}))).publish('MAJOR')
EOF
Expand Down Expand Up @@ -95,8 +95,8 @@ resource "signalfx_detector" "invocations" {
tags = compact(concat(local.common_tags, local.tags, var.extra_tags))

program_text = <<-EOF
base_filtering = filter('namespace', 'AWS/Lambda') and filter('stat', 'mean') and filter('Resource', '*')
signal = data('Invocations', filter=base_filtering and ${module.filtering.signalflow}, rollup='average', extrapolation='last_value')${var.invocations_aggregation_function}${var.invocations_transformation_function}.publish('signal')
base_filtering = filter('namespace', 'AWS/Lambda') and filter('stat', 'sum') and filter('Resource', '*')
signal = data('Invocations', filter=base_filtering and ${module.filtering.signalflow}, rollup='sum', extrapolation='zero')${var.invocations_aggregation_function}${var.invocations_transformation_function}.publish('signal')
detect(when(signal < ${var.invocations_threshold_major}, lasting=%{if var.invocations_lasting_duration_major == null}None%{else}'${var.invocations_lasting_duration_major}'%{endif}, at_least=${var.invocations_at_least_percentage_major})).publish('MAJOR')
EOF

Expand Down
12 changes: 6 additions & 6 deletions modules/integration_aws-lambda/variables-gen.tf
Original file line number Diff line number Diff line change
Expand Up @@ -9,7 +9,7 @@ variable "pct_errors_notifications" {
variable "pct_errors_aggregation_function" {
description = "Aggregation function and group by for pct_errors detector (i.e. \".mean(by=['host'])\")"
type = string
default = ""
default = ".mean(by=['FunctionName'])"
}

variable "pct_errors_transformation_function" {
Expand All @@ -21,7 +21,7 @@ variable "pct_errors_transformation_function" {
variable "pct_errors_max_delay" {
description = "Enforce max delay for pct_errors detector (use \"0\" or \"null\" for \"Auto\")"
type = number
default = null
default = 600
}

variable "pct_errors_tip" {
Expand Down Expand Up @@ -99,7 +99,7 @@ variable "throttles_notifications" {
variable "throttles_aggregation_function" {
description = "Aggregation function and group by for throttles detector (i.e. \".mean(by=['host'])\")"
type = string
default = ""
default = ".mean(by=['FunctionName'])"
}

variable "throttles_transformation_function" {
Expand All @@ -111,7 +111,7 @@ variable "throttles_transformation_function" {
variable "throttles_max_delay" {
description = "Enforce max delay for throttles detector (use \"0\" or \"null\" for \"Auto\")"
type = number
default = null
default = 600
}

variable "throttles_tip" {
Expand Down Expand Up @@ -189,7 +189,7 @@ variable "invocations_notifications" {
variable "invocations_aggregation_function" {
description = "Aggregation function and group by for invocations detector (i.e. \".mean(by=['host'])\")"
type = string
default = ""
default = ".mean(by=['FunctionName'])"
}

variable "invocations_transformation_function" {
Expand All @@ -201,7 +201,7 @@ variable "invocations_transformation_function" {
variable "invocations_max_delay" {
description = "Enforce max delay for invocations detector (use \"0\" or \"null\" for \"Auto\")"
type = number
default = null
default = 600
}

variable "invocations_tip" {
Expand Down