Skip to content
This repository has been archived by the owner on Jan 12, 2024. It is now read-only.

Commit

Permalink
fix: Warnings from informational SLOs should not affect overall resul…
Browse files Browse the repository at this point in the history
…t of `get-sli.finished` events (#974)

* Remove unneeded problem status code

Signed-off-by: Arthur Pitman <[email protected]>

* Improve tests and messages

Signed-off-by: Arthur Pitman <[email protected]>

* Produce informational SLOs for invalid SLO tile configurations

Signed-off-by: Arthur Pitman <[email protected]>

* Fix dashboard errors

Signed-off-by: Arthur Pitman <[email protected]>

* Clarify GetSLIFinished event factories

Signed-off-by: Arthur Pitman <[email protected]>

* Move `dashboard.TileResult` to `result.SLIWithSLO` and improve function names

Signed-off-by: Arthur Pitman <[email protected]>

* Require SLO in result

Signed-off-by: Arthur Pitman <[email protected]>

* Allow no SLIs to be retrieved and upload SLOs within `dashboard.Processing`

Signed-off-by: Arthur Pitman <[email protected]>

* Require `slo.yaml` file for file-based use case

Signed-off-by: Arthur Pitman <[email protected]>

* Add initial tests demonstrating current behavior

Signed-off-by: Arthur Pitman <[email protected]>

* Remove redundant error message

Signed-off-by: Arthur Pitman <[email protected]>

* Warnings from informational SLOs don't affect overall result

Signed-off-by: Arthur Pitman <[email protected]>

* Add documentation

Signed-off-by: Arthur Pitman <[email protected]>

* Small documentation update

Signed-off-by: Arthur Pitman <[email protected]>

* Polish dashboard processing

Signed-off-by: Arthur Pitman <[email protected]>

* Simplify struct name

Signed-off-by: Arthur Pitman <[email protected]>

* Retrigger

Signed-off-by: Arthur Pitman <[email protected]>

* Retrigger

Signed-off-by: Arthur Pitman <[email protected]>

Signed-off-by: Arthur Pitman <[email protected]>
  • Loading branch information
arthurpitman authored Dec 19, 2022
1 parent 5119781 commit dc9b415
Show file tree
Hide file tree
Showing 48 changed files with 2,150 additions and 817 deletions.
Binary file modified documentation/images/get-sli-finished-event-payload.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file not shown.
3 changes: 3 additions & 0 deletions documentation/sli-provider.md
Original file line number Diff line number Diff line change
Expand Up @@ -12,6 +12,9 @@ To help you understand the queries used for obtaining the SLIs, the dynatrace-se
## Specifying the units of SLIs based on the Metrics v2 API
The dynatrace-service always returns SLIs in the same units as the underlying metric expression. To convert between units, append a [`:toUnit(<sourceUnit>,<targetUnit>)` transformation](https://www.dynatrace.com/support/help/dynatrace-api/environment-api/metric-v2/metric-selector#to-unit) to the metric expression (e.g. in the **Code** tab of the Data Explorer). For example, `builtin:service.response.time:toUnit(MicroSecond,MilliSecond)` will produce a service response time metric in milliseconds. Alternatively, for file-based SLIs, the [`MV2` prefix](slis-via-files.md#converted-metrics-prefix-mv2) may be used to convert microseconds to milliseconds or bytes to kilobytes in a concise way. For example, `MV2;MicroSecond;metricSelector=builtin:service.response.time:splitBy():avg&entitySelector=type(SERVICE)` will convert the `builtin:service.response.time` metric from microseconds to milliseconds.

## Informational SLOs
SLIs associated with informational SLOs, i.e. without pass or warning criteria, will be retrieved, however any warnings generated while processing them (e.g. *no metric series*) will not affect the overall result of the `sh.keptn.event.get-sli.finished` event.

## Known Limitations

- The Dynatrace Metrics API provides data with the "eventual consistency" approach. Therefore, the metrics data retrieved can be incomplete or even contain inconsistencies for timeframes within two hours of the current time. Usually, it takes a minute to catch up, but in extreme situations this might not be enough. The dynatrace-service tries to mitigate this issue by delaying SLI retrieval by up to 120 seconds in situations where the evaluation end time is close to the current time.
55 changes: 5 additions & 50 deletions documentation/troubleshooting_evaluation-fails.md
Original file line number Diff line number Diff line change
@@ -1,25 +1,11 @@
# Evaluation fails


## Could not retrieve any SLI results

Evaluation fails but the shipyard-controller reports:
> `could not retrieve any SLI results`
![Could not retrieve any SLI results](images/could-not-retrieve-any-sli-results.png)

Likely cause:
- SLIs and SLOs should be defined using YAML files but no SLOs have been defined i.e. no `slo.yaml` file can be found

Suggested solution:


## No evaluation performed by Lighthouse because SLI failed
## Lighthouse failed because

Evaluation fails and the shipyard-controller reports:
> `no evaluation performed by lighthouse because SLI failed....`
> `lighthouse failed because...`
![No evaluation performed by Lighthouse](images/no-evaluation-performed-by-lighthouse.png)
![No evaluation performed by Lighthouse](images/lighthouse-failed-because.png)

In this case the Lighthouse performed no evaluation because the received `sh.keptn.event.get-sli.finished` event had a `result` of `fail`. The root cause can be identified from the `message` field of this event.

Expand All @@ -29,38 +15,7 @@ In instances where this is related to a specific SLI, check the `message` field

The following subsections detail some common messages as well as likely causes and solutions:

### Message: `Could not retrieve any SLI results`

**Likely cause**

- SLIs and SLOs should be defined using YAML files but no SLOs have been defined, i.e. no `slo.yaml` file can be found

**Suggested solution**

- If SLIs and SLOs should be defined using YAML files, define SLOs by creating a `slo.yaml` file, or
- If SLIs and SLOs should be sourced from a dashboard, add a [`dashboard` entry to the `dynatrace/dynatrace.conf.yaml` configuration file](dynatrace-conf-yaml-file.md#dashboard-sli-mode-configuration-dashboard)

### Message: `Metrics API v2 returned zero data points`

**Likely cause**

- No data is available for an SLI during the evaluation timeframe

**Suggested solution**

- Check the availability of data for the associated SLIs during the evaluation timeframe, e.g. using the Data Explorer in the Dynatrace tenant

### Message: `Metrics API v2 returned more than one data point`

**Likely cause**

- Multiple data points are returned for an SLI during the evaluation timeframe where only one is expected

**Suggested solution**

- Ensure the SLI only returns a single data point by e.g. including a `merge` or `splitBy` transformation in the query

### Message: `could not query Dynatrace dashboard for SLIs: error while processing dashboard config '12345678-1234-1234-1234-12345678abcd' - Dynatrace API error (404): Dashboard 12345678-1234-1234-1234-12345678abcd not found`
### Message: `could not query Dynatrace dashboard for SLIs: error while processing dashboard config '12345678-1234-1234-1234-12345678abcd': Dynatrace API error (404): Dashboard 12345678-1234-1234-1234-12345678abcd not found`

**Likely cause**

Expand All @@ -71,7 +26,7 @@ The following subsections detail some common messages as well as likely causes a
- If SLIs and SLOs should be sourced from a dashboard, ensure `dynatrace/dynatrace.conf.yaml` contains a `dashboard: <dashboard-id>` entry with the correct dashboard ID, or
- If SLIs and SLOs should be defined using YAML files, remove the `dashboard: <dashboard-id>` entry from `dynatrace/dynatrace.conf.yaml`

### Message: `could not query Dynatrace dashboard for SLIs: error while processing dashboard config '' - could not find a matching dashboard name - e.g. KQG;project=<project>;service=<service>;stage=<stage>`
### Message: `could not query Dynatrace dashboard for SLIs: error while processing dashboard config '': no dashboard name matches the name specification...`

**Likely cause**

Expand Down
21 changes: 0 additions & 21 deletions internal/dynatrace/problems_v2_client.go
Original file line number Diff line number Diff line change
Expand Up @@ -3,16 +3,12 @@ package dynatrace
import (
"context"
"encoding/json"
"net/url"
"time"

"github.com/keptn-contrib/dynatrace-service/internal/common"
"github.com/keptn-contrib/dynatrace-service/internal/sli/problems"
)

// ProblemStatusOpen is the status of an open problem
const ProblemStatusOpen = "OPEN"

// ProblemsV2Path is the base endpoint for Problems API v2
const ProblemsV2Path = "/api/v2/problems"

Expand Down Expand Up @@ -100,20 +96,3 @@ func (pc *ProblemsV2Client) GetTotalCountByQuery(ctx context.Context, request Pr

return result.TotalCount, nil
}

// GetStatusByID calls the Dynatrace API to retrieve the status of a given problemID.
func (pc *ProblemsV2Client) GetStatusByID(ctx context.Context, problemID string) (string, error) {
body, err := pc.client.Get(ctx, ProblemsV2Path+"/"+url.PathEscape(problemID))
if err != nil {
return "", err
}

// parse response json
var result problem
err = json.Unmarshal(body, &result)
if err != nil {
return "", err
}

return result.Status, nil
}
13 changes: 0 additions & 13 deletions internal/dynatrace/problems_v2_client_test.go
Original file line number Diff line number Diff line change
Expand Up @@ -26,16 +26,3 @@ func TestProblemsV2Client_GetTotalCountByQuery(t *testing.T) {
assert.NoError(t, err)
assert.EqualValues(t, 1, totalProblemCount)
}

func TestProblemsV2Client_GetStatusById(t *testing.T) {
handler := test.NewFileBasedURLHandler(t)
handler.AddExact("/api/v2/problems/-6004362228644432354_1638271020000V2", "./testdata/test_problemsv2client_getstatusbyid.json")

dtClient, _, teardown := createDynatraceClient(t, handler)
defer teardown()

status, err := NewProblemsV2Client(dtClient).GetStatusByID(context.TODO(), "-6004362228644432354_1638271020000V2")

assert.NoError(t, err)
assert.EqualValues(t, ProblemStatusOpen, status)
}
2 changes: 1 addition & 1 deletion internal/event_handler/error_handler.go
Original file line number Diff line number Diff line change
Expand Up @@ -57,7 +57,7 @@ func (eh ErrorHandler) sendErroredGetSLIFinishedEvent(ctx context.Context, event
if err != nil {
return eh.sendErrorEvent(ctx, eventSenderClient)
}
return eventSenderClient.SendCloudEvent(sli.NewErroredGetSLIFinishedEventFactory(adapter, nil, eh.err))
return eventSenderClient.SendCloudEvent(sli.NewErroredGetSLIFinishedEventFactory(adapter, eh.err))
}

func (eh ErrorHandler) sendErrorEvent(ctx context.Context, eventSenderClient keptn.EventSenderClientInterface) error {
Expand Down
14 changes: 8 additions & 6 deletions internal/keptn/config_client.go
Original file line number Diff line number Diff line change
Expand Up @@ -3,7 +3,6 @@ package keptn
import (
"context"
"errors"
"fmt"

keptn "github.com/keptn/go-utils/pkg/lib"
keptnapi "github.com/keptn/go-utils/pkg/lib/keptn"
Expand Down Expand Up @@ -54,6 +53,9 @@ const shipyardFilename = "shipyard.yaml"
const sloFilename = "slo.yaml"
const sliFilename = "dynatrace/sli.yaml"
const configFilename = "dynatrace/dynatrace.conf.yaml"
const shipyardContext = "shipyard"
const slosContext = "SLOs"
const slisContext = "SLIs"

// SLI struct for SLI.yaml
type SLI struct {
Expand Down Expand Up @@ -83,7 +85,7 @@ func (rc *ConfigClient) GetSLOs(ctx context.Context, project string, stage strin
slos := &keptn.ServiceLevelObjectives{}
err = yaml.Unmarshal([]byte(resource), slos)
if err != nil {
return nil, errors.New("invalid SLO file format")
return nil, common.NewUnmarshalYAMLError(slosContext, err)
}

return slos, nil
Expand All @@ -93,7 +95,7 @@ func (rc *ConfigClient) GetSLOs(ctx context.Context, project string, stage strin
func (rc *ConfigClient) UploadSLOs(ctx context.Context, project string, stage string, service string, slos *keptn.ServiceLevelObjectives) error {
yamlAsByteArray, err := yaml.Marshal(slos)
if err != nil {
return fmt.Errorf("could not convert SLOs to YAML: %s", err)
return common.NewMarshalYAMLError(slosContext, err)
}

return rc.client.UploadResource(ctx, yamlAsByteArray, sloFilename, project, stage, service)
Expand All @@ -103,7 +105,7 @@ func (rc *ConfigClient) UploadSLOs(ctx context.Context, project string, stage st
func (rc *ConfigClient) UploadSLIs(ctx context.Context, project string, stage string, service string, slis *SLI) error {
yamlAsByteArray, err := yaml.Marshal(slis)
if err != nil {
return fmt.Errorf("could not convert SLIs to YAML: %s", err)
return common.NewMarshalYAMLError(slisContext, err)
}

return rc.client.UploadResource(ctx, yamlAsByteArray, sliFilename, project, stage, service)
Expand All @@ -124,7 +126,7 @@ func (rc *ConfigClient) GetShipyard(ctx context.Context, project string) (*keptn
shipyard := keptnv2.Shipyard{}
err = yaml.Unmarshal([]byte(shipyardResource), &shipyard)
if err != nil {
return nil, common.NewUnmarshalYAMLError("shipyard", err)
return nil, common.NewUnmarshalYAMLError(shipyardContext, err)
}

return &shipyard, nil
Expand Down Expand Up @@ -200,7 +202,7 @@ func readSLIsFromResource(resource string) (map[string]string, error) {
sliConfig := keptnapi.SLIConfig{}
err := yaml.Unmarshal([]byte(resource), &sliConfig)
if err != nil {
return nil, common.NewUnmarshalYAMLError("SLIs", err)
return nil, common.NewUnmarshalYAMLError(slisContext, err)
}

if len(sliConfig.Indicators) == 0 {
Expand Down
11 changes: 6 additions & 5 deletions internal/sli/dashboard/custom_charting_tile_processing.go
Original file line number Diff line number Diff line change
Expand Up @@ -15,6 +15,7 @@ import (
"github.com/keptn-contrib/dynatrace-service/internal/common"
"github.com/keptn-contrib/dynatrace-service/internal/dynatrace"
"github.com/keptn-contrib/dynatrace-service/internal/sli/metrics"
"github.com/keptn-contrib/dynatrace-service/internal/sli/result"
)

// CustomChartingTileProcessing represents the processing of a Custom Charting dashboard tile.
Expand All @@ -36,7 +37,7 @@ func NewCustomChartingTileProcessing(client dynatrace.ClientInterface, eventData
}

// Process processes the specified Custom Charting dashboard tile.
func (p *CustomChartingTileProcessing) Process(ctx context.Context, tile *dynatrace.Tile, dashboardFilter *dynatrace.DashboardFilter) []TileResult {
func (p *CustomChartingTileProcessing) Process(ctx context.Context, tile *dynatrace.Tile, dashboardFilter *dynatrace.DashboardFilter) []result.SLIWithSLO {
if tile.FilterConfig == nil {
log.Debug("Skipping custom charting tile as it is missing a filterConfig element")
return nil
Expand All @@ -55,7 +56,7 @@ func (p *CustomChartingTileProcessing) Process(ctx context.Context, tile *dynatr
}

if err != nil {
return []TileResult{newFailedTileResultFromSLODefinition(sloDefinition, "Custom charting tile title parsing error: "+err.Error())}
return []result.SLIWithSLO{result.NewFailedSLIWithSLO(sloDefinition, "Custom charting tile title parsing error: "+err.Error())}
}

// get the tile specific management zone filter that might be needed by different tile processors
Expand All @@ -65,16 +66,16 @@ func (p *CustomChartingTileProcessing) Process(ctx context.Context, tile *dynatr
chartConfig := tile.FilterConfig.ChartConfig
targetUnitID := chartConfig.LeftAxisCustomUnit
if len(chartConfig.Series) != 1 {
return []TileResult{newFailedTileResultFromSLODefinition(sloDefinition, "Custom charting tile must have exactly one series")}
return []result.SLIWithSLO{result.NewFailedSLIWithSLO(sloDefinition, "Custom charting tile must have exactly one series")}
}

return p.processSeries(ctx, sloDefinition, &chartConfig.Series[0], targetUnitID, tileManagementZoneFilter, tile.FilterConfig.FiltersPerEntityType)
}

func (p *CustomChartingTileProcessing) processSeries(ctx context.Context, sloDefinition keptnapi.SLO, series *dynatrace.Series, targetUnitID string, tileManagementZoneFilter *ManagementZoneFilter, filtersPerEntityType map[string]dynatrace.FilterMap) []TileResult {
func (p *CustomChartingTileProcessing) processSeries(ctx context.Context, sloDefinition keptnapi.SLO, series *dynatrace.Series, targetUnitID string, tileManagementZoneFilter *ManagementZoneFilter, filtersPerEntityType map[string]dynatrace.FilterMap) []result.SLIWithSLO {
metricsQuery, err := p.generateMetricQueryFromChartSeries(ctx, series, tileManagementZoneFilter, filtersPerEntityType)
if err != nil {
return []TileResult{newFailedTileResultFromSLODefinition(sloDefinition, "Custom charting tile could not be converted to a metric query: "+err.Error())}
return []result.SLIWithSLO{result.NewFailedSLIWithSLO(sloDefinition, "Custom charting tile could not be converted to a metric query: "+err.Error())}
}

return NewMetricsQueryProcessing(p.client, targetUnitID).Process(ctx, sloDefinition, *metricsQuery, p.timeframe)
Expand Down
86 changes: 54 additions & 32 deletions internal/sli/dashboard/dashboard_error.go
Original file line number Diff line number Diff line change
Expand Up @@ -2,60 +2,82 @@ package dashboard

import "fmt"

// ProcessingError represents a base error that happened while processing a dashboard
// DashboardError represents a base error that happened while getting SLIs and SLOs from a dashboard.
type DashboardError struct {
cause error
}

// NewDashboardError will create a new DashboardError.
func NewDashboardError(cause error) *DashboardError {
return &DashboardError{
cause: cause,
}
}

func (e *DashboardError) Error() string {
return fmt.Sprintf("could not get SLIs from dashboard: %v", e.cause)
}

func (e *DashboardError) Unwrap() error {
return e.cause
}

// ProcessingError represents an error that happened while processing a dashboard.
type ProcessingError struct {
cause error
}

func (pe *ProcessingError) Error() string {
return pe.cause.Error()
// NewProcessingError will create a new ProcessingError.
func NewProcessingError(cause error) *ProcessingError {
return &ProcessingError{
cause: cause,
}
}

func (pe *ProcessingError) Unwrap() error {
return pe.cause
func (e *ProcessingError) Error() string {
return fmt.Sprintf("could not process dashboard: %v", e.cause)
}

// QueryError represents an error that happened while querying a dashboard
type QueryError struct {
err *ProcessingError
func (e *ProcessingError) Unwrap() error {
return e.cause
}

// NewQueryError will create a new QueryError
func NewQueryError(cause error) *QueryError {
return &QueryError{
err: &ProcessingError{
cause: cause,
},
// RetrievalError represents an error that happened while retrieving a dashboard.
type RetrievalError struct {
cause error
}

// NewRetrievalError will create a new RetrievalError.
func NewRetrievalError(cause error) *RetrievalError {
return &RetrievalError{
cause: cause,
}
}

func (e *QueryError) Error() string {
return fmt.Sprintf("could not query Dynatrace dashboard for SLIs: %v", e.err.Error())
func (e *RetrievalError) Error() string {
return fmt.Sprintf("could not retrieve dashboard: %v", e.cause)
}

func (e *QueryError) Unwrap() error {
return e.err
func (e *RetrievalError) Unwrap() error {
return e.cause
}

type UploadFileError struct {
context string
err *ProcessingError
// UploadSLOsError respresents an error that happened while uploading the SLO file.
type UploadSLOsError struct {
cause error
}

// NewUploadFileError will create a new UploadFileError
func NewUploadFileError(context string, cause error) *UploadFileError {
return &UploadFileError{
context: context,
err: &ProcessingError{
cause: cause,
},
// NewUploadSLOsError will create a new UploadSLOsError.
func NewUploadSLOsError(cause error) *UploadSLOsError {
return &UploadSLOsError{
cause: cause,
}
}

func (e *UploadFileError) Error() string {
return fmt.Sprintf("could not upload %s file: %v", e.context, e.err.Error())
func (e *UploadSLOsError) Error() string {
return fmt.Sprintf("could not upload SLO file: %v", e.cause)
}

func (e *UploadFileError) Unwrap() error {
return e.err
func (e *UploadSLOsError) Unwrap() error {
return e.cause
}
Loading

0 comments on commit dc9b415

Please sign in to comment.