Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Uptime - Enable Suites Alerts #123969

Closed
dominiqueclarke opened this issue Jan 27, 2022 · 6 comments
Closed

Uptime - Enable Suites Alerts #123969

dominiqueclarke opened this issue Jan 27, 2022 · 6 comments
Labels
enhancement New value added to drive a business result Team:Uptime - DEPRECATED Synthetics & RUM sub-team of Application Observability

Comments

@dominiqueclarke
Copy link
Contributor

As an Uptime user, I want to be notified when I have suites errors, to ensure that all the monitors I expect to run are actually running.

Background

As part of #123426, individual journeys in a suite will be identified during Heartbeat's discovery phase, and reported the Uptime as suite documents with a list of individual journeys.

If there is an error reported on a journey, that journey will not run, but the error will be displayed on the Suites tab.

AC

User should be able to create an alert to notify on new Suites error documents.

TBC: How do we want to handle configuration for this? Will it be a standard alert criterion that we choose, or will we allow the user to have some customization of the alert criteria or thresholds like Monitor Status alerts?

@botelastic botelastic bot added the needs-team Issues missing a team label label Jan 27, 2022
@dominiqueclarke dominiqueclarke added the Team:Uptime - DEPRECATED Synthetics & RUM sub-team of Application Observability label Jan 27, 2022
@elasticmachine
Copy link
Contributor

Pinging @elastic/uptime (Team:uptime)

@botelastic botelastic bot removed the needs-team Issues missing a team label label Jan 27, 2022
@dominiqueclarke dominiqueclarke added enhancement New value added to drive a business result needs-team Issues missing a team label labels Jan 27, 2022
@botelastic botelastic bot removed the needs-team Issues missing a team label label Jan 27, 2022
@dominiqueclarke
Copy link
Contributor Author

@paulb-elastic We've shifted strategy a lot on suites, and I'm not sure if this is still relevant.

The push command itself seems like the appropriate place to report back configuration errors to the user in most cases, preventing them from pushing the configuration entirely and failing that CI step.

For errors that are not caught, we do have elastic/uptime#439 in place, which will display errors when the journey cannot run. What we don't have is an alternating system for this. Should this still be prioritized for 8.3?

cc: @andrewvc @vigneshshanmugam

@paulb-elastic
Copy link
Contributor

I agree, this no longer seems to fit in with the new push approach.

I've made a comment here about error reporing from running push.

What we don't have is an alternating system for this. Should this still be prioritized for 8.3?

I'm not sure I understand what this refers to. Do you mean that we need to extend elastic/uptime#439 to show errors for externally generated (suite) monitors, not just inline ones?

@dominiqueclarke
Copy link
Contributor Author

@paulb-elastic As far as I understand, @shahzad31 correct me if I'm wrong, "inline" is a bit of a misnomer for elastic/uptime#439 as the implementation picks up on error documents generated by heartbeat. These errors should be generated when a journey fails to run, regardless of what the source is. So the existing error behavior found in elastic/uptime#439 should "just work" with monitors created via push.

That being said, I think we're at a lower risk of ever having monitors with configuration errors be pushed, since the push command could evaluate the code and report any syntax errors ahead of time. Indeed, in a suites model, the engineering team will also likely be utilizing eslint to its fullest. The biggest error we discussed with suites was two monitors having the same id or the same name, and I would think the push command would report that back to the user before its push.

What we don't have is an alternating system for this. Should this still be prioritized for 8.3?

I'm realizing this was kinda misspoken on my part. Because the journeys that failed to run still generate a summary document, they are considered down, and we already have alerting for down state.

I think it's a product question if our alerting should be more sensitive to the differences in "down" states, perhaps we can filter out down states for journeys that failed to run, and then create a new alert for journeys that failed to run. I'll leave that up to @drewpost to determine whether or not we want to look into this type of alerting for Uptime, or punt to Synthetics.

@paulb-elastic
Copy link
Contributor

Thanks for the clarification @dominiqueclarke

I am thinking about...

Because the journeys that failed to run still generate a summary document, they are considered down, and we already have alerting for down state

... I am trying to work out what the push is doing to evaluate the correctness of the journey, to work out if a monitor could be pushed and saved, but can't execute (e.g. syntax error, but I wonder if there are other cases)?

If this does fail to execute, and is marked as down is that correct anyway (i.e. it would trigger an alert that "your website is down", which is incorrect). That's what this issue was originally about, being able to have a different type of alert (quite possibly notifying a different team that there's a problem).

@andrewvc
Copy link
Contributor

I agree that this issue is no longer relevant. I'm going to close this one so we can move the discussion to #123928 and elsewhere.

The push command does not and cannot evaluate the correctness of the journey beyond it:

  1. Being valid javascript/typescript
  2. Having at least the journey and step syntax more or less correct.

The push approach really reduces the surface area of possible errors, so that's generally good news. It already, by nature of how it works, handles all the cases this would have handled, at least in any initial implementation.

If there's any disagreement re: closing this let feel free to re-open or ping in slack

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New value added to drive a business result Team:Uptime - DEPRECATED Synthetics & RUM sub-team of Application Observability
Projects
None yet
Development

No branches or pull requests

4 participants