Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add new telemetry data from event-log index. #140943

Merged
merged 17 commits into from
Sep 20, 2022

Conversation

ersin-erdal
Copy link
Contributor

@ersin-erdal ersin-erdal commented Sep 19, 2022

Resolves #138996

This PR is a follow-on of #139901.
And covers the rest of the requested telemetry data(numbers 7-9 in the issue) by aggregating data from event-log index

To Verify:

  • Create a couple of rules with always triggering conditions (e.g. index-threshold that checks if the doc count is above 0)
  • add at least 1 action.

Change alerting telemetry task run interval to something very short time (e.g. 1 min)

return moment().add(1, 'd').startOf('d').toDate();

Do the same for the actions telemetry task too

return moment().add(1, 'd').startOf('d').toDate();

Check the below data on https://localhost:5601/api/stats?extended=true&legacy=true
count_rules_by_execution_status_per_day,
count_connector_types_by_action_run_outcome_per_day,
avg_actions_run_duration_by_connector_type_per_day,

Edit: Realised that avg_actions_run_duration_by_connector_type_per_day is already in telemetry with the name avg_execution_time_by_type_per_day, therefore removed that field and its tests.

count_rules_by_execution_status_per_day,
count_connector_types_by_action_run_outcome_per_day,
avg_actions_run_duration_by_connector_type
@ersin-erdal ersin-erdal added Feature:Telemetry release_note:skip Skip the PR/issue when compiling release notes Team:ResponseOps Label for the ResponseOps team (formerly the Cases and Alerting teams) v8.5.0 labels Sep 19, 2022
@ersin-erdal ersin-erdal marked this pull request as ready for review September 19, 2022 14:18
@ersin-erdal ersin-erdal requested review from a team as code owners September 19, 2022 14:18
@elasticmachine
Copy link
Contributor

Pinging @elastic/response-ops (Team:ResponseOps)

@ersin-erdal ersin-erdal self-assigned this Sep 19, 2022
@mikecote mikecote self-requested a review September 19, 2022 16:03
Copy link
Member

@afharo afharo left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Telemetry schema changes LGTM

@@ -140,5 +144,5 @@ export function telemetryTaskRunner(
}

function getNextMidnight() {
return moment().add(1, 'd').startOf('d').toDate();
return moment().add(1, 'm').startOf('m').toDate();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should revert this :)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

haha, fixed. Thanks for catching it.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For the actions telemetry, I'm seeing both the connector type id and the rule type id in the results. For example, I currently have a rule with a server log action. The telemetry looks like:

avg_actions_run_duration_by_connector_type_per_day: {
  __server-log: 814815,
  example.always-firing: 814815
},
count_connector_types_by_action_run_outcome_per_day: {
  __server-log: {
    success: 54
  },
  example.always-firing: {
    success: 54
  }
},

I think example.always-firing shouldn't be there. Is that right?

Is this from the integration test or from an instance you ran locally?
I also saw this in integration tests and thought that somehow rule id is used for a test connector.
Because this is not happening when i run and test Kibana locally.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I saw this when running locally with an example.always-firing rule with server log connector

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Solved with b9c8252
Forgot to filter saved objects by type...

Copy link
Contributor

@ymao1 ymao1 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For the actions telemetry, I'm seeing both the connector type id and the rule type id in the results. For example, I currently have a rule with a server log action. The telemetry looks like:

avg_actions_run_duration_by_connector_type_per_day: {
  __server-log: 814815,
  example.always-firing: 814815
},
count_connector_types_by_action_run_outcome_per_day: {
  __server-log: {
    success: 54
  },
  example.always-firing: {
    success: 54
  }
},

I think example.always-firing shouldn't be there. Is that right?

Copy link
Contributor

@mikecote mikecote left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I noticed the same issue that @ymao1 reported (#140943 (review)). Once that is fixed, PR LGTM 👍 Tested locally for other issues and didn't uncover any.

@@ -754,6 +784,16 @@ Object {
__slack: 7,
},
countTotal: 120,
countRunOutcomeByConnectorType: {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I believe the paradigm we follow in other telemetry objects is to have the rule type id or connector type id last. That way, we could search by countRunOutcomeByConnectorType.failed.* and get all the different connector types that have failed, instead of having to look inside each connector type object. WDYT of switching this around to be similar?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

But the request is We'd like to know which connector types are failing the most relative to their successful runs.
Wouldn't it be more difficult to get a connector's success/failure ratio?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm...I'm not sure either of these options will make calculating the ratio per rule type easier :). We can leave it as is.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

my idea was getting it like:
count_connector_types_by_action_run_outcome_per_day.__slack.failure / count_connector_types_by_action_run_outcome_per_day.__slack.success

but

count_connector_types_by_action_run_outcome_per_day.failure.__slack./ count_connector_types_by_action_run_outcome_per_day.success.__slack

would also do the same thing... IDK i can change it :)

Copy link
Contributor

@ymao1 ymao1 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@kibana-ci
Copy link
Collaborator

💚 Build Succeeded

Metrics [docs]

✅ unchanged

History

To update your PR or re-run it, just comment with:
@elasticmachine merge upstream

cc @ersin-erdal

@ersin-erdal ersin-erdal merged commit 17a25b8 into elastic:main Sep 20, 2022
@kibanamachine kibanamachine added the backport:skip This commit does not require backporting label Sep 20, 2022
@ersin-erdal ersin-erdal deleted the 138996-telemetry-event-log branch September 20, 2022 21:43
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
backport:skip This commit does not require backporting Feature:Telemetry release_note:skip Skip the PR/issue when compiling release notes Team:ResponseOps Label for the ResponseOps team (formerly the Cases and Alerting teams) v8.5.0
Projects
No open projects
Development

Successfully merging this pull request may close these issues.

Alerting telemetry to add for 8.5
7 participants