-
Notifications
You must be signed in to change notification settings - Fork 8.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add new telemetry data from event-log index. #140943
Add new telemetry data from event-log index. #140943
Conversation
count_rules_by_execution_status_per_day, count_connector_types_by_action_run_outcome_per_day, avg_actions_run_duration_by_connector_type
Pinging @elastic/response-ops (Team:ResponseOps) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Telemetry schema changes LGTM
… 138996-telemetry-event-log
@@ -140,5 +144,5 @@ export function telemetryTaskRunner( | |||
} | |||
|
|||
function getNextMidnight() { | |||
return moment().add(1, 'd').startOf('d').toDate(); | |||
return moment().add(1, 'm').startOf('m').toDate(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should revert this :)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
haha, fixed. Thanks for catching it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For the actions telemetry, I'm seeing both the connector type id and the rule type id in the results. For example, I currently have a rule with a server log action. The telemetry looks like:
avg_actions_run_duration_by_connector_type_per_day: { __server-log: 814815, example.always-firing: 814815 }, count_connector_types_by_action_run_outcome_per_day: { __server-log: { success: 54 }, example.always-firing: { success: 54 } },
I think
example.always-firing
shouldn't be there. Is that right?
Is this from the integration test or from an instance you ran locally?
I also saw this in integration tests and thought that somehow rule id is used for a test connector.
Because this is not happening when i run and test Kibana locally.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I saw this when running locally with an example.always-firing
rule with server log connector
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Solved with b9c8252
Forgot to filter saved objects by type...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For the actions telemetry, I'm seeing both the connector type id and the rule type id in the results. For example, I currently have a rule with a server log action. The telemetry looks like:
avg_actions_run_duration_by_connector_type_per_day: {
__server-log: 814815,
example.always-firing: 814815
},
count_connector_types_by_action_run_outcome_per_day: {
__server-log: {
success: 54
},
example.always-firing: {
success: 54
}
},
I think example.always-firing
shouldn't be there. Is that right?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I noticed the same issue that @ymao1 reported (#140943 (review)). Once that is fixed, PR LGTM 👍 Tested locally for other issues and didn't uncover any.
@@ -754,6 +784,16 @@ Object { | |||
__slack: 7, | |||
}, | |||
countTotal: 120, | |||
countRunOutcomeByConnectorType: { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I believe the paradigm we follow in other telemetry objects is to have the rule type id or connector type id last. That way, we could search by countRunOutcomeByConnectorType.failed.*
and get all the different connector types that have failed, instead of having to look inside each connector type object. WDYT of switching this around to be similar?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
But the request is We'd like to know which connector types are failing the most relative to their successful runs.
Wouldn't it be more difficult to get a connector's success/failure ratio?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmm...I'm not sure either of these options will make calculating the ratio per rule type easier :). We can leave it as is.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
my idea was getting it like:
count_connector_types_by_action_run_outcome_per_day.__slack.failure / count_connector_types_by_action_run_outcome_per_day.__slack.success
but
count_connector_types_by_action_run_outcome_per_day.failure.__slack./ count_connector_types_by_action_run_outcome_per_day.success.__slack
would also do the same thing... IDK i can change it :)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
💚 Build Succeeded
Metrics [docs]
History
To update your PR or re-run it, just comment with: cc @ersin-erdal |
Resolves #138996
This PR is a follow-on of #139901.
And covers the rest of the requested telemetry data(numbers 7-9 in the issue) by aggregating data from event-log index
To Verify:
Change alerting telemetry task run interval to something very short time (e.g. 1 min)
kibana/x-pack/plugins/alerting/server/usage/task.ts
Line 197 in 01ecbd4
Do the same for the actions telemetry task too
kibana/x-pack/plugins/actions/server/usage/task.ts
Line 143 in 01ecbd4
Check the below data on
https://localhost:5601/api/stats?extended=true&legacy=true
count_rules_by_execution_status_per_day
,count_connector_types_by_action_run_outcome_per_day
,avg_actions_run_duration_by_connector_type_per_day
,Edit: Realised that
avg_actions_run_duration_by_connector_type_per_day
is already in telemetry with the nameavg_execution_time_by_type_per_day
, therefore removed that field and its tests.