Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[licensing] intermittent "license is not available" causing alerting rules to fail to execute #117394

Closed
pmuellr opened this issue Nov 3, 2021 · 9 comments · Fixed by #170006
Closed
Labels
estimate:small Small Estimated Level of Effort Feature:Alerting/RulesFramework Issues related to the Alerting Rules Framework Feature:License impact:needs-assessment Product and/or Engineering needs to evaluate the impact of the change. loe:small Small Level of Effort resilience Issues related to Platform resilience in terms of scale, performance & backwards compatibility Team:Core Core services & architecture: plugins, logging, config, saved objects, http, ES client, i18n, etc Team:ResponseOps Label for the ResponseOps team (formerly the Cases and Alerting teams)

Comments

@pmuellr
Copy link
Member

pmuellr commented Nov 3, 2021

We currently do license checks when running alerting rules, to ensure the rules can be run with the given license. We track the licensing state in this code: x-pack/plugins/alerting/server/lib/license_state.ts.

We've see errors in logs occasionally, especially when Kibana is under stress, where the license information gets set internally in our code as "unavailable". From looking at the alerting code, I think this is coming in from the license subscription we are subscribed to. Typically these "outages" can be fairly short periods, like 20 seconds. But for that 20 seconds, rules will fail with logged messages like:

Alert type siem.signals is disabled because license information is not available at this time

Once the license info is available again, everything works normally.

It feels like a 20 second "outage" in licensing shouldn't actually cause the rules to fail. We could assume the license was the same since last checked. Not sure what the limit on that time frame would be. Or maybe if it goes from anything -> unavailable, you just assume it's the most recent license seen.

Labelling for both Core and Alerting - seems like if we could agree on some "common" behavior here, in licensing, that would be the place to fix it, so other plugins using licensing also won't be affected. But maybe this is something we should just do for Alerting, in the module referenced ^^^.

@pmuellr pmuellr added Team:Core Core services & architecture: plugins, logging, config, saved objects, http, ES client, i18n, etc Feature:License Team:ResponseOps Label for the ResponseOps team (formerly the Cases and Alerting teams) Feature:Alerting/RulesFramework Issues related to the Alerting Rules Framework labels Nov 3, 2021
@elasticmachine
Copy link
Contributor

Pinging @elastic/kibana-alerting-services (Team:Alerting Services)

@elasticmachine
Copy link
Contributor

Pinging @elastic/kibana-core (Team:Core)

@exalate-issue-sync exalate-issue-sync bot added impact:needs-assessment Product and/or Engineering needs to evaluate the impact of the change. loe:small Small Level of Effort labels Nov 4, 2021
@pgayvallet
Copy link
Contributor

Yea, atm the licensing plugin does not perform any kind of retry operation in case of network error, or if ES returns an error of any kind, meaning that that kind of (potentially temporary) upstream outages results on the license$ observable emitting an error until the next (successful) license fetch.

} catch (error) {
this.logger.warn(
`License information could not be obtained from Elasticsearch due to ${error} error`
);
const errorMessage = this.getErrorMessage(error);
const signature = sign({ error: errorMessage });
return new License({
error: this.getErrorMessage(error),
signature,
});

Labelling for both Core and Alerting - seems like if we could agree on some "common" behavior here, in licensing, that would be the place to fix it, so other plugins using licensing also won't be affected. But maybe this is something we should just do for Alerting, in the module referenced ^^^.

That's a good question. I think it would make sense to have the licensing plugin be more robust against these temporary failures, and I think all consumers of this API could benefit from such robustness.

Maybe improving createLicensePoller and/or createLicenseUpdate to perform retries in case of such failures would make sense. The hardest part would probably be to identify which errors should be considered as retry-able (kinda similar to what we're doing with retries in the SO migration).

cc @mshustov wdyt?

@mshustov
Copy link
Contributor

mshustov commented Nov 8, 2021

I think it would make sense to have the licensing plugin be more robust against these temporary failures, and I think all consumers of this API could benefit from such robustness.

yeah, it makes sense for the licensing API to prevent these cascading side effects.

The hardest part would probably be to identify which errors should be considered as retry-able (kinda similar to what we're doing with retries in the SO migration).

It might be better to extract this logic and provide it from the Elasticsearch service, maybe?
All the network problems can be considered as an intermittent effect:

  • NoLivingConnectionsError
  • ConnectionError
  • TimeoutError
  • 503 ServiceUnavailable
  • 408 Request timeout

Not sure about others from https://github.com/elastic/kibana/blob/main/src/core/server/saved_objects/migrationsv2/actions/catch_retryable_es_client_errors.ts
Also, we can limit the max number of retries with exponential backoff to prevent DDOS of the Elasticsearch server.

@rudolf
Copy link
Contributor

rudolf commented Nov 8, 2021

410 gets returned by some proxies (including on cloud)

I'm not sure about 401 / 403, but probably even if there's an authentication issue we want to retry because someone might fix the credentials.

We could probably use migrationRetryCallCluster as a starting point. I agree exponential back-off would be good to add to it.
https://github.com/elastic/kibana/blob/test-es-deprecations/src/core/server/elasticsearch/client/retry_call_cluster.ts#L61-L92

@gmmorris gmmorris added resilience Issues related to Platform resilience in terms of scale, performance & backwards compatibility estimate:small Small Estimated Level of Effort loe:small Small Level of Effort and removed loe:small Small Level of Effort labels Nov 10, 2021
@joshdover
Copy link
Contributor

Related to this, I recently stole the catch_retryable_es_client_errors implementation to add similar error handling behavior to some idempotent API calls that we make in Fleet: https://github.com/elastic/kibana/pull/118587/files#diff-33164209fde96e38b4365a9902a919b9ba5f07e8078dbf0945c4001603e3cefd

It'd be nice if we had a single source of truth for which types of errors are 'transient' and should be retried.

@rudolf
Copy link
Contributor

rudolf commented Nov 16, 2021

I found the source of why we added the 401/403 handling #51242 (comment)

So it's particularly relevant to migrations because they're some of the first calls to Elasticsearch and I guess it's annoying if you're setting up a cluster and Kibana keeps crashing while you're still busy setting up credentials for Elasticsearch.

But in general, I'd say if an operation fails we should rather retry it on 401/403 than to error and just drop whatever data should have been written to Elasticsearch.

@joshdover
Copy link
Contributor

Got it, that's about what I was expecting to be the reason. For our case, we're not making any of these API calls until start anyways, where SO migrations have already succeeded. For that reason, I chose not to retry 401s and 403s.

Thanks for doing some digging, @rudolf

@kobelb kobelb added the needs-team Issues missing a team label label Jan 31, 2022
@botelastic botelastic bot removed the needs-team Issues missing a team label label Jan 31, 2022
@pgayvallet
Copy link
Contributor

pgayvallet commented Oct 27, 2023

Linking to #169788 (comment), as the recent discussions in the linked are related to this one and a potential fix could close the current issue.

pgayvallet added a commit that referenced this issue Oct 30, 2023
## Summary

Related to #169788
Fix #117394

---------

Co-authored-by: kibanamachine <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
estimate:small Small Estimated Level of Effort Feature:Alerting/RulesFramework Issues related to the Alerting Rules Framework Feature:License impact:needs-assessment Product and/or Engineering needs to evaluate the impact of the change. loe:small Small Level of Effort resilience Issues related to Platform resilience in terms of scale, performance & backwards compatibility Team:Core Core services & architecture: plugins, logging, config, saved objects, http, ES client, i18n, etc Team:ResponseOps Label for the ResponseOps team (formerly the Cases and Alerting teams)
Projects
None yet
Development

Successfully merging a pull request may close this issue.

8 participants