[licensing] intermittent "license is not available" causing alerting rules to fail to execute #117394

pmuellr · 2021-11-03T18:09:55Z

We currently do license checks when running alerting rules, to ensure the rules can be run with the given license. We track the licensing state in this code: x-pack/plugins/alerting/server/lib/license_state.ts.

We've see errors in logs occasionally, especially when Kibana is under stress, where the license information gets set internally in our code as "unavailable". From looking at the alerting code, I think this is coming in from the license subscription we are subscribed to. Typically these "outages" can be fairly short periods, like 20 seconds. But for that 20 seconds, rules will fail with logged messages like:

Alert type siem.signals is disabled because license information is not available at this time

Once the license info is available again, everything works normally.

It feels like a 20 second "outage" in licensing shouldn't actually cause the rules to fail. We could assume the license was the same since last checked. Not sure what the limit on that time frame would be. Or maybe if it goes from anything -> unavailable, you just assume it's the most recent license seen.

Labelling for both Core and Alerting - seems like if we could agree on some "common" behavior here, in licensing, that would be the place to fix it, so other plugins using licensing also won't be affected. But maybe this is something we should just do for Alerting, in the module referenced ^^^.

The text was updated successfully, but these errors were encountered:

elasticmachine · 2021-11-03T18:09:57Z

Pinging @elastic/kibana-alerting-services (Team:Alerting Services)

elasticmachine · 2021-11-03T18:09:57Z

Pinging @elastic/kibana-core (Team:Core)

pgayvallet · 2021-11-07T21:19:51Z

Yea, atm the licensing plugin does not perform any kind of retry operation in case of network error, or if ES returns an error of any kind, meaning that that kind of (potentially temporary) upstream outages results on the license$ observable emitting an error until the next (successful) license fetch.

kibana/x-pack/plugins/licensing/server/plugin.ts

Lines 203 to 213 in 3c8fa52

    
           } catch (error) { 
        
             this.logger.warn( 
        
               `License information could not be obtained from Elasticsearch due to ${error} error` 
        
             ); 
        
             const errorMessage = this.getErrorMessage(error); 
        
             const signature = sign({ error: errorMessage }); 
        
             return new License({ 
        
               error: this.getErrorMessage(error), 
        
               signature, 
        
             });

Labelling for both Core and Alerting - seems like if we could agree on some "common" behavior here, in licensing, that would be the place to fix it, so other plugins using licensing also won't be affected. But maybe this is something we should just do for Alerting, in the module referenced ^^^.

That's a good question. I think it would make sense to have the licensing plugin be more robust against these temporary failures, and I think all consumers of this API could benefit from such robustness.

Maybe improving createLicensePoller and/or createLicenseUpdate to perform retries in case of such failures would make sense. The hardest part would probably be to identify which errors should be considered as retry-able (kinda similar to what we're doing with retries in the SO migration).

cc @mshustov wdyt?

mshustov · 2021-11-08T15:55:21Z

I think it would make sense to have the licensing plugin be more robust against these temporary failures, and I think all consumers of this API could benefit from such robustness.

yeah, it makes sense for the licensing API to prevent these cascading side effects.

The hardest part would probably be to identify which errors should be considered as retry-able (kinda similar to what we're doing with retries in the SO migration).

It might be better to extract this logic and provide it from the Elasticsearch service, maybe?
All the network problems can be considered as an intermittent effect:

NoLivingConnectionsError
ConnectionError
TimeoutError
503 ServiceUnavailable
408 Request timeout

Not sure about others from https://github.com/elastic/kibana/blob/main/src/core/server/saved_objects/migrationsv2/actions/catch_retryable_es_client_errors.ts
Also, we can limit the max number of retries with exponential backoff to prevent DDOS of the Elasticsearch server.

rudolf · 2021-11-08T21:55:31Z

410 gets returned by some proxies (including on cloud)

I'm not sure about 401 / 403, but probably even if there's an authentication issue we want to retry because someone might fix the credentials.

We could probably use migrationRetryCallCluster as a starting point. I agree exponential back-off would be good to add to it.
https://github.com/elastic/kibana/blob/test-es-deprecations/src/core/server/elasticsearch/client/retry_call_cluster.ts#L61-L92

joshdover · 2021-11-15T18:19:56Z

Related to this, I recently stole the catch_retryable_es_client_errors implementation to add similar error handling behavior to some idempotent API calls that we make in Fleet: https://github.com/elastic/kibana/pull/118587/files#diff-33164209fde96e38b4365a9902a919b9ba5f07e8078dbf0945c4001603e3cefd

It'd be nice if we had a single source of truth for which types of errors are 'transient' and should be retried.

rudolf · 2021-11-16T12:33:25Z

I found the source of why we added the 401/403 handling #51242 (comment)

So it's particularly relevant to migrations because they're some of the first calls to Elasticsearch and I guess it's annoying if you're setting up a cluster and Kibana keeps crashing while you're still busy setting up credentials for Elasticsearch.

But in general, I'd say if an operation fails we should rather retry it on 401/403 than to error and just drop whatever data should have been written to Elasticsearch.

joshdover · 2021-11-16T13:02:17Z

Got it, that's about what I was expecting to be the reason. For our case, we're not making any of these API calls until start anyways, where SO migrations have already succeeded. For that reason, I chose not to retry 401s and 403s.

Thanks for doing some digging, @rudolf

pgayvallet · 2023-10-27T07:24:42Z

Linking to #169788 (comment), as the recent discussions in the linked are related to this one and a potential fix could close the current issue.

## Summary Related to #169788 Fix #117394 --------- Co-authored-by: kibanamachine <[email protected]>

exalate-issue-sync bot added impact:needs-assessment Product and/or Engineering needs to evaluate the impact of the change. loe:small Small Level of Effort labels Nov 4, 2021

gmmorris added resilience Issues related to Platform resilience in terms of scale, performance & backwards compatibility estimate:small Small Estimated Level of Effort loe:small Small Level of Effort and removed loe:small Small Level of Effort labels Nov 10, 2021

kobelb added the needs-team Issues missing a team label label Jan 31, 2022

botelastic bot removed the needs-team Issues missing a team label label Jan 31, 2022

pgayvallet mentioned this issue Oct 27, 2023

/api/stats endpoint responds with 503 if a plugin is 'critical' #169788

Closed

pgayvallet mentioned this issue Oct 27, 2023

[licensing] add license fetcher cache #170006

Merged

pgayvallet closed this as completed in #170006 Oct 30, 2023

pgayvallet added a commit that referenced this issue Oct 30, 2023

[licensing] add license fetcher cache (#170006)

21c0b0b

## Summary Related to #169788 Fix #117394 --------- Co-authored-by: kibanamachine <[email protected]>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[licensing] intermittent "license is not available" causing alerting rules to fail to execute #117394

[licensing] intermittent "license is not available" causing alerting rules to fail to execute #117394

pmuellr commented Nov 3, 2021

elasticmachine commented Nov 3, 2021

elasticmachine commented Nov 3, 2021

pgayvallet commented Nov 7, 2021

mshustov commented Nov 8, 2021

rudolf commented Nov 8, 2021

joshdover commented Nov 15, 2021

rudolf commented Nov 16, 2021

joshdover commented Nov 16, 2021

pgayvallet commented Oct 27, 2023 •

edited

Loading

[licensing] intermittent "license is not available" causing alerting rules to fail to execute #117394

[licensing] intermittent "license is not available" causing alerting rules to fail to execute #117394

Comments

pmuellr commented Nov 3, 2021

elasticmachine commented Nov 3, 2021

elasticmachine commented Nov 3, 2021

pgayvallet commented Nov 7, 2021

mshustov commented Nov 8, 2021

rudolf commented Nov 8, 2021

joshdover commented Nov 15, 2021

rudolf commented Nov 16, 2021

joshdover commented Nov 16, 2021

pgayvallet commented Oct 27, 2023 • edited Loading

pgayvallet commented Oct 27, 2023 •

edited

Loading