-
Notifications
You must be signed in to change notification settings - Fork 8.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[licensing] intermittent "license is not available" causing alerting rules to fail to execute #117394
Comments
Pinging @elastic/kibana-alerting-services (Team:Alerting Services) |
Pinging @elastic/kibana-core (Team:Core) |
Yea, atm the kibana/x-pack/plugins/licensing/server/plugin.ts Lines 203 to 213 in 3c8fa52
That's a good question. I think it would make sense to have the Maybe improving cc @mshustov wdyt? |
yeah, it makes sense for the licensing API to prevent these cascading side effects.
It might be better to extract this logic and provide it from the Elasticsearch service, maybe?
Not sure about others from https://github.com/elastic/kibana/blob/main/src/core/server/saved_objects/migrationsv2/actions/catch_retryable_es_client_errors.ts |
I'm not sure about 401 / 403, but probably even if there's an authentication issue we want to retry because someone might fix the credentials. We could probably use |
Related to this, I recently stole the It'd be nice if we had a single source of truth for which types of errors are 'transient' and should be retried. |
I found the source of why we added the 401/403 handling #51242 (comment) So it's particularly relevant to migrations because they're some of the first calls to Elasticsearch and I guess it's annoying if you're setting up a cluster and Kibana keeps crashing while you're still busy setting up credentials for Elasticsearch. But in general, I'd say if an operation fails we should rather retry it on 401/403 than to error and just drop whatever data should have been written to Elasticsearch. |
Got it, that's about what I was expecting to be the reason. For our case, we're not making any of these API calls until Thanks for doing some digging, @rudolf |
Linking to #169788 (comment), as the recent discussions in the linked are related to this one and a potential fix could close the current issue. |
## Summary Related to #169788 Fix #117394 --------- Co-authored-by: kibanamachine <[email protected]>
We currently do license checks when running alerting rules, to ensure the rules can be run with the given license. We track the licensing state in this code: x-pack/plugins/alerting/server/lib/license_state.ts.
We've see errors in logs occasionally, especially when Kibana is under stress, where the license information gets set internally in our code as "unavailable". From looking at the alerting code, I think this is coming in from the license subscription we are subscribed to. Typically these "outages" can be fairly short periods, like 20 seconds. But for that 20 seconds, rules will fail with logged messages like:
Once the license info is available again, everything works normally.
It feels like a 20 second "outage" in licensing shouldn't actually cause the rules to fail. We could assume the license was the same since last checked. Not sure what the limit on that time frame would be. Or maybe if it goes from anything -> unavailable, you just assume it's the most recent license seen.
Labelling for both Core and Alerting - seems like if we could agree on some "common" behavior here, in licensing, that would be the place to fix it, so other plugins using licensing also won't be affected. But maybe this is something we should just do for Alerting, in the module referenced ^^^.
The text was updated successfully, but these errors were encountered: