Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

/api/stats endpoint responds with 503 if a plugin is 'critical' #169788

Closed
gsoldevila opened this issue Oct 25, 2023 · 9 comments
Closed

/api/stats endpoint responds with 503 if a plugin is 'critical' #169788

gsoldevila opened this issue Oct 25, 2023 · 9 comments
Assignees
Labels
bug Fixes for quality problems that affect the customer experience Team:Core Core services & architecture: plugins, logging, config, saved objects, http, ES client, i18n, etc

Comments

@gsoldevila
Copy link
Contributor

Kibana version:
Fails on current main

Steps to reproduce:

  1. Update the "license fetch" code to casually throw
image
  1. Try to access the /api/stats endpoint
{
"statusCode": 503,
"error": "Service Unavailable",
"message": "License is not available."
}

Problems it causes:

If/when this endpoint fails, Metricbeat is not able to properly collect Kibana status.
NB the information collected by Metricbeat is used from Stack Management UI to show Kibana status.
Thus, SM UI will rely on potentially old information for some time (up to 2 minutes).

  • During that period, there might be discrepancies between observed state and actual state.
  • After that period, Kibana status will be considered Stale:

image

I believe this explains issues such as https://github.com/elastic/sdh-kibana/issues/4194

Proposed solution:
Make /api/stats endpoint more resilient to degraded plugins, and allow it to correctly report the appropriate status.

@gsoldevila gsoldevila added bug Fixes for quality problems that affect the customer experience Team:Core Core services & architecture: plugins, logging, config, saved objects, http, ES client, i18n, etc labels Oct 25, 2023
@gsoldevila gsoldevila self-assigned this Oct 25, 2023
@elasticmachine
Copy link
Contributor

Pinging @elastic/kibana-core (Team:Core)

@afharo
Copy link
Member

afharo commented Oct 25, 2023

IMO, I don't think this is an issue exclusive to /api/stats.

That error comes from the authentication service:

if (!license.isLicenseAvailable()) {
this.logger.error('License is not available, authentication is not possible.');
return response.customError({
body: 'License is not available.',
statusCode: 503,
headers: { 'Retry-After': '30' },
});
}

The license becomes not available because, when we fail to fetch the license, it emits a new Errored License:

private fetchLicense = async (clusterClient: MaybePromise<IClusterClient>): Promise<ILicense> => {
const client = isPromise(clusterClient) ? await clusterClient : clusterClient;
try {
const response = await client.asInternalUser.xpack.info();
const normalizedLicense =
response.license && response.license.type !== 'missing'
? normalizeServerLicense(response.license)
: undefined;
const normalizedFeatures = response.features
? normalizeFeatures(response.features)
: undefined;
const signature = sign({
license: normalizedLicense,
features: normalizedFeatures,
error: '',
});
return new License({
license: normalizedLicense,
features: normalizedFeatures,
signature,
});
} catch (error) {
this.logger.warn(
`License information could not be obtained from Elasticsearch due to ${error} error`
);
const errorMessage = this.getErrorMessage(error);
const signature = sign({ error: errorMessage });
return new License({
error: this.getErrorMessage(error),
signature,
});
}
};
(mind the catch block)

I wonder if we should handle the errors and reuse the cached cache for some ES errors (i.e.: network glitches). WDYT?

@pgayvallet
Copy link
Contributor

pgayvallet commented Oct 26, 2023

Correct, this is common to most (only authenticated? all? not sure) routes. If the security plugin can't fetch the license for any reason, then authc can't be performed and requests fails with a 503.

This is done so many layers below the actual route handler, that I'm not really sure what the best way to improve this would be (or if we even want/need to).

@afharo
Copy link
Member

afharo commented Oct 26, 2023

The only way I think we can improve this is by improving the error handler in the license fetcher:

} catch (error) {
this.logger.warn(
`License information could not be obtained from Elasticsearch due to ${error} error`
);
const errorMessage = this.getErrorMessage(error);
const signature = sign({ error: errorMessage });
return new License({
error: this.getErrorMessage(error),
signature,
});
}
};

If Kibana or Elasticsearch is overloaded and the license refresh fails, we might want to reuse the cached cache.

Maybe a retry count before flagging the license as failed could help?


Separate finding: did you know that we force-refresh the license whenever we reply >= 400?

// If we're returning an error response, refresh license info from
// Elasticsearch in case the error is due to a change in license information
// in Elasticsearch. https://github.com/elastic/x-pack-kibana/pull/2876
// We're explicit ignoring a 429 "Too Many Requests". This is being used to communicate
// that back-pressure should be applied, and we don't need to refresh the license in these
// situations.
if (res.statusCode >= 400 && res.statusCode !== 429) {
await refresh();
}

That feels so wrong on many levels!

  1. When Kibana is struggling, it might throw errors (5xx) => force refresh => more internal load.
    On top of that, we hold the response until we successfully refresh the license.
  2. It's an easy DDoS attack: just call an authenticated route without any credentials and force a license refresh on each request. It can also be achieved with intentionally malformed queries to throw validation errors.

@pgayvallet
Copy link
Contributor

If Kibana or Elasticsearch is overloaded and the license refresh fails, we might want to reuse the cached cache.

It could be a good quick win to avoid sporadic problem due to connectivity issues or such. We don't want to re-use the cached value for a long period and should still throw at some point though, otherwise we're fully changing our behavior.

did you know that we force-refresh the license whenever we reply >= 400?

I probably knew at some point, and my brain likely decided it was better for my sanity to delete the information (and I can't blame it)

Yeah, it feels somewhat wrong (and the associated issue introducing that behavior is from 6 years ago). The interval-based refresh might be sufficient. Or maybe we can adapt this createOnPreResponseHandler to have some throttling around calling refresh, so it call it max once every x sec/mins?

Note, we're using exhaustMap in the license pooler, so in practice we will never fetch the license more than once at a time, which ihmo significantly reduces the risk of exploit.

const fetched$ = merge(triggerRefresh$, manuallyRefresh$).pipe(
takeUntil(stop$),
exhaustMap(fetcher),
share()
);

We will be queuing a lot of refresh requests though, which I agree should ideally be avoided.

@gsoldevila
Copy link
Contributor Author

gsoldevila commented Oct 26, 2023

If Kibana or Elasticsearch is overloaded and the license refresh fails, we might want to reuse the cached cache.

I think this wouldn't help solving the related issue, as it could mean showing Kibana as "healthy" while it is not.

Regarding the RxJS refresh pipeline, perhaps we could add a debounceTime to make sure there aren't too many refresh requests close to one another.

@afharo
Copy link
Member

afharo commented Oct 26, 2023

I think this wouldn't help solving the related issue, as it could mean showing Kibana as "healthy" while it is not.

IMO, if there's a network issue when connecting to ES, the elasticsearch core service should raise it, flagging Kibana as critical. Failing to poll some pretty static information and becoming unhealthy straight away because of that seems wrong. And even worse: the overall status is only degraded, but effectively can't accept any request.

Or maybe we can adapt this createOnPreResponseHandler to have some throttling around calling refresh, so it call it max once every x sec/mins?

That sounds like a good compromise, IMO.

@pgayvallet
Copy link
Contributor

pgayvallet commented Oct 27, 2023

I think this wouldn't help solving the https://github.com/elastic/sdh-kibana/issues/4194

I agree that it wouldn't directly help for the issue.

Now, I found that issue (for totally unrelated reasons, from a SDH) but it looks like it could help with some other problems:

From the linked issue:

I think it would make sense to have the licensing plugin be more robust against these temporary failures, and I think all consumers of this API could benefit from such robustness.

If it doesn't fix the issue, that's probably still an improvement we may want to perform. I'll take a quick look, given it seems like a fairly quick win that could help in those edge cases.

@pgayvallet
Copy link
Contributor

#170006 has been merged, and I don't think we want to go further on this, so I'll consider it done.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Fixes for quality problems that affect the customer experience Team:Core Core services & architecture: plugins, logging, config, saved objects, http, ES client, i18n, etc
Projects
None yet
Development

No branches or pull requests

4 participants