`/api/stats` endpoint responds with 503 if a plugin is `'critical'` #169788

gsoldevila · 2023-10-25T11:41:16Z

Kibana version:
Fails on current main

Steps to reproduce:

Update the "license fetch" code to casually throw

Try to access the /api/stats endpoint

{
"statusCode": 503,
"error": "Service Unavailable",
"message": "License is not available."
}

Problems it causes:

If/when this endpoint fails, Metricbeat is not able to properly collect Kibana status.
NB the information collected by Metricbeat is used from Stack Management UI to show Kibana status.
Thus, SM UI will rely on potentially old information for some time (up to 2 minutes).

During that period, there might be discrepancies between observed state and actual state.
After that period, Kibana status will be considered Stale:

I believe this explains issues such as https://github.com/elastic/sdh-kibana/issues/4194

Proposed solution:
Make /api/stats endpoint more resilient to degraded plugins, and allow it to correctly report the appropriate status.

The text was updated successfully, but these errors were encountered:

elasticmachine · 2023-10-25T11:41:18Z

Pinging @elastic/kibana-core (Team:Core)

afharo · 2023-10-25T12:36:03Z

IMO, I don't think this is an issue exclusive to /api/stats.

That error comes from the authentication service:

kibana/x-pack/plugins/security/server/authentication/authentication_service.ts

Lines 127 to 134 in 987a850

    
           if (!license.isLicenseAvailable()) { 
        
             this.logger.error('License is not available, authentication is not possible.'); 
        
             return response.customError({ 
        
               body: 'License is not available.', 
        
               statusCode: 503, 
        
               headers: { 'Retry-After': '30' }, 
        
             }); 
        
           }

The license becomes not available because, when we fail to fetch the license, it emits a new Errored License:

kibana/x-pack/plugins/licensing/server/plugin.ts

Lines 181 to 216 in 9e940b5

    
           private fetchLicense = async (clusterClient: MaybePromise<IClusterClient>): Promise<ILicense> => { 
        
             const client = isPromise(clusterClient) ? await clusterClient : clusterClient; 
        
             try { 
        
               const response = await client.asInternalUser.xpack.info(); 
        
               const normalizedLicense = 
        
                 response.license && response.license.type !== 'missing' 
        
                   ? normalizeServerLicense(response.license) 
        
                   : undefined; 
        
               const normalizedFeatures = response.features 
        
                 ? normalizeFeatures(response.features) 
        
                 : undefined; 
        
               const signature = sign({ 
        
                 license: normalizedLicense, 
        
                 features: normalizedFeatures, 
        
                 error: '', 
        
               }); 
        
               return new License({ 
        
                 license: normalizedLicense, 
        
                 features: normalizedFeatures, 
        
                 signature, 
        
               }); 
        
             } catch (error) { 
        
               this.logger.warn( 
        
                 `License information could not be obtained from Elasticsearch due to ${error} error` 
        
               ); 
        
               const errorMessage = this.getErrorMessage(error); 
        
               const signature = sign({ error: errorMessage }); 
        
               return new License({ 
        
                 error: this.getErrorMessage(error), 
        
                 signature, 
        
               }); 
        
             } 
        
           };

(mind the catch block)

I wonder if we should handle the errors and reuse the cached cache for some ES errors (i.e.: network glitches). WDYT?

pgayvallet · 2023-10-26T08:05:57Z

Correct, this is common to most (only authenticated? all? not sure) routes. If the security plugin can't fetch the license for any reason, then authc can't be performed and requests fails with a 503.

This is done so many layers below the actual route handler, that I'm not really sure what the best way to improve this would be (or if we even want/need to).

afharo · 2023-10-26T09:43:51Z

The only way I think we can improve this is by improving the error handler in the license fetcher:

kibana/x-pack/plugins/licensing/server/plugin.ts

Lines 204 to 216 in 9e940b5

    
             } catch (error) { 
        
               this.logger.warn( 
        
                 `License information could not be obtained from Elasticsearch due to ${error} error` 
        
               ); 
        
               const errorMessage = this.getErrorMessage(error); 
        
               const signature = sign({ error: errorMessage }); 
        
               return new License({ 
        
                 error: this.getErrorMessage(error), 
        
                 signature, 
        
               }); 
        
             } 
        
           };

If Kibana or Elasticsearch is overloaded and the license refresh fails, we might want to reuse the cached cache.

Maybe a retry count before flagging the license as failed could help?

Separate finding: did you know that we force-refresh the license whenever we reply >= 400?

kibana/x-pack/plugins/licensing/server/on_pre_response_handler.ts

Lines 17 to 25 in 3730dd0

    
           // If we're returning an error response, refresh license info from 
        
           // Elasticsearch in case the error is due to a change in license information 
        
           // in Elasticsearch. https://github.com/elastic/x-pack-kibana/pull/2876 
        
           // We're explicit ignoring a 429 "Too Many Requests". This is being used to communicate 
        
           // that back-pressure should be applied, and we don't need to refresh the license in these 
        
           // situations. 
        
           if (res.statusCode >= 400 && res.statusCode !== 429) { 
        
             await refresh(); 
        
           }

That feels so wrong on many levels!

When Kibana is struggling, it might throw errors (5xx) => force refresh => more internal load.
On top of that, we hold the response until we successfully refresh the license.
It's an easy DDoS attack: just call an authenticated route without any credentials and force a license refresh on each request. It can also be achieved with intentionally malformed queries to throw validation errors.

pgayvallet · 2023-10-26T12:02:28Z

If Kibana or Elasticsearch is overloaded and the license refresh fails, we might want to reuse the cached cache.

It could be a good quick win to avoid sporadic problem due to connectivity issues or such. We don't want to re-use the cached value for a long period and should still throw at some point though, otherwise we're fully changing our behavior.

did you know that we force-refresh the license whenever we reply >= 400?

I probably knew at some point, and my brain likely decided it was better for my sanity to delete the information (and I can't blame it)

Yeah, it feels somewhat wrong (and the associated issue introducing that behavior is from 6 years ago). The interval-based refresh might be sufficient. Or maybe we can adapt this createOnPreResponseHandler to have some throttling around calling refresh, so it call it max once every x sec/mins?

Note, we're using exhaustMap in the license pooler, so in practice we will never fetch the license more than once at a time, which ihmo significantly reduces the risk of exploit.

kibana/x-pack/plugins/licensing/common/license_update.ts

Lines 32 to 36 in b3a67b9

    
           const fetched$ = merge(triggerRefresh$, manuallyRefresh$).pipe( 
        
             takeUntil(stop$), 
        
             exhaustMap(fetcher), 
        
             share() 
        
           );

We will be queuing a lot of refresh requests though, which I agree should ideally be avoided.

gsoldevila · 2023-10-26T12:28:46Z

If Kibana or Elasticsearch is overloaded and the license refresh fails, we might want to reuse the cached cache.

I think this wouldn't help solving the related issue, as it could mean showing Kibana as "healthy" while it is not.

Regarding the RxJS refresh pipeline, perhaps we could add a debounceTime to make sure there aren't too many refresh requests close to one another.

afharo · 2023-10-26T15:22:02Z

I think this wouldn't help solving the related issue, as it could mean showing Kibana as "healthy" while it is not.

IMO, if there's a network issue when connecting to ES, the elasticsearch core service should raise it, flagging Kibana as critical. Failing to poll some pretty static information and becoming unhealthy straight away because of that seems wrong. And even worse: the overall status is only degraded, but effectively can't accept any request.

Or maybe we can adapt this createOnPreResponseHandler to have some throttling around calling refresh, so it call it max once every x sec/mins?

That sounds like a good compromise, IMO.

pgayvallet · 2023-10-27T07:24:04Z

I think this wouldn't help solving the https://github.com/elastic/sdh-kibana/issues/4194

I agree that it wouldn't directly help for the issue.

Now, I found that issue (for totally unrelated reasons, from a SDH) but it looks like it could help with some other problems:

From the linked issue:

I think it would make sense to have the licensing plugin be more robust against these temporary failures, and I think all consumers of this API could benefit from such robustness.

If it doesn't fix the issue, that's probably still an improvement we may want to perform. I'll take a quick look, given it seems like a fairly quick win that could help in those edge cases.

## Summary Related to #169788 Fix #117394 --------- Co-authored-by: kibanamachine <[email protected]>

pgayvallet · 2024-07-08T15:24:55Z

#170006 has been merged, and I don't think we want to go further on this, so I'll consider it done.

gsoldevila added bug Fixes for quality problems that affect the customer experience Team:Core Core services & architecture: plugins, logging, config, saved objects, http, ES client, i18n, etc labels Oct 25, 2023

gsoldevila self-assigned this Oct 25, 2023

This was referenced Oct 27, 2023

[licensing] intermittent "license is not available" causing alerting rules to fail to execute #117394

Closed

[licensing] add license fetcher cache #170006

Merged

pgayvallet added a commit that referenced this issue Oct 30, 2023

[licensing] add license fetcher cache (#170006)

21c0b0b

## Summary Related to #169788 Fix #117394 --------- Co-authored-by: kibanamachine <[email protected]>

pgayvallet closed this as completed Jul 8, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

`/api/stats` endpoint responds with 503 if a plugin is `'critical'` #169788

`/api/stats` endpoint responds with 503 if a plugin is `'critical'` #169788

gsoldevila commented Oct 25, 2023

elasticmachine commented Oct 25, 2023

afharo commented Oct 25, 2023

pgayvallet commented Oct 26, 2023 •

edited

Loading

afharo commented Oct 26, 2023 •

edited

Loading

pgayvallet commented Oct 26, 2023

gsoldevila commented Oct 26, 2023 •

edited

Loading

afharo commented Oct 26, 2023

pgayvallet commented Oct 27, 2023 •

edited

Loading

pgayvallet commented Jul 8, 2024

/api/stats endpoint responds with 503 if a plugin is 'critical' #169788

/api/stats endpoint responds with 503 if a plugin is 'critical' #169788

Comments

gsoldevila commented Oct 25, 2023

elasticmachine commented Oct 25, 2023

afharo commented Oct 25, 2023

pgayvallet commented Oct 26, 2023 • edited Loading

afharo commented Oct 26, 2023 • edited Loading

pgayvallet commented Oct 26, 2023

gsoldevila commented Oct 26, 2023 • edited Loading

afharo commented Oct 26, 2023

pgayvallet commented Oct 27, 2023 • edited Loading

pgayvallet commented Jul 8, 2024

`/api/stats` endpoint responds with 503 if a plugin is `'critical'` #169788

`/api/stats` endpoint responds with 503 if a plugin is `'critical'` #169788

pgayvallet commented Oct 26, 2023 •

edited

Loading

afharo commented Oct 26, 2023 •

edited

Loading

gsoldevila commented Oct 26, 2023 •

edited

Loading

pgayvallet commented Oct 27, 2023 •

edited

Loading