Skip to content
This repository has been archived by the owner on Aug 2, 2022. It is now read-only.

FieldData and Shard Request Cache RCA #265

Merged
merged 3 commits into from
Jul 23, 2020
Merged

FieldData and Shard Request Cache RCA #265

merged 3 commits into from
Jul 23, 2020

Conversation

khushbr
Copy link
Contributor

@khushbr khushbr commented Jul 7, 2020

Issue #, if available: -N/A-

Description of changes:
Caching is one of the fundamental optimization strategy that is employed in information retrieval systems to expedite query processing and reduce back-end server workload. ElasticSearch supports 3 types of cache : Field Data Cache, Node Query Cache and Shard Request Cache.

We want to auto-size the caches dynamically, without a restart and reliance on node level elasticsearch.yml. This PR adds changes for Field Data Cache and Shard Request Cache RCA, which will help identify if the Cache is Unhealthy based on input metrics, which will then be a candidate for auto-tuning.

  1. Field Data Cache RCA is to identify when the cache is unhealthy(thrashing) and otherwise, healthy. The dimension we are use for this analysis are fieldDataCacheEvictions and fieldDataCacheMaxSizeGroupByOperation upstream metrics and maintains a collector which keeps track of the time window period(tp) where we repeatedly see evictions for the last ‘tp ’duration. This RCA is marked as unhealthy if ’tp’ is above the threshold(300 seconds) and cache size exceeds the max cache size configured.
    Note : For Field Data Cache, Hit and Miss metrics aren't available.

  2. Shard Request Cache RCA reads shardRequestCacheEvictions, shardRequestCacheHits, shardRequestCacheSizeGroupByOperation and shardRequestCacheMaxSizeInBytes upstream metrics and maintains collectors which keep track of time window period(tp) where we repeatedly see evictions and hits for the last ‘tp’ duration. This RCA is marked as unhealthy if tp we find ‘tp’ is above the threshold(300 seconds) and cache size exceeds the max cache size configured.
    Tests: Unit Tests present.

Code coverage percentage for this patch:

By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.

commit e72afa1
Author: khushbr <[email protected]>
Date:   Tue Jul 7 11:34:59 2020 -0700

    Refreshing from Mainline

commit d1b3f21
Author: khushbr <[email protected]>
Date:   Tue Jul 7 11:13:49 2020 -0700

    Removing a comment from the util file

commit 63ddab2
Author: khushbr <[email protected]>
Date:   Tue Jul 7 02:47:36 2020 -0700

    Adding final set of changes for Cache RCAs

commit 09537c2
Author: khushbr <[email protected]>
Date:   Mon Jul 6 23:54:28 2020 -0700

    Adding weight check for async evictions in FieldDataCache

commit 65d8f43
Author: khushbr <[email protected]>
Date:   Mon Jul 6 16:04:59 2020 -0700

    Addinf UT for ShardRequestCacheRca

commit cdec05b
Author: khushbr <[email protected]>
Date:   Mon Jul 6 10:45:07 2020 -0700

    Adding UT for FieldDataCacheRca

commit ebdbc7e
Author: khushbr <[email protected]>
Date:   Sun Jul 5 23:16:57 2020 -0700

    Refreshing from master

commit 4238148
Author: khushbr <[email protected]>
Date:   Sun Jul 5 23:15:08 2020 -0700

    Adding the FieldDataCacheRca and ShardRequestCacheRca

commit b6f1a51
Author: khushbr <[email protected]>
Date:   Wed Jul 1 14:17:01 2020 -0700

    Including the Protobuf re-factoring changes

commit 03cc597
Author: khushbr <[email protected]>
Date:   Wed Jun 24 21:53:08 2020 -0700

    Refreshing from master
@khushbr khushbr requested review from vigyasharma and rguo-aws July 7, 2020 19:04
@codecov
Copy link

codecov bot commented Jul 7, 2020

Codecov Report

Merging #265 into master will decrease coverage by 0.38%.
The diff coverage is 72.57%.

Impacted file tree graph

@@             Coverage Diff              @@
##             master     #265      +/-   ##
============================================
- Coverage     66.58%   66.19%   -0.39%     
+ Complexity     1864     1807      -57     
============================================
  Files           276      275       -1     
  Lines         12307    12163     -144     
  Branches        982      981       -1     
============================================
- Hits           8194     8051     -143     
+ Misses         3782     3781       -1     
  Partials        331      331              
Impacted Files Coverage Δ Complexity Δ
...ch/performanceanalyzer/PerformanceAnalyzerApp.java 40.59% <ø> (+0.41%) 4.00 <0.00> (-3.00) ⬆️
...formanceanalyzer/collectors/StatExceptionCode.java 100.00% <ø> (ø) 3.00 <0.00> (ø)
...ca/store/rca/cluster/FieldDataCacheClusterRca.java 0.00% <0.00%> (ø) 0.00 <0.00> (?)
...store/rca/cluster/ShardRequestCacheClusterRca.java 0.00% <0.00%> (ø) 0.00 <0.00> (?)
...ormanceanalyzer/rca/store/rca/cache/CacheUtil.java 64.70% <64.70%> (ø) 3.00 <3.00> (?)
...nalyzer/rca/store/rca/cache/FieldDataCacheRca.java 73.91% <73.91%> (ø) 7.00 <7.00> (?)
...yzer/rca/store/rca/cache/ShardRequestCacheRca.java 75.67% <75.67%> (ø) 9.00 <9.00> (?)
...yzer/rca/framework/api/summaries/ResourceUtil.java 97.50% <100.00%> (+0.20%) 4.00 <0.00> (ø)
...analyzer/decisionmaker/deciders/EmptyFlowUnit.java 0.00% <0.00%> (-66.67%) 0.00% <0.00%> (-1.00%)
...icsearch/performanceanalyzer/CertificateUtils.java 0.00% <0.00%> (-34.15%) 0.00% <0.00%> (-8.00%)
... and 66 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 690b69c...14127b2. Read the comment docs.

if (Double.isNaN(size)) {
LOG.error("Failed to parse metric in FlowUnit from {}", sizeMetric.getClass().getName());
} else {
sizeTotalInMB += size / CONVERT_BYTES_TO_MEGABYTES;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Floating point comparisons have rounding errors which can affect the places where this value is used. Should we just convert it to KB and use long values such that we tolerate a rounding error of 1kb?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you Vigya for the suggestion, I agree we can go with KB and tolerate the 1KB rounding error. I have updated the code for the same.

ClusterDetailsEventProcessor.NodeDetails currentNode = ClusterDetailsEventProcessor.getCurrentNodeDetails();
double cacheSize = getTotalSizeInMB(shardRequestCacheSize);
double cacheMaxSize = getTotalSizeInMB(shardRequestCacheMaxSize);
exceedsSize = cacheMaxSize != 0 && cacheMaxSize != 0 && cacheSize > cacheMaxSize;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

same as above

// the cache is considered as unhealthy
if (cacheEvictionCollector.isMetricPresentForThresholdTime(currTimestamp)
&& cacheHitCollector.isMetricPresentForThresholdTime(currTimestamp)
&& exceedsSize) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's not clear why exceedsSize is important for thrashing.
Is this to remove false positives due to TTL based evictions? The doc string mentions that TTL based evictions are not performed because setting is not present and so we ignore it.

Even to remove the TTL eviction false +ves, cacheSize may not exceed maxCacheSize because -

  1. ES accounting should mostly be accurate, which prevents cacheSize from exceeding cacheMaxSize.
  2. Evictions should have brought down the cacheSize.

Maybe we should check for cacheSize to be within a threshold of cacheMaxSize: assume evictions were due to TTL if cacheMaxSize - cacheSize > Threshold

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The evictions in Elasticsearch cache fall under 2 category :

  1. Direct Evictions : Exceeds Weight and TTL expiry based
  2. Async Eviction : Entry Invalidation, followed by forced eviction [Cache clear(), index close()] or in case of request cache, entry invalidation for timed out requests with a runnable clearing up the invalidated entries.

The reason for using exceedsSize was to handle the 2nd category, that we do not increase the cache size unless we notice eviction plus, the cache is filling up.

I agree, we can use cacheMaxSize - cacheSize > threshold, where the threshold is percentage of the cacheMaxSize

@khushbr khushbr force-pushed the khushbr-cache-rca branch from de5da1d to cff1981 Compare July 15, 2020 05:48
continue;
}

double evictionCount = flowUnit.getData().stream().mapToDouble(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Got it, so we need to sum the stream because this is a shard level metric?

Copy link
Contributor

@rguo-aws rguo-aws left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@khushbr khushbr merged commit d75a1a7 into master Jul 23, 2020
@khushbr khushbr deleted the khushbr-cache-rca branch July 23, 2020 19:59
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants