-
Notifications
You must be signed in to change notification settings - Fork 20
Conversation
commit e72afa1 Author: khushbr <[email protected]> Date: Tue Jul 7 11:34:59 2020 -0700 Refreshing from Mainline commit d1b3f21 Author: khushbr <[email protected]> Date: Tue Jul 7 11:13:49 2020 -0700 Removing a comment from the util file commit 63ddab2 Author: khushbr <[email protected]> Date: Tue Jul 7 02:47:36 2020 -0700 Adding final set of changes for Cache RCAs commit 09537c2 Author: khushbr <[email protected]> Date: Mon Jul 6 23:54:28 2020 -0700 Adding weight check for async evictions in FieldDataCache commit 65d8f43 Author: khushbr <[email protected]> Date: Mon Jul 6 16:04:59 2020 -0700 Addinf UT for ShardRequestCacheRca commit cdec05b Author: khushbr <[email protected]> Date: Mon Jul 6 10:45:07 2020 -0700 Adding UT for FieldDataCacheRca commit ebdbc7e Author: khushbr <[email protected]> Date: Sun Jul 5 23:16:57 2020 -0700 Refreshing from master commit 4238148 Author: khushbr <[email protected]> Date: Sun Jul 5 23:15:08 2020 -0700 Adding the FieldDataCacheRca and ShardRequestCacheRca commit b6f1a51 Author: khushbr <[email protected]> Date: Wed Jul 1 14:17:01 2020 -0700 Including the Protobuf re-factoring changes commit 03cc597 Author: khushbr <[email protected]> Date: Wed Jun 24 21:53:08 2020 -0700 Refreshing from master
Codecov Report
@@ Coverage Diff @@
## master #265 +/- ##
============================================
- Coverage 66.58% 66.19% -0.39%
+ Complexity 1864 1807 -57
============================================
Files 276 275 -1
Lines 12307 12163 -144
Branches 982 981 -1
============================================
- Hits 8194 8051 -143
+ Misses 3782 3781 -1
Partials 331 331
Continue to review full report at Codecov.
|
...a/com/amazon/opendistro/elasticsearch/performanceanalyzer/rca/store/rca/cache/CacheUtil.java
Outdated
Show resolved
Hide resolved
...azon/opendistro/elasticsearch/performanceanalyzer/rca/store/rca/cache/FieldDataCacheRca.java
Outdated
Show resolved
Hide resolved
...azon/opendistro/elasticsearch/performanceanalyzer/rca/store/rca/cache/FieldDataCacheRca.java
Outdated
Show resolved
Hide resolved
if (Double.isNaN(size)) { | ||
LOG.error("Failed to parse metric in FlowUnit from {}", sizeMetric.getClass().getName()); | ||
} else { | ||
sizeTotalInMB += size / CONVERT_BYTES_TO_MEGABYTES; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Floating point comparisons have rounding errors which can affect the places where this value is used. Should we just convert it to KB and use long values such that we tolerate a rounding error of 1kb?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you Vigya for the suggestion, I agree we can go with KB and tolerate the 1KB rounding error. I have updated the code for the same.
...azon/opendistro/elasticsearch/performanceanalyzer/rca/store/rca/cache/FieldDataCacheRca.java
Show resolved
Hide resolved
...n/opendistro/elasticsearch/performanceanalyzer/rca/store/rca/cache/ShardRequestCacheRca.java
Outdated
Show resolved
Hide resolved
ClusterDetailsEventProcessor.NodeDetails currentNode = ClusterDetailsEventProcessor.getCurrentNodeDetails(); | ||
double cacheSize = getTotalSizeInMB(shardRequestCacheSize); | ||
double cacheMaxSize = getTotalSizeInMB(shardRequestCacheMaxSize); | ||
exceedsSize = cacheMaxSize != 0 && cacheMaxSize != 0 && cacheSize > cacheMaxSize; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
same as above
// the cache is considered as unhealthy | ||
if (cacheEvictionCollector.isMetricPresentForThresholdTime(currTimestamp) | ||
&& cacheHitCollector.isMetricPresentForThresholdTime(currTimestamp) | ||
&& exceedsSize) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's not clear why exceedsSize
is important for thrashing.
Is this to remove false positives due to TTL based evictions? The doc string mentions that TTL based evictions are not performed because setting is not present and so we ignore it.
Even to remove the TTL eviction false +ves, cacheSize may not exceed maxCacheSize because -
- ES accounting should mostly be accurate, which prevents cacheSize from exceeding cacheMaxSize.
- Evictions should have brought down the cacheSize.
Maybe we should check for cacheSize to be within a threshold of cacheMaxSize: assume evictions were due to TTL if cacheMaxSize - cacheSize > Threshold
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The evictions in Elasticsearch cache fall under 2 category :
- Direct Evictions : Exceeds Weight and TTL expiry based
- Async Eviction : Entry Invalidation, followed by forced eviction [Cache clear(), index close()] or in case of request cache, entry invalidation for timed out requests with a runnable clearing up the invalidated entries.
The reason for using exceedsSize
was to handle the 2nd category, that we do not increase the cache size unless we notice eviction plus, the cache is filling up.
I agree, we can use cacheMaxSize - cacheSize > threshold
, where the threshold is percentage of the cacheMaxSize
...n/opendistro/elasticsearch/performanceanalyzer/rca/store/rca/cache/ShardRequestCacheRca.java
Show resolved
Hide resolved
...n/opendistro/elasticsearch/performanceanalyzer/rca/store/rca/cache/ShardRequestCacheRca.java
Show resolved
Hide resolved
...azon/opendistro/elasticsearch/performanceanalyzer/rca/store/rca/cache/FieldDataCacheRca.java
Outdated
Show resolved
Hide resolved
...a/com/amazon/opendistro/elasticsearch/performanceanalyzer/rca/store/rca/cache/CacheUtil.java
Outdated
Show resolved
Hide resolved
...a/com/amazon/opendistro/elasticsearch/performanceanalyzer/rca/store/rca/cache/CacheUtil.java
Outdated
Show resolved
Hide resolved
...azon/opendistro/elasticsearch/performanceanalyzer/rca/store/rca/cache/FieldDataCacheRca.java
Show resolved
Hide resolved
...azon/opendistro/elasticsearch/performanceanalyzer/rca/store/rca/cache/FieldDataCacheRca.java
Outdated
Show resolved
Hide resolved
...n/opendistro/elasticsearch/performanceanalyzer/rca/store/rca/cache/ShardRequestCacheRca.java
Show resolved
Hide resolved
de5da1d
to
cff1981
Compare
continue; | ||
} | ||
|
||
double evictionCount = flowUnit.getData().stream().mapToDouble( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Got it, so we need to sum the stream because this is a shard level metric?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
Issue #, if available: -N/A-
Description of changes:
Caching is one of the fundamental optimization strategy that is employed in information retrieval systems to expedite query processing and reduce back-end server workload. ElasticSearch supports 3 types of cache : Field Data Cache, Node Query Cache and Shard Request Cache.
We want to auto-size the caches dynamically, without a restart and reliance on node level elasticsearch.yml. This PR adds changes for Field Data Cache and Shard Request Cache RCA, which will help identify if the Cache is Unhealthy based on input metrics, which will then be a candidate for auto-tuning.
Field Data Cache RCA is to identify when the cache is unhealthy(thrashing) and otherwise, healthy. The dimension we are use for this analysis are
fieldDataCacheEvictions
andfieldDataCacheMaxSizeGroupByOperation
upstream metrics and maintains a collector which keeps track of the time window period(tp) where we repeatedly see evictions for the last ‘tp ’duration. This RCA is marked as unhealthy if ’tp’ is above the threshold(300 seconds) and cache size exceeds the max cache size configured.Note : For Field Data Cache, Hit and Miss metrics aren't available.
Shard Request Cache RCA reads
shardRequestCacheEvictions
,shardRequestCacheHits
,shardRequestCacheSizeGroupByOperation
andshardRequestCacheMaxSizeInBytes
upstream metrics and maintains collectors which keep track of time window period(tp) where we repeatedly see evictions and hits for the last ‘tp’ duration. This RCA is marked as unhealthy if tp we find ‘tp’ is above the threshold(300 seconds) and cache size exceeds the max cache size configured.Tests: Unit Tests present.
Code coverage percentage for this patch:
By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.