SearchableSnapshotDirectory should not evict cache files when closed #66173

tlrx · 2020-12-10T14:23:09Z

This pull request changes the SearchableSnapshotDirectory so that it does not evict all its cache files at closing time, but instead delegates this work to the CacheService.

This change is motivated by:

the fact that Lucene directories are closed as the consequence of applying a new cluster state and as such the closing is executed within the cluster state applier thread; and we want to minimize disk IO operations in such thread (like deleting a lot of evicted cache files)
the future of the searchable snapshot cache which should become persistent

This change is built on top of the existing SearchableSnapshotIndexEventListener and a new SearchableSnapshotIndexFoldersDeletionListener (see #65926) that are used to detect when a searchable snapshot index (or searchable snapshot shard) is removed from a data node.

When such a thing happens, the listeners notify the CacheService that maintains an internal list of removed shards. This list is used to evict the cache files associated to these shards as soon as possible (but not in the cluster state applier thread) or right before the same searchable snapshot shard is being built again on the same node.

In other situations like opening/closing a searchable snapshot shard then the cache files are not evicted anymore and should be reused.

elasticmachine · 2020-12-10T14:23:36Z

Pinging @elastic/es-distributed (Team:Distributed)

…moved-shards

henningandersen

I did an initial read and thought I would post my initial comments early.

henningandersen · 2020-12-10T15:21:41Z

server/src/main/java/org/elasticsearch/common/cache/Cache.java

+     */
+    public void forEach(BiConsumer<K, V> consumer) {
+        for (CacheSegment<K, V> segment : segments) {
+            try (ReleasableLock ignored = segment.writeLock.acquire()) {


Would holding the readLock not be enough?

It is enough, I'm not sure why I used and documented the usage of the write lock here. I pushed 082e0c8 to use the read lock instead (as it should prevent any mutation anyway)

henningandersen · 2020-12-10T15:47:24Z

...-snapshots/src/main/java/org/elasticsearch/xpack/searchablesnapshots/cache/CacheService.java

+                success = true;
+            } finally {
+                if (success == false) {
+                    final boolean added = evictedShards.add(shardEviction);


I am not sure this is necessary. When cache.invalidate fails, it already removed the object from the map, so if we retry an eviction, it will not hit the shard anyway.

So if we can remove this, I think we can also drop the shardsEvictionLock, which makes markShardAsEvictedInCache safer wrt. being called from the cluster applier thread.

I think you are right, I pushed 393deaf to remove the usage of the lock here (but such locks are still necessary to avoid cache files eviction while starting the directory). Let me know what you think.

I think the locks could be removed with some restructure, but we can tackle that in a follow-up, it is not really important.

However, if we add to evictedShards here while running the job on the thread, I think we risk it leaking since nothing will remove it later? I think we should assert success here. AFAICS, there is no way this should fail unless there is a bug somewhere.

I will be happy to know more about how you think it could be restructured. For now I'm taking this point and will come back to you in a short future for tackling this.

I agree there's a risk of leaking a ShardEviction around but I think it is very unlikely to happen. One idea would be to hook into the cache sync task to clean up any left overs ShardEviction there. If that's OK we can address this too as a follow up with the previous point. For now I added the success assertion.

I've opened #67160 which should improve the situation.

henningandersen · 2020-12-10T15:48:34Z

...-snapshots/src/main/java/org/elasticsearch/xpack/searchablesnapshots/cache/CacheService.java

+                            }
+                        });
+                        if (cacheFilesToEvict.isEmpty() == false) {
+                            cacheFilesToEvict.forEach(cache::invalidate);


Perhaps we need to catch exception here and assert that they are io-related, write a warning and then continue with the next file? That ensures we remove all cache files from the cache in one go, but may linger some files if there are io-issues.

Yes, your suggestion makes eviction safer - just in case. I pushed 58b7bd9

tlrx · 2020-12-10T20:10:25Z

Thanks a lot @henningandersen. I've updated the PR according to your feedback. This is ready for another review.

…moved-shards

henningandersen

LGTM. I think we need to add a test, at least for the new methods on CacheService.

henningandersen · 2020-12-11T06:54:17Z

...-snapshots/src/main/java/org/elasticsearch/xpack/searchablesnapshots/cache/CacheService.java

+                                cacheFilesToEvict.put(cacheKey, cacheFile);
+                            }
+                        });
+                        if (cacheFilesToEvict.isEmpty() == false) {


nit: this check seems superfluous?

Yes, I removed it

henningandersen · 2020-12-11T06:58:12Z

...-snapshots/src/main/java/org/elasticsearch/xpack/searchablesnapshots/cache/CacheService.java

+                            for (Map.Entry<CacheKey, CacheFile> cacheFile : cacheFilesToEvict.entrySet()) {
+                                try {
+                                    cache.invalidate(cacheFile.getKey(), cacheFile.getValue());
+                                } catch (Exception e) {


Can we just catch RuntimeException instead, since invalidate declares to not throw checked?

henningandersen · 2020-12-11T06:59:01Z

...-snapshots/src/main/java/org/elasticsearch/xpack/searchablesnapshots/cache/CacheService.java

+                                try {
+                                    cache.invalidate(cacheFile.getKey(), cacheFile.getValue());
+                                } catch (Exception e) {
+                                    assert e instanceof IOException : e;


Looking closer at this, I suppose this should never happen since we catch IO exceptions in onCacheFileRemoval, so we could just assert false : e here instead?

henningandersen · 2020-12-11T06:59:41Z

...-snapshots/src/main/java/org/elasticsearch/xpack/searchablesnapshots/cache/CacheService.java

+
+                @Override
+                public void onFailure(Exception e) {
+                    logger.warn(


I think we could also assert false : e here?

henningandersen · 2020-12-11T07:19:01Z

...-snapshots/src/main/java/org/elasticsearch/xpack/searchablesnapshots/cache/CacheService.java

+                success = true;
+            } finally {
+                if (success == false) {
+                    final boolean added = evictedShards.add(shardEviction);


I think the locks could be removed with some restructure, but we can tackle that in a follow-up, it is not really important.

However, if we add to evictedShards here while running the job on the thread, I think we risk it leaking since nothing will remove it later? I think we should assert success here. AFAICS, there is no way this should fail unless there is a bug somewhere.

…moved-shards

tlrx · 2020-12-11T10:50:02Z

Thanks a lot Henning! I merged this with an additional test which isn't great but can be improved in a follow up along with your suggestion about improving locking.

…lastic#66173) This commit changes the SearchableSnapshotDirectory so that it does not evict all its cache files at closing time, but instead delegates this work to the CacheService. This change is motivated by the fact that Lucene directories are closed as the consequence of applying a new cluster state and as such the closing is executed within the cluster state applier thread; and we want to minimize disk IO operations in such thread (like deleting a lot of evicted cache files). It is also motivated by the future of the searchable snapshot cache which should become persistent. This change is built on top of the existing SearchableSnapshotIndexEventListener and a new SearchableSnapshotIndexFoldersDeletionListener (see elastic#65926) that are used to detect when a searchable snapshot index (or searchable snapshot shard) is removed from a data node. When such a thing happens, the listeners notify the CacheService that maintains an internal list of removed shards. This list is used to evict the cache files associated to these shards as soon as possible (but not in the cluster state applier thread) or right before the same searchable snapshot shard is being built again on the same node. In other situations like opening/closing a searchable snapshot shard then the cache files are not evicted anymore and should be reused.

…66264) This commit changes the SearchableSnapshotDirectory so that it does not evict all its cache files at closing time, but instead delegates this work to the CacheService. This change is motivated by the fact that Lucene directories are closed as the consequence of applying a new cluster state and as such the closing is executed within the cluster state applier thread; and we want to minimize disk IO operations in such thread (like deleting a lot of evicted cache files). It is also motivated by the future of the searchable snapshot cache which should become persistent. This change is built on top of the existing SearchableSnapshotIndexEventListener and a new SearchableSnapshotIndexFoldersDeletionListener (see #65926) that are used to detect when a searchable snapshot index (or searchable snapshot shard) is removed from a data node. When such a thing happens, the listeners notify the CacheService that maintains an internal list of removed shards. This list is used to evict the cache files associated to these shards as soon as possible (but not in the cluster state applier thread) or right before the same searchable snapshot shard is being built again on the same node. In other situations like opening/closing a searchable snapshot shard then the cache files are not evicted anymore and should be reused. Backport of #66173 for 7.11

Use listeners to manage cache files associated with removed shards

aeeabc7

tlrx added :Distributed Coordination/Snapshot/Restore Anything directly related to the `_snapshot/*` APIs >enhancement v7.11.0 v8.0.0 labels Dec 10, 2020

elasticmachine added the Team:Distributed (Obsolete) Meta label for distributed team (obsolete). Replaced by Distributed Indexing/Coordination. label Dec 10, 2020

tlrx requested review from DaveCTurner, original-brownbear and henningandersen December 10, 2020 15:16

Merge branch 'master' into use-listeners-to-manage-cache-files-for-re…

6c1a6e9

…moved-shards

henningandersen reviewed Dec 10, 2020

View reviewed changes

tlrx added 3 commits December 10, 2020 20:34

invalidate single cache file

58b7bd9

readLock

082e0c8

simplify markShardAsEvictedInCache

393deaf

tlrx requested a review from henningandersen December 10, 2020 20:09

tlrx added 2 commits December 10, 2020 21:23

fix messed up imports

1555ce5

Merge branch 'master' into use-listeners-to-manage-cache-files-for-re…

c1c9e54

…moved-shards

henningandersen approved these changes Dec 11, 2020

View reviewed changes

tlrx added 2 commits December 11, 2020 11:00

feedback

3395cf9

Merge branch 'master' into use-listeners-to-manage-cache-files-for-re…

02aa9c9

…moved-shards

tlrx merged commit 7e6b52a into elastic:master Dec 11, 2020

tlrx added the backport pending label Dec 11, 2020

tlrx mentioned this pull request Dec 11, 2020

Make searchable snapshots cache persistent #65725

Merged

tlrx mentioned this pull request Dec 14, 2020

SearchableSnapshotDirectory should not evict cache files when closed #66264

Merged

tlrx removed the backport pending label Dec 14, 2020

tlrx mentioned this pull request Jan 7, 2021

Improve shards evictions in searchable snapshot cache service #67160

Merged

jakelandis added v8.0.0-alpha1 and removed v8.0.0 labels Jul 26, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

SearchableSnapshotDirectory should not evict cache files when closed #66173

SearchableSnapshotDirectory should not evict cache files when closed #66173

tlrx commented Dec 10, 2020 •

edited

Loading

elasticmachine commented Dec 10, 2020

henningandersen left a comment

henningandersen Dec 10, 2020

tlrx Dec 10, 2020

henningandersen Dec 10, 2020

tlrx Dec 10, 2020

henningandersen Dec 11, 2020

tlrx Dec 11, 2020

tlrx Jan 7, 2021

henningandersen Dec 10, 2020

tlrx Dec 10, 2020

tlrx commented Dec 10, 2020

henningandersen left a comment

henningandersen Dec 11, 2020

tlrx Dec 11, 2020

henningandersen Dec 11, 2020

tlrx Dec 11, 2020

henningandersen Dec 11, 2020

tlrx Dec 11, 2020

henningandersen Dec 11, 2020

tlrx Dec 11, 2020

henningandersen Dec 11, 2020

tlrx commented Dec 11, 2020

SearchableSnapshotDirectory should not evict cache files when closed #66173

SearchableSnapshotDirectory should not evict cache files when closed #66173

Conversation

tlrx commented Dec 10, 2020 • edited Loading

elasticmachine commented Dec 10, 2020

henningandersen left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

tlrx commented Dec 10, 2020

henningandersen left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

tlrx commented Dec 11, 2020

tlrx commented Dec 10, 2020 •

edited

Loading