Consistent updates of IndexShardSnapshotStatus #28130

tlrx · 2018-01-08T12:34:47Z

IndexShardSnapshotStatus represents the status of an index shard
snapshot. During the snapshot process it can be updated multiple times,
it can also be aborted and the the Get Snapshot and Snapshot Status APIs
are susceptible to read the snapshot statuses.

Right now, the information are not updated in a coherent manner, meaning
that a IndexShardSnapshotStatus can be FAILED but the failure is not
updated yet. Or it could be FINALIZE with no index version.

This commit changes IndexShardSnapshotStatus so that the Stage is updated
coherently with any required information. It also provides a asCopy()
method that returns the status of a IndexShardSnapshotStatus at a given
point in time, ensuring that all information are coherent.

Closes #26480

IndexShardSnapshotStatus represents the status of an index shard snapshot. During the snapshot process it can be updated multiple times, it can also be aborted and the the Get Snapshot and Snapshot Status APIs are susceptible to read the snapshot statuses. Right now, the information are not updated in a coherent manner, meaning that a IndexShardSnapshotStatus can be FAILED but the failure is not updated yet. Or it could be FINALIZE with no index version. This commit changes IndexShardSnapshotStatus so that the Stage is updated coherently with any required information. It also provides a asCopy() method that returns the status of a IndexShardSnapshotStatus at a given point in time, ensuring that all information are coherent.

ywelsch

I very much like the approach here. Not super happy about the "Copy" name though. I've left some smaller feedback.

ywelsch · 2018-01-08T13:25:14Z

core/src/main/java/org/elasticsearch/index/snapshots/IndexShardSnapshotStatus.java

-    public long time() {
-        return this.time;
+    public synchronized Copy moveToStarted(final long startTime, final int numberOfFiles, final long totalSize) {
+        ensureNotAborted();


is this necessary? Isn't it covered already by the next line?

Right, not needed anymore, thanks

ywelsch · 2018-01-08T13:25:46Z

core/src/main/java/org/elasticsearch/index/snapshots/IndexShardSnapshotStatus.java

+            this.numberOfFiles = numberOfFiles;
+            this.totalSize = totalSize;
+        } else {
+            throw new IllegalStateException("Unable to move the shard snapshot status to started: it is not initializing");


also add current state here to exception message?

ywelsch · 2018-01-08T14:15:36Z

core/src/main/java/org/elasticsearch/snapshots/SnapshotShardsService.java

-                        snapshotStatus.abort();
+                    final IndexShardSnapshotStatus.Stage stage = snapshotStatus.asCopy().getStage();
+                    if (stage == Stage.INIT || stage == Stage.STARTED) {
+                        snapshotStatus.moveToAborted("snapshot has been removed in cluster state, aborting");


Would it make sense to change moveToAborted so that it only does this if the shard is in INIT or STARTED (rename it to abortIfNotCompleted?).

That's an interesting suggestion, thanks

ywelsch · 2018-01-08T14:23:26Z

core/src/main/java/org/elasticsearch/snapshots/SnapshotShardsService.java

                                    case INIT:
                                    case STARTED:
-                                        snapshotStatus.abort();
+                                        snapshotStatus.moveToAborted("snapshot has been aborted");


same thing here. We could just call abortIfNotCompleted, and then take a snapshot and then act based on FINALIZE, DONE, FAILURE

ywelsch · 2018-01-08T14:31:24Z

core/src/main/java/org/elasticsearch/index/snapshots/IndexShardSnapshotStatus.java

+
+        @Override
+        public String toString() {
+            return new StringBuilder()


this looks a bit odd of a toString() implementation as it's very much targeted towards that one logging call site. Maybe change it to be more generic.

ywelsch · 2018-01-08T14:40:11Z

core/src/main/java/org/elasticsearch/index/snapshots/IndexShardSnapshotStatus.java

    private int numberOfFiles;
-
    private volatile int processedFiles;


this does not need to be volatile anymore.

ywelsch · 2018-01-08T14:40:29Z

core/src/main/java/org/elasticsearch/index/snapshots/IndexShardSnapshotStatus.java

    private long totalSize;
-
    private volatile long processedSize;


no need to be volatile anymore

tlrx · 2018-01-08T19:41:15Z

@ywelsch Thanks for review! I updated the code according to your comments, can you have another look please?

tlrx · 2018-01-08T19:59:45Z

Not super happy about the "Copy" name though.

Yeah I agree, but I didn't come with a better name.

ywelsch

I've added a few more comments

ywelsch · 2018-01-09T10:39:48Z

core/src/main/java/org/elasticsearch/snapshots/SnapshotShardsService.java

@@ -188,7 +187,7 @@ public void beforeIndexShardClosed(ShardId shardId, @Nullable IndexShard indexSh
            Map<ShardId, IndexShardSnapshotStatus> shards = snapshotShards.getValue().shards;
            if (shards.containsKey(shardId)) {
                logger.debug("[{}] shard closing, abort snapshotting for snapshot [{}]", shardId, snapshotShards.getKey().getSnapshotId());
-                shards.get(shardId).abort();
+                shards.get(shardId).moveToAborted("shard is closing, aborting");


I wonder if we can also use abortIfNotCompleted here. Finally, looking at BlobStoreRepository.snapshot, we should also do the store.decRef() before we move to finalization. The store is unnecessarily kept open (possibly causing ShardLockObtainFailedException) and the snapshot process does not check abortion while it is doing its finalization.

I wonder if we can also use abortIfNotCompleted here.

I think so. I wasn't sure when I added abortIfNotCompleted() but since the abort status is not checked again when finalizing the snapshot we can release the store as soon it is not needed so the shard can be closed and the finalization process continue and succeed or failed as it is already the case today.

The store is unnecessarily kept open (possibly causing ShardLockObtainFailedException)

Good catch, I changed that too.

ywelsch · 2018-01-09T10:46:28Z

core/src/main/java/org/elasticsearch/index/snapshots/IndexShardSnapshotStatus.java

-    public synchronized void processedFiles(int numberOfFiles, long totalSize) {
-        processedFiles = numberOfFiles;
-        processedSize = totalSize;
+    public void ensureNotAborted() {


I would prefer to have the previous boolean isAborted method. Call sites currently catch this exception and then throw another exception.

Ok, I revert the changes.

ywelsch · 2018-01-09T10:49:04Z

core/src/main/java/org/elasticsearch/index/snapshots/IndexShardSnapshotStatus.java

-    public int processedFiles() {
-        return processedFiles;
+    public static IndexShardSnapshotStatus newFailed(final String failure) {
+        if (failure == null) {


also add an assertion here so we fail hard in our tests?

ywelsch · 2018-01-09T10:50:51Z

core/src/main/java/org/elasticsearch/snapshots/SnapshotShardsService.java

-                    if (snapshotStatus.stage() == Stage.INIT || snapshotStatus.stage() == Stage.STARTED) {
-                        snapshotStatus.abort();
-                    }
+                    final IndexShardSnapshotStatus.Stage stage = snapshotStatus.asCopy().getStage();


this line can be removed?

tlrx · 2018-01-09T11:46:55Z

@ywelsch Code updated, thanks.

ywelsch

LGTM. Let's bake this a few days on master and then backport to 6.2.

tlrx · 2018-01-09T13:02:21Z

Thanks @ywelsch !

* master: (27 commits) Declare empty package dirs as output dirs Consistent updates of IndexShardSnapshotStatus (elastic#28130) Fix Gradle wrapper usage on Windows when building BWC (elastic#28146) [Docs] Fix some typos in comments (elastic#28098) Use Gradle wrapper when building BWC Painless: Add a simple cache for whitelist methods and fields. (elastic#28142) Fix upgrading indices which use a custom similarity plugin. (elastic#26985) Fix Licenses values for CDDL and Custom URL (elastic#27999) Cleanup TcpChannelFactory and remove classes (elastic#28102) Fix expected plugins test for transport-nio [Docs] Fix Date Math example descriptions (elastic#28125) Fail rollover if duplicated alias found in template (elastic#28110) Avoid concurrent snapshot finalizations when deleting an INIT snapshot (elastic#28078) Deprecate `isShardsAcked()` in favour of `isShardsAcknowledged()` (elastic#27819) [TEST] Wait for replicas to be allocated before shrinking Use the underlying connection version for CCS connections (elastic#28093) test: do not use asn fields Test: Add assumeFalse for test that cannot pass on windows Clarify reproduce info on Windows Remove out-of-date projectile file ...

* master: Set watermarks in single-node test cases Add the ability to bundle multiple plugins into a meta plugin (elastic#28022) Declare empty package dirs as output dirs Consistent updates of IndexShardSnapshotStatus (elastic#28130)

This commit changes IndexShardSnapshotStatus so that the Stage is updated coherently with any required information. It also provides a asCopy() method that returns the status of a IndexShardSnapshotStatus at a given point in time, ensuring that all information are coherent. Closes #26480

tlrx · 2018-01-15T13:14:53Z

I didn't see any related failures since it has been merged, so I backported this change to 6.2.0 in ea6a46a

* es/6.x: (31 commits) Fix eclipse build. (#28236) Never return null from Strings.tokenizeToStringArray (#28224) Fallback to TransportMasterNodeAction for cluster health retries (#28195) [Docs] Changes to ingest.asciidoc (#28212) TEST: Update logging for testAckedIndexing [GEO] Deprecate field parameter in GeoBoundingBoxQueryBuilder [GEO] Add WKT Support to GeoBoundingBoxQueryBuilder Avoid doing redundant work when checking for self references. (#26927) Fix casts in HotThreads. (#27578) Ignore the `-snapshot` suffix when comparing the Lucene version in the build and the docs. (#27927) Allow update of `eager_global_ordinals` on `_parent`. (#28014) Painless: Add whitelist extensions (#28161) Fix daitch_mokotoff phonetic filter to use the dedicated Lucene filter (#28225) Fix NPE on composite aggregation with sub-aggregations that need scores (#28129) #28045 restore removed import after backport Fix synonym phrase query expansion for cross_fields parsing (#28045) Introduce elasticsearch-core jar (#28191) upgrade to lucene 7.2.1 (#28218) [Docs] Fix an error in painless-types.asciidoc (#28221) Consistent updates of IndexShardSnapshotStatus (#28130) ...

tlrx added :Distributed Coordination/Snapshot/Restore Anything directly related to the `_snapshot/*` APIs review v7.0.0 labels Jan 8, 2018

tlrx requested a review from ywelsch January 8, 2018 12:34

tlrx mentioned this pull request Jan 8, 2018

Consistent update of stage and failure message in IndexShardSnapshotStatus #27557

Closed

ywelsch suggested changes Jan 8, 2018

View reviewed changes

Apply feedback

25afc10

Change toString()

12bf56c

ywelsch reviewed Jan 9, 2018

View reviewed changes

Apply feedback 2

93a51a5

ywelsch approved these changes Jan 9, 2018

View reviewed changes

tlrx merged commit bba591b into elastic:master Jan 9, 2018

tlrx added the backport pending label Jan 9, 2018

tlrx added v6.2.0 and removed backport pending labels Jan 15, 2018

tlrx deleted the refactor-indexshardsnapshotstatus branch January 15, 2018 14:02

colings86 added the >bug label Jan 22, 2018

ywelsch mentioned this pull request Jan 26, 2018

[CI] SharedClusterSnapshotRestoreIT#testGetSnapshotsRequest failing #26480

Closed

tlrx mentioned this pull request Jun 4, 2018

SharedClusterSnapshotRestoreIT#testGetSnapshotsRequest fails #31054

Closed

jimczi added the v7.0.0-beta1 label Feb 7, 2019

jimczi removed the v7.0.0 label Feb 7, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Consistent updates of IndexShardSnapshotStatus #28130

Consistent updates of IndexShardSnapshotStatus #28130

tlrx commented Jan 8, 2018

ywelsch left a comment

ywelsch Jan 8, 2018

tlrx Jan 8, 2018

ywelsch Jan 8, 2018

ywelsch Jan 8, 2018

tlrx Jan 8, 2018

ywelsch Jan 8, 2018

ywelsch Jan 8, 2018

ywelsch Jan 8, 2018

ywelsch Jan 8, 2018

tlrx commented Jan 8, 2018

tlrx commented Jan 8, 2018

ywelsch left a comment

ywelsch Jan 9, 2018

tlrx Jan 9, 2018

ywelsch Jan 9, 2018

tlrx Jan 9, 2018

ywelsch Jan 9, 2018

ywelsch Jan 9, 2018

tlrx commented Jan 9, 2018

ywelsch left a comment •

edited

Loading

tlrx commented Jan 9, 2018

tlrx commented Jan 15, 2018

		private int numberOfFiles;

		private volatile int processedFiles;

Consistent updates of IndexShardSnapshotStatus #28130

Consistent updates of IndexShardSnapshotStatus #28130

Conversation

tlrx commented Jan 8, 2018

ywelsch left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

tlrx commented Jan 8, 2018

tlrx commented Jan 8, 2018

ywelsch left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

tlrx commented Jan 9, 2018

ywelsch left a comment • edited Loading

Choose a reason for hiding this comment

tlrx commented Jan 9, 2018

tlrx commented Jan 15, 2018

ywelsch left a comment •

edited

Loading