-
Notifications
You must be signed in to change notification settings - Fork 25k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Consistent updates of IndexShardSnapshotStatus #28130
Conversation
IndexShardSnapshotStatus represents the status of an index shard snapshot. During the snapshot process it can be updated multiple times, it can also be aborted and the the Get Snapshot and Snapshot Status APIs are susceptible to read the snapshot statuses. Right now, the information are not updated in a coherent manner, meaning that a IndexShardSnapshotStatus can be FAILED but the failure is not updated yet. Or it could be FINALIZE with no index version. This commit changes IndexShardSnapshotStatus so that the Stage is updated coherently with any required information. It also provides a asCopy() method that returns the status of a IndexShardSnapshotStatus at a given point in time, ensuring that all information are coherent.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I very much like the approach here. Not super happy about the "Copy" name though. I've left some smaller feedback.
public long time() { | ||
return this.time; | ||
public synchronized Copy moveToStarted(final long startTime, final int numberOfFiles, final long totalSize) { | ||
ensureNotAborted(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
is this necessary? Isn't it covered already by the next line?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Right, not needed anymore, thanks
this.numberOfFiles = numberOfFiles; | ||
this.totalSize = totalSize; | ||
} else { | ||
throw new IllegalStateException("Unable to move the shard snapshot status to started: it is not initializing"); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
also add current state here to exception message?
snapshotStatus.abort(); | ||
final IndexShardSnapshotStatus.Stage stage = snapshotStatus.asCopy().getStage(); | ||
if (stage == Stage.INIT || stage == Stage.STARTED) { | ||
snapshotStatus.moveToAborted("snapshot has been removed in cluster state, aborting"); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Would it make sense to change moveToAborted
so that it only does this if the shard is in INIT or STARTED (rename it to abortIfNotCompleted
?).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That's an interesting suggestion, thanks
case INIT: | ||
case STARTED: | ||
snapshotStatus.abort(); | ||
snapshotStatus.moveToAborted("snapshot has been aborted"); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
same thing here. We could just call abortIfNotCompleted
, and then take a snapshot and then act based on FINALIZE, DONE, FAILURE
|
||
@Override | ||
public String toString() { | ||
return new StringBuilder() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this looks a bit odd of a toString() implementation as it's very much targeted towards that one logging call site. Maybe change it to be more generic.
private int numberOfFiles; | ||
|
||
private volatile int processedFiles; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this does not need to be volatile anymore.
private long totalSize; | ||
|
||
private volatile long processedSize; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
no need to be volatile anymore
@ywelsch Thanks for review! I updated the code according to your comments, can you have another look please? |
Yeah I agree, but I didn't come with a better name. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've added a few more comments
@@ -188,7 +187,7 @@ public void beforeIndexShardClosed(ShardId shardId, @Nullable IndexShard indexSh | |||
Map<ShardId, IndexShardSnapshotStatus> shards = snapshotShards.getValue().shards; | |||
if (shards.containsKey(shardId)) { | |||
logger.debug("[{}] shard closing, abort snapshotting for snapshot [{}]", shardId, snapshotShards.getKey().getSnapshotId()); | |||
shards.get(shardId).abort(); | |||
shards.get(shardId).moveToAborted("shard is closing, aborting"); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I wonder if we can also use abortIfNotCompleted
here. Finally, looking at BlobStoreRepository.snapshot
, we should also do the store.decRef()
before we move to finalization. The store is unnecessarily kept open (possibly causing ShardLockObtainFailedException) and the snapshot process does not check abortion while it is doing its finalization.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I wonder if we can also use abortIfNotCompleted here.
I think so. I wasn't sure when I added abortIfNotCompleted() but since the abort status is not checked again when finalizing the snapshot we can release the store as soon it is not needed so the shard can be closed and the finalization process continue and succeed or failed as it is already the case today.
The store is unnecessarily kept open (possibly causing ShardLockObtainFailedException)
Good catch, I changed that too.
public synchronized void processedFiles(int numberOfFiles, long totalSize) { | ||
processedFiles = numberOfFiles; | ||
processedSize = totalSize; | ||
public void ensureNotAborted() { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would prefer to have the previous boolean isAborted
method. Call sites currently catch this exception and then throw another exception.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok, I revert the changes.
public int processedFiles() { | ||
return processedFiles; | ||
public static IndexShardSnapshotStatus newFailed(final String failure) { | ||
if (failure == null) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
also add an assertion here so we fail hard in our tests?
if (snapshotStatus.stage() == Stage.INIT || snapshotStatus.stage() == Stage.STARTED) { | ||
snapshotStatus.abort(); | ||
} | ||
final IndexShardSnapshotStatus.Stage stage = snapshotStatus.asCopy().getStage(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this line can be removed?
@ywelsch Code updated, thanks. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM. Let's bake this a few days on master and then backport to 6.2.
Thanks @ywelsch ! |
* master: (27 commits) Declare empty package dirs as output dirs Consistent updates of IndexShardSnapshotStatus (elastic#28130) Fix Gradle wrapper usage on Windows when building BWC (elastic#28146) [Docs] Fix some typos in comments (elastic#28098) Use Gradle wrapper when building BWC Painless: Add a simple cache for whitelist methods and fields. (elastic#28142) Fix upgrading indices which use a custom similarity plugin. (elastic#26985) Fix Licenses values for CDDL and Custom URL (elastic#27999) Cleanup TcpChannelFactory and remove classes (elastic#28102) Fix expected plugins test for transport-nio [Docs] Fix Date Math example descriptions (elastic#28125) Fail rollover if duplicated alias found in template (elastic#28110) Avoid concurrent snapshot finalizations when deleting an INIT snapshot (elastic#28078) Deprecate `isShardsAcked()` in favour of `isShardsAcknowledged()` (elastic#27819) [TEST] Wait for replicas to be allocated before shrinking Use the underlying connection version for CCS connections (elastic#28093) test: do not use asn fields Test: Add assumeFalse for test that cannot pass on windows Clarify reproduce info on Windows Remove out-of-date projectile file ...
* master: Set watermarks in single-node test cases Add the ability to bundle multiple plugins into a meta plugin (elastic#28022) Declare empty package dirs as output dirs Consistent updates of IndexShardSnapshotStatus (elastic#28130)
This commit changes IndexShardSnapshotStatus so that the Stage is updated coherently with any required information. It also provides a asCopy() method that returns the status of a IndexShardSnapshotStatus at a given point in time, ensuring that all information are coherent. Closes #26480
I didn't see any related failures since it has been merged, so I backported this change to 6.2.0 in ea6a46a |
* es/6.x: (31 commits) Fix eclipse build. (#28236) Never return null from Strings.tokenizeToStringArray (#28224) Fallback to TransportMasterNodeAction for cluster health retries (#28195) [Docs] Changes to ingest.asciidoc (#28212) TEST: Update logging for testAckedIndexing [GEO] Deprecate field parameter in GeoBoundingBoxQueryBuilder [GEO] Add WKT Support to GeoBoundingBoxQueryBuilder Avoid doing redundant work when checking for self references. (#26927) Fix casts in HotThreads. (#27578) Ignore the `-snapshot` suffix when comparing the Lucene version in the build and the docs. (#27927) Allow update of `eager_global_ordinals` on `_parent`. (#28014) Painless: Add whitelist extensions (#28161) Fix daitch_mokotoff phonetic filter to use the dedicated Lucene filter (#28225) Fix NPE on composite aggregation with sub-aggregations that need scores (#28129) #28045 restore removed import after backport Fix synonym phrase query expansion for cross_fields parsing (#28045) Introduce elasticsearch-core jar (#28191) upgrade to lucene 7.2.1 (#28218) [Docs] Fix an error in painless-types.asciidoc (#28221) Consistent updates of IndexShardSnapshotStatus (#28130) ...
IndexShardSnapshotStatus represents the status of an index shard
snapshot. During the snapshot process it can be updated multiple times,
it can also be aborted and the the Get Snapshot and Snapshot Status APIs
are susceptible to read the snapshot statuses.
Right now, the information are not updated in a coherent manner, meaning
that a IndexShardSnapshotStatus can be FAILED but the failure is not
updated yet. Or it could be FINALIZE with no index version.
This commit changes IndexShardSnapshotStatus so that the Stage is updated
coherently with any required information. It also provides a asCopy()
method that returns the status of a IndexShardSnapshotStatus at a given
point in time, ensuring that all information are coherent.
Closes #26480