Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Consistent updates of IndexShardSnapshotStatus #28130

Merged
merged 4 commits into from
Jan 9, 2018

Conversation

tlrx
Copy link
Member

@tlrx tlrx commented Jan 8, 2018

IndexShardSnapshotStatus represents the status of an index shard
snapshot. During the snapshot process it can be updated multiple times,
it can also be aborted and the the Get Snapshot and Snapshot Status APIs
are susceptible to read the snapshot statuses.

Right now, the information are not updated in a coherent manner, meaning
that a IndexShardSnapshotStatus can be FAILED but the failure is not
updated yet. Or it could be FINALIZE with no index version.

This commit changes IndexShardSnapshotStatus so that the Stage is updated
coherently with any required information. It also provides a asCopy()
method that returns the status of a IndexShardSnapshotStatus at a given
point in time, ensuring that all information are coherent.

Closes #26480

IndexShardSnapshotStatus represents the status of an index shard
snapshot. During the snapshot process it can be updated multiple times,
it can also be aborted and the the Get Snapshot and Snapshot Status APIs
are susceptible to read the snapshot statuses.

Right now, the information are not updated in a coherent manner, meaning
that a IndexShardSnapshotStatus can be FAILED but the failure is not
updated yet. Or it could be FINALIZE with no index version.

This commit changes IndexShardSnapshotStatus so that the Stage is updated
coherently with any required information. It also provides a asCopy()
method that returns the status of a IndexShardSnapshotStatus at a given
point in time, ensuring that all information are coherent.
Copy link
Contributor

@ywelsch ywelsch left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I very much like the approach here. Not super happy about the "Copy" name though. I've left some smaller feedback.

public long time() {
return this.time;
public synchronized Copy moveToStarted(final long startTime, final int numberOfFiles, final long totalSize) {
ensureNotAborted();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is this necessary? Isn't it covered already by the next line?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Right, not needed anymore, thanks

this.numberOfFiles = numberOfFiles;
this.totalSize = totalSize;
} else {
throw new IllegalStateException("Unable to move the shard snapshot status to started: it is not initializing");
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

also add current state here to exception message?

snapshotStatus.abort();
final IndexShardSnapshotStatus.Stage stage = snapshotStatus.asCopy().getStage();
if (stage == Stage.INIT || stage == Stage.STARTED) {
snapshotStatus.moveToAborted("snapshot has been removed in cluster state, aborting");
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would it make sense to change moveToAborted so that it only does this if the shard is in INIT or STARTED (rename it to abortIfNotCompleted?).

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's an interesting suggestion, thanks

case INIT:
case STARTED:
snapshotStatus.abort();
snapshotStatus.moveToAborted("snapshot has been aborted");
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

same thing here. We could just call abortIfNotCompleted, and then take a snapshot and then act based on FINALIZE, DONE, FAILURE


@Override
public String toString() {
return new StringBuilder()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this looks a bit odd of a toString() implementation as it's very much targeted towards that one logging call site. Maybe change it to be more generic.

private int numberOfFiles;

private volatile int processedFiles;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this does not need to be volatile anymore.

private long totalSize;

private volatile long processedSize;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

no need to be volatile anymore

@tlrx
Copy link
Member Author

tlrx commented Jan 8, 2018

@ywelsch Thanks for review! I updated the code according to your comments, can you have another look please?

@tlrx
Copy link
Member Author

tlrx commented Jan 8, 2018

Not super happy about the "Copy" name though.

Yeah I agree, but I didn't come with a better name.

Copy link
Contributor

@ywelsch ywelsch left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've added a few more comments

@@ -188,7 +187,7 @@ public void beforeIndexShardClosed(ShardId shardId, @Nullable IndexShard indexSh
Map<ShardId, IndexShardSnapshotStatus> shards = snapshotShards.getValue().shards;
if (shards.containsKey(shardId)) {
logger.debug("[{}] shard closing, abort snapshotting for snapshot [{}]", shardId, snapshotShards.getKey().getSnapshotId());
shards.get(shardId).abort();
shards.get(shardId).moveToAborted("shard is closing, aborting");
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wonder if we can also use abortIfNotCompleted here. Finally, looking at BlobStoreRepository.snapshot, we should also do the store.decRef() before we move to finalization. The store is unnecessarily kept open (possibly causing ShardLockObtainFailedException) and the snapshot process does not check abortion while it is doing its finalization.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wonder if we can also use abortIfNotCompleted here.

I think so. I wasn't sure when I added abortIfNotCompleted() but since the abort status is not checked again when finalizing the snapshot we can release the store as soon it is not needed so the shard can be closed and the finalization process continue and succeed or failed as it is already the case today.

The store is unnecessarily kept open (possibly causing ShardLockObtainFailedException)

Good catch, I changed that too.

public synchronized void processedFiles(int numberOfFiles, long totalSize) {
processedFiles = numberOfFiles;
processedSize = totalSize;
public void ensureNotAborted() {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would prefer to have the previous boolean isAborted method. Call sites currently catch this exception and then throw another exception.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok, I revert the changes.

public int processedFiles() {
return processedFiles;
public static IndexShardSnapshotStatus newFailed(final String failure) {
if (failure == null) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

also add an assertion here so we fail hard in our tests?

if (snapshotStatus.stage() == Stage.INIT || snapshotStatus.stage() == Stage.STARTED) {
snapshotStatus.abort();
}
final IndexShardSnapshotStatus.Stage stage = snapshotStatus.asCopy().getStage();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this line can be removed?

@tlrx
Copy link
Member Author

tlrx commented Jan 9, 2018

@ywelsch Code updated, thanks.

Copy link
Contributor

@ywelsch ywelsch left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. Let's bake this a few days on master and then backport to 6.2.

@tlrx tlrx merged commit bba591b into elastic:master Jan 9, 2018
@tlrx
Copy link
Member Author

tlrx commented Jan 9, 2018

Thanks @ywelsch !

jasontedor added a commit to jasontedor/elasticsearch that referenced this pull request Jan 9, 2018
* master: (27 commits)
  Declare empty package dirs as output dirs
  Consistent updates of IndexShardSnapshotStatus (elastic#28130)
  Fix Gradle wrapper usage on Windows when building BWC (elastic#28146)
  [Docs] Fix some typos in comments (elastic#28098)
  Use Gradle wrapper when building BWC
  Painless: Add a simple cache for whitelist methods and fields. (elastic#28142)
  Fix upgrading indices which use a custom similarity plugin. (elastic#26985)
  Fix Licenses values for CDDL and Custom URL (elastic#27999)
  Cleanup TcpChannelFactory and remove classes (elastic#28102)
  Fix expected plugins test for transport-nio
  [Docs] Fix Date Math example descriptions (elastic#28125)
  Fail rollover if duplicated alias found in template (elastic#28110)
  Avoid concurrent snapshot finalizations when deleting an INIT snapshot (elastic#28078)
  Deprecate `isShardsAcked()` in favour of `isShardsAcknowledged()` (elastic#27819)
  [TEST] Wait for replicas to be allocated before shrinking
  Use the underlying connection version for CCS connections  (elastic#28093)
  test: do not use asn fields
  Test: Add assumeFalse for test that cannot pass on windows
  Clarify reproduce info on Windows
  Remove out-of-date projectile file
  ...
jasontedor added a commit to jasontedor/elasticsearch that referenced this pull request Jan 9, 2018
* master:
  Set watermarks in single-node test cases
  Add the ability to bundle multiple plugins into a meta plugin (elastic#28022)
  Declare empty package dirs as output dirs
  Consistent updates of IndexShardSnapshotStatus (elastic#28130)
tlrx added a commit that referenced this pull request Jan 15, 2018
This commit changes IndexShardSnapshotStatus so that the Stage is updated
coherently with any required information. It also provides a asCopy()
method that returns the status of a IndexShardSnapshotStatus at a given
point in time, ensuring that all information are coherent.

Closes #26480
@tlrx
Copy link
Member Author

tlrx commented Jan 15, 2018

I didn't see any related failures since it has been merged, so I backported this change to 6.2.0 in ea6a46a

@tlrx tlrx deleted the refactor-indexshardsnapshotstatus branch January 15, 2018 14:02
martijnvg added a commit that referenced this pull request Jan 16, 2018
* es/6.x: (31 commits)
  Fix eclipse build. (#28236)
  Never return null from Strings.tokenizeToStringArray (#28224)
  Fallback to TransportMasterNodeAction for cluster health retries (#28195)
  [Docs] Changes to ingest.asciidoc (#28212)
  TEST: Update logging for testAckedIndexing
  [GEO] Deprecate field parameter in GeoBoundingBoxQueryBuilder
  [GEO] Add WKT Support to GeoBoundingBoxQueryBuilder
  Avoid doing redundant work when checking for self references. (#26927)
  Fix casts in HotThreads. (#27578)
  Ignore the `-snapshot` suffix when comparing the Lucene version in the build and the docs. (#27927)
  Allow update of `eager_global_ordinals` on `_parent`. (#28014)
  Painless: Add whitelist extensions (#28161)
  Fix daitch_mokotoff phonetic filter to use the dedicated Lucene filter (#28225)
  Fix NPE on composite aggregation with sub-aggregations that need scores (#28129)
  #28045 restore removed import after backport
  Fix synonym phrase query expansion for cross_fields parsing (#28045)
  Introduce elasticsearch-core jar (#28191)
  upgrade to lucene 7.2.1 (#28218)
  [Docs] Fix an error in painless-types.asciidoc (#28221)
  Consistent updates of IndexShardSnapshotStatus (#28130)
  ...
@colings86 colings86 added the >bug label Jan 22, 2018
@jimczi jimczi removed the v7.0.0 label Feb 7, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
>bug :Distributed Coordination/Snapshot/Restore Anything directly related to the `_snapshot/*` APIs v6.2.0 v7.0.0-beta1
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants