Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

SNAPSHOT: More Resilient Writes to Blob Stores #36927

Closed
wants to merge 5 commits into from

Conversation

original-brownbear
Copy link
Member

@original-brownbear original-brownbear commented Dec 21, 2018

  • Fix potential failure to write the index gen files on master failover by making the writes
    to the blob store idempotent as long as the snapshot generation does not change
    • This moves the burden of consistency into the ES cluster state handling instead of leaving it in part with the
      blob store, that doesn't have any relevant consistency guarantees in most cases.
    • This PR does not change behaviour for S3 at all since there is no file existence check for that store anyway
    • This does add the fresh risk of overwriting the index.latest file with an incorrect generation in the unlikely case a master becoming stuck for a long time before writing that file out (which is incredibly unlikely in practice if not impossible because we do verify the index generation before writing in org.elasticsearch.repositories.blobstore.BlobStoreRepository#writeIndexGen so the blob store would need to be inconsistent for a timespan longer than it takes the master to fail over)
  • Fixes the test mentioned in Master failover during snapshotting could leave the snapshot incomplete #25281
    • Actually block execution on writing index file when that flag is set on the MockRepository
    • The master failover does not fail the snapshot if we don't do the file existence checks before writing (this is what would happen on S3 anyway) but instead finishes partially for the shards that reside on the live datanode
  • closes Master failover during snapshotting could leave the snapshot incomplete #25281

* Fix potentially failure to write the index gen files on master failover by making the writes
to the blob store indempotent as long as the snapshot generation does not change
  * This moves the burden of consistency into the ES cluster state handling instead of leaving it in part with the
blob store, that doesn't have any relevant consistency guarantees in most cases.
  * This PR does not change behaviour for S3 at all since there is no file existance check for that store anyway
* closes elastic#25281
@original-brownbear original-brownbear added >bug :Distributed Coordination/Snapshot/Restore Anything directly related to the `_snapshot/*` APIs v7.0.0 v6.7.0 labels Dec 21, 2018
@elasticmachine
Copy link
Collaborator

Pinging @elastic/es-distributed

@@ -738,7 +731,7 @@ protected void writeIndexGen(final RepositoryData repositoryData, final long rep
// write the index file
final String indexBlob = INDEX_FILE_PREFIX + Long.toString(newGen);
logger.debug("Repository [{}] writing new index generational blob [{}]", metadata.name(), indexBlob);
writeAtomic(indexBlob, snapshotsBytes, true);
writeAtomic(indexBlob, snapshotsBytes, false);
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This should be idempotent anyway, all we're doing here is potentially preventing deleting the old index file or updating the latest blob file in subsequent steps if either of those failed before a master failover.

@@ -150,7 +150,7 @@ public void write(T obj, BlobContainer blobContainer, String name) throws IOExce
final String blobName = blobName(name);
writeTo(obj, blobName, bytesArray -> {
try (InputStream stream = bytesArray.streamInput()) {
blobContainer.writeBlob(blobName, stream, bytesArray.length(), true);
blobContainer.writeBlob(blobName, stream, bytesArray.length(), false);
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is idempotent as far as I can see. It seems easier to simply quietly skip an existing file than to suppress the file already exists exception and keep going with writing the index gen files upstream since S3 doesn't have this check anyway (the only downside I see is that we incur traffic for blob stores that do have a consistent existence check here, not sure if that's worth the extra effort though?)?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The more I think about it, the more I dislike that we even have the failIfAlreadyExists flag on the BlobContainer interface. The parameter is simply ignored for S3 and won't reliably work with some NFS.
So in the end what we get is that our tests aren't really valid for S3 etc. in failover scenarios and the situation where any fixes we make to failover issues that are based on this check working will not be valid for S3.

=> I'm still a fan of this change, IMO it makes more sense fixing unforeseen issues with overwriting these files than fixing continuing to test and fix behavior that doesn't apply to all our blob stores.

@ywelsch ywelsch self-requested a review December 24, 2018 14:19
Copy link
Contributor

@ywelsch ywelsch left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not convinced that this is making things more resilient. In particular, the added leniency in writeIndexGen concerns me. I think we'll have to more fundamentally rethink the flow for writing files to the repository when dealing with failovers (whether it's master or data nodes).

@original-brownbear
Copy link
Member Author

@ywelsch

I'm not convinced that this is making things more resilient. In particular, the added leniency in writeIndexGen concerns me.

Are we really worse off in any real-world situation though with this? (I couldn't come up with any except for the master failover I mentioned in the PR description and as I mentioned above we're in the same spot for S3 that we currently are I think and as a result of that we are no worse off than for S3 with any other blob store either)

I think we'll have to more fundamentally rethink the flow for writing files to the repository when dealing with failovers (whether it's master or data nodes).

I would argue that since blob stores have no reliable guarantees across implementations moving the logic towards using the cluster state for consistency and making writes to the blob stores immutable for constant cluster state like I did here is at least in theory the way to go isn't it? The cluster state seems to be our only reliable source of consistency here.

@jasontedor jasontedor added v8.0.0 and removed v7.0.0 labels Feb 6, 2019
@original-brownbear
Copy link
Member Author

closing this in favor of the more general suggestion on this in #42886

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
>bug :Distributed Coordination/Snapshot/Restore Anything directly related to the `_snapshot/*` APIs v7.2.0 v8.0.0-alpha1
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Master failover during snapshotting could leave the snapshot incomplete
6 participants