-
Notifications
You must be signed in to change notification settings - Fork 24.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
SNAPSHOT: More Resilient Writes to Blob Stores #36927
Conversation
* Fix potentially failure to write the index gen files on master failover by making the writes to the blob store indempotent as long as the snapshot generation does not change * This moves the burden of consistency into the ES cluster state handling instead of leaving it in part with the blob store, that doesn't have any relevant consistency guarantees in most cases. * This PR does not change behaviour for S3 at all since there is no file existance check for that store anyway * closes elastic#25281
Pinging @elastic/es-distributed |
@@ -738,7 +731,7 @@ protected void writeIndexGen(final RepositoryData repositoryData, final long rep | |||
// write the index file | |||
final String indexBlob = INDEX_FILE_PREFIX + Long.toString(newGen); | |||
logger.debug("Repository [{}] writing new index generational blob [{}]", metadata.name(), indexBlob); | |||
writeAtomic(indexBlob, snapshotsBytes, true); | |||
writeAtomic(indexBlob, snapshotsBytes, false); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This should be idempotent anyway, all we're doing here is potentially preventing deleting the old index file or updating the latest blob file in subsequent steps if either of those failed before a master failover.
@@ -150,7 +150,7 @@ public void write(T obj, BlobContainer blobContainer, String name) throws IOExce | |||
final String blobName = blobName(name); | |||
writeTo(obj, blobName, bytesArray -> { | |||
try (InputStream stream = bytesArray.streamInput()) { | |||
blobContainer.writeBlob(blobName, stream, bytesArray.length(), true); | |||
blobContainer.writeBlob(blobName, stream, bytesArray.length(), false); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is idempotent as far as I can see. It seems easier to simply quietly skip an existing file than to suppress the file already exists exception and keep going with writing the index gen files upstream since S3 doesn't have this check anyway (the only downside I see is that we incur traffic for blob stores that do have a consistent existence check here, not sure if that's worth the extra effort though?)?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The more I think about it, the more I dislike that we even have the failIfAlreadyExists
flag on the BlobContainer
interface. The parameter is simply ignored for S3 and won't reliably work with some NFS.
So in the end what we get is that our tests aren't really valid for S3 etc. in failover scenarios and the situation where any fixes we make to failover issues that are based on this check working will not be valid for S3.
=> I'm still a fan of this change, IMO it makes more sense fixing unforeseen issues with overwriting these files than fixing continuing to test and fix behavior that doesn't apply to all our blob stores.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not convinced that this is making things more resilient. In particular, the added leniency in writeIndexGen
concerns me. I think we'll have to more fundamentally rethink the flow for writing files to the repository when dealing with failovers (whether it's master or data nodes).
Are we really worse off in any real-world situation though with this? (I couldn't come up with any except for the master failover I mentioned in the PR description and as I mentioned above we're in the same spot for S3 that we currently are I think and as a result of that we are no worse off than for S3 with any other blob store either)
I would argue that since blob stores have no reliable guarantees across implementations moving the logic towards using the cluster state for consistency and making writes to the blob stores immutable for constant cluster state like I did here is at least in theory the way to go isn't it? The cluster state seems to be our only reliable source of consistency here. |
closing this in favor of the more general suggestion on this in #42886 |
to the blob store idempotent as long as the snapshot generation does not change
blob store, that doesn't have any relevant consistency guarantees in most cases.
index.latest
file with an incorrect generation in the unlikely case a master becoming stuck for a long time before writing that file out (which is incredibly unlikely in practice if not impossible because we do verify the index generation before writing inorg.elasticsearch.repositories.blobstore.BlobStoreRepository#writeIndexGen
so the blob store would need to be inconsistent for a timespan longer than it takes the master to fail over)MockRepository