Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Do not fail snapshot when deleting a missing snapshotted file #30332

Merged
merged 5 commits into from
May 7, 2018

Conversation

tlrx
Copy link
Member

@tlrx tlrx commented May 2, 2018

When deleting or creating a snapshot for a given shard, elasticsearch usually starts by listing all the existing snapshotted files in the repository. Then it computes a diff and deletes the snapshotted files that are not needed anymore. During this deletion, an exception is thrown if the file to be deleted does not exist anymore.

This behavior is challenging with cloud based repository implementations like S3 where a file that has been deleted can still appear in the bucket for few seconds/minutes (because the deletion can take some time to be fully replicated on S3). If the deleted file appears in the listing of files, then the following deletion will fail with a NoSuchFileException and the snapshot will be partially created/deleted.

This pull request makes the deletion of these files a bit less strict, ie not failing if the file we want to delete does not exist anymore. It introduces a new BlobContainer.deleteIgnoringIfNotExists() method that can be used at some specific places where not failing when deleting a file is considered harmless.

Closes #28322

@tlrx tlrx requested a review from ywelsch May 2, 2018 12:12
@tlrx tlrx changed the title Make blob deletions less strict Do not fail snapshot when deleting a missing snapshotted file May 2, 2018
@hub-cap hub-cap added :Distributed Coordination/Snapshot/Restore Anything directly related to the `_snapshot/*` APIs and removed :Distributed Coordination/Snapshot/Restore Anything directly related to the `_snapshot/*` APIs labels May 3, 2018
@elasticmachine
Copy link
Collaborator

Pinging @elastic/es-distributed

Copy link
Contributor

@ywelsch ywelsch left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It looks to me as if only finalization needs special treatment. I would prefer that we first write the new index file before we start cleaning up the remaining index files.

@@ -780,7 +780,7 @@ private void writeAtomic(final String blobName, final BytesReference bytesRef) t
} catch (IOException ex) {
// temporary blob creation or move failed - try cleaning up
try {
snapshotsBlobContainer.deleteBlob(tempBlobName);
snapshotsBlobContainer.deleteBlobIgnoringIfNotExists(tempBlobName);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this does not change anything here? We are already catching the NoSuchFileException in the line below, which is an IOException.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If something goes wrong and an IOException is thrown when writing the temporary blob in

snapshotsBlobContainer.writeBlob(tempBlobName,..)

It means that in the catch block the deletion of the temp blob will fail, and adding this NoSuchFileException as a suppressed exception here looks like noise to me.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ok

@@ -962,7 +962,7 @@ protected void finalize(List<SnapshotFiles> snapshots, int fileListGeneration, M
if (blobName.startsWith(DATA_BLOB_PREFIX)) {
if (newSnapshots.findNameFile(BlobStoreIndexShardSnapshot.FileInfo.canonicalName(blobName)) == null) {
try {
blobContainer.deleteBlob(blobName);
blobContainer.deleteBlobIgnoringIfNotExists(blobName);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this does not change anything here? We are already catching the NoSuchFileException in the line below, which is an IOException.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, the behavior is the same but it just make more explicit that it is ok to try to delete a missing file here.

@@ -946,7 +946,7 @@ protected void finalize(List<SnapshotFiles> snapshots, int fileListGeneration, M
for (String blobName : blobs.keySet()) {
if (indexShardSnapshotsFormat.isTempBlobName(blobName) || blobName.startsWith(SNAPSHOT_INDEX_PREFIX)) {
try {
blobContainer.deleteBlob(blobName);
blobContainer.deleteBlobIgnoringIfNotExists(blobName);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what about the comment below this:

// We cannot delete index file - this is fatal, we cannot continue, otherwise we might end up with references to non-existing files with references to non-existing files

This sounds very dangerous, and I wonder if we need special precaution here. I wonder for example if we should first write the new index file before starting to delete old ones. It sounds to me as if finalization bears the risk of losing all index files if there is a process crash between deleting old ones and writing new one. We currently compensate for this in buildBlobStoreIndexShardSnapshots by doing a file listing and loading individual snapshots, but I find it odd to rely on such a fallback logic.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, I wanted to look at this too. I agree we should write the shard index file first, fail hard if something goes wrong, and then delete the files that are not needed anymore. We can still keep the individual file listing in buildBlobStoreIndexShardSnapshots.

@@ -139,7 +139,7 @@ public void writeAtomic(T obj, BlobContainer blobContainer, String name) throws
} catch (IOException ex) {
// Move failed - try cleaning up
try {
blobContainer.deleteBlob(tempBlobName);
blobContainer.deleteBlobIgnoringIfNotExists(tempBlobName);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this does not change anything here? We are already catching the NoSuchFileException in the line below, which is an IOException.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Right - in this case it must not swallow the NoSuchFileException

@tlrx
Copy link
Member Author

tlrx commented May 3, 2018

Thanks @ywelsch. I updated the code and commented. Can you have another look please?

@@ -780,7 +780,7 @@ private void writeAtomic(final String blobName, final BytesReference bytesRef) t
} catch (IOException ex) {
// temporary blob creation or move failed - try cleaning up
try {
snapshotsBlobContainer.deleteBlob(tempBlobName);
snapshotsBlobContainer.deleteBlobIgnoringIfNotExists(tempBlobName);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ok

@@ -930,54 +931,53 @@ public BlobStoreIndexShardSnapshot loadSnapshot() {
}

/**
* Removes all unreferenced files from the repository and writes new index file
* Writes a new shard's index file and removes all unreferenced files from the repository.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Writes a new index file for the shard and removes...

}
// Delete old index files
for (final String blobName : blobs.keySet()) {
if (blobName.startsWith(SNAPSHOT_INDEX_PREFIX) && (blobName.equals(currentIndexGen) == false)) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why would the current index gen be part of the blobs?

Copy link
Member Author

@tlrx tlrx May 4, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, it's an unnecessary extra safety, I'll remove this.

// Delete old index files
for (final String blobName : blobs.keySet()) {
if (blobName.startsWith(SNAPSHOT_INDEX_PREFIX) && (blobName.equals(currentIndexGen) == false)) {
blobContainer.deleteBlobIgnoringIfNotExists(blobName);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it's still nice to keep the debug logging as we had before.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

After debugging many snapshot/restore issues I've never found them useful as the IOException almost always contains the blobname and is logged at a higher level.

// Delete all blobs that don't exist in a snapshot
for (final String blobName : blobs.keySet()) {
if (blobName.startsWith(DATA_BLOB_PREFIX) && (updatedSnapshots.findNameFile(canonicalName(blobName)) == null)) {
blobContainer.deleteBlobIgnoringIfNotExists(blobName);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

before, we would ignore any IOException, now we only ignore NoSuchFileException. I think, for optional clean-up, the previous one is better.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't see why we should blindly ignore IOException here (except NoSuchFileException which is harmless because we want to delete the file anyway). It could hide a permission issue, and let unnecessary files in the repository that could be deleted if it was noticed?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't see why we should blindly ignore IOException here

I'm not a fan of blindly ignoring IOException here either. But I don't think it should the fail the snapshot which successfully completed. We can log the IOExceptions which are not NoSuchFileException as warning to make them more prominent?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That would be a good compromise, yes. Thanks!

}
} catch (IOException e) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I preferred the exception handling how it was before, i.e. local to each action. This allows better debug messages.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't agree but OK, I'll add them back.

@tlrx
Copy link
Member Author

tlrx commented May 4, 2018

@ywelsch I updated the code.

Copy link
Contributor

@ywelsch ywelsch left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@tlrx tlrx force-pushed the blobcontainer-delete-ignore-if-exists branch from 5c10ea1 to da688c2 Compare May 4, 2018 11:42
@tlrx
Copy link
Member Author

tlrx commented May 4, 2018

@elasticmachine test this please

@tlrx tlrx merged commit 1987d62 into elastic:master May 7, 2018
@tlrx tlrx deleted the blobcontainer-delete-ignore-if-exists branch May 7, 2018 07:36
@tlrx
Copy link
Member Author

tlrx commented May 7, 2018

Thanks @ywelsch

tlrx added a commit that referenced this pull request May 7, 2018
When deleting or creating a snapshot for a given shard, elasticsearch 
usually starts by listing all the existing snapshotted files in the repository. 
Then it computes a diff and deletes the snapshotted files that are not 
needed anymore. During this deletion, an exception is thrown if the file 
to be deleted does not exist anymore.

This behavior is challenging with cloud based repository implementations 
like S3 where a file that has been deleted can still appear in the bucket for 
few seconds/minutes (because the deletion can take some time to be fully 
replicated on S3). If the deleted file appears in the listing of files, then the 
following deletion will fail with a NoSuchFileException and the snapshot 
will be partially created/deleted.

This pull request makes the deletion of these files a bit less strict, ie not 
failing if the file we want to delete does not exist anymore. It introduces a 
new BlobContainer.deleteIgnoringIfNotExists() method that can be used 
at some specific places where not failing when deleting a file is 
considered harmless.

Closes #28322
@tlrx tlrx added the v6.3.1 label May 7, 2018
tlrx added a commit that referenced this pull request May 7, 2018
When deleting or creating a snapshot for a given shard, elasticsearch 
usually starts by listing all the existing snapshotted files in the repository. 
Then it computes a diff and deletes the snapshotted files that are not 
needed anymore. During this deletion, an exception is thrown if the file 
to be deleted does not exist anymore.

This behavior is challenging with cloud based repository implementations 
like S3 where a file that has been deleted can still appear in the bucket for 
few seconds/minutes (because the deletion can take some time to be fully 
replicated on S3). If the deleted file appears in the listing of files, then the 
following deletion will fail with a NoSuchFileException and the snapshot 
will be partially created/deleted.

This pull request makes the deletion of these files a bit less strict, ie not 
failing if the file we want to delete does not exist anymore. It introduces a 
new BlobContainer.deleteIgnoringIfNotExists() method that can be used 
at some specific places where not failing when deleting a file is 
considered harmless.

Closes #28322
jasontedor added a commit to jasontedor/elasticsearch that referenced this pull request May 7, 2018
…r-you

* origin/master:
  Update forcemerge.asciidoc (elastic#30113)
  Added zentity to the list of API extension plugins (elastic#29143)
  Fix the search request default operation behavior doc (elastic#29302) (elastic#29405)
  Watcher: Mark watcher as started only after loading watches (elastic#30403)
  Pass the task to broadcast actions (elastic#29672)
  Disable REST default settings testing until elastic#29229 is back-ported
  Correct wording in log message (elastic#30336)
  Do not fail snapshot when deleting a missing snapshotted file (elastic#30332)
  AwaitsFix testCreateShrinkIndexToN
  DOCS: Correct mapping tags in put-template api
  DOCS: Fix broken link in the put index template api
  Add put index template api to high level rest client (elastic#30400)
  Relax testAckedIndexing to allow document updating
  [Docs] Add snippets for POS stop tags default value
  Move respect accept header on no handler to 6.3.1
jasontedor added a commit to jaymode/elasticsearch that referenced this pull request May 7, 2018
* origin/master: (39 commits)
  Docs: fix changelog merge
  Fix line length violation in cache tests
  Add stricter geohash parsing (elastic#30376)
  Add failing test for core cache deadlock
  [DOCS] convert forcemerge snippet
  Update forcemerge.asciidoc (elastic#30113)
  Added zentity to the list of API extension plugins (elastic#29143)
  Fix the search request default operation behavior doc (elastic#29302) (elastic#29405)
  Watcher: Mark watcher as started only after loading watches (elastic#30403)
  Pass the task to broadcast actions (elastic#29672)
  Disable REST default settings testing until elastic#29229 is back-ported
  Correct wording in log message (elastic#30336)
  Do not fail snapshot when deleting a missing snapshotted file (elastic#30332)
  AwaitsFix testCreateShrinkIndexToN
  DOCS: Correct mapping tags in put-template api
  DOCS: Fix broken link in the put index template api
  Add put index template api to high level rest client (elastic#30400)
  Relax testAckedIndexing to allow document updating
  [Docs] Add snippets for POS stop tags default value
  Move respect accept header on no handler to 6.3.1
  ...
dnhatn added a commit that referenced this pull request May 8, 2018
* elastic-master:
  Watcher: Mark watcher as started only after loading watches (#30403)
  Pass the task to broadcast actions (#29672)
  Disable REST default settings testing until #29229 is back-ported
  Correct wording in log message (#30336)
  Do not fail snapshot when deleting a missing snapshotted file (#30332)
  AwaitsFix testCreateShrinkIndexToN
  DOCS: Correct mapping tags in put-template api
  DOCS: Fix broken link in the put index template api
  Add put index template api to high level rest client (#30400)
  Relax testAckedIndexing to allow document updating
  [Docs] Add snippets for POS stop tags default value
  Move respect accept header on no handler to 6.3.1
  Respect accept header on no handler (#30383)
  [Test] Add analysis-nori plugin to the vagrant tests
  [Docs] Fix bad link
  [Docs] Fix end of section in the korean plugin docs
  Expose the Lucene Korean analyzer module in a plugin (#30397)
  Docs: remove transport_client from CCS role example (#30263)
  [Rollup] Validate timezone in range queries (#30338)
  Use readFully() to read bytes from CipherInputStream (#28515)
  Fix  docs Recently merged #29229 had a doc bug that broke the doc build. This commit fixes.
  Test: remove cluster permission from CCS user (#30262)
  Add Get Settings API support to java high-level rest client (#29229)
  Watcher: Remove unneeded index deletion in tests
colings86 pushed a commit that referenced this pull request May 8, 2018
When deleting or creating a snapshot for a given shard, elasticsearch 
usually starts by listing all the existing snapshotted files in the repository. 
Then it computes a diff and deletes the snapshotted files that are not 
needed anymore. During this deletion, an exception is thrown if the file 
to be deleted does not exist anymore.

This behavior is challenging with cloud based repository implementations 
like S3 where a file that has been deleted can still appear in the bucket for 
few seconds/minutes (because the deletion can take some time to be fully 
replicated on S3). If the deleted file appears in the listing of files, then the 
following deletion will fail with a NoSuchFileException and the snapshot 
will be partially created/deleted.

This pull request makes the deletion of these files a bit less strict, ie not 
failing if the file we want to delete does not exist anymore. It introduces a 
new BlobContainer.deleteIgnoringIfNotExists() method that can be used 
at some specific places where not failing when deleting a file is 
considered harmless.

Closes #28322
dnhatn added a commit that referenced this pull request May 8, 2018
* 6.x:
  Stop forking javac (#30462)
  Fix tribe tests
  Docs: Use task_id in examples of tasks (#30436)
  Security: Rename IndexLifecycleManager to SecurityIndexManager (#30442)
  Packaging: Set elasticsearch user to have non-existent homedir (#29007)
  [Docs] Fix typo in cardinality-aggregation.asciidoc (#30434)
  Avoid NPE in `more_like_this` when field has zero tokens (#30365)
  Build: Switch to building javadoc with html5 (#30440)
  Add a quick tour of the project to CONTRIBUTING (#30187)
  Add stricter geohash parsing (#30376)
  Reindex: Use request flavored methods (#30317)
  Silence SplitIndexIT.testSplitIndexPrimaryTerm test failure.  (#30432)
  Auto-expand replicas when adding or removing nodes (#30423)
  Silence IndexUpgradeIT test failures. (#30430)
  Fix line length violation in cache tests
  Add failing test for core cache deadlock
  [DOCS] convert forcemerge snippet
  Update forcemerge.asciidoc (#30113)
  Added zentity to the list of API extension plugins (#29143)
  Fix the search request default operation behavior doc (#29302) (#29405)
  Watcher: Mark watcher as started only after loading watches (#30403)
  Correct wording in log message (#30336)
  Do not fail snapshot when deleting a missing snapshotted file (#30332)
  AwaitsFix testCreateShrinkIndexToN
  DOCS: Correct mapping tags in put-template api
  DOCS: Fix broken link in the put index template api
  Add put index template api to high level rest client (#30400)
  [Docs] Add snippets for POS stop tags default value
  Remove entry inadvertently picked into changelog
  Move respect accept header on no handler to 6.3.1
  Respect accept header on no handler (#30383)
  [Test] Add analysis-nori plugin to the vagrant tests
  [Rollup] Validate timezone in range queries (#30338)
  [Docs] Fix bad link
  [Docs] Fix end of section in the korean plugin docs
  add the Korean nori plugin to the change logs
  Expose the Lucene Korean analyzer module in a plugin (#30397)
  Docs: remove transport_client from CCS role example (#30263)
  Test: remove cluster permission from CCS user (#30262)
  Watcher: Remove unneeded index deletion in tests
  fix docs branch version
  fix lucene snapshot version
  Upgrade to 7.4.0-snapshot-1ed95c097b (#30357)
  [ML][TEST] Clean up jobs in ModelPlotIT
  Watcher: Ensure trigger service pauses execution (#30363)
  [DOCS] Fixes ordering of changelog sections
  [DOCS] Commented out empty sections in the changelog to fix the doc build. (#30372)
  Make RepositoriesMetaData contents unmodifiable (#30361)
  Change signature of Get Repositories Response (#30333)
  6.x Backport: Terms query validate bug  (#30319)
  InternalEngineTests.testConcurrentOutOfOrderDocsOnReplica should use two documents (#30121)
  Security: reduce garbage during index resolution (#30180)
  Test: use trial license in qa tests with security
  [ML] Add integration test for model plots (#30359)
  SQL: Fix bug caused by empty composites (#30343)
  [ML] Account for gaps in data counts after job is reopened (#30294)
  [ML] Refactor DataStreamDiagnostics to use array (#30129)
  Make licensing FIPS-140 compliant (#30251)
  Do not load global state when deleting a snapshot (#29278)
  Don't load global state when only restoring indices (#29239)
  Tests: Use different watch ids per test in smoke test (#30331)
  Watcher: Make start/stop cycle more predictable and synchronous (#30118)
  [Docs] Add term query with normalizer example
  Adds Eclipse config for xpack licence headers (#30299)
  Fix message content in users tool (#30293)
  [DOCS] Removed X-Pack breaking changes page
  [DOCS] Added security breaking change
  [DOCS] Fixes link to TLS LDAP info
  [DOCS] Merges X-Pack release notes into changelog (#30350)
  [DOCS] Fixes broken links to bootstrap user (#30349)
  [Docs] Remove errant changelog line
  Fix NPE when CumulativeSum agg encounters null/empty bucket (#29641)
  [DOCS] Reorganizes authentication details in Stack Overview (#30280)
  Tests: Simplify VersionUtils released version splitting (#30322)
  Fix merging logic of Suggester Options (#29514)
  ReplicationTracker.markAllocationIdAsInSync may hang if allocation is cancelled (#30316)
  [DOCS] Adds LDAP realm configuration details (#30214)
  [DOCS] Adds native realm configuration details (#30215)
  Disable SSL on testing old BWC nodes (#30337)
  [DOCS] Enables edit links for X-Pack pages
  Cancelling a peer recovery on the source can leak a primary permit (#30318)
  SQL: Reduce number of ranges generated for comparisons (#30267)
  [DOCS] Adds links to changelog sections
  Convert server javadoc to html5 (#30279)
  REST Client: Add Request object flavored methods (#29623)
  Create default ES_TMPDIR on Windows (#30325)
  [Docs] Clarify `fuzzy_like_this` redirect (#30183)
  Fix docs of the `_ignored` meta field.
  Add a new `_ignored` meta field. (#29658)
  Move repository-azure fixture test to QA project (#30253)
ywelsch added a commit that referenced this pull request May 11, 2018
Fixes an (un-released) bug introduced in #30332.

Closes #30507
ywelsch added a commit that referenced this pull request May 11, 2018
Fixes an (un-released) bug introduced in #30332.

Closes #30507
ywelsch added a commit that referenced this pull request May 11, 2018
Fixes an (un-released) bug introduced in #30332.

Closes #30507
@bleskes bleskes added v6.3.0 and removed v6.3.1 labels May 16, 2018
@jimczi jimczi added v7.0.0-beta1 and removed v7.0.0 labels Feb 7, 2019
infa-dsokolov added a commit to infa-dsokolov/elasticsearch that referenced this pull request Oct 11, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants