Add CoolDown Period to S3 Repository #51074

original-brownbear · 2020-01-15T22:01:28Z

WIP still needs tests, would just like to confirm we're good with the approach first.
7.6.0 label is intentional here, this one is important for 7.6.

WIP, still missing tests but would like to confirm we agree on the approach taken here first.

elasticmachine · 2020-01-15T22:01:30Z

Pinging @elastic/es-distributed (:Distributed/Snapshot/Restore)

original-brownbear · 2020-01-16T12:20:16Z

plugins/repository-s3/src/main/java/org/elasticsearch/repositories/s3/S3Repository.java

+        return new ActionListener<>() {
+            @Override
+            public void onResponse(T response) {
+                final Scheduler.Cancellable existing = finalizationFuture.getAndSet(


@ywelsch I went with doing it this way instead of just keeping track of a timestamp and then failing a new snapshot if it's started to close to the last timestamp. I'm afraid having random failures from concurrent snapshot exceptions when no running snapshot is visible to APIs could mess with Cloud orchestration (not necessarily breaking it but causing an unreasonable amount of _status requests).

I had to think a bit about this, and consulted @DaveCTurner as well. We both agree that this is the right path forward (simpler to explain to users and simpler for existing orchestration tools).
In short, this artificially extends the duration of the snapshot, i.e., taking or deleting a snapshot takes 3 minutes longer. Can we add a log message that details why we are doing this (and that we are in a repo with legacy snapshots)? Let's also document this somewhere (with the setting). This gives users the choice e.g. to move to a different repo.

Let's also document this somewhere (with the setting)

Should we really document this? It seems to me that if you're on AWS S3 not having the cool down is a risk in 100% of cases. If we document it, those that this functionality is intended to protect might opt to turn it off to "speed things up"?
Maybe just document the waiting but not the setting?

I think we must document how users can safely speed it up (i.e. by moving to a new repo or deleting all their legacy snapshots). I'm ok with not documenting the setting itself - we already have form for leaving dangerous settings undocumented (see MergePolicyConfig for instance). Let's add this reasoning to its Javadoc along with explicit instructions not to adjust it and instead to move to a new repo or delete all the legacy snapshots, to deal with the inevitable user who comes across it in the source code.

Can we also mention {@link Version#V_7_6_0} in the Javadoc so we get a reminder to remove this in v9?

Alright, added docs to the setting, a link to 7.6, an explanatory log message and a test in f9047d7 :)

original-brownbear · 2020-01-16T12:21:20Z

@ywelsch could you take a look here and see if this approach is agreeable to you. I'm on PTO today and tomorrow but it's no problem to find the time to code up a BwC test for this to verify that it works fine (worked fine in manual testing though).

DaveCTurner · 2020-01-16T18:05:30Z

Should we raise this as a blocker for 7.6.0?

original-brownbear · 2020-01-16T18:06:05Z

@DaveCTurner yes I think we have to. I'll deal with this tomorrow at the very latest.

original-brownbear · 2020-01-16T22:00:58Z

...epository-s3/src/test/java/org/elasticsearch/repositories/s3/S3BlobStoreRepositoryTests.java

@@ -92,6 +111,41 @@ protected Settings nodeSettings(int nodeOrdinal) {
            .build();
    }

+    public void testEnforcedCooldownPeriod() throws IOException {


This is admittedly quite the hacky test and it takes 2x 5s of hard waits to verify behaviour.

We could create a cleaner test by adding some BwC test infrastructure to the S3 plugin tests but I'm not sure it's worth the complexity. Also, running a real rest test to verify the timing here makes the test even more prone to run into random CI slowness and fail in the last step that verifies no waiting is happening when moving to a repo without any old version snapshot => this seemed like the least bad option to me.
We are using a O(5s) hard timeout in some other repo IO-timeout tests and so far that hasn't failed us due to CI running into a longer pause so I'm hopeful this will be stable.

Can we have a SnapshotResiliencyTests with a mock repo that is eventually consistent on actions for X seconds, and then becomes consistent, and then use that one to verify all is going well? Would be a stronger test than this, which just verifies that some sleep is somewhere in place.

Not trivial but doable => On it :)

Argh never mind ... then we'd have to move the cool down logic to BlobStoreRepository. We can't use the mock repository together with the S3 plugin. We also don't really have any mock infrastructure left for S3 so either we test this on the BlobStoreRepository or this requires some infrastructure that combines the mock S3 REST api infra + the snapshot resiliency test infrastructure.

Think this is worth it, given that this is a stop-gap solution?

ywelsch

I've left some more comments and ideas for tests

ywelsch · 2020-01-17T14:51:07Z

plugins/repository-s3/src/main/java/org/elasticsearch/repositories/s3/S3Repository.java

+            public void onResponse(T response) {
+                logCooldownInfo();
+                final Scheduler.Cancellable existing = finalizationFuture.getAndSet(
+                    threadPool.schedule(() -> wrappedListener.onResponse(response), coolDown, ThreadPool.Names.SNAPSHOT));


what if we are rejected from the snapshot threadpool? Let's force it onto the threadpool (use AbstractRunnable), and notify listener as welll on AbstractRunnable.onFaillure

what if we are rejected from the snapshot threadpool?

I think that's impossible unless the snapshot pool is shutting down (in which case it's kinda irrelevant what we do anyway I guess). No other action will go onto the snapshot pool until this listener is resolved (because the snapshot or delete in progress in the CS will prevent anything else from running + we specifically made it so that no steps in the snapshot operations runs on the SNAPSHOT pool before we checked the CS to avoid any deadlocks here).

I wouldn't bet my life on that (think e.g. about master failover). Let's make this safe.

ywelsch · 2020-01-17T14:53:52Z

plugins/repository-s3/src/main/java/org/elasticsearch/repositories/s3/S3Repository.java

+    protected void doClose() {
+        final Scheduler.Cancellable cancellable = finalizationFuture.getAndSet(null);
+        if (cancellable != null) {
+            logger.warn("Repository closed during cooldown period");


Do we need to log this at warn level?

I figured this should be somewhat visible, it's not great if this happens because the next master will not start waiting again. This may be something we want to add but I figured it might not be worth the extra complication because a master failover won't be instant (since the current master must have worked fine to set safe and pending generation equal before getting to the wait) so that "wait' might be good enough?

... retracted see below

Actually the situation here is better than I described above ... since if we're running into this we're always re-running the last step of the delete or snapshot operation on the next master (and will fail there because the repository generation has already moved) which will trigger another wait period. So this isn't a bad spot at all :) => moving this to DEBUG.

ywelsch · 2020-01-17T14:59:03Z

...epository-s3/src/test/java/org/elasticsearch/repositories/s3/S3BlobStoreRepositoryTests.java

@@ -92,6 +111,41 @@ protected Settings nodeSettings(int nodeOrdinal) {
            .build();
    }

+    public void testEnforcedCooldownPeriod() throws IOException {


Can we have a SnapshotResiliencyTests with a mock repo that is eventually consistent on actions for X seconds, and then becomes consistent, and then use that one to verify all is going well? Would be a stronger test than this, which just verifies that some sleep is somewhere in place.

ywelsch · 2020-01-17T14:59:55Z

...epository-s3/src/test/java/org/elasticsearch/repositories/s3/S3BlobStoreRepositoryTests.java

+
+        final long beforeFastDelete = repository.threadPool().relativeTimeInNanos();
+        client().admin().cluster().prepareDeleteSnapshot(repoName, fakeOldSnapshot.getName()).get();
+        assertThat(repository.threadPool().relativeTimeInNanos() - beforeFastDelete, lessThan(TEST_COOLDOWN_PERIOD.getNanos()));


I wonder if there are situations where ThreadPool.schedule will return before the time value specified. This would then fail here.

I wonder if there are situations where ThreadPool.schedule will return before the time value specified.

I figured that's impossible since I turned off the timestamp cache? I think otherwise the underlying primitives in ThreadPoolExecutor are accurate (at least on Linux).

This would then fail here.

And rightfully so?

=> that said :) ... let me see about the suggested test via the resiliency tests

original-brownbear · 2020-01-17T15:48:10Z

@ywelsch better tests turned out to be really complex to set up, let me know if you still want them :)

original-brownbear · 2020-01-17T22:59:37Z

All points addressed in 9f1c00d I think @ywelsch

ywelsch

LGTM

ywelsch · 2020-01-20T08:02:24Z

...epository-s3/src/test/java/org/elasticsearch/repositories/s3/S3BlobStoreRepositoryTests.java

@@ -92,6 +111,41 @@ protected Settings nodeSettings(int nodeOrdinal) {
            .build();
    }

+    public void testEnforcedCooldownPeriod() throws IOException {


original-brownbear · 2020-01-20T09:30:09Z

Thanks Yannick + David!

Add cool down period after snapshot finalization and delete to prevent eventually consistent AWS S3 from corrupting shard level metadata as long as the repository is using the old format metadata on the shard level.

original-brownbear added 4 commits January 15, 2020 14:49

bck

de5a65c

Merge remote-tracking branch 'elastic/master' into s3-cooldown

c14ac7c

Add Cooldown Period to S3 Repository

5c25c52

WIP, still missing tests but would like to confirm we agree on the approach taken here first.

cleanup

f4a8e09

original-brownbear added >non-issue WIP :Distributed Coordination/Snapshot/Restore Anything directly related to the `_snapshot/*` APIs v8.0.0 v7.6.0 v7.7.0 labels Jan 15, 2020

original-brownbear commented Jan 16, 2020

View reviewed changes

original-brownbear requested a review from ywelsch January 16, 2020 12:20

DaveCTurner added blocker and removed blocker labels Jan 16, 2020

DaveCTurner added the blocker label Jan 16, 2020

original-brownbear added 2 commits January 16, 2020 21:04

Merge remote-tracking branch 'elastic/master' into s3-cooldown

d0ffb44

test + docs

f9047d7

original-brownbear removed the WIP label Jan 16, 2020

original-brownbear commented Jan 16, 2020

View reviewed changes

fix test

4f58167

original-brownbear requested a review from DaveCTurner January 17, 2020 14:34

ywelsch suggested changes Jan 17, 2020

View reviewed changes

Merge remote-tracking branch 'elastic/master' into s3-cooldown

f9ad026

original-brownbear requested a review from ywelsch January 17, 2020 15:47

CR comments

9f1c00d

Merge remote-tracking branch 'elastic/master' into s3-cooldown

88ffa99

ywelsch approved these changes Jan 20, 2020

View reviewed changes

original-brownbear merged commit f429359 into elastic:master Jan 20, 2020

original-brownbear deleted the s3-cooldown branch January 20, 2020 09:32

original-brownbear added the backport pending label Jan 20, 2020

original-brownbear mentioned this pull request Jan 20, 2020

Add CoolDown Period to S3 Repository (#51074) #51213

Merged

original-brownbear mentioned this pull request Jan 20, 2020

Add CoolDown Period to S3 Repository (#51074) #51217

Merged

original-brownbear removed the backport pending label Jan 20, 2020

mfussenegger mentioned this pull request Mar 24, 2020

ES Backports crate/crate#9796

Closed

37 tasks

original-brownbear mentioned this pull request Dec 2, 2020

Snapshot creation with wait_for_completion: response time longer than snapshot duration #65661

Closed

jakelandis added v8.0.0-alpha1 and removed v8.0.0 labels Jul 26, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add CoolDown Period to S3 Repository #51074

Add CoolDown Period to S3 Repository #51074

original-brownbear commented Jan 15, 2020

elasticmachine commented Jan 15, 2020

original-brownbear Jan 16, 2020

ywelsch Jan 16, 2020

original-brownbear Jan 16, 2020

DaveCTurner Jan 16, 2020

original-brownbear Jan 16, 2020

original-brownbear commented Jan 16, 2020 •

edited

Loading

DaveCTurner commented Jan 16, 2020

original-brownbear commented Jan 16, 2020

original-brownbear Jan 16, 2020

ywelsch Jan 17, 2020

original-brownbear Jan 17, 2020

original-brownbear Jan 17, 2020

ywelsch Jan 20, 2020

ywelsch left a comment

ywelsch Jan 17, 2020

original-brownbear Jan 17, 2020

ywelsch Jan 17, 2020

ywelsch Jan 17, 2020

original-brownbear Jan 17, 2020 •

edited

Loading

original-brownbear Jan 17, 2020

ywelsch Jan 17, 2020

ywelsch Jan 17, 2020

original-brownbear Jan 17, 2020

original-brownbear commented Jan 17, 2020

original-brownbear commented Jan 17, 2020

ywelsch left a comment

ywelsch Jan 20, 2020

original-brownbear commented Jan 20, 2020

Add CoolDown Period to S3 Repository #51074

Add CoolDown Period to S3 Repository #51074

Conversation

original-brownbear commented Jan 15, 2020

elasticmachine commented Jan 15, 2020

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

original-brownbear commented Jan 16, 2020 • edited Loading

DaveCTurner commented Jan 16, 2020

original-brownbear commented Jan 16, 2020

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

ywelsch left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

original-brownbear Jan 17, 2020 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

original-brownbear commented Jan 17, 2020

original-brownbear commented Jan 17, 2020

ywelsch left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

original-brownbear commented Jan 20, 2020

original-brownbear commented Jan 16, 2020 •

edited

Loading

original-brownbear Jan 17, 2020 •

edited

Loading