[TS | LIP-164000] Reset Sweep Progress #5277

jeremyk-91 · 2021-02-23T22:22:47Z

Goals (and why):

Have the ability to reset sweep progress. This may be useful to deal with circumstances where entries are written to the targeted sweep queue after the sweep timestamp has progressed past it - while the transaction in question will necessarily fail, there may still be cruft in the targeted sweep queue.

Implementation Description (bullets):

Add a config flag. If set, on startup does a CAS to 0 and doesn't perform targeted sweep. This is needed because a service has multiple nodes, but once the last node rolls, assuming it starts successfully, no one is running sweep and we CAS the progress to 0 -> we have 0 for all shards and strategies.

Testing (What was existing testing like? What have you done to improve it?): Added some unit tests.

Concerns (what feedback would you like?):

What happens if the number of shards in the config is not what's actually persisted in the DB? I believe we do this correctly, but worth a check.
Have I thought through the concurrency model (especially across multiple nodes) correctly? and/or, is this too heavy-handed?

Where should we start reviewing?: ShardProgress

Priority (whenever / two weeks / yesterday): this week

changelog-app · 2021-02-23T22:22:51Z

Generate changelog in `changelog/@unreleased`

Type

Description

Targeted sweep progress may be reset with the resetTargetedSweepQueueProgressAndStopSweep flag in targeted sweep install configuration. This may be useful in cleaning up cruft in the targeted sweep queue that may have been written by failed transactions.

As the name suggests, this will prevent sweep from cleaning up old cells, so users should not run with this configuration in the steady state. If running your service in HA, once the last node rolls and reports that it has successfully reset the sweep progress table, we can be certain that progress has been reset to zero.

Check the box to generate changelog(s)

Generate changelog entry

gmaretic

It's fine, but I'm not really sure why we want to stop sweep in addition to resetting. Though it's a good forcing function to make people remove the flag so I guess that's the reason.

gmaretic · 2021-02-26T15:52:43Z

atlasdb-impl-shared/src/main/java/com/palantir/atlasdb/sweep/queue/ShardProgress.java

@@ -135,6 +135,32 @@ private long increaseValueFromToAtLeast(ShardAndStrategy shardAndStrategy, long
        return currentValue;
    }

+    public void resetProgressForShard(ShardAndStrategy shardAndStrategy) {


Eh, not a fan of the code duplication, but it's different enough to make it awkward to reuse. We could just delete the entry to make this much simpler but I assume we want HA -- not that sweep will work if we don't have delete consistency anyway...

gmaretic · 2021-02-26T15:59:10Z

atlasdb-impl-shared/src/main/java/com/palantir/atlasdb/sweep/queue/TargetedSweeper.java

-        conservativeScheduler.scheduleBackgroundThreads();
-        thoroughScheduler.scheduleBackgroundThreads();
+        if (shouldResetAndStopSweep) {
+            log.warn("This AtlasDB node is operating in a mode where it is attempting to reset the progress of "


Is this really what we want? We could just allow it to start sweeping. Also the nodes on old versions will still be attempting to sweep anyway

Discussed offline: this centers around the behaviour of targeted sweep where nodes CAS the bound from (thing I read -> my progress) repeatedly. We need to wait for all nodes to report that they're done with this.

gmaretic · 2021-02-26T16:01:18Z

atlasdb-impl-shared/src/test/java/com/palantir/atlasdb/sweep/queue/ShardProgressTest.java

+        doThrow(new CheckAndSetException("sadness")).when(mockKvs).checkAndSet(any());
+        ShardProgress instrumentedProgress = new ShardProgress(mockKvs);
+
+        assertThatCode(() -> instrumentedProgress.resetProgressForShard(CONSERVATIVE_TEN))


Probably want to verify that it only throws once the value actually repeats

Hmm, I think this is tested by stopsTryingToResetIfSomeoneElseDid() above - in practice you could check the CheckAndSetException's actual values but we don't get that here because of mocks, so I don't see another way of easily doing this?

OH I meant just verify that the method was called 4 times

gmaretic · 2021-02-26T16:08:01Z

Discussed offline: document why we need to stop running sweep and wait for all nodes to reset

svc-autorelease · 2021-03-03T13:00:29Z

Released 0.302.5

jeremyk-91 added 4 commits February 23, 2021 22:15

targeted sweep magic magic

37a263a

spotlessApply

0b25f25

I'm done message

cbcf110

Add generated changelog entries

ac7e542

jeremyk-91 requested a review from gmaretic February 23, 2021 22:22

bleh

ca30262

gmaretic approved these changes Feb 26, 2021

View reviewed changes

jeremyk-91 added 4 commits March 1, 2021 15:25

Add docs

b55339b

clarified too-generous message

651491f

spotless

44effec

mockito verify

348ca54

jeremyk-91 added autorelease merge when ready labels Mar 3, 2021

bulldozer-bot bot merged commit a6f4a1f into develop Mar 3, 2021

bulldozer-bot bot deleted the jkong/oneshot-tasks branch March 3, 2021 13:00

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[TS | LIP-164000] Reset Sweep Progress #5277

[TS | LIP-164000] Reset Sweep Progress #5277

jeremyk-91 commented Feb 23, 2021

changelog-app bot commented Feb 23, 2021 •

edited by jeremyk-91

Loading

gmaretic left a comment

gmaretic Feb 26, 2021

gmaretic Feb 26, 2021

jeremyk-91 Mar 1, 2021

gmaretic Feb 26, 2021

jeremyk-91 Mar 1, 2021

gmaretic Mar 2, 2021

jeremyk-91 Mar 2, 2021

gmaretic commented Feb 26, 2021

svc-autorelease commented Mar 3, 2021

[TS | LIP-164000] Reset Sweep Progress #5277

[TS | LIP-164000] Reset Sweep Progress #5277

Conversation

jeremyk-91 commented Feb 23, 2021

changelog-app bot commented Feb 23, 2021 • edited by jeremyk-91 Loading

Generate changelog in changelog/@unreleased

gmaretic left a comment

Choose a reason for hiding this comment

gmaretic Feb 26, 2021

Choose a reason for hiding this comment

gmaretic Feb 26, 2021

Choose a reason for hiding this comment

jeremyk-91 Mar 1, 2021

Choose a reason for hiding this comment

gmaretic Feb 26, 2021

Choose a reason for hiding this comment

jeremyk-91 Mar 1, 2021

Choose a reason for hiding this comment

gmaretic Mar 2, 2021

Choose a reason for hiding this comment

jeremyk-91 Mar 2, 2021

Choose a reason for hiding this comment

gmaretic commented Feb 26, 2021

svc-autorelease commented Mar 3, 2021

changelog-app bot commented Feb 23, 2021 •

edited by jeremyk-91

Loading

Generate changelog in `changelog/@unreleased`