ABR12: AllNodesDisabledNamespacesUpdater #5876

gsheasby · 2022-01-25T12:24:10Z

Goals (and why):
Next part of ClearLocksTask.
Add a class that does the following:

Allows us to disable/re-enable a given set of namespaces across all nodes
Rolls back attempts to do this (when safe to do so!) in case of failure or unreachable nodes

For the next PR:

Retry logic (I think this belongs at a higher level than this class)
Actually wiring this thing up to AtlasRestoreService/Client

Implementation Description (bullets):

Added AllNodesDisabledNamespacesUpdater
A lot of increasingly convoluted test cases

Testing (What was existing testing like? What have you done to improve it?):
See above, added lots of tests for the one new class

However, I do think there's an extremely convoluted failure case that this class doesn't handle, but I ran out of energy to try to test it (at least - my conviction that it should be handled is less than my energy to do so).
Consider this:

ping reaches all remote nodes (A and B) successfully
node B goes down
disable reaches local + remote A successfully but not remote B
we store that we reached two nodes (including local), but not which two.
node B comes back up, but node A goes down
reenable reaches local + remote B successfully (although in B's case, it doesn't do anything), but not remote A
we expected two responses and got two, so we declare the re-enabling to be a success
end state: we report no broken state, but remote A is disabled while B + local are not.

Concerns (what feedback would you like?):
This new class is large and rather convoluted. I extracted what I could from the main disable/re-enable methods, and now there are a bunch of small private methods that are (I think) clear, but also numerous. The new class is also somewhat repetitive, but in a way that is difficult to resolve.

For one thing, I'm not completely certain that we want the two operations to have the exact same rollback rules - if we fail to re-enable some node, do we want to re-disable all nodes? I suppose this would be the safest thing to do? (if we leave a quorum of nodes enabled, then those nodes could agree leadership and start serving requests again) - but needs thought and checking.

If we decide that the two operations should be exact inverses of each other, then there's scope for pulling out some generic stuff. However, I've resisted attempting that until we're agreed that we should (and I'm not convinced it will make things easier to read/maintain in any case - we might later want to modify re-enable in a different way to disable, for example).

Where should we start reviewing?: AllNodesDisabledNamespaceUpdater

Priority (whenever / two weeks / yesterday): it's big, important and complex - so hoping you can carve some time out soon.

This reverts commit 7704e3e.

somehow making the class longer

gmaretic

Still in-progress, but leaving comments now so you can take a look and potentially get started or discuss

gmaretic · 2022-01-31T13:47:53Z

...rc/main/java/com/palantir/atlasdb/timelock/management/AllNodesDisabledNamespacesUpdater.java

+        return disableWasSuccessfulOnAllNodes(rollbackResponses, expectedResponseCount);
+    }
+
+    private List<DisableNamespacesResponse> disableNamespacesOnAllNodes(Set<Namespace> namespaces, UUID lockId) {


This is kind of strange: we attempt to disable on all nodes, but then ignore the result and disable locally before proceeding.
I would expect to:

fail early if any fail

not disable locally if any fail

I see your point - in the case of failure, we should check the local state to establish whether the namespaces are consistently or partially disabled; but we shouldn't actually make the change if they happen not to be disabled.

gmaretic · 2022-01-31T13:58:23Z

timelock-api/src/main/conjure/timelock-management-api.yml

-          wasSuccessful: boolean
-          lockedNamespaces: set<api.Namespace>
+          # we can assume another restore is in progress for this namespace (we lost our lock)
+          consistentlyLockedNamespaces: set<api.Namespace>


This API is quite confusing. When we make requests to individual nodes, the previously locked namespaces are always going to be consistent. But then we accumulate those and if it's locked on all nodes assume they have been locked by the same lock and call it consistent or partial instead?
If my understanding is correct, we should really just have two different types of responses for the two different calls.

gsheasby · 2022-02-01T17:19:18Z

...rc/main/java/com/palantir/atlasdb/timelock/management/AllNodesDisabledNamespacesUpdater.java

+        }
+
+        UUID lockId = request.getLockId();
+        // should only roll back locally


should not roll back locally - if we're here, there's no situation where we successfully made the local update.

For the remote ones, we should try to roll back all of them (e.g. if we failed, we might have failed because of a timeout or something).

gmaretic

This is now in pretty good shape. Most comments are small, but I do think we want to go with a best effort approach on every re-enable we do -- basically the opposite of what we do with disables.

leader-election-api/src/main/java/com/palantir/paxos/WrappedPaxosResponse.java

timelock-api/src/main/conjure/timelock-management-api.yml

timelock-impl/src/main/java/com/palantir/atlasdb/timelock/TimelockNamespaces.java

timelock-impl/src/main/java/com/palantir/atlasdb/timelock/management/DisabledNamespaces.java

...rc/main/java/com/palantir/atlasdb/timelock/management/AllNodesDisabledNamespacesUpdater.java

...est/java/com/palantir/atlasdb/timelock/management/AllNodesDisabledNamespacesUpdaterTest.java

gmaretic

The second review was a bit rushed, but you addressed the first CR well and the test coverage is pretty good so I am going to approve

leader-election-impl/src/main/java/com/palantir/paxos/PaxosQuorumChecker.java

timelock-impl/src/main/java/com/palantir/atlasdb/timelock/management/DisabledNamespaces.java

gmaretic · 2022-02-04T15:54:48Z

timelock-impl/src/main/java/com/palantir/atlasdb/timelock/management/DisabledNamespaces.java

+                Set<Namespace> namespacesWithExpectedLock =
+                        Sets.difference(namespaces, namespacesWithLockConflict.keySet());
+                unlockNamespaces(namespacesWithExpectedLock);
+
                log.error(


I'm not entirely sure on the semantics of @Transaction, are we at this point guaranteed to either have succeeded or failed or can we log this and then not have the unlock above commit?

The transaction doesn't throw an exception (unless we somehow can't reach the db, I guess). If we're here, we're guaranteed to have found some conflict and made our best effort to work around it.

I guess there could be some write-write conflict here if multiple threads are doing stuff. If we wanted to be extra safe, we could pull the log message outside of the dao.

timelock-impl/src/main/java/com/palantir/atlasdb/timelock/management/UpdateFailureRecord.java

...rc/main/java/com/palantir/atlasdb/timelock/management/AllNodesDisabledNamespacesUpdater.java

prevents us from logging a lie if there's somehow a conflict and the txn fails to commit.

gsheasby added 19 commits January 21, 2022 15:03

Revert "Remove AllNodesDNU from this PR"

7d4fa82

This reverts commit 7704e3e.

use TimelockNamespaces

a5b0457

handle failure cases

45fb392

mock reEnable

bff95a5

another test

6c22ef7

and another test

1054319

refactor

913e714

errorprone

1f22605

rollback if re-enable fails

9b7ee14

re-enable case B

b79a40f

re-enable case C

2392ac1

refactor

dbea5f6

encapsulate visitor logic

ff52ba8

consistently vs inconsistently disabled

684ab27

re-enabled response granularity

abb2e01

extract methods from disable

4c7b7c0

reuse code

873e8c7

somehow making the class longer

robust failure handling

eb59331

I did this!

a7716fa

gsheasby added the no changelog label Jan 25, 2022

gsheasby requested review from jeremyk-91 and gmaretic January 25, 2022 12:24

gmaretic reviewed Jan 31, 2022

View reviewed changes

gsheasby added 7 commits February 1, 2022 13:29

split disable/reenable resps (WIP)

5e62e7a

single SingleNodeUpdateResponse

cf4a6bf

only update locally if successful elsewhere

2705b80

fix tests

33d94a6

test when lock IDs are different on different nodes

957f0af

deduplicate

e04a30c

more reorg

a27943d

shift+tab

e2a9f09

gsheasby commented Feb 1, 2022

View reviewed changes

pull out logs

24d4f7f

gmaretic reviewed Feb 2, 2022

View reviewed changes

gsheasby added 17 commits February 2, 2022 14:59

Wrapped->Successful PaxosResponse

083fef6

docs

3459a75

minor refactors

24a5b73

make SingleNodeUpdateResponse a PaxosResponse

71a931f

remove SuccessfulPaxosResponse

df20196

more nits

abe2a18

unlock non-conflicting namespaces during re-enable

79e6015

simplify reenable workflow

1389ae7

javadoc

f12319e

style

e9ec17e

wait for all responses even if we get a failure

5d4b03c

don't roll back on nodes where we got a failure response

a461714

handle unreachable nodes

2ee419b

JsonValue

5c45c2a

fix json

6cbf150

fixes

daecef1

one last test

e7a8f16

gmaretic approved these changes Feb 4, 2022

View reviewed changes

gsheasby added 2 commits February 7, 2022 09:53

final refactors

db3f65c

log outside txn

a160cac

prevents us from logging a lie if there's somehow a conflict and the txn fails to commit.

gsheasby added the merge when ready label Feb 7, 2022

bulldozer-bot bot merged commit 1e6c0e8 into develop Feb 7, 2022

bulldozer-bot bot deleted the gs/andnu branch February 7, 2022 10:28

gsheasby mentioned this pull request Feb 7, 2022

ABR15: "clear locks task" wiring #5892

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

ABR12: AllNodesDisabledNamespacesUpdater #5876

ABR12: AllNodesDisabledNamespacesUpdater #5876

gsheasby commented Jan 25, 2022

gmaretic left a comment

gmaretic Jan 31, 2022

gsheasby Feb 1, 2022

gmaretic Jan 31, 2022

gsheasby Feb 1, 2022

gmaretic left a comment

gmaretic left a comment

gmaretic Feb 4, 2022

gsheasby Feb 7, 2022

ABR12: AllNodesDisabledNamespacesUpdater #5876

ABR12: AllNodesDisabledNamespacesUpdater #5876

Conversation

gsheasby commented Jan 25, 2022

gmaretic left a comment

Choose a reason for hiding this comment

gmaretic Jan 31, 2022

Choose a reason for hiding this comment

gsheasby Feb 1, 2022

Choose a reason for hiding this comment

gmaretic Jan 31, 2022

Choose a reason for hiding this comment

gsheasby Feb 1, 2022

Choose a reason for hiding this comment

gmaretic left a comment

Choose a reason for hiding this comment

gmaretic left a comment

Choose a reason for hiding this comment

gmaretic Feb 4, 2022

Choose a reason for hiding this comment

gsheasby Feb 7, 2022

Choose a reason for hiding this comment