-
Notifications
You must be signed in to change notification settings - Fork 15
ABR12: AllNodesDisabledNamespacesUpdater #5876
Conversation
This reverts commit 7704e3e.
somehow making the class longer
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Still in-progress, but leaving comments now so you can take a look and potentially get started or discuss
return disableWasSuccessfulOnAllNodes(rollbackResponses, expectedResponseCount); | ||
} | ||
|
||
private List<DisableNamespacesResponse> disableNamespacesOnAllNodes(Set<Namespace> namespaces, UUID lockId) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is kind of strange: we attempt to disable on all nodes, but then ignore the result and disable locally before proceeding.
I would expect to:
- fail early if any fail
- not disable locally if any fail
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I see your point - in the case of failure, we should check the local state to establish whether the namespaces are consistently or partially disabled; but we shouldn't actually make the change if they happen not to be disabled.
wasSuccessful: boolean | ||
lockedNamespaces: set<api.Namespace> | ||
# we can assume another restore is in progress for this namespace (we lost our lock) | ||
consistentlyLockedNamespaces: set<api.Namespace> |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This API is quite confusing. When we make requests to individual nodes, the previously locked namespaces are always going to be consistent. But then we accumulate those and if it's locked on all nodes assume they have been locked by the same lock and call it consistent or partial instead?
If my understanding is correct, we should really just have two different types of responses for the two different calls.
} | ||
|
||
UUID lockId = request.getLockId(); | ||
// should only roll back locally |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
- should not roll back locally - if we're here, there's no situation where we successfully made the local update.
For the remote ones, we should try to roll back all of them (e.g. if we failed, we might have failed because of a timeout or something).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is now in pretty good shape. Most comments are small, but I do think we want to go with a best effort approach on every re-enable we do -- basically the opposite of what we do with disables.
leader-election-api/src/main/java/com/palantir/paxos/WrappedPaxosResponse.java
Outdated
Show resolved
Hide resolved
timelock-impl/src/main/java/com/palantir/atlasdb/timelock/TimelockNamespaces.java
Show resolved
Hide resolved
timelock-impl/src/main/java/com/palantir/atlasdb/timelock/TimelockNamespaces.java
Outdated
Show resolved
Hide resolved
timelock-impl/src/main/java/com/palantir/atlasdb/timelock/management/DisabledNamespaces.java
Outdated
Show resolved
Hide resolved
...rc/main/java/com/palantir/atlasdb/timelock/management/AllNodesDisabledNamespacesUpdater.java
Outdated
Show resolved
Hide resolved
...rc/main/java/com/palantir/atlasdb/timelock/management/AllNodesDisabledNamespacesUpdater.java
Outdated
Show resolved
Hide resolved
...rc/main/java/com/palantir/atlasdb/timelock/management/AllNodesDisabledNamespacesUpdater.java
Outdated
Show resolved
Hide resolved
...est/java/com/palantir/atlasdb/timelock/management/AllNodesDisabledNamespacesUpdaterTest.java
Outdated
Show resolved
Hide resolved
...est/java/com/palantir/atlasdb/timelock/management/AllNodesDisabledNamespacesUpdaterTest.java
Outdated
Show resolved
Hide resolved
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The second review was a bit rushed, but you addressed the first CR well and the test coverage is pretty good so I am going to approve
leader-election-impl/src/main/java/com/palantir/paxos/PaxosQuorumChecker.java
Outdated
Show resolved
Hide resolved
timelock-impl/src/main/java/com/palantir/atlasdb/timelock/management/DisabledNamespaces.java
Show resolved
Hide resolved
Set<Namespace> namespacesWithExpectedLock = | ||
Sets.difference(namespaces, namespacesWithLockConflict.keySet()); | ||
unlockNamespaces(namespacesWithExpectedLock); | ||
|
||
log.error( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not entirely sure on the semantics of @Transaction
, are we at this point guaranteed to either have succeeded or failed or can we log this and then not have the unlock above commit?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The transaction doesn't throw an exception (unless we somehow can't reach the db, I guess). If we're here, we're guaranteed to have found some conflict and made our best effort to work around it.
I guess there could be some write-write conflict here if multiple threads are doing stuff. If we wanted to be extra safe, we could pull the log message outside of the dao.
timelock-impl/src/main/java/com/palantir/atlasdb/timelock/management/UpdateFailureRecord.java
Show resolved
Hide resolved
...rc/main/java/com/palantir/atlasdb/timelock/management/AllNodesDisabledNamespacesUpdater.java
Outdated
Show resolved
Hide resolved
prevents us from logging a lie if there's somehow a conflict and the txn fails to commit.
Goals (and why):
Next part of ClearLocksTask.
Add a class that does the following:
For the next PR:
Implementation Description (bullets):
Testing (What was existing testing like? What have you done to improve it?):
See above, added lots of tests for the one new class
However, I do think there's an extremely convoluted failure case that this class doesn't handle, but I ran out of energy to try to test it (at least - my conviction that it should be handled is less than my energy to do so).
Consider this:
Concerns (what feedback would you like?):
This new class is large and rather convoluted. I extracted what I could from the main disable/re-enable methods, and now there are a bunch of small private methods that are (I think) clear, but also numerous. The new class is also somewhat repetitive, but in a way that is difficult to resolve.
For one thing, I'm not completely certain that we want the two operations to have the exact same rollback rules - if we fail to re-enable some node, do we want to re-disable all nodes? I suppose this would be the safest thing to do? (if we leave a quorum of nodes enabled, then those nodes could agree leadership and start serving requests again) - but needs thought and checking.
If we decide that the two operations should be exact inverses of each other, then there's scope for pulling out some generic stuff. However, I've resisted attempting that until we're agreed that we should (and I'm not convinced it will make things easier to read/maintain in any case - we might later want to modify re-enable in a different way to disable, for example).
Where should we start reviewing?: AllNodesDisabledNamespaceUpdater
Priority (whenever / two weeks / yesterday): it's big, important and complex - so hoping you can carve some time out soon.