Skip to content
This repository has been archived by the owner on Nov 14, 2024. It is now read-only.

ABR12: AllNodesDisabledNamespacesUpdater #5876

Merged
merged 47 commits into from
Feb 7, 2022
Merged

ABR12: AllNodesDisabledNamespacesUpdater #5876

merged 47 commits into from
Feb 7, 2022

Conversation

gsheasby
Copy link
Contributor

Goals (and why):
Next part of ClearLocksTask.
Add a class that does the following:

  • Allows us to disable/re-enable a given set of namespaces across all nodes
  • Rolls back attempts to do this (when safe to do so!) in case of failure or unreachable nodes

For the next PR:

  • Retry logic (I think this belongs at a higher level than this class)
  • Actually wiring this thing up to AtlasRestoreService/Client

Implementation Description (bullets):

  • Added AllNodesDisabledNamespacesUpdater
  • A lot of increasingly convoluted test cases

Testing (What was existing testing like? What have you done to improve it?):
See above, added lots of tests for the one new class

However, I do think there's an extremely convoluted failure case that this class doesn't handle, but I ran out of energy to try to test it (at least - my conviction that it should be handled is less than my energy to do so).
Consider this:

  • ping reaches all remote nodes (A and B) successfully
  • node B goes down
  • disable reaches local + remote A successfully but not remote B
  • we store that we reached two nodes (including local), but not which two.
  • node B comes back up, but node A goes down
  • reenable reaches local + remote B successfully (although in B's case, it doesn't do anything), but not remote A
  • we expected two responses and got two, so we declare the re-enabling to be a success
  • end state: we report no broken state, but remote A is disabled while B + local are not.

Concerns (what feedback would you like?):
This new class is large and rather convoluted. I extracted what I could from the main disable/re-enable methods, and now there are a bunch of small private methods that are (I think) clear, but also numerous. The new class is also somewhat repetitive, but in a way that is difficult to resolve.

For one thing, I'm not completely certain that we want the two operations to have the exact same rollback rules - if we fail to re-enable some node, do we want to re-disable all nodes? I suppose this would be the safest thing to do? (if we leave a quorum of nodes enabled, then those nodes could agree leadership and start serving requests again) - but needs thought and checking.

If we decide that the two operations should be exact inverses of each other, then there's scope for pulling out some generic stuff. However, I've resisted attempting that until we're agreed that we should (and I'm not convinced it will make things easier to read/maintain in any case - we might later want to modify re-enable in a different way to disable, for example).

Where should we start reviewing?: AllNodesDisabledNamespaceUpdater

Priority (whenever / two weeks / yesterday): it's big, important and complex - so hoping you can carve some time out soon.

Copy link
Contributor

@gmaretic gmaretic left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Still in-progress, but leaving comments now so you can take a look and potentially get started or discuss

return disableWasSuccessfulOnAllNodes(rollbackResponses, expectedResponseCount);
}

private List<DisableNamespacesResponse> disableNamespacesOnAllNodes(Set<Namespace> namespaces, UUID lockId) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is kind of strange: we attempt to disable on all nodes, but then ignore the result and disable locally before proceeding.
I would expect to:

  • fail early if any fail
  • not disable locally if any fail

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see your point - in the case of failure, we should check the local state to establish whether the namespaces are consistently or partially disabled; but we shouldn't actually make the change if they happen not to be disabled.

wasSuccessful: boolean
lockedNamespaces: set<api.Namespace>
# we can assume another restore is in progress for this namespace (we lost our lock)
consistentlyLockedNamespaces: set<api.Namespace>
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This API is quite confusing. When we make requests to individual nodes, the previously locked namespaces are always going to be consistent. But then we accumulate those and if it's locked on all nodes assume they have been locked by the same lock and call it consistent or partial instead?
If my understanding is correct, we should really just have two different types of responses for the two different calls.

}

UUID lockId = request.getLockId();
// should only roll back locally
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  • should not roll back locally - if we're here, there's no situation where we successfully made the local update.

For the remote ones, we should try to roll back all of them (e.g. if we failed, we might have failed because of a timeout or something).

Copy link
Contributor

@gmaretic gmaretic left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is now in pretty good shape. Most comments are small, but I do think we want to go with a best effort approach on every re-enable we do -- basically the opposite of what we do with disables.

Copy link
Contributor

@gmaretic gmaretic left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The second review was a bit rushed, but you addressed the first CR well and the test coverage is pretty good so I am going to approve

Set<Namespace> namespacesWithExpectedLock =
Sets.difference(namespaces, namespacesWithLockConflict.keySet());
unlockNamespaces(namespacesWithExpectedLock);

log.error(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not entirely sure on the semantics of @Transaction, are we at this point guaranteed to either have succeeded or failed or can we log this and then not have the unlock above commit?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The transaction doesn't throw an exception (unless we somehow can't reach the db, I guess). If we're here, we're guaranteed to have found some conflict and made our best effort to work around it.

I guess there could be some write-write conflict here if multiple threads are doing stuff. If we wanted to be extra safe, we could pull the log message outside of the dao.

prevents us from logging a lie if there's somehow a conflict and the txn fails to commit.
@bulldozer-bot bulldozer-bot bot merged commit 1e6c0e8 into develop Feb 7, 2022
@bulldozer-bot bulldozer-bot bot deleted the gs/andnu branch February 7, 2022 10:28
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants