Skip to content
This repository has been archived by the owner on Nov 14, 2024. It is now read-only.

ABR1: AtlasBackupService #5708

Merged
merged 47 commits into from
Nov 8, 2021
Merged

ABR1: AtlasBackupService #5708

merged 47 commits into from
Nov 8, 2021

Conversation

gsheasby
Copy link
Contributor

@gsheasby gsheasby commented Oct 15, 2021

Current status (1/11):
The implementation and wiring looks complete and ready for review to me, needs more tests.

Goals (and why): Atlas should provide backup and restore services, so that our internal backup + restore solution is not overly reliant on AtlasDB internals.

Implementation Description (bullets):

  • Conjurise Timelock's LockImmutableTimestamp endpoint, as it's needed for the backup process
  • Add AtlasBackupService, which encapsulates the steps of the backup process where communication with Timelock is required
  • Introduced NamespacedLockToken and PrepareBackupResponse objects to hold the information that is needed by the backup process, without necessarily tying it to "lock immutable timestamp" (so that if we wish to extend this additively, we can do so without a breaking change)

Testing (What was existing testing like? What have you done to improve it?):

  • Added tests in AtlasBackupServiceTest.

Concerns (what feedback would you like?):

  • Would like feedback on general approach. Are the method signatures / new request/response objects appropriate?

Where should we start reviewing?:

  • AtlasBackupService

Priority (whenever / two weeks / yesterday):

  • early this week

@changelog-app
Copy link

changelog-app bot commented Oct 15, 2021

Generate changelog in changelog/@unreleased

Type

  • Feature
  • Improvement
  • Fix
  • Break
  • Deprecation
  • Manual task
  • Migration

Description

Add AtlasBackupService, which encapsulates the steps of the backup process where communication with Timelock and/or knowledge of AtlasDB internals is required.

Check the box to generate changelog(s)

  • Generate changelog entry

Copy link
Contributor

@gmaretic gmaretic left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just a high-level comment, before I do a more detailed review -- we should define the backup service as a conjure API. This does require a new version of timelock to be picked up, but it decouples the server-side logic from the client

Copy link
Contributor

@gmaretic gmaretic left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As noted above, I would structure this differently with the idea that we make calls as efficient as possible and move the logic into timelock itself while exposing a minimal interface needed for backups and not relying on the implementation.
Happy to go into more detail offline if that helps

}

// lock immutable timestamp
public PrepareBackupResponse prepareBackup(AuthHeader authHeader, String namespace) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should also make this endpoint accept a set of namespaces for efficiency

}

// check immutable ts lock
public boolean checkBackupIsValid(AuthHeader authHeader, NamespacedLockToken namespacedLockToken) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would even go as far as not having the NamespacedLockToken directly, but some BackupToken that is generated by timelock. Its validity might be tied to the validity of the immutable ts lock for now, but it could change in the future and the client should not have to care about that

import java.util.Set;
import org.jetbrains.annotations.NotNull;

public class AtlasBackupService {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Instead of a thick client, most of this logic should live in timelock server

@@ -146,6 +174,11 @@ services:
namespace: string
request: ConjureLockRequest
returns: ConjureLockResponse
lockImmutableTimestamp:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we should not expose this

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To elaborate: this is not generally something we want to allow users to do -- it is an implementation detail of how transactions work, and exposing it both makes further changes more difficult and could cause reckless users to adversely affect things they don't know about. For example, if someone calls this and holds it for an indefinite amount of time, that blocks sweep from progressing.

@gsheasby gsheasby changed the title WIP [ABR1]: AtlasBackupService ABR1: AtlasBackupService Nov 1, 2021
Copy link
Contributor

@gmaretic gmaretic left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Lots of comments, but almost there!

}

public static AtlasBackupTask create(Refreshable<ServicesConfigBlock> servicesConfigBlock, String serviceName) {
ReloadingFactory reloadingFactory = DialogueClients.create(servicesConfigBlock);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we can go with an AtlasBackupService that has a fixed client and the same API instead of recreating the client for each task

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the naming is misleading you here - renaming to AtlasBackupService makes sense. The serviceName is whatever the timelock service name is configured to be internally (not connected to the namespace), and the class will be created once (on internal service startup), and reused for multiple backup operations.

.collect(Collectors.toSet());
CompleteBackupRequest request = CompleteBackupRequest.of(tokens);
CompleteBackupResponse response = atlasBackupServiceBlocking.completeBackup(authHeader, request);

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

todo: actually persist the tokens using a persister that is passed in to this class, then we have an atlas-side implementation of the persister that conforms with the current backup story in internal tooling

Copy link
Contributor

@gmaretic gmaretic left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is almost there (I guess I said that in the previous review, but now it's mainly nits)

storedTokens.put(backupToken.getNamespace(), backupToken);
}

// TODO(gs): actually persist the token using a persister passed into this class.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yep, this should be a follow-up PR

// Then we have an atlas-side implementation of the persister that conforms with the current backup story
public Set<Namespace> completeBackup(Set<Namespace> namespaces) {
Set<InProgressBackupToken> tokens = namespaces.stream()
.map(storedTokens::get)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
.map(storedTokens::get)
.map(storedTokens::remove)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If the unlock fails, do we still want to remove the element from the map?

I guess that depends on why it failed?

  • If the lock is lost, remove the element
  • If we just failed to talk to timelock, keep the element

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

unfortunately the current API doesn't give us the information to decide...

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think we will retry anyway

@@ -117,6 +144,22 @@ types:
leaderTimes: map<Namespace, LeaderTime>

services:
AtlasBackupClient:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jeremyk-91 do you think this is a reasonable name even though it defies the usual naming? AtlasBackupService should really be the class we expose to clients, so I think this makes sense

Optional<InProgressBackupToken> prepareBackup(Namespace namespace) {
try {
return Optional.of(tryPrepareBackup(namespace));
} catch (Exception ex) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does the above ever throw?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

looks like not - it must have done in an earlier version of the code.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

having said that, don't you get some sort of Exception if we fail to reach timelock? Don't we want to handle that here?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is server-side

Copy link
Contributor

@gmaretic gmaretic left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks good!

private InProgressBackupToken prepareBackup(Namespace namespace) {
AsyncTimelockService timelock = timelock(namespace);
LockImmutableTimestampResponse response = timelock.lockImmutableTimestamp(IdentifiedTimeLockRequest.create());
long timestamp = timelock.getFreshTimestamp();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I just realised that technically we can reuse the immutable timestamp here, but let's just keep this as it makes the change smaller

@bulldozer-bot bulldozer-bot bot merged commit 9fafbf9 into develop Nov 8, 2021
@bulldozer-bot bulldozer-bot bot deleted the gs/abs branch November 8, 2021 13:49
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants