Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Make restore AZ aware #4039

Open
Michal-Leszczynski opened this issue Sep 23, 2024 · 11 comments
Open

Make restore AZ aware #4039

Michal-Leszczynski opened this issue Sep 23, 2024 · 11 comments

Comments

@Michal-Leszczynski
Copy link
Collaborator

During restore improvement meetings, it was mentioned that making SM AZ aware could speed up the restore process.
We should experiment with that and see the results.

@Michal-Leszczynski
Copy link
Collaborator Author

Unfortunately, I don't have a clear idea on how to safely use AZ information in SM restore.
@avikivity could you explain the idea behind it?

@Michal-Leszczynski
Copy link
Collaborator Author

cc: @karol-kokoszka @mykaul @tzach

@avikivity
Copy link
Member

If datacenter.RF == count(datacenter.racks), then each rack gets one replica. Typical example is RF=3 and nr_racks=3.

If this holds, you can take a rack's backup and copy it to just one restored cluster rack, with nodetool refresh --load-and-stream --keep-rack (doesn't exist yet). This reduces the number of receivers from 3 to 1, and significantly reduces the compaction load.

@Michal-Leszczynski
Copy link
Collaborator Author

This reduces the number of receivers from 3 to 1, and significantly reduces the compaction load.

The reduction in receivers is already achieved with --primary-replica-only, but I guess that streaming withing the same rack should be faster.

Perhaps this would also speed up the post-restore repair, as (depending on data consistency during backup) less data would need to be transferred between the nodes during the repair.

@bhalevy
Copy link
Member

bhalevy commented Sep 24, 2024

Cc @regevran

@regevran
Copy link

This should be a scylladb issue, but as an optimization, not for the general case.

@karol-kokoszka
Copy link
Collaborator

karol-kokoszka commented Oct 7, 2024

The scenario where the DC replication factor equals the number of racks in a given datacenter requires Scylla Manager to understand the mapping between the source rack and the destination rack. In this case, Scylla Manager could ensure that data from a single rack is restored (downloaded and then subjected to l&s) only by nodes from the corresponding rack, assuming that the --keep-rack flag is implemented in the core. This is because the l&s process would always select a node from the same rack.

A repair would not be necessary because the fact that entire replica is restored, is guaranteed by the --keep-rack flag and the fact that the RF equals the number of racks. Currently, l&s duplicates work because the --primary-replica-only flag causes the same replica's data to always be streamed to the same node, which eventually necessitates a post-restore repair.

To clarify, we assume this optimization applies only when the restore occurs on an identical topology.

Currently, the backup manifest does not include rack information. The RF can be determined from the dumped schema by parsing the CREATE KEYSPACE... string.

We could extend the manifest to include rack information, but this would mean that older backups would not contain this information. As a result, the optimization would only apply to new backups.

We will not include this in version 3.4. The prerequisite for this optimization is that the --keep-rack flag is implemented on the core side.

@karol-kokoszka
Copy link
Collaborator

karol-kokoszka commented Oct 7, 2024

If datacenter.RF == count(datacenter.racks), then each rack gets one replica. Typical example is RF=3 and nr_racks=3.

If this holds, you can take a rack's backup and copy it to just one restored cluster rack, with nodetool refresh --load-and-stream --keep-rack (doesn't exist yet). This reduces the number of receivers from 3 to 1, and significantly reduces the compaction load.

It’s a bit counterintuitive to combine --primary-replica-only with --keep-rack. The primary replica is unique for a given partition key, while --keep-rack means that data is streamed to replica belonging to the same rack that the caller belongs to. Perhaps, instead of combining --keep-rack and --primary-replica-only, there should be a single flag that defines the desired behavior. For example, using --keep-rack without --primary-replica-only might make more sense. It should be possible to use either --keep-rack or --primary-replica-only independently.

EDIT
There is nothing about the --primary-replica-only flag in the original comment.
It's just about using --load-and-stream + --keep-rack. It's not counterintuitive to combine load & stream + keep_rack... it makes sense.

@bhalevy
Copy link
Member

bhalevy commented Oct 7, 2024

This should be a scylladb issue, but as an optimization, not for the general case.

@regevran please open issue for implementing nodetool refresh --load-and-stream --keep-rack
Cc @denesb

@bhalevy
Copy link
Member

bhalevy commented Oct 7, 2024

Currently, the backup manifest does not include rack information.

FWIW, this can be retrieved from system.local or system.peers if they are backed up.
the scylla sstable tool cab be used to dump their contents in json format.

@regevran
Copy link

regevran commented Oct 8, 2024

please open issue for implementing nodetool refresh --load-and-stream --keep-rack

#4798

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants