Make restore AZ aware #4039

Michal-Leszczynski · 2024-09-23T13:25:47Z

During restore improvement meetings, it was mentioned that making SM AZ aware could speed up the restore process.
We should experiment with that and see the results.

Michal-Leszczynski · 2024-09-23T13:27:22Z

Unfortunately, I don't have a clear idea on how to safely use AZ information in SM restore.
@avikivity could you explain the idea behind it?

Michal-Leszczynski · 2024-09-23T13:28:05Z

cc: @karol-kokoszka @mykaul @tzach

avikivity · 2024-09-23T18:23:20Z

If datacenter.RF == count(datacenter.racks), then each rack gets one replica. Typical example is RF=3 and nr_racks=3.

If this holds, you can take a rack's backup and copy it to just one restored cluster rack, with nodetool refresh --load-and-stream --keep-rack (doesn't exist yet). This reduces the number of receivers from 3 to 1, and significantly reduces the compaction load.

Michal-Leszczynski · 2024-09-24T07:12:42Z

This reduces the number of receivers from 3 to 1, and significantly reduces the compaction load.

The reduction in receivers is already achieved with --primary-replica-only, but I guess that streaming withing the same rack should be faster.

Perhaps this would also speed up the post-restore repair, as (depending on data consistency during backup) less data would need to be transferred between the nodes during the repair.

bhalevy · 2024-09-24T13:08:02Z

Cc @regevran

regevran · 2024-09-29T12:14:26Z

This should be a scylladb issue, but as an optimization, not for the general case.

karol-kokoszka · 2024-10-07T10:10:38Z

The scenario where the DC replication factor equals the number of racks in a given datacenter requires Scylla Manager to understand the mapping between the source rack and the destination rack. In this case, Scylla Manager could ensure that data from a single rack is restored (downloaded and then subjected to l&s) only by nodes from the corresponding rack, assuming that the --keep-rack flag is implemented in the core. This is because the l&s process would always select a node from the same rack.

A repair would not be necessary because the fact that entire replica is restored, is guaranteed by the --keep-rack flag and the fact that the RF equals the number of racks. Currently, l&s duplicates work because the --primary-replica-only flag causes the same replica's data to always be streamed to the same node, which eventually necessitates a post-restore repair.

To clarify, we assume this optimization applies only when the restore occurs on an identical topology.

Currently, the backup manifest does not include rack information. The RF can be determined from the dumped schema by parsing the CREATE KEYSPACE... string.

We could extend the manifest to include rack information, but this would mean that older backups would not contain this information. As a result, the optimization would only apply to new backups.

We will not include this in version 3.4. The prerequisite for this optimization is that the --keep-rack flag is implemented on the core side.

karol-kokoszka · 2024-10-07T10:26:21Z

If datacenter.RF == count(datacenter.racks), then each rack gets one replica. Typical example is RF=3 and nr_racks=3.

If this holds, you can take a rack's backup and copy it to just one restored cluster rack, with nodetool refresh --load-and-stream --keep-rack (doesn't exist yet). This reduces the number of receivers from 3 to 1, and significantly reduces the compaction load.

It’s a bit counterintuitive to combine --primary-replica-only with --keep-rack. The primary replica is unique for a given partition key, while --keep-rack means that data is streamed to replica belonging to the same rack that the caller belongs to. Perhaps, instead of combining --keep-rack and --primary-replica-only, there should be a single flag that defines the desired behavior. For example, using --keep-rack without --primary-replica-only might make more sense. It should be possible to use either --keep-rack or --primary-replica-only independently.

EDIT
There is nothing about the --primary-replica-only flag in the original comment.
It's just about using --load-and-stream + --keep-rack. It's not counterintuitive to combine load & stream + keep_rack... it makes sense.

bhalevy · 2024-10-07T22:23:40Z

This should be a scylladb issue, but as an optimization, not for the general case.

@regevran please open issue for implementing nodetool refresh --load-and-stream --keep-rack
Cc @denesb

bhalevy · 2024-10-07T22:26:15Z

Currently, the backup manifest does not include rack information.

FWIW, this can be retrieved from system.local or system.peers if they are backed up.
the scylla sstable tool cab be used to dump their contents in json format.

regevran · 2024-10-08T07:31:04Z

please open issue for implementing nodetool refresh --load-and-stream --keep-rack

#4798

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Make restore AZ aware #4039

Make restore AZ aware #4039

Michal-Leszczynski commented Sep 23, 2024

Michal-Leszczynski commented Sep 23, 2024

Michal-Leszczynski commented Sep 23, 2024

avikivity commented Sep 23, 2024

Michal-Leszczynski commented Sep 24, 2024

bhalevy commented Sep 24, 2024

regevran commented Sep 29, 2024

karol-kokoszka commented Oct 7, 2024 •

edited

Loading

karol-kokoszka commented Oct 7, 2024 •

edited

Loading

bhalevy commented Oct 7, 2024

bhalevy commented Oct 7, 2024

regevran commented Oct 8, 2024

Make restore AZ aware #4039

Make restore AZ aware #4039

Comments

Michal-Leszczynski commented Sep 23, 2024

Michal-Leszczynski commented Sep 23, 2024

Michal-Leszczynski commented Sep 23, 2024

avikivity commented Sep 23, 2024

Michal-Leszczynski commented Sep 24, 2024

bhalevy commented Sep 24, 2024

regevran commented Sep 29, 2024

karol-kokoszka commented Oct 7, 2024 • edited Loading

karol-kokoszka commented Oct 7, 2024 • edited Loading

bhalevy commented Oct 7, 2024

bhalevy commented Oct 7, 2024

regevran commented Oct 8, 2024

karol-kokoszka commented Oct 7, 2024 •

edited

Loading

karol-kokoszka commented Oct 7, 2024 •

edited

Loading