[Design Proposal] Snapshot Interoperability with Remote Store #6575

harishbhakuni · 2023-03-07T20:26:33Z

Goal:

Aim of this document is to capture design/changes required to provide user the same snapshot experience without duplicating data for remote store backed indices. For more details please refer Feature Proposal here.

Functional Requirements:

Ability to keep snapshots for remote store backed indices without duplication of data.
Snapshots should work seamlessly. Snapshot APIs that we have today should work for both doc-rep based indices as well as remote store backed indices.
Even After deletion of remote store backed indices, we should be able to recover from remote store using individual snapshots.

Solution:

Keep reference to Remote Store Metadata in Snapshot Metadata

The Idea here is to keep the reference to remote store metadata file in snapshot shard metadata. Whenever we trigger snapshot, we will pull the remote store shard metadata file name for each segment and keep it in snapshot shard metadata. we will use this reference during snapshot restore operation to restore data from the remote segment store.

High Level Diagram:

Existing Components:

Snapshot Service

Service responsible for creating snapshots. This service runs all the steps executed on the cluster-manager node during snapshot creation and deletion. This Service runs using CreateSnapshot or DeleteSnapshot Transport Action. During Create Snapshot Operation, it just updates the cluster state with a list of primary shards and the node id where the shard belongs. During Delete Snapshot Operation, this Service handles the deletion of metadata as well as deletion of actual data from the repository.

Restore Service

Service responsible for restoring snapshots. This service runs on Restore Snapshot Transport Action. During restore, it reads information about snapshot and metadata from repository. Then it performs a update cluster state task where it checks restore preconditions, restores global state if needed, and then creates a RestoreInProgress record with list of shards that needs to be restored and adds this shard to the routing table. Individual shards then gets restored as part of normal recovery process during shard creation.

Snapshot Shard Service

This service runs on data nodes and controls currently running shard snapshots on these nodes. It is responsible for starting and stopping shard level snapshots.

Indices Service

This is the main OpenSearch Indices Service. Individual Shards gets created and restored by this service, As we set the recovery source in cluster state to be SNAPSHOT during snapshot restore, Indices service restores the shard data from snapshot repository.

Blob Store Repository

The BlobStoreRepository forms the basis of implementations of Repository on top of a blob store. A Blobstore can be used as the basis for an implementation as long as it provides for GET, PUT, DELETE, and LIST operations. The data-nodes on the other hand, write the data for each individual shard but do not write any blobs outside of shard directories for shards that they hold the primary of. For each shard, the data-node holding the shard's primary writes the actual data in form of the shard's segment files to the repository as well as metadata about all the segment files that the repository stores for the shard.

Remote Segment Store Directory

A RemoteDirectory extension for remote segment store. Apart from storing actual segment files, remote segment store also keeps track of refresh checkpoints as metadata in a separate path which is handled by another instance of RemoteDirectory.

New Components:

Remote Store MD Lock Manager

This handles the locking/unlocking of remote segment store metadata files at shard level or at index level. Remote Store Metadata Lock Manager will run on both Cluster Manager nodes as well as Data Nodes. In Data Nodes, it will be responsible for adding/removing locks on individual metadata file at shard level. In Cluster Manager Nodes, Lock Manager will be responsible for adding/removing locks at shard level for a given lock acquiring resource.

Related Github Issue: here

Async Remote Store GC Process

An Async Process running on data nodes and responsible for periodic deletion of remote store metadata files and corresponding segments which do not have any active locks.

Related Github Issue: (<TODO: Add Github Issue Link>)

Snapshot Flows:

Create Snapshot Flow :

During Create Snapshot Flow, Cluster Manager node will update cluster state with the list of primary shards and the nodes in which they exist. In Data Nodes, as we trigger a flush, it will create a lucene commit at the end and with every commit remote store creates a new metadata file which includes the list of live segments for a given index in the cluster. we will use a different Shard Level Metadata File where we will keep the reference of latest remote store metadata file for the shard. Also it will call Lock Manger to lock that metadata file so that Remote Store GC process do not process it for deletion.
Once snapshot completes for all the shards, master node will add a lock at index level for the snapshot.

Restore Snapshot Flow:

During Snapshot Restore Flow, we sync segments from remote store to the local store, if customer is restoring the index as a remote store backed index, we sync the data to the new remote segment store as well.

Delete Snapshot Flow:

As Cluster manager node performs the entire snapshot deletion. It loops through each shard and deletes snapshot from index-<> md file and deletes corresponding snap-<>.dat file. Now Along with this, cluster manager node also have to release the index level lock for the snapshot from all the involved indices.

Note: not including other APIs like Clone Snapshot, Snapshot Status, etc. here as they do not contain any major changes.

Other Considered Approaches:

1. Keep All the required metadata at Snapshot End

This approach is similar to the above one, just that here the idea was we will not store any snapshot related metadata in Remote Store.
For every snapshot, there will be some metadata stored at the root/global level which would store the list of remote store metadata file paths for all shards of all indices which are part of snapshot.

Restoring Snapshot:

Restore of snapshot would involve calling the restore of shards at the remote store side by providing the file paths for each shard in the index that needs a restore. These file paths will be read from the respective snap-uuid-remote-uuid file depending on the snapshot specified in the request. This would align with the current behavior as well.

Delete Snapshot:

For single/batch deletion of snapshots, metadata files for respective snapshots will get cleaned up.
For eg: in case single snap deletion, snap-uuid-remote-uuid will get deleted.

Remote Store Metadata Cleanup:

For remote store metadata clean-up, there can be a method exposed to the remote store which will basically read all the existing snapshot metadata files for remote store and find out the unique set of metadata shard file path which is being referenced by the snapshots. This unique set should be excluded from the deletion of the remote store metadata files.
This could incur some extra latency as this would need a read of all the snapshot metadata files present, but there can be ways to cache the files referenced by the snapshots at the remote store side to reduce this latency.

Reason of not going with this Approach:

In this approach, we need remote store to read through the available snapshots metadata, which means remote store needs to have access to snapshot repo.
Also, Cleaning up of metadata files by looking into each available snapshot metadata does not seems generic and scalable.

2. Use Timestamp in Remote Store side

The Idea here is with every refresh, we will keep a map of time to shard level metadata file in Remote Store.
As we create commit whenever we trigger a snapshot, we know that there will be a corresponding metadata file in remote store which will be added for that time.

Create Snapshot:

With this approach, Create Snapshot will just mark Snapshot as SUCCEEDED for remote store based shards.

Delete Snapshot:

During Delete Snapshot as well, it will be a no-op for Remote based indices.

Snapshot Restore:

We will fetch the corresponding metadata file closest to the given time and use it to restore in the snapshot.

Remote Store Cleanup:

We can have a periodic async task to cleanup remote store metadata files which are older than a specified given time.

The benefit of this approach is that it is inline with PITR feature.
[lets say tommorow we can add a periodic task (lets say it runs at 1 min) which triggers commit in the indices. now in remote store, we will have metadata files stored for every 1 minute.]
In this Approach, We need to document that we keep snapshots of remote store indices for X days only if snapshot interoperability feature is enabled.

Reason of Not going with this Approach:

We are not going with this approach because this requires a change in the current snapshot behaviour. Today user can keep snapshots as long as they want. with this approach we cannot restore remote backed index after a given time.

More Low Level Details

Snapshot Metadata Changes

For Snapshots today, We keep the following metadata files:
Global State Metadata: This metadata file stores the cluster level metadata at the time of snapshot creation. details like cluster UUID, read only blocks, index aliases, data streams, etc. (File Link)
Index Level Metadata: This Metadata stores the index level settings for a index, what have details like number of replicas, whether remote store enabled for the index, etc. (File Link)
index-<> Shard Level Metadata: This metadata stores the details of blob files we have till now in snapshot for a given shard. with every snapshot we update this file with details of new files that we are adding. this is also used to compute the incremental file size and file count for a snapshot. (File Link)
snap-<>.dat Shard Level Metadata: This Metadata file is used to store the snapshot Name, snapshot start time, snapshot total running time, index File Info, etc (File Link)

Proposal:

for remote Store based Indices, at shard level, we need to have another metadata file which will have following details:

String snapshot; // Snapshot Name
long indexVersion; // index Version for the shard
long startTime; // start time of the snapshot
long time; // total time take for shard snapshot
String remoteStoreMetadataFileName; // name of the remote store shard md file
String indexUUID; // current index UUID (helpful for getting the directory path during restore)
String remoteStoreRepo; // remote store repository of snapshotted index

Why we cannot use existing shard level metadata files?
Existing shard level metadata files stores information about the physical files stored in the repo and uses it to compute the incremental files we need to add with a given snapshot. For remote store based shards, as we are only storing the checkpoint of remote store, so we don’t need to store incremental file info or physical file info. As remote store will be responsible for storing the physical files, we just need to store the remote store md file details in shard level metadata.

Snapshot Status API Changes

Snapshot Status API shows various details for a ongoing or completed snapshot like repository, snapshot UUID, shard level stats (like status of snapshot at shard level), etc.(Sample Response).

This includes snapshot file stats at shard and index level:

Property	Type	Description
incremental	Object	Number and size of files that still need to be copied during snapshot creation. For completed snapshots, incremental provides the number and size of files that were not already in the repository and were copied as part of the incremental snapshot.
processed	Object	Number and size of files already uploaded to the snapshot. The processed file_count and size_in_bytes are incremented in stats after a file is uploaded.
total	Object	Total number and size of files that are referenced by the snapshot.
start_time_in_millis	Long	Time (in milliseconds) when snapshot creation began.
time_in_millis	Long	Total time (in milliseconds) that the snapshot took to complete.

Some of these stats are not useful for remote store based indices when snapshot interoperability feature is enabled as the data for the indices will be stored in the respective remote store repositories.

Proposal:

We will be keeping incremental file size and file count for remote store based indices (or snapshot interop enabled indices in particular) as zero as we are not adding any files as part of snapshot for these indices. we need to pull total file size and file count from remote store during snapshot creation which we can add in these stats.

Open Questions:

Q. Should we add more details in snapshot status API for snapshot interop enabled indices/shards which could help in debugging snapshot issues? One additional detail we can add is details like size and file count which recently got added as part of snapshot commit.

Corresponding Github issue: <TODO: create a github issue in remote store for this>

Which Remote Store Metadata file to keep?

As per this approach during creation of snapshot, we will be keeping remote store metadata file name in shard level metadata. problem is we create the remote store metadata file with each commit but keep it updating on every refresh.
So if we choose the latest metadata file, it will get updated on each refresh and doing restore the current way, it will not be helpful. we cannot use second latest metadata file because that will be corresponding to a different commit generation.

Decision:

We need some changes at remote store side to keep a segment infos file from last commit point. so that we can refer the latest remote store md file but during restore we pass an argument to sync segments till last commit point.

Possible Race Condition:

Steps that will happen for a snapshot interop enabled shard snapshot: 1. Snapshot Shard Service will trigger lucene commit. 2. Remote Store component will update the metadata file in repo. 3. Snapshot Shard Service will reference the latest metadata file in the remote store repo.

as we are syncing segment at every refresh, so it should not happen. but is it possible that step 3 starts before step 2 completes?
in this scenario, we can maybe compare the commit generation we are using in snapshot side with the commit generation of the metadata file. if they don’t match we raise exception and have retries so that it gets the right metadata file for the snapshot. otherwise we fail the snapshot for the shard.

Enable/Disable Snapshot Interop Feature

We need to add additional request fields in create snapshot API. Create Snapshot API (doc link) today supports following request fields:

Field	Data type	Description
indices	String	The indices you want to include in the snapshot. You can use , to create a list of indices, * to specify an index pattern, and - to exclude certain indices. Don’t put spaces between items. Default is all indices.
ignore_unavailable	Boolean	If an index from the indices list doesn’t exist, whether to ignore it rather than fail the snapshot. Default is false.
include_global_state	Boolean	Whether to include cluster state in the snapshot. Default is true.
partial	Boolean	Whether to allow partial snapshots. Default is false, which fails the entire snapshot if one or more shards fails to stor

Option 1: we can enable this feature at cluster level (or keep it a query parameter):

We can add an additional Boolean field in request fields called remote_store_interop_enabled and store this argument in repository data for a given snapshot. this will be referenced during various snapshot operations to determine whether to refer remote store for remote store based indices.

Option 2: We can enable this feature at index level.

We need to add an additional String Field in request fields called remote_store_interop_enabled_indices and store the comma separated list of remote store enabled indices for which this feature should be enabled. We need to store this list as well corresponding to a given snapshot in repository data which we can reference during various snapshot operations.

Snapshot Restore Flow

Today in Snapshot World,
In the Snapshot Restore Flow, RestoreService runs on master node, which first restores global state using snapshot level metadata if it is provided in snapshot restore request. then, it restores the index settings using index level metadata. then it updates the index UUID in index setting with a new UUID, updates cluster state with recovery source as SNAPSHOT for each index and triggers index restore.
In each data node, we gets the restore operation where we recover shards for a given index using recovery source mentioned in cluster state.
We allow renaming of indices during snapshot restore. this means:
lets say, index idx1 have index UUID UUIDxyz, then the remote store path for it would be:
<remoteStoreRepo>/UUIDxyz/
if we restored idx1 as updated-idx1 with a new index UUID UUIDabc, and remote store feature kept enabled during restore, then the remote store path for it would be: <remoteStoreRepo>/UUIDabc/
then during index restore, we need to update the segments data from <remoteStoreRepo>/UUIDxyz/ repo path to <remoteStoreRepo>/UUIDabc/ repo path.
Similar scenario would happen if user updates the remote store repo for the restored index.

Approach 1: path1→path2→data node (preferred)

We can create a temporary remote directory instance which will point to <remoteStoreRepo>/UUIDxyz/, then load the remote store metadata file mentioned in snapshot shard level metadata. using this file, copy each segment file from location <remoteStoreRepo>/UUIDxyz/ to <remoteStoreRepo>/UUIDabc/. once all the segments files get synced in <remoteStoreRepo>/UUIDabc/ remote store repo path, just sync the segments from the remote store to local using same metadata file.

Approach 2: path1 → data node → path2

For remote store interop enabled shards,
we can create temporary remote directory instance which points to <remoteStoreRepo>/UUIDxyz/, we will sync segments in the shard from this temporary remote directory using the metadata file stored in snapshot shard metadata. then before making the shard Active, we will trigger refresh operation which will inturn upload the latest segments from local store to remote store of the shard, which points to <remoteStoreRepo>/UUIDabc/ repo path.
before making the shard Active, we need to make sure refresh operation fails if upload to remote store fails and we need to have enough retries.

Decision:

We will be going with approach 1, where we will sync segment data parallely from remote store repo to local store and if remote store kept enabled during snapshot restore of an index, we will sync segment data from one repo to another.

Also, We will support updating remote store repo for an index during restore or restoring it as just a doc rep based index.

Lock/Unlock Metadata File Logic

Covered here.

Snapshot Clone API changes

Since we are basically copying the information from existing snapshot in the repo, the only change needed for Clone API is to lock the respective remote store metadata files after each successful shard clone operation.

Changes Needed in Remote Store:

1. Sync segments to local disk using a given metadata file till last commit point.

2. Garbage Collection of old Metadata files.

3. Mechanism to lock/unlock metadata files.

4. Add Logic to Create Files directly in Remote Store.

5. Add Details in Remote Store Metadata File regarding diff b/w last two commits.

The text was updated successfully, but these errors were encountered:

harishbhakuni · 2023-03-07T23:11:27Z

@sohami @sachinpkale @gbbafna @ashking94 ^^

gbbafna · 2023-03-16T04:22:14Z

Thanks @harishbhakuni21 for the detailed design . Few initial comments/questions/doubts :

Ownership of the remote store directory - Since the index's data can live beyond its lifetime, who will be responsible for its cleanup ? Thinking of the cases where even the clusters are ephemeral. Probably, this will need more thought in the GC design , where a cluster level job having that repo scans the orphan uuids not in cluster and cleans them up if it finds no lock .
How do we differentiate multiple uuids for a same index - If an index is created/deleted/restored multiple times in a repo, how does a user differentiate them ? The same question will apply to existing snapshot repo as well .
Multiple clusters having same remote repo - Is this allowed ? Again, this question is not specific to interop .
Security Changes required - Since we are working on supporting encrypted data in remote store [Feature Proposal] Client Side Encryption in OpenSearch #6353 , we need to think about feasibility of Approach 1: path1→path2→data node . Thinking of two repos, where the keys are different, we wouldn't be able to copy data from path1 to path2 . Hence we need to think about Approach 2 here. Also lets think more about the security/encryption aspects like these more.

harishbhakuni · 2023-03-16T18:05:52Z

Ownership of the remote store directory - Since the index's data can live beyond its lifetime, who will be responsible for its cleanup ? Thinking of the cases where even the clusters are ephemeral. Probably, this will need more thought in the GC design , where a cluster level job having that repo scans the orphan uuids not in cluster and cleans them up if it finds no lock .

We are thinking of adding an Async GC as part of the Lock Manager (will add more details in its design doc), when we release a lock from an index, we will run an async job to cleanup the related lock files from each shards for this resource, and we might also add support there only to delete the md files (and related segment files) if they don't have any lock. this however do not cover the orphan uuids which do not have any locks.

How do we differentiate multiple uuids for a same index - If an index is created/deleted/restored multiple times in a repo, how does a user differentiate them ? The same question will apply to existing snapshot repo as well .

we don't use index UUIDs in snapshot today. even during restore we update the Index uuid and we expect an index with same name will not exist (or at least must be closed) in the domain. if an index is created/deleted/restored multiple times in a repo, it does not affect snapshots as segments files are stored per snapshots, we restore the index using segment files from the asked snapshot.
For remote store, if an index is created/deleted/restored multiple times, every time it will have different index UUIDs. here, we are assuming remote store side GC will not delete a remote store directory for an index if it have some index level locks even if the index is deleted from the cluster.

Multiple clusters having same remote repo - Is this allowed ? Again, this question is not specific to interop .

yes, for ex., user can register a s3 bucket in two different clusters. that does not affect any of our features.

Security Changes required - Since we are working on supporting encrypted data in remote store #6353 , we need to think about feasibility of Approach 1: path1→path2→data node . Thinking of two repos, where the keys are different, we wouldn't be able to copy data from path1 to path2 . Hence we need to think about Approach 2 here. Also lets think more about the security/encryption aspects like these more.

makes sense. need to think more on the restore approach now, maybe we have to go with approach 2 if during restore user updates the repo or other option could be don't support updating remote store repo during snapshot restore if interop feature is enabled for the snapshot.
I think other than this there should be no other challenges even after client side encryption enablement but will add a separate section for security.

harishbhakuni added enhancement Enhancement or improvement to existing feature or request untriaged labels Mar 7, 2023

anasalkouz added distributed framework and removed untriaged labels Mar 7, 2023

harishbhakuni mentioned this issue Mar 7, 2023

[Design Proposal] Locking Mechanism for Remote Store Md Files #6576

Closed

This was referenced Mar 15, 2023

[Remote Store] Sync Segments from Remote Store from given md file till last commit point #6688

Closed

[Remote Store] Add Logic to create files directly in remote store #6691

Closed

gbbafna added the Storage:Durability Issues and PRs related to the durability framework label Mar 16, 2023

harishbhakuni mentioned this issue Apr 12, 2023

[Snapshot Interop] Add Changes in Create Snapshot Flow for remote store interoperability. #7118

Merged

6 tasks

harishbhakuni mentioned this issue May 1, 2023

[Remote Store] Add Lock Manager in Remote Segment Store to persist data #6787

Merged

6 tasks

This was referenced May 10, 2023

[Snapshot Interop] Add Changes in Snapshot Status Flow for remote sto… #7495

Merged

Add Changes in Snapshot Clone Flow for remote store interoperability. #7496

Merged

Add Changes in Snapshot Delete Flow for remote store interoperability. #7497

Merged

harishbhakuni mentioned this issue Aug 16, 2023

Evaluate sync vs async segments upload to remote store (in RemoteStoreRefreshListener) #9024

Closed

harishbhakuni closed this as completed Aug 18, 2023

harishbhakuni mentioned this issue Sep 25, 2023

[Snapshot Interop] Add per shard metric for number of locks on remote store metadata #10215

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Design Proposal] Snapshot Interoperability with Remote Store #6575

[Design Proposal] Snapshot Interoperability with Remote Store #6575

harishbhakuni commented Mar 7, 2023 •

edited

Loading

harishbhakuni commented Mar 7, 2023

gbbafna commented Mar 16, 2023 •

edited

Loading

harishbhakuni commented Mar 16, 2023

[Design Proposal] Snapshot Interoperability with Remote Store #6575

[Design Proposal] Snapshot Interoperability with Remote Store #6575

Comments

harishbhakuni commented Mar 7, 2023 • edited Loading

Goal:

Functional Requirements:

Solution:

Keep reference to Remote Store Metadata in Snapshot Metadata

High Level Diagram:

Existing Components:

Snapshot Service

Restore Service

Snapshot Shard Service

Indices Service

Blob Store Repository

Remote Segment Store Directory

New Components:

Remote Store MD Lock Manager

Async Remote Store GC Process

Snapshot Flows:

Create Snapshot Flow :

Restore Snapshot Flow:

Delete Snapshot Flow:

Other Considered Approaches:

1. Keep All the required metadata at Snapshot End

Restoring Snapshot:

Delete Snapshot:

Remote Store Metadata Cleanup:

Reason of not going with this Approach:

2. Use Timestamp in Remote Store side

Create Snapshot:

Delete Snapshot:

Snapshot Restore:

Remote Store Cleanup:

Reason of Not going with this Approach:

More Low Level Details

Snapshot Metadata Changes

Proposal:

Snapshot Status API Changes

Proposal:

Open Questions:

Which Remote Store Metadata file to keep?

Decision:

Possible Race Condition:

Enable/Disable Snapshot Interop Feature

Option 1: we can enable this feature at cluster level (or keep it a query parameter):

Option 2: We can enable this feature at index level.

Snapshot Restore Flow

Approach 1: path1→path2→data node (preferred)

Approach 2: path1 → data node → path2

Decision:

Lock/Unlock Metadata File Logic

Snapshot Clone API changes

Changes Needed in Remote Store:

1. Sync segments to local disk using a given metadata file till last commit point.

2. Garbage Collection of old Metadata files.

3. Mechanism to lock/unlock metadata files.

4. Add Logic to Create Files directly in Remote Store.

5. Add Details in Remote Store Metadata File regarding diff b/w last two commits.

harishbhakuni commented Mar 7, 2023

gbbafna commented Mar 16, 2023 • edited Loading

harishbhakuni commented Mar 16, 2023

harishbhakuni commented Mar 7, 2023 •

edited

Loading

gbbafna commented Mar 16, 2023 •

edited

Loading