Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add snapshots pending deletion in cluster state to delete snapshot once index is deleted #79156

Open
wants to merge 56 commits into
base: main
Choose a base branch
from

Conversation

tlrx
Copy link
Member

@tlrx tlrx commented Oct 14, 2021

Note: This pull request is another attempt to delete searchable snapshot when the mounted index is deleted. It is extracted from #75565 but this one stores the information of the snapshot to delete in the cluster state as SnapshotDeletionsInPending custom objects whereas the older ones tried to store those information in the RepositoryMetadata.

In #74977 we introduced a new index setting index.store.snapshot.delete_searchable_snapshot that can be set when mounting a snapshot as an index to inform that the snapshot should be deleted once the searchable snapshot index is deleted.

The previous pull request adds the index setting and the verifications around it. This pull request now adds the logic to detect that a searchable snapshot index with this specific logic is being deleted and triggers the deletion of the backing snapshot.

In order to do this, when a searchable snapshot index is deleted we check if the setting index.store.snapshot.delete_searchable_snapshot is set. If the index to be deleted is the last searchable snapshot index that uses the snapshot then the snapshot informations are added to cluster state in a new SnapshotDeletionsInPending custom object.

Once a snapshot is pending deletion it cannot be cloned, mounted or restored in the cluster.

Snapshots pending deletions are deleted by the SnapshotsService. On cluster state updates the SnapshotsService retrieves the list of snapshots to delete and trigger the deletion by executing an explicit snapshot delete request. Deletions of snapshots are executed per repository, the service tries to prevent conflicting situations before triggering deletions.

There are situations were a snapshot pending deletion could not be deleted, for example if a repository is updated to point to a different location or if a repository is set to read-only. In such cases the snapshot information is kept around in SnapshotDeletionsInPending so that the deletion can be retried in the future. A hard limit of 5000 pending snapshot is implemented to avoid cluster state 💥 .

@tlrx tlrx added >enhancement :Distributed Coordination/Snapshot/Restore Anything directly related to the `_snapshot/*` APIs labels Oct 14, 2021
@elasticmachine elasticmachine added the Team:Distributed (Obsolete) Meta label for distributed team (obsolete). Replaced by Distributed Indexing/Coordination. label Oct 14, 2021
@elasticmachine
Copy link
Collaborator

Pinging @elastic/es-distributed (Team:Distributed)

Copy link
Contributor

@henningandersen henningandersen left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I did an initial read of this and left some initial mostly minor comments. I will need more time to fully digest this.

/**
* Represents snapshots marked as to be deleted and pending deletion.
*/
public class SnapshotDeletionsInPending extends AbstractNamedDiffable<Custom> implements Custom {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think In should be removed from the class name?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure, I pushed 579c21f

public static final SnapshotDeletionsInPending EMPTY = new SnapshotDeletionsInPending(Collections.emptySortedSet());
public static final String TYPE = "snapshot_deletions_pending";

public static final int MAX_PENDING_DELETIONS = 500;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we need a setting for this?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't have a strong opinion about this but it does not add much complexity and can help with edge cases so I pushed d15ac48

/**
* A list of snapshots to delete, sorted by creation time
*/
private final SortedSet<Entry> entries;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wonder if it was simpler to have a list here? That would remove any dependency on clock differences between masters and seems slightly simpler and cheaper too.

Still want the timestamp on the entry though.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, much more simple. I pushed 7f9c32c

Comment on lines 122 to 124
builder.append('[').append(entry.repositoryName).append('/').append(entry.repositoryUuid).append(']');
builder.append('[').append(entry.snapshotId).append(',').append(entry.creationTime).append(']');
builder.append('\n');
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am not sure I follow why there are two square brackets and a newline here? Can we format it as one entry [repo/repouuid, snapshotid, creationtime]? Also, I would prefer not to have the newline, but maybe there is precedence for having this?

Can we move the formatting of each Entry to Entry.toString?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry, that was leftovers from debugging sessions. I pushed 88c0a0b

}

public Builder add(String repositoryName, String repositoryUuid, SnapshotId snapshotId, long creationTime) {
ensureLimit();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we can enforce this in build alone, much like how we do in IndexGraveyard.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, no need to enforce the limit every time an entry is added. I changed this as part of d15ac48

final Set<Index> indicesToDelete,
final Metadata metadata
) {
if (indicesToDelete.isEmpty() == false) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Will we ever get here without indices? And evenso, this method seems to work just fine without this outer level if?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agreed, I pushed 9151c28

for (SnapshotDeletionsInPending.Entry snapshot : snapshotDeletionsInPending.entries()) {
final SnapshotId snapshotId = snapshot.getSnapshotId();

// early add to avoid doing too much work on successive cluster state updates
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Did you mean "concurrent cluster state updates"? Otherwise I wonder if we might not as well add it where you assign triggered=true instead.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There should not be concurrent cluster state updates here since this work is done on the cluster state applier thread, and only from there the we can simplify this like you suggested: 12835bb

@tlrx
Copy link
Member Author

tlrx commented Oct 18, 2021

Thanks for your feedback @henningandersen ! Let me know when you'll have more :)

@tlrx tlrx requested a review from henningandersen October 18, 2021 11:31
Copy link
Contributor

@henningandersen henningandersen left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have done a second round of all the production changes, will await a 3rd round before I look into tests too.

);

/**
* A list of snapshots to delete, sorted by creation time
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A small precision improvement (time stamps may not be strictly in order).

Suggested change
* A list of snapshots to delete, sorted by creation time
* A list of snapshots to delete, in the order deletions were requests.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK, I pushed 77dfe00


@Override
public Version getMinimalSupportedVersion() {
return Version.CURRENT.minimumCompatibilityVersion();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this should be a fixed version, i.e., Version.V_8_1_0 to avoid streaming this to 8.0 and 7.16?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Right, I pushed 77dfe00

boolean changed = false;
final List<Entry> updatedEntries = new ArrayList<>();
for (Entry entry : entries) {
if (snapshotIds.contains(entry.snapshotId)) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wonder if we should build the set of snapshotIds to make this lookup a hash-lookup rather than a linear scan? Just to avoid degenerate cases.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Makes sense, I changed this in 77dfe00.

Comment on lines 143 to 152
final Iterator<Entry> iterator = entries.stream().iterator();
while (iterator.hasNext()) {
if (prepend == false) {
builder.append(',');
}
builder.append(iterator.next());
prepend = false;
}
builder.append(']');
return builder.toString();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is nearly identical output to entries.toString(), except for a space. I wonder if we can use that here instead. OK to leave as is if you prefer this variant (but then I wonder if we need this as a utility).

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I used this to add extra information when debugging, but it's not needed anymore. I changed to what you suggested as part of 77dfe00.

}

private static Set<SnapshotId> listOfRestoreSources(final ClusterState state) {
final Set<SnapshotId> snapshotIds = new HashSet<>();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: I wonder if this could not be a one-liner stream-map-collect? The 3 methods are sort of doing similar work but written with different styles.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I renamed the methods in 77dfe00 and used streams + filters to make the method more similar.

*
* @param state the current {@link ClusterState}
*/
private void triggerSnapshotsPendingDeletions(final ClusterState state) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we short circuit this processing in some cases, like when none of the criteria used here are changed:

  1. Pending deletes
  2. Restores in progress (includes clones)
  3. Deletions in progress

(perhaps more)?

If we have 100's in the pending list that we cannot delete, it could become a continuous tax on the master to process this for every cluster state update? It may not be bad enough to warrant a complex check, but it would be nice if we can do a simple short circuit catching many cases.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok, this suggestion is a very good one but it required a larger change than I expected (see 916689e).

I reworked this and introduced the pendingDeletionsChanged(ClusterChangedEvent) and pendingDeletionsWithConflictsChanged(ClusterChangedEvent) methods that checks the cluster state update and returns true if it is worth to iterate over the pending snapshot deletions. Those methods are based on Set<SnapshotId> that are kept and updated locally on the master node. Those sets are used to know if a pending snapshot deletion is waiting for a conflicting situation, for example a restore, to be completed.

shouldRetry = RepositoryData.MISSING_UUID.equals(repositoryUuid) == false;

} else if (e instanceof ConcurrentSnapshotExecutionException) {
logger.debug(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wonder if we should ever get here? Since we check for clone, in progress snapshot and deletion before submitting it here, I think it will never throw this. AFAICS, we handle concurrent deletes sliently?

If we think it should not happen, I would keep this but add assert false here.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, we handle concurrent deletes silently but I think we can still get there with concurrent repository clean ups. I changed this code to handle all exceptions the same way (except snapshot missing exception) in cf66c46.

snapshotId
);
} else if (e instanceof RepositoryMissingException) {
logger.warn(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

warn or debug here is tricky. If this ever happens we would certainly want to log it. But if it happens, we would then be spamming the log I think with this message for every cluster state update (or at least often) and per snapshot-id.

It ties in with how we handle repo deletions. If we only allow forcing those when there are pending deletes, we could perhaps log (and respond) the these snapshots at that time. Also, we could log the snapshots when adding to the pending deletions list if the repo is not even there when the index is deleted?

I think I am ok with this as is, provided that we address this in follow-ups.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It ties in with how we handle repo deletions. If we only allow forcing those when there are pending deletes, we could perhaps log (and respond) the these snapshots at that time.

I think this is a good suggestion. I'd like to do this in a follow up.

we could log the snapshots when adding to the pending deletions list if the repo is not even there when the index is deleted

I added a warning in MetadataDeleteIndexService#updateSnapshotDeletionsPending() for this.

continue;
}

// should we add some throttling to not always retry?
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, that sort of makes sense. Perhaps we can even give up if the snapshot fails to be deleted for reasons other than repo missing enough times over enough time.

IIUC, we should really only retry for "system" issues, i.e., no available connection to repo, network issues, repo-issue or similar I think? Which would be the sort of things where some back-off will help and no back-off could be harmful.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I implemented a basic retry mechanism in cf66c46 that retries all exceptions at a constant interval (30s by default) over a given expiration time after which the retries will be stopped and the pending snapshot deletion removed from the cluster state with a warning message.

This expiration delay is only evaluated after the pending deletion has been already triggered and failed. It is still possible for a pending deletion to stay in cluster state for a long time if the deletion is blocked by a missing/read only repository or a conflicting operation.

@DaveCTurner
Copy link
Contributor

I wonder if this'd be simpler if we turn it around and instead introduce a call to MetadataDeleteIndexService#deleteIndices(ClusterState, ...) in the cluster state update that commits the snapshot deletion (i.e. the one submitted in BlobStoreRepository#writeIndexGen) so that the index deletion happens in the same cluster state update.

If we did that it'd mostly work ok except for the period where we've written an updated index-N blob but not updated the repository generation in the cluster state. If there's a master failover during this period then the new master can at least tell it has to do some fixup (the repository generations in the cluster state are not equal) and I think we can include in this fixup process a check for indices-to-be-deleted so we don't miss any.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
:Distributed Coordination/Snapshot/Restore Anything directly related to the `_snapshot/*` APIs >enhancement Team:Distributed (Obsolete) Meta label for distributed team (obsolete). Replaced by Distributed Indexing/Coordination. v9.0.0
Projects
None yet
Development

Successfully merging this pull request may close these issues.