Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Limit the number of connections used by snapshot file downloads during recoveries #79044

Closed
fcofdez opened this issue Oct 13, 2021 · 1 comment · Fixed by #79316
Closed

Limit the number of connections used by snapshot file downloads during recoveries #79044

fcofdez opened this issue Oct 13, 2021 · 1 comment · Fixed by #79316
Assignees
Labels
:Distributed Indexing/Recovery Anything around constructing a new shard, either from a local or a remote source. Team:Distributed (Obsolete) Meta label for distributed team (obsolete). Replaced by Distributed Indexing/Coordination. v7.16.0 v8.0.0-beta1

Comments

@fcofdez
Copy link
Contributor

fcofdez commented Oct 13, 2021

Today we limit the number of concurrent snapshot file downloads during recoveries to 5 per recovery:

public static final Setting<Integer> INDICES_RECOVERY_MAX_CONCURRENT_SNAPSHOT_FILE_DOWNLOADS =
Setting.intSetting("indices.recovery.max_concurrent_snapshot_file_downloads",
5,
1,
20,
Property.Dynamic,
Property.NodeScope
);

This setting combined with the default node_concurrent_recoveries=2 works well in most scenarios.

public static final int DEFAULT_CLUSTER_ROUTING_ALLOCATION_NODE_CONCURRENT_RECOVERIES = 2;

But when the number of concurrent recoveries is higher, it's possible to exhaust the repository connections
just with recoveries, possibly affecting other repository operations such as snapshot/restore.

We should limit the number of concurrent snapshot file downloads in the target node to 25 as that would leave
room for the rest of the operations to make progress if node_concurrent_recoveries is larger than the default.

Relates #73496

@fcofdez fcofdez added :Distributed Indexing/Recovery Anything around constructing a new shard, either from a local or a remote source. v8.0.0 Team:Distributed (Obsolete) Meta label for distributed team (obsolete). Replaced by Distributed Indexing/Coordination. labels Oct 13, 2021
@fcofdez fcofdez self-assigned this Oct 13, 2021
@elasticmachine
Copy link
Collaborator

Pinging @elastic/es-distributed (Team:Distributed)

fcofdez added a commit to fcofdez/elasticsearch that referenced this issue Oct 18, 2021
… recoveries

Today we limit the max number of concurrent snapshot file restores
per recovery. This works well when the default
node_concurrent_recoveries is used (which is 2). When this limit is
increased, it is possible to exahust the underlying repository
connection pool, affecting other workloads.

This commit adds a new setting
`indices.recovery.max_concurrent_snapshot_file_downloads_per_node` that
allows to limit the max number of snapshot file downloads per node
during recoveries. When a recovery starts in the target node it tries
to acquire a permit that allows it to download snapshot files when it is
granted. This is communicated to the source node in the
StartRecoveryRequest. This is a rather conservative approach since it is
possible that a recovery that gets a permit to use snapshot files
doesn't recover any snapshot file while there's a concurrent recovery
that doesn't get a permit could take advantage of recovering from a
snapshot.

Closes elastic#79044
fcofdez added a commit that referenced this issue Oct 18, 2021
Today we limit the max number of concurrent snapshot file restores
per recovery. This works well when the default
node_concurrent_recoveries is used (which is 2). When this limit is
increased, it is possible to exhaust the underlying repository
connection pool, affecting other workloads.

This commit adds a new setting
`indices.recovery.max_concurrent_snapshot_file_downloads_per_node` that
allows to limit the max number of snapshot file downloads per node
during recoveries. When a recovery starts in the target node it tries
to acquire a permit that allows it to download snapshot files when it is
granted. This is communicated to the source node in the
StartRecoveryRequest. This is a rather conservative approach since it is
possible that a recovery that gets a permit to use snapshot files
doesn't recover any snapshot file while there's a concurrent recovery
that doesn't get a permit could take advantage of recovering from a
snapshot.

Closes #79044
fcofdez added a commit to fcofdez/elasticsearch that referenced this issue Oct 18, 2021
Today we limit the max number of concurrent snapshot file restores
per recovery. This works well when the default
node_concurrent_recoveries is used (which is 2). When this limit is
increased, it is possible to exhaust the underlying repository
connection pool, affecting other workloads.

This commit adds a new setting
`indices.recovery.max_concurrent_snapshot_file_downloads_per_node` that
allows to limit the max number of snapshot file downloads per node
during recoveries. When a recovery starts in the target node it tries
to acquire a permit that allows it to download snapshot files when it is
granted. This is communicated to the source node in the
StartRecoveryRequest. This is a rather conservative approach since it is
possible that a recovery that gets a permit to use snapshot files
doesn't recover any snapshot file while there's a concurrent recovery
that doesn't get a permit could take advantage of recovering from a
snapshot.

Closes elastic#79044
Backport of elastic#79316
fcofdez added a commit that referenced this issue Oct 19, 2021
Today we limit the max number of concurrent snapshot file restores
per recovery. This works well when the default
node_concurrent_recoveries is used (which is 2). When this limit is
increased, it is possible to exhaust the underlying repository
connection pool, affecting other workloads.

This commit adds a new setting
`indices.recovery.max_concurrent_snapshot_file_downloads_per_node` that
allows to limit the max number of snapshot file downloads per node
during recoveries. When a recovery starts in the target node it tries
to acquire a permit that allows it to download snapshot files when it is
granted. This is communicated to the source node in the
StartRecoveryRequest. This is a rather conservative approach since it is
possible that a recovery that gets a permit to use snapshot files
doesn't recover any snapshot file while there's a concurrent recovery
that doesn't get a permit could take advantage of recovering from a
snapshot.

Closes #79044
Backport of #79316
fcofdez added a commit to fcofdez/elasticsearch that referenced this issue Oct 19, 2021
weizijun added a commit to weizijun/elasticsearch that referenced this issue Oct 19, 2021
* upstream/master:
  Validate tsdb's routing_path (elastic#79384)
  Adjust the BWC version for the return200ForClusterHealthTimeout field (elastic#79436)
  API for adding and removing indices from a data stream (elastic#79279)
  Exposing the ability to log deprecated settings at non-critical level (elastic#79107)
  Convert operator privilege license object to LicensedFeature (elastic#79407)
  Mute SnapshotBasedIndexRecoveryIT testSeqNoBasedRecoveryIsUsedAfterPrimaryFailOver (elastic#79456)
  Create cache files with CREATE_NEW & SPARSE options (elastic#79371)
  Revert "[ML] Use a new annotations index for future annotations (elastic#79151)"
  [ML] Use a new annotations index for future annotations (elastic#79151)
  [ML] Removing legacy code from ML/transform auditor (elastic#79434)
  Fix rate agg with custom `_doc_count` (elastic#79346)
  Optimize SLM Policy Queries (elastic#79341)
  Fix execution of exists query within nested queries on field with doc_values disabled (elastic#78841)
  Stricter UpdateSettingsRequest parsing on the REST layer (elastic#79227)
  Do not release snapshot file download permit during recovery retries (elastic#79409)
  Preserve request headers in a mixed version cluster (elastic#79412)
  Adjust versions after elastic#79044 backport to 7.x (elastic#79424)
  Mute BulkByScrollUsesAllScrollDocumentsAfterConflictsIntegTests (elastic#79429)
  Fail on SSPL licensed x-pack sources (elastic#79348)

# Conflicts:
#	server/src/test/java/org/elasticsearch/index/TimeSeriesModeTests.java
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
:Distributed Indexing/Recovery Anything around constructing a new shard, either from a local or a remote source. Team:Distributed (Obsolete) Meta label for distributed team (obsolete). Replaced by Distributed Indexing/Coordination. v7.16.0 v8.0.0-beta1
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants