Skip to content

Commit

Permalink
Suggest a strategy for using repo analysis (#101507)
Browse files Browse the repository at this point in the history
Adds some docs to suggest running a sequence of increasingly large
analyses, and to set a very generous timeout.
  • Loading branch information
DaveCTurner authored Oct 30, 2023
1 parent 61ff924 commit a170d73
Show file tree
Hide file tree
Showing 2 changed files with 19 additions and 9 deletions.
15 changes: 9 additions & 6 deletions docs/reference/snapshot-restore/apis/repo-analysis-api.asciidoc
Original file line number Diff line number Diff line change
Expand Up @@ -54,12 +54,15 @@ The Repository analysis API performs a collection of read and write operations
on your repository which are designed to detect incorrect behaviour and to
measure the performance characteristics of your storage system.

The default values for the parameters to this API are deliberately low to
reduce the impact of running an analysis inadvertently. A realistic experiment
should set `blob_count` to at least `2000`, `max_blob_size` to at least `2gb`,
and `max_total_data_size` to at least `1tb`, and will almost certainly need to
increase the `timeout` to allow time for the process to complete successfully.
You should run the analysis on a multi-node cluster of a similar size to your
The default values for the parameters to this API are deliberately low to reduce
the impact of running an analysis inadvertently and to provide a sensible
starting point for your investigations. Run your first analysis with the default
parameter values to check for simple problems. If successful, run a sequence of
increasingly large analyses until you encounter a failure or you reach a
`blob_count` of at least `2000`, a `max_blob_size` of at least `2gb`, and a
`max_total_data_size` of at least `1tb`. Always specify a generous timeout,
possibly `1h` or longer, to allow time for each analysis to run to completion.
Perform the analyses using a multi-node cluster of a similar size to your
production cluster so that it can detect any problems that only arise when the
repository is accessed by many nodes at once.

Expand Down
13 changes: 10 additions & 3 deletions docs/reference/snapshot-restore/repository-s3.asciidoc
Original file line number Diff line number Diff line change
Expand Up @@ -257,9 +257,16 @@ PUT /_cluster/settings
----
// TEST[skip:we don't really want to change this logger]

The supplier of your storage system will be able to analyse these logs to determine the problem. See
the https://docs.aws.amazon.com/sdk-for-java/v1/developer-guide/java-dg-logging.html[AWS Java SDK]
documentation for further information.
Collect the Elasticsearch logs covering the time period of the failed analysis
from all nodes in your cluster and share them with the supplier of your storage
system along with the analysis response so they can use them to determine the
problem. See the
https://docs.aws.amazon.com/sdk-for-java/v1/developer-guide/java-dg-logging.html[AWS Java SDK]
documentation for further information, including details about other loggers
that can be used to obtain even more verbose logs. When you have finished
collecting the logs needed by your supplier, set the logger settings back to
`null` to return to the default logging configuration. See <<cluster-logger>>
and <<cluster-update-settings>> for more information.

[[repository-s3-repository]]
==== Repository settings
Expand Down

0 comments on commit a170d73

Please sign in to comment.