Suggest a strategy for using repo analysis (#101507)

Adds some docs to suggest running a sequence of increasingly large analyses, and to set a very generous timeout.
elastic · Oct 30, 2023 · a170d73 · a170d73
1 parent 61ff924
commit a170d73
Show file tree

Hide file tree

Showing 2 changed files with 19 additions and 9 deletions.
diff --git a/docs/reference/snapshot-restore/apis/repo-analysis-api.asciidoc b/docs/reference/snapshot-restore/apis/repo-analysis-api.asciidoc
@@ -54,12 +54,15 @@ The Repository analysis API performs a collection of read and write operations
 on your repository which are designed to detect incorrect behaviour and to
 measure the performance characteristics of your storage system.
 
-The default values for the parameters to this API are deliberately low to
-reduce the impact of running an analysis inadvertently. A realistic experiment
-should set `blob_count` to at least `2000`, `max_blob_size` to at least `2gb`,
-and `max_total_data_size` to at least `1tb`, and will almost certainly need to
-increase the `timeout` to allow time for the process to complete successfully.
-You should run the analysis on a multi-node cluster of a similar size to your
+The default values for the parameters to this API are deliberately low to reduce
+the impact of running an analysis inadvertently and to provide a sensible
+starting point for your investigations. Run your first analysis with the default
+parameter values to check for simple problems. If successful, run a sequence of
+increasingly large analyses until you encounter a failure or you reach a
+`blob_count` of at least `2000`, a `max_blob_size` of at least `2gb`, and a
+`max_total_data_size` of at least `1tb`. Always specify a generous timeout,
+possibly `1h` or longer, to allow time for each analysis to run to completion.
+Perform the analyses using a multi-node cluster of a similar size to your
 production cluster so that it can detect any problems that only arise when the
 repository is accessed by many nodes at once.
 

diff --git a/docs/reference/snapshot-restore/repository-s3.asciidoc b/docs/reference/snapshot-restore/repository-s3.asciidoc
@@ -257,9 +257,16 @@ PUT /_cluster/settings
 ----
 // TEST[skip:we don't really want to change this logger]
 
-The supplier of your storage system will be able to analyse these logs to determine the problem. See
-the https://docs.aws.amazon.com/sdk-for-java/v1/developer-guide/java-dg-logging.html[AWS Java SDK]
-documentation for further information.
+Collect the Elasticsearch logs covering the time period of the failed analysis
+from all nodes in your cluster and share them with the supplier of your storage
+system along with the analysis response so they can use them to determine the
+problem. See the
+https://docs.aws.amazon.com/sdk-for-java/v1/developer-guide/java-dg-logging.html[AWS Java SDK]
+documentation for further information, including details about other loggers
+that can be used to obtain even more verbose logs. When you have finished
+collecting the logs needed by your supplier, set the logger settings back to
+`null` to return to the default logging configuration. See <<cluster-logger>>
+and <<cluster-update-settings>> for more information.
 
 [[repository-s3-repository]]
 ==== Repository settings