From 465e1f07d41c910324a6eee0397161234764ff78 Mon Sep 17 00:00:00 2001
From: Vijay <vamshin@amazon.com>
Date: Mon, 20 Jul 2020 13:04:58 -0700
Subject: [PATCH 1/7] synced from master

---
 jni/external/nmslib | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/jni/external/nmslib b/jni/external/nmslib
index 5482e077..1eda05dc 160000
--- a/jni/external/nmslib
+++ b/jni/external/nmslib
@@ -1 +1 @@
-Subproject commit 5482e077d1c8637499f86231bcd3979cb7fa6aef
+Subproject commit 1eda05dccd5ed34df50a243dfc64c5e9187388f8

From 46330950b1d4197cd8ad88f53f8d3111f44c7980 Mon Sep 17 00:00:00 2001
From: Vijay <vamshin@amazon.com>
Date: Fri, 24 Jul 2020 14:17:43 -0700
Subject: [PATCH 2/7] add performance tuning doc

---
 PerformanceTuning.md | 147 +++++++++++++++++++++++++++++++++++++++++++
 README.md            |   2 +-
 2 files changed, 148 insertions(+), 1 deletion(-)
 create mode 100644 PerformanceTuning.md

diff --git a/PerformanceTuning.md b/PerformanceTuning.md
new file mode 100644
index 00000000..142b408d
--- /dev/null
+++ b/PerformanceTuning.md
@@ -0,0 +1,147 @@
+#KNN Performance Tuning
+
+
+In this section we provide recommendations for performance tuning to improve indexing/search performance with the k-NN plugin.  On a high level k-NN works on following principles
+
+* Graphs are created per (Lucene) segment
+* Queries execute on segments sequentially inside the shard (same as any other Elasticsearch query). 
+* Each graph in the segment returns ‘k’ neighbors and the size results with the highest score is returned by the coordinating node. Note that size can be greater or smaller than k.
+
+To improve performance it is necessary to keep the number of segments under control. Ideally having 1 segment per shard will give the optimal performance with respect to search latency. We can achieve more parallelism by having more shards per index. We can control the number of segments either during indexing by asking Elasticsearch to slow down the segment creation by disabling the refresh interval or choosing larger refresh interval, increasing the flush threshold OR force-merging to 1 segment after all the indexing finishes and before searches.
+
+##Indexing Performance Tuning
+
+Following steps could help improve indexing performance especially when you plan to index large number of vectors at once. 
+
+* Disable refresh interval  (Default = 1 sec)
+ 
+ Disable refresh interval or set a long duration for refresh interval to avoid creating multiple smaller segments
+  ```  
+    PUT /<index_name>/_settings
+        {
+            "index" : {
+                "refresh_interval" : "-1"
+            }
+        }
+  ```
+* Disable flush
+ ```
+    Increase the "index.translog.flush_threshold_size" to some bigger value lets say "10gb", default is 512MB
+ ```
+* No Replicas (No Elasticsearch replica shard)
+ ```
+    Having replication set to 0, will avoid duplicate construction of graphs in 
+    both primary and replicas. When we enable replicas after the indexing, the 
+    serialized graphs are directly copied.
+ ```
+    
+* Increase number of indexing threads
+  ```
+    If the hardware we choose have multiple cores, we could allow multiple threads 
+    in graph construction and there by speed up the indexing process. You could determine
+    the number of threads to be alloted by using the following setting   
+    https://github.com/opendistro-for-elasticsearch/k-NN#knnalgo_paramindex_thread_qty.
+     
+    Please keep an eye on CPU utilization and choose right number of threads. Since graph
+    construction is costly, having multiple threads can put additional load on CPU. 
+  ```
+    
+* Index all docs (Perform bulk indexing)
+
+* Forcemerge 
+  
+ Forcemerge is a costly operation and could take a while depending on number of segments and size of the segments.
+ To ensure force merge is completed, we could keep calling forcemerge with 5 minute interval till you get 200 response.
+    
+    curl -X POST "localhost:9200/myindex/_forcemerge?max_num_segments=1&pretty"
+    
+* Call refresh 
+
+ Might not needed but to ensure the buffer is cleared and all segments are up. 
+ ```
+  POST /twitter/_refresh
+```
+* Add replicas (replica shards)
+
+* We can now enable replicas to copy the serialized graphs
+
+*  Enable refresh interval
+ ```
+      PUT /<index_name>/_settings
+        {
+            "index" : {
+                "refresh_interval" : "1m"
+            }
+        }
+ ```
+
+Please refer following doc (https://www.elastic.co/guide/en/elasticsearch/reference/master/tune-for-indexing-speed.html) for more details on improving indexing performance in general.
+
+##Search Performance Tuning
+
+### Warm up
+
+The graphs are constructed during indexing, but they are loaded into memory during the first search. The way search works in Lucene is that each segment is searched sequentially (so, for k-NN, each segment returns up to k nearest neighbors of the query point) and the results are aggregated together and ranked based on the score of each result (higher score --> better result). 
+
+Once a graph is loaded(graphs are loaded outside Elasticsearch JVM), we cache the graphs in memory. So the initial queries would be expensive in the order of few seconds and subsequent queries should be faster in the order of milliseconds(assuming knn circuit breaker is not hit).
+
+In order to avoid this latency penalty during your first queries, a user should use the warmup API on the indices they want to search. The API looks like this:
+
+GET /_opendistro/_knn/warmup/index1,index2,index3?pretty
+{
+  "_shards" : {
+    "total" : 6,
+    "successful" : 6,
+    "failed" : 0
+  }
+}
+
+The API loads all of the graphs for all of the shards (primaries and replicas) for the specified indices into the cache. Thus, there will be no penalty to load graphs during initial searches. *Note — * this API only loads the segments of the indices it sees into the cache. If a merge or refresh operation finishes after this API is ran or if new documents are added, this API will need to be re-ran to load those graphs into memory.
+
+### Avoid reading stored fields
+
+If the use case is to just read the nearest neighbors Ids and scores, then we could disable reading stored fields which could save some time retrieving the vectors from stored fields. 
+To understand more about stored fields, 
+please refer this [page.](https://discuss.elastic.co/t/what-does-it-mean-to-store-a-field/5893/5)
+```
+{
+ "size": 5,
+ "stored_fields": "_none_",
+ "docvalue_fields": ["_id"],
+ "query": {
+   "knn": {
+    "v": {
+      "vector": [-0.16490704,-0.047262248,-0.078923926],
+      "k": 50
+     }       
+   }
+ }
+}
+```
+##Improving Recall 
+
+Recall could depend on multiple factors like number of dimensions, segments(searching over large number of small segments and aggregating the results leads better recall than searching over small number of large segments and aggregating results. The larger the graph the more chances of losing recall if sticking to smaller algorithm parameters. Choosing larger values for algo params should help solve this issue), number of vectors, etc.  Recall can be configured by adjusting the algorithm parameters of hnsw algorithm exposed through index settings. Algorithm params that control the recall are *m, ef_construction, ef_search*. For more details on influence of algorithm parameters on the indexing, search recall, please refer this  doc (https://github.com/nmslib/hnswlib/blob/master/ALGO_PARAMS.md).  Increasing these values could help recall(better search results) but at the cost of higher memory utilization and increased indexing time. Our default values work on a broader set of use cases from our experiments but we encourage users to run their own experiments on their data sets and choose the appropriate values. You could refer to these settings in this section (https://github.com/opendistro-for-elasticsearch/k-NN#index-level-settings). We will add details on our experiments shortly here.
+
+Memory Estimation
+
+AWS Elasticsearch Service clusters allocate 50% of available RAM in the Instance capped around 32GB (because of JVM GC performance limit). Graphs part of k-NN are loaded outside the Elasticsearch process JVM. We have circuit breakers to limit graph usage to 50% of the left over RAM space for the graphs.
+
+* Memory required for graphs =   1.1 *((4* dimensions) + (8 * M)) *Bytes/vector*
+    * (4 bytes/float * dimension float/vector)
+    * (8 * M) = 4 bytes/edge * 2 levels/node *  M edge/level
+        * Note — as an estimation, each node will have membership in roughly 2 layers, and, on each layer, it will have M edges
+    * 1.1 = an extra 10% buffer for other meta data in the data structure
+* Example:- Let us assume
+    * 1 Million vectors 
+    * 256 Dimensions (2^8)
+    * M = 16 (default setting of HNSW)
+        * Memory required for !M vectors = 1.1*(4*256 + 8*16) *1M Bytes =~ 1.26GB 
+
+##Monitoring 
+
+The KNN Stats API provides information about the current status of the KNN Plugin. The plugin keeps track of both cluster level and node level stats. Cluster level stats have a single value for the entire cluster. Node level stats have a single value for each node in the cluster. A user can filter their query by nodeID and statName in the following way:
+ ```
+  GET /_opendistro/_knn/nodeId1,nodeId2/stats/statName1,statName2
+ ```
+
+Detailed breakdown of stats api metrics here https://github.com/opendistro-for-elasticsearch/k-NN#cluster-stats
diff --git a/README.md b/README.md
index 8deaa753..3b0c9786 100644
--- a/README.md
+++ b/README.md
@@ -4,7 +4,7 @@ Open Distro for Elasticsearch enables you to run nearest neighbor search on bill
 
 ## Documentation
 
-To learn more, please see our [documentation](https://opendistro.github.io/for-elasticsearch-docs/).
+To learn more, please see our [documentation](https://opendistro.github.io/for-elasticsearch-docs/docs/knn).
 
 ## Setup
 

From a7a429f2820c67a61cd8656c17519e3a547feb68 Mon Sep 17 00:00:00 2001
From: Vijay <vamshin@amazon.com>
Date: Fri, 24 Jul 2020 22:54:19 -0700
Subject: [PATCH 3/7] incorporated comments

---
 PerformanceTuning.md | 26 ++++++++++++++------------
 jni/external/nmslib  |  2 +-
 2 files changed, 15 insertions(+), 13 deletions(-)

diff --git a/PerformanceTuning.md b/PerformanceTuning.md
index 142b408d..6586bff3 100644
--- a/PerformanceTuning.md
+++ b/PerformanceTuning.md
@@ -1,17 +1,18 @@
 #KNN Performance Tuning
 
 
-In this section we provide recommendations for performance tuning to improve indexing/search performance with the k-NN plugin.  On a high level k-NN works on following principles
+In this document we provide recommendations for performance tuning to improve indexing/search performance with the k-NN plugin.  From a high level k-NN works on following principles:
 
 * Graphs are created per (Lucene) segment
-* Queries execute on segments sequentially inside the shard (same as any other Elasticsearch query). 
-* Each graph in the segment returns ‘k’ neighbors and the size results with the highest score is returned by the coordinating node. Note that size can be greater or smaller than k.
+* Queries execute on segments sequentially inside the shard (same as any other Elasticsearch query) 
+* Each graph in the segment returns *<=k* neighbors. 
+* Coordinator node picks up final *size* number of neighbors from the neighbors returned by each shard
 
 To improve performance it is necessary to keep the number of segments under control. Ideally having 1 segment per shard will give the optimal performance with respect to search latency. We can achieve more parallelism by having more shards per index. We can control the number of segments either during indexing by asking Elasticsearch to slow down the segment creation by disabling the refresh interval or choosing larger refresh interval, increasing the flush threshold OR force-merging to 1 segment after all the indexing finishes and before searches.
 
 ##Indexing Performance Tuning
 
-Following steps could help improve indexing performance especially when you plan to index large number of vectors at once. 
+The following steps could help improve indexing performance especially when you plan to index large number of vectors at once. 
 
 * Disable refresh interval  (Default = 1 sec)
  
@@ -24,20 +25,18 @@ Following steps could help improve indexing performance especially when you plan
             }
         }
   ```
-* Disable flush
- ```
-    Increase the "index.translog.flush_threshold_size" to some bigger value lets say "10gb", default is 512MB
- ```
-* No Replicas (No Elasticsearch replica shard)
+
+* Disable Replicas (No Elasticsearch replica shard)
  ```
     Having replication set to 0, will avoid duplicate construction of graphs in 
     both primary and replicas. When we enable replicas after the indexing, the 
-    serialized graphs are directly copied.
+    serialized graphs are directly copied. 
  ```
+More details [here](https://www.elastic.co/guide/en/elasticsearch/reference/master/tune-for-indexing-speed.html#_disable_replicas_for_initial_loads)
     
 * Increase number of indexing threads
   ```
-    If the hardware we choose have multiple cores, we could allow multiple threads 
+    If the hardware we choose has multiple cores, we could allow multiple threads 
     in graph construction and there by speed up the indexing process. You could determine
     the number of threads to be alloted by using the following setting   
     https://github.com/opendistro-for-elasticsearch/k-NN#knnalgo_paramindex_thread_qty.
@@ -57,11 +56,14 @@ Following steps could help improve indexing performance especially when you plan
     
 * Call refresh 
 
- Might not needed but to ensure the buffer is cleared and all segments are up. 
+ Calling refresh ensure the buffer is cleared and all segments are created so that documents are available for search. 
  ```
   POST /twitter/_refresh
 ```
 * Add replicas (replica shards)
+ 
+ This will make replica shards come up with the already serialized graphs created on the primary shards during indexing. This way 
+ we avoid duplicate graph construction.
 
 * We can now enable replicas to copy the serialized graphs
 
diff --git a/jni/external/nmslib b/jni/external/nmslib
index 1eda05dc..5482e077 160000
--- a/jni/external/nmslib
+++ b/jni/external/nmslib
@@ -1 +1 @@
-Subproject commit 1eda05dccd5ed34df50a243dfc64c5e9187388f8
+Subproject commit 5482e077d1c8637499f86231bcd3979cb7fa6aef

From 330076a7c161280ab679c26226c2f44ec1e19d5a Mon Sep 17 00:00:00 2001
From: Vijay <vamshin@amazon.com>
Date: Mon, 27 Jul 2020 16:15:58 -0700
Subject: [PATCH 4/7] incorporated comments

---
 PerformanceTuning.md | 18 +++++++++++++-----
 1 file changed, 13 insertions(+), 5 deletions(-)

diff --git a/PerformanceTuning.md b/PerformanceTuning.md
index 6586bff3..e88195ea 100644
--- a/PerformanceTuning.md
+++ b/PerformanceTuning.md
@@ -8,8 +8,6 @@ In this document we provide recommendations for performance tuning to improve in
 * Each graph in the segment returns *<=k* neighbors. 
 * Coordinator node picks up final *size* number of neighbors from the neighbors returned by each shard
 
-To improve performance it is necessary to keep the number of segments under control. Ideally having 1 segment per shard will give the optimal performance with respect to search latency. We can achieve more parallelism by having more shards per index. We can control the number of segments either during indexing by asking Elasticsearch to slow down the segment creation by disabling the refresh interval or choosing larger refresh interval, increasing the flush threshold OR force-merging to 1 segment after all the indexing finishes and before searches.
-
 ##Indexing Performance Tuning
 
 The following steps could help improve indexing performance especially when you plan to index large number of vectors at once. 
@@ -30,7 +28,9 @@ The following steps could help improve indexing performance especially when you
  ```
     Having replication set to 0, will avoid duplicate construction of graphs in 
     both primary and replicas. When we enable replicas after the indexing, the 
-    serialized graphs are directly copied. 
+    serialized graphs are directly copied. Having no replicas means that losing 
+    a node(s) may incur data loss, so it is important that the data lives elsewhere 
+    so that this initial load can be retried in case of an issue.
  ```
 More details [here](https://www.elastic.co/guide/en/elasticsearch/reference/master/tune-for-indexing-speed.html#_disable_replicas_for_initial_loads)
     
@@ -81,6 +81,11 @@ Please refer following doc (https://www.elastic.co/guide/en/elasticsearch/refere
 
 ##Search Performance Tuning
 
+To improve Search performance it is necessary to keep the number of segments under control. Lucene's IndexSearcher will search over all of the segments in a shard to find the 'size' best results. But, because the complexity of search for the HNSW algorithm is logarithmic with respect to the number of vectors, searching over 5 graphs with a 100 vectors each and then taking the top size results from ```5*k``` results will take longer than searching over 1 graph with 500 vectors and then taking the top size results from k results. 
+Ideally having 1 segment per shard will give the optimal performance with respect to search latency. We could configure index to have multiple shards to aviod having giant shards and achieve more parallelism.
+
+We can control the number of segments either during indexing by asking Elasticsearch to slow down the segment creation by disabling the refresh interval or choosing larger refresh interval.
+
 ### Warm up
 
 The graphs are constructed during indexing, but they are loaded into memory during the first search. The way search works in Lucene is that each segment is searched sequentially (so, for k-NN, each segment returns up to k nearest neighbors of the query point) and the results are aggregated together and ranked based on the score of each result (higher score --> better result). 
@@ -122,9 +127,12 @@ please refer this [page.](https://discuss.elastic.co/t/what-does-it-mean-to-stor
 ```
 ##Improving Recall 
 
-Recall could depend on multiple factors like number of dimensions, segments(searching over large number of small segments and aggregating the results leads better recall than searching over small number of large segments and aggregating results. The larger the graph the more chances of losing recall if sticking to smaller algorithm parameters. Choosing larger values for algo params should help solve this issue), number of vectors, etc.  Recall can be configured by adjusting the algorithm parameters of hnsw algorithm exposed through index settings. Algorithm params that control the recall are *m, ef_construction, ef_search*. For more details on influence of algorithm parameters on the indexing, search recall, please refer this  doc (https://github.com/nmslib/hnswlib/blob/master/ALGO_PARAMS.md).  Increasing these values could help recall(better search results) but at the cost of higher memory utilization and increased indexing time. Our default values work on a broader set of use cases from our experiments but we encourage users to run their own experiments on their data sets and choose the appropriate values. You could refer to these settings in this section (https://github.com/opendistro-for-elasticsearch/k-NN#index-level-settings). We will add details on our experiments shortly here.
+Recall could depend on multiple factors like number of vectors, number of dimensions, segments etc. Searching over large number of small segments and aggregating the results leads better recall than searching over small number of large segments and aggregating results. The larger the graph the more chances of losing recall if sticking to smaller algorithm parameters. 
+Choosing larger values for algorithm params should help solve this issue but at the cost of search latency and indexing time. That being said, it is important to understand your system's requirements for latency and accuracy, and then to choose the number of segments you want your index to have based on experimentation.
+
+Recall can be configured by adjusting the algorithm parameters of hnsw algorithm exposed through index settings. Algorithm params that control the recall are *m, ef_construction, ef_search*. For more details on influence of algorithm parameters on the indexing, search recall, please refer this  doc (https://github.com/nmslib/hnswlib/blob/master/ALGO_PARAMS.md).  Increasing these values could help recall(better search results) but at the cost of higher memory utilization and increased indexing time. Our default values work on a broader set of use cases from our experiments but we encourage users to run their own experiments on their data sets and choose the appropriate values. You could refer to these settings in this section (https://github.com/opendistro-for-elasticsearch/k-NN#index-level-settings). We will add details on our experiments shortly here.
 
-Memory Estimation
+##Memory Estimation
 
 AWS Elasticsearch Service clusters allocate 50% of available RAM in the Instance capped around 32GB (because of JVM GC performance limit). Graphs part of k-NN are loaded outside the Elasticsearch process JVM. We have circuit breakers to limit graph usage to 50% of the left over RAM space for the graphs.
 

From 55687171edb3a1e60e21ea9a0d3de00e0718242a Mon Sep 17 00:00:00 2001
From: Vijay <vamshin@amazon.com>
Date: Tue, 28 Jul 2020 12:35:51 -0700
Subject: [PATCH 5/7] incorporated comments

---
 PerformanceTuning.md | 34 +++++++++++++++-------------------
 1 file changed, 15 insertions(+), 19 deletions(-)

diff --git a/PerformanceTuning.md b/PerformanceTuning.md
index e88195ea..6eab5a90 100644
--- a/PerformanceTuning.md
+++ b/PerformanceTuning.md
@@ -1,5 +1,4 @@
-#KNN Performance Tuning
-
+# KNN Performance Tuning
 
 In this document we provide recommendations for performance tuning to improve indexing/search performance with the k-NN plugin.  From a high level k-NN works on following principles:
 
@@ -8,11 +7,11 @@ In this document we provide recommendations for performance tuning to improve in
 * Each graph in the segment returns *<=k* neighbors. 
 * Coordinator node picks up final *size* number of neighbors from the neighbors returned by each shard
 
-##Indexing Performance Tuning
+## Indexing Performance Tuning
 
 The following steps could help improve indexing performance especially when you plan to index large number of vectors at once. 
 
-* Disable refresh interval  (Default = 1 sec)
+1 Disable refresh interval  (Default = 1 sec)
  
  Disable refresh interval or set a long duration for refresh interval to avoid creating multiple smaller segments
   ```  
@@ -23,8 +22,7 @@ The following steps could help improve indexing performance especially when you
             }
         }
   ```
-
-* Disable Replicas (No Elasticsearch replica shard)
+2 Disable Replicas (No Elasticsearch replica shard)
  ```
     Having replication set to 0, will avoid duplicate construction of graphs in 
     both primary and replicas. When we enable replicas after the indexing, the 
@@ -34,7 +32,7 @@ The following steps could help improve indexing performance especially when you
  ```
 More details [here](https://www.elastic.co/guide/en/elasticsearch/reference/master/tune-for-indexing-speed.html#_disable_replicas_for_initial_loads)
     
-* Increase number of indexing threads
+3 Increase number of indexing threads
   ```
     If the hardware we choose has multiple cores, we could allow multiple threads 
     in graph construction and there by speed up the indexing process. You could determine
@@ -45,29 +43,27 @@ More details [here](https://www.elastic.co/guide/en/elasticsearch/reference/mast
     construction is costly, having multiple threads can put additional load on CPU. 
   ```
     
-* Index all docs (Perform bulk indexing)
+4 Index all docs (Perform bulk indexing)
 
-* Forcemerge 
+5 Forcemerge 
   
  Forcemerge is a costly operation and could take a while depending on number of segments and size of the segments.
  To ensure force merge is completed, we could keep calling forcemerge with 5 minute interval till you get 200 response.
     
     curl -X POST "localhost:9200/myindex/_forcemerge?max_num_segments=1&pretty"
     
-* Call refresh 
+6 Call refresh 
 
  Calling refresh ensure the buffer is cleared and all segments are created so that documents are available for search. 
  ```
   POST /twitter/_refresh
 ```
-* Add replicas (replica shards)
+7 Enable replicas 
  
  This will make replica shards come up with the already serialized graphs created on the primary shards during indexing. This way 
  we avoid duplicate graph construction.
 
-* We can now enable replicas to copy the serialized graphs
-
-*  Enable refresh interval
+8  Enable refresh interval
  ```
       PUT /<index_name>/_settings
         {
@@ -79,7 +75,7 @@ More details [here](https://www.elastic.co/guide/en/elasticsearch/reference/mast
 
 Please refer following doc (https://www.elastic.co/guide/en/elasticsearch/reference/master/tune-for-indexing-speed.html) for more details on improving indexing performance in general.
 
-##Search Performance Tuning
+## Search Performance Tuning
 
 To improve Search performance it is necessary to keep the number of segments under control. Lucene's IndexSearcher will search over all of the segments in a shard to find the 'size' best results. But, because the complexity of search for the HNSW algorithm is logarithmic with respect to the number of vectors, searching over 5 graphs with a 100 vectors each and then taking the top size results from ```5*k``` results will take longer than searching over 1 graph with 500 vectors and then taking the top size results from k results. 
 Ideally having 1 segment per shard will give the optimal performance with respect to search latency. We could configure index to have multiple shards to aviod having giant shards and achieve more parallelism.
@@ -88,7 +84,7 @@ We can control the number of segments either during indexing by asking Elasticse
 
 ### Warm up
 
-The graphs are constructed during indexing, but they are loaded into memory during the first search. The way search works in Lucene is that each segment is searched sequentially (so, for k-NN, each segment returns up to k nearest neighbors of the query point) and the results are aggregated together and ranked based on the score of each result (higher score --> better result). 
+The graphs are constructed during indexing, but they are loaded into memory during the first search. The way search works in Lucene is that each segment is searched sequentially (so, for k-NN, each segment returns up to k nearest neighbors of the query point) and the top ```size``` number of results based on the score would be returned from all of the results returned by segements at a shard level(higher score --> better result). 
 
 Once a graph is loaded(graphs are loaded outside Elasticsearch JVM), we cache the graphs in memory. So the initial queries would be expensive in the order of few seconds and subsequent queries should be faster in the order of milliseconds(assuming knn circuit breaker is not hit).
 
@@ -125,14 +121,14 @@ please refer this [page.](https://discuss.elastic.co/t/what-does-it-mean-to-stor
  }
 }
 ```
-##Improving Recall 
+## Improving Recall 
 
 Recall could depend on multiple factors like number of vectors, number of dimensions, segments etc. Searching over large number of small segments and aggregating the results leads better recall than searching over small number of large segments and aggregating results. The larger the graph the more chances of losing recall if sticking to smaller algorithm parameters. 
 Choosing larger values for algorithm params should help solve this issue but at the cost of search latency and indexing time. That being said, it is important to understand your system's requirements for latency and accuracy, and then to choose the number of segments you want your index to have based on experimentation.
 
 Recall can be configured by adjusting the algorithm parameters of hnsw algorithm exposed through index settings. Algorithm params that control the recall are *m, ef_construction, ef_search*. For more details on influence of algorithm parameters on the indexing, search recall, please refer this  doc (https://github.com/nmslib/hnswlib/blob/master/ALGO_PARAMS.md).  Increasing these values could help recall(better search results) but at the cost of higher memory utilization and increased indexing time. Our default values work on a broader set of use cases from our experiments but we encourage users to run their own experiments on their data sets and choose the appropriate values. You could refer to these settings in this section (https://github.com/opendistro-for-elasticsearch/k-NN#index-level-settings). We will add details on our experiments shortly here.
 
-##Memory Estimation
+## Memory Estimation
 
 AWS Elasticsearch Service clusters allocate 50% of available RAM in the Instance capped around 32GB (because of JVM GC performance limit). Graphs part of k-NN are loaded outside the Elasticsearch process JVM. We have circuit breakers to limit graph usage to 50% of the left over RAM space for the graphs.
 
@@ -147,7 +143,7 @@ AWS Elasticsearch Service clusters allocate 50% of available RAM in the Instance
     * M = 16 (default setting of HNSW)
         * Memory required for !M vectors = 1.1*(4*256 + 8*16) *1M Bytes =~ 1.26GB 
 
-##Monitoring 
+## Monitoring 
 
 The KNN Stats API provides information about the current status of the KNN Plugin. The plugin keeps track of both cluster level and node level stats. Cluster level stats have a single value for the entire cluster. Node level stats have a single value for each node in the cluster. A user can filter their query by nodeID and statName in the following way:
  ```

From d0e889c7428d6c059236bacdaa0ffeb6055301c1 Mon Sep 17 00:00:00 2001
From: Vijay <vamshin@amazon.com>
Date: Tue, 28 Jul 2020 13:01:20 -0700
Subject: [PATCH 6/7] incorporated comments

---
 PerformanceTuning.md | 1 +
 1 file changed, 1 insertion(+)

diff --git a/PerformanceTuning.md b/PerformanceTuning.md
index 6eab5a90..fff653c3 100644
--- a/PerformanceTuning.md
+++ b/PerformanceTuning.md
@@ -77,6 +77,7 @@ Please refer following doc (https://www.elastic.co/guide/en/elasticsearch/refere
 
 ## Search Performance Tuning
 
+### Fewer Segments
 To improve Search performance it is necessary to keep the number of segments under control. Lucene's IndexSearcher will search over all of the segments in a shard to find the 'size' best results. But, because the complexity of search for the HNSW algorithm is logarithmic with respect to the number of vectors, searching over 5 graphs with a 100 vectors each and then taking the top size results from ```5*k``` results will take longer than searching over 1 graph with 500 vectors and then taking the top size results from k results. 
 Ideally having 1 segment per shard will give the optimal performance with respect to search latency. We could configure index to have multiple shards to aviod having giant shards and achieve more parallelism.
 

From 296a3689369e892f44a70d4d0f219169017de02f Mon Sep 17 00:00:00 2001
From: Vijay <vamshin@amazon.com>
Date: Tue, 28 Jul 2020 14:28:35 -0700
Subject: [PATCH 7/7] incorporated comments

---
 PerformanceTuning.md | 19 ++++++++++---------
 1 file changed, 10 insertions(+), 9 deletions(-)

diff --git a/PerformanceTuning.md b/PerformanceTuning.md
index fff653c3..578fd2d8 100644
--- a/PerformanceTuning.md
+++ b/PerformanceTuning.md
@@ -11,7 +11,7 @@ In this document we provide recommendations for performance tuning to improve in
 
 The following steps could help improve indexing performance especially when you plan to index large number of vectors at once. 
 
-1 Disable refresh interval  (Default = 1 sec)
+1\. Disable refresh interval  (Default = 1 sec)
  
  Disable refresh interval or set a long duration for refresh interval to avoid creating multiple smaller segments
   ```  
@@ -22,7 +22,7 @@ The following steps could help improve indexing performance especially when you
             }
         }
   ```
-2 Disable Replicas (No Elasticsearch replica shard)
+2\. Disable Replicas (No Elasticsearch replica shard)
  ```
     Having replication set to 0, will avoid duplicate construction of graphs in 
     both primary and replicas. When we enable replicas after the indexing, the 
@@ -32,7 +32,7 @@ The following steps could help improve indexing performance especially when you
  ```
 More details [here](https://www.elastic.co/guide/en/elasticsearch/reference/master/tune-for-indexing-speed.html#_disable_replicas_for_initial_loads)
     
-3 Increase number of indexing threads
+3\. Increase number of indexing threads
   ```
     If the hardware we choose has multiple cores, we could allow multiple threads 
     in graph construction and there by speed up the indexing process. You could determine
@@ -43,27 +43,27 @@ More details [here](https://www.elastic.co/guide/en/elasticsearch/reference/mast
     construction is costly, having multiple threads can put additional load on CPU. 
   ```
     
-4 Index all docs (Perform bulk indexing)
+4\. Index all docs (Perform bulk indexing)
 
-5 Forcemerge 
+5\. Forcemerge 
   
  Forcemerge is a costly operation and could take a while depending on number of segments and size of the segments.
  To ensure force merge is completed, we could keep calling forcemerge with 5 minute interval till you get 200 response.
     
     curl -X POST "localhost:9200/myindex/_forcemerge?max_num_segments=1&pretty"
     
-6 Call refresh 
+6\. Call refresh 
 
  Calling refresh ensure the buffer is cleared and all segments are created so that documents are available for search. 
  ```
   POST /twitter/_refresh
 ```
-7 Enable replicas 
+7\. Enable replicas 
  
  This will make replica shards come up with the already serialized graphs created on the primary shards during indexing. This way 
  we avoid duplicate graph construction.
 
-8  Enable refresh interval
+8\.  Enable refresh interval
  ```
       PUT /<index_name>/_settings
         {
@@ -91,6 +91,7 @@ Once a graph is loaded(graphs are loaded outside Elasticsearch JVM), we cache th
 
 In order to avoid this latency penalty during your first queries, a user should use the warmup API on the indices they want to search. The API looks like this:
 
+```
 GET /_opendistro/_knn/warmup/index1,index2,index3?pretty
 {
   "_shards" : {
@@ -99,7 +100,7 @@ GET /_opendistro/_knn/warmup/index1,index2,index3?pretty
     "failed" : 0
   }
 }
-
+```
 The API loads all of the graphs for all of the shards (primaries and replicas) for the specified indices into the cache. Thus, there will be no penalty to load graphs during initial searches. *Note — * this API only loads the segments of the indices it sees into the cache. If a merge or refresh operation finishes after this API is ran or if new documents are added, this API will need to be re-ran to load those graphs into memory.
 
 ### Avoid reading stored fields