confuse about cluster setting "cache_capacity_reached" and "knn.circuit_breaker.unset.percentage" #203

dengxianjie · 2020-08-28T08:36:59Z

For deeply understand the behavior of the cache , I try some practice of cache setting likes below

PUT /_cluster/settings
{
    "persistent" : {
        "knn.plugin.enabled" : true,
        "knn.algo_param.index_thread_qty" : 20,
        "knn.cache.item.expiry.enabled": true,
        "knn.cache.item.expiry.minutes": "120m",
        "knn.memory.circuit_breaker.enabled" : true,
        "knn.memory.circuit_breaker.limit" : "10%",
        "knn.circuit_breaker.unset.percentage": 65
    }
}

And after indexing some vector into different data, the knn statistics api shows below.

What makes me confused is that "graph_memory_usage_percentage" has been over 100, but "eviction_count" doesn't work(it always be 0) after I warmup some new index.
The "cache_capacity_reached" should be true by the design but unfortunately it was still false even though "knn.circuit_breaker.unset.percentage" has totally reached . Is it anything wrong?

My device is 250GB memory, JVM uses 31G.

  "_nodes" : {
    "total" : 3,
    "successful" : 3,
    "failed" : 0
  },
  "cluster_name" : "odfe-cluster",
  "circuit_breaker_triggered" : false,
  "nodes" : {
    "xfvHY81rQjSQz5Wa0g8rmg" : {
      "miss_count" : 583,
      "graph_memory_usage_percentage" : 108.94683,
      "graph_query_requests" : 11629745,
      "graph_memory_usage" : 25204287,
      "cache_capacity_reached" : false,
      "graph_index_requests" : 57380,
      "load_exception_count" : 0,
      "load_success_count" : 544,
      "eviction_count" : 0,
      "indices_in_cache" : {
        "knn_pid_2020082711" : {
          "graph_memory_usage" : 7444895,
          "graph_memory_usage_percentage" : 32.180943,
          "graph_count" : 61
        },
        "knn_pid_2020082810" : {
          "graph_memory_usage" : 2702801,
          "graph_memory_usage_percentage" : 11.682997,
          "graph_count" : 29
        },
        "knn_pid_2020082812" : {
          "graph_memory_usage" : 2705674,
          "graph_memory_usage_percentage" : 11.6954155,
          "graph_count" : 29
        },
        "knn_pid_2020082801" : {
          "graph_memory_usage" : 2776495,
          "graph_memory_usage_percentage" : 12.001543,
          "graph_count" : 30
        },
        "knn_pid_2020082800" : {
          "graph_memory_usage" : 2741016,
          "graph_memory_usage_percentage" : 11.848183,
          "graph_count" : 29
        },
        "knn_pid_2020082811" : {
          "graph_memory_usage" : 2723480,
          "graph_memory_usage_percentage" : 11.772383,
          "graph_count" : 29
        },
        "knn_pid_2020082814" : {
          "graph_memory_usage" : 4109926,
          "graph_memory_usage_percentage" : 17.765368,
          "graph_count" : 33
        }
      },
      "graph_query_errors" : 0,
      "hit_count" : 670,
      "graph_index_errors" : 0,
      "knn_query_requests" : 10,
      "total_load_time" : 272185894228
    },
    "hS09IRT4QkqyrsNDl3wCTQ" : {
      "miss_count" : 586,
      "graph_memory_usage_percentage" : 109.56773,
      "graph_query_requests" : 12992605,
      "graph_memory_usage" : 25347929,
      "cache_capacity_reached" : false,
      "graph_index_requests" : 42870,
      "load_exception_count" : 0,
      "load_success_count" : 586,
      "eviction_count" : 0,
      "indices_in_cache" : {
        "knn_pid_2020082711" : {
          "graph_memory_usage" : 3721199,
          "graph_memory_usage_percentage" : 16.085075,
          "graph_count" : 29
        },
        "knn_pid_2020082710" : {
          "graph_memory_usage" : 3727154,
          "graph_memory_usage_percentage" : 16.110815,
          "graph_count" : 32
        },
        "knn_pid_2020082801" : {
          "graph_memory_usage" : 2776510,
          "graph_memory_usage_percentage" : 12.001608,
          "graph_count" : 27
        },
        "knn_pid_2020082812" : {
          "graph_memory_usage" : 2705705,
          "graph_memory_usage_percentage" : 11.695549,
          "graph_count" : 28
        },
        "knn_pid_2020082811" : {
          "graph_memory_usage" : 2723485,
          "graph_memory_usage_percentage" : 11.772405,
          "graph_count" : 30
        },
        "knn_pid_2020082800" : {
          "graph_memory_usage" : 5484179,
          "graph_memory_usage_percentage" : 23.705647,
          "graph_count" : 56
        },
        "knn_pid_2020082814" : {
          "graph_memory_usage" : 1475616,
          "graph_memory_usage_percentage" : 6.378426,
          "graph_count" : 18
        },
        "knn_pid_2020082813" : {
          "graph_memory_usage" : 2734081,
          "graph_memory_usage_percentage" : 11.818206,
          "graph_count" : 31
        }
      },
      "graph_query_errors" : 0,
      "hit_count" : 1667,
      "graph_index_errors" : 0,
      "knn_query_requests" : 11170,
      "total_load_time" : 168056697220
    },
    "16kZnxnPQwa4Bj76MV2shw" : {
      "miss_count" : 566,
      "graph_memory_usage_percentage" : 126.612564,
      "graph_query_requests" : 12563740,
      "graph_memory_usage" : 29291163,
      "cache_capacity_reached" : false,
      "graph_index_requests" : 62782,
      "load_exception_count" : 0,
      "load_success_count" : 566,
      "eviction_count" : 0,
      "indices_in_cache" : {
        "knn_pid_2020082810" : {
          "graph_memory_usage_percentage" : 11.695342,
          "graph_memory_usage" : 2705657,
          "graph_count" : 18
        },
        "knn_pid_2020082711" : {
          "graph_memory_usage_percentage" : 16.096361,
          "graph_memory_usage" : 3723810,
          "graph_count" : 30
        },
        "knn_pid_2020082710" : {
          "graph_memory_usage_percentage" : 16.091845,
          "graph_memory_usage" : 3722765,
          "graph_count" : 30
        },
        "knn_pid_2020082801" : {
          "graph_memory_usage_percentage" : 11.974552,
          "graph_memory_usage" : 2770251,
          "graph_count" : 24
        },
        "knn_pid_2020082812" : {
          "graph_memory_usage_percentage" : 11.700128,
          "graph_memory_usage" : 2706764,
          "graph_count" : 26
        },
        "knn_pid_2020082800" : {
          "graph_memory_usage_percentage" : 11.857425,
          "graph_memory_usage" : 2743154,
          "graph_count" : 29
        },
        "knn_pid_2020082811" : {
          "graph_memory_usage_percentage" : 11.762212,
          "graph_memory_usage" : 2721127,
          "graph_count" : 26
        },
        "knn_pid_2020082814" : {
          "graph_memory_usage_percentage" : 11.782354,
          "graph_memory_usage" : 2725787,
          "graph_count" : 31
        },
        "knn_pid_2020082813" : {
          "graph_memory_usage_percentage" : 23.652344,
          "graph_memory_usage" : 5471848,
          "graph_count" : 60
        }
      },
      "graph_query_errors" : 0,
      "hit_count" : 1932,
      "graph_index_errors" : 0,
      "knn_query_requests" : 148869,
      "total_load_time" : 128939536361
    }
  }
}

The text was updated successfully, but these errors were encountered:

jmazanec15 · 2020-08-31T16:48:28Z

Hi @dengxianjie, this is definitely unusual and might be a bug.

I have a couple of questions:

What artifact are you using to set up your cluster (i.e. RPM/DEB/Docker)? Which ODFE version is it?
When you say "warmup" some new index, do you mean run some random searches over it or use the warmup API?
How many indices did you have?
What were their shard configurations?
How many documents did they have?

Additionally, in order to reproduce the issue, could you provide the exact, detailed steps you followed to get the cache to the state where graph memory usage is above 100?

dengxianjie · 2020-09-01T09:18:03Z

Hi @dengxianjie, this is definitely unusual and might be a bug.

I have a couple of questions:

What artifact are you using to set up your cluster (i.e. RPM/DEB/Docker)? Which ODFE version is it?

When you say "warmup" some new index, do you mean run some random searches over it or use the warmup API?

How many indices did you have?

What were their shard configurations?

How many documents did they have?

Additionally, in order to reproduce the issue, could you provide the exact, detailed steps you followed to get the cache to the state where graph memory usage is above 100?

thanks for reply.
At the beginning I answer your questions.

runs in docker environment. Image pull from "amazon/opendistro-for-elasticsearch:1.9.0". ES version is 7.8.0.
"warmup" means I do some knn search by random vectors until the server responses tens of microseconds.
on that moment, amount of indices is about ten to twenty, not much.
Just for practice for preparing deploy production environment, I decide 2 shards for primary and 1 replica settings

{
  "knn_pid_2020082813" : {
    "settings" : {
      "index" : {
        "refresh_interval" : "10s",
        "number_of_shards" : "2",
        "knn.algo_param" : {
          "ef_search" : "512",
          "ef_construction" : "512",
          "m" : "4"
        },
        "provided_name" : "knn_pid_2020082813",
        "knn.space_type" : "cosinesimil",
        "knn" : "true",
        "creation_date" : "1598533966689",
        "number_of_replicas" : "1",
        "uuid" : "N--X-5bFQCOehjGI2UyneQ",
        "version" : {
          "created" : "7080099"
        }
      }
    }
  }
}

each indices has about 2 millions to 4 millions doc.

What I do during the case occur is below,
Step 1. I found that PUT and new knn cluster setting can evict all cache and init the statistic api output (Is it right?) , so I PUT and setting

PUT /_cluster/settings
{
    "persistent" : {
        "knn.plugin.enabled" : true,
        "knn.algo_param.index_thread_qty" : 20,
        "knn.cache.item.expiry.enabled": true,
        "knn.cache.item.expiry.minutes": "120m",
        "knn.memory.circuit_breaker.enabled" : true,
        "knn.memory.circuit_breaker.limit" : "10%",
        "knn.circuit_breaker.unset.percentage": 65
    }
}

Step 2. I do random search on different index on by one, and invoke GET _opendistro/_knn/stats to observe the cache stat after every search action.
Step 3. And I finally found out "graph_memory_usage_percentage" unexpectedly above 100

Additionally, I have reached the "knn.circuit_breaker.unset.percentage" before ,at the same time bulk requeset has been limited. This suggests circuit_breaker indeed can work ever.

jmazanec15 · 2020-09-04T18:06:12Z

Thanks @dengxianjie, was able to reproduce issue. Looking into the root cause.

jmazanec15 · 2020-09-04T22:34:28Z

Hi @dengxianjie, I think there is a bug with Cache expiration.. In this line, we do not convert TImeDuration to long of minutes, causing a failure. We will patch this.

As a workaround, this setting can be set to false. Can you confirm that when you set this to false, this issue does not occur?

dengxianjie · 2020-09-06T05:41:09Z

Hi @jmazanec15 , wonderful, the bug you found ,logically, would influence the expiration.

But my issue, more accurately is the eviction doesn't works. Is it correct? "graph_memory_usage_percentage" is unexpectedly above 100. It should EVICT some graph right now rather than wait for expiration occur. At the same time "cache_capacity_reached" and "circuit_breaker_triggered" should be true.

What I confused is that it seems the circuit-breaker limitation happenned something wrong.

jmazanec15 · 2020-09-08T17:26:22Z

Hi @dengxianjie

Yes, I believe the issue is that the cache does not actually get rebuilt with the new maximum weight set by "knn.memory.circuit_breaker.limit" : "10%".

This line is throwing an exception because of improper conversion.

So, this line never gets called.

Therefore, the old cache is still being used, which has a default CB of 50%. So capacity based evictions will still not occur until this 50% threshold is reached (at which point the circuit breaker will be set to true).

The confusion comes from the stats API, which uses the Elasticsearch setting to calculate graph_memory_usage here

The reason we use the setting as opposed to the cache's actual maximum weight to calculate this percentage is that the Guava cache does not actually expose a getter for maxWeight.

The above PR should fix this issue in the short term. However, in the long term, we should not actually change the setting until the cache is rebuilt. I will create an issue to track this.

dengxianjie · 2020-09-09T03:05:13Z

👍 @jmazanec15 Completely understood. Thanks.

dengxianjie changed the title ~~confuse about cluster setting "knn.memory.circuit_breaker.limit" and "knn.circuit_breaker.unset.percentage"~~ confuse about cluster setting "cache_capacity_reached" and "knn.circuit_breaker.unset.percentage" Aug 28, 2020

vamshin assigned jmazanec15 Aug 28, 2020

vamshin added question Further information is requested bug Issue that exposes a bug labels Aug 31, 2020

jmazanec15 mentioned this issue Sep 4, 2020

Fix casting issue with cache expiration #215

Merged

jmazanec15 closed this as completed Sep 14, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

confuse about cluster setting "cache_capacity_reached" and "knn.circuit_breaker.unset.percentage" #203

confuse about cluster setting "cache_capacity_reached" and "knn.circuit_breaker.unset.percentage" #203

dengxianjie commented Aug 28, 2020 •

edited

Loading

jmazanec15 commented Aug 31, 2020

dengxianjie commented Sep 1, 2020 •

edited

Loading

jmazanec15 commented Sep 4, 2020

jmazanec15 commented Sep 4, 2020

dengxianjie commented Sep 6, 2020 •

edited

Loading

jmazanec15 commented Sep 8, 2020

dengxianjie commented Sep 9, 2020

confuse about cluster setting "cache_capacity_reached" and "knn.circuit_breaker.unset.percentage" #203

confuse about cluster setting "cache_capacity_reached" and "knn.circuit_breaker.unset.percentage" #203

Comments

dengxianjie commented Aug 28, 2020 • edited Loading

jmazanec15 commented Aug 31, 2020

dengxianjie commented Sep 1, 2020 • edited Loading

jmazanec15 commented Sep 4, 2020

jmazanec15 commented Sep 4, 2020

dengxianjie commented Sep 6, 2020 • edited Loading

jmazanec15 commented Sep 8, 2020

dengxianjie commented Sep 9, 2020

dengxianjie commented Aug 28, 2020 •

edited

Loading

dengxianjie commented Sep 1, 2020 •

edited

Loading

dengxianjie commented Sep 6, 2020 •

edited

Loading