Skip to content
This repository has been archived by the owner on Aug 2, 2022. It is now read-only.

confuse about cluster setting "cache_capacity_reached" and "knn.circuit_breaker.unset.percentage" #203

Closed
dengxianjie opened this issue Aug 28, 2020 · 7 comments
Assignees
Labels
bug Issue that exposes a bug question Further information is requested

Comments

@dengxianjie
Copy link
Contributor

dengxianjie commented Aug 28, 2020

For deeply understand the behavior of the cache , I try some practice of cache setting likes below

PUT /_cluster/settings
{
    "persistent" : {
        "knn.plugin.enabled" : true,
        "knn.algo_param.index_thread_qty" : 20,
        "knn.cache.item.expiry.enabled": true,
        "knn.cache.item.expiry.minutes": "120m",
        "knn.memory.circuit_breaker.enabled" : true,
        "knn.memory.circuit_breaker.limit" : "10%",
        "knn.circuit_breaker.unset.percentage": 65
    }
}

And after indexing some vector into different data, the knn statistics api shows below.

What makes me confused is that "graph_memory_usage_percentage" has been over 100, but "eviction_count" doesn't work(it always be 0) after I warmup some new index.
The "cache_capacity_reached" should be true by the design but unfortunately it was still false even though "knn.circuit_breaker.unset.percentage" has totally reached . Is it anything wrong?

My device is 250GB memory, JVM uses 31G.

  "_nodes" : {
    "total" : 3,
    "successful" : 3,
    "failed" : 0
  },
  "cluster_name" : "odfe-cluster",
  "circuit_breaker_triggered" : false,
  "nodes" : {
    "xfvHY81rQjSQz5Wa0g8rmg" : {
      "miss_count" : 583,
      "graph_memory_usage_percentage" : 108.94683,
      "graph_query_requests" : 11629745,
      "graph_memory_usage" : 25204287,
      "cache_capacity_reached" : false,
      "graph_index_requests" : 57380,
      "load_exception_count" : 0,
      "load_success_count" : 544,
      "eviction_count" : 0,
      "indices_in_cache" : {
        "knn_pid_2020082711" : {
          "graph_memory_usage" : 7444895,
          "graph_memory_usage_percentage" : 32.180943,
          "graph_count" : 61
        },
        "knn_pid_2020082810" : {
          "graph_memory_usage" : 2702801,
          "graph_memory_usage_percentage" : 11.682997,
          "graph_count" : 29
        },
        "knn_pid_2020082812" : {
          "graph_memory_usage" : 2705674,
          "graph_memory_usage_percentage" : 11.6954155,
          "graph_count" : 29
        },
        "knn_pid_2020082801" : {
          "graph_memory_usage" : 2776495,
          "graph_memory_usage_percentage" : 12.001543,
          "graph_count" : 30
        },
        "knn_pid_2020082800" : {
          "graph_memory_usage" : 2741016,
          "graph_memory_usage_percentage" : 11.848183,
          "graph_count" : 29
        },
        "knn_pid_2020082811" : {
          "graph_memory_usage" : 2723480,
          "graph_memory_usage_percentage" : 11.772383,
          "graph_count" : 29
        },
        "knn_pid_2020082814" : {
          "graph_memory_usage" : 4109926,
          "graph_memory_usage_percentage" : 17.765368,
          "graph_count" : 33
        }
      },
      "graph_query_errors" : 0,
      "hit_count" : 670,
      "graph_index_errors" : 0,
      "knn_query_requests" : 10,
      "total_load_time" : 272185894228
    },
    "hS09IRT4QkqyrsNDl3wCTQ" : {
      "miss_count" : 586,
      "graph_memory_usage_percentage" : 109.56773,
      "graph_query_requests" : 12992605,
      "graph_memory_usage" : 25347929,
      "cache_capacity_reached" : false,
      "graph_index_requests" : 42870,
      "load_exception_count" : 0,
      "load_success_count" : 586,
      "eviction_count" : 0,
      "indices_in_cache" : {
        "knn_pid_2020082711" : {
          "graph_memory_usage" : 3721199,
          "graph_memory_usage_percentage" : 16.085075,
          "graph_count" : 29
        },
        "knn_pid_2020082710" : {
          "graph_memory_usage" : 3727154,
          "graph_memory_usage_percentage" : 16.110815,
          "graph_count" : 32
        },
        "knn_pid_2020082801" : {
          "graph_memory_usage" : 2776510,
          "graph_memory_usage_percentage" : 12.001608,
          "graph_count" : 27
        },
        "knn_pid_2020082812" : {
          "graph_memory_usage" : 2705705,
          "graph_memory_usage_percentage" : 11.695549,
          "graph_count" : 28
        },
        "knn_pid_2020082811" : {
          "graph_memory_usage" : 2723485,
          "graph_memory_usage_percentage" : 11.772405,
          "graph_count" : 30
        },
        "knn_pid_2020082800" : {
          "graph_memory_usage" : 5484179,
          "graph_memory_usage_percentage" : 23.705647,
          "graph_count" : 56
        },
        "knn_pid_2020082814" : {
          "graph_memory_usage" : 1475616,
          "graph_memory_usage_percentage" : 6.378426,
          "graph_count" : 18
        },
        "knn_pid_2020082813" : {
          "graph_memory_usage" : 2734081,
          "graph_memory_usage_percentage" : 11.818206,
          "graph_count" : 31
        }
      },
      "graph_query_errors" : 0,
      "hit_count" : 1667,
      "graph_index_errors" : 0,
      "knn_query_requests" : 11170,
      "total_load_time" : 168056697220
    },
    "16kZnxnPQwa4Bj76MV2shw" : {
      "miss_count" : 566,
      "graph_memory_usage_percentage" : 126.612564,
      "graph_query_requests" : 12563740,
      "graph_memory_usage" : 29291163,
      "cache_capacity_reached" : false,
      "graph_index_requests" : 62782,
      "load_exception_count" : 0,
      "load_success_count" : 566,
      "eviction_count" : 0,
      "indices_in_cache" : {
        "knn_pid_2020082810" : {
          "graph_memory_usage_percentage" : 11.695342,
          "graph_memory_usage" : 2705657,
          "graph_count" : 18
        },
        "knn_pid_2020082711" : {
          "graph_memory_usage_percentage" : 16.096361,
          "graph_memory_usage" : 3723810,
          "graph_count" : 30
        },
        "knn_pid_2020082710" : {
          "graph_memory_usage_percentage" : 16.091845,
          "graph_memory_usage" : 3722765,
          "graph_count" : 30
        },
        "knn_pid_2020082801" : {
          "graph_memory_usage_percentage" : 11.974552,
          "graph_memory_usage" : 2770251,
          "graph_count" : 24
        },
        "knn_pid_2020082812" : {
          "graph_memory_usage_percentage" : 11.700128,
          "graph_memory_usage" : 2706764,
          "graph_count" : 26
        },
        "knn_pid_2020082800" : {
          "graph_memory_usage_percentage" : 11.857425,
          "graph_memory_usage" : 2743154,
          "graph_count" : 29
        },
        "knn_pid_2020082811" : {
          "graph_memory_usage_percentage" : 11.762212,
          "graph_memory_usage" : 2721127,
          "graph_count" : 26
        },
        "knn_pid_2020082814" : {
          "graph_memory_usage_percentage" : 11.782354,
          "graph_memory_usage" : 2725787,
          "graph_count" : 31
        },
        "knn_pid_2020082813" : {
          "graph_memory_usage_percentage" : 23.652344,
          "graph_memory_usage" : 5471848,
          "graph_count" : 60
        }
      },
      "graph_query_errors" : 0,
      "hit_count" : 1932,
      "graph_index_errors" : 0,
      "knn_query_requests" : 148869,
      "total_load_time" : 128939536361
    }
  }
}
@dengxianjie dengxianjie changed the title confuse about cluster setting "knn.memory.circuit_breaker.limit" and "knn.circuit_breaker.unset.percentage" confuse about cluster setting "cache_capacity_reached" and "knn.circuit_breaker.unset.percentage" Aug 28, 2020
@jmazanec15
Copy link
Member

Hi @dengxianjie, this is definitely unusual and might be a bug.

I have a couple of questions:

  1. What artifact are you using to set up your cluster (i.e. RPM/DEB/Docker)? Which ODFE version is it?
  2. When you say "warmup" some new index, do you mean run some random searches over it or use the warmup API?
  3. How many indices did you have?
  4. What were their shard configurations?
  5. How many documents did they have?

Additionally, in order to reproduce the issue, could you provide the exact, detailed steps you followed to get the cache to the state where graph memory usage is above 100?

@vamshin vamshin added question Further information is requested bug Issue that exposes a bug labels Aug 31, 2020
@dengxianjie
Copy link
Contributor Author

dengxianjie commented Sep 1, 2020

Hi @dengxianjie, this is definitely unusual and might be a bug.

I have a couple of questions:

  1. What artifact are you using to set up your cluster (i.e. RPM/DEB/Docker)? Which ODFE version is it?
  2. When you say "warmup" some new index, do you mean run some random searches over it or use the warmup API?
  3. How many indices did you have?
  4. What were their shard configurations?
  5. How many documents did they have?

Additionally, in order to reproduce the issue, could you provide the exact, detailed steps you followed to get the cache to the state where graph memory usage is above 100?

thanks for reply.
At the beginning I answer your questions.

  1. runs in docker environment. Image pull from "amazon/opendistro-for-elasticsearch:1.9.0". ES version is 7.8.0.
  2. "warmup" means I do some knn search by random vectors until the server responses tens of microseconds.
  3. on that moment, amount of indices is about ten to twenty, not much.
  4. Just for practice for preparing deploy production environment, I decide 2 shards for primary and 1 replica settings
{
  "knn_pid_2020082813" : {
    "settings" : {
      "index" : {
        "refresh_interval" : "10s",
        "number_of_shards" : "2",
        "knn.algo_param" : {
          "ef_search" : "512",
          "ef_construction" : "512",
          "m" : "4"
        },
        "provided_name" : "knn_pid_2020082813",
        "knn.space_type" : "cosinesimil",
        "knn" : "true",
        "creation_date" : "1598533966689",
        "number_of_replicas" : "1",
        "uuid" : "N--X-5bFQCOehjGI2UyneQ",
        "version" : {
          "created" : "7080099"
        }
      }
    }
  }
}

  1. each indices has about 2 millions to 4 millions doc.

What I do during the case occur is below,
Step 1. I found that PUT and new knn cluster setting can evict all cache and init the statistic api output (Is it right?) , so I PUT and setting

PUT /_cluster/settings
{
    "persistent" : {
        "knn.plugin.enabled" : true,
        "knn.algo_param.index_thread_qty" : 20,
        "knn.cache.item.expiry.enabled": true,
        "knn.cache.item.expiry.minutes": "120m",
        "knn.memory.circuit_breaker.enabled" : true,
        "knn.memory.circuit_breaker.limit" : "10%",
        "knn.circuit_breaker.unset.percentage": 65
    }
}

Step 2. I do random search on different index on by one, and invoke GET _opendistro/_knn/stats to observe the cache stat after every search action.
Step 3. And I finally found out "graph_memory_usage_percentage" unexpectedly above 100

Additionally, I have reached the "knn.circuit_breaker.unset.percentage" before ,at the same time bulk requeset has been limited. This suggests circuit_breaker indeed can work ever.

@jmazanec15
Copy link
Member

Thanks @dengxianjie, was able to reproduce issue. Looking into the root cause.

@jmazanec15
Copy link
Member

Hi @dengxianjie, I think there is a bug with Cache expiration.. In this line, we do not convert TImeDuration to long of minutes, causing a failure. We will patch this.

As a workaround, this setting can be set to false. Can you confirm that when you set this to false, this issue does not occur?

@dengxianjie
Copy link
Contributor Author

dengxianjie commented Sep 6, 2020

Hi @jmazanec15 , wonderful, the bug you found ,logically, would influence the expiration.

But my issue, more accurately is the eviction doesn't works. Is it correct? "graph_memory_usage_percentage" is unexpectedly above 100. It should EVICT some graph right now rather than wait for expiration occur. At the same time "cache_capacity_reached" and "circuit_breaker_triggered" should be true.

What I confused is that it seems the circuit-breaker limitation happenned something wrong.

@jmazanec15
Copy link
Member

Hi @dengxianjie

Yes, I believe the issue is that the cache does not actually get rebuilt with the new maximum weight set by "knn.memory.circuit_breaker.limit" : "10%".

This line is throwing an exception because of improper conversion.

So, this line never gets called.

Therefore, the old cache is still being used, which has a default CB of 50%. So capacity based evictions will still not occur until this 50% threshold is reached (at which point the circuit breaker will be set to true).

The confusion comes from the stats API, which uses the Elasticsearch setting to calculate graph_memory_usage here

The reason we use the setting as opposed to the cache's actual maximum weight to calculate this percentage is that the Guava cache does not actually expose a getter for maxWeight.

The above PR should fix this issue in the short term. However, in the long term, we should not actually change the setting until the cache is rebuilt. I will create an issue to track this.

@dengxianjie
Copy link
Contributor Author

👍 @jmazanec15 Completely understood. Thanks.

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
bug Issue that exposes a bug question Further information is requested
Projects
None yet
Development

No branches or pull requests

3 participants