-
Notifications
You must be signed in to change notification settings - Fork 129
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[BUG] Linux OOM killer kicks in every time I try to load my IVF+PQ+L2 indices into the native RAM cache #1301
Comments
When I switch to another instances type (x2gd.4xlarge with 256GB RAM) I CAN load all the indices. I get this now:
with these knn cluster settings
It looks like the RAM usage reported is 11275499 kib = 10.5GiB which would fit easily in the first instance type. Also, according to this formula (taken from https://aws.amazon.com/blogs/big-data/choose-the-k-nn-algorithm-for-your-billion-scale-use-case-with-opensearch/ )
1.1 * ((((8 / 8) * 96 + 24) * 53,000,000) + (4 * 4096 * 768) + (2^8 * 4 * 768)) = 7 GiB So either it is something else going on or the memory estimates, and the metrics, are totally off. |
Hi @karlney , taking a look now. As mentioned in call this morning, could you retry using jemalloc? we have seen issues around fragmentation that can cause this and I want to see if it remediates the issue. Related issue: opensearch-project/OpenSearch#9496. The cache itself uses the file size as estimates for the amount of memory that is being used. It is probably worth investigating creating a more accurate accounting system using direct memory metrics such as AnonRSS, but file size has been reliable in the past. Need to double check this with IVFPQ closer. What is the underlying method used to collect the MEM Usage from the graph above? AnonRSS?
If my math is correct here, this is 114M docs, correct? with replicas it is around 228M? Just want to double check to be sure that 53M vectors is correct that were attempted to be loaded and it accounts for replicas. |
Right, becuse we are using nested fields in the documents the amount of lucene documents is a lot higher than the amount of OpenSearch documents that we have. So you are reading the numbers right, but we still only have 53M vectors (106M including replicas) |
An update from today is that we are NOT able to reproduce this behavior when using an index with a HNSW and PQ vector field. Then both the circuit breaker and the metrics behave exactly as expected, on all 4 differnt machines that we have tested on. x2gd.4xlarge, r7gd.8xlarge, m7gd.8xlarge and c7gd.12xlarge Next step is to switch back to using exactly the above IVF+PQ settings we had when we got the OOM problem and again run on all the 4 instance types above. We also plan to test with and without jemalloc and also add more monitoring and/or profiling flags to the jvm. |
We have done tests using the problematic IVF + PQ index now. Both with jemalloc and without. We used 4 different instance types at the same time (all having all the same replicas). The specs for the instance types are: The m7gd and c7gd crashes after loading ~60% of the indices into the cache. The r7gd and x2gd ARE able to load the full data into their caches, but the total RAM usage cache + heap is 196 GiB (40 GiB heap) The memory usage does NOT change significantly when using jemalloc vs the default malloc implementation in the system. What does change however is that a metric called "system mem committed_as" which is described as But in essence it really looks like the warmup call allocates over 155GiB of RAM for our dataset even though the _knn/_stats only shows 11171556 = 10 GiB Which is more than 1 order of magnitude off. Attached are 2 screenshots when we use the It also looks like there is no difference between Graviton 2 (x2gd) and Graviton 3 (r7gd) instances, both use the same amount of RAM in the end and both can load the dataset. |
I added detailed JVM NMT by adding the below line to /etc/opensearch/jvm.options (also using 10GB heap for this test on a r7gd.8x)
Then restarted opensearch and established a baseline like this
Then loaded two sets of indices with the The _knn/_stats changed from all 0 s to this
Finally I ran this command But to me nothing stands out in that file as (as far as I can tell) it just shows minor changes in memory recorded in that file.. so the RAM allocations seems to be outside of the control for the jvm to even measure?? I also attach the output from the jemalloc jeprof command that was run like this
But if you could give some detailed instructions on how to run jemalloc with all the correct symbols available for the profiler then we can run the tests again. Otherwise I do NOT plan to do any more investigation here unless you can tell me exactly what to look for. |
Thanks @karlney. Was out of the office last week. Will pick this back up this week. |
Was able to reproduce discrepancy in memory usage for a smaller experiment with a docker container for 2.11: Observations1. File size and formula estimates are the same, but formula doesnt account for redundant loading of code booksFrom the metrics, for a single segment, the formula estimate and the file-based estimate are the same. I think they will diverge as more segments are added because there will be redundant storage of codebooks (this should be fixed so that subcomponents are reused). A more accurate formula would be:
2. Docker reported memory is 30x greaterDocker reported memory of 0.393 GB change when the segment was loaded. This is 30x larger than the formula/file estimate. This seems off. I am investigating what may be causing this now. Experiment DetailsExperimental Procedure
Node Specs
Workload Specs
Metrics(memory metrics obtained from docker stats)
|
As an update, I used GDB to isolate where the large allocation is happening. I added the following test here: https://github.com/opensearch-project/k-NN/blob/main/jni/tests/faiss_wrapper_test.cpp (similar to our load function here: https://github.com/opensearch-project/k-NN/blob/main/jni/src/faiss_wrapper.cpp#L187-L195)
What I noticed was the following:
This accounts for 393356 kb (0.375 Gib) -> this is quite a bit and appears to be the culprit. I need to investigate this a little bit further, but there is precomputation happening around the residual for the IVFPQ index type. As a hack, I set this table limit to
With this, the loaded graph is of size 14224 kb, which maps to 0.01357 GiB which is identical to the file and formula estimate above. As next steps, Ill look more into this precomputed table for IVFPQ and see why we need to allocate this table. |
@karlney the table is in use. For your testing, there are 2 workarounds you can do to unblock you. First, you can use inner product instead of l2 distance type. This will probably be embedding model dependent, but if you are able to use innerproduct instead, it will not create a large precomputation table. Second, we can provide with a tar (or whatever artifact you are using) that will disable the precomputation table. This might lead to higher latencies, but would resolve the memory issue. Please let us know if either of those options are possible for testing. In the long term, we will need to figure out how to re-use components from the faiss model across segments. This will be a bigger refactoring effort and may take some time. |
That is great news 🎉 We will plan for a re-test later this week then, or early next week. |
@karlney one for thing, in regards to this comment
Overhead will be replicated for segments. So, if you are able to reduce segment count via forcemerge, the memory consumption will also be significantly lower. |
Hi again. Today we had time to re-run our tests with IVF + PQ and inner product. We used two c7gd.16xlarge instances (128 GB RAM each) and 1 replica of all data. We used the following training settings
After indexing of the same 53M docs we ended up with 22 shards and 590 segments (including replicas) And the resource usage and recall results are very good.. The _plugins/_knn/stats shows a RAM usage of ~ 14 GiB and measuring memory usage on the system level it also looks like the real used memory increases from 42 up to 56 which also gives 14 GiB (so no discrepancy). Recall is also not impacted that much compared to the L2 numbers (and using nprobes = 128) We still see >80% recall for all our measures (R100@100=83%, R10@10=83%) and we believe we can get even better by increasing nprobes again. So from our side we are good with this workaround. I don't know what you want to do with the ticket as the problem with L2 + IVF + PQ still remains I guess, but our team will not be blocked by that however. |
Thats great to hear @karlney. Lets keep this issue open for now, as it can track other issues we are seeing. |
What is the bug?
Linux OOM killer kicks in every time I try to load my IVF PQ indices into the native RAM cache.
This is what dmesg -T shows
How can one reproduce the bug?
I am unable to load my 9 IVF kNN indices into the RAM cache. Every time on a 96GiB instance type c7gd.12xlarge I get the Linux OOM killer to kick in and the opensearch java process is killed.
I have tried with a variety of settings on the
knn.memory.circuit_breaker.limit
from 10% up to 80% but it does NOT make any difference.And when looking at the
/_plugins/_knn/stats
the highest "graph_memory_usage" I have seen before the data nodes are killed is is ~4 GiB (the graph_memory_usage_percentage varies ofc when I vary the knn.memory.circuit_breaker.limit setting) but it never reaches 100% and the circuit breaker never kicks in, the node just dies instead.But there should be AT LEAST 40 to 50 GiB RAM available for the kNN plugin to use after subtracting the HEAP usage.
I see the problem both when running /warmup as well as when I execute kNN searches.
I have also varied the heap. First I used 30GB heap, but then tried with lower and lower values down to 8GB heap but with no difference.
These are my indices (note that we use child documents, so the total doc count counts lucene docs, not opensearch docs). The total amount of opensearch documents is ~53M across all 9 indices.
I am able to load up to 4 or 5 indices (some times) before the data nodes are killed, but it varies a bit.
KBQyyiA0RJC3h-t7vHN8bQ 1 1 7414810 85.6gb 42.8gb
81e7bB0HQ2CTdadQAuLM5g 3 1 48145087 575.7gb 287.8gb
QT33-UDRQbuObz-RJ6s3tg 1 1 3023301 96.4gb 48.2gb
AK3TIiI6SwmIzcYrqKZVeQ 1 1 3492437 107.1gb 53.5gb
YoPL-YVeTLKF4fgqQ7HZog 1 1 15397345 209.6gb 104.8gb
D9jsgr2HR7KhxiMeosHtfQ 1 1 14632959 197.4gb 98.7gb
JtO9jvzSTdq1m1DMOjmMIA 1 1 3970976 88.9gb 44.4gb
XDhwJR_DQCmFWkbR_N7IEw 1 1 3781568 82.2gb 41.1gb
iJ3NEHMhQ4uzcQFuD2r7Ug 1 1 14119900 169.6gb 84.8gb
These are the settings for the kNN field
What is the expected behavior?
I expect all my indices to be loaded into the RAM cache, or the kNN circuit breaker to trigger.
What is your host/environment?
instance type c7gd.12xlarge
OS: NAME="Amazon Linux"
VERSION="2"
ID="amzn"
ID_LIKE="centos rhel fedora"
VERSION_ID="2"
PRETTY_NAME="Amazon Linux 2"
ANSI_COLOR="0;33"
CPE_NAME="cpe:2.3:o:amazon:amazon_linux:2"
HOME_URL="https://amazonlinux.com/"
Version 2.11
Plugins
Do you have any screenshots?
Below is a graph of the memory usage during one test sequence. It shows committed memory on both data nodes. It starts at ~8GB due to the heap being 8GB. Then when I start to run warmup commands and the RAM goes up and then, it goes over 96 and then the node (java process) is killed. During this run (just before the java process got killed) the graph_memory_usage reported 3893772
The text was updated successfully, but these errors were encountered: