Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BUG] Linux OOM killer kicks in every time I try to load my IVF+PQ+L2 indices into the native RAM cache #1301

Closed
karlney opened this issue Nov 8, 2023 · 15 comments
Assignees
Labels
bug Something isn't working v2.13.0

Comments

@karlney
Copy link

karlney commented Nov 8, 2023

What is the bug?
Linux OOM killer kicks in every time I try to load my IVF PQ indices into the native RAM cache.

This is what dmesg -T shows

[tis nov  7 10:54:50 2023] opensearch[psd- invoked oom-killer: gfp_mask=0x400dc0(GFP_KERNEL_ACCOUNT|__GFP_ZERO), order=0, oom_score_adj=0
[tis nov  7 10:54:50 2023] CPU: 22 PID: 19055 Comm: opensearch[psd- Not tainted 5.10.179-171.711.amzn2.aarch64 #1
[tis nov  7 10:54:50 2023] Hardware name: Amazon EC2 c7gd.12xlarge/, BIOS 1.0 11/1/2018
[tis nov  7 10:54:50 2023] Call trace:
[tis nov  7 10:54:50 2023]  dump_backtrace+0x0/0x204
[tis nov  7 10:54:50 2023]  show_stack+0x20/0x2c
[tis nov  7 10:54:50 2023]  dump_stack+0xe8/0x120
[tis nov  7 10:54:50 2023]  dump_header+0x50/0x1f8
[tis nov  7 10:54:50 2023]  oom_kill_process+0x25c/0x260
[tis nov  7 10:54:50 2023]  out_of_memory+0xdc/0x344
[tis nov  7 10:54:50 2023]  __alloc_pages_may_oom+0x118/0x1a0
[tis nov  7 10:54:50 2023]  __alloc_pages_slowpath.constprop.0+0x588/0x800
[tis nov  7 10:54:50 2023]  __alloc_pages_nodemask+0x2b4/0x308
[tis nov  7 10:54:50 2023]  alloc_pages_current+0x90/0x148
[tis nov  7 10:54:50 2023]  __pte_alloc+0x30/0x1c0
[tis nov  7 10:54:50 2023]  do_anonymous_page+0x3ec/0x580
[tis nov  7 10:54:50 2023]  handle_pte_fault+0x1b0/0x228
[tis nov  7 10:54:50 2023]  __handle_mm_fault+0x1dc/0x374
[tis nov  7 10:54:50 2023]  handle_mm_fault+0xd0/0x240
[tis nov  7 10:54:50 2023]  do_page_fault+0x150/0x420
[tis nov  7 10:54:50 2023]  do_translation_fault+0xb8/0xf4
[tis nov  7 10:54:50 2023]  do_mem_abort+0x48/0xa8
[tis nov  7 10:54:50 2023]  el0_da+0x44/0x80
[tis nov  7 10:54:50 2023]  el0_sync_handler+0xe0/0x120
[tis nov  7 10:54:50 2023] Mem-Info:
[tis nov  7 10:54:50 2023] active_anon:77 inactive_anon:23509894 isolated_anon:0
                            active_file:3214 inactive_file:2074 isolated_file:0
                            unevictable:0 dirty:52 writeback:0
                            slab_reclaimable:458353 slab_unreclaimable:33629
                            mapped:4904 shmem:115 pagetables:50406 bounce:0
                            free:188408 free_pcp:315 free_cma:13036
[tis nov  7 10:54:50 2023] Node 0 active_anon:308kB inactive_anon:94039576kB active_file:5044kB inactive_file:9368kB unevictable:0kB isolated(anon):0kB isolated(file):0kB mapped:12176kB dirty:208kB writeback:0kB shmem:460kB shmem_thp: 0kB shmem_pmdmapped: 0kB anon_thp: 2048kB writeback_tmp:0kB kernel_stack:13488kB all_unreclaimable? no
[tis nov  7 10:54:50 2023] Node 0 DMA free:376048kB min:428kB low:1352kB high:2276kB reserved_highatomic:0KB active_anon:0kB inactive_anon:507724kB active_file:196kB inactive_file:348kB unevictable:0kB writepending:0kB present:1048576kB managed:940920kB mlocked:0kB pagetables:0kB bounce:0kB free_pcp:200kB local_pcp:0kB free_cma:52144kB
[tis nov  7 10:54:50 2023] lowmem_reserve[]: 0 0 93900 93900
[tis nov  7 10:54:50 2023] Node 0 Normal free:381112kB min:44624kB low:140776kB high:236928kB reserved_highatomic:2048KB active_anon:308kB inactive_anon:93531740kB active_file:4840kB inactive_file:11536kB unevictable:0kB writepending:204kB present:98009088kB managed:96153676kB mlocked:0kB pagetables:201624kB bounce:0kB free_pcp:1544kB local_pcp:0kB free_cma:0kB
[tis nov  7 10:54:50 2023] lowmem_reserve[]: 0 0 0 0
[tis nov  7 10:54:50 2023] Node 0 DMA: 8*4kB (UE) 16*8kB (UE) 882*16kB (UMEC) 395*32kB (UMEC) 128*64kB (UE) 48*128kB (UEC) 22*256kB (UMEC) 11*512kB (UEC) 12*1024kB (UE) 10*2048kB (UMEC) 71*4096kB (UMEC) = 376096kB
[tis nov  7 10:54:50 2023] Node 0 Normal: 312*4kB (ME) 3130*8kB (UMEH) 16355*16kB (UMEH) 2749*32kB (UEH) 0*64kB 1*128kB (H) 1*256kB (H) 1*512kB (H) 1*1024kB (H) 1*2048kB (H) 0*4096kB = 379904kB
[tis nov  7 10:54:50 2023] Node 0 hugepages_total=0 hugepages_free=0 hugepages_surp=0 hugepages_size=1048576kB
[tis nov  7 10:54:50 2023] Node 0 hugepages_total=0 hugepages_free=0 hugepages_surp=0 hugepages_size=32768kB
[tis nov  7 10:54:50 2023] Node 0 hugepages_total=0 hugepages_free=0 hugepages_surp=0 hugepages_size=2048kB
[tis nov  7 10:54:50 2023] Node 0 hugepages_total=0 hugepages_free=0 hugepages_surp=0 hugepages_size=64kB
[tis nov  7 10:54:50 2023] 5129 total pagecache pages
[tis nov  7 10:54:50 2023] 0 pages in swap cache
[tis nov  7 10:54:50 2023] Swap cache stats: add 0, delete 0, find 0/0
[tis nov  7 10:54:50 2023] Free swap  = 0kB
[tis nov  7 10:54:50 2023] Total swap = 0kB
[tis nov  7 10:54:50 2023] 24764416 pages RAM
[tis nov  7 10:54:50 2023] 0 pages HighMem/MovableOnly
[tis nov  7 10:54:50 2023] 490767 pages reserved
[tis nov  7 10:54:50 2023] 16384 pages cma reserved
[tis nov  7 10:54:50 2023] Tasks state (memory values in pages):
[tis nov  7 10:54:50 2023] [  pid  ]   uid  tgid total_vm      rss pgtables_bytes swapents oom_score_adj name
[tis nov  7 10:54:50 2023] [    674]     0   674     9561     1429   122880        0             0 systemd-journal
[tis nov  7 10:54:50 2023] [    708]     0   708    20661       66    53248        0             0 lvmetad
[tis nov  7 10:54:50 2023] [    721]     0   721     3834      320    49152        0         -1000 systemd-udevd
[tis nov  7 10:54:50 2023] [    986]     0   986     4445      112    53248        0         -1000 auditd
[tis nov  7 10:54:50 2023] [   1020]     0  1020    20430      146    57344        0             0 irqbalance
[tis nov  7 10:54:50 2023] [   1022]    81  1022     2133      144    53248        0          -900 dbus-daemon
[tis nov  7 10:54:50 2023] [   1026]    32  1026     2447      130    61440        0             0 rpcbind
[tis nov  7 10:54:50 2023] [   1029]     0  1029     1177      109    45056        0             0 systemd-logind
[tis nov  7 10:54:50 2023] [   1030]   999  1030      643       41    40960        0             0 lsmd
[tis nov  7 10:54:50 2023] [   1033]   998  1033    80488      207   102400        0             0 rngd
[tis nov  7 10:54:50 2023] [   1035]   997  1035    20606      139    57344        0             0 chronyd
[tis nov  7 10:54:50 2023] [   1036]     0  1036    42198      118    86016        0             0 gssproxy
[tis nov  7 10:54:50 2023] [   1264]     0  1264     4863      626    77824        0             0 dhclient
[tis nov  7 10:54:50 2023] [   1311]     0  1311     4863      506    73728        0             0 dhclient
[tis nov  7 10:54:50 2023] [   1461]     0  1461     5244      261    81920        0             0 master
[tis nov  7 10:54:50 2023] [   1474]    89  1474     5287      257    81920        0             0 qmgr
[tis nov  7 10:54:50 2023] [   1509]     0  1509   129645    13046   225280        0             0 aws
[tis nov  7 10:54:50 2023] [   1511]     0  1511    49916      234   139264        0             0 rsyslogd
[tis nov  7 10:54:50 2023] [   1521]     0  1521    28729      275    65536        0             0 crond
[tis nov  7 10:54:50 2023] [   1522]     0  1522     1005       50    45056        0             0 atd
[tis nov  7 10:54:50 2023] [   1533]     0  1533    28305       31    45056        0             0 agetty
[tis nov  7 10:54:50 2023] [   1535]     0  1535    28217       33    57344        0             0 agetty
[tis nov  7 10:54:50 2023] [   1707]     0  1707     4519      251    77824        0         -1000 sshd
[tis nov  7 10:54:50 2023] [   1820]     0  1820      511       26    40960        0             0 acpid
[tis nov  7 10:54:50 2023] [   8179]     0  8179  1196615    13655   868352        0             0 filebeat
[tis nov  7 10:54:50 2023] [   8284]   993  8284  1295819    23576   937984        0             0 agent
[tis nov  7 10:54:50 2023] [   8831]     0  8831     5221      327    81920        0             0 sshd
[tis nov  7 10:54:50 2023] [   8904]   613  8904     5253      387    81920        0             0 sshd
[tis nov  7 10:54:50 2023] [   8905]   613  8905    28722      376    69632        0             0 bash
[tis nov  7 10:54:50 2023] [  16336]   994 16336 181357628 23447688 202764288        0             0 java
[tis nov  7 10:54:50 2023] [  17991]   613 17991    29042      664    57344        0             0 htop
[tis nov  7 10:54:50 2023] [  18935]    89 18935     5267      541    77824        0             0 pickup
[tis nov  7 10:54:50 2023] oom-kill:constraint=CONSTRAINT_NONE,nodemask=(null),cpuset=/,mems_allowed=0,global_oom,task_memcg=/,task=java,pid=16336,uid=994
[tis nov  7 10:54:50 2023] Out of memory: Killed process 16336 (java) total-vm:725430512kB, anon-rss:93787876kB, file-rss:2876kB, shmem-rss:0kB, UID:994 pgtables:198012kB oom_score_adj:0

How can one reproduce the bug?

I am unable to load my 9 IVF kNN indices into the RAM cache. Every time on a 96GiB instance type c7gd.12xlarge I get the Linux OOM killer to kick in and the opensearch java process is killed.

I have tried with a variety of settings on the knn.memory.circuit_breaker.limit from 10% up to 80% but it does NOT make any difference.

And when looking at the /_plugins/_knn/stats the highest "graph_memory_usage" I have seen before the data nodes are killed is is ~4 GiB (the graph_memory_usage_percentage varies ofc when I vary the knn.memory.circuit_breaker.limit setting) but it never reaches 100% and the circuit breaker never kicks in, the node just dies instead.
But there should be AT LEAST 40 to 50 GiB RAM available for the kNN plugin to use after subtracting the HEAP usage.

I see the problem both when running /warmup as well as when I execute kNN searches.

I have also varied the heap. First I used 30GB heap, but then tried with lower and lower values down to 8GB heap but with no difference.

These are my indices (note that we use child documents, so the total doc count counts lucene docs, not opensearch docs). The total amount of opensearch documents is ~53M across all 9 indices.

I am able to load up to 4 or 5 indices (some times) before the data nodes are killed, but it varies a bit.

KBQyyiA0RJC3h-t7vHN8bQ 1 1 7414810 85.6gb 42.8gb
81e7bB0HQ2CTdadQAuLM5g 3 1 48145087 575.7gb 287.8gb
QT33-UDRQbuObz-RJ6s3tg 1 1 3023301 96.4gb 48.2gb
AK3TIiI6SwmIzcYrqKZVeQ 1 1 3492437 107.1gb 53.5gb
YoPL-YVeTLKF4fgqQ7HZog 1 1 15397345 209.6gb 104.8gb
D9jsgr2HR7KhxiMeosHtfQ 1 1 14632959 197.4gb 98.7gb
JtO9jvzSTdq1m1DMOjmMIA 1 1 3970976 88.9gb 44.4gb
XDhwJR_DQCmFWkbR_N7IEw 1 1 3781568 82.2gb 41.1gb
iJ3NEHMhQ4uzcQFuD2r7Ug 1 1 14119900 169.6gb 84.8gb

These are the settings for the kNN field

{
  "training_index": "train-index",
  "dimension": 768,
    "method": {
     	 "name":"ivf",
  "engine":"faiss",
  "space_type": "l2",
  "parameters":{    
    "nlist": 4096,
    "nprobes": 64,
    "encoder":{
      "name":"pq",
      "parameters":{
        "code_size": 8,
        "m": 96
      }
    }
  }
 }

What is the expected behavior?
I expect all my indices to be loaded into the RAM cache, or the kNN circuit breaker to trigger.

What is your host/environment?

  • instance type c7gd.12xlarge

  • OS: NAME="Amazon Linux"
    VERSION="2"
    ID="amzn"
    ID_LIKE="centos rhel fedora"
    VERSION_ID="2"
    PRETTY_NAME="Amazon Linux 2"
    ANSI_COLOR="0;33"
    CPE_NAME="cpe:2.3:o:amazon:amazon_linux:2"
    HOME_URL="https://amazonlinux.com/"

  • Version 2.11

  • Plugins

Do you have any screenshots?
Below is a graph of the memory usage during one test sequence. It shows committed memory on both data nodes. It starts at ~8GB due to the heap being 8GB. Then when I start to run warmup commands and the RAM goes up and then, it goes over 96 and then the node (java process) is killed. During this run (just before the java process got killed) the graph_memory_usage reported 3893772
Screenshot from 2023-11-07 12-02-03

@karlney karlney added bug Something isn't working untriaged labels Nov 8, 2023
@karlney
Copy link
Author

karlney commented Nov 8, 2023

When I switch to another instances type (x2gd.4xlarge with 256GB RAM)

I CAN load all the indices. I get this now:

"WeoGGAJ5R72JFqomwuNfrA": {
    "graph_memory_usage_percentage": 25.760834,
    "graph_query_requests": 0,
    "graph_memory_usage": 11275499,
    "cache_capacity_reached": false,
    "load_success_count": 388,
    "training_memory_usage": 0,
    "indices_in_cache": {
      "index-1": {
        "graph_memory_usage": 760023,
        "graph_memory_usage_percentage": 1.7364044,
        "graph_count": 29
      },
      "index-2": {
        "graph_memory_usage": 4084570,
        "graph_memory_usage_percentage": 9.331908,
        "graph_count": 119
      },
      "index-3": {
        "graph_memory_usage": 555885,
        "graph_memory_usage_percentage": 1.2700157,
        "graph_count": 30
      },
      "index-4": {
        "graph_memory_usage": 584381,
        "graph_memory_usage_percentage": 1.3351197,
        "graph_count": 31
      },
      "index-5": {
        "graph_memory_usage": 1367585,
        "graph_memory_usage_percentage": 3.124485,
        "graph_count": 42
      },
      "index-6": {
        "graph_memory_usage": 1385921,
        "graph_memory_usage_percentage": 3.1663768,
        "graph_count": 40
      },
      "index-7": {
        "graph_memory_usage": 686931,
        "graph_memory_usage_percentage": 1.5694131,
        "graph_count": 32
      },
      "index-7": {
        "graph_memory_usage": 621637,
        "graph_memory_usage_percentage": 1.4202375,
        "graph_count": 28
      },
      "index-7": {
        "graph_memory_usage": 1228566,
        "graph_memory_usage_percentage": 2.8068721,
        "graph_count": 37
      }
    },
    "script_query_errors": 0,
    "hit_count": 0,
    "knn_query_requests": 0,
    "total_load_time": 273487884254,
    "miss_count": 388,
    "knn_query_with_filter_requests": 0,
    "training_memory_usage_percentage": 0,
    "lucene_initialized": true,
    "graph_index_requests": 0,
    "faiss_initialized": true,
    "load_exception_count": 0,
    "training_errors": 0,
    "eviction_count": 0,
    "nmslib_initialized": false,
    "script_compilations": 0,
    "script_query_requests": 0,
    "graph_stats": {
      "refresh": {
        "total_time_in_millis": 0,
        "total": 0
      },
      "merge": {
        "current": 0,
        "total": 0,
        "total_time_in_millis": 0,
        "current_docs": 0,
        "total_docs": 0,
        "total_size_in_bytes": 0,
        "current_size_in_bytes": 0
      }
    },
    "graph_query_errors": 0,
    "indexing_from_model_degraded": false,
    "graph_index_errors": 0,
    "training_requests": 0,
    "script_compilation_errors": 0
  },

with these knn cluster settings

"knn" : {
      "algo_param" : {
        "index_thread_qty" : "32"
      },
      "circuit_breaker" : {
        "triggered" : "false"
      },
      "memory" : {
        "circuit_breaker" : {
          "limit" : "20%",
          "enabled" : "true"
        }
      }
    },

It looks like the RAM usage reported is 11275499 kib = 10.5GiB which would fit easily in the first instance type.

Also, according to this formula (taken from https://aws.amazon.com/blogs/big-data/choose-the-k-nn-algorithm-for-your-billion-scale-use-case-with-opensearch/ )

1.1 * ((((code_size/8) * m + overhead_per_vector) * num_vectors) + (4 * nlist * dimension) + (2 code_size * 4 * dimension)

1.1 * ((((8 / 8) * 96 + 24) * 53,000,000) + (4 * 4096 * 768) + (2^8 * 4 * 768)) = 7 GiB

So either it is something else going on or the memory estimates, and the metrics, are totally off.

@jmazanec15 jmazanec15 self-assigned this Nov 14, 2023
@jmazanec15
Copy link
Member

Hi @karlney , taking a look now. As mentioned in call this morning, could you retry using jemalloc? we have seen issues around fragmentation that can cause this and I want to see if it remediates the issue. Related issue: opensearch-project/OpenSearch#9496.

The cache itself uses the file size as estimates for the amount of memory that is being used. It is probably worth investigating creating a more accurate accounting system using direct memory metrics such as AnonRSS, but file size has been reliable in the past. Need to double check this with IVFPQ closer. What is the underlying method used to collect the MEM Usage from the graph above? AnonRSS?

KBQyyiA0RJC3h-t7vHN8bQ 1 1 7414810 85.6gb 42.8gb
81e7bB0HQ2CTdadQAuLM5g 3 1 48145087 575.7gb 287.8gb
QT33-UDRQbuObz-RJ6s3tg 1 1 3023301 96.4gb 48.2gb
AK3TIiI6SwmIzcYrqKZVeQ 1 1 3492437 107.1gb 53.5gb
YoPL-YVeTLKF4fgqQ7HZog 1 1 15397345 209.6gb 104.8gb
D9jsgr2HR7KhxiMeosHtfQ 1 1 14632959 197.4gb 98.7gb
JtO9jvzSTdq1m1DMOjmMIA 1 1 3970976 88.9gb 44.4gb
XDhwJR_DQCmFWkbR_N7IEw 1 1 3781568 82.2gb 41.1gb
iJ3NEHMhQ4uzcQFuD2r7Ug 1 1 14119900 169.6gb 84.8gb

If my math is correct here, this is 114M docs, correct? with replicas it is around 228M? Just want to double check to be sure that 53M vectors is correct that were attempted to be loaded and it accounts for replicas.

@karlney
Copy link
Author

karlney commented Nov 15, 2023

Right, becuse we are using nested fields in the documents the amount of lucene documents is a lot higher than the amount of OpenSearch documents that we have. So you are reading the numbers right, but we still only have 53M vectors (106M including replicas)

@karlney
Copy link
Author

karlney commented Nov 16, 2023

An update from today is that we are NOT able to reproduce this behavior when using an index with a HNSW and PQ vector field. Then both the circuit breaker and the metrics behave exactly as expected, on all 4 differnt machines that we have tested on.

x2gd.4xlarge, r7gd.8xlarge, m7gd.8xlarge and c7gd.12xlarge

Next step is to switch back to using exactly the above IVF+PQ settings we had when we got the OOM problem and again run on all the 4 instance types above. We also plan to test with and without jemalloc and also add more monitoring and/or profiling flags to the jvm.

@karlney
Copy link
Author

karlney commented Nov 17, 2023

We have done tests using the problematic IVF + PQ index now. Both with jemalloc and without. We used 4 different instance types at the same time (all having all the same replicas).

The specs for the instance types are:
instance_type = "x2gd.8xlarge" #1900 GB disk / 512G RAM / 32 cpu
instance_type = "r7gd.8xlarge" #1900 GB disk / 256G RAM / 32 cpu
instance_type = "m7gd.8xlarge" #1900 GB disk / 128G RAM / 32 cpu
instance_type = "c7gd.12xlarge" #2840 GB disk / 96G RAM / 48 cpu

The m7gd and c7gd crashes after loading ~60% of the indices into the cache. The r7gd and x2gd ARE able to load the full data into their caches, but the total RAM usage cache + heap is 196 GiB (40 GiB heap)

The memory usage does NOT change significantly when using jemalloc vs the default malloc implementation in the system.

What does change however is that a metric called "system mem committed_as" which is described as
"The amount of memory presently allocated on the system, even if it has not been "used" by processes as of yet. (Linux only)"

But in essence it really looks like the warmup call allocates over 155GiB of RAM for our dataset even though the _knn/_stats only shows 11171556 = 10 GiB

Which is more than 1 order of magnitude off.

Attached are 2 screenshots when we use the /warmup command on our indices with and without jemalloc. As you can see, 2 of the nodes (c7gd and m7gd ) crashes before we have loaded the full set of data as they simply does not have enough RAM (it seems).

It also looks like there is no difference between Graviton 2 (x2gd) and Graviton 3 (r7gd) instances, both use the same amount of RAM in the end and both can load the dataset.

Screenshot from 2023-11-17 10-06-33

@karlney
Copy link
Author

karlney commented Nov 17, 2023

I added detailed JVM NMT by adding the below line to /etc/opensearch/jvm.options (also using 10GB heap for this test on a r7gd.8x)

-XX:NativeMemoryTracking=detail

Then restarted opensearch and established a baseline like this

sudo -u opensearch /usr/share/opensearch/jdk/bin/jcmd 29239 VM.native_memory baseline

Then loaded two sets of indices with the /warmup command.

The _knn/_stats changed from all 0 s to this

"Hf2Ugd-oTL-NLPrEySKcEw": {
    "graph_memory_usage_percentage": 5.745411,
    "graph_query_requests": 0,
    "graph_memory_usage": 7146367,
    "cache_capacity_reached": false,
    "load_success_count": 227,
    "training_memory_usage": 0,
    "indices_in_cache": {
    ...
    },
   ...
  },

Finally I ran this command
sudo -u opensearch /usr/share/opensearch/jdk/bin/jcmd 29239 VM.native_memory detail.diff > native_memory.detail.txt
that produced this file native_memory.detail.txt

But to me nothing stands out in that file as (as far as I can tell) it just shows minor changes in memory recorded in that file.. so the RAM allocations seems to be outside of the control for the jvm to even measure??

This is what top shows
Screenshot from 2023-11-17 12-14-40

I also attach the output from the jemalloc jeprof command that was run like this

jeprof --show_bytes --pdf /usr/share/opensearch/jdk/bin/java jem*.heap > app-profiling.pdf but imo that is probably not very helpful as it seems to be missing the symbols for the faiss engine (and maybe for the jvm also??)

But if you could give some detailed instructions on how to run jemalloc with all the correct symbols available for the profiler then we can run the tests again.

app-profiling.pdf

Otherwise I do NOT plan to do any more investigation here unless you can tell me exactly what to look for.

@jmazanec15
Copy link
Member

Thanks @karlney. Was out of the office last week. Will pick this back up this week.

@jmazanec15
Copy link
Member

Was able to reproduce discrepancy in memory usage for a smaller experiment with a docker container for 2.11:

Observations

1. File size and formula estimates are the same, but formula doesnt account for redundant loading of code books

From the metrics, for a single segment, the formula estimate and the file-based estimate are the same. I think they will diverge as more segments are added because there will be redundant storage of codebooks (this should be fixed so that subcomponents are reused). A more accurate formula would be:

1.1 * ((((code_size/8) * m + overhead_per_vector) * num_vectors) + num_segments*(4 * nlist * dimension) + num_segments*(2^code_size * 4 * dimension)

2. Docker reported memory is 30x greater

Docker reported memory of 0.393 GB change when the segment was loaded. This is 30x larger than the formula/file estimate. This seems off. I am investigating what may be causing this now.

Experiment Details

Experimental Procedure

  1. Create training index
  2. Ingest training vectors
  3. Create model
  4. Create test index
  5. Ingest test vectors
  6. Force merge
  7. Shutdown container
  8. Bring container up
  9. Run warmup API
  10. Clear index

Node Specs

Container Total Memory (GiB) 16
JVM Heap (GiB) 8
JVM Heap (GiB) 8
Cores 4

Workload Specs

Data set dimension 768
Data set number of vectors 10k
Train vector count 10k
IVF nlist 4096
IVF nprobes 4
PQ code size 8
PQ m 96
shard count 1
replica count 0
segment count 0

Metrics

(memory metrics obtained from docker stats)

Index size (GiB) 0.15
Formula Memory Estimate (GiB) 0.01357
Plugin Stats Memory Estimate - file size (GiB) 0.01357
Container Memory Before Warmup (GiB) 8.743
Container Memory After Warmup (GiB) 9.136
Container Index Memory Reported (GiB) 0.393
Container Memory After Graph Cleared 8.781

@jmazanec15
Copy link
Member

jmazanec15 commented Nov 28, 2023

As an update, I used GDB to isolate where the large allocation is happening.

I added the following test here: https://github.com/opensearch-project/k-NN/blob/main/jni/tests/faiss_wrapper_test.cpp (similar to our load function here: https://github.com/opensearch-project/k-NN/blob/main/jni/src/faiss_wrapper.cpp#L187-L195)


TEST(FaissIVFPQTest, BasicAssertions) {
    knn_jni::faiss_wrapper::InitLibrary();
    std::string pathToIndex = "<PATH_TO_FILE>/_5_165_test_field.faiss";
    faiss::Index* indexReader = faiss::read_index(pathToIndex.c_str(), faiss::IO_FLAG_READ_ONLY);
    std::cout << "This is the dimension: " << indexReader->d << std::endl;
}

_5_165_test_field.faiss is the faiss file from the previous experiments. I set break points before and after the load. I then ran cat /proc/25164/status | grep RssAnon to get the memory consumption of the process at different points.

What I noticed was the following:

This accounts for 393356 kb (0.375 Gib) -> this is quite a bit and appears to be the culprit. I need to investigate this a little bit further, but there is precomputation happening around the residual for the IVFPQ index type.

As a hack, I set this table limit to size_t precomputed_table_max_bytes = ((size_t)1) << 11; and obtained the following results:

  • Before the load RssAnon = 884 kb
  • After the load RssAnon = 15108 kb

With this, the loaded graph is of size 14224 kb, which maps to 0.01357 GiB which is identical to the file and formula estimate above.

As next steps, Ill look more into this precomputed table for IVFPQ and see why we need to allocate this table.

@jmazanec15
Copy link
Member

@karlney the table is in use. For your testing, there are 2 workarounds you can do to unblock you.

First, you can use inner product instead of l2 distance type. This will probably be embedding model dependent, but if you are able to use innerproduct instead, it will not create a large precomputation table.

Second, we can provide with a tar (or whatever artifact you are using) that will disable the precomputation table. This might lead to higher latencies, but would resolve the memory issue.

Please let us know if either of those options are possible for testing.

In the long term, we will need to figure out how to re-use components from the faiss model across segments. This will be a bigger refactoring effort and may take some time.

@karlney
Copy link
Author

karlney commented Nov 29, 2023

That is great news 🎉 We will plan for a re-test later this week then, or early next week.
We can most probably switch to inner_product distance measure as we have seen similar recall for that as well.
I'll let you know how it goes.

@jmazanec15
Copy link
Member

@karlney one for thing, in regards to this comment

  1. File size and formula estimates are the same, but formula doesnt account for redundant loading of code books

From the metrics, for a single segment, the formula estimate and the file-based estimate are the same. I think they will diverge as more segments are added because there will be redundant storage of codebooks (this should be fixed so that subcomponents are reused). A more accurate formula would be:

1.1 * ((((code_size/8) * m + overhead_per_vector) * num_vectors) + num_segments*(4 * nlist * dimension) + num_segments*(2^code_size * 4 * dimension)

Overhead will be replicated for segments. So, if you are able to reduce segment count via forcemerge, the memory consumption will also be significantly lower.

@karlney
Copy link
Author

karlney commented Dec 4, 2023

Hi again. Today we had time to re-run our tests with IVF + PQ and inner product.

We used two c7gd.16xlarge instances (128 GB RAM each) and 1 replica of all data. We used the following training settings

{
  "training_index": "train-index",
  "max_training_vector_count": 1000000,
  "dimension": 768,
  "method": {
    "name": "ivf",
    "engine": "faiss",
    "space_type": "innerproduct",
    "parameters": {
      "nlist": 4096,
      "nprobes": 64,
      "encoder": {
        "name": "pq",
        "parameters": {
          "code_size": 8,
          "m": 192
        }
      }
    }
  }
}

After indexing of the same 53M docs we ended up with 22 shards and 590 segments (including replicas)

And the resource usage and recall results are very good..

The _plugins/_knn/stats shows a RAM usage of ~ 14 GiB and measuring memory usage on the system level it also looks like the real used memory increases from 42 up to 56 which also gives 14 GiB (so no discrepancy).

Recall is also not impacted that much compared to the L2 numbers (and using nprobes = 128) We still see >80% recall for all our measures (R100@100=83%, R10@10=83%) and we believe we can get even better by increasing nprobes again.

So from our side we are good with this workaround.

I don't know what you want to do with the ticket as the problem with L2 + IVF + PQ still remains I guess, but our team will not be blocked by that however.

@karlney karlney changed the title [BUG] Linux OOM killer kicks in every time I try to load my IVF PQ indices into the native RAM cache [BUG] Linux OOM killer kicks in every time I try to load my IVF+PQ+L2 indices into the native RAM cache Dec 4, 2023
@jmazanec15
Copy link
Member

Thats great to hear @karlney. Lets keep this issue open for now, as it can track other issues we are seeing.

@jmazanec15
Copy link
Member

Will be fixed with #1507 . In future, we will also open issue to share all state as #1507 only shares the big tables.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working v2.13.0
Projects
None yet
Development

No branches or pull requests

2 participants