[Bug]: [benchmark][standalone] Milvus standalone OOM killed frequently in concurrent upsert & query scene #37767

wangting0128 · 2024-11-18T07:52:46Z

Is there an existing issue for this?

I have searched the existing issues

Environment

- Milvus version:master-20241116-00edec2e-amd64
- Deployment mode(standalone or cluster):standalone
- MQ type(rocksmq, pulsar or kafka):rocksmq    
- SDK version(e.g. pymilvus v2.0.0rc2):2.5.0rc106
- OS(Ubuntu or CentOS): 
- CPU/Memory: 
- GPU: 
- Others:

Current Behavior

argo task：master-20241116-00edec2e-amd64

server：

[2024-11-17 00:48:25,718 -  INFO - fouram]: [Base] Deploy initial state: 
I1116 19:04:17.722678    3880 request.go:665] Waited for 1.168431781s due to client-side throttling, not priority and fairness, request: GET:https://kubernetes.default.svc.cluster.local/apis/discovery.k8s.io/v1beta1?timeout=32s
I1116 19:04:27.722923    3880 request.go:665] Waited for 3.596511189s due to client-side throttling, not priority and fairness, request: GET:https://kubernetes.default.svc.cluster.local/apis/actions.summerwind.dev/v1alpha1?timeout=32s
NAME                                                              READY   STATUS      RESTARTS        AGE     IP              NODE         NOMINATED NODE   READINESS GATES
upsert-count-1783600-1-71-3790-etcd-0                             1/1     Running     0               2m55s   10.104.20.132   4am-node22   <none>           <none>
upsert-count-1783600-1-71-3790-milvus-standalone-6d9bf746585977   1/1     Running     2 (2m28s ago)   2m55s   10.104.18.113   4am-node25   <none>           <none>
upsert-count-1783600-1-71-3790-minio-5f74dd55dc-mtwj2             1/1     Running     0               2m55s   10.104.20.131   4am-node22   <none>           <none> (base.py:261)
[2024-11-17 00:48:25,718 -  INFO - fouram]: [Cmd Exe]  kubectl get pods  -n qa-milvus  -o wide | grep -E 'NAME|upsert-count-1783600-1-71-3790-milvus|upsert-count-1783600-1-71-3790-minio|upsert-count-1783600-1-71-3790-etcd|upsert-count-1783600-1-71-3790-pulsar|upsert-count-1783600-1-71-3790-zookeeper|upsert-count-1783600-1-71-3790-kafka|upsert-count-1783600-1-71-3790-log|upsert-count-1783600-1-71-3790-tikv'  (util_cmd.py:14)
[2024-11-17 00:48:46,791 -  INFO - fouram]: [CliClient] pod details of release(upsert-count-1783600-1-71-3790): 
 I1117 00:48:26.964445    4020 request.go:665] Waited for 1.175551287s due to client-side throttling, not priority and fairness, request: GET:https://kubernetes.default.svc.cluster.local/apis/discovery.k8s.io/v1?timeout=32s
I1117 00:48:37.164327    4020 request.go:665] Waited for 3.797207586s due to client-side throttling, not priority and fairness, request: GET:https://kubernetes.default.svc.cluster.local/apis/storage.k8s.io/v1beta1?timeout=32s
NAME                                                              READY   STATUS      RESTARTS        AGE     IP              NODE         NOMINATED NODE   READINESS GATES
upsert-count-1783600-1-71-3790-etcd-0                             1/1     Running     0               5h47m   10.104.20.132   4am-node22   <none>           <none>
upsert-count-1783600-1-71-3790-milvus-standalone-6d9bf746585977   1/1     Running     7 (73m ago)     5h47m   10.104.18.113   4am-node25   <none>           <none>
upsert-count-1783600-1-71-3790-minio-5f74dd55dc-mtwj2             1/1     Running     0               5h47m   10.104.20.131   4am-node22   <none>           <none>

Before November 14, the daily regression test cases were able to run normally

argo task: upsert-count-1731610800
image: master-20241114-1d06d432-amd64

server:

NAME                                                              READY   STATUS             RESTARTS         AGE     IP              NODE         NOMINATED NODE   READINESS GATES
upsert-count-1710800-1-18-5694-etcd-0                             1/1     Running            0                5h43m   10.104.18.159   4am-node25   <none>           <none>
upsert-count-1710800-1-18-5694-milvus-standalone-5d4db4b74fbkn8   1/1     Running            1 (5h43m ago)    5h43m   10.104.15.189   4am-node20   <none>           <none>
upsert-count-1710800-1-18-5694-minio-54b56cf4b5-tx97s             1/1     Running            0                5h43m   10.104.15.190   4am-node20   <none>           <none>

Expected Behavior

No response

Steps To Reproduce

1. deploy a standalone milvus and reset quotaAndLimits
2. creata a collection with fields: 'id'(primary key), 'float_vector'(128dim), 'varchar_1'(partition key); shard_num=16
3. build HNSW index on 'float_vector' field
4. insert 2m data
5. flush collection
6. build index again
7. load collection
8. concurrent request: concurrent number=1
   - upsert: id= 1~2000
   - query: count(*)

Milvus Log

No response

Anything else?

test result:

[2024-11-17 00:44:42,243 -  INFO - fouram]: Print locust final stats. (locust_runner.py:56)
[2024-11-17 00:44:42,243 -  INFO - fouram]: Type     Name                                                                          # reqs      # fails |    Avg     Min     Max    Med |   req/s  failures/s (stats.py:789)
[2024-11-17 00:44:42,243 -  INFO - fouram]: --------|----------------------------------------------------------------------------|-------|-------------|-------|-------|-------|-------|--------|----------- (stats.py:789)
[2024-11-17 00:44:42,243 -  INFO - fouram]: grpc     query                                                                           5554    27(0.49%) |      8       0     224      4 |    0.31        0.00 (stats.py:789)
[2024-11-17 00:44:42,243 -  INFO - fouram]: grpc     upsert                                                                          5624    45(0.80%) |   2897     140   61738   1500 |    0.31        0.00 (stats.py:789)
[2024-11-17 00:44:42,243 -  INFO - fouram]: --------|----------------------------------------------------------------------------|-------|-------------|-------|-------|-------|-------|--------|----------- (stats.py:789)
[2024-11-17 00:44:42,243 -  INFO - fouram]:          Aggregated                                                                     11178    72(0.64%) |   1462       0   61738   1500 |    0.62        0.00 (stats.py:789)
[2024-11-17 00:44:42,243 -  INFO - fouram]:  (stats.py:790)
[2024-11-17 00:44:42,244 -  INFO - fouram]: [PerfTemplate] Report data: 
{'server': {'deploy_tool': 'helm',
            'deploy_mode': 'standalone',
            'config_name': 'standalone_8c16m',
            'config': {'standalone': {'resources': {'limits': {'cpu': 8, 'memory': '16Gi'}, 'requests': {'cpu': 8, 'memory': '16Gi'}},
                                      'profiling': {'enabled': True}},
                       'cluster': {'enabled': False},
                       'etcd': {'replicaCount': 1, 'metrics': {'enabled': True, 'podMonitor': {'enabled': True}}},
                       'minio': {'mode': 'standalone', 'metrics': {'podMonitor': {'enabled': True}}},
                       'pulsar': {'enabled': False},
                       'metrics': {'serviceMonitor': {'enabled': True}},
                       'log': {'level': 'debug'},
                       'extraConfigFiles': {'user.yaml': 'quotaAndLimits:\n'
                                                         '  dml:\n'
                                                         '    enabled: true\n'
                                                         '    upsertRate:\n'
                                                         '      max: 0.5\n'
                                                         '    insertRate:\n'
                                                         '      max: 0.5\n'
                                                         '    deleteRate:\n'
                                                         '      max: 0.5\n'
                                                         '  quotaCenterCollectInterval: 1\n'
                                                         '\n'},
                       'image': {'all': {'repository': 'harbor.milvus.io/milvus/milvus', 'tag': 'master-20241116-00edec2e-amd64'}}},
            'host': 'upsert-count-1783600-1-71-3790-milvus.qa-milvus.svc.cluster.local',
            'port': '19530',
            'uri': ''},
 'client': {'test_case_type': 'ConcurrentClientBase',
            'test_case_name': 'test_concurrent_locust_custom_parameters',
            'test_case_params': {'dataset_params': {'metric_type': 'L2',
                                                    'dim': 128,
                                                    'scalars_params': {'varchar_1': {'params': {'is_partition_key': True, 'max_length': 100}}},
                                                    'dataset_name': 'sift',
                                                    'dataset_size': '2m',
                                                    'ni_per': 5000},
                                 'collection_params': {'other_fields': ['varchar_1'], 'shards_num': 2, 'num_partitions': 16},
                                 'load_params': {},
                                 'release_params': {},
                                 'query_params': {},
                                 'search_params': {},
                                 'resource_groups_params': {'reset': False},
                                 'database_user_params': {'reset_rbac': False, 'reset_db': False},
                                 'index_params': {'index_type': 'HNSW', 'index_param': {'M': 8, 'efConstruction': 200}},
                                 'concurrent_params': {'concurrent_number': 1, 'during_time': '5h', 'interval': 20, 'spawn_rate': None},
                                 'concurrent_tasks': [{'type': 'query',
                                                       'weight': 1,
                                                       'params': {'expr': '',
                                                                  'output_fields': ['count(*)'],
                                                                  'check_task': 'check_query_output_count',
                                                                  'check_items': {'query_count': 2000000}}},
                                                      {'type': 'upsert', 'weight': 1, 'params': {'nb': 2000, 'random_id': False, 'start_id': 1}}]},
            'run_id': 2024111637006363,
            'datetime': '2024-11-16 19:01:40.603365',
            'client_version': '2.5.0'},
 'result': {'test_result': {'index': {'RT': 157.6541},
                            'insert': {'total_time': 2227.87, 'VPS': 897.7184, 'batch_time': 5.5697, 'batch': 5000},
                            'flush': {'RT': 3.0349},
                            'load': {'RT': 3.2808},
                            'Locust': {'Aggregated': {'Requests': 11178,
                                                      'Fails': 72,
                                                      'RPS': 0.62,
                                                      'fail_s': 0.01,
                                                      'RT_max': 61738.25,
                                                      'RT_avg': 1462.01,
                                                      'TP50': 1500.0,
                                                      'TP99': 4100.0},
                                       'query': {'Requests': 5554,
                                                 'Fails': 27,
                                                 'RPS': 0.31,
                                                 'fail_s': 0.0,
                                                 'RT_max': 224.45,
                                                 'RT_avg': 8.17,
                                                 'TP50': 4,
                                                 'TP99': 25},
                                       'upsert': {'Requests': 5624,
                                                  'Fails': 45,
                                                  'RPS': 0.31,
                                                  'fail_s': 0.01,
                                                  'RT_max': 61738.25,
                                                  'RT_avg': 2897.76,
                                                  'TP50': 1500.0,
                                                  'TP99': 4200.0}}}}}

The text was updated successfully, but these errors were encountered:

wangting0128 · 2024-11-19T02:55:07Z

This is a recurring problem

same case
argo task：upsert-count-1731956400
image：master-20241118-3d28d994-amd64

server:

NAME                                                              READY   STATUS      RESTARTS         AGE     IP              NODE         NOMINATED NODE   READINESS GATES
upsert-count-1756400-1-41-1139-etcd-0                             1/1     Running     0                6h      10.104.21.72    4am-node24   <none>           <none>
upsert-count-1756400-1-41-1139-milvus-standalone-59ff46f69v8w2b   1/1     Running     1 (79m ago)      6h      10.104.32.26    4am-node39   <none>           <none>
upsert-count-1756400-1-41-1139-minio-6846794b75-n4r9n             1/1     Running     0                6h      10.104.33.108   4am-node36   <none>           <none>

czs007 · 2024-11-19T09:11:39Z

Please reopen Pyroscope and run it again. There is a sudden surge in object allocation on the heap.

liliu-z · 2024-11-19T13:29:49Z

img_v3_02gp_f6950ae2-1d5d-4c1d-bb00-a1a887e6e86g

img_v3_02gp_ef4cd7d4-5c99-4b40-8c5b-9b125f524f4g

Plz check if there is any leakage during the upsertion?

czs007 · 2024-11-20T08:06:48Z

caused by this commit 66bf254
still investigating the root cause.

XuanYang-cn · 2024-11-20T08:11:49Z

/assign

czs007 · 2024-11-21T02:05:12Z

Without commit 66bf254

With commit 66bf254

Related to milvus-io#37767 Signed-off-by: Congqi Xia <[email protected]>

Related to #37767 --------- Signed-off-by: Congqi Xia <[email protected]>

wangting0128 · 2024-11-22T08:42:54Z

verification：not OOM

argo task: upsert-count-792sk
image: master-20241122-cfa1f1f1-amd64

server:

NAME                                                              READY   STATUS      RESTARTS        AGE     IP              NODE         NOMINATED NODE   READINESS GATES
upsert-count-792sk-1-37-4371-etcd-0                               1/1     Running     0               5h45m   10.104.21.44    4am-node24   <none>           <none>
upsert-count-792sk-1-37-4371-milvus-standalone-654958798c-kg8tz   1/1     Running     2 (5h44m ago)   5h45m   10.104.32.91    4am-node39   <none>           <none>
upsert-count-792sk-1-37-4371-minio-7cbd7fb55f-p4fdx               1/1     Running     0               5h45m   10.104.34.24    4am-node37   <none>           <none>

request RT compare

upsert RT

query RT

wangting0128 · 2024-11-26T10:22:48Z

milvus memory usage more then before

argo task: fouramf-wng45
image: master-20241125-27c22d11-amd64

server:

NAME                                                              READY   STATUS      RESTARTS         AGE     IP              NODE         NOMINATED NODE   READINESS GATES
fouramf-wng45-54-5994-etcd-0                                      1/1     Running     0                5h44m   10.104.20.12    4am-node22   <none>           <none>
fouramf-wng45-54-5994-milvus-standalone-7bd488d878-tmd9r          1/1     Running     2 (5h44m ago)    5h44m   10.104.16.55    4am-node21   <none>           <none>
fouramf-wng45-54-5994-minio-86764c5d57-gtffx                      1/1     Running     0                5h44m   10.104.32.203   4am-node39   <none>           <none>

) Related to milvus-io#37767 --------- Signed-off-by: Congqi Xia <[email protected]>

yanliang567 · 2024-12-03T11:06:36Z

@czs007 do we expected an improvement for memory usage?

xiaofan-luan · 2024-12-10T01:52:02Z

@yanliang567
we already rollback this patch.
Does that work?

xiaofan-luan · 2024-12-10T01:52:31Z

remote load is not a option before we do split log

wangting0128 · 2024-12-10T03:06:02Z

@yanliang567 we already rollback this patch. Does that work?

Which PR is rolled back? Cloud you link it? thanks

yanliang567 · 2024-12-10T03:33:01Z

I think the latest update was without remote_load enabled. @wangting0128 please help to confirm or update the latest result

wangting0128 · 2024-12-10T03:51:32Z

verification master image

argo task：upsert-count-1733770800
image：master-20241209-224c2c8e-amd64

zhagnlu · 2024-12-11T02:54:37Z

for newest master version, memory usage is generally nomal. show as upper figure.

another problem is newest compaction is not good as last version. in the middle of test, all version has a temporary memory decrease, but newest version not decrease as much as last version.
newest version:

Vs ---------------
last version:

show as figure, newest version generate too much L1 segment and can not compaction shrink as last version.

zhagnlu · 2024-12-12T10:07:56Z

the newest compaction failed, because writeReocrd failed when datanode write new sement.
maybe related with pr : #37479
please @tedxu help check it

zhagnlu · 2024-12-13T09:22:41Z

for more detail search, find the root cause:

arrow array calculate size of buffer is not correct as upper pr fixed. this will cause compaction file size not correct.
because of not correct segment file size, generally bigger than actual size, causing compaction generate too much segment.
compaction segment size is limit by hard code that max size is 11, so this pre allocate segment id exhausted

XuanYang-cn · 2024-12-13T09:26:44Z

I think I'll take the 3rd one. It's a TODO from compact m->n segments story.

for more detail search, find the root cause:

arrow array calculate size of buffer is not correct as upper pr fixed. this will cause compaction file size not correct.

because of not correct segment file size, generally bigger than actual size, causing compaction generate too much segment.

compaction segment size is limit by hard code that max size is 11, so this pre allocate segment id exhausted

#37767 Signed-off-by: luzhang <[email protected]> Co-authored-by: luzhang <[email protected]>

zhagnlu · 2024-12-18T02:37:33Z

using image with pr fixed, show compact success and memory decrease
https://argo-workflows.zilliz.cc/workflows/qa/fouramf-mh7bb?tab=workflow&nodeId=fouramf-mh7bb-1150046734&nodePanelView=inputs-outputs&sidePanel=logs

#37767 Signed-off-by: luzhang <[email protected]> Co-authored-by: luzhang <[email protected]>

wangting0128 · 2024-12-19T02:58:09Z

verification passed

argo task: upsert-count-1734548400
image: master-20241218-78438ef4-amd64

wangting0128 added kind/bug Issues or changes related a bug needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. test/benchmark benchmark test labels Nov 18, 2024

wangting0128 added this to the 2.5.0 milestone Nov 18, 2024

wangting0128 assigned yanliang567 Nov 18, 2024

czs007 assigned czs007 and unassigned yanliang567 Nov 18, 2024

yanliang567 added triage/accepted Indicates an issue or PR is ready to be actively worked on. and removed needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. labels Nov 18, 2024

sre-ci-robot assigned XuanYang-cn Nov 20, 2024

yanliang567 unassigned XuanYang-cn Nov 20, 2024

congqixia added a commit to congqixia/milvus that referenced this issue Nov 21, 2024

enhance: Revert default l0 forward policy to FilterByBF

25c14f2

Related to milvus-io#37767 Signed-off-by: Congqi Xia <[email protected]>

congqixia mentioned this issue Nov 21, 2024

enhance: Revert default l0 forward policy to FilterByBF #37867

Merged

congqixia added a commit to congqixia/milvus that referenced this issue Nov 21, 2024

enhance: Revert default l0 forward policy to FilterByBF

05623e2

Related to milvus-io#37767 Signed-off-by: Congqi Xia <[email protected]>

sre-ci-robot pushed a commit that referenced this issue Nov 22, 2024

enhance: Revert default l0 forward policy to FilterByBF (#37867)

83df725

Related to #37767 --------- Signed-off-by: Congqi Xia <[email protected]>

JsDove pushed a commit to JsDove/milvus that referenced this issue Nov 26, 2024

enhance: Revert default l0 forward policy to FilterByBF (milvus-io#37867

0f8e83e

) Related to milvus-io#37767 --------- Signed-off-by: Congqi Xia <[email protected]>

czs007 assigned zhagnlu Dec 10, 2024

zhagnlu mentioned this issue Dec 13, 2024

fix: fix wrong size of arrow array for zero-copy mode #38449

Merged

sre-ci-robot pushed a commit that referenced this issue Dec 15, 2024

fix: fix wrong size of arrow array for zero-copy mode (#38449)

c3edc85

#37767 Signed-off-by: luzhang <[email protected]> Co-authored-by: luzhang <[email protected]>

zhagnlu mentioned this issue Dec 17, 2024

fix:fix calculate arrow nest type and add ut #38527

Merged

sre-ci-robot pushed a commit that referenced this issue Dec 18, 2024

fix:fix calculate arrow nest type and add ut (#38527)

6ee94d0

#37767 Signed-off-by: luzhang <[email protected]> Co-authored-by: luzhang <[email protected]>

wangting0128 closed this as completed Dec 19, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Bug]: [benchmark][standalone] Milvus standalone OOM killed frequently in concurrent upsert & query scene #37767

[Bug]: [benchmark][standalone] Milvus standalone OOM killed frequently in concurrent upsert & query scene #37767

wangting0128 commented Nov 18, 2024

wangting0128 commented Nov 19, 2024

czs007 commented Nov 19, 2024 •

edited

Loading

liliu-z commented Nov 19, 2024 •

edited

Loading

czs007 commented Nov 20, 2024

XuanYang-cn commented Nov 20, 2024

czs007 commented Nov 21, 2024 •

edited

Loading

wangting0128 commented Nov 22, 2024 •

edited

Loading

wangting0128 commented Nov 26, 2024

yanliang567 commented Dec 3, 2024

xiaofan-luan commented Dec 10, 2024

xiaofan-luan commented Dec 10, 2024

wangting0128 commented Dec 10, 2024

yanliang567 commented Dec 10, 2024

wangting0128 commented Dec 10, 2024

zhagnlu commented Dec 11, 2024 •

edited

Loading

zhagnlu commented Dec 12, 2024

zhagnlu commented Dec 13, 2024

XuanYang-cn commented Dec 13, 2024

zhagnlu commented Dec 18, 2024 •

edited

Loading

wangting0128 commented Dec 19, 2024

[Bug]: [benchmark][standalone] Milvus standalone OOM killed frequently in concurrent upsert & query scene #37767

[Bug]: [benchmark][standalone] Milvus standalone OOM killed frequently in concurrent upsert & query scene #37767

Comments

wangting0128 commented Nov 18, 2024

Is there an existing issue for this?

Environment

Current Behavior

Before November 14, the daily regression test cases were able to run normally

Expected Behavior

Steps To Reproduce

Milvus Log

Anything else?

wangting0128 commented Nov 19, 2024

czs007 commented Nov 19, 2024 • edited Loading

liliu-z commented Nov 19, 2024 • edited Loading

czs007 commented Nov 20, 2024

XuanYang-cn commented Nov 20, 2024

czs007 commented Nov 21, 2024 • edited Loading

wangting0128 commented Nov 22, 2024 • edited Loading

verification：not OOM

wangting0128 commented Nov 26, 2024

milvus memory usage more then before

yanliang567 commented Dec 3, 2024

xiaofan-luan commented Dec 10, 2024

xiaofan-luan commented Dec 10, 2024

wangting0128 commented Dec 10, 2024

yanliang567 commented Dec 10, 2024

wangting0128 commented Dec 10, 2024

verification master image

zhagnlu commented Dec 11, 2024 • edited Loading

zhagnlu commented Dec 12, 2024

zhagnlu commented Dec 13, 2024

XuanYang-cn commented Dec 13, 2024

zhagnlu commented Dec 18, 2024 • edited Loading

wangting0128 commented Dec 19, 2024

verification passed

czs007 commented Nov 19, 2024 •

edited

Loading

liliu-z commented Nov 19, 2024 •

edited

Loading

czs007 commented Nov 21, 2024 •

edited

Loading

wangting0128 commented Nov 22, 2024 •

edited

Loading

zhagnlu commented Dec 11, 2024 •

edited

Loading

zhagnlu commented Dec 18, 2024 •

edited

Loading