Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Bug]: [benchmark][standalone] Milvus standalone OOM killed frequently in concurrent upsert & query scene #37767

Closed
1 task done
wangting0128 opened this issue Nov 18, 2024 · 20 comments
Assignees
Labels
kind/bug Issues or changes related a bug test/benchmark benchmark test triage/accepted Indicates an issue or PR is ready to be actively worked on.
Milestone

Comments

@wangting0128
Copy link
Contributor

Is there an existing issue for this?

  • I have searched the existing issues

Environment

- Milvus version:master-20241116-00edec2e-amd64
- Deployment mode(standalone or cluster):standalone
- MQ type(rocksmq, pulsar or kafka):rocksmq    
- SDK version(e.g. pymilvus v2.0.0rc2):2.5.0rc106
- OS(Ubuntu or CentOS): 
- CPU/Memory: 
- GPU: 
- Others:

Current Behavior

argo task:master-20241116-00edec2e-amd64

server:

[2024-11-17 00:48:25,718 -  INFO - fouram]: [Base] Deploy initial state: 
I1116 19:04:17.722678    3880 request.go:665] Waited for 1.168431781s due to client-side throttling, not priority and fairness, request: GET:https://kubernetes.default.svc.cluster.local/apis/discovery.k8s.io/v1beta1?timeout=32s
I1116 19:04:27.722923    3880 request.go:665] Waited for 3.596511189s due to client-side throttling, not priority and fairness, request: GET:https://kubernetes.default.svc.cluster.local/apis/actions.summerwind.dev/v1alpha1?timeout=32s
NAME                                                              READY   STATUS      RESTARTS        AGE     IP              NODE         NOMINATED NODE   READINESS GATES
upsert-count-1783600-1-71-3790-etcd-0                             1/1     Running     0               2m55s   10.104.20.132   4am-node22   <none>           <none>
upsert-count-1783600-1-71-3790-milvus-standalone-6d9bf746585977   1/1     Running     2 (2m28s ago)   2m55s   10.104.18.113   4am-node25   <none>           <none>
upsert-count-1783600-1-71-3790-minio-5f74dd55dc-mtwj2             1/1     Running     0               2m55s   10.104.20.131   4am-node22   <none>           <none> (base.py:261)
[2024-11-17 00:48:25,718 -  INFO - fouram]: [Cmd Exe]  kubectl get pods  -n qa-milvus  -o wide | grep -E 'NAME|upsert-count-1783600-1-71-3790-milvus|upsert-count-1783600-1-71-3790-minio|upsert-count-1783600-1-71-3790-etcd|upsert-count-1783600-1-71-3790-pulsar|upsert-count-1783600-1-71-3790-zookeeper|upsert-count-1783600-1-71-3790-kafka|upsert-count-1783600-1-71-3790-log|upsert-count-1783600-1-71-3790-tikv'  (util_cmd.py:14)
[2024-11-17 00:48:46,791 -  INFO - fouram]: [CliClient] pod details of release(upsert-count-1783600-1-71-3790): 
 I1117 00:48:26.964445    4020 request.go:665] Waited for 1.175551287s due to client-side throttling, not priority and fairness, request: GET:https://kubernetes.default.svc.cluster.local/apis/discovery.k8s.io/v1?timeout=32s
I1117 00:48:37.164327    4020 request.go:665] Waited for 3.797207586s due to client-side throttling, not priority and fairness, request: GET:https://kubernetes.default.svc.cluster.local/apis/storage.k8s.io/v1beta1?timeout=32s
NAME                                                              READY   STATUS      RESTARTS        AGE     IP              NODE         NOMINATED NODE   READINESS GATES
upsert-count-1783600-1-71-3790-etcd-0                             1/1     Running     0               5h47m   10.104.20.132   4am-node22   <none>           <none>
upsert-count-1783600-1-71-3790-milvus-standalone-6d9bf746585977   1/1     Running     7 (73m ago)     5h47m   10.104.18.113   4am-node25   <none>           <none>
upsert-count-1783600-1-71-3790-minio-5f74dd55dc-mtwj2             1/1     Running     0               5h47m   10.104.20.131   4am-node22   <none>           <none>
截屏2024-11-18 15 46 24 截屏2024-11-18 15 52 06

Before November 14, the daily regression test cases were able to run normally

argo task: upsert-count-1731610800
image: master-20241114-1d06d432-amd64

server:

NAME                                                              READY   STATUS             RESTARTS         AGE     IP              NODE         NOMINATED NODE   READINESS GATES
upsert-count-1710800-1-18-5694-etcd-0                             1/1     Running            0                5h43m   10.104.18.159   4am-node25   <none>           <none>
upsert-count-1710800-1-18-5694-milvus-standalone-5d4db4b74fbkn8   1/1     Running            1 (5h43m ago)    5h43m   10.104.15.189   4am-node20   <none>           <none>
upsert-count-1710800-1-18-5694-minio-54b56cf4b5-tx97s             1/1     Running            0                5h43m   10.104.15.190   4am-node20   <none>           <none>
截屏2024-11-18 15 51 14

Expected Behavior

No response

Steps To Reproduce

1. deploy a standalone milvus and reset quotaAndLimits
2. creata a collection with fields: 'id'(primary key), 'float_vector'(128dim), 'varchar_1'(partition key); shard_num=16
3. build HNSW index on 'float_vector' field
4. insert 2m data
5. flush collection
6. build index again
7. load collection
8. concurrent request: concurrent number=1
   - upsert: id= 1~2000
   - query: count(*)

Milvus Log

No response

Anything else?

test result:

[2024-11-17 00:44:42,243 -  INFO - fouram]: Print locust final stats. (locust_runner.py:56)
[2024-11-17 00:44:42,243 -  INFO - fouram]: Type     Name                                                                          # reqs      # fails |    Avg     Min     Max    Med |   req/s  failures/s (stats.py:789)
[2024-11-17 00:44:42,243 -  INFO - fouram]: --------|----------------------------------------------------------------------------|-------|-------------|-------|-------|-------|-------|--------|----------- (stats.py:789)
[2024-11-17 00:44:42,243 -  INFO - fouram]: grpc     query                                                                           5554    27(0.49%) |      8       0     224      4 |    0.31        0.00 (stats.py:789)
[2024-11-17 00:44:42,243 -  INFO - fouram]: grpc     upsert                                                                          5624    45(0.80%) |   2897     140   61738   1500 |    0.31        0.00 (stats.py:789)
[2024-11-17 00:44:42,243 -  INFO - fouram]: --------|----------------------------------------------------------------------------|-------|-------------|-------|-------|-------|-------|--------|----------- (stats.py:789)
[2024-11-17 00:44:42,243 -  INFO - fouram]:          Aggregated                                                                     11178    72(0.64%) |   1462       0   61738   1500 |    0.62        0.00 (stats.py:789)
[2024-11-17 00:44:42,243 -  INFO - fouram]:  (stats.py:790)
[2024-11-17 00:44:42,244 -  INFO - fouram]: [PerfTemplate] Report data: 
{'server': {'deploy_tool': 'helm',
            'deploy_mode': 'standalone',
            'config_name': 'standalone_8c16m',
            'config': {'standalone': {'resources': {'limits': {'cpu': 8, 'memory': '16Gi'}, 'requests': {'cpu': 8, 'memory': '16Gi'}},
                                      'profiling': {'enabled': True}},
                       'cluster': {'enabled': False},
                       'etcd': {'replicaCount': 1, 'metrics': {'enabled': True, 'podMonitor': {'enabled': True}}},
                       'minio': {'mode': 'standalone', 'metrics': {'podMonitor': {'enabled': True}}},
                       'pulsar': {'enabled': False},
                       'metrics': {'serviceMonitor': {'enabled': True}},
                       'log': {'level': 'debug'},
                       'extraConfigFiles': {'user.yaml': 'quotaAndLimits:\n'
                                                         '  dml:\n'
                                                         '    enabled: true\n'
                                                         '    upsertRate:\n'
                                                         '      max: 0.5\n'
                                                         '    insertRate:\n'
                                                         '      max: 0.5\n'
                                                         '    deleteRate:\n'
                                                         '      max: 0.5\n'
                                                         '  quotaCenterCollectInterval: 1\n'
                                                         '\n'},
                       'image': {'all': {'repository': 'harbor.milvus.io/milvus/milvus', 'tag': 'master-20241116-00edec2e-amd64'}}},
            'host': 'upsert-count-1783600-1-71-3790-milvus.qa-milvus.svc.cluster.local',
            'port': '19530',
            'uri': ''},
 'client': {'test_case_type': 'ConcurrentClientBase',
            'test_case_name': 'test_concurrent_locust_custom_parameters',
            'test_case_params': {'dataset_params': {'metric_type': 'L2',
                                                    'dim': 128,
                                                    'scalars_params': {'varchar_1': {'params': {'is_partition_key': True, 'max_length': 100}}},
                                                    'dataset_name': 'sift',
                                                    'dataset_size': '2m',
                                                    'ni_per': 5000},
                                 'collection_params': {'other_fields': ['varchar_1'], 'shards_num': 2, 'num_partitions': 16},
                                 'load_params': {},
                                 'release_params': {},
                                 'query_params': {},
                                 'search_params': {},
                                 'resource_groups_params': {'reset': False},
                                 'database_user_params': {'reset_rbac': False, 'reset_db': False},
                                 'index_params': {'index_type': 'HNSW', 'index_param': {'M': 8, 'efConstruction': 200}},
                                 'concurrent_params': {'concurrent_number': 1, 'during_time': '5h', 'interval': 20, 'spawn_rate': None},
                                 'concurrent_tasks': [{'type': 'query',
                                                       'weight': 1,
                                                       'params': {'expr': '',
                                                                  'output_fields': ['count(*)'],
                                                                  'check_task': 'check_query_output_count',
                                                                  'check_items': {'query_count': 2000000}}},
                                                      {'type': 'upsert', 'weight': 1, 'params': {'nb': 2000, 'random_id': False, 'start_id': 1}}]},
            'run_id': 2024111637006363,
            'datetime': '2024-11-16 19:01:40.603365',
            'client_version': '2.5.0'},
 'result': {'test_result': {'index': {'RT': 157.6541},
                            'insert': {'total_time': 2227.87, 'VPS': 897.7184, 'batch_time': 5.5697, 'batch': 5000},
                            'flush': {'RT': 3.0349},
                            'load': {'RT': 3.2808},
                            'Locust': {'Aggregated': {'Requests': 11178,
                                                      'Fails': 72,
                                                      'RPS': 0.62,
                                                      'fail_s': 0.01,
                                                      'RT_max': 61738.25,
                                                      'RT_avg': 1462.01,
                                                      'TP50': 1500.0,
                                                      'TP99': 4100.0},
                                       'query': {'Requests': 5554,
                                                 'Fails': 27,
                                                 'RPS': 0.31,
                                                 'fail_s': 0.0,
                                                 'RT_max': 224.45,
                                                 'RT_avg': 8.17,
                                                 'TP50': 4,
                                                 'TP99': 25},
                                       'upsert': {'Requests': 5624,
                                                  'Fails': 45,
                                                  'RPS': 0.31,
                                                  'fail_s': 0.01,
                                                  'RT_max': 61738.25,
                                                  'RT_avg': 2897.76,
                                                  'TP50': 1500.0,
                                                  'TP99': 4200.0}}}}}
@wangting0128 wangting0128 added kind/bug Issues or changes related a bug needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. test/benchmark benchmark test labels Nov 18, 2024
@wangting0128 wangting0128 added this to the 2.5.0 milestone Nov 18, 2024
@czs007 czs007 assigned czs007 and unassigned yanliang567 Nov 18, 2024
@yanliang567 yanliang567 added triage/accepted Indicates an issue or PR is ready to be actively worked on. and removed needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. labels Nov 18, 2024
@wangting0128
Copy link
Contributor Author

This is a recurring problem

same case
argo task:upsert-count-1731956400
image:master-20241118-3d28d994-amd64

server:

NAME                                                              READY   STATUS      RESTARTS         AGE     IP              NODE         NOMINATED NODE   READINESS GATES
upsert-count-1756400-1-41-1139-etcd-0                             1/1     Running     0                6h      10.104.21.72    4am-node24   <none>           <none>
upsert-count-1756400-1-41-1139-milvus-standalone-59ff46f69v8w2b   1/1     Running     1 (79m ago)      6h      10.104.32.26    4am-node39   <none>           <none>
upsert-count-1756400-1-41-1139-minio-6846794b75-n4r9n             1/1     Running     0                6h      10.104.33.108   4am-node36   <none>           <none>
截屏2024-11-19 10 54 33 截屏2024-11-19 10 54 49

@czs007
Copy link
Collaborator

czs007 commented Nov 19, 2024

image

image

Please reopen Pyroscope and run it again. There is a sudden surge in object allocation on the heap.

@liliu-z
Copy link
Member

liliu-z commented Nov 19, 2024

img_v3_02gp_f6950ae2-1d5d-4c1d-bb00-a1a887e6e86g img_v3_02gp_ef4cd7d4-5c99-4b40-8c5b-9b125f524f4g

Plz check if there is any leakage during the upsertion?

@czs007
Copy link
Collaborator

czs007 commented Nov 20, 2024

caused by this commit 66bf254
still investigating the root cause.

@XuanYang-cn
Copy link
Contributor

/assign

@czs007
Copy link
Collaborator

czs007 commented Nov 21, 2024

Without commit 66bf254
img_v3_02gq_a0fbfc3e-6b05-4a87-8021-38b766ea4aeg

With commit 66bf254
img_v3_02gq_3242b728-e350-437b-b68b-ec460b020cdg

congqixia added a commit to congqixia/milvus that referenced this issue Nov 21, 2024
congqixia added a commit to congqixia/milvus that referenced this issue Nov 21, 2024
sre-ci-robot pushed a commit that referenced this issue Nov 22, 2024
@wangting0128
Copy link
Contributor Author

wangting0128 commented Nov 22, 2024

verification:not OOM

argo task: upsert-count-792sk
image: master-20241122-cfa1f1f1-amd64

server:

NAME                                                              READY   STATUS      RESTARTS        AGE     IP              NODE         NOMINATED NODE   READINESS GATES
upsert-count-792sk-1-37-4371-etcd-0                               1/1     Running     0               5h45m   10.104.21.44    4am-node24   <none>           <none>
upsert-count-792sk-1-37-4371-milvus-standalone-654958798c-kg8tz   1/1     Running     2 (5h44m ago)   5h45m   10.104.32.91    4am-node39   <none>           <none>
upsert-count-792sk-1-37-4371-minio-7cbd7fb55f-p4fdx               1/1     Running     0               5h45m   10.104.34.24    4am-node37   <none>           <none> 
截屏2024-11-22 16 39 25

request RT compare
img_v3_02gs_6078e1a4-3072-4fda-bbdc-7e3af7ddd3fg

upsert RT
截屏2024-11-22 17 21 34

query RT
截屏2024-11-22 17 22 17

@wangting0128
Copy link
Contributor Author

milvus memory usage more then before

argo task: fouramf-wng45
image: master-20241125-27c22d11-amd64

server:

NAME                                                              READY   STATUS      RESTARTS         AGE     IP              NODE         NOMINATED NODE   READINESS GATES
fouramf-wng45-54-5994-etcd-0                                      1/1     Running     0                5h44m   10.104.20.12    4am-node22   <none>           <none>
fouramf-wng45-54-5994-milvus-standalone-7bd488d878-tmd9r          1/1     Running     2 (5h44m ago)    5h44m   10.104.16.55    4am-node21   <none>           <none>
fouramf-wng45-54-5994-minio-86764c5d57-gtffx                      1/1     Running     0                5h44m   10.104.32.203   4am-node39   <none>           <none>
截屏2024-11-26 18 22 32

JsDove pushed a commit to JsDove/milvus that referenced this issue Nov 26, 2024
@yanliang567
Copy link
Contributor

@czs007 do we expected an improvement for memory usage?

@xiaofan-luan
Copy link
Collaborator

@yanliang567
we already rollback this patch.
Does that work?

@xiaofan-luan
Copy link
Collaborator

remote load is not a option before we do split log

@wangting0128
Copy link
Contributor Author

@yanliang567 we already rollback this patch. Does that work?

Which PR is rolled back? Cloud you link it? thanks

@yanliang567
Copy link
Contributor

I think the latest update was without remote_load enabled. @wangting0128 please help to confirm or update the latest result

@wangting0128
Copy link
Contributor Author

verification master image

argo task:upsert-count-1733770800
image:master-20241209-224c2c8e-amd64

截屏2024-12-10 11 50 35

@zhagnlu
Copy link
Contributor

zhagnlu commented Dec 11, 2024

for newest master version, memory usage is generally nomal. show as upper figure.

another problem is newest compaction is not good as last version. in the middle of test, all version has a temporary memory decrease, but newest version not decrease as much as last version.
newest version:
image
Vs ---------------
last version:
image

show as figure, newest version generate too much L1 segment and can not compaction shrink as last version.

@zhagnlu
Copy link
Contributor

zhagnlu commented Dec 12, 2024

image
image
the newest compaction failed, because writeReocrd failed when datanode write new sement.
maybe related with pr : #37479
please @tedxu help check it

@zhagnlu
Copy link
Contributor

zhagnlu commented Dec 13, 2024

for more detail search, find the root cause:

  1. arrow array calculate size of buffer is not correct as upper pr fixed. this will cause compaction file size not correct.
  2. because of not correct segment file size, generally bigger than actual size, causing compaction generate too much segment.
  3. compaction segment size is limit by hard code that max size is 11, so this pre allocate segment id exhausted

@XuanYang-cn
Copy link
Contributor

I think I'll take the 3rd one. It's a TODO from compact m->n segments story.

for more detail search, find the root cause:

  1. arrow array calculate size of buffer is not correct as upper pr fixed. this will cause compaction file size not correct.
  2. because of not correct segment file size, generally bigger than actual size, causing compaction generate too much segment.
  3. compaction segment size is limit by hard code that max size is 11, so this pre allocate segment id exhausted

@zhagnlu
Copy link
Contributor

zhagnlu commented Dec 18, 2024

sre-ci-robot pushed a commit that referenced this issue Dec 18, 2024
@wangting0128
Copy link
Contributor Author

verification passed

argo task: upsert-count-1734548400
image: master-20241218-78438ef4-amd64

截屏2024-12-19 10 57 36 截屏2024-12-19 10 57 57

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
kind/bug Issues or changes related a bug test/benchmark benchmark test triage/accepted Indicates an issue or PR is ready to be actively worked on.
Projects
None yet
Development

No branches or pull requests

7 participants