Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Bug]: Refresh load failed with error call query coordinator LoadCollection: collection not fully loaded in ci case #37166

Closed
1 task done
zhuwenxing opened this issue Oct 25, 2024 · 34 comments
Assignees
Labels
ci/bug ci/e2e kind/bug Issues or changes related a bug triage/accepted Indicates an issue or PR is ready to be actively worked on.
Milestone

Comments

@zhuwenxing
Copy link
Contributor

Is there an existing issue for this?

  • I have searched the existing issues

Environment

- Milvus version:master
- Deployment mode(standalone or cluster):
- MQ type(rocksmq, pulsar or kafka):    
- SDK version(e.g. pymilvus v2.0.0rc2):
- OS(Ubuntu or CentOS): 
- CPU/Memory: 
- GPU: 
- Others:

Current Behavior

After Import, execute Refresh Load, Refresh Load failed

[pytest : test] __________ TestCreateImportJob.test_job_e2e[False-False-False-2-3000] __________

[pytest : test] [gw1] linux -- Python 3.8.17 /usr/local/bin/python3

[pytest : test] [gw1] linux -- Python 3.8.17 /usr/local/bin/python3[gw1] linux -- Python 3.8.17 /usr/local/bin/python3

[pytest : test] 

[pytest : test] self = <test_jobs_operation.TestCreateImportJob object at 0x7f2c804995e0>

[pytest : test] insert_num = 3000, import_task_num = 2, auto_id = False

[pytest : test] is_partition_key = False, enable_dynamic_field = False

[pytest : test] 

[pytest : test]     @pytest.mark.parametrize("insert_num", [3000])

[pytest : test]     @pytest.mark.parametrize("import_task_num", [2])

[pytest : test]     @pytest.mark.parametrize("auto_id", [True, False])

[pytest : test]     @pytest.mark.parametrize("is_partition_key", [True, False])

[pytest : test]     @pytest.mark.parametrize("enable_dynamic_field", [True, False])

[pytest : test]     def test_job_e2e(self, insert_num, import_task_num, auto_id, is_partition_key, enable_dynamic_field):

[pytest : test]         # create collection

[pytest : test]         name = gen_collection_name()

[pytest : test]         dim = 128

[pytest : test]         payload = {

[pytest : test]             "collectionName": name,

[pytest : test]             "schema": {

[pytest : test]                 "autoId": auto_id,

[pytest : test]                 "enableDynamicField": enable_dynamic_field,

[pytest : test]                 "fields": [

[pytest : test]                     {"fieldName": "book_id", "dataType": "Int64", "isPrimary": True, "elementTypeParams": {}},

[pytest : test]                     {"fieldName": "word_count", "dataType": "Int64", "isPartitionKey": is_partition_key, "elementTypeParams": {}},

[pytest : test]                     {"fieldName": "book_describe", "dataType": "VarChar", "elementTypeParams": {"max_length": "256"}},

[pytest : test]                     {"fieldName": "book_intro", "dataType": "FloatVector", "elementTypeParams": {"dim": f"{dim}"}}

[pytest : test]                 ]

[pytest : test]             },

[pytest : test]             "indexParams": [{"fieldName": "book_intro", "indexName": "book_intro_vector", "metricType": "L2"}]

[pytest : test]         }

[pytest : test]         rsp = self.collection_client.collection_create(payload)

[pytest : test]     

[pytest : test]         # upload file to storage

[pytest : test]         data = []

[pytest : test]         for i in range(insert_num):

[pytest : test]             tmp = {

[pytest : test]                 "word_count": i,

[pytest : test]                 "book_describe": f"book_{i}",

[pytest : test]                 "book_intro": [np.float32(random.random()) for _ in range(dim)]

[pytest : test]             }

[pytest : test]             if not auto_id:

[pytest : test]                 tmp["book_id"] = i

[pytest : test]             if enable_dynamic_field:

[pytest : test]                 tmp.update({f"dynamic_field_{i}": i})

[pytest : test]             data.append(tmp)

[pytest : test]         # dump data to file

[pytest : test]         file_name = f"bulk_insert_data_{uuid4()}.json"

[pytest : test]         file_path = f"/tmp/{file_name}"

[pytest : test]         with open(file_path, "w") as f:

[pytest : test]             json.dump(data, f, cls=NumpyEncoder)

[pytest : test]         # upload file to minio storage

[pytest : test]         self.storage_client.upload_file(file_path, file_name)

[pytest : test]     

[pytest : test]         # create import job

[pytest : test]         payload = {

[pytest : test]             "collectionName": name,

[pytest : test]             "files": [[file_name]],

[pytest : test]         }

[pytest : test]         for i in range(import_task_num):

[pytest : test]             rsp = self.import_job_client.create_import_jobs(payload)

[pytest : test]         # list import job

[pytest : test]         payload = {

[pytest : test]             "collectionName": name,

[pytest : test]         }

[pytest : test]         rsp = self.import_job_client.list_import_jobs(payload)

[pytest : test]     

[pytest : test]         # get import job progress

[pytest : test]         for task in rsp['data']["records"]:

[pytest : test]             task_id = task['jobId']

[pytest : test]             finished = False

[pytest : test]             t0 = time.time()

[pytest : test]     

[pytest : test]             while not finished:

[pytest : test]                 rsp = self.import_job_client.get_import_job_progress(task_id)

[pytest : test]                 if rsp['data']['state'] == "Completed":

[pytest : test]                     finished = True

[pytest : test]                 time.sleep(5)

[pytest : test]                 if time.time() - t0 > IMPORT_TIMEOUT:

[pytest : test]                     assert False, "import job timeout"

[pytest : test]         c = Collection(name)

[pytest : test] >       c.load(_refresh=True)

[pytest : test] 

[pytest : test] testcases/test_jobs_operation.py:103: 

[pytest : test] _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 

[pytest : test] /usr/local/lib/python3.8/site-packages/pymilvus/orm/collection.py:429: in load

[pytest : test]     conn.load_collection(

[pytest : test] /usr/local/lib/python3.8/site-packages/pymilvus/decorators.py:141: in handler

[pytest : test]     raise e from e

[pytest : test] /usr/local/lib/python3.8/site-packages/pymilvus/decorators.py:137: in handler

[pytest : test]     return func(*args, **kwargs)

[pytest : test] /usr/local/lib/python3.8/site-packages/pymilvus/decorators.py:176: in handler

[pytest : test]     return func(self, *args, **kwargs)

[pytest : test] /usr/local/lib/python3.8/site-packages/pymilvus/decorators.py:116: in handler

[pytest : test]     raise e from e

[pytest : test] /usr/local/lib/python3.8/site-packages/pymilvus/decorators.py:86: in handler

[pytest : test]     return func(*args, **kwargs)

[pytest : test] /usr/local/lib/python3.8/site-packages/pymilvus/client/grpc_handler.py:1171: in load_collection

[pytest : test]     check_status(response)

[pytest : test] _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 

[pytest : test] 

[pytest : test] status = error_code: UnexpectedError

[pytest : test] reason: "call query coordinator LoadCollection: collection not fully loaded: collection no...l query coordinator LoadCollection: collection not fully loaded: collection not loaded[collection=453468002260052459]"

[pytest : test] 

[pytest : test] 

[pytest : test]     def check_status(status: Status):

[pytest : test]         if status.code != 0 or status.error_code != 0:

[pytest : test] >           raise MilvusException(status.code, status.reason, status.error_code)

[pytest : test] E           pymilvus.exceptions.MilvusException: <MilvusException: (code=65535, message=call query coordinator LoadCollection: collection not fully loaded: collection not loaded[collection=453468002260052459])>

[pytest : test] 

[pytest : test] /usr/local/lib/python3.8/site-packages/pymilvus/client/utils.py:63: MilvusException

[pytest : test] ------------------------------ Captured log call -------------------------------

[pytest : test] [2024-10-25 08:27:42 - DEBUG - ci_test]: 

[pytest : test] method: post, 

[pytest : test] url: http://md-37148-1-py-pr-milvus.milvus-ci:19530/v2/vectordb/collections/create, 

[pytest : test] cost time: 2.218313694000244, 

[pytest : test] header: {'Content-Type': 'application/json', 'Authorization': 'Bearer None', 'RequestId': 'ff338b9a-92aa-11ef-8800-4e9e04d42779'}, 

[pytest : test] payload: {

[pytest : test]     "collectionName": "test_collection_2024_10_25_08_27_39_551139EaRekPPo",

[pytest : test]     "schema": {

[pytest : test]         "autoId": false,

[pytest : test]         "enableDynamicField": false,

[pytest : test]         "fields": [

[pytest : test]             {

[pytest : test]                 "fieldName": "book_id",

[pytest : test]                 "dataType": "Int64",

[pytest : test]                 "isPrimary": true,

[pytest : test]                 "elementTypeParams": {}

[pytest : test]             },

[pytest : test]             {

[pytest : test]                 "fieldName": "word_count",

[pytest : test]                 "dataType": "Int64",

[pytest : test]                 "isPartitionKey": false,

[pytest : test]                 "elementTypeParams": {}

[pytest : test]             },

[pytest : test]             {

[pytest : test]                 "fieldName": "book_describe",

[pytest : test]                 "dataType": "VarChar",

[pytest : test]                 "elementTypeParams": {

[pytest : test]                     "max_length": "256"

[pytest : test]                 }

[pytest : test]             },

[pytest : test]             {

[pytest : test]                 "fieldName": "book_intro",

[pytest : test]                 "dataType": "FloatVector",

[pytest : test]                 "elementTypeParams": {

[pytest : test]                     "dim": "128"

[pytest : test]                 }

[pytest : test]             }

[pytest : test]         ]

[pytest : test]     },

[pytest : test]     "indexParams": [

[pytest : test]         {

[pytest : test]             "fieldName": "book_intro",

[pytest : test]             "indexName": "book_intro_vector",

[pytest : test]             "metricType": "L2"

[pytest : test]         }

[pytest : test]     ],

[pytest : test]     "params": {

[pytest : test]         "consistencyLevel": "Strong"

[pytest : test]     }

[pytest : test] }, 

[pytest : test] response: {"code":0,"data":{}} (milvus.py:77)

[pytest : test] [2024-10-25 08:27:45 - DEBUG - ci_test]: 

[pytest : test] method: post, 

[pytest : test] url: http://md-37148-1-py-pr-milvus.milvus-ci:19530/v2/vectordb/jobs/import/create, 

[pytest : test] cost time: 0.009305953979492188, 

[pytest : test] header: {'Content-Type': 'application/json', 'Authorization': 'Bearer None', 'RequestId': 'ff338b9a-92aa-11ef-8800-4e9e04d42779'}, 

[pytest : test] payload: {

[pytest : test]     "collectionName": "test_collection_2024_10_25_08_27_39_551139EaRekPPo",

[pytest : test]     "files": [

[pytest : test]         [

[pytest : test]             "bulk_insert_data_b14dc38d-b36a-41f2-95f2-0bd4931d52a0.json"

[pytest : test]         ]

[pytest : test]     ],

[pytest : test]     "dbName": "default"

[pytest : test] }, 

[pytest : test] response: {"code":0,"data":{"jobId":"453468002260053157"}} (milvus.py:77)

[pytest : test] [2024-10-25 08:27:45 - DEBUG - ci_test]: 

[pytest : test] method: post, 

[pytest : test] url: http://md-37148-1-py-pr-milvus.milvus-ci:19530/v2/vectordb/jobs/import/create, 

[pytest : test] cost time: 0.010378837585449219, 

[pytest : test] header: {'Content-Type': 'application/json', 'Authorization': 'Bearer None', 'RequestId': 'ff338b9a-92aa-11ef-8800-4e9e04d42779'}, 

[pytest : test] payload: {

[pytest : test]     "collectionName": "test_collection_2024_10_25_08_27_39_551139EaRekPPo",

[pytest : test]     "files": [

[pytest : test]         [

[pytest : test]             "bulk_insert_data_b14dc38d-b36a-41f2-95f2-0bd4931d52a0.json"

[pytest : test]         ]

[pytest : test]     ],

[pytest : test]     "dbName": "default"

[pytest : test] }, 

[pytest : test] response: {"code":0,"data":{"jobId":"453468002260053160"}} (milvus.py:77)

[pytest : test] [2024-10-25 08:27:45 - DEBUG - ci_test]: 

[pytest : test] method: post, 

[pytest : test] url: http://md-37148-1-py-pr-milvus.milvus-ci:19530/v2/vectordb/jobs/import/list, 

[pytest : test] cost time: 0.006852626800537109, 

[pytest : test] header: {'Content-Type': 'application/json', 'Authorization': 'Bearer None', 'RequestId': 'ff338b9a-92aa-11ef-8800-4e9e04d42779'}, 

[pytest : test] payload: {

[pytest : test]     "collectionName": "test_collection_2024_10_25_08_27_39_551139EaRekPPo",

[pytest : test]     "dbName": "default"

[pytest : test] }, 

[pytest : test] response: {"code":0,"data":{"records":[{"collectionName":"test_collection_2024_10_25_08_27_39_551139EaRekPPo","jobId":"453468002260053160","progress":0,"state":"Pending"},{"collectionName":"test_collection_2024_10_25_08_27_39_551139EaRekPPo","jobId":"453468002260053157","progress":0,"state":"Pending"}]}} (milvus.py:77)

[pytest : test] [2024-10-25 08:27:45 - DEBUG - ci_test]: 

[pytest : test] method: post, 

[pytest : test] url: http://md-37148-1-py-pr-milvus.milvus-ci:19530/v2/vectordb/jobs/import/get_progress, 

[pytest : test] cost time: 0.006412982940673828, 

[pytest : test] header: {'Content-Type': 'application/json', 'Authorization': 'Bearer None', 'RequestId': 'ff338b9a-92aa-11ef-8800-4e9e04d42779'}, 

[pytest : test] payload: {

[pytest : test]     "dbName": "default",

[pytest : test]     "jobID": "453468002260053160"

[pytest : test] }, 

[pytest : test] response: {"code":0,"data":{"collectionName":"test_collection_2024_10_25_08_27_39_551139EaRekPPo","completeTime":"","details":[],"fileSize":0,"importedRows":0,"jobId":"453468002260053160","progress":0,"state":"Pending","totalRows":0}} (milvus.py:77)

[pytest : test] [2024-10-25 08:27:50 - DEBUG - ci_test]: 

[pytest : test] method: post, 

[pytest : test] url: http://md-37148-1-py-pr-milvus.milvus-ci:19530/v2/vectordb/jobs/import/get_progress, 

[pytest : test] cost time: 0.006389141082763672, 

[pytest : test] header: {'Content-Type': 'application/json', 'Authorization': 'Bearer None', 'RequestId': 'ff338b9a-92aa-11ef-8800-4e9e04d42779'}, 

[pytest : test] payload: {

[pytest : test]     "dbName": "default",

[pytest : test]     "jobID": "453468002260053160"

[pytest : test] }, 

[pytest : test] response: {"code":0,"data":{"collectionName":"test_collection_2024_10_25_08_27_39_551139EaRekPPo","completeTime":"","details":[],"fileSize":0,"importedRows":0,"jobId":"453468002260053160","progress":10,"state":"Importing","totalRows":0}} (milvus.py:77)

[pytest : test] [2024-10-25 08:27:55 - DEBUG - ci_test]: 

[pytest : test] method: post, 

[pytest : test] url: http://md-37148-1-py-pr-milvus.milvus-ci:19530/v2/vectordb/jobs/import/get_progress, 

[pytest : test] cost time: 0.005665779113769531, 

[pytest : test] header: {'Content-Type': 'application/json', 'Authorization': 'Bearer None', 'RequestId': 'ff338b9a-92aa-11ef-8800-4e9e04d42779'}, 

[pytest : test] payload: {

[pytest : test]     "dbName": "default",

[pytest : test]     "jobID": "453468002260053160"

[pytest : test] }, 

[pytest : test] response: {"code":0,"data":{"collectionName":"test_collection_2024_10_25_08_27_39_551139EaRekPPo","completeTime":"","details":[{"completeTime":"2024-10-25T08:27:54Z","fileName":"[bulk_insert_data_b14dc38d-b36a-...34610,"importedRows":3000,"progress":100,"state":"Completed","totalRows":3000}],"fileSize":8034610,"importedRows":3000,"jobId":"453468002260053160","progress":70,"state":"Importing","totalRows":3000}} (milvus.py:77)

[pytest : test] [2024-10-25 08:28:00 - DEBUG - ci_test]: 

[pytest : test] method: post, 

[pytest : test] url: http://md-37148-1-py-pr-milvus.milvus-ci:19530/v2/vectordb/jobs/import/get_progress, 

[pytest : test] cost time: 0.005645036697387695, 

[pytest : test] header: {'Content-Type': 'application/json', 'Authorization': 'Bearer None', 'RequestId': 'ff338b9a-92aa-11ef-8800-4e9e04d42779'}, 

[pytest : test] payload: {

[pytest : test]     "dbName": "default",

[pytest : test]     "jobID": "453468002260053160"

[pytest : test] }, 

[pytest : test] response: {"code":0,"data":{"collectionName":"test_collection_2024_10_25_08_27_39_551139EaRekPPo","completeTime":"","details":[{"completeTime":"2024-10-25T08:27:54Z","fileName":"[bulk_insert_data_b14dc38d-b36a-...34610,"importedRows":3000,"progress":100,"state":"Completed","totalRows":3000}],"fileSize":8034610,"importedRows":3000,"jobId":"453468002260053160","progress":80,"state":"Importing","totalRows":3000}} (milvus.py:77)

[pytest : test] [2024-10-25 08:28:05 - DEBUG - ci_test]: 

[pytest : test] method: post, 

[pytest : test] url: http://md-37148-1-py-pr-milvus.milvus-ci:19530/v2/vectordb/jobs/import/get_progress, 

[pytest : test] cost time: 0.00798344612121582, 

[pytest : test] header: {'Content-Type': 'application/json', 'Authorization': 'Bearer None', 'RequestId': 'ff338b9a-92aa-11ef-8800-4e9e04d42779'}, 

[pytest : test] payload: {

[pytest : test]     "dbName": "default",

[pytest : test]     "jobID": "453468002260053160"

[pytest : test] }, 

[pytest : test] response: {"code":0,"data":{"collectionName":"test_collection_2024_10_25_08_27_39_551139EaRekPPo","completeTime":"2024-10-25T08:28:02Z","details":[{"completeTime":"2024-10-25T08:27:54Z","fileName":"[bulk_insert...4610,"importedRows":3000,"progress":100,"state":"Completed","totalRows":3000}],"fileSize":8034610,"importedRows":3000,"jobId":"453468002260053160","progress":100,"state":"Completed","totalRows":3000}} (milvus.py:77)

[pytest : test] [2024-10-25 08:28:10 - DEBUG - ci_test]: 

[pytest : test] method: post, 

[pytest : test] url: http://md-37148-1-py-pr-milvus.milvus-ci:19530/v2/vectordb/jobs/import/get_progress, 

[pytest : test] cost time: 0.0065708160400390625, 

[pytest : test] header: {'Content-Type': 'application/json', 'Authorization': 'Bearer None', 'RequestId': 'ff338b9a-92aa-11ef-8800-4e9e04d42779'}, 

[pytest : test] payload: {

[pytest : test]     "dbName": "default",

[pytest : test]     "jobID": "453468002260053157"

[pytest : test] }, 

[pytest : test] response: {"code":0,"data":{"collectionName":"test_collection_2024_10_25_08_27_39_551139EaRekPPo","completeTime":"2024-10-25T08:28:02Z","details":[{"completeTime":"2024-10-25T08:27:54Z","fileName":"[bulk_insert...4610,"importedRows":3000,"progress":100,"state":"Completed","totalRows":3000}],"fileSize":8034610,"importedRows":3000,"jobId":"453468002260053157","progress":100,"state":"Completed","totalRows":3000}} (milvus.py:77)

[pytest : test] [2024-10-25 08:28:15 - ERROR - pymilvus.decorators]: RPC error: [load_collection], <MilvusException: (code=65535, message=call query coordinator LoadCollection: collection not fully loaded: collection not loaded[collection=453468002260052459])>, <Time:{'RPC start': '2024-10-25 08:28:15.288311', 'RPC error': '2024-10-25 08:28:15.290628'}> (decorators.py:140)

Expected Behavior

No response

Steps To Reproduce

No response

Milvus Log

failed ci job: https://jenkins.milvus.io:18080/blue/organizations/jenkins/Milvus%20HA%20CI/detail/PR-37148/1/pipeline

Anything else?

No response

@zhuwenxing zhuwenxing added kind/bug Issues or changes related a bug needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. labels Oct 25, 2024
@zhuwenxing zhuwenxing added this to the 2.5.0 milestone Oct 25, 2024
@xiaofan-luan
Copy link
Collaborator

/assign @weiliu1031

@yanliang567 yanliang567 added triage/accepted Indicates an issue or PR is ready to be actively worked on. and removed needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. labels Oct 26, 2024
@yanliang567 yanliang567 removed their assignment Oct 26, 2024
@zhuwenxing
Copy link
Contributor Author

failed ci job: https://jenkins.milvus.io:18080/blue/organizations/jenkins/Milvus%20HA%20CI/detail/PR-37148/4/pipeline

This issue currently has a relatively high reproduction probability.
@weiliu1031

@xiaofan-luan
Copy link
Collaborator

/assign @bigsheeper

@zhuwenxing
Copy link
Contributor Author

zhuwenxing commented Oct 30, 2024

@zhuwenxing zhuwenxing added priority/critical-urgent Highest priority. Must be actively worked on as someone's top priority right now. severity/critical Critical, lead to crash, data missing, wrong result, function totally doesn't work. labels Oct 30, 2024
@zhuwenxing
Copy link
Contributor Author

@bigsheeper
Copy link
Contributor

RESTful load operations or quick setup create collection operations don’t wait for loading to finish. We need to add a get load state check and wait for loading to complete before importing.
image
Please help modify all related cases.@zhuwenxing

@bigsheeper
Copy link
Contributor

/assign @zhuwenxing

@zhuwenxing
Copy link
Contributor Author

[pytest : test]         c = Collection(name)

[pytest : test] >       c.load(_refresh=True)

the failed step is c.load(_refresh=True), so it should not be related to restful api?

Do you mean that the call of c.load(refresh=True) needs to wait until the load before has completed?

@bigsheeper
Copy link
Contributor

bigsheeper commented Nov 4, 2024

Do you mean that the call of c.load(refresh=True) needs to wait until the load before has completed?

Yes. So the test cases need to be updated to wait for loading to complete. You can add a timeout to the wait process.

BTW, there's also an issue in the server. The import process continues for several dozen seconds, and it's problematic that the collection hasn't completed loading in that time. This is related to the issue #37395.

@bigsheeper
Copy link
Contributor

After test cases done, many collections weren't dropped, occupying the pool of the target observer scheduler. This prevented newly loaded collections from updating the current target, which in turn caused load slowly and time out.
image

sre-ci-robot pushed a commit that referenced this issue Nov 7, 2024
issue: #37166
cause the misuse of timer.Reset, which cause dispatcher failed to send
msg to virtual channel buffer, and dispatcher do splitting again and
again, which hold the dispatcher manager's lock, block watching channel
progress.

This PR fix the misuse of timer.Reset

Signed-off-by: Wei Liu <[email protected]>
@xiaofan-luan
Copy link
Collaborator

/assign @zhuwenxing
could you verify this issue

@xiaofan-luan
Copy link
Collaborator

Actually I don't think add more concurrency at proxy could solve the problem.

There is no reason we any of the DDL operation could cost more than 9s.

It's obviously the abuse of some DDL request, like ListIndexes blocks the whole thing.

We need to fix more bottlenecks and try to batch some of the request before we add more concurrency

@xiaofan-luan
Copy link
Collaborator

To support 10000 collections * 1000 partitions
We will saw at least 10m segments and current implementation become a huge problem

@chyezh
Copy link
Contributor

chyezh commented Nov 18, 2024

Actually I don't think add more concurrency at proxy could solve the problem.

There is no reason we any of the DDL operation could cost more than 9s.

It's obviously the abuse of some DDL request, like ListIndexes blocks the whole thing.

We need to fix more bottlenecks and try to batch some of the request before we add more concurrency

Yes, indeed.
Current implementation of milvus use too much rpc to finish one dsitributed task.
Most current implemetation of ddl executes all distributed task synchronously.
Those distributed task should be implemented by a background async task with a polling check.
The asynchronouly implementation cost will be only depended on meta service or wal.
RPC's concurrency limitation is always exists.

  1. Merge coord all together will decrease some cost of most operation.
  2. Split some function from coord into node will decrease the coord dependency.
  3. Fix determined ddl work flow with an asynchronous version.

@xiaofan-luan
Copy link
Collaborator

Let's do this. Merge all the coordinator into one is not a bad idea

@xiaofan-luan
Copy link
Collaborator

@zhuwenxing
Copy link
Contributor Author

Load times are not very stable. Previously, setting the load timeout to 5s was sufficient, but in subsequent CI runs, load timeouts still fail quite often, even when the timeout is increased to 10s.

@zhuwenxing
Copy link
Contributor Author

/assign @smellthemoon

/unassign

@liliu-z
Copy link
Member

liliu-z commented Nov 19, 2024

still see it at https://jenkins.milvus.io:18080/blue/organizations/jenkins/Milvus%20HA%20CI/detail/PR-37318/13/pipeline

@smellthemoon could you help on fixing it

This looks like this is an import timeout

@liliu-z
Copy link
Member

liliu-z commented Nov 19, 2024

@bigsheeper

load can still not be finished in 10s failed ci job: https://jenkins.milvus.io:18080/blue/organizations/jenkins/Milvus%20HA%20CI/detail/PR-37596/2/pipeline

Not sure why I cannot get into this link.

Met some slow DDL issues caused by resource racing, plz help take a check @SimFG

@chyezh
Copy link
Contributor

chyezh commented Nov 20, 2024

may be related with #37851

@yanliang567
Copy link
Contributor

/assign @SimFG

bigsheeper added a commit to bigsheeper/milvus that referenced this issue Nov 21, 2024
When there're a lot of loaded collections, they would occupy the target
observer scheduler’s pool. This prevents loading collections from
updating the current target in time, slowing down the load process.
This PR adds a separate target dispatcher for loading collections.

issue: milvus-io#37166

---------

Signed-off-by: bigsheeper <[email protected]>
czs007 pushed a commit that referenced this issue Nov 21, 2024
Remove unnecessary ListIndex and DescribeCollection RPC call during
loading.

issue: #37166,
#37630

pr: #37741

Signed-off-by: bigsheeper <[email protected]>
czs007 pushed a commit that referenced this issue Nov 21, 2024
When there're a lot of loaded collections, they would occupy the target
observer scheduler’s pool. This prevents loading collections from
updating the current target in time, slowing down the load process. This
PR adds a separate target dispatcher for loading collections.

issue: #37166

pr: #37454

Signed-off-by: bigsheeper <[email protected]>
@yanliang567
Copy link
Contributor

/assign @zhuwenxing
/unassign @SimFG

@sre-ci-robot sre-ci-robot assigned zhuwenxing and unassigned SimFG Nov 21, 2024
@zhuwenxing zhuwenxing removed priority/critical-urgent Highest priority. Must be actively worked on as someone's top priority right now. severity/critical Critical, lead to crash, data missing, wrong result, function totally doesn't work. labels Nov 22, 2024
sre-ci-robot pushed a commit that referenced this issue Nov 28, 2024
When there're a lot of loaded collections, they would occupy the target
observer scheduler’s pool. This prevents loading collections from
updating the current target in time, slowing down the load process. This
PR adds a separate target dispatcher for loading collections.

issue: #37166

pr: #37454

---------

Signed-off-by: bigsheeper <[email protected]>
sre-ci-robot pushed a commit that referenced this issue Nov 29, 2024
Remove unnecessary ListIndex and DescribeCollection RPC call during
loading.

issue: #37166,
#37630

pr: #37741

Signed-off-by: bigsheeper <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
ci/bug ci/e2e kind/bug Issues or changes related a bug triage/accepted Indicates an issue or PR is ready to be actively worked on.
Projects
None yet
Development

No branches or pull requests

10 participants