Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Bug]: Too many L0 segments caused query node OOM. #33278

Closed
1 task done
congguosn opened this issue May 22, 2024 · 3 comments
Closed
1 task done

[Bug]: Too many L0 segments caused query node OOM. #33278

congguosn opened this issue May 22, 2024 · 3 comments
Assignees
Labels
kind/bug Issues or changes related a bug triage/accepted Indicates an issue or PR is ready to be actively worked on.
Milestone

Comments

@congguosn
Copy link

Is there an existing issue for this?

  • I have searched the existing issues

Environment

- Milvus version: v2.4.1
- Deployment mode(standalone or cluster): cluster
- MQ type(rocksmq, pulsar or kafka):    kafka
- SDK version(e.g. pymilvus v2.0.0rc2): v2.4
- OS(Ubuntu or CentOS): 
- CPU/Memory: 
- GPU: 
- Others:

Current Behavior

We use upserts to ingest data to milvus, which will generate many deletes messages. If the delete messages are not consumed in time, seems a lot of L0 segments will be generated. If we restart query node which is used for growing segment, it will try to create tasks for all those L0 segments and meet OOM.

Expected Behavior

The L0 segments loading should not cause the query node to be oom killed.

Steps To Reproduce

Run upserts to ingest data to milvus in a fast speed.

Milvus Log

[2024/05/22 06:48:09.430 +00:00] [INFO] [task/scheduler.go:281] ["task added"] [task="[id=1716360317352] [type=Grow] [source=segment_checker] [reason=lacks of segment] [collectionID=449821998347640028] [replicaID=449933559028252673] [resourceGroup=__default_resource_group] [priority=Normal] [actionsCount=1] [actions={[type=Grow][node=1017][streaming=false]}] [segmentID=449821998398092289]"]
[2024/05/22 06:48:09.430 +00:00] [INFO] [task/scheduler.go:281] ["task added"] [task="[id=1716360317353] [type=Grow] [source=segment_checker] [reason=lacks of segment] [collectionID=449821998347640028] [replicaID=449933559028252673] [resourceGroup=__default_resource_group] [priority=Normal] [actionsCount=1] [actions={[type=Grow][node=1017][streaming=false]}] [segmentID=449821998398499739]"]
[2024/05/22 06:48:09.430 +00:00] [INFO] [task/scheduler.go:281] ["task added"] [task="[id=1716360317354] [type=Grow] [source=segment_checker] [reason=lacks of segment] [collectionID=449821998347640028] [replicaID=449933559028252673] [resourceGroup=__default_resource_group] [priority=Normal] [actionsCount=1] [actions={[type=Grow][node=1017][streaming=false]}] [segmentID=449821998414241025]"]
[2024/05/22 06:48:09.430 +00:00] [INFO] [task/scheduler.go:281] ["task added"] [task="[id=1716360317355] [type=Grow] [source=segment_checker] [reason=lacks of segment] [collectionID=449821998347640028] [replicaID=449933559028252673] [resourceGroup=__default_resource_group] [priority=Normal] [actionsCount=1] [actions={[type=Grow][node=1017][streaming=false]}] [segmentID=449821998414849387]"]

Anything else?

No thanks.

@congguosn congguosn added kind/bug Issues or changes related a bug needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. labels May 22, 2024
@congguosn
Copy link
Author

More error logs for reference.

[2024/05/22 06:48:08.854 +00:00] [INFO] [task/scheduler.go:591] ["processed tasks"] [nodeID=1012] [toProcessNum=0] [committedNum=0] [toRemoveNum=0]
[2024/05/22 06:48:08.854 +00:00] [INFO] [task/scheduler.go:597] ["process tasks related to node done"] [nodeID=1012] [processingTaskNum=1] [waitingTaskNum=0] [segmentTaskNum=0] [channelTaskNum=1]
[2024/05/22 06:48:08.916 +00:00] [INFO] [task/executor.go:375] ["subscribe channel..."] [taskID=1716360317351] [collectionID=449821998347640028] [replicaID=449933559028252673] [channel=by-dev-rootcoord-dml_4_449821998347640028v0] [node=1017] [source=channel_checker] [checkpoint=449933602989015041] [sinceCheckpoint=3.881772615s]
[2024/05/22 06:48:09.002 +00:00] [INFO] [task/executor.go:390] ["subscribe channel done"] [taskID=1716360317351] [collectionID=449821998347640028] [replicaID=449933559028252673] [channel=by-dev-rootcoord-dml_4_449821998347640028v0] [node=1017] [source=channel_checker] [taskID=1716360317351] ["time taken"=149.008397ms]
[2024/05/22 06:48:09.092 +00:00] [INFO] [querycoordv2/services.go:828] ["get replicas request received"] [traceID=c643022d4a9242881e03c586945c3afa] [collectionID=449821998347640028] [with-shard-nodes=false]
[2024/05/22 06:48:09.092 +00:00] [WARN] [querycoordv2/handlers.go:385] ["failed to get channels, collection may be not loaded or in recovering"] [collectionID=449821998347640028]
[2024/05/22 06:48:09.353 +00:00] [DEBUG] [task/scheduler.go:556] ["process tasks related to node"] [nodeID=1013] [processingTaskNum=1] [waitingTaskNum=0] [segmentTaskNum=0] [channelTaskNum=1]
[2024/05/22 06:48:09.353 +00:00] [DEBUG] [funcutil/parallel.go:54] [process] [total=0] ["time cost"=1.04µs]
[2024/05/22 06:48:09.353 +00:00] [INFO] [task/scheduler.go:591] ["processed tasks"] [nodeID=1013] [toProcessNum=0] [committedNum=0] [toRemoveNum=0]
[2024/05/22 06:48:09.353 +00:00] [INFO] [task/scheduler.go:597] ["process tasks related to node done"] [nodeID=1013] [processingTaskNum=1] [waitingTaskNum=0] [segmentTaskNum=0] [channelTaskNum=1]
[2024/05/22 06:48:09.353 +00:00] [DEBUG] [task/scheduler.go:556] ["process tasks related to node"] [nodeID=1017] [processingTaskNum=1] [waitingTaskNum=0] [segmentTaskNum=0] [channelTaskNum=1]
[2024/05/22 06:48:09.353 +00:00] [DEBUG] [funcutil/parallel.go:54] [process] [total=0] ["time cost"=240ns]
[2024/05/22 06:48:09.353 +00:00] [INFO] [task/scheduler.go:800] ["task removed"] [taskID=1716360317351] [collectionID=449821998347640028] [replicaID=449933559028252673] [status=succeeded] [channel=by-dev-rootcoord-dml_4_449821998347640028v0]
[2024/05/22 06:48:09.353 +00:00] [INFO] [task/scheduler.go:591] ["processed tasks"] [nodeID=1017] [toProcessNum=0] [committedNum=0] [toRemoveNum=1]
[2024/05/22 06:48:09.353 +00:00] [INFO] [task/scheduler.go:597] ["process tasks related to node done"] [nodeID=1017] [processingTaskNum=0] [waitingTaskNum=0] [segmentTaskNum=0] [channelTaskNum=0]
[2024/05/22 06:48:09.365 +00:00] [INFO] [observers/collection_observer.go:320] ["partition load progress"] [collectionID=449821998347640028] [partitionID=449821998347640029] [subChannelCount=1] [loadSegmentCount=0]
[2024/05/22 06:48:09.365 +00:00] [INFO] [observers/collection_observer.go:343] ["load status updated"] [collectionID=449821998347640028] [partitionID=449821998347640029] [partitionLoadPercentage=0] [collectionLoadPercentage=0]
[2024/05/22 06:48:09.370 +00:00] [INFO] [balance/utils.go:70] ["create segment task"] [collection=449821998347640028] [segmentID=449821998398092289] [replica=449933559028252673] [channel=by-dev-rootcoord-dml_4_449821998347640028v0] [from=-1] [to=1017]
[2024/05/22 06:48:09.370 +00:00] [INFO] [balance/utils.go:70] ["create segment task"] [collection=449821998347640028] [segmentID=449821998398499739] [replica=449933559028252673] [channel=by-dev-rootcoord-dml_4_449821998347640028v0] [from=-1] [to=1017]
[2024/05/22 06:48:09.370 +00:00] [INFO] [balance/utils.go:70] ["create segment task"] [collection=449821998347640028] [segmentID=449821998414241025] [replica=449933559028252673] [channel=by-dev-rootcoord-dml_4_449821998347640028v0] [from=-1] [to=1017]
[2024/05/22 06:48:09.370 +00:00] [INFO] [balance/utils.go:70] ["create segment task"] [collection=449821998347640028] [segmentID=449821998414849387] [replica=449933559028252673] [channel=by-dev-rootcoord-dml_4_449821998347640028v0] [from=-1] [to=1017]
[2024/05/22 06:48:09.370 +00:00] [INFO] [balance/utils.go:70] ["create segment task"] [collection=449821998347640028] [segmentID=449821998410970982] [replica=449933559028252673] [channel=by-dev-rootcoord-dml_4_449821998347640028v0] [from=-1] [to=1017]
[2024/05/22 06:48:09.370 +00:00] [INFO] [balance/utils.go:70] ["create segment task"] [collection=449821998347640028] [segmentID=449821998397681764] [replica=449933559028252673] [channel=by-dev-rootcoord-dml_4_449821998347640028v0] [from=-1] [to=1017]
[2024/05/22 06:48:09.370 +00:00] [INFO] [balance/utils.go:70] ["create segment task"] [collection=449821998347640028] [segmentID=449821998411788531] [replica=449933559028252673] [channel=by-dev-rootcoord-dml_4_449821998347640028v0] [from=-1] [to=1017]

@yanliang567
Copy link
Contributor

/assign @XuanYang-cn
/unassign

@yanliang567 yanliang567 added triage/accepted Indicates an issue or PR is ready to be actively worked on. and removed needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. labels May 23, 2024
@yanliang567 yanliang567 modified the milestones: 2.4.2, 2.4.3 May 23, 2024
@yanliang567 yanliang567 modified the milestones: 2.4.3, 2.4.4, 2.4.5 May 30, 2024
@yanliang567 yanliang567 modified the milestones: 2.4.5, 2.4.6 Jun 26, 2024
@yanliang567 yanliang567 modified the milestones: 2.4.6, 2.4.7 Jul 19, 2024
@yanliang567 yanliang567 modified the milestones: 2.4.7, 2.4.8 Aug 12, 2024
@yanliang567 yanliang567 modified the milestones: 2.4.8, 2.4.10 Aug 19, 2024
@yanliang567 yanliang567 modified the milestones: 2.4.10, 2.4.11 Sep 5, 2024
@yanliang567 yanliang567 modified the milestones: 2.4.11, 2.4.12 Sep 18, 2024
@yanliang567 yanliang567 modified the milestones: 2.4.12, 2.4.13 Sep 27, 2024
@yanliang567 yanliang567 modified the milestones: 2.4.13, 2.4.14 Oct 15, 2024
@XuanYang-cn
Copy link
Contributor

Dup with #36953 Closing this issue

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
kind/bug Issues or changes related a bug triage/accepted Indicates an issue or PR is ready to be actively worked on.
Projects
None yet
Development

No branches or pull requests

3 participants