Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Bug]: 2.4.9版本的milvus集群 在使用upsert的方式插入的时候,使用num_entities能够看到数据不断增长,但是使用count(*)的时候发现数据一直是600 #35893

Closed
1 task done
ronghuaihai opened this issue Sep 2, 2024 · 7 comments
Assignees
Labels
kind/bug Issues or changes related a bug triage/accepted Indicates an issue or PR is ready to be actively worked on.

Comments

@ronghuaihai
Copy link

ronghuaihai commented Sep 2, 2024

Is there an existing issue for this?

  • I have searched the existing issues

Environment

- Milvus version: 2.4.9
- Deployment mode(standalone or cluster): cluster 模式
- MQ type(rocksmq, pulsar or kafka):  pulsar  
- SDK version(e.g. pymilvus v2.0.0rc2):2.4.4
- OS(Ubuntu or CentOS):  Centos
- CPU/Memory: 
- GPU: 
- Others:

Current Behavior

当我使用2.3.6版本的attu查看是
image

当我使用2.3.10版本的attu查看的同一个collection是
image

当我直接使用python sdk pymilvus直接连接到实例,查看的是
image

业务反馈collection的主键不会有冲突和重复,所以用的是upsert的。由于特殊的情况,必须使用upsert的方式。但是理论上说
数据应该是增加的,而不是始终是600 。
2.4.9的count(*)是不是有bug?

Expected Behavior

No response

Steps To Reproduce

No response

Milvus Log

No response

Anything else?

No response

@ronghuaihai ronghuaihai added kind/bug Issues or changes related a bug needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. labels Sep 2, 2024
@yhmo
Copy link
Contributor

yhmo commented Sep 2, 2024

在Attu 2.3.6里,不管collection有没有loaded,“大约的Entity数量”都是用num_entities来获取的。
而在Attu 2.3.10里,当collection没有loaded时,“大约的Entity数量”是用num_entities来获取;当collection为loaded,“大约的Entity数量”是用count(*)来获取。
这个是Attu两个版本显示不一致的原因。

count(*)显示600估计是个bug,怀疑是compaction出了问题。请@XuanYang-cn看一下。

@ronghuaihai
Copy link
Author

ronghuaihai commented Sep 2, 2024

image 现在我们主要的问题就是attu显示的count(*)的问题。为什么在执行了compaction以后也不能显示真实的数据总量?

@ronghuaihai
Copy link
Author

ronghuaihai commented Sep 2, 2024

我发现数据还有丢失
--- Growing: 32, Sealed: 0, Flushed: 104
--- Total Segments: 136, row count: 1822523

过了一会show segment
--- Growing: 1, Sealed: 0, Flushed: 48
--- Total Segments: 49, row count: 406622

找了其中一个segment 进行前后比对,发现
第一次show segment的结果
SegmentID: 452240692225429158 State: Flushed, Row Count:23133
第二行是第二次show segment的结果
SegmentID: 452240692225429158 State: Dropped, Row Count:23133

@yanliang567
Copy link
Contributor

/assign @XuanYang-cn
please help to take a look.

@ronghuaihai it is not recommended to use "upsert" as "insert", as it would cause Milvus in heavily compaction progress.

@yanliang567 yanliang567 removed their assignment Sep 3, 2024
@yanliang567 yanliang567 added triage/accepted Indicates an issue or PR is ready to be actively worked on. and removed needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. labels Sep 3, 2024
@yhmo
Copy link
Contributor

yhmo commented Sep 3, 2024

重现脚本:

import random

from pymilvus import (
    connections, utility, Collection, FieldSchema, CollectionSchema, DataType,
)


connections.connect(host="localhost", port=19530)

collection_name = "AAA"
dim = 128
part_cnt = 2

schema = CollectionSchema(
    fields=[
        FieldSchema(name="id", dtype=DataType.VARCHAR, is_primary=True, auto_id=False, max_length=100),
        FieldSchema(name="vector", dtype = DataType.FLOAT_VECTOR, dim=dim),
    ])

index_params = {
    'metric_type': "L2",
    'index_type': "FLAT",
    'params': {},
}


utility.drop_collection(collection_name)
collection = Collection(name=collection_name, schema=schema)
print(collection_name, "created")


for i in range(part_cnt):
    collection.create_partition(partition_name=f"part_{i}")

option_names = ["A", "B"]

batch = 800
for i in range(part_cnt+1):
    data = [
        [f"{option_names[i%len(option_names)]}_{i*batch+k}" for k in range(batch)],
        [[random.random() for _ in range(dim)] for _ in range(batch)],
    ]
    collection.insert(data=data, partition_name=f"part_{i%part_cnt}")
    print(collection_name, "data inserted", i)

collection.flush()
print("flushed")

collection.create_index(field_name="id", index_params={'index_type': "marisa-trie"}, index_name="mmm")
collection.create_index(field_name="vector", index_params=index_params)
print("indexed")
# collection.drop_index(index_name="mmm")
collection.load()
print("loaded")
results = collection.query(expr="", output_fields=["count(*)"], consistency_level="Strong")
print("AAA count(*): ", results)

#########################################################################################
for part in collection.partitions:
    print("partition:", part.name)

    query_iterator = collection.query_iterator(expr="",
                                               output_fields=["id"],
                                               offset=0,
                                               batch_size=30,
                                               partition_names=[part.name],
                                               consistency_level="Strong")
    while True:
        res = query_iterator.next()
        if len(res) == 0:
            print("query iteration finished, close")
            query_iterator.close()
            break

        ids = [x["id"] for x in res]
        print(ids)
        print("================================================================================")

对于milvus v2.3.x,queryIterator返回的ids正常
对于milvus v2.4.x,queryIterator返回的ids重复

关键点是varchar主键的"marisa-trie"索引,如果不建该索引,则2.4.x也执行正常。

@yhmo
Copy link
Contributor

yhmo commented Sep 4, 2024

一个紧急修复,改为使用爆搜 #35943

@xiaofan-luan
Copy link
Collaborator

it's actaully don't recommend to create index on PKs (Since PK already has pk index).
but we should anyway fix the bug

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
kind/bug Issues or changes related a bug triage/accepted Indicates an issue or PR is ready to be actively worked on.
Projects
None yet
Development

No branches or pull requests

5 participants