Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

invalid input for sparse float vector #35

Open
yidasanqian opened this issue Sep 3, 2024 · 6 comments
Open

invalid input for sparse float vector #35

yidasanqian opened this issue Sep 3, 2024 · 6 comments

Comments

@yidasanqian
Copy link

code:

analyzer = build_default_analyzer(language="zh")
bm25_ef = BM25EmbeddingFunction(analyzer)
bm25_ef.load("D:/Downloads/bm25_msmarco_v1.json")

def test():
  entities = [....]
  for entity in entities:    
    docs_embeddings = bm25_ef.encode_documents([entity["content"]])       
    # Convert csr_array to the format Milvus expects (List of Dictionaries)
    sparse_vector = {int(idx): float(val) for idx, val in zip(docs_embeddings[0].indices, docs_embeddings[0].data)}
    entity["content_sparse"] = sparse_vector
  res = client.upsert(collection_name=INDEX_NAME, data=entities)       
  return res["ids"]

trace back output:

  File "d:\Develop\conda\envs\mkb\lib\concurrent\futures\_base.py", line 451, in result
    return self.__get_result()
  File "d:\Develop\conda\envs\mkb\lib\concurrent\futures\_base.py", line 403, in __get_result
    raise self._exception
  File "d:\Develop\conda\envs\mkb\lib\concurrent\futures\thread.py", line 58, in run
    result = self.fn(*self.args, **self.kwargs)
  File "D:\Develop\CodeProjects\mkb\src\tests\mkb\storage\build_1k_data_for_milvus.py", line 52, in append_rag_eval_entry
    res = client.upsert(collection_name=INDEX_NAME, data=entities)
  File "d:\Develop\conda\envs\mkb\lib\site-packages\pymilvus\milvus_client\milvus_client.py", line 276, in upsert
    raise ex from ex
  File "d:\Develop\conda\envs\mkb\lib\site-packages\pymilvus\milvus_client\milvus_client.py", line 272, in upsert
    res = conn.upsert_rows(
  File "d:\Develop\conda\envs\mkb\lib\site-packages\pymilvus\decorators.py", line 148, in handler
    raise e from e
  File "d:\Develop\conda\envs\mkb\lib\site-packages\pymilvus\decorators.py", line 144, in handler
    return func(*args, **kwargs)
  File "d:\Develop\conda\envs\mkb\lib\site-packages\pymilvus\decorators.py", line 183, in handler
    return func(self, *args, **kwargs)
  File "d:\Develop\conda\envs\mkb\lib\site-packages\pymilvus\decorators.py", line 123, in handler
    raise e from e
  File "d:\Develop\conda\envs\mkb\lib\site-packages\pymilvus\decorators.py", line 87, in handler
    return func(*args, **kwargs)
  File "d:\Develop\conda\envs\mkb\lib\site-packages\pymilvus\client\grpc_handler.py", line 715, in upsert_rows
    request = self._prepare_row_upsert_request(
  File "d:\Develop\conda\envs\mkb\lib\site-packages\pymilvus\client\grpc_handler.py", line 696, in _prepare_row_upsert_request
    return Prepare.row_upsert_param(
  File "d:\Develop\conda\envs\mkb\lib\site-packages\pymilvus\client\prepare.py", line 461, in row_upsert_param
    return cls._parse_row_request(request, fields_info, enable_dynamic, entities)
  File "d:\Develop\conda\envs\mkb\lib\site-packages\pymilvus\client\prepare.py", line 389, in _parse_row_request
    entity_helper.pack_field_value_to_field_data(v, field_data, field_info)
  File "d:\Develop\conda\envs\mkb\lib\site-packages\pymilvus\client\entity_helper.py", line 361, in pack_field_value_to_field_data
    raise ParamError(message="invalid input for sparse float vector")
pymilvus.exceptions.ParamError: <ParamError: (code=1, message=invalid input for sparse float vector)>

What's the reason? How to solve it?

@yidasanqian
Copy link
Author

entity["content"]="""
无机预涂板是一种环保板材。无机预涂板通常采用防火、抗菌、耐腐蚀和易清洁等,能够有效提高建筑物的装修质量和性能。\n以下是无机预涂板的环保特点:\n无机材料:无机预涂板基板采用无石棉硅酸钙板,不含有害的有机物,不会释放有害气体,不会对室内空气质量造成污染。\n绿色环保:无机预涂板符合绿色环保要求,不含有害物质,是一种绿色环保的装饰材料。\n耐久性:无机预涂板具有良好的耐久性,不易腐烂、老化、脆化和变形,使用寿命长,不会频繁更换,减少资源浪费。\n总之,无机预涂板是一种环保板材,符合绿色环保要求,对室内空气质量和人体健康无害,同时具有不错的装饰效果和耐久性。
"""

image

@wxywb
Copy link
Collaborator

wxywb commented Sep 4, 2024

"bm25_msmarco_v1.json" is only for English corpus, you need to fit parameters on your own documents. Here is code example

from pymilvus.model.sparse.bm25.tokenizers import build_default_analyzer
from pymilvus.model.sparse import BM25EmbeddingFunction
from pymilvus import MilvusClient,  DataType

analyzer = build_default_analyzer(language="zh")

docs = [
    "无机预涂板是一种具有优良性能的环保材料,常被应用于防火、抗菌、耐化学腐蚀等领域。",
    "无机预涂板以其卓越的耐火性、抗菌性和易维护性,被广泛应用于各类建筑场景。",
    "无机预涂板拥有防火、耐腐蚀、易清洁等特点,成为现代建筑中环保材料的首选。",
    "无机预涂板兼具环保和实用性,具有防火、抗菌、耐酸碱等多种优异性能。",
    "无机预涂板由于其出色的耐火性能、抗菌功能和环保特性,广泛应用于医院、实验室等场所。"
]


bm25_ef = BM25EmbeddingFunction(analyzer)
bm25_ef.fit(docs)


docs_embeddings = bm25_ef.encode_documents(docs)

query = '无机预涂板有耐火性吗?'

query_embeddings = bm25_ef.encode_queries([query])

client = MilvusClient(uri='test.db')

schema = client.create_schema(
    auto_id=True,
    enable_dynamic_fields=True,
)

schema.add_field(field_name="pk", datatype=DataType.VARCHAR, is_primary=True, max_length=100)
schema.add_field(field_name="sparse_vector", datatype=DataType.SPARSE_FLOAT_VECTOR)
schema.add_field(field_name="text", datatype=DataType.VARCHAR, max_length=65535)

index_params = client.prepare_index_params()

client.create_collection(collection_name="test_sparse_vector", schema=schema)
index_params.add_index(
    field_name="sparse_vector",
    index_name="sparse_inverted_index",
    index_type="SPARSE_INVERTED_INDEX",
    metric_type="IP",
)

# Create index
client.create_index(collection_name="test_sparse_vector", index_params=index_params)

search_params = {
    "metric_type": "IP",
    "params": {}
}
for i in range(len(docs)):
    entity = {'sparse_vector': docs_embeddings[[i]], 'text':docs[i]}
    client.insert(collection_name="test_sparse_vector", data=entity)

results = client.search(collection_name="test_sparse_vector", data=query_embeddings[[0]], output_fields=['text'], search_params=search_params)
print(results)

@yidasanqian
Copy link
Author

Documents are dynamically added to milvus and are more than 1 million in number, do I have to full fit all documents every time I execute a bm25 query?

@wxywb
Copy link
Collaborator

wxywb commented Sep 5, 2024

Although it is mathematically correct that BM25 should fit all inserted documents, a more practical approach is to save your parameters after fitting a large number of texts, and then load these saved parameters during query time to avoid refitting.

@yidasanqian
Copy link
Author

These documents take up about 32 GB of memory. I need to load them all into memory, then execute fit, and finally call save, right? Do I need to do this process every time I add a document? Is there a way to incrementally update the parameters?

@wxywb
Copy link
Collaborator

wxywb commented Sep 6, 2024

yes, currently there is no incremental updates for bm25 and it is planned. Also Milvus will support native bm25, please stay tuned.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants