Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Bug]: Non-deterministic performance for binary vector search #1379

Open
1 task done
piercefreeman opened this issue Apr 20, 2023 · 1 comment
Open
1 task done

[Bug]: Non-deterministic performance for binary vector search #1379

piercefreeman opened this issue Apr 20, 2023 · 1 comment
Assignees

Comments

@piercefreeman
Copy link

piercefreeman commented Apr 20, 2023

Is there an existing issue for this?

  • I have searched the existing issues

Describe the bug

I am testing the retrieval of binary vectors (128dim of True vs 128dim of False). When I search for the False vector to validate it is inserted correctly, it sometimes retrieves the False vector and sometimes retrieves the True vector.

query_embedding = np.packbits(np.array([False] * 128, dtype=np.uint8)).tobytes()
results = milvus_client.search(
    collection_name=binary_collection_name,
    anns_field="embedding",
    data=[query_embedding],
    param=search_params,
    limit=2,
    consistency_level="Strong"
)
print(results)

When placed in a test that verifies this behavior, performance is mixed:

vectordb_orm/tests/test_raw.py FFF..FFF.F                                                       [100%]

When I log the search output, it shows that both vectors are being retrieved with the same score and distance.

["['(distance: 1.0, score: 1.0, id: 440914295326004891)', '(distance: 1.0, score: 1.0, id: 440914295326004908)']"]

Expected Behavior

When I search for the False vector, it should be returned with higher vector similarity consistently.

Steps/Code To Reproduce behavior

import numpy as np
from pymilvus import Milvus, DataType, CollectionSchema, FieldSchema, IndexType
import pytest
from time import sleep
from uuid import uuid4

class BinaryEmbeddingObject:
    def __init__(self, id=None, embedding=None):
        self.id = id
        self.embedding = embedding

    def insert(self, milvus_client: Milvus, collection_name: str):
        embedding_bytes = np.packbits(self.embedding).tobytes()
        entities = [
            {"name": "embedding", "type": DataType.BINARY_VECTOR, "values": [embedding_bytes]}
        ]
        mutation_result = milvus_client.insert(collection_name=collection_name, entities=entities)
        self.id = mutation_result.primary_keys[0]

def create_new_collection(milvus_client: Milvus, collection_name: str, schema: CollectionSchema):
    if milvus_client.has_collection(collection_name):
        milvus_client.drop_collection(collection_name)
        sleep(2)
    milvus_client.create_collection(collection_name, schema)

def create_embedding_index(milvus_client: Milvus, collection_name: str):
    index_params = {
        "index_type": "BIN_IVF_FLAT",
        "params": {"nlist": 128},
        "metric_type": "HAMMING"
    }
    status = milvus_client.create_index(collection_name, "embedding", index_params)
    return status

@pytest.mark.parametrize('execution_number', range(10))
def test_raw(milvus_client: Milvus, execution_number):
    identifier = str(uuid4()).replace("-", "_")
    binary_collection_name = f"test_collection_{identifier}"

    # Define the Milvus schema
    primary_field = FieldSchema(name="id", dtype=DataType.INT64, is_primary=True, auto_id=True)
    field1 = FieldSchema(name="embedding", dtype=DataType.BINARY_VECTOR, dim=128)
    schema = CollectionSchema(fields=[primary_field, field1])

    # Create the Milvus collection
    create_new_collection(milvus_client, binary_collection_name, schema)
    create_embedding_index(milvus_client, binary_collection_name)

    # Create some BinaryEmbeddingObject instances
    obj1 = BinaryEmbeddingObject(embedding=np.array([True] * 128, dtype=np.uint8))
    obj2 = BinaryEmbeddingObject(embedding=np.array([False] * 128, dtype=np.uint8))

    # Insert the objects into Milvus
    obj1.insert(milvus_client, binary_collection_name)
    obj2.insert(milvus_client, binary_collection_name)

    milvus_client.flush([binary_collection_name])
    milvus_client.load_collection(binary_collection_name)

    # Test our ability to recall 1:1 the input content
    search_params = {"metric_type": "JACCARD"}
    query_embedding = np.packbits(np.array([False] * 128, dtype=np.uint8)).tobytes()
    results = milvus_client.search(
        collection_name=binary_collection_name,
        anns_field="embedding",
        data=[query_embedding],
        param=search_params,
        limit=2,
        consistency_level="Strong"
    )
    print(results)
    assert len(results[0]) == 2
    assert results[0][0].id == obj2.id


### Environment details

```markdown
- Hardware/Softward conditions (OS, CPU, GPU, Memory): Max OSX
- Method of installation (Docker, or from source): Docker for the Milvus host, Python virtualenv on OSX for pymilvus
- Milvus version (v0.3.1, or v0.4.0): v2.2.6 on server and Python
- Milvus configuration (Settings you made in `server_config.yaml`): N/A

Anything else?

None

@xiaofan-luan
Copy link
Contributor

/assign @cydrain

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants