[Bug]: Non-deterministic performance for binary vector search #1379

piercefreeman · 2023-04-20T21:55:43Z

Is there an existing issue for this?

I have searched the existing issues

Describe the bug

I am testing the retrieval of binary vectors (128dim of True vs 128dim of False). When I search for the False vector to validate it is inserted correctly, it sometimes retrieves the False vector and sometimes retrieves the True vector.

query_embedding = np.packbits(np.array([False] * 128, dtype=np.uint8)).tobytes()
results = milvus_client.search(
    collection_name=binary_collection_name,
    anns_field="embedding",
    data=[query_embedding],
    param=search_params,
    limit=2,
    consistency_level="Strong"
)
print(results)

When placed in a test that verifies this behavior, performance is mixed:

vectordb_orm/tests/test_raw.py FFF..FFF.F                                                       [100%]

When I log the search output, it shows that both vectors are being retrieved with the same score and distance.

["['(distance: 1.0, score: 1.0, id: 440914295326004891)', '(distance: 1.0, score: 1.0, id: 440914295326004908)']"]

Expected Behavior

When I search for the False vector, it should be returned with higher vector similarity consistently.

Steps/Code To Reproduce behavior

import numpy as np
from pymilvus import Milvus, DataType, CollectionSchema, FieldSchema, IndexType
import pytest
from time import sleep
from uuid import uuid4

class BinaryEmbeddingObject:
    def __init__(self, id=None, embedding=None):
        self.id = id
        self.embedding = embedding

    def insert(self, milvus_client: Milvus, collection_name: str):
        embedding_bytes = np.packbits(self.embedding).tobytes()
        entities = [
            {"name": "embedding", "type": DataType.BINARY_VECTOR, "values": [embedding_bytes]}
        ]
        mutation_result = milvus_client.insert(collection_name=collection_name, entities=entities)
        self.id = mutation_result.primary_keys[0]

def create_new_collection(milvus_client: Milvus, collection_name: str, schema: CollectionSchema):
    if milvus_client.has_collection(collection_name):
        milvus_client.drop_collection(collection_name)
        sleep(2)
    milvus_client.create_collection(collection_name, schema)

def create_embedding_index(milvus_client: Milvus, collection_name: str):
    index_params = {
        "index_type": "BIN_IVF_FLAT",
        "params": {"nlist": 128},
        "metric_type": "HAMMING"
    }
    status = milvus_client.create_index(collection_name, "embedding", index_params)
    return status

@pytest.mark.parametrize('execution_number', range(10))
def test_raw(milvus_client: Milvus, execution_number):
    identifier = str(uuid4()).replace("-", "_")
    binary_collection_name = f"test_collection_{identifier}"

    # Define the Milvus schema
    primary_field = FieldSchema(name="id", dtype=DataType.INT64, is_primary=True, auto_id=True)
    field1 = FieldSchema(name="embedding", dtype=DataType.BINARY_VECTOR, dim=128)
    schema = CollectionSchema(fields=[primary_field, field1])

    # Create the Milvus collection
    create_new_collection(milvus_client, binary_collection_name, schema)
    create_embedding_index(milvus_client, binary_collection_name)

    # Create some BinaryEmbeddingObject instances
    obj1 = BinaryEmbeddingObject(embedding=np.array([True] * 128, dtype=np.uint8))
    obj2 = BinaryEmbeddingObject(embedding=np.array([False] * 128, dtype=np.uint8))

    # Insert the objects into Milvus
    obj1.insert(milvus_client, binary_collection_name)
    obj2.insert(milvus_client, binary_collection_name)

    milvus_client.flush([binary_collection_name])
    milvus_client.load_collection(binary_collection_name)

    # Test our ability to recall 1:1 the input content
    search_params = {"metric_type": "JACCARD"}
    query_embedding = np.packbits(np.array([False] * 128, dtype=np.uint8)).tobytes()
    results = milvus_client.search(
        collection_name=binary_collection_name,
        anns_field="embedding",
        data=[query_embedding],
        param=search_params,
        limit=2,
        consistency_level="Strong"
    )
    print(results)
    assert len(results[0]) == 2
    assert results[0][0].id == obj2.id



### Environment details

```markdown
- Hardware/Softward conditions (OS, CPU, GPU, Memory): Max OSX
- Method of installation (Docker, or from source): Docker for the Milvus host, Python virtualenv on OSX for pymilvus
- Milvus version (v0.3.1, or v0.4.0): v2.2.6 on server and Python
- Milvus configuration (Settings you made in `server_config.yaml`): N/A

Anything else?

None

The text was updated successfully, but these errors were encountered:

xiaofan-luan · 2023-04-20T21:56:19Z

/assign @cydrain

sre-ci-robot assigned cydrain Apr 20, 2023

piercefreeman mentioned this issue Apr 20, 2023

Support multiple indexes and binary embeddings piercefreeman/vectordb-orm#5

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Bug]: Non-deterministic performance for binary vector search #1379

[Bug]: Non-deterministic performance for binary vector search #1379

piercefreeman commented Apr 20, 2023 •

edited

Loading

xiaofan-luan commented Apr 20, 2023

[Bug]: Non-deterministic performance for binary vector search #1379

[Bug]: Non-deterministic performance for binary vector search #1379

Comments

piercefreeman commented Apr 20, 2023 • edited Loading

Is there an existing issue for this?

Describe the bug

Expected Behavior

Steps/Code To Reproduce behavior

Anything else?

xiaofan-luan commented Apr 20, 2023

piercefreeman commented Apr 20, 2023 •

edited

Loading