Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

update readme, docstrings #14

Merged
merged 4 commits into from
Nov 9, 2024
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
94 changes: 94 additions & 0 deletions README.md
Original file line number Diff line number Diff line change
@@ -1,3 +1,97 @@
# affine

![badge](https://img.shields.io/endpoint?url=https://gist.githubusercontent.com/ekorman/7fbb57e6d6a2c8b69617ddf141043b98/raw/affine-coverage.json)

Affine is a Python library for providing a uniform and structured interface to various backing vector databases and approximate nearest neighbor libraries. It allows simple dataclass-like objects to describe collections together with a high-level query syntax for doing filtered vector search.

For vector databases, it currently supports:

- qdrant
- weaviate
- pinecone

For local mode, the following approximate nearest neighbor libraries are supported:

- FAISS
- annoy
- pynndescent
- scikit-learn KDTree
- naive/NumPy

Note: this project is very similar to [vectordb-orm](https://github.com/piercefreeman/vectordb-orm), which looks to be no longer maintained.

## Installation

```bash
pip install affine
# or `pip install affine[qdrant]` for qdrant support
# `pip install affine[weaviate]` for weaviate support
# `pip install affine[pinecone]` for pinecone support
```

## Basic Usage

```python
from affine import Collection, Vector, Filter, Query

# Define a collection
class MyCollection(Collection):
vec: Vector[3] # declare a 3-dimensional vector

# support for additional fields for filtering
a: int
b: str

db = LocalEngine()

# Insert vectors
db.insert(MyCollection(vec=[0.1, 0.0, -0.5], a=1, b="foo"))
db.insert(MyCollection(vec=[1.3, 2.1, 3.6], a=2, b="bar"))
db.insert(MyCollection(vec=[-0.1, 0.2, 0.3], a=3, b="foo"))

# Query vectors
result: list[MyCollection] = (
db.query(MyCollection)
.filter(MyCollection.b == "foo")
.similarity([2.8, 1.8, -4.5])
.limit(1)
)
```

## Engines

A fundamental notion of _affine_ are `Engine` classes. All such classes conform to the same API for interchangeabillity (with the exception of a few engine-specific restrictions which are be mentioned below). There are two broad types of engines

1. `LocalEngine`: this does nearest neighbor search on the executing machine, and supports a variety of libraries for the backing nearest neighborsearch (these are called the _backend_ of the local engine).

2. Vector database engines: these are engines that connect to a vector database service, such as QDrant, Weaviate, or Pinecone.

### Vector Databases

The currently supported vector databases are:

| Database | Class | Constructor arguments | Notes |
| -------- | ------------------------------ | ---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- | -------------------------------------------------------------------------------------------------------- |
| Qdrant | `affine.engine.QdrantEngine` | `host: str` hostname to use<br><br>`port: int` port to use | - |
| Weaviate | `affine.engine.WeaviateEngine` | `host: str` hostname to use<br><br>`port: int` port to use | - |
| Pinecone | `affine.engine.PineconeEngine` | `api_key: Union[str, None]` pinecone API key. if not provided, it will be read from the environment variable PINECONE_API_KEY.<br><br>`spec: Union[ServerlessSpec, PodSpec, None]` the PodSpec or ServerlessSpec object. If not provided, a`ServerlessSpec` will be created from the environment variables PINECONE_CLOUD and PINECONE_REGION. | the Pinecone engine has the restriction that every collection must contain exactly one vector attribute. |

### Approximate Nearest Neighbor Libraries

The `LocalEngine` class provides an interface for doing nearest neighbor search on the executing machine, supporting a variety of libraries for the backing nearest neighborsearch. Which one is specified by the `backend` argument to the constructor. For example, to use `annoy`:

```python
from affine.engine.local import LocalEngine, AnnoyBackend

db = LocalEngine(backend=AnnoyBackend(n_tress=10))
```

The options and settings for the various supported backends are as follows:

| Library | Class | Constructor arguments | Notes |
| ------------------- | ---------------------------------------- | ------------------------------------------------------------------------ | ----- |
| naive/numpy | `affine.engine.local.NumPyBackend` | - | - |
| scikit-learn KDTree | `affine.engine.local.KDTreeBackend` | keyword arguments that get passed directly to `sklearn.neighbors.KDTree` | - |
| annoy | `affine.engine.local.AnnoyBackend` | `n_trees: int` number of trees to use<br>`n_jobs: int` defaults to -1 | - |
| FAISS | `affine.engine.local.FAISSBackend` | `index_factory_str: str` | - |
| PyNNDescent | `affine.engine.local.PyNNDescentBackend` | keyword arguments that get passed directly to `pynndescent.NNDescent` | - |
84 changes: 81 additions & 3 deletions affine/engine/base.py
Original file line number Diff line number Diff line change
Expand Up @@ -22,10 +22,36 @@ def _query(
def query(
self, collection_class: Type[Collection], with_vectors: bool = False
) -> QueryObject:
"""
Parameters
----------
collection_class
the collection class to query
with_vectors
wether or not the returned objects should have their vector attributes populated
(or otherwise be set to `None`)

Returns
-------
QueryObject
the resulting QueryObject
"""
return QueryObject(self, collection_class, with_vectors=with_vectors)

@abstractmethod
def insert(self, record: Collection) -> int | str:
"""Insert a record

Parameters
----------
record
the record to insert

Returns
-------
int | str
the resulting id of the inserted record
"""
pass

@abstractmethod
Expand All @@ -35,10 +61,22 @@ def _delete_by_id(self, collection: Type[Collection], id: str) -> None:
def delete(
self,
*,
record: Collection | str | None = None,
record: Collection | None = None,
collection: Type[Collection] | None = None,
id: str | None = None,
) -> None:
"""Delete a record from the database. The record can either be specified
by its `Collection` object or by its id.

Parameters
----------
record
the record to delete
collection
the collection the record belongs to (needed if and and only deleting a record by its id)
id
the id of the record
"""
if bool(record is None) == bool(collection is None and id is None):
raise ValueError(
"Either record or collection and id must be provided"
Expand All @@ -58,15 +96,55 @@ def delete(

@abstractmethod
def get_elements_by_ids(
self, collection: type, ids: list[int]
self, collection: type, ids: list[int | str]
) -> list[Collection]:
"""Get elements by ids

Parameters
----------
ids
list of ids

Returns
-------
list[collection]
the resulting collection objects
"""
pass

@abstractmethod
def register_collection(self, collection_class: Type[Collection]) -> None:
"""Register a collection to the database

Parameters
----------
collection_class
the class of the collection to register. This class must inherit from `Collection`.
"""
pass

def get_element_by_id(self, collection: type, id_: int) -> Collection:
def get_element_by_id(
self, collection: type, id_: int | str
) -> Collection:
"""Get an element by its id

Parameters
----------
collection
the collection class the record belongs to
id_
the id of the record

Returns
-------
collection
the corresponding collection object for the record.

Raises
------
ValueError
if no record is found with the specified id.
"""
ret = self.get_elements_by_ids(collection, [id_])
if len(ret) == 0:
raise ValueError(f"No record found with id {id_}")
Expand Down
35 changes: 32 additions & 3 deletions affine/query.py
Original file line number Diff line number Diff line change
Expand Up @@ -22,6 +22,18 @@ def __init__(
self._similarity = None

def filter(self, filter_set: FilterSet | Filter) -> "QueryObject":
"""Filter the result of a query by specified filters

Parameters
----------
filter_set
the `FilterSet` or `Filter` object to use

Returns
-------
QueryObject
resulting `QueryObject`
"""
if isinstance(filter_set, Filter):
filter_set = FilterSet(
filters=[filter_set], collection=filter_set.collection
Expand All @@ -31,9 +43,28 @@ def filter(self, filter_set: FilterSet | Filter) -> "QueryObject":
return self

def all(self) -> list[Collection]:
"""Get all results of a query

Returns
-------
list[Collection]
all of the matching records for the query
"""
return self.db._query(self._filter_set, with_vectors=self.with_vectors)

def limit(self, n: int) -> list[Collection]:
"""Returns a fixed number of results of a query.

Parameters
----------
n
how many records to retrieve. in the case of a similarity search query
this will be the `n`-closest neighbors

Returns
-------
list[Collection]
"""
return self.db._query(
self._filter_set,
with_vectors=self.with_vectors,
Expand All @@ -42,8 +73,6 @@ def limit(self, n: int) -> list[Collection]:
)

def similarity(self, similarity: Similarity) -> "QueryObject":
"""Apply a similarity search to the query"""
self._similarity = similarity
return self

def get_by_id(self, id_) -> Collection:
return self.db.get_element_by_id(self.collection_class, id_)
2 changes: 1 addition & 1 deletion tests/conftest.py
Original file line number Diff line number Diff line change
Expand Up @@ -110,7 +110,7 @@ def _test_engine(db: Engine):
assert q9[0].name == "Apple"

# check we can query by id
assert db.query(Product).get_by_id(q9[0].id).name == "Apple"
assert db.get_element_by_id(Product, q9[0].id).name == "Apple"

# check we can delete
db.delete(record=q9[0])
Expand Down
Loading