Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

v1 27 Doc edits for ACORN #2691

Merged
merged 7 commits into from
Oct 16, 2024
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
1 change: 1 addition & 0 deletions _includes/code/howto/manage-data.collections-v3.py
Original file line number Diff line number Diff line change
Expand Up @@ -158,6 +158,7 @@
"cache": True, # Enable use of vector cache. Default: False
},
"vectorCacheMaxObjects": 100000, # Cache size if `cache` enabled. Default: 1000000000000
"filterStrategy": "sweeping" # or "acorn" (Available from Weaviate v1.27.0)
}
# highlight-end
}
Expand Down
16 changes: 8 additions & 8 deletions _includes/code/howto/manage-data.collections.py
Original file line number Diff line number Diff line change
Expand Up @@ -171,18 +171,17 @@
client.collections.delete("Article")

# START SetVectorIndexParams
from weaviate.classes.config import Configure, Property, DataType
from weaviate.classes.config import Configure, Property, DataType, VectorDistances, VectorFilterStrategy

client.collections.create(
"Article",
# Additional configuration not shown
# highlight-start
vector_index_config=Configure.VectorIndex.flat(
quantizer=Configure.VectorIndex.Quantizer.bq(
rescore_limit=200,
cache=True
),
vector_cache_max_objects=100000
vector_index_config=Configure.VectorIndex.hnsw(
quantizer=Configure.VectorIndex.Quantizer.bq(),
ef_construction=300,
distance_metric=VectorDistances.COSINE,
filter_strategy=VectorFilterStrategy.SWEEPING # or ACORN (Available from Weaviate v1.27.0)
),
# highlight-end
)
Expand All @@ -191,7 +190,8 @@
# Test
collection = client.collections.get("Article")
config = collection.config.get()
assert config.vector_index_type.name == "FLAT"
assert config.vector_index_config.filter_strategy.name == "SWEEPING"
assert config.vector_index_type.name == "HNSW"


# ===================================================================
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -30,7 +30,7 @@ With the introduction of the natively-implemented `RoaringSet` data type in `1.1

The extra efficiencies are due to various strategies that Roaring Bitmaps employ, where it divides data into chunks and applies an appropriate storage strategy to each one. This enables high data compression and set operations speeds, resulting in much improved filtering speeds for Weaviate.

Weaviate version `1.18` and onwards will include this feature, and our team will be maintaining our underlying Roaring Bitmap library to address any issues and make improvements as needed. To read more about pre-filtering read the documentation [here](/developers/weaviate/concepts/prefiltering)
Weaviate version `1.18` and onwards will include this feature, and our team will be maintaining our underlying Roaring Bitmap library to address any issues and make improvements as needed. To read more about pre-filtering read the documentation [here](/developers/weaviate/concepts/filtering)

### What this means for you
From your perspective, the only visible change will be the a one-time process to create the new index. Once your Weaviate instance creates the Roaring Bitmap index, it will operate in the background to speed up your work.
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -29,7 +29,7 @@ Object properties in Weaviate can be indexed for faster filtering. In earlier re

When you anticipate using numerical comparisons over ranges of values, enable the new rangeable index in your [collection schema](/developers/weaviate/config-refs/schema). The rangeable index is available for `int`, `number`, and `date` data types.

The rangeable index can be enabled alone or with the [filterable index](/developers/weaviate/concepts/prefiltering#indexrangefilters). Efficient range filters can be combined with other filters, such as the `Equal` filter, to quickly narrow your searches to the most relevant information.
The rangeable index can be enabled alone or with the [filterable index](/developers/weaviate/concepts/filtering#indexrangefilters). Efficient range filters can be combined with other filters, such as the `Equal` filter, to quickly narrow your searches to the most relevant information.

Internally, rangeable indexes are implemented as [roaring bitmap slices](https://www.featurebase.com/blog/range-encoded-bitmaps). This is an exciting data structure that combines several clever ideas to reduce memory requirements while improving processing speeds.

Expand Down
2 changes: 1 addition & 1 deletion developers/academy/py/tokenization/900_next_steps.mdx
Original file line number Diff line number Diff line change
Expand Up @@ -11,7 +11,7 @@ There are many more resources available to help you continue your learning journ
- [Refereces: Configuration: Tokenization](/developers/weaviate/config-refs/schema/index.md#tokenization)
- [Refereces: Configuration: Stopwords](/developers/weaviate/config-refs/schema/index.md#invertedindexconfig--stopwords-stopword-lists)
- [Concepts: Inverted index](/developers/weaviate/concepts/indexing.md#inverted-indexes)
- [Concepts: Filtering](/developers/weaviate/concepts/prefiltering.md)
- [Concepts: Filtering](/developers/weaviate/concepts/filtering.md)

:::note
As a reminder, for non-English texts, especially those which do not rely on spaces between words, try the `trigram` or `gse` tokenization methods which were added in Weaviate `v1.24` for such cases.
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -5,7 +5,11 @@ image: og/docs/concepts.jpg
# tags: ['architecture', 'filtered vector search', 'pre-filtering']
---

Weaviate provides powerful filtered vector search capabilities, meaning that you can eliminate candidates in your "fuzzy" vector search based on individual properties. Thanks to Weaviate's efficient pre-filtering mechanism, you can keep the recall high - even when filters are very restrictive. Additionally, the process is efficient and has minimal overhead compared to an unfiltered vector search.
Weaviate provides powerful filtered vector search capabilities, allowing you to combine vector searches with structured, scalar filters. This enables you to find the closest vectors to a query vector that also match certain conditions.

Filtered vector search in Weaviate is based on the concept of pre-filtering. This means that the filter is constructed before the vector search is performed. Unlike some pre-filtering implementations, Weaviate's pre-filtering does not require a brute-force vector search and is highly efficient.

Starting in `v1.27`, Weaviate introduces its implementation of the [`ACORN`](#acorn) filter strategy. This filtering method significantly improves performance for large datasets, especially when the filter has low correlation with the query vector.

## Post-Filtering vs Pre-Filtering

Expand All @@ -27,6 +31,35 @@ In the section about Storage, [we have described in detail which parts make up a
1. An inverted index (similar to a traditional search engine) is used to create an allow-list of eligible candidates. This list is essentially a list of `uint64` ids, so it can grow very large without sacrificing efficiency.
2. A vector search is performed where the allow-list is passed to the HNSW index. The index will move along any node's edges normally, but will only add ids to the result set that are present on the allow list. The exit conditions for the search are the same as for an unfiltered search: The search will stop when the desired limit is reached and additional candidates no longer improve the result quality.

## Filter strategy

As of `v1.27`, Weaviate supports two filter strategies: `sweeping` and `acorn` specifically for the HNSW index type.

### ACORN

:::info Added in `1.27`
:::

Weaviate `1.27` adds the a new filtering algorithm that is based on the [`ACORN`](https://arxiv.org/html/2403.04871v1) paper. We refer to this as `ACORN`, but the actual implementation in Weaviate is a custom implementation that is inspired by the paper. (References to `ACORN` in this document refer to the Weaviate implementation.)

The `ACORN` algorithm is designed to speed up filtered searches with the [HNSW index](./vector-index.md#hierarchical-navigable-small-world-hnsw-index) by the following:

- Objects that do not meet the filters are ignored in distance calculations.
- The algorithm reaches the relevant part of the HNSW graph faster, by using a multi-hop approach to evaluate the neighborhood of candidates.
- Additional entrypoints matching the filter are randomly seeded to speed up convergence to the filtered zone.

The `ACORN` algorithm is especially useful when the filter has low correlation with the query vector. In other words, when a filter excludes many objects in the region of the graph most similar to the query vector.

Our internal testing indicates that for lowly correlated, restrictive filters, the `ACORN` algorithm can be significantly faster, especially for large datasets. If this has been a bottleneck for your use case, we recommend enabling the `ACORN` algorithm.

As of `v1.27`, the `ACORN` algorithm can be enabled by setting the `filterStrategy` field for the relevant HNSW vector index [in the collection configuration](../manage-data/collections.mdx#set-vector-index-parameters).

### Sweeping

The existing and current default filter strategy in Weaviate is referred to as `sweeping`. This strategy is based on the concept of "sweeping" through the HNSW graph.

The algorithm starts at the root node and traverses the graph, evaluating the distance to the query vector at each node, while keeping the "allow list" of the filter as context. If the filter is not met, the node is skipped and the traversal continues. This process is repeated until the desired number of results is reached.

## `indexFilterable`

:::info Added in `1.18`
Expand Down Expand Up @@ -72,10 +105,14 @@ Thanks to Weaviate's custom HNSW implementation, which persists in following all

The graphic below shows filters of varying levels of restrictiveness. From left (100% of dataset matched) to right (1% of dataset matched) the filters become more restrictive without negatively affecting recall on `k=10`, `k=15` and `k=20` vector searches with filters.

<!-- TODO - replace this graph with ACORN test figures -->

![Recall for filtered vector search](./img/recall-of-filtered-vector-search.png "Recall of filtered vector search in Weaviate")

## Flat-Search Cutoff

<!-- Need to update this section with ACORN figures. -->

Version `v1.8.0` introduces the ability to automatically switch to a flat (brute-force) vector search when a filter becomes too restrictive. This scenario only applies to combined vector and scalar searches. For a detailed explanation of why HNSW requires switching to a flat search on certain filters, see this article in [towards data science](https://towardsdatascience.com/effects-of-filtered-hnsw-searches-on-recall-and-latency-434becf8041c). In short, if a filter is very restrictive (i.e. a small percentage of the dataset is matched), an HNSW traversal becomes exhaustive. In other words, the more restrictive the filter becomes, the closer the performance of HNSW is to a brute-force search on the entire dataset. However, with such a restrictive filter, we have already narrowed down the dataset to a small fraction. So if the performance is close to brute-force anyway, it is much more efficient to only search on the matching subset as opposed to the entire dataset.

The following graphic shows filters with varying restrictiveness. From left (0%) to right (100%), the filters become more restrictive. The **cut-off is configured at ~15% of the dataset** size. This means the right side of the dotted line uses a brute-force search.
Expand All @@ -86,70 +123,31 @@ As a comparison, with pure HNSW - without the cutoff - the same filters would lo

![Prefiltering with pure HNSW](./img/prefiltering-pure-hnsw-without-cutoff.png "Prefiltering without cutoff, i.e. pure HNSW")

The cutoff value can be configured as [part of the `vectorIndexConfig` settings in the schema](/developers/weaviate/config-refs/schema/vector-index.md#how-to-configure-hnsw) for each class separately.
The cutoff value can be configured as [part of the `vectorIndexConfig` settings in the schema](/developers/weaviate/config-refs/schema/vector-index.md#how-to-configure-hnsw) for each collection separately.

<!-- TODO - replace figures with updated post-roaring bitmaps figures -->

:::note Performance improvements from roaring bitmaps
From `v1.18.0` onwards, Weaviate implements 'Roaring bitmaps' for the inverted index which decreased filtering times, especially for large allow lists. In terms of the above graphs, the *blue* areas will be reduced the most, especially towards the left of the figures.
:::

## Cacheable Filters

Starting with `v1.8.0`, the inverted index portion of a filter can be cached and reused - even across different vector searches. As outlined in the sections above, pre-filtering is a two-step process. First, the inverted index is queried and potential matches are retrieved. This list is then passed to the HNSW index. Reading the potential matches from disk (step 1) can become a bottleneck in the following scenarios:

* when a very large amount of IDs match the filter or
* when complex query operations (e.g. wildcards, long filter chains) are used

If the state of the inverted index has not changed since the last query, these "allow lists" can now be reused.

:::note
Even with the filter portion from cache, each vector search is still performed at query time. This means that two previously unseen vector searches can still make use of the cache as long as they use the same filter.
:::

Example:

```graphql
# search 1
where: {
operator: Equal
path: ["publication"]
valueText: "NYT"
}
nearText: {
concepts: ["housing prices in the western world"]
}

# search 2
where: {
operator: Equal
path: ["publication"]
valueText: "NYT"
}
nearText: {
concepts: ["where do the best wines come from?"]
}
```

The two semantic queries have very little relation and most likely there will be no overlap in the results. However, because the scalar filter (`publication==NYT`) was the same on both it can be reused on the second query. This makes the second query as fast as an unfiltered vector search.

## Performance of vector searches with cached filters
<!-- ## Performance of vector searches with cached filters

The following was run single-threaded (i.e. you can add more CPU threads to increase throughput) on a dataset of 1M objects with random vectors of 384d with a warm filter cache (pre-`Roaring bitmap` implementation).

Each search uses a completely unique (random) search vector, meaning that only the filter portion is cached, but not the vector search portion, i.e. on `count=100`, 100 unique query vectors were used with the same filter.
Each search uses a completely unique (random) search vector, meaning that only the filter portion is cached, but not the vector search portion, i.e. on `count=100`, 100 unique query vectors were used with the same filter. -->

<!-- TODO - replace table with updated post-roaring bitmaps figures -->

[![Performance of filtered vector search with caching](./img/filtered-vector-search-with-caches-performance.png "Performance of filtered vector searches with 1M 384d objects")](./img/filtered-vector-search-with-caches-performance.png)
<!-- [![Performance of filtered vector search with caching](./img/filtered-vector-search-with-caches-performance.png "Performance of filtered vector searches with 1M 384d objects")](./img/filtered-vector-search-with-caches-performance.png) -->

:::note
<!-- :::note
Wildcard filters show considerably worse performance than exact match filters. This is because - even with caching - multiple rows need to be read from disk to make sure that no stale entries are served when using wildcards. See also "Automatic Cache Invalidation" below.
:::
::: -->

## Automatic Cache Invalidation
<!-- ## Automatic Cache Invalidation

The cache is built in a way that it cannot ever serve a stale entry. Any write to the inverted index updates a hash for the specific row. This hash is used as part of the key in the cache. This means that if the underlying inverted index is changed, the new query would first read the updated hash and then run into a cache miss (as opposed to ever serving a stale entry). The cache has a fixed size and entries for stale hashes - which cannot be accessed anymore - are overwritten when it runs full.
The cache is built in a way that it cannot ever serve a stale entry. Any write to the inverted index updates a hash for the specific row. This hash is used as part of the key in the cache. This means that if the underlying inverted index is changed, the new query would first read the updated hash and then run into a cache miss (as opposed to ever serving a stale entry). The cache has a fixed size and entries for stale hashes - which cannot be accessed anymore - are overwritten when it runs full. -->

## Further resources
:::info Related pages
Expand Down
2 changes: 1 addition & 1 deletion developers/weaviate/concepts/index.md
Original file line number Diff line number Diff line change
Expand Up @@ -68,7 +68,7 @@ You can learn more about the individual components in this figure by following t
* Speeding up specific processes
* Preventing bottlenecks

**[Filtered vector search](./prefiltering.md)**
**[Filtered vector search](./filtering.md)**
* Combine vector search with filters
* Learn how combining an HNSW with an inverted index leads to high-recall, high-speed filtered queries

Expand Down
8 changes: 4 additions & 4 deletions developers/weaviate/concepts/indexing.md
Original file line number Diff line number Diff line change
Expand Up @@ -16,7 +16,7 @@ Some things to bear in mind:

* Especially for large datasets, configuring the indexes is important because the more you index, the more storage is needed.
* A rule of thumb -- if you don't query over a specific field or vector space, don't index it.
* One of Weaviate's unique features is how the indexes are configured (learn more about this [here](../concepts/prefiltering.md)).
* One of Weaviate's unique features is how the indexes are configured (learn more about this [here](../concepts/filtering.md)).

## Vector indexes

Expand All @@ -35,12 +35,12 @@ For more information on vector indexes, see the [Vector Indexing](./vector-index
There are three inverted index types in Weaviate:

- `indexSearchable` - a searchable index for BM25 or hybrid search
- `indexFilterable` - a match-based index for fast [filtering](./prefiltering.md) by matching criteria
- `indexRangeFilters` - a range-based index for [filtering](./prefiltering.md) by numerical ranges
- `indexFilterable` - a match-based index for fast [filtering](./filtering.md) by matching criteria
- `indexRangeFilters` - a range-based index for [filtering](./filtering.md) by numerical ranges

Each inverted index can be set to `true` (on) or `false` (off) on a property level. The `indexSearchable` and `indexFilterable` indexes are on by default, while the `indexRangeFilters` index is off by default.

The filterable indexes are only capable of [filtering](./prefiltering.md), while the searchable index can be used for both searching and filtering (though not as fast as the filterable index).
The filterable indexes are only capable of [filtering](./filtering.md), while the searchable index can be used for both searching and filtering (though not as fast as the filterable index).

So, setting `"indexFilterable": false` and `"indexSearchable": true` (or not setting it at all) will have the trade-off of worse filtering performance but faster imports (due to only needing to update one index) and lower disk usage.

Expand Down
Loading