[RFC] Derived Source for Vectors #2377

jmazanec15 · 2025-01-09T18:10:52Z

Introduction

This is an RFC that presents a proposal for removing knn_vector from "_source" field without loss of OpenSearch functionality that "_source" enables. "_source" in this context refers to the per document field in OpenSearch that stores the original source provided by the user as a StoredField in lucene. See SourceFieldMapper for more details.

This is a followup for #1571 and #1572.

Problem

Currently, vectors for native indices are stored in 3 places by default

_source stored field. Vectors along with the reset of json body of the document are stored (i.e. .fdt)
Native library files — ANN structure and vectors are stored (i.e. .hnsw)
FlatVectorsFormat format — Basically doc values for vectors (i.e. .vec)

In an experiment with 10k 128-dimensional vectors, the size break down of these files was:

Total Index Size	24.3 mb
HNSW files	5.91 mb
Doc values	3.8 mb
Source	14.6 mb

With BEST_COMPRESSION codec:

Total Index Size	18.3 mb
HNSW files	5.91 mb
Doc values	3.75 mb
Source	8.64 mb

From this, we can see that 47%-60% of the storage is going towards the _source storage. For more details, see: opensearch-project/OpenSearch#6356. Worse yet, for our disk based feature with our quantized vectors (vectors that get compressed in a lossy fashion), the native lib files will get smaller than the FlatVectorsFormat file, so the _source will take up an even larger percentage of the storage.

For a typical user, they should not need to get the source vector from OpenSearch. Thus, storing the vectors in _source poses significant problems for users with minimal benefits:

Users have to pay to store data they do not really need or use. This issue gets even more pronounced for disk-based vector search, where memory is no longer the bottleneck. Users end up having to provision their clusters based on storage capacity.
Vectors in _source eat up serialization/deserialization bandwidth. Whenever the _source field needs to be serialized or deserialized (i.e. written to disk, shards migration, snapshot, etc.) a major portion of the bandwidth of this channel is consumed by the vectors in the _source themselves. This can affect all different areas of a users’s vector search workload, such as indexing throughput, search speed, page cache utilization, shard migration, etc. Again, this gets worse with disk-based vector search, where all resources are much more scarce.

Because of this, we generally recommend to users that they disable storing the vectors in the source. However, this has serious limitations:

They will not be able to reindex the data
Update and update by query API does not work
Requires understanding a lot of concepts which leads to poor OOB experience

So, enter “derived_source”. We take inspiration from “derived fields” feature of OpenSearch to use one format of data for another purpose on the fly. The idea is that we already have the vectors available via the FlatVectorsFormat files (.vec). When we need to read the _source, we should just inject the vector fields into the _source field from the FlatVectorsFormat file. The effect will be that all functionality of OpenSearch works and we get a potential > 50% reduction in storage space for vectors.

Proposed Solutions

[Option # 1] (Preferred) Implement Custom StoredFieldsFormat in existing KNNCodec

Because the KNN plugin already implements its own Codec, we can override the StoredFieldsFormat to intercept and inject the vector fields when needed. This format would use the delegate pattern (as the k-NN plugin already does with core codecs) and only intervene with respect to accesses on the _source stored field on read and write (see PoC).

Pros

Great out of box experience! User would not need to provide any special configuration in order to get this benefit. On search, they would still need to manually exclude the vector fields, but this is consistent with the existing OpenSearch behavior.
Robust feature support. Because we are modifying the _source at a very low level, we can be confident that features that require _source built on top of this would work without any issues. The _source injection would be totally transparent

Cons

Unable to access OpenSearch resources — To implement this option, we would extend our existing codec. The codec abstraction is at the Lucene level. With this, it is difficult to get some of the required OpenSearch dependencies we would need. For instance, for nested fields, in order to get the parent/child filters, we would need to either directly use the FieldsFormat/PostingFormat (as was done in the PoC) or somehow create a searcher. It is unclear exactly what limitations we will hit here
Coupling of different Format readers feels like an anti-pattern. Having the StoredFieldsReader rely on KNNVectorsReader creates a dependency chain between the 2. With this it opens up the door to a circular dependency in the future (although no concrete situations come to mind)

For this option, we created a PoC to showcase feasibility. The PoC was able to support the following features:

[Flat vector mappings] Injecting vectors into source
[Flat vector mappings] Reindexing
[Flat vector mappings] Update by query
[Nested] Injecting vectors into source for single nested mapping without deletes
[Nested] Reindexing
[Nested] Update by query

[Option # 2] Introduce a dedicated FetchSubPhase to inject vector into source

As an alternative, as was done in #1572 by @luyuncheng, we can also create a custom FetchSubPhases in order to prepare the payload with the injected source that we can return to the caller. Generally, this will be where _source gets read (but not guaranteed to be so).

The general workflow for users would be:

Create an index with the vector fields explicitly excluded from source
On search/get, the DerivedVectorSearchFetchSubPhase would intercept the SearchResponse (without the vectors) and add the excluded vector fields back into SearchResponse

This approach has the following pros/cons:

Pros

Easy access to required OpenSearch resources — _source is an OpenSearch concept - Lucene just sees it as a stored field. Thus, most of the configuration details around it are stored in the OpenSearch layer (as opposed to Lucene) — e.g. MappedFieldTypes. Implementing at the FetchSubphase gives us access to these required resources. This also makes it easier to handle other OpenSearch specific cases (such as nested fields)

Cons

FetchSubphase from plugin would execute after all core FetchSubphases. Thus, the core FetchSubphases would not have access to the vector source. There are not any explicit use cases I can think of here where they need it, but if a user comes up with a case, this would be a hard limitation.
Non-deterministic ordering of plugin based fetch-subphases — OpenSearch will execute FetchSubPhases sequentially. OpenSearch will control ordering of the FetchSubPhases that plugins add. Thus, if another plugin adds a FetchSubPhase, it is not clear whether source will be present or not for them to use
The overall experience is inconsistent with existing OpenSearch experience. A user would need to exclude the vector fields from source, but still get them in the search response.

[Option # 3] Implement Custom StoredFieldVisitor

The security plugin has a feature called “Field-level security” where admins can limit access to different users at the field level. This feature requires that they automatically filter or mask privileged fields from _source. This is similar to what we want to do for vectors! They do this by implementing a custom StoredFieldsVisitor, FlsStoredFieldsVisitor. The StoredFieldsVisitor will be called in the StoredFieldsReader, for a given document and a given field. Thus, their visitor has the option to intercept the “_source” field, and filter/mask the fields they want. They use the “onIndexModule” extension point in order to inject this via a custom readerWrapper.

We could do something similar for vector derived source, where instead of filtering and masking, we inject the vector fields.

Pros

Somewhat easy access to required OpenSearch resources — we have everything on OpenSearch side because extension point is onIndexModule
Closer than Option [Plugin migration] Update upstream #1 to actual _source field retrieval, which will mean that more features will be supported out of the box

Cons

Incompatible with security plugin — indexModule.setReaderWrapper can only be called once. Thus, as it stands now, security and knn derived source would not work together.
Inconsistent user experience — A user will still need to exclude the vector fields from source, but still get them in the search response.

Summary

We are proposing option 1 because it provides a consistent UX with existing OpenSearch UX and extends a low level enough point to be generally robust.

Proposed User Experience

The user interface will either have a cluster setting that will indicate whether or not to use the derived_source feature.

PUT my-knn-index-1
{
  "settings": {
    "index": {
      "knn": true
      "knn.derived_source.enabled": true/false # default to tru 
    }
  },
  "mappings": {
    "properties": {
      "my_vector1": {
        "type": "knn_vector",
        "dimension": 2
      }
    }
  }
}

// On search, my_vector1 is excluded
POST some_index/_search
{
       _source : {
           "excludes": ["my_vector1"]
       }
       ...
}'

Open Questions

Avoid reconstruction of vectors on searches that later filter it out

In the current PoC, if someone excludes a field like this, in the StoredFieldsReader, we will inject the vector into the document and it will be later filtered out by OpenSearch logic. Instead of this, we need to figure out a way where we skip reconstruction in the first place if the field is going to be excluded anyway. This is a bit tricky to do and may involve a change in core. One idea is to pass this information in the FieldsVisitor and do some kind of type casting to get the information in the StoredFieldsReader component.

// On search, my_vector1 is excluded
POST some_index/_search
{
       _source : {
           "excludes": ["my_vector1"]
       }
       ...
}'

Next Steps

Publish high level design
Create PoC/Proposal on core on solving redundant reconstruction of vector issue
Publish low level design

The text was updated successfully, but these errors were encountered:

navneet1v · 2025-01-09T22:44:10Z

From this, we can see that 47%-60% of the storage is going towards the _source storage. For more details, see: opensearch-project/OpenSearch#6356. Worse yet, for our disk based feature with our quantized vectors (vectors that get compressed in a lossy fashion), the native lib files will get smaller than the FlatVectorsFormat file, so the _source will take up an even larger percentage of the storage.

I think when a model give fp32 then this %age will be more. I think with 128D the number of characters are pretty low if we compare with something like cohere datasets. I have seen this %age going to 80% too.

The user interface will either have a cluster setting that will indicate whether or not to use the derived_source feature.

Any reason why we cannot enable it by default? I think we should enable it by default. WDYT?

@jmazanec15 one more benefit of removing the vector field from source is speedup in the force merge. I was running some experiments, where I saw if we don't have vector in the _source there a good visible speedup in the force merge of vector indices.

// On search, my_vector1 is excluded
POST some_index/_search
{
_source : {
"excludes": ["my_vector1"]
}
...
}'

I didn't understand why user need to do this? Because I was thinking we will just exclude the vector field while creating the index.

jmazanec15 · 2025-01-10T17:29:03Z

I think when a model give fp32 then this %age will be more. I think with 128D the number of characters are pretty low if we compare with something like cohere datasets. I have seen this %age going to 80% too.

That makes sense. 80% wouldnt surprise me too much

Any reason why we cannot enable it by default? I think we should enable it by default. WDYT?

Right - this will default to true - but there will be a setting to disable it. One reason to disable it may be fore users who are pulling vectors from OpenSearch as a vector store. It may be slower with this respect.

@jmazanec15 one more benefit of removing the vector field from source is speedup in the force merge. I was running some experiments, where I saw if we don't have vector in the _source there a good visible speedup in the force merge of vector indices.

Oh nice - yes I think there will be a lot of kind of side effect benefits from this.

I didn't understand why user need to do this? Because I was thinking we will just exclude the vector field while creating the index.

They are not exluding the vector field when creating the index. It will actually be full transparent. Thus, on search, if they do not exclude the field, it will be returned (like it is today). This keeps experience consistent. If we wanted to exclude vector fields by default, this could be taken up separately.

navneet1v · 2025-01-10T23:58:29Z

It will actually be full transparent.

can you please elaborate more on this?

They are not exluding the vector field when creating the index. It will actually be full transparent. Thus, on search, if they do not exclude the field, it will be returned (like it is today). This keeps experience consistent. If we wanted to exclude vector fields by default, this could be taken up separately.

Sorry I am little confused on this part. Let me try to ask the question again. Are we suggesting customer to exclude vector fields during index mapping or not?

jmazanec15 added RFC Request for comments v2.19.0 labels Jan 9, 2025

jmazanec15 added this to Vector Search RoadMap Jan 9, 2025

github-project-automation bot moved this to Backlog in Vector Search RoadMap Jan 9, 2025

opensearch-infra bot added this to OpenSearch Roadmap Jan 9, 2025

jmazanec15 moved this from Backlog to 2.19.0 in Vector Search RoadMap Jan 9, 2025

github-project-automation bot moved this to New in OpenSearch Roadmap Jan 9, 2025

github-actions bot added the untriaged label Jan 9, 2025

jmazanec15 removed the untriaged label Jan 9, 2025

jmazanec15 mentioned this issue Jan 10, 2025

Reuse KNNVectorFieldData for reduce disk usage #1571

Open

5 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[RFC] Derived Source for Vectors #2377

[RFC] Derived Source for Vectors #2377

jmazanec15 commented Jan 9, 2025

navneet1v commented Jan 9, 2025

jmazanec15 commented Jan 10, 2025

navneet1v commented Jan 10, 2025

[RFC] Derived Source for Vectors #2377

[RFC] Derived Source for Vectors #2377

Comments

jmazanec15 commented Jan 9, 2025

Introduction

Problem

Proposed Solutions

[Option # 1] (Preferred) Implement Custom StoredFieldsFormat in existing KNNCodec

[Option # 2] Introduce a dedicated FetchSubPhase to inject vector into source

[Option # 3] Implement Custom StoredFieldVisitor

Summary

Proposed User Experience

Open Questions

Avoid reconstruction of vectors on searches that later filter it out

Next Steps

navneet1v commented Jan 9, 2025

jmazanec15 commented Jan 10, 2025

navneet1v commented Jan 10, 2025