Optimization options for source storage for particular fields #6356

jmazanec15 · 2023-02-17T17:35:40Z

Currently, for the k-NN plugin, we introduce a data type called the knn_vector. For the purposes of this issue, this data type allows users to define fixed dimensional arrays of floating point numbers (i.e. [1.2,3.2,4.2,5.2]). On disk, we have a codec that serializes the vectors as binary doc values. So, 1 million 128-dimensional vectors would consume 1,000,000 * 128 * 4 = 512,000,000 bytes ~= 488 MB.

The problem is that the vectors also get stored in the source field. So, the file storage looks like this for 10K vectors with dimension=100 and using BEST_SPEED codec:

Total Index Size	24.3 mb
HNSW files	5.91 mb
Doc values	3.8 mb
Source	14.6 mb

With BEST_COMPRESSION codec:

Total Index Size	18.3 mb
HNSW files	5.91 mb
Doc values	3.75 mb
Source	8.64 mb

As you can see, in both cases the source takes up significantly more space than the DocValues.

Part of the problem lies in the fact that if a floating point number is represented as a string with an average of 16 digits, the total string storage size for the vector in the example above will be 1,000,000 * 128 * 16 = 2,048,000,000 bytes ~= 1953 MB, not including the additional characters like commas and spaces, etc. I understand that that will be compressed, but still, from the table above, the source field is still very large.

Im wondering if it would be possible to optimize at the field level the stored field representation. I am aware of SourceMapper here where we are able to filter based on fields. Im wondering if it would be feasible to hook in here and modify the representation for certain types before adding as a stored field.

The text was updated successfully, but these errors were encountered:

msfroh · 2023-02-21T21:30:39Z

@jmazanec15 do you need to include the field in the source?

You could exclude it from the source with

"mapping": {
  "_source": {
    "excludes": [
      "knn_vector_field" // Or whatever the field is named
    ]
  }
}

Of course, that doesn't work if you need to reindex, so it's not without consequences.

I wonder if we could do something clever to merge the stored source with the vector field doc values to reconstruct the full source without needing to store the vector field.

jmazanec15 · 2023-02-22T18:03:07Z

@msfroh I believe excluding source of the field no longer avoids storing the source completely. See https://github.com/opensearch-project/OpenSearch/blob/main/server/src/main/java/org/opensearch/index/mapper/SourceFieldMapper.java#L214. There is a recovery source that gets added if source is disabled.

Using doc values or other representations of vector values would be very nice, but I am not sure how large of a change that would require. I would need to look into that more.

jmazanec15 · 2023-02-23T01:55:29Z

Thinking about this more, the problem may be that the content type and the source need to be the same format. So, in the case above, if the content type is JSON, consumers will expect the bytes stored in source to be formatted as JSON.

Alternatively, allowing a field's source to be pulled from the document source and formatted differently might be an option (similar to filter), but reconstruction of source when it is needed may add a lot of complexity. This might require changing how we read from source - we might have to add another source meta field that specifies how the source should get put back together, from other formats such as doc values or potentially a different type-optimized source field - I am not sure at the moment how this would be done.

Alternatively, we could put up with this redundancy, but reduce the overhead, by using a different format such as protobuf. JSON is not very efficient for numeric types. In #4559, they are considering supporting different input formats for similar reasons.

@itiyamas I saw you opened #4559. I am wondering what your thoughts on this might be?

navneet1v · 2023-02-23T18:54:23Z

Did we try overriding a knn_vector field type to be more compressed while creating the source field in the SourceFieldMapper? Just like we would be doing for a int or a float field type?

jmazanec15 · 2023-02-23T19:26:19Z

I looked into this a little bit using a debugger, but the problem is that the content that is being stored is the content coming from the netty http request here. With the current setup, I am not sure we can change those bytes.

navneet1v · 2023-02-23T20:04:17Z

I looked into this a little bit using a debugger, but the problem is that the content that is being stored is the content coming from the netty http request here. With the current setup, I am not sure we can change those bytes.

Let me check and get back to you in this.

I just had a 1 more question when we want to get away from json and move to some better thing like protobuff, do we want to do for whole source field or this k-nn field. I think its better if we can move whole src from json to some better format

jmazanec15 · 2023-02-24T22:18:17Z

I think idea with #4559 would be to able to parse a different input format and not necessarily change how source is being stored.

I think changing how different fields are stored would be a different problem. This would require manipulating source in the SourceFieldMapper - probably removing a given field from the document and then adding it as a different source field. But putting this back together would be the challenge.

navneet1v · 2023-02-24T22:44:40Z

I think idea with #4559 would be to able to parse a different input format and not necessarily change how source is being stored.

I think changing how different fields are stored would be a different problem. This would require manipulating source in the SourceFieldMapper - probably removing a given field from the document and then adding it as a different source field. But putting this back together would be the challenge.

With this issue are you looking for potential solutions or something else? I am confused with your last reply.

jmazanec15 · 2023-02-24T23:24:04Z

Source storage is larger than both the HNSW storage as well as the Doc Values storage for k-NN indices. The goal of the issue is to find potential solutions for reducing the storage requirement of source for docs with k-NN vector fields.

One approach is to use a better input format, like protobuf. Another approach (not sure on feasibility of this) is to remove the vector field from the source document and write it as a separate stored field that just contains the raw binary floating point representation, so that vector source space consumption would be 4 * dimension * num_vectors bytes (roughly the size of doc values).

However, alternative to this, as @msfroh brought up, if the vectors were removed from the source entirely and when the source was needed it read from the doc values, then we wouldnt need to store vector information in source at all - saving the most space. However, I am not really sure the feasibility of doing something like this.

navneet1v · 2024-05-02T05:20:35Z

@jmazanec15 I created another GH issue which can help us remove the vectors from stored fields and recovery source both. Ref: #13490

jmazanec15 added enhancement Enhancement or improvement to existing feature or request untriaged labels Feb 17, 2023

saratvemulapalli added the distributed framework label Feb 17, 2023

anasalkouz removed the untriaged label Feb 21, 2023

anasalkouz added Migration:Pending Input and removed Migration:Pending Input labels Mar 16, 2023

jmazanec15 mentioned this issue Mar 20, 2024

Investigate migrating custom codec from BinaryDocValuesFormat to KnnVectorsFormat opensearch-project/k-NN#1087

Closed

luyuncheng mentioned this issue Mar 20, 2024

[FEATURE] Reuse KNNVectorFieldData for reduce disk usage opensearch-project/k-NN#1572

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Optimization options for source storage for particular fields #6356

Optimization options for source storage for particular fields #6356

jmazanec15 commented Feb 17, 2023

msfroh commented Feb 21, 2023

jmazanec15 commented Feb 22, 2023

jmazanec15 commented Feb 23, 2023

navneet1v commented Feb 23, 2023

jmazanec15 commented Feb 23, 2023

navneet1v commented Feb 23, 2023

jmazanec15 commented Feb 24, 2023

navneet1v commented Feb 24, 2023

jmazanec15 commented Feb 24, 2023

navneet1v commented May 2, 2024

Optimization options for source storage for particular fields #6356

Optimization options for source storage for particular fields #6356

Comments

jmazanec15 commented Feb 17, 2023

msfroh commented Feb 21, 2023

jmazanec15 commented Feb 22, 2023

jmazanec15 commented Feb 23, 2023

navneet1v commented Feb 23, 2023

jmazanec15 commented Feb 23, 2023

navneet1v commented Feb 23, 2023

jmazanec15 commented Feb 24, 2023

navneet1v commented Feb 24, 2023

jmazanec15 commented Feb 24, 2023

navneet1v commented May 2, 2024