Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Optimization options for source storage for particular fields #6356

Open
jmazanec15 opened this issue Feb 17, 2023 · 10 comments
Open

Optimization options for source storage for particular fields #6356

jmazanec15 opened this issue Feb 17, 2023 · 10 comments
Labels
distributed framework enhancement Enhancement or improvement to existing feature or request

Comments

@jmazanec15
Copy link
Member

Currently, for the k-NN plugin, we introduce a data type called the knn_vector. For the purposes of this issue, this data type allows users to define fixed dimensional arrays of floating point numbers (i.e. [1.2,3.2,4.2,5.2]). On disk, we have a codec that serializes the vectors as binary doc values. So, 1 million 128-dimensional vectors would consume 1,000,000 * 128 * 4 = 512,000,000 bytes ~= 488 MB.

The problem is that the vectors also get stored in the source field. So, the file storage looks like this for 10K vectors with dimension=100 and using BEST_SPEED codec:

Total Index Size 24.3 mb
HNSW files 5.91 mb
Doc values 3.8 mb
Source 14.6 mb

With BEST_COMPRESSION codec:

Total Index Size 18.3 mb
HNSW files 5.91 mb
Doc values 3.75 mb
Source 8.64 mb

As you can see, in both cases the source takes up significantly more space than the DocValues.

Part of the problem lies in the fact that if a floating point number is represented as a string with an average of 16 digits, the total string storage size for the vector in the example above will be 1,000,000 * 128 * 16 = 2,048,000,000 bytes ~= 1953 MB, not including the additional characters like commas and spaces, etc. I understand that that will be compressed, but still, from the table above, the source field is still very large.

Im wondering if it would be possible to optimize at the field level the stored field representation. I am aware of SourceMapper here where we are able to filter based on fields. Im wondering if it would be feasible to hook in here and modify the representation for certain types before adding as a stored field.

@jmazanec15 jmazanec15 added enhancement Enhancement or improvement to existing feature or request untriaged labels Feb 17, 2023
@msfroh
Copy link
Collaborator

msfroh commented Feb 21, 2023

@jmazanec15 do you need to include the field in the source?

You could exclude it from the source with

"mapping": {
  "_source": {
    "excludes": [
      "knn_vector_field" // Or whatever the field is named
    ]
  }
}

Of course, that doesn't work if you need to reindex, so it's not without consequences.

I wonder if we could do something clever to merge the stored source with the vector field doc values to reconstruct the full source without needing to store the vector field.

@jmazanec15
Copy link
Member Author

@msfroh I believe excluding source of the field no longer avoids storing the source completely. See https://github.com/opensearch-project/OpenSearch/blob/main/server/src/main/java/org/opensearch/index/mapper/SourceFieldMapper.java#L214. There is a recovery source that gets added if source is disabled.

Using doc values or other representations of vector values would be very nice, but I am not sure how large of a change that would require. I would need to look into that more.

@jmazanec15
Copy link
Member Author

Thinking about this more, the problem may be that the content type and the source need to be the same format. So, in the case above, if the content type is JSON, consumers will expect the bytes stored in source to be formatted as JSON.

Alternatively, allowing a field's source to be pulled from the document source and formatted differently might be an option (similar to filter), but reconstruction of source when it is needed may add a lot of complexity. This might require changing how we read from source - we might have to add another source meta field that specifies how the source should get put back together, from other formats such as doc values or potentially a different type-optimized source field - I am not sure at the moment how this would be done.

Alternatively, we could put up with this redundancy, but reduce the overhead, by using a different format such as protobuf. JSON is not very efficient for numeric types. In #4559, they are considering supporting different input formats for similar reasons.

@itiyamas I saw you opened #4559. I am wondering what your thoughts on this might be?

@navneet1v
Copy link
Contributor

Did we try overriding a knn_vector field type to be more compressed while creating the source field in the SourceFieldMapper? Just like we would be doing for a int or a float field type?

@jmazanec15
Copy link
Member Author

I looked into this a little bit using a debugger, but the problem is that the content that is being stored is the content coming from the netty http request here. With the current setup, I am not sure we can change those bytes.

@navneet1v
Copy link
Contributor

I looked into this a little bit using a debugger, but the problem is that the content that is being stored is the content coming from the netty http request here. With the current setup, I am not sure we can change those bytes.

Let me check and get back to you in this.

I just had a 1 more question when we want to get away from json and move to some better thing like protobuff, do we want to do for whole source field or this k-nn field. I think its better if we can move whole src from json to some better format

@jmazanec15
Copy link
Member Author

I think idea with #4559 would be to able to parse a different input format and not necessarily change how source is being stored.

I think changing how different fields are stored would be a different problem. This would require manipulating source in the SourceFieldMapper - probably removing a given field from the document and then adding it as a different source field. But putting this back together would be the challenge.

@navneet1v
Copy link
Contributor

I think idea with #4559 would be to able to parse a different input format and not necessarily change how source is being stored.

I think changing how different fields are stored would be a different problem. This would require manipulating source in the SourceFieldMapper - probably removing a given field from the document and then adding it as a different source field. But putting this back together would be the challenge.

With this issue are you looking for potential solutions or something else? I am confused with your last reply.

@jmazanec15
Copy link
Member Author

Source storage is larger than both the HNSW storage as well as the Doc Values storage for k-NN indices. The goal of the issue is to find potential solutions for reducing the storage requirement of source for docs with k-NN vector fields.

One approach is to use a better input format, like protobuf. Another approach (not sure on feasibility of this) is to remove the vector field from the source document and write it as a separate stored field that just contains the raw binary floating point representation, so that vector source space consumption would be 4 * dimension * num_vectors bytes (roughly the size of doc values).

However, alternative to this, as @msfroh brought up, if the vectors were removed from the source entirely and when the source was needed it read from the doc values, then we wouldnt need to store vector information in source at all - saving the most space. However, I am not really sure the feasibility of doing something like this.

@navneet1v
Copy link
Contributor

@jmazanec15 I created another GH issue which can help us remove the vectors from stored fields and recovery source both. Ref: #13490

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
distributed framework enhancement Enhancement or improvement to existing feature or request
Projects
None yet
Development

No branches or pull requests

5 participants