-
Notifications
You must be signed in to change notification settings - Fork 1.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Optimization options for source storage for particular fields #6356
Comments
@jmazanec15 do you need to include the field in the source? You could exclude it from the source with "mapping": {
"_source": {
"excludes": [
"knn_vector_field" // Or whatever the field is named
]
}
} Of course, that doesn't work if you need to reindex, so it's not without consequences. I wonder if we could do something clever to merge the stored source with the vector field doc values to reconstruct the full source without needing to store the vector field. |
@msfroh I believe excluding source of the field no longer avoids storing the source completely. See https://github.com/opensearch-project/OpenSearch/blob/main/server/src/main/java/org/opensearch/index/mapper/SourceFieldMapper.java#L214. There is a recovery source that gets added if source is disabled. Using doc values or other representations of vector values would be very nice, but I am not sure how large of a change that would require. I would need to look into that more. |
Thinking about this more, the problem may be that the content type and the source need to be the same format. So, in the case above, if the content type is JSON, consumers will expect the bytes stored in source to be formatted as JSON. Alternatively, allowing a field's source to be pulled from the document source and formatted differently might be an option (similar to filter), but reconstruction of source when it is needed may add a lot of complexity. This might require changing how we read from source - we might have to add another source meta field that specifies how the source should get put back together, from other formats such as doc values or potentially a different type-optimized source field - I am not sure at the moment how this would be done. Alternatively, we could put up with this redundancy, but reduce the overhead, by using a different format such as protobuf. JSON is not very efficient for numeric types. In #4559, they are considering supporting different input formats for similar reasons. @itiyamas I saw you opened #4559. I am wondering what your thoughts on this might be? |
Did we try overriding a knn_vector field type to be more compressed while creating the source field in the SourceFieldMapper? Just like we would be doing for a int or a float field type? |
I looked into this a little bit using a debugger, but the problem is that the content that is being stored is the content coming from the netty http request here. With the current setup, I am not sure we can change those bytes. |
Let me check and get back to you in this. I just had a 1 more question when we want to get away from json and move to some better thing like protobuff, do we want to do for whole source field or this k-nn field. I think its better if we can move whole src from json to some better format |
I think idea with #4559 would be to able to parse a different input format and not necessarily change how source is being stored. I think changing how different fields are stored would be a different problem. This would require manipulating source in the SourceFieldMapper - probably removing a given field from the document and then adding it as a different source field. But putting this back together would be the challenge. |
With this issue are you looking for potential solutions or something else? I am confused with your last reply. |
Source storage is larger than both the HNSW storage as well as the Doc Values storage for k-NN indices. The goal of the issue is to find potential solutions for reducing the storage requirement of source for docs with k-NN vector fields. One approach is to use a better input format, like protobuf. Another approach (not sure on feasibility of this) is to remove the vector field from the source document and write it as a separate stored field that just contains the raw binary floating point representation, so that vector source space consumption would be However, alternative to this, as @msfroh brought up, if the vectors were removed from the source entirely and when the source was needed it read from the doc values, then we wouldnt need to store vector information in source at all - saving the most space. However, I am not really sure the feasibility of doing something like this. |
@jmazanec15 I created another GH issue which can help us remove the vectors from stored fields and recovery source both. Ref: #13490 |
Currently, for the k-NN plugin, we introduce a data type called the
knn_vector
. For the purposes of this issue, this data type allows users to define fixed dimensional arrays of floating point numbers (i.e. [1.2,3.2,4.2,5.2]). On disk, we have a codec that serializes the vectors as binary doc values. So, 1 million 128-dimensional vectors would consume1,000,000 * 128 * 4 = 512,000,000 bytes ~= 488 MB
.The problem is that the vectors also get stored in the source field. So, the file storage looks like this for 10K vectors with dimension=100 and using BEST_SPEED codec:
With BEST_COMPRESSION codec:
As you can see, in both cases the source takes up significantly more space than the DocValues.
Part of the problem lies in the fact that if a floating point number is represented as a string with an average of 16 digits, the total string storage size for the vector in the example above will be
1,000,000 * 128 * 16 = 2,048,000,000 bytes ~= 1953 MB
, not including the additional characters like commas and spaces, etc. I understand that that will be compressed, but still, from the table above, the source field is still very large.Im wondering if it would be possible to optimize at the field level the stored field representation. I am aware of SourceMapper here where we are able to filter based on fields. Im wondering if it would be feasible to hook in here and modify the representation for certain types before adding as a stored field.
The text was updated successfully, but these errors were encountered: