[RFC] Support Facebook's faiss library as another Approximate k-NN engine #70

jmazanec15 · 2021-07-22T17:14:16Z

Overview

Over the past year, we have received a lot of interest in supporting Facebook’s MIT licensed faiss library, as another Approximate k-Nearest Neighbor (ANN) engine, in addition to nmslib. faiss offers a diverse set of algorithms that allow users to easily make tradeoffs between indexing latency, memory usage, query latency, and recall to fit their ANN workload requirements.

Earlier this year, a member of the community made a contribution in Open Distro for Elasticsearch for initial faiss support. This contribution integrated the faiss library into the plugin and added support for faiss’s implementation of Hierarchical Navigable Small World (HNSW) graphs. We are building on top of this contribution to support additional faiss features like vector quantizers and other ANN search methods.

Supporting faiss will allow users to choose from different ANN search methods and algorithms that are not available in nmslib. In particular, we are very interested in supporting faiss’s quantization methods that can reduce the amount of memory an ANN index requires.

Additionally, while this project focuses on integrating faiss, it should also refactor the plugin so that we can support additional ANN libraries and their methods in the future.

We are developing on the feature/faiss-support branch and are planning to merge to main once all requirements have been met.

Problem Statement

Because the k-NN plugin only supports one ANN engine and method, the amount of customization a user can make to achieve a solution to their ANN workload is limited.

Specifically, one problem k-NN plugin users face is that the plugin can consume a significant amount of memory. Currently, the plugin is built on top of nmslib’s implementation of HNSW. HNSW is a fast and fairly accurate ANN method. Still, for some workloads, the HNSW algorithm’s memory consumption can be an issue. From our documentation, each vector will consume approximately 1.1 * (4 * dimension + 8 * M) bytes. faiss implements several different algorithms that can provide ANN search using much less memory at the cost of additional compute during training. By supporting faiss, we can let users make memory based tradeoffs in order to achieve the solution that they want.

Requirements

In the initial phase, the requirements are:

Refactor the k-NN plugin to be able to support more than one ANN engine
Support faiss composite indices
Support faiss’s HNSW ANN method
Support faiss’s Inverted File System (IVF) ANN method
Support faiss’s Product Quantization (PQ) vector quantization method

In the future, we may consider supporting:

OpenSearch’s msearch type with faiss’s bulk querying functionality
Additional faiss features, such as preprocessors and post query refinement as well as other ANN methods

Proposed Solution

In order to support faiss in the k-NN plugin, we need to:

Refactor the JNI to support additional libraries
Add additional API’s and system resources to support ANN methods that require training
Enhance the knn_vector field type to support multiple engines and methods

Training Support

Several faiss features, such as IVF and PQ, require a training step before indexing can begin. Training takes a set of training vectors and creates a model that these features use to perform their functionalities.

From the plugin perspective, there are two approaches to support training: (1) Train a new model during segment creation with a subset of the segment’s index data and (2) Train a model before indexing can begin and use it to initialize the ANN library index during segment creation.

While the approaches are not mutually exclusive, initially we will only support Approach 2.

Approach 1 is easier to implement, but it significantly increases indexing latency. Every time a new segment is created, a new model needs to be trained. Additionally, because the model is trained with a subset of the segment’s data, it is difficult to guarantee the quantity and quality of the training data.

Approach 2 requires us to add additional APIs and OpenSearch utilities for a user to train a model and connect it to an OpenSearch k-NN index. However, it speeds up indexing and gives the user more control over the model produced. Additionally, it is recommended in the faiss documentation.

Model System Index

In order to persist faiss trained models and their metadata, we need to create a model system index.

During segment creation, a GET call is made to retrieve a model’s binary representation. The model is then used in the JNI layer to initialize the ANN library index. Once initialized, the vectors for the given segment are indexed into the ANN library index. After this completes, the ANN library index file is written to the OpenSearch index’s segment.

Train API

In order to support Approach 2, we need to give users the functionality to train a model in their OpenSearch cluster. To do this, we need to add a train API:

POST /_plugins/_knn/model/train
{
  "train_index": "train-index-name",
  "train_field": "train-field-name",
  "model_id": "custom-model-id",
  "dimension": 16,
  "method": {
      "name":"ivf",
      "engine":"faiss",
      "space_type": "l2",
      "parameters":{
         "ncentroids":128,
         "coarse_quantizer":{
            "name":"ivf",
            "parameters":{
                "ncentroids":15
            }
        },
        "encoder":{
            "name":"pq",
            "parameters":{
                "code_size":8
            }
        },
      }
  }
}
{
  "status": "SUCCESS",
  "model_id": "custom-model-id"
}

This API triggers a training workflow that reads a training set of vectors from another OpenSearch index, creates and trains an ANN library model and then serializes it into the model system index.

Upload API

One potential issue with training is that it can be very resource intensive, which could negatively impact an OpenSearch cluster that is processing a heavy workload. So, to unblock users who want to use models that require resource intensive training, we need to also provide an upload API:

POST /_plugins/_knn/model/upload
{
    "model_id": "custom-model-id",
    "engine: "engine-of-the-model",
    "dimension": X,
    "space_type": "space-type-of-the-model",
    "model_blob": "some base64 encoded string"
}
{
  "status": "SUCCESS",
  "model_id": "custom-model-id"
}

This API triggers a workflow that validates the uploaded model and then serializes it to the model system index.

knn_vector Field Enhancements

In order for a user to configure an index to use faiss, we need to enhance our knn_vector field type. Currently, a user creates an index with the following mapping:

"my_vector":{
    "type":"knn_vector",
    "dimension": 2,
    "method":{
        "name":"hnsw",
        "engine":"nmslib",
        "space_type":"l2",
        "parameters":{
            "m":44
        }
    }
}

To support faiss indices that do not require training, we need to add an additional engine. This looks like:

"my_vector":{
    "type":"knn_vector",
    "dimension": 2,
    "method":{
        "name":"hnsw",
        "engine":"faiss",
        "space_type":"l2",
        "parameters":{
            "m":44
        }
    }
}

For indices that require training, a user needs to have already trained/uploaded the model to the model index. Once they have done this, they can create an ANN OpenSearch index with the following mapping:

"my_vector":{
    "type":"knn_vector",
    "model_id": "my_trained_model_template"
}

Feedback

We are interested in any and all feedback you may have. Please do not hesitate to comment!

Specifically, however, we are interested in:

What features in faiss do you use that you would like to be supported by the k-NN plugin?
For your potential use case, if you intend to use a faiss index that requires training, would you prefer training offline and using the upload API or training online with the train API? Why?

The text was updated successfully, but these errors were encountered:

jmazanec15 · 2021-08-26T01:51:46Z

Update on Proposed APIs

Refactored API Design to center around model resource. First draft can be found here. Second draft can be found here.

For faiss, we will introduce additional functionality to add support for faiss indices that require training. With this change, we introduce a new resource: models. A model is an empty, trained native library index that can be used to initialize another native library index during ingestion. A model will be stored as a document in the model system index, which has the following mapping:

{
    "state": keyword,
    "created_timestamp": date,
    "description": keyword, 
    "error": keyword,
    "model_blob": binary,
    "engine": keyword,
    "space_type": keyword,
    "dimension": int
}

state — Model state. Can either be CREATED, FAILED, TRAINING

created_timestamp — Time at which the model was created.

description — Model description a user can provide to add additional details about a model.

error — Message provided to user to communicate why model is in failed state.

model_blob — Base64 encoded representation of the model.

engine — Engine this model was created by.

space_type — Space this model was built with.

dimension — Dimension this model supports.

Get

GET /_plugins/_knn/models/{model_id}?<filter_field_1>&<filter_field_2>

{
    {
        "model_id": "my_model_id"
        "state": "CREATED",
        "created_timestamp": "10-31-21 02:02:02",
        "description": "Model trained with dataset X", 
        "error": "",
        "model_blob": "cdscsacsadcsdca",
        "engine": "faiss",
        "space_type": "l2",
        "dimension": 128
    }
} 

GET /_plugins/_knn/models/_search?<query_filters>
{
    "query": {
         ...
     }
}

{
  "took": 13,
  "timed_out": false,
  "_shards": {
    "total": 5,
    "successful": 5,
    "skipped": 0,
    "failed": 0
  },
  "hits": {
    ...
  }
}

model_id — [Required] Specify which model to return information for. If not specified, all model information will be returned.

filter_field — Fields to include. If not specified, all fields are returned.

Delete

DELETE /_plugins/_knn/models/{model_id}

{
    "acknowledged": true
}

model_id — [Required] Model to delete

Upload

PUT /_plugins/_knn/models/{model_id}
{
    "description": "Model trained with dataset X", 
    "model_blob": "cdscsacsadcsdca",
    "engine": "faiss",
    "space_type": "l2",
    "dimension": 128
}

{
    "acknowledged": true
}


POST /_plugins/_knn/models
{
    "description": "Model trained with dataset X", 
    "model_blob": "cdscsacsadcsdca",
    "engine": "faiss",
    "space_type": "l2",
    "dimension": 128
}

{
    "model_id": "my_model_identifier"
}

description — [Optional] Model description a user can provide to add additional details about a model.

model_blob — Base64 encoded representation of the model.

engine — Engine this model was created by.

space_type — Space this model was built with.

dimension — Dimension this model supports.

Train

POST /_plugins/_knn/models/<model_id>/_train?preference=<node_id>
{
  "train_index": "train-index-name",
  "train_field": "train-field-name",
  "dimension": 16,
  "method": {
      "name":"ivf",
      "engine":"faiss",
      "space_type": "l2",
      "parameters":{
         "ncentroids":128,
         "coarse_quantizer":{
            "name":"ivf",
            "parameters":{
                "ncentroids":15
            }
        },
        "encoder":{
            "name":"pq",
            "parameters":{
                "code_size":8
            }
        },
      }
  }
}

{
    "acknowledged": true
}

POST /_plugins/_knn/models/_train?preference=<node_id>
{
  "train_index": "train-index-name",
  "train_field": "train-field-name",
  "dimension": 16,
  "method": {
      "name":"ivf",
      "engine":"faiss",
      "space_type": "l2",
      "parameters":{
         "ncentroids":128,
         "coarse_quantizer":{
            "name":"ivf",
            "parameters":{
                "ncentroids":15
            }
        },
        "encoder":{
            "name":"pq",
            "parameters":{
                "code_size":8
            }
        },
      }
  }
}

{
    "model_id": "my_model_identifier"
}

node_id — User's preference for node to execute training.

train_index — OpenSearch index from which to pull the training data.

train_field — Field of train_index from which to pull training data.

dimension — Dimension the model should be built for.

method — Method definition to produce the model.

wnbts · 2021-08-26T17:13:24Z

I have a few questions and suggestion for discussion.

GET /_plugins/_knn/{node_id}/train-jobs/{model_id}

a. whether a model resource belongs to a node resource
b. whether the node resource is necessary, i.e. must a user know a node id to operate training?
c. is a training job (/train-jobs) the same a model (/{model_id})

wnbts · 2021-08-26T17:18:16Z

PUT /_plugins/_knn/{node_id}/train-jobs/{model_id} {

a. can a job be updated once created?

wnbts · 2021-08-26T17:28:44Z

GET /_plugins/_knn/models/{model_id}?{field_filter_id1}&{field_filter_id2}

{
   "model_id": {
      "engine: "engine-of-the-model",
      "dimension": X,
      "space_type": "space-type-of-the-model",
      "model_blob": "some base64 encoded string"
   },
   ...
}

PUT /_plugins/_knn/models/{model_id}
{
    "model_id": "custom-model-id",
    "engine: "engine-of-the-model",
    "dimension": X,
    "space_type": "space-type-of-the-model",
    "model_blob": "some base64 encoded string"
}

a. the results from get might be changed to be consistent with that from put, i.e.,

[
   {
     "model_id": "custom-model-id"
      "engine: "engine-of-the-model",
      "dimension": X,
      "space_type": "space-type-of-the-model",
      "model_blob": "some base64 encoded string"
   },
   {
     "model_id": "custom-model-id-2",
     ...
   },
]

wnbts · 2021-08-26T17:34:00Z

PUT /_plugins/_knn/models/{model_id}

{
  "acknowledged": true,
  "model_id": "custom-model-id",
}

a. the response can be only an ack, the same as in delete, i.e.

{
  "acknowledged": true
}

jmazanec15 · 2021-08-27T16:43:52Z

Thanks for the feedback @wnbts. Let me address your comments 1 by 1:

GET /_plugins/_knn/{node_id}/train-jobs/{model_id}

a. whether a model resource belongs to a node resource

Right, model_id probably needs to be changed to training_job_id for /_plugins/_knn/{node_id}/train-jobs APIs. I think I was trying to cut a corner here so that training-jobs could use the same id as the model_id. However, looking at it again, I think that this does not make sense. Do you agree?

b. whether the node resource is necessary, i.e. must a user know a node id to operate training?

No, node is not necessary, but it is optional.

c. is a training job (/train-jobs) the same a model (/{model_id})

Discussed above. They are not the same thing.

PUT /_plugins/_knn/{node_id}/train-jobs/{model_id}

a. can a job be updated once created?

No it cannot. I think I got this backwards. I will switch to POST.

GET /_plugins/_knn/models/{model_id}?{field_filter_id1}&{field_filter_id2}

a. the results from get might be changed to be consistent with that from put, i.e.,

Good point, will update.

PUT /_plugins/_knn/models/{model_id}

a. the response can be only an ack, the same as in delete, i.e.

I see, I will update. Thanks for the suggestion.

wnbts · 2021-08-27T20:15:27Z

GET /_plugins/_knn/{node_id}/train-jobs/{model_id}

a. whether a model resource belongs to a node resource

Right, model_id probably needs to be changed to training_job_id for /_plugins/_knn/{node_id}/train-jobs APIs. I think I was trying to cut a corner here so that training-jobs could use the same id as the model_id. However, looking at it again, I think that this does not make sense. Do you agree?

I agree. The data modeling would be more clear or natural. Training jobs are an intuitive resource. The training job request can contain a model id, or a (new generated) model id can be returned in the response.

A separate question regarding the relationship between node and job, if a job is created with a node resource, is the job bound to the node? For example, if a job is /node-a/train-jobs/job-a, what would a get of /node-b/train-jobs/job-a return?

jmazanec15 · 2021-08-29T00:39:07Z

if a job is /node-a/train-jobs/job-a, what would a get of /node-b/train-jobs/job-a return?

In this case, no results would be returned. A job is bound to a node, however a model is not.

jmazanec15 · 2021-09-13T16:47:31Z

@wnbts I decided to update APIs to center around model resource. I felt that having a separate model resource and train jobs resource did not make sense. Please take a look at the update if you get time.

wnbts · 2021-09-14T00:51:43Z

The new version also makes sense to me! I raise some details for discussion.

{
    "state": keyword,
    "created_timestamp": date,
    "description": keyword, 
    "error": keyword,
    "model_blob": binary,
    "engine": keyword,
    "space_type": keyword,
    "dimension": int
}

Why not add model id to the resource body?

wnbts · 2021-09-14T00:59:53Z

GET /_plugins/_knn/models/{model_id}?<filter_field_1>&<filter_field_2>

{
    "my_model_id": {
        "state": "CREATED",
        "created_timestamp": "10-31-21 02:02:02",
        "description": "Model trained with dataset X", 
        "error": "",
        "model_blob": "cdscsacsadcsdca",
        "engine": "faiss",
        "space_type": "l2",
        "dimension": 128
    },
    ...
}

Getting a single resource can be separated from searching resources, i.e.
for getting a single resource

GET /_plugins/_knn/models/{model_id}
{
    "model_id" : {model_id}
    "state": "CREATED",
    "created_timestamp": "10-31-21 02:02:02",
    "description": "Model trained with dataset X", 
    "error": "",
    "model_blob": "cdscsacsadcsdca",
    "engine": "faiss",
    "space_type": "l2",
    "dimension": 128
}

for searching model resources, if needed

GET /_plugins/_knn/models/_search?size=10
{
    'query' : {...}
}

[
    {
        'model_id' : 'model_id_1',
        ....
    },
    {
        'model_id' : 'model_id_2',
        ...
    }
]

wnbts · 2021-09-14T01:12:29Z

PUT /_plugins/_knn/models/<model_id>/_train?preference=<node_id>

preference might change to a more specific name such as prefer_nodes to allow other preferences in the future such as timeout or retry.
PUT might change to POST since it's not putting a resource.

jmazanec15 · 2021-09-14T03:42:19Z

1. Why not add model id to the resource body?

Right, in the mapping, it is implicitly defined as the document id. That being said, I think in the responses, it makes sense to contain the id. I will update.

1. Getting a single resource can be separated from searching resources, i.e.
   for getting a single resource

Might have been misinterpreted. filter_field is meant to filter the items returned in the body. This is similar to how GET calls work: https://opensearch.org/docs/opensearch/rest-api/document-apis/get-documents/#url-parameters.

That being said, GET calls take a "source_includes" param. I could refactor to this.

1. `preference` might change to a more specific name such as `prefer_nodes` to allow other preferences in the future such as timeout or retry.

2. `PUT` might change to `POST` since it's not putting a resource.

Preference also follows opensearch convention: https://opensearch.org/docs/opensearch/rest-api/document-apis/get-
documents/#url-parameters

I thought PUT made sense when the resource id is being passed and the resource is being created. When the resource id is not passed, I made it POST

wnbts · 2021-09-14T16:44:56Z

1. Getting a single resource can be separated from searching resources, i.e.
   for getting a single resource
Might have been misinterpreted. filter_field is meant to filter the items returned in the body. This is similar to how GET calls work: https://opensearch.org/docs/opensearch/rest-api/document-apis/get-documents/#url-parameters.

Yeah, I misinterpreted the api. The input looks good then. The output can just be the same body used in PUT/POST.

Preference also follows opensearch convention: https://opensearch.org/docs/opensearch/rest-api/document-apis/get-
documents/#url-parameters

I see, preference is a convention.

I thought PUT made sense when the resource id is being passed and the resource is being created. When the resource id is not passed, I made it POST

The difference is subtle. I see the request is made to id/_train not to id and is therefore not a standard put. Or, a request to id/a/b would be put rather than post. So having post here can make the train api simpler to users without getting into those differences.

jmazanec15 · 2021-09-14T20:54:19Z

The difference is subtle. I see the request is made to id/_train not to id and is therefore not a standard put. Or, a request to id/a/b would be put rather than post. So having post here can make the train api simpler to users without getting into those differences.

I see. I will update to POST. Also, I will add _search API to get multiple models.

Signed-off-by: John Mazanec <[email protected]>

jmazanec15 added the RFC Request for comments label Jul 22, 2021

jmazanec15 self-assigned this Jul 22, 2021

jmazanec15 added a commit to jmazanec15/k-NN-1 that referenced this issue Sep 15, 2021

Refactor modeldao to support apis from opensearch-project#70 (comment)

de62410

Signed-off-by: John Mazanec <[email protected]>

jmazanec15 mentioned this issue Sep 15, 2021

[faiss] Update model management to support APIs in RFC #92

Closed

jmazanec15 added a commit to jmazanec15/k-NN-1 that referenced this issue Sep 15, 2021

Refactor modeldao to support apis from opensearch-project#70 (comment)

069629c

Signed-off-by: John Mazanec <[email protected]>

jmazanec15 mentioned this issue Sep 15, 2021

Refactor model management to support apis #95

Merged

5 tasks

jmazanec15 mentioned this issue Oct 7, 2021

Add training API #125

Merged

5 tasks

jmazanec15 mentioned this issue Nov 5, 2021

Support Facebook's faiss library as approximate k-NN engine #27

Closed

14 tasks

jmazanec15 mentioned this issue Nov 16, 2021

Update k-NN documentation for faiss support feature opensearch-project/documentation-website#280

Merged

1 task

jmazanec15 closed this as completed Nov 23, 2021

abbottdev mentioned this issue Aug 3, 2023

[FEATURE] Hamming distance / binary vector support #81

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[RFC] Support Facebook's faiss library as another Approximate k-NN engine #70

[RFC] Support Facebook's faiss library as another Approximate k-NN engine #70

jmazanec15 commented Jul 22, 2021

jmazanec15 commented Aug 26, 2021 •

edited

Loading

wnbts commented Aug 26, 2021

wnbts commented Aug 26, 2021 •

edited

Loading

wnbts commented Aug 26, 2021 •

edited

Loading

wnbts commented Aug 26, 2021

jmazanec15 commented Aug 27, 2021 •

edited

Loading

wnbts commented Aug 27, 2021

jmazanec15 commented Aug 29, 2021

jmazanec15 commented Sep 13, 2021

wnbts commented Sep 14, 2021

wnbts commented Sep 14, 2021

wnbts commented Sep 14, 2021 •

edited

Loading

jmazanec15 commented Sep 14, 2021

wnbts commented Sep 14, 2021

jmazanec15 commented Sep 14, 2021

[RFC] Support Facebook's faiss library as another Approximate k-NN engine #70

[RFC] Support Facebook's faiss library as another Approximate k-NN engine #70

Comments

jmazanec15 commented Jul 22, 2021

Overview

Problem Statement

Requirements

Proposed Solution

Training Support

Model System Index

Train API

Upload API

knn_vector Field Enhancements

Feedback

jmazanec15 commented Aug 26, 2021 • edited Loading

Update on Proposed APIs

Get

Delete

Upload

Train

wnbts commented Aug 26, 2021

wnbts commented Aug 26, 2021 • edited Loading

wnbts commented Aug 26, 2021 • edited Loading

wnbts commented Aug 26, 2021

jmazanec15 commented Aug 27, 2021 • edited Loading

wnbts commented Aug 27, 2021

jmazanec15 commented Aug 29, 2021

jmazanec15 commented Sep 13, 2021

wnbts commented Sep 14, 2021

wnbts commented Sep 14, 2021

wnbts commented Sep 14, 2021 • edited Loading

jmazanec15 commented Sep 14, 2021

wnbts commented Sep 14, 2021

jmazanec15 commented Sep 14, 2021

jmazanec15 commented Aug 26, 2021 •

edited

Loading

wnbts commented Aug 26, 2021 •

edited

Loading

wnbts commented Aug 26, 2021 •

edited

Loading

jmazanec15 commented Aug 27, 2021 •

edited

Loading

wnbts commented Sep 14, 2021 •

edited

Loading