Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[RFC] Support Facebook's faiss library as another Approximate k-NN engine #70

Closed
jmazanec15 opened this issue Jul 22, 2021 · 15 comments
Closed
Assignees
Labels
RFC Request for comments

Comments

@jmazanec15
Copy link
Member

Overview

Over the past year, we have received a lot of interest in supporting Facebook’s MIT licensed faiss library, as another Approximate k-Nearest Neighbor (ANN) engine, in addition to nmslib. faiss offers a diverse set of algorithms that allow users to easily make tradeoffs between indexing latency, memory usage, query latency, and recall to fit their ANN workload requirements.

Earlier this year, a member of the community made a contribution in Open Distro for Elasticsearch for initial faiss support. This contribution integrated the faiss library into the plugin and added support for faiss’s implementation of Hierarchical Navigable Small World (HNSW) graphs. We are building on top of this contribution to support additional faiss features like vector quantizers and other ANN search methods.

Supporting faiss will allow users to choose from different ANN search methods and algorithms that are not available in nmslib. In particular, we are very interested in supporting faiss’s quantization methods that can reduce the amount of memory an ANN index requires.

Additionally, while this project focuses on integrating faiss, it should also refactor the plugin so that we can support additional ANN libraries and their methods in the future.

We are developing on the feature/faiss-support branch and are planning to merge to main once all requirements have been met.

Problem Statement

Because the k-NN plugin only supports one ANN engine and method, the amount of customization a user can make to achieve a solution to their ANN workload is limited.

Specifically, one problem k-NN plugin users face is that the plugin can consume a significant amount of memory. Currently, the plugin is built on top of nmslib’s implementation of HNSW. HNSW is a fast and fairly accurate ANN method. Still, for some workloads, the HNSW algorithm’s memory consumption can be an issue. From our documentation, each vector will consume approximately 1.1 * (4 * dimension + 8 * M) bytes. faiss implements several different algorithms that can provide ANN search using much less memory at the cost of additional compute during training. By supporting faiss, we can let users make memory based tradeoffs in order to achieve the solution that they want.

Requirements

In the initial phase, the requirements are:

  1. Refactor the k-NN plugin to be able to support more than one ANN engine
  2. Support faiss composite indices
  3. Support faiss’s HNSW ANN method
  4. Support faiss’s Inverted File System (IVF) ANN method
  5. Support faiss’s Product Quantization (PQ) vector quantization method

In the future, we may consider supporting:

  1. OpenSearch’s msearch type with faiss’s bulk querying functionality
  2. Additional faiss features, such as preprocessors and post query refinement as well as other ANN methods

Proposed Solution

In order to support faiss in the k-NN plugin, we need to:

  1. Refactor the JNI to support additional libraries
  2. Add additional API’s and system resources to support ANN methods that require training
  3. Enhance the knn_vector field type to support multiple engines and methods

Training Support

Several faiss features, such as IVF and PQ, require a training step before indexing can begin. Training takes a set of training vectors and creates a model that these features use to perform their functionalities.

From the plugin perspective, there are two approaches to support training: (1) Train a new model during segment creation with a subset of the segment’s index data and (2) Train a model before indexing can begin and use it to initialize the ANN library index during segment creation.

While the approaches are not mutually exclusive, initially we will only support Approach 2.

Approach 1 is easier to implement, but it significantly increases indexing latency. Every time a new segment is created, a new model needs to be trained. Additionally, because the model is trained with a subset of the segment’s data, it is difficult to guarantee the quantity and quality of the training data.

Approach 2 requires us to add additional APIs and OpenSearch utilities for a user to train a model and connect it to an OpenSearch k-NN index. However, it speeds up indexing and gives the user more control over the model produced. Additionally, it is recommended in the faiss documentation.

Model System Index

In order to persist faiss trained models and their metadata, we need to create a model system index.

During segment creation, a GET call is made to retrieve a model’s binary representation. The model is then used in the JNI layer to initialize the ANN library index. Once initialized, the vectors for the given segment are indexed into the ANN library index. After this completes, the ANN library index file is written to the OpenSearch index’s segment.

image

Train API

In order to support Approach 2, we need to give users the functionality to train a model in their OpenSearch cluster. To do this, we need to add a train API:

POST /_plugins/_knn/model/train
{
  "train_index": "train-index-name",
  "train_field": "train-field-name",
  "model_id": "custom-model-id",
  "dimension": 16,
  "method": {
      "name":"ivf",
      "engine":"faiss",
      "space_type": "l2",
      "parameters":{
         "ncentroids":128,
         "coarse_quantizer":{
            "name":"ivf",
            "parameters":{
                "ncentroids":15
            }
        },
        "encoder":{
            "name":"pq",
            "parameters":{
                "code_size":8
            }
        },
      }
  }
}
{
  "status": "SUCCESS",
  "model_id": "custom-model-id"
}

This API triggers a training workflow that reads a training set of vectors from another OpenSearch index, creates and trains an ANN library model and then serializes it into the model system index.

train-e2e2

Upload API

One potential issue with training is that it can be very resource intensive, which could negatively impact an OpenSearch cluster that is processing a heavy workload. So, to unblock users who want to use models that require resource intensive training, we need to also provide an upload API:

POST /_plugins/_knn/model/upload
{
    "model_id": "custom-model-id",
    "engine: "engine-of-the-model",
    "dimension": X,
    "space_type": "space-type-of-the-model",
    "model_blob": "some base64 encoded string"
}
{
  "status": "SUCCESS",
  "model_id": "custom-model-id"
}

This API triggers a workflow that validates the uploaded model and then serializes it to the model system index.

upload-rfc3

knn_vector Field Enhancements

In order for a user to configure an index to use faiss, we need to enhance our knn_vector field type. Currently, a user creates an index with the following mapping:

"my_vector":{
    "type":"knn_vector",
    "dimension": 2,
    "method":{
        "name":"hnsw",
        "engine":"nmslib",
        "space_type":"l2",
        "parameters":{
            "m":44
        }
    }
}

To support faiss indices that do not require training, we need to add an additional engine. This looks like:

"my_vector":{
    "type":"knn_vector",
    "dimension": 2,
    "method":{
        "name":"hnsw",
        "engine":"faiss",
        "space_type":"l2",
        "parameters":{
            "m":44
        }
    }
}

For indices that require training, a user needs to have already trained/uploaded the model to the model index. Once they have done this, they can create an ANN OpenSearch index with the following mapping:

"my_vector":{
    "type":"knn_vector",
    "model_id": "my_trained_model_template"
}

Feedback

We are interested in any and all feedback you may have. Please do not hesitate to comment!

Specifically, however, we are interested in:

  1. What features in faiss do you use that you would like to be supported by the k-NN plugin?
  2. For your potential use case, if you intend to use a faiss index that requires training, would you prefer training offline and using the upload API or training online with the train API? Why?
@jmazanec15 jmazanec15 added the RFC Request for comments label Jul 22, 2021
@jmazanec15 jmazanec15 self-assigned this Jul 22, 2021
@jmazanec15
Copy link
Member Author

jmazanec15 commented Aug 26, 2021

Update on Proposed APIs

Refactored API Design to center around model resource. First draft can be found here. Second draft can be found here.

For faiss, we will introduce additional functionality to add support for faiss indices that require training. With this change, we introduce a new resource: models. A model is an empty, trained native library index that can be used to initialize another native library index during ingestion. A model will be stored as a document in the model system index, which has the following mapping:

{
    "state": keyword,
    "created_timestamp": date,
    "description": keyword, 
    "error": keyword,
    "model_blob": binary,
    "engine": keyword,
    "space_type": keyword,
    "dimension": int
} 

state — Model state. Can either be CREATED, FAILED, TRAINING

created_timestamp — Time at which the model was created.

description — Model description a user can provide to add additional details about a model.

error — Message provided to user to communicate why model is in failed state.

model_blob — Base64 encoded representation of the model.

engine — Engine this model was created by.

space_type — Space this model was built with.

dimension — Dimension this model supports.

Get

GET /_plugins/_knn/models/{model_id}?<filter_field_1>&<filter_field_2>

{
    {
        "model_id": "my_model_id"
        "state": "CREATED",
        "created_timestamp": "10-31-21 02:02:02",
        "description": "Model trained with dataset X", 
        "error": "",
        "model_blob": "cdscsacsadcsdca",
        "engine": "faiss",
        "space_type": "l2",
        "dimension": 128
    }
} 

GET /_plugins/_knn/models/_search?<query_filters>
{
    "query": {
         ...
     }
}

{
  "took": 13,
  "timed_out": false,
  "_shards": {
    "total": 5,
    "successful": 5,
    "skipped": 0,
    "failed": 0
  },
  "hits": {
    ...
  }
} 

model_id — [Required] Specify which model to return information for. If not specified, all model information will be returned.

filter_field — Fields to include. If not specified, all fields are returned.

Delete

DELETE /_plugins/_knn/models/{model_id}

{
    "acknowledged": true
}

model_id — [Required] Model to delete

Upload

PUT /_plugins/_knn/models/{model_id}
{
    "description": "Model trained with dataset X", 
    "model_blob": "cdscsacsadcsdca",
    "engine": "faiss",
    "space_type": "l2",
    "dimension": 128
}

{
    "acknowledged": true
}


POST /_plugins/_knn/models
{
    "description": "Model trained with dataset X", 
    "model_blob": "cdscsacsadcsdca",
    "engine": "faiss",
    "space_type": "l2",
    "dimension": 128
}

{
    "model_id": "my_model_identifier"
}

description — [Optional] Model description a user can provide to add additional details about a model.

model_blob — Base64 encoded representation of the model.

engine — Engine this model was created by.

space_type — Space this model was built with.

dimension — Dimension this model supports.

Train

POST /_plugins/_knn/models/<model_id>/_train?preference=<node_id>
{
  "train_index": "train-index-name",
  "train_field": "train-field-name",
  "dimension": 16,
  "method": {
      "name":"ivf",
      "engine":"faiss",
      "space_type": "l2",
      "parameters":{
         "ncentroids":128,
         "coarse_quantizer":{
            "name":"ivf",
            "parameters":{
                "ncentroids":15
            }
        },
        "encoder":{
            "name":"pq",
            "parameters":{
                "code_size":8
            }
        },
      }
  }
}

{
    "acknowledged": true
}

POST /_plugins/_knn/models/_train?preference=<node_id>
{
  "train_index": "train-index-name",
  "train_field": "train-field-name",
  "dimension": 16,
  "method": {
      "name":"ivf",
      "engine":"faiss",
      "space_type": "l2",
      "parameters":{
         "ncentroids":128,
         "coarse_quantizer":{
            "name":"ivf",
            "parameters":{
                "ncentroids":15
            }
        },
        "encoder":{
            "name":"pq",
            "parameters":{
                "code_size":8
            }
        },
      }
  }
}

{
    "model_id": "my_model_identifier"
}

node_id — User's preference for node to execute training.

train_index — OpenSearch index from which to pull the training data.

train_field — Field of train_index from which to pull training data.

dimension — Dimension the model should be built for.

method — Method definition to produce the model.

@wnbts
Copy link

wnbts commented Aug 26, 2021

I have a few questions and suggestion for discussion.

GET /_plugins/_knn/{node_id}/train-jobs/{model_id}

a. whether a model resource belongs to a node resource
b. whether the node resource is necessary, i.e. must a user know a node id to operate training?
c. is a training job (/train-jobs) the same a model (/{model_id})

@wnbts
Copy link

wnbts commented Aug 26, 2021

PUT /_plugins/_knn/{node_id}/train-jobs/{model_id} {

a. can a job be updated once created?

@wnbts
Copy link

wnbts commented Aug 26, 2021

GET /_plugins/_knn/models/{model_id}?{field_filter_id1}&{field_filter_id2}

{
   "model_id": {
      "engine: "engine-of-the-model",
      "dimension": X,
      "space_type": "space-type-of-the-model",
      "model_blob": "some base64 encoded string"
   },
   ...
}
PUT /_plugins/_knn/models/{model_id}
{
    "model_id": "custom-model-id",
    "engine: "engine-of-the-model",
    "dimension": X,
    "space_type": "space-type-of-the-model",
    "model_blob": "some base64 encoded string"
}

a. the results from get might be changed to be consistent with that from put, i.e.,

[
   {
     "model_id": "custom-model-id"
      "engine: "engine-of-the-model",
      "dimension": X,
      "space_type": "space-type-of-the-model",
      "model_blob": "some base64 encoded string"
   },
   {
     "model_id": "custom-model-id-2",
     ...
   },
]

@wnbts
Copy link

wnbts commented Aug 26, 2021

PUT /_plugins/_knn/models/{model_id}

{
  "acknowledged": true,
  "model_id": "custom-model-id",
}

a. the response can be only an ack, the same as in delete, i.e.

{
  "acknowledged": true
}

@jmazanec15
Copy link
Member Author

jmazanec15 commented Aug 27, 2021

Thanks for the feedback @wnbts. Let me address your comments 1 by 1:

GET /_plugins/_knn/{node_id}/train-jobs/{model_id}

a. whether a model resource belongs to a node resource

Right, model_id probably needs to be changed to training_job_id for /_plugins/_knn/{node_id}/train-jobs APIs. I think I was trying to cut a corner here so that training-jobs could use the same id as the model_id. However, looking at it again, I think that this does not make sense. Do you agree?

b. whether the node resource is necessary, i.e. must a user know a node id to operate training?

No, node is not necessary, but it is optional.

c. is a training job (/train-jobs) the same a model (/{model_id})

Discussed above. They are not the same thing.

PUT /_plugins/_knn/{node_id}/train-jobs/{model_id}

a. can a job be updated once created?

No it cannot. I think I got this backwards. I will switch to POST.

GET /_plugins/_knn/models/{model_id}?{field_filter_id1}&{field_filter_id2}

a. the results from get might be changed to be consistent with that from put, i.e.,

Good point, will update.

PUT /_plugins/_knn/models/{model_id}

a. the response can be only an ack, the same as in delete, i.e.

I see, I will update. Thanks for the suggestion.

@wnbts
Copy link

wnbts commented Aug 27, 2021

GET /_plugins/_knn/{node_id}/train-jobs/{model_id}

a. whether a model resource belongs to a node resource

Right, model_id probably needs to be changed to training_job_id for /_plugins/_knn/{node_id}/train-jobs APIs. I think I was trying to cut a corner here so that training-jobs could use the same id as the model_id. However, looking at it again, I think that this does not make sense. Do you agree?

I agree. The data modeling would be more clear or natural. Training jobs are an intuitive resource. The training job request can contain a model id, or a (new generated) model id can be returned in the response.

A separate question regarding the relationship between node and job, if a job is created with a node resource, is the job bound to the node? For example, if a job is /node-a/train-jobs/job-a, what would a get of /node-b/train-jobs/job-a return?

@jmazanec15
Copy link
Member Author

if a job is /node-a/train-jobs/job-a, what would a get of /node-b/train-jobs/job-a return?

In this case, no results would be returned. A job is bound to a node, however a model is not.

@jmazanec15
Copy link
Member Author

@wnbts I decided to update APIs to center around model resource. I felt that having a separate model resource and train jobs resource did not make sense. Please take a look at the update if you get time.

@wnbts
Copy link

wnbts commented Sep 14, 2021

The new version also makes sense to me! I raise some details for discussion.

{
    "state": keyword,
    "created_timestamp": date,
    "description": keyword, 
    "error": keyword,
    "model_blob": binary,
    "engine": keyword,
    "space_type": keyword,
    "dimension": int
} 
  1. Why not add model id to the resource body?

@wnbts
Copy link

wnbts commented Sep 14, 2021

GET /_plugins/_knn/models/{model_id}?<filter_field_1>&<filter_field_2>

{
    "my_model_id": {
        "state": "CREATED",
        "created_timestamp": "10-31-21 02:02:02",
        "description": "Model trained with dataset X", 
        "error": "",
        "model_blob": "cdscsacsadcsdca",
        "engine": "faiss",
        "space_type": "l2",
        "dimension": 128
    },
    ...
} 
  1. Getting a single resource can be separated from searching resources, i.e.
    for getting a single resource
GET /_plugins/_knn/models/{model_id}
{
    "model_id" : {model_id}
    "state": "CREATED",
    "created_timestamp": "10-31-21 02:02:02",
    "description": "Model trained with dataset X", 
    "error": "",
    "model_blob": "cdscsacsadcsdca",
    "engine": "faiss",
    "space_type": "l2",
    "dimension": 128
}

for searching model resources, if needed

GET /_plugins/_knn/models/_search?size=10
{
    'query' : {...}
}

[
    {
        'model_id' : 'model_id_1',
        ....
    },
    {
        'model_id' : 'model_id_2',
        ...
    }
]

@wnbts
Copy link

wnbts commented Sep 14, 2021

PUT /_plugins/_knn/models/<model_id>/_train?preference=<node_id>
  1. preference might change to a more specific name such as prefer_nodes to allow other preferences in the future such as timeout or retry.
  2. PUT might change to POST since it's not putting a resource.

@jmazanec15
Copy link
Member Author

1. Why not add model id to the resource body?

Right, in the mapping, it is implicitly defined as the document id. That being said, I think in the responses, it makes sense to contain the id. I will update.

1. Getting a single resource can be separated from searching resources, i.e.
   for getting a single resource

Might have been misinterpreted. filter_field is meant to filter the items returned in the body. This is similar to how GET calls work: https://opensearch.org/docs/opensearch/rest-api/document-apis/get-documents/#url-parameters.

That being said, GET calls take a "source_includes" param. I could refactor to this.

1. `preference` might change to a more specific name such as `prefer_nodes` to allow other preferences in the future such as timeout or retry.

2. `PUT` might change to `POST` since it's not putting a resource.

Preference also follows opensearch convention: https://opensearch.org/docs/opensearch/rest-api/document-apis/get-
documents/#url-parameters

I thought PUT made sense when the resource id is being passed and the resource is being created. When the resource id is not passed, I made it POST

@wnbts
Copy link

wnbts commented Sep 14, 2021

1. Getting a single resource can be separated from searching resources, i.e.
   for getting a single resource

Might have been misinterpreted. filter_field is meant to filter the items returned in the body. This is similar to how GET calls work: https://opensearch.org/docs/opensearch/rest-api/document-apis/get-documents/#url-parameters.

Yeah, I misinterpreted the api. The input looks good then. The output can just be the same body used in PUT/POST.

Preference also follows opensearch convention: https://opensearch.org/docs/opensearch/rest-api/document-apis/get-
documents/#url-parameters

I see, preference is a convention.

I thought PUT made sense when the resource id is being passed and the resource is being created. When the resource id is not passed, I made it POST

The difference is subtle. I see the request is made to id/_train not to id and is therefore not a standard put. Or, a request to id/a/b would be put rather than post. So having post here can make the train api simpler to users without getting into those differences.

@jmazanec15
Copy link
Member Author

The difference is subtle. I see the request is made to id/_train not to id and is therefore not a standard put. Or, a request to id/a/b would be put rather than post. So having post here can make the train api simpler to users without getting into those differences.

I see. I will update to POST. Also, I will add _search API to get multiple models.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
RFC Request for comments
Projects
None yet
Development

No branches or pull requests

2 participants