Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Proposal][RFC] Support analyzer-based neural sparse query & build BERT tokenizer as pre-defined tokenizer #1052

Open
zhichao-aws opened this issue Jan 3, 2025 · 2 comments
Assignees
Labels

Comments

@zhichao-aws
Copy link
Member

What/Why

What problems are you trying to solve?

Currently for neural sparse query, users need to register a sparse_encoding/sparse_tokenize model in advance and provide the model id in query body. For bi-encoder mode, we do need the ml-commons suite to manage the lifecycle of sparse encoding models. But for doc-only mode, we only use a tokenizer for query, and it would be somehow heavy to manage it with ml-commons suite. There will be several drawbacks:

  • users need to configure the only_run_on_ml_node settings to enable the tokenizer on data nodes
  • users need to register the model and manage the model groups, even to manage the model_id
  • the tokenizer predict requests will be dispatched among cluster nodes, which brings extra traffic cost

What are you proposing?

Build the analyzer-based neural sparse query. The sparse_tokenize model will be wrapped as a Lucene Analyzer. Users bind the analyzer to index field, and the neural sparse query will call the analyzer to encode the query.

The pretrained amazon/neural-sparse/opensearch-neural-sparse-tokenizer-v1 will be supported as pre-defined tokenizer. The token weight is encoded in payload attribute.

Besides used for neural sparse query, the analyzer can also be invoked like others. E.g. analyze API, chunking processor.

What is the developer experience going to be?

Will alter the model_id verification logics at neural sparse query builder. And add the pre-defined bert analyzer.

Are there any security considerations?

N/A

Are there any breaking changes to the API

We'll support a new query type for neural sparse query. I.e. users can bind the analyzer to index field, instead of providing the model id in query body.

What is the user experience going to be?

create index

PUT /my-index
{
  "settings": {
    "default_pipeline": "nlp-ingest-pipeline-sparse"
  },
  "mappings": {
    "properties": {
      "passage_embedding": {
        "type": "rank_features",
        "analyzer": "bert_tokenizer"
      },
      "passage_text": {
        "type": "text"
      }
    }
  }
}

search

GET my-index/search
{
    "query":{
        "neural_sparse": {
            "passage_embedding": {
                "query_text": "hello world"
            }
        }
    }
}

What will it take to execute?

  1. Will modify the neural sparse query logics. The model id is not required. And it will read analyzer from shard context, then use analyzer to encode the query text.
  2. Will use HuggingFaceTokenizer implementation from djl library. DJL is already a dependency in ml-commons.
  3. Will put the config file of amazon/neural-sparse/opensearch-neural-sparse-tokenizer-v1 to plugin resource directory.
@yuye-aws
Copy link
Member

yuye-aws commented Jan 3, 2025

It's good to see this RFC. I just wonder:

  1. By "analyzer": "bert_tokenizer", do you mean that bert_tokenizer is a built-in tokenizer? What are other supported tokenzer?
  2. You mention that then use analyzer to encode the query text. Can you elaborate more? For example, whether the user needs to register the sparse encoding model first and how does the analyzer locate the model for encoding.
  3. The RFC is targeted for neural sparse query. Is there any blocker for the neural dense query? Perhaps the RFC should consider both queries.

@zhichao-aws
Copy link
Member Author

Hi @yuye-aws ,

By "analyzer": "bert_tokenizer", do you mean that bert_tokenizer is a built-in tokenizer? What are other supported tokenzer?

Yes, we'll build the bert tokenizer as a built-in tokenizer. For other supported tokenizers see https://opensearch.org/docs/latest/analyzers/tokenizers/index/

You mention that then use analyzer to encode the query text. Can you elaborate more? For example, whether the user needs to register the sparse encoding model first and how does the analyzer locate the model for encoding.

Users only need to configure the analyzer in index mappings. No need for register model.

The RFC is targeted for neural sparse query. Is there any blocker for the neural dense query? Perhaps the RFC should consider both queries.

I don't see the overlap of tokenizer and neural dense query. The tokenizer can't work along for dense retrieval, and the text embedding model contains tokenizers

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
Status: New
2 participants