[Proposal][RFC] Support analyzer-based neural sparse query & build BERT tokenizer as pre-defined tokenizer #1052

zhichao-aws · 2025-01-03T07:41:59Z

What/Why

What problems are you trying to solve?

Currently for neural sparse query, users need to register a sparse_encoding/sparse_tokenize model in advance and provide the model id in query body. For bi-encoder mode, we do need the ml-commons suite to manage the lifecycle of sparse encoding models. But for doc-only mode, we only use a tokenizer for query, and it would be somehow heavy to manage it with ml-commons suite. There will be several drawbacks:

users need to configure the only_run_on_ml_node settings to enable the tokenizer on data nodes
users need to register the model and manage the model groups, even to manage the model_id
the tokenizer predict requests will be dispatched among cluster nodes, which brings extra traffic cost

What are you proposing?

Build the analyzer-based neural sparse query. The sparse_tokenize model will be wrapped as a Lucene Analyzer. Users bind the analyzer to index field, and the neural sparse query will call the analyzer to encode the query.

The pretrained amazon/neural-sparse/opensearch-neural-sparse-tokenizer-v1 will be supported as pre-defined tokenizer. The token weight is encoded in payload attribute.

Besides used for neural sparse query, the analyzer can also be invoked like others. E.g. analyze API, chunking processor.

What is the developer experience going to be?

Will alter the model_id verification logics at neural sparse query builder. And add the pre-defined bert analyzer.

Are there any security considerations?

N/A

Are there any breaking changes to the API

We'll support a new query type for neural sparse query. I.e. users can bind the analyzer to index field, instead of providing the model id in query body.

What is the user experience going to be?

create index

PUT /my-index
{
  "settings": {
    "default_pipeline": "nlp-ingest-pipeline-sparse"
  },
  "mappings": {
    "properties": {
      "passage_embedding": {
        "type": "rank_features",
        "analyzer": "bert_tokenizer"
      },
      "passage_text": {
        "type": "text"
      }
    }
  }
}

search

GET my-index/search
{
    "query":{
        "neural_sparse": {
            "passage_embedding": {
                "query_text": "hello world"
            }
        }
    }
}

What will it take to execute?

Will modify the neural sparse query logics. The model id is not required. And it will read analyzer from shard context, then use analyzer to encode the query text.
Will use HuggingFaceTokenizer implementation from djl library. DJL is already a dependency in ml-commons.
Will put the config file of amazon/neural-sparse/opensearch-neural-sparse-tokenizer-v1 to plugin resource directory.

The text was updated successfully, but these errors were encountered:

yuye-aws · 2025-01-03T09:10:55Z

It's good to see this RFC. I just wonder:

By "analyzer": "bert_tokenizer", do you mean that bert_tokenizer is a built-in tokenizer? What are other supported tokenzer?
You mention that then use analyzer to encode the query text. Can you elaborate more? For example, whether the user needs to register the sparse encoding model first and how does the analyzer locate the model for encoding.
The RFC is targeted for neural sparse query. Is there any blocker for the neural dense query? Perhaps the RFC should consider both queries.

zhichao-aws · 2025-01-06T02:37:46Z

Hi @yuye-aws ,

By "analyzer": "bert_tokenizer", do you mean that bert_tokenizer is a built-in tokenizer? What are other supported tokenzer?

Yes, we'll build the bert tokenizer as a built-in tokenizer. For other supported tokenizers see https://opensearch.org/docs/latest/analyzers/tokenizers/index/

You mention that then use analyzer to encode the query text. Can you elaborate more? For example, whether the user needs to register the sparse encoding model first and how does the analyzer locate the model for encoding.

Users only need to configure the analyzer in index mappings. No need for register model.

The RFC is targeted for neural sparse query. Is there any blocker for the neural dense query? Perhaps the RFC should consider both queries.

I don't see the overlap of tokenizer and neural dense query. The tokenizer can't work along for dense retrieval, and the text embedding model contains tokenizers

github-actions bot added the untriaged label Jan 3, 2025

zhichao-aws self-assigned this Jan 3, 2025

zhichao-aws added RFC and removed untriaged labels Jan 3, 2025

opensearch-infra bot added this to OpenSearch Roadmap Jan 3, 2025

github-project-automation bot moved this to New in OpenSearch Roadmap Jan 3, 2025

This was referenced Jan 6, 2025

Support analyzer-based neural sparse query & build BERT tokenizer as pre-defined tokenizer #1061

Closed

Support analyzer-based neural sparse query & build BERT tokenizer as pre-defined tokenizer #1088

Draft

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Proposal][RFC] Support analyzer-based neural sparse query & build BERT tokenizer as pre-defined tokenizer #1052

[Proposal][RFC] Support analyzer-based neural sparse query & build BERT tokenizer as pre-defined tokenizer #1052

zhichao-aws commented Jan 3, 2025

yuye-aws commented Jan 3, 2025

zhichao-aws commented Jan 6, 2025

[Proposal][RFC] Support analyzer-based neural sparse query & build BERT tokenizer as pre-defined tokenizer #1052

[Proposal][RFC] Support analyzer-based neural sparse query & build BERT tokenizer as pre-defined tokenizer #1052

Comments

zhichao-aws commented Jan 3, 2025

What/Why

What problems are you trying to solve?

What are you proposing?

What is the developer experience going to be?

Are there any security considerations?

Are there any breaking changes to the API

What is the user experience going to be?

What will it take to execute?

yuye-aws commented Jan 3, 2025

zhichao-aws commented Jan 6, 2025