Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BUG] Nested fields in field_map cause pipeline to fail. #109

Closed
dmille opened this issue Jan 27, 2023 · 3 comments
Closed

[BUG] Nested fields in field_map cause pipeline to fail. #109

dmille opened this issue Jan 27, 2023 · 3 comments
Labels
question Further information is requested

Comments

@dmille
Copy link

dmille commented Jan 27, 2023

What is the bug?

When defining a field_map containing nested fields, the pipeline fails to compute embeddings.

How can one reproduce the bug?

With the following configuration, using non-nested field-types, embeddings are computed:

PUT /_ingest/pipeline/neural_pipeline
{
  "description": "Neural Search Pipeline for message content",
  "processors": [
    {
      "text_embedding": {
        "model_id": "SXXx8YUBR2ZWhVQIkghB",
        "field_map": {
          "message": "message_embedding"
        }
      }
    }
  ]
}
PUT /neural-test-index
{
    "settings": {
        "index.knn": true,
        "default_pipeline": "neural_pipeline"
    },
    "mappings": {
        "properties": {
            "message_embedding": {
                "type": "knn_vector",
                "dimension": 384,
                "method": {
                    "name": "hnsw",
                    "engine": "lucene"
                }
            },
            "message": { 
                "type": "text"            
            },
            "color": {
                "type": "text"
            }
        }
    }
}

POST /_bulk
{"create":{"_index":"neural-test-index","_id":"0"}}
{"message":"Text 1","color":"red"}
{"create":{"_index":"neural-test-index","_id":"1"}}
{"message":"Text 2","color":"black"}

GET /neural-test-index/_search
DELETE /neural-test-index

With the following configuration using a nested source field, embeddings are not computed:

PUT /_ingest/pipeline/neural_pipeline_nested
{
  "description": "Neural Search Pipeline for message content",
  "processors": [
    {
      "text_embedding": {
        "model_id": "SXXx8YUBR2ZWhVQIkghB",
        "field_map": {
          "message.text": "message_embedding"
        }
      }
    }
  ]
}

PUT /neural-test-index-nested
{
    "settings": {
        "index.knn": true,
        "default_pipeline": "neural_pipeline_nested"
    },
    "mappings": {
        "properties": {
            "message_embedding": {
                "type": "knn_vector",
                "dimension": 384,
                "method": {
                    "name": "hnsw",
                    "engine": "lucene"
                }
            },
            "message.text": { 
                "type": "text"            
            },
            "color": {
                "type": "text"
            }
        }
    }
}

POST /_bulk
{"create":{"_index":"neural-test-index-nested","_id":"0"}}
{"message":{"text":"Text 1"},"color":"red"}
{"create":{"_index":"neural-test-index-nested","_id":"1"}}
{"message":{"text":"Text 2"}, "color": "black"}

GET /neural-test-index-nested/_search

What is the expected behavior?

The neural ingestion pipeline should be able to handle nested fields.

What is your host/environment?

docker image: opensearchproject/opensearch:2.5.0

Do you have any additional context?

The models referenced above were uploaded with the following configuration:

{
  "name": "all-MiniLM-L6-v2",
  "version": "1.0.0",
  "description": "sentence transformers model",
  "model_format": "TORCH_SCRIPT",
  "model_config": {
    "model_type": "bert",
    "embedding_dimension": 384,
    "framework_type": "sentence_transformers"
  },
  "url": "https://github.com/opensearch-project/ml-commons/raw/2.x/ml-algorithms/src/test/resources/org/opensearch/ml/engine/algorithms/text_embedding/all-MiniLM-L6-v2_torchscript_sentence-transformer.zip?raw=true"
}

@dmille dmille added bug Something isn't working untriaged labels Jan 27, 2023
@navneet1v
Copy link
Collaborator

navneet1v commented Jan 27, 2023

Hi @dmille
Thanks for reaching out. I did the experiment and yes the way you are defining the nested field in the pipeline won't work. But the pipeline supports nested fields. To do that please try to create pipeline like this:

PUT /neural-test-index-nested
{
    "description": "Neural Search Pipeline for message content",
    "processors": [
        {
            "text_embedding": {
                "model_id": "SXXx8YUBR2ZWhVQIkghB",
                "field_map": {
                    "message": {
                        "text": "message_embedding"
                    }
                }
            }
        }
    ]
}

The thing is right now TextEmbedding processor doesn't understand "." operator as a nested field operator. I did some test on my side and the above way of creating the processor will work and it will handle the nested fields.

I think this can be something which Plugin can support. I will create a feature request for this feature.

@navneet1v navneet1v added question Further information is requested and removed bug Something isn't working untriaged labels Jan 27, 2023
@dmille
Copy link
Author

dmille commented Jan 27, 2023

@navneet1v Thanks for the prompt reply! This fixed my problem.

@navneet1v
Copy link
Collaborator

I am closing this issue and I have created this new GH issue: #110 for tracking.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
question Further information is requested
Projects
None yet
Development

No branches or pull requests

2 participants