Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix bug where document embedding fails to be generated due to document has dot in field name #1062

Merged

Conversation

yizheliu-amazon
Copy link
Contributor

@yizheliu-amazon yizheliu-amazon commented Jan 6, 2025

Description

Fix bug where document embedding fails to be generated due to document has dot in field name, which does not match field mapping exactly

Related Issues

Resolves #1042

Root cause

Such issue is caused by that we fail to unbox/unflatten field in ingested doc, then it causes mismatch between nested field in fieldMap config and ingested document schema.

For example, nested field can be represented as either {"a.b": "c"}, or {"a": {"b": "c"}}. If fieldMap is {"a": {"b": "b_embedding"}}, but ingested document is like {"a.b": "c"}, such schema mismatch fieldMap, and then we are not able to fetch "c" as value for inferencing.

About the change

Basically, the solution is to unbox/unflatten the field with dot from ingested doc.

Before this change, given fieldMap is below

{
  "description": "text embedding pipeline for exampe",
  "processors": [
    {
      "text_embedding": {
        "model_id": "EJTmBpQBz0MpoghfHptn",
        "field_map": {
          "nested_passages.level_2.level_3_text": "level_3_container.level_3_embedding"
        }
      }
    }
  ]
}

Simulate ingestion for below doc, there is no level_3_container.level_3_embedding generated for it.

{
  "docs": [
    {
      "_index": "neural-search-index-v2",
      "_id": "1",
      "_source": {
        "nested_passages": [
          {
            "level_2.level_3_text": "clown"
          },
          {
            "level_2": {
              "level_3_text": "batman"
            }
          }
        ]
      }
    }
  ]
}

After this change, embedding can be generated

{
  "docs": [
    {
      "doc": {
        "_index": "neural-search-index-v2",
        "_id": "1",
        "_source": {
          "nested_passages": [
            {
              "level_2": {
                "level_3_container": {
                  "level_3_embedding": [
                    0.17990997,
                    0.064552955,
                    ...
                  ]
                },
                "level_3_text": "clown"
              }
            },
            {
              "level_2": {
                "level_3_container": {
                  "level_3_embedding": [
                    0.20891039,
                    0.09188991,
                    ...
                  ]
                },
                "level_3_text": "batman"
              }
            }
          ]
        },
        "_ingest": {
          "timestamp": "2025-01-06T19:00:12.84077Z"
        }
      }
    }
  ]
}

Check List

  • New functionality includes testing.
  • New functionality has been documented.
  • API changes companion pull request created.
  • Commits are signed per the DCO using --signoff.
  • Public documentation issue/PR created.

By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.
For more information on following Developer Certificate of Origin and signing off your commits, please check here.

@@ -80,4 +83,150 @@ public void test_with_different_configurations() throws URISyntaxException, IOEx
}
}

public void test_unflatten_simple_dot_notation() {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you make the test naming consistent with others?
testUnflatten_whenSimpleDotNotation_thenSucceed

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure. Will do.

Signed-off-by: Yizhe Liu <[email protected]>
@heemin32 heemin32 merged commit 5b9f43b into opensearch-project:main Jan 8, 2025
39 checks passed
opensearch-trigger-bot bot pushed a commit that referenced this pull request Jan 8, 2025
…t has dot in field name (#1062)

* Fix bug where document embedding fails to be generated due to document has dot in field name

Signed-off-by: Yizhe Liu <[email protected]>

* Address comments

Signed-off-by: Yizhe Liu <[email protected]>

---------

Signed-off-by: Yizhe Liu <[email protected]>
(cherry picked from commit 5b9f43b)
q-andy pushed a commit to q-andy/neural-search that referenced this pull request Jan 8, 2025
…t has dot in field name (opensearch-project#1062)

* Fix bug where document embedding fails to be generated due to document has dot in field name

Signed-off-by: Yizhe Liu <[email protected]>

* Address comments

Signed-off-by: Yizhe Liu <[email protected]>

---------

Signed-off-by: Yizhe Liu <[email protected]>
yizheliu-amazon added a commit to yizheliu-amazon/neural-search that referenced this pull request Jan 10, 2025
…t has dot in field name (opensearch-project#1062)

* Fix bug where document embedding fails to be generated due to document has dot in field name

Signed-off-by: Yizhe Liu <[email protected]>

* Address comments

Signed-off-by: Yizhe Liu <[email protected]>

---------

Signed-off-by: Yizhe Liu <[email protected]>
junqiu-lei pushed a commit that referenced this pull request Jan 10, 2025
… due to document has dot in field name (#1076)

* Fix bug where document embedding fails to be generated due to document has dot in field name (#1062)

* Fix bug where document embedding fails to be generated due to document has dot in field name

Signed-off-by: Yizhe Liu <[email protected]>

* Address comments

Signed-off-by: Yizhe Liu <[email protected]>

---------

Signed-off-by: Yizhe Liu <[email protected]>

* Clean up unused validateFieldName() and use existing methods for TextEmbeddingProcessorIT (#1074)

Signed-off-by: Yizhe Liu <[email protected]>

---------

Signed-off-by: Yizhe Liu <[email protected]>
martin-gaievski pushed a commit that referenced this pull request Jan 10, 2025
…t has dot in field name (#1062)

* Fix bug where document embedding fails to be generated due to document has dot in field name

Signed-off-by: Yizhe Liu <[email protected]>

* Address comments

Signed-off-by: Yizhe Liu <[email protected]>

---------

Signed-off-by: Yizhe Liu <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
backport 2.x Label will add auto workflow to backport PR to 2.x branch bug Something isn't working
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[BUG] Fail to generate embedding for ingest document with nested field defined in field map
4 participants