semantic_text - Field inference #103697

carlosdelest · 2023-12-22T17:45:39Z

Performs inference in TransportBulkAction. For every group of BulkShardRequests, it performs inference on each individual request.

Bulk inference is done at the document and model level. We can extend this in the future for multiple docs.

For now, there is no chunking - but the source format is prepared to deal with it as it stores arrays of embeddings and text pairs that will be used in nested queries for passage retrieval:

{
  "infer_field": "these are not the droids you're looking for. He's free to go around",
  "another_infer_field": "Carry on. Carry on",
  "non_infer_field": "hello",
  "_semantic_text_inference": {
    "infer_field": [
      {
        "sparse_embedding": {
          "play": 0.34588584,
          "legend": 0.005075309,
          "about": 0.13270257,
          "ship": 0.13503131,
          "anime": 0.31627595,
          "walk": 0.30274966
        },
        "text": "these are not the droids you're looking for. He's free to go around"
      }
    ],
    "another_infer_field": [
      {
        "sparse_embedding": {
          "gift": 0.027486322,
          "ryan": 0.67748386,
          "possession": 0.37753758,
          "bring": 0.88360184,
          "pocket": 0.08802759
        },
        "text": "Carry on. Carry on"
      }
    ]
  }
}

…om:Mikep86/elasticsearch into store-semantic_text-model-info-in-mappings

…inference # Conflicts: # server/src/main/java/org/elasticsearch/action/bulk/TransportBulkAction.java

carlosdelest · 2024-02-07T16:38:22Z

x-pack/plugin/inference/qa/rest/build.gradle

Added infra for doing YAML tests on ML plugin

carlosdelest · 2024-02-07T16:39:38Z

...in/java/org/elasticsearch/xpack/inference/mapper/SemanticTextInferenceResultFieldMapper.java

@@ -42,6 +42,9 @@
 import java.util.Set;
 import java.util.stream.Collectors;

+import static org.elasticsearch.action.bulk.BulkShardRequestInferenceProvider.SPARSE_VECTOR_SUBFIELD_NAME;


Changed the dependencies so the field mappers depend on the constants defined in server code. That makes sense as server code is the one generating the embeddings.

I see what you are doing here. OK, I can understand that. Basically, the thing that is satisfying the interface gets access to this param.

This is better than it was. It does seem backwards, the plugin should know the mapper & how it extracts things. But, this is better before :)

Yes, it would be more contained if everything was in the plugin. That will mean to move the inference generation class to the plugin, and inject it into the TransportBulkAction and BulkOperation classes.

Let me check how that would look like in a separate branch.

carlosdelest · 2024-02-07T16:40:15Z

...va/org/elasticsearch/xpack/inference/mapper/SemanticTextInferenceResultFieldMapperTests.java

@@ -256,32 +252,8 @@ public void testMissingSubfields() throws IOException {
            );
            assertThat(
                ex.getMessage(),
-                containsString(


This test isn't needed as there won't be an additional level of nesting for the results

carlosdelest · 2024-02-07T17:04:20Z

...in/java/org/elasticsearch/xpack/inference/mapper/SemanticTextInferenceResultFieldMapper.java

    public static final TypeParser PARSER = new FixedTypeParser(c -> new SemanticTextInferenceResultFieldMapper());

    private static final Map<List<String>, Set<String>> REQUIRED_SUBFIELDS_MAP = Map.of(
        List.of(),
-        Set.of(SPARSE_VECTOR_SUBFIELD_NAME, TEXT_SUBFIELD_NAME),


The structure for the embeddings changes a bit. The field mapper was prepared to have an additional nesting level, but that is not required as the asMap() method from the results does not return the information on that format

carlosdelest · 2024-02-07T17:06:29Z

server/src/main/java/org/elasticsearch/action/bulk/BulkShardRequestInferenceProvider.java

+import java.util.function.Consumer;
+import java.util.stream.Collectors;
+
+public class BulkShardRequestInferenceProvider {


I changed the dependency, so this class defines the constants and they are used from the field mappers. LMK if this addresses your concerns.

carlosdelest · 2024-02-07T17:07:13Z

...ference/qa/rest/src/yamlRestTest/java/org/elasticsearch/xpack/inference/InferenceRestIT.java

+    public static ElasticsearchCluster cluster = ElasticsearchCluster.local()
+        .setting("xpack.security.enabled", "false")
+        .setting("xpack.security.http.ssl.enabled", "false")
+        .plugin("org.elasticsearch.xpack.inference.mock.TestInferenceServicePlugin")


Uses the TestInferenceServicePlugin, which defines a mock inference service to be used

carlosdelest · 2024-02-07T17:17:06Z

.../rest/src/yamlRestTest/resources/rest-api-spec/test/inference/10_semantic_text_inference.yml

@jimczi LMKWYT of these integration tests. Do you think it would be valuable to use the _bulk API in a separate test suite to test it?

Mikep86

Partial review. Looking good!

Mikep86 · 2024-02-07T22:07:24Z

server/src/main/java/org/elasticsearch/action/bulk/BulkShardRequestInferenceProvider.java

+                    k -> new HashMap<String, Object>()
+                );
+
+                List<String> inferenceFieldNames = getFieldNamesForInference(fieldModelsEntrySet, docMap);


We could simplify the method signature to getFieldNamesForInference(Set<String> inferenceFields, Map<String, Object> docMap) by passing fieldModelsEntrySet.getValue(). This would also make it clearer that this helper method doesn't use the model ID.

Good suggestion! 👍

Mikep86 · 2024-02-07T22:19:09Z

server/src/main/java/org/elasticsearch/action/bulk/BulkShardRequestInferenceProvider.java

+    }
+
+    private static List<String> getFieldNamesForInference(Map.Entry<String, Set<String>> fieldModelsEntrySet, Map<String, Object> docMap) {
+        List<String> inferenceFieldNames = new ArrayList<>();


Minor optimization: We could pre-allocate an ArrayList of the maximum required size by using the inference field set size

I tend not to do that for small lists - it will be expanded to 10 elements when the first element is added, so probably just removes one list expansion.

Mikep86 · 2024-02-07T22:22:12Z

server/src/main/java/org/elasticsearch/action/bulk/BulkShardRequestInferenceProvider.java

+            // Perform inference on string, non-null values
+            if (fieldValue instanceof String) {
+                inferenceFieldNames.add(inferenceField);
+            }


Do we need to handle when the field value is a non-null & non-String value (i.e. when the user has provided a value with an invalid data type)? Or will that be handled somewhere downstream/upstream?

Good catch, we don't as of now.

There are some cases to consider, I'll work on them

Array values: We could treat these as chunking, and perform inference on every array value

Non-string values: Convert them to strings before doing inference. Don't error out when we have a non-string value, similar to how text works.

My only concern on converting non-strings is that we're doing inference on fields where inference makes no sense - and potentially incurring in costs - instead of warning the user.

I wasn't even thinking of multi-valued text fields (i.e. array of strings), but that's a case we need to handle here as well.

I was thinking of handling obvious error cases, such as when the value is a Map and can't be converted to a string in a sensible way.

Just checked how the text field handles this:

Primitive value (i.e. string, bool, number): Coerce to string
Array of primitive values: Index each value separately, coerce each value to string
Object (i.e. Map): Throw error
Array containing an object: Throw error

Mikep86 · 2024-02-07T22:33:53Z

server/src/main/java/org/elasticsearch/action/bulk/BulkShardRequestInferenceProvider.java

+                            List<Map<String, Object>> inferenceFieldResultList = (List<Map<String, Object>>) rootInferenceFieldMap
+                                .computeIfAbsent(fieldName, k -> new ArrayList<>());
+                            // Remove previous inference results if any
+                            inferenceFieldResultList.clear();


If we always remove previous inference results, doesn't that mean we will re-run inference for every semantic_text field on a reindex?

You're totally correct. We need additional logic to handle that.

This code is correct as when we receive back inference results, we want to remove the previous inference.

But, we should avoid to recalculate on reindex:

We should always avoid calculating inference for an index action if there are already inference results.

We should always recalculate inference for an update action for the included inference fields in the request

I'll work on that and add some more tests. Thanks!

We could also break this out into a follow-up task if it makes the scope of this PR too big. It's already pretty chonky 😵‍💫

Mikep86 · 2024-02-07T22:35:37Z

server/src/main/java/org/elasticsearch/action/bulk/BulkShardRequestInferenceProvider.java

+                String modelId = fieldModelsEntrySet.getKey();
+
+                @SuppressWarnings("unchecked")
+                Map<String, Object> rootInferenceFieldMap = (Map<String, Object>) docMap.computeIfAbsent(


How do we plan on handling cast errors here? It's theoretically possible for the user to provide a value for the _semantic_text_inference field with an invalid data type.

I'd say we're safe as we're casting to Map<String, Object> - and the document source will be parsed already at this point. I don't think that this can fail if it's valid JSON (as should be at that stage) 🤔

and the document source will be parsed already at this point.

Parsed by whom?

There are three scenarios here:

User doing a "put/post" with the field already defined (obviously, hasn't been indexed already)

A Reindex occurring, obviously, this one is OK as it was indexed previously somewhere (hopefully by us :/)

A new document where this field exists or doesn't based on previously inferenced values (more than one field, meaning it doesn't exist for the first inference result but does for next).

This is the tricky part of having things that are generally mapper validates further up in the ingest pipeline.

So, we need to validate things are what we expect. We don't need to do a full parse of the internals (the mapper does this), but I am not sure blindly casting is wise.

Do we have tests covering the above scenarios I laid out yet?

the variable docMap is retrieved using sourceAsMap() from the index or update request, which I believe invokes the XContentParser. So, it should be parsed at this point, right?

I think the issue is that _source can be valid JSON without meeting the cast type expectations. For example:

{ "_semantic_text_inference": "foo" }

You're right, for some reason I was not seeing that. Thanks for catching this!

from the index or update request, which I believe invokes the XContentParser. So, it should be parsed at this point, right?

It can be valid JSON (or CBOR, or SMILE), but it can 100% be invalid for what we care about.

mark-vieira · 2024-02-08T17:48:48Z

x-pack/plugin/inference/qa/rest/build.gradle

+}
+
+tasks.named('yamlRestTest') {
+  usesDefaultDistribution()


Is it strictly required that we use the default distribution here? Can these tests simply explicitly install the plugins/modules necessary?

I'm not familiar with this. Could you please point me to some docs explaining the process, or examples that don't use the default distribution to check? 🙏

mark-vieira · 2024-02-08T17:53:25Z

x-pack/plugin/inference/qa/rest/build.gradle

@@ -0,0 +1,15 @@
+apply plugin: 'elasticsearch.internal-yaml-rest-test'


I don't think we need to create another QA project here. We can just apply this plugin to the the :x-pack:plugin:inference project and run these tests there.

I thought that was a common pattern - moved directly under the plugin root dir 👍

Out of necessity mostly. The new testing framework make it unnecessary in almost all circumstances.

noted, thank you Mark!

jimczi

Thanks for adding these tests. I think the infrastructure is in place and we can iterate from here to add the missing pieces. +1 to merge on the branch so that we can start working on batching as a follow up.

Mikep86

I really like how this is progressing :)

server/src/main/java/org/elasticsearch/action/bulk/BulkShardRequestInferenceProvider.java

Mikep86 · 2024-02-08T19:06:30Z

server/src/test/java/org/elasticsearch/action/bulk/BulkOperationTests.java

Very nice tests!

Mikep86 · 2024-02-08T19:08:34Z

server/src/test/java/org/elasticsearch/snapshots/SnapshotResiliencyTests.java

@@ -1941,13 +1941,16 @@ protected void assertSnapshotOrGenericThread() {
                            client,
                            null,
                            () -> DocumentParsingObserver.EMPTY_INSTANCE
+


Nitpick: Odd place for whitespace

Mikep86 · 2024-02-08T19:23:12Z

.../rest/src/yamlRestTest/resources/rest-api-spec/test/inference/10_semantic_text_inference.yml

+    - match: { _source._semantic_text_inference.inference_field.0.text: "updated inference test" }
+    - match: { _source._semantic_text_inference.another_inference_field.0.text: "another updated inference test" }
+
+


I'm not sure how TestInferenceServicePlugin is generating embeddings, but is it possible to test that the embeddings have changed here?

Mikep86 · 2024-02-08T19:29:21Z

.../rest/src/yamlRestTest/resources/rest-api-spec/test/inference/10_semantic_text_inference.yml

+
+  - match: { _source._semantic_text_inference.inference_field.0.sparse_embedding: $inference_field_embedding }
+  - match: { _source._semantic_text_inference.another_inference_field.0.sparse_embedding: $another_inference_field_embedding }
+


Since we know that (currently, at least) TransportBulkAction will re-generate embeddings for semantic_text fields on reindex, IMO this test should indicate as such by failing right now.

Maybe we could have TestInferenceServicePlugin generate random embeddings, regardless of input text, so we can determine when the inference service has been called?

I need to iterate on that idea - I have already tried that, but the problem is check that something doesn't match - AFAIK there's not a not_match construct for YAML tests that would provide support for failing the comparison

Mikep86 · 2024-02-08T19:36:23Z

...in/java/org/elasticsearch/xpack/inference/mapper/SemanticTextInferenceResultFieldMapper.java

+ *                          "dragon": 0.50991,
+ *                          "type": 0.23241979,
+ *                          "dr": 1.9312073,
+ *                          "##o": 0.2797593


Nitpick: Indentation is off here

Mikep86 · 2024-02-08T20:02:03Z

...in/java/org/elasticsearch/xpack/inference/mapper/SemanticTextInferenceResultFieldMapper.java

The updates to this class are incomplete, there are still references to SparseEmbeddingResults.Embedding.EMBEDDING & SparseEmbeddingResults.Embedding.IS_TRUNCATED. The tests still pass because the corresponding references in the tests still exist as well.

I can handle updating SemanticTextInferenceResultFieldMapper & SemanticTextInferenceResultFieldMapperTests in a separate PR if you like; it would also keep the scope of this PR more limited.

I see, good catch - it would help if you can push the changes to this branch, or tackle that as a separate PR.

I'll like to handle this in a separate PR if that's OK

sounds good, thanks Mike!

carlosdelest · 2024-02-08T20:07:04Z

@benwtrent , @Mikep86, @jimczi :

I've been working on adding support for:

Other field types (boolean, numbers)
Arrays
_reindex and _update_by_query so inference is not recalculated again

I'm making good progress, but it's adding quite a few lines to this PR. I'd like to add that iteratively if that's ok with you.

If you're missing something that can be added afterwards, please let me know and we can discuss.

Thanks!

Mikep86 · 2024-02-08T20:11:54Z

@carlosdelest Agree that we should iterate on this through multiple PRs. This one is already huge! Can we just ensure that we capture all the follow-ups before closing this PR?

benwtrent · 2024-02-08T21:02:45Z

Sounds good to me @carlosdelest do your thing :). As long as we get tests and such.

carlosdelest · 2024-02-08T21:18:11Z

Sounds good to me @carlosdelest do your thing :).

My thing seems to be iterating on this PR forever. That's my idea of purgatory as of now. 👿 Thank you Ben!

As long as we get tests and such.

I've already implemented some YAML tests that handle some of the cases. I'll get back to them when I add support for the missing pieces.

carlosdelest · 2024-02-09T13:38:02Z

Thanks everyone for your input and guidance on this PR. I'm merging it on the feature branch.

Mikep86 and others added 30 commits December 8, 2023 18:10

Added fieldsForModels to IndexMetadata & MappingMetadata

941f960

Updated IndexMetadata tests

46f1f2e

Randomize when fieldsForModels is set

02555e3

Merge branch 'main' into store-semantic_text-model-info-in-mappings

be7a9be

Added fieldsForModels to FieldTypeLookup

3208f74

Updated MappingLookup to add getFieldsForModels

9a7513e

Updated MappingMetadata to set fieldsForModels

409c8d5

Ensure that fieldsForModels is immutable

c5748f7

Fix NPE

83319f5

Update docs/changelog/103319.yaml

b4a6f6e

Update IndexMetadata equals & hashCode

31642b8

Fix NPE

206ddb9

Fix checkstyle error

f2503ea

Merge branch 'store-semantic_text-model-info-in-mappings' of github.c…

891c02f

…om:Mikep86/elasticsearch into store-semantic_text-model-info-in-mappings

Fix NPE

6c5d541

Update MappingMetadata to ensure that fieldsForModels is always non-null

d78af4c

Resolved TODOs

aa5b800

Merge branch 'main' into store-semantic_text-model-info-in-mappings

84aac32

Adjusted cluster state diff tests

04112d1

IndexMetadata test updates

a66f69b

Added/updated FieldTypeLookup tests

6feacd7

Fix spotless violations

7be2f4b

Merge branch 'main' into store-semantic_text-model-info-in-mappings

c6c98a6

Added/updated MappingLookup tests

fa678a3

Delete docs/changelog/103319.yaml

981ac8f

Refactored into separate methods

f45af49

Added fieldsForModels to IndexMetadata & MappingMetadata

4a93be8

Moved InferenceAction and result classes to server

1b82261

Remove unneeded code for retrieving IndexMetadata.fieldsForModels

39cbdff

Change inference result classes to server

0e493ee

carlosdelest added 7 commits February 7, 2024 12:05

First test version

b12ea91

Add tests

b17a4cc

Fix bug for re-calculating inference results

fb7f9d3

Merge branch 'feature/semantic-text' into carlosdelest/semantic-text-…

0ad9496

…inference # Conflicts: # server/src/main/java/org/elasticsearch/action/bulk/TransportBulkAction.java

Fix merge

e5ee956

Fix javadoc for SemanticTextInferenceResultFieldMapper

f818bd0

Remove unnecessary class

4fdd65e

carlosdelest commented Feb 7, 2024

View reviewed changes

carlosdelest marked this pull request as ready for review February 7, 2024 17:12

carlosdelest requested a review from a team as a code owner February 7, 2024 17:12

carlosdelest requested review from Mikep86 and benwtrent February 7, 2024 17:15

carlosdelest commented Feb 7, 2024

View reviewed changes

Fix merge with main

75cbe3d

carlosdelest requested a review from jimczi February 7, 2024 18:06

Mikep86 reviewed Feb 7, 2024

View reviewed changes

mark-vieira reviewed Feb 8, 2024

View reviewed changes

jimczi approved these changes Feb 8, 2024

View reviewed changes

Mikep86 reviewed Feb 8, 2024

View reviewed changes

Mikep86 approved these changes Feb 8, 2024

View reviewed changes

carlosdelest added 2 commits February 8, 2024 22:22

Add comments on failure to load method

28e64d8

Moved yamlRestTest directory from qa to inference

792f3de

mark-vieira approved these changes Feb 8, 2024

View reviewed changes

Add cast checks and refactored a bit error handling

2b944b1

carlosdelest merged commit ca65a70 into elastic:feature/semantic-text Feb 9, 2024
13 of 14 checks passed

		@@ -0,0 +1,15 @@
		apply plugin: 'elasticsearch.internal-yaml-rest-test'

		- match: { _source._semantic_text_inference.inference_field.0.text: "updated inference test" }
		- match: { _source._semantic_text_inference.another_inference_field.0.text: "another updated inference test" }

semantic_text - Field inference #103697

semantic_text - Field inference #103697

Conversation

carlosdelest commented Dec 22, 2023 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Mikep86 left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jimczi left a comment

Choose a reason for hiding this comment

Mikep86 left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

carlosdelest commented Feb 8, 2024

Mikep86 commented Feb 8, 2024

benwtrent commented Feb 8, 2024

carlosdelest commented Feb 8, 2024 • edited Loading

carlosdelest commented Feb 9, 2024

carlosdelest commented Dec 22, 2023 •

edited

Loading

carlosdelest commented Feb 8, 2024 •

edited

Loading