Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[feature/semantic-text] semantic text copy to support #106689

Merged
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
30 commits
Select commit Hold shift + click to select a range
7aaa3b6
Revert "Extract interface from ModelRegistry so it can be used from s…
jimczi Mar 7, 2024
ebc26d2
inference as an action filter
jimczi Mar 14, 2024
86ddc9d
add more tests
jimczi Mar 18, 2024
dd73d01
Merge branch 'feature/semantic-text' into carlosdelest/semantic-text-…
carlosdelest Mar 19, 2024
c5de0da
Merge from feature branch
carlosdelest Mar 19, 2024
cf62b1b
Allow SemanticTextFieldMapper to be a multifield
carlosdelest Mar 19, 2024
f029015
Add multifields / copy_to tests to lookup and metadata
carlosdelest Mar 19, 2024
d3f9d86
First iteration for adding inference support for copy_to / multifields
carlosdelest Mar 19, 2024
140caa3
Add tests for copy_to
carlosdelest Mar 19, 2024
7d1c92a
Spotless
carlosdelest Mar 19, 2024
e023a19
Minor changes from previous PR
carlosdelest Mar 20, 2024
05aa06f
Add getMapper(field) to Mapper, so both field with multifields and ob…
carlosdelest Mar 21, 2024
c80677f
multifields support
carlosdelest Mar 22, 2024
5896c60
Add copy_to support
carlosdelest Mar 22, 2024
df0cc90
Spotless
carlosdelest Mar 22, 2024
6d4bbf3
Fix tests
carlosdelest Mar 22, 2024
068615a
Remove the need to get mappings by using MappingLookup.getFieldType()
carlosdelest Mar 22, 2024
5c3d9c4
Revert "Add getMapper(field) to Mapper, so both field with multifield…
carlosdelest Mar 22, 2024
ace17dd
PR feedback
carlosdelest Mar 26, 2024
8bebb8d
Merge branch 'feature/semantic-text' into carlosdelest/semantic-text-…
carlosdelest Mar 26, 2024
47a22d7
Fix merge with feature branch
carlosdelest Mar 26, 2024
5311424
Fix merge with feature branch
carlosdelest Mar 27, 2024
707b3f1
Remove multifield testing
carlosdelest Mar 27, 2024
5e5b32a
Now, actually drop multifields support
carlosdelest Mar 27, 2024
1abf95c
Merge branch 'feature/semantic-text' into carlosdelest/semantic-text-…
carlosdelest Mar 27, 2024
29dd33e
Fix merge with main
carlosdelest Mar 27, 2024
b17d584
Merge branch 'feature/semantic-text' into carlosdelest/semantic-text-…
carlosdelest Apr 1, 2024
82ffb5b
Fix merge with feature branch
carlosdelest Apr 1, 2024
6b26200
Check that bulk updates use all source fields for updating a semantic…
carlosdelest Apr 1, 2024
bf5b837
Add check for inference results when no value is provided
carlosdelest Apr 1, 2024
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
Original file line number Diff line number Diff line change
Expand Up @@ -22,6 +22,7 @@
import java.util.Arrays;
import java.util.List;
import java.util.Map;
import java.util.Set;

import static org.elasticsearch.xcontent.XContentFactory.jsonBuilder;
import static org.hamcrest.Matchers.equalTo;
Expand Down Expand Up @@ -106,6 +107,12 @@ public void testCopyToFieldsParsing() throws Exception {

fieldMapper = mapperService.documentMapper().mappers().getMapper("new_field");
assertThat(fieldMapper.typeName(), equalTo("long"));

MappingLookup mappingLookup = mapperService.mappingLookup();
assertThat(mappingLookup.sourcePaths("another_field"), equalTo(Set.of("copy_test", "int_to_str_test", "another_field")));
assertThat(mappingLookup.sourcePaths("new_field"), equalTo(Set.of("new_field", "int_to_str_test")));
assertThat(mappingLookup.sourcePaths("copy_test"), equalTo(Set.of("copy_test", "cyclic_test")));
assertThat(mappingLookup.sourcePaths("cyclic_test"), equalTo(Set.of("cyclic_test", "copy_test")));
}

public void testCopyToFieldsInnerObjectParsing() throws Exception {
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -224,6 +224,9 @@ public void testSourcePathFields() throws IOException {
final Set<String> fieldsUsingSourcePath = new HashSet<>();
((FieldMapper) mapper).sourcePathUsedBy().forEachRemaining(mapper1 -> fieldsUsingSourcePath.add(mapper1.name()));
assertThat(fieldsUsingSourcePath, equalTo(Set.of("field.subfield1", "field.subfield2")));

assertThat(mapperService.mappingLookup().sourcePaths("field.subfield1"), equalTo(Set.of("field")));
assertThat(mapperService.mappingLookup().sourcePaths("field.subfield2"), equalTo(Set.of("field")));
}

public void testUnknownLegacyFieldsUnderKnownRootField() throws Exception {
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -388,10 +388,12 @@ private Map<String, List<FieldInferenceRequest>> createFieldInferenceRequests(Bu
// item was already aborted/processed by a filter in the chain upstream (e.g. security)
continue;
}
boolean isUpdateRequest = false;
final IndexRequest indexRequest;
if (item.request() instanceof IndexRequest ir) {
indexRequest = ir;
} else if (item.request() instanceof UpdateRequest updateRequest) {
isUpdateRequest = true;
if (updateRequest.script() != null) {
addInferenceResponseFailure(
item.id(),
Expand All @@ -417,35 +419,50 @@ private Map<String, List<FieldInferenceRequest>> createFieldInferenceRequests(Bu
String field = entry.getName();
String inferenceId = entry.getInferenceId();
Object inferenceResult = inferenceMap.remove(field);
var value = XContentMapValues.extractValue(field, docMap);
if (value == null) {
if (inferenceResult != null) {
for (var sourceField : entry.getSourceFields()) {
Mikep86 marked this conversation as resolved.
Show resolved Hide resolved
var value = XContentMapValues.extractValue(sourceField, docMap);
if (value == null) {
if (isUpdateRequest) {
addInferenceResponseFailure(
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've added this check for disallowing updates that do not use all source fields. WDYT?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks good, except for one edge case: Say that a user wants to use an update request to delete a field from a document. Traditionally, they would do this by explicitly setting the field value to null. How do we tell the difference between the null we get when the user didn't set a value and the null we get when the user explicitly set it?

IMO since we have a way to support partial updates in the near future (chunk source attribution), we could just live with this edge case for now.

item.id(),
new ElasticsearchStatusException(
"Field [{}] must be specified on an update request to calculate inference for field [{}]",
RestStatus.BAD_REQUEST,
sourceField,
field
)
);
} else if (inferenceResult != null) {
addInferenceResponseFailure(
item.id(),
new ElasticsearchStatusException(
"The field [{}] is referenced in the [{}] metadata field but has no value",
RestStatus.BAD_REQUEST,
field,
InferenceMetadataFieldMapper.NAME
)
);
}
continue;
}
ensureResponseAccumulatorSlot(item.id());
if (value instanceof String valueStr) {
List<FieldInferenceRequest> fieldRequests = fieldRequestsMap.computeIfAbsent(
inferenceId,
k -> new ArrayList<>()
);
fieldRequests.add(new FieldInferenceRequest(item.id(), field, valueStr));
} else {
addInferenceResponseFailure(
item.id(),
new ElasticsearchStatusException(
"The field [{}] is referenced in the [{}] metadata field but has no value",
"Invalid format for field [{}], expected [String] got [{}]",
RestStatus.BAD_REQUEST,
field,
InferenceMetadataFieldMapper.NAME
value.getClass().getSimpleName()
)
);
}
continue;
}
ensureResponseAccumulatorSlot(item.id());
if (value instanceof String valueStr) {
List<FieldInferenceRequest> fieldRequests = fieldRequestsMap.computeIfAbsent(inferenceId, k -> new ArrayList<>());
fieldRequests.add(new FieldInferenceRequest(item.id(), field, valueStr));
} else {
addInferenceResponseFailure(
item.id(),
new ElasticsearchStatusException(
"Invalid format for field [{}], expected [String] got [{}]",
RestStatus.BAD_REQUEST,
field,
value.getClass().getSimpleName()
)
);
}
}
}
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -348,6 +348,8 @@ private void parseResultsObject(
}
parser.nextToken();
fieldMapper.parse(context);
// Reset leaf object after parsing the field
context.path().setWithinLeafObject(true);
}
if (visited.containsAll(REQUIRED_SUBFIELDS) == false) {
Set<String> missingSubfields = REQUIRED_SUBFIELDS.stream()
Expand Down Expand Up @@ -383,6 +385,7 @@ public SourceLoader.SyntheticFieldLoader syntheticFieldLoader() {
return SourceLoader.SyntheticFieldLoader.NOTHING;
}

@SuppressWarnings("unchecked")
public static void applyFieldInference(
Map<String, Object> inferenceMap,
String field,
Expand All @@ -407,11 +410,12 @@ public static void applyFieldInference(
results.getWriteableName()
);
}
Map<String, Object> fieldMap = new LinkedHashMap<>();
fieldMap.put(INFERENCE_ID, model.getInferenceEntityId());

Map<String, Object> fieldMap = (Map<String, Object>) inferenceMap.computeIfAbsent(field, s -> new LinkedHashMap<>());
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Multiple inference results can be applied to a single field from each source field - so we need to be prepared for that

fieldMap.putAll(new SemanticTextModelSettings(model).asMap());
fieldMap.put(CHUNKS, chunks);
inferenceMap.put(field, fieldMap);
List<Map<String, Object>> fieldChunks = (List<Map<String, Object>>) fieldMap.computeIfAbsent(CHUNKS, k -> new ArrayList<>());
fieldChunks.addAll(chunks);
fieldMap.put(INFERENCE_ID, model.getInferenceEntityId());
}

record SemanticTextMapperContext(MapperBuilderContext context, SemanticTextFieldMapper mapper) {}
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -16,11 +16,15 @@
import org.elasticsearch.plugins.Plugin;
import org.elasticsearch.test.ESSingleNodeTestCase;
import org.elasticsearch.xpack.inference.InferencePlugin;
import org.hamcrest.Matchers;

import java.util.Arrays;
import java.util.Collection;
import java.util.Collections;
import java.util.List;

import static org.hamcrest.CoreMatchers.equalTo;

public class SemanticTextClusterMetadataTests extends ESSingleNodeTestCase {

@Override
Expand All @@ -36,7 +40,7 @@ public void testCreateIndexWithSemanticTextField() {
assertEquals(indexService.getMetadata().getInferenceFields().get("field").getInferenceId(), "test_model");
}

public void testAddSemanticTextField() throws Exception {
public void testSingleSourceSemanticTextField() throws Exception {
final IndexService indexService = createIndex("test", client().admin().indices().prepareCreate("test"));
final MetadataMappingService mappingService = getInstanceFromNode(MetadataMappingService.class);
final MetadataMappingService.PutMappingExecutor putMappingExecutor = mappingService.new PutMappingExecutor();
Expand All @@ -53,6 +57,45 @@ public void testAddSemanticTextField() throws Exception {
assertEquals(resultingState.metadata().index("test").getInferenceFields().get("field").getInferenceId(), "test_model");
}

public void testCopyToSemanticTextField() throws Exception {
final IndexService indexService = createIndex("test", client().admin().indices().prepareCreate("test"));
final MetadataMappingService mappingService = getInstanceFromNode(MetadataMappingService.class);
final MetadataMappingService.PutMappingExecutor putMappingExecutor = mappingService.new PutMappingExecutor();
final ClusterService clusterService = getInstanceFromNode(ClusterService.class);

final PutMappingClusterStateUpdateRequest request = new PutMappingClusterStateUpdateRequest("""
{
"properties": {
"semantic": {
"type": "semantic_text",
"inference_id": "test_model"
},
"copy_origin_1": {
"type": "text",
"copy_to": "semantic"
},
"copy_origin_2": {
"type": "text",
"copy_to": "semantic"
}
}
}
""");
request.indices(new Index[] { indexService.index() });
final var resultingState = ClusterStateTaskExecutorUtils.executeAndAssertSuccessful(
clusterService.state(),
putMappingExecutor,
singleTask(request)
);
IndexMetadata indexMetadata = resultingState.metadata().index("test");
InferenceFieldMetadata inferenceFieldMetadata = indexMetadata.getInferenceFields().get("semantic");
assertThat(inferenceFieldMetadata.getInferenceId(), equalTo("test_model"));
assertThat(
Arrays.asList(inferenceFieldMetadata.getSourceFields()),
Matchers.containsInAnyOrder("semantic", "copy_origin_1", "copy_origin_2")
);
}

private static List<MetadataMappingService.PutMappingClusterStateUpdateTask> singleTask(PutMappingClusterStateUpdateRequest request) {
return Collections.singletonList(new MetadataMappingService.PutMappingClusterStateUpdateTask(request, ActionListener.running(() -> {
throw new AssertionError("task should not complete publication");
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -377,3 +377,98 @@ setup:
- match: { errors: true }
- match: { items.0.update.status: 400 }
- match: { items.0.update.error.reason: "Cannot apply update with a script on indices that contain [semantic_text] field(s)" }

---
"Fails when providing inference results and there is no value for field":
- do:
catch: /The field \[inference_field\] is referenced in the \[_inference\] metadata field but has no value/
index:
index: test-sparse-index
id: doc_1
body:
_inference:
inference_field:
chunks:
- text: "inference test"
inference:
"hello": 0.123


---
"semantic_text copy_to calculate inference for source fields":
- do:
indices.create:
index: test-copy-to-index
body:
mappings:
properties:
inference_field:
type: semantic_text
inference_id: dense-inference-id
source_field:
type: text
copy_to: inference_field
another_source_field:
type: text
copy_to: inference_field

- do:
index:
index: test-copy-to-index
id: doc_1
body:
source_field: "copy_to inference test"
inference_field: "inference test"
another_source_field: "another copy_to inference test"

- do:
get:
index: test-copy-to-index
id: doc_1

- match: { _source.inference_field: "inference test" }
- length: { _source._inference.inference_field.chunks: 3 }
- exists: _source._inference.inference_field.chunks.0.inference
- exists: _source._inference.inference_field.chunks.0.text
- exists: _source._inference.inference_field.chunks.1.inference
- exists: _source._inference.inference_field.chunks.1.text
- exists: _source._inference.inference_field.chunks.2.inference
- exists: _source._inference.inference_field.chunks.2.text


---
"semantic_text copy_to needs values for every source field for updates":
- do:
indices.create:
index: test-copy-to-index
body:
mappings:
properties:
inference_field:
type: semantic_text
inference_id: dense-inference-id
source_field:
type: text
copy_to: inference_field
another_source_field:
type: text
copy_to: inference_field

# Not every source field needed on creation
- do:
index:
index: test-copy-to-index
id: doc_1
body:
source_field: "a single source field provided"
inference_field: "inference test"

# Every source field needed on bulk updates
- do:
bulk:
body:
- '{"update": {"_index": "test-copy-to-index", "_id": "doc_1"}}'
- '{"doc": {"source_field": "a single source field is kept as provided via bulk", "inference_field": "updated inference test" }}'

- match: { items.0.update.status: 400 }
- match: { items.0.update.error.reason: "Field [another_source_field] must be specified on an update request to calculate inference for field [inference_field]" }