Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

semantic_text - extract Index Metadata inference information to separate class #106328

Merged
Show file tree
Hide file tree
Changes from 29 commits
Commits
Show all changes
41 commits
Select commit Hold shift + click to select a range
da4a550
Added skeleton code for SemanticQueryBuilder
Mikep86 Feb 9, 2024
330f867
Add boost and query name to XContent
Mikep86 Feb 9, 2024
ed2abf7
Add dimensions and similarity to ServiceSettings, create ModelSetting…
carlosdelest Feb 14, 2024
1ee7653
Change implementation of asMap() to avoid extra nesting in inference …
carlosdelest Feb 14, 2024
80c01c7
Fix BulkOperationTests
carlosdelest Feb 14, 2024
a4c1184
Fix tests of SemanticTextInferenceResultFieldMapper
carlosdelest Feb 14, 2024
f28e051
Revert "Change implementation of asMap() to avoid extra nesting in in…
carlosdelest Feb 15, 2024
0d515a4
Add service extension for dense vector embeddings
carlosdelest Feb 15, 2024
d53583e
Fix bug in model settings
carlosdelest Feb 15, 2024
1edab71
Fix spotless
carlosdelest Feb 15, 2024
3c14e12
Refactored inference services with common abstract class
carlosdelest Feb 15, 2024
adb7711
Added modelsForFields to QueryRewriteContext
Mikep86 Feb 20, 2024
ebb3827
Updated IndicesService to create the modelsForFields map
Mikep86 Feb 20, 2024
8a466a9
Updated SemanticQueryBuilder to implement doRewrite
Mikep86 Feb 21, 2024
5d27f6a
Added semanticQuery to SemanticTextFieldMapper
Mikep86 Feb 21, 2024
6abf24c
Updated SemanticQueryBuilder to implement doToQuery
Mikep86 Feb 21, 2024
b434da5
Updated SemanticQueryBuilder to add fromXContent
Mikep86 Feb 21, 2024
ba852b8
Added SemanticQueryBuilder to inference plugin
Mikep86 Feb 21, 2024
b2c4574
Use Lucene queries to build semantic query
Mikep86 Feb 21, 2024
1361f08
Spotless
carlosdelest Mar 11, 2024
ddf374c
New class for dealing with field inference metadata
carlosdelest Mar 13, 2024
354bb09
Include FieldInferenceMetadata into IndexMetadata
carlosdelest Mar 13, 2024
c88b9cd
Create new FieldInferenceMetadata structure
carlosdelest Mar 13, 2024
db7f531
Use FieldInferenceMetadata structure in lookups, some renaming
carlosdelest Mar 13, 2024
32293d0
Use FieldInferenceMetadata structure in dependencies
carlosdelest Mar 13, 2024
949a8d0
Use FieldInferenceMetadata structure in dependencies
carlosdelest Mar 13, 2024
21bf90b
Renaming fields
carlosdelest Mar 13, 2024
a21e9e4
Fix rebasing
carlosdelest Mar 13, 2024
d14cc53
Fix rebasing
carlosdelest Mar 13, 2024
053080a
Fix rebasing
carlosdelest Mar 13, 2024
bb76d53
Rework FieldInferenceMetadata to have a single map instead of multipl…
carlosdelest Mar 14, 2024
83ccfd2
Serialization fixes
carlosdelest Mar 14, 2024
a222d79
Test fixes
carlosdelest Mar 14, 2024
a026612
Spotless
carlosdelest Mar 14, 2024
f56db05
Fix serialization issues when inference is empty
carlosdelest Mar 14, 2024
0edc8d3
Fix test
carlosdelest Mar 14, 2024
ba6f00f
Use empty diff state to avoid bwc errors
carlosdelest Mar 14, 2024
f3a6af0
Fix parsing error, styling
carlosdelest Mar 14, 2024
88af861
Remove accessors for FieldInferenceMetadata, use the map instead. Ren…
carlosdelest Mar 18, 2024
3b8db71
Spotless
carlosdelest Mar 18, 2024
ce19bc9
Merge remote-tracking branch 'origin/feature/semantic-text' into carl…
carlosdelest Mar 18, 2024
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
Original file line number Diff line number Diff line change
Expand Up @@ -587,7 +587,7 @@ public IndexMetadata randomChange(IndexMetadata part) {
builder.settings(Settings.builder().put(part.getSettings()).put(IndexMetadata.SETTING_NUMBER_OF_REPLICAS, 0));
break;
case 3:
builder.fieldsForModels(randomFieldsForModels());
builder.fieldInferenceMetadata(randomFieldsForModels());
break;
default:
throw new IllegalArgumentException("Shouldn't be here");
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -77,8 +77,8 @@ public static void getInstance(
) {
Set<String> inferenceIds = new HashSet<>();
shardIds.stream().map(ShardId::getIndex).collect(Collectors.toSet()).stream().forEach(index -> {
var fieldsForModels = clusterState.metadata().index(index).getFieldsForModels();
inferenceIds.addAll(fieldsForModels.keySet());
var fieldsForInferenceIds = clusterState.metadata().index(index).getFieldInferenceMetadata().getFieldsForInferenceIds();
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've done some renamings to start using inference ids vs model

inferenceIds.addAll(fieldsForInferenceIds.keySet());
});
final Map<String, InferenceProvider> inferenceProviderMap = new ConcurrentHashMap<>();
Runnable onModelLoadingComplete = () -> listener.onResponse(
Expand Down Expand Up @@ -134,11 +134,11 @@ public void processBulkShardRequest(
BiConsumer<BulkItemRequest, Exception> onBulkItemFailure
) {

Map<String, Set<String>> fieldsForModels = clusterState.metadata()
Map<String, Set<String>> fieldsForInferenceIds = clusterState.metadata()
.index(bulkShardRequest.shardId().getIndex())
.getFieldsForModels();
.getFieldInferenceMetadata().getFieldsForInferenceIds();
// No inference fields? Terminate early
if (fieldsForModels.isEmpty()) {
if (fieldsForInferenceIds.isEmpty()) {
listener.onResponse(bulkShardRequest);
return;
}
Expand Down Expand Up @@ -176,7 +176,7 @@ public void processBulkShardRequest(
if (bulkItemRequest != null) {
performInferenceOnBulkItemRequest(
bulkItemRequest,
fieldsForModels,
fieldsForInferenceIds,
i,
onBulkItemFailureWithIndex,
bulkItemReqRef.acquire()
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -453,7 +453,11 @@ void executeRequest(
}
}
});
Rewriteable.rewriteAndFetch(original, searchService.getRewriteContext(timeProvider::absoluteStartMillis), rewriteListener);
Rewriteable.rewriteAndFetch(
original,
searchService.getRewriteContext(timeProvider::absoluteStartMillis),
rewriteListener
);
}

static void adjustSearchType(SearchRequest searchRequest, boolean singleShard) {
Expand Down
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is the class that will hold inference information, and the one I'd like to have your early feedback on :) .

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If I understand correctly, FieldInferenceMetadata is immutable and will be replaced with a new instance upon a mapping change. Thus, fieldsForInferenceIds (if requested) will be valid for the entire lifetime of the FieldInferenceMetadata instance.

Do I have that right?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Correct! A mapping update (or any cluster state change) means that we will have a new instance of FieldInferenceMetadata

Original file line number Diff line number Diff line change
@@ -0,0 +1,235 @@
/*
* Copyright Elasticsearch B.V. and/or licensed to Elasticsearch B.V. under one
* or more contributor license agreements. Licensed under the Elastic License
* 2.0 and the Server Side Public License, v 1; you may not use this file except
* in compliance with, at your election, the Elastic License 2.0 or the Server
* Side Public License, v 1.
*/

package org.elasticsearch.cluster.metadata;

import org.elasticsearch.cluster.Diff;
import org.elasticsearch.cluster.Diffable;
import org.elasticsearch.cluster.DiffableUtils;
import org.elasticsearch.common.collect.ImmutableOpenMap;
import org.elasticsearch.common.io.stream.StreamInput;
import org.elasticsearch.common.io.stream.StreamOutput;
import org.elasticsearch.index.mapper.MappingLookup;
import org.elasticsearch.xcontent.ConstructingObjectParser;
import org.elasticsearch.xcontent.ParseField;
import org.elasticsearch.xcontent.ToXContentFragment;
import org.elasticsearch.xcontent.XContentBuilder;
import org.elasticsearch.xcontent.XContentParser;

import java.io.IOException;
import java.util.Collections;
import java.util.HashMap;
import java.util.HashSet;
import java.util.Map;
import java.util.Objects;
import java.util.Set;
import java.util.stream.Collectors;

/**
* Contains field inference information. This is necessary to add to cluster state as inference can be calculated in the coordinator
* node, which not necessarily has mapping information.
*/
public class FieldInferenceMetadata implements Diffable<FieldInferenceMetadata>, ToXContentFragment {

// Keys: field names. Values: Inference ID associated to the field name for calculating inference
private final ImmutableOpenMap<String, String> inferenceIdForField;

// Keys: field names. Values: Field names that provide source for this field (either as copy_to or multifield sources)
private final ImmutableOpenMap<String, Set<String>> sourceFields;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We may want to change this to ImmutableOpenMap<String, List<String>> so that the source field iteration order is consistent

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think order is of consequence. FieldTypeLookup.sourcePaths(), which is used for similar purposes, also returns a Set.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I bring it up because @benwtrent brought it up during multi-field/copy_to support discovery. Depending on how we handle inference generation for multi-valued fields in the bulk action, order could matter.

For example, if we handle it by concatenating all the values, order could affect the chunks that are produced and therefore the inference results.


// Keys: inference IDs. Values: Field names that use the inference id for calculating inference. Reverse of inferenceForFields.
private Map<String, Set<String>> fieldsForInferenceIds;

public static final FieldInferenceMetadata EMPTY = new FieldInferenceMetadata(ImmutableOpenMap.of(), ImmutableOpenMap.of());

public static final ParseField INFERENCE_FOR_FIELDS_FIELD = new ParseField("inference_for_fields");
public static final ParseField SOURCE_FIELDS_FIELD = new ParseField("source_fields");

public FieldInferenceMetadata(
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've prioritised using field -> inference IDs map, as it seems clearer from the mapping perspective as well.

The class hides the actual implementation and we can calculate the reverse information on demand

ImmutableOpenMap<String, String> inferenceIdsForFields,
ImmutableOpenMap<String, Set<String>> sourceFields
) {
this.inferenceIdForField = Objects.requireNonNull(inferenceIdsForFields);
this.sourceFields = Objects.requireNonNull(sourceFields);
}

public FieldInferenceMetadata(
Map<String, String> inferenceIdsForFields,
Map<String, Set<String>> sourceFields
) {
this.inferenceIdForField = ImmutableOpenMap.builder(Objects.requireNonNull(inferenceIdsForFields)).build();
this.sourceFields = ImmutableOpenMap.builder(Objects.requireNonNull(sourceFields)).build();
}

public FieldInferenceMetadata(MappingLookup mappingLookup) {
this.inferenceIdForField = ImmutableOpenMap.builder(mappingLookup.getInferenceIdsForFields()).build();
ImmutableOpenMap.Builder<String, Set<String>> sourcePathsBuilder = ImmutableOpenMap.builder(inferenceIdForField.size());
inferenceIdForField.keySet().forEach(fieldName -> sourcePathsBuilder.put(fieldName, mappingLookup.sourcePaths(fieldName)));
this.sourceFields = sourcePathsBuilder.build();
}

public FieldInferenceMetadata(StreamInput in) throws IOException {
inferenceIdForField = in.readImmutableOpenMap(StreamInput::readString, StreamInput::readString);
sourceFields = in.readImmutableOpenMap(StreamInput::readString, i -> i.readCollectionAsImmutableSet(StreamInput::readString));
}

@Override
public void writeTo(StreamOutput out) throws IOException {
out.writeMap(inferenceIdForField, StreamOutput::writeString);
out.writeMap(sourceFields, StreamOutput::writeStringCollection);
}

@Override
public XContentBuilder toXContent(XContentBuilder builder, Params params) throws IOException {
builder.field(INFERENCE_FOR_FIELDS_FIELD.getPreferredName(), inferenceIdForField);
builder.field(SOURCE_FIELDS_FIELD.getPreferredName(), sourceFields);

return builder;
}

@SuppressWarnings("unchecked")
private static final ConstructingObjectParser<FieldInferenceMetadata, Void> PARSER = new ConstructingObjectParser<>(
"field_inference_metadata_parser",
false,
(args, unused) -> new FieldInferenceMetadata((Map<String, String>) args[0], (Map<String, Set<String>>) args[1])
);

static {
PARSER.declareObject(ConstructingObjectParser.constructorArg(), (p, c) -> p.mapStrings(), INFERENCE_FOR_FIELDS_FIELD);
PARSER.declareObject(
ConstructingObjectParser.constructorArg(),
(p, c) -> p.map(
HashMap::new,
v -> v.list().stream().map(Object::toString).collect(Collectors.toSet())
),
SOURCE_FIELDS_FIELD
);
}

public static FieldInferenceMetadata fromXContent(XContentParser parser) throws IOException {
return PARSER.parse(parser, null);
}

@Override
public Diff<FieldInferenceMetadata> diff(FieldInferenceMetadata previousState) {
return new FieldInferenceMetadataDiff(previousState, this);
}

public String getInferenceIdForField(String field) {
return inferenceIdForField.get(field);
}

public Map<String, String> getInferenceIdForFields() {
return inferenceIdForField;
}

public Set<String> getSourceFields(String field) {
return sourceFields.get(field);
}

public Map<String, Set<String>> getFieldsForInferenceIds() {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could this method be accessed by multiple threads? Should we mark fieldsForInferenceIds as volatile or maybe even wrap it in AtomicReference?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, it can be accessed by multiple threads.

I didn't want to overdo synchronization / early optimization, and have kept it light on purpose.

I think it's preferrable to recalculate vs having the syncing overhead - but happy to hear your thoughts.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agree we don't need to go heavy on synchronization here. Worst case, a race condition causes this calculation to be performed in two (or more) threads simultaneously.

I think it could be a good idea to mark fieldsForInferenceIds as volatile. This shouldn't have any syncing overhead, besides any compiler optimizations that are disabled.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Makes sense, adding that

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We need to have a single version of this map that is immutable and final. It's a small map so there's no need to compute all the flavours preemptively or even to cache them. The consumer can rebuild data structures on top if their access pattern is different.
Can we just keep a Map<String, FieldInference> where the keys are the semantic text fields and FieldInference contains the inference id and the list of source fields?
This way it's easy to extract the FieldInference for a field query.
For ingest we need to visit the entire map anyway so any shape is ok.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just to be clear I mean only keeping ImmutableOpenMap<String, FieldInference> fieldInferenceMap and adding an accessor for it. That's all we need. All this caching/reverting is non-sense at this moment imo.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see that this is a similar pattern to other Index Metadata related structures. Changing it

if (fieldsForInferenceIds != null) {
return fieldsForInferenceIds;
}

// Cache the result as a field
Map<String, Set<String>> fieldsForInferenceIdsMap = new HashMap<>();
for (Map.Entry<String, String> entry : inferenceIdForField.entrySet()) {
String inferenceId = entry.getValue();
String fieldName = entry.getKey();

// Get or create the set associated with the inferenceId
Set<String> fields = fieldsForInferenceIdsMap.computeIfAbsent(inferenceId, k -> new HashSet<>());
fields.add(fieldName);
}

fieldsForInferenceIds = Collections.unmodifiableMap(fieldsForInferenceIdsMap);
return fieldsForInferenceIds;
}

@Override
public boolean equals(Object o) {
if (this == o) return true;
if (o == null || getClass() != o.getClass()) return false;
FieldInferenceMetadata that = (FieldInferenceMetadata) o;
return Objects.equals(inferenceIdForField, that.inferenceIdForField) && Objects.equals(sourceFields, that.sourceFields);
}

@Override
public int hashCode() {
return Objects.hash(inferenceIdForField, sourceFields);
}

public static class FieldInferenceMetadataDiff implements Diff<FieldInferenceMetadata> {

private final Diff<ImmutableOpenMap<String, String>> inferenceForFields;
private final Diff<ImmutableOpenMap<String, Set<String>>> copyFromFields;

private static final DiffableUtils.NonDiffableValueSerializer<String, String> STRING_VALUE_SERIALIZER =
new DiffableUtils.NonDiffableValueSerializer<>() {
@Override
public void write(String value, StreamOutput out) throws IOException {
out.writeString(value);
}

@Override
public String read(StreamInput in, String key) throws IOException {
return in.readString();
}
};

FieldInferenceMetadataDiff(FieldInferenceMetadata before, FieldInferenceMetadata after) {
inferenceForFields = DiffableUtils.diff(
before.inferenceIdForField,
after.inferenceIdForField,
DiffableUtils.getStringKeySerializer(),
STRING_VALUE_SERIALIZER);
copyFromFields = DiffableUtils.diff(
before.sourceFields,
after.sourceFields,
DiffableUtils.getStringKeySerializer(),
DiffableUtils.StringSetValueSerializer.getInstance()
);
}

FieldInferenceMetadataDiff(StreamInput in) throws IOException {
inferenceForFields = DiffableUtils.readImmutableOpenMapDiff(
in,
DiffableUtils.getStringKeySerializer(),
STRING_VALUE_SERIALIZER
);
copyFromFields = DiffableUtils.readImmutableOpenMapDiff(
in,
DiffableUtils.getStringKeySerializer(),
DiffableUtils.StringSetValueSerializer.getInstance()
);
}

public static final FieldInferenceMetadataDiff EMPTY = new FieldInferenceMetadataDiff(
FieldInferenceMetadata.EMPTY,
FieldInferenceMetadata.EMPTY
) {
@Override
public FieldInferenceMetadata apply(FieldInferenceMetadata part) {
return part;
}
};
@Override
public FieldInferenceMetadata apply(FieldInferenceMetadata part) {
ImmutableOpenMap<String, String> modelForFields = this.inferenceForFields.apply(part.inferenceIdForField);
ImmutableOpenMap<String, Set<String>> copyFromFields = this.copyFromFields.apply(part.sourceFields);
return new FieldInferenceMetadata(modelForFields, copyFromFields);
}

@Override
public void writeTo(StreamOutput out) throws IOException {
inferenceForFields.writeTo(out);
copyFromFields.writeTo(out);
}
}
}
Loading