Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

semantic_text - extract Index Metadata inference information to separate class #106328

Conversation

carlosdelest
Copy link
Member

@carlosdelest carlosdelest commented Mar 13, 2024

Creates a new class for storing field inference information.

Field inference information is composed of:

  • Inference ID associated to each field
  • Source paths that the field must calculate inference on, besides its actual value. This will provide copy_to and multifield capabilities.

This refactoring allows control on the serialization of this information and provide bwc on it for the future.

Some renaming has been performed on methods to refer to inference id instead of model id.

Mikep86 and others added 30 commits March 13, 2024 19:07
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is the class that will hold inference information, and the one I'd like to have your early feedback on :) .

public static final ParseField INFERENCE_FOR_FIELDS_FIELD = new ParseField("inference_for_fields");
public static final ParseField SOURCE_FIELDS_FIELD = new ParseField("source_fields");

public FieldInferenceMetadata(
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've prioritised using field -> inference IDs map, as it seems clearer from the mapping perspective as well.

The class hides the actual implementation and we can calculate the reverse information on demand

@@ -632,8 +631,7 @@ public Iterator<Setting<?>> settings() {
private final Double writeLoadForecast;
@Nullable
private final Long shardSizeInBytesForecast;
// Key: model ID, Value: Fields that use model
private final ImmutableOpenMap<String, Set<String>> fieldsForModels;
private final FieldInferenceMetadata fieldInferenceMetadata;
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

IndexMetadata now uses the new class

@@ -77,8 +77,8 @@ public static void getInstance(
) {
Set<String> inferenceIds = new HashSet<>();
shardIds.stream().map(ShardId::getIndex).collect(Collectors.toSet()).stream().forEach(index -> {
var fieldsForModels = clusterState.metadata().index(index).getFieldsForModels();
inferenceIds.addAll(fieldsForModels.keySet());
var fieldsForInferenceIds = clusterState.metadata().index(index).getFieldInferenceMetadata().getFieldsForInferenceIds();
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've done some renamings to start using inference ids vs model

@carlosdelest carlosdelest requested review from Mikep86 and jimczi March 13, 2024 19:07
Copy link
Contributor

@Mikep86 Mikep86 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Restricted my review to FieldInferenceMetadata for now. It looks great!

private final ImmutableOpenMap<String, String> inferenceIdForField;

// Keys: field names. Values: Field names that provide source for this field (either as copy_to or multifield sources)
private final ImmutableOpenMap<String, Set<String>> sourceFields;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We may want to change this to ImmutableOpenMap<String, List<String>> so that the source field iteration order is consistent

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think order is of consequence. FieldTypeLookup.sourcePaths(), which is used for similar purposes, also returns a Set.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I bring it up because @benwtrent brought it up during multi-field/copy_to support discovery. Depending on how we handle inference generation for multi-valued fields in the bulk action, order could matter.

For example, if we handle it by concatenating all the values, order could affect the chunks that are produced and therefore the inference results.

return sourceFields.get(field);
}

public Map<String, Set<String>> getFieldsForInferenceIds() {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could this method be accessed by multiple threads? Should we mark fieldsForInferenceIds as volatile or maybe even wrap it in AtomicReference?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, it can be accessed by multiple threads.

I didn't want to overdo synchronization / early optimization, and have kept it light on purpose.

I think it's preferrable to recalculate vs having the syncing overhead - but happy to hear your thoughts.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agree we don't need to go heavy on synchronization here. Worst case, a race condition causes this calculation to be performed in two (or more) threads simultaneously.

I think it could be a good idea to mark fieldsForInferenceIds as volatile. This shouldn't have any syncing overhead, besides any compiler optimizations that are disabled.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Makes sense, adding that

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We need to have a single version of this map that is immutable and final. It's a small map so there's no need to compute all the flavours preemptively or even to cache them. The consumer can rebuild data structures on top if their access pattern is different.
Can we just keep a Map<String, FieldInference> where the keys are the semantic text fields and FieldInference contains the inference id and the list of source fields?
This way it's easy to extract the FieldInference for a field query.
For ingest we need to visit the entire map anyway so any shape is ok.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just to be clear I mean only keeping ImmutableOpenMap<String, FieldInference> fieldInferenceMap and adding an accessor for it. That's all we need. All this caching/reverting is non-sense at this moment imo.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see that this is a similar pattern to other Index Metadata related structures. Changing it

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If I understand correctly, FieldInferenceMetadata is immutable and will be replaced with a new instance upon a mapping change. Thus, fieldsForInferenceIds (if requested) will be valid for the entire lifetime of the FieldInferenceMetadata instance.

Do I have that right?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Correct! A mapping update (or any cluster state change) means that we will have a new instance of FieldInferenceMetadata

}

public Set<String> getSourceFields(String field) {
return getInferenceSafe(field, FieldInference::sourceFields);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's expose the fieldInferenceMap directly, we don't need to dictate how the access should look like.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done 👍

}
}

public record FieldInference(String inferenceId, Set<String> sourceFields)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe rename into FieldInferenceOptions since we'll probably need to add more later (shouldFailOnError, chunking, ...)?

Copy link
Member Author

@carlosdelest carlosdelest Mar 18, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Makes sense 👍

@carlosdelest carlosdelest changed the title WIP - semantic_text - extract Index Metadata inference information to separate class semantic_text - extract Index Metadata inference information to separate class Mar 18, 2024
@carlosdelest carlosdelest marked this pull request as ready for review March 18, 2024 11:18
@elasticsearchmachine elasticsearchmachine added the needs:triage Requires assignment of a team area label label Mar 18, 2024
@carlosdelest carlosdelest requested a review from jimczi March 18, 2024 16:18
…osdelest/semantic-text-index-metadata-update
Copy link
Contributor

@jimczi jimczi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@carlosdelest
Copy link
Member Author

@elasticmachine run elasticsearch-ci/part-1

@carlosdelest
Copy link
Member Author

@elasticmachine run elasticsearch-ci/bwc-snapshots

@carlosdelest
Copy link
Member Author

@elasticmachine run elasticsearch-ci/8.14.0 / bwc-snapshots

@carlosdelest carlosdelest merged commit 3ca808b into elastic:feature/semantic-text Mar 19, 2024
13 of 15 checks passed
FieldInferenceMetadata fieldInferenceMetadata = new FieldInferenceMetadata(
Map.of(
FIRST_INFERENCE_FIELD_SERVICE_1,
new FieldInferenceMetadata.FieldInferenceOptions(INFERENCE_SERVICE_1_ID, Set.of()),
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we want this test to reflect production, the source field set should never be empty. In this case, it should contain FIRST_INFERENCE_FIELD_SERVICE_1

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Correct - this will be fixed when we start calculating inference for copy_to and multifields, as the way of retrieving the inference texts will start using the source information for fields 👍

@@ -37,7 +37,7 @@ public void testEmpty() {
assertNotNull(names);
assertThat(names, hasSize(0));

Map<String, Set<String>> fieldsForModels = lookup.getFieldsForModels();
Map<String, String> fieldsForModels = lookup.getInferenceIdsForFields();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should probably clean up/rename any vars named fieldsForModels or similar

assertEquals(Map.of("test_model", Set.of("field")), indexService.getMetadata().getFieldsForModels());
assertEquals(
indexService.getMetadata().getFieldInferenceMetadata().getFieldInferenceOptions().get("field").inferenceId(),
"test_model"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nitpick: the assertEquals method signature is assertEquals(<expected>, <actual>). Reversing the args is the same functionally, but if the assertion fails it will generate a message with an incorrect expected value.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Well spotted! I'm changing that to Hamcrest assertions, which are easier to read 👍

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
needs:triage Requires assignment of a team area label >non-issue
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants