-
Notifications
You must be signed in to change notification settings - Fork 24.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
semantic_text - extract Index Metadata inference information to separate class #106328
semantic_text - extract Index Metadata inference information to separate class #106328
Conversation
…ference results" This reverts commit bd4e19e.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is the class that will hold inference information, and the one I'd like to have your early feedback on :) .
public static final ParseField INFERENCE_FOR_FIELDS_FIELD = new ParseField("inference_for_fields"); | ||
public static final ParseField SOURCE_FIELDS_FIELD = new ParseField("source_fields"); | ||
|
||
public FieldInferenceMetadata( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've prioritised using field -> inference IDs map, as it seems clearer from the mapping perspective as well.
The class hides the actual implementation and we can calculate the reverse information on demand
@@ -632,8 +631,7 @@ public Iterator<Setting<?>> settings() { | |||
private final Double writeLoadForecast; | |||
@Nullable | |||
private final Long shardSizeInBytesForecast; | |||
// Key: model ID, Value: Fields that use model | |||
private final ImmutableOpenMap<String, Set<String>> fieldsForModels; | |||
private final FieldInferenceMetadata fieldInferenceMetadata; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
IndexMetadata now uses the new class
@@ -77,8 +77,8 @@ public static void getInstance( | |||
) { | |||
Set<String> inferenceIds = new HashSet<>(); | |||
shardIds.stream().map(ShardId::getIndex).collect(Collectors.toSet()).stream().forEach(index -> { | |||
var fieldsForModels = clusterState.metadata().index(index).getFieldsForModels(); | |||
inferenceIds.addAll(fieldsForModels.keySet()); | |||
var fieldsForInferenceIds = clusterState.metadata().index(index).getFieldInferenceMetadata().getFieldsForInferenceIds(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've done some renamings to start using inference ids vs model
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Restricted my review to FieldInferenceMetadata
for now. It looks great!
private final ImmutableOpenMap<String, String> inferenceIdForField; | ||
|
||
// Keys: field names. Values: Field names that provide source for this field (either as copy_to or multifield sources) | ||
private final ImmutableOpenMap<String, Set<String>> sourceFields; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We may want to change this to ImmutableOpenMap<String, List<String>>
so that the source field iteration order is consistent
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think order is of consequence. FieldTypeLookup.sourcePaths()
, which is used for similar purposes, also returns a Set
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I bring it up because @benwtrent brought it up during multi-field/copy_to support discovery. Depending on how we handle inference generation for multi-valued fields in the bulk action, order could matter.
For example, if we handle it by concatenating all the values, order could affect the chunks that are produced and therefore the inference results.
return sourceFields.get(field); | ||
} | ||
|
||
public Map<String, Set<String>> getFieldsForInferenceIds() { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could this method be accessed by multiple threads? Should we mark fieldsForInferenceIds
as volatile
or maybe even wrap it in AtomicReference
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, it can be accessed by multiple threads.
I didn't want to overdo synchronization / early optimization, and have kept it light on purpose.
I think it's preferrable to recalculate vs having the syncing overhead - but happy to hear your thoughts.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Agree we don't need to go heavy on synchronization here. Worst case, a race condition causes this calculation to be performed in two (or more) threads simultaneously.
I think it could be a good idea to mark fieldsForInferenceIds
as volatile
. This shouldn't have any syncing overhead, besides any compiler optimizations that are disabled.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Makes sense, adding that
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We need to have a single version of this map that is immutable and final. It's a small map so there's no need to compute all the flavours preemptively or even to cache them. The consumer can rebuild data structures on top if their access pattern is different.
Can we just keep a Map<String, FieldInference>
where the keys are the semantic text fields and FieldInference
contains the inference id and the list of source fields?
This way it's easy to extract the FieldInference for a field query.
For ingest we need to visit the entire map anyway so any shape is ok.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just to be clear I mean only keeping ImmutableOpenMap<String, FieldInference> fieldInferenceMap
and adding an accessor for it. That's all we need. All this caching/reverting is non-sense at this moment imo.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I see that this is a similar pattern to other Index Metadata related structures. Changing it
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If I understand correctly, FieldInferenceMetadata
is immutable and will be replaced with a new instance upon a mapping change. Thus, fieldsForInferenceIds
(if requested) will be valid for the entire lifetime of the FieldInferenceMetadata
instance.
Do I have that right?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Correct! A mapping update (or any cluster state change) means that we will have a new instance of FieldInferenceMetadata
} | ||
|
||
public Set<String> getSourceFields(String field) { | ||
return getInferenceSafe(field, FieldInference::sourceFields); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let's expose the fieldInferenceMap
directly, we don't need to dictate how the access should look like.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done 👍
} | ||
} | ||
|
||
public record FieldInference(String inferenceId, Set<String> sourceFields) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe rename into FieldInferenceOptions
since we'll probably need to add more later (shouldFailOnError, chunking, ...)?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Makes sense 👍
…ame FieldInference to FieldInferenceOptions
…osdelest/semantic-text-index-metadata-update
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
@elasticmachine run elasticsearch-ci/part-1 |
@elasticmachine run elasticsearch-ci/bwc-snapshots |
@elasticmachine run elasticsearch-ci/8.14.0 / bwc-snapshots |
3ca808b
into
elastic:feature/semantic-text
FieldInferenceMetadata fieldInferenceMetadata = new FieldInferenceMetadata( | ||
Map.of( | ||
FIRST_INFERENCE_FIELD_SERVICE_1, | ||
new FieldInferenceMetadata.FieldInferenceOptions(INFERENCE_SERVICE_1_ID, Set.of()), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If we want this test to reflect production, the source field set should never be empty. In this case, it should contain FIRST_INFERENCE_FIELD_SERVICE_1
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Correct - this will be fixed when we start calculating inference for copy_to and multifields, as the way of retrieving the inference texts will start using the source information for fields 👍
@@ -37,7 +37,7 @@ public void testEmpty() { | |||
assertNotNull(names); | |||
assertThat(names, hasSize(0)); | |||
|
|||
Map<String, Set<String>> fieldsForModels = lookup.getFieldsForModels(); | |||
Map<String, String> fieldsForModels = lookup.getInferenceIdsForFields(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We should probably clean up/rename any vars named fieldsForModels
or similar
assertEquals(Map.of("test_model", Set.of("field")), indexService.getMetadata().getFieldsForModels()); | ||
assertEquals( | ||
indexService.getMetadata().getFieldInferenceMetadata().getFieldInferenceOptions().get("field").inferenceId(), | ||
"test_model" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nitpick: the assertEquals
method signature is assertEquals(<expected>, <actual>)
. Reversing the args is the same functionally, but if the assertion fails it will generate a message with an incorrect expected value.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Well spotted! I'm changing that to Hamcrest assertions, which are easier to read 👍
Creates a new class for storing field inference information.
Field inference information is composed of:
copy_to
and multifield capabilities.This refactoring allows control on the serialization of this information and provide bwc on it for the future.
Some renaming has been performed on methods to refer to inference id instead of model id.