forked from elastic/elasticsearch
-
Notifications
You must be signed in to change notification settings - Fork 0
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
semantic_text field mapper and inference - Partial PR #12
Closed
carlosdelest
wants to merge
85
commits into
carlosdelest/semantic-text-index-metadata-changes
from
carlosdelest/semantic-text-field-mapping
Closed
semantic_text field mapper and inference - Partial PR #12
carlosdelest
wants to merge
85
commits into
carlosdelest/semantic-text-index-metadata-changes
from
carlosdelest/semantic-text-field-mapping
Conversation
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Store semantic_text model info in IndexMetadata: On document ingestion, we need to perform inference only once, in the coordinating node. Otherwise, we would be doing inference for each of the shards the document is stored in. The problem with the coordinating node is that it doesn't necessarily hold mapping information if it is not used for storing an index. A pure coordinating node doesn't have any mapping information at all. We need to understand when we need to generate text embeddings on the coordinating node. This means that the model information associated with index fields needs to be efficiently accessed from there. This information needs to be kept up to date with mapping changes, and not be recomputed otherwise. The model / fields information is going to be included as part of the IndexMetadata, to ensure it is communicated to all nodes in the cluster.
Adds SemanticTextInferenceResultFieldMapper, which indexes inference results for semantic_text fields.
# Conflicts: # server/src/main/java/org/elasticsearch/TransportVersions.java
# Conflicts: # server/src/main/java/org/elasticsearch/TransportVersions.java
# Conflicts: # server/src/main/java/org/elasticsearch/TransportVersions.java
# Conflicts: # server/src/main/java/org/elasticsearch/TransportVersions.java
# Conflicts: # server/src/main/java/org/elasticsearch/TransportVersions.java # server/src/test/java/org/elasticsearch/snapshots/SnapshotResiliencyTests.java
# Conflicts: # server/src/main/java/org/elasticsearch/TransportVersions.java
…elastic#106357) --------- Co-authored-by: carlosdelest <[email protected]>
…ing (elastic#106560) This PR refactors the semantic text field mapper to register its sub fields in the mapping instead of re-creating them each time when parsing documents. It also fixes the generation of these fields in case the semantic text field is defined in an object field. Lastly this change adds a new section called model_settings in the field parameter that is updated by the field mapper when inference results are received from a bulk action. The model settings are available in the fields as soon as the first document with the inference field is ingested and they are used to validate that updates. They are used to ensure consistency between what's used in the bulk action and what's defined in the field mapping.
# Conflicts: # server/src/main/java/org/elasticsearch/TransportVersions.java
…ce metadata in `IndexMetadata` (elastic#106743) This change refactors the integration of the field inference metadata in IndexMetadata. Instead of partial diffs, the new class simply sends the entire object as diff if it has changed. This PR also rename the fields and methods related to the inference fields consistently. The inference phase (in the transport shard bulk action) is also changed so that inference is not called if: The document contains a value for the inference input. The document also contains a value for the inference results of that field (in the _inference map). If the document contains no value for the inference input but an inference result for that field, it is marked as failed. --------- Co-authored-by: carlosdelest <[email protected]>
The merge logic in MergePositionsOperator is excessively complex and lacks flexibility. It relies on the source operator emitting pages with ascending positions. Additionally, this merge logic introduced an unusual method, appendAllValuesToCurrentPosition, to the Block.Builder. We should replace this with a simpler and more flexible approach. This PR uses a mechanism similar to the grouping aggregation. In fact, it is very close to the values aggregation. Initially, I considered using the GroupingState from ValuesAggregator. However, unlike in the values aggregation, we don't expect many multi-values in enrich. Hence, I introduced the new EnrichResultBuilders instead.
When an alias action list is posted with must_exist==false, and succeeds only partially, a list of results for each action are now returned. The results contain information about the requested action, indices, and aliases. If must_exist==true, or all actions fail, the call will return a 400 status along with the associated exception.
This moves the test cases declared in the tests for ESQL's LOCATE function to test cases which will cause elastic#106782 to properly generate all of the available signatures. It also buys us all of testing for incorrect parameter combinations.
This takes a stab at generating the markdown files that Kibana uses for its inline help. It doesn't include all of the examples because the `@Example` annotation is not filled in - we're tracking that in elastic#104247 (comment) There are some links in the output and they are in markdown syntax. We should figure out how to make them work for kibana.
TransformScheduler can trigger its tasks on multiple threads. TransformTask uses an AtomicReference to manage one trigger event per thread by cycling between "Started" and "Indexing". The Retry Listener now has the same protection. "shouldRunAction" will cycle to false during execution and back to true if the action fails and should be retried. Fix elastic#107215
With this commit we split the Universal Profiling plugin into three packages: * `persistence` contains everything w.r.t index management * `rest` contains the REST API * `action` contains the transport API The `action` / `rest` structure follows the already established structure in the rest of the code base. We divide this plugin into multiple packages mainly because the different functionalities will be maintained by different teams in the future. This restructuring helps clarify boundaries.
Users of supposedly-S3-compatible storage may need to be aware of this change, so this commit expands the release notes to link to the relevant S3 documentation.
With this commit we consider a case in the TopN functions API where the specified limit is larger than the available number of TopN functions. Currently this throws an error (`IndexOutOfBoundException`). With this check in place we just return the list as is.
…t task from cluster state (elastic#106989)
Refactoring PR to make create, grant, and update API key actions local-only. Also ports a profiles action since it relies on the same base class as grant API key.
* Refactor DocsTest plugin to java * Rework asciidoc parsing to make adding more parser simple
This renames the function AUTO_BUCKET to just BUCKET. It also removes the experimental tagging of the function in the docs, making it generally available.
Allegedly-S3-compatible APIs are very popular these days, but many third-party systems offering such an API also support a shared filesystem interface. Shared filesystem protocols such as NFS are much better specified than the S3 API, and experience shows that they lead to fewer compatibility headaches. This commit adds a recommendation to the `repository-s3` docs to consider such an interface instead.
This checks in the generated-by-test doc files for newly renamed BUCKET function.
For system indices we don't want to emit metrics. DocumentSizeReporter will be created given an index. It will internally contain a SystemIndices instance that will verify the indexName with isSystemName
Consume the HttpEntity after the API response is parsed, releasing network and thread resources back to their respective pools. Leaving them unconsumed does not appear to be causing issues during tests, but it does log a large amount of hanging threads on test failure, making it harder to spot what may be the issue when a thread is hanging during a transform test. Close elastic#107055
Renaming GeoIpDownloaderStatsAction to GeoIpStatsAction
…pers (elastic#107147) Co-authored-by: @jimczi Co-authored-by: @Mikep86
…07020) This commit removes the legacy yaml rolling upgrade tests for vectors to the new rolling upgrade package. Also, it adds rolling upgrade tests for `int8_hnsw`.
…mapping # Conflicts: # server/src/main/java/org/elasticsearch/TransportVersions.java # server/src/main/java/org/elasticsearch/cluster/metadata/InferenceFieldMetadata.java # server/src/test/java/org/elasticsearch/cluster/metadata/IndexMetadataTests.java
…-field-mapping' into carlosdelest/semantic-text-field-mapping
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
(This PR is built on top of elastic#107147 to ease review, so it's opened against a local branch instead of main. You can see both PRs merged together in elastic#107262).
This PR adds the semantic_text field mapping and inference actions:
semantic_text
dense_vector
orsparse_vector
fields internally