Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

semantic_text field mapper and inference - Partial PR #12

Conversation

carlosdelest
Copy link
Owner

(This PR is built on top of elastic#107147 to ease review, so it's opened against a local branch instead of main. You can see both PRs merged together in elastic#107262).

This PR adds the semantic_text field mapping and inference actions:

  • ShardBulkInferenceActionFilter: An ActionFilter that intercepts BulkShardRequests and perform inference on the ones that have inference field metadata associated. Inference for the appropriate fields is generated and model settings included as part of the inference results.
  • SemanticTextFieldMapper:
    • Provides the mapping definition for semantic_text
    • On document ingestion, checks for the model settings in its mapping. If it's not present, updates it to ensure documents adhere to the mapping in the future.
    • Creates the appropriate mappings for dense_vector or sparse_vector fields internally
    • Validates and indexes the inference results as part of the document ingestion using these internal mappings

Mikep86 and others added 30 commits January 12, 2024 09:29
Store semantic_text model info in IndexMetadata:

On document ingestion, we need to perform inference only once, in the coordinating node. Otherwise, we would be doing inference for each of the shards the document is stored in.

The problem with the coordinating node is that it doesn't necessarily hold mapping information if it is not used for storing an index. A pure coordinating node doesn't have any mapping information at all.

We need to understand when we need to generate text embeddings on the coordinating node. This means that the model information associated with index fields needs to be efficiently accessed from there.

This information needs to be kept up to date with mapping changes, and not be recomputed otherwise.

The model / fields information is going to be included as part of the IndexMetadata, to ensure it is communicated to all nodes in the cluster.
Adds SemanticTextInferenceResultFieldMapper, which indexes inference results for semantic_text fields.
# Conflicts:
#	server/src/main/java/org/elasticsearch/TransportVersions.java
# Conflicts:
#	server/src/main/java/org/elasticsearch/TransportVersions.java
# Conflicts:
#	server/src/main/java/org/elasticsearch/TransportVersions.java
# Conflicts:
#	server/src/main/java/org/elasticsearch/TransportVersions.java
# Conflicts:
#	server/src/main/java/org/elasticsearch/TransportVersions.java
#	server/src/test/java/org/elasticsearch/snapshots/SnapshotResiliencyTests.java
# Conflicts:
#	server/src/main/java/org/elasticsearch/TransportVersions.java
…ing (elastic#106560)

This PR refactors the semantic text field mapper to register its sub fields in the mapping instead of re-creating them each time when parsing documents.
It also fixes the generation of these fields in case the semantic text field is defined in an object field.
Lastly this change adds a new section called model_settings in the field parameter that is updated by the field mapper when inference results are received from a bulk action. The model settings are available in the fields as soon as the first document with the inference field is ingested and they are used to validate that updates. They are used to ensure consistency between what's used in the bulk action and what's defined in the field mapping.
# Conflicts:
#	server/src/main/java/org/elasticsearch/TransportVersions.java
…ce metadata in `IndexMetadata` (elastic#106743)

This change refactors the integration of the field inference metadata in IndexMetadata. Instead of partial diffs, the new class simply sends the entire object as diff if it has changed.
This PR also rename the fields and methods related to the inference fields consistently.
The inference phase (in the transport shard bulk action) is also changed so that inference is not called if:

The document contains a value for the inference input.
The document also contains a value for the inference results of that field (in the _inference map).
If the document contains no value for the inference input but an inference result for that field, it is marked as failed.
---------

Co-authored-by: carlosdelest <[email protected]>
dnhatn and others added 29 commits April 9, 2024 09:24
The merge logic in MergePositionsOperator is excessively complex and 
lacks flexibility. It relies on the source operator emitting pages with
ascending positions. Additionally, this merge logic introduced an
unusual method, appendAllValuesToCurrentPosition, to the Block.Builder.
We should replace this with a simpler and more flexible approach. This
PR uses a mechanism similar to the grouping aggregation. In fact, it is 
very close to the values aggregation. Initially, I considered using the
GroupingState from ValuesAggregator. However, unlike in the values
aggregation, we don't expect many multi-values in enrich. Hence, I
introduced the new EnrichResultBuilders instead.
When an alias action list is posted with must_exist==false, and succeeds only partially, a list of results for each action are now returned. The results contain information about the requested action, indices, and aliases. If must_exist==true, or all actions fail, the call will return a 400 status along with the associated exception.
This moves the test cases declared in the tests for ESQL's LOCATE
function to test cases which will cause elastic#106782 to properly generate all
of the available signatures. It also buys us all of testing for
incorrect parameter combinations.
This takes a stab at generating the markdown files that Kibana uses for
its inline help. It doesn't include all of the examples because the
`@Example` annotation is not filled in - we're tracking that in
elastic#104247 (comment)

There are some links in the output and they are in markdown syntax. We
should figure out how to make them work for kibana.
TransformScheduler can trigger its tasks on multiple threads.
TransformTask uses an AtomicReference to manage one trigger event per
thread by cycling between "Started" and "Indexing". The Retry Listener
now has the same protection. "shouldRunAction" will cycle to false
during execution and back to true if the action fails and should be
retried.

Fix elastic#107215
With this commit we split the Universal Profiling plugin into three
packages:

* `persistence` contains everything w.r.t index management
* `rest` contains the REST API
* `action` contains the transport API

The `action` / `rest` structure follows the already established
structure in the rest of the code base. We divide this plugin into
multiple packages mainly because the different functionalities will be
maintained by different teams in the future. This restructuring helps
clarify boundaries.
Users of supposedly-S3-compatible storage may need to be aware of this
change, so this commit expands the release notes to link to the relevant
S3 documentation.
With this commit we consider a case in the TopN functions API where the
specified limit is larger than the available number of TopN functions.
Currently this throws an error (`IndexOutOfBoundException`). With this
check in place we just return the list as is.
Refactoring PR to make create, grant, and update API key actions
local-only. Also ports a profiles action since it relies on the same
base class as grant API key.
* Refactor DocsTest plugin to java
* Rework asciidoc parsing to make adding more parser simple
This renames the function AUTO_BUCKET to just BUCKET.
It also removes the experimental tagging of the function in the docs, making it generally available.
Allegedly-S3-compatible APIs are very popular these days, but many
third-party systems offering such an API also support a shared
filesystem interface. Shared filesystem protocols such as NFS are much
better specified than the S3 API, and experience shows that they lead to
fewer compatibility headaches. This commit adds a recommendation to the
`repository-s3` docs to consider such an interface instead.
This checks in the generated-by-test doc files for newly renamed BUCKET
function.
For system indices we don't want to emit metrics. DocumentSizeReporter will be created given an index. It will internally contain a SystemIndices instance that will verify the indexName with isSystemName
Consume the HttpEntity after the API response is parsed, releasing
network and thread resources back to their respective pools.

Leaving them unconsumed does not appear to be causing issues during
tests, but it does log a large amount of hanging threads on test
failure, making it harder to spot what may be the issue when a thread is
hanging during a transform test.

Close elastic#107055
Renaming GeoIpDownloaderStatsAction to GeoIpStatsAction
…07020)

This commit removes the legacy yaml rolling upgrade tests for vectors to the new rolling upgrade package. 

Also, it adds rolling upgrade tests for `int8_hnsw`.
…mapping

# Conflicts:
#	server/src/main/java/org/elasticsearch/TransportVersions.java
#	server/src/main/java/org/elasticsearch/cluster/metadata/InferenceFieldMetadata.java
#	server/src/test/java/org/elasticsearch/cluster/metadata/IndexMetadataTests.java
…-field-mapping' into carlosdelest/semantic-text-field-mapping
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.