semantic_text field mapper and inference - Partial PR #12

carlosdelest · 2024-04-09T14:55:50Z

(This PR is built on top of elastic#107147 to ease review, so it's opened against a local branch instead of main. You can see both PRs merged together in elastic#107262).

This PR adds the semantic_text field mapping and inference actions:

ShardBulkInferenceActionFilter: An ActionFilter that intercepts BulkShardRequests and perform inference on the ones that have inference field metadata associated. Inference for the appropriate fields is generated and model settings included as part of the inference results.
SemanticTextFieldMapper:
- Provides the mapping definition for semantic_text
- On document ingestion, checks for the model settings in its mapping. If it's not present, updates it to ensure documents adhere to the mapping in the future.
- Creates the appropriate mappings for dense_vector or sparse_vector fields internally
- Validates and indexes the inference results as part of the document ingestion using these internal mappings

Store semantic_text model info in IndexMetadata: On document ingestion, we need to perform inference only once, in the coordinating node. Otherwise, we would be doing inference for each of the shards the document is stored in. The problem with the coordinating node is that it doesn't necessarily hold mapping information if it is not used for storing an index. A pure coordinating node doesn't have any mapping information at all. We need to understand when we need to generate text embeddings on the coordinating node. This means that the model information associated with index fields needs to be efficiently accessed from there. This information needs to be kept up to date with mapping changes, and not be recomputed otherwise. The model / fields information is going to be included as part of the IndexMetadata, to ensure it is communicated to all nodes in the cluster.

Adds SemanticTextInferenceResultFieldMapper, which indexes inference results for semantic_text fields.

# Conflicts: # server/src/main/java/org/elasticsearch/TransportVersions.java

# Conflicts: # server/src/main/java/org/elasticsearch/TransportVersions.java # server/src/test/java/org/elasticsearch/snapshots/SnapshotResiliencyTests.java

# Conflicts: # server/src/main/java/org/elasticsearch/TransportVersions.java

…ate class (elastic#106328)

…elastic#106357) --------- Co-authored-by: carlosdelest <[email protected]>

…ing (elastic#106560) This PR refactors the semantic text field mapper to register its sub fields in the mapping instead of re-creating them each time when parsing documents. It also fixes the generation of these fields in case the semantic text field is defined in an object field. Lastly this change adds a new section called model_settings in the field parameter that is updated by the field mapper when inference results are received from a bulk action. The model settings are available in the fields as soon as the first document with the inference field is ingested and they are used to validate that updates. They are used to ensure consistency between what's used in the bulk action and what's defined in the field mapping.

# Conflicts: # server/src/main/java/org/elasticsearch/TransportVersions.java

…ce metadata in `IndexMetadata` (elastic#106743) This change refactors the integration of the field inference metadata in IndexMetadata. Instead of partial diffs, the new class simply sends the entire object as diff if it has changed. This PR also rename the fields and methods related to the inference fields consistently. The inference phase (in the transport shard bulk action) is also changed so that inference is not called if: The document contains a value for the inference input. The document also contains a value for the inference results of that field (in the _inference map). If the document contains no value for the inference input but an inference result for that field, it is marked as failed. --------- Co-authored-by: carlosdelest <[email protected]>

The merge logic in MergePositionsOperator is excessively complex and lacks flexibility. It relies on the source operator emitting pages with ascending positions. Additionally, this merge logic introduced an unusual method, appendAllValuesToCurrentPosition, to the Block.Builder. We should replace this with a simpler and more flexible approach. This PR uses a mechanism similar to the grouping aggregation. In fact, it is very close to the values aggregation. Initially, I considered using the GroupingState from ValuesAggregator. However, unlike in the values aggregation, we don't expect many multi-values in enrich. Hence, I introduced the new EnrichResultBuilders instead.

When an alias action list is posted with must_exist==false, and succeeds only partially, a list of results for each action are now returned. The results contain information about the requested action, indices, and aliases. If must_exist==true, or all actions fail, the call will return a 400 status along with the associated exception.

This moves the test cases declared in the tests for ESQL's LOCATE function to test cases which will cause elastic#106782 to properly generate all of the available signatures. It also buys us all of testing for incorrect parameter combinations.

This takes a stab at generating the markdown files that Kibana uses for its inline help. It doesn't include all of the examples because the `@Example` annotation is not filled in - we're tracking that in elastic#104247 (comment) There are some links in the output and they are in markdown syntax. We should figure out how to make them work for kibana.

…llover (elastic#107247)

TransformScheduler can trigger its tasks on multiple threads. TransformTask uses an AtomicReference to manage one trigger event per thread by cycling between "Started" and "Indexing". The Retry Listener now has the same protection. "shouldRunAction" will cycle to false during execution and back to true if the action fails and should be retried. Fix elastic#107215

With this commit we split the Universal Profiling plugin into three packages: * `persistence` contains everything w.r.t index management * `rest` contains the REST API * `action` contains the transport API The `action` / `rest` structure follows the already established structure in the rest of the code base. We divide this plugin into multiple packages mainly because the different functionalities will be maintained by different teams in the future. This restructuring helps clarify boundaries.

Users of supposedly-S3-compatible storage may need to be aware of this change, so this commit expands the release notes to link to the relevant S3 documentation.

…ny buckets (elastic#107218)

With this commit we consider a case in the TopN functions API where the specified limit is larger than the available number of TopN functions. Currently this throws an error (`IndexOutOfBoundException`). With this check in place we just return the list as is.

…t task from cluster state (elastic#106989)

Refactoring PR to make create, grant, and update API key actions local-only. Also ports a profiles action since it relies on the same base class as grant API key.

* Refactor DocsTest plugin to java * Rework asciidoc parsing to make adding more parser simple

This renames the function AUTO_BUCKET to just BUCKET. It also removes the experimental tagging of the function in the docs, making it generally available.

Allegedly-S3-compatible APIs are very popular these days, but many third-party systems offering such an API also support a shared filesystem interface. Shared filesystem protocols such as NFS are much better specified than the S3 API, and experience shows that they lead to fewer compatibility headaches. This commit adds a recommendation to the `repository-s3` docs to consider such an interface instead.

This checks in the generated-by-test doc files for newly renamed BUCKET function.

For system indices we don't want to emit metrics. DocumentSizeReporter will be created given an index. It will internally contain a SystemIndices instance that will verify the indexName with isSystemName

Consume the HttpEntity after the API response is parsed, releasing network and thread resources back to their respective pools. Leaving them unconsumed does not appear to be causing issues during tests, but it does log a large amount of hanging threads on test failure, making it harder to spot what may be the issue when a thread is hanging during a transform test. Close elastic#107055

Renaming GeoIpDownloaderStatsAction to GeoIpStatsAction

@jimczi

…pers (elastic#107147) Co-authored-by: @jimczi Co-authored-by: @Mikep86

…07020) This commit removes the legacy yaml rolling upgrade tests for vectors to the new rolling upgrade package. Also, it adds rolling upgrade tests for `int8_hnsw`.

…mapping # Conflicts: # server/src/main/java/org/elasticsearch/TransportVersions.java # server/src/main/java/org/elasticsearch/cluster/metadata/InferenceFieldMetadata.java # server/src/test/java/org/elasticsearch/cluster/metadata/IndexMetadataTests.java

…-field-mapping' into carlosdelest/semantic-text-field-mapping

Mikep86 and others added 30 commits January 12, 2024 09:29

Merge branch 'main' into feature/semantic-text

9332ef9

Merge branch 'main' into feature/semantic-text

9311f50

Merge branch 'main' into feature/semantic-text

f86ae02

Merge branch 'main' into feature/semantic-text

d06038c

semantic_text inference results indexing (elastic#103978)

64b4799

Adds SemanticTextInferenceResultFieldMapper, which indexes inference results for semantic_text fields.

Merge branch 'main' into feature/semantic-text

eda88d0

# Conflicts: # server/src/main/java/org/elasticsearch/TransportVersions.java

Merge remote-tracking branch 'origin/main' into feature/semantic-text

94805a6

# Conflicts: # server/src/main/java/org/elasticsearch/TransportVersions.java

Merge remote-tracking branch 'origin/main' into feature/semantic-text

551fe80

Merge remote-tracking branch 'origin/main' into feature/semantic-text

7e2610b

# Conflicts: # server/src/main/java/org/elasticsearch/TransportVersions.java

Move semantic_text field mappers to inference plugin (elastic#105187)

e3b6a65

Merge remote-tracking branch 'origin/main' into feature/semantic-text

553484c

# Conflicts: # server/src/main/java/org/elasticsearch/TransportVersions.java

semantic_text - Field inference (elastic#103697)

ca65a70

Merge remote-tracking branch 'origin/main' into feature/semantic-text

16762be

# Conflicts: # server/src/main/java/org/elasticsearch/TransportVersions.java # server/src/test/java/org/elasticsearch/snapshots/SnapshotResiliencyTests.java

Merge remote-tracking branch 'origin/main' into feature/semantic-text

f3d5a78

Merge remote-tracking branch 'origin/main' into feature/semantic-text

ffa4d40

Merge remote-tracking branch 'origin/main' into feature/semantic-text

3f7ccde

# Conflicts: # server/src/main/java/org/elasticsearch/TransportVersions.java

Merge remote-tracking branch 'origin/main' into feature/semantic-text

881c394

Semantic text dense vector support (elastic#105515)

b1a3ee8

This was supposed to be merged into elastic#105515 but didn't make it

2039fb3

Merge branch 'main' into feature/semantic-text

db67976

semantic_text - extract Index Metadata inference information to separ…

3ca808b

…ate class (elastic#106328)

[feature/semantic_text] Refactor inference to run as an action filter (…

823fb58

…elastic#106357) --------- Co-authored-by: carlosdelest <[email protected]>

Merge remote-tracking branch 'origin/main' into feature/semantic-text

9531948

# Conflicts: # server/src/main/java/org/elasticsearch/TransportVersions.java

Fix build error

122e439

Merge branch 'main' into feature/semantic-text

2e89d99

[feature/semantic-text] semantic text copy to support (elastic#106689)

b6ca8d2

Merge remote-tracking branch 'upstream/main' into feature/semantic-text

2c11a3f

dnhatn and others added 29 commits April 9, 2024 09:24

Don't overwrite DataStream.rolloverOnWrite flag on failure store ro…

12398ee

…llover (elastic#107247)

Expand release note for elastic#105044 (elastic#107257)

ec2a4ca

Users of supposedly-S3-compatible storage may need to be aware of this change, so this commit expands the release notes to link to the relevant S3 documentation.

Use merge sort instead of hashing to avoid performance issues with ma…

de171b8

…ny buckets (elastic#107218)

[Transform] Make force-stopping the transform always remove persisten…

e21f2e3

…t task from cluster state (elastic#106989)

Make API key actions local-only (elastic#107148)

c4a11de

Refactoring PR to make create, grant, and update API key actions local-only. Also ports a profiles action since it relies on the same base class as grant API key.

Port DocsTest gradle plugin to java (elastic#107124)

62729c9

* Refactor DocsTest plugin to java * Rework asciidoc parsing to make adding more parser simple

ESQL: Rename AUTO_BUCKET to just BUCKET (elastic#107197)

d6f9d1e

This renames the function AUTO_BUCKET to just BUCKET. It also removes the experimental tagging of the function in the docs, making it generally available.

Rename generated docs for (renamed) BUCKET func (elastic#107299)

8bcbc97

This checks in the generated-by-test doc files for newly renamed BUCKET function.

Do not report document metering on system indices (elastic#107041)

84d6157

For system indices we don't want to emit metrics. DocumentSizeReporter will be created given an index. It will internally contain a SystemIndices instance that will verify the indexName with isSystemName

Openai model_id is required (elastic#107286)

8638dee

[DOCS][ESQL] Render locate function docs (elastic#107305)

943885d

ES|QL: regex warnings in csv-spec tests (elastic#107273)

19e9fc3

Renaming GeoIpDownloaderStatsAction (elastic#107290)

48a88c5

Renaming GeoIpDownloaderStatsAction to GeoIpStatsAction

Log skipped elections due to shutdown marker (elastic#106701)

a9cab35

semantic_text: Add index metadata information for inference field map…

c57dd98

…pers (elastic#107147) Co-authored-by: @jimczi Co-authored-by: @Mikep86

Expanding and refactoring the vector rolling upgrade tests (elastic#1…

9e502aa

…07020) This commit removes the legacy yaml rolling upgrade tests for vectors to the new rolling upgrade package. Also, it adds rolling upgrade tests for `int8_hnsw`.

Update docs/changelog/107262.yaml

ff8365a

Merge remote-tracking branch 'carlosdelest/carlosdelest/semantic-text…

6805726

…-field-mapping' into carlosdelest/semantic-text-field-mapping

Update changelog

7dbb53b

carlosdelest closed this Aug 1, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

semantic_text field mapper and inference - Partial PR #12

semantic_text field mapper and inference - Partial PR #12

carlosdelest commented Apr 9, 2024

semantic_text field mapper and inference - Partial PR #12

semantic_text field mapper and inference - Partial PR #12

Conversation

carlosdelest commented Apr 9, 2024