copy_to and multifields support for semantic_text #1

carlosdelest · 2024-03-22T17:18:21Z

Adds copy_to and multifields support to semantic_text.

This needs to be merged after elastic#106560 is merged, as it is based on it.

Changes:

Iterate on the source fields for calculating inference
Allow inference to be applied from multiple responses to a single field
Parser needs to check field type for multifields

* WIP Support ENRICH MATCH on TEXT * Disallow KEYWORD from range enrich The ingest processor does not support this, and there is no keyword_range type to complement the numerical, date and ip range types. * Revert: Disallow KEYWORD from range enrich We allow using KEYWORD to range match against ip_range. * Update docs/changelog/106435.yaml * Improve changelog entry * Added yaml test for ENRICH on TEXT fields * Allow TEXT for range, so text matches IP-range (plus test)

Packaging tests have several files that may be useful in debugging failures. Additionally, we sometimes have assertions for which we want to catch them and emit additional debugging info. This commit guards the common ways that Elasticsearch is started and assertions are run with dumping all debug information available.

The shutdown integration tests test scenarios across multiple nodes. When checking if a shard is moved off a node that is shutting down, the shard migration status may not yet have been updated. This commit adds a busy wait to ensure the status has time to update before failing the test. closes elastic#77488

NodeShutdownIT.testStalledShardMigrationProperlyDetected has been muted for a couple years. It apparently reproduced when the failure first started, but no longer reproduces on main. This commit re-enables the test and closes the test issue. We can open a new issue with any subsequent failure. closes elastic#77456

This makes a couple of changes to regex processing in the compute engine: 1. Process utf-8 strings directly. This should save a ton of time. 2. Snip the `toString` output if it is too big - I chose 64kb of strings. 3. I changed the formatting of the automaton to a slightly customized `dot` output. Because automata are graphs. Everyone knows it. And they are a lot easier to read as graphs. `dot` is easy to convert into a graph. 4. I implement `EvaluatorMapper` for regex operations which is pretty standard for the rest of our operations.

see elastic#106507

This modifies the ESQL test infrastructure to generate more of the documentation for functions. It generates the *Description* section, the *Examples* section, and the *Parameters* section as separate files so we can use them as needed. It also generates a `layout` file that's just a guess as to how to render the whole thing. In some cases it'll work and we can use that instead of hand maintaining a "top level" description file for the function. Most newly generated files are unused. We have to chose to pick them up by replacing the sections we were manually maintaining with an include of the generated section. Or by replacing the entire hand maintained file with the generated top level file. Relates to elastic#104247

…tic#106505) The distributions already have correct permissions set on native libraries copied to them. However, the build itself to extract the native libs relies on the upstream file permissions. This commit sets explicit permissions on the copy task which extracts native libraries.

) Since mrjars may use preview apis, forbidden apis must know about any preview apis from the jdk. However, we do not run forbidden apis with the preview enabled flag, nor in a separate jvm, so it does not know about these classes. Thus we ignore missing classes on source sets added by the mrjar plugin. This commit configures all sourcesets added by mrjar plugin to ignore forbidden apis missing classes.

The task for updating cluster state with nodes seen by shutdown was previously switched to use batched tasks. However, the task is never marked as complete, which leads to the tasks piling up. This commit marks the task as complete and re-enables a test that appears to succeed now. closes elastic#76689

When we use `ROW` in ESQL we pick a random data set by just iterating the `Map`. It's random. Yay! And some of them don't work in this place. This just picks one that we know works. Closes elastic#106501

* Working tests * Adding more tests * Adding comment * Switching to micros and addressing feedback * Removing nanos and adding test for bug fix --------- Co-authored-by: Elastic Machine <[email protected]>

If we proceed without waiting for pages, we might cancel the main request before starting the data-node request. As a result, the exchange sinks on data-nodes won't be removed until the inactive_timeout elapses, which is longer than the assertBusy timeout. Closes elastic#106443

Add missing _to_ in sentence (cherry picked from commit 40a9155) Co-authored-by: Aaron Hanusa <[email protected]>

The scope here is to expose a method (Realms#getRealmRef) that can be used to retrieve the realm domain assignments for any realm id.

Empty read is [short-circuited](https://github.com/elastic/elasticsearch/blob/e8039b9ecb2451752ac5377c44a6a0c662087a9f/modules/repository-s3/src/main/java/org/elasticsearch/repositories/s3/S3BlobContainer.java#L115-L116) without going to the blob store. In order to test s3 blob store, ranged read should read at least one byte. This PR ensures that. Resolves: elastic#105958

+ Add esql as rest test dependency for ml/native-multi-node-tests to work around the mixed testClusters/TestCluster nodes (so all have the esql plugin installed)

This commits exposes the query transport action to improve usage. While one can perform all operations prior to this change, it has been suggested that adding the action would improve the symmetry of the API by allowing e.g. client().execute(builder.action(), builder.request()).actionGet(30, SECONDS);

* Add links to text_expansion in ELSER tutorial * Apply suggestions from code review Co-authored-by: Liam Thompson <[email protected]> --------- Co-authored-by: Liam Thompson <[email protected]>

Test tweaks for serverless: * Valid application name in API key tests * Move from `cluster.health` to `info` call in roles test (the call is just used to check that a user with a cluster privilege is indeed able to execute the test) Closes: ES-7987

I realized I forgot to add some namedwritables to our registry. I've forgotten this multiple times. Any ideas how we can improve this so we get failures if we forget in the future?

…alues' (elastic#106838)

Add new optional request option, `with_profile_uid`, to the Get and Query API Key Information endpoints, to return the API keys owner users' profile uid. Closes elastic#98939

Will restore the assert on the metric in a follow-up PR. Related to elastic#106834

In this PR we introduce the API that will expose the global retention configuration and will allow users to take advantage of it. These APIs are protected by the dedicated introduced privileges: `manage_data_stream_global_retention` or higher, which allows all operations on the global retention configuration `monitor_data_stream_retention` or higher, which allows the retrieval of the global retention configuration. This PR is the final PR that makes the global retention available for our users.

For now skip tests when flaky hdfs cluster cannot be started. Investigating further without bothering others and keeping pipeline green

…onse (elastic#106858) Add missing getter

…ndex pattern (elastic#106815) * Update KibanaOwnedReservedRoleDescriptors.java * replaced all with read, delete_index

…s} (elastic#106865)

…lastic#105745) Closes elastic#105742

Regular feature names are extracted together with historical features during feature metadata extraction. Based on this, feature checks in tests are validated to use only known features to prevent tests from being silently disabled due to a invalid or misspelled feature name. --------- Co-authored-by: Lorenzo Dematte <[email protected]>

…elds (elastic#106862) The SearchExecutionContext supports the notion of allowed fields, provided via a specific setter method. Fields are though only filtered for the getFieldType method. There needs to be consistency between getMatchingFieldNames and getFieldType. In fact there are places in the code where getMatchingFieldNames is called to resolve field name patterns, and later getFieldType is called on each of the resolved fields. If the former resolves to one field that we can't retrieve a field type for, that is unexpected and to be considered a bug. In addition, this commit adds consistency for getAllFields: this is only called by field caps, hence a different codepath that does not seem to set allowed fields for now, but it's important for the context to provide consistency around fields access, especially for methods that are as broad as getAllFields, despite their currently very specific usage. This surfaced as we are trying to move fetching of the `_ignored` field to use value fetchers, which use a search execution context and resolve the field type, whereas until now they are retrieved directly via StoredFieldsPhase and completely bypass such check. This commit also adds a test that was missing around verifying that SearchExecutionContext applies the allowedFields predicate when provided.

…ce metadata in `IndexMetadata` (elastic#106743) This change refactors the integration of the field inference metadata in IndexMetadata. Instead of partial diffs, the new class simply sends the entire object as diff if it has changed. This PR also rename the fields and methods related to the inference fields consistently. The inference phase (in the transport shard bulk action) is also changed so that inference is not called if: The document contains a value for the inference input. The document also contains a value for the inference results of that field (in the _inference map). If the document contains no value for the inference input but an inference result for that field, it is marked as failed. --------- Co-authored-by: carlosdelest <[email protected]>

Stop and Start error messages include the reason for the error followed by the suggestion to use force=true. This may cause the suggestion to be hidden by the reason, so we will move the reason after the suggestion. Close elastic#106819

Like Block#filter, Block#expand should return the specific type of the original block, rather than a generic block type. For instance, the expanded block of an IntBlock should also be an IntBlock. I encountered a situation where I had to cast the expanded block.

…Input` (elastic#106794) There's loads of scenarios where we create very small slices (as in less than buffer size) from input that already have these bytes buffered. (BKDReader#packedIndex for example) We can save considerable memory as well as potential IO to disk or worse-yet the blob store by just slicing the buffer if possible. Outside of the case of slicing and never reading from the slice, this should always save memory.

FieldName does not make much sense as an abstract class with a single private subclass. Also, the base implementation holds most of the fields that the subclass relies on to do its job. They can be unified into a single class

This adds an OPTIONS clause to FROM, allowing to specify search or index resolution options, such as: preference, allow_no_indices or ignore_unavailable.

…elastic#106624) Assert using greaterThanOrEqualTo to allow for additional scheduled background threads to appear in collected measurements after the thread pool stats have already been pulled, e.g. this could be the case for the cluster coordination thread pool.

…06881) * Update 8.13 release notes with known issue * revert unintended * reword * reword * reword

…copy-to-support-inference # Conflicts: # server/src/test/java/org/elasticsearch/index/mapper/FieldTypeLookupTests.java # x-pack/plugin/inference/src/main/java/org/elasticsearch/xpack/inference/action/filter/ShardBulkInferenceActionFilter.java # x-pack/plugin/inference/src/test/java/org/elasticsearch/cluster/metadata/SemanticTextClusterMetadataTests.java # x-pack/plugin/inference/src/test/java/org/elasticsearch/xpack/inference/action/filter/ShardBulkInferenceActionFilterTests.java # x-pack/plugin/inference/src/yamlRestTest/resources/rest-api-spec/test/inference/10_semantic_text_inference.yml

…_text field

…sion (#1…" (elastic#115827) This reverts commit 32dee6a.

rockdaboot and others added 30 commits March 19, 2024 18:33

[Profiling] Accept OTEL host architecture values (elastic#106494)

97b8977

AwaitsFix elastic#106501

0f504c1

Fix default search timeout in watcher docs (elastic#106404)

bceb38d

Update bundled JDK to Java 22 (elastic#106482)

2ab2a06

Mute EsqlClientYamlIT esql/60_enrich/Enrich on keyword

d362517

see elastic#106507

ESQL: Fix CSV tests (elastic#106506)

ea1672b

When we use `ROW` in ESQL we pick a random data set by just iterating the `Map`. It's random. Yay! And some of them don't work in this place. This just picks one that we know works. Closes elastic#106501

Allow SemanticTextFieldMapper to be a multifield

cf62b1b

Add multifields / copy_to tests to lookup and metadata

f029015

First iteration for adding inference support for copy_to / multifields

d3f9d86

[ML] Inference API Rate limiter (elastic#106330)

edbff94

* Working tests * Adding more tests * Adding comment * Switching to micros and addressing feedback * Removing nanos and adding test for bug fix --------- Co-authored-by: Elastic Machine <[email protected]>

Add tests for copy_to

140caa3

Spotless

7d1c92a

Fix typo in OIDC docs (elastic#106207) (elastic#106517)

d5565b6

Add missing _to_ in sentence (cherry picked from commit 40a9155) Co-authored-by: Aaron Hanusa <[email protected]>

Expose lookup of realm domain config by realm id (elastic#106424)

dbb7847

The scope here is to expose a method (Realms#getRealmRef) that can be used to retrieve the realm domain assignments for any realm id.

Minor changes from previous PR

e023a19

Use cluster features for ASYNC ESQL tests (elastic#104466)

6f607e4

+ Add esql as rest test dependency for ml/native-multi-node-tests to work around the mixed testClusters/TestCluster nodes (so all have the esql plugin installed)

Add links to text_expansion in ELSER tutorial (elastic#106490)

d01adff

* Add links to text_expansion in ELSER tutorial * Apply suggestions from code review Co-authored-by: Liam Thompson <[email protected]> --------- Co-authored-by: Liam Thompson <[email protected]>

jonathan-buttner and others added 28 commits March 27, 2024 21:10

Adding some missing writeable types (elastic#106841)

17a106c

I realized I forgot to add some namedwritables to our registry. I've forgotten this multiple times. Any ideas how we can improve this so we get failures if we forget in the future?

Fix misspelled ESQL feature name from 'esql.value_agg' to 'esql.agg_v…

f5b3e01

…alues' (elastic#106838)

Get and Query API Key with profile uid (elastic#106531)

3e0a0f6

Add new optional request option, `with_profile_uid`, to the Get and Query API Key Information endpoints, to return the API keys owner users' profile uid. Closes elastic#98939

unmute test, remove failing assert (elastic#106855)

058df55

Will restore the assert on the metric in a follow-up PR. Related to elastic#106834

Added content_type to profiling for rest-api-spec (elastic#106848)

8ddd03c

Skip rest tests when hdfs cluster setup fails (elastic#106856)

76f142f

For now skip tests when flaky hdfs cluster cannot be started. Investigating further without bothering others and keeping pipeline green

Add getter for global retention GetComposableIndexTemplateAction.Resp…

6b054cb

…onse (elastic#106858) Add missing getter

[Fleet] Added all privilege to kibana_system to logs-fleet_server.* i…

4c556fc

…ndex pattern (elastic#106815) * Update KibanaOwnedReservedRoleDescriptors.java * replaced all with read, delete_index

Unmute {p0=data_stream/10_basic/Delete data stream with failure store…

80094b2

…s} (elastic#106865)

Fix noop_update_total is not being updated when using the _bulk (e…

9e06cbf

…lastic#105745) Closes elastic#105742

Fix typo in functions/README.md (elastic#106870)

b85d4b1

Enable data-streams module in REST tests (elastic#106875)

f0b61f8

Fold ExactFieldName into FieldName (elastic#106867)

52fcf81

FieldName does not make much sense as an abstract class with a single private subclass. Also, the base implementation holds most of the fields that the subclass relies on to do its job. They can be unified into a single class

ESQL: Add OPTIONS clause to FROM command (elastic#106636)

aeeb597

This adds an OPTIONS clause to FROM, allowing to specify search or index resolution options, such as: preference, allow_no_indices or ignore_unavailable.

Update 8.13 release notes with known issue on downsampling (elastic#1…

a7bc24b

…06881) * Update 8.13 release notes with known issue * revert unintended * reword * reword * reword

Merge branch 'main' into feature/semantic-text

2e89d99

Fix merge with feature branch

82ffb5b

Check that bulk updates use all source fields for updating a semantic…

6b26200

…_text field

Add check for inference results when no value is provided

bf5b837

carlosdelest closed this Nov 5, 2024

jimczi pushed a commit that referenced this pull request Nov 6, 2024

Revert "[test] Dynamically pick up the upper bound snapshot index ver…

7feb4d5

…sion (#1…" (elastic#115827) This reverts commit 32dee6a.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

copy_to and multifields support for semantic_text #1

copy_to and multifields support for semantic_text #1

carlosdelest commented Mar 22, 2024

copy_to and multifields support for semantic_text #1

copy_to and multifields support for semantic_text #1

Conversation

carlosdelest commented Mar 22, 2024