Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[ML] Fix .inference index leaking out of tests. #115023

Merged
merged 10 commits into from
Oct 21, 2024

Conversation

davidkyle
Copy link
Member

@davidkyle davidkyle commented Oct 17, 2024

Many tests in the Enrich and Eql test suites are failing with the error

org.elasticsearch.client.WarningFailureException: method [DELETE], host [http://127.0.0.1:37795], URI [*,-.ds-ilm-history-*,-.ds-.slm-history-*,-.ds-.watcher-history-*?expand_wildcards=open%2Cclosed%2Chidden], status line [HTTP/1.1 200 OK]
Warnings: [this request accesses system indices: [.inference], but in a future major version, direct access to system indices will be prevented by default]
{"acknowledged":true}

.inference is a system index and will be removed automatically when the test clean up code calls the feature reset API (POST _features/_reset). The error is occurring when the indices are deleted post feature reset.

wipeAllIndices(preserveSecurityIndicesUponCompletion());

The failing tests all have monitoring enabled xpack.monitoring.collection.enabled, when enabled one of the collectors periodically calls the xpack usage API. Xpack usage calls GET _inference/_all for the list of inference endpoints. A recent change to the Inference registry persisted the default configurations the first time they read with a GET request which is what the monitoring collector was doing via the xpack uasage API. The GET causes a write to the .inference index which auto creates it causing the warning in delete indices.

The fix here is to pass a parameter do not persist when calling GET _inference/_all so that the .inference index is not created by the xpack usage API.

Reverts the change in #114928 to suppress the failures.

Closes #114962 #114791 #114790 #114789 #114775 #114773 #114771 #114769 #114763 #114761 #114840 #114839 #114773 #114749

@davidkyle davidkyle added >test Issues or PRs that are addressing/adding tests :ml Machine learning v8.16.0 v9.0.0 labels Oct 17, 2024
@elasticsearchmachine elasticsearchmachine added the Team:ML Meta label for the ML team label Oct 17, 2024
@elasticsearchmachine
Copy link
Collaborator

Pinging @elastic/ml-core (Team:ML)

@davidkyle davidkyle force-pushed the inference-index-fix branch from e8f0f2d to c81dfd3 Compare October 17, 2024 14:49
@@ -63,7 +63,7 @@ protected void masterOperation(
ClusterState state,
ActionListener<XPackUsageFeatureResponse> listener
) {
GetInferenceModelAction.Request getInferenceModelAction = new GetInferenceModelAction.Request("_all", TaskType.ANY);
GetInferenceModelAction.Request getInferenceModelAction = new GetInferenceModelAction.Request("_all", TaskType.ANY, true);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just to clarify, the issue overall that is causing test failures is that the cleanup process of the tests deletes the index and then the usage API attempts to access it here? Why can't we just swallow the exception here or check that the index exists before making the call? This would avoid the need to pass the doNotPersistDefaultConfigs flag through all of our layers just for testing purposes.

Copy link
Member Author

@davidkyle davidkyle Oct 18, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

.inference is a system index and should be cleaned up by the owning plugin - see SystemIndexPlugin.cleanUpFeature. The test are failing when the clean up code deletes all indices with the wildcard *, the deletion emits the warning Warnings: [this request accesses system indices: [.inference] and it is the warning about accessing the system index that fails the test.

The problem is the .inference index is been recreated between the call to SystemIndexPlugin.cleanUpFeature - called by feature reset and the delete all indices request. The culprit is the usages collector that calls GET _inference/* causing the index to be created.

The post test clean is here

This line calls feature reset which will clean up the system indices

And here the indices are deleted

wipeAllIndices(preserveSecurityIndicesUponCompletion());

@davidkyle
Copy link
Member Author

@elasticmachine update branch

@davidkyle davidkyle added v8.17.0 auto-backport Automatically create backport pull requests when merged labels Oct 21, 2024
// Set to true to avoid persisting on read.
// This setting only applies to GET * requests it has
// no effect when getting a single model
private final boolean doNotPersistDefaultConfigs;
Copy link
Contributor

@jan-elastic jan-elastic Oct 21, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In general, I dislike negative boolean variable names. Why not have persistDefaultConfig?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I flip flopped on this having switched the meaning once already. I'm not sure of the reason why I decided to use the verbose form other that it was friday afternoon. I've renamed to persistDefaultConfig

modelRegistry.getAllModels(
doNotPersistEndpoints == false,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why == false?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

See flip flop above. The tests should have caught this but they needed updating to

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One benefit of always using the "positive" boolean names is that you never need "flips" like this when passing the variable on.

@@ -34,19 +34,40 @@ public GetInferenceModelAction() {

public static class Request extends AcknowledgedRequest<GetInferenceModelAction.Request> {

private static boolean DEFAULT_DO_NOT_PERSIST_DEFAULT_CONFIGS = false;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't understand why we need this as part of the user-facing request.

At first glance, this is needed to fix some issues with the test setup, which shouldn't lead to changes in production code.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The option is internal to the transport action it cannot be set by an REST API user. The inference usage action however can be called by a REST API user. Xpack telemetry and monitoring also call the usage action. If the inference API is not being used it is preferable not to create the .inference index, having the flag here stops any callers of the GET /_xpack/usage from indirectly creating the index. It is also very beneficial that this change fixes the tests but even for production code it should not be creating the index if a user calls the usage action.

This entire problem relates to the design decision to persist the default configs on read. The reasons for this design are:

  • It is not known until late in the cluster startup lifecycle whether the linux x86 optimised model can be used
  • The code should make that decision once and initialise the config with the correct model variant
  • Doing this on read rather than at some fixed point in the cluster lifecycle means failures are trivially retried on the next read.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, that explanation helps! And I missed that this is internal-only.

@davidkyle
Copy link
Member Author

@elasticmachine update branch

@davidkyle davidkyle removed the auto-backport Automatically create backport pull requests when merged label Oct 21, 2024
Copy link
Contributor

@jan-elastic jan-elastic left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@davidkyle davidkyle enabled auto-merge (squash) October 21, 2024 12:20
@davidkyle
Copy link
Member Author

@elasticmachine update branch

davidkyle added a commit to davidkyle/elasticsearch that referenced this pull request Oct 22, 2024
…sage (elastic#115023)

The Inference usage API calls GET _inference/_all and because the default
configs are persisted on read it causes the creation of the .inference index.
This action is undesirable and causes test failures by leaking the system index
out of the test clean up code.
# Conflicts:
#	muted-tests.yml
#	x-pack/plugin/inference/src/main/java/org/elasticsearch/xpack/inference/action/TransportGetInferenceModelAction.java
elasticsearchmachine pushed a commit that referenced this pull request Oct 22, 2024
…sage (#115023) (#115313)

The Inference usage API calls GET _inference/_all and because the default
configs are persisted on read it causes the creation of the .inference index.
This action is undesirable and causes test failures by leaking the system index
out of the test clean up code.
# Conflicts:
#	muted-tests.yml
#	x-pack/plugin/inference/src/main/java/org/elasticsearch/xpack/inference/action/TransportGetInferenceModelAction.java
davidkyle added a commit to davidkyle/elasticsearch that referenced this pull request Oct 22, 2024
…sage (elastic#115023)

The Inference usage API calls GET _inference/_all and because the default
configs are persisted on read it causes the creation of the .inference index.
This action is undesirable and causes test failures by leaking the system index
out of the test clean up code.
# Conflicts:
#	muted-tests.yml
#	server/src/main/java/org/elasticsearch/TransportVersions.java
#	x-pack/plugin/inference/src/main/java/org/elasticsearch/xpack/inference/action/TransportGetInferenceModelAction.java
elasticsearchmachine pushed a commit that referenced this pull request Oct 23, 2024
* [ML] Do not create the .inference index as a side effect of calling usage (#115023)

The Inference usage API calls GET _inference/_all and because the default
configs are persisted on read it causes the creation of the .inference index.
This action is undesirable and causes test failures by leaking the system index
out of the test clean up code.
# Conflicts:
#	muted-tests.yml
#	x-pack/plugin/inference/src/main/java/org/elasticsearch/xpack/inference/action/TransportGetInferenceModelAction.java

* unmute tests

* precommit

* spotless

---------

Co-authored-by: Elastic Machine <[email protected]>
davidkyle added a commit that referenced this pull request Oct 24, 2024
Backport of #115023 with additional logic to handle streaming to/from a patch version
georgewallace pushed a commit to georgewallace/elasticsearch that referenced this pull request Oct 25, 2024
…sage (elastic#115023)

The Inference usage API calls GET _inference/_all and because the default 
configs are persisted on read it causes the creation of the .inference index.
This action is undesirable and causes test failures by leaking the system index
out of the test clean up code.
jfreden pushed a commit to jfreden/elasticsearch that referenced this pull request Nov 4, 2024
…sage (elastic#115023)

The Inference usage API calls GET _inference/_all and because the default 
configs are persisted on read it causes the creation of the .inference index.
This action is undesirable and causes test failures by leaking the system index
out of the test clean up code.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
:ml Machine learning Team:ML Meta label for the ML team >test Issues or PRs that are addressing/adding tests v8.16.0 v8.17.0 v9.0.0
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[CI] EnrichIT testBasicFlowDate failing
5 participants