-
Notifications
You must be signed in to change notification settings - Fork 25k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[ML] Fix .inference index leaking out of tests. #115023
Conversation
Pinging @elastic/ml-core (Team:ML) |
e8f0f2d
to
c81dfd3
Compare
@@ -63,7 +63,7 @@ protected void masterOperation( | |||
ClusterState state, | |||
ActionListener<XPackUsageFeatureResponse> listener | |||
) { | |||
GetInferenceModelAction.Request getInferenceModelAction = new GetInferenceModelAction.Request("_all", TaskType.ANY); | |||
GetInferenceModelAction.Request getInferenceModelAction = new GetInferenceModelAction.Request("_all", TaskType.ANY, true); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just to clarify, the issue overall that is causing test failures is that the cleanup process of the tests deletes the index and then the usage API attempts to access it here? Why can't we just swallow the exception here or check that the index exists before making the call? This would avoid the need to pass the doNotPersistDefaultConfigs
flag through all of our layers just for testing purposes.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
.inference is a system index and should be cleaned up by the owning plugin - see SystemIndexPlugin.cleanUpFeature. The test are failing when the clean up code deletes all indices with the wildcard *, the deletion emits the warning Warnings: [this request accesses system indices: [.inference]
and it is the warning about accessing the system index that fails the test.
The problem is the .inference index is been recreated between the call to SystemIndexPlugin.cleanUpFeature
- called by feature reset and the delete all indices request. The culprit is the usages collector that calls GET _inference/* causing the index to be created.
The post test clean is here
elasticsearch/test/framework/src/main/java/org/elasticsearch/test/rest/ESRestTestCase.java
Line 851 in c43f449
private void wipeCluster() throws Exception { |
This line calls feature reset which will clean up the system indices
elasticsearch/test/framework/src/main/java/org/elasticsearch/test/rest/ESRestTestCase.java
Line 877 in c43f449
if (resetFeatureStates()) { |
And here the indices are deleted
elasticsearch/test/framework/src/main/java/org/elasticsearch/test/rest/ESRestTestCase.java
Line 889 in c43f449
wipeAllIndices(preserveSecurityIndicesUponCompletion()); |
@elasticmachine update branch |
...ore/src/main/java/org/elasticsearch/xpack/core/inference/action/GetInferenceModelAction.java
Outdated
Show resolved
Hide resolved
// Set to true to avoid persisting on read. | ||
// This setting only applies to GET * requests it has | ||
// no effect when getting a single model | ||
private final boolean doNotPersistDefaultConfigs; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In general, I dislike negative boolean variable names. Why not have persistDefaultConfig
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I flip flopped on this having switched the meaning once already. I'm not sure of the reason why I decided to use the verbose form other that it was friday afternoon. I've renamed to persistDefaultConfig
modelRegistry.getAllModels( | ||
doNotPersistEndpoints == false, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why == false
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
See flip flop above. The tests should have caught this but they needed updating to
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
One benefit of always using the "positive" boolean names is that you never need "flips" like this when passing the variable on.
@@ -34,19 +34,40 @@ public GetInferenceModelAction() { | |||
|
|||
public static class Request extends AcknowledgedRequest<GetInferenceModelAction.Request> { | |||
|
|||
private static boolean DEFAULT_DO_NOT_PERSIST_DEFAULT_CONFIGS = false; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't understand why we need this as part of the user-facing request.
At first glance, this is needed to fix some issues with the test setup, which shouldn't lead to changes in production code.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The option is internal to the transport action it cannot be set by an REST API user. The inference usage action however can be called by a REST API user. Xpack telemetry and monitoring also call the usage action. If the inference API is not being used it is preferable not to create the .inference index, having the flag here stops any callers of the GET /_xpack/usage from indirectly creating the index. It is also very beneficial that this change fixes the tests but even for production code it should not be creating the index if a user calls the usage action.
This entire problem relates to the design decision to persist the default configs on read. The reasons for this design are:
- It is not known until late in the cluster startup lifecycle whether the linux x86 optimised model can be used
- The code should make that decision once and initialise the config with the correct model variant
- Doing this on read rather than at some fixed point in the cluster lifecycle means failures are trivially retried on the next read.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks, that explanation helps! And I missed that this is internal-only.
@elasticmachine update branch |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
@elasticmachine update branch |
…sage (elastic#115023) The Inference usage API calls GET _inference/_all and because the default configs are persisted on read it causes the creation of the .inference index. This action is undesirable and causes test failures by leaking the system index out of the test clean up code. # Conflicts: # muted-tests.yml # x-pack/plugin/inference/src/main/java/org/elasticsearch/xpack/inference/action/TransportGetInferenceModelAction.java
…sage (#115023) (#115313) The Inference usage API calls GET _inference/_all and because the default configs are persisted on read it causes the creation of the .inference index. This action is undesirable and causes test failures by leaking the system index out of the test clean up code. # Conflicts: # muted-tests.yml # x-pack/plugin/inference/src/main/java/org/elasticsearch/xpack/inference/action/TransportGetInferenceModelAction.java
…sage (elastic#115023) The Inference usage API calls GET _inference/_all and because the default configs are persisted on read it causes the creation of the .inference index. This action is undesirable and causes test failures by leaking the system index out of the test clean up code. # Conflicts: # muted-tests.yml # server/src/main/java/org/elasticsearch/TransportVersions.java # x-pack/plugin/inference/src/main/java/org/elasticsearch/xpack/inference/action/TransportGetInferenceModelAction.java
* [ML] Do not create the .inference index as a side effect of calling usage (#115023) The Inference usage API calls GET _inference/_all and because the default configs are persisted on read it causes the creation of the .inference index. This action is undesirable and causes test failures by leaking the system index out of the test clean up code. # Conflicts: # muted-tests.yml # x-pack/plugin/inference/src/main/java/org/elasticsearch/xpack/inference/action/TransportGetInferenceModelAction.java * unmute tests * precommit * spotless --------- Co-authored-by: Elastic Machine <[email protected]>
Backport of #115023 with additional logic to handle streaming to/from a patch version
…sage (elastic#115023) The Inference usage API calls GET _inference/_all and because the default configs are persisted on read it causes the creation of the .inference index. This action is undesirable and causes test failures by leaking the system index out of the test clean up code.
…sage (elastic#115023) The Inference usage API calls GET _inference/_all and because the default configs are persisted on read it causes the creation of the .inference index. This action is undesirable and causes test failures by leaking the system index out of the test clean up code.
Many tests in the Enrich and Eql test suites are failing with the error
.inference
is a system index and will be removed automatically when the test clean up code calls the feature reset API (POST _features/_reset
). The error is occurring when the indices are deleted post feature reset.elasticsearch/test/framework/src/main/java/org/elasticsearch/test/rest/ESRestTestCase.java
Line 889 in a182b71
The failing tests all have monitoring enabled
xpack.monitoring.collection.enabled
, when enabled one of the collectors periodically calls the xpack usage API. Xpack usage callsGET _inference/_all
for the list of inference endpoints. A recent change to the Inference registry persisted the default configurations the first time they read with a GET request which is what the monitoring collector was doing via the xpack uasage API. The GET causes a write to the.inference
index which auto creates it causing the warning in delete indices.The fix here is to pass a parameter
do not persist
when callingGET _inference/_all
so that the .inference index is not created by the xpack usage API.Reverts the change in #114928 to suppress the failures.
Closes #114962 #114791 #114790 #114789 #114775 #114773 #114771 #114769 #114763 #114761 #114840 #114839 #114773 #114749