-
Notifications
You must be signed in to change notification settings - Fork 24.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[ML] Make sort order for datafeeds deterministic #39187
Comments
Pinging @elastic/ml-core |
I saw the same problem with the "netstore" data set. Not only are the One wrote This difference in when |
I tried making the sort order deterministic and it does not completely fix the problem of non-deterministic results. The following jobs differed only in I'll try printing the input to the C++ process to find out whether the variability after sorting is on the C++ or Java side. |
Printing the input to the C++ process shows that, after sorting on the Java side, the remaining difference must be coming from the C++ side:
The difference in input is only a single character out of 128474423 and it's the flush ID on a flush control message that comes right at the end of the input, after many |
Going back to the original problem raised in this issue, the naive choice of field for a secondary sort order would be Current best practice is to include a tie-breaker field for sorting in the documents to be sorted. But since we do not control the input data we cannot enforce this. #25797 contained an idea to use the I think the best compromise solution for this issue is to sort first on time, then on all the other fields that are to be extracted by the datafeed that have doc values, in alphabetical order of the field names. For most cases this will result in a unique sort order, as generally the fields used in anomaly detection are numbers or keywords and have doc values by default. It is true that it is not compulsory to do anomaly detection on fields that have doc values, but even then if some fields being extracted have doc values then the sort order will become a little more deterministic, and from a testing perspective we can make sure that in any test that asserts something that relies on 100% deterministic sorting of the input data we do have doc values for all fields involved in the analysis. |
@dimitris-athanasiou pointed out that we don't know the performance impact of sorting by many fields on all the possible input indices we may encounter in production. Even if we measure it to be low impact in some internal tests it may turn out to seriously impact performance for some other cases. Therefore a better way forward could be to add an optional secondary sort field to the datafeed config. For some internal test cases we can set this to |
@sophiec20 found a job using a When this issue was first raised the differences from run to run had always been very small, and although annoying for automated testing would not have changed the way a user would react to the anomalies. |
Following #65450 the new tiebreaker field will enable unique ordering for a given source index. However, it's important to note that the index must remain untouched between test runs. So two indices that contain exactly the same documents, for example because they've been populated from the same CSV file, will not necessarily sort in the same order using the tiebreaker of #65450.
This would seem to be the only way to solve the problem that works given a set of source documents as opposed to a specific unchanging index. |
For some datasets / job configurations the final
model_bytes
value is not 100% deterministic. This might be because of the fact that the document order in the datafeed is not deterministic for documents with identical timestamps.Deterministic behaviour in this place would help QA to recognize "real" changes in the
model_bytes
.Suggestion: Make the sort order for datafeeds 100% deterministic.
The text was updated successfully, but these errors were encountered: