Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[ML] Make sort order for datafeeds deterministic #39187

Open
pheyos opened this issue Feb 20, 2019 · 9 comments
Open

[ML] Make sort order for datafeeds deterministic #39187

pheyos opened this issue Feb 20, 2019 · 9 comments
Assignees
Labels
:ml Machine learning

Comments

@pheyos
Copy link
Member

pheyos commented Feb 20, 2019

For some datasets / job configurations the final model_bytes value is not 100% deterministic. This might be because of the fact that the document order in the datafeed is not deterministic for documents with identical timestamps.
Deterministic behaviour in this place would help QA to recognize "real" changes in the model_bytes.

Suggestion: Make the sort order for datafeeds 100% deterministic.

@pheyos pheyos added the :ml Machine learning label Feb 20, 2019
@elasticmachine
Copy link
Collaborator

Pinging @elastic/ml-core

@droberts195
Copy link
Contributor

droberts195 commented Mar 18, 2019

I saw the same problem with the "netstore" data set. Not only are the model_bytes numbers different, but because the decision on when to write model_size_stats documents depends on model_bytes, the number of model_size_stats documents can vary. These two screenshots are from jobs that were run on the same machine and only differ only in the job_id:

mss3
mss4

One wrote model_size_stats 19 times and the other 21 times.

This difference in when autodetect reports model_size_stats then can then have a further effect on when the job is considered to have established model memory.

@droberts195
Copy link
Contributor

I tried making the sort order deterministic and it does not completely fix the problem of non-deterministic results. The following jobs differed only in job_id and were run with a special build that used _id as a secondary sort order:

sort1
sort2

I'll try printing the input to the C++ process to find out whether the variability after sorting is on the C++ or Java side.

@droberts195 droberts195 self-assigned this Mar 19, 2019
@droberts195
Copy link
Contributor

Printing the input to the C++ process shows that, after sorting on the Java side, the remaining difference must be coming from the C++ side:

bash-3.2$ ls -l netstore_sort*txt
-rw-rw-r--  1 dave  dev  128474423 Mar 19 10:38 netstore_sort_4.txt
-rw-rw-r--  1 dave  dev  128474423 Mar 19 10:43 netstore_sort_5.txt
bash-3.2$ diff netstore_sort_4.txt netstore_sort_5.txt
1623156c1623156
< ,,,,,,,f4
---
> ,,,,,,,f5

The difference in input is only a single character out of 128474423 and it's the flush ID on a flush control message that comes right at the end of the input, after many model_size_stats documents have been written.

@droberts195
Copy link
Contributor

Going back to the original problem raised in this issue, the naive choice of field for a secondary sort order would be _id. However, this is inefficient for large indices, as it is not in the doc values.

Current best practice is to include a tie-breaker field for sorting in the documents to be sorted. But since we do not control the input data we cannot enforce this.

#25797 contained an idea to use the _seq_no/_shard combination for tie-breaking, but that was discarded because #25674 was considered better. But #25674 is not implemented either.

I think the best compromise solution for this issue is to sort first on time, then on all the other fields that are to be extracted by the datafeed that have doc values, in alphabetical order of the field names. For most cases this will result in a unique sort order, as generally the fields used in anomaly detection are numbers or keywords and have doc values by default. It is true that it is not compulsory to do anomaly detection on fields that have doc values, but even then if some fields being extracted have doc values then the sort order will become a little more deterministic, and from a testing perspective we can make sure that in any test that asserts something that relies on 100% deterministic sorting of the input data we do have doc values for all fields involved in the analysis.

@droberts195
Copy link
Contributor

@dimitris-athanasiou pointed out that we don't know the performance impact of sorting by many fields on all the possible input indices we may encounter in production. Even if we measure it to be low impact in some internal tests it may turn out to seriously impact performance for some other cases.

Therefore a better way forward could be to add an optional secondary sort field to the datafeed config. For some internal test cases we can set this to _id safe in the knowledge that, while unacceptable in general, it achieves the goal of deterministic sort order for some relatively small internal test as simply as possible.

@droberts195
Copy link
Contributor

@sophiec20 found a job using a high_count over highest_registered_domain detector with 30 minute bucket span where the effect of the sort order of the input was large. Depending on sort order the same anomaly was observed to have scores of 55.38, 53.2, 98.2 and 98.19. We should drill into why the C++ code was so sensitive to input order within identical timestamps in this case. Making the sort order deterministic would mean we wouldn't see differences like this in tests, but it seems wrong that the difference in scores is so huge.

When this issue was first raised the differences from run to run had always been very small, and although annoying for automated testing would not have changed the way a user would react to the anomalies.

@droberts195
Copy link
Contributor

The work being done in #46523 to implement #26472 solves this. Once #46523 is implemented we will need to switch from scroll to search_after, but we'll be doing the search_after in a point-in-time view, so will be able to use a sort order of (time field, _doc) to get a unique ordering.

@droberts195
Copy link
Contributor

droberts195 commented Nov 30, 2020

we'll be doing the search_after in a point-in-time view, so will be able to use a sort order of (time field, _doc) to get a unique ordering

Following #65450 the new tiebreaker field will enable unique ordering for a given source index. However, it's important to note that the index must remain untouched between test runs. So two indices that contain exactly the same documents, for example because they've been populated from the same CSV file, will not necessarily sort in the same order using the tiebreaker of #65450.

Therefore a better way forward could be to add an optional secondary sort field to the datafeed config

This would seem to be the only way to solve the problem that works given a set of source documents as opposed to a specific unchanging index.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
:ml Machine learning
Projects
None yet
Development

No branches or pull requests

3 participants