[ML] Make sort order for datafeeds deterministic #39187

pheyos · 2019-02-20T16:14:36Z

For some datasets / job configurations the final model_bytes value is not 100% deterministic. This might be because of the fact that the document order in the datafeed is not deterministic for documents with identical timestamps.
Deterministic behaviour in this place would help QA to recognize "real" changes in the model_bytes.

Suggestion: Make the sort order for datafeeds 100% deterministic.

The text was updated successfully, but these errors were encountered:

elasticmachine · 2019-02-20T16:14:38Z

Pinging @elastic/ml-core

droberts195 · 2019-03-18T15:10:03Z

I saw the same problem with the "netstore" data set. Not only are the model_bytes numbers different, but because the decision on when to write model_size_stats documents depends on model_bytes, the number of model_size_stats documents can vary. These two screenshots are from jobs that were run on the same machine and only differ only in the job_id:

One wrote model_size_stats 19 times and the other 21 times.

This difference in when autodetect reports model_size_stats then can then have a further effect on when the job is considered to have established model memory.

droberts195 · 2019-03-19T09:57:06Z

I tried making the sort order deterministic and it does not completely fix the problem of non-deterministic results. The following jobs differed only in job_id and were run with a special build that used _id as a secondary sort order:

I'll try printing the input to the C++ process to find out whether the variability after sorting is on the C++ or Java side.

droberts195 · 2019-03-19T10:52:43Z

Printing the input to the C++ process shows that, after sorting on the Java side, the remaining difference must be coming from the C++ side:

bash-3.2$ ls -l netstore_sort*txt
-rw-rw-r--  1 dave  dev  128474423 Mar 19 10:38 netstore_sort_4.txt
-rw-rw-r--  1 dave  dev  128474423 Mar 19 10:43 netstore_sort_5.txt
bash-3.2$ diff netstore_sort_4.txt netstore_sort_5.txt
1623156c1623156
< ,,,,,,,f4
---
> ,,,,,,,f5

The difference in input is only a single character out of 128474423 and it's the flush ID on a flush control message that comes right at the end of the input, after many model_size_stats documents have been written.

droberts195 · 2019-03-20T11:30:11Z

Going back to the original problem raised in this issue, the naive choice of field for a secondary sort order would be _id. However, this is inefficient for large indices, as it is not in the doc values.

Current best practice is to include a tie-breaker field for sorting in the documents to be sorted. But since we do not control the input data we cannot enforce this.

#25797 contained an idea to use the _seq_no/_shard combination for tie-breaking, but that was discarded because #25674 was considered better. But #25674 is not implemented either.

I think the best compromise solution for this issue is to sort first on time, then on all the other fields that are to be extracted by the datafeed that have doc values, in alphabetical order of the field names. For most cases this will result in a unique sort order, as generally the fields used in anomaly detection are numbers or keywords and have doc values by default. It is true that it is not compulsory to do anomaly detection on fields that have doc values, but even then if some fields being extracted have doc values then the sort order will become a little more deterministic, and from a testing perspective we can make sure that in any test that asserts something that relies on 100% deterministic sorting of the input data we do have doc values for all fields involved in the analysis.

droberts195 · 2019-03-20T13:24:14Z

@dimitris-athanasiou pointed out that we don't know the performance impact of sorting by many fields on all the possible input indices we may encounter in production. Even if we measure it to be low impact in some internal tests it may turn out to seriously impact performance for some other cases.

Therefore a better way forward could be to add an optional secondary sort field to the datafeed config. For some internal test cases we can set this to _id safe in the knowledge that, while unacceptable in general, it achieves the goal of deterministic sort order for some relatively small internal test as simply as possible.

droberts195 · 2019-04-17T08:38:53Z

@sophiec20 found a job using a high_count over highest_registered_domain detector with 30 minute bucket span where the effect of the sort order of the input was large. Depending on sort order the same anomaly was observed to have scores of 55.38, 53.2, 98.2 and 98.19. We should drill into why the C++ code was so sensitive to input order within identical timestamps in this case. Making the sort order deterministic would mean we wouldn't see differences like this in tests, but it seems wrong that the difference in scores is so huge.

When this issue was first raised the differences from run to run had always been very small, and although annoying for automated testing would not have changed the way a user would react to the anomalies.

droberts195 · 2019-09-13T11:59:57Z

The work being done in #46523 to implement #26472 solves this. Once #46523 is implemented we will need to switch from scroll to search_after, but we'll be doing the search_after in a point-in-time view, so will be able to use a sort order of (time field, _doc) to get a unique ordering.

droberts195 · 2020-11-30T10:04:48Z

we'll be doing the search_after in a point-in-time view, so will be able to use a sort order of (time field, _doc) to get a unique ordering

Following #65450 the new tiebreaker field will enable unique ordering for a given source index. However, it's important to note that the index must remain untouched between test runs. So two indices that contain exactly the same documents, for example because they've been populated from the same CSV file, will not necessarily sort in the same order using the tiebreaker of #65450.

Therefore a better way forward could be to add an optional secondary sort field to the datafeed config

This would seem to be the only way to solve the problem that works given a set of source documents as opposed to a specific unchanging index.

pheyos added the :ml Machine learning label Feb 20, 2019

droberts195 added the team-discuss label Feb 20, 2019

droberts195 self-assigned this Mar 19, 2019

droberts195 removed the team-discuss label Mar 20, 2019

droberts195 mentioned this issue Apr 4, 2019

[ML] Consider using search_after instead of scroll in datafeeds #29781

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[ML] Make sort order for datafeeds deterministic #39187

[ML] Make sort order for datafeeds deterministic #39187

pheyos commented Feb 20, 2019

elasticmachine commented Feb 20, 2019

droberts195 commented Mar 18, 2019 •

edited

Loading

droberts195 commented Mar 19, 2019

droberts195 commented Mar 19, 2019

droberts195 commented Mar 20, 2019

droberts195 commented Mar 20, 2019

droberts195 commented Apr 17, 2019

droberts195 commented Sep 13, 2019

droberts195 commented Nov 30, 2020 •

edited

Loading

[ML] Make sort order for datafeeds deterministic #39187

[ML] Make sort order for datafeeds deterministic #39187

Comments

pheyos commented Feb 20, 2019

elasticmachine commented Feb 20, 2019

droberts195 commented Mar 18, 2019 • edited Loading

droberts195 commented Mar 19, 2019

droberts195 commented Mar 19, 2019

droberts195 commented Mar 20, 2019

droberts195 commented Mar 20, 2019

droberts195 commented Apr 17, 2019

droberts195 commented Sep 13, 2019

droberts195 commented Nov 30, 2020 • edited Loading

droberts195 commented Mar 18, 2019 •

edited

Loading

droberts195 commented Nov 30, 2020 •

edited

Loading