[ML] verify that there are no duplicate leaf fields in aggs #41895

benwtrent · 2019-05-07T13:35:43Z

This PR adds two validations for the data frame pivot config:

That there are no duplicate fields in the group_by or the aggs definitions
That there are no fields that are declared as an both an object and not, e.g. both foo.bar.baz and foo.bar

The best case scenario before this PR is that we can automatically determine the mapped type and we prevent the transform from even being started. However, if we rely on the dynamic mapping, index mapping failures will spam the logs until the task eventually fails due to the indexing failures.

elasticmachine · 2019-05-07T13:35:46Z

Pinging @elastic/ml-core

hendrikmuhs

added some comments

hendrikmuhs · 2019-05-08T09:38:38Z

.../core/src/main/java/org/elasticsearch/xpack/core/dataframe/transforms/pivot/PivotConfig.java

+        // TODO this will need changed once we allow multi-bucket aggs + field merging
+        aggregationConfig.getAggregatorFactories().forEach(agg -> addAggNames(agg, usedNames));
+        aggregationConfig.getPipelineAggregatorFactories().forEach(agg -> addAggNames(agg, usedNames));
+        usedNames.addAll(groups.getGroups().keySet());


I might miss something, but wouldn't it be simpler to sort and then compare adjacent name pairs?
(of course you need logic to handle the dots)

hendrikmuhs · 2019-05-08T09:41:23Z

.../core/src/main/java/org/elasticsearch/xpack/core/dataframe/transforms/pivot/PivotConfig.java

+        }
+
+        for (String fullName : usedNames) {
+            String[] tokens = fullName.split("\\.");


you omit the dots, so what if I have foo.bar.baz and foobar? If I get it right, this would fail validation.

You are correct. I need to fix that

hendrikmuhs · 2019-05-08T09:44:52Z

.../core/src/main/java/org/elasticsearch/xpack/core/dataframe/transforms/pivot/PivotConfig.java

+    }
+
+
+    private static void addAggNames(AggregationBuilder aggregationBuilder, List<String> names) {


nit: could be just Collection<String> ?

hendrikmuhs · 2019-05-08T09:45:25Z

.../core/src/main/java/org/elasticsearch/xpack/core/dataframe/transforms/pivot/PivotConfig.java

+        aggregationBuilder.getPipelineAggregations().forEach(agg -> addAggNames(agg, names));
+    }
+
+    private static void addAggNames(PipelineAggregationBuilder pipelineAggregationBuilder, List<String> names) {


nit: as above, could be just Collection ?

hendrikmuhs · 2019-05-08T09:47:41Z

.../src/test/java/org/elasticsearch/xpack/core/dataframe/transforms/pivot/PivotConfigTests.java

@@ -136,6 +139,74 @@ public void testDoubleAggs() throws IOException {
        expectThrows(IllegalArgumentException.class, () -> createPivotConfigFromString(pivot, false));
    }

+    public void testAggNameValidations() throws IOException {


tests are good, but I wonder if aggFieldValidation(...) could be re-factored in a way to test it at the unit test level with less boiler plate and more coverage?

...src/main/java/org/elasticsearch/xpack/core/dataframe/action/PutDataFrameTransformAction.java

przemekwitek · 2019-05-08T11:21:51Z

.../core/src/main/java/org/elasticsearch/xpack/core/dataframe/transforms/pivot/PivotConfig.java

+
+        List<String> validationFailures = new ArrayList<>();
+        List<String> usedNames = new ArrayList<>();
+        // TODO this will need changed once we allow multi-bucket aggs + field merging


"need changed" -> "need to be changed"?

przemekwitek · 2019-05-08T11:28:45Z

.../core/src/main/java/org/elasticsearch/xpack/core/dataframe/transforms/pivot/PivotConfig.java

+
+        for (String fullName : usedNames) {
+            String[] tokens = fullName.split("\\.");
+            for (int i = tokens.length - 1; i > 0; i--) {


Could you explain why do you iterate backwards and create a separate StringBuilder in each iteration?
I would go fo something like:
for (String fullName : usedNames) {
String[] tokens = fullName.split("\.");
StringBuilder prefix = new StringBuilder();
for each token:
prefix.append(token)
check in "leafNames"

I believe this code will be both more performant and easier to read. Please LMK if I'm missing something here.

.../core/src/main/java/org/elasticsearch/xpack/core/dataframe/transforms/pivot/PivotConfig.java

przemekwitek · 2019-05-08T11:34:31Z

.../src/test/java/org/elasticsearch/xpack/core/dataframe/transforms/pivot/PivotConfigTests.java

+        assertFalse(fieldValidation.isEmpty());
+        assertThat(fieldValidation.get(0), equalTo("field [user] cannot be both an object and a field"));
+
+        pivotAggs = "{"


Usually it is a good idea to split such a long test method with independent tests into a few (here, 4) shorter methods. This makes the tests more "unit", thus increasing readability.

hendrikmuhs · 2019-05-09T06:59:44Z

run elasticsearch-ci/1

hendrikmuhs

LGTM

...src/main/java/org/elasticsearch/xpack/core/dataframe/action/PutDataFrameTransformAction.java

.../core/src/main/java/org/elasticsearch/xpack/core/dataframe/transforms/pivot/PivotConfig.java

.../src/test/java/org/elasticsearch/xpack/core/dataframe/transforms/pivot/PivotConfigTests.java

przemekwitek

LGTM

…41895) * [ML] verify that there are no duplicate leaf fields in aggs * addressing pr comments * addressing PR comments * optmizing duplication check

…42025) * [ML] verify that there are no duplicate leaf fields in aggs * addressing pr comments * addressing PR comments * optmizing duplication check

…41895) * [ML] verify that there are no duplicate leaf fields in aggs * addressing pr comments * addressing PR comments * optmizing duplication check

[ML] verify that there are no duplicate leaf fields in aggs

4b6b5dd

benwtrent added >non-issue v8.0.0 v7.2.0 :ml/Transform Transform labels May 7, 2019

benwtrent mentioned this pull request May 7, 2019

[ML] properly nesting objects in document source #41901

Merged

przemekwitek self-requested a review May 8, 2019 05:55

hendrikmuhs reviewed May 8, 2019

View reviewed changes

przemekwitek reviewed May 8, 2019

View reviewed changes

benwtrent added 2 commits May 8, 2019 10:58

addressing pr comments

6395423

addressing PR comments

84c7cfa

hendrikmuhs approved these changes May 9, 2019

View reviewed changes

przemekwitek reviewed May 9, 2019

View reviewed changes

optmizing duplication check

5ddcd80

przemekwitek approved these changes May 9, 2019

View reviewed changes

benwtrent merged commit 0531987 into elastic:master May 9, 2019

benwtrent deleted the feature/ml-df-disallow-duplicate-leaf-fields branch May 9, 2019 15:51

benwtrent mentioned this pull request May 9, 2019

[ML] verify that there are no duplicate leaf fields in aggs (#41895) #42025

Merged

benwtrent mentioned this pull request May 10, 2019

[ML] properly nesting objects in document source (#41901) #42077

Merged

jakelandis added v8.0.0-alpha1 and removed v8.0.0 labels Jul 26, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[ML] verify that there are no duplicate leaf fields in aggs #41895

[ML] verify that there are no duplicate leaf fields in aggs #41895

benwtrent commented May 7, 2019 •

edited

Loading

elasticmachine commented May 7, 2019

hendrikmuhs left a comment

hendrikmuhs May 8, 2019

hendrikmuhs May 8, 2019

benwtrent May 8, 2019

hendrikmuhs May 8, 2019

hendrikmuhs May 8, 2019

hendrikmuhs May 8, 2019

przemekwitek May 8, 2019

przemekwitek May 8, 2019

przemekwitek May 8, 2019

hendrikmuhs commented May 9, 2019

hendrikmuhs left a comment

przemekwitek left a comment

		}


		private static void addAggNames(AggregationBuilder aggregationBuilder, List<String> names) {

[ML] verify that there are no duplicate leaf fields in aggs #41895

[ML] verify that there are no duplicate leaf fields in aggs #41895

Conversation

benwtrent commented May 7, 2019 • edited Loading

elasticmachine commented May 7, 2019

hendrikmuhs left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

hendrikmuhs commented May 9, 2019

hendrikmuhs left a comment

Choose a reason for hiding this comment

przemekwitek left a comment

Choose a reason for hiding this comment

benwtrent commented May 7, 2019 •

edited

Loading