[ML] properly nesting objects in document source #41901

benwtrent · 2019-05-07T15:00:20Z

While working through use-cases, I found it impossible to push data to a range type field mapping. This was because we were not properly nesting documents in the JSON we are pushing to the index.

This change creates objects in the document source itself. This has two major benefits:

Allows _preview to show the mapped objects as they will be actually stored
Allows for more interesting and complex use cases with user provided mappings

Example use case that is enabled with this change:

PUT data-logs-by-client
{
  "mappings": {
    "properties": {
      "time_frame": {
          "type": "date_range"
        }
    }
  }
}

PUT _data_frame/transforms/data_log
{
  "source": {
    "index": "kibana_sample_data_logs"
  },
  "dest": {
    "index": "data-logs-by-client"
  },
  "pivot": {
    "group_by": {
      "machine.os": {"terms": {"field": "machine.os.keyword"}},
      "machine.ip": {"terms": {"field": "clientip"}}
    },
    "aggregations": {
      "time_frame.lte": {
        "max": {
          "field": "timestamp"
        }
      },
      "time_frame.gte": {
        "min": {
          "field": "timestamp"
        }
      }
    }
  }
}

This will result in an index where range queries are possible to determine which clients accessed the website over a given range.

Implementation details:

I did not choose to create objects in the group_by fields as I could not think of a use case for it. The mapping created still treats the fields as an object (machine in the above use case), the document source just does not show it as plainly. I could be convinced otherwise :)
I am not throwing an error when parsing and discovering duplicate fields, or objects that conflict. These validations should occur earlier in the process (see: [ML] verify that there are no duplicate leaf fields in aggs #41895) and any errors here should be logged then allowed, this is consistent with how we treat unsupported aggregations - validate ahead of time and log if something weird occurred.

elasticmachine · 2019-05-07T15:00:25Z

Pinging @elastic/ml-core

przemekwitek · 2019-05-08T11:53:09Z

...src/main/java/org/elasticsearch/xpack/dataframe/transforms/pivot/AggregationResultUtils.java

+                        token,
+                        internalMap.get(token),
+                        value);
+                    assert false;


What is the visible effect from the user perspective when this assertion fires?

There is none unless they are running with the -ea JVM flag.

hendrikmuhs

added some comments, my only concern are the error logs where we should IMHO throw (I understood both cases shouldn't be possible when validation is in)

hendrikmuhs · 2019-05-09T07:17:44Z

...src/main/java/org/elasticsearch/xpack/dataframe/transforms/pivot/AggregationResultUtils.java

@@ -97,4 +97,35 @@
        });
    }

+    @SuppressWarnings("unchecked")
+    static void updateDocument(Map<String, Object> document, String fieldName, Object value) {
+        String[] fieldTokens = fieldName.split("\\.");


in 99% of cases fieldName is flat, so I wonder if we should have a shortcut optimization like:

if (fieldName.contains(".")) { document.put(fieldName, value); return; }

and then briefly explain as a code comment what the special handling does.

That sounds good to me. It may provide a logic flow optimization, but I doubt there will be a performance one as updateDocument will just add the field and return.

Looking at how String#split is written, it does a handful of constant time calculations, and since we are splitting on a single char ., String#split when the char does not exist in the string is essentially the same as doing String.contains once.

I will add a clause that checks if (fieldTokens.length == 1) as an early shortcut, but I don't think it will cause any performance increase.

...src/main/java/org/elasticsearch/xpack/dataframe/transforms/pivot/AggregationResultUtils.java

benwtrent · 2019-05-09T15:15:17Z

run elasticsearch-ci/docbldesx

benwtrent · 2019-05-09T16:12:39Z

run elasticsearch-ci/1

hendrikmuhs

LGTM!

Thanks for taking the extra mile fixing the flawed error logging.

* [ML] properly nesting objects in document source * Throw exception on agg extraction failure, cause it to fail df * throwing error to stop df if unsupported agg is found

* elastic/master: (84 commits) [ML] adds geo_centroid aggregation support to data frames (elastic#42088) Add documentation for calendar/fixed intervals (elastic#41919) Remove global checkpoint assertion in peer recovery (elastic#41987) Don't create tempdir for cli scripts (elastic#41913) Fix debian-8 update (elastic#42056) Cleanup plugin bin directories (elastic#41907) Prevent order being lost for _nodes API filters (elastic#42045) Change IndexAnalyzers default analyzer access (elastic#42011) Remove reference to fs.data.spins in docs Mute failing AsyncTwoPhaseIndexerTests Remove close method in PageCacheRecycler/Recycler (elastic#41917) [ML] adding pivot.max_search_page_size option for setting paging size (elastic#41920) Docs: Tweak list formatting Simplify handling of keyword field normalizers (elastic#42002) [ML] properly nesting objects in document source (elastic#41901) Remove extra `ms` from log message (elastic#42068) Increase the sample space for random inner hits name generator (elastic#42057) Recognise direct buffers in heap size docs (elastic#42070) shouldRollGeneration should execute under read lock (elastic#41696) Wait for active shard after close in mixed cluster (elastic#42029) ...

* [ML] properly nesting objects in document source * Throw exception on agg extraction failure, cause it to fail df * throwing error to stop df if unsupported agg is found

[ML] properly nesting objects in document source

8aa6b65

benwtrent added >non-issue v8.0.0 v7.2.0 :ml/Transform Transform labels May 7, 2019

przemekwitek self-requested a review May 8, 2019 05:55

przemekwitek approved these changes May 8, 2019

View reviewed changes

hendrikmuhs reviewed May 9, 2019

View reviewed changes

Throw exception on agg extraction failure, cause it to fail df

c00c2fa

benwtrent requested a review from hendrikmuhs May 9, 2019 15:15

throwing error to stop df if unsupported agg is found

66b17da

hendrikmuhs approved these changes May 10, 2019

View reviewed changes

benwtrent merged commit c1d31f6 into elastic:master May 10, 2019

benwtrent deleted the feature/ml-df-better-handling-of-object-fields branch May 10, 2019 13:14

benwtrent mentioned this pull request May 10, 2019

[ML] properly nesting objects in document source (#41901) #42077

Merged

jakelandis added v8.0.0-alpha1 and removed v8.0.0 labels Jul 26, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[ML] properly nesting objects in document source #41901

[ML] properly nesting objects in document source #41901

benwtrent commented May 7, 2019

elasticmachine commented May 7, 2019

przemekwitek May 8, 2019

benwtrent May 8, 2019

hendrikmuhs left a comment

hendrikmuhs May 9, 2019

benwtrent May 9, 2019

benwtrent May 9, 2019

benwtrent commented May 9, 2019

benwtrent commented May 9, 2019

hendrikmuhs left a comment

[ML] properly nesting objects in document source #41901

[ML] properly nesting objects in document source #41901

Conversation

benwtrent commented May 7, 2019

elasticmachine commented May 7, 2019

przemekwitek May 8, 2019

Choose a reason for hiding this comment

benwtrent May 8, 2019

Choose a reason for hiding this comment

hendrikmuhs left a comment

Choose a reason for hiding this comment

hendrikmuhs May 9, 2019

Choose a reason for hiding this comment

benwtrent May 9, 2019

Choose a reason for hiding this comment

benwtrent May 9, 2019

Choose a reason for hiding this comment

benwtrent commented May 9, 2019

benwtrent commented May 9, 2019

hendrikmuhs left a comment

Choose a reason for hiding this comment