Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[ML] properly nesting objects in document source #41901

Conversation

benwtrent
Copy link
Member

While working through use-cases, I found it impossible to push data to a range type field mapping. This was because we were not properly nesting documents in the JSON we are pushing to the index.

This change creates objects in the document source itself. This has two major benefits:

  • Allows _preview to show the mapped objects as they will be actually stored
  • Allows for more interesting and complex use cases with user provided mappings

Example use case that is enabled with this change:

PUT data-logs-by-client
{
  "mappings": {
    "properties": {
      "time_frame": {
          "type": "date_range"
        }
    }
  }
}

PUT _data_frame/transforms/data_log
{
  "source": {
    "index": "kibana_sample_data_logs"
  },
  "dest": {
    "index": "data-logs-by-client"
  },
  "pivot": {
    "group_by": {
      "machine.os": {"terms": {"field": "machine.os.keyword"}},
      "machine.ip": {"terms": {"field": "clientip"}}
    },
    "aggregations": {
      "time_frame.lte": {
        "max": {
          "field": "timestamp"
        }
      },
      "time_frame.gte": {
        "min": {
          "field": "timestamp"
        }
      }
    }
  }
}

This will result in an index where range queries are possible to determine which clients accessed the website over a given range.

Implementation details:

  • I did not choose to create objects in the group_by fields as I could not think of a use case for it. The mapping created still treats the fields as an object (machine in the above use case), the document source just does not show it as plainly. I could be convinced otherwise :)
  • I am not throwing an error when parsing and discovering duplicate fields, or objects that conflict. These validations should occur earlier in the process (see: [ML] verify that there are no duplicate leaf fields in aggs #41895) and any errors here should be logged then allowed, this is consistent with how we treat unsupported aggregations - validate ahead of time and log if something weird occurred.

@elasticmachine
Copy link
Collaborator

Pinging @elastic/ml-core

@przemekwitek przemekwitek self-requested a review May 8, 2019 05:55
token,
internalMap.get(token),
value);
assert false;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What is the visible effect from the user perspective when this assertion fires?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There is none unless they are running with the -ea JVM flag.

Copy link

@hendrikmuhs hendrikmuhs left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

added some comments, my only concern are the error logs where we should IMHO throw (I understood both cases shouldn't be possible when validation is in)

@@ -97,4 +97,35 @@
});
}

@SuppressWarnings("unchecked")
static void updateDocument(Map<String, Object> document, String fieldName, Object value) {
String[] fieldTokens = fieldName.split("\\.");

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

in 99% of cases fieldName is flat, so I wonder if we should have a shortcut optimization like:

if (fieldName.contains(".")) {
    document.put(fieldName, value);
    return;
}

and then briefly explain as a code comment what the special handling does.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That sounds good to me. It may provide a logic flow optimization, but I doubt there will be a performance one as updateDocument will just add the field and return.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looking at how String#split is written, it does a handful of constant time calculations, and since we are splitting on a single char ., String#split when the char does not exist in the string is essentially the same as doing String.contains once.

I will add a clause that checks if (fieldTokens.length == 1) as an early shortcut, but I don't think it will cause any performance increase.

@benwtrent
Copy link
Member Author

run elasticsearch-ci/docbldesx

@benwtrent benwtrent requested a review from hendrikmuhs May 9, 2019 15:15
@benwtrent
Copy link
Member Author

run elasticsearch-ci/1

Copy link

@hendrikmuhs hendrikmuhs left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM!

Thanks for taking the extra mile fixing the flawed error logging.

@benwtrent benwtrent merged commit c1d31f6 into elastic:master May 10, 2019
@benwtrent benwtrent deleted the feature/ml-df-better-handling-of-object-fields branch May 10, 2019 13:14
benwtrent added a commit to benwtrent/elasticsearch that referenced this pull request May 10, 2019
* [ML] properly nesting objects in document source

* Throw exception on agg extraction failure, cause it to fail df

* throwing error to stop df if unsupported agg is found
benwtrent added a commit that referenced this pull request May 10, 2019
* [ML] properly nesting objects in document source

* Throw exception on agg extraction failure, cause it to fail df

* throwing error to stop df if unsupported agg is found
jasontedor added a commit to jasontedor/elasticsearch that referenced this pull request May 10, 2019
* elastic/master: (84 commits)
  [ML] adds geo_centroid aggregation support to data frames (elastic#42088)
  Add documentation for calendar/fixed intervals (elastic#41919)
  Remove global checkpoint assertion in peer recovery (elastic#41987)
  Don't create tempdir for cli scripts (elastic#41913)
  Fix debian-8 update (elastic#42056)
  Cleanup plugin bin directories (elastic#41907)
  Prevent order being lost for _nodes API filters (elastic#42045)
  Change IndexAnalyzers default analyzer access (elastic#42011)
  Remove reference to fs.data.spins in docs
  Mute failing AsyncTwoPhaseIndexerTests
  Remove close method in PageCacheRecycler/Recycler (elastic#41917)
  [ML] adding pivot.max_search_page_size option for setting paging size (elastic#41920)
  Docs: Tweak list formatting
  Simplify handling of keyword field normalizers (elastic#42002)
  [ML] properly nesting objects in document source (elastic#41901)
  Remove extra `ms` from log message (elastic#42068)
  Increase the sample space for random inner hits name generator (elastic#42057)
  Recognise direct buffers in heap size docs (elastic#42070)
  shouldRollGeneration should execute under read lock (elastic#41696)
  Wait for active shard after close in mixed cluster (elastic#42029)
  ...
gurkankaymak pushed a commit to gurkankaymak/elasticsearch that referenced this pull request May 27, 2019
* [ML] properly nesting objects in document source

* Throw exception on agg extraction failure, cause it to fail df

* throwing error to stop df if unsupported agg is found
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants