-
Notifications
You must be signed in to change notification settings - Fork 24.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[ML] properly nesting objects in document source #41901
[ML] properly nesting objects in document source #41901
Conversation
Pinging @elastic/ml-core |
token, | ||
internalMap.get(token), | ||
value); | ||
assert false; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What is the visible effect from the user perspective when this assertion fires?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There is none unless they are running with the -ea
JVM flag.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
added some comments, my only concern are the error logs where we should IMHO throw (I understood both cases shouldn't be possible when validation is in)
@@ -97,4 +97,35 @@ | |||
}); | |||
} | |||
|
|||
@SuppressWarnings("unchecked") | |||
static void updateDocument(Map<String, Object> document, String fieldName, Object value) { | |||
String[] fieldTokens = fieldName.split("\\."); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
in 99% of cases fieldName
is flat, so I wonder if we should have a shortcut optimization like:
if (fieldName.contains(".")) {
document.put(fieldName, value);
return;
}
and then briefly explain as a code comment what the special handling does.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That sounds good to me. It may provide a logic flow optimization, but I doubt there will be a performance one as updateDocument
will just add the field and return.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looking at how String#split
is written, it does a handful of constant time calculations, and since we are splitting on a single char .
, String#split
when the char does not exist in the string is essentially the same as doing String.contains
once.
I will add a clause that checks if (fieldTokens.length == 1)
as an early shortcut, but I don't think it will cause any performance increase.
...src/main/java/org/elasticsearch/xpack/dataframe/transforms/pivot/AggregationResultUtils.java
Outdated
Show resolved
Hide resolved
...src/main/java/org/elasticsearch/xpack/dataframe/transforms/pivot/AggregationResultUtils.java
Outdated
Show resolved
Hide resolved
run elasticsearch-ci/docbldesx |
run elasticsearch-ci/1 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM!
Thanks for taking the extra mile fixing the flawed error logging.
* [ML] properly nesting objects in document source * Throw exception on agg extraction failure, cause it to fail df * throwing error to stop df if unsupported agg is found
* elastic/master: (84 commits) [ML] adds geo_centroid aggregation support to data frames (elastic#42088) Add documentation for calendar/fixed intervals (elastic#41919) Remove global checkpoint assertion in peer recovery (elastic#41987) Don't create tempdir for cli scripts (elastic#41913) Fix debian-8 update (elastic#42056) Cleanup plugin bin directories (elastic#41907) Prevent order being lost for _nodes API filters (elastic#42045) Change IndexAnalyzers default analyzer access (elastic#42011) Remove reference to fs.data.spins in docs Mute failing AsyncTwoPhaseIndexerTests Remove close method in PageCacheRecycler/Recycler (elastic#41917) [ML] adding pivot.max_search_page_size option for setting paging size (elastic#41920) Docs: Tweak list formatting Simplify handling of keyword field normalizers (elastic#42002) [ML] properly nesting objects in document source (elastic#41901) Remove extra `ms` from log message (elastic#42068) Increase the sample space for random inner hits name generator (elastic#42057) Recognise direct buffers in heap size docs (elastic#42070) shouldRollGeneration should execute under read lock (elastic#41696) Wait for active shard after close in mixed cluster (elastic#42029) ...
* [ML] properly nesting objects in document source * Throw exception on agg extraction failure, cause it to fail df * throwing error to stop df if unsupported agg is found
While working through use-cases, I found it impossible to push data to a
range
type field mapping. This was because we were not properly nesting documents in the JSON we are pushing to the index.This change creates objects in the document source itself. This has two major benefits:
Example use case that is enabled with this change:
This will result in an index where range queries are possible to determine which clients accessed the website over a given range.
Implementation details:
group_by
fields as I could not think of a use case for it. The mapping created still treats the fields as an object (machine
in the above use case), the document source just does not show it as plainly. I could be convinced otherwise :)