[ML] Return both Joda and Java formats from structure finder #33900

droberts195 · 2018-09-20T14:03:36Z

Previously the timestamp_formats field in the response
from the find_file_structure endpoint contained Joda
timestamp formats. This change makes that clear by
renaming the field to joda_timestamp_formats, and also
adds a java_timestamp_formats field containing the
equivalent Java time format strings.

Previously the timestamp_formats field in the response from the find_file_structure endpoint contained Joda timestamp formats. This change makes that clear by renaming the field to joda_timestamp_formats, and also adds a java_timestamp_formats field containing the equivalent Java time format strings.

elasticmachine · 2018-09-20T14:03:41Z

Pinging @elastic/ml-core

dimitris-athanasiou · 2018-09-21T16:24:40Z

docs/reference/ml/apis/find-file-structure.asciidoc

+    `timestamp_field` query parameter.
+<8> `joda_timestamp_formats` are used to tell Logstash and Ingest pipeline how
+    to parse timestamps.
+<9> In future Ingest pipeline will switch to use `java_timestamp_formats`.


I think this should first just say something along the lines:

"java_timestamp_formats" are the java time formats recognized in the time fields. In the future, Ingest pipeline...

dimitris-athanasiou · 2018-09-21T16:27:27Z

...in/core/src/main/java/org/elasticsearch/xpack/core/ml/filestructurefinder/FileStructure.java

@@ -652,8 +680,14 @@ public FileStructure build() {
                    throw new IllegalStateException("enum value [" + format + "] missing from switch.");
            }

-            if ((timestampField == null) != (timestampFormats == null || timestampFormats.isEmpty())) {
-                throw new IllegalArgumentException("Timestamp field and timestamp formats must both be specified or neither be specified.");
+            if ((timestampField == null) != (jodaTimestampFormats == null || jodaTimestampFormats.isEmpty())) {


nit: No need to change but I think this would read more easily if it was:

if (timestampField != null && (jodaTimestampFormats == null || jodaTimestampFormats.isEmpty())

To be equivalent it would have to say:

if ((timestampField != null && (jodaTimestampFormats == null || jodaTimestampFormats.isEmpty())) || (timestampField == null && jodaTimestampFormats != null && jodaTimestampFormats.isEmpty() == false))

I could introduce some local variables and make it:

if (wasTimestampFieldSpecified != wasJodaTimestampFormatsSpecified)

Or I could split it over two if statements, which would allow for different error messages saying exactly which setting was specified and which wasn't. (The reason I didn't bother with separate error messages originally is that this validation is not end-user validation - it's just early bug finding for developers who change the functionality in the future. But if the current code is too hard to read then I can.)

What do you think?

I think the local variables approach would definitely make it clear enough.

dimitris-athanasiou · 2018-09-21T16:31:48Z

...in/core/src/main/java/org/elasticsearch/xpack/core/ml/filestructurefinder/FileStructure.java

@@ -150,7 +152,8 @@ public String toString() {
    private final Character quote;
    private final Boolean shouldTrimFields;
    private final String grokPattern;
-    private final List<String> timestampFormats;
+    private final List<String> jodaTimestampFormats;


Reading through the code, I wonder if it would make things simpler to encapsulate this in a TimestampFormat class which will then own the joda/java pairs (or additional types if the future brings them). That would create that sort of 1-1 mapping between the alternatives. Just a thought

I'm not convinced that at the level of individual timestamp formats we want the Joda/Java formats grouped into pairs.

The output of the endpoint can currently contain something like this:

"joda_timestamp_formats" : [ "MMM dd YYYY HH:mm:ss", "MMM d YYYY HH:mm:ss" ], "java_timestamp_formats" : [ "MMM dd yyyy HH:mm:ss", "MMM d yyyy HH:mm:ss" ],

As a consumer of this information you'd want either the Joda formats or the Java formats, never both. Suppose the endpoint returned something like this:

"timestamp_formats" : [ { "joda" : "MMM dd YYYY HH:mm:ss", "java" : "MMM dd yyyy HH:mm:ss" }, { "joda" : "MMM d YYYY HH:mm:ss", "java" : "MMM d yyyy HH:mm:ss" } ]

That would be very annoying to use for somebody who wanted either the two Joda formats or the two Java formats. All consumers would end up having to write their own transformation function to get back to a list of formats of the same type.

So the other way to avoid the doubled method calls in a few places would be to have a TimestampFormats (plural) container that contains both lists of formats. The problem I foresee with this is that eventually we'll want to completely remove the Joda formats, so then the class will only have one member and will just be an annoyance.

I see your point. Sounds good!

dimitris-athanasiou

LGTM

Previously the timestamp_formats field in the response from the find_file_structure endpoint contained Joda timestamp formats. This change makes that clear by renaming the field to joda_timestamp_formats, and also adds a java_timestamp_formats field containing the equivalent Java time format strings.

rjernst · 2018-09-26T07:15:19Z

@droberts195 Isn't this a breaking change in the response structure?

dimitris-athanasiou · 2018-09-26T07:26:47Z

@rjernst The log structure finder is an unreleased feature. So no breaking changes yet :-)

rjernst · 2018-09-26T07:34:35Z

Are these formats expected to be fed back into elasticsearch? How would a user know when to choose joda vs java format? The plan for forward compatibility for java time formats in elasticsearch is described here: #27330 (comment). Instead of making a user know when to choose which format returned, could we instead produce the same formats that will be handled once that is complete in our format parsing code (which I am working on for 6.6)?

droberts195 · 2018-09-26T08:09:40Z

Are these formats expected to be fed back into elasticsearch? How would a user know when to choose joda vs java format?

I imagine few if any users will call this endpoint directly. The main user initially will be the ML UI. In 6.5 it knows to use the Joda formats. If Ingest pipeline supports Java formats in 6.6 then we'll switch the ML UI over to Java formats for 6.6 so it's using the future proof solution as soon as possible. But there's also been talk of the Ingest team using this endpoint for some sort of automatic config creation tool, and I don't think Logstash is nearly as advanced in moving over (elastic/logstash#7149) so it seemed best to output both formats. If Elasticsearch is fully switched over by 6.6 then the docs from 6.6 onwards can simply say to use Java time for Elasticsearch, Joda time for Logstash.

rjernst · 2018-09-26T08:19:47Z

"switched over" isn't how I would describe what the state will be in 6.6. The comment I linked to before describes a special marker we will use at the beginning of a format string to mean "this is a java 8 format". We only plan to add this for the cases of format specifiers that changed between joda and java 8. For those that are the same, there is no need to use this specifier.

Previously the timestamp_formats field in the response from the find_file_structure endpoint contained Joda timestamp formats. This change makes that clear by renaming the field to joda_timestamp_formats, and also adds a java_timestamp_formats field containing the equivalent Java time format strings.

droberts195 added >enhancement review v7.0.0 :ml Machine learning v6.5.0 labels Sep 20, 2018

Adding tests and fixing bugs

5ab3050

dimitris-athanasiou reviewed Sep 21, 2018

View reviewed changes

droberts195 added 3 commits September 24, 2018 10:09

Merge branch 'master' into java_and_joda_timestamp_formats

67904cb

Address some review comments

81981ee

Fixing tests for Java 8

82efc3a

dimitris-athanasiou approved these changes Sep 25, 2018

View reviewed changes

droberts195 merged commit dfe5af0 into elastic:master Sep 25, 2018

droberts195 deleted the java_and_joda_timestamp_formats branch September 25, 2018 11:52

droberts195 added a commit to droberts195/elasticsearch that referenced this pull request Sep 25, 2018

Adjustments to match the changes of elastic#33900

cf06c02

colings86 added v7.0.0-beta1 and removed v7.0.0 labels Feb 7, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[ML] Return both Joda and Java formats from structure finder #33900

[ML] Return both Joda and Java formats from structure finder #33900

droberts195 commented Sep 20, 2018

elasticmachine commented Sep 20, 2018

dimitris-athanasiou Sep 21, 2018

dimitris-athanasiou Sep 21, 2018

droberts195 Sep 24, 2018

dimitris-athanasiou Sep 24, 2018

dimitris-athanasiou Sep 21, 2018

droberts195 Sep 25, 2018 •

edited

Loading

dimitris-athanasiou Sep 25, 2018

dimitris-athanasiou left a comment

rjernst commented Sep 26, 2018

dimitris-athanasiou commented Sep 26, 2018

rjernst commented Sep 26, 2018

droberts195 commented Sep 26, 2018

rjernst commented Sep 26, 2018

[ML] Return both Joda and Java formats from structure finder #33900

[ML] Return both Joda and Java formats from structure finder #33900

Conversation

droberts195 commented Sep 20, 2018

elasticmachine commented Sep 20, 2018

dimitris-athanasiou Sep 21, 2018

Choose a reason for hiding this comment

dimitris-athanasiou Sep 21, 2018

Choose a reason for hiding this comment

droberts195 Sep 24, 2018

Choose a reason for hiding this comment

dimitris-athanasiou Sep 24, 2018

Choose a reason for hiding this comment

dimitris-athanasiou Sep 21, 2018

Choose a reason for hiding this comment

droberts195 Sep 25, 2018 • edited Loading

Choose a reason for hiding this comment

dimitris-athanasiou Sep 25, 2018

Choose a reason for hiding this comment

dimitris-athanasiou left a comment

Choose a reason for hiding this comment

rjernst commented Sep 26, 2018

dimitris-athanasiou commented Sep 26, 2018

rjernst commented Sep 26, 2018

droberts195 commented Sep 26, 2018

rjernst commented Sep 26, 2018

droberts195 Sep 25, 2018 •

edited

Loading