Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[ML] Return both Joda and Java formats from structure finder #33900

Merged

Conversation

droberts195
Copy link
Contributor

Previously the timestamp_formats field in the response
from the find_file_structure endpoint contained Joda
timestamp formats. This change makes that clear by
renaming the field to joda_timestamp_formats, and also
adds a java_timestamp_formats field containing the
equivalent Java time format strings.

Previously the timestamp_formats field in the response
from the find_file_structure endpoint contained Joda
timestamp formats.  This change makes that clear by
renaming the field to joda_timestamp_formats, and also
adds a java_timestamp_formats field containing the
equivalent Java time format strings.
@elasticmachine
Copy link
Collaborator

Pinging @elastic/ml-core

`timestamp_field` query parameter.
<8> `joda_timestamp_formats` are used to tell Logstash and Ingest pipeline how
to parse timestamps.
<9> In future Ingest pipeline will switch to use `java_timestamp_formats`.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this should first just say something along the lines:

"java_timestamp_formats" are the java time formats recognized in the time fields. In the future, Ingest pipeline...

@@ -652,8 +680,14 @@ public FileStructure build() {
throw new IllegalStateException("enum value [" + format + "] missing from switch.");
}

if ((timestampField == null) != (timestampFormats == null || timestampFormats.isEmpty())) {
throw new IllegalArgumentException("Timestamp field and timestamp formats must both be specified or neither be specified.");
if ((timestampField == null) != (jodaTimestampFormats == null || jodaTimestampFormats.isEmpty())) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: No need to change but I think this would read more easily if it was:

if (timestampField != null && (jodaTimestampFormats == null || jodaTimestampFormats.isEmpty())

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To be equivalent it would have to say:

if ((timestampField != null && (jodaTimestampFormats == null || jodaTimestampFormats.isEmpty())) || (timestampField == null && jodaTimestampFormats != null && jodaTimestampFormats.isEmpty() == false))

I could introduce some local variables and make it:

if (wasTimestampFieldSpecified != wasJodaTimestampFormatsSpecified)

Or I could split it over two if statements, which would allow for different error messages saying exactly which setting was specified and which wasn't. (The reason I didn't bother with separate error messages originally is that this validation is not end-user validation - it's just early bug finding for developers who change the functionality in the future. But if the current code is too hard to read then I can.)

What do you think?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the local variables approach would definitely make it clear enough.

@@ -150,7 +152,8 @@ public String toString() {
private final Character quote;
private final Boolean shouldTrimFields;
private final String grokPattern;
private final List<String> timestampFormats;
private final List<String> jodaTimestampFormats;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reading through the code, I wonder if it would make things simpler to encapsulate this in a TimestampFormat class which will then own the joda/java pairs (or additional types if the future brings them). That would create that sort of 1-1 mapping between the alternatives. Just a thought

Copy link
Contributor Author

@droberts195 droberts195 Sep 25, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not convinced that at the level of individual timestamp formats we want the Joda/Java formats grouped into pairs.

The output of the endpoint can currently contain something like this:

  "joda_timestamp_formats" : [
    "MMM dd YYYY HH:mm:ss", "MMM  d YYYY HH:mm:ss"
  ],
  "java_timestamp_formats" : [
    "MMM dd yyyy HH:mm:ss", "MMM  d yyyy HH:mm:ss"
  ],

As a consumer of this information you'd want either the Joda formats or the Java formats, never both. Suppose the endpoint returned something like this:

  "timestamp_formats" : [
    {
      "joda" : "MMM dd YYYY HH:mm:ss",
      "java" : "MMM dd yyyy HH:mm:ss"
    },
    {
      "joda" : "MMM  d YYYY HH:mm:ss",
      "java" : "MMM  d yyyy HH:mm:ss"
    }
  ]

That would be very annoying to use for somebody who wanted either the two Joda formats or the two Java formats. All consumers would end up having to write their own transformation function to get back to a list of formats of the same type.

So the other way to avoid the doubled method calls in a few places would be to have a TimestampFormats (plural) container that contains both lists of formats. The problem I foresee with this is that eventually we'll want to completely remove the Joda formats, so then the class will only have one member and will just be an annoyance.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see your point. Sounds good!

Copy link
Contributor

@dimitris-athanasiou dimitris-athanasiou left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@droberts195 droberts195 merged commit dfe5af0 into elastic:master Sep 25, 2018
@droberts195 droberts195 deleted the java_and_joda_timestamp_formats branch September 25, 2018 11:52
droberts195 added a commit that referenced this pull request Sep 25, 2018
Previously the timestamp_formats field in the response
from the find_file_structure endpoint contained Joda
timestamp formats.  This change makes that clear by
renaming the field to joda_timestamp_formats, and also
adds a java_timestamp_formats field containing the
equivalent Java time format strings.
droberts195 added a commit to droberts195/elasticsearch that referenced this pull request Sep 25, 2018
@rjernst
Copy link
Member

rjernst commented Sep 26, 2018

@droberts195 Isn't this a breaking change in the response structure?

@dimitris-athanasiou
Copy link
Contributor

@rjernst The log structure finder is an unreleased feature. So no breaking changes yet :-)

@rjernst
Copy link
Member

rjernst commented Sep 26, 2018

Are these formats expected to be fed back into elasticsearch? How would a user know when to choose joda vs java format? The plan for forward compatibility for java time formats in elasticsearch is described here: #27330 (comment). Instead of making a user know when to choose which format returned, could we instead produce the same formats that will be handled once that is complete in our format parsing code (which I am working on for 6.6)?

@droberts195
Copy link
Contributor Author

Are these formats expected to be fed back into elasticsearch? How would a user know when to choose joda vs java format?

I imagine few if any users will call this endpoint directly. The main user initially will be the ML UI. In 6.5 it knows to use the Joda formats. If Ingest pipeline supports Java formats in 6.6 then we'll switch the ML UI over to Java formats for 6.6 so it's using the future proof solution as soon as possible. But there's also been talk of the Ingest team using this endpoint for some sort of automatic config creation tool, and I don't think Logstash is nearly as advanced in moving over (elastic/logstash#7149) so it seemed best to output both formats. If Elasticsearch is fully switched over by 6.6 then the docs from 6.6 onwards can simply say to use Java time for Elasticsearch, Joda time for Logstash.

@rjernst
Copy link
Member

rjernst commented Sep 26, 2018

"switched over" isn't how I would describe what the state will be in 6.6. The comment I linked to before describes a special marker we will use at the beginning of a format string to mean "this is a java 8 format". We only plan to add this for the cases of format specifiers that changed between joda and java 8. For those that are the same, there is no need to use this specifier.

kcm pushed a commit that referenced this pull request Oct 30, 2018
Previously the timestamp_formats field in the response
from the find_file_structure endpoint contained Joda
timestamp formats.  This change makes that clear by
renaming the field to joda_timestamp_formats, and also
adds a java_timestamp_formats field containing the
equivalent Java time format strings.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants