-
Notifications
You must be signed in to change notification settings - Fork 24.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[ML] Return both Joda and Java formats from structure finder #33900
[ML] Return both Joda and Java formats from structure finder #33900
Conversation
Previously the timestamp_formats field in the response from the find_file_structure endpoint contained Joda timestamp formats. This change makes that clear by renaming the field to joda_timestamp_formats, and also adds a java_timestamp_formats field containing the equivalent Java time format strings.
Pinging @elastic/ml-core |
`timestamp_field` query parameter. | ||
<8> `joda_timestamp_formats` are used to tell Logstash and Ingest pipeline how | ||
to parse timestamps. | ||
<9> In future Ingest pipeline will switch to use `java_timestamp_formats`. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this should first just say something along the lines:
"java_timestamp_formats" are the java time formats recognized in the time fields. In the future, Ingest pipeline...
@@ -652,8 +680,14 @@ public FileStructure build() { | |||
throw new IllegalStateException("enum value [" + format + "] missing from switch."); | |||
} | |||
|
|||
if ((timestampField == null) != (timestampFormats == null || timestampFormats.isEmpty())) { | |||
throw new IllegalArgumentException("Timestamp field and timestamp formats must both be specified or neither be specified."); | |||
if ((timestampField == null) != (jodaTimestampFormats == null || jodaTimestampFormats.isEmpty())) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: No need to change but I think this would read more easily if it was:
if (timestampField != null && (jodaTimestampFormats == null || jodaTimestampFormats.isEmpty())
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
To be equivalent it would have to say:
if ((timestampField != null && (jodaTimestampFormats == null || jodaTimestampFormats.isEmpty())) || (timestampField == null && jodaTimestampFormats != null && jodaTimestampFormats.isEmpty() == false))
I could introduce some local variables and make it:
if (wasTimestampFieldSpecified != wasJodaTimestampFormatsSpecified)
Or I could split it over two if
statements, which would allow for different error messages saying exactly which setting was specified and which wasn't. (The reason I didn't bother with separate error messages originally is that this validation is not end-user validation - it's just early bug finding for developers who change the functionality in the future. But if the current code is too hard to read then I can.)
What do you think?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think the local variables approach would definitely make it clear enough.
@@ -150,7 +152,8 @@ public String toString() { | |||
private final Character quote; | |||
private final Boolean shouldTrimFields; | |||
private final String grokPattern; | |||
private final List<String> timestampFormats; | |||
private final List<String> jodaTimestampFormats; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reading through the code, I wonder if it would make things simpler to encapsulate this in a TimestampFormat
class which will then own the joda/java pairs (or additional types if the future brings them). That would create that sort of 1-1 mapping between the alternatives. Just a thought
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not convinced that at the level of individual timestamp formats we want the Joda/Java formats grouped into pairs.
The output of the endpoint can currently contain something like this:
"joda_timestamp_formats" : [
"MMM dd YYYY HH:mm:ss", "MMM d YYYY HH:mm:ss"
],
"java_timestamp_formats" : [
"MMM dd yyyy HH:mm:ss", "MMM d yyyy HH:mm:ss"
],
As a consumer of this information you'd want either the Joda formats or the Java formats, never both. Suppose the endpoint returned something like this:
"timestamp_formats" : [
{
"joda" : "MMM dd YYYY HH:mm:ss",
"java" : "MMM dd yyyy HH:mm:ss"
},
{
"joda" : "MMM d YYYY HH:mm:ss",
"java" : "MMM d yyyy HH:mm:ss"
}
]
That would be very annoying to use for somebody who wanted either the two Joda formats or the two Java formats. All consumers would end up having to write their own transformation function to get back to a list of formats of the same type.
So the other way to avoid the doubled method calls in a few places would be to have a TimestampFormats
(plural) container that contains both lists of formats. The problem I foresee with this is that eventually we'll want to completely remove the Joda formats, so then the class will only have one member and will just be an annoyance.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I see your point. Sounds good!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
Previously the timestamp_formats field in the response from the find_file_structure endpoint contained Joda timestamp formats. This change makes that clear by renaming the field to joda_timestamp_formats, and also adds a java_timestamp_formats field containing the equivalent Java time format strings.
@droberts195 Isn't this a breaking change in the response structure? |
@rjernst The log structure finder is an unreleased feature. So no breaking changes yet :-) |
Are these formats expected to be fed back into elasticsearch? How would a user know when to choose joda vs java format? The plan for forward compatibility for java time formats in elasticsearch is described here: #27330 (comment). Instead of making a user know when to choose which format returned, could we instead produce the same formats that will be handled once that is complete in our format parsing code (which I am working on for 6.6)? |
I imagine few if any users will call this endpoint directly. The main user initially will be the ML UI. In 6.5 it knows to use the Joda formats. If Ingest pipeline supports Java formats in 6.6 then we'll switch the ML UI over to Java formats for 6.6 so it's using the future proof solution as soon as possible. But there's also been talk of the Ingest team using this endpoint for some sort of automatic config creation tool, and I don't think Logstash is nearly as advanced in moving over (elastic/logstash#7149) so it seemed best to output both formats. If Elasticsearch is fully switched over by 6.6 then the docs from 6.6 onwards can simply say to use Java time for Elasticsearch, Joda time for Logstash. |
"switched over" isn't how I would describe what the state will be in 6.6. The comment I linked to before describes a special marker we will use at the beginning of a format string to mean "this is a java 8 format". We only plan to add this for the cases of format specifiers that changed between joda and java 8. For those that are the same, there is no need to use this specifier. |
Previously the timestamp_formats field in the response from the find_file_structure endpoint contained Joda timestamp formats. This change makes that clear by renaming the field to joda_timestamp_formats, and also adds a java_timestamp_formats field containing the equivalent Java time format strings.
Previously the timestamp_formats field in the response
from the find_file_structure endpoint contained Joda
timestamp formats. This change makes that clear by
renaming the field to joda_timestamp_formats, and also
adds a java_timestamp_formats field containing the
equivalent Java time format strings.