Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[ML] Mishandling of date format in data frame transform destination index #38926

Closed
lcawl opened this issue Jun 13, 2019 · 11 comments
Closed

[ML] Mishandling of date format in data frame transform destination index #38926

lcawl opened this issue Jun 13, 2019 · 11 comments
Labels
bug Fixes for quality problems that affect the customer experience Feature:Transforms ML transforms :ml v7.2.1

Comments

@lcawl
Copy link
Contributor

lcawl commented Jun 13, 2019

Kibana version:
7.2 BC6

Elasticsearch version:
7.2 BC6

Server OS version:
macOS v10

Browser version:
Chrome Version 74.0.3729.169

Original install method (e.g. download page, yum, from source, etc.):

Download from build candidate.

Describe the bug:

When I create a data frame transform that generates a date field in the destination index, the date is interpreted incorrectly in the Discover page.

Steps to reproduce:

  1. Install the eCommerce sample data set.
  2. Create a data frame transform using the wizard on the Machine learning page that pivots the data by grouping on the order_date and aggregating on the cardinality of the order_id. NOTE: I explicitly change the calendar_interval of the date_histogram to "1y". Under the covers, I see that it's also adding "format": "yyyy" to the date_histogram.
  3. Confirm that the preview shows a single result with a value of "2019" for the date.
  4. Start the data frame transform and let it run to completion.
  5. Confirm that when you search the contents of the destination index, it shows the correct value ("2019").
  6. Examine the mappings that were created for the destination index. In my case, "order_date" has a type date and a format of strict_date_optional_time||epoch_millis||yyyy.
  7. Examine the results in the Discover tab. The date there is shown incorrectly as Dec 31, 1969 @ 16:00:02.019.

Expected behavior:

I would expect the value to be "2019" in the Discover tab too.

Screenshots (if relevant):

Creating the data frame transform (note the correct date value in the preview):

image

Correct date value in the destination index:

image

Incorrect date value in Discover:

image

Errors in browser console (if relevant):

Provide logs and/or server output (if relevant):

Any additional context:

This is the JSON for my data frame:

{
"id": "retest",
"source": {
"index": [
"kibana_sample_data_ecommerce"
],
"query": {
"match_all": {}
}
},
"dest": {
"index": "retest-index"
},
"pivot": {
"group_by": {
"order_date": {
"date_histogram": {
"field": "order_date",
"calendar_interval": "1y",
"format": "yyyy"
}
}
},
"aggregations": {
"order_id.cardinality": {
"cardinality": {
"field": "order_id"
}
}
}
}
}

@lcawl lcawl added bug Fixes for quality problems that affect the customer experience v7.2.0 Feature:Transforms ML transforms labels Jun 13, 2019
@benwtrent
Copy link
Member

benwtrent commented Jun 13, 2019

It seems to me that automatically adding the format field to the date_histogram group_by for the user could cause more problems than necessary. (See related issue: elastic/elasticsearch#43068)

Why not default to not putting in a format but still allow the end user to provide one if they wish?

@elasticmachine
Copy link
Contributor

Pinging @elastic/ml-ui

@peteharverson peteharverson changed the title Mishandling of date format in data frame transform destination index [ML] Mishandling of date format in data frame transform destination index Jun 14, 2019
@walterra
Copy link
Contributor

I remember @pheyos brought this up already and we discussed if this line could be changed? https://github.com/elastic/elasticsearch/pull/41703/files#diff-5e5072b73a7dc03174c4b104a8dfb219R84

The line says

builder.field(FORMAT, DEFAULT_TIME_FORMAT + "||" + format);

In the case of yyyy being 2019 for example could be interpretated as epoch ms and it gets overruled by DEFAULT_TIME_FORMAT. Could we change this to:

builder.field(FORMAT, format + "||" + DEFAULT_TIME_FORMAT);

Then DEFAULT_TIME_FORMAT would be more like a fallback and yyyy would be tried first.

If format causes problems in general I think we should decide if we want to support it at all in the UI. If we remove it from the default configs but add a custom field to override that might as well give users headaches and will be harder for us to support. What do you think?

@benwtrent
Copy link
Member

I am fine removing support for it from the UI entirely, but keeping it as a valid option in the API.

@sophiec20 what do you think?

@walterra
Copy link
Contributor

walterra commented Jun 19, 2019

I looked into this a bit more. Here's a set of Kibana Dev Console statements which should demontrate the underlying issue:

GET _cat/indices

# Create an index with a custom mapping featuring different date formats
#  date_raw: default
#  date_yyyy: custom format `yyyy`, ES also offers `strict_year` to do exactly that
#  date_override: This is the format the data frame backend currently is creating
#  date_fallback: This is what I suggeset the data frame backend should be creating
PUT date_test
{
    "settings" : {
        "number_of_shards" : 1
    },
    "mappings" : {
        "properties" : {
            "date_raw" : { "type" : "date" },
            "date_yyyy" : { "type" : "date", "format": "yyyy" },
            "date_override" : { "type" : "date", "format": "strict_date_optional_time||epoch_millis||yyyy" },
            "date_fallback" : { "type" : "date", "format": "yyyy||strict_date_optional_time||epoch_millis" }
        }
    }
}

POST date_test/_doc/test_doc_2019
{
  "date_raw": 2019,
  "date_yyyy": 2019,
  "date_override": 2019,
  "date_fallback": 2019
}

POST date_test/_doc/test_doc_2018
{
  "date_raw": 2018,
  "date_yyyy": 2018,
  "date_override": 2018,
  "date_fallback": 2018
}

POST date_test/_doc/test_doc_2017
{
  "date_raw": 2017,
  "date_yyyy": 2017,
  "date_override": 2017,
  "date_fallback": 2017
}

GET date_test/_search
{
  "aggs": {
    "date_raw": {
      "date_histogram": {
        "field": "date_raw",
        "calendar_interval": "1y"
      }
    },
    "date_yyyy": {
      "date_histogram": {
        "field": "date_yyyy",
        "calendar_interval": "1y"
      }
    },
    "date_override": {
      "date_histogram": {
        "field": "date_override",
        "calendar_interval": "1y"
      }
    },
    "date_fallback": {
      "date_histogram": {
        "field": "date_fallback",
        "calendar_interval": "1y"
      }
    }
  },
  "size": 0
}

This is the result of the aggregation search from the snippet above:

{
  "took" : 4,
  "timed_out" : false,
  "_shards" : {
    "total" : 1,
    "successful" : 1,
    "skipped" : 0,
    "failed" : 0
  },
  "hits" : {
    "total" : {
      "value" : 3,
      "relation" : "eq"
    },
    "max_score" : null,
    "hits" : [ ]
  },
  "aggregations" : {
    "date_raw" : {
      "buckets" : [
        {
          "key_as_string" : "1970-01-01T00:00:00.000Z",
          "key" : 0,
          "doc_count" : 3
        }
      ]
    },
    "date_override" : {
      "buckets" : [
        {
          "key_as_string" : "1970-01-01T00:00:00.000Z",
          "key" : 0,
          "doc_count" : 3
        }
      ]
    },
    "date_yyyy" : {
      "buckets" : [
        {
          "key_as_string" : "2017",
          "key" : 1483228800000,
          "doc_count" : 1
        },
        {
          "key_as_string" : "2018",
          "key" : 1514764800000,
          "doc_count" : 1
        },
        {
          "key_as_string" : "2019",
          "key" : 1546300800000,
          "doc_count" : 1
        }
      ]
    },
    "date_fallback" : {
      "buckets" : [
        {
          "key_as_string" : "2017",
          "key" : 1483228800000,
          "doc_count" : 1
        },
        {
          "key_as_string" : "2018",
          "key" : 1514764800000,
          "doc_count" : 1
        },
        {
          "key_as_string" : "2019",
          "key" : 1546300800000,
          "doc_count" : 1
        }
      ]
    }
  }
}

You can see that date_raw and date_override treat 2019 as ms since epoch, whereas date_yyyy and date_fallback correctly identify 2019 as a year.

@droberts195
Copy link
Contributor

This problem will be solved (for transforms created by 7.3 or above) if we implement the proposal outlined in #39250 (comment).

Since 7.2 is so close to release we should probably just document the problem raised in this issue as a known bug.

@walterra
Copy link
Contributor

For 7.2.1 we could change yyyy to a format that couldn't be mixed up with epoch_millis, for example yyyy-01-01.

@droberts195
Copy link
Contributor

Would yyyy-MM-dd give different outputs to yyyy-01-01? If so I think that's hiding another bug, because the aggregation should be rounding the dates to the beginning of buckets. Maybe use of hardcoded 00 and 01 in date formats has disguised a timezone handling problem for example.

But yes, a temporary workaround could be to use yyyy-MM-dd as the minimum granularity.

@benwtrent
Copy link
Member

Honestly, if we are talking about increasing the fidelity of the format, we should just make it a full fidelity format.

@droberts195
Copy link
Contributor

droberts195 commented Jun 25, 2019

Honestly, if we are talking about increasing the fidelity of the format, we should just make it a full fidelity format.

How about:

  • For 7.2.1 we apply the simplest sticking plaster solution of changing:
    builder.field(FORMAT, DEFAULT_TIME_FORMAT + "||" + format);

to:

    builder.field(FORMAT, format + "||" + DEFAULT_TIME_FORMAT);

in the back end code. No UI changes for 7.2.1.

@walterra
Copy link
Contributor

Since we are covering the removal of auto-formatting in another issue (#39250) and the backend change to swap DEFAULT_TIME_FORMAT priorities is already in I'm closing this issue.

@walterra walterra added v7.2.1 and removed v7.3.0 labels Jun 27, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Fixes for quality problems that affect the customer experience Feature:Transforms ML transforms :ml v7.2.1
Projects
None yet
Development

No branches or pull requests

6 participants