Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Date_histogram doesn't work too well with timezones #307

Closed
trzysiek opened this issue Jun 11, 2024 · 1 comment
Closed

Date_histogram doesn't work too well with timezones #307

trzysiek opened this issue Jun 11, 2024 · 1 comment
Assignees

Comments

@trzysiek
Copy link
Member

trzysiek commented Jun 11, 2024

We'll need to think how to tackle time zones, in both Elastic, and Clickhouse.
How to reproduce:
I changed docker/log-generator/logger's code for timestamp to:

stamps := []time.Time{
  time.Date(2021, 1, 2, 0, 0, 0, 0, time.UTC),
  time.Date(2021, 1, 1, 23, 50, 0, 0, time.UTC),
  time.Date(2021, 1, 1, 22, 55, 0, 0, time.UTC),
  time.Date(2021, 1, 2, 0, 55, 0, 0, time.UTC),
  time.Date(2021, 1, 2, 1, 55, 0, 0, time.UTC),
}
...
"@timestamp":   stamps[rand.Intn(len(stamps))].Format("2006-01-02T15:04:05.999Z"),

Then, there's discrepancy between our and Elastic's responses for query at the bottom. Ratio between hits in those 2 buckets is very different.
Elastic returns
Screenshot 2024-06-11 at 11 00 40
And we do:
Screenshot 2024-06-11 at 11 03 59

Query:

POST /logs-generic-default/_search
{
    "_source": {
        "excludes": []
    },
    "aggs": {
        "0": {
            "date_histogram": {
                "fixed_interval": "1d",
                "field": "@timestamp",
                "min_doc_count": 1,
                "time_zone": "Europe/Warsaw"
            }
        }
    },
    "fields": [
        {
            "field": "@timestamp",
            "format": "date_time"
        }
    ],
    "query": {
        "bool": {
            "filter": [
                {
                    "range": {
                        "@timestamp": {
                            "format": "strict_date_optional_time",
                            "gte": "2020-03-11T04:08:33.783Z",
                            "lte": "2024-06-11T03:08:33.783Z"
                        }
                    }
                }
            ],
            "must": [],
            "must_not": [],
            "should": []
        }
    },
    "runtime_mappings": {},
    "script_fields": {},
    "size": 0,
    "stored_fields": [
        "*"
    ],
    "track_total_hits": true
}
github-merge-queue bot pushed a commit that referenced this issue Sep 10, 2024
#307

edit: one case is still wrong... Comment below `(that's also why we
don't need to worry about the field being in different timezone in
Clickhouse, it's translated to `UTC` anyway)` works in the second case,
need to handle it ourselves in the first one.

For `date_histogram` we have 2 ways we generate SQLs:

a) for longer intervals: e.g. 1 month, we group by using this function
`toStartOfMonth(timestamp)`. When taking timezones into consideration,
it becomes `toStartOfMonth(toTimeZone(timestamp, "timezone_name"))`.
Example: We have such a simple table
![Screenshot 2024-09-09 at 19 00
43](https://github.com/user-attachments/assets/204c0810-a30d-4df6-9ff6-2caf2eb174da)

Before, such a request below returned what it returns now if we omit
`time_zone` parameter, so `"key_as_string": "2023-12-01T00:00:00.000"`
But in Warsaw time, it's `2024-01-01 00:05`, so now we return good,
new-year bucket.

![Screenshot 2024-09-09 at 18 59
27](https://github.com/user-attachments/assets/928c2276-fff1-4137-8025-5460160bd4b4)

b) for shorter intervals, like the one in the issue
#307 ,
`toUnixTimestamp64Milli("@timestamp") / 3600000` becomes
`toUnixTimestamp64Milli("@timestamp")+timeZoneOffset(toTimezone("@timestamp",'Europe/Warsaw'))*1000)
/ 3600000`
Here the SQL is a bit more complex, but I think it needs to be, as this
`toUnixTimestamp` function always operates on `UTC`, translates
different timezones to `UTC`, etc, so we need to add the offset
ourselves.
(that's also why we don't need to worry about the field being in
different timezone in Clickhouse, it's translated to `UTC` anyway)

You can see below that the answers now match those from Elastic, from
#307 , the ratio of
`doc_count` for days is 1-4, not 2-3 as before this fix.
![Screenshot 2024-09-09 at 19 12
20](https://github.com/user-attachments/assets/02294d24-53bb-43c9-8bd4-89a469950869)
in pancakes it works too
![Screenshot 2024-09-09 at 19 16
43](https://github.com/user-attachments/assets/4cf537e5-6af4-47f4-8072-3431e86fb27e)
@jakozaur
Copy link
Contributor

Fixed.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants